Developers Club geek daily blog

2 years ago
The other day it was succeeded to turn interesting piece. For all groups VKontakte with number of subscribers from 5000 to 10 000 (~ 100 000 groups) the full graph in whom weight of edges equaled to intersection of audiences of groups has been constructed.

Search of similar groups and publics of VKontakte

First, such graph beautifully looks:

Search of similar groups and publics of VKontakte

Secondly, with its help it is possible to select groups of the set subject quickly. For example, it is necessary to find groups about knitting. On key word we find "knitting", one suitable group, Knitting - Knitting of online-, for example. We display groups with which it is connected:

Knitting - Knitting of online-:
6.04% PRYaZhA Corporation
5.90% Mamochkin the channel — for creative mothers (the HOOK!)
3.40% Knitting. In this world everything is connected...))
3.01% YARN CHEAP. FLEECE. ELASTIC BANDS FOR WEAVING OF BRACELETS
2.35% Spaghetti Spagetti Yarn
1.87% Eesti lõng yarn Little shop (Kauni, Kauni)
1.73% * knitting Art by hook *
1.70% Kauni's (Kauni) Yarn — legend of Estonia. Knitting.
1.66% "Lacy motives" — knitting and needlework
1.54% Yarn Turkish available and to order (Ukraine)

Also we repeat will not bother yet or yet new names will not cease to appear.

Knitting. In this world everything is connected...:
8.88% PRYaZhA Corporation
3.06% Mamochkin the channel — for creative mothers (the HOOK!)
2.58% YARN CHEAP. FLEECE. ELASTIC BANDS FOR WEAVING OF BRACELETS
2.30 Knitting % - Knitting of online-
2.14% Online store of the Yarn "OPENWORK"
1.94% Kauni's (Kauni) Yarn — legend of Estonia. Knitting.
1.85% Shop of yarn — ღ YOUR YARN ღ
1.76% Yarn
1.72% Openwork world: it is connected with love!
1.55% Eesti lõng yarn Little shop (Kauni, Kauni)

PRYaZhA corporation:
7.54% Knitting. In this world everything is connected...))
4.01% Mamochkin the channel — for creative mothers (the HOOK!)
3.47 Knitting % - Knitting of online-
3.20% YARN CHEAP. FLEECE. ELASTIC BANDS FOR WEAVING OF BRACELETS
2.72% Online store of the Yarn "OPENWORK"
2.67% Yarn
2.11% "Madam Vyazalkina" the Yarn (goods for needlework)
2.00% Kauni's (Kauni) Yarn — legend of Estonia. Knitting.
1.85% Eesti lõng yarn Little shop (Kauni, Kauni)
1.82% Spaghetti Spagetti Yarn

"Madam Vyazalkina" the Yarn (goods for needlework):
2.49% Yarn
2.37% PRYaZhA Corporation
1.42% Eesti lõng yarn Little shop (Kauni, Kauni)
1.39% Kauni's (Kauni) Yarn — legend of Estonia. Knitting.
1.32% YARN CHEAP. FLEECE. ELASTIC BANDS FOR WEAVING OF BRACELETS
1.26% Shop of yarn and goods for needlework the TOW
1.24% Knitted headdresses and not only.
1.21 HOBBY &HOME % | NEEDLEWORK
1.18% Online store of the Yarn "OPENWORK"
1.15% Spaghetti Spagetti Yarn

The similar result can achieve competently having been at key word for search: "knitting", "yarn", "needlework", "hook". But it is not always simple to think up them.

To construct such graph some unevident technical solutions about which I would like to tell have been used.

To receive the complete list of groups of given size, the fine site allsocial.ru has been pumped over. Interestingly as they collect these data? Simply go on all indexes: vk.com/club1, vk.com/club2...? Only average groups with number of subscribers from 5000 to 10 000 people for two reasons undertook: you chyokntsya to pump over huge publics like MDK, but that it is more important, membership in them does not bear special signal, such groups are connected with everything on light.

To receive the list of subscribers of groups in Vkontakt's API, there is ad hoc method. But he allows to receive on 1000 users for time and only 3 times in second. And it was necessary to pump over about 1 000 000 000 users that dofiga. It turns out that it will be necessary to wait for 3-4 days if VK responds to each request instantly. It, in general, tolerantly, but confused the following note in documentation:
In addition to restrictions on access rate, there are also quantitative restrictions on call of the same methods. For obvious reasons, we do not provide information on exact limits.

In our case, this note strains because it will be necessary to make 1 000 000 request. Here the coolest execute method comes to the rescue. Big respect for it to children from VK. Interesting somebody else has such piece? Essence that through execute it is possible to send to Contact of the program in the special VKScript language, to push there some requests to API and, perhaps, some logic. In my case the program looked approximately so:

return [
    API.groups.getMembers(id=1, offset=0, count=1000),
    API.groups.getMembers(id=1, offset=1000, count=1000),
    API.groups.getMembers(id=1, offset=2000, count=1000),
    API.groups.getMembers(id=1, offset=3000, count=1000),
    API.groups.getMembers(id=1, offset=4000, count=1000),
    API.groups.getMembers(id=1, offset=5000, count=1000),
    ...
];

In the program there can be no more than 25 appeals to API. That is the number of requests is reduced to 40 000, theoretically Bang can pass. Each such request was executed already absolutely not instantly, and about 5-6 seconds therefore it was necessary to wait all the same. Yes, it would be possible to start downloading in some flows, but couples was stryomno. In two and a half days everything dokachatsya and has borrowed approximately 10gb from me on disk.

Now there is question how to push these 10gb in random access memory and how to count paired intersection of audiences for 100 000 groups. Rescues that fact that each user consists usually in small amount of groups (99% of users consist less than in 15 groups). It is possible to write out what contributions each user makes in intersections and then these deposits to put. Let to eat, for example, two users: And both B, and three groups 1, 2 and 3. And the B — only in 1 and 3 consists in all three. And makes contributions to three intersections: (1, 2), (1, 3) and (2, 3), B — in one: (1, 3). We put, we receive that 1 and 3 are crossed on two users, other groups on one. If technically to ignore users who consist in 15 groups more, it is necessary to write out about 500 000 000 intersections that is much better, than at solution in forehead where it will be necessary to count 100 000 * 100 000 pereseniye.

Perfectly, there was only problem with random access memory. Fortunately, the described algorithm well lays down on paradigm map-redyyus therefore has been notched nano-hadup on 50 lines and calculation looked so: we write out groups and users who in them consist in two columns:

group	user
3953835	10
2065169	100001643
2112714	100001643
...

The file on ~ 9gb turns out, we sort it yuniksovy sort by the second column, we look where Pavel Durov consists:
group	user
2226515	1
37110020	1
38354466	1
43453499	1
60140141	1
60615047	1
64980878	1
1019652	10
...

We read the file, we group flow in the second column, we keep in memory only the list of groups of the user if groups less than 15, we write out all matchings in one more file:

source	target
10000	10027193
9980615	9997141
9974	9976553
...

As the threshold is picked up competently, the file turns out not too big — ~ 9gb. We sort it by two columns:
source	target
10000	100000
10000	100000
10000	10009982
10000	100100
10000	100100
10000	10019194
10000	10019194
10000	1002
10000	1002
10000	1002
...

Further the file is read, grouped in two columns and at once it is considered intersection. For groups 10000 and 100000, for example, perecheny 2 users. It can be told at once, anything it is not necessary to store in memory.

Further edges are filtered on any reasonable threshold that them remained not really much. It is possible to look at result in Gefi. There are two secrets: that all worked it is not painfully long necessary to disconnect drawing of edges, for laying OpenOrd need to download, it has laid my graph on ~ 100 000 tops for ~ 5 minutes.

Similar approach can theoretically be used in any task where there are two connected entities: sites and users, request and results of delivery, for example.

This article is a translation of the original post at habrahabr.ru/post/263753/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus