In last days off in the Museum of Moscow there took place the exhibition within which Beeline held a hackathon. I, just in case, decided to descend. The interesting challenge was offered: the graph, in tops subscribers is given, in edges the number of calls of one subscriber to another, their duration and number of sms is written. Data looked here so:
A, B — subscribers, x — the operator, with — number of calls, d — duration of talk, s — number of sms. In total ~ 6 000 000 edges. Besides there was a confidential set of edges which in advance in a random way deleted from the graph. It was offered to guess what edges were. That is on the known set of communications to tell what else communications I can appear.
First of all I took 10 000 couples of subscribers between whom there was a communication and 10 000 couples between which communication was not. Two main differences consisted in the following:
- If subscribers are connected, then almost always at one of them the operator 0. So it turns out because Beeline possesses information only on the clients
- If subscribers are not connected, then they almost always have no general contacts.
That is, roughly speaking, my solution consisted in the following: if couple of subscribers, has at least one general contact and at least one of subscribers uses the operator 0, then we add between them communication. The problem was only that in the graph there were ~ 1 000 000 subscribers and in a forehead to check how many the general contacts were impossible to each couple. Here once again the algorithm which already two times was mentioned on this website, in articles about search of similar groups in VK and about search of the connected requests comes to the rescue. I will shortly describe an essence. Let to eat 5 edges:
Subscribers 1 and 2 are crossed on two contacts 10 and 11. Let's group edges in B and for each group we will write out all matchings of A:
Let's count the frequency of all matchings and, about a miracle, at the couple 1, 2 frequency 2. This algorithm it is good to lay down on a paradigm map-redyyus therefore here again very much is useful nano-hadup on 20 lines.
To check on how many qualitative the solution turns out, I took away 20% of edges from the graph and tried to guess them. As a metrics organizers used f1 score. If to guess accidentally f1 turns out ~ 0. Beyzlayn who organizers provided gathers ~ 0.02. My solution — ~ 0.07. It turned out that when checking the direction of edges therefore f1 turns out a little higher — ~ 0.08 is not considered.
Still I tried to consider duration of talk. Really, one general contact with which both subscribers communicated only once and not for long, at all does not mean that subscribers have to be somehow connected. But for some reason in practice I did not receive any gain in quality.