Developers Club geek daily blog

Spark local mode: processing of big files on the normal notebook

1 year, 1 month ago
image
All hi.
On January 4 there was a new version of introduction before experience of use in projects. Spark works at the majority of operating systems and it can be started in the local mode even on the normal notebook. Using simplicity of the Spark setup in this case a sin not to use the main to functions. In this article we will look as on the notebook quickly to configure processing of the big file (more random access memory of the computer) by means of normal SQL queries. It will allow to make requests even to the unprepared user. Additional connection of iPython (Jupyter) notebook will allow to make full reports. In article the simple example of processing of the file is sorted, other examples on Python are here.

Read more »


AI, BigData &HPC Digest #3

1 year, 1 month ago


Hi, Habr!

Our FlyElephant command wants to congratulate all with coming New Year, to wish all the best and successful implementation of all conceived projects forthcoming year and that at the weekend was what to esteem, we publish fresh number of the digest. Today in release a traditional selection of interesting links on news and materials in the directions: artificial intelligence, big data and high-performance calculations.

On January 14th we will hold a webinar on Introduction to machine learning"" on which we will talk about history and the basic concepts of machine learning. Let's consider popular problems / algorithms of machine learning, and also we will start their examples by means of FlyElephantthe platform and we learn as it is possible to use this platform for a solution of problems of artificial intelligence. It is possible to be registered on a webinar here.

Read more »


Numpy and multiprocessor

1 year, 1 month ago
Already many use numpy library in the python-programs as it considerably accelerates work with data and execution of mathematical operations. However in many cases of numpy works many times more slowly, than she can … because uses only one processor though she could use everything that you have.

Read more »


Festival of Data in the museum of Moscow as it was

1 year, 1 month ago


Hi Habr,

So, we held the Festival of the new technologies Given at an exhibition here.

And we tell this first action from a series in which we bring together experts from different areas of business, science and public administration about analytics of data.

Storage and data analysis which were a prerogative of a narrow circle of the companies and people now begin to affect life practically all. For this reason we also began this series of actions where we tell wide audience about data and their analytics.

Read more »


Semi-automatic classification of the websites

1 year, 2 months ago
Let's consider such task: there are 1000 news websites, for example: it was written by me and children from DCA several months ago. The graph about the news websites will look approximately so:


Really, some classes manage to be selected automatically, for example "games" and "technologies":

Read more »


About participation in a hackathon from Beeline

1 year, 2 months ago
In last days off in the Museum of Moscow there took place the exhibition within which Beeline held a hackathon. I, just in case, decided to descend. The interesting challenge was offered: the graph, in tops subscribers is given, in edges the number of calls of one subscriber to another, their duration and number of sms is written. Data looked here so:
A,B,x_A,x_B,c_AB,d_AB,c_BA,d_BA,s_AB,s_BA
941235,666804,0,1,1,20,1,22,0,0
604328,367223,1,0,0,0,5,1364,0,0
932768,977234,0,0,1,168,0,0,0,0
395101,677107,0,1,1,160,0,0,0,0
250712,102647,0,0,0,0,3,456,0,0
510653,896558,0,0,139,50954,22,2990,0,0
...

A, B — subscribers, x — the operator, with — number of calls, d — duration of talk, s — number of sms. In total ~ 6 000 000 edges. Besides there was a confidential set of edges which in advance in a random way deleted from the graph. It was offered to guess what edges were. That is on the known set of communications to tell what else communications I can appear.

First of all I took 10 000 couples of subscribers between whom there was a communication and 10 000 couples between which communication was not. Two main differences consisted in the following:
  1. If subscribers are connected, then almost always at one of them the operator 0. So it turns out because Beeline possesses information only on the clients
  2. If subscribers are not connected, then they almost always have no general contacts.

That is, roughly speaking, my solution consisted in the following: if couple of subscribers, has at least one general contact and at least one of subscribers uses the operator 0, then we add between them communication. The problem was only that in the graph there were ~ 1 000 000 subscribers and in a forehead to check how many the general contacts were impossible to each couple. Here once again the algorithm which already two times was mentioned on this website, in articles about search of similar groups in VK and about search of the connected requests comes to the rescue. I will shortly describe an essence. Let to eat 5 edges:
A    B
1    10
2    10
3    10
1    11
2    11

Subscribers 1 and 2 are crossed on two contacts 10 and 11. Let's group edges in B and for each group we will write out all matchings of A:
1    2
1    3
2    3
1    2

Let's count the frequency of all matchings and, about a miracle, at the couple 1, 2 frequency 2. This algorithm it is good to lay down on a paradigm map-redyyus therefore here again very much is useful nano-hadup on 20 lines.

To check on how many qualitative the solution turns out, I took away 20% of edges from the graph and tried to guess them. As a metrics organizers used f1 score. If to guess accidentally f1 turns out ~ 0. Beyzlayn who organizers provided gathers ~ 0.02. My solution — ~ 0.07. It turned out that when checking the direction of edges therefore f1 turns out a little higher — ~ 0.08 is not considered.

Still I tried to consider duration of talk. Really, one general contact with which both subscribers communicated only once and not for long, at all does not mean that subscribers have to be somehow connected. But for some reason in practice I did not receive any gain in quality.

Read more »


Hub AI&BigData; meetup #1

1 year, 2 months ago


On December 26 our FlyElephant command will take part in Hub AI&BigData meetings; meetup devoted to big data and artificial intelligence. Action will take place in Odessa and will begin at 11:00. For all who will not be able to come online broadcasting will be organized.

Read more »


Player Relationship Management Platform in Wargaming: collecting and data analysis

1 year, 2 months ago
The field of activity of our company extends far beyond game development. In parallel with it we keep tens of internal projects, and Player Relationship Management Platform (PRMP) – one of the most ambitious.

Player Relationship Management Platform (PRMP) – special system which by means of the analysis of data bulks in real time allows to personify interaction with the player through the recommendations arriving to the user proceeding from a context of its last play experience.

PRMP allows our players to derive more pleasure from game, improves their user experience, and also relieves of viewing unnecessary advertizing and promo-messages.

Architecture of PRMP

Read more »


Scalding: an occasion to pass with Java to Scala

1 year, 2 months ago


In this article I will tell about Twitter Scalding – a framework for data handling process description in Apache Hadoop. I will begin from far away, with history of frameworks over Hadoop. Then I will give the overview of opportunities Scalding. In end I will show the code samples available to understanding to that who knows Java, but is almost not familiar with Scala.

Interestingly? Went!

Read more »


Search of potential followers on Twitter

1 year, 2 months ago
Let's assume to eat the account in Twitter to which write on rather limited circle with on what it is signed with several hundred or thousands of people. How to understand what share of audience is not covered yet? How to find these people?

For an example we will consider the account @Russia_Direct. This small edition which covers events in Russia for English-speaking readers. Something the Russia Today type, but with deeper and academic materials.



Now on them 4000 people — students, journalists, teachers of universities are signed ~:


Read more »