Developers Club geek daily blog

AI, BigData &HPC Digest #3

1 year, 3 months ago


Hi, Habr!

Our FlyElephant command wants to congratulate all with coming New Year, to wish all the best and successful implementation of all conceived projects forthcoming year and that at the weekend was what to esteem, we publish fresh number of the digest. Today in release a traditional selection of interesting links on news and materials in the directions: artificial intelligence, big data and high-performance calculations.

On January 14th we will hold a webinar on Introduction to machine learning"" on which we will talk about history and the basic concepts of machine learning. Let's consider popular problems / algorithms of machine learning, and also we will start their examples by means of FlyElephantthe platform and we learn as it is possible to use this platform for a solution of problems of artificial intelligence. It is possible to be registered on a webinar here.

Read more »


Numpy and multiprocessor

1 year, 3 months ago
Already many use numpy library in the python-programs as it considerably accelerates work with data and execution of mathematical operations. However in many cases of numpy works many times more slowly, than she can … because uses only one processor though she could use everything that you have.

Read more »


Festival of Data in the museum of Moscow as it was

1 year, 3 months ago


Hi Habr,

So, we held the Festival of the new technologies Given at an exhibition here.

And we tell this first action from a series in which we bring together experts from different areas of business, science and public administration about analytics of data.

Storage and data analysis which were a prerogative of a narrow circle of the companies and people now begin to affect life practically all. For this reason we also began this series of actions where we tell wide audience about data and their analytics.

Read more »


Hackathon and winter school of sciences on deep learning and question answering systems

1 year, 3 months ago
Today machines without effort "connect two words" (1, 2), but are not able to carry on with guarantee dialogue on the general subjects yet. However, already tomorrow you will ask them to make correctly the summary and to select for your children the best section on chess near the house. You want to understand in more detail how in this direction scientists from Facebook, with Google and dr work? You come to listen to them.

Read more »


Semi-automatic classification of the websites

1 year, 4 months ago
Let's consider such task: there are 1000 news websites, for example: it was written by me and children from DCA several months ago. The graph about the news websites will look approximately so:


Really, some classes manage to be selected automatically, for example "games" and "technologies":

Read more »


About participation in a hackathon from Beeline

1 year, 4 months ago
In last days off in the Museum of Moscow there took place the exhibition within which Beeline held a hackathon. I, just in case, decided to descend. The interesting challenge was offered: the graph, in tops subscribers is given, in edges the number of calls of one subscriber to another, their duration and number of sms is written. Data looked here so:
A,B,x_A,x_B,c_AB,d_AB,c_BA,d_BA,s_AB,s_BA
941235,666804,0,1,1,20,1,22,0,0
604328,367223,1,0,0,0,5,1364,0,0
932768,977234,0,0,1,168,0,0,0,0
395101,677107,0,1,1,160,0,0,0,0
250712,102647,0,0,0,0,3,456,0,0
510653,896558,0,0,139,50954,22,2990,0,0
...

A, B — subscribers, x — the operator, with — number of calls, d — duration of talk, s — number of sms. In total ~ 6 000 000 edges. Besides there was a confidential set of edges which in advance in a random way deleted from the graph. It was offered to guess what edges were. That is on the known set of communications to tell what else communications I can appear.

First of all I took 10 000 couples of subscribers between whom there was a communication and 10 000 couples between which communication was not. Two main differences consisted in the following:
  1. If subscribers are connected, then almost always at one of them the operator 0. So it turns out because Beeline possesses information only on the clients
  2. If subscribers are not connected, then they almost always have no general contacts.

That is, roughly speaking, my solution consisted in the following: if couple of subscribers, has at least one general contact and at least one of subscribers uses the operator 0, then we add between them communication. The problem was only that in the graph there were ~ 1 000 000 subscribers and in a forehead to check how many the general contacts were impossible to each couple. Here once again the algorithm which already two times was mentioned on this website, in articles about search of similar groups in VK and about search of the connected requests comes to the rescue. I will shortly describe an essence. Let to eat 5 edges:
A    B
1    10
2    10
3    10
1    11
2    11

Subscribers 1 and 2 are crossed on two contacts 10 and 11. Let's group edges in B and for each group we will write out all matchings of A:
1    2
1    3
2    3
1    2

Let's count the frequency of all matchings and, about a miracle, at the couple 1, 2 frequency 2. This algorithm it is good to lay down on a paradigm map-redyyus therefore here again very much is useful nano-hadup on 20 lines.

To check on how many qualitative the solution turns out, I took away 20% of edges from the graph and tried to guess them. As a metrics organizers used f1 score. If to guess accidentally f1 turns out ~ 0. Beyzlayn who organizers provided gathers ~ 0.02. My solution — ~ 0.07. It turned out that when checking the direction of edges therefore f1 turns out a little higher — ~ 0.08 is not considered.

Still I tried to consider duration of talk. Really, one general contact with which both subscribers communicated only once and not for long, at all does not mean that subscribers have to be somehow connected. But for some reason in practice I did not receive any gain in quality.

Read more »


Issue. Where? When? R

1 year, 4 months ago
imageWhile behind a window temperature on the way to the next records, it is interesting to look and what in general there were temperatures in any interval, for any years for the last several decades in 30 000 points worldwide. And can not lose with days of issue and take them those days when there is some "statistical benefit" in the selected location on warm weather, and can be on cold, having estimated it visually on any of three types of charts. Well or it is possible just to rotate the globe, to visually evaluate a variety of temperatures and "as this world is beautiful".

Read more »


War, world and ABBYY Compreno: continuation of our affair with Tolstoy

1 year, 4 months ago
Recently we told here about how the All Tolstoy in One Click project was done. By means of 3249 (three thousand two hundred forty nine) volunteers and 1 (one) good OCR technology we digitized 46820 pages of 90-volume collected works of the writer, carefully subtracted them and laid out in general access.

But if you thought that our "affair with Tolstoy" on it ended, then you were mistaken – having digitized texts of the writer, we began to investigate them by means of technology of information extraction ABBYY Compreno – not to vanish to such rich material. About what gave us "Thick text mining" and where now the received results are used, read further.

Introduction


The main goal of the All Tolstoy in One Click project was to make Tolstoy's creativity rather general property that all texts which issued from its pen were available in one click in any point of Earth. As, by the way, the author who still during lifetime refused all rights to the texts also bequeathed (yes, the anonymous, Lev Tolstoy knew about a copyleft and an opendata long before this your Internet and Richard Stallman).

However an opportunity to load the book in a convenient format in the reader or the tablet – not the only plus of digitization. Now it is possible not only to read Tolstoy's texts, but also "to measure", that is to investigate by different quantitative methods, using all arsenal of means of hands-off processing of the text (AOT, it is NLP). If you have all texts of the writer in electronic form, even by means of one-two competent search queries you can obtain curious data for which production in other times some literary critic could spend weeks and months of persistent work. And if you besides have an advanced technology of the analysis of a natural language, that is chances to make serious philological discovery (even without being a philologist). Below I will tell that we managed to namerit and learn, but before it is the couple of words about the one who as well as why is engaged in hands-off processing of artistic texts and that interesting can turn out at the same time.

Read more »


Search of potential followers on Twitter

1 year, 4 months ago
Let's assume to eat the account in Twitter to which write on rather limited circle with on what it is signed with several hundred or thousands of people. How to understand what share of audience is not covered yet? How to find these people?

For an example we will consider the account @Russia_Direct. This small edition which covers events in Russia for English-speaking readers. Something the Russia Today type, but with deeper and academic materials.



Now on them 4000 people — students, journalists, teachers of universities are signed ~:


Read more »


Application of machine learning for increase in performance of PostgreSQL

1 year, 4 months ago
image

Machine learning is engaged in search of the hidden patterns in data. The growing growth of interest in this subject in IT community is connected with the exclusive results received thanks to it. Voice recognition and the scanned documents, search engines — all this is created with use of machine learning. In this article I will tell about the current project of our company: how to apply methods of machine learning to increase in performance of DBMS.
The existing mechanism of the scheduler of PostgreSQL understands the first part of this article, in the second part it is told about opportunities of its improvement using machine learning.

Read more »