The report of winners of Microsoft Azure ML of a hackathon (on November 28-29, 2015 from Christmas) on a task of Alfa-Bank

1 year, 2 months ago
We (Alexey Burnakov and Roman Gudchenko) in last days off persistently participated in the hackathon organized by Microsoft: link.

We selected a task from Alfa-Bank about forecasting of outflow of clients. Short essence:

Before you the data set consisting of 3 generations (3 consecutive reporting months) of active clients, part of which 2 months later became inactive, i.e. is sent away in outflow (the target field = 1) the Test data set contains the following, the 4th generation of active clients.


For the provided data set to construct a binary classifier of probable outflow of clients 2 months later.
Criterion of quality of an assessment – AUC, by results of check on a control set.

Circles at the Moscow schools. What how many stand whether influence results of USE

1 year, 3 months ago
Within studied results of USE, SFE and the Olympic Games, then looked for where school students arrive and where work. Now poissleduy circles at schools.

In general, circles can tell a lot of things about school. A large number of sports sections, probably, speaks about existence of good halls and the equipment. The prices of circles, perhaps, speak about a kommertsializirovannost of school in general. Besides it was interesting to look whether the number of paid services influences success of school. Whether help to try to obtain a mug of the best rezult at examinations and the Olympic Games.

On there is a section where it is possible to write the child in section. In night all this section was extorted and we on hands had information about 80 000 Moscow circles.

Find the corrupt official. Data analysis of officials from projects of Clerical one hundred (with examples on R)

1 year, 3 months ago
How to define officials, the most suspicious from the point of view of corruption? Simplest — having compared their income and standard of living.

I want to show possibilities of the websites with open information on officials in this article, to look at how these officials live and to try to define those who are most suspicious from the point of view of corruption.

Why open information on the income of officials is important? Because it allows to control them.

Photo from an Instagram of the daughter of the former head of GAI of Ukraine Alexander Yershov. On a photo the daughter of Yershov in Cannes near Paris Hilton. As a result of scandal because of mismatch of the declared income and a way of life of a family Yershov resigned.

The analysis of the change in price in the Russian online stores

1 year, 3 months ago

In the last few years I rather strongly was interested in a question of pricing in the Russian online stores. Every time at the statement of online store for a big discount in soul creeps in doubt … Whether really such big discount? Whether the price which is crossed out now was actual?
Sharp changes in the exchange rate of dollar at the end of 2014 added fuel to the fire. There was a strong wish to receive the answer to a question as the prices depend on dollar rate in reality.
As a result, I decided to finish these questions and to collect change in price history on the Russian online stores. On a cat results of work + several interesting patterns.

New tvitnut buttons or farewell counter

1 year, 3 months ago
On November 20 Twitter disconnected a possibility of viewing of the counter of links through private API of the buttons and removed the counter from their official design what they warned in the blog a little more than a month ago about. New buttons look so now:

It is about API located at the address: Earlier, having transferred the necessary url, it was possible to receive number of tweets/retweets in which the reference to this url is contained. From now on instead of usual json, API returns an error 404.

API though officially also was private, but upon, was not limited in use in any way and for years of its existence managed to form the basis of a set of the services which are engaged in this or that data analysis.

Officially Twitter connects it with complexity of transfer of functionality on a new platform, to be exact, on low priority of this task, concerning others and convince the users that the counter on the button does not play any special role for visitors of the websites. However, it is the first case of failure of a large social network from the counter on the button.

Power Query: steroids for MS Excel

1 year, 3 months ago

In this article I want to tell about some opportunities of free and extremely useful, but for the present poorly known superstructure over MS Excel under the name Power Query.

Power Query allows to take away data from the most different sources (such as csv, xls, json, text files, folders with these files, the most different databases, different api like Facebook opengraph, Google Analytics, Yandex. The metrics, CallTouch and a lot of things still) to create the repeated sequences of processing of these data and to load them in the tables Excel or data model.

And here under a cat you can find details of all this magnificence of opportunities.

Titanic on Kaggle: you will not read up this post up to the end

1 year, 3 months ago
Hi, habr!

# { Data Science for beginners }

My name is Gleb Morozov, we are already familiar under the previous articles. At numerous requests I continue to describe experience of the participation in the educational projects (by the way who was not in time yet — up to the end still it is possible to receive materials of last courses is, probably, the shortest and most practical course on data analysis which can be imagined).

This work describes my attempt to create model for a prediction of the survived passengers of "Titanic". The main objective — training in use of the tools used in Data Science to data analysis and the presentation of results of research therefore this article will be very much and very long. The main attention is paid to the research analysis (exploratory research) and work on creation and the choice of predictors (feature engineering). The model is created within the competition Titanic: Machine Learning from Disaster passing on the website Kaggle. In the work I will use the R language.

Research of results of USE, SFE and the Olympic Games for the Moscow schools. From what schools to what HIGHER EDUCATION INSTITUTIONS come

1 year, 3 months ago
A month ago I wrote about our participation in a hackathon according to open data.

After a hackathon we were not satisfied with what has already been achieved as it usually happens, and continued work. We on hands had data to which, probably, only the staff of the Ministry of Education had access earlier: results of SFE and a victory at the Olympic Games for 2014-2015 for 90% of the Moscow schools. For 55% of schools it was succeeded to collect data on USE for 2015. Pumped over all accounts of the Moscow school students in Contact, looked what HIGHER EDUCATION INSTITUTIONS they specify at themselves in profiles after the termination.

Naturally, it was interesting to study such datast. At first trivial things about which people from education, probably, well know:
  • Points on USE in humanitarian objects are higher, than on technical. History — an exception;
  • Natural-science disciplines in the middle.

The analysis of consumer baskets in a retail

1 year, 3 months ago
Task No. 1 for the retailer — to understand who specifically makes purchases in shop, to study behavior of buyers, to select typical models, and by means of this knowledge to influence quantity and quality of purchases.

The solution is possible, using such approaches:
  • data analysis from programs of loyalty and other forms of studying of persons and behavior of buyers;
  • data analysis about purchases and transactions.

Paraphrasing the second approach — what goods the buyer put in the basket?

How to find the most long continuous line of events by means of SQL

1 year, 3 months ago
The problem of search of continuous sequences of events is quite easily solved with the help of SQL. Let's specify that these sequences are.

For an example we will take Stack Overflow. He uses cool system of reputation with awards for certain achievements. As well as in many social projects, they encourage users daily to visit a resource. Let's pay attention to these two awards:

It is easy to understand what they mean. Visit the website in the first day. Then for the second day. Then on the third (perhaps several times, it does not matter). Did not come on the fourth? We begin to consider again.

