After a hackathon we were not satisfied with what has already been achieved as it usually happens, and continued work. We on hands had data to which, probably, only the staff of the Ministry of Education had access earlier: results of SFE and a victory at the Olympic Games for 2014-2015 for 90% of the Moscow schools. For 55% of schools it was succeeded to collect data on USE for 2015. Pumped over all accounts of the Moscow school students in Contact, looked what HIGHER EDUCATION INSTITUTIONS they specify at themselves in profiles after the termination.
Naturally, it was interesting to study such datast. At first trivial things about which people from education, probably, well know:
- Points on USE in humanitarian objects are higher, than on technical. History — an exception;
- Natural-science disciplines in the middle.
For some schools there are data on USE for 2014 therefore it is possible to try to look at dynamics in two years:
- As though the point on physics grew up and a little the point on information science fell;
- Or it is noise, or tasks changed, or began to prepare in a different way.
For some schools we have not only points on USE, but also number handing over a subject. It is possible to look at popularity of disciplines. Most likely, people in a subject, and so know it:
- Russian — obligatory, it is handed over by all;
- The part, probably, hands over basic mathematics, we considered only profile;
- Emissions in English and physics happen, probably, at the expense of special schools.
I thought that the subject is more popular, the higher in it a GPA. But, seemingly, all on the contrary:
Now it is a little about SFE. I thought that the better at school hand over SFE, the better in two years and points on USE. It turned out that it is fair only for Russian and mathematics and from part for social science. Why so who knows?
There was a hypothesis that preferences in objects change. Perhaps, who handed over those, for example, the physicist in the 9th class absolutely not necessarily is handed over physics in 11. But on SFE we have data on number handing over too and popularity of objects in general matches what we see for USE:
Perhaps, matter in tasks. If to arrange objects of a GPA of SFE, the order will be not such, as for USE at all:
- High points on information science;
- Notches on the whole points appear because some schools round an average to a zero sign;
- On stories, as well as for USE, points one of the lowest.
Now about the Olympic Games. We have a number of winners of the Moscow and All-Russian Olympic Games in all objects. It was interesting to check whether progress at the Olympic Games with a GPA on USE on school correlates:
- Sometimes some dependence is browsed: for English, social science, biology, for example;
- Sometimes not really: for Russian, there is no literature of special communication.
For all schools coordinates are known. Yes, happens that it is several buildings, but we look at the legal address so far.
I had a representation that than the school is closer to the center, it is better for those. But, seemingly, it not so. At least, the GPA on USE does not depend on proximity to the center:
Probably, now interests some from where data and why they can be trusted. Results of SFE and the Olympic Games to us were kindly provided by the Ministry of Education. They promised that these data will be publicly available soon. Results of USE on objects, for some reason are considered as a big secret therefore we had to bring together them manually from the websites of schools. All Moscow schools are hosted on the mskobr.ru portal and all have a section "public report". There usually is a link to the document where the principal in any to a form reports for last year. Naturally, all schools see maintenance and registration of the report differently:
Therefore it was necessary to forget about automatic data acquisition. We took the cool tool for recognition of tables in PDF-documents — Tabula. A little it propatchit and process of data collection looked so:
Through ~ 30 hours all ~ 600 documents were processed. It turned out that only from ~ 55% turn out given to get on USE. Often this in the report stale or there are no results of USE or not GPAs, and there are only, for example, maximum. Then in ~ 300 schools for which it was succeeded to get points on USE were sent letters with a request to check data. ~ 30 schools answered, 2 found errors, 5 sent points slightly overestimated concerning the report, the others told "regulations". That is with an accuracy there are no big problems, there are problems with completeness. It is necessary to get somewhere points still for ~ 300 schools.
Then we started the Contact. The purpose was to define from what schools to what HIGHER EDUCATION INSTITUTIONS most often come. First of all it was necessary to integrate official names of schools, with those which are used by Contact. It is not so simple to make it. Because, for example, we have "School No. 17", and VK have "An evening school No. 17", "Music school No. 17 of L. N. Oborin", "Boarding school No. 17". Besides Contact permits to receive only 1000 results of search issue. If the school is specified more, than in 1000 accounts, and for the Moscow schools it almost always so, then it is necessary to think out something. We broke one request "school No. 17" on several: "school No. 17 of the girl from 6 to 14", "school No. 17 boys from 6 to 14", "school No. 17 of the girl from 15 to 17", "school No. 17 boys from 15 to 17" and so on. On requests to search, seemingly, there is some indistinct limit. After ~ 50 addresses banned us on ~ 1 hour. Anyway through couple of days all accounts were pumped over. 1800 people are the share of one school on average ~, from them ~ 450 specify university.
If to use these data as is, oddly, 90% of the Moscow school students come to MSU. Therefore the following sophisticated algorithm is applied: to throw out MSU. Yes, for example, for lyceum No. 1533 from where 50% of people go to MSU this algorithm works not really well, but other approaches terribly I worsen a covering for all schools. There are, for example, not ~ 450 people, and ~ 45, it is impossible to build on them distribution on HIGHER EDUCATION INSTITUTIONS. Those who studied at schools from the picture, please, write there corresponds the histogram to the truth or not:
It is possible to try to look for other schools on obr.msk.ru
This article is a translation of the original post at habrahabr.ru/post/270675/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.