The tool which on arbitrary access of the summary (according to the name, key word and other) would show the main characteristics of such selection, distribution of salaries, age, and many other things as in graphic look, and look any the pertsentily was interesting to me. The result of my desire, is lower under cat.
So, for data visualization, are necessary actually this, the largest source of such data is the site HeadHunter. Knowing that it has API, I have thought that here now that, with it, quickly I will receive everything, and it is not necessary to parsit anything, but having read its description, have seen that access on it can be got to base of the summary only for employers, and moreover, only for the purposes of the sentence of work. But anything, and in open access of hh.ru there is section with curriculum vitae of people which have opened them to all Internet, and that we see them, and there is a lot of them, about a third of total number. From these people we are interested in what have specified the desirable income, and their vast majority – 80%.
As a result data collection, has implemented in two ways – data collection using API import.io (originally as faster for implementation), and directly collecting and parsing is carried out using R, this way, as a result was 20% faster. So maximum number of the summary — 5 000 (restriction of hh.ru) gather in 3 minutes, but is normal in their interesting request much less so the direct temporary difference between two ways of collecting makes some seconds.
Overview of data
Most likely, this selection, is displaced, I proceed from the assumption that the people in the majority opening the summary to all Internet are more interested in job search and thereof, their expectations according to the income, most likely are a little underestimated. But without access to full base to check this hypothesis, it is impossible therefore that is, that is.
Analyzing the received results, it is always possible to remember this shift, and, for example, to cover with some percent the received results. Also has decided to separate selection into two parts in addition: the summary updated before half a year is also more senior, for tendency assessment, whether there is notorious influence of crisis on expectations and age of competitors or is not present.
Web-based graphical user interface
As forming of initial request on the site hh, rather powerful, is inexpedient to duplicate separately it in the R-Shiny application therefore initial any request forms on the site hh, and it is necessary to specify in appendix simply this hyperlink (in this example the following request was used (itself it is not interested in similar request): Moscow, the IT/Telecom, Programming/development, experience of 3 years). This hyperlink will automatically be transformed to output of hundred declarations to the page (for data collection acceleration), and data collection (without run in the summary) as the main characteristics of selection are already specified on this initial page is carried out. After necessary conversions (summary exception without ruble salary and age if it has not been made on the site hh, processings of dates of the summary), in addition to the summary picture from post heading, different charts on this selection are under construction, they are shown in fig. 1, 3. It is made everything, using Shiny Dashboard packet. According to charts of fig. 1 it is possible to see density of distributions, both salaries, and age, with the indication of deciles of these values (by the way, it is visually visible that qualitatively expectations of competitors and now – do not differ more than half a year ago).
Fig. 1. Density of distributions of age and salaries with the indication of deciles
Also in separate point of the side menu (fig. 2), all summaries in convenient tabular look in which in addition sorting or search filters, it is possible to find something the specific are output.
Fig. 2 Data sheet
On the last chart (fig. 3) it is possible to see, how the main characteristics of the income (minimum, three quartiles, maximum and emissions) on specific age, and the general tendency according to the income (in addition to the chart of dispersion from post heading on which the smoothing curve is specified).
Fig. 3 of Boksplot (distribution of the income on age)
In the next post about R, I wanted to show that any "data round us" it is possible to process and present many quickly and easily in the form, visual and more convenient for perception. In this case, it is possible to evaluate, for example, as "wide" view of the industry, the sphere of work, or on the contrary, "narrow" — having as much as possible detailed the request in many parameters (key word, the direction), to see the main tendencies.
This article is a translation of the original post at habrahabr.ru/post/266319/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.