Developers Club geek daily blog

1 year, 6 months ago
Has returned from dataton of DSW 2015 where we have taken the second place and until it was forgotten nothing, would like to share impressions.
As we participated in the hakatena of Data Science Week 2015

For us have prepared three tasks. All rather interesting and difficult. To receive normal quality for each of them, I think, it is necessary to spend not one week. To us have suggested to keep within per day.

The first task was about prediction of salaries according to the description of vacancies. On HH it is frequent instead of the sum which the competitor can expect, write "z/p it is not specified". It was offered to correct somehow it in details and to fill spaces with approximate values. Have provided us ~ 700 000 records of look
{
  "code": null,
  "site": {
    "id": "hh",
    "name": "hh.ru"
  },
  "published_at": "2015-08-18T18:19:00+0300",
  "accept_handicapped": false,
  "key_skills": [],
  "employment": {
    "id": "full",
    "name": "Полная занятость"
  },
  "id": "4027565",
  "archived": false,
  "contacts": null,
  "response_url": null,
  "relations": [],
  "employer": {
    "logo_urls": null,
    "vacancies_url": "https://api.hh.ru/vacancies?employer_id=1942790",
    "name": "Сирота ЛБ",
    "url": "https://api.hh.ru/employers/1942790",
    "alternate_url": "http://hh.ru/employer/1942790",
    "id": "1942790"
  },
  "response_letter_required": false,
  "billing_type": {
    "id": "free",
    "name": "Бесплатная"
  },
  "hidden": false,
  "type": {
    "id": "open",
    "name": "Открытая"
  },
  "specializations": [
    {
      "id": "3.318",
      "profarea_name": "Маркетинг, реклама, PR",
      "profarea_id": "3",
      "name": "Управление маркетингом"
    },
    {
      "id": "3.98",
      "profarea_name": "Маркетинг, реклама, PR",
      "profarea_id": "3",
      "name": "Исследования рынка"
    },
    {
      "id": "3.230",
      "profarea_name": "Маркетинг, реклама, PR",
      "profarea_id": "3",
      "name": "Продвижение, Специальные мероприятия"
    },
    {
      "id": "3.90",
      "profarea_name": "Маркетинг, реклама, PR",
      "profarea_id": "3",
      "name": "Интернет-маркетинг"
    }
  ],
  "premium": false,
  "description": "<p><strong>Обязанности:</strong> консультирование клиентов, работа с электронной почтой, обучение новых сотрудников, ведение отчета по проделанной работе.</p> <p> </p> <p><strong>Требования:</strong> уверенный пользователь ПК, скоростной интернет, целеустремленность, желание зарабатывать, терпеливость при работе с клиентами, ответственность, доброжелательность.</p> <p> </p> <p><strong>Условия</strong>: трудоустройство по Т.З.Р.Ф., своевременная зарплата, бонусы, премии, бесплатное обучение и сопровождение до результата, карьерный рост.</p> <p>Обращаться по эл. почте</p>",
  "schedule": {
    "id": "fullDay",
    "name": "Полный день"
  },
  "suitable_resumes_url": null,
  "test": null,
  "department": null,
  "allow_messages": true,
  "address": null,
  "salary": {
    "to": null,
    "from": 31000,
    "currency": "RUR"
  },
  "name": "Менеджер он-лайн офиса",
  "created_at": "2015-08-18T18:19:00+0300",
  "area": {
    "url": "https://api.hh.ru/areas/1884",
    "id": "1884",
    "name": "Льгов"
  },
  "experience": {
    "id": "between1And3",
    "name": "От 1 года до 3 лет"
  },
  "negotiations_url": null,
  "branded_description": null
}

And test set with the empty fields "salary.to" and "salary.from" which needed to be filled. In general these very interesting. I think, with their help it is possible to understand much that about labor market in Russia. We, for a start, have looked at some basic things. Data have given us the fresh. Some days before unloading there is jump of number of new vacancies. Children explain it to that recruiters often at the beginning of week upload declarations that they were seen by more people:
As we participated in the hakatena of Data Science Week 2015

Surprisingly often people without experience are required:
As we participated in the hakatena of Data Science Week 2015

People for full day are usually necessary:
As we participated in the hakatena of Data Science Week 2015

While I with interest studied data and periodically shared supervision, Diman, not especially penetrating about the contents, have pushed everything in Vowpal Wabbit and have received model which has won at once first place and stayed there in proud loneliness almost till the end of hakaton. In what the essence consisted there I have not understood, therefore the word to Dima:
Was to show to one of main objectives that Spark cannot ogulno use anywhere, and big layer of tasks can be solved rather effectively and by desktop machines even if selection is not located in memory.

In such cases online training of en .wikipedia.org/wiki/Online_machine_learning if in kratets, we take handful of sample units (and one is more often in general) well works and we take step of optimization algorithm. The simplest option of stochastic descent is rather simple in the description and en .wikipedia.org/wiki/Stochastic_gradient_descent is intuitively clear. However it will have convergence to hell therefore usually use more interesting things, for example in applications has well proved Follow-the-Regularized-Leader it is possible to esteem in more detail, for example, here than web.stanford.edu/class/cs229t/notes.pdf.

But before it was necessary using algorithm to prepare data. On the basis of the analysis which has been done earlier, and also common sense some signs, such as geolocation, the company name, experience, the diagram, required skills and so on have been selected. Where the situation is with the text description more interestingly. As there was a wish to receive the simplest and fast solution from use of methods, popular, but costly in respect of calculation, have refused. Hashing of signs of en.wikipedia.org/wiki/Feature_hashing which besides rather effectively works with linear decisive rules and big dimensions was as a result used, and also solves problem with coding of categorial signs.

There was a wish to write the bicycle not especially therefore the ready solution, namely very popular on Kaggle Vowpal Wabbit was used. Initially this utility reflected as the teaching tool exclusively linear online, however has expanded over time. From useful priyatnost it is opportunity to cache files with data that strongly accelerates work (speed is limited to data reading from disk), and also Progressive Validation that allows to evaluate convergence speed.

Somehow considerably to improve model, using square signs or regularization, increase the number of passes did not work well.

All process of data preparation and training took about 6 minutes which 5 left on conversion to the necessary format, training goes less than a minute.

And here still its presentation on everyones case.

I poudivlyalsya to that as youth solve now analytical problems and have passed to the second dataset. There it was offered to make system for search of similar requests:
As we participated in the hakatena of Data Science Week 2015

In principle, children from HH have already made this system, and even have written about it detailed article a year ago, but data which do not fall under NDA not so much therefore it was necessary to design the bicycle. However, the bicycle, at me has turned out quite test. Given us have shipped generously. There were two files. In the first for each user its requests have been specified (~ 100 000 000 lines):
Специалист	755713242
Call-центр	293043490
повар универсал	-1453491075
Бухгалтер	368599217
	83220527
Бухгалтер по расчету заработной платы	2002085826
	-1690082898
кладовщик	199265113
Водитель категории C	-571664634
starling	938815142
...

In the second, for each request it has been specified what result each user clicked (~ 60 000 000 lines):
374962018	-1871849048	Перевозки
435199331	656053665	java
-479980995	2055924405	развитие территории
-312078053	1785295198	стажер
373352347	-1306352914	swatch group
-335100665	-786187311	обработка изображений
430556647	834763896	директор
430528038	1620277313	Бухгалтер
435232940	-1022351920	Программист 1с
433204418	-2121514172	координатор сервиса
...

I, naturally, could not but notice similarity of this task to task about search of the VK similar groups about which wrote recently. In some hours the magnificent graph of look has been constructed:
As we participated in the hakatena of Data Science Week 2015

To each top there corresponded the request. Similar requests integrated in clusters. Here, for example, requests about druggists:
As we participated in the hakatena of Data Science Week 2015

About customs officers:
As we participated in the hakatena of Data Science Week 2015

The metrics for this task was interesting. At the end of hakaton it was necessary to step on the stage and in real time to ponakhodit recommendations for requests which will be thought up by jury. In our case it looked so:
As we participated in the hakatena of Data Science Week 2015

With the first and second tasks at us everything has turned out, rather not bad that you will not tell about the last third task. There it was necessary to make system of recommendations of goods for Ozone. Data looked approximately so:
[
    100000, 
    {
        "0": "данная книга содержит подробное описание широко распространенных моделей телевизоров выпускаемых фирмами goldstar supra shivaki ham собранных на шасси pc04 и pc91a приведено комплексное описание работы телевизоров по функциональной и принципиальнойсхемам методика поиска неисправностей и регулировка этих телевизоров схемы сопровождаются таблицами назначения элементов данные модели широко распространены на нашем рынке и часто вызывают интерес у людей занимающихся ремонтом телевизионной аппаратуры в ответ на их многочисленные обращения была написана данная книга книга предназначена для специалистов занимающихся обслуживанием и ремонтом телевизионной техники и подготовленных радиолюбителей", 
        "1": "телевизоры goldstar на шасси pc04 pc91a", 
        "2": "ю бобылев", 
        "6": "зарубежная электроника", 
        "32": "русский", 
        "18": "наука и техника", 
        "53": "5_88977_036_5"
    }
]
[
    1000001, 
    {
        "0": "регион все регионы br рейтинг mpaa not rated p", 
        "1": "skinny tiger and fatty dragon", 
        "7": "skinny tiger and fatty dragon"
    }
]
[
    1000003, 
    {
        "0": "регион 1 usa and canada br рейтинг mpaa r not for sale to persons under age 18 p", 
        "1": "skipped parts", 
        "7": "skipped parts"
    }
]
[
    1000005, 
    {
        "0": "регион 1 usa and canada br рейтинг mpaa not rated p", 
        "1": "sky wars", 
        "7": "sky wars"
    }
]
...

Similar records to us have shipped ~ 10 000 000 pieces that dofiga. There was still heap of any files, but I did not even look at them. It was necessary for these goods to find good original recommendations. Besides, that the task in itself, to put it mildly, not simple and data much at us has arisen one more problem. It is Spark. About 12 hours have left on loading there data and to start the elementary operations. Hakaton unambiguously is not better place for the first acquaintance to similar technologies. As a result in two hours prior to deadline we have made strong-willed decision to hammer on Spark and to notch at least some solution locally. My Makbuchek, as always has not brought, the place on disk came to end, but the basic solution at us has appeared. We did very primitive thing. All words from the description of goods were put in set. Then went according to the list of goods and considered intersection. Goods with big intersection got to recommendations. In principle, such method worked. For

[
    28759795, 
    {
        "1": "bruder тягач mack с прицепом платформой с колесным экскаватором погрузчиком цвет красный желтый черный", 
        "5": "спецтехника", 
        "6": "bworld", 
        "9": "машинки танки самолеты", 
        "10": "bruder", 
        "11": "bruder spielwaren gmbh", 
        "45": "23 февраля", 
        "15": "тягач прицеп платформа экскаватор погрузчик", 
        "38": "для мальчиков", 
        "30": "сын"
    }
]

We recommended
[
    30161483, 
    {
        "1": "bruder самосвал mercedes benz с колесным экскаватором цвет красный желтый", 
        "5": "спецтехника", 
        "6": "bworld", 
        "9": "машинки танки самолеты", 
        "10": "bruder", 
        "11": "bruder spielwaren gmbh", 
        "45": "23 февраля", 
        "15": "самосвал экскаватор", 
        "38": "для мальчиков", 
        "30": "сын"
    }
]
[
    28759817, 
    {
        "1": "bruder эвакуатор mercedes benz с внедорожником цвет красный желтый черный", 
        "5": "спецтехника", 
        "6": "bworld", 
        "9": "машинки танки самолеты", 
        "10": "bruder", 
        "11": "bruder spielwaren gmbh", 
        "45": "23 февраля", 
        "15": "эвакуатор внедорожник аксессуары", 
        "38": "для мальчиков", 
        "30": "сын"
    }
]
[
    28759801, 
    {
        "1": "bruder фургон scania цвет зеленый белый красный", 
        "5": "спецтехника", 
        "6": "bworld", 
        "9": "машинки танки самолеты", 
        "10": "bruder", 
        "11": "bruder spielwaren gmbh", 
        "45": "23 февраля", 
        "15": "машина погрузчик 2 паллета", 
        "38": "для мальчиков", 
        "30": "сын"
    }
]


However, to make recommendations for one goods, we needed about a minute. And test cases was ~ 60 000. And to deadline there were 15 minutes. Well, we podslit the third task.

First place was won by dudes who, have more or less solved all three problems. Respect it. But we too good fellows.

This article is a translation of the original post at habrahabr.ru/post/265721/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus