In late August, after a series of free lectures at
Data Science Week 2015 , the
organizers decided to hold a two-day
datathon - a competition where teams of programmers and analysts solved business problems from the field of Data Science.
There were three tasks on the dataton, two of which were prepared by the HeadHunter team and one OZON company. This was, I must say, not the easiest task, because most of our data is confidential. No one wants programmers and analysts to practice real resumes or closed job data. But something we still collected. To check the results, the organizers came up with metrics and wrote checkers. And these guys won on datatone:

')
Right here and now I invite you to test your strength and solve three problems with which the guys fought on datatone. Checkers for verification and all files attached.
Task 1According to our statistics, in every third vacancy employers do not indicate the proposed salary. Therefore, when an applicant seeks vacancies only with a specified salary, he loses some of the offers that are relevant not only by description, but also by compensation.
In the first task, you need to predict the salary interval based on the data on the vacancy. A training sample was prepared from vacancies with specified salaries (with upper and lower limits or with one of them). The task is to predict as accurately as possible the possible proposed wage that an employer could put up.
Formal condition and dataAccording to the available data of HeadHunter portal vacancies, for which the salary is known, make a salary forecast for vacancies that do not have this information.
Input data processingFiles for the first task:
- train.txt - contains a training sample, which contains information on the vacancy and its salary
- test.txt - contains a test sample, which contains information on the vacancy and no salary, but a value that you need to predict “From”, “To” or both (for which you need to make a forecast).
- nosalary.txt - contains only information on vacancies without specifying salary (can be used for word2vec, for example).
File format{"salary": {"predict": "from", "to": null, "from": 4124, "currency": "RUR"}, "id": "0337565"}
EvaluationTo evaluate the result of your decision, the e metric is used:
e = ABS (MIN (forecast; reality) / MAX (forecast; reality) - 1)
Aggregation metric = root of the sum of squared errors (e) divided by the number of observations.
You must predict at least 50% of the values ​​from and 50% of the values ​​to.
Task 2In the second task, it was necessary to develop an algorithm for similar search queries. Employers and job seekers often speak different languages: the applicant seeks jobs for one key word, and the employer calls job vacancies different. The task is to compare different options and help the applicant by offering options that either expand his search query, or, conversely, narrow it down, helping to get a more relevant return.
For the second stage, two datasets were unloaded and offered to use either one or two at once. In one there were 10 million lines with a combination of user id - request. In the second 60 million lines: user id, job id, and the query by which it was found and viewed. Id was hashed (changed with the correct combination). Here it was necessary to apply their knowledge of collaborative filtering.
Formal condition and dataAccording to the available data from search queries of users of the Portal HeadHunter, develop an algorithm for recommending similar requests.
How we did it in hhInput data processingThe following files are available:
- tu.tar.gz - the following fields are indicated: user id, his search query.
- vuq.tar.gz - the following fields are indicated: id of the vacancy to which the user has passed from the search result, user id, his search query.
EvaluationThe adequacy of similar requests determined jury. Everyone who successfully completed the task was asked to check it on the following list of requests:
- archivist
- manager
- sushi chef
- head of planning department
- 1s
- bigdata
- pediatrician
- hadoop
- brick
- political scientist
Task 3 (from OZON)Recommendations of rare goods. Tails distribution.
Very easy to recommend a product that is already popular. The conversion of such a recommendation will be high, but it will be useless from a business point of view. In literature, this is called a banana trap. It is much more interesting to recommend something from rarely purchased goods. This will be the task.
Formal condition and dataAccording to the available data on goods online store Ozon.ru develop a content-based recommendation system.
Data- ozon_train.txt is a training sample of a line in json, where for item we provide the most popular recommendations in true_recoms (here the dictionary from the ID of the recommended product and weight - the more the better). Weights mean clicks. Current recommendation system Ozon.ru: a mixture of content-based and collaborative filtering. Example:
{"item":"24798277","true_recoms":{"24798314":1,"24798279":2,"24798276":4,"24798277":1,"24798280":2}}
The file has a line with 40,000 recommendations - it is garbage. - ozon_test.txt - test sample
- item_details_full.gz - attributes of goods. Example:
{"id":"4381194","name":" - - â„– 84 ( )","annotation":" , . 84.<br>\r\n\" \" (1846) - . - XVI - III, . , . <br>\r\n 245 , 1903 . . . \"-\" .","parent_id":"18255189"}
under parent_id, modifications of one product (for example, different iPhones) are combined. - catalogs.gz - in which catalogs the product is located (maybe several entries). Example:
{"itemid":"29040016","catalogid":"1179259"}
- catalog_path.gz - paths for lower-level catalogs (in which the goods lie) in the directory tree. For each directory is given the full path to the root. Example:
{"catalogid":1125630,"catalogpath":[{"1125630":" . "},{"1125623":" - !"},{"1112250":" (.-)"},{"1095865":" "}]}
- ratings.gz - stores the average rating (stars). Example:
{“itemid”: 2658646, “rating”:4.0}
Response formatThe file should have the following format (the weight of the product is higher, the higher its proximity, that is, the sorting is descending):
{"item": "28759795", "true_recoms": {"28759801": 1, "28759817": 2, "28759803": 13}}
EvaluationTo evaluate the result of your decision, the NDCG @ 1000 metric is used.
Also, the organizers, in conjunction with 3data, suggested that teams use a cluster with a spark.
Thanks for attention.