📜 ⬆️ ⬇️

Two tasks HeadHunter on Data Science Week: try to solve it yourself

In late August, after a series of free lectures at Data Science Week 2015 , the organizers decided to hold a two-day datathon - a competition where teams of programmers and analysts solved business problems from the field of Data Science.

There were three tasks on the dataton, two of which were prepared by the HeadHunter team and one OZON company. This was, I must say, not the easiest task, because most of our data is confidential. No one wants programmers and analysts to practice real resumes or closed job data. But something we still collected. To check the results, the organizers came up with metrics and wrote checkers. And these guys won on datatone:


')
Right here and now I invite you to test your strength and solve three problems with which the guys fought on datatone. Checkers for verification and all files attached.

Task 1

According to our statistics, in every third vacancy employers do not indicate the proposed salary. Therefore, when an applicant seeks vacancies only with a specified salary, he loses some of the offers that are relevant not only by description, but also by compensation.

In the first task, you need to predict the salary interval based on the data on the vacancy. A training sample was prepared from vacancies with specified salaries (with upper and lower limits or with one of them). The task is to predict as accurately as possible the possible proposed wage that an employer could put up.

Formal condition and data
According to the available data of HeadHunter portal vacancies, for which the salary is known, make a salary forecast for vacancies that do not have this information.

Input data processing

Files for the first task:
  • train.txt - contains a training sample, which contains information on the vacancy and its salary
  • test.txt - contains a test sample, which contains information on the vacancy and no salary, but a value that you need to predict “From”, “To” or both (for which you need to make a forecast).
  • nosalary.txt - contains only information on vacancies without specifying salary (can be used for word2vec, for example).

File format

{"salary": {"predict": "from", "to": null, "from": 4124, "currency": "RUR"}, "id": "0337565"} 


Evaluation

To evaluate the result of your decision, the e metric is used:
e = ABS (MIN (forecast; reality) / MAX (forecast; reality) - 1)
Aggregation metric = root of the sum of squared errors (e) divided by the number of observations.

You must predict at least 50% of the values ​​from and 50% of the values ​​to.


Task 2

In the second task, it was necessary to develop an algorithm for similar search queries. Employers and job seekers often speak different languages: the applicant seeks jobs for one key word, and the employer calls job vacancies different. The task is to compare different options and help the applicant by offering options that either expand his search query, or, conversely, narrow it down, helping to get a more relevant return.

For the second stage, two datasets were unloaded and offered to use either one or two at once. In one there were 10 million lines with a combination of user id - request. In the second 60 million lines: user id, job id, and the query by which it was found and viewed. Id was hashed (changed with the correct combination). Here it was necessary to apply their knowledge of collaborative filtering.

Formal condition and data
According to the available data from search queries of users of the Portal HeadHunter, develop an algorithm for recommending similar requests.
How we did it in hh

Input data processing

The following files are available:
  • tu.tar.gz - the following fields are indicated: user id, his search query.
  • vuq.tar.gz - the following fields are indicated: id of the vacancy to which the user has passed from the search result, user id, his search query.


Evaluation

The adequacy of similar requests determined jury. Everyone who successfully completed the task was asked to check it on the following list of requests:
  • archivist
  • manager
  • sushi chef
  • head of planning department
  • 1s
  • bigdata
  • pediatrician
  • hadoop
  • brick
  • political scientist



Task 3 (from OZON)

Recommendations of rare goods. Tails distribution.

Very easy to recommend a product that is already popular. The conversion of such a recommendation will be high, but it will be useless from a business point of view. In literature, this is called a banana trap. It is much more interesting to recommend something from rarely purchased goods. This will be the task.

Formal condition and data
According to the available data on goods online store Ozon.ru develop a content-based recommendation system.

Data

  • ozon_train.txt is a training sample of a line in json, where for item we provide the most popular recommendations in true_recoms (here the dictionary from the ID of the recommended product and weight - the more the better). Weights mean clicks. Current recommendation system Ozon.ru: a mixture of content-based and collaborative filtering. Example:
     {"item":"24798277","true_recoms":{"24798314":1,"24798279":2,"24798276":4,"24798277":1,"24798280":2}} 
    The file has a line with 40,000 recommendations - it is garbage.
  • ozon_test.txt - test sample
  • item_details_full.gz - attributes of goods. Example:
     {"id":"4381194","name":"   -    -   â„– 84 ( )","annotation":"       ,              .    84.<br>\r\n\"  \" (1846) -       .    -          XVI  -       III,    .                 ,     . <br>\r\n      245      ,      1903 .    . .  \"-\"   .","parent_id":"18255189"} 
    under parent_id, modifications of one product (for example, different iPhones) are combined.
  • catalogs.gz - in which catalogs the product is located (maybe several entries). Example:
     {"itemid":"29040016","catalogid":"1179259"} 
  • catalog_path.gz - paths for lower-level catalogs (in which the goods lie) in the directory tree. For each directory is given the full path to the root. Example:
     {"catalogid":1125630,"catalogpath":[{"1125630":" . "},{"1125623":" -  !"},{"1112250":"  (.-)"},{"1095865":" "}]} 
  • ratings.gz - stores the average rating (stars). Example:
     {“itemid”: 2658646, “rating”:4.0} 


Response format

The file should have the following format (the weight of the product is higher, the higher its proximity, that is, the sorting is descending):
 {"item": "28759795", "true_recoms": {"28759801": 1, "28759817": 2, "28759803": 13}} 


Evaluation

To evaluate the result of your decision, the NDCG @ 1000 metric is used.


Also, the organizers, in conjunction with 3data, suggested that teams use a cluster with a spark.

Checker codes, answers and team decisions
Checkers of the first and third task
Data for the checker: verifiable salaries for the first task and verifiable recommendations for the third.
Description of the solution of the 1st and 2nd problem from the guys who took the second place, and the presentation of the 1st problem on the application of online training in it.

And then there may be a link to your decision. Send us and we will add it with pleasure.


Thanks for attention.

Source: https://habr.com/ru/post/268319/


All Articles