📜 ⬆️ ⬇️

HeadHunter Job Analysis


Once I wondered what if I tried to analyze the vacancies and compose some tops on them. Find out who pays the most, who is most in demand and much more.


As a data source, I used the well-known HeadHunter. Were collected and processed jobs for May of this year. Only for a month, because the API does not allow getting more.


Data collection


The HeadHunter API has excellent documentation that is located in the repository . Requests should be made to the https://api.hh.ru/ domain with the installed User-Agent , preferably of type _/_ (__) (other User-Agent options sometimes work, but if the server doesn’t like something, it will return an error ).


The logic of the collection is very simple, so I implemented it in bash using cURL and jq . However, I want to share a few nuances.


Pagination


For search of vacancies in various parameters there is endpoint GET /vacancies .


 curl -A 'irenica (https://irenica.com/)' 'https://api.hh.ru/vacancies' 

Search results will be divided into pages, the size of which is the parameter per_page (20 by default and 100 maximum). You can select a specific page by specifying the page parameter (the numbering starts from 0).


In the pages service information returned with job openings, the total number of pages of the result will be indicated.


With this, you can easily search through all pages:


 declare -ii=0 while true; do declare url="https://api.hh.ru/vacancies?per_page=100&page=$i" declare page="$(curl -A 'irenica (https://irenica.com/)' "$url")" #  $page ((i++)) declare -i totalCount=$(echo "$page" | jq '.pages') if ((i >= totalCount)); then break fi done 

Full job details


However, the search results contain only part of the job data. To get everything, you need to make a separate request for the endpoint of the form GET /vacancies/id_ .


Partial data on vacancies are in the items field of search results. At first we will collect from them ID of vacancies:


 declare vacanciesIds="$(echo "$page" | jq -r '.items[].id')" 

Then we will request complete information about the relevant vacancies separately:


 for vacancyId in $vacanciesIds; do declare url="https://api.hh.ru/vacancies/$vacancyId" declare vacancy="$(curl -A 'irenica (https://irenica.com/)' "$url")" #  $vacancy done 

Search limit bypass


The HeadHunter API has one feature - no matter how many are found, a maximum of 2000 will be returned. At the same time, the actual amount found will also be returned to the found field of the search results. Thanks to this, it is possible to know for sure whether you received all the requested data, or if there are losses.


To get around this limitation, I came up with the following. When searching, you can specify the length of time when vacancies of interest were published (via the date_from and date_to , which accept the date in ISO 8601 format). You can take a small interval and sort through all the results with such pieces: after all, the shorter the time interval, the less vacancies were published for it.


It is worth paying attention that the vacancies published only for the last month are returned. Therefore, it makes no sense to set the range anymore.


To iterate over the length of time, the latter is best represented as Unix time:


 declare -i startTime=$(date -d '-1 month' +%s) declare -i endTime=$(date -d now +%s) while ((startTime <= endTime)); do declare -i intervalEnd=$((startTime + 60*60)) declare startTimeIso="$(date -d @$startTime +%FT%T)" declare intervalEndIso="$(date -d @$intervalEnd +%FT%T)" # ... declare url="https://api.hh.ru/vacancies?per_page=100&page=$i&date_from=$startTimeIso&date_to=$intervalEndIso" # ... startTime=$intervalEnd done 

Payroll processing


To collect statistics, it was necessary to group vacancies on certain grounds. At bash, doing this was already problematic, so I used Python.


The logic of the collection is nothing special - the accumulation of data in the associative array, sorting and output to CSV. However, again a few nuances.


Salary fork


It should be noted that the salary is presented in the form of two numbers - the minimum and maximum, and any of them may be absent.


Since for analysis it was necessary to have one number, I decided to use the lower limit, and only if it is absent, the upper one.


 salary = None if vacancy['salary']: if vacancy['salary']['to']: salary = vacancy['salary']['to'] if vacancy['salary']['from']: salary = vacancy['salary']['from'] 

Exchange rates


Salary in a job can be specified in different currencies, and they - have a different rate. The HeadHunter API has endpoint GET /dictionaries containing all the necessary predefined values. Currency rates are presented in the currency field. For convenience, it would be better to put their list in an associative array, where the key is the alphabetic currency code:


 currencies = {} dictionaries = requests.get('https://api.hh.ru/dictionaries').json() for currency in dictionaries['currency']: currencies[currency['code']] = currency['rate'] 

Now, during processing, it will be easy to convert all salaries into one currency:


 salary /= currencies[vacancy['salary']['currency']] 

NDFL accounting


In some vacancies, the salary is indicated before the payment of personal income tax, in some - after. The specific field is indicated by the gross field: it is true in the first case and false in the second.


I decided to transfer all salaries to the option after tax:


 if vacancy['salary']['gross']: salary -= salary * 0.13 

Results analysis


Now is the time to show the numbers.


Remote work


Probably many of those who read this post, would like to work on the remote. But as we see, work from home in our country is not very much quoted yet. Salary is much lower, the number of vacancies is significantly less. And therefore there is less opportunity to choose for the applicant.


And it is rather strange, because in many professions and many firms (by the specific nature of the tasks), the presence of a person in the office is completely unnecessary. But this is an eternal argument.


NameSalary, averageSalary, minimumSalary, maximumNumber
Domestic staff11253610977130000nineteen
Information technology, Internet, telecom552251000300,0002828
Top management476879474100,00023
Extraction of raw materials4657920,0009089880
Installation and Service4543911874696009
Public service, non-profit organizations4491120,00090000nineteen
Working staff4421894996786037
Production423882372100,000236
Construction, real estate3989670110000329
Transport, logistics376629490100,000223


Applicants with disabilities


However, there is an even smaller category of vacancies - for people with disabilities. And this is completely illogical - even if employers do not want remote workers, but of those who are ready for this, why are there so few who think about people with disabilities? If you do not care that a person is in three time zones, what difference does it make to you, is he able to walk, for example?


Perhaps many of you are familiar with people with disabilities. I, too, and I wondered how difficult it is for them to find a job, and what they can count on.


NameSalary, averageSalary, minimumSalary, maximumNumber
Public service, non-profit organizations69675870090000eight
Top management4870530,0008242515
Information technology, Internet, telecom453214350200,0001050
Science education45056315890000376
Purchases4359115,00080,0009
Construction, real estate4214822250,000210
Production4096910,000130500189
Accounting, management accounting, finance companies363872610113100125
Lawyers343082610160,000131
Security334142290000178


Students


We all start with something, namely, with a job search, without any experience. I decided to assess the situation with positions open to such candidates.


The number of vacancies is encouraging for quick employment. And I do not know how realistic it is to get the maximum salary, but you can even somehow live by the average figures.


NameSalary, averageSalary, minimumSalary, maximumNumber
Counseling6260115002218502504
Construction, real estate55855209499896455
Top management5082611310400,000111
Extraction of raw materials381928,000100,000328
Security346173954100,0005844
Medicine, Pharma34475450200,00011776
Transport, logistics336005001500008,000
Science education3142611001245101660
Sales30444one35000052566
Installation and Service30360826480,000381


Common top


And now the most interesting thing: who pays the most? Sorted all vacancies found without any filters.


Of course, this is top management. Who would doubt that.


A curious fact: if you pay attention to the average salary in all tables, you can see that it is not that different.


NameSalary, averageSalary, minimumSalary, maximumNumber
Top management787891502,000,0002408
Extraction of raw materials616998,0001800002302
Counseling597971500500,0003762
Information technology, Internet, telecom527772668480425900
Construction, real estate485872094998933229
Production42007one261,00027269
Working staff4120325200,00043079
Car business38555208242549269
Installation and Service38412251800002390
Purchases3784650261,0002658


Cleaning woman


And here is the easiest way: why study for 5 years, if you can just wash the office? Below is the result of filtering the top vacancies for the query "cleaning *".


What if you get a job in several offices and come in the evening for a couple of hours for cleaning? So you can live quite luxurious. We will consider it life hacking.


NameSalary, averageSalary, minimumSalary, maximumNumber
Top management6300040,00087000eight
Marketing, Advertising, PR50,00050,00050,0006
Extraction of raw materials45,00045,00045,0003
HR management, trainings3324679088700058
Accounting, management accounting, finance companies32,00030,00035,000ten
Security3150720,00070,0006
Sales29696473755,000159
Construction, real estate2902441380,00073
Transport, logistics249871099045,00026
Car business24465712445,00061


Top by city


Finally, I decided to check the number of open positions in cities. The first places are not surprising, but then there are curious and even unexpected positions.


NameNumber
Moscow31137
St. Petersburg11745
Minsk7608
Almaty4386
Kiev3398
Yekaterinburg3182
Novosibirsk3097
Kazan3066
Ufa2980
Nizhny Novgorod2876


Repository


All code from the article, with improvements and instructions, is available in the repository .


')

Source: https://habr.com/ru/post/418281/


All Articles