📜 ⬆️ ⬇️

Analyzing market requirements for data scientist

There is a lot of information on the Internet that the data sciencist should know and be able to. But I decided that becoming a data sciencist is necessary immediately, so we will find out the requirements for specialists by analyzing the text of vacancies.


First, we will formulate the task and develop a plan:

Task:
')
View all job vacancies on the market and find out the general requirements indicated in them.

Plan:

1. Collect all the vacancies at the request of Data Scientist in a convenient for processing format
2. Find out frequently encountered in the description of the words and phrases.

For implementation, you need a little knowledge of SQL and Python.

If not, then you here
I recommend sqlbolt.com for learning SQL, and SoloLearn mobile application for Google ( GooglePlay and AppStore ) for Python.

Data collection


Source: hh.ru
At first I thought that you can sparse the site. Fortunately, I found that hh.ru has an API .

For a start, let's write a function that gets a list of job id for analysis. In the parameters, the function receives the search text (here we will send 'Data Scientist') and the search area (according to the api documentation), and returns a list of id. For data we use the function api job search :

Here is the code
def get_list_id_vacancies(area, text): url_list = 'https://api.hh.ru/vacancies' list_id = [] params = {'text': text, 'area': area} r = requests.get(url_list, params=params) found = json.loads(r.text)['found']; #-    if found <= 500: # API    500    ( ).    500    . params['per_page'] = found r = requests.get(url_list, params=params) data = json.loads(r.text)['items'] for vac in data: list_id.append(vac['id']) else: i = 0; while i <= 3: #   500  ""   0  3     . API     2000 ,    3. params['per_page'] = 500 params['page'] = i r = requests.get(url_list, params=params) if 200 != r.status_code: break data = json.loads(r.text)['items'] for vac in data: list_id.append(vac['id']) i += 1 return list_id 


For debugging, I sent directly to the API. I recommend using the chrome Postman application for this.

After that you need to get detailed information about each vacancy:

Here is the code
 def get_vacancy(id): url_vac = 'https://api.hh.ru/vacancies/%s' r = requests.get(url_vac % id) return json.loads(r.text) 



We now have a job list and a function that receives detailed information about each job. We must decide where to write the data. I had two options: save everything to a csv file or create a database. Since it is easier for me to write SQL queries than to analyze in Excel, I selected a database. First you need to create a database and tables in which we will take notes. To do this, we analyze what the API responds to and decide which fields we need.

We paste the api link into Postman, for example, api.hh.ru/vacancies/22285538 , make a GET request and get the answer:

Full json
 { "alternate_url": "https://hh.ru/vacancy/22285538", "code": null, "premium": false, "description": "<p> ....", "schedule": { "id": "fullDay", "name": " " }, "suitable_resumes_url": null, "site": { "id": "hh", "name": "hh.ru" }, "billing_type": { "id": "standard_plus", "name": "+" }, "published_at": "2017-09-05T11:43:08+0300", "test": null, "accept_handicapped": true, "experience": { "id": "noExperience", "name": " " }, "address": { "building": "367", "city": "", "description": null, "metro": { "line_name": "", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": " ", "lng": 37.514401 }, "metro_stations": [ { "line_name": "", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": " ", "lng": 37.514401 } ], "raw": null, "street": " ", "lat": 55.739068, "lng": 37.525432 }, "key_skills": [ { "name": " " }, { "name": " " } ], "allow_messages": true, "employment": { "id": "full", "name": " " }, "id": "22285538", "response_url": null, "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "archived": false, "name": "/ Data scientist", "contacts": null, "employer": { "logo_urls": { "90": "https://hhcdn.ru/employer-logo/1680554.png", "240": "https://hhcdn.ru/employer-logo/1680555.png", "original": "https://hhcdn.ru/employer-logo-original/309546.png" }, "vacancies_url": "https://api.hh.ru/vacancies?employer_id=1475513", "name": "  ", "url": "https://api.hh.ru/employers/1475513", "alternate_url": "https://hh.ru/employer/1475513", "id": "1475513", "trusted": true }, "created_at": "2017-09-05T11:43:08+0300", "area": { "url": "https://api.hh.ru/areas/1", "id": "1", "name": "" }, "relations": [], "accept_kids": false, "response_letter_required": false, "apply_alternate_url": "https://hh.ru/applicant/vacancy_response?vacancyId=22285538", "quick_responses_allowed": false, "negotiations_url": null, "department": null, "branded_description": null, "hidden": false, "type": { "id": "open", "name": "" }, "specializations": [ { "profarea_id": "14", "profarea_name": ", ", "id": "14.91", "name": ",  " }, { "profarea_id": "14", "profarea_name": ", ", "id": "14.141", "name": "" }] } 


Everything that we do not plan to analyze is deleted from JSON.

JSON with just the right one
 { "description": "<p> ....", "schedule": { "id": "fullDay", "name": " " }, "accept_handicapped": true, "experience": { "id": "noExperience", "name": " " }, "key_skills": [ { "name": " " }, { "name": " " } ], "employment": { "id": "full", "name": " " }, "id": "22285538", "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "name": "/ Data scientist", "employer": { "name": "  ", }, "area": { "name": "" }, "specializations": [ { "profarea_id": "14", "profarea_name": ", ", "id": "14.91", "name": ",  " }, { "profarea_id": "14", "profarea_name": ", ", "id": "14.141", "name": "" }] } 


On the basis of this JSON we make a DB. It's easy, so I'll leave it out :)

We implement the module of interaction with the database. I used MySQL:

Here is the code
 def get_salary(vac): #   .      ,     ,     None,   . if vac['salary'] is None: return {'currency':None , 'from':None,'to':None,'gross':None} else: return {'currency':vac['salary']['currency'], 'from':vac['salary']['from'], 'to':vac['salary']['to'], 'gross':vac['salary']['gross']} def get_connection(): conn = pymysql.connect(host='localhost', port=3306, user='root', password='-', db='hh', charset="utf8") return conn def close_connection(conn): conn.commit() conn.close() def insert_vac(conn, vac, text): a = conn.cursor() salary = get_salary(vac) print(vac['id']) a.execute("INSERT INTO vacancies (id, name_v, description, code_hh, accept_handicapped, \ area_v, employer, employment, experience, salary_currency, salary_from, salary_gross, \ salary_to, schedule_d, text_search) \ VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (vac['id'], vac['name'], vac['description'], vac['code'], vac['accept_handicapped'], vac['area']['name'], vac['employer']['name'], vac['employment']['name'], vac['experience']['name'], salary['currency'], salary['from'], salary['gross'], salary['to'], vac['schedule']['name'], text)) for key_skill in vac['key_skills']: a.execute("INSERT INTO key_skills(vacancy_id, name) VALUES(%s, %s)",(vac['id'], key_skill['name'])) for spec in vac['specializations']: a.execute("INSERT INTO specializations(vacancy_id, name, profarea_name) VALUES(%s, %s, %s)", (vac['id'], spec['name'], spec['profarea_name'])) a.close() 


Now we put everything together by adding a main () method to the file.

Data collection
 text_search = 'data scientist' list_id_vacs = get_list_id_vacancies(text_search) vacs = [] for vac_id in list_id_vacs: vacs.append(get_vacancy(vac_id)) conn = get_connection() for vac in vacs: insert_vac(conn, vac, text_search) close_connection(conn) 


Changing the text_search and area variables we get different vacancies from different regions.
On this data mining is completed and go to the interesting.

Text analysis


The main inspiration was the article about the search for popular phrases in the series How I met your mother

First, we will receive a description of all vacancies from the database

Here is the code
 def get_vac_descriptions(conn, text_search): a = conn.cursor() a.execute("SELECT description FROM vacancies WHERE text_search = %s", text_search) descriptions = a.fetchall() a.close return descriptions 


To work with the text, we will use the nltk package. By analogy with the above article, we write the function for obtaining popular phrases from the text:

Here is the code
 def get_popular_phrase(text, len, count_phrases): phrase_counter = Counter() words = nltk.word_tokenize(text.lower()) for phrase in nltk.ngrams(words, len): if all(word not in string.punctuation for word in phrase): phrase_counter[phrase] += 1 return phrase_counter.most_common(count_phrases) descriptions = get_vac_descriptions(get_connection(), 'data scientist') text = '' for description in descriptions: text = text + description[0] result = get_popular_phrase(text, 1, 20) for r in result: print(" ".join(r[0]) + " - " + str(r[1])) 


We combine all the above methods in the main method and run it:

Here is the code
 def main(): descriprions = get_vac_descriptions(get_connection(), 'data scientist') text = '' for descriprion in descriprions: text = text + descriprion[0] result = get_popular_phrase(text, 4, 20, stopwords) for r in result: print(" ".join(r[0]) + " - " + str(r[1])) main() 


Perform and see:

li - 2459
/ li - 2459
and - 1297
p - 1225
/ p - 1224
in - 874
strong - 639
/ strong - 620
and - 486
ul - 457
/ ul - 457
from - 415
on - 341
data - 329
data - 313
the - 308
experience - 275
of - 269
for - 254
work - 233

We see that the result got a lot of words that are typical for all vacancies and tags that are used in the description. Remove these words from the analysis. For this we need a list of stop words. Let's form it automatically, having analyzed vacancies from another sphere. I chose "cook", "cleaner" and "fitter".

Let's go back to the beginning and get vacancies for these requests. After that, add a function to get stop words.

Here is the code
 def get_stopwords(): descriptions = get_vac_descriptions(get_connection(), '') \ + get_vac_descriptions(get_connection(), '') + \ get_vac_descriptions(get_connection(), '') text = '' for description in descriptions: text = text + descriprion[0] stopwords = [] list = get_popular_phrase(text, 1, None, 200) #    for i in list: stopwords.append(i[0][0]) return stopwords 



We also see the English of and of. Let's do it easier and remove jobs in English.
Make changes to main ():

Here is the code
 for description in descriptions: if detect(description[0]) != 'en': text = text + description[0] 


Now the result looks like this:

data - 329
data - 180
analysis - 157
training - 134
machine - 129
models - 128
areas - 101
algorithms - 87
python - 86
tasks - 82
tasks - 82
development - 77
analysis - 73
construction - 68
methods - 66
will be - 65
statistics - 56
higher - 55
knowledge - 53
learning - 52

Well, this is one word, it does not always reflect the truth. Let's see what the phrases of 2 words show:

machine learning - 119
data analysis - 56
machine learning - 44
data science - 38
data scientist - 38
big data - 34
mathematical models - 34
data mining - 28
machine algorithms - 27
mathematical statistics - 23
will be a plus - 21
statistical analysis - 20
data processing - 18
English - 17
data analysis - 17
including 17
as well - 17
methods of machine - 16
areas of analysis - 15
probability theory - 14

The results of the analysis.


More clearly gives the result of the query in two words, we need to know:


Nothing new, but it was fun :)

Findings.


This is far from the perfect solution.

Errors:

1. It is not necessary to exclude vacancies in English, it is necessary to translate them.
2. Not all stop words are excluded.
3. It is necessary to bring all the words to the basic form (machine -> machine, analysis -> analysis, etc.).
4. Invent a method by which to calculate a more optimal list of stop words. Answer the question "why 200?", "Why the cleaner?".
5. We need to figure out how to analyze the result automatically, to understand that one or two words carry meaning or more.

Source: https://habr.com/ru/post/337124/


All Articles