⬆️ ⬇️

Analyzing market requirements for data scientist

There is a lot of information on the Internet that the data sciencist should know and be able to. But I decided that becoming a data sciencist is necessary immediately, so we will find out the requirements for specialists by analyzing the text of vacancies.





First, we will formulate the task and develop a plan:



Task:

')

View all job vacancies on the market and find out the general requirements indicated in them.



Plan:



1. Collect all the vacancies at the request of Data Scientist in a convenient for processing format

2. Find out frequently encountered in the description of the words and phrases.



For implementation, you need a little knowledge of SQL and Python.



If not, then you here
I recommend sqlbolt.com for learning SQL, and SoloLearn mobile application for Google ( GooglePlay and AppStore ) for Python.



Data collection



Source: hh.ru

At first I thought that you can sparse the site. Fortunately, I found that hh.ru has an API .



For a start, let's write a function that gets a list of job id for analysis. In the parameters, the function receives the search text (here we will send 'Data Scientist') and the search area (according to the api documentation), and returns a list of id. For data we use the function api job search :



Here is the code
def get_list_id_vacancies(area, text): url_list = 'https://api.hh.ru/vacancies' list_id = [] params = {'text': text, 'area': area} r = requests.get(url_list, params=params) found = json.loads(r.text)['found']; #-    if found <= 500: # API    500    ( ).    500    . params['per_page'] = found r = requests.get(url_list, params=params) data = json.loads(r.text)['items'] for vac in data: list_id.append(vac['id']) else: i = 0; while i <= 3: #   500  ""   0  3     . API     2000 ,    3. params['per_page'] = 500 params['page'] = i r = requests.get(url_list, params=params) if 200 != r.status_code: break data = json.loads(r.text)['items'] for vac in data: list_id.append(vac['id']) i += 1 return list_id 




For debugging, I sent directly to the API. I recommend using the chrome Postman application for this.



After that you need to get detailed information about each vacancy:



Here is the code
 def get_vacancy(id): url_vac = 'https://api.hh.ru/vacancies/%s' r = requests.get(url_vac % id) return json.loads(r.text) 






We now have a job list and a function that receives detailed information about each job. We must decide where to write the data. I had two options: save everything to a csv file or create a database. Since it is easier for me to write SQL queries than to analyze in Excel, I selected a database. First you need to create a database and tables in which we will take notes. To do this, we analyze what the API responds to and decide which fields we need.



We paste the api link into Postman, for example, api.hh.ru/vacancies/22285538 , make a GET request and get the answer:



Full json
 { "alternate_url": "https://hh.ru/vacancy/22285538", "code": null, "premium": false, "description": "<p> ....", "schedule": { "id": "fullDay", "name": " " }, "suitable_resumes_url": null, "site": { "id": "hh", "name": "hh.ru" }, "billing_type": { "id": "standard_plus", "name": "+" }, "published_at": "2017-09-05T11:43:08+0300", "test": null, "accept_handicapped": true, "experience": { "id": "noExperience", "name": " " }, "address": { "building": "367", "city": "", "description": null, "metro": { "line_name": "", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": " ", "lng": 37.514401 }, "metro_stations": [ { "line_name": "", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": " ", "lng": 37.514401 } ], "raw": null, "street": " ", "lat": 55.739068, "lng": 37.525432 }, "key_skills": [ { "name": " " }, { "name": " " } ], "allow_messages": true, "employment": { "id": "full", "name": " " }, "id": "22285538", "response_url": null, "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "archived": false, "name": "/ Data scientist", "contacts": null, "employer": { "logo_urls": { "90": "https://hhcdn.ru/employer-logo/1680554.png", "240": "https://hhcdn.ru/employer-logo/1680555.png", "original": "https://hhcdn.ru/employer-logo-original/309546.png" }, "vacancies_url": "https://api.hh.ru/vacancies?employer_id=1475513", "name": "  ", "url": "https://api.hh.ru/employers/1475513", "alternate_url": "https://hh.ru/employer/1475513", "id": "1475513", "trusted": true }, "created_at": "2017-09-05T11:43:08+0300", "area": { "url": "https://api.hh.ru/areas/1", "id": "1", "name": "" }, "relations": [], "accept_kids": false, "response_letter_required": false, "apply_alternate_url": "https://hh.ru/applicant/vacancy_response?vacancyId=22285538", "quick_responses_allowed": false, "negotiations_url": null, "department": null, "branded_description": null, "hidden": false, "type": { "id": "open", "name": "" }, "specializations": [ { "profarea_id": "14", "profarea_name": ", ", "id": "14.91", "name": ",  " }, { "profarea_id": "14", "profarea_name": ", ", "id": "14.141", "name": "" }] } 




Everything that we do not plan to analyze is deleted from JSON.



JSON with just the right one
 { "description": "<p> ....", "schedule": { "id": "fullDay", "name": " " }, "accept_handicapped": true, "experience": { "id": "noExperience", "name": " " }, "key_skills": [ { "name": " " }, { "name": " " } ], "employment": { "id": "full", "name": " " }, "id": "22285538", "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "name": "/ Data scientist", "employer": { "name": "  ", }, "area": { "name": "" }, "specializations": [ { "profarea_id": "14", "profarea_name": ", ", "id": "14.91", "name": ",  " }, { "profarea_id": "14", "profarea_name": ", ", "id": "14.141", "name": "" }] } 




On the basis of this JSON we make a DB. It's easy, so I'll leave it out :)



We implement the module of interaction with the database. I used MySQL:



Here is the code
 def get_salary(vac): #   .      ,     ,     None,   . if vac['salary'] is None: return {'currency':None , 'from':None,'to':None,'gross':None} else: return {'currency':vac['salary']['currency'], 'from':vac['salary']['from'], 'to':vac['salary']['to'], 'gross':vac['salary']['gross']} def get_connection(): conn = pymysql.connect(host='localhost', port=3306, user='root', password='-', db='hh', charset="utf8") return conn def close_connection(conn): conn.commit() conn.close() def insert_vac(conn, vac, text): a = conn.cursor() salary = get_salary(vac) print(vac['id']) a.execute("INSERT INTO vacancies (id, name_v, description, code_hh, accept_handicapped, \ area_v, employer, employment, experience, salary_currency, salary_from, salary_gross, \ salary_to, schedule_d, text_search) \ VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (vac['id'], vac['name'], vac['description'], vac['code'], vac['accept_handicapped'], vac['area']['name'], vac['employer']['name'], vac['employment']['name'], vac['experience']['name'], salary['currency'], salary['from'], salary['gross'], salary['to'], vac['schedule']['name'], text)) for key_skill in vac['key_skills']: a.execute("INSERT INTO key_skills(vacancy_id, name) VALUES(%s, %s)",(vac['id'], key_skill['name'])) for spec in vac['specializations']: a.execute("INSERT INTO specializations(vacancy_id, name, profarea_name) VALUES(%s, %s, %s)", (vac['id'], spec['name'], spec['profarea_name'])) a.close() 




Now we put everything together by adding a main () method to the file.



Data collection
 text_search = 'data scientist' list_id_vacs = get_list_id_vacancies(text_search) vacs = [] for vac_id in list_id_vacs: vacs.append(get_vacancy(vac_id)) conn = get_connection() for vac in vacs: insert_vac(conn, vac, text_search) close_connection(conn) 




Changing the text_search and area variables we get different vacancies from different regions.

On this data mining is completed and go to the interesting.



Text analysis



The main inspiration was the article about the search for popular phrases in the series How I met your mother



First, we will receive a description of all vacancies from the database



Here is the code
 def get_vac_descriptions(conn, text_search): a = conn.cursor() a.execute("SELECT description FROM vacancies WHERE text_search = %s", text_search) descriptions = a.fetchall() a.close return descriptions 




To work with the text, we will use the nltk package. By analogy with the above article, we write the function for obtaining popular phrases from the text:



Here is the code
 def get_popular_phrase(text, len, count_phrases): phrase_counter = Counter() words = nltk.word_tokenize(text.lower()) for phrase in nltk.ngrams(words, len): if all(word not in string.punctuation for word in phrase): phrase_counter[phrase] += 1 return phrase_counter.most_common(count_phrases) descriptions = get_vac_descriptions(get_connection(), 'data scientist') text = '' for description in descriptions: text = text + description[0] result = get_popular_phrase(text, 1, 20) for r in result: print(" ".join(r[0]) + " - " + str(r[1])) 




We combine all the above methods in the main method and run it:



Here is the code
 def main(): descriprions = get_vac_descriptions(get_connection(), 'data scientist') text = '' for descriprion in descriprions: text = text + descriprion[0] result = get_popular_phrase(text, 4, 20, stopwords) for r in result: print(" ".join(r[0]) + " - " + str(r[1])) main() 




Perform and see:



li - 2459

/ li - 2459

and - 1297

p - 1225

/ p - 1224

in - 874

strong - 639

/ strong - 620

and - 486

ul - 457

/ ul - 457

from - 415

on - 341

data - 329

data - 313

the - 308

experience - 275

of - 269

for - 254

work - 233



We see that the result got a lot of words that are typical for all vacancies and tags that are used in the description. Remove these words from the analysis. For this we need a list of stop words. Let's form it automatically, having analyzed vacancies from another sphere. I chose "cook", "cleaner" and "fitter".



Let's go back to the beginning and get vacancies for these requests. After that, add a function to get stop words.



Here is the code
 def get_stopwords(): descriptions = get_vac_descriptions(get_connection(), '') \ + get_vac_descriptions(get_connection(), '') + \ get_vac_descriptions(get_connection(), '') text = '' for description in descriptions: text = text + descriprion[0] stopwords = [] list = get_popular_phrase(text, 1, None, 200) #    for i in list: stopwords.append(i[0][0]) return stopwords 






We also see the English of and of. Let's do it easier and remove jobs in English.

Make changes to main ():



Here is the code
 for description in descriptions: if detect(description[0]) != 'en': text = text + description[0] 




Now the result looks like this:



data - 329

data - 180

analysis - 157

training - 134

machine - 129

models - 128

areas - 101

algorithms - 87

python - 86

tasks - 82

tasks - 82

development - 77

analysis - 73

construction - 68

methods - 66

will be - 65

statistics - 56

higher - 55

knowledge - 53

learning - 52



Well, this is one word, it does not always reflect the truth. Let's see what the phrases of 2 words show:



machine learning - 119

data analysis - 56

machine learning - 44

data science - 38

data scientist - 38

big data - 34

mathematical models - 34

data mining - 28

machine algorithms - 27

mathematical statistics - 23

will be a plus - 21

statistical analysis - 20

data processing - 18

English - 17

data analysis - 17

including 17

as well - 17

methods of machine - 16

areas of analysis - 15

probability theory - 14



The results of the analysis.



More clearly gives the result of the query in two words, we need to know:





Nothing new, but it was fun :)



Findings.



This is far from the perfect solution.



Errors:



1. It is not necessary to exclude vacancies in English, it is necessary to translate them.

2. Not all stop words are excluded.

3. It is necessary to bring all the words to the basic form (machine -> machine, analysis -> analysis, etc.).

4. Invent a method by which to calculate a more optimal list of stop words. Answer the question "why 200?", "Why the cleaner?".

5. We need to figure out how to analyze the result automatically, to understand that one or two words carry meaning or more.

Source: https://habr.com/ru/post/337124/



All Articles