📜 ⬆️ ⬇️

Parsing resume

Those who are faced with the tasks of automated resume analysis represent the current state of affairs in this area - the existing parsers are mostly limited to highlighting contact information and a few more fields, such as “position” and “city”.

For any meaningful analysis of this is not enough. It is important not only to highlight certain lines and mark them with tags, but also to determine what objects lie behind them.

A live example (a piece of XML resume analysis result from one of the leaders in the Sovren region):
')
<EmployerOrg> <EmployerOrgName> -DSME</EmployerOrgName> <PositionHistory positionType="directHire"> <Title>     </Title> <OrgName> <OrganizationName> -DSME</OrganizationName> </OrgName> 

The Sovren parser copes with the selection of fields. The guys are not in vain engaged in this business for nearly 20 years!

But what next to do with the "Leading Specialist of the Information Systems Development Department"? How to understand what kind of post it is, how relevant is this person’s work experience for a particular job?

If your task is to search for an employee for the requirements of a vacancy or vice versa - vacancies for the experience and wishes of a candidate, then a keyword search, a comparison of bag of words give mediocre results. For objects with which many possible synonymous names correspond, this approach will not work.

First you need to normalize the names, turn "specialists in something" into programmers, system administrators and other otolaryngologists.

For this you have to create a knowledge base, a taxonomy of objects. Moreover, the specificity is such that it is not enough to describe, for example, only builders - people change areas of activity and other non-construction jobs can also be found in a builder’s resume.

And if the taxonomy will only describe the construction - in the texts relating to other areas of activity, there will be false positives. In the construction of "architect" - is one thing, but in IT - is quite another. "Operation", "Action", "Object" and the set of phrases containing these words are examples of ambiguities that must be resolved.

Simple normalization will not save the father of Russian democracy either. The imagination of people writing resumes and making staff lists never ceases to amaze. Unfortunately for us, the developers, this means that in the general case this object cannot be identified from the line describing an object. That is, of course you can try to train any classifier, feeding him the field and the desired position.
And it will even work. On "accountants", "secretaries", "programmers".
Only here in the summary people write “a specialist of the department N” and understand whether he is an accountant or a secretary, it is possible only by context, a set of duties performed.

It would seem - well, we take into account the context, let the classifier also learns on responsibilities. So, yes, not so - when defining the set of duties, the same problem: ambiguity of interpretation, all kinds of incomprehensible anaphores ).

We decided to apply a probabilistic (Bayesian) approach:

Analyzing the source text for all lines (for example, “architect”, “work
with clients ") we define a set of all possible interpretations
(for example, for an “architect” it will be a “building architect”, “a software architect
ensure ", etc.). The result is a set of sets of interpretations. Then
we are looking for such a combination of interpretations from all the sets so that its plausibility
was the maximum.

For example:


In both places of work, the word “manager” is used to denote completely different positions. From the context it can be understood that in the first case the position is really managerial, and in the second it would be more correct to call yourself a seller.

In order to choose between a “customer service manager” and a “seller”, we evaluate the plausibility of combining these posts with the skills found in this place of work. At the same time, skills can be selected from several options in the same way, so the task is to choose the most plausible combination of the many objects found in the text.

The number of objects of different types (skills, positions, industries, cities, etc.) is very large (hundreds of thousands in our knowledge base), therefore the space to which resumes belong is very, very multidimensional. For learning most machine learning algorithms, an astronomical number of examples will be needed.

We decided to cut. That is, drastically reduce the number of parameters and use the learning outcome only where we have a sufficient number of examples.

To begin with, we began to collect statistics on combinations of tuples of attributes, for example: position-industry, position-department, position-skill. Based on these statistics, we estimate the plausibility of new, not yet previously encountered, combinations
objects and choose the best combination.

In the example above, the skill set inclines the parser towards the manager in the first case and towards the seller in the second.

The use of simple counters and Bayes probability estimates allows for obtaining good results with a small number of examples. In our knowledge base, there are now about one hundred thousand vacancies and resumes marked by specialists, and this allows us to resolve most of the ambiguities for common objects.

At the output, we get a JSON object that describes a job or resume in terms of our knowledge base, and not in those that came up with the applicant or employer.

This view can be used for accurate search by parameters, evaluation ("scoring") of the applicant's resume or comparison of the resume-vacancy pairs.

We made a simple interface in which you can load a resume (doc, docx, pdf (not a picture) and other formats) and get its presentation in JSON. Just do not forget about 152FZ! No need to experiment with a resume with real personal data :)

For example, here is a summary:

Hidden text
Pupkin Vasily Lvovich
Omsk
tel +7923123321123

Responsible and hardworking sales manager.

experience

  • 2001-2002: LLC Bydomomuprav. Cleaner in the yard. Especially conscientiously performed the work of cleaning the snow.
  • From 12.03.2005 to 30.01.2007 Shop "Hope". Senior kiosk selling sausages. Increased sausage sales by 146% in 2 months.
  • 2002-2003: LLC Kuzremont Tinsmith, the body builder in a car repair shop. Reworked "Zaporozhtsy" in "Mercedes".

Additionally
Physically strong, smart, handsome. Have your own Range Rover Sport car and category B license.

Turns into the following JSON:

Hidden text
 { "url": null, "name": "  , . ", "skill_ids": [ { "cv_skill_id": 5109999, "skill_name": "", "skill_id": 91, "skill_level_id": 1, "skill_level_name": "" }, { "cv_skill_id": 5110000, "skill_name": "", "skill_id": 596, "skill_level_id": 1, "skill_level_name": "" }, { "cv_skill_id": 5109998, "skill_name": " ", "skill_id": 1474, "skill_level_id": 1, "skill_level_name": "" }, { "cv_skill_id": 5109997, "skill_name": " ", "skill_id": 2688, "skill_level_id": 2, "skill_level_name": "" } ], "description": " ", "ts": "2016-09-14 06:00:51.136898", "jobs": [ { "employer_id": null, "description": ":  .    .       ", "department_id": null, "company_size_id": null, "industry_id": null, "start_date": "2001-01-01", "cv_job_id": 1812412, "company_size_name": null, "employer_name": null, "job_id": 336, "department_name": null, "industry_name": null, "end_date": "2002-01-01", "job_name": "" }, { "employer_id": null, "description": ":   -  .  \"\"  \"\".   , , .    Range Rover Sport    B", "department_id": null, "company_size_id": null, "industry_id": null, "start_date": "2002-01-01", "cv_job_id": 1812414, "company_size_name": null, "employer_name": null, "job_id": 268, "department_name": null, "industry_name": null, "end_date": "2003-01-01", "job_name": " " }, { "employer_id": null, "description": " 12.03.2005  30.01.2007  \"\".     .     146%  2 ", "department_id": null, "company_size_id": null, "industry_id": 39, "start_date": "2005-03-12", "cv_job_id": 1812413, "company_size_name": null, "employer_name": null, "job_id": 354, "department_name": null, "industry_name": " ", "end_date": "2007-01-30", "job_name": "" } ], "cv_file_id": 16598, "favorite_industries": [ { "name": " ", "industry_id": 39 } ], "wage_min": null, "cv_id": 1698916, "favorite_areas_data": [ [ { "id": 198830, "name": " ", "level": 1 }, { "id": 10005, "name": "  ", "level": 2 }, { "id": 88, "name": " ", "level": 3 }, { "id": 727, "name": ". ", "level": 4 } ] ], "certificate_ids": [ { "certificate_name": "   B", "certificate_id": 118, "cv_certificate_id": 604445 } ], "cv_owner": "own", "favorite_jobs": [ { "name": "  ", "job_id": 112 } ], "cv_status_id": 2, "filename": "test_resume.odt" } 

Retrieving personal data in the parser is disabled, do not look for them in JSON.

In my biased opinion, the result is interesting and there are many uses for it. Although the accuracy of object recognition, comparable to human, is still very far away. It is necessary to develop the knowledge base, train the algorithm with examples, introduce additional heuristics and, possibly, highly specialized classifiers, for example, to define industries.

I wonder what techniques you use or would you use? Particularly interesting, does anyone use the semantic approach a la Compreno from ABBYY?

Source: https://habr.com/ru/post/301936/


All Articles