Definition of gender by name - when accuracy is really important

Some time ago I was interested in the task of determining the sex of a person by his full name. At that moment I worked in the field of health insurance, where this problem was really relevant - the cost per insured, and hence the rates at which people accepted for insurance, depending on the gender of the client, could differ several times. Most of the contracts - corporate, insured are employees of the employer.

We have never seen most of them in the eye, all we had was lists of insured people, where the gender was sometimes indicated (with a lot of errors), but more often it wasn’t indicated at all. Most companies have their own specifics of work and professional traditions, which is why, in their teams, people of the same sex prevail. Even a small mistake could make a potentially profitable contract unprofitable (or vice versa, but this, by a strange coincidence, happened to our clients much less often). In general, with a portfolio volume of contracts of several billions, and a typical number of errors of about one percent, the price of correctly determining the gender of the full name was in the region of several tens of millions.

In Runet, the topic of determining gender by name was raised repeatedly, but, in most cases, it all came down to recommending carefully look at the end of the patronymic (“hiv” / “vna”) or use some similarly found patterns. Unfortunately, in my situation, this method did not work - there were many, indeed many, foreigners among the insured. The correct spelling of their patronymic did not contain any signs of the desired ending (and in some cases there was no patronymic itself).

Therefore, to solve the problem, I decided to use a statistical approach - according to the existing customer base, to determine which sex people usually belong to with the desired last name, first name and patronymic and, according to this data, assign the new insured to this or that class. If part of the full name belonged predominantly to men - I accrued +1 point, if women - minus one point, if it was approximately equal - 0 points were awarded. The result for all three parts was formed and, if the amount was greater than or equal to +2, the gender was defined as male; less than or equal to -2 - as feminine; otherwise it was thought that the floor could not be determined and it must be calculated by other methods.
')
Oddly enough, a similar, very simple algorithm allowed achieving amazing accuracy - on a sample of several hundred thousand people (with a base for learning of one and a half million) only 6 errors were made (which will be described below and each of which is quite likely a living a man would also commit).

Some details of the preliminary training sample preparation:

All full names must contain either Cyrillic or Latin only;
Only letters, spaces, hyphens and single quote characters are allowed. All other characters must either be deleted or replaced with ones close to them. There must be no spaces between the hyphen and the letters close to it;
All letters must be in one register (or all must have the first letter - capital, the rest - lowercase);
Between the parts of the full name there should be only single spaces, from the edges of the full name there should not be extra spaces;
The division of the line with the full name into three parts is carried out by the first and second space. If there are only two parts, the middle name is null, if more than three, the middle name is all that after the second space.

Since I wrote all the logic in PL / SQL, I will not lay out the entire package with the implementation of the algorithm - it is very tied to the internal structure of the database and data storage features, but I would like to mention a few features:

Since each time it is necessary to determine a person’s gender, I’ll go over the counterpart’s table with all the records for a long time, I placed the aggregated information about the frequencies of occurrence of a certain part of the full name in the stored data in the auxiliary table. It looks like four parameters - [part of full name - type (F, I or O) - gender - the number of records in the database]. Statistics updated weekly, automatically.

To determine the ratio of men and women for one part of the full name to be considered a sufficient basis for assigning it to a particular sex, I used the following function:

--    ( )   ( )   -- +1 -  -- 0 -   -- -1 -  function get_sex_by_cnt (mcnt number, fcnt number) return number is begin if mcnt=fcnt then return 0; end if; --   if mcnt<=2 and fcnt<=2 then return 0; end if; --      if mcnt>=3 and fcnt=0 then return 1; end if; --    1 if mcnt=0 and fcnt>=3 then return -1; end if; --   2 if mcnt/fcnt>0.5 and mcnt/fcnt<2 then return 0; end if; --      if mcnt>fcnt then return 1; end if; --   .   if mcnt<fcnt then return -1; end if; --   end; -- get_sex_by_cnt

If the full name could not be determined (the total value of the estimates is in the range [-1; 1]), then, in most cases, the last name can be neglected and only IO can be used - with the same criterion of correctness of work (+2 - male, -2 - woman, otherwise - gender is not determined). This helps to get around situations where an unchangeable last name like “Tymoshenko” is mistakenly assigned to the same sex, and the first and middle names belong to the opposite.

Algorithm errors. I found three situations where the algorithm may produce an incorrect result:

Asians. In Chinese, there are no formal signs allowing to refer a name to a specific gender. Probably, this applies to some other languages (at least, I can assume it is about Thai, Vietnamese and Korean) Ie by name it is impossible to determine the sex in principle. This reason is responsible for the 3 detected errors of the function of 6. On average, the number of men and women with Asian names turned out to be equal, but some (mostly rare names and surnames) belonged mainly to the same sex. It’s hard to suggest any simple solution besides the further collection of statistics, but this will not solve the problem completely. Difficult but relatively reliable decision - you can write a defining function - whether the name is Asian or not (I have not done it yet, but judging by my conversation with a fellow translator, this is possible) and for all Asians to return 0, regardless of the statistics collected.
Initials. In some cases, the full name is entered into the database in the form of “Ivanov II.” Individual letters are perceived by the function as a name and patronymic, they participate in general statistics and influence the decision-making about the field of a person. When entering into the base of a new person with a full name like “Kovalchuk OI”, there may be discrepancies between the real sex and the calculated one. This reason is responsible for one of the six errors found. You can deal with such errors, for example, by assigning single-letter surnames to the names and patronymic names of zero sex, regardless of the statistics collected.
Names that are more often used by one gender than the other or the use of which differs in different countries. Two of the six errors found. An example (I slightly distorted the real names on which an error occurred, in order not to disclose personal data): Mohamad Suleiman Farkhonda (woman) and Sasha Alexander Jefferson (woman, US citizen). I do not know whether it is possible to correct such cases algorithmically, so I simply added these names to the exception table.

How the errors of the function were searched for: a ban was placed on the entry of new insured persons into the company's base with a floor other than a specific function. In the event of a similar error, the company's employees contacted the organization that provided the lists of insured and specified the real sex of the person. Thus, all cases of discrepancies between real and calculated data were processed manually. The total number of people who have undergone a similar test is about 250 thousand

Unfortunately, this method is not a silver bullet, it is simply better than all the others that I have met. I tested the method on several databases in different companies. The disadvantages include the fact that for a part of people, due to insufficient statistics, it is impossible to determine the sex by name: on the basis of 1.5 million people, such people are slightly more than 1%, on the basis of 300 thousand people, about 3%, on the basis of 6 million people failed to determine the gender for 0.8%. I have an assumption that the percentage of people for whom gender cannot be determined is inversely proportional to the root of the size of the training set, but I have no explanation for why this is happening. Of course, the percentage of people for whom sex can be determined can be increased (and even brought to almost 100%) by setting milder conditions on when a person can be assigned a gender, but for the tasks I worked with, accuracy was more important. than 100% assignment to one of two classes.

Another disadvantage to which this method is prone is poor typo performance. Despite the fact that some of them are fairly standard (the name “Olga” is more common than, for example, the quite correct “Oktyabrina”), for most typos there will be no statistics => it will not always be possible to determine the gender by that name. Unfortunately, the converse statement (if the name you were looking for has never been found in the database, then it is not written correctly) is not true - people with unique names are no less than typos in the names of ordinary people.

Like any tool, it has features that you don’t think about when creating.

The collected statistics allows you to search for names written with errors of a certain type. If one part of the full name, according to the statistics collected, has one gender, and the other, the other, there is probably a typo. An example is “Ivanov Natalia Sergeevna”. In this case, it is most likely that there is a typo in the last name - the letter “a” at the end is forgotten.
If most of the data in the training sample will be presented mainly in one format (F-I-O), then based on the collected data, it is possible to search for full names written in a different order (for example, I-O-F) - simply based on what parts name usually belongs to the desired part. This may be important if the company conducts mailings in the spirit of “Dear Oleg Konstantinovich!”.
In those cases that I observed in practice, the percentage of errors in the training sample does not affect either the accuracy of the forecast or the number of names that could not be assigned to a particular class. In one of the cases with which I dealt, the percentage of sex determination errors in the training sample was about 4%, after correcting them and re-collecting statistics, the number of names for which the full name could not be determined changed less than 1%.

UPD 1
The comments suggested a lot of unusual names, I checked how they will be determined by the algorithm proposed above. Under the link - the proposed names and their variations with estimates of gender by surname, name and patronymic.
UPD 2
Below it was suggested that the definition of gender by name is a typical classification task. I found data on 500 thousand people for each of which the last letters of the last name (ssecondname), first name (sname) and patronymic (sthirdname) are known, the numbers 3, 2 and 1 mean how many last characters from this part of the full name were used. Unfortunately, I don’t have another sample like this, nor is it possible to make it with a different set of features for research.
I also built a decision tree on this data:
Not truncated (and therefore, inevitably, retrained): h1analysis.ru/analysis/download/89
Truncated to remove all branches containing less than 10 people: h1analysis.ru/analysis/download/90
I did not check the quality of the classification. Truncation methods like 30/70 also did not.

Source: https://habr.com/ru/post/274499/

All Articles

Definition of gender by name - when accuracy is really important

More articles: