⬆️ ⬇️

“Boys - to the left, girls - to the right”, or add the “Gender” field in the Oracle database

... there are three types of dragons: zero, imaginary and negative. All of them do not exist, but each type is in its own particular manner. The imaginary and zero dragons, called in the professional language imaginary and null, do not exist in a much less interesting way than the negative ones.

Stanislav Lem,

Cyberiad


Preamble



Constructive criticism of the article "Problems of merging records in a complex Oracle table" post. Let us try to analyze the methods for solving the problem of identifying the sex of a person based on his client data in the Oracle database.



Initial data



Suppose there is some experimental database on Oracle that contains customer data in the hotel.cards table. At a minimum, we have the last name (cards.last_name), the first name (cards.first_name) of the client ... and maybe even the middle name (cards.middle_name). Let's see what we can learn from this data.



middle name



The most unequivocal and trivial solution to identify the gender of our client is information about his patronymic. There are not so many suffixes of Russian middle names in Russian.

Wikipedia about patronymic
Patronyms, formed from the male names of the second declension, are formed by adding to the basis of the suffixes -ovich / -ovna, -evich / -evna: Roman - Romanovich, Nikolai - Nikolaevich; at the same time the names ending in -s change it to -v-: Vitaly - Vitalyevich.

The patronymic of men, formed from the male names of the first declension, are formed by adding to the base of the suffix -ich: Nikita-Nikitich, Luka-Lukich.

The patronymic of women formed from the male names of the first declension are formed by adding to the base of the suffix -ichna, if the ending was unstressed, and -ichicna, if the stress fell on the ending: Nikita - Nikitichna, but Luka - Lukinichna.


Actually, if all the patronymic names are entered into the base, then the field can be filled in a very trivial way.

Oracle user console on database server
sqlplus / as sysdba

alter table hotel.cards add gender varchar2 (10);

update hotel.cards hc set hc.gender = 'male' where hc.middle_name like '% ich';

update hotel.cards hc set hc.gender = 'female' where hc.middle_name like '% on';

commit;



But this method is not always applicable for several reasons:

1) the system was developed by a foreign customer, and the database scheme does not provide for patronymic names;

2) our foreign clients in foreign passports do not have the “Patronymic” field, and our operators do not specify this data with them.



For such cases, we still have the name and surname of the client.

')

Surname



Hours of joint meditation with Google led me to the following disappointing facts:

1) foreign surnames are usually unchanged in the masculine and feminine (English: Mr. and Mrs. Smidt, Ukrainian: Gregory and Maria Beloshtan, Ivan and Galina Krasnoshlapka);

2) the last letters of a foreign surname cannot unequivocally indicate the gender of its owner;

3) one of its parts may indicate to the floor of the owner of a compound last name.



In general, there are so many exceptions with the surnames that manual work is definitely not to be avoided. Naturally, our operators will do it, so that in the coming week do not expect invitations from them for tea with cookies.



Name



The most unambiguous parameter in our database for analyzing a person’s gender is his name. There are many sites with lists of male and female names, books with interpretations of male and female names can be found in paper and electronic form, but ... We, due to our laziness, should find either a ready-made database of names with gender (such by the way are sold), or search this information is in a form suitable for machine processing. All we need is to our sample.

select distinct hc.first_name from hotel.cards hc;



add the gender field and fill it with the values ​​"male / female". Meditations with Google on reference books and databases of names, various api to social networks and points of sale of various databases suddenly stopped when I came across the service “i-gender.com” . A very convenient API, recognition of names in Cyrillic and transliteration, as well as short testing of the veracity of its results convinced me that it is possible to transfer the routine work of our operators on the shoulders of this service so as not to distract them from the process of treating me with the liver.



The service is very unpretentious, demanding from us an http request of the form name = Innocent. In response, we get a string like {"gender": "male", "confidence": "100"}. True, the request must be made using the POST method. For curl, this is not a problem; it is easy to assign an analysis of the answer to a simple bash script. It remains to perform a series of routine gestures and to persuade the operators to review the results of work for possible mistakes of the i-gender service . Curious please under spoiler.

On the database server
bash-3.2 $ uname -a

Linux ******* 2.6.39-300.26.1.el5uek # 1 SMP Thu Jan 3 18:31:38 PST 2013 x86_64 x86_64 x86_64 GNU / Linux



bash-3.2 $ sqlplus / as sysdba

SQL * Plus: Release 11.2.0.1.0 Production on Tue Jul 30 09:39:27 2013

Copyright © 1982, 2009, Oracle. All rights reserved.

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options



mkdir / u02 / xtern

cd / u02 / xtern

touch sel.sql



sel.sql (request for a selection of the unique names of our customer base):

set linesize 9999

set pagesize 0

set head off

set underline off

set feedback off

set trimspool on

set echo off

spool result.log

set termout off

select distinct hc.first_name from hotel.cards hc;

set termout on

spool off

exit



Unload sample to file

sqlplus -s / as sysdba @ sel.sql



Remove from it extra spaces

cat ./result.log | sed 's / ^ * //; s / * $ //'> ./result.csv



On a car that has an outlet outside
[****] $ uname -a

Linux ******* 2.6.32-358.6.2.el6.i686 # 1 SMP Thu May 16 18:12:13 UTC 2013 i686 i686 i386 GNU / Linux

[****] $ curl -V

curl 7.19.7 (i386-redhat-linux-gnu) libcurl / 7.19.7 NSS / 3.14.0.0 zlib / 1.2.3 libidn / 1.18 libssh2 / 1.4.2

Protocols: tftp ftp telnet dict ldap ldaps http file https ftps scp sftp

Features: GSS-Negotiate IDN IPv6 Largefile NTLM SSL libz



touch ./myget.sh

chmod 751 ./myget.sh



myget.sh (script for requests to the i-gender.com service):

#! / bin / bash



while read NAME

do

ANSWER = $ (curl -d "name = $ NAME" www.i-gender.com/ai )

# answer will be something like this

# {"Gender": "male", "confidence": "100"}

GENDER = $ (echo $ ANSWER | cut -f 4 -d '"')

CONFIDENCE = $ (echo $ ANSWER | cut -f 8 -d '"')

echo "$ NAME, $ GENDER, $ CONFIDENCE" >> ./names.csv

It is some time to rest)))

sleep 1

done <./result.csv



result.csv (input):

Dmitriy

Anastasia

Valentine

Elissa

Anatoly

Natalia

Irina

Igor

Paul

Elizabeth

Lydia

Gregory

Alla

Osama



names.csv (resulting file)

Dmitry, male, 100

Anastasia, female, 100

Valentine, female, 85

Elissa, female, 100

Anatoly, male, 100

Natalia, female, 100

Irina, female, 100

Igor, male, 100

Paul, male, 100

Elizabeth, female, 100

Lydia, female, 100

Gregory, male, 100

Alla, female, 100

Usama, male, 100



We connect the result to our database
sqlplus / as sysdba



create or replace directory xtern_data_dir as '/ u02 / xtern';

create table pcmd.xtern_names

(

first_name varchar2 (50),

gender varchar2 (10),

confidence number

)

organization external

(

default directory xtern_data_dir

access parameters

(records delimited by newline

fields terminated by ','

)

location ('names.csv')

);



select * from pcmd.xtern_names;



When the data is connected to the database, they need to be subjected to preliminary verification using human labor. You can then integrate them into the existing schema, expanding the existing tables, creating new ones associated with views and triggers. But this has little to do with the topic of our article.



Conclusion



Using only the last name, first name and patronymic of our client, we cannot solve the task set automatically. But either purchased databases and third-party samples, or services like www.i-gender.com will help speed up and automate our task .

Source: https://habr.com/ru/post/188358/



All Articles