... there are three types of dragons: zero, imaginary and negative. All of them do not exist, but each type is in its own particular manner. The imaginary and zero dragons, called in the professional language imaginary and null, do not exist in a much less interesting way than the negative ones.
Stanislav Lem,
Cyberiad
Preamble
Constructive criticism of the article
"Problems of merging records in a complex Oracle table" post. Let us try to analyze the methods for solving the problem of identifying the sex of a person based on his client data in the Oracle database.
Initial data
Suppose there is some experimental database on Oracle that contains customer data in the hotel.cards table. At a minimum, we have the last name (cards.last_name), the first name (cards.first_name) of the client ... and maybe even the middle name (cards.middle_name). Let's see what we can learn from this data.
middle name
The most unequivocal and trivial solution to identify the gender of our client is information about his patronymic. There are not so many suffixes of Russian middle names in Russian.
Wikipedia about patronymicPatronyms, formed from the male names of the second declension, are formed by adding to the basis of the suffixes -ovich / -ovna, -evich / -evna: Roman - Romanovich, Nikolai - Nikolaevich; at the same time the names ending in -s change it to -v-: Vitaly - Vitalyevich.
The patronymic of men, formed from the male names of the first declension, are formed by adding to the base of the suffix -ich: Nikita-Nikitich, Luka-Lukich.
The patronymic of women formed from the male names of the first declension are formed by adding to the base of the suffix -ichna, if the ending was unstressed, and -ichicna, if the stress fell on the ending: Nikita - Nikitichna, but Luka - Lukinichna.
Actually, if all the patronymic names are entered into the base, then the field can be filled in a very trivial way.
Oracle user console on database serversqlplus / as sysdba
alter table hotel.cards add gender varchar2 (10);
update hotel.cards hc set hc.gender = 'male' where hc.middle_name like '% ich';
update hotel.cards hc set hc.gender = 'female' where hc.middle_name like '% on';
commit;
But this method is not always applicable for several reasons:
1) the system was developed by a foreign customer, and the database scheme does not provide for patronymic names;
2) our foreign clients in foreign passports do not have the “Patronymic” field, and our operators do not specify this data with them.
For such cases, we still have the name and surname of the client.
')
Surname
Hours of joint meditation with Google led me to the following disappointing facts:
1) foreign surnames are usually unchanged in the masculine and feminine (English: Mr. and Mrs. Smidt, Ukrainian: Gregory and Maria Beloshtan, Ivan and Galina Krasnoshlapka);
2) the last letters of a foreign surname cannot unequivocally indicate the gender of its owner;
3) one of its parts may indicate to the floor of the owner of a compound last name.
In general, there are so many exceptions with the surnames that manual work is definitely not to be avoided. Naturally, our operators will do it, so that in the coming week do not expect invitations from them for tea with cookies.
Name
The most unambiguous parameter in our database for analyzing a person’s gender is his name. There are many sites with lists of male and female names, books with interpretations of male and female names can be found in paper and electronic form, but ... We, due to our laziness, should find either a ready-made database of names with gender (such by the way are sold), or search this information is in a form suitable for machine processing. All we need is to our sample.
select distinct hc.first_name from hotel.cards hc;
add the gender field and fill it with the values "male / female". Meditations with Google on reference books and databases of names, various api to social networks and points of sale of various databases suddenly
stopped when I came across the service
“i-gender.com” . A very convenient API, recognition of names in Cyrillic and transliteration, as well as short testing of the veracity of its results convinced me that it is possible to transfer the routine work of our operators on the shoulders of this service so as not to distract them from the process of treating me with the liver.
The service is very unpretentious, demanding from us an http request of the form name = Innocent. In response, we get a string like {"gender": "male", "confidence": "100"}. True, the request must be made using the POST method. For curl, this is not a problem; it is easy to assign an analysis of the answer to a simple bash script. It remains to perform a series of routine gestures and to persuade the operators to review the results of work for possible mistakes of the
i-gender service . Curious please under spoiler.
On the database serverbash-3.2 $ uname -a
Linux ******* 2.6.39-300.26.1.el5uek # 1 SMP Thu Jan 3 18:31:38 PST 2013 x86_64 x86_64 x86_64 GNU / Linux
bash-3.2 $ sqlplus / as sysdba
SQL * Plus: Release 11.2.0.1.0 Production on Tue Jul 30 09:39:27 2013
Copyright © 1982, 2009, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
mkdir / u02 / xtern
cd / u02 / xtern
touch sel.sql
sel.sql (request for a selection of the unique names of our customer base):
set linesize 9999
set pagesize 0
set head off
set underline off
set feedback off
set trimspool on
set echo off
spool result.log
set termout off
select distinct hc.first_name from hotel.cards hc;
set termout on
spool off
exit
Unload sample to file
sqlplus -s / as sysdba @ sel.sql
Remove from it extra spaces
cat ./result.log | sed 's / ^ * //; s / * $ //'> ./result.csv
On a car that has an outlet outside[****] $ uname -a
Linux ******* 2.6.32-358.6.2.el6.i686 # 1 SMP Thu May 16 18:12:13 UTC 2013 i686 i686 i386 GNU / Linux
[****] $ curl -V
curl 7.19.7 (i386-redhat-linux-gnu) libcurl / 7.19.7 NSS / 3.14.0.0 zlib / 1.2.3 libidn / 1.18 libssh2 / 1.4.2
Protocols: tftp ftp telnet dict ldap ldaps http file https ftps scp sftp
Features: GSS-Negotiate IDN IPv6 Largefile NTLM SSL libz
touch ./myget.sh
chmod 751 ./myget.sh
myget.sh (script for requests to the i-gender.com service):
#! / bin / bash
while read NAME
do
ANSWER = $ (curl -d "name = $ NAME"
www.i-gender.com/ai )
# answer will be something like this
# {"Gender": "male", "confidence": "100"}
GENDER = $ (echo $ ANSWER | cut -f 4 -d '"')
CONFIDENCE = $ (echo $ ANSWER | cut -f 8 -d '"')
echo "$ NAME, $ GENDER, $ CONFIDENCE" >> ./names.csv
It is some time to rest)))
sleep 1
done <./result.csv
result.csv (input):
Dmitriy
Anastasia
Valentine
Elissa
Anatoly
Natalia
Irina
Igor
Paul
Elizabeth
Lydia
Gregory
Alla
Osama
names.csv (resulting file)
Dmitry, male, 100
Anastasia, female, 100
Valentine, female, 85
Elissa, female, 100
Anatoly, male, 100
Natalia, female, 100
Irina, female, 100
Igor, male, 100
Paul, male, 100
Elizabeth, female, 100
Lydia, female, 100
Gregory, male, 100
Alla, female, 100
Usama, male, 100
We connect the result to our databasesqlplus / as sysdba
create or replace directory xtern_data_dir as '/ u02 / xtern';
create table pcmd.xtern_names
(
first_name varchar2 (50),
gender varchar2 (10),
confidence number
)
organization external
(
default directory xtern_data_dir
access parameters
(records delimited by newline
fields terminated by ','
)
location ('names.csv')
);
select * from pcmd.xtern_names;
When the data is connected to the database, they need to be subjected to preliminary verification using human labor. You can then integrate them into the existing schema, expanding the existing tables, creating new ones associated with views and triggers. But this has little to do with the topic of our article.
Conclusion
Using only the last name, first name and patronymic of our client, we cannot solve the task set automatically. But either purchased databases and third-party samples, or services like
www.i-gender.com will help speed up and automate our task
.