Data Mining: Primary data processing using DBMS. Part 1

What is the article about

In the tasks of researching large amounts of data there are many subtleties and pitfalls. Especially for those who are just beginning to explore hidden dependencies and internal connections within arrays of information. If a person does it on his own, then the additional difficulty becomes the choice of examples on which to study and the search for a community to exchange views and evaluate his success. The example should not be too complicated, but at the same time should cover the main problems that arise when solving problems close to reality, so that the task is not perceived like this:

From this point of view, the resource Kaggle [ 1 ] will be very interesting, which turns the study of data into a sport. They hold data analysis competitions. Some competitions are with educational materials and are intended for beginners. That's exactly the training of data analysis, the example of solving one of the training tasks, and will be devoted to a series of articles. The first article will be about data preparation and the use of a DBMS for this purpose. Actually, how and where to start. It is assumed that the reader understands SQL.

Task: "Titanic: Machine Learning in Disasters."

One of two competitions for beginners is Titanic [ 2 ]. Translation assignment:
“The death of the Titanic is one of the most inglorious shipwrecks in history. On April 15, 1912, during its first flight, Titanic sank after a collision with an iceberg, killing 1502 of 2,224 passengers and crew. This sensational tragedy shocked the international community and led to an improvement in safety regulations on ships. One of the reasons that the shipwreck caused such casualties was the lack of lifeboats for passengers and crew. There was also an element of lucky chance that influenced rescue from drowning. Also, some groups of people were more likely to survive than others, such as women, children, and the upper class. In this competition, we will offer you to complete an analysis of which types of people are most likely to be saved. In particular, we will ask you to use machine learning tools to determine which passengers will be saved in this tragedy. ”

Data

Two files are given for the competition: train.csv and test.csv. Text data separated by commas. One line - one passenger record. Some data for recording may be unknown and omitted.
DESCRIPTION OF VARIABLES:

Variable name	What does
survival	Salvation (0 = No; 1 = Yes)
pclass	Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)
name	Name
sex	Floor
age	Age
sibsp	Number of Brothers (Sisters) / Spouses on board
parch	Number of Parents / Children on board
ticket	Ticket number
fare	Passenger fare
cabin	Cabin
embarked	Landing port (C = Cherbourg; Q = Queenstown; S = Southampton)

Note:
Pclass is an indicator of socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ lower
Age in years; Fractional if the age is less than one (1)
If the age is estimated, then it is in the form of xx.5
The following definitions are used for sibsp and parch.
Brothers (Sisters): Brother, Sister, Step Brother, or Stepsister among the passengers of the Titanic
Spouses: Husband or Wife among Titanic passengers (Mistresses and Grooms ignored)
Parents: Mother or Father among the passengers of the Titanic.
Children: Son, Daughter, Pasaver or Stepdaughter among the passengers of the Titanic.
Other family members are excluded, including cousins, cousins, uncles, aunts, daughters-in-law, son-in-law
Children traveling with nannies had parch = 0.
Similarly, traveling with close friends, neighbors, are not counted in the relationship parch.

This task, as we see, is well structured and the fields are practically defined. Actually, the first stage is to submit data for machine learning. Questions of selection and redefinition of fields, splitting, merging, classification - depend on the data feed. By and large, the issue rests on the coding and normalization. Coding of qualitative features (there are several approaches) and preliminary data processing before using machine learning methods. This task does not contain a really large amount of data. But most of the tasks (for example, Herritage Health Pr., Yandex Internet Mathematics) are not so structured and have to operate with millions of records. It is more convenient to do this with the use of a DBMS. I chose PostgreSQL DBMS. All SQL code was written under this DBMS, but with minor changes it is also suitable for MySQL, Oracle and MS SQL server.
')

Load the data

Download two files - train.csv and test.csv.
We create tables for data storage:

--     DROP TABLE IF EXISTS titanik_train; --  CREATE TABLE titanik_train ( survived int, pclass int, name varchar(255), sex varchar(255), age float, sibsp int, parch int, ticket varchar(255), fare float, cabin varchar(255), embarked varchar(255) ); --   CSV  '/home/andrew/titanik/train.csv'. -- HEADER    CSV    c  . COPY titanik_train FROM '/home/andrew/titanik/train.csv' CSV HEADER; DROP TABLE IF EXISTS titanik_test; CREATE TABLE titanik_test ( pclass int, name varchar(255), sex varchar(255), age float, sibsp int, parch int, ticket varchar(255), fare float, cabin varchar(255), embarked varchar(255) ); COPY titanik_test FROM '/home/andrew/titanik/test.csv' CSV HEADER;

As a result, we obtain two tables: with training and test data.
In this case, there were no data loading errors. Those. All data corresponded to the types that we have identified. If errors appear, then there are two ways: first correct with regular expressions and an editor like sed (either with a command based on perl -e) or load everything first as text data, and correct with regular expressions and queries using DBMS.
Add the primary key:

 --  CREATE SEQUENCE titanik_train_seq; --     select nextval('titanik_train_seq') as id, a.* into titanik_train_pk from titanik_train a;

Research data

DBMS tables are just a tool. The tool helps to more conveniently study and convert data. From the very beginning, it is convenient to divide the record fields for each passenger into numeric and text fields - by data type. To use machine learning algorithms, you still have to convert everything to a numerical representation. And the quality of work of the methods we choose depends very much on how adequately we do it.

Numeric data:
survived, pclass, age, sibsp, parch, fare
Text data:
name, sex, ticket, cabin, embarked

Let's start with the data, which is presented in text form. In fact, this is the problem of coding information. And here not necessarily the simplest approach will be the most correct. For example, if we ignore this example, and imagine that we have a field that defines the position:

CEO;
Head of Department;
shift supervisor;
cleaner;
intern;

It is logical to encode them in this way:

CEO	five	one
Head of Department	four	0.75
shift supervisor	3	0.5
cleaner	2	0.25
trainee	one	0

Those. in this way we tried to write such an element of reality as posts, into a mathematical representation. This is one (the simplest) way to submit such information. This is provided that the “distance” between posts (which does not correspond to reality) is considered the same. Those. then the question arises, what is the “distance between posts”? Importance? The scope of authority? The amount of responsibility? Prevalence? Place in the hierarchy? Or the number of employees in such a position in the enterprise? There are a lot of questions to which there will be no answer for a task divorced from reality. But in reality there are answers. And there are methods that help bring the mathematical representation to reality. But this is a topic for a separate and complex article, but for now, let's assume that we will try to encode data in the range from 0 to 1 (Code 2).
So, for data that can be compared, there is at least some explanation for why we did that with the definition of smaller and larger data.
After all, numbers carry quantitative information! Something more, and something less!
And if we have “name” as descriptive information. Do we have the right to encode data for example like this:

Name	Code2
Nikolai	0
Peter	0.5
Paul	one

It seems that we can encode this way, but it turns out that we add a component to the data that does not exist - comparing names (numbers: more-less) for an unknown reason. It would be more logical to submit data a little differently (This is also called bag-of-words):

Record number	Nikolai	Peter	Paul
15602	0	one	0
15603	0	0	one
15604	one	0	0

And how many names there will be - so many fields. And the values are: 0 or 1. There is a drawback - and if we have a million different names? What to do? In our simple task there is no such misfortune, but the answer is that this is a fairly dynamically growing branch of computer science and mathematics — the compression of input data. One of the keywords is “sparse matrix”. Also - “autoencoder”, “Vowpal Wabbit” [ 3 ], “feature hashing”. Again, this is a topic for an article. In addition, on Habré flashed materials on this topic. But for our task, we still digress from this problem. And back to the question:
Does the first idea have the right to life?
The answer strongly depends on whether we can neglect the component we added about the ranking of the “unknown attribute by which we compared names”. It is often necessary to neglect - when there is no other possibility.

And if you put in place of names, for example, the number-letter numbers of tickets (which is closer to our task). With tickets easier - there is a series and there is a number. The number can be left as is. Perhaps it determines the place of landing, and, accordingly, the distance from the entrance, etc. But the coding of the series is already more difficult. First of all, you need to answer the question: Can a single value in reality be represented as different lines in the data? The answer is simple - it can. Those. need to check this. The easiest option is a typo. The extra period, comma, space or dash.
The presence of such data already entails distortions, and the model based on “corrupted” information will not show a good result (only if this was not done in a controlled way to improve the generalization of the model).
To begin, we separate the ticket number from the series (type, brand, etc.).
To do this, view the data, pre-sorted by name. Select the data groups:

ticket without a series
ticket with a series

Next, do this: create a table where the series is highlighted in a separate field. Where there is no series - as long as we put an empty value. To do this, use the query:

 select id,survived,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, m[1] as ticket_type, m[2] as ticket_number into titanik_train_1 from (select id, survived,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, regexp_matches(ticket, '^\s*?(.*?)\s*?(\d*?)$') as m from titanik_train_pk ) as a;

where '^ \ s *? (. *?) \ s *? (\ d *?) $' is a regular expression that allows you to select parts in a ticket field - a series and number into an array.
We will leave the numbers as they are, you can check the regular expression for the presence of non-numeric characters. Using a similar regular expression.

Double problem

If the same reality object is named differently - we get the problem of twins. Let us dwell only on one aspect of the detection of twins: Typos in the name. Given that the volume of our data is small, and is subject to simple viewing, we proceed in this way - we derive all the unique field values from the table.

Request:

 select ticket_type, count(ticket_type) from titanik_train_1 group by 1 order by 2 desc;

And here is the result, where I noted supposedly the same values, which are indicated by different lines. 45 records.

Quite a lot of coincidences. If you go deeper into the task and find out on what basis tickets were marked - you can still reduce the number of items. But one of the conditions of the task was not to use additional information on this well-known tragedy. Therefore, we will stop only on this.

Requests to change data look like this:

 update titanik_train_1 set ticket_type='A.5.' where ticket_type = 'A./5.'; update titanik_train_1 set ticket_type='A.5.' where ticket_type = 'A/5'; update titanik_train_1 set ticket_type='A.5.' where ticket_type = 'A/5.'; update titanik_train_1 set ticket_type='A.5.' where ticket_type = 'A/S'; update titanik_train_1 set ticket_type='A/4' where ticket_type = 'A/4.'; update titanik_train_1 set ticket_type='A/4' where ticket_type = 'A4.'; update titanik_train_1 set ticket_type='CA' where ticket_type = 'CA'; update titanik_train_1 set ticket_type='CA' where ticket_type = 'CA.'; update titanik_train_1 set ticket_type='SW/PP' where ticket_type = 'SW/PP'; update titanik_train_1 set ticket_type='SC/PARIS' where ticket_type = 'SC/Paris'; update titanik_train_1 set ticket_type='SOTON/OQ' where ticket_type = 'SOTON/OQ'; update titanik_train_1 set ticket_type='SOTON/O2' where ticket_type = 'STON/O 2.'; update titanik_train_1 set ticket_type='SOTON/O2' where ticket_type = 'STON/O2.'; update titanik_train_1 set ticket_type='W/C' where ticket_type = 'W./C.'; update titanik_train_1 set ticket_type='WEP' where ticket_type = 'WE/P';

Now records 30.
In more complex tasks, processing is significantly different. Especially if it is not possible to view everything. Then you can use the Levenshtein function - this will allow you to find words that are close in spelling. You can tweak it a bit and make words that differ only in punctuation marks even closer. Again, we return to the concept of measure and metric - and what is meant by the distance between words? Characters that are more similar in appearance B and 8? Or those that sound similar?
In principle, in such a simple way you need to go through all the character fields in the training and test tables. Also, as an improvement, it is possible to note the merging of data on columns from these tables before searching for twins.
About distribution . At this stage, the data processing work does not require special knowledge and can be easily paralleled between a number of performers. Here, for example, at this stage, the team very much bypasses the opponent alone.

The post was quite voluminous, because the sequel is in the next part.

References:

1. www.kaggle.com
2. www.kaggle.com/c/titanic-gettingStarted
3. http://hunch.net/~vw/

Update:
Part Two: habrahabr.ru/post/165281
Part Three: habrahabr.ru/post/165283
Part Four: habrahabr.ru/post/173819

Source: https://habr.com/ru/post/165001/

All Articles