Data Mining: Primary data processing using DBMS. Part 2

Every half hour there is a new article with a screaming slogan Big data - “new oil”! . Just a godsend for marketing texts. Big Data = Big Oil = Profit. Where did this statement come from? Let's go beyond the stamp and dig a little deeper:
One of the first to use it was Michael Palmer [ 1 ] as early as 2006:

Data is just crude oil. It is valuable, but without processing it cannot be really used. It must be turned into gas, plastic, chemicals, etc., in order to create value that drives profitability; so the data needs to be analyzed and “see through” so that they become valuable.

Such an understanding of the trend “Big data - new oil!” Is closer to reality than to marketing. And it does not cancel the statement of Disraeli:
"There are three kinds of lies: There are lies, blatant lies and statistics . "
This article is a continuation of the Data Mining topic : Primary data processing using DBMS. Part 1
We continue the prey!

Continuing to remove doubles

In the last article was given a small example of code that allows you to get rid of the "twins". Continue to move in the same direction. For successful work, you also need to convert the table with test data:

--    --     select nextval('titanik_train_seq') as id, a.* into titanik_test_pk from titanik_test a; --          select id,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, m[1] as ticket_type, m[2] as ticket_number into titanik_test_1 from (select id,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, regexp_matches(ticket, '^\s*?(.*?)\s*?(\d*?)$') as m from titanik_test_pk ) as a;

In this code example, I decided to use a sequence from the previous table, in order to make it easier to combine the data if necessary later. The division of the ticket name into text and series is carried out in the same way.

We apply similar update operators to the test table (plus we will add two more, at the end, to replace the elements that were not in the training table):

 update titanik_test_1 set ticket_type='A.5.' where ticket_type = 'A./5.'; update titanik_test_1 set ticket_type='A.5.' where ticket_type = 'A/5'; update titanik_test_1 set ticket_type='A.5.' where ticket_type = 'A/5.'; update titanik_test_1 set ticket_type='A.5.' where ticket_type = 'A/S'; update titanik_test_1 set ticket_type='A/4' where ticket_type = 'A/4.'; update titanik_test_1 set ticket_type='A/4' where ticket_type = 'A4.'; update titanik_test_1 set ticket_type='CA' where ticket_type = 'CA'; update titanik_test_1 set ticket_type='CA' where ticket_type = 'CA.'; update titanik_test_1 set ticket_type='SW/PP' where ticket_type = 'SW/PP'; update titanik_test_1 set ticket_type='SC/PARIS' where ticket_type = 'SC/Paris'; update titanik_test_1 set ticket_type='SOTON/OQ' where ticket_type = 'SOTON/OQ'; update titanik_test_1 set ticket_type='SOTON/O2' where ticket_type = 'STON/O 2.'; update titanik_test_1 set ticket_type='SOTON/O2' where ticket_type = 'STON/O2.'; update titanik_test_1 set ticket_type='W/C' where ticket_type = 'W./C.'; update titanik_test_1 set ticket_type='WEP' where ticket_type = 'WE/P'; update titanik_test_1 set ticket_type='SOTON/OQ' where ticket_type = 'STON/OQ.'; update titanik_test_1 set ticket_type='SC/PARIS' where ticket_type = 'SC/PARIS';

Ticket data - processed. Now it is necessary to process the remaining text data on the same principle:
Sex (sex) - twins are not found, does not need to be separated:

 select sex, count(sex) from titanik_train_1 group by 1 order by 1 asc;

sex	count
female	314
male	577

 select sex, count(sex) from titanik_test_1 group by 1 order by 1 asc;

sex	count
female	152
male	266

Cabins (cabin) - more interesting here:

 -- ! -- select cabin, count(cabin) from titanik_train_1 group by 1 order by 1 asc; select cabin, count(id) from titanik_train_1 group by 1 order by 1 asc;

If we execute the first request (commented out), we get a rather strange value - 0 entries with no cabin specified. This is due to the peculiarities of the work of aggregating functions. Cannot add empty values correctly. Therefore, we write count (id) . And we get the result: 687 passengers with an unspecified cabin. You can make the assumption that this is a "common" compartment. And most likely for these entries is not specified class ticket.
Check our assumption:

 select id, cabin, ticket_type from titanik_train_1 where cabin ISNULL; select id, cabin, ticket_type from titanik_train_1 where cabin NOTNULL;

Not confirmed. Led many lines for which the ticket type is specified. And vice versa (request number 2)? Also not confirmed. We conclude that either a series of tickets has been lost for a certain number of people, or it shows something else, and not the location of the person in the cabin or not. That is, carries additional information. We return to the previous request.
In the output of queries on booths and the number of records with grouping, there are interesting lines:

cabin	count
C23 C25 C27	four
C30	one
F g73	2
T	one

Firstly - information about the type of cabin (the first letter in front of the numbers).
Secondly, for one ticket - several cabins. And very often there are several people in the same cabin with tickets in which several cabins are indicated (read the seats). It turns out quite interesting data that can not be ignored. In fact, this data is duplicate of relatives, but taking into account, for example, friends or acquaintances, or work colleagues - i.e. familiar people willing to help each other. Also, we obtain information on how many people were in the cabin.
Conclusion - add a box type cabin. And add the number of cabins in the ticket. Also add a field the number of people in the cabins.
Those. A family of 4 people takes 2 cabins. Or for example, two different people occupy the same cabin. The amount of data is growing!
Requests that implement this are quite complex and require an understanding of how PREG regular expressions work in PostgreSQL.
You can fit everything into one huge query, but I decided to split it into two parts. Part one determines the type of cabin and the number of cabins on the ticket, and the second request determines the number of people with the same cabin (set of cabins) in the ticket.

 select id,survived,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, cnt as cabin_cnt, cabin_type, ticket_type, ticket_number into titanik_train_2 from ( select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, regexp_matches(cabin, '^\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?.*?$') as m, 4 as cnt, substring(cabin, 1, 1) as cabin_type from titanik_train_1 UNION select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, regexp_matches(cabin, '^\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?$') as m, 3 as cnt, substring(cabin, 1, 1) as cabin_type from titanik_train_1 UNION select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, regexp_matches(cabin, '^\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?$') as m, 2 as cnt, substring(cabin, 1, 1) as cabin_type from titanik_train_1 UNION select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, regexp_matches(cabin, '^\s*?([AZ]\d*)\s*?$') as m, 1 as cnt, substring(cabin, 1, 1) as cabin_type from titanik_train_1 UNION select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, NULL as m, 0 as cnt, NULL as cabin_type from titanik_train_1 where cabin ISNULL) as a;

In principle, the only difficult point here is the regular expression. How I built it:
F 82 45 - an example of the name to be snatched. This query is built from the main unit:
([AZ] \ d *) - [AZ] means that there must be at least one large alphabetic character, \ d * - any number of 0 ... numbers.

And the second request, which considers the number of people in the cabins.

 select a.*, b.cnt as cabin_people_cnt into titanik_train_3 from titanik_train_2 a, ( select cabin as cabid, count(id) as cnt from titanik_train_1 group by 1) as b where a.cabin = b.cabid OR (a.cabin ISNULL AND b.cabid ISNULL); --.     -     687. update titanik_train_3 set cabin_people_cnt=0 where cabin ISNULL;

As a result, we obtain three additional fields: the type of cabin, the number of cabins per ticket, and the number of people in the cabin.
')
Similarly, we do for test data:

 select id,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, cnt as cabin_cnt, cabin_type, ticket_type, ticket_number into titanik_test_2 from ( select id, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, regexp_matches(cabin, '^\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?.*?$') as m, 4 as cnt, substring(cabin, 1, 1) as cabin_type from titanik_test_1 UNION select id, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, regexp_matches(cabin, '^\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?$') as m, 3 as cnt, substring(cabin, 1, 1) as cabin_type from titanik_test_1 UNION select id, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, regexp_matches(cabin, '^\s*?([AZ]\d*)\s*?([AZ]\d*)\s*?$') as m, 2 as cnt, substring(cabin, 1, 1) as cabin_type from titanik_test_1 UNION select id, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, regexp_matches(cabin, '^\s*?([AZ]\d*)\s*?$') as m, 1 as cnt, substring(cabin, 1, 1) as cabin_type from titanik_test_1 UNION select id, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number, NULL as m, 0 as cnt, NULL as cabin_type from titanik_test_1 where cabin ISNULL) as a;

and the second part:

 select a.*, b.cnt as cabin_people_cnt into titanik_test_3 from titanik_test_2 a, ( select cabin as cabid, count(id) as cnt from titanik_test_1 group by 1) as b where a.cabin = b.cabid OR (a.cabin ISNULL AND b.cabid ISNULL);

Only one field remains: landing port (embarked):

 select embarked, count(id) from titanik_train_3 group by 1 order by 1 asc; select embarked, count(id) from titanik_test_3 group by 1 order by 1 asc;

The result of this is that no doubles are found, it does not need to be separated:

embarked	count
C	168
Q	77
S	644
	2

What to do with two records where there is no data? You can replace the random values, you can drop, you can put the average. To choose from.

findings

In this part, we have previously prepared the text data in the training and test sample. By the time, this work took about three hours. From downloading data to the current moment.

This part turned out quite impressive in volume, because the continuation in the next post. In the next post we will try to form a table with numeric values instead of string values. If suddenly someone decides to do at the same time with me, using requests and processing data on this tutorial, I will answer questions in comments. Waiting for critics.

Update
Part Three: habrahabr.ru/post/165283
Part Four: habrahabr.ru/post/173819

Source: https://habr.com/ru/post/165281/

All Articles

Data Mining: Primary data processing using DBMS. Part 2

Continuing to remove doubles

findings

More articles: