Data Mining: Primary data processing using DBMS. Part 3 (Pivot Tables)

This series is devoted to the analysis of data for the search for patterns. As an example, one of the training tasks of the Kaggle sports data analysis community is used. Although the data sizes for the task are not large, the processing methods that will be considered are quite applicable for large amounts of data.
After completing Part 1 and Part 2 , two tables were generated containing the converted data.
titanik_test_3 and titanik_train_3.

The structure of the fields they differ in one field - survived, the value of which we have to determine for the test data set. Here is the code describing the structure of the table titanik_train_3

CREATE TABLE titanik_train_3 ( id bigint, survived integer, pclass integer, name character varying(255), sex character varying(255), age double precision, sibsp integer, parch integer, ticket character varying(255), fare double precision, cabin character varying(255), embarked character varying(255), cabin_cnt integer, cabin_type text, ticket_type text, ticket_number text, cabin_people_cnt integer )

In fact, the task is to turn a table with character-numeric data into a table with only a numeric representation. The creation of data dictionaries and pivot tables will help us in this. To do this, transfer the numeric data in the same form in which they were, and encode the character data.
The most important condition for using dictionaries is full coverage of values. Therefore, it is optimal at this stage (although in principle it is possible earlier) to merge the tables into one. And in the missing field to put NULL.
Given that one and the same sequence is used to create a primary key, there should be no problems. This is done using the UNION operator.

 select a.* into titanik_full_1 from ( select * from titanik_train_3 union select id, NULL::integer as survived, pclass, "name", sex , age , sibsp , parch, ticket,fare,cabin,embarked,cabin_cnt,cabin_type,ticket_type,ticket_number, cabin_people_cnt from titanik_test_3 ) as a;

Now we get one table that contains both test and training data sets.
Remove all fields except numeric:

  select a.* into titanik_full_2 from ( select id, survived, pclass::float, age::float, sibsp::float, parch::float, fare::float, cabin_cnt::float, CAST(ticket_number as float) as ticket_number, cabin_people_cnt::float from titanik_full_1 where ticket_number != '' union select id, survived, pclass, age, sibsp, parch, fare, cabin_cnt, 0 as ticket_number, cabin_people_cnt from titanik_full_1 where ticket_number = '' ) as a;

We get the table titanik_full_2, which looks like this:

 CREATE TABLE titanik_full_2 ( id bigint, survived integer, pclass integer, age double precision, sibsp integer, parch integer, fare double precision, cabin_cnt integer, ticket_number integer, cabin_people_cnt bigint )

Now in this table we will add a field that will mean that this or that value has a property for this row. Such tables are called pivot tables (pivot tables), only a little different from the usual, the field-values will take either 0 or 1. Schematically, this is shown in the figure:

Those. the table is now larger, the number of fields will be equal to the number of unique values. In principle, all these values can be made manually by requests. But it is better to write a small function in PL / PGSQL, which will automatically expand the fields.

 CREATE OR REPLACE FUNCTION sparse_matrix_generator( tablename_source character varying, tablename_dest character varying, field_name character varying) RETURNS integer AS $$ DECLARE pgst_object REFCURSOR; unival character varying; BEGIN OPEN pgst_object FOR EXECUTE 'select distinct '||field_name ||' from '||tablename_source ||' where ' || field_name ||' NOTNULL'; LOOP FETCH pgst_object INTO unival; EXIT WHEN NOT FOUND; EXECUTE 'ALTER TABLE '|| tablename_dest ||' ADD COLUMN "'|| field_name||unival ||'" smallint NOT NULL DEFAULT 0'; EXECUTE 'UPDATE '||tablename_dest||' SET "'||field_name||unival|| '"= 1 FROM ' ||tablename_source|| ' WHERE '||tablename_dest||'.id = '||tablename_source||'.id and '||field_name||' = '''||unival||''''; END LOOP; RETURN 0; END; $$ LANGUAGE 'plpgsql';

This function is applied like this:

 select sparse_matrix_generator('titanik_full_1', 'titanik_full_2', 'cabin_type'); select sparse_matrix_generator('titanik_full_1', 'titanik_full_2', 'ticket_type'); select sparse_matrix_generator('titanik_full_1', 'titanik_full_2', 'embarked'); select sparse_matrix_generator('titanik_full_1', 'titanik_full_2', 'sex');

Thus, we now have a sparse matrix with a dimension of 58 columns. It is necessary to normalize it and separate the test and training samples for the field survived.
There are different ways of rationing. There are different sampling requirements for different data analysis methods. We use one of the most simple, minimax rationing. The bottom line is this: the minimum will be 0, the maximum: 1, and everything else is proportionally distributed between them. To do this, we write the function:

 CREATE OR REPLACE FUNCTION minmax_normalizer(tablename_source character varying, field_name character varying) RETURNS integer AS $BODY$ DECLARE pgst_object REFCURSOR; maxval float; minval float; C RECORD; BEGIN EXECUTE 'select min("'||field_name ||'") as minval, max("'||field_name ||'") as maxval from '|| tablename_source INTO C; maxval := C.maxval; minval := C.minval; EXECUTE 'UPDATE '||tablename_source||' SET "'||field_name||'"=("'||field_name||'"-$1)/($2-$1)' USING minval, maxval; RETURN 0; END; $BODY$ LANGUAGE plpgsql VOLATILE COST 100;

And apply it to the table fields that need to be normalized:

 select minmax_normalizer('titanik_full_2', 'pclass'); select minmax_normalizer('titanik_full_2', 'age'); select minmax_normalizer('titanik_full_2', 'sibsp'); select minmax_normalizer('titanik_full_2', 'parch'); select minmax_normalizer('titanik_full_2', 'fare'); select minmax_normalizer('titanik_full_2', 'cabin_cnt'); select minmax_normalizer('titanik_full_2', 'ticket_number'); select minmax_normalizer('titanik_full_2', 'cabin_people_cnt');

As a result, we obtain a table with only numerical values in the range from zero to one.
Choose a test and training sample:

 select * into titanik_test_final from titanik_full_2 where survived isnull; alter table titanik_test_final drop column survived;

for the test sample and accordingly:

 select * into titanik_train_final from titanik_full_2 where survived notnull;

for training.
This table has empty values. They can be replaced, for example with an average value. To do this, also use the function:

 CREATE OR REPLACE FUNCTION null_normalizer(tablename_source character varying) RETURNS integer AS $BODY$ DECLARE pgst_object REFCURSOR; fieldval character varying; count_null integer; field_avg float; BEGIN OPEN pgst_object FOR EXECUTE 'select column_name from information_schema.columns where'|| ' table_name='''||tablename_source||''''; LOOP FETCH pgst_object INTO fieldval; EXIT WHEN NOT FOUND; count_null := 0; EXECUTE 'select count(id) from '||tablename_source||' where "'||fieldval||'" isnull' into count_null; IF count_null > 0 THEN raise notice 'field: %', fieldval; EXECUTE 'select avg('||fieldval||') from '||tablename_source INTO field_avg; EXECUTE 'UPDATE '||tablename_source||' set '||fieldval||'= $1 where '||fieldval||' isnull' using field_avg; END IF; END LOOP; RETURN 0; END; $BODY$ LANGUAGE plpgsql VOLATILE COST 100;

The function works as follows: Select all field names for the table, count the number of non-zero elements in the field, and if the number is greater than zero, start the search for the average value and update the empty values to the average.
The function is called this way:

 select null_normalizer('titanik_test_final'); select null_normalizer('titanik_train_final');

The resulting table was quite large and sparse:

 CREATE TABLE titanik_test_final ( id bigint, pclass double precision, age double precision, sibsp double precision, parch double precision, fare double precision, cabin_cnt double precision, ticket_number double precision, cabin_people_cnt double precision, "cabin_typeF" smallint, "cabin_typeB" smallint, "cabin_typeG" smallint, "cabin_typeC" smallint, "cabin_typeT" smallint, "cabin_typeD" smallint, "cabin_typeE" smallint, "cabin_typeA" smallint, "ticket_typeSW/PP" smallint, "ticket_typeC" smallint, "ticket_typePC" smallint, "ticket_typeAQ/3." smallint, "ticket_typeSC/A.3" smallint, "ticket_typeS.OC" smallint, "ticket_typeS.O./PP" smallint, "ticket_typeSC/AH" smallint, "ticket_typeSOTON/O2" smallint, "ticket_typeC.A." smallint, "ticket_typeW/C" smallint, "ticket_typeS.C./A.4." smallint, "ticket_typeFa" smallint, "ticket_typeLP" smallint, "ticket_typeSCO/W" smallint, "ticket_typeF.C." smallint, "ticket_typeA.5." smallint, "ticket_typeSC/AH Basle" smallint, "ticket_typeSC/A4" smallint, "ticket_typeS.C./PARIS" smallint, "ticket_typeS.OP" smallint, "ticket_typeLINE" smallint, "ticket_typeSO/C" smallint, "ticket_typeP/PP" smallint, "ticket_typeAQ/4" smallint, "ticket_typeSC" smallint, "ticket_typeW.EP" smallint, "ticket_typeSOTON/OQ" smallint, "ticket_typeA/4" smallint, "ticket_typeSC/PARIS" smallint, "ticket_typeA. 2." smallint, "ticket_typeF.CC" smallint, "ticket_typeS.P." smallint, "ticket_typePP" smallint, "ticket_typeC.A./SOTON" smallint, "embarkedC" smallint, "embarkedQ" smallint, "embarkedS" smallint, sexfemale smallint, sexmale smallint )

To output to text files use the script:

 COPY titanik_train_final to '/tmp/titanik_train_final.csv'; COPY titanik_test_final to '/tmp/titanik_test_final.csv';

Actually, the data is ready. Now we can try to find patterns.
To reduce the size of the now sparse table, you can use an autoencoder, or linear PCA. Continued in the next part. It is planned to apply the auto-encoder and the decisive forest and look at the result that will turn out in the standings.

UPD: Part Four

Source: https://habr.com/ru/post/165283/

All Articles

Data Mining: Primary data processing using DBMS. Part 3 (Pivot Tables)

More articles: