Data mining: Toolkit - Theano

In the previous materials of this cycle, we considered methods of preprocessing data using a DBMS. This can be useful for very large volumes of information processed. In this article I will continue to describe the tools for intelligent processing of large amounts of data, focusing on the use of Python and Theano.

Consider the tasks that we will solve, and what tools for this we use. Read the previous parts [ 1 , 2 , 3 ] is desirable, but not necessary. The initial data that was obtained as a result of the implementation of the previous parts of the cycle of articles can be taken here

Tasks

If among the data there are hidden relationships, correlation between features, structural relationships, then the problem arises of reducing the dimension of the input data. This problem manifests itself especially brightly in a situation where a large number of sparse data is being processed.
To compare data dimension reduction methods, we are going to consider the neural network compression method (Autoencoder) and the principal component method (PCA).
The classic way to solve this problem is the principal component analysis method (PCA). Wikipedia is well described [ 4 ], there is also a link [ 5 ] to the site with examples in Excel, and a detailed and fairly clear description. Here are more links to articles on Habré [ 6 , 7 ].
As a rule, the results of reducing the dimension of input data are primarily compared with the results of applying this method.
The article [ 8 ] describes what an auto-encoder is and how it works; therefore, the main material will be devoted to the implementation and application of this approach.
Banguio and Joshua wrote a good and clear review of the use of neural network compression, if suddenly someone needs a bibliography and references to works in this direction to compare their results with the existing ones, in [ 9 ], section 4.6, pp. 48-50.
As a toolkit, I decided to use Python and Theano [ 10 ]. I found this package after listening to the Neural Networks for Machine Learning courses, studying the links provided by the teacher J. Hinton. With the help of this package neural networks of “deep learning” (deep learning neural networks) are implemented. Such networks, as described in the lectures and the article [ 11 , 12 ], formed the basis of the voice recognition system used by Google on Android.
This approach to the construction of neural networks in the opinion of some scientists is quite promising, and already shows good results. Because it became interesting to me to use it for solving problems on Kaggle.
Also, on the site deeplearning.net there are a lot of examples of building machine learning systems using this particular package.

Instruments

In the documentation of developers, given the definition:
Theano is a Python library and optimizing compiler that allows you to define, optimize, and calculate mathematical expressions efficiently using multidimensional arrays.
Library features:

tight integration with NumPy;
transparent use of the GPU;
effective differentiation of variables;
fast and stable optimization;
dynamic code generation in C;
advanced features of unit testing and self-testing;

Theano has been used in high-intensity computational research since 2007.
In fact, programming with Theano is not programming in the full sense of the word, since a Python program is written that creates an expression for Theano.
On the other hand, this is programming, since we declare variables, create an expression that says what to do with these variables, and compile these expressions into functions that are used in the calculation.
In short, here is a list of what exactly Theano can do, unlike NumPy.

execution speed optimization: Theano can use g ++ or nvcc to compile parts of your expression in a GPU or CPU instruction that runs much faster than pure Python;
variable differentiation: Theano can automatically build expressions to calculate the gradient;
stability of optimization: Theano can recognize some numerically inaccurately calculated expressions and calculate them using more reliable algorithms

The package closest to Theano is sympy. The sympy package can be compared to Mathematica, while the NumPy is more similar to the MATLAB package. Theano is a hybrid whose developers have tried to take the best sides of both these packages.
Most likely, in the continuation of the cycle, a separate article will be written on how to configure theano, connect the use of the CUDA GPU to the training of neural networks and solve problems arising during installation.
In the meantime, on my Linux operating system Slackware 13.37, python-2.6, setuptools-0.6c9-py2.6, g ++, python-dev, NumPy, SciPy, BLAS is installed. Detailed installation manual for common OS in English: [ 12 ].
Before we begin to reduce the data dimension, we will try to implement a simple example of using Theano to write a function that will allow us to solve our problem of splitting a group into two parts.
As a method for splitting data into two classes, we use logistic regression [ 13 ].
The example code is based on [ 14 ] with some changes:

import numpy import theano import theano.tensor as T import csv as csv #   csv  csv_file_object = csv.reader(open('./titanik_train_final.csv', 'rb'), delimiter='\t') data=[] # - for row in csv_file_object: #   data.append(row) #    data data = numpy.array(data) #      numpy.array data = data.astype('float') #    float Y = data[:,1] #       Y X = data[:,2:] #           X #   ,   Y csv_file_object = csv.reader(open('./titanik_test_final.csv', 'rb'), delimiter='\t') data_test=[] for row in csv_file_object: data_test.append(row) Tx = numpy.array(data_test) Tx = Tx.astype('float') Tx = Tx[:,1:] #      rng = numpy.random #  N = 891 #  feats = 56 #   training_steps = 10000 #    Theano x = T.matrix("x") y = T.vector("y") w = theano.shared(rng.randn(feats), name="w") b = theano.shared(0., name="b") #  «» Theano p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) #  ,    1 prediction = p_1 > 0.5 #    xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) #      cost = xent.mean() + 0.01 * (w ** 2).sum() #   gw,gb = T.grad(cost, [w, b]) #    #  «» Theano train = theano.function( inputs=[x,y], outputs=[prediction, xent], updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb))) predict = theano.function(inputs=[x], outputs=prediction) #   for i in range(training_steps): pred, err = train(X, Y) # P = predict(Tx) #   numpy.savetxt('./autoencoder.csv',P,'%i')

The most interesting lines in this code are:

 #  «» Theano p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) #  ,    1 prediction = p_1 > 0.5 #    xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) #      cost = xent.mean() + 0.01 * (w ** 2).sum() #   gw,gb = T.grad(cost, [w, b]) #    #  «» Theano train = theano.function( inputs=[x,y], outputs=[prediction, xent], updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb))) predict = theano.function(inputs=[x], outputs=prediction)

Here the following happens (we analyze it from the end! ): We have two “expressions” train and predict . They are “compiled” (this is not only a compilation, but at the beginning of the study it will be better to simplify) with the help of theano.function into a form that can be later called and executed.
Our input parameters for prediction are x, and the output is the prediction expression, which is equal to p_1> 0.5 — that is, the threshold that returns a yes / no value, that is, a value of 0 or 1. In turn, the expression p_1 contains information about what exactly needs to be done with the variable x , namely:

 p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))

where x, w, b are variables, while w and b we define using the expression train .
For train, the input will be x, y , and the output prediction and xent . At the same time, we will update (search for optimal values) for w and b , updating them using the formulas

 w-0.1*gw, b-0.1*gb

where gw and gb are gradients associated with the error xent through the expression cost .
And the error calculated by the formula is obtained from the following expression:

 xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1)

And when we “compile” the expressions predict and train, Theano takes all the necessary expressions, creates C code for the CPU / GPU for them and executes them accordingly. This gives a significant increase in performance, while not depriving us of the convenience of using the Python environment.
The result of the implementation of this code will be the quality of the forecast, equal to 0.765555 about the Kaggle evaluation rules for the Titanik competition.
In the next articles of the cycle, we will try to reduce the dimension of the problem using different algorithms, and compare the results.

Source: https://habr.com/ru/post/173819/

All Articles

Data mining: Toolkit - Theano

Tasks

Instruments

More articles: