In the previous materials of this cycle, we considered methods of preprocessing data using a DBMS. This can be useful for very large volumes of information processed. In this article I will continue to describe the tools for intelligent processing of large amounts of data, focusing on the use of Python and Theano.
Consider the tasks that we will solve, and what tools for this we use. Read the previous parts [
1 ,
2 ,
3 ] is desirable, but not necessary. The initial data that was obtained as a result of the implementation of the previous parts of the cycle of articles can be taken here
Tasks
If among the data there are hidden relationships, correlation between features, structural relationships, then the problem arises of reducing the dimension of the input data. This problem manifests itself especially brightly in a situation where a large number of sparse data is being processed.
To compare data dimension reduction methods, we are going to consider the neural network compression method (Autoencoder) and the principal component method (PCA).
The classic way to solve this problem is the principal component analysis method (PCA). Wikipedia is well described [
4 ], there is also a link [
5 ] to the site with examples in Excel, and a detailed and fairly clear description. Here are more links to articles on Habré [
6 ,
7 ].
As a rule, the results of reducing the dimension of input data are primarily compared with the results of applying this method.
The article [
8 ] describes what an auto-encoder is and how it works; therefore, the main material will be devoted to the implementation and application of this approach.
Banguio and Joshua wrote a good and clear review of the use of neural network compression, if suddenly someone needs a bibliography and references to works in this direction to compare their results with the existing ones, in [
9 ], section 4.6, pp. 48-50.
As a toolkit, I decided to use Python and Theano [
10 ]. I found this package after listening to the
Neural Networks for Machine Learning courses, studying the links provided by the teacher J. Hinton. With the help of this package neural networks of “deep learning” (deep learning neural networks) are implemented. Such networks, as described in the lectures and the article [
11 ,
12 ], formed the basis of the voice recognition system used by Google on Android.
This approach to the construction of neural networks in the opinion of some scientists is quite promising, and already shows good results. Because it became interesting to me to use it for solving problems on Kaggle.
Also, on the site
deeplearning.net there are a lot of examples of building machine learning systems using this particular package.
Instruments
In the documentation of developers, given the definition:
Theano is a Python library and optimizing compiler that allows you to define, optimize, and calculate mathematical expressions efficiently using multidimensional arrays.
Library features:
- tight integration with NumPy;
- transparent use of the GPU;
- effective differentiation of variables;
- fast and stable optimization;
- dynamic code generation in C;
- advanced features of unit testing and self-testing;
Theano has been used in high-intensity computational research since 2007.
In fact, programming with Theano is not programming in the full sense of the word, since a Python program is written that creates an expression for Theano.
On the other hand, this is programming, since we declare variables, create an expression that says what to do with these variables, and compile these expressions into functions that are used in the calculation.
In short, here is a list of what exactly Theano can do, unlike NumPy.
- execution speed optimization: Theano can use g ++ or nvcc to compile parts of your expression in a GPU or CPU instruction that runs much faster than pure Python;
- variable differentiation: Theano can automatically build expressions to calculate the gradient;
- stability of optimization: Theano can recognize some numerically inaccurately calculated expressions and calculate them using more reliable algorithms
The package closest to Theano is sympy. The sympy package can be compared to Mathematica, while the NumPy is more similar to the MATLAB package. Theano is a hybrid whose developers have tried to take the best sides of both these packages.
Most likely, in the continuation of the cycle, a separate article will be written on how to configure theano, connect the use of the CUDA GPU to the training of neural networks and solve problems arising during installation.
In the meantime, on my Linux operating system Slackware 13.37, python-2.6, setuptools-0.6c9-py2.6, g ++, python-dev, NumPy, SciPy, BLAS is installed. Detailed installation manual for common OS in English: [
12 ].
Before we begin to reduce the data dimension, we will try to implement a simple example of using Theano to write a function that will allow us to solve our problem of splitting a group into two parts.
As a method for splitting data into two classes, we use logistic regression [
13 ].
The example code is based on [
14 ] with some changes:
import numpy import theano import theano.tensor as T import csv as csv
The most interesting lines in this code are:
Here the following happens (we
analyze it from the end! ): We have two “expressions”
train and
predict . They are “compiled” (this is not only a compilation, but at the beginning of the study it will be better to simplify) with the help of
theano.function into a form that can be later called and executed.
Our input parameters for prediction are x, and the output is the
prediction expression, which is equal to
p_1> 0.5 — that is, the threshold that returns a yes / no value, that is, a value of 0 or 1. In turn, the expression
p_1 contains information about what exactly needs to be done with the variable
x , namely:
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))
where
x, w, b are variables, while
w and
b we define using the expression
train .
For train, the input will be
x, y , and the output
prediction and
xent . At the same time, we will update (search for optimal values) for
w and
b , updating them using the formulas
w-0.1*gw, b-0.1*gb
where
gw and
gb are gradients associated with the error
xent through the expression
cost .
And the
error calculated by the formula is obtained from the following expression:
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1)
And when we “compile” the expressions predict and train, Theano takes all the necessary expressions, creates C code for the CPU / GPU for them and executes them accordingly. This gives a significant increase in performance, while not depriving us of the convenience of using the Python environment.
The result of the implementation of this code will be the quality of the forecast, equal to 0.765555 about the Kaggle evaluation rules for the Titanik competition.
In the next articles of the cycle, we will try to reduce the dimension of the problem using different algorithms, and compare the results.