Kaggle and Linux. Digit Recognizer for analysts (for beginning analysts)

For those who only know the scope of the analysis of multidimensional data, I want to share experiences on how to feel like a mini information analyst.

Whoever is not familiar with the site Kaggle.com (English), I recommend not to be lazy and spend a couple of hours there for a general acquaintance with this resource.

On this site, as already 4 years there is a competition for the best image analyzer. Everyone can take part. Initially, the competition was until 12/31/16, but now it has been extended until 2019.
')
On Habré, a way has already been described how to write a program and participate, but far from there for beginners: "How to start working in Kaggle: a guide for beginners in Data Science . "

I decided to try my hand, and not to write a program.

My skills do not allow myself to write a program with a good result in a short time, so it was decided to borrow the code from those who achieved success.

I will describe everything in stages

Register on Kaggle.com
We are looking for a link to Digit Recognized there
We go to the list of participants and see which of them posted the code. I liked the user code
daryadedik But the problem is that on windows to compile the code is missing a number of libraries. I decided not to bathe and run it in Linux Ubuntu, there is everything for him.

Step 4 - 12 describes how to create a Linux virtual machine and install all the necessary applications for Python.

If you have linux, Pycharm, Anaconda, tensorflow library, then go to step 13.
Download the program for creating virtual machines. (I downloaded VMware)
Download Linux (Ubuntu 16.04.1 LTS Desktop) http://ubuntu.ru/get
Next, install it through VMware. (VMware, new virtual machine, choose the downloaded Linux, you can select the parameters 2-3 GB of RAM and allocate 20 GB to the virtual machine)
Follow the instructions and remember the user's PASSWORD in Linux.
If you see this:

So while you're on the right track. Enter the password and go.
This is an important step. To run a Python program, we need two things:
PyCharm (development environment) and Anaconda (set of libraries).

In order not to copy what is written, follow the instructions to install PyCharm. To install Anaconda perform only item 1 (wget ... ...).

After installation, we write in the terminal the phrase that came out after (You may wish to edit .......). I have this (export PATH = / home / adminv / anaconda3 / bin: $ PATH)

(For Python 3.5)

Open a new terminal, write again the phrase

export PATH=/home/adminv/anaconda3/bin:$PATH

Enter there too

 conda install -c https://conda.anaconda.org/jjhelmus</a> tensorflow

If you see this:

So you did everything right.

We now have PyCharm, Anaconda, the tensorflow library can go to the program and competition.
Go to the site Kaggle and download files with a test sample and a competitive selection (3 files) .
Next, we run through the PyCharm terminal (we simply enter into the pycharm terminal) and create a new project, where we throw our code.

In the folder with the project we throw and downloaded 3 files.
We launch a program and voila - we receive the model trained on sampling and a file a column with result of recognition (submission).
https://www.kaggle.com/c/digit-recognizer/submit - we throw here our one file result (submission) and wait for the result of the test (it takes ~ 20 seconds). Now you are also among those involved.

That's how I got 307 place out of 1387 (at the time of this writing). The accuracy of the algorithm was 0.989.

Reference Information:

Pictures are all in 28x28 format, so the test file is a table with 784 columns.
One line - one picture, just written pixel-by-pixel.
(train - 785 columns, there the first column is the result known)
The accuracy of the algorithm depends on what parameters you enter in the code (the number of iterations of training)
The algorithm counted 40 minutes with 5400 learning iterations and 4 GB of RAM allocated for the virtual machine.

Analytical experiment

An initial training sample of 42,000 images was taken and divided into 33,600 and 8,400 images (80% and 20%) in order to test the algorithm, with known results. 33600 pictures were used for training, 8400 pictures for verification. The algorithm was trained at 5400 iterations. The running time of the algorithm is 33 minutes, subject to 4GB of RAM and 2.33GHz. Let us turn to the results.

Table 1

Table 1 clearly shows the result of the classification of pictures. After receiving these data, you can conduct an analysis. To understand the theory, we can recall the Euler circles (Fig. 1). They show which parts we divide the result of the algorithm (the resulting sample).

Pic1

Tab. 2

In accordance with Figure 2, we divide our result and enter the data in Table 2.

Now we have enough information to talk about the quality of the algorithm. In Table 3, I tried to enclose the main list of coefficients by which the program can be evaluated.