Handwriting recognition using Python and scikit

Hey. Surely many are interested in methods of machine learning and solving various problems that are not solved by conventional approaches. Recently, I was fortunate enough to attend the Data Mining course organized as part of the GameChangers program. The first homework was to make a Kaggle submit - solve the problem of Digit Recognizer .

Briefly about the data

The data for training is a csv-table, in the first column of which are the numerical values of written numbers, in the rest - 784 saturation values of individual pixels (pictures are black and white). Test data - the same table, but without answers.

Methods

Random forest
kNN method
SVM with quadratic core
Cubic SVM
Method Ensemble

All calculations were performed in Python using the scikit library, links to distributions and tutorials are attached below.
')
I deliberately do not talk here about optimizations that can and should have been done on the algorithms used, because otherwise the post about homework will grow to the abstract.

Random forest

The idea of the algorithm

Random Forest uses an ensemble of decision trees. In itself, the decision tree does not provide sufficient accuracy for this task, but is very fast. The RF algorithm teaches k decision trees on parameters randomly selected for each tree (in our case, the parameters are the brightness of individual pixels), after which each test is voted on among the trained ensemble. The construction of this algorithm is based on the idea that if you aggregate data from a large number of different weak algorithms, reducing them into a single answer, the result is likely to be better than that of a single powerful algorithm.

Implementation

from numpy import savetxt, loadtxt # - csv- from sklearn.ensemble import RandomForestClassifier #      from sklearn.externals import joblib #   #         dataset = loadtxt(open('train.csv', 'r'), dtype='f8', delimiter=',', skiprows=1) joblib.dump(dataset, 'training_set.pkl') dataset = joblib.load('training_set.pkl') #     test = loadtxt(open('test.csv', 'r'), dtype='f8', delimiter=',', skiprows=1) joblib.dump(test, 'test_set.pkl') test = joblib.load('test_set.pkl') # n_estimators -    , n_jobs -  ,     forest = RandomForestClassifier(n_estimators = 1000, n_jobs = 4) #        target = [x[0] for x in dataset] # train = [x[1:] for x in dataset] forest.fit(train, target) #  # predict()    (   numpy),    -  1x28000,      csv- savetxt('answer.csv', forest.predict(test), delimiter=',', fmt='%d')

Summary

On four cores, the training + solution took an hour and a half; this is the slowest algorithm of those considered. However, this is the second most accurate algorithm, which, when submitted to Kaggle, has more than 96% accuracy.
There is a problem of optimal selection of the number of trees: for this task several runs were performed with different values of n_estimators, 1000 gave the best result.

kNN - method

The idea of the algorithm

One of the fastest classification algorithms. In short: all entities used in training are pushed into a metric space of a dimension equal to the number of parameters. After that, when classifying a single vector, k vectors from the training set closest to the studied one are considered.
The main question when using this algorithm is the choice of the number k. When k = 1, the algorithm loses stability and begins to behave inadequately when noise appears, when k is close to the number of vectors of the training set, the accuracy becomes redundant and the algorithm degenerates.

I started the algorithm with different k and finally settled on the initially recommended k = 10

Implementation

This is all similar to the previous source, moreover, the dumps obtained during the work of Random Forest are used.

 from sklearn import neighbors from sklearn.externals import joblib #    ,      dataset = joblib.load('training_set.pkl') target = [x[0] for x in dataset] #    train = [x[1:] for x in dataset] # - clf = neighbors.KNeighborsClassifier(n_neighbors = 10, weights='uniform') clf.fit(train, target) # # test = joblib.load('test_set.pkl') savetxt('knn_answer.csv', clf.predict(test), delimiter=',', fmt='%d')

Summary

The kNN also showed accuracy above 96% with shorter running times.

SVM

SVM (Support Vector Machine) is one of the most universal classification methods, distinguished by speed and high reliability. Looking ahead, I will say that with its use, the highest accuracy among all the algorithms used was 97.7%

The idea of the algorithm

SVM classifies vectors located in multidimensional space, similar to that used in kNN, separated from the hyperplane, having dimensions n-1, where n is the dimension of the original space.

Kernels

The SVM method has one of the key hyperparameters, called the kernel. The scikit library supports all major cores used: linear, radial and polynomial. I will give the statistics of tests on small samples of tagged data:

Accuracy with a linear core = 0.9131
Accuracy with a radial core = 0.1265
Accuracy with a quadratic polynomial = 0.9635
Accuracy with a cubic polynomial = 0.9595

After these tests, I dropped the options with the linear and radial cores and drove the algorithm on the largest possible data set using the quadratic and cubic cores. Quadratic gave greater accuracy, using it, in the end and got the final solution.

Release

 from sklearn.externals import joblib from numpy import savetxt from sklearn import svm #       dataset = joblib.load('training_set.pkl') target = [x[0] for x in dataset] train = [x[1:] for x in dataset] # .  kernel   , degree -   .  , ,  3. clf_poly2 = svm.SVC(kernel = "poly", degree = 2) clf_poly2.fit(train, target) # test = joblib.load('test_set.pkl') #  savetxt('svm_answer.csv', clf_poly2.predict(test), delimiter=',', fmt='%d')

Summary

It works quite quickly, with any core overtakes Random Forest, the accuracy of the normalized data is excellent.

Method Ensemble

At the courses we were given an additional task - to see how the accuracy of the solution would change when using several algorithms at the same time.
I wanted to cluster the training sample on superpixels and see what happens, but I didn’t meet the deadline, I had to pass a simpler option - the answers given by the RF, SVM and kNN algorithms were read, and then a direct vote was taken. If opinions were divided, preference was given to SVM, as a bit more accurate. However, this ensemble gave a half-percent solution worse than what was obtained on pure SVM, so I did not succeed in improving the solution.

Links

Library used
Wiki with a detailed description of the algorithms
Documentation for the class implementing Random Forest
SVM description with a small example
More information on SVM
All about kNN

That's all. In the next post we will discuss methods for optimizing the parameters of the algorithms themselves.

Source: https://habr.com/ru/post/171723/

All Articles

Handwriting recognition using Python and scikit

Briefly about the data

Methods

Random forest

The idea of ​​the algorithm

Implementation

Summary

kNN - method

The idea of ​​the algorithm

Implementation

Summary

SVM

The idea of ​​the algorithm

Kernels

Release

Summary

Method Ensemble

Links

More articles:

The idea of the algorithm

The idea of the algorithm

The idea of the algorithm