Hey. Surely many are interested in methods of machine learning and solving various problems that are not solved by conventional approaches. Recently, I was fortunate enough to attend the Data Mining course organized as part of the
GameChangers program. The first homework was to make a
Kaggle submit - solve the problem of
Digit Recognizer .
Briefly about the data
The data for training is a csv-table, in the first column of which are the numerical values of written numbers, in the rest - 784 saturation values of individual pixels (pictures are black and white). Test data - the same table, but without answers.
Methods
- Random forest
- kNN method
- SVM with quadratic core
- Cubic SVM
- Method Ensemble
All calculations were performed in Python using the
scikit library, links to distributions and tutorials are attached below.
')
I deliberately do not talk here about optimizations that can and should have been done on the algorithms used, because otherwise the post about homework will grow to the abstract.
Random forest
The idea of the algorithm
Random Forest uses an ensemble of decision trees. In itself, the decision tree does not provide sufficient accuracy for this task, but is very fast. The RF algorithm teaches k decision trees on parameters randomly selected for each tree (in our case, the parameters are the brightness of individual pixels), after which each test is voted on among the trained ensemble. The construction of this algorithm is based on the idea that if you aggregate data from a large number of different weak algorithms, reducing them into a single answer, the result is likely to be better than that of a single powerful algorithm.
Implementation
from numpy import savetxt, loadtxt
Summary
On four cores, the training + solution took an hour and a half; this is the slowest algorithm of those considered. However, this is the second most accurate algorithm, which, when submitted to Kaggle, has more than 96% accuracy.
There is a problem of optimal selection of the number of trees: for this task several runs were performed with different values of n_estimators, 1000 gave the best result.
kNN - method
The idea of the algorithm
One of the fastest classification algorithms. In short: all entities used in training are pushed into a metric space of a dimension equal to the number of parameters. After that, when classifying a single vector, k vectors from the training set closest to the studied one are considered.
The main question when using this algorithm is the choice of the number k. When k = 1, the algorithm loses stability and begins to behave inadequately when noise appears, when k is close to the number of vectors of the training set, the accuracy becomes redundant and the algorithm degenerates.
I started the algorithm with different k and finally settled on the initially recommended k = 10
Implementation
This is all similar to the previous source, moreover, the dumps obtained during the work of Random Forest are used.
from sklearn import neighbors from sklearn.externals import joblib
Summary
The kNN also showed accuracy above 96% with shorter running times.
SVM
SVM (Support Vector Machine) is one of the most universal classification methods, distinguished by speed and high reliability. Looking ahead, I will say that with its use, the highest accuracy among all the algorithms used was 97.7%
The idea of the algorithm
SVM classifies vectors located in multidimensional space, similar to that used in kNN, separated from the hyperplane, having dimensions n-1, where n is the dimension of the original space.
Kernels
The SVM method has one of the key hyperparameters, called the kernel. The scikit library supports all major cores used: linear, radial and polynomial. I will give the statistics of tests on small samples of tagged data:
Accuracy with a linear core = 0.9131
Accuracy with a radial core = 0.1265
Accuracy with a quadratic polynomial = 0.9635
Accuracy with a cubic polynomial = 0.9595
After these tests, I dropped the options with the linear and radial cores and drove the algorithm on the largest possible data set using the quadratic and cubic cores. Quadratic gave greater accuracy, using it, in the end and got the final solution.
Release
from sklearn.externals import joblib from numpy import savetxt from sklearn import svm
Summary
It works quite quickly, with any core overtakes Random Forest, the accuracy of the normalized data is excellent.
Method Ensemble
At the courses we were given an additional task - to see how the accuracy of the solution would change when using several algorithms at the same time.
I wanted to cluster the training sample on superpixels and see what happens, but I didn’t meet the deadline, I had to pass a simpler option - the answers given by the RF, SVM and kNN algorithms were read, and then a direct vote was taken. If opinions were divided, preference was given to SVM, as a bit more accurate. However, this ensemble gave a half-percent solution worse than what was obtained on pure SVM, so I did not succeed in improving the solution.
Links
Library usedWiki with a detailed description of the algorithmsDocumentation for the class implementing Random ForestSVM description with a small exampleMore information on SVMAll about kNNThat's all. In the next post we will discuss methods for optimizing the parameters of the algorithms themselves.