Introduction to machine learning using Python and Scikit-Learn

Hi, Habr!

My name is Alexander , I am engaged in machine learning and web graph analysis ( mostly theoretical ), as well as the development of Big Data products in one of the Big Three operators. This is my first post - please do not judge strictly!)
')
Recently, people who want to learn how to develop efficient algorithms and participate in machine learning competitions with the question: "Where to start?" Some time ago, I led the development of Big Data tools for analyzing media and social networks in one of the institutions of the Government of the Russian Federation , and I still have some material on which my team studied and which I can share. It is assumed that the reader has a good knowledge of mathematics and machine learning (the team consisted mainly of graduates of the Moscow Institute of Physics and Technology and students of the School of Data Analysis ).

In fact, it was an introduction to Data Science . Recently, this science has become quite popular. Increasingly, machine learning competitions are held (for example, Kaggle , TudedIT ), often with a considerable budget. The purpose of this article is to give the reader a quick introduction of machine learning tools so that he can participate in competitions as soon as possible.

The most common Data Scientist tools today are R and Python . Each tool has its pros and cons, however, lately Python wins in all respects (this is solely the opinion of the author, who also uses both). This came after the well-documented Scikit-Learn library appeared in which a large number of machine learning algorithms were implemented.

Immediately, we note that in the article we will focus on machine learning algorithms. Primary data analysis is usually best done using the Pandas package, which can be dealt with on its own. So, let's focus on the implementation, for definiteness, assuming that at the entrance we have a matrix object-feature stored in a file with the extension * .csv

Data loading

First of all, the data must be loaded into RAM so that we can work with them. The Scikit-Learn library itself uses arrays in its NumPy implementation, so we will load * .csv files using NumPy. Download one of the datasets from the UCI Machine Learning Repository :

import numpy as np import urllib # url with dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" # download the file raw_data = urllib.urlopen(url) # load the CSV file as a numpy matrix dataset = np.loadtxt(raw_data, delimiter=",") # separate the data from the target attributes X = dataset[:,0:7] y = dataset[:,8]

Further in all examples we will work with this data set, namely with the matrix object feature X and the values of the target variable y .

Data normalization

It is well known to all that most gradient methods (on which, in essence, almost all machine learning algorithms are based) are highly sensitive to data scaling. Therefore, before running the algorithms, either normalization or the so-called standardization is most often done. Normalization involves the replacement of nominal features so that each of them lies in the range from 0 to 1. Standardization implies a data preprocessing, after which each feature has an average of 0 and variance 1. In Scikit-Learn, there are already functions ready for this:

 from sklearn import preprocessing # normalize the data attributes normalized_X = preprocessing.normalize(X) # standardize the data attributes standardized_X = preprocessing.scale(X)

Feature selection

It is no secret that often the most important thing in solving a problem is the ability to correctly select and even create signs. In English literature this is called Feature Selection and Feature Engineering . While Future Engineering is quite a creative process and relies more on intuition and expert knowledge, for Feature Selection there are already a large number of ready-made algorithms. "Wood" algorithms allow the calculation of the informativeness of features:

 from sklearn import metrics from sklearn.ensemble import ExtraTreesClassifier model = ExtraTreesClassifier() model.fit(X, y) # display the relative importance of each attribute print(model.feature_importances_)

All other methods are in one way or another based on the effective selection of subsets of features in order to find the best subset on which the constructed model gives the best quality. One such search algorithm is the Recursive Feature Elimination algorithm, which is also available in the Scikit-Learn library:

 from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression model = LogisticRegression() # create the RFE model and select 3 attributes rfe = RFE(model, 3) rfe = rfe.fit(X, y) # summarize the selection of the attributes print(rfe.support_) print(rfe.ranking_)

Algorithm construction

As already noted, Scikit-Learn implements all the basic algorithms of machine learning. Consider some of them.

Logistic regression

It is most often used for solving classification problems (binary), but multi-class classification is also allowed (the so-called one-vs-all method). The advantage of this algorithm is that the output for each object we have the probability of belonging to the class

 from sklearn import metrics from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

Naive bayes

It is also one of the most well-known machine learning algorithms, whose main task is to restore the distribution densities of the training sample. Often this method gives good quality in the tasks of multi-class classification.

 from sklearn import metrics from sklearn.naive_bayes import GaussianNB model = GaussianNB() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

K-nearest neighbors

The kNN method (k-Nearest Neighbors) is often used as part of a more complex classification algorithm. For example, its assessment can be used as a sign for an object. And sometimes, a simple kNN on well-chosen signs gives excellent quality. With proper parameter settings (mostly metrics), the algorithm often provides good quality in regression tasks

 from sklearn import metrics from sklearn.neighbors import KNeighborsClassifier # fit a k-nearest neighbor model to the data model = KNeighborsClassifier() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

Decision trees

Classification and Regression Trees (CART) are often used in tasks in which objects have categorical attributes and are used for regression and classification tasks. Very good trees are suitable for multi-class classification.

 from sklearn import metrics from sklearn.tree import DecisionTreeClassifier # fit a CART model to the data model = DecisionTreeClassifier() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

Support Vector Machine

SVM (Support Vector Machines) is one of the most well-known machine learning algorithms used mainly for the classification task. As well as logistic regression, SVM allows one-vs-all multi-class classification.

 from sklearn import metrics from sklearn.svm import SVC # fit a SVM model to the data model = SVC() model.fit(X, y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

In addition to the algorithms of classification and regression, Scikit-Learn has a huge number of more complex algorithms, including clustering, as well as implemented techniques for constructing compositions of algorithms, including Bagging and Boosting .

Optimization of algorithm parameters

One of the most difficult steps in building truly effective algorithms is choosing the right parameters. Usually, this is done easier with experience, but one way or another it is necessary to do a bust. Fortunately, Scikit-Learn already has quite a few functions implemented for this.

For example, let's look at the selection of the regularization parameter, in which we in turn look through several values:

 import numpy as np from sklearn.linear_model import Ridge from sklearn.grid_search import GridSearchCV # prepare a range of alpha values to test alphas = np.array([1,0.1,0.01,0.001,0.0001,0]) # create and fit a ridge regression model, testing each alpha model = Ridge() grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas)) grid.fit(X, y) print(grid) # summarize the results of the grid search print(grid.best_score_) print(grid.best_estimator_.alpha)

Sometimes it is more efficient to randomly select a parameter from a given segment many times, measure the quality of the algorithm with a given parameter and thus choose the best one:

 import numpy as np from scipy.stats import uniform as sp_rand from sklearn.linear_model import Ridge from sklearn.grid_search import RandomizedSearchCV # prepare a uniform distribution to sample for the alpha parameter param_grid = {'alpha': sp_rand()} # create and fit a ridge regression model, testing random alpha values model = Ridge() rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100) rsearch.fit(X, y) print(rsearch) # summarize the results of the random parameter search print(rsearch.best_score_) print(rsearch.best_estimator_.alpha)

We looked at the whole process of working with the Scikit-Learn library with the exception of outputting the results back to a file, which the reader is supposed to do as an exercise, because one of the advantages of Python (and the Scikit-Learn library itself) in comparison with R is excellent documentation. In the following parts we will look at each section in detail, in particular, we will touch on such an important thing as Feauture Engineering .

I very much hope that this material will help novice Data Scientists to start solving machine learning problems in practice as soon as possible. In conclusion, I would like to wish success and patience to those who are just beginning to participate in machine learning competitions!

Source: https://habr.com/ru/post/247751/

All Articles