Hi, Habr!
Last time we met with the
Apache Spark tool, which has recently become almost the most popular tool for processing big data, and in particular,
Large Scale Machine Learning . Today we will take a closer look at the
MlLib library, namely, we will show how to solve machine learning problems β classification, regression, clustering, and also collaborative filtering. In addition, we will show how it is possible to investigate the signs in order to select and isolate new ones (the so-called
Feature Engineering , which we
talked about earlier , and
not once ).
Plan
First of all, we will look at how to store the objects of our training sample, how to calculate the basic statistics of signs, then the machine learning algorithms (classification, regression, clustering) and finally, consider an example of building a recommender system - the so-called. collaborative filtering methods, or more precisely, one of the most common ALS algorithms.
')
Vectors
For simple
"dense" vectors there is a special class
Vectors.dense :
from pyspark.mllib.linalg import Vectors my_vec = Vectors.dence ([1.12, 4.10, 1.5, -2.7, 3.5, 10.7, 0.7])
For
βsparseβ vectors, the
Vectors.sparse class is
used :
from pyspark.mllib.linalg import Vectors my_vec = Vectors.sparse(10, [0,2,4,9], [-1.2, 3.05, -4.08, 0.46])
Here, the first argument is the number of features (the length of the vector), followed by a list β the numbers of nonzero features, and after β the values ββof the features themselves.
Marked Vectors
For marked points in Spark there is a special
LabeledPoint class:
from pyspark.mllib.regression import LabeledPoint my_point = LabeledPoint(1.0, my_vec)
Where in the
LabeledPoint class we have
LabeledPoint.features - any of the vectors described above, and
LabeledPoint.label is, respectively, a label that can take any real value in the case of a regression task and the value
[0.0,1.0,2.0, ...] - for classification tasks
Work with signs
It is no secret that often, in order to build a good machine learning algorithm, it is enough just to look at the signs, select the most relevant ones or come up with new ones. For this purpose, in the Spark class
Statistics , with which you can do all these things, for example:
from pyspark.mllib.stat import Statistics summary = Statistics.colStats(features)
In addition, Spark has a huge number of additional features like sampling, generation of standard features (like TF-IDF for texts), as well as such an important thing as scaling of features (the reader is offered to read this documentation in the article after reading this article). For the latter, there is a special class
Scaler :
from pyspark.mllib.feature import StandardScaler scaler = StandardScaler(withMean=True, withStd=True).fit(features) scaler.transform (features.map(lambda x:x.toArray()))
The only thing that is important to remember is that in the case of sparse vectors this does not work and the scaling strategy must be thought out for a specific task. We now turn directly to the problems of machine learning.
Classification and Regression
Linear methods
As always, the most common methods are linear classifiers. Learning a linear classifier is reduced to the problem of convex minimization of the functional of the weights vector. The difference lies in the choice of the loss function, the regularization function, the number of iterations, and many other parameters. For example, consider below the logistic loss function (and, respectively, the so-called logistic regression method), 500 iterations and L2 - regularization.
import pyspark.mllib.classification as cls model = cls.LogisticRegressionWithSGD.train(train, iterations=500, regType="l2")
Similarly, linear regression is done:
import pyspark.mllib.regression as regr model = regr.RidgeRegressionWithSGD.train(train)
Naive bayes
In this case, the learning algorithm takes as input only 2 parameters β the learning sample itself and the smoothing parameter:
from pyspark.mllib.classification import NaiveBayes model = NaiveBayes.train(train, 8.5) model.predict(test.features)
Decisive trees
In Spark, as in many other packages, regression and classification trees are implemented. The learning algorithm takes as input multiple parameters, such as multiple classes, maximum tree depth. Also, the algorithm must specify which categories have categorical features, as well as many other parameters. However, one of the most important of them in training trees is the so-called
impurity β the criterion for calculating the so-called
information gain , which can usually take the following values:
entropy and
gini β for classification problems,
variance β for regression problems. For example, consider a binary classification with the parameters defined below:
from pyspark.mllib.tree import DecisionTree model = DecisionTree.trainClassifier(train, numClasses=2, impurity='gini', maxDepth=5) model.predict(test.map(lambda x: x.features))
Random forest
Random forests, as is known, are among the universal algorithms and one would expect that they will be implemented in this tool. They use the trees described above. There are also exactly the
trainClassifier and
trainRegression methods for teaching the classifier and the regression function, respectively. One of the most important parameters is the number of trees in the forest, the
impurity already known to us, as well as the
featureSubsetStrategy , the number of signs that are considered when splitting a tree on the next node (for more information about the values, see the documentation). Accordingly, the following is an example of a binary classification using 50 trees:
from pyspark.mllib.tree import RandomForest model = RandomForest.trainClassifier(train, numClasses=2, numTrees=50, featureSubsetStrategy="auto", impurity='gini', maxDepth=20, seed=12) model.predict(test.map(lambda x:x.features))
Clustering
As elsewhere, in the park, the well-known
KMeans algorithm is
implemented , whose training takes input directly, the number of clusters, the number of iterations, and the strategy for selecting initial cluster centers (the
initializationMode parameter, which by default has a value of
k-means , and also can value
random ):
from pyspark.mllib.clustering import KMeans clusters = KMeans.train(features, 3, maxIterations=100, runs=5, initializationMode="random") clusters.predict(x.features))
Collaborative filtering
Considering that the most famous example of using Big Data is a recommender system, it would be strange if the simplest algorithms were not implemented in many packages. This also applies to Spark. It implements the
ALS (Alternative Least Square) algorithm - perhaps one of the most well-known collaborative filtering algorithms. The description of the algorithm itself deserves a separate article. Here we just say in a nutshell that the algorithm actually deals with decomposition of the response matrix (the rows of which are users, and the columns are products) into
product matrices
β a topic and a
topic-user , where the topics are some hidden variables, the meaning of which is often not clear (the beauty of the
ALS algorithm is just to find the topics themselves and their values). The essence of these topics is that each user and each movie are now characterized by a set of features, and the scalar product of these vectors is the assessment of the movie of a particular user. The training set for this algorithm is specified in the form of a table
userID -> productID -> rating . After that, the model is trained using ALS (which, like other algorithms, accepts many parameters for input, which the reader is offered to read about on their own):
from pyspark.mllib.recommendation import ALS model = ALS.train (ratings, 20, 60) predictions = model.predictAll(ratings.map (lambda x: (x[0],x[1])))
Conclusion
So, we briefly reviewed the
MlLib library from the Apache Spark framework, which was developed for distributed processing of big data. Recall that the main advantage of this tool,
as discussed earlier , is that the data can be cached in RAM, which allows you to significantly speed up the calculations in the case of iterative algorithms, which are most of the machine learning algorithms.