πŸ“œ ⬆️ ⬇️

Big Data Training: Spark MLlib

Hi, Habr!

image

Last time we met with the Apache Spark tool, which has recently become almost the most popular tool for processing big data, and in particular, Large Scale Machine Learning . Today we will take a closer look at the MlLib library, namely, we will show how to solve machine learning problems β€” classification, regression, clustering, and also collaborative filtering. In addition, we will show how it is possible to investigate the signs in order to select and isolate new ones (the so-called Feature Engineering , which we talked about earlier , and not once ).

Plan


First of all, we will look at how to store the objects of our training sample, how to calculate the basic statistics of signs, then the machine learning algorithms (classification, regression, clustering) and finally, consider an example of building a recommender system - the so-called. collaborative filtering methods, or more precisely, one of the most common ALS algorithms.
')

Vectors


For simple "dense" vectors there is a special class Vectors.dense :

from pyspark.mllib.linalg import Vectors my_vec = Vectors.dence ([1.12, 4.10, 1.5, -2.7, 3.5, 10.7, 0.7]) 

For β€œsparse” vectors, the Vectors.sparse class is used :

 from pyspark.mllib.linalg import Vectors my_vec = Vectors.sparse(10, [0,2,4,9], [-1.2, 3.05, -4.08, 0.46]) 

Here, the first argument is the number of features (the length of the vector), followed by a list β€” the numbers of nonzero features, and after β€” the values ​​of the features themselves.

Marked Vectors


For marked points in Spark there is a special LabeledPoint class:

 from pyspark.mllib.regression import LabeledPoint my_point = LabeledPoint(1.0, my_vec) 

Where in the LabeledPoint class we have LabeledPoint.features - any of the vectors described above, and LabeledPoint.label is, respectively, a label that can take any real value in the case of a regression task and the value [0.0,1.0,2.0, ...] - for classification tasks

Work with signs


It is no secret that often, in order to build a good machine learning algorithm, it is enough just to look at the signs, select the most relevant ones or come up with new ones. For this purpose, in the Spark class Statistics , with which you can do all these things, for example:

 from pyspark.mllib.stat import Statistics summary = Statistics.colStats(features) # meas of features summary.mean # non zeros features summary.numNonzeros # variance summary.variance # correlations of features Statistics.corr(features) 

In addition, Spark has a huge number of additional features like sampling, generation of standard features (like TF-IDF for texts), as well as such an important thing as scaling of features (the reader is offered to read this documentation in the article after reading this article). For the latter, there is a special class Scaler :

 from pyspark.mllib.feature import StandardScaler scaler = StandardScaler(withMean=True, withStd=True).fit(features) scaler.transform (features.map(lambda x:x.toArray())) 

The only thing that is important to remember is that in the case of sparse vectors this does not work and the scaling strategy must be thought out for a specific task. We now turn directly to the problems of machine learning.

Classification and Regression


Linear methods


As always, the most common methods are linear classifiers. Learning a linear classifier is reduced to the problem of convex minimization of the functional of the weights vector. The difference lies in the choice of the loss function, the regularization function, the number of iterations, and many other parameters. For example, consider below the logistic loss function (and, respectively, the so-called logistic regression method), 500 iterations and L2 - regularization.

 import pyspark.mllib.classification as cls model = cls.LogisticRegressionWithSGD.train(train, iterations=500, regType="l2") 

Similarly, linear regression is done:

 import pyspark.mllib.regression as regr model = regr.RidgeRegressionWithSGD.train(train) 

Naive bayes


In this case, the learning algorithm takes as input only 2 parameters β€” the learning sample itself and the smoothing parameter:

 from pyspark.mllib.classification import NaiveBayes model = NaiveBayes.train(train, 8.5) model.predict(test.features) 

Decisive trees


In Spark, as in many other packages, regression and classification trees are implemented. The learning algorithm takes as input multiple parameters, such as multiple classes, maximum tree depth. Also, the algorithm must specify which categories have categorical features, as well as many other parameters. However, one of the most important of them in training trees is the so-called impurity β€” the criterion for calculating the so-called information gain , which can usually take the following values: entropy and gini β€” for classification problems, variance β€” for regression problems. For example, consider a binary classification with the parameters defined below:

 from pyspark.mllib.tree import DecisionTree model = DecisionTree.trainClassifier(train, numClasses=2, impurity='gini', maxDepth=5) model.predict(test.map(lambda x: x.features)) 

Random forest


Random forests, as is known, are among the universal algorithms and one would expect that they will be implemented in this tool. They use the trees described above. There are also exactly the trainClassifier and trainRegression methods for teaching the classifier and the regression function, respectively. One of the most important parameters is the number of trees in the forest, the impurity already known to us, as well as the featureSubsetStrategy , the number of signs that are considered when splitting a tree on the next node (for more information about the values, see the documentation). Accordingly, the following is an example of a binary classification using 50 trees:

 from pyspark.mllib.tree import RandomForest model = RandomForest.trainClassifier(train, numClasses=2, numTrees=50, featureSubsetStrategy="auto", impurity='gini', maxDepth=20, seed=12) model.predict(test.map(lambda x:x.features)) 

Clustering


As elsewhere, in the park, the well-known KMeans algorithm is implemented , whose training takes input directly, the number of clusters, the number of iterations, and the strategy for selecting initial cluster centers (the initializationMode parameter, which by default has a value of k-means , and also can value random ):

 from pyspark.mllib.clustering import KMeans clusters = KMeans.train(features, 3, maxIterations=100, runs=5, initializationMode="random") clusters.predict(x.features)) 

Collaborative filtering


Considering that the most famous example of using Big Data is a recommender system, it would be strange if the simplest algorithms were not implemented in many packages. This also applies to Spark. It implements the ALS (Alternative Least Square) algorithm - perhaps one of the most well-known collaborative filtering algorithms. The description of the algorithm itself deserves a separate article. Here we just say in a nutshell that the algorithm actually deals with decomposition of the response matrix (the rows of which are users, and the columns are products) into product matrices β€” a topic and a topic-user , where the topics are some hidden variables, the meaning of which is often not clear (the beauty of the ALS algorithm is just to find the topics themselves and their values). The essence of these topics is that each user and each movie are now characterized by a set of features, and the scalar product of these vectors is the assessment of the movie of a particular user. The training set for this algorithm is specified in the form of a table userID -> productID -> rating . After that, the model is trained using ALS (which, like other algorithms, accepts many parameters for input, which the reader is offered to read about on their own):

 from pyspark.mllib.recommendation import ALS model = ALS.train (ratings, 20, 60) predictions = model.predictAll(ratings.map (lambda x: (x[0],x[1]))) 

Conclusion


So, we briefly reviewed the MlLib library from the Apache Spark framework, which was developed for distributed processing of big data. Recall that the main advantage of this tool, as discussed earlier , is that the data can be cached in RAM, which allows you to significantly speed up the calculations in the case of iterative algorithms, which are most of the machine learning algorithms.

Source: https://habr.com/ru/post/251471/


All Articles