Machine learning: from Iris to Telecom

Mobile operators, providing a variety of services, accumulate a huge amount of statistical data. I represent the department that implements the subscriber traffic management system , which, during the operation of the operator, generates hundreds of gigabytes of statistical information per day. I was interested in the question: how in these Big Data (Big Data) to reveal the maximum of useful information? No wonder that one of the V in the definition of Big Data is an additional income.

I took on this task, not being an expert in data research. Immediately a lot of questions arose: what technical means to use for analysis? At what level is it enough to know mathematics, statistics? What methods of machine learning need to know and how deep? Is it better to start with a specialized language for studying R or Python data?
')
As my experience has shown, for the initial level of data research, it’s not at all necessary. But for a quick dive I didn’t have enough of a simple example that clearly showed the full algorithm for researching data. In this article, using the example of Iris Fisher, we will go all the way through the initial training, and then apply the understanding obtained to the real data of the telecommunications operator. Readers who are already familiar with data mining can skip straight to a chapter on Telecom.

Terms

To begin, let's understand the subject of study. Now the terms Artificial Intelligence, Machine Learning, and Deep Machine Learning are often used interchangeably, but in fact there is a well-defined hierarchy:

Artificial Intelligence includes all tasks in which machines perform intellectual tasks, such as playing checkers or chess, assistants capable of recognizing speech and giving answers to questions, various robots.
Machine Learning is a narrower concept and belongs to the class of tasks for which the computer is trained to perform certain actions, having previously known correct answers, for example, classifying objects according to a set of features or recommending music and movies.
Under the Deep Learning implies tasks that are solved using neural networks and Big Data, such as pattern recognition or text translation.

In the article we will talk about Machine Learning. There are two ways to learn:

with teacher
without teacher

With a teacher is when we have data with the correct answers. Then the algorithm can be trained on this data set, and then apply it to the prediction. Such algorithms include classification and regression. Classification is the assignment of objects to a particular class by a set of attributes. For example, recognition of license plates, or in medicine, diagnosis of diseases, or credit scoring in the banking sector. Regression is the prediction of a real variable, such as stock prices.

Without a teacher (self-study) is the search for hidden patterns in the data. Such algorithms include clustering. For example, all large retail chains are looking for patterns in their customers' purchases and are trying to work with target groups of customers, rather than with the general mass.

Regression, classification and clustering are the main data research algorithms, and therefore we will consider them.

Research data

The data mining algorithm consists of a specific sequence of steps. Depending on the task and available data, the set of steps may vary, but the general direction is always determined:

Collection and cleaning of data. As practice shows, this stage can take up to 90% of the time of the entire data analysis;
Visual analysis of data, their distribution, statistics;
Analysis of dependence (correlation) between variables (features);
Selection and identification of features that will be used to build models;
Separation of data for training models and test;
Building models on data for training / assessment of results on test data;
Interpretation of the resulting model, visualization of results.

With the algorithm figured out, and what tools to use for analysis? There are lots of tools, from Excel to specialized tools, for example, MathLab. We will take Python co specialized libraries. Do not be afraid of difficulties, everything is simple:

Download Python and all math packs in one distribution called Anaconda
Installation under Linux will not cause problems: bash Anaconda2-4.4.0-Linux-x86_64.sh

Run: jupyter notebook

This automatically opens the browser:

Check that the application is running: print “HelloWorld!”

Press Ctrl + Enter, see that everything is ok.

For self-study work in IPython Notebook on the Internet there is a lot of information, for example, a simple introduction: Review Ipython Notebook 2.0 .

And we begin our study!

Collecting and cleaning data

In the example of Irises for us all the data collected and filled. Just load them and see:

#  : import numpy as np import pandas as pd from sklearn import datasets from sklearn import linear_model from sklearn.cluster import KMeans from sklearn import cross_validation from sklearn import metrics from pandas import DataFrame %pylab inline

Further:

 #    : iris = datasets.load_iris() #     print iris.feature_names #   ,  10  : print iris.data[:10] #    : print iris.target_names print iris.target

We see that the data set consists of the length / width of two types of Iris petals: sepal and petal. Do not ask me where they are at Iris. The target variable is an Iris variety: 0 - Setosa, 1 - Versicolor, 2 - Virginica. Accordingly, our task is to try to find the dependencies between the size of the petals and the Iris varieties.

For the convenience of data manipulation we make DataFrame from them:

 iris_frame = DataFrame(iris.data) #     ,   : iris_frame.columns = iris.feature_names #     : iris_frame['target'] = iris.target #      : iris_frame['name'] = iris_frame.target.apply(lambda x : iris.target_names[x]) # ,  : iris_frame

It seemed to work out what they wanted:

Descriptive statistics

 #     : pyplot.figure(figsize(20, 24)) plot_number = 0 for feature_name in iris['feature_names']: for target_name in iris['target_names']: plot_number += 1 pyplot.subplot(4, 3, plot_number) pyplot.hist(iris_frame[iris_frame.name == target_name][feature_name]) pyplot.title(target_name) pyplot.xlabel('cm') pyplot.ylabel(feature_name[:-4])

Looking at these histograms, an experienced researcher can immediately draw the first conclusions. I only see that the distribution of some variables seems to be normal. Let's try to do more clearly. We build a table with dependencies between signs and color the points depending on the Iris varieties:

 import seaborn as sns sns.pairplot(iris_frame[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)','name']], hue = 'name')

Here, even an inexperienced researcher can see that "petal width (cm)" and "petal length (cm)" have a strong relationship - the points are stretched along one line. And in principle, according to the same features, a classification can be built, since dots by color are grouped quite compactly. But, for example, using the variables “sepal width (cm)” and “sepal length (cm)”, one cannot build a qualitative classification, since points related to Versicolor and Virginica varieties are intermingled.

Dependence between variables

Now let's look at the mathematical values of dependencies:

 iris_frame[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']].corr()

In a more visual form, we construct a heat map of the dependence of the signs:

 import seaborn as sns corr = iris_frame[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']].corr() mask = np.zeros_like(corr) mask[np.triu_indices_from(mask)] = True with sns.axes_style("white"): ax = sns.heatmap(corr, mask=mask, square=True, cbar=False, annot=True, linewidths=.5

The values of the correlation coefficient are interpreted as follows:

Up to 0.2 - very weak correlation
To 0.5 - weak
Up to 0.7 - average
Up to 0.9 - high
More than 0.9 - very high

Indeed, we see that between the variables “petal length (cm)” and “petal width (cm)” a very strong dependence of 0.96 has been revealed.

We select and create signs

In the first approximation, you can simply include all variables in the model and see what happens. Then you can think about what signs to remove and which ones to create.

Training and test data

We divide data into data for training and test data. Typically, the sample is divided into training and test in a percentage of 66/33, 70/30 or 80/20. Other splits are possible depending on the data. In our example, we assign 30% of the entire sample to the test data (parameter test_size = 0.3):

 train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(iris_frame[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']], iris_frame['target'], test_size = 0.3, random_state = 0) #  ,      : print train_data print test_data print train_labels print test_labels

Model building cycle - result evaluation

We turn to the most interesting.

Linear Regression - LinearRegression

How to visualize linear regression? If you look at the relationship between two variables, then this is the line holding so that the vertical distances from the line to the points are minimal in sum. The most common optimization method is minimization of the mean-square error using the gradient descent algorithm. The explanation of the gradient descent is much where, for example, in the section “What is a gradient descent?”. But you can not read and perceive linear regression as an abstract algorithm for finding the line that most closely follows the direction of the distribution of objects. We build a model using variables that, as we understood earlier, have a strong relationship - these are “petal length (cm)” and “petal width (cm)”:

 from scipy import polyval, stats fit_output = stats.linregress(iris_frame[['petal length (cm)','petal width (cm)']]) slope, intercept, r_value, p_value, slope_std_error = fit_output print(slope, intercept, r_value, p_value, slope_std_error)

We look at the quality metrics of the model:

(0.41641913228540123, -0.3665140452167277, 0.96275709705096657, 5.7766609884916033e-86, 0.009612539319328553)
Of the most interesting is the correlation coefficient between the variables r_value with a value of 0.96275709705096657. We have already seen it before, and here we are once again convinced of its existence. Draw a graph with points and a regression line:

 import matplotlib.pyplot as plt plt.plot(iris_frame[['petal length (cm)']], iris_frame[['petal width (cm)']],'o', label='Data') plt.plot(iris_frame[['petal length (cm)']], intercept + slope*iris_frame[['petal length (cm)']], 'r', linewidth=3, label='Linear regression line') plt.ylabel('petal width (cm)') plt.xlabel('petal length (cm)') plt.legend() plt.show()

We see that, indeed, the regression line found well repeats the direction of distribution of points. Now, if we have available, for example, the length of the leaflet pental, we will be able to determine with great accuracy how wide it is!

Classification

How to intuitively present the classification? If you look at the task of division into two classes of objects that have two signs (for example, you need to separate apples and bananas if their sizes are known), then the classification is reduced to drawing a line on a plane that divides objects into two classes. If it is necessary to divide into a larger number of classes, then several lines are drawn. If you look at objects with three variables, then three-dimensional space and the task of holding planes are represented. If variables are N, then you just need to imagine a hyperplane in N-dimensional space).

So, we take the most famous classification learning algorithm: Stochastic Gradient Descent. With the gradient descent, we have already met in linear regression, and stochastic says that for the speed of work, not all sampling is used, but random data. And apply it to the SVM (Support Vector Machine) classification method:

 train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(iris_frame[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']], iris_frame[['target']], test_size = 0.3, random_state = 0) model = linear_model.SGDClassifier(alpha=0.001, n_iter=100, random_state = 0) model.fit(train_data, train_labels) model_predictions = model.predict(test_data) print metrics.accuracy_score(test_labels, model_predictions) print metrics.classification_report(test_labels, model_predictions)

We look at the quality metrics of the model:

In fact, you can evaluate the model, not really understanding the essence of the values of metrics: if accuracy, precision and recall is more than 0.85, then this is a good model, if more than 0.95, then it is excellent.

In short, the metrics used in the example reflect the following:

accuracy is the main metric that shows the proportion of correct model responses. Its value is equal to the ratio of the number of correct answers given by the model to the number of all objects. But it does not fully reflect the quality of the model. Therefore, precision and recall are introduced.

These metrics are given both in terms of the quality of recognition of each class (iris variety) and the total values. We look at the total values:

precision (accuracy) - this metric shows how much we can trust the model, in other words, how many “false positives” we have. The value of the metric is equal to the ratio of the number of responses that the model considers correct, and they were indeed correct (this number is denoted by “true positives”) to the sum of “true positives” and the number of objects that the model considered correct, but in fact they were incorrect (this number denoted by "false positives"). In the form of the formula: precision = "true positives" / ("true positives" + "false positives")
recall (completeness) - this metric shows how much a model can even detect the correct answers, in other words, how many “false passes” we have. Its numerical value is equal to the ratio of the answers that the model considers correct, and they were indeed correct to the number of all the correct answers in the sample. In the form of the formula: recall = "true positives" / "all positives"
f1-score (f-measure) is a combination of precision and recall
support - just the number of objects found in the class

There are also important model metrics: PR-AUC and ROC-AUC, you can get acquainted with them, for example, here: Metrics in machine learning tasks .

Thus, we see that the values of the metrics in our example are very good. Let's look at the schedule. For clarity, the sample is drawn in two coordinates and color by class.

First, let's display the test sample as it is:

Then, as our model predicted. We see that the points on the border (which I circled in red) were classified incorrectly:

But most of the objects predicted correctly!

Cross-validation

Somehow a very suspiciously good result ... What could be wrong? For example, we accidentally broke the data into a training and test sample. To remove this chance, the so-called cross-validation is applied. This is when the data is divided several times into a training and test sample, and the result of the algorithm is averaged.

Let us check the operation of the algorithm on 10 random samples:

 train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(iris_frame[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']], iris_frame['target'], test_size = 0.3) model = linear_model.SGDClassifier(alpha=0.001, n_iter=100, random_state = 0) scores = cross_validation.cross_val_score(model, train_data, train_labels, cv=10) print scores.mean()

We look at the result. He expectedly deteriorated: 0.860909090909

Selection of optimal algorithm parameters

What else can be done to optimize the algorithm? You can try to find the parameters of the algorithm itself. We see that alpha = 0.001, n_iter = 100 are passed to the algorithm. Let's find the optimal values for them.

 from sklearn import grid_search train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(iris_frame[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']], iris_frame['target'], test_size = 0.3) parameters_grid = { 'n_iter' : range(5,100), 'alpha' : np.linspace(0.0001, 0.001, num = 10), } classifier = linear_model.SGDClassifier(random_state = 0) cv = cross_validation.StratifiedShuffleSplit(train_labels, n_iter = 10, test_size = 0.3, random_state = 0) grid_cv = grid_search.GridSearchCV(classifier, parameters_grid, scoring = 'accuracy', cv = cv)grid_cv.fit(train_data, train_labels) print grid_cv.best_estimator_

At the output we get a model with optimal parameters:

SGDClassifier (alpha = 0.00089999999999999998, average = False, class_weight = None,
epsilon = 0.1, eta0 = 0.0, fit_intercept = True, l1_ratio = 0.15,
learning_rate = 'optimal', loss = 'hinge', n_iter = 96, n_jobs = 1,
penalty = 'l2', power_t = 0.5, random_state = 0, shuffle = True, verbose = 0,
warm_start = False)

We see that in it alpha = 0.0009, n_iter = 96. We substitute these values into the model:

 train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(iris_frame[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']], iris_frame['target'], test_size = 0.3) model = linear_model.SGDClassifier(alpha=0.0009, n_iter=96, random_state = 0) scores = cross_validation.cross_val_score(model, train_data, train_labels, cv=10) print scores.mean()

We look, it became a little better: 0.915505050505

We select and create signs

It's time to experiment with the signs. Let's remove from the model less significant features, namely “sepal length (cm)” and “sepal width (cm)”. We drive into the model:

 train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(iris_frame[['petal length (cm)','petal width (cm)']], iris_frame['target'], test_size = 0.3) model = linear_model.SGDClassifier(alpha=0.0009, n_iter=96, random_state = 0) scores = cross_validation.cross_val_score(model, train_data, train_labels, cv=10) print scores.mean()

We look, it became even better: 0.937727272727
To illustrate the approach, let's make a new sign: the area of a petal sheet and see what happens.

 iris_frame['petal_area'] = 0.0 for k in range(0,150): iris_frame['petal_area'][k] = iris_frame['petal length (cm)'][k] * iris_frame['petal width (cm)'][k]

Substitute in the model:

 train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(iris_frame[['petal_area']], iris_frame['target'], test_size = 0.3) model = linear_model.SGDClassifier(alpha=0.0009, n_iter=96, random_state = 0) scores = cross_validation.cross_val_score(model, train_data, train_labels, cv=10) print scores.mean()

It's funny, but in our example it turns out that the petal area of the petal (or rather, not even the area, because the petals are not rectangles, but the “product of length by width”) is most accurately predicted by the Iris variety: 0.94237373737374

Perhaps this can be explained by the fact that the variables 'petal length (cm)' and 'petal width (cm)', and so well divides the Irises into classes, and their work also “stretches” the classes along the straight line:

We got acquainted with the main ways of model optimization, now let's consider the clustering algorithm - an example of machine learning without a teacher.

Clustering - K-means

The essence of clustering is extremely simple - it is necessary to divide the existing objects into groups, so that the groups include similar objects. We now do not have the correct answers to train the model, so the algorithm must itself group objects according to the "proximity" of the location of objects to each other.

For example, consider the most famous K-means algorithm. It is not for nothing that K-Means are called, since The method is based on finding K cluster centers so that the average distances from them to objects that belong to them are minimal. First, the algorithm determines K arbitrary centers, then all objects are distributed in proximity to these centers. Got K clusters of objects. Then, in these clusters, the centers are re-calculated by the average distance to the objects, and the objects are again redistributed. The algorithm works until the centers of the clusters stop moving to any particular delta.

 train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(iris_frame[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']], iris_frame[['target']], test_size = 0.3) model = KMeans(n_clusters=3) model.fit(train_data) model_predictions = model.predict(test_data) print metrics.accuracy_score(test_labels, model_predictions) print metrics.classification_report(test_labels, model_predictions)

We look at the results:

We see that even with the default parameters it turns out very well: accuracy, precision and recall are greater than 0.9. We are convinced on the pictures. We see a decent, but not everywhere accurate result:

The algorithm has a drawback - for its operation, you need to specify the number of clusters that we want to find. And if it is inadequate, the results of the algorithm will be useless. Let's see what happens if you specify the number of clusters, for example, 5:

We see that in practice the result is not applicable. There are algorithms for determining the optimal number of clusters, but in this article we will not dwell on them.

Conclusion on the Iris Research

So, on the example of Irises, we considered three main methods of machine learning: regression, classification and clustering. Spent optimization algorithms and visualization of results. We obtained very good results, but this was expected on a specially prepared data set.

A complete Python Notebook can be found on Github . Go to Telecom.

Telecom

Telecom has tasks that are solved in other areas with the help of data analysis (banks, insurance, retail):

Predicting customer churn (Churn Prevention);
Fraud Prevention;
Identification of similar subscribers (subscriber base segmentation);
Cross sales (Cross-Sale) and raising the amount of the sale (Up-Sale);
Identify subscribers who strongly influence their surroundings (Alpha subscribers).

In addition, there are specific tasks:

Prediction of network resource consumption by subscribers: traffic volume, number of calls, SMS;
The study of the movement of subscribers to optimize the network.

Where does a telecom operator get data for analysis? Of the various information systems and equipment that is involved in providing services to subscribers:

The billing system stores data on payments and expenses of subscribers, tariffs, personal data;
Data on which sites the subscriber visited is extracted from the DPI equipment;
From base stations you can get geodata based on the location of the subscriber;
Service equipment generates data on the consumption of communication services by the subscriber.

My goal was to determine which tasks you can try to solve using the data generated by the subscriber traffic management system. In order for the billing system to charge the subscriber’s traffic correctly, it needs to know: who / where / when / what type and volume of traffic was consumed. This information comes from the equipment in the form of so-called CDR (Call Data Record) files. IMSI and MSISDN subscriber identifiers, location accurate to base station CELL ID, IMEI subscriber equipment identifier, session timestamp and information about the consumed service are recorded in these files in csv format.

To maintain confidentiality, all data for the study were impersonal and replaced with random values in compliance with the format. Let's look at the data:

What machine learning algorithms can be applied to this data? You can, for example, aggregate the consumption of different types of traffic by subscribers for a certain period and perform clustering. You should get something like this:

Those. if, for example, the result of clustering showed that subscribers were divided into groups that use Youtube, social networks and listen to music in different ways, then you can make tariffs that take into account their interests. I suppose that telecom operators do this, producing tariff lines with payment differentiation according to the type of traffic.

What else can be analyzed in the available data? There are several cases with subscribers equipment. The operator knows the model of the subscriber's device and can, for example, offer certain services only to Samsung users. Or, knowing the coordinates of the base stations, you can draw a heat map of the distribution of Samsung phones (I have no real coordinates, so the map has no relation to reality):

It may happen that in a certain region there will be a percentage of them more than in others. Then this information can be offered to Samsung for advertising campaigns or opening stores selling smartphones. Then you can look at the Top models of devices from which subscribers access the Internet:

To mask the current state of affairs, the outdated IMEI database was taken, but this does not change the essence of the approach. The list shows that most of the devices are Apple, modems and Samsung, and at the end Meizu, Micromax and Xiaomi appear.

Actually, these are all applications of the source data that I could find in a short time. Of course, using this data, you can look at various statistics and time series, analyze emissions, etc., but in order to identify any dependence by machine learning ... unfortunately, I have not yet found how to do this.

, : , .. , .

. , .
, , , .

Source: https://habr.com/ru/post/334738/

All Articles

Machine learning: from Iris to Telecom

Terms

Research data

Collecting and cleaning data

Descriptive statistics

Dependence between variables

We select and create signs

Training and test data

Model building cycle - result evaluation

Linear Regression - LinearRegression

Classification

Cross-validation

Selection of optimal algorithm parameters

We select and create signs

Clustering - K-means

Conclusion on the Iris Research

Telecom

More articles: