Unusual Playboy models, or about detecting outliers in data using Scikit-learn

Motivated by the BubaVV article about weight prediction of the Playboy model according to its shape and height, the author decided to go deep ~~if you know what I mean~~ into this disturbing blood research topic and find emissions in the same data, that is, ~~particularly bold~~ models that stand out from others , height or weight. And against the background of this warm-up, a sense of humor at the same time tells a little to novice researchers about outlier detection and anomaly detection in the data using the One-class Support Vector Machine implementation in the Scikit-learn library, written in python language.

Loading and primary data analysis

So, honestly referring to the source of the data and the person who worked on them, open the CSV file with the girls.csv data and see what is there. We see the parameters of the 604th Playboy month girls from December 1953 to January 2009: chest girth (Bust, in cm), waist girth (Waist, in cm), hip girth (Hips, in cm), and height (Height, in cm .) and weight (Weight, in kg).

Open our favorite Python programming environment (in my case Eclipse + PyDev) and load the data using the Pandas library. This article assumes that the Pandas, NumPy, SciPy, sklearn, and matplotlib libraries are installed. If not, Windows users can rejoice and just install the precompiled libraries from here .
Well, users of niks and poppies (as well as the author) will have to suffer a little bit, but the article is not about that.

First, we import the modules that we need. We will talk about their role as they become available.
')

import pandas import numpy as np import matplotlib.pyplot as plt import matplotlib.font_manager from scipy import stats from sklearn.preprocessing import scale from sklearn import svm from sklearn.decomposition import PCA

Create an instance of the girls Pandas DataFrame data structure by reading the data from the girls.csv file (it lies next to this py-file, otherwise you must specify the full path). The header parameter says that the names of attributes are in the first line (i.e., in the zero line, if counted as programmers).

 girls = pandas.read_csv('girls.csv', header=0)

By the way, Pandas is a great option for those who are used to python, but still love the speed of parsing data in R. The main thing that inherited Pandas from R is just a convenient DataFrame data structure.
The author got acquainted with Pandas by the Kaggle tutorial in the trial competition “Titanic: Machine Learning from Disaster”. For those who are not familiar with Kaggle, an excellent reason to finally do it.

Let's look at the general statistics of our girls:

 print girls.info()

We will be told that we have 604 girls at our disposal, each with the 7th signs — Month (object type), Year (int64 type), and 5 more signs of int64 type, which we already called.
Then we learn more about the girls:

 print girls.describe()

Oh, if everything in life was so simple!
The interpreter will list to us the main statistical characteristics of the signs of girls - the average, minimum and maximum values. Already not bad. From here we conclude that the average forms of the Playboy model are 89-60-88 (expected), the average height is 168 cm, and the weight is 52 kg.
This growth seems to be too small. Apparently, due to the fact that these historical ones, from the middle of the twentieth century, now it seems that the standard for models seems to be 180 cm.
Breast coverage of girls varies from 81 to 104 cm, waist - from 46 to 89, hips - from 61 to 99, height - from 150 cm to 188 cm, weight - from 42 kg to 68 kg.
Wow, you can already suspect that an error has crept into the data. What ~~kind of beer barrel~~ model with a waist of 89 cm? And how can the hips be 61 cm?

Let's see what these unique people are:

 print girls[['Month','Year']][girls['Waist'] == 89]

These are the girls of the month of Playboy in December 1998 and January 2005, respectively. It's easy to find them here . These are the triplets Nicole, Eric and Jacqueline ~~with the non-speaking last name~~ Dahm - all three “under one account” and Destiny Davis. It is easy to see that the triplets are 25 inches (64 cm), not 89, and our Destini's hips are 86 cm, and not 61.

For beauty, one can also construct histograms of the distribution of parameters of girls (for a change, they are made in R).

So, a simple, naked eye on the data can already be found in them some kind of oddities, if, of course, there is not very much data, and the signs can somehow be interpreted in a human-readable form.

Data preprocessing

For the training model, we will leave only the numerical parameters, except for the year. We write them into the array NumPy girl_params , simultaneously converting to type float64. Scale the data so that all signs lie in the range from -1 to 1. This is important for many machines learning algorithms. Without going into details, by scaling, we avoid the fact that the sign receives more weight only because it has a larger range of change. For example, if the Euclidean distance between people on the grounds of “Age” and “Income” is considered, then the income of a contribution to the metric will be much higher only because it is measured, for example, in thousands, and age in dozens.

 girl_params = np.array(girls.values[:,2:], dtype="float64") girl_params = scale(girl_params)

Next, select 2 main components in the data so that they can be displayed. Here we found the Scikit-learn Principal Component Analysis ( PCA ) library useful. Also, it does not hurt us to keep the number of our girls. In addition, we say that we are looking for 1% of emissions in the data, that is, we limit ourselves to 6-7 "strange" girls. (Variables in Python, written in uppercase, symbolize constants and are usually written at the beginning of the file after the modules are connected).

 X = PCA(n_components=2).fit_transform(girl_params) girls_num = X.shape[0] OUTLIER_FRACTION = 0.01

Model training

To detect "outliers" in the data, we use the one-class model of the reference vector machine. The theoretical work on this variation of the SVM began Alexey Yakovlevich Chervonenkis. As stated by "Yandex", now the development of methods for solving this problem takes the first place in the development of the theory of machine learning.
I will not tell here what is SVM and kernels, about it and so it is written a lot, for example on Habré (simpler) and on machinelearning.ru (more complicated). I will only note that One-class SVM allows, as the name implies, to distinguish objects of the same class. Detecting data anomalies is only a modest application of this idea. Now, in the era of deep learning, using the one-class classification algorithms, they try to teach a computer to “create a representation” of an object, like, for example, a child distinguishes a dog from all other subjects.

But back to the One-class SVM Scikit implementation, which is well documented on the Scikit-learn site.
Create a classifier instance with a Gaussian core and “feed” it data.

 clf = svm.OneClassSVM(kernel="rbf") clf.fit(X)

Search for outliers

We create an array of dist_to_border , which stores the distances from the objects of the training sample X to the constructed dividing surface, and then, after we have chosen the threshold, we create an array of indicators (True or False) that the object is a representative of this class, and not an outlier. In this case, the distance is positive if the object lies “inside” the area bounded by the constructed separating surface (ie, it is a representative of the class), and negatively otherwise. The threshold is determined statistically, as the distance to the dividing surface is that for OUTLIER_FRACTION (in our case, for one) the percent of the sample is larger (that is, in our case, the threshold is 1% -percentile of the distance to the dividing surface).

 dist_to_border = clf.decision_function(X).ravel() threshold = stats.scoreatpercentile(dist_to_border, 100 * OUTLIER_FRACTION) is_inlier = dist_to_border > threshold

Display and interpretation of results

Finally, visualize what happened. At this point, I will not stop, anyone can figure out the matplotlib on their own. This is a reworked code from the Scikit-learn example “Outlier detection with several methods”.

 xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500)) n_inliers = int((1. - OUTLIER_FRACTION) * girls_num) n_outliers = int(OUTLIER_FRACTION * girls_num) Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.title("Outlier detection") plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7), cmap=plt.cm.Blues_r) a = plt.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red') plt.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='orange') b = plt.scatter(X[is_inlier == 0, 0], X[is_inlier == 0, 1], c='white') c = plt.scatter(X[is_inlier == 1, 0], X[is_inlier == 1, 1], c='black') plt.axis('tight') plt.legend([a.collections[0], b, c], ['learned decision function', 'outliers', 'inliers'], prop=matplotlib.font_manager.FontProperties(size=11)) plt.xlim((-7, 7)) plt.ylim((-7, 7)) plt.show()

We get this picture:

7 "emissions" are visible. To understand what kind of girls lurk under this unflattering “outliers,” let's look at them in the original data.

 print girls[is_inlier == 0]

  Month Year Bust Waist Hips Height Weight 54 September 1962 91 46 86 152 45 67 October 1963 94 66 94 183 68 79 October 1964 104 64 97 168 66 173 September 1972 98 64 99 185 64 483 December 1998 86 89 86 173 52 507 December 2000 86 66 91 188 61 535 April 2003 86 61 69 173 54

And now the most interesting part is the interpretation of the resulting emissions.
We notice that there are only 7 exhibits in our Kunstkammer (we set the OUTLIER_FRACTION threshold so well), so you can go through each of them.

Miki Winters . September, 1962. 91-46-86, height 152, weight 45.

Waist 46 is, of course, cool! How are their breasts 91?
Christine Williams . October, 1963. 94-66-94, height 183, weight 68.

Not a little girl for those years. This is not Miki Winters.
Rosemary Hillcrest . October, 1964. 104-64-97, height 168, weight 66.

~~Whoa whoa easy!~~ Impressive lady.
Susan Miller . September, 1972. 98-64-99, height 185, weight 64.
Cuties triplets Dame . 86-89 (real 64) -86, height 173, weight 52.

An example of data error. It is not very clear how they all measure for three.
Kara Michel . December, 2000. 86-66-91, height 188, weight 61.

Height 188 - above the author of this article. Explicit “outlier” for such “historical” data.
Carmella de Cesare . April, 2003. 86-61-69, height 173, weight 54.

Perhaps because of the hips.

It is noteworthy that the lady with a 61 cm hip coverage, which we suspected of being very different from other girls, is quite normal for the rest of the parameters, and the SVM was not defined as a “blowout”.

Conclusion

Finally, I note the importance of primary data analysis, “just with my eyes” and, of course, I note that the detection of data anomalies is also used in more serious tasks - in credit scoring, in order to recognize unreliable clients, in security systems to detect potential bottlenecks, analysis of banking transactions to search for intruders and not only. An interested reader will find many other algorithms for detecting anomalies and outliers in the data and their applications.

Source: https://habr.com/ru/post/251225/

All Articles