5 main sampling algorithms

Work with data - work with data processing algorithms.

And I had to work with the most diverse on a daily basis, so I decided to make a list of the most popular publications in the series.

This article focuses on the most common sampling methods for working with data.

Simple random sampling

Suppose if you want to make a selection where each element has an equal probability of being selected.

Below we select 100 such elements from the dataset.

sample_df = df.sample(100)

Stratified sampling

Suppose we need to estimate the average number of votes for each candidate in the election. Voting takes place in three cities:

1 million workers live in city A

2 million artists live in city B

3 million senior citizens live in city C

If we try to take equally probable samples of 60 people among the entire population, then they are likely to be unbalanced relative to different cities, and therefore biased, which will lead to a serious error in the predictions.

If we specifically make a sample of 10, 20 and 30 people from cities A , B and C, respectively, then the error will be minimal.

In Python, this can be done like this:

 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

Reservoir sampling

I like this formulation of the problem:

Suppose you have a stream of elements of large unknown size that you can iterate only once.

Create an algorithm that randomly selects an element from the stream as if any element could be selected with equal probability.

How to do it?

Suppose we need to select 5 objects from an infinite stream, so that each element in the stream can be selected equally likely.

 import random def generator(max): number = 1 while number < max: number += 1 yield number #    stream = generator(10000) #    k=5 reservoir = [] for i, element in enumerate(stream): if i+1<= k: reservoir.append(element) else: probability = k/(i+1) if random.random() < probability: #    ,    reservoir[random.choice(range(0,k))] = element print(reservoir) ------------------------------------ [1369, 4108, 9986, 828, 5589]

To prove that each element could be selected is equally probable mathematically.

How?

When it comes to mathematics, it is best to try to start the solution with a small special case.

So let's look at a stream consisting of 3 elements, where we need to select only 2.

We see the first element, save it in the list, since there is still space in the tank. We see the second element, save it in the list, as there is still room in the tank.

We see the third element. It becomes more interesting here. We will save the third element with a 2/3 probability.

Let's now see the final probability of the first element being saved:

The probability of displacement of the first element from the reservoir is equal to the probability of the third element to be selected, multiplied by the probability that it is the first element of the two that will be selected for displacement. I.e:

2/3 * 1/2 = 1/3

That is, the final probability of the first element to be saved:

1 - 1/3 = 2/3

Absolutely the same logic can be applied to the second element, extending it in the future to a larger number of elements with increasing reservoir.

That is, each element will be saved with a probability of 2/3 or in the general case k / n .

Random undersampling and oversampling

A source

Too often in life there are unbalanced data sets.

The method that is widely used in this case is called resampling (sometimes they say “resampling” in the Russian translation - approx. Transl.) . Its essence lies in either removing elements from too large a set (undersampling) and / or adding more elements to an insufficiently large set (oversampling).

Let's start by creating some unbalanced sets.

 from sklearn.datasets import make_classification X, y = make_classification( n_classes=2, class_sep=1.5, weights=[0.9, 0.1], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=100, random_state=10 ) X = pd.DataFrame(X) X['target'] = y

Now we can perform random undersampling and oversampling like this:

 num_0 = len(X[X['target']==0]) num_1 = len(X[X['target']==1]) print(num_0,num_1) #   undersampled_data = pd.concat([ X[X['target']==0].sample(num_1) , X[X['target']==1] ]) print(len(undersampled_data)) #   oversampled_data = pd.concat([ X[X['target']==0] , X[X['target']==1].sample(num_0, replace=True) ]) print(len(oversampled_data)) ------------------------------------------------------------ OUTPUT: 90 10 20 180

Andersampling and oversampling using imbalanced-learn

imbalanced-learn (imblearn) is a python library for dealing with the problems of unbalanced datasets.

It contains several different methods for resampling.

a. Andersampling using Tomek Links:

One of the methods provided is called Tomek Links. “Links” in this case are pairs of elements from different classes that are nearby.

Using the algorithm, we will eventually remove the pair element from the larger set, which will allow the classifier to work out better.

A source

 from imblearn.under_sampling import TomekLinks tl = TomekLinks(return_indices=True, ratio='majority') X_tl, y_tl, id_tl = tl.fit_sample(X, y)

b. Oversampling with SMOTE:

In SMOTE (Synthesis Minority Oversampling Method), we create elements in close proximity to existing ones in a smaller set.

A source

 from imblearn.over_sampling import SMOTE smote = SMOTE(ratio='minority') X_sm, y_sm = smote.fit_sample(X, y)

But in imblearn there are other methods of undersampling (Cluster Centroids, NearMiss, etc.) and oversampling (ADASYN and bSMOTE), which can also be useful.

Conclusion

Algorithms are the blood of data science.

Sampling is one of the most important areas in working with data, and only a superficial overview is given above.

A well-chosen sampling strategy can pull the whole project along. Selected poorly lead to erroneous results. Therefore, the choice must be made wisely.

Source: https://habr.com/ru/post/461285/

All Articles

5 main sampling algorithms

Simple random sampling

Stratified sampling

Reservoir sampling

Random undersampling and oversampling

Andersampling and oversampling using imbalanced-learn

a. Andersampling using Tomek Links:

b. Oversampling with SMOTE:

Conclusion

More articles: