📜 ⬆️ ⬇️

Cluster analysis (for example, consumer segmentation) Part 1

We know that the Earth is one of 8 planets that revolve around the sun. The sun is just a star among about 200 billion stars in the Milky Way galaxy. It is very difficult to realize this number. Knowing this, you can make an assumption about the number of stars in the universe - approximately 4X10 ^ 22. We can see about a million stars in the sky, although this is just a small fraction of the total number of stars. So, we have two questions:
  1. What is a galaxy?
  2. And what is the relationship between the galaxies and the topic of the article (cluster analysis)

image

A galaxy is a cluster of stars, gas, dust, planets and interstellar clouds. Usually galaxies resemble a spiral or edeptic figure. In space, galaxies are separated from each other. Huge black holes are most often the centers of most galaxies.

As we will discuss in the next section, there is much in common between galaxies and cluster analysis. Galaxies exist in three-dimensional space, cluster analysis is a multi-dimensional analysis carried out in n-dimensional space.

Note: The black hole is the center of the galaxy. We will use a similar idea for centroids for cluster analysis.
')

Cluster analysis


Suppose you are the head of marketing and customer relations at a telecommunications company. You understand that all consumers are different, and that you need different strategies for attracting different consumers. You will appreciate the power of such a tool as customer segmentation for cost optimization. In order to refresh your knowledge of cluster analysis, consider the following example illustrating 8 consumers and the average duration of their conversations (local and international). The data below:

image

For better perception, we will draw a graph where the average length of international calls will be plotted along the x-axis, and the average length of local conversations along the y-axis. Below is a chart:

image

Note: This is similar to the analysis of the stars in the night sky (here the stars are replaced by consumers). In addition, instead of three-dimensional space, we have two-dimensional, defined by the duration of local and international conversations, as axes x and y.
Now, speaking in terms of galaxies, the problem is formulated as follows: to find the position of black holes; in cluster analysis, they are called centroids. To detect centroids, we begin by taking arbitrary points as the position of the centroids.

Euclidean Distance to Find Centroids for Clusters


In our case, two centroids (C1 and C2) we arbitrarily place at the points with coordinates (1, 1) and (3, 4). Why did we choose these two centroids? The visual display of points on the graph shows us that there are two clusters that we will analyze. However, later we will see that the answer to this question will not be so simple for a large data set.
Next, we measure the distance between the centroids (C1 and C2) and all points on the graph using the Euclidean formula to find the distance between two points.

image

Note: The distance can be calculated using other formulas, for example,
  1. square Euclidean distance - to give weight to more distant objects
  2. Manhattan distance - to reduce emissions
  3. power distance - to increase / decrease the impact on specific coordinates
  4. percent disagreement - for categorical data
  5. and etc.

Column 3 and 4 (Distance from C1 and C2) is the distance calculated by this formula. For example, for the first consumer

image

Belonging to centroids (last column) is calculated according to the principle of proximity to centroids (C1 and C2). The first consumer is closer to centroid # 1 (1.41 compared to 2.24) therefore, it belongs to the cluster with the centroid C1.

image

Below is a graph illustrating the C1 and C2 centroids (shown as a blue and orange diamond). Consumers are represented by the color of the corresponding centroid, to the cluster of which they were assigned.

image

Since we arbitrarily selected centroids, the second step is to make this choice iterative. The new position of the centroids is chosen as the average for the points of the corresponding cluster. For example, for the first centroid (these are consumers 1, 2, and 3). Therefore, the new x coordinate for the centroid C1 is the average x coordinate of these consumers (2 + 1 + 1) / 3 = 1.33. We will get new coordinates for C1 (1.33, 2.33) and C2 (4.4, 4.2). The new graph is below:

image

In the end, we will place the centroids in the center of the corresponding cluster. The chart below:

image

The positions of our black holes (cluster centers) in our example are C1 (1.75, 2.25) and C2 (4.75, 4.75). The two clusters above are similar to two galaxies separated in space from each other.

So, we will consider examples further. Let us face the task of segmentation of consumers by two parameters: age and income. Suppose we have 2 consumers with an age of 37 and 44 years old and an income of $ 90,000 and $ 62,000, respectively. If we want to measure the Euclidean distance between points (37, 90000) and (44, 62000), we will see that in this case the variable income "dominates" the variable over age and its change affects the distance greatly. We need some strategy to solve this problem, otherwise our analysis will give an incorrect result. The solution to this problem is to bring our values ​​to comparable scales. Normalization is the solution to our problem.

Data normalization


There are many approaches for data normalization. For example, the normalization of the minimum-maximum. The following formula is used for this normalization.
image
in this case, X * is the normalized value, min and max are the minimum and maximum coordinates over the entire set of X
(Note, this formula locates all coordinates on the segment [0; 1])
Consider our example, let the maximum income $ 130,000, and the minimum - $ 45,000. The normalized revenue value for consumer A is equal to

image

We will do this exercise for all points for each variable (coordinates). The income for the second consumer (62,000) will become 0.2 after the normalization procedure. Additionally, let the minimum and maximum ages be 23 and 58, respectively. After normalization, the ages of our two consumers will be 0.4 and 0.6.

It is easy to see that now all our data is located between the values ​​0 and 1. Therefore, we now have normalized data sets in comparable scales.

Remember, before the procedure of cluster analysis, it is necessary to perform normalization.

Article found kuznetsovin

Source: https://habr.com/ru/post/228477/


All Articles