Clustering: k-means and c-means algorithms

Good day!

As promised, I continue a series of publications on Data Mining technology. Today I want to talk about two clustering algorithms (k-means and c-means), describe the advantages and disadvantages, give some recommendations on their use. So let's go ...

Clustering is the division of a set of input vectors into groups (clusters) according to the degree of "similarity" to each other.
')
Clustering in Data Mining acquires value when it is one of the stages of data analysis, building a complete analytical solution. It is often easier for an analyst to identify groups of similar objects, study their features and build a separate model for each group than create one common model for all data. This technique is constantly used in marketing, highlighting groups of customers, customers, products and developing a separate strategy for each of them ( Wikipedia ).

Distance measures

In order to compare two objects, it is necessary to have a criterion on the basis of which the comparison will take place. As a rule, such a criterion is the distance between objects.

There are many measures of distance, consider a few of them:

Euclidean distance is the most common distance. It is a geometric distance in multidimensional space.

Square Euclidean distance. Sometimes there may be a desire to square the standard Euclidean distance in order to give greater weight to objects more distant from each other.

Distance of city blocks (Manhattan distance). This distance is simply the mean of the differences in coordinates. In most cases, this measure of distance leads to the same results as for the usual Euclidean distance. However, we note that for this measure, the influence of individual large differences (emissions) decreases (since they are not squared).

Chebyshev distance. This distance can be useful when it is desired to define two objects as “different” if they differ in any one coordinate (in any one dimension).

Power distance. Sometimes they wish to progressively increase or decrease the weight related to the dimension for which the corresponding objects are very different. This can be achieved using a power distance.

The choice of distance (the criterion of similarity) lies entirely on the researcher. When choosing different measures clustering results may vary significantly.

Algorithm k-means (k-medium)

The simplest, but at the same time rather inaccurate clustering method in the classical implementation. It splits the set of elements of a vector space into a previously known number of clusters k. The algorithm is such that it seeks to minimize the standard deviation at the points of each cluster. The basic idea is that at each iteration the center of mass is recalculated for each cluster obtained in the previous step, then the vectors are divided into clusters again according to which of the new centers was closer in the selected metric. The algorithm terminates when no cluster changes at any iteration.

K-means algorithm problems:
* you must know in advance the number of clusters. I proposed a method for determining the number of clusters, which was based on finding clusters distributed according to a certain law (in my case, it all came down to normal law). After that, the classic k-means algorithm was performed, which gave more accurate results.
* The algorithm is very sensitive to the choice of initial cluster centers. The classic version implies a random selection of clusters, which very often was a source of error. As a solution, it is necessary to conduct object studies to more accurately determine the centers of the initial clusters. In my case, at the initial stage, it is proposed to accept the farthest points of clusters as cents.
* does not cope with the task when the object belongs to different clusters equally or does not belong to one.

Materials on the topic:
* Wikipedia - K-means
* Introduction to K-means
* Description of kmeans feature in Matlab Statistics Toolbox
* K-means - Interactive demo (Java)

Fuzzy clustering algorithm with-means

The k-means algorithm successfully handles the last problem with-means. Instead of a clear answer to the question to which cluster the object belongs, it determines the probability that the object belongs to a particular cluster. Thus, the statement “object A belongs to cluster 1 with a probability of 90%, to cluster 2 - 10%” is true and more convenient.

The classic example with-means - so-called. "Butterfly" (butterfly):

As can be seen, the point with coordinates (3.2) equally belongs to both the first and second clusters.

The remaining problems with -means are the same as for k-means, but they are leveled due to the fuzziness of the partition.

Related Links:
* Formal description of the algorithm and implementation in C #
* Fuzzy c-means clustering
* Fuzzy C-means cluster analysis

PS I did not describe the mathematical principles of the algorithms, they can easily be found on the presented links.

Thanks for attention!

Source: https://habr.com/ru/post/67078/

All Articles

Clustering: k-means and c-means algorithms

Distance measures

Algorithm k-means (k-medium)

Fuzzy clustering algorithm with-means

More articles: