Strange, but there are very few articles about knowledge mining (data mining) and clustering (as one of the main tools that are used to extract knowledge). And if we talk about specific algorithms, only hard / soft k-means were considered.
The article below describes the theory and implementation (Python + matplotlib) of a not very well-known, but extremely interesting hierarchical method that can be called an a-quasi-equivalence algorithm.
Knowledge Extraction and Clustering
Unfortunately, the Chukchi is not a writer (especially not a Russian-speaking one), so for those who are not familiar with the terms in the subtitle, I recommend reading the
Overview of the data clustering algorithms and the
extraction of data or knowledge? (articles on the native Habré) or look at the wiki (
1 ,
2 ).
')
The task
We will divide
and conquer points in two-dimensional space. Characteristics of points will be only their coordinates. Since we have a hierarchical method, the result will be a tree \ list of possible divisions into clusters.
For the seed - an example. We want to divide into clusters the points on the image:
How should I divide these points of AI? Let's see what our algorithm comes up with.
Theory
The mathematics outlined below is taken from the book
A. A. Barsegyan, M. S. Kupriyanov, V. V. Stepanenko, I. I. Kholod - Methods and models for analyzing OLAP and Data Mining data (to be precise, chapter
7.4. Clustering data using fuzzy relationships ).
Most clustering algorithms (we will consider k-means as a reference) use the distance between them to assess the similarity of samples. It is quite logical, but we think broader - the samples are just as similar, if the estimates of the distance from each of them are similar to all other samples. What will it give us? - We will be able to solve the task and other examples that k-means copes with.
And so, how is the implemented strategy described?
- We have the set X, the number of elements in the set is Q.
We construct a matrix of estimates of the distances between the elements (we will call these estimates the normal measures of similarity) using the formula
where d (x, y) is the Euclidean distance.
- We construct relative measures of similarity using the formula
The resulting three-dimensional matrix will show how similar the pair of these elements is to all the others.
- We construct a matrix of measures of similarity of the elements of the set according to the formula
From the three-dimensional matrix, a two-dimensional matrix was obtained, which contains the worst estimates of the similarity measures for each pair of points.
- For the obtained matrix R, we construct its closure using the following formulas:
Definition for closure operation:
In our case, S = max, T = min.
At the end of the algorithm for four points, we obtain a matrix of the type
| one | 2 | 3 | four |
one | one | 0.5 | 0.3 | 0.3 |
2 | 0.5 | one | 0.3 | 0.3 |
3 | 0.3 | 0.3 | one | 0.6 |
four | 0.3 | 0.3 | 0.6 | one |
The numbers in the matrix show whether a pair of points belong to the relation R, they are called levels of quasi-equivalence. The choice of a particular level (a) splits the set into equivalence classes that correspond to separate clusters. That is, if we choose the level a = 0.4, then the pairs (1.4), (2.4), (1.3) and (2.3) will no longer belong to the relation (their alpha level is less than the specified one) . In fact, this matrix can be represented as a graph matrix, and in the case of the level a = 0.4, the initial graph is divided into two - (1, 2) and (3, 4).
In some scientific papers, along with the description of the algorithm, they also demonstrate a “flowchart”:
Let's try to disassemble it. With the letters "mu", "epsilon" and the T-norm everything is clear - we have them in the formulas. A-tolerance and a-quasi-equivalence are the properties of the relationship that it receives after applying the T-norm and the described closure. Next we have a partition into clusters, which can be carried out at all heuristically (for example, using graph theory).
The attentive reader will ask - and where is the fuzzy relationship from the title of the chapter in the book? They are used to prove that the steps described will lead us to a solution.
Practice
Instead of describing the translation of formulas into code, I’ll show you how it works. Here are the partitioning options suggested by the algorithm for the example at the beginning of the article:
Agree, quite unexpectedly, but at the same time - it is logical.
Well, for random data:
Source code:
algorithm ,
drawing .
The code is only demonstration and does not claim to be effective or to comply with PEP8.
Conclusion
There are ideas to use the described algorithm to solve the real problem of analyzing real estate offers. If there are interesting results - I will share.
References:
- the book itself, the theory of which was used -
Methods and models for analyzing data from OLAP and Data Mining ;
- another book on data mining from the same authors -
Data Analysis Technologies. Data Mining, Visual Mining, Text Mining, OLAP
- The most interesting article, in which the classification problem is solved on a similar example -
Classification of data by the support vector method .
I would be grateful for comments in PM.