The clustering task is a special case of a learning task without a teacher, which is reduced to splitting the existing set of data objects into subsets in such a way that the elements of one subset differed significantly in some set of properties from the elements of all other subsets. A data object is usually considered as a point in a multidimensional metric space, each dimension of which corresponds to a certain property (attribute) of the object, and the metric is a function of the values ​​of these properties. The choice of the data clustering algorithm and the metric used depends on the types of measurements of this space, which can be both numerical and categorical. This choice is dictated by differences in the nature of different types of attributes.
This article provides a brief overview of clustering methods for numeric data spaces. It will be useful to those who are just beginning to study Data Mining and cluster analysis and will help orient themselves in the diversity of modern clustering algorithms and get a general idea about them. The article does not claim to be a complete presentation of the material; on the contrary, the description of the algorithms in it is as simple as possible. For a more detailed study of an algorithm, it is recommended to use the scientific work in which it was presented (see the list of references at the end of the article).
Partition methods
The most famous representatives of this family of methods are the k-means [1] and k-medoids [2] algorithms. They take the input parameter
k and divide the data space into
k clusters such that the similarity between objects of one cluster is maximum, and between objects of different clusters is minimal. The similarity is measured in relation to a certain cluster center as the distance from the object under consideration to the center. The main difference between these methods lies in the way the cluster center is defined.
')
In the k-means algorithm, the similarity is considered with respect to the cluster center of mass, the average value of the coordinates of cluster objects in the data space. First,
k objects are selected at random, each of which is a cluster prototype and represents its center of mass. Then for each of the remaining objects is attached to the cluster with which the similarity is greater. After that, the center of mass of each cluster is recalculated. For each partition obtained, a certain evaluation function is calculated, the values ​​of which at each step form a converging series. The process continues until the specified series converges to its limit value. In other words, moving objects from a cluster to a cluster ends when, with each iteration, the clusters remain unchanged. Minimizing the evaluation function allows the resulting clusters to be as compact and separate as possible. The k-means method works well when clusters are compact “clouds” that are significantly separated from each other. It is effective for processing large amounts of data, but is not applicable for detecting clusters of non-convex shape or a very different size. Moreover, the method is very sensitive to noise and isolated points of space, since even a small number of such points can significantly affect the calculation of the center of mass of the cluster.
To reduce the influence of noise and discrete points of space on the clustering result, the k-medoids algorithm, in contrast to k-means, uses not the center of mass to represent the center of the cluster, but a representative object — one of the cluster objects. As in the k-means method,
k representative objects are selected at random. Each of the remaining objects is combined into a cluster with the nearest representative object. Then it is iteratively replaced for each representative object by replacing it with an arbitrary unrepresentative data space object. The replacement process continues until the quality of the resulting clusters improves. The quality of clustering is determined by the sum of deviations between each object and the representative object of the corresponding cluster, which the method seeks to minimize. Thus, the iterations continue until its representative object in each cluster becomes the medoid - the object closest to the cluster center. The algorithm is poorly scalable for processing large amounts of data, but this problem is solved by the CLARANS algorithm that complements the k-medoids method [3]. For clustering multidimensional spaces based on CLARANS, a PROCLUS algorithm was built [4].
Hierarchical methods
The general idea of ​​the methods of this group is a consistent hierarchical decomposition of a set of objects. Depending on the direction of building the hierarchy, the divisional and agglomerative methods are distinguished. In the case of the agglomerative method (bottom-up), the decomposition process begins with the fact that each object is an independent cluster. Then at each iteration, pairs of nearby clusters are successively combined into a single cluster. Iterations continue until all objects are merged into a single cluster or until a certain stopping condition is fulfilled. The divisional method (from top to bottom), on the contrary, implies that at the initial stage all objects are combined into a single cluster. At each iteration, it is divided into smaller ones until each object is in a separate cluster or the stop condition is satisfied. As a stop condition, you can use the threshold number of clusters that you want to get, but usually the threshold value of the distance between the clusters is used.
The main problem of hierarchical methods lies in the difficulty of determining the condition of a stop in such a way as to isolate “natural” clusters and at the same time prevent their splitting. Another problem with hierarchical clustering methods is choosing the point of separation or merging of clusters. This choice is critical, because after splitting or merging clusters at each subsequent step, the method will operate only on newly formed clusters, therefore, the wrong choice of a merge or split point at any step can lead to poor-quality clustering. In addition, hierarchical methods cannot be applied to large data sets, because deciding whether to divide or merge clusters requires analyzing a large number of objects and clusters, which leads to a large computational complexity of the method. Examples of algorithms based on the hierarchical method are BIRCH [5] and CHAMELEON [6].
Density methods
Clusters are considered as regions of data space with a high density of objects, which are separated by regions with a low density of objects.
The DBSCAN algorithm [7] is one of the first clustering algorithms by the density method. This algorithm is based on several definitions:
- The ε-neighborhood of an object is called a neighborhood of the radius ε of some object.
- A root object is an object whose ε- neighborhood contains at least some minimum number of MinPts objects.
- An object p is directly densely accessible from an object q if p is in an ε- neighborhood of q and q is a root object.
- An object p is densely reachable from an object q for given ε and MinPts , if there is a sequence of objects p 1 , ..., p n , where p 1 = q and p n = p , such that p i +1 is directly reachable from p i , 1 ≤ i ≤ n .
- An object p is tightly connected to an object q for given ε and MinPts , if there exists an object o such that p and q are tightly reachable from o .
To search for clusters, the DBSCAN algorithm checks the ε-neighborhood of each object. If the
ε- neighborhood of the object
p contains more points than
MinPts , then a new cluster is created with the root object
p . DBSCAN then iteratively collects objects directly densely achievable from root objects, which can lead to the union of several densely achievable clusters. The process ends when no new objects can be added to any cluster.
Although, in contrast to the partitioning methods, DBSCAN does not require an indication of the number of clusters obtained in advance, it requires an indication of the values ​​of the parameters
ε and
MinPts , which directly affect the result of clustering. The optimal values ​​of these parameters are difficult to determine, especially for multidimensional data spaces. In addition, the distribution of data in such spaces is often asymmetrical, which makes it impossible to use global density parameters for their clustering. For clustering multidimensional data spaces based on DBSCAN, the SUBCLU algorithm was created [8].
Network methods
The general idea of ​​the methods is that the object space is divided into a finite number of cells that form a network structure, within which all clustering operations are performed. The main advantage of the methods of this group is in small execution time, which usually does not depend on the number of data objects, but depends only on the number of cells in each space dimension.
The CLIQUE algorithm [9], adapted for clustering high-dimensional data, is one of the classic network algorithms. The method is based on the assumption that if in a multidimensional data space the distribution of objects is not uniform - there are density and rarefaction regions, then the projection of a density region into a subspace with a lower dimension will be part of the density region in this subspace. The CLIQUE algorithm clusterizes a multidimensional data space in the following way: the data space is divided into non-intersecting fixed-size cells, among them dense cells are identified - those whose density of data objects exceeds a given threshold value. Further, from the found cells, a space is formed in which dense cells of higher dimension can exist. The process begins with one-dimensional spaces (the described procedure is performed for each dimension), followed by a transition to higher-dimensional subspaces.
This algorithm is scalable to process large amounts of data, however, with a large number of measurements, the number of considered combinations grows nonlinearly, therefore, heuristics are required to reduce the number of considered combinations. In addition, the result obtained depends very much on the choice of the cell size and the threshold value of the density of objects in the cell. This is a big problem, since the same values ​​of these parameters are used when considering all combinations of measurements. This problem is solved by the MAFIA algorithm [10], which operates according to a similar principle, but uses the adaptive cell size when splitting subspaces.
Model methods
The methods of this family assume that there is a certain mathematical model of the cluster in the data space and seek to maximize the similarity of this model and the available data. Often this uses the apparatus of mathematical statistics.
The EM algorithm [11] is based on the assumption that the set of data under study can be modeled using a linear combination of multidimensional normal distributions. Its purpose is to estimate distribution parameters that maximize the likelihood function used as a measure of model quality. In other words, it is assumed that the data in each cluster obey a certain distribution law, namely, the normal distribution. With this assumption, it is possible to determine the optimal parameters of the distribution law — the expectation and variance at which the likelihood function is maximal. Thus, we assume that any object belongs to all clusters, but with different probability. Then the task will be to “fit” the set of distributions to the data, and then to determine the probabilities of the object belonging to each cluster. Obviously, the object should be assigned to the cluster for which this probability is higher.
The EM algorithm is simple and easy to implement, is not sensitive to isolated objects and quickly converges with successful initialization. However, it requires for initialization to indicate the number of clusters k, which implies a priori knowledge of the data. In addition, if the initialization failed, the convergence of the algorithm may be slow or a poor-quality result may be obtained.
It is obvious that such algorithms are not applicable to spaces with high dimensionality, since in this case it is extremely difficult to assume a mathematical model for the distribution of data in this space.
Conceptual clustering
Unlike traditional clustering, which detects groups of similar objects based on a measure of similarity between them, conceptual clustering defines clusters as groups of objects belonging to the same class or concept — a specific set of attribute-value pairs.
The COBWEB algorithm [12] is a classic incremental conceptual clustering method. It creates hierarchical clustering in the form of a classification tree: each node of this tree refers to a concept and contains a probabilistic description of this concept, which includes the probability that the concept belongs to this node and conditional probabilities of the form:
P (A i = v ij | C k ), where
A i = v ij is an attribute-value pair,
C k is a concept class.
Nodes that are at a certain level of a classification tree are called a slice. The algorithm uses to construct a classification tree a heuristic evaluation measure, called a category utility — an increase in the expected number of correct assumptions about attribute values ​​with knowledge of their belonging to a certain category relative to the expected number of correct assumptions about attribute values ​​without this knowledge. To embed a new object in the classification tree, the COBWEB algorithm iteratively goes through the entire tree in search of the “best” node to which this object belongs. The node is selected on the basis of placing the object in each node and calculating the utility category of the resulting slice. The category utility is also calculated for the case when the object belongs to the newly created node. As a result, the object refers to the node for which the utility category is greater.
However, COBWEB has a number of limitations. First, it assumes that the probability distribution of the values ​​of different attributes is statistically independent of each other. However, this assumption is not always true, because there is often a correlation between the values ​​of the attributes. Secondly, the probabilistic representation of clusters makes it very difficult to update them, especially in the case when the attributes have a large number of possible values. This is because the complexity of the algorithm depends not only on the number of attributes, but also on the number of their possible values.
Bibliography
- MacQueen, J. Multicariate observations / J. MacQueen // In Proc. 5th Berkeley Symp. On Math. Statistics and Probability, 1967. -C.281-297.
- Kaufman, PJ Rousseeuw, Y. Dodge, 1987. -C.405-416. Kaufman, L. Clustering.
- Ng, RT Efficient and Effective Clustering Methods for Spatial Data Mining / RT Ng, J. Han // Proc. 20th int. Conf. on Very Large Data Bases. Morgan Kaufmann Publishers, San Francisco, CA, 1994. -C.144-155.
- Aggarwal, CC Fast Algorithms for Projected Clustering / CC Aggarwal, C. Procopiuc // In Proc. ACM SIGMOD Int. Conf. on Management of Data, Philadelphia, PA, 1999. 12 p.
- Zhang, T. BIRCH: An Efficient Data Clustering Method for Very Large Databases / T. Zhang, R. Ramakrishnan, M. Linvy // In Proc. ACM SIGMOD Int. Conf. on Management of Data. ACM Press, New York, 1996. -C.103-114.
- Karypis, G. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling / G. Karypis, E.-H. Han, V. Kumar // Journal Computer Volume 32 Issue 8. IEEE Computer Society Press Los Alamitos, CA, 1999. –C.68-75
- Ester, M. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise / M. Ester, H.-P. Kriegel, J. Sander, X. Xu // In Proc. ACM SIGMOD Int. Conf. on Management of Data, Portland, OR, 1996. –C. 226-231.
- Kailing, K. Density-Connected Subspace Clustering for High-Dimensional Data / K. Kailing, H.-P. Kriegel, P. Kröger // In Proceedings of the 4th SIAM International Conference on Data Mining (SDM), 2004. –C.246-257.
- Agrawal, R. Automatic Subspace for Data Mining Applications / R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan // In Proc. ACM SIGMOD Int. Conf. on Management of Data, Seattle, Washington, 1998.-C.94-105.
- Nagesh, H. MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets / H. Nagesh, S. Goil, A. Choudhary // Technical Report Number of Center for Parallel and Distributed Computing, Northwestern University 1999. 20 p.
- Demster, A. Maximum Likelihood from Incomplete Data via the EM Algorithm / AP Demster, NM Laird, DB Rubin // JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, Vol. 39, No. 1, 1977.-C.1-38.
- Fisher, DH Knowledge acquisition through incremental conceptual clustering / DH Fisher // Machine Learning 2, 1987. -C.139-172.