📜 ⬆️ ⬇️

Non-standard clustering, part 3: techniques and metrics for time series clustering

Part One - Affinity Propagation
Part Two - DBSCAN
Part Three - Time Series Clustering
Part Four - Self-Organizing Maps (SOM)
Part Five - Growing Neural Gas (GNG)

While other specialists in machine learning and data analysis are figuring out how to tie more layers to the neural network so that it plays Mario even better, let's turn to something more mundane and applicable in practice.

Clustering time series is a thankless job. Even when grouping static data, dubious results are often obtained, let alone information that is scattered over time. However, you can not ignore the task, just because it is difficult. Let's try to figure out how to squeeze out a little sense from the rows without tags. This article discusses subtypes of time series clustering, general techniques, and popular measures of row spacing. The article is intended for readers who have already dealt with sequences in data science: the basic things (trend, ARMA / ARIMA, spectral analysis) will not be discussed.


Hereinafter P , Q , R - time series with time counts t 1 d o t s t n  and levels x 1 d o t s x n  , y 1 d o t s y n  and z 1 d o t s z n  respectively. Time Series Values ​​- x i , y i , z i - one-dimensional real variables. Variable t is considered time, but the mathematical abstractions discussed below are generally applicable to spatial series (an expanded curve), to series of symbols and sequences of other types (say, the graph of the emotional color of the text). D ( P , Q ) - a function that measures how much P differs from Q . It’s easiest to think of it as a distance from P before Q However, it should be remembered that it does not always correspond to the mathematical definition of a metric.
')

Time series “similarity” subtypes


When working with time series, there are many typical data science difficulties: high dimensionality of input data, missing data, correlated noise. However, in the task of clustering sequences there are a couple of additional difficulties. First, in the ranks there can be a different number of readings. Secondly, when working with sequential data, we have more degrees of freedom in determining how one element is similar to another. The point here is not only in the number of variables in the dataset instance; The way data changes over time also contributes. This is an obvious, but important point: we can independently transform or renumber the values ​​of instances of static data, but this will not work with sequences. Therefore, metrics and algorithms must take into account local data dependency.



In addition, when clustering rows, the semantics of the data makes a stronger contribution. For example, how to distribute 8 graphs in the picture into groups?



What is more important for a researcher? Average level y or the presence of jumps? If myself y then the first partition 1234/5678 is better suited; if it changes, then the best one will be 1278/3456. Moreover, is it important when exactly the jump occurs? Maybe we should break the group 3456 into subgroups 34 and 56? Depending on what exactly is y , these questions can be given different answers.

Therefore, in connection with the possible semantic differences of the data, it is useful to distinguish several types of similarity that may occur in practical problems [1]




A clear understanding of which type of similarity is characteristic of your task is an integral part in creating meaningful clusters.

Receptions for the presentation of data


A large amount of input data suggests feature engineering. The best way to cluster time series is to cluster them NOT as time series. If the specifics of the task allows you to convert samples to fixed-size data, you should use this opportunity. Immediately disappears a whole group of problems.




If you cannot get rid of the time dependence at all, then you should try to at least simplify the series or transfer it to another area:


Metrics


The above global characteristics will allow you to divide the rows into groups of similar structure. When this alone does not give meaningful results, you should select the appropriate metric on the sequence space. To highlight clear patterns in the information, you need to construct a good proximity function, but in order to do this, you must already have some idea of ​​the existing dataset, so do not dwell on any one of the options presented below.



Conclusion


Alas, I did not find adapted clustering algorithms for sequential data, which would be interesting to tell. Maybe I don’t know something, but several evenings spent at ArXiv and IEEE did not bring me any insights. It seems that scientists are limited only to the study of new metrics for comparing time series, but are not eager to invent specialized algorithms. The whole trick occurs at the stage of calculating the distance matrix, and the division into groups itself is carried out by good old algorithms for static data: k-means derivatives, DBSCAN, Affinity propagation, trivial hierarchical clustering, and others. There are a few that can be looked at [10] [11] [12] [13], but not that their results are much better than the old methods.

K-means gives decent results less often than in the case of static data. Some researchers have succeeded in successfully applying DBSCAN, but for him it is very nontrivial to find good parameters: the main area of ​​its application is by no means time series. The last two candidates in the list look much more interesting, hierarchical clustering is used most often [14] [15] [16]. Modern approaches also predispose the computation and combination of several distance matrices using different metrics.

Summary.You have a pack of time series in your hands, and you don’t know what to do with it?

First of all, check whether there are any missing values ​​or explicit counting of emissions. Then try to output part of dataset and see


Try to collect statistical data across the rows and cluster them; k-means derivatives are appropriate here. If this is not enough, see if any of the variables have a clearly bimodal distribution (in particular, a trend) - this may help in understanding the dataset.

Calculate the distance matrix using DTW or several TWED approaches with different parameters. Output the result of the hierarchical clustering algorithm. Try to formulate a hypothesis about the form of dataset. Change the metric according to the situation, or combine several metrics. Repeat.

And most importantly: try to get knowledge of what the meanings of these sequences really mean. This can radically affect the choice of metrics.

Yeah ... not that this algorithm is very different from any rough data processing. What to do, clustering the time series is hard work, and too much depends on specific data. I hope my article will help at least a little if you have to deal with it. In the end, even a small increase in the quality of computer data processing can mean for a human specialist the difference between "nothing is clear" and "it seems like a pattern can be found."

Sources


[1] Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, Teh Ying Wah - Time-series clustering - a Decade Review
[2] Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Puttagunta - Distance Measures for Effective Clustering of ARIMA Time-Series.
[3] Subsequence Time Series (STS) Clustering Techniques for Meaningful Pattern Discovery
[4] Li Junkui, Wang Yuanzhen, Li Xinping - LB HUST: A Symmetrical Boundary Distance for Clustering Time Series
[5] Joan Serria, Joseph Arcos - A Competitive Measure
[6] PF Marteau - Time Warp
[7] Gustavo EAPA Batista, Xiaoyue Wang, Eamonn J. Keogh - A Complexity-Invariant Distance Measure for Time Series
[8] Vo Thanh Vinh, Duong Tuan Anh - Compression Rate
[9] Dinkar Sitaram, Aditya Dalwani, Anish Narang, Madhura Das, Auradkar
[10] John Paparrizos, Luis Gravano - K-Shape: Efficient and Accurate Clustering of Time Series // Fast and Accurate Time-Series Clustering
[11] Leonardo N. Ferreiraa, Liang Zhaob - Time Series Clustering via Community Detection in Networks
[12] Zhibo Zhu, Qinke Peng, Xinyu Guan - A Time Series Clustering Method Based on Hypergraph Partitioning
[13] Shan Jicheng, Liu Weike - Clustering Algorithm for Time Series Based on Peak Interval
[14] Sangeeta Rani - Modified Clustering Algorithm for Time Series Data
[15] Pedro Pereira Rodrigues, Joao Gama, Joao Pedro Pedroso - Hierarchical Clustering of Time Series Data Streams
[16] Financial Time Series based on Agglomerative Hierarchical Clustering
[17] Goctug T. Cinar and Jose C. Principe - Hierarchical Linear Dynamic System

Source: https://habr.com/ru/post/334220/


All Articles