Constrained distance based clustering for time-series: a comparative and experimental study

Lampert, Thomas; Dao, Thi-Bich-Hanh; Lafabregue, Baptiste; Serrette, Nicolas; Forestier, Germain; Crémilleux, Bruno; Vrain, Christel; Gançarski, Pierre

doi:10.1007/s10618-018-0573-y

Constrained distance based clustering for time-series: a comparative and experimental study

Published: 30 May 2018

Volume 32, pages 1663–1707, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Constrained distance based clustering for time-series: a comparative and experimental study

Download PDF

Thomas Lampert¹,
Thi-Bich-Hanh Dao²,
Baptiste Lafabregue¹,
Nicolas Serrette²,
Germain Forestier³,
Bruno Crémilleux⁴,
Christel Vrain² &
…
Pierre Gançarski¹

1677 Accesses
24 Citations
4 Altmetric
Explore all metrics

Abstract

Constrained clustering is becoming an increasingly popular approach in data mining. It offers a balance between the complexity of producing a formal definition of thematic classes—required by supervised methods—and unsupervised approaches, which ignore expert knowledge and intuition. Nevertheless, the application of constrained clustering to time-series analysis is relatively unknown. This is partly due to the unsuitability of the Euclidean distance metric, which is typically used in data mining, to time-series data. This article addresses this divide by presenting an exhaustive review of constrained clustering algorithms and by modifying publicly available implementations to use a more appropriate distance measure—dynamic time warping. It presents a comparative study, in which their performance is evaluated when applied to time-series. It is found that k-means based algorithms become computationally expensive and unstable under these modifications. Spectral approaches are easily applied and offer state-of-the-art performance, whereas declarative approaches are also easily applied and guarantee constraint satisfaction. An analysis of the results raises several influencing factors to an algorithm’s performance when constraints are introduced.

A review and evaluation of elastic distance functions for time series clustering

Article Open access 07 September 2023

TSX-Means: An Optimal K Search Approach for Time Series Clustering

Enhancing Time Series Clustering by Incorporating Multiple Distance Measures with Semi-Supervised Learning

Article 08 July 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Time-series are becoming more readily available with the introduction of mobile sensing devices, satellite constellations, wearable devices, and health monitors, to name but a few. Contemporary time-series data mining problems are therefore characterised by increasingly large volumes of data.

This complicates time-series classification because of the complexity of collecting reliable ground truth (or reference) data and the definition of thematic classes. As such, unsupervised clustering is often employed, which offers a solution based upon the data alone. These approaches, however, ignore expert knowledge and intuition (that is to say to the potential thematic classes), and do not offer the possibility for an expert to propose modifications to the clustering.

Constrained clustering (also known as semi-supervised clustering) is the process of introducing background knowledge (also known as side information) to guide a clustering algorithm. The background knowledge takes the form of constraints that supplement the information derived from the data through a distance metric, for a (generally small) subset of the data. A constrained algorithm attempts to find a solution that balances the data derived information with that derived from the user constraints. As such these approaches offer a new tool for time-series clustering which, to the best of our knowledge, has not been applied to the domain.

This paper addresses this through the following three contributions.

A review of constrained clustering methods, including single algorithm approaches and collaborative and ensemble approaches, which define an interaction between algorithms.
Adapting a sample of these algorithms for use in time-series analysis and describes the properties of others which prevents their adaptation to time-series analysis.
An evaluation of these adapted methods on publicly available time-series data (Chen et al. 2015), which gives insight into the factors that influence their performance in constrained clustering.

As such, this article offers insight into the different formulations of constrained clustering algorithms and how they can be adapted to be used in time-series clustering. The algorithms are selected from implementations that are publicly available. Some algorithms can be directly applied by inputting a dissimilarity/similarity matrix calculated using an appropriate dissimilarity measure, others require modification of the algorithm itself to integrate the measure. The evaluation is performed using nine different datasets and forty constraint cases for each dataset.

The remainder of this paper is organised as follows. Section 2 presents some background on clustering, user-constraints, and time-series clustering. Section 3 presents a comprehensive review of the literature on constrained clustering. Section 4 describes the modification of publicly available implementations for use in time-series clustering, and a comparative study of these algorithms using standard datasets. Section 5 analyses and discusses these results and discusses the limitations of existing approaches when applied to time-series data. Finally the conclusions of the study are drawn in Sect. 6.

2 Background

2.1 Cluster analysis

Let $\mathcal {O}$ be a set of instances (data points) $\{o_1,\ldots ,o_n\}$ and $d(o_i,o_j)$ a dissimilarity (or a similarity) measure between any two instances $o_i$ and $o_j$. The similarity or dissimilarity between instances can be computed from their features or given by a similarity graph. Partition clustering involves finding a partition of $\mathcal {O}$ into K non-empty and disjoint groups called clusters, $C_1,\ldots ,C_K$, such that instances in the same cluster are very similar and instances in different clusters are different. The homogeneity of the clusters is usually formalised by a optimisation criterion, and clustering aims at finding a partition that optimises the given objective. For distance-based clustering, different optimisation criteria exist, the most popular are (Hansen and Jaumard 1997):

minimising the maximal diameter of the clusters,
minimising the maximal radius of the clusters,
maximising the minimal split between clusters,
minimising the sum of stars,
minimising the within-cluster sum of dissimilarities (WCSD),
minimising the within-cluster sum of squares (WCSS).

All of these criteria, except the minimal split, are NP-Hard. Finding a partition by maximising the minimal split between clusters is polynomial (Delattre and Hansen 1980) but becomes NP-Hard under user constraints (Davidson and Ravi 2007). As for the maximal diameter criterion, the problem is polynomial with 2 clusters ($K=2$), but is NP-Hard with more than 3 clusters ($K\ge 3$) (Hansen and Delattre 1978). The NP-Hardness of the WCSS criterion in general dimensions when $K=2$ is proved in (Aloise et al. 2009).

Similarity-based clustering uses data in the form of an undirected and weighted similarity graph, $G=(V,E)$, where each vertex, $v\in V$, represents a data point and each edge between two vertices, $v_i$ and $v_j$, has a non-negative weight $w_{ij}$. Spectral clustering aims to find a partition of the graph such that the edges between different groups have a very low weight and the edges within a group have high weight. Given a cluster $C_i$, a cut measure is defined by the sum of the weights of the edges that link an instance in $C_i$ and an instance not in $C_i$. The two most common optimisation criteria are (Luxburg 2007):

minimising the ratio cut, which is defined by the sum of $\tfrac{{{\mathrm{cut}}}(C_i)}{|C_i|}$,
minimising the normalised cut, which is defined by the sum of $\tfrac{{{\mathrm{cut}}}(C_i)}{{{\mathrm{vol}}}(C_i)}$, where ${{\mathrm{vol}}}(C_i)$ measures the degrees of the nodes belonging to $C_i$.

These criteria are also NP-Hard. Spectral clustering algorithms solve relaxed versions of those problems: relaxing the normalised cut leads to normalised spectral clustering and relaxing the ratio cut leads to unnormalised spectral clustering.

2.2 User constraints

In practice, a user may have some requirements for, or prior knowledge about, the final solution. For instance, the user can have some information on the label of a subset of objects (Wagstaff and Cardie 2000). Because of the inherent complexity of clustering optimisation criteria, classic algorithms always find a local optimum. Several optima may exist, some of which may be closer to the user requirement. It is therefore important to integrate prior knowledge into the clustering process and several studies have demonstrated the importance of this kind of domain knowledge in data mining processes (Anand et al. 1995). Prior knowledge is expressed by user constraints to be satisfied by the clustering solution. The subject of these user constraints can be the instances or the clusters (Basu et al. 2008).

Instance-level constraints are the most widely used type of constraint and were first introduced by Wagstaff and Cardie (2000). Two kinds of instance-level constraints exist: must-link (ML) and cannot-link (CL). An ML constraint between two instances $o_i$ and $o_j$ states that they must be in the same cluster: $\forall k \in \{1,\dots ,K\}$, $o_i \in C_k \Leftrightarrow o_j \in C_{k}$. A CL constraint on two instances $o_i$ and $o_j$ states that they cannot be in the same cluster: $\forall k \in \{1,\dots ,K\}$, $\lnot (o_i \in C_k\wedge o_j\in C_k)$. In semi-supervised clustering, this information is available to aid the clustering process and can be inferred from class labels: if two objects have the same label then they are linked by an ML constraint, otherwise by a CL constraint. Supervision by instance-level constraints is however more general and more realistic than class labels. Using knowledge, even when class labels may be unknown, a user can specify whether pairs of points belong to the same cluster or not (Wagstaff et al. 2001).

Cluster-level constraints define requirements on the clusters, for example:

the number of clusters K;
their absolute or relative maximal or minimal size;
their maximum diameter, i.e. clusters must have a diameter of at most $\gamma $;
their split, i.e. clusters must be separated by at least $\delta $ [note that although the diameter or split constraints state requirements on the clusters, they can be expressed by a conjunction of cannot-link constraints or must-link constraints, respectively (Davidson and Ravi 2005)];
the $\epsilon $-constraint, introduced in Davidson and Ravi (2005), demands that each object $o_i$ has in its neighborhood of radius $\epsilon $ at least one other object in the same cluster.

See Fig. 1 for an example of these constraints.

Mechanisms to integrate these constraints into the clustering process can be categorised into three different approaches:

enforcing constraints by guiding clustering algorithms during their process or by modifying the objective function;
learning the distance function using metric learning;
declarative and generative methods.

By far the most common constraints to be used in clustering are must-link and cannot-link constraints. This is because they can be intuitively derived from user inputs without in-depth knowledge of the underlying clustering process and feature space. As such, the review will focus on algorithms that explicitly model these constraints.

2.3 Time-series clustering

Time-series increase the complexity of clustering due to the properties of the data. Almost all clustering algorithms use a distance function based upon the norm of two vectors $L_p$ (Manhattan, $L_1$; Euclidean, $L_2$; and Maximum, $L_\infty $). This implies a fixed mapping between points in two time-series and as such, norm based distances are sensitive to noise, misalignment in time (however small) (Keogh and Kasetty 2003), and are unable to correct for sub-sequence, i.e. non-linear, time shifts (Wang et al. 2013). Dynamic time warping (DTW) (Sakoe and Chiba 1971, 1978) on the other hand is a dissimilarity measure that finds an optimal alignment between two time series by non-linearly warping them. As such, it overcomes the limitations of norm based distances when applied to time-series. Furthermore, certain types of clustering algorithms, for example k-Means, calculate centroids during their optimisation, which is not a trivial task in the case of time-series due to the misalignments discussed previously. The DTW Barycenter Averaging (DBA) algorithm (Petitjean et al. 2011) overcomes this limitation by iteratively refining an initial estimate of the average sequence (usually taken to be a random sample of the time-series being averaged), in order to minimise its squared DTW measure to the sequences being averaged. As such, classical constrained clustering implementations require modification to use the DTW measure and DBA averaging (if required) before being applied to time-series. Other considerations when working with time-series are that the dimensionality of the data can be very large, which means that the sampling of the input space can be sparse.

For an in-depth background on time-series clustering the following reviews are recommended: (Keogh and Kasetty 2003; Laxman and Sastry 2006; Kavitha and Punithavalli 2010; Antunes and Oliveira 2001; Liao 2005; Rani and Sikka 2012; Aghabozorgi et al. 2015). Keogh and Lin (2005) define two categories of time-series clustering:

Whole clustering: “The notion of clustering here is similar to that of conventional clustering of discrete objects. Given a set of individual time series data, the objective is to group similar time series into the same cluster” (Keogh and Lin 2005).
Subsequence clustering: “Given a single time series, sometimes in the form of streaming time series, individual time series (subsequences) are extracted with a sliding window. Clustering is then performed on the extracted time series subsequences” (Keogh and Lin 2005).

They then proceed to demonstrate that subsequence clustering is “meaningless” because “clusters extracted from these time series are forced to obey a certain constraints that are pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random” (Keogh and Lin 2005). A typical goal in time-series analysis is to cluster the data using the full time-series and this is the most direct application of existing constrained clustering approaches. Therefore this review will focus on ‘Whole Clustering’.

It should be noted that DTW is not the only available method for measuring dissimilarity between time-series. It is, nevertheless, often found that alternative dissimilarity measures are not significantly better than DTW in real-world datasets (Ding et al. 2008; Wang et al. 2013; Lines and Bagnall 2015; Bagnall et al. 2017) (the reader is referred to these references for a comprehensive review and comparison of the alternatives). To simplify the presented work it will therefore focus on DTW.

3 Constrained clustering methods

This section presents a review of partitional constrained clustering methods. These range from algorithmic approaches to declarative approaches, from using the constraints to guide the search process to using them to learn a metric before and/or during searching, and from constructing a clustering directly from the dataset to constructing a clustering from a set of given clusterings. The algorithms that exist in the literature, and which are reviewed herein, are summarised in Table 1. The methods that are used in the experimental section of this paper will be discussed in more depth in Sect. 4.

Table 1 Categorisation of methods found in the literature

Constrained distance based clustering for time-series: a comparative and experimental study

Abstract

Similar content being viewed by others

A review and evaluation of elastic distance functions for time series clustering

TSX-Means: An Optimal K Search Approach for Time Series Clustering

Enhancing Time Series Clustering by Incorporating Multiple Distance Measures with Semi-Supervised Learning

Explore related subjects

1 Introduction

2 Background

2.1 Cluster analysis

2.2 User constraints

2.3 Time-series clustering

3 Constrained clustering methods

3.1 k-Means

3.2 Metric learning

3.3 Spectral graph theory

3.4 Ensemble clustering

3.5 Collaborative clustering

3.6 Declarative approaches

3.6.1 SAT

3.6.2 Constraint programming

3.6.3 Integer linear programming

3.7 Miscellaneous

4 Constrained clustering for time-series

4.1 Algorithm adaptation

4.2 Evaluated algorithms

4.2.1 COP-KMeans

4.2.2 Spec

4.2.3 CCSR

4.2.4 Samarah

4.2.5 CPClustering

4.3 Methodology

4.4 Results

5 Discussion

5.1 Analysis of the results

5.2 Constraint influence

5.2.1 Algorithm dependent measures

5.2.2 Algorithm independent measures

5.3 Challenges

6 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Full metric scores

Appendix B: Constraint coherence

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation