Keywords

1 Introduction

Exploration of spatiotemporal data is a key aspect in many areas of management, design and business. Rapid increase of collected spatiotemporal data is associated with an intensive development of wireless sensor networks, improving sensors design techniques and increasing transmission capacity in mobile networks. Spatiotemporal data may be related to the following areas of applications: collections of events generated by sensors deployed over certain geographical regions, information about trajectories of vehicles, animals or groups of people or evolutions of phenomenons in both spatial and temporal aspects. Analysis of changes in climate and weather is the field which may generate huge amounts of spatiotemporal data described not only by the sets of points, but also by the complex objects like polygons. The problem of discovering frequent patterns in spatiotemporal data is related to several applications tasks like: analysis of traffic in cities [20], movements prediction of celestial bodies in astronomy [6] or crime analysis [41].

In addition, standard methods used in canonical data mining problems like apriori based algorithms [1], efficient clustering [18], periodicity detection [39], fast validation and data interpolation [38] need to be integrated into non-trivial, sophisticated approaches which can deal with data uncertainty, shifts in spatial or temporal dimensions and non-invariant scaling problems. In the publication, we attempt to provide a survey on the most recent methods developed for the spatiotemporal data clustering.

In particular, we propose a review of clustering methods for both spatiotemporal points and polygons (geographical areas). Previous reviews on that matter consider only the categorization of spatial objects [34, 48]. Additionally, we review distance measures proposed for complex spatial or geographical objects. To the best of our knowledge this is the first survey on the mentioned measures, gathering their properties and computing algorithms from multiple resources and recently proposed publications. In opposition to our paper, the survey proposed in [47] considers the frequent patterns discovery methods rather than clustering event-based spatiotemporal data. An attempt to provide a review of patterns discovery methods for trajectory-based spatiotemporal data has been proposed in [39], which do not consider at all clustering methods for complex geographical objects and focus mainly on moving objects.

The layout of the paper is as follows: in Sect. 2 we review a categorization of spatiotemporal datatypes in the view of their adaptations to clustering algorithms. Section 3 summarizes results in the area of clustering complex spatiotemporal objects as polygons and areas. Section 4 recalls the most important distance measures for complex spatial objects. Section 5 provides a survey on the recently proposed clustering algorithms for moving objects and trajectories. Conclusions to the survey are given in Sect. 6.

2 Spatiotemporal Datatypes

Spatiotemporal datatypes are dependent on the real-world applications. Based on literature [34, 39, 48], we can distinguish two types of spatiotemporal data: event-based (also known as location-based), collected from stationary deployed sensors and trajectory-based (also referred as ID-based [39]) used to describe movements of objects. For the event-based data case, each event may be associated with a property p which value is denoted by the function f(xytp) where (xy) is a location (usually expressed in terms of longitude and latitude), t is a time stamp during which the event has been collected. Considering the more complex spatiotemporal objects (as polygons or areas), the location of an object may be denoted by the set of its coordinates. In the case of trajectory-based spatiotemporal data, for a given set of n objects \({o_1, o_2, \dots , o_n}\), a trajectory of an object \(o_i\) is represented by a sequence of geographical points \((x_1, y_1, t_1), (x_2, y_2, t_2), \dots , (x_n, y_n, t_n)\), where \((x_j, y_j)\) is a location at the time stamp \(t_j\). The above distinction between event-based and trajectory-based spatiotemporal data has been mainly introduced in [39].

The categorization presented above can be in addition broaden with the specification of different spatial datatypes (points, lines and polygons) and their extensions to time domain: database may contain only the last snapshot of actual positions of observed objects or the whole history of an evolution of a spatiotemporal phenomenon. In some other cases only a stream of spatiotemporal data may be available.

The categorization of spatiotemporal datatypes is given in [34]. The authors of [34] distinguish the following categories: Fixed location denoting datasets containing occurrences of events of predefined types on geographical areas. Algorithms for clustering (and partially patterns discovery) for that datatype have been proposed in [9, 23, 41]. If the spatial dimension is extended to polygons or geographical areas, then one may refer to the clustering algorithms presented in [30, 50]. On the other hand, category defined as Dynamic location refers to trajectory-based spatiotemporal data. Datatypes denoted by Updated snapshot and either by Dynamic location or Fixed location labels refer to spatiotemporal data streams. The former describes moving objects reporting only the current location or position. The latter denotes streams of events occurrences (with each event described as above). Algorithms for spatiotemporal data streams clustering and classification have been recently proposed in [7, 31, 32].

The forms denoted by Time Series (according to categorization presented in [34]) are particularly useful because provide means for adaptation of classical algorithms and similarity measures used in time series analysis. For example, [14] presents a new similarity measure (Edit Distance on Real sequence, EDR) developed for the comparison of trajectories of objects. On the other hand, the time series representations of event-based spatiotemporal data are still unknown and will be developed in the future years.

In Table 1 we provide references contributing to the spatiotemporal datatypes definitions and data mining techniques proposed for them.

Table 1. Summary of the publications on spatiotemporal data types and data mining techniques proposed for them.

First propositions of spatial and spatiotemporal clustering using statistical approach have been given in [24, 36]. [36] gives a clustering method using spatial scan statistics (the approach has been improved in [27]), whereas [24] proposes an extension taking into account spatial shifts in the nature of evolving phenomenon. Due to this, proposed algorithm is able to detect clusters of spatiotemporal data which dynamically change their position and shape.

Clustering complex spatial and spatiotemporal objects is gaining attention of researchers nowadays [29, 30, 50]. The idea is to discover neighboring areas or geographical regions characterized by the same (or similar) value of non-spatiotemporal attribute (f.e. pollution). [26] adapts Fuzzy C-Means algorithm to spatiotemporal data. [25] raise the problem of anomaly detection in spatial time series using spatiotemporal clustering.

3 Clustering Spatiotemporal Events and Complex Geographical Objects

Proposed algorithms often operate on the more complex spatial objects f.e. polygons or lines. Classical density-based clustering algorithm - DBSCAN has been proposed in [18]. Many variations of the well-known density clustering algorithms like DBSCAN, OPTICS, NN were adapted to operate on spatiotemporal data. In addition, some non-standard grouping algorithms have been proposed: f.e. Spatio-Temporal Polygonal Clustering (STPC). Clustering algorithms are categorized into five main domains: Partitioning, Hierarchical, Grid-based, Model-based and Density-based [22].

In Partitioning methods, clusters are computed according to the mean value in a cluster (K-means) or based on the selection of an object which is nearest to the cluster’s center (K-Medoid). The name of the category: Partitioning methods is inspired by the fact, that each object in a dataset is assigned to one and only one cluster (there are no objects classified as a noise). Objects partitioning is performed according to the predefined optimization criterion - for a given number of clusters we would like to find assignments which minimizes the sum of distances between objects and centers of clusters (their mean values or the most central elements) to which they belong. Hierarchical clustering methods can be divided into two approaches: ascending and descending. In the ascending approach, each data object is initially assigned to its own cluster. Then, the algorithm gradually merges clusters until a predefined number of groups will be reached. On the other hand, descending approach divides one cluster (into which all data objects are initially assigned) until a predefined number of clusters will be reached.

Grid-based methods proceed with dividing data space into cells (a grid). Then, in the clustering phase, some cells are merged based on a predefined condition. For example: two dense, neighboring cells are merged into one (a cell is dense if it contains a number of objects greater than the predefined threshold). STING is an example of grid-based clustering algorithm [50]. Model-based methods try to fit clusters according to the predefined model of the data (like probability distribution). An example of a model-based method is the Expectation-Maximization algorithm [42]. Denisity-based methods try to find clusters according to distributions of density in a dataset. Due to this, appropriate density threshold should be specified by user: f.e. an estimated number of objects in a predefined neighborhood of an object [18]. Due to their simplicity, density-based clustering algorithms are widely used in data mining. Attempts to improve their efficiency and reduce needs of expert knowledge during parameter specification have been made in [4, 19].

3.1 Algorithms for Clustering Complex Spatiotemporal Objects

In this section, we proceed with description of clustering algorithms for complex spatiotemporal objects.

ST-GRID (SpatioTemporal-GRID) is a clustering method based on the partitioning spatiotemporal space into two separate grids: for spatial and temporal dimensions. In [49], the authors propose to compute the precision of a grid, based on the so-called k-dist graph which is constructed by random sampling of a dataset, calculating distance from each sample to its k-nearest neighbor and sorting calculated distances in decreasing order. The presence of clusters will be indicated by the easily noticeable threshold in the sorted distances. Calculated thresholds may be used as grid resolutions. As in the typical grid clustering algorithms, dense neighboring cells are merged to create spatiotemporal clusters. The above procedure has been originally developed only for spatiotemporal points.

ST-DBSCAN (SpatioTemporal-DBSCAN) is the algorithm developed on the conceptions of the well-known density clustering algorithm - DBSCAN (Density Based Clustering Algorithm with Noise) [18]. ST-DBSCAN has been introduced in [49] and then rearranged in [9]. ST-DBSCAN modifies DBSCAN to detect clusters according to their non-spatial, spatial and temporal dimensions. Before we describe ST-DBSCAN algorithm, we took a quick glance on the pure DBSCAN algorithm. One of the most important properties of DBSCAN is the ability to detect clusters with an arbitrary shape: circular, ellipsoidal, linear or even more complicated. However, need to specify density thresholds may result that the algorithm will do not detect proper but sparse clusters. That problem has been addressed in the another density based clustering algorithm - OPTICS [4]. ST-DBSCAN considers cluster densities according to both spatial and temporal thresholds (assuming that for many applications they are very different). Additionally, ST-DBSCAN is able to include or exclude an object from the cluster on the basis of its non-spatiotemporal attributes: f.e. if the represented temperature is very different from the cluster’s average temperature.

ST-SNN and ST-SEP-SNN Algorithms are two variations of the well known Shared Nearest Neighbor (SNN) density-based clustering algorithm. It is a noteworthy fact, that both algorithms ST-SNN and ST-SEP-SNN have been originally presented (in [50]) for clustering sets of polygons rather than points. Similarity between two objects according to the SNN algorithm is defied as a number of nearest neighbors shared by these two objects.

  • A list of spatiotemporal neighbors of any polygon p is denoted by \(NN(p)= \) k-SPN-List(p) \(\cap \) k-TN-List(p) where k-SPN-List(p) and k-TN-List(p) are lists of k neighbors of a polygon p in respectively spatial and temporal dimensions.

  • Similarity between a pair of polygons p and q is the number of nearest spatiotemporal neighbors that they share: \(similarity(p, q) = \big | NN(p) \cap NN(q)\big |\).

  • The density of a polygon p is defined as the number of polygons that share at least Eps neighbors with p - \(density(p)= \big |\{q \in \mathcal {D}|similarity(p,q)\ge Eps \}\big |\).

  • A core polygon is a polygon where \(CoreP(\mathcal {D}) = \{ p \in \mathcal {D}|denisty(p) \ge MinPs\}\) where MinPs is a user specified threshold.

The above conceptions determine clustering spatiotemporal polygons according to the ST-SEP-SNN algorithm. After marking each polygon either as a core or non-core, the algorithm proceed with clusters creation by processing each polygon in the dataset. During processing step, if an unprocessed core polygon p has been encountered, a new cluster is created and all polygons in the NN(p) list are assigned to the new cluster (the same is recursively applied to the unprocessed core polygons encountered in the NN(p) list).

ST-SNN is an algorithm that proceeds similarly to the ST-SEP-SNN algorithm presented above, with exception that the list of nearest neighbors NN(p) of a polygon p is created using slightly different method. Rather than separately compute and then intersect lists of k-nearest spatial and temporal neighbors, ST-SNN combines spatial and temporal dimensions into one measure and computes only one list of the k-nearest neighbors.

STPC [30] is another denisty-based clustering algorithm developed for spatiotemporal polygons or areas. Again, the algorithm has been developed on the basis of the conceptions of the DBSCAN algorithm. Referring to the above mentioned ST-SEP-SNN algorithm, STPC computes lists of spatial and temporal neighbors on the basis of predefined distances (rather than k-nearest neighbors). The union of both lists contain spatiotemporal neighborhood of a polygon. If the neighborhood is appropriately dense, then the polygon is marked as a core polygon and the algorithm proceeds similarly to the above.

4 Distance Measures for Complex Spatiotemporal Objects

It is noteworthy to recall spatial distance measures used for polygons or other complex geographical objects. Table 2 provides a comparison of developed distance measures for complex spatiotemporal objects: polygons and trajectories. In the case of polygons, m and n denote the numbers of their vertices and in the case of trajectories their constituting sequences of locations.

Table 2. A comparison of distance measures for complex spatiotemporal data types.

Figure 1 presents a comparison between Minimum Vertices Approximation, Exact Separation Distance and Centroid Distance. Also, in the figure we show the Hausdorff distance for two polygons. The simplified Hausdorff distance is computed using the same formula as shown in Fig. 1, but only between vertices of polygons. Formula 1 presents a method for computing the Hausdorff distance for two polygons.

Fig. 1.
figure 1

Examples of distance measures for spatial polygons.

$$\begin{aligned} \begin{aligned}&\delta _H = max \big (\tilde{\delta }_H(A, B), \tilde{\delta }_H(B, A) \big ) \\&\tilde{\delta }_H(A, B) = \underset{a \in A}{max}\; \underset{b \in B}{min}\;\big |\big |\;(a, b)\;\big |\big | \;. \end{aligned} \end{aligned}$$
(1)

A few distance measures presented in Table 2 may be used both for polygons (geographical areas) and trajectories. If particular distance measure preserves triangle inequality property, then it is possible to reduce the time of computations performed during clustering: f.e. neighborhood search [35]. Also, the methods combining the above measures with spatial and metrical indexes have been proposed in [11, 33].

5 Other Clustering Problems in the Area of Spatiotemporal Data

In this section, we provide a look on the other clustering problems related to spatiotemporal data. In particular, we attempt to provide a general overview of the most important methods proposed for clustering trajectory-based data.

Finding groups of similar moving objects - let assume that for a set of objects \(o_1, o_2, \dots , o_n\) a database stores the trajectory of a movement of each object. Additionally, let assume that each trajectory is represented in the form of a sequence of points, each associated with a timestamp. The problem of discovering flocks in the dataset is described as the problem of finding those sets of objects which for a predefined time interval stay within a disk which radius length is a parameter specified by an expert. A time interval is expressed as the sequence of consecutive timestamps. The problem of finding flocks of objects have been introduced in [21] and also developed in [8]. The above problem has been extended to finding convoys [28] and swarms of moving objects [39]. A convoy is created from a flock by relaxing containment within a disk constraint i.e. rather than looking for the fixed disks of objects, the algorithm searches for dense regions using a clustering algorithm.

Clustering trajectories - the problem has been well studied in literature. Among the most popular algorithms for clustering spatiotemporal trajectories and their similar segments are: Trajectory-OPTICS [43], TRACLUS [37] or DENTRAC [12]. The important property of these algorithms is the ability to cluster segments of trajectories rather than whole trajectories. The property is motivated by the fact that objects usually move together only in small segments of their trajectories. For example: TRACLUS proceeds with three phases: in the first, a trajectory represented in the form of a sequence of points is simplified. The number of points in the sequence is reduced and the resulting parts of each trajectory are replaced with line segments. The replacement should preserve trends and angles representing turns in movements. Then, in the second phase, clustering of similar line segments is performed. In the last step, for each discovered group of similar line segments a representative trajectory is computed. Due to the complex nature of trajectories, an appropriate similarity metric should be selected. The proposed distance measure between two trajectories contains the following components: perpendicular, parallel and angle. The components are computed as follows: the perpendicular components is computed as a Lehmer mean of the distances between ending points of one segment projected into another. The parallel component is computed as a minimum from the distances between endings of one line segment projected into another. The angle component of the distance measure is defined as a product of the length of one of line segments and sinus of an angle between line segments.

6 Conclusions

In the publication, we provide a descriptive review of recently proposed algorithms for clustering complex spatiotemporal objects. In particular, we conduct a survey on the algorithms for clustering complex spatial objects: polygons or dynamically changing areas. Among the reviewed algorithms are ST-GRID, ST-DBSCAN, ST-SNN and STPC. Additionally, we provide references and a brief summary of the distance measures proposed for complex spatial objects (i.e. the Hausdorff distance, simplified Hausdorff distance and the other recently proposed heuristics). We also attempt to provide a look on the other methods proposed for clustering spatiotemporal objects i.e. trajectory-based data. The categorization of spatiotemporal datatypes presented at the beginning of the paper provides a staring point for considering new research fields in the area of spatiotemporal data mining. In particular, the most promising directions are: developing algorithms for spatiotemporal data streams and adaptation of knowledge discovery methods proposed in time series analysis to spatiotemporal data.