1 Introduction

Discovering interesting, but implicit spatiotemporal patterns from datasets is important for many scientific domains such as astronomy, ecology, meteorology, public health, agricultural sciences [15]. The ever-growing nature of data being generated and collected from various scientific sources makes the data-driven knowledge discovery process very challenging to the researchers in these fields. The spatiotemporal co-occurrence patterns (STCOPs) can be used for modeling various scientific phenomena (e.g., tornadoes, propagation of epidemics, clouds). The patterns can be utilized for performing large-scale verification of current knowledge, as well as the prediction of unknown spatiotemporal relationships among different types of feature (i.e. event) types (e.g., prediction the spread of epidemics [20], verification of hurricane landfall precipitation models [14], discovery of the patterns in wildlife migration [16], prediction of blastocyst formation [31]). One important application area for our applied research is solar physics. Spatiotemporal co-occurrences frequently transpire among solar events such as active regions and sunspots. Identifying STCOPs appearing on the surface of the Sun can help us better understand the relationships among solar event types and lead to better modeling and forecasting of crucial events such as solar flares and coronal mass ejections. These solar events can affect the radiation in space; they can impact the safety of space and air travel, and even damage power grids [21].

Spatiotemporal co-occurrence is a specific kind of generic movement patterns, which represents the relationships among the objects that frequently coincide both in space and time [4]. The mixed-drove spatiotemporal co-occurrence patterns represent spatiotemporal feature types whose (point-based) instances are located in spatial and temporal proximity [12]. Spatiotemporal co-occurrence patterns from evolving regions is interested in the discovery of the subsets of feature types whose (region-based) instances overlap in both space and time [25]. While various kinds of spatiotemporal co-occurrence patterns exist in the literature (See Section 3 for more information), we are specifically interested in the spatiotemporal co-occurrences of moving instances with continuously evolving polygon-based representations [25]. In the context of this paper, we will use spatiotemporal co-occurrence pattern (STCOP) for referring to the spatiotemporal co-occurrence patterns discovered from datasets with evolving region-based representations, unless otherwise stated.

1.1 Motivation

In recent years, we have introduced new algorithms and novel techniques for the discovery of spatiotemporal co-occurrence patterns. The spatiotemporal co-occurrence pattern mining is introduced in [25], and a brute-force Apriori-based solution is presented. In [24], the STCOP mining algorithm is improved using a filter-and-refine approach, where insignificant co-occurrences are efficiently eliminated using a filter that utilizes OMAX significance measure. In addition to that, two spatiotemporal trajectory indexing techniques are employed for faster data retrieval in the spatiotemporal co-occurrence pattern mining process to further improve the efficiency of the process [7, 8]. The recent research work on STCOP mining presents an efficient mining schemata for conventional single machine systems [7, 24].

Big spatiotemporal data poses an enormous set of challenges regarding analytics, data processing, capacity, and validation [2]. As pointed out in [30], with the rate at which spatiotemporal data is being generated, it is necessary to develop efficient distributed systems and analytical infrastructure. The applicability of current spatiotemporal co-occurrence mining systems to many real life datasets is very limited, and can further be improved using distributed settings. There are two major challenges of the centralized solutions specifically related to STCOP mining: (1) massive data transfers are required when accessing or updating the data, and (2) expensive geometric calculations are necessary when performing topological spatiotemporal operations. To overcome these issues, we present a scalable solution for mining spatiotemporal co-occurrence patterns in a distributed environment. Accumulo,Footnote 1 a non-relational and distributed database management system, is used for storage and retrieval of spatiotemporal data. One particular challenge we have encountered is the spatiotemporal data modeling in non-relational databases. Key-value stores (primarily used in non-relational databases) provide potentially advantageous schema-free storage; however, they lack the functionality of object-relational database management systems.

1.2 Scope

In this work, we have developed a distributed spatiotemporal co-occurrence pattern mining system. Conceptually, the implemented STCOP mining algorithm is similar to the state-of-art STCOP-miner algorithm presented in [8, 24]. The system uses a non-relational distributed database system as its backend. To integrate the STCOP-miner algorithm into a non-relational database ecosystem, we have designed spatiotemporal data models to be used in column-oriented databases. We also introduce data access and update algorithms along with an in-memory join-index structure that can handle big spatiotemporal data. The spatiotemporal data models are designed to exploit the distributed data storage and retrieval mechanisms of non-relational databases. On the other hand, data access and update algorithms with the indexing structure provide necessary spatiotemporal querying functionality for mining STCOPs in non-relational database systems.

Readers should note that the scope of this paper is limited to discovering spatiotemporal co-occurrence patterns from instances with evolving polygon-based representations, and other co-occurrence pattern mining algorithms are not considered. The STCOP mining algorithm uses a spatiotemporal join operation for identifying the co-occurrences. The join operation is a part of the Apriori-based candidate generation procedure. The frequent pattern growth-based or join-less approaches are not considered in the scope of this work.

1.3 Outline

The rest of this paper is organized as follows. In Section 2, to familiarize the reader with the terminology, the basic concepts related to Accumulo database are demonstrated. Moreover, we present the system architecture and discuss the applicability of the distributed mining schema to spatiotemporal data mining. In Section 3, previous work on different types of spatiotemporal co-occurrence patterns are discussed. In Section 4, the preliminary concepts for spatiotemporal co-occurrence pattern mining are demonstrated, and the STCOP mining algorithm is presented. In Section 5, we describe our methodology for discovery of STCOPs in distributed settings and introduce the spatiotemporal data models for non-relational databases. Moreover, we provide the algorithmic details of our distributed system. We demonstrate our experimental results in Section 6. In Section 7, we present future work and conclude the paper.

2 Non-relational databases for mining spatiotemporal data

Accumulo was inspired by Google’s BigTable data storage system [13]. BigTable is a compressed, high performance and proprietary data storage system primarily built on Google File System (GFS) [17], Chubby Lock Service [9], and Log-structured merge-tree [23]. BigTable’s architecture has been adopted by many popular non-relational databases such as HBase,Footnote 2 Cassandra,Footnote 3 and Accumulo. Accumulo provides high levels of consistency with scales to thousands of nodes and petabytes of data, and processes data in near real-time. We will briefly present the architecture behind our choice of Accumulo, and the suitability of its architecture for spatiotemporal data mining.

2.1 Architecture

Accumulo is a sparse, distributed, sorted and multi-dimensional key-value storage system that depends on Apache Hadoop Distributed File System (HDFS) for data storage and Apache Zookeeper for configuration [9, 27]. HDFS is designed to store very large files, with streaming data access running on a cluster of commodity hardware. It provides high throughput access to application data, and is pertinent for very large datasets. HDFS provides the clients with a single file system view to hide the underlying collection of data blocks spread across multiple data nodes.

The key components of the Accumulo architecture, shown in Fig. 1, are master, tablet servers, garbage collector, logger and monitor. The main function of master is to monitor the cluster for the status of tablet servers, assign tablets (partition of tables) to tablet servers, and perform load balancing. Tablet server component is responsible for handling all the reads and writes for the tables. In a typical deployment, one tablet server is colocated with one HDFS data node. Tablet server is registered with Accumulo’s software by obtaining a lock from Zookeeper, and failures and recovery processes are handled by the master. Another task assigned to tablet servers is handling minor and major compactions. Minor compaction is the process of flushing the data stored in memory to sorted files stored on disk. Major compaction is merging these sorted files into a bigger file. Additional components (not shown in the Fig. 1) include: (1) garbage collector that deletes obsolete files from the file system, (2) monitor that is used for monitoring the key metrics of system resources used by Accumulo, (3) logger used for tracing the system events.

Fig. 1
figure 1

Architectural overview of Accumulo

2.2 Applicability to spatiotemporal data mining

Some of the features of Accumulo, such as table partitioning, load balancing, and horizontal scalability, provide necessary tools for efficiently performing data mining tasks on spatiotemporal datasets. In this part of our discussion, we will briefly point to the important features of Accumulo to provide insight into the necessity of a distributed schema for spatiotemporal data mining from very large datasets.

The key strength of the data representations provided in Accumulo is the ability to store sparse multidimensional data. One practical feature of Accumulo database is automatic table partitioning, where tables are split after crossing a pre-configured row count threshold. Using this feature, tables can be stored across multiple tablet servers evenly, and the system can provide parallelized data access. Load balancing, provided by Accumulo, evenly spreads the workload of tablets to tablet servers, and ensures that tablet servers are not overloaded. Load balancing is important when implementing a distributed system for scalability purposes (e.g., avoiding hotspots in spatiotemporal data analysis by distributing the workload). Furthermore, horizontal scalability (also known as scaling out, i.e. adding more nodes to the system for increasing the workload capacity) can be achieved using Accumulo. In contrast to traditional object-relational databases used for spatiotemporal data in centralized settings, the scaling is cheaper in the means of economic cost [5]. Note that traditional databases should be scaled vertically (scaling up), meaning it is required to increase the system resources such as memory and CPU. Another noteworthy functionality of Accumulo database is the server side iterators. The main function of the iterators is concurrent traversal over the data with optional filtering or transformation. Server side iterators can offload some of the computations to the tablet servers, and lead to significant performance increases.

3 Related work on spatiotemporal co-occurrence patterns

Spatiotemporal co-occurrence pattern mining is conceptually similar to classical frequent pattern mining from transactional databases. However, the implicit spatial and temporal semantics (specifically spatial and temporal overlap) are required to be identified, and the identification of these relationships dramatically increase the complexity of the STCOP mining algorithms. In spatiotemporal frequent pattern mining, the underlying spatiotemporal semantic relationships are the main subjects of discovery. One of these relationships is the co-occurrence relationship, and it is originated from the significance of closeness in spatial and temporal domains, by asserting objects located in space and time proximity are more related than the others [28].

One pioneering advancement in spatial data mining is the discovery of spatial colocation patterns [29]. The spatial closeness of the objects is introduced as the colocation relationship. Given a set of boolean spatial features, spatial colocation mining aims to discover the subsets of features whose instances are frequently colocated together. As a matter of course, it is often very hard to observe point-based spatial objects sharing the same locations. Therefore, a neighborhood relationship (based on user specified thresholds) is used for defining the colocations. The colocation mining algorithm uses an Apriori-based approach [3], which requires a spatial join algorithm while generating and pruning the candidate patterns. Partial-join and join-less approach for mining colocations were presented in [32, 33].

While colocation refers to the purely spatial closeness of objects, the term co-occurrence is more frequently used for spatiotemporal closeness. Mixed-drove spatiotemporal co-occur-rence patterns (MDCOP) are introduced in [12]. MDCOPs represent the subsets of spatiotemporal feature types whose point-based instances are frequently occurring in spatial and temporal proximity. MDCOP-mining algorithms presented in [12] can be interpreted as a temporal extension of spatial colocation mining algorithms to spatiotemporal context. The proposed MDCOP-Miner algorithms follow a similar Apriori-based approach. Following mixed-drove spatiotemporal co-occurrence patterns, sustained emerging (SECOP) [12], partial (PACOP) [10], and periodical (PECOP) [11] spatiotemporal co-occurrence patterns are introduced. Fundamentally, emerging, partial, and periodical co-occurrence relationships are quite similar to MDCOPs. They include additional constraints, and require new interest measures tuned for these constraints. SECOPs represent the subsets of feature types whose instances are increasingly colocated in space and time. PACOPs are concerned with the discovery of spatiotemporal co-occurrences that are partially present in the database. PECOPs represent the subsets of feature types that are periodically co-occurring.

Spread patterns of spatiotemporal co-occurrences over zones (SPCOZ) are introduced in [26]. SPCOZs represent the subsets of feature types whose instances are spreading and co-occurring over particular zones. The main purpose of the mining SPCOZs is discovering spreading structures that co-occur together both in space and time (meaning correlations among the spreading structures are mined instead of trajectories). Another instance of spatiotemporal co-occurrence pattern mining is composite spatiotemporal co-occurrence (COSTCOP) [34]; where a new composite prevalence measure (using spatial and temporal dimensions together) is developed, and a pruning technique is developed for improving the performance of the mining algorithm.

Aforementioned spatiotemporal co-occurrence or colocation models were originally designed for instances with point-based geometric representations. As point-based instances exhibit nearly imperceptible spatial and temporal overlap relationships among each other, the spatial and temporal neighborhoods are to be defined for characterizing co-occur-rences or colocations. However, in spatiotemporal co-occurrence pattern mining from evolving region instances (defined over polygon data type), it is highly likely to observe spatial and temporal coincidences (namely spatiotemporal overlap relationships). Mining spatiotemporal co-occurrence patterns from datasets with evolving regions was introduced in [25]. The spatiotemporal instances, which are represented by polygons evolving over time, are treated as three-dimensional continuous objects. For deciding whether an overlap among these three-dimensional structures form a significant co-occurrence, a spatiotemporal version of Jaccard significance measure is used. Similar to the other co-occurrence patterns, an Apriori-based algorithm (including a spatiotemporal join over spatial and temporal overlap predicates) is used. In [24], a novel filter-and-refine strategy for pruning the instances in the spatiotemporal join phase using OMAX measure is proposed. This algorithm is further improved in [8] by utilizing trajectory-based spatiotemporal indexing techniques.

One of the most remarkable observations about the past research work in co-occurrence pattern mining is considerably small datasets being used for testing the algorithms. Another observation we have made is the rapid increase in the runtime when using relatively larger datasets. Given these observations, one can apprehend that such data analyses tasks, when conducted on massive real life datasets, could greatly benefit from adapting an environment containing distributed storage and computational resources. In our work, we have used a cloud-based distributed database setting for storing and querying data. We have used a slightly modified version of STCOP mining algorithm [8] in non-relational database settings. The details of the algorithms will be presented in Section 4 and Section 5.

4 Preliminary concepts on spatiotemporal co-occurrence pattern mining

As mentioned earlier, we utilize the spatiotemporal co-occurrence pattern mining algorithm introduced in [8, 24], which mines the evolving region data established upon an Apriori-based algorithm that effectively prunes the candidates using a filter-and-refine strategy. Given a set of spatiotemporal feature types and spatiotemporal instances associated with these feature types, a spatiotemporal co-occurrence pattern is a subset of all features, whose instances frequently overlap both in temporal and spatial dimensions. Following parts of this section include the basic definitions, significance measures and the general algorithm for STCOP mining algorithm.

4.1 Definitions

Definition 1

A spatiotemporal instance, denoted as I n s t, is a spatiotemporal object represented as a two-dimensional region continuously evolving over time. Each instance is identified with a unique identifier, denoted as InstanceId, and has a feature type associated with it. Instances have start and end times corresponding to birth and death times of the objects. For each valid timestamp, instances can have exactly one polygon-based geometric representation. The set of all instances is denoted by \(\mathbb {I}\), and the set of all instances of a particular feature type (f i ) is denoted as \(\mathbb {I}_{f_{i}}\).

Definition 2

A feature type (or event type) is a non-spatiotemporal attribute of an instance that signifies the sort (or class) of the instance. A feature type is denoted as f i , and the set of all features is denoted by \(\mathbb {F} = \{ f_{1}, f_{2}, {\ldots } , f_{m} \}\).

Definition 3

A spatiotemporal co-occurrence pattern (referred to as pattern in this paper) is a subset of all feature types, whose instances frequently co-occur in both space and time. A pattern is denoted as P, where \(P = \{f_{i_{1}}, \ldots , f_{i_{k}}\}\) and \(P \subset \mathbb {F}\). The number of feature types in a pattern will be referred as the cardinality of the pattern. A k-cardinality pattern refers to a k-subset of all feature types, \(\mathbb {F}\). The minimum cardinality of the pattern is 2 (k≥2).

Definition 4

Given a k-cardinality pattern \(P=\{ f_{i_{1}}, {\ldots } , f_{i_{k}}\}\), a pattern instance (denoted by P I n s t) of P is a unique incident of a spatiotemporal co-occurrence (namely, spatial and temporal overlap) among the instances of the all feature types in P. Similar to the instances, pattern instances have start and end times. The start time of a pattern instance corresponds to the minimum start time of the participating (co-occurring) instances. The end time of a pattern instance corresponds to the maximum end time of the participating instances. For each timestamp between the start and end time (lifetime), a pattern instance has two kinds of spatiotemporal representation. The first representation is designated for spatiotemporal intersection showing the overlapping regions over time for the participating instances while the second one is for spatiotemporal union demonstrating the union of all regions over the lifetime of the pattern instance. In Fig. 2, we illustrate two overlapping spatiotemporal instances (I n s t i and I n s t j ) and the pattern instance (P I n s t k ) generated from these instances. The start time of P I n s t k is t 1, and the end time is t 10. The intersection geometries of the P I n s t k are valid between the interval (t 3,t 4) while the union geometries of P I n s t k are valid between the interval (t 1,t 10).

Fig. 2
figure 2

The creation of a pattern instance (P I n s t k ) from two spatiotemporal instances (I n s t i and I n s t j ). On the left, the spatiotemporal instances are illustrated. On the right, the intersection (top) and union (bottom) geometries of the pattern instance are demonstrated

4.2 Measures

We utilize two types of interestingness measures in our algorithm. The first one is the co-occurrence coefficient (cce), which is used for assessing the strength of the spatiotemporal overlap in a pattern instance. The second one is the prevalence measure (p), and it is used for evaluating the prevalence of a pattern.

4.2.1 Significance measures

The spatiotemporal co-occurrence coefficient is used for determining the strength of an overlap (both in space and time) relationship. To assess the strength of spatiotemporal overlap, we utilize the commonly used Jaccard (J) measure as co-occurrence coefficient. Additionally, the computationally cheaper OMAX measure is utilized for filtering.

The J measure for a pattern instance is calculated as the ratio of spatiotemporal intersection volume to the spatiotemporal union volume. Given a k-cardinality pattern P, and the pattern instance, P I n s t, and let {I n s t 1,…,I n s t k } be the participating instances of P I n s t. Then, the J measure is calculated as follows:

$$ J = \frac{V(Inst_{1} \cap {\ldots} \cap Inst_{k} )} {V(Inst_{1} \cup {\ldots} \cup Inst_{k} )} $$
(1)

V is the spatiotemporal volume function. The spatiotemporal instances and pattern instances are three-dimensional objects in two spatial and one temporal dimensions. The spatiotemporal intersection operation is represented with ∩, and the spatiotemporal union operation is represented with ∪.

On the other hand, the OMAX measure for a pattern instance is calculated as the ratio of spatiotemporal intersection volume to the maximum volume of all the participating instances. Following the notation used for the J measure (in Eq. 1), OMAX measure is calculated as,

$$ OMAX = \frac{V(Inst_{1} \cap {\ldots} \cap Inst_{k} )} {Max(V(Inst_{1}), {\ldots} , V(Inst_{k}) )} $$
(2)

where M a x function returns the largest participating instance volume. The OMAX measure contains J measure. In other words, maximum volume of participating instances is less than or equal to the union volume of all participating instances (M a x(V(I n s t 1),…,V(I n s t k )≤V(I n s t 1∪…∪I n s t k )). Therefore, for any pattern instance, the OMAX value is always greater than or equal to the J value. Due to the containment property, given a specific cce threshold, for each pattern instance (PInst) passing Jaccard (J) filter, it is guaranteed that it also passes OMAX filter (O M A X(P I n s t)≥J(P I n s t)). Proof of this property can be found in [24].

4.2.2 Prevalance measure

In STCOP mining, for assessing the interestingness of a pattern, the participation index is used. The participation index signifies the prevalence of a pattern [18]. For a k-cardinality spatiotemporal co-occurrence pattern, the participation index (denoted as p i(P)) is defined as follows:

$$ pi(P) = Mi{n^{k}_{i}} \, pr(P, f_{i}) $$
(3)

where \(P = \{ f_{i_{1}}, {\ldots } , f_{i_{k}} \}\) and \(f_{i} \in \mathbb {F}\). Let \(|P_{f_{i}}|\) denote the number of unique instances of feature type f i participating in the pattern instances of P, and \(|\mathbb {I}_{f_{i}}|\) is the total number of instances of feature type f i ; then, the participation ratio of a feature type f i in the pattern P is,

$$ pr(P, f_{i}) = \frac{|P_{f_{i}}|}{ | \mathbb{I}_{f_{i}} |} $$
(4)

Prevalent spatiotemporal co-occurrence patterns are characterized by a co-occurrence coefficient threshold (c c e t h ) for determining the significant pattern instances of the pattern, and prevalence measure threshold (p t h ) for assessing the interestingness of the pattern. For a prevalent pattern, the co-occurrence coefficient for all of its pattern instances must be greater than or equal to c c e t h and the participation index for the pattern must be greater than or equal to p t h .

4.3 STCOP-miner algorithm

The pseudocode of the STCOP mining algorithm can be seen in Algorithm 1. The algorithm follows greedy Apriori iterations to discover the patterns (See Lines 5 to 10 in Algorithm 1). The input of the algorithm is a dataset containing instances of different feature types. The spatiotemporal instances have evolving polygon-based representations. The result returned by the algorithm is a list of prevalent spatiotemporal co-occurrence patterns.

figure d

The initialization step of the algorithm creates the database, and reads the dataset from input files (See Line 1 in Algorithm 1). The variable k, which is the index representing the size of the patterns is set to 1 (See Line 2 in Algorithm 1). Two-dimensional list of patterns, FP, represents the prevalent patterns. The index of the list represents the cardinality. In other words, F P[k] points to a list of k-cardinality patterns. For the conciseness and correctness of the algorithm, we store feature types in the first index of FP (See Line 4 in Algorithm 1). While the set of feature types may be a necessary output for some application areas, our actual prevalent STCOP set is the difference F PF P[1].

After the initialization steps, the algorithm follows the Apriori-based steps, where firstly candidate patterns are generated, and later pruned. The candidate patterns are generated using GenerateCandidatePatterns procedure, and stored in CP, which is a list of patterns (See Line 6 of Algorithm 1). The prevalent k-cardinality patterns found in the previous iteration (F P[k]) is self-joined (F P[kF P[k]), and (k+1)-cardinality candidates which have infrequent subpatterns are removed from CP. Using the candidate patterns (CP), the candidate pattern instances are generated in the procedure GeneratePatternInstances. (See Line 7 of 1). GeneratePatternInstances procedure initially identifies the significant pattern instances, and eliminates the infrequent candidate patterns.

In Algorithm 2, the pseudocode of candidate pattern instance generation procedure (GeneratePatternInstances) is demonstrated. The pattern instances are generated and pruned for each candidate pattern (See Lines 2 to 21 in Algorithm 2). For each candidate pattern (P, let the cardinality of P be k), firstly, two (k−1)-cardinality subpatterns are determined, and the pattern instances of those two subpatterns are joined with a spatiotemporal overlap predicate and candidate patterns instances are generated and stored in C P I n s t list. (See Lines 3 and 4 in Algorithm 2). It is important to note that the subpatterns of any generated candidate pattern (P) must be prevalent; therefore, their pattern instances are stored in the database. The spatiotemporal join operation is handled in ST _Join procedure. For each candidate pattern in C P I n s t, initially, the OMAX value is computed, then if it cannot pass the co-occurrence coefficient threshold (c c e t h ), the pattern instance is removed from C P I n s t (See Lines 5 to 10 in Algorithm 2). After filtering the pattern instances with the OMAX measure, the participation index is calculated for the candidate pattern (See Line 11 in Algorithm 2). Note that some of the candidate pattern instances are already be removed. Then, if the candidate pattern cannot pass the given participation index threshold (p i t h ), the candidate pattern is removed from the candidate pattern list (CP) (See Lines 12 to 15 in Algorithm 2). If it can pass, the J value is calculated for filtered pattern instances. Similar to the OMAX filtering, the J values are calculated for remaining candidate pattern instances and the insignificant ones are removed from the list. When calculating the J values, we only calculate union volumes, as intersection volumes are already calculated in OMAX filtering steps. Using the remaining candidate pattern instances, the participation index is calculated (See Lines 16 to 21 in Algo. 2), and if the candidate cannot pass the participation index threshold, it is removed from the list (See Lines 23 to 25 in Alg. 2); else, the significant candidate pattern instances are written back to the database (See Line 26 in Alg. 2).

figure e

It is important to note that the most vital part of this procedure is the spatiotemporal join operation based on the spatiotemporal overlap predicate (See Line 4 of Algorithm 2 – ST _Join( L 1,L 2,O v e r l a p )). We will explain different implementations of the join procedure in Section 5, and for the integrity of our research, we use a generic nested loop based join algorithm for every data model. See Algorithm 3 for a detailed view. The spatiotemporal search procedure (Search-ST-Overlap) in the generic join algorithm differs for each data model. These search algorithms can be found in Section 5.

5 Modeling spatiotemporal co-occurrences in non-relational databases

The data in Accumulo is represented as simple key-value pairs. However, Accumulo provides moderately richer data modeling opportunities compared to simple key-value stores [1]. The key field in Accumulo is comprised of a row identifier, column identifier, and timestamp. Column identifier has three values: column family, column qualifier and column visibility (See Fig. 3). In our data models, we only used row identifier, column family and column qualifier for identifying the values. Column visibility and timestamp values are not used.

Fig. 3
figure 3

The elements of a key-value pair in Accumulo

In the following parts of this section, the data models for storing spatiotemporal instances are presented. All the data models use the generic join algorithm presented in Algorithm 3. The spatiotemporal search procedure is different for each data model. The search algorithms of the data models are presented in their respective sections.

figure f

5.1 Classical data model

The first data model we have designed is the classical data model (CDM) where each row stores one spatiotemporal instance or pattern instance. This data model mimics the models used in relational database settings. In Fig. 4, we demonstrate the hierarchical decomposition of a row (which stores an instance (a) or a pattern instance (b)) in CDM. The row identifiers (Row Id) are the instance identifiers (or pattern instance identifiers). Column family points to feature types and column qualifiers show timestamps. Value field represents the polygon of an instance in each timestamp. In addition to those, for more efficient spatial and temporal filtering when accessing the instances, metadata fields are generated. Metadata fields store temporal and spatial boundaries of instances and pattern instances. Temporal boundaries are the start and end time of instances or pattern instances. Spatial boundaries are the minimum bounding rectangle of the union of instance geometries or the minimum bounding rectangle of the union of pattern instance’s intersection geometries. In CDM, one instance is stored in a row comprised of multiple columns. Each column in the row is uniquely identified by the timestamp value. The columns point to a particular polygon representing the spatial extension of the instance at a valid timestamp. For pattern instances, the timestamps (in column qualifiers) are divided into two groups: intersection and union geometries. Intersection timestamps point to the intersection geometries of the pattern instances while union timestamps point to the union geometries.

Fig. 4
figure 4

The hierarchical decomposition of key-value pairs as an instance (a) and a pattern instance (b) in classical data model (CDM)

The algorithm of spatiotemporal overlap search procedure for an instance, which is stored using the classical data model, is presented in Algorithm 4. The input parameters for the procedure are the query instance object (Q I n s t), and the table name, L, where the search (or scan) will be performed. In the initialization part, the minimum bounding rectangle (MBR), start time, and end time of query instance are fetched from the database. Also, a server-side scan iterator, which is used for scanning the entire table (from 0 to for row identifiers) is also initialized. (See Line 5 in Algorithm 4 - GetIterator.) Then, for each instance (I n s t), which is returned by the iterator (I t e r L ), temporal filter and spatial filter are applied using the metadata fields. If the lifetimes (start and end times) of Q I n s t and I n s t overlap, and their MBRs intersect, then the actual polygon representations of those two instances are checked. If the actual polygons also overlap (both spatially and temporally), then I n s t is added to the result set (S R e s u l t s). See Lines 7 to 17 in Algorithm 4. After the entire table, L, is scanned, the procedure returns the result set, S R e s u l t s.

figure g

5.2 Spatiotemporal data model

As mentioned earlier, the data in Accumulo is sorted based upon the row identifiers; however, the column identifiers do not carry such a property. With spatiotemporal data model, we intend to make use of the ordered nature of row identifiers. In spatiotemporal data model (STDM), we followed a three-dimensional space-driven partitioning strategy. The partitioning space consists of two spatial dimensions (denoted by x and y) and one temporal dimension (denoted by t), and it is divided into non-overlapping three-dimensional cells. The instances are considered as three-dimensional trajectories spanning through these cells. To divide the partitioning space into cells, three user-defined step-size parameters are used. Each parameter indicates the step size of their respective dimensions, and denoted by Δx, Δy, and Δt. Our space partitioning algorithm can be seen in Algorithm 5. In a nutshell, the partitioning algorithm iteratively determines the time partition (See Line 4 of Algorithm 5) and space partitions (See Lines 5 to 17 of Algorithm 5) of each polygon at a valid timestamp.

figure h

In STDM, the row identifier is the spatiotemporal partition cell identifier, while the column qualifier shows the instance (or pattern instance) identifier. The hierarchical decomposition of instances and pattern instances in STDM is shown in Fig. 5. For pattern instances, the partition cells are calculated for only intersection geometries. The entire list of timestamp–polygon pairs are serialized and stored in the value field. By setting row identifier as the spatiotemporal partition cell, we aim to retrieve possibly co-occurring instances more efficiently by exploiting the order enforced by Accumulo. For the instances (or pattern instances) which span through more than one partition cells, a duplication strategy is used. Namely, the instances may be inserted to database more than once. The rows can store more than one instance (or pattern instance) in STDM. Each column (identified by the instance or pattern instance identifier) of a row points to a different instance or pattern instance.

Fig. 5
figure 5

The hierarchical decomposition of key-value pairs as an instance (a) and a pattern instance (b) in spatiotemporal data model (STDM)

The algorithm of spatiotemporal overlap search procedure for STDM can be seen in Algorithm 6. The algorithm takes a query instance (denoted as Q I n s t) and a table name as its parameter. The initialization step of the search algorithm (See Line 1 in Algorithm 6) plays an important role when evaluating the runtime efficiency of this procedure. As all the instances are written to the database, an initial querying of the database is unavoidable when performing this particular search algorithm. After the initialization step, the procedure primarily determines the spatiotemporal partition cells (denoted as P a r t i t i o n C e l l s) of query instance (Q I n s t) (See Line 3 in Algorithm 6). The server-side scan iterators for searching the database are prepared using the cell identifiers (denoted as c e l l I d) in P a r t i t i o n C e l l s set (See Line 5 in Algorithm 6). Note that these iterators search for only one row, which is specified by their partition identifier; namely c e l l I d. For each instance returned by the iterators, the spatiotemporal overlap predicate is checked, and if they overlap, the instance is added to the result set (S R e s u l t s). See Lines 7 to 14 in Algorithm 6. After iterating over all partition cells, the procedure returns the result set, S R e s u l t s.

figure i

5.3 Indexed spatiotemporal data model

Indexed spatiotemporal data model (ISTDM) is an extension to the spatiotemporal data model. ISTDM employs an in-memory join-index structure, which stores the partition cells of each instance (or pattern instance). For mapping the partition cells to instances, an inverted index structure [22] is used. Traditionally, the inverted index is used for text retrieval where each index entry (a word) points to the documents where the queried word occurs. The index entries for our case are the locations (identified by partition cell identifiers) of instances or pattern instances. For each feature type (or pattern), an inverted index is created, while their instances are getting written to the database. An example scenario is demonstrated in Fig. 6. For a particular feature type (f i ), the instances are \(\mathbb {I}_{f_{i}} = \{Inst_{1}, Inst_{2} , Inst_{3} , Inst_{4}, Inst_{5} \}\). Spatiotemporal partition cells are denoted by S T P a r t i t i o n i . Instances can be part of multiple partition cells. For example, I n s t 4 is only in S T P a r t i t i o n 1, while I n s t 3 spans through S T P a r t i t i o n 1, S T P a r t i t i o n 5, and S T P a r t i t i o n 6. The row identifiers used for storing instances in the database are the partition cell identifiers. On the other hand, in the inverted index, the row key is the instance identifier, and each instance identifier is mapped to the partition cell identifiers.

Fig. 6
figure 6

Example: The creation of inverted index structure for a particular feature type

ISTDM and STDM use the same hierarchical decomposition of key-value pairs for instances and pattern instances. However, the spatiotemporal overlap search procedure is slightly modified for faster access. Algorithm 7 shows the search procedure for the indexed spatiotemporal data model. The main difference is in the initialization section. The STDM search algorithm (Algorithm 6) performs a brute-force search to locate the query instance in the database in the initialization step. However, IFSTDM search algorithm uses our inverted index structure to determine the partition cells, and use it to fetch Q I n s t and the possibly co-occurring instances in table L. Also, P a r t i t i o n C e l l s set is not calculated, as we fetch this information from the index. Therefore, ISTDM eliminates initial brute-force search and the determination of partition cells for the query instance.

figure j

6 Experimental evaluation

For analyzing our data models and the search algorithms designed for these models, we developed a distributed cloud-based STCOP mining system and experimented with it using different datasets and database settings. Our primary intent in these experiments is to observe the effect of data models and using a distributed database setting to the runtime performance of our STCOP mining system. In the following parts of this section, the experimental settings with the datasets will be described, the implementation details of our system will be demonstrated, and the results and analysis of our experiments will be given.

6.1 Experimental settings

We used six artificial and five real life datasets in our experiments. The artificial datasets are created using spatiotemporal dataset generator, ERMO-DG [6]. Two of the real life datasets are the solar event datasets, and the remaining three are the basketball datasets. All of our datasets are publicly available in our website.Footnote 4

The six artificial datasets are named as A, B, C, D, E, and F. The average lifetime of the instances is 12.5 (minimum = 10, maximum = 15), and it is kept the same for all datasets for the ease of analysis. The noise ratio used in all datasets are 4.0. The smallest artificial dataset (dataset A) has 16,799 polygons (containing 113,720 point references in total), while the largest one (dataset F) has more than 562,439 polygons (containing 3,391,559 point references in total). All artificial datasets have nine artificially created feature types.

For solar event datasets, the geometric representations of solar event instances are downloaded using the Web API of Heliophysics Event Knowledgebase,Footnote 5 and tracked and interpolated using the algorithm described in [19]. The solar datasets contain the instances of six different solar event types that are Active Regions, Coronal Holes, Emerging Flux, Filaments, Sigmoids, and Sunspots. The three-month solar event dataset (denoted as 3Mo) contains solar event instances between ’01/07/2013’ to ’30/09/2013’. The six-month solar event dataset (denoted as 6Mo) contains solar event instances between ’01/01/2013’ to ’30/06/2013’.

The basketball datasets are obtained from the NBA’s official statistics page.Footnote 6 We scraped the data for three games of Atlanta Hawks in 2014-2015 season, which are following: (1) ’Atlanta Hawks – Toronto Raptors’ in ’29/10/2014’ (denoted as ATL1), (2) ’Atlanta Hawks – Los Angeles Clippers’ in ’05/01/2015’ (denoted as ATL2), (3) ’Washington Wizards – Atlanta Hawks’ in ’04/02/2015’ (denoted as ATL3). The raw data from NBA contains the point locations of the basketball and ten players on the court for a particular period (a specific play) of a game. The specified point locations of the players are buffered to create polygons. We categorized the players based on their teams (Atlanta Hawks (Atl) or the opponent (Opp)), and the positions of players (i.e., center (C), forward (F), guard (G)). Therefore, including the basketball, we have seven feature types that are Ball, Atl-C, Atl-F, Atl-G, Opp-C, Opp-F, and Opp-G.

The properties of artificial and real life datasets can be seen in Table 1. The total number of points (vertices) in the polygons can be seen in the last column.

Table 1 Datasets used in our experiments and their basic properties

One goal of our experiments is to observe the effect of horizontal scaling (scaling out) in our system. In order to see the effects of scaling out, we conducted our experiments using one, three and six tablet servers hosting the database. As data nodes are colocated with tablet servers, the term tablet server is used instead of data nodes. The experiments using one tablet server is included for exhibiting an experimental environment showing very similar characteristics to traditional database settings. We ran the three and six tablet server experiments on Amazon Web Services cloud computing platform.Footnote 7 Because of the economic constraints, we ran the one tablet server experiments on a local virtual machine. The system settings when using three and six tablet servers can be seen in Table 2. Nodes that are assigned to the role NameNode controls the distributed file system (HDFS). Zookeeper leader coordinates the ZooKeeper followers (that are Secondary NameNode and Accumulo master) for input and output operations. Secondary NameNode role is an auditing system for Hadoop, performing periodical checkpoints. Accumulo master is responsible for load balancing, as well as error detection, in tablet servers. The tablet server and data node roles are colocated for decreasing the network latency. Tablet servers are responsible for managing (i.e., reads and writes) a subset of all tables. Data nodes simply store data in HDFS. In one tablet server settings, all the above-mentioned roles are performed on the same machine.

Table 2 The system settings for experiments using three (3TS) and six (6TS) tablet servers

In one tablet server experiments, a virtual machine in a personal computer (with 2Ghz Intel Core i7 CPU and 4GB memory and 64 GB SSD storage) is used for conducting the experiments. In three and six tablet server experiments, the nodes used in AWS, are medium size computing instances (officially listed as m3.large). These instances include 10-core 2.5 GHz Intel Xeon E5-2670 CPUs, 7.5 GB memory, and 64 GB SSD storage.

6.2 Implementation details

The STCOP mining system is implemented as a Java client connecting to the Accumulo database. JTS Topology SuiteFootnote 8 library is employed for performing geometric operations. The geometries are stored using the well-known text (WKT) format. In the experiments with artificial datasets, co-occurrence coefficient threshold (c c e t h ) is set to 0.01 and prevalence measure threshold (p t h ) is set to 0.01. For solar event and basketball datasets, c c e t h is set to 0.001 and p t h is set to 0.05. We increased the p t h for the experiments with real life datasets to keep the counts of generated patterns and pattern instances comparable among the datasets.

In Section 2, we have explained the automatic table splitting feature provided in Accumulo. Using automatic table splitting, a table can be split and distributed into multiple partitions when the number of rows in the table passes a certain threshold (which is provided by the administrator of the database). Another technique used for partitioning the data is pre-splitting. When pre-splitting is employed, a number of splitting points (let n be the number of splitting points) are provided to database upon creation, and tables are split and distributed into (n+1) partitions based on these splitting points. Balanced partitioning of data is important as the number of active search iterators can be adequately increased when tables are split across the different data nodes. Pre-splitting is more advantageous when compared to automatic table partitioning, as pre-splitting can decrease the data transfer times when splitting the tables, and efficiently utilize parallel server-side search iterators.

6.3 One tablet server experiments

In one tablet server experiments, we analyzed the runtime performance of STCOP mining algorithm using classical (CDM), spatiotemporal (STDM), and indexed spatiotemporal (ISTDM) data models. For these experiments, we employed relatively smaller artificial datasets: A, B,C, and D. We run the experiments using one tablet server and one scanner (search iterator). In Fig. 7 the total runtime and time spent on spatiotemporal join operations are shown in logarithmic scale, and in Fig. 8 the size of the database is shown. The total runtime (shown as Total Time in our charts) is the total time spent for running the STCOP-Miner algorithm shown in Algorithm 1. The time spent on spatiotemporal join operations (shown as Join Time in our charts) is calculated as the sum of the total time spent on locating the query instances and performing spatiotemporal overlap search. Note that the spatiotemporal joins are known to be the performance bottleneck of the STCOP miner algorithm [7].

Fig. 7
figure 7

The total runtime and time spent on joins for one tablet server experiments

Fig. 8
figure 8

Total database sizes of datasets for one tablet server experiments

From Fig. 8, it can be observed that CDM has less storage requirements. As STDM and ISTDM use the same storage model, the size of the database is the same for STDM and ISTDM. CDM uses a string representation (WKT) of polygon objects, and timestamps are encoded in the column qualifier field of keys. On the other hand, STDM and ISTDM perform serialization on each spatiotemporal instance and store them as byte arrays. As we use a simplistic Java-based serialization model without any compression, the increase in the storage requirements for STDM and ISTDM is understandable. Another factor contributing to this increase is the duplication strategy used (when instance trajectories span across more than one partition cells, the instance is stored in multiple partition cells). We also observed that from 11 % to 37 % of instances are duplicated at least once when running one tablet server experiments.

While CDM provides compact data storage, spatiotemporal data models (STDM and ISTDM) provide better runtime performance. For smaller datasets, the difference in the runtime is not visible. For the dataset A, the total running time of CDM is even less than STDM, while the IFSTDM achieves only 1.08x speedup. However, the difference in join times is apparent when we inspect the experimental results from larger datasets (i.e., B, C, and D). By using STDM (when compared to CDM), we can see from 2.09x speedup for dataset B to 3.51x speedup for dataset C in one tablet server settings. Furthermore, using indexing on top of STDM (in ISTDM) provides better total runtime. ISTDM achieves up to 8.7x speedup when compared to STDM, and 27x speedup when compared to CDM.

6.4 Multiple tablet server experiments

In multiple tablet server experiments, we employ the Amazon Web Services cloud computing platform with three and six tablet servers as shown in Table 2. The STCOP mining system is tested with larger size artificial, solar event, and basketball datasets. In multiple tablet server experiments, our focus is to demonstrate the horizontal scalability of the system and our data models with different dataset characteristics.

6.4.1 Experiments with artificial datasets

For multiple tablet server experiments, relatively larger artificial datasets C, D, E, and F were tested. Based on our findings from one tablet server experiments with smaller artificial datasets, we did not conduct the experiments with CDM due to the poor scalability of CDM as shown in Fig. 7. We test the capabilities of CDM in the experiments with real life datasets.

Figures 9 and 10 show the total runtime of STCOP-Miner algorithm and the time spent on spatiotemporal joins for STDM and ISTDM in three (shown as 3 TS) and six (6 TS) tablet server settings. Using multiple tablet servers and data nodes, we aimed to show the effects of horizontal scalability on database side for our STCOP mining system.

Fig. 9
figure 9

The total runtime and time spent on joins for the artificial datasets C, D, E, and F in three and six tablet server settings

Fig. 10
figure 10

The total runtime and time spent on joins for the artificial datasets C, D, E, and F in three and six tablet server settings

Firstly, from Fig. 9, we can see the differences of using more data nodes in STDM. Only for the relatively smaller dataset C, the three tablet server setting performs better than six tablet server setting for STDM. We observe 3.3 to 5.0x speedup when we increase the number of tablets from three to six. On the other hand, for ISTDM (shown in Fig. 10), we do not see the effect of increasing the number of tablet servers. However, this is expected as our inverted index provides direct access to key values of instances or pattern instances. Both query instance and the instances in search results can be accessed by a constant number of lookups. Therefore, we are not able to see particularly significant speedups when using ISTDM. For STDM parallel scan iterators significantly change the time spent on searching the instances. Additionally, ISTDM performs better than STDM for almost all datasets.

6.4.2 Experiments with solar datasets

The results of the multiple tablet server experiments with the solar event datasets (3Mo and 6Mo), are shown in Fig. 11. Three tablet server experiment results are shown with plain bars, and six tablet server experiment results are shown with striped bars. It is important to mention that the instances in artificial datasets have shorter lifespans and simpler geometries while the instances in solar datasets have unbalanced data characteristics and significantly more complex geometries. Another important difference between the artificial and solar datasets is the number of generated candidate pattern instances. Solar datasets generate less candidate pattern instances; therefore, the join and total runtime for solar datasets are shorter than artificial datasets.

Fig. 11
figure 11

The time spent on spatiotemporal joins (Join Time) and the total runtime of STCOP-Miner algorithm (Total time) for the solar event datasets 3Mo and 6Mo in three (3TS) and six (6TS) tablet server settings using classical (CDM), spatiotemporal (STDM), and indexed spatiotemporal (ISTDM) data models

We can observe that STDM and ISTDM perform significantly better than the CDM for both datasets, and the difference between the total runtimes of data models is caused mainly by the difference in the join times. We also notice the performance increase when six tablet servers are used. The performance of the joins in CDM has increased when using six tablet servers (in 3Mo by 9 % and in 6Mo by 21 %). Similarly, for ISTDM, the join performance has increased for both datasets, as well as the total runtime. For STDM in 6Mo dataset, we recognize an 11 % performance drop. This can be explained by the overhead of search iterator creation. Nevertheless, the total runtime performance for STDM (in 6Mo) has increased because of the parallelized writers.

6.4.3 Experiments with basketball datasets

The results of the multiple tablet server experiments with the basketball datasets (ATL1, ATL2 and ATL3), are shown in Fig. 12. Similar to the experiments with the solar event datasets, three tablet server experiment results are shown with plain bars, and six tablet server experiment results are shown with striped bars. Instances in basketball datasets have significantly longer lifespans when compared to artificial datasets. Additionally, the geometries of the instances in the basketball datasets are simpler than the ones in solar event datasets. Similar to the solar event datasets, the number of generated candidate pattern instances for the basketball datasets is smaller for the artificial datasets. However, the lifespans of the pattern instances are longer.

Fig. 12
figure 12

The time spent on spatiotemporal joins (Join Time) and the total runtime of STCOP-Miner algorithm (Total time) for the basketball datasets ATL1, ATL2 and ATL3 in three (3TS) and six (6TS) tablet server settings using classical (CDM), spatiotemporal (STDM), and indexed spatiotemporal (ISTDM) data models

The experimental results from the basketball datasets are similar to the results from the solar event datasets. Both STDM and ISTDM perform better than CDM. Using six tablet servers leads to a performance increase in join times, as well as the total runtimes for all the settings. In addition to that, indexing (used in ISTDM) accelerates the join operations for all the datasets.

6.5 Remarks

In a nutshell, we can affirm that ISTDM provides better runtime performance than STDM and CDM for large spatiotemporal datasets, and the total runtime performance of STCOP mining system is heavily affected by the spatiotemporal join performance.

The artificial, solar event and basketball datasets have significantly different characteristics, and our experiments with all of them provide better insight into the behavior of the mining system under different scenarios. Firstly, we have observed the highest speedup from CDM to ISTDM in the artificial datasets (up to 27x). This can be explained with the higher number of generated candidate pattern instances from the artificial datasets.

Another goal of our experiments was to inspect the scalability of our proposed data models. In the STCOP mining system, when multiple tablet servers are used, data access and update operations are performed in parallel. The distribution of data access and update workload increases the performance of the system for all the data models. The spatiotemporal join performance of STDM and ISTDM is not significantly improved when more tablet servers are used. However, the total runtime performance of these models is better when more tablet servers are used, as we write the pattern instances back to the database in parallelized fashion. Especially, in the solar event dataset experiments, we observe the performance increase caused by the parallelized write operations with six tablet servers for STDM. This is clearly visible from 6Mo dataset STDM results in Fig. 11, where the join time with six tablet servers is more than the join time with three tablet servers; however, the total runtime for six tablet server is shorter, as the parallelized writes significantly increase the runtime performance.

7 Conclusion and future work

One integral part of knowledge discovery is the efficient retrieval of data. For relational databases, data access methods are well-defined and efficient retrieval techniques are facilitated by the database vendors. However, in the context of spatiotemporal databases, efficient and effective access methods are not available because of the diversity of spatiotemporal data (historical, predicted, moving objects, etc.), complex representations (points, line strings, polygons, geometry collections, etc.) and rich semantics (temporal sequences, spatial colocations, spatiotemporal co-occurrences etc.). With the increasing volumes of spatiotemporal data, it becomes necessary to employ distributed databases when mining big spatiotemporal data. In this work, we designed a distributed STCOP mining system which employs a distributed non-relational database, Accumulo, for storing spatiotemporal instances. We have introduced data models to store spatiotemporal instances in column-oriented non-relational databases. These models are classical (CDM), spatiotemporal (STDM) and indexed spatiotemporal (ISTDM) models. Classical data model simulates the object-relational database modeling. Spatiotemporal data model follows a space-driven partitioning strategy to capture the implicit spatial and temporal information, and exploits the sorted nature of keys in Accumulo for efficient data retrieval. Indexed spatiotemporal data model uses the same data modeling on the database as STDM, and utilize an inverted index structure for improving the spatiotemporal search and join performance of STCOP mining.

For testing the performance of the distributed STCOP mining system, we conducted experiments with three proposed models and eleven datasets under different distribution settings. In our experiments, we consistently see that ISTDM performs the best in the means of runtime performance. On the other hand, CDM does not perform well mainly because it does not capture and take advantange of the implicit spatiotemporal information, which leads to higher spatiotemporal join times. However, it is worth mentioning that CDM has less storage requirement than STDM and ISTDM. We also observe that using more computing nodes increases the performance of the system for all data models.

In the future, one potential problem to investigate is the join-less mining schemas for discovering STCOPs in distributed database environments. Another interesting problem is the frequent pattern growth-based approaches for the task of STCOP mining.