1 Introduction

With the development of remote sensing technologies, satellite images with the characteristics of multi-scale, multi-band, and multi-date make it tend to be big data [7, 16, 22, 25]. There is a major computational challenges to extract the useful information from remote sensing big data in an efficient manner, such as to analyze, aggregate, and store, especially for time-series remote sensing images. Remote sensing has long been used as a means of detecting and classifying temporal changes in the condition of the land surface [5, 15]. Satellite sensors are well suited to this task because they provide consistent and repeatable measurements on a spatial scale appropriate for capturing the processes of change [13]. A given scene may be observed repeatedly from space, resulting in a times series of satellite images. The high spatial resolution of current sensors provides detailed information on spatial structures, which after a series of revisits, can be extended to spatiotemporal data structures. It follows that a time series of satellite images represents a highly complex data set that potentially contains valuable spatiotemporal information [6]. In order to efficiently use the huge amounts of data that will be produced by plans for more missions and higher resolution earth observation systems, a review of the currently used approaches is needed. On the one hand, most previous change detection studies have relied primarily on examining the differences between two or more satellite images acquired on different dates [3]. These procedures can be categorized into three types. Procedures in the first category consist of combining the values of the image at time t and those at time (t − 1) in order reveal the intrinsic temporal structure of the data. The combination operator can be a subtraction [18], a division [11], or more sophisticated ones [20]. And then, the resulting combined image is classified to map the change areas. Further, procedures in the second category involve building a vector from two multi-band values (one before and one after the change) in a multi-dimensional space. The norm and angle of this vector is used as information on the type and intensity of the change [4]. Finally, procedures in the third category considers that the pixel value at time t is linearly correlated to the pixel value at time (t − 1). The parameters of the regression (e.g., the residual) are studied to map and characterize the change [12]. These methods mentioned above are only suitable for two images at one time. Thus, in order to handle time-series images, they all have to be applied for several times (e.g., twelve images combined in one in [23]), which leads to hardly understandable results.

On the other hand, although some research progress has been reported with regard to the analysis of temporal trajectories, the method can be improved further to suit time-series remote sensing images processing. Some methods use a Fourier or a wavelet decomposition to analyze the time series images. Although these robust methods can handle time series as a whole, they require a regular sampling of the time series, which means data must be regularly sensed. For remote sensing time series images, this constraint is difficult to keep, since the acquisitions depend on several factors (operational, meteorological, etc.). Sequential pattern mining method is another way to deal with the time-series remote sensing images. This type of methods extract frequent sub-sequences of radiometric evolutions or land cover change trajectories by just using the ordering of the sequence. It has the advantage of being robust to noise and extracting meaningful patterns. As remote sensing data are recognized as “big data” in a certain sense [17], it is necessary to consider the performance efficiency of finding meaningful or interesting patterns from time-series remote sensing images. Therefore, there is a critical need for a method that can enable efficient and reliable characterization of spatiotemporal patterns contained in an image time series [9]. The objective of this study is to generalize the application of the sequential pattern mining method to time-series remote sensing images by using the Continuous Association Rule Mining Algorithm (CARMA) [10] to improve the mining efficiency.

The remainder of this article is organized as follows. Section 2 describes the sequential pattern mining method. Section 3 presents a case study and an analysis of the results. Finally, Section 4 discusses the principal findings and states our conclusions.

2 Methodology

The effective use of a satellite image time series to characterize and monitor land cover change trajectories requires the analysis of temporal variations in spatial patterns [9]. One basic problem when analyzing sequences of items is to find frequent episodes [14], or in other words, to extract regular patterns from temporal data. To perform such extraction tasks, we can rely on the various algorithms that have been designed to extract frequent sequential patterns. Mining sequential and spatial patterns is an active area of research in data mining, which is used for string mining or itemset mining in transactions analysis.

In data mining field, several algorithms can be used to extract sequential patterns, and most of them can be divided into two kinds: the Apriori-like, breadth-first search methods and the pattern-growth, depth-first search methods. The bottlenecks of the breadth-first search methods (such as AprioriAll [1], AprioriSome [1] and GSP [21]) include potentially huge sets of candidate sequences, multiple scans of databases and difficulties at mining long sequential patterns. And for the depth-first search methods (such as cSPADE [24], FreeSpan [8] and PrefixSpan [19]), the major cost is the construction of projected databases. All traditional algorithms operate offline: given minimal values for the support and the confidence the algorithms scan and rescan the sequence set, often several times, and eventually produce all association rules. However in general, the user does not know the appropriate thresholds in advance. An inappropriate choice yields, after a long wait, either too many or too few association rules. Among the various algorithms, CARMA is an efficient two-pass method for finding sequences. CARMA is an alternative to Apriori that reduces Input / Output costs, time, and space requirements. It uses only two data passes to deliver results which is much lower support levels than Apriori. In addition, it allows changes in the support level during execution. The time consumed is a crucial factor in remote sensing image processing, so in this paper we aim to apply CARMA to exploit sequential pattern mining within the field of data mining in order to analyze time-series remote sensing images.

This section introduces the sequential pattern mining method based on the CARMA algorithm. First, some basic concepts of pattern mining are provided in Section 2.1, followed by an introduction to the CARMA algorithm in Section 2.2.

2.1 Basic concepts

Let I = {I 1,I 2,…,I p } be the set of all items; a set of items is referred to as an itemset and a sequence is an ordered set of one or more itemsets. For example, in a sequence s = <e 1,e 2,…,e j >, itemset e 1 appears before e 2, and e 2 appears before e 3, and so on. Itemset e j is also an element of the sequence s denoted as (x 1,x 2,…,x q ) in which x q I . A sequence that contains k itemsets is a k-sequence. If there exists 1 ≤ i 1 < i 2 < … < i n ≤ m such that \( {a}_1\subseteq {b}_{i_1} \), \( {a}_2\subseteq {b}_{i_2} \),…, \( {a}_n\subseteq {b}_{i_n} \) appears in sequence A = 〈a 1, a 2, …, a n 〉 and sequence B = 〈b 1, b 2, …, b n 〉, then it is said that A is contained in B (denoted as A ⊆ B). A set of sequences is referred to as a sequence set. In a sequence set, if a sequence s is not contained in any other sequence, then s can be called the largest sequence. Additionally, in the sequence set, the number of sequences that contain s is known as the support of s (written as sup(s)). In the mining process, if a sequence satisfies the pre-determined minimum support, then it is a frequent sequence. Sequential pattern mining is the mining of the largest sequences from the frequent sequences in a sequence set, and the sequences found by sequential pattern mining can be called the sequential pattern. In a sequential pattern s = <e 1,e 2,…,e j >, sequence s’ = <e 1,e 2,…,e j-1  > is called the antecedent and sequence s” = <e j  > is called the consequent. The confidence for a sequential pattern can be defined as the ratio of sup(s) to sup(s’). The confidence for each generated sequential pattern should be calculated, and those sequential patterns whose confidence is larger than the minimum confidence threshold are recognized as more interesting (or more useful) than the other sequential patterns.

2.2 CARMA algorithm

The CARMA algorithm uses an effective two-step method to discover sequential patterns. The first step is used mainly to find frequent sequences, and the second step is used to generate sequential patterns based on these frequent sequences.

For the first step, the main target is to form a set V of all potentially large itemsets in a lattice. There are three parameters for each itemset in the lattice: count(v), firstTrans(v), and maxMissed(v). The meanings of these three parameters are introduced below and shown in Fig. 1:

Fig. 1
figure 1

Illustration of the meanings of the three parameters of CARMA

  • count(v): number of occurrences of itemset v since v was inserted in the lattice.

  • firstTrans(v): index of the sequence data at which v was inserted in the lattice.

  • maxMissed(v): upper bound on the number of occurrences of v before v was inserted in the lattice.

The construction of the lattice is shown in Table 1:

Table 1 Example of the lattice

In Fig. 1, t 1,t 2,…,t n are the sequences in the database. When t j was under the scan process, itemset v was inserted into V. Suppose that we are reading sequence t i ; therefore, maxMissed(v) denotes the number of occurrences of v from t 1 to t j-1, the value of firstTrans(v) is j, and count(v) denotes the number of occurrences of itemset v from t j to t i. For any itemset v in the lattice, we get a deterministic lower bound count(v)/i and upper bound [maxMissed(v) + count(v)]/i. We denote these bounds by minSupport(v) and maxSupport(v), respectively. The detailed flowchart of CARMA is shown in Fig. 2:

Fig. 2
figure 2

Flowchart of CARMA

  1. 1.

    The framework of CARMA algorithm has two parts: First, we initialize V to {ø} and set count(ø) = 0, firstTrans(ø) = 0, and maxmissed(ø) = 0. Thus, V is a support lattice for an empty sequence. Suppose that V is a support lattice up to sequence i-1; if we are reading the i-th sequence t i and want to transform V into a support lattice up to i, we have to go through three steps.

    1. 1)

      For each itemset v in V, if v is contained in t i , let count(v) = count(v) + 1.

    2. 2)

      We insert a subset v of t i into V if and only if all subsets w of v are already contained in V and satisfy maxSupport(w) ≥ σ i (where σ i is the current user-defined support threshold). As v is contained in the current sequence t i , let count(v) = 1, firstTrans(v) = i, and we compute the value of maxMissed(v).

      • As w is a subset of v, we obtain maxSupport(w) ≥ maxSupport(v). Furthermore, according to Formula (2.1), Formula (2.2) and Formula (2.3), we obtain Formula (2.4).

        $$ maxSupport(w) = \left[ maxMissed(w) + count(w)\right]/i $$
        (2.1)
        $$ maxSupport(v) = \left[ maxMissed(v) = count(v)\right]/i $$
        (2.2)
        $$ count(v) = 1 $$
        (2.3)
        $$ maxMissed(v)\ \le\ maxMissed(w) + count(w) - 1 $$
        (2.4)

        When we are inserting a subset v into V, the set v is not yet contained in V. Hence, the support of v for the first (i-1) sequences satisfies Formula (2.5), where |v| denotes the number of items in v.

        $$ suppor{t_i}_{-1}(v)\le av{g}_{i-1}\left({\left\lceil \sigma \right\rceil}_{i-1}\right)+\frac{\left|v - 1\right|}{i-1} $$
        (2.5)
      • In addition, considering Formula (2.6), we obtain Formula (2.7)

        $$ maxMissed(v)=suppor{t_i}_{-1}(v)\times \left(i\ {\textstyle \hbox{-} }\ 1\right) $$
        (2.6)
        $$ maxMissed(v)\le \left\lfloor \left(i-1\right)av{g}_{i-1}\left({\left\lceil \sigma \right\rceil}_{i-1}\right)\right\rfloor +\left|v\right|-1 $$
        (2.7)
      • Based on all of the above, we can define Formula (2.8).

        $$ maxMissed(v)= min\left\{\left\lfloor \left(i-1\right)av{g}_{i-1}\left({\left\lceil \sigma \right\rceil}_{i-1}\right)\right\rfloor +\left|v\right|-1, maxMissed(w)+ count(w)-\left.1\right|w\subset v\right\} $$
        (2.8)
    3. 3)

      We compute Formula (2.9)for each itemset v of V when every k sequences (the value of k is defined by the user) are scanned. For any itemset v whose maxSupport < σ i , we delete v from V.

      $$ maxSupport = \left( maxMissed + count\right)/I $$
      (2.9)
  2. 2.

    For the second step, the main aim is to scan the sequences a second time and generate sequential patterns based on the frequent sequences found in the first part of CARMA. In the second step, we compute the precise support of all itemsets v in V and continually remove itemsets with maxSupport < σ n where σ n is the last threshold of minSupport. While performing the scanning, all itemsets v of V are checked and the parameters associated with v are updated. Two situations may arise:

    1. 1)

      If firstTrans(v) < i, then v is considered as a large itemset. If the current sequence index is greater than firstTrans for all itemsets in the lattice, the second part of the CARMA algorithm stops.

    2. 2)

      If the current sequence contains itemset v of V, we use Formula (2.10) and Formula (2.11), and if firstTrans(v) = i, we use Formula (2.12). However, using Formula (2.12)for an itemset v might yield maxSupport(w) > maxSupport(v) for some superset w of v. Thus, we use Formula (2.13) for all supersets w of v with maxSupport(w) > maxSupport(v). We also remove the itemsets v from V with maxSupport < σ n .

      $$ count(v) = count(v) + 1 $$
      (2.10)
      $$ maxMissed(v)=maxMissed(v)\ {\textstyle \hbox{-} }\ 1 $$
      (2.11)
      $$ maxMissed(v) = 0 $$
      (2.12)
      $$ maxMissed(w)=count(v)\ {\textstyle \hbox{-} }\ count(w) $$
      (2.13)

From all of the above, it is clear that CARMA requires only two scans of the sequences to obtain the sequential pattern.

3 Case study and analysis

The main objective of this paper is to generalize the typical sequential pattern mining algorithm in order to obtain patterns for remote sensing time-series images that could be useful for reaching meaningful conclusions regarding land cover change. Here, we analyze some selected patterns obtained by the CARMA algorithm, which represent the process of land cover change. This approach quantifies land cover changes in terms of the percentage area affected and maps the spatial distribution of these changes.

3.1 Study area and data

The present paper is based on a study of Wuhan city in Hubei province, China. As shown in Fig. 3, Wuhan is located in the east of Hubei province, between 113°41′–115°05′E and 29°58′–31°22′N, and covers an area of around 8494.41 km2. The terrain of Wuhan city is dominated by plains and supplemented by hills with a surface elevation ranging from 11.3 m to 873.7 m. The landform falls under the hilly regions in the southeast Hubei province. It is a transitional area between the eastern margin of Jianghan plain and the southern low mountains and hills of the Dabie Mountains. Wuhan city has witnessed rapid urbanization and industrialization owing to a booming economy; the population growth and economic development have contributed to drastic changes in land use/land cover in Wuhan.

Fig. 3
figure 3

Location of Wuhan city in Hubei province, China. The spatial extent of the experiment is shaded in light grey

Time-series Landsat MSS/TM/ETM+ images for five periods (i.e., 1980, 1990, 2000, 2005, and 2010) were selected as the data source (Table 2) for the study. The images used in the experiment were all downloaded free of charge from two websites: http://ids.ceode.ac.cn/ and http://glovis.usgs.gov/.

Table 2 Earth observation data available and processed for the study

3.2 Experimental procedure

For rational and effective analysis of land cover changes, after the image pre-processing, we first classified the time-series images of the study area into land cover maps. Second, based on the land cover maps, we constructed the image sequence set within which each sequence is a land cover class trajectory at pixel level that is described through the classified images assembled in the time series. Third, we applied the sequential pattern mining algorithm to the image sequence set to search for sequential patterns. Finally, we analyzed some interesting sequential patterns to reveal the trajectory of land cover change and evaluated the degree of change. The flowchart of the experimental procedure is shown in Fig. 4.

Fig. 4
figure 4

Flowchart of the experimental procedure

3.3 Pre-processing and classification

The pre-processing steps included reprojection, image mosaicking, and subsetting. The images were projected to the Albers Conical Equal Area projection coordinate system with detailed parameters as follows: central longitude 105° E and WGS84 spheroid. Then, the images for each year were mosaicked in the same coordinate system, and the study area was clipped from the mosaicked images against the municipal boundary layer. Thus, all the data were arranged in the same coordinate system to form a data set with consistency and integrity, suitable for spatial and sequential comparative analyses.

We used eCognition software for image classification, and image enhancement was performed to increase the visual discrimination between features from the data. Because differences and disagreements may appear in the classification process when interpreting land cover types, the classification for all the images was undertaken by a single expert by combining software and manual techniques. The land cover types were classified into five categories, namely, forest land, grassland, wetland (including rivers, ponds, and reservoirs), farm land, and built-up land, using a modified Anderson land cover classification scheme [2]. After all the pre-processing steps, the original digital number values for every pixel in the five images were transformed into land cover type values. Then, the land cover types or classes were converted into symbols. The experiment used five characters, namely, “1”, “2”, “3”, “4”, and “5”, to represent forest land, grassland, wetland, farm land, and built-up land, respectively.

3.4 Construction of land cover change sequence set

An image time series portraying the same scene can be transformed into a landscape trajectory by decomposing the sequence image-by-image and projecting it as a time-ordered series of coordinates in a pattern metric space. In our research, as land classes were converted to symbols, a categorical land cover change sequence set that contains the pixel history, or the land cover trajectory, at the pixel-level, was created by obtaining each sequence for every pixel transition. Suppose that the classification result for the pixel located in position (1, 1) in the image is grassland in 1980, 1990, and 2000, and built-up land in 2005 and 2010; then, the land cover change sequence for this pixel can be denoted as “22255”. Therefore, the land cover change sequence set can be generated by copying the land cover change sequence sequentially for each pixel in the study area from the beginning of the image to the end. All the land cover change sequence for each pixel constitute the land cover change sequence set, and this sequence set is the data source for sequential pattern mining.

3.5 Sequential pattern mining

The greater the number of land cover classes, the greater is the number of change trajectories. For three successive land cover classifications with 10 land cover classes each, the potential number of land cover change trajectories is 1000. Analyzing and interpreting all the possible change trajectory results is a time-consuming task; moreover, it is difficult to draw conclusions regarding the land cover change for the entire area. In order to identify the typical land cover changes and determine the bases of the sequences, sequential pattern mining was performed on the constructed land cover change sequence set. Here, we used the CARMA algorithm to explore the sequential patterns that represent the typical land cover changes. In the mining process, the two most important parameters are the support and the confidence of the sequence mode. We selected a number of different combinations to establish the most appropriate support and confidence values and tested the obtained sequential patterns, as shown in Fig. 5.

Fig. 5
figure 5

Relationship of support, confidence, and number of patterns: different combinations to find the appropriate support and confidence values

In Fig. 5, the numbers of generated patterns for different supports tend to coincide as the confidence increases. Therefore, in subsequent experiments, we selected a confidence level of 40 % and a support rate of 0.02 % as the parameters for the sequential pattern mining.

3.6 Result and analysis

3.6.1 Classification result and area statistics

Five classes were considered to represent the different types of land cover for the entire study area. The classification results are shown in Fig. 6 and the statistical results for the classification are listed in Table 3.

Fig. 6
figure 6figure 6figure 6figure 6figure 6

ae Results of the classification for remote sensing images from 1980 to 2010, respectively

Table 3 Area statistics of the land cover types over the 30-year study period

According to the records of the Wuhan Planning and Design Institute, since the 1990s, the city has expanded considerably owing to rapid advances of the development zone and district economy. As a result of urban development, the city has expanded along the main lines of communication. This change is confirmed by the findings of the present study. According to the area statistics of the land cover types and their changes over the 30-year study period, the built-up area has grown remarkably from 555 km2 in 1980 to 1497 km2 in 2010, and the grassland area has decreased by nearly 90 % from 86 km2 in 1980 to 11 km2 in 2010. At the same time, other types of land cover have reduced by varying degrees.

3.6.2 Performance of CARMA

To compare the performance of CARMA to traditional algorithms, we implemented CARMA along with the cSPADE algorithm developed by Zaki in 2001, and the recorded time consume for both algorithms are given in Fig. 7. The comparison results (Fig. 7) shows the runtime of CARMA and cSPADE in mining sequential patterns with different support thresholds. According to Fig. 7, we can see that CARMA outperformed cSPADE especially on low support thresholds. That due to cSPADE requires more time for removing all useless sub-sequences that are computed by the algorithm. It also requires, for a given evolution, browsing the whole dataset for identifying pixels that are concerned by this evolution. In addition, CARMA is typically by an order of magnitude more memory efficient and readily computes association rules in cases which are intractable for cSPADE.

Fig. 7
figure 7

The performance of CARMA and cSPADE algorithm

3.6.3 Sequential pattern mining result

For the study area, the mining process of CARMA over the land cover change sequence set led to the identification of 113 sequential patterns, each with their own proportion. In order to highlight the most significant land cover change trajectories, the top 18 land cover trajectories whose support rates were more than 0.1 % were selected as the land cover change patterns.

The analysis of land cover change trajectories has focused on the selected top 18 patterns with a total cover equal to 95.31 % of the entire study area (Table 4). The remainder is spread among the other 95 patterns. To analyze the temporal human impact on the environment, we divided the selected top 18 patterns into three categories, namely, stable patterns, human-induced patterns, and natural patterns. Stable patterns are characterized by pixels that are constant over time (e.g., 11111 and 22222). In particular, the trajectories are dominated by five stable patterns accounting for 86.49 % of the study area, where the land cover type never changed between 1980 and 2010. This implies that the majority of the land cover types have remained the same since 1980 for the entire time series. The spatial distribution of the stable pattern “11111” is shown by the pixels in the yellow region in Fig. 8. The support of the pattern is 6.59 %, and in 1980, forest land accounted for 6.91 % of the entire area. This implies that more than 95 % of the forest land in 1980 belonged to that the same land cover type in 2010. Similarly, for pattern “55555” (the red region in Fig. 8), we concluded that 98.93 % of the built-up land in 1980 belonged to the same land cover type in 2010. This also indicates that the areas that are not covered by the stable patterns have undergone some kind of land cover change. Thus, the current land area share of the same land cover type can have different histories.

Table 4 Top 18 land cover change patterns
Fig. 8
figure 8

Spatial distribution of the stable patterns

Human-induced patterns are represented by pixel histories with decisive changes due to human activities, such as the change from farm land to built-up land. Table 5 shows the composition of the 12 human-induced patterns accounting for 8.79 % of the study area. Human-induced land cover change trajectories are dominated by the transition from farm land to built-up land, especially during 2000 to 2005. Further, during 1990 to 2010, the highest percentage (4.29 %) of the area was affected by human-induced land cover change.

Table 5 Composition of land cover change in the human-induced patterns (each land cover change area taken from the entire area)

Figure 9 shows the spatial distribution of human-induced land cover change patterns over the past 30 years. In the study area, the human-induced land cover change is more concentrated in urban areas and becomes gradually sparse toward the suburbs. It also can be noted that some land cover change patterns are continuous in time and space, such as “44555” and “44455”, and “33555” and “33355”. For example, consider “44555” and “44455”; in 2005, areas of human-induced change from farm land to built-up land most likely appeared in the farm land adjacent to the built-up land that had been converted from farm land in 2000. This implies that urban expansion has spatial connectivity and temporal continuity.

Fig. 9
figure 9

Spatial distribution of the human-induced patterns

Natural change patterns include indecisive changes due to natural processes. Only one sequential pattern falls into this category, i.e., “44433”; it indicates that the land cover type was farm land before 2005, which changed to wetland after 2005 probably because of floods.

4 Conclusion

The method developed in this study allows the identification, classification, and spatial localization of land cover types and their trajectories of change for a temporal series. It quantifies the land cover changes in terms of the percentage of area affected and maps the spatial distribution of these changes. Further, it provides a different measure for the description of land cover changes according to their current characteristics and history. The expected novel significance of this study is the generalization of the application of the sequential pattern mining method for capturing the spatial variability of landscape patterns and their trajectories of change, to reveal information regarding process regularities with satellite imagery.

Although the presented case study clearly demonstrates that the sequential pattern mining method is a promising analytical tool for spatiotemporal data analysis, a number of issues warrant further investigation. As with other studies using historical data for studying landscape changes, the availability and quality of the data, their classification, and analysis all influence the typology of the landscape patterns and of the changes detected. Discovering interesting patterns is an important requirement in this field. In future research, we intend to develop interestingness and mining methods that are more sophisticated in order to improve the utility and efficiency of applying sequential pattern mining to remote sensing data.