Keywords

1 Introduction

Spatial co-location mining has been a problem of great practical importance due to its broad applications for environmental protection [15], public transportation [14], location-based service [11] and urban planning [12]. Most of the studies in this field adopt a Participation Index [1] to measure a co-location pattern’s prevalence, namely the frequency of a spatial feature set which locate together in a spatial database. For example, {Restaurant, Supermarket, Coffee shop} is a prevalent co-location pattern which means the restaurant, supermarket, and coffee shop are located together frequently. However, in some cases, users are not only interested in identifying the prevalence of a pattern, but also the features playing the dominant role in a pattern.

Identifying the dominant features within a co-location pattern can not only provide indications of which features are co-located frequently but also help to reveal which certain features dominate the rest of features in a co-location pattern. Therefore, it is important to determine whether a co-location pattern has dominant features and extract the dominant features from prevalent co-location patterns in certain applications. One example of such applications is extracting dominant species from plant communities. Even though prevalent co-location mining on the vegetation data can discover the coexistence relation of plants, botanists require additional information on the dominant species among the prevalent co-located plant species. Another application is identifying the dominant facility in the facility points-of-interest of urban. In urban planning, the co-location information can be used to analyze the consistency between neighboring facilities. Identifying dominant facility can further support rational urban planning and business decision making. Therefore, it is necessary to identify the co-location patterns with dominant features.

For example, {Restaurant(R), Supermarket(S), Coffee shop(C)} and {Hospital(H), Drugstore(D), Flower store (F)} are two prevalent co-location patterns with the same participation index, their spatial distribution is shown in Figs. 1 and 2 respectively. The points represent the instances of features, and the lines between each two points illustrate the neighbor relationship between the two instances accordingly. From the perspective of pattern prevalence, the two co-location patterns seem to be very similar. However, in Fig. 1, if any feature of the pattern is deleted, the rest features will still be co-located frequently. This is to say that none of the features can dominate other features in Fig. 1, the status of all the features in this pattern are equivalent. Therefore, it indicates no dominant feature in the co-location pattern {Restaurant, Supermarket, Coffee shop}. On the contrary, in Fig. 2, the features “Flower store” and “Drug store” always appear simultaneously with the feature “Hospital”. More specifically, the feature “Hospital” is always the center feature clustered around by the other two features. There are many instances of “Flower store” or “Drug store” neighboring to “Hospital” separately. Nevertheless, there is no additional proximity relation between “Flower store” and “Drug store” without “Hospital”. According to this observation, it can be argued that “Hospital” is the dominant feature in the co-location pattern {Hospital, Drugstore, Flower store}. According to whether a pattern contains a dominant feature or not, different prevalent co-location patterns can be classified either as non-dominant-feature co-location patterns (Non-DFCPs) or as dominant-feature co-location patterns (DFCPs). The features within a Non-DFCP are correlated with each other while taking an equal position in the pattern. In a DFCP, some particular features dominate the rest of the features. The position of dominant features and the other features in a DFCP is nonequivalent. The aforementioned pattern {Restaurant, Supermarket, Coffee shop} is a Non-DFCP and {Hospital, Drugstore, Flower store} is a DFCP.

Fig. 1.
figure 1

Dataset of {R, S, C}

Fig. 2.
figure 2

Dataset of {H, D, F}

With above discussion as the starting point, this paper focuses on mining DFCPs from prevalent co-location patterns. The contributions of our work can be summarized as follows: (1) proposing the new concepts of dominant feature and dominant-feature co-location pattern (DFCP); (2) proposing a new measure, namely disparity, to measure the feature position in a co-location pattern; (3) formulating the problem of mining DFCP and proposing an efficient algorithm to identify the DFCPs and corresponding dominant features; (4) evaluating the proposed method with existing traditional co-location pattern mining method on both synthesized and real data sets in terms of efficiency, mining results, and significance.

The remainder of the paper is organized as follows: Sect. 1 offers an introduction on related works; Sect. 3 presents the basic concepts and describes related measures; Sect. 4 demonstrates an efficient DFCP mining algorithm; the experimental evaluation is discussed in Sect. 5; Sect. 6 ends this paper with some conclusive remarks.

2 Related Works

For over a decade, the co-location pattern mining has been attracting abundant research interests and widely used in many applications such as environmental protection [15], public transportation [14], location-based service [11] and urban planning [12], etc. Shekhar and Huang [1] first defined the concept of spatial co-location patterns. The co-location mining aims to find all subsets of spatial features, which located together frequently. In literature [1], the authors proposed to use participation index to measure the prevalence of a co-location. According to the anti-monotone property of the participation index, a Join-based co-location pattern mining algorithm was proposed. However, this original algorithm is subject to a large amount of instance join operations. Since then, various algorithms have been developed such as the Join-Less algorithm [2], the Partial-join algorithm [3], CPI-tree algorithm [4], iCPI-tree algorithm [5] and Order-Clique-Based algorithm [6] to avoid the expensive join operation and improve the efficiency. Although these developments have made significant contributions to algorithm efficiency, due to the vastness of spatial data, the typical co-location mining framework always leads to large collections of results, which make people hardly understand and identify the targeted ones. This has become one of the biggest obstacles in the studies and applications of co-location mining. In order to resolve this problem, many researchers have done many works to reduce the number of pattern results by mining the co-location pattern with a specific relationship or a specific target. Literature [6,7,8] focused on mining maximal co-location patterns, top-k closed co-location patterns and representative co-location patterns respectively, which effectively reduce the prevalent co-location results. In order to improve the applicability of the co-location patterns, many researchers addressed to find the co-locations with specific target based on expert knowledge such as high utility co-location pattern mining [9], ontology-based co-location pattern mining [10], co-location pattern mining under domain-constrains [13]. For discovering more interesting knowledge hidden in prevalent co-location pattern, some researchers have also committed to finding the co-location with a specific relationship such as causal relationship [17], competitive relationship [16].

Extracting the co-locations with specific relationships not only reveals the substantive connection between spatial features but also reduces the number of prevalent co-location pattern results. The dominant relation is widely appeared both in nature and in human society. There is a lot of work on dominant factor analysis in many specific applications such as public safety [18], power system [19], audio recognition [20], dominant species identification and urban planning. Therefore, mining co-location patterns with dominant relation and extracting dominant features is a significant but challengeable task.

3 Preliminary and Problem Formulation

This section firstly offers a review on the preliminary concepts of typical prevalent co-location pattern mining framework, then a formal definition of feature disparity to measure the feature disparity. Furthermore, the definitions of proposed dominant feature and dominant-feature co-location pattern (DFCP) are provided and finally, the problem of DFCP mining is formulated.

3.1 Basic Concept

Given a set of spatial features F = {f1, f2 …, fn}, a set of their spatial instances \( S = S_{1} \cup S_{2} \cup \ldots \cup S_{n} \), where Si(1 ≤ i ≤ n) is a set of instance of feature fi, and a spatial neighbor relationship between instances R over S, the Euclidean metric is used for the R. Each instance x of feature fi is represented as fi.x, two instances fi.x and fj.y are neighbors of each other if the Euclidean distance between them is not greater than a distance threshold d. A k-size co-location c = {f1, …, fk} is a subset of F (c ⊆ F). l = {f1.x1, f2.x2, …, fk.xk} (l ⊆ S) is called a row-instance of c while l includes all the feature types of c and forms a clique relationship under R. The set of all the row-instances of c is called table-instance T(c).

Typically, the prevalence of a k-size co-location c = {f1, …, fk} is measured by the participation index and participation ratio. The participation ratio \( {\text{PR}}(c,f_{i} )\,(f_{i} \in c) \) for feature type fi in c is the fraction of feature fi which participates in the table instance of c. The participation ratio is defined as \( {\text{PR}}(c,f_{i} ) = \frac{{|\pi_{{f_{i} }} (T(c))|}}{{ |T(\{ f_{i} \} )|}} \), where π is the relational projection operation. The participation index PI(c) of c is the minimum participation ratio \( {\text{PR}}(c,f_{i} ) \) in all features fi in c: \( {\text{PI}}(c) = \min_{i = 1}^{k} ({\text{PR}}(c,f_{i} ))(f_{i} \in c) \). Given a user-specified prevalence threshold min_prev, a co-location c is prevalent if PI (c) ≥ min_prev.

Example 1.

Figure 2 is a spatial database with 3 features: H has 5 instances, D has 8 instances and F has 7 instances. H.1 is the first instance of feature H. Co-location {H, D, F} is the super co-location of {H, D}, {D, F} and {H, F}. The co-location instances of all the co-locations in Fig. 1 are shown in Fig. 3(a) and the co-location instances of all the co-locations in Fig. 2 are shown in Fig. 3(b) respectively. In Fig. 3(b), suppose min_prev = 0.4, for co-location pattern {H, D, F}, the table instance of {H, D, F} is shown in Fig. 3(a), the PI({H, D, F}) = min(PR({H, D, F}, H), PR({H, D, F}, D), PR({H, D, F}, F)) = min(0.6, 0.5, 0.57) = 0.5 ≥ min_prev, so c is a prevalent co-location pattern. Similarly, the PI({R, S, C}) = 0.5.

Fig. 3.
figure 3

The Table instances of {R, S, C} and {H, D, F}

3.2 Definitions

According to the discussion in Sect. 1, if a co-location pattern is DFCP, there will be disparity between the positions of its features. In order to test the feature position to identify DFCP and further extract the dominant features, we can calculate the disparity degree as follows: Firstly, we calculate the influence of each feature to the pattern. Secondly, we propose a measure, namely feature disparity, to measure the disparity between features. Thirdly, we define the dominant relation between features.

From Fig. 3, we can easily find a fact: the instances of \( c_{k - 1} \) which appears in \( c_{k - 1} \) but not appears in \( c_{k} \) mean these instances are not dominated by the feature \( c_{k} - c_{{k{ - 1}}} \). Thus, for a k-size co-location pattern \( c_{k} \), we can evaluate all its features influence from its k sub patterns.

Definition 1 (Loss Ratio).

Given a k-size (k > 2) prevalent co-location pattern \( c_{k} = \{ f_{1} ,f_{2} , \ldots ,f_{k} \} \), and a sub pattern of \( c_{k} \): \( c_{k - 1} = \{ f_{1} ,f_{2} , \ldots , f_{k - 1} \} \), for feature \( f_{i} (f_{i} \; \in \;c_{k - 1} )\,(1 \le i \le k - 1) \), the loss ratio of fi from \( c_{k - 1} \) to \( c_{k} \) is defined as follows:

$$ {\text{LR}}(c_{k} ,\,c_{k - 1} ,\,f_{i} ) = \frac{{|\pi_{{f_{i} }} (T(c_{k - 1} ))| - |\pi_{{f_{i} }} (T(c_{k} ))|}}{{|T(\{ f_{i} \} )|}} $$
(1)

Note that \( {\text{PR}}(c_{k} ,f_{i} ) \le {\text{PR}}(c_{k - 1} ,f_{i} ) \) is ensured according to the anti-monotonicity of participation ratio which is referred in [1], it can be confirmed that \( 0 \le {\text{LR}}(c_{k} ,c_{k - 1} ,f_{i} ) \le {\text{PR}}(c_{k - 1} ,f_{i} ) \). Loss ratio describes the fraction of feature’s instances loss from a co-location to its sub pattern. We then aggregate these ratios to compute a minimum single value of loss index from \( c_{k - 1} \) to \( c_{k} \).

Definition 2 (Loss Index).

The loss index from \( c_{k - 1} \) to \( c_{k} \) is defined as:

$$ {\text{LI}}(c_{k} ,c_{k - 1} ) = \min_{i = 1}^{k - 1} ({\text{LR}}(c_{k} ,c_{k - 1} ,f_{i} )) $$
(2)

Example 2.

In Fig. 4, the loss ratio of feature H from {H, D, F} to {H, D} is:

$$ \begin{aligned} & {\text{LR}}(\{ {\text{H,}}\,{\text{D,}}\,{\text{F}}\} ,\,\{ {\text{H,}}\,{\text{D}}\} ,\,{\text{H}}) = {\text{PR}}\left( {\left\{ {{\text{H,}}\,{\text{D}}} \right\},\,{\text{H}}} \right) - {\text{PR}}\left( {\left\{ {{\text{H,}}\,{\text{D,}}\,{\text{F}}} \right\},\,{\text{H}}} \right) = 1 - 0.6 = 0.4 \\ & {\text{LR}}(\{ {\text{H,}}\,{\text{D,}}\,{\text{F}}\} ,\,\{ {\text{H,}}\,{\text{D}}\} ,\,{\text{D}}) = {\text{PR}}\left( {\left\{ {{\text{H,}}\,{\text{D}}} \right\},\,{\text{D}}} \right) - {\text{PR}}\left( {\left\{ {{\text{H,}}\,{\text{D,}}\,{\text{F}}} \right\},\,{\text{D}}} \right) = 0.5 \\ \end{aligned} $$

The loss index LI({H, D, F}, {H, D}) = min(LR({H, D, F}, {H, D}, H), LR({H, D, F}, {H, D}, D)) = min(0.4,0.5) = 0.4. Similarly, LI({H, D, F}, {H, F}) = 0.4, LI({H, D, F}, {D, F}) = 0.

Fig. 4.
figure 4

Instance loss from {H, D} to {H, D, F}

The loss index from \( c_{k - 1} \) to \( c_{k} \) \( {\text{LI(}}c_{k} ,c_{k - 1} ) \) represents the fraction of individual neighbor clique relation of \( c_{k - 1} \) which is no feature \( f(f = c_{k} - c_{k - 1} ) \) involved, characterizes the instance loss of \( c_{k - 1} \) when a new feature f join in \( c_{k - 1} \) to form a higher-size pattern \( c_{k} \). The higher of the loss index is, the more instances of f are irrelevant with \( c_{k - 1} \), the less dominant of feature f to the features in \( c_{k - 1} \). Therefore, we give the definition of feature influence index as follows:

Definition 3 (Feature Influence Index).

Given a k-size co-location pattern \( c_{k} \), the influence index of feature \( f_{i} (f_{i} \; \in \;c_{k} ) \) (1 ≤ i ≤ k) in \( c_{k} \) is defined as:

$$ {\text{FII(}}c_{k} ,f_{i} )= 1- {\text{LI}}(c_{k} ,c_{k - 1} ) $$
(3)

where \( c_{k - 1} = c_{k} - \{ f_{i} \} .\)

Feature influence \( {\text{FII(}}c_{k} ,f_{i} ) \) visually represents the influence of feature fi on other features of ck, the higher of \( {\text{FII(}}c_{k} ,f_{i} ) \) is, the more instances of features in \( c_{k} \) may be dominated by fi.

Example 3.

In Fig. 3, for co-location pattern {H, D, F}, the feature influence of feature H in {H, D, F} is:

$$ \begin{aligned} & {\text{FII}}(\{ {\text{H,D,F}}\} , {\text{H}}\} ) = 1- {\text{LI}}(\{ {\text{H,D,F}}\} ,\{ {\text{F,D}}\} ) = 1.\\ & {\text{FII}}(\{ {\text{H,D,F}}\} , {\text{D}}\} ) = 1- {\text{ LI}}(\{ {\text{H,D,F}}\} ,\{ {\text{H,F}}\} ){ = 0} . 6.\\ & {\text{FII}}(\{ {\text{H,D,F}}\} , {\text{F}}\} ) = 1- {\text{LI}}(\{ {\text{H,D,F}}\} ,\{ {\text{H,D}}\} ) = 0. 5.\\ \end{aligned} $$

Feature influence index measures the influence of single feature on other features in the same pattern. Therefore, we can use the feature influence index to measure the feature disparity. The higher the disparity between two features is, the more likely that a dominant relation exists between them.

Definition 4 (Feature Disparity).

Given a k-size (k > 2) co-location pattern \( c_{k} = \{ f_{1} ,f_{2} , \ldots , f_{k} \} \), the feature disparity between \( f_{i} (f_{i} \; \in \;c_{k} )\,(1 \le i \le k) \) and \( f_{j} (f_{j} \; \in \;c_{k} )\,(1 \le j \le k,\,i \ne j) \) in \( c_{k} \) is:

$$ {\text{FD}}(f_{i} ,f_{j} ) = | {\text{FII}}(c_{k} ,f_{i} ) - {\text{FII}}(c_{k} ,f_{j} ) | $$
(4)

According to the Definition 5, it is easily proof that feature disparity is symmetrical and non-negative.

Definition 5 (Dominant Relation).

Given a k-size(k > 2) prevalent co-location pattern \( c_{k} = \{ f_{1} ,f_{2} , \ldots , f_{k} \} \), a minimum feature disparity threshold min_fd (0  min_fd  1), the feature \( f_{i} (f_{i} \; \in \;c_{k} )\,(1 \le i \le k) \) dominates \( f_{j} (f_{j} \; \in \;c_{k} )\,(1 \le j \le k,\,i \ne j) \) if they meet two conditions: (1) FD(fi, fj) ≥ min_fd and (2) FII(ck, fi) > FII(ck, fj).

It can be noticed that if there exists a dominant relation between features in a prevalent co-location, then the co-location contains dominant features.

Example 4.

In Fig. 5, after calculating the loss index, we can obtain the feature influence value and then calculate the feature disparity between two features in the pattern. Setting the min_fd = 0.4:

Fig. 5.
figure 5

The process of finding dominant features

$$ \begin{aligned} & {\text{FD}}\left( {{\text{H}},{\text{D}}} \right) = {\text{FII}}\left( {\left\{ {{\text{H}},{\text{D}},{\text{F}}} \right\},{\text{H}}} \right) - {\text{FII}}\left( {\left\{ {{\text{H}},{\text{D}},{\text{F}}} \right\},{\text{D}}} \right) = 0.4. \\ & {\text{FD}}\left( {{\text{H}},{\text{F}}} \right) = 0.5,{\text{FD}}\left( {{\text{D}},{\text{F}}} \right) = 0.1. \\ \end{aligned} $$

In {H, D, F}, FD(H, F) = 0.4 ≥ min_fd, FII({H, D, F}, H) > FII({H, D, F}, D), so H is a dominant feature of {H, D, F}.

Figure 5 illustrates the process for finding dominated features. We introduce two different extreme value functions – a minimum function and a maximum function – to obtain the greatest influence and least influence for a co-location pattern and further help to optimize the mining process.

Definition 6 (Max-Feature Influence Index).

Given a k-size co-location pattern \( c_{k} = \{ f_{1} ,f_{2} , \ldots f_{k} \} \), the maximum feature influence index of \( c_{k} \) is defined as:

$$ \hbox{max} \_{\text{FII}}(c_{k} ) = \max_{i = 1}^{k} ({\text{FII}}(c_{k} ,f_{i} )) $$
(5)

Definition 7 (Min-Feature Influence Index).

Given a k-size co-location pattern \( c_{k} = \{ f_{1} ,f_{2} , \ldots , f_{k} \} \), the minimum feature influence index of \( c_{k} \) is defined as:

$$ \hbox{min} \_{\text{FII}}(c_{k} ) = \min_{i = 1}^{k} ({\text{FII}}(c_{k} ,f_{i} )) $$
(6)

Example 5.

In Fig. 3b, for co-location pattern {H, D, F}, the max-feature influence index is:

$$ \begin{aligned} { \hbox{max} }\_{\text{FII}}(\{ {\text{H,}}\,{\text{D,}}\,{\text{F}}\} ) & = { \hbox{max} }({\text{FII}}(\{ {\text{H,}}\,{\text{D,}}\,{\text{F}}\} ,\,{\text{H}}) ,\,{\text{LI}}(\{ {\text{H,}}\,{\text{D,}}\,{\text{F}}\} ,\,{\text{F}}) ,\,{\text{LI}}(\{ {\text{H,}}\,{\text{D,}}\,{\text{F}}\} ,\,{\text{D}}) \\ & = { \hbox{max} }(1 ,\,0.6 ,\,0.5) = 1 \\ \end{aligned} $$

The min_feature influence index is: min_FII({H, D, F}) = 0.5

Lemma 1.

If FD(fi, fj) ≥ min_fd, and FII \( (c_{k} ,f_{i} ) \) > FII \( (c_{k} ,f_{j} ) \), then FII \( (c_{k} ,f_{i} ) \) − min_FII(\( c_{k} \))≥min_fd;

Proof.

Because of FII(ck, fi) > FII(ck ,fj) ≥ min_FII(ck), then

$$ \begin{array}{*{20}l} {{\text{FII}}\left( {c_{k} ,f_{i} } \right) - min\_{\text{FII}}(c_{k} ) \ge {\text{FD}}\left( {f_{i} ,f_{j} } \right)} \hfill \\ {{\text{FII}}\left( {c_{k} ,f_{i} } \right) - {\text{FII}}\left( {c_{k} ,f_{j} } \right) \ge min\_fd} \hfill \\ \end{array} $$

Lemma 2.

Given a k-size(k > 2) co-location pattern ck = {f1, f2…, fk}, two features \( f_{i} (f_{i} \; \in \;c_{k} )\,(1 \le i \le k) \) and \( f_{j} (f_{j} \; \in \;c_{k} )\,(1 \le j \le k)\,(i \ne j) \), if max_FII(ck) − min_FII(ck) ≥ min_fd, ck is a DFCP.

Proof.

Assume FII(ck, fi) > FII(ck, fj), if FD(fi, fj) = FII(ck, fi) − FII(ck, fj) ≥ min_fd, according to Definition 5, fi is a dominant feature, then ck is a DFCP.

Because of FII(ck, fi) ≤ max_FII(ck) and FII(ck, fj) ≥ min_FII(ck), max_FII(ck) − min_FII(ck) ≥ FII(ck, fi) − FII(ck, fj) ≥ min_fd. Then, the feature which has max_FII(ck) is the dominant feature in ck, so ck is a DFCP.

3.3 Problem Formulation

According Lemmas 1 and 2, we optimize the mining process. It can be formulated as follows:

Definition 8 (Max-Feature Disparity).

Given a k-size (k > 2) prevalent co-location pattern \( c_{k} = \{ f_{1} ,f_{2} , \ldots ,f_{k} \} \), the max_feature disparity of the feature \( f_{i} (f_{i} \; \in \;c_{k} )\,(1 \le i \le k) \) in \( c_{k} \) is defined as:

$$ { \hbox{max} }\_{\text{FD}}(c_{k} ,f_{i} ) = | {\text{FII}}(c_{k} ,f_{i} ) - { \hbox{min} }\_{\text{FII}}(c_{k} ) | $$
(7)

Given a minimum feature disparity threshold min_fd (0 ≤ min_fd ≤ 1), if max_FD(fi, ck) ≥ min_fd, then feature fi is a dominant feature.

Some co-location pattern recommendation applications require that interesting patterns should be both prevalent and with dominant features. On the one hand, a dominant-feature co-location means that there is more information to support specific decision-making. Thus, it guarantees the recommendation is guiding. On the other hand, identifying the dominant-feature co-location can further decrease the number of prevalent patterns. Thus, it improves the availability of patterns. Therefore, with the definition of disparity and dominant features, we formulate the problem of mining DFCP as follows.

Mining Dominant-Feature Co-location Patterns.

Given a minimum prevalence threshold min_prev, a minimum feature disparity threshold min_fd, a k-size co-location pattern ck is a DFCP if it meets the following conditions:

  1. (1)

    PI(ck) ≥ min_prev

  2. (2)

    max_FII(ck) − min_FII(ck) ≥ min_fd

Extracting Dominant Feature.

Given a minimum feature disparity threshold min_fd, a k-size co-location pattern \( c_{k} \), feature \( f_{i} (f_{i} \; \in \;c_{k} )\,(1 \le i \le k) \) is a dominant feature in DFCP if max_FD(fi, ck) ≥ min_fd.

Example 6.

In Fig. 3, setting prevalent threshold min_prev = 0.4, disparity threshold min_fd = 0.4. For co-location pattern {H, D, F}, PI({H, D, F})=0.5 ≥ min_prev, max_FII({H, D, F}) − min_FII({H, D, F}) = 0.5 ≥ min_fd, then {H, D, F} is a DFCP.

FII({H, D, F}, H) − min_FII({H, D, F}, H) = 1 − 0.5 ≥ min_fd.

FII({H, D, F}, D) − min_FII({H, D, F}, H) = 0.6 − 0.5 = 0.1 < min_fd.

FII({H, D, F}, F) − min_FII({H, D, F}, H) = 0 < min_fd.

H is a dominant feature in {H, D, F}.

For co-location pattern{R, S, C}: PI({R, S, C}) = 0.5 > min_prev

max_FII({H, D, F}) − min_FII({H, D, F}) = 0.9 − 0.71 = 0.19 < min_fd

We can determine that {R, S, C}is a non-DFCP prevalent co-location pattern without calculating the feature disparity between features.

4 Algorithm

In this section, we will demonstrate a general algorithm for mining DFCPs on prevalent co-location patterns. The mining framework of DFCP consists of two stages. In stage 1, the feature influence of each feature in a prevalent co-location pattern is computed, and then the DFCP is selected by a min_fd. Stage 2 extracts the set of dominant feature of a DFCP by a min_fd. Algorithm AMDFCP is the DFCP mining framework

figure a

.

Line 1 generates a set of star instances based on a distance threshold d. Lines 2–4 generate k-size co-location candidate pattern sets. Lines 5–22 describe DFCP mining includes: Loss index calculation, disparity test and dominant feature extraction processing. Lines 5–6 calculate participation index. Lines 7–10 compute the feature loss index and feature influence from a prevalent pattern to its sub pattern set. Lines 11–12 aggregate the feature influence to max_FII and min_FII to prune the DFCP mining and dominant feature extracting. Line 13 is a pruning strategy according to Lemma 2, which uses the full range of feature disparity, replace testing disparity of each feature pair. In line 13, if pattern c is DFCP, then lines 14–18 extract the dominant features through testing each feature in the pattern. Line15 is another pruning strategy, which uses the difference value between each feature influence and minimum feature influence to replace calculating the disparity between a feature and all the rest of features in a pattern. Line 19 stores the DFCPs with dominant features. Lines 3–23 are executed repeatedly and finally return a collection of DFCP set.

5 Experimental Study

In this section, comprehensive experiments are conducted to evaluate the proposed concepts and algorithms from multiple perspectives on both synthetic and real data sets. All the algorithms are implemented in Visual C#. All of our experiments are performed on a 2.4 GHZ, 8 GB-memory and Intel Core i3.

5.1 Experiment on Synthetic Data

5.1.1 Synthetic Data Generation

There are 4 synthetic datasets used in our experiments. We use different data generation methods to generate a synthetic dataset with different distributions. Dataset 1, dataset 3, and dataset 4 are randomly generated according to the Poisson distribution and distributed evenly in 1000 × 1000 space. Dataset 2 is generated for simulating the data distribution with dominant features. In order to generate the synthetic data as far as possible consistent with the dominant feature distribution characteristics, we assign different distributions of feature instances that classified as dominant features and non-dominant features respectively by controlling the specific parameters. Compared to randomly sow the data point on a plane, we increase the density of the instance of non-dominant features which around the dominant feature by setting a density parameter α and adjust the distance between the instances of same dominated feature by setting distance parameter β to ensure they are not located too closely and tightly. In our experiments, Four synthetic data sets are generated which shown in Table 1. Especially, we assign 5 dominant features and 15 non-dominant features in dataset 2.

We implement DFCP mining algorithm on multiple synthetic datasets and compare the efficiency of the DFCP mining algorithm and the traditional co-location Join-less [2] algorithm by considering the effect of the number of instances, the participation index threshold, the distance threshold, and the significance threshold. Table 2 shows the default parameters in the experiment.

Table 1. The experimental data set
Table 2. The default values of the parameters

5.1.2 Efficiency

The Effect of Prevalence Threshold.

Figure 6 shows the running time of DFCP mining algorithm at five different minimum prevalence thresholds (min_prev) on four synthetic data sets respectively. For each data set, the run time decreases as the min_prev increases. For all data sets, the running time increases as the data volume increases and the min_prev decreases. For dataset 4, the effect of min_prev on algorithm performance is particularly evident, this is because low threshold and dense data lead to huge table-instances of candidates, table-instances’ computation affects the performance of the algorithm. For dataset 1 and dataset 2, the running time of dataset 2 is longer than dataset 1 since the computation cost in DFCP process in dataset 2 is more than in dataset 1.

Fig. 6.
figure 6

Running time on different synthesized data sets (w.r.t min_prev)

The Effect of Distance Threshold.

Figure 7 depicts the running time on four synthetic data sets with respect to the variation of threshold distance. It can be observed that for each data set, the run time decreases as the distance threshold increases. For all datasets, the running time increases with the increase of the data volume and distance threshold. It can also be observed that the effect of large distance threshold on algorithm performance is especially obvious, which indicates that the performance of the algorithm is mainly affected by the data density. For dataset 1 and dataset 2, when the distance is at 10 and 20, the efficiency on dataset 1 is better than dataset 2, however, with the increase of distance, the situation has reversed. It is because when the distance is low, the computation cost of DFCP mining process affects the efficiency of dataset 2, yet when the distance is high, the auto-correlation which happened in dataset 1 is more frequent than dataset 2 so that dataset 2 performs better.

Fig. 7.
figure 7

Running time on different synthesized data sets (w.r.t distance)

The Effect of Co-location Feature Disparity Threshold.

Figure 8 demonstrates the running time on four synthetic data sets with respect to the variation of disparity threshold (min_fd) respectively. Comparing the running time on the four datasets, for each dataset, running time decreases more rapidly when the significance threshold increases. We note that the change of min_fd is particularly evident for the performance of algorithms on dense datasets. The efficiency of dataset1 is better than dataset 2 because the number of DFCP in dataset 2 is more than dataset 1 and the computation cost of dataset 2 is more than dataset 1.

Fig. 8.
figure 8

Running time on different synthesized data sets (w.r.t min_fd)

5.1.3 The Mining Results of AMDFCP vs Join-Less

Figure 9 shows the mining results of AMDFCP mining algorithm and the Join-less algorithm on four synthetic data sets respectively under the default parameters. The number of DFCPs increases as the data volume increases, indicating that the denser the data, the more prominent the characteristics of the feature correlation in the prevalent co-locations. With the increase of the volume of data set, the number of DFCPs is much less than that of prevalent co-locations. Obviously, the number of DFCP in dataset 2 is far more than dataset 1, that means our algorithm can identify DFCPs correctly.

Fig. 9.
figure 9

The comparison of mining results on different synthesized data sets

5.2 Experiments on Real Data

Two real-world data sets are used in our experiments. The first one is the points of interest (POI) in Beijing which consists of 26,546 spatial instances and 16 spatial feature types. The spatial distance threshold is 50 by default (meaning 50 m in the real world). The second data set is the vegetation data of the “Three Parallel Rivers Area” which consists of 335 spatial instances and 32 spatial feature types. The spatial distance threshold is 6000 by default (meaning 6 km in the real world). We set the default values as min_prev = 0.3 and min_fd = 0.2.

Compared to the synthetic data, the spatial correlation of real data is higher, thus the results have practical significance. We conduct experiments to record the number of patterns for AMDFCP algorithm compared with the number of Join-less algorithms by considering the change of prevalence threshold, the distance threshold, and the feature disparity threshold. The results of varying min_prev on POI dataset and vegetation dataset are presented respectively in Figs. 10 and 11. It can be observed that AMDFCP algorithm reduces the number of mining results as high as 50% because with decreasing of min_prev, the results of the Join-less algorithm are increased quickly, AMDFCP algorithm can filter the low correlated prevalent patterns efficiently. Figures 12 and 13 illustrate the results by varying distance threshold on POI dataset and vegetation dataset respectively. The number of DFCPs increases slower than prevalent co-location patterns as the data volume increases, it indicates that the mining result is less affected by distance thresholds. This is because the disparity metrics can avoid the result explosion for aggregation of a large number of instances under the dense data.

Fig. 10.
figure 10

The mining results with different min_prev on POI data set

Fig. 11.
figure 11

The mining results with different min_prev on vegetation data set

Fig. 12.
figure 12

The mining results with different distance threshold on POI data set

Fig. 13.
figure 13

The mining results with different distance threshold on vegetation data set

5.3 The Real Application of Significant Co-location Mining

We use POI datasets to present and explain the practical application of DFCP mining. Table 3 shows the results of DFCP mining, note that the dominant feature is labeled by “*”. The mining results indicate that the DFCP mining can offer targeted and abundant information.

Table 3. The results of DFCP mining

6 Conclusions

In this study, a new approach of mining the dominant-feature co-location patterns (DFCPs) is proposed to reveal the dominant relation between features of a pattern and reduce the prevalent co-location pattern results. The DFCP mining problem consists of the DFCP determination and dominant features extracting. In order to formulate the problem of DFCP mining, we firstly propose a measure, namely disparity, to measure the feature influence disparity, then an algorithm AMDFCP is designed to mine DFCP and dominant features. Finally, experimental results demonstrated in this paper have exemplified that DFCP mining method proposed in this paper can be utilized to identify the DFCP from prevalent co-location patterns efficiently and the experimental results in POI data have also supported the proposal that DFCP mining can provide effective support to users for specific applications.