1 Introduction

In recent decades, rapid development of spatial related technologies made large amounts of spatial data available. It becomes popular to analyze spatial phenomena and discover knowledge from spatial data sets. Spatial co-location pattern mining plays an important role in this domain. A spatial co-location pattern is a subset of spatial features, whose instances frequently appear in the spatial proximity. For instance, the location where collybia albuminosa grows usually has nest of termites, so {collybia albuminosa, nest of termites} is a co-location pattern. {rhinoceros, oxpeckers} is another real example of co-location pattern because the rhinoceros often live with oxpeckers. Co-location patterns may yield important insight in many applications, such as Earth science, public health, biology, transportation, etc. Due to the wide application of co-location patterns, researchers have developed multiple approaches of mining co-location patterns.

So far, most of the spatial co-location pattern mining did not consider the influence of different features [1,2,3,4,5,6,7,8,9], a few paper related to this topic studied either the spatial co-location pattern mining on extended spatial objects (points, line-strings, polygons) [10,11,12,13], or the high impact co-location pattern mining on point-type objects [14], based on buffer overlap of instances. Almost all the spatial co-location pattern mining approaches made their progress on the proximity of instances, without considering the non-spatial attributes of instances. The influence between different features is hard to be reflected only by the spatial proximity, since we need to consider more non-spatial attributes of instances to help us analyze the implicit interaction between features. For example, supermarkets and convenience stores have different influence on the same residents, besides the distance factors, many non-spatial attributes of them such as scale, commodity diversity, service level, etc. also work. Attributes can reflect the influence between instances, while previous spatial co-location pattern mining approaches did not focus on the influence reflected by instances’ attributes.

An example of five spatial instances is shown in Fig. 1, where the instances belong to feature A (shopping center), B (store) and C (grocery), and adjacent instances within a preset distance threshold are linked with solid lines. Although instances A.1, B.1 and C.1 form a clique for candidate co-location patterns {A, B, C}, we cannot accurately describe the influence between the three instances. Therefore, we add 8-dimensional attributes as described in Sect. 3.1 to the sample instances in Fig. 1. The principle of our insight is depicted in Fig. 2, in which one curve denotes an attribute vector for one of the five instances in Fig. 1, we can conclude the influence between instances by calculating the similarity of instances via vector values of attributes, as well as other computation.

Fig. 1
figure 1

An example of spatial instances

Fig. 2
figure 2

Similarity of instances in terms of attributes

It is challenging to introduce the influence of non-spatial attributes to co-location pattern mining. In this paper, we propose a novel approach that mines high influence co-location patterns by using spatial neighborhoods of instances and similarity of instances with attributes. In summary, the contributions of this paper are listed as follows:

  1. (1)

    We introduce the additional attributes of spatial instances, construct a new equation for calculating the similarity of instances with cosine similarity function and Mahalanobis distance function, and apply information entropy approaches for computing the influence of instances.

  2. (2)

    We define the concepts of influence ratio and influence index for co-location patterns, and prove them satisfy the downward closure property which can be used for pruning search space. Then, a novel High Influence Co-location Pattern Mining (HICPM) algorithm is designed to mine high influence co-location patterns.

  3. (3)

    We conduct extensive experiments on synthetic and real data sets, the results show that our approach can find high influence co-location patterns effectively and efficiently.

The remainder of this paper is organized as follows: related works are introduced in Sect. 2. Section 3 gives definitions and equations, proof of downward closure property, HICPM algorithm, and time complexity analysis. Section 4 conducts experiments on real and synthetic data sets and analyzes results. Finally, Sect. 5 summarizes the paper and proposes the work in the future.

2 Related works

Related work can be elaborated in two aspects as follows:

Spatial co-location pattern mining Agrawal et al. [15, 16] firstly published a founding paper of mining association rules between sets of items in large databases in 1993, it animated the pattern mining researches on methodologies and applications actively so far [17,18,19]. Koperski et al. [1] introduced spatial association rules to analyze geographic information databases in 1995. Shekhar et al. [2] pioneered to define the concept of co-location pattern mining in 2001, and propose join-based method for discovering prevalent co-location patterns from spatial data sets.

A spatial co-location pattern c is defined as a set of spatial features. The spatial features represent the kinds of instances in space, denoted as \(f_{i}\), while instances of \(f_{i}\) represent individual objects at a specified location, denoted as \(f_{i} \cdot j\). Instances are adjacent when they are within the distance threshold range. Adjacent instances of diverse features form cliques, those instances in cliques will be defined as row_instance(c) of a co-location pattern c only if they present all the features of the pattern and no subsets of them can do so. Table_instance(c) denotes T(c), is a set which contains all row_instance(c). Participation ratio PR(c, fi) is defined as the fraction of number of non-repetitive instances of feature \(f_{i}\) involved in table_instance(c) divided by total instances of the feature \(f_{i}\). Participation index PI(c) takes the minimum of PR \((c,f_{i} )\). A co-location pattern c will be considered prevalent if PI(c) is no less than the preset threshold PIthreshold.

Example 1

Figure 3 shows features A, B, C with 9, 20 and 21 instances respectively. A co-location pattern c = {A, B, C} has table-instances of {{A.1, B.1, C.1}, {A.2, B.4, C.2}, {A.2, B.6, C.3}, {A.9, B.11, C.12}, {A.9, B.13, C.12}}. Thus, PR(c, A) = 3/9, PR(c, B) = 5/20, PR(c, C) = 4/21, and PI(c) = min{PR(c, \(f_{i}\))} = 4/21 ≈ 0.19. Assumed PIthreshold = 0.15, the co-location pattern {A, B, C} is prevalent.

Fig. 3
figure 3

Illustration of spatial co-location patterns

A landmark event in this field is the full-join algorithm proposed by Huang et al. [3] in 2004, it can mine complete and correct prevalent co-location patterns. As the running time of the full-join algorithm rises significantly with instances increase, Yoo et al. [4] proposed a partial-join algorithm, which divides instances into disjoint clusters to reduce computation for join operations. Yoo et al. [5, 6] proposed join-less algorithm based on star neighbor materialized model and integrated the ideas of join-less and partial-join, to make it run faster than full-join method in dense data sets, and compared their time complexity. Huang et al. [20] use a compact prefix tree structure called FP-tree for mining prevalent co-location patterns, Wang et al. [7] propose iCPI-tree for updating previous CPI-tree based algorithm [8] for improving efficiency of co-location pattern mining. In order to compress the large number of co-location patterns mined, researchers proposed new concepts and algorithms for mining closed patterns [21], maximal frequent patterns and compressed prevalent patterns [9]. The forms of instances studied tend to be diversified into uncertain data [22], interval data [23], fuzzy data [24], extended spatial objects [10,11,12,13], incremental data [25, 26], suitable for practical scenarios. Different from participation index mostly applied to co-location pattern mining, Xiong et al. [11] introduced extended spatial objects with buffer, and proposed coverage computation with MBBR model and apply it to test route selection. Kim et al. [12] constructed a transaction-based framework for co-location pattern mining, making association analysis applicable and extended spatial objects usable, for getting geographic context awareness of ubiquitous GIS. Li et al. [13] proposed a grid based transaction for extended spatial objects, and introduced a statistical test to validate the significance of candidate co-location patterns rather than a global threshold, for identifying correlation between child cancer cases and pollutant emissions. Chai et al. [27] suggested a node-priority based large-scale overlapping community detection method. Chen [14] noticed participation index and coverage computation did not reflect the effect of instances thus defined buffer as the effect of instances and proposed an algorithm for co-location pattern mining with high impact. In view of the shortcomings of locational information and geometric computation cost, this paper introduces attributes to instances for further exploration.

Influence analysis With the rapid development of social networks, influence analysis has been widely studied. Xiang et al. [28] evaluated relationship power from interactive action and user similarity. Sathanur et al. [29] introduced transfer entropy as a measure of directed causal influence for online social interactions. Peng et al. [30] evaluated influence of a node via two factors. One was intimacy degree reflecting the proximity between users, the other was activity degree determining active nodes. Bakshy et al. [31] counted pairwise influence by score and predict global influence of a user by sum of scores, and suggested disjoint influence tree with features to estimate user’s global influence. Huang et al. [32] tended to measure individual social influence by the number of followers and the sensitivity of finding good items. Peng et al. [33] presented an evaluation model to measure direct and indirect influence based on social relationship graph, by introducing friend entropy and interaction frequency entropy to describe social influence in mobile social networks. This paper concerns the influence between adjacent instances, without consideration to the influence beyond distance threshold and indirect influence of instances.

3 Formal definitions and the algorithm

In this section, we define the high influence co-location pattern formally and prove the downward closure property of the pattern. Then, we design an algorithm with a pruning strategy for mining high influence co-location patterns and analyze time complexity of the algorithm. Table 1 lists the key notations used in this paper.

Table 1 Key notations used in this paper

3.1 Definitions and properties

In the data sets of this paper, spatial instances are expressed as vectors, which contain feature symbol, instance number, latitude, longitude, attributes. Vector values of some spatial instances in Figs. 1 and 3 are listed in Table 2.

Table 2 Vector values of some spatial instances in Figs. 1 and 3

We compute similarity of instances on attribute vectors of instances, with integration of cosine similarity and Mahalanobis distance equations as follows:

  • Cosine similarity on attribute vectors of instances

Cosine similarity represents the angle between two vectors with cosine function. Assumed that \(X_{{f_{i} \cdot j}}\) and \(Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}}\) are two attribute vectors of instances \(f_{i} \cdot j\), \(f_{{i^{\prime}}} \cdot j^{\prime}\), then their cosine similarity shall be computed as follows:

$${\text{Sim}}\left( {X_{{f_{i} \cdot j}} ,Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}} } \right) = \frac{{X_{{f_{i} \cdot j}} \cdot Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}} }}{{||X_{{f_{i} \cdot j}} ||||Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}} ||}}$$
(1)

where dot · denotes vector dot product, \(X_{{f_{i} \cdot j}} \cdot Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}} = \mathop \sum \nolimits x_{{f_{i} \cdot j}} \cdot y_{{f_{{i^{\prime}}} \cdot j^{\prime}}}\), \(||X_{{f_{i} \cdot j}} ||\) is the length of vector \(X_{{f_{i} \cdot j}}\), \(||X_{{f_{i} \cdot j}} || = \sqrt {\mathop \sum \nolimits (X_{{f_{i} \cdot j}} )^{2} } = \sqrt {X_{{f_{i} \cdot j}} \cdot X_{{f_{i} \cdot j}} }\).

  • Mahalanobis distance on attribute vectors of instances

Mahalanobis distance (also called Dz statistics) is an effective approach to calculate similarity between two unknown sample sets, it can be defined as the difference between two random variables which obey the same distribution and share same covariance matrix. Mahalanobis distance for the attribute vectors \(X_{{f_{i} \cdot j}}\) and \(Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}}\) shall be computed as follows:

$$D^{2} (X_{{f_{i} \cdot j}} ,Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}} ) = (X_{{f_{i} \cdot j}} - Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}} )^{T} \varSigma^{ - 1} (X_{{f_{i} \cdot j}} - Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}} )$$
(2)

where \(D^{2}\) denotes Mahalanobis distance of attribute vectors of instances \(f_{i} \cdot j\) and \(f_{{i^{\prime}}} \cdot j^{\prime}\), \(\varSigma^{ - 1}\) (also notated Icm) denotes inverse covariance matrix of variables, T denotes vector shall be transposed.

Although cosine similarity and Mahalanobis distance can express similarity between vectors, their emphases are different. Cosine similarity emphasizes the consistency of directions of vectors and is no sensitive to the numerical differences of vectors, while Mahalanobis distance emphasizes the numerical difference and is no sensitive to the consistency of directions of vectors. As the attributes of our data sets contain sequence data and numerical data, we expect to integrate cosine similarity and Mahalanobis distance and construct a new measure for reflecting the similarity of instances with attributes more accurately.

  • Similarity between instances with attributes

Noticing the cosine similarity ranges in [− 1, 1] and the Mahalanobis distance ranges in [0, M] (M denotes a finite constant), we construct a new measure called similarity between instances with attributes which ranges in [0, 1]. When cosine similarity takes 1 and Mahalanobis distance takes 0, that means the attribute vectors of the neighbor pairs are coincident, i.e. the attribute vectors are in the same orientation and length, at this moment, the similarity takes 1 as the maximum. And when cosine similarity takes − 1 and Mahalanobis distance is M, that is to say, the attribute vectors of the neighbor pairs are in the opposite orientation and the attribute vectors have different length, at this moment, the similarity takes 0 as the minimum. The similarity of instances \(f_{i} \cdot j\) and \(f_{{i^{\prime}}} \cdot j^{\prime}\) is defined as follows.

$$S(f_{i} \cdot j, f_{{i^{\prime}}} \cdot j^{\prime}) = \frac{{1 + \cos (X_{{f_{i} \cdot j}} ,Y_{{f_{{i^{'} }} \cdot j^{'} }} )}}{{2 + {\text{D}}(X_{{f_{i} \cdot j}} ,Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}} )}},\quad S(f_{i} \cdot j, f_{{i^{\prime}}} \cdot j^{\prime}) \in \left[ {0, \, 1} \right]$$
(3)

where S(\(f_{i} \cdot j,f_{{i^{\prime}}} \cdot j^{\prime}\)) denotes similarity of instances \(f_{i} \cdot j\) and \(f_{{i^{\prime}}} \cdot j^{\prime}\), \({\text{D}}(X_{{f_{i} \cdot j}} ,Y_{{f_{{i^{\prime}}} \cdot j^{\prime}}} )\) denotes square root of the Mahalanobis distance.

Using the example data set in Fig. 3, we give an illustrated comparison of cosine similarity, square root of Mahalanobis distance and similarity between instances as in Fig. 4, where the similarity of instances rises sequentially in [0, 0.5], the cosine similarity increases in minor fluctuations, while the square root of Mahalanobis distance decreases in relatively large fluctuations. We compare the values of ordinates that the neighbor pairs on the abscissa reflect on the three curves and find that when the values of ordinates for cosine similarity are the same, the smaller the square root of the Mahalanobis distance is, the greater the similarity of instances is, as depicted by the neighbor pairs of A.2–B.4, A.3–B.9 and when the values of ordinates for square root of the Mahalanobis distance are the same, the greater the cosine similarity is, the greater the similarity of instances is, as depicted by the neighbor pairs of A.7–C.10, A.1–B.1. Therefore, we reckon that the constructed similarity of instances can better smooth the floating variation of cosine similarity and Mahalanobis distance and reasonably reflect our expectation on the concept of influence which is defined on the basis of similarity between instances with attributes.

Fig. 4
figure 4

Comparison on cosine similarity, square root of Mahalanobis distance, and similarity of instances

  • Introduction of information entropy

Information entropy was introduced by Shannon, in his paper “A Mathematical Theory of Communication” published in 1948. It tells how much information that an event contains, the more uncertain an event is, the more information it contains. The computation of information entropy is widely applied in many areas for decades. In this paper, we use the concept and computation method of “average self-information” in Shannon’s information entropy theory, that is, the average self-information of random variable X, symbolized as H(X), is defined as:

$${\text{H}}\left( X \right) = \mathop \sum \limits_{i = 1}^{n} P\left( {x_{i} } \right)I\left( {x_{i} } \right) = - \mathop \sum \limits_{i = 1}^{n} P(x_{i} )\log P(x_{i} )$$
(4)

where X denotes a discrete random variable whose output is \(x_{i}\), i = 1, 2,…, n, and P(\(x_{i}\)) is the probability of occurrence of output \(x_{i}\), \(I(x_{i} )\) denotes the self information of event \(X = x_{i}\).

As we aim at discovering high influence co-location patterns in this paper, the available information of data sets is the number of neighbors of instances and the similarity of instances with attributes, so we use the Eq. 4 to calculate the entropies for measuring the information contained in neighbor of instance and attribute similarity of instance as follows.

  • Neighbor entropy of instance

Neighbor entropy denoted as \({\text{Ne}}(f_{i} \cdot j)\), is computed on neighbor number of instance \(f_{i} \cdot j\) with Eq. 5:

$${\text{Ne(}}f_{i} \cdot j) = - \mathop \sum \limits_{1}^{{N_{{f_{i} \cdot j}} }} \frac{1}{{N_{{f_{i} \cdot j}} }}\log_{10} \frac{1}{{N_{{f_{i} \cdot j}} }} - \left( {\left( { - \frac{1}{{N_{{f_{i} \cdot j}} }}\log_{10} \frac{1}{{N_{{f_{i} \cdot j}} }}} \right)} \right)$$
(5)

where \(N_{{f_{i} \cdot j}}\) denotes the number of neighbors and itself of instance \(f_{i} \cdot j\). The entropy is arranged for calculating the influence of instances on their surrounding neighbors.

  • Attribute similarity entropy of instance

Attribute similarity entropy denoted as \({\text{Ase(}}f_{i} \cdot j )\), is computed on the similarities of instance (\(f_{i} \cdot j\))’s attributes with its neighbors’ attributes, the equation is as follows:

$$\begin{aligned} {\text{Ase}}(f_{i} \cdot j) & = - \mathop \sum \limits_{1}^{{N_{{f_{i} \cdot j}} }} \frac{{S(f_{i} \cdot j, f_{{i^{\prime}}} \cdot j^{\prime})}}{{\mathop \sum \nolimits_{1}^{{N_{{f_{i} \cdot j}} }} S(f_{i} \cdot j, f_{k} \cdot m)}}\log_{10} \frac{{S(f_{i} \cdot j, f_{{i^{\prime}}} \cdot j^{\prime})}}{{\mathop \sum \nolimits_{1}^{{N_{{f_{i} \cdot j}} }} S(f_{i} \cdot j, f_{k} \cdot m)}} \\ & \quad - \;\left( { - \frac{{S(f_{i} \cdot j, f_{i} \cdot j)}}{{\mathop \sum \nolimits_{1}^{{N_{{f_{i} \cdot j}} }} S(f_{i} \cdot j, f_{k} \cdot m)}}\log_{10} \frac{{S(f_{i} \cdot j, f_{i} \cdot j)}}{{\mathop \sum \nolimits_{1}^{{N_{{f_{i} \cdot j}} }} S(f_{i} \cdot j, f_{k} \cdot m)}}} \right) \\ \end{aligned}$$
(6)

where \(S(f_{i} \cdot j, f_{i} \cdot j)\) denotes the similarity of instances with attributes for the instance \(f_{i} \cdot j\) with itself, \(S(f_{i} \cdot j, f_{k} \cdot m)\) denotes the similarity of instances with attributes between the instance \(f_{i} \cdot j\) and any of its neighbors (i.e. instances \(f_{k} \cdot m\)).

Definition 1

(Influence of instance) Influence of instance is defined as the degree that an instance affects its adjacent instances to tend to be the same on attributes, denoted as Inf(\(f_{i} \cdot j\)), which represents all the influence that instance \(f_{i} \cdot j\) exerts on its adjacent instances. By means of information entropy approaches, we construct the neighbor entropy and attribute similarity entropy for instances. The influence of instance \(f_{i} \cdot j\) can be computed with the entropies as follows:

$${\text{Inf}}(f_{i} \cdot j) = \omega_{1} \cdot{\text{Ne}}(f_{i} \cdot j) + \omega_{2} \cdot{\text{Ase}}(f_{i} \cdot j) , \quad \omega_{1} + \omega_{2} = 1$$
(7)

where \({\text{Ne(}}f_{i} \cdot j)\) denotes the neighbor entropy of instance \(f_{i} \cdot j\), \({\text{Ase(}}f_{i} \cdot j)\) denotes the attribute similarity entropy of instance \(f_{i} \cdot j\). \(\omega_{1}\) denotes the weight of \({\text{Ne(}}f_{i} \cdot j)\), \(\omega_{2}\) denotes the weight of \({\text{Ase (}}f_{i} \cdot j)\). In this paper, we let \(\omega_{1} = 0.2\), \(\omega_{2} = 0.8\) and assume that the influence of attributes is more bigger than that of neighbors as per our practical experience.

Definition 2

(Influence of feature) Given a spatial feature \(f_{i}\) and its instance set \({\text{S}}\left( {f_{i} } \right)\), the influence of feature \(f_{i}\) is defined as the total influence of all spatial instances belonging to this feature. It is described as follows:

$${\text{Inf}}\left( {f_{i} } \right) = \mathop \sum \limits_{{f_{i} \cdot j \in S(f_{i} )}} {\text{Inf}}(f_{i} \cdot j)$$
(8)

Definition 3

(Influence of feature in a co-location pattern) Given a k-size co-location pattern \(c = \{ f_{1} , f_{2} , \ldots , f_{k} \}\), \(f_{i} \in c\), k ⩾ 2. The influence of feature \(f_{i}\) in pattern c is defined as the sum of the minimum influence of instances belonging to the feature \(f_{i}\) exert on other features’ instances in table_instance(c). It is described as follows:

$${\text{Inf}}\left( {c,f_{i} } \right) = \mathop \sum \limits_{{f_{i} \cdot j \in T(c)}} {\text{Inf}}_{min} (f_{i} \cdot j)$$
(9)

Definition 4

(Influence ratio of feature in a spatial co-location pattern) Given a k-size co-location pattern \(c = \{ f_{1} , f_{2} , \ldots ,f_{k} \}\), \(f_{i} \in c\), k ⩾ 2. The influence ratio of feature \(f_{i}\) in pattern c (\({\text{InR}}(c,f_{i} )\)) is defined as the ratio of the influence of feature \(f_{i}\) in the co-location pattern c to the total influence of feature \(f_{i}\). It is described as follows:

$${\text{InR}}\left( {c,f_{i} } \right) = \frac{{{\text{Inf}}(c,f_{i} )}}{{{\text{Inf}}(f_{i} )}}$$
(10)

Definition 5

(Influence index of the co-location pattern) Given a k-size co-location pattern \(c = \{ f_{1} , f_{2} , \ldots , f_{k} \}\), \(f_{i} \in c\), k ⩾ 2. The influence index of the pattern c (InI(c)) is defined as the minimum among influence ratios of all the features in the pattern c. It is described as follows:

$${\text{InI}}\left( c \right) = { \hbox{min} }_{i = 1}^{k} \{ {\text{InR}}(c,f_{i} )\}$$
(11)

Definition 6

(High- influence co-location pattern) A spatial co-location pattern c will be defined as a high influence co-location pattern only if its InI(c) is not smaller than the preset threshold InIthreshold.

Example 2

Based on the table-instances and instances’ attributes from the sample data set of Fig. 3, we illustrate the process of mining high influence co-location patterns in Table 3. Assumed InIthreshold = 0.001, as \({\text{InI}}\left( c \right)\) ⩾ InIthreshold, we can conclude that all the co-location patterns in Table 3 are high influence patterns.

Table 3 The process of HICPM based on sample data in Fig. 3

3.2 The downward closure property

Downward closure property (also called antimonotone property) and Apriori principle are two cornerstones for mining spatial co-location patterns. It’s already proven that participation ratio (PR) and participation index (PI), which are the measures of traditional co-location pattern mining satisfy the downward closure property and Apriori principle [3, 34], i.e. PR and PI decrease monotonously with the increase of pattern sizes, and any instance \(f_{i} \cdot j\) which belongs to the table_instances(c) surely belongs to the table_instance(\(c^{\prime}\)) when \(c^{\prime} \subseteq c\). Through calculation and proof, we find that influence ratio and influence index also satisfy the downward closure property.

Lemma 1

(Downward closure property of the high influence co-location patterns) Influence ratio (InR) and influence index (InI) decrease monotonously as the increasing of co-locations’ sizes.

Proof

Assumed \(c' \subseteq c\), |c| = k, \(|c'| = {\text{m}}\), so k > m. (|c| denotes the size of c). T(c) denotes table-instance of the co-location pattern c. According to the above definitions and equations, the following proofs can be made:

$$\begin{aligned} & \because {\text{InR}}\left( {c,f_{i} } \right) = \frac{{{\text{Inf}}(c,f_{i} )}}{{{\text{Inf}}(f_{i} )}} = \frac{{\mathop \sum \nolimits_{{f_{i} \cdot j \in T(c)}} {\text{Inf}}_{min} (f_{i} \cdot j)}}{{{\text{Inf}}(f_{i} )}}, \\ & {\text{Inf}}_{{f_{i} \cdot j \in T(c)}} (f_{i} \cdot j) < {\text{Inf}}_{{f_{i} \cdot j \in T(c^{\prime})}} (f_{i} \cdot j),\quad \\&f_{i} \cdot j \in {\text{T}}(c)\;{\text{and}}\;f_{i} \cdot j \in {\text{T}}(c^{\prime}), \\ & \quad = > \mathop \sum \limits_{{f_{i} \cdot j \in T(c)}} {\text{Inf}}_{min} (f_{i} \cdot j) < \mathop \sum \limits_{{f_{i} \cdot j \in T(c^{\prime})}} {\text{Inf}}_{min} (f_{i} \cdot j), \\ & \quad = > {\text{Inf}}\left( {c,f_{i} } \right) < {\text{Inf}}\left( {c^{\prime},f_{i} } \right).\;{\text{As}}\;{\text{Inf}}(f_{i} )\,{\text{is a positive constant}}, \\ & \therefore {\text{InR}}\left( {c,f_{i} } \right) < {\text{InR}}\left( {c^{\prime},f_{i} } \right). \\ & {\text{InI}}\left( c \right) = { \hbox{min} }_{i = 1}^{k} \{ {\text{InR}}(c,f_{i} )\} ,\quad\\& { \hbox{min} }_{i = 1}^{m} \{ {\text{InR}}(c,f_{i} )\} < { \hbox{min} }_{i = 1}^{m} \{ {\text{InR}}(c^{\prime},f_{i} )\} , \\ \end{aligned}$$

Assumed \(f_{l} \in c\), \(f_{l} \notin c^{\prime}\).

If \({\text{InR}}\left( {c,f_{l} } \right) = { \hbox{min} }_{i = 1}^{k} \{ {\text{InR}}(c,f_{i} )\}\), then \({ \hbox{min} }_{i = 1}^{k} \{ {\text{InR}}(c,f_{i} )\} < { \hbox{min} }_{i = 1}^{m} \{ {\text{InR}}(c',f_{i} )\}\) holds. Otherwise, the inequality is also true.

\(\therefore {\text{InI}}\left( c \right) < {\text{InI}}\left( {c^{\prime}} \right)\). So proof is completed. The lemma holds.□

Theorem 1

(Apriori principle of high influence co-location patterns) If a co-location pattern c is with high influence, then all its sub-patterns\(c^{\prime} \subseteq c\)are also with high influence. Conversely, if a co-location pattern c is not with high influence, then all of its super-patterns\(c^{\prime\prime}(c \subseteq c^{\prime\prime})\)must not be with high influence.

Proof

Based on Lemma 1, Theorem 1 is clearly true.□

3.3 High influence co-location pattern mining algorithm

We propose the HICPM algorithm for mining high influence co-location patterns with a pruning strategy. The pseudo code is listed in Table 4.

Table 4 The pseudo code of HICPM algorithm (key notations please refer to Table 1)

The description of pseudo code is as follows:

  1. (1)

    Set variables \(\omega_{1}\), \(\omega_{2}\), R, InIthreshold, compute inverse covariance matrices, and use R for generating neighbor pairs, as for data preprocessing;

  2. (2)

    Generate 1-size candidate co-location patterns (i.e. features) and table-instances (i.e. instances) (steps 1–2); compute cosine similarity and Mahalanobis distance then generate similarity of attributes for any neighbor pair (steps 3–7); generate star-type neighborhoods and compute influence of instances and features (steps 8–13);

  3. (3)

    Initialize data structure; start the iteration from 2-size candidate co-location patterns: k-size candidate co-location patterns come from k-1-size high influence co-location patterns, where any sub-sets of a candidate co-location pattern were not high influential, the candidate co-location pattern would be pruned (steps 14–16); generate star-type instances for k-size candidate co-location patterns (steps 17–20); As 2-size star-type instances are clique ones, they can be directly processed (step 21); for 3-size or larger size, it’s necessary to check if the star-type instances were clique ones, before that, the candidate co-location patterns would be coarsely filtered, i.e. if influence ratio of star-type instances in a candidate co-location pattern was less than preset InIthreshold, the candidate co-location pattern would be pruned (step 22); generate the clique instances for k-size candidate co-location patterns (steps 23–24); generate k-size high influence co-location patterns (step 25); continue the iteration and return all sizes high influence co-location patterns (steps 26–28).

3.4 Time complexity

The HICPM algorithm shares the same process of forming star-neighborhoods and cliques with join-less. With reference to the time complexity analysis of join-less [6], we compare the time complexity of the two algorithms at first:

Let \(T_{hi} , T_{jl}\) be the time cost for HICPM and join-less respectively, S denotes the input spatial data sets, \(T_{star\_neighborhoods} (S)\) denotes the time cost for materialization of star-type neighborhoods from neighbor pairs set Nb, \(T_{hi} (2)\), \(T_{jl} (2)\) denotes the time cost that HICPM and join-less respectively spend for finding 2-size co-location patterns, \(\mathop \sum \limits_{k > 2} T_{hi} (k)\), \(\mathop \sum \limits_{k > 2} T_{jl} (k)\) denotes the time cost that HICPM and join-less respectively spend for finding k-size (k > 2) co-location patterns, \(T_{compu\_influ} (S)\) denotes the time cost that HICPM spends for computing the influence of instances and features, \(T_{compu\_InI(c)} (2)\), \(T_{compu\_PI(c)} (2)\) denotes the time cost that HICPM and join-less respectively spend for computing the influence index or participation index for 2-size co-location patterns, \(T_{compu\_InI(c)} (k)\), \(T_{compu\_PI(c)} (k)\) denotes the time cost that HICPM and join-less respectively spend for computing the influence index or participation index for k-size co-location patterns. Then,

$$\begin{aligned}&\begin{aligned} \because T_{hi} & = T_{gen\_neighborhoods} (S) + T_{star\_neighborhoods} (S)\\&\quad + T_{hi} \left( 2 \right) + \mathop \sum \limits_{k > 2} T_{hi} \left( k \right)\end{aligned} \\& \begin{aligned}T_{jl} & = T_{gen\_neighborhoods} (S) + T_{star\_neighborhoods} (S)\\&\quad + T_{jl} (2) + \mathop \sum \limits_{k > 2} T_{jl} (k) \end{aligned}\\& T_{hi} - T_{jl} = T_{hi} (2) - T_{jl} (2) + \mathop \sum \limits_{k > 2} (T_{hi} (k) - T_{jl} (k)) \\ &\quad = T_{compu\_influ} (S) + T_{compu\_InI(c)} (2) - T_{compu\_PI(c)} (2) \\ & \qquad + \;\mathop \sum \limits_{k > 2} (T_{compu\_InI(c)} (k) - T_{compu\_PI(c)} (k)) \\ \end{aligned}$$

As \(T_{compu\_influ} (S) > 0\), \(T_{compu\_InI(c)} (2) > T_{compu\_PI(c)} (2)\), \(T_{compu\_InI(c)} (k) > T_{compu\_PI(c)} (k)\)

$$\therefore T_{hi} > T_{jl}$$

So we know that the HICPM algorithm spends more time than join-less does.

Secondly, we analyze the time complexity of HICPM algorithms:

Let n denotes the number of instances, \(|C_{k} |\) denotes the number of k-size candidate patterns, \(t_{compu\_InI(c)} (k)\) denotes the average time cost that HICPM spends for computing the influence index for k-size co-location patterns, c denotes a constant.

$$\begin{aligned} \because T_{hi} & = T_{gen\_neighborhoods} (S) + T_{star\_neighborhoods} (S) + T_{hi} (2) + \mathop \sum \limits_{k > 2} T_{hi} (k) \\ & = T_{gen\_neighborhoods} (S) + T_{star\_neighborhoods} (S) + T_{compu\_influ} (S) \\ & \quad + T_{compu\_InI(c)} (2) + \mathop \sum \limits_{k > 2} (T_{compu\_InI(c)} (k)) \\ \end{aligned}$$

As the data preprocessing takes time of O(n2) to generate the spatial relationships between instances, and the number of instances is usually much larger than the number of features and the average number of neighbors per instance.

$$\begin{aligned} \therefore T_{hi} & = T_{gen\_neighborhoods} (S) + T_{star\_neighborhoods} ({\text{S}}) + T_{compu\_influ} (S) \\ & \quad + T_{compu\_InI(c)} (2) + \mathop \sum \limits_{k > 2} (T_{compu\_InI(c)} (k) \\ & \le O\left( {n^{2} } \right) \, + \, c \times O\left( n \right) \, + |C_{k} | \times t_{compu\_InI(c)} (k) \\ & \approx O\left( {n^{2} } \right) \\ \end{aligned}$$

Therefore, the time complexity of HICPM algorithm is O(n2).

4 Experimental evaluation

The experiments use a synthetic data set and a real data set, performed by R 3.5.2 and Java 1.7.0_51, Java SE 1.7.0_51-b13, Java Server VM (24.51-b03, mixed mode), run on a normal PC with Intel core i7-6700 CPU @ 3.40 GHz, 3.41 GHz, 16.0 GB memory, Windows 10 (64-bit). We use R for synthesizing data and calculating the Icm and use Java for executing other tasks. The real data set and synthetic data set come from Beijing’s points of interest (POIs) with attributes, downloaded from Baidu Map API on April 22, 2018 [35].

4.1 Data description

The real data set contains 16,307 instances with 8-dimensional attributes, i.e. unit_price, overall_rating, service_rating, facility_rating, hygiene_rating, image_num, groupoon_num and comment_num.

The synthetic data set contains 23,083 instances with 8-dimensional attributes, i.e. ave_price, person_visit, investment, turnover, loyalty, comment_num, rank_rating, complaint_num, synthesized by R with the means and variances referring to the annual growth rate of industries in Beijing in 2018.Footnote 1

The two data sets involve 9 kinds of features i.e. beauty store, restaurant, school, enterprise, hospital, hotel, residence, health-care center, shop.

4.2 Effectiveness of HICPM algorithm

In this section, the effectiveness of HICPM algorithm is shown on the real data set, by comparison with that of join-less algorithm.

  1. (1)

    Comparison on Top-5 Co-location Patterns Mined by HICPM Algorithm and Join-less Algorithm

When the experimental variables are set with R = 10 m, \(\omega_{1} = 0.2\), \(\omega_{2} = 0.8\), InIthreshold = PIthreshold = 0.001, we can see in Table 5 that the HICPM algorithm can find patterns that the join-less algorithm miss, i.e. the 2-size pattern {B, I} and 4-size patterns {A, B, H, I}, {A, D, H, I}, {A, F, H, I}, {A, B, D, H}. Besides, the mining results show that another 2-size patterns {C, F}, {B, C} and twelve 3-size patterns {A, F, I}, {B, D, F}, etc. can solely be mined by HICPM algorithm. That is because when R is small, there are few instances found in neighborhoods, the join-less algorithm seldom find prevalent co-location patterns as the participation ratios are too small to satisfy the threshold, however, the HICPM algorithm uses the more effective InI which is only related to attributes of instances and regardless of the prevalence.

Table 5 Comparison on Top-5 co-location patterns

Both of them can find the same size co-location patterns at same rank, e.g. 3-size patterns {A, B, I}, {B, F, H}, {B, H, I}, or the same size patterns at different rank, e.g. 2-size patterns {B, F}, {A, B}, {B, H}, {A, H} and 3-size patterns {A, B, H}, {A, H, I}. It happens because the two algorithms share the same R, i.e. they use same table-instances, so some table-instances with high influence and participation ratio can be found. However, the assumption in this paper that influence of instances only exists in neighborhood is set manually to simplify the analysis, once influence range is set beyond R, more co-location patterns with high influence will be solely discovered by HICPM algorithm.

Besides, when R = 20 m while other variables remain, the results show that the two algorithms can mine same number of co-location patterns at diverse sizes, i.e. the 2-size patterns are identical, one pattern is different in 3-size patterns and in 4-size patterns respectively. When R varies in [30, 200] while other variables remain, the join-less algorithm mines more co-location patterns at all size and finds co-location patterns at higher size. Although most of high influence patterns can also be found by join-less algorithm, a few patterns can only be mined by the HICPM algorithm. It appears that the two algorithms show similar characteristics at higher size when InIthreshold is set at a relatively low level. Nevertheless, the HICPM algorithm can sequence the patterns as per their influential level. Therefore, the HICPM algorithm can be reckoned as a new co-location pattern mining approach different from previous ones.

  1. (2)

    Comparison on Mining Results of HICPM Algorithm and Join-less Algorithm

The mining results of the two algorithms are compared with the varying variables.

  • Effect of distance threshold on pattern number

In Fig. 5a, b, \(\omega_{1} = 0.2\), \(\omega_{2} = 0.8\), InIthreshold = PIthreshold = 0.001, R varies in [10, 200]. When R = 10 m, the HICPM algorithm can discover 2-size patterns four more, 3-size patterns twelve more, 4-size patterns four more, than the patterns mined by join-less algorithm. When R = 20 m, the two algorithms mine same number of patterns at all sizes. When R ⩾ 30 m, the join-less algorithm can mine more co-location patterns, as more cliques appear, the participation ratio (PR) grow faster than InR. The opposite is true when R < 20 m.

Fig. 5
figure 5

Effect of distance threshold [for join-less in (a) and HICPM in (b)], InIthreshold/PIthreshold [for HICPM/join-less in (c)] and weight \(\omega_{1}\)/\(\omega_{2}\) [for HICPM in (d)], on pattern number

  • Effect of InIthreshold/PIthreshold on pattern number

In Fig. 5c, R = 40 m, \(\omega_{1} = 0.2\), \(\omega_{2} = 0.8\), InIthreshold/PIthreshold varies in [0.001, 0.121]. It can be seen that pattern number falls sharply when the InIthreshold/PIthreshold rises from 0.001 to 0.021. The reason is that most of patterns have their InI/participation index (PI) below 0.001.

  • Effect of weight \(\omega_{1} /\omega_{2}\) on pattern number

In Fig. 5d, R = 40 m, InIthreshold = PIthreshold = 0.001, the weight \(\omega_{1} /\omega_{2}\) varies in [0, 1]. The figure depicts the growth of pattern number when \(\omega_{1}\) rises from 0 to 1 (i.e. \(\omega_{2}\) falls from 1 to 0, as \(\omega_{1} + \omega_{2} = 1\)). It shows that when \(\omega_{1}\) rises, that means the importance of neighborhood number increases or the importance of attribute similarity decreases, we can find more high influence patterns. A compromise is made for balance as \(\omega_{1} = 0.2\), \(\omega_{2} = 0.8\) in this paper.

4.3 Performance of HICPM Algorithm

In this section (1), time performance of HICPM algorithm with variation of thresholds is shown on the real data set and the synthetic data set, by comparison with that of join-less algorithm. In this section (2), scalability of HICPM algorithm with variation of the number of instances, attribute dimensions and features is verified on the synthetic data set, by comparison with that of join-less algorithm.

  1. (1)

    Comparison on time performance of HICPM algorithm and join-less algorithm running on the real data set and synthetic data set

Time performance of the algorithms is shown with the variation of two kinds of threshold in Fig. 6 (using real data) and Fig. 7 (using synthetic data).

Fig. 6
figure 6

Comparison on time performance of HICPM algorithm and join-less algorithm running on the real data set, with the variation of R in (a) and InIthreshold/PIthreshold in (b)

Fig. 7
figure 7

Comparison on time performance of HICPM algorithm and join-less algorithm running on the synthetic data set, with the variation of R in (a) and InI/PI in (b)

  • Effect of R on time performance

In Fig. 6a, \(\omega_{1} = 0.2\), \(\omega_{2} = 0.8\), InIthreshold = PIthreshold = 0.001, R varies in [20, 200]. It shows that the curves run closely, the gap between them expands in general with the rise of R. The gap is 0.189 s(s) at R = 20 m as the minimum, and is 6.769 s at R = 200 m as the maximum. The HICPM algorithm runs slightly slower than the join-less algorithm as the former uses the same clique-forming of the latter and runs more steps to calculate the influence of features.

  • Effect of InIthreshold/PIthreshold on time performance

In Fig. 6b, R = 40 m, \(\omega_{1} = 0.2\), \(\omega_{2} = 0.8\), InIthreshold/PIthreshold (InI/PI) varies in [0.001, 0.121]. It can be seen that the curves decline with the rise of InI/PI and the curve of HICPM algorithm runs over that of join-less algorithm. The gap between curves is 0.091 s at InI/PI = 0.121 as the minimum, and is 0.197 s at InI/PI = 0.001 as the maximum. Additional experiments conducted with InI/PI < 0.001 indicate that most of the features have InI and PI values concentrated in (0, 0.001].

  • Effect of thresholds on time performance

The following curves in Fig. 7 reflect the time performance of the two algorithms running on the synthetic data set with the same experimental conditions as aforementioned for analyzing Fig. 6.

In Fig. 7a, the minimal gap between curves is 0.33 s at R = 40 m, and the maximal gap is 9.685 s at R = 180 m. In Fig. 7b, the minimal gap is 0.128 s at InI/PI = 0.121 and the maximal gap is 0.305 s at InI/PI = 0.001.

  1. (2)

    Comparison on scalability of HICPM algorithm and join-less algorithm running on the synthetic data set

To verify the scalability, we test the scalability of the algorithms on the synthetic data set. Variation of variables share the same experimental conditions: R = 40 m, \(\omega_{1} = 0.2\), \(\omega_{1} = 0.8\), InIthreshold = PIthreshold = 0.001.

  • Effect of number of instances on scalability

Figure 8a depicts that the two curves run closely, the gap between them expands in general with the increase of instances. The minimal gap is 0.033 s at n = 600, while the maximal gap is 0.262 s at n = 20,600. It demonstrates from actual data that the efficiency of the two algorithms is different as they run at same level of time complexity.

Fig. 8
figure 8

Comparison on scalability of HICPM algorithm and join-less algorithm with the variation of number of instances in (a), number of attribute dimensions in (b) and number of features in (c)

  • Effect of number of attribute dimensions on scalability

It can be seen in Fig. 8b that the two algorithms are insensitive to the variation of number of attribute dimensions, as the points on the curves fluctuate in a tiny range which is approximately 4%. So it seems hopeful to apply more dimensional attributes to explore more details of spatial relationships and changes. Time performance of the HICPM algorithm is in average 0.257 s more than that of the join-less algorithm, it is the cost for the former to run more steps.

  • Effect of number of features on scalability

In Fig. 8c, it is shown that the curve of the HICPM algorithm runs above that of the join-less algorithm, while the curves rise steeply at f = 6, 8 and 9, as the number of instances belonging to the features at f = 6, 8, 9 is much more than the number of instances of features at f = 3, 4, 5 and 7. It is also noticed that the gap between the curves changes a lot at f = 8, the repeated experiments reveal that it may be a bit deviation caused by computer system.

5 Conclusion

The main work of this paper is to mine high influence co-location patterns from spatial instances with attributes. Based on number of neighbors and similarity between instances, the InI of features in co-location patterns is calculated for mining high influence co-location patterns. This paper proves the InI satisfies the downward closure property, and proposes a HICPM algorithm with pruning strategy. With extensive experiments and comparisons on real and synthetic data sets, we analyzed the effectiveness, performance and scalability of the HICPM and join-less algorithms and found HICPM could discover high influence co-location patterns, at a time cost level slightly higher than join-less. A high influence co-location pattern can be reckoned as a concise compression of traditional co-location patterns in the aspect of influence. This paper establishes a basic framework for mining high influence co-location patterns, many details shall be probed further, e.g. flexible influential distance, principal component analysis for high-dimensional data [36], weight of attributes and influence of extended spatial objects.