Keywords

1 Introduction

Co-location pattern mining in spatial datasets is a knowledge discovery problem. It aims at finding a set of spatial features whose objects are frequently located in the close geographic proximity [1]. This problem is useful in business [2], environmental science [3], biology [4] and many other fields. However, an important limitation of co-location pattern mining is that all features are considered equally important. This causes some important but non-prevalent patterns are missed [5]. To address this issue, Yang et al. first proposed the problem of high utility co-location pattern mining (HUCM) in spatial datasets with feature-specific utilities [6].

In contrast to co-location pattern mining, HUCM considers the case where each feature has a utility. Therefore, it can be used to discover sets of features with high utility, i.e. high utility co-location patterns (HUCPs). Whereas the utility of a co-location pattern may be lower, equal or higher than the utility of its subsets. Hence the pruning strategies based on the anti-monotonicity of prevalence in co-location pattern mining is not applicable to HUCM.

There have been several algorithms proposed for HUCM. Yang et al. [6] proposed the EPA algorithm, Wang et al. [7] proposed a base algorithm and three pruning strategies, and the min/max feature utility ratio algorithm were designed in [8]. Despite these research efforts, HUCM is still very expensive in terms of computation and memory. As they all mine HUCPs by generating and storing row instances of candidate patterns.

In this paper, we propose an efficient algorithm EHUCM (Efficient High Utility Co-location pattern Mining) which differs from past HUCM algorithms that generate all row instances of a candidate pattern c to compute pattern utility. It simply depends on the participating objects of each feature in c. Moreover, to reduce the cost of dataset scanning, we organize the spatial relationships in a data structure of feature-object neighbor tree that can be used to find potential candidate patterns.

2 Problem Definition

In a spatial dataset S, consider a set of spatial features F and a set of spatial objects O, each object o is represented with a tuple < feature type, object id, location, utility >. If the distance between two objects \(o,o^{\prime} \in O\) is not greater than a given distance threshold d, the two objects satisfy neighbor relationship R. A co-location pattern c is a subset of spatial feature set F. The size of c is the number of features in c. A row instance RI of c represents a subset of objects, which includes an object of each feature in c and any two objects in RI satisfy the neighbor relationship, i.e., objects in RI form a clique. For a feature f in c, we say that an object o of f participates in c if at least one row instance of c involves o. The set of participating objects of f in c is noted as Obj(f, c). Each feature \(f_{i} \in F\) is associated with a positive number \(v(f_{i} )\) called external utility of the feature that represents its importance. Correspondingly, the internal utility of \(f_{i}\) in c is the number of participating objects of \(f_{i}\) in c, denoted as \(q(f_{i} ,c) = \left| {Obj(f_{i} ,c)} \right|\). Given a size k co-location pattern \(c = \{ f_{1} ,f_{2} , \ldots ,f_{k} \}\). The feature utility of a feature \(f_{i}\) in c is defined as \(v(f_{i} ,c) = v(f_{i} ) \times q(f_{i} ,c)\). The pattern utility of c is the sum of utility of each feature in c, defined as \(u(c) = \sum\limits_{{f_{i} \in c}} {v(f_{i} ,c)}\).The pattern utility ratio of c is defined as \(\lambda (c) = u(c)/U(S)\), where U(S) is the total utility of S. c is a HUCP if and only if \(\lambda (c) \ge\) minutil, where minutil is a user-specified minimum pattern utility ratio threshold.

Problem Definition.

Given a spatial dataset S with feature-specific utilities, a distance threshold d and a utility ratio threshold minutil, the high utility co-location pattern mining is to find all high utility co-location patterns in S.

3 The EHUCM Algorithm

As stated above, the aim of this paper is to improve the efficiency of HUCM. In this section, we present the EHUCM algorithm.

3.1 The Search Space

Since the utility measure does not satisfy the downward closure property, all patterns in the spatial dataset need to be searched. Based on the idea that the combination of features with the greater total utility of all objects in a spatial dataset is more likely to be a HUCP, we search the space in descending order of the total utility of all objects of each feature in a spatial dataset. Let \(\succ\) be the descending order of the total utility of the feature in F.

Definition 1 (Extensible Feature Set of a Pattern).

Given a co-location pattern c. Let E(c) denote the set of features that can be used to extend c according to the depth-first search, so

$$ E\left( c \right) = \{ y\left| {y \in F \wedge } \right.y \succ x,\forall x \in c\} $$
(1)

Definition 2 (Extension of a Pattern).

Given a co-location pattern c. A pattern \(c^{\prime}\) is a single-feature extension of c if \(c^{\prime} = c \cup \{ z\}\) for \(z \in E(c)\). Also, if \(c^{\prime} = c \cup Z\) where Z is a set of features Z \(\in\) 2|E(c)| with \(Z \ne \emptyset\), \(c^{\prime}\) is an extension of c.

3.2 Feature-Object Neighbor Tree (FONT)

The participating objects of each feature in a candidate pattern c are generated by scanning the dataset. To reduce the cost of dataset scanning, we adopt the idea of neighborhood materialization to organize spatial relationships in a feature-object neighbor tree (FONT) data structure. With the FONT we can easily find objects that have neighbor relationships with a given object.

Definition 3 (Object Neighbor Set).

Given an object \(o_{i} \in O\) with feature type \(o_{i} {\text{. feature}} = f_{i}\), the object neighbor set of \(o_{i}\) is defined as

$$ ONS(o_{i} ) = \{ o_{j} \left| {o_{j} \in O \wedge R(o_{i} ,o_{j} ) \wedge o_{j} {\text{.feature}} \in F} \right.\backslash \{ f_{i} \} \} $$
(2)

The object neighbor set of \(o_{i}\) includes objects that have neighbor relationships with \(o_{i}\) and the feature type of \(o_{j}\) is different from \(f_{i}\).

Definition 4 (Feature-Object Neighbor Set).

Given an object \(o_{i}\) and its object neighbor set \(ONS(o_{i} )\), the feature-object neighbor set of \(o_{i}\) on feature \(f_{i}\) is defined as

$$ FONS(o_{i} ,f_{j} ) = \{ o_{j} \left| {o_{j} \in ONS(o_{i} ) \wedge (o_{j} {\text{.feature}} = f_{j} )} \right.\} $$
(3)

The feature-object neighbor set \(FONS(o_{i} ,f_{j} )\) is a subset of object neighbor set \(ONS(o_{i} )\), and it includes all objects of \(f_{j}\) in \(ONS(o_{i} )\).

Definition 5 (Feature-Object Neighbor Tree).

Given the set of spatial features F \(=\)\(\{ f_{1} ,f_{2} , \ldots ,f_{m} \}\) and all neighbor relationships among spatial objects, the feature-object neighbor tree (FONT for short) is designed as follows. (1) The root of the FONT is marked as “null” and each feature is a child of the root. (2) The root of the feature \(f_{i}\) sub-tree is \(f_{i}\), and the object neighbor set of all objects of \(f_{i}\) constitute the branch of \(f_{i}\). Each branch records an object and its feature-object neighbor set.

3.3 Two Pruning Strategies

The search space of HUCPs has 2|F|-|F|-1 candidates, the number of candidates grows exponentially with the number of features. Therefore, in order to efficiently mine the HUCPs, two pruning strategies are proposed in this subsection, named Pattern Utility Loss Ratio and Extended Pattern Utility Ratio.

Definition 6 (Utility Loss Ratio).

Given a co-location pattern c. Let lu(c) represent the utility loss ratio of c, denoted as

$$ lu\left( c \right) = \frac{{\sum\limits_{{f_{i} \in c}} {tu\left( {f_{i} } \right) - u\left( c \right)} }}{U\left( S \right)} $$
(4)

where \(tu\left( {f_{i} } \right)\) is the total utility of all objects of feature \(f_{i}\) in a spatial dataset S.

Lemma 1.

If lu(c)\(>\) 1-minutil, all extended patterns of c cannot be high utility patterns.

Definition 7 (Extensible Objects of Extensible Feature of a Pattern).

Given a co-location pattern c. The set of objects of feature \(f^{\prime}\) in E(c) is defined as

$$ nei\left( {f^{\prime}} \right) = \left\{ {\left. {o^{\prime}} \right|o^{\prime}{\text{.feature}} = f^{\prime},\forall FONS(o^{\prime},f) \ne \emptyset ,f \in c} \right\} $$
(5)

Definition 8 (Extensible Utility Ratio Upper-bound).

Given a co-location pattern c. The extensible utility ratio upper-bound of c in a spatial dataset S is defined as

$$ ub\left( c \right) = \frac{{\sum\limits_{{f_{i} \in E\left( c \right)}} {v\left( {f_{i} } \right)} \left| {nei\left( {f_{i} } \right)} \right|}}{U\left( S \right)} $$
(6)

Definition 9 (Extended Pattern Utility Ratio).

Given a co-location pattern c. The extended pattern utility ratio of c is defined as

$$ eu\left( c \right) = \lambda \left( c \right) + ub\left( c \right) $$
(7)

Lemma 2.

If eu(c)\(<\) minutil, all extended patterns of c cannot be high utility patterns.

3.4 The EHUCM Algorithm

Algorithm 1 scans the dataset once to generate the feature-object neighbor tree, and reorder the set of features F according to the descending order of the total utility of each feature in F. Then, initialize the candidate pattern c to the empty set.

figure a

The Search procedure (Algorithm 2) takes as a parameter the ordinal number of the k-th element of the set of features F. The procedure executes a loop that considers each single-feature extension of c of the form \(c = c \cup \left\{ {f_{k} } \right\}\), where \(f_{k}\) is the k-th element of F.

4 Experimental Evaluation

The experiments were conducted on a Windows 10 platform with an Intel Core i7-8700K CPU @3.70 GHz and 32 GB of RAM. We used two real spatial datasets, namely plants dataset of Three Parallel Rivers of Yunnan Protected Area and Beijing POI dataset. We compared the performance of the EHUCM algorithm with the EPA [6].

Influence of the Distance Threshold d.

This experiment compares the running times of the two algorithms for mining HUCPs when the distance threshold is varied and the minimum utility ratio threshold parameter is fixed. Figure 1(a) and (b) shows the results. It can be observed that the EHUCM is always much faster than the EPA algorithm. For example, on Three Parallel Rivers with minutil \(=\) 0.18 and d \(=\) 10900, EHUCM is about 165 times faster than EPA.

Influence of the pattern utility ratio threshold minutil.

The performance of algorithms was evaluated by fixing the value of the distance threshold in each dataset and varying the value of the minimum utility ratio threshold. The results are shown in Fig. 1(c) and (d). The running time of both algorithms decreases as minutil increases, and EHUCM always runs in less time than EPA. For example, on Three Parallel Rivers with d = 10500 and minutil \(=\) 0.16, EHUCM is about 144 times faster than EPA.

The results of the EPA algorithm and the EHUCM algorithm in the above experiments are consistent.

Fig. 1.
figure 1

Influence of the d and minutil on different datasets.

5 Conclusions

Since existing high utility co-location pattern mining algorithms are still very expensive in terms of both runtime and memory consumption. This paper proposes an efficient algorithm EHUCM for high utility co-location pattern mining. This algorithm differs from past algorithms that generate all row instances of candidate patterns to compute pattern utility. EHUCM simply depends on the participating objects of each feature in the candidate pattern. Because the utility measure fails to satisfy downward closure property, we propose two effective pruning strategies to prune the search space more efficiently. Extensive experimental results show that the EHUCM algorithm is efficient.