EHUCM: An Efficient Algorithm for Mining High Utility Co-location Patterns from Spatial Datasets with Feature-specific Utilities

Li, Yinqiao; Wang, Lizhen; Yang, Peizhong; Li, Junyi

doi:10.1007/978-3-030-86472-9_17

Yinqiao Li¹²,
Lizhen Wang¹²,
Peizhong Yang¹² &
…
Junyi Li¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12923))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1231 Accesses
4 Citations

Abstract

High utility co-location pattern mining is still computationally expensive in terms of both runtime and memory consumption. In this paper, an efficient high utility co-location pattern mining algorithm, named EHUCM, is proposed to address this problem, which introduces the ideas of neighborhood materialization, participating objects of features and filtering unpromising candidate patterns to discover high utility co-location patterns more efficiently. To reduce the cost of dataset scanning, EHUCM pre-storing spatial relationships in a data structure to facilitate the search for potential candidate patterns. In addition, two effective pruning strategies are proposed in the EHUCM algorithm to improve the running overhead due to the utility measure not satisfying the downward closure property. Extensive experiments show that the EHUCM algorithm is 10 times or even 100 times faster than the traditional high utility co-location pattern mining algorithm.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Discovering Maximal High Utility Co-location Patterns from Spatial Data

Efficient Mining of High Utility Co-location Patterns Based on a Query Strategy

Discovering Prevalent Weighted Co-Location Patterns on Spatial Data Without Candidates

Keywords

1 Introduction

Co-location pattern mining in spatial datasets is a knowledge discovery problem. It aims at finding a set of spatial features whose objects are frequently located in the close geographic proximity [1]. This problem is useful in business [2], environmental science [3], biology [4] and many other fields. However, an important limitation of co-location pattern mining is that all features are considered equally important. This causes some important but non-prevalent patterns are missed [5]. To address this issue, Yang et al. first proposed the problem of high utility co-location pattern mining (HUCM) in spatial datasets with feature-specific utilities [6].

In contrast to co-location pattern mining, HUCM considers the case where each feature has a utility. Therefore, it can be used to discover sets of features with high utility, i.e. high utility co-location patterns (HUCPs). Whereas the utility of a co-location pattern may be lower, equal or higher than the utility of its subsets. Hence the pruning strategies based on the anti-monotonicity of prevalence in co-location pattern mining is not applicable to HUCM.

There have been several algorithms proposed for HUCM. Yang et al. [6] proposed the EPA algorithm, Wang et al. [7] proposed a base algorithm and three pruning strategies, and the min/max feature utility ratio algorithm were designed in [8]. Despite these research efforts, HUCM is still very expensive in terms of computation and memory. As they all mine HUCPs by generating and storing row instances of candidate patterns.

In this paper, we propose an efficient algorithm EHUCM (Efficient High Utility Co-location pattern Mining) which differs from past HUCM algorithms that generate all row instances of a candidate pattern c to compute pattern utility. It simply depends on the participating objects of each feature in c. Moreover, to reduce the cost of dataset scanning, we organize the spatial relationships in a data structure of feature-object neighbor tree that can be used to find potential candidate patterns.

2 Problem Definition

In a spatial dataset S, consider a set of spatial features F and a set of spatial objects O, each object o is represented with a tuple < feature type, object id, location, utility >. If the distance between two objects $o,o^{\prime} \in O$ is not greater than a given distance threshold d, the two objects satisfy neighbor relationship R. A co-location pattern c is a subset of spatial feature set F. The size of c is the number of features in c. A row instance RI of c represents a subset of objects, which includes an object of each feature in c and any two objects in RI satisfy the neighbor relationship, i.e., objects in RI form a clique. For a feature f in c, we say that an object o of f participates in c if at least one row instance of c involves o. The set of participating objects of f in c is noted as Obj(f, c). Each feature $f_{i} \in F$ is associated with a positive number $v(f_{i} )$ called external utility of the feature that represents its importance. Correspondingly, the internal utility of $f_{i}$ in c is the number of participating objects of $f_{i}$ in c, denoted as $q(f_{i} ,c) = \left| {Obj(f_{i} ,c)} \right|$. Given a size k co-location pattern $c = \{ f_{1} ,f_{2} , \ldots ,f_{k} \}$. The feature utility of a feature $f_{i}$ in c is defined as $v(f_{i} ,c) = v(f_{i} ) \times q(f_{i} ,c)$. The pattern utility of c is the sum of utility of each feature in c, defined as $u(c) = \sum\limits_{{f_{i} \in c}} {v(f_{i} ,c)}$.The pattern utility ratio of c is defined as $\lambda (c) = u(c)/U(S)$, where U(S) is the total utility of S. c is a HUCP if and only if $\lambda (c) \ge$ minutil, where minutil is a user-specified minimum pattern utility ratio threshold.

Problem Definition.

Given a spatial dataset S with feature-specific utilities, a distance threshold d and a utility ratio threshold minutil, the high utility co-location pattern mining is to find all high utility co-location patterns in S.

3 The EHUCM Algorithm

As stated above, the aim of this paper is to improve the efficiency of HUCM. In this section, we present the EHUCM algorithm.

3.1 The Search Space

Since the utility measure does not satisfy the downward closure property, all patterns in the spatial dataset need to be searched. Based on the idea that the combination of features with the greater total utility of all objects in a spatial dataset is more likely to be a HUCP, we search the space in descending order of the total utility of all objects of each feature in a spatial dataset. Let $\succ$ be the descending order of the total utility of the feature in F.

Definition 1 (Extensible Feature Set of a Pattern).

Given a co-location pattern c. Let E(c) denote the set of features that can be used to extend c according to the depth-first search, so

$$ E\left( c \right) = \{ y\left| {y \in F \wedge } \right.y \succ x,\forall x \in c\} $$

(1)

Definition 2 (Extension of a Pattern).

Given a co-location pattern c. A pattern $c^{\prime}$ is a single-feature extension of c if $c^{\prime} = c \cup \{ z\}$ for $z \in E(c)$. Also, if $c^{\prime} = c \cup Z$ where Z is a set of features Z $\in$ 2^|E(c)| with $Z \ne \emptyset$, $c^{\prime}$ is an extension of c.

3.2 Feature-Object Neighbor Tree (FONT)

The participating objects of each feature in a candidate pattern c are generated by scanning the dataset. To reduce the cost of dataset scanning, we adopt the idea of neighborhood materialization to organize spatial relationships in a feature-object neighbor tree (FONT) data structure. With the FONT we can easily find objects that have neighbor relationships with a given object.

Definition 3 (Object Neighbor Set).

Given an object $o_{i} \in O$ with feature type $o_{i} {\text{. feature}} = f_{i}$, the object neighbor set of $o_{i}$ is defined as

$$ ONS(o_{i} ) = \{ o_{j} \left| {o_{j} \in O \wedge R(o_{i} ,o_{j} ) \wedge o_{j} {\text{.feature}} \in F} \right.\backslash \{ f_{i} \} \} $$

(2)

The object neighbor set of $o_{i}$ includes objects that have neighbor relationships with $o_{i}$ and the feature type of $o_{j}$ is different from $f_{i}$.

Definition 4 (Feature-Object Neighbor Set).

Given an object $o_{i}$ and its object neighbor set $ONS(o_{i} )$, the feature-object neighbor set of $o_{i}$ on feature $f_{i}$ is defined as

$$ FONS(o_{i} ,f_{j} ) = \{ o_{j} \left| {o_{j} \in ONS(o_{i} ) \wedge (o_{j} {\text{.feature}} = f_{j} )} \right.\} $$

(3)

The feature-object neighbor set $FONS(o_{i} ,f_{j} )$ is a subset of object neighbor set $ONS(o_{i} )$, and it includes all objects of $f_{j}$ in $ONS(o_{i} )$.

Definition 5 (Feature-Object Neighbor Tree).

Given the set of spatial features F $=$$\{ f_{1} ,f_{2} , \ldots ,f_{m} \}$ and all neighbor relationships among spatial objects, the feature-object neighbor tree (FONT for short) is designed as follows. (1) The root of the FONT is marked as “null” and each feature is a child of the root. (2) The root of the feature $f_{i}$ sub-tree is $f_{i}$, and the object neighbor set of all objects of $f_{i}$ constitute the branch of $f_{i}$. Each branch records an object and its feature-object neighbor set.

3.3 Two Pruning Strategies

The search space of HUCPs has 2^|F|-|F|-1 candidates, the number of candidates grows exponentially with the number of features. Therefore, in order to efficiently mine the HUCPs, two pruning strategies are proposed in this subsection, named Pattern Utility Loss Ratio and Extended Pattern Utility Ratio.

Definition 6 (Utility Loss Ratio).

Given a co-location pattern c. Let lu(c) represent the utility loss ratio of c, denoted as

$$ lu\left( c \right) = \frac{{\sum\limits_{{f_{i} \in c}} {tu\left( {f_{i} } \right) - u\left( c \right)} }}{U\left( S \right)} $$

(4)

where $tu\left( {f_{i} } \right)$ is the total utility of all objects of feature $f_{i}$ in a spatial dataset S.

Lemma 1.

If lu(c)$>$ 1-minutil, all extended patterns of c cannot be high utility patterns.

Definition 7 (Extensible Objects of Extensible Feature of a Pattern).

Given a co-location pattern c. The set of objects of feature $f^{\prime}$ in E(c) is defined as

$$ nei\left( {f^{\prime}} \right) = \left\{ {\left. {o^{\prime}} \right|o^{\prime}{\text{.feature}} = f^{\prime},\forall FONS(o^{\prime},f) \ne \emptyset ,f \in c} \right\} $$

(5)

Definition 8 (Extensible Utility Ratio Upper-bound).

Given a co-location pattern c. The extensible utility ratio upper-bound of c in a spatial dataset S is defined as

$$ ub\left( c \right) = \frac{{\sum\limits_{{f_{i} \in E\left( c \right)}} {v\left( {f_{i} } \right)} \left| {nei\left( {f_{i} } \right)} \right|}}{U\left( S \right)} $$

(6)

Definition 9 (Extended Pattern Utility Ratio).

Given a co-location pattern c. The extended pattern utility ratio of c is defined as

$$ eu\left( c \right) = \lambda \left( c \right) + ub\left( c \right) $$

(7)

Lemma 2.

If eu(c)$<$ minutil, all extended patterns of c cannot be high utility patterns.

3.4 The EHUCM Algorithm

Algorithm 1 scans the dataset once to generate the feature-object neighbor tree, and reorder the set of features F according to the descending order of the total utility of each feature in F. Then, initialize the candidate pattern c to the empty set.

The Search procedure (Algorithm 2) takes as a parameter the ordinal number of the k-th element of the set of features F. The procedure executes a loop that considers each single-feature extension of c of the form $c = c \cup \left\{ {f_{k} } \right\}$, where $f_{k}$ is the k-th element of F.

4 Experimental Evaluation

The experiments were conducted on a Windows 10 platform with an Intel Core i7-8700K CPU @3.70 GHz and 32 GB of RAM. We used two real spatial datasets, namely plants dataset of Three Parallel Rivers of Yunnan Protected Area and Beijing POI dataset. We compared the performance of the EHUCM algorithm with the EPA [6].

Influence of the Distance Threshold d.

This experiment compares the running times of the two algorithms for mining HUCPs when the distance threshold is varied and the minimum utility ratio threshold parameter is fixed. Figure 1(a) and (b) shows the results. It can be observed that the EHUCM is always much faster than the EPA algorithm. For example, on Three Parallel Rivers with minutil $=$ 0.18 and d $=$ 10900, EHUCM is about 165 times faster than EPA.

Influence of the pattern utility ratio threshold minutil.

The performance of algorithms was evaluated by fixing the value of the distance threshold in each dataset and varying the value of the minimum utility ratio threshold. The results are shown in Fig. 1(c) and (d). The running time of both algorithms decreases as minutil increases, and EHUCM always runs in less time than EPA. For example, on Three Parallel Rivers with d = 10500 and minutil $=$ 0.16, EHUCM is about 144 times faster than EPA.

The results of the EPA algorithm and the EHUCM algorithm in the above experiments are consistent.

5 Conclusions

Since existing high utility co-location pattern mining algorithms are still very expensive in terms of both runtime and memory consumption. This paper proposes an efficient algorithm EHUCM for high utility co-location pattern mining. This algorithm differs from past algorithms that generate all row instances of candidate patterns to compute pattern utility. EHUCM simply depends on the participating objects of each feature in the candidate pattern. Because the utility measure fails to satisfy downward closure property, we propose two effective pruning strategies to prune the search space more efficiently. Extensive experimental results show that the EHUCM algorithm is efficient.

References

Yang, P., Wang, L., Wang, X., Zhou, L.: SCPM-CR: a novel method for spatial co-location pattern mining with coupling relation consideration. IEEE Trans. Knowl. Data Eng. 4347, 1–14 (2021). https://doi.org/10.1109/TKDE.2021.3060119
Article Google Scholar
Yu, W.: Spatial co-location pattern mining for location-based services in road networks. Exp. Syst. Appl. 46, 324–335 (2016). https://doi.org/10.1016/j.eswa.2015.10.010
Article Google Scholar
Akbari, M., Samadzadegan, F., Weibel, R.: A generic regional spatio-temporal co-occurrence pattern mining model: a case study for air pollution. J. Geogr. Syst. 17(3), 249–274 (2015). https://doi.org/10.1007/s10109-015-0216-4
Article Google Scholar
Yoo, J.S., Bow, M.: A framework for generating condensed co-location sets from spatial databases. Intell. Data Anal. 23, 333–355 (2019). https://doi.org/10.3233/IDA-173752
Article Google Scholar
Truong, T., Duong, H., Le, B., Fournier-Viger, P.: Efficient algorithms for mining frequent high utility sequences with constraints. Inf. Sci. (Ny). 568, 239–264 (2021)
Google Scholar
Yang, S., Wang, L.: A framework for mining spatial high utility co-location patterns. In: The 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2015), pp. 595–601. IEEE Press, New York (2015)
Google Scholar
Wang, L., Jiang, W., Chen, H., Fang, Y.: Efficiently mining high utility co-location patterns from spatial data sets with instance-specific utilities. In: Candan, S., Chen, L., Pedersen, T.B., Chang, L., Hua, W. (eds.) DASFAA 2017. LNCS, vol. 10178, pp. 458–474. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55699-4_28
Chapter Google Scholar
Wang, X., Wang, L., et al.: Mining spatial high utility co-location patterns based on feature utility ratio. Chin. J. Comput. 42, 1721–1738 (2019)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (61966036, 62062066), the Project of Innovative Research Team of Yunnan Province (2018HC019).

Author information

Authors and Affiliations

School of Information Sciences and Engineering, Yunnan University, Kunming, China
Yinqiao Li, Lizhen Wang, Peizhong Yang & Junyi Li

Authors

Yinqiao Li
View author publications
You can also search for this author in PubMed Google Scholar
Lizhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Peizhong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Junyi Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lizhen Wang .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
Johannes Kepler University of Linz, Linz, Oberösterreich, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Wang, L., Yang, P., Li, J. (2021). EHUCM: An Efficient Algorithm for Mining High Utility Co-location Patterns from Spatial Datasets with Feature-specific Utilities. In: Strauss, C., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2021. Lecture Notes in Computer Science(), vol 12923. Springer, Cham. https://doi.org/10.1007/978-3-030-86472-9_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-86472-9_17
Published: 31 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86471-2
Online ISBN: 978-3-030-86472-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics