IG-Tree: an efficient spatial keyword index for planning best path queries on road networks

Haryanto, Anasthasia Agnes; Islam, Md. Saiful; Taniar, David; Cheema, Muhammad Aamir

doi:10.1007/s11280-018-0643-5

IG-Tree: an efficient spatial keyword index for planning best path queries on road networks

Published: 15 November 2018

Volume 22, pages 1359–1399, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

World Wide Web Aims and scope Submit manuscript

IG-Tree: an efficient spatial keyword index for planning best path queries on road networks

Download PDF

Anasthasia Agnes Haryanto¹,
Md. Saiful Islam ORCID: orcid.org/0000-0001-7181-5328²,
David Taniar¹ &
…
Muhammad Aamir Cheema¹

816 Accesses
15 Citations
Explore all metrics

Abstract

Due to the popularity of Spatial Databases, many search engine providers have started to expand their text searching capability to include geographical information. Because of this reason, many new queries on spatial objects affiliated with textual information, known as the Spatial Keyword Queries, have taken significant research interest in the past years. Unfortunately, most of existing works on Spatial Keyword Queries only focus on objects retrieval. There is barely any work on route planning queries, even though route planning is often needed in our daily life. In this research, we propose the Best Path Query, which we find the best optimum route from two different spatial locations that visits or avoids the objects that are specified by the textual data given by the user. We show that Best Path Query is an NP-Hard problem. We propose an efficient indexing technique, namely IG-Tree, and three different algorithms with different trade-offs to process the Best Path Queries on Road Networks. Our extensive experimental study demonstrates the efficiency and accuracy of our proposed approach.

TK-SK: Textual-Restricted $$K$$ Spatial Keyword Query on Road Networks

An Efficient Evaluation of Spatial Search on Road Networks Using G-Tree

Effective Spatial Keyword Query Processing on Road Networks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Spatial Databases have taken a big interest in today’s society. A lot of applications are using spatial data to help our daily necessity, such as the GPS. The application of spatial data itself is accessible not only on desktop computers, but it is also very popular in mobile environments these days as mobile devices provide significant services in our everyday life [1, 24, 29, 38, 39, 44, 50]. According to StatisticBrain, about 81% of mobile users use their device for Maps and directions [53]. With the popularity of Spatial Databases, search engine providers, such as Google and Yahoo, have broadened their text searching capability to provide geographical information [6, 18]. The GlobalWebIndex showed that Google Maps is the most used app as it is used by 54% of the general smartphone users [54]. In the past, conventional search engines could only process simple user queries, such as finding a Point of Interest (POI) based on a certain keyword [25]. However the high demand in spatial data compels the search engines to have the capability of processing more than such simple queries. A lot of new queries on spatial objects affiliated with textual information, known as Spatial Keyword Queries [45], have been studied in recent years. A number of researches have been done in order to improve the geographical search engines. Yet there are still many challenges to combine and process both the textual information and the spatial data.

Nowadays, each spatial object contains one or more meaningful keywords as to represent the object’s entities [15, 18]. The keywords may contain country name, city name, address, references to landmark, or even type of road [6]. For example in Figure 1, the points denote spatial objects and each object is affiliated with one or more keywords. Through the existence of this kind of spatial keywords information, the Spatial Keyword Queries become varied. Some of the commonly used queries are the Top-k kNN Query, Boolean kNN Query, and Boolean Range Query [7]. All of these queries require the user to give a spatial location (normally the current user location) and textual data in the form of keywords as the input. While the output of these queries is spatial objects that are nearest to the user’s location and contain the keywords given. The principle parameter used to identify the nearest objects is based on the shortest path distance, which is basically computing the minimum distance between two location points. Although the existing shortest path-based solutions are useful, they are not always sufficient for our needs. In real life, we often want to plan our trip with the most efficient cost (e.g. time, distance) taken. We may need to stop by several locations in our trip before arriving to the designated destination and there are also times when we would like to avoid some spatial objects that can interfere our activities. Planning a trip is eminently more complex than a simple source-to-destination type of query. Unfortunately, the existing studies on trip planning query in Spatio-Textual area are not flexible enough to answer this kind of query. Furthermore, often at times researchers only consider users to give keywords just to find POIs. But in reality, this is not always the case. Some query keywords may have negative connotations, such as traffic jams, which means that not all user given keywords can be considered as POI.

In this paper, we propose a new variant of spatial keywords query. Given a user with his/her location, this user wants to go to his/her destination while stopping by or avoiding several locations denoted by certain keywords. For instance a user wants to go to his workplace from his house, but before arriving to the workplace he wants to stop by a gas station to refill his car fuel, a bakery to get some breakfast, and also wish to avoid any highway along his trip (see Figure 1). In this query, the user specifies the source and destination locations and several other keywords. The keywords specified are gas station, bakery, and avoid highway. So we need to find the most optimum path for the user that satisfies his preferred keywords condition. Using the illustration in Figure 1, assume that s_l is the user’s house (source location) and d_l is his workplace (destination location). There are a number of spatial objects that contain the keyword of gas station and bakery, such as s₂, s₃, s₅, s₁₀, and s₁₁, hence there are many possibilities of path combination to be established from s_l to d_l passing through at least one of each keyword. Looking at the road network, the path with the shortest distance is from s_l to s₂ to stop by a gas station, then s₃ to stop by a bakery, and then d_l. However this path passes through a highway. One of the keywords specified by the user to avoid is highway, which means that the path does not satisfy all the criteria given by the user. Therefore the best path that offers the least sum of distance and meets the criteria is from s_l to s₁₁ for bakery, then s₁₀ to stop by the gas station, then finally the destination d_l (obviously this path also avoids any highway). We call this kind of query as the Best Path Query (BP).

In Best Path Query, we are dealing with both spatial and textual data. Hence the user given query has two main parts: the spatial data part that consists of the source and destination locations that are given by the user, and the textual data part that consists of the keywords that the user would like to pass or avoid throughout his/her trip to the destination. Based on the user input, the query keyword itself can be classified into positive or negative situations. As in previous example, the user gave a negative keyword, which is to avoid highway. The positive keyword here is the POI that he wants to pass by, which are the gas station and bakery. The formal definition of Best Path Query is given as follow:

Definition 1

(Best Path Query) Given a source location s_l, a destination location d_l, and a set of keywords K = {k₁, k₂,..., k_n}, where each k_i for 1 ≤ i ≤ n can be positive (denoted by k⁺) or negative (denoted by k⁻), find the Best Path from s_l to d_l, denoted by BP(s_l, d_l, K), that passes through all k⁺ and avoid all k⁻ with optimum cost.

1.1 Challenges

The main challenge in this research is in the insufficiency of current solutions to trip planning in spatial keywords area, especially on Best Path Query. Existing studies do not take into consideration negative keywords. Having negative keywords actually increase the complexity of the problem as we have to make sure that we can avoid certain paths. There are definitely some cases where the result cannot be retrieved as we have to avoid all of the paths in between the source and destination locations.

The road networks are usually represented as a graph. The edges typically represent the road segments while the vertices represent the road intersections. So the path generation is always limited only to the edges to adjacent vertices. This increases the complexity of our query computation when we are working specifically in this environment. The complexity is also intensified with the types of query keywords given by the user. When the query keywords are negative, a lot of the paths will be blocked as we have to avoid them. We always have to make sure that we plan the route optimally despite these complexities.

Another challenge is that most of the route planning problems are regarded as a generalization of Traveling Salesman Problem (TSP) problem [3, 4]. They are NP-hard problems. The solutions offered for these queries are in polynomial time approximation algorithms. Even so, the solutions offered generally involve no pre-processing. This causes more computation particularly since the computation requires the processing of both spatial and textual relevancy. So having an on-the-go solution does not always guarantee performance efficiency. There are also very limited indexing techniques in the Spatio-Textual area, especially on road networks. In this research we attempt to provide a solution to this problem by offering a novel index that incorporate both keywords and spatial road networks information based on G-Tree [51, 52] and IR²-Tree [15].

1.2 Contributions

Our main contributions in this paper are as follows:

We formally define Best Path problem on road networks and prove that this is an NP-hard problem.
We develop a novel indexing scheme, called IG-Tree, for planning Best Path queries in road networks.
We present three approximate algorithms with different trade-offs for searching Best Paths on road networks.
We also demonstrate the effectiveness and efficiency of our algorithms through comprehensive experiments on real datasets.

1.3 Organisation

The rest of the paper is organised as follows. Section 2 presents the preliminaries and the query model for Best Path problem on road networks. Section 3 discusses the computational complexities of the Best Path problem on road networks. We introduce the IG-Tree in Section 4 and discuss our algorithms to solve the Best Path problem in Section 5. Section 6 presents the experimental evaluations of all the algorithms proposed. We discuss some related works in Section 7. Finally, Section 8 concludes the paper.

2 Preliminaries

This section presents the necessary background information, and the data and query model for Best Path problem on road networks.

2.1 Road network

We consider road network as an undirected weighted graph G = (V, E), where V is a set of vertices and E is a set of edges. Each edge (u, v) ∈ E connects two adjacent vertices u, v ∈ V and is associated with a non-negative weight w(u, v) > 0 that represents distance or travel time.

A path P(v₁, v_n) = {v₁, v₂,..., v_n} is a sequenced vertices such that v_i is adjacent to v_i+ 1, i.e., (v_i, v_i+ 1) ∈ E, for 1 ≤ i < n. The cost of a path P, denoted by cost(P), is the sum of weights of the edges of P. Given vertices u and v, we use δ(u, v) to denote the shortest path from u to v while we use dist(u, v) to denote the cost of δ(u, v). Figure 2 shows an example of a road network. If given a source vertex v₁ and destination vertex v₄, then δ(v₁, v₄) = {v₁, v₂, v₅, v₄} is the shortest path between v₁ and v₄ and dist(v₁, v₄) = 6.

2.2 Data model

A spatio-textual object o is an object with a spatial location from S = {s₁, s₂,..., s_m} that contains a set of keywords from T = {t₁, t₂,..., t_x}. We assume that spatio-textual objects are located at vertices in V. The weight of an edge (u, v) ∈ E is the travel time or road network distance of two spatially-adjacent objects o₁ and o₂ representing u and v, respectively. We use vertex v or spatio-textual object o interchangeably in this paper.

2.3 Query model

Given a road network G, the user queries consist of a source location s_l, a destination location d_l, and preferred set of keywords K = {k₁, k₂,..., k_n}, where each k_i, 1 ≤ i ≤ n, can be positive (k⁺) or negative (k⁻). A positive keyword k⁺ means that the keyword satisfies what the user wants, while a negative keyword k⁻ means that part of the keyword expresses negative connotation which the user wants to avoid. We assume K ⊆ T. Having all these information, we want to find the Best Path BP(s_l, d_l, K) that establishes a shortest path from s_l to d_l, and that passes through all k⁺ and avoids all k⁻ keyword matching vertices in G.

Table 1 presents the list of mathematical notations used in this paper.

Table 1 List of notations used in the paper

Full size table

3 Complexity analysis

The Best Path problem is different from the general Shortest Path problem. In Best Path problem, we want to find the path with objects of interest along our way to the destination. The objects of interest are determined by the keywords. The result of Best Path itself often is longer in distance than the Shortest Path, but there are also cases where it can have the same result as the Shortest Path. When the user does not specify any keywords at all in the query, then the Best Path is basically the Shortest Path since there are only source and destination locations provided. The Shortest Path problem is solvable in polynomial time. Therefore we can implement the commonly used algorithms for Shortest Path, such as Dijkstra algorithm, for this particular case.

Lemma 1

Givens_l,d_l,andK = {},BP(s_l, d_l, K) = δ(s_l, d_l).

Proof

Base Case: If |P| = 1 and K = {}, then P = {s_l} and cost(P) = 0 = dist(s_l, s_l). Hence, δ(s_l, s_l) = BP(s_l, s_l, K).

Inductive hypothesis: Let u be the last vertex added to P, P^′ = P ∪{u}. In this case our Inductive Hypothesis is

$$\text{for each } y \in P^{\prime} \text{, } cost(P^{\prime}(s_{l},y))=cost(\delta(s_{l}, y)) $$

Inductive step: Suppose that there is a shortest path Q from s_l to u and

$$cost(Q)<cost(P^{\prime}(s_{l},u)) $$

Since Q is a shortest path, then cost(Q) = dist(s_l, u).

Assume that the shortest path Q begins at P^′ and then leaves P^′ before arriving to the destination u. (y, z) is the first edge in Q that leaves P^′, and Q_y is a shortest path from s_l to y, so

$$cost(Q_{y})+w(y,z) \leq cost(Q) $$

Since according to the Inductive Hypothesis cost(P^′(s_l, y)) is also the cost of δ(s_l, y), then cost(P^′(s_l, y)) ≤ cost(Q_y). So it gives us

$$cost(P^{\prime}(s_{l}, y))+w(y,z)\leq cost(Q_{z}) $$

As y and z are adjacent vertices, then

$$cost(P^{\prime}(s_{l}, z))\leq cost(P^{\prime}(s_{l},y))+w(y,z) $$

Since u is part of Q, so

$$cost(P^{\prime}(s_{l},u))\leq cost(P^{\prime}(s_{l},z)) $$

Therefore shortest path Q does not exist, so cost(P^′(s_l, u)) = δ(s_l, u) = BP(s_l, u, K). □

However, this is not the same when we have a keyword specified by the user. Depending on the positive or negative value, the path may or may not be retrieved. If the query contains only one positive keyword, doing the shortest path search from source to the nearest vertex that has matching keyword then doing another shortest path search from the nearest vertex with matching keyword to the destination is incorrect. As an example using the road network in Figure 1, assume that the source s_l is v₅, destination d_l is v₁₀, and then the preferred keyword given by the user is located at v₁ and v₃. If we choose the nearest keyword match vertex from v₅, v₃ is the nearest since dist(v₅, v₃) is 3, while dist(v₅, v₁) is 6. However if we calculate the total distance from v₅ to v₃ to v₁₀, the total is 18. On the contrary, the total distance from v₅ to v₁ to v₁₀ is 9, which is a lot shorter than having v₃ as the chosen keyword match vertex. Hence, choosing the nearest vertex with matching keyword will cause local minimum problem.

Meanwhile if the query keyword does not exist in T, then BP(s_l, d_l, K) is also δ(s_l, d_l).

Lemma 2

Givens_l,d_l,andK = {k₁},k₁∉T.ThusBP(s_l, d_l, K) = δ(s_l, d_l).

Proof

If k₁∉T, then K = {}; which is already proven in Lemma 1. □

If the user query contains a negative keyword, then the vertices that have this particular keyword need to be avoided/blocked. When we do the query processing to find the Best Path, these vertices can be pruned/disconnected from the graph as they are no longer considered as POI. This may also cause a deadend in the graph since a potential path can be ceased with the disappearance of a vertex. So when an edge of a vertex with negative keyword is a bridge, we will not be able to retrieve any Best Path. Lemma 3 and 4 prove the non-existence of Best Path for this particular case.

Lemma 3

If there exists a bridge (u, v) in G, andG consists of subgraph H and I that are connected by(u, v).Givens_l ∈ Handd_l ∈ I,then (u, v) ⊆ BP(s_l, d_l, K).

Proof

Assume that (u, v) is a bridge in G and BP(s_l, d_l, K) on G does not contain (u, v). Since BP(s_l, d_l, K) is a path that every vertex in it has to be connected with each other, and BP(s_l, d_l, K) ⊆ G ∖{(u, v)} which G ∖{(u, v)} is a disconnected graph, then it must be disconnected. □

Definition 2

A critical path cp(v_i, v_j), i≠j, is a path that consists of one or more graph bridges between v_i and v_j.

Lemma 4

Givens_l,d_l,K = {k⁻},andBP(s_l, d_l, K) = cp(s_l, d_l).ThenBP(s_l, d_l, K) does not exist.

Proof

Suppose that there exists a bridge (u, v) in BP(s_l, d_l, K) and k⁻∈ (u, v). Since we have to avoid negative keywords, then (u, v) has to be pruned from BP(s_l, d_l, K). Hence, the graph is now disconnected as proved in Lemma 3. □

Even though negative keywords can cause path blockage, it does not mean that we cannot retrieve any Best Path at all. The path blockage might cause us to re-route to another path even though it may cause a longer path.

Lemma 5

Givend_l,s_l,andK = {k⁻},ifBP(s_l, d_l, K)≠cp(s_l, d_l),then there existsBP(s_l, d_l, K).

Proof

We can establish a path since the graph is still connected even though there is k⁻. □

In the case where the user gave a set of positive keywords as the query input and there is no negative keyword at all, then the Best Path’s result will be similar to the state-of-the-art Trip Planning Route Queries (TPQ)’s result [27].

Lemma 6

Givend_l,s_l,and$K=\{k_{1}^{+}, k_{2}^{+}, ..., k_{n}^{+}\}$,BP(s_l, d_l, K) = TPQ.

Proof

When all keywords are positive, then by definition, Best Path is the same as TPQ where we have to find the best trip/route from s_l, passing through one point from each category and then ending the trip at d_l. □

Looking at Lemma 6, it means that Best Path also can be considered as NP-hard problem. Thus in the following Lemma we try to reduce Best Path problem to Traveling Salesman Problem (TSP), to which TSP is a well-known NP-hard problem.

Lemma 7

BP(s_l, d_l, K) is NP-hard.

Proof

Assume a road network G of a set of spatio-textual objects with spatial locations S = {s₁, s₂,..., s_m} and each object has a distinct keyword from the keyword set T. Moreover, the user queries consist of a source location s_l, destination location d_l, and preferred keywords K = {k₁, k₂,..., k_n}. Again, assume that for K, we need to visit every vertex in G. Now, we reduce the Best Path Problem to TSP. Let G^prime = (V^prime, E^prime) as the instance of TSP, where V^′ = V and E^′ = (u, v) for any u, v ∈ V^′. Then for road network G, we complete the graph by connecting all vertices. The cost fuction between G and G^′ is as follow:

$$cost(u,v)=\left\{\begin{array}{ll} 0,& \text{ if } edge(u,v) \in E\\ 1,& \text{ if } edge(u,v) \notin E \end{array}\right.$$

Suppose that Best Path BP(s_l, d_l, K) exists in G and has cost ≤ 0 in G^′, hence there exists a solution to TSP in G^′ with cost ≤ 0. □

4 Data index

This section presents the IG-Tree, an indexing technique for planning Best Path Queries on Road Networks. Before presenting the IG-Tree, we first provide a brief background on G-Tree [51, 52] and IR²-Tree [15], which are the indexing techniques that inspired us to develop the IG-Tree for processing BP(s_l, d_l, K) queries efficiently.

4.1 G-Tree

One of the most efficient indexing techniques on Road Networks is the G-Tree [51, 52]. In G-Tree, the road network is partitioned recursively into sub-networks. The nodes in G-Tree correspond to a single sub-network and each node contains two or more road network vertices. The graph partition process is performed by using the multi-level partitioning algorithm [26], which guarantees that each subgraph will be of almost the same size. Figure 3 shows an example of graph partitioning of the road network given Figure 2. Here, the original graph is partitioned into two subgraphs, which are shown by G₁ and G₂. In the next level, G₁ is partitioned again into two equal-sized subgraphs G₃ and G₄. Similarly, G₂ is partitioned again into subgraphs G₅ and G₆.

The vertices that connect two sub-networks together are marked as borders and they are stored in the G-Tree nodes. Figure 3 shows the example where vertex v₁ is the border of G₃ since it connects partition G₃ with other partitions G₄ and G₅. The border in partition G₄ consists of v₈ and v₇, the border in partition G₅ consists of v₂ and v₃, and the border in partition G₆ is v₄. In the partition G₁, the borders are v₁ and v₇ as both connects G₁ with G₂. While the borders in G₂ are v₂ and v₃.

G-Tree does not store the distance of every vertex but stores the set of borders and the shortest path distance between borders that is kept in the distance matrix. For example for partition G₃, the border is v₁, so the distance matrix will contain the shortest path from v₁ to all other vertices in the subgraph partition {v₁, v₁₀, v₁₁}. The distance matrix itself is proven to be very efficient in terms of processing the k NN search on road networks [51, 52].

Though G-Tree is very efficient in indexing and processing nearest neighbor (NN), k nearest neighbor (k NN) and keyword-based k NN queries on road networks, it is not applicable for processing Best Path queries.

4.2 I R ²-Tree

There are a number of indexing techniques proposed for processing Spatial Keywords Queries, one of them is the IR²-Tree. The IR²-Tree is first introduced by Felipe et al. [15]. It is a hybrid indexing approach that combines the R-Tree [5] and information retrieval signature files. However, this indexing technique is only applicable for spatial data objects in Euclidean space.

The indexing in IR²-Tree is performed by attaching the inverted index to the R-Tree, i.e., every tree node in IR²-Tree holds the information for both spatial location and keywords. The leaf nodes contain the actual spatial data and keywords. For example, assume that an object o₁ that contains keyword book is located in leaf node N₁ with spatial location of [38,4] [93,9] (upper right and bottom left coordinate of the minimum bounding rectangle (MBR)), while an object o₂ with keyword supermarket is located in leaf node N₂ with spatial location of [8,15] [41,32]. Suppose that the inverted index for keyword book is 10 and the inverted index for keyword supermarket is 01. Thus the leaf node N₁ contains the information of spatial location [38,4] [93,9] and keyword index 10, while leaf node N₂ contains the information of spatial location [8,15] [41,32] and keyword index 01.

As the leaf node in IR²-Tree stores the spatial data and keyword index, the non-leaf node contains the combination of several objects. The spatial information is based on the MBR, while the inverted index of the keywords are calculated using logical OR [15]. For example, the leaf nodes N₁ and N₂ from the previous example have the same parent node N₀. So in this case, N₀ contains keyword information of 11 as this node consists of both keywords from N₁ and N₂.

4.3 Proposed data index: IG-Tree

As previously discussed, the IR²-Tree [15] is used for indexing Spatial Keywords Queries in Euclidean space, while the G-Tree [51, 52] is used for indexing Road Networks. As Best Path is a type of Spatial Keywords Query on Road Network, each one of these indexing techniques has its own benefit to Best Path Query. Thus, we adopt these two indexing techniques to develop a new indexing scheme that can improve the processing of Best Path query: IG-Tree, a hybrid between IR²-Tree and G-Tree.

Using the road network in Figure 2, we attempt to create the IG-Tree. So following the graph partition technique used in G-Tree, we partition the graph into smaller subgraphs. Figure 3 shows the graph partitioning of the example road network given in Figure 2. The graph is divided into equal-sized subgraphs using the multi-level partitioning algorithm [26] and each partition consists of two or more vertices. At the leaf level of the tree, the subgraph G₃ consists of vertices v₁, v₁₀ and v₁₁; subgraph G₄ consists of vertices v₇, v₈ and v₉; subgraph G₅ consists of vertices v₂, v₃ and v₅; and finally, subgraph G₆ consists of vertices v₄ and v₆. Each partition makes up one node in the IG-Tree, as presented in Figure 4.

After the graph partition, we mark the borders of each partition. Borders are the vertices in one partition that are connecting the road network to another partition. For example the border for partition G₃ is v₁ since v₁ connects the subgraph to partition G₄ and partition G₅. So the borders of G₄ are v₇ and v₈; the borders of G₅ are v₂, v₃ and v₅; while the border of G₆ is v₄. Based on these borders, we create the distance matrices. So the shortest path distances for every border in every node are pre-computed and stored in the matrices. Tables 2, 3, 4, 5, 6, 7, and 8 show the distance matrices for each node in IG-Tree.

Table 2 Distance Matrix for G₀

Full size table

Table 3 Distance Matrix for G₁

Full size table

Table 4 Distance Matrix for G₂

Full size table

Table 5 Distance Matrix for G₃

Full size table

Table 6 Distance Matrix for G₄

Full size table

Table 7 Distance Matrix for G₅

Full size table

Table 8 Distance Matrix for G₆

Full size table

Another aspect of the graph in Figure 3 is the keywords. Some vertices contain one or more keywords, thus we also need to index these keywords. The keywords can be turned into inverted list. So the first step is to sort all of the keywords in the graph. Then for each keyword, we assign a binary value based on its existence in each node. For instance vertex v₂ contains only keyword book, thus the inverted list for v₂ is 1000. For node v₈, it contains both keyword book and supermarket, so its inverted list is 1001. The inverted index for all the vertices of the graph in Figure 3 is presented in Table 9.

Table 9 Keyword index

Full size table

Based on the above inverted index list, we attach each inverted index to its corresponding vertices at the leaf nodes. For each parent node, its inverted index is calculated using logical OR of its child nodes. For instance $G_{3}^{prime}$s inverted index is the result of logical OR of the inverted index of v₁, v₁₀, v₁₁. The result of 0000 or 0000 or 0010 is 0010, thus $G_{3}^{prime}$s inverted index is 0010. The same calculation is applied for every non-leaf nodes. The root node will normally have all 1s for the index.

Even though we have indexed all of the available keywords and assign them to each node in the tree, having these indexes are not adequate. The inverted index only identifies that a certain keyword exists on a node but do not exactly identify the location until we go to the leaf node. Therefore we propose a Keyword Distance Matrix for each node. This Keyword Distance Matrix contains the distance of the nearest keyword matching vertex from each border. By having this matrix, the keyword search computation is sped up as we do not need to compute the keyword distance in processing time. The Keyword Distance Matrices for the IG-Tree in Figure 4 are shown at Tables 10, 11, 12, 13, 14, 15 and 16.

Table 10 Keyword Distance Matrix for G₀

Full size table

Table 11 Keyword Distance Matrix for G₁

Full size table

Table 12 Keyword Distance Matrix for G₂

Full size table

Table 13 Keyword Distance Matrix for G₃

Full size table

Table 14 Keyword Distance Matrix for G₄

Full size table

Table 15 Keyword Distance Matrix for G₅

Full size table

Table 16 Keyword Distance Matrix for G₆

Full size table

Based on the above discussion, there are several important components to build an IG-Tree. For every non-leaf node in IG-Tree contains the partition name, the border of each partition, and the inverted index (using the logical OR of its child node). Each non-leaf node also contains two types of matrices, which are the Distance Matrix and Keyword Distance Matrix. For every leaf node, it contains the road network’s vertex and inverted list of the corresponding vertex. We also keep the geographic coordinate location of each vertex in the leaf node.

4.3.1 Space complexity of the IG-Tree

Height

The height of IG-Tree is similar to G-Tree [51, 52] which is $\mathcal {H}=\log _{f} \frac {|V|}{\tau }+ 1$, where f is the number of partition for each graph/subgraph, |V | is the number of vertices in the given (road network) graph G, and τ is the number of maximum vertices on leaf node’s subgraph.

Number of nodes

Like G-Tree, IG-Tree has only one node in level 0, which is the root. In an arbitrary level i of the tree, there are fⁱ internal nodes as the number of partition for each graph (at level 0)/subgraph (at level > 0) is s. As τ is the maximum number of vertices on leaf node’s subgraph, there are $\frac {|V|}{\tau }$ leaf nodes. As a result, the number of nodes in IG-Tree is $\mathcal {O}\left (\frac {f}{f-1}\cdot \frac {|V|}{\tau }\right )=\mathcal {O}\left (\frac {|V|}{\tau }\right )$ which is again similar to that of G-Tree.

Number of inverted lists

A node in IG-Tree contains an inverted list representing the keywords covered in that node. As the number of nodes in an IG-Tree is $\mathcal {O}\left (\frac {|V|}{\tau }\right )$, the number of inverted lists is $\mathcal {O}\left (\frac {|V|}{\tau }+|V|\right )$. Thus, the space complexity of maintaining inverted lists in IG-Tree becomes $\mathcal {O}\left (\frac {|V|}{\tau }\cdot |T|+|V|\cdot |T|\right )$, where |T| is the number of keywords covered in the whole road network G and |V |⋅|T| is the space complexity of the inverted lists for all vertices in V.

Number of borders

If we assume the road network to be modeled as a planar graph, the number of borders on average in a node of level i is $\mathcal {O}\left (\log _{2}f\cdot \sqrt {\frac {|V|}{f^{i + 1}}}\right )$ as per the calculation conducted in [51, 52]. As there are fⁱ nodes in a level i, the number of border nodes in an arbitrary level i is $\mathcal {O}\left (\log _{2}f\cdot \sqrt {\frac {|V|}{f^{i-1}}}\right )$. If we sum this measure from level 1 to height of the tree $\log _{f} \frac {|V|}{\tau }+ 1$, the total number of borders in an IG-Tree is $\mathcal {O}\left (\frac {\log _{2}f}{\sqrt {\tau }}|V|\right )$, which is again similar to the G-Tree under the planar graph assumption.

Distance matrices

The total distance matrix size of all leaf nodes is $\mathcal {O}(\sqrt \tau |V|\cdot \log _{2}f)$ and the total distance-matrix of non-leaf nodes is $\mathcal {O}\left (|V|\cdot \log _{2}^{2}f\cdot \log _{f} \frac {|V|}{\tau }\right )$ as per the calculation conducted in [51, 52].

Keyword distance matrices

The average number of borders in a leaf node of IG-Tree is $\mathcal {O}(\log _{2}f\cdot \sqrt \tau )$[51, 52] and the total number of keywords in G is |T|. Thus the keyword distance matrix size in a leaf node is $\mathcal {O}(\log _{2}f\cdot \sqrt \tau \cdot |T|)$. The total keyword distance matrix size of all leaf nodes becomes $\mathcal {O}\left (\frac {|V|}{\tau }\cdot \log _{2}f\cdot \sqrt \tau \cdot |T|\right )$. Each internal node on level i generates $\mathcal {O}\left (\log _{2}f\cdot \sqrt {\frac {|V|}{f^{i + 1}}}\right )$ borders on average [51, 52]. Therefore, the keyword distance matrix size of each node at level i is $\mathcal {O}\left (\log _{2}f\cdot \sqrt {\frac {|V|}{f^{i + 1}}}\cdot |T|\right )$. In IG-Tree, there are fⁱ nodes at level i, therefore keyword distance matrix size at level i is $\mathcal {O}\left (f^{i}\cdot \log _{2}f\cdot \sqrt {\frac {|V|}{f^{i + 1}}}\cdot |T|\right )$=$\mathcal {O}\left (\log _{2}f\cdot \sqrt {|V|f^{i-1}}\cdot |T|\right )$. Thus the total keyword distance matrix size of non leaf nodes is $\mathcal {O}({\sum }_{0\le i < \mathcal {H}}\log _{2}f\cdot \sqrt {|V|f^{i-1}}\cdot |T|)$.

4.3.2 Index reconstruction for tree node with negative query keywords

In the Best Path Query, the user is allowed to give keywords as input of the query and the query keywords can be positive and negative. The positive keywords denote the spatio-textual objects that the user wants to visit, while the negative keywords denote the spatio-textual objects that the user wants to avoid along his/her trip. Even though IG-Tree contains textual information of spatial objects, we still have to check the textual relevancy between the spatial object with the query keywords given by the user in order to consider an object to be visited or avoided. For the objects that contain positive keywords, the IG-Tree can compute the path well with the help of the Distance Matrices. For example if we want to compute the path from v₅ to the nearest book, we can directly refer to the Keyword Distance Matrix in Table 15 to save up some time. But this is different when negative keywords exist. Even though IG-Tree is designed to improve path computation on finding the Best Path, it still has a weakness when there is a negative keyword found in the query given by the user. As previously mentioned in Section 3, a vertex that holds one or more negative query keywords must be pruned from the road network graph. This is due to the fact that this particular vertex holds a query that the user wants to avoid/block. Currently IG-Tree consists of Distance Matrices that store the shortest paths between borders. So when a vertex is pruned from the graph, the shortest path may also change. The existing index has to be modified considering a vertex is gone and the Distance Matrices are no longer storing accurate distances. The new Distance Matrices will replace the existing matrices during the query processing time of the query with the corresponding negative keywords. The modification however depends on the location of the vertex in the IG-Tree:

Case-C1. If vertex v that contains k⁻ is the border and no other border exists for a node that we must visit, then no path can be established. In this case, v is considered as a bridge. This situation is already proved through Lemmas 3 and 4 in Section 3.
Case-C2. If vertex v that contains k⁻ is the border and there is/are other border(s), then path reconstruction is needed. The path reconstruction will involve the whole tree node where v is located and also the borders on other nodes that are adjacent to v. For example if v₃ in Figure 3 contains k⁻, then the path reconstruction occurs on the whole G₅ tree node and its adjacent borders. The adjacent borders in this case can be identified through parent nodes of G₅, whether the parent nodes has v₃ as one of their borders. G₂ and G₀ are indeed sharing v₃ as their border, so the path reconstruction will involve these two tree nodes as well. As v₃ is pruned from the graph, v₃ is then omitted from the Distance Matrices of G₅, G₂ and G₀. The border-to-border distances of these matrices are also affected because of the omission of v₃, thus the entire matrices has to be recalculated because of the changes in the shortest path between these borders. The path reconstruction itself can be obtained using Dijkstra algorithm. Tables 17, 18 and 19 shows the Distance Matrices after path reconstruction.
Case-C3. If vertex v that contains k⁻ is in the leaf node (not the border), then distance has to be recalculated. The index recalculation for this case does not affect the whole tree, but only on the tree node where the vertex with negative keyword lies. Similar to the previous case, the path reconstruction can be obtained using Dijkstra algorithm. When v is in the leaf node, we can focus on its own subgraph partition as it does not affect the other partitions like in the previous case. For instance, assume that v₁₀ in Figure 3 contains k⁻. v₁₀ is a leaf node as it does not connect any sub-networks. v₁₀ is located in partition G₃, thus only this partition will need to be reconstructed which is shown in Table 20.

Table 17 Reconstructed Distance Matrix for G₅

Full size table

Table 18 Reconstructed Distance Matrix for G₂

Full size table

Table 19 Reconstructed Distance Matrix for G₀

Full size table

Table 20 Reconstructed Distance Matrix for G₃

Full size table

5 Query processing

We propose three Best Path query processing algorithms that can be applied on IG-Tree, namely the Optimal Distance Approximation Search, Ancestry Priority Search, and the Euclidean-based Approximation. A baseline algorithm is also provided in this section. The baseline algorithm offers precise solution, while the other three proposed algorithms offer approximation solution with different trade-offs. In each subsection, we discuss on how each algorithm works and their trade-offs.

5.1 Baseline algorithm

In this section, we discuss on the baseline algorithm that can be used on IG-Tree to find the Best Path. This algorithm is able to compute the result of Best Path query accurately. The key/main idea is to find the permutation of all possible combinations of positive keywords and then compare them in order to find the one that has the most efficient cost (least distance).

As an example, assume that we want to find the best path from v₁ to v₄ while passing through cinema and book. In this case, sl = v₁, dl = v₄, and keywords = {cinema, book}. The first step here is to turn the preferred keywords into inverted index so that we can check its relevancy with the inverted index in IG-Tree. The preferred keywords are cinema and book, therefore the inverted list is 1100 (K = 1100). Then we have to find the partition that contains the source and destination in the IG-Tree, where v₁ is located under the partition G₃ and v₄ is located under the partition G₆. After the source and destination locations are found, then we can start finding the best path BP(v₁, v₄,1100) that visits the chosen keywords.

Scanning through every single vertex in the leaf node that holds inverted index of 1100. The inverted index of 1000 can be found at v₂ and v₈, while the inverted index of 0100 can be found at v₆. Knowing the exact locations of the keywords, we can do cartesian product between each set of keywords. In this case, the cartesian product will be between {v₂, v₆} and {v₈, v₆}. Then based on the cartesian product, we have to get the permutation to help computing the path with the least distance. The permutations for this case consist of {v₂, v₆}, {v₆, v₂}, {v₈, v₆}, and {v₆, v₈}. Based on these permutations, we can find the shortest path from s_l to each permutation, and then from the permutation to d_l. In this case, we will have four possible paths: v₁ → v₂ → v₆ → v₄ = 10, v₁ → v₆ → v₂ → v₄ = 22, v₁ → v₈ → v₆ → v₄ = 16, v₁ → v₆ → v₈ → v₄ = 32. Algorithm 1 shows the pseudocode for shortest path search in IG-Tree. While calculating the shortest paths, we also need to keep track of the path with the least sum of distance. At the end, we will obtain the Best Path with the most accurate solution. For this example, the Best Path BP(v₁, v₄,1100) = v₁ → v₂ → v₆ → v₄ with the least total distance of 10.

This algorithm guarantees the accuracy of finding the Best Path on road networks. However since the Best Path query is an NP-Hard problem, this algorithm definitely runs in non-polynomial time especially on a large datasets. In our experiment, it can spend up to 17 hours merely to find the Best Path with 5 query keywords even in a very small datasets with only 100 vertices. This is certainly impossible to be applied for our daily use. The pseudo-code of the baseline algorithm is given in Algorithm 2.

5.2 Optimal distance approximation search

Because of the non-polynomial time complexity of the baseline algorithm, we propose an approximation algorithm to compromise the runtime. This algorithm is a lot faster than the baseline one but its result is not 100% accurate.

When multiple keywords are involved in the query, the complexity rises as we have to know all possible combinations of the keywords in order to get the most optimal solution (distance-wise). However when there is only one keyword involved in the query, the query can be retrieved in polynomial time. Thus in this approximation algorithm we utilize this situation in order to retrieve the multiple keywords query. The way this algorithm works is that for each query keyword given by the user, we find the best path between the source location to the query keyword and then to the destination. By getting the best path for each keyword, we can locate the best possible location of each keyword that will give the shortest distance of source-keyword-destination. Then after we have the best candidate of each keyword, we find the path from source location to its nearest candidate, then from the nearest candidate to its next nearest candidate. We keep doing this until all keywords are covered, then finishing the path to the destination location.

For example a user wants to find the best path from v₁₀ to v₇ while passing through a bookstore and a cinema. In this case, s_l = v₁₀, d_l = v₇, and K = {book, cinema}. In order to find the best path, we have to transform the query keywords given by the user into inverted list. Since the keywords are {book, cinema}, thus the inverted list is K_if = 1100.

In this algorithm, we have to firstly find the locations of s_l and d_l. Looking at the IG-Tree, s_l is located within partition G₃, while d_l is located within partition G₄ as shown in Figure 5. Now for each keyword k_n in K_if, we have to find the best path from s_l − k_n − d_l. Assume that the first keyword that we want to find its best path is book to which its inverted index is 1000. In the road network, there are actually several vertices that contain the keyword book. So we have to calculate the total shortest path distance for each keyword location and then finding the one that has the least amount of distance. A naive solution here is to find the shortest path between s_l to k_n and then add up the shortest path between d_l to k_n. Since our current keyword index is 1000, v₈ and v₂ have the same index. Hence we have to establish the shortest path δ(v₁₀, v₈) + δ(v₈, v₇) and also δ(v₁₀, v₂) + δ(v₈, v₂). The way the shortest path works is similar to the previous section. The best path distance for visiting v₈ is dist(v₁₀, v₈) + dist(v₈, v₇) = 6 + 10 = 16, while the best path distance for visiting v₂ is dist(v₁₀, v₂) + dist(v₂, v₇) = 5 + 9 = 14. Based on these calculations, v₂ has the best path from v₁₀ to v₇ as depicted in Figures 6 and 7. Thus, we can store v₂ to a candidate queue Q_k as the candidate vertex to find the multiple keywords best path query. The same process also goes for keyword cinema. The inverted index of keyword cinema is 0100. There is only one vertex in the road network that contains 0100, which is v₆. Therefore the best path is going to be based on δ(v₁₀, v₆) + δ(v₈, v₆). Hence, v₆ can be stored to a candidate queue Q_k as another candidate vertex for finding the multiple keywords best path query, specifically for keyword cinema.

As we have found the candidates for each keyword specified by the user, we can do best path search from s_l to the candidates, then to d_l. Q_k consists of v₂ and v₆. So what we have to do is to find which vertex in the candidate queue Q_k is the nearest from s_l (v10). In this case, v₂ is the nearest so we have to find the shortest path δ(v₁₀, v₂) = {v₁₀, v₁, v₂} with total distance dist(v₁₀, v₂) = 5. Next, we have to find the nearest next candidate from v₂, which is v₆. Then we establish another shortest path δ(v₂, v₆) = {v₂, v₅, v₄, v₆} with total distance dist(v₂, v₆) = 7. Since there is no more candidate in the queue, then it means that we have found all the keywords specified by the user in our path. Thus we can establish the final path from the last candidate to the destination d_l (δ(v₆, v₇) = {v₆, v₄, v₅, v₃, v₇} with total distance dist(v₆, v₇) = 11). The result of the best path BP(v₁₀, v₇,{book, cinema}) = {v₁₀, v₁, v₂, v₅, v₄, v₆, v₄, v₅, v₃, v₇}.

Based on our experiment, this algorithm runs faster than the baseline algorithm even though the approximation is not 100% accurate. However the approximation result is close to the baseline result even when the algorithm runs with a large dataset. The pseudo-code of this algorithm is given in Algorithm 3.

5.3 Ancestor priority approximation search

In this paper, we propose another approximation algorithm. This algorithm utilizes the common ancestor between the source and destination locations with the purpose of minimizing the tree traversal time. Sometimes when we are trying to find one or more keywords in the IG-Tree, we have to travel through most of the tree nodes even though the source and destination locations are on the same partition node. However in this algorithm, the idea is to traverse only on the branch of an ancestor node. This is basically to do early pruning through the common ancestor between source and destination locations in IG-Tree.

As an example, a user invokes a query with source location in vertex v₁₀, destination location in vertex v₇, and the preferred keyword is book. The query keyword inverted index in this case is 1000 for keyword book. Looking at the IG-Tree in Figure 8, the common ancestor between v₁₀ and v₇ is G₁. Thus in this algorithm we are going to only focus on the branch under G₁, especially if the user given keyword is available in this branch. As the query inverted index is 1000 and the inverted index attached in G₁ is 1011, we can see that the query keyword exists in G₁, so we can definitely focus on this node to find the best path from v₁₀ to v₇ while passing by a keyword book.

The way the Ancestor Priority Search algorithm works is similar to the Optimal Distance Approximation algorithm once we know which branch we need to work on. Firstly we have to compute the best path of each keyword, then recording the candidate vertices into a queue Q_k in order to find the final multiple keywords best path query. Continuing from the previous example, the focus now is only on the branch of G₁. So we do not need to travel to other branches outside partition G₁. In this case, we have to find the best path for each query keyword first. But since there is only one keyword, then we can find the Best Path directly. The way we find each keyword is through the inverted index attached in the IG-Tree and traverse down until we found which vertex has the keyword. We know that G₁ has 1000 so we have to check its immediate children. G₃ does not have 1000, while G₄ has 1000, thus we need to traverse down the partition of G₄ in order to find the keyword. The children of G₄ are v₈, v₉, and v₇. Only v₈ has 1000, therefore we have to find the best path from s_l to v₈ and then to v₇ as depicted in Figure 9. Similar to the previous algorithm, we have to find the shortest path δ(v₁₀, v₈) and δ(v₈, v₇) to help finding the best path. The shortest path δ(v₁₀, v₈) = {v₁₀, v₁, v₈}, while the shortest path δ(v₈, v₇) = {v₈, v₁, v₇}. As there is only one keyword, therefore the Best Path BP(v₁₀, v₇,{book}) = {v₁₀, v₁, v₈, v₁, v₇}.

The previous case however does not always happens because if the keyword does not exist in the current ancestor node, we have to go to its parent node and check whether the query keyword is available in the parent node. If it does not, then we have to keep going to the upper node until we can find the query keyword. Once the query keyword is found in the node, then we can continue the best path search. For example assume that a user wants to find the best path from v₁₀ to v₇ while passing through a cinema. The inverted index for cinema is 0100, while the common ancestor of v₁₀ and v₇ is G₁. The partition G₁ does not have cinema in it since its inverted index is 1011, thus we have to find out whether cinema is available in G₁’s parent node. The parent node of G₁ is G₀ and its inverted index is 1111, which means that cinema exists in this partition. Therefore the best path search will cover the whole tree branches under G₀.

The main advantage of this algorithm is in the early pruning. There is no need to explore the whole tree as we only need to focus on one branch through the common ancestor between the source and destination. However the disadvantage of this algorithm is that it has even lower accuracy compared to the Optimal Distance Approximation Search algorithm. Traversing under one branch does not guarantee the shortest path distance for best path since some keywords with closer distances might be located in other partitions. But this algorithm tries to compromise this with lesser tree traversal cost. The pseudo-code of this algorithm is given in Algorithm 4.

5.4 Euclidean-based approximation search

We propose another approximation algorithm in this paper. The idea behind this particular algorithm is to make use of the coordinate of each vertex and then find the best path through Euclidean distance before applying it into road network. This approximation algorithm is very fast compared to the previous algorithms since it is using Euclidean distance computation. However because of the usage of Euclidean distance on a road network data, the performance of this algorithm is quite low in terms of its accuracy.

The Euclidean-based Approximation has two main components, namely the Euclidean approximation (Algorithm 5 row 1-10) and the best road network path (Algorithm 5 row 11-20). In the Euclidean approximation part, we firstly need to find the Euclidean locations of both the source location s_l and the destination location d_l. Based on these two Euclidean locations, we calculate the best path in Euclidean distance for each keyword k_n in K_if. The way we find the best path for each keyword k_n is similar to the Optimal Distance Approximation algorithm, where we have to get the optimum shortest path of δ(s_l, k_n) + δ(k_n, d_l) in Euclidean distance and then store k_n into a candidate queue Q_k to help establishing the final best path. Once we have found the best path of each keyword k_n, we move to the second part, which is the road network path.

In the road network path component of Euclidean-based Approximation algorithm, the best path search is done in a similar fashion as the previous algorithms where we have to find the nearest candidate Q_k1 from s_l then establish the shortest path δ(s_l, Q_k1) between s_l and the candidate Q_k1 in road network distance instead of the Euclidean distance. After the shortest path δ(s_l, Q_k1) is established, we need to find the next nearest candidate Q_kn from the Q_k1 and establish the shortest path δ(Q_k1, Q_kn). We repeat the same step until every single candidate in the queue Q_k has been visited. Then we can find the shortest path δ(Q_kn, d_l) to end the trip.

The Euclidean distance is merely to help deciding which vertices to be visited based on the keywords chosen by the user. But at the end the best path’s result is in road network distance. The approximation in this algorithm is very low as it can over-approximate the result up to 300% based on our experiment. However the running time of this algorithm is a lot faster compare to the other algorithms.

6 Experiment

In this section, we compare the efficiency and accuracy of the four Best Path query processing algorithms from Section 5: Baseline Algorithm (BruteForce), Optimal Distance Approximation Search (OptDist), Ancestor Priority Approximation Search (AncestorPriority), and Euclidean-based Approximation Search (Euclidean).

6.1 Settings

6.1.1 Environment

We perform our experiments on 2.5 GHz Intel Core i7-4870 CPU and 12 GB RAM running 64-bit Ubuntu. All of the algorithms were written in single-threaded C++.

6.1.2 Datasets

We use real datasets from 9th DIMACS Implementation Challenge - Shortest Paths [20] and [55] for the road network datasets. We select four datasets: California, New York City, Colorado, and Florida. Table 21 provides the details of the size of the real-world road network datasets.

Table 21 Road Network Datasets

Full size table

Meanwhile for the textual information we utilize keyword sets based on [55] and assign them into the vertices in the road network datasets. As the textual part needs to be able to detect whether the user gives one or more negative keywords, a number of negative sentiment analysis based words from [21, 22] are used in order to accommodate the negative keyword(s) in the user query.

6.1.3 Queries

In our experiments, we generate the keyword set K for the test queries with a random distribution from a keyword pool. The size of the keyword set in the test queries varies from 1 to 15 while the object density^{Footnote 1} of these keyword set varies from 1% to 30% of the whole road network datasets. We also evaluate the impact of varying the distance between s_l and d_l pairs, which are varied from 2% to 64% of the maximum distance between two vertices in the space.

6.2 Index evaluation

This section evaluates the proposed IG-Tree index for planning Best Path queries on road networks in terms of index building time and space consumption, and index reconstruction time(s) for negative keywords in the tested queries. Figure 10 shows the index building times and space consumption of the proposed IG-Tree and the G-Tree indices for New York City dataset. We see that index building time of IG-Tree is comparable to that of G-Tree though IG-Tree combines IR²-Tree with G-Tree. The index size of IG-Tree is slightly larger than that of G-Tree as inverted lists and Keyword Distance Matrices are maintained in IG-Tree in addition to Distance Matrices.

Finally, the times required to reconstruct the IG-Tree index for negative keywords in the Best Path queries are quite durable as we observe from Figure 11. The IG-Tree takes only ∼ 0.8 secs to reconstruct the index for up to 10 (negative) keywords. However, we observe only a few keywords in K including negative keywords in route planning queries, which is around 5-6 keywords, in our usual life. Therefore, we believe that the index reconstruction time IG-Tree for few negative keywords would be pretty durable in practical applications of Best Path queries.

6.3 Performance study

We evaluate our query processing algorithms on two metrics, specifically on the running time and approximation accuracy. The approximation accuracy shows the percentage of the result accuracy produced by each algorithm compared to the expected correct result obtained by running the baseline algorithm.

6.3.1 Effect of k ⁺

The positive keywords (k⁺) given by the users take a very important role in Best Path Query. Each k⁺ must be visited at least once, therefore the more k⁺ to be visited, it is expected that the running time also increases for every algorithm. Figure 12 shows the query performance as the number of k⁺ increases. In these experiments, we specified the query keywords K to be all positive, without any negative keywords. The experiments show that the running times for the approximation algorithms (OptDist, AncestorPriority, Euclidean) run a lot better compared to the baseline algorithm. The baseline algorithm has the worst running time among all as the time increases exponentially. According to our experiments, the average running time for baseline algorithm for |K| = 1 is 0.48 ms but it increases up to 21507.41 ms when |K| = 5. Even though baseline algorithm offers a precise solution, the amount of time taken to get the result is not suitable for daily usage. Imagine when we want to plan a trip to a new country and it takes 17 hours for us to get the best path with |K| = 10. This is definitely impossible to be used in everyday life.

As we proposed three approximation algorithms, we also evaluate the approximation accuracy for each algorithm. Figure 13 shows the percentage of accuracy of each approximation algorithm compared to the baseline algorithm. When K is only 1, all the three approximation algorithms have high accuracy. However when K increases, the accuracy decreases. The OptDist has the best approximation compared to the other two algorithms. Even though its approximation is not 100% accurate, the percentage of accuracy is still above 75%. It is very different from the Euclidean-based algorithm to which its approximation is very poor compared to others as the accuracy is only 6.83% when K = 15. The trend for Euclidean approximation algorithm is definitely the worst as the inaccuracy keeps escalating drastically.

Based on this experiment, we can see that OptDist performs better than the other three algorithms in terms of the running time. It also performs better than the other two approximation algorithms in terms of the approximation accuracy. So we can conclude that OptDist is the best choice when we want to invoke Best Path Query with various numbers of K.

6.3.2 Effect of k ⁻

The k⁻ has a great impact to the Best Path Query. As previously discussed in Section 4, there are several cases on what would happen to the IG-Tree when we found a k⁻. A lot of times when k⁻ is located on the border, the path cannot be retrieved at all. So the distribution of k⁻ is always kept to be lesser than k⁺ in this particular experiment to ensure that we can retrieve some results. For this experiment, we set the query keywords K to have both positive and negative keywords.

According to our experiment result in Figure 14, the Euclidean-based algorithm always has a faster running time compared to the other three algorithms. The OptDist and AncestorPriority are actually almost the same in terms of their running time even though the AncestorPriority still seems to be a bit faster than OptDist. Meanwhile the baseline algorithm has the worst running time as expected. Having a negative keyword k⁻ definitely affects the running time of some queries as there might be path reconstruction happening throughout the query processing. This also explains the difference between the time in Figures 12 and 14, where the running time in Figure 12 with all positive keywords does not require any path reconstruction so it is faster than the experiment result in Figure 14.

Another metric that we test is the accuracy of the approximation algorithms. Figure 15 shows the result of the approximation accuracy of each algorithm towards the result of baseline algorithm. The trend in Figure 15 is almost similar to the trend in Figure 13, which might indicate that the path reconstructions happening in these queries because of k⁻ does not truly have impact towards the accuracy. We can also see that the accuracy percentage of OptDist and AncestorPriority are the same for this case and both have better accuracy than the Euclidean-based algorithm. The Euclidean approximation is still the worst among the other algorithms with very low accuracy even though the running time is a bit faster than the others.

6.3.3 Effect of keyword densities

In this experiment evaluation, we want to observe the query performance when we increase the keyword densities. Figure 16 shows the running time for query within the density of 0.01 to 0.30. The running time of the baseline algorithm increases drastically compared to the rest. Even in the lower density case, specifically on 0.01 density, the baseline algorithm takes about 17x more time than the OptDist algorithm. The Euclidean-based algorithm however performs in a constant manner even though the density increases. The OptDist and AncestorPriority on the other hand have similar running time.

We also run an experiment on the running time when K is varied from 1 to 15 while the density for each keyword is 0.05. Figure 17 shows the result of this specific experiment where it interestingly shows that the constant running time for Euclidean algorithm. The trend indicates that Euclidean-based algorithm has the most constant running time especially for denser datasets. This is totally different from the other three algorithms, which keep increasing with the increase in density. Even though Figures 12 and 14 run similar experiments, the results are different. In Figures 12 and 14, the keyword density is randomized and usually below 0.05.

We conducted another experiment to test the running time of query with all negative keywords while increasing the density of the keywords from 0.01 to 0.055. We only increase until 0.055 by the reason of having higher density of negative keywords will return no path/no result at all. Figure 18 shows the result of this particular experiment. Surprisingly, the running time of Euclidean algorithm drastically increases almost in the same trend as the baseline algorithm. Meanwhile, the OptDist and AncestorPriority are running in constant time manner when the keywords are all negative.

6.3.4 Effect of positive and negative keywords ratio

Figure 19 shows the running time on the positive and negative keywords ratio. The ratio is based on 0.01 density. In this experiment, we limit the ratio of the keywords (positive:negative) into 1:0, 0:1, 1:1, 5:0, 0:5, and 5:1. We exclude the ratio with negative keywords higher than the positive keywords as there is no path retrived most of the time with this kind of query.

In Figure 19, we can see that the running time of OptDist algorithm increases if the number of query keywords increase (both positive and negative). The AncestorPriority has similar trend with OptDist with the increase of time towards the more keywords involved. The increase in trend also happens to the baseline algorithm but it increases exponentially as what we have expected from the previous experiments. Nevertheless, the increasing trend does not happen to the Euclidean algorithm as it is comparatively constant in running time even with more positive keywords added into the query.

6.3.5 Effect of distance between s _l and d _l

We also evaluate the effect of varying the distance between s_l and d_l pairs. The distance between s_l and d_l are varied from 2%, 4%, 8%, 16%, 32%, up to 64% of the maximum distance between two points in the road network datasets. We set |K| = 15 by default and contain both positive and negative keywords. Figure 20 shows the experimental results of the source–destination distance. We do not include the baseline in Figure 20 since the runtime for 2% in NY dataset already reaches above 10,000 ms, which made a huge difference with the other three algorithms.

From the experimental result, we can see that the three algorithms are running in constant time even though the source-destination distance increases. Both OptDist and AncestorPriority have a similar trend, while Euclidean has better running time most of the time than the rest.

6.3.6 Summary

Based on our experimental study, each algorithm has its own strength. The baseline algorithm certainly offers accurate result, but it has the worst running time as it increases exponentially when the queries and dataset increases. The OptDist itself has the best approximation compared to the other two approximation algorithms (AncestorPriority and Euclidean). Its running time is definitely a lot better than the baseline algorithm but the AncestorPriority and Euclidean-based algorithms still beats it by a few microseconds, especially when all the keywords in the user query are positive.

The AncestorPriority often follows the trend of OptDist on both the runnning time and accuracy. In terms of running time, AncestorPriority is frequently faster by only a few microseconds compared to OptDist. On the other hand, the accuracy of AncestorPriority is slightly lower than OptDist because of the early pruning. Compared to the Euclidean, AncestorPriority still has better accuracy even though its speed is still slower than the Euclidean.

The Euclidean-based algorithm’s main strength is in its fast runtime. For a quick approximation, the Euclidean-based algorithm can be used but it has the lowest accuracy compared to the other algorithms. The Euclidean however does not perform well when the density of the negative keywords are high.

In general, OptDist offers the best solution in comparison with the other algorithms. Even though the Euclidean-based algorithm outperforms the running time of OptDist when the queries contain all positive keywords, OptDist still provide better approximation. OptDist is more stable in both its speed and approximation accuracy.

7 Related works

This section discusses the related works, specifically on Spatial Keyword and Route Planning Queries.

7.1 Spatial keywords queries

With the growth of geographical search engine, Spatial Keyword Queries becoming more crucial and popular among researchers. Most of the early works on Spatial Keyword Queries focus on queries like top-k nearest neighbor queries (Tk NN) [7,8,9, 11, 15, 35, 42]. In Tk NN queries, the goal is to rank objects, measured the keywords similarity (between the object’s keyword and query) and the distance from the specified query location, in order to retrieve k number of objects with the highest ranking. As discussed earlier, this type of query mainly accepts user’s spatial location and keywords as input, and produces spatial objects with matching keywords as the output.

A lot of other Spatial Keyword Queries variants are also based on Tk NN queries, in which the works on these variants try to improve the Tk NN queries to be able to process moving objects [40], continuous objects [17], reverse top-k query [16, 32], joint queries [23, 41], or interactive Tk NN queries [49]. Besides the works on Spatial Keyword Queries that focus on Tk NN queries, some variants of Spatial Keyword Queries have also been proposed, such as the collective Spatial Keyword querying [10, 31, 48], diversified Spatial Keyword search [47], region-based query [13], scalable continual top-k query [43], reverse spatial and textual k nearest neighbor query [34], spatio-textual data clustering [14], fuzzy keyword search [2], and m-closest keyword queries [46]. However, none of these queries can be classified as route planning queries.

As Spatial Keyword Queries become varied, a number of indexing techniques that are able to process both spatial and textual data have been proposed in the past years. A lot of the indexing technique on Euclidean space are utilizing the R-Tree in which they attach additional textual information into the R-Tree to be capable of computing textual data. Some of those R-Tree based indices are IR-Tree [28], b R*-Tree [45], and IR²-Tree [15]. The IR²-Tree itself have been discussed in Section 4 as it inspired us to develop the IG-Tree for planning Best Path queries.

Several indexing techniques for Spatial Keyword Queries on road networks have also been studied recently. These indexing techniques are a lot more complex compared to indices for Euclidean data spaces. One of the earlier work was proposed by Rocha-Junior et al. [35] where their basic indexing architecture consists of four components. The first component is spatial component which is using the network R-Tree. The second is adjacency component, which it uses adjacency B-Tree to traverse the network. The third component is mapping component, which it uses Map B-Tree to map the adjacency edges with MBR that encloses the edges. The last component is the spatio-textual component, which it stores the spatial and textual properties of the objects. Another work on road networks is also done by Luo et al. [33]. They introduced a new indexing technique that is very different from Rocha-Junior et al. [35]. The proposed index, which is the Node-Partition-Distance (NPD) index, keeps useful distances so that the exact distance and the query keyword coverage can be computed independently. Another recent work on spatial keywords index on road networks is also proposed by Li et al. [30]. They proposed SKQAI, a novel air index for spatial keyword query processing on road networks. The SKQAI indexing technique consists of three components: weighted Quad-Tree of road network, keyword Quad-Trees, and network distance bound array.

We see that researchers have proposed many different indexing techniques in order to process diverse spatial keyword queries efficiently through the past few years. But the indexing techniques proposed still treat the spatial data part and the textual part as two different entities. Current techniques often adopts hybrid indexing, in which they have separate indices for the spatial data and the textual data and then combines both indices, especially for road networks. Looking at the existing studies, we find that these indices are not applicable to route planning queries in Spatio-Textual data, e.g., Best Path queries.

7.2 Route planning queries

There are a number of similar route planning queries as Best Path. The most well known query is the Trip Planning Route Query (TPQ), which retrieves the best trip from two different locations that passes at least one point from each of the chosen categories [27]. Ever since the invention of TPQ, many new studies start to investigate the variation and application of TPQ in certain areas, such as Group TPQ (GTP) [19] for processing multiple users’ trip, and a recent study on TPQ with Location Privacy [37] in order to protect user’s location privacy. Another popular route planning query is Optimal Sequenced Route Queries (OSR), a spatial query that finds the minimum route distance from a source location and passing through a set of sequenced categories [36]. However, these queries do not consider any keywords processing. Best Path Query focus on Spatio-Textual field, which in this case it needs to process the textual part of the objects. Another difference is that all of these existing works do not take into consideration any negative keywords. Everything in the chosen categories in TPQ and OSR must be visited, while in Best Path there are some categories that we have to avoid which increases the complexity of the problem.

In Spatio-Textual area itself, there is one study on route planning to the best of our knowledge, which is the Keyword-aware Optimal Route Search (KOR) [12]. It is a query that finds an optimal route that covers a set of user given keywords with a specific budget constraints and objective score [12]. In conjunction to Best Path Query, KOR is different as the query requires budget constraint for processing. KOR also does not take into consideration negative keywords as what Best Path does.

Most of the route planning problems is regarded as a generalization of Traveling Salesman Problem (TSP) problem [3, 4]. They are NP-hard problems. The solutions offered for these queries are polynomial time approximation algorithms. Even so, the solutions offered do not involve significant pre-processing. This causes more computation, particularly since the computation requires the processing of both spatial and textual relevancy. So having an on-the-go solution does not always guarantee performance efficiency, especially on Spatio-Textual field. Therefore in this research we attempt to provide a better solution to this problem by offering a pre-processing index that incorporate both Keywords and Spatial Road Networks.

8 Conclusion

In this paper, we introduce a new variant of Spatial Keywords Query, which is the Best Path Query. The Best Path Query is an NP-Hard problem as it can be reduced to the Travelling Salesman Problem (TSP). Throughout our study, we develop an indexing technique called the IG-Tree that can process both spatial and textual information. This indexing technique can be used on various types of Spatial Keywords Query, especially on the Best Path Query. Three algorithms to solve the Best Path Query are also proposed in this paper, namely the Optimal Distance Approximation Search, Ancestor Priority Search, and the Euclidean-based Approximation solution. Each algorithm has its own strengths and weaknesses. The effectiveness and efficiency of the proposed algorithms are demonstrated through our extensive experiments. As a possible future work, we can further improve the IG-Tree to include keyword scoring function in order to increase the keyword search accuracy.

Notes

Object density: the quantity of keyword matched objects for each query keyword compared to the number of vertices in the road network.

References

Adhinugraha, K.M., Taniar, D., Indrawan, M.: Finding reverse nearest neighbors by region. Concurrency Comput. Pract. Exp. 26(5), 1142–1156 (2014)
Article Google Scholar
Alsubaiee, S., Li, C.: Fuzzy keyword search on spatial data. In: International Conference on Database Systems for Advanced Applications, pp. 464–467 (2010)
Arora, S.: Polynomial time approximation schemes for euclidean traveling salesman and other geometric problems. J. ACM 45(5), 753–782 (1998)
Article MathSciNet MATH Google Scholar
Arora, S.: Approximation schemes for np-hard geometric optimization problems: a survey. Math. Program. 97(1), 43–69 (2003)
Article MathSciNet MATH Google Scholar
Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-Tree: an efficient and robust access method for points and rectangles. In: ACM SIGMOD, pp. 322–331 (1990)
Chen, Y.Y., Suel, T., Markowetz, A.: Efficient query processing in geographic Web search engines. In: ACM SIGMOD, pp. 277–288 (2006)
Chen, L., Cong, G., Jensen, C.S., Wu, D.: Spatial keyword query processing: an experimental evaluation. In: Proceedings of the VLDB Endowment, vol. 6, pp. 217–228 (2013)
Cong, G., Jensen, C.S., Wu, D.: Efficient retrieval of the top-k most relevant spatial Web objects. Proc. VLDB Endow. 2(1), 337–348 (2009)
Article Google Scholar
Cao, X., Cong, G., Jensen, C.S.: Retrieving top-k prestige-based relevant spatial Web objects. Proc. VLDB Endow. 3(1-2), 373–384 (2010)
Article Google Scholar
Cao, X., Cong, G., Jensen, C.S., Ooi, B.C.: Collective spatial keyword querying. In: ACM SIGMOD, pp. 373–384 (2011)
Cao, X., Chen, L., Cong, G., Jensen, C.S., Qu, Q., Skovsgaard, A., Wu, D., Yiu, M.L.: Spatial keyword querying. In: Conceptual Modeling, pp. 16–29 (2012)
Cao, X., Chen, L., Cong, G., Guan, J., Phan, N.T., Xiao, X.: Kors: keyword-aware optimal route search system. In: IEEE ICDE, pp. 1340–1343 (2013)
Cao, X., Cong, G., Jensen, C.S., Yiu, M.L.: Retrieving regions of interest for user exploration. Proc. VLDB Endow. 7(9), 733–744 (2014)
Article Google Scholar
Choi, D.W., Chung, C.W.: A k-partitioning algorithm for clustering large-scale spatio-textual data. Inf. Syst. 64(Supplement C), 1 – 11 (2017)
Article Google Scholar
De Felipe, I., Hristidis, V., Rishe, N.: Keyword search on spatial databases. In: IEEE ICDE, pp. 656–665 (2008)
Gao, Y., Qin, X., Zheng, B., Chen, G.: Efficient reverse top-k boolean spatial keyword queries on road networks. IEEE Trans. Knowl. Data Eng. 27(5), 1205–1218 (2015)
Article Google Scholar
Guo, L., Shao, J., Aung, H., Tan, K.L.: Efficient continuous top-k spatial keyword queries on road networks. GeoInformatica 19(1), 29–60 (2015)
Article Google Scholar
Hariharan, R., Hore, B., Li, C., Mehrotra, S.: Processing spatial-keyword (Sk) queries in geographic information retrieval (Gir) systems. In: ACM SSDBM, pp. 16–16 (2007)
Hashem, T., Hashem, T., Ali, M.E., Kulik, L.: Group trip planning queries in spatial databases. In: SSTD, pp. 259–276 (2013)
http://www.dis.uniroma1.it/challenge9/download.shtml. Last accessed 22 January 2018
http://www.wjh.harvard.edu/∼inquirer/No.html. Last accessed 22 January 2018
https://github.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/blob/master/data/opinion-lexicon-English/negative-words.txt. Last accessed 22 January 2018
Hu, H., Li, G., Bao, Z., Feng, J., Wu, Y., Gong, Z., Xu, Y.: Top-k spatio-textual similarity join. IEEE Trans. Knowl. Data Eng. 28(2), 551–565 (2016)
Article Google Scholar
Hwang, K., Cho, S.: A lifelog browser for visualization and search of mobile everyday-life. Mob. Inf. Syst. 10(3), 243–258 (2014)
Google Scholar
Jones, C.B., Abdelmoty, A.I., Finch, D., Fu, G., Vaid, S.: Geographic information science: proceedings of the third international conference, GIScience, Chap. The spirit spatial search engine: architecture, ontologies and spatial indexing (2004)
Karypis, G., Kumar, V.: Analysis of multilevel graph partitioning. In: Proceedings of the ACM/IEEE Conference on Supercomputing (1995)
Li, F., Cheng, D., Hadjieleftheriou, M., Kollios, G., Teng, S.H.: On trip planning queries in spatial databases. In: SSTD, pp. 273–290 (2005)
Li, Z., Lee, K.C.K., Zheng, B., Lee, W.C., Lee, D., Wang, X.: Ir-tree: an efficient index for geographic document search. IEEE Trans. Knowl. Data Eng. 23 (4), 585–599 (2011)
Article Google Scholar
Li, Y., Wu, D., Xu, J., Choi, B., Su, W.: Spatial-aware interest group queries in location-based social networks. Data Knowl. Eng. 92(Supplement C), 20–38 (2014)
Article Google Scholar
Li, Y., Li, G., Li, J., Yao, K.: Skqai: a novel air index for spatial keyword query processing in road networks. Inf. Sci. 430-431(Supplement C), 17 – 38 (2018)
Article Google Scholar
Long, C., Wong, R.C.W., Wang, K., Fu, A.W.C.: Collective spatial keyword queries: a distance owner-driven approach. In: ACM SIGMOD, pp. 689–700 (2013)
Lu, J., Lu, Y., Cong, G.: Reverse spatial and textual k nearest neighbor search. In: ACM SIGMOD, pp. 349–360 (2011)
Luo, S., Luo, Y., Zhou, S., Cong, G., Guan, J., Yong, Z.: Distributed spatial keyword querying on road networks. In: EDBT, pp. 235–246 (2014)
Luo, C., Junlin, L., Li, G., Wei, W., Li, Y., Li, J.: Efficient reverse spatial and textual k nearest neighbor queries on road networks. Knowl-Based Syst. 93 (Supplement C), 121 – 134 (2016)
Article Google Scholar
Rocha-Junior, J.B., Nørvåg, K.: Top-K spatial keyword queries on road networks. In: EDBT, pp. 168–179 (2012)
Sharifzadeh, M., Kolahdouzan, M., Shahabi, C.: The optimal sequenced route query. VLDB J. 17(4), 765–787 (2008)
Article Google Scholar
Soma, S.C., Hashem, T., Cheema, M.A., Samrose, S.: Trip planning queries with location privacy in spatial databases. World Wide Web 20(2), 205–236 (2017)
Article Google Scholar
Waluyo, A.B., Srinivasan, B., Taniar, D.: Research in mobile database query optimization and processing. Mob. Inf. Syst. 1(4), 225–252 (2005)
Google Scholar
Waluyo, A.B., Taniar, D., Rahayu, W., Srinivasan, B.: Mobile service oriented architectures for nn-queries. J. Netw. Comput. Appl. 32(2), 434–447 (2009)
Article Google Scholar
Wu, D., Yiu, M.L., Jensen, C.S., Cong, G.: Efficient continuously moving top-k spatial keyword query processing. In: IEEE ICDE, pp. 541–552 (2011)
Wu, D., Yiu, M.L., Cong, G., Jensen, C.S.: Joint top-k spatial keyword query processing. IEEE Trans. Knowl. Data Eng. 24(10), 1889–1903 (2012)
Article Google Scholar
Xu, J., Lu, H.: Efficiently answer top-k queries on typed intervals. Inf. Syst. 71(Supplement C), 164–181 (2017)
Article Google Scholar
Xu, Y., Guan, J., Li, F., Zhou, S.: Scalable continual top-k keyword search in relational databases. Data Knowl. Eng. 86, 206–223 (2013)
Article Google Scholar
Yairi, I., Igi, S.: Mobility support gis with universal-designed data of barrier/barrier-free terrains and facilities for all pedestrians including the elderly and the disabled. In: IEEE International Conference on Systems, Man and Cybernetics, vol. 4, pp. 2909–2914 (2006)
Zhang, D., Chee, Y.M., Mondal, A., Tung, A.K., Kitsuregawa, M.: Keyword search in spatial databases: towards searching by document. In: IEEE ICDE, pp. 688–699 (2009)
Zhang, D., Ooi, B.C., Tung, A.K.H.: Locating mapped resources in Web 2.0. In: IEEE ICDE, pp. 521–532 (2010)
Zhang, C., Zhang, Y., Zhang, W., Lin, X., Cheema, M.A., Wang, X.: Diversified spatial keyword search on road networks. In: EDBT, pp. 367–378 (2014)
Zhang, P., Lin, H., Yao, B., Lu, D.: Level-aware collective spatial keyword queries. Inf. Sci. 378(Supplement C), 194 – 214 (2017)
Article MathSciNet Google Scholar
Zheng, K., Su, H., Zheng, B., Shang, S., Xu, J., Liu, J., Zhou, X.: Interactive top-k spatial keyword queries. In: IEEE ICDE, pp. 423–434 (2015)
Zhong, R., Fan, J., Li, G., Tan, K.L., Zhou, L.: Location-aware instant search. In: ACM CIKM, pp. 385–394 (2012)
Zhong, R., Li, G., Tan, K.L., Zhou, L.: G-Tree: an efficient index for knn search on road networks. In: ACM CIKM, pp. 39–48 (2013)
Zhong, R., Li, G., Tan, K.L., Zhou, L., Gong, Z.: G-tree: an efficient and scalable index for spatial search on road networks. IEEE Trans. Knowl. Data Eng. 27(8), 2175–2189 (2015)
Article Google Scholar
http://www.statisticbrain.com/mobile-browser-vs-application-preferences/
http://blog.globalwebindex.net/chart-of-the-day/top-global-smartphone-apps-who-s-in-the-top-10/
http://www.cs.utah.edu/∼lifeifei/SpatialDataset.htm

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, Monash University, Melbourne, Australia
Anasthasia Agnes Haryanto, David Taniar & Muhammad Aamir Cheema
School of Information and Communication Technology, Griffith University, Gold Coast, Australia
Md. Saiful Islam

Authors

Anasthasia Agnes Haryanto
View author publications
You can also search for this author in PubMed Google Scholar
Md. Saiful Islam
View author publications
You can also search for this author in PubMed Google Scholar
David Taniar
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Aamir Cheema
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anasthasia Agnes Haryanto.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Haryanto, A.A., Islam, M.S., Taniar, D. et al. IG-Tree: an efficient spatial keyword index for planning best path queries on road networks. World Wide Web 22, 1359–1399 (2019). https://doi.org/10.1007/s11280-018-0643-5

Download citation

Received: 30 July 2018
Revised: 06 September 2018
Accepted: 31 October 2018
Published: 15 November 2018
Issue Date: 15 July 2019
DOI: https://doi.org/10.1007/s11280-018-0643-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

IG-Tree: an efficient spatial keyword index for planning best path queries on road networks

Abstract

Similar content being viewed by others

TK-SK: Textual-Restricted $$K$$ Spatial Keyword Query on Road Networks

An Efficient Evaluation of Spatial Search on Road Networks Using G-Tree

Effective Spatial Keyword Query Processing on Road Networks

Explore related subjects

1 Introduction

Definition 1

1.1 Challenges

1.2 Contributions

1.3 Organisation

2 Preliminaries

2.1 Road network

2.2 Data model

2.3 Query model

3 Complexity analysis

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Definition 2

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

4 Data index

4.1 G-Tree

4.2 I R 2-Tree

4.3 Proposed data index: IG-Tree

4.3.1 Space complexity of the IG-Tree

Height

Number of nodes

Number of inverted lists

Number of borders

Distance matrices

Keyword distance matrices

4.3.2 Index reconstruction for tree node with negative query keywords

5 Query processing

5.1 Baseline algorithm

5.2 Optimal distance approximation search

5.3 Ancestor priority approximation search

5.4 Euclidean-based approximation search

6 Experiment

6.1 Settings

6.1.1 Environment

6.1.2 Datasets

6.1.3 Queries

6.2 Index evaluation

6.3 Performance study

6.3.1 Effect of k +

6.3.2 Effect of k −

6.3.3 Effect of keyword densities

6.3.4 Effect of positive and negative keywords ratio

6.3.5 Effect of distance between s l and d l

6.3.6 Summary

7 Related works

7.1 Spatial keywords queries

7.2 Route planning queries

8 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

4.2 I R ²-Tree

6.3.1 Effect of k ⁺

6.3.2 Effect of k ⁻

6.3.5 Effect of distance between s _l and d _l