Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The proliferation of GPS and tracking technology has brought to availability huge volumes of trajectories from real moving objects such as mobile phone users, vehicles and animals. Searching such a collection of trajectories finds several applications, including route recommendation, behavior mining, and in transportation systems [1, 2]. Different from conventional retrieval tasks which identify similar trajectories to a given one or those crossing a specific spatial region, in this paper we focus on point-based search, which retrieves trajectories based on given points. In particular, taking as input a set of query points Q (e.g., a particular set of POIs), the distance-based trajectory search studied in [3, 4] retrieves the trajectories that pass as close as possible to all query points. Specifically, the distance of a trajectory t to Q is computed by summing up, for each query point \(q\in Q\), its distance to the nearest point in t.

Consider for instance a collection of touristic trajectories; a travel agency issues a distance-based query to survey or recommend popular routes that pass close to specific sightseeing attractions. As another example, query set Q could contain traffic congestion points; in this case, the traffic department seeks to discover the causes of the congestion by analyzing the trajectories that pass near the points in Q. In the context of surveillance and security applications, Q may contain locations of crime scenes, and hence the police department issues a distance-based query to investigate the correlation of these crime locations by identifying suspects who moved close to all of them.

Contributions. This paper tackles two problems under the point-based trajectory search. First, we thoroughly study the efficient evaluation of distance-based trajectory search. We review in detail existing algorithms \(\mathtt {IKNN}\)  [3] and \(\mathtt {GH}\)/\(\mathtt {QE}\)  [4]. These methods follow a candidate generation and refinement paradigm, and invoke a nearest neighbor (NN) search centered at each query point to examine the trajectories in ascending order of their distance to Q. By analyzing the pros and cons of these methods, we design a hybrid NN-based algorithm which consistently outperforms \(\mathtt {IKNN}\)  and \(\mathtt {GH}\)/\(\mathtt {QE}\) by over an order of magnitude. Going one step further, we tackle the inherent shortcomings of the NN-based approach itself, namely (a) the increased I/O cost due to independently running multiple NN searches and (b) the increased CPU cost for continuously maintaining a priority queue for each NN search. We propose a novel spatial range-based approach, which is up to 2 times faster than our hybrid algorithm.

Second, we observe that the distance-based search ranks trajectories solely on how close they pass to the query points in Q, ignoring however other qualitative characteristics of the retrieved results. To fill this gap, we introduce a practical variant of distance-based trajectory search, which also takes into account the temporal aspect of the trajectories. Specifically, this bounded distance-based search filters out non-interesting trajectories, whose points closest to Q span a time interval greater than a user-defined threshold.

Outline. The rest of the paper is organized as follows. Section 2 formally defines the distance-based and bounded distance-based trajectory search while Sects. 3 and 4 address their efficient evaluation. Then, Sect. 5 discusses problem variants where (a) the trajectories are ranked both on their distance to the query points and the time interval they span, and (b) Q is a sequence of query points, instead of a set. Section 6 presents our experimental analysis. Finally, Sect. 7 outlines related work, while Sect. 8 concludes the paper.

2 Problem Definition

Let T be a collection of trajectories. A trajectory in T is defined as a sequence of spatio-temporal points \(\{p_1,\ldots , p_n\}\), each represented by a \(\langle latitude, longitude,\) \(timestamp\rangle \) triple. The input of point-based trajectory search over collection T is a set of m spatial query points \(Q = \{q_1,\ldots ,q_m\}\). Given a query point \(q_j \in Q\) and a trajectory \(t_i \in T\), we define the \(\langle p^*_{ij},q_j\rangle \)  matching pair based on the nearest to \(q_j\) point \(p^*_{ij}\) of trajectory \(t_i\), i.e., \(p^*_{ij} = \arg \min _{p \in t_i}dist(p,q_j)\), where \(dist(\cdot ,\cdot )\) denotes the distance (e.g., Euclidean) between two points in space. We then define the distance of a trajectory to Q based on the matching pairs for every query point \(q_j\) as:

$$\begin{aligned} dist(t_i,Q) = \sum _{q_j \in Q}{dist(p^*_{ij},q_j)} \end{aligned}$$
(1)

Consider the example in Fig. 1(a), where query points are represented as diamonds, and trajectory points as circles; filled circles indicate matched points of the trajectory to query points. For trajectory \(t_1\), point \(p^*_{11}\) is its closest point to query point \(q_1\), and hence \(\langle p^*_{11}, q_1 \rangle \) represents a matching pair. The other matched trajectory points of \(t_1\) are \(p^*_{12}\) and \(p^*_{13}\). Note that it is possible for a trajectory point to be matched with multiple query points. This is the case with trajectory \(t_3\), where \(p^*_{32}\) is the closest point to both \(q_1\) and \(q_2\), i.e., \(p^*_{31} \equiv p^*_{32}\).

Fig. 1.
figure 1

Distance-based trajectory search with 4 trajectories, \(T = \{t_1, \dots , t_4\}\), and 3 query points, \(Q = \{q_1, \dots , q_3\}\); \(t_1\), \(t_2\) is the result to 2-DTS(TQ), while \(t_2\), \(t_3\) the result to 2-BDTS\((T,Q,\tau )\)

We now formally define the distance-based trajectory search problem [3, 4].

Problem 1

(Distance-based Trajectory Search). Given a collection of trajectories T and a set of query points Q, the k-Distance-based Trajectory Search, denoted by k-DTS(TQ), retrieves a subset of k trajectories \(R \subseteq T\) such that for each \(t \in R\) and \(t^{\prime } \in T \backslash R\), \(dist(t,Q) \le dist(t^{\prime },Q)\) holds.

Returning to the example of Fig. 1(a), trajectory \(t_1\) has the lowest distance to Q, followed by \(t_2\), \(t_3\) and \(t_4\); hence, the result to 2-DTS(TQ) is \(t_1\), \(t_2\).

Next, we introduce a novel point-based trajectory search problem by also taking into account the temporal aspect of the trajectories. Let \(P^*_i\) be the set of all matching pairs for a trajectory \(t_i\), sorted ascending on the timestamp of the involved trajectory points. We define the span of trajectory \(t_i\) with respect to Q, denoted by \(span(t_i,Q)\), as the length of the time interval between the first and the last pair in \(P^*_i\), or equivalently:

$$\begin{aligned} span(t_i, Q) = \max _{q_x, q_y \in Q} (timestamp(p^*_{ix}) - timestamp(p^*_{iy})) \end{aligned}$$
(2)

Intuitively, \(span(t_i,Q)\) equals the total time needed to reach as close as possible to all query points in Q, following trajectory \(t_i\).

Problem 2

(Bounded Distance-based Trajectory Search). Given a collection of trajectories T, a set of query points Q and a span threshold \(\tau \), the k-Bounded Distance-based Trajectory Search, denoted by k-BDTS(T,Q,\(\tau \)), retrieves the subset of k trajectories R \(\subseteq \) T such that:

  • for each t \(\in \) R, span(t,Q) \(\le \tau \) holds, and

  • for each t\(^\prime \) \(\in \) T \(\backslash \) R with span(t\(^\prime \),Q) \(\le \tau \), dist(t,Q) \(\le \) dist(t\(^\prime \),Q) holds.

Returning to Fig. 1(a), assume for simplicity that trajectory points are reported in fixed time intervals. As a result, the span of a trajectory is proportional to the number of its points from the first to the last matched point (excluding the first). For example, \(span(t_1,Q) = 4\) as there are 4 points from \(p^*_{11}\) and up to \(p^*_{13}\). Similarly, we obtain the spans of \(t_2\), \(t_3\), \(t_4\) as 2, 1, 2, respectively. Figure 1(b) plots the trajectories in the span-dist plane. DTS ignores the span values and simply returns the trajectories with the lowest dist coordinate. In contrast, BDTS introduces a threshold, e.g., \(\tau =3\), on the span of the trajectories, depicted as the dashed vertical line. Trajectories to the right of this line, i.e., \(t_1\), do not qualify as BDTS results. Therefore, the result of 2-BDTS is \(t_2\), \(t_3\), i.e., the trajectories with the 2 lowest distances among those left of the line. Notice that BDTS may not return the trajectory with the lowest distance to Q if its span exceeds the threshold; e.g., \(t_1\) in Fig. 1.

Depending on the application, one may consider alternative definitions for point-based trajectory search that take into account both the distance and the span metrics. We briefly overview one of them in Sect. 5, where we also discuss trajectory search given a sequence of query points, instead of a set.

3 Distance-Based Trajectory Search

We first discuss trajectory search based on the distance to a set of query points. Section 3.1 revisits existing work, while Sects. 3.2 and 3.3 present our NN-based and spatial range-based methods, respectively.

3.1 Existing Methods

Methods \(\mathtt {IKNN}\)  [3] and \(\mathtt {GH}\)/\(\mathtt {QE}\) [4] have previously tackled distance-based trajectory search. Note that in [3] the problem was defined with respect to the similarity of a trajectory \(t_i\) to the set of query points Q, defined as \(sim(t_i,Q) = \sum _{q_j \in Q}{e^{-dist(p^*_{ij},q_j)}}\). In what follows, we describe the straightforward adaptation of the \(\mathtt {IKNN}\)  algorithm for the distance metric of Eq. (1) (which was also used in [4]). The adaptation of \(\mathtt {GH}\)/\(\mathtt {QE}\) and our methods (Sects. 3.2 and 3.3) to the similarity metric of [3] is also straightforward and therefore, omitted. Moreover, the relative performance of all methods is identical independent of the metric used.

All existing methods adopt a candidate generation and refinement evaluation paradigm. During the first phase, a set of candidate trajectories is determined by incrementally retrieving the nearest trajectory points to the query points in Q. For this purpose, the methods utilize a single R-tree to index all trajectory points. A candidate trajectory t is called a full match if the matching pairs of t to all query points in Q have been identified; otherwise, t is a partial match. As soon as the candidate set is guaranteed to include the final results (even as partial matches), candidate generation is terminated, and the refinement phase is then employed to identify and output the results.

figure a

The \(\mathtt {IKNN}\)  Algorithm. Note that the \(\mathtt {IKNN}\)  algorithm comes in two flavors; in the following, we consider the one based on best-first search, as it was shown in [3] to be both faster and require fewer I/O operations. Algorithm 1 shows the pseudocode of \(\mathtt {IKNN}\) . During candidate generation (Lines 2–6), the algorithm iterates over the points of Q in a round robin manner. For each query point \(q_j\), the (next) batch of nearest to \(q_j\) trajectory points is retrieved using the R-tree index, in Line 4. The nearest neighbor search retrieves a different number of trajectory points \(\delta _j\) per query point \(q_j\), in order to expedite the termination of this first phase (details in [3]). Based on the newly identified matching pairs that involve \(q_j\), the set of candidates C is then updated in Line 5 by either adding new partial matches or filling an empty slot for existing. For each partial match \(t_i\) in C, \(\mathtt {IKNN}\)  computes an upper bound of its distance to Q by setting the distance of \(t_i\) to every unmatched query point equal to the diameter of the space (maximum possible distance between two points):Footnote 1

$$\begin{aligned} \overline{dist}(t_i,Q) = \sum _{q_j \in Q_i}{dist(p^*_{ij},q_j) + |Q\backslash Q_i| \cdot DIAM}, \end{aligned}$$
(3)

where set \(Q_i \subseteq Q\) contains all the query points already matched to a point in trajectory \(t_i\). We denote by \(UB_k\) the k-th smallest among the distance bounds for the trajectories in C. In addition, \(\mathtt {IKNN}\)  computes a lower bound LB of the distance to Q for all unseen trajectories (i.e., those not contained in C), by aggregating the distance of the farthest (retrieved so far) trajectory point to each query point in Q. Formally:

$$\begin{aligned} LB = \sum _{q_j \in Q}{dist(p^{\delta }_j,q_j)} \end{aligned}$$
(4)

where \(p^{\delta }_j\) is the last trajectory point returned by the NN search centered at \(q_j\).

The candidate generation phase of \(\mathtt {IKNN}\)  terminates when \(UB_k \le LB\); in this case, none of the unseen trajectories can have smaller distance to Q compared to the candidates in C. Last, \(\mathtt {IKNN}\)  invokes \(\mathtt {Refine_{DTS}}\) to produce the results. Briefly, the function examines candidates in ascending order of a lower bound on their distance, retrieving them from disk to compute \(dist(\cdot ,Q)\) (details in [3]).

figure b
figure c

The \(\mathtt {GH}\)/\(\mathtt {QE}\) Algorithms. Different from \(\mathtt {IKNN}\) , the methods in [4] retrieve trajectory points in ascending order of the distance to their closest query point. Specifically, a global heap H is used to retrieve at each iteration the globally nearest trajectory point \(p_{ij}\) to some query point \(q_j\), and then, to update candidate set C, accordingly. Algorithm 2 shows the pseudocode of \(\mathtt {GH}\). The candidate generation phase of \(\mathtt {GH}\) is terminated as soon as set C contains k full matches (proof of correctness in [4]). Note that these full matches are not necessary among the final results identified in Line 6 during the refinement phase.

In practice, the order imposed by global heap H cannot guarantee a good performance unless both trajectory and query points are uniformly distributed in space. For instance, if a particular query point is very close to many trajectories, \(\mathtt {GH}\) will generate a large number of partial matches with only that slot filled. Consequently, it will take longer to produce the k full matches needed to terminate the generation phase, and at the same time a large number of candidates would have to be refined. A similar problem occurs when a query point is located away from the trajectories.

To address these issues, Tang et al. [4] proposed an extension to \(\mathtt {GH}\) termed \(\mathtt {QE}\), which periodically fills the empty slots for the partially matched trajectories with the highest potential of becoming results. These are then retrieved from disk, and their actual distance is computed. A trajectory has high potential if it has (i) few empty slots and (ii) small distance in each filled slot with respect to the next point to be retrieved for that slot. These factors are captured respectively by the denominator and enumerator of the following equation:

$$\begin{aligned} potential(t_i) = \frac{\sum _{q_j \in Q_i}{\left( dist(p^H_j,q_j) - dist(p^*_{ij},q_i)\right) }}{|Q \backslash Q_i|} \end{aligned}$$
(5)

where set \(Q_i \subseteq Q\) contains all the query points already matched to a point in \(t_i\), \(p^H_j\) is the next nearest trajectory point to \(q_j\) contained in heap H and \(p^*_{ij}\) is the nearest to \(q_j\) point in trajectory \(t_i\).

Algorithm 3 shows the pseudocode of \(\mathtt {QE}\). The candidate generation phase of \(\mathtt {QE}\) terminates when candidate set C contains k full matches (similar to \(\mathtt {GH}\)), provided however that their distance to Q is smaller than the distance of all unseen trajectories (Line 2) (proof of correctness in [4]). To determine this, \(\mathtt {QE}\) computes in Line 7, a lower bound LB of the distance for the unseen trajectories (similar to \(\mathtt {IKNN}\) ) by aggregating the distance of the next nearest trajectory point to every query point, i.e., the contents of heap H:

$$\begin{aligned} LB = \sum _{q_j \in Q}{dist(p^H_j,q_j)} \end{aligned}$$
(6)

3.2 A Hybrid NN-based Approach

The DTS problem can be viewed as a top-k query [5, 6]. For each query point \(q_j\), consider a sorted trajectory list \(T_j\), where each trajectory is ranked according to its distance to the query point. Then, the objective is to determine the top-k trajectories that have the highest aggregate score, i.e., distance, among the lists. However, as these lists are not given in advance and constructing them is costly, the goal is to progressively materialize them, until the result is guaranteed to be among the already seen trajectories.

Following the top-k query processing terminology, a sorted access on list \(T_j\) corresponds to the retrieval of the next nearest trajectory to query point \(q_j\), which in turn may involve multiple trajectory point NN retrievals. In contrast, a random access for trajectory \(t_i\) on list \(T_j\) corresponds to the retrieval of \(t_i\) from disk and the computation of its distance to \(q_j\); in practice, once \(t_i\) is retrieved, its distance to all query points can be computed at negligible additional cost.

Methods \(\mathtt {IKNN}\) , \(\mathtt {GH}\) and \(\mathtt {QE}\) employ various ideas from top-k query processing (an overview of this field is presented in Sect. 7). Particularly, \(\mathtt {IKNN}\)  performs only sorted accesses and prioritizes them in a manner similar to \(\mathtt {Stream\!\!-\!\!Combine}\) [7]. Similarly, \(\mathtt {GH}\) performs only sorted accessses but follows an unconventional strategy for prioritizing them, which explains its poor performance on our tests in Sect. 6. On the other hand, \(\mathtt {QE}\) additionally performs random accesses following a strategy similar to the \(\mathtt {CA}\) algorithm [5] to select which trajectory to retrieve.

In the following, we present the \(\mathtt {NNA}\) algorithm, which combines the strengths of \(\mathtt {IKNN}\)  and \(\mathtt {QE}\). In short, it builds upon the \(\mathtt {Quick\!\!-\!\!Combine}\) top-k algorithm [8] performing both sorted and random accesses to generate the candidate set. \(\mathtt {NNA}\) has the following features. First, similar to \(\mathtt {IKNN}\) , the algorithm retrieves in a round robin manner, batches of nearest trajectory points to each query point in Q. This addresses the weaknesses of \(\mathtt {GH}\) when dealing with non-uniformly distributed data. Second, after performing the nearest neighbor search centered at each query point, \(\mathtt {NNA}\) fills the slots of the trajectories with the highest potential according to Eq. (5), similar to \(\mathtt {QE}\). Finally, \(\mathtt {NNA}\) employs the termination condition of \(\mathtt {IKNN}\)  for the candidate generation phase. In practice, \(\mathtt {NNA}\) extends Algorithm 1 by completing the most promising partial matches in C (similar to \(\mathtt {QE}\)), between Lines 5 and 6. Hence, it is able to compute tighter bounds compared to \(\mathtt {IKNN}\)  and thus terminate the generation phase earlier. In addition, it produces fewer candidates than \(\mathtt {IKNN}\) , reducing the cost of the refinement phase.

3.3 A Spatial Range-Based Approach

We identify two shortcomings of all the NN-based methods previously described. First, each NN search is implemented independently, which means that R-tree nodes and trajectory points may be accessed multiple (up to |Q|) times, which increases the total I/O cost. Second, each NN search is associated with a priority queue, whose continuous maintenance increases the total CPU cost.

Our novel Spatial Range-based algorithm, denoted by \(\mathtt {SRA}\), addresses both these shortcomings. Similar to the NN-based approaches, it follows a generation and refinement paradigm. However, to generate the candidate set, it issues a spatial range search of expanding radius centered at each query point in Q. All searches operate on a common set N of R-tree nodes, which avoids accessing nodes more than once and hence saves I/O operations. Moreover, set N needs not be sorted according to any distance, eliminating costly priority queue maintenance tasks. The range-based search for each query point \(q_j\) is associated with current radius \(r_j\), and is also assigned a maximum radius  \(\theta _j\). As the algorithm progresses, current radius \(r_j\) increases while maximum radius \(\theta _j\) decreases. Candidate generation terminates as soon as \(r_j > \theta _j\) for some query point \(q_j\).

figure d

Algorithm 4 shows the pseudocode of \(\mathtt {SRA}\). In Lines 2–4, \(\mathtt {SRA}\) initializes the current and maximum radius for each query point. For the latter, an upper bound \(UB_k\) to the k-th smallest distance to Q is computed. In particular, \(\mathtt {SRA}\) invokes a sum-aggregate nearest neighbor (sum-ANN) procedure [9] retrieving trajectory points in ascending order of \(\sum _{q_j \in Q}{dist(\cdot ,q_j)}\). Assuming that this procedure retrieves point \(p_i\) of trajectory \(t_i\), the sum-aggregate value is an upper bound to the distance of \(t_i\), i.e., \(dist(t_i) \le \sum _{q_j \in Q}{dist(p_i,q_j)}\). Hence, once points from k distinct trajectories have been retrieved, \(\mathtt {SRA}\) can determine a value for \(UB_k\).

During the candidate generation phase in Lines 5–13, \(\mathtt {SRA}\) first selects the query point \(q_c \in Q\) with the fewest retrieved points so far, and increases its radius by a fixed \(\xi \) Footnote 2, so that each location retrieves more or less the same number of points. Then, it extends the range search centered at \(q_c\) to new radius \(r_c\). In particular, all nodes in N that intersect with the search frontier are expanded, i.e., replaced by their children (Line 8). During the expansion, all trajectory points within the frontier are collected in set S (Line 9). Upon completion of the expansion, set N contains no R-tree node or point within \(r_c\) distance to \(q_c\), or with distance to \(q_c\) greater than \(\theta _c\), and N will be re-used in further iterations.

After the expansion, \(\mathtt {SRA}\) uses the newly seen trajectory points in S to properly update candidate set C. Note that for each trajectory \(t_i\) in C, \(\mathtt {SRA}\) keeps |Q| slots storing the closest trajectory points \(t_i.p_j\) seen so far to each query point \(q_j\). A slot is marked matched if the corresponding matching pair has been determined, i.e., when \(t_i.p_j \equiv p^*_{ij}\). \(\mathtt {SRA}\) in Line 10 performs the following tasks for each point \(p_x\) in S; let \(t_i\) be the trajectory \(p_x\) belongs to. For each slot \(q_j\) that is not matched, \(\mathtt {SRA}\) checks whether \(p_x\) is closer to \(q_j\) than \(t_i.p_j\), and updates the slot with \(p_x\) if true. If the slot for the current query point \(q_c\) was among those examined, it is marked as matched. The benefits of this update strategy are twofold. First, it guarantees that no matching trajectory point will be missed, even though \(\mathtt {SRA}\) does not access \(p_x\) again (removed from N) for \(q_j \ne q_c\). At the same time, it also helps to derive a tighter upper bound for the distance of \(t_i\):

$$\begin{aligned} \overline{dist}(t_i,Q) = \sum _{q_j \in Q_i}{dist(p^*_{ij},q_j) + \sum _{q_j \in Q\backslash Q_i}{dist(t_i.p_j,q_j)}}. \end{aligned}$$
(7)

Compared to Eq. (3) utilized by \(\mathtt {IKNN}\)  and \(\mathtt {NNA}\), Eq. (7) computes a tighter bound on unmatched slots. Based on these bounds, a tighter value for \(UB_k\) can be established (Line 11).

To better explain the procedure in Line 10, we use the example of Fig. 1(a) for \(k=2\). \(\mathtt {SRA}\) has just started and thus C is empty. Assume that the current query point is \(q_c = q_1\), and let \(r_1 = 0 + \xi \) be the radius of the shaded disk depicted in the figure. As a result, set S in Line 9 contains trajectory points \(\{p^*_{21}, p^*_{22}, p^*_{41} \}\). Moreover, candidate set C contains \(t_2\) and \(t_4\). For trajectory \(t_2\), \(p^*_{21}\) is settled as the matching point to \(q_1\) because \(dist(p^*_{21}, q_1) < dist(p^*_{22}, q_1)\) and no unseen point of \(t_2\) can be closer. On the other hand, the matching points to \(q_2\), \(q_3\) cannot be yet determined, but we can use \(p^*_{21}\) and \(p^*_{22}\) to bound \(t_2\)’s distances to \(q_2\) and \(q_3\). Therefore, the slots for \(t_2\) become \(\langle \mathbf {p^*_{21}}, p^*_{22}, p^*_{21} \rangle \), where bold indicates a matched slot. Moreover, an upper bound to the distance of \(t_2\) is determined as \(\overline{dist}(t_2,Q) = dist(p^*_{21}, q_1) + dist(p^*_{22}, q_2) + dist(p^*_{21}, q_3)\). Similarly, we obtain the slots for \(t_4\) as \(\langle \mathbf {p^*_{41}}, p^*_{41}, p^*_{41} \rangle \).

As a last step, \(\mathtt {SRA}\) updates the maximum radius for all query points with respect to the new \(UB_k\) in Lines 12–13. Observe that SRA’s termination condition for candidate generation is essentially identical to that of \(\mathtt {IKNN}\) . Any trajectory not in the candidate set C must have distance to each \(q_j\) at least \(\theta _j\), and thus distance at least equal to \(LB = \sum _{q_j \in Q}\theta _j\). The termination condition of Line 5, \(r_j > \theta _j\) for some \(q_j\), and the update of \(\theta _j\), imply that, when candidate generation concludes, \(UB_k \le LB\).

Finally, the performance of \(\mathtt {SRA}\) can be enhanced following the key idea of \(\mathtt {QE}\) to further improve the \(\overline{dist}(t_j,Q)\) bound and therefore, \(UB_k\). We denote this extension to the \(\mathtt {SRA}\) algorithm by SRA+. Specifically, in between Lines 10 and 11 in Algorithm 4, SRA+ fills the empty slots of the trajectories in C with the highest potential as computed using Eq. (5).

4 Bounded Distance-Based Trajectory Search

We next address the bounded distance-based trajectory search. Recall from Sect. 2 that k-BDTS\((T,Q,\tau )\) is equivalent to a k-DTS\((T^{\prime },Q)\) distance-based query over the subset \(T^{\prime } \subseteq T\) containing only trajectories with \(span(\cdot ,Q) \le \tau \). However, as span(tQ) can be computed only after all the matching pairs of a trajectory t to Q are identified, the major challenge is to limit the number of invalid partial matches generated, i.e., those with the \(span(\cdot ,Q) > \tau \). In the following, we address this issue in two alternative ways.

figure e

The idea behind the incremental approach, denoted as \(\mathtt {INCREMENTAL}\), is to progressively construct the result set R by utilizing the generation phase of a DTS method as a “black” box. Algorithm 5 illustrates \(\mathtt {INCREMENTAL}\); note that any of the algorithms in Sect. 3 can be used as the underlying DTS method. At each round, \(\mathtt {INCREMENTAL}\) asks for the missing \(k - |R|\) trajectories to complete the result set R in Lines 3–4. For this purpose, a \(\lambda \)-DTS(TQ) search is processed, with the \(\lambda \) value been increased at each round by \(k - |R|\); during the first round \(\lambda = k\). Each time \(\lambda \) is updated in Line 3, the DTS method in Line 4 does not run from scratch. It continues the candidate generation using a new termination condition with respect to the updated \(\lambda \) in order to expand candidate set C. Last, in Line 5, \(\mathtt {Refine_{BDTS}}\) examines the new candidates to update result set R by computing their \(dist(\cdot ,Q)\) and eliminating trajectories with \(span(\cdot ,Q) > \tau \).

Intuitively, \(\mathtt {INCREMENTAL}\) takes a conservative approach to bounded distance-based trajectory search. As it is unable to predict which partial matches could provide a valid trajectory (full match) with \(span(\cdot ,Q) \le \tau \), a refinement phase is needed to “clean” the candidate set. Hence, \(\mathtt {INCREMENTAL}\) may involve several rounds of generation and refinement phases. To address these issues, we propose the \(\mathtt {ONE\!\!-\!\!PASS}\) approach which involves a single generation and refinement round. The idea is again to build upon a DTS method but by extending its candidate generation phase in two ways. First, for each partial match \(t_i\) in candidate set C, \(\mathtt {ONE\!\!-\!\!PASS}\) computes a lower bound of \(span(t_i,Q)\) based on the points of \(t_i\) matching the current subset of query points \(Q_i \subset Q\), as follows:

$$\begin{aligned} \underline{span}(t_i,Q) = {\left\{ \begin{array}{ll} 0, &{}if |Q_i| = 1\\ span(t_i,Q_i), &{}otherwise \end{array}\right. } \end{aligned}$$
(8)

Every partial match with \(\underline{span}(\cdot ,Q) > \tau \) can be safely pruned. Second, the original termination is triggered only after candidate set C contains at least k valid full matches, i.e., with \(span(\cdot ,Q) \le \tau \). This is because the k-th upper bound \(UB_k\) of existing candidates can be computed only through full matches. For example, candidate generation of \(\mathtt {ONE\!\!-\!\!PASS}\) based on SRA+ terminates as soon as at least k valid full matches are identified and \(r_j > \theta _j\) holds for some query point \(q_j\).

5 Discussion

We discuss alternative definitions and variants to the point-based search problems introduced in Sect. 2.

Distance and Span-based Trajectory Search. Although taking into account their temporal span, the bounded distance-based search still ranks the trajectories solely on their distance to the query points in Q. As an alternative, we may rank the results with respect to a linear combination of the span-dist metrics:

$$\begin{aligned} f(t,Q) = \alpha \cdot dist(t,Q) + (1-\alpha ) \cdot span(t,Q) \end{aligned}$$
(9)

where \(\alpha \) weights the importance of each metric. With Eq. (9), we introduce the k-Distance & Span-based Trajectory Search, denoted by k-DSTS(TQ) which returns the subset of k trajectories \(R \subseteq T\) with the lowest \(f(\cdot ,Q)\) value.

All methods discussed in Sect. 3 can be extended for k-DSTS(TQ) by replacing \(dist(\cdot ,Q)\) with \(f(\cdot ,Q)\). Note that the upper bound \(\overline{f}(t,Q)\) of a partial match t can be computed by setting \(\overline{span}(t,Q)\) equal to the total duration of the trajectory t. In contrast, as no matching pairs are identified for the unseen trajectories, the lower bound LB or the \(\theta _j\) values are defined similar to the DTS methods, i.e., essentially setting the lower bound of span to zero. In Sect. 6.4, we experimentally investigate the efficient evaluation of DSTS.

Order-aware Trajectory Search. Similar to [3], we also consider a variation of the trajectory search when a visiting order is imposed for the query points. In this variation, the matched trajectory point \(p^*_{ij}\) to query point \(q_j\), is not necessarily the nearest to \(q_j\) point of trajectory \(t_i\). Consider for example trajectory \(t_2\) in Fig. 1. The depicted \(p^*_{22}\), \(p^*_{21}\), \(p^*_{23}\) for DTS cannot be the matched points in the \(q_1 \rightarrow q_2 \rightarrow q_3\) order-aware DTS, as they violate the visiting order. Instead, the matched points that preserve the imposed visiting order are \(p^*_{22}\), \(p^*_{22}\), \(p^*_{23}\), where \(p^*_{22}\) is matched with \(q_1\) although \(dist(p^*_{22},q_1) > dist(p^*_{21},q_1)\). The distance of a trajectory to sequence Q is recursively defined as follows:

$$\begin{aligned} {dist_o(t, Q) = {\left\{ \begin{array}{ll} \min {\left\{ \begin{array}{ll} dist_o(t, T(Q)) + dist(H(t), H(Q)) - DIAM \\ dist_o(T(t), Q) \end{array}\right. } &{} \text {if } t \ne \varnothing , Q \ne \varnothing \\ |Q| \cdot DIAM &{} \text {if } t = \varnothing \\ 0 &{} \text {if } Q = \varnothing \\ \end{array}\right. } } \end{aligned}$$
(10)

where H(S) is the first point (head) in a sequence S, T(S) indicates the tail of S after removing H(S), \(\varnothing \) denotes the empty sequence, and DIAM represents the diameter of the space. The distance can be computed by straightforward dynamic programming [3]. To derive an upper bound on a partial matched trajectory \(t_i\), we consider only the subsequence \(Q_i\) of Q that contains the matched query points, i.e., \(\overline{dist}_o(t_i, Q) = dist_o(t_i, Q_i)\). For order-aware BDTS, distance and its upper bound are the same as in order-aware DTS. Note, however that the lower bound on span (Equation (8)) does not apply as the matching are not yet finalized. For order-aware DSTS evaluation, \(f_o(t,Q)\) and its upper bound are defined in a similar manner to order-aware DTS. In Sect. 6, we experimentally investigate the order-aware variants of all three trajectory search problems.

6 Experimental Analysis

We evaluate our methods for point-based trajectory search. All algorithms were implemented in C++ and the tests run on a machine with Intel Core i7-3770 3.40 GHz and 16 GB main memory running Ubuntu Linux.

Table 1. POIs in Beijing
Table 2. Experimental parameters (default values in bold)

6.1 Setup

We conducted our analysis using real-world trajectories from the GeoLife Project [1012]. The collection contains 17,166 trajectories with 19 m points in Beijing, recording a broad range of outdoor movement. To generate our query sets, we considered around 90 k points of interest (POIs) of various types, located inside the same area covered by the trajectories (see Table 1 for details). A query set Q is formed by randomly selecting a combination of |Q| types and a particular POI from each type. We assess the performance of all involved methods measuring their CPU and I/O cost, and the number of candidates they generate over 1,000 distinct query sets Q, while varying (i) the number of returned trajectories k and (ii) the number of query points |Q|. In case of BDTS queries, we additionally vary the span threshold via the \(\tau /\tau _{min}\) ratio, where \(\tau _{min}\) is the minimum possible time required to travel among the query points in Q at a constant velocity of 50 km/h. Finally, for DSTS queries, we also vary the weight factor \(\alpha \) of Eq. (9). Table 2 summarizes all parameters involved in our study.

6.2 Distance-Based Trajectory Search

Figure 2 reports the CPU cost, the I/O cost and the number of generated candidates for the DTS methods. As expected the processing cost of all methods goes up as the values of k and |Q| increase. The tests clearly show that SRA+ is overall the most efficient evaluation method. We also make the following observations.

First, we observe that \(\mathtt {IKNN}\)  always outperforms \(\mathtt {GH}\)/\(\mathtt {QE}\); note that this is the first time the methods from [3, 4] are compared. Naturally, \(\mathtt {GH}\) comes as the least efficient method; due to the examination order imposed by global heap H, the algorithm is unable to cope with the skewed distribution of the real-world data. \(\mathtt {QE}\) manages to overcome the shortcomings of \(\mathtt {GH}\) by completing the empty slots of the most promising candidates. Yet, compared to \(\mathtt {IKNN}\) , \(\mathtt {QE}\) is less efficient due to its weak termination condition for the generation phase; recall that at least k full matches are needed for this purpose which also results in generating a larger number of candidates, as shown in Fig. 2(c) and (f). The advantage of \(\mathtt {IKNN}\)  over \(\mathtt {GH}\)/\(\mathtt {QE}\) justifies our decision to build the hybrid \(\mathtt {NNA}\) method upon the round robin-based candidate generation of \(\mathtt {IKNN}\)  which retrieves nearest neighbor points in batches, and its powerful threshold-based termination condition. \(\mathtt {NNA}\) is indeed the most efficient NN-based method, in fact with an order of magnitude improvement over \(\mathtt {IKNN}\)  and \(\mathtt {GH}\)/\(\mathtt {QE}\) on both CPU and I/O cost. Finally, Fig. 2 clearly shows the advantage of the spatial range-based evaluation approach over the NN-based one. \(\mathtt {SRA}\) is always faster while incurring fewer disk page accesses than \(\mathtt {IKNN}\) , and in a similar manner, SRA+ outperforms \(\mathtt {NNA}\).

Fig. 2.
figure 2

Performance comparison for Distance-based Trajectory Search

Fig. 3.
figure 3

Performance comparison for Distance-based Trajectory Search (order-aware)

We also experimented with the order-aware variant of DTS. Figure 3 depicts similar results to Fig. 2; the spatial range-based evaluation approach is again superior to the NN-based and overall, SRA+ is the most efficient method. Nevertheless, it is important to notice that the advantage of completing the most proposing candidates is smaller compared to Fig. 2, in terms of the CPU cost. Specifically, observe how close is the running time of \(\mathtt {GH}\) to \(\mathtt {QE}\), of \(\mathtt {IKNN}\)  to \(\mathtt {NNA}\) and of \(\mathtt {SRA}\) to SRA+, in Fig. 3(a) and (d). This is expected as completing partial matches employs dynamic programming to compute \(dist_o(\cdot ,Q)\).

6.3 Bounded Distance-Based Trajectory Search

Next, we investigate the evaluation of BDTS queries while varying the k, |Q| and \(\tau /\tau _{min}\) parameters. Based on the findings of the previous section, we use the SRA+ algorithm as the underlying DTS method. Note that due to lack of space we omit results for the order-aware variant of BDTS; the results however are similar. Figure 4 clearly shows that \(\mathtt {ONE\!\!-\!\!PASS}\) outperforms \(\mathtt {INCREMENTAL}\) in all cases. As expected, the conservative approach of \(\mathtt {INCREMENTAL}\) generates a larger number of candidates by performing multiple rounds of generation and refinement which results in both higher running time and more disk page accesses. Last, notice that the evaluation of BDTS becomes less expensive for both methods while increasing \(\tau /\tau _{min}\), as the number of invalid candidates progressively drops.

6.4 Distance and Span-Based Trajectory Search

Finally, we study the evaluation of DSTS queries. For this experiment, we extended the most dominant method from [3, 4], i.e., \(\mathtt {IKNN}\) , and our methods \(\mathtt {NNA}\), \(\mathtt {SRA}\) and SRA+ following the discussion in Sect. 5. The results in Fig. 5 demonstrate, similar to the DTS case, the advantage of both the spatial range-based approach and the SRA+ algorithm which is overall the most efficient evaluation method. Due to lack space, we again omit the figure for the order-aware variant of DSTS as the results are identical to Fig. 5.

7 Related Work

Apart from the studies [3, 4] for distance-based search on trajectories detailed in Sect. 3.1, our work is also related to top-k and nearest neighbor queries.

Top- k Queries. Consider a collection of objects, each having a number of scoring attributes, e.g., rankings. Given an aggregate function \(\gamma \) (e.g., SUM) on these scoring attributes, a top-k query returns the k objects with the highest aggregated score. To evaluate such a query, a sorted list for each attribute \(a_i\) organizes the objects in decreasing order of their value to \(a_i\); requests for random accesses of an attribute value based on object identifiers may be also possible. Ilyas et al. overviews top-k queries in [6] providing a categorization of the proposed methods. Specifically, when both sorted and random accesses are possible, the \(\mathtt {TA}\)/\(\mathtt {CA}\) [5] and \(\mathtt {Quick\!\!-\!\!Combine}\) [8] algorithms can be applied. \(\mathtt {TA}\) retrieves objects from the sorted lists in a round-robin fashion while a priority queue to organizes the best k objects so far. Based on the last seen attribute values, the algorithm defines an upper score bound for the unseen objects, and terminates if current k-th highest aggregate score is higher than this threshold. \(\mathtt {TA}\) assumes that the costs of the two different access methods are the same. As an alternative, \(\mathtt {CA}\) defines a ratio between these costs to control the number of random accesses, which in practice are usually more expensive than sorted accesses. Hence, the algorithm periodically performs random accesses to collect unknown values for the most “promising” objects. Last, the idea behind \(\mathtt {Quick\!\!-\!\!Combine}\) is to favor accesses from the sorted lists of attributes which significantly influence the overall scores and the termination threshold. In contrast, when only sorted accesses are possible, the \(\mathtt {NRA}\) [5] and \(\mathtt {Stream\!\!-\!\!Combine}\) [7] algorithms can be applied. Intuitively, \(\mathtt {Stream\!\!-\!\!Combine}\) operates similar to \(\mathtt {Quick\!\!-\!\!Combine}\) without performing any random accesses. In Sect. 3.1, we discuss how the methods in [3, 4] build upon previous work on top-k queries to address distance-based search on trajectories.

Fig. 4.
figure 4

Performance comparison for Bounded Distance-based Trajectory Search

Nearest Neighbor Queries. There is an enormous amount of work on the nearest neighbor (NN) query (also known as similarity search), which returns the object that has the smallest distance to a given query point; k-NN queries output the k nearest objects in ascending distance. Roussopoulos et al. proposed a depth-first approach to k-NN query in [13] while Hjaltason et al. enhanced the evaluation with a best-first search strategy in [14]. An overview of index-based approaches can be found in [15]; efficient methods for metric spaces, e.g., [16], and high-dimensional data, e.g., [17], have also been proposed.

Fig. 5.
figure 5

Performance comparison for Distance and Span-based Trajectory Search

For a set of query points, the aggregate nearest neighbor (ANN) query [9] retrieves the object that minimizes an aggregate distance to the query points. As an example, for the MAX aggregate function and assuming that the set of query points are users, and distances represent travel times, ANN outputs the location that minimizes the time necessary for all users to meet. In case of the SUM function and Euclidean distances, the optimal location is also known as the Fermat-Weber point, for which no formula for the coordinates exists.

8 Conclusions

In this paper, we studied the efficient evaluation of point-based trajectory search. After revisiting the existing methods (\(\mathtt {IKNN}\)  and \(\mathtt {GH}\)/\(\mathtt {QE}\)), which examine the trajectories in ascending order of their distance to the queries points, we devised a hybrid algorithm which outperforms them by a wide margin. Then, we proposed a spatial range-based approach; our experiments on real-world trajectories showed that this approach outperforms any NN-based method. Besides improving the performance of distance-based search, we also introduced and investigated the evaluation of a practical variant for point-based trajectory search, which also takes into account the temporal aspect of the trajectories. As a direction for future work, we plan to consider additional types of annotated data on the trajectories in point-based search, such as textual and social information.