Continuous k nearest neighbor queries over large multi-attribute trajectories: a systematic approach

Xu, Jianqiu; Güting, Ralf Hartmut; Gao, Yunjun

doi:10.1007/s10707-018-0326-5

Continuous k nearest neighbor queries over large multi-attribute trajectories: a systematic approach

Published: 11 July 2018

Volume 22, pages 723–766, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

GeoInformatica Aims and scope Submit manuscript

Continuous k nearest neighbor queries over large multi-attribute trajectories: a systematic approach

Download PDF

710 Accesses
8 Citations
Explore all metrics

Abstract

We study multi-attribute trajectories by combining standard trajectories (i.e., a sequence of timestamped locations) and descriptive attributes. A new form of continuous k nearest neighbor queries is proposed by integrating attributes into the evaluation. To enhance the query performance, a hybrid and flexible index is developed to manage both spatio-temporal data and attribute values. The index includes a 3D R-tree and a composite structure which can be popularized to work together with any R-tree based index and Grid-based index. We establish an efficient mechanism to update the index and define a cost model to estimate the I/Os. Query algorithms are proposed, in particular, an efficient method to determine the subtrees containing query attributes. Using synthetic and real datasets, we carry out comprehensive experiments in a prototype database system to evaluate the efficiency, scalability and generality. Our approach gains more than an order of magnitude speedup compared to three alternative approaches by using 1.8 millions of trajectories and hundreds of attribute values. The update performance is evaluated and the cost model is validated.

Multi-attribute Trajectory Data Management

A nearest neighbor query method for searching objects with time and location informations based on spatiotemporal similarity

Article 25 March 2024

SST: Synchronized Spatial-Temporal Trajectory Similarity Search

Article 28 April 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Motivation

Trajectory data, keeping track of historical movements of moving objects such as vehicles and ships, is becoming ubiquitous due to the widespread use of GPS devices. Such data that records geographical locations changing over time is of crucial importance for emerging applications, e.g., route recommendation [9], tracking [28], monitoring [60], crowdsourcing [47, 49], to name but a few.

Despite tremendous efforts made on studying trajectory databases, proposals in the literature mainly deal with standard trajectories [33, 50, 66], i.e., a sequence of timestamped geo-locations, and the majority of queries are limited to the spatio-temporal evaluation such as range queries [51], nearest neighbors [21] and convoys [27]. In the real world, typical moving objects such as vehicles and persons are associated with pieces of descriptive information. The database system should fully represent moving objects and allow users to query objects with extensive knowledge to better understand the movement and users’ behavior.

As a fundamental step towards that, we investigate a new form of trajectories called multi-attribute trajectories, each of which consists of a standard trajectory and a set of attribute values. Standard trajectories have been well established in the literature. Attributes have various semantics according to applications. The combination allows users to query trajectories by specifying certain attributes. We study c ontinuous $\underline {k}$ n earest n eighbor queries over m ulti-a ttribute t rajectories, Ck NN_MAT for short. Given a standard trajectory, the number of neighbors k and attribute values, Ck NN_MAT returns k trajectories at each defined time instant, each of which (i) contains query attribute values and (ii) belongs to k nearest neighbors of the query trajectory. Returned objects should fulfill the condition: (i) attribute consistency and (ii) time-dependent distance closeness. Distances between moving objects vary over time, and thus returned objects change at certain time points, complicating the evaluation. The Euclidean distance is measured.

Consider a running example. A database stores five vehicle trajectories {mo₁, mo₂, mo₃, mo₄, mo₅} in which each contains a sequence of timestamped locations and two attribute values from domains COLOR = {RED, SILVER, BLACK} and BRAND = {BENZ, VW, TOYOTA}, respectively, as illustrated in Fig. 1. An interesting query “Continuously report the nearest SILVER VW to mo₄” is issued. Ck NN_MAT returns {([t₁, t₂], mo₅), ([t₂, t₄], mo₃)} indicating the key aspect that only objects fulfilling the attribute condition will be evaluated on the time-dependent closeness. Although mo₁ and mo₂ are closer than mo₃ and mo₅ to the query trajectory, they do not contain (SILVER VW) and will be excluded. The query cannot be answered in standard trajectory databases because the attribute is not defined and the system does not know which trajectories are SILVER VWs. If attributes are ignored, the query becomes traditional nearest neighbor queries [16, 21, 46] and will report {([t₁, t₂], mo₂), ([t₂, t₃], mo₁)}.

The combined data enriches the representation and exposes semantics that is orthogonal to location data, which most semantic-enriched trajectories focus on [2, 61, 62]. Multi-attribute trajectories raise a range of novel queries by integrating the attribute value into the evaluation. We do not deal with attributes such as speed and acceleration that can be calculated from the motion function and are not necessarily to be defined.

1.2 Previous works and challenges

In the current state-of-the-art, standard trajectories with additional information have been investigated, e.g., semantic trajectories [36, 61], activities trajectories [64], and transportation modes [56]. A semantic/activity trajectory is typically defined to be a sequence of locations attached with semantic labels, e.g., park, restaurant. In principle, semantic/activity trajectories is an enriched version of standard trajectories in terms of semantic labels describing locations. This is orthogonal to multi-attribute trajectories. We point out major differences in this section and provide a thorough comparison in related work.

(i)
Attributes represent a range of aspects of the objects and provide a full picture of moving objects, as opposed to semantics limited to locations. This will support a different (even broader) range of applications.
(ii)
Semantic locations are sparsely defined because among a person’s trajectory a few locations have semantics. Attributes are location-independent and associated with the complete trajectory. They are not derived from timestamped locations or the geographical environment.
(iii)
Semantic trajectories cope with similarity search or ranking queries rather than continuous queries with the exact match on attribute values, leading to different tasks when developing the index. Semantic trajectories are grouped in terms of locations and semantics, but attributes are not related to locations.

Efficiently answering Ck NN_MAT is not trivial. Standard trajectory indexes such as R-tree [24], TB-tree [37], SETI [6] and TrajStore [35] only deal with the spatio-temporal data without managing attributes. We cannot use the index to perform the selection on attribute values. Furthermore, the pruning technique of min and max distances cannot be applied if trajectories attributes are not determined. The trajectory subset containing query attributes dynamically changes according to the query and cannot be pre-computed. False dismissals will occur if we perform the pruning without the awareness of attribute values. Consider the example in Fig. 1. Although mo₂ has smaller distances to mo₄ than mo₃, we cannot prune mo₃ because mo₂ does not contain (SILVER, VW).

Ck NN_MAT could be answered by employing an attribute index. We evaluate the attribute condition to receive trajectories containing query attributes and then proceed to processing standard trajectories. However, this method is limited in scope and inherently suffers from the performance issue. Standard trajectories will be processed by either performing a sequential scan or accessing an on-line built index. If the attribute predicate is selective, the query cost may be acceptable because a small dataset is processed. If the attribute predicate has a bad selectivity, a large number of trajectories will be returned. Both the sequential scan and building an on-line index incur high CPU and I/O costs. Furthermore, creating an index for each query at runtime causes extra storage space.

To sum up, the main challenge is how to let the index effectively and efficiently preserve the spatio-temporal proximity, and also maintain attributes. Individually managing each part can be easily achieved but does not achieve an optimal performance. We aim to develop a general structure that can be used for a range of queries rather than ad-hoc methods for a particular query workload.

1.3 The solution

We start by modeling attributes and then use a relational interface to integrate standard trajectories and attributes. This ensures that existing operators on standard trajectories can be leveraged, avoiding starting from scratch. To reduce approximations of standard trajectories, we identify small pieces of movements and pack them to have a compact dataset to build the index. To manage multi-attribute trajectories, we propose an index structure made up of a 3D R-tree and a composite structure BAR. The 3D R-tree is to preserve the spatio-temporal proximity, and BAR built on top of R-tree is to manage attribute values.

We do not tightly integrate attribute values into the spatio-temporal index, but offer a versatile approach such that BAR is compatible with most spatio-temporal indexes and can be discarded if attributes are not defined. We provide a dynamic structure to support tracking incoming data and also define a cost model to estimate the update costs.

We design a query framework following the filter-and-refine strategy. The filter accesses BAR to determine the R-tree nodes fulfilling the attribute condition in which an efficient access method is proposed to determine the nodes and a thorough analysis is provided. We traverse the R-tree from top to bottom level during which one evaluates both attribute and spatio-temporal predicates for each accessed node. The filter returns a set of candidate trajectories, each of which contains query attributes and approximately belongs to k nearest neighbors. The refinement runs the accurate distance computation. The proposal is implemented in an extensible database system SECONDO [20] to have a practical result and achieve the system development. The contributions of the paper are summarized as follows:

We offer insight into multi-attribute trajectories by providing the data representation and proposing a new query Ck NN_MAT. A query expression is defined to formulate attribute values.
We propose a hybrid and flexible index on a compact dataset by packing standard trajectories and provide an optimal access method. The storage cost of the index is analytically provided.
Employing a light-weight structure, we develop a method to efficiently update the index to keep track of incoming multi-attribute trajectories and establish a cost model to estimate update I/Os.
Efficient query algorithms are developed to answer Ck NN_MAT in which different forms of query attributes are supported.
Using large real and synthetic datasets, an extensive experimental study is conducted to demonstrate the significant performance advantage of our method over three baseline methods in terms of efficiency and scalability.

The rest of the paper is organized as follows: The problem is defined in Section 2. We present the index structure in Section 3 and turn to the update issue in Section 4. Query algorithms are introduced in Section 5. Section 6 reports the evaluation, Section 7 discuses the generality of the proposal, and Section 8 surveys the related work, followed by conclusions in Section 9.

2 Problem definition

2.1 Standard trajectories

A standard trajectory is typically modeled by a function from time to 2D space. In a database system, a discrete model is implemented, that is to represent the trajectory by a sequence of so-called temporal units, as illustrated in Fig. 2. Each unit records start and end locations during a time interval, and locations between start and end locations are estimated by interpolation. A data type called mpoint is defined. The comprehensive framework refers to [15, 22].

Definition 1

Standard trajectories

$D_{\underline {mpoint}}$ = {< u₁,..., u_n > |∀i ∈ [1, n], u_i = (t₁, t₂, p₁, p₂), t₁, t₂ ∈ instant, p₁, $p_{2} \in \mathbb {R}^{2}$ }

2.2 An integration

Let A = {A₁, ..., A_d} denote d attributes to be combined with standard trajectories, d≥ 1. We refer to the domain of each attribute as dom(A_i) (i ∈{1, ..., d}) and define ∀ i ∈{1,..., d}: dom$(A_{i}) \subset \mathbb {Z}^{+}$. A d-attribute object is represented by the following data type.

Definition 2

Multi-attributes

$D_{\underline {\textit {att}}}$ = {(a₁, ..., a_d)|a_i ∈ dom(A_i), i ∈ {1, ..., d}}

The overall domain is denoted by dom(A) = $\bigcup _{i = 1}^{d}$dom(A_i). Attribute semantics is set according to real applications. In the running example, we have two attributes A₁ = COLOR and A₂ = BRAND with domains dom(A₁) = {RED, SILVER, BLACK} and dom(A₂) = {BENZ, VW, TOYOTA}, respectively. We use a composite data model $\mathcal {D}$(Trip; Att) to represent a multi-attribute trajectory database, where Trip denotes standard trajectories and Att denotes multi-attributes. The data model is translated to a relation with the schema (Id: int, Trip: mpoint, Att: att) by embedding data types mpoint and att as relational attributes, as shown in Table 1. The advantage of the relational interface is that (i) it allows combining heterogeneous data models; and (ii) existing operators on standard trajectories can be leveraged, benefiting the system development.

Table 1 Integrate standard trajectories and attributes

Full size table

2.3 Queries

We incorporate attribute values into the evaluation and define the query expression

Definition 3

Attribute query expression

Q_a = (a₁, ..., a_d), a_i ∈ dom(A_i) or a_i = ⊥.

The attribute expression for “Continuously report the nearest SILVER VW to mo₄” is formed by Q_a = (SILVER, VW). We use the notation ⊥ for an undefined value. If the user does not define any attribute, the query regresses to traditional nearest neighbor query. Given a d-attribute object o $\in D_{\underline {att}}$, let o.A_i refer to the value of the attribute A_i. An operator called contain is defined.

contain(o, Q_a) returns true if ∀a_i ∈ Q_a: o.A_i = a_i or a_i = ⊥.

Let dist(t, Trip₁, Trip₂) return the distance between two standard trajectories at a time point t. Ck NN_MAT is formally defined as follows.

Definition 4

CkNN_MAT

Given a query standard trajectory mq, an integer k, and Q_a, CkNN_MAT returns k trajectories denoted by $\mathcal {D}^{\prime } \subseteq \mathcal {D}$ at each time t ∈ T(mq) such that $\forall mo^{\prime } \in \mathcal {D}^{\prime }:$ contain(mo^′.Att, $Q_{a}) \wedge \forall mo \in \mathcal {D} \setminus \mathcal {D}^{\prime }:$ dist(t, mo^′, mq) < dist(t, mo, mq)

Referring to Fig. 1, Ck NN_MAT returns mo₅ during [t₁, t₂] and mo₃ during [t₂, t₃]. In the following, we will use the notation t for a time point and also a time interval, and explicitly indicate the meaning. Users can also define multiple values of an attribute, e.g., “Continuously report the nearest SILVER or BLACK VW to mo₄”. We generalize the expression of the query attribute.

Definition 5

An extension

Q_a = (X₁, ..., X_d), X_i ⊆ dom(A_i) or X_i = ⊘.

At the concept level, Q_a = (X₁, ..., X_d) defines the component for each attribute over {A₁, ..., A_d}, in which X_i is a set of attribute values. The multi-value query “SILVER or BLACK VW ” is formed by Q_a = ({SILVER, BLACK}, {VW}). At the implementation level, the query is defined by a relation in which a tuple supports multi-valued attributes.

The operator contain is extended accordingly:

contain(o, Q_a) returns true if ∀X_i ∈ Q_a: o.A_i ∈ X_i or X_i = ⊘.

Table 2 summarizes notations frequently used in the following paper.

Table 2 Summary of notations

Full size table

3 The index

3.1 An overview

The proposed index structure consists of a 3D R-tree and BAR, as shown in Fig. 3. The 3D R-tree that serves as indexing standard trajectories is a height balanced tree. Each node contains an array of entries, each of which couples i) a pointer pointing to a subtree or an object with ii) a rectangle that bounds data objects in the subtree. BAR is a composite structure that includes a B-tree, a relation A tt_Rel and a record file RF, and will be elaborated in Section 3.3.

We build BAR on top of the 3D R-tree by extracting attribute values from multi-attribute trajectories. The structure builds the connection between attribute values and R-tree nodes, and enables us to know attribute values in a sub-tree. In a leaf node, each entry stores a pointer pointing to a tuple in the relation. We access the tuple to obtain the value. For a non-leaf node, attribute values are collected by performing the union on values from child nodes. BAR maintains attribute values in an efficient way such that we are able to fast settle the R-tree nodes that (i) contain query attributes and (ii) fall into the query time.

3.2 Packing standard trajectories

The R-tree is supposed to be built on sorted minimum bounding rectangles (MBRs) that approximate trajectories.

By observation we find that raw trajectories from GPS records contain a large number of small units due to short time intervals or slow movement.

In order to reduce the size of the dataset, we pack small pieces of movements to have less but larger units. Let u_i denote the unit extent in the i th dimension. The average extent over all units in the i th dimension is denoted by Δ_i. Then, the deviation of a unit is given as:

$$ f(u) = \sum \frac{u_{i}}{{\Delta}_{i}}, i \in \{d_{x}, d_{y}, d_{t}\}. $$

(1)

A threshold Bound is defined to select small units. We remove duplicate values to overcome the impact of the number of small units and analytically estimate the lower bound.

$$ \text{\textit{Bound}} = \text{Avg}(\text{Unique}(\lfloor f(u)\rfloor)) \geq \text{Avg}(\lfloor f(u) \rfloor) \approx 3. $$

Let U be the set of all temporal units and we define

$$u^{*} = \underset{u \in U}{\arg\max} \ \ \text{Unique}(\lfloor f(u) \rfloor). $$

Not all values in {0, 1, ..., ⌊f(u^∗)⌋} may be defined and thus the upper bound is

$$\textit{Bound} \approx \text{Avg}(0 + ... + f(u^{*})) \leq \frac{f(u^{*})}{2}. $$

Using GPS records of ShangHai taxis [1], we calculate the unit deviation over 500 temporal units and report their values as well as Bound in Fig. 4. One can see that the majority of units have the derivation smaller than Bound. We pack successive small units of the same trajectory as one unit such that the deviation of the unit is larger than Bound. The overall number of trajectory approximations (MBRs) is greatly reduced, leading to a compact dataset to build the index.

Lemma 1

Given |U| temporal units, the number of MBRs is less than $\frac {3}{4}\cdot |U|$ after packing.

Proof

Let U_s = {u|u ∈ U ∧ f(u) < Bound}⊂ U be the set of small units. Suppose that the lower bound Avg(⌊f(u)⌋) is taken, leading to $|U_{s}| \approx \frac {|U|}{2}$. We have Bound = Avg(Unique(⌊f(u)⌋)) ≥ Avg(⌊f(u)⌋) and therefore have $|U_{s}| \geq \frac {|U|}{2}$. Each u ∈ U_s will be packed with at least one unit.^{Footnote 1} In the sequel, |U_s| units will become $\frac {|U_{s}|}{2}$ units at most and the number of units after packing is (|U| - |U_s|) + $\frac {|U_{s}|}{2}$ = |U| - $\frac {|U_{s}|}{2} \leq |U|$ - $\frac {|U|}{4}$ = $\frac {3}{4}\cdot |U|$. □

The packing can be treated as building the R-tree in a different way. We pack small pieces of trajectories to obtain large units and take them as the input for a leaf node. The index uses the same threshold as the bulk load method [4] to group units into one leaf node, guaranteeing the spatio-temporal locality. The defined Bound is the average value over Unique(f(u)) and thus will not result in grouping units into a large extent. During the packing procedure, neither raw units are modified/simplified nor data is lost. We do not need extra storage space and maintain the same amount of original units. After packing, the R-tree is built by bulk load.

3.3 BAR

We need an appropriate structure to manage trajectory attributes in order to prune the data when traversing the index. One possible approach is to integrate attribute values into existing trajectory indexes. However, each multi-attribute trajectory is associated with a set of attribute values, leading to a complex task for a large number of trajectories. Modifying existing trajectory indexes will complicate the structures and also make them ad-hoc. We keep the original trajectory index and build a structure on top of that to manage attributes.

3.3.1 The Relation Att_Rel

The key component in BAR is the relation Att_Rel that builds the connection between attribute values and R-tree nodes. The relation schema is defined as

Att_Rel(A_VAL: int, H: int, RecId: int).

For each attribute value, we maintain a tuple for all nodes containing the value at the same height. The nodes are stored in a record. A tuple stores the attribute value, the height, and the record identifier. The relation is created as follows. First, for each a ∈ dom(A) we traverse the R-tree in depth-first search to collect all nodes containing a and create an intermediate tuple for each node. We set a as the key and record the node height. Second, the intermediate tuples are grouped according to the height and a record is used to store all nodes containing a at each height. A tuple is created to store the record id. We repeat this procedure for all a∈ dom(A). The number of tuples in Att_Rel is O(|dom(A)|⋅ H) in which |dom(A)| is the overall number of attribute values and H is the R-tree height. We create a B-tree on Att_Rel by making a key combining A_VAL and H. Using example trajectories, we show the 3D R-Tree and BAR in Fig. 5.

Unique attribute values

We need a unique key for each attribute value. The ideal case is that attribute domains do not overlap. In practice, it may be not possible to have non intersecting domains, but this problem can be solved. One can use a composite number to represent the attribute value. This is achieved by combining the attribute id and the value. In turn, a two-dimensional point (i, a) (i∈ [1, d], a ∈ dom(A_i)) is formed. Then, a space-filling curve Z-order is used to map points of a two-dimensional space to one-dimensional values. This is done by interleaving the binary coordinate values and is able to guarantee that attribute domains do not overlap. We prove this in Appendix.

3.3.2 Record storage

Record format

We maintain a list of items in each record. Each item is represented by a three-tuple: (nid, b, t), in which nid is the node id, b is a bitmap and t is a time interval. The bitmap represents the entries containing the attribute value in a node and t is the overall time of the entries. We make the design based on the observation that the number of entries containing an attribute value is no larger than the total number of entries, usually much smaller. Also, the number of entries containing an attribute value increases from leaf level to root level. This is because if a node contains the value, all its ancestor nodes will contain the value. Consider SILVER by referring to Fig. 5. Only one entry in N₂ contains SILVER but all entries in N_R contain SILVER. To efficiently settle the entries fulfilling the attribute condition, we access the bitmap instead of performing a linear scan over all entries.

Bitmap

Let m denote the length of a collection of bit-vectors and E be the node fan-out. We perform the mapping between bitmaps and entries. There are two cases. (i) m≥ E, each bit maps to a unique entry. If the i th ∈ [0, m) entry contains the value, we have b[i] = 1. Otherwise, b[i] = 0. (ii) m < E, each bit maps to a sequence of entries. The corresponding entries for the i th bit are calculated by [i $\cdot \lceil \frac {E}{m} \rceil $, (i + 1) $\cdot \lceil \frac {E}{m} \rceil $). We define b[i] = 1 if one of the entries contains the value. The bitmap index incurs little storage overhead and is efficient for processing data in small quantity due to the speed of bit-wise operations. The length m depends on the implementation (e.g., a 32-bit integer) and E is bounded by the minimum and maximum numbers of entries in a node. The bitmap is able to fast determine qualified entries for the intersection condition of several attributes. By defining m = 8, Fig. 5 gives the bitmap of each item in the record. Consider Q_a = (SILVER, VW). At H = 2, we access Records 1 and 3, and retrieve b = 00000111 for SILVER and b = 0000110 for VW, respectively. By performing the bitwise AND operation, the 0th entry is not defined and therefore N₀ will not be further considered. A data type is embedded into an relational to represent the records.

Definition 6

Record

$D_{\underline {\text {\textit {rec}}}}$ = {((nid₁, b₁, t₁), ..., (nid_n, b_n, t_n))|nid_i ∈int, b_i ∈ int, t_i ∈ interval}

3.3.3 Querying the bitmap

In order to know the entries containing the query attribute we need to access the bitmap to report defined bits. A straightforward way is to apply a linear scan, leading to the time complexity O(m). We propose an optimal method to boost the performance when the number of defined bits is smaller than m. Let $\mathcal {B}$ = < 2⁰, 2¹, ..., 2^m− 1 > be a sequence of integers. Given a bitmap b, its defined bits are reported as follows: Step 1, by performing a binary search we find the smallest 2ⁱ ∈ B such that 2ⁱ ≥ b. Step 2, if 2ⁱ = b, we report i and terminate the search because the bit is already found and no more bit is to be reported. If 2ⁱ > b, we report i - 1 and update b = b - 2^i− 1. Then, we repeat Steps 1-2 until b decreases to 0, during which bits are progressively reported from high to low positions. Figure 6 depicts the procedure of reporting b = 00000110 in Record 3.

We give the query algorithm in Algorithm 1. Let P denote the set of defined bit positions, initially empty. Two indexes s and e are used to define the s th and e th integers in $\mathcal {B}$. To find the smallest i such that 2ⁱ ≥ b, the procedure performs a binary search and terminates when either b is equal to an integer in $\mathcal {B}$ (line 8) or e = s + 1 (line 14). In the former case, all bits are found already. In the latter case, the position of the highest bit is found and put into P (lines 16-17).

To continue searching the bits, we update b as well as s and e. The baseline method to update s and e is to set s ← 0 and e ← m after each iteration. Consequently, the binary search is repeatedly performed on the overall bit set. Motivated by the fact that b decreases after each iteration, the position of the next defined bit cannot be larger than s. Thus, we perform the update by setting e ←s and s ← 0. The correctness is proved in the following.

Lemma 2

If2^s ≤ b < 2^e and e = s + 1, the highest bit in b is s.

Proof

Let s^′≠ s be the highest bit and there are two cases:

(i) s^′ > s, that is s^′∈ [e, m). Then, we have b $\geq 2^{s^{\prime }} \geq 2^{e}$. This contradicts b < 2^e.
(ii) s^′ < s, that is s^′∈ [0, s). Then, we have 2^s > b because s^′ is the highest bit. This contradicts 2^s ≤ b.

□

Time complexity

We need O(log m) to report the highest bit and denote the position by p∈ [0, m). To find the second highest bit, a binary search is performed, leading to O(log p). The iteration time depends on p. The smaller p is, the less iterations are needed. If p∈ [m/2, m), we need log p = log m iterations. If p ∈ [0, m/2), we achieve log m - 1 iterations. To report the i th highest bit, the iteration time is between [log m - (i - 1), log m], depending on where the bit is located. To sum up, we have

Theorem 1

Upper Bound

$$O(|b|\log m) $$

Theorem 2

Lower Bound

$$O\left( \sum\limits_{i = 1}^{|b|}(\log m - (i - 1)\right) = O(|b|\log m - (|b| -1)|b|/2) $$

Proof

We perform a binary search to look for |b| defined bits in O(m). In the worst case, they are the |b| highest bits and we need O(log m) for each iteration, leading to O(|b|log m). An optimal case is that we need O(log m - 1) for the second iteration when the position of the second highest bit is smaller than $\frac {m}{2}$. If the position of the second highest bit is $\leq \frac {m}{4}$, we need O(log m - 2) for the third iteration and so on. □

The binary search guarantees an optimal performance as long as $|b| \leq \lceil \frac {m}{\log m} \rceil $. The advantage may be not significant when the number of bits is large and the bits are mostly located at high positions. For example, if |b| = m, the upper bound is actually larger than the linear search performance, O(|b|log m) = O(m log m) > O(m). If this happens, we just perform a linear search.

3.3.4 The storage

BAR includes the relation Att_Rel, the B-tree and the record file. The storage cost of Att_Rel is O(|dom(A)|⋅ H), also the B-tree. The size of the record file depends on the number of record items. Each tuple in Att_Rel points to a record and the number of record items is determined by the node height. Specifically, the number of nodes at h∈ [1, H] is O(E^H−h) in which E is the node fan-out and h = 1 is for leaf nodes. Consider two extreme cases. Only one node O(E⁰) contains the value at the root level and O(E^H− 1) nodes contain the value at h = 1. The number of record items is calculated by

$$ \sum\limits_{a\in \textit{dom}(A)} \sum\limits_{h \in [1, H]} E^{H-h} = \sum\limits_{a\in \textit{dom}(A)}(1 + ...+ E^{H-1}) $$

Theorem 3

The storage cost is O(|dom(A)|⋅ H) + $O({\sum }_{a\in \textit {dom}(A)} E^{H})$.

The cost increases when the number of trajectories rises and is also proportional to the number of attribute values. We will verify this in the experimental evaluation.

4 Update

The database needs to keep track of the incoming data and allow querying both the historical and new data. An important task is to synchronize index structures in order to be consistent with the underlying data space. Given a set of incoming multi-attribute trajectories, inserting them into the index incurs updating two structures: (i) 3D R-tree and (ii) BAR. In general, a new R-tree named $\mathcal {R}_{u}$ is created on new trajectories and BAR is built on $\mathcal {R}_{u}$. To distinguish between historical and new structures, we term BAR_u for the new one. $\mathcal {R}_{u}$ and BAR_u are inserted into historical structures $\mathcal {R}$ and BAR, as illustrated in Fig. 7.

4.1 Updating R-tree

We pack the incoming trajectories and create a new R-tree by bulk load. The new R-tree is maintained by the same storage file as the historical R-tree in order to simplify the procedure of accessing the structure. Otherwise, one has to detect whether the accessed node belongs to the new R-tree or the historical R-tree and select the corresponding file. It is a rather complex task to maintain many storage files for frequent updates.

$\mathcal {R}_{u}$ is inserted into $\mathcal {R}$ as follows. Let H_u and H denote the heights of the two R-trees, respectively. We assume H_u ≤ H because the amount of incoming trajectories for one time update is usually much smaller than that of the historical data. If H_u = H, a new root node is created to hold root nodes of $\mathcal {R}$ and $\mathcal {R}_{u}$ as two entries. If H_u < H, we add the root node of $\mathcal {R}_{u}$ as an entry to an appropriate node in the target tree $\mathcal {R}$ whose height is equal to H_u. This is achieved by performing a top-down traversal in the target tree until a node whose height is equivalent to the new R-tree. During the traversal we always choose the last entry of each accessed node as the node to be processed at the next level. This is because entries are increasingly sorted by time and incoming trajectories are certainly located after existing trajectories. If the node is not full, we insert the root of the new R-tree as an entry into the node. Otherwise, we create a new node for the new R-tree.

Consider a new trajectory mo₆ = (< u₁₂, u₁₃ >, (SILVER, VW)) in Fig. 8. We exemplify the update procedure as follows. $\mathcal {R}_{u}$ contains only one node and will be inserted into the 3rd entry in N_R. We record a path denoted by $\mathcal {P}_{u}$ along with inserting $\mathcal {R}_{u}$ into $\mathcal {R}$. The path starts from the root node and ends at the node to insert the entry. Each element in $\mathcal {P}_{u}$ is of the format (h, nid, pos) in which h is the node height, nid is the node id, and pos marks the entry position in the updated node. After the insertation the path will be $\mathcal {P}_{u}$ = <(H, nid, pos),..., (H_u, nid, pos)>, and we reversely update the spatio-temporal box of each node in the path.

Theorem 4

the time complexity

Inserting a new R-tree into the existing one requiresO(H) in which H is the height of the existing R-tree.

Proof

We require O(H) to find the node whose height is the same as the new R-tree because only one node is accessed at each level. Then, we update each node in the path by following the bottom-up approach. The length of the path is O(H) and thus the updating cost is O(H). To sum up, the merging operation requires O(H). □

Theorem 5

the number of node accesses

Inserting a new R-tree incursO(H) node accesses in total.

Proof

Finding the node in the existing R-tree to insert the new R-tree needs to access O(H) nodes and only these nodes will be updated after the insertion. □

4.2 Updating BAR

The procedure of building BAR_u is the same as in Section 3.3.1. To update BAR, we insert BAR_u into BAR: step 1, for each tuple in BAR_u.Att_Rel, we search the matching tuple in BAR.Att_Rel and append record items for the nodes in $\mathcal {R}_{u}$; step 2, we update record items for each node appearing in $\mathcal {P}_{u}$.

Definition 7

Matching tuple

A tuple y ∈ BAR.Att_Rel is the matching tuple of the tuple x ∈ BAR_u.Att_Rel if x.A_VAL = y.A_VAL ∧ x.H = y.H.

Figure 9 shows BAR_u for the new trajectory. Consider the tuple for SILVER at H = 1. After determining the matching tuple, we append a new item in Record 2 for N₃. If the matching tuple is not found (the attribute value does not appear in BAR), we create the tuple as well as the record and insert them into BAR.

$\mathcal {R}_{u}$ is inserted into $\mathcal {R}$ as a sub-tree and the nodes in $\mathcal {P}_{u}$ are updated in terms of (i) spatio-temporal boxes; and (ii) attribute values. Part (i) is processed in Section 4.1 and part (ii) is processed as follows. For each attribute value in new trajectories, we look for tuples in BAR.Att_Rel having the value and the appropriate height according to $\mathcal {P}_{u}$. Note that the height is increasingly numbered from leaf to root level, guaranteeing that the heights of $\mathcal {R}$ and $\mathcal {R}_{u}$ are consistent. If we find the tuple, we access the record to update the item for the node. Precisely, the bitmap and the time box are updated. For example, the items of Records 1 and 3 are updated in Fig. 9 because N₃ is inserted into N_R. Later, we refresh the record to synchronize the data.

New trajectories incur an ongoing expansion of the time. The time range of the nodes in $\mathcal {P}_{u}$ overlaps with that of new trajectories. To enhance the update performance, we increasingly sort the items on time and update the items from the end of the list.

Definition 8

Sorted record items

The items in a record are increasingly sorted according to time intervals. Consequently, Definition 6 is updated to $D_{\underline {\text {\textit {rec}}}}$ = {(<nid₁, b₁, t₁), ..., (nid_n, b_n, t_n) > |t₁ < ... < t_n, nid_i ∈ int, b_i ∈ int, t_i ∈ interval}

If a new root node is created, we will not find the matching tuple in BAR.Att_Rel. Therefore, we create the tuple as well as the record and insert them into BAR. Afterwards, we drop BAR_u and $\mathcal {P}_{u}$. Let $\mathcal {D}_{u}$ be the set of new arrival trajectories. The update algorithm is given in Algorithm 2 and the complexity is provided.

Theorem 6

The time complexity of updating isO(|dom(A)|⋅ H)

Proof

The time complexity depends on (i) the domain of attribute values and (ii) the maximum value between H_u and $|\mathcal {P}_{u}|$, that is O(H). □

4.3 Maintaining light-weight BAR

By performing some test workloads, we find that step 1 is not a costly procedure in terms of I/O accesses because one just appends new record items or creates tuples. On the contrary, step 2 is an expensive procedure in which we update record items for the nodes appearing in $\mathcal {P}_{u}$. For each attribute value, we look for the tuple in BAR that contains the value and the corresponding height. We then access the record file to load the entire record into the main-memory, find the record item to update (may also create), and write the record back into the disk. This incurs a lot of I/O accesses because all items are loaded from the disk. Figure 10 illustrates the procedure of updating the records for SILVER.

In view of the deficiency, we propose a light-weight BAR named lw-BAR to reduce the I/O cost. We buffer record items in lw-BAR rather than update BAR. A relation and a B-tree make up lw-BAR and there is no record file. The relation schema is of the form

$$\text{lw-Att\_Rel: } (A\_VAL: \underline{int}, H: \underline{int}, RecItem: \underline{rec}). $$

For each attribute value appearing in BAR_u, we use a tuple in lw-Att_Rel for the node in $\mathcal {P}_{u}$. In the update path, there is only one node at each height from H_u to H, and therefore the number of updated items in a record is one. The lw-BAR only stores one item in a record. To have a compact structure, we merge the record component (managed in a record file in BAR) into the relation by replacing the record id by the record item. Employing the lw-BAR, the number of I/O accesses for updating will be considerably reduced because only one tuple is processed. Continuing the example, the light-weight structure is depicted in Fig. 11.

To accommodate frequent updates, the record items for $\mathcal {P}_{u}$ have to be updated whenever $\mathcal {R}_{u}$ is inserted into $\mathcal {R}$. If the entries in the inserted node do not overflow, we only update record items for historical nodes. However, under a continuous update load, frequent insertions will cause the node overflow and lead to new nodes, as illustrated in Fig. 12. In this scenario, we merge the record item in lw-BAR into the one in BAR and create another record item in lw-BAR for the new node. We give the enhanced update algorithm in Algorithm 3.

Our method has low latency which is very important for on-line applications. The query evaluation can be executed when $\mathcal {R}_{u}$ and BAR_u are created but not yet inserted into $\mathcal {R}$ and BAR. If the system detects that the updating is not complete, we can separately search $\mathcal {R}$, BAR and $\mathcal {R}_{u}$, BAR_u and perform the union on their results to obtain the final answer.

4.4 The cost model

Recall step 2 in which for each attribute value we update the record for each node in $\mathcal {P}_{u}$. The number of updated records is determined by (i) the domain of attribute values dom(A), and (ii) the length of the update path, denoted by $|\mathcal {P}_{u}|$. Let Rec denote the I/O cost of accessing a record. We measure the update cost by

$$ \textit{Cost}(\text{I/O}) = \sum\limits_{h \in [H_{u}, H]} \sum\limits_{a\in \textit{dom}(A)} \textit{Rec} $$

(2)

We assume that |dom(A)| is static. Updating new trajectories one by one is expensive because each update involves Cost(I/O) = H ⋅|dom(A)|⋅ Rec. The larger H_u is, the less the cost is. An optimal case occurs if H_u = H. In this setting, a new root node is created and a new record is created for the root node, leading to Cost(I/O) = |dom(A)|⋅ Rec. This is consistent with the fact that updating by bulk load is more efficient than individually performing the update [4]. Rec is the cost of loading record items for all nodes containing an attribute value at a particular height. E is the capacity of the R-tree node. Let B be the block size. The I/O cost of accessing a record for the nodes at h is estimated by

$$ \textit{Rec} = \frac{E^{H-h}}{B}, h \in [H_{u}, H] $$

(3)

Combining Eqs. 2 and 3, the update cost is

$$ \textit{Cost}(\text{I/O}) = \sum\limits_{h \in [H_{u}, H]}{|\textit{dom}(A)|\cdot \frac{E^{H-h}}{B}}. $$

(4)

Employing the lw-BAR, we only insert one record item for each attribute value at each height, reducing the cost to $O(\frac {1}{B})$. In contrast, the method in Section 4.2 has to load all record items into the main-memory. If a new node is created, record items for all nodes at H_u are loaded into the main-memory, leading to $O\left (\frac {E^{H-H_{u}}}{B}\right )$, although only one item is updated. To sum up, the cost of the enhanced method is

$$ {\textit{Cost}(\text{I/O})}_{lw} = \frac{E^{H-H_{u}}}{B} + \sum\limits_{h \in [H_{u}, H]}{\frac{|\textit{dom}(A)|}{B}} $$

(5)

5 The algorithm for Ck NN_MAT

We outline the query procedure in Fig. 13. By default, we introduce the method by considering a single value for each attribute and will explicitly point out the difference for multiple values when necessary.

Filter

We access BAR to determine the R-tree nodes that include query attribute values and intersect the query time, denoted by N_a. We take N_a coupled with mq and k as input to traverse the R-Tree in breadth-first order, during which the search space is pruned by taking into account spatial and temporal parameters as well as attributes. The filter returns a set of candidates, each of which fulfills the attribute condition and approximately belongs to k nearest neighbors.

Refinement

Unpack each candidate trajectory to get the original temporal units and perform the exact distance computation to return k nearest neighbors at each query time.

Since the refinement is trivial, we focus on the filter phase, as given in Algorithm 4, which calls two subroutines CollectNodes and Candidates, respectively.

5.1 Collecting R-tree nodes

We collect the nodes containing Q_a level by level. For each a ∈ Q_a, we start from h = 1 and access BAR to find the records. For each item (nid, b, t) in the record, we check whether the item identified by nid is already in N_a. If not, we insert the item into N_a by attaching a counter, initialized by 1. Such a value represents the number of query attribute values contained in the node. The extended record item is denoted by λ. If the item already exists in N_a, we increase the counter and update the bitmap by performing the bitwise AND. This is because a node fulfilling the condition must contain all values in Q_a. Let |Q_a| return the number of attribute values. We present the algorithm in Algorithm 5 and use Lemma 3 to prune the items. Figure 14 depicts the procedure of determining the nodes for Q_a = (SILVER, VW).

Lemma 3

Given an item λ ∈ N_a, the item is pruned if λ. counter ≠|Q_a| or λ.b = 0.

Proof

(i) λ.counter ≠|Q_a| : It is impossible that λ.counter > |Q_a| because distinct attribute values are counted. If λ.counter < |Q_a|, the number of attributes in the node is less than |Q_a| and λ can be safely pruned.
(ii) λ.b = 0 : There is no entry in the node containing all query values and we safely prune λ.

□

Time complexity

The time cost includes (i) searching BAR.Att_Rel to find the tuples and (ii) collecting record items. We perform a lookup on the B-tree for each attribute value at each h∈ [1, H]. This requires

$$O(d \cdot H \cdot \log (|dom(A)| \cdot H)) $$

as at most d attributes are queried. Next, we consider the number of processed record items for an attribute value at a particular height. Such a value is equivalent to the number of nodes fulfilling the attribute condition. Obviously, the maximum value occurs at leaf level, leading to O(E^H− 1) nodes. Assume that attribute values are uniformly distributed. Let

$$\delta_{a} = \max \left\{\frac{1}{|dom(A_{1})|},...,\frac{1}{|dom(A_{d})|}\right\} $$

be the maximum ratio of a single value to all values in the domain among all attributes and δ_t be the ratio of T(mq) to the overall time. The number of leaf nodes containing a query value is approximated by

$$O(E^{H-1} \cdot \delta_{a} \cdot \delta_{t}). $$

The total number of accessed record items will be

$$\sum\limits_{h \in [1, H]} E^{H-h} \cdot \delta_{a} \cdot \delta_{t} = O(E^{H} \cdot \delta_{a} \cdot \delta_{t}). $$

Theorem 7

Given a query with O(d) attribute values, the time to report the nodes containing the query isO(d ⋅ (H ⋅ log(|dom(A)|⋅ H) + E^H ⋅ δ_a ⋅ δ_t)).

5.2 An extension: multiple values

We make an extension that a query attribute defines multiple values. A node is satisfied if it contains one of the values. We extend the aforementioned λ ∈ N_a to (nid, b, t, aid) by adding the attribute id. The bitwise OR is performed on bitmaps for values from the same attribute. Algorithm 5 is slightly modified, see Algorithm 5.2, and Lemma 4 is used for pruning.

Lemma 4

Let AttCount(Q_a)(≤|Q_a|) return the number of query attributes. An itemλ ∈ N_a is pruned if λ. counter ≠ AttCount(Q_a) or λ.b = 0.

Consider the query “Report the nearest SILVER or BLACK VW to mo₄”. We look for qualified nodes that contain SILVER/BLACK and VW. Based on Fig. 5, we report BAR for BLACK by referring to Fig. 15.

We exemplify the procedure of reporting qualified nodes for ({SILVER, BLACK}, VW) at h = 1 in Fig. 16. Records 2 and 6 are returned for SILVER and BALCK, respectively. We perform the bitwise OR on bitmaps and do not increase the counter. Next, we find the nodes containing VW and perform the bitwise AND to determine the nodes containing {SILVER, BLACK} and VW. The counter increases if the bitmap is not zero. Non-qualified nodes are pruned according to Lemma 4.

Theorem 8

The time cost for multiple values is O(|dom(A)|⋅ (H ⋅ log(|dom(A)|⋅ H) + E^H ⋅ δ_a ⋅ δ_t)).

5.3 Reporting candidates

The algorithm, given in Algorithm 7, traverses the R-tree from root to leaf level during which we prune the search space using both spatio-temporal and attribute predicates. A list $\mathcal {L}$ is used to maintain accessed nodes. For each node n $\in \mathcal {L}$, we determine whether (i) n belongs to N_a; and (ii) objects in the subtree rooted at n will contribute to the result. If n∉N_a, we prune it. If n ∈ N_a, we extend the method in [21] to maintain the candidates in terms of count and their distances to mq. The query needs only k neighbors. If there are k candidates at each defined time, objects that are further than current candidates to mq can be safely pruned, see Lemma 5.

Lemma 5

Given a node n, if there are k objects in the candidate set duringT(n) and their distances to mq are smaller than dist( n, mq) at each t ∈ T(n), n is safely pruned.

To determine whether there are enough candidates, a segment tree is maintained by storing the time interval, the distance and the number of trajectories in a node. The key point is that the number of trajectories in the segment tree is not the number of all trajectories in an R-tree node, but the number of trajectories fulfilling the attribute condition. Such a value depends on the query. Although we can do the pre-computation for each attribute value and use it for |Q_a| = 1, one has to calculate on-the-fly for |Q_a| > 1, which is a costly procedure. This is because the result is the intersection set of several query attributes and one needs to traverse the sub-tree to count the number of trajectories containing query attributes. We do the pre-computation for a single value. For each a ∈ dom(A), we maintain a temporal integer to represent the distribution. Each node needs O(|dom(A)|) temporal integers. If |Q_a| = 1, we create candidates using temporal integers and insert them into the segment tree. If |Q_a| > 1, we receive temporal integers until trajectories (leaf nodes) are accessed. Using the index in Fig. 5, we report possible temporal integers for SILVER and VW of the root node, as illustrated Fig. 17. We do the intersection on SILVER and VW to determine trajectories containing both attributes and report the trajectory count.

If n belongs to N_a and candidates are not enough, we open the node to look up entries and search objects. We call ReportBitPos to establish the entries containing query values. If n is a leaf node, we retrieve the tuple from the relation to precisely evaluate the attribute. This is because if a bit maps to several entries (m< E), we set the bit as long as one entry contains the value. Therefore, the value is accurately checked for each entry. If m ≥ E, this step is ignored. If n is a non-leaf node, its child nodes will be put into $\mathcal {L}$ for further consideration. Using the running example, Fig. 18 exemplifies the procedure of reporting candidates.

The procedure of processing multiple query values is the same as processing a single value, but more nodes will be accessed. Using the query “Continuously report the nearest SILVER or BLACK VW to mo₄”, we depict the procedure of reporting candidates in Fig. 19.

5.4 The refinement

This step includes two phases: (i) unpack each candidate to obtain temporal units; (ii) apply the slightly modified plane-sweep algorithm [3] to determine the k lowest time-dependent distance curves to report the result.

Phase (i) is straightforward. Phase (ii) takes in a sequence of candidate trajectories, temporally ordered on time. We compute their time-dependent distances to the query trajectory and find the k nearest objects at each defined time point. To achieve this, one needs to determine pieces of movements with overlapping time and apply the distance function by employing the linear interpolation on each piece [15, 16]. We manipulate temporal units to calculate the distance. The distance function is of the form the square root of a quadratic polynomial, as illustrated in Fig. 20. The lowest curve corresponds to the nearest trajectory to the query. Since distances change over time, split points between curves are found to determine the k lowest pieces of curves.

6 Experimental evaluation

We implement the proposal in C/C++ and perform the evaluation in an extensible database system SECONDO [20]. A standard PC (Intel(R) Core(TM) i7-4770CPU, 3.4GHz, 4GB memory, 2TB hard disk) running Suse Linux 13.1 (32 bits, kernel version 3.11.6) is used.

6.1 Datasets and parameter settings

Real datasets are from a company Datatang including Shanghai taxis (TAXI) and Beijing buses (BUS) [1]. TAXI includes taxi GPS records from four companies in 2014. The company id is defined as the attribute. BUS contains bus card records in 2014. Each record stores when (the time reading the card) and where (locations of bus stops) passengers get on and off the bus. Each bus is identified by its id and bus stops are identified by orders in the route and their locations. We build bus trips from bus card records by grouping records on bus id and then sorting them on time. There are 384 bus routes in total and the route id is set as the attribute. Part of the data is published at (http://dbgroup.nuaa.edu.cn/jianqiu/).

Synthetic datasets are generated by utilizing a tool MWGen [56]. Several datasets are created in terms of the number of trajectories, the attribute quantity and the attribute domain. For each attribute, the value is randomly selected from the domain. In the system, one can flexibly define parameters to generate datasets in different settings. We provide the default dataset MAT5 here and will report other datasets when they are used in the evaluation. In each query, the standard trajectory and attribute values are randomly selected over domains. Datasets statistics and query parameters are listed in Table 3, in which default query values are marked.

Table 3 Datasets and parameters

Full size table

6.2 The effect of packing

We report the packing effect on all datasets by referring to Fig. 21. A compact dataset is achieved and the number of temporal units decreases by an order of magnitude, demonstrating that the effect in practice is better than the theoretical analysis in Lemma 1. In turn, the number of R-tree nodes is greatly reduced and the tree height decreases. This leads to an index structure with small size, reducing the storage overhead and accessing cost.

6.3 Searching the bitmap

We compare the performance of querying the bitmap between the binary search and the linear search. The results are reported in Fig. 22. The binary search achieves better performance than the linear search for BUS and MAT5, but the two methods have the same performance for TAXI. This is because the average number of bits per attribute value is less than O(m log m) for BUS and MAT5, and therefore the binary search guarantees an optimal performance, as we analyzed in Section 3.3.3. However, this does not hold for TAXI whose attribute domain is small (dom(A) = [1, 4]) and the average number of bits per attribute value is larger than m log m.

6.4 Performance evaluation

In the evaluation, CPU time and I/O accesses are used as performance metrics and the results are averaged over 20 runs. Three baseline methods are developed.

(i) 4D R-tree.:: One can treat attributes as an extra dimension in addition to spatial and temporal. Therefore, a high-dimensional index can be utilized. For each multi-attribute trajectory, attribute values (a₁, ..., a_d) are decomposed into d individual values and each is combined with the standard trajectory, e.g., mo₁(RED, BENZ) will yield $mo_{1}^{\prime }$(RED) and $mo_{1}^{\prime \prime }$(BENZ). A relation with the schema (Id: int, Trip: mpoint, Att: att) is created in which Att is a single-value attribute. The tuple quantity of the relation is O(d ⋅|D|). To answer Ck NN_MAT, we need |Q_a| query boxes each of which is created on the query attribute value and the defined time T(mq). The procedure starts from the root node and evaluates each accessed node. If the node intersects query boxes, we open the node and access its child nodes for further consideration. Otherwise, we safely prune the node. A multi-attribute trajectory is returned only if among its d trajectories there are |Q_a| trajectories intersecting query boxes. The method enlarges the data set d times, leading to high storage cost, and does not achieve a good locality when grouping objects for a large d.
(ii) 3D R-tree + Attribute Relation (3D RAR).:: The paper [21] develops an algorithm that employs a 3D R-tree with auxiliary structures to efficiently answer continuous nearest neighbor queries over standard trajectories. We extend the paradigm to support proposed queries by recording the set of attribute values contained in each node. A relation stores attribute values contained in R-tree nodes. During the query procedure, we first determine whether the accessed node contains all values in Q_a by performing a lookup on a B-tree built on the relation. If yes, we open the node and move forward to the spatio-temporal evaluation. Otherwise, we prune the node. The main drawback is that recording attribute values is able to determine whether the node contains a particular value but cannot make the decision on the AND predicate for several attributes.
(iii) 3D R-tree + Inverted Bitmap (RIB).:: In the field of spatial keywords search, a lot of R-tree based indexes have been proposed. According to [7], the method in [53] is the most efficient to answer boolean k NN queries. The approach partitions the data into multiple groups such that each group shares as few attributes as possible. In the context of multi-attribute trajectories, trajectories are grouped on attributes (a₁,..., a_d), which can be viewed as a point in d-dimensional space. We apply Z-order to map the d-dimensional value to one-dimensional and group objects based on Z-order values. Objects are then sorted by attribute values, time and spatial data. The 3D R-tree is employed. Each R-tree node contains a pointer to an inverted bitmap that records positions of entries defining the attribute value. A relation is used to store the bitmaps and a B-tree is created by combining the node id and the attribute value as the key. The bitmap length is the fan-out of a node. For each accessed node, a sequential scan is made on the bitmap to determine entries containing query values.

6.4.1 Scalability

We conduct the experiments to obtain insights into different aspects affecting the performance: (i) the number of multi-attribute trajectories; and (ii) attributes. Table 4 reports dataset settings.

Table 4 Datasets for scalability

Full size table

Index storage

When scaling $|\mathcal {D}|$ and d, we report the storage cost of the index structure including 3D R-tree and BAR. As shown in Fig. 23, the storage costs of both structures increase when we enlarge the number of trajectories. By fixing the number of trajectories, the storage cost of BAR rises proportionally when d is enlarged. This is consistent with the analysis in Section 3.3.4.

Scaling $|\mathcal {D}|$

The query efficiency is reported in Fig. 24. The results demonstrate that when the data size grows the costs of all methods rise proportionally, but our method achieves two orders of magnitude better performance than baseline methods.

Tuning d ($|\mathcal {D}|$ = 1.8 million)

Figure 25 reports the result of tuning d in which |Q_a| = 3 by default and |Q_a| = 1 for d = 1. Our method achieves the best performance in all settings and RIB performs competitively to our method when d = 3. RIB orders the data on attribute values and the predicate has a good selectivity when |Q_a| = d. However, when d increases, the performance degrades dramatically. We analyze that Z-order values cannot preserve the locality when the dimension becomes large and the linear scan is inferior to our bitmap querying approach. The 4D R-tree has poor performance because the dataset is enlarged d times.

Tuning dom(A) ($|\mathcal {D}|$ = 1.8 million)

We enlarge attribute domains by conducting two experiments: d = 1 and d= 10. In the former case, there is no AND predicate (|Q_a| = 1) and the number of attribute values affects the attribute selectivity. In the latter case, we set |Q_a| = 3. Our method consistently outperforms baseline methods in two settings, as demonstrated in Figs. 26 and 27. One can see that the performance of our method and RIB increases when dom(A) becomes large. This is because the selectivity of the attribute predicate increases. The RIB performance shows different behavior by tunning d and dom(A). The method is sensitive to attributes and thus limited in scope.

6.4.2 Effect by |Q _a|

Figure 28 reports the experimental results, demonstrating that our method achieves more than an order of magnitude faster than three baseline methods. The proposed method is able to efficiently find R-tree nodes containing query attribute values. When |Q_a| increases, less nodes fulfill the condition, reducing the number of disk accesses. Formally, given a query attribute value a ∈ dom(A_i), assuming |dom(A_i)| is the minimum among all domains, the number of trajectories containing a is calculated by $|\mathcal {D}| \cdot \delta _{a}$ (defined in Section 5.1). The number of trajectories containing Q_a (a₁ AND a₂ ... $a_{|Q_{a}|}$) is approximated by $|\mathcal {D}| \cdot \delta _{a}^{|Q_{a}|}$.

6.4.3 Effect by k

The results are reported in Figs. 29, 30 and 31. TAXI and BUS contain only one attribute, and thus we set |Q_a| = 1. The query costs of all methods are not sensitive to k and our method substantially outperforms alternative methods. For BUS (d = 1, dom(A) = [1, 384]), the RIB performance is close to ours in a few cases due to good attribute selectivity. However, the performance decreases for TAXI (d = 1, dom(A) = [1, 4]) and MAT5 (d = 10, dom(A) = [1, 127]). An interesting behavior is that the query performance is not sensitive to k. The reason is, the majority of the time cost is in the filter step in which the pruning ability is not significantly affected by k. The refinement step requires little time cost and the number of processed candidates does not rise proportionally to k.

In the filter step, we prune a node if there are enough candidates. That means, during the time covered by the node there are at least k trajectories in the candidate set whose distances to the query trajectory are smaller than the distance between the node and the query trajectory. However, for Ck NN_MAT queries such a value can not be determined before the query is issued (|Q_a| > 1). Computing the value on-the-fly is not acceptable because we need to traverse the subtree to count all trajectories containing Q_a. We will obtain the value only when leaf nodes are opened to evaluate trajectory attributes. The pruning method can only be applied when trajectories in leaf nodes are accessed, postponing the pruning procedure. A large number of nodes have been accessed before we can perform the pruning. We record the time cost of the filter step which is almost equal to the overall running time. Although the refinement step iteratively processes each candidate, the overall number of candidates is not large, as reported in Fig. 32. The number of candidates does not rise proportionally to k because the candidates depend on not only k but also Q_a and the query time.

6.4.4 Applying packing to baseline methods

Our packing method can be applied to all baseline methods. Using datasets MAT5, TAXI and BUS, we run baseline methods by building the index on packed data. Default query parameters are used. As expected, the performance of all baseline methods increases due to a compact data representation and less index overhead. Using TAXI and MAT5, our method still exhibits the best performance, about 3 times faster than other methods. The performance difference among all methods is marginal for BUS because the attribute is quite selective (Fig. 33).

Short, median and long query trips

For each standard trajectory, we calculate the trajectory deviation by considering spatial and temporal dimensions, applying Eq. 1 in Section 3.2. Then, we intentionally select short, median and long trajectories as query trajectories in terms of deviation values. Figure 34 shows that our method outperforms baseline methods by a factor of 3-100.

Multiple values for one attribute

Using |Q_a| = 3, we let one of query attributes include a set of values, i.e., |Q_a.X_i| > 1, and scale the number of values from {2, 4, 6, 8, 10}. The other two attributes only define one value. As reported in Fig. 35, when |Q_a.X_i| increases, the query costs rise for all methods because more data objects are involved in the evaluation. Our method performs at least 2 times faster than other methods.

6.4.5 Update evaluation

Update efficiency

Using MAT5 and BUS, we build the historical database on part of the dataset and take the rest as new trajectories, denoted by $\mathcal {D}_{u}$. The performance is evaluated by scaling $|\mathcal {D}_{u}|$. The overall time is measured, as reported in Fig. 36a. Although $|\mathcal {D}_{u}|$ increases in several orders of magnitude, the updating cost only rises marginally. We also carry out the evaluation of performing a series of updates by defining $|\mathcal {D}_{u}|$ = 500,000 for MAT5 and $|\mathcal {D}_{u}|$ = 300,000 for BUS, respectively, and building the historical index on $\mathcal {D}$ - $\mathcal {D}_{u}$. Then, 50 updates are continuously performed in which $\frac {|\mathcal {D}_{u}|}{50}$ trajectories are processed each time. We measure the overall update time and report the result in an accumulated way, as shown in Fig. 36b. The time cost increases slightly.

Performance between BAR and lw-BAR

We evaluate the efficiency of updating BAR by comparing the baseline method and the one using the light-weight structure. As illustrated in Fig. 37, the lw-BAR substantially outperforms the baseline method, validating the proposed cost model. Using MAT5, an interesting behavior occurs when $|\mathcal {D}_{u}|$ arrives at 200,000. The two methods almost have the same I/O accesses. We detect that under this workload the height of the new R-tree is equal to that of the historical R-tree. A new root node is created to hold two subtrees and the length of the updating path is 0. A small update cost is required for both methods.

We evaluate the query performance by building the index on the same dataset in different ways. One method involves updating by initially creating the structure on $\mathcal {D}$ - $\mathcal {D}_{u}$ ($|\mathcal {D}_{u}|$ = 500,000 for MAT5 and $|\mathcal {D}_{u}| = 300,000$ for BUS) and then continuously performing 10 updates ($\frac {|\mathcal {D}_{u}|}{10}$ in each update). The other method creates the index on $\mathcal {D}$, i.e., without any update. We aim to test whether the update will influence the query performance. Default query parameters are used. Figure 38 reports the query costs, showing that continuously updating the structure will not greatly affect the performance. In fact, the performance in the update scenario is superior because a better partition is achieved in the time dimension.

6.5 Discussion

We analyze each alternative method and explain why they are inferior than our method. In order to build the 4D R- tree, an multi-attribute trajectory is decomposed into d trajectories, each of which contains a single-value attribute. This enlarges the data set d times. During the query processing, the attribute is approximately evaluated. Using Q_a = (SILVER, VW), we search the index to look for trajectories containing either SILVER or VW because each trajectory contains only one attribute. However, among returned trajectories only some contain both of them. Multi-attribute trajectories such as (SILVER, TOYOTA) or (BLACK, VW) will also be included, increasing the number of processed objects. To exactly determine the objects, we traverse the tree down until the leaf level and retrieve tuples for accurate examination.

The method using 3D RAR is able to select R-tree nodes containing an individual attribute value, but cannot determine trajectories containing several values from different attributes. Given a set of values, trajectories containing one of them will be all included although some of them may not fulfill the condition. The AND predicate cannot be not evaluated before leaf nodes are accessed, weakening the pruning ability.

RIB is limited in scope because it only achieves good performance when the attribute predicate is selective, e.g., (i) |Q_a| = d, or (ii) |Q_a| = 1 and dom(A) is large. In other settings, the performance significantly deteriorates. Furthermore, if traditional nearest neighbor queries are processed, the efficiency deteriorates as the spatio-temporal proximity is not preserved. Our method achieves stable performance and generalizes to queries on standard trajectories.

7 Generality

The generality of our method includes three aspects: (i) packing standard trajectories, (ii) managing attribute values, and (iii) supporting a range of queries on multi-attribute trajectories and also queries on standard trajectories.

Packing

The established method produces a compact data set. There is no information loss and no extra storage cost. The procedure can be applied for other trajectory queries to enhance the performance.

BAR

The system is able to flexibly build the traditional trajectory index or the hybrid index, depending on whether standard trajectories or multi-attribute trajectories are processed. BAR is not tightly integrated into the spatio-temporal index and therefore can be combined with other traditional trajectory indexes, categorized into (i) R-tree based indexes, e.g., TB-tree [37], MV3R-Tree [45], and (ii) grid based indexes, e.g., SETI [6]. The well-established structures do not have to be modified, benefiting the system development. Figure 39 reports BAR built on top of TB-tree and Grid index, following a similar procedure to Section 3.3.1. We instantiate into the 3D R-tree due to the advantage of preserving the spatio-temporal proximity and the efficiency of answering nearest neighbor queries, as demonstrated in [21], and compare with other trajectory indexes as follows.

TB-tree. The structure has the trajectory preservation property that only stores units of the same trajectory within a leaf node, resulting in a large spatial extent of the leaf nodes. The spatial proximity is not preserved because segments of different trajectories that lie spatially close will be stored in different nodes. We cannot effectively prune the search space by min and max distances, resulting in poor performance for nearest neighbor queries. The STR-tree [37] introduces a parameter to balance between spatial properties and trajectory preservation, but the main concern is to handle the spatial domain and treating the temporal as a secondary issue.
SETI. The space is divided into disjoint cells, each of which contains trajectory segments that are completely within the cell and has a temporal index (an R^∗-tree) for objects’ time intervals. The number of spatial partitions plays a crucial role in index design, but setting an optimal value is not trivial, e.g., trajectories may be uniformly or skewly distributed, making the performance unstable. The method focuses on the spatial proximity and has the limitation that the boundaries of the spatial dimension remain constant. The SEB-tree [41] is similar to SETI where the space is partitioned into zones, but the difference is that only the zone information is stored in the database without knowing the exact location.
MV3R-tree. The index combines a multi-version R-tree (MVR-tree) and a small auxiliary 3D R-tree built on the leaf nodes of the MVR-tree. The former is to process timestamp queries and the latter is to process long interval queries. Short interval queries are processed by selecting the appropriate tree. We deal with interval queries and thus the structure is essentially a 3D R-tree. The MR-tree [58] builds separate R-trees for each timestamp and achieves good performance in the case of timestamp queries. However, the performance is not efficient for time window (interval) queries and there are many replicated node entries.

Queries

The following queries can be answered following our framework: (i) spatio-temporal windows + attributes, e.g., “Did any RED BENZ pass the restricted area during [t₁, t₂]?”; (ii) continuous distance queries + attributes, e.g., “Continuously report all TOYOTAs within 200 meters to the target?”; and (iii) spatio-temporal similarity + attributes, e.g., “Did any SILVER VW follow the target?”. We search BAR to find the subtrees in the spatio-temporal index fulfilling the attribute condition and then explore the spatio-temporal index. If attributes are not considered, the algorithm directly searches the spatio-temporal index without accessing BAR.

8 Related work

8.1 Querying trajectories

In the literature, tremendous efforts have been made on querying standard trajectories. Representative works include nearest neighbors [18, 19], similarity search [8, 42] and pattern discovery [27] [32]. In particular, continuous nearest neighbor queries are studied in [17, 21], but the result is different from Ck NN_MAT.

In the era of big data, a large amount of trajectory data is collected and used to support applications such as personalized route recommendation [11]. The paper [31] studies retrieving distinct trajectories passing a user-specified spatio-temporal region over big trajectory data, and estimates the answers with a guaranteed error bound. Scalable algorithms for nearest neighbor joins are studied in [13]. The processed data, mainly retrieved from GPS, is standard trajectories, i.e., within the scope of location and time. Although GPS records contain a variety of attributes such as velocity, direction and acceleration, they are location-related and can be inferred from motion function. We enrich the data representation to support attributes independent of locations and allow users to find k nearest neighbors over time with certain attribute values, extending the query capability.

Emerging applications require extensive information about trajectories such as quality and semantics [65]. Semantic trajectories are studied to discover meaningful knowledge from locations [2, 59, 62]. A semantic enriched trajectory is typically defined to be a sequence of timestamped places, where each place is represented by a location with a semantic label. Interesting patterns can be properly defined and extracted. For example, a so-called fine-grained sequential pattern reports trajectories that satisfy spatial compactness, semantic consistency and temporal continuity simultaneously [61]. Consider actions that users can take at particular places such as sport, dining and entertaining. Activity trajectories are defined by associating geo-spatial points with activities. A similarity search returns k trajectories whose semantics contain the query and have the shortest minimum match distance [64]. Motivated by the fact that standard trajectories do not make much sense for humans, a partition-and-summarization approach is proposed to automatically generate texts to highlight the significant semantic behavior [43]. A good survey of semantic trajectories refers to [36].

Moving objects with transportation modes are investigated in [55, 57]. A trajectory over diverse geographical spaces includes timestamped locations and a sequence of transportation modes such as Indoor→Walk→Car. Queries containing transportation modes can be answered, e.g., “who arrived at the university by taxi”. A generic model is proposed to capture a wide range of meanings derived from a standard trajectory, called symbolic trajectory [23]. A systematic study of annotated trajectory databases is performed to represent a symbolic trajectory by a time-dependent label, which can be names of roads and speed profile, for example.

There are fundamental differences between those works and multi-attribute trajectories. First, we consider attributes that are location-independent and can not be discovered from standard trajectories. This differs from attaching location labels in semantic trajectories. Symbolic trajectories do not contain geo-locations, while multi-attribute trajectories do. Second, different queries are evaluated. We incorporate attributes into the evaluation for Boolean queries and search k nearest neighbors at each defined time. Previous queries deal with spatial closeness and attributes similarity instead of time-dependent distances and exact matches on attributes. Labels are sparsely defined in semantic trajectories because a few locations may contain semantics. As a result, ranking queries are primarily dealt with rather than continuous queries with attributes.

One more related work is heterogeneous k-nearest neighbor queries [44]. A moving object is represented by a location-independent attribute and a set of coordinates. By defining a function that combines the costs of distances and the location-independent attribute, the query returns objects having the k-th smallest value. Although the work considers the location-independent attribute, there are three major differences in comparison with ours. First, the data representation is limited in scope because each moving object is associated with only one attribute. We consider multiple attributes to have a generic solution. Second, they query objects based on a ranking function on distance and attribute, but we require exact matches on attribute, leading to different results. Third, their distance function is not time-dependent, while we deal with distances changing over time to support continuous queries. The query in [63] continuously reports k nearest spatial points with keywords to a moving object in a road network. The processed data is spatial points with keywords, but we deal with moving objects.

8.2 Indexing and manipulating trajectories

In the last decade, an impressive number of access methods have been proposed to optimize the processing of trajectory queries [6, 38, 39, 45, 48]. A survey of trajectory indexing and retrieval is given in [29]. These indexes only handle the proximity on spatial and temporal data, but do not achieve an optimal performance for Ck NN_MAT. One one hand, we cannot use the index to select objects fulfilling the attribute condition because the spatio-temporal indexes do not manage attributes. On the other hand, the criteria of minimum and maximum distances cannot be used for pruning because objects with small distances may not contain query attribute values. Exemplified by Fig. 40, employing the pruning heuristic of min and max distances, the node N_A is pruned because objects in N_A are further than N_B to mq. However, this produces false results because objects in N_B do not fulfill the attribute condition. Spatio-temporal index structures are not aware of attribute values and therefore cannot determine nodes containing query attribute values.

A grid index is established to organize activity trajectories in a hierarchical manner [64]. Activities are location-dependent and the index maintains the spatial and activity proximities. A similar structure is developed to incorporate both spatial and semantic information for approximate keyword search [62]. The grid index is a spatial index that maintains locations with activities rather than continuous movements. They address ranking queries, but we deal with continuous queries.

Several algorithms are proposed to minimize the total volume of trajectory approximations when given a user-specified number of splits [25]. Rasetic et al. provide an improved solution that splits trajectories into a number of sub-trajectories and indexes them to minimize the number of expected disk I/Os with respect to an average size of spatio-temporal range queries [39]. Calibrating trajectory data is studied in [42], the goal of which is to transform heterogeneous trajectories to one with unified sampling strategies. The method rewrites raw trajectories based on a reference set in order to remove the impact of the sampling heterogeneity. Trajectories after calibrating are not the same as the original form due to operations such as shifting, removing and inserting. The line of work on trajectory simplification [33] [34, 54] aims to approximate a trajectory by a simplified one with fewer points. The processing primitives are to retrieve less trajectories but in good quality. However, trajectories are modified and not the same as before. We merge small pieces of movements and approximate trajectories in a compact way without any information loss. The trajectory segmentation serves the purpose of partitioning each trajectory into sub-trajectories for a compact representation [62]. They evaluate spatial, time and semantic variances based on a local trajectory and aim to find an optimal method that uses the minimum number of trajectory segments such that the maximum decrease of feature value is less than a threshold. The paper [12] optimizes the maintenance of continuous queries over standard trajectories.

Moving objects databases, as the representative of update-intensive applications, need to handle frequent updates. A number of approaches have been proposed to improve the efficiency of updating spatio-temporal indexes [5, 26, 40]. To support updating multi-attribute trajectories, not only locations and time but also attribute values are addressed. Meanwhile, low latency should be achieved to allow instantaneous queries.

8.3 Spatial keyword queries

Recently, spatial keyword queries (SKQ for short) have been extensively studied in the literature [7, 30, 52]. The task is to support queries that take a geo-location and a set of text descriptions called keywords as arguments and return objects that are close to the query location and contain the keywords called Boolean k NN query [14], or objects with the highest ranking scores measured by a combination of distances to the query location and the text relevance to the keywords called Top-k NN query [10]. To efficiently answer the query, a spatial index such as 2D R-tree and a text index structure are combined. For example, the IR-tree [10] arguments each node of the R-tree with a pointer pointing to an inverted file that contains a summary of the text content of the objects in the corresponding subtree. During the query procedure, one uses the combined structure to estimate both the spatial distance and the text relevancy and prune the objects that cannot contribute to the result.

Compared to SKQ, multi-attribute trajectories are in principle the combination of standard trajectories and attributes. Both problems extend the traditional spatial and moving objects to enrich the data representation. However, there are some fundamental differences. In the aspect of data representation, SKQ focuses on static geo-locations and location-dependent text descriptions. We cope with moving objects and location-independent attributes, leading to different queries. SKQ considers a static query location, while our query is a moving object and returns the nearest object to the query at each query time, complicating the evaluation. Furthermore, text descriptions and attributes will make different tasks when designing the index structure. In SKQ, the index groups close objects in terms of spatial distances and location-related text relevances. Attributes do not depend on particular locations but are associated with objects. It is possible to attach attributes to timestamped locations, but each piece of trajectories will have all attributes along with the trajectory, resulting in an extremely large amount of redundant data. In fact, the key issue of boosting the index is to know which objects contain particular attribute values and where they are located in the spatio-temporal index. Therefore, a different criterion is used to build the index. Last, there is no update in SQK, but we accommodate frequent updates.

9 Conclusions

We enrich the trajectory representation to form multi-attribute trajectories and propose a new query. A hybrid index structure is designed and updating the index is also supported. We make a systematic design such that the attribute structure can be combined with a range of standard trajectory indexes. Efficient query algorithms are developed. An extensive experimental evaluation is conducted on large datasets to demonstrate the performance advantage of our approach over baseline methods. The future work is to consider similarity queries on multi-attribute trajectories.

Notes

Trajectories containing only a single unit are treated as dirty data and will be removed from the dataset. It is rare and impractical that two consecutive GPS records have a major deviation.

References

(2016). http://factory.datatang.com/en/
Alvares LO, Bogorny V, Kuijpers B et al (2007) Towards semantic trajectory knowledge discovery. Data Mining and Knowledge Discovery
Bentley JL, Ottmann T (1979) Algorithms for reporting and counting geometric intersections. IEEE Trans Computers 28(9):643–647
Article Google Scholar
Bercken J, Seeger B, Widmayer P (1997) A generic approach to bulk loading multidimensional index structures. In: VLDB, pp 406–415
Biveinis L, Saltenis S, Jensen C (2007) Main-memory operation buffering for efficient r-tree update. In: VLDB, pp 591–602
Chakka VP, Everspaugh A, Patel JM (2003) Indexing large trajectory data sets with seti. In: CIDR
Chen L, Cong G, Jensen C, Wu D (2013) Spatial keyword query processing: an experimental evaluation. PVLDB 6(3):217–228
Google Scholar
Chen L, Özsu MT, Oria V (2005) Robust and fast similarity search for moving object trajectories. In: SIGMOD, pp 491–502
Chen Z, Shen H, Zhou X, Zheng Y, Xie X (2010) Searching trajectories by locations-an efficiency study. In: SIGMOD, pp 255–266
Cong G, Jensen C, Wu D (2009) Efficient retrieval of the top-k most relevant spatial web objects. PVLDB 2(1):337–348
Google Scholar
Dai J, Yang B, Guo C, Ding Z (2015) Personalized route recommendation using big trajectory data. In: ICDE, pp 543–554
Ding H, Trajcevski G, Scheuermann P (2008) Efficient maintenance of continuous queries for trajectories. GeoInformatica 12(3):255–288
Article Google Scholar
Fang Y, Cheng R, Tang W, Maniu S, Yang XS (2016) Scalable algorithms for nearest-neighbor joins on big trajectory data. IEEE Trans Knowl Data Eng 28(3):785–800
Article Google Scholar
Felipe I, Hristidis V, Rishe N (2008) Keyword search on spatial databases. In: ICDE, pp 656–665
Forlizzi L, Güting RH, Nardelli E, Schneider M (2000) A data model and data structures for moving objects databases. In: SIGMOD, pp 319–330
Frentzos E, Gratsias K, Pelekis N, Theodoridis Y (2005) Nearest neighbor search on moving object trajectories. In: SSTD, pp 328–345
Frentzos E, Gratsias K, Pelekis N, Theodoridis Y (2007) Algorithms for nearest neighbor search on moving object trajectories. GeoInformatica 11(2):159–193
Article Google Scholar
Gao Y, Zheng B, Chen G, Li Q (2010) Algorithms for constrained k-nearest neighbor queries over moving object trajectories. GeoInformatica 14(2):241–276
Article Google Scholar
Gao Y, Zheng B, Chen G, Li Q, Guo X (2011) Continuous visible nearest neighbor query processing in spatial databases. VLDB J 20(3):371–396
Article Google Scholar
Güting RH, Behr T, Düntgen C (2010) SECONDO: a platform for moving objects database research and for publishing and integrating research implementations. IEEE Data Eng Bull 33(2):56–63
Google Scholar
Güting RH, Behr T, Xu J (2010) Efficient k-nearest neighbor search on moving object trajectories. VLDB J 19(5):687–714
Article Google Scholar
Güting RH, Böhlen M, Erwig M, Jensen C, Lorentzos N, Schneider M, Vazirgiannis M (2000) A foundation for representing and querying moving objects. ACM TODS 25(1):1–42
Article Google Scholar
Güting RH, Valdës F, Damiani M (2015) Symbolic trajectories. ACM Trans Spatial Algo Syst, 1(2):Article 7
Article Google Scholar
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp 47–57
Hadjieleftheriou M, Kollios G, Tsotras VJ, Gunopulos D (2002) Efficient indexing of spatiotemporal objects. In: EDBT, pp 251–268
Jensen C, Lin D, Ooi BC (2017) Maximum update interval in moving objects databases. In: Encyclopedia of GIS, p 1205
Chapter Google Scholar
Jeung H, Yiu M, Zhou X, Jensen C, Shen H (2008) Discovery of convoys in trajectory databases. PVLDB 1(1):1068–1080
Google Scholar
Lange R, Dürr F, Rothermel K (2011) Efficient real-time trajectory tracking. VLDB J 20(5):671–694
Article Google Scholar
Dinh L, Aref WG, Mokbel MF (2010) Spatio-temporal access methods: Part 2 (2003-2010). IEEE Data Eng Bull 33(2):46–55
Google Scholar
Lee T, Park J, Lee S, Hwang S, Elnikety S, He Y (2015) Processing and optimizing main memory spatial-keyword queries. PVLDB 9(3):132–143
Google Scholar
Li Y, Chow C, Deng K, Yuan M, Zeng J, Zhang J, Yang Q, Zhang Z (2015) Sampling big trajectory data. In: CIKM, pp 941–950
Li Z, Ding B, Han J, Kays R (2010) Swarm: mining relaxed temporal moving object clusters. PVLDB 3(1):723–734
Google Scholar
Long C, Wong RC, Jagadish HV (2013) Direction-preserving trajectory simplification. PVLDB 6(10):949–960
Google Scholar
Long C, Wong RC, Jagadish HV (2014) Trajectory simplification: on minimizing the direction-based error. PVLDB 8(1):49–60
Google Scholar
Mauroux PC, Wu E, Madden S (2010) Trajstore: an adaptive storage system for very large trajectory data sets. In: ICDE, pp 109–120
Parent C, Spaccapietra S, Renso C et al (2013) Semantic trajectories modeling and analysis. ACM Comput Surv 45(4):42
Article Google Scholar
Pfoser D, Jensen C (2000) Novel approaches in query processing for moving object trajectories. In: VLDB, pp 395–406
Popa IS, Zeitouni K, Oria V, Barth D, Vial S (2011) Indexing in-network trajectory flows. VLDB J 20(5):643–669
Article Google Scholar
Rasetic S, Sander J, Elding J, Nascimento MA (2005) A trajectory splitting model for efficient spatio-temporal indexing. In: VLDB, pp 934–945
Sidlauskas D, Saltenis S, Jensen C (2014) Processing of extreme moving-object update and query workloads in main memory. VLDB J 23(5):817–841
Article Google Scholar
Song Z, Roussopoulos N (2003) Seb-tree: An approach to index continuously moving objects. In: MDM, pp 340–344
Su H, Zheng K, Wang H, Huang J, Zhou X (2013) Calibrating trajectory data for similarity-based analysis. In: SIGMOD, pp 833–844
Su H, Zheng K, Zeng K, Huang J, Sadiq SW, Yuan N, Zhou X (2015) Making sense of trajectory data A partition-and-summarization approach. In: ICDE, pp 963–974
Su Y, Wu Y, Chen ALP (2007) Monitoring heterogeneous nearest neighbors for moving objects considering location-independent attributes. In: DASFAA, pp 300–312
Tao Y, Papadias D (2001) Mv3r-tree: a spatio-temporal access method for timestamp and interval queries. In: VLDB, pp 431–440
Tao Y, Papadias D, Shen Q (2002) Continuous nearest neighbor search. In: VLDB, pp 287–298
Tong Y, Chen L, Zhou Z et al (2018) Slade: a smart large-scale task decomposer in crowdsourcing. IEEE Transactions on Knowledge and Data Engineering to appear
Tong Y, Chen Y, Zhou Z et al (2017) The simpler the better: a unified approach to predicting original taxi demands based on large-scale online platforms. In: ACM SIGKDD, pp 1653–1662
Tong Y, She J, Ding B, Wang L, Chen L (2016) Online mobile micro-task allocation in spatial crowdsourcing. In: ICDE, pp 49–60
Tzoumas K, Yiu ML, Jensen C (2009) Workload-aware indexing of continuously moving objects. PVLDB 2(1):1186–1197
Google Scholar
Wang H, Zimmermann R (2011) Processing of continuous location-based range queries on moving objects in road networks. IEEE Trans Knowl Data Eng 23 (7):1065–1078
Article Google Scholar
Wang X, Zhang Y, Zhang W, Lin X, Huang Z (2016) SKYPE: Top-k spatial-keyword publish/subscribe over sliding window. PVLDB 9(7):588–599
Google Scholar
Wu D, Yiu ML, Cong G, Jensen C (2012) Joint top-k spatial keyword query processing. IEEE Trans Knowl Data Eng 24(10):1889–1903
Article Google Scholar
Lin HZTWX, Ma S, Huai J (2017) One-pass error bounded trajectory simplification. PVLDB 10(7):841–852
Google Scholar
Xu J, Güting R, Zheng Y (2015) The TM-RTree: an index on generic moving objects for range queries. GeoInformatica 19(3):487–524
Article Google Scholar
Xu J, Güting RH (2012) MwgenG: a mini world generator. In: MDM, pp 258–267
Xu J, Güting RH (2013) A generic data model for moving objects. GeoInformatica 17(1):125–172
Article Google Scholar
Xu X, Han J, Lu W (1990) Rt-tree: an improved r-tree indexing structure for temporal spatial databases. In: SDH, pp 1040–1049
Yan Z, Chakraborty D, Parent C, Spaccapietra S, Aberer K (2011) Semitri: a framework for semantic annotation of heterogeneous trajectories. In: EDBT, pp 259–270
Yao B, Xiao X, Li F, Wu Y (2014) Dynamic monitoring of optimal locations in road network databases. VLDB J 23(5):697–720
Article Google Scholar
Zhang C, Han J, Shou L, Lu J, Porta TFL (2014) Splitter: mining fine-grained sequential patterns in semantic trajectories. PVLDB 7(9):769–780
Google Scholar
Zheng B, Yuan N, Zheng K, Xie X, Sadiq SW, Zhou X (2015) Approximate keyword search in semantic trajectory database. In: ICDE, pp 975–986
Zheng B, Zheng K, Xiao X, Su H, Yin H, Zhou X, Li G (2016) Keyword-aware continuous knn query on road networks. In: IEEE ICDE, pp 871–882
Zheng K, Shang S, Yuan N, Yang Y (2013) Towards efficient search for activity trajectories. In: ICDE, pp 230–241
Zheng K, Su H (2015) Go beyond raw trajectory data: quality and semantics. IEEE Data Eng Bull 38(2):27–34
Google Scholar
Zheng K, Zheng Y, Yuan N, Shang S (2013) On discovery of gathering patterns from trajectories. In: ICDE, pp 242–253

Download references

Acknowledgments

We sincerely thank Fabio Valdés, Thomas Behr and Sara Betkas for their helpful comments to improve the preliminary version. This work is supported by National Key Research and Development Plan of China (2018YFB1003902) and the Fundamental Research Funds for the Central Universities (NO. NS2017073).

Author information

Authors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Jianqiu Xu
FernUniversität in Hagen, Hagen, Germany
Ralf Hartmut Güting
ZheJiang University, Hangzhou, China
Yunjun Gao

Authors

Jianqiu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Hartmut Güting
View author publications
You can also search for this author in PubMed Google Scholar
Yunjun Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianqiu Xu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Unique attribute values by composite numbers

Given a point (x, y), its Z-order value is denoted by z-val(x, y) and the binary representation is z[2 ⋅ m] : z[i] = x[i], z[i + 1] = y[i], i∈ [0, m], x[m], y[m] are arrays of bits for binary representations x and y, respectively, and m is the number of bits to represent the coordinates.

Lemma 6

Let a₁ ∈ dom($A_{d_{1}}$) and a₂ ∈ dom($A_{d_{2}}$) be attribute values from two different domains, respectively. Then, we have z-val(d₁, a₁)≠ z-val(d₂, a₂).

Proof

Let z₁[2 ⋅ m] and z₂[2 ⋅ m] be binary representataions for z-val(d₁, a₁) and z-val(d₂, a₂), respectively. Because of d₁≠d₂, then arrays x₁[m] and x₂[m] are not equal. After the interleaving, there exists an even bit i ∈ [0, 2 ⋅ m - 1] such that z₁[i]≠z₂[i]. As a result, we have z₁[2 ⋅ m]≠z₂[2 ⋅ m]. The condition holds regardless of a₁ and a₂. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, J., Güting, R.H. & Gao, Y. Continuous k nearest neighbor queries over large multi-attribute trajectories: a systematic approach. Geoinformatica 22, 723–766 (2018). https://doi.org/10.1007/s10707-018-0326-5

Download citation

Received: 11 October 2017
Revised: 08 April 2018
Accepted: 28 June 2018
Published: 11 July 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s10707-018-0326-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Continuous k nearest neighbor queries over large multi-attribute trajectories: a systematic approach

Abstract

Similar content being viewed by others

Multi-attribute Trajectory Data Management

A nearest neighbor query method for searching objects with time and location informations based on spatiotemporal similarity

SST: Synchronized Spatial-Temporal Trajectory Similarity Search

1 Introduction

1.1 Motivation

1.2 Previous works and challenges

1.3 The solution

2 Problem definition

2.1 Standard trajectories

Definition 1

2.2 An integration

Definition 2

2.3 Queries

Definition 3

Definition 4

Definition 5

3 The index

3.1 An overview

3.2 Packing standard trajectories

Lemma 1

Proof

3.3 BAR

3.3.1 The Relation Att_Rel

Unique attribute values

3.3.2 Record storage

Record format

Bitmap

Definition 6

3.3.3 Querying the bitmap

Lemma 2

Proof

Time complexity

Theorem 1

Theorem 2

Proof

3.3.4 The storage

Theorem 3

4 Update

4.1 Updating R-tree

Theorem 4

Proof

Theorem 5

Proof

4.2 Updating BAR

Definition 7

Definition 8

Theorem 6

Proof

4.3 Maintaining light-weight BAR

4.4 The cost model

5 The algorithm for Ck NN_MAT

Filter

Refinement

5.1 Collecting R-tree nodes

Lemma 3

Proof

Time complexity

Theorem 7

5.2 An extension: multiple values

Lemma 4

Theorem 8

5.3 Reporting candidates

Lemma 5

5.4 The refinement

6 Experimental evaluation

6.1 Datasets and parameter settings

6.2 The effect of packing

6.3 Searching the bitmap

6.4 Performance evaluation

6.4.1 Scalability

Index storage

Scaling \(|\mathcal {D}|\)

Tuning d (\(|\mathcal {D}|\) = 1.8 million)

Tuning dom(A) (\(|\mathcal {D}|\) = 1.8 million)

6.4.2 Effect by |Q a|

6.4.3 Effect by k

6.4.4 Applying packing to baseline methods

6.4.2 Effect by |Q _a|