1 Introduction

During the last decade there has been an increased demand for applications in spatial networks with moving objects and trajectories; from navigation systems: road/ river/railway networks, traffic analysis, map destination, shortest routes, location-based services, to energy-resources networks: oil, electrical power network, natural gas, telephone lines, water-sewer systems, etc.

Spatial networks are characterized by topological restrictions, since moving objects follow specific routes to reach a destination. Even in aviation, flights respect predefined air routes. Moving objects may have properties which may affect their status through the network (e.g. speed, congestion tolerance and priorities, obeying certain network restrictions) or may not (e.g. ID, label, color).

A spatial network can be modeled as a graph with a set of nodes connected with edges.Footnote 1 The main advantage in graph representations is that usually there is no need for the graph topology to perfectly match the real geography. Most applications require only the existence of nodes and weighted edges. Usually a moving object has a starting node and a destination node, plus it may have to visit some in-between network nodes. The path followed by every object, respecting the restrictions enforced by the network, is called object trajectory.

Modern spatial applications involve big data; thus sophisticated indexing and processing methods are required for efficient data management. When new trajectories have to be inserted, or old trajectories have to be modified/deleted the indexes must efficiently support these operations. Among the spacial access methods, a well honored popular family of such indexes are the R-tree-based methods.

Another vital requirement in spatial applications is the efficient query processing of trajectories. In particular, a popular query is top-k similarity retrieval of trajectories, i.e. given a trajectory (or some spatial locations in the network and time restrictions for the transition between nodes) we want the top-k most similar trajectories to the given one (or passing close to the spatial locations).

In emergency applications the query response time is crucial. Therefore, another desired property is the progressive query processing, i.e. the results are provided to the user in a incremental manner: when a trajectory satisfies the space and time restrictions, it is provided to the user, while the next results are being prepared.

In the area of trajectory query processing there are some major challenges. A large number of the proposed approaches require a preprocessing to precompute all-to-all pairwise shortest paths in the spatial network, with significant space and time costs, especially in large networks. To avoid this preprocessing some methods ignore the network restrictions by taking the Euclidean node distances to provide preliminary results, which require subsequent filtering.

Another issue is that real-life applications generate big amounts of data in short time. For instance, we can consider the number of trajectories generated in the road network of a medium-sized city during a single day. In such a case, it is crucial to efficiently suggest trajectories to the moving objects in the network. While the existing spatiotemporal indexing methods may be quite sufficient in small networks and trajectory datasets, their performance degrades and becomes quite inefficient in very large networks. Moreover, complex indexes require even higher processing cost to manage the trajectories. Therefore, the challenge is to have simple and efficient indexes for big data.

The most popular spatial access methods for such settings are either tree-based (R-tree, M-tree, etc.) or methods that exploit other structures. A distinct category is comprised of progressive trajectory similarity search methods, which can highly reduce the on-line query processing time, due to the fact that not all top-k results need to be retrieved, if users find the already retrieved results satisfactory.

To overcome the weaknesses of existing trajectory similarity search approaches, we propose an elaborated variation of our method Cluster-extended Adjacency List (CeAL) [32]. CeAL has been used to enhance location-based trajectory similarity top-k queries, where a user provides the query locations in a spatial network along with time restrictions, and the top-k similar trajectories to the locations that satisfy the time restrictions are provided to the user in a progressive manner. Here, the original CeAL method is modified and applied as a core indexing scheme to support efficient trajectory management and retrieval in very large spatial networks. This is achieved by facilitating trajectory similarity searching by taking into account both spatial and temporal restrictions between nodes.

The proposed variation of CeAL inherits the advantages of its original version [32] and, in addition, it is enhanced with the following characteristics:

  1. (a)

    User-defined locations and time restrictions have been replaced by recorded spatial positions and timestamps of moving objects. Automatically calculating and recording locations and time restrictions through the applied moving objects framework enhances CeAL.

  2. (b)

    Trajectory similarity calculations have been significantly reduced, by avoiding the computation of all-to-all shortest path distances as most of the previous methods do, by limiting the calculation of pairwise node distances from the small set of the selected spatial locations to the nodes of the spatial graph.

  3. (c)

    CeAL consumes linear space, since adjacent nodes are connected directly to stored trajectory data on disk, without building any complex index.

  4. (d)

    The trajectories can be provided in a personalized manner by means of a proposed spatio-temporal similarity measure adaptable to the user preferences by tuning the query to be more spatial- or more temporal-oriented. Additionally, users can set weights to spatially prioritize the selected spatial locations.

  5. (e)

    finally, CeAL is established with all necessary theorems and proofs for its properties and its complexity.

The main contributions of this study are:

  • We propose a novel variation of algorithm CeAL for the progressive processing of top-k trajectory similarity queries in spatial networks. CeAL can operate in on-line environments, as the on-line query processing cost has been reduced due to its progressive approach, which enables early termination if adequate results have been reported to the user.

  • We propose a new spatio-temporal similarity measure, which satisfies the generalized metric properties and can also be used in other related problems, where the metric properties are applied to efficiently prune the search space.

  • We provide an extensive comparison of CeAL against other state-of-the-art methods with respect to their space and time performance. To this end, we use an enhancement of the Brinkhoff’s classic generator [4], which takes into account the spatial and temporal restrictions.

The structure of the sequel is as follows. The next section provides an overview of the relevant literature, whereas Section 3 gives basic definitions, notations and assumptions. Alternative access methods are described in Section 4, whereas Section 5 focuses on the particular implementations. Section 6 reports the results of an extensive experimentation. Finally, the last two sections conclude the paper and discuss further extensions.

2 Related research

2.1 Trajectory generation

Generating trajectories of moving objects in spatial networks must be realistically designed. A classic trajectory generator has been proposed by Brinkhoff [4], which takes into account the spatial restrictions imposed by the underlying network and the temporal restrictions due to characteristics of the moving objects. Some important aspects in the process of the generation are the maximum speed of the moving objects, the influence of the other moving objects to the speed and the route of an object, the maximum capacity of connections, the adequate determination of the start and destination of an object, the influence of external objects and events, and time-scheduled traffic. The generator is written in Java; to enhance further the generator performance in very large networks, we extended this method in a new implementation framework (C++, Boost Graph library [5]).

2.2 Trajectory indexing and retrieval

Access methods for trajectory management must be efficient during query processing in spatial networks. To this end, there are several relevant works proposed in the literature. In [20, 27] the trajectory of a moving object is represented as a set of graph edges followed by the object during its lifetime. Also, of interest is the time interval during which the moving object traverses a specific edge. Additionally, two kinds of transformation techniques are proposed for network data and for trajectory data. Both techniques store the data in R-trees. An alternative way to represent the trajectory data is to store the visited nodes along with the corresponding time instant when the visit takes place [11].

In [37] the notion of multi-attribute trajectories is studied, i.e. standard trajectories with descriptive attributes. Multi-attribute trajectories are indexed in a 3D R-tree and a composite structure which can be adapted to work with any R-tree-based or Grid-based index. However, the article focuses in the problem of continuous k nearest neighbor queries over the data trajectories and proposes efficient algorithms for query processing.

Several studies model trajectories as time series using transformation techniques, where trajectory similarity search is performed by using either distance measures or subsequence matching [1, 7, 8, 14, 21, 25, 29, 35, 38]. However, these works suffer from a high cost of similarity calculations. Therefore, pruning or approximation methods have been proposed to decrease the computational cost. Most of the works on similarity search assume Euclidean spaces, either transformed or not, using R-trees or variants [24]. The works of [9, 16, 22] introduce query processing algorithms for similarity search in trajectory data ignoring, however, the temporal domain.

The works of [6, 17] retrieve trajectories similar to a query trajectory in both spatial and temporal domains. In particular, similarity calculations and optimization techniques, such as pruning and bounding, are performed in Euclidean space, which contradicts the nature of spatial networks. On the other hand, the works of [33, 34] perform trajectory similarity search in spatial networks using M-trees [10], to prune the search space based on metric functions.

With respect to the storage of trajectories, several methods have been proposed. Frentzos et al. use FNR-trees which is a 2-d R-tree storing graph edges, whereas its leaves point to the roots of 1-d R-trees, which store the visits of each specific edge [15]. In [31], trajectories are considered as sets of points in the Euclidean space and are indexed with R-trees. The algorithm returns the k most similar trajectories by a set of predefined point locations, and uses a heap to retrieve candidate trajectories from each individual query point. In the sequel, the candidate trajectories are refined according to specific bounds. The particular algorithm is based on the methodology of Fagin’s Threshold Algorithm [13]. The main distance measure is an aggregation of the distances from the query points to the corresponding shortest trajectory points. The idea of this aggregation has been also studied in [26].

With respect to similarity measures, mainly the spatial attributes of the trajectories are taken into consideration, with temporal data becoming relevant only occasionally. A query processing algorithm returns the most similar trajectories by searching over a set of candidate trajectories. Although a variety of trajectory similarity measures has been proposed, most of them apply specific measures. For instance, some widely used similarity measures are: Euclidean distance [1], Discrete Fourier Transformation and Wavelets [7], Edit Distance and its variations [8, 9], Longest Common Subsequence [35] and Dynamic Time Warping [38].

The existing spatiotemporal indexing methods are quite sufficient in small networks and trajectory datasets, but their performance gradually degrades and becomes inefficient in very large networks. Moreover, the more complex the index, higher the processing cost to manage the trajectories. Tiakas et al. alternatively suggest that the network is represented by a structure based on adjacency lists at a preprocessing step [32]. This is the original version of CeAL. In each edge formed at preprocessing, a cluster is assigned, which contains references to all the trajectories that pass from that particular edge. This way a simple and efficient index is constructed to handle large data.

3 Preliminaries

Here we introduce the basic terminology and assumptions, as well as we present the main tasks and formulate the main problem.

3.1 Definitions

A spatial network can be represented as a graphG(V,E) consisting of a set of vertices V and a set of edges E. On a 2-d plane every vertex can also be defined by its coordinates as (xi,yi) ∈ V. Every edge can be defined as (vi,vj) ∈ E and represents a connection between vi and vj. We assume that the network is a static connected undirected graph. For real data, edges are weighted, i.e. a weight w(vi,vj) is given to any pair of neighbor nodes vi,vj, to represent the distance between them or the time spent to travel from one to the other, and so on.

We also assume that the distance between two non-neighbor nodes vi,vn equals the sum of the weights of all the edges in the path: w(vi,vj) + w(vj,vk) + … + w(vm,vn). If there are several paths connecting the two nodes, then of importance is the geodesic path which is the shortest one with a distance called network distance. We normalize this distance by dividing with the network diameter (the maximum shortest path distance between any two nodes), to have a distance measure in the interval [0,1]. We will denote this normalized distance between any two nodes va,vb of the network as d(va,vb).

This distance measure d(⋅) satisfies the metric properties:

  • Non-Negativity: Any transition from vertex va to vertex vb has a non-negative cost. Therefore, it holds that: d(va,vb) ≥ 0, whereas d(va,vb) = 0 ⇔ va = vb.

  • Triangular Inequality: For any three nodes va,vb,vc it holds that: d(va,vb) ≤ d(va,vc) + d(vc,vb).

  • Symmetry: Since the network is undirected, the distances from node va to node vb and vice-versa are equal: d(va,vb) = d(vb,va).

We denote a trajectory as Ti which is part of a trajectory set T: TiT. Each trajectory Ti has its own length ri of spatial points, which is called description length. We assume that the trajectories have an arbitrary description length, which means that for two different trajectories Ti,Tj, it may hold that rirj.

We assume that the spatial points of the trajectories lay on the nodes of the spatial network. Otherwise, if the spatial points of the trajectories lie on the edges, then they can be aligned to the closest nodes using map-matching methods [2, 3, 18, 23, 36]. This matching does not affect the proposed methods, since it can be performed in a preprocessing step, while generating the trajectory data.

Therefore we can define a trajectory as an ordered set of ri pairs, which correspond to nodes vi visited during the network traversal along with the time instances \(t_{v_{i}}\) that the visit takes place:

$$ T_{i} = \{(v_{i1},t_{v_{i1}}),(v_{i2},t_{v_{i2}}),\ldots,(v_{ir_{i}},t_{v_{ir_{i}}})\} $$
(1)

We consider a node visit to be an instantaneous event with zero time elapsed, i.e. we ignore the time spent by an object in any node. Defining a trajectory requires only the total time spend by the object when moving from one node to another within the network limits.

Finally, the multisetFootnote 2 of the spatial points from all trajectories is denoted as R, and the multiset of all trajectory edges as RE. Both multisets R and RE represent the raw trajectory data, and it holds that:

$$ |R| = \sum\limits_{i=1}^{|T|} r_{i}\qquad |RE| = \sum\limits_{i=1}^{|T|} (r_{i}-1) = |R|-|T| $$

3.2 Indexing and managing trajectories data

A focus of this work is to estimate the efficiency of the examined indexing methods to manage trajectories. In particular, we study their efficiency in supporting dynamic environments where new trajectories have to be inserted or old trajectories have to be modified/deleted, as well as their consumed space to index data. We will use large networks to study their behavior along with the respective algorithms. We will also conduct experiments in very large real road networks, which are relatively sparse but with millions of nodes and edges.

3.3 Problem definition for trajectory similarity top-k queries

Let G be the underlying graph of an undirected network and T a trajectories dataset. Let Q be a set of query locations q1,q2,…,qm which are spatial points (nodes of G), that the resulted trajectories have to pass as close as possible. Let also qt2,qt3,…,qtm be the corresponding inter-arrival times which are m − 1 tolerance time intervals, acceptable by users for travelling between the query locations (\(qt_{i}=\infty \) denotes the lack of time restriction for the transition to location qi). Let w1,w2,…,wm be the users’ predefined weights, expressing the personal preference of importance to the m query locations, where 0 < wj < 1 for j = 1,...,m and \({\sum }_{j=1}^{m} w_{j} = 1\). Given a similarity function sim(Q,Ti) between the set Q of query locations and a trajectory TiT, the goal is to find the k most similar trajectories in T with the highest similarity score to Q.

For this study the query locations q1,q2,…,qm and the corresponding inter-arrival times qt2,qt3,…,qtm can alternatively be given through a query trajectory Tq, where its nodes define the query locations and its time instances define the corresponding inter-arrival times.

4 Indexing methods and algorithms

The mostly used indexes for trajectories are based on R-trees [19] and their variants [24]. R-trees group nearby spatial objects in a minimum bounding rectangle, MBR, which is a key concept in all R-tree-based algorithms. Figure 1 illustrates an example of MBR, which is a rectangle that encapsulates an edge in such a way that each of min(x),max(x),min(y),max(y) will be in contact with the respective rectangle side. An MBR can also encapsulate a trajectory object with all its nodes and edges. Moreover, in upper R-tree levels there are also MBR’s that enclose lower level MBR s.

Fig. 1
figure 1

a Line segment of a corresponding edge encapsulated by an MBR, b Index node representing the MBR containing the edge

In the sequel, we will present three indexing methods for moving objects, two of them based on R-trees, and one based on adjacency lists. These methods have been previously tested experimentally in small-scale networks. However, here the efficiency and the performance of these methods will be stressed in networks of large sizes, e.g. in the order of millions of links/edges.

4.1 Fixed network R-trees

FNR-trees are height balanced structures based on R-trees [15]. The idea is that any network with n links can be represented as a forest of 1-d R-trees,Footnote 3 having a single 2-d R-tree on top. The 2-d R-tree is used to index the graph edges; i.e., every 2-d R-tree leaf represents a single graph edge and stores a pointer to a 1-d R-tree, which indexes the temporal intervals during which a moving object traveled through the particular edge represented by that leaf.

The FNR-tree can support efficient insertion and deletion of trajectories data. For the insertion process it uses Guttman’s search algorithm on the top-level 2-d R-tree to find the relevant graph edge encapsulated by the appropriate MBR. This leads to a 1D R-tree containing the object visits. Since time is increasingly monotonously, time intervals will be inserted in an increasing order. Thus, we can insert the new element on the bottom-most, right-most tree node without performing a search. This optimization leads to full 1-d R-tree leaves, and minimizes the leaf overlap. Without this optimization, space utilization in the 1-d R-trees is around 65%, whereas with this implementation it increases to 96% [15].

To perform a spatio-temporal query against an FNR-tree, a 3-d interval is used that consists of two spatial points and one temporal point. Thus, the query can be defined as: ((x1,y1),(x2,y2),(t1,t2)). For the search process, Guttman’s search algorithm is executed on the top level 2-d R-tree, and the edges bound by the spatial interval represented by the rectangle provided by the user as a query are identified. After recovering the leaf nodes representing these edges, the edges are stored in memory. Then Guttman’s search algorithm is executed in each of the 1-d R-trees which are pointed by the previously recovered leaves, and the corresponding edges are retrieved. If there are edges that are completely outside the query spatial window rectangle, they are discarded.

4.2 Moving objects in network trees

MON-trees comprise of a 2-d R-tree with leaves pointing to lower level 2-d R-trees (see Figure 2), which index the moving objects and their trajectories [11]. At the upper level of MON-trees, there is a hash structure with entries in the form (polyid, bottreeptr), where polyid is the unique trajectory ID and acts as key, whereas bottreeptr points to the lower level R-tree which indexes that current trajectory.

Fig. 2
figure 2

MON-tree overview [11]

The upper R-tree leaves are of the form (MBR, polypt, treept), where MBR is the MBR acting as a box for the trajectory, polypt is a pointer to the trajectory itself, and treept is a pointer to the lower level R-tree. The internal nodes have the form (MBR, childpt), where MBR is the MBR enclosing the MBR s of descendant nodes, and childpt is a pointer to the descendant node.

The lower level R-trees index the object trajectories. This is achieved with two intervals: the spatial interval (p1,p2) (where 0 ≤ p1,p2 ≤ 1), and the temporal interval (t1,t2). A combination of the two intervals gives the position of the moving object within the time interval defined by the two time points t1 and t2.

Searching is based on a spatio-temporal window: wnd = (x1,x2,y1,y2,t1,t2) and can be interpreted as: retrieve the moving objects within the space bounded by the rectangle r = (x1,x2,y1,y2), during the time period t = (t1,t2). To this end, the process is split into its orthogonal parts, the spatial and the temporal one. First, a search is performed on the top R-tree to retrieve all MBR s which intersect the rectangle defined by the spatial part of wnd. The result is a set of windows: \(wnd^{\prime }=\) \(\{(p_{1_{1}},p_{1_{2}},t_{1},t_{2}),\ldots ,(p_{n_{1}},p_{n_{2}},t_{1},t_{2} )\}\) as shown in Figure 3, where n is the number of elements, pn the position of the moving object, and t1, t2 the time interval given as input to the query. After retrieving these network portions, a search is performed on each of them, based on the time interval. As seen in Figure 3, the trajectories retrieved by the initial search are examined to determine on which parts they intersect the time interval provided as part of the spatio-temporal window.

Fig. 3
figure 3

MON-tree search by spatio-temporal window [11]

The insertion process takes a trajectory ID as input and uses the hash structure to discover the lower level R-tree which corresponds to that trajectory. Searching is accomplished by a spatio-temporal window, which defines the spatial and temporal intervals of interest. The search algorithm begins from the top level R-tree root, which narrows the search down to the MBR s of each trajectory. If there is no bottom level R-tree for inserting this trajectory, then a new R-tree node is created, and the pointer of the trajectory is inserted in the hash structure.

4.3 Cluster-extended adjacency lists

CeAL uses an adjacency list to model the network on which the trajectories will be mapped. Trajectory clusters are assigned to each node of the list, storing the trajectories that pass through it. In our variation of CeAL, query processing can be done in both following ways: (a) the user can define specific spatial locations and time restrictions as well optional weights of importance for the locations, (b) the user can input a query trajectory.

4.3.1 Creating the CeAL scheme

CeAl has a preprocessing phase to index the trajectories, either during generation, or extracted from a pre-compiled dataset [32]. This phase indexes the edges with adjacency lists: for each edge (vi,vj) a cluster Cij is created to store all the trajectories passing through this edge. If no trajectory passes through a specific edge, then the relevant cluster remains empty. The created cluster is assigned to the node vj, the ending point of the edge. Thus, the final structure comprises of V adjacency lists, representing the edges from a specific network node, extended by the clusters containing all the trajectories passing through the edge (see Figure 4).

Fig. 4
figure 4

Extended adjacency list index of node vi with p adjacent nodes and trajectory clusters

The clusters are implemented as dynamic lists. Therefore, an initial traversal of the trajectory dataset T is required. For each trajectory Ti, its ID is passed as a parameter to a pre-selected hash function. In particular, we used the simple hash function: ID mod |T| to get the disk page location of the trajectory. Then, the trajectory Ti is traversed; during this traversal, we retrieve its edges and store the trajectory’s ID in each of the clusters associated with it.

Algorithm 1 presents the preprocessing procedure. The time complexity for reading the trajectory data and the spatial complexity of the preprocessing phase is linear: O(|V | + |E| + |RE|). Also, the created trajectory clusters are generally smaller in size than the structure proposed in [28], where clustering is based on network nodes, which leads to larger clusters. In CeAL, clustering is performed based on the graph edges. Additionally, since hashing is used to store the trajectories, they can be efficiently retrieved when required by using the same hash function.

figure a

Figure 5 depicts a small-scale example of a spatial network with 14 nodes, 21 edges and 3 trajectories (see Tables 1 and 2). The outcome of Algorithm 1 is the structure of Figure 6.

Fig. 5
figure 5

An illustrative small-scale example

Table 1 Edge weights of the graph in Figure 5
Table 2 Trajectories of the graph in Figure 5
Fig. 6
figure 6

Outcome of Algorithm 1 on the structure of Figure 5

4.3.2 Trajectory similarity measures

In the proposed CeAL method, trajectory retrieval is based on the similarity between the trajectories and the selected spatial positions and timestamps of the moving objects. Therefore, trajectory retrieval ignites a location-based query with time restrictions. To facilitate searching within CeAL, two new trajectory similarity metrics, Ds(.) and Dt(.), are proposed for the spatial and the temporal dimension, respectively. The spatial similarity measure is used to assess how close a trajectory is to the selected spatial positions Q with respect to the restriction the network imposes on the movement of objects. The spatial distance of a specific location qiQ from a trajectory TjT, which passes through the nodes v1,v2,…,vn, is defined as the minimum among the distances between the location and each node of the trajectory. This measure is:

$$ d_{s} (q_{i},T_{j}) = \min_{(h=1,\ldots,n)}d(q_{i},v_{h}) = d(q_{i},v_{\min}) $$
(2)

Proposition 1

ds(⋅) is a generalized metric function that satisfies the generalized triangular inequality with values in the range of [0,1].

Proof

It is sufficient to prove that the following properties hold for any location qjQ, for any node xV and for any trajectory TiT:

  1. 1.

    0 ≤ ds(qj,Ti) ≤ 1

  2. 2.

    ds(qj,Ti) = 0 ⇔ qjTi

  3. 3.

    ds(qj,Ti) ≤ d(qj,x) + ds(x,Ti)

Let vminj be the corresponding node of the trajectory Ti with the minimum distance from the location qj (see Figure 7). Since ds(qj,Ti) = d(qj,vminj) and the spatial function d(⋅) is in the range of [0,1], therefore the same holds for the ds(⋅) function, i.e. 0 ≤ ds(qj,Ti) ≤ 1. Moreover, it holds that: ds(qj,Ti) = 0 ⇔ d(qj,vminj) = 0 ⇔ qj = vminj (property of the d(⋅) function). Therefore, since vminj is a node of Ti we have: qjTi.

Fig. 7
figure 7

Proof of generalized triangular inequality

For the proof of the generalized triangular inequality, x is a random graph node where the closest node of trajectory Ti to x is not necessary node vminj (Figure 7). Let vminx be the corresponding node of the trajectory Ti which has the minimum distance from node x. Then, we have: ds(qj,Ti) = d(qj,vminj) and ds(x,Ti) = d(x,vminx). Thus, it is sufficient to prove:

$$ d_{s}(q_{j},T_{i}) \le d(q_{j},x) + d_{s}(x,T_{i}) ~\Leftrightarrow~ d(q_{j},vmin_{j}) \le d(q_{j},x) + d(x,vmin_{x}) $$

Since vminj is the closest node between the rest of nodes of trajectory Ti to qj (including vminx), it holds that:

$$ d(q_{j},vmin_{j}) \le d(q_{j},vmin_{x}) $$

Moreover, since function d(⋅) satisfies the triangular inequality for node x, it holds that:

$$ d(q_{j},vmin_{x}) \le d(q_{j},x) + d(x,vmin_{x})$$

By combining the last two inequalities we reach to the generalized triangular inequality:

$$ d(q_{j},vmin_{j}) \le d(q_{j},x) + d(x,vmin_{x}) $$

Each included location may have a different distance from the trajectory, which means that this distance will be calculated separately for each location. Our objective is to have at least j nodes as close to the location qj as possible. Therefore, we calculate the sum of the distances of all locations from the trajectory, which allows approximating the total distance of the trajectory from the spatial locations into consideration. Consequently, the spatial similarity metric is defined as the average distance between all locations and the trajectory, and can be calculated as:

$$ D_{s} (Q,T_{i}) = \frac{1}{m} \sum\limits_{j=1}^{m} d_{s} (q_{j},T_{i}) $$
(3)

An alternative approach is when the user can provide the importance on each spatial location. In this case, the calculated spatial similarity distances are multiplied by the assigned weight of importance, which take values in the interval (0,1) and have sum 1, and show the contribution of each location to the total similarity. The more the weight approaches 0, the less important it is; on the contrary, the more it approaches 1, the more it contributes to the final distance calculation. In this case, the metric is:

$$ D_{s} (Q,T_{i}) = w_{1} d_{s} (q_{1},T_{i} ) + {\ldots} + w_{n} d_{s} (q_{n},T_{i}) $$
(4)

Thus, if a spatial location is considered more important, then its weight will be close to 1. If the distance of that location is large, it will affect the spatial measure adversely. The opposite case, where even though the distance is large, the weight is close to 0, will mean that its effect on the final value of the spatial measure will be less severe, reducing the impact its distance has to the final trajectory similarity score. It goes without saying that the opposite is also true.

Proposition 2

Ds(⋅) is a generalized metric function that satisfies the generalized triangular inequality in the range [0,1].

Proof

It is sufficient to prove that the following properties hold for any node xV and for any trajectory TiT:

  1. 1.

    0 ≤ Ds(Q,Ti) ≤ 1

  2. 2.

    Ds(Q,Ti) = 0 ⇔ qjTi,∀j = 1,…,m

  3. 3.

    Ds(Q,Ti) ≤ dq(Q,x) + ds(x,Ti), where \(d_{q}(Q,x) = {\sum }_{j=1}^{m} w_{j} \cdot d(q_{j},x)\)

From Proposition 1 we have that: 0 ≤ ds(qj,Ti) ≤ 1, ∀j = 1,…,m. Since wj > 0, ∀j = 1,…,m, we have that: 0 ≤ wjds(qj,Ti) ≤ wj,∀j = 1,…,m. By summing the above m inequalities we derive:

$$ 0 \le \sum\limits_{j=1}^{m} w_{j} \cdot d_{s}(q_{j},T_{i}) \le \sum\limits_{j=1}^{m} w_{j} \Leftrightarrow 0 \le D_{s}(Q,T_{i}) \le 1 $$

Moreover, \(D_{s}(Q,T_{i}) = 0 \Leftrightarrow {\sum }_{j=1}^{m} w_{j} \cdot d_{s}(q_{j},T_{i}) = 0\), and since ds(qj,Ti) ≥ 0 and wj > 0,∀j = 1,…,m, the sum will be zero in case that all terms become zero, i.e. ds(qj,Ti) = 0,∀j = 1,…,mqjTi,∀j = 1,…,m (Proposition 1).

Finally, if x is a random graph node, according to Proposition 1, we have: ds(qj,Ti) ≤ d(qj,x) + ds(x,Ti),∀j = 1,…,m. Thus: wjds(qj,Ti) ≤ wjd(qj,x) + wjds(x,Ti),∀j = 1,…,m. By summing these m inequalities, we get:

$$ \sum\limits_{j=1}^{m} w_{j} \cdot d_{s}(q_{j},T_{i}) \le \sum\limits_{j=1}^{m} w_{j} \cdot d(q_{j},x) + d_{s}(x,T_{i}) \cdot \sum\limits_{j=1}^{m} w_{j} $$
$$ \Leftrightarrow D_{s}(Q,T_{i}) \le d_{q}(Q,x) + d_{s}(x,T_{i}) $$

An advantage of this methodology is that the proposed similarity measures express the similarity between a trajectory TiT and the selected spatial locations in Q. Therefore, the proposed measures are functions in the |Q|×|T| space, instead of the |T|×|T| space, by significantly speeding up computations. Moreover, the computation of all-to-all geodesic path distances is avoided, by limiting the calculation of pairwise node distances from the small set of spatial positions Q to the nodes of T. This is in contrast to the majority of previous methods for trajectory similarity search, which require a computationally intensive preprocessing step with all-to-all geodesic path distance calculations.

A significant property of the proposed method is that it is not required that all times points are stored. The reason is that time restrictions set by the recorded timestamps between the location visits and the resulting delay is what defines the temporal restrictions in an absolute manner.

To calculate the temporal similarity we obtain the nearest nodes vminj for j = 1,…,m of the trajectory to the spatial locations, as described above. Then, we calculate the inter-arrival times on each of these nodes. This can be calculated instantly, by summing the arrival times of these nodes. More specifically, if we set \(t_{vmin_{j}}\) for j = 1,…,m as the time points, which correspond to each closest node, we calculate the corresponding inter-arrival times dt2,dt3,…,dtm in the above fashion. We observe three distinct cases:

  1. 1.

    dtj = qtj: The actual temporal distance of the location from the trajectory is equal to the time tolerance based on the recorded timestamps. The temporal distance is equal to 0.

  2. 2.

    dtj > qtj: The temporal distance is greater than the time tolerance. Thus, more time is needed to pass through the trajectory, which means that the temporal difference must be taken into consideration and is equal to |qtjdtj|.

  3. 3.

    dtj < qtj: The temporal distance is less than the time tolerance. In this case, the temporal distance is not taken into consideration, since less time is needed to traverse the trajectory. This case is treated like the first one, i.e. the temporal distance is 0.

Based on the above, the temporal distance metric is:

$$ D_{t}(Q,T_{i}) = \frac{1}{m-1} \sum\limits_{j=2}^{m} \frac{|qt_{j}-dt_{j}|} {\max_{2<j<m}\{qt_{j},dt_{j}\}} $$
(5)

Proposition 3

Dt(⋅) is a generalized metric function that satisfies the generalized triangular inequality in the range [0,1].

Proof

For j = 2,…,m, by considering the values qtj and dtj as m − 1 couples of real values, the proof is the same as presented in [34]. □

The two previous similarity metrics are then combined into a spatio-temporal metric sim(.) as follows:

$$ sim(Q,T) = 1-dist(Q,t) $$
(6)

where:

$$ dist(Q,T) = a*D_{s}(Q,T) + (1-a)*D_{t}(Q,T) $$
(7)

Parameter a ∈ [0,1] expresses the preference to one of these two metrics, depending on how close its value approaches 0 or 1. Thus, an application can tune which of the two, or any combination of the two, should be applied. If a = 0, then an absolute preference for the temporal distance is expressed. Oppositely, a = 1 means that only the spatial distance will be included in the spatio-temporal similarity measure calculation.

Algorithm 2 presents the progressive trajectory retrieval process from the spatial locations. The main strategy is the following: from each location, perform an incremental Dijkstra expansion step following a round-robin strategy, collect the trajectory IDs that are included in the trajectory clusters of the visited edges, compute the spatiotemporal similarities based on the proposed measures and progressively return the top-k retrieved trajectories when an updated threshold value is satisfied. The threshold and top-k process is similar with Fagin’s Threshold Algorithm [13].

figure b

When the algorithm begins, variables and structures are initialized (lines 1–4). Variable L keeps the threshold value. The ordered structure H keeps the retrieved trajectory ID’s ordered by their calculated spatiotemporal distance. Each location qj uses a Fibonacci Heap HQj, in which the corresponding shortest-path distances from the Dijkstra expansion are updated. To avoid recalculations of spatiotemporal distances in any step of the algorithm, a bit-set B with |T| bits in memory is used where the corresponding bit of each calculated trajectory distance is enabled on-the-fly. Therefore, during the query processing, the distances are calculated only once for each trajectory. The five main steps of the proposed algorithm are the following:

(S1): From each location qj, in a round-robin manner (initially vQj = qj), each neighbor node uQj of vQj is retrieved in the Dijkstra expansion step (lines 5–17). The heaps HQj are updated with the relevant shortest-path distances from the Dijkstra expansion. The candidate trajectories Th are collected from the corresponding edge clusters \(C_{(vQ_{j},uQ_{j})}\) of the extended adjacency list index (line 18).

(S2): The spatiotemporal distances dist between the collected candidate trajectories Th and the location set Q are calculated (7). In bit-set B the corresponding bits of each calculated trajectory distance are enabled (lines 19–24). The currently calculated trajectory distances and their corresponding trajectory Ids are preserved and updated in H (ordered by dist) on-the-fly (line 23).

(S3): The threshold L is updated according to the aggregated network distances between the locations and the set of vminj nodes: \(L=\frac {a}{m} {\sum }_{j=1}^{m} d(q_{j},vmin_{j})\), where vminj is the closest node to location qj in the current Dijkstra expansion level, i.e. vminj has the shortest path distance to qj among all the detected nodes in the current round from qj. The threshold L is a lower bound of the final distance function dist and it is used for generating the results. In each round, L is increased, (when the expansion level is changed), by comparing the current Lcurr value with the previously calculated one. In particular, if the currently computed Lcurr value is greater than the previous L value of the last round, then the Lcurr value of the current round is updated accordingly (lines 28–31). Since the temporal distances Dt are aggregated with the spatial distances Ds in the final distance function dist(⋅), L is a lower bound for both spatial and spatiotemporal distances. Moreover, in case that wj weights are used (4), then threshold L is calculated as: \(L=a \cdot {\sum }_{j=1}^{m} w_{j} \cdot d(q_{j},vmin_{j})\) (alternative line 28).

(S4): After the end of each round, the trajectories in the current top-k list in H are examined based on condition that they have a distance dist lower than L. If the condition is satisfied for a subset of trajectories in H, then these trajectories are instantly added to the top results list (lines 32–36). The trajectory extraction proceeds progressively until L reaches a value greater than the distance of the k-th element in H or in the extreme case that the spatiotemporal distances of all trajectories in T have been calculated (stopping condition, line 37).

(S5): If not all top-k results have been retrieved, the algorithm proceeds to the next expansion round, where the algorithm repeats the loop in lines 5–40.

Correctness: As the threshold L is a lower bound of the distance function dist, all trajectories that have not been discovered yet will have spatiotemporal distances greater than or equal to L. This means that the trajectories that have been stored into H will definitely have smaller distances to any not discovered yet trajectory in all next expansion levels. Also, when a trajectory is inserted into H, its calculated spatiotemporal distance is a final distance and will not be modified (bitset B ensures that will not even be recalculated). Therefore, as L is increasing and there are some trajectories in the top positions of H that have distances smaller to L, they can safely returned to the user.

The time complexity of Algorithm 2 is: \(O(m*(|V|*\log |V|+|E|)+|RE|)\), where the part \(O(m*(|V|*\log |V|+|E|))\) corresponds to the Dijkstra expansion. On the other hand, the term O(|RE|) represents the number of trajectory edges the algorithm will take into account; it is at maximum |RE|, since the control bitmap will be storing the IDs of the trajectories with distances already calculated.

4.3.3 Trajectory retrieval for the illustrative small-scale example

The diameter of the graph in Figure 5 is DG = 27, which is the geodesic distance between the most distant nodes v1 and v14. The closest nodes of trajectory T1 from locations q1,q2,q3 are nodes v3,v12,v13, with distances 4,2,2, respectively. Then, the spatial distance between the set Q of locations and the nodes of trajectory T1 are calculated as: \(D_{s}(Q,T_{1}) = \frac {1}{3} \cdot \frac {4+2+2} {27} \approx 0.099\) (considering equal weights \(w_{j}=\frac {1}{3}\)). The closest nodes of trajectory T2 from locations q1,q2,q3 are nodes v5,v8,v11, with distances 0,5,0, respectively. Then, Ds(Q,T2) is: \(D_{s}(Q,T_{2}) = \frac {1}{3} \cdot \frac {0+5+0} {27} \approx 0.062\). The closest nodes of trajectory T3 from locations q1,q2,q3 are nodes v5,v9,v11, which are the nodes that the trajectory passes through all the locations, resulting thus in Ds(Q,T3) = 0. Therefore, if a = 1, i.e. only the spatial similarity contributes to the final score of sim(⋅), the top-3 similarity list is [T3,T2,T1].

If the time tolerance is 3 time units for the transition from q1 to q2 and 3 time units for the transition from q2 to q3, i.e. qt2 = qt3 = 3, then the corresponding inter-arrival times for T1 are dt2 = 5 > 3 (for the transition from v3 to v12), and dt3 = 4 > 3 (for the transition from v12 to v13). Therefore, Dt(Q,T1) is: \(D_{t}(Q,T_{1})= \frac {1}{2} \cdot \left (\frac {|3-5|}{5} + \frac {|3-4|}{4} \right ) = 0.325\). The corresponding inter-arrival times for T2 are dt2 = 2 < 3 (for the transition from v5 to v8), and dt3 = 2 < 3 (for the transition from v8 to v11). Therefore, we have: Dt(Q,T2) = 0, since dt values are set equal to qt. The corresponding inter-arrival times for T3 are dt2 = 5 > 3 (for the transition from v5 to v9), and dt3 = 5 > 3 (for the transition from v9 to v11). Therefore, Dt(Q,T3) is: \(D_{t}(Q,T_{3}) = \frac {1}{2} \cdot \left (\frac {|3-5|}{5} + \frac {|3-5|}{5} \right ) = 0.4\).

If a = 0.5, the final spatio-temporal distances of the trajectories are equal to: dist(Q,T1) ≈ 0.212, dist(Q,T2) ≈ 0.031, dist(Q,T3) = 0.2. By considering both temporal and spatial domains, the top-3 similarity list becomes [T2,T3,T1]. In contrast to trajectory T3 (case a = 1), by considering both spatial and temporal domains (case a = 0.5), T2 is the top trajectory result which does not pass from all the locations.

5 Unified framework and extensions for the studied methods

Here we provide additional information about the unified framework implemented for the three studied methods, plus more details about the extensions. The studied methods and the respective algorithms were implemented so that memory manipulation and pointer creation to objects and values is allowed, instead of copying whole objects across functions and methods. This way, the involved classes and structures can communicate with each other and have access to the needed objects.

We model the graphs with adjacency lists because the tested networks are sparse plus they are rather stable without any changes to their initial topology. Thus, graph data are stored with size analogous to that of the network. Especially for CeAL, each edge has an additional property, corresponding to the cluster containing the IDs of the trajectories passing through this edge.

We used the Dijkstra algorithm [12] for pairwise shortest path calculations. We provided the initial node to the Dijkstra algorithm, so that it begins its expansion steps, as well as an array for the distances assigned to each edge. This results in an efficient derivation of the shortest paths, even in extremely large networks. The particular implementation of the Dijkstra algorithm uses the concept of a virtual visitor. We provided a visitor as parameter to the function call, which evaluates the potential options from the nodes adjacent to the one it is on, and selects one according to the specific algorithmic restrictions.

Most implementations use the predefined Dijkstra visitor, which will only perform the default set of actions on each step. The main advantage is that we can replace this predefined visitor with a visitor of our own preference to perform modified tasks on each step. Therefore, especially for CeAL that uses the Dijkstra algorithm to perform a set of actions on each expansion step, we implement our own visitor class and override its behavior when it finishes handling a node. Thus, the Dijkstra expansion step will be overridden and the trajectories will be discovered as per the function of the algorithm described above. Before any expansion from each location, we call the Dijkstra algorithm by passing the predefined visitor as a parameter, and then we pass the resulting distance array to our custom visitor. This allows to perform the distance calculations based on either the Euclidean distance, or the actual distance obeying the network constraints as adopted here.

The trajectories collected through query processing are stored in a min-heap. At the end of the process, the min-heap contains all the discovered trajectories in ascending order. We then extract the top-k trajectories, where k is user-defined, either in an expanded (all nodes that comprise a trajectory), or in a compressed form (number of nodes in trajectory), and the total spatio-temporal cost.

Regarding the R-tree-based methods, we implemented the bottom R-trees aiming at maximizing the control and the ability for modifications. Each R-tree node consists of a 1-d array containing the tree elements. Thus, the nodes are separated from the elements to be stored; this provides the flexibility to store whatever is necessary. Notably, in our implementation the edges follow a bidirectional rationale, i.e. the flag for the edge direction is ignored.

Our initial approach was to insert the TopTreeElement elements into each node, for both internal and leaf nodes. The first tests showed that this approach was not very efficient with respect to creating and inserting elements in each node. For this reason, the TopTreeElement elements were removed from the internal nodes, i.e. in our current R-tree implementation, elements of this type can only be found in the leaves, whereas internal nodes store an 1-d vector, which can store elements of any type. Thus, we can store either TopTreeElement elements, or pointers at R-tree lower levels. We use R-trees to store the edges. This provides a robust implementation with respect to time and storage efficiency. These extensions are important for enhancing the performance of the two studied R-tree based methods.

The trajectory generator by Brinkhoff [4] was extended to further enhance its performance for very large networks. Especially for the generation process of the trajectories, a class has been implemented to model time with discrete clock ticks, which denote a new movement cycle for the visitors, where each visitor moves only as far as its speed allows. Each time instance is numbered, which allows tying visits to nodes and to edges to a time instance or a time interval. Thus, it is possible for an object, depending on its speed, to pass more than one clock ticks on the same edge; however, visiting a node is considered always instantaneous.

The creation of a visitor is based on a probability, which can be adjusted to model bustling or deserted networks. This probability is examined on each time instant; in addition to the visitors already moving in the network, new visitors may also appear. Each visitor knows the specific trajectory to follow in the network, which is indexed and stored in each particular studied method. Additionally, each visitor is assigned a specific speed, which determines the distance traversed on one clock tick. The way used visitors can be modified so that moving objects can be studied on graphs. In our case, each visitor decides on the trajectory to follow, and it traces this trajectory with increasing time. The trajectory is stored as a sequence of nodes, and thus a visitor is used only to trace trajectories on the network.

6 Experimental evaluation

We performed a series of exhaustive experiments on the studied methods. The datasets retrieved from the 9th DIMACS Challenge webpage [30] (last update in 2010) represent various specific portions of the US road network (see Table 3). Even though we appreciate the realistic network topology, which helps in providing the real distances between nodes instead of calculating them, either during the graph construction in a preprocessing step, or on the fly during the algorithm execution, our interest primarily lies in the network sizes, which cover a wide spectrum. By using these datasets, we trust that our conclusions are realistic.

Table 3 Datasets used for experimental evaluation

All methods and the unified framework were implemented in C++. The Boost Graph library [5] was also used for several primitive structure types and graph algorithms. The experiments were performed on a personal computer with an Intel Core i7-6700K quad-core processor clocked at 4.00 GHz with 8 MB Cache, 16 GB (8GBx2) DDR4 main RAM memory clocked at 2133 MHz, and an SSD drive, with a read speed of 550 MB/s, a write speed of 520 MB/s, and a capacity of 120 GB. To avoid any throttling on the processor or any other system part, the computer remained plugged in the power outlet throughout the whole experimentation.

6.1 Preprocessing - indexing and storing

Here we calculate the required space and time to construct the corresponding indexes of each method. In each case, we construct the network and store it in the relevant index. Afterwards, we insert a variable number of trajectories and observe the time and storage required. The number of trajectories varies from 1K to 10K, 100K and 1M (Figs. 8910 and 11). The trajectories are created by our enhanced generator as described in Section 5.

Fig. 8
figure 8

Time for constructing the trajectory indexes - 1K trajectories

Fig. 9
figure 9

Time for constructing the trajectory indexes - 10K trajectories

Fig. 10
figure 10

Time for constructing the trajectory indexes - 100K trajectories

Fig. 11
figure 11

Time for constructing the trajectory indexes - 1M trajectories

We observe a more or less similar behavior, when bulk trajectory insertions are made in the networks. The time differences are small in all three methods; however, MON-tree and CeAL have a distinct advantage when storing an empty network over FNR-tree, as seen in Table 4.

Table 4 Time needed to construct networks for each index (secs)

With respect to the storage space needed while constructing the network, for brevity, we show results only for the NY network since our experiments indicate that the same behavior is observed across all datasets. Table 5 shows the increase of memory space required for a variable number of trajectories. We observe that CeAL displays the best behavior in comparison to the other methods. When the number of trajectories is small (e.g. 1000) there is not much difference between the examined methods; however, as the number of trajectories increases, the performance gap becomes more obvious.

Table 5 Network storage space for each index (MB)

Figure 12 shows the comparison of the time needed to insert a trajectory in each index. It can be seen that MON-tree and CeAL outperform FNR-tree in all cases. In networks larger than the Great Lakes network, CeAL increases linearly in time, whereas MON-tree retains its performance. This is due to the index construction. FNR-tree stores the entire network; thus, inserting a new trajectory requires a search, which becomes more expensive as the network increases in size. Similarly, CeAL’s adjacency list grows in size as the network nodes increase, leading to larger search times for storing the new trajectory in the correct network edges. On the other hand, MON-tree does not store any part of the network until it becomes relevant (by being part of a new trajectory, which means that it is less sensitive to network size increases). Similar results are retrieved for a trajectory deletion.

Fig. 12
figure 12

Comparison between FNR-tree, MON-tree and CeAL - trajectory insertion

Next, we examine the time to retrieve a trajectory from each index. In particular, first we compare the results between FNR-tree and MON-tree, and then between MON-tree and CeAL. Figure 13 shows that the performance of FNR-tree is inferior of that of MON-tree in all cases. In particular, its performance degrades seriously as the network size increases. Therefore, the only meaningful comparison is between MON-tree and CeAL. Figure 14 shows that CeAL outperforms MON-tree in retrieving the trajectory edges. Retrieving a trajectory from a MON-tree requires a traversal of the generated R-tree, which is sensitive to the network size, while doing the same in a CeAL index requires the traversal of a single adjacency list path, which doesn’t require any comparisons and path decisions.

Fig. 13
figure 13

Comparison between FNR-tree and MON-tree - trajectory discovery

Fig. 14
figure 14

Comparison between MON-tree and CeAL - trajectory discovery

6.2 Trajectory similarity query processing

Here, we test the performance of the methods under examination during trajectory similarity query processing. We measure the total time performance (CPU time and I/O cost) as well as the total number of distance computations / shortest path calculations for searching and retrieving trajectories stored within each index.

6.2.1 Results for variation of |T|

We check how the examined methods behave with increasing number of stored trajectories since real-life applications deal with a large number of trajectories, and dynamic data can lead to increased load. Figs. 151617 and 18 show how the algorithms behave across the different networks.

Fig. 15
figure 15

FNR-tree, MON-tree and CeAL - 1000 trajectories

Fig. 16
figure 16

FNR-tree, MON-tree and CeAL - 10000 trajectories

Fig. 17
figure 17

FNR-tree, MON-tree and CeAL - 100000 trajectories

Fig. 18
figure 18

FNR-tree, MON-tree and CeAL - 1000000 trajectories

As can be seen from these figures, small trajectory numbers favor the two spatial methods, which exhibit a more or less stable behavior. On the other hand, CeAL scales linearly with increasing network size, a behavior that is maintained across all cases of trajectory numbers. A small trajectory number and a smaller network still favors CeAL, since it performs as well as, and in some cases better, than the spatial methods.

CeAL outperforms the other methods when the number of stored trajectories is in the order of millions. Figure 18 shows that CeAL outperforms the two spatial methods in small, medium and large networks. Assuming that a real-life application will have millions of stored trajectories to provide accurate information to the users, we can conclude that despite the fact that the initial results don’t favor CeAL across the whole dataset, it performs better in the crucial tests.

6.2.2 Results for variation of k

Next, we examine how the methods are affected with varying k. To stress the methods we use networks with 1000000 trajectories, for reasons mentioned in previous. The results can be seen in Figure 19. It should be noted that the FNR-tree and MON-tree indexes do not support a top-k query, but rather work by providing a desired spatiotemporal box and retrieving all trajectories in it. Nevertheless, we include the relevant measurements for both methods in the figure, so that a comparison can be made about the relative efficiency between the algorithms.

Fig. 19
figure 19

Comparison between FNR-tree, MON-tree and CeAL - trajectory discovery, number of top-k trajectories requested

6.2.3 Results for variation of q

Next, we examine the effect of increasing the query locations q = 5,10,20,40,80 when |T| = 100 K and k = 10. As seen in Figure 20, this has the most significant impact on CeAL’s performance, due to the increased number of shortest path calculations that need to be performed. Again, FNR-tree and MON-tree relevant measurements are included for comparison purposes.

Fig. 20
figure 20

Comparison between FNR-tree, MON-tree and CeAL - trajectory discovery, number of query locations

6.2.4 Results for progressiveness of top-K result discovery

To show CeAL’s progressiveness in fetching the trajectory results, we logged the time when each trajectory was retrieved during the algorithm’s execution. We ran this experiment with 100K trajectories stored in the data structure, with 20 query points of interest, requesting 10 trajectories. In further experiments, the algorithm’s behavior remains consistent with what is shown below. Table 6 shows how trajectories are obtained in a progressive manner when performing a top-k query in CeAL. For each dataset, the time when a particular trajectory was obtained is shown. The above results show that CeAL retrieves its results progressively, thus being able to provide results in a more efficient manner.

Table 6 Progressive discovery of top-k results (msecs)

6.2.5 Results for I/O activity

Lastly, we present results on the average number of page accesses (I/O activity) needed by each method for each network to retrieve a trajectory, as seen in Table 7. Note that for this comparison, we use the results of MON-tree for 1 million stored trajectories. Also, for completeness in Table 8 we present the average number of disk accesses of MON-tree, which depends on the number of inserted trajectories into the structure.

Table 7 Average number of disk accesses for all methods
Table 8 Average number of disk accesses for a MON-tree with varying trajectory number

As can be seen from the above results, CeAL’s adjacency list-based structure offers a distinct advantage as far as disk accesses are concerned, as it is not sensitive to the increase of the network size. The spatial methods demonstrate different behaviors. FNR-tree constructs the entire network from the beginning; thus, the network size determines how deep the resulting R-tree’s leaf nodes are stored, and the total number of accesses needed to retrieve each level. On the other hand, MON-tree constructs only the relevant parts of the network, i.e. only the parts with existing trajectories. This means that the network size itself doesn’t affect the number of disk accesses, but the number of trajectories does.

6.2.6 Results for distance calculations

CeAL uses Dijkstra’s Shortest Path algorithm to find the closest trajectories to the points of interest q. Table 9 shows the number of performed calculations during the query trajectory processing across all of the network datasets. These results were obtained on a network with 100000 stored trajectories, a query of 20 locations, and requesting 10 top-k trajectories. The R-tree-based indexes do not use any shortest-path algorithm, since they return trajectories contained within a provided bounding box.

Table 9 Number of distance/shortest paths calculations for CeAL

7 Conclusions

There is an ever increasing need for spatial and temporal data (and their spatio-temporal synthesis), which a user needs to retrieve efficiently, even in real time. We have examined three approaches towards the solution of the problem of trajectory storing, indexing and retrieval in large spatial networks, through the implementation of three different methods. Any application dealing with multi-dimensional spaces can be based on the methods using the R-trees family. On the other hand, our approach based on simpler structures and algorithms, removes the need for intermediate R-trees and leads to better time performance.

FNR-trees displays some issues when compared to its antecedents. The main disadvantage of this method is that it copies the whole network on its top level R-tree at the beginning of its execution, which increases the construction time and makes its reconstruction infeasible in case it is ever needed.

MON-trees improves upon the above, by not storing the entire network beforehand. Instead, edges are stored in the R-tree only when needed, i.e. only when storing a trajectory with edges that are not yet stored in the R-tree. If the edge has already been stored, then the temporal data insertion will proceed as usual. As seen in the experimental section, from the point of view of construction time, MON-trees should be favored. A direct consequence of this is that each time we insert a new edge or search for an existing one, this is always performed on a top level R-tree of smaller size than the FNR-tree, thereby generally decreasing the time needed for these operations.

Our CeAL method is an alternative approach which is characterized be the following advantages:

  • Decreased storage space and construction time in comparison to the other two methods. In particular, this advantage is more significant when compared with FNR-trees.

  • By using a generalized metric similarity function, the trajectories are ranked according to their relevance to the user query. This allows the user to retrieve the best trajectory according to his needs. This could be achieved by the other two methods by manipulating the query rectangle but this should come along with another set of difficult problems.

  • It returns to the user the desired number of trajectories (k) in a progressive manner, whereas the other two methods return all the results they reach. This characteristic is important not only in a real-world setting but performance-wise as well.

  • It prevails in retrieving complete trajectories, although the other two methods perform better at retrieving these parts of a trajectory which pass through a specific part of the graph.

The effectiveness of all three methods depends on the number of the stored trajectories. If this number is small, then the results will be poor. In particular, for the first two methods, the user query might contain a small number of trajectories, or even none at all. On the other hand, the CeAL method will always return a number of trajectories.

8 Future work

In this paper we focused in the problem of trajectory storing, indexing and retrieval in spatial networks aiming at delivering an efficient and effective solution. Future research could capitalize on the following ideas.

The drawbacks of FNR-trees have been faced by MON-trees. Thus, one could invest in improving further MON-trees instead of FNR-trees. MON-trees could be augmented with a ranking mechanism to deliver the trajectories according to the user input. Locations of interest could be integrated on this method as well, which could be translated to a representative trajectory. The trajectories discovered by the method would then be ranked based on spatial and temporal distances before returned to the user.

Another improvement of this method could be to provide the capability to define the desired number of the trajectories, ensuring that the user will receive only the most relevant results. This requires a metric function to rank the trajectories and return only the closest ones. As seen, the number of requested trajectories does not affect their discovery speed; e.g. a query requesting k trajectories and one requesting k + 1000 trajectories will need approximately the same time.

MON-trees could be enhanced with a user-defined or derived threshold. For example, if the user requests k trajectories, and several trajectories have a distance less than the threshold, then the execution would stop and the trajectories would be delivered to the user as good enough answers to query. This concept needs testing to come up with reasonable policies as to what should be the threshold set.

As explained before, all examined methods are based on the existence of a significant number of pre-stored trajectories to ensure that the results are relevant. If the number of trajectories is small, then the results, although algorithmically correct, will suffer in terms of quality and usefulness to the user. For this reason, an abundance of pre-existing trajectories is required, either real or generated. Using minimum spanning trees could help in separating the graph in segments, or neighborhoods, which can communicate with each other, and generating trajectories on each of these segments.

Another alternative could be based on the wide proliferation of social networks and on using spatio-temporal data gathered and aggregated from them. Proper anonymization techniques could be used to avoid invoking personal information issues. Unfortunately, there is lack of such datasets extracted from major social networks.