1 Introduction

As the location-based social networks (Zheng 2011a) become popular in recently years, an increasing number of people start using GPS-enabled devices to log their outdoor movements with GPS trajectories (Zheng et al. 2008c, 2009a, 2010c, 2011e). These trajectories do not only record users’ location histories in the physical world but also imply their personal interests and preferences (Eagle et al. 2006; Zheng et al. 2011a, b). Meanwhile, trajectories generated by a large number of people imply rich social and community intelligent (Zhang et al. 2011). Figure 1 demonstrates the mobility of four individuals (A, B, C, and D) who respectively recorded a one-day trip with a GPS trajectory. According to the outdoor movement, we can observe the following three insights.

Fig. 1
figure 1

GPS trajectories and user interests

  1. 1.

    Geographic overlaps: People having similar outdoor location histories (or mobility patterns) in the geographic spaces could share some similar life interests. For instance, the users A and B might share some similar interests as both of them have visited the same cinema, museum, coffee, and shopping mall (although they may not know each other personally).

  2. 2.

    Semantic overlaps: People could share some similar interests if they have similar mobility patterns in the space of semantic locations. For example, though the user C does not access the same locations with B, the semantic meanings (categories) of the locations (museum, cinema and coffee) are the same with that of B. That is, they could still share similar interests.

  3. 3.

    Location sequence: Although the user D also visited a museum, a coffee shop, a shopping mall, and a cinema, the sequence between these locations (“museum → shopping mall → coffee → cinema”) is different from that of the users B and C (“cinema → museum → coffee → shopping mall”). Thus, the similarity between D and B might not be as significant as that between C and B.

In this paper, we aim to estimate the similarity between users according to the semantic location histories (SLH) inferred from their GPS trajectories. This similarity can regarded as potential social ties between users, thereby enabling friend and location recommendation. For instance, as shown in Fig. 1, if knowing the users B and C are similar according to their location histories, we are able to recommend user B to user C in a social networking service. As a result, they may connect to each other, i.e., creating a social tie between them. Further, we can recommend the museum 1 and cinema 1 to the user C, and provide the user B with the museum 2 and cinema 2 as a recommendation.

The two essential steps of finding similar users are (1) modeling users’ interests from their historical GPS trajectories and (2) measuring the similarity between them based on their location histories. Both tasks are non-trivial.

First, we model a user’s movements as a sequence of semantic locations (categories), e.g., “museum → cinema → restaurant”, instead of physical locations. Semantic locations are more informative in capturing the interests of users than physical geo-positions. Additionally, this semantic history enables us to detect the similar users without overlaps in the geographic space, e.g., the users B and C shown in Fig. 1. However, there exists uncertainty in identifying the location category of a GPS point due to the positioning errors of GPS devices and the crowded distribution of points of interests (POIs) in a city (sometimes, many POIs are located in the same building).

Second, it is non-trivial to estimate the similarity between users by measuring the semantic location sequences. Our first intuition is that users sharing a longer sequence of semantic locations would be more similar than those sharing a shorter one. For example, people sharing a sequence “museum → restaurant” would be more similar than those visiting these two categories separately. In addition, a fine semantic location, e.g., “art museum” is more informative in reflecting users’ interests than a coarse one, e.g., “museum”. Thus, users sharing semantic locations with a coarse granularity would be less similar than those sharing a finer granularity. Furthermore, semantic locations with different popularities contribute differently to the similarity between users. Intuitively, users sharing a category visited by a few people, e.g., “museum” would be more similar than those sharing a very common category, e.g., “restaurant”.

To address these problems, we first construct for each user a SLH based on their historical GPS trajectories. Then, we compute the similarity between different users in terms of their SLHs, considering the sequence, granularity and popularity features mentioned above. The contributions of our work include:

  1. 1.

    The SLH well models a user’s interests and considers the uncertainty of the semantic meanings of a place where a user stayed. Specifically, the SLH transfers a user’s location history from raw GPS trajectories to a set of high-level semantic-location-sequences on different levels of a hierarchy representing different granularities of location categories.

  2. 2.

    We design the maximal travel match (MTM) algorithm to compare location histories of different users. MTM finds out the maximal subsequence matches instead of simply counting common locations. Moreover, by incorporating travel time between two locations, MTM is more capable of finding common sub-sequences with meaningful visiting orders beyond existing work on sequence matching, such as Edit Distance (ED) (Levenshtein 1966), Longest Common Sub-Sequences (LCSS) (Vlachos et al. 2002) and Dynamic Time Warping (DTW) (Yi et al. 1998).

  3. 3.

    We evaluated our approach on real-world GPS data collected by 109 users over a year. The results demonstrate the advantages of SLH and MTM over baselines respectively (the dataset has been released to the public (GeoLife GPS trajectories 2010) to facilitate other professionals’ research).

The rest of this paper is organized as follows. Section 2 introduces the preliminary and architecture of this work. Section 3 describes the location history modeling, and Sect. 4 details the location history matching. Section 5 gives an optimal solution using the similarity in a real system. Later, we report on the evaluation results in Sect. 6 and discuss the related work in Sect. 7. Finally, we conclude our work in Sect. 8.

2 Preliminary

Definition 1

(GPS Trajectory) A GPS trajectory Tra is a sequence of time-stamped points, Tra = p 0 → p 1 → ,…, → p k where \( p_{i} = (x,y,t) (i = 0,1, \ldots ,k) \); (x, y) are latitude and longitude respectively, and t is a timestamp. \( \forall 0 \le i \le k, p_{i + 1} \cdot t > p_{i} \cdot t \).

Definition 2

( Stay Point) A stay point s is a geographical region where a user stayed over a time threshold \( \theta_{t} \) within a distance threshold \( \theta_{d} \). In a trajectory, s is characterized by a set of consecutive points \( P = \left\langle {p_{m} ,p_{m + 1} , \ldots ,p_{n} } \right\rangle \), where \( \forall m < i \le n \), \( Dist(p_{m} ,p_{i} ) \le \theta_{d} \), \( Dist(p_{m} ,p_{n + 1} ) > \theta_{d} \) and \( Int(p_{m} ,p_{n} ) \ge \theta_{t} . \) Therefore, \( s = (x, y, t_{a} ,t_{l} ) \), where

$$ s.x = \sum\limits_{i = m}^{n} {p_{i} \cdot x/|P|,} $$
(1)
$$ s.y = \sum\limits_{i = m}^{n} {p_{i} \cdot y/|P|,} $$
(2)

respectively stands for the average x and y coordinates of the collection P; \( s.t_{a} = p_{m} \cdot t \) is the user’s arriving time on s and \( s.t_{l} = p_{n} \cdot t \) represents the user’s leaving time.

As depicted in Fig. 2, {p 1, p 2,, p 8} formulate a GPS trajectory, and a stay point would be detected from {p 3 , p 4 , p 5 , p 6 } if d ≤ θ d and Int(p 3p 6) ≥ θ t . In contrast to a raw point p i , a stay point carries a particular semantic meaning, such as a shopping mall or a restaurant a user accessed.

Fig. 2
figure 2

A GPS trajectory and a stay point

Figure 3 presents the architecture of our work, which consists of two major steps: location history modeling and location history matching. Given (1) GPS trajectories of multiple users and (2) a POI database, our objective is to infer the user similarity score of each pair of users. Later, this similarity can be used by some existing clustering algorithms, like K-means and KNN, as a distance function to cluster users into different groups. Therefore, we can easily find out top k similar users of a person by ranking others in the person’s group according to the similarity scores.

Fig. 3
figure 3

The architecture of similar user discovery

In order to make different users’ location histories comparable, we first put all users’ GPS trajectories together and create a shared framework of location history. Here, a POI database is employed to transfer a user’s location history from geographic spaces into the semantic spaces. The POI database contains a corpus of POI entities, each of which includes the properties of category, latitude and longitude, etc. Then, based on the framework we can respectively build a location history for each user. Later, for each pair of users, we explore their similarity by matching their location histories. We will provide more details of the architecture in the following sections.

3 Location history modelling

Figure 4 shows the process of modeling location history for each user, and Fig. 6 gives a demonstration. This step is comprised of three components denoted as grey boxes in Fig. 4 and described respectively in the following subsections.

Fig. 4
figure 4

The procedure of modeling user location history

3.1 Stay points representation

In this component, we first extract stay points from each user’s GPS trajectories by using a stay point detection method proposed in paper (Li et al. 2008). These stay points carry more semantic meanings beyond raw GPS points, and allow us to filter the places where a user only passed by, e.g., crossroads.

However, it is almost impossible to identify the exact POI a user visited according to a stay point, given the GPS positioning error and crowded distribution of POIs in a city. In practice, as shown in Fig. 5, a GPS reading may have a 10 m or more error to the real position. Naturally, there could be multiple POIs pertaining to different categories exist in such a distance range, while the nearest POI to the stay point may not be the real place that a user visited. Sometimes, restaurants, shopping malls, and cinemas are even overlapped in the same building.

Fig. 5
figure 5

The architecture of similar user discovery

In this work, we represent a stay point as a [s.x − γ, s.x + γ] × [s.y − γ, s.y + γ] region (refer to Fig. 5 for an example), where γ is a parameter related to the GPS positioning error. After that, we construct a feature vector for each stay region according to the POIs fallen in the region. Here, we employ the idea of TF-IDF (term frequency-inverse document frequency), which is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Similarly, we regard categories of POIs as words and treat the users’ stay regions as documents. Intuitively, if POIs of a category occur in a region many times, this POI category is important in representing this region. Furthermore, a POI category (e.g., “museum” and “natural parks”) that occurs rarely in other regions is more representative for the region than a common POI category, e.g., “restaurant”, which could appear in many places. Thus, we consider both the occurring frequency of a POI category in a region (similar to TF) and the inverse location frequency (equivalent to IDF) of this category. Combining these two factors, we design the feature vector as follows.

Definition 3

(Feature Vector) The feature of a stay region r in a collection of regions R is f r  = < w 1w 2w F  >, where w i is the weight of POI category i in region r. F is the number of unique POI categories in a POI database.

$$ w_{i} = \frac{{n_{i} }}{N} \times \log \frac{|R|}{{|\{Regions containing}\,{ i\} |'}} $$
(3)

where n i is the number of POIs of category i located in region r and N stands for the total number of POIs in region r. The first part of Eq. 3 represents the occurring frequency of a category and the second part denotes the inverse location frequency of a category, in which |R| is the number of regions in the collection.

According to Eq. 3, we can represent a stay region with a feature vector (refer to the top part of Fig. 6 for an example). Although we still cannot identify the exact POI category visited by a user, this feature vector catches the interests of a user to some extent by representing the semantic meaning of a location. In short, the feature vector reflects on the uncertainty of accessed categories while bypasses the difficulties in identifying the exact POI visited by a user.

Fig. 6
figure 6

The demonstration of location history modeling

3.2 Generating location history framework

In the second component, as demonstrated in Fig. 6, we cluster the stay regions into some groups according to their feature vectors. The stay regions in the same cluster can be regarded as locations of the similar type and having similar semantic meanings. However, a flat clustering is insufficient in differentiating similar users of different degrees. Intrinsically, we are more capable of discriminating similar users given categories with a finer granularity. For example, “restaurant” helps identify users who like dining outside, while “Indian restaurant” and “Japanese restaurant” enable us to differentiate people interested in different types of food.

Considering this factor, we hierarchically cluster the feature vectors in a divisive manner and build a tree-structured semantic location hierarchy. As shown in the middle part of Fig. 6, we start with putting feature vectors of all users into one cluster and treat this cluster as the root (i.e., cluster at layer 1). For each cluster c at layer j(j ≥ 0), we split c into a set of sub-clusters by using a flat clustering algorithm. The result sub-clusters of c are considered as c’s child nodes at layer j + 1. This procedure repeats a given number of times L. As a result, we create a tree-structured hierarchy where clusters at the same layer share the same granularity and a lower layer denotes a finer granularity.

Definition 4

(Semantic Location) A semantic location c is a feature vector cluster and represents a set of stay regions sharing similar semantic meanings of a certain granularity.

Definition 5

(Semantic Location Hierarchy) A semantic location hierarchy \( \mathcal{F} \) is a tree-structured framework in the feature vector space, \( \mathcal{F} = \cup_{l = 1}^{L} \{ C^{l} \} \), where L is the total number of layers; \( C^{l} = \{ c_{l1} ,c_{l2} , \ldots ,c_{lk} \} \) is the set of semantic locations at layer l, and \( c_{lk} \) denotes the kth semantic location on the lth layer.

3.3 Building individual location history

In this component, we construct a location history for each user based on the semantic location hierarchy \( \mathcal{F} \) and the user’s stay points (illustrated in Fig. 6). Originally, a user’s location history in the geographic spaces is represented by a sequence of stay points with traveling time between each two consecutive stay points. Then, on each layer of the semantic location hierarchy \( \mathcal{F} \), we respectively substitute a stay point with the semantic location that the stay point’s feature vector pertains to. After this projection, different users’ location histories become comparable.

Definition 6

( Semantic Location History) A user’s semantic location history is a sequence of semantic locations on each layer of \( \mathcal{F} \), \( H = \cup_{l = 1}^{L} \{ S^{l} \} \), where \( S^{l} = (c_{l0} \mathop{\longrightarrow}\limits^{{{\Updelta t_{1} }}}c_{l1} \mathop{\longrightarrow}\limits^{{{\Updelta t_{2}}}} \ldots \mathop{\longrightarrow}\limits^{{{\Updelta t_{k}}}}c_{lk} ) \) is the sequence on the lth layer of \( \mathcal{F} \). Suppose having two consecutive stay points \( s_{k - 1} \) and \( s_{k} \), \( s_{k - 1} \in c_{l,k - 1} \) and \( s_{k} \in c_{lk} , \) then \( \Updelta t_{k} = s_{k} .t_{a} - s_{k - 1} .t_{l} \) is the traveling time from \( c_{l,k - 1} \) to \( c_{lk} \).

Example 1

(Location History) As demonstrated in the up-right part of Fig. 6, according to trajectory Tra m user m’s location history can be represented by

$$ H = (s_{1} \mathop{\longrightarrow}\limits^{{{\Updelta t_{1}}}}s_{2} \mathop{\longrightarrow}\limits^{{{\Updelta t_{2}}}}, \ldots ,\mathop{\longrightarrow}\limits^{{\Updelta t_{7}}}s_{7} ), $$

where \( \Updelta t_{i} = s_{i + 1} .t_{a} - s_{i} .t_{l} \) is the traveling time between s i and s i+1. Then, we extend each stay point into a stay region and calculate the feature vector of each stay region as follows. Suppose that s 1 contains two restaurants and one museum, and s 2 only have four restaurants. The total number of stay regions created by all the users is 100, in which 50 have restaurants and two contain museums. So, the feature vectors of s 1 and s 2 are f 1 and f 2 respectively:

$$ f_{1} = \left( {\frac{2}{3} \times \log \frac{100}{50},\frac{1}{3} \times \log \frac{100}{2}, \ldots ,} \right), $$
$$ f_{2} = \left( {\frac{4}{4} \times \log \frac{100}{50},\quad 0, \ldots ,} \right), $$

After hierarchically clustering these stay regions in the feature spaces, we build a \( \mathcal{F} \). Later, by replacing a stay point with the cluster ID (semantic location) the point’s feature vector pertaining to, we can obtain two sequences, S 2 and S 3, on the second and third layer of \( \mathcal{F} \) separately.

$$ S^{2} = (c_{20} \mathop{\longrightarrow}\limits^{{{\Updelta t_{1} }}}c_{20} \mathop{\longrightarrow}\limits^{{{\Updelta t_{2} }}}c_{21} \mathop{\longrightarrow}\limits^{{{\Updelta t_{3} }}}c_{21} \mathop{\longrightarrow}\limits^{{{\Updelta t_{4} }}}c_{21} \mathop{\longrightarrow}\limits^{{{\Updelta t_{5}} }}c_{21} \mathop{\longrightarrow}\limits^{{{\Updelta t_{6}}}}c_{20} ), $$
$$ S^{3} = (c_{30} \mathop{\longrightarrow}\limits^{{{\Updelta t_{1}} }}c_{30} \mathop{\longrightarrow}\limits^{{{\Updelta t_{2}}}}c_{32} \mathop{\longrightarrow}\limits^{{{\Updelta t_{3}} }}c_{32} \mathop{\longrightarrow}\limits^{{{\Updelta t_{4}} }}c_{33} \mathop{\longrightarrow}\limits^{{{\Updelta t_{5} }}}c_{33} \mathop{\longrightarrow}\limits^{{{\Updelta t_{6} }}}c_{30} ), $$

So, user m’s location history can be represented as \( H = \{ S^{2} ,S^{3} \} . \)

4 Location history matching

In this section, we explore the similarity between each pair of users by matching their semantic location histories. As shown in Fig. 7, we first match two users’ semantic sequences at each layer of \( \mathcal{F} \) (i.e., MaxTravMatch), and then calculate a similarity score for each pair of user (i.e., CalculateSimilarity) by aggregating the matched sub-sequences at all layers in a weighted manner. Table 1 shows the notations used in this paper.

Fig. 7
figure 7

The algorithm for location history matching

Table 1 Notations

4.1 Maximal travel match

A simple way coming to people’s mind is to count the number of items shared by two sequences. However, this method will lose lots of information about a user’s behavior and preferences, and leaves sequences sharing the same number of common locations indistinguishable. To address this issue, we consider both the visiting order and the travel time between two locations. Intuitively, users sharing the habit of “cinema  restaurant  shopping” are more similar with each other than users visiting these three places separately or in a different order. So, given two different users’ location histories, we detect the shared sub-sequences at each layer of the \( \mathcal{F} \). To guarantee the visiting order between two locations is meaningful, we require the travel time between the two locations to be similar.

Definition 7

( Sub-sequence) Given a sequence \( S = (c_{1} \mathop{\longrightarrow}\limits^{{{\Updelta t_{1} }}}c_{2} \mathop{\longrightarrow}\limits^{{{\Updelta t_{2} }}} \ldots \mathop{\longrightarrow}\limits^{{{\Updelta t_{m - 1}} }}c_{m} ) \), we denote the ith item of S as S[i] (e.g., S[1] = c 1) and represent its subsequence as \( S[a_{1} , a_{2} , \ldots a_{k} ] \) where \( 1 \le a_{1} < a_{2} < \ldots \le m. \)

For instance, S[1, 3, 6, 7] = \( c_{1} \to c_{3} \to c_{6} \to c_{7} \) in the above definition. Note that, we allow holes in a sub-sequence, i.e., discontinuous, for a better sequence match.

Definition 8

(Travel Match) Given a temporal constraint factor \( \rho \in \left[ {0,{ 1}} \right] \) and sub-sequences \( S_{1} [a_{1} , a_{2} , \ldots a_{k} ] \) and \( S_{2} [b_{1} , b_{2} , \ldots b_{k} ] \) from two sequences S 1 and S 2 respectively, these two sub-sequences formulate a k-length travel match if they hold the following two conditions.

  1. 1.

    \( \forall i \in [1,k], \quad a_{i} = b_{i} , \) and

  2. 2.

    \( \forall i \in \left[ {2,k} \right], \frac{{\left| {\alpha_{i} - \alpha_{i} '} \right|}}{{Max(\alpha_{i} , \alpha_{i} ')}} \le \rho \), where \( \alpha_{i} = S_{1} \left[ {a_{i} } \right].t_{a} - S_{1} \left[ {a_{i - 1} } \right].t_{l} \) and \( \alpha_{i}^{'} = S_{2} \left[ {b_{i} } \right].t_{a} - S_{2} \left[ {b_{i - 1} } \right].t_{l} \), i.e., the travel time between two locations.

In the latter of this paper, we represent the travel match as (a 1, b 1) → (a 2, b 2) → ··· → (a k , b k ).

Essentially, a travel match is a common sequence of semantic locations visited by two users in similar travel times. Note that the semantic locations in a travel match do not have to be consecutive in the user’s original location history. For instance, one user went hiking from a lake (i.e., lake  hiking park). Another one had the similar route but stopped by a hotel for booking a room or having a lunch (i.e., lake  hotel  hiking park). In this case, “lake  hiking park” should still be considered as a common sequence (between the two users’ location histories) as long as the gap between the two users’ travel times from the lake to the hiking park is not very big. However, if the second user stayed in the hotel for a few days and accessed some other places before approaching the hiking park, we do not regard “lake  hiking park” as a common sequence any longer due to the big time difference.

Definition 9

( Maximal Travel Match) A travel match (a 1, b 1) → (a 2, b 2)… → (a k , b k ) between two sequences S 1 and S 2 is a maximal travel match if,

  1. 1.

    No left increment: \( \nexists a_{0} < a_{1} , b_{0} < b_{1} \), s.t., \( (a_{0} ,b_{0} ) \to (a_{1} ,b_{1} ) \to (a_{2} ,b_{2} ) \cdots \to (a_{k} ,b_{k} ) \);

  2. 2.

    No right increment: \( \nexists a_{k + 1} > a_{1} , b_{k + 1} > b_{k} \), s.t., \( (a_{1} ,b_{1} ) \to (a_{2} ,b_{2} ) \cdots \to (a_{k} ,b_{k} ) \to (a_{k + 1} ,b_{k + 1} ) \), and

  3. 3.

    No internal increment: \( \forall i \in \left[ {1,k} \right], \nexists a_{i} < a_{{i^{'} }} < a_{i + 1} \) and \( b_{i} < b_{{i^{'} }} < b_{i + 1} \), s.t.,

    $$ (a_{1} ,b_{1} ) \to (a_{2} ,b_{2} ) \cdots \to (a_{i} ,b_{i} ) \to (a_{{i^{'} }} ,b_{{i^{'} }} ) \to (a_{i + 1} ,b_{i + 1} ) \to \cdots \to (a_{k} ,b_{k} ). $$

Example 2

(Maximal Travel Match) Figure 8 demonstrates an example of the maximal travel match between two sequences S 1 and S 2. Here, a node stands for a semantic location and the letter in a node represents the ID of the location. The numbers on the top of the box denotes the index of a node in a sequence, e.g., location A is the first node in both S 1 and S 2. The number appearing on a solid edge means the travel time between two consecutive nodes, and the number shown on a dashed edge denotes the stay time in a location.

Fig. 8
figure 8

An example of the maximal travel match

Let ρ = 0.2 in this example. First, \( (1,1) \to (2,2) \), i.e., \( A \to B \), is a travel match, because the travel times (\( A \to B \)) in S 1 and S 2 are identical, |2 − 2|/2 = 0. Then, we find that \( (2,2) \to (3,4) \), i.e., \( B \to C \), also satisfies the conditions defined in Definition 8. Though B and C is not directly connected in S 2, the travel time between these two locations is 4 + 0.5 + 0.5 = 5, which is very similar to that of S 1. In short, |5 − 4|/5 = 0.2. However, both \( A \to B \) and \( B \to C \) are not the maximal travel match in this example as they are contained in \( A \to B \to C \), i.e., \( (1,1) \to (2,2) \to (3, 4) \). Later, \( C \to E \) and \( C \to F \) cannot formulate travel matches due to the difference between corresponding travel times. Using the same approach, we find \( (1,1) \to (2,2) \to (4,3) \to (5,5) \to (6,6) \), i.e., \( A \to B \to D \to E \to F \), is another maximal travel match. Overall, we detect two maximal travel matches, \( A \to B \to C \) and \( A \to B \to D \to E \to F \) from S 1 and S 2.

4.2 Discovering maximal travel matches

Figure 9 presents the process of discovering the maximal travel matches. First, we detect the 1-length travel matches (coined as trivial matches in Definition 10) between two sequences and identify a precedence relation (refer to Definition 11) between these trivial matches. Then, the trivial matches and their precedence relation are transformed into a graph G, where a node is a trivial match and an edge corresponds to the precedence relation between trivial matches. Second, we prove that a maximal match is equivalent to a maximal length path in the graph G′, which is a refined graph from G. Note that, for the sake of efficiency, we directly build G′ instead of building G first and then removing redundant edges from G (see Sect. 4.2.1). Later, by searching for the maximal length path in the G′, we can find out the maximal travel matches (refer to Sect. 4.2.2 for details).

Fig. 9
figure 9

The process of finding the maximal travel matches

Definition 10

( Trivial Match) A trivial match is a 1-length travel match, e.g., location A (1, 1) in Fig. 8.

Definition 11

( Precedence Relation) Let (i, j) and (i, j′) be two trivial travel matches between S 1 and S 2. (i, j) is a precedence of (i′, j′) if:

  1. 1.

    \( i < i' \) and \( j < j' \), and

  2. 2.

    \( \frac{{\left| {\alpha_{i} - \alpha_{i} '} \right|}}{{Max(\alpha_{i} , \alpha_{i} ')}} \le \rho \), where \( \alpha_{i} = S_{1} \left[ {i'} \right].t_{a} - S_{1} \left[ i \right].t_{l} \) and \( \alpha_{i} ' = S_{2} \left[ {j'} \right].t_{a} - S_{2} \left[ j \right].t_{l} \).

The precedence relation does not satisfy reflexivity, i.e., (i, j) is not a precedence of (i, j). In addition, the precedence relation does not have transitivity since the second condition may be violated. In Fig. 8, A(1, 1) is a precedence of B(2, 2) and B(2, 2) is a precedence of C(3, 4). However, A(1, 1) is not a precedence of D(4, 3) because the difference of the travel time from A to D in S 1 and S 2 is 9 − 7/9 > 0.2.

4.2.1 Building the precedence graph

Following the case demonstrated in Figs. 8, 10 depicts an example of building the precedence graph G = (V, E) based on the trivial matches and corresponding precedent relations detected from S 1 and \( S_{2} \). Basically, each node in V is a trivial match between S 1 and S 2, and an edge in E from node (i, j) to (i′, j′) stands for a precedent relation between the two trivial matches. The Algorithm 3 shown in Figs. 11 describes the process in detail.

Fig. 10
figure 10

The precedence graph for S 1 and S 2

Fig. 11
figure 11

Building refined graph directly based on two sequences

Using Fig. 10 as an example, we illustrate Algorithm 3. In Fig. 10b, each node corresponds to a trivial match, and the number in a node indicates its order in the sorted list, i.e., F 66, E 55, D 43, C 34, B 22, and A 11 (see line 2 of Algorithm 3).

We first mark all nodes to white, which means the node is unreachable from the existing edges in G′. Then, the main loop (from Line 5 to 10) adds non-redundant edges starting from each node \( v_{l} \) into \( G^{'} \). If \( v_{l} \) has a precedence relation with \( v_{t} \), we can formulate a candidate edge \( e = v_{l} \to v_{t} \). Here, e is non-redundant only if \( v_{t} \) is marked white, i.e., unreachable from another path starting from \( v_{l} \). After adding a non-redundant edge e, we mark all reachable nodes from e to black.

In Fig. 10b, the main loop starts from \( E_{55} \). According to Definition 11 \( E_{55} \) is a precedence of \( F_{66} \). As \( F_{66} \) is labeled as white so far, we add \( E_{55} \to F_{66} \) to \( G^{'} \) and then mark \( F_{66} \) to black. After that, the main loop comes to \( D_{43} \). In the sorted list, the nodes before \( D_{43} \) are \( E_{55} \) and\( F_{66} \). Since \( D_{43} \) is a precedence of \( E_{55} \) and E 55 is white, we add \( D_{43} \to E_{55} \) into \( G^{'} \)and then mark \( E_{55} \) to black. Though \( D_{43} \) is also a precedence of \( F_{66} \), \( F_{66} \) is already marked as black. Thus, the edge \( D_{43} \to F_{55} \) is redundant and cannot be inserted into \( G^{'} \). Now, the main loop for \( D_{43} \) is over.

Lemma 1

A precedence graph G is a directed acyclic graph. Each travel match corresponds to a path in G.

More specifically, if \( (a_{1} ,b_{1} ) \to (a_{2} ,b_{2} ) \ldots \to (a_{k} ,b_{k} ) \) is a path in G, \( S_{1} [a_{1} , a_{2} , \ldots a_{k} ] \) and \( S_{2} [b_{1} , b_{2} , \ldots b_{k} ] \) form a travel match, and vice versa. For instance, the path \( A_{11} \to B_{22} \to C_{34} \) in Figure 10b corresponds to the travel match (1, 1) \( \to (2, \, 2) \to (3, \, 4) \) in Fig. 8.

Definition 12

( Maximal Length Path) A path P in G is a maximal length path if the first node of P has zero in-degree and the last node has zero out-degree.

For example, in Fig. 10b, the path \( A_{11} \to B_{22} \to C_{34} \) is a maximal path, which corresponds to a maximal travel match. However, the maximal path \( A_{11} \to C_{34} \) does not correspond to a maximal match.

To address this issue, we refine G into a new graph \( G^{'} \) by removing the redundant edges. Let \( Reach_{G} (u) \) be the set of nodes reachable from u in G. We define an edge \( u \to v \) as redundant in \( G \), if \( Reach_{G} (u) = Reach_{{G/\{ u \to v\} }} (u) \), where \( G/\{ u \to v\} \) \( (V, E/\{ u \to v\} ) \). In other words, an edge \( u \to v \) in G is redundant if there is another alternative path from u to v. For example, \( A_{11} \to C_{34} \) in Fig. 10b is redundant.

Lemma 2

The following two statements are equivalent:

  1. 1.

    \( S_{1} [a_{1} , a_{2} , \ldots a_{k} ] \) and \( S_{2} [b_{1} , b_{2} , \ldots b_{k} ] \) form a maximal travel match M between S 1 and S 2.

  2. 2.

    P = \( (a_{1} ,b_{1} ) \to (a_{2} ,b_{2} ) \to \ldots \to (a_{k} ,b_{k} ) \) is a maximal path in G′, i.e., \( (a_{1} ,b_{1} ) \)has zero in-degree and \( (a_{k} ,b_{k} ) \) has zero out-degree in \( G^{'} \).

Proof (1) ⇒ (2). According to Definition 9, a maximal travel match cannot be extended any longer on both left and right sides. So, the path corresponding to the maximal travel match in \( G^{'} \) does not have any precedent nodes and successors. That is (a 1, b 1) has zero in-degree and \( (a_{k} ,b_{k} ) \) has zero out-degree in G′. In short, M is a maximal path in G′.

Proof (2) ⇒ (1). First, as \( (a_{1} ,b_{1} ) \) has zero in-degree in \( G^{'} \), the condition 1 of Definition 9 holds. Second, \( (a_{k} ,b_{k} ) \)has zero out-degree. Therefore, condition 2 of Definition 9 also holds. Third, since we cannot find any redundant edges in the refined graph \( G^{'} \), the condition 3 holds. That is, given any two consecutive nodes in P, we cannot find other paths passing these two nodes except for P.

Lemma 2 enables us to find maximal matches by exploring maximal paths in \( G^{'} \). However, obtaining \( G^{'} \) from G is quite time consuming. Consequently, in the implementation, we directly construct \( G^{'} \) by using Algorithm 3 instead of first building G and then removing redundant edges from G. In Line 2, (i, j) is before \( (i', j') \) in the decreasing lexicographical order if \( i > i' \), or \( i = i^{'} \wedge j > j' \).

Lemma 3

Algorithm 3 outputs the correct graph \( G^{'} \) for two semantic location sequences S 1 and S 2.

Proof: Let \( (v_{1} = (i_{1} , j_{1} ), \ldots , v_{k} = (i_{k} , j_{k} )) \) be the sequence in Line 2 of Algorithm 3. Let G be the graph that contains all precedence relationships. Clearly, the edge set of \( G^{'} \) outputted by Algorithm 3 is a subset of that of G.

First, we show that each non-redundant edge of G is in \( G^{'} \). Let \( e = v_{l} \to v_{t} \) be a non-redundant edge in G. If e is not an edge in \( G^{'} \), it must be the case that \( v_{t} \) is black (see Line 7 of Algorithm 3). However, \( v_{t} \) is only marked black when there is another node \( v_{t} ' \) so that \( v_{t} \) is reachable from \( v_{t} ' \), and \( v_{l} \to v_{t} ' \) is an edge in \( G^{'} \). Then, there exists another path from \( v_{l} \) to \( v_{t} \) in \( G^{'} \) as well as in G. This contradicts the assumption that e is non-redundant.

Second, we show that each edge of \( G^{'} \) is non-redundant in \( G \). Let \( e = v_{l} \to v_{t} \) be an edge in \( G^{'} \), and assume on the contrary e is redundant in G. Since removing redundant edges does not change reachability, there must exist another path P of non-redundant edges from \( v_{l} \) to \( v_{t} \) in G. According to the first argument in the previous paragraph, \( G^{'} \) also contains this path. Let P be \( (v_{l} \to v_{{l_{1} }} \to , \ldots , \to v_{{l_{k} }} \to v_{t} ) \). For each node in P, let us consider the main loop processing node \( v_{l} \). When building edge \( v_{l} \to v_{{l_{1} }} \) in Line 8 of Algorithm 3, all other edges in P have been built. Therefore, \( v_{t} \) is already marked black since it is reachable from \( v_{{l_{1} }} \). As a result, the algorithm will not build \( v_{l} \to v_{t} \). It is a contradiction to the assumption. So, each edge of \( G^{'} \) is non-redundant in G.

4.2.2 Finding maximal paths

This subsection describes how to output all maximal travel matches given a refined graph \( G^{'} \), i.e., Line 2 in Algorithm 2 Here, we refer to a path from a node u to a zero out-degree node in \( G^{'} \) as a partial maximal path from u. As shown in Fig. 12, Algorithm 4 generates all partial maximal paths from a given node u in \( G^{'} \). Let \( No_{{G^{'} }} (u) \) be the set of nodes that u has outgoing edges to. For two sets of paths \( \mathcal{P}1 \) and \( \mathcal{P}2 \) in \( G^{'} \), if the last node of any path \( P_{1} \in \mathcal{P}_{1} \) is the same with the first node of any path \( P_{2} \in \mathcal{P}_{2} \), \( \mathcal{P}_{1} \otimes \mathcal{P}_{2} = \{ P_{1} + P_{2} |P_{1} \in \mathcal{P}_{1} ,P_{2} \in \mathcal{P}_{2} \} \), where P 1 + P 2 the path simply concatenating P 1 and P 2.

Fig. 12
figure 12

Output partial maximal paths

For completeness, we state the complete algorithm shown in Algorithm 2. Let \( ZeroIn(G') \) be the set of nodes with zero in-degree in \( G' \). Each path in the set returned by Algorithm 2 corresponds to a maximal match. For example, the maximal matches between S 1 and S 2 in Fig. 8 are \( A \to B \to C \) and \( A \to B \to D \to E \to F \).

4.2.3 Calculating similarity using maximal matches

We consider three factors when measuring the similarity between two users in terms of their location histories.

  1. 1.

    Sequential property: Users sharing longer semantic location sequences would be more similar. We detect the common sequences between two users’ location history by using the maximal travel match algorithm and give a higher weight to a longer match.

  2. 2.

    The granularity of a semantic location: Users sharing semantic locations with a finer granularity could be more similar. We give a relatively higher weight to the maximal travel matches detected at a lower layer of the hierarchy \( \mathcal{F} \).

  3. 3.

    The popularity of a semantic location: Users sharing semantic locations that are less frequently visited by people would be more similar. Here, we propose inverted user frequency to give an unpopular location a high importance score: \( iuf(c) = \log \frac{N}{n} \), where N is the total number of users in the dataset and n is the number of users visiting the semantic location c.

    Combining the three factors, we calculate an overall similarity score for each pair of users in terms of the Eq. 4.

    $$ SimUser(H_{1} , H_{2} ) = \mathop \sum \limits_{l = 1}^{L} f_{w} (l) \times SimSq(S_{1}^{l} , S_{2}^{l} ); $$
    (4)
    $$ SimSq(S_{1} , S_{2} ) = \frac{{\mathop \sum \nolimits_{j = 1}^{m} sg(t_{j} )}}{{|S_{1} | \times |S_{2} |}}, $$
    (5)
    $$ sg(s) = g_{w} (k) \times \mathop \sum \limits_{i = 1}^{k} iuf(c_{i} ). $$
    (6)

    Given two users’ location histories H 1 and H 2, we compute the similarity between them by summarizing the weighted similarity of semantic location sequences detected at each layer of the hierarchy \( \mathcal{F} \). We use a function \( f_{w} (l) \) to assign a higher weight to the similarity of sequences occurring at a lower layer, e.g., \( f_{w} (l) = 2^{l - 1} \). Then, the similarity between two semantic location sequences S 1 and S 2 at a layer, \( SimSq(S_{1} , S_{2} ) \), is represented by the sum of the similarity score, \( sg(t_{j} ) \), of each maximal match between \( S_{1} \) and \( S_{2} \). Here, m is the total number of maximal matches. Meanwhile, \( SimSq(S_{1} , S_{2} ) \) is normalized by the production of the lengths of the two sequences, since a longer sequence have a high probability to have long matches. That is, a user with a longer history are more likely to be similar to others (than a user having a short period of history) without performing the normalization. Later, we calculate the similarity of a maximal travel match \( t \), \( sg(t) \), by summing up the iuf of each semantic location c in t and weighting \( sg(t) \) in terms of the length \( k \) of t, e.g., \( g_{w} (k) = 2^{k - 1}. \)

5 Finding similar users

With the user similarity calculated above, we can hierarchically cluster users into some groups in a divisive manner by using some clustering algorithms, like K-mean. As a result, as depicted in Fig. 13, we can build a user cluster hierarchy, where a cluster denotes a group of users sharing some similar interests and different layers represent different levels of similarity (Hung et al. 2009). The clusters shown on a higher layer could stand for big communities in which people share some high-level interests, such as sports. The clusters that occur on the lower layers denote people sharing some finer interests, like hiking (the layer of the hierarchy can be determined based on the needs of applications). Meanwhile, we can find out one representative user (the center) of each cluster according to the similarity scores between each pair of users.

Fig. 13
figure 13

Finding similar users and inserting new users in a hierarchical user clusters

This user hierarchy brings us two aspects of advantages.

  1. 1.

    Fast retrieval of similar users: Instead of checking all the users, we can retrieve top k similar users for a person by only ranking the users in the same cluster as the person (in terms of the similarity score). This retrieval process can start from the bottom layer of the hierarchy, as depicted by the blue dash arrow. Finding more than k users in the bottom-layer cluster that the person pertains to, we can directly rank these users in the cluster for the person. If the number of users is less than k, we can further check the parent node (cluster) of this cluster until finding out a cluster with more than k users.

  2. 2.

    Insert new users: When a new user u′ comes to the system, it is not necessary to compute the similarity score between u′ and each user in the system. This process is very time consuming and will keep on increasing with the number of users. Instead, we only need to insert this user into the most proper clusters on each layer of the hierarchy by computing the similarity between u′ and the representative user in a cluster. For example, as demonstrated by the red solid arrows in Fig. 13, we first compute the similarity between u′ and (u 1, u 2,…,u k ) If u 2 is the most similar user to u′ out of the k users, we insert u′ into u 2’s cluster C 2. Then, we further check the child clusters of C 2 and insert u′ into the clusters whose representative user is the most similar to u′. This process is performed iteratively until reaching the bottom layer of the hierarchy.

In practice, we do not need to re-build this hierarchy unless the numbers of newly inserted users exceed a certain threshold. That is, in most case we can find similar users for a person very efficiently.

6 Evaluation

In this section, we evaluate our method based on a real-world GPS trajectory dataset collected by 109 users in a period of over 1 year.

6.1 Settings

6.1.1 GPS devices, users, and trajectories

As shown in Fig. 14, our GPS devices include Magellan Explorist 210/300, G-Rays 2 and QSTARZ and GPS-enabled phones. These devices are configured to record a GPS reading every 5 s, and are delivered to 109 users with diverse backgrounds, including college students, housewives, employees of different companies and organizations, etc. We collected GPS logs of these volunteers in a period of 1 year.

Fig. 14
figure 14

GPS-enabled devices used for the user study

As a result, the collected GPS trajectories cover 36 cities and include totally 8,027,911 GPS points, from which we detected 18,074 stay points. We used 5,366 stay points in weekends as our data set to make sure that these stay points reflect users’ interests in leisure time instead of routine paths between homes and offices.

6.1.2 Parameter selections

The default values of parameters in location history construction are as follows. (1) Stay point detection: we test a set of thresholds and find out \( \theta_{d} \) = 200 m and \( \theta_{t} \) = 30 min is more proper than others (refer to papers (Li et al. 2008; Zheng et al. 2011d) for more justifications). These values allow us to detect meaningful places that users visited and filter places where users only passed by. Also, these two parameters are relatively robust to traffic jams. (2) Stay point extension: we test the performance of our method changing over γ and set γ = 200 m after the study. (3) Feature vector construction: our POI database contains 6,828,951 POIs of 13 categories including restaurants, markets, natural parks, and museums, etc. (4) Feature vector clustering: we use k-means to hierarchically cluster feature vectors into three layers in a divisive manner. As a result, the second layer contains 15 clusters and the third layer contains 48 clusters.

The default values of parameters in user similarity exploration are as follows. We test a set of ρ, and set ρ = 0.2 to find maximal matches (refer to Fig. 24 for details). We use \( g_{w} (k) = 2^{k - 1} \) to give a larger weight to longer match. As shown in Fig. 15, we observe that the occurrence of k-length travel matches drops exponentially as the k increases. Thus, the significance of an occurrence of a k-length travel match increases exponentially with k. Similarly, we set \( f_{w} (l) = 2^{l - 1} \) to give a larger weight to lower layers because the number of maximal matches of 132, 056 at the second layer is much larger than 48, 252 at the third layer.

Fig. 15
figure 15

The distribution of the maximal travel matches

6.2 The evaluation approach

6.2.1 Ground truth

To obtain the ground truth of a user’s interests, we conduct a questionnaire-style user study, in which each user answers the questions we proposed by giving a rank (1–4), as shown in Fig. 16a. Then, we regard a user’s answer, e.g., Figure 16b, as an interest vector, in which each entry is the user’s rank to a corresponding question. Later, we calculate a cosine similarity between two users’ interest vectors. A user’s true ranking list of ten most similar users is those having the closest interest vectors with her.

Fig. 16
figure 16

A questionnaire (a) and an example of answers (b)

6.2.2 Evaluation criteria

MAP and NDCG are employed to evaluate the performance of our approach. MAP is the most frequently used summary measure of a ranked retrieval run. In our experiment, it stands for the mean of the precision score after each relevant user is retrieved. In the search results, a user is deemed as a relevant user if his/her relevant level is ≥3. For instance, the MAP of a relevance vector \( G = {<}4, 0, 2, 3, 3, 1, 0, 2, 1, 1{>} \) is computed as follows:

$$ MAP = \frac{1 + 2/4 + 3/5}{3} = 0.7 $$

NDCG is used to compute the relative-to-the-ideal performance of information retrieval techniques (Jarvelin and Kekalainen 2002). The discounted cumulative gain of G is computed as follows: (In our experiments, b = 2.)

$$ DCG[ i ] = \begin{cases}G{[1]},& if i = 1 \\DCG{[i - 1]}+ G[i], &if i < b\\DCG{[i - 1]} + \frac{G[i]}{\log_{b}i},& if i \ge b\end{cases} $$
(7)

Given the ideal discounted cumulative gain DCG’, then NDCG at ith position can be computed as \( NDCG\left[ i \right] = DCG\left[ i \right]/DCG^{'} [i]. \)

6.2.3 Evaluation framework

The experiment aims to validate the four contributions we claimed: (1) We propose to use semantic location history instead of physical positions to represent users’ interests. (2) We build a semantic location history with multiple granularities for each user so as to measure two users’ similarity more precisely. (3) We propose the maximal travel match to compare the semantic location sequences of two users. The maximal match has a better performance beyond existing sequence matching approaches in this application scenario. (4) We propose iuf to give high weights to unpopular semantic locations. We evaluate the effectiveness of the four points as follows.

  1. 1.

    We compare the effectiveness of the semantic location history with that of a physical-location-based approach HGSM (Li et al. 2008) proposed by us previously. In addition, to evaluate the effectiveness of the feature vector used to represent the semantic meaning of a stay region, we also compared our method with two basic semantic-based approaches. One basic approach is called NearestType, which assigns the category of the nearest POI to a stay point. The other approach LargestType assigns a stay region the category having the largest number of POIs within the stay region. Since these two approaches cannot produce hierarchical location histories, we only used the second layer of our SLH in the comparison. The rest of settings of the two baselines are the same with that of SLH-MTM.

  2. 2.

    To validate the effectiveness of the multiple granularities of semantic locations, we compared the performance of our approach using multiple layers of the hierarchy \( \mathcal{F} \) with that only using the second layer or only using the third layer.

  3. 3.

    To evaluate the effectiveness of MTM, we first compared the MTM with three approaches that do not consider the sequential property: Count, Cosine, and Pearson. Suppose N semantic locations \( \left\{ {c_{i} , 1 \le i \le N} \right\} \) are generated on a certain layer of the hierarchy \( \mathcal{F} \). If in c i User1 has k i stay-points and User2 has l i stay-points, the location histories of User1 and User2 can be represented as follows.

    $$ u_{1} = (k_{1} , k_{2} , \ldots k_{i} , \ldots , k_{N} ),\,\quad u_{2} = (l_{1} , l_{2} , \ldots , l_{i} , \ldots , l_{N} ). $$

    The similarity of two users by count is computed as Eq. (8):

    $$ sim_{count} (u_{1} ,u_{2} ) = \mathop \sum \limits_{i = 0}^{N} \min (k_{i} ,l_{i} ) $$
    (8)

    For instance, the similarity between two users’ location histories shown in Fig. 8 is 6 (A, B, C, D, E, F).When conducting the Cosine and Pearson methods, we compute the similarity between two users’ location histories according to Eqs. (9) and (10) respectively:

    $$ sim_{\cos ine} (u_{1} ,u_{2} ) = \frac{{\mathop \sum \nolimits_{i} k_{i} l_{i} }}{{\sqrt {\mathop \sum \nolimits_{i} l_{i}^{2} } \sqrt {\mathop \sum \nolimits_{i} k_{i}^{2} } }} $$
    (9)
    $$ sim_{pearson} (u_{1} ,u_{2} ) = \frac{{\mathop \sum \nolimits_{i} (k_{i} - \bar{u}_{1} )(l_{i} - \bar{u}_{2} )}}{{\sqrt {\mathop \sum \nolimits_{i} (k_{i} - \bar{u}_{1} )^{2} \mathop \sum \nolimits_{i} (l_{i} - \bar{u}_{2} )^{2} } }} $$
    (10)

    When performing these three baselines, we substitute the SimSq in Eq. 4 with corresponding similarity shown in Eqs. 8, 9, and 10, and keep the remaining settings the same as our SLH-MTM.We also compared MTM with four existing sequence matching approaches that consider the visiting order of locations but without taking into account the travel time constraints between locations. These approaches include τ-containment for trajectory pattern mining (Giannotti et al. 2007) and three well-known sequence matching approaches, consisting of ED, LCSS (Vlachos et al. 2002) and DTW (Yi et al. 1998).

  4. 4.

    To evaluate the effectiveness of iuf, we compared iuf with the baseline using the same weights for all semantic locations, denoted as Same Weight.

  5. 5.

    In addition to the four aspects, we examined the effects of the scale of stay regions (γ) and the difference ratio on travel time (ρ) on the performance of our approach.

6.3 Results

6.3.1 Effectiveness of SLH

Figure 17 shows that our approach has higher nDCG scores beyond HGSM. This justifies the advantage of semantic locations over geographic positions. The reason is that HGSM are relatively weak in detecting the similarity between users without overlaps in the geographic spaces. Consider two users visiting “museum 1 → shop 1” and “museum 2  shop 2”, respectively. Our approach regards them as similar since they share a semantic sequence of“museum → shop”, while HGSM could not handle this case.

Fig. 17
figure 17

The semantic locations versus physical locations

Figure 18 shows that our approach obtains higher nDCG than two basic semantic-based approaches: NearestType and LargestType. It suggests that our approach using a feature vector to represent a stay region is more capable of modelling a user’s interests beyond these two baselines. The reason is that our approach is aware of the GPS positioning errors and captures the uncertainty of location categories by feature vectors.

Fig. 18
figure 18

SLH versus two baseline semantic-based methods

6.3.2 Effectiveness of multiple granularities

As shown in Figure 19, the finer granularity (using the third layer) achieves higher nDCG than the coarse granularity (using the second layer, i.e., only cluster the feature vectors once). Moreover, our approach achieves the highest nDCG when using multiple granularities and assigning large weights to a fine granularity. This result indicates that a finer granularity of semantic locations improves SLH’s capability in discriminating similar users (optimizing for precision), while a coarser one enhances SLH to find coarsely similar users (optimizing for recall). By combining the capabilities of multiple granularities, SLH is able to distinguish users more precisely and detect users similar at different degrees. For example, when combining multiple granularities, a coarse granularity discover the group of similar persons who like outdoor sports, and a fine granularity further distinguishes users who like hiking from these users.

Fig. 19
figure 19

The effectiveness of the semantic granularity

6.3.3 Effectiveness of MTM

We first validate the effectiveness of the sequential property. As shown in Fig. 20, MTM achieves higher nDCGs than the three approaches only considering individual semantic locations shared by two users. This means a sequence of semantic locations is more capable of representing a user’s interests than considering locations individually. Actually, common semantic locations are some 1-length travel matches. We can obtain the same user similarity by only using 1-length travel matches in our method. Naturally, we are able to further differentiate users by considering longer matches. For example, a couple visiting together three semantic locations in a sequence “museum → restaurant → shopping mall” are more similar than a user visiting these locations separately, or in a different order “shopping mall  museum  restaurant”.

Fig. 20
figure 20

MTM versus methods regardless of sequential properties

Then, we compare MTM with other sequence matching approaches. Figure 21 shows that MTM outperforms three widely used sequence matching approaches ED, LCSS, and DTW in this application scenario. By taking into account the travel time between two locations, MTM is more capable of modeling the sequential property of a user’s outdoor movement, and the behavior and intension behind the movement. For instance, two users share a sequence of“restaurant → shopping mall” in their location histories. One user went to the two places in different trips occurring in different days. The other visited the two places in one trip (in the same day). MTM regards the two sequences as different, while the three sequence matching approaches mistakenly consider them as a common sequence and use it to measure user similarity. Intuitively, different time interval denotes different extents of sequentiality.

Fig. 21
figure 21

MTM versus other sequence matching algorithms

Figure 21 also shows that MTM outperforms the traditional trajectory pattern mining approach τ-containment, which aims to mine the sequential patterns with travel time constraints from given trajectories. This method enables us to detect some frequent sequential patterns from a user’s location history, while ignoring some common sequences (shared by two users) that occur infrequently but carry important meanings of a user’s interests. For instance, two user shares a sequence of “lake → hiking park”, which is not a frequent pattern in these two users’ location histories as these types of events are relatively rare in people’s daily life (compared to shopping and dining). Though not frequent, such kinds of sequences are still important in reflecting a person’s interests. Sometimes, these infrequent sequences are even more valuable in differentiating people than frequent patterns. Besides, τ-containment is much less efficient than MTM. τ-containment used 292 s to mine patterns from our data set. In comparison, MTM only ran 19 s.

6.3.4 Effectiveness of iuf

As shown in Fig. 22, according to nDCG, our approach using iuf outperforms the method using the same weight for different semantic locations. This indicates that the scheme assigning larger weights to less popular semantic locations is more capable of measuring the similarity between users. Intuitively, the semantic location of restaurant could appear in all users’ location histories. Obviously, it is hard to say all of them are similar given this observation. However, the semantic locations, like hiking park and lake, which do not frequently occur in a user’s location history, can reflect the user’s real interests and are much more distinguishing than a common location like restaurant. Without iuf, the similarity between users will be dominated by those common semantic locations, and cannot reveal the true correlation between users.

Fig. 22
figure 22

iuf versus the method using the same weight for locations

6.3.5 Effects of γ and ρ

Figure 23 shows that when the scale of stay regions is 200 m, the nDCG is the highest in our data set. It is related to the positioning errors of our devices. When the region is too small, it may exclude the real places that users visited. When the region is too large, it may include many noisy POIs.

Fig. 23
figure 23

Effect of scale of stay regions γ

Figure 24 shows that we achieve the highest nDCG when ρ is round 0.2–0.4. When ρ drops below 0.2, it becomes too strict to find long matches. When ρ becomes too loose, some matches are meaningless, which cause the same problem as ED, LCSS, and DTW.

Fig. 24
figure 24

Effect of difference ratio ρ on travel time

7 Related work

7.1 Mining human location history

A branch of research has been performed based on individual location history recorded in GPS trajectories. These researches include detecting significant locations of a user (Ashbrook and Starner 2003; Hariharn and Toyama 2004; Liu et al. 2006; Cao et al. 2010), predicting the user’s movement among these locations (Ye et al. 2009), and recognizing user-specific activities at each location (Patterson et al. 2003). As opposed to these works, we aim to model multiple users’ location histories and learn patterns from numerous individuals’ behaviors.

Giannotti et al. (2007) mined similar sequences from users’ trajectories, and MSMLS (Krumm and Horvitz 2007) used a history of a driver’s destinations, along with data about driving behavior extracted from multiple users’ GPS traces, to predict where a driver’s may be going as a trip progresses. Zheng et al. (2008a, b, 2010b) classified the transportation modes of a GPS trajectory into driving, walking, taking a bus, and riding a bike. Instead of exploring users’ behaviors, in this paper, we aim to understand the correlations between different users’ activities and mine the similarity between users.

7.2 User similarity

7.2.1 In cyber systems

User similarity has been studied in some recommender systems and online social networks. The goal is to discover content of interest of a user from her similar users on the Web. One example is Amazon’s book recommendation system (Linden et al. 2003), which recommends a user with some books she would like to read while have not been found by her. Other examples are LinkedIn and Facebook, where the user similarity can help a person find out some users sharing similar interests and backgrounds. Two widely used techniques are collaborative filtering (CF) (Breese et al. 1998) and nearest neighbor (NN) search (Sarwar et al. 2000). CF assumes that people agreed in the past tend to agree in the future. NN regards users with similar interests as nearest neighbors with highly correlated preference data.

Our work is different from these techniques in two aspects. First, we study user behavior in the physical world instead of online behavior. The learned user similarity can bridge the gap between the virtual world and physical world. Second, we consider some properties of human location histories, such as the sequence and the granularity of semantic locations, when measuring the user similarity.

7.2.2 In the physical world

Eagle et al. (2006) aimed to recognize the social patterns in users’ daily activities from the dataset collected by users with Bluetooth-enabled mobile phones, and extract the social structures from the mobile phone data provided by wireless communication operators (Eagle et al. 2009a, b). Cranshaw et al. (2010) examines the location traces of 489 users of a location sharing social network for relationships between the users’ mobility patterns and structural properties of their underlying social network. Hung et al. (2009) targeted at the problem of discovering communities among users, where users in the same community have similar trajectory patterns. Our work differs from the above-mentioned work in the following two aspects. First, we measure user similarity based on the semantic meanings of a location instead of physical positions. Second, we consider the sequential property between locations, and the hierarchy and popularity of a semantic location.

Li et al. (2008) proposed HGSM, which mine the similarity between users in terms of their physical location histories. When calculating the similarity, HGSM also considers a user’s travel behaviors and the properties of geographical spaces. Further, Zheng et al. (2011d) incorporate this user similarity into a user-centric CF model to conduct a personalized friend and location recommendation. Though HGSM is very similar to the work reported in this paper, the major difference still lies in the semantic location we inferred from a given physical stay region. According to the statement in this paper, modeling the semantic meaning of a physical location is nontrivial. Moreover, in terms of the evaluation, the proposed MTM is more effective and efficient than the sequence matching algorithm used in HGSM. Finally, this article is an expansion of the poster paper (Xiao et al. 2010) with more details of methodology and experiments presented.

7.3 Location recommendation

Zheng et al. (2009c), recommended a user with the top interesting locations and travel sequences mined from a large number of user-generated GPS trajectories. The experienced users in a given region are also recommended. Zheng et al. (2010a) use a collaborative learning approach to enable an activity-location recommendation based on GPS traces associated with user-generated comments. That is, given an activity recommend the best k locations, and given a location recommend the best k activities. Chen et al. (2010) recommended a user with some trajectories according to a set of user-specified point locations. As these recommendations are generic recommendation, they do not consider the similarity between users.

Froehlich et al. (2006) proposed to vote a single user’s personal location history to recommend locations. CityVoyager (Takeuchi and Sugimoto 2006) recommended shops to users based on their past location histories. Zheng et al. (2009b) first learned the correlation between locations in terms of multiple users’ location histories. In turn, the location correlation is employed by an item-based CF model to conduct a personalized location recommender (Zheng et al. 2011c). Extended from paper (Zheng et al. 2010a), a user-centric location-activity recommender is further performed (Zheng et al. 2010d). Although implicitly involving the user similarity (based on location history), these recommenders still use traditional CF techniques without considering specific properties of location histories, e.g., the sequence of users’ movements and hierarchy of geographic locations.

8 Conclusion and future work

In this paper, instead of using online social structures, we estimate the similarity between users in terms of their location histories in the physical world. This similarity can lead to social ties between users in a social networking service, hence bridging the gap between online social networks and the physical world. Rather than directly matching different users’ location histories in the geographic spaces, we model a user’s GPS trajectories with a semantic location history (SLH). The SLH carries more semantic meanings of a user’s interest (beyond physical location), and can find out similar users without geospatial overlaps. Also, we believe that users sharing (1) a finer semantic location, (2) a longer sequence of locations and (3) less popular semantic locations would be more similar to each other. Then, we compare different users’ SLHs by using the maximum travel match (MTM), which considers both the sequence information and the travel time between locations. The evaluation results based on real-world GPS data show that SLH shows clear advantages over a physical-location-based approach and the basic semantic approaches. When simultaneously incorporating the three factors mentioned above, our approach achieves the best performance. Additionally, MTM is more effective than several widely used sequence matching approaches, such as LCSS and dynamic time wrapping DTW, in this application scenario.

In the future, we aim to compare this user similarity with users’ relationship in online social networks. Second, we plan to further improve the efficiency of our approach and employ this similarity to conduct a personalized location recommendation system.