1 Introduction

With the proliferation of mobile communication devices loaded with positioning capabilities, recent years have witnessed the explosive growth of location-based services (LBS) on road networks. Typical examples of these applications include location-based store finder, road navigation, and location-aware advertisement. While offering great benefits and new business opportunities, the LBS also presents new threats—the intrusion of location privacy [2, 3, 38, 43, 55]. Adversaries can exploit user location information for such malicious purposes as spamming, stalking, and inferring health condition or alternative lifestyle habits.

Over the past years, many promising approaches have been proposed to preserve users’ location privacy. Most of them focus on location k-anonymity or location l-diversity, which blur the exact location of a mobile user into a c loaked ar ea (CR for short). To compute a CR, location k-anonymity extends it until at least k-1 other users are included, and location l-diversity extends until l-1 different locations are included. In this sense, an exact location is mixed with other users (k-anonymity) or other locations (l-diversity), which makes it difficult for an adversary to learn valuable information.

Both techniques guarantee some degree of privacy. Nevertheless, they have the following two serious drawbacks. First, the generated CR could breach semantic information, which potentially endangers the individuals’ privacy. More concretely, a CR might only include semantically homogeneous locations even when it is perturbed with other users and locations, and hence an adversary would be able to infer semantic meanings from the CR. In other words, it is vulnerable to a semantic homogeneity attack. Second and more importantly, all previous solutions do not take the location privacy of other users in the system into consideration, while providing the privacy protection for a location-based query user. However, this process may pose a serious threat to other users’ privacy, as illustrated by the following example.

Example 1

Figure 1 shows an example of a query snapshot which consists of 8 mobile users {u1, u2,...,u8}. Consider a scenario, a patient u1 named Bob uses his GPS-enabled mobile phone from road e7 to find the nearest Italian restaurant. This simple query can be answered by an LBS in a publicly available web server (e.g., Google Maps). However, the LBS is not trusted, Alice can collaborate with the LBS to acquire the location of Bob and infer his health status. Thus, to prevent Bob’s location from leakage, instead of directly sending the query to the LBS, he uses an anonymizer, which is a trusted server. The location anonymizer perturbs his exact location according to his specified privacy requirement before forwarding this query to the LBS. The state-of-the-art approach used by the location anonymizer is based on segmentl-diversity [7, 45], which cloaks Bob’s walking road with other nearby roads. The set, {e6, e7, e9} may be a CR under location k-anonymity (k = 5) and segment l-diversity (l = 3). Unfortunately, all roads e6, e7 and e9 have the homogeneous semantics, namely hospital. Consequently, it is quite easy for an adversary to confidently derive that u1 is in the hospital. Hence, mobile users using {e6, e7, e9} for the LBS query requests would be more likely linked to hospitals and could be suspected of having treatment. Alternatively, assume that an adversary has the knowledge of the locations of all users except u3. By some technical means, he intercepts the generated CR {e6, e7, e9}. Based on given k = 5 and l = 3, he can definitely infer that u3 must be in this CR. However, this inference leaks u3’s health status, which violates the privacy of u3.

Figure 1
figure 1

A snapshot of mobile users over a road network

Several studies [10, 27, 53] have put significant effort to resist semantic homogeneity attacks over road networks, however, they have different critical limitations as stated in [29]. In addition, to our knowledge, no existing studies can effectively address the second drawback. In our early work [29], we have studied the problem of location privacy preservation of mobile users on road networks and proposed l-semantic diversity based EIRank algorithm. Unfortunately, the user identity is protected by using a pseudonym. In some scenarios, it is not adequate for providing identity protection [15]. Instead, in this paper, we propose to leverage locationk-anonymity to protect user identity. The differences between this extended paper and our previous work [29] are described in Related Work.

Besides, ensuring the location privacy from leakage for other users is very important. While utilizing location k-anonymity to provide identity protection for query users, it requires to acquire the locations of other users. However, if the generated CRs suffer from the risk of location leakage, mobile users are reluctant to report their locations. In turn, it hinders the implementation of location k-anonymity, further impacts the effectiveness of identity protection. On the other hand, due to an increasing awareness of privacy risks, users might refrain from accessing the LBS, which would hinder the proliferation of these services. Hence, it is paramount to provide the location privacy for other users.

To tackle these problems, we propose PrivSem, a novel framework which integrates segment l-semantic diversity, location k-anonymity, and 𝜖-differential privacy, to protect users’ location privacy over road networks. In this framework, the cloaking algorithm only has access to the location data sanitized according to differential privacy (DP), rather than the original data. In practice, mobile users subscribe to a cellular service provider (CSP) that already has access to their locations (e.g., through cell tower triangulation). The CSP reaches an agreement with its subscribers on the terms and conditions of location disclosure. Therefore, it could release user locations to third party in a noisy form. However, using DP introduces the following two difficult challenges.

First, the generated CRs are determined by noisy location data, which requires sophisticated strategies to ensure the effectiveness of these CRs. To be specific, the determination of CRs is a serious dilemma: for the location k-anonymity, the generated CRs cannot be too small; yet, extending them would lead to more query results, and thus heavier system overhead in delivering these results. To create sanitized location data released at the CSP, we resort to the Private Spatial Decomposition (PSD) approach [8]. Typically, a PSD is a sanitized spatial index, where each node reports a noisy count of the users rooted at that node. To guarantee that the generated CR has a high success rate, we investigate an error analysis model that determines with high probability a spatial region that includes sufficient users. The other challenge is data sparsity in the spatial domain. In reality, compared to the total number of mobile users, the total number of roads could be very large. Consequently, the majority of PSD nodes have very low to zero count. The data sparsity poses great challenge for privacy preserving techniques since the perturbation noise is more likely to dominate the released count in presence of a small set of mobile users. To remedy this deficiency, we propose an adaptive group mechanism to resist the influence of noises on counting values.

In summary, we make the following contributions:

  • For the first time, we detect the privacy leakage problem of other users in the context of location-based queries, and we present a framework that achieves differentially private guarantees.

  • We have provided a cloaking technique that balances privacy requirements with system overhead in [29]. In this paper, we develop a series of optimization strategies to further improve its performance.

  • We propose an error analysis model that quantifies the difference between noisy count of users in a CR and its real count, and we also introduce a search strategy that find appropriate PSD regions to ensure high success rate of the cloaking algorithm.

  • Through formal privacy analysis, we prove that our proposed solution can provide required privacy protection. We conduct an extensive experimental study over real-world datasets. Experimental results demonstrate that our proposed techniques are effective and efficient.

The reminder of this paper is organized as follows. Section 2 presents some necessary preliminaries. Section 3 introduces the proposed privacy framework, whereas Sections 4 and 5 detail the proposed solution. The privacy guarantees of proposed solution are analyzed in Section 6, followed by the experimental evaluation in Section 7, and a survey of related work in Section 8. Finally, Section 9 concludes this paper.

2 Preliminaries

2.1 Road network

Definition 1

(Semantic Road Network) A road network is modeled as an undirected graph G = (V, E, ξ) with a node set V and an edge set E, such that (i) a node vV represents a road intersection or a location (e.g., church); (ii) an edge e = (u, v) ∈ E, also referred as a segment, connects two nodes u and v; and (iii) ξ represents a semantic function, i.e., for each edge eE, ξ(e) is the sensitive semantic label of segment e.

Example 2

(Semantic Road Network) Figure 1a gives an example of a semantic road network, in which each edge is associated with a semantic ID. The semantic labels corresponding to the IDs are shown in Figure 1b. Nodes v4 ,v5 and v6 in Figure 1a are different buildings within the same hospital. Edges e6, e7 and e9 connecting these three nodes would then have the same sensitive semantic label “hospital”. Thus, the area represented by the triangle (v4, v5 and v6) would indicate the hospital.

2.2 Differential privacy

Differential Privacy (DP) has recently emerged as the state-of-the-art scheme for protecting individuals’ privacy. It is a semantic model which provides strong protection against realistic adversaries with auxiliary backgrounds. Informally, DP guarantees that the computation output of an algorithm is relatively insensitive to any change of one individual record, and thus, an adversary can learn limited information about any a specified individual. One important notion in DP is neighboring datasets. We say that two datasets D and D to be neighboring, denoted by DΘD, if |D| = |D| and D and D differ in only one location record.

Definition 2

(𝜖-Differential Privacy) A randomized algorithm \(\mathcal {A}\) is 𝜖-differential privacy if for any pair of neighboring datasets D and D, and for any subset of output \(S \subseteq Range(\mathcal {A})\),

$$ Pr(\mathcal{A}(D)\in S)\leq e^{\epsilon}Pr(\mathcal{A}(D^{\prime})\in S) $$
(1)

The parameter 𝜖 is called as the privacy budget which controls the level of privacy protection. A smaller 𝜖 implies more restrictions imposed on the influence of a single user location, and hence gives more privacy to the individual. To achieve differential privacy, one principal technique is the Laplace mechanism [11], which injects random noise following Laplace distribution into the output. The amount of the noise depends on the sensitivity of query function f, formally defined as:

Definition 3

(Sensitivity) The sensitivity of a query function f : \(D\rightarrow {\mathbb {R}^{d}}\), is the maximum change caused by a single record. Formally, Δf = maxDΘDf(D) − f(D) ∥1, where ∥ . ∥1 is L1 norm.

Theorem 1

(Laplace Mechanism) Given a function f:\(D \rightarrow {\mathbb {R}^{d}}\),for any dataset D, a randomized algorithm\(\mathcal {A}_{f}\)thatreturns\(f(D)+ Lap(\frac {{\Delta } {f}}{\epsilon })^{d}\)satisfies𝜖-differentialprivacy, whereLap(λ)ddenotes a vector of d i.i.d. samples from the Laplace distributionLap(λ).

Two composition properties are widely used to ensure the overall privacy, known as sequential and parallel compositions.

Theorem 2

(Sequential Composition[33]) Let\(\mathcal {A}_{i},\ldots ,\mathcal {A}_{m}\)be m algorithms, each provides𝜖i-differentialprivacy. A sequential of algorithms\(\mathcal {A}_{i}(D)\)overthe dataset D provides\(({\sum }_{i}{\epsilon _{i}})\)-differentialprivacy.

Theorem 3

(Parallel Composition[33]) Let\(\mathcal {A}_{i},\ldots ,\mathcal {A}_{m}\)be m algorithms, each provides𝜖i-differentialprivacy. Then, a sequential of\(\mathcal {A}_{i}(DS_{i})\)overdisjoint subsetsDSiof dataset D provides (maxi𝜖i)-differentialprivacy.

In our approach, the noise injected may be from a sum of independent Laplace distributions instead of a single Laplace distribution. Thus, here we present two important results for sum of independent Laplace distributions.

Lemma 1

(sum of laplace distributions[4]) Let\(Y = {\sum }_{i=1}^{n} \gamma _{i}\)bethe sum ofγ1,...,γnindependent Laplace random variables with zero mean and scalebiandbM = max{bi}.Let\(v\geq \sqrt {{\sum }_{i=1}^{n} b_{i}}\),and\(0<\lambda <\frac {2\sqrt {2}v^{2}}{b_{M}}\).Then,\(Pr(Y > \lambda )\leq exp(-\frac {\lambda ^{2}}{8v^{2}})\).

Corollary 1

(measure concentration[4]) LetY,v, {bi}iandbMbe defined as in Lemma 1. Suppose 0 < δ ≤ 1 and\(v>\max \{\sqrt {{\sum }_{i=1}^{n} b_{i}},b_{M}\sqrt {2ln\frac {2}{\delta }}\}\).Then,\(Pr[|Y|>v\sqrt {8ln\frac {2}{\delta }}]\leq \delta \).

2.3 Private Spatial decompositions (PSD)

Private Spatial Decompositions (PSD) is introduced to release spatial datasets in a DP compliant manner [8]. Typically, a PSD is a sanitized spatial index transformed according to DP, where each node contains a noisy count of the data points rooted at that node. A variety of index types can be used as a basis for building PSD, such as k-d trees, grids or quad-trees.

Generally, the accuracy of PSD is extremely affected by the type of spatial structure and its parameters. The existing PSD can be divided into two categories: object-based PSD and space-based partitioning PSD. With object-based PSD, the split position for a node relies on the placement of user locations. This category includes structures such as k-d trees and R-trees. To protect privacy, split decisions also must be made according to DP, and a share of privacy budget is consumed in the process. Thus, the privacy budget must be split between building the index structure and reporting node counts. In theory, object-based PSD are more balanced, but they are not very robust. Only tiny changes of the PSD parameters can decrease the accuracy abruptly.

For space-based partitioning PSD, it performs splits of nodes only depending on the underlying structures (e.g., quad-trees, BSP-trees and grids), rather than real user location data. The privacy budget is entirely used to report the user count in each node. Generally, all nodes at the same level have non-overlapping domains, which yields a constant and low sensitivity of 2 per level. This is because, changing a single location in the dataset may affect at most two partitions in a level. The merit of space-based partitioning PSD is simple to construct, but can become unbalanced.

The accuracy of PSD also relies on the allocation of privacy budget. The best allocation for budget 𝜖 across levels is geometric allocation [8], where higher levels receive less budget than leaf nodes. To ensure overall privacy, the sequential composition property is applied across nodes on the same root-to-leaf path, whereas parallel composition property is applied to disjoint paths in the hierarchy. In this paper, we adapt the space-based partitioning PSD to address the specific requirements of the LBS over road networks.

3 PrivSem: an overview

In this section, we first present the system architecture and the workflow for privacy-preserving location-based queries, and then introduce the privacy model and assumptions. Finally, we discuss design challenges and associated performance metrics.

3.1 System architecture

The PrivSem adopts the classic centralized architecture for providing anonymous information delivery in the LBS, as sketched in Figure 2. In this architecture, the location anonymizer (LA) is a trusted entity which locates between mobile users and LBS providers. It consists of three tiers. The first tier is the user profile model which captures user personalized location privacy requirements. The second tier is comprised of the location cloaking components typically specialized in location anonymization service. The final tier is dedicated to the filter of candidate results. Besides, the communication between mobile users and the LA is via establishing an authenticated and encrypted connection.

More specifically, mobile users send their locations to a trusted cellular service provider (CSP) which periodically collects location updates and releases a PSD according to privacy budget 𝜖 mutually agreed upon with the users. This CSP is either located in the LA or a dedicated server. Then, when the LA receives a query request with the exact location information from a query user (Step 1), it queries the PSD to determine a CR according to the user’s privacy requirement, and relays it to the LBS provider (Step 2). Subsequently, the LBS provider computes the candidate results for received anonymous query, and forwards these produced results to the LA (Step 3). Finally, the LA extracts the exact answers from candidate results by properly filtering false hit information, and delivers them to the query user (Step 4).

Figure 2
figure 2

Overview of PrivSem

3.2 Privacy model and assumptions

In PrivSem, our specific objective is two-fold. One is to protect both the locations and the identities of query users during the LBS; The other is to prevent other users’ location privacy from leakage. Therefore, there exists two-level privacy requirements to be considered: personalized user privacy profile and system privacy profile. The former allows a query user to specify his personalized privacy requirement, which is essential for providing anonymous location queries. The latter allows the system to control its privacy protection capability for other users’ locations.

Personalized user privacy profile

To protect mobile users’ privacy, locationk-anonymity and segmentl-semantic diversity are provided to specify their personalized location privacy requirements. Location k-anonymity guarantees that it is difficult to identify a specific user among a set of users, based on the CR. Segment l-semantic diversity controls that it is difficult to link a user with a specific location semantic (such as a clinic or a church) with high certainty (≥ 1/l). For example, in Example 1, if Bob’s CR is {e2, e7, e4}, then an adversary cannot pinpoint the exact semantic of Bob’s walking road. In practice, a mobile user may alter his privacy preference (k and l values) as often as required. Thus, such privacy requirements should be customizable and provided on a per query basis.

Moreover, our cloaking technique also should not compromise the q uality o f the LBS (QoS). To guarantee the QoS, e.g., response time, a mobile user u should specify some customized requirements. In this framework, maximum temporal toleranceσt is used to allow a user to specify the critical QoS constraint. It ensures that the temporal delay introduced for waiting to anonymize a query request should be within an acceptable time interval. In summary, the set of parameters (k, l, σt) consists of u’s service profile.

System privacy profile

In addition, we argue that the protection for other users’ location privacy also is essential. Here, one measure is used to specify this type privacy requirement. That is, 𝜖-differential privacy. It guarantees that its outputs are approximately the same even if a single location record in the dataset is arbitrarily changed. In other words, its outputs are insensitive to the change of any user location. This suggests that the user privacy is protected, and thus, mobile users are not discouraged from participating in the statistical analysis. The level of privacy protection is controlled by the parameter 𝜖. Lower 𝜖 indicates stronger privacy protection, but also noisier results.

Threat Model

Generally, in PrivSem, we assume that the background knowledge of an adversary is as follows: (i) the cloaking algorithm used by the LA, (ii) all the CRs ever received at the LBS provider, and (iii) the locations of partial mobile users. The first assumption is common in the security community since the data privacy algorithms are usually public. The second assumption states that either the communication channel between the LA and the LBS provider is not secure, or the LBS provider is not trusted (e.g., a commercial organization that collects unauthorized information from its clients for spamming). The third assumption is motivated by the fact that an adversary can pinpoint the locations of some users by the illegal means. If an adversary knows the locations of all users, it is meaningless from the point of location privacy preservation.

3.3 Design goals and performance metrics

Protecting user privacy both for query users and other users complicates location anonymization assignment, and may reduce the effectiveness and efficiency of the LBS. Due to the nature of DP, it is possible for a segment containing no users, even though the PSD shows a positive count. Consequently, no users (or an insufficient number thereof) are introduced to protect identity of the query user. Alternatively, the generated CR may be too large, whereas a smaller one is sufficient for location anonymization request. A larger CR deteriorates QoS as well as the efficiency of anonymous queries. Finally, in the non-private location-based queries where the exact location of a query is known, only the required final results are returned. With the privacy awareness, many redundant candidate results may be returned, increasing system overhead. In summary, we focus our attention on the following performance metrics:

Success Rate

The main purpose of the cloaking algorithm is to maximize the number of query requests perturbed successfully while maintaining their privacy and QoS constraints. Due to PSD data uncertainty, the LA may fail to provide enough protection (e.g., an insufficient users are included). Anonymization success rate (ASR) measures the ratio of requests perturbed successfully to the total number of received anonymization requests. The major challenge lies in how to keep ASR close to 100%.

Relative Levels

In PrivSem, the LA no longer utilizes the accurate user count of a segment to compute the CR, the generated CR could tend to larger, resulting in poor QoS and expensive processing cost of anonymous queries. The challenge is to generate small CRs for successful perturbed messages even the accurate user counts of segments are not known. Hence, relative anonymity level (RAL) and relative semantic level (RSL) are used to measure the performance of generated CRs. More specifically, RAL is equal to \(\frac {k_{c}}{k}\), and RSL is \(\frac {l_{c}}{l}\). kc and lc denote the actual values obtained for the cloaking algorithm.

System Overhead

Dealing with inaccurate locations increases the complexity of the LBS, which poses scalability issues. An important metric for measuring overhead is the number of candidate results. This quantity affects both the communication overhead required to delivery the candidate results from the LBS provider to the LA, as well as the computational overhead of the query processing algorithm.

4 Spatial cloaking algorithm

To fulfill user privacy requirements and achieve high QoS, the proposed anonymization algorithm is composed of two stages: (i) segment allocation and (ii) online cloaking. In the first phase, the segments of a road network are roughly grouped into different buckets, so that location anonymization can be performed in a single bucket rather than the entire road network. Using this information, the second phase locates the bucket of the segment of each query user and anonymizes the segment based on user privacy profile. In this section, we present the segment allocation, then detail the online cloaking in the next section.

4.1 Segment allocation

This phase mainly allocates the segments of a road network to different buckets according to users’ privacy requirements. To capture most user privacy requirements, we make the following observation.

Observation 1

The location semantic privacy requirements L of user privacy profiles follow a Gaussian distribution LN(μ, σ2), i.e., most of user location semantic privacy requirements fall in the middle range, and fewer have higher privacy requirements. The parameter μ is the mean of the distribution, and the parameter σ is its standard deviation.

It follows that we can make use of the 68-95-99.7 empirical rule, also known as the 3σ rule, which states that about 99.7% of values sampled from a Gaussian distribution lie within three standard deviations away from the mean. With this fact, it is often sufficient to set the appropriate semantic number of a bucket at μ + 3σ for satisfying almost all user location anonymization in a single bucket. Definition 4 formally elaborates the objective of segment allocation.

Definition 4

(Segment Allocation) The segments of a road network G = (V, E, ξ) are allocated to p buckets, G1, G2,...Gp, where p > 1 and Gi = (Vi, Ei, ξi), such that \(V=\bigcup _{1\leq i\leq p}V_{i}\), \(E=\bigcup _{1\leq i\leq p}E_{i}\), \(\xi (E)=\bigcup _{1\leq i\leq p}\xi _{i}(E_{i})\), and the following conditions are satisfied.

  1. (i)

    The segments of all buckets are disjoint, i.e., ∀1 ≤ i, jp, EiEj = ϕ.

  2. (ii)

    The semantic number of a bucket must exceed the threshold μ + 3σ, i.e. , \(|\xi (E_{i})|=|\bigcup _{\forall e\epsilon E_{i}}\xi (e)|\geq \mu +3\sigma \).

The cloaking algorithm aims at protecting the location privacy of mobile users. Besides, it should not compromise QoS, which depends mostly upon maximum temporal tolerance and system overhead. As mentioned, the number of candidate results is used to measure system overhead, which is formulated in Definition 5. Without loss of generality, we focus our discussion on K-nearest neighbors (KNN) queries.

Definition 5

(LBS Server Processing) [7, 45] For a query q with associated a CR Sc, the candidate results of q consists of two parts: (1) the POIs on the segments of Sc, and (2) the results as q are issued on the boundary nodes of the boundary set Sbn, where the boundary set is a set of nodes whose some connected edges are not included in Sc. Formally, \(Cost(q, S_{c})=(\bigcup _{s\in S_{c}}O(q, s))\bigcup (\bigcup _{v\in S_{bn}}O(q, v))\)

With this query processing model, it can be observed that the system overhead Cost(q, Sc) is significantly affected by parameters |Sc| and |Sbn|. However, decreasing |Sc| and |Sbn| imposes conflicting demands on Cost(q, Sc). The reason mainly involves the fact that segments that are near each other tend to possess similar semantic labels. For a user privacy profile, our goal is to find the optimal CR which is minimized in terms of system overhead, while satisfying location k-anonymity and segment l-sematic diversity. To sum up, our problem is equivalent to the following optimization problem:

Minimize Cost(q, Sc), subject to Count(Sc)≥ k, \(|\xi (S_{c})|=|\bigcup _{\forall e\epsilon S_{c}}\xi (e)| \geq l\).

As shown in the paper [45], the problem of computing an optimal CR is NP-hard.

Solution

Given the analysis above, we develop a greedy solution called EIRank. An intuitive guideline is that adjacent segments with different semantic labels should be cloaked together in order to provide a compact structure and semantic preference simultaneously. Under this guideline, we prefer cloaking the segments exhibitingstructure similarity and semantic label dissimilarity. To measure the similarity of linkage structures and the dissimilarity of semantic labels, we present two scoring functions: S(n1, n2) and Diff(ep.φ, eq.φ).

In many applications, objects are deemed similar if they are related to similar objects. Motivated by this intuition, a general similarity metric called SimRank is naturally adapted to capture the similarity of linkage structures. The calculation of SimRank is given in (2).

$$ \begin{array}{@{}rcl@{}} S(n_{1},n_{2})=\left\{ \begin{array}{ll} 1 & n_{1}=n_{2}\\ \frac{C}{|I_{n_{1}}||I_{n_{2}}|}\sum\limits_{j\in I_{n_{2}}}\sum\limits_{i\in I_{n_{1}}}S(i,j) & n_{1} \neq n_{2}\\ \vspace{-0.7cm} \end{array} \right. \end{array} $$
(2)

In this equation, the parameter C refers to as a decay factor, is a constant between 0 and 1, and the parameter In denotes the neighboring set of n. Note that (2) is set to 0 when \(I_{n_{1}} = \emptyset \) or \(I_{n_{2}} = \emptyset \).

To evaluate the dissimilarity of semantic labels of segments, we utilize the normalized edit distance. In this case, the dissimilarity of semantic labels Diff(ep.φ, eq.φ) is measured by the edit distance between the semantic labels in regard to the length of the semantic label. The edit distance, Edit(ep.φ, eq.φ), between two semantic labels, ep.φ and eq.φ, is defined as the minimum number of basic operations required to transform one semantic label into the other. In this paper, the basic operations are referred as insertion, deletion and substitution of symbols, which is formalized as follows.

Let Ti(a) denotes the insertion of symbol a, Td(a) denotes the deletion of symbol a, and Ts(b|a) denotes the substitution of symbol a by symbol b (ab). Then,

$$ \mathit{Diff}(e_{p}.\varphi, e_{q}.\varphi)=\frac{\mathit{Edit}(e_{p}.\varphi, e_{q}.\varphi)}{\mathit{Max}(|e_{p}.\varphi|, |e_{q}.\varphi|)} $$
(3)

where ep.φ represents the label function of ep, and Max(|ep.φ|,|eq.φ|) is a function that computes the larger length of the two labels ep.φ and eq.φ.

To integrate semantic information and linkage structure collaboratively for segment allocation, we develop a greedy solution, referred as EIRank, for simultaneously representing link-based similarity and semantic-based dissimilarity. At a high level, our solution consists of four steps: EI network construction, label clustering, augmented EI network construction and segment allocation. Below, we will introduce each of them in details.

EI Network Construction

For ease of exposition, the semantic label of each edge is unique in this paper. To combine linkage structure and segment semantic, an edge interaction (EI) network is firstly transformed from a semantic road network. More specifically, an EI network node (e-node), represents an edge in the original semantic road network, and two e-nodes are said to be adjacent if their corresponding edges share a common node in the original semantic road network. Additionally, the labels of e-nodes in the EI network are provided by the semantic labels of the corresponding edges in the road network. For instance, the edges e1 and e2 have a common node v2 in the semantic road network (Figure 3a), and thus the e-nodes e1 and e2 are linked together in the EI network (Figure 3b). Furthermore, the segment id itself is sufficient to indicate the semantic label of a segment, so that the labels of the e-nodes in the EI network are omitted to mark.

Figure 3
figure 3

Example of EIRank strategy

Label Clustering

The problem of calculating the dissimilarity of two segment labels is translated into the equivalent one of calculating the dissimilarity of two e-node labels in the EI network. As mentioned before, we are able to use the normalized edit distance to accomplish the goal. In the case of the labels of the two e-nodes e3 and e4 in Figure 3b, by performing the basic operations Ts(l|h), Ts(b|r), Td(c) and Td(h), the label of e-node e3 is transformed to the label of e-node e4. Thus, the dissimilarity of these two e-node labels is \(\mathit {Diff}(e_{3}.\varphi ,e_{4}.\varphi )=\frac {2}{3}\).

Based on the dissimilarity of the labels of e-nodes, we then perform a generalized k-medians clustering [32] for these e-node labels. In Figure 3b, the result of label clustering is {church, police, park} and {bar, club}.

Augmented EI Network Construction

In the third step, we create a virtual node for each label cluster and connect the e-nodes whose labels are in the same cluster to the virtual node. The new generated network is called augmented EI network. Through adding the virtual nodes, the e-nodes in the same label cluster tend to have higher structure similarities. For example, Figure 3c gives the augmented EI network corresponding to Figure 3b. Two virtual e-nodes o1 and o2 are created to represent the clusters {church, police, park} and {bar, club}, respectively. In particular, the virtual node o1 is connected to the e-nodes in the set {e1, e2, e3, e8, e9, e10}. In the same manner, the e-nodes in the set {e4, e5, e6, e7} are connected to the virtual node o2.

Segment Allocation

As stated earlier, the segments of a CR needs to be structurally similar and semantically dissimilar. Based on the first three steps, the dissimilarity of the e-node labels is converted to the similarity of the linkage structure. It is consistent with the similarity of the linkage structure. Next, the function S(ep, eq) is employed to evaluate the similarity for every pair of non-virtual e-nodes.

To calculate SimRank more efficiently, we adopt the method in [13]. In such a case, the similarity of e-nodes is measured by (4), which states that the similarity of two e-nodes is the expectation of the total time which is the time taken by two random walkers starting from two different nodes to finally meet.

$$ S(e_{p},e_{q})=E(C^{\tau(e_{p},e_{q})}) $$
(4)

Once the similarity of all e-node pairs has been calculated, we leverage the single-linkage hierarchical clustering [39] to perform the segment allocation. The function Allocate(ep, eq, GS) is used to describe this process.

The complete description of our EIRank strategy is shown in Algorithm 1.

figure a

4.2 Ordered locating index

Due to the daunting size of segments, it is costly and time-consuming to search the position of a segment in a segment allocation. To fast and efficiently for performing this search, we devise a novel data structure-Ordered Locating Index.

Ordered Locating Index (OLI)

For a particular segment, this data structure allows for high efficient and fast computation of the position in the segment allocation. It organizes the segments in order. Each record is represented as (Sid, Bid, Pid, Pointer) where Sid is the segment identifier, Bid is the bucket identifier of the segment Sid, Pid is the position identifier of segment Sid in bucket Bid, and Pointer is a pointer to the next record. With the mapping relation, we can quickly locate the position of a segment in a segment allocation. More precisely, (5) is used to compute the sequence of segment Seq(ei, j) in the ordered linked list to obtain the Bid and Pid of ei, j. In the equation, the symbol ei, j indicates a segment that connects the nodes i and j. Meanwhile, the first three entries are mainly used to compute the total number of segments before the segment ei, j.

$$ \begin{array}{@{}rcl@{}} Seq(e_{i,j})&=&\sum\limits_{k=1}^{i-1}degree(k)-|S_{overlap}|_{S_{overlap}=\{e_{lt}\},t<i,l<i}\\ &&+|S_{prior}|_{S_{prior}=\{e_{ip}\},i<p<j}+1 \end{array} $$
(5)

Take the segment e4,6(i.e., e7) in Figure 1 as an example. According to the (5), its segment sequence is Seq(e4,6) = degree(v1) + degree(v2) + degree(v3) - (|{e1,2, e2,3}|) + |{e4,5}| + 1 = 2 + 3+ 2-2 + 1+ 1 = 7. Then, search for the 7th record in OLI which is shown in Figure 6, we derive that Bid = 2 and Pid = 8. It means that the segment e4,6 is in bucket 2, at position 8.

5 Online cloaking

In the previous section, we have elaborated the first phase of the proposed approach. Once partitioned buckets have been obtained, the remaining work is to generate a CR according to a user’s online request. As illustrated in Example 1, to protect other users’ privacy, we cannot use the real count of mobile users in each segment. Instead, the PSD is leveraged to publish this count.

5.1 Building the user PSD

This step consists of building a user PSD (at the CSP component) to be later used for location anonymization assignment at the LA. Building the PSD is a vital step, because it determines how accurate the released data is, which in turn affects ASR. In this part, a naive user PSD for mobile users on a road network is firstly presented. Then, inspired by the data sparsity, we then modify it to release more accurate counts.

Naive User PSD

To construct user PSD, we devise the Segment User Count Map(SUCM) as the underly fundamental structure to record a count of the number of mobile users located in each segment. Each entry is defined as a tuple (Sid, N) where Sid is the segment identifier while N is the number of mobile users on this segment. This structure is dynamically maintained to keep track of the total number of mobile users within each segment. In addition to the mapping of each segment to its current count, we also devise a hash table HT to keep track of current locations of mobile users. Each entry in HT for a mobile user is represented as (Uid, Sid), where Uid is the mobile user identifier, and Sid is the segment identifier where Uid is located. This structure allows for efficient and fast computation of mobile user counts belonging to a particular segment of a road network. Figure 4 illustrates these two data structures.

Figure 4
figure 4

An illustrate of proposed fundamental data structure for PSD

With this fundamental spatial structure, we need to build a user PSD (donated by PSDu) according to DP. It only requires one simple step to fulfill the construction of PSDu. That is, the noisy counts of segments are computed by directly injecting random Laplace noise with scale \(\lambda =\frac {2}{\epsilon }\) to the actual counts of these segments. Typically, all segments have non-overlapping extents, which yields a constant and low sensitivity of 2 (i.e., changing the location of any one specified user in the dataset may affect at most the counts of two segments). Thus, injecting random Laplace noise \(Lap(\frac {2}{\epsilon })\) satisfies 𝜖-differential privacy.

Customized User PSD

In real life, compared to the total number of segments in a road network, the total number of mobile users is very small. The majority of segments have very low to zero count. This data sparsity issue is an immense challenge for privacy-preserving techniques since the injected noise is more likely to dominate the released counts in presence of a small set of mobile users. In other words, when each segment is perturbed individually, it will result in high relative error for the released counts of sparse segments due to the injected perturbation noise. Inspired by this, we propose adaptive group mechanism to mitigate the data sparsity issue.

The key idea is that segments with small statistics should be grouped together if they have close statistics and similar statistical trends. Due to the spatial correlation, it is very likely that adjacent segments belong to the same area(such as suburb, city center, etc). Therefore, they poss similar constraints on the user counts, leading to more similar statistical properties. For example, a collector road restricts the number of users to a low value whereas an expressway attracts a considerably higher number of users; Besides, adjacent roads of a congested road are also tend to congested. To utilize these heuristics, we propose to use the structure information to characterize the statistical trend. Let S(ei) denote the neighboring set of a segment ei. We then adopt (6) to measure the similarity of statistical trends. The expression signal(ei, ej) is used to check whether they are roads of same type. If they both are expressway(collector etc), the value is 1; otherwise, it is 0. J(ei, ej) is the Jaccard similarity coefficient between S(ei) and S(ej). Eventually, segments with small statistics and high similarity are grouped together.

$$ SJ(e_{i},e_{j})= \alpha\times signal(e_{i}, e_{j})+(1-\alpha)J(e_{i},e_{j}) $$
(6)

Figure 5 describes our customized user PSD mechanism. As observed, it is further decomposed into M1, M2 and M3, which operate sequentially. M1 performs a sparse computation algorithm between noisy count \(\widetilde {nc_{e_{i}}}\)of the segment ei and the noise resistance threshold τ1. The result of the calculation is forwarded to M2, which makes use of it to decide whether to group a segment separately or group with other segments. It means M2 determines the final groups (called partitions) of segments. Once the partition structure of the segments is established, M3 injects the Laplace noise to each partition to ensure differential privacy. Since we assume the uniform data distribution within each partition, the noisy count of each segment can be estimated with the average partition count.

Figure 5
figure 5

Internal mechanics of customized user PSD

The sub mechanisms M1 and M3 can be performed directly, as they only depend on Laplace mechanism. Next, we elaborately describe the procedures of group strategy R. Initially, let the segment ej itself as an independent partition if \(\widetilde {nc_{e_{j}}}> \tau _{1}\) and add the partition to R; Then, sort all segments not in R in order of increasing \(\widetilde {nc_{e_{j}}}\), denoted by Ω. Subsequently, as long as Ω is not empty, we perform the following operations: initialize a new partition(e.g., g) with the first segment e(1), and check the next segment e(k) in Ω. If the distinction between the noisy counts of e(1) and e(k) is less than τ2, and the sum of noisy counts of the segments in g is less than τ1, we calculate their similarity. If this value is larger than τ3, remove e(k) from Ω and place it into the partition g, otherwise, skip this segment and do the same check for the next segment. For other cases, add g to R and remove e(1) from Ω. Finally, the final group strategy R is obtained.

Note that, the threshold τ2 is the error threshold that decides whether the statistics of two segments are close to each other, and τ3 is the similarity threshold that decides whether the statistic properties of two segments are similar. Both M1 and M3 must be private, as they have access to sensitive dataset D. Moreover, any post-processing of differentially private data remains differentially private, therefore M2 does not violate privacy. Let 𝜖1 and 𝜖2 be the budgets spent in M1 and M3, respectively. Then, due to sequential theorem(Theorem 2), the privacy budget of adaptive group mechanism is 𝜖1 + 𝜖2 = 𝜖.

Example 3

(Adaptive Group Mechanism) Suppose there are three segments {e1, e2, e3} needed to be grouped, and their similarities are SJ(e1, e2) = 0.92, SJ(e1, e3)= 0.35 and SJ(e2, e3)= 0.45. Let τ1= 50, τ2= 20 and τ3= 0.8. The noisy counts for these three segments are 15.3, 9.7 and 65.2, respectively. Since 65.2 > 50, the segment e3 is a separate group and is added to R. For the segments e1 and e2, their statistical similarity is 0.92. As 0.92 > 0.8, and 15.3 - 9.7 = 5.6 is smaller than 20, we can group these two segments together. Thus, the final group strategy is R = {{e3},{e1, e2}}.

Error Analysis

To protect the user privacy, the exact location of a query user is usually extended to a larger CR which needs to be meet the user privacy requirement. Due to the nature of DP, a CR may not contain enough mobile users even if the released count exceeding the specified k. To ensure high ASR, we should guarantee the real count of a CR is larger than k with high possibility. To achieve this goal, we analyze the error in the count reported by a CR. Indeed, the noisy count of a CR is computed as the sum of noisy counts of the partitions which are contained in the CR. Below we quantify the noise accumulated in the process, which will help us to improve ASR of the cloaking algorithm.

Formally, let \(\widetilde {nc_{cr}}\) be the count released by the PSD for a CR, and donate by nccr its real count. Let n1 and n2 denote the number of complete partitions and partial partitions contained in the CR, respectively. Given a partial partition gp and its corresponding complete partition gc, the expected error is \(Error(g_{p})={\sum }_{e_{i}\in g_{p}}(\widetilde {nc_{e_{i}}}-nc_{e_{i}})={\sum }_{e_{i}\in g_{p}}(\frac {{\sum }_{e_{j}\in g_{c}}(nc_{e_{j}})}{|g_{c}|}+\frac {1}{|g_{c}|}Lap(\frac {2}{\epsilon })-nc_{e_{i}}) = \underbrace {(\frac {{\sum }_{e_{j}\in g_{c}}(nc_{e_{j}})}{|g_{c}|}|g_{p}|-nc_{g_{p}})}_{\text {Approximation error}} +\underbrace {\frac {|g_{p}|}{|g_{c}|}Lap(\frac {2}{\epsilon })}_{\text {Laplace error}}\).

As observed, the error of a partial partition is composed of approximation error and laplace noise error. Since the approximation error is data-dependent and difficult to quantify, we roughly use the introduced noise error to measure the error of a partial partition. In contrast, the error of a completed partition is totally composed of noise error. Then, we can write \(\widetilde {nc_{cr}}\) as follows.

$$ \widetilde{nc_{cr}}= nc_{cr}+\sum\limits_{i = 1}^{n1}{Lap(\frac{2}{\epsilon})}+\sum\limits_{j = 1}^{n2}{ratio_{j}\times Lap(\frac{2}{\epsilon})}=nc_{cr}+\sum\limits_{i = 1}^{n}{Lap(\frac{2}{\epsilon})} $$
(7)

In this equation, n is the sum of n1 and \(\lceil \sum \limits _{j = 1}^{n2}{ratio_{j}}\rceil \). For each partial partition gp, ratio is the ratio of the segment number contained in gp to that of its associated completed partition gc, i.e.,\(\frac {|g_{p}|}{|g_{c}|}\). Let the random variable \(Y =\sum \nolimits _{i = 1}^{n}{Lap(\frac {2}{\epsilon })}\) denotes the sum of these Laplace noises. It thus determines the error in estimating the absolute value of the count \(\xi =\parallel \widetilde {nc_{cr}}-nc_{cr}\parallel _{1}\). The following corollary gives a formal description about this quantity.

Corollary 2

(Error Bound for CR Counting) For any count query for a CR, the noisycount\(\widetilde {nc_{cr}}\)obtainedby summing the noisy counts of the n partitionscontained in the CR, with probability at least1 − δ,the quantity\(\xi =\parallel \widetilde {nc_{cr}}-nc_{cr}\parallel _{1}\)isat most\(O(\frac {2}{\epsilon }\sqrt {n}\log \frac {1}{\delta })\).

Proof(sketch). The proof follows from Corollary 1, where we choose \(v=\sqrt {{\sum }_{i=0}^{n-1}(\frac {2}{\epsilon })^{2}}\sqrt {2\ln {\frac {2}{\delta }}}\).

5.2 Cloaking algorithm

In this subsection, we present our online cloaking algorithm, which is sketched in Algorithm 2. It involves six main inputs: u (mobile user), (x, y) ∈ ei (user location), (k, l, σt) (user profile), H (hash table), OLI (ordered locating index), and PSDu (user PSD).

The algorithm starts by performing the initialization, after which it extracts some basic information (Lines 1-2). Subsequently, it leverages a semantic based cloaking function to discover the semantic-based cloaked area CRu (Line 3). At this step, the algorithm ignores the constraint of k, but rather focuses on the semantic requirement. Then, it calculates the count of mobile users in CRu (Lines 4-5). At this step, to protect other users’ location privacy, the noisy count of each segment is used, rather than the real count. Next, the algorithm analyzes the lower bound for the real user count of CRu, that is, \(\widetilde {nc_{CR_{u}}}-\xi \). Finally, it checks whether this value satisfies location k-anonymity constraint. If the current CRu is k-anonymity, it stops. Otherwise, it recursively selects the neighboring segment with largest noisy count to add into CRu until CRu satisfies k-anonymity (Lines 6-9). Note that, If the current largest noisy count is less than \(\frac {2\sqrt 2}{\epsilon }\), the algorithm stops the cloaking process and returns failure.

During the process, some subtle for processing noisy counts are introduced. Specifically, if the noisy count of a PSD node containing current segment is less than \(\frac {2\sqrt 2}{\epsilon }\), its value is set to zero. Recall that the major purpose of the cloaking algorithm is to achieve high ASR. In that sense, we desire to ensure that the user count of a selected segment is non-empty, i.e., the real user count of a segment is strictly positive. Given the Laplace mechanism of DP, each PSD node count is the sum of noisy and real count. For the distribution of injected noise, it has standard deviation \(\mu =\frac {2\sqrt 2}{\epsilon }\). Hence, if the count of a PSD node is less than μ, then with high probability it is empty. Based on this analysis, we prefer to set the noisy counts of segments in a such PSD node to 0, further increasing ASR.

figure b

For the semantic based cloaking function, its core lies in making use of the segment oblivious property [29]. When l = 1, it returns the segment of user location as the CR. It suggusts that user current segment is sufficient to satisfy the semantic privacy requirement. For other situations, it consists of a number of iterations. In each iteration, it detects a cloaked segment set(short for cloaked set) SL from the first unprocessed location \(loc_{pos_{c}+1}\) in bucket Bidu. The formed cloaked set satisfies l-semantic diversity. Then, we check the number of remaining semantics in this bucket. When it is less than l, these remaining segments are also put into SL. Next, determine whether Sidu is in SL. If not, this function continues to iteratively detect SL in the same manner until it find the required SL. Finally, for each segment eSL, we insert it into CRu. During the process, the notation Poso represents the last position of SL which is detected in the previous iteration, and Posc represents the last position of SL which is detected in the current iteration.

Example 4

(Semantic Based Cloaking) Suppose the content of a bucket for Figure 1 is {e3, e1, e4, e11, e2, e6, e9, e7, e22, e17, e12}, and the semantics of these segments are shown in Figure 6. Two users u1 and u2 have the same semantic privacy requirement l = 3, and they locate in the segment e22 and e1, respectively. Initially, the algorithm traverses the bucket from the scratch (i.e.,Poso= 0) and finds the first 3-semantic diversity cloaked set SL = {e3, e1, e4}. Next, it checks the number of remaining semantics in this bucket (i.e., 6). Because 6 is larger than 3, and then SL = {e3, e1, e4} is a cloaked set (i.e.,Posc= 3). Since the segment e22 is not in {e3, e1, e4}, The algorithm continues to search the next cloaked set SL = {e11, e2, e6} in the same manner (Poso= 3, Posc= 6). When it retrieves the cloaked set SL = {e9, e7, e22, e17}, and detects that the remaining semantics number is 1, which is smaller than 3. Thus, the segment e12 needs to be added into SL. At this step, the segment e22 is in SL = {e9, e7, e22, e17, e12}. Therefore, the cloaked set SL = {e9, e7, e22, e17, e12} is u1’s semantic based CR. To compute the semantic based CR for u2, the algorithm just needs to perform the first iteration as the user u1, and get SL = {e3, e1, e4}.

Figure 6
figure 6

An illustrate of SOPlist and Cloaked l-maplist

5.3 Optimizations

Despite its simplicity, the basic version of PrivSem introduced above suffers several drawbacks. (1) For anonymizing each query request, the function Semantic_Cloaking() recalculates the basic semantic-based CR from the scratch every time, which deteriorates the efficiency of the LA. (2) It attempts to execute anonymization immediately after a new query request arrives, i.e., the LA processes each query request completely independently. It is expected that a lots of attempts are required to process these requests, thereby incurring the scalability problem. In what follows, corresponding to these drawbacks, we develop a series of optimization strategies to improve the performance of the LA.

Recording semantic cloaking (RSC)

To facilitate the execution of the semantic-based cloaking algorithms, we also devise SOPlist and Cloakedl-maplist these two other data structures. The former is a 2-semantic diversity index, and it aims to speed up the computation of the basic semantic-based CR. Each record of SOPlist is in the form ((seman1, n1), (seman2, n2), Pointer), where (seman1, n1) ((seman2, n2)) indicates n1(n2) adjacent segments of semantic label seman1 (seman2), while Pointer is a pointer to the next record.

Cloaked l-maplist is designed to record the CRs that have been generated for distinct semantic requirements so far. It is achieved by re-using the mapping between segments and CRs. A basic cell of Cloaked l-maplist is represented as(li, npointer, spointer) and li_set, where li denotes li-semantic diversity, npointer and spointer are pointers to the next basic cell and li_set, respectively, and li_set records the last position of each CR regarding semantic requirement li. li_set is dynamically maintained to keep track of the current maximum position of CRs of semantic requirement li in a bucket.

Example 5

(SOPlist and Cloakedl-maplist) Continuing with Example 4, Figure 6 shows the SOPlist and Cloaked l-maplist corresponding to the bucket. In Example 4, to compute the semantic based CR for u2, the algorithm recalculates from the scratch, which deteriorates the efficiency of the LA. With the help of Cloaked l-maplist, instead of recalculating it, the algorithm examines whether this cloaked set has been formed before. After deriving u1’s semantic based CR SL = {e9, e7, e22, e17, e12}, the maximum of l3-set is updated as 11. From the structure OLI, it is easy to conclude that the position \(Pid_{u_{2}}\)= 2. Since 2 is smaller than 11, this means that its semantic based CR has been formed before, and thus the algorithm can directly obtain it without recalculating. As 2 < 3, the last position of first cloaked set, then the semantic based CR is generated by adding the segments in interval (0,3], that is SL = {e3, e1, e4}.

Delay and sharing anonymizing (DSA)

To increase the scalability of LA, we propose delay and sharing anonymizing strategy. It explores the possibility of sharing processing in the location anonymization operation, by combining query requests with nearby locations together and perturbing them as an entirety. This makes sense, because for each query request, one can wait for a period of time Tw, before starting the anonymization process. And besides, the parameter Tw is set according to users’ maximum tolerable σt in user profile. Through the sharing anonymization, it can significantly reduce the burden on LA, further increasing its scalability.

Concretely, given a set of query requests \(\{q\}_{i=1}^{t}\) with corresponding users’ privacy profiles as \(\{k_{i}, l_{i}, \sigma _{i}\}_{i=1}^{t}\) and arriving time \(\{r_{i}\}_{i=1}^{t}\). These query requests are pushed to the query queue Qq in a increasing order of (ri + σi). Once the first query request qi is popped up from Qq, the algorithm attempts to perform anonymization. First, it calculates and initiates the cloaked area CRu based on Semantic_Cloaking(). Then, it traverses Qq and find the request set Sqr whose semantic requirements are similar to qi, i.e., for ∀qjSqr, lj = li. Next, for each query request qjSqr, it checks whether its issuing segment ej is in CRu. If ejCRu, qj is deleted from Qq. Besides, the parameter kmax is used to record the maximum k associated with requests in this anonymization. Finally, based on the noisy count of CRu and kmax, it extends CRu until it contains enough mobile users.

6 Privacy analysis

In this section, we give the formal analysis about privacy guarantees of PrivSem for both query users and other users. From the perspective of query users, it needs to consider the resilience of the location anonymization against an adversary’s attack: based on his prior knowledge and understanding concerning the anonymization model, the adversary attempts to pinpoint query users’ identity, location and location semantic through the perturbed information. For the other users, it should guarantee that it is difficult to link a specific segment or region with these users. It is noteworthy that the attack discussed here focuses on one-shot queries.

Privacy guarantee for query users

Given a CR of a query user u as a set of segments CRu, the expected identity protection is provided if at least k users are indistinguishable to the adversary. For location protection, compared to l-segment diversity, we argue that segment l-semantic diversity provide more stronger privacy protection capability. That is, at least l segments in CRu are indistinguishable to the adversary. Alternatively, it is difficult to pinpoint the exact semantic of query location. Thus, from the adversary’s perspective, each user u is associated with this query request with equal probability which is no larger than \(\frac {1}{k}\); the probability of u associated with each segment is no more than \(\frac {1}{l}\), and the semantic number associated with CRu is no less than l. However, with effective attacks, the adversary can identify that these associations have much higher probability than required, thereby disclosing u’s privacy with high confidence. Thus, the notion of Linkability is proposed to capture such vulnerability.

Definition 6

(Linkability) Given a user u issues a query q with exact location as \(v_{loc}^{*}\in e_{u}\), location semantic as eu.φ and anonymous CR as a set of segments CRu. Based on CRu and background knowledge Kb, the identity linkability Pr[qu)|CRu, Kb] is the probability an adversary can infer u’s association with q, and location linkability Pr[ueu|CRu, Kb] is the probability that an adversary can infer u’s association with eu.

Especially, the background knowledge Kb considered in this paper has stated in Section 3.2. To compute the identity linkability, it is sufficient to estimate the number of users contained in CRu. During the cloaking process, we combine the noisy count and error analysis to guarantee generated CRu containing enough users, i.e., the expected number of users exceeds than k with high probability. Further, it is easy to conclude that \(Pr[q\leftarrow u|CR_{u}, K_{b}]\leq \frac {1}{k}\), which satisfies identity protection.

Following, we mainly analyze the location privacy guarantee in terms of location semantic. In order to evaluate the attack resilience of the location anonymization, we introduce a general replay attack model. In this model, for each segment eCRu, by re-running the cloaking algorithm with e assumed to be the exact location, the adversary estimates the probability of e to generate CRu, like[CRu|ue, Kb]. Then, Pr[ueu|CRu, Kb] is calculated as \(Pr[u\leftarrow \ e_{u}|CR_{u}, K_{b}]= \frac {like[CR_{u}|u\leftarrow e_{u}, K_{b}]}{{\sum }_{e\in CR_{u}}like[CR_{u}|u\leftarrow e, K_{b}]}\).

According to the constructing principle of CRu, it exists a subset CRsCRu which is firstly generated by semantic-based cloaking function. Also, φ(CRs) ≥ l. Besides, due to the existence of multiple locations sharing same semantics, |CRs|≥ l. Based on segment oblivious property [29], for the query q issuing from each e in CRs (i.e., eCRs), the same CRs can be generated. Thus, for any e in CRs, it may be the query location. Further, \(Pr[u\leftarrow e_{u}|CR_{u}, K_{b}]\leq \frac {like[CR_{u}|u\leftarrow e_{u}, K_{b}]}{{\sum }_{e\in CR_{s}}like[CR_{s}|u\leftarrow e, K_{b}]} =\frac {like[CR_{u}|u\leftarrow e_{u}, K_{b}]}{(|CR_{s}|like[CR_{u}|u\leftarrow e_{u}, K_{b}])}\leq \frac {1}{l}\). Alternatively, the semantic number of CRs is no less than l, and hence it is difficult for an adversary to infer the semantic of query location. To sum up, our proposed cloaking algorithm can provide privacy protection for query users.

Privacy guarantee for other users

During the location anonymization for query users, it requires to compute the user count for a particular segment. In our framework, PSD is used to release the noisy counts of the segments. This is the only step involving other users. If we can prove the released PSD does not violate location privacy for these users, then the whole algorithm will not leak their privacy. As stated above, in PrivSem, 𝜖-differential privacy is used to provide privacy protection for these users. Let D and D denote two neighboring datasets. Therefore, our goal is to prove that \(\frac {Pr(\mathcal {A}(D)=PSD_{u})}{Pr(\mathcal {A}(D^{\prime })=PSD_{u})}\leq e^{\epsilon }\).

For our customized PSD, M1 be the sparse calculation mechanism, M2 be group mechanism, and M3 be the count publishing mechanism. Since changing a user location can affect the location count query is at most 2, and thus adding \(Lap(\frac {2}{\epsilon _{1}})\) satisfy 𝜖1-differential privacy, i.e., \(Pr(\mathcal {M}_{1}(D)=S)\leq e^{\epsilon _{1}}Pr(\mathcal {M}_{1}(D^{\prime })=S)\). Any postprocessing of differentially private data remains differentially private, i.e.,\(Pr(\mathcal {M}_{2}(\mathcal {M}_{1}(D)=S)=\Re )\leq e^{\epsilon _{1}}Pr(\mathcal {M}_{2}(\mathcal {M}_{1}(D^{\prime })=S)=\Re )\). For a specified segment e, it will be either a independent partition or a part of a partition. Based on this fact, \(Pr(\mathcal {A}(D,e)=\widetilde {nc_{e}})\) can be written as follows.

\(Pr(\mathcal {A}(D,e)=\widetilde {nc_{e}})=Pr(\mathcal {M}_{2}(\mathcal {M}_{1}(D,e)\geq \tau _{1})=\Re _{1})Pr(\mathcal {M}_{3}(D,e)=\widetilde {nc_{e}}|\Re _{1})+Pr(\mathcal {M}_{2}(\mathcal {M}_{1}(D,e)< \tau _{1})=\Re _{2})Pr(\mathcal {M}_{3}(D,e)=\widetilde {nc_{e}}|\Re _{2})\)

To analyze the expression above, we first investigate the properties of the following two expression: \(\frac {Pr(\mathcal {M}_{3}(D,e)=\widetilde {nc_{e}}|\Re _{1})}{Pr(\mathcal {M}_{3}(D^{\prime },e)=\widetilde {nc_{e}}|\Re _{1})}\) and \(\frac {Pr(\mathcal {M}_{3}(D,e)=\widetilde {nc_{e}}|\Re _{2})}{Pr(\mathcal {M}_{3}(D^{\prime },e)=\widetilde {nc_{e}}|\Re _{2})}\).

$$ \begin{array}{@{}rcl@{}} \frac{Pr(\mathcal{M}_{3}(D,e)=\widetilde{nc_{e}}|\Re_{1})}{Pr(\mathcal{M}_{3}(D^{\prime},e)=\widetilde{nc_{e}}|\Re_{1})} \propto \frac{Pr(Lap(\widetilde{nc_{e}}-n{c_{e}^{D}}))}{Pr(Lap(\widetilde{nc_{e}}-nc_{e}^{D^{\prime}}))} &=&\frac{\frac{1}{2\lambda}e^{\frac{-(|\widetilde{nc_{e}}-n{c_{e}^{D}}|)}{\lambda}}} {\frac{1}{2\lambda}e^{\frac{-(|\widetilde{nc_{e}}-nc_{e}^{D^{\prime}}|)}{\lambda}}}\\ \leq e^{\frac{|n{c_{e}^{D}}-nc_{e}^{D^{\prime}}|}{\lambda}} &=& e^{\epsilon_{2}} \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} \frac{Pr(\mathcal{M}_{3}(D,e)=\widetilde{nc_{e}}|\Re_{2})}{Pr(\mathcal{M}_{3}(D^{\prime},e)=\widetilde{nc_{e}}|\Re_{2})} &=&\frac{Pr(\frac{{\sum}_{e_{j}\in g_{c}}nc_{e_{j}}^{D}}{|g_{c}|}+\frac{1}{|g_{c}|}Lap(\frac{2}{\epsilon_{2}})=\widetilde{nc_{e}})}{Pr(\frac{{\sum}_{e_{j}\in g_{c}}nc_{e_{j}}^{D^{\prime}}}{|g_{c}|}+\frac{1}{|g_{c}|}Lap(\frac{2}{\epsilon_{2}})=\widetilde{nc_{e}})}\\ &\propto& \frac{Pr(Lap(\frac{2}{\epsilon_{2}})=(\widetilde{nc_{e}}-\frac{{\sum}_{e_{j}\in g_{c}}nc_{e_{j}}^{D}}{|g_{c}|})|g_{c}|)}{Pr(Lap(\frac{2}{\epsilon_{2}})=(\widetilde{nc_{e}}-\frac{{\sum}_{e_{j}\in g_{c}}nc_{e_{j}}^{D^{\prime}}}{|g_{c}|})|g_{c}|)}\\ &=&\frac{\frac{1}{2\lambda}e^{\frac{-(|\widetilde{nc_{e}}-\frac{{\sum}_{e_{j}\in g_{c}}nc_{e_{j}}^{D}}{|g_{c}|}|g_{c}| |)}{\lambda}}} {\frac{1}{2\lambda}e^{\frac{-(|\widetilde{nc_{e}}-\frac{{\sum}_{e_{j}\in g_{c}}nc_{e_{j}}^{D^{\prime}}}{|g_{c}|})|g_{c}||)}{\lambda}}}\\ &\leq& e^{\frac{{\sum}_{e_{j}\in g_{c}}(|nc_{e_{j}}^{D^{\prime}}-nc_{e_{j}}^{D^{\prime}}|)}{\lambda}} =e^{\frac{2}{\frac{2}{\epsilon_{2}}}}=e^{\epsilon_{2}} \end{array} $$
(9)

Combining the results of the (89), we have

$$ \begin{array}{@{}rcl@{}} &&Pr(\mathcal{A}(D)=\widetilde{nc_{e}})\\ &=&Pr(\mathcal{M}_{2}(\mathcal{M}_{1}(D,e)\geq \tau_{1})=\Re_{1})Pr(\mathcal{M}_{3}(D,e)=\widetilde{nc_{e}}|\Re_{1})\\ &&+Pr(\mathcal{M}_{2}(\mathcal{M}_{1}(D,e)< \tau_{1})=\Re_{2})Pr(\mathcal{M}_{3}(D,e)=\widetilde{nc_{e}}|\Re_{2})\\ &\leq& e^{\epsilon_{1}}Pr(\mathcal{M}_{2}(\mathcal{M}_{1}(D^{\prime}, e)\geq \tau_{1})=\Re_{1})e^{\epsilon_{2}}Pr(\mathcal{M}_{3}(D^{\prime},e)=\widetilde{nc_{e}}|\Re_{1})\\ &&+e^{\epsilon_{1}}Pr(\mathcal{M}_{2} (\mathcal{M}_{1}(D^{\prime},e)< \tau_{1})=\Re_{2})e^{\epsilon_{2}}Pr(\mathcal{M}_{3}(D^{\prime},e)=\widetilde{nc_{e}}|\Re_{2})\\ &=&e^{\epsilon_{1}+\epsilon_{2}}(Pr(\mathcal{M}_{2}(\mathcal{M}_{1}(D^{\prime}, e)\geq \tau_{1})=\Re_{1})Pr(\mathcal{M}_{3}(D^{\prime},e)=\widetilde{nc_{e}}|\Re_{1})\\ &&+Pr(\mathcal{M}_{2} (\mathcal{M}_{1}(D^{\prime},e)< \tau_{1})=\Re_{2})Pr(\mathcal{M}_{3}(D^{\prime},e)=\widetilde{nc_{cr}}|\Re_{2}))\\ &=&e^{\epsilon}Pr(\mathcal{A}(D^{\prime},e)=\widetilde{nc_{e}}) \end{array} $$

For the input D, we can utilize all segments to divide D into mutually disjoint sub-datasets. By the parallel composition property,the privacy budgets used in computing user count of each segment do not need to accumulate. Thus, \(Pr(\mathcal {A}(D,e)=PSD_{u})\) is no more than \(e^{\epsilon }Pr(\mathcal {A}(D^{\prime },e)=PSD_{u})\), which completes the proof.

7 Experimental evaluation

In this section, we conduct extensive experiments to evaluate the effectiveness and efficiency of the proposed framework PrivSem. Our methods are implemented in C+ +. All the experiments are conducted on a machine with CPU Inter(R) Core(TM)i7-2600, 8.00GB memory, 3.40GHz frequency, 500GB hard disk.

7.1 Datasets and compared methods

  1. (1)

    Datasets. In the experiments, two real road network datasets are used, i.e., California and Oldenburg road networksFootnote 1. These datasets involve diversified POIs, e.g., hospital, church, school, which we used as query objects in the experiments. The parameters of the two real road networks are summarized in Table 1.

  2. (2)

    Query generator. In all the experiments, we use the Network-based Generator of Moving ObjectsFootnote 2 to generate a set (10000) of moving objects on the maps. Because these two maps are of different scales, we can simulate both peak and off-peak traffic conditions. In each simulation, each moving object generates a set of (or none) KNN queries with a randomly assigned probability. The parameters of queries are listed in Table 2. More specifically, the parameters k, l, σt, K and γ follow normal distribution. Particularly, after issuing a query request, the moving object waits for some inter-wait time γ, until the request is either answered or dropped, before issuing another query request. The parameter c follows a uniform distribution over the interval [0,62]. We consider privacy budget 𝜖 ∈{0.5,0.75,1,1.25,1.5}, ranging from strict to loose privacy requirements.

  3. (3)

    Compared methods. Under the PrivSem framework, we implement the following three methods: EIRank+, Naive, and PrivSem+. EIRank+ is the algorithm which protects the identity and location privacy for query users over the road networks. In that case, we do not consider the other users’ privacy and leverage the exact count of each segment. Conversely, the latter two methods take other users’ privacy into consideration. The difference between them is the way to determine a CR: Naive determines an effective CR using naive user PSD, while PrivSem+ uses the customized user PSD. In the experiments, we let τ1 = 15, τ2 = 5, τ3 = 0.5, α = 0.5, and \(\epsilon _{1}=\frac {1}{2}\epsilon \) for all datasets. In addition, we compare our EIRank+ method with SA algorithm [27]. This method uses k-anonymity and 𝜃-security semantics to protect query users on road networks. For 𝜃-security semantics, it is straightforward to convert it into l-semantic diversity.

Table 1 Real dataset parameters
Table 2 Parameter setting

7.2 Experimental results

  1. A.

    Effectiveness of PrivSem for Query Users.

    In the first set of experiments, we evaluate the utility of PrivSem framework for merely query users. Figure 7 shows the results with different parameters. From Figure 7a, it can be seen that ASR of both SA and EIRank+ tend to decrease as k increases. This is because, larger cloaking time required for anonymizing a query request for a larger k, resulting in many waiting requests drop. Figure 7b and c show that EIRank+ substantially outperforms SA in terms of RAL and system overhead. The main reason is that the cloaking strategies of the two algorithms are different. EIRank+ performs segment-based perturbation, which stops just after obtaining user specified requirement. In contrast, SA performs vertex Voronoi-based perturbation. Based on this difference, a CR of SA is larger than EIRank+, and contains more users.

    Figures 7d–f demonstrate the impact of varying semantic diversity on the performance for the two algorithms. From the results, we observe that with the increase of l, EIRank+ consistently outperforms SA in terms of ASR and system overhead. What sightly surprised is, RSL of SA remains unchanged and that of EIRank+ increases. The phenomenon is reasonable. This is because the semantic number of a CR exactly equals to the user-defined semantic diversity for SA algorithm. To resist reverse engineering attacks, the lastest CR of each bucket contains more than l semantics for EIRank+.

    Figure 8 depicts the time cost of two algorithms with different parameters. It is expected that when the location anonymization algorithm has to generate larger CRs to satisfy the stricter privacy requirements, the cloaking time of both approaches increases(Figure 8a). Such larger CRs lead to larger search space at the LBS provider, so the query quality gets worse when the privacy requirements become more stricter (Figure 8d and e). In addition, with the help of optimization strategies (RSC and DSA), the cloaking time of EIRank+ decreases as l or σt becomes larger (Figure 8b and c). To retrieve more PoIs, i.e., a larger K, it can be seen that the query processing time increases.

  2. B.

    Effect of Other Users’ Privacy.

    In this set of experiments, we mainly investigate the effect of considering others’ user privacy. As shown in Figure 9, in general, PrivSem+ is sightly inferior to EIRank+ in terms of ASR, RAL(RSL), and system overhead.

    Varyingk.:

    Figure 9a–c illustrate the performance of the two algorithms with different k. As shown, the performance of these two algorithms deteriorates as k increases. This is intuitive as a larger k imposes a stronger constraint on k-anonymity, thus taking more cloaking time and generating a larger CR. Further, for PrivSem+, a larger CR tends to introduce more noise. Compared with EIRank+, this noisy count brings more counting error. Hence, PrivSem+ exhibits a relative poor performance.

    Varying𝜖.:

    Figure 9d–f show the performance of these two algorithms under varying privacy budget 𝜖. From the results, it is clearly that the performance of EIRank+ is held constant. This is because, EIRank+ is the exact algorithm, which is not affected by 𝜖. In addition, the utility of PrivSem+ is improved as 𝜖 increases. This conforms to the theoretical analysis that a larger privacy budget results in less noise and therefore a more accurate result.

  3. C.

    Effect of Customized User PSD.

    Figure 10 shows the performance of Naive and PrivSem+ algorithms with different 𝜖 values. Obviously, PrivSem+ consistently gains better performance at the same level of privacy. Besides, all these two algorithms exhibit similar trend: the utility of the results is improved as 𝜖 increases. This is because, when privacy budget 𝜖 increases, a smaller amount of perturbation noise is required and a lower degree of privacy is guaranteed. Here, we omit the effect of k on these two algorithms since they present similar trend as Figure 9a–c.

  4. D.

    Effect of Optimization Strategies.

    In this part, we study how our optimization strategies affect the performance of PrivSem+. From Figure 11, we can see, without RSC, directly calculating the basic semantic-based cloaked area from the scratch each time produces poor results. It is in line with our analysis: by effectively reducing the calculating number, the cloaking time of anonymizing requests can be significantly reduced. A similar trend also is observed. It means that our proposed optimization strategy DSA is effective in improving the utility of the cloaking algorithm.

Figure 7
figure 7

Effectiveness comparison of EIRank+ and SA

Figure 8
figure 8

Time Cost of EIRank+ and SA

Figure 9
figure 9

Effect of other users’ privacy

Figure 10
figure 10

Effectiveness comparison of PrivSem+ and Naive

Figure 11
figure 11

Effect of optimization strategies

8 Related work

In this section, we mainly discuss two main streams of research, location privacy and location semantics, which closely relate to our work. We introduce each stream in more details below.

8.1 Location privacy

In the past decades, with the explosive growth of LBS [23, 43], location privacy has received more and more attention [2, 3, 30, 38, 46]. Most techniques for achieving location privacy preservation can be categorized into location anonymization and differential privacy based approaches.

Location anonymization

Location anonymization has gained popularity as a solution to preserve user location privacy in the LBS. It mainly leverages location obfuscation to perturb a user’s exact location. Generally, it could be further classified into fake location [22, 40, 54], space transformation[6, 15, 36], mix-zones [35] and spatial cloaking. The main idea of false location is to send a fake location or a series of dummy locations including the user’s exact location to the LBS providers. Its major shortcoming is expensive dummy generation cost. Sometimes, it cannot achieve claimed privacy protection level due to distance intersection attack [17]. Space transformation is a technique that transforms the original data space into another space while maintaining the spatial proximity, usually with cryptographic theory. Although the strong privacy provided, it also incurs significant computation cost that limits its applicability. The techniques based on mix-zones preserve a user’s location by concealing his/her location in a zone. Among these diverse anonymization strategies, spatial cloaking is the prominent and the most relevant to ours.

Spatial cloaking blurs the exact location of a user with a CR until some privacy metrics are satisfied, such as k-anonymity [41] and l-diversity [31]. The location k-anonymity was initially by Gruteser et al. [16]. Subsequently, a series of research has been conducted to improve the computation of a CR. CliqueCloak [14] and Casper [34] both proposed personalized location anonymization. The former located a clique in a graph to compute the CR, while the latter utilized a quadtree-based index structure for fast computation of the CR. HilbertCloak [19] used Hilbert spacefilling curve and its CR is independent of mobile user distribution. Besides, there also exists other studies using other techniques to generate the CR, such as Probabilistic Cloaking [5], historical locations-based location privacy [50] and feeling-based location privacy [51].

Using location k-anonymity technique, the CR may include only one meaningful location (e.g., a specific clinic or church) and reveal strong relationships to such a location. To avoid this situation, PrivacyGrid [1] proposed location l-diversity, which enlarges a CR until ’l-1’ different locations are included. Unfortunately, most of these existing spatial techniques are no longer applicable on road networks, because the area granularity of measurement tends to fail.

Recently, several studies have focused on location privacy preservation over road networks [7, 24, 25, 28, 45]. One of the best-known techniques is based on the concept of segment l-diversity [7, 45]. For instance, XSTAR [45] attempted to achieve the optimal balance between high query-processing efficiency and robust inference attack resilience while taking k-anonymity and segment l-diversity together into account. However, as mentioned, these techniques cannot prevent the location semantic information from leakage.

Differential privacy based location privacy

Differential privacy is a strong privacy model which initiates in the statistics analysis, and then gradually are extended to location data. Until now, a lot of literature has focused on differentially private location data analysis and publishing. It is further divided into one-time release of static data [8, 20, 37, 47,48,49] and continuous release of dynamic data [12, 21, 44]. The work closely related to ours is differentially private location data publishing [8, 37, 48] for count queries. Generally, these studies resort to standard spatial indexing, e.g., grids and quad-trees, to provide a private description of the data distribution. Various fundamental steps, such as selecting splitting points and describing the data distribution within a region, must be done privately. In order to minimize the non-uniformly error, Xiao et al. [48] employed the heuristic to select the split points of KD-trees so that the two sub-regions are as close to uniform as possible. Instead of using a uniformity heuristic, Cormode et al. [8] split the nodes along the median of the partition dimension. Qardaji et al. [37] introduced a novel adaptive-grid method which lays a coarse-grained grid over the dataset, and then further partitions each cell according to its noisy count.

Differential privacy has also been applied to complex location data mining tasks. For instance, He et al. [18] studied differentially private trajectory publishing. To et al. [42] investigated location protection for worker datasets in spatial crowdsourcing. Li et al. [26] introduced private-preserving trajectory analysis for points-of-interest recommendation. These problems are orthogonal to ours.

8.2 Location semantics

Generally, the sensitive information is often exposed by query semantics or location semantics information. In the first case, it implies that the query content of a CR should be diverse. In this paper, our work focus on the second case, i.e., location semantics may disclose the sensitive information. For location l- diversity cloaking technique, the generated CR may embrace multiple locations. However, it may just include one place type. Several previous studies [9, 52] have identified such semantic breach issues.

Lee et al. [25] proposed mining the location semantic using Earth Mover’s Distance to prevent location semantic from leakage. Yigitoglu et al. [53] extended the semantic location cloaking model [10] to provide location privacy protection in urban settings. Since the CRs are generated offline for a particular privacy requirement, it fails to support the varied privacy requirement. Recently, Li et al. [27] solved this issue based on the vertex Voronoi-partition. Unfortunately, similar to the work [25], it is vulnerable to reverse engineering attacks. To overcome these drawbacks, in the early time, we proposed EIRank [29] to resist semantic-based attack. The differences of this extended manuscript from our conference version [29] are as follows:

  • We identify the privacy leakage problem of other users in the context of location-based queries, and present a framework that achieves differentially private guarantees, Section 3 summarizes this newly added content.

  • We propose an error analysis model that quantifies the difference between noisy count of users in a cloaked area and its real count (Section 5.1), and we also devise a search strategy that find appropriate PSD regions to ensure high success rate of the cloaking algorithm (Section 5.2).

  • We develop a series of optimization strategies to further improve the performance of the proposed framework. This newly added part in detailed in Section 5.3.

  • Inspired by data sparsity issue, we design a customized user PSD to resist the influence of perturbed noise on counts, and Section 5.1 is newly added.

  • We give the formal analysis about the privacy guarantee of PrivSem in newly added Section 6. Besides, we conduct more experiments on real datasets to evaluate its effectiveness and efficiency.

9 Conclusion

Protecting mobile user privacy is a fundamental problem in the LBS and has attracted intensive interests. However, most of these methods ignore semantic information and other users’ privacy requirements. In this paper, we propose PrivSem, a novel framework which integrates location k-anonymity, segment l-semantic diversity and differential privacy to pretect user privacy over road networks. Under this framework, we determine a CR using the sanitized data according to DP, instead of the original data. This task is challenging due to the uncertainty of DP. To address this, we present an error analysis model to quantify the error incurred in computing a CR. In addition, data sparsity in the spatial domain imposes another challenge to user privacy as well as utility. To address the issue, we further propose a customized user PSD which groups similar segments together to release more accurate data counts. Extensive experiments on several real road network datasets demonstrate the efficiency and effectiveness of our proposed framework PrivSem.