Keywords

1 Introduction

Big data era has come and it is called the “fourth paradigm” of scientific research. More and more databases are used in various fields such as healthcare, education, finance, population, transportation, science and technology, and have created huge social benefits. However, privacy concerns hinder the wildly use of these data. People would refuse to provide their sensitive information such as salary, diseases and user behavioral information. To this end, how to release useful information without revealing the individual’s privacy has become a hot issue.

Dwork proposed the concept of differential privacy [2,3,4,5], which is still the state-of-the-art standard notion in data privacy. It provides a rigorous privacy guarantee that it will not influence the outcome of any analysis when removing or adding a single database item. However, The initial framework of differential privacy is only effective for independent data records.

In practice, tuple correlation occurs naturally in datasets. User activity streams like time-series data, GPS trajectories and social networks typically generate records which are correlated. It has been shown that the data correlations can be utilized by attackers to improve their inferences about individuals and cause privacy leakage [7]. Group differential privacy has been proposed to solve this problem [5], which extends the privacy protection on individual to a group of correlated individuals. But the required noise may greatly destroy data utility.

Pufferfish was proposed by [8], which is based on differential privacy but can accommodate more situations. There are 3 important components in Pufferfish, a set of potential secrets \( S \), a set of discriminative pairs \( S_{pairs} \), and a set of data evolution scenarios \( D(\theta \in D) \). It promises that the secret pairs are indistinguishable to the adversary. \( {\text{D}} \) captures how much knowledge the potential attackers have and then it can take the correlation of data into consideration. But the framework did not propose any specific perturbation algorithm to handle the correlation. Song et al. adopted the framework and used it to protect the privacy of time-series data such as physical activity measurements and power consumption data [11].

However, the prior work focuses on the correlations among individuals with only one attribute [14], or multiple attributes but only for one individual [12]. In this paper, we consider the correlations among individuals as well as the correlations among multiple attributes inside each sequence, such as different people’s time-series data. These databases have wide applications, including stock markets, disease surveillance and real-time traffic monitoring. For example, in a database which records physical activities of members from the same family or company across time, there are different individual’s records, and each record is a data sequence. Our goal is to publish aggregate statistics on individuals’ activities without leaking the privacy of a specific individual, and here privacy is the activity at any given moment.

The contributions of our paper can be summarized as follows:

  • We consider the simultaneous privacy protection for two types of correlations among categorical data sequences. One is the correlations among individuals, and the other is correlations inside each sequence.

  • We propose a protection mechanism based on Pufferfish privacy by modelling correlations among variables employing the multi-dimensional Markov Chain.

  • We conduct experiments on simulated data and demonstrate that our privacy mechanism provides both high privacy and utility guarantees.

2 Related Work

In the past decade, a growing body of work has been published on differential privacy [2,3,4,5]. As we explain earlier, differential privacy assumes that records are independent so it is not the right framework for the scenarios where records are correlated.

Correlated differential privacy has emerged to solve this problem. Kifer [7] was the first to raise the issue that differential privacy may not guarantee privacy without consideration of data correlations, and then proposed Pufferfish privacy [8], a generalization of differential privacy. It provides some specific instances of Pufferfish framework but is lack of specific mechanisms for many practical applications.

Existing privacy mechanisms for correlated data publishing can be classified into two types. The first one replaces the global sensitivity with new correlation-based parameters, such as dependence coefficient [9] and correlated sensitivity [15], and [10] used Maximal Information Coefficient to measure the correlations and achieved correlated differential privacy for big data publication. The other one uses appropriate models to describe the correlations between variables. [14] uses a modification of Pufferfish and proposed Bayesian differential privacy, which represents the data correlations by a Gaussian correlation model. Song proposed Markov Quilt Mechanism representing data correlation via a Bayesian Network [11]. There are also some efforts on time-series release such as [13] and high-dimensional data releasing based on Markov network [12]. However, these efforts only considered one-dimensional correlations of data. Therefore, they cannot be applied to simultaneously protect the two types of correlations. One type is the correlations among various sequences, and the other is the correlations inside each sequence.

3 Preliminaries

We will introduce some basic concepts in this section, including Pufferfish privacy mechanism, Multi-dimensional Markov Chain models, global sensitivity and Laplace mechanism. To start with, Table 1 lists notations and their explanations used across this paper.

Table 1. Table of notations

3.1 Pufferfish Privacy Mechanism

We use Pufferfish framework as our privacy definition and extend it to apply in our cases. A Pufferfish framework consists of three parts, a set of potential secrets \( S \), a set of discriminative pairs \( S_{pairs} \), and a set of data evolution scenarios \( D(\theta \in D) \). S captures what is protected, which is the set of secrets that refer to individual’s private data. \( S_{pairs} \) captures how to protect, which means that the attackers cannot distinguish between the secret pairs. Finally, D captures how much knowledge the potential attackers have, which is a collection of plausible data generating distributions. In this paper, the correlations of data are controlled. Each \( \theta \in D \) represents an adversary’s belief about how to generate the data, and we should promise the indistinguishability.

Definition 3.1 (\( \epsilon \)-Pufferfish(\( S, S_{pairs} , D \)) Privacy).

Given set of potential secrets \( S \), a set of discriminative pairs \( S_{pairs} \left( {\left( {s_{i} ,s_{j} } \right) \in S_{pairs} } \right) \), a set of data evolution scenarios \( D(\theta \in D) \), and a privacy parameter \( \epsilon > 0 \), \( M \) satisfies \( \epsilon \)-Pufferfish(\( S, S_{pairs} , D \)) privacy if

$$ P (M\left( X \right) = \omega |s_{i} ,\theta ) \le e^{ \epsilon } P (M\left( X \right) = \omega |s_{j} ,\theta ) $$
(1)
$$ P (M\left( X \right) = \omega |s_{j} ,\theta ) \le e^{ \epsilon } P (M\left( X \right) = \omega |s_{i} ,\theta ) $$
(2)

Equivalently,

$$ e^{ - \epsilon } \le \frac{{P (s_{i} | M\left( X \right) = \omega ,\theta )}}{{P (s_{j} | M\left( X \right) = \omega ,\theta )}}/\frac{{P (s_{i} |\theta )}}{{P (s_{j} |\theta )}} \le e^{ \epsilon } $$
(3)

when \( s_{i} \) and \( s_{j} \) are such that \( P(s_{i} |\theta ) \ne 0 \), \( P(s_{j} |\theta ) \ne 0 \).

3.2 Multi-dimensional Markov Chain Models

Markov Chain models are widely used in the modeling of data sequences [1]. In our work, we use a multi-dimensional Markov Chain model for correlated data sequences such as sales demand data, stock index data and physical activities of a group individual. We assume that there are s sequences \( \left\{ {y_{n}^{\left( k \right)} ,k = 1,2, \ldots ,s} \right\} \), and \( y_{n}^{\left( k \right)} \) is the state probability distribution vector of the kth sequence at time n. Each sequence has m possible states in M. If the kth sequence is in state j with probability one at time n then we write \( P\left\{ {y_{n}^{\left( k \right)} = j} \right\} = 1 \) or

$$ y_{n}^{\left( k \right)} = \left( {0, \ldots ,0,\underbrace {1}_{j},0, \ldots ,0} \right)^{T} $$
(4)

The following conditions are satisfied in a multivariate Markov Chain model:

$$ y_{n + 1}^{\left( j \right)} = \sum\nolimits_{k = 1}^{s} {\lambda_{jk} } P^{{\left( {jk} \right)}} y_{n}^{\left( k \right)} , \left( {j = 1,2, \ldots ,s} \right) $$
(5)

where \( \sum\nolimits_{k = 1}^{s} {\lambda_{jk} } = 1, \lambda_{jk} \ge 0, 1 \le j,k \le s \). \( P^{{\left( {jk} \right)}} \) are the transition probabilities from the state of kth sequence at time n to the state of jth sequence at time (n + 1), and \( \lambda_{jk} \) are the weights between columns.

The state probability distribution of the jth Chain at time (n + 1) is related to the state distribution of the s sequences at time n, but independent of the state before time n, which only hinges on the weighted average of \( P^{{\left( {jk} \right)}} y_{n}^{\left( k \right)} \). The following is the matrix notation:

$$ \left( {\begin{array}{*{20}c} {y_{n + 1}^{\left( 1 \right)} } \\ {y_{n + 1}^{\left( 2 \right)} } \\ \vdots \\ {y_{n + 1}^{\left( s \right)} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {\lambda_{11} P^{{\left( {11} \right)}} } & {\lambda_{12} P^{{\left( {12} \right)}} } & \cdots & {\lambda_{1s} P^{{\left( {1s} \right)}} } \\ {\lambda_{21} P^{{\left( {21} \right)}} } & {\lambda_{22} P^{{\left( {22} \right)}} } & \cdots & {\lambda_{2s} P^{{\left( {2s} \right)}} } \\ \vdots & \vdots & \vdots & \vdots \\ {\lambda_{s1} P^{{\left( {s1} \right)}} } & {\lambda_{s2} P^{{\left( {s2} \right)}} } & \cdots & {\lambda_{ss} P^{{\left( {ss} \right)}} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {y_{n}^{\left( 1 \right)} } \\ {y_{n}^{\left( 2 \right)} } \\ \vdots \\ {y_{n}^{\left( s \right)} } \\ \end{array} } \right) $$
(6)

Let \( y_{n} = \left( {y_{n}^{\left( 1 \right)} ,y_{n}^{\left( 2 \right)} , \ldots ,y_{n}^{\left( s \right)} } \right)^{T} \), then \( y_{n + 1} = Qy_{n} \).

Lemma 1.

For \( 1 \le j,k \le s \), if \( \lambda_{jk} \ge 0 \), then the matrix Q has a eigenvalue that is equal to 1, and the eigenvalues of Q are smaller than or equal to 1.

Lemma 2.

For \( 1 \le j,k \le s \), assume that \( \lambda_{jk} \ge 0 \) and \( P^{{\left( {jk} \right)}} \) is irreducible. Then there exists a stable vector \( y = \left( {y^{\left( 1 \right)} ,y^{\left( 2 \right)} , \ldots ,y^{\left( s \right)} } \right)^{T} \) such that \( {\text{y}} = Qy \) and \( \sum\nolimits_{i = 1}^{m} {\left[ {y^{\left( j \right)} } \right]_{i} } = 1, 1 \le j \le s. \)

In order to obtain the values of parameters, the transition probability matrix of each data sequence must be determined. Let \( f_{{i_{j} i_{k} }}^{{\left( {jk} \right)}} \) represent the transition matrix from the state \( i_{k} \) in the sequence \( \left\{ {y_{n}^{\left( k \right)} } \right\} \) to the state \( i_{j} \) in the sequence \( \left\{ {y_{n}^{\left( j \right)} } \right\} \). Then the transition frequency matrix can be written as follows:

$$ F^{{\left( {jk} \right)}} = \left( {\begin{array}{*{20}c} {f_{11}^{{\left( {jk} \right)}} } & \cdots & \cdots & {f_{1m}^{{\left( {jk} \right)}} } \\ {f_{21}^{{\left( {jk} \right)}} } & \cdots & \cdots & {f_{2m}^{{\left( {jk} \right)}} } \\ \vdots & \vdots & \vdots & \vdots \\ {f_{m1}^{{\left( {jk} \right)}} } & \cdots & \cdots & {f_{mm}^{{\left( {jk} \right)}} } \\ \end{array} } \right) $$
(7)

And the following rule:

$$ \hat{p}_{{i_{j} i_{k} }}^{{\left( {jk} \right)}} = \left\{ {\begin{array}{*{20}l} {\frac{{f_{{i_{j} i_{k} }}^{{\left( {jk} \right)}} }}{{\mathop \sum \nolimits_{{i_{k} = 1}}^{m} f_{{i_{j} i_{k} }}^{{\left( {jk} \right)}} }},} \hfill & {\sum\nolimits_{{i_{k} = 1}}^{m} {f_{{i_{j} i_{k} }}^{{\left( {jk} \right)}} \ne 0} } \hfill \\ {0,} \hfill & {in\,the\,other\,cases} \hfill \\ \end{array} } \right. $$
(8)

Using this transition frequency matrix \( F^{{\left( {jk} \right)}} \) and the normalized rule, one obtains the estimations of the matrix of transition probabilities \( P^{{\left( {jk} \right)}} \):

$$ \widehat{P}^{{\left( {jk} \right)}} = \left( {\begin{array}{*{20}c} {\hat{p}_{11}^{{\left( {jk} \right)}} } & \cdots & \cdots & {\hat{p}_{1m}^{{\left( {jk} \right)}} } \\ {\hat{p}_{21}^{{\left( {jk} \right)}} } & \cdots & \cdots & {\hat{p}_{2m}^{{\left( {jk} \right)}} } \\ \vdots & \vdots & \vdots & \vdots \\ {\hat{p}_{m1}^{{\left( {jk} \right)}} } & \cdots & \cdots & {\hat{p}_{mm}^{{\left( {jk} \right)}} } \\ \end{array} } \right) $$
(9)

We also need to obtain the parameters \( \lambda_{jk} \). There is a stable probability vector y in the multi-dimensional Markov Chain. We can estimate the vector y by calculating the probability of each state in each sequence, and is denoted as \( \widehat{y} = \left( {\widehat{y}^{\left( 1 \right)} ,\widehat{y}^{\left( 2 \right)} , \ldots ,\widehat{y}^{\left( s \right)} } \right)^{T} \), then \( \widehat{y} = Q\widehat{y} \). The values of \( \lambda_{jk} \) can be obtained by solving the following optimization problem:

$$ \left\{ {\begin{array}{*{20}c} {\mathop {\hbox{min} }\limits_{\lambda } \,\mathop {\hbox{max} }\limits_{i} \left| {\left[ {\sum\nolimits_{k = 1}^{m} {\lambda_{jk} } \widehat{P}^{{\left( {jk} \right)}} \widehat{y}^{\left( k \right)} - \widehat{y}^{\left( j \right)} } \right]_{i} } \right|} \\ {subject\,to\,\sum\nolimits_{k = 1}^{s} {\lambda_{jk} } = 1,\,and\,\lambda_{jk} \ge 0,\,\forall k} \\ \end{array} } \right. $$
(10)

This problem can be formulated as a linear programming problem. Let B be the condition-\( B = \left[ {\widehat{P}^{{\left( {j1} \right)}} \widehat{y}^{\left( 1 \right)} \left| {\widehat{P}^{{\left( {j2} \right)}} \widehat{y}^{\left( 2 \right)} } \right| \ldots |\widehat{P}^{{\left( {js} \right)}} \widehat{y}^{\left( s \right)} } \right] \), the model can be written as follows. For each j:

$$ \mathop {\hbox{min} }\limits_{\lambda } w_{j} $$

Subject to

$$ \left\{ {\begin{array}{*{20}c} {\left( {\begin{array}{*{20}c} {w_{j} } \\ {w_{j} } \\ \vdots \\ {w_{j} } \\ \end{array} } \right) \ge \widehat{y}^{\left( j \right)} - B\left( {\begin{array}{*{20}c} {\lambda_{j1} } \\ {\lambda_{j2} } \\ \vdots \\ {\lambda_{js} } \\ \end{array} } \right),} \\ {\left( {\begin{array}{*{20}c} {w_{j} } \\ {w_{j} } \\ \vdots \\ {w_{j} } \\ \end{array} } \right) \ge - \widehat{y}^{\left( j \right)} + B\left( {\begin{array}{*{20}c} {\lambda_{j1} } \\ {\lambda_{j2} } \\ \vdots \\ {\lambda_{js} } \\ \end{array} } \right),} \\ {w_{j} \ge 0,} \\ {\sum\nolimits_{k = 1}^{s} {\lambda_{jk} } = 1,\,and\,\lambda_{jk} \ge 0,\,\forall j} \\ \end{array} } \right. $$
(11)

3.3 Additional Notion

We introduce some additional definitions and notation to conclude this section.

Definition 3.4 (global sensitivity).

Let f be a function that maps a dataset into a fixed-size vector of real numbers (\( {\text{i}}.{\text{e}}.X \to R^{d} \)). For any two neighboring databases X and \( X^{\prime} \), the sensitivity of f is defined as

$$ GS_{f} = \mathop {\hbox{max} }\limits_{{X,X^{\prime}}} \left\| {f\left( X \right) - f\left( {X^{\prime}} \right)} \right\|_{p} $$
(12)

Where p denotes \( L_{p} \) norm used to measure \( \Delta f \), and we usually use \( L_{1} \) norm.

For any query function F: \( X \to R^{d} \), the privacy mechanism M

$$ M\left( X \right) = f\left( X \right) + Z $$
(13)

Satisfies \( \upepsilon \)-differential privacy, where \( Z \sim Lap\left( {\Delta f/ \epsilon } \right) \). We use \( Lap\left( \sigma \right) \) to denote a Laplace distribution with mean 0 and scale parameter \( \sigma \). Recall that this distribution satisfies the density function: \( h\left( x \right) = \frac{1}{2\sigma }e^{ - \left| x \right|/\sigma } \).

4 A Mechanism for 2-Dimensional Correlated Data

4.1 Problem Statement

We consider a more restricted setting when the database X are several categorical data sequences. We assume that there are s categorical sequences and each has m possible states in M. Their dependence can be described by multi-dimensional Markov Chain model, and the goal is to keep the value of each \( x_{i}^{k} \) private. We next use two examples to illustrate the problem.

Example 1: A Group Physical Activity Measurement.

A is the set of activities such as {walking, sleeping, working} and \( s_{t}^{k *a} \) denotes the event that the kth person’s state is activity a at moment t, i.e., \( x_{t}^{k} = a \). In the Pufferfish framework, we set S as {\( s_{t}^{k *a} :k = 1, \ldots ,s,t = 1, \ldots ,T,a \in A \)}, so the activity at any specific moment t of each person is a secret. \( S_{pairs} \) is the set of all pairs (\( s_{t}^{k *a} \), \( s_{t}^{k *b} \)) for a, b in A and for all t and each person; in other words, for all pairs a and b, the attackers cannot tell whether this person is doing activity a or activity b at any time. D is a set of possible distributions to generate the data, which captures how people switch between activities and how people influence each other. A plausible belief is to set D be a set of multi-dimensional Markov Chains where each state is an activity in A. Each multi-dimensional Markov Chain can be represented by an initial distribution \( y_{1} \) which represents the initial state of each sequence, the transition probabilities \( P^{{\left( {jk} \right)}} \) and the weights between columns \( \lambda_{jk} \). For example, we have two activities {walking, working} and use \( \left( {1, 0} \right)^{T} \) to represent walking. There are two sequences in the dataset. Thus, a distribution \( \theta \in D \) is represent by a tuple

$$ \left\{ {y_{1} , \left[ {\begin{array}{*{20}c} {P^{11} } & {P^{12} } \\ {P^{21} } & {P^{22} } \\ \end{array} } \right], \left[ {\begin{array}{*{20}c} {\lambda_{11} } & {\lambda_{12} } \\ {\lambda_{21} } & {\lambda_{22} } \\ \end{array} } \right]} \right\} $$

Then such D can be the set:

$$ \left\{ {\begin{array}{*{20}c} {\left( {\left[ {\begin{array}{*{20}c} {\left( {0, 1} \right)^{T} } \\ {\left( {1, 0} \right)^{T} } \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} {\left[ {\begin{array}{*{20}c} {1 } & {0.5} \\ 0 & {0.5} \\ \end{array} } \right]} & {\left[ {\begin{array}{*{20}c} {0.7} & {0.6} \\ {0.3} & {0.4} \\ \end{array} } \right]} \\ {\left[ {\begin{array}{*{20}c} {0.8} & {0.5} \\ {0.2} & {0.5} \\ \end{array} } \right]} & {\left[ {\begin{array}{*{20}c} {0.9} & {0.6} \\ {0.1} & {0.4} \\ \end{array} } \right]} \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} {0.5} & {0.5} \\ {0.5} & {0.5} \\ \end{array} } \right]} \right), } \\ {\left( {\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\left( {1, 0} \right)^{T} } \\ {\left( {0, 1} \right)^{T} } \\ \end{array} } \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} {\left[ {\begin{array}{*{20}c} {0 } & {0.5} \\ 1 & {0.5} \\ \end{array} } \right]} & {\left[ {\begin{array}{*{20}c} {0.5} & {0.5} \\ {0.6} & {0.4} \\ \end{array} } \right]} \\ {\left[ {\begin{array}{*{20}c} {0.4} & {0.3} \\ {0.6} & {0.7} \\ \end{array} } \right]} & {\left[ {\begin{array}{*{20}c} {0.9} & {0.6} \\ {0.1} & {0.4} \\ \end{array} } \right]} \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ \end{array} } \right]} \right)} \\ \end{array} } \right\} $$

Example 2: Sales Demand Data Sequences.

The database consists of a soft-drink company’s sales demand data. The company has 5 products {A, B, C, D, E} and each product is labeled as its moving rate of sales volume - {very fast-moving, fast-moving, standard, slow-moving, very slow-moving, no sales volume}. Each customer of the company has 5 sales demand data sequences. We can use the database to reduce the company’s inventory and maximize the needs of each customer, but we cannot reveal customer’s privacy which means the adversary cannot infer the customer’s demand for all products at a specific time. Let M be the moving states set and let \( s_{t}^{k*m} \) denote the event that the kth product’s state is m at time t, namely, \( x_{t}^{k} = m \). In the Pufferfish framework, we set S as {\( s_{t}^{k *m} :k = 1, \ldots ,5,t = 1, \ldots ,T,m \in M \)}, so the state at each time t of each product is a secret. \( S_{pairs} \) is the set of all pairs (\( s_{t}^{k *m} \), \( s_{t}^{k *n} \)) for m, n in M and for all t and each product. Similarly, D can also be a set of multi-dimensional Markov Chains.

4.2 Our Mechanism

In our mechanism, we first use multi-dimensional Markov Chains to describe the 2-dimensional correlation and get the set of all possible distributions which can generate the data. Then we adopt the Pufferfish framework and customize our privacy definition for our application. At last, we use the concept of interpretation by adding appropriate noise to the result and then achieve both utility and privacy.

Our mechanism is based on the Laplace mechanism in differential privacy which adds noise to the result of F proportional to the global sensitivity. In our mechanism, we use the worst-case distance between the distribution \( P\left( {F\left( X \right) |s_{i} ,\theta } \right) \) and \( P\left( {F\left( X \right) |s_{j} ,\theta } \right) \) for a secret pair \( (s_{i} , s_{j} ) \). First, we use the idea of Earth Mover’s Distance (EMD) to represent two probability distributions’ distance.

Definition 4.1.

Let \( \mu ,\nu \) be two probability distributions on R, and let \( \varGamma \left( {\mu ,\nu } \right) \) be the set of all joint distributions. The distance between \( \mu \) and \( \nu \) is defined as:

$$ Distance_{\infty } \left( {\mu ,\nu } \right) = inf_{{\gamma \in\Gamma \left( {\upmu,\upnu} \right)}} \mathop {\hbox{max} }\limits_{{\left( {a,b} \right) \in support\left( \gamma \right)}} \left| {a - b} \right| $$
(14)

The Earth mover’s distance is the minimum shift probability mass between \( \mu \) and \( \nu \) which in our mechanism is \( P\left( {F\left( X \right) |s_{i} ,\theta } \right) \) and \( P\left( {F\left( X \right) |s_{j} ,\theta } \right) \). To guarantee the Pufferfish privacy, we add Laplace noise to the result of the query F proportional to the \( Distance_{\infty } \left( {P\left( {F\left( X \right) |s_{i} ,\theta } \right),P\left( {F\left( X \right) |s_{j} ,\theta } \right) } \right) \). We describe the full mechanism in Algorithm 1.

figure a

For given Database X, query F, Pufferfish framework (\( S, S_{pairs} , D \)), and privacy parameter \( \upepsilon \), we find the supremum of the distance (EMD) between \( \mu_{{i,\uptheta}} \) and \( \mu_{{{\text{j}},\uptheta}} \) through all \( S_{pairs} \) and \( D \). Then, we add the Laplace noise to the result of F proportional to the distance we find. The mechanism for 2-dimensional correlated data satisfies the pufferfish privacy.

5 Experiments

We apply our mechanism to the simulated data which is generated by a multi-dimensional Markov Chain of two sequences (s = 2) and each sequence with length T = 100 and states {0, 1}. We employ this prototype simulation in order to achieve an efficient implementation of our algorithm.

First, we generate the database X which is determined by initial distribution for two sequences with two parameters \( q_{0}^{1} = P\left( {X_{1}^{1} = 0} \right) \) and \( q_{0}^{1} = P\left( {X_{1}^{2} = 0} \right) \), the transition probabilities \( P^{{\left( {jk} \right)}} \) and the weights between columns \( \lambda_{jk} \) which are equal to 0.5 in our setting. The transition probabilities are determined by four transition matrices and each matrix such as \( P^{{\left( {11} \right)}} ,P^{{\left( {12} \right)}} ,P^{{\left( {21} \right)}} ,or P^{{\left( {22} \right)}} \) is determined by parameters \( p_{0}^{jk} \) and \( p_{1}^{jk} \), in which \( p_{0}^{jk} = P(X_{i + 1}^{j} = 0|X_{i}^{k} = 0) \) and \( p_{1}^{jk} = P(X_{i + 1}^{j} = 1|X_{i}^{k} = 1) \).

The query \( F\left( X \right) = \frac{1}{T*s}\sum\nolimits_{k = 1}^{s} {\sum\nolimits_{i = 1}^{T} {X_{i}^{\text{k}} } } \). Then we calculate the conditional probability \( P\left( {F\left( X \right) = \cdot |s_{i} ,\theta } \right) \) and \( P\left( {F\left( X \right) = \cdot |s_{j} ,\theta } \right) \) and measure the distance between them by Earth Mover’s Distance. The privacy budget \( \epsilon \) varies in {0.2, 0.5, 1, 2, 5}. We compare the actual F(X) with our output result and show the average \( L_{1} \) error between them. We use group differential privacy as our baseline which assumes that all variables are correlated and adds \( Lap\left( {1/ \epsilon } \right) \) noise to each bin. Table 2 shows the result of our experiments.

Table 2. \( L_{1} \) error of frequency of state 1

From Table 2, we can see that our mechanism is more accurate than group differential privacy. As expected, the \( L_{1} \) error decreases as the private budget \( \epsilon \) increases which shows that smaller \( \epsilon \) means more privacy. The experiments show that our mechanism achieves both higher utility and privacy than group differential privacy.

6 Conclusion

We propose a Pufferfish privacy mechanism for correlated categorical data sequences, such as a group of physical activity measurements, sales demand data sequences, and other time-series datasets. We use the multi-dimensional Markov Chain model to represent the correlations among individuals and inside each sequence. Experiments with simulated data show that our mechanism achieves both high utility and privacy.

There are still some aspects for our work to be improved in the future. The computational efficiency can be improved by exploiting structural information of multi-dimensional Markov Chains. Experiments also need to be conducted on real-world datasets. Some other types of correlated data, such as semi-structured data, graph data and large-scale data, also requires novel models and privacy mechanisms.