Introduction

The rapid growth in the availability of household data containing Geographical Information System (GIS) coordinates has opened new venues to researchers interested in the complex interactions between space, human behavior, and outcomes (Arcury et al. 2005; Cooke et al. 2010; Seiber and Bertrand 2002; Tanser et al. 2009). The availability of geo-referenced household data allows researchers to study the density and spatial distribution of specific features of populations; it also allows researchers to assess the extent to which specific behaviors, such as school attendance or health care seeking, depend on spatial location and distance to public infrastructure like schools or hospitals (Kaplan and Hegarty 2006; Kyei et al. 2012; Lohela et al. 2012).

One of the most critical challenges faced by researchers working with geo-referenced household data is confidentiality and privacy protection (Kamel Boulos et al. 2009; O’Brien and Yasnoff 1999; Onsrud et al. 1994). Given that Global Positioning System (GPS) coordinates generally lie within a 10-m radius of the true location (Schwieger 2003), households and household members could be identified if their GPS coordinates were publicly available. Even in densely populated urban areas, relatively few households are within a 10-m radius of a given coordinate and even fewer may match all the other household characteristics included in the data set, including income, family size, or occupation. Identification of households is of particular concern if GPS coordinates are linked to sensitive data on household members, such as household members’ HIV status and sexual behavior. Allowing the identification of survey respondents constitutes a clear violation of the confidentiality that researchers typically guarantee as part of the consent process preceding a subject’s enrollment in a research study. If individuals with stigmatized traits or particular vulnerabilities can be identified, a range of harms and undesirable consequences may result, such as harassment, social exclusion, or crime (Hyman 2000).

Several approaches have been developed to address these data confidentiality concerns. Complete non-disclosure, restricted disclosure, and coordinate scrambling are the most common approaches (Golden et al. 2005). Other possible approaches include aggregating high-resolution spatial data to larger administrative units, as well as the use of software agents that allow researchers to analyze fully identified data without being able to physically access the underlying identifiable information (Kamel Boulos et al. 2006).

Complete non-disclosure essentially implies destroying the original household location information upon completion of the study. The approach is easy and highly effective for ensuring data confidentiality, but also implies the systematic destruction of potentially valuable information. Partial disclosure is the approach currently taken by some of the larger European and US household surveys, such as the Health, Ageing and Retirement (SHARE) surveys (Borsch-Supan et al. 2013; Linardakis et al. 2013) or the National Longitudinal Study of Youth (Center for Human Resource Research 1997). In these surveys, GPS or address data are collected, but are not part of the data sets that are made publicly available. Upon request, GPS data are selectively made available to researchers with appropriate scientific reasons for using this data. To minimize the risk of data leakage, researchers generally need to sign strict data confidentiality agreements and may also be required to come to specific data centers to physically access the data. Partial disclosure has two main disadvantages: First, setting up confidentiality agreements with a potentially large number of researchers and research institutions often requires substantial human and legal resources; second, traveling to a data center may be impossible for many researchers because of financial and time constraints.

To avoid these challenges, large data collection operations, such as the Demographic and Health Surveys (DHS) and many Health and Demographic Surveillance Systems (HDSS), use a third approach, which is loosely referred to as “coordinate scrambling” (ICF International 2012). Similar to the “selective availability” program run by the US military until 2000 (National Archives and Records Administration 1996), a random noise vector is added to each coordinate to “mask” its true location, and the resulting “scrambled” coordinate is then made available to the public. In the case of the DHS, the scrambled coordinate falls within a 2-km radius of the original coordinate in urban areas and within a 5-km radius in rural areas.Footnote 1 Both scrambling radii were chosen with the objective to make identifying households sufficiently difficult, which essentially means making sure that sufficiently large number of households fall within the chosen radius. This could in theory be achieved by choosing the scrambling radius as a function of local population density. In practice, local population density data are not always available; as a result, large survey operations like the DHS simply use a smaller radius for the typically more densely populated urban areas and a larger radius for rural areas.

Conceptually, the idea of scrambling seems attractive, since scrambling implies protecting data confidentiality, while allowing researchers to work with collected geographical data. In practice, however, neither the theoretical nor the empirical implications of scrambling are well understood. To address this knowledge gap, we formally introduce the concept of scrambling in this paper and mathematically assess implications of scrambling for estimates of average distance as well as estimates of the relationships between distance and other variables of interest. We prove mathematically that scrambling systematically biases the distance estimates between one point whose true coordinates are known to the analyst (e.g., a healthcare clinic) and another point whose true location has been disguised through scrambling (e.g., a household). To illustrate the resulting biases empirically, we scramble true GPS data from a demographic surveillance site (DSS) in rural South Africa and then show the resulting distance and regressions biases in a range of study settings.

The effect of scrambling on average observed distances and distance variation

Let us define a point X as a reference or access point of interest and d as the distance of interest between X and some given household location h. One may think of X as the nearest fast food location, shop selling cigarettes, or healthcare clinic. For simplicity, but without loss of generality, we assume that the access point X is located at the origin, so that the vector from X to each household h is just h itself and has distance |h|.

To derive the bias in the average distances computed based on scrambled data, we assume that the true location of h is perturbed by adding a random noise vector v drawn from some centrally symmetric distribution as illustrated in Fig. 1. “Centrally symmetric” requires that for any region R the probability that v ∊ R is the same as the probability that −v ∊ R. If the random noise vector v is drawn from such a centrally symmetric distribution, the expected value of the noise vector is zero. This means that there is no systematic bias in either direction, so that the perturbed vector h + v has the same expected position as the corresponding vector h.

Fig. 1
figure 1

Model setup

Under this rather general assumption, we demonstrate that the following proposition is true:

Proposition 1

For any vector h and a randomly added scrambling vector v the following must always be true:

  1. (i)

    The expected value of the scrambled distance from the point of interest, 〈|h + v|〉, always exceeds the true distance |h| as long as v is not limited to the line segment between −h and +h.

  2. (ii)

    Adding a noise vector v always increases the expected square distance 〈|h + v| 2 〉, and the difference 〈|h + v| 2   |h| 2 equals the mean square 〈|v| 2 〉 of v.

  3. (iii)

    The expected bias between the true and the observed distance is bounded by 〈|v| 2 〉/(2|h|); thus, as long as |v|  γ for some radius γ, then, as long as the distribution of v is symmetric under v, we have \( \left\langle {\left| {h + v} \right|^{2} } \right\rangle - \left| h \right|^{2} \le \frac{{\gamma^{2} }}{2\left| h \right|} \).

Part (i) of Proposition 1 states that the average (expected) scrambled distance is strictly larger than the true distance for any two-dimensional error. The magnitude of this bias increases with the maximum scrambling radius and decreases with the true distance as shown in Part (iii) of the proposition. Intuitively, adding two-dimensional noise terms biases the average distance due to the nonlinear relation defined in Pythagoras’ Theorem; the (linear) average of the scrambled distances turns out to be systematically larger than the actual true distance. Part (ii) of the proposition is more straightforward; given that the two vectors of interest are by assumption independent, the total variation in scrambled distance can be directly decomposed in the true variation in distance and the average variation generated by the scrambling error.

The full mathematical proof of Proposition 1 is available in “Appendix”.

Empirical implications

Proposition 1 has two main implications for empirical analysis. First, and most importantly, any population-based estimate of average distance based on scrambled data will display a systematic upward bias. This overestimation of true distances may have undesirable consequences if estimated distances are used for policy. Assume, for instance, that the government wants to know the average distance children travel to school or the fraction of individuals living outside a given distance to a health facility. Further, assume that the coordinate of the school or health facility is precisely known, but that all coordinates of households have been scrambled. If the government uses these scrambled coordinates to calculate distances, the average observed distance will be strictly larger than the average true distance, so that the fraction of individuals living beyond a given distance of interest will generally be overestimated.

The second major issue when working with scrambled data directly links to the statistical literature on measurement error in variables starting with the seminal work by Spearman (1904). As shown in part (ii) of Proposition 1, the addition of random spatial noise essentially implies adding measurement error to the variable of interest. Since the measurement error is orthogonal to the true distance by construction, the classical-errors-in-variables (CEV) case will arise. In a standard ordinary least squares regression (OLS) framework, the probability limit of the coefficient estimated for the distance variable of interest is given by

$$ p\,\lim \left( {\hat{\beta }_{\text{OLS}} } \right) = \beta \frac{{\sigma_{h}^{2} }}{{\sigma_{h}^{2} + \sigma_{v}^{2} }}, $$

where β is the true coefficient of interest and σ 2 h , σ 2 v correspond to the variance in the true distance and the scrambling error, respectively (Wooldridge 2002, 2003). The greater the variance in the random noise term, the closer the estimated slope moves toward zero; this effect is generally referred to as “regression dilution,” “attenuation,” or “attenuation bias” following the original work by Spearman.

To illustrate the importance of scrambling biases in practical application, we use data from the Wellcome Trust Africa Centre for Health and Population Studies (Africa Centre) in rural South Africa. As described in further detail in Tanser et al. (2008), the Africa Centre surveillance was launched in 2000 and longitudinally tracks demographic and health outcomes for all individuals who reside in a geographically contiguous demographic surveillance area covering a total of 438 km2. The area is mostly rural and densely populated with about 25 households per square kilometer. As of June 2013, the site covers about 90,000 individuals. Figure 2 shows the spatial distribution of households and primary healthcare clinics in the area.

Fig. 2
figure 2

Households and primary healthcare clinics. Notes The demographic surveillance area shown here is located in the Hlabisa sub-district in rural KwaZulu-Natal, South Africa. The area is situated in the south-east portion of the uMkhanyakude district of KwaZulu-Natal province near the town of Mtubatuba. It is bounded on the west by the Umfolozi-Hluhluwe nature reserve, on the south by the Umfolozi river, on the east by the N2 highway (except form portions where the Kwamsane township straddles the highway) and in the north by the Inyalazi river for portions of the boundary. The physical homes of local residents, locally referred to as “homesteads”

To illustrate the effects of scrambling on estimation, we assume a simple population model, where the outcome of interest y for an individual i (such as school attendance, antenatal clinic attendance, HIV antiretroviral treatment uptake) is a linear function of distance and random error term, such that

$$ y_{i} = \alpha + \beta {\text{Dist}}_{i} + \varepsilon , $$

where Dist is distance in kilometers and ɛ is a randomly distributed error term.

We start our simulations with the basic scenario outlined in the theoretical model and illustrated in Fig. 1, with one specific reference point and a given scrambling radius. In the DSS data, the true coordinates of both the households and the reference points are known. We can thus compare actual distances to the ones observed in a setting where the household coordinates have been scrambled. To evaluate the impact of scrambling on regression analysis, we assume a simple data generating process, where some generic outcome variable y is a linear function of the true distance with a stochastic error terms as described above.

We simulate three different scenarios: a scenario with a very close reference point (inside the demographic surveillance area), a scenario with a mid-range reference point (20 km), and a scenario with a more distant reference point (50 km). For each scenario, we take the true coordinates of all 16,309 households shown in Fig. 2, generate a dependent variable as a linear function of the true distance, and then run 1,000 simulations with scrambled data. Each iteration of the simulation proceeds in three steps: In the first step, we add a random (scrambling) error between 0 and the chosen radius to each of the original household coordinates; in the second step, we compute the Euclidean distance between the scrambled household coordinates and the reference point of interest, and in the last step, we run a regression using the scrambled rather than actual distance as explanatory variables. We store the average distances and regression coefficients obtained in each iteration of the simulation and then compare them to the true values of both variables.

Figure 3 summarizes the results from these simulation models and illustrates the general relation between scrambling noise, observed distances, and expected regression point estimates. More scrambling noise unambiguously increases the expected average distance (bias) as well as the attenuation bias in regressions, while more distant reference points reduce the biases observed. Assuming an average distance from the household to the nearest school or clinic of <10 km (scenario 1), and a scrambling radius of 5 km recommended for rural areas, the average distance is overestimated by 0.51 km, which corresponds to an upward bias of about 5 %. The bias is much larger in regression models, where the average estimated distance coefficients is 36 % smaller than then true effect.

Fig. 3
figure 3

Scrambling results

The situation is further complicated when subjects have a more complex choice set (such as multiple schools or clinics) to choose from. Often researchers may wish to investigate the importance of specific factors pertinent to the nearest facility, such as teacher quality or health staff availability. It is easy to see that scrambling will make this exercise rather difficult. As shown in Fig. 2, there are six health facilities that are located directly in the demographic surveillance area, and 15 health facilities that are located in the larger district. If household coordinates are scrambled and households are linked to the nearest health facility locations according to the scrambled household-facility distance estimate, a rather large fraction of households will be linked to an incorrect location and the correlation between the true and the actual distance will fall. This is illustrated in Table 1. With a recommended scrambling radius of 5 km, about one-third of households are linked to the incorrect nearest facility, and the correlation between the true and the scrambled distance is <0.5.

Table 1 Nearest primary healthcare clinics with scrambled household coordinates

While it is hard to generalize these biases due to their dependence on the local distribution of reference points, the added complexity of multiple reference point will in most cases substantially increase the biases generated by scrambling. Similarly, complex challenges arise if researchers want to use existing network information to measure actual (rather than Euclidean) distances traveled on roads or public transport to reach specific access points. In the case of network analysis, scrambling home locations will bias average distances and will also lead to miscoding of transport entry points, with resulting error in travel time as well as environmental risk exposure.

Summary and conclusion

In this paper, we have proved mathematically and demonstrated empirically that scrambling of GPS locations leads to a systematic overestimation of the average distance between households and other points of interest at the population level for descriptive purposes. For bi- or multivariate regression analysis, the use of scrambled GPS coordinates will lead to systematic underestimation of the true causal effects of distance. Both effects are problematic from a scientific and a policy perspective. The systematic underestimation of the true causal effect of proximity will likely undermine the perceived importance of spatial distance. This may discourage public investment that ensures geographical accessibility to essential services, such as education, health care, or transport; it may also reduce support for projects aimed at ensuring sufficient distance to nearby hazards, such as waste sites, nuclear reactors, or sources of noise pollution.

This paper is, to our knowledge, the first to fully quantify the biases resulting from scrambling. The main results presented strongly support the recommendation made in the 2007 Committee on the Human Dimensions of Global Change special report Putting People on the Map (2007, p. 62), which states that “[a]ltering data to mask the exact spatial locations impedes the ability of researchers to calculate accurate spatial relationships, such as distances.”

While scrambling is currently common practice in major population-based surveys, such as DHS, and longitudinal surveillance systems, such as HDSS, it is only one of many approaches to mask geo-spatial data that have been proposed and used. Rather than randomly moving coordinates, one may also displace co-ordinates deterministically to a new set of locations, through displacement, scaling, or rotation. Each of these deterministic geo-masking approaches fails to preserve some important geographical information. The random perturbations generated by scrambling have previously been thought to approximately preserve all important aspects of geographical information. Unlike deterministic geo-masking, scrambling has thus been judged to be “satisfactory from a comprehensive information-preservation standpoint” (Armstrong et al. 1999). We show in this paper that scrambling does not preserve distance information. For many analytical purposes, scrambling does not appear the superior geo-masking approach it has previously been considered to be, and its routine use to mask geographical information in major population-based surveys deserves re-consideration.

Given that geo-referenced data are as important for science as they are for policymaking, and given that the protection of data confidentiality is an important dimension of ethical research, alternative approaches to handling geo-referenced data appear preferable. One commonly practiced alternative to data scrambling, which appears strongly preferable to scrambling, is the “restricted release” of data suggested by the Committee on the Human Dimensions of Global Change (2007) as well as National Research Council (2005). Such data access restrictions require an application and review process, as well as specific access rules and locations, which are comparatively costly and may limit the use of this approach to resource-rich settings.

A less costly alternative to scrambling is to provide distance calculations on request. Rather than providing researchers with geographical coordinates, distances between two points (e.g., a household and a health care facility) could be calculated by data owners and the resulting distance measures could be shared with researchers instead of the coordinates. While this would ensure privacy protection from an individual point of view, it would also provide researchers with the key variables needed for empirical analysis. For instance, to assess the effect of distance on access to antiretroviral treatment in developing countries (see e.g., Bärnighausen et al. 2014), variables such as “distance to the nearest primary healthcare clinic” or “distance to the nearest road” could be computed. Given that researchers would neither know the reference point location nor the direction from the reference point, identifying households based on these distance measures would be impossible.

An alternative promising approach is the use of software agents, which could allow researchers to analyze fully identified data without being able to physically access the underlying identifiable information (Kamel Boulos et al. 2006). Developing such system may require substantial upfront investment in order to ensure that they are user-friendly and provide sufficient data protection, but they may be the most efficient solution in the long run. In the absence of software agents, restricted release of true coordinates or true distances appears strongly preferable to the scrambling approach.