Introduction

Vulnerability to environmental change and its implications for human and ecological well-being remain critical challenges within global development. Research on the myriad dimensions of vulnerability has grown rapidly over the past decade, and while vulnerability as a concept has been usefully theorized, understanding the patterns and implications of differential vulnerability also requires accessible data. However, a critical challenge in vulnerability scholarship is that research linking people and place requires knowledge of individuals’ or households’ geographic location. Such knowledge can compromise confidentiality.

This paper begins a methodological exploration to facilitate the availability of detailed socio-ecological data to advance understandings of population-environment-health connections. Our goal is ultimately to fuel research to inform policy designed to safeguard human and ecological well-being. The methodological exploration presented here involves innovative processing and analyses of data from a low-income setting in rural South Africa, a region from which social surveillance data are underutilized for the purposes of research on population-environment-health interactions. Several techniques of data anonymization are tested that are designed to mask true household locations in order to protect confidentiality. The primary contribution of this study is the identification of anonymization techniques—for this particular context and contextual measure—that yield more accurate estimations of environmental measures from anonymized household locations relative to the measures from the “true” household locations while sufficiently preserving confidentiality. Importantly, however, the process presented here is illustrative in that a wide variety of methodological decisions must be made that should be context-specific as well as driven by a project’s particular research objectives.

Background

Understanding associations between human populations and their environments is becoming increasingly important as evidence of climate change continues to mount. Such understandings are particularly essential to inform programs and policies in settings of high climate vulnerability including many rural areas of the Global South (Byers et al., 2018). In many such regions, livelihoods remain intimately intertwined with proximate natural resources (e.g., Wisley et al., 2018), resources which may become increasingly scarce in areas anticipating shifts in rainfall and heat extremes (e.g., Olsson et al., 2014).

To increase the sustainability of rural livelihoods, research at the individual- or household-scale is of particular importance since livelihood decision-making is focused within these realms (e.g., Sumner et al., 2017). A wide variety of useful secondary social science datasets are available for such examinations. These often include confidential information from human subjects including age, gender, race, ethnicity, income, livelihood strategies, and, in some cases, particular health outcomes. Collection of such data typically comes with important assurances as to confidentiality and protection of individual right to privacy.

Yet, examination of the environmental dimensions of well-being requires linking this individual- or household-scale data with information reflecting local environments. As two examples, locational data allow linking households to information on proximate rainfall and temperature conditions, thereby facilitating environmental health research. Locational data also allow for disease mapping to identify spatial clusters of incidences or outbreaks for programmatic targeting. Linking the necessary social and environment data for these questions requires geocodes—the geographic locations of survey or census respondents. Yet, making available such geographically specific data would typically violate ethical and legal requirements regarding the confidentiality of microdata in that respondents’ identities may be revealed. This tradeoff as related to protecting privacy while maintaining sufficient accuracy is the central issue addressed within this paper.

There are several approaches to address data confidentiality concerns with the most restrictive being complete non-disclosure such as the destruction of all location information after data collection has ceased. Partial disclosure can be implemented along a continuum, with any particular approach having its own drawbacks. There are no broadly representative comparative studies that examine the relative strengths of the many approaches nor that offer guidance on method choice. This paper’s contribution is examination of the tradeoffs between anonymity and analytical precision within one such method—geographic masking—thereby offering important insight related to this commonly used approach. The analyses presented make use of an environmental variable measuring vegetation cover and are undertaken in a rural South African study site. The work is illustrative of the process, and the implications, of different geo-masking approaches and is intended as guidance for population-environment researchers interested in linking micro and contextual data while maximizing privacy.

Of the variety of methods to protect anonymity, restricted data enclaves offer access to the highly specific geographic information. An example is the US Census Bureau’s network of Federal Statistical Research Data Centers (FSRDC) through which precise geographic information may be made available with highly structured confidentiality agreements and a requirement of travel to highly secured data centers for access. Such agreements also typically require high levels of oversight as to public presentation of results.

Offering more accessibility but substantially less spatial precision, geographic information is occasionally available within secondary data sets. In many cases, readily available spatial units are quite coarse (e.g., municipality in the Mexican Migration Survey). Yet, many organizations offer the possibility of more spatial precision through confidentiality agreements with individual researchers for specific projects. The Panel Study of Income Dynamics, for example, provides the option through a security agreement to access US Census linkages to tracts, block groups, or blocks which offer far more precision than positioning respondents in their larger-scale counties. Other methods to protect respondent identity include aggregation whereby data are transformed to characterize geographic units (e.g., county-scale age, gender, racial composition). Aggregation units can sometimes be quite small although the ecological fallacy remains a risk. This occurs when inferences are made about individuals or other subunits based on aggregated characteristics. Also, given the loss of microdata, aggregation yields less analytically useful data for the purposes of understanding micro-scale social and economic processes. Other approaches include spatial smoothing where data represent weighted averages of individual-level data averaged spatially by nearest neighbors (Zhou & Louis, 2010) and multiple imputation which simulates datasets that capture dependencies among variables in the original data (Wang & Reiter, 2012). Linear programming approaches add noise to individual locations based on mathematical probabilities to minimize risk of individual disclosures based on the desired level of privacy. Data swapping entails switching values between various records, while synthetic data entails the creation of a dataset which has similar properties to original data where individuals cannot be identified. The critical importance of the privacy/accuracy tradeoff is illustrated by the debate surrounding differential privacy methods proposed for US Census data. Broadly, differential privacy entails the addition of a precise amount of statistical noise—i.e., synthetic records—into the released dataset such that a user cannot identify individual data (Abowd & Schmutte, 1999). Ruggles et al. (2019) argue that proposed methods have the potential to reduce the utility of microdata and smaller-area estimates and may disproportionally impact racial/ethnic minorities and underrepresented individuals.

Another approach to respondent protection is illustrated by the Demographic Health Survey (DHS) which provides geocodes representing the center of a geographic cluster or small settlement. Each location is “geo-scrambled” to randomly add position error, the distance influenced the local population density. Within this approach, the sociodemographic and economic characteristics of the households, themselves, are unchanged but instead the location is altered. As such, the contextual variables generated for households will reflect the displaced, as opposed to original location. Several studies have cautioned researchers to carefully consider the spatial error introduced within analyses using the DHS (e.g., Elkies et al., 2015) and innovative alternatives to the DHS clusters has been proposed such as using characteristics of nearby communities as proxies for environmental conditions (Grace et al., 2019).

Related, and the focus of this paper, geographic masking—also known as “geomasking” or “jittering”—entails displacement of the individual or household location using predefined parameters typically related to direction and distance from the true location. Proposed as early as 1999 (Armstrong, Rushton, and Zimmerman), geomasking has received far more attention in public health and epidemiological research as compared to population science although potentially of substantial use to demographers.

Approaches to geographic masking

Masking techniques typically include some form of spatial dislocation to reduce the potential for identification of study households. A variety of techniques have been developed that structure displacement of the original locations through different approaches to randomization of distance and direction. After displacement, the original locations are removed from the dataset that is made publicly available or available through a data sharing agreement.

One of the most straightforward approaches is presented in Fig. 1 where a household is simply randomly offset within a buffer of predetermined size (in this case, 300 m). In the case of Fig. 1, the displacement is constrained by the village boundaries so that a household is not displaced outside the village.

Fig. 1
figure 1

Example of a simple random offset for a households with available displacement area constrained by village boundary

The use of randomization has become more common in context-centered research (Armstrong et al., 1999; Cassa et al., 2008; Lu et al., 2012). However, there is little consensus as to the amount of displacement necessary to preserve confidentiality. One approach to quantitatively measuring privacy risk is the spatial k-anonymity factor, where k is represents the number of people (or households) needed within a buffer to preserve confidentiality (Sweeney, 2002). This is an extension of the concept of k-anonymity where data are released only if there is a minimum of k-1 individuals with the same combinations of characteristics (Zandbergen, 2014). In its spatial version, k-anonymity approaches consider the displacement distance necessary to protect privacy given a particular population density. Dense urban settings require less distance in displacements than sparsely populated rural areas (Cassa et al., 2008).

To better understand the implications of the various approaches to balancing research and confidentiality, we explore the differences that displacement brings for a particular contextual measure reflecting proximate vegetation, described below. We do so within a longstanding study site, the MRC/Wits-Agincourt Unit in rural South Africa.

Research setting

The Agincourt Health and Socio-Demographic Surveillance System (AHDSS)—situated in the far northeast of South Africa—is operated by the Medical Research Council (MRC) and University of the Witwatersrand (Wits) Rural Public Health and Health Transitions Research Unit (MRC/Wits-Agincourt Unit). The study area of 450 km2 study includes 31 villages which are home to ~ 110,000 residents in ~ 22,700 households. Since 1992, the Agincourt Unit has conducted an annual census including the entire Agincourt HDSS population (Collinson, 2010).

A “homeland” area where black South Africans were forcibly resettled during the era of Apartheid, the study site is characterized by relatively high population densities (~ 170 persons per sq. km), high poverty, and a longstanding lack of development and access to state services (Collinson, 2010). The Agincourt study site’s settlement pattern is fairly typical of rural communities across South Africa, and socioeconomically, it is characterized by a high reliance on remittances from the large proportion of adults who are migrant laborers on commercial farms and in towns and cities across the country. A substantial portion of households also depend heavily on the state pensions of elderly members (Collinson, 2010).

The region is generally dry (annual rainfall 550–700 mm), although an east–west rainfall gradient results in local variation in natural resource availability. Homestead plots are typically too small to fully support subsistence agriculture and some households farm assigned plots in the surrounding communal lands. Residents are highly dependent on the natural environment for a range of uses. These include grazing livestock and collecting fuelwood, wild foods, thatching grass, construction timber, and other domestic products both for household consumption and for generating income (Paumgarten & Shackleton, 2011).

The centrality of natural resources to livelihoods in rural South Africa is key to the illustrative research presented here. Case studies in two rural villages found that 70% of households made use of non-timber forest products, such as fuelwood, wild fruit, and edible herbs during times of shortage and crisis (Paumgarten & Shackleton, 2011). Even in rural South African villages with readily available electricity, over 90% of households use fuelwood as a primary energy source due to the cost of electricity and appliances (Matsika et al., 2013). This trend has been observed in and near the Agincourt study site where natural resources also act as buffers against household shocks such as a breadwinner’s death (Hunter et al., 2007).

There is vast potential for linking population, environment and health data within Health and Demographic Surveillance Systems (HDSS). The INDEPTH network includes 48 such sites in low-income settings across sub-Saharan Africa, Asia, and Oceania. These study settings undertake continuous monitoring of all individuals within a defined study setting and, combined, the INDEPTH network data provides longitudinal health and demographic insight on nearly 4 million individuals in 18 countries, providing critically important opportunities for policy-relevant scholarship (INDEPTH, 2017).

The application of “geographic masking” to facilitate contextual research is, however, nascent within the HDSS community. A recent study with professionals possessing a working knowledge of surveillance systems found that most respondents (83.5%) were not aware of any written rule, policy, or regulation governing research with HDSS data although nearly 86% agreed that there was a need for such guidelines. The risk of personal or family data being compromised was of great concern, and 74% supported anonymizing data before release to researchers (Anane-Sarpong et al., 2016). Per special agreements with particular HDSS’ (e.g., Africa Centre Demographic Information System), temporary access to locational information can sometimes be obtained by individual researchers (e.g., Tlou et al. 2017). But a key motivation of the present project is to explore the potential for geographic masking approaches to facilitate greater data sharing for the purposes of population-environment-health research especially as related to current and future climate vulnerability.

Data and methods

We anonymize data from the AHDSS using nine different approaches (described below) and evaluate the impact of the geomasking through comparison of measures of proximate natural resources. The AHDSS provided data on specific physical locations of households but without any additional individual or household information, and in total the locations represent 31 separate villages with 22,708 households (Fig. 2).

Fig. 2
figure 2

Villages within the MRC/Wits-Agincourt Unit

Given our focus on facilitating population-environment-health research, we examine the implications of geomasking for a vegetation measure reflective of proximate natural resources which are fundamental to most livelihoods in the Agincourt study site and predominantly collected from communal lands surrounding villages. Specifically, we use the normalized difference vegetation index (NDVI) which is well-correlated with vegetation amount and quality (Roerink et al., 2003; Wessels et al., 2004). It is important to acknowledge that NDVI is used here to illustrate the process involved in geomasking and to allow for exploration of the tradeoff between privacy and accuracy. This example demonstrates the important questions to be asked within the context of spatially-informed scholarship although the choice of specific data reflecting the proximate environment must be driven by research objectives.

NDVI values are derived from data from the Landsat 5, 7, and 8 missions and the index’s calculation exploits vegetation’s reflectance of near-infrared light and the absorption of red light (Tucker, 1979). The data used to calculate NDVI is consistent across the Landsat missions due to Collection 1 data processing completed by the USGS Earth Resources Observation and Science (EROS) Center. Values range from − 1 to + 1 with vegetation biomass and productivity positively correlated with NDVI (Foody et al., 2001; Mutanga & Skidmore, 2004; Wang & Rich, 2008). Low values (≤ 0.1) indicate barren land, rock, sand, or water, moderately positive values (0.2–0.3) may correspond to shrublands or grasslands, while high values (0.6–0.8) correspond to temperate or tropical rainforests (NASA, 2000).

The Landsat data includes NDVI estimates for a given location at least every 16 days, and every 8 days when considering Landsat 7 and Landsat 8 overpass, at a resolution of 30 m (~ 100 ft). Two corresponding files were obtained for each date; a NDVI image file containing processed NDVI values and the associated pixel QA file. The pixel QA file is a raster image with the same pixels as the remotely sensed image, but each pixel is given a number identifying its usability, reporting image quality issues for each associated pixel of the NDVI data.Footnote 1

This project incorporates data from March 1997 to December 2017 to capture temporal changes in NDVI. Additional processing included filtering NDVI values based on typical quality control criteria. For instance, data reporting “Cloud Shadow,” “Cloud,” or “Water” in each corresponding QA file were excluded from the analysis. Undertaking the data integration at the pixel scale allows for better coverage since areas with clouded coverage can be deleted from consideration without the need to discount the entire Landsat image from consideration. Areas within village boundaries were also eliminated from consideration since they do not represent the communal areas where resource collection takes place. We also excluded neighboring game reserves and parkland (see Fig. 2) since village residents typically do not have access to these spaces.

We use 2000 m (2 km) buffer zones within which the NDVI associated with a household point location is calculated—the choice was informed by research on typical travel distances for natural resource access (Giannecchini et al., 2007).Footnote 2 The determination of appropriate buffer size must be informed by cultural context and existing knowledge of population-environment linkages within that socio-ecological space. Median NDVI values within each individual buffer zone are estimated as a measure of the central tendency of NDVI values available to each household from March 1997 to December 2017. We also use a measure of household resource availability that is the sum of the NDVI values divided by the number of households in each individual household buffer zone. This metric serves as a proxy of relative resource availability accounting for both distance of access and the number of households that may share proximate resources. Households located further from village boundaries or in high-density areas have lower natural resource availability than households near village boundaries or in less-dense areas (e.g., see Leyk et al., 2012).

For this project, NDVI values are not aggregated over long periods of time (e.g., mean annual value) since such calculations necessarily diminish information on seasonality and other within-year changes. Instead, we incorporate 200 estimates of vegetation availability for each of the 22,708 households for an average of 10 months per year, 1997 to 2017 reflecting one measurement per month selected as close as possible to the middle of the month where available.

In the current study, we employ nine geomasking techniques; four of these use the “donut” approach such that no points are offset within a minimum radius. Approaches that use donut masking include (1) random displacement, (2) offsets that represent Gaussian distributions of displacement, and both (3) random displacement and (4) Gaussian displacement with a “distance/density factor” (explained below). For these four approaches, we consider a maximum radius of 300 m with a 150 m exclusion zone (i.e., residence locations displaced 150 to 300 m). This distance represents on average, the largest distance between any two households located within the same village. Larger distances were tested but could not be used due to smaller footprints of several villages. We also limit displacement to a household’s village extent (see Fig. 1). The influence of the village boundary constraint varies by village spatial size; as would be anticipated, larger villages have greater displacement potential. Village shape matters, too. For instance, Ireagh B is long and narrow and has smaller potential areas of displacement than villages of a similar size (Dumphries B) for all methods of displacement.

As mentioned, the “distance/density factor” is also considered in several of our illustrative masking approaches. The factor is calculated by relating the local population density (e.g., village) to total population densities (e.g., study region) (Cassa et al., 2008). Basically, within such weighted displacement approaches, the baseline displacement distance within buffers (300 m) is adjusted to compensate for proximity of households to one another within a particular village. More dense villages require less displacement for privacy protection.

$$Total \ density \ multiplier= \frac{Average \ total \ household \ density}{Village \ household \ density}$$

We also examine a method that allows for adjustment of displacement distance that adds consideration of k-anonymity to the density adjustment (Allshouse et al., 2010). With this approach, the minimum (Rai) and maximum (Rbi) displacement distance is defined by the density of households and user-defined levels of k-anonymity:

$${R}_{ai}={\left(\left({A}_{i} / \pi \right)\times \left({k}_{a} / {N}_{i}\right)\right)}^{1/2}$$
$${R}_{bi}={\left(\left({A}_{i} / \pi \right)\times \left({k}_{b} / {N}_{i}\right)\right)}^{1/2}$$

Ni represents the number of households in each village and Ai is the area. Here, ka is the minimum displacement threshold and kb equals the maximum displacement threshold. For example, specifying ka = 5 assumes that at least five households will be in closer proximity to the true household location than the displaced household location. For this study, we specify ka = 5 and 20 with kb = 10 × ka. The decision to use a multiplier of 10 follows work by Allshouse et al. (2010) and this particular measure is central to our exploration of the tradeoff between privacy vs. accuracy. These approaches are also displayed in Fig. 3 as (5) with ka = 5 and kb = 50, and (6) with ka = 20 and kb = 200.

Fig. 3
figure 3

The nine illustrative geographic masking techniques examined. Black circle represents the original location; gray dots represent simulated possible locations using each masking method with a radius of 300 m and exclusion zones of 150 m for masking approaches 1 and 2. For approaches 3 and 4, the radius and exclusion zones are adjusted from 300 and 150 m using a total density multiplier (Dm). Dm = average total household density/village household density. For approaches 5 and 6, possible locations are placed within a radius Rbi and exclusion zone Rai. Approaches 7, 8, and 9 represent possible locations of displaced households within a village or sub-village boundaries

Our final three illustrative approaches involve random displacement of households within a geographic area defined differently than a circular buffer. These three approaches focus on within-village geographic clusters pre-defined by spatial attributes that likely shape livelihood strategies such as proximity to a major road or distance to communal lands. These involve random assignment and are also presented in Fig. 3, numbered within (7) the entire village; (8) sub-village areas defined by physical boundaries such as roads, rivers, or railroads; and (9) sub-village areas defined by distance from edge of village boundary in 100 m buffer zones.

Village descriptive profiles

There is substantial variation in overall population size across the AHDSS villages (Table 1); the village of Lillydale A is home to over 1600 residents while one sub-section of Somerset, Somerset B, has only 71 residents. This wide variation plays into the differentials in density, with the highest density in Somerset C (1180 households/km2) and lowest density in MP Stream (117 households/km2), a tenfold distinction. These variations speak to the need for randomization methods that account for substantial differences in size and density in that offsetting households a large distance may not be practical in the village of Somerset C with its small geographic footprint. However, for Somerset C, consideration must also be given to its low overall population which poses important challenges to privacy.

Table 1 Summary statistics for true household locations, growing-season months, 2 km buffers, MRC/Wits-Agincourt Unit
Table 2 Percentage of Households That Do Not Meet Privacy Thresholds

As to the environment, Fig. 4 presents estimated NDVI values for households during the summer of 2010 and reveals a substantial west–east greenness gradient. The study area’s western side is indeed characterized by slightly higher elevations, greater variation in topography, and more precipitation. It is also clear that villages in the eastern portion confront substantial resource constrains given both low NDVI values and boundaries with fenced reserves.

Fig. 4
figure 4

NDVI values (Jan 2010) for true household locations, 2 km buffers, MRC/Wits-Agincourt Unit

Understanding implications of geomasking

Ultimately, this study’s objective is to demonstrate a process whereby researchers might better understand the implications of geomasking. Here, we aim to examine the significance of the differences in NDVI values for each masking approach as compared to the NDVI calculated based on the true household location. As such, the focus is on differences between geomasking methods as opposed to differences between households. The reference category represents the NDVI estimates derived from true household locations.

To generate a quantitative understanding of the distinctions in NDVI values, we determined differences between the two NDVI estimates calculated for the true household locations relative to those displaced. To quantify distinctions, however, our evaluation considers only growing-season months (September to April); Table 1 also presents this village-scale descriptive information across the entire study period, sorted by the average of the monthly median NDVI. The results reveal wide variation in the range of median NDVI values between and within villages. For instance, median NDVI values in Xanthia varied by 0.79, a relatively large amount given that NDVI values generally range between 0.0 and 0.8 in the study area. The villages of Cunningmore A, Makaringe, and Agincourt also show a high degree of variation in median household NDVI values. This pattern is also observed in the range of NDVI values estimated as the sum of NDVI divided by the number of households; the highest levels of variation are in the villages of Xanthia, Makaringe, and Agincourt. Although there is not a particularly strong correlation between density and NDVI, the villages in the study area’s western portion do tend to be larger in size and relatively less dense than villages with lower levels of variation of NDVI values which tend to be in the eastern portion of the study area (e.g., Somerset B, Lilydale A, and Rholane). Even so, the village with highest density, Xanthia, actually has especially high NDVI likely due to the area’s rolling topography and proximity to surface water sources.

These descriptive glimpses into variation are intriguing, but we also aim for a more thorough sense of the significance of these differences. To this end, and given the study’s longitudinal nature, we use multilevel models where repeated measures of NDVI (level 1) are nested within households (level 2) with both the intercept and slope for households varying across time. Since NDVI is a continuous measure, we use linear mixed regression models with fixed effects representing the different methods of randomization—the effect of the method used to estimate natural resource availability is reflected by the fixed effect’s estimated coefficient. For example, a statistically significant coefficient for the fixed effect using randomization approach 2 (i.e., Gaussian displacement with donut masking) would suggest important differences between the NDVI values calculated via this method and those reflecting NDVI surrounding households’ true locations while also accounting for temporal differences in NDVI. Analyses were conducted for each village to estimate variation of randomization methods at the village scale. This approach allows us to examine potential sources of deviation in given approaches that may be associated with regions of high or low NDVI values, or areas within the study site that have a high degree of variation in resource availability.

Results for median and sum NDVI are presented in Appendix Tables 3 and 4, respectively. Two particularly intriguing findings emerge. First, methods of displacement that account for housing density (random, Gaussian, and k-anonymity) provide results most similar to the median NDVI estimates of the true household locations as demonstrated by a consistent lack of statistically significant differences, especially for median NDVI estimates. While there are a greater number of statistically significant differences for the sum NDVI estimates, these differences are substantively small (less than 4%) for three specific methods (random density, Gaussian density and ka = 5 kb = 50). The greater number of significant differences is likely a result of the introduction of larger variation in NDVI estimates for masked household locations. For instance, sum NDVI estimates are more sensitive to the position of the household relative to the village boundary, where a house located in the interior portion of a village will have a smaller sum NDVI estimate than a house located near the boundary of a village and closer to the surrounding vegetation.Footnote 3

Second, there are a few villages within which almost all of the geomasking approaches simply miss the mark, most notably Rholane, Somerset B, and Lilydale B. To spatially represent these patterns of error, Fig. 5 presents the differences between true locations for the sum NDVI estimates for each of the nine masking methods. As shown in Fig. 5, there is a high degree of variation in NDVI estimates depending on the method used. However, villages in the central and eastern portions of the study area consistently show a large difference between true and randomized NDVI values.

Fig. 5
figure 5

Spatial distribution of difference in sum of NDVI values/number of households, true vs. geomasked locations, MRC/Wits-Agincourt Unit (1) Random, donut, (2) Gaussian, donut, (3) Random (density), donut, (4) Gaussian (density), donut (5) ka=5 kb=50, donut, (6) ka=20 kb=200, donut, (7) Random within village, (8) Random within sub-village by infrastructure (roads), (9) Random within sub-village by distance to boundary

To explore potential explanations for the variation in accuracy, we examined the underlying distributions of NDVI, population, and density, as well as examining other potentially influential factors such as village proximity to other villages and the study region’s edge. Regardless of the method of masking, villages that are smaller and more densely populated tend to have lower variation between true and displaced NDVI values (e.g., Somerset C). This is because smaller and more-dense villages tend to have relatively low displacement distances—especially when using density-dependent approaches—which then results in less difference between true and displaced NDVI values. In the case of both random and Gaussian approaches of geomasking (panel 5a and 5b), villages with less difference between true and randomized NDVI values are located in the northern part of the study area (Makaringe, MP Stream, Dumphries A, Dumphries B, Dumphries C, Rolle C, and Kumani). These villages generally exhibit higher levels of NDVI compared to the middle part of the study area where differences between true and randomized NDVI values are greater. Two southern villages—Belfast and Somerset C—also show better agreement. This is due to a combination of factors: the villages are relatively small and have small ranges of NDVI values. Belfast’s location at the edge of the study site further reduces variability between true and displaced NDVI values. In all, the spatial variation in error is due to the influence of NDVI variability, population, and/or population density, as well as other factors such as household proximity to village edges, other households, as well as village proximity to other villages and protected areas. In a particular research project, it would be useful to consider how variation in these factors combine with your variable of interest to influence the accuracy of displaced values across the study region. Such understanding may ultimately influence the decision as to which geomasking method is most appropriate for a particular research project.

Tradeoffs between anonymity and accuracy

In addition to measurement error, it is important to understand potential implications for anonymity within each geomasking approach. In essence, this illustrates the balancing act that represents a key contribution of this paper but, as a reminder, these analyses illustrate a process which should be undertaken within spatially-explicit scholarship where anonymity is a concern. Recall that there are research-specific choices to be made regarding relevant environmental variables, buffer sizes, required anonymity threshold, and the like.

To illustrate tradeoffs, we calculated average displacement between true household locations and masked locations and the estimated k-anonymity for each location averaged within the villages (defined as the number of households that are closer to the true location than masked). The village-specific results are presented in Appendix Table 5, while they are summarized within Fig. 6a. On average, and as would be expected, masking methods that tend to be closer to the true location tend to have lower k-anonymity. In particular, random displacement and Gaussian approaches adjusted for density (approaches 1–4) provide smaller displacement distances and lower levels of actual k-anonymity (yellow oval, Fig. 6a). As shown in Appendix Table 5, for instance, in the least populated village, Somerset C, density-informed, random geomasking result in an average of eight households being closer to the true location than the masked household. The density-informed Gaussian displacement lowered this to six.

Fig. 6
figure 6

Evaluation of accuracy and privacy for true vs. geomasked locations, MRC/Wits-Agincourt Unit. a Displacement distance versus k-anonymity. b k-Anonymity versus absolute difference in NDVI

Conversely, geomasking approaches that incorporate medium levels of both household density and k-anonymity (approaches 5 and 6) provide relatively higher levels of displacement and anonymity (green oval, Fig. 6a). On the other extreme, relatively high levels of anonymity and displacement characterize geomasking within village boundaries (approaches 7–9), as opposed to those based on buffers (orange oval, Fig. 6a). In particular, approach 9 that uses sub-village areas as defined by distance to village boundary reached as high as 833 in the densely populated village of Agincourt, the study site’s namesake.

To evaluate implications of anonymity and accuracy, for each geomasking method we next plot the average k-anonymity versus the average difference in the sum of NDVI between true and masked household locations (Fig. 6b). We find that methods that have lower levels of k-anonymity (for example, random and Gaussian displacement) also tend to have smaller differences in NDVI measures (yellow oval, Fig. 6b). Conversely, methods with higher levels of k-anonymity tend to have larger differences in NDVI measures (for example, random village displacement). While methods that randomize locations within a village or sub-village boundary indicate higher levels of k-anonymity, these methods have relatively large differences in NDVI measures. Of particular interest, methods that incorporate both household density and k-anonymity as highlighted by the green oval in panel b (ka = 20 kb = 200) exhibit high levels of k-anonymity and relatively small differences in vegetation measures, suggesting perhaps the most effective approach for NDVI in this study setting.

These findings on tradeoffs are logical since larger displacements will yield greater anonymity although also being further from the true location and, therefore, more likely to yield a larger difference in true vs. displaced measures. Again, it is not our goal to firmly decide which of the nine approaches is “best,” rather we aim to illustrate the process by which researchers might explore the implications of geomasking in a particular study context with particular research objectives.

The average level of anonymity provides insight into the utility of each method and the tradeoff between accuracy of NDVI measures and relative displacement. However, given that methods of anonymization assume homogeneous distribution of households, we extend the analyses further by establishing a minimum acceptable threshold of k-anonymity for this particular illustration of the process of geomasking. For example, a threshold of five is not met if there are less than five households located closer to the true location than the masked location. In this study, we find that at most, 1.2% of households do not meet the privacy standard of five households regardless of the method of anonymization. For a threshold of 10 households, only one method—Ka = 20 Kb = 200—provides a relatively high level of anonymization with 99.8% of households meeting this privacy standard, while each method has at least one household that may be exposed to these privacy standards (Table 5). We recognize the disclosure of just one household is problematic as illustrated in Table 5, where all methods disclose at least one household. Yet, this analysis highlights that although masking methods may provide an average acceptable high level of protection (for example, approaches 7–9) it is important to consider how well a single household is protected. One response to this challenge is the removal of households that fail desired levels of anonymity from the dataset, of course with careful documentation. Another approach is to swap a specified number of households prior to displacement—or values within households—thereby adding additional uncertainty to lessen the likelihood of identification. While these approaches may not be acceptable in some cases, the study’s investigators must balance the project’s needs as related to the level of detail required—more detailed data requires a larger k-anonymity to minimize the likelihood of identification. (Table 2)

Discussion and conclusions

Substantial advancements in geographic information science have expanded the possibilities of scholarship linking the social and ecological worlds. Such scholarship is essential during this contemporary era of climate change in order to understand the processes shaping vulnerabilities.

The preliminary analysis presented here has been motivated by concern with facilitating socio-ecological research while preserving the confidentiality of social science survey respondents. Several studies have emerged over the past several years examining the potential error introduced by geomasking techniques, but most such research has examined error with regard to the creation of distance-based measures such as distance to health clinic (e.g., Warren et al., 2016). Results suggest that the creation of these distance-based covariates can yield consequential measurement error. But as opposed to distance-based measures, here we begin examination of the error introduced by geomasking for contextual measures reflecting proximate environmental conditions. We do so within the context of a health and demographic surveillance site, the Agincourt Health and Socio-Demographic Surveillance System in rural South Africa. The exploration suggests that in this study setting and for NDVI, geomasking approaches that use buffers and account for population density produce the most accurate results. However, the exploration also clearly demonstrates the tradeoff between accuracy and privacy, with more accuracy resulting in a higher level of potential respondent identification. In this way, the analysis demonstrates the critical tradeoff between geographic accuracy and confidentiality—a tradeoff that must be made carefully considered by study site administrators and research teams. It is important to remind readers that there are a variety of analytical choices that must be informed by specific study contexts, research questions, and the level of detail desired within the data.

While we focus on the Agincourt HDSS site, this is but one of nearly 50 such surveillance sites throughout sub-Saharan Africa, Asia, and Oceania. HDSS support research on the world’s most pressing development questions, and they do so in regions where reliable and comprehensive data would typically not otherwise be available. Within these social surveillance systems, health, mortality, fertility, and migration are often focal topics. Socio-demographic characteristics such as age, gender, and employment status are also recorded. Combined across individuals, households and years, these data provide extraordinary opportunities to improve understanding of social and ecological processes and their connection with health including infant mortality, child health, and disease outbreaks. As such, the approaches tested here have the potential for far broader impact in understanding connections between human well-being and environmental change over longer time periods and in different settings.

In the long run, we aim to contribute to methodologies balancing research and privacy needs. As noted by Zandbergen (2014:9), “the lack of comparative analysis of masking techniques provides a clear indication for desirable future directions.” Important next steps include examination of the error introduced within substantive analyses when using the NDVI estimates from anonymized household locations. Yet, the present manuscript offers an important first step in exploring anonymization prospects for health and demographic surveillance systems that are designed to facilitate essential population-health-environment in this contemporary era of global environmental change.