1 Introduction

Geographic location provides a relational connection between individuals, their health behaviors, and a rapidly expanding array of data on neighborhood structure and composition [1,2,3]. Once hierarchically integrated, it becomes possible to systematically investigate the impact of a wide range of neighborhood socio-ecological covariates on indicators of individual health and well-being [4,5,6,7]. The geographic landscape of retail tobacco products provides an excellent example [8,9,10,11,12,13,14,15,16,17,18,19]. A considerable body of empirical evidence demonstrates the ways point-of-sale tobacco marketing influence tobacco users’ product preferences as well as decisions to initiate or refrain from use [8, 9, 11, 20,21,22,23,24,25,26], and that targeted point-of-sale tobacco (POST) marketing tactics sustain tobacco-related health disparities [12, 27,28,29,30,31,32]. Studies have also focused on the lowest displayed pack price, promotions, placement for leading brands, and flavor descriptors. More recent work has found that cigarette packs cost less at retailers located near public schools than those near private schools [11], and that outlet proximity to schools and parks is linked to advertising practices and illicit youth sales [12].

Despite a proliferation of studies that seek to identify and study mechanisms linking neighborhood environments to the health-related decision-making and behavior of residents, [2, 33,34,35,36] a number of unresolved conceptual issues persist, not least the definition of the ecological units of analysis themselves. Because there are no naturally occurring units of neighborhood, the neighborhood areal units employed by researchers are often arbitrary or linked to confounding geographic circumstances, [37] undermining their utility for geographically-explicit health research. [2] The ideal neighborhood unit would represent the “true causally relevant spatial context,” [1] however, so long as the true spatial context remains undefined, we are left with what Kwan has termed the ‘uncertain geographic context problem’ (UGCoP). [38] Inferential errors brought about by the use of arbitrary methods for delineating neighborhoods likely contribute to the misclassification of effects and inconsistent evidence for socio-ecological relationships and health behavior across the research literature. [38,39,40,41]

Residential location remains the most readily available geographic linkage for integrating individual and neighborhood data. Conventionally, researchers join residential addresses directly to the administrative zones they fall within (e.g., US Census polygons) or draw new ‘ego-centric’ buffer zones around those addresses [4]. These residential areal units act as spatially and temporally invariant containers for various neighborhood risk and protective factors, each of which can then be linked to residents themselves. Data on individual mobility patterns improve upon the assumptions that administrative zones or ego-centric buffers make about the propensity for individuals to come into contact with environmental risk factors. Mobility data capture actual travel patterns, including each person’s ever-accumulating set of origins and destinations, allowing for precise estimation of exposure to environmental risk factors over time. Dynamic conceptualizations of neighborhood exposures may enrich our understanding of health and place by better reflecting the true causally relevant spatial context of exposures to environmental risk in the population [6, 7, 42,43,44].

This paper presents a direct comparison of real time versus neighborhood approximated (residential address; [4]) exposure to a well-known environmental risk factor, the spatial distribution of tobacco outlets across the contiguous United States [19, 45,46,47]. This comparison provides an empirical framework for contrasting relative uncertainty in spatial exposures and estimating the likelihood of their misclassification. Residential and real-time exposures were computed for a sample of 363 people, each of whom voluntarily recorded their location every 10 min for 180 days using their cellular phone. Analyses use a common density-based exposure metric to examine the extent to which variation in exposure levels observed in real-time could be explained by exposure estimates approximated solely from residential location. By partitioning variation in exposure due to routine patterns of mobility, we specifically tackle the first of Chen and Kwan’s (2015) three key dimensions for research on retail access: contextual uncertainty due to real-time mobility patterns.

2 Methods

For clarity of presentation the Methods are organized into three sections: 2.1 Data Preparation, 2.2 Tobacco Outlet Exposure Estimation, and 2.3 Statistical Modeling. Section 2.2 describes the calculation of home and real-time (hourly) exposure estimates, and Section 2.3 describes the statistical analysis framework, which is a special case of multinomial logistic regression, providing a framework for interpretation of the model estimation process. Agresti [48] shows the direct correspondence between the log-linear and logit modeling frameworks, both in mathematical and inferential terms, and we present an application of this approach here.

2.1 Data Preparation

2.1.1 Longitudinal Human Mobility Data

Longitudinal human mobility data was continuously recorded via the OpenPaths (https://openpaths.cc) platform, launched by the New York Times Company Research and Development Lab in May 2011 [49]. Participants who provided consent to participate in this research and downloaded the OpenPaths application for either iOS or Android were then able to use the application to continuously capture their current geographic location—i.e., latitude and longitude, wirelessly uploaded according to a 10-min sampling rate. When participants remained stationary (less than 5 m location change between successive coordinates), the OpenPaths application suspended data collection to save battery, ensuring that OpenPaths data represents periods of at least minimal mobile activity rather than extended periods of rest, such as while people sleep at night or sit for long periods at their desks. While stationary, the application continued to “listen” for a greater than 5-m location change and reinitiated continuous tracking whenever such a change was detected.

The raw longitudinal mobility dataset used for the present analysis contained 8,458,902 observations collected from 859 individuals worldwide from 01/01/2012 to 06/06/2015. Each observation included a unique user ID, date, time, latitude, and longitude. The raw mobility data were initially clipped to the United States using state outline polygons published by United States Census Bureau, yielding a US cohort of 744 individuals with 4,647,152 observations. Data from the lower 48 states and Washington, DC were then clipped to correspond with the tobacco density layer (Section 2.2), yielding 729 individuals with 4,556,619 observations. Participants tracked their location over a mean of 267 days (Median = 178; SD = 278), recording an average of 6251 location coordinates (Median = 3192; SD = 10,289). To standardize the amount of data contributed by each individual while maximizing the amount available for analysis, each participant’s daily data records were left-truncated, retaining their most recent 180 days of tracking data. Individuals with fewer than 180 days of data were not included in the present analyses, ensuring all between-person contrasts were based on an equivalent 180-day sampling frame. The final longitudinal mobility dataset thus included Np = 363 individuals and 1,418,151 real-time observations over the 180-day study period.

2.1.2 Estimating Tobacco Outlet Density

A nationwide density surface of tobacco retail outlet locations was generated using kernel density estimation (KDE). This non-parametric method extrapolates from spatially distributed point data by estimating their continuous density using spatial density functions known as kernels, each of which has a specified circular radius size known as the bandwidth [50]. Gaussian kernels with a fixed 8047 m (5 mile) bandwidth were used to generate the final density surface, from which density estimates could be extracted with a resolution of 250 m.

The empirical basis for this probability density surface was a national dataset of tobacco retail outlets, identified by North American Industry Classification Systems (NAICS) codes. Developed by the Office of Management and Budget, NAICS is the standard used by Federal statistical agencies to classify businesses based on their primary activity [51]. In 2012, geocoded data was obtained from D&B Hoovers. The following retail categories and corresponding NAICS codes were included: beer, wine, and liquor stores (NAICS: 445310); supermarkets and other grocery stores (NAICS: 44511); convenience stores (NAICS: 44512); pharmacies and drug stores (NAICS: 446110); gasoline stations with convenience stores (NAICS: 44711); other gasoline stations (NAICS: 44719); department stores (NAICS: 452111); discount department stores (NAICS: 452112); and tobacco stores (NAICS: 453991). For pharmacies and department stores, we individually reviewed all chains with 50 or more locations to determine if they sold tobacco and excluded them accordingly [52]. Based on this analysis, we also excluded all other department stores and pharmacies as they likely do not sell tobacco. In addition, we excluded major department chains and grocery stores that, based on their store’s policy, do not sell tobacco products (i.e., Target, Whole Foods, Trader Joe’s, Wegmans). We also excluded pharmacies and drug stores in the 55 Massachusetts and 2 California municipalities that have banned the sale of tobacco products within these establishments. The final dataset comprised N = 269,781 retail outlets.

2.2 Tobacco Outlet Exposure Estimation

2.2.1 Residential Exposure

Residential locations were extracted from participants’ OpenPaths mobility data using a two-step procedure based on Toole et al. [53], and originally given by Zheng and Xie [54]. In the first step, meaningful locations—stay events—were extracted from mobility data for each participant for each nighttime period (8 pm–7 am). A second step then combines each participant’s set of nighttime stay events covering the entire 180 day analysis period to a set of aggregate stay points. The aggregate stay point that comprises the most nighttime stay events for each participant was selected as the maximally unbiased estimator of residential location. Residential retail exposures were then generated by spatial joining each participant’s inferred residential location to the density surface of tobacco retail outlet locations (Mean = 9.50, Median = 3.81, SD = 12.25). Exposures were square root transformed owing to right skew, and classified into five groups based on exposure intensity.

2.2.2 Real-Time Exposure

Participants’ OpenPaths mobility data was used to compute hourly radius of gyration (Rg), measured in meters [28]. Rg estimates the size and spread of a participant’s personal activity space for a given hour. Rg is defined by the standard deviation between locations and their center of mass:

$$ {R}_g=\sqrt{\frac{1}{N}\sum \limits_{k=1}^N{\left({r}_k-{r}_{\mathrm{mean}}\right)}^2} $$

where N is the total number of location coordinates collected from each individual per hour, and rmean is the individual center of mass, or the mean longitude and latitude of all N locations. The great circle distance in meters between a specific location and the center of mass (rk − rmean) was calculated using Vincenty’s formulae [55]. The dataset was then aggregated over 465,279 hourly observations of Rg. Given the present focus on routine day-to-day mobility patterns, long-range travel was excluded by dropping observations with Rg larger than 160 km (Ndrop = 945), the maximum distance a car can travel in 1 h. Hours with zero movement (Rg = 0) were then excluded, yielding 363 individuals with a total 304,164 hourly observations (65.4% of the hourly data).

Real-time exposure was conceptualized as the product of a participant’s hourly movement (Rg) and their hourly aggregate exposure to retail outlets (Mean = 22,620, Median = 7.01, SD = 144,191). Each real-time mobility coordinate contributed by the participants was joined to a tobacco outlet density value extracted from the KDE surface. Hourly exposure levels are, thus, the product of each participant’s Rg within each hour under observation and the average tobacco outlet density value across the set of mobility coordinates recorded within the same hour. This exposure variable approximates the number of tobacco outlets that surrounded each participant within each hour of the study and, as expected from a count variable of this kind, the observed distribution of exposure values was heavily skewed to the right. To improve correspondence with the standard assumptions of a categorical count-based data analysis framework, both residential and hourly exposure values were square root transformed and determined to closely follow a negative binomial distribution. This produced real valued outputs in the range 0–27, which were divided into 27 clusters by binning with a one unit spacing. Each cluster effectively labels all participant hours within a particular range of exposure intensity, with subsequent clusters representing the increasing intensity of participants’ hourly exposures in this sample.

To model temporal patterns of exposure, time-related variables based on social convention were derived for each observation. These were time-of-day (24-h clock), day-of-week (weekday versus weekend), and season. Time of day was defined by four 6-h windows: 3:00–9:00 as “early,” 9:00–15:00 as “day,” 15:00–21:00 as “evening,” and 21:00–3:00 as “late,” treating observations recorded between midnight and 3 AM as part of the preceding day [56]. Day-of-week was coded binary, indicating whether each observation fell on a weekend, defined as falling after 17:00 Friday through 17:00 Sunday. Season was also defined as categorical, with December–February as winter, March–May as spring, June–August as summer, and September–November as fall.

2.3 Statistical Analyses

2.3.1 Log-Linear Model Selection

The interactive association between real-time exposures (27 clusters) and residential (5 clusters) exposures was stratified across time-of-day (4 categories), day-of-week (2 categories), and season (4 categories), using a set of multivariate contingency tables that populated a 5-dimensional matrix with a total of 27×5×4×2×4 = 4320 cells. Patterns of association within this large multivariate contingency table were analyzed with generalized categorical data analysis techniques [48]. Specifically, we employed an exponential, log-linear modeling framework. Log-linear models convert the multiplicative relations among joint and marginal counts in a contingency table to additive, linear associations by transforming the counts to logarithms [48]. Hierarchically nested model comparison techniques were used to iteratively identify the most parsimonious combination of factors required to explain the observed data. Systematic comparison of hierarchically nested log-linear models produced a likelihood ratio test statistic presented in the text. The saturated model represents the log frequencies for the cell index (h,w,t,s,e) of all non-ordinal combinations of both real-time exposure and residential retail exposure (home), time-of-day, weekend, and season:

$$ \mathit{\ln}\left({\mu}^{hw t se}\right)=\lambda +{\lambda}_h^H+{\lambda}_w^W+{\lambda}_t^T+{\lambda}_s^S+{\lambda}_e^E+{\lambda}_{hw}^{HW}+\cdots +{\lambda}_{hw t}^{HW T}+\cdots +{\lambda}_{hw t e}^{HW T E}+\cdots +{\lambda}_{hw t se}^{HW T SE},h=1,\dots, 5,w=1,2,t=1,\dots, 4,s=1,\dots, 4,e=1,\dots, 27. $$

Where H is home exposure, T is time-of-day, W is weekend, S is season, and E is real-time exposure. h, t, w, s, and e are categories within H, T, W, S, and E. μhwtse represents the expected cell frequencies in the five-dimensional contingency table. λ is the constant. \( {\lambda}_h^H \) denotes the row effect of \( {\lambda}_i^H,i\in h \). \( {\lambda}_w^W \), \( {\lambda}_t^T \), \( {\lambda}_s^S \), and \( {\lambda}_e^E \) also represent row effects. \( {\lambda}_{hw}^{HW} \) is the interaction term \( {\lambda}_{ij}^{HW} \) between H and W, where i ∈ h, j ∈ w. \( {\lambda}_{hwt}^{HWT} \),…, \( {\lambda}_{hwte}^{HWTE} \),…, \( {\lambda}_{hwtse}^{HWTSE} \) are higher dimensional interaction terms. For easier read, letter symbols are used in Table 1 to represent the highest interaction model terms.

Table 1 Step-down contrasts of “best-in-class” models

2.3.2 Log-linear Model Interpretation: Ordinal Logit Modeling

Ordinal logit models were used to interpret observed associations within a log-linear framework, utilizing well-established logistic model reporting and interpretation standards [48]. In this paper, log-linear estimation and selection was used to identify the most parsimonious model, best fitting the observed data, and then ordinal logit modeling was used to examine our primary aim: examining the extent to which variation in exposure to tobacco retail outlets observed in real-time could be explained by exposure estimates approximated as a function of residential location and time (time-of-day, day-of-week, season-of-year). Following Agresti (2012), this was accomplished by setting the residential exposure variable (home) from within the log-linear model as an ordinal dependent variable. Within this framework, each ordinal residential exposure level represented a “cluster” of participants, and model results estimate the probability—i.e., the log-odds ratio or logit—that each real-time exposure observation was contributed by a participant from each of the residential density clusters.

3 Results

Overall, mean hourly Rg was 0.55 km with a standard deviation of 2.66 km. Minimum and maximum were 0 and 152.94 km. Within each day, Rg was the lowest in the middle of the night and higher across the remainder of the day. Two spikes were observed in Rg in early morning and late afternoon on weekdays, which is likely related to commuting between home and work—while there was generally more variation on weekends. Figure 1 illustrates the daily drop in mobility across the late-night hours, followed by a steep rise across the early morning, and then divergence on Saturday and Sunday, with early Sunday afternoon revealed as the window of greatest mobility on average.

Fig. 1
figure 1

Radius of Gyration Measured Mobility Patterns by Time of Day across Day of Week

Figure 2 presents generalized additive model smoothed real-time exposure intensity by time of day for each residential exposure cluster. On weekdays, real-time exposure spikes in the early morning and late afternoon across all residential exposure clusters, probably due to increased movement and exposure to retail outlets during the commute to and from work. However, this relationship becomes more prominent as residential exposure cluster increases from low to high. In contrast, weekend exposure was elevated at noon and then remained high until early evening, and was broadly consistent across residential exposure clusters, except in the high exposure cluster which deviates from the patterns observed.

Fig. 2
figure 2

Weekday and Weekend Real-time Exposure by Time of Day for Each Residential Exposure Cluster

3.1 Model Fitting: Best-in-class Log-linear Model Selection

Table 1 presents an overview of the best-in-class model selection process used to identify the most parsimonious model, defined as the minimal set of parameters required to provide an adequate fit to the observed data. Following hierarchically nested model comparison techniques, Model performance is evaluated by both degree of freedom (df) and likelihood ratio test statistic (G2). p value represents lack of fit of the models, which means models with p value of 1 fit at 99% confidence level and p value of 0 indicates model not fitting. The initial basis for comparison is Model 1.9, which is the saturated model that corresponds perfectly to the raw data, having zero degrees of freedom, as the number of parameters is equivalent to the total number of cells generated by all interactive combinations of the five factors under study: retail exposure (the conceptual dependent variable), residential exposure (home), weekend, time-of-day (time), and season. Model 1.4 fits while using only 829 cells to model the total 1325 cells under study. It also accurately predicts a five-dimensional contingency table with lower level four-way interactions. This is the most parsimonious model, effectively isolating an informative pattern in the data that then becomes the basis for inference.

To measure the separate strength and significance of each interaction in Model 1.4, models excluding one of the two interaction terms were evaluated. Table 2 shows the likelihood ratio evaluation of the influence of the excluded term compared to Model 1.4 with all two-term four-way interactions. This method provides a specific test of conditional independence between seasonal effects and real-time exposure levels. These model fits indicate that while the interactions between weekend effects and other factors are important, their associations do not contribute in Model 1.5 as much as the interactions between residential (home) exposure, time, season, and real-time exposure.

Table 2 Likelihood ratio evaluation: removal of four-way terms from maximally parsimonious model

3.2 Model Inference: Ordinal Logit Modeling of Residential Density Clusters

To examine real-time tobacco retail outlet exposures among clusters of residential exposure, an ordinal logit model was constructed to be mathematically equivalent to the most parsimonious log-linear model (see Section 2.3.2 Table 1 Model 1.4), itself identified through the model selection process described in Section 3.1:

$$ logit\left[P\left(H=h|T=t,S=s,E=e\right)\right]=\alpha +{\beta}_t^T+{\beta}_s^S+{\beta}_e^E+{\beta}_{ts}^{TS}+{\beta}_{te}^{TE}+{\beta}_{se}^{SE}+{\beta}_{ts e}^{TS E}+\varepsilon, h=1,\dots, 5,t=1,\dots, 4,s=1,\dots, 4,e=1,\dots, 27. $$

where the log-odds ratio (i.e., logit) of residential exposure cluster membership (home) for each hour is modeled as a function of the real-time exposure level (H), time-of-day (T), and season (S) associated with each hour. α is a constant. h, t, s, and e are categories within H, T, S, and E, and ε is the error term. \( {\beta}_t^T \), \( {\beta}_s^S \), and \( {\beta}_e^E \) represent the effects of parameter T, S, and E respectively. \( {\beta}_{ts}^{TS} \), …, \( {\beta}_{tse}^{TSE} \) represent the effects of interactions between parameters. For easier read, letter symbols are used in Table 2 to represent the highest interaction model terms. Model predicted residential exposure cluster membership and associated confidence intervals were generated via bootstrapping with 500 random samples of 10,000 observations, drawn with replacement from the empirical data distribution, effectively capturing the uncertainty associated with each parameter estimated by the model. This simulation-based resampling approach allowed for precise discrimination between the different residential exposure clusters (Fig. 3).

Fig. 3
figure 3

Real-time Exposure for Each Residential Exposure Cluster

Figure 3 presents the model predicted results of this process, illustrating the predicted probability that each of the 304,164 h under study was contributed by a participant from each of the residential exposure clusters. Separation between the bootstrapped 95% confidence intervals corresponds to regions of the distribution of real-time hourly exposures that were significantly explained by one or more of the density clusters. Overall, 61.3% of real-time, hourly exposures were of relatively low intensity, and after controlling for temporal and seasonal variation, 72.8% of the variance among these low-level exposures was accounted for by residence in one of the two lowest residential density quintiles. Residence in one of the two highest residential density quintiles accounted for approximately 50% of the variance among extreme exposure levels, but extreme levels of exposure were rare, constituting about 1% of the data. Altogether 55.2% of the variance in real-time exposures was not explained by participants’ residential exposure cluster, and most moderate to high intensity real-time exposures (38.7% of all hourly exposures) were no more likely to have been contributed by subjects from any single residential density cluster than another. In sum, OpenPaths participants experienced a heterogeneity in hourly tobacco retail outlet exposures that is only partially explained by their static residential exposures.

4 Discussion

While environmental “exposures” are most commonly thought of as biological – “internal” contact with toxic particles in the environment (e.g., air pollution and infectious pathogens) – there is growing recognition that monitoring of exposures to the broader ecosphere or “eco-exposome” is also important [57,58,59]. Individual-level geographic location data provide a spatial linkage that makes it possible to estimate the multivariate impact of countervailing societal and environmental systems on individual decision-making and behavior. This gives rise to the possibility of using such information for disease prevention and intervention delivery. It is our position, however, that continuous, real-time location data need not and should not be limited to use within real-time, “just-in-time adaptive” interventions. In fact, we believe the present paper provides an example of the way such micro real-time data can be better understood when it is aggregated, because it is only then that we can properly account for the relative significance of the various locations each participant frequents, at least as they pertain to the tobacco point-of-sale landscape.

Traditional estimates of exposure to risk and protective factors in neighborhoods are founded on the idea that the relative concentration of health-related factors around people’s homes sufficiently captures and thus can be used to characterize aggregated patterns of environmental exposure within and between neighborhood areas. This paper evaluated the degree to which residential locations approximate actual exposures by comparing empirical observations collected in real-time with static neighborhood estimates that only used information about each participant’s residence. Results demonstrate the utility of a continuous geolocation data smoother (i.e., Rg) that makes it possible to generate dynamic, mobility weighted KDE exposure values that retain both spatial and temporal resolution. Findings suggest that real-time exposures are misclassified by person-level residential exposure estimates to a substantial degree, especially among people residing in areas characterized by moderate to high levels of residential density.

Essentially, results of this work indicate that exposures to moderate and high levels of tobacco outlet density were systematically less-likely than lower density exposures, and that subjects who resided in moderate to high density areas were less likely to experience real-time exposures that were as high as estimated by the observed density around their residential location. This finding suggests that residence-based neighborhood approximations exhibited a tendency to over-estimate exposure levels experienced by residents in the real-world, and somewhat counter-intuitively, that this may have been particularly true within dense urban areas, where despite high-levels of density overall, shorter travel distances among a smaller set of stores dampened observed hourly exposure levels.

This paper advances the literature in a number of ways. Focusing on a reliable, national source of tobacco outlet data allowed us to identify variation in urban dynamics and behavior across different regions of the US. The use of continuous real-time geo-location tracking provides excellent temporal and spatial resolution, which improved sensitivity to detect dynamic patterns in the data. The analysis framework developed here can be used to assess mobility patterns, exposure to points of interest, and associated effects on health behavior. Nevertheless, this study also had methodological limitations that should be considered. This sample is not nationally representative, as participation in the mobility tracking was based on self-selection, and required access to the Internet and a smartphone. Additionally, because no participant demographic information was available, other factors that could potentially affect participants’ mobility, such as occupation or income level, could not be measured.

5 Conclusions

Results of this work shed light on the nature of real-time exposures to a spatially distributed environmental risk factor, as compared to a commonly used neighborhood exposure estimate. Future work should leverage methods of this kind to advance our understanding of individual decision-making and behavior change dynamics as a function of environmental conditions. Natural extensions would incorporate other policy and health-relevant risk and protective factors, such as the distribution of food, alcohol, and cannabis products. Research that involves clinical populations attempting to modify habitual health behaviors would be useful, including work with patients working to adhere to dietary restrictions or quit cigarette smoking. It will also be interesting to investigate the basic mechanisms underlying these associations, such as memory and other cognitive processes affected by regular product exposures and associated preferences. Variations across geographic areas and over time may provide insight and identify targets of intervention for public health practitioners, urban planners, and policy makers.