Non-independent Data

Bjørnstad, Ottar N.

doi:10.1007/978-3-319-97487-3_15

Ottar N. Bjørnstad⁵

Part of the book series: Use R! ((USE R))

4523 Accesses

Abstract

Many infectious disease experiments result in non-independent data because of spatial autocorrelation across fields (such as discussed in Chap. 13), repeated measures on experimental animals (such as the in-host Plasmodium data discussed in Sect. 7.7), or other sources of correlated experimental responses among experimental units (such as the possibility of correlated infection fates among the rabbit littermates discussed in Sect. 4.3). Statistical methods that assume independence of observations are not strictly valid and/or fully effective on such data (e.g., Legendre 1993; Keitt et al. 2002). “Mixed-effects models” and “Generalized linear mix-effects models” (GLMMs) have been/are being developed to optimize the analysis of such data (Pinheiro and Bates 2006).

This chapter uses the following R-packages: nlme, ncf, lme4, and splines.

Access provided by CONRICYT-eBooks. Download chapter PDF

15.1 Introduction

Many infectious disease experiments result in non-independent data because of spatial autocorrelation across fields (such as discussed in Chap. 13), repeated measures on experimental animals (such as the in-host Plasmodium data discussed in Sect. 7.7), or other sources of correlated experimental responses among experimental units (such as the possibility of correlated infection fates among the rabbit littermates discussed in Sect. 4.3). Statistical methods that assume independence of observations are not strictly valid and/or fully effective on such data (e.g., Legendre 1993; Keitt et al. 2002). “Mixed-effects models” and “Generalized linear mix-effects models” (GLMMs) have been/are being developed to optimize the analysis of such data (Pinheiro and Bates 2006).

While this full topic is outside the main scope of this text, it is very pertinent to analyses of disease data, so we will consider the three case studies.

15.2 Spatial Dependence

We use the rust example introduced in Sect. 13.2 (Fig. 13.1) to illustrate two approaches to accounting for spatial dependence in disease data: (1) random blocks vs (2) spatial regression. This experiment looked at severity of a foliar rust infection on three focal individuals of flat-top goldenrods in each of 120 plots across a field divided into four blocks. The experimental treatments were (1) watering or not and (2) whether surrounding non-focal host plants were conspecifics only, a mixture of conspecifics and an alternative host (the Canadian goldenrod) or the alternative host only.

15.2.1 Random Blocks

As in our spatial pattern analysis, we jitter the coordinates because some methods require unique coordinates for each data point.

We first use lme to fit two random effect models. The first considers individuals in blocks. The second considers plots nested in blocks.

We next do a likelihood ratio-test to check for the better fit. The likelihood ratio test (provided by anova) shows that the nested model provides the best fit.

The intervals-call shows that the between-plot variance is about twice as large as the between-block variance, and watered plots have a significantly higher rust burden.

15.2.2 Spatial Regression

The above randomized block mixed-effects models are the classic solution to analyzing experiments with spatial structure. An alternative is to formulate a regression model that considers the spatial dependence among observations as a function of separating distance. To investigate how proximate observations on different experimental treatments may be spatially autocorrelated, we can explore the spatial dependence among the residuals from a simple linear analysis of the data. We use the nonparametric spatial covariance function (as implemented in the spline.correlogram()-function in the ncf-package) discussed in Chap. 13. We first fit the simple regression model that ignores space altogether.

The nonparametric spatial correlation function reveals strong spatial autocorrelation that decays to zero around 38 m (with a CI of 31–43 m).

To fit the spatial regression model we use the gls-function from the nlme-package (Pinheiro and Bates 2006). This function fits mixed models from data that have a single dependence group, i.e., one spatial map, one time series, etc.; With multiple groups we use the lme-function discussed (see Sect. 15.3). There are many possible models for spatial dependence. We compare the exponential model (which assumes the correlation to decay with distance according to exp(−d∕a) where d is distance and a is the scale) and the Gaussian model (exp(−(d∕a)²). [The nugget-flag means that the function is not anchored at one at distance zero]. We compare these to the nonspatial model (fitn) and the best random block model (fit2) using AIC.

The AICs show that the exponential model provides the best fit. Moreover, the spatial regression model provides a better fit than the nested random effect model. This is presumably because of the gradual decay in correlation with distance (Fig. 15.1).

The parametrically estimated range of 9.8 m is a bit longer (but within the confidence interval) of the e-folding scale (5.5 m) estimated by the spline correlogram; 1-nugget = 0.64 is comparable (but a little greater) than the 0.55 y-intercept. We can use the Variogram-function from the nlme-package to see if the spatial model adequately captures reflects the spatial dependence (Fig. 15.2). It looks like a plausible fit.

15.3 Repeated Measures of In-Host Mouse Malaria

Repeated measurements usually result in non-independent data because of the inherent serial dependence. Consider Huijben’s data on anemia of mice infected by five different strains of Plasmodium chaubodii introduced in Sect. 7.7 with lots of measurements taken on days 3 through 21, 24, 26, 28, 31, 33, and 35. We will study the red blood cell counts (RBCs) of mice infected by one of five different clones as well as the control group. The sample sizes per treatment were 10 for AQ, BC, CB, and ER, 7 for AT and 5 for control. Eleven of the animals died. SH9 has the data (in long format).^{Footnote 1} For the analysis we strip some unnecessary columns 1, 3, 4, 7, 8, and 11 that are extraneous to focus on the RBC count:

For the repeated measures analyses we create a groupedData-object from the data frame using the nmle-package. The below call declares how the RBC counts represent time series for each mouse. Note that mice that died are scored by zero RBC count in the data set and that these zeros end up dominating patterns, we therefore rescore these data as missing (NA), and plot the grouped data object to visualize the anemia by treatment (Fig. 15.3).

The main difference is between control and treatments, but the maximum anemia varies somewhat among strains. To test for significant differences we use lme to build a repeated measures model. In the simplest case we follow standard convention and model the time series using day as an ordered factor and assume the treatment effect to be additive. The random= ∼ 1|Ind2-call in the formula indicates that we assume there to be individual variation in the intercept (but not the slopes) among individuals. We then use the ACF function to look for evidence of serial dependence in the residuals from the fit. As is apparent from the ACF plot there is temporal autocorrelation in the residuals out to at least 4 days (Fig. 15.4).

There are many models for serial dependence. We use a first order autoregressive process (AR1). This is specified by the correlation=corAR1(form= ∼ Day|Ind2) function call. Note that this is one of a variety of time-series models available in the nlme-package, the most general of which is the ARMA(p, q) model discussed in Sect. 6.2.1.

The Phi1 parameter of 0.7088 represents the estimated day to day correlation, which is substantial. We can plot the predicted and observed correlation. The AR1-model seems to be a nice fit (Fig. 15.5).

Moreover, a formal likelihood-ratio test provided by the anova function reveals that the correlated error model provides a significantly better fit to the data:

Statistically, the time-by-treatment interaction model, rather than the additive model, is better still:

Finally we can plot the predicted values against time (filtering out predictions for the missing values in the original data) (Fig. 15.6). There is a distinct ordering in the virulence of the strains:

Modeling time as an ordered factor is quite parameter wasteful (the full interaction model has 153 parameters). A flexible yet more economic approach may be to model time using smoothing splines. The following example uses a B-spline with 5 degrees-of-freedom (Fig. 15.7). The qualitative features are similar to the more parameter rich model (Fig. 15.6)

15.4 B. bronchiseptica in Rabbits

Bordetella bronchiseptica is a respiratory infection of a range of mammals (e.g., Bjørnstad and Harvill 2005). Its congeners, B. pertussis and B. parapertussis, cause whooping cough in humans, but B. bronchiseptica is usually relatively asymptomatic (though it can cause snuffles in rabbits and kennel cough in dogs). The data comes from a commercial rabbitry which breeds NZW rabbits to study transmission paths in the colony. The data is from the same study as we used to study the age-specific force of infection in Sect. 4.3. Nasal swabs of female rabbits and their young were taken at weaning ( ∼ 4 weeks old). A total of 86 does and 408 kits were included in the study (Long et al. 2010).

To investigate if (a) offspring of the infected mothers have an increased instantaneous risk of becoming infected and (b) if offspring of the same litter tended to have the same infection fate because of within-litter transmission, we use a random effect (generalized linear mixed model, GLMM) logistic regression, with litter as a random effect. We first do some data formatting.

Here, the concern is with whether littermates share correlated fates. Unlike for spatial or temporal autocorrelation, there are no canned functions to quantify this correlation. However, following our discussion of autocorrelation in Sect. 13.3, it is easy to customize our own calculations. In the below, the first double-loop makes a sibling-sibling “contact-matrix,” tmp, that flags kits according to litter membership. After, tmp2 rescales the binary sick vector that flags whether or not an animal was infected, and tmp3 generates the correlation matrix. Finally mean(tmp3*tmp) provides the within-litter autocorrelation in infection status averaged across all litters.

The within-litter correlation of 0.53 represents a substantial interdependence among littermates. Since the response variable is binary (infected vs noninfected) we cannot use lme. Instead we use the lmer-function from the lme4-package and specify using the “family” argument that the response is binomial. Using AICs we contrast the fit with within-litter correlation (fitL) with the fit that assumes independence (fit0); The appropriate independence fit is generated by declaring that each of the 408 individuals are in their own group (variable X in the data set).

The litter-dependent model is clearly best (no surprise given the strong empirical intra-litter correlation). The summary of the best model reveals that the key predictor of infection fate is whether or not a sibling was infected (anyothersickTRUE). The infection status of the mother was insignificant. The mixed-effect logistic regression thus reveals that the most important route of infection is likely to be sib-to-sib transmission (Long et al. 2010).

Notes

1.
With repeated measures data we often use both long-format with one line for each observation and wide-format with one line for each experimental unit.

References

Bjørnstad, O. N., & Harvill, E. T. (2005). Evolution and emergence of bordetella in humans. Trends in Microbiology, 13(8), 355–359.
Article Google Scholar
Keitt, T. H., Bjørnstad, O. N., Dixon, P. M., & Citron-Pousty, S. (2002). Accounting for spatial pattern when modeling organism-environment interactions. Ecography, 25(5), 616–625.
Article Google Scholar
Legendre, P. (1993). Spatial autocorrelation: Trouble or new paradigm? Ecology, 74(6), 1659–1673.
Article Google Scholar
Long, G. H., Sinha, D., Read, A. F., Pritt, S., Kline, B., Harvill, E. T., et al. (2010). Identifying the age cohort responsible for transmission in a natural outbreak of bordetella bronchiseptica. PLos Pathogens, 6(12), e1001224.
Article Google Scholar
Pinheiro, J., & Bates, D. (2006). Mixed-effects models in S and S-PLUS. Berlin: Springer Science & Business Media.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Center for Infectious Disease Dynamics, Pensylvania State University, University Park, PA, USA
Ottar N. Bjørnstad

Authors

Ottar N. Bjørnstad
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bjørnstad, O.N. (2018). Non-independent Data. In: Epidemics. Use R!. Springer, Cham. https://doi.org/10.1007/978-3-319-97487-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-97487-3_15
Published: 31 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97486-6
Online ISBN: 978-3-319-97487-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics