Introduction

Biomarkers of environmental chemical exposures are widely used to assess etiologic relationships in perinatal epidemiology [1]. For example, there is great interest in understanding the effects of toxic chemicals, such as phthalates or polychlorinated biphenyls (PCBs), on pregnancy outcomes [2,3,4,5]. These effects can be studied using measurements from human biospecimens, such as blood or urine, which quantify the presence and the concentration of various toxic environmental agents. A key advantage of using biomarkers is that they unify diverse sources and routes of exposure to estimate the biological dose in the target tissue [6, 7]. They also allow investigators to study multiple exposures simultaneously by using data taken from the same biospecimen [6, 7].

The analysis of biomarker data entails an extraordinary collection of statistical challenges [8,9,10,11,12]. A biomarker can describe a measurement in the causal sequence of events between an exposure to a hazardous factor and a health outcome [6]. However, a biomarker cannot be used to directly quantify the effects of interventions because the actual source and route of exposure are difficult to determine. Furthermore, attributing the cause of a health outcome to a single biomarker can be difficult because there are often dozens of biomarkers to take into consideration and they may be highly correlated. PCBs, for example, involve 209 congeners measured in blood plasma at very low levels [13]. Chemical concentrations measured in blood or urine samples from pregnant women may also be influenced by a variety of factors including dilution-dependent sample variation, censoring at low-levels due to detection limits, short chemical half-life, and variability due to physiology and metabolism.

This review aims to discuss recent developments in the statistical analysis of biomarkers with particular emphasis on chemical exposures during pregnancy. Given the breadth of the field, we limit our review to significant trends and developments related to two topics: (1) statistical methods for exposure assessment using biomarkers and (2) multivariable modeling of perinatal outcomes where multiple biomarkers are included as inputs in a linear regression model. Throughout, we focus on settings where the scientific objective is to estimate the confounder-adjusted association between individual biomarkers within a single chemical class and a continuous outcome.

Statistical Challenges in Exposure Assessment Using Biomarkers

The quality of exposure measurement plays a critical role in estimating the effects of environmental exposures. The use of biomarkers adds an additional layer of complexity [9, 10]. Errors in exposure measurement can introduce bias and uncertainty into estimates of health effects [14]. When working with biomarkers, the objective is to estimate the health effects of usual personal exposures over several months or during specific windows of vulnerability (e.g., trimesters of pregnancy) [15]. However, there are a myriad of data-analytic challenges that one may encounter. For example, non-persistent chemicals such as phthalates and bisphenol A tend to metabolize quickly and measured analyte concentrations from urine samples tend to reflect recent exposure [12].

In the following sections, we review three statistical challenges in exposure assessment that are specifically related to biomarkers. We note that there is a vast literature on biostatistical methods to correct for measurement error and misclassification [10, 14]. These approaches, including regression calibration and Bayesian approaches, model the relationship between the observed exposure and an unobserved true exposure, and they permit statistical adjustment for bias and uncertainty in the exposure-disease relation [10]. However, these methods are less commonly used in epidemiological studies with biomarkers of environmental chemical exposures because, in most settings, there is no unambiguous definition of the true exposure that can be measured objectively and then used to estimate the size of the measurement error [14].

Biomarkers That Fall Below the Limit of Detection

Biomarkers for low-level chemical exposure frequently fall below the minimum detectable capacity of the analytical instruments. These measurements, reported by many laboratories as less than the limits of detection (< LOD), are not usable in any statistical analysis. Substitution with a fixed value such as LOD/2 or LOD/√2 is the most common strategy for handling the left-censored data. Deletion of all observations < LOD is also commonly seen. However, when the same value is substituted repeatedly or when data are not censored at random, then increased bias, decreased power and reduced variability will result [16,17,18,19,20]. Although deletion and substitution are widely used and easy to implement, the consensus in the literature is that they should be avoided when the frequency of non-detection exceeds 10% [18, 21].

In contrast, maximum likelihood estimation (MLE) is usually superior. It uses the uncensored data to maximize the likelihood function for the log-transformed censored data and iteratively produce estimates with low bias and high statistical power [17, 19, 20, 22,23,24]. MLE is the gold standard when the data follow a parametric distribution [17, 18]. However, MLE is not suitable for datasets with small sample size or non-parametric distributions. In these settings, multiple imputation (MI) is preferable [18, 22]. MI uses multiple datasets that are imputed individually to obtain a set of estimates that are combined to form a final estimate [18, 22, 25,26,27]. The number of recommended imputations is between five and ten [22, 28] and can be implemented easily with statistical software. However, the selection of a suitable imputation model may be complicated [22, 28, 29].

In general, there is no single method that is superior in all settings and researchers are encouraged to select methods that best fit their own scenarios [17]. As bias increases with the proportion of < LOD, it is sometimes recommended that a chemical be omitted completely from the analysis if the frequency of non-detection exceeds 20%. Furthermore, it is often incorrectly assumed that the measurement error in the reported values < LOD is greater than the errors in those above the LOD [26, 30]. It has been suggested that machine readings, which are the uncensored outputs from the analytical instruments, may be more accurate than the imputed values [18, 31]. Therefore, before choosing an imputation method, obtaining the machine readings is encouraged to better understand the measurement error process [32].

Adjustment for Lipid or Urine Dilution to Better Estimate Biologically Relevant Exposure

Variabilities in physiological processes can influence biomarker measurements in blood and urine samples. Lipophilic chemicals in blood samples require adjustment for lipid dilution [32, 33••], whereas urinary chemicals require creatinine or specific gravity adjustment to account for hydration level [34]. For example, when measuring PCBs, individuals with more body fat will have greater PCB concentrations compared to leaner individuals and, if uncorrected, this measurement error leads to an overestimation of PCB exposure levels in the target tissue [35]. Consequently, to estimate the biologically relevant exposure, it is necessary to account for dilution-dependent sample variation.

Traditionally, lipid dilution is adjusted by dividing chemical concentration with serum lipid levels (e.g., μg/L or μg/g lipids) [34, 35], and urine diluteness is adjusted using either creatinine or specific gravity (SG) [36,37,38]. There is an ongoing debate of what the best adjustment methods are, but there has been no definitive answer that works in all settings [34]. O’Brien et al. [39••], using directed acyclic graphs and simulation studies, compared traditional adjustment methods with novel methods. They recommended adjusting for urinary samples by creating a “creatinine z-score,” which divides the chemical concentration by C/C[ratio], which is the ratio of the observed creatinine level divided by the predicted value [39••]. An alternative option is to adjust for both the “creatinine z-score” and creatinine as covariates [39••]. As for serum samples, both the traditional adjustment method and the inclusion of lipids as covariate in the regression model are recommended [39••].

Further studies looking at the use of 2-stage models for creatinine have been proposed [40, 41], and other newer methods involve examining total lipid and creatinine levels across participants according to sociodemographic factors [42], biological media (e.g., blood or urine), and chemical characteristics (e.g., persistence, lipophilicity, or hydrophilicity) [43•]. Regardless of the adjustment method, it is important to keep in mind that it is logistically challenging to obtain tissue-specific measurements which could help better understand the toxicant distribution in the target tissues. Blood and urine samples, therefore, are proxies of how the chemicals affect inaccessible human tissues [33••, 39••]. Even after correcting for dilution differences, the exposure measurements may not reflect exposure in the target tissue.

Handling Repeated Measures of Biomarkers Within a Single Pregnancy

Biomarker exposures have traditionally been assessed using spot samples. Recent literature has revealed that this practice of assessing exposure at a single time point during pregnancy does not adequately represent the exposure throughout the whole pregnancy [44•, 45••, 46•]. Single exposure measurements typically ignore within-person variability and may lead to exposure misclassification [45••, 47] especially among non-persistent chemicals with low intraclass correlation coefficients (ICCs), such as BPAs and some phthalates where within-person variability is especially high [45••].

ICCs determine the reliability of the exposure measurements by calculating the ratio of between-person variability to total variability (e.g., between- plus within-person variabilities) over a period of time. For persistent chemicals such as polybrominated diphenyl ethers (PBDEs), where within-person variability is small compared to between-person variability, the ICCs are close to 1 (e.g., 0.87 to 0.99), which indicates that single samples can reliably measure average exposures [48]. For chemicals with ICCs less than 0.6, such as BPAs and phthalates, it was found that as many as 35 samples per participant are needed to achieve adequate statistical power and limit bias [45••]. For these low ICCs, multiple measurements per participant collected over the course of the pregnancy are recommended [44•, 46•, 49]. However, the quality of exposure measurement also depends on the characteristics of the biological sample. For example, cotinine measured in hair reflects longer periods of exposure to secondhand smoke while serum cotinine reflects exposure in recent days [50].

Assaying multiple samples per participant results in higher assay costs. A simulation study by Perrier et al. [45••] showed that repeated measurements can be pooled before analysis with equal aliquots from each sample, and this reduces costs. Compared to using single measures, this cost-saving method saw improved power and bias [43•, 45••]. Alternatively, at the data analysis stage, repeated biomarkers measures can be summarized into a single measure of cumulative exposure [33••, 51]. Using the mean biomarker concentration across visits improves power to detect exposure-disease relationships even when exposure has poor stability over time [33••]. However, the mean concentration may not reflect exposure during biologically relevant periods, and any data that are missing not at random can bias the results. When an acute exposure is of interest, using the maximum rather than the mean exposure across visits is appropriate. It is important to note that the effects of the acute exposure represented by maximum exposure will also depend on its temporal relationship with an outcome. Recently, Chen et al. [33••] reviewed several statistical models for repeated measure of biomarkers in pregnant women including multivariable regression, regression-in-parallel and multiple informants modeling, two-stage methods, and clustering methods. There are also statistical approaches using measurement error models such as regression calibration and simulation extrapolation (SIMEX) to correct for exposure misclassification [10].

Besides providing a more accurate exposure assessment, repeated measurements can identify critical windows of vulnerability by analyzing changes in biomarker levels across pregnancy. The identification of critical windows is difficult statistically because the biomarkers are correlated within individuals, and this induces collinearity if the repeated measures are included in the outcome model [33••]. Furthermore, important time windows are mostly unknown and are highly dependent on the chemicals and outcomes studied. To determine differences in effects across time of exposure, while accounting for correlated exposures, several modeling approaches such as those examined by Chen et al. [33••] and Sanchez et al. [51] can be applied.

Statistical Challenges in Multivariable Modeling with Biomarkers of Environmental Chemical Exposure

When examining the relation between multiple biomarkers in the same chemical class and a health outcome, the critical issue is to determine the scientific objective of the analysis [12, 52, 53]. The investigator must carefully consider the causal questions, competing hypotheses, and the quantification and interpretation of evidence [52]. However, in the case of biomarkers of environmental chemical exposures the inferences about causality are primarily at the hypothesis-generating end of the research spectrum. For example, when examining the dose-response relationship over continuous levels of PCBs measured in blood, there are no clearly defined interventions that can directly modify biomarker levels in human tissue [7, 54]. Consequently, in the discussion that follows, we focus on statistical challenges where the goal is to estimate the confounder-adjusted association between individual biomarkers within a chemical class (e.g., PCBs) and a continuous outcome, where the dose-response is presumed to have a simple mathematical form (e.g., linear or monotonically increasing).

The most widely used analytic strategy is multiple linear regression where one or more of the biomarkers are included as predictor variables, along with confounders [55••]. Controlling for too many biomarkers can lead to data sparsity or multicollinearity, particularly when the number of predictors is large in relation to the sample size [56]. Consequently, researchers will seek to reduce the number of variables in the model. Biomarkers present a host of unique challenges in multivariable modeling because some analytes are chemically related. For example, phthalate metabolites in urine are often derived from the same parent compound and tend to be highly correlated. Three specific problems include the following: (1) accounting for heterogeneity in biomarker levels between chemicals within the same chemical grouping (e.g., PCBs), (2) variable selection and shrinkage techniques for biomarkers, and (3) the role of dimension reduction techniques such as summation of chemical concentrations.

Accounting for Heterogeneity in Biomarker Levels Between Chemicals Within the Same Class

When examining the association between several exposure biomarkers within a chemical class and a health outcome, it is desirable to report a measure of effect that conveys the magnitude of the association, per unit change in the biomarker. However, choosing a measure of effect that is easily interpreted is not simple. Consider, for example, the case of PCB153, which is more widely detected, and also more variable between individuals, than PCB118 [1]. If the measured concentrations for both compounds are included directly in a multivariable model for the outcome, then the regression coefficient for PCB153 will be diminished relative to PCB118 merely because the exposure level for PCB153 is more heterogeneous in the study population. More generally, within any chemical class there are often dozens of analytes to take into consideration, many of which are detectable in fewer than 50% of the study subjects. Furthermore, different authors use different scales of measurement of exposure, such as μg/L vs. ng/g lipid. Consequently, it is extremely difficult to assemble a collection of predictor variables for inclusion in a multivariable model.

It is common practice to log transform the biomarker concentrations, using base 2 or 10, before incorporating them into a multivariable model. In addition to limiting the influence of outliers, this approach also incorporates a non-linear dose-response between the biomarker concentration and the mean of the outcome variable, where the slope of the curve levels off at higher doses [5]. Nonetheless, the interpretation of the multivariable regression analysis results remains problematic. The regression coefficients correspond to changes in the mean of the outcome that are associated with a 2-fold (or 10-fold) increase in the level of the biomarker. This is difficult to interpret without referring to percentiles of the concentration on the original scale. For example, a 10-fold increase in concentration from the median may entail extrapolation of effects outside of the range of biologically plausible exposure levels.

A different strategy is to put the biomarkers onto a common scale by subtracting the mean and dividing by the standard deviation [57]. Subtracting the mean typically improves the interpretation of main effects in the presence of interactions [58]. The resulting regression coefficient is the expected difference in the outcome comparing participants that differ by one standard deviation in the input variable, conditional on the remaining predictors. More generally, scaling predictors is widely used in combination with shrinkage methods, such as lasso regression, because it can dramatically affect regression coefficient estimation [59]. However, scaling input variables using the standard deviation has been criticized because the resulting scale of measurement is not transportable, meaning that the regression coefficients cannot be compared between studies because they depend on the distribution of exposure, which is sensitive to arbitrary features of the study design [60, 61]. Furthermore, right-skewness of biomarker concentrations tends to distort the standard deviation to render it meaningless as a measure of variability.

We recommend two alternative approaches to account for heterogeneity in biomarker levels: either discretizing the predictor into quartiles or tertiles [62, 63], or rescaling the biomarker concentrations using a standard reference such as an interquartile range (IQR). The standard reference should be expressed in natural units of the exposure, which should be established based on the distribution of the exposure in the target population. Ideally, it should correspond to levels of exposure where biologically meaningful changes in the outcome are anticipated. Discretization has been criticized in the statistical literature [64] because the presumed model for the relationship between the predictor and the outcome is typically unrealistic, which can introduce bias. However, discretization has the advantage that it balances the tradeoff between simplicity and interpretability versus proper fit [65].

Variable Selection and Shrinkage Techniques for Biomarkers Within the Same Chemical Class

Variable selection and model building using biomarkers of environmental chemical exposure is uniquely challenging [9]. Consider, for example, the case of PCBs, which can be measured by as many as 209 unique congeners in blood. In populations not subject to widespread pollution, most PCBs are detectable in fewer than 50% of the study participants. On the one hand, we could attempt to impute low concentrations. However, the low detection rates raise questions about which biomarkers should be incorporated into the model. Many investigators discard all but the most widely detected congeners and retain those that meet a particular detection threshold (e.g., 20%). Other authors limit their focus entirely to PCB153 based on the argument that it is correlated with other PCBs and with PCB summary metrics [13]. Discarding some PCBs is reasonable because, individually, we do not expect low-level exposures to predict measurable variation in the outcome. On the other hand, all PCBs reveal information about cumulative PCB exposure, and some PCBs may be toxic even at very low levels. The health effects of low-level chemical exposures during pregnancy are mostly unknown [66]. Therefore, new research is required to support multivariable modeling of high-dimensional low-level biomarkers [9, 67]. One important direction of research is to develop novel MI techniques for joint imputation of multiple correlated biomarkers that fall below the LOD and to characterize settings in which such methods will be most useful [67].

A related issue in multivariable modeling concerns the adjustment for co-pollutant confounding [55••, 68••, 69••]. Consider the case of PCB153 and PCB180, which are both widely detected in human populations [13]. The biomarkers are correlated because the congeners were manufactured together as commercial PCB mixtures. The investigators must decide whether both compounds should be included into the same multivariable model for a health outcome or not. If both PCBs are included, the regression coefficients become harder to interpret etiologically because they describe changes in the mean response conditional on the respective PCBs. However, PCB153 and PCB180 do not vary independently and the standard error of the regression coefficients may increase dramatically due to collinearity. Conversely, if both PCBs are not included in the model, then the regression coefficients will be confounded by one another [69••]. It is likely that etiologically relevant parameters that describe the causal effects of individual chemicals can be recovered via prediction from a model that incorporates multiple biomarkers from the same class, for example, by using the parametric g-formula [58, 70, 71].

One widely used analytic strategy is to combine variable selection with shrinkage methods, such as Bayesian hierarchical modeling [72, 73], the lasso or elastic net [55••, 74, 75]. Bayesian methods use probability distributions to model uncertainty. These distributions, called prior distributions, are combined with the data to obtain the posterior distribution for model parameters. Bayesian hierarchical models generalize ordinary regression to distinguish multiple levels of information in a model. The hierarchical structure introduces shrinkage which reduces the regression coefficients towards zero or towards a group-specific mean parameter. Introducing bias is typically not desirable; however, shrinkage methods also reduce the standard error, and consequently, the overall mean squared error of estimation. Thus, shrinkage methods are particularly useful in the presence of collinearity because they stabilize parameters estimates and reduce prediction error of the outcome variable. As a result of these advantages and their usefulness in settings with multiple correlated exposures, shrinkage methods and Bayesian techniques have been used to analyze biomarkers [3, 76,77,78,79,80].

Dimension Reduction Methods for Biomarkers Within the Same Chemical Class, Including the Sum-Of-Chemical Approach

Data from highly correlated exposures can be combined using dimension reduction techniques to reduce the number of variables [74]. The idea is to create linear combinations of biomarkers that serve as lower dimensional summaries, and these can be included as predictors in the outcome model. Rather than looking at the effects of individual biomarkers, dimension reduction techniques address a different scientific objective, which is to examine the association between a broad class of chemical exposures and a health outcome [11]. One of the most common examples is to sum the individual biomarker concentrations [13], or alternatively, to sum the concentrations after dividing by the molar mass (see [81] and [5] for examples related to PCBs and phthalates). This approach is straightforward and intuitively appealing; however, it does not account for the potency of the individual biomarkers. If a biomarker does not affect the outcome, then including it in the sum will introduce a measurement error in the estimated dose-response relationship between the summation and the outcome. Alternatively, sums can be created by weighting each biomarker by their toxicity, such as toxic equivalency factors (TEFs), which describe dioxin-related toxicity for PCBs [82]. It is also possible to use data-driven approaches to create lower dimensional summary variables based on relationships in the data. Examples include principal component analysis (PCA), partial least squares, cluster analysis, weighted quantile sum regression [83], environmental risk scores [84], and structural equation models [85] that use a latent variable for the true unobservable exposure.

However, a limitation of dimension reduction techniques is that the biological interpretation of the analysis results is challenging. A unit change in the low-dimensional summary (e.g., +1 IQR) need not correspond to changes in more distal exposures (see [15] for an example related to household dust). When analyzing biomarkers of environmental chemical exposure, there are two distinct scientific objectives: estimating the health effects of individual chemicals, and estimating the health effect of cumulative exposure to multiple chemicals. Dimension reduction techniques for biomarkers of environmental chemical exposures lie squarely within the second objective. Further research is needed to bridge the gap between dimension reduction techniques and the growing literature on causal inference.

Conclusions

We have highlighted several unique challenges in the analysis of biomarkers of environmental chemical exposures with particular focus on exposure assessment and multivariable modeling. Please refer to Table 1 for key references listed according to the statistical challenges we have identified. Although our emphasis was on exposures during pregnancy, the challenges are also of relevant to epidemiologists working in other areas of environmental health. Given the broad scope of the review, it is inevitable that some topics are omitted. We did not discuss the literature on uncertainty quantification for high-dimensional data (e.g., multiple testing) or machine learning and non-linear modeling techniques, and we refer the reader to the textbooks [64, 74] for further details. Additionally, we fully acknowledge having sidestepped a detailed discussion of causal inference using biomarkers. We focused on settings where the scientific objective is to estimate the confounder-adjusted association between individual biomarkers within a chemical class and a health outcome. This is consistent with the usual custom in epidemiology of exercising caution in using language about causality [86]. When analyzing biomarkers, the topic of cause and effect relationships inevitably drifts to formulating difficult-to-answer questions such as “Had their measured biomarker levels been different, would the pregnancy outcome have been different?” [7]. Ignoring causality has the disadvantage that it does not shed light on the health effects of specific interventions [15]. However, it should be emphasized that association studies have the advantage that they can reveal specific biological mechanisms [15]. This contributes to the overall body of evidence linking chemical exposures and health outcomes. More generally, there is a vast literature on statistical methods for causal inference [87]. Much of this work has been in applied settings with a binary treatment comparison [88]. There is also an emerging body of work about the analysis of chemical mixtures [89, 90]. New research is needed to clearly articulate the definitions of causal contrasts that are relevant to the study of biomarkers of environmental chemical exposures.

Table 1 Key references according to statistical challenges