Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Introduction

Conducting dietary exposure assessment (E) consists in combining deterministically or probabilistically food consumption figures (Q) with concentrations (C) of a given chemical substance in a number of foods or food categories. To be compared with the acceptable daily intake or another health-based reference value, the exposure is then divided by the number of days of the survey (n) and by the body weight for individuals (bw). The basic formula is therefore:

$$ {E}_{i}=\frac{1}{{n}_{i}b{w}_{i}}{\displaystyle \sum _{k}{\displaystyle \sum _{t}{Q}_{i,t,k}{C}_{i,t,k}}}$$

Occurrence data can be obtained either from control and monitoring programs or from a total diet study (TDS). In both cases, data reported to be below the limit of detection (LOD), often called ‘non-detects’ or ‘left-censored data’, are likely to have a critical influence on the results of the assessment. The LOD and limit of quantification (LOQ) also known as “limit of determination” are of special importance for exposure estimations in risk assessments as they determine the minimum value that can be detected and quantified, respectively. It should be noticed that many definitions of LOD and LOQ have been suggested over time in different analytical areas. The LOD represents the minimum concentration or mass of an analyte that can be detected with a given confidence for a given analytical procedure. More formally, the LOD can be defined as the lowest concentration level that can be determined to be statistically different from a blank [1], customarily set using confidence levels equal to 95 % or 99 %. Similarly, the LOQ is the minimum concentration or mass of the analyte that can be quantified with acceptable accuracy and precision [1], given that at this level the analyte is considered to be present. In the Australian TDS, this has been defined with slightly different criteria as the limit of reporting (LOR) (see Chap. 20 – The Australian Experience in Total Diet Studies).

The objective of TDS is to provide concentration data for dietary exposure assessment, which are analyzed in food as consumed and obtained from composite samples expected to represent an average value for a food, food group of interest or even the whole diet. In theory, the dietary exposure to a chemical could, therefore, be based on a unique sample including a weighted mix of all food of the diet in which the chemical is expected to occur. At the other end of the spectrum of possibilities, a TDS can be based on each relevant food item, such as fish, or on a composite of various species available on the market. Finally composite samples can be prepared locally and repeated in various areas of a country or region and in various seasons to capture the variability of the analyte content regarding these parameters.

In the current practice of TDS, a low number of composite samples (generally 1–4) are prepared for relevant single food items or food groups (e.g. bread, fish, beef, etc.). In the case of food group samples, generally weighted composites are made up from different foods from the food group according to the ratio in which they are consumed (e.g. different types of bread or species of fish). The pooling of different foods in composite samples has several drawbacks: Firstly, it introduces considerable uncertainty about the variability of the concentrations of the individual foods present in the food groups. Moreover, compositing may dilute individual food samples having high concentrations when the remaining samples have much lower concentrations. The dilution effect may even prevent the determination of a chemical if it occurs at very low levels and/or if it occurs in only one or a few of the foods within a composite [2]. In addition, the analysis of weighted food composites allows only one mixture of foods (i.e. representation of only one age-sex group of the population or of the whole population) to be evaluated. Analysis of individual foods allows greater coverage of population subgroups, because the daily consumption of foods for different groups can then be simulated and calculated [2]. For the reasons mentioned above, the analysis of single food items is preferred over composite samples, although this approach has different advantages and disadvantages (see Chap. 9 – Food Sampling and Preparation in a Total Diet Study).

This chapter covers the handling of non-detects in TDS studies. It is based on a review of the literature included in a recent report of the European Food Safety Authority dedicated to this topic [3]. While none of these works were specific to the TDS, many were based on realistic datasets in the field of chemical occurrence in food.

Dealing with Non-detects in Dietary Exposure Assessment

An important factor for the evaluation of the presence of chemical substances is the possibility of distinguishing between non-detects and true zero values. For persistent organic pollutants, such as dioxins, PCBs (polychlorinated biphenyls) and PBDEs (polybrominated diphenylethers) and naturally occurring heavy metals such as lead and cadmium, it seems accepted that there are no true zero values in food: these substances are ubiquitous and will be consistently present in foodstuffs, although sometimes in extremely low concentrations. On the other hand, for process contaminants, like acrylamide, 3-monochloropropane-1,2-diol (3-MCPD) and also for most pesticides, true zero values can occur if the contaminant is not formed in the food, or the pesticide is not used on a crop. When dealing with non-detects, it should be kept in mind to which group the substance of interest belongs.

Communication with the analytical laboratory that measures TDS samples is very important. The laboratory analyzing the samples should be able to reach the lowest LODs and/or LOQs possible and at the same time have good performance of other important QC factors (high reproducibility, low blanks, high recoveries). The definitions of the LOD and LOQ used by the laboratories should be available. In the contact with the analytical laboratories, it is recommended to emphasize the need for the correct reporting of the LOD and LOQ. Analytical laboratories are often not aware of how exposure assessors use their reported values, so usually not much effort is put into accurate reporting of the LOD or LOQ. Depending on how strict the LOD and LOQ are defined by the analytical laboratory, it may be decided to use different definitions, or to report values between LOD and LOQ, such as the LOR.

Methods for Handling Non-detects

There are a variety of statistical methods to deal with non-detects. The most commonly used are: deletion, substitution, maximum likelihood estimation (MLE), log-probit regression, and non-parametric methods.

Deletion

Within the methods available to deal with non-detect samples, deletion represents the elimination of all non-detected data from the dataset. For TDS, in the case that more than one single sample for a food or food group is available, depending on the number of non-detects in this food or food group, this solution is likely to result in a considerable overestimation in terms of the frequency of occurrence of a chemical substance in a set of foods (in case of removal of true zero values), and in terms of levels of contamination (all the values below the LOD are excluded). When only one single sample is available, the exposure from the total diet may be underestimated when food groups with concentrations below LOD are deleted. For these reasons, this approach is not further considered in this chapter.

Substitution Method

In the field of food safety, the most commonly used recommendations to handle left-censored data are the ones from the GEMS/Food-EURO workshop in 1995 [4]. In practice, depending on the proportion of positive values and the overall sample size, for results below the LOD, a value equal to the LOD, zero or LOD/2 is used as a surrogate for the unknown non-detected value (see Table 16.1). This method is referred to as the substitution method, whereby the substitution of the non-detect with zero, LOD/2 or LOD is customarily defined, respectively, as the lower-, middle-, and upper-bound scenario. It is important to note that the GEMS/Food-EURO workshop recommended that for the purpose of dietary exposure assessments, laboratories and analysts should report as quantified results the data between the LOD and LOQ as this would promote the best use of available data. If this is done, only the LOD remains.

Table 16.1 Statistical treatment of data sets containing various proportions of non-quantified results

The substitution of non-detects with other values is widely recognized to be biased, with the bias a function of the true variability in the data, the percentage of censored observations, and the sample size [5]. Another disadvantage of substitution is that it does not work well when the number of detected samples exceeds 60 % of the results. In other words, when the dataset contains 1 % or 60 % of non-detect samples, it is likely that the two datasets have different underlying distributions. The most critical situation for the substitution method is when there are multiple LOD values. The reason for this is that substituted values depend on the conditions, which determined the detection limit, such as the laboratory sensitivity and precision and sample matrix interferences. These factors do not necessarily bear a relation to the true value [6].

A WHO publication recognizes the impact of left censored data on the overall uncertainty in chemical exposure assessment and recommends using statistical methods to provide more accurate estimates of a fitted distribution and its statistics than the classical method of substitution [7]. Despite its drawbacks, the substitution method is easy to implement, widely understood, and the upper-bound practice leads to conservative estimates for exposure assessment calculations, i.e. overestimation of the mean and underestimation of the variability.

Statistical Methods Available

There are a variety of statistical methods to deal with non-detects. The most commonly used are parametric maximum likelihood estimation (MLE), log-probit regression and non-parametric methods. It is important to note that both for TDS data and other sets of data, when the occurrence of a chemical in a food or food group is below the LOD/LOQ, based on a single or a very low number of analytical results, none of the statistical techniques described below can be used. The only possibility is, therefore, to employ the WHO recommendation and, more precisely, with the last row of Table 16.1, i.e. conduct lower bound and upper bound estimations.

The parametric maximum likelihood estimation (MLE) method is often considered as the preferred approach because the distribution of concentration values in food products can be expected to be log-normal if the food product is grown/made in a ‘homogeneous environment’. Data both below and above the detection limit are assumed to follow a log-normal distribution. The parameters of the chosen distribution are estimated so to best fit the distribution of the observed values above the detection limit, compatibly with the percentage of data below the limit. The estimated parameters are the ones that maximize the likelihood function. It is also possible to use other distributions, such as the Weibull and the gamma distributions. However the reported data often does not fit with a parametric model, particularly when they are collected in an international environment. A variety of point sources are likely to be present, leading to different background levels in different regions/countries and in different foods. In addition, true zero concentration values may be present, and the concentration in a food or food group may be better described by a combination of more than one distribution, e.g. binomial and a log-normal. According to Helsel [6], for data sets of at least 50 observations and where the percent of censored observations is small, the MLE method is usually considered as the method of choice. Some improvements of the MLE method are possible, for example by accounting for different sources of heterogeneity and by forcing the distribution in such a way that the observed fraction of non-detects is equal to the predicted fraction of non-detects.

In the log-probit regression method the data are sorted, and a linear relationship is assumed between the logarithm of concentration values and the inverse cumulative normal distribution of the observations’ plotting position. It has been suggested that the log-probit regression should not be applied to datasets with multiple LOD values [8].

The standard non-parametric technique for censored data is the Kaplan-Meier (KM) method. The advantage of such an approach is the possibility of estimating the mean, together with the median and other quantiles, in the presence of non-detect values, without relying upon distributional assumptions [9]. With the KM method, the weight of the censored data is distributed over the different observed values below the censoring values, i.e. LODs and LOQs, and zero. It is therefore not interesting to apply the KM method when there is only one LOD value, as it would be equivalent to substituting the censored values with zero or the largest observed value below the LOD. Because it is non-parametric, the KM method tends to be insensitive to outliers, which occur frequently in environmental data [10].

Bayesian statistics are fundamentally based on a different paradigm from “frequentist” statistics used for MLE methods. In summary, model parameters are not assumed to be fixed unknown constants to be estimated but instead are seen as random variables. All models fitted by MLE approaches could be, in general, also fitted using Bayesian approaches. In the case where no prior information is available, Bayesian methods will theoretically lead to very similar (if not identical) results as those obtained by MLE methods, when the same underlying model is used. An example of Bayesian modeling of left-censored data can be found in a paper of Paulo [11], which shows that application of Bayesian modeling to pesticide risk assessment is feasible, and that in a data-rich situation, the model compares well with empirical Monte Carlo modeling.

Several publications have evaluated the performance of statistical treatments of left-censored data [6, 8, 10]. The authors used various procedures and relied on different indicators to evaluate the performance of the proposed approaches. A complete analysis of the papers is included in the EFSA report [3]. In summary, the choice of the method depends, on the one hand, on the characteristics of the dataset under consideration, and on the other, on the resources for an accurate statistical analysis and modeling.

Because most dietary assessments employ a tiered approach, a sophisticated analysis of the data should be performed only when necessary and after a clarification of the issues above. Then, the following steps should be followed:

  1. 1.

    Initial analysis

    The main quantities to be evaluated are the size of the dataset, its potential sources of heterogeneity, the number of distinct LODs and the percentage of non-detects. In practice the analyses could be conducted following these preliminary steps, separately for each food or food group analyzed.

  2. 2.

    Sensitivity of concentration data

    The sensitivity of concentration distributions can be estimated by calculating the lower bound and the upper bound of dietary exposure based on the substitution of non-detects respectively by 0 and by the LOD. The substitution should be applied on the mean and/or the high percentile(s). If the effect is negligible then the dietary exposure assessment can rely on the upper bound approach without need for modeling. On the contrary if the difference between the lower and the upper bounds is important, i.e. if the health-based guidance value is between the two estimations, a modeling of left-censored data is needed.

  3. 3.

    Treating left-censored data

    As mentioned above, the TDS usually involves consideration of a food category with a single or very few analytical results, e.g. one to four analytical results per food group. Under such circumstances there is no robust way to deal with censored data. Based on the available literature and on the recommendations both from WHO and EFSA, the only possibility is to estimate the lower and the upper bound. However, because the use of TDS at regional level represents an important and valuable trend, it is likely that in the future, TDS will aim to include more samples for each food or food group to capture the variability in occurrence. As an example, on the basis of four samples analyzed for a country, a regional TDS involving 15 countries would result in 60 analytical results to describe the distribution of occurrence in a single food item at regional level. Such a number would allow for a statistical analysis. Moreover, the introduction of uncertainty analysis in the risk analysis process will require a more accurate picture of the distribution of occurrence than a single average value.

When the dataset is more than 50 observations and the percentage of censoring is between 50 % and 80 %: the parametric (MLE) approach is recommended. A set of candidate parametric models, such as log-normal, gamma, and Weibull, should be considered and the best final model should be checked for goodness of fit. When the dataset is more than 50 observations and the percentage censoring is lower than 50 % with a single LOD, a parametric approach (MLE) is recommended.

When the dataset is more than 50 observations and the percentage censoring is lower than 50 % with multiple LODs, both the parametric approach and the KM method can be performed; the latter has the advantage that it avoids making any assumptions about the form of the underlying distribution (see Fig. 16.1).

Fig. 16.1
figure 00161

Flowchart of the overall strategy for the treatment of left-censored observations proposed by EFSA