How to statistically analyze nano exposure measurement results: using an ARIMA time series approach

Klein Entink, Rinke H.; Fransman, Wouter; Brouwer, Derk H.

doi:10.1007/s11051-011-0610-x

How to statistically analyze nano exposure measurement results: using an ARIMA time series approach

Research Paper
Published: 21 October 2011

Volume 13, pages 6991–7004, (2011)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Nanoparticle Research Aims and scope Submit manuscript

How to statistically analyze nano exposure measurement results: using an ARIMA time series approach

Download PDF

Rinke H. Klein Entink¹,
Wouter Fransman¹ &
Derk H. Brouwer¹

460 Accesses
31 Citations
Explore all metrics

Abstract

Measurement strategies for exposure to nano-sized particles differ from traditional integrated sampling methods for exposure assessment by the use of real-time instruments. The resulting measurement series is a time series, where typically the sequential measurements are not independent from each other but show a pattern of autocorrelation. This article addresses the statistical difficulties when analyzing real-time measurements for exposure assessment to manufactured nano objects. To account for autocorrelation patterns, Autoregressive Integrated Moving Average (ARIMA) models are proposed. A simulation study shows the pitfalls of using a standard t-test and the application of ARIMA models is illustrated with three real-data examples. Some practical suggestions for the data analysis of real-time exposure measurements conclude this article.

Modeling and forecasting time series of precious metals: a new approach to multifractal data

Article Open access 12 June 2019

How to Describe and Propagate Uncertainty When Processing Time Series: Metrological and Computational Challenges, with Potential Applications to Environmental Studies

Robust Estimation Procedure for Autoregressive Models with Heterogeneity

Article 01 October 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Assessment of the exposure of workers to manufactured nano particles at the workplace receives considerable attention, because the number of workers involved with nanotechnology is increasing rapidly. In the absence of sufficient scientific knowledge on the hazardous potential and possible health effects of exposure to nano particles, exposure measurements are performed to locate sources of emission and to characterize exposure in different work situations to gain knowledge on how to reduce personal exposure levels. However, there is no consensus method yet on how to (statistically) analyze and report these exposure measurement results.

Although real-time measurements in exposure assessment are found in, for instance, studying exposure to noise, measurement strategies for exposure to nano-sized particles differ from the majority of traditional integrated sampling methods by the use of real-time instruments. Common instruments for on-line measuring number concentrations or surface area concentrations of (nano) particles are the SMPS, CPC and ELPI, and diffusion chargers, respectively (Brouwer et al. 2004), which have various response times ranging between t = 1,…,180 s. That is, every t seconds, a measurement result (number concentration) is recorded. The resulting measurement series, therefore, is a time series, a set of measurements collected sequentially over time. Typically, in time series the recorded set of measurement results are not independent and show significant autocorrelation between subsequent samples.

The current literature on exposure to manufactured nano objects discusses several ways of statistically analyzing data obtained from real-time exposure measurements. Brouwer et al. (2004) identified the effect of work activities on particle number concentration and the percentage of ultrafine particles graphically. Although the authors showed that useful information can be retrieved, graphical analysis is limited to making qualitative inferences. Demou et al. (2008) repeatedly collected time series data on 20 days with the same production process. The analysis of the data was done by averaging the 20 time series and making a graphical analysis. Bello et al. (2009) mentions the testing of mean differences at a significance level of P < 0.05, but the article is unclear about what kind of test was performed. Park et al. (2010) used t-tests for evaluating mean differences of time series data. There are two potential problems with using t-tests: First, the mean of a time series only has a substantial interpretation when the time-series is stationary, that is, the time series is a random fluctuation around the mean and there are no trends in the data. Second, the autocorrelation in the measurements leads to underestimation of the variance in the data, resulting in a biased test statistic of the t-test. Subsequently, Park et al. (2010) modeled the data using random effects models for which they assumed a compound symmetry covariance structure. Such a covariance structure assumes that all measurements have the same correlation with each other. This implies, for instance, that two measurements with a time lag of two hours are equally correlated with each other as two measurements with a time lag of two minutes. Clearly, this assumption is not appropriate. Evans et al. (2010) emphasized the usefulness of real-time measurements, but also these authors only reported graphical inferences. Pfefferkorn et al. (2010) used (partial) autocorrelation assessments, Autoregressive Integrated Moving Average (ARIMA) time series models and first-order autocorrelation models to analyze their data. ARIMA models are well-known statistical models to deal with autocorrelated time series observations that we also propose in this article. ARIMA methods for analyzing time series data have also been used in environmental studies to (the effects of) air pollution. Recent examples are, for instance Sharma et al. (2009) and Mann et al. (2010). Note that in this article we focus on the statistical modeling and analysis of real-time nano exposure measurements. As such, the statistical methods are only descriptive and not explanatory. For a good understanding of what is happening at a workplace, it is necessary to combine mechanistic models with data collection and statistical analysis, but this is outside the scope of this article. For a mechanistic modeling approach of exposure to nano-sized particles, see for instance Schneider et al. (2011).

The statistical analysis of time series requires special attention, since the measurements are not independent from each other. This has been addressed in the late 1980s by O’Brien et al. (1989), who recommended a check for autocorrelation for real time (dust) exposure data. When studying exposure to manufactured nano objects, people are confronted with several measurement series of different nano processes. Relevant research questions then involve testing for elevated exposure levels when a certain task is performed, or evaluating the similarity in trends and levels of repeated experiments. Other questions may relate to the identification of exposure determinants from a set of measurements. From the mentioned literature above, it is clear that exposure measurement time series have been dealt with in different ways, of which some are inappropriate for such data, and others are incomplete in checking specific (but important) assumptions of the applied models. In this article, we aim to show the difficulties of analyzing and interpreting time series data in the context of exposure assessment. First, we briefly introduce the statistical theory of time series analysis. Second, by means of simulated data examples we highlight some potential problems of dealing with nano exposure time series data. Third, we illustrate the statistical analysis with some real data examples from exposure measurements on nano-sized particles. We conclude this article with a discussion and some practical recommendations for real-time measurements and the analysis of such data.

Statistical analysis of time series

The first step in analyzing time series data is making a graph, which serves to quickly identify peaks and trends in the data. Subsequently, a potential next step is then fitting a model to the data to make quantitative inferences.

The modeling and analysis of time series data has frequent application in fields as economics (e.g., stock exchange data), geography, and engineering. A foundation for the statistical analysis of such data was Box and Jenkins (1970), who developed and applied Autoregressive Moving Average (ARMA) and ARIMA models. An introductory text is Cowpertwait and Metcalfe (2009). A Bayesian statistical approach to time series can be found in West and Harrison (1997). In this article, we have applied ARIMA models to real-time nano exposure measurements. In this section, we will first introduce ARIMA models and then give a brief overview of the model fitting procedure on the basis of a simulated example.

ARIMA models

Let Y _t, t = 1,…, T denote a sequence of measurements of a variable Y at subsequent and equally spaced times t. The autoregressive (AR)-part of an ARIMA model refers to the regression of Y _t on time lags of itself. That is, it expresses the time series as a linear function of its past values. It is common to denote the order of the model as the number of time lags p, or AR (p). The simplest AR model is the first-order autoregressive, or AR (1), model

$$ Y_{t} - a_{1} Y_{t - 1} = \mu + e_{t} , \quad e_{t} \sim N(0,\sigma^{2} ), $$

(1)

where a ₁ is the coefficient of the autoregression, μ is an intercept, and e _t the residual error term which follows a normal distribution with mean 0 and variance σ². A value of a ₁ close to 1 or −1 denotes a high autocorrelation in Y, and a value of a ₁ close to 0 denotes little autocorrelation in Y.

The moving average (MA) part of the ARIMA model denotes the structure on the error term. The simplest MA(q) model is the first-order MA(1) model, with q = 1 denoting the order, given by

$$ Y_{t} = \mu + e_{t} - c_{1} e_{t - 1} , \quad e_{t} \sim N\left( {0,\sigma^{2} } \right), $$

(2)

where c ₁ is the moving average coefficient of the first order. A value of c₁ close to 1 or −1 denotes a high autocorrelation of the error term. For a value of c ₁ close to 0 the model reduces to an ANOVA model with an intercept and random error component.

The integrated (I)-part refers to the order of differencing of a time series. In case of a non-stationary time series, differencing can be applied to remove trends from the data to obtain a stationary series that subsequently can be modeled with AR and MA terms. The idea is that a trend in Y can be accounted for by taking the derivative (dY/dt) of the series, which then might be stationary. Since the measurements are at discrete time steps t = 1, 2,…, T, a model of order d = 1 is the first-order (or I(1)) model that models the differenced series of Y with lag 1: dY/dt = Y _t – Y _t−1.

From Eq (1) and (2) it can be seen that several assumptions are made in an ARMA model, of which we want to stress two explicitly: First, the ARMA model assumes that the time series is stationary. A stationary process is defined as a stochastic process whose probability distribution is not a function of time. This means that parameters like the mean and the variance of the series are constant over time. In other words, it is assumed that there is no trend or seasonal variation present in the data Y. An example of a stationary process in a workplace situation would be a time series showing a constant particle concentration over time with only some random fluctuation. A non-stationary example could be an increasing particle concentration in the first hour after starting up a production process. Second, it is assumed that the error terms follow a normal distribution with constant variance. Those assumptions have important practical implications when evaluating exposure measurements. If a series of exposure measurements is not stationary, sample statistics like the mean, variance, and correlations with other variables are not meaningful since they are dependent on the length of the measurement series. This is quickly seen when considering a series with an increasing trend over time, e.g., the series has a positive slope over time. In that case, both the estimates for the mean and the variance will grow with sample size over time. As a consequence, neither the correlations with other variables are well-defined, nor are comparisons like a t-test of any significance. Symanski and Rappaport (1994) investigated autocorrelation and stationarity of exposure measurements (not on the nano scale) and noted that in assessing occupation exposure, summaries like the mean and variance components play an important role, illustrating the importance of assessing stationarity, see also Rappaport (1991).

Note that the distribution of the error term is assumed to be normal, it says not that the (empirical) distribution of Y is necessarily normal. Therefore, these assumptions are very important to check before proceeding with any inferences on the time series.

The combination of the AR(p), I(d), and MA(q) parts specifies an ARIMA(p,d,q) model, where p, d, and q denote the orders of the specific model terms. Selection of the appropriate orders is the topic of the next subsection.

Estimation and software

In most statistical software packages, standard routines are available for time series modeling with ARIMA models. For the analyses in this article, we used the arima() function available in the stats-package in the free statistical environment R (R Development Core Team 2011).

Stepwise approach for analyzing time series data

This section describes a stepwise approach for statistically analyzing time series data from nano exposure measurements using ARIMA models.

Step 1

The first step in fitting an ARIMA model is checking the stationarity of a time series. A standard procedure is to study a time series plot of the data together with the autocorrelation function (ACF, more details below). With a graph of the data one can visually check the constancy of the mean and variance (Fig. 1a). A plot of the autocorrelation of the data can also be informative (Fig. 1b). Typically, when the sample autocorrelation is high initially (>0.8) and shows a very slow decay to zero (e.g., over more than ten time lags) it is a sign for non-stationarity of the data. Then, difference the data once (corresponding to an ARIMA(0,1,0) model) and see if the differenced series appears stationary. An exponential trend in the data, alongside an increasing variance with time, is an indicator for a multiplicative relationship instead of (linear) additive growth. In that case, a log-transform of the data is useful to obtain a linear growth and to stabilize the variance.

To quantitatively test if a time series is stationary or a differencing step might be necessary first, one can test the hypothesis H₀: a ₁ = 1 versus the alternative H_a: a ₁ < 1. To do so, an AR(1) model can be fitted and compared to a first-order differencing or ARIMA(0,1,0) model. In case the series is non-stationary, the AR(1) coefficient will be close to 1. For comparison of the AR(1) and the ARIMA(0,1,0) models, the AIC model fit criterion can be used (explained below in Step 2). In case the assumption of stationarity is reasonable for the data at hand, the AIC will favor the AR(1) model over the ARIMA(0,1,0) model. For a worked example, see Example 2.

When, after differencing, the series is stationary, proceed with Step 2. If it is not possible to obtain approximate stationarity of the series, ARIMA models are not appropriate. Then, qualitative graphical inferences are possible, or other statistical methods have to be considered.

Step 2

In the second step, the orders p and q of the ARMA(p,q) components need to be determined. Here, the ACF and the Partial Autocorrelation Function (PACF) play an important role. The Cross Correlation Function (CCF) is related to the ACF, but estimates the correlation between two different series.

The ACF is the correlation of a variable with itself at different times. For example, the autocorrelation at lag 2 is the correlation between Y _t and Y _t−2. The CCF is estimated in exactly the same way as the ACF, but with the difference that it is not the autocorrelation of the series with itself, but the correlation of two time series at lags k = 0, 1,….,K. The CCF is helpful to determine the similarity in the patterns of two time series, for instance in determining the similarity between repeated experiments (see also Example 3 in the “Empirical Examples” section). The PACF at time lag k is the correlation that remains after removing the effects of autocorrelation at shorter time lags. For example, the PACF at lag k = 2 is the autocorrelation that remains after correcting for the propagating effect of the autocorrelation at lag k = 1 (if there is an autocorrelation of 0.5 of lag k = 0 with lag k = 1, then this autocorrelation propagates to lag 2 since the samples at lag 1 and lag 2 are also correlated, resulting in an autocorrelation of lag 0 and lag 2 of 0.5 × 0.5 = 0.25).

To determine the orders p and q of the AR and MA model terms, plots of the ACF and PACF of the (differenced) data are typically used. It is best to look at the ACF and PACF together, and follow the following rules of thumb:

If the PACF has a sharp cut-off, then an AR term should be considered. The order of the autoregression is determined by looking at the PACF function, and checking after which time lag the PACF is approximately zero. For instance, if the PACF has significant spikes at time lags 1 and 2, but is (almost) zero at time lag 3 and higher, then an AR(2) model might be appropriate.
For determination of the order (q) of the MA part the ACF is used in a similar way, where q is chosen based on significant (positive or negative) spikes in the ACF. The rule of thumb here is: if the ACF has a sharp cut-off, then an MA term should be considered. Again, the lag were the ACF cuts off corresponds with the order of the MA model part.

We illustrate the order selection procedure for the AR(p) component with the simulated example presented in Fig. 1. Fig. 1a shows the time series plot of an AR(2) process. In Fig. 1b the ACF is plotted, showing a significant positive spike at lag 1, and significant negative spikes at lags 3 and 4. However, it can be seen that the PACF plot shows a significant positive spike at the first time lag, and a negative spike at the second time lag and is approximately zero afterward. Based on the PACF, it seems that after accounting for the autocorrelation at lags 1 and 2, no significant higher order terms are needed to describe this time series. An appropriate model then would be an AR(2) model.

It can sometimes be difficult in practice to select the best model based on a data sample and the estimated ACF and PACF. For instance, it can be hard to distinguish between an AR(1) or an MA(1) model by visual inspection of the ACF and PACF, and then decide which of the two models fits the data best. Therefore, information criteria like the Akaike Information Criterion (AIC) can be used to search for the best fitting model. The AIC is a relative measure of model fit that balances an improvement in model fit based on the log-likelihood with the number of added model parameters. As such, it is only useful to compare two (or more) models with each other, where the model with the lower AIC value is considered to provide a better fit to the data. An alternative to the AIC is the Bayesian Information Criterion (BIC), which differs from the AIC in how it penalizes for adding model parameters, but otherwise its interpretation is similar to the AIC. Most statistical software packages provide such model fit statistics along with the estimated model parameters.

Step 3

ARIMA models can be extended to include covariates to explain observed effects or changes in the concentration level of a time series. For example, an indicator variable to model the effect of a process activity on exposure, where X _t = 1 denotes activity, and X _t = 0 denotes no activity. For an AR(1) model, the resulting equation then is:

$$ Y_{t \, } - a_{1} Y_{{t - 1 { }}} = \mu + (X_{t} - a_{1} X_{t - 1} )\beta + e_{t} , \quad e_{t} \sim N(0,\sigma^{2} ), $$

(3)

where β is the regression coefficient of concentration level Y on indicator variable X. In this case the autoregression on Y also applies to X. When a differencing step is necessary, note that also X is differenced, i.e., the change in X is related to the change in Y. Effectively, this means that the ARIMA model is fitted to the errors of the regression of Y on X (note that without explanatory variables, the residual errors plus the mean term equal the observations Y). The cross correlation between the series Y and X can be helpful to identify if there is a relationship between observed trends in Y and a covariate X.

The model in (3) can also be viewed as a linear regression model that accounts for the serial correlation in the measurements. In Eq. 3, the serial correlation is modeled by an AR(1) model, but a regression model with an MA structure on the error terms can also be used. However, it is important to evaluate the stationarity of the series before making inferences and conclusions from ARMA regression models, since both ARMA- and regression models (and their combination) assume stationarity with constancy of variance of the error terms. Non-stationary series typically violate such assumptions and may bias the estimated coefficients.

Step 4

When the time series appears stationary and the appropriate orders of the ARIMA model have been determined, model assumptions as the normality of the residuals and residual autocorrelation and have to be checked. This step finalizes the model fitting and model checking procedure. Subsequently, interpretation of the results remains.

Testing for mean differences: standard t-test or ARIMA?

To show the influence of autocorrelation between measurements on statistical testing for mean differences (i.e., a standard t-test and an ARIMA regression model), a simulation study was performed. All data were simulated from an AR(1) model with μ = 0 and σ² = 1, see Eq (1), for four different values of the autocorrelation, a ₁ = 0.3, 0.5, 0.7, and 0.9. The length of the series was N = 200 samples. A switch in the mean level of the series occurred at t = 101, and two tests for the difference in mean level between the first 100 and the second 100 samples were performed. The first test was a standard t-test for a difference in means. In the second test, an AR(1) model was fitted, where an indicator variable modeled the mean difference between the second and the first half of the data. From the fitted AR(1) regression model, a t-test for the estimated mean difference was obtained from the estimated coefficient and its standard deviation for the indicator variable, corresponding to a regular t-test but with the important difference that autocorrelation between the samples was accounted for. For each condition (mean difference and autocorrelation) 100 data sets were simulated.

Figure 2 summarizes the results, where the averages over the 100 simulated data sets of the estimated t-values were plotted against the simulated mean differences. It can be seen that the t-test overestimated the statistical significance of the mean difference in all cases, where the overestimation was greater for increasing autocorrelation between the samples. Although the mean, variance, and sample size of the simulated data sets were chosen arbitrarily, the principle of this result stands for any data set where there is autocorrelation between subsequent samples. It shows that when testing for mean differences, autocorrelation in the data cannot be ignored. Neglecting autocorrelation can lead to false significant results, especially when mean differences are small or autocorrelation between samples is high.

Empirical examples

In this section, we present three real data examples to illustrate the use of ARIMA models for (statistically) analyzing time-series of nano exposure measurement results.

Example 1: Testing for an effect of an activity on the particle number concentration level

A measurement series of a certain activity or task resulted in a time series of 1500 subsequent measurements of number concentration of particles smaller than 100 nm, using an ELPI on-line measurement device with a response time of 1 s. For simplicity, we refer to this measurement series as “Example 1” from now on. A time series plot of the data is shown in Fig. 3a, in which a dotted line denotes the region where the task or activity was performed during the measurement period. The research question was to investigate a potential rise in exposure to nano-sized particles (<100 nm) during performance of the activity compared with the non-activity period.