Abstract
In this paper, we estimate the distribution of population by exposure to multiple airborne pollutants, taking into account the spatio-temporal variability of daily air quality and the high-resolution spatial spread of human population around Europe. In particular, we consider monitoring network data for five pollutants, namely carbon monoxide, nitrogen dioxide, ozone, coarse and fine particulate matters. The spatial information contained in the large dataset of daily continental air quality is exploited using a multivariate spatio-temporal model capable to cover cross correlation among pollutants, covariates, and missing data as well as spatial and temporal variability and correlation. At the same time, the model is simple enough to be feasible for the large dataset of daily continental air quality over three years. Maximum likelihood estimation is performed using the EM algorithm, and kriging-like spatial estimates are used to compute high-resolution exposure distribution. Moreover, a novel semi-parametric bootstrap technique is used to assess the exposure distribution uncertainty. In this way, we compare the daily population exposure of 33 European countries and three important metropolitan areas in years 2009–2011 using a single flexible model. Extensive tabulations and graphs are reported in the supplementary material.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Thanks to the wide development of fixed monitoring networks, human exposure to airborne pollution is often related to the so-called “ambient exposure,” namely the pollutant concentration at which people are exposed when outdoor, which is assessed by means of pollutant concentration data coming from the mentioned monitoring networks or from chemical transport computer models.
Since using the ambient exposure as personal exposure may lead to ecological fallacy, individual exposure is deserving an increasing attention in recent years. The exposure of an individual to airborne pollutants can be defined as the total amount of pollutants the person is exposed to, during a given period of time of his/her life. Clearly, this amount depends on a large number of factors, the main factors being the spatial location of the person and, conditionally on this, his/her life style and activity level. This is often called personal exposure as, in principle, it can be related to each specific person.
In the recent statistical literature, two main approaches to model personal exposure emerged. On the one side, the probabilistic approach of Zidek et al. (2005) allows to simulate individual exposure temporal profiles for a finite set of individuals and, based on this, it allows to estimate a simulated population distribution. On the other side, an empirical approach based on personal monitors has been considered by few authors. For example, McBride et al. (2007) consider a Bayesian hierarchical model to model the relationship between observed personal exposure and ambient exposure for a small sample of individuals running personal monitors. Moreover, Jahn et al. (2013) performed an extensive study in Guangzhou assessing the correlation between \(\hbox {PM}_{2.5}\) measured by ambient and personal monitors.
In principle, aggregating personal exposures brings us to the population exposure of an entire city, country or continent. Since this total personal exposure is difficult or impossible to compute in practice, Finazzi et al. (2013) and Shaddick et al. (2013) consider a simplified concept of “population exposure,” which is halfway between ambient and personal exposure and it assumes that, in a small area, ambient and personal exposure are roughly proportional. This assumption is especially critical when using yearly averages, which are used not only in Shaddick et al. (2013) and Beelen et al. (2009) , but also in European regulations, see “Air quality in Europe—2015 report” (EEA 2015). In Finazzi et al. (2013), this aggregation effect is mitigated by the use of daily data which take into account also seasonal and meteorological effects to exposure. Moreover, the latter authors consider spatio-temporal modeling to estimate the exposure burden at the country level, while the former authors consider a purely spatial model at the continental level. In this paper, we use the population exposure distribution introduced by Finazzi et al. (2013), but with a different statistical model capable to handle daily data at the European level, uncertainty included.
Spatio-temporal statistical models for air quality have been extensively developed during the last decade in the univariate case (see e.g., Sahu et al. 2006; Calder 2008 and McMillan et al. 2010). On the other side, multivariate spatio-temporal data are increasingly being used in the analysis of air quality. Pollice and Jona (2010) propose the use of a multivariate model for daily concentrations of three pollutants in the Taranto area, Italy. De Iaco et al. (2013) implement spatio-temporal cokriging based on the linear coregionalization model (LCM). Calculli et al. (2015) use the multivariate dynamic map model for insight into Saharian dust events.
In this paper, European multi-pollutant daily concentrations are estimated in high resolution using an extension of the multivariate dynamic coregionalization model (DCM) implemented in the D-STEM software (Finazzi and Fassò 2014), which covers Box–Cox data transformation, prediction and semi-parametric bootstrap, which is an extension of Rister and Lahiri (2013). Model outputs are then extensively used to compute the exposure distribution for European countries and three metropolitan areas.
Alternatives to DCM are available which cover non-separable correlation functions (e.g., Bourotte et al. 2016). Unfortunately in this case the variance covariance matrix has dimension \(ST\times ST\) and, as shown in Cameletti et al. (2011), computation time increases rapidly, making this unfeasible for our dataset. Approximations such as tapering or pairwise likelihood could be also considered. Nonetheless Bevilacqua et al. (2016) showed that the sparsity level required to make this method computationally competitive is unachievable for our application. This paper shows that DCM approach is flexible enough and a reach set of covariates can reduce modeling effort for the covariance function.
The rest of the paper is organized as follows. In Sect. 2, we introduce the concept of exposure distribution and its estimation technique based on a general spatio-temporal model coupled with a novel bootstrap technique. Section 3 presents the air quality datasets and the geographic area covered by the study. Candidate models are fitted using the EM algorithm and cross-validation is adopted for model selection. Then, exposure distribution results are reported for 33 European countries and three metropolitan areas. Joint exposure to multiple pollutants is also taken into account. A closing section draws some conclusions while extensive tabulations and graphs are reported in the supplementary material.
2 Statistical Methods
In this section we detail the statistical modeling techniques used to estimate the population exposure distribution and its uncertainty. To start with, we define the exposure distribution based on high-resolution data observed over a geographic area, which may range from a metropolitan area to the European continent. Then we summarize the multivariate modeling technique and we introduce the novel bootstrap approach, which is used in propagating the uncertainty of the model output to the estimated exposure distribution.
2.1 The Distribution of Population by Exposure
In order to define the exposure distribution, we assume that the concentrations of q pollutants are available on a fine spatial grid of pixels \(\mathbb {B}=\left\{ \mathcal {B}_{1},\ldots ,\mathcal {B}_{R}\right\} \), which covers the geographic region of interest, namely \(\mathcal {R=\cup }_{r=1}^{R}\mathcal {B}_{r}\). Let \(y_{i}\left( \mathcal {B},t\right) \) denote the average concentration of pollutant i over pixel \(\mathcal {B}\in \mathbb {B}\) at discrete time \(t\in \mathcal {T=}\left\{ 1,\ldots ,T\right\} \). Moreover, let \(n\left( \mathcal {B}\right) \) be the number of people living inside pixel \(\mathcal {B\in }\mathbb {B}\), which is supposed to be constant over time and let \(N=\sum _{r=1}^{R}n\left( \mathcal {B}_{r}\right) \) be the total population living in \(\mathcal {R}\). Note that population stationarity is a simplifying assumption and may be relaxed if data about weekly and seasonal population movements are available as in Secchi et al. (2015).
Improving over Finazzi et al. (2013) and Shaddick et al. (2013), which used the (weighted) average exposure for region \(\mathcal {R}\), we focus on the more detailed information provided by the cumulative exposure distribution for region \(\mathcal {R}\), namely
where I(A) is the indicator function of set \(A, n(\mathcal {B})\) and N are defined above.
Hence, for a fixed \(y, F_{i}(y,t)\) gives the fraction of the population that was exposed to a concentration level lower or equal to y at time t for pollutant i. This can be integrated over time. For example, considering the time frame \(\mathcal {T}\), the (average) exposure distribution in \(\mathcal {T}\) is given by
Note that distribution in (1) may be interpreted as a mixture distribution with equal weights and it is much more informative than the distribution of the yearly average concentration as it takes into account the daily variability of the pollutant and the extreme concentrations are not averaged out. The exposure distribution density is also defined in a natural way by \(f_{i}(y) =\frac{\mathrm{d}}{\mathrm{d}y} F_{i}(y)\), and it can be readily approximated by suitable data binning. Moreover, the exposure exceedance is given by the quantity
which gives the fraction of the population that was exposed to a pollutant concentration higher than y, averaged over time.
Multi-pollutant distributions can be suitably defined generalizing \(F_{i}(y) \) and \(H_{i}(y) \). For instance, considering two pollutants and their exposure distribution \(F_{ij}\left( y,z\right) \), the simultaneous exposure exceedance is given by the positive quadrant frequency and, in practice, \(H_{ij}\) is computed using
In the case of a general q-dimensional pollutant, the corresponding positive orthant frequency \(H\left( y_{1},\ldots ,y_{q}\right) \) is suitably defined.
Since the concentration levels \(y_{i}(\mathcal {B},t)\) are usually unknown in practice, the next section summarizes a statistical modeling approach for estimating \(y_{i}(\mathcal {B},t)\) starting from data observed on a irregularly spaced and heterogeneous monitoring network. In the rest of the paper, pixels \(\mathcal {B}\) are supposed to be small and the following approximation is considered
where \(\varvec{s}_{r}\) is the center of \(\mathcal {B}\) and \(\left| \mathcal {B}\right| \) is its area.
2.2 Dynamic Coregionalization Model
Mapping multi-pollutant concentrations at the continental scale requires to adopt a statistical model able to handle the complexity of the pollutant phenomenon in an efficient way. From the data point of view, the model must accommodate for an unbalanced monitoring network and extensive missing data. In particular, different pollutants are often observed at non-co-located spatial locations. Moreover, some pollutants are known to be correlated with others and mapping may benefit from a model able to describe the spatio-temporal correlation across pollutants. Hence the aim is to have a model which is both flexible and easy to be estimated for a large dataset.
In order to estimate \(y_{i}(\mathcal {B},t)\) of the previous section, a hierarchical multivariate spatio-temporal model is considered here. To this end, let \(\varvec{y}(\varvec{s},t)\) be the q-dimensional vector of the pollutant concentrations \(y_{1},\ldots ,y_{q}\) observed at spatial location \(\varvec{s}\in \mathcal {R}\) and time \(t\in \mathcal {T}\). Since environmental variables are commonly positively skewed, similarly to Oliveira et al. (1997), we use Box–Cox transformation \(\xi _{i}=g_{i}(y_{i})=g\left( y_{i},\lambda _{i}\right) \) with parameter \(\lambda _{i}\), and assume that \(\varvec{\xi }=\left( \xi _{1},\ldots ,\xi _{q}\right) ^{\prime }\) is a q-dimensional Gaussian spatio-temporal process as defined in Finazzi and Fassò (2014), Sect. 5. In particular, \(\xi _{i}\) are supposed to be locally linearly related to some covariates \(\varvec{x}_{i}\) and the following DCM is used
In the observation Eq. (5), the measurement errors \(\varepsilon _{i}\) are Gaussian distributed with zero mean and variance \(\sigma _{\varepsilon _{i}}^{2}\), uncorrelated over space, time and pollutants. In the trend Eq. (6), \(\alpha _{i}\) is a spatio-temporal stochastic intercept, \(\varvec{x}_{i}\) is a \(b_{i}\)-dimensional vector of covariates with stochastic coefficient vector \(\varvec{\beta }_{i}=\left( \beta _{i1},\ldots ,\beta _{i,b_{i}}\right) ^{\prime }\). The stochastic structure of \(\alpha \) and \(\varvec{\beta }\) is given by the following models
where \(j=1,\ldots ,b_{i}, \alpha _{i}^{0}\) and \(\beta _{ij}^{0}\) are the fixed effects while \(z_{ij}\) are the temporal random effects such that \(\varvec{z}(t)=\!\left( z_{10}\left( t\right) ,\ldots ,z_{q0}\left( t\right) ,z_{11}\left( t\right) ,\ldots ,z_{1b_{1}}\left( t\right) ,\ldots ,z_{q1}\left( t\right) ,\ldots ,z_{qb_{q}}\right. \left. \left( t\right) \right) ^{\prime }\) is a stochastic vector with smooth dynamics given by
with \(\varvec{\eta }(t)\sim N\left( \varvec{0},\varvec{\Sigma }_{\varvec{\eta }}\right) \) and \(\varvec{\Sigma }_{\varvec{\eta }}\) a diagonal variance-covariance matrix.
In (7), \(\varvec{w}\left( \varvec{s},t\right) =\left( w_{1}\left( \varvec{s},t\right) ,\ldots ,w_{q}\left( \varvec{s},t\right) \right) ^{\prime }\) is a q-dimensional Gaussian process, uncorrelated over time but correlated over space. In particular, \(\varvec{w}\left( \varvec{s},t\right) \) is a so-called LCM with the following spatial variance-covariance matrix function
where \(\varvec{V}\) is a valid correlation matrix and \(\rho \left( \left\| \varvec{s}-\varvec{s}^{\prime }\right\| ;\theta ,\nu \right) \) is the Matérn correlation function, with \(\left\| \varvec{s}-\varvec{s}^{\prime }\right\| \) the distance between spatial locations \(\varvec{s}\) and \(\varvec{s}^{\prime }, \theta \) the range parameter and \(\nu >0\) the shape parameter.
The model parameter set is
where \(\varvec{\alpha }=\left( \alpha _{1}^{0},\ldots ,\alpha _{q}^{0}\right) ^{\prime }, \varvec{\beta }=\left( \left( \varvec{\beta }_{1}^{0}\right) ^{\prime },\ldots ,\left( \varvec{\beta }_{q}^{0}\right) ^{\prime }\right) ^{\prime }, \varvec{\beta }_{i}^{0}=\left( \beta _{i1}^{0},\ldots ,\beta _{ib_{i}}^{0}\right) ^{\prime }, \varvec{\sigma }_{\varvec{\varepsilon }}^{2}=\left( \sigma _{\varepsilon _{1}}^{2},\ldots ,\sigma _{\varepsilon _{q}}^{2}\right) ^{\prime }, \varvec{\tau }=\left( \tau _{1},\ldots ,\tau _{q}\right) ^{\prime }, \varvec{\sigma }_{\varvec{\eta }}^{2}=\mathrm{diag}\left( \varvec{\Sigma }_{\varvec{\eta }}\right) \) and \(\varvec{v}\) is the \(q\left( q-1\right) /2\) dimensional vector obtained by stacking the unique and non-diagonal elements of the symmetric matrix \(\varvec{V}\). Note that, in practice, we may have \(\alpha _{i}^{0}\equiv 0\) and/or \(\beta _{ij}^{0}\equiv 0\ \)and/or \(z_{ij}\equiv 0\), for some i, j. Hence the vectors \(\varvec{\alpha },\varvec{\beta ,z}(t)\) and \(\varvec{\sigma }_{\varvec{\eta }}^{2}\) include only non-zero elements.
2.3 Prediction and Bootstrap
Let \(\varvec{Y}=\left\{ \varvec{y}_{1},\ldots ,\varvec{y}_{q}\right\} \) be the dataset of pollutant concentrations. In particular, each \(\varvec{y}_{i}\) is the \(n_{i}T-\)dimensional vector of the i-th pollutant concentrations at spatial locations \(\mathbb {D}_{i}\mathbb {=}\left\{ \varvec{s}_{1},\ldots ,\varvec{s}_{n_{i}}\right\} , i=1,\ldots ,q\). The vectors \(\varvec{y}_{i}\) may include missing values and, in general, the sets \(\mathbb {D}_{i}\) may be different. This means that the q pollutants are not necessarily observed at the same spatial locations (see Fassò and Finazzi 2011).
The minimum mean square error spatial prediction of \(y_{i}=y_{i}\left( \varvec{s},t\right) \) at point \(\varvec{s}\notin \mathbb {D}_{i}\) is given by \(\bar{y}_{i}=E\left( y_{i}|\tilde{\varvec{Y}}\right) \), where \(\tilde{\varvec{Y}}=\left\{ \tilde{\varvec{y}}_{1},\ldots ,\tilde{\varvec{y}}_{q}\right\} \) is the dataset restricted to the non-missing data. On the other side, using the plug-in approach, DCM gives the natural estimator for \(\xi _{i}\), namely \(\hat{\xi }_{i}=E\left( \xi _{i}|\tilde{\varvec{Y}}\right) \). Hence we approximate \(\bar{y}_{i}\) using the back-transformed \(\hat{\xi }_{i}\) with bias correction given by the well known delta method, namely
where \(g_{i}^{-1}\) is the Box–Cox back-transformation function, \(\sigma _{\hat{\xi }_{i}}^{2}={\text {Var}}\left( \xi _{i}|\tilde{\varvec{Y}}\right) \) and \(\left[ \cdot \right] ^{\prime \prime }\) is the second derivative.
In order to assess the uncertainty of the population exposure distributions and exceedances defined in Sect. 2.1, we aim at estimating the uncertainty related to \(\hat{y}_{i}\). To this end, it is worth noting that \(p\left( \xi _{i}\mid \tilde{\varvec{Y}}\right) =N\left( \hat{\xi }_{i},\sigma _{\hat{\xi }_{i}}^{2}\right) \). It is thus straightforward to draw a sample from \(N\left( \hat{\xi }_{i},\sigma _{\hat{\xi }_{i}}^{2}\right) \) and to back-transform the sample using (12) in order to assess the uncertainty of \(\hat{y}_{i}\). This approach, however, is not robust against misspecification of DCM and Box–Cox transformation. Additionally, it does not reflect the uncertainty on the estimated model parameters, which are plugged-in when \(\hat{\xi }_{i}\) and \(\sigma _{\hat{\xi }_{i}}^{2}\) are computed.
In order to obtain a robust assessment of the uncertainty on \(\hat{y}_{i}\), we develop a semi-parametric approach based on the conditional distributions \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) , \hat{\xi }_{i}\in \mathbb {R}\), which are estimated using the DCM output. In order to estimate \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) we assume that the following conditional independence and ergodicity conditions hold true.
Condition 1
For each \(i=1,\ldots ,q\), the conditional process \(\left( y_{i}\left( \varvec{s},t\right) |\hat{\xi }_{i}\left( \varvec{s},t\right) \right) , \varvec{s}\in \mathcal {R}, t\in \mathbb {N} ^{+}\), is stationary, white noise over space and time, with known distribution \(p_{i}\left( y_{i}|\xi _{i}\right) \).
Condition 2
For each \(i=1,\ldots ,q\), the bivariate distribution \(p_{i}\left( y_{i},\hat{\xi }_{i}\right) \) is described by a bivariate Gaussian mixture of \(m_{i}\) components.
Condition 3
For each \(i=1,\ldots ,q\), let \(\varvec{x}_{i}\left( \varvec{s},t\right) \) be the vector of covariates at spatial location \(\varvec{s}\). There exists a stochastic process \(\left( y_{i}\left( \varvec{s},t\right) ,\varvec{x}_{i}\left( \varvec{s},t\right) \right) \) which is ergodic for \(p_{i}\left( y_{i},\hat{\xi }_{i}\right) \). In other words, the bivariate distributions \(p_{i}\left( y_{i},\hat{\xi }_{i}\right) \), and thus \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) \), can be consistently estimated using a realization \(\left( y_{i}\left( \varvec{s},t\right) ,\hat{\xi }_{i}\left( \varvec{s},t\right) ,\varvec{x}_{i}\left( \varvec{s},t\right) \right) \) for \(T\rightarrow \infty \).
Condition 1 amounts to say that, conditionally on \(\xi _{i}\), the pollutant concentration \(y_{i}\) is i.i.d. over space and time. Indeed, Condition 1 is the basis for bootstrap resampling and it is far less restrictive than bootstrap assumptions of both Rister and Lahiri (2013) and Finazzi et al. (2013). Condition 2 is made for convenience and it is not restrictive due to the universal approximator nature and flexibility of Gaussian mixtures. Eventually, Condition 3 allows to have consistent estimates of the distributions \(p_{1},\ldots ,p_{q}\) based on the observed data and DCM estimates \(\hat{\xi }_{1},\ldots ,\hat{\xi }_{q}\). In the sequel, the set of distributions \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) ,i=1,\ldots ,q\) is called data model.
2.4 Model Estimation and Selection
The overall procedure is described in this section using eight steps. Preliminarily, in step 1, the Box–Cox parameter set \(\varvec{\lambda }\) is estimated and data are transformed accordingly, then in step 2, after a preliminary analysis of transformed data, a pool of candidate DCMs for model selection is defined. For each candidate DCM, conditionally on \(\varvec{\lambda }\) the parameter set \(\Psi \) is estimated in step 3 and the data model \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) , i=1,\ldots ,q\), is estimated in step 4. In step 5, model selection is performed. Applying plug-in approach to the selected model, in steps 6, daily maps required for the population exposure are estimated, in step 7, the multivariate exposure distribution is computed and, finally, assessment of uncertainty is performed in step 8.
Since data model is intended to be robust against DCM and Box–Cox transform misspecification, a two-stage procedure based on data splitting is used. Moreover, cross-validation is performed for model selection. To do this, each monitoring network \(\mathbb {D}_{i}\) is partitioned in three subsets, say \(\mathbb {D}_{i}^{[1]},\mathbb {D}_{i}^{[2]}\) and \(\mathbb {D}_{i}^{[3]}\), and dataset \(\tilde{\varvec{y}}_{i}\) is partitioned accordingly, that is \(\tilde{\varvec{y}}_{i}^{[h]}=\left( \tilde{y}_{i}\left( \varvec{s},t\right) ,\varvec{s}\in \mathbb {D}_{i}^{[h]},t\in \mathcal {T}\right) \). The estimation of \(\Psi \) and \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) is based on the data \(\tilde{\varvec{y}}_{i}^{\left[ 1\right] }\) and \(\tilde{\varvec{y}}_{i}^{\left[ 2\right] }\), respectively, while cross-validation is performed using \(\tilde{\varvec{y}}_{i}^{\left[ 3\right] }\). We now discuss in more details the above steps.
-
(1)
Box–Cox transform estimation
Each element of the parameter set \(\varvec{\lambda }=\left( \lambda _{1},\ldots ,\lambda _{q}\right) ^{\prime }\) is estimated using the maximum likelihood approach as in Box and Cox (1964), applied separately to data \(\tilde{\varvec{y}}_{i}^{[1] },i=1,\ldots ,q\). In order to keep computational time feasible, we use here an iid distribution which ignores cross, spatial and temporal correlation. Ignoring autocorrelation is known to give suboptimal estimates and estimated standard deviations of \(\hat{\lambda }_{i}\) obtained by above likelihood are known to be biased. Nonetheless, the objective of this paper is exposure and \(\varvec{\lambda }\) may be considered as a nuisance parameter. Hence, in view of the large dataset used, we consider above \(\varvec{\lambda }\) as a good estimate for our purposes but we avoid to compute its standard deviations in the application section.
-
(2)
Preliminary analysis
After a preliminary analysis of \(\xi \) transform, based on correlations, spatial regressions and simplified DCMs, a set of candidate DCMs is defined. On the one side this step has to consider the relevant model components and covariates but on the other side, considering computational burden, a large full factorial model search has to be avoided.
-
(3)
DCM estimation
For each DCM of the previous step, conditionally on \(\varvec{\hat{\lambda }}\), estimation of the model parameter set \(\Psi \) is based on the EM algorithm and is implemented by means of the D-STEM software applied to data \(g_{i}\left( \tilde{\varvec{y}}_{i}^{[1]}\right) , i=1,\ldots ,q\), giving \(\hat{\Psi }^{\left[ 1\right] }\), say. This is a computer intensive task for daily data at continental level as discussed in the opening of Sect. 3.2.
-
(4)
Data model estimation
The semi-parametric estimation of the data model \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) , i=1,\ldots ,q\), is composed of the following two sub-steps:
-
(a)
Kriging. Using \(\hat{\Psi }^{\left[ 1\right] }, \xi _{i}\) are estimated at spatial locations \(\mathbb {D}_{i}^{[2]}\):
$$\begin{aligned} \hat{\xi }_{i}^{\left[ 2\right] }=E_{\hat{\Psi }^{\left[ 1\right] }}\left( \xi _{i}^{\left[ 2\right] }\mid \tilde{\varvec{y}}^{\left[ 1\right] }\right) , \end{aligned}$$(13)where \(E_{\hat{\Psi }^{\left[ 1\right] }}\) is the expectation computed using the DCM with parameter set \(\hat{\Psi }^{\left[ 1\right] }\). Note that the estimates \(\hat{\xi }_{i}^{\left[ 2\right] }\) are obtained using the information from all the pollutants, exploiting their spatio-temporal cross-correlation and are stacked in the vector \(\varvec{\hat{\xi }}_{i}^{\left[ 2\right] }=\left( \hat{\xi }_{i}^{\left[ 2\right] }\left( \varvec{s},t\right) ,\varvec{s}\in \mathbb {D}_{i}^{[2]},t\in \mathcal {T}\right) \) for convenience.
-
(b)
Mixture estimation. For each \(i=1,\ldots ,q\), data \(\left( \tilde{\varvec{y}}_{i}^{\left[ 2\right] },\varvec{\hat{\xi }}_{i}^{\left[ 2\right] }\right) \) are used to estimate the bivariate mixtures distributions \(p_{i}\) using the standard EM algorithm for Gaussian mixtures with \(m_{i}\) components (see e.g., McLachlan and Krishnan 2008). Hence, conditional distributions \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) are obtained straightforwardly.
-
(a)
-
(5)
Model selection
Given a set of candidate DCMs, unbiasedness and mean square error optimality are of interest. Hence, using \(\tilde{\varvec{y}}_{i}^{\left[ 3\right] }\), errors \(\hat{e}_{i}\left( \varvec{s},t\right) =\tilde{y}_{i}\left( \varvec{s},t\right) -\hat{y}_{i}\left( \varvec{s},t\right) , \varvec{s}\in \mathbb {D}_{i}^{[3]}\) are computed and the DCMs are compared in terms of bias, root mean squared error (RMSE) and \(R^{2}\) statistic, where, for each pollutant, \(R^{2}=1-\mathrm{MSE}/\mathrm{Var}(y)\). The DCM with the highest \(R^{2}\) is retained.
-
(6)
Mapping
In order to map the pollutant concentration over space and time at high resolution, small pixels \(\mathcal {B}_{r}\in \mathbb {B}\) with centers \(\varvec{s}_{r}\) are considered. The actual average pollutant concentration over pixel \(\mathcal {B}_{r}\) is estimated by \(\hat{y}_{i}(\varvec{s}_{r},t)\) in Eq. (12).
-
(7)
Estimation of exposure distributions
The definitions and methods of Sect. 2.1 are applied to \(\hat{y}_{i}(\varvec{s}_{r},t)\), giving the univariate and multivariate estimated exposure distributions, say \(\hat{F}_{i}, \hat{H}_{i}\) and \(\hat{f}_{i}\).
-
(8)
Uncertainty of exposure distributions
To evaluate the uncertainty of quantities such as \(\hat{H}_{i}\) and \(\hat{f}_{i}\), a bootstrap approach is adopted and confidence bands of \(f_{i}\) and \(H_{i}\) are provided. In particular, \(B=200\) bootstrap replicates of the daily average pollutant concentration maps are generated by randomly sampling the pollutant concentration of pixel \(\mathcal {B}_{r}\) and time t from \(p_{i}\left( y_{i}|\hat{\xi }_{i}(\varvec{s}_{r},t)\right) \). This gives the maps \(\varvec{y}_{b}^{*}=\left( y_{b}^{*}\left( s_{r},t\right) ,t\in \mathcal {T}\right) , r=1,\ldots ,R, b=1,\ldots ,B\). Step 7 applied to each replica \(\varvec{y}_{b}^{*}\) gives the bootstrap sample \(H_{i,b}^{*}\) from which confidence bands for \(H_{i}\) are immediately computed.
3 European Case Study
3.1 Data Description
Air quality data from the AirBase database of the European Environment Agency are considered. The geographic area is restricted to the box (\(35{{}^\circ }N, 67{{}^\circ }N, 11{{}^\circ }W, 35{{}^\circ }E\)) of Fig. 1, which includes in full the 33 European countries listed in the table of Fig. 2 and overall time frame covers years 2009–2011. With some abuse of language, terms like European level or European average will be used for the union of the 33 countries.
The daily average concentrations, in \(\mathrm{\mu g/m^{3}}\), for the following five relevant airborne pollutants are used: carbon monoxide (CO), nitrogen dioxide (\(\hbox {NO}_{2}\)), ozone (\(\hbox {O}_{3}\)), particulate matter not larger than \(10\,\mathrm{\mu m}\) (\(\hbox {PM}_{10}\)) and \(2.5\,\mathrm{\mu m}\) (\(\hbox {PM}_{2.5}\)). Only background-type stations are taken into account because traffic-type and industrial-type stations are subject to heavy preferential sampling and are known to overestimate the pixel average, which is relevant for the population exposure assessment at daily scale. Table 1 shows that, overall, the monitoring network is unbalanced since the number of sensors measuring each pollutant is highly variable. Moreover, the missing data rate is quite substantial and averts the use of standard methods not suited to handle missing data. Table 1 also reports the parameters \(\lambda _{i}\) of the Box–Cox transformation, which are estimated using data \(\tilde{\varvec{y}}_{i}^{[1]}\) as in step 1 of Sect. 2.4. Both the need and the effect of the Box–Cox transformation are clear from skewness and kurtosis of Table 1.
Population count, land elevation, and meteorology data are considered as covariates. Population data are obtained from the LandScan database (Dobson et al. 2000) which provides the 24 h average population spatial distribution globally with spatial resolution 1 km. This dataset is constant over the year, but the 24 h average takes into account the intra-day population mobility. Variations across the three years of the case study are considered negligible. Population data are used for the exposure distribution computation and as a fundamental covariate, being proxy of local emissions averaged over time. Other candidate covariates for the model are land elevation, taken from the GTOPO30 dataset, and meteorological variables including wind speed, pressure and temperature, taken from the ERA Interim dataset (ECMWF Interim Re-Analysis).
A special focus is dedicated to the three metropolitan areas of Berlin, Greater London and Milan province (see Table 2). The three areas are characterized by a major source of pollution related to car traffic and house heating. Nonetheless, they are different in population size, population density, transportation and, more importantly, in atmospheric circulation regime. In particular, Berlin and Greater London have a higher population density but they are subject to oceanic air circulation. Viceversa, the Milan province has a lower population density but a higher vehicle rate, and it is surrounded by the c-shaped mountain range composed by the Alps and the Appennines, which prevents major air circulation and leads to pollutant accumulation in the lower troposphere.
3.2 DCM for European Data
Model estimation and selection are performed following the strategy defined by the eight steps of Sect. 2.4 which are referred to, in the sequel. Each set \(\mathbb {D}_{i}, i=1,\ldots ,q\), is randomly partitioned in such a way that \(\mathbb {D}_{i}^{[1]}, \mathbb {D}_{i}^{[2]}\) and \(\mathbb {D}_{i}^{[3]}\) include 2 / 3, 1 / 6 and 1 / 6 of the total number of sensors, respectively.
A preliminary (step 2) analysis, applied to standardized data \(\mathbf {\xi }_{i}^{[1]}\) based on simple regressions and DCM univariate models gave the following results:
-
(1)
Selected covariates include meteorological variables at lags 0, 1 and 2, land elevation, log-transformed population count and two dummies identifying Saturday and Sunday effects. For each pollutant, covariates have been pruned using p values. In particular, due to the large number of observations and degrees of freedom, pruning is done implementing t tests with p value \(\ge 3\,\times \,10^{-6}\).
-
(2)
Random temporal components \(z_{ij}\left( t\right) \) of Eqs. (7) and (8) have been selected if empirical variance contribution was larger than 0.01. As a result only \(z_{i0}, i=1,\ldots ,q\) have been retained. This means that only the intercept in DCM needs to be time varying.
-
(3)
The exponential covariance function, \(\rho (\Vert \varvec{s}-\varvec{s}^{\prime }\Vert ;\theta _{i},\nu =\frac{1}{2}),\) resulted the best in the restricted Matérn class \(0<\nu _{i}\le \frac{1}{2}\) \(i=1,\ldots ,q\), according to cross-validation results. Note that the restriction \(\nu \le \frac{1}{2}\) is used to assure a positive definite covariance function on a sphere or a part of it, such as the European domain, which can be hardly projected on the plane, see Chunfeng et al. (2009). Indeed in this case study, we use the geodetic distance for \(\Vert \varvec{s}-\varvec{s}^{\prime }\Vert \).
Going on with step 2, we formulate a candidate model class where each one-dimensional equation is given by
which is a particular case of DCM model in Sect. 2.2 and \(\xi \) are standardized variables. Using Eq. (14) , we define various univariate DCM candidates with independent \(w_{i}\) components, and one five-variate DCM candidate with \(w_{i}\) given by the LCM of Sect. 2.2. The first candidate (M1a) is a set of univariate models obtained by Eq. (14), deleting both random effects \(z_{i0}\) and \(w_{i}\), that is spatial regression. The second candidate (M1b) is given by Eq. (14) deleting only w, while the third candidate (M1c) is the fully saturated univariate case obtained by Eq. (14) with all its terms. The fourth candidate (M5) is the fully saturated five-variate model corresponding to M1c. Note that M1a and M1b are nested in M1c, which is quasi-nested in M5 since, in the multivariate model, restriction \(\theta _{1}=\cdots =\theta _{5}\) is applied.
Moving to step 3, estimation of M5 is built from the estimated univariate models. In particular to ease the convergence of the EM algorithm, initial values of M5 are taken as the estimates of M1c with the spatial correlation range parameter \(\theta \) initialized to the average of \(\hat{\theta }_{1},\ldots ,\hat{\theta }_{5}\). Now the dataset is \(\varvec{Y}^{[1]}=\left\{ \varvec{y}_{1}^{[1]},\ldots ,\varvec{y}_{5}^{[1]}\right\} \), which includes a total of \(S=4636\) sensors and \(T=1095\) days. The number S determines the size of the largest variance-covariance matrix involved in the EM algorithm, which, in the case of no-missing data, must solve a linear system \(4636\times 4636\) for each day and each EM iteration. To face this computational burden, model estimation is performed on a cluster of five server machines, for a total of 48 CPU cores and 192 GB of RAM. We defer further discussion of steps 3 and 4 to Sects. 3.3 and 3.4.
The step 5 cross-validation results, based on data \(\tilde{\varvec{y}}_{i}^{[3]}\), are reported in Table 3. Note that \(R^{2}\) statistic increases markedly, when the terms \(z_{i0}\) and \(w_{i}\) are successively added to M1a, as they capture the pollutant space-time variability which is not explained by the covariates. Finally, note that, for all the pollutants, \(R^{2}\) increases when moving from the univariate models M1c to the multivariate model M5. The largest \(R^{2}\) increase is obtained by \(\hbox {PM}_{2.5}\) for the reasons discussed in the next section.
3.3 DCM Results
The estimated parameters of the multivariate model M5 are reported in Tables 4 and 5, with approximated standard deviations in brackets. In particular, Table 4 reports the fixed effect coefficients \(\hat{\beta }_{ij}^{0}\). As expected, an increase in the wind speed implies a lower pollutant concentration, with the exception of ozone. In fact, ozone is a photochemical pollutant derived mainly by \(\hbox {NO}_{X}\) and affected by transportation. As a consequence it exhibits also negative correlation with all the other pollutants and an increase during weekend [see Wolff et al. (2013) and references therein]. Pressure and temperature show complex dynamical transfer functions as the coefficients at different lags often differ in their sign. Eventually, land elevation, population density and weekend have coefficient signs corresponding to the expected effect on pollutant concentrations.
Table 5 reports spatial and temporal random effect parameters. In order to understand vector \(\varvec{\hat{\tau }}\) remember that data \(\xi _{i}\) have unit variance and \(\hat{\tau }_{i}^{2}\) may be interpreted as the variances of the spatial latent components \(w_{i}\). Hence, we see that such components are quite important as \(\hat{\tau }_{i}^{2}\) are between 0.42 and 0.62. The value of \(\hat{\theta }\) suggests that the spatial correlation of \(\varvec{w}\) has a regional range since correlation is smaller than 0.05 at 550 km distance. Moreover, the elements of the correlation matrix \(\varvec{\hat{V}}\) can be interpreted as the cross-correlations between the pollutants after adjusting for all the covariates in the model. Consistently with expectations, it is seen that \(\hbox {O}_{3}\) is negatively correlated with all the other pollutants. Interestingly, the correlation between \(\hbox {PM}_{10}\) and \(\hbox {PM}_{2.5}\) is particularly high (0.93). This means that the multivariate model captures relationships among pollutants even if the network is unbalanced. This is especially useful to improve the mapping of \(\hbox {PM}_{2.5}\) which has a monitoring network much sparser than \(\hbox {PM}_{10}\) and is consistent with cross-correlation results of Table 3.
3.4 Data Model Results
Considering the data model \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) , i=1,\ldots ,q\), discussed in step 4 of Sect. 2.4, estimation results are presented in Fig. 3 for a subset of the pollutants. In particular, the left panels depict the data \(\left\{ \tilde{\varvec{y}}_{i}^{\left[ 2\right] },\varvec{\hat{\xi }}_{i}^{\left[ 2\right] }\right\} \) (dots) and the contour plot of the estimated \(\hat{p}_{i}\left( y_{i},\hat{\xi }_{i}\right) \). The right panels, instead, depict the data model \(\hat{p}_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) directly derived from \(\hat{p}_{i}\left( y_{i},\hat{\xi }_{i}\right) \). Note that \(\hat{p}_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) is a univariate distribution for each value \(\hat{\xi }_{i}\), while, with an abuse of representation and only to ease the interpretation, Fig. 3 depicts \(\hat{p}_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) using a bivariate plot.
3.5 Mapping
Following step 6 of Sect. 2.4, European daily maps at \(5 \times 5\,\mathrm{km}\) spatial resolution are obtained using Eq. (12) for the five pollutants of this study. In Fig. 1 daily maps are averaged over each year for \(\hbox {PM}_{10}\), depicting the average level for 2009 and the yearly average variations in 2010 and 2011. Note that, for a given location, the concentration can widely change from year to year and this is mainly due to the “average weather” of each year. Also note that \(\hbox {PM}_{10}\) increased in northern Europe during these years, and it decreased in the south–west. In the south–east, a significant increase from 2010 to 2011 is observed.
3.6 Exposure Distributions and Bootstrap
Following steps 7 and 8 of Sect. 2.4, the daily exposure distributions of Sect. 2.1 are considered here at the yearly level, aggregated at European level, country level and for the metropolitan areas detailed in Table 2. Since they are based on daily data, these distributions take into account the seasonal variability of the pollutants and daily peaks are not averaged out.
Considering first exceedances, Table 6 reports the median and the 98th percentile for the five pollutants, at the European level, years 2009–2011. Deterioration of 98th percentiles for \(\hbox {PM}_{10}\) and \(\hbox {PM}_{2.5}\) is apparent, while all the other pollutants have a stable pattern. Extended exceedance functions \(H_{i}\) at the country level are reported in supplementary material.
In order to compare the exposure distribution of each country, pollutant and year, the European percentiles of Table 6 are used as reference concentration levels. As an example, the resulting exceedances of the European 98th percentile for \(\hbox {PM}_{2.5}\) are reported in the table of Fig. 2, with 95 % bootstrap confidence intervals. A value higher than \(2\,\%\) is reported in red as it implies a population exposure higher than the joint European population exposure; green color is for lower percentages and white is for values around \(2\,\%\). Poland and some eastern and Balkanian countries result to have the higher exposition to high \(\hbox {PM}_{2.5}\) concentrations over the three years. Extended tabulations are reported in supplementary material for all the pollutants.
To illustrate the capability of assessing the population exposure at the metropolitan scale, Fig. 4 depicts the population exposure density, uncertainty included, with respect to \(\hbox {NO}_{2}\) in 2009 for the metropolitan areas of Berlin, Greater London and Milan province. This shows that the percentage of Milan population, which is exposed to high \(\hbox {NO}_{2}\) concentrations, is larger than the other two metropolitan areas and also larger than Europe altogether. Moreover, Fig. 5 depicts the joint population exposure density in 2009 with respect to \(\hbox {PM}_{10}\) and \(\hbox {PM}_{2.5}\). Joint exceedance frequencies \(H_\mathrm{{PM_{10},PM_{2.5}}}\) given by Eq. (3) are reported in Table 7. This shows that the percentage of Milan population exposed to high multi-pollutant concentrations is larger than the other metropolitan areas and Europe altogether.
4 Conclusions
The assessment of the population exposure to airborne pollutants benefits from considering the joint spatio-temporal distribution of pollutants and population. In this paper we showed how the population exposure to five different pollutants around Europe can be assessed with a unified approach using a multivariate spatio-temporal model. Considering all the pollutants jointly is important because it is known that there are negative synergetic effects on health. In addition, from the statistical point of view, the multivariate approach turns out to be important when the monitoring network is unbalanced with respect to the different pollutants. Indeed, we found improved fitting using a multivariate model rather than using simpler univariate models.
Exploiting the model output we computed population exposure and exceedance at various spatial and temporal scales, ranging from daily to yearly and from relatively small metropolitan areas to the continental level. Confidence intervals and bands were provided by a bootstrap technique based on a novel semi-parametric approach which is fast and suitable for large datasets.
Population exceedance results clearly show that the air quality \(\cdot \)across Europe is still highly variable and that there exists a significant variability across time. However, this is partly due to the variability in the meteorological conditions; some years may be worst than others and some European areas are disadvantaged by local geophysical and climatic conditions.
References
Beelen R., Hoek G., Pebesma E., Vienneau D., de Hoogh K., Briggs D. (2009), Mapping of background air pollution at a fine spatial scale across the European Union, Sci. Total Environ. 407(6), 1852-1867.
Bevilacqua M., Fassò A., Gaetan C., Porcu E., Velandia D. (2016), Covariance tapering for multivariate Gaussian random fields estimation. Statistical Methods and Applications. 25(1), 21-37.
Bourotte M., Allard D., Porcu E. (2016), A flexible class of non-separable cross-covariance functions for multivariate space–time data. Spatial Statistics. doi:10.1016/j.spasta.2016.02.004
Box G.E., Cox D.R. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B, 26(2) 211-252.
Calculli C., Fassò A., Finazzi F., Pollice A. and Turnone A. (2015), Maximum likelihood estimation of the multivariate hidden dynamic geostatistical model with application to air quality in Apulia, Italy. Environmetrics, 26(6), 406-417.
Calder C. A. (2008), A dynamic process convolution approach to modeling ambient particulate matter concentrations. Environmetrics, 19(1), 39-48.
Cameletti M., Ignaccolo R., Bande S. (2011), Comparing spatio-temporal models for particulate matter in Piemonte. Environmetrics, 22, 985-996.
Chunfeng H., Haimeng Z., Scott M.R. (2009), On the validity of commonly used covariance and variogram functions on the sphere. Mathematical Geosciences, 43, 721-733.
De Iaco S., Palma M., Posa D. (2013). Prediction of particle pollution through spatio-temporal multivariate geostatistical analysis: spatial special issue. Advances in Statistical Analysis, 97(2), 133-150.
De Oliveira V., Kedem B., Short D. A. (1997), Bayesian prediction of transformed Gaussian random fields. Journal of the American Statistical Association, 92(440), 1422-1433.
Dobson J.E., Edward A. Brlght, Coleman P.R., Durfee R, and Worley B.A. (2000), LandScan: A Global Population Database for Estimating Populations at Risk, Photogrammetric Engineering & Remote Sensing, 66(7) 849-857.
EEA (2015), Air quality in Europe - 2015 report, European Environmental Agency publications, doi:10.2800/62459.
Fassò A., Finazzi F. (2011) , Maximum likelihood estimation of the dynamic coregionalization model with heterotopic data, Environmetrics, 22(6), 735-748.
Finazzi F., Fassò A. (2014), D-STEM: A Software for the Analysis and Mapping of Environmental Space-Time Variables, Journal of Statistical Software. 62(6), 1-29.
Finazzi F., Scott M.E., Fassò A. (2013), A model based framework for air quality indices and population risk evaluation. With an application to the analysis of Scottish air quality data, Journal of the Royal Statistical Society, series C, 62(2), 287-308.
Jahn H.K., Kraemer A., Chen X.C., Chan C.Y., Engling G., Ward T.J. (2013), Ambient and personal PM2.5 exposure assessment in the Chinese megacity of Guangzhou, Atmospheric Environment, 74, 402-411.
McBride S.J., Williams R.W., Creason J. (2007), Bayesian hierarchical modeling of personal exposure to particulate matter. Atmospheric Environment, 41(29), 6143–6155.
McMillan N. J., Holland D. M., Morara M., Feng J. (2010). Combining numerical model output and particulate data using Bayesian space–time modeling. Environmetrics, 21(1), 48-65.
McLachlan G., Krishnan T. (2008), The EM Algorithm and Extensions, 2nd Edition, Wiley, New York.
Pollice A., Jona Lasinio G. (2010), A multivariate approach to the analysis of air quality in a high environmental risk area. Environmetrics, 21(7-8), 741-754.
Rister K., Lahiri S.N. (2013) , Bootstrap based Trans-Gaussian Kriging, Statistical Modelling, 13, 509–539.
Sahu S. K., Gelfand A. E., Holland D. M. (2006), Spatio-temporal modeling of fine particulate matter. Journal of Agricultural, Biological, and Environmental Statistics, 11(1), 61-86.
Secchi P., Vantini S., Vitelli V. (2015) , Analysis of spatio-temporal mobile phone data: a case study in the metropolitan area of Milan, Statistical Methods and Applications, 24, 279-300.
Shaddick G., Yan H., Salway R., Vienneau D., Kounali D. and Briggs D. (2013) , Large-scale Bayesian spatial modelling of air pollution for policy support, Journal of Applied Statistics, 40: 4,777-4,794
Wolff G. T., Kahlbaum D. F., Heuss, J. M. (2013), The vanishing ozone weekday/weekend effect. Journal of the Air & Waste Management Association, 63(3), 292-299.
Zidek J. V., Shaddick G., White R., Meloche J., Chatfield C. (2005), Using a probabilistic model (pCNEM) to estimate personal exposure to air pollution. Environmetrics, 16(5), 481-493.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Fassò, A., Finazzi, F. & Ndongo, F. European Population Exposure to Airborne Pollutants Based on a Multivariate Spatio-Temporal Model. JABES 21, 492–511 (2016). https://doi.org/10.1007/s13253-016-0260-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13253-016-0260-7