1 Introduction

Thanks to the wide development of fixed monitoring networks, human exposure to airborne pollution is often related to the so-called “ambient exposure,” namely the pollutant concentration at which people are exposed when outdoor, which is assessed by means of pollutant concentration data coming from the mentioned monitoring networks or from chemical transport computer models.

Since using the ambient exposure as personal exposure may lead to ecological fallacy, individual exposure is deserving an increasing attention in recent years. The exposure of an individual to airborne pollutants can be defined as the total amount of pollutants the person is exposed to, during a given period of time of his/her life. Clearly, this amount depends on a large number of factors, the main factors being the spatial location of the person and, conditionally on this, his/her life style and activity level. This is often called personal exposure as, in principle, it can be related to each specific person.

In the recent statistical literature, two main approaches to model personal exposure emerged. On the one side, the probabilistic approach of Zidek et al. (2005) allows to simulate individual exposure temporal profiles for a finite set of individuals and, based on this, it allows to estimate a simulated population distribution. On the other side, an empirical approach based on personal monitors has been considered by few authors. For example, McBride et al. (2007) consider a Bayesian hierarchical model to model the relationship between observed personal exposure and ambient exposure for a small sample of individuals running personal monitors. Moreover, Jahn et al. (2013) performed an extensive study in Guangzhou assessing the correlation between \(\hbox {PM}_{2.5}\) measured by ambient and personal monitors.

In principle, aggregating personal exposures brings us to the population exposure of an entire city, country or continent. Since this total personal exposure is difficult or impossible to compute in practice, Finazzi et al. (2013) and Shaddick et al. (2013) consider a simplified concept of “population exposure,” which is halfway between ambient and personal exposure and it assumes that, in a small area, ambient and personal exposure are roughly proportional. This assumption is especially critical when using yearly averages, which are used not only in Shaddick et al. (2013) and Beelen et al. (2009) , but also in European regulations, see “Air quality in Europe—2015 report” (EEA 2015). In Finazzi et al. (2013), this aggregation effect is mitigated by the use of daily data which take into account also seasonal and meteorological effects to exposure. Moreover, the latter authors consider spatio-temporal modeling to estimate the exposure burden at the country level, while the former authors consider a purely spatial model at the continental level. In this paper, we use the population exposure distribution introduced by Finazzi et al. (2013), but with a different statistical model capable to handle daily data at the European level, uncertainty included.

Spatio-temporal statistical models for air quality have been extensively developed during the last decade in the univariate case (see e.g., Sahu et al. 2006; Calder 2008 and McMillan et al. 2010). On the other side, multivariate spatio-temporal data are increasingly being used in the analysis of air quality. Pollice and Jona (2010) propose the use of a multivariate model for daily concentrations of three pollutants in the Taranto area, Italy. De Iaco et al. (2013) implement spatio-temporal cokriging based on the linear coregionalization model (LCM). Calculli et al. (2015) use the multivariate dynamic map model for insight into Saharian dust events.

In this paper, European multi-pollutant daily concentrations are estimated in high resolution using an extension of the multivariate dynamic coregionalization model (DCM) implemented in the D-STEM software (Finazzi and Fassò 2014), which covers Box–Cox data transformation, prediction and semi-parametric bootstrap, which is an extension of Rister and Lahiri (2013). Model outputs are then extensively used to compute the exposure distribution for European countries and three metropolitan areas.

Alternatives to DCM are available which cover non-separable correlation functions (e.g., Bourotte et al. 2016). Unfortunately in this case the variance covariance matrix has dimension \(ST\times ST\) and, as shown in Cameletti et al. (2011), computation time increases rapidly, making this unfeasible for our dataset. Approximations such as tapering or pairwise likelihood could be also considered. Nonetheless Bevilacqua et al. (2016) showed that the sparsity level required to make this method computationally competitive is unachievable for our application. This paper shows that DCM approach is flexible enough and a reach set of covariates can reduce modeling effort for the covariance function.

The rest of the paper is organized as follows. In Sect. 2, we introduce the concept of exposure distribution and its estimation technique based on a general spatio-temporal model coupled with a novel bootstrap technique. Section 3 presents the air quality datasets and the geographic area covered by the study. Candidate models are fitted using the EM algorithm and cross-validation is adopted for model selection. Then, exposure distribution results are reported for 33 European countries and three metropolitan areas. Joint exposure to multiple pollutants is also taken into account. A closing section draws some conclusions while extensive tabulations and graphs are reported in the supplementary material.

2 Statistical Methods

In this section we detail the statistical modeling techniques used to estimate the population exposure distribution and its uncertainty. To start with, we define the exposure distribution based on high-resolution data observed over a geographic area, which may range from a metropolitan area to the European continent. Then we summarize the multivariate modeling technique and we introduce the novel bootstrap approach, which is used in propagating the uncertainty of the model output to the estimated exposure distribution.

2.1 The Distribution of Population by Exposure

In order to define the exposure distribution, we assume that the concentrations of q pollutants are available on a fine spatial grid of pixels \(\mathbb {B}=\left\{ \mathcal {B}_{1},\ldots ,\mathcal {B}_{R}\right\} \), which covers the geographic region of interest, namely \(\mathcal {R=\cup }_{r=1}^{R}\mathcal {B}_{r}\). Let \(y_{i}\left( \mathcal {B},t\right) \) denote the average concentration of pollutant i over pixel \(\mathcal {B}\in \mathbb {B}\) at discrete time \(t\in \mathcal {T=}\left\{ 1,\ldots ,T\right\} \). Moreover, let \(n\left( \mathcal {B}\right) \) be the number of people living inside pixel \(\mathcal {B\in }\mathbb {B}\), which is supposed to be constant over time and let \(N=\sum _{r=1}^{R}n\left( \mathcal {B}_{r}\right) \) be the total population living in \(\mathcal {R}\). Note that population stationarity is a simplifying assumption and may be relaxed if data about weekly and seasonal population movements are available as in Secchi et al. (2015).

Improving over Finazzi et al. (2013) and Shaddick et al. (2013), which used the (weighted) average exposure for region \(\mathcal {R}\), we focus on the more detailed information provided by the cumulative exposure distribution for region \(\mathcal {R}\), namely

$$\begin{aligned} F_{i}\left( y,t\right) =\frac{\sum \limits _{r=1}^{R}n\left( \mathcal {B}_{r}\right) I\left( y_{i}\left( \mathcal {B}_{r},t\right) \le y\right) }{N}, \end{aligned}$$

where I(A) is the indicator function of set \(A, n(\mathcal {B})\) and N are defined above.

Hence, for a fixed \(y, F_{i}(y,t)\) gives the fraction of the population that was exposed to a concentration level lower or equal to y at time t for pollutant i. This can be integrated over time. For example, considering the time frame \(\mathcal {T}\), the (average) exposure distribution in \(\mathcal {T}\) is given by

$$\begin{aligned} F_{i}(y)=\frac{1}{T}\sum _{t=1}^{T}F_{i}(y,t). \end{aligned}$$
(1)

Note that distribution in (1) may be interpreted as a mixture distribution with equal weights and it is much more informative than the distribution of the yearly average concentration as it takes into account the daily variability of the pollutant and the extreme concentrations are not averaged out. The exposure distribution density is also defined in a natural way by \(f_{i}(y) =\frac{\mathrm{d}}{\mathrm{d}y} F_{i}(y)\), and it can be readily approximated by suitable data binning. Moreover, the exposure exceedance is given by the quantity

$$\begin{aligned} H_{i}(y) =1-F_{i}(y) , \end{aligned}$$
(2)

which gives the fraction of the population that was exposed to a pollutant concentration higher than y, averaged over time.

Multi-pollutant distributions can be suitably defined generalizing \(F_{i}(y) \) and \(H_{i}(y) \). For instance, considering two pollutants and their exposure distribution \(F_{ij}\left( y,z\right) \), the simultaneous exposure exceedance is given by the positive quadrant frequency and, in practice, \(H_{ij}\) is computed using

$$\begin{aligned} H_{ij}(y,z) =\frac{\sum \limits _{t=1}^{T}\sum \limits _{r=1}^{R}n\left( \mathcal {B}_{r}\right) I(y_{i}(\mathcal {B}_{r},t)>y) I\left( y_{j}\left( \mathcal {B}_{r},t\right) >z\right) }{TN}. \end{aligned}$$
(3)

In the case of a general q-dimensional pollutant, the corresponding positive orthant frequency \(H\left( y_{1},\ldots ,y_{q}\right) \) is suitably defined.

Since the concentration levels \(y_{i}(\mathcal {B},t)\) are usually unknown in practice, the next section summarizes a statistical modeling approach for estimating \(y_{i}(\mathcal {B},t)\) starting from data observed on a irregularly spaced and heterogeneous monitoring network. In the rest of the paper, pixels \(\mathcal {B}\) are supposed to be small and the following approximation is considered

$$\begin{aligned} \hat{y}_{i}(\mathcal {B},t)=\frac{1}{\left| \mathcal {B}\right| }\int _{\mathcal {B}}\hat{y}_{i}(\varvec{s},t)\mathrm{d}\varvec{s}\cong \hat{y}_{i}(\varvec{s}_{r},t), \end{aligned}$$
(4)

where \(\varvec{s}_{r}\) is the center of \(\mathcal {B}\) and \(\left| \mathcal {B}\right| \) is its area.

2.2 Dynamic Coregionalization Model

Mapping multi-pollutant concentrations at the continental scale requires to adopt a statistical model able to handle the complexity of the pollutant phenomenon in an efficient way. From the data point of view, the model must accommodate for an unbalanced monitoring network and extensive missing data. In particular, different pollutants are often observed at non-co-located spatial locations. Moreover, some pollutants are known to be correlated with others and mapping may benefit from a model able to describe the spatio-temporal correlation across pollutants. Hence the aim is to have a model which is both flexible and easy to be estimated for a large dataset.

In order to estimate \(y_{i}(\mathcal {B},t)\) of the previous section, a hierarchical multivariate spatio-temporal model is considered here. To this end, let \(\varvec{y}(\varvec{s},t)\) be the q-dimensional vector of the pollutant concentrations \(y_{1},\ldots ,y_{q}\) observed at spatial location \(\varvec{s}\in \mathcal {R}\) and time \(t\in \mathcal {T}\). Since environmental variables are commonly positively skewed, similarly to Oliveira et al. (1997), we use Box–Cox transformation \(\xi _{i}=g_{i}(y_{i})=g\left( y_{i},\lambda _{i}\right) \) with parameter \(\lambda _{i}\), and assume that \(\varvec{\xi }=\left( \xi _{1},\ldots ,\xi _{q}\right) ^{\prime }\) is a q-dimensional Gaussian spatio-temporal process as defined in Finazzi and Fassò (2014), Sect. 5. In particular, \(\xi _{i}\) are supposed to be locally linearly related to some covariates \(\varvec{x}_{i}\) and the following DCM is used

$$\begin{aligned} \xi _{i}\left( \varvec{s},t\right)&=\mu _{i}\left( \varvec{s},t\right) +\varepsilon _{i}\left( \varvec{s},t\right) , \end{aligned}$$
(5)
$$\begin{aligned} \mu _{i}\left( \varvec{s},t\right)&=\alpha _{i}\left( \varvec{s},t\right) +\varvec{x}_{i}\left( \varvec{s},t\right) ^{\prime }\varvec{\beta }_{i}\left( t\right) . \end{aligned}$$
(6)

In the observation Eq. (5), the measurement errors \(\varepsilon _{i}\) are Gaussian distributed with zero mean and variance \(\sigma _{\varepsilon _{i}}^{2}\), uncorrelated over space, time and pollutants. In the trend Eq. (6), \(\alpha _{i}\) is a spatio-temporal stochastic intercept, \(\varvec{x}_{i}\) is a \(b_{i}\)-dimensional vector of covariates with stochastic coefficient vector \(\varvec{\beta }_{i}=\left( \beta _{i1},\ldots ,\beta _{i,b_{i}}\right) ^{\prime }\). The stochastic structure of \(\alpha \) and \(\varvec{\beta }\) is given by the following models

$$\begin{aligned} \alpha _{i}\left( \varvec{s},t\right)&=\alpha _{i}^{0}+z_{i0}\left( t\right) +\tau _{i}w_{i}\left( \varvec{s},t\right) , \end{aligned}$$
(7)
$$\begin{aligned} \beta _{ij}\left( t\right)&=\beta _{ij}^{0}+z_{ij}\left( t\right) , \end{aligned}$$
(8)

where \(j=1,\ldots ,b_{i}, \alpha _{i}^{0}\) and \(\beta _{ij}^{0}\) are the fixed effects while \(z_{ij}\) are the temporal random effects such that \(\varvec{z}(t)=\!\left( z_{10}\left( t\right) ,\ldots ,z_{q0}\left( t\right) ,z_{11}\left( t\right) ,\ldots ,z_{1b_{1}}\left( t\right) ,\ldots ,z_{q1}\left( t\right) ,\ldots ,z_{qb_{q}}\right. \left. \left( t\right) \right) ^{\prime }\) is a stochastic vector with smooth dynamics given by

$$\begin{aligned} \varvec{z}(t)=\varvec{z}(t-1)+\varvec{\eta }(t), \end{aligned}$$
(9)

with \(\varvec{\eta }(t)\sim N\left( \varvec{0},\varvec{\Sigma }_{\varvec{\eta }}\right) \) and \(\varvec{\Sigma }_{\varvec{\eta }}\) a diagonal variance-covariance matrix.

In (7), \(\varvec{w}\left( \varvec{s},t\right) =\left( w_{1}\left( \varvec{s},t\right) ,\ldots ,w_{q}\left( \varvec{s},t\right) \right) ^{\prime }\) is a q-dimensional Gaussian process, uncorrelated over time but correlated over space. In particular, \(\varvec{w}\left( \varvec{s},t\right) \) is a so-called LCM with the following spatial variance-covariance matrix function

$$\begin{aligned} \varvec{\Gamma }=\varvec{V}\rho \left( \left\| \varvec{s}-\varvec{s}^{\prime }\right\| ;\theta ,\nu \right) , \end{aligned}$$
(10)

where \(\varvec{V}\) is a valid correlation matrix and \(\rho \left( \left\| \varvec{s}-\varvec{s}^{\prime }\right\| ;\theta ,\nu \right) \) is the Matérn correlation function, with \(\left\| \varvec{s}-\varvec{s}^{\prime }\right\| \) the distance between spatial locations \(\varvec{s}\) and \(\varvec{s}^{\prime }, \theta \) the range parameter and \(\nu >0\) the shape parameter.

The model parameter set is

$$\begin{aligned} \Psi =\left\{ \varvec{\alpha },\varvec{\beta },\varvec{\sigma }_{\varvec{\varepsilon }}^{2},\varvec{\tau },\varvec{\sigma }_{\varvec{\eta }}^{2},\theta ,\varvec{v}\right\} , \end{aligned}$$
(11)

where \(\varvec{\alpha }=\left( \alpha _{1}^{0},\ldots ,\alpha _{q}^{0}\right) ^{\prime }, \varvec{\beta }=\left( \left( \varvec{\beta }_{1}^{0}\right) ^{\prime },\ldots ,\left( \varvec{\beta }_{q}^{0}\right) ^{\prime }\right) ^{\prime }, \varvec{\beta }_{i}^{0}=\left( \beta _{i1}^{0},\ldots ,\beta _{ib_{i}}^{0}\right) ^{\prime }, \varvec{\sigma }_{\varvec{\varepsilon }}^{2}=\left( \sigma _{\varepsilon _{1}}^{2},\ldots ,\sigma _{\varepsilon _{q}}^{2}\right) ^{\prime }, \varvec{\tau }=\left( \tau _{1},\ldots ,\tau _{q}\right) ^{\prime }, \varvec{\sigma }_{\varvec{\eta }}^{2}=\mathrm{diag}\left( \varvec{\Sigma }_{\varvec{\eta }}\right) \) and \(\varvec{v}\) is the \(q\left( q-1\right) /2\) dimensional vector obtained by stacking the unique and non-diagonal elements of the symmetric matrix \(\varvec{V}\). Note that, in practice, we may have \(\alpha _{i}^{0}\equiv 0\) and/or \(\beta _{ij}^{0}\equiv 0\ \)and/or \(z_{ij}\equiv 0\), for some ij. Hence the vectors \(\varvec{\alpha },\varvec{\beta ,z}(t)\) and \(\varvec{\sigma }_{\varvec{\eta }}^{2}\) include only non-zero elements.

2.3 Prediction and Bootstrap

Let \(\varvec{Y}=\left\{ \varvec{y}_{1},\ldots ,\varvec{y}_{q}\right\} \) be the dataset of pollutant concentrations. In particular, each \(\varvec{y}_{i}\) is the \(n_{i}T-\)dimensional vector of the i-th pollutant concentrations at spatial locations \(\mathbb {D}_{i}\mathbb {=}\left\{ \varvec{s}_{1},\ldots ,\varvec{s}_{n_{i}}\right\} , i=1,\ldots ,q\). The vectors \(\varvec{y}_{i}\) may include missing values and, in general, the sets \(\mathbb {D}_{i}\) may be different. This means that the q pollutants are not necessarily observed at the same spatial locations (see Fassò and Finazzi 2011).

The minimum mean square error spatial prediction of \(y_{i}=y_{i}\left( \varvec{s},t\right) \) at point \(\varvec{s}\notin \mathbb {D}_{i}\) is given by \(\bar{y}_{i}=E\left( y_{i}|\tilde{\varvec{Y}}\right) \), where \(\tilde{\varvec{Y}}=\left\{ \tilde{\varvec{y}}_{1},\ldots ,\tilde{\varvec{y}}_{q}\right\} \) is the dataset restricted to the non-missing data. On the other side, using the plug-in approach, DCM gives the natural estimator for \(\xi _{i}\), namely \(\hat{\xi }_{i}=E\left( \xi _{i}|\tilde{\varvec{Y}}\right) \). Hence we approximate \(\bar{y}_{i}\) using the back-transformed \(\hat{\xi }_{i}\) with bias correction given by the well known delta method, namely

$$\begin{aligned} \hat{y}_{i}=g_{i}^{-1}\left( \hat{\xi }_{i}\right) +\frac{1}{2}\sigma _{\hat{\xi }_{i}}^{2}\left[ g_{i}^{-1}\left( \hat{\xi }_{i}\right) \right] ^{\prime \prime }, \end{aligned}$$
(12)

where \(g_{i}^{-1}\) is the Box–Cox back-transformation function, \(\sigma _{\hat{\xi }_{i}}^{2}={\text {Var}}\left( \xi _{i}|\tilde{\varvec{Y}}\right) \) and \(\left[ \cdot \right] ^{\prime \prime }\) is the second derivative.

In order to assess the uncertainty of the population exposure distributions and exceedances defined in Sect. 2.1, we aim at estimating the uncertainty related to \(\hat{y}_{i}\). To this end, it is worth noting that \(p\left( \xi _{i}\mid \tilde{\varvec{Y}}\right) =N\left( \hat{\xi }_{i},\sigma _{\hat{\xi }_{i}}^{2}\right) \). It is thus straightforward to draw a sample from \(N\left( \hat{\xi }_{i},\sigma _{\hat{\xi }_{i}}^{2}\right) \) and to back-transform the sample using (12) in order to assess the uncertainty of \(\hat{y}_{i}\). This approach, however, is not robust against misspecification of DCM and Box–Cox transformation. Additionally, it does not reflect the uncertainty on the estimated model parameters, which are plugged-in when \(\hat{\xi }_{i}\) and \(\sigma _{\hat{\xi }_{i}}^{2}\) are computed.

In order to obtain a robust assessment of the uncertainty on \(\hat{y}_{i}\), we develop a semi-parametric approach based on the conditional distributions \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) , \hat{\xi }_{i}\in \mathbb {R}\), which are estimated using the DCM output. In order to estimate \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) we assume that the following conditional independence and ergodicity conditions hold true.

Condition 1

For each \(i=1,\ldots ,q\), the conditional process \(\left( y_{i}\left( \varvec{s},t\right) |\hat{\xi }_{i}\left( \varvec{s},t\right) \right) , \varvec{s}\in \mathcal {R}, t\in \mathbb {N} ^{+}\), is stationary, white noise over space and time, with known distribution \(p_{i}\left( y_{i}|\xi _{i}\right) \).

Condition 2

For each \(i=1,\ldots ,q\), the bivariate distribution \(p_{i}\left( y_{i},\hat{\xi }_{i}\right) \) is described by a bivariate Gaussian mixture of \(m_{i}\) components.

Condition 3

For each \(i=1,\ldots ,q\), let \(\varvec{x}_{i}\left( \varvec{s},t\right) \) be the vector of covariates at spatial location \(\varvec{s}\). There exists a stochastic process \(\left( y_{i}\left( \varvec{s},t\right) ,\varvec{x}_{i}\left( \varvec{s},t\right) \right) \) which is ergodic for \(p_{i}\left( y_{i},\hat{\xi }_{i}\right) \). In other words, the bivariate distributions \(p_{i}\left( y_{i},\hat{\xi }_{i}\right) \), and thus \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) \), can be consistently estimated using a realization \(\left( y_{i}\left( \varvec{s},t\right) ,\hat{\xi }_{i}\left( \varvec{s},t\right) ,\varvec{x}_{i}\left( \varvec{s},t\right) \right) \) for \(T\rightarrow \infty \).

Condition 1 amounts to say that, conditionally on \(\xi _{i}\), the pollutant concentration \(y_{i}\) is i.i.d. over space and time. Indeed, Condition 1 is the basis for bootstrap resampling and it is far less restrictive than bootstrap assumptions of both Rister and Lahiri (2013) and Finazzi et al. (2013). Condition 2 is made for convenience and it is not restrictive due to the universal approximator nature and flexibility of Gaussian mixtures. Eventually, Condition 3 allows to have consistent estimates of the distributions \(p_{1},\ldots ,p_{q}\) based on the observed data and DCM estimates \(\hat{\xi }_{1},\ldots ,\hat{\xi }_{q}\). In the sequel, the set of distributions \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) ,i=1,\ldots ,q\) is called data model.

2.4 Model Estimation and Selection

The overall procedure is described in this section using eight steps. Preliminarily, in step 1, the Box–Cox parameter set \(\varvec{\lambda }\) is estimated and data are transformed accordingly, then in step 2, after a preliminary analysis of transformed data, a pool of candidate DCMs for model selection is defined. For each candidate DCM, conditionally on \(\varvec{\lambda }\) the parameter set \(\Psi \) is estimated in step 3 and the data model \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) , i=1,\ldots ,q\), is estimated in step 4. In step 5, model selection is performed. Applying plug-in approach to the selected model, in steps 6, daily maps required for the population exposure are estimated, in step 7, the multivariate exposure distribution is computed and, finally, assessment of uncertainty is performed in step 8.

Since data model is intended to be robust against DCM and Box–Cox transform misspecification, a two-stage procedure based on data splitting is used. Moreover, cross-validation is performed for model selection. To do this, each monitoring network \(\mathbb {D}_{i}\) is partitioned in three subsets, say \(\mathbb {D}_{i}^{[1]},\mathbb {D}_{i}^{[2]}\) and \(\mathbb {D}_{i}^{[3]}\), and dataset \(\tilde{\varvec{y}}_{i}\) is partitioned accordingly, that is \(\tilde{\varvec{y}}_{i}^{[h]}=\left( \tilde{y}_{i}\left( \varvec{s},t\right) ,\varvec{s}\in \mathbb {D}_{i}^{[h]},t\in \mathcal {T}\right) \). The estimation of \(\Psi \) and \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) is based on the data \(\tilde{\varvec{y}}_{i}^{\left[ 1\right] }\) and \(\tilde{\varvec{y}}_{i}^{\left[ 2\right] }\), respectively, while cross-validation is performed using \(\tilde{\varvec{y}}_{i}^{\left[ 3\right] }\). We now discuss in more details the above steps.

  1. (1)

    Box–Cox transform estimation

    Each element of the parameter set \(\varvec{\lambda }=\left( \lambda _{1},\ldots ,\lambda _{q}\right) ^{\prime }\) is estimated using the maximum likelihood approach as in Box and Cox (1964), applied separately to data \(\tilde{\varvec{y}}_{i}^{[1] },i=1,\ldots ,q\). In order to keep computational time feasible, we use here an iid distribution which ignores cross, spatial and temporal correlation. Ignoring autocorrelation is known to give suboptimal estimates and estimated standard deviations of \(\hat{\lambda }_{i}\) obtained by above likelihood are known to be biased. Nonetheless, the objective of this paper is exposure and \(\varvec{\lambda }\) may be considered as a nuisance parameter. Hence, in view of the large dataset used, we consider above \(\varvec{\lambda }\) as a good estimate for our purposes but we avoid to compute its standard deviations in the application section.

  2. (2)

    Preliminary analysis

    After a preliminary analysis of \(\xi \) transform, based on correlations, spatial regressions and simplified DCMs, a set of candidate DCMs is defined. On the one side this step has to consider the relevant model components and covariates but on the other side, considering computational burden, a large full factorial model search has to be avoided.

  3. (3)

    DCM estimation

    For each DCM of the previous step, conditionally on \(\varvec{\hat{\lambda }}\), estimation of the model parameter set \(\Psi \) is based on the EM algorithm and is implemented by means of the D-STEM software applied to data \(g_{i}\left( \tilde{\varvec{y}}_{i}^{[1]}\right) , i=1,\ldots ,q\), giving \(\hat{\Psi }^{\left[ 1\right] }\), say. This is a computer intensive task for daily data at continental level as discussed in the opening of Sect. 3.2.

  4. (4)

    Data model estimation

    The semi-parametric estimation of the data model \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) , i=1,\ldots ,q\), is composed of the following two sub-steps:

    1. (a)

      Kriging. Using \(\hat{\Psi }^{\left[ 1\right] }, \xi _{i}\) are estimated at spatial locations \(\mathbb {D}_{i}^{[2]}\):

      $$\begin{aligned} \hat{\xi }_{i}^{\left[ 2\right] }=E_{\hat{\Psi }^{\left[ 1\right] }}\left( \xi _{i}^{\left[ 2\right] }\mid \tilde{\varvec{y}}^{\left[ 1\right] }\right) , \end{aligned}$$
      (13)

      where \(E_{\hat{\Psi }^{\left[ 1\right] }}\) is the expectation computed using the DCM with parameter set \(\hat{\Psi }^{\left[ 1\right] }\). Note that the estimates \(\hat{\xi }_{i}^{\left[ 2\right] }\) are obtained using the information from all the pollutants, exploiting their spatio-temporal cross-correlation and are stacked in the vector \(\varvec{\hat{\xi }}_{i}^{\left[ 2\right] }=\left( \hat{\xi }_{i}^{\left[ 2\right] }\left( \varvec{s},t\right) ,\varvec{s}\in \mathbb {D}_{i}^{[2]},t\in \mathcal {T}\right) \) for convenience.

    2. (b)

      Mixture estimation.      For each \(i=1,\ldots ,q\), data \(\left( \tilde{\varvec{y}}_{i}^{\left[ 2\right] },\varvec{\hat{\xi }}_{i}^{\left[ 2\right] }\right) \) are used to estimate the bivariate mixtures distributions \(p_{i}\) using the standard EM algorithm for Gaussian mixtures with \(m_{i}\) components (see e.g., McLachlan and Krishnan 2008). Hence, conditional distributions \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) are obtained straightforwardly.

  5. (5)

    Model selection

    Given a set of candidate DCMs, unbiasedness and mean square error optimality are of interest. Hence, using \(\tilde{\varvec{y}}_{i}^{\left[ 3\right] }\), errors \(\hat{e}_{i}\left( \varvec{s},t\right) =\tilde{y}_{i}\left( \varvec{s},t\right) -\hat{y}_{i}\left( \varvec{s},t\right) , \varvec{s}\in \mathbb {D}_{i}^{[3]}\) are computed and the DCMs are compared in terms of bias, root mean squared error (RMSE) and \(R^{2}\) statistic, where, for each pollutant, \(R^{2}=1-\mathrm{MSE}/\mathrm{Var}(y)\). The DCM with the highest \(R^{2}\) is retained.

  6. (6)

    Mapping

    In order to map the pollutant concentration over space and time at high resolution, small pixels \(\mathcal {B}_{r}\in \mathbb {B}\) with centers \(\varvec{s}_{r}\) are considered. The actual average pollutant concentration over pixel \(\mathcal {B}_{r}\) is estimated by \(\hat{y}_{i}(\varvec{s}_{r},t)\) in Eq. (12).

  7. (7)

    Estimation of exposure distributions

    The definitions and methods of Sect. 2.1 are applied to \(\hat{y}_{i}(\varvec{s}_{r},t)\), giving the univariate and multivariate estimated exposure distributions, say \(\hat{F}_{i}, \hat{H}_{i}\) and \(\hat{f}_{i}\).

  8. (8)

    Uncertainty of exposure distributions

    To evaluate the uncertainty of quantities such as \(\hat{H}_{i}\) and \(\hat{f}_{i}\), a bootstrap approach is adopted and confidence bands of \(f_{i}\) and \(H_{i}\) are provided. In particular, \(B=200\) bootstrap replicates of the daily average pollutant concentration maps are generated by randomly sampling the pollutant concentration of pixel \(\mathcal {B}_{r}\) and time t from \(p_{i}\left( y_{i}|\hat{\xi }_{i}(\varvec{s}_{r},t)\right) \). This gives the maps \(\varvec{y}_{b}^{*}=\left( y_{b}^{*}\left( s_{r},t\right) ,t\in \mathcal {T}\right) , r=1,\ldots ,R, b=1,\ldots ,B\). Step 7 applied to each replica \(\varvec{y}_{b}^{*}\) gives the bootstrap sample \(H_{i,b}^{*}\) from which confidence bands for \(H_{i}\) are immediately computed.

3 European Case Study

3.1 Data Description

Air quality data from the AirBase database of the European Environment Agency are considered. The geographic area is restricted to the box (\(35{{}^\circ }N, 67{{}^\circ }N, 11{{}^\circ }W, 35{{}^\circ }E\)) of Fig. 1, which includes in full the 33 European countries listed in the table of Fig. 2 and overall time frame covers years 2009–2011. With some abuse of language, terms like European level or European average will be used for the union of the 33 countries.

Fig. 1
figure 1

Yearly average \(\hbox {PM}_{10}\) concentration [\(\mu gm^{-3}\)]. Top-left level of year 2009. Top right difference between 2010 and 2009. Bottom-left difference between 2011 and 2010. Bottom right difference between 2011 and 2009 (Color figure online).

Fig. 2
figure 2

Percentage above the \(\hbox {PM}_{2.5}\) European 98th percentile by country. H is the exceedance percentage; LCL-UCL is the \(95\,\%\) bootstrap confidence interval. Green is below the European percentage, red is above and white is intermediate (Color figure online).

The daily average concentrations, in \(\mathrm{\mu g/m^{3}}\), for the following five relevant airborne pollutants are used: carbon monoxide (CO), nitrogen dioxide (\(\hbox {NO}_{2}\)), ozone (\(\hbox {O}_{3}\)), particulate matter not larger than \(10\,\mathrm{\mu m}\) (\(\hbox {PM}_{10}\)) and \(2.5\,\mathrm{\mu m}\) (\(\hbox {PM}_{2.5}\)). Only background-type stations are taken into account because traffic-type and industrial-type stations are subject to heavy preferential sampling and are known to overestimate the pixel average, which is relevant for the population exposure assessment at daily scale. Table 1 shows that, overall, the monitoring network is unbalanced since the number of sensors measuring each pollutant is highly variable. Moreover, the missing data rate is quite substantial and averts the use of standard methods not suited to handle missing data. Table 1 also reports the parameters \(\lambda _{i}\) of the Box–Cox transformation, which are estimated using data \(\tilde{\varvec{y}}_{i}^{[1]}\) as in step 1 of Sect. 2.4. Both the need and the effect of the Box–Cox transformation are clear from skewness and kurtosis of Table 1.

Population count, land elevation, and meteorology data are considered as covariates. Population data are obtained from the LandScan database (Dobson et al. 2000) which provides the 24 h average population spatial distribution globally with spatial resolution 1 km. This dataset is constant over the year, but the 24 h average takes into account the intra-day population mobility. Variations across the three years of the case study are considered negligible. Population data are used for the exposure distribution computation and as a fundamental covariate, being proxy of local emissions averaged over time. Other candidate covariates for the model are land elevation, taken from the GTOPO30 dataset, and meteorological variables including wind speed, pressure and temperature, taken from the ERA Interim dataset (ECMWF Interim Re-Analysis).

Table 1 Dataset description with respect to the five pollutants. Skewness and kurtosis are given before (y) and after (\(\xi \)) the Box–Cox transformation with estimated \(\lambda \).

A special focus is dedicated to the three metropolitan areas of Berlin, Greater London and Milan province (see Table 2). The three areas are characterized by a major source of pollution related to car traffic and house heating. Nonetheless, they are different in population size, population density, transportation and, more importantly, in atmospheric circulation regime. In particular, Berlin and Greater London have a higher population density but they are subject to oceanic air circulation. Viceversa, the Milan province has a lower population density but a higher vehicle rate, and it is surrounded by the c-shaped mountain range composed by the Alps and the Appennines, which prevents major air circulation and leads to pollutant accumulation in the lower troposphere.

Table 2 Metropolitan areas characteristics: population (thousands); area (\(\mathrm{km^2}\)); density (inhabitants/\(\mathrm{km^2}\)) and Motor Vehicles per 1000 inhabitants at national level.

3.2 DCM for European Data

Model estimation and selection are performed following the strategy defined by the eight steps of Sect. 2.4 which are referred to, in the sequel. Each set \(\mathbb {D}_{i}, i=1,\ldots ,q\), is randomly partitioned in such a way that \(\mathbb {D}_{i}^{[1]}, \mathbb {D}_{i}^{[2]}\) and \(\mathbb {D}_{i}^{[3]}\) include 2 / 3, 1 / 6 and 1 / 6 of the total number of sensors, respectively.

A preliminary (step 2) analysis, applied to standardized data \(\mathbf {\xi }_{i}^{[1]}\) based on simple regressions and DCM univariate models gave the following results:

  1. (1)

    Selected covariates include meteorological variables at lags 0, 1 and 2, land elevation, log-transformed population count and two dummies identifying Saturday and Sunday effects. For each pollutant, covariates have been pruned using p values. In particular, due to the large number of observations and degrees of freedom, pruning is done implementing t tests with p value \(\ge 3\,\times \,10^{-6}\).

  2. (2)

    Random temporal components \(z_{ij}\left( t\right) \) of Eqs. (7) and (8) have been selected if empirical variance contribution was larger than 0.01. As a result only \(z_{i0}, i=1,\ldots ,q\) have been retained. This means that only the intercept in DCM needs to be time varying.

  3. (3)

    The exponential covariance function, \(\rho (\Vert \varvec{s}-\varvec{s}^{\prime }\Vert ;\theta _{i},\nu =\frac{1}{2}),\) resulted the best in the restricted Matérn class \(0<\nu _{i}\le \frac{1}{2}\) \(i=1,\ldots ,q\), according to cross-validation results. Note that the restriction \(\nu \le \frac{1}{2}\) is used to assure a positive definite covariance function on a sphere or a part of it, such as the European domain, which can be hardly projected on the plane, see Chunfeng et al. (2009). Indeed in this case study, we use the geodetic distance for \(\Vert \varvec{s}-\varvec{s}^{\prime }\Vert \).

Going on with step 2, we formulate a candidate model class where each one-dimensional equation is given by

$$\begin{aligned} \xi _{i}\left( \varvec{s},t\right)&=\alpha _{i}^{0}+z_{i0}\left( t\right) +\tau _{i}w_{i}\left( \varvec{s},t\right) +\varvec{x}_{i}\left( \varvec{s},t\right) ^{\prime }\varvec{\beta }_{i}^{0}+\varepsilon _{i}\left( \varvec{s},t\right) \nonumber \\ z_{i0}\left( t\right)&=z_{i0}\left( t-1\right) +\eta _{i}(t), \end{aligned}$$
(14)

which is a particular case of DCM model in Sect. 2.2 and \(\xi \) are standardized variables. Using Eq. (14) , we define various univariate DCM candidates with independent \(w_{i}\) components, and one five-variate DCM candidate with \(w_{i}\) given by the LCM of Sect. 2.2. The first candidate (M1a) is a set of univariate models obtained by Eq. (14), deleting both random effects \(z_{i0}\) and \(w_{i}\), that is spatial regression. The second candidate (M1b) is given by Eq. (14) deleting only w, while the third candidate (M1c) is the fully saturated univariate case obtained by Eq. (14) with all its terms. The fourth candidate (M5) is the fully saturated five-variate model corresponding to M1c. Note that M1a and M1b are nested in M1c, which is quasi-nested in M5 since, in the multivariate model, restriction \(\theta _{1}=\cdots =\theta _{5}\) is applied.

Moving to step 3, estimation of M5 is built from the estimated univariate models. In particular to ease the convergence of the EM algorithm, initial values of M5 are taken as the estimates of M1c with the spatial correlation range parameter \(\theta \) initialized to the average of \(\hat{\theta }_{1},\ldots ,\hat{\theta }_{5}\). Now the dataset is \(\varvec{Y}^{[1]}=\left\{ \varvec{y}_{1}^{[1]},\ldots ,\varvec{y}_{5}^{[1]}\right\} \), which includes a total of \(S=4636\) sensors and \(T=1095\) days. The number S determines the size of the largest variance-covariance matrix involved in the EM algorithm, which, in the case of no-missing data, must solve a linear system \(4636\times 4636\) for each day and each EM iteration. To face this computational burden, model estimation is performed on a cluster of five server machines, for a total of 48 CPU cores and 192 GB of RAM. We defer further discussion of steps 3 and 4 to Sects. 3.3 and 3.4.

The step 5 cross-validation results, based on data \(\tilde{\varvec{y}}_{i}^{[3]}\), are reported in Table 3. Note that \(R^{2}\) statistic increases markedly, when the terms \(z_{i0}\) and \(w_{i}\) are successively added to M1a, as they capture the pollutant space-time variability which is not explained by the covariates. Finally, note that, for all the pollutants, \(R^{2}\) increases when moving from the univariate models M1c to the multivariate model M5. The largest \(R^{2}\) increase is obtained by \(\hbox {PM}_{2.5}\) for the reasons discussed in the next section.

Table 3 Cross-validation results for the nested univariate models and the 5-variate model.
Table 4 Fixed effect coefficients \(\hat{\beta }^0_{ij}\) with standard deviations in brackets for the five-variate model M5.
Table 5 Random effect parameters with standard deviations in brackets for the five-variate model M5.

3.3 DCM Results

The estimated parameters of the multivariate model M5 are reported in Tables 4 and 5, with approximated standard deviations in brackets. In particular, Table 4 reports the fixed effect coefficients \(\hat{\beta }_{ij}^{0}\). As expected, an increase in the wind speed implies a lower pollutant concentration, with the exception of ozone. In fact, ozone is a photochemical pollutant derived mainly by \(\hbox {NO}_{X}\) and affected by transportation. As a consequence it exhibits also negative correlation with all the other pollutants and an increase during weekend [see Wolff et al. (2013) and references therein]. Pressure and temperature show complex dynamical transfer functions as the coefficients at different lags often differ in their sign. Eventually, land elevation, population density and weekend have coefficient signs corresponding to the expected effect on pollutant concentrations.

Table 5 reports spatial and temporal random effect parameters. In order to understand vector \(\varvec{\hat{\tau }}\) remember that data \(\xi _{i}\) have unit variance and \(\hat{\tau }_{i}^{2}\) may be interpreted as the variances of the spatial latent components \(w_{i}\). Hence, we see that such components are quite important as \(\hat{\tau }_{i}^{2}\) are between 0.42 and 0.62. The value of \(\hat{\theta }\) suggests that the spatial correlation of \(\varvec{w}\) has a regional range since correlation is smaller than 0.05 at 550 km distance. Moreover, the elements of the correlation matrix \(\varvec{\hat{V}}\) can be interpreted as the cross-correlations between the pollutants after adjusting for all the covariates in the model. Consistently with expectations, it is seen that \(\hbox {O}_{3}\) is negatively correlated with all the other pollutants. Interestingly, the correlation between \(\hbox {PM}_{10}\) and \(\hbox {PM}_{2.5}\) is particularly high (0.93). This means that the multivariate model captures relationships among pollutants even if the network is unbalanced. This is especially useful to improve the mapping of \(\hbox {PM}_{2.5}\) which has a monitoring network much sparser than \(\hbox {PM}_{10}\) and is consistent with cross-correlation results of Table 3.

Fig. 3
figure 3

Data model estimation for \(\hbox {NO}_{2}\) (top panels) pollutant, \(\hbox {PM}_{10}\) (middle pnales) and \(\hbox {PM}_{2.5}\) (bottom panels). The left panels show the dataset \(\left\{ \varvec{Y}_{i}^{\left[ 2\right] },\varvec{\hat{\xi }}_{i}^{\left[ 2\right] }\right\} \) (dots) and the contour plot of the estimated joint distribution \(\hat{p}_{i}\left( y_{i},\xi _{i}\right) \). The right panels show the contour plot of the data model \(\hat{p}_{i}\left( y_{i}\mid \xi _{i}\right) \) (Color figure online).

3.4 Data Model Results

Considering the data model \(p_{i}\left( y_{i}|\hat{\xi }_{i}\right) , i=1,\ldots ,q\), discussed in step 4 of Sect. 2.4, estimation results are presented in Fig. 3 for a subset of the pollutants. In particular, the left panels depict the data \(\left\{ \tilde{\varvec{y}}_{i}^{\left[ 2\right] },\varvec{\hat{\xi }}_{i}^{\left[ 2\right] }\right\} \) (dots) and the contour plot of the estimated \(\hat{p}_{i}\left( y_{i},\hat{\xi }_{i}\right) \). The right panels, instead, depict the data model \(\hat{p}_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) directly derived from \(\hat{p}_{i}\left( y_{i},\hat{\xi }_{i}\right) \). Note that \(\hat{p}_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) is a univariate distribution for each value \(\hat{\xi }_{i}\), while, with an abuse of representation and only to ease the interpretation, Fig. 3 depicts \(\hat{p}_{i}\left( y_{i}|\hat{\xi }_{i}\right) \) using a bivariate plot.

3.5 Mapping

Following step 6 of Sect. 2.4, European daily maps at \(5 \times 5\,\mathrm{km}\) spatial resolution are obtained using Eq. (12) for the five pollutants of this study. In Fig. 1 daily maps are averaged over each year for \(\hbox {PM}_{10}\), depicting the average level for 2009 and the yearly average variations in 2010 and 2011. Note that, for a given location, the concentration can widely change from year to year and this is mainly due to the “average weather” of each year. Also note that \(\hbox {PM}_{10}\) increased in northern Europe during these years, and it decreased in the south–west. In the south–east, a significant increase from 2010 to 2011 is observed.

3.6 Exposure Distributions and Bootstrap

Following steps 7 and 8 of Sect. 2.4, the daily exposure distributions of Sect. 2.1 are considered here at the yearly level, aggregated at European level, country level and for the metropolitan areas detailed in Table 2. Since they are based on daily data, these distributions take into account the seasonal variability of the pollutants and daily peaks are not averaged out.

Table 6 Median and 98th percentile of the European exposure distribution in years 2009–2011 (in \(\mathrm{\mu gm^{-3}}\)).

Considering first exceedances, Table 6 reports the median and the 98th percentile for the five pollutants, at the European level, years 2009–2011. Deterioration of 98th percentiles for \(\hbox {PM}_{10}\) and \(\hbox {PM}_{2.5}\) is apparent, while all the other pollutants have a stable pattern. Extended exceedance functions \(H_{i}\) at the country level are reported in supplementary material.

In order to compare the exposure distribution of each country, pollutant and year, the European percentiles of Table 6 are used as reference concentration levels. As an example, the resulting exceedances of the European 98th percentile for \(\hbox {PM}_{2.5}\) are reported in the table of Fig. 2, with 95 % bootstrap confidence intervals. A value higher than \(2\,\%\) is reported in red as it implies a population exposure higher than the joint European population exposure; green color is for lower percentages and white is for values around \(2\,\%\). Poland and some eastern and Balkanian countries result to have the higher exposition to high \(\hbox {PM}_{2.5}\) concentrations over the three years. Extended tabulations are reported in supplementary material for all the pollutants.

Fig. 4
figure 4

Population exposure density for \(\hbox {NO}_{2}\) in 2009 evaluated for Berlin, Greater London and Milan Province. Shaded areas: \(95\,\%\) bootstrap confidence bands. The 50th and 98th percentile vertical bars refer to European values of Table 6 (Color figure online).

Fig. 5
figure 5

Contour plots of the joint population exposure density for \(\hbox {PM}_{10}\) and \(\hbox {PM}_{2.5}\) evaluated for Berlin (top-left), Greater London (top-right) and Milan province (bottom) in year 2009. The thick black lines are the 98th percentiles of the joint European exposure distribution in 2009 reported in Table 6 (Color figure online).

To illustrate the capability of assessing the population exposure at the metropolitan scale, Fig. 4 depicts the population exposure density, uncertainty included, with respect to \(\hbox {NO}_{2}\) in 2009 for the metropolitan areas of Berlin, Greater London and Milan province. This shows that the percentage of Milan population, which is exposed to high \(\hbox {NO}_{2}\) concentrations, is larger than the other two metropolitan areas and also larger than Europe altogether. Moreover, Fig. 5 depicts the joint population exposure density in 2009 with respect to \(\hbox {PM}_{10}\) and \(\hbox {PM}_{2.5}\). Joint exceedance frequencies \(H_\mathrm{{PM_{10},PM_{2.5}}}\) given by Eq. (3) are reported in Table 7. This shows that the percentage of Milan population exposed to high multi-pollutant concentrations is larger than the other metropolitan areas and Europe altogether.

4 Conclusions

The assessment of the population exposure to airborne pollutants benefits from considering the joint spatio-temporal distribution of pollutants and population. In this paper we showed how the population exposure to five different pollutants around Europe can be assessed with a unified approach using a multivariate spatio-temporal model. Considering all the pollutants jointly is important because it is known that there are negative synergetic effects on health. In addition, from the statistical point of view, the multivariate approach turns out to be important when the monitoring network is unbalanced with respect to the different pollutants. Indeed, we found improved fitting using a multivariate model rather than using simpler univariate models.

Table 7 Marginal and joint exposure exceedances for \(\hbox {PM}_{10}\) and \(\hbox {PM}_{2.5}\) in 2009 considering as threshold the 98th European percentiles.

Exploiting the model output we computed population exposure and exceedance at various spatial and temporal scales, ranging from daily to yearly and from relatively small metropolitan areas to the continental level. Confidence intervals and bands were provided by a bootstrap technique based on a novel semi-parametric approach which is fast and suitable for large datasets.

Population exceedance results clearly show that the air quality \(\cdot \)across Europe is still highly variable and that there exists a significant variability across time. However, this is partly due to the variability in the meteorological conditions; some years may be worst than others and some European areas are disadvantaged by local geophysical and climatic conditions.