Introduction

Forecasting airborne pollen levels is an interesting problem not only from an environmental point of view but also in health-care planning—mainly in vaccination strategies related to allergies among children and the elderly. Cupressaceae pollen has one of the highest pollen incidences in the Mediterranean area and is present in the atmosphere practically all year round, although it is predominant in the winter period, when no other plants are flowering, making this particle a powerful allergen. In Europe, allergy to Cupressaceae pollen was considered a rarity until 1975, but is now a recognised clinical entity (Belmonte et al. 1999). In order to develop a stochastic model to explain this phenomenon, several meteorological covariates must be taken into account, as was studied in Spain by several authors (e.g., Aira et al. 2001; Belmonte et al. 1999; Díaz de la Guardia et al. 2006; Galán et al. 1998; Sabariego et al. 2011; Tortajada and Mateu 2008).

The regression approach that we follow in this paper has been considered previously by several authors such as Stark et al. (1997), who applied a Poisson regression model for ragweed pollen; Brumback et al. (2000), who propose dan extension of the generalized linear models to the nonlinear framework; Smith and Emberlin (2005), who adjusted several regression models after considering pre-peak, peak and post-peak periods; Moseholm et al. (1987), Ocaña Peinado et al. (2008) and Rodríguez-Rajo et al. (2006), who use ARIMA processes; Valderrama et al. (2010), who proposed a two-step functional regression model; Makra and Matyasovszky (2011) and Makra et al. (2011), who consider nonparametric regression methods; and by Díaz de la Guardia et al. (2006) and Sabariego et al. (2011) using polynomial and multiple linear regression.

The aim of this paper was to select a set of variables suitable for modelling the stochastic process of Cupressaceae airborne pollen concentration during the pollination season. To do so, a dimensionality reduction on the basis of principal component analysis (PCA) was developed for both this process and for the above-mentioned climatic processes. The time predictive approach applied in this model means that the sample paths of the main processes were recorded 1 week in advance of the others. Multiple linear regression among the principal components (PCs) was then performed to obtain the predictive model.

The behavior of our methodology was tested by its application to data recorded by the Aerobiology Center at the University of Granada (southern Spain) over a period of 12 years (1995–2006).

Materials and methods

This study was carried out in the city of Granada (SE Spain), in the Mesomediterranean bioclimatic level. All the data used in this paper were collected using methodology analogous to that of Díaz de la Guardia et al. (2006). Data were recorded for 11 years (1995–2005), from 15 January to 15 April (90 data days per year), but taking 2 weeks as the time interval applicable. Thus, six intervals were obtained for each year, i.e., 66 in total. During these 90 days the pollen intensity is very high because species of the genus Cupressus—very prevalent in urban vegetation—produce pollen on a massive scale (Díaz de la Guardia et al. 2006). Due to the predictive aim of this model, the pollen concentration process was considered 1 week in advance of the climatic processes, i.e., I 1 = [T i−1, T i] and I 2 = [T i, T i+1], for i = 1, 2, . . . , 65. The stochastic processes taken into consideration were as follows:

  • Cupressaceae pollen concentration: {P (t), t ∈ I2} expressed as number of pollen grains per cubic meter of air (grains/m3).

  • Daily average temperature: {T (t), t ∈ I1} expressed in degrees centigrade (°C)

  • Daily average relative humidity: {H(t), t ∈ I1} expressed in percent (%)

  • Daily hours of sun: {S(t), t ∈ I1}

  • Daily maximum wind speed: {W (t), t ∈ I1} expressed in kilometers per hour (km/h)

Because of the erratic nature of the pollen data measured, a logarithmic transformation was applied in order to smooth them: X(t) = log[P (t) + 1]. Then, to perform the PCA, all the variables considered were standardized. In the principal component analysis (PCA), the respective PCs for each stochastic process are denoted by:

$$ \left\{ {\xi_i^{(X)}} \right\},\left\{ {\xi_i^{(T)}} \right\},\left\{ {\xi_i^{(H)}} \right\},\left\{ {\xi_i^{(S)}} \right\},\left\{ {\xi_i^{(W)}} \right\} $$

so that the Karhunen-Loève expansion for {X(t), t ∈ I2} is given by:

$$ X(t) = \sum\limits_{i = 1}^n {u_i^\prime {\xi_i},t \in {I_2}} $$
(1)

and the PCs are estimated by means of multiple linear regression as follows:

$$ \matrix{{*{20}{c}} {\widehat{\xi }_i^{(X)} = \gamma {\text{o}} + \sum\limits_{j = 1}^{{n_i}} {{a_j}\xi_j^{(T)}} + \sum\limits_{j = 1}^{{n_2}} {{b_j}\xi_j^{(H)}} + \sum\limits_{j = 1}^{{n_3}} {{c_j}\xi_j^{(S)}} } \\ { + \sum\limits_{j = 1}^{{n_4}} {{d_j}\xi_j^{(W)}\quad i = 1,2, \ldots, n.} } \\ } $$
(2)

The criterion for selecting the number of PCs for each process to be included in the model is that they must have an explained variance greater than 1 (Kaiser 1958). The explicative variables to be introduced into the multiple regression model are then determined by the stepwise method. The goodness-of-fit for the model given by the expansion in Eq. 1 will be evaluated by the R 2 coefficient. The forecasts in 2006 will be tested by evaluating the mean square error (MSE):

$$ MSE = \frac{1}{{15}}\sum\limits_t {{{\left[ {X(t) - \widehat{X}(t)} \right]}^2}} $$
(3)

A parametric 95% confidence interval for each forecast was performed in order to calibrate the degree of fit with the observed real values.

The whole statistical analysis was carried out using SPSS 15.0 software (SPSS, Chicago, IL).

Results

A complete set of data corresponding to 12 years (1995–2006) was provided by the Airbiology Unit of the University of Granada. The peak day for P (t) usually occurred in February with quantities higher than 1,000 grains/m3. This study considers the pollination interval from 15 January to 15 April during 1995–2005, while data for 2006 were used to compare forecast and real values.

Table 1 shows the mean, standard deviation and partial correlation coefficient between Cupressaceae pollen concentration and the four climatic variables during 1995–2005. All correlations are significant at the 0.01 level. The PCA of the transformed process {X (t), tI 2} and the four climatic processes mentioned above are shown in Table 2. Note that the accumulated percentage of explained variance is approximately 75–80%. On the basis of the criterion to select the number of PCs, three are considered for X(t), and the linear regressions (Eq. 2) in terms of the remaining ones are the following:

$$ \matrix{{*{20}{c}} {\widehat{\xi }_1^{(X)} = 0.508\xi_1^{(T)} + 0.275\xi_1^{(S)} + 0.373\xi_1^{(W)}{R^2} = 87.44\% } \hfill \\ {\widehat{\xi }_2^{(X)} = 0.384\xi_1^{(T)} + 0.271\xi_2^{(T)} + 0.802\xi_1^{(W)}{R^2} = 83.77\% } \hfill \\ {\widehat{\xi }_3^{(X)} = 0.764\xi_2^{(T)} - 0.286\xi_1^{(H)} + 0.181\xi_1^{(S)}{R^2} = 58.67\% } \hfill \\ } $$
Table 1 Mean, standard deviation and partial correlation coefficient between P (t) and the four climatic variables: T (t), S (t), H (t) and W (t) in the period 1995–2005
Table 2 Principal component analysis (PCA) for X (t) and the four climatic variables. PC Principal components, λ i eigenvalues of the PCA, CV i cumulated variance for each PC

Predictions for the sample paths of X (t) in 2006 were obtained by replacing the above-mentioned PCs in the expansion in Eq. 1. Associated to the general model proposed in this expansion, the R 2 coefficient obtained was 75.35%.

Real and forecast values for the first three sample paths in 2006 of {P (t), tI 2} are shown in Table 3, together with their associated MSE. A 95% confidence interval for each forecast is also included in Table 3. Figures 1 and 2 show these forecasts for the two first sample paths.

Table 3 Real pollen P (t), and forecast pollen \( \widehat{P}(t) \), for the two sample paths in 2006 and their mean square error (MSE). LL and UL are lower and upper limits in the 95% condence intervals for \( \widehat{P}(t) \), respectively
Fig. 1
figure 1

Pollen observed and forecast pollen values in the period 15 January–29 January

Fig. 2
figure 2

Pollen observed and forecast pollen values in the period 30 January–13 February

Discussion

Pollen is an important component in the development of allergic diseases. This research examined Cupressaceae pollen found in the atmosphere of Granada in the period 1995–2005. Due to the prevalence of this pollen in Granada during the winter, the relationship between of the most important meteorological parameters on daily pollen counts was researched in order to investigate the conditions that influence the prevalence of Cupressaceae pollination.

In agreement with the results of other studies in the Mediterranean area (Díaz de la Guardia et al. 2006; Galán et al. 1998; Sabariego et al. 2011; Tortajada and Mateu 2008), meteorological variables such as daily average temperature, daily humidity, daily hours of sun and daily maximum wind speed, were revealed as predictors to construct a PC multiple regression model.

From Tables 13 and Figs. 1 and 2 we can observe that the model proposed in this paper captures the trends in an optimal way, and allows the anticipation of the appearance of peaks in the Cupressaceae airborne pollen process. However, pollen levels are related not only to meteorological variables,; human activities such as pruning, watering, or introduction or elimination of plants can modify pollen values.