Introduction

Betula pollen is considered to be one of the main causes of pollinosis in northern and central Europe (Wallin et al. 1991; D’Amato and Spieksma 1992; Wihl et al. 1998). In localities in NW Spain between 13% and 60% of individuals who are immunosensitive to pollen grains respond positively to Betula pollen allergens (Aira et al. 2001; Dopazo 2001). There is no unique criterion for establishing the minimum level capable of triggering allergic symptoms since the response is highly individual. However, it has been pointed out that concentrations greater than 30 grains/m3 trigger severe symptoms and values greater than 80 grains/m3 produce allergic symptoms in 90% of patients (Viander and Koivikko 1978; Corsico 1993).

One of aerobiology’s objectives is to develop statistical models enabling the short- and long-term prediction of atmospheric pollen concentrations. This would allow allergic individuals to take preventive measures to protect themselves from the severity of the pollen season. Pollen concentration predictions are usually long-term, trying to predict the onset and severity of the pollen season. Such models published recently have used, as prediction values, the sum of temperatures starting from a certain date (Clot 2001; Caramiello et al. 1994; Ruffaldi and Greffier 1991) or different phenological variables such as the chilling units and growing degree days required to trigger the phenological phase of flowering (Andersen 1991; Laaidi 2001; Jato et al. 2002). Another type of model aims at making short-term predictions of concentration, e.g. 1 or 2 days in advance, making use of statistical tools such as time series (Box-Jenkins type). The difficulty of modelling using time series lies in the very nature of the series, since it consists of zero pollen values throughout almost the entire year, interrupted by one or several random intervals, of short duration, of high values with very fast fluctuations. Prediction based on the correlation of pollen levels with meteorological variables involves uni-dimensional non-linear equations using variables with the greatest prediction capacity (Norris-Hill 1995). In the case of Betula, the most frequently described models are based on linear parametric statistics (Atkinson and Larsson 1990; Norris-Hill and Emberlin 1991; Spieksma et al. 1995; Aira et al. 1998).

The objective of this study is the short-term prediction of Betula pollen levels, with a prediction horizon of 1–3 days starting from the last recorded observation. To that end we applied a nonparametric additive logistic model and a partially linear regression model, which estimate the probability of concentrations greater than 30 grains/m3 and predict future values of pollen grains respectively. Both methods (generalized additive models and semiparametric regression models) are powerful and increasingly widespread statistical tools that enable the usual structural restrictions to be eliminated from the hypotheses on the model. Within the different nonparametric techniques, generalized additive models are of special importance because of the need to work with flexible multi-variant models that can be adapted to a wide variety of situations. Their main advantages are interpretability and flexibility.

Let (Z l , Y l ), l = 0, ±1, ±2, …, a p + 1-dimensional time series, where Z l is a p-dimensional series and Y l a one-dimensional reponse series. Given a random sample [(Z i , Y i )] i=1 k and for n = k + κ, we want to predict Y n given Z n . Our prediction can be established by a logistic additive model

$$ P(Y_{l} = 1|{\mathbf{Z}}_{l} ) = {\text{Logit}}{\left[ {\alpha + {\sum\limits_{j = 1}^p {f_{j} (Z_{{jl}} )} }} \right]}, $$
(1)

where Logit(u) = eu/(1 + eu) is the link function and the unknown functions f j are assumed to have mean 0. We applied this model to predict whether the concentration of pollen grains would exceed a certain level with a horizon of κ = 1, 2 and 3 days. Specifically we established a threshold of 30 grains/m3 since this is the minimum level triggering severe allergy symptoms. We therefore defined the binary response variable of interest, i.e. Y l = I{X l > u}, κ = 1, 2, 3, being interested in P(Y l = 1|Z l ) = E{Y l = 1|Z l }. The usual regression models with a binary solution are generalized linear models (McCullagh and Nelder 1989). These models are restrictive since they specify that the structure of the covariables is linear. Generalized additive models extend linear models by eliminating the linear structure from covariables, only requiring the addition of the covariables’ effects.

Partially linear models (Härdle et al. 2000) deserve special attention because of their excellent behaviour in real situations (Prada-Sánchez et al. 2000). This is largely due to the fact that the nonparametric component makes the model highly flexible, enabling the inclusion of variables whose contribution is not completely known and/or in cases of heterocedasticity. Thus, restructuring the starting series in vector r + p + 1 (Z l , V l , Y l ), l = 0, ±1, ±2, …, with r-dimensional V l and supposing that the contribution of variables V l is linear, results in the partially linear model

$$ Y_{l} = {\mathbf{V}}^{t}_{l} \beta + \varphi ({\mathbf{Z}}_{l} ) + \varepsilon _{l} , $$
(2)

where the function ϕ is an unknown smooth function. We applied the previous model to the pollen series to obtain a specific prediction 1, 2 and 3 days in advance, defining the solution as Y l = X l, κ = 1, 2, 3, Z l = (X l , X l–1) with X l being the pollen series and V l the meteorological series.

Materials and methods

The city of Vigo is situated at sea level in the north-west corner of Spain on the shore of the Vigo Estuary (42°14′15′′N; 8°43′30′′W). Its climate is temperate maritime with a mean temperature of 18.8 °C, a mean minimum temperature of 11 °C and 1,412 mm of total annual precipitation.

Birch is represented in north-west Spain by one species, Betula alba L. (Moreno 1990). It is widely distributed and is the dominant tree in altimontane oro-Cantabrian acidophilic forests, with a clearly Euro-Siberian distribution. Such forests are found above a height of 1,150 m and are the last tree formations of the altitudinal series, with montane thermo-climates and hyper-humid ombro-climates (Costa et al. 1990). In this same area, but on siliceous soils with a greater Mediterranean influence, there are also birch forests in the Galician-Portuguese altimontane layer and the Ourense-Sanabria and supra-Mediterranean layer. In the Euro-Siberia region, birch may also form part of riparian forests, along with Alnus glutinosa, Salix atrocinera and Frangula alnus.

This study was based on the monitoring of atmospheric pollen concentrations from 1995 to 2001. Sampling was carried out by means of a 7-day Lanzoni pollen trap (Hirst 1952) placed on a terrace of the town hall 15 m above ground level. The standard sampling procedures proposed by Domínguez et al. (1992) were used to obtain the pollen counts. The meteorological data were supplied by the National Institute of Meteorology from the Vigo station located in the proximity of the pollen trap. Data from March, April and May 1995–2001 were used in the models’ estimations and from the corresponding months in 2001 to evaluate the behaviour of the models.

The following group of variables was available for the construction of both models: the number of pollen grains (X 1) and meteorological variables (rainfall, humidity, temperatures, hours of sunshine, wind speed and direction). We broke the temporal dependence among the observations in order to process the series, considering them as observations independent of time, which enables processing from a regression point of view.

Additive logistic model

There are no standard criteria for selecting the variables included in the model. Our selection involved the combination of the most important variables affecting development, release and dispersion of pollen grains and the results obtained when they were included in the model. Therefore the following variables were chosen for the model (Eq. 1) with the objective of P(Y l = 1|Z l ) = E{Y l  = 1|Z l }, with Y l = I{X l > u}, κ = 1, 2, 3, u = 30: T l min, T l med, daily minimum and mean temperature in l; T l maxT l–1 max, difference in daily maximum temperature between l and l – 1; t l calm, the daily mean time without wind in l; the V l 3, daily mean S–SW wind velocity in l. We also added the variable of pollen grains at instant 1 and with a 1-day delay (X l , X l–1) in order to distinguish between upward and downward phases. Thus, the model being considered is

$$ P(Y_{l} = 1|{\mathbf{Z}}_{l} ) = {\text{Logit}}{\left[ {\alpha + f_{1} (X_{l} ,X_{{l - 1}} ) + f_{2} (T^{{max}}_{l} - T^{{max}}_{{l - 1}} ) + f_{3} (T^{{min}}_{l} ) + f_{4} (T^{{med}}_{l} ) + f_{5} (t^{{calm}}_{l} ) + f_{6} (V^{3}_{l} )} \right]} $$
(3)

On the basis of observations and using kernel-type estimators in the estimations of the additive functions, we applied the local scoring algorithm (Hastie and Tibshirani 1990), initialising the additive functions starting from a linear logistic regression model, i.e. f 1(u, v) = β1 u + β2 v; f 2(u, v) = β3 u + β4 v; f j (u) = β j+2 u, j = 3, 4, 5, 6. The convergence of the algorithm was attained when the adjusted relative variations of the additive functions were small.

Depending on the data point to be predicted we determined smoothing parameters (locally) using the cross-validation method for each variable at each internal step of the backfitting algorithm.

Partially linear model

We considered the data as a whole, as in the previous section, except that the solution response here is Y l = X l, κ = 1, 2, 3, and the covariables included in the model are produced by the vectors:

$$ {\mathbf{V}}_{l} = {\left( {T^{{min}}_{l} ,T^{{med}}_{l} ,T^{{max}}_{l} - T^{{max}}_{{l - 1}} ,t^{{calm}}_{l} ,V^{3}_{l} } \right)}\quad {\text{and}}\quad {\mathbf{Z}}_{l} = (X_{{l - 1}} ,X_{l} ), $$

i.e. a nonparametric self-explicative part and a linear explanatory part formed by meteorological variables.

We followed the mechanism proposed by Speckman (1988) in the estimation of the model. If we subtract E{Y l /Z l } from both sides of Eq. (2)

$$ Y_l - E\{Y_l/{\mathbf{Z}}_l \} = ({\mathbf{V}}_{l} - E\{ {\mathbf{V}}_{l} /{\mathbf{Z}}_{l} \} )^{t} \beta + \varepsilon _{l} , $$

we obtain a linear model relating Y l and V l after adjustment for their expected values given Z l . Given the smoother matrix in the non-parametric context Z H , where (Z H ) ij = w j H,k[Z i , (Z 1, …, Z k )], {w i H,k} is a set of kernel-generated weights and H a square matrix, writing \( {\mathbf{\tilde{Y}}} = ({\mathbf{I}} - {\mathbf{Z}}_{{\mathbf{H}}} ){\mathbf{Y}} \) and \( {\mathbf{\tilde{V}}} = ({\mathbf{I}} - {\mathbf{Z}}_{{\mathbf{H}}} ){\mathbf{V}} \) we obtain

$$ \hat{\beta} = ({\mathbf{\tilde{V}}}^{t} {\mathbf{\tilde{V}}})^{{ - 1}} {\mathbf{\tilde{V}}}^{t} {\mathbf{\tilde{Y}}} $$

and the predictor produced by

$$ \hat{\varphi }({\mathbf{V}}_{n} ,{\mathbf{Z}}_{n} ) = {\mathbf{V}}^{t}_{n} \hat{\beta } + {\sum\limits_{j = 1}^k {w^{{{\mathbf{H}},k}}_{j} ({\mathbf{Z}}_{n}, ({\mathbf{Z}}_{1} , \ldots ,{\mathbf{Z}}_{k} ))(Y_{j} - {\mathbf{V}}^{t}_{j} \hat{\beta })} }. $$

The smoothing parameters were selected using cross-validation, i.e. H was selected so as to minimise the expression

$$ {\text{CV}}({\mathbf{H}}) = {\sum\limits_{i \in I} {{\left[ {y_{i} - \hat{\varphi }^{{ - i}} ({\mathbf{V}}_{i} ,{\mathbf{Z}}_{i} )} \right]}^{2} \tilde{w}({\mathbf{Z}}_{i})} } $$

where \( \hat{\varphi }^{{ - i}} ({\mathbf{V}}_{i} ,{\mathbf{Z}}_{i} ) \) is the prediction produced previously without the ith point at that point and \( \{ \tilde{w}( \cdot )\} \) is a set of kernel-generated weights.

Results

Betula pollen is found in Vigo’s atmosphere during 3 months in the summer and represents between 1% and 5% of the total annual pollen. The pollination period is short – less than 40 days – except in the last 2 sampling years, lasting 59 and 70 days respectively. It generally begins during March and ends during the first days in May, except in 1997 when it ended on 21 April. The highest concentrations are normally attained in mid-April, although the maximum peaks of 1997 and 1998 occurred earlier, on 28 and 27 March respectively.

There were also fluctuations in the maximum values during the years under study, with a minimum of 29 grains/m3 (27 March 1998) and a maximum of 247 grains/m3 (7 April 1999).

The wide variations in the Betula pollen concentrations make it difficult to obtain prediction models. We can illustrate the behaviour of the predictor of the model defined by Eq. 3 with κ = 1 with the design data in Fig. 1 (analogous figures were obtained for the cases κ = 2 and 3). We quantify the different types of error (type I error and type II error) that are made in the decision

$$ \left\{ {\begin{array}{*{20}l} {{H_{0}\!\! :\ Y_{n} = 1\left| {{\mathbf{Z}}_{n} } \right.} \hfill} & { \Leftrightarrow \hfill} & {{P(X_{{n + 1}} \geqslant 30\left| {{\mathbf{Z}}_{n} } \right.) \geqslant p} \hfill} \\ {{H_{1}\!\! :\ Y_{n} = 0\left| {{\mathbf{Z}}_{n} } \right.} \hfill} & { \Leftrightarrow \hfill} & {{P(X_{{n + 1}} \geqslant 30\left| {{\mathbf{Z}}_{n} } \right.) < p} \hfill} \\ \end{array} } \right. $$

given p lies between 0 and 1. Thus, we define

  • $$ {\text{Type\ I\ error}} = \frac{{\# \{(\hat{P}(Y_n = 1\left| {{\mathbf{Z}}_n} \right.) < p) \cap (Y_n = 1)\}}} {{\# \{Y_n = 1\} }}, $$

    i.e. the prediction at κ instants of the probability that the model will detect a level equal to or greater than 30 grains/m3 when this level was actually exceeded.

  • $$ {\text{Global\ type\ I\ error}} = \frac{{\# \{ (\hat{P}(Y_{n} = 1\left| {{\mathbf{Z}}_{n} } \right.) < p) \cap (Y_{n} = 1)\} }} {k}, $$

    i.e. the prediction at κ instants of the probability that the model will detect a level equal to or greater than 30 grains/m3.

  • $$ {\text{Type\ II\ error}} = \frac{{\# \{ (\hat{P}(Y_n = 1\left| {{\mathbf{Z}}_n} \right.) > p) \cap (Y_n = 0)\}}} {{\# \{ Y_n = 0\} }}, $$

    i.e. the prediction at κ instants of the probability that the model will detect a level greater than 30 pollen grains/m3 when this level was not actually exceeded.

  • $$ {\text{Global\ type\ II\ error}} = \frac{{\# \{ (\hat{P}(Y_{n} = 1\left| {{\mathbf{Z}}_{n} } \right.) > p) \cap (Y_{n} = 0)\} }} {k}, $$

    i.e. the prediction at κ instants of the probability that the model will detect a level greater than 30 pollen grains/m3.

  • $$ {\text{Global\ error}} = \frac{{\# \{ (\hat{P}(Y_{n} = 1\left| {{\mathbf{Z}}_{n} } \right.) < p) \cap (Y_{n} = 1)\} + \# \{ (\hat{P}(Y_{n} = 1\left| {{\mathbf{Z}}_{n}} \right.) > p) \cap (Y_{n} = 0)\} }} {k}, $$

    i.e. assigning equal importance to both types of error, the error of prediction at k instants in the decision.

Fig. 1.
figure 1

Power function (solid line) versus type I error function (star)

It should be pointed out that all of the errors behave in a similar way in both the design data and prediction data as a whole. The global error is 13.44%, 16.2%, 36.45 with p = 0.5 and for κ = 1, 2 and 3, respectively.

Choosing this as the cutting point gives us the list of errors during the 3 months of 2001, which are shown in Table 1.

Table 1. Errors during the 3 months of 2001

The prediction errors are the same for κ = 1 and 2 because the limit of 30 grains/m3 was only exceeded on 10 days. Figure 2 shows the real pollen values during the months under study from 1995 to 2000 and the predictions 1, 2 and 3 days in advance that were obtained with the partially linear model of Eq. 2. It should be noted that the real series is displaced so that the comparisons can be made on the same vertical line. Tables 2 and 3 quantify different types of mean errors per year:

$$ {\text{SE}} = {\left[ {y_t - \hat{\varphi} ({\mathbf{V}}_{t - \kappa}, \mathbf{Z}_{t - \kappa})} \right]}^{2} ,\quad \kappa = 1,2,3 $$
$$ {\left| {\frac{{y_{t} - \hat{\varphi }{\left( {{\mathbf{V}}_{{t - \kappa }} ,{\mathbf{Z}}_{{t - \kappa }} } \right)}}} {{y_{t} }}} \right|}\quad \text{if}\ y_{t} \ne 0, {\enspace} \kappa = 1,2,3 $$
Fig. 2.
figure 2

Daily pollen concentrations (area) for Betula from March to May, along with the predictions for 1 (thick solid line), 2 (fine solid line) and 3 (broken line) days ahead (1995–2000)

Table 2. Standard error according to year
Table 3. Relative absolute error according to year

Finally, Fig. 3 shows the corresponding graph with data from 2001, values that were not taken into account when formulating the model. These values were used to check the validity and prediction capacity of the proposed model.

Fig. 3.
figure 3

Verification of the model during 2001. (These data were not used to formulate the model.) Daily pollen concentrations (area) for Betula from March to May, along with predictions for 1 (thick solid line), 2 (fine solid line) and 3 (broken line) days ahead

Discussion

The highest concentrations of Betula pollen in Spain are recorded in the north-west Iberian Peninsula (Jato et al. 1999). The values found in Vigo are similar to those described by Méndez (2000) for Ourense but lower than those found for Santiago de Compostela, where a total 24-h pollen count of 4,000 grains (annual average) is attained during some years (Aira et al. 1998). The concentrations of this pollen type are higher in the north of Europe and are produced during May (Atkinson and Larsson 1990; Spieksma et al. 1995).

Betula pollen values higher than 30 grains/m3, the quantity considered sufficient to trigger severe allergy symptoms, were attained during 16 days during years of high concentrations and on 7 days for low concentrations. There were never more than 6 days on which a count of 80 grains/m3 or more was reached; this is the value cited as sufficient to produce symptoms in 90% of patients (Viander and Koivikko, 1978; Detandt and Nolard 1996). In studies carried out in London, Norris-Hill and Emberlin (1991) indicate that these maximum values were generally attained on days with temperatures greater than 18 °C. In the city of Vigo, they coincided with temperatures higher than 20 °C.

During the study period, the annual average 24-h sum of Betula pollen fluctuated from year to year, oscillating between 1,540 in 1995 and 157 in 1998, describing biannual behaviour with an alternation of years in which the tree gives priority to reproduction and others in which it focuses on vegetative growth (Nilsson and Persson 1981; Emberlin et al. 1993; Jäger et al. 1991; El-Ghazaly et al. 1993; Spieksma et al. 1995; Latorre 1999). This biannual behaviour is also influenced by the meteorological conditions prevailing during the pollination period, since the intense flowering during 1997 coincided with limited precipitation in March and April. On the other hand, the concentrations recorded in 1996 and 1998 may be related to the low temperatures and especially the intense rainfall during the flowering period.

Linear logistic models are usually used in aerobiological studies aimed at predicting pollen concentrations. The regression lines proposed by this methodology, and that taking into account the meteorological values with the highest correlation coefficient in relation to pollen concentrations, explain up to 57% of the Betula pollen data in our area (Rodríguez-Rajo 2000; Méndez 2000). In this study, the proposed additive logistic model for predicting Betula pollen concentrations starts from the regression model that is generally used in aerobiological studies. The additive logistic model’s prediction behaviour is therefore superior to that described by these linear logistic models, since the final equation starts from such models and thereafter improves them.

The models for predicting pollen concentrations that use only meteorological variables as prediction variables produce results with a low prediction level, which shows that meteorological variables do not explain such behaviour on their own. Other variables that better reflect the factors affecting the plant, and on which its pollen production and release depend, should be taken into account. In this regard, the pollen concentrations of previous days reflect these factors and therefore substantially improve the prediction capacity when included as a variable in prediction models. In this case, the prediction behaviour of the additive logistic model attained an acceptable error level during the months of 1995–2000 (Table 1). It is important to note that, despite the previously described biannual behaviour in which years of high and low concentrations alternate, the proposed model adapted perfectly to both cases, accurately describing the curve found in the behaviour of Betula pollen grains (Fig. 2).

In 2001 there were few peaks greater than 30 grains/m3, so that the values of the errors are merely informative. As in the case of previous years, the estimated curve accurately describes the behaviour of the Betula pollen grains. It is worth mentioning that the partially linear model behaved well, although as the prediction horizon increases we observe a time lag in the prediction series of the same longitude as the prediction horizon (see days 7–14 to 24–4 in Fig. 3). During the model-verification year of 2001 we found three concentration peaks. The first was detected by the proposed model, which predicted a mean concentration of 138 pollen grains/m3 air. The following two were also predicted with similar values although with a delay of 1 day, which is logical in view of the importance of the “previous day’s pollen” variable as a predictor in the established model.

As previously discussed, this situation suggests that the currently available meteorological variables are not sufficiently explicative. It is the series itself that has to provide almost all of the information, thereby resulting in this delay.

Conclusions

In Vigo Betula pollen is recorded between March and May, attaining concentrations capable of triggering allergy symptoms. Meteorological parameters, along with the previous day’s pollen concentrations, are good tools for establishing models predicting pollen concentrations, since they reflect the entire series of factors that affect the plant and on which its pollen production and release depends. The integration of both variables substantially improves the percentage of variability explained by the regression lines.

The models were tested with data from 2001 and the predicted curve very closely followed the observed variations of the daily mean concentrations, in spite of the special meteorological conditions registered during this year. Even though temperatures were similar to the climatological average, precipitation values were almost double overall during the month of Betula flowering. It is worth highlighting the good prediction behaviour of the models used, especially the partially linear model. The results are better than those obtained by classical methodology, although their main inconvenience is their low prediction horizon, since the pollen data are not available until 24 h before the day for which a determination of the pollen concentration is required.

The model obtained would be applicable to different geographical areas, but its adjustment, using the available and adequate variables, would be necessary in every case.

One of the main goals in aeropalynological models is to predict concentration levels that could trigger allergic symptoms. The proposed model offers great reliability since, by predicting 1 day before, the probability that the model will detect a level equal to or greater than 30 pollen grains/m3 is higher than 90%. In the same way the possibility of raising a false alarm for the allergenic people will be 1.2%.