A principal component regression model to forecast airborne concentration of Cupressaceae pollen in the city of Granada (SE Spain), during 1995–2006

Ocaña-Peinado, Francisco M.; Valderrama, Mariano J.; Bouzas, Paula R.

doi:10.1007/s00484-012-0527-9

A principal component regression model to forecast airborne concentration of Cupressaceae pollen in the city of Granada (SE Spain), during 1995–2006

Short Communication
Published: 22 February 2012

Volume 57, pages 483–486, (2013)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Biometeorology Aims and scope Submit manuscript

A principal component regression model to forecast airborne concentration of Cupressaceae pollen in the city of Granada (SE Spain), during 1995–2006

Download PDF

Francisco M. Ocaña-Peinado¹,
Mariano J. Valderrama¹ &
Paula R. Bouzas¹

407 Accesses
6 Citations
Explore all metrics

Abstract

The problem of developing a 2-week-on ahead forecast of atmospheric cypress pollen levels is tackled in this paper by developing a principal component multiple regression model involving several climatic variables. The efficacy of the proposed model is validated by means of an application to real data of Cupressaceae pollen concentration in the city of Granada (southeast of Spain). The model was applied to data from 11 consecutive years (1995–2005), with 2006 being used to validate the forecasts. Based on the work of different authors, factors as temperature, humidity, hours of sun and wind speed were incorporated in the model. This methodology explains approximately 75–80% of the variability in the airborne Cupressaceae pollen concentration.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Forecasting airborne pollen levels is an interesting problem not only from an environmental point of view but also in health-care planning—mainly in vaccination strategies related to allergies among children and the elderly. Cupressaceae pollen has one of the highest pollen incidences in the Mediterranean area and is present in the atmosphere practically all year round, although it is predominant in the winter period, when no other plants are flowering, making this particle a powerful allergen. In Europe, allergy to Cupressaceae pollen was considered a rarity until 1975, but is now a recognised clinical entity (Belmonte et al. 1999). In order to develop a stochastic model to explain this phenomenon, several meteorological covariates must be taken into account, as was studied in Spain by several authors (e.g., Aira et al. 2001; Belmonte et al. 1999; Díaz de la Guardia et al. 2006; Galán et al. 1998; Sabariego et al. 2011; Tortajada and Mateu 2008).

The regression approach that we follow in this paper has been considered previously by several authors such as Stark et al. (1997), who applied a Poisson regression model for ragweed pollen; Brumback et al. (2000), who propose dan extension of the generalized linear models to the nonlinear framework; Smith and Emberlin (2005), who adjusted several regression models after considering pre-peak, peak and post-peak periods; Moseholm et al. (1987), Ocaña Peinado et al. (2008) and Rodríguez-Rajo et al. (2006), who use ARIMA processes; Valderrama et al. (2010), who proposed a two-step functional regression model; Makra and Matyasovszky (2011) and Makra et al. (2011), who consider nonparametric regression methods; and by Díaz de la Guardia et al. (2006) and Sabariego et al. (2011) using polynomial and multiple linear regression.

The aim of this paper was to select a set of variables suitable for modelling the stochastic process of Cupressaceae airborne pollen concentration during the pollination season. To do so, a dimensionality reduction on the basis of principal component analysis (PCA) was developed for both this process and for the above-mentioned climatic processes. The time predictive approach applied in this model means that the sample paths of the main processes were recorded 1 week in advance of the others. Multiple linear regression among the principal components (PCs) was then performed to obtain the predictive model.

The behavior of our methodology was tested by its application to data recorded by the Aerobiology Center at the University of Granada (southern Spain) over a period of 12 years (1995–2006).

Materials and methods

This study was carried out in the city of Granada (SE Spain), in the Mesomediterranean bioclimatic level. All the data used in this paper were collected using methodology analogous to that of Díaz de la Guardia et al. (2006). Data were recorded for 11 years (1995–2005), from 15 January to 15 April (90 data days per year), but taking 2 weeks as the time interval applicable. Thus, six intervals were obtained for each year, i.e., 66 in total. During these 90 days the pollen intensity is very high because species of the genus Cupressus—very prevalent in urban vegetation—produce pollen on a massive scale (Díaz de la Guardia et al. 2006). Due to the predictive aim of this model, the pollen concentration process was considered 1 week in advance of the climatic processes, i.e., I ₁ = [T _i−1, T _i] and I ₂ = [T _i, T _i+1], for i = 1, 2, . . . , 65. The stochastic processes taken into consideration were as follows:

Cupressaceae pollen concentration: {P (t), t ∈ I₂} expressed as number of pollen grains per cubic meter of air (grains/m³).
Daily average temperature: {T (t), t ∈ I₁} expressed in degrees centigrade (°C)
Daily average relative humidity: {H(t), t ∈ I₁} expressed in percent (%)
Daily hours of sun: {S(t), t ∈ I₁}
Daily maximum wind speed: {W (t), t ∈ I₁} expressed in kilometers per hour (km/h)

Because of the erratic nature of the pollen data measured, a logarithmic transformation was applied in order to smooth them: X(t) = log[P (t) + 1]. Then, to perform the PCA, all the variables considered were standardized. In the principal component analysis (PCA), the respective PCs for each stochastic process are denoted by:

$$ \left\{ {\xi_i^{(X)}} \right\},\left\{ {\xi_i^{(T)}} \right\},\left\{ {\xi_i^{(H)}} \right\},\left\{ {\xi_i^{(S)}} \right\},\left\{ {\xi_i^{(W)}} \right\} $$

so that the Karhunen-Loève expansion for {X(t), t ∈ I₂} is given by:

$$ X(t) = \sum\limits_{i = 1}^n {u_i^\prime {\xi_i},t \in {I_2}} $$

(1)

and the PCs are estimated by means of multiple linear regression as follows:

$$ \matrix{{*{20}{c}} {\widehat{\xi }_i^{(X)} = \gamma {\text{o}} + \sum\limits_{j = 1}^{{n_i}} {{a_j}\xi_j^{(T)}} + \sum\limits_{j = 1}^{{n_2}} {{b_j}\xi_j^{(H)}} + \sum\limits_{j = 1}^{{n_3}} {{c_j}\xi_j^{(S)}} } \\ { + \sum\limits_{j = 1}^{{n_4}} {{d_j}\xi_j^{(W)}\quad i = 1,2, \ldots, n.} } \\ } $$

(2)

The criterion for selecting the number of PCs for each process to be included in the model is that they must have an explained variance greater than 1 (Kaiser 1958). The explicative variables to be introduced into the multiple regression model are then determined by the stepwise method. The goodness-of-fit for the model given by the expansion in Eq. 1 will be evaluated by the R ² coefficient. The forecasts in 2006 will be tested by evaluating the mean square error (MSE):

$$ MSE = \frac{1}{{15}}\sum\limits_t {{{\left[ {X(t) - \widehat{X}(t)} \right]}^2}} $$

(3)

A parametric 95% confidence interval for each forecast was performed in order to calibrate the degree of fit with the observed real values.

The whole statistical analysis was carried out using SPSS 15.0 software (SPSS, Chicago, IL).

Results

A complete set of data corresponding to 12 years (1995–2006) was provided by the Airbiology Unit of the University of Granada. The peak day for P (t) usually occurred in February with quantities higher than 1,000 grains/m³. This study considers the pollination interval from 15 January to 15 April during 1995–2005, while data for 2006 were used to compare forecast and real values.

Table 1 shows the mean, standard deviation and partial correlation coefficient between Cupressaceae pollen concentration and the four climatic variables during 1995–2005. All correlations are significant at the 0.01 level. The PCA of the transformed process {X (t), t ∈ I ₂} and the four climatic processes mentioned above are shown in Table 2. Note that the accumulated percentage of explained variance is approximately 75–80%. On the basis of the criterion to select the number of PCs, three are considered for X(t), and the linear regressions (Eq. 2) in terms of the remaining ones are the following:

$$ \matrix{{*{20}{c}} {\widehat{\xi }_1^{(X)} = 0.508\xi_1^{(T)} + 0.275\xi_1^{(S)} + 0.373\xi_1^{(W)}{R^2} = 87.44\% } \hfill \\ {\widehat{\xi }_2^{(X)} = 0.384\xi_1^{(T)} + 0.271\xi_2^{(T)} + 0.802\xi_1^{(W)}{R^2} = 83.77\% } \hfill \\ {\widehat{\xi }_3^{(X)} = 0.764\xi_2^{(T)} - 0.286\xi_1^{(H)} + 0.181\xi_1^{(S)}{R^2} = 58.67\% } \hfill \\ } $$

Table 1 Mean, standard deviation and partial correlation coefficient between P (t) and the four climatic variables: T (t), S (t), H (t) and W (t) in the period 1995–2005

Full size table

Table 2 Principal component analysis (PCA) for X (t) and the four climatic variables. PC Principal components, λ _i eigenvalues of the PCA, CV _i cumulated variance for each PC

Full size table

Predictions for the sample paths of X (t) in 2006 were obtained by replacing the above-mentioned PCs in the expansion in Eq. 1. Associated to the general model proposed in this expansion, the R ² coefficient obtained was 75.35%.

Real and forecast values for the first three sample paths in 2006 of {P (t), t ∈ I ₂} are shown in Table 3, together with their associated MSE. A 95% confidence interval for each forecast is also included in Table 3. Figures 1 and 2 show these forecasts for the two first sample paths.

Table 3 Real pollen P (t), and forecast pollen $ \widehat{P}(t) $, for the two sample paths in 2006 and their mean square error (MSE). LL and UL are lower and upper limits in the 95% condence intervals for $ \widehat{P}(t) $, respectively

Full size table

Discussion

Pollen is an important component in the development of allergic diseases. This research examined Cupressaceae pollen found in the atmosphere of Granada in the period 1995–2005. Due to the prevalence of this pollen in Granada during the winter, the relationship between of the most important meteorological parameters on daily pollen counts was researched in order to investigate the conditions that influence the prevalence of Cupressaceae pollination.

In agreement with the results of other studies in the Mediterranean area (Díaz de la Guardia et al. 2006; Galán et al. 1998; Sabariego et al. 2011; Tortajada and Mateu 2008), meteorological variables such as daily average temperature, daily humidity, daily hours of sun and daily maximum wind speed, were revealed as predictors to construct a PC multiple regression model.

From Tables 1–3 and Figs. 1 and 2 we can observe that the model proposed in this paper captures the trends in an optimal way, and allows the anticipation of the appearance of peaks in the Cupressaceae airborne pollen process. However, pollen levels are related not only to meteorological variables,; human activities such as pruning, watering, or introduction or elimination of plants can modify pollen values.

References

Aira MJ, Dopazo A, Jato MV (2001) Aerobiological monitoring of Cupressaceae pollen in Santiago de Compostela (NW Iberian Peninsula) over six years. Aerobiologia 17:319–325
Article Google Scholar
Belmonte J, Canela M, Guardia R et al (1999) Aerobiological dinamics of the Cupressaceae pollen in Spain, 1992–1998. Polen 10:27–38
Article Google Scholar
Brumback BA, Ryan LM, Schwartz JD et al (2000) Transitional regression models, with application to environmental time series. J Am Stat Assoc 95:16–27
Article Google Scholar
Díaz de la Guardia C, Alba F, De Linares C et al (2006) Aerobiological and allergenic analysis of Cupressaceae pollen in Granada. J Investig Allergol Immunol 16:24–33
Google Scholar
Galán C, Fuillerat MJ, Comtois P et al (1998) Bioclimatic factors affecting daily Cupressaceae flowering in southwest Spain. Int J Biometeorol 41:95–100
Article Google Scholar
Kaiser HF (1958) The Varimax criterion for analytic rotation in factor analysis. Psychometrika 23:187–200
Article Google Scholar
Makra L, Matyasovszky I (2011) Assessment of the daily ragweed pollen concentration with previous-day meteorological variables using regression and quantile regression analysis for Szeged, Hungary. Aerobiologia 27:247–259
Article Google Scholar
Makra L, Matyasovszky I, Thibaudon M et al (2011) Forecasting ragweed pollen characteristics with nonparametric regression methods over the most polluted areas in Europe. Int J Biometeorol 55:361–371
Article Google Scholar
Moseholm L, Weeke ER, Petersen BN (1987) Forecast of pollen concentrations of Poaceae (Grasses) in the air by time series analysis. Pollen Spores XXIX:305322
Google Scholar
Ocaña Peinado FM, Valderrama MJ, Aguilera AM (2008) A dynamic regression model for air pollen concentration. Stoch Environ Res Risk Assess 22:59–63
Article Google Scholar
Rodríguez-Rajo FJ, Fernández-Gonzlez D, Vega-Maray AM et al (2006) Biometeorological characterization of the winter in the north west Spain based on Alnus pollen flowering. Grana 45:288–296
Article Google Scholar
Sabariego S, Cuesta P, Fernández-González F et al (2011) Models for forecasting airbone Cupressaceae pollen levels in Central Spain. Int J Biometeorol. doi:10.1007/s00484-011-0423-8
Smith M, Emberlin J (2005) Constructing a 7-day ahead forecast model for grass pollen at north London, United Kingdom. Clin Exp Allergy 35:1400–1406
Article CAS Google Scholar
Stark PC, Ryan LM, McDonald JL et al (1997) Using meteorological data to predict daily ragweed pollen levels. Aerobiologia 13:177–184
Article Google Scholar
Tortajada B, Mateu I (2008) Cupressaceae pollen in the atmosphere of Valencia (East of Spain) and relationship with meteorological parameters. Polen 18:51–59
Google Scholar
Valderrama MJ, Ocaña FA, Aguilera AM, Ocaña Peinado FM (2010) Forecasting pollen concentration by a two-step functional model. Biometrics 66:578–585
Article Google Scholar

Download references

Acknowledgments

This work was supported partially by projects MTM2010-20502 of Dirección General de Investigación y Gestión del Plan Nacional I+D+I and grants FQM-307 of Consejería de Innovación de la Junta de Andalucía, both in Spain. The authors are grateful to Dra. Consuelo Díaz de la Guardia (Botanic Department of the University of Granada) for providing the data to perform this research.

Author information

Authors and Affiliations

Department of Statistics and Operations Research, Faculty of Pharmacy, University of Granada, 18071, Granada, Spain
Francisco M. Ocaña-Peinado, Mariano J. Valderrama & Paula R. Bouzas

Authors

Francisco M. Ocaña-Peinado
View author publications
You can also search for this author in PubMed Google Scholar
Mariano J. Valderrama
View author publications
You can also search for this author in PubMed Google Scholar
Paula R. Bouzas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco M. Ocaña-Peinado.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ocaña-Peinado, F.M., Valderrama, M.J. & Bouzas, P.R. A principal component regression model to forecast airborne concentration of Cupressaceae pollen in the city of Granada (SE Spain), during 1995–2006. Int J Biometeorol 57, 483–486 (2013). https://doi.org/10.1007/s00484-012-0527-9

Download citation

Received: 13 December 2011
Revised: 16 January 2012
Accepted: 21 January 2012
Published: 22 February 2012
Issue Date: May 2013
DOI: https://doi.org/10.1007/s00484-012-0527-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A principal component regression model to forecast airborne concentration of Cupressaceae pollen in the city of Granada (SE Spain), during 1995–2006

Abstract

Introduction

Materials and methods

Results

Discussion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation