Introduction

Pollen allergy

Pollen allergy had become a widespread disease by the end of the 20th century. Nowadays, every 5th or 6th person, on average, suffers from this disease of the immune system in Europe. Pollinosis involves unpleasant symptoms and can become asthma. It has been proved that those who fall ill with pollen allergy can not concentrate on their work, feel unwell and are frequently on sick leave.

About 30% of the Hungarian population has some type of allergy, 65% of them have pollen sensitivity, and at least 60% of this pollen sensitivity is caused by ragweed. It is a shocking fact that the number of patients with registered allergic illnesses has doubled and the number of cases of allergic asthma has increased by a factor of four in Southern Hungary over the last 40 years. However, we have to remember that the diagnosis of asthma has also developed significantly during this period.

The aim of the study is to analyse the relation of ragweed pollen concentration to meteorological elements in a medium-sized city: Szeged, Southern Hungary.

The main plants causing pollen allergy in Europe are grasses (Poaceae), birch (Betula), mugwort (Artemisia) and, in Southern Europe, the olive-tree (Oleaceae). In the 1980s a new plant that spreads extremely aggressively joined the list. It appears in more and more countries, blooms for a long time (in some cases for 3 months) and it produces a great deal of pollen, which when breathed-in, rapidly produces characteristic symptoms of pollinosis (coughing, sneezing, nasal discharge and inflammation of the mucous membranes of the eyes and nose). The plant is the short (or common) ragweed (Ambrosia artemisiifolia = Ambrosia elatior) and clinical investigations have demonstrated that its very allergenic pollen is the main reason for the most massive, most serious and most long-lasting pollinosis.

Ambrosia was the name given to the delicious food eaten by the mythical Greek gods to make them live forever. Though the Latin name for ragweed is also Ambrosia it bears no resemblance to the heavenly food, the only food of newborns of gods. It is mentioned in the literature either as a food or as a drink and one imagines that it was some kind of honey. It was also believed to be effective for curing wound. Of course, the Greek gods did not eat ragweed.

Ragweed is an important genus of the Asteraceae family, which probably has its origin in Southern North America. This plant has evolved in response to a dry climate and open environment and from an entomophilous to an anemophilous pollination. In Europe, a steppe-prairie dry environment is found in the east-central part of the continent which, like ragweed’s region of origin, favours for its growth. The presence of ragweed here is recorded in botanical writings of the 19th century. However, it became part of the human landscape in Europe after the First World War, when contaminated seed shipments to the former Austro-Hungarian Empire allowed the plant to establish itself on this part of the continent. Ragweed can take hold and prosper, if the climatic threshold for its floral and seed development is reached, i.e. up to 1,200 °C heating degree days in total. Under a maritime climate, e.g. in North-Eastern Spain or in the Netherlands, ragweed populations seem to just survive and do not seem able to reach population levels that would make them noxious. Hence, it is the climate that limits ragweed, even if the human environment would allow its establishment (Comtois 1998).

In Western Europe, four American species have already become estabilished. These are ragweed with mugwort leaves (= short ragweed; Ambrosia artemisiifolia = Ambrosia elatior), giant ragweed (Ambrosia trifida), perennial ragweed (Ambrosia psilostachya) and silver ragweed (Ambrosia tenuifolia). However, short ragweed is the most widely spread of all, not only in Western Europe but in Hungary too (Járai-Komlódi and Juhász 1993).

Considering the annual total pollen counts of various plants measured between 1990 and 1996 in Southern Hungary, ragweed generates about half of the total pollen production (47.3%). Though this ratio depends strongly on meteorological factors year by year (in 1990 this ratio was 35.9%, while it was 66.9% in 1991), ragweed seems to be the main aero-allergen plant (Juhász 1995).

The Carpathian Basin can be considered the region most polluted with airborne ragweed pollen in Europe. Some of the highest counts on peak days are 3,247 in Novi Sad (Serbia-Montenegro 2001), 2,003 in Szeged (Hungary 1991) and 1,394 in Pécs (Hungary 1994). The highest values observed in Novi Sad and Szeged on peak days are about one order of magnitude higher than those in other cities of Europe considered to be rather polluted.

Climatic characteristics of Szeged

Szeged (20°06′E; 46°15′N) lies near the confluence of the Tisza and Maros Rivers. It is the largest city in the south-eastern part of Hungary. The population of the city is around 155,000 and its built-up area is about 46 km2. Szeged and its surroundings are a flat and open region and the city has the lowest elevation in Hungary. In addition, the country lies in the Carpathian Basin; hence, Szeged is situated in a so-called double basin, which strengthens the effects of anticyclonic circulation patterns in accumulating pollutants together with pollen concentrations. The mean annual temperature is 11.2 °C and the mean annual precipitation is 570 mm.

Data and methods

In Szeged, the pollen content of the air has been monitored with the help of a high volume pollen trap (Lanzoni VPPS 2000) since 1989. The air sampler is located in the city, on the roof (20 m above ground level) of the building of the Faculty of Arts, University of Szeged. Daily pollen data were obtained by counting all pollen grains on four longitudinal transects (Käpylä and Penttinen 1981). Meteorological data were obtained from the monitoring station located 2 km from the sampling site in the city center, which is operated by ATIKÖFE (Environmental Protection Inspectorate of the Lower-Tisza Region, Branch of the Ministry of Environment).

There have been many studies analysing the effect of meteorological parameters on the pollen concentration of various plants (Giner et al. 1999; Galán et al. 2000, 2001; Jato et al. 2000). According to these the most important parameters are temperature, humidity and precipitation.

The data basis consists of daily ragweed pollen counts and daily averages of 11 meteorological variables not only for the main pollination periods but for the whole of the period 1997–2001. The criterion of main pollination periods was introduced by Nilsson and Persson (1981), and takes into account 90% of the annual total pollen concentration, eliminating the initial 5% and the final 5%.

The daily averages of the following meteorological variables were considered mean air temperature (T mean, °C), maximum air temperature (T max, °C), minimum air temperature T min, °C), daily temperature range (Δ T = T maxT min,°C), relative humidity (RH, %), irradiance (I, Wm–2), wind speed (WS, m/s), vapour pressure (VP, mbar), saturation vapour pressure (E, mbar), potential evapotranspiration (PE, mm) and dew point temperature (T d, °C).

A new interpretation of the two-sample test

The statistical test applied developed by Makra, is a method that is considered to be a new interpretation of the two-sample test.

The basic question addressed by this test is whether or not a significant difference can be found between the averages of an arbitrary subsample of a given time series and the whole sample (Makra et al. 2000, 2002; Tar et al. 2001).

Let ξ1, ξ2, ... ξ n ... ξ N represent independent random variables of normal distribution, with mean m. Let E(ξ) be the expected value of ξ, with D(ξ) its standard deviation.

Suppose that the standard deviations of ξ i values are identical and equal to σ. Now, choose an optional subsample of n elements from the given whole time series of N elements (n < N).

Let

$$\bar m = \frac{{\xi _1 + \ldots + \xi _n }}{n}\quad {\text{and}}\quad \bar M = \frac{{\xi _1 + \ldots + \xi _N }}{N}$$
(1)

where n < N.

Then, after elementary steps, the random variable

$${\text{Probe}}\;{\text{statistics}}({\text{PS}}) = \frac{{\bar M - \bar m}}{{\sqrt {\frac{{N - n}}{{N \cdot n}}} \cdot \sigma }}$$
(2)

can be characterised by a standard normal distribution, N(0;1).

This implies that having fixed the sample mean \({\bar M}\) and the standard deviation σ, a test of the above null hypothesis for a given subsample mean \({\bar M}\) leads to the following comparison of PS and x p ,

$$p\left( {\left| {\frac{{\bar M - \bar m}}{{\sqrt {\frac{{N - n}}{{N \cdot n}}} \cdot \sigma }}} \right| > x_p } \right) = p$$
(3)

In Eq. 3, x p is taken with p probability from the distribution function of the standard normal distribution, corresponding to a selected 0 < p1 probability threshold.

If the absolute value of PS (Eq. 2) is higher than x p then \(\bar M\) and \({\bar m}\) differ significantly. The null hypothesis, according to which there is no significant difference between \(\bar M\) and \({\bar m}\), can be rejected with the significance level p. (The significance tests are carried out at p = 0.01 significance level.)

This is a sufficient condition, ensuring the normal distribution of PS in Eq. 2, but it can be softened in the case of very large N and n, in the light of the central limit theorem. Stationarity and independence of the original distribution, however, are unavoidably necessary conditions of the Makra test.

The Makra test performs Eq. 3 for all possible subsamples with n = 3, 4, ..., N – 1 elements of duration, starting from the 1st, 2nd, ..., (Nn)th element of the time series. For example, in the case of 99 pieces of data (years) this means 4,752 repeated comparisons of the subsample average and the overall mean. Detection of significant deviations also includes information on their duration, onset and end (Makra et al. 2000, 2002; Tar et al. 2001).

When performing the Makra test, the significant breaks obtained may or may not be distinct. If they are not distinct, only one break is considered to be significant; namely, that for which the value of the test statistics is the maximum.

Factor analysis

In order to reduce the dimensionality of the above data set and thus to explain the relations between the 12 variables, the multivariate statistical method of factor analysis is used. The main object of factor analysis is to describe the initial variables X 1, X 2, ..., X p in terms of m linearly independent indices (m < p), the so-called factors, measuring different “dimensions” of the initial data set. Each variable X can be expressed as a linear function of the m factors:

$$X_i = \sum\limits_{j = 1}^m {\alpha _{ij} F_j }$$
(4)

where α i,j are constants called factor loadings. The square of α i,j represents the part of the variance of X i that is accounted for by the factor F j . It is a common practice for both the initial set of parameters X i and the resultant factors F j to be standardised having zero mean and unit variance. The first argument for using standardised variables is to give all variables equal weight, whereas the original variables may have extremely different variances. Another reason for using standardised variables is to overcome the problem of the different units of the various variables used. From the above, it is apparent that α i,j ≤ 1. If a factor loading ∣α i,j ∣→ 1, the variable X i is highly correlated to the factor F j . Furthermore, a high correlation of some of the initial variables with the same factor is strong evidence of their covariability. The knowledge of covariability among the variables is a very important tool, since significant conclusions can be drawn for the causes of variation and/or the linkages between the initial variables (Bartzokas and Metaxas 1993, 1995; Sindosi et al. 2003).

One important step in this method is the decision about the number (m) of retained factors. Many criteria have been proposed for this. In this study, the Guttmann criterion or rule 1 is used, which keeps the factors with eigenvalues greater than 1 and neglects those that do not account for at least the variance of one standardised variable X i . Extraction was performed by principal component analysis (the kth eigenvalue is the variance of the kth principal component). There are an infinite number of alternative equations to Eq. 4. In order to select the best or the desirable ones, so-called factor rotation is applied, a process that either maximises or minimises factor loadings for a better interpretation of the results. In this study, “varimax” or orthogonal factor rotation is applied, which keeps the factors uncorrelated (Jolliffe 1990, 1993).

The factors can be considered vectors, components of which are the factor loadings. The complexity of the factors represents a matrix of factor loadings. The number of the factors is generally substantially less than that of the original variables. Since the complexity of the factors reflects that of the original variables, the problem examined becomes much simpler and can be interpreted much more easily.

The analysis was applied to a data set consisting of 12 columns (variables) and 465 rows (days) between 15 July and 15 October (first day: 15 July; last day: 15 October 1997–2001.

The linear regression model

In the linear regression model, the dependent variable is assumed to be a linear function of one or more independent variables plus an error introduced to account for all other factors:

$$ y_{i} = \beta _{1} x_{{i1}} + \ldots + \beta _{k} x_{{ik}} + u_{i} $$
(5)

In the above regression equation, y i is the dependent variable, x i1, ..., x ik are the independent or explanatory variables, and u i is the disturbance or error term. The goal of regression analysis is to obtain estimates of the unknown parameters β1, ..., β k , which indicate how a change in one of the independent variables affects the values taken by the dependent variable.

The regression line expresses the best prediction of the dependent variable y i . However, nature is rarely (if ever) perfectly predictable and usually there is substantial variation of the observed points around the fitted regression line. The deviation of a particular point from the regression line (its predicted value) is the error of the estimate, which is known as the deviation or residual value.

The aim of the method is to determine the values of the parameters that minimize the sum of the squared residual values (the sum of the squared errors of prediction) for the set of observations. This is known as a least-squares regression fit.

The square root of the average squared error of prediction is used as a measure of the accuracy of prediction. This measure is called the standard error of the estimate and is designated as σ est . The formula for the standard error of the estimate is as follows:

$$\sigma _{{est}} = {\sqrt {\frac{{{\sum\limits_{i = 1}^n {(y_{i} - {y}^{\prime}_{i} )^{2} } }}}{n}} }$$
(6)

where n is the number of pairs of (x, y) points.

The smaller the variability of the residual values around the regression line relative to the overall variability, the better is our prediction. For example, if there is no relationship between the x and y variables, then the ratio of the residual variability of the y variable to the original variance is equal to 1. If x and y are perfectly related, then there is no residual variance and the ratio of variance would be 0. In most cases, the ratio would fall somewhere between these extremes, that is, between 0 and 1. One minus this ratio is referred to as R 2 or the coefficient of determination. This value is immediately interpretable in the following manner. If we have an R 2 of 0.4, then we know that the variability of the y values around the regression line is 1.0–0.4 times the original variance; in other words we have explained 40% of the original variability, and are left with 60% residual variability. Ideally we would like to explain most if not all of the original variability. The R 2 value is an indicator of how well the model fits the data (e.g., R 2 close to 1.0 indicates that we have accounted for almost all of the variability with the variables specified in the model).

All statistical computations were performed with SPSS (version 9.0) software.

Results and discussion

Characteristics of the main pollination period of ragweed pollen for the 5-year data set examined as well as their averages, are shown in Table 1. Both the starting date of the pollination period (between 20 June and 13 July) and the finishing date (11–29 October) vary widely. The duration, average daily count and total count, except for 1998, show definite fluctuations (the total count: the sum of daily counts per cubic metre of air per year). On the other hand, 5 years might not be enough to detect clear trends.

Table 1. Characteristics of ragweed pollen in Szeged for their main pollination period, according to Nilsson and Persson (1981)

Other parameters of ragweed pollen

Figure 1 shows daily ragweed pollen counts for the five data sets examined. Furthermore, averages for each pollination period (1997–2001) are also displayed (dotted lines). The curves display 5-day running averages; vertical lines indicate the main pollination period. Maximum values can be observed between 15 August and 15 September. This 1-month period is characterised by the highest ratio of ragweed pollen to total pollen release: 1997: 79.3%, 1998: 77.3%, 1999: 86.6%, 2000: 83.5%, 2001: 86.9%.

Fig. 1.
figure 1

Ragweed pollen counts and their running means, 1997–2001. Dotted lines The averages for the period examined Vertical lines The main pollination period

Five-year daily averages of the ragweed pollen counts for the period between 15 July and 15 October were analysed with the help of the Makra test. According to the χ2-test, the data set of these average values fits the normal distribution; hence, the Makra test can be applied. The result received is reliable if this subset of the pollination period was chosen, since, on the one hand, the main pollination period differs year by year and, on the other hand, the period between 15 July and 15 October contains almost the annual total of the pollen grains each year. Only a mere fraction (0.35% in 1997; 0.05% in 1998; 0.08% in 1999; 0.38% in 2000 and 0.04% in 2001) can be found beyond it.

With the help of the Makra test we determined whether or not significant differences can be found between the average of an arbitrary subsample of the time series mentioned and that of the whole sample. It was found that averages of the subsamples with periods 15 July–14 August, and 17 September–10 October are significantly lower than that of the whole sample, while the average of the period between 16 August and 13 September is significantly higher than that of the whole sample. This means that, according to the data set examined, the period between 16 August and 13 September can be considered to be the one most polluted by ragweed pollen in the air, hence the most dangerous one for pollinosis (Fig. 2). The result, obtained by this method indicates the same period as the most dangerous one for the highest pollen concentration, established according to Fig. 1.

Fig. 2.
figure 2

Subperiods with significantly different averages of ragweed pollen counts from the mean of the entire data series, i.e. the “breaks”; diurnal average pollen counts, Szeged, 15 July–15 October, 1997–2001

The usefulness and the added value of the Makra test is that, by applying it, we can determine whether or not the averages of non-independent data sets, namely, the average of a selected subsample, and that of the whole data set differ significantly.

One might ask, why the Makra test is used and not the two-sample test or the F-test. The main difference between the methods is that the Makra test compares two non-independent samples (the whole sample and any of its subsamples) which, after elementary steps, become independent of each other whereas the two-sample test and the F-test compare two different and independent samples.

Connection of ragweed pollen concentration with meteorological elements

The pollen season starts about the middle of July and ends about the middle of October, reaching a maximum in the middle of this 12-week time span. During this period temperature and vapour pressure have a systematic trend towards lower values. The highest temperatures are measured at the beginning of the pollen season, together with low pollen concentrations, and the lowest temperatures at the end of the pollen season, again with low pollen concentrations. If this systematic trend of the meteorological variables is not considered, the factor analysis applied might be influenced by it and the multiple regression model might not be as good as it could be.

In order to analyse any connection between ragweed pollen concentration and the meteorological elements, multivariate statistical analysis was applied with the help of factor analysis.

Factor analysis was performed for the 465-day-long period containing 5-year daily values of the examined 12 variables between July 15 and October 15 1997–2001. The mean seasonal variation was subtracted from the original daily values of each variable. According to the χ2-test the deseasonalised data sets fit the normal distribution; therefore, the method can be applied.

Factor loadings of the rotated component matrix are shown in Table 2. After performing factor analysis, according to the Guttmann criterion, four factors were retained. Eigenvalues of the retained and rotated components as well as the explained and cumulative variances are also shown in Table 2. The four retained factors explain 84.4% of the total variance of the 12 original variables.

Table 2. Factor loadings of the rotated component matrix. Loadings higher than ∣0.091∣ (the threshold value for the 95% significance level) are written in bold type. The variables represent deaseasonalised data sets

When performing factor analysis on standardised data, factor loadings received are correlation coefficients between the original variable and its co-ordinates (factor scores) belonging to the rotated axes. Statistical significance of a factor loading – as a correlation coefficient – can be calculated by the following formula:

$$t = \sqrt {\frac{{r^2 (n - 2)}}{{1 - r^2 }}} $$
(7)

where r is the given factor loading, n is the number of the element pairs [n = 465 is the length of the 5-year (1997–2001) daily deseasonalised values of the 12 variables, n – 2 is the number of degrees of freedom and t is the parameter to be calculated, which follows Student’s t-distribution. If the absolute value of t (in the case of n – 2 = 463 degrees of freedom and 5% significance level) is higher than the threshold value of Student’s t-distribution (t threshold = 0.091) then we can say that the null hyphotesis – according to which the time series of the original variable and that of its factor scores are independent – is not fulfilled. Hence, if values of factor loadings are higher than 0.091 then the a priori hypothesis, according to which the two time series are independent, is rejected. For the sake of an easier survey, factor loadings higher than ∣0.091∣ are indicated by bold figures (Bartzokas and Metaxas 1993, 1995; Sindosi et al. 2003).

Connections between the 12 variables examined will now be analysed according to the factor loadings of the retained and rotated factors (Table 2).

Factor 1

This factor explains 42.8% of the total variance and, on the basis of its significant factor loadings, contains the following daily parameters: mean air temperature (T mean), saturation vapour pressure (E), dew point temperature (T d), maximum and minimum temperatures (T max and T min), potential evapotranspiration (PE), relative humidity (RH), irradiance (I), daily temperature range (ΔT = T maxT min), and wind speed (WS). Factor 1 reveals the opposite relationship between, on the one hand, potential evapotranspiration and, on the other hand, dew point temperature (the temperature at which the air becomes saturated), vapour pressure and relative humidity. Since the factor loading of the daily pollen concentration is not significant, no relation with the meteorological variables considered (however many of them have significant factor loadings) can be established.

Factor 2

This factor explains 22.9% of the total variance and comprises each of the 12 variables. Only the irradiance (I) has negative factor loading. Temperature (T mean, T min, T max) and humidity (E, T d, VP, PE) variables are in direct proportion. Higher temperature involves higher potential evapotranspiration (PE) and, thus, higher vapour pressure (VP), too; hence, dew point temperature (T d) will also be higher. The factor loading of the daily pollen concentration is relatively low; however it is significant and is in direct proportion to the variables mentioned, except for irradiation (I).

In the light of the above results, why might hot and moist weather promote high pollen concentrations in the air? The role of temperature is clear; it is in direct proportion to pollen concentration. On the other hand, the effect of humidity might be complex. According to Giner et al. (1999), a nightly relative humidity in excess of 60% appeared to influence the atmospheric pollen concentration negatively during the day but, when the relative humidity was higher than 80% in the morning, pollen concentrations increased again. Another factor might be that, in the case of high relative humidity, pollen grains with an uneven surface can stick together more easily, causing the sampler to trap more pollen grains. Also, it has been observed in Szeged that, on the day following a rainy day (when relative humidity is higher), pollen concentration suddenly increases.

Factor 3

This factor explains 10.2% of the total variance and contains the mean air temperature (T mean), saturation vapour pressure (E), vapour pressure (VP), minimum air temperature (T min), irradiance (I), daily temperature range (ΔT = T maxT min) and wind speed (WS). In this factor, daily temperature range, together with minimum temperature, has the highest loading. The factor loading of the daily pollen concentration is not significant; hence, its relationship to the variables mentioned can not be interpreted.

Factor 4

This factor explains 8.4% of the total variance and comprises only daily pollen concentration, maximum temperature (T max) and wind speed (WS). High loadings of daily pollen concentration (–0.522) and wind speed (0.854) indicate that only part of the variance of daily pollen concentration is controlled by the above-mentioned temperature and humidity variables.

A study of factors 2 and 4 raises the question: why can wind promote high pollen concentrations in some cases and not in others? Under which conditions would pollen accumulate in the plant and under which conditions would pollen be released? The higher concentration of airborne pollen sometimes depends less on phenological phases and meteorological conditions than on transportation by the wind. Wind direction was not considered when factor analysis was performed (because of the disturbing effect of a building, situated next to the meteorological station). Pollen concentration depends not only on wind speed but on wind direction too. If winds blow from a source area, the pollen concentration increases; on the other hand, wind from other directions would modify the pollen counts less. The pollen concentration may also increase through the re-suspension of high-atmosphere pollen that could have been released several hours earlier (Giner et al. 1999; Fehér and Járai-Komlódi 1997).

Assessment of the daily average number of ragweed pollen grains

Further statistical analysis was performed to detect the influence of meteorological parameters on the daily ragweed pollen counts for the years examined. For the calculations, data for the period 1997–2000 were used, while the year 2001 was considered to be the reference year; namely, the assessment was performed for this year.

Let us introduce some notations. Let p be the daily pollen number, p′ the mean seasonal variation of the daily pollen number (1997–2000) and p 1–3 the average pollen number for the preceding 3 days. The daily sums of precipitation (namely: rainy days) were not considered (e.g. Giner et al. 1999). The reason for this is that the role of precipitation is complicated just because of the negative effect of rain intensity on pollen counts (Galán et al. 2000). According to Fornaciari et al. (1992), on the one hand, the best correlation was obtained by comparing pollen concentrations (Urticaceae) and meteorological parameters on non-rainy days and, on the other hand, correlating rainfall and pollen concentration is very difficult.

Multiple regression analysis was performed for the above-mentioned variables, using the “Enter” method. This method lets us select how independent variables are entered into the analysis. The Enter method enters all variables at the same time.

The multiple regression analysis can have two aims: (a) the estimation of future data on the basis of past ones; (b) to detect relations between the independent variables considered and the dependent variable. In the latter case, of course, we also receive an estimation for the chosen dependent variable (p), and the daily meteorological variables can also be used for estimating the daily number of pollen grains. We chose this latter aim. According to which, p′ and p 1–3 and the meteorological variables were used in the regression procedure in order to detect the goodness of their connection with p.

Since p′ is the mean seasonal variation of the daily pollen number (1997–2000), it is independent of the daily pollen number (p) of the reference year (2001). Furthermore, p′contributes to the quality of the regression considerably, since its standardized coefficient is much higher than that of the other variables.

The daily pollen concentration (p) depends on the average pollen number for the preceding 3 days (p 1–3). On the other hand, our aim is to detect the goodness of their relation; hence, p 1–3 is applied in the regression procedure. Besides, application of p 1–3 in the regression model gives better estimation for p.

When performing regression analysis, firstly the variable pollen number (p) or its modified variant (Δp) is considered to be the dependent variable, while the meteorological variables or their modified variants are counted as independent variables (in some cases not only meteorological variables are considered to be independent variables). The modified variable is defined as the difference of the original variable and its mean seasonal variation. (They are signed as follows: Δp, ΔT mean, ΔT max, ΔT min, ΔE, ΔVP, ΔT d, ΔI, ΔWS, ΔRH, ΔPE.)

Significance values of each independent variable are evaluated during the regression process. The less the significance value belonging to a variable, the better the role of the variable in the regression is. If the significance value belonging to the variable is relatively great, the variable cannot significantly improve the accuracy of the estimate.

The regression analysis is performed by using several steps. The variable having the greatest significance value will be omitted from the next step of the regression analysis. The procedure is carried out step-by-step until the significance values belonging to each variable become less than about 0.2. Five different cases were examined according to the selection of dependent and independent variables. The results (cases 1–5) are summarized in Table 3.

Table 3. Regression analysis of the daily pollen number (p) as well as that of the difference of the daily pollen number and its mean seasonal variation (Δp). Variables represented in each case were considered in the first step of the regression analysis, while those indicated by bold type have remained until the last step of the regression. 1Δp 1–3 is the difference between the average pollen number for the preceding 3 days and its mean seasonal variation. \( {\text{Error}} = {\sum\limits_{{\text{time}}\;{\text{interval}}} {|({\text{real}}\;{\text{value}} - {\text{estimate}}\;{\text{value}})/({\text{real}}\;{\text{value}})} }| \). Values of the error function can only be used to compare various regression equations.

Since WS and RH as original variables have no seasonality (trend) (Fig 34), they were used in the regression analysis instead of variables ΔWS and ΔRH. According to the preliminary expectations, the result was the same (see case 1, Table 3). WS and RH seem not to play a role in the regression analysis in determining the daily pollen counts.

Fig. 3.
figure 3

The mean seasonal variation of the daily wind speed (WS)

Fig. 4.
figure 4

The mean seasonal variation of the daily relative humidity (RH)

Except for case 2, each resulting regression equation comprises the following variables: I, E, PE. Hence they might have a greater role in daily pollen concentration. Variables T d and VP occur in three cases; therefore, they might have a limited role compared to the former variables. If Δp 1–3 is considered to be an independent variable, together with meteorological variables (deseasonalised data), then the regression equation contains only the temperature variables of meteorological ones (case 2, Table 3).

The equation resulting from the regression was applied to the data. When applying the regression analysis, the error was only calculated for those days, on which the daily pollen number exceeded 10. The reason for this is as follows. When comparing differences between the averaged daily and the estimated daily pollen numbers, then, considering the accuracy of the forecast, it is not very important if, in the case of a daily pollen count below 10 grains/m3, the error of the assessment is large (e.g. the procedure forecasts 2 or 3 pollen grains/m3 instead of 1). As mentioned above, the threshold value for clinical symptoms is considered to be 20 pollen grains/m3 air (Jäger 1998)].

After performing the regression analysis only two cases are emphasised, since the lowest errors (Table 3) belong to following equations.

$$ \begin{array}{*{20}l} {{\Delta p} \hfill} & { = \hfill} & {{0.089 + 12.751\Delta T_{{{\text{mean}}}} - 6.385\Delta T_{{\max }} - 2.708\Delta T_{{\min }} + 0.729\Delta p_{{1 - 3}} \quad ({\text{case}}\;2)} \hfill} \\ {p \hfill} & { = \hfill} & {{ - 27.510 - 0.123{\text{I}} - 35.017E + 88.243{\text{PE}} + 32.770{\text{VP}} - } \hfill} \\ {{} \hfill} & {{} \hfill} & {{ - 1.756T_{{\min }} + 0.464{p}' + 0.588p_{{1 - 3}} \quad ({\text{case}}\;5)} \hfill} \\ \end{array} $$
(8)

If the estimated daily pollen numbers (p, regression Eq. 8) are compared to the original ones, the equations seem to follow the trends tightly (Figs. 56). Both figures clearly indicate, on the one hand, that maxima are underestimated and, on the other hand, have negative estimates too. Estimates for case 2 show higher extremes than those for case 5 (see e.g. dates 14 August 1997; 18 August 1998; 04 September 2000; 02 September 2001; 28 August 2001). Results represented by Figs. 56 reflect the explained variances (R, Table 3).

Fig. 5.
figure 5

Recorded and forecasted daily pollen numbers 1997–2000

Fig. 6.
figure 6

Recorded and forecasted daily pollen numbers 2001

For purpose of comparison, input data for case 2 and case 5 are the same; however, different variables were prepared from them (original data, deseasonalised data) (Table 3).

Conclusion

Parameters of ragweed pollen [maximum daily concentration per year; total number per year; first observation day; last observation day; duration (days); average daily number; number of days per year exceeding the threshold value for clinical symptoms (20–30–50 pollen grains m–3day–1)], fluctuate considerably; hoewver, the short observation period does not allow to perform trend analysis.

The ragweed pollen load of Szeged is at its most serious between 16 August and 13 September, hence this is the most dangerous period for pollinosis. This period of highest pollen concentration, established by the Makra test, confirms the results of empirical pollen calculations.

Application of factor analysis reduced the dimensions of the original data set (daily ragweed pollen concentration and the 11 meteorological variables examined), faciliating the detection of connections among them. After factor analysis, four factors were retained according to the Guttmann criterion. These four factors explain 84.4% of the total variance of the original 12 variables.

When factor analysis is performed with deseasonalised data instead of the original ones, the most characteristic changes in factor loadings belonging to the variables considered are as follows. Daily pollen concentration has significant loadings in factors 2 and 4 instead of factors 1, 3 and 4. The parameters play, to various extents, a positive role in increasing daily pollen concentrations, except for irradiance, which seems to be inversely related to daily pollen number. Minimum air temperature indicates a clear positive connection with daily pollen concentration. A significant connection between the daily pollen number and wind speed, with both positive and negative seems to explain both high and low pollen concentration as a consequence of high wind speed. This might hint at the ambivalent role of wind speed.

Surprisingly, if seasonality is subtracted, then the explained variance of the regression decreases. According to our preliminary expectations the relationship between daily pollen number and the meteorological variables would have been more clear-cut. However, this is not the case. Seasonality does not play such a great role in the time interval we have examined. It might be that a statistical description of pollen data simplifies complex processes like phenology and dispersion of pollen in the air.

This is a preliminary study based on data from only 5 years. The resulting regression equation is only an approximate indication of which variables are useful for predicting ragweed pollen counts. Further data are required to improve the model accuracy and, hence, to allow refined conclusions to be drawn. In further research, phenological phases of ragweed will also be taken into account, that influence the formation and ripening of the pollen in the plant. A phenological model for ragweed, which calculates the timing of phenological phases on the basis of temperature sums, will be helpful to improve the accuracy of the prediction model. Another direction for further research might be the analysis of the relation between the different types of air mass established for the Carpathian Basin and pollen concentrations.