Keywords

1 Introduction

Worldwide, attention is increasingly drawn to the quality of atmospheric air and the negative impact of air pollutants on the human health. In the field of ecology, there are numerous scientific studies, which indicate the worrying statistics related to bad air quality and the growing number of people who suffer from disease and die as the result of air pollution [1, 2]. Recent data show that Bulgaria ranks among the European countries with the most polluted air and the highest levels of particulate matter 10 \(\upmu \text {m}\) in diameter or smaller (PM10) [3, 4]. The main reasons, which facilitate the generation of PM10 are the emissions of various fuel-based commercial and domestic processes, busy traffic and other, predominantly during the winter periods. In Bulgaria, 36 automated measuring stations constantly monitor the levels of 12 main pollutants, including PM10, and are controlled by the national Executive Environment Agency (EExA) [5].

In the literature, various methods for investigation and forecasting of air pollutant concentrations are used in order to ensure air quality and to prevent pollution. Multivariate linear and non-linear regression, principal component analysis, factor analysis, cluster analysis, and other have a wide range of applications [6, 7]. In a large number of publications, stochastic methods for modeling time series of air pollutants are applied (see [8, 9] and the literature cited therein). The most frequently used methods of this type are ARIMA, transfer function methods and others, based on the methodology of Box-Jenkins [10].

In recent years, alongside these methods, the scientific community is increasingly more interested in new computer-oriented machine-learning data-driven methods. These methods extract direct dependencies and generate mathematical models from the data. The most widely-used methods are: Artificial neural networks (ANN), Support vector machine (SVM) regression, Random Forest (RF), Multivariate adaptive regression splines (MARS), etc. In [11] forecasts are made for the concentrations of SO2 by applying three different methods - least square SVR, MARS, and M5 Model Tree. It was found that the MARS method is the second best for predicting SO2 concentrations. Authors of [12] present predicted results for PM10 concentrations for the city of Sarajevo, obtained using the ANN method and input data for meteorological variables (wind speed, humidity, temperature and pressure) over the period from 2010 to 2013. Predictive RF models are built and analyzed in [13] for data from 31 air quality monitoring sites in Switzerland using surface meteorological, synoptic scale, boundary layer height, and time variables to explain daily PM10 concentrations. A recent summary paper [14] presents the current state of air pollution forecasting methods and published results.

This study applies the non-parametric data-driven regression technique MARS to analyze and model the concentrations of PM10 in the town of Smolyan, Bulgaria. The objective of the developed empirical case study is to construct the most suitable models, which describe the measured data with the highest degree of predictive power and to forecast the future pollution for up to 7 days ahead.

Statistical procedures are carried out using the software package Salford Predictive Modeler (SPM) [15] and IBM SPSS [16].

2 Description of the Study Area and Data

The town of Smolyan is located in South Bulgaria and is the administrative center of Smolyan municipality and Smolyan province. The town has a population of 28,000 people and is the smallest main provincial town in Bulgaria. The climate of Smolyan is transitional-continental with predominant Aegean influence. The terrain is mountainous with an abundance of coniferous forests. The average altitude is around 1000 m. Winters are mild and snowy with a stable snow-cover between 80 and 120 days, and summers are cool and humid. The average temperature in January is between \(-1\) and \(-5\,^\circ \text {C}\), and in July—between 13 and \(20^\circ \). The town of Smolyan was chosen for this study as representative of mountain settlements, regional centers in Bulgaria, with systemic pollution of PM10. Similar examples are the cities of Blagoevgrad, Kardzhali and others.

The investigation is conducted using average daily measurements of the pollutant PM10 (\({\upmu }\text {g/m}^3\)) in the town of Smolyan over a period of 9 years from 1 January 2010 to 30 November 2018 or the total number of observations is N = 3254. The missing data for the observed time interval of the PM10 variable are 167, which is about 5%. In the modeling procedures, the missing data are replaced using linear interpolation. The analysis also includes 13 initial predictors: CLOUD - cloud cover, HUMID - relative air humidity, MAXT - maximum daily temperature, MINT - minimum daily temperature, PRESS - atmospheric pressure, WINDSP - winds speed, PRECIP - precipitation, as well as the lagged variables: PM10_1, MINT_1, and others with the values from the previous day. In order to maintain correspondence with the weather, the models also use 3 time variables: YEAR_MONTH, MONTH, and MONTH_DAY.

Table 1 shows descriptive statistics of the initial data for the town of Smolyan. The table indicates that the average value of the pollutant PM10 is 44.14 \({\upmu }\text {g/m}^3\), which is close to the maximum permissible average daily threshold of 50 \({\upmu }\text {g/m}^3\) as per European and national standards [17, 18]. The maximum recorded value is 260.5 \({\upmu }\text {g/m}^3\), which exceeds the threshold by a factor of five. The initial analysis clearly shows that the PM10 air pollutant is problematic for the town.

Table 1. Descriptive statistics of the initial data.

Figure 1 shows the sequence plot of the time series for the PM10 pollutant over the considered 9-year period. The horizontal line in the plot represents the regulatory requirements of 50 \({\upmu }\text {g/m}^3\). It clearly outlines multiple instances where this threshold value is exceeded. The exceedances for the pollutant are observed mainly during the cold months of the year. This is explained by the fact that households use mainly solid fuel and coal for heating.

Fig. 1.
figure 1

Sequence plot of the measured daily concentrations of PM10 for the town of Smolyan.

3 MARS Method

The MARS method was developed by the American statistician and physicist Friedman in 1990–1991 [19]. The advantage of the MARS method lies in its ability to process both small- and large-scale datasets. If the dependent variable (here PM10) is \(y=y(X)\) and the predictors \(X=(X_{1},X_{2}, \ldots , X_{p})\) are vectors with dimension N, the MARS model \(\widehat{y}=\widehat{y}_{[M]}\) has the following form:

$$\begin{aligned} \widehat{y}_{[M]}=b_{0}+\sum ^{M}_{j=1}b_{j}BF_{j}(X), \end{aligned}$$
(1)

where \(b_{0}, b_{j}, j=1,2, \ldots , M\) are the coefficients in the model, \(BF_{j}(X)\) are its basis functions (BFs), M is their number, p is the number of predictors in the model. The form of one-dimensional BFs is one of the follows

$$\begin{aligned} BF_{j}(X)=\max _{X_{k}} (0, X_{k}-c_{k,j}), \quad BF_{j}(X)=\max _{X_{k}} (c_{k,j}-X_{k}, 0), \end{aligned}$$
(2)

where \(c_{k,j}\in X_{k}\) are nodes selected and fixed by the MARS algorithm, \(X_{k}\) is a variable (predictor), \(k=1,2, \ldots , p\). For the non-linear interactions BFs are built as products of other BFs.

When constructing the model, some control parameters need to be set by the researcher. The maximum number of basis functions (\(L_{0}\)) and the maximum r of multipliers (i.e. degree of interactions) in BFs are set, where \(L_{0}\ge 3p\). The algorithm consists of two steps. During the first step, initial value for \(b_{0}\) is taken (for example \(b_{0}=\min _{i} y_{i}\)) and consequently the model is complemented by BFs of type (2). For the current model with L number of BFs, the selection of variables and nodes is determined by the condition for minimizing the sum

$$\begin{aligned} SSL_{[L]}=\sum ^{N}_{i=1} \left[ y_{i}-\widehat{y}_{[L],i}(X)\right] ^2. \end{aligned}$$
(3)

The second step of the method includes an assessment of the accuracy of the constructed MARS model and reduction of the number of BFs using the general cross-validation (GCV) criterion, which takes into account not only the residuals error but also the overall error of the model [19]. BFs which do not contribute significantly to the accuracy of the model are removed. The form of the criterion is:

$$\begin{aligned} GCV_{[m]}=\frac{\sum ^{N}_{i=1} \left[ y_{i}-\widehat{y}_{[m],i}(X)\right] ^2}{N(1-C(n)/N)^2}, \ \ m=1,2, \ldots , L_{1}, \ \ L_{1}\le L_{0}. \end{aligned}$$
(4)

Here \(C(m)=m+\delta (m-1)/2 \in [2, 3]\). The second step of the algorithm prevents the overfitting of the model. The best MARS models were selected with the highest coefficient of determination \(R^2\), the highest GCV \(R^2\), obtained after step 2, and the minimum values of the root mean square error (RMSE), given by the expressions

$$\begin{aligned} R^2=\frac{\sum ^{N}_{i=1} (\widehat{y}_{i}-\overline{y})^2}{\sum ^{N}_{i=1} (y_{i}-\overline{y})^2}, \ \ RMSE=\sqrt{\frac{1}{N} \sum ^{N}_{i=1} (y_{i}-\widehat{y}_{i})^2} \end{aligned}$$
(5)

where \(y_{i}\) are the observed values of the dependent variable \(y=PM10\), \(\overline{y}\) is its mean, \(\widehat{y}_{i}\) are the predicted values by the MARS model.

4 Results and Discussion

4.1 Construction of MARS Models

An important part in constructing MARS models is the choice of control parameters. With a relatively large volume N of the sample, a maximum number of BFs of 100, 200 and 300, and a maximum degree of interactions between the predictors r = 2, 3 and 4 were selected. Furthermore, a more detailed analysis of the PM10 observations showed only two very high measured values - 260.5 \({\upmu }\text {g/m}^3\) and 253 \({\upmu }\text {g/m}^3\), respectively. In order not to unnecessarily complicate the models, these two outliers were replaced by linear interpolation.

The conducted model procedures showed that the best predictive results could be obtained with the use of all 7 meteorological variables from Table 1 and their lagged variables and lagged variable PM10_1, taking into account the PM10 pollution from the previous day. As predictors, the following time variables were also used: MONTH, YEAR_MONTH and MONTH_DAY respectively, given by the expressions:

$$\begin{aligned} YEAR\_MONTH=YEAR+(MONTH-0.5)/12, \end{aligned}$$
(6)
$$\begin{aligned} MONTH\_DAY=MONTH+(DAY-0.5)/31 \end{aligned}$$
(7)

Interactions between these time variables was ruled out. In order to assess the future forecasting model capabilities, the data for the last 7 days of the sample (24–30 November 2018) were not included in the modeling process and were used as holdout data. The total maximum number of predictors is 18. Table 2 shows the main statistics of selected best MARS models.

Table 2. Summary statistics of the selected best MARS models of PM10.

The obtained best MARS models describe the observed PM10 data with R2 = 86% and GCV \(R^2\) = 83%. Models M3 (200; 4) and M5 (300; 3) have the highest performance, followed by M2 (200; 3). Since their respective RMSEs are not significantly different, we choose models M2 and M5 as the simplest models with interaction between predictors up to 3th degree. Figure 2 shows the correspondence between M5 and PM10 with a 5% confidence interval.

Fig. 2.
figure 2

Comparison of the predicted PM10 values using model M5 and the measured PM10 values.

4.2 Application of Models for Forecasting Future Pollution

The MARS method has the advantage over many other methods that it generates flexible models which allow to obtain very easy the predictions and forecasts through formulas (1)–(2) at given predictor values. Figure 3 illustrates the calculated estimates of M2 and M5 models compared to PM10 data for 7 days ahead. Very good correspondence is observed.

Fig. 3.
figure 3

Forecasting results obtained using MARS models M2 and M5 against the measured PM10 values for 7 days ahead.

5 Conclusion

With the help of the powerful data-driven method MARS statistical models for analyzing and forecasting of PM10 atmospheric air pollution data, depending on changes in weather conditions are built. Models with very good statistical characteristics were obtained. The agreement of models with measured data and their high performance for predicting future pollution concentrations has been demonstrated.

The performed analyzes and obtained results suggest that MARS can successfully be used to model other air pollutants as well as similar time series of ecology data.