Keywords

1 Introduction

Ensemble methods are machine learning algorithms where the performance or classification accuracy is improved as a result of the combination of individual models. Different variants and approaches can be found in the scientific literature to create these ensembles. A first approach employs the same learner but changes the training datasets. Between this type of ensembles Boosting [1], Bagging [2], Random forest [3] and AdaBoost [4] can be cited. Another possible approach relies on the use of different learning methods. In this case, majority voting, weighted voting and averaging are the most common techniques. Finally, stacking ensembles [5] are based on the use of the outputs of individual models as inputs of a second stage algorithm as a way to improve the performance of the models.

Air pollution is one of the most important environmental problems that must be faced in order to preserve the quality of living of the population. Nitrogen dioxide (NO2) is one of the main pollutants. Its origins are manifold, but it is very related to combustion processes [6] and the reactions between nitrogen oxides and ozone [7]. It has harmful effects on human health [8] and is considered to be the main reason for air quality loss in urban areas [9].

The main objective of this paper is to improve NO2 estimations in a monitoring network located in the Bay of Algeciras (Spain). To achieve this goal, a stacking ensemble is proposed. Artificial neural networks (ANNs), linear and nonlinear genetic algorithms (GAs) are employed as individual learners. Besides, ANNs are used as the second stage algorithm. This ensemble produces promising results outperforming all the individual models and other stacking ensembles that are also calculated. The importance of improving the NO2 estimations is related to their ability to give monitoring networks autonomous capabilities, such as missing data imputation or detection of decalibration situations.

The rest of this paper is organized as follows. Section 2 describes the area of study and the database. Section 3 presents the methods used in this work. Section 4 describes the experimental design. Results are discussed in Sect. 5. Finally, the conclusions are shown in Sect. 6.

2 Data and Area Description

The Bay of Algeciras area is a heavily industrialized region which is located in the south of Spain and includes a population of nearly 300,000 inhabitants. The sources of NO2 are numerous including not only the mentioned industries but very heavy traffic in the urban areas. Additionally, the Port of Algeciras Bay is one of the most prominent ship-trading ports in Europe. Thus, vessels constitute another important source of gaseous air pollution in this area.

All the aforementioned facts highlight the importance that an adequate pollution control strategy has to preserve the wellbeing of the population. With this purpose, a pollution monitoring network is located in this area. It is composed of 14 stations and records hourly measures of NO2. Figure 1 shows the location of the Bay of Algeciras and the situation of the monitoring stations (depicted using their codes). Table 1 shows the correspondence between stations and their codes.

Fig. 1.
figure 1

Area of study

Table 1. Location of the NO2 monitoring stations

The database used in this work contains hourly NO2 concentration measures that were obtained by the aforementioned monitoring stations during a period of 6 years (2010–2015). This database was normalized as a previous step. Then, it was split into two different datasets. A first one including records from 2010 to 2014, which was used to select the best parameters of the models and train them. The second one includes only measures taken in 2015 and was used as the test set. The results are provided using only the test set in order to determine the performance of the models with unseen data.

3 Methods

This section presents a brief description of the methods and techniques used in this work.

3.1 Artificial Neural Networks

Backpropagation feedforward multilayer perceptron [10], which includes at least one hidden layer different from the input and output layers, is the most widely used design for ANNs. According to [11], ANNs with enough neurons and a single hidden layer can be considered as universal approximators of any nonlinear function.

In this work, backpropagation neural networks (BPNNs) with a single hidden layer have been used to create hourly NO2 estimation models. The Levenberg–Marquardt algorithm [12] has been employed for optimization purposes. Additionally, the early stopping technique [13] has been applied to the training process with the aim of avoiding overfitting and ensuring good generalization capabilities in the models.

So as to determine the optimal number of hidden neurons, authors have used a 5-fold cross validation resampling procedure, which has been used previously with good results [14,15,16,17,18].

3.2 Genetic Algorithms

Genetic algorithms [19] are search methods inspired by the natural selection processes. The decision variables for a particular problem are encoded into strings of a certain alphabet. This strings are known as chromosomes and act as candidate solutions of the problem (which are known as individuals). The set of all the individuals is known as the population. In order to determine the goodness of each possible solution, a fitness value is calculated for each individual of the population.

The general process starts with the generation of a random initial population. This population evolves from one generation to another through the application of genetic operators. It moves towards a global optimum solution of the problem according to the fitness values obtained. Selection, crossover and mutation can be found among the genetic operators. The process continues and new generations are created until the stopping criteria are met. The interested reader can find a more detailed explication of this process in the work of [20].

In this work, four different genetic algorithms models have been developed in order to estimate the hourly NO2 concentrations at the EPSA monitoring station (see Table 1). In all these cases, the fitness function that must be minimized is the mean squared error (MSE) between the dependent variable and the estimation produced as the output of a function which is specific for each case. This is shown in Eq. 1.

$$ err = MSE(y, \widehat{y}) $$
(1)

where y is the dependent variable and \( \widehat{y} \) is the estimation produced by the GA model. The main differences between these models lie on the specific function that is used to produce the estimations. Equations 2, 3, 4 and 5 show the estimation functions corresponding to GA model 1 (GA-1), GA model 2 (GA-2), GA model 3 (GA-3) and GA model 4 (GA-4) respectively.

$$ \widehat{y} = \sum\nolimits_{i = 1}^{n} {(w_{{1_{i} }} \cdot (S (w_{{2_{i} }} \cdot x_{i} ) + w_{{3_{i} }} \cdot x_{i } + w_{{4_{i} }} \cdot x_{i}^{{w_{{5_{i} }} }} + e^{{w_{{6_{i} }} }} \cdot x_{i} + w_{{7_{i} }} )) + k} $$
(2)
$$ \widehat{y} = \sum\nolimits_{i = 1}^{n} {(S(w_{{1_{i} }} ) \cdot (S (w_{{2_{i} }} \cdot x_{i} ) + w_{{3_{i} }} \cdot x_{i } + w_{{4_{i} }} \cdot x_{i}^{{w_{{5_{i} }} }} + e^{{w_{{6_{i} }} }} \cdot x_{i} + w_{{7_{i} }} )) + k} $$
(3)
$$ \widehat{y} = \sum\nolimits_{i = 1}^{n} { (w_{{1_{i} }} \cdot x_{i } + S \left( {w_{{2_{i} }} \cdot x_{i} } \right) + w_{{3_{i} }} \cdot x_{i}^{{w_{{4_{i} }} }} + e^{{(w_{{5_{i} }} \cdot x_{i} ) }} ) + k} $$
(4)
$$ \widehat{y} = \sum\nolimits_{i = 1}^{n} {(w_{i} \cdot x_{i} ) + k} $$
(5)

where y is the dependent variable, xi are the independent variables (predictors), n is the total number of predictors, \( w_{{1_{i} }} \), \( w_{{2_{i} }} \), \( w_{{3_{i} }} \), \( w_{{4_{i} }} \), \( w_{{5_{i} }} \), \( w_{{6_{i} }} \), \( w_{{7_{i} }} \), \( w_{i} \), and k are the weights determined by the GA and S is the sigmoid function, which is expressed in Eq. 6.

$$ S\left( n \right) = \frac{1 }{{1 + e^{{\left( { - n} \right)}} }} $$
(6)

It is important to note that \( w_{{6_{i} }} \) in Eqs. 2, 3, and 4 has been constrained within the [\( 10^{ - 12} \), +∞) interval. The genetic algorithm function provided by MATLAB R2016b has been used to develop the GA models. In this software, the codification of the variables in chromosomes is done internally without any intervention by the user. As can be seen in Eqs. 25, GA-1, GA-2, GA-3 present a non-linear behaviour whereas GA-4 fitness function is linear.

3.3 Stacked Ensembles

Stacked ensembles [21] are techniques which are intended to supply an overall prediction or estimation value based on the combination of the outputs of individual models. This type of ensemble can be beneficiated from the different perspectives offered by the individual models and usually improve their results. A brief description of the ensembles used in this work is presented next:

  • Average (avg): The final estimation is calculated as the average of the individual models’ estimations.

  • Weighted average (wavg): In this case, each individual model has a different contribution to the final estimation according to the goodness of its estimation power, as is shown in Eq. 7.

    $$ E_{final} = \sum\nolimits_{i = 1}^{j} {(w_{i} \cdot E_{i} )} $$
    (7)

    where \( w_{i} \in \left[ {0,1} \right] \) for each \( i \in \left[ {1, \ldots ,j} \right] \), \( \sum\nolimits_{i = 1}^{j} {w_{i} = 1} \) and \( E_{i} \) represent an individual estimation model.

  • ANN weighted ensemble (ANNwe): Inspired in the wavg ensemble, this work proposes a type of ensemble which uses the individual models as inputs of a BPNN. The obtained model represents the best possible combination of the inputs in order to produce the aggregated output.

4 Experimental Procedure

The objective of this study is to determine the possible improvements in NO2 estimation models performance when a proposed stacked ensemble is applied. In this approach, GAs are used in conjunction with ANNs. The proposed fitness functions (see Eqs. 26) let the GA models capture linear and nonlinear relations between variables and increase their estimation performance.

For estimation purposes, the hourly NO2 values measured at the EPSA monitoring station (see Table 1) was considered as the dependent variable. In contrast, the hourly NO2 values corresponding to the rest of the monitoring stations were used as predictor variables. As an initial step, the original database was normalized and divided into two disjoint groups. The first one included hourly NO2 records going from 2010 to 2014 and was used as the training set. The second one included records belonging to 2015 and acted as the test set.

The experimental process was divided into two different stages. In the first stage, five estimation models were developed, one using ANNs and the rest using different genetic algorithms approaches. In the case of the ANN models, the BPNNs used a single hidden layer and a different number of hidden neurons (hns) (1 to 50). The Levenberg–Marquardt was selected as the optimization algorithm and the early stopping technique was employed to improve the generalization capabilities of the models. Starting with the training set, a random resampling procedure using 5-fold cross-validation was used for each number of hns and the average performance measures were calculated. This process was repeated 20 times to avoid the effect of randomness in the ANN weights initialization, and the average results were also calculated. Additionally, the individual results for each repetition were also stored so that a multicomparison procedure could determine meaningful differences within the models in a later step. Regarding performance measures, the Pearson correlation coefficient (R), the mean squared error (MSE), the index of agreement (d) and the mean absolute error (MAE) [22] were calculated. These performance indexes are defined in Eqs. (811).

$$ R = \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {O_{i} - \overline{O} } \right)\cdot\left( {P_{i} - \overline{P} } \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{N} (O_{i} - \overline{O} )^{2} \cdot \mathop \sum \nolimits_{i = 1}^{N} (P_{i} - \overline{P} )^{2} } }} $$
(8)
$$ MSE = \frac{1}{N}\sum\nolimits_{i = 1}^{N} {(P_{i} - O_{i} )^{2} } $$
(9)
$$ d = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {P_{i} - O_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {\left| {P_{i} - \overline{O} } \right| + \left| {O_{i} - \overline{O} } \right|} \right)^{2} }} $$
(10)
$$ MAE = \frac{1}{N} \sum\nolimits_{i = 1}^{N} {\left| {P_{i} - O_{i} } \right|} $$
(11)

where P indicates predicted values and O indicates observed values.

Finally, the best model was obtained using the Friedman test [23] and the Bonferroni method [24], jointly with the mentioned performance measures. The Friedman test let us determine if meaningful differences were present between the models. The Bonferroni method evaluated which of the models were not statistically equivalent. Following the Occam’s razor principle, the model with fewer hns was selected among those showing no significant differences with the model that produced best performance indexes.

After the model selection, a new BPNN model was trained using the whole training dataset and the number of hns of the most accurate model. Then, this model was fed with the inputs of the test set in order to obtain the final NO2 estimation for the year 2015. Finally, performance measures were calculated through the comparison of observed vs. estimated values.

In the case of the GA models, the fitness functions presented in Sect. 3.2 (Eqs. 1 to 6) were minimized using the training data set. Regarding the parameters that control the genetic algorithms, different tests were carried out in order to select the best possible parameter combination and each combination was repeated 20 times. Table 2 shows the possible values that were tested for each parameter. A detailed description of each parameter can be found in Matlab’s Genetic Algorithm Options web page [25]. Table 3 shows the final combination selected per each GA model.

Table 2. Parameters tested in the GA models
Table 3. Selected parameters for each AG model

Once the stopping criteria were met, the corresponding weights were stored for each specific GA model. As the last step, the final NO2 estimations for the year 2015 were obtained using Eqs. 2, 3, 4, 6 and the corresponding weights and test sets for each case. Finally, values for R, MSE, d and MAE were calculated after comparing the observed NO2 values against the estimated ones obtained with each model.

In the second stage, avg, wavg and ANNwe ensembles (see Sect. 3.3) were calculated. In the case of the avg ensemble, the calculation is straightforward as it only averages the obtained estimations obtained with the individual models. For the wavg ensemble (see Eq. 7), each of the estimations from stage-1 was weighted according to its MSE value following an inversely proportional distribution. To calculate the ANNwe ensemble, ANN models were trained using the outputs from stage-1 as their inputs and the 2015 NO2 measured values as their targets. In this case, the same network configuration as stage-1 ANN models was applied. The final output was obtained as the one which produced a lesser MSE value after 20 repetitions.

5 Results and Discussion

The results of the experimental procedure are presented in this section. In the first stage, different models have been developed to estimate the hourly NO2 concentration values at the EPSA monitoring station (station 1). NO2 values measured at the other stations have been used as inputs of the models (see Table 1 and Fig. 1). The initial data set has been split into two disjoint datasets and the results are obtained through the comparison of observed vs. estimated values for the test set (2015). This lets us evaluate the performance of the models with unseen data. For comparative purposes, a Lasso model using the same datasets and 5-fold cross validation has also been included. Table 4 shows the performance measures corresponding to stage-1 models.

Table 4. Performance indexes for stage-1 estimation models

As can be expected, the best ANN model outperforms all the GA models. However, it can be noted that non-linear GA models (GA-1, GA-2 and GA3) beat easily the performance offered by the GA-4 linear model. This indicates that Eqs. 2, 3 and 4 are able to capture linear and also an important amount of non-linear relations between input and output variables. However, the proposed fitness functions cannot compete with ANNs’ ability to act as universal approximator of any nonlinear function, as was mentioned in Sect. 3.1.

Table 5 shows the results obtained by the proposed ensembles in the second stage. These methods combine the outputs of stage-1 models with the aim of improving the estimation results.

Table 5. Performance indexes for stage 2 ensembles

Results show how avg and wavg ensembles constitute an improvement over GA models but do not reach the estimation goodness offered by the stage-1 ANN model. This can be explained by the fact that the average and the weighted average operations (to a lesser extent) are highly influenced by extreme values that are far from the mean of the individual learners considering a specific instant of time. In our case, this influence comes primarily from GA4 output. As an example, if AG4 output is removed from the ensemble, the value of MSE for avg drops to 260.837 and its R-value rises to 0.735.

In the case of the ANNwe ensemble, its performance indexes are far superior when compared to those belonging to all the proposed models in the first and second stages. As can be seen, the second stage ANN can take advantage of the different linear and non-linear relations captured by the GA and ANN of the first stage. Some of them are already present in the ANN model of the first stage, but other ones are provided by the GA models. Considering the results, the proposed two-stage approach guarantees a better estimation performance of the NO2 concentration values at the EPSA monitoring station.

A comparison between the best models of the first and second stages is presented in Figs. 2 and 3 where estimated versus measured NO2 hourly values are depicted for January 2015. As can be seen, the fit and adjustment to the observed values are superior in the case of ANNwe when compared to the ANN model of the first stage. This confirms what was stated before about the improvement of the estimation models provided by the proposed approach (Fig. 3).

Fig. 2.
figure 2

Estimated vs. real values for January 2015 using the stage-1 ANN most accurate model

Fig. 3.
figure 3

Estimated vs. real values for January 2015 using the stage-2 ANNwe ensemble

6 Conclusions

The aim of this paper is to verify the possible improvements that a stacked ensemble approach can provide to NO2 estimations over compared to other individual models. This approach uses artificial neural networks, linear and nonlinear genetic algorithms as individual learners. Then, their outputs are used as inputs of the ANN models of the second stage.

Regarding the first stage results, the proposed GA models that use non-linear functions produce much better results if they are compared to GA using a linear fitness function. This indicates that their fitness functions can detect useful relationships between variables that are ignored in the linear approaches. However, ANNs outperform them due to their ability to act as universal approximators (see Sect. 3.1).

The results of both stages show how the ANNwe approach outperforms all the other proposed approaches, ensuring a better estimation performance of NO2 in the monitoring network. The main reason can be found in the fact that all stage-1 models capture different linear and nonlinear relations between the inputs and the targets. Therefore, the ANNwe approach is able to exploit the advantages offered by each individual model. Additionally, it is able to find an optimal combination of their outputs in order to increase the global estimation performance.

The use of the proposed model can provide better and more reliable NO2 estimations if it is compared to the other proposed models. This can be very useful as these estimations can provide robustness and autonomous capabilities to the monitoring network. They also can be helpful in missing values or detection of decalibration situations.