Introduction

Due to rapid population growth and backward economic levels, air pollution has been one of the major problems perplexing many developing countries. According to the latest world air quality report released by AirVisual (2019), Asian locations dominate the highest 100 average PM2.5 levels during 2018, with cities in India, China, Pakistan, and Bangladesh occupying the top 50 cities (AirVisual 2018). China is the largest developing country in the world, and many cities of China have suffered from serious air pollution in the past few years, such as Hotan, Shijiazhuang, Baoding, Xianyang, Jiaozuo, and Cangzhou. Although China’s air pollution exposures have stabilized and even begun to decline slightly after several years of strict restrictions on industrial emissions and the use of fossil fuels for indoor heating and cooking (HEI and IHME 2018), efforts are still needed to protect environment at a high level. Sulfur oxides, carbon oxides, nitrogen oxides, hydrocarbons, particulate matter 10 (PM10), and particulate matter 2.5 (PM2.5) in the atmosphere are the main contributors to air pollution, and many efforts have been put into predicting air quality based on the observations of scattered air monitoring stations.

Deterministic methods usually build simulation models to simulate and predict the diffusion and transport process of atmospheric pollutants (Ma et al. 2019). However, such methods suffer from large computation costs and low prediction accuracy if underlying atmospheric conditions are complex and involve a large amount of observed data. Moreover, it is necessary for IT technologists to know specific domain knowledge for parameter identification. Machine learning methods are another kind of approaches for air quality prediction based on a large amount of observed data. In recent years, researchers have employed many machine learning methods to predict air quality because of their theoretical foundation, diverse models, and accurate forecasting effects, such as multiple linear regression (Stadlober et al. 2008; Genc et al. 2010; Li et al. 2011), support vector machine (SVM) (Deng et al. 2018; Osowski and Garanty 2007), and artificial neural network (ANN) (Cabaneros et al. 2019; Perez and Reyes 2006; Feng et al. 2015). However, on the one hand, although such traditional models employed by some efforts are widely used and have reasonable performance in many domains, they are not suitable for handing time serial data since they cannot well process the time-steps of a sequence. In other words, they cannot well deal with the relationship between old information and new input in a sequence. On the other hand, such efforts do not yield the desired performance for air quality prediction. There are also efforts that employ deep learning models in air quality prediction, and the main solutions of adopting deep learning in air quality prediction are utilizing recurrent deep learning models, such as recurrent neural network (RNN) (Pineda 1987), long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997), and transferred bi-directional LSTM (Ma et al. 2019; Ma et al. 2019). Such models generate multi-layered representations of data, and also exhibit temporal dynamic behavior for time serial data, thus providing better performance over traditional machine learning methods. Moreover, there are also other efforts, such as Lin et al. who proposed a neuro-fuzzy network, in which the training data are described by fuzzy clusters with statistical means and variances to address the uncertainty of the involved impact factors (Lin et al. 2020). Jiang et al. presented a hybrid air quality prediction approach with pigeon-inspired optimization and extreme learning machine. The work employed a modified extreme learning machine to predict the data sub-series clustered based on the multidimensional scaling and K-means clustering (Jiang et al. 2019). Wang et al. proposed an ensemble deep learning model which considered both weather patterns and spatial-temporal properties (Wang and Song 2018). Maciaga et al. proposed a clustering-based ensemble model based on several evolving spiking neural networks on a separate set of time series for air quality prediction (Maciag et al. 2019). Compared with the efforts mentioned above, in this paper, we propose a hybrid ensemble model CERL to exploit the merits of both forward neural networks and recurrent neural networks for hourly air quality data prediction in Northwest of China. We take two cities in Northwest China, i.e., Lanzhou and Xi’an, as examples, and demonstrate the superiority of CERL. Moreover, we analyze the impact on the CERL performance as the step length that it can predict increases.

The rest of this paper is organized as follows. “Related work” presents a brief literature review on the work related to air quality prediction. In “Proposed approach,” different prominent machine learning methods used for air quality prediction are presented. In addition, the hybrid method proposed in this paper is introduced. In “Materials,” the materials used by this paper are given. In “Experiments and results,” the results of the hourly air quality data are presented. “Discussions” discusses the CERL improvements in different step prediction and its superiority based on a hypothesis testing. “Conclusion” summarizes the achievements and highlights of this paper, and outlines directions for future work.

Related work

Air quality forecast predicts air pollution levels for a period ahead and provides important information to the public. However, the prediction is still a challenge because of the complexity of the process involved and the strong coupling across many parameters, which affect the modeling performance (Leksmono et al. 2006; Biancofiore et al. 2017). There have been three main types of air quality prediction methods: deterministic methods, statistical methods, and machine learning methods (Ma et al. 2019; Athira et al. 2018; Kwok et al. 2017; Singh et al. 2012). Deterministic methods usually build simulation models to simulate and predict the diffusion and transport process of atmospheric pollutants. But such methods often have large computation costs and low prediction accuracy if underlying atmospheric conditions are complex. Statistical methods are a kind of data-driven way of air quality prediction, and the most of statistical methods assumed the relationships between the input variables and the target outputs are linear (Ma et al. 2019), for example, multiple linear regression (Stadlober et al. 2008; Genc et al. 2010; Li et al. 2011). Such linear approaches suffer from the non-linearity of the real world. Machine learning-based methods often focus on nonlinear models, and the main methods fall into this category are ANN (Cabaneros et al. 2019; Perez and Reyes 2006; Feng et al. 2015), SVM (Deng et al. 2018; Osowski and Garanty 2007), etc. For example, Cabanerosa et al. reviewed the research activities in air pollution forecasting with ANNs and showed that feed-forward and hybrid ANN models with ad hoc optimization approaches were predominantly used to forecast long-term air pollutant factors (Cabaneros et al. 2019). Yang et al. presented a support vector regression model to predict PM2.5 concentrations by considering spatial heterogeneity and dependence among the data (Deng et al. 2018). Note that there are also efforts that consider both statistical methods and machine learning methods as statistical methods (Mallet and Sportisse 2008; Zhang et al. 2012). Such linear and nonlinear data-driven methods usually build models fast with moderate accuracy, and have been studied a lot in recent years. For example, Singh et al. explored both linear and nonlinear approaches to predict air quality with the selected air pollutant factors and meteorological conditions as the estimators (Singh et al. 2012). They argued that the nonlinear models, especially artificial neural network-based models and their variants, performed relatively better than linear PLSR models. Carcia predicted PM10 concentrations based on generalized linear models (GLMs), which focused on the relationship between atmospheric concentrations of air pollutants and meteorological variables (Garcia et al. 2016). In GLM, PM10 concentration was considered a dependent variable and both gaseous pollutants and meteorological variables were considered independent variables. Based on the similarity of PM2.5 variation in monitoring network, He et al. proposed two methods, the linear method of stepwise regression and the nonlinear method of support vector regression, to predict PM2.5 concentration (He et al. 2018). Shang et al. proposed a method on training local models based on a combination of classification and regression tree (CART) and ensemble extreme learning machine (EELM) to address the global-local duality and improve the prediction accuracy (Shang et al. 2019).

Besides the traditional methods based on machine learning algorithms, there are efforts that employ deep learning models in air quality prediction. Deep learning is a branch of machine learning that generates multi-layered representations of data, commonly using artificial neural networks, and has improved the state of the art in various machine learning tasks (Lang et al. 2019). The main solutions of adopting deep learning in air quality prediction are utilizing recurrent deep learning models, such as RNN (Pineda 1987) and LSTM (Hochreiter and Schmidhuber 1997). For example, Biancofiore et al. adopted a recurrent neural architecture, i.e., Elman Recurrent Network, to forecast daily averaged concentration of PM10, and argued that RNN had better performances compared with both the multiple linear regression model and the neural network model without the recursive architecture (Biancofiore et al. 2017). In Athira et al. (2018), Athira V compared different RNN models and their variations based on the pollution and meteorological time series AirNet data (Zhao et al. 2018), and showed that the performance of gated recurrent unit network was slightly higher than that of RNN and LSTM networks. Ma et al. used a bi-directional LSTM model to learn long-term dependencies of PM2.5 (Ma et al. 2019). The highlight of the work was the combination of a bi-directional LSTM and transfer learning technique, which could transfer the knowledge from smaller temporal resolutions to larger ones. Based on the work, Ma et al. also proposed a stacked bi-directional LSTM that combined deep learning techniques and transfer learning to deal with the data shortage problem (Ma et al. 2019).

In addition, there are hybrid models which exploit the advantages of multiple models, such as a hybrid model based on sample entropy, secondary decomposition, and least squares support vector machine LSTM AQI prediction (Wu and Lin 2019). Wang et al. proposed a deep spatial-temporal ensemble model, which considered not only meteorological information but also spatial and temporal properties to predict air quality. LSTM was also used to learn both short-term and long-term dependencies (Wang and Song 2018).

To sum up, there are a variety of differences between the aforementioned efforts and our work. Our work is a kind of ensemble model to exploit the merits of both forward neural networks and recurrent neural networks that are designed for handling time serial data. Based on the advantages of both different types of neural networks, CERL provides better performance over baseline models. In particular, we focus on the air quality prediction of two rarely studied capital cities in Northwest of China, and build prediction models for main pollutant factors, i.e., AQI (AirNow 2019), PM2.5, PM10, CO, SO2, NO2, and O3 hours by hours.

Proposed approach

In this work, we combined forward neural networks with several recurrent neural networks as a hybrid model with an aim to improve the accuracy of air quality prediction. This section first introduces several machine learning methods that are often used for air quality prediction, and then introduces our hybrid combined approach CERL.

Prominent approaches for time series data

Cascade-forward neural network

Cascade-forward neural network (CFNN) is an artificial neural network in which the information moves only forward, i.e., from the input nodes, through the hidden nodes to the output nodes. Moreover, CFNN includes a connection from the input and every previous layer to following layers. In other words, in a CFNN with three layers, the output layer is also connected directly with the input layer except hidden layer, as shown in Fig. 1a. As with feed-forward networks, CFNNs with single hidden layer can arbitrarily closely approximate any continuous function that maps intervals of real numbers to some output interval of real numbers. Based on the direct connection between the input and output, CFNN is often used for time series prediction. For example, Tengeleng et al. utilized a cascade forward back-propagation neural network (BPNN) to predict rain parameters, i.e., water content, rain rate, and radar reflectivity with raindrop size distribution (Tengeleng and Armand 2014). Warsito et al. showed that CFNN models could successfully predict both simulated time series data and monthly palm oil price index data (Warsito et al. 2018, 05).

Fig. 1
figure 1

The models employed for air quality prediction

RNN

RNN is a kind of artificial neural network that is specially designed to model time serial data. Unlike feed-forward networks, the hidden layers of RNN are connected back into themselves to maintain an internal state and allow RNN to exhibit temporal dynamic behavior for a time sequence, as shown in Fig. 1b. Therefore, RNN enables the networks to do temporal processing, and Biancofiore et al. argued that RNN had better performances compared with other neural network models without the recursive architecture on forecasting daily averaged concentration of PM10 (Biancofiore et al. 2017).

ESN

Roughly speaking, echo state network (ESN) is a special case of recurrent neural network with a non-trainable sparse random recurrent part (reservoir) and a simple linear readout (Jaeger 2001, 01), as shown in Fig. 1c. Connection weights in the ESN reservoir, as well as the input weights, are randomly generated. Compared with other RNN models, ESNs can efficiently process the temporal dependency of time series with high nonlinear mapping capacity and dynamic memory (Shen et al. 2016; Lukoševičius and Jaeger 2009).

Recurrent networks using previous outputs

Besides the standard recurrent neural networks, in which each layer has a recurrent connection with a tap delay associated with it, there are variant RNNs that have delayed recurrent connections between their output and the input layer, as shown in Fig. 1d. In such networks, the state of the model is influenced not only by its previous internal states but also by its outputs. This is useful in modeling time serial data, since the output for timestep t is helpful to predict the output for timestep t + d, where d is the step length of the time serial prediction.

Proposed hybrid approach

As we can see from above sections, recurrent neural networks are a powerful type of artificial neural networks, in which the outputs of hidden layers are fed back into the same hidden layer. Such kind of internal memory is helpful for handling time serial data, i.e., the data that occurs in a time sequence. In this paper, we focus on combining forward neural networks with prominent recurrent neural networks as a hybrid model CERL with an aim to improve the accuracy of air quality prediction. The general process of building the hybrid model has two stages: single model learning and hybrid model learning, as shown in Fig. 2.

Fig. 2
figure 2

The general process of building the hybrid ensemble model

As other supervised learning algorithms, we split the data set into two sets: training and test sets, which are used to fit a model and assess the model at the end of the model building, respectively. In the first stage, several single recurrent neural network models are built based on mapping input features to output labels. Such models need to be optimized to have their best performance. After the optimized single models are built, they are used to calculate the predictions to the training set. The predictions to the training set are denoted by train_Y\({~}_{1}^{\prime }\), train_Y\({~}_{2}^{\prime }\), and train_Y\({~}_{n}^{\prime }\). Accordingly, the predictions to the test are denoted by test_Y\({~}_{1}^{\prime }\), test_Y\({~}_{2}^{\prime }\), and test_Y\({~}_{n}^{\prime }\). In the general model building process of a supervised learning algorithm, such prediction results are used to calculate training error and test error. In our work, such prediction results are grouped together as the features of training and test sets to build the hybrid ensemble model, respectively. It is worth noticing that the labels of the training and test sets used to building single models are reused. In other words, the goal of the hybrid ensemble model is to map the intermediate prediction results of the single models to the final output labels of the training set. Such a regression process can be implemented by many machine learning algorithms, such as linear regression, BPNN, and SVM.

Since artificial neural network has been well established by many successful applications in a variety of fields (Yoon et al. 2011; Singh et al. 2012), in this work, we employed a three-layer BPNN for our hybrid ensemble model. We used the logistic sigmoid function as the activation functions on hidden neurons, which is defined as follows,

$$ f(y_{j})= \frac{1}{1+e^{-y_{j}}} $$
(1)

where yj is the output of a hidden neuron i, which is calculated as follows,

$$ y_{j}= \sum\limits_{i=1}^{n}w_{ij}train\_Y_{i}^{\prime} + \theta_{j} $$
(2)

where \(train\_Y_{i}^{\prime }\) is the output of the single models, and it is used as the input of the ensemble model. wij is the weight from input neuron i to neuron j, and 𝜃j is the bias of neuron i.

At the output layer, we used mean square error (MSE) as the loss function, which is defined as follows,

$$ \min^{w} J(w) = \frac{1}{n}\sum\limits_{i=1}^{n}(Y_{i}-\hat{y}_{i})^{2} $$
(3)

where Yi is the actual output of training instance i and \(\hat {y}_{i}\) is the output from the neural network for the instance i. Our goal is to minimize the loss function J as the neural network is trained.

Materials

Datasets

We evaluated the performance of our proposed model on a set of air quality data which are extracted from the web site of historical data of air quality in China (Wang 2019), which provides download services of historical air quality data for all cites in China since May 13, 2014. The air quality data was from China Environmental Monitoring Station (CNEMC 2019), which updates the data daily. The air pollutant factors include AQI, PM2.5, PM10, CO, SO2, NO2, and O3 hours by hours. Moreover, the data also includes the average values of PM2.5, PM10, CO, SO2, NO2, and O3 over a 24-h period.

We selected the air quality data of two capital cities in Northwest of China, i.e., Xi’an and Lanzhou. We selected the dataset which was from January 1–31, 2019, since both cities often have the worst air quality in December and January, as shown in Fig. 3, which shows the monthly average AQI and PM2.5 values of Xi’an and Lanzhou over 69 months from January 2014 to August 2018. Note that the data of Fig. 3 was from China Air Quality Online Monitoring and Analysis Platform (Wang 2019). The data is collected once in an hour; finally, we got 744 (24 × 31) samples for each given city in the first month of 2019. Besides the date and time, each sample also includes 6 factors given a specified hour, i.e., AQI, the concentration of PM2.5, PM10, CO, SO2, NO2, and O3.

Fig. 3
figure 3

The air quality of Xi’an and Lanzhou (Jan. 2014–Aug. 2018)

Data preprocessing

Before building machine learning models for air quality prediction, we processed data by filling missing values, handing noisy data, normalization, and dataset split. The details of such processes will be presented in this section.

Filling missing values

Due to technical and other reasons, a small amount of the air quality data provided by the Environment Cloud were missing. This is very common in environment monitoring, but may have a significant effect on the conclusions that are drawn from the data.

Since our data were in time serials, we filled missing values based on linear extrapolation, which is a method used to construct new data points based on a discrete set of known data points. For example, if the data of two samples are given by two coordinates (t1, y1) and (t2, y2), the missing data y at the time t are calculated with the following formula:

$$ y_{*}=y_{1}+\frac{(t_{*}-t_{1})}{(t_{2}-t_{1})}(y_{2}-y_{1}). $$
(4)

In other words, the data y1, y2,y are in a straight line. Note that t can be within or outside the time interval [t1,t2].

In this step, we filled 7 missing values, and the statistical details of each pollutant factor after filling missing values can be found in Table 1.

Table 1 The data statistical details

Noise reduction based on singular spectrum analysis

In machine learning, noisy data caused by different erratic factors usually affect the forecast accuracy. To deal with such problem, we employed singular spectrum analysis (SSA) to handle the noisy data. SSA is a model-free method and can be used to decompose original series into a sum of interpretable components, such as trend, periodic components, and noise. Afterwards, the signals can be extracted from noisy data by discarding some decomposed components. In other words, the noise reduction data is obtained by adding the first several decomposed components together.

In the practical application of SSA, the optimal number of the data reconstruction is usually the half length of decomposed components (He et al. 2019). In our work, the data series were decomposed into 100 components, and different numbers of the components ranging from 10 to 70 were regarded as noises and discarded to evaluate the denoisy performance. Generally speaking, the components discarded result in a smoother and slower varying data series. Figure 4 presents the noise reduction residuals of AQI and PM2.5 of Lanzhou from January 1 to January 15 by different percentages. The residuals are the differences between the original values and the values after the noise reduction. We can see that the residuals become larger as more components are regarded as noises and discarded. However, the noise reduction does not change the trend of the data since the residuals are often very small, and the only difference is that the curve becomes smoother as more data are reduced. The performance evaluation of data denoise will be detailed in “Denoise evaluation.”

Fig. 4
figure 4

The noise reduction residuals by SSA

Normalization

The dataset in our work was normalized by the mapminmax function of MATLAB, which is defined as:

$$ \begin{array}{@{}rcl@{}} mapminmax(X, y_{min}, y_{max}) \!&=& \!\frac{(y_{max} - y_{min})*(x - x_{min})}{x_{max} - x_{min}}\\ &&+y_{min} \end{array} $$
(5)

where X is the matrix to be normalized; ymin and ymax are expected minimum and maximum values of each row of X, respectively; and xmin and xmax are actual minimum and maximum values of each row of X. In our work, the dataset is normalized to [0.001,1].

Dataset split

In our work, we took 7 pm on January 24, 2019, as the time point to split training and test sets, and ensured that they contained 80% and 20% instances, respectively. Moreover, we further split the time serial training and test datasets based on a sliding window algorithm, which was used to segment a collection of historical air quality data into groups. The algorithm procedure can be found in Algorithm 1.where data is one-by-d matrix, d is the length of the data, window_size is the number of consecutive observations per sliding window, and step_length is the number of steps ahead to forecast. The algorithm takes the data, window size and step length as the input, and outputs X and Y, which are then used to learn target air quality prediction models. For example, suppose a time sequence is denoted as data = (s1, s2, …, s100), window_size = 3 and step_length = 1, then X = (〈s1,s2,s3〉, 〈s2,s3,s4〉, …, 〈s97,s98,s99〉) and Y = (s4, s5, …, s100). If window_size = 5 and step_length = 2, then X = (〈s1,s2,s3,s4,s5〉, 〈s2,s3,s4,s5,s6〉, …, 〈s94,s95,s96,s97,s98〉) and Y = (〈s6,s7〉, 〈s7,s8〉,…, 〈s99,s100〉). Note that the slide step of our sliding window algorithm is 1.

figure a

Evaluation methodology metrics

In this paper, the following three metrics were employed to evaluate the performance of the involved models. There are mean absolute deviation (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and correlation coefficients (R), which are calculated with the following formulas.

$$ {RMSE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(x_{i}-\hat{x}_{i})^{2}} $$
(6)
$$ {MAE} = \frac{1}{n}\sum\limits_{i=1}^{n}\left |x_{i}-\hat{x}_{i}\right| $$
(7)
$$ {MAPE} = \frac{1}{n}\sum\limits_{i=1}^{n}\left |\frac{x_{i}-\hat{x}_{i}}{x_{i}}\right| \times 100<percent> $$
(8)
$$ R=\frac{{{n}{\sum}_{i=1}^{n}{({x_{i}}{\hat{x}_{i}})}-{({\sum}_{i=1}^{n}{x_{i}})}{({\sum}_{i=1}^{n}{\hat{x}_{i}})}}} {{\sqrt{{n}{\sum}_{i=1}^{n}{x_{i}}^{2}-({\sum}_{i=1}^{n}{x_{i}})^{2}}\sqrt{{n}{\sum}_{i=1}^{n}{\hat{x}_{i}}^{2}-({\sum}_{i=1}^{n}{\hat{x}_{i}})^{2}}}} $$
(9)

where xi and \(\hat {x}_{i}\) represent the actual value and the predicted value, respectively, and n is the number of test samples.

The model training parameters

In this paper, our hybrid ensemble model CERL combined forward neural networks with recurrent neural networks that are designed for handling time serial data to predict air quality in Lanzhou and Xi’an. More precisely, we took CFNN, RNN, ESN, and RNN using previous outputs as baseline models, and combined them using BPNN to improve the prediction performance. To demonstrate the superiority of CERL over such baseline models, both baseline models and CERL were optimized to compare their best performance. For CFNN and RNN, we used MATLAB functions cascadeforwardnet and layrecnet for CFNN and RNN implementation, respectively. We specified the number of hidden neurons of CFNN and RNN as \(\log _{2}n\), where n is the size of the input layer. We used the ESN MATLAB library developed by Jaeger et al. (2007). The number of internal units was set to nn, where n is the size of the input layer. The spectral radius of the ESN reservoir was 0.01 to ensure that the ESN had the echo state property. We used Pyrenn (Atabay 2019), which is a recurrent neural network toolbox for python and MATLAB for the implementation of RNN using previous outputs. As with CFNN and RNN, we specified the number of hidden neurons of Pyrenn as \(\log _{2}n\), and the number of output delays as 2. Note that we used Pyrenn as the name for the RNN models using previous outputs for short in what follows. Moreover, in CERL, we used BPNN to combine the prediction results of CFNN, RNN, ESN, and Pyrenn. We specified the BPNN learning parameters as follows: learning rate 0.001, maximum number of iterations 2000, and the number of hidden neurons 4. Moreover, we used “trainbr” as the training function to avoid overfitting, since it usually worked well with early stopping. In order to make the comparison more reasonable, the best performance of all aforementioned models was the average of 100 times of training and testing such models.

Experiments and results

Denoise evaluation

As mentioned in “Noise reduction based on singular spectrum analysis,” the noise reduction does not change the trend of the data, but makes the curve smoother as more data are reduced. Moreover, it is helpful to improve the model performance after the noisy data are removed. In order to demonstrate the denoise performance, we used the dataset that was reduced by different percentages as the input of different baseline models, i.e., CFNN, RNN, ESN, and Pyrenn to build 1-step PM2.5 prediction models for Lanzhou. Figure 5 shows the performances of different models. In the SSA of our work, the data series were decomposed into 100 components. As we can see, the performance of different models improves as the more data are reduced, and the best performance is obtained when the noise reduction percentage is 70%. In other words, the noise reduction data is obtained by adding the first 30 components together. The performance of all models goes along the same trend, which proves that the denoise reduction is useful to improve the model performance. In the following experiments, the noise reduction data is obtained by adding the first 30 components together except clearly specified.

Fig. 5
figure 5

Denoised by different percentages

1-step prediction

1-step prediction means that the models can be used to predict air quality for the next hour. To get an optimal model, we used different window sizes to split the training set and the test set. The optimal window size was obtained by analyzing the prediction performance of PM2.5, and then, it was used to build the models for other pollutant factors, i.e., AQI, PM10, CO, SO2, NO2, and O3.

Window size decision

In this paper, we used a sliding window algorithm to split training and testing sets. In order to find an appropriate window size, we used different window sizes ranging from 1 to 10 to prepare the training set and the test set. Different models, i.e., CFNN, RNN, ESN, and Pyrenn, used such data as the input to build 1-step PM2.5 prediction models for Lanzhou. Figure 6 a shows the performance of different models. We can see that the performance improves as the window size increases, and the curves become flat after the window size is bigger than 5. Therefore, in our 1-step air quality prediction models, we took 5 as the window size. It is worth noticing that some models had slight better performance when the window size was bigger than 5, but we still specified the window size as 5 to reduce computing costs. As a result, the dataset was divided into a training set of 590 samples and 144 test samples.

Fig. 6
figure 6

N-step window size decision

Performance comparison

To illustrate the performance of CERL, we used CFNN, RNN, ESN, and Pyrenn as baselines to build 1-step models for 7 air pollutant factors including AQI, PM2.5, PM10, CO, SO2, NO2, and O3. Afterwards, we compared the performances of such models with CERL. The noise reduction data in such experiments was obtained by adding the first 30 components together, and the window size was 5. Each model was optimized to get its best performance. Moreover, each model was trained and tested 100 times to get its average performance. Tables 2 and 3 show the performance results of different models in Lanzhou and Xi’an, respectively. Note that the best results are indicated in italics.

Table 2 The performance of different models for 1-step prediction in Lanzhou
Table 3 The performance of different models for 1-step prediction in Xi’an

We can see that all such models have good performance to predict air quality in both Lanzhou and Xi’an. The average MAPE values of such models for AQI, PM2.5, PM10, SO2, NO2, O3, and CO in Lanzhou are 2.14%, 3.22%, 3.56%, 9.03%, 5.20%, 5.03%, and 6.50%, respectively. In Xi’an, the average MAPE values of such models for AQI, PM2.5, PM10, SO2, NO2, O3, and CO are 2.56%, 2.70%, 2.87%, 4.98%, 2.98%, 8.96%, and 2.38%, respectively. Moreover, among such models, CERL exhibits an improvement over CFNN, ESN, RNN, and Pyrenn on 6 of 7 air pollutant factors in both Lanzhou and Xi’an. In Lanzhou, CERL provides superior performance on all pollutant factors except NO2 prediction. For example, the MAPE value of CERL for the AQI prediction in Lanzhou is 2.01%, CFNN has the second smallest MAPE value 2.04%, and Pyrenn has the worst MAPE value 2.37%. But the MAPE value of CERL for the NO2 prediction is 5.32%, which is bigger than that of CFNN with the best MAPE value of 4.88%. In Xi’an, CERL does not get the best performance only on SO2 prediction. For example, the MAPE value of CERL for the AQI prediction in Xi’an is 2.25%, CFNN has the second smallest MAPE value 2.31%, and Pyrenn has the worst MAPE value 3.60%. But the MAPE value of CERL for the SO2 prediction is 5.04%, which is bigger than that of ESN with the best MAPE value 4.47%. However, although CERL has superior performance over other models, the performance of such models is similar. This is because such models are adequate for predicting air quality precisely in a short term. Figure 7 shows the comparison of such models on MAPE metric.

Fig. 7
figure 7

1-step comparison (MAPE)

The final prediction results of CERL in Lanzhou is given in Fig. 8. We can see that all such models have adequate performance and the forecasting values and the actual values are fitting very well on all pollutant factors.

Fig. 8
figure 8

CERL 1-step prediction in Lanzhou

N-step prediction

To demonstrate the performance of CERL on long-term air quality prediction, this section provides the comparison between CERL and the baseline models for air quality prediction in the next 3, 5, and 8 h, respectively. Note that the values in N steps ahead are simultaneously predicted by the models in our work, rather than based on the results of 1-step prediction. The datasets used to prepare the training and test sets can be found in Algorithm 1.

3-step prediction

As the 1-step prediction, we first did experiments to decide the window size of 3-step prediction, and Fig. 6 b shows the performance of different models on different window sizes to predict PM2.5 in Lanzhou. We can see that the performance of all models except Pyrenn ups to optima as the window size is 10. As a result, the dataset is divided into 583 samples for the training set and 137 samples for the testing set. Pyrenn does not have the best performance but an approximate optimal value with the window size 10. Therefore, we used 10 as the window size to build 3-step air quality prediction models, and the results are shown in Fig. 9.

Fig. 9
figure 9

3-step comparison (MAPE)

We can see that, unlike the 1-step prediction, CERL provides better performance on all air pollutant factors in both Lanzhou and Xi’an. Moreover, CERL has more obvious improvement than the other three baseline models. For example, the MAPE value of CERL for the 1-step PM2.5 prediction is 2.94%, which only improves 2.04% over CFNN that has the second best performance with the MAPE value 3.00%. However, in the 3-step AQI prediction in Lanzhou, the MAPE value of CERL for the PM2.5 prediction is 3.92%, which improves 7.98% over RNN with the second best MAPE value 4.26%. It is also true for the SO2 prediction. In Xi’an, the MAPE value of CERL for the 1-step SO2 prediction is 5.04%, which is even worse than that of CFNN, ESN, and RNN. However, in the 3-step AQI prediction in Xi’an, the MAPE value of CERL for the SO2 prediction is 6.89%, which improves 11.67% over ESN with the second best MAPE value 7.80%.

5-step prediction

In the 5-step prediction, the best prediction performance is obtained when the window size is 10, as shown in Fig. 6c. We can see that the ESN and Pyrenn models do not have the best performance, but they have approximate optimal value with the window size 10. As a result, the dataset is divided into 581 samples for the training set and 135 samples for the testing set. The performance of different 5-step models with the window size 10 is presented in Fig. 10.

Fig. 10
figure 10

5-step comparison (MAPE)

As in the 3-step prediction, CERL has obvious better performance than in the other baseline models in all air pollutant factors, and the improvement is more clear than both 1-step and 3-step prediction. For example, the MAPE value of CERL for the 3-step PM2.5 prediction is 3.92%, which improves 7.98% over RNN with the second best MAPE value 4.26%. However, in the 5-step PM2.5 prediction in Lanzhou, the MAPE value of CERL for the PM2.5 prediction is 6.87%, which improves 9.96% over RNN with the second best MAPE value 7.63%. In Xi’an, the MAPE value of CERL for the 3-step SO2 prediction is 6.89%, which improves 11.67% over ESN with the second best MAPE value 7.80%. However, in the 5-step SO2 prediction in Xi’an, the MAPE value of CERL for the SO2 prediction is 12.30%, which improves 21.36% over ESN with the second best MAPE value 15.64%. However, as the step length increases, the overall performance declines. As shown in Figs. 7 and 9, the MAPE values of CERL 1-step and 3-step prediction fall in the range of 2.01∼8.72% and 2.66∼11.84% for almost all air pollutant factors, respectively. As the step length is increased to 5, the MAPE values of CERL fall in the range of 5.05∼16.12%.

8-step prediction

In our 8-step air quality prediction models, we took 8 as the window size, as shown in Fig. 6d. We can see that the best performance is achieved when the window size is 6, and it even declines when the window size is bigger than 10. As a result, the dataset is divided into 585 samples for the training set and 139 samples for the testing set.

In the 8-step prediction, CERL also has the best performance in almost all air pollutant factors prediction, as shown in Fig. 11. The MAPE value of CERL for the 8-step PM2.5 prediction in Lanzhou is 10.97%, which improves 20.04% over RNN with the second best MAPE value 13.72%. In Xi’an, the MAPE value of CERL for the 8-step SO2 prediction is 18.39%, which improves 8.14% over ESN with the second best MAPE value 20.02%. The improvement is not bigger than previous experiments because that Pyrenn has the worst MAPE value 65.02% for the 8-step SO2 prediction. Moreover, we can see that, although CERL has better performance, the MAPE values of CERL for air quality prediction are increased to the range of 6.59∼31.69%.

Fig. 11
figure 11

8-step comparison (MAPE)

Discussions

The CERL improvements

To sum up, we can see that CERL provides better performance over the baseline models. In the 1-step prediction, all models have good performance to predict air quality. The average MAPE values of such models for all air pollutant factors (except O3) fall in the range of 2.13∼6.05% and 2.56∼4.98%, respectively. This is because all such models are adequate for dealing with time serial data, especially for short-term prediction. Although CERL has superior performance over other models, the performance of such models is not obvious or even worse than that of the baseline models. As the step length increases, CERL has more obvious improvement, as shown in Tables 4 and 5. For example, the improvements of CERL in the 1-step, 3-step, 5-step, and 8-step prediction for PM2.5 in Lanzhou are 1.82%, 8.01%, 9.98%, and 20.03%, respectively.

Table 4 The CERL MAPE improvements in different step prediction in Lanzhou
Table 5 The CERL MAPE improvements in different step prediction in Xi’an

However, as the step length increases, the overall performance of all models declines. For example, the MAPE values of CERL 1-step, 3-step, and 5-step fall in the range of 2.01\(\sim \)8.72%, 2.66\(\sim \)11.84%, 5.05\(\sim \)16.12%, respectively. As the step length is increased to 8, the MAPE values of CERL fall in the range of 6.59\(\sim \)31.69%. We did not make further evaluation with bigger step length, since it makes not much sense if the prediction quality is worse than expected.

Diebold Mariano test

We further compare the performance of different models with a hypothesis testing method, called Diebold Mariano (DM) test. DM test is often used to check whether two forecasts for a time series are significantly different.

Let \({{e}_{i}^{1}}\) and \({{e}_{i}^{2}}\) (i = 1,2) be the residuals for the two forecasts, i.e.,

$$ {{e}_{i}^{1}}=y_{i}-g_{i} \qquad {{e}_{i}^{2}}=y_{i}-h_{i} $$
(10)

where yi is actual value and gi and hi are predictive values of the two forecasts.

The loss function of two forecasts is defined as:

$$ L({erro}{r_{i}^{1}})=({{e}_{i}^{1}})^{2} \qquad L({erro}{r_{i}^{2}})=({{e}_{i}^{2}})^{2} $$
(11)

The DM test statistic can be then defined by:

$$ DM=\frac{\frac{1}{n}{\sum}_{i=1}^{n}(L({erro}{r_{i}^{1}})-L({erro}{r_{i}^{2}}))}{\sqrt{\frac{S^{2}}{n}}}. $$
(12)

where S2 is the an estimator of the variance of \(d_{i}=L({erro}{r_{i}^{1}})-L({erro}{r_{i}^{2}})\). To check whether our CERL model is more accurate than other ones, we test the equal accuracy hypothesis. Given a significance level α, there are two hypotheses H0 and H1 defined as:

$$ H_{0}: L({erro}{r_{i}^{1}})=L({erro}{r_{i}^{2}}) $$
(13)
$$ H_{1}: L({erro}{r_{i}^{1}})\not=L({erro}{r_{i}^{2}}) $$
(14)

The null hypothesis H0 denotes that there is no significant difference in the prediction performance of two forecasts. Against the null hypothesis H0, the hypothesis H1 indicates that two forecasts have different levels of performance. The DM statistic follows approximately a standard normal distribution N(0,1) under the null hypothesis. In this work, we set the significance level as 5%. In other words, the null hypothesis is rejected if |DM|≤ 1.96. Table 6 shows the DM test values for the PM2.5 prediction between our CERL and other baseline models, i.e., CFNN, RNN, ESN, and Pyrenn. We can see that the lowest DM value is − 2.06. As a result, we can draw the conclusion that the null hypothesis is rejected and CERL has better performance than the other models.

Table 6 DM test of different models

To sum up, there are several reasons for the results. One is that CERL is a kind of ensemble model that employs different analytical models and then synthesizes their results into a single score in order to improve the prediction performance. Moreover, CERL not only involves of forward neural networks but also exploits the merits of recurrent neural networks that are designed for handling time serial data, such as RNN, ESN, and recurrent networks using previous outputs. In other words, CERL is able of capturing different underlying patterns in the data, thereby having superiority over the other baseline models.

Conclusion

This paper proposed a hybrid ensemble model CERL to exploit the merits of both forward neural networks and recurrent neural networks that are designed for handling time serial data to predict air quality hourly. Measured air pollutant factors including AQI, PM2.5, PM10, CO, SO2, NO2, and O3 are used as input to predict air quality from 1 to 8 h ahead. Based on the air quality prediction in two rarely studied capital cities in Northwest of China, Lanzhou and Xi’an, CERL further improves the prediction performance over recurrent neural networks. However, this work is based on the prediction at the hour level, and does not have high accuracy for the long-term prediction. Our future study should be expanded to explore the air quality prediction at the day level. Moreover, this work only considers measured air pollutant factors and adding measured meteorological information into the air quality prediction may be another direction of our future research. In future, we will see how the information from multiple meteorological monitoring stations influences air quality prediction. In addition, we plan to employ convolutional neural network (CNN), a well-known deep learning model in air quality prediction since multiple factors are involved.