1 Introduction

Today, PM2.5 is becoming more and more harmful to the human body. It causes a drop in air quality. The air contains a large amount of harmful substances that are most likely to affect the health of the respiratory system (Feng et al. 2015). Besides, the body’s respiratory system is exposed to the air in a long time. Therefore, air quality can inhale more particulate matter, high virus concentration, and easily cause many diseases (Wang et al. 2019a). In recent years, smog weather has frequently affected people’s health in many provinces and cities across the country. Haze pollutants are prone to cardiovascular disease. The smog reduces the visibility and increases the probability of traffic accidents, which affects people’s travel. Haze affects people’s production and life. Haze pollution has become one of the most serious problems in the world. It causes serious air quality problems.

The air quality prediction is to describe and analyze the future air pollution status and environmental quality trends and the dynamic changes of major pollutants and pollution sources (Tao et al. 2019). Before, the data and prediction all come from the sensor and internet of things (Al-Janabi 2020; Al-Janabi et al. 2020b). It provides a basis for proposing countermeasures to prevent further deterioration of the environment and improve the environment. Therefore, it is particularly important to predict the smog index in a timely manner so that people can take preventive measures in a timely manner to minimize losses. Today, air quality issues are becoming more and more important, and providing accurate environmental predictions can be a very important issue (Wang et al. 2019d). Therefore, the prediction of air quality data is very necessary.

Therefore, it is particularly important to predict the air quality in a timely manner so that people can take preventive measures in a timely manner to minimize losses (Qian-Rao 2016; Al-Janabi and Alkaim 2020). Due to the variety of sources of air pollution, i.e., automobile exhaust, volcanic eruptions, industrial emissions, air pollution is usually formed by a mixture of various sources of pollution. Moreover, in the smoggy weather in various regions, the extent of the effects of different sources of pollution varies. The atmospheric system is very complex, and various factors in the atmosphere have a certain correlation (Lonati et al. 2005). It is found by factor analysis to find out which factors in the atmosphere are related to the smog index. Therefore, smog is difficult to predict.

At present, the methods of air quality prediction can be roughly divided into three categories (Yan et al. 2018). The first is statistical methods. The second is the traditional forecasting method. The third is the deep learning algorithm (Li and Xu 2018; Al-Janabi et al. 2020a). Statistical methods include linear regression, grey prediction, Markov prediction, and so on. Most statistical models have certain requirements for data, and the models also have relatively clear mathematical forms. It is difficult to describe data with complex components with limited mathematical formulas. The traditional method is characterized by simple method, mature theory and easy implementation. It only needs to search historical statistics of various indicators. Traditional analysis (Al-Janabi and Abaid Mahdi 2019) makes it easy to analyze the future development of data, especially short-term trends. The deep learning algorithm can establish the intrinsic mapping relationship between environmental indicators and influencing factors, and its prediction accuracy is significantly higher than the traditional analysis method and can be accurate to a specific date.

To this end, a Convnet and Dense-based Bidirectional Gated Recurrent Unit (CDBGRU) is proposed, which got a better performance on air quality data than other methods. It used a rebuilt convnet to replace traditional convnet in which another convolutional layer is used instead of the max-pooling layer. To sum up, the main contributions of this paper are as follows:

  1. 1.

    This paper proposed a Convnet and Dense-based Bidirectional Gated Recurrent Unit (CDBGRU). In our model, a rebuilt convnet, Bi-GRU and additional Dense were used to predict a more accuracy air quality value.

  2. 2.

    We verify the effectiveness and performance advantages of the proposed method through a large number of comparative experiments on the data in Beijing air quality data from 2018-01-01 to 2019-07-01.

This article is introduced from these sections: Sect. 2 introduces the relevant related to this article. Section 3 introduces the research methods and principles of this paper. Section 4 shows experimental results and forecast accuracy. Finally, we give the article summary and future prospects in Sect. 5.

Fig. 1
figure 1

Some typical air pollution features

2 Related work

Air quality study has a long history. In the past, the existing method forecasting the air quality always focused on statistical method. Afterward, shallow machine learning methods were proposed to solve the forecasting problem, such as SVR (Drucker et al. 1997), DTR (Xu et al. 2005) and GBR (Huang and Oosterlee 2011). As a common classifier, SVR achieves superior performance on many scenarios. Generally, SVR can be divided into three classes, i.e., linear-SVR, poly-SVR and rbf kernel SVR. Specially, rbf kernel SVR often has the best performance on complex data, for example, air quality data. DTR is based on Decision Tree, and usually we use CART Decision Tree for regression due to the inconvenience on ID3 Trees and C4.5 Trees. GBR is a boosting model which is a technique for learning from its mistakes. In essence, it is to gather ideas and integrate a bunch of poor learning algorithms for learning. Additionally, many researches consider the link of each feature in air quality data, for example, (Deleawe et al. 2010) expanded machine learning and put it into predicting \(\hbox {CO}_2\) level, which gives the inspiration of prediction method.

However, with the study on deep learning (Hao et al. 2018) and big data analyze (Wang et al. 2019c), more and more researches considered how to combine the deep learning (Singh and Srivastava 2016) model with air quality data. Because air quality data have dynamic, nonlinear and, especially, time series-related characteristics, more and more researches focus on data-driven models. So far, a large number of air quality forecasting methods based on the big data analyze (Najafabadi et al. 2015) have been proposed to predicted each feature and pollution value. For example, (Zheng et al. 2013) proposed a semi-supervised learning approach which is based on a co-training framework consisting of ANN and CRF to predict PM2.5.

In recent years, deep learning model especially convolution neuron networks and recurrent neuron networks has changed great work. As a time series predicted model, RNN (Zaremba et al. 2014) has a good performance on air quality forecasting for air quality data is a typical time series data. So far, a large number of derivatives of RNN have appeared. In order to deal with gradient disappearance in RNN, LSTM (Gers et al. 1999) and GRU was proposed which has been widely used in industry. More than this, the derivatives of LSTM and GRU also have appeared, such as Bi-directional GRU (Schuster and Paliwal 1997), which uses the backward of the data to smooth the nonlinear characteristics. CNN (Zhang et al. 2018; Long and Zeng 2019; Pan et al. 2019) is a good appliance to extract features and (Tao et al. 2019) combine CNN and Bi-GRU to predict the value of PM2.5 called CBGRU which has a good performance. However, the methods above still have space to improve. So, CDBGRU was proposed.

3 Proposed method

In this section, we will propose a method to measure the value of PM2.5 called Convnet and Dense-based Bidirectional Gated Recurrent Unit (CDBGRU) and introduce each part of the method.

3.1 Problems and motivations

Nowadays, air quality forecasting has become a popular research topic in control of urban and rural air pollution. The goal of air quality forecasting is to predict the current time change of PM2.5 value at the observation point. The observation time interval is usually set for one or two hour, which is decided by the ground-based monitoring station (Wang et al. 2019b). Figure 1 shows the typical air pollution data such as PM2.5, PM10, AQI, NO2.

PM2.5 prediction problem is usually illustrated as follows. Assuming time T and the air quality index \(P_{T}\) , the goal of prediction is to predict the PM2.5 concentration value \(P_{T+1}\) at time \(T+1\) or \(P_{T+n}\) at time \(T+n\). Usually, the goal achieved accurate result by modeling the history air quality-related time series dataset \(\{ \;{P_t}|t = 1,2, \ldots ,T\} \), where P represents the history air quality index including PM2.5 and other air quality-related time series data such as \(\hbox {NO}_2\), temperature, \(\hbox {SO}_2\), etc. The difficulty of prediction is how to combine the trend of historical data with the relationship of time series. To this end, CDBGRU was proposed.

3.2 Our proposed method

3.2.1 Overview of CDBGRU

To solve the above problems, in this paper, an air quality forecasting method called Convnet and Dense-based Bidirectional Gated Recurrent Unit (CDBGRU) predicting PM2.5 is proposed. In general, due to the different representations and data structures in different time, the statistical characteristics of air quality related time series are discrepant, in which shallow machine learning models cannot have a good performance to deal with complex scenarios. In recent years, hybrid deep learning model has also been used on predicting PM2.5 value, which achieved more effective results than those of classic machine learning models.

CDBGRU is a combination of CNN, additional full connecting layer and Bi-directional GRU that considers the spatial temporal dependence of air quality-related time series data. The model consists of three part. In the first part, the one-dimensional convnet without max-pooling performs local feature learning and dimensionality reduction in input variables. Compared with traditional convnets, the lack of max-pooling instead another convnet have a better performance on disposing the original data to emerge low-dimensional feature sequences. In the second part, the feature sequences are fed into the bidirectional GRU, which is an improvement on LSTM and the parameters in reset gate and update gate are constantly adjusted in the process of training, then the relationship between the features extracted from the convnets are learned from the time series data. At the end of the model, two fully connected layers are used and the last layer contains only one channel to predict the PM2.5 concentration value. The innovation of this method is to use one-dimensional convnet as the preprocessing process before GRU, and use an additional layer of convolution network to replace the max-pooling layer which combines the local feature extraction with the time series prediction ability of GRU. On the other hand, through bidirectional processing sequence, bidirectional GRU can capture patterns that may be ignored by unidirectional GRU. Figure  2 shows the architecture of our deep model.

Fig. 2
figure 2

The architecture of CDBGRU

3.2.2 1D convnet for local trend features learning

CNN not only has great work in image processing but also can be used for air quality data processing and time series analysis. The weight sharing features and the local perception of CNN can accelerate the learning of parameters for processing multi-variable time series, so as to improve learning efficiency. CNN usually has three layers: Convolutional layer, Activation layer and Pooling layer. The 1D convnets can be utilized for learning local trend features. Convnets can perform convolution operations to extract features from local input trend features, which allows modularity representation and data efficiency. These characteristics make convnets excellent in image processing and suitable for time series sequence processing. In this paper, we regard time series as a spatial dimension. The local perception and weight sharing of convnets can reduce the difficulty of handling in dealing with multiple time series and improve the learning efficiency. The patterns learned in one position at one sequence can be recognized at other positions in the future, because the patterns learned in the same position can be recognized in other positions and input conversion can be performed for each subsequence which is summarized as time shift invariants of convnets.

As shown in Fig. 3, with convolution kernel windows used in each convolution layer to process meteorological and PM2.5 time series, sequence fragments can be learned within one window size, and all these subsequences can be identified in the whole time series, so that the local trend feature change characteristics of multivariate time with time can be captured. In the general convolution network, after 1D convolution operation, another convolution layer instead of max-pooling layer was used for secondary sampling to output the subsequence extracted from the input time series because the weather data do not need to be desampled after convolution, and the sample does not require high translation invariance, so this paper chooses CNN + Relu with strip = 2 to replace the original pooling layer, which not only reduces the length of the input time series, but also retains most of the characteristics of the air pollution data.

Fig. 3
figure 3

1D convnets processing time series

3.2.3 Bidirectional gated recurrent unit for time series forecasting

In our model, bidirectional Gated Recurrent Unit (BiGRU) is shown in Fig. 4, which is often used for predicting time series process. It is well known that RNN is a distinct kind of neural network developed for processing sequential data. However, due to the disadvantages of RNN such as vanishing gradient and explosion gradient, it is difficult to learn long-term-dependent tasks such as long time series air quality data. To solve the problem, LSTM, GRU and their derivatives are proposed. The LSTM keeps track of long-term information by including gates (input gate, forget gate and output gate). And GRU is an improved version of the LSTM, which can also learn long-term dependencies. Different from LSTM, there are no memory unit, but 2 gates which is update gate and reset gate instead of 3 gates in GRU. Compared with LSTM, GRU has more simple architecture, requires less computation and is trained faster.

In the cell of GRU, the function of update gate is to decide which information can be retained and then delivered to the next state, and the reset gate represents how to combine the previous state information with the new input information. The criteria of state update for the next output and state value in the GRU unit are as follows.

$$\begin{aligned} {z_t}= & {} \sigma \left( {{W_z}*[x(t),h(t - 1)]} \right) \end{aligned}$$
(1)
$$\begin{aligned} {r_t}= & {} \sigma \left( {{W_r}*[x(t),h(t - 1)]} \right) \end{aligned}$$
(2)
$$\begin{aligned} {\hat{h}}(t)= & {} \sigma \left( {{W_h}*\left[ {x(t),\left( {{r_t}*h(t - 1)} \right) } \right] } \right) \end{aligned}$$
(3)
$$\begin{aligned} h(t)= & {} \left( {1 - {z_t}} \right) *h(t - 1) + {z_t}*{\hat{h}}(t) \end{aligned}$$
(4)

Where \(\sigma \) is the activation function, x(t) is the input, \(h(t- 1)\) is the previous output, \(W_z\) , \(W_r\) and \(W_h\) are the weights of the update gate, reset gate, and candidate output.

Fig. 4
figure 4

Bidirectional GRU

The Bi-GRU consists of two GRUs, one of which processes a chronological input time series data and the others process an anti-chronological time series data, and then combine their representations in a total state. Features and time series data such as air quality data and meteorological data are subject to the distribution of some continuous function, in which we can fit a function from the previous data through the observation values to predict the future data. In the same way, future data can be used to a function to predict the value of the previous moment, which is often used on back propagation. For time series forecasting tasks, in previous research, only historical data can provide predictive power due to the continuous distribution; however, bidirectional training model can provide more useful information in modeling for the reverse continuity. By viewing air quality data from chronological and anti-chronological input enables the model to get more accurate representations and capture patterns that may be ignored when using chronological GRU, thereby improving the performance of ordinary GRU.

3.2.4 Full connected layer (dense)

Full connected layer often plays the role of Classifier or Regressor in the whole network. It is to map the learned distributed feature representation to the sample marker space. Using more than one full connected layer can improve the accuracy of our model due to so many parameters in these layers.

3.3 Discussion about limitation

This paper proposed CDBGRU algorithm to predict the PM2.5 value by advanced neural network. However, there are some limitation on this model.

  1. 1.

    A gluttonous neural network will be enhance the loss of time. To some extent, the network of this paper is not complex. Compared with previous methods, our network has too much parameters, which increase the calculation time to some degree. However, in this paper, it is worthwhile to sacrifice time complexity for more accurate results.

  2. 2.

    1D convent is used to deal with nonlinear features. Before that, most methods used correlation coefficient to screen air quality data. Compared with the previous selection strategy, 1D convent may have advantages, but there is no evidence that correlation coefficient is worse than 1D convent. Parameters of high linearity from correlation coefficient selection will be also achieve good results.

4 Experiment

In this section, we will perform experiments on real air quality data to evaluate the proposed method. By comparing the classic shallow learning model, deep learning model and our model, the prediction performance and effectiveness of the model are verified.

4.1 Dataset

The experience in this paper uses Beijing air quality dataset which includes Pm2.5 value data and other feature such as date, time, temperature, humidity, wind speed, wind direction. The data interval in the dataset is one hour, and dataset used for experiment is ranged from 2018-01-01 to 2019-07-01.

4.2 Setup

4.2.1 Error measure

We choose Root Mean Square Error (RMSE) as the loss function, while RMSE can better reflect the true situation of the prediction mistake. In addition, R square is used as the error evaluation index of the model to evaluate the change degree and accuracy of the data and to measure the prediction quality of the model. The two of evaluating indexes are shown as follows:

$$\begin{aligned} {\mathrm{RMSE}_{\left( {{y^\prime },y} \right) }}= & {} \sqrt{\frac{1}{n}\sum \nolimits _{i = 1}^n {{{\left( {y_i^\prime - {y_i}} \right) }^2}} } \end{aligned}$$
(5)
$$\begin{aligned} R\mathrm{{ - }}{\mathrm{Square}_{\left( {{y^\prime },y} \right) }}= & {} 1 - \frac{{\sum \nolimits _{i = 1}^n {{{({y_i} - y_i^\prime )}^2}} }}{{\sum \nolimits _{i = 1}^n {{{({y_i} - {\bar{y}})}^2}} }} \end{aligned}$$
(6)

where n is the number of samples, \({y_i}\) is the real data, \(y_i^\prime \) is the predicted data and \({\bar{y}}\) is the average data.

4.2.2 Experiment setup

We choose eight other models to contrast our model. Besides GRU and bgru, other six model are introduced as follows:

  1. 1.

    Support Vector Regression (SVR) (Drucker et al. 1997) SVR is an important branch of support vector machine. It holds that as long as the deviation between the prediction value and true value is not too large, the prediction can be considered as correct without calculating the loss.

  2. 2.

    Gradient Boosting Regressor (GBR) (Huang and Oosterlee 2011) GBR is a boosting method through a series of iterations to optimize the regression results, each iteration introduces a weak regressor to overcome the shortcomings of the existing weak regressor combination.

  3. 3.

    Decision Tree Regressor (DTR) (Xu et al. 2005) DTR is an application of decision tree. DTR realizes reasonable regression prediction through continuous branching and pruning.

  4. 4.

    Recurrent neural network (RNN) (Schuster and Paliwal 1997): RNN is a kind of neural network with sequence data as input, recursion in the evolution direction of sequence, and all nodes (cycle units) are connected by chain. It usually deal with sequence data.

  5. 5.

    LSTM (Gers et al. 1999) LSTM is to deal with the problems in long-term dependence. It solves the problem of gradient disappearance to a certain extent.

  6. 6.

    CBGRU (Tao et al. 2019) CBGRU tries to predict the air quality data with fewer parameters. It uses a convolutional layer to achieve more effective information.

Some information and comparations about each method are shown in Table 1

Table 1 The information and comparations about each method

4.2.3 Results analysis

For deep learning model, all are trained for 200 epochs. Each result is averaged over 10 trials. We use dropout between layers with probability of 0.2 in order to avoid the over-fitting problem. In addition, all the deep learning model use the early stop condition in the training process. If the loss of validation data does not change in 20 epochs in training, the training will stop. After obtaining the trained models, each data points in testing dataset are tested and MAE, RMSE and R-Square are calculated.

Before we start the experiment, we need to preset the super parameters in the model for the best performance. Owing to GRU as our baseline, we will test the hyperparameters on GRU, including lookback and the number of neurons in hidden layers which mean how many timesteps should the input data go back and how many neuron nodes are needed to achieve an optimal prediction effect. When testing lookback, lookback was set to a value chosen from a candidate set of \(\{5,10,15,20,25\}\) and the experiment result is shown in Fig. 5.

Fig. 5
figure 5

The lookback value with RMSE

From Fig. 5, we can find that with the increase in lookback, the forecasting performance first improves greatly and then begins to deteriorate. When lookback is 10, the model can achieve the best performance. In fact, small lookback cannot guarantee long-term memory input, and large lookback will have more redundant information input, which is not conducive to modeling. Next, we begin to search the suitable number of neurons. The number of neurons will be chosen from the candidate of \(\{32, 64, 80, 100, 128, 256\}\). The results are shown in Fig. 6.

Fig. 6
figure 6

The number of neurons with RMSE

From Fig. 6, we can find with the increase in the number of neurons, the effect of the model becomes better first and worse later. When the number of neurons is 80, the effect of the model is the best. Combining Figs. 5 and 6, we choose the past 10 hours to predict the next hour’s data and the neurons of GRU hidden layer is 80.

As to convolutional neural networks, a common setting is 2 layers of convolutional layers with the activation functions of RELU where the length of kernel window is 1\(\times \)3. Due to the lack of max-pooling layer, another convolution is used in which the convolution window is 1\(\times \)2 for replacing max-pooling layer with the pool size of 2. At last, we add another fully connected layer with only 1 neuron.

4.3 Forecasting results and analysis

To verify the efficiency and accuracy of CDBGRU, we develop several comparative models and our trained model. In SVR, we choose RBF kernel and the kernel coefficient is the reciprocal of feature number. In DTR, the maximum depth of the tree is 10 and the criterion is gini. In GBR, we set the loss as the least squares. For deep learning models, i.e., RNN, LSTM, GRU, BGRU, CBGRU, the number of hidden layers was all set as 2 in which the number of layer node is 80.

Table 2 The result in different methods
Fig. 7
figure 7

The true value and the predicted value from 2019-1-1 to 2019-2-28

4.3.1 Results analysis

To verify the effectiveness of our methods, some experiments are performed. Table 2 shows the quantitative results by RMSE and R-Square. From Table 2, we can find that shallow models usually have a worse performance than deep learning models. Compared with deep learning models (such as RNN or LSTM), shallow models (such as SVR) have a larger RMSE, while R-Square are smaller. For deep learning methods, LSTM and GRU have similar performance and both better than RNN. When adding bi-direction training model, the performance gets better for bi-direction training model can enhance network stability. Compared with CBGRU, our model has a better performance due to the additional Convnets and full-connected layer instead of Max-pooling layer. The result shows that our model can learn more local trend information, time series information and long-term dependencies. Besides, in order to explain the validity of this model more intuitively, Fig. 7 showed the correctness of this method in some data. From Fig. 7, we can observe our method has a good performance on PM2.5 forecasting.

4.3.2 Parameters sensitivity analyze

Although the two parameters (lookback and the number of neurons) were tested before, there are still other parameters such as epoch and the size of conv kernel. To better evaluate the performance of the parameters, we chose to evaluate the results on both the training set and the test set. Figure shows the sensitivity of these two parameters

Fig. 8
figure 8

The sensitivity of epochs

From Fig. 8, we can find with the increase in epochs, and the test data are getting better and better. However, when the epoch is greater than 200, the effect on the training set will decrease, which is caused by the overfitting problem caused by the increase in training number. So, we chose 200 as the epoch. The same as above, the size of kernel performs great in each size on training data which is performed in Fig. 9, as well as 3 on the testing data. Too large convolution kernel will affect the loss of feature, too small convolution kernel will lead to feature redundancy. So we choose \(1\times 3\) kernel size as the best size.

4.3.3 Discussion about experiments

In this section, we will analyze the difference of each method and the advantages and disadvantages. From Table 2, we can easily find that our method achieved the best result. Compared the other deep learning method, especially CBGRU, our method has extra full connection layer architecture to ensure the prediction accuracy. However, from the experiments of parameter sensitivity we can also find CDBGRU has no obvious parameter sensitivity, which will disturb us to find the best parameters. Moreover, additional Dense layer will increase the number of parameters, which will increase time complexity. At last, our model has a robust prediction, from Fig. 7 we can observe that there are two outliers, which has been forecasted correctly. Overall, CDBGRU has satisfied the current demand of PM2.5 prediction.

Fig. 9
figure 9

The sensitivity of kernel size

5 Conclusion

This paper proposed Convnet and Dense-based Bidirectional Gated Recurrent Unit (CDBGRU), to which a special type of RNN was instructed. Different from the classic RNN model, the feature in air quality data is extracted by Convnet and Dense which has a good performance on a large scale of data. Moreover, we choose the Bi-GRU network with better performance to deal with time series data, e.g., air quality data. Then, we compared it with traditional machine learning method and deep learning method. Through evaluation experiments, we verified the performance superiority and parameter sensitivity. Considering the complex network, our method cannot be called an excellent algorithm for the too many parameters. For the future work, we may consider a better network to predict more accuracy PM2.5 value.