1 Introduction

Air pollution has always attracted substantial attention in environmental sciences (Sun et al. 2017). Long-term exposure to haze has caused various diseases such as lung cancers, heart attacks, and respiratory diseases (Yu and Stuart 2017). Especially, severe haze episodes have erupted in Beijing since January 2013, resulting in excess deaths due to respiratory and circulatory diseases (Gao et al. 2017; Chen et al 2013; David et al. 2014). PM2.5 is the most harmful suspended particle to human health. Thus, an accurate prediction approach is essential and positive for decision makers to formulate the prevention measures.

In recent years, the PM2.5 concentration prediction approaches have been enriched. Generally, the existing methods designed for PM2.5 concentration prediction can be concluded as deterministic methods and statistical methods. Deterministic methods tend to focus on their temporal and spatial evolution process. Specifically, the evolution process consists of emission, dispersion, transformation and diffusion of air pollutants based on meteorological factors and chemical reaction (Bray et al. 2017; Zhou et al. 2017; Woody et al. 2016). In addition, statistical methods widely applied in air pollutant prediction consist of multiple linear regression (MLR) (Donnelly et al. 2015), auto-regression integrated moving average model (ARIMA) (Jian et al. 2012), support vector regression (SVR) (Yang et al. 2018). Nevertheless, the regression models and time series models fail to handle stochastic uncertainty. Thus, the proposed methods have poor performance in extreme points.

In order to handle the shortcomings of linear models, artificial neural networks (ANN) have been employed to predict air pollutants with satisfying performance in recent years. Gennaro et al. (2013) predicted the PM10 concentration in two contrasted sites by ANN, respectively. The results proved its availability in air pollutant prediction. To predict the air quality index in Ahvaz, Iran by ANN, Maleki et al. (2019) proved its applicability through the comparison tests. However, the data volume and dimension for model training have been grown rapidly in recent years. A deep learning method as a new artificial intelligence technology has been exploited in different fields such as computer vision (Chan et al. 2015), text processing (Liu et al. 2019) and time series prediction (Wang et al. 2019a, b), etc. Likewise, the deep neural network was applied in air pollutant prediction with excellent performance (Ong et al. 2016; Soh et al. 2018; Li et al. 2016). Previous scholars used LSTM models to conduct the air pollutant prediction (Wen et al. 2019; Wu and Lin 2019). The LSTM model can deal with air pollutant prediction excellently due to its excellent performance in time series problems. Nevertheless, the single LSTM model fails to learn spatial information. Specifically, the air pollutant concentration would change with its emission, diffusion, and reaction with other suspended particles, which indicates the air pollutant is also related to space dimension. A convolutional neural network (CNN) (LeCun et al. 1998) has been proven its strong processing ability in the spatial dimension, which was widely applied in image recognition (Ren et al. 2015). Moreover, the monitoring data in this paper are also spatially relevant. Air pollutants in different areas will affect each other. Thus, the CNN model is a reasonable approach to solve spatial correlation in air pollutant prediction.

Given the limitations of the above methods, a hybrid CNN-LSTM model is proposed, which could handle the air pollutants' complexity and variability. The CNN model can extract spatial features of air pollutants in different cities around Beijing. In this way, it can reflect the spatial effect of different cities when air pollutants diffuse and spread. Then, the output of the CNN model can be used as the input of the LSTM model. Meanwhile, LSTM is used to deal with time series prediction widely. LSTM will achieve better prediction performance due to its strong ability to handle gradient explosion and vanishing problems (Zhang et al. 2018a, b; Zhao et al. 2017). Therefore, the LSTM model is employed to predict the daily average PM2.5 concentration by extracting the features of the time dimension.

The remaining part of the article is organized as follows. The relevant literature on the methods of air pollutant prediction is introduced in Sect. 2. Section 3 gives the data description and a specific modeling approach of CNN-LSTM. In Sect. 4, a detailed analysis of the experimental result is given. Finally, Sect. 5 makes a conclusion briefly.

2 Related works

Deep learning methods have been widely applied in the PM2.5 prediction instead of conventional prediction models (Ong et al. 2016; Soh et al. 2018; Li et al. 2016). Conventional prediction models consist of deterministic methods and statistical methods. Deterministic methods focus on the emission and diffusion process of air pollutants based on historical data. However, factors such as the lack of prior knowledge and incomplete data may add air pollutant prediction difficulty. Thus, the deterministic methods suffer from low precision and instability. Statistical methods focus on mathematical principles and probability models with flexibility and simplicity. Zhang et al. (2018a, b) utilized the ARIMA approach to predict PM2.5 in Fuzhou, China, which indicated that PM2.5 concentration experienced seasonal fluctuations. Metia et al. (2016) proposed a hybrid model to overcome the uncertainties related to emission inventory data by integrating a chemical transport model and the Kalman Filter approach.

With the increase of data dimension, the above conventional methods fail to deal with the stochastic uncertainty and have poor performance in predicting the extreme points. Therefore, deep neural network (DNN) as an excellent deep learning method has been adopted widely. A restricted Boltzmann machine was used to predict time series data (Kuremoto et al. 2014). In addition, a deep recurrent neural network (DRNN) was adopted to predict air pollutant concentration with acceptable accuracy.

However, the proposed approaches are usually a single prediction model and ignore air pollutants' spatiotemporal correlation. The prediction performance of a hybrid model outperforms a single model. Based on this viewpoint, a hybrid model called CNN-LSTM is exploited. The CNN model is adopted to extract features, while LSTM can deal with time series prediction well (Huang and Kuo 2018; Qin et al. 2019; Li et al. 2020). Huang et al. (2018) introduced the CNN-LSTM model to predict particulate matter concentration. The proposed model achieved the best prediction performance compared with other models. However, the above researchers only considered the air pollutant concentration and ignored the impact of air pollutants in different regions. As known to all, the concentration of air pollutants may change with its emission, diffusion, and reaction with other suspended particles. Therefore, it is necessary to consider the spatiotemporal correlation based on this paper's deep neural network.

3 Materials and methods

3.1 Data description

The study area in this paper is Beijing and its surrounding areas, including Tianjin, Hebei, and so on. Figure 1 demonstrates the PM2.5 concentration distribution in China in Feb. 2014. It is well known that PM2.5 pollution is very concerning in Beijing and its surrounding cities. These areas have experienced industrialization and urbanization over the past years and their geographical location is very close to each other.

Fig. 1
figure 1

The PM2.5 concentration distribution in China

In this paper, the historical data from Beijing can be divided into two subsets, including pollutant concentration and meteorological factors. The statistical information of the dataset is shown in Table 1. The dataset contains 1887 samples ranging from Jan. 1st, 2015 to Mar. 1st, 2020. Among them, the pollutant concentration data is collected from the air quality online monitoring platform (https://www.aqistudy.cn/), and the meteorological data is obtained from the weather forecasting website (http://tianqi.2345.com/). Table 1 displays the statistics of different variables. It is seen that the range of different variables fluctuates wildly. Meanwhile, some character variables need to be converted into numerical variables. Therefore, in order to speed up the model training progress, feature processing techniques are applied as follows:

  1. (1)

    As shown in Fig. 2, the probability distribution of different continuous variables demonstrates the left-skewed distribution, which is unfavorable for prediction accuracy. Most of the models are based on the assumption of normal distribution. Thus, logarithmic transformation is a good solution of solving data with a biased distribution. The final probability distribution after the logarithmic transformation is shown in Fig. 11.

  2. (2)

    As for discrete variables, such as wind direction, weather, and wind, an approach called one-hot encoding is utilized to divide into different categories, which is beneficial to modeling.

  3. (3)

    The present dataset contained 20,757 records for model studying. The dataset is divided into a training set and a test set. We use 80 percent of data as the training set, and the remaining data as the test set to verify the model's effect.

Table 1 The statistical information of the dataset used in the model
Fig. 2
figure 2

The probability distribution of different continuous variables

3.2 Spatiotemporal correlation analysis

Due to severe pollution in Beijing and its close geographical location, we consider the spatial correlation of PM2.5 concentration from different cities. Pearson's correlation coefficient is a common approach used in measuring the correlation between different variables. The model features can be filtered according to their correlation coefficients. Figure 3 shows the calculation results of variables from different cities. The correlation coefficient values range from − 0.289 to 0.761. It is observed that the further the distance is away from Beijing, such as Henan and Shandong, the smaller the correlation coefficient is. Besides, the correlation coefficient's threshold value is selected as 0.5 for feature selection in this paper. The coefficient is more than 0.5, indicating a significant correlation between variables (Li et al. 2017). Apparently, the CO, PM2.5 and PM10 from Tianjin and Hebei strongly correlate with PM2.5 in Beijing. Thus, the spatial correlation provides powerful support for improving the prediction performance instead of establishing a separate model for each city.

Fig. 3
figure 3

The correlation coefficients of different features from surrounding cities

Then, we analyze the temporal correlations according to autocorrelation functions. The formula can be written as follows:

$$\rho_{k} = \frac{Cov(y(t),y(t + i))}{{\sigma_{y(t)} \sigma_{y(t + i)} }},i = 1,2,3...,$$
(1)

where \(Cov(\cdot)\) represents the covariance, \(\sigma (\cdot)\) denotes the standard deviation, \(y(t)\) and \(y(t + i)\) represent the target time series at time \(t\) and the delayed time series with a time delay \(i\), respectively.

Figure 4 demonstrates the autocorrelation coefficients of PM2.5 from different cities. It is obvious that the curve shows a descending trend with the lag time. The trend reflects that the longer the time, the less impact the PM2.5 concentration data has on the current state. In addition, the rate of decline is also gradually slowed down with the increase of the lag time, and the descent speed at the beginning is the largest.

Fig. 4
figure 4

The autocorrelation coefficients of different time lags

Based on the above research, it is readily observed that PM2.5 in Beijing has a significant spatiotemporal correlation with surrounding cities, which is beneficial to prediction accuracy.

3.3 The introduction of the Artificial Neural Network

Artificial Neural Network is an effective mathematical model in the early stages due to its strong capacity of handling nonlinear problems, which simulates the structure of brain neurons. Among them, Multilayer Perceptron (MLP) as a typical neural network structure has been widely applied over the past years. MLP contains the input layer, output layer, and hidden layer. As shown in Fig. 5, the simplest neural structure of MLP consists of one hidden layer. However, with the increase of data volume and feature dimension, the traditional MLP model with a three-layer neural structure cannot achieve good performance. Therefore, popular neural networks such as CNN (Chu and Thuerey 2017) and LSTM (Song et al. 2019) are put forward by increasing network structure complexity. In this study, the CNN and LSTM models are combined to deal with the time series prediction problem.

Fig. 5
figure 5

The specific structure of three-layer perceptron network

3.3.1 Convolutional neural network model

Convolutional Neural Network (CNN) comes from the lenet-5 neural network proposed by Lecun in 1998 (Lecun et al. 1998). The proposed network has achieved remarkable recognition performance in the research of handwritten font recognition, which has aroused scholars' close attention. The network structure of the convolutional neural network is shown in Fig. 6.

Fig. 6
figure 6

The structure of a simple convolutional neural network

Different from the traditional neural network model (NN), CNN has multiple feature maps in every layer, and every feature map contains multiple neurons. The current neuron is convoluted by the output of the upper layer neuron and a convolutional kernel. The convolutional kernel is essentially a defined weight matrix, which is used to extract the features of the local sensing domain.

The structure of a convolutional neural network mainly includes a convolutional layer, pooling layer and fully connected layer. The convolutional layer and pooling layer in the hidden layer are the essential modules of CNN. The convolutional layer is responsible for extracting local features of data while the pooling layer is employed to extract further features based on the down-sampling approach.

Convolutional Neural networks (CNN) can automatically learn features from sequence data, such as text and image data. Its standard network structure contains 1D, 2D and 3D CNN. Given that PM2.5 data is one-dimension data, 1D CNN was utilized for feature learning in this study. The specific process of 1D CNN is demonstrated in Fig. 7. The blue part indicates a filter, which represents a sliding window that convolves across the data. The input data and the extracted feature after a sliding window have the same dimension. The green part denotes another filter, and its sliding process is the same as before. Suppose the dimension of input data is M and the number of filters is N, then the total number of the extracted features is M*N (Huang and Kuo 2018).

Fig. 7
figure 7

The learning mechanism of 1D CNN

3.3.2 Long Short-term memory model

Another important neural network widely applied in sequential data is the Recurrent Neural Network (RNN). Unlike other neural networks, RNN tends to focus on the relationship between input data and output data. The basic structure of RNN is shown in Fig. 8.

Fig. 8
figure 8

The structure of a simple recurrent neural network

As shown in Fig. 8, \(x\) denotes input data, \(o\) denotes output data, \(U\) represents weight matrix from input layer to hidden layer, \(V\) represents weight matrix from hidden layer to output layer, \(W\) represents weight matrix from hidden layer to the hidden layer, \(s\) is state value of hidden layer.

However, gradient vanishing problem often occurs in the training process of RNN. Then the training parameters are reduced to zero. Therefore, Long Short-Term Memory Model (LSTM) was introduced to solve the problem of gradient vanishing. LSTM model was first proposed in 1997 and it is a special RNN model (Hochreiter and Schmidhuber 1997). Figure 9 displays the specific network structure of the LSTM model.

Fig. 9
figure 9

The specific network structure of LSTM model

As shown in Fig. 9, \(\sigma\) and \(tanh\) represent the activation function, where \(\sigma\) is designed to map the value between 0 and 1, while \(tanh\) is adopted to map the output between -1 and 1. The formulas of activation functions are written in Eq. (2) and (3).

$$\sigma { = }\frac{1}{{1 + e^{ - x} }},$$
(2)
$$tanh = \frac{{e^{x} - e^{ - x} }}{{e^{x} + e^{ - x} }},$$
(3)

Unlike the internal structure of RNN, the state of LSTM is controlled by an input gate \(i_{t}\), a forget gate \(f_{t}\) and an output gate \(o_{t}\). Among them, the forget gate is designed to discard information of the memory cell. The forget gate mechanism receives the output value \(h_{t - 1}\) of the upper layer and the input value \(x_{t}\) of the current time. Then a probability value \(C_{t - 1}\) is calculated through the sigma function, which is used to determine the retention of the unit state at the previous time. Also, the input gate is responsible for updating new information to the cell state. Specifically, the probability of state update is controlled according to the output value of \(\sigma\) function, and then a new input value \(C_{t}\) is generated through \(tanh\) function. The output gate determines to control the output of the external state \(h_{t}\) according to the internal state \(C_{t}\) at the current time. The specific process can be described as Eqs. (4)–(9).

$$f_{t} = \sigma \left( {W_{f} x_{t} + U_{f} h_{t - 1} + b_{f} } \right),$$
(4)
$$i_{t} = \sigma \left( {W_{i} x_{t} + U_{i} h_{t - 1} + b_{i} } \right),$$
(5)
$$o_{t} = \sigma \left( {W_{o} x_{t} + U_{o} h_{t - 1} + b_{o} } \right),$$
(6)
$$\tilde{C}_{t} = \tanh \left( {W_{c} x_{t} + U_{c} h_{t - 1} + b_{c} } \right),$$
(7)
$$C_{t} = f_{t} \odot C_{t - 1} + i_{t} \odot \tilde{C}_{t} ,$$
(8)
$$h_{t} = o_{t} \odot \tanh \left( {C_{t} } \right),$$
(9)

where \(W_{f}\), \(W_{i}\), \(W_{o}\) and \(W_{c}\) represent the weight matrices for input vector \(x_{t}\). \(U_{f}\), \(U_{i}\), \(U_{o}\) and \(U_{c}\) denote the weight matrices from the previous state to hidden state. \(b_{f}\), \(b_{i}\), \(b_{o}\) and \(b_{c}\) are bias weights. \(\odot\) represents the multiplication of the matrix. \(x_{t}\) is input vector at time \(t\). \(h_{t}\) denotes output vector at time \(t\). \(C_{t}\) represents the cell status at time \(t\).

3.3.3 The hybrid CNN-LSTM model

The hybrid CNN-LSTM model was applied in computer vision and text processing at an early stage. CNN was used as a feature extractor on image and text data, and then input to LSTM for further processing. Likewise, CNN is adopted to extract features of time series data, while LSTM is designed for prediction according to the output from the CNN model in this study.

Figure 10 demonstrates the specific structure of the CNN-LSTM model. A one-dimensional convolutional layer and a pooling layer are designed as the base layer of the hybrid model due to the particularity of time series. In order to input the output of CNN into LSTM, a flatten layer is constructed between CNN layer and LSTM layer. Also, the fully connected layer is constructed to decode the LSTM output. Finally, the prediction results can be obtained from the proposed model.

Fig. 10
figure 10

The specific network structure of CNN-LSTM

Aimed at improving the robustness of the model, we use 336 samples as validation set to adjust model parameters and the remaining 28 samples to predict. The parameter selection method is determined by grid search. The specific parameters of CNN-LSTM in this paper are shown in Table 2. Among them, we adopt the relu function as an activation function instead of other common activation functions. The relu function can solve the problem of gradient disappearance in neural networks due to its special structure. In addition, an efficient parameter optimizer called Adam is utilized in this study instead of the gradient descent approach. In Adam's parameter optimizer, the learning rate of parameters can be dynamically updated. Thus, the parameter has more opportunities to jump out of the local optimum.

Table 2 The specific parameters of the hybrid CNN-LSTM model

The popular performance indices are employed to evaluate the model accuracy, which are expressed as follows:

$$RMSE = \sqrt {\frac{1}{N}\sum\limits_{t = 1}^{N} {\left( {y_{t} - \hat{y}_{t} } \right)}^{2} } ,$$
(10)
$$MAPE = \frac{1}{N}\sum\limits_{t = 1}^{N} {\left| {\frac{{y_{t} - \hat{y}_{t} }}{{y_{t} }}} \right|} ,$$
(11)
$${\text{R}}^{2} = 1 - \frac{{\sum\limits_{t = 1}^{N} {\left( {y_{t} - \hat{y}_{t} } \right)^{2} } }}{{\sum\limits_{t = 1}^{N} {\left( {y_{t} - \overline{y}} \right)^{2} } }},$$
(12)

where \(N\) is the sample size of test set, \(\hat{y}_{t}\) represents the predicted value of PM2.5 at time \(t\), \(\overline{y}\) is the mean value of PM2.5, while \(y_{t}\) denotes the observed value of PM2.5 at time \(t\).

4 Results and discussion

4.1 Prediction performance

The hybrid CNN-LSTM model based on spatiotemporal correlation is conducted to predict the daily average PM2.5 concentration from February 2020 to March 2020. Figure 11 displays the prediction performance. It is obviously seen that the predicted values are close to the observed values in the whole prediction range. The proposed model demonstrates an accurate performance, especially at local high points. This phenomenon indicates that the hybrid CNN-LSTM based on spatiotemporal correlation can deal with nonlinear characteristics and the sudden changes of time series excellently. More specifically, the performance indexes RMSE, MAPE and R2 of train set are 11.56, 41.91%, 94.72%, while the RMSE, MAPE and R2 of test set are 10.60, 39.58%, 96.47%, respectively. The above performance indexes indicate that the proposed model obtains high prediction accuracy and avoid the model over-fitting issue. It is strongly proved that CNN can extract the inherent features efficiently and then improve the prediction accuracy of LSTM.

Fig. 11
figure 11

The daily average PM2.5 concentration prediction result

4.2 Comparison with other neural network models

To compare different models' performance, we select two commonly used neural networks, including Multilayer perceptron (MLP) and Long Short-Term Memory (LSTM). Among them, MLP was widely used to predict air pollution with excellent performance at early stages. Table 3 shows the prediction performance of different evaluation indexes, while Fig. 12 demonstrates the specific prediction results. It is readily observed that the prediction performance of the hybrid model outperforms the single model. Especially, the forecasting values by CNN-LSTM are consistent with the observed values. In Table 3, the CNN-LSTM model achieves the lowest RMSE and MAPE values, while the highest R2 value in daily air pollutant prediction. More specifically, the performance indexes of LSTM are RMSE 14.84, MAPE 52.53% and R2 93.08%, while the RMSE, MAPE and R2 of MLP are 22.16, 87.43% 84.56%, respectively. It is observed that the prediction accuracy of deep neural network model including LSTM and CNN-LSTM are superior to MLP. Moreover, the prediction performance of CNN-LSTM outweighs LSTM. In general, the above two single models' prediction accuracy is less than that of CNN-LSTM according to the experimental results. In contrast, the CNN-LSTM model makes full use of both models' advantages to well account for the spatiotemporal correlation and reduce prediction error. Therefore, the hybrid CNN-LSTM model achieves much better prediction accuracy than the proposed neural networks.

Table 3 The comparisons of different evaluation indexes in prediction performance
Fig. 12
figure 12

Comparison of prediction results of different models

4.3 Comparison of the spatiotemporal correlation results

In this section, we train the same model with different data in order to evaluate the spatiotemporal correlation on the prediction performance (Russo and Soares 2014; PSoh et al. 2018). For the former, we train the proposed three different models with the air pollutant concentration data and meteorological factors in Beijing. In the latter case, the above input data is integrated with the air pollutant concentration data in other cities around Beijing. Then, the integrated data is put into the same model. The evaluation results are shown in Table 3 and Table 4. For the same model, the latter obtains the lower RMSE and MAPE values.

Table 4 The comparisons of different evaluation indexes in prediction performance without spatiotemporal correlation

Meanwhile, the model considering spatiotemporal correlation has a higher R2. Specifically, the RMSE, MAPE and R2 of the CNN-LSTM model without considering spatiotemporal correlation are 16.46, 58.45%, 91.49%, respectively. Apparently, the approach has a higher error compared with the CNN-LSTM with spatiotemporal correlation. By comparing the above results, the hybrid CNN-LSTM model combined with spatiotemporal correlation has less error than other neural network models. It is proved that the spatiotemporal correlation plays an important part for higher accuracy.

5 Conclusion

An effective model with high accuracy and stability is essential to protect humans from suffering from the adverse effects of haze. In this study, a hybrid CNN-LSTM model based on spatiotemporal correlation is proposed to predict the daily PM2.5 concentration in Beijing. More specifically, we not only focus on the PM2.5 in Beijing, but also its surrounding cities with Beijing due to the fluidity of air pollutants. Moreover, meteorological factors could affect the transmission and diffusion of air pollutants. Thus, it is necessary to consider the meteorological data in model training for better prediction accuracy. To explore the spatiotemporal correlation of PM2.5 in Beijing, we adopt Pearson's correlation coefficient in this paper and find air pollutants with high correlation in its surrounding cities. It is shown that the model considering spatiotemporal correlation achieves an excellent prediction performance. Thus, the advantage of the proposed hybrid model is that the CNN model can acquire spatial features in input data while the LSTM model can deal with the time correlation in time series data. Generally, the CNN-LSTM model is verified to be suitable for PM2.5 prediction. More attention could be paid on more training data to verify the generalization of the proposed model in future work. Besides, more meteorological factors related to PM2.5 concentration need to be taken into account.