Introduction

Air pollution is a serious environmental issue that is attracting increasing attention globally (Kurt and Oktay 2010). Many developing countries suffer from heavy air pollution. For example, extreme air pollution events have frequently occurred in China in recent years, especially in the Beijing, Tianjin, and Hubei districts. According to Reports on the State of the Environment in China (2015), among 338 monitored cities, 265 (78.4 %) were below the national healthy air quality standard, and the percentage of days below the standard reached 23.3 % on average.

Particulate matter with an aerodynamic diameter of or less than 2.5 μm (PM2.5) represents an air pollutant that can be inhaled via nasal passages to the throat and even the lungs. Long-term exposure to PM2.5 increases the incidence of associated diseases (e.g., respiratory and cardiovascular diseases, reduced lung function, and heart attacks) in humans (Künzli et al. 2000; Bravo and Bell 2011). Obtaining real-time air quality information is of great importance for air pollution control and for protecting humans from adverse health impacts due to air pollution (Zheng et al. 2013). Hence, it is necessary to conduct air quality prediction to better reflect the changing trend of air pollution and to provide prompt and complete environmental quality information for environmental management decisions, as well as to avoid serious air pollution accidents (Chen et al. 2013).

Many studies have focused on air quality predictions, and the following two types of methods are generally used: deterministic and statistical. A deterministic method employs theoretical meteorological emissions and chemical models (Bruckman 1993; Coats 1996; Guocai 2004; Jeong et al. 2011) to simulate pollutant discharge, its transfer and diffusion processes, and removal processes using dynamic data of a limited number of monitoring stations in a model-driven way (Kim et al. 2010; Baklanov et al. 2008). Representative methods, such as CMAQ (Chen et al. 2014) and WRF-Chem (Saide et al. 2011), are widely used for urban air quality forecasting. However, due to unreliable pollutant emission data, complicated underlying surface conditions, and an incomplete theoretical foundation, the simulation results suffer from low prediction accuracy (Vautard et al. 2007; Stern et al. 2008).

However, compared with these complicated theoretical models, statistical methods simply use a statistical modeling technique to predict the air quality in a data-driven manner. Straightforward methods such as the multiple linear regression (MLR) (Li et al. 2011) model and the auto regression moving average (ARMA) (Box and Jenkins 1970) model are commonly used for air quality prediction. However, these methods usually yield limited accuracy due to their inability to model nonlinear patterns; thus, they cannot predict extreme air pollutant concentrations (Goyal et al. 2006). A promising alternative to these linear models are artificial neural networks (ANNs) (Gardner and Dorling 1998; Hooyberghs et al. 2005; Lal and Tripathy 2012; Sánchez et al. 2013) and support vector regression (SVR) models (Nieto et al. 2013; Suárez Sánchez et al. 2013; Hájek and Olej 2012). A previous study showed that an ANN model was more accurate than the linear models (such as ARMA or MLR) because the air quality data presented clearer nonlinear patterns than linear patterns (Prybutok et al. 2000). Studies have also used a combination of these models for air quality predictions, and the results have shown that hybrid methods have a better predictive performance than single models (Díaz-Robles et al. 2008; Chen et al. 2013; Sánchez et al. 2013).

However, all these methods usually predict air quality at each station separately and neglect the high spatial correlations between stations. Spatial correlations generally occur between environmental variables (Legendre 1993). Air quality for all monitoring stations was highly correlated, thereby reflecting air pollutant dispersion patterns to some extent (Jerrett et al. 2005; Kracht et al. 2015). Therefore, it is important to fully model spatiotemporal correlations for air quality predictions.

Spatiotemporal prediction models that include air quality as a spatiotemporal process have been introduced, such as the spatiotemporal auto regression moving average (STARMA) (Martin and Oeppen 1975), the spatiotemporal artificial neural network (STANN) (Nguyen et al. 2012), and the spatiotemporal support vector regression (STSVR) models (Cheng et al. 2007). These methods can deal with nonlinear spatiotemporal features to a certain degree; however, the shallow models commonly use hand engineering to extract low level features. Thus, the performance of these models is greatly affected by artificial features, which inspired us to examine the air quality prediction problem in terms of deep architecture models capable of capturing these spatiotemporal features for accurate predictions.

Recently, deep learning, a new potential machine learning methodology, has attracted considerable academic and industrial attention (Bengio 2009) and has been successfully applied to image classification, natural language processing, prediction task, object detection, artificial intelligence, motion modeling, etc. (Silver et al. 2016; Hinton et al. 2006; Zhang et al. 2015; Collobert and Weston 2008; Mohamed et al. 2011; Bengio 2009; Chan et al. 2015). Deep learning algorithms use multiple-layer architectures to extract the inherent features of data layer-by-layer from the lowest to the highest level, and they can identify representative structure in data. Because air quality process is inherently complicated, its temporal trends and spatial distribution are affected by various factors, such as air pollutant emissions and deposition, weather conditions, traffic flow, human activities, and so on. This situation has increased the difficulty of using traditional shallow models, especially for providing a good representation of air quality features. Deep learning algorithms can extract representative air quality features without prior knowledge and may lead to a good performance for air quality predictions.

In this paper, we introduced a deep learning-based method for air quality predictions. A stacked autoencoder model is used to extract representative spatiotemporal air quality features, and it is trained in a greedy layer-wise manner. Thus, spatial and temporal correlations are inherently considered in the model. Furthermore, experimental results have demonstrated that the proposed method for air quality predictions has superior performance.

The main novelty and contributions of this paper are summarized as follows:

  1. 1.

    We introduced the deep learning approach for research on air quality prediction. The latent air quality features can be automatically learned using a stacked autoencoder model, and the learned representations are used to construct a regression model for air quality prediction.

  2. 2.

    We treated the regional air quality as a spatiotemporal process and used the deep learning algorithm to build a spatiotemporal prediction framework, which considers the spatial and temporal correlations of air quality data in the modeling process. The experimental results demonstrated the advantages of this approach over time series models.

  3. 3.

    Our model can predict the air quality of all monitoring stations simultaneously and shows a satisfactory seasonal stability.

The remainder of this paper is structured as follows: the “Methodology” section presents the deep learning-based approach for air quality predictions; the “Experiment and results” section discusses the experiments and results; and the “Conclusion” section presents the concluding remarks.

Methodology

First, a stacked autoencoder model is introduced. The stacked autoencoder model is a widely used deep learning architecture that incorporates autoencoders as building blocks to construct a deep network (Bengio et al. 2007).

Autoencoder

An autoencoder is a neural network that attempts to reconstruct its inputs (Lv et al. 2015). To accomplish this reconstruction and obtain a good representation, the autoencoder must capture the most important features of the input using methods that include principle component analysis (PCA). Figure 1 illustrates a basic schema of an autoencoder. Given a set of training samples {x (1) , x (2) , x (3) , . . ., x (N) } in which x (i) ∈ R d, an autoencoder first encodes the input vector x to a higher-level hidden representation y based on equation (1), and then it decodes the representation y back to a reconstruction z, calculated as in equation (2):

$$ y=f\left({W}_1x+b\right) $$
(1)
$$ z=g\left({W}_2y+c\right) $$
(2)
Fig. 1
figure 1

Autoencoder architecture. The autoencoder transforms input vector x to y via the encoder f and attempts to reconstruct x via the decoder g to produce reconstruction z. The reconstruction error is measured by the loss L H (x,z)

where W 1 and W 2 are weight matrixes and b and c are bias vectors. We employed the logistic sigmoid function 1/(1 + exp.(−x)) for f(x) and g(x) in this study. The parameters of this neural network are optimized to minimize the average reconstruction error,

$$ J\left(\theta \right)=\frac{1}{N}{\displaystyle \sum_{i=1}^NL\left({x}^{(i)},{z}^{(i)}\right).} $$
(3)

Here, L is a loss function. We used the traditional squared error in our model.

However, the reconstruction criterion alone cannot guarantee the extraction of representative features because it can lead to the straightforward solution of “simply copy the input” or similarly undesirable solutions that maximize mutual information (Vincent et al. 2010). To force the autoencoder to extract more robust features and prevent it from simply learning the identity, Ranzato introduced the sparse over-complete (i.e., higher dimension than the input) representation method (Poultney et al. 2006; Boureau and Cun 2008). A sparse over-complete representation can be perceived as a compressed representation because it has implicit compressibility due to the large amounts of deactivated hidden units rather than an explicit lower dimensionality (Vincent et al. 2008, 2010). To achieve the sparse representation, a sparsity restraint is embedded into the reconstruction error:

$$ J\left(\theta \right)=\frac{1}{N}{{\displaystyle \sum_{i=1}^N\left\Vert {x}^{(i)},{z}^{(i)}\right\Vert}}^2+\lambda \left({\left\Vert {W}_1\right\Vert}^2+{\left\Vert {W}_2\right\Vert}^2\right)+\mu {\displaystyle \sum_{j=1}^{H_D}KL\left(\rho \left\Vert {\rho}_j\right.\right).} $$
(4)

where ‖W 12 and ‖W 22 are the regulation terms, KL(ρρ j ) is the sparsity term, λ and μ are the weights for the regulation term and sparsity term, respectively, HD is the number of hidden units, ρ is a sparsity parameter (typically a small value close to zero), \( {\rho}_j=\left(1/N\right){\displaystyle {\sum}_{i=1}^N{y}_j\left({x}^{(i)}\right)} \) is the average activation of hidden unit j over the training set, and KL(ρρ j ) is the Kullback–Leibler (KL) divergence, which is defined as follows:

$$ KL\left(\rho \left\Vert {\rho}_j\right.\right)=\rho \log \frac{\rho }{\rho_j}+\left(1-\rho \right) \log \frac{1-\rho }{1-{\rho}_j}. $$
(5)

The KL divergence fastens the sparsity restraint on the coding procedure. Gradient-based procedures, such as stochastic gradient descent algorithms, can be used to solve this optimization problem.

Stacked autoencoder

A SAE is actually a concatenation of autoencoders which the outputs of the autoencoder stacked on the layer below are wired to the inputs of the successive layer (Bengio et al. 2007). More specifically, for a SAE with L layers, the first layer is trained using the training set as the input. After obtaining the first hidden layer, the output of the kth (k < L) hidden layer is utilized as the input for the (k + 1) hidden layer. Using this method, sequential autoencoders can be stacked hierarchically. Each hidden layer is a higher-level abstraction of the previous layer, and the last hidden layer contains high-level structure and representative information of the input, which are more effective for the successive prediction (Wang et al. 2016).

To employ the SAE model for air quality predictions, a real-value predictor must be added on the top layer. In this paper, a logistic regression (LR) layer was embedded into the network for real-value air quality predictions. The logistic regression model could also be replaced with other regression models, such as the SVR. The SAE model combined with the LR predictor constitutes the entire deep architecture model for air quality predictions, as illustrated in Fig. 2.

Fig. 2
figure 2

Deep architecture model for air quality predictions: Stacked autoencoders are at the bottom for feature extraction, and a logistic regression layer is at the top for real-value predictions

Training algorithm

The BP algorithm with the gradient-based optimization technique is widely used for training neural networks (Barnard 1992). Unfortunately, deep networks trained in this manner are known to have poor performance (Hanson and Giles 1993; Kambhatla and Leen 1997; Tenenbaum et al. 2000). Deep networks with large initial weights usually lead to poor local minima, whereas deep networks with small initial weights produce tiny gradients in the bottom layers, which decrease the applicability of training networks with numerous hidden layers (Hinton and Salakhutdinov 2006). To solve this difficulty, Hinton (2006) proposed a greedy layer-wise unsupervised learning technique that can train deep networks effectively. The key idea to use this technique is to pre-train the deep network layer-by-layer in a bottom–up manner. After the pre-training stage, the BP algorithm can be used to fine-tune the entire network’s parameters in a top–down fashion. Our training procedure is based on the studies by Hinton et al. (2006) and Bengio et al. (2007), which are provided below.

Algorithm 1. Training SAE

For a training sample X and the preset number of hidden layers L and the number of nodes in each hidden layer, initialize the network parameters (i.e., the pre-training epochs, the pre-training learning rate, the fine-tuning epochs, the fine-tuning learning rate, and the mini-batch size).

Step 1 Pre-training the SAE

— Set the weight parameters λ and μ. Randomly initialize the weight matrices and bias vectors.

— Train the first hidden layer using the training set as the input.

— Train the successive hidden layers in a greedy layer-wise manner while using the output of the kth hidden layer as the input for the (k + 1)th hidden layer.

Step 2 Fine-tuning the whole network

— Use the output of the last hidden layer as the input for the logistic regression layer.

— Randomly initialize {W L+1 , b L+1}.

— Use the BP algorithm with the gradient-based optimization technique to update the whole network’s parameters in a top–down fashion.

Experiment and results

Data description

The hourly PM2.5 concentration data for Beijing City from 2014/1/1 to 2016/5/28 at 12 air quality monitoring stations were downloaded from the Ministry of Environmental Protection of China (http://datacenter.mep.gov.cn/). The PM2.5 concentration of all these stations was measured using a Thermo Fisher 1405F detector based on the tapered element oscillating microbalance (TEOM) method. Figure 3 shows the distribution of these air quality monitoring stations. This dataset contains 20,196 records for each station. Seasonal statistical data are shown in Table 1. In our experiment, we randomly selected 60 % of the data as the training set, 20 % as the validation set, and the remaining 20 % as the test set.

Fig. 3
figure 3

Distribution of the air quality monitoring stations in Beijing City

Table 1 Average PM2.5 concentration (μg/m3) for each station in different seasons

Index of performance

To evaluate the performance of the proposed model, we adopted three performance indexes: the root-mean-square error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE). These indexes are calculated as follows:

$$ \mathrm{RMSE}=\sqrt{\frac{1}{N}{\displaystyle \sum_{i=1}^N{\left({O}_i-{P}_i\right)}^2},} $$
(6)
$$ MAE=\frac{1}{N}{\displaystyle \sum_{i=1}^N\left|{O}_i-{P}_i\right|,} $$
(7)
$$ \mathrm{MAPE}=\frac{1}{N}{\displaystyle \sum_{i=1}^N\frac{\left|{O}_i-{P}_i\right|}{O_i}}. $$
(8)

where O i denotes the observed air quality, P i denotes the predicted air quality, and N denotes the number of evaluation samples. The RMSE and MAE were used to evaluate the absolute error, while the MAPE was used to measure the relative error. The former reflects the extremum effect and error range of the predicted values, and the latter reflects the specificity of the average predicted value (Chen et al. 2013). The optimal structure of our model was determined when the MAPE was minimized.

Deep architecture structure

Our spatiotemporal deep learning (STDL) model contains several parameters that must be determined to build the architecture, including the size of the input layer, the number of hidden layers, and the number of hidden units in each hidden layer. For the input layer, we used the data collected from all stations as the input; thus, the model could be built upon a monitoring network that considers spatial correlations. Furthermore, with respect to the temporal relationship of the air quality, we used the air quality data at previous time intervals as the inputs (i.e., t − 1, t − 2, ..., t − r) to predict the air quality at time interval t. Thus, the proposed model inherently accounts for the spatial and temporal correlations of the air quality data. The dimensions of the input space and output are mr and m, respectively, where m is the number of stations.

We chose the time intervals r from the set {4, 6, 8, 10, 12}; thus, the input dimensions vary from 48 to 144. The numbers of layers were selected from the set {1, 2, 3, 4}. For simplicity, the number of nodes in each hidden layer was set equivalent and selected from the set {100, 200, 300, 400, 500}. All parameters for our prediction model are listed in Table 2. Moreover, the number of training epochs and the learning rate are also important during the learning phase because the reconstruction error usually increases dramatically with a large learning rate and the model would overfit the training data when the number of epochs is too large. In our experiment, we set the initial pre-training and fine-tuning learning rate to 2 and the scaling learning rate to 0.9995.

Table 2 Parameters for our air quality prediction architecture

As we tested the effect of each parameter, the other parameters were kept fixed. In this step, the validation set was used to evaluate the performance. Better parameter configurations could be identified using a grid search or other heuristic searching methods; however, due to the large search spaces, these methods would be tedious and computationally prohibitive (Huang et al. 2014). Thus, a random search of a fixed set was the preferred method in our experiments.

First, we inspected the effect of various network sizes, which represents one of the most typical problems in neural network design. The training time and the generalization capability of neural network models are highly affected by the network size parameters. In this experiment, the MAPE was the main evaluation index. Figure 4a–c shows the MAPE, RMSE, and MAE values for the various network sizes (i.e., the number of layers and the number of nodes in each layer). Figure 4a–c shows that the performance could be improved by increasing the number of hidden layers from one to four. High-level air quality features are inherently learned in this manner. However, deeper structures do not have an advantage over a four-layer structure, and models with a structure that is too complex present the issue of overfitting.

Fig. 4
figure 4

Performance with various network sizes: (a) MAPE, (b) RMSE, and (c) MAE

Figure 4a–c also shows that increasing the number of nodes in each layer can slightly improve the performance. When the number reached 300, our model presented the best performance. More nodes in each layer would unnecessarily increase the training time and result in overfitting. These phenomena can be easily demonstrated using a four-layer structure with 400 or 500 nodes in each layer because the validation error increases rapidly. To maintain efficiency and accuracy, a structure with three hidden layers and 300 nodes in each layer were used in subsequent experiments.

Next, we tested the effect of different time intervals, as shown in Table 3. A large r would increase the size of the input layer and provide a sufficient number of temporally correlated features, although it increases the training time. The prediction performance obviously increased initially but failed to show improved performance after r = 8. If k is greater than 8, then additional latent unrelated inputs make it more difficult for the complicated architecture to learn a good representation.

Table 3 Effect of the time intervals (layer = 3, nodes = 300)

Finally, we investigated the effect of fine-tuning epochs. Figure 5 shows the accuracy curve (measured by the MAPE) on the training set and the validation set as a function of the number of epochs. When the epochs were less than 3000, an increase in the number of epochs obviously decreased the training and validation errors. When the epochs were greater than 3000, the model appeared to be overfit, and the generalization capability did not improve but weakly fluctuated. Because a large number of epochs lead to a large temporal cost, we found that the optimal number of fine-tuned epochs at which the training and validation errors converged was 3000 in our experiment.

Fig. 5
figure 5

Effects of the fine-tuned epochs

Results and discussion

First, we evaluated the spatial stability of our STDL model. The predictive performance for each station is shown in Table 4, which indicates that our STDL model showed different predictive performances for these stations. In detail, the RMSE varied from 13.83 to 16.31 μg/m3, the MAE varied from 8.44 to 9.33 μg/m3, and the MAPE varied from 18.60 to 27.32 %. The Guanyuan station (No. 3) had the best performance with the lowest MAPE value of 18.60 %, the lowest MAE value of 8.44 μg/m3, and a low RMSE value of 14 μg/m3. The prediction results are shown in Fig. 6, which indicates that the predicted data are generally consistent with the recorded data. The R 2 value between the recorded and predicted hourly PM2.5 concentrations in this testing phase indicated that 98.24 % of the explained variance was captured by the model. The Zhiwuyuan station (No. 9) had the highest relative error and a MAPE value higher than 25 %, which was mainly because this station is located at the border of an urban area and presents only limited air pollutants, such as traffic pollutants. Therefore, this station has relatively good air quality and lower average PM2.5 concentrations (Table 1). Considering that our model produced similar absolute errors (RMAE and MAE) for each station, the MAPE value was higher at the Zhiwuyuan station.

Table 4 Prediction performance for each station
Fig. 6
figure 6

Predicted and recorded values of the test set at the Guanyuan station

Next, we tested the temporal stability of our model. We calculated the performance index for the four seasons, and the results are shown in Table 5, revealing that our model presented a consistent performance in each season. This feature is beneficial because it indicates that a separate model is not required for each season.

Table 5 Prediction performance for each season

Next, we evaluated the rank prediction performance of our model. According to the National Technical Regulation on the Ambient Air Quality Index (see Table 6), we calculated the recorded and predicted rank rate, which is shown in Table 7. Each row shows the predicted air quality rank ratio, and each column contains the recorded air quality rank ratio. Table 7 shows that the prediction rank rate of our model was high for each air quality rank, and the overall prediction rank accuracy rate was 82.66 %.

Table 6 PM2.5 air quality levels
Table 7 Predicted and recorded air quality levels (in percent)

Finally, we compared the performance of the proposed STDL model with that of the STANN model, the SVR model, and the ARMA model. These models were trained and tested using the same training and testing sets applied for the STDL model; however, the input data might have been slightly different. The STANN model uses the same inputs as our STDL model, which predicts the air quality of all stations simultaneously based on the spatiotemporal correlations of the input data. The main difference between the STANN model and our STDL model is that the STANN model does not use the greedy layer-wise unsupervised learning algorithm to pre-train the deep network. We conducted the prediction tasks for each station separately for the SVR and ARMA methods, which are merely time series prediction models, using data from a single station as the input. The results are shown in Table 8.

Table 8 Prediction performance for the STDL, STANN, SVR, and ARMA models

Table 8 reveals that the STDL model presented more accurate air quality predictions than the STANN, SVR, and ARMA models and had lower RMSE, MAE, and MAPE values. Table 8 indicates that the two spatiotemporal models (STDL and STANN) had higher accuracy than the time series models (ARMA and SVR), which shows that spatial correlations are important for air quality predictions. Moreover, a comparison of the performance of the two spatiotemporal models showed that the MAPE of the STDL decreased by 5.12 % compared with that of the STANN, indicating that the deep architecture method with unsupervised pre-training can automatically learn better features than shallow models, thus improving the prediction performance.

Conclusion

In this paper, a spatiotemporal deep learning-based model was developed for air quality prediction. This model consists of a stacked autoencoder model at the bottom for unsupervised feature extraction and a logistic regression model at the top for real-value regression. Compared with existing methods that generally model the shallow structure of air quality data, the proposed method can effectively extract latent air quality feature representations from air quality data, especially nonlinear spatial and temporal correlations. Compared with traditional time series air quality prediction models, our model was able to predict the air quality of all monitoring stations simultaneously, and it showed a satisfactory seasonal stability. We evaluated the performance of the proposed method and compared it with the performance of the STANN, ARMA, and SVR models, and the results showed that the proposed method was effective and outperformed the competitors.