Keywords

1 Introduction

In recent years, air pollution has become a vital issue in most developing countries and gained worldwide attention due to its negative effects on health, economic and environmental sustainability [1, 2]. Rapid development in industrialization, infrastructure, and urbanization has caused serious air quality deterioration, especially in urban areas [3]. One of the most dangerous air pollutants namely carbon monoxide (CO) can cause negative impacts on human health such as respiratory infections, lung cancer, and heart diseases that may lead to mortality [4]. CO is a colourless, tasteless and odourless gas that is commonly emitted from the combustion of fossil fuel and coal [5]. Concentration levels of CO are generally higher in urban areas as compared to the rural areas where the industrial, commercial and busy traffic particularly focus on the area [6]. Therefore, reliable forecasting of air pollutant concentration is essential and beneficial to provide accurate information on the air quality in the affected area and support environmental management [4].

Forecasting of time series air pollutants based on intelligent modelling strategies has been proven in illustrating higher accuracy as compared to statistical modelling such as Auto Regressive Integrated Moving Average (ARIMA) [7]. Deep learning is a subset of machine learning based on the neural network that also has been successfully implemented to solve problems in speech recognition and image classification [8]. On the other hand, deep learning strategies such as convolutional neural network (CNN) and recurrent neural network (RNN) has gained popularity in numerous studies of air quality forecasting due to their advantages over traditional machine learning models such as artificial neural network (ANN) and support vector machine (SVM) [3, 4, 9]. However, RNN is known to have a drawback during the learning process called the vanishing gradient problem [10]. Considering the limitation in RNN, an improved method namely long short-term memory (LSTM) that used memory block for recurrent learning process is introduced and have been widely applied in air quality forecasting [11, 12].

Besides that, hybrid architectures of multiple deep learning methods such as CNN-LSTM [13, 14] and sequence to sequence (seq2seq) model [15, 16] are able to improve the individual models in air quality forecasting. For instance, Wang et al. [17] developed a hybrid seq2seq model based on Bidirectional LSTM and gated recurrent unit (GRU) and Jia et al. [18] used stacked GRU layer to forecast hourly ozone concentration. Besides that, Sharma et al. [19] and Du et al. [20] developed hybrid CNN-LSTM in the forecasting of particulate matter. From the literature study, it is found that proposed hybrid architectures outperform individual deep learning models and yield the highest forecasting accuracy. However, the studies do not compare the forecasting performance between CNN-LSTM and seq2seq LSTM hybrid architectures. The comparison analysis between different hybrid models may provide new insight into the effectiveness and efficiency of hybrid architectures in air quality forecasting. Although sequence to sequence deep learning models have been previously developed for air quality forecasting, the model’s evaluation in multistep forecasting of CO concentration is still limited.

The objective of this study is to establish two multistep hybrid deep learning models namely CNN-LSTM and seq2seq LSTM in hourly forecasting of CO concentration in Selangor, Malaysia. It involves hourly air quality datasets at six air quality monitoring stations for 1 to 6 h ahead forecasting of CO concentration, development and comparison of the proposed deep learning architectures. The comparison study was conducted in order to highlight the performances of different architectures and evaluate the impact of each architecture network on forecasting accuracy. The performances of the forecasting models were evaluated based on statistical evaluation such as root mean square error (RMSE), mean absolute percentage error (MAPE) and mean absolute error (MAE).

2 Data and Methods

2.1 Study Area and Data

Study Area and Data Collection.

Hourly historical air quality data consist of six air pollutants namely PM2.5, PM10, SO2, NO2, O3 and CO were obtained from the Department of Environment Malaysia from 1 January 2019 to 31 December 2019. The datasets were collected at six air quality monitoring stations in Selangor. Figure 1 shows the location of monitoring stations considered in this study. The hourly dataset contains 8760 records for each station. The mean hourly air pollutants concentration for the six monitoring stations are calculated and summarized in Table 1.

Fig. 1.
figure 1

Location of air quality monitoring stations

Table 1. Statistics of the parameters for Selangor

Data Preprocessing.

The datasets collected contains missing values that may be due to instrumental error, invalid values and regular maintenance. In this study, mean value of the particular attribute is used to substitute the missing data. Then, mean hourly air pollutants of multiple monitoring stations were computed to represent the air quality for Selangor. The dataset was split into two sets namely training and testing. Training set is set for 80% of total records, while testing set was set to 20%. The dataset values were normalized in the range of [0, 1] to avoid the negative impacts on model’s learning process due to nonuniform value ranges. The equation for data normalization is defined in Eq. 1.

$$z=\frac{x-\mathrm{min}(x)}{\mathrm{max}\left(x\right)-\mathrm{min}(x)}$$
(1)

where x is the actual value and z is the normalized value.

2.2 Long Short-Term Memory

LSTM is an updated version of RNN that is capable to learn long-term dependencies and solve vanishing gradient problems in RNN by performing self-loop memory blocks [4]. An LSTM unit consists of a memory block that includes three different gates namely forget gate, input gate and output gate as illustrated in Fig. 2. All three gates having functions of writing information from the input, forget the information, and determining the final outputs. The gate unit aims to control the information flow from one LSTM unit to another and allow the network to learn over many times steps [9].

Fig. 2.
figure 2

LSTM unit architecture

LSTM takes current information xt, previous output from hidden layer ht-1 and previous cell state, Ct-1 as input. However, gate structures help LSTM to learn the long-term dependencies in sequential series and allow the information to pass through LSTM network. Therefore, LSTM is an effective model for learning sequential data. The forget gate, input gate, output gate and memory cell in the structure can be defined based on the following equations:

$${f}_{t}=\sigma \left({W}_{f}{x}_{t}+{U}_{f}{h}_{t-1}+{b}_{f}\right)$$
(2)
$${i}_{t}=\sigma \left({W}_{i}{x}_{t}+{U}_{i}{h}_{t-1}+{b}_{i}\right)$$
(3)
$${o}_{t}=\sigma \left({W}_{o}{x}_{t}+{U}_{o}{h}_{t-1}+{b}_{o}\right)$$
(4)
$${\tilde{C }}_{t}=ReLU\left({W}_{C}{x}_{t}+{U}_{C}{h}_{t-1}+{b}_{C}\right)$$
(5)

where Uf, Ui, Uo and UC are the weight matrices connecting the preceding output to the gate units and memory cell. bf, bi, bo and bC are the bias vectors. Wf, Wi, Wo and WC are the weight matrices mapping the hidden layer input to the gate units and a memory cell. σ denotes sigmoid function as defined in Eq. 6 and ReLU activation function is defined in Eq. 7. Then, the cell output and the layer output can be implemented using Eq. 8 and Eq. 9, respectively.

$$\sigma \left(x\right)=\frac{1}{1+{e}^{-x}}$$
(6)
$$R\left(z\right)=\mathrm{max}(0,z)$$
(7)
$${C}_{t}={f}_{t}\circ {C}_{t-1}+{i}_{t}\circ ReLU({U}_{C}{h}_{t-1}+{W}_{C}{x}_{t}+{b}_{C})$$
(8)
$${h}_{t}={o}_{t}\circ ReLU({C}_{t})$$
(9)

2.3 Convolutional Neural Network

CNN is a biologically inspired network that has been successfully implemented in image recognition, object detection and text processing [21]. CNN is also able to work on multiple arrays of data where 1D is for signals and sequences data as well as text, 2D is for images and 3D is for images taken across time and videos [22]. General CNN network architecture consists of different layers namely convolutional, max pooling, dropout and fully connected layer as illustrated in Fig. 3. In CNN, a convolutional layer is important to extract the features of input variables using the convolutional kernel [8]. The pooling layer is introduced after the convolutional layer to speed up the filtering and reduce the number of operations. Pooling layers simplifies and downsamples the output received from convolutional layers to avoid overfitting [10]. After convolutional and pooling layers, the output was flattened into 1D array for successive forecasting.

Fig. 3.
figure 3

CNN architecture

Considering the ability of 1D CNN in solving time series data, the application has gained worldwide attention in various fields. The equations for 1D CNN are as follows [20]:

$${c}_{j}^{l}=\sum\nolimits_{i}{x}_{i}^{l-1}*{\omega }_{ij}^{l}+{b}_{j}^{l}$$
(10)
$${x}_{j}^{l}=ReLU\left({c}_{j}^{l}\right)$$
(11)
$${x}_{j}^{l}=Flatten\left({x}_{j}^{l}\right)$$
(12)
$${x}_{k}^{l+1}=FC({\omega }_{kj}^{l+1}{x}_{j}^{l}+{b}_{k}^{l+1}$$
(13)

The convolutional layer learning process is modelled based on Eq. 10 and Eq. 11, where ⁎ denotes a convolution operator, \({\omega }_{ij}^{l}\) is filter, \({b}_{j}^{l}\) is bias and l is the involved layer. ReLU activation function was used within the layer. \({x}_{i}^{l-1}\) and \({c}_{j}^{l}\) represent input and output vector to a convolution layer.

2.4 Experimental Design

This study aims to evaluate the performances of two hybrid LSTM based models for CO concentration forecasting at 1 to 6 h ahead of time horizon incorporating historical air quality datasets at six air quality monitoring stations. Besides that, a comparative analysis was conducted in order to highlight the effectiveness of different hybrid architectures in forecasting 1 to 6 h ahead of CO concentration in terms of error assessments.

Seq2Seq LSTM model consists of two LSTM layers with 128 units and 64 units, respectively for both encoder and decoder processing layers. A manual search is performed to find the optimum hyperparameters of the models. The activation function used in the network is rectified linear unit (ReLU) which has the advantage of reducing the vanishing gradient and has better convergence performance. Besides that, adaptive moment estimation (ADAM) is used as an optimizer within the network where the optimizer can successfully work in online and stationary settings. The exponential decay rate for first moment estimates and second-moment estimates are 0.9 and 0.999, respectively. The learning rate is set to 0.001. Then, the forecasting models are fitted with a batch size of 128 and mean square error (MSE) is used as the loss function. Early stopping criteria is implemented for learning epoch in the model. The description of hyperparameters used in this study is summarized in Table 2.

CNN-LSTM model consists of a 1D convolution layer with a filter number of 32 and kernel size of 3. The hyperparameters of LSTM in CNN-LSTM architecture is set equal to the seq2seq LSTM model. The architecture of seq2seq LSTM and CNN-LSTM model proposed in this study are illustrated in Fig. 4 and Fig. 5, respectively.

Table 2. Hyperparameters of proposed models
Fig. 4.
figure 4

Sequence to sequence LSTM architecture

Fig. 5.
figure 5

CNN-LSTM architecture

2.5 Performance Evaluation

Proposed forecasting models were evaluated using statistical equations namely root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE). The RMSE represent the difference between the observed and forecasted value at a different time interval. The MAE shows the absolute difference between observed and forecasted values on overall data points. The MAPE presents the average absolute error of forecasts in terms of percentages that measure the model’s forecasting accuracy. The smaller value of RMSE, MAE and MAPE indicate better forecasting performances.

The equations of performances criteria are defined as follows:

$$RMSE=\sqrt{\frac{1}{n}\sum\nolimits_{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}$$
(14)
$$MAE=\frac{1}{n}\sum\nolimits_{i=1}^{n}\left|{y}_{i}-{\widehat{y}}_{i}\right|$$
(15)
$$MAPE=\frac{1}{n}\sum\nolimits_{i=1}^{n}\left|\frac{{y}_{i}-{\widehat{y}}_{i}}{{y}_{i}}\right|\times 100$$
(16)

where n is the number of data points; \({y}_{i}\) and \({\widehat{y}}_{i}\) are the observed and forecasted values, respectively.

3 Results and Discussion

The performances of CNN-LSTM and seq2seq LSTM model in the forecasting CO concentration at 1 h to 6 h ahead in terms of RMSE, MAE and MAPE are demonstrated in Fig. 6, Fig. 7 and Fig. 8, respectively. From the graphs, the error values gradually increase as the forecasting time horizon increase. It can be perceived that both forecasting models show the same trend of evaluation scores which indicate forecasting accuracy is lower for a larger forecasting time horizon [3]. Therefore, it is important to decide on the high and low resolution for optimum forecasting accuracy and reduce bias in the dataset.

The forecasting performances of proposed architectures were compared to highlight their effectiveness and impact in air quality forecasting. Both forecasting models were developed to extract input data features using the first processing layer and forecast future CO concentration using the second processing layer. In this case, encoder-decoder frameworks were proposed with different architectural designs. Seq2seq LSTM architecture yields RMSE of 0.1623, 0.1823, 0.1980, 0.2082, 0.2153 and 0.2215 for 1 h to 6 h ahead forecasting, respectively which are lower as compared to CNN-LSTM model. Similar to MAE and MAPE, the error values of seg2seq LSTM are lower than CNN-LSTM. Therefore, seq2seq LSTM model outperforms CNN-LSTM in terms of RMSE, MAE and MAPE at multi-hour step ahead forecasting.

Seq2seq LSTM reduces the RMSE, MAE and MAPE of CNN-LSTM by 23.6%, 24.2% and 28.0%, respectively at 6 h ahead forecasting. Table 3 summarizes the error values of both proposed forecasting architectures. Higher performances of seq2seq LSTM indicates the architecture successfully extracted important features and captured temporal distribution in time series air quality dataset to successfully forecast multi-hour ahead of CO concentration [15]. Therefore, the architectural design of a forecasting model affects the performances in terms of the learning process and future forecasting. However, the architecture depicts slight improvements from CNN-LSTM illustrates that CNN-LSTM may still be consistent in multi-hour CO concentration forecasting.

Overall, both CNN-LSTM and seq2seq LSTM models yield promising forecasting performances where the models are able to forecast CO concentration near the observed values. It is indicated that proposed hybrid models have the ability to extract the important features in multiple input variables and successfully forecast future CO concentration. The comparison of observed and forecasted CO concentration at 6 h ahead forecasting is presented in Fig. 9. It can be concluded that both forecasting models are reliable to forecast multistep ahead of air pollutant concentration. Different designs of architectural networks and hyperparameter combinations can be further explored to enhance forecasting performances.

Fig. 6.
figure 6

RMSE of proposed models

Fig. 7.
figure 7

MAE of proposed models

Fig. 8.
figure 8

MAPE of proposed models

Table 3. RMSE, MAE and MAPE of proposed architectures at 6 h forecasting
Fig. 9.
figure 9

Forecasted value and observed value of CO concentration

4 Conclusion

In this study, two hybrid architectures based on LSTM were proposed to forecast hourly CO concentration using air quality datasets from multiple monitoring stations in Selangor. CNN-LSTM consists of a 1D convolutional layer and two layers of LSTM. Meanwhile, seq2seq LSTM contains two LSTM layers in both decoder and decoder processing layers. Both models are designed to extract the features in multiple input variables using the first processing layer and forecast future CO concentration using the second processing layer. Seq2seq LSTM model illustrates slightly higher forecasting performances as compared to CNN-LSTM at 1 h to 6 h ahead of forecasting. However, both hybrid architectures depict superior forecasting performance and yield forecasted CO concentration near the observed values. Overall, the design of optimum hybrid architecture may depend on variational input parameters and forecasting requirements. There are many ways in which the study can be extended. First, considering other parameters such as weather and traffic data may enhance the forecasting performances which is exclusively considered in this study due to data source limitation. Second, the study can be extended by including spatiotemporal analysis among multiple air quality monitoring stations. Lastly, the hybrid architectures of deep learning approaches can be extended using more sophisticated methods such as bidirectional LSTM to handle larger datasets and optimization techniques to find optimum deep learning hyperparameters.