1 Introduction

Human-induced activities have increased the level of air pollution, and its adverse consequences were suffered by all. Researchers across the world have been concerned with the challenging problem of high ambient aerosol concentrations that directly affect our health, economy and the climates [1, 2]. A World Health Organization (WHO) report [3] suggests that 90% of the population of the planet breathes air that crosses the WHO air quality recommendations, and every year, around 7 million people have lost their lives from exposure to ambient air pollution. A survey conducted by WHO in the year 2016 reported that out of the world’s top twenty polluted cities, fourteen cities belong to India. Therefore, reliable forecasting of PM2.5 concentration is imperative to forewarn the public as well as policy planners to take corrective measures. In the last couple of decades, different forecasting methods such as the deterministic model [4,5,6,7,8], statistical model [9] and artificial neural networks (ANN)-based models were explored and attempted by various researchers across the world. In recent years, ANN-based models gained prominence owing to its ability to handle linear and nonlinear variability present in the environmental data with limited set of variables. The general regression neural network [10] and neuro-fuzzy models [11], feed forward network, radial basis function network, multilayer perceptron (MLP) model [12], back-propagation neural network [13, 14], and recurrent neural network (RNN) [15, 16] with error back-propagation learning techniques are some of the widely used ANN architecture for air pollution forecasting [17,18,19,20]. With the advancement in computing technology, an emerging field of machine learning and artificial intelligence has attracted a lot of researchers to apply deep learning techniques (a subset of machine learning) for diverse problems of societal relevance.

Among many deep learning architectures, long short-term memory (LSTM) network was mostly used for air quality time series forecasting in the recent past because of its ability to capture long- and short-term dependencies [21,22,23,24,25,26]. However, different hybrid network architectures were also employed for different aspects of air quality studies [27,28,29]. The LSTM network was further used to ascertain the long-term and short-term dependencies [30] and effectiveness of encoder-decoder networks for building prediction machines with time-series data [31, 32]. A hybrid model consisting of convolutional LSTM and CNN was also attempted to predict the concentration of particulate matter [33, 34] where convolutional LSTM was used for sequential spatiotemporal information and CNN for extracting temporal features in parallel. Similarly, transfer learning BiLSTM model was also examined for hourly, daily and weekly prediction of air quality [35].

Most of these studies were focused on predicting air quality at single (or few) monitoring stations making use of such modelling techniques rather limited. India being a very diverse (7th largest country by area with 2nd most populated in the world) country spreading from 8°4′ to 37°6′ latitude and 68°7′ to 97°25′ longitude, models trained on selected cities may not be efficient for use in other cities. The model architecture needs constant alteration for use in other cities. Therefore, a uniform simpler model is essentially required which is less data intensive and can be applicable across India without the need for structural changes. The present study aims to develop a hybrid deep learning network with uniform model architecture that can be applicable for all monitoring stations across India. Besides developing and testing the model for multi-step ahead forecasting, it is also ascertained the dependence of model performance on the relative variance measure in terms of signal-to-noise ratio (SNR) of the input data. Section 2 details site description of data pre-processing; model development and architecture are presented in Sect. 3. Result and discussions are given in Sect. 4 and conclusion in Sect. 5.

2 Site and Data Description

For the present study, the air pollution data were acquired for 26 different cities across the country from the Central Pollution Control Board (CPCB), Government of India (http://www.cpcb.nic.in/) (Fig. 1). For the ease of analysis, India was further subdivided into 15 Agroclimatic zones (Table 1) as per Indian Meteorological Department (IMD), Government of India (GOI) classification [36]. No data was available for two regions, namely, WH (U.T. of Jammu and Kashmir and Union Territory of Ladakh) and the IR of India (Andaman and Nicobar Island, Lakshadweep Island), hence were not included in the present study. The data was collected for the duration from 1 January 2015, to 31 May 2020 depending upon the data availability. Details of the data used in this work have been presented in Table 1.

Fig. 1
figure 1

Agroclimatic zones of India

Table 1 Agroclimatic zones and data description

2.1 Data Pre-Processing

The data acquired from the secondary sources were often infected with outliers and missing values. Therefore, pre-processing of the data to eliminate and minimise such errors is highly imperative. In the present study, the unreasonably high values were considered as outliers and were replaced with the help of linear interpolation method [37]. Similar techniques were followed for filling of missing values present in the data.

3 Model Development

3.1 Network Architecture

In the present study, an ANN architecture with deep learning framework was adopted for multi (Eight) step ahead forecasting. The model architecture is an encoder-decoder–based (Fig. 2) [38] sequence-to-sequence hybrid model, which has three main components, namely,

  1. 1.

    3D-CNN: 3-dimensional convolutional neural network model

  2. 2.

    ConvLSTM: convolutional long short-term memory

  3. 3.

    BiLSTM: bidirectional long short-term memory

Fig. 2
figure 2

Encoder-decoder–based sequence-to-sequence model

Essentially, 3D-CNN and LSTM networks are the backbone of this architecture. The LSTM model was widely applied for time series prediction, because of its ability to store the information in self-recurrent cells that can be retrieved at different time steps. The LSTM network performs exceedingly well in reducing the Gaussian noise present in the data [39], but unable to filter out non-Gaussian noise, which was inherently present in the data set. To address these shortcomings, BiLSTM network was applied to reduce the overfitting of noisy data. Besides, the ConvLSTM model performed better in datasets having long-duration sequential features with multiple temporal information [40]. Furthermore, the 3D-CNN model was advancement over 2D-CNN model that has better processing ability for large contextual data helpful in extracting the spatiotemporal features. The ability of the 3D-CNN to extract features from large sequential data into different time–frequency domains was exploited to reduce noise present in the data as well as to abstract features that can be stored and further fed into the next fully connected layer. The schematic diagram of hybrid model architecture is presented in Fig. 3.

Fig. 3
figure 3

Model architecture

In the model architecture, ConvLSTM encoder layer generates a feature map that was further refined and filtered by the second ConvLSTM network with Batch normalisation layer. The output is fed into 3D-CNN to extract spatiotemporal patterns from the state matrix. The output of the 3D-CNN layer then feeds into the decoder layer having four BiLSTM networks. BiLSTM will generate a string of the entire sequence containing values for 8 h. The first, second and third fully connected layers act as an interpretation layer for each time step of the output sequence, and the last fully connected layer is the final output layer of the model that generates the final predicted value of 8 steps ahead prediction. Concurrently, a dropout layer was used after the first BiLSTM to minimize the overfitting. Each layer of filter is a CNN model abstract feature. Since initial network layers receive the noisy raw data, fewer filters were used to capture the basic features only. In the subsequent layers, the number of filters was increased to capture deeper abstraction of features. A smaller filter size or kernel size can capture more features than a larger kernel size. We applied 64 numbers of filters of size (1,7) in the first ConvLSTM layer. In the second ConvLSTM layer, the number of filters was increased to 128, and kernel size was decreased to (1,3). Odd numbers of kernel size were used to maintain symmetry around the centre or origin of the abstraction layer.

The BiLSTM layer, used in the model, acts as a decoder and generates output of multiple values in a sequence. Cross validation and out of sample testing techniques were employed to evaluate the model performance. A similar model framework was earlier applied by [41] for learning smart manufacturing problems using time series data. However, they used stacked ConvLSTM as an encoder and stacked BiLSTM as a decoder layer for an auto encoder model framework. In the present study, Stacked ConvLSTM layer outputs were fed into the 3D-CNN layer. The air pollution time series data are the net outcome of the complex interplay between different stochastic and dynamic processes having different characteristic frequencies [16]. Therefore, 3D-CNN was used to take into account the characteristic features, enhancing the ability of the network for better prediction.

To forecast PM2.5 value for the next 8 h, we used 3 sequences of 8-h durations, i.e., 24 h of data as input sequence with the next 8 h of data as target. But this number of instances would be rather limited for training a deep learning model. Therefore, an overlapping moving window method was used during training of the time series data for generating more training instances. This method is a modified rolling window method as proposed in [42] and later adopted for air pollution studies by [35]. Here, a large training dataset was generated by shifting the entire sequence by one step (Fig. 4) as discussed as follows.

Fig. 4
figure 4

Overlapping data by one step

Let us consider a time series u(t) = {u1, u2, u3,……..,ut}. In order to forecast the next k values of the sequence ŝ = (ŝ1, ŝ2,…,ŝk) equivalent to (ut+1, ut+2,………,ut+k) with the help of last observation and a moving window of fixed size w, it would be

$$\widehat{s}= (\widehat{s}_{1}, \widehat{s}_{2},..., \widehat{s}_{k}) = f(u_{t\,-\,w},u_{t\,-\,w\,+\,1}, u_{t\,-\,w\,+\,2},..., u_{t})$$

When the above operations are applied to a univariate time series of length N, it generates a sequence to sequence prediction with an input set U ∈ Rn×w and output set S ∈ Rn×k. Here, n is the size of training data given by

$$n = (N - w - k + 1).$$

As evident from the above description, the entire sequence of 24-h time series data (h1 to h24) was converted into 3 × 8 sequence internally and mapped to the next 8-h values (h25 to h32). In the next step, the data were shifted by one value, and now, the data (h2 to h25) were mapped to the next 8-h sequence (h26 to h33) and so on.

Since the model was used in many stations situated at different geographical locations of India, the model parameters had been generalized in such a way that it could result in optimum value for most of the stations.

3.2 Hyperparameters

Hyperparameters in machine learning are special kinds of parameters that play a significant role in determining the performance of a deep learning model. The hyperparameters used in this paper are listed in Table 2.

Table 2 Optimal architecture of the parameters used in the study

The most widely used activation function for deep learning was the rectified linear unit (ReLU), which is f(x) = max (0, x). A new activation function was proposed by the Google Brain team [43], named ‘swish’ which is f(x) = x · sigmoid(x), which performs better in a deeper network. Hence, in the present study, the swish activation function was used, and for BiLSTM part, ‘tanh’ activation function was applied.

There exist different types of optimizing algorithms such as Gradient Descent, Stochastic Gradient Descent, Momentum Based Gradient Descent, Adaptive Moment Estimation (Adam), Nesterov Accelerated Gradient (NAG) and Root Mean Square Propagation (RMSProp) to minimize the loss function during the training of a machine learning model. In the present work, Adaptive Moment Estimation (Adam) optimizer was used due to its adaptive nature and combined momentum component [44].

3.3 Model Evaluation

Furthermore, the effectiveness of the model was tested following a walk forward validation method. Statistical error metrics like root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentile error (MAPE) were used for performance evaluation. The equations involved in the error metrics are as follows:

$$RMSE=\sqrt{\frac{1}{n}{\textstyle\sum }_{i\,=\,1}^{n}{\left({A}_{i}-{P}_{i}\right)}^{2}}$$
$$MAE= \frac{1}{n}\textstyle\sum_{i\,=\,1}^{n}|{A}_{i}-{P}_{i}|$$
$$MAPE=\left(\frac{1}{n}\textstyle\sum_{i\,=\,1}^{n}\frac{\left|{A}_{i}\,-\,{P}_{i}\right|}{\left|{A}_{i}\right|}\right)\times 10$$

where

Ai :

observed value

P i :

model predicted value

n :

 total number of samples

All the model was developed in a single HP-Z6-G4 Workstation under Linux environment. NVIDIA Quadro P2200 GPU was used with Python3.8, TensorFlow and Keras library to run the model.

4 Result and Discussions

4.1 Statistical Distribution Analysis

The time series data were subjected to normality test using KS statistics, Shapiro-Wilk test and Jarque-Bera test to understand the nature of time series. The results of the normality test (Supplementary Table T1) indicate the rather non-normal nature of the PM2.5 dataset across different agroclimatic zones of India. The seasonal analysis of the data revealed the maximum average concentration of 195.5 μg/m3 of PM2.5 during winter season and minimum of 113.8 μg/m3 during monsoon season. Similar observations were also reported by many authors in the past [11]. During monsoon season, WCPG area witnessed minimum average concentration and WD observed maximum average concentration. During the winter season, EH region has minimum average concentration, and the MGP has the maximum average concentration. It is worth mentioning that the TGP region has maximum average concentration during the post-monsoon season whereas the MGP region has maximum average concentration in pre-monsoon season. The seasonal contour plots (Fig. 5) revealed that north-western India has higher concentration of PM2.5 especially during monsoon and post monsoon season. In winter and pre-monsoon seasons, the highest concentration was confined to Indo Gangetic plain region.

Fig. 5
figure 5

Seasonal contour plots

A detailed statistical distribution analysis was carried out to ascertain the nature of distribution prevalent in PM2.5 data in different agroclimatic zones. It is imperative to understand the nature of the statistical distribution in any data as it determines the effectiveness of the model performance measures used in forecasting problems. To ascertain the best fit model, each dataset was tested against 7 common distributions applied in case of air pollution studies, namely, Normal, Log Normal, Logistic, Laplace, Weibull, Gamma and Beta. The best fitted distribution (Supplementary Fig. F1(A-Z)) was selected based on the minimum sum of square error criteria. The results (Table 3) reveal the predominance of Gamma and Beta distribution at 24 out of 26 sites. Overall, at 14 sites, PM2.5 concentration follows Gamma distribution, Beta distribution at 10 sites and lognormal at the remaining two sites. It is pertinent to mention that all the sites at UGP, TGP and EPH have Gamma distribution as best fit distribution. Similarly, LGP, ECPH, GPH follow Beta distribution only. Log normal distribution was observed only at Shillong in EH region and at Patna in MGP. At the rest of the regions, mixed distribution fitting results were obtained.

Table 3 Best fit Statistical distribution parameters values

4.2 Model Performance Evaluation Results

The model performance evaluations were carried out through commonly used error functions such as RMSE, MAE and MAPE. The error functions RMSE and MAE are scale dependent and not an optimum measure to compare different data sets where mean differences are larger. However, MAPE is a unitless function that is scale independent and more suitable for model comparisons even with data that have large variance and are infected with extreme values. The model’s performances (Fig. 6) in terms of RMSE values for 8-consecutive-hour advance predictions ranged from the minimum of 7.09 in Shillong at EH region to the maximum of 53.81 in Patna at MGP. Similarly, in terms of MAE, minimum of 5.41 and maximum of 34.09 were obtained at Shillong and Patna respectively. In terms of MAPE, the minimum value of 18.6% and a maximum of 52.7% were observed in Hyderabad and Chennai, respectively. The first step prediction errors in terms of RMSE were found to be less than 10 µg/m3 at nine cities whereas Shillong in EH, Howrah in LGP and Mandideep and Nagpur in CPH were found to be in between 10 and 15 µg/m3. In terms of MAE, 15 out of 26 cities have less than 10 µg/m3 MAE value, and overall, 24 out of 26 cities have ≤ 15 µg/m3 MAE values. Only Patna (MAE = 20.52) in the MGP and Talcher (MAE = 20.29) in EPH have MAE values more than 15 µg/m3. Similarly, 11 out of 26 sites have MAPE values less than 30%, and 20 sites show MAPE values < 35%. Apart from Jodhpur in WD and Chennai in ECPH zone, the rest of the sites have MAPE values less than 40%. (The detailed results are presented in Supplementary Table T2). Overall, for predicting up to 8-h ahead concentrations, the model performance was found to be relatively better in the central, southern and western regions of India in comparison to northern and eastern regions. The robustness of the prediction ability of the proposed model framework is also evident from the 1-h ahead and 8-h average PM2.5 concentration estimates. The performance evaluation for 1-h ahead and 8-h average prediction horizon is essential as the regulatory air quality was reported mostly in this temporal range. The minimum values of RMSE, MAE and MAPE for 1-h ahead concentration was found to be 5.81 (at Shillong), 3.92 (at Aurangabad) and 10.8 (Howrah), respectively. The maximum RMSE of 41.384 and 29.04 was obtained respectively for 1-h and 8-h average periods at Patna and Talcher, respectively. Overall, less than 20 RMSE values were observed at 18 sites and 13 sites for 1-h ahead and 8-h average forecasting horizon, respectively. In case of MAE, again, minimum and maximum values for 1-h and 8-h average were found at Shillong and Patna respectively. The results were found to be more uniform across India in terms of MAE as only 3 sites namely Delhi, Talcher and Patna, have MAE values more than 20 µg/m3.

Fig. 6
figure 6

Model performance results for 8-h cumulative forecasting at different cities in various agroclimatic zones of India

Across agroclimatic zones (Table 4), the best model performance for 1-h ahead prediction was achieved in terms of RMSE for SPH (7.6) followed by CPH (8.4) and WPH (9.4). Overall, 7 zones have RMSE values less than 20 and 3 zones each have RMSE values in the range 20 to 30 and 30 to 40 µg/m3, respectively. In terms of MAE, similar trend in the error was observed with the minimum value (5.4) obtained at SPH and maximum (16.3) at MGP. In case of 8-consecutive-hour advance predictions, the same pattern was obtained with minimum RMSE and MAE values of 11.1 and 8.0, respectively, observed at SPH, and maximum RMSE (40.4) and MAE (25.8) were obtained for MGP. However, the results for MAPE values were slightly different with minimum error value observed at LGP (10.8%) and EH (21.3%) for 1-h ahead and cumulative 8-consecutive-hour ahead prediction, respectively.

Table 4 Model performance agroclimatic zone wise

The observed forecasting results exhibit spatial variability in model performance. As evident from the heatmap (Fig. 7) for multi-step hourly forecast, the regions mostly along the southeast, south, central and southwestern of India have better model performance in terms of MAE values. In most parts of North India, relatively poor model performance was observed except for Jamshedpur, Agra, Amritsar and Shillong. It is worth mentioning that, for the first step, model performance is best across India except at Patna and Talcher. Till date, cross-country analysis of model performance was not attempted in India, although ANN with deep learning architecture was attempted for selected pollution hotspots in India by many researchers [45, 46]. The results obtained at different locations and the corresponding observed values were further subjected to statistical distribution analysis. The best fit distribution was found to be the same as the original training data at each location in India. (The statistical distribution analysis plots for test results are not displayed here as it is same as that of the observed training dataset).

Fig. 7
figure 7

Heat map of model performance errors in terms of MAE values

4.3 Effects of Data Length and SNR on Model Performance

The variability in the forecasting results across the different agroecosystems of India prompts us to examine the effect of data length and the nature of deterministic signal and random components present in the PM2.5 time series using correlation analysis and signal-to-noise ratio (SNR) measurements. SNR quantifies the fraction of desired or good information with respect to unwanted or false information in each data series. In the present study, SNR was calculated [47] (Table 5) for each pre-processed dataset through the following equation:

$$SNR=\frac{\mu }{\sigma }$$

where µ is the mean and σ is the standard deviation of the time series data. Such equations are used in situations where all values are non-negative. The scatter plots and trend line between the data length and model performance error (Supplementary Fig. F2) does not show any relationship between them indicating minimal or no effect of data length on model performance. A scatter plot (Fig. 8) of MAE vs. SNR for 1-h ahead and 8-h cumulative forecasting reveals a sharply decreasing trend. It is evident that as the noise component reduces, the model error also declines significantly for both the forecasting horizons. It is pertinent to mention that when SNR is greater than ~ 1.5, error variance reduces significantly, i.e. model performance improves significantly. The variability in the results across the sites in India may be attributed to the level of noise present in the data series. In northern India, the pollution sources vary significantly because of large population density and traffic loads, thereby increasing the relative variance in the data set. The Indo Gangetic plain (IGP) region is known for large-scale farming and agricultural waste burning. It is to be noted that westerly winds are dominant in this region throughout the year except for monsoon season when easterlies bring monsoon rains. Westerlies wind-driven dust storms and agricultural burning bring large uncertainty in the dust load over the area. The poor performances for multistep ahead forecast of model in this region may be attributed to the weather-induced uncertainties.

Table 5 Signal-to-noise ratio
Fig. 8
figure 8

Scatter plot of MAE vs SNR

4.4 Comparison with Other Studies

The comparative analysis of the results reveals a significant improvement in the model errors in terms of RMSE when compared with the multiple output in a sequence. [48] applied multi-output auto encoder model for forecasting PM2.5 and PM10 concentrations at Beijing city and obtained the best RMSE value of 39, although the applied model has used multiple inputs such as meteorological variables in addition to the time series data of the pollutant concentrations. [49] have applied an ANN model to achieve an error of 0.0191 in terms of MSE with a correlation coefficient of 0.7301. However, the model prediction horizon is of single-step only, and the model viability for multistep ahead prediction horizon was not examined. Similarly, [46] evaluated a simple feed-forward artificial neural network model for Kolkata region in eastern India using multivariate input parameters to predict single-step PM2.5 concentrations during the COVID-induced lockdown period and reported the RMSE value of 3.74 and MAE value of 1.14. Similarly, [50] has tested 8 different models including Stacked LSTM, LSTM-autoencoder, BiLSTM and Conv2DLSTM models on different air pollutants in Kolkata and observed RMSE and MAE values more than 10 µg/m3. Similarly, [51] has achieved an MAE value of ~ 15 in case of PM2.5 forecasting in Delhi. Furthermore, [48] reported RMSE values of 31, 56 and 68 for 3-h, 5-h and 9-h ahead prediction using ANN model for Talcher station in India. Using LSTM and BiLSTM, they have reported RMSE values of 26, 41, 80 and 42 and 155 and 168, respectively in comparison to the RMSE values of 29.04 and 40.41 for 1-h and 8-h ahead prediction horizon. In the present study, the proposed model has achieved RMSE values ranging from 7.09 to 53.81 across different data centres spread over 13 different agroclimatic zones in India. Out of the total 26 locations, 18 locations have RMSE values less than 30 in India. The results indicate the robustness of the model to be applicable to different locations in India without alterations.

5 Conclusion

Air pollution data mostly contained seasonal trends, multiple periodicities and stochastic components. To address multiple complexities, present in the air pollution data, a hybrid deep learning model was formulated by integrating Convolutional LSTM, 3D Convolutional Neural Network and Bidirectional LSTM network and examined its forecasting efficiency across India on a univariate PM2.5 time series data. There is universality in the PM2.5 data series across India as all of these data rejected the null hypothesis of normal distribution. They largely follow either Gamma or Beta distribution apart from Patna in MGP and Shillong in EH region that follow log normal distribution. The results obtained for 8-h ahead sequential prediction reveal significant variations across the region with minimum (7.09) and maximum (53.81) RMSE values obtained at Shillong in EH and Patna in MGP, respectively. Similar results (minimum: − 5.41 and maximum: − 34.09) were also found in terms of MAE values at Shillong and Patna respectively. In terms of MAPE, minimum and maximum values were observed to be 18.6 and 52.7% at Hyderabad and Chennai, respectively. The robustness of model performance was evident from the little variations observed in the model error estimation for 1-h ahead and 8-h sequential forecasts. The results (MAE) were further analysed against SNR and found strong association between level of error and SNR values. As SNR decreases, model performance decreases (MAE values increases). The variations in the SNR may be attributed to anthropogenic activities in the region. The results reveal weak performance in and around IGP in comparison to the rest of India. The model has the potential to be utilized for policy and planning for pollution control. It could be a useful tool for forewarning about lurking air pollution events.