1 Introduction

Data is one of the most valuable treasures in the world, forming the basis of different branches of computer science. Data refers to any set of objects with organized features, or specific characteristics of an object or collection thereof and their features. Data can be of different types and can be obtained by observation, search, or recording (Alkaim & Al-Janabi 2020). In general, researchers dealing with the concept called data science work in three domains, related to data, intelligence, and statistics (Buyya et al. 2016). Data science can be divided into three fields, viz. small, normal, and big/huge data. Small data is organized into uniform structures such as tables or lists containing no more than 30 samples and thus does not follow the normal distribution and cannot be used for decision-making. On the other hand, normal data is also structured but does follow the normal distribution and is thus useful for taking different types of decision such as clustering, classification, prediction, optimization, etc. Finally, big data can have different types such as structured, semistructured, or unstructured, with size ranging from 1 TB to 1 ZB. Extraction of useful knowledge or patterns from big data can be achieved by the combination of the two main concepts of machine learning and cloud computing.

Deep learning is a branch of modern science that considers multilevel learning processes, where learning is applied at each level for a specific part of the problem and aggregation of the corresponding results enables the overall problem to be solved. It is thus classified as a branch of artificial intelligence (Liu et al. 2019).

Prediction is a type of decision-making technique where future events are forecast based on historical information. Among the three types of prediction technique, viz. traditional (offering accuracy), self (offering speed), and intelligent (offering both speed and accuracy), this work relates to the latter (Al-Janabi et al. 2015).

Increasing air pollution caused by technological development represents one of the most important challenges facing the world today. It can be categorized into several classes depending on its origin, viz. pollution due to living organisms such as bacteria and fungi in the environment such as water, air, or soil; chemical air pollution due to an the imbalance in the ecosystem resulting from chemical effects, being in the form of solid particles or liquid droplets or gases; and technological, due to a change in the balance between the components of an ecosystem that prevents its efficient operation and ability to perform its natural role in the disposal of pollutants.

2 Related work

The issue of air quality prediction is one of the critical topics related to human lives and health. The aim of the work presented herein is to develop a new method for such prediction based on the huge amount of data that is available and operating on data series. This section first reviews previous studies by researchers in this area and compares them based on the database used in each case, the methods applied to assess the results, the advantages of each method, and its limitations.

Ong et al. (2015) used a deep recurrent neural network (DRNN) reinforced with a novel pretraining system using an autoencoder, principally designed for time-series prediction. Moreover, the sensors were chosen within the DRNN without degrading the accuracy of the predictions by considering the sparsity of the system. This method was applied to the prediction of air pollution, in particular for PM2.5 particulate matter concentration, offering more accurate results compared with the poor performance achieved using the noise reduction approach. The results were evaluated using four measures, viz. the root-mean-square error (RMSE), precision (P), recall (R), and F measure. The work presented herein is similar in that it uses the same technique (RNN), albeit based on the LSTM approach.

Al-Janabi et al. (2015) applied a hybrid system using genetic neural computing (GNC) to analyze and understand data corresponding to the concentration of dissolved gases in four subgroups for analysis based on the IEEE C57.104 specification using a genetic algorithm (GA). The clustering data was input to the neural network to predict the different types of errors. The hybrid system generates decision rules which identify the error accurately. Two measures were used in that work, viz. the Davies–Bouldin (DB) index and the mean square error (MSE). The results indicated that the problem could be solved at lower cost and that the described method facilitated the prediction process and enabled a more accurate approach through the analysis of errors and ways to address them. This work is similar to that presented herein in that it uses neural networks, while the difference lies in the use of the PSO algorithm combined with LSTM.

Li et al. (2016) described an air quality prediction method based on a spatiotemporal deep learning (STDL) model. A stacked autoencoder (SAE) method was applied to extract inherent air quality characteristics, being trained using a greedy layerwise method. In comparison with traditional time-series prediction models, the described model could predict the air quality at all stations at the same time and exhibited temporal stability across all seasons. In addition, a comparison with the spatiotemporal artificial neural network (STANN), autoregression moving average (ARMA), and support vector regression (SVR) models was presented. The results of the model were evaluated using three measures, viz. RMSE, mean absolute error (MAE), and mean absolute percentage error (MAPE). The work presented herein is similar in that the same technique (RNN) is applied to prediction the air quality indexes, but now dealing with huge data and also applying the LSTM approach to enhance the operation of the network.

Li et al. (2017) used a long short-term memory extended (LSTME) neural network model with combined spatial–temporal links to predict concentrations of air pollutants. In that approach, the LSTM layers automatically extract potential intrinsic properties from historical air pollutant and accompanying data, while meteorological data and timestamp data are also incorporated into the proposed model to improve its performance. The technique was evaluated using three measures (RMSE, MAE, and MAPE) and compared with the STANN, ARMA, and SVR models. The work presented herein is similar in its use of the LSTM approach as part of a recurrent neural network structure but differs in its use of another evaluation measure.

Ghoneim and Manjunatha (2017) described a new prediction model based on deep learning for ozone levels, considering pollution and weather correlations in an integrated fashion. This deep learning model was used to learn ozone level features, and trained using a grid search technique. A deep architecture model is utilized to represent the ozone level features for the predictions. Experiments demonstrated that the proposed method offered superior performance for ozone level predictions. The results of this study could be helpful for predicting ozone level pollution in Aarhus City as a model for smart cities, to improve the accuracy of ozone forecasting tools. The results of the model were evaluated based on the RMSE, MAE, MAPE, squared R2, and correlation coefficient. The work presented herein also uses a memory (LSTM in this case) for processing of large data, but differs in that the optimal structure of the neural network is found by applying a PSO algorithm.

Lifeng et al. (2018) reported that the best predictions of air quality could be obtained using the GM model (1.1) with fractional order accumulation, i.e., FGM (1.1), to find the expected average annual concentrations of PM2.5, PM10, SO2, NO2, 8-h O3, and O-24 h. The measure used in that work was the MAPE. Application of the FGM (1.1) method resulted in much better performance compared with the traditional GM model (1.1), revealing that the average annual concentrations of PM2.5, PM10, SO2, NO2, O8–O3, and O3 24-h will decrease from 2017 to 2020. That work presented herein is similar in that it predicts the concentration of air pollutants and finds ways to address them, but differs in its use of the LSTM method for the predictions.

Popoola et al. (2018) considered sensor measurements including SNAQ boxes and network deployment, sensor measurement validation, and source apportionment to build a predictive model for the ADMS-Airport tool, using the concentration of pollutants to determine the air quality model. The results showed that such a method can be applied in many environments that suffer from air pollution, potentially reducing the health effects of reduced air quality and decreasing cost, as well as for monitoring of greenhouse-gas emissions. The work presented herein is similar in that the concentration of air pollutants is determined, but differs in its use of the LSTM RNN method.

For effective extraction of spatiotemporal features, Wen et al. (2019) combined a convolutional neural network (CNN) and LSTM neural network (NN), as well as meteorological and aerosol data, to refine the prediction performance of the model. Data collected from 1233 air quality monitoring stations in Beijing and the whole of China were used to verify the effectiveness of the proposed model (C-LSTME). The results showed that the model achieved better performance than state-of-the-art technologies for predictions over different durations at various regional and environmental scales. The technique was evaluated using three measures (RMSE, MAE, and MAPE). In comparison, the LSTM approach is also applied in a RNN in this work, but after having identified the best structure for the network. In addition, another evaluation measure is used herein.

Shang et al. (2019) described a prediction method based on a classification and regression tree (CART) approach in combination with the ensemble extreme learning machine (EELM) method. Subgroups were created by dividing the datasets using a shallow hierarchy tree through the CART approach. At each node of the tree, EEL models were constructed using the training samples of the node, to minimize the verification errors sequentially in all of the subtrees of each tree by identifying the number of hidden intestines, where each node is considered to be a root. Finally, the EEL models for each path to a leaf are compared with the root of each leaf, selecting only the path with the smallest error to check the leaf. The measures used in that work were the RMSE and MAPE. This experimental measurement results revealed that such a method can address the issue of global–local duplication of the prediction method at each leaf and that the combined CART–EELM approach worked better than the random forest (RF), v-(SVR), and EELM models, while also showing superior performance compared with EELM or k-means-EELM seasonal. The work presented herein is similar in that it uses the same set of six air pollution indexes (PM2.5, O3, PM10, SO2, NO2, CO) but differs in terms of the mechanism applied to reduce air pollutants, applying the RNN method.

Li et al. (2019) applied a new air quality forecasting method and proposed a new positive analysis mechanism that includes complex analysis, improved prediction units, data pretreatment, and air quality control problems. The system analyzes the original series using an entropy model and a data processing process. The multiobjective multiverse optimization (MOMVO) algorithm is used to achieve the required performance, revealing that the least-squares (LS)SVM achieved the best accuracy in addition to stable predictions. Three measures were used for the evaluation in that work , viz. RMSE, MAE, and MAPE. The results of the application of the proposed method to the dataset revealed good performance for the analysis and control of air quality, in addition to the approximation of values with high precision. The work presented herein uses the same evaluation measures but differs in its use of the LSTM approach in the RNN after identifying the best structure for the network.

Table 1 presents a comparison of the cited previous works based on the type of dataset considered, the methodology used, the evaluation measures applied, and the advantages offered.

Table 1 Comparison of previous works

3 Main concept

3.1 Big data

This term is commonly used today due to the abundance and diversity of sources, which lead to difficulty in dealing with the resulting data because they may be unorganized and require large storage systems. Big data was first defined by Douglas Laney based on the 3Vs, viz. volume, velocity, and variety, and has been widely cited since 2001, although many have tried to increase the number of Vs, to 4, 5, 6, and even 11. Big data has also been defined as an application, which emphasizes its different applications based on the different types of data, whereas Barry Devlin defined it as the application of process-mediated data, human-sourced information, and machine-generated data. Shaun Connolly focused on analyzing the transactions, interactions, and observation of data, seeking hindsights using big data technology; this type of definition is oriented by new technological developments such as MapReduce, bulk synchronous parallel (BSP) computing such as Hama, resilient distributed datasets (RDDs) such as Spark, and Lambda architecture such as Flink (Buyya et al. 2016) (Fig. 1).

Fig. 1
figure 1

Big data

3.2 Big data analysis stages

Data analysis is the process of inspecting, transforming, and modeling data with the goal of discovering useful information (Al-Janabi & Alkaim 2019), informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names while being used in different business, science, and social science domains. In today’s business, data analysis plays a role in making decision-making more scientific and helping businesses operate effectively (Buyya et al. 2016) (Fig. 2).

Fig. 2
figure 2

Big data analysis stages

3.3 Deep learning

Deep learning is a new area of machine learning which has gained popularity in the recent past. Deep learning refers to architectures that contain multiple hidden layers (deep networks) to learn different features at multiple levels of abstraction. Deep learning algorithms seek to exploit the unknown structure in the input distribution to discover useful representations, often at multiple levels, where higher-level learned features are defined in terms of lower-level features (Ali et al. 2019).

3.4 Prediction

Prediction can be defined as the task of data analysis to predict unknown values of target features. It includes a classification task for class label prediction and numerical prediction whose aim is to predict continuous or ordered values. The type of target attribute specifies whether the problem is classification with binary values or numerical prediction with continuous values. Many statistical methodologies have been used for such numerical prediction, among which regression analysis is most often applied (Basavaraju et al. 2019) (Fig. 3).

Fig. 3
figure 3

Main types of machine learning techniques (Al-Janabi & Mahdi 2019)

3.5 Air pollution

Air pollution remains a serious concern and has attracted attention from industries, governments, as well as the scientific community. One type of air pollutant that has attracted immense attention is fine particulate matter. PM2.5 is a widespread air pollutant, consisting of a mixture of solid and liquid particles suspended in the air, in addition to PM10 and O3 as other types of air pollution. Air pollution is a global issue that transcends geographical boundaries and calls for an interdisciplinary approach to solve a global problem. Thus, forecasting concentrations of air pollutants is an effective method for protecting public health by providing early warnings of harmful air pollutants (Liu et al. 2019).

4 Building the DLSTM-PSO model

In this section, an effective prediction model is built, including four stages. The first stage involves dataset preprocessing, including data collection, splitting, handling of missing values, and normalization of the dataset. In the second stage, the PSO algorithm is applied to identify the best structure for the LSTM network, including determination of the best weight, bias, number of hidden layers, number of nodes in each hidden layer, and activation function. In the third stage, the prediction model (called DLSTM-PSO) is built to predict the concentration of the six pollutants considered. The final stage is the evaluation of the results based on the symmetric mean absolute percentage error (SMAPE) and 10 cross-validations.

First, the main details of the model are presented, being based on the following assumptions:

  • The air quality data file contains the concentration of several major air pollutants: PM2.5 (μg/m3), PM10 (μg/m3), and NO2 (μg/m3). In addition, we also provide the concentrations of CO (mg/m3), O3 (μg/m3), and SO2 (μg/m3) from Beijing 2018.

  • All points with NA or negative values in the PM2.5 or PM10 (or O3 from) data are considered to be invalid and are dropped from the truth file; For example, the data [2957631, CT3, 2018-04-13 00:00:00, 24.0,,15. 7,] has no PM2.5 data and thus is dropped from the scoring matrix, even though it includes PM10 data.

  • The PM2.5 limit is taken as 10 μg/m3 (average allowable value per year) or 25 μg/m3 (average allowable value in 24 h).

  • The PM10 limit is taken as 20 μg/m3 (average allowable value per year) or 50 μg/m3 (average allowable value).

  • The O3 limit is taken as 100 μg/m3 (average allowable value in 8 h). The recommended maximum value, previously set at 120 μg/m3 in 8 h, has been reduced to 100 μg/m3 based on recent findings of the relationships between daily mortality and ozone levels in locations where the concentration of this substance is less than 120 μg/m3.

  • The NO2 limit is taken as 40 μg/m3 (average allowable value per year) or 200 μg/m3 (average allowable value per hour).

  • The SO2 limit is taken as 20 μg/m3 (average allowable value in 24 h) or 500 μg/m3 (average allowable value in 10 min) (Fig. 4).

    Fig. 4
    figure 4

    Block diagram of proposed DLSTM-RNN

figure a

4.1 Data preprocessing stage

As explained above, datasets were collected from two type of resources (i.e., from the directory of websites such as the KDD Cup 2018 dataset, or by building stations to capture concentrations). These datasets must be handling before using them to build predictive models. The dataset for each station was split and saved in a separate file containing the name of the station. Then, missing values were treated by dropping each row from which one or more value was lacking. Finally, normalization was applied in each column of the dataset related to each station, using the MinMaxScaler process on all the dataset (PM2.5, PM10, NO2, CO, O3, and SO2) to make the concentration values lie in the range [0, 1]. The main steps in this stage are described in Algorithm 2.

figure b

4.2 Determined structure network-particle swarm (DSN-PS)

In this step, the structure of the LSTM is specified by PSO. During the PSO, the three following steps are repeated until the maximum epoch limit is reached or one of the stop conditions is met:

  • Calculate the value of the fit for each element among the particles;

  • Update the pBest appropriate values for each particle, as well as the best gBest general value;

  • Update the speed and position of each particle.

The aim of the PSO algorithm is to optimize the LSTM-RNN by specifying the optimal values of the weight, bias, number of hidden layers, number of nodes in each hidden layer, and activation function, as shown in the diagram in Fig. 5 and presented in Tables 2 and 3.

figure c
Fig. 5
figure 5

Determination of the optimal parameters of the LSTM using PSO

Table 2 Hyperbolic Functions with description (Al-Janabi and Alwan 2017)
Table 3 Polynomial Functions with description (Al-Janabi and Alwan 2017)

4.3 Development of the long short-term memory (DLSTM) approach

The common LSTM module consists of a cell, an input port, an output port, and a forgotten gateway. The cell remembers values at random intervals, and the three gates regulate the flow of information inside and outside the cell. LSTM networks are well suited for classifying, processing, and predicting predictions based on time-series data, where there may be an unknown delay for important events in a time series. LSTMs have been developed to deal with fading problems that can be encountered when training traditional RNNs. The relative lack of sense of the gap length is a feature of LSTM in RNNs, hidden Markov models, and other sequential learning methods in many applications (Inácio et al. 2019).

An LSTM module (or cell) has five essential components, which allows it to model both long- and short-term data:

  • Memory cell: This represents the internal memory of the cell, which stores both short- and long-term memories.

  • Hidden state: This is the output state information calculated from the current input, previous hidden state, and current cell input, eventually being used to predict the future concentrations. Additionally, the hidden state can decide to retrieve only the short- or long-term or both types of memory stored in the cell state to make the next prediction.

  • Input gate: Decides how much information from the current input flows to the cell state.

  • Forget gate: Decides how much information from the current input and the previous cell state flows into the current cell state.

  • Output gate: Decides how much information from the current cell state flows into the hidden state.

4.3.1 The variables in LSTM–RNN

This algorithm requires multiple variables to be set at the beginning, then these are updated by applying computational operations, as shown below:

  • Step 1: The forward components

    Step 1.1::

    Compute the gates:

    Input activation:

    $$ a_{t} = \tanh \left( {W_{a} . \, X_{t} + U_{a} .{\text{out}}_{t - 1} + b_{a} } \right) $$
    (1)

    Input gate:

    $$ i_{t} = \left( {WI. \, X_{t} + U_{i} .{\text{out}}_{t - 1} + b_{i} } \right) $$
    (2)

    Forget gate:

    $$ f_{t} = \left( {Wf_{.} X_{t} + U_{f} .{\text{out}}_{t - 1} + b_{f} } \right) $$
    (3)

    Output gate:

    $$ o_{t} = \left( {Wo. \, X_{t} + U_{o} .{\text{out}}_{t - 1} + b_{o} } \right) $$
    (4)

    Then find:

    Internal state:

    $$ {\text{State}} = a_{t} i_{t} + f_{t} \odot {\text{state}}_{t - 1} $$
    (5)

    Output:

    $$ {\text{outt}} = \tanh \left( {\text{state}} \right) \odot {\text{ot}} $$
    (6)

    where

    $$ {\text{Gate}}\,S_{t} = \left[ {\begin{array}{*{20}c} {a_{t} } \\ {i_{t} } \\ {\begin{array}{*{20}c} {f_{t} } \\ {o_{t} } \\ \end{array} } \\ \end{array} } \right],\,\,W = \left[ {\begin{array}{*{20}c} {W_{a} } \\ {W_{i} } \\ {\begin{array}{*{20}c} {W_{f} } \\ {W_{o} } \\ \end{array} } \\ \end{array} } \right],\,\,U = \left[ {\begin{array}{*{20}c} {U_{a} } \\ {U_{i} } \\ {\begin{array}{*{20}c} {U_{f} } \\ {U_{o} } \\ \end{array} } \\ \end{array} } \right],\,\,b = \left[ {\begin{array}{*{20}c} {b_{a} } \\ {b_{i} } \\ {\begin{array}{*{20}c} {b_{f} } \\ {b_{o} } \\ \end{array} } \\ \end{array} } \right] $$
  • Step 2: The backward components:

    Step 2.1::

    Find

    \( \Delta t \), the output difference as computed by any subsequent

    \( \Delta {\text{OUT}} \), the output difference as computed by the next time step.

    $$ \delta {\text{out}}_{t} = \Delta_{t} + \Delta {\text{out}}_{t} $$
    (7)
    $$ \delta {\text{state}}_{t} = \delta {\text{out}}_{t } \odot o_{t} \odot \left( { 1 - \tanh^{2} \left( {{\text{state}}_{t} } \right)} \right) + \delta {\text{state}}_{t + 1} \odot f_{t + 1} $$
    (8)
    Step 2.2::

    This gives

    $$ \delta a_{t} = \delta {\text{state}}_{t} \odot i_{t} \odot \left( {1 - a_{t}^{2} } \right) $$
    (9)
    $$ \delta i_{t} = \delta {\text{state}}_{t} \odot a_{t} \odot i_{t} \odot \left( {1 - i_{t} } \right) $$
    (10)
    $$ \delta f_{t} = \delta {\text{state}}_{t} \odot {\text{state}}_{t - 1} \odot f_{t} \odot \left( {1 - f_{t} } \right) $$
    (11)
    $$ \delta o_{t} = \delta {\text{out}}_{t} \odot { \tanh }({\text{state}}_{t} ) \odot o_{t} \odot \left( {1 - o_{t} } \right) $$
    (12)
    $$ \delta x_{t} = W^{t} . \delta {\text{state}}_{t} $$
    (13)
    $$ \delta {\text{out}}_{t - 1} = U^{t} . \delta {\text{state}}_{t} $$
    (14)
  • Step 3: Update the internal parameter.

    $$ \delta W = \mathop \sum \limits_{t = 0}^{T} \delta {\text{gates}}_{t} \otimes x_{t} $$
    (15)
    $$ \delta U = \mathop \sum \limits_{t = 0}^{T} \delta {\text{gates}}_{t + 1} \otimes {\text{out}}_{t} $$
    (16)
    $$ \delta b = \mathop \sum \limits_{t = 0}^{T} \delta {\text{gates}}_{t + 1} $$
    (17)
figure d

5 Experiment

This section presents the results of each stage in the prediction model. Also, a justification is presented for all the results.

5.1 Dataset used

Data from KDD cup 2018* is used, containing the name of the 35 stations and the concentration of each pollutant per hour, viz. PM2.5, PM10, SOx, CO, NO2, and O3. Table 4 presents the raw dataset.

Table 4 Dataset before handling missing values

The dataset is split by station and saved in separate files containing the name of each station. Thereby, each station contains six types of concentrations and a number of records as presented in Table 5.

Table 5 Split stations

Missing values in each column are then handled as illustrated in Table 6.

Table 6 Dataset after handling missing values

Table 7 presents a description of the results after handling missing values.

Table 7 Description of data after preprocessing

5.2 Data visualization

Figure 6 illustrates the resulting data, which contains various patterns occurring over time.

Fig. 6
figure 6

Data visualization

This graph already says a lot. The specific reason for picking this data is that this graph presents a wide range of different behaviors in the concentrations of the air pollutants over time, which will make the learning process more robust and provide the opportunity to test the quality of the predictions in a variety of situations.

Another feature to notice is that the values at the beginning of 2017 are much higher and fluctuate more than the values close to the end of the dataset. Therefore, one must ensure that the data exhibit similar value ranges throughout the time frame, which will be considered during the data normalization phase.

5.3 Normalizing the data

Before normalizing, the dataset is split into a training set and a test set, using 70% for training and 30% for testing.

A scaler must now be defined to normalize the data. MinMaxScalar scales all the data to the region of 0 and 1. One can also reshape the training and test data to have the shape [data_size, num_features].

Due to the observation above that different time periods of the data have different value ranges, one should normalize the data after splitting the full series into windows. Otherwise, the earlier data will be close to 0 and will not add much value to the learning process. Here, a window size of 2500 is chosen.

The data can now be smoothed using an exponential moving average, which helps to remove the inherent raggedness of the concentration data and produce a smoother curve. Note that only the training data should be smoothed in this way.

5.4 Data generator

A data generator is first implemented to train our model, including a method called unroll_batches(…) that will output a set of num_unrollings batches of input data obtained sequentially, where each batch of data is of size [batch_size, 1]. Then, each batch of input data will have a corresponding output batch of data (Tables 8, 9).

Table 8 Dataset after normalization and splitting for training and testing
Table 9 Data generator to train our model

For example, if num_unrollings = 3 and batch_size = 4, a set of unrolled batches might look like:

  • input data: [x0,x10, x20, x30, x40, x50], [x1,x11, x21, x31, x41, x51], [x2,x12, x22, x32,x42, x35]

  • output data[x1,x11, x21, x31, x41, x51],[x2,x12, x22, x32,x42, x52], [x3,x13, x23, x33,x43, x53]

Then, one finds the best weights of the input between the hidden layers as illustrated in Tables 10, 11, and 12. Meanwhile, the weights of the recurrent connections are illustrated in Tables 13 and 14.

Table 10 Weights of the input between input and first hidden layers
Table 11 Optimal weights of input between first and second layer
Table 12 Optimal weights of input between second and third layer
Table 13 Weights of input between third and output layer
Table 14 Weight of recurrent connections

After constructing the LSTM model using the DSN-PS algorithm, the model consists of several layers capable of predicting the concentrations of the air pollutants. We have 32 stations × 6 pollutants (PM2.5, PM10, SOx, CO, NOx, and O3), resulting in 192 readings per hour, 4608 per day, and 138,240 within the 30 days of the training process of the network. After the training, DLSTM-RNN can predict air pollution concentrations over the next 48 h based on the previous training. The SMAPE error rate scale is then used to evaluate the results from the DLSTM network for the least or nearest error (Table 15).

Table 15 DLSTM-PS and SMAPE results

The combination of LSTM and SPO reduces the training time for the network because the SPO algorithm provides the best function for activation and identifies the number of hidden layers and number of nodes in each hidden layer, considering that they provide better weights, but at the same time complicate the network for the reason described above.

6 Discussion and conclusions

Air quality index datasets represent huge data, requiring intelligent and deep computation for the extraction of useful patterns. Despite the advantage of their large size, the limitations of such datasets include the possibility of missing values, that each concentration may show high and low value, and that the records for each station may not be equal.

DSN-PS is used to determine the parameters and activation function of the DLSTM, offering the advantage of reduced time of execution for LSTM, and limitation of the DSN-PS will increase the complexity of the LSTM.

The DLSTM is constructed by using the LSTM of the DSN-PS, and PSO is used to determine the optimal number of hidden layers, number of nodes in each hidden layer, weight, bias, and activation function. The advantage of DLSTM is its ability to deal with huge data and the use of memory cells to save information in the long term, while the limitation of DSTM is the huge number of parameters.

Evaluation is the process of calculating the error between actual and predicted values, which can be achieved using different types of error measure, including prediction (i.e., SMAPE, MSE, RMSE, MAE, MAPE, etc.) and coefficient matrixes (i.e., accuracy, F, FP, etc.).

  • How particle swarm can be useful in building a recurrent neural network (RNN)?

PSO works to modify the behavior of each in a particular environment gradually, depending on the behavior of their neighbors until they are obtained the optimal solution. On the other hand, the neural networks use the principle of the try and error in the selection of the basic parameters of their own and modified gradually to reach the values accepted for those parameters.

Depending on the PSO and neural networks of the above subject, we used the PSO principle to find the optimal parameters and the activation function of the neural network.

  • How to build a multi-layer model with a combination of two technologies( LSTM-RNN with particle swarm)?

Through, building new predictor called SAQPM that combining between the DSN-PS and the DLSTM. Where DSN-PS used to find the best structure with parameter to LSTM while DLSTM used to predict the rate Concentrations of air pollution.

  • IS SMAPE measure enough to evaluate the results of suggesting predictor?

Yes, The SMAPE is sufficient to evaluate the results of the predictor within the next 48 h.

  • What is the beneficial result from building predictor by the combination between DSNPS and DLSTM?

By combining DNS-PS and DLSTM they will reduce the execution time by defining network parameters but at the same time will increase the computational complexity.