Introduction

Since air pollution is shown to be one of the significant threats to the health of society and the ecosystem, governments have considered clean air as a basic requirement of health and well-being. Air pollution is one of the most important reasons for serious illnesses causing early deaths, such as heart and lung diseases, stroke, lung cancer (Scovronick 2015; Eurepean Environment Agency 2019; WHO 2019). Air pollution adversely damages natural ecosystems and biodiversity by damaging the quality of water and soil that support the ecosystem (Eurepean Environment Agency 2019). Air pollution also has negative economic impacts such as a decrease in the lifetime of individuals, an increase in medical costs, and a decrease in productivity. The people living in urban areas are the residents who need better living conditions in terms of environment, transportation, health, and so on. Authorities set air quality standards as an important component of environmental policies and risk management to protect the public health and quality of life. Many countries have taken important steps towards developing smart cities (Santander, Barcelona, Singapore) (Sánchez et al. 2013; Kloeckl et al. 2012; Bakici et al. 2012). European Commission offers EU-wide targets and policy objectives to reduce these impacts for the period from 2021 to 2030, and enable the EU to implement its commitments under the Paris Agreement (European Commission 2017).

Recent works about air quality monitoring and prediction point out the requirements for smart city air quality applications (Park et al. 2017). Air pollution depends not only on the concentration of pollutants (Bashir Shaban et al. 2016; Zhang et al. 2019) but also on the meteorological conditions such as temperature, wind direction, and wind strength (Soh et al. 2018; Chu et al. 2012; Tai et al. 2010). The information about air quality and meteorology gathered from various types of sensors must be combined in a way to achieve a useful representation of real-world environments. Also, changes in city conditions and people’s behavior over time negatively affect the performance of the model obtained with old data. Since each city data have a different marginal probability distribution, training and optimizing the model at regular intervals with current data are necessary for an up-to-date prediction system (Zanella et al. 2014).

This study aimed to solve the challenges by developing a CNN+LSTM (convolutional neural network + long short-term memory) neural network model for hourly air pollution prediction. The historical information from various locations in a city is processed to predict the concentration of particles in the same locations in the next hour. In addition, the proposed model is supported by the transfer learning method to obtain a deep learning structure with the optimal set of hyperparameter values to use in various other cities.

CNN extracts the relationship between different features with spatial dependency in sensor locations. LSTM extracts temporal information characteristics from time series data and it consolidates the nonlinear relationship between air pollutants. Various sensor data that are collected at different locations in a city are processed to predict not only the local pollution affecting a small area but also the propagation of pollution factors affecting a wider precinct over time.

Artificial intelligence and machine learning–based techniques consist of computational methods which improve the learning performance of the machine in solving nonlinear problems from complex data. Therefore, these techniques are commonly used for modeling, search, and optimization in various smart city systems (Soh et al. 2018; Khan et al. 2018; Chen et al. 2019; Qin et al. 2019; Tao et al. 2019; Ma et al. 2019). This has led to a tendency of using various machine learning methods in recent air quality studies (Soh et al. 2018; Qin et al. 2019; Tao et al. 2019; Ma et al. 2019; Qi et al. 2018). Deep learning is a multilayer neural network–based machine learning method; it learns data representations with multiple levels of abstraction from complex structures. Also, deep learning models can consist of more than one machine learning algorithm. As for another advantage of the model, CNN and LSTM work based on parameter sharing to reduce the number and the complexity of the parameters that the model has to learn.

Transfer learning focuses on extracting information from a data set and reapplying it in another data set that has a different distribution from the first one. It provides the transfer of the knowledge/model from a source domain to a target domain. In this study, the proposed deep neural network is supported by transfer learning to transfer weights/features to improve the prediction performance in different cities.

European Innovation Partnership on Smart Cities and Communities supported by the European Commission aims to bring sustainable solutions to city problems in different fields such as energy, transportation, and communication (EU 2021). Especially in developing countries, the health and environmental conditions of big cities are at risk due to transportation and energy sources. The main aim for such cities is to reduce the emission in a certain region and control the air pollution that may become a threat. Intelligent air quality monitoring and prediction systems play a major role in building sustainable and clean cities. This kind of work raises awareness for a clean environment; residents of the city can learn about environmental information, services and precautions instantly. Existing studies are generally limited to monitoring, and there is no application available to integrate large amount of collected data for a consistent prediction model.

The data used by the proposed model include information about both pollutant concentrations and meteorological conditions. The advantage of artificial intelligence (AI) is its capability to extract relevant features from input data automatically to learn the correlation between relevant features. CNN is able to generalize and learn by reducing the size and complexity of data, LSTM has a memory to connect the previous information to the current and generate output according to the sequence of inputs. Another advantage is that AI is capable of running different types of machine learning algorithms together to improve the learning performance of air pollution prediction. Thus, the model can process relevant spatial-temporal features simultaneously where the spatial features are about the location of sensors and temporal features are based on the hourly concentration of the pollutants. One of the important challenges for air pollution prediction studies is to design a model and obtain the optimal hyperparameters that have high prediction accuracy for many regions/cities. Transfer learning enables the model that successfully performs a specific task on the source domain to be used on target data for the same task or for a related task.

The remainder of the paper is organized as follows: the “Literature survey” section describes prior work in this area, the “Methodology” section explains proposed air pollution prediction method, the “Implementation” section describes the implementation, the “Results” section presents the results of simulation studies investigating the performance, finally, the “Conclusion and discussion” section concludes the study and evaluates the results.

Literature survey

The studies in the literature consider the effects of spatial and temporal features of prediction performance and compare several machine learning algorithms (Bashir Shaban et al. 2016; Tao et al. 2019; Soh et al. 2018; Zhao et al. 2019; Djalalova et al. 2015; Zeinalnezhad et al. 2020; Schürholz et al. 2020; Qin et al. 2019; Ma et al. 2019; Aceves-Fernandez et al. 2020; Di Antonio et al. 2019; Zhou et al. 2018; Lv et al. 2019; Wei Y et al. 2016; Ma et al. 2019; Li et al. 2016; Zhang et al. 2021; Zhao et al. 2020). Summary of machine learning–based air quality studies in the literature is given in Table 1.

Table 1 Proposed architectures and their performances in literature

In Tao et al. (2019), the prediction model is designed with 1D-CNN (1-Dimensional CNN) and Bi-GRU (Bi-directional Gated Recurrent Units). They use the UCI machine learning repository database for PM2.5 data of Beijing. The model is trained and tested employing different combinations of various features. The best performance is obtained with inputs of pollution, temperature, wind speed, wind direction, and the dew point. Adding other features to the input increases the complexity and causes difficulty in learning. The study concludes that variants of RNN (recurrent neural networks) such as LSTM and GRU have better performance than RNN.

In Soh et al. (2018), a spatial-temporal deep neural network (ST-DNN) model is proposed to predict air quality. They use historical data including temporal information, the concentration of pollutants (PM2.5 and PM10), and meteorological characteristics as input; the output is the future PM2.5 concentration. Two real-world data sets (76 locations in 23 cities in Taiwan and locations from Beijing) are used in experiments. While the convolutional layer can extract the temporal delay factor from surrounding features by learning spatial information, the proposed model provides a long time frame prediction by LSTM.

In Zeinalnezhad et al. (2020), a regression technique by ANFIS modeling is applied to conduct a time series prediction for air pollution. Input contains the information of O3, SO2, CO, NO2 with 24-h time series length. The input layer takes five input variables which represent the contamination rate for consecutive 5 days. The output layer gives a single output variable which is the predicted contamination rate for the next day. According to the comparison between errors related to different pollutants, the ANFIS model designed for NO2 has the lowest error for short-term contamination prediction.

In Qin et al. (2019), an air quality prediction model is built by combining CNN and LSTM. The input features contain the concentration of pollutants and meteorological conditions; the output is the predicted sequence of PM2.5 for Shangai. According to the comparison between the three models (CNN, LSTM, CNN+LSTM), the performance of the CNN+LSTM model is the best for long-term sequence prediction. CNN and RNN have the same RMSE value where CNN+LSTM reduces RMSE by 41.45% compared to CNN and RNN.

In Di Antonio et al. (2019), the performance comparison is done between univariate and multivariate prediction for PM10 with LSTM. The PM10 concentration data set contains time series measurements from January 2009 to December 2017 in Pescara, Italy. An increase in the number of features causes a decrease in prediction performance because it increases the complexity of the network.

In Zhou et al. (2018), a multi-output LSTM network–based deep learning model built with three deep learning algorithms is proposed to make regional multi-step-ahead prediction (horizons t + 1 up to t + 4) for variables PM2.5, PM10, and NOX in Taipei City, China, by using the air quality data and meteorological data. The proposed Deep Multi-output LSTM model has smaller RMSE (with improvement rates ranged from 2.18% to 13.91%) than the Selective Multimodal LSTM model.

In Li et al. (2016), a deep learning–based air quality method is proposed proposed for hourly PM2.5 concentration. The data set is from 12 monitoring stations in Beijing between the 1st of January 2014 and the 28th of May 2015. A stacked autoencoder model is used to extract representative spatiotemporal features for air pollution data. The study shows that deep learning outperforms the traditional time series models; it is able to predict the air quality with seasonal stability at all monitoring stations.

In Zhang et al. (2021), the cutting-edge bidirectional long short-term memory neural network (BiLSTM) is used for time series air quality prediction. The real-world air quality data set is consists of hourly PM2.5 concentration collected from 12 observation points in Beijing, China, during 2013–2017. In terms of prediction error, the BiLSTM model outperforms the existing models.

In Zhao et al. (2020), forward neural networks are combined with recurrent neural networks to create a hybrid model (called CERL) in order to improve the prediction accuracy of air quality in two capital cities Xi’an and Lanzhou. Each sample includes the date, time, and 6 air pollutant factors (AQI, the concentration of PM2.5, PM10, CO, SO2, NO2, and O3). The performance of CERL model is compared with CFNN, ESN RNN, and Pyrenn in terms of RMSE. It is observed that the proposed model improves the 1-h step prediction performance for only PM2.5.

In Al-Janabi et al. (2019), an intelligent predictor called smart air quality prediction model (SAQPM) is proposed for the prediction of the air pollutant concentrations PM2.5, PM10, NO2, CO, O3, SO2 over the next 2 days. The model is based on LSTM. Particle swarm optimization algorithm is used to determine the optimal structure of the proposed method including weights, bias, number of hidden layers, number of nodes in each hidden layer, and activation function.

In Al-Janabi et al. (2021), a real time intelligent programmable forecaster system is designed which is capable of predicting the pollutant concentrations within the next 48 h. The proposed model consists of two parts. In the first part, air pollution data including PM2.5, PM10, NO2, CO, O3, SO2 are collected and stored. In the second part, concentrations of pollutants are predicted for the next 48 h using LSTM neural network.

Regarding studies using transfer learning for air quality, in Lv et al. (2019), the transfer learning model is proposed to transfer features from an urban area to a non-urban area. Three types of features (terrain, spatial, and temporal) are used as inputs. Real air quality data (Air Quality Index (AQI), PM10, PM2.5, SO2, NO2, CO, O3) and meteorology data (weather, temperature, wind speed, humidity) are collected hourly from three cities (Hangzhou, Ningbo, Wuxi) in China. As a result, adding terrain features in the learning process does not make any significant improvement in prediction for urban areas.

In Ma et al. (2019), bidirectional long short-term memory (BLSTM) network and transfer learning are used to predict air quality and their performance is compared with other commonly seen models. A case study is conducted in Guangdong, China, the data are collected from all the monitoring stations in the city for three years for the hourly PM2.5 concentration. Transfer learning improves the prediction accuracy of BLSTM at larger temporal resolutions by up to 40%.

Methodology

The first objective of this study is to predict the hourly concentration of air pollutants in multiple locations in a city effectively. The second objective is to transfer the model between cities that have different marginal probability distributions for air quality and meteorological data. For data preprocessing, all duplicate data are cleaned in data sets, and linear interpolation is applied to fill in missing values. Then the data are normalized to scale the values into a specific range to reduce the data redundancy/bias effect on learning, which is caused by the wide variation of the value range in the data (Eessaar 2016).

The steps of the study can be summarized as follows.

  1. 1.

    The data are preprocessed; bad data are removed, missing data are corrected by linear interpolation, normalization is applied.

  2. 2.

    Data are separated for training, validation, and test. Three types of sample selection methods are identified.

  3. 3.

    Time series data structure is created with the rolling-window method.

  4. 4.

    Two types of input structure are defined for the neural network.

  5. 5.

    Hybrid CNN+LSTM deep neural network is created.

  6. 6.

    Hyperparameter tuning is applied according to validation and test performances.

  7. 7.

    The results are compared with the previous works in the literature.

  8. 8.

    A 1-hidden layer LSTM network is run on the same data sets and their performances are compared.

  9. 9.

    Transfer learning method is applied for weight transfer between different cities.

The study can be split into three parts according to the input-output relationship (Table 2).

Table 2 Explanation of the methods according to the input-output type

Two types of input structure are defined for the neural network: 2D and 3D array; these inputs hold the spatial and temporal features together. 2D input consists of the historical information about a target pollutant at sensor locations; 3D input consists of both historical information about the target pollutant and other pollutants related to air quality. In this way, the model can use the interaction between different variables on pollutants in neighboring locations. The summary of data properties used for the methods is given in Table 3.

Table 3 Explanation of the data properties used for the methods

The first step of the machine learning model design is to determine the layers and their properties. Then, the optimal architecture and hyperparameters are determined by training-testing with different activation functions. Rectified Linear Unit (ReLU) and sigmoid activation functions are used to search for performance improvement. After the optimal model and hyperparameter values are determined, the neural network is transferred from the source city to the target city by changing the last learning layers, which is the fully connected layer in the proposed model.

Data

Barcelona Air Quality Data: Data for air quality in Barcelona is taken from The Open Data BCN that is a service of the Municipal Data Office (Barcelona City Council 2020). Real-time hourly measurement is made for three pollutants (O3 (tropospheric ozone), NO2 (Nitrogen dioxide), PM10 (Suspended particles)) by the stations throughout Catalonia from 06/13/2018 to 01/31/2019. Data are generated by those three sensors deployed in 7 different locations (Ciutadella, Eixample, Gracia, Palau Reial, Poblenou, Sants, Vall Hebron) where the distance between sensor locations in Barcelona is not more than 5 km.

Kocaeli and İstanbul Air Quality Data: Data for air quality and meteorology are provided by Republic of Turkey Ministry of Environment and Urbanization (Republic of Turkey Ministry of Environment and Urbanization 2019). The data contain the information about the concentration of various pollutants (Sulfur Dioxide (SO2), Nitrogen Oxides (NOX), Tropospheric Ozone (O3), Particulate Matters (PM), etc.) and meteorological conditions (temperature, relative humidity, air pressure, wind speed, and direction) for Kocaeli and İstanbul which are two large cities in Turkey. The hourly measurement data for Kocaeli are dated between 11/14/2017 17.00 and 4/11/2020 23.00 (total of 21,103 h) and for İstanbul data are dated between 01/01/2015 00.00 and 04/11/2020 23.00 (total of 46,272 h).

Sensors are deployed at 14 locations in Kocaeli and at 43 locations in İstanbul. Three locations are chosen for each city (Alikahya, Gebze, and Körfez for Kocaeli; Silivri, Esenyurt, and Sultangazi for İstanbul) in this study. The amount of missing data and the availability of the information for a variable were taken into consideration while selecting these locations. In Kocaeli, the distance between locations are 25 km and 18 km for Gebze-Körfez and Gebze-Alikahya respectively. In İstanbul, the distance between locations are 37 km and 17 km for Silivri-Esenyurt and Esenyurt-Sultangazi respectively.

Time series data set preparation

Rolling-window

“Rolling-window” (Zivot and Wang 2003) method is used to create time series input for neural networks. One sample for time record t is created by using the values [t0d,t0) as the features of the target value at time step t0. Figure 1 displays how the rolling-window method works. d and s represent the frame size and the step size respectively. The values of the past d time record points are taken as features and the value at the time step t0 is taken as the target value. Then, the frame is slid s steps for the next sample.

Fig. 1
figure 1

Rolling window method for the time series data

Input structure of the CNN

Time series sequences created by the rolling-window model are combined to construct a 2D sample model (Fig. 2a) as the input for convolutional neural network. 2D array in Fig. 2a is input to the model with Method-1 (UNI/UNI) to obtain output of size mx1 for the predicted values of the pollutant concentration in each location.

Fig. 2
figure 2

Input structures for proposed CNN+LSTM neural network

The input is rearranged (Fig. 2b) to take advantage of CNN’s ability to process 3D structures. m, n, and k represent the number of locations, the number of the time step, and the number of pollutants. 3D input is used by Method-2 (MULTI/UNI) for the prediction of one pollutant and Method-3 (MULTI/MULTI) for prediction of multiple pollutants.

Fig. 3
figure 3

CNN+LSTM deep learning structure

Figure 3 illustrates an example of the CNN+LSTM deep neural network where the feature maps obtained by CNN are input to the LSTM.

Convolutional neural network (CNN)

Convolutional neural network is initially used to process image data. A deep convolutional neural network structure is a multilayer perceptron network with more than one hidden layers. A typical CNN consists of convolutional layers and pooling layers. The convolutional layer is a set of artificial neurons that represent convolutional filters to generate feature maps. Input is split into small blocks to be convolved with a specific set of weights. Different sets of features are obtained by sliding convolutional filters on input with the same weights.

The pooling layer is employed to reduce the spatial size of the input representation and the number of parameters. Similar information in the local region is identified and the dominant response is output.

The activation function is a nonlinear function used for learning complicated nonlinear patterns. In this study, the Rectified Linear Unit (ReLU) and sigmoid are used in learning layers (Nwankpa et al. 2018). The fully connected learning layer is implemented at the end of the neural network, it combines the features by globally analyzing the outputs of previous layers.

Long short-term memory

Recurrent neural networks (RNNs) use previous information with the current information to generate the output depending on the sequence input, the network produces different outputs by the same input. Long short-term memory (LSTM) is a special type of RNNs that is used to capture the long-term dependencies in a sequence data. There are four interacting layers (called as “gate”) in the hidden state of LSTM (Fig. 4) connected in a special way. Information is received from outside and it is stored, written to the memory cell, and read via the gates.

Fig. 4
figure 4

The structure of the long short-term memory (LSTM) neural network

The memory unit controls the flow of information to decide the influence of previous information on output, a copy of predictions is held on by this unit. The weights and information in memory are multiplied to decide which data and how much of it will be used, some of them are then added back to the prediction. A few predictions are selected as the prediction for that moment, then the information that is not relevant immediately is set apart to block its effect on the predictions that are going forward.

Transfer learning

The representation of transfer learning is given in Fig. 5. Transfer learning is classified into three categories (Pan and Yang 2010): Inductive transfer learning, unsupervised transfer learning, transductive transfer learning. The source task and target task in inductive transfer learning are different. Unsupervised transfer learning is similar to inductive transfer learning, but it focuses on unsupervised learning tasks in the target domain. Source and target tasks in transductive transfer learning are the same, the target domain is different but related to the source domain.

Fig. 5
figure 5

Representation of transfer learning between two neural network models

Aim of transfer learning is to improve the learning of the target prediction function, fT(.), in target domain, DT, using the knowledge in source domain, DS, and source task, TS, where DSDT or TSTT with the target task given as TT.

There are two cases in transfer learning according to the relations between source domain and target domain. Firstly, the feature space of the source domain can be different from the target domain, (χSχT). Secondly, there can be no difference in source and target domains, but the marginal probability distributions of input data are different, (χS = χT, PS(X)≠PT(X)). It is called homogeneous transfer learning when χS = χT, and heterogeneous transfer learning when χSχT (Weiss et al. 2016).

Implementation

In the initial step of data preprocessing, data were cleaned and interpolated if there were missing values for at most five consecutive time points, then data were normalized by min-max normalization method. Interpolated time series data set was created with various frame sizes and different data separation methods. After 2D and 3D input structures were defined, CNN+LSTM deep learning–based time series prediction model has been implemented by using Deep Network Designer Tool on MATLAB version R2020a. Experiments were run on a computer with an Intel (R) Core (TM) i3-4005U CPU processor, running Windows version 10. Hyperparameters were tuned to increase the prediction performance of the model. The model was trained on three cities’ data to obtain the optimal hyperparameter values. We have no overfit issue as our training includes validation such that the training stops if validation error increases in certain number of training epochs (e.g., 6) in MATLAB. At the end, the pre-trained network, that had been trained with the source city’s data, was run with transfer learning on the target city’s data where the probability distributions are different from each other. Performance metrics (RMSE and correlation coefficient) were measured to evaluate the prediction performance of the proposed system.

The hyperparameters were optimized during the training and test runs. The model was run 15 times for each method, the neural network structure with the lowest validation RMSE is selected to be used on test data. If the test RMSE value is smaller than the reference level (smaller one of the two values: the results of the studies in the literature or the result of 1-hidden layer LSTM run for comparison in this study), the results are accepted. Otherwise, the neural network was redesigned and trained by changing network properties and hyperparameters (such as hidden layer type and number, neuron number, activation function).

Method-1 (UNI/UNI) was firstly used for non-normalized time series data of Barcelona to predict the actual concentration value of the target pollutant. Then, it was employed for the normalized data to compare the performance with non-normalized data. Then, all three methods (Method-1, Method-2, Method-3) were performed with normalized time series data for each city (Barcelona, Kocaeli, İstanbul).

Meteorology data available in Kocaeli and İstanbul were added to the input with Method-2 (MULTI/UNI) and Method-3 (MULTI/MULTI) because of the need for 3D input structure. Temperature, relative humidity, air pressure, wind speed, and wind direction were used along with the concentration of pollutants.

Deep neural networks that have high prediction accuracy in Barcelona and Kocaeli data were transferred to İstanbul. Also, transfer learning was performed in cases where different pollutants’ information exist in the target domain and source domain, for example, between Barcelona and Kocaeli.

Descriptive statistics of pollution data for the three cities (Barcelona, Kocaeli, İstanbul) is given in Table 4 where all pollutant concentrations are given in μg/m3.

Table 4 Descriptive statistics of air pollution data in three target cities

Hyperparameter tuning

Hyperparameters can be related to the model selection tasks (topology and size of the network) or to the optimization and training process (such as learning rate and mini-batch size). The following set of hyperparameters were tuned in this study:

Frame Size: Frame size was varied as integer values in the interval [8, 15].

Step Size: The step size of the time series sample was taken as 1 in this study.

Data Separation: Data have been split by 70%-15%-15% and 80%-10%-10% for training set, validation set, and test set respectively.

Sample Selection: There are three sample selection methods: random, sequential, and consecutive. The random selection method is that when the training, validation, and test samples are selected randomly from the data. Figure 6 shows the consecutive and sequential selection methods. Sequential sample selection refers to the selection of validation samples sequentially from the training data set by using a validation frequency where the data are split into training and test sets consecutively. Consecutive selection is that the data are split into three pieces for training, validation, and test, respectively.

Fig. 6
figure 6

Consecutive and sequential sample selection

Validation Frequency: Validation frequency refers to the number of iterations between evaluations of validation metrics, it was taken as 10, 15, and 20, respectively. It also refers to the period of selecting the validation sample in the sequential sample selection method.

Mini-batch Size: According to the amount of the time series data set, mini-batch size per iteration was taken as 30, 70, 100, 150, and 200 for training progress.

Number of the epoch: The early stopping method was used in this study to determine when to stop training the model.

Number of Convolutional Layer: The CNN part was constructed with one, two, and three convolutional layers.

Pooling Layer: The CNN part was firstly built without a pooling layer, then an average pooling layer, and a max-pooling layer were used.

Filter Size: In the CNN part, the filter size was set as 2×2 and 3×3.

Number of Filters: In the CNN part, the number of filters was taken as 4, 5, and 10.

Number of LSTM Layers: The LSTM part was constructed with one and two LSTM layers.

Number of Hidden Units in LSTM Layer: The number of hidden units in LSTM layers was taken as 25, 50, 75, 100, 150, and 200.

Activation Function: Rectified Linear Unit (ReLU) and sigmoid were used as activation functions.

Learning Rate: Learning rate was set as 0.001, 0.005, 0.01.

Results

The structure of the proposed CNN+LSTM deep learning prediction model and its properties are given in Table 5 with the following parameters:

  • m: number of locations

  • n: number of time steps

  • k: number of features

  • nk: number of convolutional filters

  • nl: number of LSTM units

  • no: number of neurons in fully connected layer.

Table 5 Proposed deep learning model’s layers and their properties

In our study, adding a pooling layer could not improve the success of the model. To the contrary, it caused unstable training performance with high test RMSE values. As the pooling layer reduces the information in the huge input data, it caused loss of the main features of input with smaller size.

The model was run 15 times for all combinations of different hyperparameter values (described above) to predict the concentration of pollutants by each method. The best test results were chosen according to the lowest RMSE values, then, test RMSE and the correlation coefficients were calculated. A learning rate of 0.005 provided the lowest RMSE during the training process. Also, the random sample selection method reached a lower test RMSE value than consecutive and sequential sample selection methods.

Table 6 gives the data properties and methods for the best prediction results for different cities. Test RMSE values were used as the prediction performance metric, and the best prediction performance was obtained for PM10 for all three cities. 3D samples created from the normalized time series data were used as input to the model. For Kocaeli, Method-3 (multi-input/multi-output) provided the lowest test RMSE while Method-2 (multi-input/uni-output) provided lowest RMSE for İstanbul. For Barcelona, both methods reached the lowest test RMSE for PM10 prediction (Table 7).

Table 6 Data properties and methods that give the best prediction results
Table 7 Comparison of the prediction performance between 1-hidden layer LSTM model and CNN+LSTM model

Proposed architectures and their performances in the literature are given in Table 8. In Zeinalnezhad et al. (2020), the proposed ANFIS model reached RMSE values (μg/m3) of 0.25 for SO2, 0.20 for O3, and 0.16 for NO2. The architecture proposed in this paper reached lower RMSE values for these pollutants in three cities than the study in Zeinalnezhad et al. (2020). Results of Method-2 and Method-3 based on multivariate input, which is also called “multivariate prediction” (Di Antonio et al. 2019), were compared with other studies in the literature (Soh et al. 2018; Di Antonio et al. 2019). The RMSE values in related studies are computed by normalization for comparison purposes. The RMSE for ANN-based prediction model in Soh et al. (2018) was measured as 0.47 for O3 and 0.11 for SO2 by the prediction with univariate input; 0.21 for O3, and 0.15 for SO2 by the prediction with multivariate input. The RMSE of our CNN+LSTM neural network is lower for O3 in Barcelona and İstanbul, and for SO2 in Kocaeli and İstanbul than the RMSE of the study in Soh et al. (2018). The study in Di Antonio et al. (2019) measured RMSE as 0.08 with both univariate and multivariate input. In our proposed model, the RMSE was measured as 0.07 in Barcelona and Kocaeli while it was 0.08 in İstanbul.

Table 8 Proposed architectures and their performances in the literature

In Zhou et al. (2018), the multi-output-based DM-LSTM model has smaller RMSE than the SM-LSTM model has; the improvement rates are reported to range from 2.18% to 13.91%. Table 7 shows the test RMSE and improvement rates of the models for three cities (Barcelona, Kocaeli, İstanbul). The CNN+LSTM model has improvement rates between 11 and 53% for PM10, 20 and 31% for O3, 8 and 15% for NO2, 18 and 46% for SO2, and 9 and 47% for NOX with respect to the LSTM model.

Method-1 (UNI/UNI)

2D samples created from non-normalized Barcelona data contain historical information of one target pollutant which are used to predict the concentration in μg/m3. The network with the lowest validation RMSE was tested and performance metrics were measured. In the second stage, the model was trained for normalized Barcelona data to compare the performance with non-normalized data. The activation function in the fully connected layer was set as ReLU and sigmoid, and their performances are compared for the normalized case. Scatter plots of the prediction results are given in Fig. 7. As can be seen in the Fig. 7a, b, and c, although the predicted values produced by the model are above the actual values, the model generally captured the behavior of the non-normalized time series data. Looking at the results obtained with the normalized data in Fig. 7d, e, and f, it is seen that the use of ReLU in the last layer leads to the prediction of extremely high values from the actual values. Using sigmoid instead of ReLU in the last layer provides to capture the behavior of normalized time series data.

Fig. 7
figure 7

Results of Method-1 for the prediction in Barcelona for non-normalized (ac) and normalized (df) air pollution data

The test RMSE for non-normalized data was measured as 37.29 for O3, 29.58 for NO2, and 13.87 for PM10; the correlation coefficient between target and predicted values was measured as 0.87 for O3, 0.79 for NO2, and 0.96 for PM10. The test RMSE for normalized data with sigmoid was measured as 0.06 for O3, 0.09 for NO2, and 0.06 for PM10; the test RMSE with ReLU was measured as 0.28 for O3, 0.39 for NO2, and 0.42 for PM10.

Method-2 (MULTI/UNI)

3D input samples were created from normalized data to combine the information about various variables that have different ranges of values. The activation functions in hidden layers were set to ReLU while sigmoid is only used at the fully connected layer. Scatter plots of the prediction results are given in Fig. 8.

Fig. 8
figure 8

Scatter plots of target and predicted concentrations by Method-2 and Method-3

In Barcelona, test RMSE was measured as 0.12 for O3, 0.11 for NO2, and 0.07 for PM10; correlation coefficient was measured as 0.87 for O3, 0.83 for NO2, and 0.95 for PM10.

In Kocaeli, test RMSE was measured as 0.08 for PM10, 0.08 for SO2, and 0.08 for NOX; correlation coefficient was measured as 0.85 for PM10, 0.74 for SO2, and 0.77 for NOX.

İstanbul data contain the information of five pollutants (O3, NO2, PM10, SO2, NOX); therefore, two different data sets were created to compare prediction results with other cities. The first data set (Dataset-1) contains the information of the same pollutants with Barcelona (O3, NO2, and PM10); the second data set (Dataset-2) contains information of the same pollutants with Kocaeli (PM10, SO2, NOX).

For Dataset-1 in İstanbul, test RMSE was measured as 0.12 for O3, 0.11 for NO2, and 0.08 for PM10; correlation coefficient was measured as 0.89 for O3, 0.82 for NO2, and 0.75 for PM10.

For Dataset-2 in İstanbul, test RMSE was measured as 0.08 for PM10, 0.09 for SO2, and 0.09 for NOX; correlation coefficient was measured as 0.79 for PM10, 0.79 for SO2, and 0.86 for NOX.

Method-3 (MULTI/MULTI)

Method-3 was performed to predict the concentration of multiple variables at the same time for a city by using 3D input that contains the information of all pollutants. The target is the concentration of three pollutants in multiple locations in the next hour. The network was built with ReLU activation function in hidden layers and sigmoid activation function in the fully connected layer. Scatter plots of the prediction results are given in Fig. 8.

In Barcelona, test RMSE was measured as 0.12 for O3, 0.11 for NO2, and 0.07 for PM10; correlation coefficient was measured as 0.82 for O3, 0.73 for NO2, and 0.94 for PM10.

In Kocaeli, test RMSE was measured as 0.07 for PM10, 0.09 for SO2, and 0.09 for NOX; correlation coefficient was measured as 0.86 for PM10, 0.74 for SO2, and 0.79 for NOX.

For Dataset-1 in İstanbul, test RMSE was measured as 0.11 for O3, 0.10 for NO2, and 0.08 for PM10; correlation coefficient was measured as 0.86 for O3, 0.69 for NO2, and 0.70 for PM10.

For Dataset-2 in İstanbul, test RMSE was measured as 0.09 for PM10, 0.10 for SO2, and 0.10 for NOX; correlation coefficient was measured as 0.77 for PM10, 0.68 for SO2, and 0.78 for NOX.

Results with meteorological information

Meteorology information were added to the input data to observe the contribution of meteorological conditions to the prediction accuracy. Hourly information about temperature, air pressure, relative humidity, wind speed, and wind direction were added to the 3D input. The results were observed by Method-2 and Method-3 with the same deep learning model structure. Since meteorology data are available for only Kocaeli and İstanbul, this part of the study was done for these two cities.

Table 9 gives the results observed when ReLU is used in hidden layers and sigmoid is used in the fully connected layer as activation functions. For the prediction of PM10 concentration, using meteorological information (MI) and air quality (AQ) data together reduced the RMSE by 10% with Method-2 and Method-3 in İstanbul. However, it increased RMSE by 12% with Method-2 and by 30% with Method-3 in Kocaeli. For the prediction of SO2 concentration, the use of MI data with AQ data caused a 15% increase of RMSE with Method-2 for Kocaeli, on the other hand, there was no change in the test RMSE in İstanbul. For the prediction of NOX concentration, adding MI data to the input caused an increase of RMSE in both cities. The test RMSE increased by about 45% in İstanbul and 25% in Kocaeli with Method-2; it increased by 33% in Kocaeli and by 10% in İstanbul with Method-3. When the meteorological data are added to the model along with air quality data, the complexity in the input increases. As the complexity in the input increases, the model has difficulty in capturing the relation between spatial-temporal features. Also in the study (Di Antonio et al. 2019), a decrease in the prediction performance caused by the increase in the number of features is observed.

Table 9 Prediction performance comparison between studies with and without meteorological data for Kocaeli and İstanbul, bold entries indicate the lowest RMSE value for the pollutant

Transfer learning

The transductive transfer learning method was applied to transfer network weights where the source task and target task are the same, whereas the source domain and target domain are different but related. Barcelona and Kocaeli were selected as source city and İstanbul was selected as the target city. The pre-trained model was firstly tested directly on the target data. Then, the last learning layer (fully connected layer) was changed using Deep Network Designer Tool, a short training process was performed with learning rate 0.001 for the training with target city’s data.

Transfer learning was performed for the prediction of PM10 concentration in İstanbul. Figure 9a shows the relation between target and predicted values; the graphs in Fig. 9b are for the predicted and target values in İstanbul. These are for the transfer learning results where Kocaeli is the source domain and İstanbul is the target domain. Test RMSE were measured as 0.12 and 0.09 for Method-2 and Method-3, respectively.

Fig. 9
figure 9

Transfer learning results for PM10 concentration in İstanbul by pre-trained neural network with Kocaeli data

Conclusion and discussion

In this study, a CNN+LSTM-based deep neural network model was proposed to predict the hourly concentration of air pollutants in three cities (Barcelona, Kocaeli, İstanbul) based on spatial-temporal features. There are three different methods used according to the input-output relationship. Method-1 is based on univariate input and univariate output; Method-2 is based on multivariate input and univariate output; Method-3 is based on multivariate input and multivariate output. All methods are performed for prediction of air pollutants in multiple locations in the city.

The first objective of the study is to develop a supervised model for air pollution prediction from real sensor data obtained in different locations, and to set the parameters that give the optimal level of learning based on the best performance metric values achieved for the most number of pollutants and cities. The CNN+LSTM neural network predicts the future hourly concentration of air pollutants in certain environmental factors (air pollution and meteorological conditions). Convolutional layers extract the relation between locations for spatial features while LSTM layers extract temporal feature characteristics from time series data. The nonlinear relationship between multi-variable time series and air pollutants were combined, and the effects of air pollution and meteorological data on prediction performance were observed. Although the model selection task-related hyperparameters generally did not change among different pollutants, the training process-related hyperparameters changed among the pollutants and the cities.

The use of different activation functions at the last layer of the network provided different prediction performance. ReLU in the fully connected layer causes exceeding predicted values for each pollutant while working with normalized data. Sigmoid limits the prediction values between 0 and 1 and it improves the prediction performance for normalized data. However, it produces the prediction result at lower points than the target values at higher pollution levels. The optimal case comes out to use different activation functions: ReLU in each hidden layer and sigmoid in the last learning layer which is the fully connected layer.

While O3, NO2, and PM10 were selected as target pollutants in Barcelona, PM10, SO2, and NOX were selected in Kocaeli. There are two data sets in İstanbul since there are five pollutants’ data for İstanbul: The first data set (Data Set-1) contains the same pollutants with Barcelona (O3, NO2 and PM10); the second data set (Data Set-2) contains the same pollutants with Kocaeli (PM10, SO2 and NOX).

Considering the test RMSE, the performances of Method-2 and Method-3 are better than Method-1 for time series prediction in each city. In Barcelona, test RMSE values are the same as Method-2 and Method-3. On the other hand, Method-2 has higher correlation coefficients than Method-3. Test RMSE for PM10 prediction in Barcelona and Kocaeli is generally the same. The RMSE values for O3 and NO2 in Barcelona are 5% higher than the RMSE value for PM10. The difference between RMSE for different pollutants is 0.01 in Kocaeli. The RMSE for Nitrogen-based pollutants (NO2 and NOX) is 0.04 higher than the RMSE for PM10 in İstanbul. İstanbul data set is larger than Barcelona and Kocaeli data sets. Nevertheless, RMSE for O3, NOX, and NO2 are approximately the same as in other cities while RMSE for PM10 and SO2 are higher than the other two cities.

3D samples were created with information on pollutants and meteorology to be used for Method-2 and Method-3. The meteorological features are temperature, relative humidity, air pressure, wind speed, and wind direction. As a result, adding meteorological features to the input generally increased the test RMSE by 10 to 40% in both Kocaeli and İstanbul, using only air pollution information as features caused the model to be more successful. Use of meteorology data in İstanbul only improved the prediction performance for PM10 concentration with both Method-2 and Method-3 by a rate of 12%.

The second objective of the study is to transfer the models between cities and to examine prediction ability of different models on weight transfer and to identify whether the parameters that give an optimal level of learning are the same for different cities. The training process with the target city’s data set did not improve the prediction performance, the lowest test RMSE was observed when the model was applied directly to the target domain.

The transductive transfer learning success was achieved in weight transfer for Method-2 and Method-3 from Kocaeli to İstanbul. Weight transfer between Barcelona-İstanbul or Barcelona-Kocaeli has higher test RMSE than the transfer between Kocaeli-İstanbul has. Kocaeli and İstanbul are neighbor cities and they have similar air pollution and meteorological characteristics, on the other hand, the characteristics of Barcelona are not related to the other two cities.

Looking at the optimal hyperparameters, the model selection task-related hyperparameters are the same for different cities, however, the training process-related hyperparameters change among cities. Although a common deep learning network structure was determined, the transfer of the model did not increase the prediction performance in the target city. For an air pollution prediction model with high accuracy, it would be better to train the model with the data of each city and to set training-related hyperparameters as given in our work. Further work using data from other cities will enhance the prediction capability of the proposed model.

The major outcomes of this study are as follows: different input types for CNN+LSTM hybrid machine learning algorithm were defined and their effects on the performance of the model were examined. The relationship between the concentration of various pollutants was observed. Besides the concentration information of air pollutants, the effect of meteorological conditions was observed for the prediction of air pollution. The efficiency and performance of transfer learning between cities were examined. Both spatial and temporal features were used as inputs to the algorithm, the concentration of pollutants for the next hour were predicted at multiple locations. Also, as the output of the model, the estimated concentration of more than one pollutant was obtained at the same time.

However, some limitations should also be noted for this study. There is a large amount of missing and/or bad data in the publicly available data sources for the target cities selected for this study. In this case, the size of the data set used for training the model is limited due to the lack of usable data collected from all sensors located in cities. Also, due to this limitation of available/reliable data, small sized samples were created as model input. This makes it difficult for the model to extract meaningful relationships from the data. In addition, since the behavior and characteristics of the data for the target cities will be specific to each, the lack of previous studies on this type data has also affected the data analysis and the selection of the appropriate algorithm.