Introduction

      With the acceleration of China’s industrialization process, severe climate problems have arisen. Among them, air pollution has received widespread public attention. According to data from the World Health Organization, about 4.2 million people die prematurely each year due to exposure to environmental air pollution (Song et al. 2017).The particulate matter 2.5 (\( PM_{2.5}\)) concentration, which is the main pollutant, is widely used as an air quality monitoring and regulatory indicator (Qi et al. 2019).\( PM_{2.5}\) can remain suspended in the air for extended periods and penetrate deep into the lungs (Liu et al. 2023). Besides, epidemiological studies have shown that \( PM_{2.5}\) particulate matter severely affects people’s health (Dong et al. 2019), and long-term exposure to high \( PM_{2.5}\) concentrations increases the risk of respiratory diseases, lung cancer, and cardiovascular diseases (Zou et al. 2016; Li et al. 2015). Since one of the pathways for the respiratory system’s exposure to \( PM_{2.5}\) is through the nasal-brain route, the central nervous system is highly vulnerable to harm (Liang et al. 2023). Therefore, individuals in areas with high concentrations of \( PM_{2.5}\) pollution are more prone to developing neurodegenerative diseases (Younan et al. 2020). Currently, Beijing has set up 35 monitoring stations to monitor hourly air quality data, including the concentration of particulate matter 2.5(\( PM_{2.5}\)), particulate matter 10(\( PM_{10}\)), air quality index(AQI), carbon monoxide(CO), nitrogen dioxide(\( NO_{2}\)), ozone(\( O_{3}\)), and sulfur dioxide(\( SO_{2}\)), which is beneficial for people to understand the pollution concentration information in their areas. \( PM_{2.5}\) concentration exhibits nonlinear characteristics regarding time and space, implying that monitoring stations have limited effectiveness in preventing and controlling \( PM_{2.5}\) pollution. Therefore, accurate prediction of \( PM_{2.5}\) concentration holds significant importance for air pollution prevention and control (Yule 1927).Air quality prediction methods involve statistical regression, machine learning, and deep learning techniques. In Park et al. (2023), the authors utilized outdoor \( PM_{2.5}\) concentration, temperature, and humidity data near indoor target points as input to calculate indoor \( PM_{2.5}\) concentration using a multiple linear regression model. Experimental results demonstrated the feasibility of this approach in predicting indoor \( PM_{2.5}\) concentration. Furthermore, the article evaluated the model’s performance based on seasons, revealing that seasonal characteristics significantly influence indoor \( PM_{2.5}\) concentration and the predictive model’s performance. In Song et al. (2015), the authors constructed a generalized additive model (GAM) to estimate the statistical relationship between latent variables and \( PM_{2.5}\) concentration. Experimental results demonstrated an increase of 18.73% in the \( R^{2}\) value of this model compared to stepwise linear regression, indicating its applicability for \( PM_{2.5}\) prediction. Due to the often complex and nonlinear relationships between variables, there is a need for improvement in the predictive accuracy of these methods. For example, Li et al. (2022b) constructed a random forest regression model incorporating MAIAC AOD, meteorological, topographical, date, and location data to estimate daily \( PM_{2.5}\) concentrations in the Huaihai Economic Zone from 2000 to 2020. The results demonstrated the effectiveness of this approach in accurately predicting \( PM_{2.5}\) concentrations in a separate study. Besides, in Wang et al. (2021), the authors employed PSO-SVR, GWO-SVR, PSO-GSA-SVR, and GRNN (with spread=0.4 and spread=0.5) to fit three intrinsic mode feature functions (imf) based on CEEMD. By randomly combining the predictive results of the three imfs to generate 125 individual models and subsequently selecting these individual models through the DPC method for combined predictions, the authors achieved significantly accurate forecasts for \( PM_{2.5}\) time series data in four Chinese cities. However, the methods mentioned above did not address the temporal correlations in \( PM_{2.5}\) concentration data. Additionally, these models exhibit limited capability to represent complex functions, and there is room for improvement in their generalization capabilities for complex prediction problems.In recent years, neural networks have significantly developed with improved computer computing power. Compared with traditional methods, neural networks can deal with complex nonlinear relationships (Zhang et al. 2021a). Therefore, an increasing number of scholars worldwide utilize deep learning for regression problem-solving. Among these, recurrent neural networks (RNNs) are designed to handle sequential information effectively (Zhang et al., 2019) and have found widespread application in fields like fault diagnosis, machine translation, and speech recognition, achieving promising results (Mansouri et al. 2022; Li et al. 2018; Kim and Lee 2020; Ackerson et al. 2021; Zhou et al. 2019). However, RNNs suffer from the vanishing gradient problem and the exploding gradient problem caused by long sequences.

LSTM and GRU neural networks have been proposed to address this issue. Since \( PM_{2.5}\) concentration data has dynamic characteristics over time and can be described using time series models, the LSTM and GRU neural networks have been applied in \( PM_{2.5}\) concentration prediction research. For example, Li et al. (2022c) established a GRU-based \( PM_{2.5}\) concentration prediction model that utilizes the mean relative error (MRE), root mean square error (RMSE), and Pearson correlation coefficient as evaluation criteria to evaluate the network’s accuracy. Extensive experiments demonstrated the model’s appealing predictive performance. Besides, Zhou et al. (2019) used hourly \( PM_{2.5}\) concentration and weather information in Beijing as input and trained four models based on seasons using a GRU model. The authors demonstrated that the GRU-based model has a higher prediction accuracy and is suitable for time series prediction of atmospheric pollutants. Ge et al. (2019) used a deep bidirectional and unidirectional long short-term memory (DBU-LSTM) neural network to obtain feature information for \( PM_{2.5}\) concentration data and relied on tensor decomposition to complete missing data. Experiments highlighted the model’s feasibility. However, none of the existing works discussed the correlation between other data in the dataset and \( PM_{2.5}\) concentration and the correlation between \( PM_{2.5}\) concentrations.

Furthermore, Huang et al. (2021) proposed an EMD-GRU neural network based on empirical mode decomposition (EMD) for predicting \( PM_{2.5}\) concentration. The \( PM_{2.5}\) concentration sequence was decomposed using EMD, and the resulting sub-sequences and meteorological features were input into a constructed GRU neural network for training and prediction. Experimental results highlighted that this method accurately predicted \( PM_{2.5}\) concentration. Zhang et al. (2022a) suggested a method for hourly prediction of Beijing’s \( PM_{2.5}\) concentration based on a Bi-LSTM neural network and discussed the effectiveness of incorporating meteorological features. Indeed, the corresponding experimental results revealed that exploiting meteorological features effectively reduces the prediction error of \( PM_{2.5}\) concentration. Besides, Ding and Zhu (2022) constructed an LSTM model based on principal component analysis (PCA) and attention mechanism to eliminate the correlation effect between indicators and reduce model complexity, achieving good experimental prediction results. However, LSTM and GRU models degrade into random guessing as the length of the time series data increases.

This study proposes a hybrid neural network called TCN-biGRU to address this issue and preserve historical information to a greater extent. This model combines the advanced feature extraction capability of TCN neural networks with bi-GRU neural networks’ time series prediction ability. Unlike other studies that focus on improving accuracy by optimizing model parameters or increasing model complexity, this model is designed based on the data feature analysis, taking into account the advantages of TCN and bi-GRU models that are consistent with the inherent features of the data. In the developed architecture, the TCN neural network achieves an exponentially large receptive field (Liu et al. 2020) due to the dilated convolutions and residual connections structures, where the neural network’s input can be a long time series data segment. Compared to LSTM neural networks, GRU neural networks have a simpler architecture, less computational complexity, and faster training speed (Wang et al. 2021). Moreover, incorporating directional information can improve the model’s accuracy by considering the strong correlation between \( PM_{2.5}\) concentrations between the previous and subsequent periods. Additionally, the bi-GRU neural network involves two GRU models from the time series and the reverse time series directions, providing complete historical and future information for each time point in the input sequence of the output layer (Liang et al. 2020). Hence, the proposed TCN-biGRU neural network combines the advantages of both models and can be used in the prediction research of \( PM_{2.5}\) concentrations. This research fills a gap by exploring the fusion of TCN and bi-GRU models. The main contributions of this paper are as follows:

  1. 1)

    Pollutants highly correlated with \( PM_{2.5}\) concentrations are investigated as inputs to the neural network, which improved the prediction accuracy compared to solely relying on \( PM_{2.5}\) concentrations.

  2. 2)

    The autocorrelation of \( PM_{2.5}\) concentrations is explored, and a strong correlation between \( PM_{2.5}\) concentrations and the concentrations in the previous and subsequent periods is verified, providing a basis for using bi-GRU neural networks.

  3. 3)

    A neural network model is proposed named TCN-biGRU, and comparative experiments are conducted using the Beijing air quality data from 2021/01/01 to 2021/12/31 to validate the effectiveness and performance advantages of the developed method.

Methods

To accurately predict the \( PM_{2.5}\) concentration (\( \mu g/m^{3} \)), this paper proposes a neural network model based on TCN-biGRU implemented using measurement data from air quality monitoring stations in Beijing. Additionally, to improve the model’s accuracy, the relationship between other factors (\( PM_{10}\) (\( \mu g/m^{3} \)), AQI, CO (\( \mu g/m^{3} \)), \( NO_{2}\) (\( \mu g/m^{3} \)), \( O_{3}\) (\( mg/m^{3} \)), and \( SO_{2}\) (\( \mu g/m^{3} \))) and \( PM_{2.5}\) concentration is discussed, as well as the autocorrelation analysis of \( PM_{2.5}\) concentration.

The model input is historical air quality data. However, the correlation between each pollutant and \( PM_{2.5}\) concentration is calculated to increase the model’s accuracy. Since the Pearson correlation coefficient measures the correlation between variables (Shi et al. 2021), it is utilized to analyze the correlation between each pollutant and \( PM_{2.5}\) concentration. Thus, the auto-correlation coefficient of \( PM_{2.5}\) concentration data is calculated. Moreover, a suitable number of timesteps is selected based on the calculation cost and accuracy by evaluating 4, 6, 12, 18, and 28 timesteps. Finally, the monitoring point data is input to the input layer of the TCN-biGRU neural network, and the predicted value of the monitoring point \( PM_{2.5}\) concentration is obtained. Next, the process is described in detail.

Correlation analysis

The monitoring stations report data on multiple pollutants, with existing studies revealing a correlation between pollutants (Zhang et al. 2021b; Wu et al. 2022; Popescu et al. 2017). Therefore, in this study, the Pearson correlation coefficient represents the relationship between \( PM_{2.5}\) concentration and the concentration of other pollutants. The Pearson correlation coefficient (Pearson 1900) formula is as follows:

$$\begin{aligned} \rho _{XY} = \frac{Cov(X,Y)}{\sigma _{X}\sigma _{Y}} \end{aligned}$$
(1)
$$\begin{aligned} Cov(X,Y)=\frac{1}{n}\sum _{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y}) \end{aligned}$$
(2)

where X and Y represent the concentration of \( PM_{2.5}\) and other pollutants, respectively, Cov(XY) is the covariance of X, Y, and \(\sigma _{X} \), and \(\sigma _{Y}\) denotes the standard deviation between \( PM_{2.5}\) concentration and other pollutant concentrations.

Fig. 1
figure 1

The structure diagram of the dilated convolution

Autocorrelation analysis

To validate that the \( PM_{2.5}\) concentration at time T is influenced by the \( PM_{2.5}\) concentration of the previous and subsequent time points and to demonstrate the importance of bi-GRU in the TCN-biGRU model, the autocorrelation function (ACF) is utilized to prove that the \( PM_{2.5}\) concentration time series has autocorrelation, i.e., ACF reveals the correlation at each lag (Flores et al. 2012). The concept and formula of autocorrelation function (ACF) were first introduced in Yule (1927). The concept and formula of ACF have gradually evolved and taken shape throughout the development of time series analysis. The ACF is defined as follows:

$$\begin{aligned} Corr =\sum _{i=1}^{n-k}\frac{(x_{i}-u)(x_{i+k}-u)}{\sum _{i=1}^{n}(x_{i}-u)^{2}} \end{aligned}$$
(3)

where K is the order, u is the mean value of the sequence, and \(x_{i}\) and \(x_{i+k}\) correspond to item i of the two separated sequences.

TCN-biGRU neural network

We combine the TCN and bi-GRU neural networks into TCN-biGRU to benefit from the advantages of each network. This section will introduce the TCN and biGRU neural networks separately.

TCN neural network

TCN is a convolutional neural network proposed by Bai et al. (2018). Its design combines best practices, such as fully convolutional networks, dilated convolutions, residual connections, and causal convolutions (Hu et al. 2022). Experimental results have shown that TCN outperforms RNN and LSTM networks in predicting longer time series data (Yan et al. 2020) due to relying on dilated convolutions and residual layer modules that increase the network’s receptive field and obtain more historical information. Hence, the proposed model exploits this characteristic of the TCN network. Therefore, the following sections detail the dilated convolution and residual layer modules of TCN. Besides, the fully convolutional networks and causal convolutions ensure that the network’s input and output sequence lengths are the same and that information does not “leak” into the past data. It should be noted that these two parts will not be discussed in this paper. Using dilated convolutions can achieve exponentially large receptive fields. The receptive field can be understood as the maximum number of steps back from the current data at time T. When the kernel size is k=2, and the dilation rate is d = [1,2,4], dilation allows for the input to be sampled at spaced intervals during convolution. Figure 1 illustrates the structure diagram of a dilated convolution, which is formulated as follows (Bai et al. (2018)):

$$\begin{aligned} F(s)=(x*_{d}f)(s)=\sum _{i=0}^{k-1}f(i)\dot{x}_{s-d \dot{i}} \end{aligned}$$
(4)

where \(x \in R^{n}\) is the input of a one-dimensional sequence, f is the convolution kernel, and s is the element in the sequence. The receptive field’s formula is as follows:

$$\begin{aligned} R=(k-1)*(\sum _{i}d_{i}+1) \end{aligned}$$
(5)

Therefore, to increase the receptive field, the convolution kernel size k is changed, or the dilation factor d is increased. However, as the network becomes deeper, this strategy increases the computation cost, gradient explosion, and gradient vanishing (Li et al. 2022a).

Fig. 2
figure 2

The structure diagram of GRU model and bi-GRU model

To solve these problems, residual layer modules are added to TCN, with Fig. 3 presenting the updated structure diagram. The residual layer module comprises two identical layers, including dilated causal convolution, weight normalization, ReLU, and dropout layers (not used). The \(1 \times 1\) convolution layer ensures that the input and output of the residual have the same dimensions. The output o of input i is presented in Eq.6 (Bai et al. (2018)), and therefore, the receptive field R of the TCN neural network with N residual blocks is presented in Eq.7 (Bai et al. (2018)). Factor 2 is set because a cell has two layers.

$$\begin{aligned} o=Actication(i+F(i)) \end{aligned}$$
(6)
$$\begin{aligned} R=2*(k-1)*N*(\sum _{i=0}d_{i}+1)=N*(k-1)*2^{i} \end{aligned}$$
(7)

Bi-GRU neural network

Compared to the input, forget, and output gates in LSTM neural networks, the GRU neural network is optimized and has only a reset and an update gate. The former gate preserves useful past information, and the latter combines past and current information. Moreover, the reset gate captures short-term dependencies in time series, and the update gate captures long-term dependencies. The GRU neural network structure is simple and thus reduces processing and training time (Zhang et al. 2022b). The structure diagram of GRU is illustrated in Fig. 2, where \(H_{t-1}\) represents the hidden state of the previous timestep, \(H_{t}\) is the hidden state of timestep t, \(\tilde{H}_{t}\) is the candidate hidden state of timestep t, and \(R_{t}\) and \(Z_{t}\) represent the reset door and update door, respectively. W is the weight parameter, and b is the deviation parameter. The GRU core formulas are as follows (Cho et al.2014):

$$\begin{aligned} R_{t}=\sigma (X_{t}W_{xr}+H_{t-1}W_{hr}) \end{aligned}$$
(8)
$$\begin{aligned} Z_{t}=\sigma (X_{t}W_{xz}+H_{t-1}W_{hz}) \end{aligned}$$
(9)
$$\begin{aligned} \tilde{H_{t}}=tanh(X_{t}W_{xh}+(R_{t}* H_{t-1})W_{hh}) \end{aligned}$$
(10)
$$\begin{aligned} H_{t}=Z_{t}* \tilde{H}_{t}+(1-Z_{t})* {H_{t-1}} \end{aligned}$$
(11)

The bi-GRU neural network comprises a forward and a backward GRU (Ortega-Bueno et al. 2019), with one GRU processing the time series data, arranged forward and the other backward. The structure diagram is presented in Fig. 2, and the output formula is formulated in Eq. 12. Considering that the \( PM_{2.5}\) concentration is highly correlated with the concentration data of the previous and subsequent moments, using a bi-GRU neural network to train the network from both forward and backward directions improves the network’s predictive accuracy.

$$\begin{aligned} H_{t}=(\mathop {H_{t}}\limits ^{\leftarrow } + \mathop {H_{t}}\limits ^{\rightarrow })/2 \end{aligned}$$
(12)

TCN-biGRU neural network

Figure 3 illustrates the main network architecture of the proposed TCN-biGRU prediction model. The Dense 1 layer aims to change the shape of the TCN network output data from (batch_ size, nb_filters) to (batch_size, timesteps * input_dim), where nb_filters is the number of filters used in the convolution layer. It should be noted that this work does not employ a dropout in the residual linking layer, as our trials revealed that it did not significantly improve the model’s performance. Besides, the TCN-biGRU neural network has a large receptive field and can extract features through time series and reverse time series, enhancing \( PM_{2.5}\) concentration prediction.

Fig. 3
figure 3

The structure diagram of the dilated convolution

Experiment

The deep learning models employed in this paper are built using the Keras framework and Python programming language. All experiments are conducted on a 64-bit Windows 10 operating system with an Intel Core i7-8750H CPU processor.

Data preprocessing

Data source

To verify the prediction accuracy of the proposed TCN-biGRU model, the air quality data of Beijing were collected and released by the National Environmental Monitoring Station for time series prediction. The dataset covers hourly data from January 1st, 2021, to December 31st, 2021, including 7 features: \( PM_{2.5}\) (\( \mu g/m^{3} \)), \( PM_{10}\) (\( \mu g/m^{3} \)), AQI, CO (\( \mu g/m^{3} \)), \( NO_{2}\) (\( \mu g/m^{3} \)), \( O_{3}\) (\( mg/m^{3} \)), and \( SO_{2}\) (\( \mu g/m^{3} \)). For data for each contaminant, if three or more samples (rows) are missing in a row, samples (rows) with missing information attribute values are deleted. If there are fewer than three consecutive missing data, the mean is used for completion. We use 80% of the data as the training set and 20% as the test set of the neural network.

Correlation analysis

The correlation between \( PM_{2.5}\) concentration and the concentrations of other pollutants is verified based on the Pearson correlation coefficients between the \( PM_{2.5}\) concentration and the remaining six pollutant concentrations. Figure 4 depicts the corresponding heatmap highlighting that \( PM_{2.5}\) concentration positively correlates with \( PM_{10}\), AQI, \( NO_{2}\), and \( SO_{2}\) concentrations and negatively with \( O_{3}\) concentration. The Pearson correlation coefficient ranges from \(-\)1 to 1, with each correlation coefficient corresponding to a specific color (shown on the right side of the figure). As the correlation coefficient approaches 1, the color becomes lighter, and as it approaches \(-\)1, it becomes darker. A larger absolute value of the Pearson correlation coefficient indicates a stronger correlation. Additionally, the correlation with CO and \( O_{3}\) is relatively small. Therefore, our neural network inputs are \( PM_{2.5}\), \( PM_{10}\), AQI, \( NO_{2}\), and \( SO_{2}\). Note that the same neural network structure and parameters are used to conduct a comparative experiment against solely utilizing the \( PM_{2.5}\) concentration as input. The corresponding results are presented in Sect. 3.3.

Fig. 4
figure 4

A heatmap of the correlation between \( PM_{2.5}\) concentration and other pollutant concentrations

Autocorrelation analysis

The effectiveness of the developed bi-GRU neural network is demonstrated by employing the \( PM_{2.5}\) concentration data. The results are depicted in Fig. 5, where the vertical axis is the confidence coefficient, the horizontal axis is the lag k, and the blue area is the confidence interval. Figure 5 infers that the \( PM_{2.5}\) concentration has a high correlation with itself, and the correlation coefficient gradually decreases as the lag k increases. Specifically, the autocorrelation coefficient is still around 0.9 when k=4, 0.8 when k=6, 0.7 when k=12, 0.6 when k=18, and 0.5 when k=28. The lag k provides a basis for the timestep values in Section 3.5.1. In addition, the bi-GRU neural network obtains more feature information from the forward and reverse sequences. Therefore, our model uses a bi-GRU neural network and fuses it with the TCN neural network.

Fig. 5
figure 5

A graph showing the results of autocorrelation analysis on \( PM_{2.5}\) concentration

Determination of model parameters

Due to the critical roles that activation function types and neuron counts play in the accuracy of artificial neural network models, it is ensured that other parameters were kept constant and separately investigated the effects of activation function types and neuron counts on the model. Four activation functions—linear, tanh, sigmoid, and ReLU—were selected for experimentation. In the case of neuron count analysis, the TCN network is chosen as a baseline. While maintaining the same structure for the bi-GRU neural network, the neuron count of the TCN neural network is altered within the bounds of its predictive capacity. Following attaining the TCN’s predictive limit, the TCN-biGRU model is established based on this baseline structure.

Fig. 6
figure 6

Prediction results of the model when timesteps=4

Data standardization

The Z-score normalization method is utilized to unify the different scales of the data and improve comparability. The dataset is normalized through Z-score normalization by setting the mean to 0 and the standard deviation to 1. Using the entire dataset’s overall mean and standard deviation mitigates the impact of outliers. This technique enhances model predictive performance (Tanaka et al. 2022). This common normalization method standardizes the data by subtracting and dividing the mean by the standard deviation. Z-Score is formulated as follows:

$$\begin{aligned} Z=\frac{x-\mu }{\sigma } \end{aligned}$$
(13)

where \(\mu \) and \(\sigma \) are the average and the standard deviation of all data, respectively.

Fig. 7
figure 7

Prediction results of the model when timesteps=6

Fig. 8
figure 8

Prediction results of the model when timesteps=12

Evaluating criterion

Our model’s prediction accuracy is evaluated based on the mean squared error (MSE), used as the loss function, and the MAE, RMSE, and \(R^{2}\) evaluation metrics. These metrics are commonly used to evaluate the variability and accuracy of data, with the corresponding formulas presented in Eqs. 1417, where N is the number of samples, \(y_{i}\) is the actual value, \(\check{y_{i}}\) is the predicted value, and \(\bar{y_{i}}\) is the average value. Note that the smaller the MSE, MAE, and RMSE, the higher the model’s accuracy. Moreover, the closer \(R^{2}\) is to 1, the higher the model’s accuracy.

$$\begin{aligned} MSE=\frac{1}{N} \sum _{i=1}^{N}\left( y_{i}-\hat{y}_{i}\right) ^{2} \end{aligned}$$
(14)
$$\begin{aligned} MAE=\frac{1}{N} \sum _{i=1}^{N}\mid y_{i}-\hat{{y}_{i}} \mid \end{aligned}$$
(15)
$$\begin{aligned} RMSE=\sqrt{\frac{1}{N} \sum _{i=1}^{N}\left( y_{i}-\hat{y}_{i}\right) ^{2}} \end{aligned}$$
(16)
$$\begin{aligned} R^{2}=1-\frac{\sum _{i}\left( \hat{y}_{i}-y_{i}\right) ^{2}}{\sum _{i}\left( \bar{y}_{i}-y_{i}\right) ^{2}} \end{aligned}$$
(17)
Fig. 9
figure 9

Prediction results of the model when timesteps=18

Fig. 10
figure 10

Prediction results of the model when timesteps=28

Model prediction experiments

Determination of timesteps and input data

Table 1 Performance comparison of models with different timestep values
Table 2 Performance comparison of models with different input dimension values
Table 3 Predictive results of different activation function models
Table 4 Predictive results of models with different neuron counts
Fig. 11
figure 11

With the increase of the number of iterations, the loss values of the three models reach a stable level in about 150 iterations

The “timesteps” parameter indicates how many timestamps of input data should be included in each unit model’s input, representing how many previous consecutive input data points are relevant to the current data. Therefore, the “timesteps” value is determined by referring to the lag order k in autocorrelation analysis (Section 3.1.3). Timesteps are a very important hyperparameter in neural networks, and an appropriate value can improve the accuracy of time series prediction models. To select a suitable value for timesteps, comparative experiments are conducted with the timestep value set to 4, 6, 12, 18, and 28 while keeping other neural network parameters the same. The prediction results for different “timesteps” values are depicted in the figure below (Fig. 6, 7, 8, 9, 10, partial results are displayed), revealing that when the timesteps are 4, the model can track the \( PM_{2.5}\) concentration during dramatic changes better than in the other four timestep values. The experimental results are reported in Table 1, highlighting that the proposed model performs best when the timestep is 4, while simultaneously, the computation and training time of the system are relatively short. Therefore, 4 timesteps are selected. Additionally, various input data setups are compared, including \( PM_{2.5}\), \( PM_{10}\), AQI, \( NO_{2}\), and \( SO_{2}\), and using only \( PM_{2.5}\), to verify that the neural network’s prediction accuracy is higher when using the five input data. Similarly, the input dimensions of the data are modified while preserving the other parameters, with the corresponding results reported in Table 2. The results infer that the MAE is better when input_dim=1, but the RMSE and \(R^{2}\) are inferior when input_dim=5. Hence, to enhance the model’s accuracy, input_dim = 5. It is observed that insufficiently comprehensive input parameters can impact the accuracy of the model’s predictions.

Determination of activation functions and neuron count

In order to achieve optimal performance of the model and achieve more accurate predictions, we separately investigated the impact of different activation functions and neuron counts on the model’s accuracy. While keeping other network parameters consistent, four commonly used activation functions are explored: linear, tanh, sigmoid, and ReLU. The experimental results are presented in Table 3, revealing that the model performs best when the ReLu activation function is chosen for the neural network.

For the selection of neuron counts, five different cases of neuron counts are evaluated: 32, 50, 64, 100, and 128, as shown in Table 4. The results indicate that as the number of neurons in the TCN hidden layers increases, the prediction performance improves significantly and gradually decreases. The model’s performance reaches optimal when the neuron count is set to 50. Therefore, the neuron count is set to 50 in the experiments.

Prediction results

It is compared against the LSTM and GRU neural networks to validate the proposed model’s accuracy. Since the LSTM, GRU, and TCN-biGRU neural networks converged after 150 iterations (see Fig. 11), we set 150 epochs for each model. The predicted results of all models are reported in Table 5. Some of the predicted results are visualized in Fig. 12.

Furthermore, it is compared against the bi-GRU and TCN-GRU neural networks to validate the proposed model better. In the test data, the TCN-biGRU neural network had a lower mean absolute error, root mean square error, and a higher \(R^{2}\) than the competitor neural networks. Figure 12 reveals that the TCN-biGRU-based predicted values follow the trend of the actual values to a certain extent, and it performs better than the competitor networks when the data have larger variations. Additionally, the results demonstrate that the TCN-biGRU neural network converged significantly faster than the LSTM and GRU neural networks. Therefore, the suggested network can be effectively applied for \( PM_{2.5}\) concentration prediction.

Furthermore, to validate the effectiveness of our proposed model, its predictive results are compared with two different hybrid models. The predictive outcomes of the CNN-LSTM model were sourced from the study by Xie et al. (2023), while the predictive outcomes of the LSTM-Attention model were sourced from the study by Gao and Li (2022), ensuring consistency in the dataset and model settings. The predictive results are presented in Table 6, revealing that the TCN-biGRU neural network outperforms the other two models. Hence, establishing the TCN-biGRU neural network model is highly necessary for enhancing the accuracy of \( PM_{2.5}\) concentration prediction.

Table 5 Performance comparison of five different neural networks under the same conditions
Fig. 12
figure 12

The prediction results of \( PM_{2.5}\) concentration using five types of neural networks

Discussion and conclusion

Discussion

This study introduces the TCN-biGRU neural network for predicting \( PM_{2.5}\) concentration, with various factors and parameters influencing the model’s accuracy. The paper investigates the impact of single-input versus multiple-input variables on predictive accuracy and the influence of neural network hyperparameters such as timestep, activation function, and number of neurons on predictive accuracy.

Table 1 highlights that using five input variables yields a prediction result with a 0.76 lower RMSE and a 0.009 higher \( R^{2}\) compared to using only \( PM_{2.5}\) concentration as input. Table 2 reveals that the predictive accuracy is highest when timesteps=4, with a significant reduction in MAE and RMSE by 0.74 and 2.03, respectively, and an improvement in \( R^{2}\) by 0.023, compared to other scenarios. Table 3 demonstrates that using the ReLu activation function results in the largest reduction in MAE and RMSE by 1.28 and 2.98, respectively, compared to the reference experiment. The impact of TCN neural neuron count on predictive accuracy is presented in Table 4, where using 50 neurons yields the maximum reduction in MAE and RMSE by 0.46 and 1.22, respectively, compared to the reference experiment. Both input variables and hyperparameters influence the model’s accuracy. Among these, the choice of timesteps and activation function has a more pronounced impact on the model’s output than the effects of input variables and the number of neurons on the model’s accuracy.

Table 6 Performance comparison of two different neural networks under the same conditions

Figure 12 reveals that all five models can effectively track the variations in \( PM_{2.5}\) concentration data. However, when confronted with larger fluctuations, the TCN-biGRU neural network exhibits superior tracking performance compared to the other four models. Additionally, by comparing Tables 5 and 6, it is evident that fused neural networks like CNN-LSTM and LSTM-Attention show better predictive accuracy than standalone LSTM and GRU neural networks. Nonetheless, compared to the TCN-biGRU neural network proposed in this study, there remains some disparity. From the above discussions, it is evident that the TCN-biGRU neural network can be utilized for predicting \( PM_{2.5}\) concentration.

Conclusion

This paper introduces the TCN-biGRU model for predicting \( PM_{2.5}\) concentration in Beijing based on a combination of temporal convolutional neural network (TCN) and bidirectional gated recurrent unit (bi-GRU) neural network. The model considers the relationship between meteorological features and \( PM_{2.5}\) concentration and the impact of \( PM_{2.5}\) concentration’s self-correlation on predictions. It also investigates the influence of input parameters, neuron counts, and activation functions on prediction accuracy. Based on the experimental results, it is evident that the input variables of the neural network and the neural network’s hyperparameters influence the model’s accuracy. Furthermore, the TCN-biGRU model is also compared with other prediction models. The experimental results indicate that this approach outperforms the other comparative models with MAE, RMSE, and \( R^{2}\) values of 4.19, 8.13, and 0.955, respectively. This research offers valuable insights for \( PM_{2.5}\) concentration prediction and environmental control.