1 Introduction

Recently, with the rapid development of urbanization and industrialization in China, a large amount of harmful substances have been released into the atmosphere, and more and more attention has been paid on the transformation of the air pollutants data, such as carbon monoxide (CO), carbon dioxide (CO2), sulfur dioxide (SO2), methane (CH4), nitrogen oxides (NOx), ozone (O3), and particulates (PM2.5 and PM10). These harmful substances affect the urban air quality and pose a great threat to human health [24, 47]. Many regions have suffered from serious air pollution, especially Beijing, Tianjin, Hebei, and Shandong province in China [51, 56].

PM2.5 and nitrogen dioxide (NO2), as dominant pollutants, have attracted wide attention [13, 15]. PM2.5 refers to the particulate matter whose aerodynamic diameter is 2.5 μm or less. It is made of toxic and hazardous substances with high activity; it has the character of long residence time and far transportation distance in the atmosphere [43]. The sources of PM2.5 include fuel combustion from automobiles, power plants, wood burning, industrial processes, and vehicles such as buses and trucks. It is also formed in the atmosphere when gases such as SO2 and NOx and volatile organic compounds are transformed in the air by chemical reactions. NO2 is a poisonous gas with reddish brown and pungent odor at room temperature. Its participation in the photochemical reaction catalyzes ozone production, thus leads to photochemical smog pollution. NO2 is mainly derived from the fossil fuel and biomass burning, soil emissions, and lightning. Meanwhile, the contribution of anthropogenic sources accounts for a larger proportion, including motor vehicle emissions, power plants, and other industrial sources [11, 27]. Numerous studies [8, 9, 40, 55] have shown that exposure to high levels of NO2 and PM2.5 leads to breathing difficulty, lung and cardiovascular diseases, acid deposition, and eco-environmental system damages. To provide an early warning for air quality changes and protect human health and environment, an effective and accurate model for the short- and long-term forecasts of PM2.5 and NO2 concentration is more necessary [48, 53, 54].

Forecasting methods can be divided into three main categories, i.e., numerical methods, statistical methods, and artificial intelligence (AI)-based methods [21]. A large number of numerical models [20, 44, 50], such as box model, Gaussian model, Lagrangian model, and Euler model, have been used for air pollutant concentration forecast. These models can simulate the physical and chemical process in the atmosphere and they are also called atmospheric dispersion models. However, such models are restricted in many operational conditions because they require accurate and detailed data, such as meteorology, terrain geomorphology, pollution sources, and other data [6, 37]. For the statistical methods, multiple regression model [13, 45], grey model [34], Kalman filter techniques [38], and autoregressive moving average (ARMA) model [25] have been widely used to forecast air pollutant levels; such models can be generalized and are consistent with actual observations. However, due to the existence of strong nonlinearity problem of air pollutant concentration, the predicting accuracy is difficult to improve by using the abovementioned methods [32].

In recent decades, the AI-based methods have aroused public interest in air pollutant concentration forecasting. Among them, artificial neural network (ANN) and support vector machine (SVM) are more popular. ANN is good at solving nonlinear problem and is considered as a promising forecasting tool [4, 16, 41]. Moustris et al. [29] presented an ANN to forecast the maximum daily value of pollutants index in Athens and Greece. The results indicated that ANN could give reliable forecast for the air quality. Gennaro et al. [10] proposed an ANN to forecast daily PM10 concentration in regional site and urban site. The results showed that ANN could be a powerful tool to obtain real-time information on air quality status. Feng et al. [15] introduced a novel hybrid model combining air mass trajectory analysis and wavelet transformation to improve the ANN accuracy for PM2.5 concentration forecast. The mass trajectory was applied to recognize different corridors, and the wavelet transformation was used to deal with the fluctuation of PM2.5 concentration. Nevertheless, ANN suffers from a number of weakness, such as overfitting problem, local minimal problem, network construction problem, and the need of a large number of data for network training. So there are more difficulties when ANN is applied to some forecasting problems [2, 41].

SVM has been proposed on the basis of statistical learning approach and it overcomes the shortcomings of ANN model [39]. It employs the structural risk minimization principle to obtain the global optimum, instead of empirical risk minimization principle. Originally, SVM was applied for pattern classification. With the introduction of ε-insensitive loss function, SVM was gradually developed to solve the nonlinear regression estimation and time series prediction problems [17, 46], namely support vector regression (SVR). Ortiz-García et al. [32] established SVR model to forecast hourly O3 concentration in Madrid urban area, and the model parameter was optimized by an improved grid search method. The findings showed that the SVR model is superior to multi-layer perceptron. Yeganeh et al. [49] used a hybrid model based on partial least squares (PLS) and SVM to forecast hourly and daily CO concentration. The results indicated that this hybrid model performed faster prediction and more accurate ability. Moazami et al. [28] applied SVR model to predict the carbon monoxide (CO) concentrations of the next day in Tehran metropolitan; the results showed that the SVR has less uncertainty in CO prediction than adaptive neuro-fuzzy inference system (ANFIS) and ANN models.

However, some shortcomings still exit in these studies. On one hand, the original time series of air pollutant concentration is highly nonlinear and time-varying. The fixed SVR model is difficult to adapt to this feature, while the online SVR model can update model dynamically; therefore, the online SVR model based on re-modeling method is used to predict air pollutant concentration in this study. On the other hand, SVR model performance is greatly affected by three parameters (penalty factor C, kernel parameter σ, and insensitive coefficient ε). The traditional methods, such as grid search, cross validation, and gradient descend, exist some limitations due to low calculation efficiency and poor accuracy [2, 19]. Therefore, it is necessary to overcome these shortcomings. Heuristic algorithm is a kind of local optimization algorithm based on intuition or experience; it is applied in many fields, such as the optimization of neural network [30], the optimization of scheduling problem [35, 36], and so on. While for the parameter selection, the heuristic algorithms have also shown great superiority. Several heuristic algorithms have been applied to select parameters, such as genetic algorithm [14], immune algorithm [26], and simulated annealing algorithm [33]. However, compared with particle swarm optimization (PSO) algorithm, these methods perform slow search speed and poor accuracy in multi-dimensional optimization problems [5, 12]. PSO algorithm was introduced by Kennedy and Eberhart [22]; it is equipped with the mechanism of memory and has a simple structure. Therefore, it is more suitable to select the SVR parameters [12, 52].

In order to prevent premature convergence and local minimum of the standard PSO algorithm, a quantum-behaved PSO algorithm (QPSO) is applied, and a hybrid QPSO-SVR model is established to forecast PM2.5 and NO2 concentration. At the same time, in order to select the optimal prediction method, the recursive multi-step prediction, direct multi-step prediction, and online direct multi-step prediction methods are compared to predict PM2.5 concentration in three selected months.

The rest of this paper is organized as follows: Section 2 describes the preliminary knowledge of mathematics, including SVR method, QPSO algorithm, and multi-step prediction method. Section 3 shows the data required for the experiment, and the experiment results for PM2.5 and NO2 concentration prediction. Finally, conclusions are given in Section 4.

2 Preliminary Knowledge of Mathematics Algorithm and Model

2.1 Support Vector Regression Model

Support vector machine (SVM) was developed on the basis of statistical learning [1]. In 1992, Boser, Guyon, and Vapnik proposed the optimal boundary learning theory in the conference paper about computational learning for the first time, which was also the initial form of SVM. In 1995, Vapnik proposed a SVM learning algorithm completely; it had outstanding advantages in theory and it realized the nonlinear mapping of the high-dimensional space by kernel function, and it was used to solve nonlinear classification and regression estimation problems.

In conventional ε-support vector regression (ε-SVR) algorithm, the basic idea is to map the input vector into a high-dimensional feature space via a nonlinear mapping function. The structure risk minimization principle is applied to construct the optimal decision function in the feature space so that the relationship between the input and the output is approximated. Given the data set {(xi,yi),i = 1,2,...,l} (xi is the input vector, yi is the desired value, l is the number of samples), the regression estimation can be performed by the following formula:

$$ f(\mathbf{x})=\omega^{T}\phi(\mathbf{x})+b $$
(1)

where ω and b are the coefficients to be adjusted, and ϕ(x) is a mapping function of the input vector in the high-dimensional space. These can be estimated by minimizing the structure risk function described as follows:

$$ R(f)=\frac{1}{2}\|\omega\|^{2}+C\sum\limits_{i = 1}^{l} L_{\varepsilon}(y_{i},f(\mathbf{x})) $$
(2)

where \(\frac {1}{2}\|\omega \|^{2}\) is used as a measurement of function smoothness, and C is a regularized constant determining the trade-off between the model complexity and promotion ability. The ε-insensitive loss function is denoted by Lε and is described as the following:

$$ L_{\varepsilon}(y_{i},f(\mathbf{x}_{i}))=\left\{\begin{array}{ll} 0&|\text{y}_{i}-f(\mathbf{x}_{i})|<\varepsilon,\\ |y_{i}-f(\mathbf{x}_{i})|-\varepsilon&|\text{y}_{i}-f(\mathbf{x}_{i})|\geq\varepsilon \end{array}\right. $$
(3)

where y and f(x) are the observation and predictive value respectively. This function is utilized to panelize the training error between y and f(x). The above problem to find ω and b can be expressed in the form of convex quadratic programming, which can be described as follows:

$$ \left\{\begin{array}{ll} \min\limits_{\omega ,b}(\frac{1}{2}||\omega||^{2}+C\sum\limits_{i = 1}^{l} {({\xi_{i}} + \xi_{i}^ * )}\\ s.t.\left\{\begin{array}{lll} y_{i}-\omega\phi(\mathbf{x})-b\le\varepsilon+\xi_{i}&{i = 1,2,\cdots,l}\\ -y_{i}+\omega\phi(\mathbf{x})+b\le\varepsilon+\xi_{i}^{*}&{i = 1,2,\cdots,l}\\ \xi_{i}\ge 0,\xi_{i}^{*}\ge 0&{i = 1,2,\cdots,l} \end{array}\right. \end{array}\right. $$
(4)

where ε defines the error requirement of regression function, which determines the number of support vectors and guarantees the sparseness of the solution. The slack variables \(\xi _{i},\xi _{i}^{*}\) are used to control the upper and lower bounds of the output.

In order to solve the above quadratic programming problem, the Lagrange function is introduced. In this case, the dual form of optimization problem is described as follows:

$$ \left\{\begin{array}{ll} \max\limits_{\alpha ,\alpha^{*}}[-\frac{1}{2}\sum\limits_{i = 1}^{l} \sum\limits_{j = 1}^{l} (\alpha_{i} - \alpha_{i}^ * )(\alpha_{j} - \alpha_{j}^ * )K({\mathbf{x}_{i}},{\mathbf{x}_{j}}) - \sum\limits_{i = 1}^{l} (\alpha_{i} + \alpha_{i}^ * ) \varepsilon + \sum\limits_{i = 1}^{l} (\alpha_{i},\alpha_{i}^ * )y_{i}]\\ s.t.\left\{\begin{array}{lll} \sum\limits_{i = 1}^{l} {({\alpha_{i}} - \alpha_{i}^ * ) = 0} \\ 0 \le {\alpha_{i}} \le C\\ 0 \le \alpha_{i}^ * \le C \end{array}\right. \end{array}\right. $$
(5)

where αi and \(\alpha _{i}^{*}\) are the Lagrange multipliers. The function K(xi,xj) = ϕ(xi)ϕ(xj) is the kernel matrix and can be replaced by any function satisfying the Mercer’s condition. A common election for this kernel function is the radial basis function (RBF):

$$ K(\mathbf{x}_{i},\mathbf{x}_{j})=exp\left( -\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}}{\sigma^{2}}\right) $$
(6)

where σ is the width of RBF; it reflects the degree of correlation between support vectors. The impact of support vector is too strong to achieve sufficient accuracy if σ is too large; in contrast, the support vector is relatively loose if σ is too small, and the model is relatively complex.

By solving the optimization problem described above, the coefficients of Eq. (1) can be found as the following:

$$ \omega^{*}=\sum\limits_{i = 1}^{l} (\alpha_{i}-\alpha_{i}^{*})\phi(\mathbf{x}_{i}) $$
(7)
$$\begin{array}{@{}rcl@{}} {b^ * } &=& \frac{1}{{{N_{nsv}}}}\left\{{\sum\limits_{0 < {\alpha_{i}} < C} {[{y_{i}} - \sum\limits_{{\mathbf{x}_{i}} \in SV} {({\alpha_{i}} - \alpha_{i}^ * )} K({\mathbf{x}_{i}},{\mathbf{x}_{j}}) \!- \varepsilon ]} } \right.\\&&\left.{ + \sum\limits_{0 < {\alpha_{i}} < C} {[{y_{i}} - \sum\limits_{{\mathbf{x}_{j}} \in SV} {({\alpha_{j}} - \alpha_{j}^ * )} K({\mathbf{x}_{i}},{\mathbf{x}_{j}})} + \varepsilon ]} \right\} \end{array} $$
(8)

where Nnsv is the number of normal support vectors, and SV is the support vector. The following equation is the regression function:

$$\begin{array}{@{}rcl@{}} f(\mathbf{x}) = {\omega^ * }\phi (\mathbf{x}) + {b^ * } &=& \sum\limits_{i = 1}^{l} {({\alpha_{i}} - \alpha_{i}^ * )\phi ({\mathbf{x}_{i}})\phi (\mathbf{x}) + {b^ * }}\\ &=& \sum\limits_{i = 1}^{l} {({\alpha_{i}} - \alpha_{i}^ * )K({\mathbf{x}_{i}},\mathbf{x}) + {b^ * }} \end{array} $$
(9)

The fixed ε-SVR takes the existing sample data to build the model and then predicts the unknown value based on the established fixed model. While, for the highly nonlinear and time-varying data, it is difficult for fixed SVR model to adapt to such characteristics, and this leads to the decrease of prediction accuracy. Therefore, an online SVR model based on re-modeling method is proposed to overcome this shortcoming. The main idea of this approach is to re-establish the SVR model based on the online updated time series. When a new sample arrives, it is added to the previous training set and then a new SVR model is obtained, and this new model is used for the next forecast. The single forecasting process of the fixed SVR model and the proposed online SVR model are shown in Fig. 1.

Fig. 1
figure 1

The single forecasting process of the fixed SVR model and the proposed online SVR model

As mentioned above, the SVR parameters (C, σ, and ε) affect the performance of the model. Hence, it is essential to select appropriate parameter, and a quantum-behaved particle swarm optimization algorithm is utilized to find the proper SVR parameters.

2.2 Particle Swarm Optimization

2.2.1 The Original Particle Swarm Optimization

PSO [23] is a kind of stochastic optimization algorithm on the basis of population intelligence. It features a feasible and simple structure without gradient information. In continuous function optimization problems especially, it shows advantage in performance, such as the speed of convergence, computational time, and so on. Hence, it has become a hot research algorithm in the field of intelligent optimization. The basic principle is described below.

A swarm consists of m particle flies with a certain speed in a D-dimensional search space, and each particle represents a bird in the search space. For the problem to be solved, a potential solution is determined by a particle, and each particle has a velocity that determines the distance and direction of its flight. Moreover, all particles have a fitness value determined by the optimized function. In the process of flight, the particles will be adjusted dynamically by their own and group flight experience. After several iterations, the optimal solution is obtained. In each iteration, the particle updates itself by tracking two “extremes,” one is the optimal solution found by itself, called the individual extremum, another is the optimal solution found by the whole population, called the global extremum. Their velocity and position are updated according to the following equations.

$$ {\mathbf{V}_{(i + 1)}} = \omega \cdot {\mathbf{V}_{i}} + {c_{1}} \cdot {r_{1}} \cdot (\mathbf{p}_{Bes{t_{i}}} - {\mathbf{X}_{i}}) + {c_{2}} \cdot {r_{2}} \cdot (g_{Best} - {\mathbf{X}_{i}}) $$
(10)
$$ {\mathbf{X}_{i + 1}} = {\mathbf{X}_{i}} + {\mathbf{V}_{(i + 1)}} $$
(11)

where ω is the inertia weight; c1 and c2 are the two positive constants, called cognitive learning rate and social learning rate respectively; r1 and r2 are random numbers in the range [0,1]; Xi = (xi1,xi2,⋯ ,xiD) represents the i th particle; \(\mathbf {p}_{Best_{i}}=(p_{Best_{i1}},p_{Best_{i2}},\cdots ,p_{Best_{iD}})\) represents the best previous position of the i th particle; the gBest represents the best particle among all the particles in the population; Vi = (vi1,vi2,⋯ ,viD) represents the velocity for the i th particle, and the velocities are confined within [Vmin,Vmax]D ; if Vi exceeds the threshold Vmin or Vmax, it is set equal to the corresponding threshold.

2.2.2 Quantum-Behaved Particle Swarm Optimization

The main disadvantage of PSO is that global convergence cannot be guaranteed [31]. To deal with this problem, QPSO was developed and reported by [42].

In traditional PSO algorithm, the dynamic behavior of the particle is widely divergent due to that the exact values of V and X cannot be determined simultaneously. While in QPSO algorithm, the state of a particle is determined by wave function ψ(X,t) instead of velocity and position. It is only necessary to learn the probability that the particles will appear at position X with probability density function ∥ψ(X,t)2 , the form of which depends on the potential field that the particles lie in. Thus, the particles can appear at any point of space with a certain probability and the whole space can be searched without diverging to infinity. The particles move according to the following iteration equations:

$$ \mathbf{X}_{t + 1}=\left\{\begin{array}{ll} \mathbf{P}_{i}-\beta(m_{Best}-\mathbf{X}_{t})\ln (1/u)&\text{if } k \ge 0.5\\ \mathbf{P}_{i}+\beta(m_{Best}-\mathbf{X}_{t})\ln (1/u)&\text{if } k < 0.5 \end{array}\right. $$
(12)

where,

$$ {\mathbf{P}_{i}} = \varphi \cdot \mathbf{p}_{Bes{t_{i}}} + (1 - \varphi ) \cdot g_{Bes{t_{i}}} $$
(13)
$$ m_{Best} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\mathbf{p}_{Bes{t_{i}}}} $$
(14)

u, k, and φ are random numbers in the range of [0,1] respectively; mBest is the mean best position defined as the mean of all the best positions of the population; β, called contraction-expansion coefficient, can be tuned to control the convergence speed of the algorithm and it is only parameter in QPSO algorithm.

QPSO has already been applied in various optimization problems with excellent results [3, 7]. Therefore, QPSO is used to optimize the parameters of SVR, and the optimized SVR model is applied to predict air pollutant concentrations.

2.3 QPSO for Parameter Determination of the SVR Model

As it has been demonstrated above, QPSO algorithm is used to select the penalty factor C, kernel parameter σ, and insensitive coefficient ε in the SVR model, and then use the optimized SVR model to forecast the PM2.5 and NO2 concentrations. The flowchart of the QPSO algorithm for the three-parameter selection in the SVR model is shown in Fig. 2 and the procedures of the QPSO-SVR model are presented as follows:

  1. Step 1:

    Initializing the QPSO parameter. The number of particles is 10, the maximum iteration number is 30, and the search ranges of C, σ, and ε are [0.1,100], [0.1,100], and [0.01,10] respectively. Each particle’s position is determined by the three-dimensional parameters, and the particle swarm position is randomly initialized according to the initial range of given variables. Contraction-expansion coefficient β is set to the following linear decreasing form:

    $$ \beta=(1.0-0.5)(T-t)/T + 0.5 $$
    (15)

    where T is the maximum iteration number and t is the current iteration number.

  2. Step 2:

    Calculating the current fitness of all particles. The fitness value for each particle’s position is determined by the fivefold cross validation error. In this study, mean square error (MSE) is utilized as cross-validation error, which is defined as follows:

    $$ MSE = \frac{1}{n}\sum\limits_{i = 1}^{n} {{{({Y_{i}} - Y_{i}^ * )}^{2}}} $$
    (16)

    where Yi is the measured value, \(Y_{i}^{*}\) is the predicted value, and n is the number of the data points.

  3. Step 3:

    Choosing the individual history optimal position and the global optimal position. The current position of each particle is initialized to the individual historical optimal position, and the position with the smallest fitness value among all the particles is chosen as the global optimal position.

  4. Step 4:

    Updating the position of the particles. First, calculating the average position of the particles according to Eq. (14), then calculating random position for each particle according to Eq. (13); finally, the position of the particles is updated according to Eq. (12).

  5. Step 5:

    The fitness value of the updated particle is recalculated and compared with the fitness value of the previous iteration. If it is better, the position of the particle is updated to the current position of the particle.

  6. Step 6:

    The current global optimal position and fitness value of the population are calculated and compared with the fitness value of the global optimal position of the previous iteration. If it is better, the global optimal position of the population is updated to the current global optimal position.

  7. Step 7:

    Checking the termination criterion. Optimal parameters are determined if the termination criterion is satisfied. Otherwise, return to step 2.

Fig. 2
figure 2

The flowchart of QPSO-SVR model

2.4 Multi-step Ahead Forecast Method

Multi-step ahead forecast can be described as an estimation of future values in the case of the given previous observations. There are several strategies for multi-period forecast, such as recursive strategy, direct strategy, MIMO strategy, and so on [18]. Therefore, the multi-step ahead forecast methods based on the recursive strategy and the direct strategy are compared to select optimal prediction method. The main idea of recursive strategy is that M samples are trained to obtain regression model firstly. Secondly, a single-step forecast can be determined using the established regression models. Finally, the following forecasting steps are calculated iteratively using the single-step predicted values as a historical time series for the subsequent point. And the estimation of the H next values is defined as Eq. (17), while the direct strategy presents an easily understandable result when forecasting H steps ahead. And the estimation of the H next values can be obtained by Eq. (18).

$$ \left\{\begin{array}{llll} \hat{y}(t + 1)=f(y(t),y(t-1),\cdots,y(t-n + 1))\\ \hat{y}(t + 2)=f(\hat{y}(t + 1),y(t),\cdots,y(t-n + 2))\\ \vdots\\ \hat{y}(t+H)=f(\hat{y}(t+H-1),\hat{y}(t+H-2),\cdots,\hat{y}(t),y(t-1),\cdots,y(t-n+H)) \end{array}\right. $$
(17)
$$ \left\{\begin{array}{ll} \hat{y}(t+H)=f(y(t),y(t-1),y(t-2),\cdots,y(t-n + 1)) \end{array}\right. $$
(18)

where n is the maximum embedding order, y is the observed value, \(\hat {y}\) is the predicted value, and f represents the established model. And H = 1,2,...,M, M is the maximum horizon of prediction.

By the above equations, the H step prediction results can be obtained. When the value of H is large, with the increase of the prediction step, it may appear that all the inputs are the predicted values, which may reduce the forecasting accuracy. In this study, in order to avoid error accumulation and computational complexity, the values of H and n are set to 4 and 1, respectively. Therefore, the following recursive equation and direct equation are obtained respectively:

$$ \left\{\begin{array}{llll} \hat{y}(t + 1)=f(y(t),\mathbf{P}(t))\\ \hat{y}(t + 2)=f(\hat{y}(t + 1),\mathbf{P}(t + 1))\\ \hat{y}(t + 3)=f(\hat{y}(t + 2),\mathbf{P}(t + 2))\\ \hat{y}(t + 4)=f(\hat{y}(t + 3),\mathbf{P}(t + 3)) \end{array}\right. $$
(19)
$$ \left\{\begin{array}{ll} \hat{y}(t + 4)=f(y(t),\mathbf{P}(t)) \end{array}\right. $$
(20)

where y(t) is the concentration of air pollutant to be predicted at time t. P(t) represents the value of the auxiliary variables at time t. And in the experiment, it is assumed that the values of all the auxiliary variables, which will be described in Section 3.1, can be known 4 hours in advance.

3 Simulation Results and Discussions

3.1 Original Dataset

Beijing, as the capital of China, has built 35 air quality monitoring sites so far, among which Wanliu monitoring site located in Haidian District of Beijing is an environmental assessment point, and it is close to the city center. So the evaluation of its air quality has certain representation for the overall air quality of Beijing. This is why Wanliu monitoring site is selected as the object of this experiment. According to the dataset which was collected in the Urban Air project [57,58,59], the available air quality dataset measured at the Wanliu Monitoring Station in May 2014 to April 2015 is selected as the original dataset. The selected dataset includes five major air pollutants, i.e., PM2.5, NO2, CO, O3, and SO2, and six meteorological parameters, i.e., weather (W), temperature (T), pressure (P), relative humidity (RH), wind speed (WS), and wind direction (WD), which were hourly measured at the Wanliu Monitoring Station. The weather are described by 17 different values and the wind direction is represented by 10 different situations, as shown in Tables 1 and 2, respectively. All input variables in the models are shown in Table 3. And three kinds of prediction methods are used to forecast PM2.5 and NO2 concentration for 4 hours ahead: (1) multi-step prediction based on recursive strategy (recursive forecast): the pollutant concentration are predicted according to Eq. (19). (2) Multi-step prediction based on direct strategy (direct forecast): the value of all input variables at time t is used to directly forecast the pollutant concentration at time (t+ 4). (3) Online multi-step prediction based on direct strategy (online direct forecast): the regression model is updated dynamically in the process of direct multi-step prediction. While for the first two methods, they use a fixed model in the forecasting process.

Table 1 Different weather conditions are represented by 17 different values
Table 2 Different wind directions are represented by 10 different values
Table 3 All input variables for the models

Because meteorological conditions have great impact on atmospheric pollutant concentrations, and Beijing is a city with four distinct seasons, in order to assess the effect of seasonal variations on model performance, the recorded levels of PM2.5 and NO2 in July 2014, November 2014, and January 2015 are selected as original samples. The number of valid data in those months were 688 (July 3, 2014, 00:00–July 31, 2014, 15:00), 720 (November 1, 2014, 00:00– November 30, 2014, 23:00), and 720 (January 1, 2015, 00:00–January 30, 2015, 23:00). In the experiments, the first 70% of the data is selected as training set, and the remaining data is used as testing set. At the same time, the fivefold cross-validation method is adopted to obtain the optimal prediction model in the experiments. And its main idea is that the previous 70% training data are divided into five equally sized and mutually complementary subsets firstly, and then the data from the four subsets are trained to obtain a model and the remaining subsets are tested to evaluate the obtained model; this process is repeated for the five possible choices. Finally, the model with the smallest error in five experiments is selected as the optimal model, and then the previously divided 30% test data is used to evaluate this optimal model. In the experiments, all the algorithms were coded in matlab and C+ + language and their code was run on an Intel(R) Core(TM) i5-4210U, 1.70GHZ PC with 4GB of RAM.

In order to eliminate the influence of different dimension and unit, the input and output data of samples are normalized respectively in the data process. The formula is as follows:

$$ \mathbf{X}_{norm}=\frac{\mathbf{X}-\mathbf{X}_{min}}{\mathbf{X}_{max}-\mathbf{X}_{min}} $$
(21)

The formula normalizes the original data into the range of [0, 1], where Xnorm is the normalized data, X is the original data, and Xmax and Xmin are the maximum and minimum values in the original data set respectively.

3.2 Evaluation of the Model Performance

The test results of the QPSO-SVR model are analyzed quantitatively based on mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R2) in this study. The three evaluation functions are defined as follows:

$$ MAE = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {{Y_{i}} - Y_{i}^ * } \right|} $$
(22)
$$ RMSE = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {{{({Y_{i}} - Y_{i}^ * )}^{2}}} } $$
(23)
$$ {R^{2}} = 1 - \frac{{\sum\limits_{i = 1}^{n} {{{({Y_{i}} - Y_{i}^ * )}^{2}}} }}{{\sum\limits_{i = 1}^{n} {{{({Y_{i}} - {{\bar Y}_{i}})}^{2}}} }} $$
(24)

where Yi is the measured concentration level, \(Y_{i}^{*}\) is the forecast value, \(\overline {Y}_{i}\) is the average of the measured value, and n is the number of the data points.

In MAE, the deviation is absolute; it can reflect the actual situation better for prediction error. RMSE is most useful for large error due to the existence of relatively high weight for large error. The better performance is always given by smaller MAE and RMSE and the better fitting result is always described by the value of R2 which is close to 1.

3.3 PM2.5 Concentration Forecasting Results

In the experiments, the QPSO-SVR model proposed in this paper is used to select the optimal prediction method among three prediction methods, and then this optimal prediction method is applied to compare the performance of different optimization algorithms for SVR parameter selection.

Figure 3 shows the original time series of hourly PM2.5 concentrations in July 2014, November 2014, and January 2015. It can be observed that the highest concentrations of PM2.5 are 262μg/m3, 435μg/m3, and 482μg/m3 in the 3 months respectively, and PM2.5 concentration increased significantly in late November; this is mainly due to the combustion of coal which produces a large amount of pollutants. In addition, meteorological conditions affect the diffusion of atmospheric pollutants. Therefore, it is necessary to analyze and predict the PM2.5 concentration.

Fig. 3
figure 3

The original time series of hourly PM2.5 concentrations a July 2014, b November 2014, c January 2015

In order to select the best prediction method, three prediction methods were tested based on the QPSO-SVR model. Figure 4 shows results of the three prediction methods based on QPSO-SVR model for the prediction of PM2.5 concentration in July 2014, November 2014, and January 2015. It is observed that, for the recursive prediction method, the prediction result is the worst one in the selected months. The reason is the cumulative effect of errors in the recursive strategy, while the other two methods have little difference in the prediction result. However, it still can be seen that the online direct forecast method is slightly better than the result of the direct prediction method. This can be seen from Table 4; both MAE and RMSE produced by the online direct prediction method are smaller than those created by the recursive prediction and direct prediction methods in the selected months. Hence, it can be concluded that the online direct prediction method is superior to the other two methods. Therefore, the online direct prediction method is selected to test the prediction performance of several models.

Fig. 4
figure 4

The results of the three prediction methods based on QPSO-SVR model for the prediction of PM2.5 concentrations. a July 2014. b November 2014. c January 2015

Table 4 The comparison of three prediction methods for PM2.5 concentration prediction based on the QPSO-SVR model

Figures 56 and 7 present the prediction results of PM2.5 concentration based on the QPSO-SVR model and the PSO-SVR model in the selected months, respectively, which include the fitting curve of the two models in the test phase and the corresponding absolute error. It can be seen from Fig. 5 that there are many deviation points in the prediction of the PSO-SVR model, while the QPSO-SVR model only appears with few deviation points. Either for the individual case or for the average case, the QPSO-SVR model shows better prediction performance than the PSO-SVR model. Figures 6 and 7 can also prove the same conclusion. In addition, the robustness of both QPSO-SVR model and PSO-SVR model is also inspected under the impact of meteorological factors such as weather, temperature, pressure, humidity, wind speed, and wind direction in the three different seasons. Hence, it can be concluded that the QPSO-SVR model possesses advantages to the PSO-SVR model although the impact of meteorological factors exists.

Fig. 5
figure 5

The online multi-step prediction results for PM2.5 concentration and the absolute error of two models in July 2014. a QPSO-SVR model. b PSO-SVR model. c Absolute error of two models

Fig. 6
figure 6

The online multi-step prediction results for PM2.5 concentration and the absolute error of two models in November 2014. a QPSO-SVR model. b PSO-SVR model. c Absolute error of two models

Fig. 7
figure 7

The online multi-step prediction results for PM2.5 concentration and the absolute error of two models in January 2015. a QPSO-SVR model. b PSO-SVR model. c Absolute error of two models

Table 5 lists the comparison of prediction performance among the QPSO-SVR model, PSO-SVR model, GA-SVR model, and GS-SVR model for PM2.5 concentration on the test stage. It can be seen from the table that the QPSO-SVR model has the lowest prediction error and its calculation time is less than that of the GA-SVR model and GS-SVR model. Although the QPSO-SVR model and the PSO-SVR model have little difference in the running time, it can still be seen that the QPSO-SVR runs faster than the PSO-SVR model. Therefore, it can be concluded that QPSO is superior to other optimization algorithms in parameter selection of SVR model; it is proved that the proposed hybrid QPSO-SVR model is effective in the prediction of atmospheric PM2.5 concentration.

Table 5 The comparison of model performance for PM2.5 concentration prediction based on the online direct prediction method

3.4 NO2 Concentration Forecasting Results

Considering the characteristic of each pollutant, such as the accumulation of PM2.5, and chemical and physical complexity of NO2, prediction performance and the robustness of the QPSO-SVR model can be further verified by forecasting NO2.

Figure 8 shows the original time series of hourly NO2 concentrations in July 2014, November 2014, and January 2015 respectively. It can be observed that NO2 also shows same change regulation with PM2.5 concentration, and the frequent fluctuation of NO2 concentration may have an impact on the prediction model.

Fig. 8
figure 8

The original time series of hourly NO2 concentrations. a July 2014. b November 2014. c January 2015

Figures 910 and 11 present the prediction results of NO2 concentration based on the QPSO-SVR model and the PSO-SVR model in the selected months respectively. By comparison, it can be seen that the prediction results generated by the QPSO-SVR model are much better than those produced by the PSO-SVR model in the 3 months. Especially in July, more prediction points by the PSO-SVR model are deviated from the measured points, but only several prediction points by the QPSO-SVR model are away from the measured ones. And the MAE produced by the QPSO-SVR model is smaller than that obtained by the PSO-SVR model. Therefore, the same conclusion that the QPSO-SVR model possesses better prediction performance than the PSO-SVR model can be obtained.

Fig. 9
figure 9

The online multi-step prediction results for NO2 concentration and the absolute error of two models in July 2014. a QPSO-SVR model. b PSO-SVR model. c Absolute error of two models

Fig. 10
figure 10

The online multi-step prediction results for NO2 concentration and the absolute error of two models in November 2014. a QPSO-SVR model. b PSO-SVR model. c Absolute error of two models

Fig. 11
figure 11

The online multi-step prediction results for NO2 concentration and the absolute error of two models in January 2015. a QPSO-SVR model. b PSO-SVR model. c Absolute error of two models

Table 6 shows predicting error and computational time comparison among QPSO-SVR model, PSO-SVR model, GA-SVR model, and GS-SVR model for NO2 concentration on the test stage. It can be seen that, both MAE and RMSE produced by the QPSO-SVR model are smaller than those created by the other models in the three selected months, while the values of R2 generated by the QPSO-SVR model are greater than those produced by the other models for all the selected months. In addition, the heuristic optimization algorithm is more efficient than the grid search method in selecting SVR parameters. Moreover, the PSO-SVR model possesses much less computational time than the GA-SVR model, and the QPSO-SVR model also reduces the calculation time compared with the PSO-SVR model.

Table 6 The comparison of models performance for NO2 concentration prediction based on the online direct prediction method

Based on the above experiments, it can be concluded that the QPSO-SVR model is more excellent compared with the PSO-SVR model, GA-SVR model, and GS-SVR model. It can always possess good, robust prediction performance for air pollutants.

3.5 Experiments summary

In order to summarize, visualize, and compare the studies and to emphasize the contribution of the algorithm proposed in this study, the comparison of different methods in this study is presented in the Table 7. In order to improve the PM2.5 and NO2 concentration prediction accuracy, three aspects are considered, including prediction methods, models, and optimization algorithms. According to the above experimental results, it is concluded that the QPSO-SVR model based on the online direct prediction method is more suitable for the prediction of atmospheric PM2.5 and NO2 concentration than other methods.

Table 7 The comparison of different methods in this study

4 Conclusions

This paper mainly develops a hybrid QPSO-SVR model to predict atmospheric PM2.5 and NO2 concentrations in the short term, and the QPSO algorithm is mainly used to select the optimal parameters (C, σ, and ε) influencing the performance of SVR. Firstly, the three prediction methods are proposed, including multi-step prediction method based on recursive strategy, multi-step prediction method based on direct strategy, and online multi-step prediction method based on direct strategy. PM2.5 concentration was predicted by these three methods based on the QPSO-SVR model; the results show that the online multi-step prediction method based on direct strategy has best prediction results. Secondly, the prediction performances of the QPSO-SVR model, PSO-SVR model, GA-SVR model, and GS-SVR model were compared by using the online direct prediction methods. And the atmospheric PM2.5 and NO2 concentrations in the three different seasons were predicted. The results demonstrate that the QPSO-SVR model possesses better prediction performance in terms of prediction accuracy and computational time. Moreover, the QPSO-SVR model is more robust because it is less affected by the meteorological factors. Finally, the model proposed in this paper can be used for the prediction of other pollutant concentration, and our team have installed device in our campus to collect pollutant concentration and meteorological data in order to analyze and evaluate the environment of our campus. Additionally, the value of H and n in multi-step prediction will have an effect on the prediction results, and the problem of large computation and poor real-time performance will appear in the online SVR model when the sample is too large. How to solve these problems will be our future research work.