Keywords

1 Introduction

With the development of society, China’s economic development has been accelerated, but behind it is often accompanied by environmental pollution. Air pollution is one of the main environmental pollution. Pollution gas can lead to a variety of diseases, among which PM2.5 has a greater threat to human health due to its small particle size and can cause haze weather. More efficient and accurate forecasting of air pollution components can provide some reference and guiding significance for people’s safe travel and related environmental protection departments’ work on air pollution prevention and control.

Numerical model prediction is one of the methods for predicting atmospheric pollution components. This method needs a wide range of data and its system is complex, which is relatively immature. The other method is statistical model prediction, which is more convenient and efficient, and its prediction effect is better. Therefore, people generally use statistical models to learn the relationships between numerous relevant features and air pollution components, so as to realize the prediction of air pollution components.

A large number of researches predict air pollution components based on machine learning model. The optimization of the model is mainly in data processing, model selection and combination, and model parameter optimization [1]. In order to get the optimal combination, different combination methods should be analyzed according to the specific data set.

The quality of data features determines the upper limit of machine learning predictive performance [2]. There is a strong correlation between a number of air related features selected in this paper. In order to further improve the performance of air pollution component prediction, this paper introduces the mRMR algorithm to dimension the data features from the perspective of maximum correlation and minimum redundancy, select important features, and reduce the influence of redundancy features. Finally, XGBoost model is used to mine the information between air features and PM2.5 concentration labels, and a high-performance PM2.5 concentration prediction model for the next hour is constructed based on feature selection and XGBoost.

2 Data Set Analysis

The data used for PM2.5 concentration prediction in this paper are collected by the self-research integrated system in Foshan. The data are hourly data for four contaminated gases and four weather factors from 1 March 2020 to 2 September 2020, specifically SO\(_2\), NO\(_2\), PM10, PM2.5 and temperature, relative humidity, wind speed, and wind direction. Finally, combined with the hourly information on the day of the observation data, the complete data set with 9 features used in this paper is formed. The unit of gas, temperature, relative humidity, wind speed and wind direction are \(\mu g/m^3\), \(^\circ C\), \(\%R.H.\), m/s and degree respectively. The characteristics of the data set are analyzed in the following part.

2.1 Data Description Statistics

The following is a descriptive statistical analysis of the data set with the help of SPSS statistical software. The statistical results are shown in Table 1. According to the average value of polluted gases, the local air quality during this period is good; according to the maximum value, serious air pollution exists; according to the standard value, the concentration of PM10 and PM2.5 changes relatively large.

Table 1. Data description statistics.
Fig. 1.
figure 1

Time series of PM2.5 concentration values

2.2 PM2.5 Time Series

By looking at time series, we can get a general idea of the data. Time series of PM2.5 concentration values are given here, as shown in Fig. 1. As can be seen from the figure, the concentration of PM2.5 varies greatly, which basically stays at a low level in summer.

2.3 Normalized Mutual Information Between Features

Mutual information theory describes how much information a random variable contains another random variable, in which the value range of normalized mutual information is [0,1], and the larger the value is, the larger the amount of information is. The normalized mutual information among the features in the data set is shown in Table 2, where PM2.5n represents the concentration of PM2.5 in the next hour. As can be seen from the table, there is a strong correlation between PM2.5 concentration value in the next hour and several features, and at the same time, there is a certain mutual information among other features, which will lead to a certain degree of redundancy among features.

Table 2. Normalized mutual information between features

3 mRMR Algorithm and XGBoost Prediction Model

A reliable feature set is an important part of machine learning, so we analyse air data sets from different perspectives in the previous section. All the features in the data set are collected by the actual system. However, it is not certain whether all the features are strongly correlated with PM2.5 concentration, or some redundancy may occur among the features, which reduces the prediction performance of the model. Therefore, in order to ensure the reliability of the prediction model, we need to properly process all the features before feeding the data set into the machine learning model. In addition, we also need to choose the appropriate machine learning model, according to the characteristics of the actual data set, fully mining the data relations in it, so as to achieve better prediction effect. Therefore, the following will introduce the mRMR feature selection algorithm and XGBoost prediction model used in the prediction method in this paper.

3.1 mRMR Feature Selection Algorithm

mRMR is based on mutual information, which is derived from the concept of entropy [3]. Entropy gives abstract information a certain metric, which can be used to describe the uncertainty between random things. Mutual information partly represents a common part between two random variables, and the more common information, the greater the mutual information between them, and thus the greater the interaction between them. The mutual information between random variables x and y can be calculated as follows:

$$\begin{aligned} I(x, y)=\iint p(x, y) \log \frac{p(x, y)}{p(x) p(y)} d x d y . \end{aligned}$$
(1)

The p(x) and p(y) are the respective edge probability densities of the two variables, and p(xy) is their binary probability distribution. It can be seen that mutual information is the statistical mean of random variables x and y under the probability distribution of p(xy).

mRMR is a feature selection algorithm which on the one hand measures the correlation between two sets based on mutual information to identify the subset of features with the greatest correlation with the target set. On the other hand, it measures the redundancy among the features in the set on the basis of maximum correlation, so as to exclude the redundancy in the features on the basis of maximum correlation. Therefore, mRMR eliminates redundant information on the basis of retaining key features to achieve the purpose of reducing the complexity of the model and effectively preventing the over-fitting problem of machine learning models [4]. The maximum correlation in mRMR can be expressed as:

$$\begin{aligned} \max D(S, c), \quad D=\frac{1}{|S|} \sum _{x_{i} \in S} I\left( x_{i}, c\right) . \end{aligned}$$
(2)

The \(I(x_i, c)\) is the mutual information described above, here specifically represents the mutual information between the variables \(x_i \) and c, and S is the feature matrix containing all \(x_i \). |S| is the characteristic matrix dimension used to represent the number of elements in S, and c is the target variable. It can be seen that the maximum correlation is expressed as the average value of mutual information. The minimum redundancy can be expressed as:

$$\begin{aligned} \min R(S), \quad R=\frac{1}{|S|^{2}} \sum _{x_{i}, x_{j} \in S} I\left( x_{i}, x_{j}\right) . \end{aligned}$$
(3)

The \(I(x_i,x_j)\) represents the mutual information between the variables \(x_i\) and \(x_j\). Finally, the criterion for selecting the optimal feature subset is given:

$$\begin{aligned} \left\{ \begin{array}{l} \max \psi (D, R) \\ \psi =D-R \end{array}\right. \end{aligned}$$
(4)

3.2 XGBoost Prediction Model

Both XGBoost and GBDT algorithms belong to Boosting algorithm. Based on GBDT, XGBoost is proposed and mainly optimizes its objective function and improves the basic learning machine [5]. The XGBoost model works by providing a number of weak learners and adding up their base predictions and residuals to get the final prediction. Its expression is as follows:

$$\begin{aligned} \dot{y}_{i}=\sum _{i=1}^{k} f_{i}\left( x_{i}\right) . \end{aligned}$$
(5)

The k represents k weak learners, \(f_i\) represents the ith weak learner, \(x_i\) represents the i-th data set, and \(\dot{y}_{i}\) represents the final predicted value after accumulation.

XGBoost model is an ensemble learning framework, which contains many decision trees. Different from other ensemble learning tree models, the training process of XGBoost model is more complex. Compared with GBDT, which is also Boosting algorithm, it adds regular term to the objective function and optimizes the feature splitting process. Its objective function is expressed as follows:

$$\begin{aligned} o b j=\sum _{i=1}^{n} L\left( y_{i}, \dot{y}_{i}\right) +\sum _{i=1}^{k} { \Omega }\left( f_{i}\right) . \end{aligned}$$
(6)

The \(L(y_{i},\dot{y}_{i})\) represents the loss function, \({ \Omega }(f_{i})\) represents the regular term of the i-th tree, and n represents the sample size. Unlike GBDT, the loss function can be defined case by case, training a tree to make the target function as small as possible. When training the t-th tree, its objective function can be approximately expressed as follows after second-order Taylor expansion:

$$\begin{aligned} {o b j^{t}} \approx \sum _{i=1}^{n}\left[ g_{i} f_{t}\left( x_{i}\right) +0.5 h_{i} f_{t}\left( x_{i}\right) ^{2}\right] +{ \Omega }\left( f_{t}\right) . \end{aligned}$$
(7)

The \(g_i\) represents the first derivative of the loss function, \(h_i\) represents the second derivative of the loss function, and the regular term is:

$$\begin{aligned} { \Omega }\left( f_{t}\right) =\gamma {T}+0.5 \lambda \sum _{j=1}^{T} w_{j}^{2} . \end{aligned}$$
(8)

At this point, the minimum point of the objective function can be easily obtained [6], so as to finally calculate the minimum value of the objective function. The smaller the minimum of the objective function is, the better the performance of the tree is. In order to obtain the minimum of the objective function, the greedy algorithm can be used to find the tree that can get the minimum of the objective function from a series of tree structures. Then, by calculating and comparing the information before and after splitting, when a certain information gain can be obtained, the characteristic splitting is adopted. And so on, all the tree structures can be obtained eventually, until all the learning of the weak learner is finished and the training of the whole model is completed.

4 Experimental Studies

4.1 Evaluation Index of the Experimental Results

In order to objectively and reasonably judge and compare the prediction performance of the model, certain evaluation indexes need to be selected. In this paper, three evaluation indexes, RMSE, MAE and \(R^2\), are adopted, and their expressions are shown as follows:

$$\begin{aligned} R M S E=\sqrt{\frac{1}{m} \sum _{i=1}^{m}\left( y_{i}-\widehat{y}_{i}\right) ^{2}} , \end{aligned}$$
(9)
$$\begin{aligned} M A E=\frac{1}{m} \sum _{i=1}^{m}\left| \left( y_{i}-\widehat{y}_{i}\right) \right| , \end{aligned}$$
(10)
$$\begin{aligned} R^{2}=\frac{T S S-R S S}{T S S}=1-\frac{R S S}{T S S} . \end{aligned}$$
(11)

The \(y_i\) represents the actual observed value, \(\widehat{y}_{i}\) represents the predicted value, and m represents the number of samples.

RMSE is the root mean square error, which can describe the degree of difference between the predicted value and the measured value. When the value is large, it generally means that most of the samples have large differences. MAE is the mean absolute error, which gives an average of the difference between the actual observed value and the predicted value. \(R^2\), the coefficient of determination, describes how much of a change in the real value is due to the predicted value.

4.2 Comparative Experiment of Prediction Model

In the selection of prediction models, SVM, KNN and RF prediction models are used to compare with XGBoost. In the experimental process, all nine features of the data set are adopted to divide the first 3900 samples in the data set into training sets, and the last 490 samples into verification sets, the same with the subsequent experiments. The experimental results are shown in Table 3. The unit of time is second.

Table 3. Evaluation indexes of prediction performance of different models

It can be found from the experimental results that the training time of XGBoost prediction model is short, and compared with other models, its prediction error is the smallest, the degree of fit is the highest, and the prediction time is the shortest. Since the XGBoost model has the best prediction effect of PM2.5 concentration in the next hour, the following experiments will be conducted on the mRMR feature selection algorithm based on this model.

4.3 Comparative Experiment of Feature Selection Algorithm

Finding out important features in the data set can reduce the feature dimension and obtain a more efficient dataset, so as to improve the performance of the prediction model. In this paper, mRMR algorithm is used for feature selection. Meanwhile, three feature selection algorithms, Pearson [7], PCA [8] and ReliefF [9], are used to compare with it. XGBoost is used as the prediction Model to study their performance in PM2.5 concentration prediction. The experimental results are shown in Table 4, which lists the evaluation indexes that each feature selection algorithm can make the prediction effect optimal and the number of features used.

Table 4. The optimal performance and the number of features of each feature selection algorithm

The process of feature selection is to continuously exclude features with low scores, and finally select the situation that can achieve the best prediction effect. Figure 2 and Fig. 3 are the trend graphs of predicting \(R^2\) and RMSE in the process of feature selection using each algorithm, which is helpful for us to further understand the process of feature selection method. To facilitate the understanding of the important details, the longitudinal axis in the figure is limited to a certain range, where the value of the RMSE curve corresponding to the PCA feature selection algorithm is 16.59454 at the characteristic number of 1 and the corresponding \(R^2\) curve is 0.01650 at the characteristic number of 1.

Fig. 2.
figure 2

RMSE curve of each feature selection method

Fig. 3.
figure 3

\(R^2\) curve of each feature selection method

As can be seen from the figures, the overall trend of model prediction evaluation increases with the increase of features, because the more features there are, the more information they can provide to the model. It can also be found that the mRMR feature selection algorithm can make the model get better performance in the case of learning fewer features, while the other three algorithms can hardly get better prediction effect than using the original data set in the process of feature selection. This is because there is indeed redundancy between features, and some features help the model less than their own redundancy does. However, only when the features that contribute less to the prediction are accurately eliminated can better results be obtained. In the process of feature selection using mRMR algorithm, when the corresponding prediction performance evaluation indexes RMSE and \(R^2\) reach the optimum, only 5 features in the data set are used, which can not only reduce the computational complexity of the model, but also improve the prediction performance of the model.

5 Conclusion

In this paper, the air related data collected by the self-research integrated system is used to form the original data set. In order to make the data set more reliable, the mRMR feature selection algorithm is used to reduce its dimension to get the feature set that can make the prediction effect optimal. Finally, the selected data set is combined with the XGBoost prediction model for learning, training and prediction, and the mRMR-XGBoost model for the next 1 h concentration of PM2.5 is designed. The experimental results show that the mRMR-XGBoost model has a good performance in the prediction of PM2.5 concentration.