Introduction

Blasting is one of the highly effective methods in open-cast mining when used to move rocks and overburden. However, only 20–30% of explosion energy is used for rock fragmentation (Chen and Huang 2001; Coursen 1995; Gad et al. 2005; Gao et al. 2018e). The remaining energy is wasted and generates undesirable effects such as ground vibration, air-blast overpressure (AOp), fly rock, and back break (Ak and Konuk 2008; Bui et al. 2019; Chen and Huang 2001; Ghasemi et al. 2016; Hajihassani et al. 2014; Hasanipanah et al. 2017a; Monjezi et al. 2011a; Nguyen and Bui 2018b; Nguyen et al. 2018a). Among these effects, PPV is one of the most undesirable effects because it may be harmful to humans and structures. To reduce the adverse effects of blasting operations, many researchers have proposed empirical equations to predict PPV; among these researchers are the United States Bureau of Mines (Duvall and Fogelson1962; Ambraseys and Hendron 1968; Davies et al.1964; Standard 1973; Roy 1991). However, influencing parameters are numerous, and the relationship among them is complicated. Thus, the empirical methods may not be entirely suitable for predicting PPV in open-cast mines (Ghasemi et al. 2013; Hajihassani et al. 2015; Hasanipanah et al. 2015; Monjezi et al. 2011b, 2013; Nguyen and Bui 2018a; Nguyen et al. 2018b, 2019; Saadat et al. 2014).

Nowadays, artificial intelligence (AI) is well known as a robust tool for solving the real-life problems (Alnaqi et al. 2019; Gao et al. 2018a, c; Moayedi and Nazir 2018; Moayedi et al. 2019; Moayedi and Rezaei 2017). Many researchers have studied and applied AI in predicting blast-induced issues, especially blast-produced PPV. Longjun et al. (2011) applied two benchmark algorithms for estimating PPV, including support vector machine (SVM) and random forest (RF); two other parameters with 93 explosions were used as training datasets, and 15 observations among 93 views were selected as testing datasets. Their study indicated that the SVM and RF models performed well in estimating blast-induced PPV. The SVM model was introduced as a superior model in their study. Hasanipanah et al. (2017b) also developed a Classification and regression tree (CART) model to predict PPV at Miduk copper mine (Iran) using 86 blasting events. Multiple regression (MR) and various empirical techniques were also considered to predict PPV and compared with the CART model. As a result, the CART model was exhibited better performance than the other models with RMSE = 0.17 and R2 = 0.95 in their study. In another work, Chandar et al. (2017) estimated blast-induced PPV using ANN model; 168 blasting operations were collected in dolomite, coal mine, and limestone (Malaysia) for their aim. The results indicated that the ANN model, with R2 = 0.878 for the three mines, is the best among the approaches used in their study. Metaheuristics algorithm was also considered and used to predict PPV by Faradonbeh and Monjezi (2017), i.e., gene expression programming (GEP); 115 blasting operations were used for their study. Accordingly, a formula based on the GEP was developed to estimate PPV as the first step in their study. Then, it was compared with several nonlinear and general equation models as the second step as well. Their results designated that the GEP model was better than the other models in forecasting blast-induced PPV. Similar works can be found at those references (Faradonbeh et al. 2016; Hasanipanah et al. 2017c; Sheykhi et al. 2018; Taheri et al. 2017).

In this study, an XGBoost model was developed to predict blast-induced PPV in Deo Nai open-pit coal mine (Vietnam). Three other models were also produced, including SVM, RF, and KNN for comparison with the constructed XGBoost model.

This paper is organized as follows. Section “two” describes the site study and the data used. Section “three” provides an overview of the algorithms used in this study. Section “four” reports the results and discussion. Section “five” shows the validation of the constructed models. Finally, Section “six” presents our conclusions.

Site study and data used

Study area

With the total area up to ~ 6 Km2, the Deo Nai open-pit coal mine was a large open-cast coal mine in Vietnam (Fig. 1). It is located in Quang Ninh province, Vietnam, with the proven reserve is 42.5 Mt, and productivity is 2.5 Mt/year. The study area has a complex geological structure, includes many different phases and faults. Conglomerate, siltstone, sandstone, claystone, and argillic rock were included in the overburden of this mine (Vinacomin 2015). The hardness of these rocks (f) in the range of 11–12 according to Protodiakonov’s classification (Protodiakonov et al. 1964); specific weight (γ) in the range of 2.62–2.65 t/m3. Therefore, blasting operations for rock fragmentation in this mine is a high-performance method.

Fig. 1
figure 1

Location of the study area

However, the Deo Nai open-pit coal mine is located near residential areas (Fig. 1), which have a distance of approximately 400 m from the blasting sites. Moreover, the capacity of burden must explode significantly in a blast of up to more 20 tons, and the adverse effects (especially PPV) of the blasting operation to the surrounding environment are substantial. Thus, we have selected this area as a case study to consider and predict PPV caused by blasting operations with the aim of controlling the undesirable effects on the environment and residential areas.

Data collection

To conduct this study, 146 blasting events were collected with nine parameters, such as the number of borehole rows per blast (N), charge per delay (Q), powder factor (q), length of stemming (T), burden (B), monitoring distance (D), spacing (S), bench height (H), and time interval between blasts (Δt) which were considered as nine input parameters to predict the outcome, i.e., PPV. Table 1 shows a brief of the datasets used in this study.

Table 1 Blasting events recorded for this study

For monitoring PPV, the Blastmate III instrument (Instantel, Canada) was used with the specifications that are shown in Table 2. In this study, PPV values were recorded in the range of 2.140 to 33.600 mm/s. A GPS device was used to determine D. The remaining parameters were extracted from blast patterns.

Table 2 Basic parameters of the PPV monitoring instrument

Preview of XGBoost, SVM, RF, and KNN

eXtreme gradient boosting (XGBoost)

XGBoost is an improved algorithm based on the gradient boosting decision proposed by (Friedman et al. 2000, 2001; Friedman 2001, 2002). XGBoost, which was created and developed by Chen and He (2015), can construct boosted trees efficiently, operate in parallel, and solve both classification and regression problems. The core of the algorithm is the optimization of the value of the objective function. It implements machine learning algorithms under the gradient boosting framework. XGBoost can solve many data science problems in a fast and accurate way with parallel tree boosting such as gradient boosting decision tree and gradient boosting machine.

An objective function usually consists of two parts (training loss and regularization):

$${\text{Obj}}(\varTheta ) = L(\varTheta ) + \varOmega (\varTheta ),$$
(1)

where L is the training loss function and \(\varOmega\) is the regularization term. The training loss is used to measure the model performance on training data. The regularization term aims to control the complexity of the model such as overfitting (Gao et al. 2018d). Various ways are conducted to define complexity. However, the complexity of each tree is often computed as the following equation:

$$\varOmega (f) = \gamma T + \frac{1}{2}\lambda \sum\limits_{j = 1}^{T} {\omega_{j}^{2} } ,$$
(2)

where T is the number of leaves and \(\omega\) is the vector of scores on leaves.

The structure score of XGBoost is the objective function defined as follows:

$${\text{Obj}} = \sum\limits_{j = 1}^{T} {\left[ {G_{j} \omega_{j} + \frac{1}{2}(H_{j} + \lambda )\omega_{j}^{2} } \right] + \gamma T} ,$$
(3)

where \(\omega_{j}\) are independent of each other. The form \(G_{j} \omega_{j} + \frac{1}{2}(H_{j} + \lambda )\omega_{j}^{2}\) is quadratic and the best \(\omega_{j}\) for a given structure q(x).

Support vector machines (SVM)

SVM is a machine learning method based on statistical theory and developed by (Cortes and Vapnik 1995). This method continues to be applied to high-performing algorithms with slight tuning. Similar to CART, SVM can also be used to solve classification and regression problems. According to Cortes and Vapnik (1995), SVM was used for classification analysis. SVR, a version of SVM for regression analysis, was proposed by Drucker et al. (1997).

In SVM, fitting data \(\{ x_{i} ,y_{i} \} ,\,(i = 1,2, \ldots ,n),\,x_{i} \in R^{n} ,\,y_{i} \in R\) with a function \(f(x) = w \cdot x + b\) is a problem. Thus, according to SVM theory, the fitting problem function is expressed as follows:

$$f(x) = w \cdot x + b = \sum\limits_{i = 1}^{k} {(a_{i} - a_{i}^{*} )K(xx_{i} ) + b}$$
(4)

where ai, a *i , and b are obtained by solving subsequent second optimization problems. Usually, a small fraction of ai, a *i is not zero; this fraction is called support vector.

Max:

$$w(a,a^{*} ) = - \frac{1}{2}\sum\limits_{i,j = 1}^{k} {(a_{i} - a_{i}^{*} )(a_{j} - a_{j}^{*} )K(x_{i} x_{j} ) + \sum\limits_{i = 1}^{k} {y_{i} (a_{i} - a_{i}^{*} ) - \varepsilon \sum\limits_{i = 1}^{k} {(a_{i} + a_{i}^{*} )} } } ,$$
(5)
$${\text{s}} . {\text{t}} .\left\{ {\begin{array}{*{20}l} {\sum\limits_{i = 1}^{k} {(a_{i} - a_{i}^{*} ) = 0} } \hfill \\ {0 \le a_{i} ,a_{i}^{*} \le C,(i = 1,2, \ldots ,k)} \hfill \\ \end{array} } \right.,$$
(6)

where C is a penalty factor that shows the penalty degree to samples of excessive error ε; \(K(x_{i} x_{j} )\) is kernel function, which solves calculation problems of high dimension skillfully by introducing kernel functions. These functions are mainly of the following types:

  1. 1.

    Linear kernel

    $$K\left( {x,y} \right) = x \cdot y,$$
    (7)
  2. 2.

    Polynomial kernel

    $$K\left( {x,y} \right) = [(x \cdot y) + 1]^{d} ;\quad d = (1,2, \ldots ),$$
    (8)
  3. 3.

    Radial original kernel function

    $$K\left( {x,y} \right) = \exp \left[ {\frac{{ - \left\| {x - y} \right\|^{2} }}{{\sigma^{2} }}} \right],$$
    (9)
  4. 4.

    Two-layer neural kernel

    $$K\left( {x,y} \right) = \tanh \left[ {a(x \cdot y) - \delta } \right].$$
    (10)

In this study, the SVM method with a polynomial kernel function is used to develop the SVM model for anticipating PPV.

Random forest (RF)

RF is one of the decision tree algorithms and introduced by Breiman (2001) for the first time. It is well known as a robust non-parametric statistical technique for both regression and classification problems. On the other hand, RF was introduced as an ensemble method based on the results from different trees to achieve predictive accuracy (Vigneau et al. 2018). For each new observation, RF combines the predicted values from the individual tree in the forest to give the best result. In the forest, each tree roles as a voter for the final decision of the RF (Gao et al. 2018b). The core of the RF model for regression can be described as three steps follow:

  • Step 1 Create bootstrap samples as the number of the tree in the forest (ntree) based on the dataset.

  • Step 2 Develop an unpruned regression tree for each bootstrap sample by random sampling of the predictors (mtry). Among those variables, select the best split.

  • Step 3 Predict new observation by ensemble the predicted values of the trees (ntree). For the regression problem as well as predicting blast-induced PPV, the average value of the predicted values by the individual tree in the forest used.

Based on the training dataset, an estimate of the error rate can be obtained by the following:

  • At each bootstrap iteration, predict the data not in the bootstrap sample using the tree grown with the bootstrap sample, called “out-of-bag” (OOB).

  • Aggregate the OOB predictions and calculate the error rate.

The implementation of the RF algorithm for predicting blast-induced PPV in this study is shown in Fig. 2. More details of the RF algorithm can be found at those references (Breiman 2001; Bui et al. 2019; Nguyen and Bui 2018b).

Fig. 2
figure 2

Workflow of RF in predicting blast-induced PPV

k-nearest neighbor (KNN)

KNN is known as a favorite technique for solving regression and classification problems in machine learning and introduced by Altman (1992). Based on the closest neighbors (k neighbors), the KNN algorithm determines the testing point and classify them. On the other hand, the KNN algorithm does not learn anything from training data. It only remembers the weights of neighbors in the functional space. When it comes to forecasting a new observation, it searches similar results and calculates the distance to those neighbors. Therefore, KNN is classified as “lazy learning” algorithms (Fig. 3).

Fig. 3
figure 3

Illustration of KNN algorithm for two-dimensional feature space (Hu et al. 2014)

For regression problems as well as predicting blast-induced PPV, the KNN algorithm uses a weighted average of the k-nearest neighbors, computed by their distance inversely. The KNN for regression can be worked as four steps follow:

  • Step 1 Determine the distance from the query sample to the labeled samples.

    $$d(x_{tr} ,x_{t} ) = \sqrt {\sum\limits_{n = 1}^{N} {w_{n} (x_{tr,n} - x_{t,n} )^{2} } }$$
    (11)

    where N is the number of features; \(x_{tr,n}\) and \(x_{t,n}\) denote the nth feature values of the training (\(x_{tr}\)) and testing (\(x_{t}\)) points, respectively; \(w_{n}\) is the weight of the nth feature and lies interval [0,1].

  • Step 2 Order the labeled examples by increasing distance.

  • Step 3 Based on RMSE (Eq. 12), define the optimal number of neighbors. Cross-validation can be used for this task.

  • Step 4 Calculate the average distance inversely with k-nearest neighbors.

Results and discussion

In this study, the datasets are divided into two sections: training and testing. Of the total datasets, 80% (approximately 118 blasting events) are used for the training process, and the rest (28 observations) are used for the testing process. The training dataset is used for the development of the mentioned models. The testing dataset is used to assess the performance of the constructed models.

To evaluate the performance of the constructed models, two criteria statistical include determination coefficient (R2) and root-mean-square error (RMSE) are used with RMSE provide an idea of how wrong all predictions are (0 is perfect), and R2 provides an idea of how well the model fits the data (1 is perfect, 0 is worst). In this study, RMSE and R2 were computed using the following equations:

$${\text{RMSE}} = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {(y_{i} - \hat{y}_{i} )^{2} } }$$
(12)
$$R^{2} = 1 - \frac{{\sum\limits_{i} {(y_{i} - \hat{y}_{i} } )^{2} }}{{\sum\limits_{i} {(y_{i} - \bar{y})^{2} } }}$$
(13)

where n denotes for the number of data, \(y_{i}\) and \(\hat{y}_{i}\) denotes the measured and predicted values, respectively; \(\bar{y}\) is the mean of the measured values.

Additionally, the Box–Cox transform and 10-fold cross-validation methods are used to avoid overfitting/underfitting.

XGBoost

In XGBoost, two stopping criteria, namely, maximum tree depth and nrounds, were considered to prevent complexity in modeling. Selecting the significant values for maximum tree depth and the nrounds causes excessive growth of the tree and an overfitting problem. Therefore, the maximum tree depth is set in the 1–3 range, and nrounds is set as 50, 100, and 150.

To achieve an optimum combination of these two parameters, a trial-and-error procedure was conducted with the range of two settings proposed. The performance indices, which include RMSE and R2, were calculated to evaluate the XGBoost models on both the training and testing datasets (Table 3).

Table 3 Performance indicators of the XGBoost models

Based on Table 3, nine XGBoost models were developed and evaluated. The results of the XGBoost models in Table 3 are very close to each other, which causes difficulty in selecting the best model. Thus, a simple procedure with the ranking method proposed by Zorlu et al. (2008) is applied in Table 4. The XGBoost models in Table 4 are ranked and evaluated through ranking indicators. The results of the overall grade for XGBoost models 1–9 are summarized in Table 5.

Table 4 The ranking of the XGBoost models based on their performance
Table 5 Total rank of XGBoost models

According to Table 5, model 1 with the total rank value of 35 reached the highest value among all the constructed XGBoost models. On other words, the XGBoost model No. 1 performed better than the other XGBoost models in this study.

Support vector machine (SVM)

In SVM, the kernel function with polynomial kernel was used to develop the SVM models. Two stopping criteria, namely degree and cost, were considered to prevent complexity in modeling. Also, the scale parameter was held constant at a value of 0.1. In this study, we select the range of 1–3 for the degree and set the cost as 0.25, 0.5, and 1.

To achieve an optimum combination of these two parameters, a trial-and-error procedure was also conducted similarly to that for the XGBoost method with the range of the two SVM parameters. The performance indices, namely, RMSE and R2, were calculated to evaluate the SVM models on both the training and testing datasets (Table 6).

Table 6 Performance indices of SVM models

Table 6 shows some low-performance models such as nos. 1, 4, 7, 2, 5. However, some models exhibit high performances that are almost similar. Thus, a simple ranking method should be applied to determine the best SVM model among the developed ones, as shown in Table 7. Table 8 indicates the total rank of the SVM models 1–9.

Table 7 Performance indices of SVM models with the rank
Table 8 Total rank of SVM models

According to Table 8, model 6 with a total rank of 32 achieved the best performance among all the developed SVM models. Thus, we conclude that model 6 is the best SVM model with the SVM method. Note that, the same training and testing datasets were applied for the development of the SVM models as those used for the XGBoost models.

Random forest (RF)

With the RF technique, two stopping criteria called ntree and mtry were considered to prevent complexity and reduce the running time of the model. A trial-and-error procedure with ntree is discussed in the range of 50–150, whereas mtry set as 5, 7, and 9 is implemented in Table 9. Likewise to the development of the XGBoost and SVM models, the same training and testing datasets were applied for the development of the RF models in this study.

Table 9 The RF models performance for predicting blast-induced PPV

Based on Table 9, all of the nine constructed RF models are suitable for estimating blast-produced PPV in this study. Some of the RF models, such as models 5–9, provide higher performance than others. However, the results of the models are nearly similar. Thus, concluding which model is the best for the RF technique is difficult. A ranking technique was used to identify the best model for the RF technique, as reported in Table 10. Additionally, a total ranking of the RF models is computed in Table 11.

Table 10 The RF models with their rank through performance indicators
Table 11 Total ranking of RF models

According to Tables 10 and 11, RF model 7 with a total ranking value of 30 reached the highest value among all the developed RF models. Thus, we can conclude that RF model 7 with ntree = 150 and mtry = 9 is the superior model in the RF technique for anticipating blast-produced PPV in this study.

k-nearest neighbor (KNN)

In this study, nine KNN models were developed with the k neighbors set in a range of 3–11 through training datasets. The performance of the KNN models was evaluated using the testing dataset as the second step in the development of the KNN models. Note that the same datasets were used for the development of the KNN models as those used for the development of the models above. The performance indices of the KNN models are shown in Table 12.

Table 12 The KNN models performance in this study

As shown in Table 12, the results of the constructed KNN models are close to one another. Thus, determining which model is the most optimal among the built KNN models is difficult. A simple ranking method similar to the previous sections was applied to the KNN technique. The performance indices of the KNN models with their rank were calculated and the results are presented in Table 13. Additionally, Table 14 shows the total rank of KNN models.

Table 13 Performance of the KNN models with the rank
Table 14 Total rank of KNN models

According to Tables 13 and 14, nine KNN models were ranked with the value of total rank in the range of 9–31. As shown in the tables, KNN model 3 with an entire rank value of 31 achieved the highest value among the developed KNN models.

Validation performance of models

In this study, two statistical criteria, namely, R2 and RMSE, were employed to measure the performance of the selected predictive models and computed using Eqs. (1213). After the optimal models for each technique were selected, the values of the aforementioned statistical criteria for all models were calculated for both the training and testing datasets, as indicated in Table 15. According to these results, the accuracy level of the XGBoost technique is better than those of the SVM, RF, and KNN models. Figure 4 demonstrates the performance of the models in forecasting blast-induced PPV on the testing dataset.

Table 15 Statistical values for selected predictive models
Fig. 4
figure 4

Measured versus predicted values on the testing dataset

Figure 5 presents a useful way to consider the spread of the estimated accuracies for the various methods and how they relate among the XGBoost, SVM, RF, and KNN techniques. According to Fig. 5, the KNN technique has the lowest accuracy level with several outliers, whereas the XGBoost technique exhibits the highest accuracy level without outliers. The RF technique can also provide an approximation of the XGBoost performance. However, a closer look shows that the developed XGBoost model offers higher performance than the RF model. Furthermore, the RF technique appears to have outliers, whereas the established XGBoost model has none. Additionally, the accuracy of the selected PPV predictive models was also compared and shown in Fig. 6. According to Fig. 6, among the developed models, the XGBoost technique yields the most reliable results in forecasting blast-produced PPV.

Fig. 5
figure 5

Comparison of machine learning algorithms in box and whisker plots

Fig. 6
figure 6

Prediction values of selected predictive models on testing datasets

Considering the input variables in this study, it shows that the number of input variables is high (9 input variables). Therefore, an analysis procedure of sensitivity was performed to find out which input variable(s) is/are the most influential parameters on blast-induced PPV as shown in Fig. 7. As a result, Q (charge) and D (distance) are the most influential factors on blast-induced PPV in this study. They should be used in practical engineering to control blast-induced PPV. The other input parameters were also effected on blast-induced PPV but not much.

Fig. 7
figure 7

Sensitivity analysis of independent variables for the PPV predictive model

Conclusions and recommendations

In practice, an accurate and efficient estimation of PPV is essential to reduce the environmental effects of blasting operations, especially near residential areas. This study developed the XGBoost, SVM, RF, and KNN models to predict PPV caused by blasting operations in the Deo Nai open-pit coal mine in Vietnam. Nine input parameters (Q, H, B, S, T, q, N, D, and Δt) were used to predict PPV from 146 blasting events at the mine. For modeling purposes, all datasets were divided into training and testing sets, with 80% (118 observations) of the entire dataset used for training and 20% (28 representations) for testing. The performance of the predictive models was evaluated based on two criteria, namely, R2 and RMSE, using the training and testing datasets. Based on the results of this study, RMSE values of 1.554 and 1.742 were obtained for the XGBoost model on the training and testing datasets, respectively. These values are the smallest among the RMSE values of the constructed models, which shows that the XGBoost model can be introduced as a new approach to solve environmental problems caused by blasting. Furthermore, R2 values of 0.955 and 0.952, respectively, for the training and testing datasets of the XGBoost technique indicate that the capability of the proposed technique is slightly higher than that of the other developed models for PPV prediction.

Although XGBoost was a robust model for predicting blast-induced PPV in this study, it is still needed to be further studied for improving the accuracy level as well as the computational time. Also, a hybrid model based on XGBoost and another algorithm are also a good idea for future works.