1 Introduction

Rockbursting is a rock mass instability phenomenon accompanied by spalling, cracking, splitting or even ejection due to the sudden release of strain energy (Gong et al. 2019). For deep mines and tunnels, rockbursting is becoming an increasingly prominent problem. It has been reported that many countries have encountered the rockburst hazard, making it a worldwide challenge (Keneti and Sainsbury 2018; Hu et al. 2019). For example, seismicity and rockbursting were deemed as the biggest threat to workers’ safety in Ontario’s underground mines based on an investigation by the Ontario Ministry of Labour in Canada (Ontario Ministry of Labour 2015). An intense rockburst can cause casualties and huge economic losses. Durrheim (2010) reported the largest rockburst with a local magnitude of 5.2 in the Klerksdorp district of South Africa. It caused two deaths, fifty-eight injuries and severe damage to the surface buildings. Zhang et al. (2012) described four extremely intense rockburst cases in the tunnels of Jinping II Hydropower Station. One of them occurred at the depth of 2330 m and caused seven deaths, one injury and destruction of a tunnel boring machine. Because of its serious consequences, predicting the risk of rockbursting efficiently is crucial and plays a significant role in disaster prevention.

Rockburst risk prediction can be classified into two types: long-term and short-term (Zhou et al. 2012). Long-term rockburst prediction is generally conducted in the design or early excavation stages of projects. The intrinsic rock mechanics parameters, such as strength, stiffness, brittleness and energy storage capacity, and the in situ or mining-induced stress are used to establish the predictive model (Liang et al. 2019). The likelihood of rockbursting for various rock types in different stress conditions is estimated. The results are then used to provide guidance for the subsequent construction stages. Short-term rockburst prediction aims to predict the risk of rockbursting in the near future based on in situ measurement techniques. Among these techniques, microseismic monitoring is one of the most widely used methods (Zhang et al. 2018). By monitoring and analyzing the microseismic waves released by the rock fracture, some premonitory characteristics of rockbursts are determined, which can be used to predict the risk of rockbursting. The most commonly used microseismic indicators for rockburst prediction include the number of events (Srinivasan et al. 1997), b value (Ma et al. 2018), energy index (Liu et al. 2013), apparent volume (Ma et al. 2018) and others (Cai et al. 2018; Liang et al. 2020) and their changing rate over a time window. For instance, Brady and Leighton (1977) found that the seismic activity increased rapidly and then decreased distinctly before a rockburst; Ma et al. (2018) revealed that an elbow point existed in the b-value time curve prior to the mainshocks; Liu et al. (2013) found that the apparent volume and spatial correlation length increased, whereas energy index, fractal dimension and b value dropped before the occurrence of large events; Xue et al. (2020) used the number of daily events N and b value to predict rockburst risk and found that rockbursting was more likely to occur when (lgN)/b was larger than 1. However, the precursory characteristics for rockburst prediction are not always consistent in different geological environments; a general early warning threshold has not yet been determined. Although numerous promising results have been achieved in various aspects of rockburst research, short-term rockburst prediction still remains problematic. Currently, there is no uniform approach or consensus in the engineering community.

Rockbursting is affected by many complex factors that include the properties of the rock mass, geological structures, stress regime and excavation activities (Wu et al. 2019). Some of these factors are uncertain, correlated, and cannot be determined quantitatively (Liang et al. 2019). In addition, there is a highly nonlinear relationship between rockburst risk level and influencing factors, which makes it very difficult to determine the contribution of each rockburst factor. Thus, the prediction accuracy for rockbursting is greatly restricted. In addition, different mechanisms can cause rockbursts. Rockbursts can be classified into three categories, namely strainbursts, pillar bursts and fault-slip bursts (Kaiser and Cai 2012). Strainbursts and pillar bursts are generally produced due to high strain energy concentration and release from the rock mass itself, whereas fault-slip bursting is induced by a remote source such as slippage of faults or other geological structures. Strainbursts are the most common type of rockbursts experienced in tunnels. In strainbursts, the location of the seismic event is the same as where damage occurs, providing the opportunity to predict the rockburst risk using microseismic indicators in tunnels.

More recently, machine learning (ML) approaches have started to be used to predict rockbursts because of their ability to handle nonlinear and complex problems. Rockbursting is generally treated as a classification problem according to the risk level. In this paper, we propose using ensemble learning methods for predicting short-term rockburst risk in tunnels based on microseismic monitoring. Ensemble learning is a branch of machine learning that incorporates multiple models to achieve particular tasks where the errors of an individual model is likely compensated by others (Sagi and Rokach 2018). These models are generally decision trees. Decision trees fit the short-term rockburst prediction task well because they do not need any underlying distribution assumption, can handle multiple types of data and work especially well with small datasets. However, decision trees also have some disadvantages, such as being sensitive to small variations of the input data and being prone to overfitting, which means the model fits a given data too closely and cannot predict the unknown behavior effectively, particularly if the number of samples is small (Kotsiantis 2013; Rokach 2016). Ensemble learning techniques overcome this drawback by combining multiple models, thus reducing the bias (Woźniak et al. 2014). Therefore, the overall predictive performance of the ensemble classifier would be better than that of a single model (Dev and Eden 2019; Krawczyk et al. 2017). The typical ensemble learning methods include random forest (RF) (Breiman 2001), adaptive boosting (AdaBoost) (Freund and Schapire 1997), gradient boosted decision tree (GBDT) (Friedman 2001), extreme gradient boosting (XGBoost) (Chen and Guestrin 2016) and light gradient boosting machine (LightGBM) (Ke et al. 2017). Recently, these algorithms have received a lot of attention and have proven to have excellent prediction performance in many engineering geology fields, such as formation lithology classification (Dev and Eden 2019), landslide assessment (Dou et al. 2020) and rock core image processing (Chauhan et al. 2016). To the best of our knowledge, a comparative analysis of these algorithms for short-term rockburst risk prediction has not been conducted.

The original contribution of this paper is to develop a methodology that uses ensemble learning methods to predict short-term rockburst risk. First, the rockburst case data based on microseismic monitoring are collected. Then, five ensemble learning algorithms are used to predict the short-term rockburst risk. Finally, the predictive performance of each model is comprehensively analyzed and evaluated using five metrics.

2 Literature review

In the last decade, many machine learning (ML) approaches have been used to predict the risk of rockbursts, of which only a small sub-set targeted short-term rockburst prediction. A summary of approaches that use ML for rockburst prediction is provided in Table 1. These approaches can be classified into three categories as supervised learning approaches, comparative decision strategies and unsupervised learning approaches. Artificial neural networks (ANN) (Feng and Wang 1994), Gaussian process (GP) (Su et al. 2009), RF (Dong et al. 2013), stochastic gradient boosting (SGB) (Zhou et al. 2016), Bayesian networks (BNs) (Li et al. 2017), logistic regression (LR) (Li and Jimenez 2018), regression models (RMs) (Afraei et al. 2018) and DT (Pu et al. 2018; Ghasemi et al. 2020) have been used as supervised learning attempts of long-term rockburst prediction. The majority of these approaches are not very suitable for the typical size and structure of rockburst datasets because ANN typically need large datasets for training and tend to overfit; GP assumes each observation is normally distributed which is not appropriate in many applications and it uses the whole sample and features information to perform the prediction; BNs cannot obtain good prediction performances when the correlation between features is large because it assumes feature independence; LR tends to underperform when multiple or nonlinear decision boundaries exist; RMs are sensitive to outliers; and DT is unstable and prone to overfitting.

Table 1 Summary of ML algorithms on rockburst prediction

Some more advanced supervised learning techniques were adapted through combining different algorithms and other performance improvement methods. Zhou et al. (2012) used the genetic algorithm (GA) and particle swarm optimization (PSO) to optimize the hyperparameters of support vector machines (SVM); Jiang et al. (2016) adopted the GA-based synthetic minority over-sampling technique (SMOTE) and C4.5 DT algorithms for the imbalance data classification of rockbursts; Li et al. (2017) used the extreme learning machine (ELM) optimized by GA to improve the accuracy of rockburst prediction; Pu et al. (2019) combined the fruit fly optimization (FOA) and generalized regression neural networks (GRNN) to predict rockbursts; Wu et al. (2019) used Copula theory and least-squares support vector machine (LSSVM) to establish rockburst prediction probability models; Liu and Hou (2019) introduced PSO to optimize back propagation neural network (BPNN), probabilistic neural network (PNN) and SVM simultaneously for rockburst prediction; Zheng et al. (2019) developed a entropy weight integrated with grey relational BPNN model to analyze rockburst risk; Zhou et al. (2020) integrated the firefly algorithm (FA) and ANN for rockburst risk prediction; and Xue et al. (2020) established the PSO-ELM model to determine rockburst risk. In these papers, the prediction performances of supervised learning algorithms are improved through hyperparameter optimization, sampling technique and data preprocessing.

Comparative analysis of multiple ML algorithms was performed by several researchers for long-term rockburst prediction. For instance, Zhou et al. (2016) compared the predictive performance using ten supervised learning methods, which included linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), partial least-squares discriminant analysis (PLSDA), naive Bayes (NB), k-nearest neighbor (KNN), multilayer perceptron neural network (MLPNN), DT, SVM, RF and gradient boosting machine (GBM); Lin et al. (2018) made a performance comparison of the rough set (RS)—cloud model (CM), NB, KNN and RF for rockburst prediction; Faradonbeh and Taheri (2019) predict rockburst hazard based on emotional neural network (ENN), gene expression programming (GEP) and C4.5 DT; and Pu et al. (2019) used a generative model (GP) and a discriminative approach (SVM) to evaluate rockburst liability, respectively. When compared to only using one algorithm, these comparative analyses yield more stable and robust results. However, except for DT and the ensemble approaches, these algorithms are not suitable for solving the prediction issues with small datasets.

The last category of approaches considers the difficulty of obtaining the actual rockburst levels in some instances. Therefore, they adopt an unsupervised learning approach, where the labels of the data do not need to be known in advance. The classification is performed based on clustering and taking the distance from the center of the cluster. Gao (2015) used an abstraction ant colony clustering algorithm (ACCA) for the rockburst prediction to improve the computational efficiency and accuracy of the traditional ACCA; Pu et al. (2019) adopted the k-means method to relabel the original data and then used SVM to predict the rockburst risk for kimberlite pipes; Faradonbeh et al. (2019) advanced the two clustering approaches, namely, self-organizing map (SOM) and fuzzy c-mean (FCM), for rockburst prediction in deep underground projects. However, because the actual rockburst levels can be obtained in most situations, supervised learning methods received the most research attention for rockburst prediction.

As opposed to the extensive literature on long-term rockburst prediction, only three papers focus on short-term rockburst prediction. Zhou et al. (2016) predicted the field rockburst damage using the SGB model; Feng et al. (2019) combined the mean impact value algorithm (MIVA), modified firefly algorithm (MFA) and PNN model to predict the short-term rockburst risk based on the monitored microseismicity; and Ji et al. (2020) used GA-SVM model to predict rockbursts (high energy tremors) in coal mines. Short-term rockburst prediction is very important as it can be used as a mechanism for real-time warning to prevent accidents. This paper addresses the need in the literature for machine learning methods of short-term rockburst prediction.

3 Data and variables

From the work of Feng et al. (2013), a total of 93 rockburst cases for short-term risk prediction are obtained from the tunnels of the Jinping-II hydropower project in China. This hydropower station includes seven parallel tunnels with an average length of 16.67 km and a maximum depth of 2525 m. The maximum, intermediate and minimum principal stresses are found by back analysis to be approximately 63 MPa, 34 MPa and 26 MPa, respectively (Feng et al. 2013). The main lithology is marble, which has a saturated uniaxial compressive strength of 30–114 MPa, elastic modulus of 25–40 GPa and tensile strength of 3–6 MPa, and the surrounding rock masses are hard and intact (Feng et al. 2019; Ma et al. 2015). Under these geological conditions, the rockburst risk is very high. The rockburst intensity is divided into four levels based on radiated energy and damage phenomena, namely none, slight, moderate and intense. The classification method of rockburst risk level is descripted in Table 2 (Feng et al. 2013; Chen et al. 2015). In this paper, these four levels acting as the model output are labeled as 0, 1, 2 and 3, respectively. The distribution of the rockburst risk levels in the overall dataset is given in Fig. 1. The complete rockburst case data are presented in “Online Appendix A.”

Table 2 Rockburst risk level classification method (Feng et al. 2013; Chen et al. 2015)
Fig. 1
figure 1

Distribution of the rockburst risk levels

This dataset is obtained from the microseismic monitoring system, which provides many parameters that can be used to quantify seismic events. Details of the data acquisition process are discussed in Feng et al. (2015). In general, the number of events, seismic energy and apparent volume are used to represent seismic activity. Among them, the number of events reflects the frequency of seismicity; seismic energy indicates the energy released by seismic events in the form of waves and characterizes the strength of seismicity. The seismic energy of P or S waves is calculated by (Glazer 2018):

$$E = 8\pi \rho c\int_{0}^{\infty } {v^{2} (f)} df,$$
(1)

where \(\rho\) indicates the density of source material; \(c\) is the velocity of P or S wave; \(v^{2} (f)\) represents the velocity power spectrum; and \(f\) denotes the frequency.

The apparent volume represents the volume of inelastic deformation zone of seismic source, which can be determined by (Liu et al. 2013):

$$V_{A} = \mu P^{2} /E,$$
(2)

where \(\mu\) indicates the shear modulus of rock; and \(P\) is the seismic potency.

Since seismic events are induced by dynamic rock fractures, the parameters of seismic events can reflect the characteristics of rock damage to some extent. In this study, six indicators are selected to predict the short-term rockburst risk, and their statistics are shown in Table 3. According to the definition of these seismic event parameters, it can be known that \(C_{1}\), \(C_{2}\) and \(C_{3}\) signify the number, strength and size of rock mass fractures during the rockburst development process, respectively. In addition, \(C_{4}\), \(C_{5}\) and \(C_{6}\) are adopted to indicate the effect of time. Therefore, these six variables can be introduced to illustrate the state of the rock mass fractures (Feng et al. 2015). To make calculation convenient, the values of \(C_{2}\), \(C_{3}\), \(C_{5}\) and \(C_{6}\) are used in their logarithmic form. After taking the logarithm, it does not change the correlation of the data, but compresses the scale of the variables and reduces the absolute value of the data.

Table 3 Statistics of each indicator

The box plot of each indicator for the four rockburst levels is shown in Fig. 2. Overall, the rockburst level is positively correlated with each indicator. The larger the indicators values, the higher the rockburst level. However, some outliers exist in all indicators under each level, which indicates the complexity of the rockburst formation mechanism. In addition, for the same indicator in different levels, the distance between upper and lower quartiles (the height of the box) varies, and the range of indicator values also has some overlapping parts. Therefore, the effect of all indicators needs to be incorporated for better accuracy.

Fig. 2
figure 2

Box plot of each indicator in different levels

4 Methodology

In this paper, RF, AdaBoost, GBDT, XGBoost and LightGBM are used to predict the short-term rockburst risk and their predictive performances are comprehensively compared from multiple perspectives. The structure of the proposed methodology is indicated in Fig. 3. First, the raw rockburst data is randomly separated into training set (70% of the data) and test set (30% of the data). Of note, the proportion of rockburst samples with different risk levels in training and test sets is kept consistent during data partitioning. Second, a fivefold cross-validation (CV) method is applied on the training set to obtain the optimal hyperparameter values of five ensemble learning models (RF, AdaBoost, GBDT, XGBoost and LightGBM). Third, each model with tuned hyperparameters is then fitted by the training set. Fourth, the test set is adopted to evaluate the performance of models using classification accuracy, Cohen’s kappa, precision, recall and F-measure. Finally, the optimal model can be obtained based on these performance evaluation metrics. If its predictive performance is acceptable, it can be used for engineering applications. The detailed descriptions of ensemble learning models, hyperparameter optimization and model evaluation metrics are given in this section.

Fig. 3
figure 3

Structure of the proposed methodology

4.1 Ensemble learning models

According to the dependencies between the base learners, the ensemble learning methods can be mainly divided into two types: bagging (Breiman 1996) and boosting (Schapire 1990). In bagging methods, the base learners are usually weak and have no dependencies, which allow them to be implemented in a parallel manner. The diagram of bagging ensemble learning is shown in Fig. 4. First, a bootstrap sampling technique is adopted to generate sample sets from the original dataset. Then, the base learners are independently trained using each sample set. Finally, the predicted result is obtained based on the integration rules. A voting method is usually used for classification problems. RF proposed by Breiman (2001) is a typical representative of the bagging approach. Nevertheless, RF differs from a bagging approach while selecting the features. The features are randomly selected from original features in the RF, which is favorable for improving the generalization ability of the model.

Fig. 4
figure 4

Diagram of bagging ensemble learning

In boosting methods, dependencies exist between the weak base learners, and the subsequent base learner increases the emphasis on the misclassified instances obtained by the previous base learners. The diagram of boosting ensemble learning is shown in Fig. 5. First, the initial sample set is used to train the first base learner. Then, the input of samples for the next base learner is adjusted based on the classification error of the first base learner. This process is repeated with the subsequent base learners and is terminated upon reaching the specified number of iterations or meeting the convergence conditions. Lastly, the final result is obtained by combining the results from all base learners. AdaBoost, GBDT, XGBoost and LightGBM are boosting approaches that use decision trees as base learners (Kotsiantis 2014). In AdaBoost, the weighted samples are taken as the input for the next base learner. The samples’ weights are adjusted based on the principle that misclassified samples obtain the larger weights. In addition, the final result is obtained using the weighted results from all base learners. The weight of the base learners is attained according to their classification error rate, and the base learner with a smaller error rate is provided a larger weight (Sagi and Rokach 2018). In GBDT, the residual of the previous base learner is selected as the input for the next base learner. Therefore, the goal of the subsequent base learners is to reduce the residual. The final result is obtained by adding the results from all base learners. GBDT has a good classification performance because the loss decreases along the gradient direction in each iteration (Deng et al. 2019). XGBoost and LightGBM are proposed as improvements to GBDT. In the XGBoost algorithm, a regularization term is added in the objective function to prevent overfitting, and a second-order Taylor expansion of the loss function instead of a first-order derivative in GBDT is used (Chen and Guestrin 2016). In LightGBM, a histogram-based algorithm is utilized to increase the training speed and reduce memory consumption, and leaf-wise growth strategy of the trees with a maximum depth limit is adopted (Ke et al. 2017).

Fig. 5
figure 5

Diagram of boosting ensemble learning

The strengths and weaknesses of the ensemble learning methods used in this paper are summarized in Table 4. These algorithms were implemented in Python using the scikit-learn library (Pedregosa et al. 2011) for RF, GBDT and AdaBoost, and XGBoost and LightGBM were carried out using the xgboost and lightgbm libraries, respectively.

Table 4 Strengths and weaknesses of each ensemble learning method

4.2 Hyperparameter optimization

Hyperparameters are the parameters within the ML algorithm itself that need to be adjusted based on the dataset. To improve the performance of the model, the hyperparameters should be optimized rather than specified manually. In this study, some key hyperparameters in the ensemble learning models are selected for optimization. For the RF algorithm, the number of trees in the forest (n_estimators) and the maximum depth of the tree (max_depth) are selected. For ensemble learning algorithms belonging to the boosting type, the same hyperparameters are selected to make the comparison more meaningful, which include the maximum number of trees (n_estimators) and the shrinkage coefficient of each tree (learning_rate).

K-fold CV is commonly used for hyperparameter configuration (Zhou et al. 2016). In general, the value of K between five to ten is selected (Jung 2018). After considering the number of samples, fivefold CV is used in this study. The training set is randomly split into fivefolds. Among them, fourfolds make up the training sub-set, and the remaining one is used as the validation set. The training sub-set is utilized for fitting the models, and the validation set is utilized for assessing the model’s performance. This process is repeated five times by changing permutation of the folds until each sample in the training set is predicted. Then, the average accuracy of the validation set is taken into account to determine the hyperparameter values.

4.3 Model evaluation metrics

For the rockburst risk prediction problems, the predictive performance for both total samples and the samples belonging to a certain risk level should be taken into account. Therefore, five metrics including two global metrics (classification accuracy and Cohen’s Kappa) and three within-class metrics (precision, recall and F-measure) are adopted to evaluate the performance of each model (Kumar 2019). For a certain model, these metrics can be calculated based on the confusion matrix. Suppose a confusion matrix is:

$$M = \left[ {\begin{array}{*{20}c} {k_{11} } & {k_{12} } & \cdots & {k_{1n} } \\ {k_{21} } & {k_{22} } & \cdots & {k_{2n} } \\ \vdots & \vdots & \ddots & \vdots \\ {k_{n1} } & {k_{n2} } & \cdots & {k_{nn} } \\ \end{array} } \right],$$
(3)

where \(n\) indicates the number of levels, \(k_{ii}\) indicates the number of samples of level \(i\) that are correctly predicted, and \(k_{ij}\) indicates the number of samples of level \(i\) that are predicted to level \(j\).

The accuracy represents the ratio of each level that is classified correctly, which can be calculated by:

$${\text{Accuracy}} = \frac{1}{m}\sum\limits_{i = 1}^{n} {k_{ii} } ,$$
(4)

where \(m\) indicates the number of samples in the dataset.

The Cohen’s Kappa is a robust criterion that is generally utilized to evaluate the interrater reliability for categorical variables. The range of its value is from − 1 to 1, with larger values indicating better agreement. A value greater than or equal to 0.4 indicates a good agreement (McHugh 2012). The Cohen’s Kappa can be calculated by:

$${\text{Kappa}} = \frac{{m\sum\nolimits_{i = 1}^{n} {k_{ii} - \sum {\left( {\sum\nolimits_{i = 1}^{n} {k_{ij} } \times \sum\nolimits_{j = 1}^{n} {k_{ij} } } \right)} } }}{{m^{2} - \sum {\left( {\sum\nolimits_{i = 1}^{n} {k_{ij} } \times \sum\nolimits_{j = 1}^{n} {k_{ij} } } \right)} }}.$$
(5)

The precision is the proportion of the number of samples that are correctly predicted to the total number of samples that are predicted, which can be calculated by:

$$\Pr {\text{ecision}} = \frac{{k_{ii} }}{{\sum\nolimits_{i = 1}^{n} {k_{ij} } }}.$$
(6)

The recall is the proportion of the number of samples that are correctly predicted to the actual total number of samples, which can be calculated by:

$${\text{Recall}} = \frac{{k_{ii} }}{{\sum\nolimits_{j = 1}^{n} {k_{ij} } }}.$$
(7)

The higher the precision and the recall, the better the results are. However, these two metrics are contradictory in some cases. Then, the F-measure is proposed by comprehensively considering their values, which can be calculated by:

$$F{\text{-measure}} = \frac{{(\beta^{2} + 1) \times {\text{Precision}} \times {\text{Recall}}}}{{\beta^{2} \times ({\text{Precision}} + {\text{Recall)}}}},$$
(8)

where \(\beta\) indicates the weighting factor, and in general, when \(\beta { = }1\), then F-measure is called \({\text{F}}_{ 1}\).

5 Results and analysis

5.1 Cross-validation results

The hyperparameter optimization results for each model based on the training sets are shown in Table 5. First, the range and interval of values for each hyperparameter were specified, which were identical for the same hyperparameters in different models to make the comparison more reasonable. The search range for the n_estimators was between 10 and 100 with an interval of 10, the search range for the max_depth was between 1 and 10 with an interval of 1, and the search range for the learning_rate was between 0.01 and 0.2 with a an interval of 0.01.

Table 5 Hyperparameter optimization results

The average accuracy of models for each group of hyperparameter values is shown in Fig. 6. According to Fig. 6, the overall performance of each model can be determined. Comparing with other models, the results of GBDT for different hyperparameter values are more stable. With the increase in hyperparameter values, the average accuracy of AdaBoost decreases. However, with the increase in hyperparameter values, the average accuracy of XGBoost and LightGBM increases. For the RF, there are several peaks, but its overall accuracy is higher. Based on the best average accuracy of fivefold CV, the optimal hyperparameter values of each model are obtained (see Table 5).

Fig. 6
figure 6

Average accuracy of CV for different hyperparameters. a RF; b AdaBoost; c GBDT; d XGBoost; e LightGBM

5.2 Prediction results of the test set

The models with the optimal hyperparameter values were used to predict short-term rockburst risk on the test set. The prediction results of each model are shown in Table 6. The results of each model form a confusion matrix as defined in Sect. 4.3. The values on the main diagonal represent the number of cases that classified correctly, and others are that misclassified. The values in the first row and fourth column and the fourth row and first column are particularly important. The value in the first row and fourth column indicates the number of samples with risk level 0 that were predicted as level 3. In this case, it would cause unnecessary panic and loss of prevention costs. According to Table 6, only one such case exists in GBDT, and there are no such cases in other models. The value in the fourth row and first column represents the number of samples with risk level 3 that were predicted as level 0. In this condition, not predicting a major seismic event could produce very serious accidents and even casualties. There is not this kind of case in these five models, which indicates that the ensemble learning models have a certain reliability for this situation.

Table 6 Prediction results of each model

The accuracy and Cohen’s Kappa of each model are shown in Fig. 7. It can be observed that RF performs the best with a highest accuracy of 0.8, followed by GBDT and XGBoost with an accuracy of 0.7667 and 0.7333, respectively. AdaBoost and LightGBM perform the worst with an accuracy of 0.6667. However, the Cohen’s Kappa value of AdaBoost is larger than that of LightGBM, which indicates that the performance of AdaBoost is better than LightGBM. In addition, the rank of the RF, GBDT and XGBoost in term of Cohen’s Kappa remains the top three performers. Therefore, after considering both accuracy and Cohen’s Kappa, it can be concluded that the performance rank is \({\text{RF}} > {\text{GBDT}} > {\text{XGBoost}} > {\text{AdaBoost}} > {\text{LightGBM}}\).

Fig. 7
figure 7

Accuracy and Cohen’s Kappa values of each model

However, the moderate and intense rockbursts should receive more attention in engineering, because they can cause more serious consequences. Therefore, the none and slight risk rockburst cases are combined into a group (low risk rockburst), and the moderate and intense risk rockburst cases are combined into another group (high risk rockburst). The accuracy of these two groups is displayed in Fig. 8. In term of high risk rockburst, GBDT possessed the highest value with an accuracy of 0.9167, followed by RF with an accuracy of 0.8333. LightGBM possesses the lowest value with an accuracy of 0.6667. XGBoost and AdaBoost had the same accuracy on high risk rockburst, whereas XGBoost performed better than AdaBoost when predicting low risk rockburst. Therefore, if the center of focus is high risk rockbursts, then the performance rank is \({\text{GBDT}} > {\text{RF}} > {\text{XGBoost}} > {\text{AdaBoost}} > {\text{LightGBM}}\).

Fig. 8
figure 8

Accuracy for low and high risk rockburst

To compare the classification performance of each model for each level, precision, recall and F1 values are calculated, respectively. Among them, classification precision represents the ability to predict samples correctly. The precision value of each model for each level is shown in Fig. 9. It can be seen that the precision of different models for each level is different. RF, AdaBoost and LightGBM perform the best for the risk of intense rockbursting with a precision of 1.0000, XGBoost performs the best for the risk of moderate rockbursting with a precision of 0.7500, GBDT performs the best for the risk of slight rockbursting with a precision of 1.0000, and RF performs the best for the risk of none rockbursting with a precision of 0.8462. Overall, the precision of the intense and none risk is higher than the moderate and slight risk. The reason may be that the discrimination boundaries of the risk of moderate and slight rockbursting are more uncertain.

Fig. 9
figure 9

Precision of each model for each level

Recall indicates the ability to correctly predict as many events as possible in the actual samples. The recall value of each model for each level is shown in Fig. 10. Comparing with precision values, the variance of recall values among five models is smaller. RF, GBDT and XGBoost achieved the highest recall value (0.7500) for the risk of intense rockbursting, GBDT achieved the highest recall value (1.0000) for the risk of moderate rockbursting, RF and XGBoost achieve the highest recall value (0.4286) for the risk of slight rockbursting, and RF achieves the highest recall value (1.0000) for the risk of none rockbursting. Comparing with other risk levels, the recall value of the risk of slight rockbursting is lowest.

Fig. 10
figure 10

Recall of each model for each level

In general, high precision and recall values indicate a good result. However, their values are not positively correlated and may sometimes even be negatively correlated. F1 is an effective index that measures the accuracy of both precision and recall. The F1 value of each model for each level is shown in Fig. 11. RF possesses the highest F1 values for the risk of intense, slight and none rockbursting. The values are 0.8571, 0.5455 and 0.9167, respectively. For the moderate risk of rockbursting, GBDT possesses the highest F1 value (0.8421), but RF also ranks second with a F1 value of 0.7778. Therefore, RF has a best performance for rockburst risk prediction from this view. In addition, according to Fig. 11, all of these models have a lower performance for the prediction of the risk of slight rockbursting.

Fig. 11
figure 11

F1 value of each model for each level

5.3 Overfitting analysis

Overfitting is an important indicator to evaluate the performance of machine learning. A good algorithm should not only perform well in training data, but also have the ability to predict the future data reliably. The prediction accuracy of training and test sets for these five ensemble learning methods is shown in Fig. 12. It can be seen that the GBDT, XGBoost and LightGBM are overfitting to the training set while performing poorly on the test set and AdaBoost performs worst in both training and test sets, whereas RF has a better generalization ability. From this point of view, this study identifies RF as a first-choice algorithm for the short-term rockburst risk prediction for this dataset.

Fig. 12
figure 12

Prediction accuracy of training and test sets

6 Discussion

The database used in this study is the same as that in Feng et al. (2019), but the results obtained are not directly comparable because their test set was manually specified and only contained 10 samples. However, in our work, the training and test sets were randomly split, and the test set included 30 samples to minimize the effects of overfitting and to obtain a more precise prediction of the model performance. In this study, RF obtained the highest accuracy for the overall rockburst samples, whereas GBDT obtained the best accuracy for the risk of moderate and intense rockburst cases. Combined with the precision, recall and F-measure metrics, the optimal model corresponding to different rockburst levels was not always the same. Nevertheless, RF and GBDT had better comprehensive performance metrics. The reason may be that the dataset includes many outliers and these two methods are not sensitive to outliers.

For the performance evaluation of models, most researchers only used the overall accuracy (the ratio of correctly predictive number to the total number) (Wu et al. 2019; Feng et al. 2019; Pu et al. 2019a, b; Liu and Hou 2019). However, this measure does not indicate classification performance of each risk level. Using other metrics of performance, we observed the prediction performance of slight rockbursting was lower than for other risk levels, which is less important than the correct classification of moderate and intense rockburst in engineering applications. There might be three reasons for this misclassification. The first is that the discrimination boundary based on radiated energy is too tight. For example, from Table 2, it can be seen if the radiated energy is equal to 100 J, rockburst risk is slight, whereas if the radiated energy is equal to 101 J, rockburst risk is moderate. The second one is that the discrimination interval of slight level using radiated energy is smaller than moderate and intense levels. The third one is that there is a certain subjectivity to distinguishing the rockburst risk level according to the observed rockburst damage. In addition, the most serious consequence occurs where an intense rockburst takes place when no risk is predicted. On the other hand, economic losses incur when an intense rockburst is predicted where in reality there is no risk. Therefore, using other types of performance measurements than classification accuracy helps understand the prediction results better. In this study, we adopted accuracy, Cohen’s Kappa, precision, recall and F-measure to assess the predictive performance of models. In addition, the stability of models for different hyperparameter values, the accuracy of moderate and intense rockbursting, and the misjudgment between the risk of none and intense rockbursting are analyzed.

The proposed methodology is theoretically more scalable for dynamic data, which is important given that the number of seismic events in a database increases with time. The combination of multiple models provides better robustness and can solve such problems more effectively. In actual engineering projects, the optimal methods can vary from dataset to dataset. However, following our proposed methodology, the best fitting ensemble method can always be found in other datasets by comparing the performance of different methods. If the optimal approach obtains a reliable predictive performance, it can be used to predict rockburst risk in new areas. In this study, the highest predictive accuracy is 0.8000 for all cases and 0.9167 for high risk cases, which indicates the proposed methodology can be applied and provide effective guidance for short-term rockburst risk management.

Although the overall prediction results are satisfactory using the proposed methodology, there are also some limitations.

  1. 1.

    The dataset of rockburst samples is relatively small. ML algorithms rely heavily on the quality of the dataset. In general, model overfitting occurs when the training dataset is small, which would decrease the model’s generalization and reliability. Although numerous rockburst cases have been reported from all over the world, a public dataset based on microseismic monitoring is rare. If a more comprehensive rockburst database can be established, the short-term rockburst prediction problem could be solved more efficiently using ensemble learning methods.

  2. 2.

    Only the microseismic indices are selected for the short-term rockburst prediction. The rockburst is affected by various factors, such as the intrinsic nature of rock, stress conditions, geologic structure and external disturbance. It is still uncertain whether the microseismic information can completely reflect the comprehensive influence of these factors. Although the selected six microseismic indicators can describe the state of the rock mass fractures, it is meaningful to explore the prediction results by combining the microseismic indicators and other influencing factors.

  3. 3.

    The proposed risk prediction method is only suitable for this kind of rockbursting, in which the source of the seismic event is at the same location as damage in tunnels. This is commonly referred to as strainbursting. There is also another type of rockburst that is induced by the impact of seismic waves due to fault slip or blasting, which is called impact-induced bursting (He et al. 2012). The mechanisms of these two types of rockbursts are different. This study only focuses on the short-term risk prediction for strainbursts.

7 Conclusions

To ensure the safety of workers and the smooth implementation of the projects, short-term rockburst risk prediction using algorithms with high performance is desired in deep level rock engineering. This study compared the comprehensive performance of five ensemble learning methods including RF, AdaBoost, GBDT, XGBoost and LightGBM for short-term rockburst prediction. The hyperparameter values of these five models were optimized based on a training set using fivefold CV. GBDT obtained the most stable results when assigning different hyperparameter values. The model performances were evaluated using the test set after tuning the hyperparameters. Through the analysis of model performance metrics, the RF and GBDT methods yielded the best prediction results. In addition, the predictive performances for different rockburst levels using these models were different. The prediction performance for the risk of none, moderate and intense rockbursting were better than that for the risk of slight rockbursting. Overall, the predictions obtained provide a valuable reference for engineers to base decisions on.

In the future, a rockburst database with a larger number and a higher quality should be established through joint efforts worldwide. In addition, the rockburst data acquisition framework including the indicators and risk levels will be unified. Considering that rockbursting is affected by numerous factors, the effect of different indicators on the prediction results is worth investigating.