1 Introduction

With increasing depletion of near surface deposits as well as underground space development, underground engineering operations are driving deeper and deeper, causing problems related to rock mechanics. This has led to an increase in underground stress causing rockburst hazard which can be seriously damaging to an excavation (Bruning 2018). Rockburst is a phenomenon which occurs in an underground opening from the violent failure of rock mass due to extreme stress concentrations (Hoek 2000; Song et al. 2024). Rockburst particularly threatens the safety of workers and causes destruction to underground works, especially in deep level mining in which there are tunnels and shafts with hard brittle rock (Nussbaumer 2000). Since the beginning of human exploration towards minerals and underground structures, rockburst has become a common instability phenomenon hindering tremendous projects. The earliest recorded instance of rockburst, the Altenberg tin mine in Germany during 1640, is an example of such event that compelled a halt to mining operations for several decades (Ortlepp 2005). Thereafter, various countries around the world such as Australia, China, South Africa, the USA, etc., have accounted for cases of rockburst in mining and tunnelling works (GłOwacka 1993; Wang et al. 2021; Ling et al. 2023; Zhang et al. 2024a, b). Due to the sudden and severe consequences led by the rockburst hazard, researchers around the world have been urged to identify prediction methods and measures for control. Generally, rockburst prediction is investigated on the basis of two main approaches: long-term prediction and short-term prediction (Man Singh Basnet et al. 2023). Long-term rockburst typically focuses on predicting the tendency of different levels of intensity at different sections of mines and tunnels, relying on different rock mechanical parameters during the design stage of the project. Nevertheless, long-term prediction does not tell the time of occurrence or location of events (Liang and Zhao 2022). Unlikely in long-term prediction, short-term prediction employs field monitoring to estimate early warnings of rockburst during the excavation phase (Feng 2017; Zhang et al. 2024a, b). In order to predict the rockburst in real time, microseismic monitoring is commonly used in calculating three-dimensional source location and the different levels of damage severity depending on the monitored information (Song et al. 2024). Hence, scholars have utilised various kinds of MS information to evaluate the occurrence of rockburst. For instance, Feng et al., with the help of six MS parameters, established a formula for dynamic warning of rockburst (Feng et al. 2014). Feng et al. studied the fractal behaviour of energy distribution associated with MS events during the rockburst development process (Feng et al. 2016). Chen et al. based on radiated MS energy, developed a quantitative intensity classification method (Chen et al. 2015). Additionally, Ma et al., applying MS monitoring, investigated the relation between rockburst and its influencing factors (Ma et al. 2015). Further, Liu et al. examined the MS parameters in the Hongtoushan copper mine and concluded that all multiparameter demonstrated precursory behaviour before the ground pressure hazard (Liu et al. 2013). Zhang et al. investigated the occurrence processes of the fault slip rockburst utilising in-situ failure analysis, geological surveys, and MS information to understand the development and occurrence mechanisms. The results demonstrated that fault-slip rockbursts could be effectively monitored and analysed using MS technology, highlighting the importance of understanding geological and stress conditions to mitigate such events (Zhang et al. 2022). Tang et al. studied the mechanism of rockburst in water-rich (WR) areas during tunnel excavation using microseismic monitoring and experimental analysis, comparing WR and water-poor (WP) sections. The results showed that rockbursts are influenced by the classification of surrounding rock and excavation methods, with high in-situ stress conditions accelerating rockburst occurrence in WR areas (Tang et al. 2023).

Although different scholars experimentally studied rockburst occurrence utilising various MS multi parameters, due to inconsistencies aroused in establishing the exact relation between MS information and the intensity warning of rockburst, researchers developed mathematical and intelligent models to predict rockburst risks. Recently, intelligent machine learning methods have gained more concern because they do not need prior understanding related to the input and output variables and instead simply learn the pattern from data to predict the outcome (Pu et al. 2019; Man Singh Basnet et al. 2023). Taking advantage of ML application, attempts have been made to predicting short-term rockburst. Feng et al. developed the probabilistic neural network (PNN) model, feeding 93 rockburst cases obtained from Jinping II hydropower. This model is further improved using a mean impact value algorithm and modified firefly algorithm for better performance. After optimizing the PNN model, accuracy increased by almost 25% in comparison to the single PNN model (Feng et al. 2019). Liang et al. employed the ensemble boosting technique to verify its strength in predicting short-term rockburst. The final result depicted random forest and the gradient boosted decision tree achieved better performance in classifying rockburst intensity (Liang et al. 2020). Liang et al. further developed the stacking ensemble technique, embedding six different classifiers as a base learner. The model is evaluated using different performance metrics and the final outcome simply illustrated ensemble classifiers are more powerful than each base learner in performing the task (Liang et al. 2021). Zhao et al. created a decision tree (DT) model and also investigated the relationship between MS features and rockburst. The model was successfully employed to predict case histories and has only misclassified two sample (Zhao et al. 2021). Additionally, Yin et al. took 1500 MS events and proposed a tree-based model for real time prediction. In order to establish precursory MS sequences, the dimensionality technique is used and data is labelled to form 300 precursory MS sequences via the grouping rule. Finally, two types of precursor trees with and without pruning were used to validate the result (Yin et al. 2021). Jin et al. built a nonlinear support vector machine (nonlinear-SVM) and tested 22 samples. Initially, radial basis function (rbf) was identified as the best performing kernel which is utilised to build a final model by optimizing the hyperparameters, proving remarkable in predicting samples with few misclassifications (Jin et al. 2022). Feng et al. employed knowledge of clustering analysis for establishing the rockburst intensity warning using MS monitoring parameters. The main advantage of the proposed work is that it only needs few samples as an input. The predicted cases of Jinping II hydropower China from his work corresponded with the actual situation (Feng et al. 2020). Finally, Basnet et al. developed an explainable risk prediction model and predicted short-term rockburst risks. The results showed that the model not only predicts the risks but also interprets the decisions made by the model (Basnet et al. 2024).

All aforementioned works are examples of the non-parametric model approach. In fact, recent trends of ML employment in short-term rockburst belong to non-parametric models. As other different types of ML models, parametric models such as logistic regression (LR), naïve bayes (NB), linear discriminant analysis (LDA), etc. have unique theory bases and processes of modelling in contrast to non-parametric models. Nonetheless, parametric models have been infrequently employed in short-term rockburst risk prediction, which is why it is tempting to explore its significance now. The characteristic of a parametric model is that it learns using a pre-defined functional form with a fixed number of parameters. While parametric models are generally less sensitive to the size of the data than non-parametric models, and it basically assumes the data to follow a normal distribution. On the contrary, non-parametric models do not rely on shape of function and the distribution of data but instead generalise well when the number of data is large enough. As a result, to check the feasibility of the parametric model over the non-parametric model, two types of datasets are prepared that are normally (normally transformed) and non-normally (original) distributed data using a small numbers of samples and inputs.

Rockburst prediction work is often complex and complicated process, and there is no study of parametric models in short-term rockburst prediction. Hence, this work compares the result from a parametric and non-parametric model typified by LR and SVM, respectively. The remainder of this paper is designed as follows: the dataset description section describes the source of data and provides the statistical information of two different datasets; the preliminaries section provides a brief of what are parametric and non-parametric models followed by an elaboration of LR and SVM. In the model building section, both LR and SVM are constructed using two different datasets which are compared and evaluated.

2 Dataset description

The fundamental step for building ML models simply requires data samples to train and test the model. Therefore, a dataset has been collected by gathering information from the field work and various articles on short-term rockburst based on microseismic parameters.108 case records were gathered from different sources (Feng et al. 2019, 2013; Liu et al. 2021; Zhao et al. 2021) which contains four features, such as cumulative no. of events (CN), cumulative released energy (CE), cumulative apparent volume (CV) and rockburst development day (RD). CN accounts for the number of microfractures during the events, CE tells the amount of radiated energy that measures the fracturing strength of rock mass, CV is the volume of the inelastic deformation which also describes the amount of damage in rock mass and RD refers to an incubation period of incident. In this study two predictors CE and CV are taken in logarithmic form to ensure the constant correlation and make calculation convenient. All of the mentioned input indices are used to predict the target variable “rockburst intensity”. Basically, there are four types of rockburst intensity levels which are categorised relying on the degree of damages. The characteristics of the four intensity levels are shown in Table 1.

Table 1 Characteristics of four rockburst intensity levels (Chen et al. 2015)

The microseismic features contain the information that is utilised to build a model in order to predict the target (rockburst intensity). The characteristics and patterns between rockburst feature and target can be seen in Fig. 1 which is a parallel coordinate plot that describes the relation between inputs and outputs. Parametric models generally rely on assumption of normality for better performance; therefore, two datasets have been prepared in order to compare the models in both datasets, one with original data that is directly collected from the field and literature and another with transformed data with normal distribution. The first one is named “Dataset I” whereas the second is named “Dataset II”. Since both datasets contain target variables in the form of categorical variables and ML models are unable to handle such type of labels, categorical labels are transformed into numeric form by assigning numeric value to each class; this is represented by 0, 1, 2, 3 for None, Slight, Moderate and Intense rockburst, respectively. Both datasets contain 38 None cases, 27 Slight cases, 29 Moderate cases and 14 Intense cases.

Fig. 1
figure 1

Parallel plot of distribution of four input indices with respect to intensity levels

2.1 Dataset I visualization

Dataset I contains the original information collected from the field and literature in its original form; for parametric ML models, normality is considered as a basic assumption because they rely on a known form of mapping function to make a prediction which aids in better fitting of a line or hyperplane. Therefore, there are two ways of checking normality using graphical and numerical methods (Park 2008). These two methods often employ the knowledge of histogram, Q-Q plot, skewness and kurtosis values to determine whether feature variables are normally distributed (Field 2013). Basically, normally distributed data should have a histogram plot resembling a bell shape curve and Q-Q plot having all data points falling on the straight line (Thode 2002). Similarly, skewness and kurtosis measure the symmetry and heaviness of distribution tails respectively. Observing Fig. 2, on the histogram and Q-Q plot, each feature is not much normally distributed; CN and RD are right skewed and kurtotic whereas CE and CV are left skewed. In terms of Q-Q plot, all of the data points do not fall on the straight line which simply means Dataset I is not much normally distributed. The publicly available Dataset can be found in Appendix.

Fig. 2
figure 2

Histogram and QQ plot of features in Dataset I. a The histogram and QQ plot of CN. b The histogram and QQ plot of CE. c The histogram and QQ plot of CV. d The histogram and QQ plot of RD

2.2 Dataset II visualisation

From a statistical point of view, Dataset I does not meet the requirement of assumption. Therefore, Dataset II has been prepared applying Box-Cox transformation, which transforms the non-normally distributed data into Gaussian or Gaussian like distribution (Box and Cox 1964). Drawing on Fig. 3, it is obvious that all features almost follow the Gaussian distribution. On the histogram plot, the distribution of independent variables more or less resembles the bell shape curve. In comparison to Dataset I, the Q-Q plot for Dataset II depicts that, for every feature, almost all of the data points fall on the straight line.

Fig. 3
figure 3

Histogram and QQ plot of each feature. a The histogram and QQ plot of CN. b The histogram and QQ plot of CE. c The histogram and QQ plot of CV. d The histogram and QQ plot of RD

3 Preliminaries

3.1 Parametric and non-parametric model

Parametric models generally assume the form of function (f) which portrays the relation between input(X) with output(Y), represented as Y = f(x). The algorithm utilises training records to learn the target mapping function. Parametric models presume the population can be accurately represented by a probability distribution with a defined set of parameters (James et al. 2013). This is acquainted with whether the population is normal, unless then it may be easily estimated using a normal distribution that is possible by invoking the central limit theorem. Parametric approaches are often model based and rely on assumption regarding the shape of the function to be inferred. The suitable model is then considered to estimate the collection of parameters.

Non-parametric models do not make assumptions regarding the structure of the function to be estimated. These techniques work by estimating the unknown function (f), which may have any form and does not require any assumption about the population’s parameters (James et al. 2013; Russell 2010). In fact, the non-parametric method does not require data to be normally distributed; it simply tends towards additional precision which attempts to seek the best fit for the observations. Generally, a huge amount of data is necessary to approximate the undermined the function (f) exactly.

Both parametric and non-parametric models have their own form of strength and shortcomings depending on the specific problems. To explore the significance in short-term rockburst prediction with fewer data and features, Logistic Regression and Support Vector Machine classifiers are adopted as a parametric and non-parametric model, respectively.

3.2 Logistic regression

Logistic regression can be regarded as a transformed form of linear regression applying sigmoid function. Logistic regression utilises sigmoid function to map the input variables and output probabilities. Say, X = (\({X}_{1},\dots ..,{X}_{n}\)) is an input vector containing the information of the features, then the conditional probability of output variable Y is estimated relying on a set of observations on X, represented as P(Y = y|X = x)(Cox 1958). General graphical representation of the probability curve of logistic regression can be seen in Fig. 4 in which the vertical axis is the probability result for a given classification and whereas the horizontal axis stands for the value of X.

Fig. 4
figure 4

Working principle of LR

Since, the distribution of y|x is assumed to be a Bernoulli distribution, the conditional probability is written as:

$$p\left( {y{|}x} \right) = f\left( x \right)^{y} \left( {1 - f\left( x \right)} \right)^{1 - y}$$
(1)

f(x) denotes the parameter of the Bernoulli distribution which is a function of the input data,

f(x) = p(y = 1|x) = E(y|x).

Further, f(x) can be calculated utilising the logistic function and the linear transformation of the input variable is given as:

$$f\left( x \right) = \frac{1}{{1 + e^{{ - \left( {\beta_{0} + \beta_{1x} } \right)}} }}$$
(2)

\({\beta }_{0}+{\beta }_{1x}\) is analogous to the linear form y = ax + b. The sigmoid function attempts to confine the Y value between 0 and 1.

LR experimentally uses several parameters; the “C” parameter refers to an inverse of regularisation degree in which the greater the value of C, the lesser the regularisation. The “fit_intercept” parameter defines whether a constant number is necessary for the decision function. Likewise, “solver” aids in the optimisation problem. Newton-cg, lbfgs, liblinear, sag and saga are the common types of solvers.

3.3 Support vector machine

Support vector machine is a non-parametric model that is built for classification and regression relying on the statistical learning framework (Cortes and Vapnik 1995). SVM maximises the width of the margin between two observations by mapping a collection of training samples to points in space. Then, new instances are mapped into the same space and fit into the category to which they belong. SVM can perform linear as well as nonlinear classification problems; if data is nonlinear, it simply uses a kernel trick to classify the nonlinear data by representing it into the higher dimensional feature spaces.

For training samples in a feature space having n data points, it can be written as; (\({x}_{1}\), \({y}_{1}),\dots {(x}_{n}\),\({ y}_{n}\)), where \({y}_{i}\in \left\{+1,-1\right\}\), \({x}_{i}\) denotes the ith feature vector and \({y}_{i}\) estimates the label of \({x}_{i}\). SVM aims to maximise the margin between the set of points \({x}_{i}\) as given in Fig. 5. For the set of points x, any hyperplane can be represented with the following equation:

$$w^{T} x - b = 0$$
(3)

w signifies a normal vector to the hyperplane.

Fig. 5
figure 5

SVM general principle for hyperplane maximisation (Jin et al. 2022)

If samples are linearly separable, two parallel hyperplanes that separate the two classes of data are selected so as to ensure that the maximum width existed between them. The area covered by these two parallel hyperplanes is renowned as margin and their respective equations can be defined as

$$w^{T} x - b{ } = { }1$$
(4)

Data points that lie on or above this boundary are indicated with label 1

$$w^{T} x - b{ } = { } - 1$$
(5)

On the contrary, data points that are situated on or below this boundary are labelled with -1. Geometrically, the distance between these hyperplanes can be denoted as \(\frac{2}{\parallel w\parallel }\), where \(\parallel w\parallel\) must be as minimum as possible to provide the largest margin between two planes. The distance is calculated applying the distance from a point to plane equation. In order to prevent the data points falling inside the margin, constraints are added. For each i, constraints are expressed by

$$w^{T} x_{i} - b \ge { }1,{\text{ for }}y_{i} = 1$$
(6)

Or

$$w^{T} x_{i} - b \ge { } - 1,{\text{ for }}y_{i} = - 1$$
(7)

Solving (6) and (7), the following terms can be obtained

$$y_{i} \left( {w^{T} x_{i} - b} \right) \ge 1,\,{\text{for all }}1 \le i \le n$$
(8)

By minimising \(\parallel w\parallel\) subject to Eq. (8), values for w and b can be obtained that determine our classifier. x \(\mapsto\) sgn(\({w}^{T}x-b\)), sgn(.) which denotes the sign function. The maximum margin hyperplane is totally decided by the \({x}_{i}\) which is situated nearest to it. Such \({x}_{i}\) are known as support vectors.

On the other hand, for the non-linearly separable data, a hinge function is utilised.

$$Max \, (0,1 - y_{i} (w^{T} x_{i} - b))$$
(9)

Simply, \({y}_{i}\) denotes ith target and \({w}^{T}{x}_{i}-b\) is the ith output. The optimisation mainly focuses on minimising the given function

$${\text{C}}\parallel w\parallel^{2} + \left[ {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} {\text{Max }}\left( {0,1 - y_{i} \left( {w^{T} x_{i} - b} \right)} \right)} \right]$$
(10)

C is a non-negative parameter which penalises for misclassification. Minimising Eq. (10) is regarded as a constrained optimisation problem having a differential objective function which then expressed for every data point by introducing a new variable \({\xi }_{i}\) in order to rewrite the Eq. (8) in the following constraint condition

$$y_{i} \left( {w^{T} x_{i} - b} \right) \ge { }1 - \xi_{i}$$
(11)

Finally, the problem of optimisation is given by the following order

$$min\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \xi_{i} + {\text{C}}\parallel w\parallel^{2}$$
(12)

Subject to \({y}_{i}\left({w}^{T}{x}_{i}-b\right)\ge 1-{\xi }_{i}\) and \({\xi }_{i}\ge 0\) i = 1, 2….n

Applying the Lagrange function, Eq. (12) turns the simplest problem.

$$Maximizef\left( {c_{1} \ldots c_{n} } \right) = \mathop \sum \limits_{i = 1}^{n} c_{i} - \frac{1}{2}\mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{n} y_{i} c_{i} \left( {x_{i}^{T} x_{j} } \right)y_{j} c_{j}$$
(13)

Since this belongs to the dual maximisation problem and the \({c}_{i}\) is denoted by the given expression

$$w = \mathop \sum \limits_{i = 1}^{n} c_{i} y_{i} x_{i}$$
(14)

Once the optimal result is obtained after computing Eq. (13), the hyperplane and decision-making function is decided.

In the case of nonlinear SVM, kernel function k which satisfies the k \(\left({x}_{i},{x}_{j}\right)\) = \(\varphi ({x}_{i}) \varphi ({x}_{j})\) is introduced for the transformed data points. Now, in terms of optimisation problem, SVM can be expressed as;

$$\begin{gathered} {\text{Maximize}}\left( {c_{1} \ldots c_{n} } \right) = \sum\limits_{{i = 1}}^{n} {c_{i} } - \frac{1}{2}\sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {y_{i} } } c_{i} k\left( {x_{i} ,x_{j} } \right)y_{j} c_{j} {\text{ }}\backslash hfill{\text{subject}}\;{\text{to}}\sum\limits_{{i = 1}}^{n} {c_{i} } y_{i} = 0\;{\text{and}} \le c_{i} \le \frac{1}{{2nC}}{\text{for }} \hfill \\ \end{gathered}$$
(15)

The final value of \({c}_{i}\) can be gained by resolving the Eq. (15) and \(b - \left[ {\mathop \sum \limits_{i = 1}^{n} c_{j} y_{j} k\left( {x_{i} ,x_{j} } \right)} \right] - y_{i}.\) After all, decision making non-linear SVM is given by

$$b - \left[ {\mathop \sum \limits_{i = 1}^{n} c_{j} y_{j} k\left( {x_{i} ,x_{j} } \right)} \right] - y_{i}$$

4 Model building

4.1 Model building with dataset I

Models being trained and tested on dataset I have been termed \({LR}_{I}\) and \({SVM}_{I}\). The modelling process is conducted in the python scikit-learn library (Pedregosa et al. 2012). In the initial phase, after importing the dataset, it is randomly partitioned into the training and testing sample in the ratio of 70:30, meaning that 30% of the observations separated from the dataset did not participate during the training phase. They are placed apart to validate the model’s performance. Further, the training and testing dataset is scaled using standardisation to avoid the negative influence caused by the differing magnitude of the features. Standardisation is adopted for feature scaling, which scales down the range of every feature with zero mean along with a standard deviation of 1 using the given formula.

$$x^{\prime} = \frac{A - \mu }{\sigma }$$
(16)

where A is the original value of the variable, \({x}^{\prime}\) indicates the standardised feature, \(\mu\) and \(\sigma\) denote the mean and standard deviation of features, respectively. During the modelling process, concerning the size of our training set, an underfitting and overfitting issue could cause a problem; therefore, to address this matter, fivefold cross-validation is adopted while training the model. The schematic diagram of fivefold cross-validation is shown in Fig. 6. The idea behind cross-validation is that the training set is randomly broken down into five subsets with approximately equal dimensions. Then, training is conducted in five rounds so that four subsets are combined in each round, forming a training set, whereas the left one is tested as a validation set. The obtained results from five rounds are then averaged to produce an overall accuracy. This strategy often helps to generalise the model’s ability while treating new data to yield an unbiased result.

Fig. 6
figure 6

Schematic diagram of fivefold cross-validation

4.2 Modelling process for \({{\varvec{L}}{\varvec{R}}}_{{\varvec{I}}}\)

In general, LR is a binary classifier that performs binary classification; hence it is extended to deal for multiclass classification problems (Greene 2003). While training the model, two hyperparameters, C and penalty, are optimised applying the Gridsearch method embedding fivefold cross-validation. C is an inverse of regularisation strength that adds penalty to increase the magnitude of the parameter values for reducing the overfitting. Likewise, penalty is a form of regularisation and there are basically three types of penalty parameter l1, l2 and elasticnet. l1 penalises the sum of absolute values of weights and l2 penalises the sum of squares of the weights. In elasticnet, both l1, l2 terms are added. In this study, the C parameter value is chosen between 1e + 8 to 1e-3 and four attributes for the penalty parameter are considered, including l1, l2, elasticnet and none. Nevertheless, the model achieved the best accuracy score of 66.67% on a test set when the optimal value for C = 10 and l2 as penalty parameter is obtained.

4.3 Modelling process for \({{\varvec{S}}{\varvec{V}}{\varvec{M}}}_{{\varvec{I}}}\)

Similar to the multiclass \({LR}_{I}\) model, for modelling \({SVM}_{I}\) a multiclass approach of one vs one (OVO) (Wu et al. 2004) is adopted by creating a base binary classifier represented as \({C}_{pq}\)(p and q denotes the pth and qth categories present in the training samples) for every pairwise category available in training records. Altogether, m(m−1)/2 base classifiers are created when there are m categories in the training records. For training records T, if the base classifier placed T in category p then vote is given to category p, otherwise it is voted to category q. Training sample T was categorised into the category with the most votes up until all of the based classifiers had been voted on.

As is already indicated in Sect. 3.3, SVM can handle both linear and nonlinear data, therefore there are basically four types of kernel functions: linear, radial basis function(rbf), polynomial and sigmoid. These kernel functions work differently depending upon the nature of the data. In order to estimate the relevant kernel function that performs best with given data, kernel function is first identified using gridsearch method with five-fold cross validation with default parameter settings. The obtained result is shown in Fig. 7. Out of four kernels, rbf has the highest cross-validation accuracy of 57.14, as we can also see that the accuracy between linear kernel and rbf kernel has only a slight difference and this represents the data is slightly non-linear. Hence the model is further evaluated based on rbf kernel optimising hyperparameters for SVM.

Fig. 7
figure 7

Best performing SVM kernel on dataset I

As rbf has the best performance among other kernels, while building the classification model, two hyperparameters C and gamma for rbf are further optimised through the gridsearch approach embedding fivefold cross validation. C is regarded as a penalty parameter that estimates the number of errors classifiers make during classification. Similarly, the gamma parameter estimates the data distribution which is mapped into the new dimension. While tuning the hyperparameters, the search range for C parameter is happened between 1e + 09 to 1e-2 and 1e-09 to 1e + 02 for the gamma parameter in grid form, respectively. The pair that provides the best accuracy is selected as an optimal hyperparameter values. The heatmap in Fig. 8 shows the accuracy for different pairs of C and gamma and the different colour indicates different accuracies. When the optimal C and gamma parameter is obtained as 1e + 07 and 1e-04, the model achieved the prediction accuracy of 60.61 for the test sample.

Fig. 8
figure 8

Hyperparameter tuning for \({SVM}_{I}\)

4.4 Model building with dataset II

The models that are being trained and tested on Dataset II are termed \({LR}_{II}\) and \({SVM}_{II}\) respectively. As in Sect. 4.1, in order to train the model, the dataset is still partitioned into the same ratio of 70:30. The same sample numbers have been used for training like in Dataset I, and the same rule also applies for testing samples. Feature scaling has not been done for this dataset because, after the box-cox transformation, all features are in a similar range due to auto standardization. However, a similar approach of fivefold cross validation is adopted like in Sect. 4.1 to overcome the problem of overfitting and underfitting.

4.5 Modelling process for \({{\varvec{L}}{\varvec{R}}}_{{\varvec{I}}{\varvec{I}}}\)

While creating a model on dataset II, the same consistency is maintained. The multiclass classification approach is adopted and, whilst optimising hyperparameters, C and penalty parameters are optimised via embedding fivefold cross validation. The hyperparameter value for the C parameter is still considered between 1e + 8 to 1e−3 and the four attributes l1, l2, elasticnet and none are selected as a penalty parameter. After completing optimisation, \({LR}_{II}\) obtained an accuracy of 72.73 for predicting the test set when C and the penalty parameter equals 10 and l2 respectively. In comparison to Dataset I, the logistic regression model has better performance on Dataset II, this simply exaggerate parametric model is more reliable at prediction if data is relatively normal in distribution.

4.6 Modelling process for \({{\varvec{S}}{\varvec{V}}{\varvec{M}}}_{{\varvec{I}}{\varvec{I}}}\)

During the modelling process, \({SVM}_{II}\) follows the same standard of \({SVM}_{I}\). Depending on the characteristics of different data, each kernel functions may have different results. Thus, the most suitable kernel is first estimated by applying gridsearch with five-fold cross-validation using default parameter settings. The result for the most appropriate kernel is shown in Fig. 9. The kernel rbf still has the best accuracy result for Dataset II, similar to Dataset I. Among the four kernels, rbf has the highest score of 60 whereas polynomial has the lowest score of 49.82.

Fig. 9
figure 9

Best performing SVM kernel on dataset II

Further, \({SVM}_{II}\) is built with the rbf kernel by optimising the and gamma parameter with gridsearch and applying fivefold cross validation. The parameter tuning range is adjusted the same as with \({SVM}_{I}\).The hyperparameter tuning process is shown in Fig. 10; when the optimal value of C and gamma reached 1 and 1, the model has the best prediction accuracy of 63.64 for the test sample.

Fig. 10
figure 10

Hyperparameter tuning for \({SVM}_{II}\)

5 Performance measurement and evaluation

In Sect. 4, prediction models were built based on two types of datasets after hyperparameter tuning for each model. Even though classification accuracy is regarded as a straight forward measure to evaluate performance, nevertheless, it cannot be reliable when data imbalance is a problem. As mentioned, in Sect. 2, the classes distribution proportion is not equal and, in such case, precision, recall and F1 measure metrics aid in ensuring how robust the classifier is at classifying each class correctly. In rockburst prediction, intense and moderate levels are considered as “high risk” rockburst whereas none and slight are regarded as “low risk” rockburst. Generally, these two kinds of risks should be paid more attention because classifying high risk as a low risk spread false alarms that might cause unexpected casualties and could bring serious consequences to the project. Similarly, falsely predicting low risk rockburst as high risk rockburst leads to large amounts of money being spent on controlling rockburst even if it is not happening. For these reasons, these two problems should be addressed properly. A confusion matrix for high-risk and low risk rockburst is presented in Table 2.

Table 2 Confusion matrix for high-risk and low-risk rockburst

In Table 2, True Positive (TP) indicates the positively predicted sample which is actually positive, False Negative (FN) represents a negatively predicted observation which in reality is positive, False Positive (FP) denotes a predicted positive that is actually negative and True Negative (TN) is a predicted negative which is actually negative. For any model, precision, recall and F1 score is calculated based on this matrix.

Precision(P) is the total number of positively predicted samples upon the sum of positively predicted samples and the number of false positives (Goutte and Gaussier 2005). Simply given by formula:

$$P = \frac{TP}{{TP + FP}}$$
(17)

Similarly, Recall (R) is the ratio of true positive upon the sum of true positive and false negatives (Goutte and Gaussier 2005) represented by the formula below.

$$F = \frac{TP}{{TP + FN}}$$
(18)

A good classifier should have higher precision and recall value but in real case there is always a trade-off between them. Thus, F1 score estimates the quality of model by calculating the harmonic mean between precision and recall (Goutte and Gaussier 2005) using given formula:

$$F1 = 2\frac{P \times R}{{P + R}}$$
(19)

From Table 2, if only the classification of low risk is considered a major concern because falsely predicting low risk as high risk rockburst leads to unnecessary costs for controlling the risk, we would expect a model which has a low false positive value because the higher value of the false positive is disadvantageous when the model predicts too many cases of low risk as high risk. A situation in which a false positive is most dangerous is regarded as a type I error. To minimise type I errors, a high precision model is more effective for accurate classification. Similarly, if high-risk prediction is the primary concern, models that have less false negative value are advantageous because once the models predict high-risk as low-risk, user may think there are no consequences when in actuality situation there is a high risk rockburst might occur and be very harmful in the end. In this case, a false negative is more dangerous, which is a type II error. For type II error, models that have a higher recall value are more beneficial. The final result for the models trained on two different datasets can be found in the confusion matrix plot in Table 3.

Table 3 Confusion matrix of models trained on dataset I and dataset II for the testing set

From the table we can see that parametric models have achieved better accuracy results than non-parametric models in both types of datasets, however, \({LR}_{II}\) has better results which simply illustrates that parametric models perform better when features are transformed into normal distribution. In the case of \({SVM}_{II}\), it is also less benefited by Dataset II but accuracy has not differed much in comparison to \({SVM}_{I}\). Viewing the confusion matrix for each model, the samples in the diagonal entries denotes the true predictions made by the model with zero misclassification rate. All the samples that do not fall inside the diagonal matrix are false predictions made by the models whose misclassification rates are not zero. The matrix is asymmetric for each classifier because the classification results for each intensity class varies. The precision for models on two different datasets is shown in Fig. 11.

Fig. 11
figure 11

Precision for models on two different datasets

As already discussed above, when a type I error is more dangerous, higher precision is given more priority. Therefore, for low-risk cases \({SVM}_{I}\) (1.0) and \({SVM}_{II}\)(0.90) have better precision for none cases followed by \({LR}_{I}\)(0.90) and \({LR}_{II}\)(0.84). \({LR}_{II}\)(0.4) outperforms \({LR}_{I}\)(0.27), \({SVM}_{I}\) (0.26) and \({SVM}_{II}\)(0.2) for slight cases. In contrast, we can also check the precision for high-risk cases, all four models also have a precision score of one for intense rockburst, which means all of them can accurately classify the intense cases correctly. For moderate cases, \({LR}_{II}\) has the higher precision score of (0.81) among others. However, \({LR}_{I}\) has slightly better precision in comparison to \({SVM}_{I}\) and \({SVM}_{II}\).

When a type II error is more dangerous, high recall models are beneficial. Thus, from Fig. 12, for high-risk cases, \({LR}_{I}\)(0.75) shows higher recall value for intense rockburst with \({SVM}_{II}\)(0.75) and \({LR}_{II}\)(0.75). For moderate rockburst, \({LR}_{I}\) achieved a greater recall value followed by \({LR}_{II}\)(0.5) and \({SVM}_{II}\) (0.5). Nevertheless, \({SVM}_{I}\) shows a least recall score of 0.41. On the contrary, we can also see the recall score for low risk rockburst. In terms of no rockburst \({LR}_{I}\)(0.83) and \({LR}_{II}\)(0.91) are better than \({SVM}_{I}\) (0.66) and \({SVM}_{II}\)(0.83). \({LR}_{II}\) (0.8) and \({SVM}_{I}(0.8)\) shows equal recall value for slight rockburst whereas \({SVM}_{II}(0.40)\) has the lowest recall score.

Fig. 12
figure 12

Recall for models on two different datasets

Regardless of whether high or low risk rockburst is more important, for real scenarios in terms of controlling and safety purposes, both cases of high risk and low risk should be paid equal importance and predicted accurately. Thus, both precision and recall should have higher value for both high-risk and low-risk cases. Generally, for an optimal model, a higher value of precision and recall are desirable but trade-offs always exist between them which means as one increases another decrease. In rockburst prediction, high-risk and low-risk cases should be treated equally in same importance. As a consequence, we cannot simply rely on single precision and recall score to measure the robustness of each classifier. Hence, when reliability of the model is not interpretable using single precision and recall score, the F1 score metric that evaluates the classifier’s performance can be further computed by calculating the harmonic mean of precision and recall. The F1 score for each intensity class is given in Fig. 13. Based on the F1 score metric, best predicting models can be ranked as \({LR}_{II}\hspace{0.17em}\)> \({LR}_{I}\hspace{0.17em}\)> \({SVM}_{I}\hspace{0.17em}\)> \({SVM}_{II}\) which obtained the average F1 score of 0.7254, 0.6754, 0.6458, 0.6412 respectively. Overall, \(LR\) outperforms SVM which has a better F1 score for each intensity level in both type of datasets.

Fig. 13
figure 13

F1 score for models on two different datasets

For further evaluating the performance of parametric and non-parametric models, a ROC (receiving operating curve) is used for measurement. Initially, ROC is used for binary classification as a performance metric (Spackman 1989). Here, we have extended it for a multiclass classification approach by drawing ROC for each intensity class. Later averaging ROC for both models on two different datasets has been drawn. Multiclass ROC features Y-axis with true positive rate and X- axis with false positive rate. In general, the best classifier is the one which has larger area under the curve (AUC). In this experiment, the AUC for averaging-ROC estimates the robustness of classifiers.

In general, a different value of AUC determines the different characteristic function for the classifier’s performance. Most commonly suggested values can be specified as 0.5, 0.7–0.8, 0.8–0.9 and above 0.9. 0.5 denotes no discrimination at all, whereas, 0.7–0.8 and 0.8–0.9 suggests acceptable discrimination and excellent discrimination, respectively. Similarly, the most outstanding model should have an AUC greater than 0.9 (Hosmer Jr et al. 2013).From Fig. 14a and b if we see two models \({LR}_{I}\) and \({SVM}_{I}\) trained on dataset I, \({LR}_{I}\) has the AUC score of 0.9 compared to \({SVM}_{I}\) which has the AUC score of only 0.86. This simply illustrates that when the availability of data is comparatively small, a parametric model can still reliably predict short-term rockburst. In the same way, from Fig. 14c and d, if we observe \({LR}_{II}\), \({SVM}_{II}\) trained on Dataset II, \({LR}_{II}\) has the highest AUC score of 0.91 even when compared to three other models. Further, we can notice that non-parametric \({SVM}_{I}\) and \({SVM}_{II}\) trained on two different datasets have similar AUC scores, however in terms of average F1 score, \({SVM}_{I}\) is slightly ahead of \({SVM}_{II}\) which means a class imbalance problem non-parametric model trained on original dataset can achieve better result.

Fig. 14
figure 14

Multiclass ROC for\({LR}_{I}\), \({LR}_{II}\), \({SVM}_{I}\),\({SVM}_{II}\). a multiclass ROC for\({LR}_{I}\). b multiclass ROC for\({SVM}_{I}\). c multiclass ROC for\({LR}_{II}\). d multiclass ROC for \({SVM}_{II}\)

6 Results and discussion

Parametric and non-parametric ML models are two distinct types of methods with unique theory and modelling processes. However, in short term rockburst risk evaluation, application of the parametric model approach is infrequent as these kinds of models do not rely on data size and normally distributed features are beneficial for them compared to non-parametric models. Therefore, parametric (LR) and non-parametric (SVM) is employed and studied by preparing two different datasets: the original data (Dataset I) which is not much normally distributed and another (Dataset II) that is prepared by applying box-cox transformation which is approximately normally distributed. At first, the model is built on Dataset I by partitioning it into the ratio of 70:30, during the modelling process in Dataset I standardisation technique is preferred for feature scaling to bring all the features in the similar range for both LR and SVM. Once the hyperparameter optimisation is done LR achieved an accuracy of 66.67 whereas SVM obtained only 60.61 for the testing set in Dataset I. Similarly, the models have been trained on Dataset II maintaining the same consistency and configuration as in Dataset I, with the only difference being feature scaling that is not performed for Dataset II because, after box cox transformation, all feature values are set into auto scaling. The LR model still obtained the best accuracy of 72.73 whereas SVM achieved an accuracy of 63.64 for testing data. From the accuracy result, parametric LR scored higher accuracy in Dataset II.

Although accuracy is often regarded as a straightforward measure for model performance, it cannot interpret the whole performance. Therefore, unlikely in previous studies, three other evaluation metrics, precision, recall and F1 score, are further calculated and average AUC is computed for each model to identify their robustness. In the perspective of predicting high risk and low risk rockburst cases, the final F1 score says that LR trained and tested on Dataset II is more suitable than a model trained on Dataset I. Similarly, SVM trained on Dataset I is slightly ahead of that trained on Dataset II. As we can see, \({SVM}_{II}\) has a comparatively higher accuracy than \({SVM}_{I}\) but from the F1 score point of view, \({SVM}_{I}\) is more appropriate to classify both high risk and low risk in imbalanced classes as it can lead to correct guidance. Moving forward, AUC is calculated for each classifier trained on both datasets to see their robustness. Overall, \({LR}_{II}\) has the best AUC score of 0.91 followed by \({LR}_{I}\)(0.90); likewise, \({SVM}_{I}\) and \({SVM}_{II}\) scored equal AUC in terms of performance which simply exaggerates that, for short-term rockburst risk prediction, parametric models can still effectively identify risks even in small datasets with fewer features. These models are even more capable of predicting accurate outcomes if non-normally distributed data are transformed into a normal distribution.

Short-term rockburst prediction using microseismic monitoring during the rockburst development process relies on different factors. Depending upon the complex geological conditions during the data acquisition process, acquiring sufficient records and features for training the model is always challenging. The parametric model such as LR is often more powerful in defining the true relationship between input and output variables using a known form of functions that can also predict the outcome without being independent of the size of the data. This is because once the functional form is determined, the parametric model estimates the coefficients from training data to give a better predictive model. The amount of data will not impact the training processes. However, if the independent variables are more or less normally distributed, they can find the best fit for accurate prediction. The work of Feng et al. (Feng 2017) states that the degree of microseismicity is positively correlated with the frequency and intensity of rockburst, which means MS multi-parameters and rockburst risk levels also have higher level of correlation. Parametric models are often more effective when dealing with inputs and outputs that have a strong relationship because assuming a specific functional form, such as a line or hyperplane, simplifies the training process. Therefore, a parametric model like logistic regression gave better results regardless of the data size compared to non-parametric models. The primary reason for the poor performance of non- parametric SVM on both datasets is that even though non-parametric models do not depend on the assumption of mapping function and normality of the variable distribution, in order to perfectly learn from the training samples as well as to generalise on unseen observation, it requires a large amount of supportive data and inputs to learn the pattern from it so that it may define the required functional form to best fit the training instances. However, in short-term rockburst prediction, it is always challenging to acquire the sufficient amount of data required to train the non-parametric models, due to this reason models trained with insufficient records often loses ability to generalise on unseen samples and cannot give the higher prediction result. On the other hand, if we see the accuracy of non-parametric SVM on two datasets, the accuracy has only a little difference. However, based on the other performance metrics, it has almost similar performance in every aspect. This is simply because non-parametric SVM depends on the number of training records for estimating the parameters necessary for making a prediction independent of the distribution of input variables. Hence, unlike the parametric model, only a more enormous amount of data increases the efficiency of the non-parametric model.

Although the overall results are promising, some limitations can be handled in future research in a similar area:

  1. 1.

    The available dataset contains four intensity classes. Among them, the proportion of intense rockburst cases is comparatively less. When an ML model is trained on a dataset where the ratio of some classes is not equal, a model could produce a biased result in such cases. Therefore, a dataset can be further updated in future work by giving equal preferences to each class to yield more accurate outcome; this can be done by adding more cases of minority classes into the dataset. It will further help the model to generalise on observation of each intensity class for real time warning.

  2. 2.

    This paper mainly focuses on developing intelligent predicting models for immediate types of rockburst, such as strain burst and strain-structure slip burst that often occur immediately after excavating in deep engineering projects. There is also a different type of rock bursting known as fault-slip rockburst, which has a different mechanism, but due to the self-similarity of the rockburst development process, monitored MS information can be utilised for the early warning of risk levels (Feng et al. 2017). Future research on this topic using an intelligent model is worthy of exploring.

7 Conclusion

The prediction of rockburst hazard relying on a traditional approach is always challenging. As a result, an intelligent approach that aids in the accurate prediction of risk is necessary. Therefore, this paper proposes and compares two widely adopted  parametric and nonparametric methods typified by LR and SVM based on two different datasets. Initially, 108 rockburst data obtained from MS monitoring information that contained four attributes such as cumulative number of events, cumulative radiated energy, cumulative apparent volume and rockburst development day, was collected. Since a parametric model performs better when feature variables are normally distributed, two datasets were prepared to better understand the feasibility of both approaches. Dataset I is created using original data which is not normally distributed whereas normally distributed Dataset II is prepared by applying box-cox transformation to Dataset I. After, the models were built on both datasets by using the same number of samples for training and testing in the splitting ratio of 70:30 for both conditions. While training the model, hyperparameter optimisation is done using the gridsearch method applying five-fold cross validation to find the C and penalty parameter for LR as well as the C and gamma parameter for the best performing rbf kernel of SVM using the multiclass classification approach. Once the optimal parameter is obtained for each classifier, the result shows that the logistic regression model trained on Dataset II obtained the highest accuracy of 72.73% on the testing sample when compared to other models. For further evaluation, three other performance metrics, precision, recall and F1 score were computed from which we can conclude that LR performed well on both datasets whereas SVM has almost equal performance in both type of datasets. In order to investigate the robustness of classifiers, ROC-AUC is drawn for each class and average AUC has been computed for each classifier. The final output indicates that when the LR model is trained on Dataset II, it achieved the highest AUC score of 0.91. When trained on Dataset I, it has an AUC score of 0.90. However, the SVM has an equal AUC of 0.86 while using both datasets. This simply says that, in terms of short-term rockburst risk prediction, the limited availability data is always a threat. Even though a non-parametric model is good in achieving better output, it still needs more data and inputs to generalise well and yield higher accuracy. Nevertheless, a parametric model performs well because the data size does not influence the model generalisation as it constrains an algorithm to a specified functional form that estimates the coefficients from training data to give a better predictive model. Regardless of data size highly accurate model can still be achieved if the data are relatively normally distributed. Hence, in future research parametric models are worth exploring when the datasets are comparatively small and the number of variables are limited.