1 Introduction

The term “liquefaction” is frequently known as the transformation of saturated loose sandy soil from a solid to a liquid state as a result of an increase in pore water pressures and consequent complete loss of effective stresses when it is subjected to strong and rapid seismic loading conditions. This phenomenon was not seriously considered by engineers until 1964 [1]. However, the Alaska and the Niigata earthquakes in 1964 led to significant damages to the environment, structures, and underground facilities due to soil liquefaction [2,3,4]. The damaging effects of liquefaction were also observed during recent earthquakes [5,6,7,8]. Therefore, it is crucial to appropriately evaluate the soil liquefaction potential when designing any soil-structure system. In this regard, many researchers commenced extensive investigations on earthquake-induced soil liquefaction and its prediction.

Various empirical or semi-empirical methods are proposed to evaluate liquefaction potential. The most common way for estimation of soil liquefaction is the stress-based approach called simplified procedure [9]. This approach needs the capacity of soil to resist liquefaction. The liquefaction resistance of soil can be determined from laboratory tests or in-situ geotechnical investigations. In-situ tests are much preferred for soil liquefaction evaluation due to the defects of laboratory tests, such as representing actual field conditions and obtaining high-quality soil samples. Some of the in-situ tests preferred for liquefaction triggering analysis are shear wave velocity test, cone penetration test (CPT), and standard penetration test (SPT) [10,11,12,13,14,15,16,17]. Moreover, numerous nonlinear numerical analyses have been performed for the calculation of soil liquefaction. However, numerical approach may require a sophisticated constitutive soil model and solid knowledge during dynamic analyses. Therefore, the complexity of the model may render numerical approach ineffective and time-consuming. Considering these facts, nonconventional approaches (e.g., machine learning) become an alternative tool for providing solutions to various problems in this topic. Machine learning (ML) methods have an adequate capacity to capture the potential correlations among information without any prior assumptions [18].

Currently, ML methods, such as support vector machine (SVM), logistic regression (LR), artificial neural network (ANN), random forest (RF), naive bayes, eXtreme gradient boosting (XGBoost), decision tree, and etc., are being increasingly used in different engineering applications. With regard to geotechnical engineering, employing ML methods can supply our understanding of the complex behavior of liquefiable soils. It can be also a contributory tool for improving liquefaction hazard analysis [19]. According to Xie, et al. [20], various ML techniques have been adopted many times for liquefaction assessment [21,22,23,24,25,26,27,28,29,30,31,32,33].

In recent years, ensemble algorithms have gained great attention in various fields due to the predictive capabilities of these methods [31, 34,35,36,37]. Ensemble method is an algorithm that combines multiple classifiers to solve a complex problem and improve the model's performance. The idea of ensemble learning is to reduce the chance of error while expanding the general dependability and certainty of the model [38]. Over the past decades, widely used algorithms such as SVM, ANN, and LR have showed strong performance for both regression and classification tasks. However, tree-based ensemble (TBE) approaches have significantly improved performance and are becoming more widely accepted methods [39]. Bagging [40] (i.e., RF) and boosting (i.e., XGBoost) are the classical well-known TBE approaches. Bagging and boosting algorithms mainly differ from each other in two aspects. First, while the boosting algorithm mainly utilizes weighted averages to make multiple weak learners into stronger learners, bagging (which stands for Bootstrapping) is a combination of multiple independent learners [41]. Second, the boosting algorithm aims to produce an ensemble model less biased, whereas the bagging algorithm mainly aims to get strong models with less variance than its components [42]. Perhaps not surprisingly, TBE methods have become a particularly popular approach since it combines properties from both statistical and ML methods [43] and they provide a high accuracy prediction model with ease of interpretation and feature importance analysis for large datasets. Also, ensemble methods based on decision trees (e.g., RF and XGBoost) have been widely used to deal with nonlinear problems. Beside, the primary advantages of TBE methods are that they work on fewer assumptions compared to other existing alternatives for classification (such as SVM, LR, and discriminant analysis) and they require minimum data pre-processing [43, 44]. With the growing interest in ML, many studies have been conducted in various domains by using TBE methods, such as information science [45], biological [46], energy [47], and healthcare [48]. In these studies, prediction results are compared with other classification methods. They are concluded that the TBE methods provide better prediction accuracy than the other classification models.

TBE methods have also been successfully introduced in geotechnical engineering. For example, Wang, et al. [35] and Bharti, et al. [36] applied the XGBoost approach for the slope stability problems. Zhang, et al. [37] adopted XGBoost and RF algorithms to predict the relationship between the undrained shear strength and several basic soil parameters. Pham, et al. [49] utilized the Adaptive Boosting (AdaBoost) algorithm for the classification of soils, and the results indicated that AdaBoost offers important results for soil samples by the automatic classification. Wang, et al. [50] used five different algorithms, including gradient boosting machine (GBM) and RF to estimate bearing deformation and column drift ratio responses of extended pile-shaft-supported bridges. They concluded that the GBM algorithm well predicted the seismic responses of the soil-bridge systems as compared to other studied methods. However, the literature review shows that advanced boosting algorithms such as GBM and XGBoost have been rarely employed for liquefaction prediction [31, 51]. Moreover, no previous study investigated the AdaBoost algorithm for the seismic soil liquefaction prediction. Indeed, there is not enough example of a quantitative-systematic comparison of boosting algorithms in liquefaction prediction. As previously stated, the goal is to develop the process of building ML models not only using robust algorithms, such as the cases of ensemble learners but also simpler and faster learning algorithms seeking to assess which of the algorithms can better predict. Considering many advantages of TBE methods (e.g., reliability, robustness, and high accuracy) and the lack of the liquefaction prediction studies based on TBE methods in the literature, AdaBoost, GBM, and XGBoost algorithms are applied in the present study.

It is well known that when building an ML-based model for making a prediction, lots of data and features are required. Not all features in the dataset may be necessary during the modeling phase. The principal goal for engineers or researchers is to reach the best predictive ability of the created model. Hence, removing the irrelevant data may be contributed minimizing the errors, enhancing learning accuracy, and reducing the computation time [52,53,54]. Furthermore, using the dataset without pre-processing would increase the overall complexity of the model. This limitation can be remedied by using feature selection (FS) methods as it reduces the size of the training dataset and removes the superfluous features. FS is a procedure of specifying an optimal subset of features through all possible combinations of feature subsets from the original dataset, which reduces the number of predictors as far as possible without compromising predictive performance [55]. Das, et al. [54] determined important input features of SPT, CPT, and Vs datasets using FS methods in a multi-objective optimization framework. Hu [56] used the filter method, which is one of the classes of FS methods, to move all irrelevant variables for gravelly soil liquefaction. Demir and Sahin [57] applied the RFE method as an efficient FS technique for liquefaction prediction. All studies concluded that FS methods are able to improve the predictive capability of models. Therefore, identifying relevant features of a dataset is a noteworthy process in the preparation of the prediction model.

The objective of the proposed study is to predict the soil liquefaction through AdaBoost, GBM, and XGBoost algorithms considering three FS methods, namely Recursive Feature Elimination (RFE), Boruta, and Stepwise Regression (SR), for enhancing the performance of TBE algorithms. For this, 620 SPT case studies collected from Kocaeli and Chi-Chi earthquakes are used in the experiments. Four performance metrics such as Overall Accuracy, Precision, Recall, and F-measure were used to measure the performance of the models. The entire analyses were performed with the R package software [58]. The novelty of this paper can be summarized with the following headings: (1) The application of GBM and XGBoost is still rare in the prediction of soil liquefaction. In addition, the AdaBoost algorithm was first time applied in this study and compared the other studied boosting algorithms for proper liquefaction prediction assessment. The investigation of these algorithms and their comparison with each other is highly necessary to reach sufficient background and obtain some proper findings. (2) The growing popularity of FS methods and their frequent application raise new questions about their influence on the prediction performance of the models. Hence, the results of RFE, Boruta, and SR methods were compared to the original dataset including all features to enhance our understanding in terms of the skills of these algorithms in providing the optimal features. (3) One of the important things when building a prediction model is sampling. In current ML-based liquefaction prediction practices, data is randomly subdivided into training and testing samples by generally using the simple random sampling (SRS) technique, which may be problematic when the distribution of the liquefaction events in the dataset is imbalanced [55]. However, in this study, the training and test samples were produced through the Stratified Random Sampling (StrRS) technique to ensure the selection of balanced samples. This approach generates random sampling points and distributes them equally between each class (liquefied and non-liquefied). (4) A nonparametric statistical test called the Wilcoxon sign rank test [59] was applied to find out whether there is a statistical significance between the prediction results of the algorithms. (5) Lastly, computation costs of the TBE algorithms were evaluated in the cases of parallel and non-parallel processing.

2 Methodology

This section presents the theoretical details of TBE algorithms and FS methods applied in this study. Moreover, an introduction to the liquefaction dataset and the performance measurement and accuracy assessment steps are briefly mentioned. A flowchart of the methodology is presented in Fig. 1.

Fig. 1
figure 1

The route of the methodology followed in this study

2.1 Description of the dataset

A total of 620 SPT records with 12 parameters collected from the two major earthquakes in 1999 are considered for the purpose of the study [24]. The dataset consists of binary classification, including 256 liquefied (Yes) and 364 non-liquefied (No) cases. The further details about the dataset are summarized in Table 1.

Table 1 Parameters and some statistical measures of the dataset [24]

2.2 Overview of tree-based ensemble (TBE) algorithms

Ensemble algorithms utilize several weak learners and aggregate their outcomes to improve the performance of a model. There are several types of TBE algorithms. Among them, AdaBoost, GBM, and XGBoost were employed to predict soil liquefaction. A brief discussion of the three algorithms is provided here.

2.2.1 Adaptive boosting (AdaBoost)

AdaBoost is one of the most commonly applied boosting algorithms introduced by Freund and Schapire [60]. The AdaBoost algorithm exhibits an efficient performance since it is capable of generating expanding diversity. It has been successfully applied for solving two-class, multi-class single-label, multi-class or multi-label, and categories of single-label problems. This algorithm is an iterative process that tries to generate a strong classifier with weak classifiers [61]. The weak classifiers are chosen to minimize the errors in each iteration step during the training process and used to build a much better classifier so that boosts the performance of the weak classification algorithm [62]. This boosting is accomplished by averaging the output of the set of weak classifiers. Pseudocode for the AdaBoost algorithm is presented in Algorithm 1 [53].

figure a

2.2.2 Gradient boosting machine (GBM)

GBM can be utilized for both classification and regression problems in terms of ML applications. GBM is used to find an additive model that will minimize the loss function, which is to construct the new base-learners to be maximally correlated, associated with the whole ensemble [63]. It is possible to assign the loss function arbitrarily. If the error function is a classic squared-error loss, the process of learning may result in consecutive error-fitting to achieve a better intuition. A GBM model mainly contains a few hyperparameters such as max_depth, min_rows, ntrees, col_sample_rate and learn_rate. The GBM algorithm may be summarized as the following pseudocode given in Algorithm 2 [63].

figure b

2.2.3 eXtreme gradient boosting (XGBoost)

The open-source XGBoost is a free-to-use library of the gradient boosted tree algorithm that has recently dominated science competitions for structured or tabular data. The XGBoost combines all the predictions of a set of weak learners by combining several of them to create a strong learner that obtains better prediction performances [64]. This method needs to decide the primary hyperparameters for the prediction of the model. Every ML algorithm achieves the best performance of the model with the best hyperparameters, so appropriate tuning is particularly important, including XGBoost [65]. Therefore, the grid search (GS) method is utilized to reach the appropriate model hyperparameters of XGBoost, namely eta (the learning rate), subsample (subsample ratio of the training instance), max_depth (maximum depth of a tree), gamma (minimum loss reduction), colsample_bytree (subsample ratio of columns when constructing each tree), min_child_weigh (minimum sum of instance weight) and nrounds (number of boosting iterations). The package named as “xgboost” from the “caret” library in R was used to perform XGBoost operations [66]. The pseudocode of the XGBoost was given in Algorithm 3 [67].

figure c

2.3 Feature selection (FS) methods

FS is a part of developing predictive models in ML for reducing the number of input variables. The potential benefits of FS consist of facilitating data understanding, shortening computational cost, and getting rid of the problem of dimensionality to improve the performance of the prediction model [68]. Several FS methods have been developed to obtain which features are most relevant and should be used in prediction models. In this study, three FS methods RFE, Boruta, and SR have been utilized to select only important and relevant features. The detail of the three FS methods is described in the following sections.

2.3.1 Recursive feature elimination (RFE)

RFE is commonly used for the FS method proposed by Guyon, et al. [69]. RFE is used to rank the features in a dataset according to the importance provided by the RF algorithm. The RFE method contains several main steps [70, 71]. Firstly, the importance of each feature is calculated for each iteration in the process of the feature elimination step. Secondly, the features are sorted from high to low according to their importance value. Finally, the least important feature(s) is removed from the model. After this step, the model is built again, and feature importance scores are recalculated. This process has recurred until a feature is found not to be redundant or irrelevant. Detailed information on the RFE method is given in Algorithm 4 as pseudocode [72].

figure d

2.3.2 Boruta

Boruta (Algorithm 5 [73]) is a wrapper-built feature ranking and selection method based on RF for feature relevance estimation in the R statistical package [74]. The importance of features is established in the RF algorithm. Calculating variable importance with RF is possible by the measure of accuracy decreasing when information about variables in a node is removed from the model. Similar to the RF algorithm, the Boruta method is based on adding randomness to the model and collecting results from the ensemble of randomized samples [75]. The Boruta method involves the following steps [66]; (1) Duplicating all features to extend the information system, (2) shuffling the added attributes to eliminate their correlations with the response, (3) running the RF model on the extended system and gathering importance (Z) scores, (4) gaining the MZSA (the maximum Z score among the duplicated (i.e. shadow) features) value and assigning a hit to every feature that scored better than MZSA, (5) applying a two-sided equality test with the MZSA for each feature with undetermined importance, (6) assuming the features which have less importance than MZSA as unimportant and removing them from the information system permanently, (7) removing duplicated variables, and (8) repeating the procedures from step (1) to (7) until the importance is assigned for all attributes. The detail of the method is clearly described in Kursa and Rudnicki [74].

figure e

2.3.3 Stepwise regression (SR)

SR [76] is the most well-known method for choosing features in a model that keeps relevant features and removes irrelevant or redundant ones. Indeed, SR was developed as an FS procedure for linear regression models that is a combination of the forward and backward selections. The focus of the method is to modify the forward selection so that after each step of the algorithm, all candidate variables in the model are checked to see if their significance has been reduced below the specified tolerance level. At the end of the process, if a non-significant variable is observed, it is excluded from the base model [77]. SR requires two independent statistical significance cut-off values for adding and deleting variables from the model [78]. More detail about the mathematic background of SR can be found in the literature [76, 78]. The pseudocode of the SR method is shown in Algorithm 6 [79].

figure f

2.4 Performance evaluation methodology

There are many kinds of performance evaluation metrices to evaluate classification performance of ML models. In this study, the metrics of Overall Accuracy (\(Acc\)), Precision (\(P\)), Recall (\(R\)), and F-measure (\(F\)) were used. The performance of two-class classification models is described based on the Confusion Matrix (CM). Accordingly, CM parameters namely TP, FN, FP, and TN were utilized to compute performance evaluation metrics as shown in Fig. 2

Fig. 2
figure 2

Description of CM and performance matrices

The accuracy of the liquefaction prediction produced by the ML models is estimated from the CM for the validation data. The produced models can show good performance results considering the measurements, but it would be appropriate to use a statistical significance test to determine the best single model among the other produced models. Therefore, Wilcoxon’s sign rank test [59], which is one of the most important nonparametric tests for multiple comparisons, was used to identify significant differences between the models.

3 Results and discussion

Several TBE algorithms and FS methods were applied in the present work. First, three FS methods were used to detect the most relevant features according to their importance. The objective of FS is to remove irrelevant and redundant features by keeping the ones that can predict the optimum feature. The FS process might help decrease the computational time, improve the performance of the algorithm, and prevent overfitting. After the FS process, for the purpose of analyzing the best feature subset TBE algorithms were employed. Besides, all feature combinations (i.e., RAW SPT data including 12 features) was also considered as a comparison model. The most important fact in optimum model preparation is that model accuracies mainly depend on the selected hyperparameters. Therefore, the tenfold cross-validation, which is randomly partitioned into k equal sized subsamples, was applied to the proposed models for hyperparameter tuning. The algorithms namely AdaBoost, GBM, and XGBoost were utilized to decide the overall best performing one among models. The quality of the resulting models was evaluated using Acc, P, R, and F metrics, respectively. Wilcoxon’s sign test was also utilized to acquire the statistical differences of the accuracies of models. For interest, applications in the proposed methodology were performed using the R programing language (version 3.6.3) [58] with the following main R packages: caret, sp, randomForest, Boruta, adabag, gbm, e1071, h2o, and xgboost, respectively. All applications for liquefaction assessment were performed on a PC with 4.0 GHz AMD Ryzen 9 3950X CPU, 64 GB RAM, and Windows 10 operating system.

3.1 Determination of training and test sample size

ML algorithms build a model that relies on training samples in order to make predictions or decisions. Therefore, training sample size had a larger impact on model accuracy than the algorithm used [80]. However, there is no advice or exact ratio for a minimum number of samples required for ML prediction. It could be said that the determination of the best sample size for the prediction of the model may depend on the ML algorithm, the number of input variables, and the size of the original database [81]. Additionally, another important aspect of determining the training and test sample size is the training data selection method. Sampling strategies can be divided into two types namely, probability or random sampling and non-probability sampling. In a random sampling technique, each member of the sampling unit has an equal chance of being selected in the sample. There are several random sampling techniques available for managing sampling sizes [82, 83]. Non-probability sampling is a population using a subjective (i.e., non-random) which the user selects samples based on subjective judgment rather than random selection [84]. The following examples of non-probability sampling methods can be found in the literature including, quota, snowball, judgment, and convenience sampling.

Sampling is the technique of selecting a subset of a population from the entire population for the purpose of determining the characteristics of the whole population to make statistical inferences. There are several types of sampling techniques but two sampling techniques namely Simple Random Sampling (SRS) and Stratified Random Sampling (StrRS) are the most preferred methods in the ML area. SRS is the basic sampling technique where a group of samples was selected from a population. This sampling method is the most appropriate option when the entire population from which the sample is taken is homogeneous. Otherwise, StrRS would be the ideal approach in circumstances where the population is heterogeneous or dissimilar. In this study, StrRS was used as a sampling size strategy. The population is directly divided into subgroups in this method and a random sample is taken from each subgroup, meaning each subgroups sample has the same sampling fraction. These mentioned subgroups are called strata. The main advantage of StrRS is that it captures key population characteristics in the sample and the process of stratifying reduces sampling error with ensuring a greater level of representation. Usually, training data size is set to split in the ratio of 60:40, 70:30, or 80:20 (training/testing set). The training dataset is used for model building and test dataset is utilized for model evaluation. On the other hand, many researchers proposed a ratio of 70:30 or 80:20 for producing datasets [26, 29, 31,32,33, 54, 57, 85]. In this study, the ratio of 70:30 (training/test set) was chosen like other literature research [54, 57, 85, 86] and SPT data (a total of 256 “Yes” and 364 “No”) was used for the analysis. When the distributions of the two classes (i.e., Yes and No) in the dataset are compared, the No class is proportionally more than the Yes class. It has been revealed that the classes are not homogeneously distributed. In other words, the distribution of the liquefaction events in the dataset is imbalanced. Therefore, the SPT data is divided into training/test set using the StrRS technique. After the sampling process, training data is contained 179 events of “Yes” and “No”, and test data is contained 77 events of “Yes” and “No” liquefaction events. As a result, characteristics of both training and test sample sizes became proportional to the entire population into homogeneous units. Also, a comparative example of between the two sampling strategies are given in Table 2. The mean values of liquefied events for the dataset using StrRS and SRS techniques are 0.500 and 0.4124, respectively. This finding shows that the distribution of each sample (i.e., Yes and No) is disproportionate for SRS. Thus, using the StrRS technique will guarantee that each class has sufficient samples.

Table 2 A comparative example of the difference between StrRS and SRS techniques according to SPT dataset

3.2 Feature selection for dimensionality reduction

FS methods select variables in the original dataset which are more important and relevant for the prediction process and remove unrelated ones. FS methods provide several benefits to circumvent the curse of dimensionality, increasing learning process speed, simplifying models, and improving the quality of ML methods as well as training efficiency [87, 88]. In this study, three different FS methods RFE, Boruta, and SR were compared. Table 3 shows the ranks of all selected features and their feature importance (\(FI\)) scores.

Table 3 Selected factors and their importance scores estimated by FS methods

The model was initially applied to the training data by utilizing RF-\(FI\) analysis to identify which features are the most effective in liquefaction prediction. The most important features obtained from RF-\(FI\) scores are given in Table 3. The features namely \(FC\), \(\phi ^{\prime}\), and \((N_{1} )_{60}\) were found the most important parameters based on the SPT dataset. On the other hand, \(a_{t}\), \(Vs\), \(a_{max}\), and \(M_{w}\) was the less effective features based on the RF-\(FI\) score analysis. It is important to note that these results only show which of the features are more or less important. After the model evaluation using the entire set of features (i.e., RAW data) as input features, the FS methods RFE, Boruta, and SR were performed for identifying the least important feature/s and removing them from the dataset. When the FS results were analyzed, it was seen that the RFE, Boruta, and SR methods determined 4 (\(FC\), \(\phi ^{\prime}\), \((N_{1} )_{60}\), and \(CSR\)), 9 (\(\phi ^{\prime}\), \((N_{1} )_{60}\), \(CSR\), \(z\), \(\sigma ^{\prime}_{v}\), \(\sigma_{v}\), \(a_{max}\), \(Vs\), and \(FC\)), and 10 (\(\phi ^{\prime}\), \(CSR\), \(FC\), \(a_{t}\), \(Vs\), \((N_{1} )_{60}\), \(d_{w}\), \(\sigma ^{\prime}_{v}\), \(\sigma_{v}\), and \(a_{max}\)) parameters as the most effective features, respectively. Evaluating the based-on FS methods overlap taken over all each selected feature revealed that \(\phi ^{\prime}\) and \(CSR\) were found to be common features. The results of Boruta and SR was found to be very similar in that most have a very similar ranking. The biggest difference in performed FS analyzes was seen between the RFE and the other two methods because RFE is a greedy optimization algorithm that eliminates most features. As a result of the FS processes, new data models were created, and each feature model was given a new name such as Model_RFE, Model_Boruta, and Model_SR. In addition, RAW_Data (i.e., original of SPT data) was used as a benchmark model for an objective comparison with the other models (i.e., Model_RFE, Model_Boruta, and Model_SR).

3.3 Optimization of hyperparameter with grid-search

Hyperparameter optimization in ML aims to detect the optimum hyperparameters that deliver the best performance as measured on a validation set. One of the methods to tune ML problems is the k-fold cross-validation (CV). CV is also a very useful approach in cases where needed to mitigate overfitting and provide a less biased estimation of a tuned model’s performance on the dataset. In this approach, the training data are randomly split into k subgroups (e.g., k = 10 and becoming tenfold CV), and the model is then run k times with one of the subsets held back for validation each time. Importantly, each subset in the data sample is designated to an individual group and stays in that group for the duration of the procedure. The results of each run are evaluated using the pending data, and the results are averaged across all k scores [81]. At the end of the process, the group that gives the best performance, commonly defined based on the mean of the model skill scores, is chosen.

In this study, to make sure that each fold is a good representative of the whole data, StrRS technique was used as a training sample size strategy. The SPT dataset was split into training/test set in the ratio of 70:30 using StrRS for hyperparameter estimation and performance analysis. The training data, which is offered as the best overall performance score by k-fold CV, was used for training the model, and the test data (or validation data) was used for setting the evaluating performance of models. The hyperparameter tuning process was performed using Grid Search based on CV with tenfold (Table 4). This was carried on using only the training data to targeting at a better comparison between models. It should be noted that “train” function on the “caret” package was used to find tuning parameters automatically for these models [89].

Table 4 Details of the hyperparameter tuning of methods according to the FS based models

3.4 Evaluating performance of models

Using the same dataset with the different types of ML algorithms or using the different datasets with the same type of algorithms produces very different performances. Therefore, this study systematically compares how performance varies as different combinations of FS methods and ML algorithm types are combined. To this end, FS methods namely RFE, Boruta, and SR were used during the selection of the best feature combination. Three types of TBE methods namely AdaBoost, GBM, and XGBoost were utilized as liquefaction prediction algorithms. Performance evaluation of 12 (4 × 3) models with the different types of prediction algorithms and FS methods combination were analyzed in two perspectives in this study. The first perspective is to calculate the performance of models with Acc, P, R, and F metrics based on CM. The second perspective is to identify statistical significance between the models using the Wilcoxon’s test. The results of performance metrics of the models in terms of CM are given in Table 5. When the Acc results of the models reviewed, the XGBoost classifier performed the highest accuracies among the other FS-based models. In should also be noted that from Table 5, Model_RFE and Model_Boruta were found as the best feature subset as compared to Model_SR and Model_RAW for various performance metrics.

Table 5 Comparison of performance results of the models based on CM

It is essential to assess the P, R, and F values for predicting "Yes" and "No" classes to specify the strengths and weaknesses of each method and fully understand the quality of the result [90]. When the CM result in Table 5 was analyzed, almost all three ML methods predicted the "Yes" class better than the "No" class. The P value refers to the number of actual "Yes" classes relative to all samples identified as "Yes". Higher P means that "Yes" classes are correctly mapped than "No" classes. R is the percentage of all liquefaction events (i.e., "Yes" and "No") that are properly identified. Having low R values means that the model predicted more False Negatives (should be positive but labelled negative). Higher R values indicate that most of the liquefaction events are labelled as "Yes". F-measure is the harmonic mean of the "Yes" and "No" label on validation datasets and the higher F value shows that the final model is in making predictions more accurately. Another important observation is that the best performance accuracy for Model_RAW was obtained by using the AdaBoost algorithm.

When the overall performance of all pairs of models was analyzed, the XGBoost algorithm based on Model_Boruta feature dataset showed the best Acc result (Acc = 0.9675) when compared to the results of Adaboost with Model_RFE (Acc = 0.9545) and GBM with Model_Boruta (Acc = 0.8961) algorithms. It is clearly seen that there are minimal differences between the Acc values of the models. However, identifying the significance of the differences between the models should be conducted by statistical analysis. Therefore, Wilcoxon’s sign rank test was utilized for the evaluation of the statistical significance of the difference in the performance of models. Wilcoxon’s sign rank test with p-value was applied for pairwise comparisons of the models. If the calculated p-value is lower than or equal to 0.05, it means that the performance of models is different, otherwise, the p-value is higher than 0.05, the results are non-significant, and the performance of the model result tends to be the same. During the comparison of the p-values of the models, only Model_Boruta with XGBoost, Model_RFE with AdaBoost, and Model_Boruta with GBM were considered due to their achievement in Acc results. The statistical results of the Wilcoxon's sign rank test given in Table 6 show that the performance of the Model_Boruta with the XGBoost method was statistically insignificant than the Model_RFE with Adaboost method. Furthermore, when the statistical test results of the XGBoost and AdaBoost methods were compared to the GBM results, the p-value was found lower than the significance level of 0.05, which means that both methods indicated different performances. Overall, all models performed acceptable results for liquefaction prediction, but XGBoost with Model_Boruta exhibited the most stable and best performance according to validation and statistical results.

Table 6 Results of the Wilcoxon’s sign rank test

3.5 Computational costs

Training time of ML algorithms according to two types of feature subsets such as Model_RAW and Model_RFE were calculated considering two different options, parallel and non-parallel processing. The parallel processing is a technique in running program tasks with two or more computer processors (CPUs) to handle separate parts of an overall task. On the contrary, only running one processor to handle all parts of the task is called non-parallel processing. Training and tunning times of two models with three different kinds of ML algorithms are shown in Table 7. It should be noted that the computational costs stated here is only the training time of average tuning (i.e., not including prediction process), which does count the time with each hyperparameter tuning operation. In order to clearly observe computation costs of the ML algorithms, only Model_RFE, which contains the least data (four), and Model_RAW, which includes all features (twelve), was selected. From Table 7, it is seen that the XGBoost algorithm performed the training/tuning process quickly than the other algorithms for both datasets using parallel and non-parallel processing. On the other hand, the AdaBoost algorithm was computed the training/tuning process with the longest computation time for both parallel and non-parallel options. In addition, the GBM algorithm has been processed at acceptable times, but the used library called "Package h2o" does not currently allow parallelization. Hyperparameter tuning, especially with k-fold CV, is an expensive process that can benefit from parallelization. The results showed that parallelization processing was beneficial for reducing computation costs for this study. However, k-fold CV obviously required a large computation time, in which case other approaches may be more appropriate for users who do not have a powerful computer and time.

Table 7 Training/Tuning computation times with parallel and without a parallel process

4 Conclusions

Soil liquefaction has been accepted as one of the essential risk factors to the seismic performance of structures in liquefaction-prone areas due to its behavioral complexity. Nowadays, ML algorithms have been considered a useful tool for the prediction of soil liquefaction with impressive predicting accuracy. Therefore, this research investigates and compares the prediction performance of the TBE algorithms AdaBoost, GBM, and XGBoost for predicting soil liquefaction. These algorithms are relatively new in geotechnical applications that have rarely been employed to predict soil liquefaction. Also, performances of three different FS methods (RFE, Boruta, and SR) were compared by combining with the TBE algorithms. The results indicated that although all models performed acceptably good performance, the XGBoost algorithm based on the Boruta method (i.e., Model_Boruta) achieved the highest overall accuracy \((Acc = 96.75\% )\). Besides, XGBoost with Model_RFE successfully predicted liquefaction events even though four out of twelve parameters (\(FC\), \(\phi ^{\prime}\), \((N_{1} )_{60}\), and \(CSR\)) were selected from the original SPT dataset. The Acc value of the XGBoost model were found to be \(Acc = 96.10\%\) and \(Acc = 92.21\%\) for the four featured model (Model_RFE) and the original dataset (Model_RAW), respectively. On the other hand, the XGBoost algorithm required a significantly shorter training time than the other algorithms. At the same time, while k-fold CV obviously required a large computation time, the parallel processing utilized by the TBE algorithms except GBM led to reducing computational costs. This study can provide insights into the parameter settings, feature selection, and algorithm selection for liquefaction prediction analysis. Moreover, the results of this study may be helpful for researchers who build models to make a prediction and evaluate the performance of different problems using TBE algorithms and FS methods. In the future, detailed feature engineering strategies may be addressed to improve the performance of these ensemble methods. Also, implementation of relatively new and sophisticated boosting algorithms such as LightGBM, CatBoost etc. may be considered for the further studies to predict soil liquefaction.