1 Research background and motivation

A deep foundation is a common and obligatory type of foundation to support superstructure having heavy loads or laying on weak ground. Beside drilled shafts, driven piles made of timber, steel, precast concrete, and composite are also an effective solution in terms of cost and quality. In order to design pile foundation, the axial pile bearing capacity is regarded as the most important parameter. Therefore, estimating this parameter has been the subject of numerous theoretical and experimental studies in geotechnics.

Overall, there are five main methods to evaluate the pile bearing capacity, namely the static analysis, dynamic analysis, dynamic testing, pile load test, and in-situ testing [49, 50, 57]. Design guidelines based on static analysis often recommend using the critical depth concept. However, the critical depth is an idealization that has neither theoretical nor reliable experimental support, and it contradicts physical laws.

Dynamic analysis methods are based on wave mechanics for the hammer-pile-soil system. The ambiguity in the hammer impact effect, as well as changes in soil strength from the conditions at the time of pile driving, and at the time of loading, causes uncertainties in bearing capacity determination. Dynamic testing methods are based on monitoring acceleration and strain near the pile head during driving. However, the measurements can only be analyzed by an experienced person. In addition, another considerable limitation is that the capacity estimation is not available until the pile is driven [53]. The pile load test, a field measurement of full-scale pile settlement subjected to static load, is believed to provide the most accurate results. However, this method is time-consuming and costly [59]. Therefore, developing a simple, economical, and accurate method is highly desired.

The measurements of soil properties by in-situ test methods have developed rapidly since 1970’s. Concurrent with this development is the increasing use of in-situ test data in prediction of pile bearing capacity. The common tests include: standard penetration test (SPT), cone penetration test (CPT), flat dilatometer test (DMT), pressuremeter test (PMT), plate loading test (PLT), dynamic probing test (DP), press-in and screw-on probe test (SS) and field vane test (FVT). Each test applies different loading schemes to measure the corresponding soil response in an attempt to evaluate material characteristics. Among these in-situ test data, the SPT is commonly used to predict the bearing capacity of piles [5, 7].

Different SPT data based methods for determining the bearing capacity of piles have been proposed in the literature. They can be categorized into two main approaches, direct and indirect methods. Among the two, the direct methods are more widely accepted among field engineers due to the ease of computation. For example, [2, 4, 8, 18, 34, 58] proposed SPT direct methods for sandy or clayed soil proposed. For a case study in Iran [59], the authors analyzed the pile by the finite element method and compared it with four different SPT direct methods to find a reasonable prediction for its bearing capacity. However, according to [57], all of these empirical formulations have some inadequacies. Therefore, researchers have been exploring other ways to utilize SPT data to predict pile bearing capacity. Previous studies show that using machine learning algorithms is a viable option [24].

Machine learning (ML), a branch of artificial intelligence, that mimics the operation of the human brain, can nonlinearly infer new facts from adaptively learning historical data [36, 43, 52]. Moreover, the performance of machine learning (ML)-based models can be improved gradually along with the increase of learning data, so they can be kept up-to-date with the high requirements of accuracy for complex engineering problems. Many contributions have demonstrated the effectiveness and efficiency of ML-based models to deal with civil engineering-related problems, for example, predicting mechanical properties (compressive/tensile strength/shear) of hardened concrete [10, 23, 28], the ultimate bond strength of corroded reinforcement and surrounding concrete [25], the bearing capacity of piles [11, 13], the pulling capacity of ground anchor [14, 56], etc.

ML-based models, especially Artificial Neural Network (ANN), have been extensively used to predict pile bearing capacity. Early works in this direction include [30, 61], where ANN with error back propagation is utilized. In [39], a combination of ANN and Genetic Algorithm (GA), where the weights of ANN is optimized by GA, is trained using data from 50 dynamic load tests conducted on precast concrete piles. A similar approach is proposed in [40], where in addition to GA, Particle Swarm Optimization (PSO) is utilized to optimize ANN connection weights. GA is also used to select the most important features in the raw dataset in the application of ANN to predict the bearing capacity of piles [50]. ML-based techniques other than ANN have also been considered, for instance, Samui [55] uses Support Vector Machine (SVM), Pham et al. [49] investigates Random Forest and Chen et al. [13] studied the neuro-imperialism and neuro-genetic methods.

Based on this literature review, it is clear that ANN is the current state-of-the-art method due to its black-box nature and ease of use. However, the prediction accuracy provided by ANN can be improved, and the prediction robustness with respect to different data modeling methods should be studied more thoroughly. There is also a need to investigate more advanced machine learning algorithms. In this paper, we propose to use the Extreme Gradient Boosting (XGBoost) [12], an ensemble tree model, to predict axial pile bearing capacity from a set of influential variables and a large-scale dataset. The XGBoost is the winning algorithm in multiple machine learning competitions. It has also become a very popular algorithm in science and engineering.

Le et al. [29] recently utilized an XGBoost-based ensemble model for predicting the heating load of buildings for smart city planning and concluded that the proposed ensemble model is the most robust in comparing with other machine learning models, including a standard XGBoost model, SVM, Random Forest (RF), Gaussian Process (GP), and Classification and Regression Trees (CART). Nguyen et al. [42] recently demonstrates that the XGBoost is a promising tool to assist civil engineers in forecasting deflections of reinforced concrete members. In addition, the outstanding performance of XGBoost-based models has been further convincingly demonstrated in a variety of practical problems [19, 20, 31, 60, 67, 68].

Motivated by the successes of XGBoost-based ensemble models, this study aimed to investigate an XGBoost-based model to predict the bearing capacity of reinforced concrete piles and compared its performance with that of the deep ANN, which is a popular machine learning model used for regression analysis. For the XGBoost, as well as other machine learning models, determining an optimal set of hyper-parameters to acquire the ultimate generalization of models is a crucial task [15, 66]. Nevertheless, most of the previous studies, which are related to machine learning-based pile bearing capacity estimation, relied on manual or simple grid search method for hyper-parameter setting [1, 3, 17]. Employing the default hyper-parameters, performing model selection on the basis of experience, and selecting hyper-parameters based on trial-and-error processes usually lead to suboptimal performances. Systematic and automated approaches for model selection should be used to effectively construct machine learning-based pile bearing capacity prediction models.

Accordingly, using metaheuristic approaches (i.e. heuristic optimization algorithms), such as, the PSO, GA, bat algorithm, history-based adaptive differential evolution, differential flower pollination, etc., to find an optimal set of parameters for machine learning models has become popular [25, 42, 44,45,46, 63]. In this research, we propose to use the Whale Optimization Algorithm (WOA) to optimize the search for the best parameter configuration of XGBoost. WOA is modeled after humpback whales’ prey searching and feeding behaviors [35]. The motivation for selecting the WOA is that it is one of the current state-of-the-art metaheuristics and has been successfully used in my applications in engineering [22].

Our proposed hybrid model WOA-XGBoost has XGBoost as the main machine learning-based prediction method, which can estimate the axial pile bearing capacity from a set of explanatory variables. The WOA is configured to search for an optimal set of XGBoost parameters that yields the smallest root mean squared error. The proposed model is tested using a dataset of 472 samples of static load pile tests. The hybrid model demonstrates superior performances over those of default XGBoost, and especially deep ANN, a popular model in previous works. In summary, the main contributions of the current paper can be stated as follows:

  1. (i)

    Machine learning-based models for estimation of pile bearing capacity are constructed and verified using a large-scale dataset consisting of 472 pile test experiments. It is noted that most of the previous studies only relied on small-scale datasets [16, 38, 39].

  2. (ii)

    Although various models have been proposed [3], few studies have been dedicated to the investigation of XGBoost, a currently state-of-the-art regressor, for estimating pile bearing capacity.

  3. (iii)

    Since the problem of model selection (i.e. the determination of a suitable set of hyper-parameters) is crucial [6], a novel hybridization of WOA and XGBoost-based regression machine is proposed and verified.

The rest of the paper is organized as follows. In Sect. 2, research methodology, the dataset, experiment variables as well as the main ideas of WOA and XGBoost are presented. The hybrid model WOA-XGBoost is described in Sect. 3. In Sect. 4, experiments are presented and the performances are compared and discussed. The paper ends with some conclusions in Sect. 5.

2 Research methodology

2.1 The collected dataset of static load tests

To train and validate the proposed machine learning method, this study relies on a dataset of static load test of driven reinforced concrete piles. This dataset includes 472 compiled in the previous work of [50]. This is a fairly large dataset and highly appropriate for constructing and verifying sophisticated machine learning models. It is noted that precast piles with closed tips are driven into soil layers with the employment of hydraulic pile driven machine to record capacity of piles. Figure 1 demonstrates the experimental set-up used for data measurement. Figure 2 provides illustrations for the pile structure, its geometrical variables, and soil stratigraphy. Table 1 summarizes the ten conditioning factors employed to predict the dependent variable Y which is the axial pile bearing capacity. In addition, this table also reports statistical descriptions of the predictor and dependent variables.

Fig. 1
figure 1

Static load test experiment for measuring pile bearing dcapacity

Fig. 2
figure 2

Demonstration of the pile structure and soil stratigraphy

Table 1 Statistical descriptions of the employed variables

2.2 Whale optimization algorithm

The Whale Optimization Algorithm was first introduced by Mirjalili and Lewis [35]. It is a swarm-based metaheuristics algorithm whose exploration and exploitation phases are modeled after humpback whales’ prey searching and feeding behaviors. In nature, after identifying the prey (usually a school of small fishes or plankton) humpback whales dive down nearly 12 m and start to swim upward to the surface in a spiral trajectory while creating bubbles to form a virtual net to herd and corral the prey. This is referred to as the bubble-net feeding method [9, 51].

More specifically, the algorithm considers a pod of \(n\) whales hunting for food. The coordinates of their positions \(\overrightarrow{{X}_{i}}\) are candidates for the parameters need to be optimized. In the exploitation (feeding) phase, the current best position, \(\overrightarrow{{X}^{*}}\), is treated as the target prey. To mimic the prey shrinking encircling behavior, the position of the \(i\)th whale in the \((t+1)\)th iteration is updated by:

$$ \overrightarrow {{{\varvec{X}}_{{\varvec{i}}} }} \left( {{\varvec{t}} + 1} \right) = \user2{ }\overrightarrow {{{\varvec{X}}^{\user2{*}} }} \left( {\varvec{t}} \right) - \user2{A \vec{D} } $$
(1)
$$ \vec{\user2{D}} = |\user2{C }\overrightarrow {{{\varvec{X}}^{\user2{*}} }} \left( {\varvec{t}} \right) - \overrightarrow {{{\varvec{X}}_{{\varvec{i}}} }} \left( {\varvec{t}} \right) $$
(2)

where \(\mathrm{A}\) and \(\mathrm{C}\) are coefficients determined by:

$$ A = 2 a \cdot r - a\;{\text{and}}\;C = 2 r $$
(3)

where \(r\) is a random number in \([\mathrm{0,1}]\) and \(a\) linearly decreases from 2 to 0.

In addition, the following equation is utilized to imitate the position spiral updating mechanism:

$$ \overrightarrow {{{\varvec{X}}_{{\varvec{i}}} }} \left( {{\varvec{t}} + 1} \right) = \user2{ \vec{D}}^{\user2{^{\prime}}} \user2{ e}^{{{\varvec{bl}}}} \cos 2\user2{\pi l} + \overrightarrow {{{\varvec{X}}^{\user2{*}} }} \left( {\varvec{t}} \right) $$
(4)

where \({\overrightarrow{D}}^{^{\prime}}=|\overrightarrow{{X}^{*}}\left(t\right)-\overrightarrow{{X}_{i}}\left(t\right)|\), \(b\) is a predefined parameter and \(l\) is a random number in \([-\mathrm{1,1}]\).

As whales swim in shrinking circles and a spiral trajectory simultaneously, a random number \(p\in [\mathrm{0,1}]\) can be used to determine the particular behavior a whale exhibits in a given iteration.

In the exploitation phase, the algorithm has assumed that the current best position is the location of the prey. However, this is not always true and whales often search randomly according to the position of other whales in the pod. This exploration is modeled using the following equations:

$$ \overrightarrow {{{\varvec{X}}_{{\varvec{i}}} }} \left( {{\varvec{t}} + 1} \right) = \user2{ \vec{X}}_{{{\mathbf{rand}}}} \left( {\varvec{t}} \right) - \user2{A \vec{D}} $$
(5)
$$ \vec{\user2{D}} = \left| {\user2{C \vec{X}}_{{{\mathbf{rand}}}} \left( {\varvec{t}} \right) - \overrightarrow {{{\varvec{X}}_{{\varvec{i}}} }} \left( {\varvec{t}} \right)} \right| $$
(6)

where \({\overrightarrow{X}}_{\mathrm{rand}}\) is the position of a random whale in the population other than the \(i\)th.

The decision whether to search locally (exploitation) or globally (exploration) is made upon checking if \(\left|A\right|\le 1\). The pseudo code of the algorithm is presented in Fig. 3. For more information, new variants and applications of WOA, we refer the reader to [22, 32, 35] and the references therein.

Fig. 3
figure 3

The Whale Optimization algorithm

2.3 Extreme gradient boosting machine (XGBoost)

The XGBoost is an open-source library that provides machine learning algorithms, both regression and classification, in the gradient boosting framework [21, 33]. It originated from an academic research project but has become a widely used library in both academia and industry [65]. The library is highly efficient, flexible, and portable. It supports multiple languages, including C+ + , python, R, etc. The library also supports distributed training on clusters on cloud computing platforms, such as Amazon Web Services, Google Cloud Platform and Microsoft Azure.

Under the hood, the XGBoost algorithm builds a series of weak learners, which are classification or regression trees (CART) [54, 64]. These weak learners are then combined to form the final prediction model. Like other boosting methods, XGBoost do not build all the regression trees at the same time but step by step. The tree in the current step is constructed in such a way that it minimizes the average value of the loss function of all the steps on the training set.

More specifically, let the training data be \(D={\left\{{x}_{i},{y}_{i}\right\}}_{i=1}^{n}\), where \({x}_{i}\in {R}^{m}\) is an input vector with m features, and \({y}_{i}\in R\) is the corresponding output. Assume \({\widehat{y}}_{i}^{(t-1)}\) is the prediction output at step \(t-1\). Then, at step \(t\), the XGBoost builds the tree that minimizes the following objective function:

$$ {\varvec{L}}^{{\varvec{t}}} = \mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{n}}} {\varvec{l}}\left( {{\varvec{y}}_{{\varvec{i}}} ,\hat{\user2{y}}_{{\varvec{i}}}^{{\left( {\varvec{t}} \right)}} } \right) + \user2{ }{{\varvec{\Omega}}}\left( {{\varvec{f}}_{{\varvec{t}}} } \right) = \mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{n}}} {\varvec{l}}\left( {{\varvec{y}}_{{\varvec{i}}} ,\hat{\user2{y}}_{{\varvec{i}}}^{{\left( {{\varvec{t}} - 1} \right)}} + {\varvec{f}}_{{\varvec{t}}} \left( {{\varvec{x}}_{{\varvec{i}}} } \right)} \right) + {{\varvec{\Omega}}}\left( {{\varvec{f}}_{{\varvec{t}}} } \right) $$
(7)

where \(l\) is any convex and differentiable loss function measuring the difference of the prediction and the provided output, \({f}_{t}: {R}^{m}\to R, {f}_{t}\left(x\right) ={w}_{q(x)}\) is the prediction function of the tree, where \(w\in {R}^{T}\) is the vector of scores on leaves, \(q:{R}^{m}\to \left\{\mathrm{1,2}, \dots ,T\right\}\) is a function assigning each data point to the corresponding leaf, and \(T\) is the number of leaves. The last term, \(\Omega \), is the regularization term. Its purpose is to reduce overfitting, a common issue in machine learning. This term penalizes complex trees with many leaves and gives priority to more simple and predictive trees, more specifically

$$ {{\varvec{\Omega}}}\left( {\varvec{f}} \right) = \user2{ \gamma T} + \frac{1}{2}\user2{ \lambda }\mathop \sum \limits_{{{\varvec{j}} = 1}}^{{\varvec{T}}} {\varvec{w}}_{{\varvec{j}}}^{2} $$
(8)

where \(\gamma \) and \(\lambda \) are parameters.

Approximating the right hand side of Eq. (6) by using the second-order Taylor expansion of \(l\) w.r.t to the second variable we have:

$$ {\varvec{L}}^{{\varvec{t}}} \approx \mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{n}}} \left[ {{\varvec{l}}\left( {{\varvec{y}}_{{\varvec{i}}} ,\hat{\user2{y}}_{{\varvec{i}}}^{{\left( {{\varvec{t}} - 1} \right)}} } \right) + {\varvec{g}}_{{\varvec{i}}} {\varvec{f}}_{{\varvec{t}}} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) + \user2{ }\frac{1}{2}{\varvec{h}}_{{\varvec{i}}} {\varvec{f}}_{{\varvec{t}}}^{2} \left( {{\varvec{x}}_{{\varvec{i}}} } \right)} \right] + {{\varvec{\Omega}}}\left( {{\varvec{f}}_{{\varvec{t}}} } \right) $$
(9)

where \({g}_{i}={\partial }_{{\widehat{y}}_{i}^{\left(t-1\right)}}l\left({y}_{i},{\widehat{y}}_{i}^{\left(t-1\right)}\right)\) and \({{h}_{i}=\partial }_{{\widehat{y}}_{i}^{\left(t-1\right)}}^{2} l\left({y}_{i},{\widehat{y}}_{i}^{\left(t-1\right)}\right)\).

Removing the term \(\sum_{i=1}^{n}l\left({y}_{i},{\widehat{y}}_{i}^{\left(t-1\right)}\right)\), which does not depend on the choice of the decision tree in the current step, from Eq. (8) and collecting the terms associated with the same score (the data points on the same leaf get the same score), we are left with a simpler objective function that needs to be minimized:

$$ \begin{gathered} \tilde{\user2{L}}^{{\varvec{t}}} = \mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{n}}} \left[ {{\varvec{g}}_{{\varvec{i}}}\;{\varvec{f}}_{{\varvec{t}}} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) + \user2{ }\frac{1}{2}{\varvec{h}}_{{\varvec{i}}}\;{\varvec{f}}_{{\varvec{t}}}^{2} \left( {{\varvec{x}}_{{\varvec{i}}} } \right)} \right] + {{\varvec{\Omega}}}\left( {{\varvec{f}}_{{\varvec{t}}} } \right) \hfill \\ \quad\;\; = \mathop \sum \limits_{{{\varvec{j}} = 1}}^{{\varvec{T}}} \user2{ }\left[ {\left( {\mathop \sum \limits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{j}}} }} {\varvec{g}}_{{\varvec{i}}} } \right){\varvec{w}}_{{\varvec{j}}} + \frac{1}{2}\left( {\mathop \sum \limits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{j}}} }} {\varvec{h}}_{{\varvec{i}}} + \user2{\lambda }} \right){\varvec{w}}_{{\varvec{j}}}^{2} \user2{ }} \right] + \user2{\gamma T } \hfill \\ \end{gathered} $$
(10)

where \({I}_{j}\) is the subset of the input set associated with leaf \(j\), i.e., \({I}_{j}=\left\{i:q\left({x}_{i}\right)=j\right\}\).

It can be noted that the first term of Eq. (10) is a sum of independent quadratic functions in \({w}_{j}\). Therefore, the optimal \({w}_{j}^{*}\) and the minimal objective \({\widetilde{L}}^{*t}\) are given by:

$$ {\varvec{w}}_{{\varvec{j}}}^{\user2{*}} = \user2{ } - \frac{{\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{j}}} }} {\varvec{g}}_{{\varvec{i}}} }}{{\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{j}}} }} {\varvec{h}}_{{\varvec{i}}} + {\varvec{\lambda}}}}\user2{ },\user2{ \tilde{L}}^{{\user2{*t}}} \left( {\varvec{q}} \right) = \user2{ } - \frac{1}{2}\mathop \sum \limits_{{\varvec{j}}}^{{\varvec{T}}} \frac{{\left( {\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{j}}} }} {\varvec{g}}_{{\varvec{i}}} } \right)^{2} }}{{\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{j}}} }} {\varvec{h}}_{{\varvec{i}}} + {\varvec{\lambda}}}} + \user2{\gamma T} $$
(11)

This equation can be used to measure how good a tree is as a candidate for the current step. However, these optimal values can only be calculated when the structure of the tree in the current step has already determined. As it is not feasible to consider all the possible tree structures, the XGBoost built trees iteratively.

At the beginning, the XGBoost sorts the input data set according to feature values to form a tree with zero depth. Then in each step, a new tree is created by an optimal branch splitting. According to Eq. (11), this splitting, which maximizes the lost reduction, is calculated by

$$ \tilde{\user2{L}}_{{{\mathbf{split}}}} = \user2{ } - \frac{1}{2}\left. {\left[ {\frac{{\left( {\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{L}}} }} {\varvec{g}}_{{\varvec{i}}} } \right)^{2} }}{{\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{R}}} }} {\varvec{h}}_{{\varvec{i}}} + {\varvec{\lambda}}}} + \frac{{\left( {\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{L}}} }} {\varvec{g}}_{{\varvec{i}}} } \right)^{2} }}{{\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}_{{\varvec{R}}} }} {\varvec{h}}_{{\varvec{i}}} + {\varvec{\lambda}}}} - \frac{{\left( {\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}}} {\varvec{g}}_{{\varvec{i}}} } \right)^{2} }}{{\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{I}}}} {\varvec{h}}_{{\varvec{i}}} + {\varvec{\lambda}}}}} \right.} \right] - {\varvec{\gamma}} $$
(12)

where \({I}_{L}\) is the subset of input indices on the left of the split and \({I}_{R}\) is the subset of indices on the right of the split.

The XGBoost algorithm is summarized in Fig. 4. For more detailed information, we refer the reader to [41, 42, 64]. In addition to using the regularization term \(\Omega\) in Eq. (7), the XGBoost let users specify two parameters, \(\mathrm{max}\_\mathrm{depth}\) and learning rate \(\eta \), to combat overfitting. The parameter \(\mathrm{max}\_\mathrm{depth}\), as suggested by the name, limits the maximal tree depth that is allowed by XGBoost. The possible range for this parameter is \([0,\infty ]\) and the default value is \(\mathrm{max}\_\mathrm{depth}=6\). The learning rate, also called shrinkage, scales the prediction of newly built tree by a factor \(0<\eta <1\) to reduce the influence of each individual tree and allow trees in the later steps chances to improve the model. More specifically:

$$ \hat{\user2{y}}_{{\varvec{i}}}^{{\left( {\varvec{t}} \right)}} = \hat{\user2{y}}_{{\varvec{i}}}^{{\left( {{\varvec{t}} - 1} \right)}} + \user2{\eta f}_{{\varvec{t}}} \left( {{\varvec{x}}_{{\varvec{i}}} } \right) $$
(13)
Fig. 4
figure 4

The XGBoost algorithm

The default value of \(\eta \) is 0.3.

3 The proposed WOA-XGBoost model for intelligent estimation of pile bearing capacity

Our study proposes to combine XGBoost and WOA into a hybrid model of WOA-XGBoost to predict the bearing capacity of concrete piles. In this hybrid model, XGBoost is served as the main prediction machinery which establishes a function that can derive the axial pile bearing capacity from a set of explanatory variables, including pile diameter (X1), thickness of the first soil layer (X2), thickness of the second soil layer (X3), thickness of the third soil layer (X4), elevation of the natural ground (X5), top of pile elevation (X6), elevation of the extra segment of pile top (X7), depth of pile tip (X8), mean value of SPT blow count along pile shaft (X9), and mean value of SPT blow count at pile tip (X10). However, even though the XGBoost algorithm is quite robust, its performance does depend on the choice of its parameters. Therefore, the WOA is utilized to determine an optimal set of the XGBoost parameters. Once these optimal parameters have been determined, the corresponding XGBoost model is then used to train and provide the final prediction.

The complete work flow of the WOA-XGBoost is illustrated in Fig. 5. It consists of three phases: preprocessing, parameter optimization and final prediction. In the preprocessing phase, the input data is normalized to have zero mean and roughly similar magnitudes using the following Z-score formula:

$$ X_{{\text{N}}} = \frac{{X_{{\text{O}}} - m_{{\text{X}}} }}{{s_{{\text{X}}} }} $$
(14)

where XN and XO are the normalized and original feature variables, mX and sX are the mean and standard deviation of the considered feature in the whole input data. Finally, the whole data set is spit into a training data set and a testing data set. The training data set is used for training the model and optimizing parameters. The testing set is only used to benchmark the performance of trained models.

Fig. 5
figure 5

Operation procedure of the WOA-XGBoost model

In the parameter optimization phase, in order for WOA to be able to find optimal XGBoost parameters, we need to provide it with a way to evaluate how good a given set of parameters is. This is done via formulating an objective/cost function based on Root Mean Square Error (RMSE) and k-fold cross-validation. The training data set is further split into \(k\) folds (subsets of roughly equal size). Given a set of parameters, the associated XGBoost model is trained and validated \(k\) times using different training and validating sets. For each fold, the model is trained with data from the other \(k-1\) folds and is validated against the data in the current fold. The cost function is defined as the average of the validating RMSE, more specifically:

$$ {\mathbf{CF}} = \frac{1}{{\varvec{k}}}\mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{k}}} {\mathbf{RMSE}}_{{\varvec{i}}} , $$
(15)

where \({\text{RMSE}}_{i}\) is the validating RMSE associated with the \(i\)th fold (when the training set is the union of other folds):

$$ {\mathbf{RMSE}}_{{\varvec{i}}} = \sqrt {\frac{{\mathop \sum \nolimits_{{{\varvec{j}} \in {\varvec{S}}_{{\varvec{i}}} }} \left( {{\varvec{Y}}_{{{\varvec{A}},{\varvec{j}}}} - {\varvec{Y}}_{{{\varvec{P}},{\varvec{j}}}} } \right)^{2} }}{{\left| {{\varvec{S}}_{{\varvec{i}}} } \right|}}} $$
(16)

where \({S}_{i}\) is the index set of the \(i\)th fold, \(\left|.\right|\) denotes the cardinal of a set, \({Y}_{A,j}\) is the actual output for the \(j\)th data set and \({Y}_{P,j}\) is the corresponding predicted output.

The WOA search stops when the cost function does not improve (reduce) after a certain number of iterations or when the prescribed maximal number of iterations has been reached. When an optimal set of parameters has been determined, the associated XGBoost model is used in the final prediction phase. Then, the predicted values of pile bearing capacity can be obtained and documented.

4 Experimental results and discussion

4.1 Experiment Setup

In the experiments, the integrated WOA-XGBoost model was developed in Python with the uses of the following packages: (i) XGBoost Python package version 0.90, the official implementation of XGBoost in Python (ii) mealpy version 1.1.0, a Python module of cutting-edge nature-inspired meta-heuristic algorithms, including WOA [62] (iii) scikit-learn version 0.23.2, a machine learning library for Python language. All experiments of the WOA-XGBoost model were performed on a laptop computer equipped with Intel® Core™ i5-3437U CPU @ 1.90 GHz × 4, 8 GB DDR3 RAM 1600 MHz and the operating system Ubuntu 20.04.4 LTS.For WOA, we set the whale population (\(\mathrm{pop}\_\mathrm{size}\)) to be 50 and the maximal number of iterations (epoch) to be 150. It is configured to optimize two XGBoost parameters, namely the learning rate (\(\eta \)) and the maximal tree depth (\(\mathrm{max} \mathrm{depth}\)). The search ranges of \(\eta \) and \(\mathrm{max} \mathrm{depth}\) are \(\left[0.05, 0.3\right]\) and [3,10], respectively.

For XGBoost, the objective is set to be reg: squarederror (regression with square error) and the number of boost rounds (number of iterations) is chosen to be 100. The data consisting of 472 samples is randomly split into two sets: a training set of 424 samples (90%) and a testing set of 48 samples (10%). During the parameter optimization phase, the training set is further partitioned into 5 folds of roughly equal size to evaluate the cost function according to Eqs. (15) and (16).

In Fig. 6, the convergence history of the cost function in the parameter optimization phase of a typical run of WOA-XGBoost is illustrated. It can be seen that, in this particular run, the cost function decreases very quickly at the beginning and stall after roughly 50 iterations. Currently, early stopping has not been supported in the implementation of the WOA algorithm. Hence, the code continues to run until the maximum iteration of 150 is achieved. However, it is necessary to keep the maximal number of iterations of WOA at 150 to maintain the robust performance of the optimization process. It is because the number of iterations required for the WOA convergence may vary. In Table 2, the tuned learning rate and max depth parameters of XGBoost obtained by WOA in this particular run are given. It should be noted that as tree depth is an integer parameter, the effective value of the \(\mathrm{max} \mathrm{depth}\) found is actually 3. In addition, when the cost function is computed, the XGBoost model is trained and tested five times to perform a fivefold cross-validation. On average, each training and testing phase last 0.082511 and 0.002290 s, respectively. In total, the WOA-XGBoost model requires nearly 1 h and 15 min (4454 s) to find a set of optimal XGBoost parameters.

Fig. 6
figure 6

Convergence curve of the cost function in the parameter optimization phase

Table 2 An optimal set of tuned parameters

In order to accurately assess the predictive capability of different models, the following performance metrics are considered: RMSE, the mean absolute percentage error (MAPE), the mean absolute error (MAE) and the coefficient of determination (R2). RMSE has been introduced in the previous section. MAPE, MAE, and R2 are given by:

$$ {\text{MAPE}} = \frac{100\% }{N}\mathop \sum \limits_{i = 1}^{N} \frac{{\left| {Y_{A,i} - Y_{P,i} } \right|}}{{Y_{A,i} }} $$
(17)
$$ {\text{MAE}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \frac{{\left| {Y_{A,i} - Y_{P,i} } \right|}}{{Y_{A,i} }} $$
(18)
$$ R^{2} = 1 - \mathop \sum \limits_{i = 1}^{N} \left( {Y_{A,i} - Y_{P,i} } \right)^{2} /\mathop \sum \limits_{i = 1}^{N} \left( {Y_{A,i} - \overline{Y}} \right)^{2} $$
(19)

where YA,i and YP,i are the actual and the predicted bearing capacity, respectively; N is the number of data instances; \(\overline{Y }\) is the average of actual values. It should be noted that smaller RMSE, MAE or MAPE is better, while higher R2 is better.

4.2 Comparison between the tuned and the default model

Table 3 presents the comparison in performance of XGBoost with default parameters recommended by the XGBoost toolbox and WOA-XGBoost with parameters optimized by the metaheuristic algorithm. We can see that in the training phase, default XGBoost is better than WOA-XGBoost in all metrics. The RMSE and MAE of XGBoost are almost 37 and 56%, respectively, smaller than those of WOA-XGBoost. However, in the testing phase the opposite holds. WOA-XGBoost outperformed XGBoost in all the metrics. The RMSE and MAE of WOA-XGBoost are 13.4 and 12.4%, respectively, smaller than those of XGBoost. In the testing phase, the MAPE and R2 of WOA-XGBoost are also noticeably better than those of XGBoost. This result indicates that tuning parameters using WOA does help WOA-XGBoost to reduced overfitting and provide more accurate prediction.

Table 3 Performance comparison of XGBoost and WOA-XGBoost

All of the above results are exclusive to the considered partition of the data. They could be biased toward this specific partition. Therefore, we further compare the performance of XGBoost (with default parameters) and WOA-XGBoost (with the optimal parameters found by WOA and listed in Table 2) on other 20 randomly partitions of the data with the same splitting ratio (9:1). In Fig. 7, the distributions of the residuals of both models are presented. It can be seen that the distribution is more symmetric and closer to a normal distribution in the case of the tuned model. This is the first indication that WOA-XGBoost is more robust than the default XGBoost. In Fig. 8, model prediction capability is compared. In the training phase, the default XGBoost model fits the data better (the predicted output is closer to the line of best fit). However, in the testing phase, the tuned model (WOA-XGBoost) provides a more accurate prediction. Thus, it is apparent that tuned hyper-parameters have helped XGBoost be less prone to overfitting.

Fig. 7
figure 7

Residual comparison

Fig. 8
figure 8

Model prediction comparison

The comparison of performance in the testing phase is presented in Fig. 9. In terms of RMSE, the error incorporated in the objective/cost function, the tuned model is clearly the winner. In 20 experiments, there are only two cases where the RMSE of the tuned model is larger than that of the default model. In term of MAE and MAPE, the tuned model is no longer the clear winner, but it is still better in more than half of the occasions. In terms of R2, an important benchmark on how well a model predicts the observed data, the tuned model again beats the default model. There is only one instance where the R2 of the tuned model is smaller (worse) than that of the default model. However, even in that case, the R2 of the tuned model is already 0.9, which indicates good prediction capability.

Fig. 9
figure 9

Performance comparison in the testing phase

In order to have a more reliable conclusion, we perform Wilcoxon signed-rank tests on the claim that WOA-XGBoost is better than default XGBoost in a specific metric. Here, “better” means smaller RMSE, MAE, and MAPE, but also means greater R2. The hypotheses are as follows:

  • H0: The (pseudo) median of the benchmark of WOA-XGBoost in the 20 experiments is the same as that of default XGBoost.

  • H1: The (pseudo) median of the benchmark of WOA-XGBoost in the 20 experiments is better than that of default XGBoost (claim).

The results of the test are summarized in Table 4. At \(\alpha =0.05\), there is not enough evidence to accept the claim for MAPE, but there are strong evidences to support the claim for RMSE, MAE and R2. Especially, the decision is very conclusive for RMSE and R2 because the p-values are very small. In summary, the tuned model is more accurate and reliable than the default model because it is much better in the objective benchmark, RMSE, and in important benchmarks, namely MAE and R2.

Table 4 Wilcoxon signed-rank test the claim that WOA-XGBoost is better than XGBoost in different benchmarks

Thus, it can be observed that even though the XGBoost algorithm is robust, its performance significantly depends on the selection of its hyper-parameters. Hence, a metaheuristic is definitely needed to optimize the process of finding these hyper-parameters. The WOA has proven to be highly appropriate in assisting the learning phase of XGBoost. This metaheuristic shows good convergence property and helps locate a good solution of hyper-parameters of the XGBoost algorithm. Once a set of optimal parameters has been determined, the corresponding XGBoost model is then used to train and provide the final prediction.

4.3 Comparison with the benchmark approaches

In this section, to confirm the predictive capability of the newly developed hybrid model WOA-XGBoost used for pile bearing capacity prediction, its performance is compared to that of the capable machine-learning-based models based on Deep Neural Network (DNN) for regression. The DNN has been trained with the state-of-the-art Adam optimizer [27] and implemented via the scikit-learn Python library [48]. The DNN has been constructed with 2, 3, 4, and 5 hidden layers. The DNN is selected as the benchmark models in this section because neural networks have been extensively and successfully employed in data-driven pile capacity estimation [26, 37, 39, 47]. In DNN, ReLU (Rectified Linear Unit) is employed and the number of training epochs is set to be 1000; the number of neurons in the hidden layers is selected via a fivefold cross-validation.

In Tables 5 and 6, the means and standard deviations of all the metrics are provided for all considered models in both training and testing phases. Variants of DNN with different numbers of hidden layers (2, 3, 4 and 5) appear to have similar performances, especially in the testing phase. Among them, the variant with 4 layers is the better one. However, even this variant lags behind WOA-XGBoost in both training and testing phase. The reductions in the average of testing RMSE of WOA-XGBoost compared to DNN with 2, 3, 4 and 5 layers are roughly 12, 11.7, 9 and 12%, respectively. The box plots in Fig. 10, further confirm this for the testing phase. Not only WOA-XGBoost has a better mean but it also has a better median and interquartile range in all metrics. The performances of WOA-XGBoost in all metrics are also very robust. There is no outliner in the negative direction in the WOA-XGBoost, while there are relatively numerous ones in the DNN models, especially in the R2 index.

Table 5 Performance statistic in the training phase
Table 6 Performance statistic in the testing phase
Fig. 10
figure 10

Performance comparison of different models in the testing phase

Similar to Sect. 4.2, we perform Wilcoxon signed-rank tests to compare WOA-XGBoost with each and every variant of DNN in all considered benchmarks. The general hypotheses can be written as:

  • H0: The (pseudo) median of the benchmark of WOA-XGBoost in the 20 experiments is the same as that of DNN with n layers

  • H1: The (pseudo) median of the benchmark of WOA-XGBoost in the 20 experiments is better than that of DNN with n layers (claim)

where n = 2, 3, 4 and 5.

The test results are summarized in Tables 7, 8, 9 and 10. It can be seen that, there is sufficient evidence to support the claim in all comparisons in all benchmarks, except for only one inconclusive case, where R2 of WOA-XGBoost is compared to that of DNN with 4 layers. With all of these evidences, it can be concluded that WOA-XGBoost is the best model in these experiments.

Table 7 Wilcoxon signed-rank pairwise comparison in RMSE
Table 8 Wilcoxon signed-rank pairwise comparison in MAE
Table 9 Wilcoxon signed-rank pairwise comparison in MAPE
Table 10 Wilcoxon signed-rank pairwise comparison in R2

5 Concluding remarks

In this paper, we have formulated and tested the hybrid model WOA-XGBoost in predicting the bearing capacity of concrete piles. XGBoost is the crucial part of the model providing the prediction from a set of ten feature variables, including thickness of the first, second and third soil layer, pile diameter, elevation of the natural ground, of top of pile and of the extra segment of pile top, depth of pile tip, mean value of SPT blow count along pile shaft and of SPT blow count at pile tip.

Although the XGBoost algorithm is the state-of-the-art machine learning method and is highly effective for complex function approximation. This study has proven that its performance can still be meliorated with the use of advanced metaheuristic algorithms. Accordingly, the WOA is used to find an optimal set of values of the learning rate and the maximal depth, two important hyper-parameters of XGBoost. The hybrid model is set up so that the selected set of parameters minimizes the average RMSE error in a fivefold cross-validation. The model is trained, validated, and compared on subsets of a dataset consisting of 472 samples.

The experimental results, supported by the statistical hypothesis tests, confirm that the hybridization of XGBoost and WOA significantly outperforms the individual XGBoost model as well as the DNN-based regression models. Therefore, it is highly recommended to use in pile bearing capacity prediction. Incorporating advanced feature selection and utilizing other state-of-the-art metaheuristics are two lines of research that can be considered to advance the current study. Since the proposed method combines machine learning and metaheuristic, the model construction phase is completely data-dependent. Hence, the hybrid WOA-XGBoost can be trained and implemented autonomously without domain knowledge in machine learning. Therefore, the proposed method has a high potential to be used for modeling other sophisticated problems in the field of civil engineering.