Introduction

With the desperate shortage of land and the growing tendency for high-rise structures in recent years, pile foundation is attracting massive attention owing to its ability to support high loads in weaker soils. Having soft soils underneath our foundations will cause high total settlements, differential settlement, and bearing-capacity problems (Charlie et al. 2009; Gabrielaitis et al. 2013; Wang et al. 2016). Dynamic pile load test is increasingly used to evaluate the load capacity of the pile since it is cheaper, easier to handle that allows conducting many tests, compared to static load tests (Rajagopal et al. 2012). More importantly, the results obtained by dynamic testing are quite similar to that of static tests (Nayak et al. 2000; Rausche et al. 2004; Bradshaw and Baxter 2006; Long 2007; Basarkar 2011; Sakr 2013; Liu et al. 2020). Therefore, contractors are encouraged to choose dynamic pile load tests as an alternative for pile testing if the code allows.

Machine learning (ML) is an area of research that allows computers to learn from observed data without being specifically programmed (Asteris et al. 2021a; Bardhan et al. 2021; Kardani et al. 2021a; Kumar et al. 2022a). Moreover, geophysical design parameters are not always collected directly from field or laboratory tests but are often approximated by developing regression fitting to datasets. Artificial neural network (ANN) as one of the commonly used ML methods has been used in estimating the bearing capacity of piles (Asteris et al. 2021b; Benali et al. 2017; Che et al. 2003; Goh and Goh 2007; Goh 2000; Moayedi 2018; Lee and Lee 1996; Jiang et al. 2016; Jiang and Zhang 2018; Kiefa 1998; Low et al. 2001; Moayedi and Hayati 2019; Pal and Deswal 2008; Pradeep et al. 2021; Shahin et al. 2009). Kiefa (1998) developed general regression neural network (GRNN)-based model to predict the bearing capacity of piles in cohesion-less soils. Che et al. (2003) used the data collected from dynamic wave tests to develop a back-propagation neural network-based model to predict the bearing capacity of piles, where a feed-forward neural network of one layer and 10 neurons was built. ANN models have become increasingly popular and successfully used in various fields of geotechnical engineering (Shahin et al. 2001). Recently, Alzo’ubi and Ibrahim (2019) used backpropagation neural network and generalized regression neural network to predict accurately the pile static load test curves.

The most popular one among the various ANN models is the back-propagation (BP) algorithm. However, in the BP algorithm, the trial-and-error approach to ascertain the optimal number of hidden neurons makes it very time-consuming. To improve the simulation performance of ANN, integrating ANN with metaheuristic optimization techniques becomes preferred (Kumar et al. 2022b). The optimization techniques are used to optimize various parameters like weight and bias of the neural network to improve its performance (Kardani et al. 2021b). Benali et al. (2017) presented ANN and Principal component analysis (PCA)-based ANN to predict the axial load capacity of piles, and concluded that the results obtained by the PCA-based models were in good agreement with those of standard penetration test (SPT)-based analysis. Nguyen et al. (2020) applied hybrid ANN-based prediction of column deflection exposed to seismic conditions. Particle swarm optimization (PSO)-based model gives satisfactory results and outperforms the traditional ANN model. Murlidhar et al. (2020) applied hybrid ANN models: genetic algorithm (GA)-based ANN and particle swarm optimization (PSO)-based ANN (ANN-PSO) in predicting pile bearing capacity. Chen et al. (2020) compared the performances of genetic programming (GP) and ANN in predicting the load capacity of piles where 50 datasets of concrete piles were collected from the literature and found that the GP model outperformed ANN, GA-ANN (hybrid model of ANN and GA) and ICA-ANN (hybrid model of ANN and imperialist competitive algorithms). Liu (2020b) compared the performance of ANN, Adaptive neuro-fuzzy inference system (ANFIS), and GA-ANN in reliability analysis of vertical settlement of Pile raft foundation. GA-ANN was proved to outperform ANN and ANFIS models.

Over the past three decades, researchers and academics have shown a growing interest in meta-heuristic optimization, leading to the regular proposal of novel meta-heuristics for the solution of complicated and real-world issues in many fields. Single-based algorithms and population-based algorithms are the two primary categories of meta-heuristics. The foundation of single-based meta-heuristic algorithms, also called trajectory algorithms, is the generation of a single solution at each iteration. This solution is made more efficient by the neighbourhood mechanism. Population-based meta-heuristic algorithms, in contrast to their single-based counterparts, produce a set of multiple solutions (population) on each iteration. Population-based meta-heuristics can be broken down into four distinct types: those based on evolution, swarm intelligence, events, and the physical sciences. Based on the principles of natural evolution, Evolutionary Algorithms (EA) use the three operators of selection, recombination, and mutation to achieve their goals. Swarm Intelligence (SI) is one example of the second category of techniques, which draws its inspiration from the study of collective behaviour in the natural world. Insects, birds, mammals, reptiles, fish, etc. The third category, which includes activities such as instructing a learning-based algorithm, is inspired less by the wonders of nature than by the actions of humans.

The study presents a comparative analysis of five hybrid ANN models namely ANN-PSO (hybrid model of ANN and particle swarm optimization), ANN-GOA (hybrid model of ANN and grasshopper optimization algorithm), ANN-ABC (hybrid model of ANN and artificial bee colony), ANN-ACO (hybrid modes of ANN and ant colony optimization), ANN-ALO (hybrid model of ANN and ant lion optimizer) and three traditional soft computing models including multivariate adaptive regression splines (MARS), GP and group method of data handling (GMDH) for estimating the probability of failure of piles. These methods have not been explored in foundation engineering earlier but found robust in literature (Alizadeh et al. 2019; Moayedi et al. 2020; Seifi et al. 2020). PSO is a widely used optimization technique of swarm intelligence family imitating bird swarm behaviours (Armaghani et al. 2020b; Kashani et al. 2020; Ray et al. 2021). GOA is based on the herding behaviour of grasshoppers. ABC algorithm follows the social cooperation behaviour of honey bees in AI (Bui et al. 2020; Huang et al. 2020; Wang et al. 2020). ACO is based on ant’s behaviour of forage and found to be very reliable in literature (Moayedi et al. 2019b; Xu et al. 2019; Zhang et al. 2020a). ALO follows the way ant lion chases the prey. Moayedi et al. (2019a) demonstrated the robust prediction of ALO-ANN and its superiority over conventional models. MARS, GP and GMDH are popular models used successfully in various geotechnical problems (Ardakani and Kordnaeij 2019; Hassanlourad et al. 2017; Kardani et al. 2021a; Mola-Abasi and Eslami 2019; Samui et al. 2019; Yin et al. 2020; Zhang et al. 2020b; Zhang and Goh 2013). In this study, based on the results of dynamic tests on piles, five hybrid ANN models and three traditional models will be thoroughly investigated for the prediction of bearing capacity of pile foundations.

Methodology and theoretical background

High strain dynamic testing of piles

Dynamic testing of piles (PDA test) is an innovative method to determine the load capacity of piles (Fellenius 1999; Rausche et al. 1985, 2004; Smith 2002). One-dimension wave propagation theory is feasible to be extended to piles in the PDA test since the strike of the hammer leads to the downward propagation of waves. The uniform cross-section of the piles is postulated to be a slender element enclosed by materials of far lesser stiffness (Salgado 2008). Deploying a pair of accelerometer and strain transducer on top of the pile evaluates the complex monitoring of piles the reported data is transmitted through a cable to the PDA which is converted and recorded as force and speed. In the next stage, using the CAPWAP program, the bearing potential of the pile is estimated. To assess soil resistance and its distribution along with the pile, CAPWAP integrates the calculated force and velocity with wave equation analysis. It uses the iterative curve-fitting technique which matches the response of the model pile, subjected to wave analysis, to the pile under investigation for a single hammer strike (FHWA 2006). Susilo (2006) suggested some guidelines for monitoring the criteria like impact factor hand hammer weight. Minimum hammer weight should be 1% of the required ultimate load capacity and enhanced to 2% of the required load capacity for piles anticipated to have high-end bearing capacity.

Details of models and meta-heuristic optimization algorithms

Artificial neural network (ANN)

ANN is a popular approximation tool to simulate and predict the output and it is developed by emulating the neural system of the human body. It comprises three parallel layers connected via weights and biases: input layer, hidden layer, and output layer as shown in Fig. 1 (Moayedi and Rezaei 2019). Backpropagation (BP) is the most popular learning tool applied in feedforward ANN models. It uses the gradient descendent optimization technique. Powered by high neuronal interconnections, the ANN can handle complex and non-linear correlations between input and output variables. The number of neurons in the hidden layer can be adjusted by the users to obtain the best performance. Through the harnessing of such a structure, ANNs have been employed as effective soft computing techniques for different purposes such as function approximation and pattern recognition in many engineering disciplines.

Fig. 1
figure 1

A basic structure of an ANN

Particle swarm optimization (PSO)

PSO is a widely used optimization technique that belongs to the swarm intelligence family, proposed by Kennedy and Eberhart (1995). The principal origin of impulse for the PSO algorithm is to gather and school patterns among birds and fish. So that the central goal of this algorithm is to provide a universal best resolution in multidimensional space. PSO performs the search of the optimal solution through particles, whose trajectories are adjusted by a stochastic and a deterministic component. Each particle is influenced by its ‘best’ achieved position and the group ‘best’ position but tends to move randomly. In PSO, the population P is represented by:

$$P=\left({p}_{1},{p}_{2},{p}_{3} \ldots {p}_{n}\right)$$
(1)

The velocities of the individual particle are denoted by:

$$u=\left({u}_{1},{u}_{2},{u}_{3}...{u}_{n}\right)$$
(2)

Previously visited best location (lbest) is shown as:

$$l=\left({l}_{1},{l}_{2},{l}_{3} \ldots {w}_{n}\right)$$
(3)

The swarm is updated as follows {for (i = 1, 2, … , n) and k being current iteration}:

$${u}_{i}^{k+1}={w}^{k}{u}_{i}^{k}+{d}_{1}{k}_{1}^{k}\left({l}_{i}^{k}-{w}_{i}^{k}\right)+{d}_{2}{k}_{2}^{k}\left({l}_{g}^{k}-{w}_{i}^{k}\right)$$
(4)
$${w}_{i}^{k+1}={w}_{i}^{k}+{v}_{i}^{k+1}$$
(5)

where n is the total dimension, lg denotes the best particle, and superscripts is used for the number of iterations. w is weight and d1, d2 are two learning factors called cognitive and social parameters, respectively (position constants). The best performance of the model requires proper tuning of the two position constants. k1 and k2 are uniformly distributed random numbers in the range of 0–1. Unlike evolutionary algorithms, PSO does not use Darwinian principles of ‘survival of fittest’ or genetic operators. In POS, the sociometric principle of exchange of information between the experience of the individual swarm and best performer is adopted as its working principle (Gaitonde and Karnik 2012).

Grasshopper optimization algorithm (GOA)

Based on the harm they cause to agriculture; grasshoppers are acknowledged as pests. The grasshopper optimization algorithm (GOA) imitates the action of the swarm of grasshopper finding a food source in nature. Grasshoppers do not act as an individual but form some of the largest swarms of all living organisms. Swarm motion is influenced by the interactions of individuals in a swarm, wind, gravity, and food sources, etc. Like large rolling cylinders, millions of grasshoppers jump and proceed. Saremi et al. (2017) proposed a mathematical model for this action given by:

$${P}_{i} = {R}_{i} + G +\text{ W}$$
(6)

where \({P}_{i}\) represents the position of the ith grasshopper, \( {R}_{i}\) represents social interaction,\(G\) is the gravity force on ith grasshopper and \(W\) is the wind direction. The advanced formulation of the expression can be given by:

$${P}_{i}= \sum\limits_{\begin{array}{c}j=1\\ j\ne i\end{array}}^{N}r\left(\left|{P}_{j}^{k}-{P}_{i}^{k}\right|\right)\frac{{P}_{j}^{k}-{P}_{i}^{k}}{{d}_{ij}}-g{\widehat{e}}_{g}+{w}^{^{\prime}}{\widehat{e}}_{\mathrm{w}}$$
(7)

where \(r\) is a function that simulates the effect of social interactions of \(N\) individual grasshopper which can be expressed as:

$$r\left(p\right)=f{\mathrm{e}}^{-\frac{p}{l}}-{\mathrm{e}}^{-p}$$
(8)

where dij is the distance between two grasshoppers (say at points i and j) and given by:

$${d}_{ij}=\left|{P}_{j}^{k}-{P}_{i}^{k}\right|$$
(9)

If g is the gravitational constant and \({\widehat{e}}_{g}\) represents a unit vector towards the centre of the earth, gravitational force G is given by:

$${\text{G}}= \text{ } -g{\widehat{e}}_{g}$$
(10)

If \({w}^{^{\prime}}\) is the wind drift constant and \({\widehat{e}}_{\mathrm{w}}\) represents a unit vector towards the direction of the wind, the wind drift effect W is given by:

$$W={w}^{^{\prime}}{\widehat{e}}_{\mathrm{w}}$$
(11)

The effects of wind and gravity are much weaker than the relationship between grasshoppers since grasshoppers easily find secure zones and show low convergence. Thus, the model’s modified version of Eq. (6) can be re-written as:

$${P}_{i}={c}^{k}\left( \sum\limits_{\begin{array}{c}j=1\\ j\ne i\end{array}}^{N}{c}^{k}\frac{{P}_{i}^{\mathrm{ub}}-{P}_{i}^{\mathrm{lb}}}{2}r\left(\left|{{P}^{k}}_{j}-{{P}^{K}}_{i}\right|\right)\frac{{{P}^{k}}_{j}-{{P}^{K}}_{i}}{{d}_{ij}}\right)+ {\mathop{T}\limits^{\frown}}_{d} $$
(12)

where ub and lb are the upper and lower bound, respectively, associated with the variables. c is a decreasing coefficient described in Eq. 13. \({\mathop{T}\limits^{\frown}}_{d} \) is the value of the variable in the target (best solution obtained so far).

$${c}^{k}={c}_{\text{max}}-\left({c}_{\mathrm{max}}-{c}_{\mathrm{min}}\right)\frac{k}{{k}_{\mathrm{max}}}$$
(13)

where \({c}_{\mathrm{max}}\) and \({c}_{\mathrm{min}}\) are 1 and 0.00001 respectively in the present work. The higher the value of c, the more is the swarm exploration.

Artificial bee colony (ABC)

ABC algorithm (Karaboga 2005; Tereshko and Loengarov 2005) is a metaheuristics optimization approach that shadows the behaviour of honey-swarm bees of social cooperation into machine learning. It divides honey bees into two types: employed and unemployed. Unemployed bees are further sub-divided into onlooker and scout bees. Employed bees first attack the food sources and search for other food sources in the neighbourhood, which represents possible solutions. There are as many employed bees as the number of food sources. The onlooker bees observe the motion of the employed bees and based on the information generated about the amount of nectar i.e., the fitness value of the solutions, and selects the food sources to be exploited. It memorises the best solutions and abandons the poor ones. When food sources get finished, employed bees become scout bees and search for further food sources to replenish the abandoned food sources.

Let ui (i = 1; 2; 3; …; m) be the food source. Neighbourhood food sources or possible solutions are expressed by:

$${V}_{i}= {x}_{i} + {\mu }_{i}({x}_{i}-{x}_{j})$$
(14)

where µi is a random number between − 1 and 1, xj is chosen randomly, (j = 1, 2,…n; j ≠ i). Based on the information, onlooker bee chooses food sources based on the following probability:

$${P}_{i}=\frac{{f}_{i}}{{\sum }_{n=1}^{t}{f}_{n}}$$
(15)

where t is the total number of sources fi is the amount of nectar or fitness value of the ith source and it is calculated using an objective function \(\phi ({x}_{i})\):

$$ {f_i} = \left\{ {\begin{array}{*{20}{l}} {\frac{1}{{1 + \phi ({x_i})}},}&{\phi ({x_i}) \geq 0} \\ {1 + \left| {\phi ({x_i})} \right|,}&{\phi ({x_i}) < 0} \end{array}} \right.$$
(16)

Scout bees replenish the abandoned food sources with new ones by the following expression:

$${x}_{i}={x\mathrm{minmax}}_{\mathrm{min}}$$
(17)

where \({x}_{\mathrm{max}}\) is the upper bound and \({x}_{\mathrm{min}}\) is the lower bound of \({x}_{i}\). The iteration is repeated till the termination condition is met.

Ant colony optimization (ACO)

ACO simulates the food-searching behaviours of ants (Dorigo et al. n.d.; Dorigo and Blum 2005; Dorigo and Socha 2007). Artificial ants search for the best solutions in the parameter space. The journey of the ants to the food source from the nest and returning to the nest is modelled as one iteration in the algorithm. While on the journey, ants release pheromones which guides the later ants toward the possible solutions instead of a random search. The shorter path has the highest pheromones concentration as it is traversed by a maximum number of ants and also since pheromones have nature of evaporating with time which reduces their concentration towards the minimum traversed paths as well as longer paths which takes longer time in reaching and coming forth leading to evaporation of pheromones. The algorithm consists of three phases: ant-based solution construction, pheromone evaporation, and iteration. In the first phase, artificial ants explore possible solutions and build paths by recording the positions and quality of the solutions. In the later simulation, more ants follow the path, and records of the longer path get evaporated. Simulated ants probabilistically pick a trail that is based on the pheromone density and objective function value etc. heuristic values.

If i and j be the beginning and end notes of the path, dij is the distance between them, tij is the pheromones density then the probability of choosing path i to j for n number of nodes is:

$${P}_{ij}=\frac{{\tau }_{ij}^{\alpha }{d}_{ij}^{\beta }}{{\sum }_{i,j=1}^{n}{\tau }_{ij}^{\alpha }{d}_{ij}^{\beta }}$$
(18)

Pheromone’s concentration decreases exponentially with time due to evaporation between time t and t + 1:

$${\tau }_{ij}(t+1)=\rho {\tau }_{ij}(t)+\Delta {\tau }_{ij}$$
(19)

where 0 < r < 1 is the constant of evaporation and \(\Delta {{\tau }_{ij}}^{k}(t)\) is the increment of pheromone. For m number of ants, additional pheromone laid by kth ant at tth iteration is:

$$\Delta {\tau }_{ij}= \sum\limits_{k=1}^{m}\Delta {\tau }_{ij}^{k}(t)$$
(20)

Each ant has individual pheromone contribution of:

$$\Delta \tau _{ij}^k(t) = \left\{ {\begin{array}{*{20}{l}} {\frac{Q}{{{h_k}}}} \quad {{\text{if the}}\;k{\text{th antpasses}}\;\left( {i,j} \right)\;{\text{in current tour}}} \\ 0 \quad \; {{\text{otherwise}}} \end{array}.} \right.$$
(21)

where Q is a constant and Lk = length of path traversed by kth ant.

Ant lion optimizer (ALO)

ALO is a metaheuristic algorithm based on the hunting behaviour of ant lions (Mirjalili 2015). Ant lions catch their prey, ants, by digging sharp cone-shaped curves. Ant lions position themselves at the bottom of the pit, waiting for the ants to fall. No sooner than the ants fall in the trap, ant lions start throwing sands to catch the prey which is trying to escape. When the ants fall at the bottom, ant lions consume it and further create another bigger cone-shaped trap. The matrices Mant and Mantlion give the position of ants and ant lions, respectively and Moa and Moal are the objective functions for m number of both parameters.

$${M}_{\mathrm{ant}}=\left[\begin{array}{cccc}{A}_{11}& {A}_{12}& \dots & {A}_{1r}\\ {A}_{21}& {A}_{22}& \dots & {A}_{2r}\\ \dots & \dots & \dots & \dots \\ {A}_{m1}& {A}_{m2} & \dots & {A}_{mr}\end{array}\right]; \; {{M}_{\text{antlion}}}=\left[\begin{array}{cccc}{L}_{11}& {L}_{12}& \dots & {L}_{1r}\\ {L}_{21}& {L}_{22}& \dots & {L}_{2r}\\ \dots & \dots & \dots & \dots \\ {L}_{m1}& {L}_{m2}& \dots & {L}_{mr}\end{array}\right]$$
(22)
$${M}_{\mathrm{oa}}=\left[\begin{array}{c}f\left(\left[{A}_{11},{A}_{12},...,{A}_{1d}\right]\right)\\ f\left(\left[{A}_{21},{A}_{22},...,{A}_{2d}\right]\right)\\ ....\\ f\left(\left[{A}_{m1},{A}_{m2},...,{A}_{md}\right]\right)\end{array}\right]; \; {M}_{\mathrm{oal}}=\left[\begin{array}{c}f\left(\left[{L}_{11},{L}_{12},...,{L}_{1d}\right]\right)\\ f\left(\left[{L}_{21},{L}_{22},...,{L}_{2d}\right]\right)\\ ....\\ f\left(\left[{L}_{m1},{L}_{m2},...,{L}_{\mathit{md}}\right]\right)\end{array}\right]$$
(23)

The random walk of the ants is modelled as:

$${X}_{i}=\left[0,cumsum\left(2\psi -1\right)\right]$$
(24)

where \(cumsum\) represents the cumulative sum for the maximum number of iterations. If rand is the random number with uniform distribution in the range [0, 1], the stochastic function is defined as:

$$ \psi = \left\{ {\begin{array}{*{20}{l}} 1&{{\text{if}}\;{\text{rand > 0}}{\text{.5}}} \\ 0&{{\text{if}}\;{\text{rand}} \leq 0.5} \end{array}} \right.$$
(25)

The normalized ant position is given by the following equation:

$${X}_{i}^{\mathrm{itr}}=\frac{\left({X}_{i}^{\mathrm{itr}}-{\alpha }_{i}\right)\times \left({\lambda }_{i}^{\mathrm{itr}}-{\nu }_{i}^{\mathrm{itr}}\right)}{\left({\phi }_{i}-{\alpha }_{i}\right)}+{\nu }_{i}^{\mathrm{itr}}$$
(26)

where liiter and niiter are the maximum and minimum values of ith dimension the particular iteration, respectively. fi and ai the maximum and minimum values of the random walk of the ith variable. The lower and upper bounds of ith dimension are calculated as follows if \(Antlio{n}_{j}^{\text{itr}}\) denotes position of jth antlion at particular iteration.

$${\nu }_{i}^{\text{itr}}=Antlio{n}_{j}^{itr}+{\nu }_{i}^{{\text{itr}}-1}$$
(27)
$${\lambda }_{i}^{\text{itr}}=Antlio{n}_{j}^{\text{itr}}+{\lambda }_{i}^{itr-1}$$
(28)

To model the phenomenon of ants falling to the bottom of the pit, their random walk is reduced by a factor:

$${\nu }_{i}^{\mathrm{itr}}=\frac{{\nu }_{i}^{\mathrm{itr}}}{K}; {\lambda }_{i}^{\mathrm{itr}}=\frac{{\lambda }_{i}^{\mathrm{itr}}}{K}$$
(29)

K is a constant such that,

$$K=1{0}^{\mu }\times \frac{\text{current iteration}}{\text{maximum number of iterations}}$$
(30)

Taking \(itrmax\) as the maximum number of iterations, µ is calculated as follows:

$$ \mu = \left\{ {\begin{array}{*{20}{l}} 2&{{\text{if}}\;{\text{itr}} > 0.1itr\max } \\ 3&{{\text{if}}\;{\text{itr}} > 0.5itr\max } \\ 4&{{\text{if}}\;{\text{itr}} > 0.75itr\max } \\ 5&{{\text{if}}\;{\text{itr}} > 0.9itr\max } \\ 6&{{\text{if}}\;{\text{itr}} > 0.95itr\max } \\ {{\text{else}}}&{{\text{if}}\;{\text{itr}}} \end{array}} \right.$$
(31)

Antlion catches the ant and consumes it by dragging it inside the sand and moves to a new position:

$$Antlio{n}_{j}^{\text{itr}}=An{t}_{i}^{\text{itr}} if f\left(Ant_{i}^{itr}\right)>Antlio{n}_{j}^{\text{itr}}$$
(32)

Further, in the optimization method, elitism is applied. Elitism is the method of choosing the best ant lion as an elite which being the fittest impacts the movement of every single ant in the iteration. The randomness of ants is given by the Roulette wheel approach:

$$An{t}_{i}^{\text{itr}}=\frac{{W}_{al}^{\text{itr}}+{W}_{elite}^{itr}}{2}$$
(33)

where \({W}_{al}^{\text{itr}}\) is the random walk of ant lion and \({W}_{elite}^{\text{itr}}\) random walk of elite ant lion.

Multivariate adaptive regression splines (MARS)

MARS introduced by Friedman (1991) is a non-parametric regression method that uses basis functions to define the correlation between input parameters and output variable and piecewise linear splines, called basis functions (BFs) to establish this correlation. The MARS methodology initially advances as a forward stepwise function (constructive phase) and then as a backward stepwise function (pruning phase). With the initially existing constant BF, the forward stepwise function starts. The basic functions are split at each step, satisfying the “lack of fit criterion.” The model becomes over-fitted and then the pruning stage begins. Finally, the optimum model is developed in the third step. More details about the method can be studied in the literature (Samui and Kim 2013; Zhang et al. 2020c, 2021; Zhang and Goh 2016).

Genetic programming (GP)

GP (Koza 1992) is a symbolic machine learning technique that uses the Darwinian concept of natural selection and genetic recombination. It evolves from GA (Holland 1975)and uses tree-structure seeming computer programs instead of a string of numbers. The model initializes by the creation of a random population and is followed by the reproduction of individuals and the creation of new by processes of mutation and crossover. In traditional GP, symbolic regression is typically performed to generate a population of trees, which in turn encodes a mathematical expression. The generated expression predicts the desired output (\(m\times 1\)) using the given inputs (\(n\times m\)), where \(n\) and \(m\) are the number of input variables and the number of observations respectively. On the other hand, multi-gene GP (MGGP) is a weighted linear combination of GP trees. For each SR model, the linear weights are derived from the training dataset, which is used further to predict the new outputs. It is understood from the literature that the MGGP regression technique is computationally more efficient than the traditional GP. However, to obtain higher predictive accuracy, the hyper-parameters, such as population size, tournament size, the maximum number of generations, the maximum number of genes, crossover and mutation probability, and functions should be designed properly.

Group method of data handling (GMDH)

GMDH is a self-organized neural network. In this feed-forward method, the elementary unit is a quadratic equation of two variables. The coefficients of the function are calculated using regression analysis (Armaghani et al. 2020a). It simulates datasets having several inputs (u1,….un) and one output (V):

$$V=f({u}_{1}, \ldots {u}_{n})$$
(34)

A simplified example of polynomial comprising two variables, ui and uk:

$${U}_{i}={\alpha }_{0}+{\alpha }_{1}{u}_{i}+{\alpha }_{2}{u}_{k}+{\alpha }_{3}{u}_{i}^{2}+{\alpha }_{4}{u}_{k}^{2}+{\alpha }_{5}{u}_{i}{u}_{k}$$
(35)

The GMDH model (shown in Fig. 2) can be described by neuron layers, each with several data points related by quadratic polynomials to each other, and new neurons are created in the process. Here, the filter results are denoted by U1, U2Un. The best outputs are passed through the selection layer (U1, U2Ur). The performance Z1, Z2, Zp is polynomial to a higher degree than the previous one. Selected few are transported through the selection layer. The output is Z1, Z2, Zq. The process is carried out till the desired outcome is achieved.

Fig. 2
figure 2

A basic structure of GMDH algorithm

Hybridization of ANN and metaheuristic algorithms

Shortcomings of ANN involves extensive calculation time and trial-and-error approach to discover the appropriate number of hidden neurons. There is a growing initiative to combine ANN with metaheuristic optimization strategies to boost ANN's simulation performance. Several neural network parameters, such as weight and bias, are optimised using optimization methods to boost performance. Recently, many studies are being conducted in engineering applications to augment the capability of ANN models by optimization algorithms (OAs) such as ABC, ACO, ALO, PSO, GOA (Adnan et al. 2019; Armaghani et al. 2014; Malekpour and Mohammad Rezapour Tabari 2020; Moayedi et al. 2019b; Ozturk and Karaboga 2011; Rahgoshay et al. 2019; Taheri et al. 2017; Umar et al. 2019; Xu et al. 2019). ANN models may lead to unwanted outcomes since Back Propagation (BP) lacks in finding the exact global minimum. Moreover, ANN models are more vulnerable to be caught in local minima. OAs have been found to eradicate this problem of ANN by assigning weights and biases. In the study, PSO, GOA, ABC, ACO, and ALO were used to optimize the learning parameters (weights and biases) of ANN and five hybrid models, namely ANN-PSO, ANN-GOA, ANN-ABC, ANN-ACO, and ANN-ALO were constructed to predict the bearing capacity of the pile foundation. The steps of developing hybrid ANN models are shown in Fig. 3 in the form of a flow chart.

Fig. 3
figure 3

A flow chart showing the steps of developing hybrid ANN models

Data processing and analysis

Descriptive statistics of the datasets

To simulate the soft computing models, 50 PDA datasets were collected from the study of Momeni et al. (2015). 36 PDA tests were conducted at the various project sites in Indonesia. Note that, these tests were conducted as per the guidelines of ASTM (D4945-08 in cohesion-less soils (American Society for Testing and Materials 2010). Table 1 presents the descriptive statistics of the parameters of the collected dataset. The dataset comprises five parameters, namely weight of the hammer (w in kN), the height of fall of the hammer (H in m), cross-sectional area (A in cm2), length of the pile (L in m), pile set value (S in mm), and ultimate bearing capacity (QU) of the pile, among which first five parameters were used as the input parameters to predict the QU, the output parameter. As can be seen, the sample variances are scattered in the range of 0.81 to 836,134.34, which indicates that the present dataset has a wide range of input and output parameters. On the other hand, Fig. 4 represents the frequency histogram of the input and output variables. In addition, the values of standard error (scattered in the range of 0.35–129.32) confirm that the present database consists of a wide range of variables, and hence useful for soft computing modelling.

Table 1 Descriptive statistics of the input and output variables
Fig. 4
figure 4

Frequency histogram of the input and output variables

Data processing and computation of models

In soft computing field, to enhance model accuracy, it is important to normalize the inputs and output variables with a predefined range. The normalization aims to adjust the numeric data values to a standard scale, without ambiguous variations in the value ranges. The process is not essential for all machine learning datasets, but only if the parameters have different ranges. All the variables have been normalized from 0 to 1 in this dataset using the expression given by:

$${x}_{\mathrm{Normalized}}=\frac{\left({x}_{\mathrm{Actual}}-{x}_{\mathrm{min}}\right)}{\left({x}_{\mathrm{max}}-{x}_{\mathrm{min}}\right)}$$
(36)

where \({x}_{\mathrm{Actual}}\) is the actual value of the particular parameter, \({x}_{\mathrm{min}}\) is the minimum value of the parameter in the dataset and \({x}_{\mathrm{max}}\) is the maximum value ofthe parameter in the dataset. Post-normalization, the dataset is randomly divided into training (70% of the total dataset) and testing (30% of the total dataset) subsets. Amongst them, the training subset was used to train the model. In the training phase, the model learns the correlation between input and output variables and constructs a predictive model. Then, the testing dataset was used to test the prediction of the trained model. The performance of the models is further ascertained by using various statistical parameters, described in detail in later sections. The entire methodology is depicted in Fig. 5. The results of the employed models are compared with those of the traditional FOSM model and the robustness of the model was determined

Fig. 5
figure 5

Research methodology of AI-based models

Results and discussion

Performance parameters

To estimate the performance of the developed models, several widely used statistical indices were determined (Behar et al. 2015; Despotovic et al. 2015; Kumar et al. 2021; Kumar and Samui 2020; Legates and Mccabe 2013; Stone 1993). These are coefficient of determination (R2), performance index (PI), Nash–Sutcliffe efficiency (NS), Willmott’s index of agreement (WI), variance account for (VAF), root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), root mean square error to observation`s standard deviation ratio (RSR), and weighted mean absolute percentage error (WMAPE). The expressions for these parameters are given below:

$${R}^{2}=\frac{\sum_{i=1}^{N}{\left({d}_{i}-{d}_{\mathrm{mean}}\right)}^{2}-\sum_{i=1}^{N}{\left({d}_{i}-{y}_{i}\right)}^{2}}{\sum_{i=1}^{N}{\left({d}_{i}-{d}_{\mathrm{mean}}\right)}^{2}}$$
(37)
$$PI=adj.{R}^{2}+\left(0.01\times \mathrm{VAF}\right)-\mathrm{RMSE}$$
(38)
$$\mathrm{NS}=1-\frac{{\sum }_{i=1}^{N}({y}_{i}-{\widehat{y}}_{i}{)}^{2}}{{\sum }_{i=1}^{N}({y}_{i}-{y}_{\mathrm{mean}}{)}^{2}}$$
(39)
$$\mathrm{WI}=1-\left[\frac{\sum_{i=1}^{N}{\left({d}_{i}-{y}_{i}\right)}^{2}}{\sum_{i=1}^{N}{\left\{\left|{y}_{i}-{d}_{\text{mean}}\right|+\left|{d}_{i}-{d}_{\text{mean}}\right|\right\}}^{2}}\right]$$
(40)
$$\mathrm{VAF}=\left(1-\frac{var\left({d}_{i}-{y}_{i}\right)}{var({d}_{i})}\right)\times 100$$
(41)
$$\mathrm{RMSE}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}{\left({d}_{i}-{y}_{i}\right)}^{2}}$$
(42)
$$\mathrm{MAE}=\frac{1}{N}\sum_{i=1}^{N}\left|({y}_{i}-{d}_{i})\right|$$
(43)
$$\mathrm{MAPE}=\frac{1}{N}\sum_{i=1}^{N}\left|\frac{{d}_{i}-{y}_{i}}{{d}_{i}}\right|\times 100$$
(44)
$$\mathrm{RSR}=\frac{\mathrm{RMSE}}{\sqrt{\frac{1}{N}{\sum }_{i=1}^{N}({d}_{i}-{d}_{\mathrm{mean}}{)}^{2}}}$$
(45)
$$\mathrm{WMAPE}=\frac{{\sum }_{i=1}^{N}\left|\frac{{d}_{i}-{y}_{i}}{{d}_{i}}\right|\times {d}_{i}}{{\sum }_{i=1}^{N}{d}_{i}}$$
(46)

where \({d}_{i}\) is the observed ith value, \({y}_{i}\) is the predicted ith value, \({d}_{\mathrm{mean}}\) is the average of observed value, \(N\) is the number of the data sample. Note that, for an ideal model, the values of these indices should be equal to their ideal values, the details of which are presented in Table 2.

Table 2 Ideal values of performance indices

Configuration of the developed hybrid ANN models

As mentioned earlier, to optimize the learning parameters of ANN. five OAs were used. In ANN, these learning parameters are the input weights, biases of hidden neurons, output weights, and output bias. After the initialization of ANN, OAs were used to optimize the learning parameters, i.e., the weights and biases of the models. For this purpose, OAs were set up before optimizing the learning parameters of ANNs, including the population size (\({n}_{\mathrm{s}}\)), the maximum number of iterations (\(k\)), lowerbound (\(\mathrm{lb}\)), upper bound (\(\mathrm{ub}\)), and other parameters besides the number of hidden neurons (\(Nh\)) of ANNs. Then, the weight and biases of ANN were optimized by OAs based on the training dataset. The optimized values of weight and biases were determined by setting RMSE as the cost function. It is pertinent to mention here that, although \({n}_{\mathrm{s}}\), \(k\), \(lb\), and \(ub\) were kept the same during the optimization process, however, the optimized value of learning parameters is different in each case.

Following the above-mentioned procedure and using the same training dataset, the \(Nh\) was variedin the range of 5–20, and the most appropriate value obtained was 6 for ANN-PSO, 7 for ANN-GOA, 6 for ANN-ABC, 7 for ANN-ACO, and 7 for ANN-ALO. The values of other parameters were set as \({n}_{\mathrm{s}}\) = 50, \(k\) = 200, \(lb\) = − 1, and \(ub\)= + 1. Therefore, the optimum number of optimized weight and biases are 43 (5 × 6 + 6 + 6 + 1) for ANN-PSO, 50 (5 × 7 + 7 + 7 + 1) for ANN-GOA, 43 (5 × 6 + 6 + 6 + 1) for ANN-ABC, 50 (5 × 7 + 7 + 7 + 1) for ANN-ACO, and 50 (5 × 7 + 7 + 7 + 1) for ANN-ALO, and note that values of these optimized weights and biases are different from one another. The detailed configuration of the developed hybrid ANN models is presented in Table 3. Furthermore, the convergence behaviour of the developed hybrid ANN models is presented in Fig. 6, from which the converging ability of the hybrid models in finding the global minimum can be assessed.

Fig. 6
figure 6

Convergence curves of the developed hybrid ANN models

Table 3 Configuration of optimum hybrid ANN models

Configuration of the employed MARS, GP, and GMDH models

With the same training dataset, the MARS, GP, and GMDH models were constructed and accordingly evaluated. To design the MARS model, a piecewise linear regression variant of MARS was considered in the present study. The hyper-parameters, such as the number of BFs, GCV, self-interaction, maximum interactions, threshold value, and pruning option were designed using trial-and-error runs, the details of which of the designed MARS model are presented in Table 4. In addition, the details of each BF are presented in Table 5, and the expression of the designed MARS model in Eq. (47). The expression is given in Eq. (47) can readily be used to estimate the bearing capacity of piles.

Table 4 Optimal values of effective parameters of MARS model
Table 5 Equations of the basis functions in MARS model

Analogous to the MARS model, the parameters of GP and GMDH models were designed based on trial-and-error approaches. The most effective choices of different GP parameters and terminating criteria (population size, number of generations, tournament size, maximum number of genes, maximum tree depth, mutation probability, and functions set) are presented in Table 6, and the final GP model for predicting the bearing capacity of the pile is given in Eq. (48) which can also be used as a readymade formula to estimate the probability of failure using relativity index. On the other hand, the most suitable structure of GMDH consists of 4 hidden layers with 10 neurons in each layer. The best performance was achieved when the number of hidden layers was set to 3.

$${y}_{\text{MARS}} = 0.28444 -1.1447*BF1 +12.706*BF2 -1.6146*BF3 +8.3353*BF4 -12.524*BF5 +7.3737*BF6 +0.12589*BF7 +12.732*BF8 -1.4795*BF9 -4.6314*BF10 -0.59395*BF11 +14.463*BF12 -12.111*BF13 -0.32972*BF14 +8.811*BF15$$
(47)
$${y}_{\text{GP}}=4.488 x1 + 1.122 x2 - {\text{exp}}(2 x1) 0.5154 + 1.122 \tanh({\text{exp}}(x1)) + 0.1091 \cos(\sin(square(x5 + 4.662))) +\ tanh(1.533 x4) 0.28 - 0.28 \cos(x5) - 0.28 \tanh(x2 - 1.315) + square(\cos(2 x2 + \square(x5))) 0.1233 - \tanh(square(-2.482 x1) + x1 x5 {\text{exp}}(x5)) 1.456 - 0.6767$$
(48)
Table 6 Parametric configuration and terminating criteria of the optimum GP model

Statistical details of results

This sub-section describes the outcomes of all the performance parameters of the models. The output parameter values for all nine models are presented in Tables 7 and 8, respectively, for the training and testing datasets. Note that, only one or two parameters are never enough because every parameter has its advantages as well as limitations. Therefore, to determine the efficiency of the developed models, ten performance indices were determined and assessed in detail. As can be seen, all models have captured the correlations in estimating the pile bearing capacity. However, based on the experimental results with the R2 criteria, it can be seen that the R2 values of the top two performing models are 0.9967 (MARS) and 0.9914 (GP) in the training phase. These facts demonstrate that the conventional soft computing models have attained the most accurate prediction in the training phase. While based on the R2 and RMSE criteria ANN-GOA attained the best prediction performance among the ANN-based models. On the other hand, in the testing phase, GP outperformed all other models by far, with R2 = 0.9859 and RMSE = 0.0353, while ANN-PS was found to be the second-best model (R2 = 0.9773 and RMSE = 0.0439) in estimating the bearing capacity of pile foundation. Tables 7 and 8 report the prediction performances of all the models using 10 performance metrics, respectively, for the training and testing phases. Itis observed that MARS and GP have achieved the best outcomes in all metrics in the training and testing phases, respectively. However, for the ANN-based hybrid models, ANN-PSO has achieved second place, followed by ANN-ALO, ANN-ACO, ANN-GOA, and ANN-ABC. Figures 7 and 8 depict the comparison of the actual values with the predicted values of all the employed models for the case of training and testing phases, respectively.

Table 7 Details of performance parameters for the training dataset
Table 8 Details of performance parameters for the testing dataset
Fig. 7
figure 7

Illustration of actual vs. predicted values of the developed models for the training (TR) dataset

Fig. 8
figure 8

Illustration of actual vs. predicted values of the developed models for the testing (TS) dataset

Furthermore, to visualise the results, the Taylor diagram and accuracy matrix are presented. It may be noted that Taylor diagram (Taylor 2001) is a simple visual representation of how well a model performs compared to the other used models. It plots correlation, standard deviation, and RMSE on a 2-dimensional graph. In Taylor diagrams, the radial distance from the origin, azimuthal angle on the graph denotes standard deviation and correlation coefficient, respectively. RMSE error is plotted as the distance between observed and simulated fields, related in identical units to standard deviation. On the other hand, an accuracy matrix, recently proposed by Kardani et al. (2021a) was used to analyse the accuracy level of the developed models in the form of a heat map matrix. One can estimate the overall status of the developed models based on the colour variation of performance parameters. Figures 9a and b and 10a, b represent the Taylor diagram and accuracy matrixes, respectively, for the developed models.

Fig. 9
figure 9

Taylor diagrams: a training phase and b testing phase

Fig. 10
figure 10

Accuracy matrix: a for training results and b for testing results

Discussion

In the above sub-sections, performance of applied machine learning models in terms of bearing capacity of pile foundation are analyzed and presented. For this purpose, 50 dynamic pile load test data of concrete piles were collected from literature and utilized. Five hybrid ANN models and three conventional soft computing models were employed to estimate the bearing capacity of piles first. The employed models are evaluated on the ground of various statistical parameters. Based on the experimental presented in the above sub-sections, it is seen that the MARS model attained the highest prediction with R2 = 0.9967, RMSE = 0.0155, in the training phase, while ANN-PSO (R2 = 0.9773, RMSE = 0.0439) and GP (R2 = 0.9859, RMSE = 0.0353) attained the most accurate results in the analysis of piles. Note that, all the models were developed in MATLAB environment with MATLAB 2015a version and version with i3-8130U CPU @ 2.20 GHz, 12.00 GB RAM. The computational cost of the top two best-performing models was noted as 69.316796s (ANN-PSO) and 14.253398s (GP). It is pertinent to mention here that, a prediction model with higher prediction accuracy attained in the testing phase should be accepted with more conviction. Therefore, the ANN-PSO and GP models can be considered as robust models in analysis of piles.

Conclusion

Soft computing have transformed all the sectors of engineering and civil engineering is not an exception. Soft-computing models can potentially be used as an alternative to expensive and time-intensive field tests and inefficient numerical methods. A comparative assessment of five hybrid ANN models and three conventional soft computing models in estimating bearing capacity of piles are presented in this study. For this purpose, 50 sets of dynamic pile testing data were collected from the available literature. The values of statistical performance parameters, regression curves, Taylor diagrams and accuracy matrices recommends that the piles considered in the analysis could be considered safe against bearing capacity failure. Experimental results point out that ANN-PSO and GP can estimate the bearing capacity of pile accurately both in the training and testing phases. However, a detailed review of results reveals that the ANN-PSO (R2 = 0.9773, RMSE = 0.0439) and GP (R2 = 0.9859, RMSE = 0.0353) showed comparatively better performance in the testing phase. The unique advantages of the proposed ANN-PSO model are higher prediction accuracy, ease of implementation with the existing datasets, and high generalization capability. On the other hand, the predicting expression of GP can be used as a user-friendly equation to determine the bearing capacity of pile. Furthermore, the ANN-PSO and GP models proposed in this study would be used to analyze other civil engineering structures once the corresponding database is prepared for the purpose.