Introduction

Liquefaction is a rapid loss of shear strength in non-cohesive soils subjected to dynamic loading effects such as earthquakes. Sometimes, the shear strength drops to almost zero or significantly reduced. In both cases, liquefaction leads to many soil-based problems (Coduto 2003). The most characteristic feature of all liquefaction events is the excessive pore water pressure under undrained loading conditions. It is a well-known issue that non-cohesive dry soils tend to tighten more under static and repetitive loads. However, when the cohesionless soils are saturated, the loading in undrained conditions develops rapidly, and the tendency of the soil to tighten more causes excessive pore water pressure and decreases the effective stress (Kramer 1996).

Liquefaction is one of the most important and complex issues in geotechnical earthquake engineering. The destructive effects of liquefaction in Alaska (Mw = 9.2) and Niigata (Ms = 7.5) earthquakes in 1964 attracted the attention of researchers to this phenomenon. The most striking of liquefaction-induced damages in both earthquakes is slope failures, failures in bridge and building foundations, floating embedded structures, etc. After the following earthquakes in 1971 San Fernando, 1976 Tangshan, 1985 Mexico city, 1989 Loma Prieta, 1994 Kobe, and 1999 Golcuk (Turkey), many researchers have turned to observing the conditions that affect the liquefaction phenomenon. For this reason, determination of the factors causing liquefaction, the liquefaction potential in vulnerable areas, and the prediction of possible damages are among the most important research topics in geotechnical earthquake engineering.

The liquefaction potential depends on the geotechnical properties of the grounds, topography, seismicity, groundwater level, and geological history (Youd and Perkins 1978). Various empirical methods based on experimental and probabilistic calculations have been developed to determine the liquefaction potential (Kramer and Mayfield 2007). Liquefaction potential can be determined by laboratory tests (dynamic three-axis, dynamic cutting, shaking tests) and in situ tests (standard penetration test (SPT), cone penetration test (CPT), seismic experiments) (Kramer 1996 and Ishihara 1996; Liu and Qiao 1984; Elgamal et al. 1989; Lambe 1981; Husmand et al. 1988; Seed and Idriss 1971; Tokimatsu and Yoshimi 1983; Iwasaki et al. 1981; Suzuki et al. 1997; Robertson and Wride 1998; Stokoe et al. 1988; Andrus and Stokoe 2000). However, due to the laboratory tests being time-consuming and expensive, the methods by which SPT and CPT data are used are more preferred. The methods based on SPT have often been more preferred by geotechnical engineers for many years in the assessment of liquefaction. In these methods, the safety of the ground against the liquefaction during the earthquake is calculated by comparing the rate of cyclic resistance (CRR) to the rate of cyclic stress (CSR) (Seed and Idriss 1971; Youd et al. 2001; Cetin et al. 2004; Idriss and Boulanger 2006; Idriss and Boulanger 2010; Boulanger and Idriss 2012).

Recently, soft computing methods especially the artificial neural networks (ANNs) have become popular in practical solutions of the geotechnical engineering problems such as bearing capacity of shallow and pile foundations, slope stability, settlement behavior, and compressibility parameters of soils (Nejad et al. 2009; Lee and Lee 1996; Kiefa 1998; Sakellariou and Ferentinou 2005; Wang et al. 2005; Kuo et al. 2009; Abdalla et al. 2015; Chenari et al. 2015; Kalinli et al. 2011; Sulewska 2011; Chik et al. 2014). Also, the liquefaction potential of soils has been tried to predict by using different artificial intelligence applications in the last 20 years (Goh 1994, 1996, 2002; Juang and Chen 1999; Rahman and Wang 2002; Baziar and Nilipour 2003; Kim and Kim 2006; Hanna et al. 2007; Chern et al. 2008; Ramakrishnan et al. 2008; Mughieda et al. 2009; Samui and Sitharam 2011; Karthikeyan et al. 2013; Muduli and Das 2015a; Muduli and Das 2015b; Erzin and Ecemis 2015; Xue and Xiao 2016; Xue and Liu 2017; Goharzaya et al. 2017; Hoang and Bui 2018).

Goh (1994) suggested ANN models to predict the liquefaction potential of soils based on actual field records using SPT data. Rahman and Wang (2002) developed fuzzy neural network models for the evaluation of liquefaction potential with SPT-based large databases of liquefaction case histories. Hanna et al. (2007) proposed a general regression neural network model to predict the liquefaction potential in soil deposits with SPT-based data including field tests from the Turkey and Taiwan major earthquakes in 1999. Ramakrishnan et al. (2008) proposed a prediction model on liquefaction susceptibility of unconsolidated sediments using ANN model including the field data on SPT tests. Samui and Sitharam (2011) proposed two machine learning methods such as ANN and SVM to predict liquefaction susceptibility of soils based on the SPT data from the 1999 Chi-Chi, Taiwan earthquake. Hoang and Bui (2018) proposed a novel soft computing model named KFDA-LSSVM (combines kernel Fisher discriminant analysis with a least squares support vector machine) to evaluate the earthquake-induced soil liquefaction. They used 3 historical data sets based on shear velocity, CPT, and SPT including real cases of earthquake-induced soil liquefaction.

In this paper, an alternative and novel approach is proposed using the group method of data handling (GMDH) model, which is a type of an ANN. The GMDH model was first proposed by Ivakhnenko (1971, 1976) and the GMDH network is a self-organizing, machine learning method. While GMDH is self-organizing, it creates an optimal network by trying a number of networks in different architectures depending on the number of input variables. Recently, the GMDH method has begun to be applied in some geotechnical problems (Kordnaeij et al. 2015; Ardakani and Kordnaeij 2017; Hassanlourad et al. 2017; Jirdehi et al. 2014). In this regard, a novel ensemble GMDH model (EGMDH) based on classification with different activation function bases has been developed to best explain the relationship between input and output variables on predicting the liquefaction potential of soils with SPT-based field data from 2 major earthquakes (Chi-Chi, Taiwan earthquake, 21.09.1999, Mw = 7.6 and Kocaeli, Turkey earthquake, 17.08.1999, Mw = 7.4). The results of the proposed EGMDH model were also compared with other classifier models such as GMDH, artificial neural network (ANN) (Haykin 1994), support vector machine (SVM) (Cortes and Vapnik 1995), logistic regression (LR) (Le Cessie and Van Houwelingen 1992), and random forest (RF) (Ali et al. 2012).

Group method of data handling

The GMDH algorithm is a self-organizing approach based on evaluating performance on multiple input–single output data pairs. GMDH, proposed by Ivakhnenko in the 1970s (Vissikirsky et al. 2005), is an architectural class of polynomial neural network models. Since the GMDH network has a flexible structure, hybrid methods have been developed with intuitive methods such as genetic, evolutionary, and particle swarm optimization (Ghanadzadeh et al. 2012). The main implication of the GMDH model is to define an analytical function that enables weights to be obtained on a regression basis in forward feed neural networks using square neurons. In the GMDH network, neurons in a layer are bound to the next layer through a quadratic and triquadratic polynomial to form new neurons in the next layer. In this model, the input variables are mapped to the output variable. In this mapping, the goal is to construct the function f() which will estimate the output value \( \hat{y} \) using the input vector X = (X1, X2, X3, . . . , Xn) (Kordnaeij et al. 2015). This function estimates the values as close as possible to real \( \hat{y} \) output values. When considering multiple input–single output, the function between them is expressed as follows (Ardakani and Kordnaeij 2017):

$$ {y}_i=f\left({x}_{i1},{x}_{i2},{x}_{i3},\dots, {x}_{in}\right)\kern0.75em \left(i=1,2,3,\dots M\right) $$
(1)

Thus, it is possible to estimate the output value \( \hat{y} \) by using the input vector X = (Xi1, Xi2, Xi3, . . . , Xin). Where, yi is the dependent variable and xi is the independent variable. The prediction equation can be written as:

$$ {\hat{y}}_i=\hat{f}\left({X}_{i1},{X}_{i2},{X}_{i3},\dots, {X}_{in}\right)\kern0.75em \left(i=1,2,3,\dots M\right) $$
(2)

To solve this problem, the GMDH generates the general relation between output and input variables in the form of a mathematical definition also referred to as a reference. The aim here is to minimize the difference between the actual output values and the estimated values.

$$ {\sum}_{i=1}^M{\left[\hat{f}\left({x}_{i1},{x}_{i2},{x}_{i3},\dots, {x}_{in}\right)-{y}_i\right]}^2\to Minimum $$
(3)

The general connection between input and output variables can be expressed as a complex discrete form of a series of Volterra functions as below (Ardakani and Kordnaeij 2017; Zhu et al. 2012):

$$ y={w}_0+{\sum}_{i=1}^n{w}_i{x}_i+{\sum}_{i=1}^n{\sum}_{j=1}^n{w}_{ij}{x}_i{x}_j+{\sum}_{i=1}^n{\sum}_{j=1}^n{\sum}_{k=1}^n{w}_{ij k}{x}_i{x}_j{x}_k+\cdots, $$
(4)

The above equation is known as the Kolmogorov–Gabor polynomial. This function is written as follows. GMDH uses a recursive polynomial regression procedure to synthesize any model. Polynomial regression equations can produce a high order polynomial model using effective predictors.

$$ Quadratic:\hat{y}=G\left({x}_i,{x}_j\right)={w}_0+{w}_1{x}_i+{w}_2{x}_j+{w}_3{x}_i{x}_j+{w}_4{x}_i^2+{w}_5{x}_j^2 $$
(5)

The mathematical relation between the input variables of the generated network and the output variable is formed by Eq. 4. The weights of the equation in Eq. 5 are calculated by regression methods. Thus, the difference between real y and estimated \( \hat{y} \) is minimized for input pairs xi and xj. The weights are obtained by a least squares method. In this way, the weighting coefficients of the quadratic function (Gi) are obtained so as to optimally fit the output set of all input–output data pairs. In the GMDH model, the output variables were tried to be estimated best way by taking all the input variables (two variables at a time) and creating a second-order polynomial equation (Eq. 5) in the training process. Each input vector pair (attributes) will form a second quadratic regression polynomial equation. For the first layer, the L (L = m (m − 1)/2) number of regression polynomial equations is obtain. Here, L is the number of polynomial equations to be obtained in any layer, and m is the number of variables that come to the layer. For example, if the input variable number m = 4, L = 6 regression polynomial equations will be obtained in the first layer. New variables are obtained for the next layer from the first layer using these equations. In this way, new variables are obtained for the other layers in each layer. Thus, new variables are generated which best explain the output variable from the input variables. If the minimum error value in the current layer is greater than the error value in the previous layer, the model becomes complicated. In other words, it is expected that the error value in a certain layer is smaller than the error value in the previous layer. GMDH network architecture is given in Fig. 1.

Fig. 1
figure 1

GMDH network architecture

Each input data pair forms a regression equation. Outputs of the regression equations form new inputs to the next layer. The final output consists of the regression equations selected from all the layers. In the GMDH model, the aim is to have a minimum of error squares as specified in Eq. 6. The sum of the squares of the differences between the actual output values (yi) and the estimated values (Gi(xi, xj)) is expected to be the smallest.

$$ E=\frac{\sum_{i=1}^M{\left({y}_i-{G}_i\left({x}_i,{x}_j\right)\right)}^2}{M}\rightarrow \mathrm{minimum} $$
(6)

The GMDH network is constructed using all possible binary combinations of n input variables to construct the polynomial regression equation (in Eq. 4) that best predicts the independent y variable with the least squares method. From the observed {(yi, xip, xiq), (i = 1, 2, 3, …M)} samples, the first layer of the GMDH network is constructed using n (n − 1)/2 quadratic polynomial neurons.

$$ \left[\begin{array}{ccc}{x}_{1p}& {x}_{1q}& \vdots \kern0.5em {y}_1\\ {}{x}_{2p}& {x}_{2q}& \vdots \kern0.5em {y}_2\\ {}\begin{array}{c}\dots \\ {}{x}_{mp}\end{array}& \begin{array}{c}\dots \\ {}{x}_{mq}\end{array}& \begin{array}{cc}\vdots & \begin{array}{c}\dots \\ {}{y}_m\end{array}\end{array}\end{array}\right] $$
(7)

Here, p and q are the any two variables that come into the layer. Equation 4 can be written in matrix form as follows using the input–output variables mentioned above:

$$ AW=Y $$
(8)

Where, W is the vector of the unknown weight coefficients of the quadratic polynomial and Y specifies the vector of the output values.

$$ W={\left\{{w}_0,{w}_1,{w}_2,{w}_3,{w}_4,{w}_5\right\}}^T $$
(9)
$$ Y={\left\{{y}_1,{y}_2,{y}_3,{y}_4,,{y}_5,\dots {y}_M\right\}}^T $$
(10)
$$ A=\left[\begin{array}{ccc}1& {x}_{1p}& {x}_{1q}\kern0.5em {x}_{1p}{x}_{1q}\kern0.5em \begin{array}{cc}{x}_{1p}^2& {x}_{1q}^2\end{array}\\ {}1& {x}_{2p}& \begin{array}{ccc}{x}_{2q}& {x}_{2p}{x}_{2q}& \begin{array}{cc}{x}_{2p}^2& {x}_{2q}^2\end{array}\end{array}\\ {}\begin{array}{c}\dots \\ {}1\end{array}& \begin{array}{c}\dots \\ {}{x}_{mp}\end{array}& \begin{array}{ccc}\begin{array}{c}\dots \\ {}{x}_{mq}\end{array}& \begin{array}{c}\dots \\ {}{x}_{mp}{x}_{mq}\end{array}& \begin{array}{cc}\begin{array}{c}\dots \\ {}{x}_{mp}^2\end{array}& \begin{array}{c}\dots \\ {}{x}_{mq}^2\end{array}\end{array}\end{array}\end{array}\right] $$
(11)

The weights are solved in matrix form using multiple regression equations as follows:

$$ W={\left({A}^TA\right)}^{-1}{A}^TY $$
(12)

Where, W is the weight vector to be estimated, A is the input matrix, and Y is the output vector. The flowchart for the GMDH algorithm is shown in Fig. 2.

Fig. 2
figure 2

The flowchart for the GMDH algorithm

Ensemble GMDH model

The main goal in ensemble classification is to achieve a result by combining the values obtained by different classifiers. The combination of the classifiers consists of the processes of performing the classification process in the direction of the estimates resulting from the training of the resampled training sets and the classifiers separately. In general, it is stated that the accuracy of classification with the classifier obtained as a result of combining is better than when each classifier is used singularly. Because, while a single classifier can have a higher test error, the diversity of classifiers usually compensates for the mistakes of a single classifier. Therefore, fewer test errors are obtained with the combination of classifiers (Pal and Mather 2003). The main goal in ensemble classification is to produce a result by combining the values previously obtained by different classifiers. During this process, it is tried to calculate by giving certain weight points to the other classifiers. The main problem here is to combine different classification algorithms and decide which ratios to use. The most advantage is that it can get better values due to it uses the data of other methods together (Augusty and Izudheen 2013).

In the current study, the GMDH has been ensembled using different activation functions under the same conditions (learning rate, number of hidden layers, weights, number of neurons in hidden layer). Activation functions are used to better explain the relationship between input and output (Kondo and Ueno 2012). These activation functions are given below:

$$ Sigmoid=\frac{1}{1+{e}^{-y}} $$
(13)
$$ Radial\ Basis={e}^{-{y}^2} $$
(14)
$$ Polynomial=y $$
(15)
$$ Tangent=\tanh (y) $$
(16)
$$ Sinus=\mathit{\sin}(y) $$
(17)

The diagram of the proposed ensemble GMDH (EGMDH) model is shown in Fig. 3. Outputs of 5 GMDH models operated under the same conditions are produced with different activation functions. Each model makes its own output decision for example data. However, the output of EGMDH is the community decision of these 5 models.

Fig. 3
figure 3

The algorithm of the EGMDH model

Performance criteria

In assessing the performance of the classification models used in machine learning, the confusion matrix which compares the actual and predicted values is frequently used (Fig. 4) (Kaya 2013). Accuracy, precision, recall, and F-criterion were used to demonstrate the performance of the methods proposed in the study. These success criteria are calculated from confusion matrix as follows:

Fig. 4
figure 4

The simplified confusion matrix

$$ Accuracy=\frac{TP+ TN}{TP+ TN+ FP+ FN} $$
(18)
$$ Precision= TP/\left( TP+ FP\right) $$
(19)
$$ Recall= TP/\left( TP+ FN\right) $$
(20)
$$ F- criterion=2\left( Recall\times Precision\right)/\left( Recall+ Precision\right) $$
(21)

In these equations, T, F, P, and N express true, false, positive, and negative, respectively. For example, TP indicates the number of positive samples correctly classified; FN indicates the number of false negative samples misclassified.

Accuracy is the most popular and simple method used to determine success and is defined as the ratio of the number of correctly classified (TP + TN) samples to the total number of samples (TP + TN + FP + FN). Precision gives the degree of precision of the classifier result and defined as the ratio of positive-labeled sample number (TP) to the proportion of total samples (TP + FP) that are classified as positive. Recall is the ratio of positively labeled samples (TP) to the total number of truly positive samples (TP + FN). The F-criterion is calculated using the precision and recall metrics. It is used to optimize the system towards the direction of precision or recall.

Data processing

The database used in the present study was preferred for predicting the presence of liquefaction by the EGMDH model on the SPT-based liquefaction assessment. In this context, 451 SPT-based field data from 2 major earthquakes in 1999 were used. Both case records were obtained from Hanna et al. (2007). Two hundred thirty-nine of the case records belong to the Chi-Chi, Taiwan earthquake (Mw = 7.6) and 212 of them belong to the Kocaeli, Turkey earthquake (Mw = 7.4). Three hundred nine of the records were classified as non-liquefied and 142 of them as liquefied. The input parameters used in the EGMDH model are SPT blow numbers (N1,60), percent finest content less than 75 μm (F ≤ 75 μm, %), depth of groundwater table (dw), total and effective overburden stresses (σvo, σvo), maximum peak ground acceleration (amax), and magnitude of earthquake (Mw) and the output is the occurrence of liquefaction.

SPT-based liquefaction assessment

The liquefaction depends on many factors such as particle size and distribution, geological age and sedimentation conditions, volume change potential, permeability, water table level, earthquake magnitude and duration, and distance to center. In general, loose sandy soils that are saturated with water are more sensitive to liquefaction during large earthquakes (Kramer 1996 and Coduto 2003). Liquefaction can only occur if all affecting factors such as loose soil, water saturation, and large and long-term earthquake magnitude are present at the same time.

In the literature, the most important study to investigate the liquefaction potential of soils has been presented by Seed and Idriss (1971) as a “simplified procedure.” Seed and Idriss (1971) have basically expressed the liquefaction potentials of the soils by two parameters. The first parameter is the ratio of cyclic stress (CSR) which indicates the level of cyclic loading that can be caused by the earthquake, and the second parameter is the rate of cyclic resistance (CRR) that indicates the resistance of the soil against the liquefaction. The ratio of cyclic stress generated during earthquakes (CSR) is defined as in Eq. 22.

$$ CSR=0.65\times \frac{a_{max}}{g}\times \frac{\sigma_v}{\sigma_v^{\prime }}\times {r}_d $$
(22)

Here, amax is the peak horizontal acceleration at the ground surface during the earthquake; g the gravitational acceleration; σv and \( {\sigma}_v^{\prime } \) the total and effective stress; and rd the stress reduction coefficient. The average values are used for the rd depending on the depth in Eq. 23 in engineering applications.

$$ {r}_d=\left\{\begin{array}{c}1.0-0.00765z,\kern0.5em z\le 9.15\ \mathrm{m}\\ {}1.174-0.0267z,\kern0.5em 9.15<z\le 23\ \mathrm{m}\end{array}\right. $$
(23)

In order to determine the rate of cyclic resistance (CRR), Youd et al. (2001) suggested the following equation:

$$ CRR=\frac{1}{34-{(N1)}_{60}}+\frac{(N1)_{60}}{135}+\frac{50}{{\left[10\times {(N1)}_{60}+45\right]}^2}-\frac{1}{200} $$
(24)

Corrected SPT-N values used in the liquefaction analysis are suggested to be corrected as follows, taking into account the effect of fine grain ratio (FC) on liquefaction resistance:

$$ N{1}_{60,\mathrm{CS}}=\alpha +\beta N{1}_{60} $$
(25)
$$ \alpha =0\ \mathrm{and}\;\beta =1\ \mathrm{for}\ \mathrm{FC}\le 5\% $$
(26a)
$$ \alpha =\exp \left(1,76-\frac{190}{FC^2}\right)\mathrm{and}\ \beta =\left[0,99+{\frac{FC}{1000}}^{1,5}\right]\kern0.5em \mathrm{for}\kern0.37em 5\%<\mathrm{FC}<35\% $$
(26b)
$$ \alpha =1\ \mathrm{and}\;\beta =1,2\kern0.5em \mathrm{for}\ \mathrm{FC}\ge 35\% $$
(26c)

Here, α and β are fine grain ratio correction coefficients; CS is the correction coefficient.

The safety factor for the liquefaction risk is defined as FS (Eq. 27). If the safety factor is less than 1, it means that the zone involves the risk of liquefaction; if the factor is greater than 1, it indicates that the zone does not involve the risk of liquefaction.

$$ \mathrm{FS}=\frac{\mathrm{CRR}}{\mathrm{CSR}} $$
(27)

Results

GMDH model

The occurrence of liquefaction in soils has been tried to estimate firstly with using the GMDH algorithm. GMDH is a nonlinear regression method, but is a model that also carries the characteristics of supervised and unsupervised artificial neural networks (ANNs). Regression is a statistical model that examines the cause-and-effect relationship between independent variables and dependent variables. Linear regression model is modeling the relationship between one or more independent variables and dependent variables. Trials with the GMDH model were conducted according to different training-test set ratios. The performance results are given in Table 1.

Table 1 Success rates of the GMDH model according to different training-test ratios

Due to no criterion in the literature as to which rates of training-test sets should be made, the authors have experimented with training-test sets at different rates. It can be seen in Table 1 that the trials were conducted for data sets with different ratios in the form of 50–50%, 60–40%, and 70–30% training-test. The highest success was achieved as 97.00% for the 70–30% training-test data set. Performance measures are calculated from the confusion matrix. The confusion matrix for the 70–30% training-test set is given in Fig. 5. It is seen on the confusion matrix that only 4 samples are classified as incorrect for both training and test sets.

Fig. 5
figure 5

Confusion matrix by GMDH for 70–30% training-test data. a Training data. b Test data

The GMDH can be used in architecture built in different numbers of layers and with different numbers of neurons in each layer. The performance measures obtained as a result of trials with different hidden layer numbers in GMDH architectures are given in Table 2. Since the number of input variables is low, the trials have been performed for the hidden layer numbers 1, 2, and 3. As a result of these trials, it is seen that when the hidden layer number increases, the success rate also increases. The highest success is achieved when the hidden layer number is 3 (Table 2).

Table 2 GMDH success rates for different numbers of hidden layers with 70–30% training-test set (with 10 neurons)

The performance measures obtained as a result of trials using different numbers of neurons in the hidden layers of a 3-layered GMDH model are given Table 3. As can be seen in Table 3, the performance for the presence of 5, 10, and 15 neurons in hidden layers has not changed. The best success rate for the test set was 97.00% in the presence of 5 neurons in hidden layers.

Table 3 GMDH success rates of different neurons in hidden layers with 70–30% training-test set (with 3 hidden layers)

EGMDH model

In this paper, a GMDH-based new approach was proposed in the prediction of soil liquefaction. A novel ensemble GMDH (EGMDH) model with different activation functions has been developed to best explain the relationship between input and output variables by changing the GMDH algorithm. The GMDH model was used for each of the sigmoid, radial basis, sin, tangent, and polynomial functions. Then, each GMDH classifier model with an activation function was combined to produce a common output. In general, it is stated that the classification accuracy with the classifier obtained as a result of combining is better than using each classifier singly. Because, the diversity of classifiers usually compensates for the mistakes of a single classifier which can have a higher test error when used singly. Thus, less test error is achieved with the combination of classifiers. The success rates obtained with the EGMDH model for the 70–30% training-test set are given in Table 4.

Table 4 EGMDH model success rates for different numbers of the hidden layers with 70–30% training-test set (with 10 neurons)

It is seen that the EGMDH model is more successful than the GMDH model on the estimation of soil liquefaction when Table 4 is examined. A high classification success rate as 99.30% was obtained with EGMDH. The output confusion matrix for the EGMDH model is given in Fig. 6. As it can be seen in Fig. 5b, only one sample of “1” (liquefied) state is misclassified. All samples representing the state “0” (non-liquefied) are correctly classified.

Fig. 6.
figure 6

Confusion matrix by EGMDH for 70–30% training-test data. a Training data. b Test data

Discussion

This study was aimed to develop a novel prediction model for the liquefaction potential of soils by using the ensemble group method of data handling (EGMDH) algorithm based on the GMDH model. For this aim, the GMDH model has been converted to an ensemble model for different activation functions. The main goal in the ensemble classification is to achieve a result by combining the values obtained by different classifiers. The combination of classifiers consists of resampled training sets, training of classifiers separately, and realization of the classification process in the direction of the emerging estimates. The accuracy of the classification made with the classifier obtained as a result of combining is better when each classifier is used singularly.

Totally 451 SPT-based field records obtained from 2 major earthquakes were used for the prediction models. The success rate of the liquefaction prediction achieved with the GMDH model was 97.00%, while it increased to 99.30% with EGMDH. The EGMDH model is also compared with different classifier models such as ANN, SVM, LR, and RF. Performance values for all models are shown in Table 5. It is obvious that the performance of the proposed EGMDH model is more successful than other classifier models.

Table 5 Comparison of EGMDH with other classifier models

There are 88 “non-liquefaction” and 47 “liquefaction” cases in the test set. Both the GMDH and EGMDH models have been much more capable on the prediction of “liquefaction cases” as seen in Table 6. The data of all cases in the test set and the estimation results of the models are given in Table 7. The proposed EGMH model has only one false estimate on liquefaction status.

Table 6 The performance of the models on the prediction of field cases
Table 7 Comparison of actual and predicted liquefaction records

As mentioned in the “Data processing” section, the data used in this study were obtained from the Hanna et al. (2007) study. They proposed a GRNN model on SPT-based liquefaction assessment. The success of the GRNN model was 92.9% for the test set, 94.7% for the forecast set, and 97% for the total set of data. It is understood that the proposed EGMDH model with 99.3% success performance has achieved higher success compared with the GRNN model.

The results achieved by the proposed EGMDH model were also compared well with other artificial intelligence (AI) works on predicting liquefaction status in the literature. The success rates on the results of some studies are summarized below; Rahman and Wang (2002) proposed a fuzzy neural network model for SPT-based liquefaction prediction with 205 field records. They used 27 cases for testing and achieved a 81.5% success rate with five misclassified cases. Ramakrishnan et al. (2008) proposed a SPT-based ANN model for predicting the liquefaction susceptibility of unconsolidated sediments. They used 23 case records and the model performance was tested with 5 cases and achieved a success of 99.9%. However, the input parameters used in the proposed model were selected as the liquefaction severity index, liquefaction sensitivity index, and estimated CRR and CSR values unlike the similar studies. Samui and Sitharam (2011) proposed ANN and SVM models to predict liquefaction susceptibility of soils based on the SPT data by using 288 case records. They used only two input parameters in the models (CSR N1,60 or PGA N1,60). The performances of both models in the test set were found to be between 87.2% and 88.37% for the ANN model and between 94.19% and 95.35% for the SVM model. Muduli and Das (2015a) have studied the uncertainty of the SPT-based method for assessing the seismic soil liquefaction potential using multi-gene genetic programming (MGGP). Statistical performances of the developed “best” MGGP-based CRR model were found to be as R = 0.96 for training and R = 0.98 for testing. Hoang and Bui (2018) proposed a novel soft computing model named KFDA-LSSVM (combines kernel Fisher discriminant analysis - least squares support vector machine) on the prediction of shear velocity and CPT- and SPT-based soil liquefaction. The results of the proposed KFDA-LSSVM were compared with the other benchmark models including LSSVM, extreme learning machine (ELM), and support vector machine (SVM). The success rates of the models obtained in the SPT-based prediction were 84.95% for KFDA-LSSVM, 84.06% for LSSVM, 82.63% for SVM, and 80.05% for ELM.

The abovementioned studies were performed with different or same case records, different input parameters or numbers, and different approach methods. The common aspect of achievements in all methods is that they will be a good alternative to traditional calculation methods on determining the liquefaction susceptibility of soils. It is obvious that the proposed EGMDH model in the present study can be used as an effective alternative model on predicting the liquefaction potential just like the other successful models.

Sensitivity analysis

A sensitivity analysis of the proposed model was carried out to evaluate the input parameters influence on the model output. Sensitivity analysis is the selection of appropriate parameters for a classification algorithm. The parameters in the data sets are one of the most important factors affecting the classification performance. The low number of parameters may, in some cases, cause classes not to be properly separated. In the case of a high number of parameters, it leads to problems such as an increase in training time and a decrease in the accuracy rate of the parameters. Therefore, it is necessary to determine the correct number of parameters. Considering the multiplicity of sample numbers in the data sets, it is almost impossible to perform the parameter determination manually. Thus, different approaches have been proposed for parameter selection procedures (Das and Basudhar 2008). In the present study, our data set consists of 8 inputs and one output parameter. InfoGainParameterEval and ChiSquaredParameterEval weighting methods were used to determine the effect of input parameters on the output parameter. The InfoGainParameterEval method uses the information gain that each input parameter generates with the output parameter (Lee and Lee 2006). The ChiSquaredParameterEval method uses the chi-square statistic between input parameters and output parameter (Aggarwal 2013). The weight values obtained for each parameter are given in Table 8. It is find out that the most effective input parameter according to both methods is the effective overburden stress (σ′), while the most ineffective input parameter is the depth of groundwater table (dw) (Table 8).

Table 8 Sensitivity analysis results

Conclusions

An ensemble model based on GMDH-type neural network was proposed for the prediction of SPT-based liquefaction assessment in this paper. The proposed novel approach achieved successful results with efficient (almost 100%) accuracy in predicting the liquefaction potential of soils. All the models used, including the proposed new ensemble model in the present study, have been much more effective on the prediction of “liquefaction” cases compared with the “non-liquefaction” cases. Despite the fact that there are many studies in the literature on the prediction of liquefaction with different artificial intelligence techniques, the authors believe that new models for predicting the liquefaction phenomena will continue to be developed just as the EGMDH model, proposed in this study.