1 Introduction

The transportation system is an essential component for the socio-economic development of a nation (Alam et al. 2020). The highway is an infrastructure facility to connect the Greenfield development. The most important element of highway construction is the mechanical properties of soil and design loads (Reddy et al. 2019). The California bearing ratio is a soil parameter widely used in designing flexible pavements in India. The CBR of soil is experimentally determined as per IS 2720 (P-16): 1979 to design the sub-base and base course. The CBR test determines an empirical value, which indicates the shear stress in subgrade, subbase, and base. Finally, the CBR test gives the soil’s mechanical strength and bearing value. The soaked and unsoaked are the types of CBR. The soaked CBR test involves 96 h of soaking of prepared soil sample on optimum moisture content, and a 50 mm diameter plunger penetrates the soil at the rate of 1.25 mm/min (IS 2720: 1979, P-16 2022). Determining unsoaked CBR value is easy and less time-consuming than soaked CBR. Still, both procedures are time-consuming, and sometimes, not accurate results are obtained due to improper maintenance of CBR equipment and sample disturbance (Taskiran, 2010). Therefore, the prediction models or AI approaches are applied to compute the CBR value for fine-grained soils because of their complex behaviour (González et al. 2018). Correlation equations are the most straightforward approach for determining the CBR of sub-grade materials (Taskiran, 2010). Several attempts have been made to establish a soaked CBR relationship with index parameters (gravel, sand, fine content, D10, D30, D50, D60), consistency limits (liquid limit, plastic limit, plasticity index), compaction parameters (optimum moisture content, maximum dry density), specific gravity, and compaction energy by numerous researchers. The researchers have derived the regression equations and concluded that the soil parameters LL, PI, PL, OMC, MDD, FC, MDD, G, S, M, and C affect the prediction of the CBR (Taskiran, 2010; Venkatasubramanian et al. 2011; Datta et al. 2011; Harini HN et al. 2014; Phani Kumar et al. 2015; Janjua et al. 2016; Pradeep Kumar et al. 2016; Chandrakar et al. 2016; Bourouis et al. 2016; Reddy et al. 2019). These parameters have been used as input parameters to construct the high-performance ANN models and compared with MLR, GEP, and OLS models.

Taskiran (2010) has compared GEP and ANN models and has reported that the GEP predicts CBR better than ANN with COD of 0.918. On the other hand, Ajalloeian et al. (2015) have concluded that the ANN model has performed better than the OLS approach. Vekatasubramanian et al. (2011) have compared ANN & MLR in predicting soil CBR and observed that MLR has predicted better than ANN. In addition, the authors have concluded the large data set can improve the performance of ANN models. Therefore, many researchers have used large data sets to develop ANN models and compare them with MLR.

Furthermore, Harini et al. (2014) and Pradeep Kumar et al. (2016) have concluded that the ANN model has outperformed the MLR model in predicting CBR. Rehman et al. (2017a; b) has reported that D50 and CU are the most influencing parameters for CBR of granular soil. Rehmanet al. (2017a; b) have concluded that the combined LL and PI predict soaked CBR of soil with a performance (R2) of 0.9. González et al. (2018) have estimated CBR using MLR, polynomial regression, and radial basis network models. The authors have suggested that soil can be divided into two groups based on the gravel content to predict CBR better. Ravichandra et al. (2019) have concluded that MDD is the most influencing parameter in computing the soaked CBR. The GMDH-NN model predicts CBR of soil better than the ANN and MLR models (Kurnaz et al. 2019). Alam et al. (2020) have reported that the Krigging method predicts CBR of soil better than ANN and GEP approaches. Vu et al. (2021) has concluded that the random forest model can predict the CBR of soil with a testing performance (R2) of 0.92.

In addition, many authors have used the field test data to predict the CBR of soil. For example, Al-Refeai et al. (1997) have successfully computed the CBR of fine-grained soil using gravel content and DCP parameters. Gül et al. (2021) have estimated the CBR from field measurements, such as laboratory soil and SPT-N parameters. The study of statistical relationships has revealed that the grain size, stiffness–looseness, DUW, and OMC influence the CBR of soil. In addition, it has also been observed that the ground vibration test and SPT predict CBR with high performance.

The literature study shows that the simple linear/ polynomial/ logistic regression analysis, multiple regression analysis, and artificial neural network approaches can predict the soaked CBR of fine-grained soil. Furthermore, many researchers have concluded that the G, S, FC, LL, PL, PI, OMC, and MDD of fine-grained soil play an important role in predicting soaked CBR. However, the effect of the correlation coefficient and training data set has not been studied in predicting soaked CBR of fine-grained soil. In addition, the hybrid learning approaches (relevance vector machine, least-square support vector machine, etc.) have not been applied to predict soaked CBR. Therefore, the following objectives are mapped for the present research work based on the literature study.

  • To develop and train the gene expression programming and GA/PSO optimized RVM models with 50%, 60%, 70%, 80%, 90%, and 100% of 182 training data sets.

  • To study the effect of optimization techniques (GA and PSO) on the performance of single-kernel-based RVM (SRVM) models in predicting soaked CBR of fine-grained soils.

  • To evolve dual kernel-based RVM (HRVM) models using a better-performing kernel (determined from each GA and PSO optimized SRVM model) and train the proposed HRVM models using 50%, 60%, 70%, 80%, 90%, and 100% of 182 training data sets.

  • To identify the best architectural model for predicting soaked CBR by mapping a comparative study between better-performing GA optimized SRVM, PSO optimized SRVM, GA optimized HRVM, PSO optimized HRVM, and GEP.

  • To determine the effect of the correlation coefficient and training data set on performance and overfitting of the developed RVM and GEP models.

  • To study the effect of multicollinearity in the training data sets on the performance and overfitting of the developed RVM and GEP models.

  • To determine the sensitivity of input parameters for soaked CBR using the cosine amplitude method.

2 Data analysis

This section briefly describes data sources, frequency distribution and histogram, Pearson's product-moment correlation coefficient, and statistics of collected data sets.

2.1 Data sources

The collected data set includes gravel, sand, fine content, liquid limit, plastic limit, plasticity index, optimum moisture content, maximum dry density, and soaked CBR of fine-grained soil. The soil data sets of soaked CBR have been collected from various sources, as given in Table 1.

Table 1 Details of data sources of soaked CBR

Table 1 shows that the one hundred and eighty-two training data sets have been collected from different published research articles to train the RVM and GEP models. In addition, 15 and 36 data sets have been collected from the published articles to test and validate the trained AI models, respectively. Based on the performance comparison, the best architecture model has been determined to predict the soaked CBR. Furthermore, the 12 fine-grained soil samples have been collected from and around Kota, Rajasthan, and tested in the laboratory to cross-validate the performance and prediction of the best architecture model.

2.2 Frequency distribution and histogram of data sets

The gradational parameters, LL, PL, PI, OMC, and MDD, are the input parameters of models to predict soaked CBR. The frequency distribution of input parameters of soaked CBR data sets is shown in Fig. 1a–i.

Fig. 1
figure 1

Frequency distribution of input parameters of soaked CBR data sets

2.3 Descriptive statistics

The present study has 182 training, 15 testing, 36 validation, and 12 cross-validation data sets of fine-grained soil. In addition, the training data sets have been subdivided into 50%, 60%, 70%, 80%, 90%, and 100% of 182 training data sets. The statistical parameters, minimum, maximum, mean (average), standard deviation (St. Dev), and confidence level (CL) at 95% are calculated for 50%, 60%, 70%, 80%, 90%, and 100% training data sets, as given in Table 2.

Table 2 Descriptive statistic for training data of soaked CBR

2.4 Pearson's product moment correlation coefficient

The Pearson's product-moment correlation coefficient has calculated the relationship between input and output parameters in the present study. The level of relationship in terms of the correlation coefficient is given in Table 3 (Hair et al. 2010).

Table 3 Level of relationship vs value of correlation coefficient

Furthermore, the correlation coefficients are calculated for 50%, 60%, 70%, 80%, 90%, and 100% of 182 training data sets, as shown in Fig. 2a–f.

Fig. 2
figure 2

Relationship between input and output parameters for different training data sets

Figure 2a–f depicts the relationship between input and output parameters of 50%, 60%, 70%, 80%, 90%, and 100% of 182 training data sets. Figure 2 demonstrates the gravel content very strongly correlates with soaked CBR. On the other hand, the fine content very strongly correlates with maximum dry density. The liquid limit of fine-grained soil very strongly correlates with PL, PI, and OMC. In addition, it has also been observed that the sand content is strongly correlated with consistency limits. Still, the compaction parameters (OMC, MDD, and CBRs) are strongly influenced by gradational parameters and consistency limits of fine-grained soils. The plasticity index strongly influences the optimum moisture content and maximum dry density. In addition, fine content strongly influences consistency limits, and liquid limit strongly influences soaked CBR. In addition, the gravel content moderately correlates with liquid limit, plasticity index, and optimum moisture content of fine-grained soil. In addition, the plastic limit moderately influences soaked CBR. The correlation coefficient for soaked CBR shows that the sand content has no relationship with soaked CBR of fine-grained soil.

Figure 2a demonstrates that the sand content has multicollinearity of − 0.6644 and − 0.6700 with fine content and plastic limit, respectively. On the other hand, the OMC has multicollinearity with a plastic limit (0.7751) and a plasticity index (0.7410) of soil. Moreover, the liquid limit and PI have multicollinearity with soaked CBR. Figure 2b presents that the gravel content has multicollinearity with PL (− 0.4525) and PI (− 0.4830). On the other hand, sand content has multicollinearity with fine content (− 0.6684), LL (− 0.6856), and PL (− 0.4535) & PI (− 0.6052). Furthermore, FC has multicollinearity with PL (0.7461), PI (0.7511) and CBRs (− 0.7788). In addition, the liquid limit has multicollinearity with PL (0.8111) and OMC (0.8321). Moreover, the soaked CBR has multicollinearity with liquid limit and OMC.

Figure 2c shows that multicollinearity is present between sand content and liquid limit (− 0.6892), PL (− 0.6574), and PI (− 0.5346), and OMC (− 0.5543). Furthermore, maximum dry density has multicollinearity of − 0.8192 for FC and − 0.7949 for LL. In addition, optimum moisture content has multicollinearity with PL (0.6933) and PI (0.7198) of fine-grained soil. The correlation coefficient for CBRs shows that CBRs influence because of the multicollinearity between CBRs and liquid limit (− 0.5690) and OMC (− 0.5978) of fine-grained soil. Figure 2d illustrates that fine content has multicollinearity with gravel (− 0.7336) and sand (− 0.7059) content. The plastic limit and plasticity index have multicollinearity of − 0.4407 and − 0.4614, respectively, with gravel content. On the other hand, the liquid limit and plastic limit has multicollinearity of − 0.6440 and − 0.6230, respectively, with sand content. In addition, the plasticity index (− 0.5032) and OMC (− 0.5335) have multicollinearity with sand content. However, the liquid limit and plastic limit determine the plasticity index of soil, which develops a relationship (PI = LL−PL) between plastic limit and plasticity index. Thus, Fig. 2d shows that the plastic limit and plasticity index has multicollinearity of 0.7020 and 0.7259, respectively, with the optimum moisture content of fine-grained soils. In addition, the liquid limit (− 0.5751) and plasticity index (− 0.5343) contain the multicollinearity for soaked CBR of fine-grained soil.

Figure 2e depicts that the gravel content has multicollinearity with a plastic limit (− 0.4236) and plasticity index (− 0.4485). Furthermore, liquid limit (− 0.6719), plastic limit (− 0.6444) and, plasticity index (− 0.5359) and OMC (− 0.5478) have multicollinearity with sand content present in the fine-grained soils. Figure 2e shows that soaked CBR of fine-grained soils has multicollinearity with liquid limit and plasticity index. Figure 2f demonstrates that the fine content has multicollinearity with gravel (− 0.7067) and sand (− 0.7222) content. In addition, gravel content has multicollinearity with a plastic limit (− 0.4128) and a plasticity index (− 0.4332). On the other hand, the sand content has multicollinearity with a plastic limit (− 0.5119) and optimum moisture content (− 0.5271). Moreover, the OMC has multicollinearity of 0.6959 and 0.7308 with plastic limit and plasticity index, respectively. In addition, it has been found that the liquid limit and plasticity index influence the prediction of soaked CBR.

3 Methodology used in the present study

The relevance vector machine and gene expression programming artificial intelligence approaches have been used in the present research work to predict soaked CBR of fine-grained soils.

3.1 Relevance vector machine

A relevance vector machine is a hybrid artificial intelligence approach, and this approach is an advanced version of the support vector machine. RVM gives the most promising results, because it uses a Bayesian inference system (Tipping 2001):

$$k\left( {x,x^{\prime}} \right) = \mathop \sum \limits_{j = 1}^{N} \frac{1}{{\beta_{j} }}\varphi \left( {x,x_{j} } \right)\varphi \left( {x^{\prime},x_{j} } \right).$$
(1)

The relevance vector machine models use Linear, Gaussian, Polynomial, and Laplacian kernels in the present study. The mathematical expression of kernels is

Linear Kernel

$$K\left( x \right) = C + \left( {ai*\left( {x,x_{i} } \right)} \right).$$
(2)

Gaussian Kernel

$$K\left( {x,x^{\prime}} \right) = {\text{exp}}\left( {\frac{{x - x^{{\prime}\,{2}} }}{{2\sigma^{2} }}} \right).$$
(3)

Polynomial Kernel

$$K\left( {x,y} \right) = \left( {x^{T} y + c} \right)^{d} .$$
(4)

Laplacian Kernel

$$K\left( {x,y} \right) = exp\left( {\frac{x - y}{\sigma }} \right)$$
(5)

Sixty models are developed in the present study to predict the soaked CBR of fine-grained soil. The model designation with different percentages of training data sets is given in Table 4.

Table 4 Details of RVM models

Six RVM models have been constructed by 50%, 60%, 70%, 80%, 90%, and 100% for each Linear, Polynomial, Gaussian, and Laplacian kernel. Six RVM models of single kernel function-based (SRVM) are optimized by GA and PSO techniques. In addition, the two kernel function-based (HRVM) models are developed by recognizing the maximum performance SRVM model in GA and PSO optimization. In this study, Laplacian and Gaussian kernel function-based SRVM models have performed better than the other RVM models. Therefore, HRVM models have been developed using Laplacian and Gaussian kernels as Kernel1 and Kernel2, respectively. The proposed HRVM models are also optimized by GA and PSO techniques. The hyperparameters of the proposed RVM models are given in Table 5.

Table 5 Configuration of RVM Models

3.2 Gene expression programming

Gene expression programming is an evolutionary algorithm developed to solve user-defined problems using automatic generation computer programs (Zhong, 2017), and it consists of mutation and crossover operators (Tenpe, 2020). It is a genotype–phenotype system encoded in simple linear chromosomes of fixed length. The general gene structure is shown in Fig. 3.

Fig. 3
figure 3

Gene Structure (Tenpe, 2020)

where “–” is the root node, “Rt, + , −” is the function node (Head), and “X1, X2, X3, 7” are terminals (tail). The head length (h) and tail length (t) are computed using Eq. 3.52:

$$T = \left( {n - 1} \right)h + 1.$$
(6)

The present study has developed six gene expression programming models to predict the soaked CBR of fine-grained soil. The model designation with different percentages of training data sets is given in Table 6.

Table 6 Details of GEP models

The gene expression programming models have been trained by 50%, 60%, 70%, 80%, 90%, and 100% training data sets. The hyperparameters of the proposed GEP models are given in Table 7.

Table 7 Configuration of GEP models

The following equations are derived while training the GEP models:

$$\begin{aligned} {\text{Model 61}} =& \left( \left( GEP3Rt\left( {\left( {OMC * {-} 9.94} \right)} \right) \right.\right.\\ &\left. * GEP3Rt\left( {\left( {G {-} MDD} \right)} \right) \right)\\ &\left. * \left( {1.0/\left( {\left( {\left( {OMC {-} 8.12} \right)/2.0} \right)} \right)} \right) \right)\\ & + MAX\left( \left( GEP3Rt\left( {FC} \right)\right.\right.\\ &\left. * GEP3Rt\left( {\left( {S {-} LL} \right)} \right) \right),\\ &\left. GEP3Rt\left( {\left( {{-} 58.24 {-} G} \right)} \right) \right)\end{aligned}$$
(7)
$$\begin{aligned}{\text{Model 62}} = &\, ( ( ( 1.0 {-} ( ( {G + S} )/( {G {-} OMC} ) ) ) \\ &+ ( ( {OMC + 1.87} )/( {OMC {-} 6.25} ) ) )/2.0 )\\ &+ ( ( G + GEP3Rt( ( ( ( ({2.07 + S} )/2.0 ) \\ &+ ( {OMC/PL} ) ) + ( {{-} 2.24 {-} LL} ) ) ) )/2.0)\end{aligned}$$
(8)
$$\begin{aligned}{\text{Model }}63 = &\left( 1.0/\left( \left( \left( ATAN\left( {\left( {1.0 {-} \left( {S/LL} \right)} \right)} \right)\right.\right.\right.\right.\\ &+ MIN\left( GEP3Rt\left( {MDD} \right),\right.\\ &\left.\left.\left.\left. \left.\left( {\left( {S + MDD} \right)/2.0} \right) \right) \right)/2.0 \right) \right) \right)\\ & + \left( MAX\left( \left( {POW\left( {MDD,2.0} \right) * \left( {1.0/\left( {OMC} \right)} \right)} \right),\right.\right.\\ &\left.\left.\left( {\left( {6.34 {-} LL} \right) {-} PI} \right) \right) * G \right)\end{aligned}$$
(9)
$$\begin{aligned}{\text{Model }} 64 =& MIN( EXP( EXP( ATAN( {MDD} ) )),\\ &MIN( ( ( ( ( {S + MDD} )/2.0 ) + MDD )/2.0 ),FC )) \\ &+ ( 1.0 {-} MIN( MAX( ( {PI {-} 3.12}),\\ &( {OMC {-} PI} ) ),MIN( ( G *PL ),EXP( {MDD} ) ) ) ) \end{aligned}$$
(10)
$$\begin{aligned}{\text{Model }} 65 =& MIN( ( G {-} ( ( ( 1.0/( {FC}) ) * S )\\ & + ( {-} 3.76 + MDD )) ),G ) \\ &+ ( ( ( PI * (EXP( MIN( 1.15, {-} 3.35 ) )\\ & * ({{-} 3.35 * G}) ) ) + 4.52 )/2.0) \end{aligned}$$
(11)
$$\begin{aligned} {\text{Model }}66 =& ( ( 1.0 {-} GEP3Rt( ( {FC * G} ) )) \\ &{-} GEP3Rt( ( Log( {LL} ) * ({FC + PI} ) ) ) ) \\ &+ ( 1.0 {-}( POW( ( ( {S + S} )\\ & + PL ),2.0)/( ( {7.96 {-} LL} ) * POW( {OMC,2.0}) ) ) )\end{aligned}$$
(12)

3.3 Methodology

The present research predicts soaked CBR of fine-grained soil using the best architectural AI model. Therefore, a comparative study is mapped between adopted AI approaches, such as relevance vector machines and gene expression programming. The relevance vector machine models are developed using single and dual kernel functions, denoted by SRVM and HRVM. These kernel functions are Linear, Polynomial, Gaussian, and Laplacian. Each kernel function develops six SRVM models using 50%, 60%, 70%, 80%, 90% and 100% of 182 training data sets. The genetic algorithms and particle swarm optimization techniques optimize each SRVM model. The optimized SRVM models have predicted soaked CBR of 15 testing fine-grained soil data sets. Thus, four models (one from each kernel function) are the better-performing models from each GA and PSO identified and compared to obtain the best kernel function. Two kernel functions (one from GA and one from PSO) have been identified as the best kernel function and developed the dual kernel-function-based HRVM models. In the present study, the Laplacian and Gaussian kernel functions are the best kernel function identified from SRVM models. Therefore, six HRVM models are developed for 50%, 60%, 70%, 80%, 90%, and 100% of 182 training data sets and optimized by each GA and PSO technique. Based on the comparative study of the performance of HRVM models, one model from each GA and PSO technique is a better-performing model recognized. Finally, the present study has four better-performing models: one from SRVM optimized by GA, one from SRVM optimized by PSO, one from HRVM optimized by GA, and one from HRVM optimized by GA PSO.

On the other hand, 50%, 60%, 70%, 80%, 90%, and 100% of 182 training data sets have developed six GEP models. In addition, a GEP model having maximum performance is recognized as a better-performing model and compared with four models obtained from a comparison of RVM models. Thus, comparing one GEP and four RVM models gives the best architectural model to predict the soaked CBR of fine-grained soil. In addition, geotechnical properties of 36 fine-grained soils have been collected from published articles to compile a validation data set to validate the best architectural models. Moreover, the published models have predicted the soaked CBR of fine-grained soil in validation data sets and compared it with the best architectural model. After validating the best architectural model, the best architectural model has predicted the soaked CBR of twelve fine-grained soil samples tested in the laboratory.

4 Results and discussion

Numerous researchers have applied different AI approaches to predict the soaked CBR of soil for fine-grained soil. The present study uses the genetic algorithm and particle swarm optimized relevance vector machine AI approach to predict soaked CBR of fine-grained soils. In addition, the gene expression programming AI approach is also applied to predict the soaked CBR of fine-grained soil. The best architectural AI model has been determined by comparing the performance of developed models. Furthermore, the performance of models has been determined in terms of RMSE, MAE, and R.

4.1 Results of single kernel-based RVM (SRVM) models optimized by GA

The single kernel-based RVM models have been developed using Linear, Polynomial, Gaussian, and Laplacian kernel functions and optimized by a genetic algorithm. RVM models have been trained by 50%, 60%, 70%, 80%, 90%, and 100% of 182 data sets. Fifteen data sets have tested the capabilities of developed RVM models. The training and testing performance of the SRVM models is mapped for different percentages of training data sets, as shown in Fig. 4a–h.

Fig. 4
figure 4

Training (TR) and testing (T) performance of GA optimized SRVM models

Figure 4 depicts the performance of the developed Linear, Polynomial, Gaussian, and Laplacian SRVM models (GA optimized) using 50–100% training data sets. The following observations have been mapped from Fig. 4i Model 3 has predicted soaked CBR with the test RMSE of 1.0390, MAE of 0.8596, and R of 0.8974, (ii) Model 9 has predicted soaked CBR with test RMSE of 0.5116, MAE of 0.4169, and R of 0.9701, (iii) Model 15 has predicted soaked CBR with test RMSE of 0.8909, MAE of 0.7332, and R of 0.9260, (iv) Model 21 has predicted soaked CBR with test RMSE of 0.3140, MAE of 0.2418, and R of 0.9874. Models 3, 9, 15, and 21 are developed using Linear, Polynomial, Gaussian, and Laplacian kernels and trained by a 70% training data set. In the 70% training data set, gravel content is very strongly correlated, and fine content and MDD are strongly correlated with soaked CBR of fine-grained soils. The soaked CBR of fine-grained soil is highly influenced by gravel content (Al-Refeai et al. 1997), fine content (Reddy et al. 2019), and MDD (Ravichandra et al. 2019). The correlation coefficient of sand content shows no relationship, and the rest of the input parameters have a moderate relationship with the soaked CBR of fine-grained soil. The performance of Models 3, 9, 15, and 21 has been compared, and it has been observed that Model 21 (Laplacian kernel-based) has predicted soaked CBR with high performance and has been identified as a better performance model. Furthermore, the overfitting for single kernel-based RVM models optimized by GA has been calculated, as shown in Fig. 5.

Fig. 5
figure 5

Overfitting of single kernel-based RVM models optimized by GA

The ratio of test RMSE to training RMSE (overfitting = \({\text{Test RMSE}}/{\text{Training RMSE}}\)) is known for the overfitting or overfitting ratio of the model (Tenpe et al. 2020). Figure 5 illustrates the comparison of overfitting of the model in predicting soaked CBR of fine-grained soil. Figure 5 shows that Models 3, 9, and 21 have the overfitting of 0.1279, 0.1319, and 0.2155, respectively, which is comparatively less than other models. On the other hand, the Gaussian SRVM (Model 18) model has an overfitting of 0.1686, which is less than other Gaussian SRVM (optimized by GA). The comparison of overfitting for models 3, 9, 18, and 21 demonstrates that model 3 (linear SRVM) has the least overfitting than other models. Therefore, it can be stated that the linear SRVM model optimized by the GA technique achieves minimum overfitting in predicting soaked CBR of fine-grained soil than nonlinear SRVM models optimized by the GA technique.

4.2 Results of single kernel-based RVM models optimized by PSO

Similarly, single kernel-based RVM models have been developed using Linear, Polynomial, Gaussian, and Laplacian kernel functions and optimized by the particle swarm optimization technique. Fifteen data sets have tested the capabilities of developed RVM models. The training and testing performance of the SRVM models is mapped for different percentages of training data sets, as shown in Fig. 6a–h.

Fig. 6
figure 6

Training (TR) and testing (T) performance of PSO optimized SRVM models

Figure 6 depicts the performance of the developed Linear, Polynomial, Gaussian, and Laplacian SRVM models (PSO optimized) using 50–100% training data sets. Models 27, 33, 39, and 45 are developed using Linear, Polynomial, Gaussian, and Laplacian kernels and trained by a 70% training data set. The following observations have been mapped from Fig. 6, (i) Model 27 has predicted soaked CBR with the test RMSE of 0.8447, MAE of 0.7183, and R of 0.9315, (ii) Model 33 has predicted soaked CBR with test RMSE of 0.9350, MAE of 0.7585, and R of 0.9026, (iii) Model 39 has predicted soaked CBR with test RMSE of 0.4747, MAE of 0.3930, and R of 0.9748, (iv) Model 45 has predicted soaked CBR with test RMSE of 0.4788, MAE of 0.3553, and R of 0.9696. Models 27, 33, 39, and 45 are developed using Linear, Polynomial, Gaussian, and Laplacian kernels and trained by a 70% training data set. The input parameters gravel content, fine content, and maximum dry density very strongly correlate with the soaked CBR of fine-grained soil. In addition, it is found that the fine content strongly correlates with consistency limits, which indirectly enhances the prediction of soaked CBR of fine-grained soil. Furthermore, the overfitting for single kernel-based RVM models optimized by the PSO technique has been calculated, as shown in Fig. 7.

Fig. 7
figure 7

Overfitting of single kernel-based RVM models optimized by PSO

Figure 7 demonstrates the comparison of overfitting of SRVM models optimized by the PSO technique. Figure 7 shows that Models 27, 33, 39, and 45 have overfitting of 0.104, 0.2518, 0.1631, and 0.3134, respectively, which is comparatively the least than other SRVM models optimized by the PSO technique. The overfitting comparison of models 27, 33, 39, and 45 shows that Model 27 (linear model) has minimum overfitting in predicting the soaked CBR of fine-grained soil than nonlinear models (Models 33, 39, and 45).

On the other hand, the performance comparison has been mapped between GA and PSO optimized SRVM models. The particle swarm optimization technique has improved the performance (R) of Model 3 from 0.8974 to 0.9315 (Model 15) and Model 15 from 0.9260 to 0.9784 (Model 39). Model 3 and 15 are linear and Gaussian kernel-based SRVM models. Therefore, it can be stated that the PSO optimization technique boosts the performance of linear and Gaussian kernel-based SRVM models of soaked CBR. However, the performance of Model 9 has dropped from 0.9701 to 0.9026, and Model 21 has fallen from 0.9874 to 0.9696 using the PSO technique. Models 9 and 21 are polynomial and Laplacian kernel-based SRVM models. Therefore, it can be stated that the PSO optimization technique decreases the performance of polynomial and Laplacian kernel-based SRVM models of soaked CBR, predicting soaked CBR of fine-grained soil.

Furthermore, genetic algorithm optimized SRVM Model 21 has outperformed the other GA optimized SRVM models, and PSO optimized SRVM Model 39 has outperformed the other PSO optimized SRVM models. However, Model 21 and Model 39 has constructed by Laplacian and Gaussian kernel functions. Therefore, the hybrid relevance vector machine (HRVM) models are developed using Laplacian and Gaussian kernel as kernel1 and kernel2, respectively, and optimized by GA and PSO techniques.

4.3 Results of two kernel-based RVM models optimized by GA

HRVM models have also been trained by 50%, 60%, 70%, 80%, 90%, and 100% of 182 data sets and optimized by GA. Fifteen data sets have tested the ability of developed HRVM models. The training and testing performance of the HRVM models is mapped for different percentages of training data sets, as shown in Fig. 8a, b.

Fig. 8
figure 8

Training (TR) and testing (T) performance of GA optimized HRVM models

Figure 8 depicts the performance of the developed GA optimized HRVM models using 50–100% training data sets. From Fig. 8, it has been observed that Model 51 has predicted soaked CBR with test RMSE of 0.7381, MAE 0.6094, and R of 0.9606. Model 51 has been trained by 70% training data sets. Therefore, Model 51 has outperformed the GA-optimized Models 3 and 15 and PSO-optimized Models 27 and 33. Still, Models 9, 21, 39, and 45 have predicted soaked CBR better than Model 51. Furthermore, the HRVM (GA Optimized) overfitting has been calculated, as shown in Fig. 9.

Fig. 9
figure 9

Overfitting of dual kernel-based HRVM models optimized by GA

Figure 9 illustrates that Model 51 has an overfitting of 0.2805 in predicting soaked CBR of fine-grained soil, which is comparatively less than other GA-optimized HRVM models. However, model 51 has been trained by 70% of 182 training data sets. As the result of the relationship between soaked CBR and input parameters (gravel content, fine content, and maximum dry density), model 51 has predicted soaked CBR of fine-grained soil with the least overfitting.

4.4 Results of two kernel-based RVM models optimized by PSO

Similarly, HRVM models (PSO optimized) have been trained by 50%, 60%, 70%, 80%, 90%, and 100% of 182 data sets and optimized by the PSO technique. Fifteen data sets have tested the capabilities of developed HRVM models. The training and testing performance of the HRVM models is mapped for different percentages of training data sets, as shown in Fig. 10a, b.

Fig. 10
figure 10

Training (TR) and testing (T) performance of PSO optimized HRVM models

Figure 10 depicts the performance of the developed PSO optimized HRVM models using 50–100% training data sets. From Fig. 10, it has been observed that Model 57 has predicted soaked CBR with test RMSE of 0.4761, MAE 0.3895, and R of 0.9701. On the other hand, model 51 has been trained by 70% training data sets. Therefore, Model 57 has outperformed the GA optimized Models 3 & 15 and PSO optimized Models 27, 33 and 45. Still, Model 9, 21, and 39 have predicted soaked CBR better than Model 57. On the other hand, the comparison of overfitting of the model, as shown in Fig. 11, demonstrates that model 57 has predicted the soaked CBR of fine-grained soil with the least overfitting than other PSO optimized HRVM models, i.e., 0.2812.

Fig. 11
figure 11

Overfitting of dual kernel-based HRVM models optimized by PSO

Finally, from the comparison of single and dual kernel RVM models, the single kernel-based Models 21 (GA optimized), 39 (PSO optimized), and two kernel-based Models 51 (GA optimized) and 57 (PSO optimized) have been identified as better performance RVM models in predicting soaked CBR of fine-grained soils. The results of RVM models show that the high performance and accuracy in predicting soaked CBR can be achieved by a 70% training data set. The data set has a very strong correlation with gravel content and a strong correlation with FC and MDD of soil.

On the other hand, multicollinearity has been observed in the 70% of 182 training data sets. The multicollinearity is a statistical parameter affecting the models' prediction and performance. 70% of the 182 training data set shows that the fine content has multicollinearity with gravel and sand content. However, the sand content is the least influencing input parameter in predicting soaked CBR of fine-grained soil. Moreover, the sand content has multicollinearity with consistency limits and optimum moisture content. Thus, the effect of consistency limits and optimum moisture content becomes less in predicting soaked CBR. In addition, it is found that the maximum dry density has multicollinearity with fine content and liquid limit. In addition, the optimum moisture content has multicollinearity with plastic limit and plasticity index. Instead of the presence of multicollinearity in the 70% training data set, nonlinear RVM models have predicted the soaked CBR with a performance of more than 0.95, which shows that the RVM approach is a robust approach and it is not getting affected by the presence of multicollinearity during the prediction of soaked CBR of fine-grained soil.

4.5 Results of gene expression programming

The gene expression programming models have been trained by 50%, 60%, 70%, 80%, 90%, and 100% of 182 data sets. Fifteen data sets have tested the capabilities of developed GEP models. The training and testing performance of the GEP models is mapped for different percentages of training data sets, as shown in Fig. 12a, b.

Fig. 12
figure 12

Training (TR) and testing (T) performance of GEP models

Figure 12 depicts the performance comparison of the trained and tested GEP models using 50–100% training data sets. From Fig. 12, it has been observed that Model 62 has predicted soaked CBR with test RMSE of 1.2148%, MAE 0.9902%, and R of 0.8847. Model 62 has been trained by 60% training data sets. Therefore, Model 62 has outperformed the other GEP models in predicting the soaked CBR of soil. Furthermore, the overfitting is also calculated for the GEP models, as shown in Fig. 13.

Fig. 13
figure 13

Overfitting of GEP models

Figure 13 demonstrates that Model 62 has achieved an overfitting of 0.3063 in predicting soaked CBR of fine-grained soil, which is comparatively less than other GEP models. The expression tree of model 62 is shown in Fig. 14.

Fig. 14
figure 14

Expression tree of Model 62

Figure 14 shows that Model 62 has three genes, i.e., Sub-ET1, Sub-ET2, and Sub-ET3, which consists of three constant values viz. c0 = 2.07, c6 = 6.63, and c7 = − 4.31 and input parameters: d0 = G, d1 = S, d3 = LL, d4 = PL, d6 = OMC. The fine content and maximum dry density play a vital role in predicting the soaked CBR of fine-grained soil (Taskiran 2010; Chandrakar et al. 2016; Bourouis et al. 2016; Reddy et al. 2019). However, both input parameters are missing in the expression tree, indicating the less effective models for predicting the soaked CBR of fine-grained soil. In addition, plasticity index and C + S content are absent in Model 62. The GEP models have been tested by the data set of inorganic clays of high plasticity soils. The high plasticity clay requires the plasticity index to predict soaked CBR (Datta et al. 2011). Therefore, Model 62 has predicted soaked CBR lower than experimental values of CH soil.

On the other hand, 60% of 182 training data sets have trained the model 62. 60% of 182 training data sets have multicollinearity between gradational parameters and consistency limits. The gravel content has moderate multicollinearity (0.41 < R < 0.60) with plastic limit and plasticity index. The sand content has multicollinearity with FC and liquid limits, and the expression tree does not have FC as the input parameter. Therefore, the present multicollinearity between sand and fine content does not affect the prediction. Still, the liquid limit has very strong multicollinearity (R > 0.8) with plastic limit and optimum moisture content. Therefore, model 62 did not perform well compared to the RVM models in the presence of multicollinearity.

4.6 Best architecture model

The test performance of Models 21, 39, 51, 57, and 62 has been compared to determine the best architecture model for predicting soaked CBR of soil, as shown in Fig. 15.

Fig. 15
figure 15

Comparison of test performance of Models 21, 39, 51 and 62

Figure 15 illustrates that Model 21 (SRVM optimized by GA) has predicted soaked CBR of fine-grained soil with RMSE of 0.3140%, MAE of 0.2418, and R of 0.9874, which is comparatively less than other better-performing models. Therefore, the best architectural model for predicting the soaked CBR of fine-grained soil is Model 21.

Furthermore, the data sets of 36 fine-grained soil specimens have been collected from the published research articles to validate Model 21. In addition, the published equations/ models have also been used to compute the soaked CBR of 36 fine-grained soil and compared with Model 21. Patel et al. (2010) derived a multilinear regression equation to predict soaked CBR of soil using PI, MDD, and OMC:

$$\begin{aligned}{\text{CBR}}_{s}& = 43.907 - 0.093*{\text{PI}} \\ & \quad - 0.3081*{\text{OMC}} - 18.78*{\text{MDD}}\end{aligned}$$
(13)

Harini et al. (2014) also developed a regression model to predict soaked CBR of soil using liquid limit and fine content:

$${\text{CBR}}_{s} = 4.86 - 0.07*{\text{LL}} + 0.01*{\text{FC}}$$
(14)

Pradeep Kumar et al. (2016) proposed a multiple linear regression equation to predict the soaked CBR of soil using MDD, OMC, PL, LL, PI, G, S, and FC:

$$\begin{aligned} {\text{CBR}}_{s} &= 0.6981*{\text{FC}} + 0.7802*S + 0.9193*G\\ &\quad + 0*{\text{PI}} + 0.0483*{\text{LL}} \\ &\quad -0.2356*{\text{PL}} + 0.1579*{\text{OMC}}\\ &\quad + 7.8239*{\text{MDD}} - 73.412 \end{aligned}$$
(15)

Reddy et al. (2019) also suggested a simple regression equation to predict the CBR using the plasticity index of soil:

$${\text{CBR}}_{s} = 6.1596 - 0.1024*{\text{PI}}$$
(16)

Equations 1316 and model 21 have been used to predict the soaked CBR of 36 soil specimens collected from published articles and results, as shown in Fig. 16.

Fig. 16
figure 16

Actual vs predicted plot for soaked CBR of 36 fine-grained soil specimens

Figure 16 depicts the actual vs predicted plot of soaked CBR of fine-grained soil using Model 21 and Eqs. 1316. Figure 16 shows that Model 21, Eqs. 13, 14, 15, and 16 have predicted soaked CBR with a performance (R) of 0.8631, 0.7743, 0.8437, 0.8133, and 0.7919, respectively. The performance comparison shows that Model 21 has performed better than published equations/models. Furthermore, 12 fine-grained soil samples have been collected from and around Kota, Rajasthan, for the laboratory validation of model 21. The gravel, sand, fine content, liquid limit, plastic limit, plasticity index, OMC, MDD and soaked CBR have been determined in the geotechnical laboratory. The prediction vs actual plot for laboratory-tested soil data sets is shown in Fig. 17.

Fig. 17
figure 17

Actual vs predicted plot for soaked CBR of 12 fine-grained soil specimens

Figure 17 depicts the actual vs predicted plot of soaked CBR of 12 fine-grained soil using Model 21. Figure 17 illustrates that Model 21 predicts the soaked CBR for laboratory-tested soil with RMSE of 0.4237%, MAE of 0.32% and R of 0.8139. The correlation coefficient (R) of 0.8 or more than 0.8 shows a strong relationship between actual and predicted values (Smith 1986). Therefore, Model 21 can be used to predict the soaked CBR of fine-grained soil.

5 Sensitivity analysis

In the present study, the non-linear sensitivity analysis has been performed using the cosine amplitude method. The sensitivity analysis is performed by Eq. 17 (Ardakani et al. 2019):

$${\text{SS}} = \frac{{\mathop \sum \nolimits_{c = 1}^{n} \left( {X_{ic} *X_{jk} } \right)}}{{\sqrt {\mathop \sum \nolimits_{c = 1}^{n} X_{ic}^{2} } \mathop \sum \nolimits_{c = 1}^{n} X_{jk}^{2} }},$$
(17)

where \(X_{ic}\) is input parameters G, S, FC, LL, PL, PI, OMC, and MDD, and \(X_{jk}\) is output parameter soaked CBR of fine-grained soil. The value of SS equal to one shows that the input parameter highly influences the output parameter. The sensitivity analysis has been performed for 50%, 60%, 70%, 80%, 90%, and 100% of 182 training data sets in this study, as shown in Fig. 18.

Fig. 18
figure 18

Sensitivity analysis for different perentages of training data sets

Figure 18 illustrates the sensitivity analysis for different percentages of training data sets. However, Model 21 has been identified as the best architectural model by considering the test performance in the present study, which is trained by 70% of 182 training data sets. In addition, model 21 has outperformed the published models and equation in predicting soaked CBR of fine-grained soil. Furthermore, Fig. 18 shows that the 70% training data sets including G, S, FC, LL, PL, PI, OMC and MDD has sensitivity of 0.9209, 0.5139, 0.2281, 0.3079, 0.3885, 0.1970, 0.2673, and 0.6265, respectively, with soaked CBR of fine-grained soil. In addition, the sensitivity for consistency limits and compaction parameters has increased with training data sets. The soaked CBR of fine-grained soil is highly affected by index properties (Taskiran 2010). Therefore, the proposed AI models in the present study have achieved high performance and prediction for Model 21 trained by the 70% of the 182 training data sets.

6 Conclusions

The present study has successfully employed gene expression programming (GEP) and relevance vector machine (optimized by GA and PSO technique) models to predict the soaked CBR of fine-grained soils. Based on the outcomes of the study, the following conclusions are mapped.

  • The performance comparison of GA and PSO optimized SRVM models demonstrates that the performance of the linear and Gaussian SRVM models increases after optimizing the model by the PSO optimization technique. On the other hand, the performance of polynomial and Laplacian SRVM models decrease after optimizing the models by the PSO optimization technique.

  • From the performance comparison of the SRVM model, the Laplacian SRVM model (Model 21) and Gaussian SRVM model (Model 39) have been identified as better-performing models with a performance of 0.9874 and 0.9748, respectively. Therefore, the dual-kernel-based HRVM (optimized by GA and PSO) models are developed using the Laplacian kernel (as kernel1) and Gaussian kernel (as kernel2). The test performance of HRVM models illustrates that the PSO optimization technique enhances the performance and predicts the soaked CBR better than GA-optimized HRVM models.

  • Model 21 has been identified as the best architectural model for predicting soaked CBR of fine-grained soils by comparing the performance of Model 21 (0.9874), Model 39 (0.9748), Model 51 (0.9606), Model 57 (0.9701), and Model 62 (0.8847). In addition, the validation and cross-validation (laboratory validation) results have confirmed that Model 21 can predict the soaked CBR better than published models in the literature.

  • The present study demonstrates that the RVM model predicts soaked CBR better if the input and output parameters of training data sets have a strong (0.61 < R < 0.80) to very strong (0.81 < R < 1.0) correlation. In addition, the developed RVM model predicts the soaked CBR with the least prediction error (RMSE and MAE) and overfitting.

  • Based on the performance comparison of GEP models, Model 62 outperformed the other GEP models with a performance of 0.8847. Model 62 was trained using input parameters gravel, sand content, OMC, and PL to predict the soaked CBR. Model 62 did not perform well in the absence of the maximum dry density and the presence of multicollinearity in 60% of 182 training data sets.

  • Multicollinearity in the training data set affects the performance and overfitting of the GEP model. On the other hand, the optimized SRVM and HRVM models are less influenced by multicollinearity.

  • The sensitivity analysis confirms that the soaked CBR is highly influenced by the gravel, sand content, and maximum dry density of fine-grained soils.

Finally, the present research introduces a robust SRVM (Laplacian SRVM model optimized by GA optimization) model to predict the soaked CBR of fine-grained soil.