Introduction

The deformability modulus (Em) is among the essential properties of intact rock, affecting the large-scale engineering behavior of rock mass. The stress-to-strain ratio during loading a rock mass, both in elastic and inelastic behavior, is known as rock mass Em (Sousa and Grossmann 2022). Em is commonly used as an input parameter in empirical and numerical relationships to determine the engineering behavior of rock mass (Saedi et al. 2019). This parameter provides the basis for structural design and stability analyses (Li et al. 2020; Wu et al. 2020). Em is a factor in the design of rock structures, indicating the deformation properties of the rock mass (Aksoy et al. 2022). This parameter is determined by field and laboratory tests, such as dilatometric tests (DMTs), radial loading, and plate-jacking method. Unpredictable conditions of formation, high cost, operational problems, and time-consuming are the factors that expose some problems when conducting these tests (Ko et al. 2016). Using the standard procedures for Em estimation in in-situ tests is very difficult. These difficulties can be overcome using regression models that directly measure Em (Aboutaleb et al. 2018; Ceryan et al. 2021). Researchers have investigated the effect and relation of Em with rock discontinuity (Wu et al. 2020). The Em of rock mass is a complex and sensitive parameter that is measured using highly advanced methods and techniques. In the past two decades, machine learning methods (MLMs) have been widely used in geotechnical engineering (Abdallah 2019; Aengchuan and Phruksaphanrat 2018), tunneling and engineering geology (Ghorbani et al. 2021; Marcher et al. 2020), and for rock material and composite material. These methods have high efficiency in data processing and do not require conducting time-consuming, expensive in-situ experiments (Sun et al. 2020a; Xu et al. 2022). MLMs can develop a relationship between complex and high-dimensional variables (Zhang et al. 2020). Various MLM has been used in geoengineering (Zhao et al. 2018) to predict seepage behavior (Xiao and Zhao 2019), rock mechanics (Huang et al. 2021; Xia et al. 2019) slope stability (Đurić et al. 2019; Huang et al. 2022), the mechanical and physical properties of rocks (Salimi et al. 2019), and interpolation and stratification of multilayer soil property profile (Zhao and Wang 2020). In this respect, Sun et al. (2020b) showed that the support vector regression (SVR) method provides the best predictive model for determining rough rock fractures. Alemdag et al. (2015) used statistical and artificial neural network (ANN) methods to estimate the Em. In another study, the effect and application of Em on the fracture rock mass were estimated using the equivalent discrete fracture network (E-DFN) under uniaxial compression (Ma et al. 2020). Hasanipanah et al. (2022) applied the Levenberg–Marquardt algorithm (LMA), conjugate gradient (SCG), and Bayesian regularization (BR) algorithm in the cascaded forward neural network (CFNN) MLM to determine Em. In another effort, four ANNs were built to determine the influences of overburden stress on Em (Tokgozoglu et al. 2021). Fathipour-Azar (2022) applied various machine learning (ML) models to estimate triaxial rock mass strength. Several machine learning (ML) models, such as SVR, Gaussian process regression (GPR), and ANN, have been designed to predict the elastic modulus of magmatic rock (Ceryan et al. 2021). Meng et al. (2018) applied the back-propagation neural network technique to find the effective fracture propagation zone in gypsum rock. The indirect estimation is the most cost-effect technique for Em prediction. In this respect, Fattahi (2021) developed a relevance vector regression (RVR) model for the indirect prediction of Em. (Babanouri and Fattahi 2018) used SVR to construct a constitutive model for rock fracture to estimate the shear and peak displacement.

The present research aims to extend the use of geo-mechanical parameters in Em estimation. A significance index of geo-mechanical parameters is assigned to each parameter’s effect and relation with Em. This index represents the influence of a single parameter among the entire parameters network on Em. Statistical analyses were established to determine the best parameters for the Em. Finally, multiple MLMs were conducted with SVR, Copula method, and multivariate nonlinear regression (MNLR) to validate the reliability of the best design model for Em. The Copula method used in this study can exhibit the structure dependence between two or more random variables. Five dependent variables and one independent variable relationship were determined in this case. The present study is the first effort to use the Copula method to forecast Em regarding the complications in the relationship of variables dependency. In general, this study aims to provide optimal predictive models where the correlation between the input variables is not an unfavorable parameter. Therefore, the K-fold Cross-validation (CV) method was used to determine the models’ forecasting ability. The results confirmed the capability of predictive models using these input parameters.

Study area and data collection

Khersan Dam-II is located about 2 km southwest of Lordegan city in Chaharmahal and Bakhtiari Province (Iran) on the Khersan River. The coordinates of the study area are 50 36° E and 31 25° N in the southwestern part of Zagros. Access to the dam location is possible by a 2-km long asphalted road from Lordegan through the villages of Qale-e-Madresh and Abchnar. Abchnar village is almost 6 km long from the village of Shamalk on the right bank of the river and the dam axis. The stratigraphy of the Khersan Dam-II site consists of four low-upward-folding rock units, i.e., Asmari (lower, middle, and upper), Gachsaran, Aghajari, and Quaternary deposits, including slopes, crumbs, and river uplifts. Almost all structures of the dam site except the surface power plant are located in the upper Asmari. These structures consist of thick to medium-layered limestone, with regular layering and a small percentage of thin-layer limestone to marl lime, are shown in the geological map of the dam site in (Fig. 1).

Fig. 1
figure 1

a The location of Chaharmahal and Bakhtiari Province (Iran), b the location of Lordegan in the Bakhtiari Province, and c the Geological map of the site location of Khersan Dam-II

Exploratory operations were carried out in the first phase of studies of the Khersan Dam and Power Plant Design. These operations included drilling boreholes and galleries and conducting laboratory tests, including the Uniaxial Compressive Strength test (UCS) and field dilatometry test. Rock quality designation (RQD), Joint condition (Jc), joint spacing (Js), and groundwater condition (Wc) data were obtained to determine Em through the rock mass classification (RMR) system. This project used a flexible dilatometer and volume change measurement method to determine the Em. Thus, within wells with a diameter of 1 mm and a pre-determined depth, the dilatometer is sent as a metallic cylinder with rubber cladding. Afterward, compressed air with stepwise pressure values of 1, 2, and 3 MPa was entered into the space between the middle part and the rubber cladding in a borehole. In the meantime, borehole wall deformability values ​​were recorded at each step. Of the 149 dilatometric test data that determine the Em in the left and right flanks of the dam, only 88 of the dilatometric test results matched with other data from exploration boreholes and field surveys. The results of laboratory and field experiments are divided into two quantitative and qualitative forms (Table 1). In this section, experiments are conducted, and the collected results are corrected to increase the validity of probabilistic models. Next, the outlier values​ are removed from the final estimation calculations.

Table 1 Khersan Dam-II site Laboratory and Field Parameters data

Methodology

Statistical methods

Today, statistical analysis is among the most commonly used tools in engineering geology, rock mechanics, civil engineering, hydrology, and petroleum engineering. In this section, we offer a relation for estimating Em using geomechanical parameters of rock. R software was used for estimating Em, regarding its high efficiency and ease of use for complex calculations (Charbel and Hassan 2021). One of the features of this software is to perform computational operations such as multivariate regression, principal component analysis, and function analysis (Górecki and Smaga 2019). For this purpose, the statistical data of dilatometric experiments at the Khersan dam-II site were analyzed using R software, and standard statistical indices were used to select the appropriate regression models. The data in R software were first identified by sensitivity analysis, followed using various estimation methods to estimate this modulus.

Machine learning methods (MLMs)

In this paper, Em is predicted using (1) the log-linear approach, (2) the SVR approach, and (3) the Copula approach. The previous section mentioned that log-linear predictive modeling could lead to better results than linear models. Recently, MLMs have offered new possibilities for solving complex problems related to regression issues. In addition, machine learning regression methods directly use data to train a model without knowledge of the predictor-target relationship (Abdallah 2019). The SVR method has good stability and generalization abilities to solve complex and high-dimensional problems (Kang and Li 2016). There are two types of support vector machine (SVM), namely (1) the function fitting performed by SVR and (2) pattern identification used by support vector classification (Kavaklioglu 2011). SVR is widely used in engineering geology and geotechnical engineering, such as for rock fracture and Blast-Induced Rock Movement (Sun et al. 2020b; Yu et al. 2020). The significance of the Copula method is handling multivariate analysis and dependence modeling. In addition, it can decompose the multivariate joint distribution function into all of its marginal (univariate) distribution functions and a Copula function (Qian et al. 2020). Copula theory plays an important role in hydrology problems, geotechnical reliability analysis, engineering geology, rock mechanics, and wind engineering. Therefore, it has received increasing interest regarding its high accuracy compared with other methods (Abdollahi et al. 2019; Gaidai et al. 2019; Ismail et al. 2018; Zhou et al. 2019). In this study, MNLR, SVR, and Copulas method were applied to develop prediction models to study the influence of the geomechanical parameters such as the UCS, RQD, joint condition (Jc), joint spacing (Js), dynamic modulus (Edyn), and groundwater condition (Wc) on Em. These analyses were performed using R-software. Statistical indices were used to evaluate the performance of prediction models and compared the predicted values with the actual values. These indices include root-mean-square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2). Equations (1) and (2) give expressions of MAE and RMSE, respectively.

$${\text{MAE}} = \frac{1}{n}\mathop \sum \limits_{{i = 1}}^{n} \left| {\left( {Y_{i} - \acute{Y}_{i} } \right){\text{~}}} \right|$$
(1)
$${\text{RMSE}} = \sqrt {\frac{1}{n}\sum\limits_{{i = 1}}^{n} {\left( {Y_{i} - \acute{Y}_{i} } \right)^{2} } } .$$
(2)

Results

Results of statistical analysis

Descriptive statistics are applied to summarize data in an organized manner to describe the univariate analysis of data and the various relationship between them. Descriptive statistics is an essential part of data analysis that provides the foundation of inferential statistics. Tables 2 and 3 provide univariate descriptive statistics of the data. Here, Table 2 summarizes the descriptive statistics for the continuous data (i.e., Em, Jc, UCS, RQD, and Edyn). Also, Table 3 presents a frequency table for the non-continuous data (i.e., Js and Wc). The univariate analyses show that Em and Jc are almost symmetric, but UCS, Edyn, and RQD are asymmetric (i.e., highly right-skewed, right-skewed, and left-skewed, respectively). An extreme asymmetry is observed in Wc. In addition, RQD and Jc are heavy-tailed relatives of a normal distribution, but UCS is light-tailed.

Table 2 Descriptive statistics for continuous data
Table 3 Descriptive statistics for non-continuous data

Figure 2 presents the relationships between the considered data. The upper panel shows the Pearson correlation between the continuous variables and the box plots for the combinations between the continuous and the non-continuous variables. In addition, the lower panel shows the scatter plots with fitting smooth nonlinear curves between the continuous variables and the dot plots for the combinations between the continuous and the non-continuous variables. Finally, the diagonal panel represents the density plots of the continuous variables and the bar plots of the non-continuous variables. Figure 3 shows a heatmap of the correlation matrix of the dataset containing the data and the natural logarithm of continuous data. The Pearson correlation can only reflect a linear relationship of variables but ignores other relationship types. Hence, the natural logarithm of continuous variables was considered because of the possibility of exponential nonlinear relationships. Figure 4 illustrates a heatmap of Spearman’s correlation coefficient matrix of the data. Spearman correlation can determine whether there is a monotonic association between variables. Also, Spearman’s correlation coefficient is invariant to monotone transformations, such as the natural logarithm transform. Analyzing the relationships between the data show a nonlinear relationship between the data. The logarithm of Em has a stronger relationship with predictor variables than Em. Hence, we considered the log-linear technique to model the data.

Fig. 2
figure 2

The Scatterplot matrix of the data

Fig. 3
figure 3

Heatmap of Pearson’s correlation coefficients of the continuous data and their natural logarithm

Fig. 4
figure 4

Heatmap of Spearman’s correlation coefficients for Em

Next, we used the best subsets method to identify the best-fitting models. During regression modeling, eliminating unessential variables will make the model easier to interpret, less susceptible to data overfitting, and more generalizable. Best subsets regression is an exploratory model-building regression analysis that compares all possible models using a specified set of predictors and displays the best-fitting models. The result of the best subsets regression analysis is summarized in Table 4. We used predicted R-square (Pred.R-sq), estimated error of prediction (MSEP), and Sawa’s Bayesian Information Criteria (SBIC) to identify the best-predicting models. Models with larger Pred.R-sq values or smaller MSEP or SBIC have better predictive ability. In the log-linear approach, the results of the best subset regression analysis are proposed by modeling Em based on logEdyn, Edyn, Js, UCS, Wc, RQD, and Jc, in the order of their appearance. Table 4 shows that the model in the 6th row (containing predictors logEdyn, Edyn, Js, UCS, Wc, and RQD) and the model in the 7th row (containing predictors logEdyn, Edyn, Js, UCS, Wc, RQD, and Jc) are the best candidates for predicting models of Em, in the order of their appearance.

Table 4 Best subsets of regression

Results of MLMs for rock mass deformability modulus (E m)

Result of Copulas method

The Copula method was implemented using R-software to determine the Em. As can be seen from Fig. 5, R2 = 0.85 for Em in the Copula method. Em is a dependent variable, and the other four variables (i.e., Jc, UCS, RQD, and Edyn) are independent variables. In Copula MLM for the Em, 1 for deformability modulus (Em), 2, 3, 4, and 5 for joint condition (Jc), Uniaxial Compressive Strength (UCS), rock quality designation (RQD), and dynamic modulus (Edyn), respectively, have been used. As a result, four Copula method trees were developed. In Tree 1, pair of parameters is prepared with a ratio of one parameter and another. The pair of parameters, such as deformability modulus and dynamic modulus, provides the best result with a par value of 1.76 and a τ value of 0.69 are presented in Table 5. The graphical models of all Copula trees are presented in Fig. 6. In Tree 1, the pair of Em with UCS provides the second-best result with a Tau (τ) value of 0.46 (Fig. 7). Moreover, the pairs of Em with Jc and RQD give the τ value of 0.41 and 0.41, respectively (Fig. 7). In Copula Tree 2, the pair of parameters was prepared with a ratio of one parameter and two others. In Tree 3, the pair of parameters was prepared with a ratio of two parameters and the other two parameters. Finally, in Tree 4, pair of parameters was prepared with a ratio of two parameters and the other three. It is noteworthy that all these Copula trees provided an independent relationship between the parameters are presented in Table 6.

Fig. 5
figure 5

Scatter plots of the predicted Em from the Copula method

Table 5 The results of the Copula Tree 1 for Em
Fig. 6
figure 6

The graphical models of all Copula trees for Em

Fig. 7
figure 7

The results of the Copula Trees 1 and 2 for Em

Table 6 The results of the Copula Trees 2, 3, and 4 for Em

SVR use to predicate deformability modulus

In this study, Em was also predicted using the SVR method. The model was constructed by the e1071 package, which is the first and most intuitive package in R software. The UCS, RQD, Jc, Js, Edyn, and Wc were selected as the model’s inputs. The SVR model was constructed based on the kernel function type of the Radius Basis Function (RBF). The kernel function can transform data from a nonlinear to a linear form. The SVR results are presented in Table 9. As can be seen, the models can accurately predict Em with R2 = 0.868 (Fig. 8). Overall, three SVR models were developed for Em are summarized in Table 7.

Fig. 8
figure 8

Scatter plots predicted by SVR for Em

Table 7 Results of SVR model for Em

Predicating deformability modulus by MNLR

Power Multivariate Nonlinear regression was applied to determine the effect of geomechanical parameters on the Em. In the MNLR method, Em is a dependent variable, and UCS, RQD, Jc, Js, Edyn, and Wc are independent variables. A total of 21 MNLRs were developed for Em, in which the hybrid MNLR model provides the best results. The hybrid MNLR model is the second-best model among all the machine learning models and provides a relationship, as shown in Eq. (3). Furthermore, the results of the hybrid MNLR are very near to SVR. The results of nonlinear regression analysis to predict Em (R2 = 0.866) are presented in Fig. 9.

$${\varvec{E}}_{{\varvec{m}}} = 0.4675 \times {\varvec{Edyn}}^{1.3641} \times 0.9220^{{{\varvec{Edyn}}}} \times 1.0023^{{{\varvec{UCS}}}} \times 1.1262^{{{\varvec{I}}\left( {{\varvec{JS}} = 20} \right)}} \times 1.0020^{{\user2{RQD }}} .$$
(3)
Fig. 9
figure 9

Scatter plots predicted by hybrid MNLR for Em

Comparisons between MLMs

One of the significant challenges in MLMs is to check the model accuracy of unseen data to know whether the designed model performs well. To evaluate the model’s accuracy, we need to test it against those data points that were not present during the model training. Using R-programming language, CV is one of the best methods for checking the accuracy of the machine learning model. CV is a standardized technique for testing the performance of a predictive model. In this process, a part of the data set is saved, which will not be used in the model training. When the model is ready, this specific dataset is used for testing purposes. The CV method is divided into two types: (1) non-exhaustive CV (e.g., K-fold CV, holdout method, repeated random sub-sampling validation) and (2) exhaustive CV included (e.g., leave-p-out CV and leave-one-out CV).

The K-fold CV method was used to find the best-designed machine-learning models for Em. This method is one of the most accurate and reliable methods for testing machine learning models. The CV technique divides the data into equal K subsets (folds). Of these K-folds, one subset is used as a validation set, and the remaining is applied in the training model. The K-fold CV method was used for SVR, Copula Method, and MNLR to find the best MLM for Em. For these MLMs, fitting models were built by the K-fold CV technique. In the K-fold CV method, 25 models were constructed, including 19 MNLR, 1 Copula method model, and 3 SVR models. Model training was based on 84 observations and tests of fitting model 4 that were left out randomly (K-fold CV, where K = 4). The number of repetitions of the process is also considered 100. Results showed that the SVM Model 25 is the best-designed model for the Em. However, this model’s output is very slightly different from SVR Model 24. Moreover, the hybrid-MNLR Model 17 is the second-best-designed machine learning model for Em (Fig. 10). The Copula method model did not provide good results for Em. Besides, the coefficient of variation (ratio of standard deviation to average) of R2, RMSE, and MAE factors in 100 repetitions of K-fold CV is 0.004, 0.013, and 0.009, respectively, summarized in Table 8.

Fig. 10
figure 10

The comparison of all machine learning models for Em

Table 8 K-fold cross-validation results of all machine learning models

In the following, we outlined the overall performance of various models presented in this work in predicting the Em values based on RMSE. Table 9 summarizes the comparison of all statistical indices (MSE, RMSE, and MAE) for different models.

Table 9 Comparison of the best final models

Conclusion

Rock mass deformability is one of the most demanding, sensitive, and complicated subjects in engineering geology, civil engineering, rock mechanics, and petroleum engineering. In this respect, it is necessary to determine the effect of different geomechanical parameters, e.g., Uniaxial Compressive Strength test (UCS), rock quality designation (RQD), joint condition (Jc), joint spacing (Js), dynamic modulus (Edyn), and groundwater condition (Wc), on the Em. For this purpose, different modern and reliable machine learning and statistical methods were used. Statistical analysis is a straightforward technique to select the best parameters and the quantitative analysis of Em data. These methods are significantly less time-consuming and low-cost than other methods, such as In-situ field tests. The SVR model provided more realistic and correct results for Em. In comparison, Copula methods for Em do not provide excellent results due to less data availability. Multivariate Non-Linear Regression (MNLR) is the second-best machine learning model for predicting Em. The results showed that Em directly correlates with a dynamic modulus (Edyn) compared to the UCS, RQD, and Jc. Furthermore, Em has the weakest relationship with geomechanical parameters such as Js and Wc. It is noteworthy that both Js and Wc are classified as non-continuous. Overall, this research effectively predicts the Em by solving the high-dimensional and nonlinear relationship between the Em and the mentioned parameters.