Introduction

By synergizing geospatial data, remote sensing techniques, and diverse Earth science datasets, informatics plays a pivotal role in fostering a holistic comprehension of natural systems and processes (Sermet and Demir 2019). In the realm of engineering applications, this encompassing knowledge significantly enhances the precision of decision-making and predictive capabilities. For instance, within geotechnical engineering, Earth Science Informatics facilitates the integration of soil behavior, geological insights, and hydrological information into prognostic models, thereby refining the design and evaluation of critical structures such as foundations, dams, and slopes (Pham et al. 2023). This interdisciplinary methodology not only optimizes engineering outcomes but also advances sustainable solutions by meticulously accounting for the intricate interplay between natural and constructed environments.

The Soil Water Characteristic Curve (SWCC) offers valuable insights, both directly and indirectly, into the behavior of water within unsaturated soils (Zhai and Rahardjo 2012). The accurate determination of the SWCC for a given soil necessitates a combination of precise measurement techniques and predictive methodologies. Nonetheless, the field, laboratory, and computer vision-based measurements of SWCC are resource-intensive, laborious, time-consuming, and occasionally unfeasible due to challenges concerning scaling, spatial variability, and site inaccessibility (Achieng 2019). As a result, the utilization of modeling procedures has become a widely adopted approach for predicting SWCC (Dobarco et al. 2019).

The application of machine learning (ML) algorithms in soil moisture research has witnessed a substantial upsurge. These algorithms are favored for their non-parametric essence and adeptness in capturing intricate and non-linear associations (Padarian et al. 2019). ML techniques employed to estimate SWCC predominantly fall within the realm of supervised learning, which entails the provision of a labeled training dataset containing known output values. The model is then trained through algorithms applied to the input dataset, enabling the prediction of the desired output. The training process continues until the model attains the intended accuracy on the training dataset. Supervised learning finds widespread application in tasks encompassing classification and regression (Rani et al. 2022).

In various studies, researchers have used different algorithms to understand the complex relationship between soil properties and water content. These methods range from traditional models to advanced ones like artificial neural networks (ANN), support vector machines (SVM), random forests (RF), and deep neural networks (DNN). Neural network models consist of input, hidden, and output layers, with the number of hidden layers determined by problem complexity. For general geotechnical engineering, it's often found that a single hidden layer suffices (Wang et al. 2022). Incorporating data preprocessing with Bayesian regularization neural networks, Pham et al. (2019) showcased the capability to enhance the precision of predicting SWCC and illustrated that three-hidden-layer BRNN-PTF showed a considerable outperformance to predict the soil water content. The effectiveness of these algorithms was assessed through metrics including the Root Mean Squared Error (RMSE) and R-squared (R2) values, thereby furnishing insights into the prognostic capacities of the models. A comprehensive synthesis of these studies is presented in Table 1, outlining the spectrum of employed ML algorithms, their corresponding performance metrics, noteworthy observations, and the specific features assimilated within the models.

Table 1 Review on using machine learning algorithm to estimate SWCC

Several PTFs for predicting SWCC with acceptable accuracy were proposed by researchers (Pachepsky et al. 2006; Leij et al. 2004; Børgesen et al. 2008). However, rare attention has been given to assessing the significance of the features encompassed within the provided dataset. While Pham et al. (2023) undertook the endeavor of determining the importance of database features to construct their ML-PTFs and assessed its effectiveness in SWCC estimation, they did not specifically focus on the role of soil porosity in water retention within soil (Tuller et al. 2004).

The works by Fredlund and Rahardjo (1993) as well as Hopmans and Dane (1986) underscore the significance of matric suction in governing water dynamics and mechanical responses within soils of varying compositions, encompassing sandy and silty soils. While the assertion by Achieng (2019) maintains that a deep learning approach enables the exclusive prediction of SWCC using soil matric suction as the sole input feature, other investigations posit that precise SWCC predictions frequently derive advantages from a broader spectrum of input parameters, obtained through laboratory analyses and image processing methods. The incorporation of supplementary parameters like soil texture, porosity, particle size distribution, and mineral composition augments the predictive efficacy of ML models.

Nguyen et al. (2014) and Vereecken et al. (2010) reached the conclusion that incorporating soil structure information into Pedotransfer Functions (PTFs) holds the potential to enhance their performance. They further recommended in-depth exploration to determine the robustness of these improvements across various data mining techniques and diverse categories of PTFs. Employing ImageJ's built-in capability for soil porosity analysis, Bakhshi et al. (2023) demonstrated that the water retention capacity and SWCC pattern are contingent on soil pore geometrics. This feature yields valuable output parameters, encompassing porosity surface area (Total Area of Porous Regions, cm2), volume (Total Number of Porous Voxels × Voxel Volume, cm3), elongation (Major Axis Length/Minor Axis Length, dimensionless), flatness (Average Length of Major Plane/Average Length of Minor Plane), sphericity (4π × area/perimeter2, dimensionless), and compactness (volume of the porous region/surface area of the porous region, dimensionless).

Building upon these findings, our study leveraged intricate soil structural attributes derived from image analysis as inputs for the ML technique employed. To this end, in conjunction with other frequently employed algorithms, we assessed the application of gradient boosting (GB) and Ada Boost (AdaB) in estimating SWCC using ML within the Orange.3 data mining software. The predictive exercise was undertaken under two scenarios: 1) using matric suction as the sole predefined input, and 2) integrating an array of input parameters garnered from both laboratory measurements and image analysis techniques.

Material and methods

Soil sampling, treatment preparation, and experimental setup

This study delved into the intricate effects of diverse treatments on soil porosity and SWCC within soil samples obtained from distinct textural classes in Central Iran. The samples originated from Arenosols (coordinates 35° 54′ N and 50° 32′ E) and Vertisols (coordinates 36° 22′ N and 49° 35′ E) and comprised loamy sand and silty clay soils.

Soil sampling and analysis

Topsoil samples (0–10 cm) were collected, dried, and sieved to achieve a particle size of 2 mm, ensuring uniformity in the analysis. Established methods were employed to evaluate pivotal soil properties, vital for predicting SWCC. These properties included soil organic carbon (SOC) (Walkley and Black 1934), serving as an indicator of organic matter content; particle size distribution (PSD) (Gee and Or 2002), revealing soil texture composition; cation exchange capacity (CEC) (Rhoades 1983), a measure of ions retention capacity; electrical conductivity (EC) (Rhoades 1996), reflecting soil salinity; pH (Thomas 1996), indicating soil acidity; and parameters characterizing soil porosity (a, n, θs, and θr) (Dexter et al. 2008). To preserve sample integrity, bulk density was determined through the core method, utilizing Kopecky rings (5 cm in height and 5 cm in diameter) (Grossman and Reinsch 2002).

Treatment preparation and experimental setup

The soil samples underwent a comprehensive range of treatments, each meticulously designed to investigate specific soil responses. This included the application of various levels of CaCO3, ranging from 0 to 5%, to assess the influence of calcium carbonate on soil properties (Huang et al. 2016). Similarly, Fe2O3.7H2O, varying from 0 to 2%, was introduced to explore the effects of iron oxide (Li et al. 2021). The incorporation of vermicompost, at varying levels (0% to 2%), allowed insights into how organic carbon and nutrient content impacted soil characteristics (Demir 2020). Furthermore, combined treatments involving CaCO3, Fe2O3.7H2O, and vermicompost in specific ratios (1.5%, 0.5%, and 1%, respectively), as well as higher levels, were investigated. Additionally, treatments were prepared based on Sarkar et al. (2014), where organic matter, iron oxide, and CaCO3 were removed at specific levels.

Cation treatment and structural degradation

To comprehend the influence of cations on soil structure, solutions containing CaCl2 and NaCl at concentrations of 0, 5, 10, and 20 meq L−1 (Mi et al. 2018) were employed for irrigation during the incubation period. This facilitated an examination of how varied cation levels affected soil behavior. The study also encompassed a comprehensive analysis of degraded treatments, achieved through a tailored consolidation process designed to replicate conditions resulting from natural degradation.

Incubation and testing

Following treatment application, the soil samples were placed in pots and incubated at room temperature (24 ~ 26 °C). To emulate real-world scenarios, the samples underwent numerous cycles of shrink-swell and wetting–drying, a process repeated 20 times. Rigorous monitoring was conducted to capture any variations. Rewetting was done until moisture content equal to field capacity was achieved by carefully adding water to the sponge cover placed on top of the columns to avoid disturbing soil conditions.

Sample count and analysis

The study encompassed a total of 128 samples, involving diverse combinations of amendment treatments and degraded treatments. A subset of samples was selected for direct measurements of the SWCC, while others underwent preparation for subsequent image analysis through impregnation with a mixture of polyester resin, catalyst, hardener, and fluorescent dye. For a comprehensive breakdown of specific treatments, consult Table 2.

Table 2 Initial soil samples properties

Determination of the SWCC

The SWCC was constructed by combining the results obtained for water content at both low matric suctions (0, 10, 20, 40, and 70 cm) using a sandbox apparatus (Cresswell et al. 2008) and higher matric suctions (100, 300, 500, 1000, 3000, 5000, 9000, and 15,000 cm) using pressure plate/pressure membrane apparatus (Dane et al. 2002). Despite the acknowledged methodological limitations, this approach stands as the most prevalent technique for SWCC measurement (Schindler et al. 2012). Undisturbed samples were used to determine the lower matric suctions ranging from 0 to 1000 cm, while disturbed samples were utilized for matric suctions ranging from 3000 to 15,000 cm.

Sample preparation, imaging, and image preprocessing

A total of 128 soil samples, having undergone pre-treatment, were subjected to a methodical impregnation procedure involving a mixture of polyester resin and styrene in a 5:1 ratio (Eben et al. 2020), accompanied by suitable amounts of hardener and catalyst. To enhance the visibility of soil pores during the forthcoming digital imaging phase, a brightener, 2 g.L−1 of fluorescent dye, was deliberately introduced into the mixture (Ringrose-Voase 1996). This strategic inclusion served to augment the luminance of pores under UV illumination, facilitating their subsequent visual analysis.

The impregnation process unfolded within plastic containers that were housed in a meticulously regulated environment within a vacuum desiccator. The desiccator underwent an evacuation process set at 8 psi for a duration of 2 h. This step assumed paramount importance, ensuring comprehensive resin infiltration throughout the samples and the effective displacement of air from the pores. Consequent to this evacuation, the samples were refilled with the impregnation mixture and subjected to an additional two-hour vacuum cycle, thereby optimizing resin penetration. Following this sequence, the samples were diligently sealed to counteract the rapid volatilization of styrene. Approximately seven days subsequent to sealing, the samples were unsealed, allowing for the gradual and natural volatilization of styrene over time, ultimately leading to the desired hardening of the polyester resin. This polymerization process culminated after an average duration of approximately 75 days (Wei et al. 2019).

Upon the completion of the resin hardening phase, the samples underwent meticulous cutting and polishing procedures. Each individual sample was meticulously subjected to two horizontal and two vertical cuts, which collectively resulted in the exposure of four proximate surfaces, thus providing an extensive range of viewing angles for subsequent imaging. Notably, the imaging process was executed within an environment carefully configured as a controlled dark room, a setting that was equipped with specialized UV lamps. The strategic utilization of these lamps aimed to maximize the fluorescence emission of the dye embedded within the pores, thereby significantly enhancing their visibility. The images were captured using a digital camera boasting a resolution of 12 MP and an aperture of f/1.8.

Following the successful acquisition of the color images, the next crucial step entailed their systematic preprocessing within the ImageJ software. This versatile software platform facilitated an array of operations essential for effective analysis. Specifically, the color images were subject to grayscale conversion, a step that transformed the images into a grayscale format, subsequently enhancing their suitability for further analysis. To accentuate the visual distinction between pores and solid regions, thresholding was systematically applied to the grayscale images, resulting in their conversion into binary images. This binary representation enabled a sharp demarcation between pores, represented as white pixels, and solid areas, depicted as black pixels.

The stacking of these binary images yielded an ensemble of four 3D volumes for each individual sample. These volumes served as the foundational data for the subsequent analysis, providing a multi-dimensional perspective of the spatial distribution of pores within the samples. The ensuing analysis drew extensively from the specialized 2D and 3D plugins embedded within the ImageJ software. These plugins facilitated the extraction of key parameters characterizing the identified pores. This encompassed pivotal parameters including the determination of 3D porosity, pore sphericity, aspect ratio, and the orientation of pore objects. The orientation was expressed through two angles: φ, representing the angular deviation between the horizontal plane and the long axis of the pore channel (ranging from 0° to 90°), and θ, representing the azimuthal orientation of the long axis on the horizontal plane (ranging from 0° to 180°). Furthermore, critical metrics such as pore space surface area and sphericity were directly ascertained through the utilization of ImageJ. Integral to the analysis was the calculation of porosity, which manifested as the fraction of image volume characterized by pore space.

ML procedure

Exploring ML models

Acknowledging the potential of ML models to unveil complex data patterns, these models were employed to unravel intricate relationships within the acquired soil dataset. However, it was recognized that the effectiveness of these models depended on the quality, quantity, and representativeness of the training dataset. Table 3 provides a comprehensive overview of the ML algorithms employed in this study for the prediction of SWCC. Each algorithm is described along with its key hyperparameters, strengths, and limitations.

Table 3 Overview of the utilized machine learning algorithms to predict the Soil Water Characteristic Curve

Orange.3

Ahangar-Asr et al. (2012) emphasized that the simplicity of a procedure and its capability to apply multiple models simultaneously are key factors in determining the priority of a method for estimating SWCC. In line with this, we utilized Orange.3 software, which offers a user-friendly and efficient ML process. This approach facilitated a rapid comparison of diverse fitted models, encompassing Gradient Boosting, Ada Boost, Decision Tree, Random Forest, Neural Network, Support Vector Machine, k-Nearest Neighbors, and Linear Regression. Furthermore, the Feature Importance widget was used to determine the relative importance of input features in predicting SWCC with a minimal dataset.

Statistical analysis

The statistical analysis included the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Relative Root Mean Squared Error (RRMSE), the Pearson correlation coefficient (R), Performance Index (Pi, Jalal et al. 2021), and the Willmott’s index of agreement (d1, Zhang et al. 2020; Achieng 2019) collectively quantify the predictive performance and reliability of the models. These metrics provide a comprehensive view of how well the algorithms capture the complex relationships inherent in SWCC data.

$$RMSE=\sqrt{\frac{{\sum }_{\mathrm{i}=1}^{\mathrm{n}}{({\mathrm{o}}_{\mathrm{i}}-{\mathrm{t}}_{\mathrm{i}})}^{2}}{\mathrm{n}}}$$
(1)
$$MAE=\frac{{\sum }_{\mathrm{i}=1}^{n}\left|{\mathrm{o}}_{\mathrm{i}}-{\mathrm{t}}_{\mathrm{i}}\right|}{n}$$
(2)
$$RRMSE= \frac{1}{\left|\overline{o }\right|} \sqrt{\frac{{\sum }_{\mathrm{i}=1}^{\mathrm{n}}{({\mathrm{o}}_{\mathrm{i}}-{\mathrm{t}}_{\mathrm{i}})}^{2} }{\mathrm{n}}}$$
(3)
$$R=\frac{{\sum }_{i=1}^{n}({\mathrm{o}}_{\mathrm{i}}-\overline{o })({\mathrm{t}}_{\mathrm{i}}-\overline{t }) }{\sqrt{{\sum }_{i=1}^{n}{({\mathrm{o}}_{\mathrm{i}}-\overline{\mathrm{o} })}^{2} {\sum }_{i=1}^{n}{\left({\mathrm{t}}_{\mathrm{i}}-\overline{\mathrm{t} }\right)}^{2}}}$$
(4)
$${P}_{i}=\frac{RRMSE}{1+R}$$
(5)
$$d1=1-\frac{{\sum }_{i=1}^{n}\left|{t}_{i}- {o}_{i}\right|}{\left|{o}_{i }-\overline{o }\right|+ \left|{t}_{i}- \overline{t }\right|}$$
(6)

Where oi and ti represent the i-th actual and predicted output values, respectively. \(\overline{o }\) and \(\overline{t }\) denote the average values of the actual and predicted output values, respectively. The parameter n signifies the number of samples under consideration, and it is worth noting that in our analysis, no parameters were excluded during the regression process.

Feature importance analysis

As suggested by Pham et al. (2023), the analysis of feature importance employed the Shapley additive explanations (SHAP) technique, which allowed for a quantitative assessment of the significance of the selected features. This methodology revolves around the assessment of model variance using a variance-based measurement approach (Molnar 2020). Notably, the geotechnical field has widely adopted the Shapley value approach (Wadoux and Molnar 2022; Cheng et al. 2022) to unravel the intricacies of feature importance. The Shapley value (φ) embodies the average marginal contributions of features within various coalitions, as encapsulated by the equation:

$${\mathrm{\varphi }}_{j}=\sum_{S\subseteq X\{xj\}}\frac{\left|S\right|!\left(p-\left|S\right|-1\right)!}{p!}(val(S\cup \{xj \})-val(S))$$
(7)
$$\sum\nolimits_{i=1}^{p}{\mathrm{\varphi }}_{j}\widehat{\mathrm{f}}\left(\mathrm{X}\right)-{\mathrm{E}}_{x}(\widehat{\mathrm{f}})$$
(8)

Where S represents a subset of features, X = {X1,X2,…,Xp} designates the observed data point, xj denotes the value of the feature under consideration, p signifies the total count of features, val(S) reflects the model prediction marginalized over the remaining input features, \(\widehat{f}(X)\) stands for the model prediction for the given X, and EX(\(\widehat{f}\)) represents the anticipated model predictions for a given dataset. This comprehensive approach elucidates the intricate dynamics of feature importance and model contributions. All computations and feature importance analyses in this study were carried out automatically using Orange.3 software.

Results and discussion

The properties of initial soil samples

Table 2 represented the routine properties parameters obtained from the SWCCs of soil samples prior to any treatment. The selection of these two samples was done deliberately to ensure a wide range of variations in their physical, chemical, and hydraulic properties, allowing for a comprehensive evaluation. The loamy sand sample has a high sand content of 82.6% with approximately 12% clay, while the silty clay sample has a clay content exceeding 40% and a lower sand content of around 10%. Both samples are non-saline and slightly alkaline, but they differ significantly in terms of organic carbon (OC) content (0.12% vs. 0.42%) and cation exchange capacity (CEC) values (4.8 cmol+ kg−1 vs. 24.1 cmol+ kg−1). The matric suction at the inflection point (hi) of the SWCC varies from 300 cm in the loamy sand sample to 70 cm in the silty clay sample. The shape factor (n) of the SWCC in Van Genuchten’s (1980) model ranges from 2.07 in the loamy sand sample to 1.0 in the silty clay sample. The bulk density of the studied samples did not show significant differences. However, there were significant differences in the alpha coefficient, which corresponds to the inverse value of air entry into the soil (α, cm−1), as well as in the saturation water content (θs, g.g−1) and residual water content (θr, g.g−1) between the two samples.

Changes in properties of treated samples and the results obtained from image analyses

Table 4 presents the changes in the physical and chemical properties of the treated samples after the incubation period, compared to the blank samples. Additionally, Table 2 provides the results from image analyses of the soil pores developed as a result of the treatments. The analysis of the trends in Table 4 reveals noteworthy patterns in the impact of treatments, treatment levels, and soil textures on various soil properties. Generally, there’s a tendency for increasing bulk density with higher treatment levels, particularly prominent in “CaCO3” and “OM” treatments. “Cations” treatments consistently lead to elevated CEC with increasing levels, while “Removed Fe2O3” and “Removed OM” exhibit reduced CEC after removal. “CaCO3” treatments show increased electrical conductivity (EC) with higher levels, and “OM” treatments correlate with higher organic carbon (OC) content. Values of pH decreases with higher levels in “Cations” treatments, and “Removed CaCO3” results in decreased pH after removal. Porosity-related attributes are influenced by treatments and levels, with “OM” treatments consistently yielding higher porosity surface area and porosity volume. Removal and degradation treatments lead to various property changes, such as decreased OC content, CEC, and porosity-related attributes, highlighting the complexity of soil responses to alterations. Overall, these trends provide valuable insights into the intricate relationships between treatments, soil properties, and textures.

Table 4 Properties of treatments at the end of incubation and the results obtained from image analysis

Similar to Table 4 and Table 1, a dataset of individual treatments was prepared, which was automatically divided into model training and test datasets. This practice of preparing a unified database, as advocated by Wang et al. (2022) and Zhang et al. (2020), is essential in ML procedures to ensure consistency and facilitate the training, testing, evaluation, and comparison of different models, thereby yielding robust and reliable results. Then, the mentioned features from Table 2 and Table 1 applied in eight algorithms to predict the soil water content at different matric suction levels. Soil matric suction is used as a predefined input feature, while the other features are applied separately in all evaluating models. The most important features are determined based on their effects on the model output, as shown in Fig. 1.

Fig. 1
figure 1

Input parameters and their relative importance in accurate prediction of GB, AB, RF, SVM, DT, ANN, kNN, and LR models

Impacts and relative importance of the input parameters on the models

Researchers have utilized various soil properties, including the percentages of clay, silt, and sand, as well as void ratio and water content at saturation, along with soil matric suction related to gravimetric water content, for the estimation of SWCC (Pham et al. 2023; Rastgou et al. 2020). Identifying the most significant features in SWCC estimation can greatly reduce time and energy consumption while increasing accuracy. As input features of models Fig. 1 (1.a to 1.h) illustrates the effects of different input parameters on model outputs and their relative importance in terms of the model's accuracies (RMSEs) in eight ML algorithms. Similar to studies conducted in previous years (Pham et al. 2023; Gunarathna et al. 2019; Nguyen et al. 2017), our observations indicate that within these algorithms, matric suction emerged as the most pivotal parameter in the GB (Fig. 1a), AB (Fig. 1b), RF (Fig. 1d), and SVM (Fig. 1f) models. On the other hand, organic carbon percentage, soil texture, porosity surface area, and electrical conductivity emerged as the most significant parameters in the DT (Fig. 1c), ANN (Fig. 1e), kNN (Fig. 1g), and LR (Fig. 1h) models, respectively. Matric suction was identified as the most important parameter among the first three influential parameters affecting the model outputs in all models, except for the ANN model (Fig. 1e). Lower matric suction values resulted in higher prediction accuracy in the models, while higher matric suction values led to a decrease in accuracy. The results indicated that, except for the ANN model, three to five of the input characteristics were identified as the most influential parameters for prediction accuracy in different models.

Following matric suction, soil pore characteristics have emerged as the subsequent significant parameters in facilitating accurate predictions, except in the context of the ANN model. Pham et al. (2023) demonstrated that soil texture-related properties hold significant importance following soil matric suction. However, in our study, the prediction of SWCC reveals the involvement of one or two pore characteristics. Notably, attributes like structural flatness and porosity surface area exhibit notably stronger influence compared to other pore characteristics. Soil bulk density, as a other structural feature, has garnered attention from various researchers in recent years (Amanabadi et al. 2019; Gunarathna et al. 2019). However, this soil physical property serves as an average indicator of soil compaction and fails to provide insights into the detailed attributes of porosity. Some studies have endeavored to indirectly assess soil structure by incorporating soil moisture content across different matric suctions into their models (Senyurek et al. 2020; Cai et al. 2019). In one of the rare instances exploring soil porosity, Ahangar-Asr et al. (2012) integrated soil void ratio as an input parameter within a model geared towards SWCC and soil porosity characteristic prediction. Nevertheless, their investigation did not specifically delve into the impact of these properties on the outcomes of the model.

Comparison of the models’ predicted results

The output of the models when all parameters used

When comparing the SWCCs generated by the models using all the studied parameters, it was found that the GB, AB, RF, and DT models produced the most accurate results with lower RMSE (< 0.028) and MAE (< 0.018), and higher d1 (> 0.93) and R2 (> 0.968) in test dataset (TstD), as shown in Table 5. This means that the mean difference between the predicted and measured water contents was less than 0.02 g g−1 for all matric suctions used to plot the SWCCs. Achieng (2019) conducted research using ML techniques, including ANN, DNN, and SVM models, to estimate SWCC. In most cases of drying SWCC, the models achieved an RMSE of less than 0.01, with R2 and d1 values exceeding 0.99 and 0.94, respectively. The study demonstrated high accuracy in the estimation of SWCC in the studied Loamy Sand soil sample. Lamorski et al. (2017) employed various SVM models trained with physical soil properties, including SWCC drying branch, BD, Sand%, Silt%, clay%, OC, and soil specific surface, as input variables. The resulting models successfully estimated SWCC wetting branches with an R2 greater than 0.98 and an RMSE less than 0.02. Srivastava et al. (2013) utilized the SVM algorithm, which yielded an RMSE of 0.013 and an R2 of 0.69. In contrast, the performance of the random forest algorithm varied across different studies. Long et al. (2019) and lm et al. (2016) reported RMSE values greater than 0.04 m3 m−3, while Bai et al. (2019) achieved accurate results with an RMSE less than 0.02 m3 m−3. However, in this study ANN, SVM, kNN, and LR algorithms, showed a significant decrease in model accuracy (as indicated by higher values of RMSE, MAE, and lower values of d1 and R2) compared to the acceptable limits of accuracy. Consequently, these models were unable to generate SWCCs that met the required level of accuracy. Similar to the findings of Hastie et al. (2009), which demonstrated that regression-based methods may yield non-accurate results in pedo-transfer function methods, the LR algorithm in this study produced an R2 of 0.66 and an RMSE of 0.69 when applied in the ML method, categorizing it as a non-accurate model. Nguyen et al. 2017 highlighted the benefits of the kNN model, including its flexibility, simplicity, accuracy in limited data availability conditions, and the ability to incorporate new observations into training datasets without the need to redevelop the PTF models. However, Guevara and Vargas (2019) examined the performance of the kNN algorithm for predicting soil moisture content based on DEM data and found that the prediction RMSE exceeded 0.05 m3 m−3. In another study, Liu et al. (2017) observed an RMSE greater than 0.07 m3 m−3 in the prediction of moisture content using the kNN algorithm with inputs derived from satellite-derived data.

Table 5 The statistics obtained for the models used to generate SWCC using all parameters

Table 6 presents the Pearson correlation (r) between the measured water content (θMeasured) and the evaluating models, along with the identified important features. Previous studies have reported correlation coefficients greater than 0.9 between estimated and measured SWCC or soil moisture content using the random forest algorithm (Im et al. 2016; Bai et al. 2019; Long et al. 2019; Zappa et al. 2019). However, it is important to note that the ability of the same algorithm to estimate soil moisture content may vary depending on the input features used in the modeling procedure. For example, the aforementioned studies utilized different sets of input features, including satellite-derived data, soil texture (Zappa et al. 2019), and leaf area index (Im et al. 2016). These variations in input features can result in different levels of correlation with the target values. As illustrated in Fig. 1 and further supported by Table 6, certain features exhibit a stronger correlation with the measured soil moisture content. Notably, matric suction has shown a strong negative correlation with θMeasured, indicating its influence on soil moisture dynamics.

Table 6 Pearson correlation (r) between model and used features with measured water content

The reduction in soil pore size distribution resulting from increased soil compaction leads to elevated matric suction across all soil texture classes (Fredlund and Rahardjo 1993). Thus, soil bulk density and sand percentage exhibit a negative correlation with soil water content. Additionally, a negative correlation was observed between water content and structural flatness, indicating that increased soil pore compaction leads to a decrease in water content at varying matric suction levels. Notably, based on Pearson correlation coefficients, structural flatness (r = −0.625) demonstrates a more explicit effect on the decrease of soil water content compared to soil bulk density (r = −0.469).

Just appling soil matric suction as model input feature

To assess the necessity of incorporating additional input features for improving the model outputs, an evaluation conducted using only the matric suction feature as the input. While soil matric suction has a significant impact on model learning and prediction accuracy, the results presented in Table 7 demonstrate that models trained solely using matric suction and related water content data did not achieve acceptable precision. The models exhibited high error rates and low R2 values when tested on the dataset. These findings indicate the need for additional input features to improve the accuracy and reliability of the models.

Table 7 Statistics of models in the case where matric suction was used as the only input parameter

Despite the negative correlation observed between soil water content and matric suction in the evaluating models (Table 8), the calculated RMSE values revealed relatively high errors in the model outputs. The mean absolute errors further indicated significant inaccuracies in the prediction of soil water contents at different matric suction levels, with values ranging from 0.08 to 0.09. Such errors are far from acceptable in this context. Moreover, the considerably low values of R2 highlight the inconsistency between the predicted SWCC patterns and the observed data.

Table 8 Pearson correlation (r) when the matric suction is included as the only modeling parameter

The use of soil matric suction as the sole input feature in the eight evaluating models significantly reduces the correlation between the models and the measured water content (θMeasured). This, in turn, causes the correlation of the linear regression model with θMeasured to be lower than the correlations between matric suction and θMeasured (as shown in Table 8). Based on these findings, it can be concluded that utilizing matric suction values alone in the prediction of the SWCC yields better results compared to using the Linear regression model with only matric suction values. This observation suggests that in this case the modeling process was not effective and did not produce useful outcomes. It's worth noting that Zhang et al. (2020) also observed differences in ML procedure capacity based on the number of utilized parameters, reinforcing the importance of considering parameter selection in prediction of soil thermal conductivity.

Predicted SWCCs with evaluating models based on the ML procedure

The assessment of predictive accuracy for eight ML algorithms, was undertaken by comparing their estimated results to the actual measurements. This evaluation is visualized in Fig. 2, where eight individual curves (labeled from “a” to “h”) depict the performance of each algorithm. These curves provide a comprehensive representation of how well the algorithms align with the actual measurements. Notably, the 1:1 line in each segment serves as a reference for perfect agreement between predictions and measurements. Among these algorithms, Gradient Boosting (GB) showcases its remarkable predictive capabilities, reflecting its potential to closely replicate the actual SWCC.

Fig. 2
figure 2

Comparison of predicted and observed SWCC around the 1:1 line

Figures 3 and 4 illustrate the SWCC for Loamy Sand and Silty clay soil samples, respectively. As mentioned earlier, the evaluating models can be categorized into two classes based on their prediction accuracy: high and low. In Figs. 3 and 4, these differences explicitly demonstrated. Specifically, for the Loamy Sand soil sample, Gradient Boosting, Ada Boost, Tree, and Random forest models (Fig. 3a–d) exhibited almost perfect predictions of SWCC. While the high accuracy prediction of the SWCC is consistent in Silty Clay soil samples, it is worth noting that for soil matric suctions higher than 1000 cm, the error of the mentioned models shows a relatively decreased trend. Previous studies have highlighted the flexibility and reliability of ML algorithms such as ANN, kNN, and SVM in providing accurate estimations, as they do not rely on stringent assumptions about the underlying data and can adapt to various situations (Nguyen et al. 2017; Hastie et al. 2009). However, in the present study, the performance of the Neural Network, SVM, kNN, and Linear Regression models in predicting SWCC for both Sandy Loam and Silty Clay soil samples yielded errors that were deemed non-acceptable. Specific details regarding the nature and magnitude of these errors would provide further insights into the limitations of these models in the context of the study. These errors resulted in deviations between the predicted SWCC patterns and the measured SWCC pattern across the entire range of matric suctions (Figs. 3 and 4e ~ h). Specifically, the models showed underestimation at low matric suction and overestimation at high matric suction for all studied soil samples. The SVM and kNN models fail to exhibit the expected decreasing trend with respect to matric suction in the Loamy Sand sample, rendering them unable to adequately explain the SWCC. Similarly, the kNN model yields inaccurate outputs for the Silty Clay soil sample.

Fig. 3
figure 3

Comparison of the predicted and measured SWCCs by different models in Loamy Sand soil sample

Fig. 4
figure 4

Comparison of the predicted and measured SWCCs by different models in Silty Clay soil samples

Evaluating models uncertainty

Table 9 presents the error percentages to quantify the mean differences between the observed and predicted SWCCs in both the Loamy Sand and Silty Clay soil samples. These error percentages provide insights into the uncertainty associated with each evaluating model. Although Wang et al. (2021) demonstrated high accuracy in determining SWCC for soils with a high clay fraction, this study found that the average error of the eight models used for Loamy Sand soil samples was considerably higher at 35% compared to Silty Clay Soil samples. However, the four well-predicted models, namely Gradient Boosting, AB, Tree, and Random Forest models, exhibited an equal average error percentage of approximately 5% in both Loamy Sand and Silty Clay soil samples, and no significant difference in the estimation of SWCC was observed between the two studied soil textures. The Gradient Boosting model demonstrated superior prediction capability in both studied soil textures, and it exhibited the lowest error percentage in Loamy sand soil samples, with an average uncertainty of 2.7%. The other evaluating models, such as Neural Network, SVM, kNN, and Linear regression, exhibited unreliable outputs with error percentages exceeding 20%. In particular, the SVM model performed poorly in Loamy Sand soil samples, reaching approximately 90% errors. Interestingly, these models showed comparatively better prediction performance in Silty Clay soil samples compared to Loamy Sand soil.

Table 9 Uncertainties of evaluating models (error percentage between observed and predicted results) in prediction of soil moisture content at different soil matric suction of Loamy Sand and Silty Clay soil samples

Prediction errors at two sides of the inflection point (hi)

Some researchers have observed that their models underestimated the water content of the SWCC at relatively high suction heads (Nguyen et al. 2017; Hwang and Powers 2003; Meskini-Vishkaee et al. 2014; Mohammadi and Meskini-Vishkaee 2012; Tuller and Or 2001; Tuller et al. 1999). Nguyen et al. (2017) attributed the underestimation of SWCC to the lack of measurement of input features at high matric suction situations. Other studies have shown the existence of corner water, lens water, and film water in soils, which may be one of the main causes of the underestimation phenomenon (Mohammadi and Meskini-Vishkaee 2012; Or and Tuller 1999; Shahraeeni and Or 2010; Tuller and Or 2005; Tuller et al. 1999). However, Wang et al. (2021) claim that their improved prediction model can effectively predict soil–water characteristic curves, especially for soils at high matric suctions, in contrast, in this study, we observed visual evidence of increasing model errors with higher soil matric suction in Figs. 3 and 4, as well as in Table 9. To further support this observation, the error percentages of the evaluating models compared at matric suction values below and above a matric suction related to hi. For the Loamy Sand soil samples, hi equal to to 70 cm was calculate, while for the Silty Clay soil samples, hi was calculated equal to to 300 cm. Figure 5 presents the results of this comparison. In both studied soil textures, the error percentages of all evaluating models are considerably higher at matric suctions greater than hi compared to matric suctions less than hi. Among the models, the DT model exhibited the maximum difference between the measured and predicted SWCCs at the two sides of the inflection point. Moreover, the prediction error percentages at matric suctions greater than hi were found to be ten times higher than those at matric suctions less than hi. Also, Bakhshi et al. (2023), employing an image analysis approach and substantiating their findings with the Laplace equation (Tuller et al. 2004) for elliptical pores, reported an overestimation of moisture contents at matric suctions exceeding hi. Additionally, the SVM and kNN models exhibit minimal changes in prediction errors with respect to matric suction. Consequently, there is a minimum difference between the prediction errors of SWCC at the two sides of the inflection point for these models. Based on this concept, the best performance models are identified as those with lower error percentages and a minimal difference in prediction errors at the two sides of the inflection point. Models such as Gradient Boosting, AB, and Random forest exhibit these characteristics.

Fig. 5
figure 5

Comparisons evaluating models error percentages at two side of SWCC inflection point in a) Loamy Sand and b) Silty Clay soil samples

Residual contents of predicted SWCCs

To quantify the absolute differences between predicted and measured SWCCs, the difference curves for both Loamy Sand and Silty Clay soil samples were presented. Figure 6 depicts the difference curves for Loamy Sand samples, while Fig. 7 displays the difference curves for Silty Clay samples. Each figure includes multiple subfigures (a ~ h) representing different scenarios or conditions within each soil sample. Building upon the previous discussions regarding the high capability of the Gradient Boosting, AB, Tree, and Random forest models, it is evident from Figs. 6 and 7 (subfigures a-d) that these models exhibit minimal fluctuation relative to zero. Furthermore, the other studied models, including Neural Network, SVM, kNN, and Linear regression, demonstrate significant underestimation at low matric suction and overestimation at higher matric suctions, as depicted in Figs. 6 and 7 (subfigures e–h). Similar to the results of this study, Achieng (2019) observed residual SWCC values of about -0.1 to 0.1 g.g-1, but did not find a specific pattern for changes in errors with increasing matric suction. However, as illustrated in Figs. 6f and 7f for both the studied Loamy Sand and Silty Clay soil textures, the highest estimation errors are observed at the two ends of the SWCC. In other words, the SVM model shows the highest error in the estimation of the structural-based and textural-based sections of the SWCC, and around the inflection point, the estimation error of the SVM model diminishes to about zero.

Fig. 6
figure 6

Absolute difference of prediction and measured SWCC at evaluating models in Loamy Sand soil samples

Fig. 7
figure 7

Absolute difference of prediction and measured SWCC at evaluating models in Silty Clay soil samples

Conclusion

  • Role of informatics in precise estimation of SWCC: The application of ML techniques has led to the simplification of the intricate process of predicting the SWCC in this study. Additionally, the utilization of Orange.3 data mining software has enabled the incorporation of a wide range of measured physical soil properties into the predictive model, eliminating the requirement for extensive programming knowledge. Thus, through the utilization of informatics principles, we establish a connection between scientific insights and practical engineering applications, thereby facilitating a smoother transition of predictive models into real-world soil-based scenarios.

  • Shortcomings of Solely Matric Suction-based Models: Our investigation into SWCC prediction revealed a noteworthy limitation. Models constructed solely on the basis of soil matric suction, while seemingly intuitive, exhibited inadequacy in accurately predicting SWCC behavior. This deficiency was evident in the Mean Absolute Error (MAE) exceeding 0.08 and an R-squared (R2) value below 40% in test dataset.

  • Enhancing Accuracy through Multivariate Approach: The quest for enhanced prediction accuracy led us to explore a more holistic approach. We embarked on a journey to comprehend the influence of various soil properties on SWCC behavior. Interestingly, our endeavors unveiled the pivotal significance of incorporating soil properties such as bulk density, organic carbon content, and micro-porosity characteristics like flatness or porosity surface area. Integrating these properties as measured features within the model yielded substantial improvements in the precision of SWCC estimation. Statistical analysis revealed that in this scenario, the Gradient Boosting algorithm led to an almost perfect estimation of SWCC, yielding RMSE and Pi values of 0.016 and 0.03, respectively. Furthermore, the AB, Random Forest, and Tree models resulted in highly accurate estimations with RMSE and Pi values lower than 0.03 and 0.04, respectively. However, other evaluated models, including Neural Network, SVM, kNN, and Linear Regression, did not exhibit improvement during the training phase, even with the inclusion of additional properties of the studied soils.

  • Feature importance analysis: Among the evaluated models, matric suction stood out as the most critical parameter in the GB, AB, and RF models. Its exclusion from these models resulted in a notable increase in RMSE, reaching up to 0.08. Lower matric suction values correlated with higher accuracy, while higher values reduced accuracy. Following matric suction, soil micro-porosity characteristics gained importance, lowering model RMSE by up to 0.04 in highly accurate models. Notably, structural flatness and porosity surface area played a significant role compared to other pore characteristics in predicting SWCC accurately.

  • Navigating Errors and Achieving Realistic Prediction: Acknowledging the existence of errors in the estimated SWCC within this study is crucial, especially concerning matric suctions surpassing the SWCC inflection point. These errors were observed in proficiently recognized models, amounting to up to 8 percent in silty clay soils. However, upon analysis, these errors are relatively minor and do not substantially compromise the models’ effectiveness in predicting SWCC behavior. This is attributed to the decreasing trend of variations in water content at high matric suctions.