Introduction

Soil physical and mechanical (SPM) properties are crucial parameters to evaluate soil quality and health, as well as to determine soil aggregate stability. They play a significant role in land degradation and land management in the surrounding environment across the world, especially the areas similar to Iran’s climate condition (Rezaee et al., 2020a, 2020b; Mozaffari et al., 2021, b, 2022, Mozaffari et al., 2022; Zahedifar, 2023a, b). These properties also provide valuable information on the water infiltration and nutrients cycle and play crucial role in soil ventilation, microbial activity, and tillage performance (Mustafa et al., 2020). Therefore, knowing the variability of SPM in the landscape is necessary for determining the fertilizer requirement of agricultural crops, water, and cultivation management such as tillage toward sustainable production (Moosavi and Sepaskhah, 2012; Brevik et al., 2015).

Traditional mapping of SPM methods is labor intensive, timely, and costly, as they require highly dense soil observation, high field survey, and laboratory activates (Shahabi et al., 2017, Kazemi Garajeh et al., 2022). Moreover, they rely on expert interpretation (expert opinion) of environmental covariates involving the key soil-forming factors (Gorji et al., 2015). Furthermore, conventional soil mapping methods are unable to provide quantitative result of spatial soil maps in the term of accuracy and uncertainty analysis about variation of soil properties in soil survey projects (Zahedi et al., 2017). To prevail the limitations of conventional approach, novel methods like the digital soil mapping (DSM) approach have been more applied by researchers in recent decades. DSM employs mathematical and statistical methods for establishing the correlation soil properties and environmental factors that are representative of soil formation factors (McBratney et al., 2003). The output of DSM consists of the spatial prediction maps along with their quantitative validation that can help reduce the cost and time for soil science surveys (Esfandiarpour-Boroujeni et al., 2020).

In current DSM studies, pedometricians typically derive topographic attributes from digital elevation model (DEM) (Wang et al., 2018) and remote sensing (RS) data (Xiao et al., 2019) which are easily accessible information sources of environmental covariate. The use of grid proxy of topographic attributes and RS indices has been confirmed in different studies in the field of SPM prediction (Mashalaba et al., 2020; Camera et al.; 2017; Ugbaje and Reuter, 2013). In this regard, Mashalaba et al. (2020) reported that topography feature had the most important effect in soil property prediction in central Chile, while environmental covariates have been widely applied worldwide, where they may not be sufficient for predicting soil properties. In two different studies, Mousavi et al. (2022) and Khosravani et al. (2023) investigate the estimation of SOC and soil properties in two scenarios which considers both soil and environmental covariates and only considers environmental covariates. Their results showed that including soil variables along with environmental covariates improved the accuracy of MLAs compared to the scenario without soil variables. Similarly, Zeraatpisheh et al. (2021) demonstrated that soil and RS variables were recognized as the most important driving of soil aggregate stability. In this regard, Mozaffari et al. (2022) believed that using the primary soil properties for modeling particles size distribution could lead to acceptable accuracy.

Recently, applied ML algorithms for modeling soil properties by the aid of environmental factors in the DSM have attracted more attention by pedometricians. Selection of the appropriate ML algorithms can have a significant impact on the accuracy of the produced maps (Khaledian and Miller, 2020). In a comprehensive review of research conducted over the last 10 years, Khaledian and Miller (2020) assess the capability of six ML algorithms in soil mapping, namely, multivariate linear regression (MLR), k-nearest neighbor (k-NN), support vector machine regression (SVR), Cubist (CB), random forest (RF), and artificial neural network (ANN). According to their review, the RF algorithm was known to outperform in modeling of aggregate stability (Bouslihim et al., 2021). Also, Yamaç et al., (2020) confirm that the k-NN algorithm had the high R2 value (0.8) for predicting permanent wilting point (PWP) among other ML algorithms in calcareous soils.

In recent years, most DSM studies have focused on applying MLA to predict soil properties in top soil (Zeraatpisheh et al., 2019; Parsaie et al., 2021), whereas the potential of applied ML algorithms to predict SPM along with depth has not been deeply explored (Hengl et al., 2017). Some case studies have mostly focused on SOC prediction at top soil and subsoil simultaneously (Taghizadeh-Mehrjardi et al., 2014; Mousavi et al., 2022). According to the literature until writing this paper, few research has been conducted on the on the vertical variation of SPM using soil depth functions such as spline, ML algorithms, environmental covariates, and soil variables.

Therefore, limited studies have mapped spatial variation of SPM at the surface and subsurface (vertical and horizontal dimensions) by considering environmental covariates along with soil variables. Thus, the current research was conducted with the main aim of modeling SPM attributes including GMD, mean weight diameter of aggregates (MWD), shear strength (SS), and penetration resistance (PR) certainly using environmental covariates as a first scenario (S1) and accounting basic soil variables and environmental covariates as a second scenario (S2) in the southwest of Iran. Furthermore, we evaluated the capability of three ML algorithms of RF, CB, and k-NN in preparing spatial estimation maps of GMD, MWD, SS, and PR at increment four depths of 0–5, 5–15, 15–30, and 30–60 cm, to provide more accurate and detailed maps of SPM which can be used in land and water management strategies, soil erosion control, and improving soil stability.

Materials and methods

Research workflow

The general framework of this study is designed in six main steps and presented in Fig. 1. The main steps are presented in the following order: (1) designing sampling point locations using the “clhs” package in R statistical software, collection of soil samples from 0 to 30 and 30 to 60 cm, and standardizing soil depth by spline depth function, (2) preparing/collecting all possible environmental covariates from RS indices and DEM as representative soil forming factors, (3) selecting the most appropriate environmental covariates for predicting SPM properties, (4) evaluating three ML algorithms (RF, k-NN, and CB) at the four standard soil depths (spatial modeling of soil properties) based on two scenarios (S1: environmental factors, S2: environmental factors + soil variables), (5) determining the relative importance (RI) of covariates; and (6) preparing prediction maps of SPM.

Fig. 1
figure 1

Flowchart of the research in the study area at soil standard depth (0–5, 5–15, 15–30, and 30–60 cm). Geometric mean diameter of aggregates (GMD), mean weight diameter of aggregates (MWD), shear strength (SS), penetration resistance (PR), conditioned Latin hypercube sampling (clhs), machine learning model (ML)

Description of study area

Here, the interest area is limited to longitude of 52° 41′ 35.82″ to 52° 57′ 1.07″ E and latitude 30° 2′ 14.72″ to 29° 48′ 35.02″ N, covering about 48,963 ha in Marvdasht, which is located in Northern part of Shiraz (Fig. 2). The slope gradient is varying from 0 to 12% with mean altitude of 1605 msl. Most of this area has low physiographic intensity, and over 85% of the land has a slope of less than 5%. The mean annual precipitation and temperature are 287 mm and 17.5 °C, and according to the closest climatic station, July and January are the hottest and coldest months, respectively. Also, the xeric and thermic are soil moisture and temperature regimes of the study area, respectively. Marvdasht Plain is a main agricultural region for crops like irrigated winter wheat, barley, alfalfa, and canola. Therefore, preparing digital maps of the key soil properties in this region offers valuable insights into the soil condition, and the maps can serve as a useful tool for evaluating and adjusting land management practices.

Fig. 2
figure 2

a Location of Fars province in Iran, b location of Marvdasht plain in Fars province, and c location of the soil samples (red circle) in the study area

Field survey and laboratory activity

For the field survey, the location of 200 sampling points was determined by Conditional Latin Hypercube Sampling (CLHS), a random stratified method that selects sampling points based on initial information pertaining to a suite of environmental factors in an interest area (Minasny & McBratney, 2006), using the open-source R statistical software (4.0.3 version). The location is shown in Fig. 2c. After fixing the sampling locations, soil shear strength (SS) and penetration resistance (PR) were directly determined by Vane shear resistance meter and pocket cone penetrometer (ELE algorithm), respectively, during field survey (Fig. 3a, b). The SS was measured using a Torvin resistance tester with three replications around each sampling point. Furthermore, PR was measured using a hand-held penetrometer on intact soils with three replications around each point in the study area. The instrument had a narrow cylindrical rod of 6 mm diameter and 5.7 cm length. The penetrometer was pushed into the soil up to the marked part (about 6 mm), and the required pressure (kPa) was recorded. It should be noted that the average values of three measurements (replications) were used to determine the soil shear strength (SS) and penetration resistance (PR) at each sampling point. The measurements were taken at points at equal distances on the side of a circle with a radius of about 0.5 m (Fig. 3c). After sampling, soil samples were transferred to the laboratory for measuring the aggregate stability using wet sieving method (Kemper and Rosenau, 1986). In other words, stability of the soil aggregates against water were measured using the standard sieving method with seven group sizes of sieves including > 2, 2, 1, 0.5, 0.25, 0.125, 0.053, and < 0.053 mm opening diameter (Kemper and Rosenau, 1986; Le Bissonnais, 2016). Then, for quantifying the structural stability of the soil aggregates, the geometric mean diameter (GMD) and mean weight diameter (MWD) values were calculated based on the results of the wet sieving method using the following equations:

$$\textrm{GMD}=\exp \left[\sum_{i=1}^n{w}_i\textrm{Log}\left(\overline{d_i}\right)\right]$$
(1)
$$\textrm{MWD}=\sum_{i=1}^n{w}_i\overline{d_i}$$
(2)

where \({\overline{\textrm{d}}}_{\textrm{i}}\)is the mean diameter of two consecutive sieves (mm) and Wi is the weight of particles in that size range as a percentage of the total sample.

Fig. 3
figure 3

a Shear strength (SS), b penetration resistance (PR), c field sampling point, and d wet sieving tool

Moreover, auxiliary soil properties including soil organic matter (SOM) and soil textural components (i.e., sand, silt, and clay contents) were measured using the wet oxidation (Nelson and Sommers, 1996) and hydrometer (Gee and Bauder, 1986) methods, respectively. Furthermore, in order to prepare continuous maps of the mentioned soil attributes for use in further steps, the interpolation geostatistical approach of Ordinary Kriging was used to estimate the values of the aforementioned soil attributes at unknown points using their corresponding measured values along with their modeled spatial structure. The detailed descriptions on the estimation procedure using the mentioned geostatistical approach can be seen in the literature (Moosavi and Sepaskhah, 2012; Moradi et al., 2016; Azizi et al., 2022).

Standardization of soil depth

The spline method was used to extract soil properties at standard depths along vertical profiles continuously using the R package “GSIF.” The values for GMD, MWD, SS, and PR were standardized at four depths 0–5, 5–15, 15–30, and 30–60 cm. For more details about splines, see Bishop et al. (1999) and Malone et al. (2009).

Feature selection

A total of 79 driving factors were prepared from soil variable RS data and topographic attributes. Here, four soil covariates such as clay, silt, sand, and soil organic matter (SOM) were selected based on the expert opinion and related literature in this field (Celik, 2005; Ayoubi et al., 2012; Zeraatpisheh et al., 2021); 36 remote sensing covariates and individual band were prepared from Landsat 8 with 30-m spatial resolution after the necessary corrections (radiometry) using ENVI software version 5.3. In addition, 39 topographic attributes were extracted from DEM (ALOS PALSAR satellite) using the topographic analysis method (Wilson, 2018) in SAGA GIS version 7.9.1 software. As mentioned, we prepared the maps of soil variables by the results of Ordinary Kriging interpolation method (Azizi et al., 2022). Finally, the spatial resolution of all covariates was fixed to 30-m in Arc GIS software. The details of the covariates are described in Table 1.

Table 1 Soil and environmental covariates used in this study to predict soil physical and mechanical (SPM) attributes

After preparing soil and environmental factors for avoiding the increase of time and model fitting process, the environmental factors were chosen using the variance inflation factor (VIF) method (Akinwande et al., 2015), using the “VIF” package in R software. The VIF is a step way method and eliminates the covariates that have the highest correlation with each other. After applying VIF method, 36 environmental covariates remained. To further select the most appropriate covariates, the Boruta method was applied.

The Boruta algorithm for selecting environmental covariates was proposed by Kursa and Rudnicki (2010). This approach is one of the semi-automated supervised methods for feature selection, based on the random forest (RF) algorithm, which selects the most important environmental covariates using the repeatable backward and forward system. Finally, the output of the covariate’s selection was done based on the value of the Z factor, which is determined in four general categories. The covariate is unrelated, slightly related, moderately related, or completely related when the Z factor is lower than 5, 5 to 10, 10 to 15, and more than 15, respectively (Keskin et al., 2019). Additionally, soil variables were included in the modeling process by the expert opinion.

Machine learning (ML) algorithms

In this study, we evaluated three ML algorithms, RF, k-NN, and CB, to predict SPM by using soil and environmental factors and employing two scenarios including S1 (using just environmental factors, i.e., topographic and RS covariates) and S2 (using both the mentioned environmental covariates and soil variables), at four standard depths.

Random forest (RF)

Random forest (RF) is one of the non-linear ML algorithms that is widely used in DSM of soil properties. The RF algorithm is easy to implement and requires few parameters to tune (Rahmani et al., 2022). Here, the RF algorithm was applied to predict the surface and depth of the soil’s physical and mechanical properties. The RF algorithm was tuned according to two hyper-parameters: the number of trees (ntree), which was between 100 and 1000 trees with the distance of 100 trees interval, and mtry, which represents the number of environmental covariates that can be used to grow at each tree according to the minimum error (Breiman, 2001).

k-nearest neighbor (k-NN)

The k-nearest neighborhood (k-NN) algorithm is one of the non-linear methods. This operates based on calculating the Euclidean distance between the desired soil sample and other observation points. The k-NN method then weighs k numbers of adjacent observation samples based on their distance to the desire sample. In addition, based on the weight of each sample in a set of k number of samples, an estimate of the desired data is made according to the minimum error in that set (Nemes et al., 2006).

Cubist (CB)

The Cubist algorithm is a regression tree algorithm that generates various algorithms using training data. Each algorithm comprises multiple rules, which are summarized by one or more conditions (Holmes et al., 1999). When all the conditions of a rule are met, the corresponding linear relationship is utilized to forecast the SPM. The algorithm’s rules are ranked through the Cubist algorithm’s decreasing importance process. This implies that the first rule has the highest contribution to the algorithm’s accuracy, while the last rule has the least. The algorithm predicts the target variable’s value based on influential variables, and the number of rules is adjusted using the best-fitting regression algorithm. To optimize the algorithm, it was fine-tuned by adjusting two hyper-parameters: the number of committees and the number of neighbors (Ma et al., 2017).

Assessment of prediction performance

For assessment of the ML algorithms (RF, k-NN, and CB), all data was split to the training and testing subset which consisted of 80 and 20%, respectively. Four statistical indices included the coefficient of determination (R2), normalized root means square error (nRMSE), and Nash-Sutcliffe coefficient (NS) that is a statistical measure commonly applied to assessment of the performance of ML algorithm predictions. Also, the mean standardized squared prediction error (MSSPE) was applied for assessment of the uncertainty of ML algorithms. It is defined as the mean squared prediction error (MSSPE) of an algorithm divided by the average MSSPE of a set of benchmark algorithms (Rossel and McBratney, 2008). The mentioned statistical measures were calculated using the following equations:

$${R}^2=\frac{\Big(\sum_{i=1}^n\left({O}_i-\overline{O}\right){\left({P}_i-\overline{P}\right)}^2}{\sum_{i=1}^n{\left({O}_i-\overline{O}\right)}^2\ {\sum}_{i=1}^n{\left({P}_i-\overline{P}\right)}^2}$$
(3)
$$\textrm{nRMSE}\ \left(\%\right)=\kern0.5em \frac{\sqrt{\frac{1}{n}\sum_{i=1}^n{\left({O}_i-{P}_i\right)}^2}\kern1.5em }{\overline{O}}\times 100$$
(4)
$$\textrm{NS}=1-\frac{\sum_{i=1}^n{\left({O}_i-{P}_i\right)}^2}{\sum_{i=1}^n{\left({O}_i-\overline{O}\right)}^2}$$
(5)
$$\textrm{MSSPE}={\left(1/n\right)}^{\ast}\sum {\left({O}_i-{P}_i\right)}^2/\left({S}^2\right)$$
(6)

where Oi and Pi are the observed and predicted values, respectively; n is the amount of data; and S2 is the variance of the observed values. As SPM varies on different scales, the nRMSE is a suitable statistical index for quantifying the algorithm accuracies in this study. The nRMSE values range from 0 to 100, where values close to zero showed excellent performance, and values above 0.3 show poor algorithm validation (Bannayan and Hoogenboom, 2009).

Results and discussion

Summary statistic

The results of statistical indices of SPM for 200 soil samples at the four standards of Lapuee plain are presented in Table 2. The results showed that both GMD and MWD decreased by increasing the depth, while SS and PR have irregular trends with depth. According to the coefficient of variation, CVs, all four soil properties at four standard depths showed high variability according to the classification by Wilding (1985). One of the reasons for the high CV values of SPM may be attributed to the agricultural activities and land management (Heydari et al., 2020). According to the findings, the average SPM at a standard depth decreased from top- to sub-soils (Table 2). Higher amount of SOM was observed at upper layers which is same with Mousavi et al. (2023) findings. Based on the pedological theories, the SOM increases the porosity and ventilation and reduces soil compaction (Soane, 1990; Elbasiouny et al., 2014). Therefore, it seems that SOM has a significant effect on SPM.

Table 2 Descriptive statistics of soil physical and mechanical properties at the four standard soil depths for the soil samples (n = 200(

Selected features

When dealing with a large pool of data, using all of it can be time-consuming and can increase algorithm complexity. Feature selection is a useful method for choosing the appropriate types of covariates which are using for the modeling process (Neyestani et al., 2021).

Based on our aims for selecting the most relevant environmental factors, through the VIF method, the number of environmental factors was reduced from 75 to 36. The Boruta algorithm was also fitted, and five covariates were ultimately selected from the 36 environmental covariates (Fig. 4), in addition to four soil variables, resulting in a total of nine variables used for predicting the SPM properties (Table 3). Among the selected environmental covariates, three of them (SIPI, MNDWI, and IRON) were related to RS indices, while two of them (Watershed Basins (WB) and Channel Network Base Level (CNBL)) were extracted from DEM (Fig. 5). Also, as mentioned in the section of 2.5.1, soil variables of SOM, clay, silt, and sand contents were selected based on expert opinion (Fig. 6). The most important soil and environmental covariates based on the best scenarios and ML algorithms (Table 4) were applied to predict each SPM.

Fig. 4
figure 4

Important variables selection with Boruta algorithm

Table 3 Select soil and environmental covariates for four properties obtained with Boruta at the four standard depths
Fig. 5
figure 5

Five environmental covariates were obtained from RS: Structure Insensitive Pigment Index (SIPI), Modified Normalized Difference Water Index (MNDWI), Iron Oxide Ratio (IRON), Watershed Basins (WB), and Channel Network Base Level (CNBL)

Fig. 6
figure 6

Four variables were obtained from soil analysis: clay, sand, silt, and SOM

Table 4 The most important soil and environmental covariates for prediction of soil physical and mechanical properties based on the best model at each four-standard depth

Algorithm performance

The accuracy of the algorithm prediction was evaluated using statistical indices such as R2, nRMSE, and NS for GMD, MWD, SS, and PR at four standard depths. The comparison between scenarios S1 and S2, as measured by R2 and NS, showed that S2 had the highest prediction accuracy for the SPM at the four standard depths. According to the finding, including the soil variable and environmental factors improves the performance of ML algorithms (Mousavi et al., 2022). Tables 5 and 6 list the quantitative results of the scenarios comparison for S1 and S2, respectively, using the ML algorithms. Overall, the validation results for the two scenarios indicate that S2 had higher accuracy, and the subsequent sections will focus on its results.

Table 5 Validation results for prediction of the soil physical and mechanical properties at the four standard depths (scenario 1: using environmental covariates)
Table 6 Validation results for prediction of the soil physical and mechanical properties at the four standard depths (scenario 2: using both environmental covariates and soil variables)

Validation results of GMD

According to Table 6, the RF algorithm showed the best prediction performance in mapping GMD with (R2 = 0.70 and 0.68, nRMSE = 18.21 and 10.21, and NS = 0.67, and 0.60) in 0–5 and 15–30 cm depths, respectively. The k-NN algorithm also performed the best for predicting GMD with R2 of 0.45, nRMSE of 10.17, and NS of 0.48 at the depth of 5–15 cm. Finally, CB showed the best predictions of GMD with R2 of 0.59, nRMSE of 6.19, and NS of 0.58 at a depth of 30–60 cm. Furthermore, based on Rossel and McBratney (2008) report, all applied ML algorithms in GMD prediction at four standard depths showed intermediate prediction performance; however, at 0 to 5 cm and 15 to 30 cm depth, the RF outperformed best. Chen et al. (2015) examined surface and subsurface soil salinity variation and reported that the RF had high capability in prediction verses other ML algorithms. Similarly, Mousavi et al. (2022) and Rahmani et al. (2022) confirm that the RF algorithm has high accuracy and low error for predicting topsoil thickness.

Validation results of MWD

The k-NN displayed the well performance in predicting MWD (Table 6). As shown in Table 6, for three depths of 0 to 5, 15 to 30, and 30 to 60 cm, the k-NN algorithm was the best one (R2 of 0.57, 0.56, and 0.45 and nRMSE of 14.12, 8.21, and 10.23, respectively), while CB showed the best prediction with R2 of 0.61 and nRMSE of 8.13 in 5 to 15 cm depth. In general, the modeling results showed a similar performance of prediction for GMD compared to MWD at all depths. Overall, the validation results of the algorithms were intermediate at standard depths; however, the k-NN showed better performance compared to other algorithms for MWD; in all standard depths except 5–15 cm depth, CB algorithm showed the best prediction than the k-NN algorithm.

Validation results of SS

Validation results of the predictive algorithms for SS showed that the best algorithm at all depths was the k-NN algorithm with R2 of 0.65, 0.54, 0.57, and 0.59; nRMSE of 11.14, 12.15, 13.23, and 11.16; and NS of 0.55, 0.53, 0.54, and 0.51 at depths of 0–5, 5–15, 15–30, and 30–60 cm, respectively (Table 6). Furthermore, results showed that all ML algorithms used in this research had similar performances in predicting SS at surface soils. The highest R2 value (R2 = 0.65) was obtained for SS prediction in depth of 5–15 cm compared to the other studied depths. Overall, the k-NN was the best predictive algorithm for SS in all of the studied depths. In this regard, research conducted by Hengl et al. (2021) on modeling soil fertility properties showed that the RF and CB algorithms had the best accuracy compared to that of the k-NN and SVR algorithms. Furthermore, Khosravani et al. (2023) reported that the CB algorithm which followed by RF had the best prediction capability for soil fertility attributes. As regards SPM, similar results were reported by Bouslihim et al. (2021), while Yamaç et al. (2020) stated that the k-NN algorithm was the best predictive algorithm for permanent wilting point (PWP) in calcareous soils.

Validation results of PR

The validation results indicated that CB was the best algorithm (R2 = 0.67) for predicting PR in the depth of 0–5 cm, while k-NN was the best algorithm (R2 = 0.68) in the depth of 5–15 cm. Also, RF algorithm (R2 = 0.92 and 0.86) was the best for predicting PR at the depth 15–30 cm and 30–60 cm (Table 6).

Based on the results obtained from the algorithms’ validation for predicting PR, the RF algorithm showed the best performance (R2 = 0.92) compared to other ML algorithms at the depth of 15–30 cm. But the k-NN algorithm performed well compared to the other ML algorithms to predict GMD, MWD, SS, and PR properties at different depths. Totally, the best predictive algorithms were RF for GMD and PR properties, and k-NN for MWD and SS properties. Zeraatpisheh et al. (2021) reported that the k-NN and support vector machine algorithms were performed well in prediction of SOC in different aggregate size. Furthermore, it is shown that the RF algorithm in comparison to the ANN algorithm was better in prediction of soil surface erosion rate (Khosravi Aqdam et al., 2022).

The importance ranking of factors

The comparison between algorithms’ scenarios showed that scenario S2 had the higher accuracy in predicting SPM compared to scenario S1. Therefore, the relative importance was described based on scenario S2. The results indicated that clay and SOM were two important variables in the prediction of SPM at four standard depths. Increasing SOM can improve soil aggregate stability, which may explain the high GMD and MWD values in cultivation land (Lacoste et al., 2014). Also, Mozaffari et al. (2021, b) observed strong relationship among SOM and MWD and GMD in all of their datasets. They believe that the SOM had important role by protecting SAS and decreasing the effect of wind and water erosion. Additionally, Celik (2005) and Ayoubi et al. (2012) reported that SOM directly contributed to soil aggregate formations and stabilities, and also, the level of SOM can define and explain the type of soil aggregates (macro, meso, and micro aggregates). Correlation analysis between MWD, soil properties, and covariates revealed that organic carbon had the highest influence (27.9%) on MWD. Similar result was reported by Tang et al. (2016) and Wang et al. (2019).

Out of soil variables among environmental covariates, only CNBL as a proxy of topographic attributes was important in the predicting MWD and PR at the depths of 15–30 cm and 30–60 cm and SS only at a depth of 30–60 cm, the soil properties by influencing on the soil climate and hydrology (Wang et al., 2018; Tu et al., 2018; Nsabimana et al., 2020). Forghani et al. (2020) confirmed that topographic features such as CNBL and valley depth are the most influential factors on physical parameters. Also, topographic attributes, organic matter, and geology data were the most important parameters in the spatial prediction of SAS (Bouslihim et al., 2021). In contrast to soil variables and topographic attributes, the RS indices had a weak effect on SPM prediction.

Spatial prediction

In this study, GMD, MWD, SS, and PR maps were prepared based on the best ML algorithm in all four standard depths (Figs. 7, 8, 9, and 10). It was shown that high GMD values were observed in the Southwest, South, and Southeast section of the region. In these regions, the value of GMD decreased from 2.27 to 1.25 mm at depths of 0–5 to 30–60 cm, possibly due to the high SOM in the topsoil compared to the subsurface layers and improved soil structure resulting from agricultural activities (Fig. 7). For GMD, MWD, SS, and PR, a decreasing trend was observed from the surface to the subsoil, particularly in the northern zone of the region. Spatial prediction maps showed that the higher GMD content were concentrated in the Southwest, South, and Southeast parts of regions. The trend of MWD was similar to GMD, and its value decreased from the surface to the deeper layers (Figs. 7 and 8). According to the result of Le Bissonnais (2016), the soils with MWD > 2.0 mm have very stable aggregate, so there is no surface crusting available. The minimum values of GMD and MWD were observed at the Northern boundaries (Fig. 7), where the pastures with low vegetation cover mainly increase erosion rates and thus caused a weak structure in the top soil. These results can be related to the mountainous conditions (stone fragments). Furthermore, the maximum values of GMD and MWD were obtained in the southwest, south, and southeast, which could be due to the high SOM and clay and good soil structure (Fig. 6). Increasing SOM can help soil aggregate stability to improve, so adding SOM could justify the high GMD and MWD values in the cultivated land (Lacoste et al., 2014). Therefore, applying regular SOM is recommended. In addition, the comparison of Figs. 6, 7, and 8 showed that the SOM and clay contents are the most important covariates influencing the prediction of GMD and MWD. Among the rest of the important environmental covariates, CNBL and WB had the same trend as GMD and MWD in the study area. The CNBL as a proxy of topography has an important role in GMD and MWD, which is correlated with the SOC for soil conservation (Schillaci et al., 2017; Sabetizade et al., 2021). Spatial variability of the SPM including SS and PR shows the soil quality condition and provides useful information for making the appropriate decision for improving the soil fertility conditions. The highest values of SS were observed in the southern, central, and northwestern part of the area (Fig. 9). Unlike, the lowest values of SS and PR, properties were observed in the southern, central, and northwestern zone, whereas the highest values were observed in the northern and northeastern zone (Fig. 10). Also, the amount of SS from surface to depth showed a decreasing trend (3.65 to 3.42 kPa), while the PR showed an increasing trend (1.09 to 3.05 kPa). The most influential soil and environmental covariates in predicting SS and PR are derived from expert opinion and DEM (Table 3). The clay, silt, and SOM showed direct relationship with SS and indirect relationship with P unlike sand (Fig. 6). The WB and CNBL showed a negative relationship with SS, GMD, and MWD, while they, especially WB, showed a positive relationship with PR (Fig. 5). SIPI and MNDWI showed no significant trend with changes in GMD and MWD properties (Fig. 5). Khalil et al. (2011) used topographic attributes such as slope, slope direction, and elevation to predict SS and reported that the use of topographic attributes increases the accuracy of SS prediction maps. The changes in SS and PR as the result of land use and agricultural activities affect vegetation type, SOM, soil structure, and porosity. From the start to the point of maximum shear, soil shear is related to the soil physical condition, especially soil compaction (Komandi, 1992), and as soil density increases, more force is required to break soil particles (Brevik et al., 2015). Lower SS in the northern parts is attributed to mountains and piedmont physiographic units, whereas in the low relief areas (alluvial plains), higher SS was achieved. Higher level areas have a weak soil structure, low SOM, high erosion rate, and low resistance to cutting. Based on field observations, severe erosion, the presence of stones and gravels, and surface soil runoff can lead to a weak soil structure at depth and a decrease of SS from surface to depth (Castro Filho et al., 2002). Based on Fig. 10, the PR values decreased from the north and northeast parts to the central, southwest, south, and southeast parts of the area. It is also observed that the central and the southern parts have the lowest PR values. The suitable vegetation, minimal tillage operations, appropriate land use, and high organic matter reduce the penetration resistance depending on the condition of the soil structure and the porosity of the topsoil. It has been reported that increasing SOM by using organic matter, vermicompost, and biological sludge creates a strong and stable soil structure; therefore, the crust formation on the soil surface and PR values reduces (Asghari et al., 2010). From the surface to the depth, an increasing trend for PR was observed indicating the lower soil quality and weak soil structure due to a low amount of SOM and a reduction of the soil formation process at the deeper layers. Finally, the prepared maps using the ML algorithms indicated that the variation of GMD, MWD, and SS decreased from surface to depth, while PR had an increasing trend from surface to deeper layers. In contrast to PR variation, GMD, MWD, and SS were increased in the southern and central parts of the study area compared to that of the northern parts. The variation trend of these properties indicates that the southern and central parts of the study area have a favorable soil structure compared to the northern parts, and the quality of soil structure decreased from surface to depth.

Fig. 7
figure 7

Spatial prediction maps of GMD for four standard depths in the Marvdasht area. Map prepared based on the best predictive model. a RF for depth of 0–5 cm, b k-NN for depth of 5–15 cm, c RF for depth of 15–30 cm, and d CB for depth of 30–60 cm

Fig. 8
figure 8

Spatial prediction maps of MWD for four standard depths in the Marvdasht area. Map prepared based on the best predictive model. a k-NN for depth of 0–5 cm, b CB for depth of 5–15 cm, c k-NN for depth of 15–30 cm, and d k-NN for depth of 30–60 cm

Fig. 9
figure 9

Spatial prediction maps of SS for four standard depths in the Marvdasht area. Map prepared based on the best predictive model (k-NN). a For depth of 0–5 cm, b for depth of 5–15 cm, c for depth of 15–30 cm, and d for depth of 30–60 cm

Fig. 10
figure 10

Spatial prediction maps of PR for four standard depths in the Marvdasht area. Map prepared based on the best predictive model. a CB for depth of 0–5 cm, b k-NN for depth of 5–15 cm, c map RF for depth of 15–30 cm, and d k-NN for depth of 30–60 cm

Conclusions

In this study, the DSM maps of SPM were produced at four standard depths by RF, k-NN, and CB algorithms in a semi-arid region. Two covariates’ scenarios consist of environmental factors, i.e., RS and topography attributes (S1) and environmental factors plus soil properties (S2) were assessed. In S2 scenario, which accounted for both soil variables and environmental covariates, it was recognized as the better scenario for modeling SPM compared to S1. Based on the relative ranking of soil and environmental features, it was found that SOM and clay played a more important role in predicting SPM at all studied depths than that of the topographic and remote sensing attributes. The validation results revealed that the RF algorithm was the best comparison to other ML algorithms in predicting PR at the depth of 15–30 cm, while the k-NN algorithm had the highest prediction frequency. So, k-NN model has the high potential for mapping SPM in agricultural area and can help soil scientists for filling the gap of SPM mapping in these areas. Our findings display that the spatial and vertical variation of three soil properties (GMD, MWD, and SS) decreased from the surface to subsurface layer, except for PR.

For SPM spatial distribution, we conclude that including soil and environmental factors can lead to an increase in the accuracy of predicting soil properties. Globally, this research highlights the role of soil properties in DSM research. When soil variables are not measured, it is recommended to use freely available global soil databases, such as Soil Grid products, to account for the role of soil properties along with environmental covariates in the modeling process. The applied method is a promising approach for land use planer and farmers for better management of agricultural zones, especially in areas with highly intensive cultivation activity. Finally, for moving forward, future research could further refine the digital mapping of SPM by incorporating more detailed soil and environmental covariate data and expanding the study to other regions with diverse soil-forming factors. By continuing to advance our knowledge about the SPM spatial variability, we can better inform agricultural management practices and contribute to the sustainability of our planet’s natural resources.