Introduction

Soil organic carbon (SOC) stocks are one of the most important properties of soil. It has a strong connection with soil behavior and production potentials, such as providing nutrients to plants, water retention, greenhouse gas retention, resistance against physical degradation, and yield. Therefore, its reduction can have detrimental effects on soil properties (Maia et al., 2010; Venter et al., 2021). The effects of climate, soil characteristics, and management on SOC stock accumulation have been extensively investigated (Rabbi et al., 2015; Söderström et al., 2014). However, the relative importance of these factors remains unclear, mainly in the arid and semiarid zones (Sabetizade et al., 2021).

Sanderman et al. (2017) stated that environmental factors such as land use changes affect the amount of SOC stocks (Chakan et al., 2017). Therefore, environmental factors are useful tools for predicting SOC stocks (Dong et al., 2021). Among the environmental factors, topography is an important factor in the formation of soil in different climates. Topographic features, including elevation, slope, aspect, curvatures, and other dependent factors, are effective factors in controlling the movement and maintenance of the soil water. Therefore, it will have influences on most soil characteristics, including SOC stocks (Hu et al., 2018; Prietzel et al., 2016).

In areas with more topographic variation, a larger SOC stock variation is expected (Zhu et al., 2019). In addition, studying the relationships between climatic and environmental factors with the amount of SOC stocks in different regions can help us to predict SOC stocks. It can help us to simulate how the environmental changes affect soil carbon levels; therefore, modeling can be a useful tool in studying the SOC stocks using these parameters (Prichard et al., 2000).

To study the SOC stocks, the development of digital soil mapping (DSM) methods and their applications (McBratney et al., 2003) have created the ability to study the spatial distribution of SOC stocks using SCORPAN factors (e.g., soil, climate, organisms, material parent) (Bargaoui et al., 2019; Minasny et al., 2013). Many algorithms have been used for modeling SOC stocks, such as random forest (RF) model (Gomes et al., 2019; Hengl et al., 2015; Hounkpatin et al., 2018; Yang et al., 2016), super vector machine (SVM) model (Minasny et al., 2018; Ottoy et al., 2017; Wang et al., 2018), the models based on kriging (Gomes et al., 2019; Wang et al., 2018), and partial least squares regression (PLSR) model (Jiménez et al., 2019; Keskin et al., 2019; Zhu et al., 2019). The RF and PLSR methods are based on the well-known classification and regression. These models have been used in various digital soil mapping studies over the past decade (Behrens et al., 2019). Huang et al. (2018) showed that these models predict the spatial distribution of soil properties using environmental factors with more accuracy.

Identifying suitable environmental factors for the SOC prediction model is still a challenging issue. Therefore, the aims of this study are the following: (1) modeling surface SOC stocks using environmental factors including terrain attributes, moisture index, and normalized difference vegetation index (NDVI); (2) selecting environmental factors using RF and PLSR models to achieve useful and effective environmental factors to optimize the model; and (3) evaluating the accuracy and comparing the efficiency of RF and PLSR models in modeling and estimating the spatial distribution of SOC stocks.

Materials and methods

Study area, land use, and sampling points

The study area is the northwest of Iran (Fig. 1A). It extends from latitudes of 45°52′00″ N to 46°23′00″ N and from longitudes of 36°24′00″ E to 36°46′00″ E with a total area of 1.14 × 103 km2 (Fig. 1B). The study area has an average annual temperature of 12 °C with 350–450 mm of annual precipitation. Grasslands, gardens and irrigated farming, dry farming, and watercourse are major land uses (Fig. 1B). The elevation of this region ranges from 1311 to 2224 m. The slopes are from 2 to more than 60%. Also, this area has a variety of complex aspects. The soil orders of this region include Entisols and Inceptisols. Some of the highlands of this region are rock outcrops (Iranian soil and water institute, 1991).

Fig. 1
figure 1

Location of the study area in Iran and West Azerbaijan province (A), and locations of sampling points (B)

For land use map, Landsat 8 satellite images were used with a spatial resolution of 30 m (Mohajane et al., 2018). Landsat 8 satellite images of the study area were downloaded from the earth explorer website (https://earthexplorer.usgs.gov/). Pre-processing, including atmospheric and radiometric calibrations, were performed in ENVI 5.3 software. To classify land uses, a maximum likelihood algorithm (Jensen, 2005) was employed by controlling 200 points in different land uses and 200 points in Google Earth software. Based on this, a land use map was obtained (Fig. 1B).

Multiple conditioned Latin Hypercube method (cLHm) was used to select the sampling points. Using this method, 210 points with a density of 0.184 were identified for sampling (Fig. 1B) (Ließ, 2020; Minasny & McBratney, 2006; Minasny et al., 2013). Sampling points were identified by Montana 680 GPS-Garmin, and soil sampling was performed. All of the soil samples were collected from 8 June to 30 July 2019.

Laboratory analysis and calculation of SOC stocks

After sampling, the soil samples were air-dried and passed through a 2-mm sieve. Organic Carbon (OC) was measured using the Walkley–Black method (Nelson & Sommers, 1982). Some researchers have shown that the recovery of OC by the Walkley–Black method is nearly 76 percent, as OC exists in a reduced form in organic compounds, and it can be oxidized to CO2. However, mineral carbonates exist in oxidized forms and do not participate in oxidation and reduction reactions (Schumacher, 2002). To overcome this problem, 1.32 as a correction factor, is often used to adjust for the complete recovery of OC (1).

$${\mathrm{OC}}_{\mathrm{Corrected}}={\mathrm{OC}}_{\mathrm{Measured}}\times 1.32$$
(1)

where OCCorrected is the measured organic carbon in the laboratory.

Soil bulk density was measured by the cylinder method (Klute & Page, 1986), because the gravels cannot hold the SOC stocks; therefore, gravels were removed, and the actual amount of soil was calculated (Tian et al., 2009). After removing the gravel, the equivalent soil depth was calculated by Eq. 2 (Ellert et al., 2002). Finally, the amount of soil SOC stocks was obtained using Eq. 3 (Deng et al., 2014).

$${\mathrm{h}}_{\mathrm{i}}=\frac{\mathrm{D}\times {\mathrm{Bd}}_{\mathrm{min}}}{{\mathrm{Bd}}_{\mathrm{i}}}$$
(2)
$${\mathrm{SOC}}_{\mathrm{stocks}}={\mathrm{OC}}_{\mathrm{Corrected}}\times {\mathrm{Bd}}_{\mathrm{i}}\times {\mathrm{h}}_{\mathrm{i}}\times 10$$
(3)

where hi is the equivalent soil depth (m), D is the soil depth (0.3 m), Bdmin is minimum soil bulk density (gr/cm3) in total samples (with removed gravel), and Bdi is the measured soil bulk (gr/cm3) density for i sample (with removed gravel).

Environmental factors including terrain attribute, vegetation, and moisture indices

Digital elevation model (DEM) of the study area, with 30 × 30 m2 spatial resolution, was acquired from the earth explorer website (https://earthexplorer.usgs.gov/). Based on the DEM data, 23 terrain attributes (Guo et al., 2019) were derived using SAGA GIS software (Conrad et al., 2015). All of these indicators are given in Table 1. The terrain attributes were divided into three groups, including local, regional, and combined attributes which were calculated based on fixed window and neighboring pixels, contributing area concepts, and local and regional attributes, respectively (Quinn et al., 1991).

Table 1 The list of terrain attributes as predictors of SOC stock modeling derived from the DEM (Guo et al., 2019)

To obtain the moisture index, the evaporation of MODIS products and precipitation data of TRMM products from the Giovanni website were used (https://giovanni.gsfc.nasa.gov/). These parameters were resampled to 30 × 30 m2 by R-Studio software, which adopts the digital elevation model (DEM) data as a covariant. After that, the moisture index was calculated according to Ivanov’s moisture formula by R-Studio software (Eq. 4) (Wang et al., 2019).

$$\mathrm{K}=\frac{\mathrm{R}}{{\mathrm{E}}_{0}}$$
(4)

where E0 is the evaporation, K is the moisture index, and R is the annual precipitation (mm).

After preparing Landsat 8 satellite images from the USGS website and performing pre-processing, including all corrections made to satellite image bands, NDVI was calculated by Red (R) and infrared (NIR) bands, according to Eq. 5 in ENVI 5.3 software (Zhao et al., 2014).

$$\mathrm{NDVI}=\frac{\mathrm{NIR}-\mathrm{R}}{\mathrm{NIR}+\mathrm{R}}$$
(5)

Selecting environmental factors to predict spatial distribution of SOC stocks

In this study, at the first stage, Pearson’s correlation between SOC stocks and environmental factors was obtained. Then, SOC stocks were modeled by random forest (RF) (Gomes et al., 2019; Hounkpatin et al., 2018) and partial least squares regression (PLSR) (Jiménez et al., 2019; Keskin et al., 2019) models to select environmental factors for estimation spatial prediction. The RF and PLSR methods divided the data into two groups: test and train (train = 170 data of 210 data and test = 40 data of 210 data) (RColorBrewer & Liaw, 2018). To perform the RF model, the most important parameters to predict SOC stocks were identified with SAS JMP software. To perform the PLSR model at first, SmartPLS software was used to identify environmental factors for estimating SOC stocks. Then, the selected data using SmartPLS software were transferred to the Unscrambler software, and the major parameters were identified. Finally, the spatial distribution of SOC stocks was predicted by the RF and PLSR models in the R program.

Evaluation of spatial estimation methods

The SOC stocks data were obtained for the train and test sample sites from the assessment of spatial distribution maps of the estimated SOC stocks. Different validation indices, including the root-mean-square error (RMSE), mean absolute deviation (MAD), coefficient of determination (R2), and concordance (ρc) were used to interpret the measured and estimated values of SOC stocks using the following equations (Eqs. 6, 7, 8, and 9) (Kuhn & Johnson, 2013).

$${\mathrm{R}}^{2}=1-\frac{{\sum }_{\mathrm{i}=1}^{\mathrm{n}}{\left(\mathrm{Obs}-\mathrm{Pred}\right)}^{2}}{{\sum }_{\mathrm{i}=1}^{\mathrm{n}}{\left(\mathrm{Obs}-\overline{\mathrm{Obs} }\right)}^{2}}$$
(6)
$$\mathrm{RMSE}=\sqrt{\frac{{\sum }_{\mathrm{i}=1}^{\mathrm{n}}\left(\mathrm{Obs}-\mathrm{Pred}\right)}{\mathrm{n}}}$$
(7)
$$\mathrm{MAD}=\frac{{\sum }_{\mathrm{i}=1}^{\mathrm{n}}\left|\mathrm{Obs}-\mathrm{Perd}\right|}{\mathrm{n}}$$
(8)
$${\uprho }_{\mathrm{c}}=\frac{2\uprho {\upsigma }_{\mathrm{obs}}{\upsigma }_{\mathrm{pred}}}{{\upsigma }_{\mathrm{obs}}^{2}+{\upsigma }_{\mathrm{perd}}^{2}+{\left({\upmu }_{\mathrm{obs}}-{\upmu }_{\mathrm{perd}}\right)}^{2}}$$
(9)

where Obs is the measured value, Pred is the prediction value extracted from the model, \(\overline{Obs }\) is the average measured values, n is the number of sampling points, ρ is Pearson’s correlation coefficient between the predictions and observations, and µObs and µPred are the means of the predicted and observed values, respectively. σ2Obs and σ2Pred are the corresponding variances.

Results

Descriptive statistics of SOC stocks in different land uses

The summary statistical of SOC stocks has been shown for all land uses and each land use in Table 2. The results showed that the maximum, minimum, mean, median, skewness, and kurtosis values for SOC stocks were 4.5, 0.514, 2.7, 2.571, −0.139, and −0.746, respectively, in the total land uses in the study area (Table 2). Also, the amount of SOC stocks for grasslands was the highest. In this land use, the presence of natural vegetation has increased SOC stocks, and thus soil quality has been improved (Roose et al., 2005). As a result of the higher micro-organisms activity, SOC stocks were further accumulated (Hooper et al., 2000; Wang et al., 2019). The maximum, minimum, mean, median, skewness, and kurtosis values for SOC stocks in grasslands were 4.5, 1.286, 3.239, 3.20, −0.768, and 0.266, respectively (Table 2). The lowest amount of SOC stocks is related to the watercourse. Probably, soil erosion in this land use reduced the amount of SOC stocks (Wang et al., 2010). The maximum, minimum, mean, median, skewness, and kurtosis values for SOC stocks in the watercourse were 3.729, 0.514, 1.977, 2.121, 1.641, and 0.036, respectively (Table 2).

Table 2 Descriptive statistics of SOC stocks (kg/m2) data

Relative environmental factors with SOC stocks

Based on the Pearson correlation (p-value < 0.05 level), SOC stocks were not correlated with the NDVI, and did not show any correlation with the aspect in the local group. Also, SOC stocks were not correlated with the midslppst (mid-slope position), and sink (closed depressions) in the regional group (Fig. 2). Rahmati et al. (2016) investigated the SOC stocks in the Lighvan watershed located in northwestern Iran in four land uses, including barren lands, weak grasslands, irrigated lands, and dry farming using the ETM+ sensor. Their results revealed that remote sensing was an ineffective method in estimating SOC in areas using vegetation cover. This result might attribute to the disturbance of vegetation in the spectral reflectance of OC.

Fig. 2
figure 2

Pearson’s correlation (P-value < 0.05 level) between SOC stocks with environmental factors (including vegetation index, moisture index, and terrain attributes)

Selecting environmental factors by RF model

The results of SOC stock modeling using the RF model showed that the environmental factors that have the greatest effect on the prediction of SOC stocks include standh, texture, slph, elevation, rsp, and normalh. The modeling results with these parameters showed that the total effects were slph, standh, texture, elevation, rsp, and normalh 34.32, 15.9, 15.1, 14.6, 10.81, and 9.36 (%), respectively (Fig. 3). Therefore, the total effect of the slph parameter was the highest value, and the total effect of normalh was the lowest value. The importance of the RF model in estimating the factors is represented in Fig. 4. The estimated factors varied significantly: slph (2.4 to 287.8), texture (0 to 56.85), standh (1307.09 to 2212.45), elevation (1311 to 2224), rsp (0 to 1), and normalh (0.08 to 0.99). The highest value of these factors was in the west of the watershed, and the lowest value was in the middle of the studied watershed.

Fig. 3
figure 3

Total effect (%) of each environmental factors on estimating SOC stocks by RF model

Fig. 4
figure 4

Environmental factors selected by the RF model, slope height (A), terrain surface texture (B), standardized height (C), elevation (D), relative slope position (E), and normalized height (F)

Selecting environmental factors by PLSR model

The analytical model shows the effects of the studied environmental factors in Fig. 5. In this diagram, each line has a path and direction, which is the path coefficient, or the standardized beta coefficient of the multiple regression model. Each coefficient represents the value of the effect of the independent variable on the dependent variable. Also, in path analysis, the unknown variable of error quantity (e2) remains, and the sum of the coefficient of explanation and the variable of error is equal to one (R2 + e2 = 1) (Norris et al., 2015). The results of SOC stock modeling from the PLS algorithm in SmartPLS software showed that the path coefficient of moisture index was 0.099 and terrain attributes including local, regional, and combination were 0.221, 0.395, and −0.023, respectively (Fig. 5).

Fig. 5
figure 5

PLS algorithm with all environmental factors (including moisture index, local, regional and combined attributes)

The purpose of factor analysis is summarizing the data in the form of more effective factors in the model (Harman, 1976). Factor analysis results showed that local parameters including slope, ruggedness, elevation, convexity, and convergence and regional parameters including rsp, standh, normalh, texture, chnl base, slph, and eaf had significant effects on SOC stocks (Table 3). So, modeling was carried out with these parameters. The results of modeling by selected parameters using factor analysis showed no change in the R2 value (Fig. 6).

Table 3 The result of factor analysis
Fig. 6
figure 6

PLS algorithm using selected environmental factors by factor analysis

The results from PLSR analysis in Unscrambler software showed that among the selected parameters using factor analysis modeling in SmartPLS software (Zhu et al., 2019), the four factors of standh, rsp, slope, and chnl base demonstrated 40% of SOC stock variations. Also, for test data, this relationship was 34% (Fig. 7). The values of the path coefficient parameter for standh, rsp, slope, and chnl base were 0.929, 0.885, 0.850, and 0.843, respectively (Fig. 6). Thus, among the selected parameters using factor analysis, they had a path coefficient of more than 0.840. Therefore, the spatial distribution of the PLSR model in R software was selected for SOC stocks using these four factors (Fig. 7).

Fig. 7
figure 7

The relative importance of covariates for SOC stock prediction using the PLSR model

After obtaining the main parameters of the PLSR model, the PLSR relationship for the selected parameters using these four factors was obtained (Table 4). These relationships with a 40% coefficient of determination (R2) predict SOC stocks using the selected parameters by the PLSR model. The importance order of the PLSR model in terms of factor analysis is demonstrated in Fig. 8. There were the following variations in the parameter values: standh (1307.09 to 2212.45), rsp (0 to 1), slope (0 to 1.09%), and chnl base (1311 to 2224). The highest and lowest values are related to the northwest and middle of the watershed, respectively.

Table 4 PLS regression model of SOC stocks at 0–30 cm soil depth (n = 170)
Fig. 8
figure 8

Environmental factors selected by the PLSR model, standardized height (A), relative slope (B), slope (C), and channel network base level (D)

Spatial distribution of SOC stocks

The spatial distribution results of RF (Fig. 9A) and PLSR (Fig. 9B) models using training points (170 points) are presented. The R2 values for RF and PLSR models are 0.81 and 0.40, respectively. Also, the accuracy criteria RMSE and MAE and ρc values for the RF model are better than the value of these parameters for the PLSR model (Table 5). The difference in spatial variation is due to the difference in selecting the factors of these models to estimate the SOC stock distribution, but generally, the pattern of the SOC stock distribution using the RF and PLSR methods was similar.

Fig. 9
figure 9

The spatial predicted of SOC stocks at 0–30 cm soil depth using the RF (A) and PLSR (B) models

Table 5 Calibration and validation indices of SOC stocks (0–30 cm) predicted by RF and PLSR methods

The results of R2 for the model validation using test points (40 points) of RF and PLSR methods are 0.76 and 0.34, respectively. Also, accuracy criteria RMSE, MAE, and ρc values for the RF model are better than the values of these parameters for the PLSR model (Table 5). Generally, it can be concluded that the RF method is a more suitable method than PLSR in estimating SOC stock distribution. Increasing elevation and topography variation increases the rate of spatial changes in SOC stocks. As a key factor in soil formation, topography is a major factor which has a significant effect on soil properties. Therefore, it is expected that in areas with high topographic changes, the SOC stocks have greater changes (Zhu et al., 2019). The SOC stock distributions in the western and eastern regions were the highest amounts (Fig. 9). The increase in elevation has probably reduced anthropogenic activity because of the return of plant residues, and the accumulation of plant residues has increased the amount of SOC stocks in these areas (Bonfatti et al., 2016).

Discussion

The benefits of SOC stock in agricultural development have been well known, and many models have been proposed to understand and predict SOC stock (Gurung et al., 2020). But what is important is to predict the amount of SOC stock using the most efficient indicators. In this study, we tried to select the best environmental factors for predicting the amount of SOC stock using RF and PLSR models. The results showed that the prediction accuracy of these models to predict SOC stock varied (Tables 4 and 5). These differences in model prediction can be due to differences in the inconsistent state of nature, the nature of the model, and differences in the characteristics of the sampling points (Zhao & Li, 2017). Therefore, it is not possible to avouch which models are inefficient for predicting SOC stock, but it is clear that the accuracy of the predictive models varies (Gurung et al., 2020).

In this study, the results of the PLSR model by selecting the parameters using path analysis showed no change in the R2 value (Table 3, Figs. 6, 7, and 8). We concluded that path analysis can be useful in recognizing the effects of variables on each other and prioritizing them in predicting the spatial variation of SOC stocks (Jiménez et al., 2019). Factor analysis using the studied indicators showed that the standh index had the maximum effect on SOC stocks because in many areas the climate is controlled by topography variations (Gao et al., 2015). Probably, increasing the elevation affects the soil formation processes such as increasing clay, limestone leaching, and reducing soil acidity (Rhoton et al., 2006). Other chosen parameters for path analysis were the slope and relative slope position (rsp). These parameters significantly affect the amount of SOC stocks, because the following particles that are transferred to the lower areas by erosion accumulate at the foot slope and increase the amount of SOC stocks in these areas (Zhao & Li, 2017). Another effective factor in path analysis was the chnl base. The chnl base, with slope, plays an essential role in the movement of materials and erosion. Therefore, this parameter has a fundamental impact on the SOC stocks (Maerker et al., 2016; Schillaci et al., 2017; Shahini Shamsabadi et al., 2019).

In this study, some differences in the relative contribution of attributes were observed by the RF model. In the RF model, the main factor for estimating the amount of SOC stocks was slph; however, other parameters were selected to estimate the amount of SOC stocks including standh, elevation, rsp, and normalh (Fig. 3). The complex topography in this area may have led to the heterogeneity of SOC stock estimation because in areas with complex topography, there are many uncertainties in estimating SOC stocks. In fact, it is expected that in these areas, changes in the topographic pattern cause changes in slope-dependent parameters. As a result, it makes the different slope and rsp, making difficult the estimation of SOC stocks. However, a deep understanding of the spatial variation of SOC stocks and its effective factors has not yet been achieved (Zhu et al., 2019). But in these and similar areas, elevation, slope, and aspect with their related parameters are probably the main factors in controlling SOC stocks, because they cause changes in climate, hydrological, and environmental conditions (Qin et al., 2016). The changes in these conditions are related to the response of topography variations, and as a result, they will affect the SOC stocks (Zhao & Li, 2017). The texture is another factor that was selected by the RF model (Fig. 3). It showed the softness and roughness of the ground earth surface. By the elevation of this index, the amount of surface roughness probably increased, so it acts as a barrier against particle transfer (Iwahashi & Pike, 2007).

Conclusions

In this study, we aimed to select environmental factors to predict and estimate the spatial distribution of SOC stocks using RF and PLSR models. The overall results showed that the RF model was more accurate than the PLSR model in selecting suitable environmental factors for estimating SOC stocks. In both RF and PLSR models, selected standh and rsp factors were effective parameters in estimating SOC stocks, which indicates that the standh and rsp play important roles in determining the amount of SOC stocks, and by entering these parameters and the other important factors in these models, the amount of SOC stocks can be easily obtained. Nevertheless, because of the complex relationships between SOC stocks and related environmental factors, more detailed studies are needed to find causal relationships and enhance the accuracy of SOC stock estimates. As the shortage of SOC stocks is a new threat to land degradation and a reduction in the agricultural production potential, simple ways need to be found to estimate SOC stocks.