Keywords

1 Introduction

Soil is an essential natural resource for the development of life in ecosystems, harboring large amount of biodiversity, functioning as a store and supply of water and nutrients and being the support for different human activities such as agriculture.

The importance of soil is even highlighted in the 2030 Agenda Sustainable Development Goals (SDGs). Several SDGs involve directly and indirectly the soil into their targets. In turn, Goals 12 “Responsible consumption and production” and 15 “Life on land” mention the use of sustainable production system and agricultural practices for improving soil quality, also preventing its pollution through proper management of chemical and waste. In this way, it also seeks to curb the causes of soil degradation, such as salinization [1].

Therefore, it is necessary to carry out preventive measures, such as the development of cartography and models to predict how salinity will evolve. To this end, the Digital Soil Mapping (DSM) methodology and its derived models, such as the scorpan model, are proposed, which are based on the statistical inference of soil properties by searching for the statistical relationship between these properties measured in the field with different auxiliary variables or environmental covariates (climate, lithology, land use, vegetation, topography, etc.), finally extrapolating these relationships to those data lacking areas [2,3,4,5].

Thus, the main objective of this research focuses on the application of the Digital Soil Mapping (DSM) method in an irrigated area of Castile and León (Spain) to determine which are the most useful and relevant covariates for modeling soil salinity.

2 Study Area

The study area is located between the provinces of León and Zamora (Spain) (Fig. 1). It covers an area of 1500 km2, with altitudes between 680 and 930 m and generally flat relief. The average annual temperature is between 10 and 13 °C, with rainfall between 400 and 500 mm and average annual ETP of up to 800 mm [6, 7].

Fig. 1
figure 1

Location of the irrigated area in León and Zamora, Spain

The dominant lithologies correspond to the Pleistocene and Holocene, formed by alluvial deposits in terrace areas, and by sand, silt and clay in valley bottom areas and river plains. Among dominant soil types, Cambisols are found in terrace zones in the center and north of the study area, and Fluvisols in the floodplains. Finally, the main land use is associated to irrigated crops, with a predominance of maize (65,000 ha), with poplar plantations also standing out in the riparian areas [8].

3 Methodology

The DSM methodology was used for the study through the application of the scorpan model developed by McBratney et al. [2], which proposes the integration of different environmental variables into a function that allows, in this specific case, the prediction of soil salinity (Ss) (Eq. 1).

$$S_s = f\left( {s,c,o,r,p,a,n} \right)$$
(1)

These environmental variables, also called “soil-forming factors”, comprise soil (s), climate (c), organisms (o), topography (r), parent material (p), age (a) and spatial position (n). In turn, these variables are defined by different environmental covariates [2].

In this way, based on soil samples with salinity data and the different environmental covariates related to it, a relationship is established that allows estimating the values in unsampled areas. To do this, a spatial overlapping of the soil samples on the mapping of the different covariates was first applied, which allows obtaining at each sampling point both the value of the properties of that soil, as well as the values of each of the environmental covariates used. After obtaining these values, the relationships between the independent variables (environmental covariates) and the dependent variable can be modeled using different statistical methods. In this study, the dependent variable associated with salinity is the electrical conductivity measured in the saturated paste extract (ECx). In turn, the statistical technique that was applied was Multiple Linear Regression (MLR). The flow chart in Fig. 2 outlines the procedure followed.

Fig. 2
figure 2

Applied methodology flow chart

3.1 Soil Data

The Soil Database of Castile and León, obtained from the Agrarian Technological Institute [9] was used to acquire soil data. This contains 914 surface samples from the first 25–30 cm of soil with measurements of electrical conductivity in the saturated paste extract (ECx). However, for the irrigated area the number of samples was lower (132), analyzed also according to the laboratories of origin (Table 1).

Table 1 Samples and ECx values range (minimum and maximum) for each laboratory analysed in the study area

3.2 Environmental Covariates

To obtain the environmental covariates, different data sources were used to derive a total of 24 covariates (Table 2). In the case of land use and lithology, as these are categorical variables, each of their classes was transformed into a binary numerical variable.

Table 2 Data sources and environmental covariates used in the study for the different soil forming factors

It should be noted that all covariates obtained are in raster format with a spatial resolution of 25 m × 25 m. There were covariates with high spatial resolution (≤ 25 m) and others with very low resolution (those ones related to climate). It is always better to change from a high resolution towards a low resolution. For this reason, among those covariates with high resolution, it was decided to work with 25 m. The climate variables in general do not show such high spatial variability. Thus, it was concluded that although it is not the best solution to change from 500 to 25 m, the inherent error could be affordable with the DSM method.

3.3 Statistical Analysis

After overlaying and extracting the values of the 24 covariates in the sampling points at the study area, a correlation analysis was carried out to reduce the number of environmental covariates, given that the number of covariates was very high. In this way, those whose Pearson coefficient (P) was greater than 0.75 were eliminated from the analysis, as they give redundant information.

After this, Multiple Linear Regression (MLR) was applied using IBM SPSS Statistics 26 software. This technique is summarized by the following equation (Eq. 2), in which Y is the dependent variable, Xn are the predictors that explain the dependent variable, β0 is the intercept or origin, βn are the coefficients that represent the weight and relationship of each environmental covariate with the dependent variable, and ε are the residual values that cannot be explained by the model [15].

$$Y = \beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot X_2 + \ldots + \beta_n \cdot X_n + \varepsilon$$
(2)

Prior to its application, the dataset was segmented, with 70% of the samples being used for model calibration and the remaining 30% for subsequent validation. With the corresponding calibration samples, the MLR was applied using the “backward elimination” method. Although this method yields numerous models, only two models were selected, taking into account that the value of the coefficient of determination (R2) was adequate considering its significance value, and that the number of covariates was as small as possible.

Subsequently, for the selection of the final model variables, Generalised Linear Models (GLM) were estimated using the covariates of each model chosen, obtaining a value of the Akaike Information Criterion (AIC). In this way, the model with the lowest AIC value was selected or, in the case where the value was similar for both models, the one with the lowest number of covariates was selected.

Finally, once the covariates that best explained the electrical conductivity in each case were known, another MLR was applied again, forcing only these covariates to be used in order to obtain the final equation.

4 Results

Due to the low R2 coefficients achieved after working with all the available soil data in the study area, it was decided to work with two sample groups according to the analytical laboratory. The initial correlation analysis applied to the 132 soil samples showed similarities for both laboratories studied, discarding from the analysis the profile curvature and planform curvature variables, as well as a land use variable and a lithology variable.

Although the soil indices show high correlations, we worked with all of them since each one represents different edaphic characteristics. However, regarding the vegetation indices, NDVI and SAVI were discarded because their correlation with EVI showed a P value of 0.990. After this preliminary analysis, MLR was implemented.

  • “Análisis Integrales” Laboratory

Two models were selected with R2 coefficients of 0.587 and 0.581, respectively. After performing the GLM with these models, an AIC value of − 124.11 was obtained for both, and the second model (R2 = 0.581) was finally chosen as it was the one that considered the fewest covariates. It can be seen that the indices, mainly soil indices, are the ones that best explain the ECx, although topographic variables are also important (Eq. 3). This model showed a significance value of 0.000 (p < 0.001), so the results are statistically representative.

$$\begin{aligned} {{\mathbf{EC}}}_{{\mathbf{x}}} & = - 0.{597} + \left( {\left( {{4}.{682} \times {1}0^{ - {5}} } \right) \times {\text{BI}}} \right) + \left( { - 0.0{36} \times {\text{LAND}}\_{\text{USE}}} \right) \\ & + \left( {0.0{53} \times {\text{CURVATURE}}} \right) + \left( {0.{745} \times {\text{CAI}}} \right) + \left( { - 0.0{43} \times {\text{TEMP}}} \right) \\ & + \left( { - 0.00{4} \times {\text{ASPECT}}} \right) + \left( {0.0{37} \times {\text{LITHO}}} \right) + \left( { - 0.{461} \times {\text{CI}}} \right) \\ & + \left( {0.00{1} \times {\text{ELEVATION}}} \right) + \left( {0.{566} \times {\text{EVI}}} \right) \\ \end{aligned}$$
(3)
  • “APPLUS” Laboratory

Again, two models have been selected whose R2 values are 0.395 and 0.382, respectively. The GLM provides an AIC value of − 115.66 for both, with the second model (R2 = 0.382) being selected because it has fewer covariates. The covariates associated with the indices have higher weight, followed also by those corresponding to the topography factor (Eq. 4). In this case, the significance value is higher than 0.005 (p = 0.255), so the results are not statistically representative.

$$\begin{aligned} {{\mathbf{EC}}}_{{\mathbf{x}}} & = - 0.{47}0 + \left( {0.0{49} \times {\text{TEMP}}} \right) + \left( { - 0.00{1} \times {\text{RAIN}}} \right) \\ & + \left( { - 0.0{81} \times {\text{LAND}}\_{\text{USE}}} \right) + \left( {0.0{21} \times {\text{SLOPE}}} \right) + \left( { - 0.00{5} \times {\text{ASPECT}}} \right) \\ & + \left( { - 0.0{1}0 \times {\text{MRVBF}}} \right) + \left( { - 0.0{8}0 \times {\text{CURVATURE}}} \right) + \left( {0.{28}0 \times {\text{SI}}} \right) \\ & + \left( {0.{2}0{8} \times {\text{EVI}}} \right) + \left( {0.{26}0 \times {\text{CAI}}} \right) + \left( {\left( { - {3}.0{36} \times {1}0^{ - {5}} } \right) \times {\text{BI}}} \right) \\ \end{aligned}$$
(4)

5 Discussion

The results obtained show that the DSM method is useful to know which variables best explain and model salinity, since the obtained R2 coefficients are quite good (0.581, 0.382). However, there is much uncertainty, which can be caused by the soil samples themselves, either by inhomogeneity of the measurements or by poor spatial distribution of the samples. It may also be due to the use of a high number of environmental covariates and to the error associated to them, as these come from several different data sources.

The indices derived from satellite images stand out, both those associated with the soil factor and the organism factor (Table 2), as well as the covariates corresponding to the topography factor. In the latter case, curvature is the most relevant.

Comparing these results with those obtained by other authors, they show similarities. Omuto et al. [16] applied MLR in a study area located in Lesotho, where they obtained an R2 value of 0.460, which does not differ much from those obtained in this study (0.581, 0.382). Mosleh et al. [17] also applied MLR, resulting in a worse R2 (0.110). On the other hand, Taghizadeh-Mehrjardi et al. [18], although they used superlearning techniques, among all those statistical techniques was MLR which showed an R2 of 0.230, which is slightly worse than the results of this study.

In turn, these authors corroborate the importance of both satellite image-derived covariates and topographic covariates for spatial modelling of salinity [16, 18, 19]. It is also worth noting that Mousavi et al. [4] concluded that the best results are obtained when both types of covariates are used. In their case, by applying MLR using only satellite indices they obtained an R2 of 0.506, while using also topographic variables the R2 value increased to 0.660.

6 Conclusions

Following the analysis and discussion in this research, the DSM was concluded to be a useful methodology to obtain the most relevant covariates in the soil salinity modelling; highlighting those associated with the topography factor, as well as the variables corresponding to the indices calculated from satellite images.

Soil salinity is an emerging future challenge in agricultural areas, especially in those that are more susceptible. Salinity threatens soil quality and crops productivity, which means a reduction in supply to the population. Given the previous, further research on this topic is required, using methodologies such as DSM to identify those susceptible areas and allowing to apply preventive measures in order to achieve the SDGs targets.