Introduction

World soils contain around four times as much carbon (C) as the vegetation and three times as much as the atmosphere1,2. Globally, soil organic C (SOC) stocks are currently estimated to be 1400 ± 150 petagrams of carbon (Pg C) at 1 m in depth and 2060 220 Pg C to 2 m at depth3. Additionally, SOC improves soil health and the delivery of other related ecosystem services4,5, with undeniable benefits on crop yields, boosting or maintaining the production of food, feed, fiber, and energy. Accordingly, any change in the soil C reservoir would significantly impact both world food security and global climate change6,7. In this sense, assessing and monitoring SOC is mandatory1,8. However, the conventional laboratory-based analytical methods for SOC accounting are expensive and time-consuming9,10. The development of accurate and cost-efficient alternative techniques for SOC quantification is a major concern in the climate policy11.

Diffuse reflectance spectroscopy (DRS) is a technique that records the absorption and dispersion of light processes, produced on the surface of soil particles12. It is a fast, non-polluting technique of lower cost and has already been used to characterize different attributes13,14,15,16. Its use for estimating soil C content17,18,19,20 may allow the detailed mapping of this variable on large scales and the understanding of its spatial variation in landscapes8,21.

In addition to minerals in the soil governing numerous properties directly related to crop productivity22,23,24, soil mineralogy plays a large role in directly regulating the size and stability of C stocks25. Some studies have indicated a direct relationship between oxides and C flow in the soil of small areas15,18,26. However, the existing relationships between mineralogy and C in large areas, with higher geological and geomorphological variability, are still little studied. Understanding the relationship between soil attributes in detail is the best alternative for management practices and soil planning27.

Multivariate analyses, such as partial least squares regression (PLSR)28, and machine learning algorithms, such as random forest (RF)29, have been widely used in C estimation using DRS data18,19,20,21,30 Bellon-Maurel and McBratney17 provided a detailed literature review about DRS and C estimates. According to the authors, the future proposal is to optimize these studies spatially for mapping purposes and associate the estimates with robust mathematical models.

Soil is a system that is constantly changing. Therefore, accurate predictions of the consequences of human activities on terrestrial ecosystems may be difficult31. The diffusion of new technologies and agricultural research developed in the coming decades will determine the possibility of mitigating resources and adapting to climate change to ensure continuity in food production32. Moreover, there is always a shortage of data, inventories, detailed C maps for large-scale projects21,33, and the best mathematical techniques for these studies30,34, mainly in Brazil.

Studies of wide magnitude, which present not only C estimates but also its relationship with other soil attributes and their spatial variability on maps are essential for agriculture and the sustainability of terrestrial ecosystems. In this context, this study aimed to (a) investigate the relationship of C with different soil mineralogical, chemical, and physical attributes for different geological and geomorphological compartments; (b) understand which spectral bands are most important for estimating C content; (c) estimate C content from diffuse reflectance spectroscopy using different mathematical techniques and indicate which one is the best for tropical soil conditions; and (d) map C contents of the Western Plateau of São Paulo in detail.

Material and methods

Location and characterization of the area

The study area was the Western Plateau of São Paulo (WPSP), which covers approximately 13 million hectares (~ 48% of the State of São Paulo, Brazil) (Fig. 1). The geological outline is mainly characterized by sandy, clayey, and gravel sediments, volcanic rocks with basic composition, and sedimentary sequences, mainly psamitic, which may include pyroclastic sequences (Fig. 1a). Approximately 2 million hectares of WPSP are represented by basalt (15.5%), 7.4 million hectares by the Peixe River Valley Formation (57.1%), and 3.6 million hectares by other sedimentary formations (27.5%).

Figure 1
figure 1

(a) Updated geological map of the Western Plateau of São Paulo. (b) Pedological map of the Western Plateau of São Paulo. Source: Agronomic Institute of Campinas. (c) Dissection map of the Western Plateau of São Paulo. (d) Sample planning of the collected points. (e) Hematite map made by X-ray diffraction (Hm, g kg−1). Source: Extracted from Silva et al.37 (f) Goethite map made by X-ray diffraction (Gt, g kg−1). Source: Extracted from Silva et al.37. (g) Kaolinite/(kaolinite/gibbsite) ratio [Kt/(Kt + Gb)] map made by X-ray diffraction. Source: Extracted from Fernandes et al.38.

The geological formations of WPSP are divided into two groups: (i) Caiuá group, composed of Santo Anastácio and Paraná River formations in the State of São Paulo, corresponding to deposits of sand sheets of dry climate, accumulated in extensive and monotonous desert plains, marginal to the large sand sea dune complexes (Caiuá Desert), extending to the northern region of the State of Paraná; and (ii) Bauru group, composed of Uberaba, Peixe River Valley, Araçatuba, São José do Rio Preto, Presidente Prudente, and Marília formations, which include Taiúva analcimites and volcanic rocks locally interspersed in the sequence35,36.

Soils with the highest occurrence are classified as Argissolo Vermelho-Amarelo (Ultisol), Latossolo (Oxisol), Latossolo férrico (Oxisol), Neossolo Litólico (Entisol), Nitossolo Vermelho (Nitisol), and Gleissolo Háplico (Gleysol) (Fig. 1b). The geomorphology of the area is shown in Fig. 1d. Moderately dissected areas are predominant in the region. The dissection level expresses the soil evolution in the landscape (Fig. 1b), which is associated with pedogenesis (soil formation rate) and geomorphogenesis (landscape evolution rate). The lowest hematite (Hm) content, highest amorphous goethite (Gt) contents37, and highest kaolinite (Kt) proportion are observed in the central portion of WPSP, where sandstones prevail38.

The mineralogical characterization at the detail level can infer numerous other soil characteristics. Silva et al.37 and Fernandes et al.38 pointed out that iron oxides and minerals Kt and Gb accurately reflect their respective formation environments and the variability imposed by geological and geomorphological material. Soils undergo a reduction process in humid environments in highly dissected compartments in the most drained regions of the landscape, favoring Gt formation. On the contrary, sandstone compartments, rich in silica, favor Kt formation.

The tropical climate with a dry winter season prevails in the north and northwest of WPSP (C2rA′a′), while the humid temperate climate with a hot summer prevails in the south (B4rB′4a). In addition, the climate in the east and southeast consists of a humid temperate climate with a dry winter and hot summer (B2rB′3a). The climate in these regions can be subclassified into four more variations, according to the Thornthwaite classification. The natural vegetation in the area consisted of Atlantic Forest in the west and Cerrado in the east and southwest of WPSP, with the most representative current land use being sugarcane (Saccharum spp.), citrus, and pasture.

A total of 265 soil samples georeferenced in WPSP were collected to study the spatial variability of mineralogical attributes at a depth of 0.00–0.20 m. The minimum spacing between samples was 10 km and the maximum spacing was 60 km (Fig. 1d). The samples were collected along the highways of the State of São Paulo in representative locations that have suffered minimal anthropogenic interference. The sampling plan was prepared based on the file of State highways supplied by the Department of Roads and Highways using the ET GeoWizards tool in the software ArcView 9.3. The geographic coordinate information for each point was previously defined and inserted in a GPS navigation device with an accuracy of 3–5 m. Sample collection was guided by real-time navigation using GPS and laptops.

Points simulating the stratified sampling (in red in Fig. 1d) were selected after identifying the patterns of variability in soil attributes and identifying homogeneous areas39. The choice of the sampled points considered previous experiences within WPSP, using geostatistical techniques40,41, proposing a representative distribution of the studied area.

Analyses

Soil particle size and chemical analysis

Particle size analyses were performed for all WPSP points. The pipette method with a 0.1 mol L−1 NaOH solution as a chemical dispersant and mechanical stirring at a low rotation for 16 h was used, as recommended by Teixeira42. Calcium, magnesium, and potassium contents were extracted using the ion-exchange resin procedure43. Base saturation (V) was given by the equation V = (Mg + Ca + K)/CEC.

Mineralogical analyses

Iron contents related to the totality of pedogenetic irons extracted by dithionite-citrate-bicarbonate (Fed) were determined following the procedure of Mehra and Jackson44 for all grid points and the stratified samples. The contents of iron extracted by ammonium oxalate (Feo) relative to low-crystallinity pedogenetic iron oxides were determined following the methodology mentioned by Camargo et al.45 adapted from Schwertmann.46.

Minerals were quantified for all points on the total grid and stratified samples. Clay for the X-ray diffraction (XRD) analysis was separated from the soil sample by the centrifugation method47. The clay fraction was subjected to the elimination of iron oxides by the dithionite-citrate-bicarbonate (DCB) method to characterize kaolinite (Kt) and gibbsite (Gb), according to Mehra and Jackson48, and sieved in a 0.10-mm opening mesh. XRD characterized the minerals of the clay fraction Kt and Gb in sheets made with material without orientation (powder). Goethite (Gt) and hematite (Hm) were characterized after treating the clay fraction using 5 mol L−1 NaOH (1 g clay 100 mL−1 solution) for their concentration, according to the method of Norrish and Taylor49, modified by Kämpf and Schwertmann50.

XRD was performed with the samples prepared by the powder method using an HGZ apparatus equipped with cobalt cathode and iron filter and K radiation (20 mA, 30 kV) for Hm and Gt diffraction and copper cathode with nickel filter for Kt and Gb diffraction. The scanning speed was 1°2θ min−1 with an amplitude from 23° to 49°. The reflexes of Kt (001), Gb (002), Hm (012 and 110), and Gt (110 and 111) were used for mineralogical evaluation.

Diffuse reflectance spectroscopy (DRS)

Diffuse reflectance spectra were obtained using approximately 1 g of air-dried fine soil ground in an agate mortar until obtaining constant color, and the content was placed in sample holders with a 16-mm diameter cylindrical space. The soil subsamples (1 g) used for the DRS analysis came from the samples corresponding to 250 g of soil collected at each of the 265 points georeferenced in WPSP.

Reflectance values were determined in a Lambda 950 UV/Vis/NIR spectrophotometer coupled to an integrating sphere of 150 mm in diameter. The spectra were recorded at 0.5-nm intervals, with an integration time of 2.43 nm s−1 over the 350 to 2500-nm range (350–800 nm—visible—VIS and 801–2500 nm—near-infrared—NIR).

The pre-processing technique used to remove noise from the raw curves was the standard normal variable (SVN). This technique is used to remove interference due to light scattering, helping to correct any curvilinear and linear trends in the baseline of the original spectra14.

Carbon content and stock

The total soil carbon content (C%) was determined for all points of WPSP. The analysis was carried out by dry combustion using a LECO CN-20009. The soil samples (air-dried fine soil) were sieved on a 100-mesh sieve and oxidized at high temperatures (oven at 1350 °C) using 2.8 ultrapure oxygen.

The C stock was analyzed only for stratified samples. Undisturbed soil samples were collected with a volumetric ring (with known volume) at a depth of 0.00–0.05 m to determine the following physical attributes: porosity, soil density (Ds), and soil moisture. The respective masses and soil moisture in the ring were determined for Ds calculation. The density of each layer was calculated according to the equation:

$${\text{Ds }}\left( {{\text{g c}}{{\text{m}}^{ - 3}}} \right) \, = \, \left( {{\text{dry soil mass}},{\text{ g}}} \right)/\left( {{\text{cylinder volume}},{\text{ c}}{{\text{m}}^3}} \right)$$

The total contents of C stock were determined using cross-calibration curves and based on the weight of the analyzed sample. C stocks (in Mg ha−1) were calculated for each soil sample according to the equation described below:

$${\text{C stock }} = {\text{ C }}\left( \% \right) \, \times {\text{ Ds }}\left( {{\text{g c}}{{\text{m}}^{ - 3}}} \right) \, \times {\text{ Thickness }}\left( {{\text{cm}}} \right)$$

Statistical analysis

The data were subjected to descriptive statistics after the completion of laboratory analysis. The mean, maximum, minimum, standard deviation, and coefficient of variation were calculated. Moreover, simple correlation coefficients (Spearman) were calculated between soil attributes. All wavelengths of the spectral curves were submitted to the stepwise regression gradually and interactively (forward–backward) based on the Akaike information criterion (AIC). Subsequently, the most relevant wavelengths for estimating C content were selected, as follows. These wavelengths were subjected to cluster grouping analysis.

Two C analyses were used to estimate C contents: partial least squares regression (PLSR) and machine learning (ML). PLSR uses chemometric calibrations and validations by a cross-validation procedure. Reflectance measurements were converted into absorbance [Log10 (1/Reflectance)] for data processing in the software. The software ParleS®28 was used to determine the relationship between the entire spectral curve and the studied attributes. The database was subdivided into a set of calibrations (70%) and a set of predictions (30%) for analysis. The “train_test_split” technique was used to split the data into calibration and validation sets. This is a function in the “model_selection” module of the popular machine learning library scikit-learn. This function is used to perform the train test split procedures, which splits a dataset into two subsets: a training set and a test set.

The performance of each model was evaluated using the coefficient of determination (R2), mean absolute error (MAE), and the root mean square error (RMSE). These metrics were evaluated both in calibration and in prediction. The ML algorithm was the random forest (RF) regression51. RF is a non-parametric technique developed to improve the prediction of the classification and regression tree models, as it uses fully grown decision trees and reduces the error and variation52. It consists of the combination of several trees, which are generated from an input variable sampled at random. All trees are the same size. However, the subdivisions of the trees are based on a subset randomly sampled from the total database, and the final result of RF consists of the mean of the results of all trees53. RF has been one of the most used ML algorithms for presenting high performance in the prediction of soil attributes54.

The performance of prediction models was evaluated in a set of independent and unprecedented data. In this sense, the database was subdivided into two different sets randomly separated, remaining 50% for training and 50% for testing. This division was due to the high variability of the data. The performance of each model was evaluated using R2 and RMSE in the training and testing stages. The package scikit-learn was used to run the algorithm55.

The geostatistical analysis was used to characterize the spatial variability pattern of the observed values56. Semivariograms relating the distance vector to the semivariance were constructed and models were adjusted. The values were interpolated by ordinary kriging for the construction of spatial distribution maps. The semivariograms were generated and selected using the R language, which already provides the best adjustment in the geostat package.

Results and discussion

Soil attributes across the Western Plateau of São Paulo

Soils originating from sandstone had a mean sand of 82%, while those from basalt were more clayey, with a mean of 53% clay (Table 1). The maximum and minimum clay values for basalt soils reached 67 and 27%, respectively. This higher range between minimum and maximum shows the variation of the attribute for the compartment. The same can be observed by the high coefficients of variation, according to the classification of Warrick and Nielson57. Sandstone soils have a maximum of 90% and a minimum of 73% sand and, therefore, less variability compared to basalt soils. The CV for sand content was classified as low for the compartment.

Table 1 Descriptive statistics for soil attributes from different geological and geomorphological compartments in agricultural areas and native forest in the Western Plateau of São Paulo.

The mean clay values in the geomorphological compartments increased from slightly (23% clay) to highly dissected compartments (34% clay). The means of silt and sand tended to increase from the highly to the slightly dissected compartments (Table 1). The maximum and minimum values for clay and silt means were higher for the low dissected compartment than the others. It shows that soils of highly dissected compartments tend to be more clayey and those of slightly dissected compartments tend to be sandier.

Slightly dissected environments would be those flatter ones, favoring higher water infiltration, better drainage, higher leaching, more uniform solar radiation, and other characteristics that condition higher soil weathering (higher pedogenesis rates). Therefore, higher mean values of clay and lower mean values of sand were expected. However, considering that highly dissected compartments have concave areas of high intensity and a depositional behavior58, finer materials, such as clay and organic material, may have accumulated in this compartment. Cunha et al.59 and Silva et al.37 observed similar results.

Basalt compartment soils have higher mean contents of Fed (52 g kg−1) than those from sandstone (12 g kg−1) (Table 1). The same was observed for the Feo content, in which sandstone compartment soils presented a mean of approximately 1 g kg−1, while basalt soils had a mean of 3 g kg−1. The iron content by geomorphological compartment showed that the slightly dissected compartment had the lowest means of Fed (24 g kg−1) and Feo (1.37 g kg−1). Fed content had the same behavior as the clay content, that is, lower values in the slightly dissected compartment and higher values in the highly dissected compartment.

Basalt compartment soils showed a higher V (70%) than soils originating from sandstone (65%) (Table 1). WPSP is an intensely cultivated region and soil liming/fertilization and management practices may have influenced these results. Also, sandstone rocks have higher particle sizes and, consequently, the weathering process is slower in these rocks than in basalt rocks. The mean of V for geomorphological compartments reached 71%, while soils from the slightly dissected environment showed a mean of 67%. The moderately dissected compartment was intermediate compared to the others (65%). The differences are small for these compartments.

The mean value of soil density (Ds) was higher for sandstone soils (1.52 g cm−3) than for basalt soils (1.39 g cm−3) (Table 1). Thus, the data analysis by compartments for large areas may present results similar to those observed in the literature. In this case, the shape of the mineral in kaolinitic soils, such as sandstones, may favor higher Ds values, while the shape of these minerals for basalt soils, with higher iron oxide content, may contribute to better arrangement of particles and lower Ds60. Ds presented intermediate means for the moderately dissected compartment (1.49 g cm−3), increasing from highly to slightly dissected (1.39–1.44 g cm−3, respectively).

The means of total pore volume (TPV) showed an inverse behavior when compared to the means of Ds (Table 1). TPV is higher for basalt compartment soils (41%) and lower for sandstone compartment soils (36%). The means for geomorphological compartments had little difference, i.e., 40, 37, and 38% for highly, moderately, and slightly dissected compartments, respectively. It indicates that the geomorphological compartment did not capture TPV variability well.

The total C contents were higher for areas of basalt and highly dissected compartments than the other compartments. The mean for the basalt compartment (20.82 g kg−1) was almost double the mean observed for the sandstone compartment (10.74 g kg−1). Bahia et al.18 observed C values similar to those found in this research when studying smaller WPSP areas. All mean attribute values are similar to those found in the literature for different sandstone and basalt geological compartments18,60,61.

The basalt compartment presented the highest C stock (Cs) (144 Mg ha−1) than the sandstone compartment. The same behavior was found for highly (137 Mg ha−1) and slightly dissected compartments (100 Mg ha−1), matching the higher C and clay contents. The highest Cs values are related to compartments with predominant oxidic mineralogy (Table 1), as verified for the entire WPSP extension.

Sandstone compartment soils are kaolinitic and basalt compartment soils are oxidic, showing the highest mean values of Gt (15 g kg−1) and Hm (55 g kg−1). It is also observed in the mean value for the Gt/(Gt + Hm) ratio of 0.22 for soils originating from basalt and the mean of 0.89 for the Kt/(Kt + Gb) ratio for sandstone soils (Table 1). According to Curi and Franzmeier62, soils originating from basalt present higher iron contents than those originating from sandstone, favoring the formation of iron oxides to the detriment of other minerals. Mineralogy presented means with small differences for geomorphological compartments. The highest difference was found for the Gt/(Gt + Hm) ratio of 0.40 in the slightly dissected compartment, which was higher than the others. Thus, this compartment has a predominance of oxidic mineralogy, while the others (highly and moderately dissected) have a predominance of kaolinitic mineralogy.

Correlation between C contents and soil attributes

Figure 2 shows a correlation matrix with all the studied attributes and geological and geomorphological compartments. A direct correlation can be observed between sand content and the sandstone compartment and between clay content and the basalt compartment, as found in the descriptive statistics. Sandstone soils are sandier and basalt soils are more clayey. Geomorphological compartments did not show good correlations with geological compartments. A directly proportional correlation was found between clay content and the highly dissected compartment and an inverse correlation with sand content. An inverse behavior was observed for sand content.

Figure 2
figure 2

Spearman correlation matrix for attributes at a depth of 0–0.20 m and different geological and geomorphological compartments of the Western Plateau of São Paulo. Fed crystalline iron, Feo non-crystalline iron, V base saturation, Ds soil density, TPV total pore volume, C carbon, Cs carbon stock, Kt kaolinite, Gb gibbsite, Gt goethite, Hm hematite.

Fed and Feo contents showed a high and inverse correlation with the sandstone compartment (− 0.83) and a direct correlation with the basalt compartment (Fig. 2). Regarding the geomorphology, Fed and Feo contents showed low correlation coefficients, with the highest values of − 0.24 and − 0.22 for soils of the slightly dissected compartment, respectively. The attributes V, TPV, C, Cs, Gt, and Hm showed a direct correlation with the basalt compartment and an inverse correlation with the sandstone compartment, while Ds and the Kt/(Kt + Gb) and Gt/(Gt + Hm) ratios showed negative correlations with the sandstone compartment and positive correlation with the basalt compartment.

Considering the correlations between the attributes and C and Cs contents, the highest values were found between clay content and soil mineralogy. The higher the clay content and predominance of soil oxides, the higher the C and Cs contents. The opposite was observed for sandier soils, with a predominance of kaolinitic mineralogy, as the correlation between these attributes and C and Cs was high but negative. Therefore, areas with higher means of clay content are potential C reservoirs, as proposed by Mendes63. Soil CO2 emission can be described by the emission and C stock decay coefficients. Thus, the potentials for higher or lower emissions are intrinsic to the soil and closely related to formation processes and factors, expressed by covariate attributes, such as texture and mineralogy15,26,64.

Carbon stocks vary according to the studied soil type and depth, and although these correlations are poorly studied, the stock is an important indicator of environmental services65. Assad et al.66 observed that soil properties gain prominence in the influence of Cs on local work scales. Saiz et al.67 pointed out that soil texture was a determinant factor for Cs variations.

The descriptive statistics and all the studied attributes show that basalt soils and slightly dissected compartments are more weathered than sandstone soils and highly dissected compartments. The moderately dissected compartment presented intermediate means for highly and slightly dissected compartments for all attributes, not allowing for a more specific differentiation. All intermediate landscape forms were included in this group. Therefore, the attributes are expected to present higher variation than the other compartments (Table 1).

The intensity of landscape shapes, indicated by the dissection level, follows a pattern imposed by the structural control of the parent material, indicating a pedogenesis (soil formation) and geomorphogenesis variation (landscape carving). The highly dissected compartment shows concave areas of high intensity, indicating higher dissections in the region and, consequently, higher geomorphogenesis, while the slightly dissected compartment encompasses more preserved regions, with more flattened surfaces and higher pedogenesis rates than geomorphogenesis.

Vasconcelos et al.68 studied the pedo-geomorphogenesis evolution model of Serra da Canastra and observed that the dynamics of the relief developed from the process of cutting the land (higher landscape dissection) causes environments of water saturation and stagnation and better conditions for drainage in flatter areas. However, environments of higher water stagnation favor the advance of hydromorphy, resulting in the formation of amorphous minerals and Kt predominance69, while flatter and well-drained areas favor the advance of oxidation, with the domain of more crystalline minerals and higher Gb content70.

The observations showed that highly dissected basalt compartments with higher clay content and oxidic mineralogy are more favorable for C storing. The opposite occurs for sandstone soils. These results can assist producers in making decisions regarding planting, reforestation, and practices aiming to increase Cs in the soil and reduce CO2 emissions. Some of these practices are cited by Minasny et al.8 The use of these practices has already been considered in C credit and monetization policies.

Spectral signature and most important bands for C estimation

Figure 3 shows the spectral signatures for the different geological and geomorphological compartments of WPSP. The curves for the sandstone soil presented higher reflectance. Sandstone soils have a higher content of clear minerals, such as orthoclase, quartz, and plagioclase, which increase the reflectance of samples16. On the other hand, the reflectance values for basalt samples are lower. Darker minerals in soils originating from basalt, such as iron oxides, reflect less, generating a shorter spectral curve than the curve of sandstone soils71.

Figure 3
figure 3

Spectral curves of stratified samples for sandstone and basalt in the area of the Western Plateau of São Paulo. Adapted from Fernandes et al.38. *Kt kaolinite, Gb gibbsite, Gt goethite, Hm hematite, OM organic matter, H2O water molecule, OH hydroxyl. The ranges most frequently cited in the literature for C content are 850–90074,80 nm and 2320–2370 nm for organic C74,80, inorganic C, and OM20, 410, 570, and 660 nm for organic C81, and 476 and 808 nm30.

The range from 400 to 690 nm in the visible (VIS) is used to characterize the presence of the oxides Hm and Gt14,15,16,35,72. The curve in this band has a greater concavity for the basalt soil, pointing out a higher expression of oxides for soils with this parent material. The spectral behavior for basalt soils also points to a predominance of Hm relative to Gt (greater characteristic concavity). Genú et al.73 observed similar results for Oxisols of the Serra Geral Formation.

The characteristic range for Kt and Gb varies from 2000–2100 to 2300–2350 nm38,72. The sandstone soil presented only one characteristic valley (2100–2200 nm), referring to Kt. The predominant mineral in sandstone samples, rich in quartz and silicon, was Kt. Therefore, this mineral is more abundant in the clay fraction in WPSP soils, as most of the soils in the area are of sandstone origin. Silva et al.37 observed values of weathering indices (Ki and Kr) that classify the reference soils of the region as kaolinitic to oxidic kaolinitic.

Two features were observed for the basalt soil in the range from 2150 to 2300 nm, the first referring to Kt and the second to Gb. The valley depth of Kt was smaller regarding the curve of the sandstone soil, showing a less favorable environment for Kt formation in basalt soils. Demattê et al.16 studied spectral curves of Brazilian biomes and observed that the characteristic features of Gb are found in more weathered soils, such as those of the Cerrado biome. On the contrary, less weathered soils, such as those of the State of Rio Grande do Norte, presented no characteristic features of Gb. Basalt has minerals more easily weathered than the minerals present in the sandstone, thus showing higher Gb content.

Stenberg and Viscarra Rossel74 pointed out the presence of hydroxyls in the range of 1400 nm related to molecular water. A marked presence of water molecules (H–O–H) bound to minerals or impurities has been observed in the range of 1900 nm75,78,77. Dufréchou et al.78 observed that the band of 1900 nm is not characteristic of a specific mineral but due to the clay composition. These bands were indicated in the importance values as important in the prediction of the Kt/(Kt + Gb) ratio, thus being only the influences of other minerals or the water molecule.

Moreover, Bishop et al.79 pointed out that variations in absorbance ranges occur due to the stretching of Si, Fe, and Al oxides, cation size, and folding mode, among other crystallographic characteristics that can divide the occurrence of attributes into several spectral bands. Demattê et al.78 observed that the relationship between the spectral behavior of soils varies according to the different soil pedogenetic processes and landscape position. Therefore, the identification of spectral bands is restricted to the characteristics and conditions of soil formation.

The ranges most frequently cited in the literature for C content are 850–90074,80 nm and 2320–2370 nm for organic C74,80, inorganic C, and OM20, 410, 570, and 660 nm for organic C81, and 476 and 808 nm30. Several ranges in the Vis–NIR range have the potential to predict the total, organic, or inorganic C content. Bahia et al.18 observed that the best prediction parameters can be observed in the NIR range, as there are several ranges related to the biding of elements such as C–C, C=C, CH, and C–N over this range.

All spectral bands were inserted into the stepwise model to minimize errors. This procedure gradually and interactively selected the most important bands to predict the total C. The error measured by RMSE and MAE decreased and R2 increased as the wavelength was inserted into the model. The first band for the Vis range to be inserted into the model was 699 nm, followed by 352, 696, 697, 503, 501, 455, 442, and 478 nm. The best model presented the following parameters: RMSE of 4.74, MAE of 3.77, and R2 of 0.30.

The first band for the NIR range to enter the model was 2434 nm, followed by 2211, 2413, 1696, 1617, 2207, 1946, 1465, 1400, 1078, 1022, 1902, 1696, 1493, 1410, 1943, 1939, 1896, 1892, 1522, 1453, 1402, 1420, 1129, 1092, and 949 nm. The best model found for the NIR range presented an RMSE of 3.87, MAE of 3.07, and R2 of 0.52.

The best parameters obtained for NIR can be explained according to the observations of Bahia et al.18 as previously mentioned. More bands related to C can be found in the NIR range. Some of the bands classified as important, according to the stepwise model, were similar to those mentioned by other authors, such as 660 and 410 nm81 and 808 and 476 nm30. These results indicate that some variations in the most important ranges for estimating C can be observed when comparing studies of different soils and regions around the world.

Cluster plots with heatmaps were generated from the most important bands (Fig. 4) to group the spectral bands, C contents, and geological compartments and understand their cause-and-effect relationships with the wavelengths selected by stepwise. Figure 4 shows that the highest C contents (lighter colors) were grouped with the basalt compartment for both Vis (Fig. 4A) and NIR ranges (Fig. 4B). The sandstone compartment was grouped and associated with the lowest C contents.

Figure 4
figure 4

Cluster analysis with the heatmap for the most important spectral ranges for C estimation, as indicated by the stepwise modeling, C contents, and geological and geomorphological compartments for soils in the Western Plateau of São Paulo in the Vis (A) and NIR ranges (B). The color legend is a table that associates the colors used in the heatmap with the values of the main variable. The heatmaps were generated using the R language, gplots package, version 4.2.2 (https://www.R-project.org/).

The most important spectral bands in the Vis and NIR ranges (Fig. 4) showed the highest C contents grouped with the lowest reflectances. C contents in the Vis range, which expresses the soil color14,15,24, are associated with organic C, present in organic matter (OM). Demattê et al.82 studied the influence of OM on soil reflectance and pointed out that its removal increases reflectance across the spectral curve. In other words, the lower the OM content, the lighter the soils and the higher their reflectance.

The opposite was observed for the sandstone compartment. The highest reflectances were grouped for sandstone soils with lower C contents. These facts can be observed in the spectral curves of Fig. 3. The lowest reflectances are observed for soils originating from basalt, which have higher C means (Table 1). Spectral curves with higher reflectance are observed for sandstone soils (Fig. 3), which, in turn, have lower C means (Table 1).

Some samples of the basalt compartment were not grouped, being dispersed among the sandstone compartment samples. In theory, all samples from each compartment should form a single group. However, maps and scales for collecting the points were different. The points were chosen and collected based on previous experiences in smaller areas40,41 to represent all the variability of 13 million hectares. Moreover, agronomic data present high variability in space and, therefore, present different behaviors from those expected.

Total C estimation and mapping

Estimation by random forest (RF)

Table 2 shows that the testing stage had lower metrics than the training stage. Heil et al.83 also observed a reduction in metrics in the prediction stage for soil C and iron content. The C estimate for the Vis–NIR range showed the best metrics, with MAE of 2.50, RMSE of 3.40, and R2 of 0.74 for training, and MAE of 3.81, RMSE of 4.88, and R2 of 0.54 for testing. The metrics for the C estimate in the NIR range (MAE = 2.46, RMSE = 3.41, and R2 = 0.73) were similar to those observed for the Vis–NIR range in the training stage. However, the R2 in the NIR range was higher than that observed for the Vis–NIR range in the testing stage.

Table 2 Evaluation metrics of random forest models to estimate the total C (g kg−1) for the Vis, NIR, and Vis–NIR ranges at a depth of 0.00–0.20 m for soils in the Western Plateau of São Paulo.

Morellos et al.30 used the cubist ML algorithm and obtained an RMSE value of 2.18 and an R2 of 0.78 for estimating organic C in a Luvisol in Germany. Gelsleichter et al.19 studied the estimate C by RF and obtained an R2 of 0.84 for soils in the Itatiaia National Park, Brazil. The results found in the literature are higher than those obtained in this study. However, the high variability of the area can lead to higher errors and more erratic predictions. However, they are still efficient when comparing the cost–benefit ratios 28,84,85. Stenberg13 pointed out that samples from sandy soils in the testing and/or calibration stages can result in worse attribute estimates. Sand has a small total surface compared to clay and OM, which can cause more erratic effects due to high absorption12.

Figure 5 shows the regression between observed and estimated data. The estimate using the three ranges (Vis, NIR, and Vis–NIR) showed that values up to 10 g kg−1 of C, observed and estimated by RF, are very close to each other and the line. Values of C > 10 g kg−1 had more dispersion between the observed data and those estimated by RF. It may indicate that the algorithm has little sensitivity to learn and estimate higher C content. Therefore, the sample grid in areas with a higher C content must be densified to minimize errors. The wide differences between minimum and maximum values can make the prediction more erratic, as it is a large area, with a high geological and geomorphological variability.

Figure 5
figure 5

Regression analysis between predicted C (g kg−1) data by the (A) Vis (visible), (B) NIR (near-infrared), and (C) Vis–NIR ranges by random forest and observed data for soils in the Western Plateau of São Paulo.

In general, ML shows better results than traditional methods, but it can be considered of low interpretability86. The importance of the variables used in the estimation and the number of times they were mentioned in the rules must be evaluated to overcome this possible difficulty87,88,89. It enables the visualization of how the models were generated to interpret them (Fig. 5).

The C contents observed and estimated by RF for the Vis, NIR, and Vis–NIR ranges were subjected to geostatistical analysis for detailed mapping (Table 3). The C data observed and estimated by RF in the Vis range were adjusted to the exponential model. The C data estimated by RF in the NIR and Vis–NIR ranges adjusted to a spherical model.

Table 3 Geostatistical metrics of the total C contents observed and estimated by the Vis (visible), NIR (near-infrared), and Vis–NIR ranges by random forest (RF) at a depth of 0.00–0.20 m for soils of the Western Plateau of São Paulo.

Bahia et al.18 worked with C mapping for a small area of the State of São Paulo and adjusted the spherical model to the data. La Scala Jr. et al.26 and Bahia et al.90 studied the soil CO2 flux in areas of the State of São Paulo and also adjusted the spherical model to the data, pointing out that the data agreed with those observed in the literature. The spherical model is considered the most common for adjusting soil variables56,91, as they present abrupt variations along the landscape92.

The data estimated by RF presented shorter ranges and lower spatial dependence. The results showed that the algorithm has a high sensitivity to capture the heterogeneity of the data. The highest C contents were found in the south, southwest, and east edges, where basalt compartments are located (Fig. 6). The lowest C contents were found in the other WPSP regions, especially in the central region, where sandstones are located. The estimates using DRS and RF could detect these wide variations of C contents in the sandstone.

Figure 6
figure 6

Observed C maps and estimated by random forest based on the ranges of the spectral curves Vis (A), NIR (B), and Vis–NIR (C) and their respective error maps (DF) for soils of the Western Plateau of São Paulo. Vis visible, NIR near-infrared. The semivariograms and maps were generated using the R language, geostat package, version 4.2.2 (https://www.R-project.org/).

The data maps estimated by RF showed similar behavior to the observed C map (Fig. 6A–C). However, the algorithm tends to underestimate the C contents in both geological and geomorphological compartments. As indicated by the metrics, the estimation model that associated the Vis–NIR ranges generated the C map most similar to the observed map. The association of ranges allowed a better capture of transitions in the compartments, especially in the sandstone compartments. The C content estimation by Vis and NIR in the sandstone compartment was generalized although presenting similar maps to that observed.

The superiority of mathematical modeling depends on the database. The choice of modeling depends on the database size and professional expertise. The RF analysis was carried out using the database reduced by stepwise analysis for practical and data processing purposes. The selection of specific wavelengths allows the researcher to work with a smaller database and sensors that capture wavelengths less frequently. Models derived from smaller spectral libraries can provide more accurate predictions of C content, surpassing predictions obtained with a model derived from a large spectral library93.

The use of the most robust ML algorithms manages to capture the non-linearity of the data, besides being able to adapt to the database and improve the model accuracy94. ML techniques capture valuable spatial information that is not captured by environmental covariates, and the inclusion of this information improves the overall predictive performance95. These algorithms associated with DRS allow detailed mapping of C content in a precise and fast way. Indirect techniques, such as DRS, are efficient in capturing the spatial variability of C, being faster and more efficient, as well as less costly, than conventional techniques18,85. The maps generated using these estimates can assist in defining public policies for land use, reducing CO2 emissions, and implementing C-credit policies.

Other potential technologies, such as UAV (unmanned aerial vehicles)-based NIR, also offer quality assessment and monitoring of soil C96 and plant biomass97. Aldana-Jague et al.96 investigated the potential of UAV multi-spectral imagery (480–1000 nm) for estimating the OC content in bare cultivated soils at a high spatial resolution (12 cm) and concluded that the methodology has a clear potential for use in precision agriculture or monitoring important soil properties following management changes. The authors obtained a coefficient of determination of 95% for the validation and an RMSE of 0.21% C based on cross-validation despite a sampling design that was not fully optimized for spectral calibration or spatial mapping. In this context, remote sensors attached to UAVs have been successfully calibrated using ground control points with known reflectance98,99. Cruciol et al.100 investigated calibration approaches of UAV-based near-infrared digital images and observed that the conversion of visible and near-infrared digital images into reflectance promoted a decrease in the coefficient of variation of the spectral data for all visible bands.

Estimation by partial least squares regression (PLSR)

Table 4 shows the metrics obtained for the partial least squares regression (PLSR). The errors given by MAE, RMSE, and R2 values are slightly higher in this prediction stage. Bahia et al.18 estimated the total C by PLSR and obtained an R2 of 0.82 for the Vis range and 0.88 for the Vis–NIR range in a small study area within WPSP. Asgari et al.20 studied the estimate of organic C in Iran and obtained an R2 of 0.74 for the Vis–NIR range. Viscarra Rossel et al.81 obtained an R2 of 0.86 for the NIR range and 0.91 for the Vis–NIR range. The extension and high variability of WPSP should be considered although the results are higher than those found in these studies.

Table 4 Metrics for evaluating models by partial least squares regression to estimate the total C (g kg−1) for the Vis, NIR, and Vis–NIR ranges at a depth of 0.00–0.20 m for soils of the Western Plateau of São Paulo.

The Vis range presented the best metrics for this modeling, with an R2 of 0.64, MAE of 3.49, and RMSE of 4.40 for calibration, and R2 of 0.58, MAE of 3.52, and RMSE of 6.65 for prediction (Table 4). The Vis range was followed by the Vis–NIR range, with an R2 of 0.52, MAE of 3.40, and RMSE of 4.59 for calibration, and R2 of 0.52, MAE of 3.48, and RMSE of 6.42 for prediction. The NIR range presented the worst metrics, mainly in the prediction stage, with an R2 of 0.32, MAE of 3.52, and RMSE of 5.56 for calibration, and R2 of 0.35, MAE of 3.59, and RMSE of 7.75 for prediction. Bahia et al.18 explained that the metrics obtained for the Vis range can be justified by the good relationship between C and iron oxides. The relationships between these oxides and C contents were observed in this study and also by La Scala Jr. et al.26 and Bahia et al.15,90.

Unlike RF and regardless of the higher or lower C contents, the observed and estimated data showed higher dispersion along the line in the regression analysis (Fig. 7). The entire spectral curve was used for the PLSR modeling, as the calibration and prediction models need a wide variety of data. Wavelengths can have wide variations that influence the predictive capacity of the chemometric analysis101. Also, Wold et al.101 explained that the PLSR analysis is based on data homogeneity, and very sudden variations of dilution in the curves or noise can lead to errors. In other words, working with only a few wavelengths or doing extreme preprocessing can favor errors in the estimation of attributes. Gelsleichter et al.19 studied the preprocessing of spectral curves for C estimation and pointed out that it can decrease the predictive potential of PLSR.

Figure 7
figure 7

Regression analysis between the C (g kg−1) data predicted by the (A) Vis (visible), (B) NIR (near-infrared), and (C) Vis–NIR ranges by partial least squares regression and observed data for soils in the Western Plateau of São Paulo.

The comparison between PLR with RF modeling showed that the latter algorithm had a better performance than the multivariate chemometric analysis. Morellos et al.30 used RF and PLSR and observed that ML presents a better performance for estimating C. The authors stated that the efficiency of this model is due to its high capacity to deal with the nonlinear pattern of the data. Viscarra Rossel et al.14 Keskin et al.102 and Gelsleichter et al.19 made the same considerations, corroborating the results observed in this study.

Regarding the lengths considered the most important for each range, the range between 420 to 480 nm is considered as important for C prediction. PLSR does not consider the length of 352 nm and above 500 nm as important for the estimation. In the NIR, RF considers wavelengths between 1000 to 2000 nm as important, while PLSR considers wavelengths above 2000 nm as more important. The ranges considered important for C prediction were wide compared to those found in the literature. Viscarra Rossel et al.81 cited wavelengths of 400–2198 nm in the Vis–NIR range and 1100–2498 nm in the NIR range. Demattê et al.103 cited several wavelengths, such as 825, 853, 1100, 1138, 1449, 1524, 1650, 1706, 1730, 1754, 1930, 1961, 2033, 2135, 2137, 2275, 2316, 2307, 2336, 2381, and 2469 nm. These authors corroborate with the results of this study regarding the specific wavelengths and the fact that more wavelengths in the NIR range may be important for estimating C.

The geostatistical analysis for detailed mapping (Table 5) showed that the C data estimated by PLSR were adjusted to three different models: exponential (Vis range), spherical (Vis–NIR range), and Gaussian (NIR range) (Table 5).

Table 5 Geostatistical metrics of the total C contents observed and estimated by the Vis (visible), NIR (near-infrared), and Vis–NIR ranges by partial least squares regression (PLSR) at a depth of 0.00–0.20 m for soils of the Western Plateau of São Paulo.

The closest ranges were obtained for the observed C data (5.95 m) and estimated by PLSR for the Vis (5.11 m), NIR (3.15 m), and Vis–NIR ranges (6.01 m), with the highest values of spatial dependence (DSD). The longest ranges for the estimated data indicate higher underestimations, generating a more homogeneous database.

The maps of data estimated by PLSR showed similar behavior to the map of observed C (Fig. 8A–C). The data estimated by the association of ranges (Vis–NIR) and NIR could not capture the variations between sandstone and basalt compartments, tending to overestimate C contents in basalt regions, which have more C, and underestimate these contents in sandstone regions, with less C content. The southwest and south regions of WPSP have an extension of the basalt compartment with a higher C content (map of observed total C). However, these ranges in the PLSR analysis did not capture this variation.

Figure 8
figure 8

Observed C maps and estimated by partial least squares regression based on the ranges of the spectral curves Vis (A), NIR (B), and Vis–NIR (C) and their respective error maps (DF) for soils of the Western Plateau of São Paulo. Vis visible, NIR near-infrared. The semivariograms and maps were generated using the R language, geostat package, version 4.2.2 (https://www.R-project.org/).

Following the best PLSR metrics, the estimated data map for the Vis range showed the most similar patterns to the observed map (Fig. 8). The data in the basalt compartment followed the same patterns as the observed map. The data in the sandstone compartment were overestimated mainly in the central region of WPSP. Bahia et al.18 mapped the total C content in a small area within WPSP and observed that PLSR underestimates the maximum C values, while it is more assertive for regions with lower C contents.

PLSR is a technique successfully used to predict numerous soil attributes, using spectral curves in various regions of the world. Using PLSR, Fernandes et al.38 estimated soil mineral contents, Camargo et al.61 estimated available and adsorbed P content, and Bahia et al.18 estimated C and N contents. These studies, carried out in Brazilian soils, demonstrate the potential and diversity of the technique. Several other studies at an international level with the most varied soil attributes can also be mentioned20,28,94.

Conclusions

The highest C contents are associated with more clayey soils, oxidic mineralogy, higher total pore volume, and lower soil density in highly dissected basalt compartments.

Soils with a higher reflectance have lower C content.

The most important wavelengths for estimating carbon are 352, 696, 697, 503, 501, 455, 442, and 478 nm in the visible range and 2211, 2413, 1696, 1617, 2207, 1946, 1465, 1400, 1078, 1022, 1902, 1696, 1493, 1410, 1943, 1939, 1896, 1892, 1522, 1453, 1402, 1420, 1129, 1092, and 949 nm in the mid-infrared range.

The random forest algorithm associated with the Vis–NIR spectral range is more efficient in estimating C content in tropical soils. The random forest algorithm associated with the Vis–NIR spectral range allowed the construction of a map of the estimated C content more similar to the observed map, pointing out that it can be used to define public policies for land use, reduce CO2 emissions, and implement C credit policies in an economic and fast way.

Implications of the study: the importance of using a spectral sensor for pedogenesis in soil C mapping

This study provides an alternative strategy to the usual methodologies for evaluating and mapping soil characteristics, since traditional methods for quantifying soil properties involve field sampling and laboratory analysis, which are later mapped using interpolation to transform point data into a surface, making the process expensive, time-consuming, and laborious, and may not provide accurate information for large spatial areas9,10. Thus, spectral pedology using proximal sensors has been gaining prominence with the growing need for soil mapping with higher spatial resolution, presenting higher applicability in countries with large territorial extensions, such as Brazil16. It is a very promising technology that relates soil studies to the interpretation of electromagnetic spectra. This technology allows the discrimination of soil attributes14,104, and the information is added to digital mapping. In addition, the adoption of new methodologies, such as diffuse reflectance spectroscopy (DRS), and robust mathematical techniques, such as machine learning and geostatistics, work together to make research of this magnitude possible16,105.

Soil mineralogy and texture are directly related to C stock due to the physical properties of the clay surface106,107. Inadequate and unplanned land use, related to a lack of knowledge of the dynamics of C entry and exit from the soil, increases CO2 emissions by 31%32. In this context, the topic addressed by this research is based on environmental sustainability and food security. According to England and Viscarra-Rossel105, C stocks in the soil can be stabilized or increased through the identification and implementation of agronomic and environmental management. However, soil C shows high variability, requiring quantification to verify changes in its stock due to human activities, which demand a large sample volume93. Therefore, the study of C concentrations and their dynamics in the landscape through mapping could serve as a basis for countless future decisions, identifying, for example, where the soil has the highest potential to store C naturally, thus guiding actions for the mitigation of CO2 emissions from agricultural food production systems108.

The use of geology and geomorphology to compartmentalize the study area helps to determine the spatial variability of soil attributes. In this context, we understand that knowledge of these attributes in an area of 13 million hectares with economic relevance and heterogeneous in numerous characteristics can guide not only work in the State of São Paulo but also in neighboring states that have similar conditions, demonstrating the future potential of this research.

Therefore, we present here unprecedented results when considering methodology, territorial extension, and mapping. Thus, associated research in other Brazilian regions, using different methodologies and mathematical techniques, or showing how these maps can be used through field experimentation, are extremely important for validating the results. All the generated data is now part of a regional database and can be used as a basis for other research, guiding a pioneering development of research and innovation and contributing to the economical and sustainable advancement of science in Brazil.