1 Introduction

Rare earths, also known as industrial vitamins, are vital strategic resources widely used in many fields, such as the military, petrochemicals, and textile (Yang et al. 2013). Locations with ion-absorption rare earth deposits, such as Ganzhou, Jiangxi Province, China, are commonly characterized by a warm and humid climate, and low undulating hilly landforms. Owing to disorderly activity and outdated technology in the earlier stages of mining, substantial abandoned tailings are present in rare earth mining areas, resulting in serious eco-environmental problems that need to be solved urgently. In particular, the soil properties in rare earth mining areas are seriously affected by the leaching process (Yang et al. 2013).

The major soil type of rare earth mining area is red soil in Ganzhou. With a hot and rainy conditions, the process of desilicification and allitization during the formation of red soil results in a high content of iron and aluminum oxide in the soil. And clay minerals composed of halloysite and kaolin are formed to provide a good environment for the accumulation of rare earth elements (Li and Zhou 2020). There are three typical types of rare earth mining areas namely unexploited, in situ leaching, and heap leaching mining areas. The unexploited mining area is less affected by human activity and has high vegetation coverage. The in situ leaching and heap leaching areas extract rare earth elements by chemical methods, but the degree of impact on the environment is different. As for the in situ leaching mining area, overall damage to the mountain is minor, with greater damage to the soil and vegetation due to mining mainly occurring around the injection wells and collection ditch. The heap leaching mining area is mainly covered by rare earth tailings, and the soil shows severe desertification with only a few Pinus massoniana plants. Collecting soil samples in these three types of mining areas can conduct a more comprehensive study of the soil in rare earth mining areas in order to deal with the eco-environmental problems.

As a sink for atmospheric carbon dioxide, soil plays an important role in achieving global carbon neutrality (Paustian et al. 2016). The impact of human activities on changes in soil properties is a long-term and complicated process (Gu et al. 2021). The extraction process of rare earth elements by using (NH4)2SO4 negatively affects soil properties, the most important of which are organic carbon (OC) content, total nitrogen (TN) content, pH value, and clay content. Soil OC and TN influence soil functions related to water and nutrient retention, while also providing nutrients for plant growth. When using (NH4)2SO4, NH4+ replaces Ca2+ adsorbed on soil colloids, which destroys the soil aggregate structure and causes a subsequent loss of soil OC and TN. Soil pH is related to the growth and development of animals and plants, with the low pH values in rare earth mining areas found to affect soil health. The NH4+ replaces H+ on soil colloids, which will increase H+ in soil, leading to soil acidification and compaction (Guo et al. 2010). Soil clay plays an essential role in soil that affect many soil properties and process (Song et al. 2021); it also acts as “glue” to hold soil particles together (Bronick and Lal 2005). Rare earth elements are adsorbed and enriched by clay minerals in ionic form. The usage of (NH4)2SO4 results in the replacement of rare earth ions on clay minerals with H+ and NH4+ and the destruction of soil binding agent (clay). Therefore, these four soil properties in rare earth mining areas should be monitored in a timely manner to provide support data for soil erosion control and ecological restoration.

Traditionally, soil properties are measured using physical and chemical methods in the laboratory (Greenberg et al. 2020). Although accurate results can be obtained, traditional soil testing methods require substantial labor, materials, and financial resources. Furthermore, these methods have limitations in large-scale monitoring owing to spatial variability in soil properties. The emergence of visible–near-infrared spectroscopy provides a powerful tool for the rapid monitoring of soil properties (Chen et al. 1989), based on acquiring soil spectral data using a ground spectroradiometer. Studies have shown the potential of soil spectra to predict soil properties. For example, Kovačević et al. (2010) successfully used the Gaussian kernel with the support vector machine (SVM) to predict soil pH values. Zhang et al. (2019) predicted the soil TN content using feature bands and the SVM method. Ji et al. (2019) combined data from four soil spectral sensors to predict soil organic matter (SOM) and pH, and the concentrations of soil ions, including phosphorus, potassium, and calcium. Tsakiridis et al. (2020) found that the convolutional neural network method performed well in predicting the soil clay, silt, and sand contents, pH value, cation exchange capacity (CEC), and OC, CaCO3, and N contents in the LUCAS topsoil database. The mechanisms for predicting soil properties based on visible–near-infrared spectroscopy depend on different spectral interactions of the main soil chromophores (Vohland et al. 2011).

Extracting useful information from the original soil spectrum is difficult owing to spectral overlaps occurring in the visible and near-infrared range (Stenberg et al., 2010; Chen et al. 2020). Therefore, spectral transformation methods are used for spectral pre-processing to reduce the influence of environmental noise and enhance the useful information. The first-order derivative (FOD) can remove interference from linear or nearly linear background noise to improve analysis accuracy (Ben-Dor et al. 1997). The continuum removal (CR) generally magnifies the absorption and reflection characteristics in spectra, and the spectra are normalized to a consistent spectrum background, which is beneficial to identify feature bands (Clark and Roush 1984; Tziolas et al. 2020). As an effective signal processing method, the continuous wavelet transform (CWT) is able to decompose the original spectrum into multi-scale wavelet coefficients through operations such as scaling and translation. This decomposition process can enhance certain information in the spectrum, including the location and nature of high-frequency features (narrow absorption features, spikes, and noise), or the size and shape of continuous features on a large scale (Vohland et al. 2016).

Soil spectral analysis commonly uses linear and non-linear calibration methods. As a linear multivariate regression method, partial least-squares regression (PLSR) is superior to other regression methods, such as stepwise multiple regression (SMLR) and principal component regression (PCR), in processing multi-dimensional collinearity data (Conforti et al. 2015; Shi et al. 2013). However, soil is formed under the effects of multiple factors, such as parent material, topography, and climate. Owing to the complex composition of soil, the relationship between soil spectra and soil properties might not be a simple linear relationship (Vohland et al. 2011). Non-linear methods might outperform linear methods in dealing with such issues. For example, the SVM is a useful tool for solving non-linear problems with multi-dimensional data and small sample sizes (Nawar et al. 2016). Based on soil spectroscopy, the SVM has been successfully used to predict soil properties. Nawar and Mouazen (2017) found that the SVM and multivariate adaptive regression splines (MARS) outperformed PLSR and achieved similar accuracy for predicting soil TN, total carbon (TC), and water content at different geographical scales. Furthermore, extreme gradient boosting (XGBoost) based on the gradient descent algorithm can solve classification, regression, and sorting issues (Chen and Guestrin 2016), but is rarely used to predict soil properties based on spectroscopy. Wei et al. (2019) found that XGBoost performed well in estimating the arsenic (As) content in soil, suggesting that this method has potential to predict other soil properties.

Recently, owing to the emergence of data mining and deep-learning methods, a growing number of studies have focused on the prediction of soil properties using large spectral libraries (Tsakiridis et al. 2020; Zhong et al. 2021). Although a model with high overall accuracy can be obtained, its applicability might not be high when the prediction is downscaled to a single soil type in a specific area, owing to the large coverage scale of the soil samples. Therefore, the aims of this study were (1) to verify the feasibility of using visible–near-infrared spectroscopy to predict the OC content, TN content, pH value, and clay content of soil in rare earth mining areas; and (2) to select the optimal spectral transformation method (FOD, CR, and CWT) coupled with different calibration methods (PLSR, SVM, and XGBoost) for predicting soil properties based on visible–near-infrared spectroscopy.

2 Materials and methods

2.1 Study area and sampling

The study area, located in Longnan, Dingnan, and Xinfeng Counties, Jiangxi Province, China, was rich in ionic rare earth ores. This area has a mid-subtropical humid climate, with a mean annual rainfall of 1500–1600 mm and a mean annual temperature of 19.0 °C. The elevation is in the range of 200–400 m, and the major landform is low hills, with forestland as the dominant land-use type. The soil type in the study area is mainly red soil (Alumi-Ferric Alisols) and the clay minerals mainly consist of kaolinite.

Nine typical mining sites representing unexploited, in-situ leaching, and heap leaching mining areas (three sites each) were selected in the study area (Fig. 1). Considering the danger of sampling in the mountains, soil samples of the unexploited mining area were mainly collected on both sides of the mountain road (> 5 m distance from the road). In the in situ leaching mining area, soil sampling was mainly conducted near the injection wells. Soil sampling points in the heap leaching mining areas were evenly distributed.

Fig. 1
figure 1

Locations of the study area and soil sampling points in Jiangxi Province, China

A total of 232 topsoil samples (depth, 0–20 cm) were collected using a five-point sampling method in June and July 2020. The center position of each sample was recorded using a handheld global positioning system (Fig. 1). After removing plant roots and stones by hand, all soil samples were kept in resealable bags and labeled. The samples were then air-dried, ground, and passed through a 10-mesh sieve (2 mm). Each sample was equally divided into three parts for the determination of soil properties, measurement of soil spectra, and further use.

2.2 Chemical analysis

The OC and TN contents of soil samples passed through a 100-mesh sieve (0.149 mm) were determined using a Vario MACRO cube elemental analyzer (Elementar, Hanau, Germany) based on the combustion–oxidation method (Wang et al. 2020). As the measured soil pH values were strongly acidic, the samples contained nearly no inorganic carbon, and the OC content was considered to be equal to the TC content. The pH values of soil samples (2 mm) were measured using a potentiometric method with a water–soil ratio of 2.5:1 (v/w) (Kovačević et al. 2010). The clay content of the soil samples (2 mm) was measured using the pipette method (Kilmer and Alexander 1949).

2.3 Spectral measurement and pre-processing

An ASD FieldSpec 4 spectroradiometer (Analytical Spectral Devices Inc., Boulder, CO, USA) was used to obtain the spectral reflectance of soil samples in the range of 350–2500 nm. The spectral sampling resolutions of the instrument were 3 nm (at 700 nm) and 10 nm (at 1400 and 2100 nm). The spectra were resampled to 2-nm intervals and 2151 bands were exported for each spectrum. Spectral measurements were conducted in a dark room to reduce interference from external light sources. Soil samples were placed in a black sample container kept in the slot at the top of the MugLite instrument (Analytical Spectral Devices Inc.) and measured with the built-in light source. A white Spectralon panel (Analytical Spectral Devices Inc.) was used to calibrate the instrument every 10 min. Each soil sample was scanned five times and the mean of the spectra was used as the final spectrum for each sample.

The splice correction function in ViewSpecPro v6.0 (Analytical Spectral Devices Inc.) was used to eliminate the effects of breakpoints generated by the instrument when measuring soil spectra. The sections at 350–399 nm and 2451–2500 nm, which were considerably affected by the instrument and environmental noise during the measurement, were removed. The Savitzky–Golay filter was then used to smooth the spectrum and remove noise caused by the instrument and environment, while maintaining the original spectral characteristics (Savitzky and Golay 1964).

2.4 Spectral transformation methods

The first-order derivative (FOD), continuum removal (CR), and continuous wavelet transform (CWT) were selected to compare with the original reflectance.

The Mexican hat (Torrence and Compo 1998) was selected as the mother wavelet function for CWT, and transformed it into a set of wavelet coefficients on different scales (Mallat 1989). The decomposition scales were set at 21, 22, 23, …, and 210 to prevent data redundancy (Cheng et al. 2011).

2.5 Calibration methods

2.5.1 Partial least-squares regression

PLSR is a linear multivariate regression method that projects the independent (X) and dependent (Y) variables into a new space and identifies the relationship between them to construct a prediction model (Viscarra Rossel and Behrens 2010; Wold et al. 2001). PLSR is able to extract the main information from multiple independent variables by reducing the dimensionality and effect of multicollinearity in the independent variables. The correlation between independent and dependent variables is also considered, with the dependent variables predicted through several latent variables extracted from multiple independent variables. This method is suitable for situations where the number of samples is less than the number of independent variables (Kuang et al. 2015). In this study, the number of latent variables in the PLSR model was determined by ten-fold cross-validation, and model construction was implemented in The Unscrambler X v10.4 (CAMO, Oslo, Norway).

2.5.2 Support vector machine

The SVM is a non-linear model in machine learning that projects the input data into a feature plane and finds an optimal plane that can minimize the distance from all samples to the plane (Wang et al. 2019). To reduce the complexity of the calculation and prevent dimensional disaster, the kernel function is introduced, which can solve high-dimension problems by calculating them under low dimensions (Smola and Schölkopf 2004). In this study, the radial basis function (RBF) kernel was used to construct the SVM model. Two main parameters, C (cost parameter) and γ (kernel parameter), must be optimized in the construction of the SVM model (Hong et al. 2018; Dong et al. 2021). Therefore, a grid search with ten-fold cross-validation was used to optimize the C and γ values. The SVM model was constructed using Matlab R2017b (MathWorks Inc.).

2.5.3 Extreme gradient boosting

XGBoost (Chen and Guestrin 2016) is a scalable end-to-end tree boosting system algorithm inspired by the gradient enhancement algorithm (Friedman 2001). XGBoost not only uses the first derivative of the loss function, but also performs a second-order Taylor expansion of the loss function by accounting for the second derivative information. Consequently, the model converges quickly and its operating efficiency is improved (Friedman et al. 2000). In this study, the root mean square error (RMSE) was used as the loss function to evaluate the optimal objective function. A regular term was added to the calculation process of the model objective function, which can improve the generalization ability to prevent over-fitting of the prediction model. When encountering a situation with a large amount of data, a multi-threaded parallel method was used to improve the computational efficiency (Wei et al. 2019).

For a given dataset of n samples and m independent variables, the objective function of XGBoost can be defined as follows:

$$O{\text{bj}}(\theta )=\sum_{i}^{n}l({y}_{i},{\widehat{y}}_{i})+\sum_{t=1}^{T}\Omega ({f}_{t})$$
(1)

where l is the loss function. yi and \(\hat{y}_{i}\) are the measured and predicted value of the number i sample, respectively. ft is the number t tree. \(\sum_{t=1}^{T}\Omega ({f}_{t})\) is the sum complexity of t trees, which is used as a regular term in the objective function.

Expanding the objective function according to Taylor’s formula, the second-order Taylor expression of the loss function after t iterations can be obtained, which can be approximately expressed as follows:

$${L}^{(t)}=\sum_{i=1}^{k}[l({y}_{i},{\widehat{y}}^{(t-1)})+{g}_{i}{f}_{t}({x}_{i})+\frac{1}{2}{h}_{i}{f}_{t}{}^{2}({x}_{i})]+\Omega ({f}_{t})$$
(2)

where \({g}_{i}={\partial }_{{\widehat{y}}^{(t-1)}}l({y}_{i},{\widehat{y}}^{(t-1)})\) and \({h}_{i}={\partial }_{{\widehat{y}}^{(t-1)}}^{2}l({y}_{i},{\widehat{y}}^{(t-1)})\) are the first-order and second-order partial derivatives of the loss function, respectively.

In the XGBoost model, grid search was used to optimize hyperparameters, and the hyperparameters were tuned as follows: the maximum depth of the tree was set to 7; learning rate was set to 0.1 to control length of each iteration step; and the number of trees (n-estimators) was set to 80. The XGBoost model was constructed using the xgboost package in Python v3.7 (https://www.python.org/).

2.6 Model accuracy evaluation

The accuracy of the constructed prediction models was evaluated using the coefficient of determination (R2), RMSE, and ratio of performance to inter-quartile distance (RPIQ). Larger R2 or RPIQ values and a smaller RMSE value indicated better prediction accuracy. The prediction performance of models was divided into four categories according to the RPIQ values, as follows: excellent (RPIQ ≥ 4.05); good (3.37 ≤ RPIQ < 4.05); approximately quantitative (2.70 ≤ RPIQ < 3.37); distinguishing between high and low values (2.02 ≤ RPIQ < 2.70); and unsuccessful (RPIQ ≤ 2.02) (Saeys et al. 2005; Ludwig et al. 2017).

2.7 Statistical analysis

Statistical analysis was conducted using IBM SPSS Statistics v22.0 (IBM Corp., Armonk, NY, USA) and Microsoft Excel v2019 (Microsoft Corp., Redmond, WA, USA). Graphical drawing was performed using ArcGIS v10.5 (ESRI Inc., Redlands, CA, USA) and OriginPro v2021 (OriginLab Corp., Northampton, MA, USA).

The FOD, CR, and CWT spectral transformations were performed using OriginPro v2021 (OriginLab Corp., Northampton, MA, USA), ENVI Classic v5.5 (Harris Geospatial Inc., Bloomfield, CO, USA), and Matlab R2017b (MathWorks Inc., Natick, MA, USA), respectively.

3 Results

3.1 Descriptive statistics of soil properties

The obtained 232 soil samples were divided into two parts by the Kennard–Stone method (Kennard and Stone 1969), namely, the calibration dataset (N = 174) and the validation dataset (N = 58). Descriptive statistics of the measured soil OC content, TN content, pH value, and clay content for the whole, calibration, and validation datasets are summarized in Table 1. It could be observed that the OC content, TN content, pH value, and clay content of the whole dataset ranged from 0.04 to 2.08%, 0.01 to 0.18%, 3.90 to 5.84%, and 3.82 to 47.30%, respectively. The CV (coefficient of variation) of OC content, TN content, and clay content were higher than 35%, whereas the CV of pH value was below 15%, which meant that the pH value of the study area had little variability according to Wilding (1985).

Table 1 Descriptive statistics of soil properties in the study area

The mean, SD (standard deviation), and CV of the whole, and calibration and validation datasets were similar. And Levene’s test (Levene 1960) was conducted to prove the reliability of the method used to split datasets. The p-values of the four soil properties from Levene’s test were 0.011, 0.298, 0.666, and 0.759 (significance level, α  = 0.01), respectively, indicating that the calibration and validation datasets had equal variances and could represent the whole dataset.

3.2 Soil spectra and transformations

Figure 2a shows the original (OR) spectrum after averaging the obtained 232 soil spectra. The soil spectrum increased rapidly in the visible range, showing a steep slope. Slight changes were observed from 800 to 1300 nm and from 1500 to 1800 nm, while the spectrum exhibited a slow downward trend in the range of 2200–2450 nm.

Fig. 2
figure 2

a Soil spectral curves of original reflectance (OR) and (bd) transformed reflectance

Figures 2b–d show the soil spectra after three different transformations. When transformed using the FOD, the spectrum mainly fluctuated near 0, and bands, such as those at 500, 1400, 1900, and 2200 nm, became more distinct. After CR transformation, absorption valleys were observed in the spectrum at approximately 500, 900, 1400, 1900, and 2200 nm. The range of wavelet coefficients obtained by the CWT increased with increasing scale. The wavelet coefficients at scales 1–6 had a small variation range, fluctuating between − 1 and 1. The wavelet coefficients at scales 7–10 exhibited a remarkable expansion of the value range. At scale 8, two wavelet coefficient peaks appeared near 800 and 2200 nm, respectively, while at scales 9 and 10, the curves were smooth with a convex center at approximately 1400 nm.

3.3 Correlation between soil properties and spectra

Pearson correlation analysis was used to measure the relationship between soil properties (OC, TN, pH, and clay) and soil spectra (OR, FOD, CR, and CWT). Bands in the range of 600–800 and 2000–2400 nm were highly correlated with the OC content (Fig. 3). The highest correlation with the OC content was observed at 793 nm for FOD spectra, and at 2110, 1290, and 1945 nm for OR, CR, and CWT1 spectra, respectively. The OR spectra were negatively correlated with the OC content over the full wavelength range, with almost all bands passing the significance test. The correlation coefficients between FOD, CR, or CWT spectra and the OC content fluctuated between positive and negative values (Fig. 3a). The highest correlation between soil spectra and the TN content was found at 793 nm in the FOD spectra, with the correlation coefficient being higher than the corresponding coefficient between soil spectra and the OC content. However, the number of bands that passed the significance test was slightly smaller for the TN content than for the OC content (Fig. 3b).

Fig. 3
figure 3

Correlations between soil properties (a OC, b TN, c pH, and d clay) and spectra. Pearson correlation coefficients that pass the p = 0.01 significance test are shown

The soil pH value showed a relatively low correlation with the OR, FOD, and CR spectra. The CWT spectra showed a high correlation with the pH value, mainly at approximately 400, 1000, and 2400 nm at low decomposing scales, with the highest correlation observed at 1011 nm in the CWT2 spectra (Fig. 3c). The clay content of soil showed a lower correlation with the OR spectra compared with the OC content. Bands in transformed spectra in the range of 400–500, 1600–1700, and 2000–2450 nm showed a relatively high correlation with the clay content, with the highest correlation observed at 1672 nm in the FOD spectra. For the CWT spectra, bands from 400 to 500 nm at scales 1–4 appeared as narrow features, while bands from 2000 to 2450 nm at scales 7–9 appeared as broad features (Fig. 3d).

3.4 Prediction of soil properties

For calibration dataset, the prediction accuracy of the XGBoost method for soil properties were higher than the PLSR and SVM method. The best results of the OC content, TN content, pH value, and clay content for the calibration dataset were obtained using XGBoost based on CWT spectra (OC: R2 = 0.99, RMSE = 0.05; TN: R2 = 0.99, RMSE = 0.01; pH: R2 = 0.97, RMSE = 0.07; clay: R2 = 0.99, RMSE = 1.54; Table 2).

Table 2 Comparison of the prediction accuracy for soil properties using different spectral transformation and calibration methods based on the calibration dataset (N = 174)

To evaluate the influence of different spectral transformation and calibration methods on the prediction of soil properties, RPIQ value was added to calculate for the validation dataset. For the OC content, nearly all models constructed based on FOD and CWT spectra outperformed the models based on OR spectra (Table 3 and Fig. 4). In particular, the best results with CWT spectra were obtained using SVM (R2 = 0.88, RMSE = 0.26, RPIQ = 4.37) and XGBoost (R2 = 0.89, RMSE = 0.24, RPIQ = 4.67). Models based on CR spectra yielded the worst result when coupled with SVM (R2 = 0.62, RMSE = 0.45, RPIQ = 2.44). Compared with the OC content, the prediction accuracy for the TN content was lower, with the best result obtained using CWT spectra with XGBoost (R2 = 0.86, RMSE = 0.01, RPIQ = 4.14). For each calibration method, the results obtained with OR, FOD, and CWT spectra had higher accuracy than those obtained with CR spectra (Table 3 and Fig. 5). The model based on CWT spectra performed well with the SVM and XGBoost methods. For OR and FOD spectra, the models’ accuracies obtained using SVM and XGBoost methods were similar, and lower than those based on CWT spectra. The best prediction result for soil pH value was obtained using CWT spectra with XGBoost (R2 = 0.73, RMSE = 0.19, RPIQ = 1.95). However, each prediction model had an RPIQ value below 2.02, indicating that the model could not successfully predict soil pH value (Table 3 and Fig. 6). For clay content, the best result was obtained with CWT spectra and the SVM method (R2 = 0.67, RMSE = 6.45, RPIQ = 3.12). The models constructed with original or transformed spectra and different calibration methods were mostly able to approximately quantitative the clay content (Table 3 and Fig. 7).

Table 3 Comparison of the prediction accuracy for soil properties using different spectral transformation and calibration methods based on the validation dataset (N = 58)
Fig. 4
figure 4

Scatter plot of OC content models based on a OR, b FOD, c CR, and d CWT spectra with partial least-squares regression (PLSR, left), support vector machine (SVM, middle), and extreme gradient boosting (XGBoost, right) methods using the validation dataset. The soil samples from unexploited, in situ leaching, and heap leaching mining areas are in red, green, and blue colors, respectively (the same below)

Fig. 5
figure 5

Scatter plot of TN content models based on a OR, b FOD, c CR, and d CWT spectra with PLSR (left), SVM (middle), and XGBoost (right) methods using the validation dataset

Fig. 6
figure 6

Scatter plot of pH value models based on a OR, b FOD, c CR, and d CWT spectra with PLSR (left), SVM (middle), and XGBoost (right) methods using the validation dataset

Fig. 7
figure 7

Scatter plot of clay content models based on a OR, b FOD, c CR, and d CWT spectra with PLSR (left), SVM (middle), and XGBoost (right) methods using the validation dataset

4 Discussion

4.1 Features of soil spectra and transformations

For the original soil spectra (OR), five absorption features were observed at approximately 400–600, 900, 1400, 1900, and 2200 nm, respectively. The absorption feature at approximately 400–600 nm was associated with hummus and iron (Palacios-Orueta and Ustin 1998; Stoner and Baumgardner 1981). The broad absorption band at approximately 900 nm was primarily attributed to ferric ion (Stoner and Baumgardner 1981). The absorption valleys at 1400, 1900, and 2200 nm were associated with O–H groups (Viscarra Rossel and Behrens 2010; Whiting et al. 2004). Specifically, the valley at 1400 nm was attributed to the O–H stretching of water and clay minerals. The absorption at 1900 nm was dominated by hygroscopic water and lattice water retained in the air-dried soil samples. The absorption at 2200 nm was attributed to Al–OH bending and stretching in clay minerals (Bishop et al. 1994).

Many studies have explored the relationship between soil properties and spectra. Dalal and Henry (1986) reported that soil OC and TN contents have the same feature bands at approximately 1870 and 2052 nm. Shi et al. (2013) found that bands at 1450, 1850, 2250, 2330, and 2430 nm were essential for predicting the soil TN content. Jiang et al. (2017) observed 400–800, 1900, and 2000–2350 nm as essential band ranges for predicting soil OC and TN contents. The range of bands highly correlated with soil OC and TN contents was fairly similar, mainly concentrated at 600–800 and 2000–2400 nm. However, there is no consensus on whether the TN content in soil is predicted by its correlation with the OC content or based on its own spectral features (Jiang et al. 2017; Zhang et al. 2019). Despite this, the TN content clearly has a close relationship with the OC content of soil, because most nitrogen in the topsoil is organic and generally accounts for one-tenth of the OC carbon content in soil (Stenberg et al. 2010). In the present study, the OC and TN contents showed a significant correlation (r = 0.84, p < 0.01), and their correlation coefficients with OR spectra were similar, despite the slight variation in coefficient values (Fig. 3a and b).

The mechanisms used to predict soil pH based on visible–near-infrared spectroscopy are mainly related to organic materials, iron oxides, and clay minerals (Viscarra Rossel and Behrens 2010). Vašát et al. (2014) found that, in the range of 350–2500 nm, no band was highly correlated with soil pH, while bands at 400, 800, 1400, 1850, and 2300 nm recognized by PLSR might be correlated with soil color, OC content, and clay minerals. The results of our study also showed that bands at 400, 1000, and 2300 nm were highly correlated with soil pH, in partial agreement with previous studies. Furthermore, Peng et al. (2014) attributed bands at 410–572, 1400, 1900, 2200, and 2300 nm to iron, water, O–H stretching, aluminum, and magnesium in clay minerals. Nawar et al. (2016) observed that bands at 1900, 2000, and 2200 nm in CR spectra showed strong correlations with the clay content of soil. Similarly, the results of our study showed absorption features attributed to clay content at approximately 400–500 and 2000–2450 nm.

4.2 Comparison of prediction models

Three calibration methods (PLSR, SVM, and XGBoost) were used to compare the prediction accuracy of models constructed based on OR, FOD, CR, and CWT spectra for soil properties in the study area.

For spectral transformations, compared with OR, FOD and CWT had a better improvement in prediction accuracy, while the accuracy of CR decreased. The effectiveness of CR transformation has been reported by Vašát et al. (2014). However, in the present study, CR was the worst spectral transformation method for predicting soil properties. For CR spectra, the most prominent objects in the data normalization process were feature peaks and troughs. Other detailed information in the data might not be displayed well, leading to reduced prediction accuracy. FOD transformation could remove baseline drift and enhance absorption features, thereby improving prediction accuracy (Hong et al. 2018). CWT was the optimal spectral transformation method for predicting the soil properties in the present study. Decomposing the spectrum at multiple scales provided more possibilities for predicting soil properties and capturing useful information hidden in the spectrum. As the decomposing scale increases, the width of the adsorption features correlated with soil properties also increases.

For calibration methods, SVM and XGBoost outperformed PLSR in predicting soil properties. Nawar et al. (2016) and Yang et al. (2019) all proved that non-linear models are superior for predicting soil properties based on visible–near-infrared spectroscopy. The linear regression method, PLSR, might not be able to integrate a large amount of information and extract effective information, while the non-linear methods, SVM and XGBoost, could solve this problem well (Zhang et al. 2019). Previously, Viscarra Rossel and Behrens (2010) found that the prediction accuracy of the non-linear model was positively correlated with the number of variables used for modeling. When there were more variables, the model could extract more features.

The best prediction accuracy for OC and TN were obtained using CWT spectra with XGBoost, and the RPIQ values were both higher than 4.05, indicating excellent results. Furthermore, although XGBoost outperformed the PLSR and SVM methods in predicting pH value, all of the constructed models were unsuccessfully to predict it (RPIQ < 2.02). For prediction of the clay content, most constructed models could approximately quantify it.

The OC has broad absorption bands in the visible range, while the TN has almost the same feature bands. TN in soil is closely related to organic matter, most of the N is organic and stored in nitrogen-containing compounds (Stenberg et al. 2010). The clay is negatively correlated with soil spectral and is mainly affected at 1400, 1900, and 2200 nm (Peng et al. 2014). However, the pH has no direct spectral response in the visible-near-infrared range, the previous studies reported the prediction mechanisms for pH value might be due to other soil properties (Viscarra Rossel and Behrens 2010; Vašát et al. 2014). Furthermore, the variation of the sample source is also an important factor affecting the prediction accuracy of the models (Stenberg et al. 2010). The greater the variation of the soil sample dataset, the better the prediction accuracy. The CV of pH value was below 10%, which might influence the prediction accuracy.

5 Conclusions

In this study, three spectral transformation methods (FOD, CR, and CWT) were used to predict soil OC content, TN content, pH value, and clay content in rare earth mining areas based on visible–near-infrared spectroscopy. The accuracies of the prediction models constructed using different calibration methods (PLSR, SVM, and XGBoost) were compared. Spectral transformations based on FOD and CWT were useful for predicting the soil properties. Overall, the models based on CWT spectral transformation coupled with XGBoost calibration outperformed other models in predicting the OC content, TN content, and pH value. However, the optimal model for clay content estimation was CWT spectra coupled with SVM. Soil spectra used in this study were measured in the laboratory, resulting in less disturbance from the external environment compared with field spectra. In future research, we will explore the potential of field and satellite spectra in predicting large-area soil properties.