1 Introduction

Soil is a fundamental natural resource that consists of organic and inorganic mineral matter, water, and air, which people rely on for the production of food and energy (Palma et al. 2007; Cuffney 2010). Soil is an environmental filter for metals, nutrients, and other contaminants that may leach into the environment. The ability of a soil to support any of these functions depends on its structure, composition, and chemical, biological, and physical properties (Karlen et al. 1997). Evaluation of soil quality usually utilizes a combination of physical and chemical indicators as a minimum data set (Bo et al. 2010; Rossel et al. 2016). Compared with conventional analytical methods, vis–NIR is nondestructive, fast, and cost-effective (Shepherd and Walsh 2002; Rossel et al. 2006; Hong et al. 2017).

Soil spectral properties are affected by soil minerals, water content, organic matter, and texture. The transitional abilities of different soil components are different, and when NIR radiation interacts with a soil sample, it is the overtones (Hbroge et al. 2004; Rossel et al. 2008). Therefore, the different absorption bands in the soil absorption spectrum allow quantitative analysis of the soil component content. In the visible region (400–780 nm), molecules of various components of the soil produce electron absorption spectra. Electronic excitations are the main process as the energy of the radiation is high and this region contains useful information in organic and inorganic materials in the soil (Mortimore et al. 2004; Schoell et al. 2005). For absorptions in the NIR region (780–2500 nm), the absorption of soil is mainly caused by fundamental frequency vibration of C–H, N–H, O–H, and frequency-doubling vibration absorption (Bo et al. 2010; Rodionov et al. 2014). Water has a strong influence on vis–NIR spectra of soil, the absorption bands around 1400–1900 nm, and other parts of the vis–NIR region.

Several literatures have shown the capacity of the combined use of vis–NIR spectroscopic and chemometrics to predict soil physical and chemical properties in the most recent 20 years (Rossel et al. 2006; Mouazen et al. 2010; Dotto et al. 2017). For example, soil bulk density and soil water content with spectroscopy have been discussed with satisfied predictive results (Quraishi and Mouazen 2013; Al-Asadi and Mouazen 2014; Liu et al. 2014; Morellos et al. 2016). Regarding the chemical properties, several researchers have demonstrated accurate predictions of soil chemical content. Hong et al. (2017) and Jiang et al. (2016) demonstrated new possibilities to estimate SOM using vis–NIR spectroscopy with multivariate modeling techniques. Besides, combined vis–NIR spectroscopy with partial least squares regression (PLSR) has been applied in total nitrogen (TN), moisture content, and pH (Buondonno et al. 2012; Kuang and Mouazen 2013; Morellos et al. 2016; Filippi et al. 2018). In some agricultural areas, the vis–NIR spectroscopy-based approach has also been widely used in monitoring agricultural pollution, especially for heavy metal pollution (Angelopoulou et al. 2017; Todorova et al. 2018).

In this study, we collected 44 soil samples from common parent materials and textures in the three regions of Shaanxi, China. Soil samples were scanned in the laboratory in the range of 350–2500 nm; twenty-four preprocessing methods and correlation analysis were tested to improve predictions, and a partial least squares regression was used to predict soil quality indicators. Our goal is to establish a quantitative model of soil quality and evaluate the ability of soil spectra for predicting soil physical and chemical properties in Shaanxi Province.

2 Materials and methods

2.1 Study area and soil sampling

The study site was located in the middle of Guanzhong Plain, Shaanxi Province, China (Fig. 1). The terrain is high in the northwest and low in the southeast. It descends from north to south and is divided into three roads. It consists of four land types: the frontal alluvial fan, the loess plateau, the flood plain, and the alluvial terrace. The annual average temperature is 12.9 °C, and the average annual precipitation is between 552.6 and 663.9 mm.

Fig. 1
figure 1

Study area and sampling point distribution

We collected 44 soil samples in February 2013. The soil type is earth–cumulic–orthic anthrosols, and the sampling depth was the thickness of the tillage layer, usually 0–30 cm. These soil samples consist of different physical and chemical properties, and these substances affect the reflectance and absorption spectral characteristics of soil. In order to establish a more accurate spectral quantitative model for the study area, seventeen soil property indicators were measured including soil organic matter (SOM); total nitrogen (TN), bulk density (BD), soil water (SW), total nitrogen (TN), nitrate nitrogen (NN), available potassium (AK), available phosphorus (AP), Cr, Mn, Ni, Cu, Zn, As, Cd, Hg, and lead (Pb).

2.2 Chemical analysis and spectral measurement

2.2.1 Soil physical measurements

Soil samples were air-dried and crushed, and crop residues, root material, and stones were removed so the soil could pass through the 2-mm sieve. Bulk weight was analyzed by collecting a known volume of soil using a metal ring pressed into the soil (intact core), and determining the weight after drying (Ito et al. 2002). The moisture content was determined gravimetrically after heating the samples at 105 °C for 24 h (Ruberto et al. 2010).

2.2.2 Soil chemical measurements

SOM was measured by the combustion method using a LECO® FP2000 analyzer (Laboratory Equipment Corporation, St Joseph, MI, USA). Soil pH was measured in deionized water (Zornoza et al. 2008). TN was measured by the Kjeldahl method (Houba 1997). NN was determined by a UV spectrophotometer (Gross et al. 2010). AP was measured by the NaHCO3 method (Yaseen and Malhi 2009). AK was measured by the acetic acid-flame photometric method (Kataoka et al. 1991). The heavy metal content was determined by inductively coupled plasma mass spectrometry (ICP-MS, Agilent 7700), and included Cr, Mn, Ni, Cu, Zn, As, Cd, Hg, and lead (Pb) (Gajek et al. 2013).

2.2.3 Soil reflectance measurements

Diffuse reflectance spectra of soil samples were measured using a portable spectroradiometer (Fieldspec 4, Analytical Spectral Devices, Inc.) with a spectral range of 350–2500 nm. The spectroradiometer had a bandwidth of 1.4 nm at 350–1000 nm and 1.1 nm at 1001–2500 nm. Measurements were made with a Hi-Brite Contact Probe that uses halogen bulb color temperature (2901 ± 10 K) for illumination. The contact probe measures a spot size of 10 nm. The sensor was calibrated with a white reference panel once every ten measurements. For every sample, the Petri dish was rotated 4 times, each time by 90°, and 40 spectra were averaged to minimize noise and to maximize the signal-to-noise ratio, then averaged in order to form a representative spectrum. Spectra were recorded with a sampling resolution of 1 nm, and thus, each spectrum comprised 2151 reflectance channels in the range of 350–2500 nm.

2.3 Spectral preprocessing

Spectral preprocessing with mathematics was commonly used to correct measurements and noisy spectra. In addition to chemical properties, soil structural properties also influence soil spectra, such as light scattering effects. To enhance the more chemically relevant peaks in the spectra and reduce the effects baseline shifts and overall curvature, various spectral preprocessing approaches have been studied. For example, Savitzky–Golay (SG) smoothing with the first and second derivatives with a second-order polynomial and window size of 9 wavelengths were used to reduce baseline variation and increase resolutions of spectral peak features (Peng et al. 2014; Fu et al. 2018). Standard normal variable transformation (SNV) transform was used to reduce the particle size effect and curvilinear trend of the spectrum (Dhanoa et al. 1995). Multiplication scattering correction (MSC) and normalization (NOR) removes additive or multiplicative signal effects (Burger and Geladi 2007). Logarithmic transformation (LOG (1/R)) not only enhances the difference in visible spectrum but also reduces the influence of multiplicative factors under light conditions (Liu et al. 2018). In this paper, prior to building the predictive models, twenty-four spectral preprocessing methods were explored, in order to improve the final soil physical and chemical predictions performances. The treatments included SG smoothing with the first and second derivatives with a second-order polynomial and a window size of 9 wavelengths alone and in conjunction with SNV, MSC, NOR, and LOG (1/R).

2.4 Prediction model building and testing

Partial least squares regression (PLSR) is a primary statistical method, and it combines features from principal component analysis and multiple regression in order to reduce complex spectral matrix (Ergon 2014). It is particularly useful when we need to predict a set of dependent variables from a very large set of independent variables. Because it can handle the large dimensionality and collinearity of data produced by vis–NIR spectroscopy, this method has been widely applied in the field of chemometrics (Qiao et al. 2015). To train and then test our models, we selected at random 70% of the soil samples to represent the calibration set (33 samples) to build PLSR models, and the remaining 30% was used for the validation set (13 samples).

For each PLSR model, several performance metrics were used for model evaluation. The coefficient of determination (R2) and the root mean squared error (RMSE) give indications of the validity of the prediction models. The standard error (SE) measures the precision of the prediction models. The ratio of performance to deviation (RPD) was calculated as the standard deviation (SD) of the validation set divided by the RMSE (Sarathjith et al. 2014; Gholizadeh et al. 2017). Equations (1)–(4) are as follows:

$$ {R}^2=\frac{\sum_{i=1}^n{\left({x}_i-\overline{y_i}\right)}^2}{\sum_{i=1}^n{\left({y}_i-\overline{y_i}\right)}^2} $$
(1)
$$ \mathrm{RMSE}=\sqrt{\sum_{i=1}^N\frac{{\left(\hat{y_i}-{y}_i\right)}^2}{N}} $$
(2)
$$ \mathrm{SE}=\sqrt{\sum_{i=1}^N\frac{{\left(\hat{y_i}-\overline{y}\right)}^2}{\left(N-1\right)}} $$
(3)
$$ \mathrm{RPD}=\frac{\mathrm{SD}}{\mathrm{RMSE}} $$
(4)

where \( \hat{y_i} \) indicates the values estimated by the model, yi indicates the observed values, and N is the number of observations of the variable to be modeled. The RPD was evaluated with the following criteria: RPD > 3 represents an excellent prediction; 2 < RPD < 3 has limited predictive ability; RPD < 2 has no predictive ability (Prs et al. 2012; Sun et al. 2018).

3 Results and discussion

3.1 Soil properties

The statistics of reference values of soil property concentrations are summarized in Table 1. The research areas’ soil texture is silt. The pH values ranged from 7.47 to 8.38, neutral to weak alkaline. The mean value of pH indicators shows that the soil is alkaline in the study area. The BD values ranged from 1.02 to 1.61. The content of SOM ranged from 11.83 to 26.93 g kg−1. In addition, the ranges of TN, NN, AK, and AP concentrations were 0.61–1.60 mg kg−1, 0.88–13.75 mg kg−1, 80.1–368.9 mg kg−1, and 0.8–105.1 mg kg−1, respectively. According to the Soil Environment Quality standards in China, the Hg, Pb, and Mn concentrations showed more skewed and irregular distributions. This is probably caused by irrational use of mercury, which is usually contained in fertilizers and pesticides. In addition, compared with Ni, Cu, As, and Cd, the Hg concentrations presented a higher standard deviation (SD = 5.17) and a high coefficient of variation (CV = 41%), thereby exceeding the critical value. The contamination may result from sewage irrigation. This indicated that the selected properties were spatially variable within the study area, and the wide range of physical and chemical properties provides vis–NIR calibration and validation accuracy (Abdi et al. 2012; Xu et al. 2018).

Table 1 Statistical analysis of the attributes of soil properties

3.2 Soil spectral analysis

Because of the transition energy levels of different soil components, there are different absorption bands in the soil absorption spectrum curve. The calculated mean reflectance spectrums for soil samples with different SOM content are shown in Fig. 2. In general, the soil samples have similar spectral patterns. In particular, the spectral reflectance increases with wavelength and has three prominent absorptions near 1400 nm, 1900 nm, and 2100 nm; this is because of soil moisture was strongly absorbed. The first absorption region at approximately 1400 nm is the first overtone of the O–H stretch and C–H combination of aromatic structures from lignin (Vasques et al. 2008). The second absorption at approximately 1900 nm is due to the combination of O–H and H–O–H. The third absorption at approximately 2200 nm results from O–H stretches and the Al/Fe-OH bend (Bendor 2002; Ding et al. 2013). Between 350 and 760 nm, the reflectance curves showed an increasing trend. The absorption regions between 350 and 950 nm were caused by the iron oxides and organic matter, and the absorptions between 1100 and 1550 nm result from clay minerals and water; between 1700 and 1800 nm, they were caused by carbon, and absorptions between 2100 and 2500 nm result from kaolinite, smectite, carbonates, and organic compounds (Bishop et al. 2008; Nawar et al. 2016). Despite the similarity, it is also clear to see that the samples with higher SOM content tend to have lower reflectance values. This is because as the SOM content increases, the soil color gradually becomes deeper. When light hits the surface of the soil, the incident light is absorbed by the soil, causing the soil reflectivity to gradually decrease.

Fig. 2
figure 2

Reflectance curves of soil samples with different content

3.3 Correlation analysis

Soil is the product of weathering and soil formation of the surface rocks of the crust. Generally, the chemical composition of the soil is relatively stable, and the content level and variation range of the elements are low. Therefore, the content of soil chemical elements is highly comparable. In studying the correlation between elements, it could be inferred whether the sources are the same. Various elements in the soil have different migration and enrichment trends, showing that the correlation of each element content is different. The Pearson correlation coefficient was used to analyze relationships between soil elements (Table 2).

Table 2 The linear correlation coefficients among soil properties (P < 0.01)

From the results, we founded that strong correlations were observed between SOM with Zn (r = − 0.45), which was mainly caused by zinc accumulation in soil with high humus content. The zinc content of soil increases with the increase of humus and organic matter content (Wang et al. 2014). Also, the correlations between heavy metal elements are significant. For example, strong correlations were observed between Cr and Ni (r = 0.94), Mn (r = 0.88), Pb (r = 0.64), Cu (r = 0.62), and Hg (r = 0.59). The correlation coefficients of Ni and Cu, Zn, Hg, and Pb were 0.83, 0.57, and 0.59, respectively; the correlation coefficients of Cu and Zn, Cd, Hg, and Pb were 0.61, 0.55, 0.50, and 0.68, respectively. The correlation coefficients of Zn and Hg and Pb were 0.83 and 0.55, respectively. The correlation coefficient of Hg and Pb was 0.68. The Cu, Zn, and Pb elements are sulfur-philic elements, and the similar elements have a certain commonality in epigenetic geochemistry, and thus have significant correlation. It shows that these heavy metals are more likely to come from the same source of pollution and have a strong companion effect.

3.4 Model performance

After S–G nine-point smoothing, transformation of NOR, MSC, and SNV, the first deviation (FD), second deviation (SD), and LOG (1/R) combination PLSR were used to establish the prediction model. At the same time, in order to avoid over-fitting, the dimension of the input variables was kept as small as possible.

The model performances for different soil indicators are summarized in Table 3, and accuracies were reported on the testing dataset. For physical indictors BD and SW, the best result was achieved for BD, the best preprocessing was log(1/R) + SG + MSC + FD, the accuracies were R2 = 0.97, RPD = 5.90, RMSE = 0.02, and SE = 0.02, which can be considered as an excellent model. For the SW, the R2 < 0.90, RPD = 2.09 < 3, which cannot be considered an excellent model (Fig. 3). The predictions of BD and SW have been obtained, as observed in previous studies (Quraishi and Mouazen 2013; Al-Asadi and Mouazen 2014; Liu et al. 2014). The BD could be affected by good soil structure that leads to overall good soil quality (Figueroa 2003).

Table 3 The results of calibration and validation using the preprocessing in the soil properties
Fig. 3
figure 3

Scatter plots of the measured and vis–NIR predicted soil bulk density and gravimetric soil water. Note: BD, bulk density; SW, gravimetric soil water

Calibration and validation model results show that different preprocessing methods have significant effects on soil physical and chemical properties. For chemical properties, compared with performance metrics R2, RMSE, SE, and RPD, best results of calibration and validation in the soil chemical properties were achieved for SOM (R2 = 0.98, RPD = 8.56, RMSE = 0.45), pH (R2 = 0.95, RPD = 4.40, RMSE = 0.05), and TN (R2 = 0.98, RPD = 6.67, RMSE = 0.02) (Table 4). The three best optimum preprocessing models include SG + MSC + FD, R + SG + SNV + FD, and R + SG + NOR + SD. The performances of SOM, pH, and TN prediction were better than AK (R2 = 0.89, RPD = 2.91 < 3.0, RMSE = 1.39), AP (R2 = 0.85 < 0.90, RPD = 3.58, RMSE = 5.7495), and NN (R2 = 0.90, RPD = 3.07, RMSE = 0.62) (Fig. 4), resulting from the overtone and combinations of O–H, C–H (Rossel et al. 2006; Vasques et al. 2008). Therefore, further research needs to improve the calibration and validation accuracy of soil properties, especially AK, AP, and NN.

Table 4 Best results of calibration and validation in the soil chemical properties
Fig. 4
figure 4

Scatter plots of the measured and vis–NIR predicted soil chemical contents. Note: SOM, soil organic matter; TN, total nitrogen; NN, nitrate nitrogen; AK, available potassium; AP, available phosphorus

Regarding the heavy properties, calibration and validation statistics of the best results of five heavy metals are presented in Table 5. Models for five heavy metals include Hg (R2 = 0.99, RPD = 8.59, RMSE = 0.12), Cr (R2 = 0.97, RPD = 5.96, RMSE = 0.10), Ni (R2 = 0.93, RPD = 3.74, RMSE = 0.13), Pb (R2 = 0.97, RPD = 5.57, RMSE = 0.10), and Cu (R2 = 0.92, RPD = 3.38, RMSE = 0.08). Models for As (R2 = 0.87, RPD = 2.58), Mn (R2 = 0.80, RPD = 2.09), and Cd (RPD = 2.77) had R2 < 0.9 and RPD < 3.0; for the model of Zn, although R2 = 0.91 > 0.90, RPD = 3.13 > 3.0, the offset had too much deviation (Fig. 5). According to the results, Mn, Zn, and As were not as good as the others, and the essential reason is that these heavy metals in soil are spectrally featureless and are limited in vis–NIR spectra (Wu et al. 2007; Wang et al. 2014).

Table 5 Calibration and validation statistics of the best results of five heavy metals
Fig. 5
figure 5figure 5

Scatter plots of the measured and vis–NIR predicted soil heavy metal contents

4 Conclusions

Vis–NIR spectroscopy and chemometric analysis have been demonstrated to be reliable tools for estimating soil physical and chemical properties. In this study, based on collected soil spectral data and chemical indicators, we have established a prediction model of seventeen soil elements and analyzed the potential of vis–NIR spectroscopy to predict the contents of soil properties using the PLSR technique.

Our results show that accurate calibration and validation models in the soil chemical properties were achieved, including BD, SOM, pH, and TN. In addition, Hg, Cr, Ni, Pb, and Cu concentrations were well predicted. Moreover, vis–NIR spectroscopy is as an effective, nondestructive, fast, and cost-effective tool for accurate predictions of soil properties, and combined with methodology can be used for rapid acquisition of spectra, multi-parameters, and assessment of soil physical and chemical properties in the center of Shaanxi Province. Although the predictions were less accurate than the laboratory, we are optimistic of the potential of vis–NIR and chemometric methods in the prediction of soil properties.