Introduction

Soil organic matter (SOM) is an important indicator of soil quality in coal mining regions. SOM estimation methods for land reclamation and ecological restoration plans have therefore been receiving a great deal of attention (Demirel et al. 2011a, b). A rapid and inexpensive method for determining SOM content is essential for evaluating and managing soil resources (Chang et al. 2001). However, most approaches are based on traditional methods, which tend to be time consuming, expensive, and laborious (Sebag et al. 2006; Seely et al. 2010). Therefore, the development of fast, inexpensive, accurate, and real-time tools for measuring SOM content has become a priority task for scholars.

Hyperspectral remote sensing techniques have the potential for monitoring SOM content in coal mining regions because of an abundance of spectral information (Doetterl et al. 2013; Gomez et al. 2013). Compared with traditional laboratory methods, hyperspectral technique is quicker, cheaper, and can eliminate sample preparation and chemical reagents (Chang and Laird 2002; Demattê et al. 2004). Therefore, there have been many studies of SOM monitoring models (Croft et al. 2012). For example, using a regression tree method and hyperspectral technology, Gmur et al. (2012) quantified several soil properties, including SOM. Steffens et al. (2014) applied Vis-NIR imaging spectroscopy to map SOM quality in visually homogeneous organic surface layers. Yang and Li (2013) quantified SOM content through a combination of soil spectroscopy and multivariate stepwise linear regression. Nevertheless, there has been limited research into hyperspectral monitoring of SOM in coal mining regions, where serious soil degradation is typical.

Among the most efficient methods in constructing reliable models in the hyperspectral remote sensing field, partial least squares regression (PLSR) has been the most frequently used for estimating SOM content. Nocita et al. (2011) quantified that content via a combination of soil spectroscopy and PLSR. Vohland et al. (2011) integrated field visible near infrared spectroscopy and PLSR to predict SOM. Many scholars have indicated that the PLSR method can mitigate effects of the multicollinearity problem and may solve information losses introduced by multiple stepwise regression (which are attributable to characteristic band screening) (Janik et al. 2009; Vohland and Emmerling 2011). However, PLSR analysis and processing is severely affected by having too many variables. Still worse, the accuracy of SOM estimates can be seriously affected by noise (Groenigen et al. 2003). To alleviate these disadvantages of PLSR, some researchers have combined it with canonical correlation analysis (Chen et al. 2013; Kim et al. 2014). However, it is doubtful that this combined method can be applied to coal mining regions, because they are unique geographical and industrial areas with geospatial, social, and environmental factors that are widespread, comprehensive, dynamic, and complicated (Demirel et al. 2011a, b; Erener 2011). Furthermore, some useful information may not be correlated because of noise in the soil spectrum. Methods that reduce noise while retaining as much useful information as possible have become an urgent requirement for macroscopic SOM monitoring in coal mining regions.

In the present study, wavelet and correlation analyses were used to amplify useful information that was previously obscured by noise. Then, a satisfactory model for SOM prediction based on soil samples in coal mining regions was developed by combination with the PLSR method.

Experiments

Study area

The ecological environment of coal mining regions in Datong, China, has been seriously damaged, because it is one of the main coal-producing areas of Shanxi Province, and the damage is ongoing. The Jinghuagong National and Xinzhouyao mines are typical coal mining regions of Datong. They are composed of temperate hill zones, with widespread salinized chestnut soils. Forty-six samples of these soils were randomly selected from those collected from Jinghuagong National Mining Park (40°7′N, 113°7′E) and a coal mining subsidence region of Xinzhouyao mine (40°4′N, 113°5′E), at 0–20 cm depths (Fig. 1). All samples were air dried, crushed to pass through a 2-mm sieve, and then pulverized by grinding. The samples were split into two sets, one for chemical analysis and the other for hyperspectral measurements. SOM contents of the samples were determined using the potassium dichromate, oxidation-ferrous sulfate titrimetric method (Devi et al. 2011; Ramesh et al. 2012).

Fig. 1
figure 1

Soil samples collected from a Jinghuagong National Mining Park and b coal mining subsidence region of Xinzhouyao mine

Measurement and data processing

An ASD FieldSpec 3 spectroradiometer from Analytical Spectral DevicesTM was used to obtain the soil reflectance spectrum over the 350–1000 nm and 1000–2500 nm bands, with increments of 1.4 and 2 nm, respectively. The spectral resolution at 700 nm was 3 nm, and at 1400 and 2100 nm, it was 10 nm. Each soil sample filled a container (10-cm diameter and 2-cm depth) and was illuminated from above with a halogen lamp. After adjusting the zenith angle and distance between the light source and soil surface, 10 scans were made of each sample, and a white reading with a white panel was taken as calibration. All these operations were performed in a dark room to avoid the effect of stray light (Farifteh et al. 2008). Spectral data for each sample were determined using the mean of the 10 scans.

Spectral filtering and transformations

Derivate processing reduces the influence of low-frequency noise (Ghiyamat et al. 2013; Liaghat et al. 2014). In the inverse log mode, spectral differences in the visible region can be highlighted, and the influence of illumination variation is minimized (Wang et al. 2009). In the present work, each original spectral reflectance (REF) was transformed into a first derivative (FDR), second derivative (SDR), first derivative of the reciprocal logarithm (log(1/R))', and second derivative of the reciprocal logarithm (log(1/R))".

Gomez et al. (2008) and Lin et al. (2014) found obvious characteristic absorption troughs using the continuum removal method, so they could more easily distinguish absorption bands and build better PLSR models. The main spectral response areas of SOM content can be evaluated according to previous studies and the results of continuum removal processing. Therefore, we used continuum removal to enhance and standardize spectral absorption features (Kokaly and Clark 1999).

Wavelets-correlations-partial least squares regression method

PLSR is a mainstream, linear multiple regression method that compresses spectral data by reducing measured collinear spectral variables to a few non-correlated latent variables or factors (Geladi and Kowalski 1986; Feret et al. 2011; Singh et al. 2013). The basic aim of PLSR is to build a linear model:

$$ Y=Xb+E $$
(1)

where Y is a mean-centered matrix that contains the response variables, X is a mean-centered matrix that contains the predictor variables, b is a matrix that contains the regression coefficients, and E is a matrix of residuals (Cho et al. 2007). Wavelet-correlation-PLSR (W-C-PLSR) is closely related to PLSR, but uses wavelet and correlation analyses instead of using transformed spectra directly. The original soil spectra have obvious burrs, revealing a large number of noisy data in the spectral reflectance curves. This noise is also present in the transformed spectra. Therefore, the transformed data were decomposed using a wavelet de-noising method based on the Sym8 matrix function (Liu et al. 2011). Liu et al. (2011) noted that a three-level decomposition with the threshold de-noising method based on wavelet analysis provides an appropriate balance between curve smoothing and retention of spectral features. To select the optimal decomposition level, each spectral curve was decomposed into five layers. After the wavelet analysis, correlation between the SOM content and transformed spectra was calculated. The correlation coefficient equations are

$$ r=\frac{{\displaystyle {\sum}_{i=1}^N\left({R}_i\left({\lambda}_j\right)-\overline{R}\left({\lambda}_j\right)\right)\left({S}_i-\overline{S}\right)}}{{\left[{\displaystyle {\sum}_{i=1}^N{\left({R}_i\left({\lambda}_j\right)-\overline{R}\left({\lambda}_j\right)\right)}^2{\displaystyle {\sum}_{i=1}^N{\left({S}_i-\overline{S}\right)}^2}}\right]}^{0.5}} $$
(6)

where N is the number of soil samples, R i (λ j ) is transformed spectral reflectance at wavelength j, S i is the corresponding measured SOM value, \( \overline{R}\left({\lambda}_j\right) \) denotes the sample mean of {R i (λ j )} N i = 1 , and \( \overline{S} \) is the sample mean of {S i } N i = 1 .

Based on the main spectral response areas of SOM content obtained through previous studies and continuum removal, sensitive bands with significant correlation coefficients (P < 0.01) were selected for further analysis. Lastly, PLSR analysis was performed using the selected bands and measured SOM values. The W-C-PLSR flowchart is shown in Fig. 2.

Fig. 2
figure 2

Flowchart for wavelet-correlation-partial least squares regression

By carefully combining wavelet analysis, correlation analysis, and PLSR, the W-C-PLSR method can highlight subtle information that was obscured by noisy data, so as to take full advantage of useful spectral information and enhance accuracy.

Establishment and verification of the model

Thirty soil samples were randomly selected to construct the model. The remaining 16 samples were used for verification. Stabilities and accuracies of all the models were determined by the coefficient of determination (R 2), root mean square error of calibration (RMSEC), and mean relative error of calibration (MREC). Predictive capabilities were evaluated by root mean square error of validation (RMSEV) and mean relative error of validation (MREV). An effective model should have high R 2, low RMSE (RMSEC and RMSEV), and small MRE (MREC and MREV).

$$ RMSE=\sqrt{\frac{1}{N}{\displaystyle {\sum}_{i=1}^N{\left({S}_i-{\overset{\frown }{S}}_i\right)}^2}} $$
(2)
$$ MRE=\frac{1}{N}{\displaystyle {\sum}_{i=1}^N\frac{\left|{S}_i-{\overset{\frown }{S}}_i\right|}{S_i}} $$
(3)

where S i is the measured value, \( {\overset{\frown }{S}}_i \) is the predicted value, and N is the number of validation samples.

Results and discussion

Interpretation of soil spectral reflectance

Differences of spectral reflectance between spectra and samples with varying SOM content (11.45, 8.07, 8.78, and 6.92 mg/kg) are shown in Fig. 3a. This figure indicates that soil reflectance generally decreases with increasing SOM content. The SOM of 11.45 showed lower reflectance values than the others, probably because of its greater SOM content. Spectral continuum-removed curves (Fig. 3b) show seven major absorption troughs at 400–592, 609–685, 707–813, 826–1006, 1365–1531, 1850–2056, and 2146–2241 nm. Additionally, there were three remarkable water absorption peaks at 1413, 1918, and 2211 nm. Because of apparent differences of spectral characteristics caused by SOM, spectral region fitting for SOM can be predicted by hyperspectral models. Many studies have indicated that the 400–1000 nm spectral range is the main SOM spectral response area. Additionally, some investigators have found that the highest correlation between SOM and reflectance value was ∼600 nm (Krishnan et al. 1980; Nocita et al. 2011). According to analysis and results of prior studies and the soil spectral continuum-removed curves in Fig. 3b, 400–1006 nm was determined as the main SOM spectral response area in our study.

Fig. 3
figure 3

Original reflectance curves (a) and spectral continuum-removed curves (b) of soil samples with differing SOM contents

Wavelet analysis

Each REF was transformed into FDR, SDR, (log(1/R))', and (log(1/R))". To retain as much useful information as possible, each transformed spectra was decomposed using five levels of approximation. Noise was reduced using the Sym8 matrix function of the wavelet. Figure 4 compares the initial FDR and decomposed results of the wavelet analysis at various levels (data of SDR, log(1/R) and (log(1/R))' not shown). It is clear that the de-noised results were not satisfactory using one level (Fig. 4b) as compared with the initial FDR (Fig. 4a). The smoothness of the spectral curve clearly improved using two levels (Fig. 4c), but there some noise remained. Using three levels (Fig. 4d), the method dramatically reduced the noise while preserving spectral characteristics, especially in the 400–1006 nm spectral range. At the same time, the maximum value was achieved near 600 nm, and the response characteristics were obvious. However, using four and five levels (Fig. 4e–f) produced curves that were too smooth, so that extreme points were not clear and useful information was lost.

Fig. 4
figure 4

De-noised first-derivative spectral reflectance curve (FDR) under different wavelet decomposition scales (bf), compared with the initial FDR (a)

Correlation analysis

Correlation coefficients for the measured SOM contents were calculated and compared with both the initial transformed spectra and decomposed spectral reflectance (one, two, and three levels) in the range 400–2300 nm. The transforms, including FDR, SDR, (log(1/R))', and (log(1/R))", are shown in Figs. 5 and 6. These figures demonstrate that the Sym8 wavelet decomposition remarkably improved correlations between SOM content and spectrum transformations in the range 400–1006 nm, especially for SDR and (log(1/R))". There was stronger correlation between SOM and SDR using three levels, with maximum correlation coefficient −0.9044 (at 658 nm). For 400–1006 nm, maxima of all correlation coefficients and the number of sensitive bands (P < 0.01) are given in Table 1. Table and Fig. 5 demonstrate that the wavelet analysis amplified useful information that was previously obscured by noise. The decomposed SDR with three levels was the most significant. The initial FDR performed better than the other initial spectrum. According to these correlation results, the SDR with three levels after continuum removal was used to build the W-C-PLSR models.

Fig. 5
figure 5

Correlation analysis of SOM contents with FDR (ad) and SDR (eh) (initial and decomposed)

Fig. 6
figure 6

Correlation analysis of SOM contents with (log(1/R))' (ad) and (log(1/R))" (eh) (initial and decomposed)

Table 1 Correlation analysis between SOM and mathematically transformed spectra (initial and decomposed) in the range 400–1006 nm

Applicability of W-C-PLSR method

The decomposed SDR with three levels maximized correlations between SOM contents and the major spectral response areas (400–1006 nm). Based on the main spectral response areas of SOM content obtained in other studies and continuum removal, 289 sensitive bands were selected for further analysis, whose correlation coefficients were significant (P < 0.01). Then, PLSR analysis was done using 289 independent bands and measured SOM values. The C-MLSR method based on correlations and multiple linear stepwise regression (MLSR) often guarantees very accurate evaluation results and has great potential (Chang et al. 2001). Thus, for comparison, using wavelet, correlation, and MLSR methods, we constructed the W-C-MLSR model based on the 289 sensitive bands of the SDR, with three levels. There was information loss upon screening the bands during MLSR, and the initial FDR performed better than the other initial spectrum. Accordingly, results of a correlation-partial least squares regression (C-PLSR) model based on the 338 sensitive bands (P < 0.01) of the initial FDR were also compared (Pu 2012).

Table 2 shows results of the W-C-PLSR, W-C-MLSR, and C-PLSR models.

Table 2 Test results of W-C-PLSR (SDR at level 3), C-PLSR (FDR), and W-C-MLSR (SDR at level 3) models of SOM content

The W-C-MLSR model had small prediction errors in calibration analysis (R 2 = 0.869, RMSEC = 7.48, MREV = 36.41 %), indicating that the W-C-MLSR method is unreliable. The reason is clearly because of the complicated spectra of poor-quality soils in coal mining regions and information loss caused by band screening. Fortunately, the C-PLSR model gave satisfactory results in predicting SOM content (RMSEV = 7.74, MREV = 24.83 %; Table 2), but the W-C-PLSR model produced smaller errors for both calibration (R 2 = 0.970, RMSEC = 3.10, MREC = 8.75 %) and validation (RMSEV = 5.85, MREV = 14.32 %). As shown in Fig. 7a, W-C-PLSR samples were almost all near the 1:1 line, a much better performance than the C-PLSR model (Fig. 7b). In summary, it was demonstrated that in an environment such as the coal mining region of Datong, W-C-PLSR generated less error in the prediction of SOM content, showing that the W-C-PLSR method can augment useful information that was previously obscured by noise and greatly improve the accuracy of SOM estimates.

Fig. 7
figure 7

Calibration and validation results using a W-C-PLSR method based on SDR with three levels and b C-PLSR model based on initial FDR

Conclusions

To establish satisfactory SOM estimation models, the issue of how to reduce noise while retaining as much useful information as possible was investigated herein. By carefully applying wavelet, correlation, and PLSR techniques, the potential of the W-C-PLSR method for rapid SOM quantification was studied. According to the analysis, results of previous studies and the soil spectral continuum-removed curves in Fig. 3b, 400–1006 nm was determined as the main SOM spectral response region. Based on the 289 sensitive bands of SDR with three levels, whose correlation coefficients were significant (P < 0.01), the W-C-PLSR model was obtained. For comparison, we built the W-C-MLSR model based on 289 sensitive bands of the SDR with three levels and C-PLSR model based on 338 sensitive bands of the initial FDR. The results indicate that the W-C-MLSR method was very unreliable. The C-PLSR model produced favorable predictions of SOM content. However, the W-C-PLSR model performed the best, giving smaller prediction errors than C-PLSR. In conclusion, the W-C-PLSR method has great potential to monitor SOM in coal mining restoration regions.