Estimating Soil Properties and Classes from Spectra

Wadoux, Alexandre M. J. -C.; Malone, Brendan; Minasny, Budiman; Fajardo, Mario; McBratney, Alex B.

doi:10.1007/978-3-030-64896-1_9

Alexandre M. J. -C. Wadoux⁸,
Brendan Malone⁹,
Budiman Minasny⁸,
Mario Fajardo⁸ &
…
Alex B. McBratney⁸

Part of the book series: Progress in Soil Science ((PROSOIL))

998 Accesses

Abstract

The most common way of estimating soil properties from pre-processed spectra is by calibrating a statistical model. If the response of the spectra at a particular wavelength follows the Beer-Lambert law, the degree of reflectance at a particular wavelength is proportional to the concentration of a soil property. In this case, a linear model can be fitted between this wavelength and the measured values of a soil property.

Access provided by Autonomous University of Puebla. Download chapter PDF

Spectral Reflectance of Soils

Spectral Reflectance of Soil

Estimating Soil Texture from a Limited Region of the Visible/Near-Infrared Spectrum

The most common way of estimating soil properties from pre-processed spectra is by calibrating a statistical model. If the response of the spectra at a particular wavelength follows the Beer-Lambert law, the degree of reflectance at a particular wavelength is proportional to the concentration of a soil property. In this case, a linear model can be fitted between this wavelength and the measured values of a soil property. In most cases, however, the response of the spectrum follows a complex form, i.e. the concentration of a soil property is related to several interacting wavelengths and overlapping regions of the spectrum. In recent years, chemometric methods based on multivariate statistical models and machine learning algorithms have considered the entire spectrum as a predictor. When there are many hundreds of predictor variables (wavelengths), the methods can be described as multivariate. Multivariate models can be calibrated using the whole spectrum, with the target variables being measured values of soil properties.

Predictions made by a calibrated model need to be validated. Three common validation methods exist: data splitting, cross-validation and additional probability sampling. In digital soil spectroscopy, one most often has only a single dataset for both calibration and validation. Collecting an additional probability sample to independently validate soil spectral models is generally not feasible. Both data splitting and cross-validation are sub-optimal compared to collecting an additional probability sample because the information contained in the dataset cannot be fully exploited during calibration. In most cases, unfortunately, this is the only option.

In data splitting, the dataset is split into two subsets, generally containing 75% and 25% of the data, which are used for calibration and validation, respectively. In cross-validation, the dataset is split into K subsamples, where the K − 1 subsamples are used for calibration, and validation statistics are computed from the subsample left aside. Each soil sample is used once for validation. Cross-validation should be preferred over simple data splitting, especially when the dataset is small (Brus et al. 2011).

In this chapter, we use data-splitting for the sake of demonstration. More information on validation methods can be found in the statistical (e.g. Friedman et al. 2001, Chapter 7) or pedometrics (e.g. Brus et al. 2011) literature.

The set of packages used in this chapter are installed using the lines below. The book-associated soilspec package is also required; see Chap. 3 for information on its installation.

9.1 Goodness of Fit Measures

The process of creating a model begins first with a calibration. After a model is calibrated, we can use it to make a prediction on new samples. An important step in the analysis is to evaluate the calibrated model by predicting the value of the soil property and to comparing it with its associated measured value.

In the following sections, we will review the most common indicators of the quality of a prediction made by a calibrated model. These indicators are routinely employed in soil spectroscopy and in the general statistical modelling literature.

Root mean square error (RMSE)

The root mean square error (RMSE) is the standard deviation of the residuals between observed and predicted values of a variable. The RMSE evaluates the dispersion of the residuals. In other words, the RMSE tells how concentrated the data are around the line of the best fit between observed and predicted values. RMSE values are non-negatives; a value of 0 indicates a perfect fit between observed and predicted values. The RMSE value depends on the scale of the data and is therefore not suitable for comparison between datasets. In general, the lower the RMSE, the better the fit between observed and predicted values. The RMSE is computed as follows:

$$\displaystyle \begin{aligned} \mathrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}({\mathrm{obs}}_{i} - {\mathrm{pred}}_{i})^{2}}, {} \end{aligned} $$

(9.1)

where n is the validation sample size and obs and pred are vectors of observed and predicted values of the soil properties, respectively. The RMSE can simply be derived in R by:

Mean error (ME) or bias

The mean error (ME) is often computed along with the RMSE to assess the bias of the predictions. The ME is simply the average of all the errors between predictions and observations. Ideally, the value of the ME is zero which indicates no bias in the prediction. Note that a ME of zero does not indicate that there is no error (the positive and negative errors cancel out), but that there is no systematic bias in the predictions made by the model. The ME is computed as follows:

$$\displaystyle \begin{aligned} \mathrm{ME} = \frac{1}{n}\sum_{i=1}^{n}({\mathrm{obs}}_{i} - {\mathrm{pred}}_{i}), {} \end{aligned} $$

(9.2)

which in R gives:

Squared correlation coefficient

(r ²) Pearson’s squared correlation coefficient (r ²) is commonly used to assess the dispersion around the regression line. In other words, the r ² represents the strength of the linear association between observed and predicted values with respect to the fitted regression line. The r ² is such that the values are between 0 and 1. It is computed as follows:

$$\displaystyle \begin{aligned} r^{2} = \left(\frac{\sum_{i=1}^{n}({\mathrm{obs}}_{i} - {\mathrm{\overline{obs}}})({\mathrm{pred}}_{i}-{\mathrm{\overline{pred}}})}{\sqrt{\sum_{i=1}^{n}({\mathrm{obs}}_{i} - {\mathrm{\overline{obs}}})^{2}}\sqrt{\sum_{i=1}^{n}({\mathrm{pred}}_{i} - {\mathrm{\overline{pred}}})^{2}}}\right)^{2}. {} \end{aligned} $$

(9.3)

In R, this can be efficiently derived using the cor function from the stats package:

While the r ² is widely used, there is a general confusion in the literature about what a r ² is and how to compute it. When the r ² is computed as the squared Pearson’s r correlation coefficient, it measures the closeness of fit to the fitted linear regression line between observed and predicted, but does not indicate the closeness against a 1:1 line (observed versus predicted) which is of interest when validating. The r ² is not sensitive to the departure of fitted regression line to the 45 degree line of agreement. In many cases, it is therefore not recommended to compute the r ² as the squared correlation coefficient, in particular when predictions are biased.

Coefficient of determination (R2)

The coefficient of determination (R2) is the amount of variance explained by the model. The R2 quantifies the improvement made by the model over simply using the mean of the observations as prediction (Janssen and Heuberger 1995). In the literature, the R2 is sometimes referred to as the amount of variance explained, a modelling efficiency coefficient (Wadoux et al. 2018), a skill score (Nussbaum et al. 2017) or a Nash-Sutcliffe model efficiency coefficient (Nash and Sutcliffe 1970). As for the r ², its optimal value is 1, but it can be negative if the root mean square error exceeds the standard deviation of the data. It is computed as follows:

$$\displaystyle \begin{aligned} \mathrm{R2} = 1-\frac{\sum_{i=1}^{n}({\mathrm{obs}}_{i} - {\mathrm{pred}}_{i})^{2}}{\sum_{i=1}^{n}({\mathrm{obs}}_{i}-{\mathrm{\overline{obs}}})^{2}} {} \end{aligned} $$

(9.4)

which is equal to 1 − SSE∕SST where SSE is the sum of the squared error and SST of the total sum of squares. The R2 is derived in R by:

Lin’s concordance coefficient

(ρ _c) The concordance correlation coefficient (ρ _c) was introduced by Lawrence and Lin (1989) to assess the agreement between observed and predicted values with respect to the 1:1 line. If the predictions are in perfect agreement with the observations, all the points fall on the 1:1 line. The ρ _c is given by:

$$\displaystyle \begin{aligned} \rho_{c} = \frac{2r\sigma_{\mathrm{{pred}}}\sigma_{\mathrm{obs}}}{\sigma_{\mathrm{obs}}^{2}+\sigma_{\mathrm{{pred}}}^{2}+(\mu_{\mathrm{obs}} - \mu_{\mathrm{{pred}}})^{2}} = rC_{b}, {} \end{aligned} $$

(9.5)

where r is Pearson’s correlation coefficient, σ is the standard deviation (rσ _pred σ _obs is the covariance between observed and predicted values) and μ is the mean. Lawrence and Lin (1989) have shown that ρ _c reduces to rC _b where C _b is the bias correction factor defined as:

$$\displaystyle \begin{aligned} C_{b} = \left(\frac{v+1/v+u^{2}}{2}\right)^{-1}, {} \end{aligned} $$

(9.6)

with v = σ _pred∕σ _obs being the scale shift and $u=(\mu _{\mathrm {pred}}-\mu _{\mathrm {obs}})/\sqrt {\sigma _{\mathrm {pred}}\sigma _{\mathrm {obs}}}$ being the location shift relative to the scale. In other terms, ρ _c assesses the correlation between observed and predicted values, with a bias correction. The ρ _c can be implemented in R by:

There are several implementations for computing ρ _c with associated confidence intervals and p-value, and the reader can find examples in the DescTools package with the CCC function or in the epiR package with the epi.ccc function.

Ratio of performance to deviation (RPD)

The ratio of performance to deviation (RPD) was proposed by Williams and Thompson (1978) as the ratio of standard error in prediction to the standard deviation. The objective of the RPD is to scale the error in prediction with the standard deviation of the property. It is widely used in the infrared spectroscopy literature as a way to assess the goodness of fit of infrared spectroscopy models.

The RPD is calculated as follows:

$$\displaystyle \begin{aligned} \text{RPD}=\frac{\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}({\mathrm{obs}}_{i} - \mathrm{\overline{obs}})^{2}}}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}({\mathrm{obs}}_{i} - {\mathrm{pred}}_{i})^{2}}}, {} \end{aligned} $$

(9.7)

which is equivalent to sd( obs) /RMSE( obs, pred) . This metric and its systematic use have been criticized, in particular because the standard deviation of the soil property used to scale the error is misleading in the case of skewed or non-normal observations. The RPD is computed in R by:

It can be seen that RPD is proportionally related to the coefficient of determination (or R2). R2 is based on variance, while RPD is based on standard error. If we assume a normal distribution, then RPD = 1/sqrt( 1-R2) . We can test it (Fig. 9.1).

Note that when R2 > 0.8, the RPD rapidly increases for larger R2 values. This suggests that a small increase in accuracy could be interpreted as a large boost in RPD.

Ratio of performance to inter-quartile distance (RPIQ)

The ratio of performance to inter-quartile distance (RPIQ) has been proposed by Bellon-Maurel et al. (2010) to account for possibly non-normal distribution of the observations. The RPIQ is similar to the RPD, but it uses the inter-quartile range to represent the spread of the observations. It is therefore not sensitive to the statistical distribution of the observations. The values obtained by the RPIQ can be interpreted the same way than the RPD. The RPIQ is given by:

$$\displaystyle \begin{aligned} \text{RPIQ} = \frac{(Q_3(\mathrm{obs}) - Q_{1}(\mathrm{obs}))}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}({\mathrm{obs}}_{i} - {\mathrm{pred}}_{i})^{2}}}, {} \end{aligned} $$

(9.8)

where Q ₁(obs) and Q ₃(obs) are the first (25%) third (75%) quantiles of the observations (Q ₃(obs) − Q ₁(obs) is the inter-quartile distance) and the denominator is the RMSE. In R, it can be computed as follows:

From the literature, it appears that various arbitrary limits of RPD were set to characterize a good model performance. For example, in agricultural products, it was quoted by Batten (1998) that RPD values greater than 3 are useful for screening, values greater than 5 can be used for quality control and values greater than 8 can be used for any application. In soil science, the paper by Chang et al. (2001) made other three categories: Category A, RDP > 2.0; Category B, RDP 1.4–2.0; and Category C, RDP < 1.4. This was interpreted by other authors as the reference standard in model performance: excellent if RPD > 2 and non-reliable models when RPD < 1.4. Other authors also have slightly modified this to justify the quality of prediction.

A soil scientist would say their model is excellent as it has RPD > 2, while a plant scientist would disagree as an excellent model should have RPD > 3. Limitations of RPD are described in Minasny and McBratney (2013), and the myth of RPD as a single measure of accuracy is summarized in the article of Esbensen et al. (2014): ‘It is a myth that the RPD statistics furthers an objective, across-model, comparative, unambiguous prediction validation figure-of-merit’.

We warn against using an arbitrary classification system to justify the model performance, as RPD and R2 and other measures can be easily affected by the distribution of the data. We can illustrate how these accuracy measures are sensitive to the data distribution. Consider a set of random numbers (Fig. 9.2):

Obviously the accuracy measure should say there is no clear relationship between the two random variables. We use the eval function from the book-associated soilspec package for this.

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0 0.38 0 -1 0.02 0.71 1.18

Let us now see the effect of adding two extreme observations to the data (Fig. 9.3).

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0.01 0.38 0 0.89 0.95 3.04 1.19

It is now visible that we can achieve RPD > 3, ρ _c > 0.9 and R2 = 0.9 indicating we have created an excellent model. The RPIQ seems not being much affected by outliers.

For categorical variables, the overall accuracy

The overall accuracy (OA) measures the fraction of predictions that are correctly classified. Its optimal value is 1 (all if the predicted class equal the observed class) and falls to 0 if none of the predicted classes equal the observed classes. It is formally calculated as follows:

$$\displaystyle \begin{aligned} \text{OA} = \frac{\text{Number of correct prediction}}{\text{Total number of prediction}}, {} \end{aligned} $$

(9.9)

where the numerator is an indicator having 1 if the predicted class equals the observed class and 0 otherwise and the denominator is the validation sample size. In R it is computed as follows:

As OA only calculates the overall accuracy, if the data is imbalanced. i.e. one class dominates over the others, a predictive model will only predict the dominant class and could provide high accuracy. To solve this problem, Cohen’s kappa statistic is used.

For categorical variables, Cohen’s kappa statistic

Cohen’s kappa statistic (Cohen 1960) is another measure of the agreement between predicted and observed classes. It is often referred to as the comparison of the overall accuracy to the expected random chance accuracy. Cohen’s kappa is defined as the difference between the overall accuracy and the random chance accuracy divided by 1 minus the random chance accuracy. The statistic can be negative, but is more often comprised between 0 and 1, where 1 shows a perfect agreement between the predicted and observed classes and 0 no more agreement than what is expected by chance. It is derived as follows:

$$\displaystyle \begin{aligned} {\text{Cohen's kappa}} = \frac{\text{OA}-p_{e}}{1-p_{e}}, {} \end{aligned} $$

(9.10)

where OA is the overall accuracy derived previously and p _e is the expected probability of chance agreement. In other words, p _e is the expected random chance accuracy. In R Cohen’s kappa statistic is computed as follows:

9.2 Models for Quantitative Variables

A general problem in spectroscopy data is the large number of predictors, i.e. the number of spectral bands. The machine learning literature will describe this as a ‘large p and small n’ problem. In addition, the spectral bands are highly correlated. One way of handling data with a high number of predictor variables such as in infrared spectroscopy is variable reduction. Another way is to select only relevant variables to use in the model (variable selection). Principal components and partial least squares regression (PLSR) methods are routinely used in chemometrics for variable reduction. Both models boil down to linear regression and principal component analysis. The PLSR model is particularly useful for prediction purposes. We will first explore these linear models and then continue with machine learning with cubist and random forest models.

In this section, we use the raw spectra described in Chap. 3 (Fig. 9.4). The steps for pre-processing are explained in the previous chapters.

For the example, we further separate the data into calibration (75%) and validation (25%) by splitting randomly the dataset (Fig. 9.5).

The observation of the total carbon content is slightly skewed to the left. In a real-world case study, we might want to test whether a transformation makes the carbon values look normally distributed. The slight skewness is not critical in this case study.

9.2.1 Principal Component Regression

Principal component regression (PCR) boils down to principal component analysis (Sect. 6.2 ) and multiple linear regression by taking the principal component of the spectra and by building a linear regression on the component scores. Recall the matrix of scores T obtained by PCA of the matrix of spectral variables X. In PCR, a linear regression model is fitted between the scores of the PC of X and the soil property (response variable) y (if only one soil property is predicted) or Y of size n × c where c is the number of soil properties of interest (c = 1 for a single response). The PCR model takes the following form:

$$\displaystyle \begin{aligned} \mathbf{Y} = \mathbf{T}\boldsymbol{\beta} + \mathbf{E}, {} \end{aligned} $$

(9.11)

where β denotes the vector of regression coefficients, found by solving β = (X ^T X)⁻¹ X ^T Y, and E is a matrix of residuals, the same size than Y.

The user must then decide how many components to use in the model. Usually, the optimal number of components is defined by cross-validation. The cross-validation can be done by using the pls package, but for illustration we will use the princomp and lm functions.

Figure 9.6 shows that around nine components capture more than 99% of the variation.

One can now fit a simple PC regression. First we specify the number of principal components to use in the model and then form the PC scores as the independent variables and soil property as the dependent variable in a linear model.

## ## Call: ## lm(formula = datC$TotalCarbon ~ ., data = sdata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.2051 -0.4448 -0.0744 0.3529 4.5404 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.214198 0.046704 25.998 < 2e-16 *** ## PC1 0.027437 0.002848 9.635 < 2e-16 *** ## PC2 0.059561 0.004554 13.080 < 2e-16 *** ## PC3 0.001337 0.009975 0.134 0.89348 ## PC4 0.157447 0.013955 11.282 < 2e-16 *** ## PC5 0.303688 0.018495 16.420 < 2e-16 *** ## PC6 0.014171 0.032577 0.435 0.66389 ## PC7 0.115872 0.036931 3.138 0.00188 ** ## PC8 0.400682 0.053989 7.422 1.36e-12 *** ## PC9 -0.043479 0.076195 -0.571 0.56871 ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## Residual standard error: 0.7994 on 283 degrees of freedom ## Multiple R-squared: 0.7196, Adjusted R-squared: 0.7107 ## F-statistic: 80.7 on 9 and 283 DF, p-value: < 2.2e-16

We can now assess the goodness of the fit by plotting the observed versus predicted values of the total carbon, for both the calibration and validation datasets (Fig. 9.7).

Derive some accuracy measures, for the example using the eval function and specifying the argument obj = "quant" for continuous variables:

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0 0.79 0.57 0.72 0.84 1.89 1.46

and for validation.

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0.08 0.74 0.42 0.36 0.7 1.26 1.63

9.2.2 Partial Least Squares Regression

Partial least squares regression (PLSR) is a technique that attempts to combine PCA and multiple regression (Wold et al. 2001). It aims to predict a set of dependent variables (soil properties) by extracting from the spectra a set of ‘orthogonal’ factors (or latent variables) which give the best prediction. The components in partial least squares are determined by the predictor variables X (as in PCR) but also by the response variable(s) (y or Y if multiple responses). The PLSR model takes the following form:

$$\displaystyle \begin{aligned} \mathbf{X} = \mathbf{TP}^{T} + \mathbf{E}, {} \end{aligned} $$

(9.12)

$$\displaystyle \begin{aligned} \mathbf{Y} = \mathbf{UQ}^{T} + \mathbf{F}, {} \end{aligned} $$

(9.13)

where X, T and P are defined previously. Both T and U are score matrices of size n × p for X or Y, respectively, and P (size d × b) and Q (size d × c) are loading matrices for X or Y. Finally, E is the matrix of residuals of size n × b for X and F is the matrix of residuals of size n × c for Y. A regression between X and Y is obtained by:

$$\displaystyle \begin{aligned} \mathbf{U} = \mathbf{T}\boldsymbol{\beta}, {} \end{aligned} $$

(9.14)

where β is the vector of regression coefficients of the linear model. Substituting this relationship from the original model, predictions are obtained by:

$$\displaystyle \begin{aligned} \mathbf{Y} = \mathbf{UQ}^{T} = \mathbf{T}\boldsymbol{\beta}{\mathbf{Q}}^{T}. {} \end{aligned} $$

(9.15)

This is implemented in R with the plsr function from the pls package, made by Wehrens and Mevik (2007). As for the PCA and PCR, we must choose the optimal number of principal components (Fig. 9.8).

More information on the RMSEP and bias-adjusted RMSEP values are obtained in the vignette of the pls package. Note that we use method = oscorespls, which is not the default of the plsr function. It is discussed later in this section. This figure from a 10-fold cross-validation on the calibration data shows that 14 components seem to produce a minimal RMSEP. So we use this number.

It is also possible to plot directly the predicted and observed values of the soil properties. The predictions are made using the fitted pls_model using nc principal components (Fig. 9.9).

This method plots observed against predicted based on the cross-validated predictions using nc number of components. Additional figures can be derived from the soilCPlsModel object. Note that the values in the x-axis are the indices of the wavelength (Fig. 9.10).

Figure 9.10 shows the three first loadings of the PLS components of the spectra. It gives an idea which wavelengths have the most influence on each of the components, which can be used for spectral interpretation. In addition to the loadings of the PC, in PLSR the regression coefficients can be plotted. We do not use the default plot function of the plsr package and select instead manually the regression coefficient for the case of nc = 14 (Fig. 9.11).

This gives a more meaningful interpretation, the standardized coefficients of the regression. Wavelengths with a large (positive or negative) value of the regression coefficient are more influential in the prediction.

Alternative to the regression coefficients and the principal component loadings, one can use the variable importance on projection (VIP) proposed by Wold et al. (1993). It measures the importance of each wavelength in the projection of the components. It is calculated as a weighted sum of the squares of the PLS weights (Ng et al. 2019). Influential variables have a variable importance on projection value greater than 1. As there is no direct implementation in the pls package, we provide the code below. For more information, the reader is redirected to the article of Wold et al. (1993). Note that the VIP calculation is valid only for the orthogonal score algorithm (method = "oscorespls" in the pls package) and for a single response PLSR model. By default, the plsr function in R uses the kernelpls algorithm.

Figure 9.12 shows that the important wavelengths in the projection of the PLS components to predict the total soil carbon are similar to the wavelengths found important in the previous figures using the standardized regression coefficient or the first three principal component loadings.

Now that we have created the PLS model, we can use the model to predict using the spectra on the calibration and validation datasets (Fig. 9.13).

Derive some accuracy measures, for the example using the eval function and specifying the argument obj = "quant" for continuous variables:

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0 0.56 0.8 0.86 0.92 2.63 2.04

and for validation.

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0.02 0.52 0.64 0.69 0.85 1.81 2.34

Bagging PLSR

One way of ‘strengthening’ the PLSR prediction is to generate multiple models and average the prediction (making an ensemble model). Bootstrap aggregating or bagging (Breiman 1996) manipulates the calibration data to generate different models. The bootstrap is a general statistical method used to assess the accuracy of a prediction by sampling the calibration data with replacement. Suppose the calibration data of size n is composed of predictors and response; we randomly generate B datasets based on the calibration data by sampling with replacement. For each of the bootstrap datasets, we fit a PLSR model. The bagging estimate is calculated as the average of all the model predictions.

Therefore, it combines the outputs of many models to produce a powerful ‘committee’, which is useful when dealing with data with high variation, as each realization will produce a model that fits a particular set of the data which may differ from other realizations. The important element of bagging is that by perturbing the calibration, it can cause significant changes in the predictor. The aggregated predictor averages the prediction over a collection of bootstrap samples, therefore reducing the variance of prediction. The accuracy of the prediction is increased when the prediction method is unstable, i.e. small changes in the calibration data used in bootstrap can result in large changes in the resulting predictor. It was used by McBratney et al. (2006) for soil prediction and quantifying its uncertainty. The improvement over a single PLSR could be small, but it may be more robust against noise in the spectra, and it is also possible to obtain uncertainty intervals of the prediction (Mevik et al. 2004).

A function called fitBagPlsr was created for this.

Based on our previous single PLS model, we use bagging to generate 50 bootstrapped models.

The output of the function contains model which is the bootstrapped PLSR models, oobRmse which is the mean out-of-bag RMSE and calRmse which is the mean calibration RMSE.

Out of bag is the internal validation used in bootstrap. At each bootstrap, a sample of size n of the original data was sampled with replacement. That means that for each bootstrap about one-third of the data are not used in the calibration. This oob (out-of-bag) data are used to estimate the error of the model.

## [1] 0.7093668

## [1] 0.5187187

Now that we have created the bagged PLS model, we can use the model to predict using the spectra on the calibration and validation datasets using the following function.

In the function predictBagPlsr, the argument modelBpls is the bagged PLSR model, newspec is the spectra of the validation set, nbag is the number of bootstrap samples, and nc is the number of PCs used in PLSR. The function returns bagPred that is the bagged predicted values, predAve that is the mean of the predictions and predStd that is the standard deviation of the predictions (Fig. 9.14).

Derive some accuracy measures, for the example using the eval function and specifying the argument obj = "quant" for continuous variables:

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0 0.56 0.79 0.86 0.92 2.64 2.04

and for validation.

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0.01 0.5 0.66 0.7 0.86 1.85 2.39

9.2.3 Cubist

Another way of handling large dimensional data is using variable selection techniques to find the best predictors. When the high dimensional data has been reduced to several components or important variables have been selected, they are used for prediction using either linear regression or data-mining tools. Regression trees, neural networks and support vector machines have been used for such predictions.

There are also data-mining tools which are designed to extract information on data containing large number of variables and large number of samples. This is potentially useful as the data reduction step need not be taken. Models that improve regression trees have been proposed, including random forest. While RF has been used successfully for prediction, the model form is complex, and interpretation can be difficult as no explicit formulae can be given.

Other forms of data-mining tools based on the idea of decision trees are the regression rules, rule-based regression or cubist model. This is in effect transforming regression into a classification problem, the model consists of a set of rules, and each rule consists of a linear model. The idea is similar to the regression tree algorithm; while regression trees have a value at each ‘leaf’, regression rules build a multivariate linear function. Regression rules are also analogous to piecewise linear functions.

The cubist model takes the form of:

Rule 1: [10 cases, mean -0.96, range -1.77 to 0.66, est err 0.27] if R880 <= 0.0144 R1610 > -0.921 then outcome = -4.06 + 1.27 * R540 + 1.61 * R1610

The model has several rules. Each rule has a ‘condition’ (reflectance at 880 nm <= 0.0144 & at 1610 nm > -0.921); if this condition is met by the data, then the prediction is the given linear function. The program also informs the statistics of each rule which refers to the range of values of the predicted and also the error of the model.

Cubist initially was a commercial regression-rules program, but now a public GNU code has been provided and ported in Cubist R package by Kuhn et al. (2012). The model of cubist is a set of comprehensible rules, where each rule has an associated linear model. Whenever a situation matches a rule’s conditions, the associated model is used to calculate the predicted value. The first use of cubist for soil spectroscopy modelling was made by Minasny and McBratney (2008).

## ## Call: ## cubist.default(x = datC$spcAMovav, y = datC$TotalCarbon) ## ## ## Cubist [Release 2.07 GPL Edition] Tue Aug 25 15:51:17 2020 ## --------------------------------- ## ## Target attribute ‘outcome’ ## ## Read 293 cases (422 attributes) from undefined.data ## ## Model: ## ## Rule 1: [106 cases, mean 0.347, range 0.06 to 1.94, est err 0.120] ## ## if ## 1415 > -0.4015094 ## then ## outcome = 4.663 - 447.07 850 + 1223.8 1410 - 1089.7 1400 + 326.8 860 ## - 240.31 2305 + 277.82 845 + 220.88 2310 - 493.2 1415 ## - 162.2 865 + 133.4 2165 + 458.8 1395 - 111.19 2170 ## + 150.9 2075 - 60 810 + 56.88 825 - 119.1 2100 - 36.37 625 ## + 35.28 630 - 55.7 2015 - 70.4 1815 + 65.8 1805 - 131.6 1455 ## + 32.19 2285 - 94.3 1405 + 22.26 800 + 95.1 1465 + 13.9 1940 ## + 39.6 1365 - 12.23 2380 + 54.4 1435 - 12.34 790 - 47.2 1430 ## + 9.65 2345 - 9.11 2350 - 9 2120 - 6 885 + 2.69 645 ## - 14.6 1445 + 5.7 2145 + 3.3 910 ## etc... [shortened]

The summary of the model provides the full model. It also informs the rules in the model and the number of times (frequency) certain variables (wavelengths) are used as conditions and as predictors. We can plot these as an indicator of which wavelengths are useful in the model, which we will describe later.

The calibrated cubist model is then used to predict, on both the calibration and validation data (Fig. 9.15).

Derive some accuracy measures, for the example using the eval function and specifying the argument obj = "quant" for continuous variables:

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 -0.02 0.27 0.91 0.97 0.98 5.46 4.23

and for validation.

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0.03 0.39 0.8 0.82 0.92 2.4 3.11

Note that the validation statistics show that the calibrated cubist model actually gives us a worse prediction compared to PLS. From the model summary, we can also see that rules 4, 7 and 8 are only fitted to eight, five and seven observations. This could make the model overfit the data. So we need to simplify the model, by setting fewer rules. We re-run the model using two rules by adding the following options: control = cubistControl( rules = 2) .

## ## Call: ## cubist.default(x = datC$spcAMovav, y = datC$TotalCarbon) ## ## ## Cubist [Release 2.07 GPL Edition] Tue Aug 25 15:51:17 2020 ## --------------------------------- ## ## Target attribute ‘outcome’ ## ## Read 293 cases (422 attributes) from undefined.data ## ## Model: ## ## Rule 1: [106 cases, mean 0.347, range 0.06 to 1.94, est err 0.120] ## ## if ## 1415 > -0.4015094 ## then ## outcome = 4.663 - 447.07 850 + 1223.8 1410 - 1089.7 1400 + 326.8 860 ## - 240.31 2305 + 277.82 845 + 220.88 2310 - 493.2 1415 ## - 162.2 865 + 133.4 2165 + 458.8 1395 - 111.19 2170 ## + 150.9 2075 - 60 810 + 56.88 825 - 119.1 2100 - 36.37 625 ## + 35.28 630 - 55.7 2015 - 70.4 1815 + 65.8 1805 - 131.6 1455 ## + 32.19 2285 - 94.3 1405 + 22.26 800 + 95.1 1465 + 13.9 1940 ## + 39.6 1365 - 12.23 2380 + 54.4 1435 - 12.34 790 - 47.2 1430 ## + 9.65 2345 - 9.11 2350 - 9 2120 - 6 885 + 2.69 645 ## - 14.6 1445 + 5.7 2145 + 3.3 910 ## ## Rule 2: [155 cases, mean 1.263, range 0.31 to 3.5, est err 0.280] ## etc... [shortened]

The calibrated cubist model is then used to predict, on both the calibration and validation data (Fig. 9.16).

Derive some accuracy measures, for the example using the eval function and specifying the argument obj = "quant" for continuous variables:

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 -0.01 0.4 0.86 0.93 0.96 3.76 2.91

and for validation.

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0.06 0.39 0.79 0.82 0.91 2.38 3.08

We can also infer which wavelengths are useful in the model based on the model summary. The following plot shows which variables are important as predictors and as ‘conditions’ (Fig. 9.17).

9.2.4 Random Forest

Random forest (RF) is a widely used algorithm for data science applications in a broad array of scientific domains. A RF model is an ensemble of decision trees organized as set of structured classifiers or trees and can be used for both regression and classification purposes (Breiman 2001). The advantage of RF over its much simpler data-mining counterpart, the CART (Classification and Regression Tree) model, is its ensemble approach. Rather than a single tree in the case of CART, the RF model has many trees, where each is constructed using different perturbations of the data – in terms of both calibration cases and explanatory variables. During RF model development of the ensemble trees, two-thirds of cases are sampled (bootstrap sampling with replacement) and are used to grow a regression tree, and the other one-third is used to perform a cross-validation in parallel with the calibration step. These samples are called out-of-bag samples (oob samples) which are used to obtain an estimate of the model performance. For regression the final prediction is the average of the individual tree or classifier outputs, whereas in classification the trees vote by majority on the correct classification (mode). You might note the similarity of this cross-validation procedure in the earlier described section about the bootstrap PLSR model.

Random forest has three tuning parameters: mtry, number of trees and minimum node size. The first parameter mtry is the number of input variables that are randomly selected for each bootstrap, which can range from 1 to n (sample size). The second parameter is the number of trees, which must be sufficiently large for the oob error stabilization. In general, 500 trees are sufficient, but if a large number of trees are chosen, the results will not differ significantly, but more time will be necessary to model fitting. The last tuning parameter is the minimum node size, which determines the minimum size of nodes which no split will be attempted; the default value is 5 for regression and 1 for classification.

The RF model is a natural candidate for soil spectral inference work because of the high dimension status of the soil spectral wavelengths, for example, with vis-NIR or MIR data. Ideally it would be used in situations where the number of spectra is also large, as this model has a tendency to overfit, particularly if the tunable parameters previously mentioned are not optimized. The issue here is about generalization when the model is extended to new data, which is not a strong feature of the RF model. Naturally its wide use in data science has also resulted in some researchers experimenting with this model approach for soil spectral data with good results, e.g. Santana et al. (2018) and Hobley et al. (2017) as some relatively recent examples.

In this example, we use the randomForest package.

## ## Call: ## randomForest(formula = TotalCarbon ~ ., data = datCSub, ntree = 1000, mtry = 10, importance = TRUE, na.action = na.omit) ## Type of random forest: regression ## Number of trees: 1000 ## No. of variables tried at each split: 10 ## ## Mean of squared residuals: 0.4582793 ## % Var explained: 79.18

The randomForest function saves the important variables that were used to explain the variance of the soil property. For this, we specified importance = TRUE, and we can now plot the important variables of the soilCRFModel model. Two methods are proposed, the change in terms of MSE or in terms of node purity. The reader is redirected to statistical learning book (e.g. Friedman et al. 2001) for more information on these methods. Figure 9.18 shows the most important variables sorted in order of importance.

Using the fitted RF model, we can generate predictions on the validation set and validate the predictions (Fig. 9.19).

Derive some accuracy measures, for the example using the eval function and specifying the argument obj = "quant" for continuous variables:

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0 0.3 0.94 0.96 0.98 5.03 3.89

and for validation.

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0.1 0.52 0.57 0.69 0.81 1.8 2.33

9.2.5 Memory-Based Learning

Memory-based learning (MBL) is a local calibration algorithm presented in Ramirez-Lopez et al. (2013a) and implemented in the package resemble. MBL has been developed to deal with complex, often continental or global, soil (infrared) spectral datasets. Instead of building a single (global) function to link the soil property and the spectra (as in PLSR or RF), the spectra are split into a number of subsets sharing similar spectral characteristics. In this sense, this technique is closely related to Chap. 7 because it makes use of distance metrics in the principal component space to select the closest points for building a local model. Subsequently, MBL can deal with complex non-linear relationships between the spectra and a particular soil property.

MBL works in the following stages:

1.
Build a p-dimensional space of the spectra where p is the number of principal components.
2.
For each point of the validation dataset, select its k-nearest neighbours from the calibration sample.
3.
Fit a local model for each point of the validation dataset using its nearest neighbours spectra found in the calibration dataset. Several models are available for the fit.

Some aspects are therefore particularly important in any MBL algorithm. One must choose the similarity/dissimilarity metric used to select the closest point in the spectra space (see also Chap. 7 ). A second important point is the number of neighbours that one wants to use in the fitting process. This number must be sufficiently large to build a realistic model, but increasing too much the number of points might also decrease prediction accuracy. The package resemble provides options to optimize the number of neighbours.

In the function mbl, the user must then decide:

the similarity metric diss_method, in this example the Mahalanobis distance in the PC space (diss_method = "pca").
the choice of the optimal number of PCs under the argument pc_selection; in this example, the PCs are selected based on the cumulative amount of variance explained.
the model for fitting, in our example the weighted average PLS model (method = "local_fit_wapls( min_pls_c = 4, max_pls_c = 17) ").
the validation method used, in this example the leave-nearest-neighbour-out cross-validation (validation_type = "NNv" in the mbl_control argument).

We further decide a sequence of nearest neighbours k to test (in order to find the optimal number of neighbours in local model fitting). Here we test a sequence from 20 up to 120 in steps of 10.

We can now perform the model fitting using MBL and the mbl function. We use the object control which contains the parameters of the mbl function. We make use of a regression function called weighted average partial least square (wapls1). It uses multiple models generated by multiple PLS components (i.e. between a minimum and a maximum number of PLS components). At each local partition, the final predicted value is a weighted average of all the predicted values generated by the multiple PLS models. See the package documentation for more details.

We can summarize the information derived and plot the results.

## ## Call: ## ## mbl(Xr = datC$spcAMovav, Yr = datC$TotalCarbon, Xu = datV$spcAMovav, ## Yu = NULL, k = k2t, method = local_fit_wapls(min_pls_c = 4, ## max_pls_c = 17), diss_method = "pca", diss_usage = "none", ## pc_selection = list("cumvar", maxexplvar), control = mbl_control(validation_ type = "NNv"), ## center = TRUE, scale = FALSE) ## ## _______________________________________________________ ## ## Total number of observations predicted: 98 ## _______________________________________________________ ## ## Nearest neighbor validation statistics ## ## k rmse st_rmse r2 ## 1: 20 0.530 0.0833 0.784 ## 2: 30 0.444 0.0699 0.843 ## 3: 40 0.523 0.0823 0.846 ## 4: 50 0.542 0.0853 0.850 ## 5: 60 0.527 0.0829 0.848 ## 6: 70 0.556 0.0874 0.829 ## 7: 80 0.549 0.0863 0.828 ## 8: 90 0.568 0.0894 0.834 ## 9: 100 0.549 0.0863 0.850 ## 10: 110 0.550 0.0864 0.848 ## 11: 120 0.535 0.0841 0.851 ## _______________________________________________________

Figure 9.20 shows the number of neighbours against the RMSE. The optimal number of neighbours that minimizes the RMSE is 30.

Instead of assuming that the total carbon content in the validation set is unknown (the NULL in the Yu argument), we can use the actual total carbon content values of this dataset. Note that they are not used at all during computations and they are only used for validation purposes (i.e. comparing what it was predicted to the actual values).

We can plot the predicted and observed values of the total carbon as done before for the other calibration algorithms (Fig. 9.21).

Derive some accuracy measures, for the example using the eval function and specifying the argument obj = "quant" for continuous variables:

## ME RMSE r2 R2 rhoC RPD RPIQ ## 1 0.13 0.5 0.65 0.71 0.85 1.86 2.42

9.3 Models for Categorical Variables

Soil properties such as organic carbon or clay are continuous. Some other soil properties or attributes are categorical, that is, they are assigned to a finite number of groups. Discrete values can also be considered as categorical if their number is limited. Examples of categorical soil properties are soil texture classes, mineral categories or soil types.

For prediction of categorical soil properties, the models need to be adapted. Since predicting categorical properties with spectroscopy is relatively rare in soil science, we will present two models, which are adaptations of models already presented in the previous section on modelling continuous properties.

As an example, we use the continuous values of clay, silt and sand provided in the book-associated Geeves dataset and convert it to soil texture classes. The dataset originates from Australia (see also Chap. 3 ), hence the use of the Australian soil texture triangle (Northcote 1971; National Committee on Soil and Terrain (Australia) and CSIRO Publishing 2009) to define the class names. The Australian soil texture triangle has 11 classes. We show below how to convert the soil clay, silt and sand content to soil texture classes using the soiltexture R package.

We start by plotting the Australian soil texture triangle and the points from our dataset (Fig. 9.22).

We can assign the name of the Australian soil texture class to the calibration and validation datasets. By default, the points that are at the border of two classes are assigned the name of the two classes. For the demonstration, we remove these points. In most cases, however, it would be more judicious to make further diagnostics on the soil samples and to assign a class to these samples.

## [1] "Cl" "Lo" "SiLo" "SaLo" "ClLo" "LoSa" "SiClLo" "SaClLo" ## [9] "SiCl" "SaCl"

Now we will show how to use two models for predicting soil texture classes.

9.3.1 Partial Least Squares Discriminant Analysis

Partial least squares discriminant analysis (PLS-DA, Barker and Rayens 2003) is a linear classification model which builds on the conventional PLS regression. PLS-DA is a PLS regression between two matrices. The first is the matrix X of size n × b containing the spectra, where n is the sample size and b is the number of spectral bands. The second is a dummy matrix Y of size n × c where c is the total number of classes (e.g. soil texture classes). The columns in Y represent a class membership, that is, a value of 1 or 0 is applied depending on whether the soil sample belongs to the class. The PLS-DA model is then calibrated like PLS, by a linear regression between the rotated scores of the X and Y matrices principal components. The calibrated model predicts a value between 0 and 1, which is assigned to the closest class by a softmax function. A discussion on PLS-DA is further provided by Brereton and Lloyd (2014).

In this example, we use the caret package which provides functionalities for cross-validation.

We can display a summary of the fitted PLS-DA model.

## Partial Least Squares ## ## 290 samples ## 421 predictors ## 10 classes: ’Cl’, ’ClLo’, ’Lo’, ’LoSa’, ’SaCl’, ’SaClLo’, ’SaLo’, ’SiCl’, ’SiClLo’, ’SiLo’ ## ## Pre-processing: centered (421), scaled (421) ## Resampling: Cross-Validated (10 fold, repeated 3 times) ## Summary of sample sizes: 260, 258, 260, 261, 263, 263, ... ## Resampling results across tuning parameters: ## ## ncomp Accuracy Kappa ## 1 0.5534921 0.3461652 ## 2 0.5593630 0.3558506 ## 3 0.5593189 0.3567091 ## 4 0.5650486 0.3656058 ## 5 0.5650045 0.3654928 ## 6 0.5650045 0.3660005 ## 7 0.5708395 0.3736145 ## 8 0.5744451 0.3835748 ## 9 0.5823474 0.3970096 ## 10 0.5640763 0.3746830 ## 11 0.5738831 0.3908864 ## 12 0.5758560 0.3937251 ## 13 0.5850061 0.4117775 ## 14 0.5668808 0.3889140 ## 15 0.5726905 0.3977221 ## 16 0.5853583 0.4178228 ## 17 0.5868221 0.4213265 ## 18 0.5960500 0.4360405 ## 19 0.6052095 0.4509891 ## 20 0.6046920 0.4543307 ## 21 0.6234867 0.4825286 ## 22 0.6270725 0.4894305 ## 23 0.6304389 0.4948148 ## 24 0.6418556 0.5108486 ## 25 0.6484922 0.5213268 ## 26 0.6416819 0.5124769 ## 27 0.6437369 0.5150439 ## 28 0.6414408 0.5116607 ## 29 0.6357714 0.5035562 ## 30 0.6295964 0.4961511 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was ncomp = 25.

The train function of the caret package automatically selects the optimal number of principal components based on the overall accuracy evaluated by the cross-validation subset left out (Fig. 9.23).

The train function selected 25 components as the optimal number. The fitted PLS-DA model can now be applied to unseen data.

## Cl ClLo Lo LoSa SaCl SaClLo SaLo ## 1 0.1201323 0.16463553 0.07922130 0.09377720 0.08920005 0.08531693 0.08447906 ## 2 0.2054752 0.10797734 0.06762277 0.10687099 0.08748491 0.08216171 0.07374156 ## 3 0.1930218 0.09038021 0.08228958 0.09765407 0.08766723 0.08802700 0.09041075 ## 4 0.1163598 0.07402499 0.11089104 0.14420159 0.08809790 0.08977505 0.10009583 ## 5 0.2927797 0.05545218 0.09967716 0.07337192 0.08037484 0.08205305 0.07691914 ## 6 0.1025771 0.06678866 0.23378266 0.08995787 0.08500366 0.08440278 0.08189733 ## SiCl SiClLo SiLo ## 1 0.09069487 0.09012774 0.10241505 ## 2 0.08595896 0.08518364 0.09752289 ## 3 0.08966126 0.09750844 0.08337970 ## 4 0.08502138 0.08641169 0.10512070 ## 5 0.08054953 0.07722688 0.08159563 ## 6 0.08272078 0.08220722 0.09066194

Here we can see that the output is a number between 0 and 1. As mentioned previously, the PLS-DA model returns a probability-like value of the assignment to a class. For example, the fifth soil sample is more likely to belong to the class Cl (probability of 0.20) than it is to the other classes (probability below 0.1 in all other classes). To obtain the final results, the softmax function is applied on the probability values to return the single value, that is, the membership to a class. This is done by default in the caret package.

We can now visualize the prediction and validate against measured value of the classes.

## Reference ## Prediction Cl ClLo Lo LoSa SaCl SaClLo SaLo SiCl SiClLo SiLo ## Cl 21 1 2 0 0 0 0 0 0 0 ## ClLo 6 5 7 0 0 0 0 0 0 0 ## Lo 0 2 24 0 0 0 0 0 0 0 ## LoSa 0 0 5 7 0 0 0 0 0 0 ## SaCl 0 0 0 0 0 0 0 0 0 0 ## SaClLo 0 1 0 0 0 0 0 0 0 0 ## SaLo 0 1 3 2 0 0 0 0 0 0 ## SiCl 2 0 1 0 0 0 0 0 0 0 ## SiClLo 0 0 0 0 0 0 0 0 0 0 ## SiLo 0 0 6 1 0 0 0 0 0 0

## Accuracy Kappa ## 0.5876289 0.4584787

The confusion matrix confMat shows that there is on average a good agreement between predicted and measured soil texture classes. For example, 22 of the reference Cl values are correctly classified, and 11 are assigned to a different class, 8 of which are to the most similar class (ClLo). The overall accuracy and Cohen’s kappa values show that the predictions are accurate.

9.3.2 Random Forest

Random forest applied to categorical variables is similar to the same model applied to continuous variables. Note that parameter tuning is not performed in this example and that, for example, mtry is held constant. The reader will find a large number of examples on how to make parameter tuning in the package vignette. Parameter tuning will be a key element of the modelling, in particular to avoid overfitting of the model or to avoid an excessive number of trees without compromising on prediction accuracy.

## Random Forest ## ## 290 samples ## 421 predictors ## 10 classes: ’Cl’, ’ClLo’, ’Lo’, ’LoSa’, ’SaCl’, ’SaClLo’, ’SaLo’, ’SiCl’, ’SiClLo’, ’SiLo’ ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 3 times) ## Summary of sample sizes: 260, 258, 260, 261, 263, 263, ... ## Resampling results: ## ## Accuracy Kappa ## 0.6269545 0.5055345 ## ## Tuning parameter ’mtry’ was held constant at a value of 20.54264

Let us now apply the calibrated random forest model to the validation data and display the overall accuracy and Cohen’s kappa statistics. Note that the same statistics can be obtained by using the book-associated package soilspec with the eval function.

## Reference ## Prediction Cl ClLo Lo LoSa SaCl SaClLo SaLo SiCl SiClLo SiLo ## Cl 22 2 0 0 0 0 0 0 0 0 ## ClLo 8 8 2 0 0 0 0 0 0 0 ## Lo 0 1 20 4 0 0 1 0 0 0 ## LoSa 0 0 0 12 0 0 0 0 0 0 ## SaCl 0 0 0 0 0 0 0 0 0 0 ## SaClLo 0 1 0 0 0 0 0 0 0 0 ## SaLo 0 0 6 0 0 0 0 0 0 0 ## SiCl 3 0 0 0 0 0 0 0 0 0 ## SiClLo 0 0 0 0 0 0 0 0 0 0 ## SiLo 0 0 4 1 0 0 0 0 0 2

## Accuracy Kappa ## 0.6597938 0.5641933

9.4 Soil Spectral Inference Systems

Another way to estimate soil properties is to combine the spectra with pedotransfer functions (PTF, Bouma 1989), in a soil spectral inference system (McBratney et al. 2006). Several soil physical, chemical and biological properties cannot be estimated directly from the spectra, as it done in Sects. 9.2 and 9.3. The reasons are that i) some soil properties do not have a clear spectral response in the spectrum and that ii) the development of calibration functions of a soil property from soil spectral libraries is not always possible due, for example, to budget constraints. McBratney et al. (2006) thus proposed a two-step approach to estimate a large range of soil properties by combining spectroscopy and PTFs, as follows:

Step 1, calibration: a multivariate model is built between the spectra and the measured values of some basic soil properties (that have been shown to demonstrate a spectral response in the spectral region of interest). In the infrared, the basic soil properties are, for example, clay, silt, sand, organic carbon, pH and cation exchange capacity. This step can be implemented using one of the methods presented in Sects. 9.2 or 9.3. When a model is calibrated, it can be used to predict soil properties of a soil sample where only the spectrum is available.
Step 2, inference: the basic soil properties estimated in Step 1 are used as input in a PTF to predict a set of different soil properties, such as the permanent soil wilting point or field capacity. The PTF can be found in the literature or derived using a large soil database. Ideally, a PTF developed on similar soil is used to derive the soil properties. A number of PTFs have been published, for example, by McBratney et al. (2002) or Pachepsky and Rawls (2004). The PTFs can be concatenated into a network structure to create an inference system that estimates the target soil property or properties and also propagates the uncertainty of the estimate.

So one of the main features of soil spectral inference systems is the quantification and propagation of uncertainty. Model uncertainty, for example, can be quantified using an ensemble model by bootstrap and aggregating, for which an example using partial least squares regression is provided in Sect. 9.2.2.

Since Step 1 is already presented in Sects. 9.2 or 9.3, we do not repeat it here and make a simple example of Step 2. Note also that uncertainty quantification and propagation is discussed elsewhere in the literature (e.g. in Tranter et al. 2010, Van der Klooster et al. 2011 or Brodský et al. 2013).

Take the following PTF from Rab et al. (2011) to estimate the volumetric field capacity (FC, m³/m³, %) as a function of basic soil properties. We chose this PTF because it has been derived using data from a case study in Australia, for an area in which the soils are similar to those of the Geeves dataset. The Geeves dataset is provided by the book-associated soilspec package. The PTF makes use of the soil clay and silt content.

$$\displaystyle \begin{aligned} Field~capacity = 7.759 + 0.7165 \times clay + 0.9708 \times silt - (0.01729 \times clay)\times silt {} \end{aligned} $$

(9.16)

We can use the Geeves dataset provided in the soilspec package. It contains values of the soil clay and silt content.

## [1] "clay" "silt" "sand" "TotalCarbon" "spc"

Note here the soil properties are available in our dataset but that in many cases they can be estimated using the spectra (Step 1) or using a soil spectral library (Viscarra-Rossel et al. 2008). We can now derive the field capacity using the basic soil properties clay and silt (Fig. 9.24).

References

Barker M, Rayens W (2003) Partial least squares for discrimination. J Chemometr J Chemometrics Soc 17:166–173
Article CAS Google Scholar
Batten GD (1998) Plant analysis using near infrared reflectance spectroscopy: the potential and the limitations. Aust J Exp Agric 38:697–706
Article Google Scholar
Bellon-Maurel V, Fernandez-Ahumada E, Palagos B, Roger J-M, McBratney AB (2010) Critical review of chemometric indicators commonly used for assessing the quality of the prediction of soil attributes by NIR spectroscopy. TrAC Trends Anal Chem 29:1073–1081
Article CAS Google Scholar
Bouma J (1989) Using soil survey data for quantitative land evaluation. In: Advances in soil science. Springer, Berlin, pp 177–213
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
Article Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Google Scholar
Brereton RG, Lloyd GR (2014) Partial least squares discriminant analysis: taking the magic away. J Chemometr 28:213–225
Article CAS Google Scholar
Brodský L, Vasat R, Klement A, Zadorova T, Jaksik O (2013) Uncertainty propagation in VNIR reflectance spectroscopy soil organic carbon mapping. Geoderma 199:54–63
Article Google Scholar
Brus DJ, Kempen B, Heuvelink GBM (2011) Sampling for validation of digital soil maps. Eur J Soil Sci 62:394–407
Article Google Scholar
Chang C-W, Laird DA, Mausbach MJ, Hurburgh CR (2001) Near-infrared reflectance spectroscopy–principal components regression analyses of soil properties. Soil Sci Soc Am J 65:480–490
Article CAS Google Scholar
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46
Article Google Scholar
Esbensen KH, Geladi P, Larsen A (2014) The RPD myth…. NIR News 25:24–28
Google Scholar
Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer series in statistics, New York
Google Scholar
Geeves GW, Cresswell HP, Murphy BW, Gessler PI, Chartres CJ, Little IP, Bowman GM (1994) Physical, chemical and morphological properties of soils in the wheat-belt of southern NSW and northern Victoria. NSW Department of Conservation; Land Management/CSIRO Division of Soils Occasional Report, CSIRO
Google Scholar
Hobley EU, Brereton AJLEG, Wilson B (2017) Soil charcoal prediction using attenuated total reflectance mid-infrared spectroscopy. Soil Res 55:86–92
Article CAS Google Scholar
Janssen PHM, Heuberger PSC (1995) Calibration of process-oriented models. Ecol Model 83:55–66
Article Google Scholar
Kuhn M, Weston S, Keefer C, Coulter N (2012) Cubist models for regression. R package Vignette R package version 00 18
Google Scholar
Lawrence I, Lin K (1989) A concordance correlation coefficient to evaluate reproducibility. Biometrics 45:255–268
Article Google Scholar
McBratney AB, Minasny B, Cattle SR, Vervoort RW (2002) From pedotransfer functions to soil inference systems. Geoderma 109:41–73
Article Google Scholar
McBratney AB, Minasny B, Viscarra-Rossel RA (2006) Spectral soil analysis and inference systems: A powerful combination for solving the soil data crisis. Geoderma 136:272–278
Article CAS Google Scholar
Mevik B-H, Segtnan VH, Næs T (2004) Ensemble methods and partial least squares regression. J Chemometr J Chemometr Soc 18:498–507
Article CAS Google Scholar
Minasny B, McBratney AB (2013) Why you don’t need to use RPD. Pedometron 33:14–15
Google Scholar
Minasny B, McBratney AB (2008) Regression rules as a tool for predicting soil properties from infrared reflectance spectroscopy. Chemom Intell Lab Syst 94:72–79
Article CAS Google Scholar
Nash JE, Sutcliffe JV (1970) River flow forecasting through conceptual models part I—A discussion of principles. J Hydrol 10:282–290
Article Google Scholar
National Committee on Soil and Terrain (Australia) and CSIRO Publishing (2009) Australian soil and land survey field handbook. CSIRO Publishing
Google Scholar
Ng W, Minasny B, Malone BP, Sarathjith MC, Das BS (2019) Optimizing wavelength selection by using informative vectors for parsimonious infrared spectra modelling. Comput Electron Agric 158:201–210
Article Google Scholar
Northcote KH (1971) Factual key for the recognition of Australian soils
Google Scholar
Nussbaum M, Walthert L, Fraefel M, Greiner L, Papritz A (2017) Mapping of soil properties at high resolution in Switzerland using boosted geoadditive models. Soil 3:191–210
Article Google Scholar
Pachepsky Y, Rawls WJ (2004) Development of pedotransfer functions in soil hydrology. Elsevier, Amsterdam
Google Scholar
Rab MA, Chandra S, Fisher PD, Robinson NJ, Kitching M, Aumann CD, Imhof M (2011) Modelling and prediction of soil water contents at field capacity and permanent wilting point of dryland cropping soils. Soil Res 49:389–407
Article Google Scholar
Ramirez-Lopez L, Behrens T, Schmidt K, Stevens A, Demattê JAM, Scholten T (2013a) The spectrum-based learner: a new local approach for modeling soil vis–NIR spectra of complex datasets. Geoderma 195:268–279
Article Google Scholar
Santana FB de, Souza AM de, Poppi RJ (2018) Visible and near infrared spectroscopy coupled to random forest to quantify some soil quality parameters. Spectrochim Acta Part A Mol Biomol Spectrosc 191:454–462
Article Google Scholar
Tranter G, Minasny B, McBratney AB (2010) Estimating pedotransfer function prediction limits using fuzzy k-means with extragrades. Soil Sci Soc Am J 74:1967–1975
Article CAS Google Scholar
Van der Klooster E, Van Egmond FM, Sonneveld MPW (2011) Mapping soil clay contents in Dutch marine districts using gamma-ray spectrometry. Eur J Soil Sci 62:743–753
Article Google Scholar
Viscarra-Rossel RA, Jeon YS, Odeh IOA, McBratney AB (2008) Using a legacy soil sample to develop a mid-IR spectral library. Soil Res 46:1–16
Article Google Scholar
Wadoux AMJ-C, Brus DJ, Heuvelink GBM (2018) Accounting for non-stationary variance in geostatistical mapping of soil properties. Geoderma 324:138–147
Article Google Scholar
Wehrens R, Mevik B-H (2007) The pls package: principal component and partial least squares regression in R. J Stat Softw 18:1–24
Article Google Scholar
Williams PC, Thompson BN (1978) Influence of whole meal granularity on analysis of HRS wheat for protein and moisture by near infrared reflectance spectroscopy (NRS). Cereal Chem 55:1014–1037
Google Scholar
Wold S, Johansson E, Cocchi M (1993) PLS: partial least squares projections to latent structures. In: 3D qsar in drug design: theory, methods and applications. Kluwer ESCOM Science, Dordrecht, pp 523–550
Google Scholar
Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58:109–130
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

The University of Sydney, Sydney, NSW, Australia
Alexandre M. J. -C. Wadoux, Budiman Minasny, Mario Fajardo & Alex B. McBratney
CSIRO, Canberra, ACT, Australia
Brendan Malone

Authors

Alexandre M. J. -C. Wadoux
View author publications
You can also search for this author in PubMed Google Scholar
Brendan Malone
View author publications
You can also search for this author in PubMed Google Scholar
Budiman Minasny
View author publications
You can also search for this author in PubMed Google Scholar
Mario Fajardo
View author publications
You can also search for this author in PubMed Google Scholar
Alex B. McBratney
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wadoux, A.M.J.C., Malone, B., Minasny, B., Fajardo, M., McBratney, A.B. (2021). Estimating Soil Properties and Classes from Spectra. In: Soil Spectral Inference with R. Progress in Soil Science. Springer, Cham. https://doi.org/10.1007/978-3-030-64896-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-64896-1_9
Published: 26 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64895-4
Online ISBN: 978-3-030-64896-1
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)

Publish with us

Policies and ethics

Estimating Soil Properties and Classes from Spectra

Abstract

Similar content being viewed by others

Spectral Reflectance of Soils

Spectral Reflectance of Soil

Estimating Soil Texture from a Limited Region of the Visible/Near-Infrared Spectrum

9.1 Goodness of Fit Measures

Root mean square error (RMSE)

Mean error (ME) or bias

Squared correlation coefficient

Coefficient of determination (R2)

Lin’s concordance coefficient

Ratio of performance to deviation (RPD)

Ratio of performance to inter-quartile distance (RPIQ)

For categorical variables, the overall accuracy

For categorical variables, Cohen’s kappa statistic

9.2 Models for Quantitative Variables

9.2.1 Principal Component Regression

9.2.2 Partial Least Squares Regression

Bagging PLSR

9.2.3 Cubist

9.2.4 Random Forest

9.2.5 Memory-Based Learning

9.3 Models for Categorical Variables

9.3.1 Partial Least Squares Discriminant Analysis

9.3.2 Random Forest

9.4 Soil Spectral Inference Systems

References

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Estimating Soil Properties and Classes from Spectra

Abstract

Similar content being viewed by others

Spectral Reflectance of Soils

Spectral Reflectance of Soil

Estimating Soil Texture from a Limited Region of the Visible/Near-Infrared Spectrum

9.1 Goodness of Fit Measures

Root mean square error (RMSE)

Mean error (ME) or bias

Squared correlation coefficient

Coefficient of determination (R2)

Lin’s concordance coefficient

Ratio of performance to deviation (RPD)

Ratio of performance to inter-quartile distance (RPIQ)

For categorical variables, the overall accuracy

For categorical variables, Cohen’s kappa statistic

9.2 Models for Quantitative Variables

9.2.1 Principal Component Regression

9.2.2 Partial Least Squares Regression

Bagging PLSR

9.2.3 Cubist

9.2.4 Random Forest

9.2.5 Memory-Based Learning

9.3 Models for Categorical Variables

9.3.1 Partial Least Squares Discriminant Analysis

9.3.2 Random Forest

9.4 Soil Spectral Inference Systems

References

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation