Introduction

Near infrared (NIR) spectroscopy has been widely utilized for analyzing wood (Tsuchikawa and Kobori 2015; Schimleck and Tsuchikawa 2021). Among existing nondestructive methods of wood analysis (e.g., SilviScan, acoustics, resistograph), NIR spectroscopy is unique in that it can be used to estimate properties related to wood chemistry (Schimleck et al. 2019). From an industrial perspective, the most important of these properties is pulp yield (Greaves and Borralho 1996), which is defined (in percent) as the yield of pulp from a given volume of wood (Raymond et al. 2001). Traditional assessment of pulp yield, which involves cooking wood chips to a fixed Kappa number (lignin content) in a laboratory digestor, is time-consuming, costly, and limits the number of samples that can be examined (Raymond et al. 2001). NIR diffuse reflectance spectroscopy provides a viable alternative for estimation of pulp yield (Downes et al. 2011; Trung et al. 2015), as the variation at specific wavelengths in the range 1100–2500 nm is directly related to variation in pulp yield (Michell and Schimleck 1996). Indeed, it has been successfully applied in tree breeding programs to improve pulp yield of short rotation plantations that are established on a large-scale to provide wood for pulping facilities worldwide (Schimleck 2008).

NIR spectroscopic estimation of pulp yield involves two main components, namely data collection and model development. Spectra of milled wood from a large number of samples with known pulp yield are first collected. These samples are expected to typically represent individual trees or composites of several trees (Downes et al. 2006) and should ideally be from a variety of sites (Downes et al. 2011). Then, a partial least squares (PLS) regression model is built based on the collected spectral and pulp yield data. The model relates the spectral information of each sample to its pulp yield and is used to predict the pulp yield of unexamined samples.

Increasingly sophisticated approaches are available for model development (Cogdill et al. 2004; Mora and Schimleck 2010; Fernandes et al. 2013; Li et al. 2019; Nasir et al. 2019; Ayanleye et al. 2021); however, exploration of their use for pulp yield estimation, or for estimating other wood properties, is rare in wood-related research. Recent papers (Ho et al. 2021, 2022) have investigated the potential to utilize genetic algorithms (GA) in model development (Bangalore et al. 1996; Villar et al. 2014; De et al. 2017). Ho et al. (2021) focus on optimization of pulp yield models for Tasmanian blue gum (Eucalyptus globulus Labill.) samples, and Ho et al. (2022) expand the investigation to several loblolly pine (Pinus taeda L.) wood properties (density, microfibril angle, modulus of elasticity and tracheid coarseness, radial diameter, tangential diameter, and wall thickness), which were measured by SilviScan (Evans 1994, 1999, 2006). It is shown that the GA can improve PLS regression models in calibration and prediction by identifying critical wavelengths for model building. Furthermore, the most representative wavelengths selected by the GA consistently have band assignments arising from the wood components that directly impact pulp yield (e.g., cellulose, hemicellulose, or lignin) and SilviScan measured properties.

In both studies (Ho et al. 2021, 2022), the coefficients of determination of the PLS regression model for calibration and prediction sets (\({R}_{c}^{2}\) and \({R}_{p}^{2}\), respectively) and corresponding standard errors (SEC and SEP, respectively) are specified as constraint conditions. In effect, the performance indicators of optimized models have to be better than those in the model using all wavelengths with a small number of latent variables. The GA is employed to optimize an objective function consisting of \({R}_{c}^{2}\) and \({R}_{p}^{2}\) with equal weight to balance the optimization process equally in terms of calibration and prediction and to avoid overfitting. However, there are knowledge gaps in utilization of optimization methods to the wood properties prediction models based on NIR data. For example, how is the prediction performance of optimized models affected when different weight factors are used? Or how will the most representative wavelengths, which are selected from optimization process, change with different weight factors?

The objective of this paper is to address these questions by investigating the performance of optimized models and the most representative wavelengths with different weight factors in the objective function. The state-of-the-art global optimization solver pySOT (Eriksson et al. 2015), which implements a surrogate optimization technique as developed in Regis and Shoemaker (2007, 2013), is utilized in this paper to investigate these questions. The optimization process is illustrated by a flow chart shown in Fig. 1. In essence, pySOT evaluates the objective function at several points, internally constructs an approximation to the objective function based on this data and attempts to find points that either maximize the approximation (and hopefully maximize the objective) or explore areas where the objective function has not been evaluated. As the points evaluated are generated randomly at each iteration, pySOT also is a non-deterministic algorithm. Its advantage over the GA is that a smaller number of function evaluations are required and hence more practical for computationally expensive objective functions (Eriksson et al. 2019). Wavelength selection and number of PLS components result from the optimization process using pySOT, and, in addition, the performance of optimized model based on selected wavelengths and different weight factors are compared. Furthermore, a robustness analysis is conducted to suggest sensible choices of the weights used in the objective function.

Fig. 1
figure 1

Flowchart of pySOT optimization process for NIR wavelength selection

Materials and methods

Data sets

Two data sets (pulp yield-min and pulp yield-max) containing Tasmanian blue gum samples with different pulp yield variation are used as in Ho et al. (2021). Pulp yield-min contains 67 clonal blue gum samples of the same age and similar pulp yields (ranging from 50.8% to 55.8%), while pulp yield-max consists of 30 blue gum samples from different forests of various ages and with more diverse pulp yields (ranging from 37.6% to 60.2%) (Ho et al. 2021). Both untreated and second derivative datasets for pulp yield-min and pulp yield-max are utilized with the same partitioning as Ho et al. (2021) into calibration and prediction datasets based on the DUPLEX selection method (Snee 1977).

Optimization problem

Wavelength selection and number of latent variables (\({N}_{comp}\)) for PLS regression are investigated at a specific number of wavelengths (\({N}_{Wvl}\)). Specifically, \({N}_{Wvl}\) wavelengths are selected from the 700 wavelengths ranging from 1100 to 2500 nm in 2-nm increments (i.e., {1100, 1102, …, 2496, 2498}). These wavelengths, together with \({N}_{comp}\), result in (\({N}_{Wvl}\)+ 1) variables for each PLS model.

The performance of the PLS model is assessed using associated \({R}^{2}\) scores, denoted as \({R}_{c}^{2}\) and \({R}_{p}^{2}\) for the calibration dataset and prediction datasets, respectively, with the aim of maximizing both. Noting that these two quantities may be in conflict (i.e., in order to increase one quantity, the other must be decreased), therefore, the problem presented here is a multi-objective optimization problem (Ehrgott 2005) with \({R}_{c}^{2}\) and \({R}_{p}^{2}\) being the two objective values to optimize simultaneously.

The most common goal of an optimization method is to find inputs, which are Pareto optimal—where one objective value cannot be unilaterally improved without causing another to worsen. Selecting an optimal input from a collection of Pareto optimal points is a difficult problem, where the relative trade-offs in each objective value must be assessed using expert judgment. In this case, all Pareto optimal points may be found by maximizing an objective function \(\alpha {R}_{c}^{2}+\left(1-\alpha \right){R}_{p}^{2}\) for all values of \(\alpha\) between 0 and 1 inclusive (Ehrgott 2005). Here, \(\alpha\) represents a relative weighting of the two objective values: \(\alpha\) = 0 corresponds to only maximizing \({R}_{p}^{2}\) without concern for \({R}_{c}^{2}\), and \(\alpha\) = 1 corresponds to only maximizing \({R}_{c}^{2}\).

The optimization in Ho et al. (2021) corresponds to considering \(\alpha\) = 0.5, and this study extends the analysis by considering the robustness of the wavelength selection process relative to the choice of \(\alpha\). It is expected that choosing \(\alpha\) too large would cause overfitting to the calibration data; however, there is no way of knowing what numerical values are “too large” in any single instance. The extent of any possible overfitting and whether the negative impact would be substantial is also unclear. Hence, multi-objective optimization techniques are applied in this study to extend the work of Ho et al. (2021) in order to assess both the overall quality of calibrated NIR models based on automatic wavelength selection and also the impact of the choice of \(\alpha\) on the results.

Optimization method

Optimization for \(\alpha\) = 0, 0.1, …, 0.9, 1 is performed to estimate the full set of Pareto optimal points. There is a standard (single-objective) optimization problem for each value of \(\alpha\) with (\({N}_{Wvl}\)+1) variables that take discrete values within lower and upper bounds. PySOT is employed to solve the optimization problems, and the budget of each optimization is set to be 500 function evolutions. Moreover, each optimization is run 5 times as pySOT is a non-deterministic algorithm.

In order to investigate the impact exerted by the number of latent variables on the performance of the PLS model, the ranges for \({N}_{\mathrm{comp}}\) is selected to be 1–7 or 8–14 or 15–20 for pulp yield-min datasets, and 1–7 or 8–15 for pulp yield-max datasets. As mentioned in Ho et al. (2021), the upper bounds for the number of latent variables (20 for pulp yield-min datasets and 15 for pulp yield-max datasets) are large enough to explain the variance of the full datasets. All the results for \({N}_{Wvl}\) from 20 to 30 with different ranges of \({N}_{\mathrm{comp}}\) are saved to compare performance of the PLS model among different ranges of \({N}_{\mathrm{comp}}\).

To investigate the robustness of model performance with respect to \(\alpha\) and suggest sensible choices of \(\alpha\), the regions in \({R}_{c}^{2}\)/\({R}_{p}^{2}\) space, which are covered by the optimized values obtained by different weight vectors with all ranges of \({N}_{\mathrm{comp}}\) and all values of \({N}_{Wvl}\), are analyzed.

Finally, representative wavelengths for PLS regression in different ranges of \(\alpha\) are inspected. The frequency of each wavelength present in the optimal sets of all optimization cases is determined, as some optimized wavelengths can be just a local maximizer for the objective values and lack generalizability (Ho et al. 2021).

Results and discussion

There are 1210 optimization cases for pulp yield-max datasets and 1815 optimization cases for pulp yield-min datasets. These include the combination of 11 choices of \({N}_{Wvl}\) (varying from 20 to 30), 11 choices of \(\alpha\) (varying from 0 to 1 by 0.1 increments), and two/three different sets of constraints on \({N}_{comp}\), with each combination being run five times. The optimization results presented in the following sections consist of the combined results of all five iterations.

Performance of PLS model with different numbers of latent variables

To investigate the impact of \({N}_{comp}\) on the performance of PLS regression, \({R}_{c}^{2}\) is plotted against \({R}_{p}^{2}\) for all optimization cases with the ranges of the number of latent variables being specified in Fig. 2, and the extreme values are listed in Table 1. It is observed that the \({R}_{c}^{2}\) scores of PLS models with the ranges of larger \({N}_{\mathrm{comp}}\) are significantly greater than that of PLS models with the ranges of smaller \({N}_{\mathrm{comp}}\), suggesting that larger \({N}_{comp}\) generally results in better performance in calibration. However, improvement in prediction performance is small with an increase in \({N}_{\mathrm{comp}}\), especially for second derivatives datasets. A slight improvement of 0.005 in the highest \({R}_{p}^{2}\) scores of PLS models is observed in pulp yield-max raw dataset when increasing \({N}_{\mathrm{comp}}\) from 1–7 to 8–15, and no notable difference is observed in the second derivative pulp yield-max dataset. The highest \({R}_{p}^{2}\) score of PLS models increases from 0.953 to 0.989 and drops to 0.985 when \({N}_{\mathrm{comp}}\) varies from 1–7 to 8–14 to 15–20 for pulp yield-min dataset. A similar trend is observed for second derivative pulp yield-min dataset with much smaller improvement when increasing \({N}_{\mathrm{comp}}\) from 1–7 to 8–14. Moreover, the lowest \({R}_{p}^{2}\) scores with the ranges of larger \({N}_{comp}\) are significantly lower than that of smaller \({N}_{\mathrm{comp}}\). As shown in Table 1, negative \({R}_{p}^{2}\) scores for PLS models are observed with large \({N}_{\mathrm{comp}}\)(-1.092 when \({N}_{\mathrm{comp}}\) is 8–15 for pulp yield-max dataset; -0.185 and -0.344 when \({N}_{comp}\) is 15–20 for pulp yield-min datasets). These indicate that if \({N}_{\mathrm{comp}}\) is chosen to be too large, it may exacerbate the overfitting of PLS regression (with improper choice of \(\alpha\)), leading to poor performance in prediction. Hence, lower \({N}_{\mathrm{comp}}\) may be selected to decrease computing time while still providing equivalent model performance.

Fig. 2
figure 2

PLS model performance for different ranges of Ncomp and different sets of NIR spectra a pulp yield-max untreated, b pulp yield-max second derivative, c pulp yield-min untreated, and d pulp yield-min second derivative

Table 1 Extreme values comparison for different ranges of \({N}_{\mathrm{comp}}\)

Furthermore, the variation of \({R}_{p}^{2}\) in untreated pulp yield-max is larger than that in untreated pulp yield-min as PLS models with extremely poor performance in prediction (\({R}_{p}^{2}\le -0.5\)) are observed for untreated pulp yield-max. This observation may result from the large variation (both in terms of origin and pulp yield) in the samples contained in pulp yield-max.

Choice of Alpha

There are a number of PLS models with nearly perfect calibration performance but poor prediction performance from Fig. 2, suggesting that focusing only on calibration performance (by setting large alpha) results in overfitting. To investigate the range of alpha generating reasonable fit to the prediction data, optimal objective values (\({R}_{c}^{2}\) and \({R}_{p}^{2}\)) achieved by each value of alpha are analyzed. The regions (convex hulls) of optimal objective values (in \({R}_{c}^{2}\)/\({R}_{p}^{2}\) space) for each value of alpha are plotted in Fig. 3, and the mean and variance of \({R}_{p}^{2}\) for each value of alpha are also calculated. The smaller variance and smaller region indicate more robust model performance relative to the choice of value of alpha. As shown in Fig. 3, the variance of \({R}_{p}^{2}\) increases dramatically when \(\alpha\) varies from 0.9 to 1, which is consistent with the rightmost large convex hulls in the figures. The means of \({R}_{p}^{2}\) are remarkably small at \(\alpha = 1\), and significant decreases in \({R}_{p}^{2}\) start to be observed at \(\alpha = 0.9\) as in Fig. 3.

Fig. 3
figure 3

\({R}_{c}^{2}\)/\({R}_{p}^{2}\) of different values of \(\alpha\) for a pulp yield-max untreated, b pulp yield-max second derivative, c pulp yield-min untreated, d pulp yield-min second derivative, e pulp yield-min untreated omitting small \({N}_{\mathrm{comp}}\), and f pulp yield-min second derivative omitting small \({N}_{\mathrm{comp}}\)

Interestingly, even if a small \(\alpha\) is chosen, relatively good calibration results are still obtained, especially for pulp yield-max datasets (\({R}_{c}^{2}\ge 0.96\) for all values of \(\alpha\)). The variation in \({R}_{c}^{2}\) for pulp yield-min datasets is larger than that for pulp yield-max datasets. It is probably because the number of samples in the pulp yield-min dataset are around two times of the number of samples in the pulp yield-max dataset, and the number of latent variables varying from one to seven is not enough for PLS regression. Omitting the results for model performance with \(1\le {N}_{comp}\le 7\), the variation in \({R}_{c}^{2}\) is much smaller for untreated data as shown in Fig. 3(e), and \({R}_{c}^{2}\) is larger than 0.8 for all values of \(\alpha\) for both untreated and second derivate treated data as shown in Fig. 3(e–f). Though these results are to some extent dataset specific, the similarities in the performance across the datasets are sufficient to conclude that \(\alpha \le 0.8\) should be chosen in order to obtain a reasonably good predictive performance, and \(\alpha \ge 0.9\) tends to overfit the calibration data.

Representative wavelengths

For each wavelength in the domain (i.e., {1100, 1102, …, 2496, 2498}), the number of optimization cases for which it was present in the optimum wavelength set selected by pySOT was counted. Since \(\alpha \ge 0.9\) tends to overfit the calibration data, the spectra for \(\alpha \le 0.8\) and \(\alpha \ge 0.9\) are plotted separately to observe if the most representative wavelengths selected from the optimization process are impacted by the choice of weight vector. The frequency plots for the four datasets are presented in Figs. 4 and 5, with the ten most frequently identified wavelengths being marked.

Fig. 4
figure 4

Pulp yield-max for a untreated spectra for \(\alpha \le 0.8\), b untreated spectra for \(\alpha \ge 0.9\), c second derivative spectra for \(\alpha \le 0.8\), and d derivative spectra \(\alpha \ge 0.9\)

Fig. 5
figure 5

Pulp yield-min for a untreated spectra for \(\alpha \le 0.8\), b untreated spectra for \(\alpha \ge 0.9\), c second derivative spectra for \(\alpha \le 0.8\), and d derivative spectra \(\alpha \ge 0.9\)

The selected wavelengths for \(\alpha \le 0.8\) are more concentrated than wavelengths selected when \(\alpha \ge 0.9\). Moreover, untreated spectra and second derivative spectra for different datasets also show different frequency distributions. Yet similarities exist across the datasets in the wood components corresponding to the representative wavelengths, and these wood components are observed to directly impact pulp yield (cellulose, lignin, and hemicellulose).

For \(\alpha \le 0.8\) and untreated NIR spectra, wavelengths utilized most frequently are listed in Table 2. Origins of the wavelengths are based on band assignments presented by Schwanninger et al. (2011) for the primary components of wood. As observed in Ho et al. (2022), regions arising from CH and OH bond vibrations are consistently utilized and include 2100–2400 nm (predominately CH and CH2 combination bands), 1350–1800 nm (first overtone CH stretch bond vibrations), 1400–1600 nm (first overtone OH stretch bond vibrations) and 1140–1225 nm (second overtone CH stretch bond vibrations), regions where bands assigned to cellulose and lignin are common.

Table 2 Band assignments for optimization when \(\alpha \le 0.8\)

With the exception of 1862 nm (pulp yield-min), all important wavelengths had recognized bond vibrations occurring at, or within 5 to 10 nm of, a wavelength band assigned to a wood component. For second-derivative-treated NIR spectra and \(\alpha \le 0.8\), similar correspondence is observed with only two wavelengths in the region 1120–1130 nm (pulp yield-max) not related to a bond vibration arising from a specific wood component (Table 2).

For \(\alpha \ge 0.9\), the most frequently used wavelengths generally occurred in regions unrelated to vibrations arising from specific wood components. While weaker predictive performance may be expected for models emphasizing calibration statistics, this observation may help explain why they do not perform as well as the models developed using values of \(\alpha\) that balance both calibration and prediction performance.

Discussion on wavelength selection problem and optimization algorithms

Optimization can improve PLS calibration performance by identifying a relatively small number of NIR wavelengths for model development (Ho et al. 20212022). Weighting of objective values is a critical decision in this approach, and it was chosen to balance calibration and prediction performance, i.e., \(\alpha\) = 0.5. To understand the impact of different values of \(\alpha\) (0, 0.1, …, 0.9, 1) on the robustness of wavelength selection, the global optimization solver pySOT was utilized to optimize PLS regression model performance. Using pySOT, it was demonstrated that models are robust over a wide range of \(\alpha\) values (0 to 0.8).

A key feature of optimized PLS models is the consistent identification of wavelengths occurring in NIR regions associated with C-H and O–H bond vibrations arising from cellulose, hemicellulose, and lignin. The identification of wavelengths whose origin is directly related to the main components of wood explains why both calibration and prediction performance (for samples included in the optimization process) can be improved.

Further work is required to investigate the applicability of optimized models to the estimation of wood properties, for example a comparison of estimated pulp yield genetic parameters (Raymond et al. 2001; Schimleck 2008) based on predicted pulp yield data from PLS models and optimized PLS models. Ho et al. (2022) investigated GA optimized model prediction of microfibril angle (MFA) for loblolly pine; however, the prediction set employed (referred to as “Maps”) differed in age and geographic origin and was unrelated to the calibration / prediction samples (“Agenda 2020”) used for model development. Ho et al. (2022) recognized this issue and stated: “The Agenda 2020 model did not perform as well as the optimized MFA model that incorporated the Maps data for its prediction phase, indicating that characteristic wavelengths for the Maps data set differ from those of the Agenda 2020 samples.” This is an important finding and while consistent with other NIR-based studies of wood that have attempted to apply models based on samples from one location to those from a different location (Schimleck 2008). It indicates that optimized models are also site(s) limited and that important wavelengths likely differ to some extent with age and location. Hence, an assessment of optimized model performance should be based on samples sharing the same characteristics (age and location) as those used for model development but exclusive of the calibration/prediction phases of optimization. This question should be explored in future studies.

Furthermore, the performance of optimized models can be evaluated by comparing among different optimization methods in terms of associated computational times and trade-offs between optimized and unoptimized models. Utilizing an optimization approach not only likely results in better models, but also adds time and complexity (in terms of model development). Further studies will answer whether better predictive models help achieve incremental increases in the value of wood products derived from forests whose wood properties are accurately characterized, and what types of applications will most benefit from the optimization results.

Conclusion

In this study, pySOT is utilized to develop PLS regression pulp yield models based on two Eucalyptus globulus data sets (characterized by narrow and extreme yield ranges) and examine predictive performance of models using different weight factors (i.e., factor \(\alpha\) in the objective function \(\alpha {R}_{c}^{2}+\left(1-\alpha \right){R}_{p}^{2}\) to be maximized in the optimization algorithm). Values of \(\alpha \le 0.8\) provided good predictive performance, whereas \(\alpha \ge 0.9\) tended to overfit the calibration data indicating that models are robust for values of \(\alpha\) from 0 to 0.8. Representative wavelengths for each data set were identified. For \(\alpha \le 0.8\), almost all wavelengths arose from C-H and O–H bond vibrations observed in cellulose, hemicellulose, and lignin. However, identified wavelengths for \(\alpha \ge 0.9\) generally occurred in regions, which are unrelated to vibrations arising from specific wood components.

The reliability with which pySOT identified chemically relevant representative wavelengths indicates its usefulness as a global optimization algorithm in this setting. However, because the underlying algorithm is randomized, repeated runs of the optimization are important to assess the variability in results, so deterministic methods may be overall faster (and give more reproducible representative wavelengths).

There was consistency between the optimal ranges of alpha between the two Eucalyptus globulus data sets, but it would be useful in future work to assess whether this generalizes to other species/data sets, or whether alternative metrics of PLS performance (such as RMSE) are suitable. Testing the generalizability of the identified representative wavelengths on a separate partition of data (i.e., not calibration or prediction) would also give further evidence of the utility of this framework but is more suited to larger data sets than those used in this work. In addition, the applicability of optimized models to the estimation of wood properties, the comparison between optimization methods and associated computational times, and trade-offs between optimized and unoptimized models are recommended for future studies.