Introduction

Efficient utilization of forest resources requires information on wood property variation at multiple scales. Information is often limited owing to the cost and time associated with measuring many wood properties and various methodologies have been developed for estimating these properties rapidly (Schimleck et al. 2019). Near infrared (NIR) spectroscopy is one such technique that has been widely applied to wood (Tsuchikawa and Kobori 2015; Schimleck and Tsuchikawa 2020), and it is the only nondestructive technique that can provide an estimate of pulp yield (the yield of chemically derived pulp from a given volume of wood). Pulp yield is critical to the economics of the pulp and paper industry (Greaves and Borralho 1996) and is very expensive to measure (Meder et al. 2011). Hence, there is increased interest in utilizing a rapid, inexpensive approach for its determination (Michell 1995).

Owing to its importance and direct relationship with wood chemistry, the estimation of pulp yield by NIR spectroscopy began with the earliest wood—NIR papers (Birkett and Gambino 1988; Wright et al. 1990) and pulp yield has remained a consistent focus of NIR-wood related research (Downes et al. 2009, 2010, 2011; Meder et al. 2011; White et al. 2009). However, efforts to improve calibration performance through the utilization of advanced selection techniques are rare. For example, Mora and Schimleck (2008) utilized three different sample selection techniques (CADEX, DUPLEX and SELECT algorithms) to identify samples most representative of their data set for the development of pulp yield calibrations. They showed calibration performance was improved by utilizing only selected samples and recommended that these methods be employed to identify unique samples prior to doing any wood property determination utilizing models based on NIR spectra. More recently, Li et al. (2019) utilized a particle swarm optimization (PSO)—support vector machine (SVM) approach and observed improved density prediction for four commercially important Chinese species.

The selection of the most representative wavelengths in the spectra data of all samples might improve both calibration and prediction performances of partial least squares (PLS) regression and reduce computational workload. This selection problem can be defined as an optimization problem (Bangalore et al. 1996).

Many complex real-world problems involve optimizing goals, which means searching for the maximum and / or minimum values of these goals (objective values). For example, in manufacturing, maximizing profit and minimizing cost are common aims; whereas, in logistics, goods or services management, distribution and transportation are of interest, such that goods or services can be delivered in the shortest time and in a cost effective manner. In these examples, profit, cost and time are objective values. Objective values are affected by many factors, and these are called design variables (or decision variables). If there are restrictions, which are typically expressed mathematically as inequalities or equations, they are called constraints. A function, which expresses relationships between objective values and design variables, is called the objective function.

The common optimization problems in the field of NIR spectroscopy/chemometrics include wavelength selection (i.e., variable selection) and selection of the appropriate number of components (i.e., latent variables) in partial least-square (PLS) regression. Others include preprocessing techniques, such as feature selection and optimization of the parameters in calibration models with Support Vector Machines (Ramirez-Morales et al. 2016); and instrumentation optimization (signal precision and wavelength resolution) (Greensill and Walsh 2000). These problems have been solved by many optimization methods: the binary dragonfly algorithm (Chen and Wang 2019), genetic algorithms (GA) (Bangalore et al. 1996; Villar et al. 2014; De et al. 2017), artificial bee colony (Sun et al. 2019), particle swarm optimization (De et al. 2017; Lou et al. 2014), ant colony optimization (Xiaowei et al. 2014) and simulated annealing (Swierenga et al. 1998; Balabin and Smirnov 2011) are some examples. These metaheuristic algorithms help to save time and computational resources, especially in the case of wavelength selection problems in which the solution space is too large. Among the mentioned algorithms, simulated annealing is a single solution approach to improve a local search heuristic to find a better solution; while the others are population-based approaches which maintain and improve multiple potential solutions by generating a new population based on principles of natural systems. Evolutionary algorithm (e.g. genetic algorithm) and swarm-intelligence-based algorithm (e.g. binary dragonfly algorithm, artificial bee colony, particle swarm optimization and ant colony optimization) are two common categories of population-based methods. It is worthy to note that in spite of the popularity of these optimization methods in the field of NIR spectroscopy/chemometrics, their applications in the field of wood-NIR are very limited.

Xiaobo et al. (2010) and Balabin and Smirnov (2011), reviewed variable selection methods for NIR spectroscopy, including GA. Xiaobo et al. (2010) concluded that GA combined with PLS regression showed superiority over other applied multivariate methods because wavelengths selected by GA did not lose prediction capacity and provided useful information about the chemical system.

Villar et al. (2014) applied three variable selection methods, including Martens Uncertainty Test, interval Partial Least Squares (iPLS) and GA to Visible-NIR spectra. The application of iPLS and GA resulted in considerable improvement of the calibration model with the number of latent variables being reduced while also decreasing the root mean square error of the cross-validation (RMSECV) and the standard error of cross-validation (SECV) and increasing the ratio of prediction to deviation (RPD) compared to a full spectrum model.

Evolutionary genetic algorithms are a branch of evolutionary computation, which are inspired by natural evolutionary and adaption processes. Evolutionary algorithms include three major algorithms, i.e., evolution strategies, evolutionary programming and genetic algorithms. Rechenberg (1973) introduced evolutionary strategies as a numerical optimization technique, while the current framework of genetic algorithms was first proposed by Holland (1975) and his students (Jong 1975). An important addition was the development and introduction of the population concept into evolution strategies by Schwefel (1981, 1995). Evolutionary algorithms have been adapted to various optimization problems, with examples including numerical optimization, for example, both constrained (Michalewicz and Schoenauer 1996; Kim and Myung 1997) and unconstrained (Yao and Liu 1996, 1997) and multi-objective optimization (Fonseca and Fleming 1995, 1998).

All evolutionary algorithms have two prominent features, which distinguish themselves from other search algorithms. First, they are all population-based and second, there is communication and information exchange among individuals in a population. They are the result of selection and/or recombination in evolutionary algorithms. Most recombination (crossover) operators use two parents and produce two offspring which inherit the information (genes) from their parents.

Genetic algorithms have been applied in the area of NIR spectroscopy since the 1980s. Koljonen et al. (2008) reviewed applications of GAs, including wavelength selection, wavelength interval selection, feature selection, co-optimization for wavelength selection and the number of PLS components, pre-processing optimization and wavelet transformation. The authors also proposed some potential research directions and applications of GAs in chemometrics.

In this paper, the GA approach was applied to a variable selection problem, which can be considered as an optimization problem, for NIR spectroscopy data sets. Two data sets represented by untreated, and second derivative spectra were used to predict pulp yield. The goals of the optimization problem were reducing the number of variables (i.e., wavelengths) for PLS regression and identifying the most frequent optimum wavelengths (i.e., representative wavelengths) for each data set. NIR band assignment was utilized to provide useful information about the wood components related to the optimum wavelengths.

Materials and methods

Optimization problem

Wavelength selection, number of wavelengths (NWvL) and number of latent variables (Ncomp) for PLS regression are often optimized in the same procedure using GA. Using an approach first implemented by Bangalore et al. (1996), a chromosome includes a series of (N + 1) genes, in which N is the total number of wavelengths in the wavelength domain. Therefore, each gene in the first N genes corresponds to a specific wavelength. The value of a gene is binary, which indicates whether the wavelength is included in the model or not (i.e., 1 = yes and 0 = no). The number of wavelengths for the regression model (NWvL) is counted as the number of genes among the first N genes assigned the value of 1. However, as a result, the number of selected wavelengths could not be controlled. The last gene represents the number of latent variables, which is an integer. By developing the problem in this way, the wavelengths, NWvL and Ncomp are co-optimized.

In the study presented here, the optimum wavelengths and number of latent variables for PLS regression are investigated at a specific number of wavelengths, which increased from 10 to 100. This approach allows the observation of how these variables and PLS model metrics change versus the number of wavelengths. Therefore, the implementation of GA to the optimization problem will be different from the aforementioned studies (Bangalore et al. 1996; Koljonen et al. 2008) and summarized as follows.

Each calibration model for PLS regression includes (NWvL + 1) variables, which are a combination of wavelengths selected from the wavelength domain and the number of latent variables for PLS regression. They are combined into a vector, called a chromosome (or an individual) \({x}^{*}={\left[{x}_{1} {x}_{2}\dots {x}_{NWvL+1}\right]}^{T}\). Each value in a chromosome is called a gene. The first gene \({x}_{1}\) represents the number of latent variables while the others (from \({x}_{2}\) to \({x}_{NWvL+1}\)) are assigned integer values which belong to a NWvL-combination without repetition of all wavelength values in their domain (N elements). This combination is sorted in ascending order before being assigned to genes.

Data sets

The optimization problem was developed and applied to two NIR data sets selected as they represented two extremes in terms of pulp yield variation. The first (pulp yield-min) was comprised of 67 clonal blue gum (Eucalyptus globulus) samples (Schimleck and French 2002) all the same age and with Kraft pulp yields that ranged from 50.8 to 55.8%. The second (pulp yield-max) included 30 blue gum samples (Michell 1995) from several different native forests in Tasmania, Australia. The forests were of various ages and pulped samples had a much wider yield range (soda pulp yields = 37.6 to 60.2%). Details regarding sample preparation and collection of NIR spectra are described in Michell (1995) and Schimleck and French (2002). Briefly, wood chip samples (representative of individual trees or clones) were milled in a model 4 Wiley mill (Thomas Scientific, Swedesboro, NJ, USA). For both data sets, milled wood was placed in a large NIR systems sample cup (NR-7070) and duplicate spectra (the cell was repacked between scans) collected using a NIR Systems Inc. Model 5000 scanning spectrophotometer (Silver Spring, Maryland, USA). Duplicate spectra (wavelength range 1100–2500 nm in 2 nm increments, total N = 700) were averaged prior to analysis. For the pulp yield-max samples, a static sample holder was used, whereas a spinning sample holder was utilized for the collection of spectra from the pulp yield-min samples. Schimleck and French (2002) and Turner et al. (1983) provide information regarding the determination of pulp yield for samples included in the two datasets.

Each data set was separated into two subsets (i.e., calibration set and prediction set) based on the DUPLEX selection method (Snee 1977), which use Euclidean distance to determine the proximity of samples to others in a factor space (Mora and Schimleck 2008). The basic information of data sets is shown in Table 1.

Table 1 Data set information

The maximum number of latent variables was selected to be 20 for pulp yield-min data and 15 for pulp yield-max data. The value of 20 for the maximum Ncomp was considered more than necessary for a PLS model using this set but we wanted to allow for instances where the optimization required more latent variables as suggested by preliminary models using 10 latent variables. Based on an analysis of the percentage of variance explained of Y for full data set, Ncomp = 20 explained 99.54% variance of Y. Therefore, a number of latent variables greater than 20 does little in terms of improving the PLS model and might actually make the model more complicated and therefore increase the computing time. In case of the pulp yield-max data set, the maximum number of latent variables was limited by the size of the calibration set (20 samples). When the cross-validation sets of 4 were used, the number of latent variables should not be larger than the size of the training set (i.e., 15). Again, the analysis of the percentage of variance explained of Y showed that Ncomp = 15 explained 99.94% variance of Y.

Therefore, the domains of optimum variables were defined as:

For number of latent variables: \(D\left[{x}_{1}\right]=\left[\begin{array}{ccccccc}1& 2& 3& \dots & 18& 19& 20\end{array}\right]\) for pulp yield-min data

\(D\left[{x}_{1}\right]=\left[\begin{array}{ccccccc}1& 2& 3& \dots & 13& 14& 15\end{array}\right]\) for pulp yield-max data

For wavelength variables: \(D\left[{x}_{i}\right]=\left[\begin{array}{ccccccc}1100& 1102& 1104& \dots & 2494& 2496& 2498\end{array}\right]\) (nm)

(i = 2 … NWvL+1)

The performance of a calibration model for PLS regression (i.e., a chromosome or an individual) in this study was evaluated by four inequality constraints (m = 4) and two objective values. The constraint conditions are R-squares for the calibration and prediction sets (\({R}_{c}^{2}\) and \({R}_{p}^{2}\), respectively) and the standard errors for the calibration and prediction sets (SEC and SEP, respectively). These constraint conditions can be expressed as follows:

$$\left\{ {\begin{array}{*{20}c} {R_{c}^{2} - R_{c,\min }^{2} \ge 0} \\ {R_{p}^{2} - R_{p,\min }^{2} \ge 0} \\ {{\text{SEC}}_{max} - SEC \ge 0} \\ {{\text{SEP}}_{max} - SEP \ge 0} \\ \end{array} } \right.$$

in which \({R}_{c,\mathrm{min}}^{2}\), \({R}_{p,\mathrm{min}}^{2}\), SECmax and SEPmax are limited values for \({R}_{c}^{2}\), \({R}_{p}^{2}\), SEC and SEP, respectively. These limited values were selected so that the performance of an optimum calibration model is equal to, or better than that, of a calibration model using all wavelengths in the data set and Ncomp = 6. The details of the constraint values are shown in Table 2.

Table 2 Constraint values for data sets and the result of optimized values

The objective values in this study are the aforementioned R-squares for the calibration and prediction sets. It demonstrates that the overall goal of the optimization problem presented here is to obtain a set of wavelengths which can produce good R-square values for both sets. If the objective value is only the R-square for the calibration set, it might result in an overfitting problem for PLS regression and as a result, the fitness of PLS regression in terms of prediction would be reduced. The objective function is defined as: \({f}_{obj}=\alpha \times {R}_{c}^{2}+\beta \times {R}_{p}^{2}\), in which α and β are weighted factors for \({R}_{c}^{2}\) and \({R}_{p}^{2}\), respectively (\(\alpha +\beta =1\)). In this study, α and β were selected to be 0.5.

Optimization process

In the optimization problem for PLS regression, the first generation is created randomly. The first generation of parents, P, is represented by the following matrix, in which k is the number of parents (or the population size), and each row represents an individual’s chromosome. In this study, the number of parents (k) is 100.

$$P = \left[ {\begin{array}{*{20}c} {x_{1,1} } & {x_{1,2} } & \ldots & {x_{1,n - 1} } & {x_{1,NWvL + 1} } \\ {x_{2,1} } & {x_{2,2} } & \ldots & {x_{2,n - 1} } & {x_{2,NWvL + 1} } \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ {x_{k,1} } & {x_{k,2} } & \ldots & {x_{k,n - 1} } & {x_{k,NWvL + 1} } \\ \end{array} } \right]$$

In the initial step, a hundred individuals were created by randomly selecting gene values from pre-defined variable domains by the uniform distribution. The strength (or fitness) of each individual is evaluated by the objective function \({f}_{obj}\). Good individuals are selected to be parents based on their fitness to create the next generation (offspring) during the search process. After that, the objective function of each individual offspring is evaluated and compared to their parents using a penalty function (Van de Lindt and Dao 2007) in a process named tournament selection. The best individuals are identified and become new parents of the next generation. The process is repeated until pre-determined convergence criteria are satisfied.

The searching process is performed through the crossover (recombination) and mutation operators. In the crossover operator, two or more offspring are often produced by randomly exchanging genes from two or more parents. In most cases, two parents will be selected randomly, thus, only two offspring will be created and inherit genes from parents. The number of individuals selected to perform the crossover operator depends on a crossover rate. A crossover point, where genes exchange occurs, is chosen randomly between 1 and (n-1). There are possibly more than one crossover points. However, only one crossover point will be used in this study. For example, individuals Xi and Xj are selected to take the crossover operator at the crossover point k, the two offspring are expressed as:

$$X_{i}^{^{\prime}} = \left[ {\begin{array}{*{20}c} {x_{i,1} } & \ldots & {x_{i,k - 1} } & {x_{j,k} } & \ldots & {x_{j,NWvL + 1} } \\ \end{array} } \right]$$
$${\text{and}}\quad X_{j}^{^{\prime}} = \left[ {\begin{array}{*{20}c} {x_{j,1} } & \ldots & {x_{j,k - 1} } & {x_{j,k} } & \ldots & {x_{i,NWvL + 1} } \\ \end{array} } \right]$$

The mutation operator changes some genes in some individuals in every generation. Similar to the crossover operator, the number of chromosomes selected to be mutated depends on the pre-defined mutation rate while the mutation points are chosen randomly between 1 and n. At a mutation point q on a selected chromosome p, a gene’s value is changed to a random value which is within the gene’s domain. A new offspring is expressed as:

$$X_{p}^{^{\prime}} = \left[ {\begin{array}{*{20}c} {x_{p,1} } & \ldots & {x_{p,q - 1} } & {x_{p,q}^{^{\prime}} } & {x_{p,q + 1} } & \ldots & {x_{p,NWvL + 1} } \\ \end{array} } \right]$$

Since the genes from \({x}_{2}\) to \({x}_{NWvL+1}\) are required to create a set of unique wavelengths, the new offspring produced by crossover and mutation operators are checked. If the values of wavelength genes are not unique, the operator is repeated until a set of unique wavelength values is obtained. In addition, the large domains defined for variables result in numerous possible individuals. Therefore, in this study, the crossover and mutation rates were chosen to be 0.5 to introduce various new genes to the population.

The offspring matrix, O, obtained from crossover and mutation operators, is expressed as:

$$O = \left[ {\begin{array}{*{20}c} {x_{1,1} } & {x_{1,2} } & \ldots & {x_{1,n - 1} } & {x_{1,n} } \\ {x_{2,1} } & {x_{2,2} } & \ldots & {x_{2,n - 1} } & {x_{2,n} } \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ {x_{r,1} } & {x_{r,2} } & \ldots & {x_{r,n - 1} } & {x_{r,n} } \\ \end{array} } \right]$$

The selection process is conducted for parents and offspring using the tournament selection method. Each individual is a PLS regression model for the respective data set. The regression produces constraint values, and the fitness of each individual is evaluated based on the objective function. The fitness vector Y and the constraint value matrix C for (k + r) individuals can be expressed as:

$$Y = \left[ {\begin{array}{*{20}c} {y_{1} } \\ {y_{2} } \\ \ldots \\ {y_{{\left( {k + r} \right)}} } \\ \end{array} } \right] \,\text{and}\,C = \left[ {\begin{array}{*{20}c} {c_{1,1} } & {c_{1,2} } & \ldots & {c_{1,m} } \\ {c_{2,1} } & {c_{2,2} } & \ldots & {c_{2,m} } \\ \ldots & \ldots & \ldots & \ldots \\ {c_{{\left( {k + r} \right),1}} } & {c_{{\left( {k + r} \right),2}} } & \ldots & {c_{{\left( {k + r} \right),m}} } \\ \end{array} } \right]$$

where m is the number of constraint values being considered (as described earlier, m = 4).

As proposed by Van de Lindt and Dao (2007), one should concentrate on searching for individuals around those individuals having the best fitness, so that the approach to global optimization is as stable as possible. In that scenario, some individuals, which do not satisfy the constraint conditions but have very good fitness, might be considered for retention. A penalty function was proposed by Van de Lindt and Dao (2007) so the individuals having fitness values around the best fitness value will have a higher probability of survival. The mathematical form of this penalty function for a minimum optimization problem can be expressed as:

$${f}_{p}\left(x\right)=\left\{\begin{array}{l}f\left(x\right)\, \qquad \qquad \qquad \quad \quad \, if\, \, \left[{g}_{k}\left(x\right)\ge 0 \,\mathrm{\,and\,}\, {h}_{m}\left(x\right)=0\right]\,\text{or}\, \left[{f}_{b}\left(x\right)<f(x)\right] \\ {f}_{b}\left(x\right)+\left[{f}_{b}\left(x\right)-f\left(x\right)\right]\quad if\,\left[{g}_{k}\left(x\right)<0\, \mathrm{or}\, {h}_{m}\left(x\right)\ne 0\right] \,\text{and} \,\left[{f}_{b}\left(x\right)\ge f(x)\right] \end{array}\right.$$
$$for\, k=\mathrm{1,2},\dots , N;\,h=\mathrm{1,2},..,M$$

where \({f}_{p}\left(x\right)\) is fitness after penalizing; \(f\left(x\right)\) is fitness before penalizing, and \({f}_{b}\left(x\right)\) is fitness of the best individual in the constraint domain \(\left[{g}_{k}\left(x\right)\ge 0 \mathrm{\,and\,} {h}_{m}\left(x\right)=0\right]\), in which \({g}_{k}\left(x\right)\) and \({h}_{m}\left(x\right)\) are constraint functions for N inequality constraints and M equality constraints, respectively.

Results and discussion

Optimization results

As mentioned, optimization was implemented for a specific number of wavelengths. There were 91 optimization cases corresponding to the change in number of wavelengths from 10 to 100. Figure 1 shows objective values (\({R}_{c}^{2}\) and\({R}_{p}^{2}\)) from each optimum set of wavelengths resulting from the optimization process for the four data sets. Overall, the objective values were greatly improved compared to the corresponding values obtained from using all wavelengths (see Table 2). The best number of wavelengths for optimizing the prediction result (\({R}_{p}^{2}\)) differed among data sets. For pulp yield-min untreated spectra, \({R}_{p}^{2}\) increases from 0.96 to 0.98 when NWvL increases from 10 to 22. \({R}_{p}^{2}\) stays above 0.98 before it fluctuates drastically in the range of 0.9-1when NWvL is larger than 56. \({R}_{p}^{2}\) of pulp yield-max untreated spectra reaches a peak value of 0.99 at NWvL = 14 which is then followed by a downward trend to around 0.965 as NWvL increased. Excellent values of \({R}_{p}^{2}\) were observed for both the second derivative sets. \({R}_{p}^{2}\) of pulp yield-min second derivative spectra increases from 0.97 and remains above 0.99 with NWvL ≥ 27, while \({R}_{p}^{2}\) of pulp yield-max second derivative spectra is higher than 0.99 for all investigated cases of NWvL.The optimization also reduces SEC and SEP values indicating an improvement in model fitting and predictive performance (Fig. 2).

Fig. 1
figure 1

Optimum objective values result for different spectra data sets

Fig. 2
figure 2

SEC and SEP from optimum results for different spectra data sets

Figure 3 shows the optimization results for the number of latent variables (Ncomp). These are values, which combined with the corresponding optimum wavelength sets, resulted in the highest objective values. Only pulp yield-max untreated spectral data shows a convergence of Ncomp = 6 versus number of wavelength (NWvL). For the other data sets, the optimum Ncomp fluctuates over a wide range. However, Ncomp = 6 tends to be the lower limit while the upper limit reaches the preselected maximum number of latent variables in some cases. This suggests that the true upper limit might go higher if the maximum number of latent variables were increased.

Fig. 3
figure 3

Optimization results for the number of latent variables

For optimization based on different numbers of wavelengths, the sets of identified wavelengths share few common components. For example, the result from optimization for pulp yield-min untreated spectra shows that the optimum wavelength sets for NWvL = 10 and NWvL = 11 are as follows:

$${\text{WvL}}_{NWvL = 10} = \, [1470\,\,1636\,\,1706\,\,1790\,\,1852\,\,1854\,\,2286\,\,2364\,\,2474\,\,2476] \, \left( {\text{nm}} \right)$$
$${\text{WvL}}_{NWvL = 11} = \left[ {1152\,\, 1154\,\, 1198\,\, 1200\,\, 1472\,\, 1918\,\, 2032\,\, 2322\,\, 2328\,\, 2364\,\, 2372} \right] \left( {\text{nm}} \right)$$

These two sets only share common wavelengths in the range 1470–1472 nm and 2364 nm. It suggests that the optimization result for a specific number of wavelengths might be just a local optimized point for that specific case. Therefore, the local optimized point contains not only the common wavelengths but its own distinguishing wavelengths. This means not all the optimized wavelengths in that case contribute to global optimization and help to explain, or understand, the relationship between wavelengths and wood components or wood properties.

Most frequently identified wavelengths

A statistical approach was applied to analyse the optimization results. Each wavelength in the domain (i.e., from 1100 to 2498 nm) was counted for its presence in the different optimum wavelength sets resulting from 91 optimization cases. The most frequent wavelengths of a data set were considered representative for that data set. The frequency of wavelengths across all optimization cases for a given data set is plotted in Fig. 4. Frequency distribution for the untreated spectral data sets is more concentrated than that for second derivative data sets (Fig. 4). Moreover, although the distributions are concentrated for untreated spectra, the wavelengths with highest frequency of pulp yield-min and -max untreated spectra are not the same indicating that representative wavelengths are different for the untreated spectra.

Fig. 4
figure 4

Presence frequency of wavelengths in the optimum results

Different sets of the most frequent wavelengths were determined for each data set based on different minimum frequency values. For an example of pulp yield-min untreated spectra, there are 304 wavelengths presented at least seven (7) times and 12 wavelengths with a minimum frequency of 26. The objective values result of the models using representative wavelengths sets as their input are plotted in Fig. 5. In general, the representative wavelength sets also greatly improved the PLS model, although the performance was not as high as that provided by the optimized wavelength sets. Moreover, Fig. 5 shows that \({R}_{c}^{2}\) increases with the number of representative wavelengths (NRWvL). Moreover, \({R}_{p}^{2}\) peaks when NRWvL is in the range of 100–200 wavelengths and tends to decrease when more wavelengths are added to the model input.

Fig. 5
figure 5

Objective values results for different representative wavelength sets

Comparison of band assignments

Schwanninger et al. (2011) reviewed and provided a summary of band assignments for wood and its components. Results from that study were utilized here and matched to the most frequent wavelengths of each data set. Table 3 shows the representative wavelengths of each data set, their frequency, bands in the NIR spectrum identified as arising from wood, the related bond vibration and the corresponding wood components.

Table 3 Band assignments for optimization results

Strong agreement was observed between the most frequently observed representative wavelengths and bands corresponding to wood components. The strong agreement is very encouraging as it indicates that wavelengths identified as important for optimization originate from bond vibrations in wood components that directly influence pulp yield (Poke and Raymond 2006). For the pulp yield-max data, nearly all identified wavelengths that had a wood related analog arose from cellulose while for the pulp yield-min data the frequency of bands related to lignin, while still relatively small, was greater. The contrasting range in yields for the data sets influenced the selection of wavelengths. It is likely that the wide range of yields for the pulp yield-max data set has permitted clear identification of specific wavelengths related to cellulose utilizing untreated spectra (Fig. 4c), whereas for the pulp yield-min untreated spectra (Fig. 4a) the narrow yield range resulted in more wavelengths being identified as important and also allowed lignin-related wavelengths to have a greater influence. This suggests variation in lignin content is more important for pulp yield models based on data that has a narrow range. For the second derivative data, more wavelengths had influence which can be expected as this treatment baselines the data and highlights differences amongst wavelengths (Barton 1989).

Conclusion

This study presents an optimization problem for Eucalyptus globulus pulp yield models. Two NIR data sets represented by untreated and second derivative spectra were used in multivariate calibration models based on partial least squares (PLS) regression to predict pulp yield. The genetic algorithm was used to select optimum wavelengths, with an objective function including both R-squares for the calibration and prediction sets. The optimization process was run for 91 cases corresponding to the change in number of wavelengths from 10 to 100. Results show that optimum wavelengths considerably improved PLS regression model performance (represented by R-square and standard error), not only for the calibration sets but also the prediction sets. However, each spectral data set has its own optimum number of wavelengths. Despite differences, R-square values for prediction were still greater than 0.96. The optimum number of latent variables varied over a wide range from the maximum allowed (20) to a lower limit of six. A statistical approach was applied to determine representative wavelengths for each spectral data set. Representative wavelengths were assigned to corresponding wood components through a band assignment process, which showed strong agreement. The result also suggests variation in lignin content is more important for pulp yield models based on data having a narrow range.