Introduction

Bamboo, a major non-wood forest product that belongs to the Poaceae family, is well known for its industrial uses (Fu 2001). Adult bamboo wood is one of the most important replacements of wood resources in the wood industry (Janssen 2000). Additionally, another important property of bamboo is the edible juvenile bamboo shoot. The utilization of bamboo shoots as food is a traditional food culture in China for more than 2500 years with their rich nutrient contents and delicious taste (Satya et al. 2010). There are more than 300 different species of bamboo in Asia, and most of them produce edible shoots, but less than 100 species are utilized for food (Grosser and Liese 1971). Bamboo shoots can be easily catalogized as two types, winter bamboo shoots and spring bamboo shoots, of which spring bamboo shoots are more popular (Choudhury et al. 2012). However, the taste of these bamboo shoots varies greatly, and with a large number of suboptimal qualities of bamboo shoots in the market, the healthy development of the bamboo shoot market will be seriously influenced (Kumar et al. 2017). The quality of bamboo shoots is required due to the growing demand of the market. The traditional methods for the identification of bamboo shoots are mainly the naked eye or laboratory chemical analysis methods (i.e., wet chemical analysis), which are expensive and time-consuming. Additionally, the destruction of samples resulting from wet chemical analysis is also a concern. A fast and reliable alternative method to classify different bamboo shoot species is needed. Near-infrared reflectance (NIR) spectroscopy has the advantages of rapid analysis, being repeatable, and being non-destructive and has been widely applied to agriculture and the food factory (Nicolai et al. 2007; Porep et al. 2015).

For example, NIR spectroscopy was used to investigate the quality of fruits such as strawberry (Amodio et al. 2017), apple (Beghi et al. 2014), and banana (Zude 2003), as well as identify plants such as tea (Li and He 2008), Eucalyptus species (Castillo et al. 2008), and grapevine (Gutierrez et al. 2015). Additionally, NIR spectroscopy has been used to determine bamboo properties. For instance, NIR showed promise for the discrimination of three bamboo species by the scanning of leaves (Wang et al. 2016). The lignification that is associated with the crude fiber content and firmness of bamboo shoots has also been successfully predicted by NIR spectroscopy (Xu et al. 2014). Furthermore, spectroscopic determination of the chemical composition and classification of bamboo fractions was also achieved (Ramirez et al. 2015). However, fewer studies have demonstrated the ability to discriminate different bamboo shoot species based on NIR spectroscopy.

NIR spectroscopy produces profiles containing a large amount of information, which contains not only important information according to the target but also a lot of irrelevant noise (Brenchley et al. 1997). It is important to find a useful method to profitably exploit the useful spectral information for better calibration. Recently, multivariate mathematics and chemometric statistics methods, including support vector machines (SVM) (Devos et al. 2009; Vishwanathan and Murty 2002), partial least squares-discriminant analysis (PLSDA) (Ballabio and Consonni 2013), and random forest (RF) (Liaw and Wiener 2002), combined with NIR spectroscopy, have been successfully and widely used for classification. For instance, RF has been successfully applied to distinguish two pigweeds from three soybean varieties and yielded high classification accuracies ranging from 93.8 to 100% (Fletcher and Reddy 2016). SVM produced efficient and promising results in the discrimination of three types of tea (Chen et al. 2007). PLSDA combined with NIR spectroscopy could be used as a rapid and non-destructive method to discriminate castor seeds for breeding programs (Santos et al. 2014). Apart from the classification methods, the important feature selection could also highly influence the model performance. There are massive overtones and combinations of vibration information from C-H, O-H, and N-H groups that interact with NIR spectra in plant samples (Yang et al. 2018), and most of them overlap strongly, which will influence the robustness and reliability of model calibration (Inagaki et al. 2018). Pre-processing methods, such as detrending and derivatives, as well as variable selection algorithms (Caliari et al. 2017; Mancini et al. 2018) combined with chemometric statistics could efficiently reduce these bands and eliminate the irrelevant variables that influence model calibration (Rinnan et al. 2009). However, the comparison of different pre-processing methods combined with different classification methods, including SVM, PLSDA, and RF, for discrimination of multiple bamboo shoot species has not been well researched.

Therefore, the present paper aims (1) to evaluate the capacity of NIR reflectance spectroscopy to discriminate four types of bamboo shoot species using SVM, PLSDA, and RF methods; (2) to compare the performance of different NIR spectra pre-processing methods in classification models for the best discrimination of bamboo shoots; and, most importantly, (3) to identify the most important variables related to bamboo shoot identification and to test the possibility of using optimal spectral informative variables for discrimination.

Methods and Materials

Sample Collection

In spring of 2019, a total number of 747 bamboo shoots from four of the most common edible bamboo shoot species at three bamboo forest sites in Chongqing were selected for classification model building: Baijia (BJ) (Phyllostachys bissetii) bamboo, Gaojie (G) (Phyllostachys prominens) bamboo, Shui (S) (Phyllostachys heteroclada) bamboo, and Ping (P) (Qiongzhuea communis Hsueh) bamboo. Details are shown in Table 1.

Table 1 Main site characteristics of four bamboo shoot species

NIR Spectra Collection

All fresh bamboo shoots were extracted from the soil without roots and labeled. The outside shell of the bamboo shoots was peeled off, and simultaneously, NIR spectra were collected using a field spectrometer (LF-2500, Spectral Evolution, USA) with a 5 mm diameter fiber optics probe and spectral band range from 1100 to 2500 nm at 5.5 nm intervals averaging 32 scans. For each bamboo shoot sample, three lines 120° across each other were first established, which mean each sample including three lines. In each line, NIR spectra were collected from the bottom to the top with a fixed distance of 10 mm, and then all spectra from the three lines of each sample were averaged for final use. In total, 747 averaged spectra (from 747 samples) were collected. Spectra ranging from 1100 to 2500 nm was used; wavelengths within this range are related to plant nutrition components (Gillon et al. 1999; Min et al. 2006).

Model Calibration and Validation

Three classification methods, PLSDA, SMV, and RF, and three spectra pre-processing methods and their combination, i.e., detrending (Det), first (1st), and second (2nd) derivatives using Savitzky-Golay smoothing with a window size of 15 data points and a polynomial order of 2 (Press and Teukolsky 1990), were used to obtain the best classification model. In total, six spectra pre-processing methods were used for each machine learning model: raw spectra, Det, 1st derivative, 2nd derivative, Det+1st derivative, and Det+2nd derivative. The best pro-processing and classification methods were chosen for future use. For each classification model, 80% of the data set was randomly selected for internal calibration, and the remaining 20% was used for validation. This randomized permutation has been conducted 200 times to determine the performance evaluation. This method was first mentioned by Couture et al. (2016) to estimate the model stability, which could provide the classification error based on 200 calibration models; this method was highly recommended for model calibration and validation. The overall accuracy (OVA.ACC), specificity (Spe), sensitivity (Sen), negative prediction (Neg), and positive prediction (Pos) from both calibration (Cal) and validation (Val) were used to track the model performance. The threshold for interpreting probabilities to class labels is 0.5. All analyses were conducted in R software (version 3.1.2) (R Core Team 2017). The e1071 package (Meyer et al. 2018) in R was used for SVM model performing, the caret package (Wing et al. 2017) was used for PLSDA and RF model building, the prospectr package (Stevens and Ramirez-Lopez 2014) was used for spectra pre-processing, and the ggplot2 package (Wickham 2016) was used for data visualization.

Results

Spectra Information

The average of the four bamboo shoot species original (no pre-processing) and Det+2nd derivative spectra is displayed in Fig. 1. The four types of bamboo shoots had similar curves in the original spectra, and it was difficult to observe differences with the naked eye. However, three regions were found that differed among these four bamboo shoots: 1680, 1950, and 2040 nm based on Det+2nd derivative spectra processing. The results indicated that NIR has the potential for use in bamboo shoot identification.

Fig. 1
figure 1

The average original (Raw)-NIR (upper) and Det+2nd derivative-NIR (lower) spectra of four bamboo species. Dotted line: the position same as in (b). Each line represents one bamboo shoot specie; the spectra was average by species level

Model Comparison

The performance of six types of pre-processing methods combined with three classification models is shown in Table 2. The Spe, Sen, Neg, Pos, and OVA.ACC were recorded for the comparison of model performance. Despite the three different NIR classification models, six spectra pre-processing methods presented different performances compared to other pre-processing methods: The Det+2nd derivative had the highest value among Spe Sen, Neg, Pos, and OVA.ACC, followed by the Det+1st derivative, 2nd derivative, 1st derivative, and Det; raw spectra (non-processing) had the lowest value among all models. The SVM model had the best model performance among all of the six pre-processing methods. Therefore, the best pre-processing method and classification models were the Det+2nd derivative and SVM, which produced a mean and range of Spe, Sen, Neg, Pos, and OVA.ACC values all equal to 1 in the calibration and a mean Spe value of 0.98 (range 0.92–1), mean Sen value of 0.96 (range 0.83–1), mean Neg value of 0.98 (range 0.91–1), mean Pos value of 0.96 (range 0.84–1), and mean OVA.ACC value of 0.95 (range 0.91–0.99) in the validation. Very poor classification model performance with respect to the Spe, Sen, Neg, Pos, and OVA.ACC value range error was obtained from the 200 simulated times for the SVM model using the Det+2nd derivative NIR spectra.

Table 2 Distribution (95% confidence intervals) of Spe, Sen and Neg, Pos and OVA.ACC values in the calibration and validation statistics from 200 simulations of bamboo shoot model classification from SVM, PLSDA, and RF methods and six different pre-processing methods using full-length NIR spectra. Each model permutation included 80% of the data for internal calibration and the remaining 20% for validation

Variable Importance of NIR Spectra Applied to Bamboo Shoot Discrimination

The importance of spectral variables used by the SVM model combined with the Det+2nd derivative is plotted in Fig. 2. The black color represents the important variables for the four bamboo shoots selected by SVM. It can be observed that the SVM model selected most of the important spectral regions that were similar among the four bamboo shoot species. Eight wavelengths bands were found that highly influenced the model accuracy, i.e., 1015, 1135, 1175, 1338, 1380, 1620, 1690, and 1750 nm; among them, the bands around 1015, 1135, and 1338 nm are considered the most important regions. The most important spectral bands were mostly located in the 1000–1800 nm region.

Fig. 2
figure 2

Influence of bamboo shoot classification types on NIR spectra in the SVM model using Det+2nd derivative spectra. Black line: important spectral regions selected by the SVM model, and red line: less important than the black region

Model Evaluation

In total, 40 out of 256 variables considered to be important to the model classification were selected to build the SVM classification model. Then, this model was applied to the validation set as a comparison with the full-length spectra SVM model. The confusion matrix of misclassification for validation data predicted by selected variables and full-length variables in the SVM model is displayed in Fig. 3. A high classification accuracy of 0.95 was obtained from the full-length spectra (Fig. 3a). However, a reasonable and high classification accuracy of 0.91was also produced by the important variables, which is only 16% less than that of the full variables (Fig. 3b).

Fig. 3
figure 3

The misclassification of four bamboo shoot species based on the validation of the SVM model using full-length (a) and selected spectral variables (b) based on Det + second-derivative NIR spectra

Discussion

Different classification methods may have different accuracies depending on the target materials. In this study, a high range of OVA.ACC from 0.91 to 0.99 was found using the Det+2nd spectra and SVM model. The SVM model performed better than PLSDA and RF, similar to the finding of Tankeu et al. (2018), who reported that by using of hyperspectral imaging data, the SVM model yielded better predictions for the identification of true black cohosh, and Sun et al. (2017), who showed that the SVM model combined with a hyperspectral reflectance imaging technique had high accuracy (92.96–97.28%) in the detection of chilling peaches. SVM combined with NIR spectroscopy also yielded a promising result for the prediction of the chilling storage stage of eggplants (Tsouvaltzis et al. 2020) and different rice flour types (Sampaio et al. 2020). In addition, SVM methods can be applied to image classification; for example, the SVM model has been used to classify different soil types using soil images. In contrast to these results, PLSDA had better accuracy than SVM in the identification of transgenic maize kernels (Feng et al. 2017). Spectra pre-processing methods could improve the model performance. It has been found that compared to the raw spectra without pre-processing, different pre-processing methods of NIR spectra could greatly improve the output performance of the PLSDA, SVM, and RF models, similar to the result reported by Qiu et al. (2018), who found that first and second derivatives with the Savitzky-Golay filter yielded a better classification accuracy (approximately 98%) using the PLSDA model for the detection of artificial aging of corn seeds compared with embryo and endosperm FT-NIR spectra. The second derivative could maximally obtain useful spectral variables for the prediction of extractive contents in Eucalyptus bosistoana in model calibration (Li and Altaner 2019). This study found that Det combined with the 2nd derivative yielded the best accuracy for the PLSDA, SVM, and RF models.

Eight wavelengths, i.e., 1015, 1135, 1175, 1338, 1380, 1620, 1690, and 1750 nm, in NIR spectra were considered the most important bands in this study on the classification of bamboo shoots. The bands around 1015, 1135, and 1175 nm were strongly related to the second overtone of C-H stretching vibration, which mainly represents terpenes (Ma et al. 2019). The first overtones of the C-H and O-H group stretching were mainly located at 1338 and 1380 nm, respectively (Schwanninger et al. 2011). The bands around 1620, 1690, and 1750 nm were associated with the first overtone of the C-H group, which is related to carbohydrate compounds, including fatty acids, starch, and lignin (Hourant et al. 2000; Qiu et al. 2018; Schwanninger et al. 2011). Carbohydrates and proteins are the most important components that strongly influence the quality of bamboo shoots. These components were found to strongly interact with NIR spectra in this study, which potentially proves that bamboo shoot species can be identified using NIR technology.

The results showed that the use of important variables from the NIR spectra, which were less than 16% of the full-length spectra, yielded a promising and reliable classification with a mean accuracy of 0.91, i.e., slightly less than that of the full-length spectra (0.95). NIR spectra contain much irrelevant information except for the useful chemical interacting band information, which will delay the processing time of model calibration and prediction for future use. Therefore, use of the most important extracted wavelength to build a similar and reliable model as full-length spectra could largely reduce the number of used spectral bands and improve the model performance (Workman Jr and Weyer 2012).

The cost and time of traditional methods for the large-scale quality assessment of bamboo shoots are limiting and will restrict the process of the bamboo shoot market. This method provides an advanced approach for bamboo shoot identification and allow for rapid measurement of a large number of samples.

Conclusions

In conclusion, use of NIR spectroscopy for the classification of different bamboo shoot species is feasible. The Det+2nd derivative pre-processing combined with the SVM method produced the highest classification accuracy and significantly avoided overfitting compared to the PLSDA and RF models. The most relevant spectra variables that were used for the SVM model could efficiently reduce the number of variables and yield a reliable accuracy as well as the full length of variables. The results indicate that the NIR technology could be a reliable and robust method for on-line bamboo shoot classification, and can support the farmers and industries in the monitoring of the high quality of bamboo shoots.