Introduction

Pummelo is one of the most important species in Citrus. Pummelo fruits are attractively large with flesh colors varying from white, yellow and pink to red and are high in vitamin C content. Pummelo peels are very thick and contain significant amount of antioxidant flavonoids, mostly naringine and neohesperidin, and dietary fiber pectin, and their leaves, flowers and fruits are rich in essential oils (Du and Chen 2009; Huang and Chen 1998). For these reasons, the pummelo industry is very important in China and in other pummelo growing countries.

There exists no simple and rapid method for the identification of pummelo cultivars. Many pummelo cultivars are derived from elite lines of seed progenies, so they are morphologically very similar to each other. Existing methods for the identification of pummelo cultivars are mainly based on morphology, palynology, biochemistry, cytology and molecular biology (Uzun et al. 2009), and are usually time-consuming and costly, and are therefore inconvenient to be used where rapid identification is needed.

Vis/NIR spectroscopy is a powerful technique widely used in quantitative and qualitative analysis of materials including plant materials with characteristic spectroscopic profiles. The technique is inexpensive and simple to use, requires very small amount of sample and can, at the same time, achieve a rapid and highly sensitive measurement. It has been used in identification and classification of various agricultural materials including strawberries (Sánchez et al. 2012), wolf berries (Du and Gong 2013), milk powders (Wu et al. 2008b) and natural textile raw materials (Zhou et al. 2008). A wide range of plant materials can be used in Vis/NIR analysis, and leaves should be much better in case of identifying a cultivar for the following reasons: first, leaves are available nearly all year around compared to other plant organs, especially on evergreen fruit trees; second, a leaf spectrum is easy to obtain.

Vis/NIR spectroscopy is usually used in combination with pattern recognition methods in identification and classification research including identification of pharmaceutical excipients (Candolfi et al. 1999), instant milk teas (Liu et al. 2009), apple cultivars (Luo et al. 2011) and moldy chestnuts (Zhou et al. 2011). Soft independent modeling of class analogy (SIMCA), partial least square discriminant analysis (PLS-DA), back propagation neural network (BPNN) and least squares-support vector machine (LS-SVM) are the most widely used pattern recognition methods in qualitative analysis.

As far as the authors are aware, there has been no report on the identification of pummelo cultivars using Vis/NIR spectroscopy techniques. In this study, the applicability of the technique in identifying pummelo cultivars was evaluated, and models were established using methods of SIMCA, PLS-DA, BPNN and LS-SVM. The objectives of this study were to develop a rapid and reliable method for the identification of pummelo cultivars using leaf spectra obtained by Vis/NIR spectroscopy.

Materials and methods

Sample preparation

The four pummelo cultivars used (Table 1) in this study were located in the National Citrus Germplasm Repository of Chongqing, China. Healthy mature leaves were randomly collected from spring shoots growing inside the tree canopy in mid-September of 2013. Sampled leaves were temporarily kept in sealed plastic bags that were immediately placed in a storage box chilled with iced-bags. A total of 240 leaf samples (60 leaves for each cultivar) were obtained. In order to develop calibration models, 192 samples (48 for each cultivar) were randomly selected as the sample set for calibration and the remaining 48 samples (12 for each cultivar) were used as the sample set for validation.

Table 1 The plant materials used in the study

Equipment and spectra acquisition

The Vis/NIR spectra of pummelo leaves were collected using a Fieldspec Handheld spectrometer (Analytical Spectral Devices, Inc. Boulder, Colorado, USA) in a darkroom. The light source was provided by a halogen lamp (ASD Pro Lamp) at a 45° angle. The distance between light source and the surface of the leaf sample was 300 mm. The spectro-radiometer was approximately 150 mm from the surface of the leaf being analyzed. The detectable wavelength range of the spectrometer was between 325 and 1075 nm, with a sampling interval of 1.6 nm and a spectral resolution of 1 nm. The field-of-view (FOV) of the spectrometer was 25°. The light source was turned on 30 min earlier to warm up the halogen bulb before analysis. Before spectra acquisition, the instrument was calibrated with a standard whiteboard. Leaf was placed on a 150 mm × 300 mm black fabric. Ten spectra were recorded and averaged for every leaf sample.

All spectra were read and exported from the instrument by the RS3 software (Analytical Spectral Devices, Inc. Boulder, Boulder, Colorado, USA). The software Unscrambler V9.7 (CAMO ASA, Oslo, Norway) was used for data preprocessing, and for the SIMCA and the PLS-DA models. The BPNN model was implemented in Matlab R2010a (The MathWorks, Inc. Massachusetts, USA) and free LS-SVM toolbox (LS-SVM v 1.7, Suykens, Leuven, Belgium) were used to build the LS-SVM model.

Pre-processing

To avoid low signal-to-noise ratio, the spectral range of 400–1000 nm was utilized in this study. The raw spectral data were pre-processed to reduce the noise by using moving average smoothing with nine segments (Gao et al. 2004). To remove baseline drifts and enhance spectral differences, the second order derivative was utilized by using the Savitzky–Golay convolution method with a second-order polynomial and a 5-point segment (Chu et al. 2004).

Principal component analysis

Principal component analysis (PCA) was used to simplify the complex spectral data (He et al. 2006). In the end, the first eight principal components (PCs) were extracted and used in relevant analysis.

Soft independent modeling of class analogy

Soft independent modeling of class analogy (SIMCA) is a pattern recognition method that is based on PCA. PCA was applied in the calibration set to develop PCA class models. The PCs to describe most of the variation within each class was determined by cross-validation (Wold 1976). Unknown samples were then identified by the distances to different PCA models. The model distance limit Smax is calculated for the class model m as follows in Eq. 1:

$$ S_{max\left( m \right)} = S_{o} \left( m \right)\sqrt {F_{C} } $$
(1)

where S o is the average distance within the model, F c (Fisher criterion) is the critical value. The performance of SIMCA is evaluated by recognition rate and rejection rate (Ning et al. 2008).

Partial least squares discriminant analysis

Partial least squares discriminant analysis (PLS-DA) is a supervised pattern recognition method that is based on partial least squares regression. In PLS-DA, each sample in the calibration set needs to be assigned a dummy variable as a categorical variable Y for each cultivar (0001 for LN, 0010 for GX, 0100 for PS and 1000 for GL). A regression analysis between categorical variable Y and spectral data of calibration set X is then conducted to establish the PLS regression model. The general model of PLS-DA can be written as Eqs. 2 and 3:

$$ X = TP^{T} + E $$
(2)
$$ Y = UQ^{T} + F $$
(3)

where X represents the matrix of spectral data, T is a factor score matrix, P is the X loadings and E is the residual or a noise term. Y is a matrix of the categorical variable, U is the scores for Y, Q is the Y loadings, and F is the residuals. Full cross-validation is used to evaluate the quality and prevent over-fitting of the calibration model. Finally, predicted categorical variable values (Yp) are determined by the PLS-DA model. If the values between a sample of the validation set and a categorical variable meet the following conditions: \( \left| {Yp - Y} \right| > 0.5 \) and deviation value <0.5, the sample belongs to this class (Yang et al. 2008).

Back propagation neural network

As one of the most popular neural network topologies, the BPNN is commonly used in the present Vis/NIR spectroscopy analysis. The BPNN is consisting of an input layer, one or more hidden layers and an output layer. Training procedures of BPNN are comprised of a forward propagation of input signal and a backward propagation of error. The signal is successively transmitted to input layer, hidden layer and output layer, and the status of every layer can only be affected by next layer. If anticipant outcome cannot be gotten in output layer, then change to back-propagation. According to the error signal of back propagation, the back-propagation algorithm network adjusts weights in each successive layer to reduce the errors at each level until all the errors are within the required tolerance (Yang et al. 2011). A schematic diagram of multilayer neural network architecture is shown in Fig. 1.

Fig. 1
figure 1

Schematic diagram of the topological structure of BPNN

The transfer function of net-layer takes the form:

$$ f\left( x \right) = \frac{1}{{1 + e^{{{\raise0.7ex\hbox{${ - x}$} \!\mathord{\left/ {\vphantom {{ - x} Q}}\right.\kern-0pt} \!\lower0.7ex\hbox{$Q$}}}} }} $$
(4)

where, Q is the parameter of function Sigmoid.

When constructing a BPNN model, parameters such as the structure, the initial connection weights and thresholds, the learning rate and expected error need to be considered. In theory, a BPNN with three layers can solve arbitrary classification. In this study, a three layers BPNN was constructed to identify different varieties of pummelo. PCA was used to simplify the complex spectral data (He et al. 2006). PCs extracted by PCA were used as inputs. Dummy variable values (0001 for LN, 0010 for GX, 0100 for PS and 1000 for GL) were used as outputs.

Least square-support vector machine

Least square-support vector machine (LS-SVM) is a statistical learning theory that adopts a least squares linear system as a loss function. LS-SVM has the capability for linear and non-linear multivariate calibration and solving the multivariate calibration problems in a relatively faster way, so that LS-SVM has been commonly applied in the present Vis/NIR spectroscopy analysis. The details of the theoretical background to the LS-SVM algorithm are given in Wu et al. (2008a) and Suykens and Vandewalle (1999). The LS-SVM regression model can be expressed as:

$$ Y\left( X \right) = \mathop \sum \limits_{i = 1}^{n} \alpha_{i} K\left( {X,X_{i} } \right) + b $$
(5)

where α i is the Lagrange multiplier called support value, X i is the input vector, K(X, X i ) is the kernel function mapping input into high dimensional space and b is the bias term, respectively. Determination of optimal input feature subset, proper kernel function and optimum kernel parameters are three crucial problems that need to be solved in LS-SVM. Because radial basis function (RBF) kernel can handle the non-linear relationships, reduce the computational complexity of the training procedure and give a good performance under general smoothness assumptions (Borin et al. 2006), RBF kernel was used as the kernel function of LS-SVM in this study. RBF kernel can be expressed as:

$$ {\text{Radial basis function}}:K\left( {X,X_{i} } \right) = \exp \left( { - \frac{{X - X_{i}^{2} }}{{\sigma^{2} }}} \right) $$
(6)

where σ is the kernel parameter.

Grid-search technique and cross-validation were applied to find out the optimal parameter values. To simplify the complex spectral data, PCA was applied in the spectral analysis. PCs extracted by PCA were used as inputs to LS-SVM.

Results and discussion

Principal component analysis

The 3D scatter plots of the four pummelo cultivars were generated following a PCA of their respective leaf spectra. As shown in Fig. 2, leaf samples from the four cultivars were clustered into four groups in 3D space. However, there was a moderate overlap between samples of GX and LN and a small overlap between samples of PS and GL. This indicated that the leaves of GX and LN were more similar to each other in leaf spectra or only three PCs with a total accumulated reliability of 64.4 % were not representative enough to express their differences. Table 2 shows the accumulated reliability: 81.7 % of it was represented by the first 6 PCs much larger than 77.1 % represented by the first 5 PCs; the accumulated reliability had little change represented by the first 6 PCs to the first 10 PCs. So the BPNN and the LS-SVM models were established based on different numbers of PC from 6 to 10. The optimal number of PC was determined as 8 after comparing the final prediction results of BPNN models and LS-SVM models respectively. Thus, the first 8 PCs were used as the input variables in building the next BPNN model and LS-SVM model.

Fig. 2
figure 2

3D-scatter plots of PC1 vs PC2 vs PC3 extracted by PCA of four cultivars of pummelo

Table 2 The first 10 principal components of the accumulated reliability

Results using SIMCA

The discrimination results based on SIMCA are shown in Table 3. When the F c was set to 25 %, the appropriate number of PCs determined by cross-validation was 6, 4, 5 and 3 for LN, GX, PS and GL respectively. In the sample set for calibration, a recognition rate of 100 %, with the smallest percentages of rejection, could be reached for PS. The average rates of recognition and rejection for the calibration sample set were 95.83 and 83.33 % respectively. In the validation sample set, the highest rate of recognition was 91.67 %, and the average rates of recognition and rejection for the sample set were 79.17 and 83.33 % respectively which were not acceptable for discriminating the four pummelo cultivars.

Table 3 Results of calibration set and validation set using SIMCA

Results using PLS-DA

The discrimination results of PLS-DA model are shown in Table 4. Using the model, 98.44 % of the samples in the calibration sample set were correctly identified. For the samples in the validation set, those from GX and GL were correctly discriminated, but two and one of the samples were incorrectly identified in PS and LN respectively. In total, 93.75 % of the samples in validation set were correctly discriminated. Compared with the SIMCA model, the performance of the PLS-DA model was better, though a discrimination rate of 83.33 % for PS was still insufficient.

Table 4 Results of calibration set and validation set using PLS-DA

Results using BPNN

In building the BPNN model, the momentum and the least learning rate were set at 0.6 and 0.01, while the threshold residual error and the training time were set at 0.001 and 1500 s, respectively. Hyperbolic tangent sigmoid and logarithmic sigmoid transfer functions were applied in the hidden and the output layers respectively. After adjustment of the parameters, a three-layer BPNN model was built, in which the number of neurons was 8, 9, 4 for the input layer, the hidden layer and the output layer respectively.

As shown in Table 5, the BPNN model could identify 98.44 % of the calibration samples correctly, which is comparable to the result of the PLS-DA model. The overall accuracy of the BPNN model in identifying the samples of the validation set was 97.92 %. In comparison, the BPNN results were the best among the three models. Particularly, the recognition rate of 97.92 % for the validation samples was much higher than that obtained by either the SIMCA model or the PLS-DA model. This indicated that Vis/NIR spectroscopy combined with BPNN could be utilized to identify pummelo cultivars.

Table 5 Results of calibration set and validation set using BPNN

Results using LS-SVM

LS-SVM is usually applied in two-class classification. In order to use it in multi-class classification, the minimum output coding (MOC) scheme was introduced into LS-SVM in this study (Van Gestel et al. 2002). The first 8 PCs obtained from PCA were used as inputs in LS-SVM modeling. The adjustment parameter γ and the RBF kernel parameter σ2 were 43.3768 and 0.017499 respectively, both of which were optimized by a grid-search technique and cross-validation. As shown in Table 6, for all samples in the calibration set, the overall accuracy was 98.44 %, which was the same as those obtained by the PLS-DA and the BPNN models. For all samples in the validation set, a prediction accuracy of 97.92 % was obtained, which was the same as that obtained by BPNN.

Table 6 Results of calibration set and validation set using LS-SVM

The results showed that both the BPNN model and the LS-SVM model were superior to the PLS-DA model and the SIMCA model for the classification of pummelo cultivars. However, there were cases where LN was miss-identified as GL or vice versa when the calibration samples were analyzed by PLS-DA, BPNN and LS-SVM. This is perhaps related to the fact that the GL and LN are genetically closer to each other.

Conclusions

In this study, leaf Vis/NIR spectra of four pummelo cultivars were obtained and analyzed, and four pattern recognition methods, SIMCA, PLS-DA, BPNN and LS-SVM were compared. PCA was used to extract PCs that were used as inputs for LS-SVM and BPNN models. It was found that the BPNN and LS-SVM models performed better than the SIMCA and the PLS-DA models as shown by their respective discrimination accuracy. Satisfactory discrimination accuracy was achieved, which was 98.44 % for the BPNN model and the LS-SVM model, when samples of the calibration set were analyzed. Discrimination accuracy of 97.92 % was achieved for the BPNN model and the LS-SVM model, when samples of the validation set were analyzed. The PLS-DA model data also showed that the PLS-DA model could achieve the same identification rate which was 98.44 % as the BPNN and LS-SVM models achieved in samples of the calibration set, yet its discrimination accuracy was only 83.33 % in validation set of PS. Thus, the PLS-DA model was insufficient for discriminating pummelo cultivars. In contrast, the BPNN model and the LS-SVM model, as non-linear methods, performed better in discriminating pummelo cultivars.

This research demonstrated that Vis/NIR spectroscopy combination with BPNN and LS-SVM has the potential ability to identify the different cultivars of pummelo and offered a new approach to the fast discriminating. This fast discriminating system will reduce considerable pressure on labor for identifying cultivars of pummelo by the crop expert, and makes the rapid and on-the-go identification of pummelo cultivars in the field possible, especially in identifying the cultivars of nursery stock of Citrus before field planting.

Moreover, as our future work, further studies are still needed to cover more cultivars of pummelo and a more robust discrimination model for application of this fast discriminating system in the fields is desired.