Introduction

Sorghum is an ancient cereal crop, characterized by strong disease resistance, wide adaptability, tolerance to barrenness, aridity, saline, and alkali stresses, and tolerance to water-soluble fertilizer. It can also be planted in arid hills and barren mountainous areas where staple crops requiring high level of water-soluble fertilizer are inappropriate to be planted [1]. Sorghum planting can increase the efficiency of the grain-production process, provide diverse types of grain, and promote the development of livestock farming [2, 3]. Moreover, Sorghum can be used as an industrial raw material applied in products, such as starch, alcohol, vitamins, and biomass. In particular, it is broadly applied in the brewing industry and has significance to broader socio-economic development [3,4,5,6,7,8]. However, the differences between varieties of sorghum are significant so accurate identification of the varieties is conducive to determination of the characteristics of different varieties of sorghum. This helps sorghum planters to choose the varieties suitable for their planting conditions and demands to set more effective planting management strategies [4]. Meanwhile, it can also ensure diversity and integrity of varieties in the seed repository and guarantee the effective utilization and development of germplasm resources. During breeding, identification of the varieties can help breeding operators to make a rational selection of the parents and avoid hybridization and mixing of different varieties. This can improve breeding efficiency and the rate of success, also facilitating the determination of the genetic purity of hybrid offspring for hybrid and related varieties [4,5,6]. In doing so, breeding hybrid failure can be avoided. Identification of the varieties is of significance to improvement of varieties, new cultivar breeding, and their popularization. By means of detection of the varieties, the contents of starch and tannin in sorghum grain can be obtained. The research provides supports for brewing enterprises, also helping to justify the present research into the identification of varieties of sorghum.

Different sorghum varieties have different growth characteristics and agronomic traits. Traditional methods of detection of varieties, such as morphological and physiological methods, fail to identify precisely the varieties due to small differences among the varieties of various families and genera. Chemical analysis methods such as phenol staining, and DNA-based molecular technique require grinding of the grains, rendering the process inefficient, while preventing rapid non-destructive assay [9]. As the hyperspectral imaging technique is a rapid, non-destructive identification technique, it integrates merits, such as machine vision and spectral analysis. It has been applied in detection and identification of varieties for crops, such as wheat, corn, and millet [10]. Different sorghum varieties have unique reflectance characteristics in the hyperspectral bands, and by analyzing this spectral data, differences between varieties can be identified. Compared to traditional observation and measurement methods, hyperspectral imaging technology has higher spectral resolution and accuracy, providing more accurate variety identification results. At the same time, labor and time costs are reduced. Due to a huge amount of spectral data and redundancy generated when using the hyperspectral technique, feature extraction algorithms were used to conduct dimensionality reduction on the spectra data and chemical significance of selecting variables becomes easier to elucidate [11, 12]. Song et al. performed principal component analysis to extract 25 characteristic wavelengths of wheat spectral and used a support vector machine (SVM) to construct classification models, with an accuracy of 97.54% [13]. Zhu et al. utilized a deep convolutional neural network to identify wheat grains [14]. Yang et al. used a successive projection algorithm to extract 19 characteristic wavelengths from the hyperspectral data of waxy corn seed varieties and used an SVM to establish cultivar classification models with an accuracy of 98.2% [15]. Kabir et al. used visible–near-infrared spectroscopy on 16 varieties of millet and integrated principal component analysis (PCA) dimensionality reduction with the k-nearest-neighbor (KNN) algorithm, linear discriminant analysis, logistic regression, random forest (RF), and SVM algorithms, respectively to build cultivar classification methods [16]. The results indicated that RF and SVM were the most effective. At present, research into rapid identification of varieties of sorghum using VNIR-HSI technique is sparse, thus, the research is undertaken with the aim of obtaining an efficient non-destructive testing method for distinguishing between varieties of sorghum.

In these experiments, taking 27 varieties of sorghum as research objects, Innovation includes the following three points:

  1. (1)

    Visible and near-infrared spectroscopy were used to acquire the sample information.

  2. (2)

    Competitive adaptive reweighted sampling (CARS) was then integrated with RF to obtain a model for the identification of varieties of sorghum.

  3. (3)

    Indices, such as identification precision and Cohen’s kappa value, were adopted to test the performance of the proposed method which was applied across the entire sample and each single cultivar therein.

In doing so, a rapid, non-destructive method of identification of varieties of sorghum was developed (Fig. 1).

Fig. 1
figure 1

Analysis process flow chart

Materials and methods

Experimental samples

A total of 3240 samples of 27 varieties of sorghum (120 samples for each cultivar) cultivated from the Sorghum Research Institute of Shanxi Agricultural University were selected. The sorghum seeds selected that were plump and highly regular in shape were classified, then sealed for storage in a jar. The type of the varieties of sorghum is represented by V1–V27, as shown in Fig. 2, among which, the varieties (V1, V3, V9, V13, V15, V16, and V24) are glutinous varieties of sorghum, having a white color. The other varieties of sorghum appear red in color. Other physical properties of the sorghum seeds are similar. In this experiment, non-glutinous and glutinous sorghum seeds were randomly arranged to increase the adaptability of the model.

Fig. 2
figure 2

Sample diagram

Hyperspectral imaging system

The VNIR-HSI (Headwall Photonics, USA) scanning platform was used to finish the image acquisition of sample varieties of sorghum. This system consists of a canning platform, a micro near-infrared hyperspectral imager with an aperture of 1.4 and focal length of 25 mm, light source, a controller, and computer. Its spectral range is 380–1000 nm at a spectral resolution of 0.727 nm, with a total of 856 wave bands. The image acquisition parameters are as follows: the object distance is 370 mm, the pushing distance is 100 mm, and the platform was driven at a speed of 2.938 mm s−1, in doing so, clear, undistorted images can be captured.

To reduce interference caused by systemic light sources and the dark current, a black-and-white correction was applied to the hyperspectral images based on the equation \(R = \frac{{\mathop R\nolimits_{0} - \mathop R\nolimits_{{\text{b}}} }}{{\mathop R\nolimits_{{\text{w}}} - \mathop R\nolimits_{{\text{b}}} }}\), among which R refers to the corrected hyperspectral images; R0 denotes the original hyperspectral images; Rw represents the white background image of a standard white correction board with a reflectivity approaching 99.9%; Rb represents the dark black background image with a reflectivity of 0% taken after closing the lens cover.

The samples of varieties of sorghum were placed into a sample dish with a diameter of 40 mm and depth of 15 mm, leveled smooth, and compacted. Afterward, they were placed on the movable scanning platform to allow collection of hyperspectral images. Image acquisition on the samples of each cultivar was conducted 120 times, producing a total of 3240 hyperspectral images.

Data extraction and pre-processing

Data extraction

Hyperspectral images contain both the spectral and image information from sorghum samples. Each pixel point on the image corresponds to a curve on the diffuse reflectance spectrum. ENVI 5.0 [17] was adopted to extract the spectral data of the region of interest (ROI) for each hyperspectral image: the reflectivity of each pixel point was calculated, and its arithmetic mean was taken as basic data for subsequent processing.

Data pre-processing

The data pertaining to the influences of noise and light scattering of the instrument in the data acquisition process were obtained. This work successively used Savitzky–Golay (S–G) [17, 18] smoothing filters, the standard normal variate (SNV) [19, 20], and multiplicative scatter correction (MSC) [21, 22] to pre-process the spectral data. This can further eliminate random errors, remove the influence of light scattering to improve the signal-to-noise ratio (SNR), enhance the correlation between the light spectrum and data, effectively eliminating instrument noise from the spectral data. On this basis, the performance of the established model can be improved. Considering the stability and general adaptability of the constructed model, the total number of samples in the modeling part is 3240, and the calibration set and the prediction set are randomly divided according to a 2:1 ratio using Matlab to generate random numbers, where the calibration set contains 2160 samples and the prediction set contains 1080 samples.

Extraction of characteristic wavelengths

The hyperspectral can effectively provide quantification data. However, a host of variables leads to significant redundancy, reducing the model load and prediction capability. To improve the precision of the predictive model, this experiment adopted CARS for dimensionality reduction, further improving interpretability of the variables.

The CARS algorithm is used to select the optimal combination of the effective variables in the spectra by mimicking Darwin’s “survival of the fittest” principle [23,24,25,26,27,28,29]. For the spectral variables of dimensions m × p, CARS adopted the effective variables based on the following steps:

  1. 1.

    Based on Monte Carlo sampling (MCS), 80% of the samples were randomly selected from the correction set to construct the PLS model. The regression coefficient |Ki| (i = 1, 2, …, p) of the ith wavelength can be obtained;

  2. 2.

    Exponentially decreasing function (EDF) was adopted to eliminate smaller wavelength points of |Ki|. The retention rate of the variable rj = ae−bj (j = 1, 2, …, N): where j denotes the jth MCS; N represents the total number of MCS operations. The parameters (a and b) are constants. According to r1 = 1 and rN = 2/p, they can be calculated thus

    $$ a = \left(p/2\right)^{1/(N - 1)} $$
    (1)
    $$ b = \ln \left(p/2\right)/(N - 1) $$
    (2)
  3. 3.

    Based on an adaptive reweighted sampling (ARS) technique, the variable was further screened. According to Darwin’s principle of the survival of the fittest, \(w_{i} = \left| {K_{i} } \right|/\sum\limits_{i = 1}^{p} {\left| {K_{i} } \right|}\) (i = 1, 2, …, p) was adopted for variable selection;

  4. 4.

    The aforementioned steps are repeated until the number of MCS reach its pre-set value N;

  5. 5.

    Using tenfold cross-validation, the root mean square error of cross-validation (RMSECV) can be used as the evaluation standard. Then, the value of the variable subset can be obtained by comparing difference of each MSC. The variable subset corresponding to the minimal RMSECV value is seen as the optimal variable.

The classification model and evaluation criteria

The classification model

The RF algorithm, which is widely applied in issues relating to data regression and classification, is an intelligent combinational classification algorithm [30,31,32,33]. A self-sampling method (bootstrap) is used to repeatedly and randomly select n samples from the original calibration set N to generate new calibration sample decision trees by placement. Repeating the aforementioned steps can help generate m decision trees to form an RF. The classified results of new data are determined according to the scores of the polling of classification trees. The classification capability of single tree may be small. However, after many decision trees are generated, after generating statistics pertaining to the classified results for each tree in each tested sample, the most possible type of classification can be ascertained.

The steps can be described as follows:

  1. 1.

    N is used to express the number of training examples (samples) and M denotes the number of features;

  2. 2.

    The number of features m is input for determining the decision results of the last node at the decision tree, among which, m should be less than M;

  3. 3.

    Using a method of random sampling with replacement in the Nth training examples (samples), N sampling operations can be conducted to form a training set, namely, (bootstrap) and then unselected examples (samples) are used for prediction to estimate the error of the method;

  4. 4.

    For each node, m features are randomly selected and each node at the decision tree is determined based on these features. According to m features, the optimal method of division can be calculated. The purpose of randomly selecting the training set includes: if random sampling is not made, the training sets for each of different trees are same. Hence, the classified results of finally trained trees are identical. The objectives of sampling with replacement involve the following: if sampling with replacement is not used, the training sample for each different tree differs, showing no intersection. That is to say, the trained results from each tree are largely different;

  5. 5.

    Finally, RF is voted using multiple trees to determine the type of the samples.

Evaluation criteria

The RF method was used to build a prediction and classification model of varieties of sorghum. Then the performance of the model was estimated using the classification accuracy. The accuracy can be calculated thus

$$ {\text{ACC}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{FN}} + {\text{TN}}}} \times 100 $$
(3)

where ACC refers to the accuracy; TP refers to true positive; TN refers to true negative; FP refers to false positive; FN refers to false negative.

Meanwhile, in order to visualize the effect of the algorithm, the confusion matrix (CM) of the sorghum variety classification results was constructed and Cohen’s kappa value was calculated using the formula:

$$ K = \frac{{P_{{\text{o}}} - P_{{\text{e}}} }}{{1 - P_{{\text{e}}} }} $$
(4)

where K: Cohen’s kappa; Po: proportion of observation-consistent units; Pe: proportion of chance consistent units. Cohen’s kappa is calculated as − 1 to 1, but usually the kappa falls between 0 and 1, which can be divided into five groups to indicate different levels of agreement: 0.0–0.20 very low agreement (low), 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement (moderate), 0.61–0.80 high agreement (substantial), and 0.81–1 almost perfect.

Results and discussion

Spectral characteristics

Figure 3 describes the original spectral curve of 3240 samples selected in this experiment and pre-processed spectral curve. Comparison of Fig. 3a and d shows that the influences of light scattering and noise on the spectra after pre-processing with S–G, SNV, and MSC methods are eliminated as soon as possible to improve SNR, which is conducive to improving the subsequent classification accuracy of the model.

Fig. 3
figure 3

Raw spectra and pretreatment of the samples. a Raw spectra. b Spectral pre-processing by S–G. c Spectral pre-processing by S–G and SNV. d Spectral pre-processing by S–G SNV and MSC

Figure 4 depicts the average spectra of 27 varieties of sorghum. Within the wavelength range of 380–1000 nm, the general trends of the sorghum spectral curves are similar. Wave peak and trough show relatively little change. Meanwhile, some curves intersect and overlap. Figure 4 displays the pre-processing of the average spectra for 27 varieties of sorghum. After pre-processing, the curve intersection and overlapping can be reduced, the noise is reduced, allowing clearer distinction between different samples. Among these, the spectral curves of the varieties (V1, V3, V9, V13, V15, V16, and V24) at wavelengths ranging from 400 to 850 nm are shown above the rest of the varieties. This information can be used to differentiate red sorghum from white sorghum: a trough appears at around 670 nm in the spectral curve, possibly caused by a bathochrome effect, while the spectral curve at wavelengths greater than 850 nm presents a declining trend. This is possibly related to the stretching vibration of molecular bonds for O2, O–H bonds, and hydroxyl functional groups [28]. These differences provide an effective discrimination basis for varieties of sorghum when using spectral identification.

Fig. 4
figure 4

Average spectra of 27 samples

Characteristic extraction

The aim of CARS is to eliminate irrelevant variables and reduce collinearity among variables. Figure 5 describes with the increase in the number of MCS, changes in the number of sample variables in the subsets of sorghum samples, RMSECV, and regression coefficient are shown in the subsets of sorghum samples. As the number of MCS is increased, the variables selected from the effect of EDF present an exponential decrease and then gradually tend to stabilize; the value of RMSECV first decreases, then increases, indicating an absence of correlation between the variables initially eliminated during the variable screening process and components to be tested. Thereafter, the variables irrelevant to the components to be tested are added to the variable subsets; effective wavelengths are retained at the labeled position, at this time, the value of RMSECV is minimal. The screened variables are the optimal variable combination, as shown in the figure, when the number of sampling operations is 62, RMSECV is 5.6174 (its minimum value). At this time, there are no redundant wavelengths to be screened. The subset contains 20 variables, the corresponding wavelengths are 425, 426, 429, 430, 588, 589, 591, 592, 661, 662, 668, 669, 672, 673, 686, 880, 881, 885, 911, and 915 nm.

Fig. 5
figure 5

Key extraction results of the CARS algorithm. a Changes in the number of waveband variables. b Variation of RMSEcv. c Path of variable regression coefficients. Five-pointed star denotes the optimal point where root mean square error of cross-validation (RMSECV) values achieve the lowest

Result of classification

Identification precision

Full-spectral data for 27 types of varieties of sorghum and 20 characteristic spectral data were respectively used to establish RF-based classification models. When using the full-spectrum data, the number of trees (ntree) is 1000, and the number of features (mtry) randomly sampled is 29. The experimental results demonstrate that the accuracy of the calibration set is 94.58% while the accuracy of the prediction set is 64.44%; when using characteristic spectral data, the number of trees generated (ntree) is 1000 and the number of randomly sampled features (mtry) is 4. Among them, ntree and mtry are two important parameters of random forests. ntree is the number of base classifiers included, with a default of 500; mtry is the number of variables included in each decision tree, with a default of logN. By optimizing the parameters, the error is minimized when mtry is taken to be 29 for full-spectrum data modeling, mtry is minimized when mtry is taken to be 4 for eigenspectral data modeling, and the error is stable when ntree is taken to be 1000.

Experimental results indicate that the accuracy of the correction set is 95.00% while the accuracy of the prediction set is 84.07%. The overall accuracy of the testing sets for the classification models built using CARS -RF can be significantly increased, as shown in Fig. 6.

Fig. 6
figure 6

Results of classification for the calibration and prediction in single and overall sample. a Calibration results of RF model based on full and key wavelengths. b Prediction results of RF model based on full and key wavelengths

Cohen’s kappa value

In this paper, the confusion matrix is established for the classification results of the classification model established based on the feature spectral data, and its Cohen’s kappa value is calculated, and the calibration and prediction Cohen’s kappa values of all samples are 0.9212 and 0.9231, respectively, as shown in Fig. 7.

Fig. 7
figure 7

Results of Cohen’s kappa value for the calibration and prediction in single and overall sample

Discussion

Using CARS algorithm, 20 characteristic wavelengths were screened from 27 types of varieties of sorghum. The characteristic wavelength is compressed to 2.4% of the total number of full-wave bands. The spectral data are saved in the matrix with array dimensions of 3240 × 20, thereby reducing the computational time. The characteristic wavelengths are between 420–430 nm, 580–600 nm, 660–690 nm, and 880–920 nm. The characteristic wavelengths occur at 420–430 nm, corresponding to the blue–violet region, which corresponds to the adsorption peak of compounds such as flavonoids contained in sorghum seeds [34, 35], while the wavelength range of 580–600 nm corresponds to the green light region, corresponding to the adsorption peak of pigments including chlorophyll contained in sorghum seeds [36,37,38]. The wavelength range of 660–690 nm corresponds to the red light region, which mainly involve excitation and fluorescence of chlorophyll contained in sorghum seeds; while at wavelengths of around 880–920 nm, due to the effects of stretching vibrations for C=C, C=O, and C=N, the changes in optical properties of substances including protein, starch, cellulose, and moisture content contained in sorghum at this wavelength range contribute to the phenomenon [39, 40].

According to the classification results, it can be found that when modeling using full-spectrum data, the accuracies of the calibration and prediction sets are 94.58% and 64.44%, respectively. When using the characteristic spectral data, the accuracy of the calibration set is 95.00% and the accuracy of the prediction set is 84.07%. The total accuracy of the prediction set for the classification model established based on CARS-RF is increased by 19.63%. This is because irrelevant variables are eliminated during the extraction of characteristic variables while retaining related characteristic wavelengths pertaining to organic matter contained in sorghum as soon as possible.

When the CARS-RF model was used for classification prediction of each of 21 types of varieties of sorghum (V1–V11, V15, and V20–V27), the accuracies of the correction set are above 92.50%, and the accuracies of the prediction set are higher than 90.00%. The accuracies of some varieties reach 100%, while the accuracies of the prediction set for two varieties of sorghum V12–V13 ranged between 81.25 and 90% and the accuracies of the prediction set are beyond the range of 80.00–85%; while the accuracies of the correction set for four types of varieties of sorghum V16–V19 are in the range of 73.75–80.00%; the accuracies of the prediction set ranged between 70.00 and 75.00%. Owing to physical properties of four types of varieties of sorghum V16–V19 are similar, some deviations occur in the polling process of RF.

As can be seen from Table 1, using Cohen’s kappa value to evaluate the accuracy of the single-species model, the Cohen’s kappa values of the calibration and prediction sets of V1–V15, V19–V27 were all in the range of 0.81–1.0, it shows that the evaluation results are almost identical to the accuracy results. The Cohen’s kappa values for V16–V18 calibration and prediction sets ranged from 0.61 to 0.8, indicating a high level of consistency. The calibration and prediction Cohen’s kappa values for the full sample were 0.9212 and 0.9231 respectively, ranging from 0.81 to 1.0, also falling into the category of almost perfect agreement.

Table 1 Recognition accuracy and Cohen’s kappa value for the prediction set

In summary, the CARS-RF was used for modeling of varieties of sorghum. Its identification accuracy meets the classification requirements of the varieties, presenting a certain application value.

Conclusion

This experiment took 27 types of varieties of sorghum as research objects to identify sorghum seeds using the method integrating the hyperspectral non-destructive detection technique with machine learning. The hyperspectral imaging system was used for acquisition of varieties of sorghum across the average spectral range of 380–1000 nm. Using CARS algorithm, irrelevant variables can be eliminated but effectively retain the characteristic wavelength related to organic matter contained in sorghum the further to realize spectral dimensionality reduction. Meanwhile, RF was used to establish a model capable of identifying varieties of sorghum. The accuracy of a part of the correction set in the model for the varieties V1–V27 is 95.00%. The accuracy of the prediction set is 84.07%. Using the Confusion matrix to calculate Cohen’s kappa values, the calibration and prediction Cohen’s kappa values for the full sample were 0.9212 and 0.9231 respectively, indicating that the evaluation results are almost identical to the correctness results. The results indicated that non-glutinous and glutinous varieties of sorghum with similar physical properties can be distinguished using the CARS-RF model.

Sorghum variety identification with CARS-RF, not only can accurate, rapid and non-destructive identification of sorghum varieties be achieved, but it can also provide a way of thinking for varietal identification of other crops, as well as helping with seed management, variety protection and germplasm resource management.