Introduction

Sugarcane (Saccharum Officinarum L.) is the largest sugar crop and the second largest bioenergy crop in the world. Most of the world’s sugarcane is grown in tropical regions of countries including Brazil, India, and China, where precipitation is extremely unstable and the sugarcane plants can suffer from drought stress (Inman-Bamber & Smith, 2005). Among the five major developmental stages of sugarcane growth: sprouting, seedling, tillering, elongation (i.e. nutritional growth stage), and maturity, the elongation stage is unanimously considered to be the key growth stage (Yadav et al., 2020; Zhao & Li, 2015). At elongation stage, drought stress leads to a series of physiological disorders such as discoloration, folding, tillers and leaf area reduction of sugarcane leaves, which eventually reduces the yield and quality of sugar. The biological mechanism of drought response is that under drought conditions, the stomatal conductance of leaves decreases, which hinders the synthesis of chlorophyll and accelerates the decomposition of chlorophyll, thus hindering photosynthesis in leaves. Therefore, early detection of drought stress can provide theoretical guidance for the diagnosis of crop nutrient conditions as well as nutrient management in the field.

Chlorophyll content is considered to be an indicator of drought stress (as well as nitrogen stress, disease stress, etc.) (Peng et al., 2019). Although high chlorophyll content measurement accuracy can be achieved with traditional chemical methods, they are often complex, costly, and time-consuming, making them unsuitable for large-scale plant monitoring in the field. Recently developed spectroscopic technologies offer the advantages of non-destructive, time efficient and highly automated in detecting characterizing material structures and compositions (DAIE, 1992; Gitelson & Merzlyak, 1996). For most plants species, decreased chlorophyll content in leaves were found leads to the spectral reflection pattern changes (Zarco-Tejada et al., 2000), i.e. an increase in overall spectral reflectance; an increase in the proportion of reflected red and blue light, and a decrease in green light.

Scientists have extensively investigated rapid, non-destructive chlorophyll content measurement techniques based on plant spectroscopy and spectral analysis technology (Inoue et al., 2016; Wang et al., 2018). The existing spectroscopic-based chlorophyll content measurement techniques can be divided into three categories: the first category uses leaf or canopy radiation transmission models, including PROSAIL, SAIL, PROSPECT, etc. (Jacquemoud & Baret, 1990; Verhoef, 1984). Although a high measurement accuracy can be achieved, it is less frequently used due to the complexity of the radiation transmission model.

The second category for chlorophyll content estimation is using vegetation indices (VI), which were defined as linear combination or ratios of the reflectance intensities in two or three bands (Sims & Gamon, 2002). Due to the effectiveness in minimizing the potential influence of scale factors and enhancing spectral features of vegetation, Vis were widely adopted in remote sensing and photogrammetry for plant science research, especially in crop growth information monitoring. Many VIs have been reported to be effective in estimating biochemical parameters of plants. For instance, Normalized difference vegetation index (NDVI), Soil-Adjusted vegetation index (SAVI) and Ratio vegetation index (RVI) were reported effective in chlorophyll content estimation of maize (Chaoyang et al., 2010). There are studies focusing on clarifying the response differences of VIs to environmental disturbances of various kinds, therefor screening and combining different VIs to construct more robust chlorophyll content estimation models. For instance, Li et al. (2015) constructed a model for chlorophyll content estimation of sugarcane leaves at tillering stage. NDVI and red edge position (REP) were selected as variables, and a determination coefficient of 0.7386 was achieved. Qiao et al. (2022) constructed a partial least square (PLS) model for chlorophyll content estimation for maize crop using remote sensing data. The model used VISs screened with rand frog method (RFM) as variables, and achieved an RMSE  of 2.361 mg/L. And Zhang et al. (2022) extracted spectral indices from hyperspectral images, and achieved a R2 ranging from 0.77 to 0.78 for the prediction of leaf chlorophyll content. However, most of the VI-based algorithms were designed for remote sensing data with limited wavebands available, and the dimensionality reduction nature of VIs may discard significant information reflecting the physiological effects corresponding to chlorophyll content changes, thus decrease the estimation accuracy.

The third category uses machine learning technology to establish regression models of spectral reflectance signatures and plant physiological parameters using full spectra. In recent years, different algorithms have been developed to combine spectral reflectance features with machine learning techniques for the estimation of leaf physiological parameters, including chlorophyll content, moisture content, nitrogen content, etc. (Krishna et al., 2019; Liu et al., 2020; Wang et al., 2020). Typically, the collected spectra are in large quantities and also contain noisy, redundant and interfering information. Therefore, the spectra are usually pre-processed first to remove the noise. Commonly used pre-processing methods include: Savitzky–Golay smoothing (SG) (Yoosefzadeh-Najafabadi et al., 2021), multiplicative scatter correction (MSC) (Sonobe et al., 2021), standard normal variate (SNV) (Yamashita et al., 2020). Then, in order to eliminate redundant and interfering information, dimensionality reduction algorithms are usually applied in the form of extracting characteristic bands (El-Hendawy et al., 2021; Hoeppner et al., 2020; Liang et al., 2018; Wang et al., 2019; Zhu et al., 2020), which are less correlated and preserve majority of useful information in the estimation of leaf physiological parameters. Commonly used characteristic band selection methods include correlation coefficient method, successive projections algorithm (SPA) (Xuan et al., 2021), and random forest (RF) (Abdel-Rahman et al., 2013), etc. Finally, the extracted characteristic bands are used as the input of machine learning to predict the physiological parameters of plants. Commonly used machine learning methods include Back Propagation Neural Networks (BPNN) (Khoshhal & Mokarram, 2012), K-Nearest Neighbors (KNN) (Osco et al., 2020), Support Vector Regression (SVR) (Bhadra et al., 2020), etc.

Although significant progress has been made in the study of drought monitoring and parameter inversion using spectral data for different plant species, the spectral response of sugarcane plants to drought during the early elongation stage is seldom studied, and the quantitative relationships between leaf chlorophyll content and spectral reflection properties of sugarcane plants are still poorly discovered.

In this paper, we present our work on non-destructive estimation of chlorophyll content of sugarcane leaves under drought stress using VIS/NIR spectroscopy. The contribution of this study is that we explored the quantitative relationships between leaf chlorophyll content and spectral reflection properties of sugarcane plants under drought stress, which has not been extensively investigated before. Another methodological contribution is that we systematically investigated the performance of different existing pre-processing, dimensionality reduction, and regression techniques in establishing a model for estimating leaf chlorophyll for sugarcane plants under drought stress, which has not been reported in other studies. The specific objectives of this study were to:

  1. (1)

    Analyze the spectral changes of sugarcane leaves in the early elongation stage under different drought stress levels

  2. (2)

    Explore the optimal characteristic spectral bands and regression model for spectroscopy-based chlorophyll content estimation, and establish an optimal estimation model

  3. (3)

    Evaluate the performance of the proposed chlorophyll content estimation algorithm, and compare the performance of the proposed model with conventional VI-based methods

  4. (4)

    Validate generalizability of the proposed model, as well as the potential for monitoring the degree of drought stress in sugarcane.

Experimental materials and data collection

Experimental materials and experimental design

The experimental materials in this study were prepared at the State Key Laboratory of Protection and Utilization of Subtropical Agricultural Biological Resources of Guangxi University. Four sugarcane cultivars were involved, namely Zhongzhe No. 2 (ZZ2), Zhongzhe No. 9 (ZZ9), TD11 (i.e., GN18, use TD instead in this article), Funong 95-1702 (FN). Among them, both cultivars ZZ2 and ZZ9 were from the same parents (ROC25 × Yuncan 89-7), with ZZ2 is a relatively drought-sensitive variety whereas ZZ9 is a drought-resistant variety. FN is a drought-sensitive wild-type variety. TD is a transgenic strain as the recipient of the Ea-DREB2B gene introduced from FN, which exhibits greater drought tolerance with higher photosynthetic capacity and ROS scavenging ability relative to FN.

Plants of each cultivar were divided into five groups including four drought treatment groups (W1, W2, W3, W4) and one control group (CK). CK indicates normal water treatment (80% of the field capacity), W1 indicates mild drought (60% of the field capacity), W2 indicates moderate drought (45% of the field capacity), W3 indicates severe drought (35% of the field capacity), and W4 indicates extreme drought (less than 30% of the field capacity).

The experiments were carried out at the teaching and research base of the College of Agronomy, Guangxi University, China (22°51′N, 180°17′E) from September to November, 2021. As the pre-planting operations, the sugarcanes were cut into single-bud stems, soaked in carbendazim for 30 min, and cultivated at 28 °C for germination. Stems with consistent budding were selected for transplantation.

The plants was divided into two experiments, experiment 1 was to transplant stems into plastic pots and placed in a greenhouse, and experiment 2 was to transplant stems into an experimental field consisted of 20 plots, in which one plot was designed for each cultivar in each experimental group, respectively. The plants in both experiments were exposed to natural variations on photoperiod throughout the experiment, and the soil moisture was controlled according to the assigned group. For experiment 1, and the soil moisture was controlled daily by weighing the pots and replenishing them to maintain the target field holding capacity. For experiment 2, the soil moisture content was monitored daily through a soil moisture meter (TDR150, Spectrum Technologies, Inc, USA), and the field was replenished when the soil moisture content fell below the target value of the group.

Measurement methods

From all sugarcane plants involved, leaves were sampled for data acquisition during the elongation stage. For both experiments, 4 plants were selected for each cultivar and each group for data acquisition. For experiment 1, the top 5 leaves of each sugarcane plant were collected for data acquisition, resulting in a total of (5 leaves) × (4 plants) × (4 cultivars) × (5 groups) = 400 samples for sugarcane plants grown in greenhouse. For experiment 2, the top 3 leaves were collected, resulting in a total of (3 leaves) × (4 plants) × (4 cultivars) × (5 groups) = 240 samples for sugarcane plants grown in the field. The leaf VIS/NIR spectral data and the corresponding leaf chlorophyll content were measured right after leaf collection.

VIS/NIR spectral data acquisition

A CID portable CI-710 optical fiber spectrometer (CID Bio-Science Inc., Camas, WA, USA) (Fig. 1) was used for VIS/NIR spectral data acquisition. The sensor is capable of measuring the reflectance spectrum between 400 and 950 nm with an optical resolution of 2.0 nm. The diameter of the measuring chamber is 7.6 mm. A blue LED and an incandescent lamp were used as the radiation sources, with a leaf clipping mechanism to protect the measurement area from the environmental light interference. Before spectral data collection, the sensor was calibrated using a whiteboard. During data acquisition, measurements were taken at 4 randomly selected positions of the leaves. The measurements were repeated 10 times at each position, and the average vector was taken as the reflection spectrum of the leaf.

Fig. 1
figure 1

The portable CI-710 fiber spectrometer to measure the reflectance spectrum of a sugarcane leaf

Chlorophyll content determination

After completing the leaf spectral measurement, the chlorophyll content of the leaf was determined by laboratory chemical analysis. A standard protocol was followed (Lichtenthaler & Wellburn, 1983): firstly, the surface of the sampled leaf was cleaned by wiping the dirt. The leaf sample (with midrib removed) was then sheared and placed in a 25 mL capacity flask. Then 20 mL of 80% acetone was added to the flask. The samples were placed in a dark environment and soaked until the tissue was white, and then the volume was brought to 25 mL with dip reagent and shaken thoroughly. The dip reagent was used as a reference for the absorbance of the blank, then the absorbance value at 663 nm and 645 nm (OD663, OD645) were taken by a UV–Vis spectrophotometer (UV-1800, Shimadzu Corporation, Kyoto, Japan) for colorimetric analysis. The chlorophyll content was calculated as:

$${\text{Chlorophyll}}\,{\text{content}}\,({\text{mg}}/{\text{cm}}^{2} ) = (8.02 \times {\text{OD}}_{{663}} + 20.21 \times {\text{OD}}_{{645}} ) \times V/(S \times 1000),$$
(1)

where V is the liquid volume, and S is the area of the sampled leaf.

Data processing

In this study, an algorithmic framework consisting of pre-processing, dimensionality reduction, and regression steps was used to process the collected spectrums. Firstly, as a pre-processing step, a smoothing algorithm was applied to the obtained spectra to remove the noise caused by human operations, environmental and instrumental factors. Secondly, the dimensionality of the spectra data was reduced using feature extraction techniques. Dimensionality reduction is necessary because each measured reflectance spectrum contains hundreds of values and exhibits a high degree of redundancy. Incorporating all the data into a prediction model leads to severe overfitting problems. Finally, a regression model was established using machine learning techniques for the estimation of chlorophyll content.

The technical procedure for constructing the optimal model for estimation of sugarcane leaf chlorophyll content is shown in Fig. 2, in which different combinations of dimensionality reduction methods and regression models were constructed. Then these combinations were systematically evaluated and compared before the optimal chlorophyll estimation model was finally selected.

Microsoft Excel 2019 was used to organize the data, Origin 2019 was used for statistical analysis and plotting, and Python with “Scikit-learn” library (Pedregosa et al., 2011) was used to implement the data processing algorithms.

Fig. 2
figure 2

The technical procedure for constructing the optimal model for estimation of sugarcane leaf chlorophyll content

Spectral data preprocessing

The measurement noise, which appears as glitches on the collected spectrum (Fig. 3a), can negatively affect the accuracy of the estimation. Smoothing algorithm is one of the most commonly used methods to eliminate noise. Savitzky–Golay (SG) is a convolutional smoothing method that performs polynomial least squares fit to the spectral data in a moving window manner to eliminate glitches (Fig. 3b). The SG convolutional smoothing algorithm is:

$$x_{{a\_smooth}} = \frac{1}{H}\sum\limits_{{i = - w}}^{{ + w}} {x_{{a + i}} h_{i} } ,$$
(2)

where \({h}_{i}\) is the smoothing coefficient and obtained by least squares fitting. H is the normalization coefficient.

Fig. 3
figure 3

Comparison before and after spectrum smoothing

Dimensionality reduction and feature band extraction

In this study, multiple characteristic spectral bands extraction algorithms was applied to the pre-processed spectral data in order to reduce the data dimension and reduce the data redundancy.

Pearson correlation coefficient method (Wang et al., 2017): The correlation coefficient between the reflectance of each band in the spectrum and the content of a specific substance, e.g. chlorophyll content, was calculated as follows:

$$R_{j} = \frac{{\sum\nolimits_{{i = 1}}^{N} {\left( {x_{{ij}} - \bar{x}_{J} } \right)} \left( {y_{i} - \bar{y}} \right)}}{{\sqrt {\sum\nolimits_{{i = 1}}^{N} {\left( {x_{{ij}} - \bar{x}_{J} } \right)^{2} \sum\nolimits_{{i = 1}}^{N} {\left( {y_{i} - \bar{y}} \right)^{2} } } } }},$$
(3)

where \({x}_{ij}\), \({y}_{i}\) are the reflectance value of the jth band and the chlorophyll content in the ith spectrum data sample, respectively, and \(\bar{x}_{j}\), \({\bar{y}}\) are the mean of the jth spectral reflectance values and the mean of chlorophyll contents of all the spectrum data samples, respectively. A higher absolute value of correlation coefficient indicates a higher explanatory power of a certain spectral band in the estimation of chlorophyll content. Therefore the spectral band values with significant correlations (i.e. \(\left| R \right| \ge 0.8\)) with chlorophyll content were selected as the characteristic bands (Källén, 2016).

Successive projections algorithm (SPA) (Araújo et al., 2001): SPA is a forward feature selection method for solving collinearity issues. In characteristic bands selection, starting from a randomly selected band, a new band with minimum information redundancy was iteratively selected until certain number of bands were selected. Projection analysis was used to evaluate the information redundancy of a certain band with respect to the selected bands. By comparing the magnitude of the projected vectors onto the orthogonal complement space of the selected bands, the band with the largest projected vector magnitude was selected as a new characteristic band. With different SPA-selected band sets started with different initial bands, a partial least square regression model was applied to the training set to compare these sets and select the best combination. The flow of the SPA algorithm is expressed as follows:

figure a

Random Forest algorithm (RF) (Bylander, 2002): RF is a model capable of identifying the relative importance of each characteristic band in the prediction. If there is correlation between high-dimensional characteristic bands, the contribution to the regression is not significant and it reduces the computational efficiency. The Gini coefficient can be used as an evaluation criterion to determine the contribution of each input feature to the random forest-based regression. A higher coefficient indicates a higher feature importance.

Stepwise regression analysis (SR) (Greenland, 1989): SR is an iterative characteristic band selection method. Firstly, arbitrary bands were selected to establish an initial regression model. Then, an F test was performed on each band in the regression equation to test their correlation significance to the chlorophyll content. Bands with insignificant correlation were eliminated. After elimination, F tests were performed on a new set of candidate bands. Bands tested for significantly correlation with chlorophyll content were selected for inclusion in the regression model. The iterations of elimination and selection continued until no new bands were selected.

Regression models, evaluation criterions and dataset partitioning

After dimensionality reduction of the spectrum data, machine learning-based regression models were used to estimate the chlorophyll content of sugarcane leaves. In this study, the performance of several commonly-used regression models was evaluated and compared. The regression models include Partial Least Squares Regression (PLSR) (Rosipal & Krämer, 2006), Back Propagation Neural Network (BPNN) (Khoshhal & Mokarram, 2012), Support Vector Regression (SVR) (Fan et al., 2018), K-Nearest Neighbors (KNN) (Cover & Hart, 1967) and Stacking (Wu et al., 2021). PLSR is capable of determining a linear relationship between the input variables and the response variables by simultaneously decomposing and screening data matrices with a few independent principal components (PCs). BPNN typically uses a forward neural network with input, hidden and output layers to a build non-linear predictive models from a large dataset. The nonlinear mapping between inputs and outputs was approximated by layers of simple operation collections (neurons). SVR is the generalization of support vector machines for nonlinear regression problems, the algorithm inherits the advantages of SVM including the low complexity and non-linearity. KNN is an intuitive data classification method and also applicable to regression prediction problems. Stacking, with several weak learners based on different algorithms trained from the initial training dataset, the prediction results of these regressors are then used as a new training set to train a meta learner for the best combination of the weak learners into a strong learner. In this study, BPNN, SVR and KNN were used as weak learners of the Stacking algorithm, and a linear regressor was used as the meta-learner.

As the performance metrices, the coefficient of determination (\({R}^{2}\)), the root mean square error (RMSE)) were used for the performance assessment of different regression models, the equations are:

$$R^{2} = 1 - \frac{{\sum\nolimits_{{i = 1}}^{N} {\left( {y_{i} - \widehat{y}_{i} } \right)} ^{2} }}{{\sqrt {\sum\nolimits_{{i = 1}}^{N} {\left( {y_{i} - \bar{y}} \right)} ^{2} } }},$$
(4)
$$RMSE = \sqrt {\frac{1}{N}\sum\nolimits_{{i = 1}}^{N} {\left( {y_{i} - \widehat{y}_{i} } \right)} ^{2} .}$$
(5)

In which N is the number of spectrum samples, \({y}_{i}\) is the measured chlorophyll content corresponding to the ith sample, \({\widehat{y}_{i} }\) is the model predicted value of the ith sample, and \({\bar{y}}\) is the mean of the measured chlorophyll content of all samples. \({R}_{t}^{2}\), \({RMSE}_{t}\) and \({R}_{p}^{2}\), \({RMSE}_{p}\) were used as the matrices for performance evaluation, within which the subscript letters ‘t’ and ‘p’ represent the prediction performance of using the training and testing datasets, respectively. Larger \({R}^{2}\) and smaller RMSE indicate better the prediction performance of the model. The unit of RMSE is (mg/cm2).

The data samples was randomly divided into a training set (70%) and a testing set (30%). The training dataset was used for selecting the feature bands, evaluate the dimensionality reduction models, and train different regression models. The testing set was used for the assessment of the constructed regression models.

Five-fold cross-validation technique was applied during training the regression models, in order to search for the optimal hyper-parameters of the regression models and avoid the impact of data bias during model assessment. The training set was further divided into five mutually exclusive subsets, in which four subsets were selected as the cross-training set and one subset as the cross-validation set. The validation was repeated until all subsets have been cross-validated, and the average of the five results were used as the final validation result. The optimal hyper-parameters were determined based on the validation results.

Vegetation index-based models

A VI is a parameter constituted by a linear or non-linear combination of specific spectral bands, which effectively enhances certain details or features of vegetation. Many VIs have been reported to be effective in estimating biochemical parameters of plants including leaf chlorophyll content (Table 1). In this study, 8 reported effective VIs were selected, and a linear regression models was fitted for each VI using the training dataset. The prediction performance of these models were then evaluated using the testing dataset.

Table 1 Several VIs reported effective in the estimation of leaf chlorophyll content

In addition, science the characteristic sensitive wavelengths of leaf chlorophyll content are different among different plant species, also varies upon the growing environment. Therefore, in this study, we reselected variable wavelengths from 6 empirical VIs (Table 2) in order to investigate a series of optimized Vis for chlorophyll content estimation in sugarcane leaves. The optimized Vis were constructed by searching over all available band combinations in the VI formulas. The correlation coefficients R to the chlorophyll content were calculated for each formed VI. The combination with the maximum \(\left|R\right|\) was selected to form the optimal VI. A linear regression models was then fitted for each VI using training data, for estimation of chlorophyll content of sugarcane leaves.

Table 2 The empirical VIs and the corresponding formulas (Sun et al., 2019), \({R}_{i}\) and \({R}_{j}\) indicate the reflectance at bands i and j nm respectively

Results

Analysis of the effects of drought stress on sugarcane leaves

One-way ANOVA test and Turkey’s HSD method was applied to the chlorophyll content data collected from plants with different treatments for each cultivar. The results of the analysis showed that the leaf chlorophyll content of different sugarcane varieties tended to decrease with the increase of drought stress, and the differences of chlorophyll content were test significant (p-value < 0.05) between different treatments (Fig. 4). The average leaf chlorophyll contents under different drought stress treatments were compared with CK group. Under W1, W2, and W3 treatments, the decrease in chlorophyll content was greater in FN and Z2 than in TD and Z9 with increasing drought stress (Fig. 5). That is due to the fact that FN and Z2 are drought-sensitive cultivars, while TD and Z9 are more drought-tolerant.

Fig. 4
figure 4

Statistical analysis and Tukey’s HSD compact letters for chlorophyll content in leaves of different sugarcane cultivars under different drought stress treatments. Different letters indicate significant group-wise differences

Fig. 5
figure 5

The degrees of decreasing leaf chlorophyll content of different sugarcane cultivars compared with the control group (CK) as drought stress level increases

In addition, for each sugarcane cultivar, sugarcane leaves under different stress treatments showed different spectral characteristics, but the overall trend of the spectral curves was the same (Fig. 6). There is a peak in reflectance at 560 nm in the visible green light region (525–600 nm), which is due to the photosynthetic effect absorb less light in this region. In the visible red region (630–690 nm), a reflectance valley is formed at 685 nm (known as the red valley), chlorophyll in plants has strong absorption and photosynthetic activity for light in this waveband. In the near infrared region (700–900 nm), the spectrum curve formed a reflectance peak at 755 nm, which was due to the reflection from the internal foliar cells inside the crop. The reflectance spectrum at 525–700 nm and 750–900 nm increases significant with increasing drought stress.

Fig. 6
figure 6

Mean spectral profiles (525–900 nm) of different sugarcane cultivars under different drought stress treatments

Given the aforementioned points, it is clear that drought stress causes changes in chlorophyll content of sugarcane leaves, which further leads to changes in the spectral profile. This result supports the feasibility of using spectral reflectance to predict the chlorophyll content of sugarcane leaves.

Results of different spectral data dimensionality reduction methods

In order to reduce redundant spectral information and the computational cost of training machine learning models, dimensionality reduction techniques including the correlation coefficient method, SPA, RF and SR were applied to the spectra in the dataset. The results of these dimensionality reduction methods were analyzed, and their performance was evaluated and compared.

Correlation coefficient method

By calculating the correlation coefficients between reflectance and chlorophyll content in each band, it was observed that the chlorophyll content in sugarcane leaves were negatively correlated with the spectral reflectance. The absolute values of the correlation coefficients between reflectance of most bands and the chlorophyll content were greater than 0.7, indicating strong correlations (Fig. 7a). The bands with absolute correlation coefficients greater than 0.8 were selected in this study, which were mainly concentrated in 525–745 nm (Fig. 7b). A total of 221 bands were selected, accounting for 58.78% of total bands (Table 3).

Fig. 7
figure 7

Results of the correlation coefficient method. a Correlation coefficients between each band reflectance and chlorophyll content; b bands selected by the correlation coefficient method

Successive projections algorithm (SPA)

In this study, 16 characteristic bands were selected using SPA (Fig. 8), accounting for 4.3% of the total bands (Table 3). Of these, 7 filtered bands were in the visible region and 11 bands were in the NIR region. Figure 8 shows the bands selected by the successive projection algorithm.

Fig. 8
figure 8

Bands screened by successive projections algorithm

Random forest (RF) algorithm

The importance score of each band for chlorophyll prediction was obtained by RF (Fig. 9a). The bands with high importance were mainly concentrated in the 560–725 nm range, which implies the spectral reflectance within this range was obviously influenced by drought stress. The top 20 bands with the highest characteristic importance were selected (Fig. 9b), accounting for 5.32% of the total bands (Table 3).

Fig. 9
figure 9

Results of the RF method. a Importance of each band obtained by the RF algorithm; b selected bands with the RF algorithm

Stepwise regression (SR) analysis

Six characteristic bands were selected using stepwise regression analysis, accounting for 2.5% of the total bands. Figure 10 shows the bands screened using the stepwise regression analysis. The bands screened using the stepwise regression analysis were mainly between 400 and 750 mm, with fewer bands selected after 750 nm. This indicates the bands after 750 nm contained less important information.

Fig. 10
figure 10

Bands screened by SR analysis

Table 3 Comparative analysis of using input variable sets selected by different dimensionality reduction methods (M2–M6) with directly using the original spectrum (M1)

Based on the previous results, six sets of variables were established as the input variables for the regression model (Table 4). Among them, M1 is the full spectrum, M2 is the set of the 221 characteristic bands selected by the correlation coefficient method, M3 is the set of the 16 characteristic bands selected by SPA, M4 is the set of the 20 characteristic bands selected by RF, M5 is the set of the 6 characteristic bands selected by SR.

Table 4 Grouping of chlorophyll content prediction model input variables

Applying dimensionality reduction methods greatly reduced the complexity of the chlorophyll content prediction models comparing with the full spectrum (M1), at the cost of a slight decrease in prediction accuracy. Among the dimensionality reduction methods, SPA (M3) achieved the largest \({R}_{p}^{2}\) and the smallest \({\text{R}\text{M}\text{S}\text{E}}_{p}\) values when using a PLSR as a baseline regression model for chlorophyll content estimation, in which the \({R}_{p}^{2}\) was only 0.0174 lower than using the full spectrum, but with 95.7% fewer inputs variables. In summary, dimensionality reduction is effective in eliminating redundant data in the spectra for sugarcane leaf chlorophyll content estimation, and SPA was the optimal method in both reducing computational effort and preserving the prediction accuracy.

Comparative analysis between different regression models

For each input variable set, four regression methods, including BPNN, SVR, KNN and Stacking, were used to build the prediction models of sugarcane leaves chlorophyll content. The hyper-parameters of each model were optimized through grid searches with 5-fold cross-validation. The performance of these prediction models with different combinations of input variables are shown in Table 5. It can be observed that all fitted models had \({R}_{t}^{2}\), \({R}_{cv}^{2}\) and \({R}_{p}^{2}\) values greater than 0.7, and \(RMSE_{t}\), \(RMSE_{{cv}}\) and \(RMSE_{p}\) values less than 0.2, indicating the models were trained properly, without overfitting issues.

Table 5 Performance of different combinations of input variable sets and regression models for predicting chlorophyll content in sugarcane leaves

The prediction performance of the four models with different input variable sets were evaluated and compared. The prediction results of different models were plotted in Fig. 11. the fitted lines were close to the reference line (y = x), indicating the models were trained properly, without overfitting issues.

As observed in Table 5; Fig. 12, there is no method outperforms others. BPNN has the highest prediction accuracy when using input variable sets M1 (full spectrum), with \({R}_{p}^{2}\) equals to 0.9854, and \({{R}{M}{S}{E}}_{p}\) equals to 0.0641. For models using reduced input variable sets (M2–M5), Stacking-based models have the highest prediction accuracy while using each reduced input variable sets. And the prediction performance is comparable to using full spectrum and BPNN while using M2 (59.4% of original spectrum) and M3 (4.3% of original spectrum) as input variable sets, with \({R}_{p}^{2}\) equals to 0.9841 and 0.9834, respectively, and \({{R}{M}{S}{E}}_{p}\) equals to 0.0533 and 0.0544, respectively.

Considering both prediction performance and model complexity, the model combining Stacking method with M3 (reduced by SPA) was selected as the optimal model for our future experiments. The optimal \({R}_{p}^{2}\) is 0.9834 and the \({{R}{M}{S}{E}}_{p}\) is 0.0544.

Fig. 11
figure 11

Performance of different chlorophyll content prediction models using testing sets

Fig. 12
figure 12

Performance comparison of different chlorophyll content prediction models composed of different input variable set and different regression methods

Prediction models based on vegetation indices

A series of VIs were optimized by investigating the combination of bands variables in the formulas of each defined empirical VIs (Table 2). For each VI, the formed indices were evaluated by their absolute correlation coefficient \(\left|R\right|\) to the chlorophyll content (Fig. 13). The maximum \(\left|R\right|\) for each index definition with their corresponding variable wavelength combination are shown in Table 6.

Fig. 13
figure 13

Plots of correlation coefficients between different VIs and the chlorophyll content. For each VI definition, the x-axis and y-axis are the variable wavelength combination used in the formula, range from 525 to 900nm. The color corresponds to the correlation coefficient R between the VI formed by certain wavelength combination and the leaf chlorophyll content

It can be observed that the \({\left|R\right|}_{max}\) of the eight optimized VIs were all higher than 0.75, indicating a good correlation with the leaf chlorophyll content. The highest \({\left|R\right|}_{max}=0.8888\) was obtained using the MSR index definition, with band combinations of 607 and 896 nm.

Figure 13. Plots of correlation coefficients between different VIs and the chlorophyll content. For each VI definition, the x-axis and y-axis are the variable wavelength combination used in the formula, range from 525 to 900 nm. The color corresponds to the correlation coefficient R between the VI formed by certain wavelength combination and the leaf chlorophyll content.

Table 6 The Maximum correlation coefficient and the corresponding band position for different VIs.

The training data was used to fit linear regression models describing how both existing and optimized VIs are related to the chlorophyll contents, respectively (Table 7). The results revealed that the linear models based on optimized VIs generally outperforms existing VIs in terms of chlorophyll content prediction performance. And model with optimized NDVI (570, 894) had the best performance with the highest Rp2 equals to 0.8208 and the lowest RMSEp equal to 0.1786.

Table 7 The regression models for different existing and optimized VIs.

Generalizability assessment of the chlorophyll content prediction model

To further validate the generalizability of the optimal chlorophyll content prediction model, the established optimal model proposed was applied to the 240 samples collected from sugarcane plants grown in the field. The predicted chlorophyll content values were found to be highly correlated (R2 = 0.9577) and accurate (RMSE  = 0.0775) with respect to the ground truth measurements (Fig. 14). The fitted linear model between measured and predicted values has a slope of 0.8387, and an intercept of 0.4533 mg/cm2. The 95% confidence limit was about ± 1 mg/cm2 for chlorophyll content prediction.

Fig. 14
figure 14

Performance of the proposed model for chlorophyll content estimation on field samples

It was shown that the model established by SPA-based dimensionality reduction with Stacking achieved certain level of stability and generalizability in predicting chlorophyll content of sugarcane leaves.

Discussions

Effects of drought stress on chlorophyll content and spectrum of sugarcane leaves

The results showed that the chlorophyll content of sugarcane leaves decreases with increasing drought stress, and the reflectance spectrum of sugarcane leaves of all cultivars showed a significant trend with increasing drought stress. From the biological point of view, this is because drought stress accelerates the decomposition of chlorophyll in sugarcane leaves (Bin Zhang et al., 2020). As a consequence, the light absorption was hindered and therefore the reflectance of the spectrum at 525–700 nm, which is associated with respiratory pigments. Water stress also causes damage to the cellular structure of sugarcane leaves, which affects light absorption and increase reflectance to the spectrum of 750–950 nm (Zarco-Tejada et al., 2009). Therefore, the apparent changes in spectra caused by water stress made it possible to quantify the chlorophyll content using NIR/VIS reflectance spectroscopy, providing a foundation for predicting the degree of drought stress.

Dimensionality reduction and regression methods

A significant reduction in the number of input variables was obtained through dimensionality reduction methods (correlation coefficient method, SPA, RF and SR) without a significant loss in prediction accuracy. This demonstrates the effectiveness of dimensionality reduction in reducing the redundancy of spectrum data for chlorophyll content estimation in sugarcane leaves.

In the aspect of selected bands, the characteristic bands selected by SR and SPA were mainly around 500 and 700 nm in the visible light region and 850–900 nm in the NIR region. This is consistent with previous studies on the important spectrum for chlorophyll content prediction (Carter & Knapp, 2001; Zhou et al., 2021). In contrast, the correlation coefficient method and the RF method select the band mainly in the visible light and fail to extract the important spectra in the near-infrared region.

In the aspect of performance in dimensionality reduction, as the experimental results, the overall performance of SR and SPA was better than the correlation coefficient and RF. That is because correlation coefficient method and RF only extracts variables that are highly correlated to the measured chlorophyll content without intercorrelation checked. While SPA and SR are algorithms capable of selecting features with less intercorrelation, which further reduced the redundancy of the data. This also revealed the necessity of NIR spectrum in the predication of chlorophyll content prediction in sugarcane leaves.

With machine learning-based regression models, as shown in Fig. 13, the predicted chlorophyll contents were found highly correlated and accurate with respect to measured chlorophyll contents under most situations. However, in all cases, none of the methods outperforms others in terms of correlation (R2) and accuracy (RMSE). The optimal regression model varied with the input variable set selected. Considering both prediction performance and model complexity, the optimal prediction model was constructed by combining Stacking regression algorithm with 16 characteristic bands selected by SPA. The model could yield an prediction accuracy of \({R}_{p}^{2}\) equals to 0.9834 and \({\text{R}\text{M}\text{S}\text{E}}_{p}\) equals to 0.0544.

Comparison of our proposed model with VI-based models

A series of linear regression models based on existing VIs and optimized VIs were constructed in this study, as these were the common approaches for the estimation of biochemical parameters of sugarcane plants (Amarasingam et al., 2022; Li et al., 2015). Performance of linear models based on optimized VIs is generally higher than those based on existing VIs reported effective in Chlorophyll content estimation, which is consistent with the findings in Yang et al. (2021). The models with optimized NDVI (570, 894), optimized RI (607, 886) and optimized MSR (607, 896) yield a \({R}_{p}^{2}\) greater than 0.8. The models with optimized NDVI (570, 894) has the best performance (\({R}_{p}^{2}\) = 0.8208 and \({\text{R}\text{M}\text{S}\text{E}}_{p}\) = 0.1786) in VI-based models. The characteristic bands selected were around 600 and 890 nm, which were in red zone related to the absorbance band of chlorophyll, and in NIR zone with high reflectance related to the leaf cellular structure. Therefore, optimized VI-based models can achieved a decent performance in leaf chlorophyll content prediction with very few variable bands, and results of selected bands could be useful in designing optical sensors for leaf chlorophyll content prediction.

The prediction performance of the proposed optimal model (i.e. SPA + Stacking) were compared with VI-based models. The performance of our best optimized NDVI-based models (\({R}_{p}^{2}\) = 0.8208 and \({\text{R}\text{M}\text{S}\text{E}}_{p}\) = 0.1786) is still lower than the proposed model based on feature band extraction and machine learning (\({R}_{p}^{2}\) = 0.9834 and \({\text{R}\text{M}\text{S}\text{E}}_{p}\) = 0.0544). Our results demonstrated that our proposed model outperforms of commonly used VI-based methods in estimation of chlorophyll content of sugarcane leaves. That is because the VIs’ nature of dimensionality reduction may lead to the loss of significant information reflecting the physiological effects corresponding to chlorophyll content changes, thus decrease the estimation accuracy, thus not suitable for an accurate estimation of leaf chlorophyll content.

Generalizability of the chlorophyll content prediction model

When applying the trained optimal regression model onto field-collected data for generalizability evaluation, the predicted chlorophyll content values were found to be highly correlated (R2 = 0.9577) and accurate (RMSE = 0.0775) with respect to the ground truth values. A certain level of generalizability was obtained in monitoring sugarcane drought stress. However, the prediction accuracy was lower compared to data collected in the greenhouse. A slope below one was observed, indicating the relationship between the spectrum and chlorophyll content varies for sugarcane plants cultivated in the greenhouse and in the field. The main reason was that the field environment was less controllable and different from that of the greenhouse. This affected the growth of the sugarcane plants, including the cellular structures.

Therefore, the question regarding the optimal dimensionality reduction method and regression model has been answered in this paper. A series of spectral data and models for chlorophyll content estimation of sugarcane leaves at elongation stage were proposed and could be treated as priority indices to be tested in any follow-up work. Further research will be conducted regarding the optimal strategy for drought stress assessment of the sugarcane plants using the chlorophyll content as an intermedium variable. At the same time, environmental variables affecting plant growth should be investigated and incorporated into the estimation model to improve the generalizability of the predication models.

Conclusions

In this study, we proposed models for the estimation of leaf chlorophyll content of drought-affected sugarcane using VIS/NIR reflectance spectroscopy. By comparing the combinations of different dimensionality reduction techniques and different regression models, a chlorophyll content estimation model with optimal performance was constructed. The results show that the selection of characteristic bands is effective in reducing the redundancy in the spectrum, while improving the prediction accuracy of certain models. Considering both prediction performance and model complexity, the optimal prediction model was constructed by combining successive projection algorithm (SPA) and Stacking regression algorithm, with R2 = 0.9834, RMSE  = 0.0544. The proposed model outperforms the commonly used VI-based approaches. By applying the trained optimal model onto field-collected data, the predicted chlorophyll content values were found correlated (R2 = 0.9577) and accurate (RMSE  = 0.0775) with respect to the measured ground truth, which demonstrated a certain level of generalizability of the model.

The overall results manifests that the proposed SPA + Stacking model could provide an effective, accurate and non-destructive way for monitoring the leaf chlorophyll content of sugarcane plants. As chlorophyll content is closely related to the degree of drought, applying the proposed chlorophyll content prediction method to sugarcane cultivation can provide general guidance for irrigation management and a scientific basis for water resources allocation and drought preparation, thus improving sugarcane yield and reduce drought risk.

However, the developed prediction model developed has the major limitation that only a limited range of environmental factors affecting the sugarcane growth were considered in this study. Factors including meteorological conditions, water diversion and other uncontrollable factors were not considered, which may reduce the performance of the prediction model. Therefore, our future research plan is to further investigate and incorporate environmental factors into the model proposed in this study.