1 Introduction

In this era of automation, the agricultural domain requires rigorous manual intervention. This often leads to reduced growth of plants or crops. Therefore, research is going on in the domain of plant phenomics. It refers to the study of the growth and structure of plants. Recently, the extraction of the shape and counting the number of leaves of a plant have become a major issue in plant phenotyping. It is quite challenging to detect the leaves and count them. It is so because the size of the leaves is quite small and/or the leaves may be in an overlapping manner. Manually counting the leaves of such a plant is a tedious and time-consuming task. Therefore, an automatic process is needed to fix this problem.

Deep learning (DL) is one of the most commonly applied modern artificial intelligence (AI) tools which plays a crucial role in plant phenotyping [1]. Convolution neural networks (CNN) extract the growth dynamics and morphologic traits of an image with greater precision [2]. Recently, plant phenotyping has been recognized as a bottleneck in modern plant breeding and similar research work [3]. Moreover, plant phenotyping is an expanding research field that connects plant biology, technology, DL, and automation engineering. The importance of plant phenotyping is increasing to accelerate the process of plant breeding [4]. Most of the research work in image-based phenotyping is performed based on the structural data and holistic physiological information of the plants present in the image. The phenotypes are extracted using sprouting, appearance, senescence of leaves, and the appearance of fruits and flowers [5]. Related works in phenotyping have increased rapidly over the last few years.

In recent times, digital cameras (often attached to mobile phones) are quite available. Therefore, a robust plant phenotyping scheme for camera captured images helps to reduce the costs and improve the production of the plants and crops. Therefore, several researchers are investigating plant phenotyping schemes from camera captured images [1]. The growth of a plant can be estimated from the projected leaf area [7]. As mentioned earlier, because the size of the leaves is often small and/or leaves are often overlapped by other leaves, the existing CNN-based plant leaf area segmentation is a challenging task. Furthermore, random rotation or leaf displacement and the lighting effect on plant images make segmentation more difficult [7]. Moreover, the existing CNN-based approaches often fail to produce highly accurate results. Therefore, a novel CNN model has been proposed here to segment the leaf area of plants with higher accuracy. As a result, it can be executed on a system that has limited access to high-performance hardware. It was determined after evaluating previous plant phenotyping studies that there are fewer CNN models with smaller parameter sizes that are associated with good accuracy. CNN models must also be compatible with mobile devices and require less memory. To address this issue, the goal of this research is to design a CNN model with a small number of parameters but high accuracy. The proposed model, LS-Net, has a small number of parameters (approximately 5 M) and requires limited storage. Although the leaf segmentation task is relatively simple for images of rosette plants [3], the existing methods have yet to produce an accurate result. So, the model has been trained and tested using images of rosette plants. The proposed approach is also compared with four different standard CNN models, namely (a) DeepLab V3 + [30]; (b) Seg Net [29]; (c) Fast FCN with Pyramid Pooling module [38]; and (d) U-Net [42]. The leaf segmentation efficacy of the tested CNN models has been investigated by computing segmentation accuracy, dice score, and Intersection over Union (IoU) score. The results are quite encouraging.

In summary, the contributions of the work are as follows:

  • A novel convolutional neural network for leaf segmentation of rosette plants has been proposed which consists of a powerful and light-weighted backbone as MobilenetV2, an Atrous Convolution Block which allows the density of encoder features to be controlled [34], and an effective decoder that takes previous information between continuous intervals for better accuracy.

  • Unlike existing approaches, a separable convolution has been applied in the Atrous convolution block used for leaf segmentation and in the decoder part, resulting in a quicker and more robust encoder-decoder network.

  • On the merged dataset of CVPPP and KOMATSUNA, the proposed model achieves a new state-of-the-art performance. The proposed model was also tested over the CVPPP dataset to perform a better comparison with previously published models.

  • A normalization layer has been introduced in this proposed model to filter out the unnecessary pixels from the images. Details are given in Sect. 2.5.

The rest of the paper is organized as follows. Section 1.1 reflects the related existing work. The proposed method is discussed in Sect. 2. The experimental study is presented in Sect. 3. Finally, the concluding remarks are presented in Sect. 4.

1.1 Related works

Successful plant phenotyping research has been carried out in recent years. Several researchers have concentrated their efforts on this hot topic. Phenotyping of plants has been done in climate-controlled laboratories, greenhouses, and even in the outdoors. Many plant phenotyping research projects have been completed recently. Kumar et al. [9] proposed an orthogonal transform domain-based plant region segmentation approach based on orthogonal transform coefficients. To count the leaf counts, they employed a deep CNN. Their models have been tested with mobile phone images and also tested on CVPPP benchmark datasets. With a segmentation accuracy of 93.72%, their recommended system was the most accurate. Wu et al. [10] proposed a highly effective object-aware embedding learning architecture. The architecture includes a module called the distance regression module, which generates seeds for fast clustering. They combined two U-Nets to get an increase in 8% in mean symmetric best dice (mSBD). The fundamental flaw in their suggested model is that it does not produce predicted results for the CVPPP dataset's A3 category. In the leaf segmentation challenge, Gomes and Zheng [11] describe experimental research on the constraints of phenotyping datasets and the best performing approach. They also focused on the notion that model cardinality and test-time augmentation could be useful in single-class image segmentation. Huther et al. [12] have created a versatile pipeline. The approach can extract phenotypic measurements from plant images in an unsupervised manner. Leaf tissue was classified into three categories using segmented images: healthy, anthocyanin-rich, and senescent. Another study had been conducted by Yang et al. [13], where authors investigated how to segment and classify leaf images with a complex background. For this challenge, a Mask R-CNN model had been utilized.

Bell et al. [14] described a method for segmenting Arabidopsis thaliana plant images at the leaf level based on identified edges. The leaf margins, as well as the number of leaves, were classified by the model. This method is effective at extracting occluding pairs of leaves that are extensively overlapped. Pape and Kulkas [15] published another paper on leaf segmentation using leaf edge detection. They also offered a strategy that included image analysis based on software IAP [16] for extracting a broad collection of image attributes to forecast the number of leaves. For varied resolution and noise appearance, the suggested model's performance is reduced in the A3 category of the CVPPP dataset. Kuznichov et al. [18] suggested a data augmentation technique that keeps the data objects' geometric structure. The research tried to match the physical appearance of an image as closely as possible to that of authentic images. Researchers got an 86.7 Best-Dice Score by combining different approaches. However, the CVPPP dataset's A2 and A3 categories, on the other hand, had lower segmentation accuracy than A1 and A4.

Ubbens et al. [17] used both actual and synthetic datasets to show that the datasets might be interchanged when training a model for the leaf number counting task. They primarily provided a new strategy for complementing plant phenotyping datasets with rendered visuals of synthetic databases. It had been discovered that their strategy reduces the mean absolute count error by 27% approximately.

While most research focused on 2D images, Itakura et al. [21] used an efficient approach to obtain spatial information. The boundaries between all the leaves were unclear in a 2D image of rosette plants because of the overlapped portions. They collected 3D point-cloud images of plants with various cameras and sensors to overcome the drawbacks of a 2D image of the top-view of the plant. As a result, the authors obtained a 0.06 cm2 absolute leaf area estimation inaccuracy. Pape and Kulkas [19] provided another 3D histogram-based segmentation for rosette plants where a Euclidean-distance-map-based method had been implemented to detect and segment the leaves. In addition, a method was introduced to detect the optimal leaf split points to separate the overlapped leaves.

Yin et al. [20] suggested a unique paradigm for video processing using fluorescent plants. The number of well-aligned leaves was determined by applying leaf segmentation and alignment to a frame. After that, leaf tracking was used in conjunction with earlier data. The value of the Symmetric Best Dice (SBD) index gained by their proposed model was 78.0. Scharr et al. [22] compared several leaf segmentation methods on a unique dataset containing images from phenotyping experiments. Ren et al. [23] presented an attention-based end-to-end Recurrent Neural Network (RNN) which was trained on RoI (Regions of Interest) that had been successively constructed, with object segmentation inside each region. The suggested model was validated using both the CVPPP and other datasets.

Aksoy et al. [24] proposed a multi-level approach for locating and tracking rosette plant leaves. They worked on three weeks old tobacco plants in their early phases of growth with infrared cameras. This allowed for the automatic and non-invasive measurement of essential plant factors such as leaf growth rates. Pre-processing, leaf segmentation, and leaf tracking were the three major parts of this operation. Leaf-shape models were also used to calculate leaf sizes. Dellen et al. [25] evaluated the growth of a tobacco plant using pre-processing time-lapse of growing plants, which was related to plant phenotyping. An image-shape model partitioned the leaf area in each frame. They also developed a new graph-based tracking method that could fully fill gaps in the sequence of a set of surrounding frames. Janssens et al. [26] described yet another unique method for automatic segmentation of individual plant leaves. In addition, an approach for extracting the line of symmetry of the leaf was proposed.

The above discussion clearly demonstrates that accurate leaf segmentation is very important because the accuracy of leaf detection, localization, counting, leaf tracking, and boundary estimation solely depend on this. In addition to that, it is noticed that the majority of the above discussed models are unable to produce satisfactory results over the A2 and A3 categories of the CVPPP dataset due to the green background as noise and a large number of overlapped leaves. Therefore, this study developed a novel CNN model called LS-Net which is able to perform better leaf segmentation in the presence of various background and overlapping leaves. The brief descriptions of the proposed CNN model and dataset have been done in the following section.

2 Materials and methods

This section describes the proposed CNN model as well as the dataset. The first sub-section describes the used dataset with some sample images. The next sub-section describes the used backbone of the proposed model. In the next sub-sections, a discussion of Atrous Convolution and Transpose Convolution is presented. In the end, the architecture of the proposed model is demonstrated.

2.1 Dataset design

This experimental study is performed over a merged dataset which is a combination of two well-known datasets, namely the plant phenotyping dataset (CVPPP) benchmark datasets [7] and the KOMATSUNA dataset [8]. The CVPPP dataset is divided into four parts, termed A1, A2, A3, and A4. The number of total images in the four sections is 810.

On the other hand, the KOMATSUNA dataset is divided into two sections. One section is created using an RGB-D camera, and another section is created using multiple RGB cameras. The number of total images in the two sections is 1200. Therefore, there are a total of 2010 images and their segmented ground truths (binary) with the size same as 224 × 224 in the merged dataset. The images are resized using bilinear interpolation [27]. Subsequently, a random division of the merged dataset has been done to create the train set, validation set, and test set. The training set has 1410 images, the validation set contains 300 images, and the test set contains 300 images, all of which were chosen at random. Random clockwise and anti-clockwise rotations of 90\(^\circ\), 180\(^\circ\), and vertical flips have been performed over images. Some sample images from the dataset are shown in Fig. 1.

Fig. 1
figure 1

Example of Leaf images of the utilized datasets

A separate dataset of randomly selected 140 images from the CVPPP dataset is utilized for testing in order to perform a better comparison with previously published CNN models.

2.2 Backbone

The CNN backbone determines the precision of an image segmentation model. The feature maps of an image have been extracted by the backbone, and the image has been further segmented based on these feature maps. There are many segmentation models with different backbones are reported in [28]. In this paper, MobilenetV2 has been applied as the backbone of the proposed model because it proved to be successful in the CNN-based segmentation domain. The MobilenetV2 is discussed in the next paragraph.

The MobilenetV2 [31] model is recently developed based on CNN, which has been utilized by several well-known segmentation models. Hence, it has been applied as a backbone in the proposed model, and the architecture is depicted in Fig. 2. Some minor modifications are made in the architecture of the MobilenetV2 to fit the decoding part. The modifications are discussed in Sect. 4. Many powerful CNN models require the use of depth-wise convolutional layers. This layer reduces both the number of parameters and the amount of computation. A bottleneck block, presented in Fig. 2(a), is employed in MobilenetV2. ReLU6 [31], a special ReLU function, is used here and is defined by the equation \(max\left(x,6.0\right)\). There is a ‘conv block’ consisting of three layers, i.e., a convolution layer, a batch normalization layer, and an activation function as shown in Fig. 2(b). In addition, a ‘depthwise conv block’ which is the combination of depth-wise convolution layer, batch normalization layer, and activation function is presented in Fig. 2(c).

Fig. 2
figure 2

a Bottle neck Block; b Conv Block; c Depthwise Conv Block; d MobilenetV2 Architecture

2.3 Atrous convolution block

Atrous convolution block, which is introduced by Chen et. al. [34], has shown efficacy when integrated with well-known DeepLab V3 + [30], DeepLab [35], and PSP Net [36]. Atrous convolutions are very efficient in explicitly controlling the resolution of the feature maps extracted by the deep CNN backbones. For that reason, Atrous Convolution block from Chen et. al. [30] has been adopted for this work. However, the traditional convolution layers are replaced by separable convolution layers because separable convolution layers have significantly fewer parameters than the traditional convolution layers. It also contributes to the reduction of the parameter size in the proposed model. In addition to that, it also sets the field-of-view of the filter in order to capture multi-scale information and simplifies typical convolution operations. In Atrous Convolution, the input is parallelly fed into five layers, namely a (1 × 1) separable convolution layer, three (3 × 3) dilated separable convolution layers [37] with dilated rates of 6, 12, and 18, and a Max pooling layer. Then, the outputs of all the mentioned layers are concatenated. The architecture is presented in Fig. 3.

Fig.3
figure 3

Atrous Convolution Module

2.4 Utilized upsampling method

The upsampling methods, such as nearest neighbour [44], bed of nail [44], max unpooling [44], and bilinear interpolation [27], are predefined and do not depend on data, which makes them task-specific. They do not learn from data and hence are not a generalized technique. To address this issue, a well-known upsampling method called transpose convolution has been applied in this work. Transposed convolutions are the backbone of modern segmentation and super-resolution algorithms. They provide the best and most generalized upsampling of abstract representations. The working methodology is presented in Fig. 4, where it can be easily noticed that output has been generated by performing a point-to-point multiplication between input and kernel [44]. Transposed convolutions also suffer from checkered board effects, but less than the above-mentioned interpolation techniques. The main cause of this is uneven overlap in some parts of the image, causing artefacts. This can be fixed or reduced by using a kernel-size divisible by the stride, and therefore taking a kernel size of 2 × 2 or 4 × 4 when having a stride of 2. Hence, this study utilizes a (2 × 2) kernel size for transpose convolution. However, for bilinear interpolation, this checkerboard effect exists badly and is significantly higher than transpose convolution.

Fig. 4
figure 4

Working methodology of Transpose Convolution

2.5 Normalization layer

A normalization layer, which has been used in the proposed LS-Net model, is nothing but a layer that makes the values zero that are less than 0.5 and the other values remain unchanged. The idea behind this normalization layer is to filter out the unnecessary pixels from the images. Unwanted pixels can be generated due to the background or the reflection of light. This can decrease the accuracy of the segmentation. With this normalization layer, it has become easier for the designed model to predict more accurately. Figure 5 shows the two types of segmented images of the model, i.e., with and without normalization layers, and also proves the crucial impact of the normalization layer in the proposed model. The proposed model's test accuracies with and without the normalization layer are 97.36% and 92.85%, respectively, over the merged dataset. Hence, it can be said that the normalization layer significantly influences the accuracy of the LS-Net. The authors of the other tested CNN models, like DeepLab V3 + , Seg Net, Fast-FCN with Pyramid Pooling Module, and U-Net, did not consider or incorporate the normalization layer into their models. Therefore, this normalization layer is an important component of the proposed LS-Net model as per the experimental study.

Fig. 5
figure 5

Segmentation Outcomes of the proposed LS-Net model with and without Normalization Layer

2.6 The architecture of the proposed LS-Net Model

The main purpose of proposing this model is to increase the accuracy of the prediction. The proposed LS-Net utilizes MobilenetV2 [31] as the backbone because, besides being efficient, it takes fewer parameters than others. The architecture of our backbone is discussed in the previous section.

Many efficient neural network [31,32,33] topologies employ depthwise convolution layers as a crucial building element, and this study also uses them. The Separable convolution layers are used to reduce the parameter size. The main concept is to replace a fully convolutional operator with a factorized version that divides convolution into two layers. The first layer, known as depthwise convolution, applies a single convolutional filter per input channel to conduct light-weight filtering. The second layer is a 1 × 1 convolution, also known as a pointwise convolution, which is responsible for generating new features by calculating linear combinations of the inputs [31].

In addition, the image information has been saved before each downsampling and used later in the model, as shown in Fig. 6. ReLU6 [31] has been used as the activation function. The Atrous convolution block, as presented in Fig. 3 is added after the backbone. The entire architecture is illustrated in Fig. 6. The kernel size of the separable convolution layer is (1 × 1). The kernel size of all the ‘separable conv block’ is (3 × 3).

Fig. 6
figure 6

a Conv Block; b Separable Conv Block; c Depthwise Conv Block; d Bottleneck Block; e Architecture of Proposed LS-Net

In the decoder part, there are four decoder blocks and consists of one depthwise conv block as presented in Fig. 6(c), two bottleneck blocks as presented in Fig. 6(d), a transpose convolution as shown in Fig. 4 [44], and a depthwise conv block, and concatenation.

A depthwise conv block as shown in Fig. 6(c) consists of a conv block as presented in Fig. 6(a) with (1 × 1) kernel size, a (3 × 3) depthwise convolution layer [31] followed a batch normalization and an activation function, a (1 × 1) convolution layer [39], and a batch normalization layer.

The bottleneck block is the same as the depthwise conv block but the only difference is that, at the end, the input is concatenated with the output of that block which is presented in Fig. 6(d). A bottleneck block consists of a conv block with (1 × 1) kernel size, a (3 × 3) depthwise convolution layer [31] followed by batch normalization and an activation function, a (1 × 1) convolution layer [39], and a batch normalization layer, and concatenation.

3 Experimental results

The experimental study has been performed using NVIDIA GeForce 1650 with cuDNN CUDA 10.0 and AMD Ryzen 5 3550H processor with 16 GB RAM and 256 GB SSD. On the other hand, the software is Anaconda, which includes Jupyter Notebook. All experiments have been performed using TensorFlow and Scikitlearn.

The performance measures of segmentation work have been investigated by computing three widely used parameters, namely segmentation accuracy, dice score, and intersection over union (IoU). The details of the parameters are reported in Table 1.

Table 1 Segmentation Quality Parameters

3.1 Parameter settings

The ‘Binary Crossentropy’ has been utilized as the loss function for all CNN models. The input image is in RGB format and the shape is (224, 224, 3). The output image is in binary form and of size (224, 224, 1). The batch size varies according to the proposed model and machine capacity. The parameter settings of the proposed and other tested CNN models are presented in Table 2.

Table 2 Parameter Settings of CNN models over input image size: (224, 224, 3)

3.2 Training and validation of the proposed model

Images that belong to the training and validation sets are normalized before training and preprocessing. In the proposed model, the Adam optimizer and binary crossentropy as the loss function have been utilized. To get the optimized result, the training process has been terminated when the training loss is less than or equals 0.011, which is set empirically. The training accuracy began at 94.01% and ended at 97.38%, while the training loss began at 0.1335 and ended at 0.0102. The training details of the proposed LS-Net are presented in Fig. 7. The training outcomes, on the other hand, are shown in Table 3.

Fig. 7
figure 7

Training details of Proposed LS-Net: a training loss; b training accuracy; c validation loss

Table 3 Training results of the Proposed LS-Net

3.3 Discussion on testing and prediction

The segmentation efficacy of the tested CNN models has been measured by computing the Dice score and IoU along with testing loss and accuracy. The proposed LS-Net model has been compared with DeepLab V3 +, Seg Net, Fast FCN with Pyramid Pooling module, U-Net over merged dataset, and CVPPP dataset. In addition to that, LS-Net is also compared with SLIC_Seg [22, 45], a segmentation model proposed by Nottingham [9], Wageningen [9], Kumar & Domnic [9], Kumar & Domnic [45] over the CVPPP dataset based on Dice score.

The numerical values of the mentioned quality measurement parameters over the merged dataset have been recorded in Table 4. In addition to that, the graphical analysis of the dice score and IoU for all employed models over the merged dataset is performed in Fig. 8. The results over the CVPPP dataset are presented in Tables 5 and 6, whereas the segmented leaf outcomes by tested CNN models are presented in Fig. 9.

Table 4 The testing results obtained by different CNN model over merged dataset
Fig. 8
figure 8

Graphical analysis of Dice score and IoU: a DeepLab V3 + ; b Seg Net; c Fast-FCN with Pyramid Pooling Module; d U-Net; e Proposed LS-Net Model

Table 5 The testing results obtained by different CNN model over CVPPP Dataset
Table 6 Dice Score comparison with prior published leaf segmentation models for leaf segmentation on CVPPP Dataset

According to Table 4 and the graphical analysis of Fig. 8, it is clear that the proposed LS-Net model delivers superior results compared to other tested CNN models in terms of test accuracy, test loss, Dice, and IoU over the merged dataset. Seg Net provided second best results when the same evaluation parameters have been considered. DeepLab V3 + and Fast FCN with Pyramid Pooling block produced nearly the same results. U-Net, on the other hand, produced the worst results.

The tested CNN models were evaluated again on the CVPPP dataset. The numerical values of test accuracy, test loss, Dice, and IoU in Table 5 clearly show that the proposed LS-Net produced competitive results when compared to others. LS-Net achieved a testing accuracy of 98.92%. On the other hand, finding the second best model over the CVPPP dataset is quite difficult. DeepLab V3 + is the second best model based solely on testing accuracy. When test loss and IoU were taken into account, Seg Net produced the second best result. Fast FCN with Pyramid Pooling block provided a slightly better Dice score compared to Seg Net. U-Net also produced the worst numerical results in this case. Table 6 compares the proposed LS-Net model to some state-of-the-art segmentation techniques by computing Dice scores over the CVPPP datasets. The Dice score adds credence to the superiority of the LS-Net. The dice scores of the compared state-of-the-art techniques over CVPPP dataset are collected from [9, 45].

The leaf segmented outcomes of the tested CNN models over the merged dataset are presented in Fig. 9 for visual analysis following the quality parameter-based analysis. The ground truth images of the tested images are also included in Fig. 9 for better visual comparison. Overall, the proposed LS-Net produces better segmented results than other tested CNN models. Segmented results over complex images are also presented for better visual comparison of the tested CNN models. Images with a small green background are considered as complex because the colours of the background and the leaf are very similar. The last three rows of Fig. 9 show the prediction results of the models used, demonstrating that LS-Net produced competitive results.

Fig. 9
figure 9

Segmented outcomes of the tested CNN Models

4 Conclusion

This paper introduces LS-Net, a Deep Convolution Neural Network (DCNN) for the leaf segmentation of rosette plants. The LS-Net is trained and evaluated on the CVPPP [7] and KOMATSUNA [8] datasets. DeepLab V3 + , Seg Net, Fast-FCN with Pyramid Pooling Module, and U-Net are four well-known DCNN models that are compared to the proposed LS-Net to investigate the efficacy of the proposed one. Depending on the architecture and experimental results, the following conclusions can be made:

  1. 1.

    The proposed LS-Net model is better in the leaf segmentation domain because of its powerful and light-weighted backbone, i.e., MobilenetV2, an Atrous convolution block that allows the density of encoder features to be controlled, and an effective decoder that takes previous information between continuous intervals.

  2. 2.

    The utilization of normalization layer in the proposed LS-Net model has a great impact because it helps to filter out the unnecessary pixels from the images. The LS-Net also takes up less memory and therefore is more suited than the other models reviewed.

  3. 3.

    The visual and numerical findings clearly show that the LS-Net produces competitive outcomes in the leaf segmentation area when compared to other state-of-the-art CNN models. The key advantage of the proposed LS-Net model over other evaluated CNN models is that it can better segment complex leaf images, such as those with a light green background.

  4. 4.

    Over the merge dataset and the CVPPP dataset, the LS-Net obtains 97.36% and 98.92% test accuracies, respectively, demonstrating its effectiveness in the field of leaf segmentation. This is supported not only by accuracy, but also by other well-known segmentation quality scores such as dice and IoU.

However, the limitation of the proposed model is that it provides unsatisfactory results when there is a large overlap between leaves or images have a significant green background. Based on the above conclusions and limitations, the following future directions for the study are as follows:

  1. 1.

    Although LS-Net performs well in the leaf segmentation of rosette plants, it is important to test it in other real-world leaf segmentation areas as well as medical image segmentation areas.

  2. 2.

    Accurate leaf segmentation is required for leaf counting. As a result, leaf counting using an LS-Net-based segmentation model could be a promising future endeavour.

  3. 3.

    Aside from the development of CNN models, enhancing image quality with better advanced techniques can also help with performance. Furthermore, utilizing an embedding system, it will be feasible to segment and classify leaves automatically in real-time.

  4. 4.

    Finally, but certainly not least, deep learning seems to have its own set of common issues, such as network structure design, 3D image segmentation model design, and loss function design. Designing 3D convolution models to analyze 3D leaf image data is a researchable direction. Loss function design has long been a challenge in deep learning research. Nature-inspired optimization algorithms [46] based optimized deep learning models [47] can also be an emerging research topic.