Introduction

Maize stands as a fundamental staple crop, playing a pivotal role in ensuring food security. Additionally, it serves as a vital source of feed, energy, and forage (Tanumihardjo et al. 2020). However, drought emerges as a primary contributor to significant declines in maize crop yield (Farhangfar et al. 2015). To mitigate the adverse impacts of environmental stresses, plants have developed diverse mechanisms, among which leaf rolling is noteworthy. The rolling of leaves is a prevalent adaptive response seen in plants experiencing drought stress (Kadioglu et al. 2012). This physiological adaptation diminishes light interception, transpiration, and leaf dehydration. As a result, it emerges as a potentially valuable mechanism for drought avoidance, especially in arid regions (Kadioglu et al. 2007). Besides drought, leaf rolling can be triggered by various abiotic stresses like water deficit and high temperature, there are also biotic stresses to consider, including insect infestation and fungal infections. Understanding the mechanisms behind leaf rolling alterations provides researchers with a distinct opportunity to enhance stress tolerance in crops exhibiting this trait, like maize (Kadioglu et al. 2012).

To gain a more profound understanding of leaf rolling as a mechanism, it is imperative to ascertain the occurrence and extent of this phenotype. Traditional leaf rolling detection has primarily been a manual process, known for its labor-intensive and time-consuming nature. Clarke visually assessed the degree of leaf rolling (Clarke 1986). Premachandra et al. assessed the extent of leaf rolling by quantifying the decrease in leaf width as a percentage caused by rolling (Premachandra et al. 1993). An analogous scoring method, which evaluates the percentage decrease in the width of the central part of the leaf due to rolling, was employed to establish the correlation between drought resistance and rolling (Saruhan et al. 2011). Zhang et al. computed the index of rolling by evaluating the widths of leaves in both their natural and unfolded states (Zhang et al. 2009). Sirault et al. developed a repeatable protocol to quantify leaf curvature. Micro-photographs of leaf cross-sections were taken, and two approaches were employed for quantifying leaf rolling: one based on the convex hull of the cross-section and the other using cubic smoothing splines for mathematical approximation. Both approaches yielded objective measurements (Sirault et al. 2015). Baret et al. investigated the viability of an efficient method for assessing leaf rolling in maize through aerial observation using UAVs, but no further applications were pursued (Baret et al. 2018). Visual scoring methods for leaf rolling are often subjective, while various assessment experiments can be both costly and inefficient. These low-throughput techniques present challenges when applied to large-scale phenotyping experiments. However, the research into high-throughput methods for determining leaf rolling within our investigation scope remains limited. Therefore, there exists an urgent demand for high-throughput methodologies, especially within the realm of field experiments.

Recently, the ongoing advancement of high-throughput plant phenotyping measurement and analysis technology has been accompanied by progress in artificial intelligence, notably in deep learning, contributing to plant phenotyping research (Jiang et al. 2020). Leaves, being integral components of plants, demand accurate detection and analysis, crucial for various applications such as species recognition (Mehdipour Ghazi et al. 2017; Waldchen et al. 2018a, b), disease diagnosis (Darwish et al. 2020; Martinelli et al. 2014), and vegetation analysis (Ding et al. 2020) Cutting-edge object detection algorithms in deep learning have found extensive applications in leaf detection, counting, and disease detection (Liu et al. 2020; Oo et al. 2018; Pal et al. 2023; Thai et al. 2023; Ubbens et al. 2018). These advancements lay the groundwork for our proposal of a method for detecting leaf rolling. The intricacies of dense leaves, characterized by occlusion, have consistently posed challenges in leaf-related tasks, thereby presenting difficulties in leaf rolling detection. Scale variations among leaves in different growth stages, alterations in leaf shape due to rolling, and background interference in complex environments are additional factors influencing our detection results. Our aim is to address these challenges and present a precise, high-throughput method for detecting leaf rolling in maize using an object detection algorithm.

This study introduces a method by integrating DCNv2 (Deformable ConvNets v2) (Zhu et al. 2019) alongside the CBAM (Convolutional Block Attention Module) (Woo et al. 2018) into YOLOv8. Our suggested method introduces DCNv2 to address deformation and scale disparities in leaf rolling detection in maize, and CBAM, a lightweight and effective attention mechanism, to strengthen feature extraction capability and feature validity. We term this method LRD-YOLO. The proposed LRD-YOLO model undergoes validation and testing on our dataset. Experimental findings showcase that our proposed method surpasses others in terms of accuracy, showcasing its effectiveness for detecting leaf rolling in maize. The contributions highlighted in this study are as follows:

  • We created a dataset comprising maize leaves in different growth stages and with varying degrees of rolling in complex natural environments for leaf rolling detection in maize, meticulously labeling all data.

  • We proposed a novel approach for leaf rolling detection in maize based on improved YOLOv8 with Deformable ConvNets v2 and Convolutional Block Attention Module.

  • Through a comprehensive set of experiments on our dataset, we showcase that LRD-YOLO demonstrates exceptional performance in both accuracy and efficiency, surpassing other methods.

Materials and methods

Image acquisition

The images of maize were obtained from a greenhouse situated at the Shenzhen Experimental Base of the Chinese Academy of Agricultural Sciences, using the rear cameras of iPhone 13 and iPhone 14. Scientific water replenishment measures were implemented throughout the maize’s growth cycle to manage water stress levels, resulting in varying degrees of leaf rolling, ranging from mild to severe.

As illustrated in Fig. 1 Samples of the data Fig. 1, these images were obtained under diverse conditions, including overlap, occlusion, and multi-scale occurrences between leaves. The backgrounds featured a mix of weeds and wilted maize leaves, and light effects were also considered. The data collection took place in July 2023, yielding a total of 724 original maize images with multiple perspectives including 7878 individual target leaves, which were used to construct the dataset for this study.

Fig. 1
figure 1

Samples of the data

Image annotation

To accurately assess the occurrence of maize leaf rolling, we employed the leaf rolling assessment criteria established by CIMMYT (Bänziger et al. 2000). The assessment involved measuring rolling on individual leaves, and the criteria are depicted in Fig. 2. In Stage 1, the leaf is unrolled and turgid, while from Stage 2 onwards, the leaf rim starts to roll. By Stage 3, the leaf blade displays pronounced rolling, appearing V-shaped; by Stage 4, the rolled leaf rim extends over a section of the leaf blade. By Stage 5, the leaf is rolled tightly, resembling an onion.

Fig. 2
figure 2

Leaf rolling stage from 1 to 5. Stage 1, the leaf is unrolled and turgid; Stage 2, the leaf rim starts to roll; Stage 3, the leaf blade displays pronounced rolling, appearing V-shaped; Stage 4, rolled leaf rim extends over a section of the leaf blade; Stage 5, the leaf is rolled tightly, resembling an onion

In this study, the dataset is categorized into two classes based on the various stages of maize leaf rolling during labeling: leaf and rolled. During the classification process, leaves at Stage 1 are labeled as leaf, while leaves at Stage 2 to Stage 5 are labeled as rolled.

The images used in this study underwent annotation by the Labelimg (Tzutalin 2015) software with the labeling file format being.txt. After the labeling process was finished, the labeled images were divided into training, validation, and test sets in an 8:1:1 ratio.

YOLOv8 model

YOLOv8 (Jocher et al. 2023), created by Ultralytics, stands as a cutting-edge YOLO model, demonstrating versatile applications in object detection and image classification tasks. Ultralytics, known for their impactful YOLOv5 model (Jocher 2020), has once again set industry benchmarks with YOLOv8.

While YOLOv8 maintains the overarching network architecture of YOLOv5, encompassing the structural design of both backbone and neck while also considering various scale models, it introduces numerous modifications and improvements. YOLOv8 integrates the C2f module into its backbone, resulting in a reduction in the overall network size. The C2f module serves as the fundamental building block in the Backbone, featuring a smaller parameter count and superior feature extraction capabilities compared to the C3 module of YOLOv5. Refer to Fig. 3 for a graphical depiction illustrating the structures of the C3 and C2f modules. And introduce the Decoupled-Head concept (Ge et al. 2021). It retains the Path Aggregation Network (Liu et al. 2018) concept but removes the convolutional structure in the UpSampling stage. Furthermore, it discards the Anchor-Base, adopting the Anchor-Free approach. These improvements lead to increased performance in object detection, positioning YOLOv8 as the selected baseline model for our study.

Fig. 3
figure 3

The structures of the C3 and C2f modules

Improvement of the YOLOv8 model

To improve the performance of detecting leaf rolling, we propose the LRD-YOLO, as depicted in Fig. 4 LRD-YOLO addresses challenges associated with scale variation and occlusion in leaves at different growth stages.

Fig. 4
figure 4

Overall architecture of the proposed LRD-YOLO

To capture the scale variation induced by leaves at various growth stages, we incorporate the Deformable ConvNets v2 (DCNv2) into the model. Specifically, we substitute the convolution in the C2f module with the DCNv2. This adjustment aims to enhance the capability of the model in detecting leaves with deformations or significant scale variations. Additionally, to enhance leaf rolling detection in scenarios where leaves may occlude or overlap, we incorporate the CBAM before the small and medium detection heads. This strategic placement of the CBAM module aids in better detecting leaves that are subject to occlusion or overlap.

The proposed enhancements to the LRD-YOLO model significantly contribute to the overall accuracy and robustness of leaf rolling detection. Furthermore, these improvements enable the model to effectively adapt to the challenges posed by multiscale and occluded leaf detection within complex natural environments.

Deformable convnets v2

In traditional convolutional neural networks, convolution operations are performed at fixed positions within the input feature maps, as depicted in Fig. 5a. However, real-world scenarios often entail objects within images undergoing various transformations, such as deformations, rotations, or changes in scale. These transformations pose challenges for traditional CNNs, impeding their ability to effectively capture relevant features. The DCN (Deformable Convolutional Networks) (Dai et al. 2017) is intricately designed to overcome the inherent constraints of conventional methodologies.

Fig. 5
figure 5

Comparison of traditional convolution and deformable convolution. a Traditional convolution kernel. b Deformable convolution kernel

DCN addresses this limitation by introducing offsets ∆Pn to adapt convolutional kernels. By incorporating offsets into deformable convolutions, the convolutional kernels gain increased flexibility, enabling them to dynamically adjust their sampling positions. This flexibility enables the network to prioritize areas of interest within the input, effectively handling geometric variations and deformations. The representation of the deformable convolution operation is depicted below:

$${\text{y}}\left( {{\text{P}}_{{0}} } \right){ = }\mathop \sum \limits_{{{\text{P}}_{{\text{n}}} \in {\text{R}}}} {\text{w}}\left( {{\text{P}}_{{\text{n}}} } \right) \times \left( {{\text{P}}_{{0}} {\text{ + P}}_{{\text{n}}} { + }\Delta {\text{P}}_{{\text{n}}} } \right)$$
(1)

For a single feature map input, depicted in Fig. 5b, an extra \(3\times 3\) convolutional layer learns the offset. The output dimension matches the original feature map size. Deformable convolution starts with an interpolation operation using the generated offset, followed by standard convolution.

However, it is plausible that deformable convolution introduces extraneous regions that interfere with feature extraction, resulting in a degradation of algorithm performance. To address this issue, Deformable ConvNets v2 not only includes the offset for each sampling point but also incorporates a weight coefficient ∆mk to distinguish whether the introduced region aligns with our area of interest. The DCNv2 operation is formulated as:

$${\text{y}}\left( {{\text{P}}_{{0}} } \right){ = }\mathop \sum \limits_{{{\text{P}}_{{\text{n}}} \in {\text{R}}}} {\text{w}}\left( {{\text{P}}_{{\text{n}}} } \right) \times \left( {{\text{P}}_{{0}} {\text{ + P}}_{{\text{n}}} { + }\Delta {\text{P}}_{{\text{n}}} } \right)\Delta {\text{m}}_{{\text{k}}}$$
(2)

The weight coefficient is designed to distinguish between regions that align with the area of interest and those that do not. By incorporating these weight coefficients, DCNv2 can effectively filter out extraneous regions that may interfere with feature extraction, thereby leading to an enhancement in the overall algorithm performance.

In summary, the offsets in DCN aim to pinpoint the location of regions containing valid information, while the incorporation of weight coefficients in DCNv2 serves to assign significance to these identified locations. Both mechanisms collectively ensure the precise extraction of valid information. Maize leaves undergo substantial geometric deformation during the rolling process, and there is also a challenge associated with considerable scale differences between leaves at various growth stages. Consequently, the application of Deformable ConvNets v2 proves instrumental in addressing both the deformation and scale disparities inherent in the detection of rolled maize leaves.

Convolutional block attention module

As an attention mechanism, CBAM is intended to amplify the representation capability of convolutional neural networks by concurrently emphasizing both channel-wise and spatial-wise features. In Fig. 6, the CBAM attention module's comprehensive structure is depicted, with the channel attention module focusing on essential features and the spatial attention module attending to their respective positions.

Fig. 6
figure 6

Overall architecture of CBAM. The module has two sub-modules: channel attention module and spatial attention module

As depicted in Fig. 7a, the initial steps involve performing the pooling operation on the input feature map \(\text{F}\) to produce new feature maps. These are then concurrently input into a weight-sharing Multilayer Perceptron network, undergoing operations for dimensionality reduction and enhancement to manage parameter count. The resulting feature maps are activated using sigmoid activation, resulting in output feature maps \({\text{M}}_{\text{c}}\). These maps are subsequently multiplied by \(\text{F}\) to derive output \({\text{M}}_{\text{c}}\text{(F)}\).

Fig. 7
figure 7

Architecture of each attention sub-module. a Channel attention module. b Spatial attention module

The computation for the channel attention module is outlined as follows:

$${\text{M}}_{{\text{c}}} = \sigma \left( {{\text{MLP}}\left( {{\text{AvgPool}}\left( {\text{F}} \right)} \right) \oplus {\text{MLP}}\left( {{\text{MaxPool}}\left( {\text{F}} \right)} \right)} \right)$$
(3)

The spatial attention module uses the \({\text{M}}_{\text{c}}\text{(F)}\) as input. Initially, it conducts the pooling operation, resulting in the generation of two distinct feature maps, which are subsequently concatenated across channels. Following this, a \(7\times 7\) convolutional kernel is employed to create a new feature map, with sigmoid activation applied to generate the feature map \({\text{M}}_{\text{s}}\). Finally, \({\text{M}}_{\text{s}}\) is multiplied by \({\text{M}}_{\text{c}}\text{(F)}\) to yield the resulting output \({\text{M}}_{\text{s}}\text{(F)}\). The computation for the spatial attention module is expressed is outlined below:

$${\text{M}}_{{\text{s}}} = \sigma \left( {f^{{\left( {7 \times 7} \right)}} \left( {\left[ {{\text{AvgPool}}\left( {\text{F}} \right);{\text{MaxPool}}\left( {\text{F}} \right)} \right]} \right)} \right)$$
(4)

In summary, CBAM dynamically adjusts feature map weights, enhancing the model's ability to capture vital image features. As a strategic enhancement, we incorporated the CBAM module to extract features effectively and ensure their validity for leaf rolling detection in maize.

Experimental results

Environment of experiment

The experimental setting for this study operates on a Linux server equipped with 100GB of RAM and a Tesla V100S-PCIE graphics card, featuring Intel® Xeon® Gold 6230R CPUs@2.10GHz. PyTorch serves as the framework for experiments, with the software environment comprising CUDA11.1, Python 3.8.16, and Torch 1.10.1. During the training phase, we run the network for 150 epochs. We define the size of input image as 640 × 640 and designate a batch size of 16. Utilizing the AdamW optimizer, we set the learning rate at 0.001667, momentum at 0.9, and weight decay at 0.0005.

Evaluation metrics

To thoroughly evaluate the proposed model for detecting leaf rolling in maize, we employed several evaluation metrics including FLOPs (floating point operations), precision, FPS (frames per second), recall, mAP (mean Average Precision), and the number of parameters. The following equations are utilized to compute the precision and recall:

$${\text{Recall}} = \frac{{\text{True Positive}}}{{\text{True Positive + False Negative}}}$$
(5)
$${\text{Precision}} = \frac{{\text{True Positive}}}{{\text{True Positive + False Positive}}}$$
(6)

The following equation is employed to compute mAP:

$${\text{mAP}} = \frac{{1}}{{\text{N}}}\mathop \sum \limits_{{{\text{i}} = {1}}}^{{\text{N}}} {\text{AP}}_{{\text{i}}}$$
(7)

In this equation, N is the categories, and \({\text{AP}}_{\text{i}}\) is the average precision for the \(\text{ith}\). A higher mAP score indicates more accurate detection.

FPS measures the inference speed, which is critical for assessing real-time model performance. FLOPs provide an estimate of the number of floating-point arithmetic operations necessary for a model during inference, while parameters encompass the trainable biases and weights in the neural network.

Ablation experiments

To assess the influence of each suggested enhancement of LRD-YOLO for leaf rolling detection in maize, we conducted ablation experiments. The hardware environment and parameter settings remained consistent throughout the ablation experiments.

Ablation experiments of the baseline and the LRD-YOLO

We first evaluate the effectiveness of our LRD-YOLO model against the baseline YOLOv8n model. The latter was trained using the same dataset as the former but lacked the incorporation of DCNv2 and the CBAM.

Table 1 Ablation experiment of the YOLOv8n model and the LRD-YOLO model displays the ablation experiment results. The comparison showcases that our two enhanced methods outperform the YOLOv8n model significantly. By incorporating the CBAM attention, the mAP increases by 2.4% to 78.9%, with only a slight increase of 0.03 M parameters. Upon introducing the DCNv2 module into YOLOv8n, the mAP (80.5%) sees an improvement of 4.0%, and the FLOPs decrease from 8.9 to 8.0. By combining these two improved methods, LRD-YOLO significantly improves mAP(81.6%) by 5.1% and decreases the FLOPs from 8.9 to 8.0 with only a marginal increase of 0.32 M in the number of parameters.

Table 1 Ablation experiment of the YOLOv8n model and the LRD-YOLO model

As depicted in Fig. 8, we performed a detailed analysis of the changes in loss values. It’s apparent that LRD-YOLO showcases a quicker reduction in loss compared to YOLOv8n on the validation set. This indicates the effectiveness of our enhancements.

Fig. 8
figure 8

Analysis of the training loss

The results indicate initial support for the effectiveness of improvements to the baseline YOLOv8n in detecting maize leaf rolling under complex environmental conditions.

Ablation experiments of the Deformable ConvNets v2

Next, we execute a more specific ablation analysis to assess the influence of DCNv2 on the performance of the LRD-YOLO. While the C2f component within YOLOv8 facilitates the acquisition of multi-scale features and broadens the scope of receptive fields, it concurrently raises computational demands and parameter counts. Furthermore, it demonstrates a lack of sensitivity to variations in the shape of the leaves. By replacing convolutional layers within the C2f component with DCNv2, we effectively alleviate computational loads and bolster the performance of the baseline model. This enhancement proves especially significant for leaves manifesting notable scale fluctuations across growth phases and for those experiencing alterations in shape due to rolling.

The data in Table 2 highlights the performance contrast across various placements of DCNv2. Clearly, replacing convolutional layers within the C2f component of the baseline model, neither its neck nor its backbone, with DCNv2 yields enhancements in both mAP and FLOPs reduction. These outcomes emphasize the efficacy of incorporating DCNv2 into the C2f component, consequently amplifying the capability of LRD-YOLO to efficiently tackle the challenges posed by deformation and scale variations in identifying rolled maize leaves.

Table 2 Comparison of adding DCNv2 to different positions

Ablation experiments of the convolutional block attention module

Finally, we examine the impact of CBAM on the efficacy of the LRD-YOLO. We incorporate the CBAM module before the various sizes of the detection head to evaluate its effect on our models.

Table 3 displays a comparison of performance across different placements of the CBAM module. Notably, integrating the CBAM module before the small and medium detection heads showcases the most significant enhancement in mAP. This improvement can be attributed to the dataset’s inclusion of small and medium-sized leaves, which are prone to occlusion and overlap. These outcomes validate the effectiveness of applying CBAM attention before the small and medium detection heads in mitigating missed detections of occluded and small targets.

Table 3 Comparison of adding CBAM module to different detection heads

In summary, the outcomes from all ablation experiments affirm that the integration of both DCNv2 and the CBAM module into the LRD-YOLO significantly enhances the accuracy of leaf rolling detection in maize, especially under challenging environmental conditions.

Comparison with state-of-the-art detection methods

Comparison of performance

We conducted a comprehensive performance evaluation on the test set, comparing LRD-YOLO model with six advanced methods: Faster R-CNN (Ren et al. 2017), SSD (Liu et al. 2016), YOLOv5n (G 2020), YOLOv6n (Li et al. 2022), YOLOv7-Tiny (Wang et al. 2022), and Real-Time Detection Transformer (RT-DETR) (Zhao et al. 2023). All experiments were executed on an NVIDIA TESLA V100s GPU, maintaining a consistent software environment. The performance analysis of these methods is presented in Table 4.

Table 4 Performance comparison of LRD-YOLO with other detection methods

SSD and Faster R-CNN face challenges in achieving a harmonious balance between detection accuracy and inference speed. Burdened by an excess of parameters and arithmetic operations, Faster R-CNN exhibits a low inference speed of only 17.1 FPS. Conversely, while the SSD model showcases a reasonable speed of 48.6 FPS, its diminished precision makes it unsuitable for real-time tasks.

The YOLO methods, particularly adept at leaf rolling detection in maize, reveal distinctive performance characteristics. YOLOv5n stands out with the lowest FLOPs and Params, recorded at 4.2 G and 1.8 M, respectively, while YOLOv7-Tiny boasts the highest FPS at 76.3. Nevertheless, the detection precision, recall, and mAP metrics of YOLOv5n, YOLOv6n, and YOLOv7-Tiny do not align proportionately with their impressive inference speeds.

The Real-Time Detection Transformer (RT-DETR), an advanced end-to-end object detector devised by Baidu, stands out for its exceptional accuracy while maintaining real-time performance capabilities. RT-DETR exhibits outstanding performance on our dataset, achieving an impressive mAP of 79.5% and precision of 83.3%, surpassing other models within the YOLO series, all while sustaining a speed of 31.1 FPS.

Our proposed model, LRD-YOLO, emerges as the frontrunner with the highest mAP of 81.6%. Notably, its detection accuracy surpasses that of RT-DETR, achieving an improved fps of 56.0, requiring only 8.0 G FLOPs and 3.5 M parameters. These results underscore that our LRD-YOLO model is the optimal choice for leaf rolling detection in maize, successfully balancing both speed and accuracy in the domain.

Comparison of detection results

To further assess the efficacy of these methods, we carried out experiments to compare the actual effectiveness of seven object detection methods for leaf rolling detection. The results are illustrated in Fig. 9.

Fig. 9
figure 9

Comparison of the detection results. The arrow points to the incorrect results, and the yellow box represents the missing target

As depicted in the figures, leaves marked by yellow box or arrow exhibit varying degrees of occlusion and overlap, leading to false or missed detections for all models except the LRD-YOLO model. Faster R-CNN exhibits missed detections when leaves overlap and occlude each other. Conversely, SSD is more prone to generating redundant detection boxes in dense scenarios. YOLOv5n incorrectly classified the rolled leaves in Fig. 9d as normal leaves, while both YOLOv6n and YOLOv7-Tiny displayed identical missed detections where leaves were either obscured or overlapped. RT-DETR showcased high accuracy in both images, with only one missed detection.

Only the LRD-YOLO model accurately predicted the position and quantity of the rolled leaves. These findings suggest that LRD-YOLO successfully addresses the challenge of detecting leaf rolling in maize under complex environmental conditions.

In summary, the comparison of performance and detection results further underscores the effectiveness of LRD-YOLO for leaf rolling detection in maize under intricate environmental conditions.

Robustness in adverse weather conditions

Although object detection methods have shown encouraging outcomes when applied to high-quality datasets, the ongoing challenge lies in precisely localizing objects within low-quality images taken in adverse weather conditions (Liu et al. 2022). To assess the robustness of LRD-YOLO, we conducted experiments comparing its effectiveness to the baseline model in leaf rolling detection under adverse weather conditions.

As depicted in Fig. 10, our data augmentation techniques to include more severe conditions such as bright light, rain, and fog in our test sets. Moreover, we have simulated scenarios where water droplets can obscure the lens during rainy conditions, as well as instances of mud splattering caused by windy weather.

Fig. 10
figure 10

Data augmentation for severe weather conditions

The detection results of LRD-YOLO and YOLOv8n are illustrated in Fig. 11. While LRD-YOLO demonstrates robust performance under rainy conditions, it occasionally experiences false positives and misses in foggy and bright light environments when lens-obscuring water droplets are present. In comparison, the YOLOv8n model shows significant issues with false positives and misses across all adverse environments tested. These findings highlight LRD-YOLO's effectiveness in enhancing the baseline method's resilience to adverse weather conditions, significantly improving object detection accuracy in challenging environments.

Fig. 11
figure 11

Comparison of the detection results in adverse weather conditions. The arrow points to the incorrect results, and the yellow box represents the missing targets

In addition to applying the aforementioned data augmentation methods to our test set, we have extended these techniques to our training and validation sets, resulting in a training set comprising 4088 images and a validation set of 490 images. Based on this augmentation, we trained the LRD-weather model, which has been specifically designed to excel in severe weather conditions while maintaining high detection accuracy.

The performance of YOLOv8n, LRD-YOLO, and LRD-WEATHER on the test set is detailed in Table 5, the bolded section in Table 5 highlights the model with the highest score under the corresponding weather conditions. As shown, LRD-YOLO consistently improves mAP in mild weather conditions by 2.9%, 2.3%, 3.2%, and 5.0% over YOLOv8n, respectively, while maintaining high accuracy. Our model's performance under severe weather conditions is also demonstrated. However, in more extreme scenarios, both YOLOv8n and LRD-YOLO exhibit significant performance degradation, with the mAP metric dropping below 50% at Spatter_Severe conditions. In contrast, due to robust data augmentation during training and validation, the LRD-WEATHER model maintains over 75% accuracy and mAP metrics under severe extreme weather conditions, showcasing its superior detection performance in challenging environments.

Table 5 Performance comparison of YOLOv8n, LRD-YOLO and LRD-WEATHER

These results underscore the effectiveness of LRD-YOLO and LRD-WEATHER in enhancing the robustness of the baseline method against adverse weather conditions. They demonstrate the significant advancements our model brings to achieving precise object detection in challenging environmental contexts.

Discussion

Visualization of the detection results

To further underscore the efficacy of our improvements to the baseline model, we performed a detailed analysis of the results obtained by LRD-YOLO and the YOLOv8n model for maize leaf rolling detection. For this analysis, we utilized Grad-CAM (Selvaraju et al. 2020) visualization as a tool. Grad-CAM is designed to visualize the distinct contributions of various regions within a deep neural network to the prediction results. This method aids in pinpointing significant areas within images. Figure 12 presents a random selection of examples illustrating Grad-CAM visualizations generated by both LRD-YOLO and YOLOv8n on the test set. The Grad-CAM visualization provides valuable insights into the model’s attention focus during leaf rolling detection in maize.

Fig. 12
figure 12

Grad-CAM visualization of LRD-YOLO and YOLOv8n

Upon careful examination of the Grad-CAM visualizations, our model exhibits a notable ability to concentrate on the specific area of the maize leaf where rolling occurs. For uncurled leaves, the model also maintains focus. The introduction of DCNv2 significantly enhances the model’s proficiency in detecting leaves with diverse scale sizes and shape variations. In contrast, Grad-CAM visualizations from the YOLOv8n model display less precision, often extending to regions outside of the leaves. Remarkably, Grad-CAM visualizations from LRD-YOLO are characterized by increased focus and accuracy, capturing the key features of the leaves with precision. This underscores the excellent contribution of the CBAM module to our model. These findings highlight the effectiveness of our LRD-YOLO in improving the performance of the baseline YOLOv8n for detecting leaf rolling in maize. The LRD-YOLO model showcases an improved ability to navigate the complexities of the surrounding environment, ensuring robust performance even in the presence of interfering factors. The application of the Grad-CAM visualization technique further highlights the LRD-YOLO model’s enhanced focus on the key characteristics of maize leaves.

Lightweight improvement of the LRD-YOLO model

The integration of DCNv2 and CBAM significantly enhances the model's feature extraction and adaptability to shape and scale variations, but it also increases the complexity of the YOLOv8 model. These factors can pose challenges, particularly in resource limited settings such as small farms or remote areas without advanced computing infrastructure. Model complexity is as important a metric as accuracy, and while LRD-YOLO excels in accuracy, there are still opportunities for reduction in complexity.

We have taken steps to address the model's complexity and computational requirements. Specifically, we have employed the channel pruning algorithm (Layer-adaptive sparsity for the Magnitude-based Pruning) (Lee et al. 2020) to optimize the LRD-YOLO model. This approach aims to reduce network complexity by eliminating less critical channels, thereby improving computational efficiency. Detailed experimental results demonstrating the effectiveness of this optimization are provided in the following table.

As illustrated in Table 6, the pruned model demonstrates significant improvements over the original LRD-YOLO in terms of parameter reduction by 77.8%, 50% fewer FLOPs, and a 9% increase in inference speed. Importantly, despite these reductions, the pruned model maintains a marginal decrease of only 2.1% in mAP compared to the original, still surpassing the baseline YOLOv8n by 3%. This underscores the efficacy of our pruning strategy in balancing model complexity with performance.

Table 6 Results of the pruning experiment

Figure 13 visually represents the impact of our pruning approach on the convolutional layers, showcasing a substantial reduction in channel counts. This reduction signifies the successful optimization of model complexity, enhancing its suitability for resource constrained environments such as small farms and remote areas.

Fig. 13
figure 13

Channels contrast of base and prune model

While our pruning efforts have significantly reduced the complexity of the model, we recognize that further improvements in inference speed are necessary. To address this challenge, we have explored alternative lightweight backbone networks as replacements for the original backbone in the LRD-YOLO model. Specifically, we evaluated MobileNetV3 (Howard et al. 2019), ShuffleNetV2 (Ma et al. 2018), and VanillaNet (Chen et al. 2023) with different layer configurations.

The experimental results presented in Table 7 highlight VanillaNet-9 as particularly promising, achieving a remarkable 52.5% improvement in inference speed compared to the original LRD-YOLO model. Although the accuracy of the model is reduced compared to LRD-YOLO, it is still slightly higher than the baseline model. Inference speed is also improved over baseline. This enhancement is achieved while maintaining a low model complexity, demonstrating superior performance among the tested backbone networks.

Table 7 Results of the backbone network experiment

Compared to other models in the YOLOv8 family (s, m, l), the YOLOv8n model stands out as the most lightweight variant. While the LRD-YOLO model introduces a slight increase in complexity compared to YOLOv8n, it remains a relatively lightweight solution suitable for a wide range of application scenarios.

Particularly for resource-constrained environments such as small farms or remote areas, the pruned LRD-YOLO model offers a practical and efficient solution. For scenarios demanding higher inference speeds, we have explored enhancing the LRD-YOLO model by integrating lightweight backbone networks like VanillaNet-9.

These optimizations directly address the concerns raised regarding computational demands and suitability for real-world agricultural applications. By significantly reducing model complexity while maintaining competitive performance metrics, our approach ensures that the pruned LRD-YOLO model is well-equipped for practical deployment across varied agricultural settings.

Limitations

Our study's dataset, although diverse, may not be sufficiently large to capture all variations in leaf rolling across different maize varieties and environmental conditions. Advanced data augmentation methods could help enhance the dataset's diversity and richness, so we employed a comprehensive suite of seven methods, as illustrated in Fig. 14. These methods encompassed random cropping, cutout, brightness adjustment, flipping, noise addition, rotation, and shift.

Fig. 14
figure 14

Example of data augment

The performance of LRD-YOLO after data augmentation is shown in the Table 8.

Table 8 Results of data augmentation

The rolling of maize leaves is a process that spans from mild to severe, manifesting phenotypic variations at different degrees of rolling. Although our suggested model can successfully accomplish the binary classification task of detecting rolled maize leaves, its efficacy is limited by the size of the dataset, impeding a comprehensive detection of the entire rolling process. Excessive classification leads to a decrease in the number of instances within each class, which poses challenges in properly training the model. Data augmentation alone cannot fundamentally address the issue of insufficient instances within each class and often leads to the problem of overfitting.

Moreover, the model requires a substantial amount of images to discern subtle differences in rolling degrees between different classes, a requirement not currently met by our dataset. In future work, we intend to establish a larger-scale dataset to delve deeper into the phenotypic characteristics of rolled maize leaves. And the imbalance across various stages of leaf rolling in our dataset is a critical issue that requires careful consideration as we expand our dataset. Future work will endeavor to cover leaf rolling caused by changes in soil type, climatic conditions and biotic stresses (e.g. pests and diseases) wherever possible. Our objective is to enhance the depth of the study and ultimately apply our research to field conditions.

Conclusion

We propose the LRD-YOLO model, an innovative approach for leaf rolling detection in maize with a focus on achieving high accuracy without compromising real-time inference speed. To initiate the study, a new leaf rolling dataset is meticulously collected, encompassing various challenges inherent in this task, such as severe occlusion, changes in leaf scale and shape, and complex background scenarios. The principal contributions of our approach involve integrating the CBAM mechanism into the YOLOv8 architecture. This integration enhances feature extraction capability and feature validity, thereby improving detection accuracy in occluded scenes and complex environments. Additionally, we introduce DCNv2 to better adapt to changes in target shape and scale. Following conducting experiments, our findings underscore the role of the LRD-YOLO in significantly improving detection accuracy for leaf rolling in maize, surpassing existing methods while maintaining real-time inference capabilities.