1 Introduction

Currently, more than 60% of yield losses in agriculture are due to plant diseases. In order to identify plant diseases, traditional agricultural management practices use plant pathology technique, namely methods related to the chemical analysis of the infected area of the plant, and physical methods such as spectroscopy. These techniques obviously require qualified people intervention and can be very expensive. So, developing automated and low-cost plant disease detection methods becomes imperative. The majority of automated methods proposed in the literature are based on computer vision and artificial intelligence techniques. In fact, software-based products can perform vision-based tasks with greater accuracy than human experts. The main objective of automatically detecting plant disease is to rapidly and accurately diagnose plant disease. This process involves capturing images of plants (plants leaves), analyzing them and then providing a diagnosis. However, vision-based plant disease detection algorithms are faced with many challenges such as complexity of disease, lighting variations, low resolution of acquired image, lack of labeled data (for training accurate machine/deep learning model). This article proposes to evaluate effect of super-resolved plant disease images for early automatic and precise detection of potato blight disease (caused by phytophthora infestans). In fact, increasing resolution of images is an important processing step, allowed to obtain precise and reliable data, offering advantages in several areas such as vegetation/urban detection, precision agriculture, land use/land cover, change detection and in many other fields of computer vision applications. The main objective of super-resolution techniques is to generate a higher resolution image (HR) from lower resolution images (LR). In this work, a new method is developed, that allow to accurately detect disease on plant leaf. This method involves Super-resolved plant leaf images dataset by training EDSR model on the healthy/diseased plant leaves dataset. Then training three recent object detection models in order to detect late blight disease on plant leaf diseased images, using standard and super-resolved dataset.

2 Related works

Detecting diseases of plant leaf from images use traditional image processing technique and advanced one. Traditional classical image processing technique analyze visually observable patterns seen on the plant. Segmentation-based methods determine the difference in the color, texture and shape of the affected area, by dividing plant leaf image into different regions then analyzing each region for signs of disease. These methods can also be used to identify the type of disease present, as well as to monitor the progression of the disease over time. Dey et al. [1] describe the process of capturing images of the leaves, pre-processing the images to remove noise and enhance contrast then implement Otsu thresholding-based image processing for segmentation of leaf rot diseases in betel vine leaf. Features extraction methods involves extracting various features from plant leaves images. These features can then be used as input of classification algorithms to classify plant as healthy or diseased. The authors in [2] select the most relevant features using fuzzy feature selection technique (Fuzzy curve and Fuzzy surface) to improve classification accuracy.

Advanced techniques mainly use deep learning architecture. Neural networks models can be trained on a large dataset of plant leaves images to learn the features that differentiate healthy and diseased plants. The model can then be used to make prediction on new plant leaves images. Figure 1 shows the required steps for automatic identification of plant diseases using machine/deep learning approaches.

Fig. 1
figure 1

Automatic identification of plant disease

A summary of most recent proposed approaches for automatic plant disease classification and detection are shown in Table 1.

Table 1 Summary of most recent proposed approaches for automatic plant disease classification and detection

From Table 1, we can observe that deep learning-based technique model outperform segmentation, feature extraction and classification techniques. Segmentation based techniques present a smaller computational complexity but don’t consider low contrast and defaults illumination of acquired image, so hardly achieve good plant disease detection results.

In [3, 5, 7], the authors used SVMs to classify leaf images as healthy or infected with various types of diseases, SVMs are generally used after different features extraction steps. However, SVMs require hand-crafted features, which can be time-consuming and limit the model’s ability to learn complex features. SVMs are also combined with deep learning models to recognize four types of rice diseases [11], using the strengths of both approaches allow the authors to achieve high accuracy.

Deep learning and specifically Convolutional Neural Networks (CNNs) architecture are commonly used in plant disease classification task. Ferentinos et al. [8] fine tune five different CNNs architectures (AlexNet, GooLeNet, VGG-16, ResNet50, InceptionV3) on new plant disease dataset. The authors compared their performance in terms of accuracy, precision and recall and found that VGG-16 achieved the highest accuracy. When fine-tuning a pretrained model, the weights of the model are updated during training on a new dataset. By doing this, the new model can benefit from the feature extraction capabilities of the pretrained model. Fine tuning a pretrained model has been shown to improve accuracy and reduce the amount of time and resources required for training. Ashwinkumar et al. [15] propose a modified CNN-based model and introduce a new optimization technique called emperor penguin optimizer (EPO), to prevent overfitting and improve the generalization performance of the model. Their method outperforms other CNN-based model in terms of both accuracy and computational efficiency. Zhang et al. [14] improved CNN network by adding two generative networks models and use a modified Vision transformer network (ViT) to improve CNNs ability to capture global feature as a branch network. The model has a low inference speed but achieve high precision. Borhani et al. [16] compared between a simplified version of the ViT model, a CNN-based model and the hybrid model which included the combination of the CNN and ViT. Higher accuracy was achieved by ViT based model, however, an improvement in speed prediction was attempt by combining the attention blocks with convolutional blocks. Recently, a deep neural network based on an automatic pruning mechanism is developed to enable high-accuracy plant disease detection even under limited computational power was proposed by Liu et al. [17]. Sarawast et al. [18] proposed (Modified-MDNN) algorithm by using DSURF(Dynamic-SURF) features and also shows the comparative analysis between DNN and MDNNN performance.

It is important to note that the performance of all machine learning proposed methods depends on the diversity of the dataset used for training and testing the model. In fact, the dataset should be diverse and representative of the type of desired disease recognition. Increasing diversity of training dataset is necessary especially when large datasets aren’t available. This process is performed by data augmentation. Shorten et al. [19] make a survey on different data augmentation technique proposed in the literature. They present existing methods for Data Augmentation (color space augmentations, geometric transformations, kernel filters, mixing images, random erasing, feature space augmentation, Noise injection, Adversarial training, neural style transfer), promising developments, and meta-level decisions for implementing Data Augmentation.

3 Methodology

In this section, we describe the proposed methodology to perform plant disease detection. We firstly super resolve dataset using enhanced deep super-resolution network architecture (EDSR) and then perform detection and recognition steps using Faster R-CNN, YoloV8 and Detr models. A short description of super-resolution concept, EDSR architecture, Faster R-CNN, YoloV8 and Detr architectures is given below.

3.1 Super-resolution

Super-resolution methods use information from several different images (multi-image super-resolution (MISR)) or only one image (single-image super resolution (SISR)) to create one upsized image. Three categories of SISR approaches can be distinguished: reconstruction-based approaches, use a priori information to recover the HR images; interpolation-based approach that reconstruct HR images using existing pixels to interpolate missing pixels, learning-based approaches use dictionary pairs of training and testing images to predict HR images [20]. The recent developments in deep learning have furnish many tools to resolve SISR problem. Several deep learning architectures like CNNs [21], Residual CNNs [22] GANs [23] and DNNs [24] have been proposed to reconstruct high resolution image from low resolution image. Multi-image super-resolution approaches expect to reconstruct high-resolution details using multiple low-resolution images of the same scene (acquired from the same or different sensors). Deep learning based MISR methods achieved satisfactory results [25, 26] but depends on the fidelity of the training datasets. Optimization-based iterative MISR methods solve the problem in the spatial and frequency domains [27]. In frequency domain, multiple images are combining with subpixel displacements to enhance resolution. Due to difficulties to use a priori knowledge as regularization, many spatial domain MISR techniques were proposed. These include the maximum likelihood (ML), the maximum a posteriori (MAP), and the projection onto convex sets (POCS) [28, 29].

3.2 Edge enhanced deep super- resolution network

A particular attention is given to SISR learning methods. Recent comparative studies [30, 31] identified the Enhanced deep super-resolution network architecture (EDSR) proposed by Lim et al. [32] and the Enhanced super-resolution generative adversarial networks (ESRGAN) proposed by [33] as competitive approaches in the field of super-resolution task. While ESRGAN architecture is well- known for its capability to reconstruct fine details and textures and enhance perceptual quality, it may introduce artificial details which is undesirable in the domain where precision and accuracy are imperative. Furthermore, ESRGAN is a GAN based model that requires carful training process (leading to a significant challenge when computational resources are limited). In contrast, the training process in EDSR architecture is simple, it requires only a single loss function to optimize. EDSR architecture is relatively intuitive and represents an optimization of super resolution residual network architecture (SRResNet) [34]. Unnecessary modules are removed to simplify the network. In fact, it requires only a single loss function to optimize. The authors in [32] found that removing Batch normalization allow to save 40% of memory usage during training compared to SRResNet. Moreover, EDSR enhance image resolution while maintaining high fidelity to the original image (Better PSNR and SSIM values compared to ESRGAN super-resolution results). This is crucial in plant disease identification context where accurate representation of vein patterns, disease lesions and edges are essential.

3.3 Region-convolutional neural network (R-CNN)

Region with CNN features (R-CNN model) are a popular object detection and localization framework developed by Ren S et al. [35]. R-CNN architecture involves three modules. The first generates category-independent region proposals. The second extract features from each proposed region using pretrained CNN. The last module ensures object classification and bounding box regression. R-CNN suffers from several limitations such as slow inference speed, high memory consumption and multistage training. These limitations have motivated the development of optimized version of R-CNN such as Fast-R-CNN [36], Faster-R-CNN [37] and Mask-R-CNN [38]. Fast R-CNN uses a single CNN to extract features from the entire image and then generates region proposals from the feature map. Fast R-CNN model consists of a single-stage compared to three stages in R-CNN. Fast R-CNN samples multiple ROIs from the same image by using a new layer called ROI pooling that extracts equal-length feature vectors from all proposals. This makes Fast- R-CNN detection model faster and more efficient. Faster R-CNN is an extension of fast R-CNN. The architecture of Faster R-CNN consists of 2 modules:

  • Region Proposal Network (RPN) is built for generating region proposals.

  • Fast R-CNN is used for detecting objects in the proposed region

3.4 Detection transformer model (Detr)

Detr perform prediction and generate the final set of detections by combining a common CNN with a transformer architecture. The authors in [39] propose to resolve object detection task as a set prediction problem and use transformers and bipartite matching loss. Detr object detection model uses self-attentional transformers to process input images, producing a fixed set of detection directly, without using proposal networks (RPNs) and non-maximum suppression. It is important to note that the transformer architecture used in Detr is computationally intensive and training large Detr models consume a large amount of memory.

3.5 You only look one model (Yolo)

Yolo detection model [40] addresses object detection as a single-stage, end-to-end process. It uses a simple deep convolutional neural network to detect objects in the image. The input image is divided up into a grid of cells. The grid size is typically determined by the architecture of the YOLO model. After that, Yolo predicts bounding boxes that contain the objects found in each grid cell. The bounding box values ((x, y) coordinates of the box's center, width (w), and height (h)) are relative to the cell's size. A confidence score is then added to each bounding box to indicate the presence of an object. A class probability is also assigned using SoftMax activation. Only the most confident bounding box with high probability are selected. This model has 24 convolution layers. 4 max-pooling layers, and 2 fully connected layers. 1*1 convolution followed by 3*3 convolution are used to reduce number of channels. Yolo predicts multiple bounding boxes for each grid cell and filters them based on their confidence score and overlap with the anchor box. There are several versions of Yolo, the most recent one being YoloV8 [41]. YoloV8 architecture focuses on performance and speed achieving cutting-edge results on a number of benchmarks. YoloV8 predicts the center of an object directly instead of predicting the offset from predefined anchor boxes. This is called anchor free detection and allows reducing the number of box prediction. YoloV8 uses Mosaic augmentation (stitching four images together) and augments images during training online. At each epoch, the model sees a slightly different variation of the images it has been provided.

3.6 Proposed method

The proposed method was tested on potato plant leaf to detect Late Blight disease. Model detection training process was release using the plant leaf images taken from the popular PlantVillage dataset (https://www.kaggle.com/datasets/emmarex/plantdisease). Figure 2 displays examples of potato plant leaves images. Figure 2(a) shows a healthy potato plant leaf, while Fig. 2(b), (c) and (d) show various forms of late blight lesions on potato plant leaves.

Fig. 2
figure 2

Healthy and diseased potato plant leaves images. (a) Healthy potato plant leaf, (b), (c) and (d) infected potato plant leaves

In order to diversify and increase the training set, the potato plant leaf (Healthy and diseased) dataset was firstly augmented using geometric and color space transformations. A pretrained EDSR model for transfer learning was then used to upscale images by factor 4x. The evaluation of the effect of super-resolution on plant leaf disease detection is doing by training three object detection models with transfer learning on the two datasets (Standard Dataset (SD) and super-resolved dataset SRD). Typically, training Faster-RCNN, Detr and Yolo V8 object detection models requires input images to be of a specific size for optimal processing. In this work, the Standard dataset contain potato plant leaves images (healthy and diseased) of size 256 × 256. The Super-resolved dataset contain potato plant leaves images (healthy and diseased) upscaled to 1024 × 1024. The Super-resolved dataset (SRD) and the standard dataset (SD) are then split into training set (70%), testing set (20%) and validation set (10%). The block diagram of the proposed method is depicted in Fig. 3.

Fig. 3
figure 3

Block diagram of the proposed method

4 Experiments and results

4.1 Hyperparameters and setting

All training experiments were run on a machine with 10 GB RTX 3080 GPU, 10th generation i7 CPU, and 32 GB of RAM. Faster R-CNN, Detr and YOLOV8 training process are performed on PyTorch framework. The learning rate is set to 0.0001 and the batch size to 8 with Adam optimizer. The smooth L1 is used for bounding box regression tasks. In this work, Faster R-CNN use ResNet50 as backbone network [39], region proposal consists of convolutional layers followed by fully connected layers and use the feature map coming from the backbone network as input. Detr use the ResNet-101 architecture as the backbone network [42] and incorporates deformable convolutions networks (DCN). Yolo V8 use CSPDarkNet as backbone network [43]. The size of SD and SRD datasets is 918 before data augmentation process and 2204 after data augmentation process.

4.2 Metrics evaluation

The quality assessment of super-resolution results is generally performed using PSNR and SSIM metrics. It describes the similarity between the predicted (upscaled) image and the ground truth image.

  • Peak signal to noise ratio (PSNR)

The PSNR is calculated using the Mean-Square-Error (MSE) of the pixels and the maximum possible pixel value (MAXI) as follows [44]:

$$PSNR\left(x,y\right)=10.\log\left(\frac{{MAX\;\left(x,y\right)}^2}{MSE\;\left(x,y\right)}\right)$$
  • Structural Similarity Index Measure (SSIM)

The SSIM measure the similarity between two images. SSIM is designed to improve PSNR metric, which have proven to be inconsistent with human eye perception.

$$SSIM=\frac{(2.\stackrel{-}{x.}\overline{y }+{C}_{1})(2.{\sigma }_{xy}+{C}_{2})}{\left({\left(\overline{x }\right)}^{2}+{\left(y\right)}^{2}+{C}_{1}\right)+({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{C}_{2})}$$

where \({\sigma }_{x}\), \({\sigma }_{y}\) are the variance of x and y respectively, \({\sigma }_{xy}\) is the covariance of x and y. \({C}_{1}={({k}_{1}L)}^{2}\) and \({C}_{2}={({k}_{2}L)}^{2}\) are two variables used to stabilize the division with weak denominator; and L represents the dynamic range of the pixel values. The common values for k1 and k2 are 0.01 and 0.03, respectively [45]

The evaluation of object detection results, in our case Late Blight disease detection is doing by calculating some metrics [46]. The detection results are compared to the manually labeled data (ground truth annotations) to determine how accurately the detection method performed.

  • Precision/Recall

Precision measures the proportions of correctly predicted bounding boxes among all predicted bounding boxes.

$${\varvec{P}}recision=\frac{TP}{TP+FP}$$

Recall measures the proportion of correctly predicted bounding boxes among all ground truth bounding boxes.

$${\varvec{R}}{\varvec{c}}{\varvec{a}}{\varvec{l}}{\varvec{l}}=\frac{TP}{TP+FN}$$
  • mean Average Precision (mAP50-95/mAP 50)

mAP 50–95 refers to the mAP computed over a range of intersection over Union (IoU) thresholds from 0.5 to 0.95. IoU measures the overlap between the predicted bounding boxes and the ground truth bounding boxes. To compute mAP 50–95, the precision and recall values are calculated at each IoU threshold within the range.

mAP50 refers to the mAP at IoU threshold of 0.5 mAp 50 is used as a standard benchmark to evaluate the performance of object detection models. It provides a measure of how well the model can detect objects with a reasonable level of overlap with the ground truth.

$$IoU=\frac{Area\;of\;Overlap}{Area\;of\;Union}$$

AP is calculated across a set of IoU thresholds for each class k and then take the average of all AP values.

$$mAP=\frac{1}{n}\sum_{k=1}^{k=n}{AP}_{k}$$

\({AP}_{k}\) is the AP of class and n the number of classes.

  • Training/validation loss box

The training and validation loss is usually used to diagnose the model’s performance and identify which aspects need tuning. Training and validation Loss Box measure the difference between the predicted bounding box coordinates and the ground truth bounding box coordinates. Common loss functions include mean squared error and smooth L1.

4.3 Results and performance analysis

Figure 4 shows an example of potato plant leaf infected by late blight disease. Figures (a) and (b) shows the original images. Figure (c) and (d) shows the corresponding super-resolved images (Upscaled image by factor × 4).

A zoomed-in patches of the original image and super-resolved one are displayed in Fig. 4. The PSNR and SSIM value are mentioned at the bottom of figure (c). From the obtained results, we can see that the EDSR architecture allow a good preservation of high frequency details and achieve a high SSIM and PSNR values.

Fig. 4
figure 4

Infected potato plant leaf: (a) Example from SD dataset, (c) Example from SRD dataset

Table 2 summarizes the obtained automatic detection results. Each model was run for 50 epochs. Performance assessment of the proposed method is performed by comparing mAP, Precision and Recall of the six trained models:

  • Train and validate Faster-RCNN on SD and SRD

  • Train and validate Detr on SD and SRD

  • Train and validate Yolo V8 on SD and SRD

Table 2 Performance comparison of the six detection models

From Table 2, it can be observed that training detection models on super-resolved dataset (SRD) improve the detection results. In fact, Faster-RCNN SRD, Detr SRD and Yolo V8 SRD achieve high mAP at 0.5 IoU threshold compared to Faster-RCNN ND, Detr SD and Yolo V8 SD. Yolo V8 SRD model outperform all the other trained models in term of mAP, precision and Recall.

The performance of the proposed methods is also performed by analyzing the training and validation process for the six models, represented by mAP 50–95 validation curves (Fig. 5) and by Training/Validation Loss Box curves (Fig. 6).

For Faster- RCNN SD, after the last epoch, the mAP (50–95) is 0.58 but this is not the mAP for the best epoch. The model achieves the best mAP after epoch 32 where the mAP is 0.6. We observe that best mAP(50–95) for the other models are not achieved at last epoch. Indeed, the mAP (50–95) for Faster RCNN SRD is 0.63 at epoch 35, 0.6 for Detr SD at epoch 27, 0.62 for Detr SRD at epoch 29, 0.71 for Yolo V8 at epoch 32 and finally 0.76 for Yolo V8 at epoch 30. The best mAP (50–95) is achived by Yolo V8 SRD model. The models trained and validated on SRD achieve better mAP(50–95) overall.

 Figure 6 shows training / validation Loss Box, they are visualized together on a graph for the six models.

Fig. 5
figure 5

Mean average precision curves

All the models perform good localization of the object (in our case detecting late blight disease on potato plant leaf). In fact, the training loss box and validation loss box both decrease, converge to a low value and stabilize at a specific point. However, the lower value of Training/Validation Loss Box is attempted by Yolo V8 models: 0.055 for training on SD dataset, 0.047 for validation on SD dataset, 0.045 for training on SRD dataset and 0.031 for validation on SRD dataset. The training results can be further optimized using more powerful hardware, that allow using larger batch sizes and improve convergence (Fig. 6).

Fig. 6
figure 6

Loss curves for training and validation data

Figure 7(a)…(f) display detection results (localization of late blight lesions on potato leaf according with their corresponding confidence) using the six trained models on infected leaf image taken from test set. Figure 6(a)-(b) show detection results using Fast-R-CNN-SD and Fast-R-CNN-SRD. Figure 7(c)-(d) show detection results using Detr-SD and Detr-SRD. Figure 7(e)-(f) show detection results using Yolo V8-SD and Yolo V8-SRD. By comparing the obtained results, we can see that training models on super-resolve dataset enhance detection. In fact, five late blight lesions were detected using Yolo V8-SRD compared to Yolo V8-SD which detect four lesions. Yolo V8- SRD detect a very small lesion with high confidence (0.95). In addition, four late blight lesions were detected using Fast-R-CNN-SRD and Detr-SRD compared to Fast-R-CNN- SD and Detr-SRD which detect only three lesions.

Fig. 7
figure 7

Automatic potato late blight detection results on test image. (a) Faster-RCNN-SD detection, (b) Faster-RCNN-SRD detection, (c) Detr-SD detection, (d) Detr-SRD detection, (e) YoloV8-SD detection, (f) YoloV8-SRD detection

Finally, the training and inference time of the six trained models is calculated. This evaluation is interesting in scenario when real-time plant disease detection and monitoring is required.

Table 3 shows the training time of the six trained models. According to adjusted training parameters (previously presented in Sect. 4.1), Yolo V8 SD/SRD and Faster -RCNN SD/SRD take the least training time. Faster-RCNN require more training time than Yolo V8 because, Faster-RCNN model employs selective search to propose regions and uses a CNN architecture to process these regions (which can be time consuming). Yolo V8 combines the tasks of object classification and localization into a single stage making the training process significantly faster. Detr achieve the longer training time due to the large number of parameters that need to be trained and the complexity of transformers.

Table 3 Training time evaluation

Inference time refers to the amount of time it takes for a trained model to make predictions on unseen data. For more accurate assessment, the inference time is calculated for 20 different input images (20 Standard inputs with original resolution and 20 super-resolved input images). The obtained results, depicted in Table 4 show that the lowest inference time is achieved by Yolo V8 SRD, 12 ms, when the model process Standard input images and the highest inference time,91 ms is achieved by Detr SD when the model process Super-resolved input images. As demonstrated in previous results, training Faster-RCNN, Detr and Yolo V8 models on super-resolved dataset improve the model’s ability to precisely detect small lesions, However, processing high resolution images lead to increasing inference time of the models. In fact, inference time of the six trained models is higher when input images are super-resolved. From Tables 3 and 4, we can see that using higher resolution images increase both training and inference time. Based on all the obtained results, we can summarize that Yolo-V8 (SRD) maintain a good balance between detection precision and inference time.

Table 4 Average inference time evaluation

5 Conclusion

In this work, three object detection models (Faster-RCNN, Detr and Yolo V8) were trained on two types of datasets (Standard dataset (SD) and super-resolved dataset (SRD)). A comparison between the six trained model were conducted. The experimental results showed that achieving a precise recognition and localization of potato late blight disease are improved by training object detection models on super-resolved dataset. The obtained results showed promising results in achieving high mean Average Precision (mAP). Training Faster-RCNN, Detr and Yolo V8 on super-resolved dataset (SRD) outperformed other trained models. Using high resolution images dataset allow to detect very small lesions on the leaves at the cost of a small increase in training and inference time. The hardware resources and specifications must be taken into consideration, more powerful hardware can process complex architectures or higher-resolution input images faster. Note that, to deploy a plant disease detection model on smart device, a specialized inference engines like TFLite or OpenVINO must be used to optimize models for specific hardware. We can then conclude, that there is a fundamental compromise between precision and inference time, the increased resolution can capture more details, which is important for detecting small lesions on infected plant leaves, but it results in slower inference time and require more computational resources. In the future, performance of the proposed method could be evaluated and generalized to automatically detect and localize other types of plant diseases.