Keywords

1 Introduction

Disease outbreaks are increasingly becoming rampant globally, especially since some are extremely difficult to control and can lead to famine [1, 2]. Notably, viral plant diseases attributed to diseases caused by pathogens, such as early and late blight, are known to reduce the overall yield of vegetable crops and are a great menace affecting both home gardeners and large productions [1]. With their large productions and adaptability comes high risk and high susceptibility to viral plant diseases. Plant disease identification is a crucial component in precision agriculture that primarily deals with observing the stages of diseases in plants [3]. With 60% to 70% of all visible observations first appearing on the leaves compared to the stem and fruits, plant diseases are most commonly observed on the leaves. Thus, early symptom detection is vital to the effect of disease diagnosis, control, and damage assessment. The traditional manual methods of plant disease identification, both direct and indirect, have been providing definitive solutions for centuries. However, despite using modern equipment, such methods are labor-intensive, exhaustive, and time-consuming in providing answers promptly. Also, most of these techniques require specialized tools and consumable chemical reagents, thus becoming inefficient and unsustainable [4]. As part of machine vision technology, machine learning (ML) systems can mimic the direct identification method through pattern recognition by providing accurate and timely identification.

The ML systems are generally classified into a conventional classifier (CC) and deep learning (DL) (or deep convolution neural network (D-CNN)) methods [5, 6]. The feature representation in a CC method involves explicitly extracting features as patterns with properties that sufficiently portray the quantifiable details of the disease symptom, the region of interest (ROI). Then, the complete automatic identification is achieved using a machine learning classifier built using the features. On the other hand, the DL method involves automated implicit feature representation. In this process, pixels in the entire image, a neighborhood, or a group, are considered characteristics for the feature learning. Hence, the typical feature representation becomes more of an implicit process than a stage and an embedded part of the model architecture. There are two methods of training a DL: training from scratch and transfer learning. The process of training a DL from “scratch” involves designing and building the network layer by layer. This process is often a complicated and time-consuming process due to the deep architecture design [6]. Transfer learning, as introduced by Bengio [7], involves applying an already established architecture that has been successful in other computer vision domain problems and can adapt to the problem under consideration, significantly reducing the complexity.

Throughout the literature, many works use the whole leaf image as training data with various degrees of precision. Sharada et al. [8] presented the first DL model to identify multiple plant diseases on a relatively comprehensive Plant Village (PV) image dataset. It constitutes over 54,000 images of 14 different crop species and 26 disease pairs (healthy and unhealthy), including early and late blight. The transfer learning pre-trained networks used were AlexNet and GoogLeNet, which were fine-tuned, trained on the training data, and validated on the testing data. Zhang et al. [9] also implemented AlexNet, GoogleNet, and ResNet pre-trained models on the same dataset to identify tomato diseases. Xu et al. proposed using a VGG-16 model trained with transfer learning [10]. Fuentes et al. also applied the DL architecture for real-time implementation on tomato crops [11]. Also, lightweight CNNs are proving useful in reducing some of the limitations associated with the approach. Geetharamani et al. and Durmus et al. focused on architecture simplification by reducing the number of deep layers [12, 13]. Durmus et al. proposed using squeezeNet, a compressed and lightweight version of the D-CNN with fewer layers, to identify the tomato diseases [13]. The tomato images used are part of the PV dataset. Dasgupta et al. also proposed a lightweight CNN to detect plant diseases designed for mobile applications [14]. Chen et al. proposed using a modified lightweight MobileNet-V2 model and a “step” transfer learning to optimize feature learning for multiple plant disease classification [15]. During training, the weights and parameters of the deeper layers are frozen, while those of the upper layers are updated. Such models have less computation and are cost-effective. However, the accuracy of using lightweight models is relatively lower than that achieved using regular models. Other recent D-CNN methods also employ manual data augmentation and use segmented ROI images as input data. Sharma et al. manually cropped the images to only include ROIs in the input data before training the D-CNN, instead of using the whole leaf image [16]. Barbedo also implemented ROI localization on the image data based on similar criteria of size and color of symptoms [17]. Sun et al. proposed using NN generated lesion images to augment existing image data [18]. However, despite the use of ROI image data, a significant number of DL methods still record low classification accuracies often attributed to the unavailability of sufficient data and the combination of diseases with similar symptoms without the basis of an inference rule [17].

A recent study by Lee et al. concluded that the DL feature representations learned from whole images do not necessarily focus on the ROIs [19]. The features are instead learned from areas with the most common distinctive characteristics, such as leaf venation. Toda and Akura also supported this claim, noting that a DL model to learn visual shape characteristics such as that of the profoundly grooved leaf edges of tomato crop instead of the visible symptom characteristics [20]. Regarding ROI image data, different levels of subjectivity arise during the segmentation, mainly due to loosely characterizing the segmentation without the necessary pathological inference. This problem influences the separation boundary limit during the segmentation resulting in the removal of the blurred region from the ROI. Whereas, earlier research studies indicated its prominence in improving the quality of learned features [21]. In regards to this, this paper proposes a DL plant disease identification method using an extended region of interest (EROI) segmented images as the training data. Instead of the typical ROI image data, the proposed method has the advantage of using segmented EROI data that is inclusive of the blurred region to enhance feature representation during training and improve classification accuracy.

2 Materials and Method

This study implements a DL with transfer learning to identify plant diseases using proposed segmented EROI image data. The ROI segmentation is typically practiced in conventional classifier methods, which involves isolating the visible disease lesions from the rest of the leaf. In this paper, the term identification refers to detection and classification. ROIs for the disease were identified and segmented using a proposed pathological segmentation algorithm. Three popular pre-trained networks, including AlexNet, ResNet-50, and VGG-16, have been used. The paper also considered the vegetable early blight and late blight diseases [22, 23], both of which cut across tomato, pepper, potato, and eggplant with similar symptoms [24]. These crops have planting areas ranging from small backyard plots to much larger field acreages and greenhouses, which makes them exceptional candidates for research in precision agriculture. Thus, even though only two diseases are considered in this study, the research impact is significant.

2.1 Image Dataset

The primary image data of 1,400 images of the potato plant leaf images used in this study was obtained from the comprehensive PV dataset [8]. In total, 500 images showed the symptoms caused by the EB, 500 showed the symptoms caused by the LB, and 400 are healthy leaves. Figure 1 shows samples of each disease symptom [8]. All images are in RGB format and of equal size \( 256 \times 256 \) pixels.

Fig. 1.
figure 1

Example of potato leaf image samples from the PV dataset. From right: healthy, early blight, and late blight.

2.2 Pathological Extended Region of Interest (EROI) Segmentation

As the infection in EB manifest, it forms concentric rings with a bulls-eye pattern emanating from a dark-brown focus and surrounded by a yellowish chlorosis zone [22]. In the case of LB, the infection has a small center of a dark lesion, after which it also manifests to dark-brown or black bordered by a water-soaked lesion with a pale whitish-green border that fades into the healthy tissue [23]. The dark foci (brown) areas are the necrotic regions, while the chlorosis boundary zones are the symptomatic regions.

Following the characteristics of two blight diseases, the symptoms typically show a significant color difference from the other surrounding tissue areas, which involves changing variation from light green to yellow, brown, or black. The proposed pathological segmentation method uses the proportion of each (RGB) color channel intensity for tissue pixels of a healthy leaf image. Typically, the pixels within a leaf image exhibiting higher intensity deviations towards the green hue than blue and red belong to healthy tissue [25, 26]. Mathematically, \( G > R \gg B \). Hence, in order to establish the degree of certainty threshold, the proposed ROI segmentation starts with computing the percentage of the pixels’ green color intensity in the original RGB image. Thus, an input leaf image \( I\left( {x,y} \right) \) of size \( M \times N \) is made up of several pixels \( p_{i,j} \left( {x,y} \right) \) (\( for\; i,j = 0, 1, 2, \ldots ,M - 1, N - 1 \)). Each pixel has a color value which is the combination of the three RGB colors, and each channel color intensity ranges from \( 0 \;\left( {\text{no color}} \right) - 255\; \left( {\text{maximum intensity}} \right) \). The average percentage of each color value to the combined color values is computed using Eq. (1) – (4).

$$ r_{g} = \sqrt {\frac{{G_{i,j} }}{{R_{i,j} + G_{i,j} + B_{i,j} }}} \times 100\% $$
(1)
$$ r_{r} = \sqrt {\frac{{R_{i,j} }}{{R_{i,j} + G_{i,j} + B_{i,j} }}} \times 100\% $$
(2)
$$ r_{b} = \sqrt {\frac{{B_{i,j} }}{{R_{i,j} + G_{i,j} + B_{i,j} }}} \times 100\% $$
(3)
$$ r_{gr} = \sqrt {\frac{{R_{i,j} }}{{R_{i,j} + G_{i,j} }}} \times 100\% $$
(4)

From the results of Eqs. (1) – (4), pixel intensity values of healthy tissue (green color) with the highest degree of certainty are \( 42.4\% \) for \( r_{g} \), 34.5% for \( r_{r} \), and \( 23.3\% \) for \( r_{b} \). Through further experimentations, it is found that lower percentages mean less green color (tone) pixels and vice versa. Hence, adjusting the values would change the segmentation boundary between the healthy and disease region tissues. Following this, four threshold values are proposed to generate four binary masks from \( I\left( {x,y} \right) \) to incorporate the blurred region and allow invariancy against small intensity variations using Eqs. (5)–(8).

$$ g_{1} = \left\{ {\begin{array}{*{20}c} 0 & {38\% < r_{g} \le 47\% } \\ {1 } & {otherwise} \\ \end{array} } \right. $$
(5)
$$ g_{2} = \left\{ {\begin{array}{*{20}c} 0 & { 32\% \le r_{r} < 37\% } \\ {1 } & {otherwise} \\ \end{array} } \right. $$
(6)
$$ g_{3} = \left\{ {\begin{array}{*{20}c} 0 & {18\% \le r_{b} < 29\% } \\ 1 & { otherwise} \\ \end{array} } \right. $$
(7)
$$ g_{4} = \left\{ {\begin{array}{*{20}c} 0 & {65\% < r_{gr} < 85\% } \\ 1 & {otherwise} \\ \end{array} } \right. $$
(8)

The four masks are then combined to generate two binary segmentation masks; \( m_{1} = g_{1} \left| {\left| {g_{2} } \right|} \right|g_{3} \) and \( m_{2} = g_{1} ||g_{4} \). The first mask, \( m_{1} \), succeeds in the segmentation of healthy tissue pixels incorporating the lighter green pixels. The second mask, \( m_{2} \), segments the darker green pixels. Finally, the binary segmentation mask is given by Eq. (9).

$$ S_{mask} = m_{1} ||m_{2} $$
(9)

Some morphological post-processing operations are applied to clean-up the binary mask and remove isolated border pixels. A closing operation using a disk structural element of radius three (3) is applied, followed by a dilation operation using the same structuring element. Applying the completed \( S_{mask} \) to the original input image masks the healthy green tissue pixels, turning them to black (or zero). The mathematical expression is given in Eq. (10).

$$ I_{EROI} \left( {x,y} \right) = \left\{ {\left( {x,y} \right) \in I\left( {x,y} \right) |S_{mask} } \right\} $$
(10)

Algorithm 1 shows the pseudocode for the proposed EROI segmentation, and Fig. 2 shows a sample result.

Fig. 2.
figure 2

EROI segmentation sample result showing input image (a) EROI segmented image (b).

figure a

2.3 Disease Identification with DL Classifiers

In this study, three transfer learning pre-trained CNN models, AlexNet [27], VGG-16 [28], and ResNet-50 [29], have been implemented using the segmented EROI images as input data.

The AlexNet model has a depth of eight (8) layers. To re-train the model, the last three fully-connected layers of the AlexNet are trimmed and replaced with new layers that will classify the three classes of EB, LB, and HL. This way, the features from the rest of the layers are kept, i.e., the transferred layer weights. However, the weights and biases in the new layers are increased by a factor of \( 10 \) to enable faster learning than in the transferred layers. Before training, the images were resized to \( 227 \times 227 \) pixels, which is acceptable by the network and augmented to optimize training given minimal data. The augmentation includes flipping vertically and horizontally, scaling, and translating. The modified network is then re-trained with stochastic gradient descent (sgd) optimization with \( 1 \times 10^{ - 4} \) as an initial learning rate, a mini-batch size of 10 for a maximum of 12 epochs.

The ResNet-50 model is 50 layers deep and requires input images of \( 224 \times 224 \) pixels. Thus, for implementation, the images were resized to the network acceptable size. The layer with the learnable weights is the last fully-connected layer; similar to the process applied in AlexNet transfer learning, this layer, along with the output layer, is replaced by new ones with the number of outputs equal to the number of disease classes. However, in this case, while the weights of earlier layers are re-initialized, the weights of the first ten layers of the network are frozen by setting their learning rates to zero. This process speeds up the network training since the gradient in those layers will not update. Furthermore, it limits the risk of overfitting since the data is relatively small. The modified network is then trained with sgd optimization with \( 3 \times 10^{ - 4} \) as an initial learning rate, a mini-batch size of 10 for a maximum of 12 epochs

The VGG-16 model features extremely homogeneous architecture that performs a \( 3 \times 3 \) convolution with 1-stride 1-padding and \( 2 \times 2 \) pooling with 2-strides from the starting to finishing layers. It is 16 layers deep, and, like in ResNet-50, the images must be resized to \( 224 \times 224 \) pixels. For implementation, the images are resized to \( 224 \times 224, \) and the same re-training and optimization hyper-parameters used in the ResNet model were applied.

2.4 Performance Measures

The standard performance used to compare the classifier performances are precision, recall, F1 score, and overall accuracy. These are computed from the confusion matrixes using Eqs. 1114.

$$ Precision = \frac{TP}{TP + FP} $$
(11)
$$ Recall \left( {TPR} \right) = \frac{TP}{TP + FN} $$
(12)
$$ F_{1} Score = \frac{2TP}{2TP + FP + FN} $$
(13)
$$ Accuracy = \frac{TP + TN}{TP + FP + FN + TN} $$
(14)

3 Experimental Results and Discussion

The EROI image dataset was split into \( 80\% \) training and \( 20\% \) testing sets. For benchmarking, a separate data generated using the typical ROI segmentation approach described in [25] was used to train and test the models under the same setup. Table 1 summarizes the performance measures on the ROI image testing data for the three implemented DL models.

Table 1. Performance measures on ROI test data

From Tables 1 and 2, the ResNet-50 model achieved the highest performance measure results on both the ROI and EROI data, while VGG-16 recorded the lowest. With deeper layers, the ResNet-50 model harbors denser learned features but at the expense of simplicity in implementation as it uses a lot of memory and parameters. The classification results on the segmented image data generated using the proposed pathological segmentation method achieved higher performance measures across all three implemented DL models. Using the typical ROI data, AlexNet, ResNet-50, and VGG-16 have \( 93.86\% \), \( 95.18\% \), and \( 93.86\% \) average accuracies, respectively. On the other hand, the accuracies improved by \( 1.75\% \), \( 2.19\% \), and \( 0.44\% \) with EROI data, respectively.

Table 2. Performance measures on EROI test data

Furthermore, there is a significant improvement in the metric measures of EB and LB classes, which shows a better characterization of the two disease symptoms. From the results (Table 1 and 2), the change in the performance measure statistics is attributed to misclassifications relative to LB symptoms recognized as that of EB. Regardless, the improved results indicated improved feature representation for classification due to incorporating the extended blurred region. Hence, improved data quality leads to better feature learning, and there is better efficiency in performance since the classification accuracy has been improved given fewer data

4 Conclusion

In this work, a DL plant disease identification is actualized using segmented image data from a proposed pathological disease region segmentation algorithm. Instead of applying the typical disease region (ROI) image data, the proposed approach uses the advantage of pathological inference to incorporate extended region of interest (EROI), the fuzzy blurred region. Comparative results using state-of-the-art pre-trained DL models show the efficaciousness of the proposed approach in improving feature representation and classification performance.