1 Introduction

Plant diseases plays a major role and have a negative impact on yield, resulting in economic, social and environmental losses in the global production of agriculture products. Identification of diseased symptoms on leaves by manual intervention seems to be a difficult task and time consuming. Experienced plant pathologists and agronomists may even struggle to diagnose specific diseases. Therefore it leads to erroneous assumptions and conclusions. In this context, researchers felt that there is a requirement of automatic tools for the detection of infected portions on the leaves (Barbedo 2013; Yang and Guo 2017; Barbedo 2016). In fact, this would be helpful for the farmers, non-expert, and non-botanist , especially, where continuous on-site monitoring is impractical and specialized technical assistance is not readily available in large cultivated crops. These issues are resolved to some extent using image processing techniques when the manual assessment is insufficient and inaccurate (Weizheng et al. 2008; Patil and Bodhe 2011; Scharr et al. 2016). Furthermore, machine learning algorithms may help in classifying the diseased plant from non-diseased and predicting the infected portion of the plants (Zawbaa et al. 2014).

At present, with the advancement of computational systems with GPU embedded processors deep learning techniques have grown rapidly with the development of new methods and models (LeCun et al. 2015). The deep learning techniques are introduced into agriculture, and as a result, the diagnosis of plant diseases has begun in the last few years (Carranza-Rojas et al. 2017; Yang and Guo 2017). Nowadays basic CNN model is mostly preferred in deep learning framework for automation of plant disease identification (Lee et al. 2015; Grinblat et al. 2016). Many researchers (Mohanty et al. 2016; Ferentinos 2018; Zhang et al. 2019) have used the open-source Plant Village dataset for predicting plant diseases and are producing promising results. But unfortunately the misclassification rate is also seen in specific crop species. From our observations it is understood that Mango tree leaves when trained using a multilayer convolutional neural network-based ternary classification model, are being adversely affected with an Anthracnose, a fungal disease (Singh et al. 2019). There are other issues like invariance, over fitting and spatial resolution which may even affect the detection accuracy. Hence, the need of modified architecture for classification, segmentation, and detection of plant diseases is desirable. In this work, an experimental version of Mask R-CNN is proposed for the identification of multi-class plant leaf disease using a deep learning framework to achieve reliable classification and position accuracy in finding the infected portions on leaf images. The major contributions of this paper can be highlighted as follows:

  1. 1.

    A novel deep learning DPD-DS framework to detect multi-class plant leaf diseases is proposed.

  2. 2.

    The structure of the Mask-RCNN backbone network is modified, and the anchor proportions in the RPN network is changed resulting in high detection precision.

  3. 3.

    The proposed framework is tested on two different datasets with multiple class by manually generating 4785 annotations using the annotation tool to accurately identify even small infected regions of diseased leaves. The proposed model gives effective instance segmentation results compared to other state-of-art models.

  4. 4.

    A transfer-learning technique is used to speed up training of the DPD-DS model on a new group of classes.

The rest of the paper is written as follows. Section 2 presents the related works of classification, detection and segmentation. In Sect. 3, the proposed DPD-DS framework is described in detail. In Sect. 4, the experimental results and performance metrics are described and Sect. 5 concludes the work with future directions.

2 Related work

Diseases in plants significantly affect crop cultivation in an agricultural environment. In recent years, researchers have used several approaches and seen considerable progress in plant disease detection. The first group of approaches employs handcrafted features to classify symptoms of the disease (Mahlein et al. 2012). In this case it is found that there exists risk that humans would deviate and lose focus in less time when performing this task. The second group of approaches uses digital imaging for disease detection and evaluates the diseased portions from microscopic to regional scale (Bock et al. 2010; Price and Osborne 1990). In addition, the quantification in various types of symptoms of plant leaves is also taken for consideration in order to address the extent of the leaf injury in a crop (Bowen et al. 1991). In view of identifying the symptoms pertaining to the disease on the leaf, using color digital photography showed a contrast between uninjured and injured leaf sections in photograph of Zostera marina (eelgrass) and classified images in to healthy and injured using unsupervised algorithm and then colors are assigned by user to different classes (Boese et al. 2008). The resulting color image is compared with true original color image. However the pre processing involve tedious job prior to segmentation. Camera exposure settings also influence the contrast levels of infected and disinfected leaf portions. The work in Sannakki et al. (2011) using color image segmentation identified the infected portion on the leaf and percentage of the portion infected is graded using Fuzzy logic. But unfortunately, the model produced a nonlinear and discontinues space, which makes the changes in color hue hard to follow due to discontinuities. All the above methods mentioned may give results only when plants achieve certain stages of growth. On the other hand image acquisition is strictly performed in lighting conditions in limited distance. As other digital technologies co-exist, large amount of information are obtained in real time and as a result several machine learning algorithms such as logistic regression, random forest (Zawbaa et al. 2014), K-means clustering, decision trees (DT), Naive Bayes (NB), K-nearest neighbors (KNN) (Liakos et al. 2018), support vector machines (SVM) (Priya et al. 2012), have come up with better decision making strategies in identifying the diseased symptoms in the crops (Behera et al. 2020).

The third group is exclusively based on deep learning techniques for the identification of plant diseases. Mainly convolution neural networks (CNNs) have achieved considerable success in plant disease classification (Lee et al. 2017; Liu et al. 2018). At present, deep learning techniques have achieved outstanding results in object detection (He et al. 2015; Nanehkaran et al. 2020). It aims to classify and localizing the objects in a given image. The state-of-the-art object detection models are currently starting with overfeat (Sermanet et al. 2013), RCNN (Girshick et al. 2014), Fast RCNN (Girshick 2015), Faster RCNN (Ren et al. 2015), SSD (Liu et al. 2016), Mask-RCNN (He et al. 2017).

3 Proposed DPD-DS system

The proposed DPD-DS framework of plant segmentation and classification based on Mask-RCNN is shown in Fig. 1. Mask-RCNN is the improvised instance segmentation model that extends Faster-RCNN. Instance segmentation leverages the clustering approach that collects the identical pixels to form the instances. It is divided into two stages. In the first stage, the image is scanned and regional proposals are produced. In the second phase, the objects are marked, identified and segmented. Our proposed leaf disease detection and segmentation (DPD-DS) system can be divided into three major parts:

  1. 1.

    Classification of PDs (healthy or diseased).

  2. 2.

    Detecting the infected portion of PDs (bounding-box).

  3. 3.

    Generating mask for each infected region of PDs.

Since our proposed framework is based upon mask-RCNN architecture (He et al. 2017), different components and description of each one is explained in detail:

Fig. 1
figure 1

Schematic illustration of proposed DPD-DS system using Mask-RCNN

3.1 Backbone (feature extractor)

In this work, we adopt the Residual network for deep feature extraction of images. ResNet has a “skip connection” concept that can increase the convergence rate during the training process and avoid overfitting. Currently, the residual network consists of five different architectures, namely ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. By considering the computational cost and detection performance, we chose ResNet-50 as a feature extractor with a bottom-up Feature Pyramid Network (FPN) from the ResNet family by considering performance and computational cost.

Table 1 Configuration details of the proposed ResNet-50 architecture, swish and batch normalizations are hidden in each convolution block, input size is 512 \(\times \) 512

The configuration details of proposed ResNet-50 architecture, as shown in Table 1. It accepts the input dimension 512 \(\times \) 512 and composed of five convolution blocks. Each block is accompanied by the ReLU activation function and batch normalization (He et al. 2016). In this paper, we use a new activation function ’Swish’ which outperforms ReLU for effective classification accuracy. This function is relatively simple: Input x is multiplied by the sigmoid function x (Fig. 2). The mathematical representation for the function ’Swish’ (Ramachandran et al. 2017) is given in Eq. (1). Swish is a smooth activation function which means it does not change direction unexpectedly as does ReLU close to x = 0. Conversely, it bends effortlessly for values less than or equal to zero and then ascends again. It does not shift in one direction, such as ReLU. Table 1 shows \(2/1^{*}\) implies 2 with the first iteration and 1 from the second iteration. Each convolution layers from Conv2 _ x, ..., Conv5_ x have a short cut connection for 1 \(\times \) 1 convolution filter. And convolution filtering on 3 \(\times \) 3 \(\times \) depth in Conv2 _ x, ..., Conv5_ x is done using feature map with padding of 1. No padding is performed in all other cases.

$$\begin{aligned} f(z)= z \times Sigmoid(z) \end{aligned}$$
(1)
Fig. 2
figure 2

Swish activation function

Table 2 Configuration details of proposed DPD-DS system
Fig. 3
figure 3

Few sample leaves images from the database—A Mango leaves of diseased (a, b) and healthy (c, d) images, B pomegranate leaves of diseased (e, f) and healthy (g, h) images, C Pongamia Pinnata leaves of diseased (i, j) and healthy images (k, l), D apple leaves of diseased (m,n) and healthy (o, p) images, E grape leaves of diseased (q, r) and healthy images (s, t) images

Table 3 Dataset details of proposed system

3.2 Region proposal network

RPN is a lightweight convolutional neural network; first implemented in the Faster-RCNN. This design replaces the selective search which is a slower method for generating bounding boxes. Mask-RCNN is an extended version of Faster-RCNN. In our framework, RPN takes input from the backbone as convolution features and produces an object’s bounding box as output. RPN scans image regions called anchors, which in this framework are 200 k region proposals. The boxes are scattered over the image with these anchors. Anchors of different sizes and aspect ratios are taken to shield enough areas. The highest probability of RPN among the top N anchors is chosen. Because of overlapping proposals and control redundancy (false positives), based on each class of confidence scores, we use non-maximum suppression (NMS). The overall loss for RPN is as follows (see Eq. (2))

$$\begin{aligned} \lambda (P_{i},O_{i})= & {} \frac{1}{N_{cls}}\sum _{i}L_{cls} (P_{i},P_{i}^{*}) \nonumber \\&+ \psi \frac{1}{N_{reg}}\sum P_{i}^{*} L_{reg}(O_{i},O_{i}^{*}) \end{aligned}$$
(2)
$$\begin{aligned} \lambda _{overall}= & {} \lambda _{cls}+\lambda _{bbox}+\lambda _{mask}, \end{aligned}$$
(3)

where, \(\lambda _{cls}\) represents classification loss mainly uses cross-entropy loss is defined as follows (see Eq. (4)). \(\lambda _{bbox}\) is the bounding box loss which is based on L1-Normalization (see Eqs. (5), (6)).

$$\begin{aligned} H(y,y^{*})= & {} - \sum _{i=0}^{N} y_{i}\log (y_{i}^{*}) \end{aligned}$$
(4)
$$\begin{aligned} L_{bbox}(t^{u},u)= & {} \sum _{i\in {{x,y,w,h}}} Smooth_{L_{1}}(t_{i}^u-u_{i}) \end{aligned}$$
(5)

where

$$\begin{aligned} Smooth_{L_{1}}(x)= {\left\{ \begin{array}{ll} 0.5x^{2} ,&{}\quad \text {if} \mid x \mid <1\\ \mid x \mid -0.5,&{}\quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

In the formula stated above, ‘i’ indicates anchor box index in the mini batch. \(N_{cls}\) and \(N_{reg}\) indicates the number of classification layers and regression layers; \(P_{i}\) is the predicted probability value of anchor-point i as an object; \(P_{i}^{*}\) is 0 when the anchor-box is negative and 1 if the anchor box is positive; \(O_{i}\) represents a vector of four coordinate parameters of the prediction candidate box; \(O_{i}^{*}\) refers to a vector of four parametrized coordinates of true value area; \(L_{cls}\) and \(L_{reg}\), respectively, reflect loss of classification and loss of regression. The proposed DPD-DS system based on Mask-RCNN with slight modifications such as anchor proportions, backbone structure, activation function, lightweight CNN to make it fit for plant disease detection.

3.3 ROI align network

This layer predicts well-defined object pixel masks which are used in mask-RCNN. Well-aligned regions of interest (ROI) features are needed for this. Anchors created by the layer RPN vary in size and aspect ratio. The sizes of these regions are normalized in ROI-align. For this reason bilinear interpolation (BI) is used to calculate the exact pixel characteristics for the ROI-align layer. After this, global average pooling and max pooling operations are carried out from current features for further enhancement. Finally, ROI Align extracts the features and moved to the head of the network to generate class labels, bounding boxes, and masks. Based on the features created by the ROI Align layer, two fully connected layers are used to generate class labels and bounding boxes. For each ROI, the classification, detection, and segmentation’s are carried out in parallel. The Overall loss of the proposed method is computed as stated in Eq. (3).

During the implementation of the DPD-DS system, various parameters are considered and utilizes these parameters by training as shown in Table 2. The trained model will segment and detect the infected portion of the leaves images correctly.

4 Experiments results and performance evaluation

To verify and test the robustness and efficacy of the proposed DPD-DS framework, the experiments were performed on Intel (R) Xeon (R) processors, including NVIDIA Quadro, 12 GB RAM, and Windows10 operating system. Our framework is implemented using deep neural network libraries, using the python KerasTensorflow framework.

4.1 Image data collection

In this work, two benchmark datasets are collected. These datasets comprise leaf images of crop species (Mohanty et al. 2016; Siddharth et al. 2019). Among them, specific crops namely apple, mango, pomegranate, pongamia pinnata and grapes are considered. The details of these datasets are given below.

4.1.1 Dataset1

PlantVillage Dataset: The archive for the image-based study of the disease is available as a PlantVillage dataset. This contains 54,309 labeled images of 14 different species of crops. We use two crop species from this archive, namely apple, grape. These species are classified into two groups that are healthy and diseased.

4.1.2 Dataset2

A Database of Leaf images Dataset: The repository contains 4503 labeled images with 12 distinct crop species. Here, we are utilizing only 3 species namely mango, pomegranate, Pongamia pinnata. Each species are segregated into healthy and diseased. Few samples of plant disease combinations (including non-diseased) are depicted in Fig. 3 and the details pertaining to the Image dataset are furnished in Table 3.

4.2 Image annotation procedure

Image labeling is one of the key parts of segmentation models. There are several imaging annotation or labeling tools are available for this purpose. We use the VGG Image Annotation tool (VIA), to make annotations manually. It is a simple HTML file that can be accessed in any web browser without any installation procedure. Finally, we can save annotated information in a JSON format file as shown in Fig. 4. During the training process, this JSON file is used as ground truth of such images. A total of 3953 images with 7 different diseased class labels of 6564 annotations for training the DPD-DS framework are collected. The steps for training the proposed framework is summarized in Algorithm 1. The procedure starts with the preparation of the Plant Leaf Disease Dataset from two repositories, namely Dataset1 and Dataset2. The images are scaled to 512 \(\times \) 512 dimensions and are annotated. The annotation procedure is illustrated in Sect. 4.2. The annotated images are further prepared for training with the 80–20 rule. The ResNet architectures (ResNet-50, 101) are trained on annotated and labeled image datasets.

Fig. 4
figure 4

Sample bounding box annotations

figure a

4.3 Evaluation measures

To assess the efficiency of object detection and segmentation models the following metrics are used:

4.3.1 Intersection over Union

This is the most widely used technique to asses performance of segmentation models, also called the Jaccard similarity coefficient. It tests the similarity between the forecast bounding box region and bounding box region of ground truth, as shown in Eq. (7).

$$\begin{aligned} IoU=\frac{P_{b}\cap G_{b}}{P_{b}\cup G_{b}} \end{aligned}$$
(7)

where \(P_{b}\) as predicted bounding box and \(G_{b}\) as boundary box of ground truth.

Table 4 Performance comparison in terms of precision, recall and F1-score values for different state-of-art models

4.3.2 Precision, recall , F1-score

These measurements are expressed at pixel-level in terms of True Positive (\(True_{pos}\)), True Negative (\(True_{neg}\)), False-positive (\(False_{pos}\)), False Negative (\(False_{neg}\)), as shown in Eqs. (8), (9) and (10). Precision gives the ratio between predicted observations to total observations. On the other hand, Recall gives the ratio between predicted observations to the overall observation of the relevant class. F1-Score is a harmonic mean of precision and recall (i.e., F1-score) which is used while measuring object detection performance as shown below.

$$\begin{aligned} Precision= & {} \frac{True_{pos}}{True_{pos}+False_{pos}} \end{aligned}$$
(8)
$$\begin{aligned} Recall= & {} \frac{True_{pos}}{True_{pos}+ False_{neg}} \end{aligned}$$
(9)

where \(True_{pos}\) represents the event which segments infected region pixel as an infected-region pixel. \(True_{neg}\) represents the event which segments the background pixel as the background pixel. \(False_{pos}\) represents the event which segments background as an infected-region pixel and \(False_{neg}\) represents the event which segments infected-region pixel as a background pixel

$$\begin{aligned} F1-Score=2 \times \left( \frac{Precision \times Recall}{Precision+Recall}\right) . \end{aligned}$$
(10)

4.3.3 Average precision and mean average precision

Average precision (AP) measures the number of correctly predicted pixels to the number of pixels in the ground truth for single class. Mathematically, it is defined as follows

$$\begin{aligned} Average Precision(AP)=\frac{True_{pos}}{True_{pos}+False_{pos}+False_{neg}}. \end{aligned}$$
(11)

Mean average precision (mAP) is an average of all classes average precision as follows

$$\begin{aligned} mAP={\frac{1}{N} \sum _{i=1}^{N} AP_{i}}. \end{aligned}$$
(12)
Fig. 5
figure 5

Comparison of proposed detection results with existing models: [A]—Original Image, [B]—Ground _Truth (Proposed), [C]—Ground _Truth (Existing models), [D]—SSD, [E]—Faster_RCNN, [F]—Faster_RCNN +FPN, [G]—YOLOv2, [H]—YOLOv3, [I]—proposed

Table 5 Performance comparisons in terms of total parameters, memory consumption and time with existing models

4.3.4 Results comparison and analysis

Initially, to assess the proposed segmentation model performance, two tests were conducted on different state-of-art detection models, and results are compared with the proposed model. One corresponds to the dataset containing only diseased samples as Test 1, and the other corresponds to a dataset with both healthy and diseased samples as Test 2. The experimental results in terms of precision, recall, F1-score, and mean average precision (mAP) at IoU threshold value is considered 0.75. It is noted that the proposed DPD-DS system gives the highest precision, recall, and F1-score rates given in Table 4. SSD is showing lower performance compared to all other models. It is well understood from test results; the proposed DPD-DS gives good detection results than other detection algorithms.

Further, the mean average precision (mAP) is computed on various state-of-art methods and compared the proposed DPD-DS framework results. As shown in Table 4, the proposed model achieves the best mean Average Precision for Test 1 and Test 2 with mAP is 0.8500 and 0.8611, respectively. We fine-tuned the model until 100 Epochs while empirically noted validation loss values for Test 1 and Test 2 of the proposed model. Figure 6 shows the variations of the validation loss curves against epochs of the proposed deep learning framework for Test 1 and Test 2. As shown in both figures, validation loss falls rapidly in the earlier iterations and decreases slowly afterward for both tests. From Fig. 6, it is observed that Test 1 converges fastly compared to Test 2.

Fig. 6
figure 6

Overall loss curves of the proposed model under two tests: Test 1: [A]—classification loss, [B]—localization loss , [C]—mask loss, [D]—total loss, Test 2: [E]—classification loss, [F]—localization loss , [G]—mask loss, [H]—total loss

Besides, Table 5 shows the performance of all the models in terms of speed, memory, and time consumption. However, YOLOv3 gives a better mAP; it takes more parameters and training time than the proposed model. Faster_RCNN model takes more training time compared to all the models. SSD model takes less test time, but it takes more training time compared to the proposed model. In line with the results reported in Tables 4 and 5, the performance of the proposed model achieves satisfactory results not only efficient in terms of mAP but also in terms of computationally feasible. Figure 5 depicts the detection results of all models. Faster_RCNN model is not able to detect all small diseased portions of the leaves. YOLOv3 and Faster_RCNN +FPN gives better detection results, but these methods are not segmented the diseased portions pixel to pixel-wise. It is observed from Fig. 5 that the proposed DPD-DS system can detect and segment the infected regions on the leaf images by assigning precise class labels. Moreover, the experimental results ensure that our proposed framework is suitable for the accurate detection of crop diseases. This can be embedded in drones, smartphones, or robots to help farmers and agronomists automate the optimal identification of plant diseases for real-time monitoring in large-scale cultivated crops.

5 Conclusions and future work

In this work, a novel DPD-DS model for automated diagnosis of plant diseases is proposed. The proposed system includes the creation of diseased image libraries, DPD-DS system generation, and performance metrics analysis. Our model extracts the useful features in feature extraction by changing the backbone structure, adjustable anchor proportions with different scales, and changing the hyperparameters from the experiments, increasing the reliability of disease classification for small infected regions. We assessed the proposed framework on two openly available benchmark datasets of healthy and diseased images taken into consideration. Our proposed system is compared with existing state-of-the-art one-stage and two-stage object detection models. Moreover, two experiments are conducted on different state-of-art object detection methods to test the proposed model efficacy. From the overall experimental results, it is understood that our proposed DPD-DS system gives significant results in terms of precision, recall, and mean Average Precision. This work has a few limitations: (1) Chinar infected regions are not segmented properly on less number of images. (2) Detection depends on the exact ground-truth data (annotations). Currently, there are very few publicly available datasets with a limited number of classes. In the future, we will focus on the ensemble network for plant leaves disease detection by extending the dataset to include more disease classes. Based on symptoms, it is difficult to identify the diseases at an early stage of growth manually, but deep learning models will simplify this. Our model is designed in such a way that it allows us to detect multiple diseases on leaves.