1 Introduction

The agricultural landmass is sufficient enough to feed today’s world population. Mostly, the wealth of developing nations is based on agricultural production. Initial stage results of the leave image based-identification of disease in the field of agriculture play an important role to maintain their economy (Beucher and Meyer 1993).

Szegedy et al. (2013) analyzed the deep CNN for multi-object detection as localization and classification using a simplified model by developing the object mask. Tseng et al. (2014) invented a method for plant disease recognition using tone-based features. Ioffe and Szegedy (2015) introduced the batch normalization process that provides freedom to choose a higher learning rate with little caring initialization and eliminates the need for dropouts as well as requires less training time. Mahlein (2016) studied crop management termed as precision agriculture while plant phenotyping has been defined as the noninvasive analysis of the plant properties as physiological, biochemical, or anatomical for example data collection through optical, multispectral, and thermal sensors. Kasun et al. (2016) worked on the redundancy reduction of the data as extreme learning machine auto-encoder (ELM-AE) and sparse ELM-AE (SELM-AE) to make the system fast.

Mohanty et al. (2016) utilized the capabilities of smartphones along with computer perception through deep learning for the diagnosis of diseases. Wang et al. (2017) utilized the transfer learning for the automatic apple’s leaf image for disease severity detection. Brahimi et al. (2017) introduced an automatic feature extraction through the Convolutional Neural Network (AFE-CNN) and trained it with the 14,828 images of tomato leaves that classified nine diseased and healthy imaginariums. He et al. (2017) presented a mask region-based CNN method (M-R-CNN) by adding a branch for predicting an object mask that makes the faster R-CNN more time saver.

The artificial intelligence-based methodologies have been utilized for the extraction of of visual features, followed by clustering and classification (Alexander et al. 2018). The complete automation-based features combined with handcrafted visual attributes (CDHVA) has been employed to learn the three fine-tuned layers of pre-trained deep convolutional neural networks (DCNNs) with handcrafted descriptors jointly in Zhang et al. (2018), Barbedo (2018). The color information and vector quantization learning along with CNN (CICQL-CNN) system was presented by Sardogan et al. for the detection and categorization of tomato leaf diseases (Sardogan et al. 2018).

The transfer learning using stacked sparse autoencoder (SSAE) subnetworks were employed to extract deep spatial and spectral features with one sequential SSAE subnetwork performs the smooth fusion of these deep attributes (Deng et al. 2019; Kurmi and Chaurasia 2020) to reduce the dependency of the system on a large labeled sample dataset. Apart from this, the transfer learning (TL) also helps to analyse the individual lesions and spots in images and identify multiple diseases on a single leaf and classification (Arnal Barbedo 2019). Bauer et al. (2019) presented an AirSurf a hybrid system as a combination of computer vision, machine learning, and software engineering (Kurmi and Chaurasia 2020) for measuring the yield-related phenotyping from the aerial imagery.

A large group of researchers (Aversano et al. 2020; Zeng et al. 2020) utilized the transfer learning on various standard deep learning models (AlexNet, Inception v3, DenseNet-169, SqueezeNet-1.1, ResNet-34, and VGG13) with dataset augmentation through the Generative Adversarial Networks (GAN) (Li et al. 2021; Liu and Wang 2020) for classifying citrus leaf images. An Alexnet based multi-scale CNN has been presented in Lv et al. (2020), with batch normalization to prevent overfitting and increasing the systems robustness through the Adabound optimizer and parametric rectified linear unit function

Karthik et al. (2020) presented a cascaded connection of deep CNN models for significant feature extraction and context-relevant attribute attention-seeking. The first model gives equal importance for each feature while the second model decides the weight based on the relevant context.

Nagasubramanian et al. (2021) proposed a system that observes the crops’ growth and leaf diseases continuously for advising farmers in need. the proposed framework uses machine learning techniques such as support vector machine and convolutional neural networks to provide analytical statistics on plant growth and disease patterns. It employed ensemble Nonlinear Support Vector Machine (ENSVM) for Ensemble Classification and Pattern Recognition for Crop Monitoring System (ECPRC) to identify plant diseases at the early stages.

The existing works still have limited performance for various multi-class categorization problems of diseased data along with limited accuracy. This work presents a novel leaf color information-based localization method for the region of interest segmentation and CNN-based categorization. The proffered technique leads at a higher level than the conventional approaches of entity localization in terms of accuracy and G-mean measure. To localize the disease confining region we employed the seed-based county growing approach (Callara et al. 2020). The mixture model-based region expansion is employed to refine the leaf area. Application of CNN model on the localized images provides the segregated attribute extraction. The compact delineation of these attributes has been found using the dropout method to bestows exceptional solutions for classification leaf images.

The contents of the article include materials and methods in Section 2, which covers the dataset and proffered localization approach. Part 3 provides an explanation of the developed convolutional neural network-based attribute extraction and categorization. In Section 4, we have explored the result obtained from various models and compared with the proposed model. Finally, Section 5 provides the final conclusion with future working research directions.

2 Materials and method

2.1 Dataset

We have taken the PlantVillage dataset that consists of a variety of 14 crops and their diseases, which comprises 54,309 labeled images. out of which we have taken three crops: bell pepper, potato, and tomato as shown in Fig. 1. The bell pepper dataset has 1478 healthy and 997 bacterial spot diseased images. The potato crop has three categories with 1000 early blight and 1000 late blight with 152 healthy images. Ten categories of tomato crop have been divided as target spot with 1404 images, 373 mosaic virus leaves, 3209 of yellow leaf curl virus, 2127 leaves affected by bacterial spot. There are 1000 early blight cases, 1591 healthy image cases, 1909 late blight, 952 leaf mold, 1771 septoria leaf spot, and 1676 spider mites affected tomato leaf images.

Fig. 1
figure 1

The PlantVillage dataset description Left image shows the original number of images while the right image illustrates the imaginarium after augmentation for dataset balancing

2.2 Proposed method

The proffered system for plant leaf classification is depicted in Fig. 2, which comprises a couple of steps: 1) image localization using segmentation and 2) feature finding for classification.

Fig. 2
figure 2

The process flow of developed localization and classification technique

The proposed preprocessing method expressed in three steps. First step environs the techniques for foreground segmentation i.e. leaf area extraction from the backdrop. Further step describes initial seed extrication purporting to the region growing approach. The workflow of the proposed method is given in Fig. 2.

2.2.1 Preprocessing

The image consists of some artifacts (noise) inside images of the leaves. The environmental noise that affects the sensors during image acquisition is called speckle noise. So, we should apply some sort of pre-processing mechanism to encounter this problem. This is done by analyzing every leaf’s pixel through the histogram plot of intensity for the voxels of leaf area and employ auto-thresholding (Torr and Murray Sep. 1997). Voxel whose intensity is below the local threshold taken as background voxel.

2.2.2 Initial seed points selection

The leaf image color shows that the green ingredient is with greater values and the blue color with less value for the leaf region as compared to the background. The green to the blue color ratio (G/B) depicts seed region marking efficiently.

A global threshold has been utilized to separate the foreground and background (Al-Kofahi et al. April 2010) up to a certain level and it is not always acceptable, which needs a local-level analysis (Ridler and Calvard Aug 1978). The combined effect of local and global thresholding provides efficient leaf region extraction with empirically selected window size for local thresholding is at least 9x9.

2.2.3 Region growing framework

The neighboring pixels of the initialized leaf region is chosen from seeds and are interfused for the region growing. Non-linear numeration of morphological attributes or image shapes is termed morphology operations. A decaying shape-based repetitive erosion (Haralick et al. 1989) was employed for shrinking the leaf to initialize seed. The region growing approach extracts the homogeneous county in the neighboring perimeter, which may not be an exact region of interest. To mitigate this, we have applied the mixture model-based region growing method (Callara et al. 2020). Global Gaussian distribution has been amalgamated with the prior information obtained from the region growing approach.

Consider an image Y with an associated K-class classification pattern X where a point l with intensity \(y_l\) is classified as belonging to class \(x_l \in \{1,\ldots K \}\). The Kth class model is a mathematical description of the conditional probability \(P(y_l|x_l = k)\) (Calapez and Rosa Sep. 2010). The kth class of the model is described by the linear mixture model:

$$\begin{aligned} \psi _k (y)= \alpha _k \Psi _B (y-K_0) +(1-\alpha _k) \Psi _{Sk} (y-K_0) \end{aligned}$$
(1)

where y represents the intensity level of the pixel, \(\alpha _k\) denotes the mixture parameter, and \(K_0\) denotes system offset. \(\Psi _B\) represents the distribution of the pixels of background generally possesses the normal distribution properties, with parameters mean \(K_0\) and variance \(v_B\) and \(\Psi _{Sk} \) signifies the distribution of kth class pixels intensity, given by distributions as negative-binomial having variance measure \(v_{Sk}\) and parameter mean \(\mu _{Sk}\). As per the Ref. (Calapez and Rosa Sep. 2010) the negative binomial distribution can be represented as given below:

$$\begin{aligned} p_K= \frac{\mu _{Sk} }{v_{Sk}} \end{aligned}$$
(2)

and

$$\begin{aligned} r_K= \frac{\mu _{Sk}^2 }{v_{Sk}-\mu _{Sk}} \end{aligned}$$
(3)

The presence of single class k of pixels at the local level is assumed for the region growing utilities. To complete this task a broad model for a pixel \(y_l\) is defined by the 5-distribution parameters

$$\begin{aligned} \begin{aligned} \Psi _K (y_l;K_0, v_B, \alpha , r, p)&= \alpha \frac{1}{Z(v_B)} exp(-\frac{(y_l-K_0)^2}{2v_B}) \\&\quad + (1-\alpha ) \frac{\Gamma (y_l-K_0+r)}{(y_l-K_0)!\Gamma (r)}p^r(1-p)^{y_l-K_0} \end{aligned} \end{aligned}$$
(4)

where \(\alpha \in [0, 1]\). \(Z(v_B)\) is a normalizing parameter depending on variance \(v_B\). The fitting of the model is performed using an expectation-maximization (EM) technique in which:

  1. 1.

    The parameter variables p and r are computed through the method of moments (Eqs. 2, 3)

  2. 2.

    Mean \(K_0\) and variance \(v_B\) are calculated by maximizing the loglikelihood

    $$\begin{aligned} L(\Theta |Y,X) = \sum _{y=\mathrm{min}(Y)}^{\mathrm{max}(Y)} \mathrm{ln}(\Psi ) \end{aligned}$$
    (5)
  3. 3.

    \(\alpha \) is given by the posterior density for \(N_s\) samples of pixel values {\(\alpha _y\) \(\forall \) \(y=1, 2,\ldots N_s\)}

    $$\begin{aligned} \alpha = \frac{ \sum _{y=\mathrm{min}(Y)}^{\mathrm{max}(Y)} \alpha _y}{N_s} \end{aligned}$$
    (6)
Fig. 3
figure 3

The sample images for original image in top row and the corresponding background separated image in bottom row

Homogeneity-based region growing is utilized by establishing local threshold levels for the confocal dataset. Basically, the background statistics with signal distributions are utilized as a linear mixture model (MM) to determine the likelihood with which a given pixel (voxel) can be considered as part of the foreground or not, as described below. The rule to grow regions is then designed from these probabilities. The initial seed is considered the central point of the region of interest. Further, the homogeneity property for the localized area is derived from an image volume centered on the seed. A generalized threshold using Otsu (Siddique et al. 2018) thresholding method for segmentation has been employed, which is an optimum solution for a multimodal distribution (Ng 2006). On the other hand, background with normal distributions comprising the negative binomial is fitted through an expectation-maximization (EM) method (Callara et al. 2020).

All original images are an RGB image within the range of 0 to 255, the maximum value is 255 but these values would be too large for the proposed model to process as showin in Fig. 3. Therefore, values are targeted in between 0 and 1. Each pixel value is rescaled from the [0,255] to [0, 1] range by 1./255.

2.3 Convolutional neural network model

2.3.1 CNN architecture

CNN’s are feed-forward neural networks and a group of neural networks that have been shown to be very effective in the recognition and classification of images and are made up of several layers. CNN’s comprise kernels, neurons, and filters that have learnable weights, parameters, and biases. This filter receives inputs, transforms them, and optionally continues them with nonlinearity (Uçar et al. 2017). Figure 4 demonstrates CNN architecture. It comprises the Convolutional, Pooling, Fully Connected, and Rectified Linear Unit (ReLU) layers.

Fig. 4
figure 4

The proposed CNN architecture for the attributes retrieval and classification of various diseases for different crops

The input image size of dimension 256\(\times \)256\(\times \)3 RGB image is converted into gray image as in Fig. 5, 16 convolution filters of 3\(\times \)3 size = 128\(\times \)128\(\times \)16 followed by the ReLU activation function as illustrated in Fig. 6. The width of Conv. layers (the number of channels) initially is 16 and increased by twice for each convolution layer. After pooling 64\(\times \)64\(\times \)16, image patches are obtained as depicted in Fig. 7.

Fig. 5
figure 5

The image at first layer of size 256\(\times \)256

Further, the number of channels is taken as 32 (convolution filters) of 3\(\times \)3 sizes that provide 32\(\times \)32\(\times \)32. The pooling provides the down sampled images of dimension 16\(\times \)16\(\times \)32. Furthermore, the number of channels is taken as 64 (convolution filters) sizes that provide 16\(\times \)16\(\times \)64 as shown in Fig. 8.

Fig. 6
figure 6

The original image at second convolutional layer of size 128\(\times \)128

The pooling provides the downsampled images of dimension 8\(\times \)8\(\times \)64. Further, the number of channels is taken as 128 (convolution filters) of 3\(\times \)3 sizes that provide 8\(\times \)8\(\times \)128.

Fig. 7
figure 7

The original image at the third convolutional layer of size 64\(\times \)64

Find out the vertical and horizontal center lines of each patch to intact the information and the max pooling for the remaining 8\(\times \)8\(\times \)256 provides the down sampled images of dimension 4\(\times \)4\(\times \)256.

It is further a fully connected layer to come as 1\(\times \)1\(\times \)4096 followed by the second fully connected layer with 1\(\times \)1\(\times \)256 to the softmax containing a number of nodes equal to the number of output classes.

Fig. 8
figure 8

The original image at fifth convolutional layer of size 16\(\times \)16

2.3.2 Convolutional layer

The convolutional layer is the main building block for a convolutional network and performs several computational task. The primary purpose is, get characteristics from the input data of the image form. By learning image attributes, it manages and maintains the spatial connection among pixels. It uses small squares of the given image. The provided input image is convoluted by the use of a group of detectable neurons. This generates an activation or feature map in the output frame, and subsequently, the feature maps are inserted into the next convolution layer as input data.

2.3.3 Pooling layer

It decreases the dimensionality of every activation map. Nevertheless, the most relevant data remains available. The image input is slitted into a group of rectangles that are not overlapping. A non-linear operation like average/maximum wills down-sample each region. A pooling or a sub-sampling layer in CNN layers is added after a convolution layer once getting the function maps. This is to reduce the computing power needed for the data processing by reducing the dimensionality. Pooling shortens training time and prevents over- controls. The max-pooling layer most of the time follows rectified linear unit (ReLU) activation layer. Here, we utilize max-pooling of window size 2\(\times \)2 pixels. Followed by the pooling the feature map has been obtained by employing the ReLU activation function.

2.3.4 ReLU layer

It is a wise non-linear element procedure that comprises rectifier-employing units. It is applied per pixel, and all negative values are reconstituted by zero in the feature map. The ReLU activation function is defined as,

$$\begin{aligned} F(x) = \mathrm{max}(0, x) \end{aligned}$$
(7)

where, x is the weighted sum of inputs.

2.3.5 Fully connected layer

When each filter of the preceding layer is linked in the next layer of each filter then it is called a fully connected layer (FCL). The results of all the layers like pooling, convolutional, and ReLU are instances of the high-level input image features.The purpose of using the FCL is to identify the input image into different classes by using all the features, depending on the training set. FCL is known as the final pooling layer which uses Softmax, an activation function to feed the features to a classifier. At the output layer softmax, the neuron is used for binary classification. A softmax activation function is a form of logistic regression that normalizes an input value into a vector of values that follows a probability distribution whose total sums up to 1. The softmax activation function is defined as:

$$\begin{aligned} \sigma (x_j) = \frac{e^{x_j}}{\sum _{k=1}^{n} e^{x_k}} \end{aligned}$$
(8)

where x is a vector of the inputs to the output layer and j indexes the output unit.

2.3.6 Optimization

For DL models, the right option for optimization algorithm could significantly improve both declines in training time and progress in precision. Adaptive moment estimation (ADAM) was first reported in 2014, as an optimization algorithm to train deep neural networks (DNN) with adaptive learning. ADAM optimizer is gaining enormous popularity in DL applications such as computer vision. This algorithm is an improved and updated version of the traditional stochastic gradient descent algorithm. ADAM optimizer shows finer results and performance as compared to classical stochastic gradient descent. Adam optimizer calculates the individual adaptive learning rate for each parameter from estimates of the first and second moments of the gradients. The intuition behind the Adam is that we don’t want to roll fast because we can jump over the minimum, we want to decrease the velocity a little bit for careful search. The equations for weight up gradation using adam can be given by,

$$\begin{aligned}&m_t = B_1 m_{t-1} + (1-B_1)g_t \end{aligned}$$
(9)
$$\begin{aligned}&v_t = B_2 v_{t-1} + (1-B_2)g_t^2 \end{aligned}$$
(10)

where \(m_t\) and \(v_t\) are estimates of first and second order moment respectively.

$$\begin{aligned}&m_t' = \frac{m_t}{1-B_1^t} \end{aligned}$$
(11)
$$\begin{aligned}&v_t' = \frac{v_t}{1-B_2^t} \end{aligned}$$
(12)

where \(m_t'\) and \(v_t'\) are bias corrected estimates of first and second moment, respectively. Finally, we update the parameter as shown below,

$$\begin{aligned} W_n = W_0 - \frac{nm_t'}{\sqrt{v_t'+\epsilon }} \end{aligned}$$
(13)

where, \(W_n\) is updated weight, \(W_0\) is old weight and n is learning rate. \(B_1\), \(B_2\), \(\epsilon \) are hyper parameters.

The stochastic gradient descent (SGD) method with a learning rate of 0.01 and the weight update equation can be given by,

$$\begin{aligned} W_n = W_0 - n \triangledown J (W) \end{aligned}$$
(14)

where, \(W_n\) is the New weight, \(W_0\) is the Initial weight, n is the Learning Rate, \(\triangledown J (W)\) = represents gradient with respect to parameter w. After that, to convert all the pooled images through flattening into a continuous vector a Flatten function has been used. In this additional parameters are not required as Keras can understand that the object classifier already holds pooled image pixels so they need to be flattened. In the next step, two dense functions have been used which are an FCL, in the first dense layer, Keras used the vector as the input for the NN which has been obtained above, and provided the output of 4 classes by using ReLU AF. In the next dense layer, a softmax function has been used to determine specific target output results. The ADAM optimizer has been used for better results.

2.4 Performance parameters

The success rate evaluation of the segmentation system is computed and compared in terms of \( F_1 \)-score, modified Hausdorff distance (MHD), and Dice similarity coefficient (DC) (Kurmi et al. 2019; Dubuisson and Jain 1994). The performance of the classification method is defined in terms of accuracy (Ac) (Kurmi and Chaurasia Aug. 2018; Chaurasia and Chaurasia 2016) and receiver operating characteristic (ROC) curve (Fawcett 2006; Kurmi and Gangwar 2021) along with the evaluation of the area under the characteristic curve (AUC) (Fawcett 2006). For a better classification system the AUC should be of high value (Kurmi et al. 2021). The detail of the evaluation measure is

$$\begin{aligned}&True~ positive ~rate ~(TPR)= (Sensitivity)= \frac{Tr_{Po}}{Tr_{Po}+Fa_{Na}},~~ TNR= \frac{Tr_{Na}}{Tr_{Na}+Fa_{Po}} \end{aligned}$$
(15)
$$\begin{aligned}&False ~ positive ~rate ~(FPR)= (1-Specificity) = \frac{Fa_{Po}}{Fa_{Po}+Tr_{Na}} \end{aligned}$$
(16)

where \(Tr_{Po}\) signifies the correctly identified positive number of samples, \(Tr_{Na}\) indicates the appropriately classified negative entities, \(Fa_{Po}\) represents falsely marked negative samples, and \(Fa_{Na}\) is the measure of incorrectly classified positive samples. For major class the true negative rate (TNR) and for minor class, the false positive rate (FPR) are defined as:

$$\begin{aligned}&Precision= \frac{Tr_{Po}}{Tr_{Po}+Fa_{Po}} ~~and~~ Recall = \frac{Tr_{Po}}{Tr_{Po}+Fa_{Na}} \end{aligned}$$
(17)
$$\begin{aligned}&G-mean = ( \Pi _{k=1}^{m}Recall_k)^{\frac{1}{m}} \end{aligned}$$
(18)

where m is the number of classes. The G-mean metric is given by the ratio of a number of items from the minority to the majority class. The overall accuracy does not provide a true score when there is an imbalance in the dataset among the number of classes in each category. This imbalance is corrected by the G-mean by enhancing the accuracy of skewed class distribution (He and Garcia Sep. 2009). The ROC curve plots the true positivity rate vs the false positivity rate for the model at different cutoff points, to calculate the accuracy of the system. The area under the curve (AUC) (Fawcett 2006; Huang and Ling Mar. 2005) represents how well the binary classes (one vs. all in case of multiple classes) can be separated, with an ideal point at (0,1) where there are no misclassifications. Higher AUC implies the better performance of the model. The accuracy is given by:

$$\begin{aligned} Ac=(Tr_{Po}+Tr_{Na})/(Tr_{Po}+Fa_{Po}+Fa_{Na}+Tr_{Na}) \end{aligned}$$
(19)

3 Result and discussion

The performance analysis of the proposed localization-based classification technique is done using three existing PlantVillage datasets of leaf images of bell pepper, potato, and tomato crops; The simulation has been performed using Python and its supporting packages as Tensorflow backend (Team 2018), Keras API (Keras 2018), and Scikit-learn (Pedregosa et al. 2011) library. The personal desktop (CPU: Core i5 processor 2.30 GHz, RAM: 16 GB) with google colab was used to train and test the network.

3.1 Evaluation of segmentation work

The performance of the proposed leaf region extraction has been compared with three state-of-the-art methods. A set of image taken for segmentaion performance anaysis was 150 image 10 from each class. The LSSC (Soares and Jacobs 2013) and SFAT (Sharma et al. 2017) approaches offered 0.862 and 0.868 \(F_1\)-score, respectively. On the other hand, the DC (MHD) values for LSSC and SFAT methods are 0.725 (10.58) and 0.769 (10.32), individually. The \( F_1 \)-score provided by the proposed method is 0.908 with 0.817 DC and 7.54 MHD values that are far better than the state-of-the-art methods as shown in Table 1.

Table 1 Segmentation performance measures comparison with existing approaches

The complexity analysis of SFCC (Biswas et al. 2014) and SFAT (Sharma et al. 2017) approaches is approximated to \(O(N^3)\) order. On the other hand, the computations of the proposed segmentation system are at par with LSSC (Soares and Jacobs 2013) of \(O(N^2)\) order.

3.2 Classification work evaluation

For the classification work evaluation from the total images 20\(\%\) are reserved for testig and remaining image are utilized for the traing with 10 fold cross validation. The analysis of the proposed classification work with existing models is provided in Table2. For the classification of the tomato dataset, the SELM-AE (Singh and Misra 2017) method shows 0.887 Ac and 0.918 AUC. The categorization of potato images has been performed by SELM-AE with 0.914 Ac and 0.882 AUC.

Table 2 A comparative analysis of the performance measures as Average accuracy (Ac) as well as area under the characteristics curve (AUC) on considered dataset for proposed method

The DLLA (Bharali et al. 2019) method offered accuracy for tomato datasets, potato, and bell pepper are 0.916, 0.917, and 0.931, respectively with AUC values of 0.922, 0.908, and 0.928, respectively. The accuracy performance of the CLIQL-CNN (Kaur et al. May 2018) approach is 0.904, 0.908, and 0.948, for tomato, potato, and pepper datasets, respectively.

One additional analysis has also been performed using all classes dataset for training and testing of the model. The analysis of the confusion matrix for all the 15 classes is shown in Fig 9. The clear illustration of all categories with their diagonal elements as correct categorized classes and off-diagonal entities as wrongly classified values.

Fig. 9
figure 9

The confusion matrix shows the comparison of the proposed system for internal classes categorization accuracy

The performance measure of the proffered approach as compared with existing approaches in terms of AUC is illustrated in Fig. 10.

Fig. 10
figure 10

The comparison of ROC curves for the proposed system with different existing approaches

The DLDIC (Hang et al. 2019) approach illustrates the lowest value of AUC 0.879 and the PDDD technique offers 0.893. The proposed method shows 0.942 AUC that performs 5\(\%\) better measure than existing methods. The performance measures sometimes do not provide fair comparison through accuracy and it needs another measure G-mean as shown in Fig. 11.

Fig. 11
figure 11

The G-mean comparison of the proposed technique with different existing classification methods

The PDDD approach offers 0.934 G-mean while the DLLA technique provides 0.943 and the DLDIC method gave 0.948 average G-mean. The proffered approach provides a better G-mean of 0.952 than the existing approaches.

Fig. 12
figure 12

Analysis of training time (in minutes) for the proposed system with existing techniques

The time required for the training has also been analyzed (in minutes) for the different methods and is given by the graph in Fig. 12. The PDDD technique is most efficient in terms of time taken to train the model, at 140 min while the SELM-AE is most costly. The DLLA (Bharali et al. 2019) method requires 150 min as average training time while DLDIC (Hang et al. 2019) approach needs 143 training minutes. The average time required to train the proposed method is 480 min The DLDIC (Hang et al. 2019) classification accuracy performance is at par with the proposed method, but the computational complexity more than the proposed method. hence, by comparing the accuracy, AUC, and time complexity the proposed method provides better performance measures than the existing approaches.

4 Conclusion

An image signal processing system has broad application in almost each and every domain of science, engineering, and management. here we are discussing the application in agro-economic growth systems like vegetation measurement, vigor diagnosis, phenotyping, etc. A proffered region localization-based deep CNN learning system offers discriminatory attributes to identify the crops as well as the diseases. The leaf region fixation was carried out using the leaf color properties and region growing approach. The segmentation measure of the localization system depicts 0.916 \(F_1\)- score with 0.824 and 7.29, Dice coefficient, and modified Hausdorff distance, respectively. The classification performance with mean Ac and AUC curves are 0.942 and 0.948, respectively, with 0.952 G-mean scores and a training time of 480 min. Hence, the proposed system gives better performance measures than the state of art techniques. the proposed work can also be employed for other crop identification and disease classification.