Introduction

Coffee is widely regarded as the second-largest industry globally, following crude oil. It is a beneficial agricultural product. Around the world, fifteen billion tree plantations contribute to society. Coffee production serves the needs of around 25 million producers in various countries, catering to global demand. The commercial coffee industry predominantly revolves around two main species: Arabica coffee (Coffea arabica L.) and Robusta coffee (Coffea canephora). Arabica coffee predominantly flourishes in damp tropical highlands, while Robusta coffee can prosper at sea level but is also commonly cultivated in these moist tropical highlands [1] 0.97.55% of coffee cultivars are from Brazil, which produces 40% of the largest producer. Since 1849, Arabica coffee has thrived in the highlands of Thailand, making it a suitable and viable option As a lucrative crop to substitute for opium production. The cultivation of Arabica coffee in this region proves highly appropriate and feasible, offering a sustainable alternative to opium derived from the dried latex of the opium poppy and Papaver somniferum [2]. The most controversial challenge in the Ethiopian coffee sector is the coffee-producing country's that use of a plant with stimulating characteristics called khat (Catha edulis) instead of coffee for its benefit [3, 4]. Small, low-profile coffee-growing families perform a number of the harvest some famines harm and sicken the plant, decreasing production and affecting the food's quality and flavour [5]. A mobile application was developed using fuzzy inference to diagnose crop disease [6]. A deep neural network model has been used to quantify and categorize coffee leaf rust and leaf miner, which achieved above 97% symptom of classification results, making it possible to evaluate the severity of biotic stress [7, 8]. Small producers needed help to keep a steady supply of premium coffee on the market [9]. Colombia grows only Arabica coffee beans, Compared to Canephora coffees also known as Robusta coffees, these have a softer flavour and are more sought after on international markets. Most harvesting is done by medium-sized or smaller families specialising in coffee cultivation as a result of being attacked by pests, plants can lose quality and quantity [10]. The growing severity of poor cultivation has impacted quality and quantity indicators over the past few decades [2]. Coffee leaf rust (CLR) [11], coffee borer beetle (Hypothenemus hampei) [12], coffee leaf miner, citrus mealybug, root borer, red spider and coffee stem mite severely damage coffee plants, affecting crop yield [13, 14]. The high level of horizontal resistance in coffee populations and the tremendous genetic variety of coffee populations may have kept the rust at bay. CLR aggression is height dependent, with more aggressive behaviour at lower elevations and warmer temperatures. Coffee leaf rust stands as the most devastating coffee disease that affect C. arabica plants [15, 16]. After the nineteenth century, CLR damaged many coffee crops in large areas of Ceylon's Arabica. Many countries use pesticides to prevent rust, which still causes production declines of 20% or more in some countries [17]. CLR was only found in Ethiopia in 1934, although it had been present in other countries for long periods without spreading epidemics or wiping out specific C. arabica types. Coffee and rust have coexisted for a very long period. CBD was found for the first time in Kenya in 1922, west of the Rift Valley and close to Mount Elgon. In some reports, it was found that it damages the crops by up to 75%. According to this, coffee growing west of the Rift Valley ended, and tea plantations became the dominant form of agriculture in the region. The Rift Valley's arid climate kept the disease from spreading to the Central African country's crucial coffee-growing regions for a long time—the province's highlands. For the first time, Ray-ner noted the presence of CBD in 1951, east of the Rift [18, 19]. Deep learning is a new class of models that have emerged in artificial intelligence due to the tremendous growth in machine learning's practical applications over the past few years. The deep learning approach named Convolution Neural Networks (CNN) has achieved excellent results in image recognition. CNN automatically learns relevant characteristics from the dataset used for training, eliminating the need for manually crafted elements based on prior problem understanding used in traditional methods. Moreover, CNN integrates the segmentation step within its convolution filters, simplifying its usage further [19, 20]. Brazil, the world's top producer of coffee, claims the International Coffee Organisation, which is highly advantageous for the other country. Coffee plantations encounter various biotic stresses like leaf miners, rust, and brown leaf spot. These stresses lead to defoliation, reduced photosynthesis, and, ultimately, a negative impact on crop yield [21]. To organise the information, developed a system a threshold-based method was used to assess the various symptoms and signs associated with coffee leaves rather than categorising them. Leaf miner and rust are the two categories used to classify the symptoms. Both of these categories included components that were made by hand. The segmentation of lesions yielded data that demonstrated the symptoms could be differentiated and rated on a severity scale. While the results were satisfactory, this segmentation technique fails to account for the interplay between pixel positions, which can be altered by lighting and specular reflection factors. The main achievements of this project can be encapsulated as follows: Firstly, the dataset has been expanded by including more coffee leaves, enabling a larger and more diverse collection of images of both healthy and ill leaves. Secondly, the project introduces a multi-task system that performs categorization and severity evaluation of biological stress on coffee leaves. Lastly, the project conducts comparative studies on various deep learning structures and exploring their effectiveness in analyzing the coffee leaf images [22]. An algorithm for pattern recognition based on texture is proposed to calculate features such as local binary features and statistical features to identify a coffee plant leaves with lesions and compare its detection rate with other CNN methods [23]. The CLR disease is the primary phytosanitary threat to coffee plants, and it is caused by the fungus name Hemileia vastatrix. It's not uncommon for the severity of a problem to increase before any significant measures are taken to fix it, Poor disease control can severely weaken coffee plants as a result leaving only a handful of leaves on the trees and dramatically reducing yield and quality [24]. Deep learning techniques are a machine learning subfield that entails training artificial neural networks to identify objects in pictures, transcribe speech, and make predictions. Deep learning models are built from layers of interconnected artificial neurons and can learn to recognise patterns in data through back propagation [25]. In this research, a novel classification model is presented to effectively detect various ailments affecting coffee leaves, like grey leaf spots, common rust, and northern leaf blight, using coffee leaf images. The model utilizes Convolution neural networks (CNNs) with two prior trainings, namely InceptionV3 and DenseNet121, chosen for their reasonable parameter sizes. To enhance the classification performance, feature fusion employed to integrate the forecasting abilities of both CNN models, creating a comprehensive and accurate classification system. The main aim of this study is to utilize feature fusion method in order to combine the predictive capabilities of two convolution neural networks (CNNs), InceptionV3 CNN and DenseNet121 CNN that have already been trained. By implementing an end-to-end learning model, the features derived from both CNNs are fused together. The objective is to enhance the classification accuracy of coffee plant diseases. The key points of this work can be encapsulated as follows:

  1. a.

    The application of feature fusion in a paradigm of end-to-end learning, combining features extracted from InceptionV3 CNN and DenseNet121 CNN.

  2. b.

    The evaluation of classification accuracy using the proposed model, as well as comparisons with DensenNet121, EfficientNetB0 and InceptionV3 models, using the same dataset.

  3. c.

    The achievement of higher accuracy in detecting and classifying coffee plant diseases with the proposed model, while maintaining reasonable parameter sizes.

Related Work

In [26], the authors proposed a model that achieved a classification accuracy of 98.56% by combining traits from multiple CNN approaches. In [27], the author's model achieved an accuracy rate of 98% by addressing the issue of over-fitting through data augmentation and transfer learning. In [28], authors presented a densely optimized CNN to classify four types of maize leaves, including healthy maize leaves and three affected by disease leaves like: corn grey leaf spot, common corn rust, and corn northern leaf blight. The proposed model deployed a softmax classifier layer followed by five dense blocks and Attained a classification accuracy of 94%. In [29], the suggested transfer learning method using DenseNet-121 Attained a rate of classification accuracy of 99.81% and required fewer training parameters, reducing computing complexity. In [30], authors retrieved deep features from tomato, maize, and potato photos stored in the Plant Village dataset repository using the pre-trained VGG19. They concentrated on extracting Information from layers 6 and 7 of the VGG19 network, which was then linked using the partial least squares (PLS) method. They used the partial least squares (PLS) technique, which included fusing the extracted features in parallel to choose the most appropriate features. The network achieved an impressive classification accuracy of 97.50% with these parameters. In [31], The authors suggested a methodology that integrates both classification and metaheuristic techniques for the purpose of categorizing breast cancer. In [32], the authors suggested an optimized approach for detecting breast cancer through the examination of ultrasound images. In [33], a combination of features was utilized by the authors to predict breast cancer, merging patient gene modality data with image data. In [34], a method involving the application of fuzzy-based association rules mining was employed by the authors to analyze data associated with breast cancer. Decision trees were applied to anthropometric and blood data in [35] to identify symptoms of breast cancer, resulting in highly favorable outcomes. Lastly, in [36], a two-stage filter-based method for breast cancer analysis was implemented by the authors, incorporating two statistical methods in the first stage and two theoretical approaches in the second stage.

Materials and Methods

A convolutional neural network (CNN) is comprised of three layers comprising convolution, pooling, and fully connected layers. These layers work together to create a powerful framework for analyzing image datasets. This framework requires minimal preprocessing compared to other classification algorithms, given sufficient training. CNNs capable of learning filters and extracting features from images, enabling them to effectively process large areas. Compared to other neural networks, they are exceptionally good at processing visual, audio, or voice signals [37].

Classification using CNN

Multiple layers of CNN describe as follows:

  1. a)

    Input Layer: This layer provides raw data for a matrix of pixel values.

  2. b)

    Convolutional Layer: This layer conducts convolution operations on the input layer and includes trainable filters aimed at extracting features from the image to assist in pattern recognition.

  3. c)

    Activation Layer: In this layer, the activation function applies for non-linearity in the model that can identify a more complex pattern.

  4. d)

    Pooling Layer (Subsampling Layer): In this layer, with the help of max-pooling and average-pooling, the output's spatial dimensions are diminished thus decreasing the parameter count and computational load.

  5. e)

    Fully Connected Layer (FC Layer): This layer establishes connections between every neuron in the preceding layer and every neuron in the present layer, facilitating the learning of intricate patterns and prediction-making.

  6. f)

    Flatten Layer: This layer forwards the output from the layer below to the fully connected layers.

  7. g)

    Dropout Layer: This layer uses to prevent overfitting and improve generalisation.

  8. h)

    Batch Normalisation Layer: This layer uses to normalise the previous layer's output.

CNNs utilize the softmax function to ascertain the correct class for an image, operating as a probability distribution function. This function assigns probabilities to each potential class, indicating the likelihood of the input image belonging to a specific category. Equation 1 represents the softmax process, where (X)i represent the input value, and N signifies the total number of types. First, to compute the class probability, exponentiated the input values (X)i, then treat any negative values as positive [38].

$${\sigma \left(x\right)}_{i}= \frac{{e}^{{X}_{i}}}{\sum_{j=1}^{N}{e}^{{X}_{j}}}for\,i=1\dots \dots .N\,and\,X=\left({X}_{1}\dots \dots .{X}_{N}\right)$$
(1)

Feature Extraction Process

To achieve this objective, we employ a sequence of layers to analyze the image and extract pertinent information. Following multiple convolutional and activation layers, pooling layers are implemented. The classification module utilizes the pixel values of the image to generate a feature map. The convolutional layer's filters learn low-level, intermediate, and high-level image features as they apply to the layer's input. Pooling includes edges, patterns, and textures. Typically, A tensor filters data with odd dimensions involved in the information in a sliding-window fashion by convolutional averaging the filter values after multiplying them by the window on the result. Equation 2 represents the convolutional product of tensors T (input) and F (filter), while the height, channel count, and width signify \({n}_{h},{n}_{c} and {n}_{w}\), respectively.

$$Conv{\left(T,F\right)}_{m,n}= \sum_{x=1}^{{p}_{h}}\sum_{y=1}^{{p}_{w}}\sum_{z=1}^{{p}_{c}}{F}_{x,y,z}\times {T}_{m+x-1,n+y-1,z}$$
(2)

Equation 3 indicates Rectifier Linear Unit (ReLU) activation function. It acts as a non-linear transform on the result of its convolutional layer [21].

$$F\left(x\right)=\text{max}\left(0,m\right)$$
(3)

The activation layers add non-linearity to the output using the activation function, which researchers have represented in Eq. 4.

$$Conv{({T}^{[A-1]},{F}^{(p)})}_{m,n} = \sum_{x=1}^{{p}_{h}}\sum_{y=1}^{{p}_{w}}\sum_{z=1}^{{p}_{c}}{(F}_{x,y,z}\times {T}_{m+x-1,n+y-1,z}+ {b}_{p}^{[A]})$$
(4)

The input layer, shown below, depicts \({T}^{[A-1]}\) with \({p}_{h}{{,p}_{w} and p}_{c}\) Activation layers, which are typically defined after convolutional layers, introduce nonlinearity to the output. Equation 5 illustrates the computation of the output layer, where \({F}^{\left(n\right)}\) represents the dimensions of the filters.

$${T}^{A}=\bigg[ {\psi }^{\left[A\right]} ( conv\left({T}^{\left[A-1\right]} , {F}^{\left(1\right)}\right), . . . . . ..{\psi }^{\left[A\right]}(conv\left({T}^{\left[A-1\right]} , {F}^{{n}_{c}^{A}}\right)\bigg]$$
(5)

Classification Process

Classification is employed to assign the collected characteristics to the appropriate categories. In typical scenarios, fully connected layers connect all the elements. In these layers, each input connects to every neuron in the subsequent activation layer. This network configuration functions as a classifier, where each layer consists of numerous neurons connected to the output layer. Equation 6 represents the SoftMax function, which normalises the output by utilising the SoftMax activation function to determine the highest probabilities.

$$\sigma (P)_{i} = \frac{{e^{{P_{i} }} }}{{\sum\nolimits_{i = 1}^{k} {e^{{P_{i} }} } }}$$
(6)

Transfer Learning Method

One machine learning method, "transfer learning", involves taking a model built for one task and using it as the foundation for another. When employing deep learning for tasks in computer vision or natural language processing, it's customary to initiate the process with pre-trained models. The enormous skill gains provided by neural network models necessitate their usage for addressing problems related to massive computational time resources. Model Development and Pre-training are the two transfer learning strategies that are most frequently used [38].

Essential Points of Model Approach Development

  1. 1.

    Choose a source job with abundant data and a known relationship between the input and output data. The system learned their relationship while mapping the input to the output data.

  2. 2.

    Develop a Competent Model for This Primary Task Next, Construct a Source Model. To qualify as "feature learning," the model must outperform a naive model.

  3. 3.

    Model Reuse: The model may be used whole or in part, depending on the techniques used.

  4. 4.

    Input/Output Data Model: Depending on the input/output data for the intended task, the model might need to be adjusted or improved.

Approach for Pre-Trained Model

  1. 1.

    Select Pre-Trained Source Model-We have used InceptionV3 and DenseNet121 as pre-trained models for this work.

  2. 2.

    Building a new model based on a previously trained model is a method used frequently in machine learning known as transfer learning. Using another model can enhance the knowledge gained by a separate but related model in transfer learning.

  3. 3.

    Tune Model: Taking an already trained model and further training it on a new dataset related to the original task allows for fine-tuning the model.

  4. 4.

    It would be taught how to map the feature extractor's output classes to the new data set. The InceptionV3 and DenseNet121 architectures of convolutional neural networks were utilized in this work, both of which had been pre-trained. In Fig. 1, InceptionV3, part of Google's Inception architectural lineage, employs diverse inception modules featuring various filter sizes to capture information across different scales. Designed for discerning intricate image patterns, the model combines these modules with traditional convolutional layers to gather features at multiple scales. Although it demands more parameters and increases computational costs, InceptionV3 addresses this by incorporating additional classifiers during training to enhance gradient flow and convergence. Recognized for its versatility, InceptionV3 is beneficial for various computer vision tasks, encompassing feature extraction, object recognition, and image classification [43].

Fig. 1
figure 1

InceptionV3 CNN architecture

An initial convolutional layer followed by four dense blocks and three transition layers represents the structure of DenseNet121. In this structure, each dense block densely connects several thick layers. The purpose of transition layers is to decrease the number of maps with features and to down-sample the spatial dimensions of the input. After the final dense block is processed, as Fig. 2 illustrates, a pooling layer and softmax activation employed to acquire feature maps and generate final class probabilities. In deep neural networks, particularly in architectures like DenseNet, it is common to use a substantial block consisting of six dense layers. Every thick layer comprises a 1 × 1 convolutional layer, then a 3 × 3 convolutional layer. This configuration serves as a fundamental foundational element in the network [39].

Fig. 2
figure 2

Six dense layers on a dense block

Equation 7 shows the number of network connections.

$$xl=H\left(\left[{x}_{0,}{x}_{1,}{ x}_{2,}\dots \dots \dots {x}_{l-1 }\right]\right)$$
(7)

where [× 0, X1…, xl1-1] are the input layers that combine the feature maps and generate the output xl.

Figure 3 shows the DenseNet121 CNN model, a typical DenseNet network implementation consisting of two main fragments: dense (DB) and transitional (TB) blocks. Each layer forms a thick alliance with one convolution layer of size 1 × 1 and three 3 × 3 sizes of convolution layers. In the spaces between the thick blocks, Transition blocks connect some of the blocks with a batch normalization layer, then a convolution layer of 1 × 1 dimension, and finally a 2 × 2 range average pooling layer. In this implementation, there are four dense blocks. Depending on the density of the block, there are six, twelve, twenty-four, or sixteen layers of thickness.

Fig. 3
figure 3

DenseNet121 CNN architecture

Feature Fusion

By combining multiple features or representations extracted from different sources or modalities, feature fusion creates a more comprehensive model suitable for various applications, including object recognition, image segmentation, and speech recognition. Fusion algorithms examine the input tensors of dimensions \({m}_{x,y,d}and {m^{\prime}}_{x,y,d}\) for x, y and d-dimensions, respectively. In addition, each input tensor is combined element by element to produce another tensor with the exact dimensions as previous tensors represented by Eq. 8. Equation 9 represents the sum and the aspect taken for each input tensor, the highest value element, and then an identical tensor with the same shape is produced. If we consider averaging instead of maximisation, then the method known as the average method is represented by Eq. 10. The input tensors are stacked together using the concatenation method along the axis. That is represented by Eq. 11.

$${Y}_{x,y,d}^{sum}= {m}_{x,y,d}+ {m^{\prime}}_{x,y,d}$$
(8)
$${Y}_{x,y,d}^{max} =\text{max}\left({m}_{x,y,d} , {m^{\prime}}_{x,y,d}\right)$$
(9)
$${Y}_{x,y,d}^{avg}=avg\left({m}_{x,y,d} , {m^{\prime}}_{x,y,d}\right)$$
(10)
$${Y}_{x,y, 2d}^{concat}= {m}_{x,y,d}, {Y}_{x,y,2d-1}^{concat}= {q}_{x,y,d}$$
(11)

Figure 4 portrays the comprehensive procedure of the proposed work, encompassing three main stages: data prepraration, training, and evaluation.

Fig. 4
figure 4

Proposed procedural framework for a classification procedure

Description of Dataset

Throughout this experiment, using Kaggle's Plant Village dataset [40], Authors classified 3076 images of healthy and diseased coffee leaves into four groups of healthy coffee leaves and three types of diseased leaves, including rust on coffee leaves, black rot, and brown spot. The numbers of images and samples of images are shown in Table 1 and Fig. 5, respectively. Sometimes little patches of pale yellow can be seen within the upper surface of coffee leaves, called coffee leaf rust [17]. The lower surface of the leaf experiences a gradual size expansion of these spots. They increase in size and vary from orange–yellow to orange–red. The eye-like appearance of coffee leaves is known as brown eye spot disease. Small chlorotic spots on leaves that grow to a 3/16-to-5/8-inch diameter turn brown. The dark brown or black appearance of infected coffee leaves is known as "black rot". Sclerotia cover these on the petioles and twigs. Using machine learning techniques, the experimental dataset was split into training and test sets to evaluate the model's performance on unseen data [41, 42]. Model training typically utilizes 80% of the information, and Testing reserves the remaining 20%. Further, hyper parameter tuning keeps Distinct from the training subset into the validation subset. The model is evaluated on data not observed during training, thanks to this process. To ensure well-training performance and prevent overfitting, a validation subset is employed. Table 2 displays the available augmented methods. Before further processing, the size standardizes to 244 × 244 pixels with a uniform standard.

Table 1 The combined count of images used across all categories within the datasets
Fig. 5
figure 5

The dataset includes examples representing each category: Healthy, coffee leaf rust, brown eye spot and black rot

Table 2 Data augmentation values to be used

Training Phase

To initialize the new proposed model, we utilized pre-trained weights from DenseNet121 and InceptionV3 CNNs, as depicted in Fig. 6. We amalgamated the classification layers with a fully connected layer comprising 1024 neurons and a median pooling layer. Additionally, three extra fully connected layers, with 256, 512, and 1024 neurons, respectively, replaced both branches. Final pruning was executed by applying a hierarchical cross-entropy loss function to the softmax layers of four neurons. Considering that DenseNet121 and InceptionV3 necessitate different pre-processing, we incorporated a pre-processing layer for feature extraction before training. This pre-processing layer facilitated feature extraction to accommodate the distinct pre-processing requirements of DenseNet121 and InceptionV3.The proposed CNN architecture, as illustrated in Fig. 6, encompasses two branches, each fulfilling its function for pre-processing, feature extraction, and a fully linked layer with average pooling. The outputs of the fully connected layers in each branch are amalgamated using concatenation. The concatenated output proceeds through several connected layers and culminates with a softmax classification layer.

Fig. 6
figure 6

Proposed Hybrid CNN architecture

Phase of Evaluation

The authors tested the model during the assessment stage to assess its effectiveness and ability to prevent the "overfit" problem. For testing and training, 80% and 20% of samples from the original dataset includes, respectively. During an evaluation, the image divides into four groups like Healthy, Coffee Leaf Rust, Brown Eye Spot and Black Rot. The proposed model used the Keras library. It provides a wide range of pre-defined layers, activation functions, loss functions, optimisers, and utilities to streamline the process of building and training models. This model runs on Intel i7 CPU, Equipped with 250 GB of Solid-state drive (SSD) and random-access memory (RAM) with up to 16 GB of storage capacity. Cross-entropy is often used for training models as a loss or objective function, particularly in classification tasks. For multi-class problems with K classes, the formula for categorical cross-entropy is:

$$CE=\sum [y*\text{log}\left(p\right)]$$
(12)

where y is the true a one-hot vector for the probability distribution, and p represents the predicted probability distribution across all classes. The limitations of our proposed model include the possibility of heightened system complexity resulting from the integration of diverse models and datasets. Fine-tuning the entire structure might prove intricate and time-consuming. Additionally, if rigorous validation is required, there could be susceptibility to overfitting.

Results and Discussions

Initial training subset and constructed a hybrid CNN model using the augmented training set illustrated in Fig. 7, the researchers trained the hybrid CNN model. The training process begins with Learning at a rate of 0.01. The learning rate reduces by 0.1 after every five epochs, with no adjustment to the validation loss value for 50 iterations. Model training stops if there is no improvement for ten consecutive periods. Other CNN models, such as InceptionV3, DenseNet121, EfficientNetB0, and the suggested hybrid model, also undergo this training. In multi-class classification problems, frequently, classifiers use Positive and Negative True and False TP, TN, FP and FN, respectively, to evaluate their performance. We can construct a confusion matrix for each class using these concepts by employing the one-versus-all strategy that allows us to assess each class classifier's effectiveness independently. Several criteria, such as F1 scores, recall, accuracy, and precision, are computed to evaluate each class classifier's performance. These metrics offer valuable information about the classifier's effectiveness for individual classes. Accuracy evaluates the overall correctness of the classifier's predictions for a type. Precision, among all predicted positive cases, measures the proportion of accurately forecasted positive events. A recall calculates the rate of accurately predicted positive events compared to all positive samples. The F1 score offers a balanced evaluation of the classifier's performance. Analyzing these metrics for each class can thoroughly understand the classifier's advantages and weaknesses in classifying multiple categories. This approach allows us to evaluate and compare the performance of individual class classifiers effectively.

Fig. 7
figure 7

The adapted models' confusion matrix for the experiment includes a InceptionV3 b DenseNet121, c EfficientNetB0 and d Proposed model

Accuracy: Eq. 13 evaluates the ratio of accurate predictions to the total predictions generated by the model.

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(13)

Precision: Eq. 14 shows that precision measures the ratio of accurate optimistic predictions to positive predictions.

$$Precision=\frac{TP}{TP+FP}$$
(14)

Recall: Eq. 15 measures the ratio of accurate optimistic predictions from proper positive examples.

$$Recall=\frac{TP}{TP+FP}$$
(15)

F1 score: Eq. 16 shows the use of the F1 score when the misclassification of positive examples is as essential as negative examples.

$$F1 score=\frac{(2*Precision*Recall)}{Precision+Recall}$$
(16)

The computational accuracy of the trained models, such as InceptionV3, DenseNet 121, EfficientNet B0 and the proposed model, was assessed by evaluating their F1-score, precision and recall. Subsequently, we used the operating characteristics of the receiver (ROC) curve on the testing dataset. To analyse the predictions of the suggested model. The overall predictive accuracy of a model refers to its capacity to make accurate predictions of the outcome or label of a given dataset by using ROC measurement. The graph in Fig. 8 generates by comparing the recall value and the false positive rate in a (FPR) value at various threshold levels. The equation representing the FPR denotes Eq. 17.

Fig. 8
figure 8

(a) Accuracy and the Loss of Validation and Training for InceptionV3. (b) Loss and the Accuracy of Validation and Training for DenseNet121. (c) Loss and the Accuracy of Validation and Training for EfficientNetB0. (d) Loss and the Accuracy of Validation and Training for Proposed model

$$FPR=\frac{FP}{FP+TN}$$
(17)

The primary purpose of the proposed work was to rapidly build and develop a deep learning model for which we used the Keras library. We conducted the experiment utilizing the Google Collaboratory (Colab) platform and an Intel processor equipped with 16 GB of RAM. This study closely explored the variance between the real and forecasted output with the help of cross-entropy (CE). Adam's algorithm improves on the difference obtained, minimising the cost function so that the model performs better. After the initial training is augmented, the suggested hybrid approach shown in Fig. 6 trains CNN. The learning rate for training initially takes as 0.02 for a total of 50 epochs. After that, the learning rate is reduced to 0.1 if there is no improvement after every four generations. This procedure uses for InceptionV3, DenseNet 21, EfficientNet B0 and the proposed hybrid model. Figure 8 illustrates the classification accuracies obtained on the same dataset by various models, namely EfficientNetB0, DenseNet121, InceptionV3, and the proposed hybrid model, which were recorded as 98%, 96.25%, 94%, and 99%, respectively. Table 3 exhibits the accuracy of the proposed hybrid model along with EfficientNet B0, DenseNet 121, and InceptionV3, whereas Tables 4, 5, 6, and 7 showcase the calculated recall, accuracy, and F1 scores of these models in the research. The result shows that the proposed hybrid model achieved the maximum F1 score for coffee leaf rust with the highest accuracy, outperforming InceptionV3, DenseNet 121 and EfficientNet B0. The confusion matrix displayed in Fig. 7 shows the classification of the suggested hybrid model, EfficientNet B0, DenseNet 121 and InceptionV3 into coffee leaf rust, black rot, brown eye spot and healthy leaves, where it is apparent that the proposed hybrid has a better model performance. Figure 9 displays the ROC curve of the proposed model.

Table 3 Accuracy of the experimental models
Table 4 The models' accuracy, recall, and F1-score in InceptionV3
Table 5 The models' accuracy, recall, and F1-score in EfficientNetB0
Table 6 The models' accuracy, recall, and F1-score in DenseNet121
Table 7 The models' accuracy, recall, and F1-score in the proposed model
Fig. 9
figure 9

ROC curve for the suggested model

Conclusion and Outline of Future Work

The results conclude that improving the classification accuracy of coffee plant disease requires a combination of in-depth features that achieve by producing different CNNs and complex feature sets. This work concludes that the presented model performed well compared to other methods, with 99% classification accuracy. In this work, we used CNN with few parameters and combined feature sets to generate a robust model. The model performed very well with a large dataset. In the future, we will study more and more diseases so that more situations detect from the beginning stage. Not only this but also the use of other augmentation methods to boost the detail of the image can make the model more useful. In future, we will also test other hybrid techniques and create new models by hybridising more than two models to get better results. In our hybrid model, we can extract additional insights by incorporating diverse datasets, including cucumber, pumpkin, grapes, and more. The application of our proposed model allows for precise predictions not only in plant-related contexts but also in the analysis of biological data, such as breast cancer, Parkinson's, and various other human diseases.