1 Introduction

Lungs are the most important organ in human body but people neither take good care of them nor give importance to their own respiratory and breathing related issues which later causes various infections and injuries [1]. According to the WHO, airway diseases are the foremost cause of death and disability worldwide hence it is essential to ensure a better diagnostic technique that can provide an accurate diagnosis as well as take appropriate actions to cure it. Numerous factors contribute to the rise of such diseases which includes direct or indirect tobacco smoke exposure, low birth weight, heavy exposure to malnutrition, air pollution, and virus exposure, like the influenza virus or the Coronavirus. In addition to this, there are also certain similarities in the symptoms of these diseases which can cause confusion and lead to misdiagnosis in treatment, therefore it is important to detect and diagnose multiple airway diseases accurately and timely [2].

In the healthcare sector, where massive and complex amounts of data are generated, artificial intelligence (AI) has proven to be an invaluable and irreplaceable asset. AI techniques have done a fantastic job in the image recognition sector of healthcare, such as to evaluate and classify the images of lung cancer, diagnosis of fibrotic lung disease, to interpret the pulmonary function tests as well as to diagnose various restrictive and obstructive lung diseases. In a broader sense, the medical sector is getting more involved with artificial intelligence for assisting doctors to predict and diagnose numerous types of diseases, particularly in the previous years when there was pandemic due to COVID-19 virus and insufficient hospitals were there to provide adequate care to ill people [3]. According to the NHS-commissioned Topol Report, the advanced algorithms of mathematics, cloud computing, etc. have escalated to develop the methods based on artificial intelligence (AI) techniques to analyze, interpret, and forecast healthcare data [4].

One of the researchers created a cough detection-based application, as shown in Fig. 1, that uses sensors to record the symptoms such as cough sound, body temperature, airflow, etc. of patients' or users. The recorded data was later converted and processed by a machine learning based techniques for identifying patterns and classifying the combined symptoms of various respiratory disorders [2].

Fig. 1
figure 1

AI based cough detection application [2]

Another study stated that Google AI scientists developed a neural network where they had reported that their network was better or as accurate as radiologists to detect malignant lung nodules [5]. A similar model was found to detect chronic obstructive pulmonary disease (COPD) in smokers to predict acute respiratory disease cases and mortality [6]. In paper [7], the authors learned that the machine learning algorithm worked well with the radiologists to interpret thoracic using high resolution computed tomography images by 73%. Their study also demonstrated that deep learning algorithms could be valuable in diagnosing interstitial lung disease. Similarly, in paper [8], the authors discovered that deep learning improved the diagnosis of chronic hypersensitivity pneumonitis, nonspecific interstitial pneumonia, cryptogenic organizing pneumonia, and typical interstitial pneumonia patterns. Therefore, it can be said that AI-based techniques have demonstrated superior performance and provide clinicians with a powerful decision support tool. The importance of such technology in improving clinical practice will drive medical community acceptance in the real world [9,10,11,12].

In this paper, the image dataset of four respiratory diseases like lung cancer, pulmonary embolism (PE), covid-19, and pneumoconiosis, including the normal lung images, have been considered and applied on various deep learning models such as EfficientNetB6, EfficientNetV2B1, EfficientNetV2B3, DenseNet201, Xception, ResNet50V2, Inception-v3, EfficientNetV2S, InceptionResNet-v2, ResNet101V2, and proposed hybrid model (EfficientNetB6 and ResNet101V2). The models are evaluated using several factors which include loss, accuracy, F1 score, MCC, precision, and recall. The research found that the proposed hybrid model obtained the highest training and testing accuracy of 99.84% and 99.77%, respectively.

1.1 Contribution

The contributions that have been made to develop the prediction system of respiratory diseases are shown as under:

  1. 1.

    The dataset of 19,488 images was initially taken from the four diseases, including the normal lungs, and later pre-processed by applying the CLAHE technique to enhance the contrast and remove noisy signals from images.

  2. 2.

    Using histogram equalization, the images have been visualized graphically to study the patterns of the pixel as well as detect anomalies if present.

  3. 3.

    For extracting features and to obtain ROI various techniques have been used such as contour features, Otsu thresholding, and Adaptive thresholding. It results in 38,876 image features and is later split into training and testing sets on a scale of 70:30.

  4. 4.

    Subsequently, ten deep transfer learning models along with the proposed hybridized model are applied and trained using training and testing dataset. These models are further examined through various parameters such as loss, accuracy, recall, F1 score, precision, and MCC values. Besides this, the values of confusion matrix to showcase the best model for identifying and classifying the respiratory diseases have been also generated.

1.2 Road Map of the Paper

The first section of the research paper is referred to as the Introduction. It provides concise information on respiratory diseases, their impact, and AI-based strategies for combating them. The context for discussing the researcher's work in the field of detecting respiratory disease is presented in Sect. 2. In addition, Sect. 3 describes the dataset, techniques, and parameters used to develop the respiratory disease detection system, while Sect. 4 describes the system's outcomes. In Sects. 5 and 6, the proposed work is contrasted with the existing one in the discussion, and the paper is summarized and concluded respectively.

2 Background

In this section, the work of various researchers in detecting lung cancer, covid 19, PE, and pneumoconiosis with the help of machine and deep learning techniques have been showcased [13, 14]. A tabular representation in Table 1 has also been provided to make it more informative, where the dataset, methods used by the researchers, and their outcomes and limitations have been shown. In the case of lung cancer detection, researchers like Dunke et al. [15] and Sori et al. [16] classified the lung nodules and detected their malignancy level using a 3D multi-path VGG network, U-Net architecture, and multi-phase CNN, respectively. Likewise, Chen et al. [17] researched lung cancer treatment to reflect better diagnosis by providing higher interpretability of the output. During the research, they analyzed the model's performance for a small imbalance dataset, and to overcome it, they proposed a new bag simulation method for multiple instance learning. Likewise, Said et al. [18] had discussed the use of deep learning techniques for accurate diagnosis of lung cancer through medical image segmentation. The study used a dataset of CT scans of lung cancer patients and compared the performance of different deep learning architectures for image segmentation. The results showed that the use of deep learning techniques significantly improved the accuracy of lung cancer diagnosis through medical image segmentation. The study had important implications for the future of lung cancer diagnosis and treatment.

Table 1 Analysing the work of the researchers

In the case of pneumoconiosis, Sun et al. [19] proposed a fully deep learning technique that comprised segmentation and a staging procedure. The researchers initially segmented and extracted lung regions in the CXR images and later classified them into four stages using focal staging loss and deep log-normal label distribution learning. Similarly, Yang et al. [20] also developed the automatic pneumoconiosis screening system, which used pre-processing pipeline along with the ResNet classification model. As per the authors, a large set of data was used. In their paper, Zhang et al. [21] mentioned an AI-based model that assisted radiologists in screening and stage pneumoconiosis based on their CXR images. The model initially segmented the lung region into six sub-regions, and later CNN based network was applied to classify and predict the opacity level of each sub-region. Their research ended by diagnosing each area and classifying them under the classes such as normal, stage 1, stage II, or III based on their prediction results. Peng et al. [22] investigated the use of convolutional neural networks in medical images to enhance pneumoconiosis diagnosis. The research gathered 8361 chest X-ray films for the first round of model testing and 24,887 chest X-ray films for the third round of model testing. Three distinct models were designated with test sets, and the diagnostic efficacy of each was computed.

In the case of PE [23], the authors applied a novel approach for analyzing the incomplete and partial datasets based on Q-analysis and ML algorithms. The authors' main aim was to introduce the hybridization of the theory of hyper networks and supervised artificial neural networks. Using this strategy, they developed new computer-aided design software to detect PE for reducing the number of CT-angiography analyses and ensure a great efficiency of the diagnosis. Similarly, in [24], the researchers worked on CT- angiography images of PE by training weakly labelled data on a deep neural network. The author took a small dataset and proved that the results obtained were much better, and it demonstrated that small research groups could use DL models for limited resources. In [25], the authors stated that a CT exam is necessary for achieving fast detection and diagnosis of PE. So based on it, they proposed a pipeline-based technique that used U-Net (Fig. 2) to detect embolisms from CT images and classified them between true positives and false positives using machine learning algorithms.

Fig. 2
figure 2

UNet architecture to detect pulmonary embolism

Grenier et al. [26] developed a CNN model as well as the hybrid 3D/2D UNet topology to detect and suspect PEs on computed tomography angiograms (CTAs). They used the dataset of 387 anonymized real-world chest CTAs which had been acquired on 41 different scanner models. The results showed that their algorithm correctly identified 170 out of 186 positive PE cases with 91.4% sensitivity and 184 out of 201 negative PE cases with 91.5% specificity.

To detect covid-19, the authors in [27] used the Fuzzy technique, MobileNetV2, Squeeze Net, and support vector machine. The data classes were restructured during the pre-processing phase and stacked with the original images. The MobileNetV2 and Squeeze Net models were then used to train the stacked dataset and support vector machine was used to combine and classify efficient features. Similarly, in [28], the researchers used deep learning and laboratory data to predict and estimate the covid-19 disease patients. The model was validated using tenfold cross-validation after testing 18 laboratory outcomes from 600 patients. In [29], the authors predicted that during COVID-19, children would experience stress, depression, and anxiety. To understand the children's stress, depression, and anxiety levels, a Deep Learning Neural Network (DLNN)-based method was used. Using cutting-edge Machine Learning techniques, Duong et al. [30] presented a practicable method to detect Covid-19 in chest X-ray (CXR) and lung computed tomography (LCT) images. The primary classification engine used the EfficientNet and MixNet technique on four real-world datasets i.e. two CXR datasets of 17,905 and 15,000 images, and two LCT datasets having 411,500 and 2,482 mages, respectively. The approach was evaluated using a five-fold cross-validation method, in which the dataset was divided into five parts where accuracy consistently exceeded 95.0% across all configurations, indicating a promising prediction performance across all datasets.

3 Methodology

This section addresses the many phases of the research, such as Sect. 3.1, which provides details about the dataset. Section 3.2 describe the procedure for pre-processing the data. Section 3.4 depict picture visualization graphically. Section 3.4 provide the methods for extracting the features, and Sect. 3.5 describe the models briefly. Finally, Sect. 3.6 give an overview of the parameters used to evaluate the model's performance. The flow of all these phases is shown in Fig. 3.

Fig. 3
figure 3

Proposed system design for respiratory diseases detection and classification

3.1 Dataset

The initial step in developing an automatic identification system for predicting and classifying airway disorders such as lung cancer, PE, pneumoconiosis, and covid-19 is to gather data from authorized sources. To fit the model, the lung cancer images are gathered from a dataset of chest CT scan images in.jpg or.png format [31]. Pneumoconiosis illness images are obtained from the Chongqing CDC through chest x-rays. It is divided into two subfolders: training and validation, which contain 568 and 140 images of pneumoconiosis-affected lungs and normal lungs, respectively [32]. Covid -19 pictures are obtained from the SARS-COV-2 Ct-Scan Dataset. This dataset includes 1230 covid negative and 1252 covid positive CT scans for a total of 2482 CT scans [33]. The images for PE were acquired from CT imaging of PE. The dataset consists of 35 different patients' computed tomography angiography (CTA) pictures for PE [34]. Finally, the images for normal lungs were extracted from all of the datasets mentioned above and merged to generate a single dataset. Figure 4 depicts the original images of numerous airway illnesses including normal lungs used in the research.

Fig. 4
figure 4

Original images of a lung cancer, b PE, c Covid-19, d pneumoconiosis, e normal lungs taken from respective dataset

3.2 Pre-processing

Following the collection of images of size 224 × 224 × 1 from various disease datasets, pre-processing has been performed to improve the characteristics and remove the noisy signals so that they may be readily evaluated better. The CLAHE approach, which stands for contrast limited adaptive histogram equalization, is useful for medical images as it improves the contrast of an original image by dividing it into small regions called tiles and equalizing the histogram of each tile separately. The approach is adaptive because it adjusts the contrast enhancement locally to the characteristics of each tile, which can vary within an image. This allows it to handle images with large variations in illumination and contrast. Two important parameters of the CLAHE approach are the clip limit and the tile grid size. The clip limit, specified by the cliplimit() function, sets a threshold on the amount of contrast enhancement applied to each tile. This limit prevents over-amplification of the contrast, which can lead to the loss of image details and the introduction of artifacts. In this study, the clip limit has been set to 10, which is lower than the default value 40. The tile grid size, compute by the tileGridSize() function, determines the number of tiles used to divide the image for histogram equalization. The size of the tiles is important because it affects the trade-off between local and global contrast enhancement. A larger tile size leads to more global contrast enhancement, while a smaller tile_size enhances more local contrast. In this approach, the image is initially divided into non-overlapping tiles of equal size, and each tile is processed independently. After CLAHE technique is applied to each tile, the generated tiles are merged using bilinear interpolation to provide a more contrasted and visible output image. Figure 5 shows the output images obtained using the CLAHE approach with the specified parameters.

Fig. 5
figure 5

Pre processed images of various respiratory diseases

3.3 Exploratory Data Analysis

During pre and post-processing of original images, the histogram using hist() has been generated to find out the pattern of an image. Figure 6a represents the histogram of original images which indicates that the pixel intensity distribution of images is not uniform and have noisy signals in it. On the contrary, after applying the contrast enhancement technique to the images, it can be seen in Fig. 6b that the technique improves the visualization of certain features in the image as well as reduces the noise.

Fig. 6
figure 6

a Histogram of before pre-processed images. b Histogram of after pre-processed images

3.4 Feature Extraction

In this section, after obtaining the histogram equalized images, the features have been extracted using various image augmentation techniques. Initially contour feature is used for finding out the extreme points to crop the images for obtaining the desired region using threshold techniques. During this phase, the properties of the images have been generated by calculating the parameters such as area, epsilon, perimeter, height, width, extent, equivalent diameter, minimum value, aspect ratio, maximum value, min value location, max value location, extreme leftmost point, extreme rightmost point, mean color, extreme topmost point, and extreme bottommost point using Eqs. (1) to (16). All the computed values are displayed in Table 2.

Table 2 Characteristics of different images of respiratory diseases

Initially, we calculated area, which is the product of height and width and based on it, the aspect ratio has also been calculated. The equations to compute them are (1) and (2)

$$\text{area}=\text{height}\times \text{width}$$
(1)
$$\text{Aspect\;ratio}= \frac{{\text{width}}}{{\text{height}}}$$
(2)

Further, other parameters such as height and width are computed, as shown in Eqs. (3, 5), depending on the contour feature points which are being passed to the bounding rectangle function through the OpenCV library.

$$\text{height}=cv2.boundingRect\left(cnt\right)$$
(3)
$$\text{width}=cv2.boundingRect\left(cnt\right)$$
(4)

Moreover, perimeter, equivalent diameter, extent, and epsilon are also calculated using Eqs. (58). Perimeter is computed through arclength, and the extent is the ratio of an object's area to the bounding rectangle area. Diameter is similar to the image's contour area, and in the end, epsilon is used for calculating the distance between the two points of the same classes.

$$\text{epsilon}= \sqrt{{(({x}_{2}-{x}_{1})}^{2}+{({y}_{2}-{y}_{1} )}^{2}}$$
(5)
$$\text{Perimeter}=0.1\times cv2 \times arclength \left(cnt,True\right)$$
(6)
$$\text{Extent}= \frac{\text{object\; area}}{\text{bounding \;rectangle\; area}}$$
(7)
$$\text{Equivalent \;diameter}= \sqrt{\frac{4 \times \text{contour\; area}}{\pi }}$$
(8)

In addition to this, max and min value location as well as max and min values of the feature is calculated along with the value of color intensity as shown in Eqs. (912)

$$\text{Minimum \;value\; location}=cv2.\text{minMaxLo}()$$
(9)
$$\text{Maximum \;value \;location}=cv2.\text{minMaxLo}()$$
(10)
$$\text{Minimum\; value}=cv2.\text{min}()$$
(11)
$$\text{Maximum\; value}=cv2.\text{max}()$$
(12)
$$\text{Mean\; color}=\text{cv}2.\text{mean}()$$
(13)

In the end, extreme leftmost, rightmost, bottommost, and topmost values are also calculated, in which 0 stands for the extreme left and rightmost point which means the calculation of values takes place in the horizontal direction. At the same time, 1 refers to calculating values in the vertical direction for extreme bottommost and topmost points.

$$\text{Extreme \;leftmost\; point}=tuple(cnt(cnt\left[:,:,0\right].argmin()\left[0\right])$$
(14)
$$\text{Extreme\;rightmost\; point}=tuple(cnt(cnt\left[:,:,0\right].argmin()\left[0\right])$$
(15)
$$\text{Extreme\;topmost \;point}=tuple(cnt(cnt\left[:,:,1\right].argmin()\left[0\right])$$
(16)
$$\text{Extreme\;bottommost \;point}=tuple(cnt(cnt\left[:,:,1\right].argmin()\left[0\right])$$
(17)

Using cv2.contour (), the morphological values of the various contour features are used to determine the largest contour (). The contours are the curves that connect all continuous points (along the boundary) that have the same hue or intensity. They are helpful for analyzing the shape as well as item identification and recognition. They are utilized in this research to generate the extreme points for cropping the image so that the characteristics can be extracted and extraneous information or details can be avoided to save space and time. The colors red, green, blue, and teal are used to define the extreme points for x–y coordinates, and they are determined using argmax() and argmin() as shown in Fig. 7.

Fig. 7
figure 7

Applying feature extraction on multiple respiratory diseases. (Color figure online)

Further, the cropped images are segmented to obtain the region of interest by generating the bounding box using Otsu and adaptive thresholding techniques as shown in Figs. 8 and 9 respectively which results in to 38,976 image features. The Otsu method, also known as the binarization algorithm, is an uncomplicated and efficient automatic thresholding technique. The results of thresholding technique are generated using cv2.Otsu(). An image is consisted of two classes i.e. background and foreground. Otsu technique helps to compute an optimized threshold value which minimize and maximize the intra-class variance (σwc) and the inter-class variance (σbc) of these five classes respectively. Two variances i.e. σwc and σbc, are calculated using eq (xviii) and (xix) respectively for all possible thresholds (thresh = 0 to I, i.e., maximum intensity level). In the end, if the value of the pixel luminance is less or equal to the threshold, it is replaced by 0 (black), and if greater than the threshold, it is replaced by 1 (white) in order to obtain the binary or black/white image.

Fig. 8
figure 8

Images after applying Otsu thresholding

Fig. 9
figure 9

Images after applying adaptive thresholding i lung cancer, ii PE, iii normal lung, iv Covid-19, v pneumoconiosis

$${\sigma }_{wc}^{2}t= {\omega }_{1}\left(t\right){\sigma }_{1}^{2}t+{\omega }_{2}\left(t\right){\sigma }_{2}^{2}t$$
(18)
$${\sigma }_{bc}^{2}t= {\sigma }^{2}- {\sigma }_{w}^{2}t$$
(19)

where, weights \({{\varvec{\omega}}}_{1}\left({\varvec{t}}\right)\) refers to the probabilities which are separated by a threshold t of the two classes. \({\sigma }_{1}\) and \({\sigma }_{2}\) are the variances of these two classes [35].

An adaptive threshold is also being chosen on the basis of the statistical properties of the pre-processed images which are cropped after their extreme points have been generated. The function cv2.adaptiveThresholding() used for weight updating unit in order to find an acceptable threshold value for the images that are bimodal in nature. Consider a size [W × H] image and assign two weights, \({\mu }_{1}\) and \({\mu }_{2}\) and later compare them to each and every pixel value in the [W × H] image. Later, the weight which is closest to the pixel value is selected for updating the weight of each input pixel. Further, the variation between the closest and input weight is multiplied by the learning rate \(\beta\) and is added to the closest weight. If \({\mu }_{1}\) is close to that value of the pixel, \({\mu }_{1}\) is updated, and if \({\mu }_{2}\) is close to that pixel, \({\mu }_{2}\) is updated by applying the following Eqs. (20).

$${\mu }_{new}={\mu }_{old}+\beta \times (pixel-{\mu }_{old})$$
(20)

The updated weights are applied to each image pixel as well as the average of these two weights is used as the value of the threshold; Eq. (20) describes it. This threshold setting can be used to convert a picture to binary form [36].

$${a}_{th}= \frac{{\mu }_{1}+ {\mu }_{2} }{2}$$
(21)

The pixel that ranges above \({a}_{th}\) value is considered as object and the value that ranges below \({a}_{th}\) value are consider as background.

3.5 Classifiers

This section briefly describes about all deep learning models that have been applied to the dataset (Sect. 3.1) for predicting and classifying airway diseases. In addition to this, their hyper-parameter values are also shown in Table 3 that has been kept fixed throughout the research.

Table 3 Hyper-parameters of applied deep learning models

EfficientNet It is a convolutional neural network based scaling and design method that uses a compound coefficient to scale all depth/width/resolution dimensions consistently (Fig. 10). EfficientNet is constructed upon the foundational network derived from the neural architecture search conducted by the AutoML MNAS framework. The architectural design incorporates a mobile inverted bottleneck convolution technique, which bears resemblance to the MobileNet V2 model. However, it should be noted that this architecture exhibits an increase in size, primarily attributed to the corresponding rise in floating-point operations per second (FLOPS) [37]. In this paper, four new efficientNet series were used: EfficientNetB6 (total parameters 40,970,656, Traianable parameters 40,746,221, and Non-trainable params: 224,435), EfficientNetV2B3 (Total params: 12,937,587, Trainable params: 12,828,371, Non-trainable params: 109,216), EfficientNetV2B1 (Total params 6,934,391, Trainable params: 6,863,319, Non-trainable params: 71,072), and EfficientNetV2S (Total params 20,337,333, Trainable params: 20,183,461, Non-trainable params: 153,872).

Fig. 10
figure 10

Architecture of EfficientNet model

DenseNet201 DenseNet201, in Fig. 11, has the property of reusing features with the help of its multiple layers, which increases variation in the subsequent layer input and enhances performance. This model has a more complex and denser network where all the layers are linked together with shorter connections in order to efficiently train and generate results [38]. The total number of parameters generated by the DenseNet201 model in this study is 18,321,475, of which 18,092,419 are trained and 229,056 are untrained.

Fig. 11
figure 11

Architecture of DenseNet201 model

Inception-v3 As shown in Fig. 12, the model contains 42 layers and has a lower error rate than Inception v1 and Inception V2. The core building block of Inception v3 is the inception module. Each inception module comprises parallel branches of different filter sizes, including 1 × 1, 3 × 3, and 5 × 5 convolutions. These branches are designed to capture features at various spatial scales. In addition, 1 × 1 convolutions are used within the inception module to reduce the number of channels and control computational complexity. The outputs of all branches are concatenated along the channel dimension, providing a rich set of multi-scale features [39].

Fig. 12
figure 12

Architecture of Inception-v3 model

In this study, the Inception-v3 model generated a total of 21,808,355 parameters, of which 21,773,923 are trained and 34,432 are not.

Xception The Xception model, as shown in Fig. 13, comprises of multiple modules called Xception blocks. Each Xception block consists of a sequence of depthwise separable convolutions, batch normalization, and nonlinear activation functions. The residual connections from the Inception-ResNet architecture are also incorporated into Xception blocks to facilitate gradient flow and ease optimization. The Xception model typically concludes with global average pooling and a fully connected layer with softmax activation for classification. The global average pooling reduces the spatial dimensions to a vector representation, and the fully connected layer generates class probabilities [40]. In this study, the Xception model produced an entire set of 20,867,051 parameters, of which 20,812,523 are trained, and 54,528 are not.

Fig. 13
figure 13

Architecture of Xception model

ResNet50V2 The modified version of ResNet50 is ResNet50V2 (Fig. 14), and it performs better on the ImageNet dataset than ResNet50 and ResNet101. ResNet50V2 is organized into multiple stages, including Stage 01, Stage 02, Stage 03, Stage 04, and Stage 05, each containing several residual blocks. The feature map sizes decrease as the network goes deeper, capturing features at different scales. The architecture also uses a bottleneck design within each residual block, consisting of 1 × 1 convolutions to reduce dimensionality, 3 × 3 convolutions for feature extraction, and another 1 × 1 convolution for dimension restoration. This bottleneck architecture reduces computational complexity and allows for more efficient feature learning [41].

Fig. 14
figure 14

Architecture of ResNet50V2 model

The total number of parameters generated by the ResNet50V2 model in this study is 23,564,675, of which 23,519,235 are trained, and 45,440 are untrained.

InceptionResNet-v2 The core building blocks of Inception-ResNetV2 are the Inception blocks. These blocks capture multi-scale features crucial for understanding complex visual patterns. Each Inception block contains parallel branches with different filter sizes and pooling operations. By operating in parallel, the network can capture and efficiently combine features at various scales. One notable feature of Inception-ResNetV2 is the incorporation of residual connections. Residual connections allow for the direct propagation of information from earlier layers to later layers. This enables smoother gradient flow during training and helps alleviate the vanishing gradient problem, which can hinder the training of very deep networks. The residual connections also contribute to the network's ability to learn shallow and deep features effectively. Inception-ResNetV2 architecture also includes auxiliary classifiers. The auxiliary classifiers typically combine convolutional layers, pooling layers, and fully connected layers. These classifiers are inserted at intermediate stages of the network and help with gradient propagation during training. They encourage the network to learn more meaningful representations and prevent overfitting [42].

Figure 15 depicts the InceptionResNet-v2 basic block diagram. In this study, the overall number of parameters generated by the InceptionResV2 model is 54,343,845, 54,283,301 are trained, and 60,544 are untrained.

Fig. 15
figure 15

Architecture of InceptionResNet-v2

ResNet101V2 The ResNet101V2 architecture consists of 101 layers and is widely used in computer vision tasks like image classification and object detection. The network starts with an input layer that takes an image as an input and is sent to the convolutional layers that extract low-level features from that input image. The key innovation of ResNet101V2 lies in its residual blocks, which include skip or shortcut connections. These connections enable the network to learn residual mappings by preserving the input and combining it with the output of convolutional layers. The residual blocks also employ a bottleneck structure, which reduces the dimensionality of feature maps to improve efficiency without degrading the performance. Further, global average pooling reduces spatial dimensions, followed by fully connected layers for final classification or regression. Activation functions, such as ReLU, introduce non-linearity and shortcut connections to ensure the flow of gradients during training [43].

Overall, ResNet101V2 is a powerful architecture that leverages skip connections and bottleneck structures to train deep networks effectively and extract meaningful features for visual tasks. In this research, the ResNet101V2 model (Fig. 16) produced a total of 42,630,533 parameters, of which 42,532,869 are trained and 97,664 are untrained.

Fig. 16
figure 16

Architecture of ResNet101V2

Proposed hybrid transfer learning model This proposed hybrid model is composed of two pre trained models such as EfficientNetB6 and ResNet101V2 which are being trained with an input size of 224 × 224 and has generated 83,601,949 parameters, out of which 83,279,085 are trainable parameters, and 322,099 are non-trainable parameters as shown in Fig. 17.

Fig. 17
figure 17

Layered structure of proposed hybrid model

The layered structure of EfficientNetB6 consists of one input layer, one rescaling layer, one normalization layer, two 2D convolution layers, two batch normalization layers, two activation layers. The architecture also contains seven blocks as well as sub blocks which are connected sequentially. In blocks 1 and 7, there are three sub blocks that consist of one Global AveragePooling 2D layer, one reshape layer, three 2D convolution layers, one multiply layer, two batch normalization layers, one drop out layer, one activation layer, and one add layer each. Likewise, from block 2 to block 6, the eight sub blocks consist of four 2D convolution layers, two batch normalization layers, one activation layer, one zero padding 2D layer, one depth wise 2D convolution layer, one global average pooling 2D layer, and one reshape, add as well as multiply layer each.

On the other side, the layered architecture of ResNet101V2 consists of one input layer, one zero padding 2D layer, two 2D convolution layers, one max pooling 2D layer, two batch normalization layers, and three activation layers. The architecture too contains three blocks which are initially followed by twenty three blocks via one activation layer. Each block is having sub blocks which consist of three 2D convolution layers, two batch normalization layers, two activation layers, one zero padding 2D layer, and one add layer.

Later the output activation layer of each model is concatenated at the concatenate layer which is further connected to dense and softmax layer from where the possibilities of predicting the class of airway diseases are obtained.

Besides it, the architecture has also been shown in Table 4 where the layers have been taken randomly but sequentially to provide the gist and parameters of EfficientNetB6 + ResNet101V2.

Table 4 Architecture of proposed hybrid model (EfficientNetB6 + ResNet101V2)

The columns that have been depicted in the table contains Layer which lists the name or type of each layer in the model, Output Shape to indicate the shape of the output tensor or feature map produced by each layer. The shape is represented as (batch_size, height, width, channels), and Param# to show the number of parameters (weights and biases) associated with each layer. The description about each layer is as followed:

input_5 (InputLayer) This is the input layer of the model, expecting input tensors with a shape of (None, 224, 224, 1). "None" represents a variable batch size, 224 × 224 is the input image size, and 1 is the number of channels (grayscale).

rescaling_2 (Rescaling) This layer rescales the input data, so the values fall within a specific range.

normalization_2 (Normalization) This layer normalizes the input data, making it have zero mean and unit variance.

stem_conv_pad (ZeroPadding2D) This layer adds zero-padding to the input tensor.

stem_conv (Conv2D) It applies convolutional operations to the input and produces an output tensor with a shape of (None, 112, 112, 56).

stem_bn (BatchNormalization) This layer performs batch normalization on the previous output tensor.

stem_activation (Activation) It applies an activation function to introduce non-linearity to the tensor.

block1a_dwconv (DepthwiseConv2D) This layer performs depthwise convolution, which applies separate convolutions to each input channel.

conv2_block1_1_conv (Conv2D) This convolutional layer produces an output tensor with a shape of (None, 56, 56, 64).

block7c_project_conv (Conv2D) This layer applies convolution to the input tensor, resulting in an output tensor with a shape of (None, 7, 7, 576).

conv5_block2_2_conv (Conv2D) It performs convolution on the input tensor, generating an output tensor of shape (None, 7, 7, 512).

conv5_block3_3_conv (Conv2D) This convolutional layer produces an output tensor with a shape of (None, 7, 7, 2048).

top_conv (Conv2D) It applies convolution to the input tensor, resulting in an output tensor with a shape of (None, 7, 7, 2304).

conv5_block3_out (Add) This layer performs element-wise addition between two input tensors.

top_bn (BatchNormalization) It performs batch normalization on the previous output tensor.

post_bn (BatchNormalization) This layer applies batch normalization to the input tensor.

Activation It applies an activation function to the tensor.

concatenate_2 This layer concatenates multiple input tensors along the channel axis.

dense_2 (Dense) It is a fully connected (dense) layer that produces an output tensor with a shape of (None, 7, 7, 5).

In a nutshell, the table provides a summary of the architecture, input/output shapes, and parameter counts for each layer in the model.

3.6 Evaluation parameters

The aforementioned applied models have to be now evaluated to test their performances and for that certain parameters are required which are described in this section.

Accuracy The parameter measures the efficiency of the model to correctly classify the image of any respiratory disease [44]. It is calculated by Eq. (22)

$$\text{Accuracy}=\frac{\text{True\; Positive}+\text{True\;Negative}}{\text{True\;Positive}+\text{True\;Negative}+\text{False\;Positive}+\text{False\;Negative}}$$
(22)

Loss This parameter is used for predicting the discrepancy between the actual and the predicted values. If the loss value generated is nearer to zero, it implies that the model works best else it should be re-trained [45]. It is calculated by Eq. (23)

$$\text{Loss}= \frac{{\left(\text{Actual} - \text{Predicted}\right)}^{2}}{\text{Total\; number\; of\; observations}}$$
(23)

Precision and Recall These parameters are used to examine the model in terms of the positive prediction [46]. These both metrics are calculated by Eqs. (24) and (25) respectively.

$$\text{Precision}= \frac{\text{True\; Positive}}{\text{True\;Positive}+\text{False\;Positive}}$$
(24)
$$\text{Recall}= \frac{\text{True\;Positive}}{\text{True\;Positive}+\text{False\;Negative}}$$
(25)

F1 score It is the parameter that evaluates the performance of the classifier especially in those scenarios where both precision and recall are important and are needed to be balanced efficiently [47]. It is represented by Eq. (26)

$$F1\;\text{score}= \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision}+\text{Recall}}$$
(26)

Matthew’s Correlation coefficient The Matthews correlation coefficient (MCC) is a parameter which depends on the values of confusion matrix as well as defines the quality of the prediction of classifier [48]. It is computed using Eq. (17)

$$\text{MCC}= \frac{\text{True\;Positive} \times \text{True\;negative}-\text{False\;negative} \times \text{False\;positive}}{\sqrt{\left(\text{True\;positive}+\text{False\;positive}\right)\left(\text{True\;positive}+\text{False\;negative}\right)\left(\text{True\;negative}+\text{False\;positive}\right)(\text{True\;negative}+\text{False\;negative})}}$$
(27)

4 Experimental Results

This section reflects that the models such as EfficientNetB6, EfficientNetV2B3, DenseNet201, Inception-v3, Xception, EfficientNetV2B1, ResNet50V2, EfficientNetV2S, InceptionResNet-v2, ResNet101V2, and Hybrid (EfficientNetB6 + ResNet101V2) have been evaluated using the parameters as mentioned in Sect. 3.6 to test their performances for different diseases dataset.

Initially, the models are evaluated for their training and testing accuracy as well as loss. Later the confusion matrix of 5 × 5 for classes C0 as lung cancer, C1 as PE, C2 as covid 19, C3 as normal lungs, and C4 as pneumoconiosis has been generated to evaluate the performance of the classification models for the said classes in Fig. 16 in order to compare their actual target values with the predicted ones.

In Table 5, during the training phase, the EfficeintNetB6 and Xception generated the best accuracy of 98.99% and 99.75%, respectively, whereas loss values of 0.02 and 0.01, respectively. On the other hand, during the testing phase, EfficientNetB6 and ResNet101V2 computed the best accuracy by 99.26% and 99.13%, and loss by 0.01 each, respectively.

Table 5 Evaluation of models for various respiratory diseases

Based on their best accuracy and loss for a testing dataset, the aforementioned models have been clubbed together to form the hybrid model and tested on the same dataset. The overall scenario shows that the values in bold are generated by the hybrid model, which indicates its highest accuracy for both training and testing datasets with 99.84% and 99.77%, respectively, with a minimum testing loss of 0.001.

Moreover, the curves generated by the models while iterating over the training and testing datasets for 15 epochs are also studied in Fig. 18. After analyzing the curves, it was discovered that the models at certain epochs reveal that the plot of training loss drops to the point of stability, as does the plot of testing loss, and has a little gap with the training loss. Similarly, it can be seen that the testing accuracy plot rises to the point of stability and has a little gap with the training accuracy. This shows that the models have good-fitting learning curves. On the other hand, at the rest of the epochs, there is a big gap between the accuracy and loss curves, which indicates that the training dataset does not provide the enough information in order to understand the problem as compared to the testing dataset during evaluation. Compared to the other remaining models, the accuracy and loss curves of the hybrid model and EffiicientNetB6 are superior.

Fig. 18
figure 18figure 18figure 18

Analyzing the curves of models during training and testing phase

The models are also evaluated on their performances for the parameters such as F1 score, recall, and Precision, as shown in Table 6. The proposed hybrid model (EfficientNet B6 and ResNet101V2) has generated the highest precision value with 1.00, recall with 0.99, and F1 score value. The least value was obtained by EfficientNetV2B1 by 0.63, 0.66, and 0.60, respectively which mean that EfficientNetV2B1 generated the highest number of false positives compared to true positives (Table 8).

Table 6 Evaluating applied models for multi-disease detection

Figure 19 shows the execution time that had been taken by the models to generate the testing accuracy. It can be found that the lowest execution time has been taken by EfficientNetB6 and Inception_V3 with 3280 s, while the highest has been taken by ResNet101V2 with 3779 s. As far as proposed hybrid model is considered, it has taken 3291 s to generate the output of testing accuracy. Besides this, if training time is considered, then all transfer models took the average time of 4 to 5 h to get trained but the proposed hybrid model took 10 h to generate the training accuracy and loss.

Fig. 19
figure 19

Execution time of models

After training and testing the model with the airway diseases dataset, the confusion matrix, as shown in Fig. 20, has been generated for five target classes to compute their true positive as well as false positive, false negative as well as true negative values by using the formula as shown in Table 7.

Fig. 20
figure 20

Confusion matrix of models

Table 7 Formulae to compute values of confusion matrix

Here the value of i and j is same as the label of the class. For example if it is class 0, then it’s true positive will be the value at \({C}_{00}\) and so on. In a nutshell all the diagonal values of the confusion matrix are the true positive of their corresponding class.

Table 8 shows that using EfficientNetB6 for class 0 (lung cancer), true positive value 430 indicates that 430 data points of positive class are successfully classified. False negative 31 indicates that 31 positive class data points are classified incorrectly as negative class data points, False positive 25 indicates that 25 data points of negative class are classified incorrectly as positive class data points, and true negative 5362 indicates that 5362 negative class data points are correctly classified. Similarly, we can explain the significance of TP, FN, FP, and TN values for the remaining classes and classifiers. After assaying the table completely, it has been found that the classifiers have been quite good for our dataset by obtaining greater true negative and true positive values except for EfficientNetV2B1. In the end, the system performance of the proposed hybrid (EfficientNetB6 and ResNet101V2) model has also been validated using the images taken from the dataset to predict the class of each disease and the results are shown in Fig. 21.

Table 8 Values of TP, TN, FP, and FN for different classes of respiratory diseases
Fig. 21
figure 21

Prediction of respiratory diseases using proposed hybrid model

5 Discussion

Artificial intelligence technologies have been used to forecast the mortality rate in patients who are having airway illnesses because these diseases are one of the most common causes of mortality worldwide. In this paper, various techniques have been used to develop a system for identifying and classifying airway diseases like PE, covid-19, lung cancer, and pneumoconiosis along with normal lung images. Initially, the CLAHE technique was used for enhancing the quality and contrast of an image, followed by extracting contour features (Sects. 3.2, 3.4). These contour features were used to crop the image, which was later sent for segmentation to get a region of interest. Two thresholding techniques, Otsu /binarization and Adaptive, were applied to the image dataset to obtain the ROI efficiently (Sect. 3.4). Later ten deep pre-trained models were used such as EfficientNetB6, EfficientNetV2B3, DenseNet201, Inception-v3, Xception, EfficientNetV2B1, ResNet50V2, EfficientNetV2S, InceptionResNet-v2, and ResNet101V2 from which the two best models were hybridized and re-trained by using the same dataset. During testing, it was determined that the proposed hybrid model had the highest recall, accuracy, precision, and F1 score compared to the other models. Figure 22 depicts more assessments of the models using Recall, Precision, F1 score, and Matthew's correlation coefficient for distinct classes of respiratory diseases.

Fig. 22
figure 22

Analysis of models for different performance metrics

EfficientNetB6, Inception-v3, and InceptionResNet-v2 obtained the highest precision, accuracy, F1 score, recall, and MCC of 1.00 for PE and pneumoconiosis while as EfficientNetV2B3, DenseNet201, Xception ResNet50V2, EfficientNetV2S, and ResNet101V2 obtained the same values only for PE. On the other hand, EfficientNetV2B1 computed the highest accuracy of 0.98 for lung cancer, the precision of 1.00 and F1 score of 0.78 for PE, recall of 0.78 for covid 19, and MCC of 0.79 for pneumoconiosis. The proposed hybrid model (EfficientNetB6 and ResNet101V2) obtained 1.00 accuracy, recall, precision, F1 score, and MCC for PE and covid.

After obtaining all the results, the comparison has been done between the proposed hybridized method and the techniques that have been used by the researchers to predict multiple airway diseases on the basis of their accuracy metric, as mentioned in Table 9.

Table 9 Comparing the existing and the current technique

6 Conclusion

In this paper, ten deep transfer learning models such as EfficientNetB6, EfficientNetV2B3, DenseNet201, Inception-v3, Xception, EfficientNetV2B1, ResNet50V2, EfficientNetV2S, InceptionResNet-v2, ResNet101V2, along with the proposed hybrid model (EfficientNetB6 + ResNet101V2) had been trained using the dataset of four different respiratory diseases. It has been found that hybridizing the two models, i.e., EfficientNetB6 and ResNet101V2, obtained the highest testing accuracy of 99.77%. On the other hand, the lowest values were obtained by EfficientNetV2B1 with 69.48% accuracy and the highest loss value of 0.84. The research also has limitations, such as much computational time has been taken to pre-process the data and obtain its features. Besides this, the Otsu threshold did not work as efficiently as the Adaptive threshold for the dataset images to find the region of interest. The reason behind this is the limited flexibility of the technique, which cannot handle or generate accurate ROI for complex images with multiple regions and vary with intensity values. In addition, EfficientNetV2B1 generated the highest number of false positives, which obtained low precision of 0.63, an F1 score of 0.60, and a recall of 0.66. This shows that the model suffers from the underfitting problem, which should be taken care in the future to enhance its accuracy in prediction. It can be done by adding more layers, increasing the number of neurons in the existing layers, or using more complex model architecture. Another approach is to provide more training data or increase the number of epochs during training to allow the model to learn the underlying patterns in the data more effectively. On a large scale, researchers can also work on developing a unified platform that can detect all airway diseases in an instant in the future to minimize the time of patients and clinicians.