1 Introduction

The uncontrolled growth of malignant cells in the epidermis, the skin’s outermost layer, as a result of unrepaired DNA damage that generates mutations is known as skin cancer. Skin cells with these abnormalities develop fast, resulting in malignant tumours. Malignant cancer cells are treated with chemotherapy, radiation, and surgery because they are lethal. Depending on gender and age, tumours may be distributed differently. Depending on the species and organ affected, cancer presents itself in a variety of ways [1]. As per the World Health Organization (WHO), approximately 14 million cancer cases and 9.6 million cancer-related casualties all across the world have been reported. Cancer is the foremost cause of mortality among humans, as per statistics [2]. In 2022, an estimated 197,700 new cases have been detected in the United States. There will be 97,920 in situ (non-invasive) cases limited to the epidermis (top layer of skin) and 99,780 invasive cases that will penetrate the epidermis and enter the second layer of skin (the dermis).

The most prevalent types include basal cell carcinoma (BCC), nevus, melanoma, and pigmented benign keratosis (PBK). BCCs are uncontrolled, abnormal growths that appear in the outermost layer of the skin’s basal cells. It is the most frequent type of skin cancer identified in the United States. Melanoma develops in melanocytes, which are responsible for producing the pigment melanin. It appears in different colours viz. rose pink, royal purple, azure, and colourless [3]. White people are more than 20 times more likely than black persons to develop melanoma. Melanoma has a 2.6% (1 in 38) lifetime risk in whites, 0.1% (1 in 1,000) lifetime risk in blacks, and 0.58% (1 in 172) lifetime risk in Hispanics [4]. People can understand their skin condition and what precautions and steps they should take so that it can help in early diagnosis. Skin cancer is caused by air pollution, UV radiation, and an unhealthy lifestyle [5, 41]. Applying sunscreen with an SPF of 15 or higher on a regular basis can lower the incidence of melanoma and squamous cell carcinoma by 50% and 40% respectively [6]. Skin lesions can be viewed without being impeded by skin surface reflections with this tool [7]. It’s usually utilized to evaluate pigmented lesions in order to distinguish benign melanocytic nevi and seborrheic keratosis from malignant melanoma and pigmented BCC [8]. If any different skin patches are identified and it progresses as time passes should be noted [9]. Manual dermoscopy image evaluation is laborious, error-prone, and discretionary, it can yield widely disparate diagnostic conclusions [10].

In addition, many techniques have been introduced in dermatology to prevent from skin cancer that elucidates a set of elements through which confirmed cases of melanoma can be identified. [11]. A computer-aided diagnostic procedure is more objective and reliable than a human-expert diagnosis, which is subjective and not necessarily reproducible [12, 38]. When compared to naked eye assessment, several meta-analyses have found that dermoscopy improves and enhances the accuracy of melanoma diagnosis. This practice is time-consuming and relies on their interpretation, making it very subjective, and they may deliver significantly different diagnostic conclusions [13]. Computer-aided design (CAD) is a tool that helps in design creation, modification, analysis, and optimization. It’s also utilised to boost designer productivity, design quality, documentation, and the creation of a manufacturing database. CAD technology allows medical procedures to be customized to an individual’s specific needs. The techniques and technologies designed to automate the visual inspection of skin lesions, which is typically performed by dermatologists, are intended to help clinicians i) to detect early signs of cancer; ii) allow patient to properly evaluate lesions; and iii) focus on promoting melanoma prevention and iv) awareness programs [14]. Deep Neural Networks had a substantial effect on healthcare data, reaching high accuracy in the categorization of a variety of illnesses, including skin cancer [15]. On a variety of complex computer vision and image classification tasks, deep learning algorithms have reached human-level performance [17]. Computer-based detection employs imaging techniques and artificial intelligence. The numerous phases of detection include collecting dermoscopic images, hair filtering, sounds, entropy threshold segmentation, feature extraction and classification. A Back-Propagation Neural Network (BPN) has been utilised for classification. It determines whether a lesion is malignant or not. In BPN, weights are randomly assigned at the start of training and various machine-learning techniques have been used as a result. During the forward pass of the signal, the network generates an output based on initial weights and the activation function utilised. The result is compared to the expected result, if they don’t match then the error can be calculated by subtracting the desired output from the actual output [22, 40]. The study shows a weighted ensemble strategy by using an ensemble of seven DL algorithms. The following are the research contributions:

  • A weighted ensemble approach has been proposed by an ensemble of deep learning algorithms, including InceptionV3, Xception, ResNet50, EfficientNetB4, and MobileNet. The suggested methodology classifies data on skin cancer into two categories: melanoma and nevus.

  • ISIC skin lesion image dataset has been used for experimentation.

  • For evaluating the efficacy and efficiency of the predicted technique, evaluation metrics include accuracy, precision, recall, F1 Score, sensitivity, and specificity.

  • The robustness of the proposed method has also been checked on another ISIC dataset.

  • On the same set of parameters, comparisons have been made between modern approaches and conventional algorithms.

The paper is systematized as follows: Section 2 reviews the Literature done, and Section 3 enlightens the detailed architecture of the methodology. Section 4 elucidates the results & analysis section which is followed by the conclusion section.

2 Literature review

Detailed research conducted has been done using machine learning (ML) and deep learning (DL) algorithms for skin lesion classification and identification. Satin Jain et al. proposed research using six different transfer learning models and applied multi-classification using HAM1000 skin cancer dataset. The proposed methodology achieved an accuracy of 90.48 [9, 30], it was concluded that Xception outperform as compared with the other transfer learning networks employed in the study. The authors proposed a Region of Interest based methodology that uses a Convolutional Neural Network (CNN) with data augmentation for ROI pictures using the DermIS and DermQuest datasets. For DermIS and DermQuest, the suggested approach provided an accuracy of 97.9% and 97.4%, respectively [3]. Fengying Xie et al. developed an ensemble method for binary classification of melanocytic tumours as benign or malignant, with an accuracy of 94.17% for xanthous races and 91.11% for Caucasian races [18]. On the Dermweb dataset, Suganya R et al. introduced a support vector machine classifier that binary classifies images with an accuracy of 96.80% [19]. The authors proposed GoogleNet which categorises eight types of skin lesions and the accuracy achieved by the proposed approach was 94.92% [2]. M. A. Farooq et al. used a probabilistic strategy in which active contours and a watershed merged mask were used, and then SVM and Neural Classifier was used to categorize the segmented mole. On the DermIS and DermQuest datasets have been used during experimentation to categorise melanoma or non-melanoma images with an accuracy of 80% [20]. Fikret Ercal et al. used a feed-forward neural network to extract discriminant features from 326 images and the model attained an accuracy of 86% [21]. J Abdul Jaleel et al. used ANN for classification with a backpropagation algorithm for training images to obtain an accuracy of 93% [22]. The authors proposed graphic processors for clinical skin image analysis that uses an ANN to analyse the image and study borders or contour to determine a final diagnosis. The collection includes 730 images of good and bad incidents from the International Skin Images Collaboration’s MED-NODE project (ISIC). Finally, the proposed approach has achieved a 76.67% accuracy, with a 78.79% success rate in melanoma cases and a 74.07% success rate in benign lesions cases [11]. Authors proposed a set of procedures for segmenting skin lesions and assessing the observed area and surrounding skin tissues for melanoma detection. The method was tested using the ISIC 2016 dataset, which resulted in a sensitivity of 95% [17]. Ardan Adi Nugroho et al. put forward a CNN-based identification system using dermoscopy images. The proposed model offered accuracy of 80% and 78% at training and testing time respectively [23]. Amirreza Rezvantalab et al. studied the effectiveness and capability of CNNs to classify Eight skin diseases using HAM10000 and PH2 datasets using pre-trained models and obtained an accuracy of 94.40% for melanoma and BCC [13]. Yuexiang Li et al. proposed a DL system that provides segmentation and coarse classification performance simultaneously using FCRN. A lesion index calculation unit has been anticipated to refine the results. A basic CNN was presented for the dermoscopy image feature removal. On the ISIC 2017 dataset, the proposed DL frameworks were examined. The accuracy of the frameworks was reached in the experiments, with 75.3%, 84.8%, and 91.2% respectively [24]. Gerald Schaefer et al. suggested an ensemble method which resulted sensitivity value of 93.76% using 564 skin lesion images [25]. E. Nasr-Esfahani et al. implemented a computational complex method based on CNNs using several clinical images resulting in accuracy of 81% [26]. Titus J. Brinker et al. proposed CNN classifier using International Symposium on Biomedical Imaging 2016 challenge dataset giving an average precision of 70.9% [27]. Shorfuzzaman applied Deep learning models (MobileNet, Xception, ResNet50, ResNet50V2, and DenseNet121) that were used for transfer learning, and were already pre-trained on ImageNet data. The classifier was evaluated for each model [32]. Shahsavari diffused four different CNN model and evaluated classification results on 934 and 200 images from ISIC and PH2 test data with the average accuracy of 97.1% and 96%, Area under receiver operating characteristics curve (AUC) of 98.6% and 98.1%, precision of 87.1% and 90.2%, recall of 86.7% and 85.4% for ISIC and PH2 test data, respectively [33]. Jin proposed a cascade knowledge diffusion network (CKDNet) to transfer and aggregate knowledge learnt from different tasks to simultaneously boost the performances of classification and segmentation [34]. Imran deployed model using learners of VGG, CapsNet, and ResNet for skin cancer detection [35]. Tembhurne designed model achieves a higher accuracy of 93% with an individual recall score of 99.7% and 86% for the benign and malignant forms of cancer, respectively [36]. Basuk used MFSNet (Multi-Focus Segmentation Network), with differently scaled feature maps for computing the final segmentation mask using raw input RGB images of skin lesions [37].

3 Weighted ensemble framework

The weighted average ensemble, also known as the weighted sum ensemble that combines predictions from multiple models, where each model’s weights are based on its performance. Weights for each learner (YInceptionV3, YMobileNet, YXception, YVGG16, YCNN, YResNet50, YEfficientNetB4) in the weighted ensemble are scientifically determined based on their average accuracy on the testing dataset. The resulting weights Yk; k = 1, …, 7 are scaled to equal one. The weighted ensemble result will not be influenced by this scaling process. The weights allocated to each model in the weighted ensemble model are shown in Table 1 and the algorithm for the same is shown in Fig. 1.

Table 1 Weights used in second weighted ensemble model
Fig. 1
figure 1

The flow diagram of proposed weighted ensemble model

The individual learners’ decision values for each image i in the test dataset are stacked to create the ensemble decision mapping. For the two proposed ensemble approaches, the average ensemble model and the weighted ensemble model, the ensembled decision values are calculated using an indicator function, for ith class of the dataset, which aligns to the predicted value of the kth model with the associated class label as in Eq. (1).

$$\Delta \left({p}_{k}^{j},{l}^{i}\right)= \{1 ; \;{p}_{k}^{j} \in {l}^{1} 2 ;\; {p}_{k}^{j} \in {l}^{2}$$
(1)

Here, \({p}_{k}^{j}\) is the predicted value of the image I i.e., \({p}_{InceptionV3}^{j}\)= InceptionV3(I(j)), \({p}_{MobileNet}^{j}\)= MobileNet(I(j)), \({p}_{Xception}^{j}\)= Xception(I(j)), \({p}_{VGG16}^{j}\)= VGG16(I(j)), \({p}_{CNN}^{j}\)= CNN(I(j)), \({p}_{ResNet50}^{j}\)= ResNet50(I(j)), \({p}_{EfficientNetB4}^{j}\)= EfficientNetB4(I(j)), where j \(\in M\), M is the number of test images.

$${final\_pred }_{(i)}^{\left(j\right)}=\sum\nolimits_{k=1}^{7}{Y}_{k}*\Delta \left({p}_{k}^{j},{l}^{\mathrm{i}}\right), for \;i\in [\mathrm{1,7}]$$
(2)

The weighted ensemble model’s final predicted output for an image I(j) from ith class is obtained by adding the product of all the individual predictions and their respective weights from all different models and then selecting the class with the highest weight.

$$output_{(i)}^{\left(j\right)}=\max\left({final\_pred}_{(i)}^{\left(j\right)}\right)$$
(3)

3.1 Algorithms: proposed CNN model

figure a

4 Proposed methodology

The ISIC dataset is used in this study to propose an ensemble of classifiers for identifying skin cancer lesions. The weighted ensemble technique is the subject of this study, which uses classifiers to compute predicted outputs in a weighted manner to determine the final output [28]. With the use of a variety of pre-trained CNN models, an ensemble technique that has been employed for skin cancer classification as had been investigated in [4] was able to obtain an accuracy of 76%, and [25] produced a sensitivity of 93.76%. This section provides a comparison of the effectiveness of several transfer learning networks. Figure 1 shows the proposed methodology in use.

4.1 Dataset description

The dataset comprises 11449 skin lesion images in (.jpg) format. The dataset used in experimentation is ISIC which has been taken from a publicly available repository (https://www.kaggle.com/datasets/qikangdeng/isic-2019-and-2020-melanoma-dataset). The dataset is divided into 2 classes of skin cancer: Melanoma and Nevus. The sample images of both classes are displayed below in Fig. 2a, b:

Fig. 2
figure 2

a, b Sample images of melanoma and nevus classes from the ISIC dataset

Figure 3 represents the distribution of images in different classes where  ~ 44% of the data contributes to the melanoma class and  ~ 56% of the data is contributed by the class, nevus. The dataset was split into training and testing sets with a split of 80–20, the complete distribution is shown in detail in Table 2 and class distribution for the same has been depicted in Fig. 3.

Fig. 3
figure 3

Class distribution of skin cancer images

Table 2 A complete distribution of skin cancer images

4.2 Data preprocessing

The skin lesion images have been trained and validated using various image pre-processing techniques. This is done to reduce the complexity of the data provided to the model. Each image in the dataset was reduced to 100 × 100 pixels and has three channels: Red, Blue, and Green (RGB). The data was first separated into training and testing sets, whereas 80% of data has been used for training and 20% for testing. The next phase in the preprocessing process was label encoding, which is the act of converting labels into a numeric format so that machines can read them. The use of those labels can then be better determined by ML techniques. It is a necessary pre-processing step for the structured dataset in supervised learning. Melanoma – 0 and nevus – 1 were assigned to both groups in the suggested model.

4.3 Model architecture

To outperform a single classifier, an ensemble of classifiers employs deep learning classifiers [29]. This study attempts to develop a weighted ensemble technique where the final output decision is made based on the weighted total of the anticipated outputs from the classifiers. The weighted ensemble model of the proposed methodology combined predictions from numerous CNN models, including InceptionV3, MobileNet, Xception, ResNet50, VGG16 and others, with weights assigned to each model based on performance and expertise [28]. The weighted ensemble is a voting variation that either gives each model a different weight or assumes that all models are equally capable and contribute equally to the ensemble’s predictions. Each model is given a specific weight, which is then multiplied by the value it predicted and used to get the sum or average forecast. The classification model that was suggested added up the expected probabilities for each class and selected the one with the highest probability. Figure 4 depicts the suggested methodology’s architecture.

Fig. 4
figure 4

Architecture weighted ensemble deep learning model

To make a weighted average prediction, each ensemble member must first be assigned a fixed weight coefficient. This could be a percentage of the weight represented by a floating-point value between 0 and 1. It might also be an integer beginning with 1 that represents the number of votes each model should receive. The first weighted ensemble model employed fixed average weights that were all equal, whereas the second weighted ensemble model used weights that were computed using a search method that evaluated the model’s performance on various weight combinations. ImageNet dataset has been used as a transfer learning model along with a basic CNN model has been used to propose the weighted ensemble approach. The proposed methodology combines the below-mentioned pre-trained models using transfer learning by changing the output layers to fit the dataset. These models are briefly discussed below in this section: The inceptionV3 model includes convolutions, average pooling, max pooling, concatenations, dropouts, and completely linked layers. To calculate loss, Softmax [13] is utilised. The softmax activation function is depicted in Eq. (4). Batch normalization is used and applied to activation inputs and loss is calculated using the softmax function:

$${\sigma (z)}_{i}=\frac{{e}^{{z}_{i}}}{\sum_{i=1}^{k}{e}^{{z}_{j}}},$$
(4)

where i = 1, 2…, k and z = (z1, z2, …, zk) ∈ ℝ.

MobileNet uses a 53-layer deep CNN architecture that is targeted for mobile and embedded vision applications. The Xception architecture is a 71-layer CNN with the same set of parameters as Inception V3. The implementation advantages are more effectual use of model parameters rather than increased capacity. VGG16 is a deep CNN with 16 layers. It can be imported from pre-trained ImageNet database. The network classifies 1000 different object categories, which results the network library consisting of rich feature representations for a widespread range of images. ResNet50 has 48 layers along with 1–1 max-pool and average pool layer [31]. EfficientNet has CNN architecture that evenly balances all depth, breadth, and resolution dimensions using a compound coefficient with preset scaling factors. 4- layer deep CNN model was fed in the ensemble model in which the kernel size was taken to be 3 × 3, uniformly with an input image size of 100 × 100 and a default batch size of 32. Max pooling was performed on layer 2 and 4 along with dropout on layer 2 to 4 of 0.25. The model has 2 dense layers with batch size 32 and dropout at 0.5. It used ReLU as the activation function in the hidden layers and sigmoid function for the dense layers [30]. The equation for the same has been shown below:

$$\mathrm{ReLU }=\{0 ;x<0 x ;x\ge 0$$
(5)

5 Results and analysis

This section presents the analysis and performance of the suggested methodology on the ISIC dataset. All of the images were fed into both ensemble models, with the first model using equal weights and the second model using different weights. The dataset was split into training and testing sets with an 80:20 ratio. The results were contrasted using the subsequent measures. Formulas for the evaluation criteria are presented in Eq. (610):

$$\mathrm{Accuracy }=\frac{(TP+TN)}{\left(TP+FP+FN+TN\right)}$$
(6)
$$\mathrm{Precision }=\frac{(TP)}{\left(TP+FP\right)}$$
(7)
$$\mathrm{Recall }=\frac{(TP)}{\left(TP+FN\right)}$$
(8)
$$\mathrm{F}1-\mathrm{Score }=\frac{2 \times precision \times recall}{precision+ recall}$$
(9)
$$\mathrm{Specificity }=\frac{(TN)}{\left(TN+FP\right)}$$
(10)

The performance of all the 6 different types of transfer learning nets, as well as the basic CNN model with self-constructive layers, is shown using Tables 3 and 4.

Table 3 Comparison of various CNN models on ISIC dataset
Table 4 Comparison of various CNN models on another ISIC dataset

All the models have been given around with 11449 images in total as input, divided into two categories: 5106 MEL images and 6343 NEV images in the second dataset with an image size of 100 × 100 with RGB configuration. In all the models first, the input images were flattened, using of ‘Sigmoid’ activation function, the ‘BinaryCrossEntropy’ loss function and ‘Adam’ as optimizer.

The first model is InceptionV3. It uses CNN design from the Inception family, which includes label smoothing, which is factorized in 7 × 7 convolution. An auxiliary classifier has been used to transfer label information for lowering down the network. It gives maximum accuracy of 79.69%, precision of 81%, recall of 80%, F1 score of 79%, sensitivity of 65%, and specificity of 92%, it is the most accurate method available. The second model is MobileNet. It has a streamlined architecture with low latency and it uses depth-wise separable convolutions. With a maximum accuracy of 84.54%, precision of 85%, recall of 85%, F1 score of 84%, sensitivity of 77%, and specificity of 91%, this is the best possible result. Xception obtained a precision 86%, recall 85%, F1 score 85%, sensitivity75%, specificity-94%, and the accuracy 85.20%. For all input weights, the default pre-training model has been set to "xception."VGG-16 achieved an accuracy of 87.68%, precision of 88%, recall 88%, F1 score of 88%, sensitivity of 88% and, specificity of 88%. The default pre-training model has been set as "Imagenet" for all input weights. The fifth model is CNN. The ensemble model was fed uniformly with an input picture size of 100 100 and a default batch size of 32 using a custom 4-layer deep CNN model with a kernel size of 33. Layers 2 and 4 had a dropout of 0.25 and were max pooled. Two dense layers with a batch size of 32 and a dropout of 0.5 make up the model. ReLU was the chosen activation function for the hidden layers, and sigmoid was used for the dense layers. This model achieves 89.87%, 90% precision, 90% recall, 90% F1 score, 85% sensitivity, and 94% specificity. ResNet50 is the sixth model. The design is transformed into a residual network by these residual blocks or skip links. This model has a 90.61 accuracy, a 91% precision, a 91% recall, a 91% F1 score, an 88% sensitivity, and a 93% specificity. EfficientNetB4 is the seventh and final model. EfficientNetB4 has a CNN design and scaling strategy that equally scales all depth/width/resolution dimensions using a compound coefficient. With the maximum attainable accuracy of 91.79%, the precision of 92%, recall of 92%, F1 score of 92%, the sensitivity of 90%, and specificity of 93%, it outperforms all other models.

The suggested methodology was also applied to another ISIC dataset (https://www.kaggle.com/datasets/jaiahuja/skin-cancer-detection) to assess the model’s robustness in light of the recent decades’ rapid growth in the prevalence of melanoma-related mortality. The suggested methodology distinguishes between two types of skin cancer, Melanoma and Nevus, only these classes were used to compare the two datasets. There are 827 images in this collection, divided into two categories: Melanoma (454) and Nevus (373). The other dataset was employed to train and test all of the CNN models used in the proposed methodology. Table 4 illustrates the performance of the six different pre-trained transfer learning models as well as the custom CNN on this dataset. Figure 6 represents comparative graphs of the metrics in Tables 3 and 4.

As it can be seen from Table 4 ResNet50 was found to be the best-performing algorithm, with accuracy, precision, recall, and performance factors of 78.31%, 78%, and (Sensitivity – 84% and Specificity – 70%), respectively. VGG16 comes in second, followed by Xception and the CNN algorithm. Furthermore, the InceptionV3 and MobileNet algorithms have the lowest accuracy, with an average accuracy of 55.42% and 67.46%, respectively. According to performance factors such as sensitivity and specificity, the EffiecientNetB4 model has the highest specificity (79%) and the highest sensitivity (84%).

In the comparative graph, the blue bars showcase the accuracy, orange bars show the precision, grey bars show the recall, yellow bars show the F1-score, light blue shows the sensitivity and green shows the specificity of respective algorithms. Figure 5a, b are comparative graphs of the first and second ISIC datasets respectively showcasing the accuracy, precision, recall, f1-score, sensitivity and specificity of seven different algorithms.

Fig. 5
figure 5

a Comparative graphs of algorithms used on first ISIC dataset b Comparative graph of algorithms used on second ISIC dataset

The Training (blue line) and Testing (red line) curves are shown in Fig. 6 for the ISIC dataset’s model accuracy curve (Orange line). There are 5106 images in MEL class and 6343 images in NEV class. The input shape of the images was taken to be the same as the image size with their RGB configuration. After 76 epochs of convergence, EfficientNetB4 and ResNet50 have the highest accuracy of 91.79% and 90.61%, respectively, as shown in the graph above. InceptionV3 model has the lowest accuracy, followed by MobileNet, Xception, and VGG16. The best accuracy was found to be around 90.04% using a self-layer constructed CNN model.

Fig. 6
figure 6

Model Accuracy curves on first ISIC dataset: a InceptionV3 training accuracy (in blue) with testing accuracy (in orange). b MobileNet’s training accuracy (in blue) with testing accuracy (in orange). c Xception’s training accuracy (in blue) with testing accuracy (in orange). d VGG1 training with testing accuracy (in blue and orange, respectively). e Custom CNN training accuracy (in blue) with testing accuracy (in orange). f ResNet50 training with testing accuracy (in blue and orange, respectively). g EfficientNetB4’s training accuracy (in blue) with testing accuracy (in orange)

Figure 7 is the loss curves showcase the bad essence of our model. Or in other words how bad our model is behaving as time progresses. Better the accuracy of the model less deviated will be the loss curve. From the curves as it could be observed that in Fig. 8a, c, g the loss is deviating similarly throughout the epochs. But in the case of Fig. 8b, d, f the curves keep on getting poorer as the height of the curve is exponentially rising. Thus, the loss is more in such cases. On analyzing Table 4, it was seen that VGG16 and ResNet50 performed equally well on the other ISIC dataset. Figure 7 shows the model accuracy curves of all the CNN models mentioned in Table 4 and corresponding model loss curves of all the CNN models are shown in Fig. 10.

Fig. 7
figure 7

Model Loss curves on first ISIC dataset: a Training loss (in yellow) and Testing loss (in red) of InceptionV3. b Training loss (in yellow) and Testing loss (in red) of MobileNet. c Training loss (in yellow) and Testing loss (in red) of Xception. d Training loss (in yellow) and Testing loss (in red) of VGG1. e Training loss (in yellow) and Testing loss (in red) of custom CNN. f Training loss (in yellow) and Testing loss (in red) of ResNet50. g Training loss (in yellow) and Testing loss (in red) of EfficientNetB4

Fig. 8
figure 8

Model Accuracy curves on another ISIC dataset: a InceptionV3 training (blue) and testing (orange) accuracy. b MobileNet training (blue) and testing (orange) accuracy. c Xception’s training (blue) and testing (orange) accuracy. d VGG1’s training (blue) and testing (orange) accuracy. e Custom CNN training (blue) and testing (orange) accuracy. f ResNet50 training (blue) and testing (orange) accuracy. g EfficientNetB4’s training (blue) and testing (orange) accuracy

Figure 8 is the Training (blue line) and Testing (red line) curves are shown in Fig. 8 for the first ISIC dataset’s model accuracy curve (Orange line). There are 454 images in MEL class and 373 images in NEV class in the dataset. The input shape of the images was taken to be the same as the image size with their RGB configuration. After 76 epochs of convergence, the best performing algorithm turned out to be ResNet50 with accuracy, precision, recall and performance factors close to 78.31%, 78%, 78% and (Sensitivity – 84% and Specificity -70%) respectively. The second best is VGG16 followed by Xception and CNN algorithm. Also, InceptionV3 and MobileNet algorithm shows out the least accuracy with an average accuracy close to 55.42% and 67.46% respectively (Fig. 9).

Fig. 9
figure 9

Model Loss curves on another ISIC dataset; a Training loss (in yellow) and Testing loss (in red) of InceptionV3. b Training loss (in yellow) and Testing loss (in red) of MobileNet. c Training loss (in yellow) and Testing loss (in red) of Xception. d Training loss (in yellow) and Testing loss (in red) of VGG1. e Training loss (in yellow) and Testing loss (in red) of custom CNN. f Training loss (in yellow) and Testing loss (in red) of ResNet50. g Training loss (in yellow) and Testing loss (in red) of EfficientNetB4

The suggested average ensemble model and the weighted ensemble model’s accuracy, precision, recall, and F1 score are shown in Table 5. When looking at Table 5(a), it is clear that the weighted ensemble model outperformed the average ensemble model using the weights listed in Table 2. The weighted ensemble model’s accuracy was 93.36%, and its precision, recall, and f1-score were all 93% whereas performance factors were 97%. Table 5(b) reveals that the dataset used to test the resilience of both models performed well, with the weighted ensemble model having an accuracy of 85.54% and precision, recall, and f1-score of 86% each. whereas performance factors were 88%—sensitivity and 82% specificity.

Table 5 Performance metrics of proposed ensemble models

A confusion matrix (CM) is a tabular representation that describes the performance of a classification method. A confusion matrix is used to display and summarize the results of a classification algorithm. The confusion matrix is made up of four major features (numbers) that define the classifier’s metric of measurement. TP, TN, FP, FN are the four numbers. A True Positive (TP) was considered when the individual record was correctly identified as a malignant sample. If the non-cancerous sample was correctly predicted, so it has been identified as a True Negative (TN). False Positive (FP) was considered when the model incorrectly identified healthy patients as malignant instances. False Negative (FN) was taken into consideration when malignant cases were recognized as normal ones [39]. On both datasets, Fig. 10 displays the confusion matrices of the suggested average and weighted ensemble models.

Fig. 10
figure 10

a Confusion Matrix of average ensemble model on the first dataset. b Confusion Matrix of weighted ensemble model on the first dataset. c Confusion Matrix of average ensemble model on the second dataset. d Confusion Matrix of weighted ensemble model on the second dataset

Figure 10 depicts confusion matrix of average and weighted ensemble models of both datasets. The average accuracy of all the models was taken and then the ensemble model was applied. The first dataset consisted of 5106 images in MEL class and 6343 images in NEV class and the second dataset consisted of 454 images in MEL class and 373 images in NEV class. Where the X-axis showcases the predicted labels and Y-axis the True labels (Fig. 11).

Fig. 11
figure 11

Misclassified samples from the proposed ensemble model. The samples are from the Melanoma class, but were incorrectly assigned to the Nevus class

Model training run time refers to the duration required to train a machine learning model on a given dataset. The duration depends on various factors, such as model complexity, dataset size, hardware resources, and optimization techniques used. Smaller models with limited data may train quickly, taking a few minutes or hours. In contrast, larger models, like deep neural networks and vast datasets might necessitate days, weeks, or even months to complete training. Parallel processing, distributed computing, and accelerators like GPUs can significantly reduce training time. Efficient algorithms, transfer learning, and optimization advances also play a crucial role in decreasing training run time, making model development more feasible and scalable (Table 6, Fig. 12).

Table 6 Performance metrics of proposed ensemble models
Fig. 12
figure 12

It’s a model-by-model AUC curve. The curve is represented by the AUC values. The AUC represents the likelihood of a random positive (green) example being placed to the right of a random negative (red) example. The AUC value might be anywhere between 0 and 1

An AUV curve has been generated for the proposed ensemble model. This curve depicts that the prediction of EfficientNetB4 is highest with an AUC value of about 0.917. Whereas all other models have almost the same AUC value with InceptionV3 and MobileNet models presenting the least value of about 0.785 and 0.839 respectively. On average all the seven models are best for this dataset as the AUC value is revolving near to grade between 0.8 to 0.9 on average.

5.1 Analysis using state-of-the-art methods

The implementation of the proposed weighted ensemble model is compared and analyzed. The authors of [4] worked with PNASNet-5-Large, InceptionResNetV2, SENet154, and InceptionV4 neural networks that use deep learning-based models. Before being fed into the network, dermoscopic images were correctly processed and enhanced and proposed a melanoma screening system [7] that segmented and classified skin lesions as malignant or benign using a convolutional neural network (CNN) architecture, as well as other ML classification methodologies. For the test image, the segmentation technique creates a binary mask, which was then used to segment out the lesion. A multi-scale convolutional neural network was utilised in [8]. Their method relied on a pre-trained Inception-v3 network using the ImageNet dataset, which was fine-tuned for skin lesion classification using two distinct scales of input images. Similar to [17], where the authors proposed a set of procedures for segmenting skin lesions and assessing the observed area and surrounding skin tissues for melanoma detection. The authors of [11] proposed graphic processors for medical skin image analysis that uses an ANN system to detect similar patterns by processing in a set of components tasked with detecting the features of the object to analyse into the image and studying edges or contour to determine a final diagnostic. The collection includes 730 photographs of good and bad incidents from the International Skin Images Collaboration’s MED-NODE project (ISIC). The data generation unit, depending on GAN in [14], was largely reliant on the processing system for removing image occlusions and the processing unit for populating limited lesion classes or equivalently constructing remote patients with pre-defined types of lesions. [15] suggested a progressive generative adversarial network that produced a substantially larger range of augmentations. Similarly, [16] looked into the prospect of using generative adversarial networks to create realistic-looking dermoscopic images (GANs). The images were then used to supplement a deep CNN’s existing training set in order to improve its performance on the skin lesion classification test. [24] describes a deep learning system that uses two full CNN to achieve segmentation and coarse classification performance simultaneously (FCRN). By computing the distance heat map, a lesion index calculation unit (LICU) was established to refine the coarse categorization findings. For the dermoscopy picture feature extraction job, a basic CNN was presented. The proposed methodology’s performance was examined and analysed against state-of-the-art approaches. Table 7 depicts the comparison analysis.

Table 7 A comparison of the proposed methodology with the state-of-the-art methods

6 Discussion & conclusion

Skin cancer is defined as the abnormal proliferation of malignant cells in the epidermis, the skin’s outer layer, as a result of unrepaired DNA damage that causes mutations. Dermoscopy images must be manually evaluated by dermatologists, which takes time and is imprecise and subjective. The study’s main objective was to combine DL algorithms to better effectively detect skin cancer. The scientists then presented a weighted ensemble classifier that performs binary classification using an ensemble of seven deep learning neural networks, including InceptionV3, VGG16, Xception, ResNet50, and others. The suggested ensemble technique, when applied to the first ISIC dataset, obtains maximum accuracy, precision, recall, f1-score, sensitivity, and specificity of 93.36%, 93%, 93%, 93%, and 97%, respectively. Another ISIC dataset has been used to assess and analyse the efficacy of the proposed method. On the other ISIC dataset, the proposed weighted ensemble classifier’s accuracy was 85.54%.

It has been concluded that a weighted ensemble technique is one in which the decision regarding the final output is based on the weighted sum of the expected outputs of the classifiers. Each model is given a different weight under the weighted ensemble voting type. Each model is given a certain weight, which is then multiplied by the outcome it predicted to produce the total or average prediction. By calculating the anticipated probabilities for each class, the suggested classification model selected the one with the highest probability.

Future research will present a technique to evaluate the performance of the classifier by combining the machine learning and deep learning methodologies. To make the photos occlusion-free, a more effective data purification technique will be devised. Finally, the classifier’s accuracy can be increased further by using the proposed future method in conjunction with a better data purification technique and a sizable, evenly distributed dataset.