Keywords

1 Introduction

Skin cancer is a prevalent and dangerous disease that requires high accurate diagnosis for effective treatment. Melanoma, a type of skin cancer, has become increasingly common in recent decades and affects people of all ages. Although melanoma accounts for only 1% of skin cancers, it causes the majority of skin cancer deaths. Early prediction of skin cancer are crucial for effective treatment and cauterization. Advanced technology, particularly in the field of artificial intelligence, has led to the development of practical applications for medical and healthcare. Deep learning (DL) has been widely applied in various fields, which includes medical diagnosis and healthcare, robotics and automation, and intelligent assistance systems and so on. The high performance with handling variety tasks become a popular choice for solving specific problems. DLis particularly useful for image processing tasks, such as medical image analysis and diagnosis, due to its ability to learn and extract features in high performance. DL techniques have been shown to produce better results compared to traditional shallow learning approaches. It is abilited to handle large datasets with many trainable parameters. However, a major challenge in training DL models is small dataset, data imbalance. This problem is leaded to biased classification models, with high performance on majority categories and low performance on minority categories. For example, in the ISIC 2018 dataset, the NV category is large samples, while other categories are a little samples. This problem leads to the NV categories dominating the model during training, and low performance on other categories. To address this issue, some techniques such as data augmentation and focal loss approach are used to improve performance. Augmented data techniques is a common technique to balance the dataset by artificially increasing the number of samples in under-fitting categories. However, this technique leads to overfitting or making noisy samples into the dataset. Therefore, in this study, we only focus on reject uncertain samples, which may lead incorrect diagnosis, for improving accuracy of decision with high rate of sample coverage and reject accuracy.

2 Related Works

These are some of the popular and well-known DL models in image classification and pattern recognition. Each of them has strength points and characteristics that make suitable for different datasets and application fields. The GoogleNet approach [1] is known as a deep architecture with multiple layers, MobileNet [2] is designed to be lightweight and efficient for mobile devices, ResNet [3] and DenseNet [4] are ability to train very deep neural networks and overcome the vanishing gradient problem, while EfficientNet [5] has shown to be highly accurate and efficient for various image recognition tasks. These are selected models, which have greatly improved the flexibility and accuracy of image recognition systems [6, 7]. Generally, it selects the appropriate model for specific dataset with expected that the system achieves higher accuracy without the need for manual tuning or hand craft selection. This is particularly useful in applications where the dataset is changing or evolving, the classified system should adapt to new data. Overall, DL models are more accessible and effective for a wider range of applications in image recognition and beyond. In industrial aspects, DLs-based methods have been widely used in many applications such as video surveillance system [8]. These approaches aim to find the optimal configuration of hyperparameters for the DL model, such as learning rate, batch size, number of filters, etc. The search method randomly selects a combination of hyperparameters and evaluates the performance of the model. The grid search method searches for the best combination of hyperparameters within a predefined range. The Bayesian optimization algorithm uses prior knowledge to guide the search for the best hyperparameters [9,10,11]. These methods have been shown to be effective in finding optimal hyperparameters for DL models [12, 13]. There are various approaches to improving the performance of DL models for image recognition tasks. These include using state-of-the-art models, selecting models automatically based on the data, optimizing the structure and hyperparameters of the models, and data augmentation to address the problem of imbalanced data [14, 15]. The selection approach depends on the specific problem and available resources, and combination of different approaches is necessary to reach higher accuracy.

In the field of medical images-based cancer disease diagnosis, dermoscopy is a skin surface imaging microscopic technique technology. Numerous studies have demonstrated that DL models produce high diagnostic performance when compared to standard imaging, dermatologists [16]. The paper [17] analysis methods and experimental results on the ISIC Challenge 2018. They presented a two-stage method to segment lesion regions from medical images based optimized training method and applied some parts for post-processing. The lesion images were acquired with a variety of dermatoscope types, from all anatomic sites, or historical sample of patients presented for skin cancer screening, from several different institutions. Each lesion image contains exactly one main lesion. Inspired by synthetic minority oversampling technique [18]. This method focuses the minority category samples before performing up sampling, which supports for better consideration of the uneven distribution of the samples. In another approach, MC-SMOTE method [19] combines of over-sampling the minority categories and under-sampling the majority categories, which achieves higher classifier performance than just using under-sampling the majority categories. This method uniformly increases minority categories samples by utilizing k-mean method, e.g., wind turbine fault detection for applied to practical application.

Other recent developments in the field of pattern recognition and classification based on the use of attention mechanisms in DL models [20]. In this approach, it allows the classification models to focus on the most informative parts of input images rather than processing on the entire image as equally importance. Nowadays, attention mechanisms have been shown that its outperformers accuracy than the DL models based on convolutional network in various tasks such as pattern classification, object recognition, image captioning, and so on.

In other approach, some research works report methods for eliminating uncertain samples [21,22,23]. These solutions are the inspiration for proposed solutions in the problem of diagnosing diseases, which improve the accuracy of medical image classification. This approach is integrated reject option that enables the network to reject input samples that are difficult to classify with high confidence. The authors argue that this can lead to better performance in real-world applications where the cost of misclassification is high. The reject option is implemented using a binary decision tree that operates on the output of the network. The decision tree takes as input the predicted class probabilities and other features such as the maximum and minimum probabilities and decides whether to reject the input sample or classify it with one of the predefined classes. The methods achieved state-of-the-art performance on several benchmark datasets and performs particularly well on imbalanced datasets.

3 Proposed Methodology

3.1 Overview Approach

This method aims to improve the performance of a DL by optimizing its architecture and ambiguity rejection. The general processing architecture, illustrated in Fig. 1, includes three major stages that should be investigated and customized: feature extraction, fully connected network for the classifier, and ambiguity rejection.

Fig. 1.
figure 1

General training flowchart of a DCNN based classification architecture.

3.2 Feature Extraction and Classification

In the first stage of feature extraction, the DL model is adjusting the training parameters, and refining the loss formulation. The approach has been evaluated empirically using various convolutional neural network (CNN) backbones for feature extraction tasks on different criteria. Our research does not focus on designing new deep learning architectures. Instead, we use the popular CNN model and customize fully connected layers for multiple category classification. There are many approaches to solve the feature extraction stage, such as using state-of-the-art backbone architectures with their pretrained parameters or initially constructing CNN architectures for selected searching the best model. The output feature maps are used as input for the classification stage. Experimental results prove the stability and efficiency on some predefined DCNN backbones, such as DenseNet and MobileNet family.

In this paper, two popular outstanding CNN architectures of DenseNets [4], MobileNets [2] were investigated. Among that, the family MobileNet architectures are known as lightweight model, which is efficiently model for limited resources. Two versions of MobileNet and MobileNetV3Large models were explored the performance ratio. The transfer learning was applied from a pretrained ImageNet model to ISIC2018 dataset for finetuning network hyperparameters. In contrast, DenseNets are more accurate and efficient, which are two versions of DenseNet121 and DenseNet201. The DenseNet is transferred learning from the pretrained model using ImageNet, without including the last top layer, and the feature map is taken from its last layer named “ReLU”. These architectures with trainable parameters are illustrated in Table 1.

Table 1. The list of backbones and their parameters

In the classification stage, there are various approaches, such as using fully connected neural network (FCNN), support vector machines (SVM), or other machine learning approaches, which are appropriately applied. In this study, the FCNN for multiple classification, which takes the input feature maps from the feature extraction stage to classify. To avoid overfitting problems, we add some special layers to this neural network architecture, such as dropout layers. The optimal architecture was estimated using the trial-and-error method. Finally, the architecture consists of two dense connected layers with 1,024 nodes and 512 nodes following by activated layer. The activation function results to dropout layer with the ratio of 50% probabilities. The final output layer with c nodes following softmax activation function.

3.3 Imbalanced Data Processing

As mentioned above, to address imbalanced data issue, we investigated several solutions, such as data augmentation (AU) method and focusing on hard samples using focal loss (FL) [24] approach. The AU technique is also explored in this study. Augmentation processing involves applying image processing techniques such as geometric and artificial color transformations to augment data samples of minority categories and to concentrate on misclassification samples. This technique helps to address the problem of data imbalance. The method is suitable for multi-skin disease classification and effectively addresses issues of underfitting and overfitting, which is happened due to the imbalance of samples between the major categories and minor categories. Some image processing techniques are applied such as color normalization and geometrical transformations, which applied to the training dataset. We used color processing and affined transformations such as rotation, flip, skews, zoom, and crop. The augmented data was generated with random parameters within a predefined period, and each new sample was created and fixed for all methods. That means our approach is different to the image data generator, such as Tensorflow and PyTorch libraries, which generate new data from the original dataset for each epoch. In the data generator processing, training data is different each time a trained model, different methods. The image data generator is used to avoid overfitting, but it is difficult to show compared results of different methods because generated training dataset is different each time. The data augmentation method was used to balance the dataset between all categories with the expectation of improving the correct rates. The main problem with this approach is that it produces a huge training dataset from the original one, which requires high hardware requirements and significantly increases computational time. The details of the parameters used to generate the dataset are presented in Table 2.

Table 2. The details of parameters for data augmented processing.

In this paper, we also investigate the weighting mechanism by FL [24] that affects the efficiency of the model for different categories of data. This approach deals the problem of data imbalance without data augmentation processing. Different to data augmentation, the loss functions (LFs) applied for multiple classification, but it may less effective because the performance metrics for this problem are composed of indicators such as one versus all accuracy, sensitivity/recall, and specificity. The training task aims to optimize the model’s parameters to achieve the lowest loss cost across all datasets, thereby increasing classification performance. However, this approach leads to a seesaw problem where majority categories are more influential than minority categories, resulting in lower weighting towards performance scores.

3.4 Ambiguity Rejection

Normally, a multi-class classification model can be defined as a set of probabilities P = {\({p}_{1},{p}_{2},..,{p}_{m}\)} where each \({p}_{i}\) denotes predicted probability of classifier of the \(m\) categories, \({p}_{i}\) is the predicted probability of the \({i}^{th}\) category and the output of the classifier is defined as a function \(f(x)= argmax\left({p}_{i}\right)\), with \(i\in \){1,2,..,m}.When we use a per-class confidence thresholds ambiguity rejection module to reject confusion region, the function \(f\left(x\right)\) is adjusted as the Formula below.

$$f\left(x\right)= \left\{\begin{array}{c}reject, if\, {p}_{i}\le {\delta }_{i},\forall i \in \{1, 2, .., m\}\\ argmax\left({p}_{i}\right), i\in \left\{1, 2, .., m\right\} \,\,otherwise\end{array}\right.$$
(1)

where \(\delta \) = {\({\delta }_{1},{\delta }_{2},..,{\delta }_{m}\)} denotes confidence thresholds, of which \({\delta }_{i}\) is the threshold of \({i}^{th}\) category (\({c}_{i}\)). \(\delta \) set is usually obtained from a training sample so that the correctly classified accuracy on test dataset is greater than or equal to the pre-set select accuracy e.g., 95%.

In this study, we use the validation dataset to determine the threshold \({\delta }_{i}\) of the class \({c}_{i}\). More specifically, from the validation dataset, by using classifier, we calculate a set of probabilities P of each sample in this dataset. For a given class \({c}_{i}\), we determine the potential thresholds (\({\delta }_{possible}\)), which are the unique values of the list probability \({p}_{i}\). The most importance question is how to choose the best threshold for the class \({c}_{i}.\) For a given threshold \({\delta }_{i}\in {\delta }_{possible}\) of the class \({c}_{i}\), we determine rejected samples by the Eq. 1. For example, we have n rejected samples and there are k samples that are failures (corrected classified by our model). The probability of having more than k failures is ProbFailure(k,n). A given \({\delta }_{i}\) is acceptable when ProbFailure(k,n) is greater than 1-β, β is a given significance level. For each acceptable \(\delta \), we calculate select accuracry and coverage respectively, and threshold of the class \({c}_{i}\) is the one with the highest select accuracry and coverage. In this research, ProbFailure(k,n) is estimate by using Binomial Cumlative Distribution function in Eq. 2 as the following formula.

$$binom.cdf\left(k,n,p\right)= \sum\nolimits_{i=0}^{k}\left(\genfrac{}{}{0pt}{}{n}{i}\right){p}^{i}{(1-p)}^{n-i}$$
(2)

where \(n\) denotes the number of rejected samples, \(k\) denotes the number of failures in n rejected samples, and \(p\) denotes the probability that a given rejected sample is failure. A given rejected sample is failure as random, so \(p=0.5\).

4 Experimental Results and Analysis

4.1 Materials and Preprocessing

In this study, the ISIC2018 [25, 26] skin cancer dataset is used to experiment and evaluate the solution. Due to this dataset is still used for a competition then the ground truth labels of testing images are not available. Therefore, the experiment and comparison are based on the training and validation datasets. The original dataset for training contains 10,015 samples and 193 samples for evaluation. The dataset consists of 7 categories, which include Melanoma (MEL), Melanocytic nevus (NV), Basal cell carcinoma (BCC), Actinic keratosis (AKIEC), Benign keratosis (BKL), Dermatofibroma (DF), and Vascular lesion (VASC). The image samples are uniformed 450×600 resolution. For evaluation, the original validation dataset is used as the validation1 dataset. The original training dataset is split into 80% for training and 20% for evaluation as the validation2. Details about the dataset used in this experiment is presented in Table 3.

Table 3. Details of the experimental dataset

4.2 Evaluation Metrics

To evaluate the performance of the studied methods on the task of feature extraction and classification, we assessed using popular effectiveness measures such as Recall (REC), Accuracy (ACC), Precision (PRE), Specificity (SPE), and F1. Notice that the accuracy metric of multiple classification is different to that of binary classification problem. The accuracy is estimated based on the one versus all retained classes. For each category, the samples are treated as positive samples and other retained classes are treated as negative samples in the binary classification problem. So, the accuracy score criterion differs between binary and multiple classification. However, some other metrics are the same as in binary classification. The effectiveness measured metrics are computed as follows:

$$ ACC_i = (TP_i + TN_i )Ns $$
(3)
$$ ACC = \frac{1}{Ns}\sum_{i = 1}^c {n_i *ACC_i } $$
(4)
$$ {\text{Re}} call = TP/(TP + FN) $$
(5)
$$ PRE = TP/(TP + FP) $$
(6)
$$ SPE = TN/(TN + FP) $$
(7)
$$ F_1 \, = \, TP/[TP + \frac{1}{2}(FP + FN)] $$
(8)

where Ns is the total number of samples in dataset, \(Ns = TP_i + FP_i + FN_i + TN_i\) where TPi and FPi are the number of true positive and false positive samples belonging to the category ith, respectively; FNi and TNi are the number of false negative and true negative samples belonging to the category ith, respectively. The number of samples of the class ith is ni. In that approach the accuracy of each class cth is calculated by TPc/total instances of the class cth. However, this performance measurement is same with Recall ratio. Therefore, we used the above formulation for estimating the accuracy rate.

4.3 Evaluation Results and Analysis

In this study, we experimentalize and analyze feature extraction and classification task using the category cross entropy and FL, AU method and then ambiguity sample rejection for improving high confident disease diagnosis. In amount of solutions for data imbalance treatment, the AU requires higher computational cost for model training due to that it generates more significant new samples for balancing training dataset. We also customized two kinds of feature extraction backbones, such as MobileNet, DenseNet family. These kinds of backbones are representative for different approaches. The MobileNet backbone represents for a small and compact architecture. It is suitable for applying to limited resource computing systems. The DenseNet backbone represents for the dense connected network with a heaving trainable parameter. In general, MobileNets are lightweight architectures, which consist of several million of trainable parameters. However, they achieve high accuracy with different applications. The MobileNet models are efficient mechanisms based on the depth-wise separable convolutions. The DenseNet architecture with dense connection layers through dense blocks. The network layers relate to matching feature-map sizes directly with each other. Each layer obtains additional inputs from all preceding layers and passes on its feature maps to all subsequent layers. The experimental results on the evaluated dataset show that DenseNet121 + FL method reach outperformer on validation dataset1 at 88.08% recall and 94.18 accuracy rate, as depicted in Table 5 of appendix section. Meanwhile, DensseNet201 and FL method reach the best result on validation dataset2 with all criteria. So, the DensNet network family is more stable results comparing to other methods (Fig. 2). Meanwhile, CC method get the lowest with 76.68% Recall at 88.77% accuracy. In overall, the DenseNet family and FL response the best results on ISIC2018 dataset, as illustrated in Fig. 3.

Fig. 2.
figure 2

Experimental results on both evaluation datasets

Fig. 3.
figure 3

Average evaluated result of MobileNets and DenseNets family on both validation sets.

In ambiguity reject stage, we adjust \(\delta \) set so that select accuracy is high, around 95.0% corresponds to error rate at 5%, to ensure acceptable error rate in real-world applications and to compare performances of methods. The validation dataset1 and validation dataset2 are used determine the threshold \({\delta }_{possible}\) with expected to reach accepted select_recall rate with highest coverage ratio of the class of each category. Some experimental results are shown in Table 4. Ambiguity rejection with \(\delta =0.1\), selected recall ratio was reached about 96.25% at 75% correct coverage ratio with DenseNet121 + AU. Meanwhile, MobileNetV3Large + CC archives 93.19% select recall at 66.24 correct coverage ratio only, as depicted in Table 4 (a). Experimental results also illustrated that the determined coefficient of delta = 0.3, the DenseNet201 + FL achieved the highest precision with 94.91% select_recall at 81.06% correct coverage ratio, while the MobileNet + CC achieved the lowest accuracy with 91.69% select_recall at 80.93% correct coverage ratio, as illustrated in Table 4 (b). According to experimental result shows that CC loss function archives the lowest recall ratio in both situation classification and ambiguity rejection.

Table 4. Experimental results of ambiguity rejection on both evaluation datasets

5 Conclusions

In this article, we presented a new approach for improving medical image-based diseases diagnosing by applying DL classification and rejecting ambiguous samples. Our approach concentrates on balancing of the influence coefficient ratio of each category to the other ones instead of focusing hard samples of LF method or augmenting image data with expected higher precision ratio. The CNN architecture was also customized fully connected layers and transformed for ISIC dataset. Applying ambiguity rejection stage to removing uncertain samples support for significantly improves accuracy. The solution was able to improve the diagnosing quality from results of classification stage, e.g. recall rate is improved from 85.63% to 96.25% at 75% coverage rate with DenseNet121 + AU, Experimental results demonstrated that this solution utilizes for archiving higher accuracy, but it also gaps a problem of eliminating uncertainty samples, which is not fully coverage ratio in disease diagnosis.