1 Introduction

Skin cancer is one of the most common types of cancer in the USA; in 2022, it is estimated that 99,780 new invasive melanoma and 97,920 in situ melanoma cases have been diagnosed and 7650 cases died from the disease [1, 2]. However, early diagnosis dramatically improves the five-year survival rate, for example, in melanoma the 5-year survival rate for early diagnosed cases is 99% compared to 68% when melanoma reaches a nearby lymph node and 30% when melanoma spreads to distant lymph nodes [3].

Factors leading to delays or inaccurate diagnosis include a shortage of dermatologists [4] especially in rural areas [5], a gap that is subsequently filled by non-specialists, such as primary care providers who are inadequately trained to deal with complex and ambiguous dermatological conditions [6, 7] without access to diagnostic aids. Given the increasing prevalence of skin cancer and the chronic lack of dermatological expertise, there is a critical need to develop computer-aided skin cancer diagnostic applications that offer an accurate rapid early diagnosis.

The increased computational power and recent advances in artificial intelligence (AI) methods, such as deep learning, empowered the development of deep neural networks that can perform skin cancer diagnosis comparable and even superior to that of dermatologists [8,9,10,11]. Although deep learning applications have witnessed rapid expansion, the black-box nature and lack of robustness of such applications hinder their proliferation in clinical settings as they do not provide explanations of their decision-making process and any perturbations to the input can dramatically impact their performance and completely change the output [12,13,14].

In 2021, the European Commission proposed a set of rules to regulate the deployment of AI applications in the European market. Transparency was one of the requirements that AI applications, in particular high-risk applications such as medical services, should comply with [15]. As a consequence, the AI community created the explainable artificial intelligence (XAI) concept that aims at developing interpretability approaches to improve the transparency of the decision-making process, hence enhancing relevant human understandability and trust [16][16].

Numerous approaches have been investigated to enhance the interpretability of deep learning applications [18,19,20,21], where interpretability can be incorporated into the diagnostic model training or after training. In dermatology, deep neural networks (DNN) are trained to be interpretable by mimicking dermatologists’ diagnosis strategy, as an example, a DNN was trained to perform hierarchical diagnosis and thus predict the lesion origin or lesion malignancy prior to making the disease diagnosis [22]. On the other hand, after training interpretability aims to integrate visualization tools with the developed model to rationalize its diagnosis. For example, in a skin lesion segmentation and classification task [23], a visualization map was integrated with the model to illustrate the image regions that were mostly used by the model to make a prediction. This would give the end-user confidence to know that the AI model was in fact analyzing the pathology on the provided image.

Concerning image modality, the number of dermoscopic image-based applications is substantially greater than those utilizing clinical images. In a recent review of XAI articles on skin cancer diagnosis [13], out of the 37 studies included in the review, 27 articles analyzed dermoscopic images, four articles utilized clinical images, two articles used both modalities, and four articles employed histopathological images. Dermoscopic images are skin images captured by a specialist using a special instrument (dermatoscope) to provide a magnified view of the lesion [24]. However, clinical images are skin images captured by a digital camera, thus obtaining clinical images do not require specific experience and acquiring high-quality clinical images is facilitated by the rapidly evolving smartphone cameras [25]. Given the fact that the burden of early diagnosis in dermatology relies on general practitioners [6] who can rarely perform dermoscopy [26], clinical images, rather than dermoscopic, are more relevant for skin lesion screening and thus the need for related XAI research.

In this work, we develop an interpretable skin cancer early diagnosis approach using clinical images. The proposed approach aims to (1) provide a low-cost rapid screening that can accelerate skin cancer diagnosis and (2) assist general practitioners in providing diagnosis prompting early referrals to dermatologists. Our proposed approach first incorporates domain knowledge, through a skin lesion taxonomy, into the design of a DNN to incrementally learn dermatological concepts. In addition, visual explanations of the diagnosis-making process are offered through advanced visualization maps to understand the rationale behind the model’s final diagnosis. More specifically, the contributions of this work are:

  1. 1.

    Develop an incremental multi-output model that predicts the lesion origin (as melanocytic or non-melanocytic), classifies the lesion malignancy (malignant or benign), and provides a disease diagnosis (melanoma, nevi, basal cell carcinoma, and seborrheic keratosis).

  2. 2.

    Integrate two interpretability approaches to improve the transparency of the proposed model.

  3. 3.

    Investigate different loss functions for training the incremental model.

  4. 4.

    Implement several data balancing techniques.

  5. 5.

    Evaluate the proposed model and compare it with similar models.

2 Background

In this section, we explain some of the recent works that have tackled the problem of interpretability in skin cancer diagnosis. The work discussed herein is grouped by image modality into dermoscopic and clinical image interpretability approaches.

2.1 Dermoscopic Image Interpretability

Image similarity is an interpretability approach that simulates a dermatologist’s diagnosis of new cases based on the knowledge of similar past cases. Barata et al. [27] utilized a content-based image retrieval component when training a DNN model to diagnose melanoma. As such, the model was able to provide a diagnosis and retrieve similar images that justify the predicted diagnosis. Similarly, Codella et al. [28] utilized a hierarchy to group similar images when training a DNN to diagnose melanoma; consequently, the model was able to diagnose melanoma and retrieve similar images based on the similarity hierarchy. As a result, the model’s prediction was accompanied by a set of similar images to justify the attained diagnosis.

The ABCD rule of dermoscopy is a well-established rule where asymmetry, border irregularity, color variations, and diameter features are analyzed to detect melanoma [29]. Chowdhury et al. [30] developed a melanoma diagnosis machine learning model that employs the ABCD features extracted using image transformations. In addition, a DNN consolidated with visual attention components was trained to diagnose melanoma and generate a visualization of the model’s diagnosis process. The visualization of the DNN was found to be correlated with the output of the ABCD feature-based model indicating that the DNN implicitly learned the ABCD features and thus the DNN results can be trusted. Likewise, Stieler et al. [31] embedded the ABCD rule in the local interpretable model-agnostic explanations (LIME) [32] to develop an explainable melanoma diagnosis DNN model. LIME, a local surrogate model, was adapted such that its logic was replaced by the ABCD rule and trained on the predictions of the black-box model to visually explain the rationale behind the prediction.

Concept activation vector (CAV) is an interpretability method that evaluates the correlation between human-defined concepts and the model’s prediction. CAV is used as a deep learning model testing strategy that measures the importance of human-defined concepts in the results of the model [33]. Lucieri et al. [34] employed CAV to interpret the results of a deep learning model trained to diagnose melanoma, nevi, and seborrheic keratosis. They utilized several skin lesion features, such as lesion pigmentation, streaks, dots and globules, and blue-whitish veils, to represent the concepts for understanding the model’s diagnosis. Their work showed that there is a strong correlation between the model’s prediction and the explored skin lesion features which indicates that the model learned human understandable concepts.

Utilizing a skin lesion taxonomy is an interpretability approach that simulates the dermatologists in dividing the diagnosis task into a hierarchy of subtasks [35] based on various criteria such as lesion origin or malignancy before reaching the diagnosis. In 2019, Barata et al. [22] employed a two-level skin lesion hierarchy to develop two DNNs to classify an image as melanoma versus nevi and seborrheic keratosis versus nevi. Both networks were trained to learn one level of the hierarchy and then perform binary classification on the disease level without considering the dependency between the hierarchy levels. In 2021, Barata et al. [36] extended the taxonomy presented in [22] to include a three-level hierarchy implemented with a recurrent neural network trained on the features of dermoscopic images extracted by an encoder network.

Attribution maps, also known as saliency maps, are visualizations that outline the areas contributing to a diagnosis decision accordingly providing a visual explanation of the diagnosis-making process. The class activation map (CAM) [37] and gradient-weighted class activation maps (Grad-CAM) [38] are popular techniques that were widely employed as DNN visual explanation tools [23, 39,40,41,42,43].

2.2 Clinical Image Interpretability

Compared to dermoscopic images, clinical images have been exhibiting much less attention in addressing the gap related to their interpretability; moreover, the employed XAI approaches mainly focused on providing visual justifications of the DNN output using variants of attribution map techniques. In 2017, Ge et al. [44] adapted the classical CAM to work on the bilinear pooling feature map [45] to provide detailed visual explanations of the diagnosis process. Later in 2018, Grad-CAM was implemented to improve the understandability of a DNN diagnosis of 12 skin lesions [46]. Pfau et al. [47] implemented an aggregated global visualization approach based on Grad-CAM and the competitive gradient input method [48] to study how the DNN model adapts to skin image artifacts, such as ink.

Furthermore, CAM was used to visualize the regions of the image that contribute to the predicted labels of a DNN developed by Kawahara et al. [49]. The visualizations were utilized to interpret the prediction of the 7-point criteria associated with melanoma [50] and the diagnosis of five skin lesions. Another attribution map technique, the integrated gradients [51], was deployed in [52] where the DNN was trained to mainly diagnose 26 skin conditions. Using integrated gradients demonstrated the model’s learnability by highlighting the significant image pixels that led to the prediction. Additionally, the integrated gradients technique [51] was combined with SmoothGrad [53] to offer an averaged visualization map, over a set of images, that was then utilized to study the correlation between the model’s learnt features and human-labeled region of interest in classifying skin lesions [54].

After revising the recent interpretability work in skin cancer diagnosis, it can be observed that the work implemented in clinical image interpretability is primarily based on providing visual explanations of the predictions. Therefore, we aim to develop an interpretable skin cancer diagnosis model, trained and validated on clinical images, that incorporates domain knowledge into model training and provides visual explanations of the predictions after training. Accordingly, we embed dermatology knowledge in our model training by implementing a well-established skin lesion taxonomy that mimics dermatologists in diagnosing the lesions based on the lesion’s origin and malignancy. In addition, we utilize an attribution map technique to provide visual explanations of the predictions as an after-training interpretability approach.

It is worth mentioning that the taxonomy, as a source of dermatology knowledge, has been partially utilized in [22] where the authors used only two levels of the taxonomy and developed two separate models each providing a binary classification of the diseases (explained in Sec. 2.1). In addition, the taxonomy has been utilized in [36] where the full taxonomy was implemented using image encoding and a recurrent neural network. In our work, we implement the full taxonomy following a different approach, and we develop a single multi-output model that incrementally predicts the lesion origin, lesion malignancy, and the disease. In addition, we build the taxonomy using a convolutional neural network with a customized loss function to help the model learn the dependency between the taxonomy levels and hence improve the disease diagnosis.

3 Data and Taxonomy

The clinical images of a publicly available dataset containing 1011 skin lesion cases [55] were employed in our work. Originally, the dataset was used to predict the 7-point criteria linked to melanoma and perform lesion diagnosis for 5 disease classes, basal cell carcinoma (BCC), nevi (NEV), melanoma (MEL), seborrheic keratosis (SK), and a miscellaneous class (MISC) that includes any other disease such as dermatofibroma [49] (the number of images for each disease is listed in Table 1). In our work, we focused on four classes, BCC, NEV, MEL, and SK to fit with the three-level skin lesion taxonomy [36] shown in Fig. 1.

Table 1 Data description
Fig. 1
figure 1

Skin lesion taxonomy

The first level of the taxonomy identifies the lesion origin as melanocytic or non-melanocytic, the second level groups lesions based on malignancy as malignant or benign, and the final level leads to the disease. The total number of images included in our work is 914 images, distributed as presented in Fig. 2 across the taxonomy levels.

Fig. 2
figure 2

Study data distribution

Based on Fig. 2, it can be observed that the data is imbalanced at all levels. At the lesion origin level, there are 827 melanocytic and 87 non-melanocytic; at the malignancy level, there are 294 malignant and 620 benign; and at the disease level, there are 42 BCC, 252 MEL, 575 NEV, and 45 SK cases. The imbalance ratio of the data is calculated as per Eq. 1 [56], the imbalance ratio for the levels 1, 2, and 3 is 9.51, 2.11, and 13.69, respectively.

$$Imbalance\;Ratio=\frac{N_{maj}}{N_{min}}$$
(1)

where \({N}_{maj}\) and \({N}_{min}\) are the numbers of majority and minority class instances, respectively.

4 Methods

In this work, we implemented two interpretability approaches to improve the transparency of skin cancer diagnosis. First, we adapted a DNN architecture to incrementally learn dermatological concepts and consequently increase the interpretability of the diagnosis-making process. Second, a visual saliency map is created to explain skin lesion predictions on the disease level. Since the utilized data is imbalanced as discussed in Sec. 3, we address this problem by developing several data balancing approaches.

4.1 Proposed Architecture

The Inception V3 architecture, introduced by Szegedy et al. [57], performed well in skin lesion diagnosis [49, 58]; we thus adapted the Inception V3 to mimic dermatologists in incrementally diagnosing a lesion by developing a multi-output incremental diagnosis network. The main blocks of the Inception V3 network (Fig. 3) were retained to extract features from skin images, whereas the classification layers of the network were replaced with the incremental diagnosis block to obtain domain knowledge represented as the skin lesion taxonomy.

Fig. 3
figure 3

Proposed network architecture

During training, the image features (output of the extraction block) were given as input to all diagnosis levels (Fig. 3). In addition, levels 2 and 3 were provided with the lesion origin and malignancy, as such the network was trained to predict the output of all levels. As a result, when making a diagnosis in the testing phase, the network could justify the final disease prediction based on level 1 and 2 predictions.

To optimize the diagnosis model during training, a loss function was used to calculate the prediction divergence from the actual diagnosis and update the model accordingly. The categorical cross entropy (CE) loss function (Eq. 2) is one of the most commonly used loss functions [59]. However, the CE considers the loss for each diagnosis level separately without considering the dependency between the levels. Subsequently, we adapted the CE loss to consider the skin lesion taxonomy levels and calculate the loss for each level while considering the previous level as shown in Eq. 3, the taxonomy CE loss (TCE).

$${Loss}_{CE}\left(\widehat y,y\right)=-\sum_{j=1}^Cy_j{\;\log\;\widehat y}_j$$
(2)
$${Loss}_{TCE}\left(\widehat y,y\right)=\frac{-1}N\sum_{i=1}^N\sum_{j=1}^Cy_{ij}\;\log\;{\widehat y}_{ij}$$
(3)

where \({y}_{ij}\) is the actual diagnosis, \({\widehat{y}}_{ij}\) is the predicted diagnosis, C is the number of classes in each level, and N is the levels.

4.2 Visual Saliency Maps

Visual saliency maps are visualizations that emphasize the pixels of an image that mostly influence the classification of a DNN. Gradient-based visualization methods are approaches that calculate the classification gradient given the input features [60] such as the widely used GradCAM which backpropagates the gradient to the last convolutional layer to create a fine-grained visualization map [60]. In our work, we utilized GradCAM +  + [61] which is an extension of the GradCAM where the former approach provides better visualization and more accurate object localization and and detects multiple occurrences of the objects [61].

4.3 Data Balancing

Based on the data distribution in Fig. 2 and the imbalance ratios calculated in Sect. 3, it can be inferred that there is a noticeable class imbalance in all taxonomy levels. As a result, the developed models are expected to be biased toward the majority classes [62]. Therefore, we implemented three techniques to address class imbalance on the algorithm and data levels. On the algorithm level, the class frequency-based weighted loss function [63] is used to handle class imbalance where each class is assigned a weight that is inversely proportional to the number of instances in that class. Therefore, the minority class receives higher weights compared to the majority class, the class frequency-based weighted CE loss implemented herein for skin lesion diagnosis with the taxonomy is defined as Eq. 4.

$${Weighted\;Loss}_{TCE}\left(\widehat y,y\right)=\frac{-1}N\sum_{i=1}^N\sum_{j=1}^CW_{ij}\;y_{ij}\;\log\;{\widehat y}_{ij}$$
(4)

where \({W}_{ij}\) is the inverse of the number of instances in level i class j and the weights are normalized over the number of classes to balance the loss [64].

On the data level, we implemented two data augmentation approaches. First, data transformation, where image geometric transformations (i.e., flipping and rotation) were performed to increase the minority class size [65], and second, data integration in lieu of artificially expanding the minority class data size using transformations, we integrate real images from another image source, DermNet NZ [66], to augment the minority class.

5 Empirical Framework

In this section, we describe the empirical setup designed to investigate the proposed methods, then explain in detail model training configurations, and finally, clarify the evaluation metrics utilized herein to assess the developed models.

5.1 Empirical Setup

The empirical setup is designed to study the impact of the incremental architecture, the loss function, the data balancing, and the K-fold cross-validation on the disease diagnosis accuracy. As a result, our work incorporates four setups: (1) a baseline model that predicts the diseases directly to be compared with the incremental model that learns dermatological concepts based on a skin lesion taxonomy; (2) a model with the categorical CE loss to be compared with a model with the TCE loss; (3) three models with class frequency-based weighted loss function, data transformation, and data integration are developed for investigating data balancing; and (4) three models implementing K-fold cross-validation [67] (K = 5) for class frequency-based weighted loss function, data transformation, and data integration. In all setups, the significance of the difference in disease level accuracy across the developed models is measured using the p value of the Z-test for comparing two proportions with a 95% confidence level [68].

In the third setup, data transformation and integration are applied to the minority classes. In data transformation, training images belonging to the minority class diseases, SK and BCC, were replaced by three transformed images created by applying random flipping, rotation, and adding Gaussian noise. Consequently, the training data size increased to 652 images, and the validation and test data sizes did not change. In data integration, 42 SK images and 152 BCC images collected from DermNet NZ were consolidated with the study data to augment the minority classes. Unlike data transformation, integrating data from different sources necessitates redistributing the images across training, validation, and testing. As a result, the total number of images (1108) was split into 60% training (664), 20% validation (222), and 20% testing (222 images). Figure 4 illustrates the disease distribution of the original, transformed, and integrated training dataset.

Fig. 4
figure 4

Training data class distribution

5.2 Model Training

The study data, 914 images, were mainly split into 60% (548) training, 20% (183) validation, and 20% (183) testing based on the split provided by Kawahara et al. [49] of the utilized 7-point criteria evaluation database [55] for proper performance comparison. However, in the k-fold cross-validation experimental setup, a fivefold data split was implemented on the training and validation (731 images), while model testing was performed on the independent test set (183 images). The baseline and the incremental architectures are built using the Inception V3 network [57] while applying transfer learning [69] to benefit from the network weights gained from training on ImageNet [70]. To refine the Inception V3 for the skin lesion diagnosis task, we replaced the classification layers of the network with a global average pooling layer and Softmax layers [71]. In the baseline model, only one Softmax layer with 4 output units was added to the network to directly perform disease diagnosis (MEL, NEV, BCC, SK).

In the incremental architecture, first, a Softmax layer with input as the image features and 2 output units representing the first level of the taxonomy (melanocytic or non-melanocytic) was added. Subsequently, we added a concatenate layer to combine the image features with the classification output from the first Softmax layer where the output of the concatenation layer was fed as input to a second Softmax layer with 2 output units which is responsible to perform the classification of the second level of the taxonomy as malignant or benign. Finally, we added another concatenation layer to combine image features with the classification outputs from the first and second Softmax layers, accordingly, and a third Softmax layer with 4 output units was added to use that integrated input to perform disease level classification (MEL, NEV, BCC, SK).

Online image augmentation, such as image flipping, rotation, zoom, and shift, was performed during training. To calibrate the pre-trained Inception V3 with the skin lesion diagnosis task, we unfreeze the last two blocks of the network to allow them to train with the skin lesion classification layers for 50 epochs with batch size 32. In model optimization, we utilized the stochastic gradient descent with a learning rate of 0.001 and momentum of 0.9 for consistently reduced loss, fast convergence, and decreased oscillations [72]; all experiments were implemented using Keras [73] for model training and optimization, while TensorFlow [74] was used for class label transformation from categorical to numerical values.

5.3 Evaluation

All developed models were evaluated in terms of accuracy, sensitivity (recall), specificity, precision, and F-score metrics [75] as per Eqs. [5,6,7,8,9].

$$Accuracy= \frac{TP+TN}{TP+FP+TN+FN}$$
(5)
$$Sensetivity/Recall= \frac{TP}{TP+FN}$$
(6)
$$Specificity= \frac{TN}{TN+FP}$$
(7)
$$Precision= \frac{TP}{TP+FP}$$
(8)
$$F-Score= \frac{2\times Recall\times Precision}{Recall+Precision}$$
(9)

In the equations above, TP is the true positives (count of correctly classified positive class), TN is the true negatives (count of correctly classified negative class), FP is the false positives (count of instances incorrectly classified as a positive class), and FN is the false negatives (count of instances incorrectly classified as a negative class). In level 1 of the taxonomy, the positive class is the melanocytic, and the negative class is the non-melanocytic. In level 2, the malignant class is the positive, and the benign is the negative class. In level 3, each disease represents a positive class, and all others are the negative class.

6 Results

In the first empirical setup, summarized in Table 2, it can be noticed that the baseline model has an overall accuracy (0.58) higher than the disease level overall accuracy (0.39) in the incremental model (M1). However, the sensitivity of the baseline model indicates that it has poor discrimination between the diseases; subsequently, all the test images are classified as NEV because it is the majority class. On the other hand, in the incremental model, we can see a slight improvement in the model discrimination, on the disease level, between MEL (sensitivity 0.86 and precision 0.31) and NEV (sensitivity 0.24 and precision 0.68) as a result of the knowledge gained from level 2 of the taxonomy. When observing the sensitivity and specificity of the baseline and M1 (Table 2), we can find multiple occurrences of zero sensitivity and one specificity (or the opposite) indicating that both models suffer from the class imbalance that led to biased models toward the majority classes.

Table 2 Setup-1 and setup-2 models: the F-score is undefined (U) when sensitivity and precision are zero

Turning to the second empirical setup where we study the impact of the taxonomy-based loss function (TCE) on the performance of the incremental architecture of M1. Table 2 shows a comparison between the incremental model with categorical CE (M1) and with TCE (M2) loss functions. It can be inferred that the TCE significantly improved the overall model performance where the accuracy increased from 0.66 to 0.77. Moreover, there is a considerable enhancement in the classification accuracy in the malignancy prediction between the two models; thus, the accuracy increased from 0.68 to 0.73. Similarly, the accuracy of the disease prediction improved from 0.39 to 0.69. To measure the significance of the change in disease classification accuracy across the three models, the p value of the Z test for comparing the accuracies with a 95% confidence level is reported. Comparing the baseline with M1 and M2, p = 0.00038 and 0.0226, respectively, for comparing M1 and M2, p < 0.00001. As such, all p values < 0.05 meaning that the differences in accuracies across the three models are significant.

As the model architecture does not impact the first level of the taxonomy, there is no noticeable improvement in the lesion origin performance and both models with CE and TCE are biased toward the majority class (melanocytic positive class), thus achieving a sensitivity of 1.00 (M1) and 0.98 (M2), whereas the specificity is zero.

When observing the sensitivity and specificity of M1 and M2 at the malignancy prediction level, we can find a noticeable balance between the sensitivity (0.58) and specificity (0.81) in the model with TCE compared to the model with CE loss (sensitivity of 0.02 and specificity of 1). This indicates the increased ability of the model with TCE loss to differentiate between the malignant and benign cases although the classes are imbalanced. Similarly, on the disease level, the model with TCE loss shows more balance between NEV and MEL sensitivity and specificity compared to the CE loss model. Nevertheless, none of the models was able to diagnose BCC or SK.

In the third empirical setup, data balancing approaches were implemented based on the outperforming model in the second setup (M2) utilizing the TCE as the loss function. Class frequency-based weighted loss, data transformation, and data integration balancing techniques (explained in Sec. 4.3 and designed as detailed in Sec. 5) were implemented. The performance of the models addressing data imbalance is summarized in Table 3. Data transformation and data integration techniques improved disease classification accuracy compared to the weighted loss. The difference in accuracy between the weighted loss and data transformation and the weighted loss and data integration is significant (p = 0.04036 and 0.02034, respectively). However, the difference in accuracy between the data transformation and data integration is not significant (p = 0.86502).

Table 3 Setup-3 models: the F-score is undefined (U) when sensitivity and precision are zero

It can be noticed that data integration balancing has the ability to differentiate between the melanocytic (sensitivity 0.95) and non-melanocytic (specificity 0.61) classes in the lesion origin level; similarly, the model can differentiate between malignant (sensitivity 0.71) and benign (specificity 0.79) lesions, and on the disease level, the model can diagnose all diseases with higher precision compared to the other models.

Comparing the weighted loss with the data transformation model, the former considerably improved the class imbalance; therefore, the model started to correctly classify the minority classes. This can be inferred by the specificity of 0.35 in lesion origin prediction which reflects the ability of the model to predict the non-melanocytic compared to zero specificity in the data transformation model (Table 3) and all other models (Table 2).

On the disease level, when observing the sensitivity, the weighted loss model successfully diagnosed a few SK cases but failed to diagnose the BCC images. Similarly, the data transformation model predicted a few cases of BCC and failed to predict the SK. It is important to note that the data integration model correctly diagnosed 0.24 of the SK cases compared to zero cases for the data transformation model although the number of SK images utilized in training the latter model is 81 which is more than the number of the SK images considered for training the former model (52 images). This observation emphasizes the benefit of using real images to augment the training instead of artificially altering the images.

In the last experimental setup, we implemented a five-fold cross-validation strategy for all data balancing techniques. As such the training and validation data are combined and divided into five folds, four utilized for training and one for validation and shuffled iteratively. Accordingly, five models are developed for each data balancing technique, and the average performance along with the standard deviation is reported in Table 4.

Table 4 Setup-4 models: the five-fold cross-validation (CV) (score ± standard deviation)

Although data transformation and data integration improved the diseases classification accuracy compared to the weighted loss technique, the difference in accuracy between the weighted loss and the data transformation is not significant (p = 0.1031), but the difference in accuracy between the weighted loss and the data integration is significant (p = 0.01046). Finally, the difference in accuracy between data transformation and data integration is not significant (p = 0.38978).

With respect to the change in accuracy between all data balancing techniques with and without performing cross-validation (Tables 3 and 4), in the weighted loss, the disease accuracy insignificantly increased from 0.57 to 0.59 (p = 0.67448). In the data transformation, the disease accuracy did not change (0.67); finally, in the data integration, the disease accuracy insignificantly increased from 0.68 to 0.71 (p = 0.47152). It is worth mentioning that, developing a model based on the five-fold cross-validation approximately requires five times the computational time of developing the same model without cross-validation. Accordingly, the selection between the k-fold cross-validation and the train-validation split is a trade-off between accuracy and computational time.

To summarize the results of all empirical setups, we created a class-specific performance comparison based on the sensitivity and specificity of each model. In Fig. 5,  M1 represents the incremental architecture; M2 is the incremental architecture and the taxonomy-based loss function; M3 includes the incremental architecture, the taxonomy-based loss function, and minority class data integration; and M4 represents the incremental architecture, the taxonomy-based loss function, data integration, and five-fold cross-validation.

Fig. 5
figure 5

Experiment summary

In lesion origin classification, all models were similarly able to predict the melanocytic class; however, only M3 and M4 were able to predict the non-melanocytic class. Thus, neither the incremental architecture nor the taxonomy-based loss function impacted the performance of lesion origin classification which completely abides by the logic of the developed architecture.

Regarding malignancy classification, M1 pertains to the incremental architecture only, resulting in a biased model toward the benign class; nevertheless, M2, M3, and M4 resulted in a noticeable balance in detecting malignant and benign cases. Finally, on the disease level, M1 and M2 were able to distinguish between MEL and NEV, but the underrepresented diseases BCC and SK were not captured by the models. Given the high imbalance ratio (13.69) at this taxonomy level, data integration for minority class balancing implemented in M3 and M4 substantially improved the diagnosis of BCC and SK.

Moving to the interpretability of the developed models, an example of the correctly diagnosed BCC, NEV, MEL, and SK cases is illustrated in Fig. 6a, b, c, and d. The lesion taxonomy clearly explains the rationale behind the final disease diagnosis. In addition, the presented probability of each disease group improves the transparency and confidence of the results.

Fig. 6
figure 6

Skin taxonomy interpretation of the results for BCC (a), NEV (b), MEL (c), and SK (d)

The lesion taxonomy is also beneficial in detecting misdiagnoses as illustrated in Fig. 7 where a MEL lesion was misdiagnosed as NEV. This error can be discovered based on the malignancy prediction, as NEV is not a malignant lesion as shown by the model. Thus, the model’s built-in interpretability facilitated understanding the output of the model. However, there are cases where the taxonomy will not help detect the misdiagnosis as shown in Fig. 8, a case of NEV misdiagnosed as MEL. The lesion is diagnosed correctly as melanocytic but misdiagnosed as malignant and MEL; since logically all the predicted taxonomy paths are correct, this misdiagnosis cannot be detected. This error is expected to happen in any computer-aided diagnosis application.

Fig. 7
figure 7

Misclassified MEL as NEV

Fig. 8
figure 8

Worst-case error

Finally, we present, in Fig. 9, the second interpretability approach of GradCAM ++ to highlight the influential regions that contributed most to the disease classification. It can be observed that the generated maps are consistent with the model predictions; thus, there is an overlap between the disease region and the region highlighted by the GradCAM ++ in Fig. 9a for the correctly classified cases. In contrast, in Fig. 9b illustrating the misdiagnosed cases, we can notice that the regions where the model used to make a diagnosis do not align well with the disease region. This implies the GradCAM ++ faithfulness in explaining the results of the model.

Fig. 9
figure 9

GradCAM ++ interpretation for the correctly classified cases (a) and incorrectly classified cases (b)

7 Discussion

In this work, an interpretable skin cancer diagnosis approach utilizing clinical images to learn dermatological concepts was developed and evaluated. The results showed the significant impact of the incremental architecture, the taxonomy-based loss function, and the minority class data integration on improving the model’s accuracy and boosting its ability to differentiate between the classes in all taxonomy levels. In addition, the model demonstrated its ability to justify the diagnosis and discover disease misdiagnosis.

In comparison with prior works, Kawahara et al. [49] developed a DNN to diagnose skin lesions using the same dataset utilized herein our work (discussed in Sect. 3). The model of Kawahara et al. trained on clinical images had an average accuracy of 60% compared to 67% average disease accuracy achieved by our model implementing the incremental architecture, taxonomy-based loss function, and data transformation without integrating images from DermNet NZ (Table 3). In addition, our model outperforms the classification approach proposed by Ngiam et al. [76] and implemented by Kawahara et al. [49] using the same clinical images; Ngiam’s model had an accuracy of 58.2% compared to 67% for our model. In terms of interpretability, Kawahara et al. utilized CAM to visualize the dominant regions of the lesion images employed for diagnosis [49].

Ge et al. [44] developed three DNNs trained and tested on 26,584 clinical images to predict 15 skin lesions (3 malignant and 12 benign conditions) and achieved average accuracies of 52.2%, 54.1%, and 59.4% for the three DNNs. On the other hand, we focused on 4 diseases as our focus is more on skin cancer detection than common skin conditions. We employed 1108 images for developing our best performing model (M4) which achieved an average disease level accuracy of 71% and all levels’ average accuracy of 78%. With respect to interpretability, Ge et al. integrated CAM with the bilinear feature pooling [77] to provide a detailed visual explanation of the important regions of skin lesion images.

Esteva et al. [58] trained an Inception V3 network on 129,450 clinical images to classify the lesion as carcinoma versus SK and melanoma versus nevi. Although the selection of the disease classification tasks was based on a skin lesion taxonomy, the authors did not consider the taxonomy in the models’ implementation. In terms of accuracy, our best performing model achieved 77% accuracy in malignancy detection and 71% accuracy in diagnosing the four diseases; however, in Esteva’s work, the accuracy of classifying carcinomas and SK was 72.1%, and the accuracy of diagnosing melanoma and nevi was 55.4%. Regarding interpretability, Esteva et al. employed the t-distributed stochastic neighbor embedding [78] that visualizes high dimensional data to envision the learnt features of the DNN’s last layer and thus understand the inference of the model.

Although our approach outperforms other prior works, it has some limitations. The disease level accuracy can be further improved by incorporating more clinical images in model training. In addition, various loss functions that reflect the taxonomy can be developed to investigate their impact on classification accuracy. Finally, evaluating the developed interpretability methods by general practitioners is needed to assess the impact of the implemented skin cancer diagnosis models on the performance and confidence of humans.

8 Conclusion

We presented an interpretable skin cancer diagnosis approach that employs a skin lesion taxonomy to incrementally learn dermatologic knowledge using an adapted DNN architecture. Our models were trained on clinical images as they are easily obtained by a non-specialist healthcare provider. The empirical analyses showed that the implemented taxonomy is beneficial in improving classification accuracy, understanding the rationale behind the disease diagnosis, and discovering diagnosis errors. Moreover, we employed an advanced gradient-based class activation map method that demonstrated consistent visual explanations of the diagnosis-making process. Our work is a step toward developing an interpretable rapid skin cancer diagnostic tool that can assist general practitioners to make an early diagnosis. Further long-term, large-scale validation studies are nonetheless needed to understand the usability, interpretability, and accuracy of our proposed model when employed by general practitioners in clinical settings.