Keywords

1 Introduction

Skin cancer is the most common type of cancer worldwide, responsible for 64, 000 fatalities in 2020 [16]. The majority of skin cancers can be treated if diagnosed early. However, visual inspection of skin malignancies with the human eye during a health screening is prone to diagnostic errors, given the similarity between skin lesions and normal tissues [12]. Dermatoscopy is the most reliable imaging method for screening skin lesions in practice. This is a non-invasive technology that allows the dermatologist to acquire high-resolution images of the skin for better visualisation of the lesions, while also enhancing sensitivity (i.e. accurate identification of the cancer lesions) and specificity (correct classification of non-cancerous suspicious lesions) when compared with the visual inspection. Nonetheless, dermatologists still confront hurdles in improving skin cancer detection, since manual assessment of dermatoscopic images is often complicated, error-prone, time-consuming, and subjective (i.e., may lead to incorrect diagnostic outcomes) [12]. Thus, a computer-aided diagnostic (CAD) system for skin lesion classification that is both automated and trustworthy has become an important evaluation tool to support dermatologists with proper diagnosis outcomes to finalise their decisions.

Over the last decades, several Convolutional Neural Network (CNN) based methods have been presented, delivering better CAD systems that identify the melanoma and non-melanoma skin lesions accurately. Deep neural networks are being used to classify skin cancer at the dermatological level. Examples include [9] using GoogleNet’s Inception v3 model, which achieved \(72.1\%\) and \(55.4\%\) accuracy of the three and nine class respectively, on a Stanford Hospital private dataset. In [22], a fully convolutional residual network (FCRN) is proposed and evaluated on the IEEE International Symposium on Biomedical Imaging (ISBI) 2016 Skin Lesion Analysis Towards Melanoma Detection Challenge dataset. This model obtained the \(1^{st}\) place on the challenge leaderboard, yielding an accuracy of \(85.5\%\). Moreover, an attention residual learning convolutional neural network (ARL-CNN) was introduced by [23] and evaluated on the ISBI 2017 dataset, achieving an average area-under-curve (AUC) of \(91.7\%\).

Ensemble-based CNN models have also shown superior performance on medical image analysis [5, 6] and skin lesion segmentation [15] and classification, as shown in the International Skin Imaging Collaboration (ISIC) datasets 2018 [3], 2019 [10], and the HAM10000 dataset [1]. However, these methods require training several deep learning models to create the ensemble, which requires huge computing power and is not suitable for real-time applications. In summary, it can be said that most methods used for medical image classification, including lesion classification are based on CNN models. However, it was reported that while such model’s perform very well on datasets, cross-datasets generalisation is still considered as a key challenge for the computer vision research community [8].

To this end, we aim to address some of the issues above using a single deep learning model to classify skin lesions accurately. We propose the development of vision transformers-based models, as these have proven to be outperforming many image classification tasks [7, 14]. In this study, we use a bidirectional encoder representation from the image transformers model to correctly diagnose the skin lesion. The rest of this article is organised as follows. Section 2 describes the materials and the bidirectional encoder representation from the image transformers model in detail. The experimental findings of the CNN and transformer-based models are compared and examined in Sect. 3. Finally, Sect. 4 draws the research conclusions and suggests some future directions.

Fig. 1.
figure 1

The architecture of the proposed transformers model, TransSLC, input image, image patches. A special mask embedding [M] is replaced for some random mask of image patches (blue patches in the figure). Then the patches are fed to a backbone vision transformer and classify. (Color figure online)

2 Methods and Materials

2.1 Image Transformer

In this work, we propose a bidirectional encoder representation from image transformers motivated by BEIT [2]. Figure 1 provides a schematic diagram of the proposed method.Initially, the input skin lesion \(224\times 224\) image is split into an array of 16 image patches, with each patch measuring \(14\times 14\) pixels, as shown in the top-left corner of Fig. 1. In BEIT, a masked image modelling (MIM) task to pretrain vision transformers is proposed for creating the visual representation of the input patches. Therefore, we used a block-wise masking forward by a linearly flatten projection to get the patch embeddings. A special token [S] is added to the input sequence for regularisation purposes. Furthermore, the patch embeddings include standard learnable 1D position embeddings as well. The input vectors of each embeddings are fed into transformers encoder. We then use vision transformers encoder as a backbone network of our model. The encoded representations for the image patches are the output vectors of the final layer of the transformers, which are then fed into the classification head which in turn classifies the input skin lesion image. The classification head consists of two layers: a global average pooling (used to aggregate the representations) and a softmax-based output layer that produces the classification of the distinct the categories.

2.2 Model Implementation Setup

As mentioned in the previous section, the proposed TransSLC model design is based on the BEIT model presented in [2]. In practice, we utilise a 12-layer transformer encoder, with 768 hidden size and 12 attention heads. A total of 307 feed-forward networks were also implemented for the intermediate size of the network. For our experiment, the input skin lesion image size is set to \(224\times 224\) resolution, with the \(14\times 14\) array of patches having some patches randomly masked. We trained our proposed model for 50 epochs, using the Adam optimiser [13] with parameters \(\beta _1 = 0.5\) and \(\beta _2 = 0.999\). The learning rate was set to 0.0001, and a batch size of 8 was used. To ensure a fair comparison with other CNN-based methods, we have used the same experimental settings. Experiments were carried out using Nvidia Tesla T4 16 GB Graphics Processing Unit (GPU) cards, and running the experiment for 50 epochs for all the models below took on average 24 h of training time.

2.3 Model Evaluation

Standard evaluation metrics were used to evaluate the performance of the models used in the experiments. These are accuracy, precision, recall, and F1 score. Definitions of these metrics are presented in Table 1.

Table 1. Model evaluation metrics to evaluate the models.

2.4 Dataset

The public and commonly used HAM10000 dataset was used [1] for evaluation purposes. The dataset contains 10,015 images. These images are labelled based on a discrete set of classes representing seven categories: actinic keratoses and intraepithelial carcinoma (AKIEC), basal cell carcinoma (BCC), benign keratosis (BKL), dermatofibroma (DF), melanoma (MEL), melanocytic nevus (NV), and vascular lesions (VASC). As can be seen in Table 2 the samples distributions is imbalanced. In other words, the number of training images in NV class is 4693 whereas DF and VASC classes have only 80 and 99 images, respectively. This is a common problem in most medical datasets, as well as health-related data [20] where various data sampling methods, as well as algorithmic modifications, are employed to handle it [21]. However, for the purpose of this paper, we handled this problem using a simple data augmentation technique. This includes flipping the images horizontally and vertically, random cropping, adaptive histogram equalisation (CLAHE) with varying values for the original RGB images is used to change the contrast. To generate a range of contrast images, we set the CLAHE threshold for the contrast limit between 1.00 and 2.00

Table 2. The image distribution per class and splits of the HAM10000 dataset.
Table 3. Comparison of the performance (%) of the proposed transformers-based model against different CNN-based models in terms of the accuracy (AC), precision (PR), recall (RE), and F1 score (F1), respectively, on the test dataset.

3 Experimental Results

For comparison purposes with our proposed TransSLC model, we have selected several state-of-the-art models, including ResNet-101 [11], Inception-V3 [18], the hybrid Inception-ResNet-V2 [17], Xception [4] and EfficientNet-B7 [19]. These models are considered state-of-the-art, and commonly used in medical image analysis. As can be seen in Table 3, TransSLC achieved the top performance reaching an accuracy of \(90.22\%\), precision of \(85.33\%\), recall of \(80.62\%\), and F1 score of \(82.53\%\). It can also be seen that among the selected CNN-based models, EfficientNet-B7 [19] achieved the best results with accuracy of \(88.18\%\), precision of \(83.66\%\), recall of \(78.64\%\), and F1 score of \(80.67\%\), respectively. Thus, our proposed model improves \(2.04\%\), \(1.67\%\), \(1.98\%\), and \(1.86\%\) in terms of accuracy, precision, recall, and F1 score, respectively, comparing with CNN-based EfficientNet-B7 [19] model.

Moreover, Fig. 2 shows a confusion matrix of the 7 classes of the HAM10000 dataset with the test dataset. The confusion matrix in Fig. 2 shows (a) the EfficientNet-B7 [19] with the test dataset has some miss classification, particularly in the MEL types, and (b) that the proposed model, TransSLC, with the test dataset, is able to classify the skin lesion types in most of the classes. The CNN-based EfficientNet-B7 [19] model performs well in detecting AKIEC, BCC types of a skin lesion with \(5\%\), and \(8\%\) higher than our proposed TransSLC model. To classify BKL, DF, MEL, NV, and VASC types of the lesions, the EfficientNet-B7 [19] model performs poorly and significantly fails in MEL types with \(15\%\) lower than our proposed model. This is a crucial flaw, as MEL types are deadly for patients. Therefore, CNN-based models have a considerable some limitations when used in real-world clinical settings. In contrast, our proposed model is capable of overcoming this limitation and could potentially be deployed in a real clinical setting. Still, TransSLC has some limitations when classifying MEL types, getting this class confused with \(1\%\) of AKEIEC, \(1\%\) of BCC, and \(5\%\) of BKL, \(25\%\) of NV types, respectively. Another drawback of the proposed transformers-based model consists of huge number of the parameters which requires large memory (computational capacity) in order to implement.

Fig. 2.
figure 2

Confusion matrix of (a) CNN-based EfficientNet-B7 Model (b) Transformers-based proposed model (TransSLC).

Fig. 3.
figure 3

ROC curve (receiver operating characteristic curve) of (a) CNN-based EfficientNet-B7 Model (b) Transformers-based proposed TransSLC model.

Figure 3 illustrates the comparison between CNN-based EfficientNet-B7 and proposed model using Receiver Operating Characteristic (ROC) curve. The EfficientNet-B7 yields the area of AKIEC class is \(98\%\) which is \(2\%\) higher than proposed model. The area of the rest of classes, DF, MEL, NV, and VASC achieved by TransSLC improves \(1\%\), \(2\%\), \(2\%\), and \(2\%\), respectively, compared with the EfficientNet-B7 model. The remain area of the BCC and BKL classes are the same for both EfficientNet-B7 and our proposed model. The class-wise performance metrics of the proposed transformers-based, TransSLC, model is presented in Table 4. The proposed model yields the \(86.00\%\), \(78.90\%\), \(84.77\%\), \(89.47\%\), \(79.60\%\), \(93.70 \%\), and \(84.8\%\) of accuracy to classify AKIEC, BCC, BKL, DF, MEL, NV, and VASC, respectively.

Table 4. The class-wise performance metrics of the proposed transformers-based, TransSLC, model for the seven classes of skin lesion classification in terms of the precision (PR), recall (RE) and F1 score (F1), respectively.
Fig. 4.
figure 4

Visualisation results of the activation maps. For every column, we show an input image, the corresponding activation maps from the outputs of EfficientNet-B7 and the proposed TransSLC model.

The performance analysis of several ablation experiments is likely insufficient to assess the benefits and behaviour of the proposed model. Thus, Fig. 4 we depict the activation maps of the CNN-based and transformers-based model. Notice that the EfficientNet-B7 rows show the activation maps, where the model can classify all these images correctly to the corresponding class but activated in overall regions of the input skin lesion images. More preciously, the skin lesion types can be conformed through some lesion areas only on the dermatoscopic image. The activation maps by the proposed transformers-based TransSLC model can remarkably overlay with only the lesion regions, which could signify the presence of lesion type. Finally, we can infer that a transformers-based model would distinguish between important and non-relevant characteristics of skin lesion, as well as learning the appropriate features for each given class.

4 Conclusion

In this paper, we presented TransSLC, a transformers-based model able to classify seven types of skin lesions. The proposed method was compared with five popular state-of-the-art CNN-based deep learning models Using the HAM10000 public datasets. Our proposed model achieved the accuracy of \(90.22\%\), precision of \(85.33\%\), recall of \(80.62\%\), and \(85.53\%\), respectively on the test dataset. The proposed model shows the transformers-based model outperforms the traditional CNN-based model to classify different types of skin lesions which can enable new research in this domain. Future work will further explore transformers-based methods performances across other datasets, as well as carrying out cross-datasets evaluation to assess how well the model generalises.