Keywords

1 Introduction

Globally in each year, 132,000 new melanoma and 2 to 3 million non-melanoma skin cancer patients appear which shows that the rate of skin cancer incidence is drastically growing-up [1, 2]. The major cause for this is due to ultraviolet (UV) radiation which is the most significant spectrum of sunlight that can destroy the DNA under the skin cell that leads to excess development of skin cells resulting in skin cancer. The main cause of UV to reach our surroundings is the evacuation of the level of the ozone layer [3, 4]. The most usual categories of skin lesions are squamous cell carcinoma, melanoma, basal cell carcinoma, Benign, Actinic keratosis, Melanocytic nevi, Vascular lesions, and Dermatofibroma [5,6,7]. Melanoma is the most serious cancerous kind of skin lesion, which is the cause for 9000 mortality in 2017 in United States [8] only. If melanoma is diagnosed in its early stage, nearly 95% of the cases have a possibility to cure, especially basal cell and squamous cell carcinomas are highly curable cases [9].

Skin lesion is primarily detected manually by using human naked eyes, which require a magnifying and illuminated skin images. Among several procedural techniques, the most common methods (ABCD) rule, Menzies 7-point checklist and 3-point checklist are used to detect the melanoma in the early stages [10, 11]. Reports on the performance of clinical dermatologists on diagnostic accuracy have claimed 80% for a dermatologist who have ten years and more experience, whereas dermatologist who have 3 to 5 years experience were able to reach only 62% [12]. This shows that for detecting skin lesion with a better accuracy, years of experience over difficult situations plays a great role. Applying machine learning techniques on dermoscopic image to classify malignant and benign lesion becomes popular task because of the ability to detect patterns in digital images. Deep learning methods exhibit better performance in detection and classification of various diseases by means of medical image examination [5, 13].

Several studies have been prompted to classify skin lesion from dermascopic images. Barata et al. [14] uses a global and local features for the detection of melanoma in dermoscopy images. They have compared the effect of color and texture features for lesion classification and concluded that a combination of features leads to better performance. In the work of Codella et al. [15] a combination of support vector machine (SVM), sparse coding techniques and deep learning are applied on International Skin Imaging Collaboration (ISIC) dataset to recognize/classify dermoscopy images. A convolutional network with transfer learning is developed by Cıcero et al. [16] on a custom dataset of skin image to get better performance in the skin lesion classification task.

In the recent days, we have been witnessing the application of deep learning for many of the medical image analysis problems. Among these, Esteva et al. [17] applied a pretrained CNN technique, GoogleNet and Inception v3 for image classification. In order to tackle the difficulties of classifying skin lesion, Lopez et al. [18] presented a pretrained VGGNet algorithm with the transfer learning method. The ISIC dataset is used for testing the proposed method. In 2017, Krizhevsky et al. [19] applied a deep CNN on a large dataset of ImageNet LSVRC-2010. The number of different classes after classification is 1000. Codella et al. [20] proposed an ensemble of deep residual network and fully CNN in combination with SVM, hand-coded feature extractor and sparse coding method to segment and detect melanoma cases on a dataset of International Symposium on Biomedical Imaging (ISBI). To classify the dermoscopy image dataset of ISBI 2017 into three different classes, Harangi et al. [21] employ an ensemble technique that fuses the classification output of four different deep neural network algorithms. Tan et al. [22] used a feature optimization technique considering Particle Swarm Optimization (PSO) for the purpose of classification of skin lesion into benign and malignant. Dermofit Image Library, PH2, and Dermnet are the datasets used for evaluation.

In addition to these works, most recently Hekler et al. [23] implements a deep learning network for skin lesion classification into malignant melanoma and benign nevus that could help human for the histopathologic melanoma diagnosis. To enhance the performance of skin lesion classification, a dilated convolution of deep learning technique is applied on four pretrained algorithms (VGG16, VGG19, MobileNet and InceptionV3) by [5]. They have used transfer learning for the extraction of features from the images. Chaturvedi et al. [4] proposed a transfer learning on pretrained MobileNet algorithm and evaluated on HAM10000 dataset to classify into seven different classes. In Pratiwi et al. [13], CNN model is proposed for the detection of skin cancer from HAM10000 dermoscopy image. In the work of Khan et al. [24], an ensemble of pretrained ResNet-50 and ResNet-101 through transfer learning based feature extraction is employed for skin lesion classification. The features extracted are fed to SVM for classification. Even though, several attempts are done for classification of skin lesions, still there is lack of generality in their capability of classification and have not achieved better accuracy because of the complexities in the image itself [4].

In this study, we proposed an ensemble method that fuses the two most common pretrained deep convolutional neural networks, namely DenseNet and InceptionV3, which are pretrained on approximately 1.28 million images. In most of the cases these two algorithms outperform in the HAM10000 dataset [25] as we explore from the previous works. We use a fine tuning technique for the feature extraction (discussed in the methodology in detail). The proposed model is trained and tested on HAM10000 dataset [25] that consists of 10015 dermoscopy images. The rest of the paper is outlined as follows. Section 2 introduces the proposed method which details about dataset description, data pre-processing techniques, data augmentation and the proposed architecture. Section 3 present experimental findings. Finally, Discussion and Conclusions are given in Sect. 4.

2 Proposed Methodology

In this section, we present the details of the proposed methodology which include the dataset used for training and evaluation, data pre-processing and augmentation techniques, and the architecture of the proposed method.

2.1 Dataset

To train, validate and test the proposed model, we have used a collection of dermatoscopic images namely Human Against Machine with 10000 training images (HAM10000) dataset which is available publicly on International Skin Imaging Collaboration (ISIC). The dataset accommodates 10015 dermatoscopic images gathered from different populations by using a variety of modalities. The dataset is not equally distributed for each type of lesions, 6705 Melanocytic nevi (nv) images, 1113 Melanoma (mel) images, 1099 Benign keratosis (bkl) images, 514 Basal cell carcinoma (bcc) images, 327 Actinic keratosis (Akiec) images, 142 Vascular (vasc) images and 115 Dermatofibroma (df) images. All images are stored with 600 \(\times \) 450 pixels resolution. This indicates that, more than 50% of the dataset is imbalanced to only one type of lesion namely Melanocytic nevi. Figure 1 shows five sample images from each lesion types.

Fig. 1.
figure 1

Randomly selected sample images for each cancer type from the HAM10000 dataset.

2.2 Data Pre-processing

The pixel resolutions of all the images used in this study are 600 \(\times \) 450. To make the size of these images compatible with our models (DenseNet and InceptionV3), we downscale the pixel resolution to 256 \(\times \) 192 by using Keras ImageDataGenerator. Then normalization of the dataset is performed by dividing the pixel values of the images by 255.0. Finally, we divide the dataset for the training (8111 images), validation (902 images) and testing (1002 images) sets.

2.3 Data Augmentation

For deep learning algorithms, to get a better performance, a large amount of data is required. But still, acquiring an adequate amount of data is the main challenge in the area. One best solution to increase the dataset size is the data augmentation technique as it raises the dataset size without eliminating the structure of the data. In our study, the first data augmentation techniques used is a rotation operation with a range randomly between 0 and 60\(^{\circ }\). The other concern in image data preparation is, objects of interest in the image may be off-centred by several means. To handle this problem, we apply width shifting and height shifting with a range of 0.2. At last, shear and random zooming operations are applied with a range of 0.2.

2.4 Proposed Architecture

Recently, for image classification, CNN become the state-of-the-art method because it achieved excellent performance on a well-known datasets such as MINIST [26] and ImageNet [27]. There are several varieties of CNN algorithms for image classification task, such as AlexNet [19], GoogLeNet [28], ResNet [29], DenseNet [30], VGGNet [31], InceptionV3 [32] and others. In our work, we perform the classification of skin lesion by using an ensemble method which encompasses well-established deep learning architectures that have shown better accuracy in the previous works, namely DenseNet and InceptionV3. These architectures are available as a pretrained model that were initially trained with ImageNet dataset that contains around 1.28 million natural images with 1000 classes. We use the weights and biases from the pretrained model to initialize learning on our dataset, and then a fine-tuning technique is applied on all the layers in the selected architectures. The details are presented in the following subsections.

The DenseNet Architecture: The DenseNet architecture proposed by Huang et al. [30] contains all the layers which are directly connected to each other to optimize the flow of information between the layers. That means, each layer in the network receive information from all the antecedent layers and feeds its output to all the consequent layers. A concatenation operation is performed in every layer to merge the inputs from the previous layers. Equation 1 presents the input feature map fetched to the ith layer from all the preceding layers [33]. The connectivity pattern in each layer of a single dense block is illustrated in Fig. 2. DenseNet is one of the best performer method on ImageNet dataset where it performs 0.773 on top-1 and 0.936 on top-5 retrievals.

Fig. 2.
figure 2

A layout of single dense block which has 5 layers.

$$\begin{aligned} x_i = H_i(x_0, x_1, ... , x_{i-1}) \end{aligned}$$
(1)

Here, \(x_i\) is the output of the ith layer and \(H_i\) is the composite function that represent the operations such as rectified linear units (ReLU), Batch Normalization (BN), and Convolution (Conv).

In our work, we have used the variant of DenseNet that is named as DenseNet-201, which has 4 dense blocks and 201 layers. Figure 3 shows a DenseNet architecture with three dense blocks. In each dense block, there is a composition layer which performs sequentially BN and ReLU and then a 3 \(\times \) 3 convolution operations. The convolution operation is used to provide the concatenated output feature map, say for example, to transform the input feature maps \(x_0, x_1, x_2\) to output feature map \(x_3\) by using Eq. 1. The Batch Normalization operation is used to normalize the input of each layer [34] in order to decrease the absolute difference between data and make the relative difference higher.

Fig. 3.
figure 3

A deep DenseNet with 3 dense block.

The other operation which is part of composition layer is the ReLU, which is applied in DenseNet architecture. Equation 2 describes how the ReLU operation works.

$$\begin{aligned} ReLu(x)={\left\{ \begin{array}{ll} x, &{} \text {if }x>0.\\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(2)

The changing of feature map size due to the down-sampling layers in convolutional networks makes difficult to perform the concatenation operation. To facilitate down-sampling, the Densenet architecture separates the whole network into various dense blocks that are connected densely. As shown in Fig. 3, there are transition layers between these densely connected dense blocks used to perform convolution and pooling operation. In this work, these transition layers include three different operations namely batch normalization, 1 \(\times \) 1 convolution and a 2 \(\times \) 2 average pooling operation [30]. Beside, a bottleneck layer is incorporated within the dense blocks before a 3 \(\times \) 3 convolution layer. It consists of BN, ReLU and a 1 \(\times \) 1 convolution layer. The 1 \(\times \) 1 convolution operation in this layer makes the network computationally efficient by reducing the number of input feature maps to the 3 \(\times \) 3 convolution operation in the dense block. This layer make DenseNet method effective by reducing the complexity and size of the model. The main benefits of DenseNet, when compared to other methods, are presented below.

  • Only a few parameters: Since the feature maps from the preceding layers acts like an input for the current layer, many feature maps can be reused to learn by some convolution kernels.

  • Capability to reduce over fitting: The dense connection in the DenseNet network built short paths from the beginning to the end layers. Due to this, the loss function provides additional guidance for each layer. Consequently, the dense connection protects the over-fitting problem in a better way, particularly it is a good choice for learning from the small size of data.

  • Layers are deeper: Because all layers are linked directly to each other, the network has highly deep architecture.

The InceptionV3 Architecture: By enhancing the GoogleNet [28] network, Szegedy et al. [32] proposed an algorithm called InceptionV3. The major enhancement is reducing the size of the parameters by concatenating the convolutional filters which have different sizes into a new filter. Consequently the computational complexity of the model is decreased. Figure 4 illustrates the architecture of InceptionV3. This model scores an error rate of 3.5% on top-5 and 17.3% on top-1 of ImageNet dataset.

Fig. 4.
figure 4

The architecture of InceptionV3.

In this network, the number of parameters is reduced by replacing the convolution filters of size greater than 3 \(\times \) 3 (e.g. 5 \(\times \) 5 or 7 \(\times \) 7) by a sequence of 3 \(\times \) 3 convolution layers. The computational cost of a large spatial filter convolution is expensive [32]. In addition to this, spatial factorization into asymmetric convolutions is applied. This means, replacing an n \(\times \) n filters by two layer asymmetric filters of n \(\times \) 1 followed by 1 \(\times \) n. The InceptionV3 network has 42 layers and the detail of the network that is shown in Fig. 4 is presented in Table 1.

Table 1. The detailed outline of InceptionV3 architecture. The input size column also represents the output size of the previous layer.

Fine Tuned Ensemble: To increase the image classification accuracy on a dataset which do not have sufficient amount of annotated images, an ensemble of DNN is a powerful technique, which makes a decision by combining the prediction results from multiple models [21]. In our work, we have explored an ensemble of two well-known pretrained CNN algorithms, DenseNet and InceptionV3. Firstly, the two methods (DenseNet and IncptuionV3) are fine-tuned and trained on our dataset individually, and then the best performed model is saved. The fine tuning technique is applied by freezing all the layers of the networks prior to the final fully connected layer. A fully connected layer of the pretrained networks is removed and replaced by a new fully connected layer that have seven neurons which is equivalent to the number of classes in the prediction task. Finally, classification is performed by fusing saved models by averaging technique. An averaging of a models’ prediction is an ensemble learning technique that predicts based on the predictions obtained by each model. It considers each model equally for average calculation and used to bring down the variance in the final neural network model [21]. Figure 5 presents the model architecture of the current work.

Fig. 5.
figure 5

The proposed ensemble architecture.

3 Experimental Details

In this section, we present the details of the experimental setup, the evaluation metrics and results with a detailed discussion on the experimentation.

3.1 Experimental Setup

In the current study, we perform a classification task using two fine-tuned DNN methods and then we have employed an ensemble technique. The proposed models are trained for 40 epochs with a batch size of 32 on our dataset. The evaluation of the model is done using 902 sample images from the validation set and 1002 sample images from the test set. We have considered Adam optimizer to optimize learning with a learning rate of 0.0001, the minimum learning rate is set to 0.1 \(\times \) 106. We have used the Top 1 accuracy, which is the standard performance measure in CNN studies [19]. Firstly, we have compared the classification accuracy obtained on the HAM10000 dataset considering

  • Fine-tuned denseNet algorithm

  • Fine-tuned InceptionV3 algorithm and

  • An ensemble method of DenseNet and InceptionV3.

In addition, the best accuracy result from our evaluation is compared with other previous studies which have better performance and done on HAM10000 dataset. To program the model for our computer aided diagnosis (CAD) system, we use python programming language. We built it on top of Keras deep learning framework for neural networks [35] with the tensorflow [36] back-end. The training is performed on Google Collaborator, which come up with a single 12 GB NVIDIA Tesla K80 GPU and 12 GB RAM.

3.2 Evaluation Metrics

The overall testing of the proposed model is performed using 1002 unseen test dataset. To assess the achievement of the proposed model, we use several evaluation metrics, namely precision, recall, accuracy, and F1-Score. For each of the 7 groups’ precision, Recall, and F1-score are determined. Also, the weighted average which is a good measure for unbalanced dataset is also calculated for recall, precision and f1-score.

3.3 Experimental Results

Table 2. Precision, recall, and F1-score for each class due to DenseNet model.
Table 3. Precision, recall, and F1-score for each class due to Inceptionv3 model.

This part narrate the assessment outcomes of the proposed method empirically and graphically on the HAM10000 dataset. As shown in Tables 23 and 4, the experimental result for DenseNet, InceptionV3, and the Ensemble of the two models respectively is presented with precision, recall, and f1score on the HAM10000 dataset for seven classes. Accordingly, for individual models, namely DenseNet and InceptionV3, the Melanocytic nevi class which has the maximum number of the test sample (678 out of 1002) scores the highest precision, recall, and f1-score. For Melanocytic nevi class, DenseNet scores 95%, 96% and 95% and InceptionV3 scores 94%, 95% and 95% precision, recall, and f1-score respectively. The ensemble method scores the highest precision of 100% for Melanoma and Dermatofibroma, the highest recall of 96% for Melanocytic nevi, and 96% f1-score for melanoma classes. Quantitatively, Table 5 demonstrates the accuracy, weighted precision, weighted recall and weighted f1-score of our ensemble network and the two fine-tuned network for test dataset. A weighted average of precision, recall, and f1-score for DenseNet 90%, 89% and 89%; for InceptionV3 88%, 88% and 88%; and for Ensemble model 91%, 91% and 91% in the given order is recorded. We have also computed the training-validation accuracy curves for the proposed method. The training-validation accuracy curve for DenseNet and InceptionV3 models are demonstrated in Fig. 6 and Fig. 7 respectively. The graphs exhibit that there is a high increasing rate of accuracy until 25th epoch and there-after 25th epoch, the graph becomes converge in both models. Another evaluation metric that is applied in this study is to visualize the classification performance using confusion matrix which is described in terms of correctly classified and wrongly classified test samples. The proposed method performance is presented in Figs. 89 and 10 for DenseNet, InceptionV3, and Ensemble models respectively on HAM10000 dataset that contain seven classes. The diagonals of a confusion matrix from top-left to bottom-right are correctly classified samples, and all other cells out of this diagonal represent wrongly classified samples. Finally, we have made a comparative study with the existing state-of-the-art methods that are validated on the HAM10000 dataset as shown in Table 6. The highest outcome of proposed architecture is indicated by making bold. The comparison in the table indicates that the proposed method achieves better when compared to existing algorithms.

Table 4. Precision, recall, and F1-score for each class due to Ensemble approach.
Table 5. The evaluation metrics(%) of the proposed methods.
Fig. 6.
figure 6

The classification accuracy curve due to DenseNet model.

Fig. 7.
figure 7

The classification accuracy curve due to Inceptionv3 model.

Table 6. The Comparative analysis of the proposed method with the existing methods.
Fig. 8.
figure 8

The confusion matrix due to DenseNet model.

Fig. 9.
figure 9

The confusion matrix due to Inceptionv3 model.

Fig. 10.
figure 10

The confusion matrix due to Ensemble model.

4 Discussion and Conclusions

The diagnosis and detection of skin cancer is the complex task due imbalanced number of training samples. In the previous section, we have presented the experimental results with well-known metrics numerically and graphically. It is evident from the experimental results (see Tables 2 and 3) that the DenseNet and InceptionV3 scores the highest precision, recall and f1-score with respect to Melanocytic nevi class which has large number of samples. This class has 678 samples out of which 67% is considered for the test dataset. In the Ensemble model, there is no single class that possess a good score than the other classes with respect to all the evaluation metrics as indicated in the Table 4. But, almost in all evaluation metrics for all individual classes, the ensemble model achieves a better performance than DenseNet and InceptionV3 models. By observing the classification rate of every individual class obtained from the experimental analysis, it is clear that almost all classes that have large categories have better classification rates whereas those images from small classes are highly misclassified.

In addition, as the multi-class classification report shows in Table 5, the ensemble model records a better classification accuracy when compared with the individual models in terms of accuracy, loss, and weighted average of precision, recall, and f1-score. It achieves an accuracy of 91%, loss of 38.33% and an equal score for weighted precision, recall, and f1-score of 91% for the unseen test datasets. We have observed that concatenating two or more models together by using various ensembling techniques can improve the prediction capability and generalization ability of a classification model.

The confusion matrix also gives a clear illustration by comparing the True label and Predicted label for each sample in the test set. Even if most of the images are classified correctly, due to the presence of high similarity in the inter-class and the variability due to intra-class between images in some classes in the training data makes it impossible to reach high classification capability for each class. The comparative process indicates that the proposed method achieves better performance in terms of precision, f1-score, and accuracy. In terms of recall, our model is in the second rank next to Pratiwi et al. ‘[13]’. But our model is much better than this model in view of precision and hence better f1-score is registered by the proposed study.

In summary, in the current work, we have employed a new ensemble method for skin lesion classification by using deep learning. In our proposed method, the fine-tuned technique is applied on DenseNet and InceptioV3 networks. These algorithms were pretrained on ImageNet dataset which is a large image dataset with 1000 different classes. For our task, we remove the last fully-connected layer of the algorithms and replace it with a new fully-connected layer that is appropriate for our classification task on the HAM10000 dataset which has 7 classes. After we train the algorithms on our dataset separately, we concatenate the results by the average ensemble technique. Experimentation is performed on the test dataset and achieved an accuracy of 89.42%, 87.82% and 91% for DenseNet, InceptionV3 and ensemble models respectively. From the experimental result, we observe that through the fusion of two methods, the model scores better performance than individual architectures. Moreover, the comparative study shows that the result of the proposed method achieves better performance in most of the parameters, when compared to the existing state-of-the-art methods. Although the proposed method has improved accuracy, it still needs improvement to tackle the overfitting problem and to increase the accuracy by using different regularization techniques.