1 Introduction

Skin cancer is one of the most common types of cancer that occurs due to abnormal growths in the skin cells [1]. According to the World Health Organization (WHO), skin cancer affects roughly three million individuals globally every year, resulting in thousands of deaths [2]. Regular skin examinations by expert dermatologists and awareness of changes in nevus or skin spots are essential for early diagnosis of potential skin cancer. This allows for the treatment of cancer cells before they spread to surrounding tissues. Additionally, when skin cancer is diagnosed early, the treatment process is both easier and less invasive [3].

Dermoscopy is a diagnostic technique that does not involve any invasive procedures and enables the visualization of skin lesions with higher magnification and improved clarity [4]. It is commonly used by dermatologists in the diagnosis of melanoma and other skin cancer types. However, this technique is known to be time-consuming, tiring, and prone to errors and variations in diagnosis among dermatologists [5]. Therefore, there is a demand for computer-aided diagnosis (CAD) systems to reduce diagnostic subjectivity and improve accuracy and consistency. These systems can also aid in the early detection and treatment of skin cancer by identifying early-stage skin lesions that may be missed by the naked eye [6].

The purpose of automated CAD systems is to categorize skin lesions as malignant or benign, sometimes even precisely categorizing these two classes into their own sub-classes. On the other hand, the classification of skin lesions poses several challenges that can result in misdiagnosis. Some of these challenges include:

  1. 1.

    Similar-looking lesions: Some skin lesions from different classes may have a similar appearance.

  2. 2.

    Differences in skin characteristics: Different individuals may have different skin types and structures, causing skin lesions from the same class to look different.

  3. 3.

    Variations in lesion stages: An early-stage lesion may have a different appearance from a later-stage lesion.

  4. 4.

    Insufficient or inaccurate data: The data used for skin lesion classification may be insufficient (e.g., for a rare type of lesion) or inaccurate.

  5. 5.

    Artifacts: Artifacts, including hair, skin lines, and blood vessels, may be present in dermoscopy images.

With the rapid progress of deep learning technology, it has become the preferred method for medical image analysis in computer vision [7, 8]. Compared to traditional classification methods, deep learning has exhibited enhanced robustness and superior generalization capability. One of the most well-known deep learning models, Convolutional Neural Networks (CNNs) [9] are excellent at capturing spatial information and detecting local patterns, making them suitable for image analysis tasks, including skin lesion classification. However, as higher performance and scalability demand increased, researchers explored new architectures such as Vision Transformers (ViTs) [10]. ViTs introduced the concept of self-attention, allowing models to capture global dependencies in input images. This contribution led to remarkable improvements in image classification tasks by dealing with the complexities of aforementioned challenges of the image datasets. In 2022, Liu et al. [11] introduced the ConvNeXt model, which combines the strengths of CNNs and Transformers. ConvNeXt utilizes a CNN backbone to capture local features and an attention mechanism to capture global dependencies. This architecture has been shown to surpass the performance of traditional transformers and even the successful ViT model, Swin Transformer [12], while overcoming the limitations of input size.

Ensemble methods have gained significant popularity in diverse medical image classification tasks. [13, 14]. The classifiers with different architectures used in ensemble methods can capture image information at different levels, leading to more accurate decisions. To our knowledge, there is no existing study on the classification of skin lesions from dermoscopy images using ConvNeXt models. On the other hand, this study is the first to utilize both individual ConvNeXt models and ensemble learning technique for classifying skin lesions from dermoscopy images. Therefore, the main contributions of this study are as follows:

  1. 1.

    This is the pioneering study that applies ConvNeXt model architectures to dermoscopy images for the task of skin lesion classification.

  2. 2.

    We conducted experiments without altering the existing structures of ConvNeXt models (Tiny, Small, Base, Large) to enable effective transfer learning for eight-class skin lesion classification.

  3. 3.

    We investigated the effect of ensemble learning, and the results demonstrated that the ensemble of different ConvNeXt models outperformed individual models in the classification tasks.

  4. 4.

    For both individual models and ensemble models, five-fold cross-validation and testing were performed to evaluate their performance. The ensemble of all ConvNeXt models achieved an overall classification accuracy of 97.7%, surpassing both the performance of individual models and state-of-the-art methods.

  5. 5.

    To ensure the validity of this study, comparisons were made with state-of-the-art methods based on CNNs [15,16,17,18,19] and Vision Transformer (ViT) models [20]. These methods were selected as they represent the most frequently compared approaches in the recent literature. Training and testing processes were conducted on the publicly available ISIC 2019 dataset, commonly used for skin lesion classification. This allowed for a fair comparison of the proposed approach against other state-of-the-art methods.

Based on our findings, this study highlights the potential of ConvNeXt models in accurately classifying skin lesions from dermoscopy images. Further research in this direction can contribute to the development of more effective and reliable automated systems for skin lesion analysis.

2 Related work

The initial studies on skin lesion classification in the literature considered the lesion classification problem as a binary classification problem, where the lesions were categorized as either malignant or benign. With the emergence of larger datasets [21,22,23] that include subtypes of malignant and benign lesions, recent studies have focused more on automated multi-class skin lesion classification. However, automated multi-class classification of skin lesions remains a challenging task due to the challenges mentioned in Sect. 1 and the existence of multiple classes.

Deep Learning has garnered significant attention in the field of medical image classification, including the classification of skin lesions. Extensive research has been conducted, employing numerous deep learning approaches to tackle this task. Esteva et al. [24] utilized the GoogleNet Inception v3 model to train on a dataset consisting of 129,450 clinical images, encompassing 2,032 different diseases. The proposed model achieved performance comparable to that of all tested experts and demonstrated the ability of artificial intelligence to classify skin cancer at a level similar to dermatologists. Abbas and Celebi [25] proposed a new classification method named DermoDeep, which combines various visual features and deep neural network approaches to classify pigmented skin lesions. They evaluated the method on 2800 region-of-interests (ROIs) and achieved an AUC of 0.96, with a sensitivity of 93% and specificity of 95%. Gessert et al. [17] proposed an ensemble of deep learning models comprising EfficientNets, SENet, and ResNeXt WSL, which were selected using a search strategy. They addressed the class imbalance issue with a loss balancing approach. The results showed that EfficientNets models performed well on the ISIC2019 dataset. Furthermore, the automatic selection of the ensemble of SENet154 and ResNext models indicated that the variability in network architectures yielded better results. Pacheco and Krohling [26] highlighted the potential for achieving improved performance by considering the demographic characteristics of the patient, rather than solely relying on the classification of skin lesions based on images. To this end, they proposed a new approach called MetaBlock, which uses the most relevant features and metadata. The results showed that the MetaBlock approach improved classification for all tested models. Kassem et al. [18] tested a modified GoogleNet model using transfer learning approach on the ISIC2019 dataset. The proposed model achieved the following classification metrics: accuracy of 94.92%, sensitivity of 79.8%, specificity of 97%, and precision of 80.36%. Molina-Molina et al. [15] presented an approach that combines deep learning features extracted from Densenet-201 with 1D fractal signatures of texture-based features through transfer learning. The proposed method achieved an accuracy of 97.35%, sensitivity of 66.45%, and specificity of 97.85% on the ISIC2019 dataset. Iqbal et al. [19] proposed a Deep Convolutional Neural Network (DCNN) model with fewer filters and parameters to improve efficacy and performance. The proposed model achieved an accuracy of 89.58%, sensitivity of 89.58%, and specificity of 97.57% on the ISIC2019 dataset. Zhao et al. [16] presented a new skin lesion image classification approach based on SLA-StyleGAN, a specific image augmentation method for skin lesions, using the DenseNet201 architecture. Additionally, they introduced a novel loss function that aims to increase the distance between samples from different classes while reducing the distance between samples within the same class. Experimental results demonstrated that the proposed framework achieved a balanced multi-class accuracy of 93.64% on the ISIC2019 dataset. Ayas [20] proposed the first vision transformer-based model for multi-class skin lesion image classification. The proposed Swin Transformer model achieved a sensitivity of 82.3%, specificity of 97.9%, accuracy of 97.2%, and balanced accuracy of 82.3% on the ISIC2019 dataset.

Fig. 1
figure 1

The architecture of the ConvNeXt-Tiny model. The downsample layer and ConvNeXt block are stacked in the ratio of 3:3:9:3 ratio of 4 stages. The GELU represents the Gaussian Error Linear Unit. The output class names are abbreviated as AK: actinic keratosis, BCC: basal cell carcinoma, BKL: benign keratosis, DF: dermatofibroma, NV: melanocytic nevus, MEL: melanoma, SCC: squamous cell carcinoma, and VASC: vascular lesion, respectively

In this paper, we presented the effectiveness of ConvNeXt [11] model, which combines the strengths of CNNs and Transformers, in skin lesion classification. The ConvNeXt is a CNN-based model and it has been proposed to improve the performance of vision transformers. Unlike vision transformers, ConvNext does not rely on specialized modules such as shifting window attention or relative position biases, resulting in a more modern model that achieves comparable performance, memory usage, and FLOPs (floating-point operations per second) to the Swin Transformer [27]. To the best of the author’s knowledge, this is the first study to utilize the ConvNext model for multi-class skin lesion classification. The experimental results demonstrated that the proposed approach achieved better performance for both individual and ensemble models in terms of sensitivity, specificity, and accuracy metrics.

3 Methods

3.1 ConvNeXt

The ConvNeXt architecture [11], proposed by Liu et al. in 2022, aims to outperform the performance of ViTs. To achieve this goal, it takes advantage of attention-based classifiers and conventional ResNet model. Motivated by the need to capture global dependencies and contextual information, the ConvNeXt architecture employs convolutions with large receptive fields as its fundamental building block. Additionally, as a pure CNN architecture, ConvNeXt outperforms Swin Transformer, the most powerful transformer model on the ImageNet-1K dataset [28]. The ConvNeXt architecture is shown in Fig. 1.

ConvNeXt has a structure that is very similar to ResNet50, consisting of a head feature extraction layer, a middle layer characterized by a bottleneck structure encompassing four different dimensions, and a high-dimensional feature classification layer. However, the interior of each layer and the strategy of stacking have undergone several changes. First, the stacking number of each block has been revised from 3:4:6:3 to 3:3:9:3, which is similar to the transformer model. Within each ConvNeXt block, there is a depth-wise convolution operation, which is then accompanied by 1\(\times \)1 convolutions. To achieve this, the depth-wise convolution adopts a group-wise convolution approach that involves grouping the channels together. Secondly, the bottleneck design has been modified to following sequence of operations: firstly, it performs feature extraction, followed by dimension reduction, and finally, dimension expansion. Thirdly, the size of the convolution kernel has been changed from 3x3 to 7x7. Fourthly, the activation function has been replaced from Rectified Linear Unit (ReLU) to Gaussian Error Linear Unit (GELU), with fewer activation functions used. Finally, a notable change is the adoption of layer normalization instead of batch normalization as well as employing fewer normalization layer. These modifications, along with new parameters, structures, and functions, have gradually improved the performance of ConvNeXt, even outperforming the ViT such as Swin Transformer.

Additionally, four versions of ConvNeXt are proposed, namely, ConvNeXt-Tiny (T), ConvNeXt-Small (S), ConvNeXt-Base (B), and ConvNeXt-Large (L). The diversity of these versions varies as the number of channels and blocks used in each stage differs, as shown in Table 1.

Table 1 The configurations of the four ConvNeXt model versions

3.2 The proposed ensemble of ConvNeXt classifiers

Ensemble learning is a powerful technique widely used in computer vision, where different classifiers are combined to enhance classification performance. By leveraging the diverse information captured by classifiers with different architectures, ensemble models have the potential to achieve higher accuracy compared to individual base learners. This approach is commonly employed in various medical image classification tasks [13, 14]. In this study, all versions of the ConvNeXt model (ConvNeXt-T, ConvNeXt-S, ConvNeXt-B, ConvNeXt-L) are selected as the base classifiers of the ensemble model.

Let x(whc) be an unseen test image with a size of \(w \times h\) pixels and c channels. To classify x, we utilize the following approach. In the final decision step, each individual fine-tuned ConvNeXt classifier \(C_{i}\) in the ensemble C produces confidence scores of the input x belonging to the class y membership as given in (1). We then select the class with high confidence value as the label for x as given in (2).

$$\begin{aligned} P_{y}(x)= & {} \sum _{C_{i}\epsilon C}P_{C_{i},y}(x) \end{aligned}$$
(1)
$$\begin{aligned} C(x)= & {} arg\,\underset{y\epsilon Y}{max}\ P_{y}(x) \end{aligned}$$
(2)

4 Experimental setup and results

All experiments were conducted on a computer equipped with an Intel(R) Core(TM) i9-11900K 3.50 GHz CPU and an NVIDIA GeForce RTX 3080 12GB GPU. The ConvNeXt models were developed using the PyTorch deep learning library.

4.1 ISIC2019 Skin lesion classification dataset

The ISIC 2019 skin lesion dataset [21,22,23, 29, 30] is a dermatology dataset created by the International Skin Imaging Collaboration (ISIC) in 2019. It is specifically designed for skin cancer diagnosis and consists of a total of 25,331 images belonging to 8 subcategories of both benign and malignant skin lesions. The subcategories are named as follows: actinic keratosis (AK), basal cell carcinoma (BCC), benign keratosis (BKL), dermatofibroma (DF), melanocytic nevus (NV), melanoma (MEL), squamous cell carcinoma (SCC), and vascular lesion (VASC). Figure 2 includes some sample images of the dataset in each lesion category. The dataset does not provide ground truth labels for the test data. To make a fair comparison with state-of-the-art methods we followed the same training/testing protocol presented in [20]. We divided the available training data into three subsets: training, validation, and test, with a split ratio of 70%, 10%, and 20%, respectively. We also applied the 5-fold cross-validation technique, where the dataset was split into 5 folds, keeping the number of images of the same class in each fold equal. This ensures to avoid problems such as all samples being from one class or certain classes not being represented. Table 2 presents the number of training, validation, and test samples in each lesion category.

We employed data augmentation techniques during the training process to enhance the model’s generalization ability. The augmentation techniques include a range of transformations, including geometric transformations like random horizontal and vertical flips, random rotation, and color jitter transformations that involve brightness, contrast, and saturation adjustments. Additionally, we resized all the images to 224\(\times \)224 pixels to ensure consistency in input dimensions during training.

Fig. 2
figure 2

Some sample images of the dataset in each leasion category. AK: actinic keratosis, BCC: basal cell carcinoma, BKL: benign keratosis, DF: dermatofibroma, NV: melanocytic nevus, MEL: melanoma, SCC: squamous cell carcinoma, VASC: vascular lesion

4.2 Training details

The ISIC 2019 dataset exhibits class imbalance, with the NV class containing over 12,000 images whereas classes such as AK, DF, SVC, and VASC comprise a smaller number of images ranging from 200 to 900. Imbalanced datasets tend to bias the model towards the class with a larger number of samples, which can result in an increase in false positives (FP) or false negatives (FN) depending on the imbalance. In this study, to address the issue of overfitting on the NV class during training, a weighting scheme based on inverse class frequency is applied to the cross-entropy loss function. The weight value for each class, \(weight_{C_{i}}\), is calculated using the Eq. 3.

$$\begin{aligned} weight_{C_{i}}=\frac{\sum _{j=1}^{k}N_{j}}{k\times N_{i}} \end{aligned}$$
(3)

where \(N_{i}\) denotes the number of images in ith class and k denotes the number of class.

Table 2 The number of training, validation, and test samples in each lesion category

The training data in the ISIC 2019 dataset is not sufficient for training CNN-based architectures from scratch. Therefore, in this study, instead of training ConvNeXt models from scratch, pre-trained models on the 1K-class ImageNet dataset were fine-tuned as skin lesion classifiers. During the training of the models, the AdamW optimization method was applied with a learning rate of 1e-5 and a weight decay value of 1e-8. The cross-entropy loss function was used as the error function. The batch size was set to 8, and the number of epochs was set to 50 for all individual and ensemble models.

4.3 Performance metrics

The classification performance of the models was evaluated considering three widely used quantitative metrics, i.e., Sensitivity, Specificity, and Accuracy. The study was considered as a multi-class (c) classification problem, where each test sample is assigned to one of the predefined classes \(Class_1\), \(Class_2\),..., \(Class_c\). The confusion matrix [31] is used to analyse the results of the multi-class classifier. The confusion matrix shows the relationship between the actual class values and the class values predicted by the classifier. The confusion matrix for the c-class problem can be expressed as a \(c\times c\) table where each cell \(x_{i,j}\), \((i = 1,..., c\) and \(j = 1,..., c)\) of the confusion matrix at row i and column j provides the number of instances for which the predicted class is j and the actual class is i. A binary confusion matrix is a special case when there are only two classes. Hence, a \(c\times c\) confusion matrix can be represented as a set of c binary confusion matrices, one for each class. Table 3 represents a confusion matrix for a c-class problem.

Table 3 Confusion matrix used to calculate evaluation metrics

Sensitivity measures the ability of a classification model to correctly identify positive instances out of all actual positive instances whereas specificity measures the ability of a classification model to correctly identify negative instances out of all actual negative instances in a dataset. Accuracy measures the overall correctness of a classification model across all classes. The sensitivity, specificity, and accuracy metrics for class\(_i\) are formulated as follows:

$$\begin{aligned} Sensitivity_{\text {class}_i}= & {} \frac{x_{ii}}{x_{ii}+\sum _{j=1}^{c} x_{ij}} \end{aligned}$$
(4)
$$\begin{aligned} Specificity_{\text {class}_i}= & {} \frac{\sum _{j\ne i}^{c} \sum _{k\ne i}^{c} x_{jk}}{\sum _{j\ne i}^{c} \sum _{k\ne i}^{c} x_{jk} + \sum _{j\ne i}^{c} x_{ij}} \end{aligned}$$
(5)
$$\begin{aligned} Accuracy_{\text {class}_i}= & {} \frac{x_{ii} + \sum _{j\ne i}^{c} \sum _{k\ne i}^{c} x_{jk}}{\sum _{i=1}^{c} \sum _{j=1}^{c} x_{ij}} \end{aligned}$$
(6)

Accuracy, sensitivity, and specificity measures are calculated separately for each class using the confusion matrix obtained. In the study, the class for which the classification performance is to be calculated was defined as the positive class, while all other classes were defined as negative classes. As a result, the overall classification performance measure was obtained by averaging the c classes.

Fig. 3
figure 3

Confusion matrices depicting the performance on the test set for the eight-class skin lesion classification, highlighting individual sub-versions of ConvNeXt (T: tiny, S: small, B: base, L: large) and the proposed ensemble model, with results specifically from fold 1 test set. The diagonal values represent the sensitivity values for each class

4.4 Results

First, the individual and ensemble performances of four different versions of the ConvNeXt model, as stated in Table 1, were analyzed for skin lesion classification. The mean and standard deviation results of each model obtained through 5-fold cross-validation are provided in Table 4. The classification performance was evaluated using the three metrics described in Sect. 4.3: accuracy, sensitivity, and specificity. The results show that by increasing the model complexity in the order of tiny, small, base, and large, the SE metric improves from 80.5% to 81.3%. Furthermore, it has been observed that the transfer learning approach achieves an accuracy of over 96% for all ConvNeXt models. Additionally, ensemble models have increased the highest sensitivity value obtained in individual models from 81.3 to 84.2. The proposed ConvNeXt T-S-B-L (overall) ensemble method achieved the best values with accuracy of 97.7%, sensitivity of 84.2%, and specificity of 97.9%. Furthermore, the high average results and low standard deviation values indicate that the models generally perform well and the results are consistent. In conclusion, such results have demonstrated the effectiveness of both individual and ensemble architecture of ConvNeXt models.

Table 4 Performance comparison of the individual and ensemble of ConvNeXt models

We also compared the effectiveness and robustness of the ConvNeXt models with state-of-art methods: Molina’s method [15], Zhao’s method [16], EfficientNets [17], Kassem’s method [18], CSLNet [19], and Swin transformer-based models [20]. We chose these methods because they are the most frequently compared studies in the literature. For a fair comparison, we used the same configuration of the dataset as in [20]. We also applied 5-fold cross-validation to avoid the variability of samples which may affect the performance of the models. Quantitatively, Table 5 summarizes the classification performance of the proposed method with six state-of-the-art methods on the ISIC 2019 dataset. The symbol “-” refers unreported results. The highest accuracy of 97.7% was achieved with the ensemble of all ConvNeXt models. Additionally, it can be observed that the sensitivity and specificity values are also high. Molina et al. [15] achieved an average sensitivity value of 66.5% by using the entire dataset without performing a specific training-test split. However, they reported that the low sensitivity values for classes like DF, SCC, and VASC were attributed to the limited number of images available for these classes. Zhao et al. [16] were able to increase the sensitivity value to 68.2% by incorporating various contributions. Gessert et al. [17] achieved the best sensitivity value of 72.5% by using an additional dataset. Kassem et al. [18] addressed the imbalanced dataset problem and achieved a sensitivity of 79.8% by using only 191 images. Iqbal et al. [19], addressed the issue of class imbalance in the dataset and achieved an impressive sensitivity using their proposed CSLNet model. However, the classification accuracy of the model was considerably low. Ayas et al. [20] obtained state-of-the-art classification results by using different sub-versions of the Swin transformers. Table 5 further demonstrates that the proposed ConvNext models in the study yield competitive results against the Swin Transformer models. These findings highlight that different models may be effective in different scenarios. Researchers can contribute to a better understanding of the strengths of each approach and guide future studies by conducting in-depth analysis of these competitive results. This competition encourages progress and fosters continuous innovation to achieve better performance.

Table 5 Performance comparison of the ConvNeXt models with state-of-the-art models

Figure 3 shows the confusion matrices obtained for individual and ensemble models on the fold-1 test set. The diagonal values in the confusion matrices represent the ratio of correctly classified samples to the total number of samples in each class, giving the sensitivity values for each class. As can be seen from Fig. 3, the ConvNeXt-T model achieved 68% sensitivity for the AK class in the fold-1 test set, which was increased to 78% with the ensemble model. Similarly, the classification of MEL class images obtained 74% accuracy with the individual ConvNeXt-T model, which was also improved to 78% with the ensemble model. It is noteworthy that the individual accuracies of the models vary significantly for each class. For instance, the individual performances of tiny, small, base and large models for AK class are 68%, 76%, 73%, and 76%, respectively. In addition, the ensemble model achieves 78% accuracy for the AK class and demonstrates higher accuracy than the individual models for almost all other classes as well.

5 Conclusion

Automatic classification of skin lesions is a very challenging step due to various factors such as similar appearance lesions, diverse skin structures, variations in lesion stages, limited or inaccurate data, and artifacts present in dermoscopy images. In this study, we conducted an analysis and comparison of different versions of pre-trained and fine-tuned ConvNeXt models, i.e. Tiny, Small, Base, and Large, for skin lesion classification on publicly available ISIC 2019 dataset. However, the true strength of our approach lies in the ensemble model, which combines all four ConvNeXt models to produce more accurate result. The proposed ensemble model achieved an impressive overall classification accuracy of 97.7%, surpassing the performance of both individual models and state-of-the-art methods. Furthermore, our proposed method yielded a sensitivity value of 84.2% and a specificity value of 97.9%, indicating its ability to accurately classify skin lesions from dermoscopy images. These results highlight the effectiveness of the ConvNeXt architecture and its ensemble approach in addressing the challenges associated with skin lesion classification. The successful application of ConvNeXt models in this study opens up possibilities for developing more robust and reliable automated systems for skin lesion analysis. Future research can explore further enhancements to the ConvNeXt architecture and ensemble learning techniques to improve the performance and generalizability of skin lesion classification systems. Ultimately, such advancements can contribute to early detection, timely treatment, and improved outcomes for patients with skin diseases.