Keywords

1 Introduction

According to the American Cancer Society, melanoma is the most dangerous type of skin cancer; its early diagnosis is essential for successful treatment and patient survival [1]. In a study published in the Skin Cancer Foundation [3], late diagnosis of melanoma is a significant problem in many parts of the world, including Latin America, where lack of access to health services and awareness about skin cancer contributes to a late diagnosis. In 2020, according to data from the International Agency for Research in Cancer of the World Health Organization, through the GLOBOCAN project, the incidence of Melanoma in Mexico was 2,051 cases with 773 deaths [4].

The diagnosis of melanomas is mainly made by visually inspecting skin lesions by highly trained dermatologists. Asymmetry, border, color, diameter, and lesion enlargement are the standard features that specialists consider. Another common way to diagnose cancer is by performing a biopsy, a pathological examination that takes much time and resources to provide the results. The Sierra Tarahumara is a mountain range part of the Sierra Madre Occidental located in Chihuahua. This rural and remote area lacks sufficient pathologists and medical resources to diagnose and treat skin cancer. The lack of information and awareness about this type of cancer in these communities can lead to delayed seeking care and late diagnosis of the disease. This problem can seriously affect the population’s health and lead to higher mortality and morbidity rates in the region. A comprehensive approach is needed to address the lack of access to pathology services, including actions to increase skin cancer awareness, improve local doctors’ training, and provide resources and technology for diagnostic testing and treatment.

Deep learning techniques, especially convolutional neural networks (CNNs), have been widely used in different image recognition tasks to automatically classify specific patterns on images [11]. Particularly for classifying skin cancer, different CNN models have been proposed achieving very accurate classification results [7, 16, 22].

Unfortunately, these systems have not yet been incorporated into daily clinical practice because most CNN models need the usage of Graphical Processing Units (GPUs), a hardware not very common in most hospitals. As an alternative to using expensive hardware equipment, TensorFlow (an open-source machine learning ML framework) has launched a lightweight version named TensorFlow Lite (TFLite) [5]. TFLite is optimized for deploying deep learning models on mobile and embedded devices with limited computational resources. Then, CNNs can be implemented in low-cost, low-power, portable, easy-to-use devices for classification and detection tasks. The training is performed on the GPU, but the inference can be executed on mobile devices, also known as on-device inference.

This work presents a comparison of state-of-the-art CNN models to classify images into benign or malignant melanoma lesions automatically. These models are trained and tested on two skin cancer datasets, demonstrating their robustness in different scenarios. The inference of the selected CNN model can be performed on a mobile device, known as on-device inference. The TFLite framework, in combination with Android Studio, allows us to convert the CNN model to a light version capable of working on low-cost, low-power devices. In this way, this CNN can be easily used by medical specialists with access to dermoscopy images and have the opportunity to diagnose suspicious cases in an early manner. Even when this methodology has already been implemented in recent research, most of them only evaluate their proposal in one dataset with few samples, achieve low-performance results, or perform the inference of the model in a server computer. In our proposal, we could maintain a balance between accurate performance results considering two different datasets demonstrating the robustness of our proposal. We named our application SkinSight, which can be loaded on Android devices. Considering that most people have a smartphone, this tool could be used where it is difficult to have highly specialized GPUs and/or trained personnel in cancer detection. It is worth mentioning that this paper aims to identify the best CNN configuration that achieves a comparable performance with state-of-the-art models trained and tested on GPUs and with those developed to be used in portable devices.

2 Literature Review

The International Skin Imaging Collaboration (ISIC) is a global organization with an online repository of dermoscopic and clinical images of skin lesions [2]. The objective is that researchers from all over the world can work in the development of computer-aided systems to detect and diagnose melanoma and other skin cancers. With the advancement in computer vision algorithms based on deep learning models, different researchers have reported accurate results in classifying benign and malignant skin lesions. Cassidy et al. performed a benchmark study in [9] with images of the ISIC dataset and 19 state-of-the-art deep learning architectures. The VGG19, DenseNet121, and EfficientNetB2 architectures achieved the best area under the Receiver Operating Characteristic Curve (AUC) results. Benyahia, Meftah, and Lezoray [8] also investigate the efficiency of 17 deep learning architectures and 24 machine learning classifiers using the ISIC dataset. They concluded that the DenseNet201 neural architecture combined with the Cubic SVM algorithm produces the best classification results.

Rehman et al. [25] use a modified pre-trained DenseNet201 by staking three convolutional layers at the end of the model, followed by a global average pooling, a batch normalization, and two dense layers. The authors used a contrast stretching enhanced technique to improve the quality of the images reporting an average accuracy of 95.5%. In [21], was adapted a ResNet101 architecture to classify benign and malignant skin cancer images. Two convolutional layers were included at the end of the model, followed by pooling and two fully connected layers. The authors reported an average accuracy of 90.67%.

All these previous research papers perform their training and testing in a specialized GPU, achieving state-of-the-art performance in skin lesions classification tasks. After deeply analyzing their results, we select the ResNet101, DenseNet201, and a CNN of the EfficientNet family in our experiments. The accurate reported results and reduced number of parameters in these neural architectures make them ideal candidates for our research.

Figure 1 shows a block diagram of the process we follow in developing our SkinSight app. First, it is necessary to train the different deep learning models on TensorFlow with the appropriate datasets and compare their performance to select the most appropriate model. Then, convert the selected CNN to TensorFlow Lite. Next, set up Android Studio for Android App development with the appropriate Android SDK and NDK components installed, add TensorFlow Lite dependencies, and copy the TF-Lite model into the project. The TFLite interpreter is necessary to load the model in the project. A user interface is designed to create the views and controls to interact with the model and display the prediction results appropriately. Then, connect an Android device to the computer and build the app with Android Studio. Finally, test SkinSight with images to confirm that the CNN model works as required.

Fig. 1.
figure 1

Block diagram of deploying a CNN in a mobile device using TF-Lite and Android Studio.

The general methodology of performing the training of the CNN in the GPU and the inference in a mobile device (to be used by the medical sector) has already been proposed in different research papers. In [19] is presented a mobile app to classify skin diseases considering their severity based on the MobileNetV2 architecture. A dataset of 1,220 images is processed, achieving an accuracy of 94.32% in the classification task. In [14], a dataset of 2,358 images was classified as melanoma or benign using the InceptionV3 neural architecture. The accuracy reported by the authors is 81%. Dai et al. [10] presented an on-device inference app using 10,015 images. The accuracy achieved by the model was 75.2%. In [15] is presented an augmented reality app that classifies skin lesions for identifying melanoma. The app continuously tracks the lesion, implementing different image pre-processing algorithms to remove hair and segment the lesion before analyzing the image in the CNN model. Their method achieved an accuracy of 78.8%. Kousis et al. [20] load a light version of a DenseNet169 network on a mobile Android device to classify benign or malignant images. The DenseNet169 model achieved an accuracy of 91.10%, considering a dataset of 10,015. The authors mentioned that when testing their app in a real environment, it was necessary to transfer the image to a server for better performance. In [12], the MobileNetV2 architecture classifies skin lesion images considering three datasets. The overall accuracy performance reported when testing their proposal in a new dataset with the mobile app was 91.33%. Arani et al. [6] presented the Melanlysis app for detecting skin cancer based on the EfficientNetLite-0 architecture. The authors use only the dataset’s dermoscopy images, achieving an accuracy of 94%. In [13] is presented a lesion segmentation and classification method based on a DenseNet201 model loaded on a mobile device. The classification task considers the identification of seven skin lesion classes achieving an accuracy of 89%.

3 Methodology

3.1 Deep Learning Models

The ResNet (Residual Neural Network) architecture introduces the concept of residual or skip connection to address the vanishing gradient problem present in deep neural networks [17]. The Residual Blocks of the ResNet model have convolutional and batch normalization layers and ReLu activation functions. The number of residual blocks defines the variant of the ResNet architecture. We select the ResNet101 in our experiments considering the results reported in [21].

DenseNet, or Dense Connected Convolutional Network, uses the concept of dense blocks to connect the output of every other layer within each of its blocks [18]. That is, the output of each layer is concatenated before passing it to the input of the subsequent layer within each dense block. To reduce the spatial dimensions between dense blocks and the number of channels, DenseNet defines Transition Layers. Similar to ResNet, DenseNet defines different variants, and in our experiments, the DenseNet201 is selected according to the results in [25].

EfficientNet is a family of deep neural network architectures that use a neural architecture search method to uniformly scale the network’s depth, width, and input image size. EfficientNetV2 [24] aims to optimize the training speed and parameter efficiency. Regularization techniques are adaptively adjusted during training, considering different input image sizes. The authors define this particularity as Progressive Learning with Adaptive Regularization. In TensorFlow are implemented seven versions of EfficientNetV2. In our experiments, we select the EfficientNetV2-S variant because it has almost the same number of parameters as DenseNet201.

In order to adapt these three different CNN architectures to the skin cancer datasets, we consider two options. The first one only includes a global average pooling in the last convolutional layer of these architectures, followed by a fully connected layer. Inspired in [25], a second option considers including three convolutional layers, a global average pooling, and a batch normalization, followed by fully connected with dropout layers. A transfer learning strategy was used to train these neural architectures where initially, only the extra layers were trained by ten epochs (freezing the layers of the CNN architectures). Then, a fine tune strategy unfreezes 20% of the CNN architecture, and a new training is performed with a reduced learning rate.

3.2 TensorFlow Lite (TFLite)

TensorFlow Lite (TFLite) [5] is a lightweight deep learning framework specifically designed for deploying CNN models to mobile and embedded devices created by Google. TFLite optimizes the size and speed of the models without neglecting their performance. TFLite uses quantization methods to compress the deep learning model by using fewer bits to represent model parameters [23].

Once the model is converted to a TFLite format, the integrated development environment (IDE) of Android Studio for Android App is used to load the CNN model into the mobile device. The TFLite interpreter is in charge of running the inference of the model and producing the predictions. Then, deploying deep learning models on mobile devices is possible by combining Tensor Flow, TFLite, and Android Studio.

4 Experimental Settings and Results

In our experiments, we use two datasets presented on Kaggle that consider images of the ISIC challenges. Dataset one (DS1)Footnote 1 has 3,297 dermoscopic images. 1,800 images are classified as benign and 1,497 as malignant, respectively. Kaggle provides a data partition where 80% of the data is separated to train and 20% to test. In our experiments, the training data was re-partitioned into train and validation with a final distribution of 60% to train, 20% to validate, and 20% to test. The second dataset (DS2)Footnote 2 has 10,605 images. Kaggle defines 9,605 for training and 1,000 for testing. Same as the previous dataset, the training data was re-partitioned to provide a validation set. The final data split corresponds to 80% to train, 10% to validate, and 10% to test.

The training of the CNN models used in this work is performed on Google Colaboratory, a cloud-based platform with pre-installed libraries and dependencies. In our case, we use the TensorFlow library to train the CNN models. Table 1 shows the accuracy classification results of the different CNN architectures. The second column specifies if the CNN considers the three convolutional extra layers, global average pooling, and batch normalization, followed by fully connected and dropout layers. The third column indicates the number of parameters of each CNN. The fourth and fifth columns indicate the accuracy percentage achieved by each CNN.

Table 1. Accuracy results of the different neural architectures.

The accuracy results of the models are very similar. The best accuracy and the model with fewer parameters are highlighted in bold. ResNet101 obtains the best classification results but is the CNN with the largest number of parameters. EfficientnetV2-S and DenseNet201 obtain comparable performance, but in our implementation, it is very important to have a reduced number of parameters because our objective is to deploy the CNN model in an Android application running on a mobile device. For this reason, we select the DenseNet201 model. Figure 2 shows the confusion matrix results obtained with the DenseNet201 model considering the two datasets.

Fig. 2.
figure 2

Confusion matrix results

By visually inspecting the images of the datasets, we realize that some of them are very difficult to classify as benign or malignant. Figure 3 shows some of them where, despite being difficult samples, the DenseNet201 model correctly classifies them.

Fig. 3.
figure 3

Examples of difficult samples of the datasets

Once the model was trained, it was converted to a light version with TF-Lite and loaded into the mobile device using Android Studio for Android App. Figure 4 shows the final user interface designed for SkinSight with prediction results. SkinSight can load images from the smartphone gallery. With this option, we could select the testing images of DS1 and DS2 and confirm that the accuracy performance of the model is maintained on the light version obtaining the same results reported on the confusion matrix of Fig. 2. By comparing these results with those models reported in Sect. 2, our accuracy performance is superior to most mobile apps. Only two of them achieved better results. The first only considers one dataset of few samples (1,220 images), and the second eliminates images not obtained with a dermoscopy (the ISIC dataset has images obtained with simple cameras and are commonly incorrectly classified).

Fig. 4.
figure 4

Prediction results of the SkinSight app.

5 Conclusions

This paper presents the process we follow to design an Android app named SkinSight to detect melanoma automatically. First, we compare the performance of state-of-the-art CNN models trained and tested with images of two datasets of the ISIC challenge. The accuracy results obtained with EfficientnetV2-S, ResNet101, and DenseNet201 are very similar. However, considering that our objective is to develop a mobile app that medical personnel can use to diagnose suspicious cases early, we select the CNN model with the fewest parameters. The combination of using TensorFlow, TensorFlow Lite, and Android Studio offers a powerful solution for deploying deep learning models on mobile devices.

Recent models that surpass the results reported in this paper implement highly cost pre-processing techniques to remove noise and artifacts from the images. Also, some of these publications stack more than five machine learning algorithms, but the improvement is only 3% compared to our implementations. Considering that our SkinSight app is designed to be used by the medical sector with limited resources, we bear in mind a balance between accurate classification results and a few parameters of the model. In this paper, we only perform the testing of SkinSight with images already analyzed by specialists. Because we want to bring this tool closer to rural areas of our location, our next step is to work with local medical doctors and patients already diagnosed with this disease and test the app in a real environment to identify how to handle different skin tonalities and factors not considered on the ISIC dataset.