Keywords

1 Introduction

1.1 Dermatological Asymmetry of Skin Lesions and Screening Methods

According to the statistics compiled by European Cancer Information System (ECIS) [1] and the American Cancer Society (ACS) [2] life-threatening melanoma can be completely cured if removed in the early stages [3, 4]. According to ECIS, the estimated risk for 2018 of melanoma varies from 38.2 new cases in Germany to 13.6 cases in Iceland per 100K age and gender standardized population [1]. Also, ACS reports in 2018 that the risk of Americans developing cancer over their lifetime is 37.6% for females and 39.7% for males, where the melanoma risk is 1 in 42 cases for females and 1 in 27 cases for males [2]. It is necessary to develop a quick and effective diagnostic method: to minimize the excision of benign lesions and increase the detection of melanoma. Dermatology experts use various screening methods such as the Three-Point Checklist of Dermoscopy (3PCLD) [5,6,7], The Seven-Point Checklist (7PCL) and the ABCD rule [9, 10]. All of them are considered effective in skin lesion assessments.

3PCLD methodology is based on the criteria of asymmetry in shape, hue, and structure distribution within the lesion defined and it can have a value of either 0 for symmetry in two axes, 1 for symmetry in one axis, or 2 for asymmetry. In this method, the pigmented network and blue-white veil are either present or absent. Another example of the screening method used in dermatology is the ABCD’s of melanoma. In this rule ABCD stands for asymmetry, border (not well-defined, irregular), color (more than one shade), diameter (usually larger than 6 mm) and evolution (changing features over time). All those features are characteristics of melanoma that general physicians or dermatologists check while diagnosing. Like in 3PCLD methodology, this method focuses on the asymmetry of the lesion [11,12,13] which is one of the common characteristics of skin damage that can be noticed visually. These examples show the importance of symmetry/asymmetry in various screening methods of detecting melanoma. In the paper, we show the results of the CNN application to the problem of the asymmetry within the skin lesion in the dermoscopic images.

There are a few publications about the symmetry/asymmetry of the skin lesion using machine learning/AI methods. In the paper [16], there is only shape asymmetry discussed and the authors tested several ML methods on the PH2 dataset. The result showed 95.8% of accuracy, with true positive rates for the asymmetry 92.5%, 95.7% for the 1-axis symmetry and 100% for the symmetric lesions while using the SVM with the radial basis kernel function.

This research paper presents the results of the application of convolutional neural networks for the diagnosis of skin lesions asymmetry. The neural networks were based on available, pre-trained networks such as Xception (XN), VGG19 [14], and Inception-ResNet-v2 (IRN2). Those networks provide promising results even with a relatively small but well-described PH2 dataset [15].

1.2 Dermatological Datasets

From the available databases, we have chosen the PH2 dataset [15] to conduct our research. This database consists of dermoscopic images obtained at the Dermatology Service of Hospital Pedro Hispano (Matosinhos, Portugal) under the same conditions through the Tuebinger Mole Analyzer system using a magnification of 20 times. Images in the dataset are 8-bit RGB color images with a resolution of 768 × 560 pixels.

This image database contains a total of 200 dermoscopic images of melanocytic lesions, including 80 common nevi, 80 atypical nevi, and 40 melanomas. The PH2 database includes medical annotation of all the images namely medical segmentation of the lesion, clinical and histological diagnosis and the assessment of several dermoscopic criteria (colors; pigment network; dots/globules; streaks; regression areas; blue-whitish veil) [8, 15, 16].

One of the alternatives for the PH2 database is the ISIC Archive which contains the largest publicly available collection of quality controlled dermoscopic images of skin lesions [17]. The ISIC Archive contains over 24,000 dermoscopic images, which were collected from leading clinical centers internationally and acquired from a variety of devices within each center. The ISIC dataset metadata does not provide information about the asymmetry of lesions. The other examples of the dermatological datasets can be found in Interactive Atlas of Dermoscopy or An Atlas of Surface Microscopy of Pigmented Skin Lesions [18, 19].

1.3 Pretrained Convolutional Neural Network and Their Features

The pretrained Convolutional Neural Networks have different features that should be taken into account when choosing a network to apply to a given problem. The most important characteristics are network accuracy, true positive and negative rate, speed of classification, and size. While selecting a network these features should be taken into account. Currently, we can choose within several pretrained networks. The chosen three networks’ characteristics are given in Table 1. The network depth is defined as the largest number of sequential convolutional or fully connected layers on a path from the input layer to the output layer. The inputs to all networks are RGB images.

Table 1. Pretrained convolutional neural networks parameters

2 Data Preparation for the Research

2.1 Augmentation and Preparation of the Database

The PH2 database contains 117 fully symmetric, 31 symmetric in one axis and 52 fully asymmetric images of skin lesions. In order to use this database in our research, we had to increase the number of images while minimalizing possible influence on the pixel distribution. To create new images, various geometric transformations that do not change the asymmetry of shape, shade and structure distribution, as well as other features present in both 3PCLD and 7 PCL, were used. For the transformation of images, we chose three rotations by 90°, 180° and 270°, mirroring on the vertical and horizontal axis, and a 90° rotation of the images after mirroring (Fig. 1). In total, we got seven transformations for each image that did not change the pixels, shape, or color distribution. These transformations allowed us to increase the PH2 database from 200 to 1600 images.

To show the idea of the author method, to classify not only using the original image but as well its invariant copies we provide in Table 2 the classification probabilities of the exemplary image classification and its invariant copies (the image IMD168 from the PH2 dataset) by the chosen VGG19 CNN network trained on the images from PH2 and their seven copies. The probability of the classification for the asymmetry (column with value ‘0’) networks varies from 0.013 to 0.94. The same variance of probability occurs for other CNN networks. It can be concluded that the same image and its invariant versions can provide us with opposite classification results due to convolutional network operations.

Fig. 1.
figure 1

A sample of image IMD168 from the PH2 [15] dataset and its invariant augmentation.

Table 2. VGG19 CNN classification probability of the image IMD168 from the PH2 dataset.

For example, during a convolution each of the eight images with the filters and their weights give a different output result due to the convolution properties that can be derived from the formula:

$$ I\left( {x,y} \right) = \sum\nolimits_{i = 0}^{n} {\sum\nolimits_{j = 0}^{m} {k\left( {i,j} \right)I\left( {x + i, y + j} \right)} } , $$

where the kernel k is of size n by m. The image is size NxM, where N ≥ n and M ≥ m. As it is shown in the final section the probability of the classification of each of the invariant images can vary from 0.0 to 1.0.

The next step in preparing the database was to scale the images to the input sizes required by the selected networks Table 1. First, we scaled the shorter dimension of images (in our case, height) to the input size, e.g. 224 px (see Table 1) using the Bicubic Sharper algorithm in Photoshop. Then, all images were cropped to a square shape. As a result, we obtained a set of images scaled to the sizes required by each of the networks, e.g. 224 × 224 px, see Table 1. The dataset prepared in this way contains 936 fully symmetric, 248 symmetric in one axis and 416 fully asymmetric images of skin lesions and met the requirements of our research and could be used for network tests.

2.2 CNN Network Setting and Configuration

We used pre-trained networks in our research because they are trained on the ImageNet database [20]. Moreover, those networks use as starting point to learn a new task previous abilities to extract informative features from natural images. Since in each pre-trained network the last three layers are configured to classify 1000 classes, we separated all but the last three layers and replaced them so that the networks would classify images into 3 classes. Due to this method and 3PCLD, the networks classified the images as symmetrical, symmetrical in one axis, and asymmetrical.

To achieve the highest classification rates we have conducted the initial research testing the wide variety of the following parameters for all three networks:

  • 30–60 epochs;

  • learning rate from 1e − 4 to 1e − 2.

After the initial research we choose for:

  • VGG19 – learning rate 1e − 4 and 40 epoch;

  • XN - learning rate 5e − 4 and 30 epoch;

  • IRN2 - learning rate 5e − 4 and 30 epoch.

The time of training depends on the network and number of the training images and the machine specification. The times for the machine 1 specification have varied from around:

  • 18 min for VGG19;

  • 30 min for XN;

  • 60 min for IRN2.

2.3 Hardware Description

To ensure the credibility of the results, the research was conducted independently on two computers with the same operating system (Microsoft Windows 10 Pro) and different configurations:

  • Set 1. Processor: Intel(R) Core(TM) i7-8700K CPU @ 3.70 GHz (12 CPUs), Memory: 64 GB RAM, Graphics Card: NVIDIA GTX 1080Ti with 11 GB of Graphics RAM.

  • Set 2. Processor: Intel(R) Core(TM) i7-9700K CPU @ 3.60 GHz (8 CPUs), Memory: 16 GB RAM, Graphics Card: NVIDIA GeForce RTX 2070 with 8 GB of Graphics RAM.

On both machines, the research was conducted using Matlab 2019b with up-to-date versions of Deep Learning Toolbox™ (v. 12). Deep Learning Toolbox allows to transfer learning with pretrained deep network models, see Table 1. The second hardware set was used to test the procedure to see whether the classification parameters depend on their hardware. Different configurations affected only the time of execution in CNN networks training and at the end training. When working on both machines the calculated average accuracy, as well as their maximum and minimum, showed results close to each other. It proved that the procedure was not hardware-dependent.

3 Research Method Description

The first step in our research method is database preparation. To selected networks, two databases were added: training and testing. Both databases were created by dividing the augmentation PH2 dataset into two sets in the following proportions 75% training and validation and 25% testing. The division was carried out so that the original images and their copies were in one set. The division was carried out 4 times so that each image was included in the test set. The training, validation and testing image cases for the three chosen networks were the same, although the image sizes were different. This allows us to assess and compare the results more thoroughly. Also, to check whether increasing the database with image copies obtained after rotations and mirroring gives better results, the tests were carried out on the original PH2 database file, which was also divided in the previously mentioned way into training and testing set, see Table 3. All steps were repeated on different image sizes to make it possible to research different networks, see Table 1.

Table 3. The number of the images in the original and augmentated PH2 set.

The networks were tested 5 times on each pair of training, validation and testing sets. The resulting networks are saved for future testing and analysis of the results. For each CNN mentioned in Table 1 parameters such as accuracy, true positive rate were defined and calculated according to Eqs. (1)–(6). Next, their average values with the variance, minimum and maximum values were calculated for twenty-five (5 rounds ×5). Correct classification plus overestimation which is the Accuracy + Error Type I were considered best for purpose of our research: it is better if the screening method overestimates the diagnosis than the opposite (Underestimation Error Type II - False Negative) as the final diagnosis of malignant melanoma takes place after histopathological research.

The confusion matrix parameters are defined as follows:

$$ ACC = \left( {TP + TN} \right)/N $$
(1)
$$ TPR = TP/\left( {TP + FN} \right) $$
(2)
$$ w.ACC = (TPR_{0} + TPR_{{1}} + TPR_{2} )/3 $$
(3)
$$ FPR = FN/\left( {FP + TN} \right) $$
(4)
$$ F1 = 2TP/\left( {2TP + FP + FN} \right) $$
(5)
$$ MCC = \left( {TP*TN - FP*FN} \right)/\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} $$
(6)

where:

  • N- a number of all cases;

  • true positive, TP – number of positive results i.e. correctly classified cases;

  • true negative, TN – number of negative results i.e. correctly classified cases;

  • false positive, FP – number of negative results i.e. wrongly classified cases as positive ones;

  • false negative, FN – number of positive results i.e. wrongly classified cases as

  • negative ones, also called Type II error;

  • accuracy, ACC; weighted accuracy, w. ACC;

  • true positive rate, TPR, also called Recall; TPRi – stands for true positive rate for the symmetry values i  =  0, 1 and 2;

  • false positive rate, FPR; FPRi – stands for false positive rate for the symmetry values i  = 0, 1 and 2;

  • score test F1;

  • Matthews correlation coefficient, MCC.

In Table 4 weighted F1 and MCC are calculated as for weighted accuracy, Eq. (3).

4 Results

The research method described above allowed us to obtain 60 neural networks. Results from those networks were recorded and analyzed. The results were analyzed in three ways:

  1. 1.

    T1 - networks tested on a subset of original images;

  2. 2.

    T8 - networks tested on the original set and its seven copies;

  3. 3.

    IDA - networks tested on the original set and its seven copies but in the worst-case scenario, i.e. if one of the 8 copies of the images has been recognized as asymmetric, all its copies have been classified as asymmetrical.

The advantage of the IDA procedure is the increased value of the true positive rate (TPR) for the positive cases i.e. asymmetric ones. Asymmetric lesions according to 3PCLD and ABCD rule are more prone to be melanocytic. On the other hand, this procedure increases the false-positive value (FPR) (see Table 4) which can be considered as its biggest disadvantage. However, this procedure finds more melanoma cases than the T1 or T8 methods. The procedure of IDA is also used in blue-white veil classification by CNN in [21].

When comparing the results of the networks, we also took into account such classification characteristics as weighted accuracy (ACC), F1 score and Matthews correlation coefficient (MCC), see Table 5. Within the network, the results were similar regardless of the method used (T1, T8, IDA). However, the results of each network differed. The best results for these classification characteristics were shown by the Xception (XN) network with an accuracy score of 78.9%.

Table 4. The classifications results for the asymmetry. The chosen confusion matrix factors true positive rate for full asymmetry (TPR0), true positive rate for symmetry in one axis (TPR1), true positive rate for full symmetry (TPR2), false positive rate for full asymmetry (FPR0), false positive rate for symmetry in one axis (FPR1), false positive rate for full asymmetry full symmetry (FPR2) with their average (AVG), variance (VAR), minimum (MIN) and maximum (MAX) values for the chosen CNN network.

Additionally, to compare the networks, the area under curve (AUC) value was used. VGG19 turned out to be the best network and obtained a result of 0.9652. Figure 2 shows the best receiver operating characteristic curve (ROC) with the highest value of the area under curve (AUC).

From our research we have chosen the best CNN networks:

  • VGG19 - true positive rate for the asymmetry 84.62%, weighted accuracy 68.29%, F1 score 0.682 and Matthews correlation coefficient 0.581;

  • Xception - true positive rate for the asymmetry 92.31%, weighted accuracy 67.41%, F1 score 0.646 and Matthews correlation coefficient 0.533;

Inception-ResNet-v2 - true positive rate for the asymmetry 53.85%, weighted accuracy 51.57%, F1 score 0.528 and Matthews correlation coefficient 0.295.

Table 5. The classifications results for the asymmetry. The chosen confusion matrix factors weighted accuracy (w.ACC) with their average (AVG), variance (VAR), minimum (Min) and maximum (Max) values and weighted F1 score (w.F1), weighted Matthews correlation coefficient (w.MCC) for the chosen CNN network.
Fig. 2.
figure 2

The examples of the best receiver operating characteristic curve (ROC) with the highest value of the area under curve (AUC) for the three chosen CNNs.

5 Conclusions

Asymmetry plays an important role in the assessment of skin lesions, which is evident in dermatological diagnostic methods such as The Three-Point Checklist of Dermoscopy (3PCLD). Melanoma diagnosis based on asymmetry can also be made with the use of properly trained CNN networks. Such networks can serve as a helpful tool in the preliminary diagnosis of dangerous skin lesions.

In our research, we used three pretrained networks (Xception, VGG19, Inception-ResNet-v2) and trained them on our enlarged PH2 database. The method developed by us (using different forms of augmentation) turned out to be in many cases more effective than training the network only on the original images, even by 20% higher. In the studies we achieved a maximum of 68.56% weighted accuracy, 92.31% true positive rate, 66% false positive rate with tests F1 = 0.74, MCC = 0.58 and AUC = 0.97.

In our corresponding research [21, 22] in the field of a dermatological image processing using the Invariant Dataset of Augmentation is used with the PH2 [15] and the Atlas of Dermoscopy (Derm7pt) [8, 18] datasets to increase the classification rates e.g. true positive rate, test F1 and MCC using CNNs in comparison to the feature based methods used in [12, 16, 23].