1 Introduction

The skin is the biggest organ of the human body. It forms 10% of the whole body. The facial skin, due to its composition, is thinner than that of the body and contains fewer cells which makes it more sensitive and easier to be damaged. Even more, it is the most exposed part to sun radiations and also is more sensitive to hormonal changes. All these factors make the skin of the face susceptible to several diseases. Facial skin problems are widespread and affect infants, children, adolescents, adults, and seniors [1]. These problems can be easily seen which makes the person not confident. Impermanent or permanent, benign or cancerous, genetic or acquired, facial skin diseases have different causes and symptoms. Some have similar signs which make the diagnosis difficult and sometimes wrong. For this reason, the field of dermatology is showing a high increase in the use of AI, especially to improve and to accelerate diagnostics and thus to decrease the errors. Consequently, a lot of research has been conducted, trying to find the best methods leading to accurate prediction of diseases. Based on the state of the art, [2] worked on the detection of acne and non-acne areas using spatial–temporal features. They used supervised learning method to extract features from the image [2]. Khongsuwan et al. (2011) suggested an approach for acne detection using UV fluorescence image considering that bacteria in acne react to UV light. They apply H-maxima transform to the image to discover regional maxima and section acnes. They obtain satisfying efficiency, but the long exposure period of the test skin to UV light can damage it [3]. Later, Hamayun et al. [4] detected the location of acne using template matching technique with N-mean kernel. Their method is simple and fast but suffers from some drawbacks in terms of performances [4]. Chang et al. [5] have worked on an automated facial skin defects detection and recognition approach using support vector machine-based classifier. Their method consists of locating the face from the input image and then extracting the ROI to detect the potential defects. These defects are later classified into normal skin, spots, and acne. Their method was efficient and presents high accuracy [5]. A pixel-based method is proposed by [6]. They used RGB image to detect acne and Mahalanobis distance (MD) and Bayes’ method for segmentation. The advantage of this approach is that it keeps the pixel elements of the input [6]. Chantharaphaichit et al. [7] investigated an automated method based on the detection of the circular form of acne and their counting. For this purpose, they used blob detection. They also use Bayesian classifier following features extraction to minimize the misclassification. This method is efficient but is affected by many conditions such as the form of acne and lighting [7]. In 2017, Natchapol Kittigul proposed an automatic acne detection and quantification method. The acne is detected by speeded up robust features and classified using K-nearest neighbors algorithm. This method procures an accuracy of 73% [8]. On 2018, a facial skin analysis mobile application is developed by Amini et al. [9] to detect acne lesions and classify them into papules or pustules. The acquired front face image is calibrated and normalized, and then, the regions of interest are specified to detect and classify acne. This method presents an accuracy of 92% for the identification of acne and 98% for the classification into papules or pustules [9]. The same year, Shen et al. [10] worked on an automated detection technique of facial acne vulgaris vs CNN. They use a binary classifier to detect ROI and then seven classifiers to classify the acne into one of the six types of the disease or healthy skin [10]. Al-masni et al. [11] implemented a multiple skin lesions diagnostic system for segmentation and classification using deep convolutional networks. The skin lesions boundaries are firstly segmented from the images using full resolution convolutional network, and then, they are transferred to convolutional neural network classifiers: Inception-v3, ResNet-50, Inception-ResNet-v2, and DenseNet-201, to be classified. They assess their method using three datasets: ISIC 2016, 2017, and 2018 including skin images referring to two, three, and seven diseases, respectively. The experiments show that the classifiers of Inception-v3, ResNet-50, Inception-ResNet-v2, and DenseNet-201 predict the diseases with accuracies of 77.04%, 79.95%, 81.79%, and 81.27% for two classes (benign and melanoma) of ISIC 2016, 81.29%, 81.57%, 81.34%, and 73.44% for three classes (benign, seborrheic keratosis, and melanoma) of ISIC 2017, and 88.05%, 89.28%, 87.74%, and 88.70% for seven classes (benign, seborrheic keratosis, basal cell carcinoma, actinic keratosis, dermatofibroma, vascular lesion, and melanoma) of ISIC 2018, respectively [11]. Evgin Goceri [12] developed an automated deep learning-based technique to classify dermatological disorders from color digital images into five classes: acne vulgaris, psoriasis, hemangioma, seborrheic dermatitis, and rosacea. The method consists of two phases: automated detection and extraction of lesions via a fully automated updated extension of the automated detection of facial disorders (ADFD) technique and classification of lesions using a pre-trained DenseNet201 model. The proposed technique achieves a classification accuracy of 95.4% [12]. In this paper, the identification process is achieved through an adapted FSDNet. According to many works, CNN has been proven to be efficient when it comes to image classification. Due to this, we propose a CNN-based network that we called facial skin diseases network (FSDNet). It is a fine-tuned VGG-16 model adapted to facial skin disorders identification. We modify the structure of the fully connected layer. In particular, the proposed method can identify eight face skin diseases. In addition, normal skin and no-face can also be distinguished.

Compared to the state of the art, two main contributions are highlighted in this work: (1) The identification process does not require the extraction of the ROI from face images since the system is trained regardless of face pose, illumination, image resolution, etc., and (2) the number of detected pathologies is greater compared to the one reported in the literature. Finally, high accuracy is reached.

2 Materials and methods

The proposed method consists of four steps: gathering and labeling facial skin diseases images, image preprocessing, training the network, and finally identification of the diseases. The general flowchart of our approach is shown in Fig. 1. In the following subsections, we will introduce CNNs and then describe in detail each part of our method including image preprocessing techniques and the proposed network FSDNet: its architecture, training and identification of the facial skin diseases.

Fig. 1
figure 1

Block diagram of the suggested method. The images are first collected and labeled. They go through some preprocessing techniques such as resizing to fit the network, data augmentation to increase the size of the dataset, and then feeding to the network to be trained to finally identify the appropriate class

Fig. 2
figure 2

General architecture of CNN. It is composed of an input layer, a hidden layer made up of several convolutional and max pooling layers, and a fully connected layer, followed by a softmax layer

2.1 Preprocessing

Before feeding the images to the network, they undergo two preprocessing techniques. Firstly, the images are resized to \(224 \times 224\) to have similar sizes and thus fit the network. Secondly, since deep learning models require large data, we use data augmentation techniques to increase our dataset and the robustness of our model. Since our contribution is to identify the facial skin disease regardless of the face pose, the data are augmented by rotation and horizontal flip. The images are rotated with different angles (\(5^{\circ }\), \(10^{\circ }\), \(45^{\circ }\), \(90^{\circ }\), and \(270^{\circ })\).

2.2 Convolutional Neural Network

Convolutional neural network (CNN) is a class of deep neural network specialized and widely used for image recognition [13]. They use 2D images as input. CNNs are composed of an input layer, an output layer, and a hidden layer consisting of multiple convolutional layers, pooling layers, and fully connected (FC) layers (Fig. 2). The convolutional layer is the core of a CNN. It is based on convolutional operation and formed of a series of kernel filters.

The activation function commonly used is rectified linear unit (RELU). The aim of the convolution is to extract the high-level features. The convolution between the input and the filters produces many feature maps. These maps are given to the pooling layer as an input. Pooling layer is responsible for reducing the spatial size of the feature maps and hence controlling overfitting. There are many pooling functions, but max pooling is the most popular one. Softmax layer, a FC layer, is the last layer in the network. It uses softmax activation function, and it is responsible for making predictions [14]. There are many architectures of CNNs including LeNet, AlexNet, VGGNet [15], GoogLeNet [16], ResNet, etc., used for many purposes such as object detection, segmentation, image captioning, image recognition, and image classification.

2.3 Facial skin diseases network

CNNs are the best choice in feature learning and object classification since they extract the features automatically from the images and learn them during the training. This makes them the most used in image classification. In our approach, for identification purposes, we propose a fine-tuned VGG-16 model that we call here FSDNet (facial skin diseases network) with neuronal architecture adapted to facial skin diseases identification.

Fine-tuning is a process permitting to use pre-trained networks to identify classes they were not primarily trained on. It consists of updating the architecture of the network by removing the FC layers from the end of the network and replacing them with new ones or new types of layers, and also retraining it to learn new classes.

The pre-trained VGG-16 is composed of multiple convolutional layers followed by a rectified linear unit stacked on top of each other and followed by a max pooling layer, gathered in five blocks. The network ends with three FC layers, followed by a softmax layer and an output layer [15].

As mentioned previously, FSDNet is a fine-tuned version of VGG-16. First, the fully connected layers are removed from the initial model. We replace them by global average pooling and dropout layers, followed by a softmax classifier and an output layer. The architecture of the FSDNet is presented in Fig. 3. The global average pooling layer speeds up the training and preserves the requisite feature. The dropout layer of factor of 0.5 is used to prevent overfitting. The output layer is modified to a ten-dimensional output vector to fit the number of predicted classes.

The first four blocks are frozen so their weights cannot be updated. We start training our network from the fifth block using a very small learning rate. Adam optimizer [17] is utilized with a learning rate of 0.0001 to train and optimize the network.

For training and identification purposes, we split randomly the images of our dataset into training and validation sets. Five split cases are considered. The network is trained for 10 epochs with a batch size of 16 images.

Fig. 3
figure 3

Fine-tuning VGG-16 workflow. We kept the first five blocks of the VGG-16. We replace the fully connected layer of the main model by a global average pooling layer and a dropout layer. The output layer is modified to a ten-dimensional output vector with a softmax classifier

3 Results

The implementation of the algorithm was built using Keras [17] with Tensorflow as backend. The system ran on an Intel®core\(^\mathrm{{TM}}\) i7 7700HQ, 2.8 GHz CPU, 1060 GPU GTX. The average inference times for training and testing the images in each case of data split are presented in Table 1.

Table 1 Training and validation times in the five data split cases

3.1 Database

Due to the absence of any standard public dataset for the diseases identified in our approach, we created a database of balanced classes and labeled images collected from different sources to train and validate our network. It is formed of images of various resolutions and referring to males and females of different ages, in many face poses, and illumination conditions.

It is initially composed of 2000 images with proper annotations referring to ten classes: eight face skin pathologies (acne, actinic keratosis, angioedema, blepharitis, eczema, melasma, rosacea, and vitiligo) that are mostly spread and affecting all ages, genders, and races, normal skin class, and “no-face” class containing miscellaneous images (animals, things, food, etc.) (Fig. 4). Each class contains 200 images for 200 different persons. Since deep learning models require large data, the dataset is then augmented to 20000 images by rotations and flip of the initial images. The dataset is divided randomly into training and validation sets. The model was tested using another dataset composed of 20 images from each class, provided from Dermweb [18].

Fig. 4
figure 4

Images from our dataset referring to the ten classes: eight facial skin diseases (acne, actinic keratosis, angioedema, blepharitis, eczema, melasma, rosacea, and vitiligo), normal skin class, and no-face class including different images such as animals, cups, coins, and phones

3.2 Performance evaluation

The performance of the FSDNet is studied in five split cases. Initially, we use randomly 90% of the dataset for training and 10% for validation and then 80:20, 70:30, 60:40, and 50:50 for training vs validation. To evaluate the classification results, we calculate some classification metrics such as accuracy, precision, recall, and F1-score. These evaluation metrics are also studied in the five split cases and shown in Table 2.

The accuracy is the fraction of correct predictions to total samples.

$$\begin{aligned} \ Accuracy = \frac{TP+TN}{TP+TN+FP+FN}. \end{aligned}$$
(1)

The precision represents the percentage of correct positive samples to all the positive predicted samples.

$$\begin{aligned} \ Precision = \frac{TP}{TP+FP}. \end{aligned}$$
(2)

The recall, known as the potential of the model to detect positive samples, is the proportion of correct predicted positive samples to all positive samples.

$$\begin{aligned} \ Recall = \frac{TP}{TP+FN}. \end{aligned}$$
(3)

The F1-score determines the accuracy of the test and then indicates the robustness and precision of the classifier.

$$\begin{aligned} \ F1 = \frac{TP}{TP+ \frac{1}{2}(FP+FN)} \end{aligned}$$
(4)

where TP = true positives, TN = true negatives, FP = false positives, and FN = false negatives.

The performance of identification is also measured by computing the confusion matrices in five split cases shown in Fig. 5.

The classification report presented in Table 2 and the confusion matrices show that our network gives good results in all cases. The highest ones are obtained for a split case of 90:10 because in this case the network is more trained, which improves the validation accuracy.

To test the efficiency of our approach, images from outside the dataset were given to FSDNet.

These images are of different resolutions, poses, and illumination.

As shown in Fig. 6, all the images are well identified with an accuracy of 100 %, which show that our model is robust and accurate. We study the effect of illumination, distance, and poses on the performance of our system in Figs. 7 and 8.

Table 2 Classification report
Fig. 5
figure 5

Confusion matrices in the five data split cases. The vertical axis of the confusion matrix is the true label referring to the ten classes (acne, actinic keratosis, angioedema, blepharitis, eczema, melasma, rosacea, vitiligo, normal skin, and no face), and the horizontal axis is the predicted label. The diagonal of the matrix in dark color represents the number of correctly predicted samples in each class, and all the other numbers present the wrong predicted samples in each class

3.3 Comparative study

We compared our proposed model to the techniques mentioned in the state of the art in Table 3. While the identification performance of our method seemed equal to that of some previously proposed methods, our approach could identify eight facial skin diseases, while the other methods are restricted to the identification of acne. We could classify the images into ten classes (eight facial skin diseases, normal class, and no-face class), while the other methods have maximum seven classes including types of acne and healthy skin. To train and evaluate our method, we create our own dataset composed of 20000 images referring to the different identified diseases, while in the other methods they use small number of images.

Fig. 6
figure 6

Test of FSDNet with images from outside the dataset. We give the network images referring to many classes such as acne, melasma, angioedema, eczema, vitiligo, and normal. All the images are correctly identified with high accuracy varying between 99% and 100%

Fig. 7
figure 7

Images of eczema in four different brightness levels. The images are correctly identified but with different accuracies that decrease when the image becomes darker. (a) Original image, accuracy of 99.9%. (b) The brightness is modified with a factor of 0.8 and accuracy 99.1%. (c) The brightness is modified with a factor of 0.6 and accuracy 98.5%. (d) The brightness is modified with a factor of 0.5 and accuracy 95.1%

Fig. 8
figure 8

Images of melasma in four different poses. They are all correctly predicted with accuracy (a) 98%, (b) 99%, (c) 90.7%, and (d) 93.6%

4 Discussion

In this paper, we investigated a new facial skin diseases identification method through an adapted FSDNet based on CNN that helps dermatologists and using our own created dataset. We developed an algorithm to identify eight facial skin diseases, normal skin, and no-face classes from images of different resolutions, poses, and illuminations.

The main contributions are, first, identifying the diseases without extraction of ROI from face images and, second, detecting more potential facial skin diseases compared to the ones in the literature with high accuracy.

Table 3 Comparison of our approach with other methods

We suggested a CNN-based network, FSDNet that is a fine-tuned version of VGG-16 model modifying the neuronal architecture of the FC layer. Due to the absence of any standard public dataset for the same, we train and validate our model with a dataset that we created including 20000 labeled images referring to the ten identified classes in our approach. We evaluated the performance of our model by computing the classification metrics such as accuracy, precision, recall, and F1-score.

These metrics were studied in five split cases of the dataset. The data were divided randomly between training and validation. Finally, to test the robustness of identification of the model, we feed the network with images from outside the dataset. We investigated the effects of face pose, illumination, and distance from the camera on the accuracy of identification.

The results in Table 2 demonstrate the classification report in the five split cases. We can see that the metrics are slightly different but the highest ones are obtained for a split of 90:10 due to the fact that the network is more trained than the other cases. The values of the four metrics are between 95% and 97%, which indicates that our model has a high potential and our classifier is robust and precise.

Figure 5 presents the confusion matrices of the dataset identification results in the five studied split cases. The study of these matrices shows that despite some false predictions, the model could correctly identify nearly 96% of the images in all split cases, which proves the results obtained in the classification report. We can observe that the network is well trained for all classes.

In Fig. 7, we study the effect of illumination on the accuracy of identification. We change the brightness of the images with different factors. One can observe that the network could identify the proper disease but with lower accuracy when the images become darker. The results are shown in Table 4.

Figure 8 shows the effect of the face pose on the performance of our system. The same image in four different poses is fed to the network. In the four cases, the image is identified as melasma with high accuracy, which shows that the model could perform the prediction regardless the face pose. The performance of the system is affected by the distance from the camera. We found that the accuracy of identification decreases when the distance is above 35 cm.

The model can be integrated into an acquisition system able to acquire images from patients and determine directly their class. Besides diseases identification, the system can be improved by measuring the severity of the disease.

Table 4 Study of brightness effect on eczema prediction accuracy

5 Conclusions

An aided diagnostic facial skin disease classification approach based on deep CNNs is investigated in this paper. The aim of this research is to present a simple method that could identify more diseases than the methods previously proposed disregarding the face pose, the illumination, and the resolution and even more without extraction of region of interest from the images. The classification of pathologies is done via an FSDNet model which is a fine-tuned version of VGG-16 that we proposed adapted to facial skin disease classification. We have collected our dataset containing 20000 images used to train and validate our model. Our experiments show that FSDNet accomplishes an accuracy of 97% and identifies successfully the class of the test face skin images with an accuracy of 100%.