1 Introduction

Facial expressions are one of the most important means for humans to express emotions and intentions without saying a word and are a form of nonverbal communication, especially for people in the deaf community, as they are used in sign language to express grammatical functions and emotions. Therefore, recognizing emotion from facial expressions has become a well-researched area. In psychology, Ekman and Friesen (1971) identified six universal emotions (happiness, sadness, disgust, fear, surprise, and anger), each with its unique facial expression that can be recognized automatically through computer vision algorithms.

Recently, automatic facial emotion recognition (AFER) has drawn the attention of the research community for its numerous applications in various fields including, medicine (Jin et al., 2020; Leo et al., 2020), security systems (Yin et al., 2017), and education. In the education field, facial emotion recognition can be used to monitor students’ engagement in the classroom. Lasri et al. (2019) proposed a CNN architecture to recognize students’ facial emotions in a classroom environment. ELLaban et al. (2017) also proposed a real-time system for students’ facial expression recognition in the e-learning environment. Thomas and Jayagopi (2017) classified the level of students’ engagement by analyzing behavioral cues from their facial expressions, head movements, and gaze behavior. Aslan et al. (2019) presented a real-time student engagement system that improves instructors’ classroom practices.

One of the main research questions focused on educational data mining (EDM) is how deaf and hard-of-hearing students are engaged in a classroom. The question of engagement is significant and can affect teaching and student achievement in different learning environments, such as classrooms and massively open online courses (MOOCs). The lack of deaf and hard-of-hearing students’ engagement can be caused by various reasons, including students who rely on lip-reading may not have time to process the preceding subject information when teachers don’t make a pause before passing to a new subject. And teachers who don’t know sign language can experience difficulties when communicating with deaf or hard-of-hearing students. Moreover, some universities are not able to provide their deaf or hard-of-hearing students with assistive technology. To improve the effectiveness of the learning process, teachers can keep track of the engagement level of each student.

Deaf and hard-of-hearing students’ engagement can be evaluated using questionnaires or automated systems based on eye movement and facial emotion recognition (FER) for a better teaching pedagogy and learning experience. Extracting facial expression features from facial images and recognizing different facial expressions with a trained classifier is the major task of facial emotion recognition (FER). The classical FER techniques consist of three main steps: image preprocessing, feature extraction, and emotion recognition. In the preprocessing step, the face region is detected and then cropped from the input image. Subsequently, reducing to eliminating noise, scaling, resizing, and normalization are performed on the face image. The feature extraction step from the processed image is a significant stage, which consists of finding various spatial and temporal features from the facial components. Finally, traditional machine learning (ML) methods and deep learning (DL) methods classify the input image using the extracted features to understand emotions. The traditional machine learning (ML) methods aim to detect the face region in the image and extract features, then classify the input image using the extracted features. While the deep learning (DL) methods, especially convolutional neural networks (CNNs) and deep convolutional neural networks (DCNN), perform the FER task by combining feature extraction and classification steps in its single composite operational process. Other deep learning approaches include pre-trained DCNN networks, such as VGG-16 (Simonyan et al., 2015), VGG-19, Inception v3 (Szegedy et al., 2015), Xception (Chollet et al., 2017), Resnet-50 (He et al., 2016), DenseNet-121 (Huang et al., 2017), DenseNet-169, and MobileNet (Howard et al., 2017) which reduce the long training process by using pre-trained weights.

To the best of our knowledge, no studies have been conducted, to date, on detecting the deaf and hard-of-hearing students engagement using machine learning or deep learning algorithms. In the present study, we attempt to address this challenge, for the first time, by proposing an automatic system that evaluates the deaf and hard-of-hearing students engagement from their facial expressions based on a deep convolutional neural network (DCNN) and transfer learning (TL). The facial images of students are obtained using a camera in the classroom. This system can help teachers observe the reaction of deaf or hard-of-hearing students on a particular topic during a lecture, adjust the teaching methodology according to students’ comprehension, identify deaf or hard-of-hearing students who are not engaged and need academic support sessions in order to improve classroom management and save time and resources. An ImageNet pre-trained VGG-16 model was employed in the proposed FER model by replacing its upper layers with other dense layers, dropout layers, and batch normalization layers. Next, we fine-tuned the VGG-16 model on two facial image datasets: the Japanese female facial expression (JAFFE) (Lucey et al., 2010) and the Karolinska directed emotional faces (KDEF) (Calvo & Lundqvist, 2000). Then, the performance of the proposed model is evaluated and compared to seven different pre-trained DCNN models (VGG-19, Inception v3, DenseNet-121, DenseNet-161, MobileNet, ResNet-50, Xception) on JAFFE and KDEF datasets.

The overall objectives of this study can be outlined as follows:

  • We propose a novel system that detects the engagement of deaf and hard-of-hearing students from their facial emotions based on deep convolutional neural networks (DCNN) and transfer learning (TL).

  • To evaluate the performance of the facial emotion model, we tested different model optimizers and different popular pre-trained DCNN models on JAFFE and KDEF datasets.

  • To monitor deaf and hard-of-hearing students’ engagement, a result visualization is provided by our system in real-time.

The rest of this article is described as follows: In Section 2 we present the related works. Section 3 contains an overview of CNN, VGG-16 model, and transfer learning followed by a description of the proposed method. Section 4 discuss the experimental results. Finally, Section 5 presents the conclusion and the future extensions of our work.

2 Related work

2.1 Facial expression recognition

Several techniques have been carried out on applying machine learning and deep learning methods to analyze human facial expressions in the last few decades. Earlier works on facial emotion recognition relied on traditional machine learning methods such as support vector machines (SVM), k-Nearest Neighbor (KNN), and neural networks (NN) with differents features extraction techniques. Lee et al. (2012) used contourlet transform (CT) for feature extraction and regularized discriminant analysis-based boosting algorithm (RDAB) for classification. Their proposed approach was evaluated using the JAFFE dataset. Liew and Yairi (2015) examined five feature descriptors, including Gabor, Haar, local binary pattern (LBP), histogram of oriented gradients (HOG), and binary robust independent elementary features (BRIEF), by using several classifications algorithms such as SVM, KNN, linear discriminant analysis (LDA) and adaptive boosting (AdaBoost) on extended Cohn-Kanade (CK+) (Lucey et al., 2010), multimedia understanding group (MUG) (Aifanti et al., 2010), JAFFE, and frontal image from the KDEF dataset. The authors identified HOG as the best feature descriptor and SVM as the best classifier. HOG and SVM have been also used by Eng et al. (2019). They employed the JAFFE and the whole KDEF dataset to evaluate their method. Holder and Tapamo (2017) used the Scharr gradient operator, dimensionality reduction, and facial component extraction to add improvements to the gradient local ternary patterns (GLTP), which has been used for feature extraction. Then they used SVM for feature classification on the CK+ and JAFFE datasets.

With the great success of deep learning for image classification, many researchers employed deep learning-based methods for facial emotion recognition (FER). Zhao et al. (2015) used a deep belief network (DBN) for feature learning and extraction from facial expression images. Then, a multi-layer perceptron (MLP) model is used for emotion classification on JAFFE and CK+ datasets. A boosted deep belief network (BDBN) framework was proposed by Liu et al. (2014) that combines feature learning, feature selection, and emotion classification. The BDBN framework was evaluated using the CK+ and JAFFE datasets. Sari et al. (2021) proposed a standard CNN architecture with two convolutional-pooling layers for facial emotion recognition on CK+, JAFFE, and KDEF datasets. Hamester et al. (2015) considered a multi-channel convolutional neural network (MCCNN) architecture evaluated on the JAFFE dataset. The first channel is composed of a standard CNN. Then, the second channel uses pre-trained parameters obtained by a convolutional autoencoder (CAE), which learns Gabor-like filters. The two channels are connected with a fully-connected layer, followed by a logistic regression classifier. A hybrid CNN-RNN approach is employed by Jain et al. (2018) for facial emotion recognition. A CNN model with six convolutional layers and two fully connected layers was used primarily for feature extraction. Then, RNN was introduced to classify facial emotion using JAFFE datasets. However, to the best of our knowledge, no studies have been conducted, to date, on analyzing the facial expressions of deaf and hard-of-hearing students using machine learning or deep learning algorithms.

2.2 Engagement detection

Measuring students’ engagement in the classroom is a major concern for teachers, as it positively affects the quality of education and learning. Numerous methods have been applied to detect student engagement from educational data, facial emotion recognition, and head and eye movement. Ayouni et al. (2021) proposed a system that predicts students’ engagement levels (actively engaged, passively engaged, and not engaged) using support vector machine (SVM), artificial neural network, and decision tree on recorded students’ activities. The system can alert the instructor when a student has a low engagement level via course messages or e-mail. Shen et al. (2022) developed a framework for assessing the students’ engagement level (great, not bad, and not so well) from their facial expressions in the e-learning environment. The authors used an attentional convolutional network model for facial expression recognition.

3 Methodology

To provide a better understanding of the proposed methodology, we present in the subsections that follow, an overview of convolutional neural network (CNN), VGG-16 model, transfer learning (TL), and the proposed approach for detecting the engagement of deaf and hard-of-hearing students from their facial expressions.

3.1 Overview of convolutional neural network (CNN), deep CNN models, and transfer learning (TL)

3.1.1 Convolutional neural network (CNN)

A convolutional neural network (CNN) is a deep learning neural network, which is most commonly applied to recognize visual patterns in the input image with minimal pre-processing compared to traditional image classification algorithms. The generic CNN architecture, as shown in Fig. 1, includes different layers such as convolution layers, activation layers, pooling layers, and fully connected layers.

Fig. 1
figure 1

Generic architecture of a convolutional neural network (CNN)

The convolution layer derives its name from the convolution process that preserves the spatial relationship between pixels by using small squares of input data to learn image features. It uses a kernel that moves over the input image and computes a dot product with the overlap local region aggregating the result in a feature map. Equation 1 below represents the convolution formula of a 2D image h with a 2D kernel x :

$$\begin{array}{@{}rcl@{}} y[m,n]&=&h[m,n] * x[m,n] \\ &=& \sum\limits_{j=-\infty}^{\infty} \sum\limits_{i=-\infty}^{\infty} h[i,j]x[m-i,n-j] \end{array}$$
(1)

where m and n are the indexes of rows and columns of the result matrix.

Each convolutional layer is followed by a nonlinear activation layer where rectified linear unit (ReLU) function, defined in (2), is the most widely used.

$$\begin{array}{@{}rcl@{}} R(x)= \begin{cases} \begin{array}{cc} max(0,x)&,x>=0\\ 0&,x<0 \end{array} \end{cases} \end{array}$$
(2)

Then, the pooling layer executes a downsampling operation on the feature maps obtained after applying the ReLU activation function to reduce their dimension while keeping the important information. In particular, there are different types of pooling operations like max pooling, sum pooling, and average pooling. Finally, the pooled feature maps are flattened into a single column then a fully connected layer is used to classify the images.

3.1.2 Visual geometry group 16 (VGG-16)

With the development of deep learning, deep convolutional neural networks (DCNN) is constructed by modulating the basic CNN architecture in more depth. DCNN is considered a powerful deep learning algorithm in computer vision tasks, as it allows the automatic extraction of features of large datasets and can achieve better performance than standard neural networks. VGG 16 is a popular convolution neural network (CNN) architecture proposed by Simonyan et al. (2015) from Oxford University. To date, it is considered to be one of the best vision model architecture and was the winning model of the 2014 ILSVR (ImageNet) competition. The VGG-16 architecture, as shown in Fig. 2, consists of 13 convolutional layers of a (3x3) filter with stride 1, five max-pooling layers with stride 2, two fully connected layers with 4096 channels each followed by another fully connected layer with 1000 channels, and the final layer is the softmax layer.

Fig. 2
figure 2

VGG-16 network architecture

3.1.3 Transfer learning (TL)

Transfer learning (TL) is a machine learning method where a pre-trained model is reused on a new problem. The original trained model usually needs a high generalization to adapt to unseen data. Transfer learning means that training won’t need to be restarted from scratch for every new task, it simply applies a pre-trained model that is usually trained on a huge dataset like ImageNet and the obtained weights from this model can be employed for any other task. Training deep convolutional neural network models may take weeks on very large datasets as they have many parameters to tune. Thus, transfer learning can be very useful to solve this issue because it can save training time and resources, which is the main motivation behind this research.

3.2 Proposed system architecture for emotion recognition and engagement detection

In this subsection, we introduce a novel system for real-time engagement detection from facial expressions of the deaf and hard-of-hearing students using deep convolutional neural network (DCNN) and transfer learning (TL), as illustrated in Fig. 3. The system can be used in the classroom environment to assist the teachers for understanding the attention and engagement of the deaf and hard-of-hearing students with the learning material. The students’ images are automatically analysed by the system to evaluate their state of concentration from facial expressions using a web camera.

Fig. 3
figure 3

Overall architecture of the proposed system for emotion recognition and engagement detection of deaf and hard-of-hearing students

3.2.1 Facial emotion recognition

In the facial emotion recognition phase, the pre-trained VGG-16 model described in Section 3.1.2 with transfer learning is used to identify the dominant emotion expressed by the deaf and hard-of-hearing students’ faces at each moment. Fine-tuning is a commonly used technique for transfer learning where the weights and learning of the pre-trained model are used as an initializer for a new task. This technique is much faster than training the whole model from scratch and can help reduce the risk of overfitting. There are three most used ways to fine-tune a model: train the entire model, freeze some layers and train the others, and freeze the convolution base.

First, we froze the Conv blocks of the pre-trained VGG-16 model so that their weights don’t get updated in each epoch. Then, we replaced the last dense layers with new dense layers to classify a facial image into one of seven basic emotion classes (fear, anger, happy, surprised, sad, neutral, and disgust). The channel size of the new dense layers is 1000 and 256, respectively. The rectified linear activation function (RELU) is included after the added fully connected layers, followed by batch normalization layers and dropout layers with a probability of p = 0.5 to avoid overfitting. Further, the output layer uses the Softmax activation function and the categorical cross-entropy as a loss function for multi-class classification as shown in (3) and (4).

$$\begin{array}{@{}rcl@{}} \sigma(\vec{z})_{i}&=&\frac{e^{z_{i}}}{\sum\limits_{j=1}^{C}e^{z_{j}}} \end{array}$$
(3)
$$\begin{array}{@{}rcl@{}} CE&=&-{\sum\limits_{i}^{C}}t_{i}\log{(f(s)_{i})} \end{array}$$
(4)

where C represents the number of different classes, the subscript i denotes the ith class, ti is the truth label, Si is the Softmax probability for the ith class, and zi is the predicted score for the ith class.

Stochastic gradient descent (SGD) with Nesterov’s momentum, defined in (5), has been used as the model optimizer with learning rate 0.01 and Nesterov’s momentum 0.9 to generate better performance and faster results. We set the batch size to 16 and the number of epochs to 150. Figure 4 shows a summary of our VGG-16 model fine-tuning, and Table 1 describes the details of the model’s hyperparameters.

$$\begin{array}{@{}rcl@{}} v_{t}&=&\gamma v_{t-1}+\eta\nabla_{\theta} J(\theta-\gamma v_{t-1}) \\ \theta&=&\theta-v_{t} \end{array}$$
(5)

where v is the velocity and γ is usually set to 0.9.

Fig. 4
figure 4

Model summary of the proposed VGG-16 fine-tuning

Table 1 VGG-16 model’s hyperparameters

3.2.2 Engagement detection

The recognized facial emotions are used to detect the concentration level of deaf and hard-of-hearing students. The resulting concentration index (CI) is defined by multiplying the probability of dominant emotions probability (DEP) by the related emotion weights (EW), as shown in (6).

$$CI = DEP \times EW$$
(6)

Emotion weight is a value between 0 and 1 that determines the concentration degree of a facial emotion at a given time, as shown in Table 2.

Table 2 Emotion weight

According to the obtained concentration index, the deaf or hard-of-hearing student’s level of engagement (highly engaged, nominally engaged, or not engaged) is evaluated by Table 3. It can be used by teachers to adjust the lesson accordingly. Teachers can also have a real-time engagement report of each deaf or hard-of-hearing student, which can help them understand the classroom knowledge more pertinently.

Table 3 Engagement detection from concentration index (CI)

4 Experimental results and discussion

To evaluate the effectiveness of the proposed FER approach on two datasets. First, a description of these datasets and image pre-processing and augmentation are presented. Then the experimental setup and the results of the proposed model, followed by a results comparison with prior studies.

4.1 Facial expression datasets

Two benchmark facial expression datasets were used to evaluate our proposed approach: the Japanese Female Facial (JAFFE) and the Karolinska Directed Emotional Faces (KDEF) datasets. Images of the datasets are labeled into seven basic emotion classes: happiness, fear, sadness, neutral, disgust, anger, and surprise. The brief description and selection reasons of the datasets used are given below.

4.1.1 Japanese Female Facial Expression (JAFFE) dataset

The JAFFE (Lyons et al., 1998) dataset includes 213 grayscale facial expression images of 10 different Japanese female models that were taken at the psychology department at Kyushu University. Each model performed seven basic emotions (30 angry, 29 disgust, 33 fear, 30 happiness, 31 sad, 30 surprises, and 30 neutral) in which each expression contains 3 to 4 images per model, as shown in Fig. 5. The images are in .tiff format with a resolution of 256×256 pixels. We chose this dataset to prove the possibility of training a small dataset using deep convolutional neural network (DCNN) models.

Fig. 5
figure 5

Sample facial expression images from JAFFE dataset

4.1.2 Karolinska Directed Emotional Faces (KDEF) dataset

The KDEF (Calvo and Lundqvist, 2008) dataset is created by Karolinska institute, department of clinical neuroscience, section of psychology, Stockholm, Sweden. The dataset is a collection of 4900 images of 70 individuals (35 females and 35 males) showing seven emotional states photographed twice from 5 different angles (full-left profile, half-left profile, straight, half-right profile, and full-right profile), as shown in Fig. 6. The images are in RGB format with a resolution of 562×762 pixels. Different criteria were applied for the actors’ selection and the picture-taking procedure, such as the ages between 20 and 30 years, the absence of facial hair, earrings or eyeglasses, and visible make-up during the photo session. Facial expression recognition on the KDEF dataset is challenging for profile views, especially for full-left or full-right profile views, as only one side of the face with one ear and eye is visible. Hence, we examined the whole dataset in the present study to evaluate the performance of the proposed method for these challenging cases.

Fig. 6
figure 6

Sample facial expression images from KDEF dataset

4.2 Image pre-processing and augmentation

The following image pre-processing steps are considered in our work to format images before they are used by DCNN models. First, the Haar feature-based cascade classifier (Viola & Jones, 2001) from OpenCV (Bradski, 2000) is used to detect and crop faces from each image. It is an effective machine learning based-approach, in which a cascade function is trained using a lot of positive and negative images to recognize the face region in the image. Second, the face regions were resized into 224 × 224 pixels, which is the default input dimension of pre-trained DCNN models. Finally, we applied some image augmentation techniques to avoid overfitting and classify the unseen data very accurately. Each image is horizontally flipped and rotated with an angle of (− 10 to 10). No data augmentation was made to the images in the test set.

4.3 Experimental setup

The facial emotion recognition of deaf and hard-of-hearing students model has been written in the python programming language using Keras (Chollet, 2015) with TensorFlow backend (Abadi et al., 2016) for the image classification and OpenCV for image pre-processing. The experiments were performed on a PC with Nvidia GTX 1070, Intel Core i7, 16 GB RAM, CPU of 3.20 GHz in a 64-bit Windows 10 environment.

In this study, we used two different modes to split the JAFFE and KDEF datasets into training and testing: (i) 80% of images are used for training, and 20% of images are used as a test set. (ii) a 10-Fold Cross-Validation (CV), in which the whole datasets are randomly partitioned into ten parts, nine parts are used for training the model, and the rest is used for testing.

4.4 Experimental analysis and comparison

In this section, the performance of our proposed model is evaluated using different types of optimization algorithms on two comprehensive benchmark facial expression datasets: JAFFE and KDEF. Then, the obtained optimized model performance is compared with different pre-trained DCNN models. The optimizers like Stochastic gradient descent (SGD) (Robbins & Monro, 1951), SGD with momentum (Qian, 1999), SGD with Nesterov’s momentum (Nesterov, 1983), Adam (Kingma & Ba, 2014), Adagrad (Duchi et al., 2011), and Adadelta (Zeiler, 2012) are used in the present work to obtain optimized model performance. Figures 7 and 8 show a comparison of the train loss evolution with various optimization algorithms through the epochs on the JAFFE and KDEF datasets with a batch size of 16 and categorical cross-entropy as loss function.

Fig. 7
figure 7

Training loss comparison of all optimizers using the fine-tuned VGG-16 model on JAFFE dataset

Fig. 8
figure 8

Training loss comparison of all optimizers using the fine-tuned VGG-16 model on KDEF dataset

Stochastic gradient descent (SGD) with Nesterov momentum has shown the best recognition accuracy of 97.7% and 86.33% on JAFFE and KDEF datasets, respectively. Moreover, it is observed from Figs. 7 and 8 that the SGD with Nesterov momentum has the lowest training loss among all optimization algorithms whereas Adadelta provides the highest training losses. Table 4 presents the recognition accuracy of our proposed model with different optimizers using an 80–20% split validation scheme.

Table 4 Recognition accuracy of our proposed model on JAFFE and KDEF datasets using different optimizers

Figures 9 and 10 illustrate the accuracy and loss graphs of the JAFFE and KDEF dataset training and testing phases. Then, Figs. 11 and 12 show the confusion matrices for JAFFE and KDEF datasets validated with an 80–20% split validation scheme.

Fig. 9
figure 9

Training and testing accuracy (a) and loss (b) of the fine-tuned VGG-16 model on JAFFE dataset

Fig. 10
figure 10

Training and testing accuracy (a) and loss (b) of the fine-tuned VGG-16 model on KDEF dataset

Fig. 11
figure 11

Confusion matrix of the fine-tuned VGG-16 model on JAFFE dataset

Fig. 12
figure 12

Confusion matrix of the fine-tuned VGG-16 model on KDEF dataset

A specific set of performance metrics were considered to provide additional analysis of our approach: precision, accuracy, recall, and F1-score. Corresponding formulas regarding each of these metrics are defined in (7), (8), (9), and (10), where TP (resp. TN) stands for true positive (resp. negative) and FP (resp. FN) for false positive (resp. negative).

$$\begin{array}{@{}rcl@{}} Accuracy&=&\frac{TP+TN}{TP+TN+FP+FN} \end{array}$$
(7)
$$\begin{array}{@{}rcl@{}} Precision&=&\frac{TP}{TP+FP} \end{array}$$
(8)
$$\begin{array}{@{}rcl@{}} Recall&=&\frac{TP}{TP+FN} \end{array}$$
(9)
$$\begin{array}{@{}rcl@{}} F1&=&2\times\frac{(P\times R)}{(P+R)} \end{array}$$
(10)

Table 5 gives the overall performance of the proposed model on the JAFEE and KDEF datasets. The test accuracies of the proposed model using schemes of split (80–20%) and 10-fold CV on JAFFE are 97.7% and 98%, respectively. Similarly, on the KDEF dataset, the test accuracies are 86.33% and 99% using schemes of split (80–20%) and 10-fold CV, respectively. It should be noticed that the size of the KDEF dataset is much larger than JAFFE and contains multiple views of the face. Hence, the recognition accuracy of the JAFFE dataset is higher than KDEF.

Table 5 Precision, Recall, and F1-score of our proposed model on JAFFE and KDEF datasets

The proposed FER method is examined for eight differents pre-trained DCNN models: VGG-16, VGG-19, Inception v3, DenseNet-121, DenseNet-169, MobileNet, ResNet-50, and Xception, as shown in Table 6. The experiments were performed using two different splitting modes for JAFFE and KDEF datasets: 20% for testing (i.e., 80% for training) and a 10-Fold Cross-Validation (CV). The results show that the VGG-16 model achieved the best classification accuracies of 97.7% and 86.33% for JAFFE and KDEF datasets on the selected 20% test data case, followed by VGG-19 with the same accuracy of 97.7% for JAFFE and 83.40% for KDEF.

Table 6 Comparison of the accuracies with different pre-trained deep CNN models on JAFFE and KDEF datasets

ResNet-50 presented the worst results with an accuracy of 83.72% for JAFFE and 50.4% for KDEF. Moreover, on the 10-Fold Cross-Validation case, accuracy varied from 97% to 99% for the JAFFE dataset and 81% to 99% for the KDEF dataset. The VGG-16 model achieved the best accuracies of 98% and 99% for JAFFE and KDEF datasets, respectively.

4.5 Result visualization of facial emotion recognition and engagement detection of deaf and hard-of-hearing students

Our proposed model has been tested on images of seven deaf and hard-of-hearing students from the faculty of sciences Rabat, Mohammed V University in Rabat, who participated in the experiment (4 males and 3 females) there of them were three wearing glasses and two in a half-left profile view, as shown in Fig. 13. It can be seen clearly from the figure that all the faces were recognized and marked by the red rectangular outlines. Then, each emotion is represented with black text, and the red bar represents its probability. The dominant emotion label with the maximum value is represented with red text at the top of each rectangle. Subsequently, the engagement type calculated from the dominant emotion of each deaf and hard-of-hearing student is represented with white text at the bottom of each rectangle. Of the total seven faces, five were labeled “happy” and two were labeled “neutral”. In addition, six students were nominally engaged and one student was highly engaged. Figure 14 shows the percentage of the engagement level of the deaf and hard-of-hearing students in the classroom. According to the achieved results, the proposed approach has shown remarkable performance in evaluating the facial expressions and the engagement of deaf and hard-of-hearing students in a classroom environment.

Fig. 13
figure 13

Facial emotion recognition and engagement detection of deaf and hard-of-hearing students

Fig. 14
figure 14

Engagement level of the seven deaf and hard-of-hearing students in the class

4.6 Performance comparison with prior studies

In this section, the performance of the proposed FER method is compared with other state-of-the-art methods on JAFFE and KDEF datasets. Table 7 presents the methods used in prior studies to recognize facial emotions, publication year, the total of the sample of the two datasets, the data splitting method, and the test accuracy. In the 10-Fold CV case, the multi-layer perceptron (MLP) and deep belief networks (DBNs) approach (Zhao et al., 2015) gave 90.95% accuracy for JAFFE. However, the proposed approach achieved the best recognition accuracy of 98% for JAFFE and 99% for KDEF.

Table 7 Comparison between the proposed method and existing methods on JAFFE and KDEF datasets

For the 80–20 split, the proposed method shows an accuracy of 97.7% and 86.33% for JAFFE and KDEF outperforming the CNN method (Sari et al., 2021), which gives an accuracy of 86.24% for JAFFE and 82.38% for KDEF, respectively. The proposed approach with VGG-16 based transfer learning model outperformed any other state-of-the-art method for JAFFE and KDEF datasets.

5 Conclusion

One of the biggest challenges in education is having a system that detects the engagement of deaf and hard-of-hearing students. In this work, we proposed a novel approach for evaluating deaf and hard-of-hearing students engagement from their facial emotions captured by the camera in a classroom environment using a VGG-16 based transfer learning model with a fine-tuning strategy. Three different engagement levels are produced by our system: ‘highly engaged’, ‘nominally engaged’, and ‘not engaged’.

Our research compared the influence of different optimization algorithms on model performance and conducted comparison analysis with eight different pre-trained DCNN models using two validation schemes split (80–20%) and 10-fold CV on JAFFE and KDEF datasets. It turned out that the VGG-16 model achieved the best classification accuracy of 97.7% and 86.33% for JAFFE and KDEF on 20% of test samples. Similarly, in the 10-Fold CV case, the VGG 16 model achieved the best classification accuracy of 98% and 99% for JAFFE and KDEF datasets. Moreover, the SGD with Nesterov’s momentum has the lowest training loss compared with the other optimization algorithms. According to the obtained results, the proposed method outperformed other state-of-the-art methods and has proven to be successful in recognizing deaf and hard-of-hearing students engagement based on facial emotions in a classroom environment.

The proposed system can help teachers to adapt the teaching material based on the engagement level of each deaf or hard-of-hearing student. We have tested our system with seven deaf and hard-of-hearing students in a classroom enviromnent. The results reveal that the proposed system correctly identifies the students engagement from facial expressions. In future research, we will evaluate the engagement from more features such as gaze behavior, and body movements to improve the performance of classroom teaching methods for deaf and hard-of-hearing students.