1 Introduction

Emotions of the driver have a significant influence on comfort and safety when driving [1]. One of the main factors reducing driving safety in the 20–50 million non-fatal injuries and 1.24 million fatal road traffic accidents that occur globally is drivers' incapacity to regulate their emotions. The swift advancement of intelligent automobiles necessitates the amalgamation of driver-automation communication and cooperation to augment driving comfort wherever driver sentiment is a crucial condition [2]. Understanding the feelings of the driver is essential for enhancing comfort and safety when operating intelligent cars.

Driver Emotion Recognition (DER) technology is to evaluate the active state of a driver using their facial expression. The DER system is about improving the Human Machine Interface (HMI) in vehicles. Numerous applications, including social security, mental health monitoring, safe driving, and health care, have made use of emotion identification [3]. By detecting and raising a driver's awareness of their emotions, the DER system may be utilized to manage their emotional state. So, the effective recognition of driver emotion is very much important to develop a better DER system.

Several techniques are used in the literature to evaluate a driver's emotional condition [4]. A soft computing tool is used to identify the mood and facial motion for a driving assistant technology that uses Fuzzy Rule-Based Systems (FBS). The analysis of facial gesture variations is used to define the fuzzy rules [5]. DER system is also designed by utilizing Local Binary Pattern (LBP) for texture base assessment and face recognition. LBP is efficient and very simple. Oriented Fast and Rotated BRIEF (ORB) is an additional DER system approach for data analysis. With the Support Vector Machine (SVM) acting as the classifier, ORB serves as a quick and reliable feature detector that utilizes Binary Robust Independent Elementary Features (BRIEF) as a face feature detector [6]. ECG (Electrocardiogram) is also one of the very efficient tools to detect and recognize a person's emotion. DER system process is also based on EEG data in a method of Self-Assessment Manikin (SAM) [7]. When a person is exposed to a certain stimulus, the density and frequency of the stimulus are utilized to determine the subject's emotional state [9]. We can recognize the driver's emotion by these signals.

A camera-based DER system, a camera is installed within the vehicle. It monitors the driver's face continuously. This camera-based DER system uses a face detection algorithm to detect the driver's face in the natural setting [8]. Under realistic driving situations, despite varying lighting conditions, occlusions, and other passengers in the vehicle, accurate identification of the driver's face is required. The major problem faced by the camera-based driver monitoring system [9]. DER system helps to alert the driver by raising the alarm. So it is necessary that the future generation vehicles should have the extra safety feature to alert the driver about the driver's emotional state not only for road safety but also human-wellbeing.

The proposed method's primary contributions are listed below:

  • Driver's emotion recognition is detected using a deep learning-based ShuffleNet V2 and KLT Algorithm-based feature extraction method.

  • Face image Dataset are collected and pre-processed using image resizing, Gaussian filter, median filter, histogram equalization and wiener filter for filtering the noise and enhancing contrast of the image.

  • Pre-processed images are segmented using Region of Interest (ROI) for segmenting the unwanted portions from the facial images to reduce the complexity of the model.

  • Segmented portions of the face images are extracted for features using the Kanade-Lucas-Tomasi (KLT) algorithm for training the model with the facial distinct features.

  • ShuffleNet V2 classifier is used to categorize the emotions of the drivers into six distinct expressions such as happy, surprise, sad, fear, anger, disgust, and neutral.

The remaining portion of the study is composed of the literature review pertaining to the DER system is presented in Chapter 2. The proposed design and technique for the proposed portion are included in Chapter 3. Portion 4 contains the proposed method's outcome and discussion section. The final section is included in portion 5.

2 Literature review

This section summarizes research in the field of recognition of emotion. Based on their previous and existing activities, people often show emotions such as happy, neutrality, sadness, disgust, surprise, fear, and anger. Below is a study and review of a few of the current detection methods.

Du et al. [10] introduced the Convolution Bidirectional Long Short-term Memory Neural Network (CBLNN), a novel deep learning architecture for identifying the driver's emotional state. CBLNN was used to recognize emotions easily and accurately in real-time. The process of CBLNN was based on CNN to evaluate the face shape. CBLNN had better accuracy in identifying anger, sadness, happiness and neutrality. However, CBLNN accuracy in detecting fear is significantly lower.

Wang et al. [11] established a combination of several electrocardiogram (ECG) features to identify the driver's emotional state. Three components make up the ECG signals such as waveform, nonlinear characteristics, and time–frequency interval. To determine if a motorist was relaxed or nervous, an emotion detection technique was used while driving. ECG had a 91.34% accuracy rate for identifying the driver's calm and a 92.89% accuracy rate for identifying tension. But, the processing of ECG data requires the combination of multiple evidence fusion in nonlinear analysis.

Xia et al. [12] recommended a unique method for recognizing cross-dataset transfer driver expressions in shared projection subspace (GD-LS-SS) by utilizing global discriminative and local structural knowledge. GD-LS-SS makes use of the data's local geometrical structure by using graph topology knowledge. The kernel-based GD-LA-SS is designed by using the advantages of the kernel trick to investigate the kernel projection in order to increase identification accuracy and handle the nonlinear cross-dataset transfer. However, the GD-LA-SS had several challenges like how to eliminate unfavorable transfer in the current method, how to choose the most important aspects presented in the face pictures, and how to transfer the key facial images attachment to SD.

Hu et al. [13] presented a novel deep learning architecture that combines a 3D conditional generative adversarial network with a two-level attention bidirectional long short-term memory network (3DcGAN-TLABiLSTM). On public NTHU-DDD dataset, test the 3DcGAN-TLABiLSTM framework. Because of the significant intra-class variations in head position, face expression, and lighting conditions, it is still difficult to diagnose fatigue.

Jeong et al. [14] designed a lightweight multilayer random forest (LMRF) model, a deep model formed by non-neural network-based layer-by-layer random forests. Even with fewer hyper-parameters, LMRF achieves performance comparable to DNN, and it runs quicker on a CPU. But LMRF had a performance degradation problem when more than three numbers of layers were presented.

Shojaeilangari et al. [15] presented an Extreme Space Learning (ESL) approach to identify the human face's inherent emotions. ESL students were able to learn a vocabulary and a nonlinear classification model at the same time. When dealing with noisy signals and faulty data acquired in natural situations, to achieve accurate classification, ELM combines the resemblance capability of a sparse representation along with the discriminative strength of the Extreme Learning Machine (ELM). A higher computational expenses for extracting and categorizing features, as well as the need to optimize a large number of parameters, are also disadvantages.

Kim et al. [16] created a line-segment feature analysis (LFA), convolutional recurrent neural network (CRNN) model for face sentiment analysis, and a streaming based on images PingPong256 (PP2) method. Real-time pictures gathered by image devices were secured by encryption and decryption using the PP2 algorithm. The LFA-CRNN paradigm, on the other hand, is unable to leverage miniaturization technologies like mobile edge computing systems.

Mohan et al. [17] suggested a deep convolution neural network (DCNN). The first branch looks at geometric elements, including lines, edges, and curves, whereas the second branch looks at holistic traits. DCNN methodology outperforms all the cutting edge techniques across all datasets. A GF-based edge descriptor is utilized to get a small amount of local features in order to use the DCNN model.

Cui et al. [18] introduced a unique multi-task neural network called Multi-EmoNet, which can repair noisy images and classify human face emotions under various settings. When compared to baseline networks, Multi-EmoNet obtains a significantly greater degree of classification results on images with different levels of light. A generic design that can be used for any noisy picture classification issue is the multi-task network.

Madupu et al. [19] developed a Convolutional Neural Network (CNN) based automated face emotion categorization method using the Speeded Up Robust Features (SURF) feature. This approach had an accuracy rate of 91%. However, the dataset sample size for this approach was just 200.

Based on the above-revealed article, several significant challenges has been raised for face emotion detection. The accuracy of CBLNN in detecting fear is significantly lower [10]. The processing of ECG data necessitates the integration of multiple evidence fusion in nonlinear analysis [11]. The GD-LA-SS faced challenges in avoiding unfavorable transfer, selecting crucial face picture aspects, and transferring key facial image attachments to SD [12]. Diagnosis of weariness is still difficult since there are large intraclass differences in head posture, expression, and illumination [13]. LMRF experienced performance degradation when presented with more than three layers [14]. The disadvantages of this method include higher computational expenses for feature extraction and categorization, as well as the need to optimize numerous parameters [15]. The dataset sample size for this approach was limited to 200 [19].

3 Proposed methodology

Most of the road accidents are caused due to the unpleasant emotions of the driver. To overcome such disasters, the Emotion Recognition of the Driver based on KLT Algorithm and ShuffleNet V2 has been designed. In the proposed work Driver Emotion Recognition (DER) system is developed to avoid this type of accident. Initially images from the datasets such as CK_plus, FER_2013, TFEID, KMUFED and KDEF are considered as input. These dataset consist of several regions of human face with different emotions based on both the genders. At first, the image dataset are pre-processed using the Histogram equalization, Wiener filter, 2D Gaussian filter, and 2D median filter for image enhancement and noise reduction to attain better performance on recognition. Subsequently, the preprocessed images are segmented based on the ROI (Region of Interest) segmentation algorithm, where a rectangular-shaped ROI (Region of Interest) is placed over the facial images to remove the unwanted portions of the facial image based on the interested regions. These segmented facial regions are extracted for features by using the KLT algorithm to attain the scattered feature points which have enough texture on facial images. Finally, the extracted features form the KLT algorithm is given to shuffle net V2 for training and recognizing the driver emotions. This ShuffleNet V2 classifier identifies six distinct expressions such as happy, surprise, sad, fear, anger, disgust, and neutral. The below Fig. 1 explains the entire process of the proposed DER system.

Fig. 1
figure 1

Structure of the proposed approach

3.1 Preprocessing

Image preprocessing has become an important phase in any face image processing for scientific purposes. Preprocessing goal is to improve the picture quality and increase its characteristics for future preparation. The general preprocessing includes Noise reduction, Color Normalization, Histogram Equalization and Edge detection. In this proposed method, image preprocessing can resize the image, remove the noise in the image and do the enhancement the image.

3.1.1 Step 1: Resizing

Images may be resized without developing any sections removed by using this technique. To improve or reduce the total number of pixels in the image, image resizing is important. Pixel data is altered when an image is resized. Avoid resampling an image that has been resized. Make simple sizing adjustments; do not alter the image's data content. A picture of 700 by 700 pixels, for instance, gets shrunk to 256 by 256 pixels. The scaled texture of the original image is seen below in Fig. 2.

Fig. 2
figure 2

Resized Diagram

3.1.2 Step 2: Noise removal

Technique of eliminating or decreasing noise from a picture is known as noise reduction. By smoothing the entire image and leaving the region around the contrast limits, it can reduce the appearance of noise. A 2D Median filter and 2D Gaussian filter are the two techniques used in this proposed system for the noise removal process. These techniques can remove most of the noise present in the image.

2D Gaussian Filter: Gaussian Filter serves as a 2D convolutional filter to remove noise and smoothen the image. Its impulse response is a Gaussian filter. This proposed method uses a 2D Gaussian filter to remove noise from the given set of images. Image noise is mostly smoothed by using the Gaussian filter [20]. The following Eq. (1) represents the Gaussian filter,

$$\widehat{f}\left(x,y\right)=f\left(x,y\right)*g\left(x,y\right)$$
(1)

where, the 2D input image is denoted by f, the 2D output image is denoted by \(\widehat{f}\).

2D Median Filter: Visual noise is removed using a nonlinear digital filtering method known as the median filter. Because it may sometimes preserve edges while eliminating noise, it is often used in digital image processing. The proposed method employs a 2D median filter as an additional method to remove noise from the picture collection. It works by gradually substituting the median of neighboring pixels for each value in the image, pixel by pixel. Because the neighbor's pattern moves pixel by pixel throughout the image, it is referred to as a window. The pixel image's median value may be found using Eq. (2) [21].

$$MF\left(i,j\right)={\text{Median}} \left(x1, x2, \dots ., x8, x9\right)$$
(2)

where, MF (i, j) is the median values of neighbor pixels and (i, j) represents the pixel coordinates.

3.1.3 Step 3: Image enhancement

Image enhancement is a method that allows the user to emphasize certain aspects of an image while reducing or eliminating any unwanted information. For example, removing noise and adjusting levels to emphasize the characteristics of an image. In this proposed approach, image enhancement is utilized to enhance the given image dataset. Histogram equalization and Wiener Filter are the two techniques used for image enhancement in this proposed method.

Histogram Equalization; Histogram Equalization is a computer image processing method used to increase the image's brightness and smooth the image. This is achieved by increasing the image's range. This technique frequently raises all images' overall contrast when the user data is represented by near contrast values. This raises the contrast level in areas that have low local contrast. So, this Histogram Equalization technique is used in this proposed method to improve image quality. Equation (3) represents the resultant of the enhanced image using histogram equalization [22].

$$EI \left(i, j\right)={T}_{f}\left(x\right)= \left\{{T}_{f}(x\left(i,j\right))/\forall x(i,j)\in X\right\}$$
(3)

where, i and j are the coordinates of the image, \({T}_{f}\)(x) is the transformation function and EI (i, j) represents the enhanced image.

Wiener Filtering: Wiener filtering technique is used for image restoring. Gaussian filter used for Noise removal in this proposed method may lightly blur the image. So the Wiener Filter can balance the image quality. Using this approach, if noise is present in this system, it is assumed to be adaptive white Gaussian noise. Wiener filtering actually requires deep knowledge about the original picture and power spectra of the noise. The Wiener Filter estimates the actual image linearly. Wiener filter suppresses the Gaussian Noise existing in the picture [23] according to the following Eq. (4),

$$J\left(i, j\right)=m+\frac{{i}^{2}-{\sigma }^{2}}{{\sigma }^{2}}\left(I\left(i, j\right)-m\right)$$
(4)

where, i, j stand for the input image's row and column, J stands for the output image's intensity, I stands for the input image's intensity, and \({\sigma }^{2}\) represents the input noise variance.

3.2 ROI extraction in segmentation

Digital image is divided into many fragments using the segmentation procedure. By using this technique, the depiction of an image may be made more meaningful or simpler. This segmentation procedure yields a collection of images that together cover the full image. In this proposed method, rectangular shape ROI is extracted during the segmentation process.

An image's designated area intended subsequent processing or analysis is called a Region of Interest (ROI). ROI will be extracted using this proposed approach in a rectangular form. A rectangular extraction of the specified region with improved precision is made. The primary region required for subsequent procedures is this extracted region. It is mostly employed in the image portioning process during segmentation. The segmented image is shown in the following Fig. 3. Rectangular ROI extraction is used to extract the necessary targeted region from the image in a rectangular form. Figure 3 shows the extraction of the rectangular shape ROI during the segmentation process.

Fig. 3
figure 3

Image Segmentation

3.3 Feature extraction using KLT algorithm

A step in the dimensionality reduction method that breaks down enormous amounts of raw data into smaller pieces is called feature extraction. The image's attributes are described or the feature group is mentioned in this feature extraction method. These characteristics accurately and distinctively define the actual data collecting process, and they are simple to implement. In this proposed method Kanade–Lucas–Tomasi (KLT) face feature is utilized to obtain the features of the face image [24].

KLT algorithm is used for tracking human faces or features from a captured frame. First, determine the displacement of tracked points that have moved from one frame to the next. The movement of a human face may be easily computed using this displacement determination, and then the feature points of a human face can be tracked. The aim of the KLT is to consider the intensity information of the pixels. Equation (5) represents the KLT algorithm calculation.

Assuming the image was tracked at time t and the next image at time t + T.

$$I(x, y, t +T) =I(x- X, y- Y, t)$$
(5)

where, x and y are the variables of the first image. Using this, face landmarks will be spotted in this proposed method.

3.4 ShuffleNet v2

An essential first stage in the tiredness recognition process is facial feature point detection. In this work, ShuffleNet V2K16 classifier to categorize various face expression kinds. ShuffleNet v2 represents an enhanced version of ShuffleNet v1 that utilizes channel shuffling with four objectives for design. It works more complexly and accurately than ShuffleNet v1 and MobileNet v2. By working on relevant channel groups, grouped convolution lowers processing costs; however, it lessens the expressive potential of the output characteristics since it restricts the information flow across channel groups. By guaranteeing that feature maps share information without requiring more processing, it is ensured that input and output are associated through the channel shuffling approach. Using channel shuffling in group convolution is seen in Fig. 4 [25].

Fig. 4
figure 4

Group convolution with channel shuffling

ShuffleNet V2 was a network design that splits the input feature map into two branches, each with half as many channels, in order to minimize MAC usage. The left branch stays unaltered while the right branch conducts three convolutions with a step size of one. Both regular and deep separable convolutions are used to carry out the convolutions. Following convolution, information is transferred between groups via channel shuffle, features are integrated, channel numbers are added, and the two branches are concatenated. Every channel is combined into one. The concatenation process expands the network, improves feature extraction, and doubles the number of channels without increasing the FLOPs. Information sharing between many channels is made possible by mixing and washing the same channel. The network's processing burden is lessened by this method.

Channel separation and depthwise convolution are the two building blocks that ShuffleNet v2 uses to divide input features into two halves.

In order to reduce the cost of memory access, the convolutional layer should retain an equal amount of inputs and outputs for each of the feature channels. One x one convolution, for instance, has an output channel called \({c}_{o}\), FLOPs of B, and an input feature size of \({c}_{i} x h x w\).

$$B=hw{c}_{i}{c}_{o}$$
(6)
$$MAC=hw\left({c}_{i}+{c}_{o}\right)+{c}_{i}{c}_{o}$$
(7)

The mean inequality states that when B is kept fixed, the following holds:

$$MAC \ge 2\sqrt{hwB}+\frac{B}{hw}$$
(8)

Inequality sign holds when \({c}_{i}={c}_{o}\), indicating that the maximum MAC consumption is achieved.

Since there are g groups in group convolution, it is advisable to utilize less of it to prevent memory access costs from rising.

$$B=\frac{hw{c}_{i}{c}_{o}}{g}$$
(9)
$$MAC= hw\left({c}_{i}+{c}_{o}\right)+\frac{{c}_{i}{c}_{o}}{g}$$
(10)
$$=hw{c}_{i}+\frac{Bg}{{c}_{i}}+\frac{B}{hw}$$
(11)

When group count (g) and floating-point operations (B) are increased, MAC rises accordingly.

Reducing network branching through inception design has an impact on the computer's capacity for parallel computation. Speed is impacted by the numerous multi-branch structures in the network architecture. Significant MAC usage is achieved despite short FLOPs by reducing tensor operations such as ReLU activation function and feature summation operations.

4 Result and discussion

All vehicle users need to be aware of emotion recognition because road safety and human well-being are mainly dependent on the current emotional state of the drivers. The proposed DER system discovers the emotional mind of the driver, and if the driver is in unpleasant emotion, it makes an alarm to alert the driver. The ShuffleNet V2 classifier is used in the DER system classifies the different types of emotions. Matlab 2020b, with 16 GB RAM, an Nvidia GeForce GTX 1650 GPU, and an Intel Core i5 CPU, has been used to stimulate the suggested DER system using CNN classifier. CNN identifies the specific emotional state along with classifies the image's input data.

4.1 Dataset description

CK_plus [26], FER_2013 [27], TFEID [28], KMUFED [29], KDEF [30] are five datasets used in this proposed method. CK_Plus is a complete set of action units and emotion-based expressions. FER_2013 data is a 48 × 48 pixel face picture in grayscale. The Karolinska Directed Emotional Faces (KDEF) dataset is a set of human facial expressions.

Therefore, the data is initially, an image collected and it was preprocessed to eliminate noise in the dataset. The preprocessed data were then segmented to extract an area of an image that was used for further process. After segmentation, the features of the image were extracted, then the image data was fed into CNN classification. ShuffleNet V2 classifies the images, and the emotional state was determined. The preprocessed images of the Ck_plus dataset is mentioned in Table 1.

Table 1 CK_Plus dataset's pre-processing and feature extraction.

Probability can be corrected with the use of receiver operating characteristics (ROC). A single classifier's true positive rate is determined and plotted against the false positive rate to create the ROC curve. Figure 5 illustrates the Confusion matrix plot and ROC plot for CK_Plus dataset. An excellent classifier is represented by a value of 1, while a poor classifier is represented by a value of 0.5 on the ROC curve. The proposed method plots the true positive rate and false positive rate for the dataset CK Plus on a ROC graph. The proposed approach provides greater performance since the ROC curve hits 1. CK_Plus's ROC curve is displayed in Fig. 5a.

Fig. 5
figure 5

Confusion matrix plot and ROC plot for CK_Plus dataset

Examining the effectiveness of the categorization approach is done with the confusion matrix. Confusion matrix displaying the CK_Plus dataset's accuracy rate. There are seven distinct classifications in the dataset: fearful, furious, disgusted, joyful, depressed, shocked, and neutral. In class 0, accuracy was 80.0%; in class 1, accuracy was 100.0%; in class 2, accuracy was 92.3%; in class 4, accuracy was 100.0%; and in class 6, accuracy was 92.1%. At 93.41%, the dataset CK_Plus has an overall accuracy rating. As shown in Fig. 5b, the confusion matrix used the CK_Plus dataset.

The FER_2013 picture dataset is the second dataset utilized in this proposed methodology. The picture dataset is first scaled, after which it undergoes processes such as Wiener filtering, histogram equalization, median filtering, and Gaussian filtering. The KLT feature method is then used to extract the image features. Table 2 indicates the way the FER_2013 dataset was preprocessed and features were extracted.

Table 2 Pre-processing and Feature extraction in FER_2013 dataset.

Plotting true positive and false positive values is provided by the ROC curve for the FER_2013 dataset. With a ROC score of 1, the proposed method performs better for this dataset. According to the FER_2013 dataset, the ROC curve is shown in Fig. 6a. 84.5, 68.1, 99.0, 91.4, 80.7, 94.0, and 84.0% are the accuracy rates for the seven distinct classes comprising the entire FER_2013 dataset. This dataset has an overall accuracy percentage of 83.68%. The FER_2013 dataset's confusion matrix is displayed in Fig. 6b.

Fig. 6
figure 6

Confusion matrix plot and ROC plot for FER_2013 dataset

TFEID image dataset is a subsequent dataset. The imagery dataset first performs resizing of an image, followed by the use of Gaussian, median, Wiener, Histogram equalization, and KLT feature extraction processes. Table 3 shows the steps of pre-processing and extraction of features in the TFEID dataset.

Table 3 TFEID dataset's pre-processing and extraction of features.

For the TFEID dataset, the true positive rate and false positive rate are displayed on a ROC curve. With this dataset, the proposed approach's ROC value is 1, indicating its great performance. Figure 7a displays the ROC curve based on the TFEID dataset. Seven distinct datasets are included within the TFEID dataset according to the proposed methodology. 85.7, 77.8, 70.0, 87.5, 61.5, 100.0, and 100.0% are the accuracy rates for the seven classes. The TFEID dataset's confusion matrix is displayed in Fig. 7b.

Fig. 7
figure 7

Confusion matrix plot and ROC plot for TFEID dataset

KMU_FED is an additional image dataset that undergoes a resizing procedure before undergoing the application of Gaussian, median, histogram equalization, and Wiener filters. To extract the image's features, KLT extracting features is utilized. The KMU_FED dataset's pre-processing and extraction of features are displayed in Table 4.

Table 4 Pre-processing and Feature extraction in KMU_FED dataset.

For the KMU FED dataset, the true positive rate and false positive rate are displayed on the ROC curve. Proposed approach's ROC value for this dataset is 1, indicating that it performs well. Figure 8a displays the ROC curve with the dataset KMU_FED. Fearful, angry, disgusted, pleased, sad, shocked, and neutral are among the seven classes represented by the dataset KMU_FED. For class 0, accuracy was 97.6%; for class 1, accuracy was 100.0%; for class 2, accuracy was 100.0%; for class 3, accuracy was 92.3%; for class 5, accuracy was 95.2%. Dataset KMU_FED has an overall accuracy rating of 98.18%. Figure 8b displays a KMU_FED dataset's confusion matrix.

Fig. 8
figure 8

Confusion matrix plot and ROC plot for KMU_FED dataset

KDEF dataset is the last dataset used in this proposed method. This dataset contains methods for image resizing, feature extraction using KLT, histogram equalization, Wiener filter, and Gaussian filter. Table 5 indicates the Pre-processing and feature extraction in the KDEF dataset.

Table 5 Pre-processing and Feature extraction in KDEF dataset

In the proposed approach, the true positive rate and false positive rate are displayed on a ROC graph for the dataset KDEF. This proposed method's ROC curve becomes 1, indicating improved performance. The KDEF dataset's ROC curve is displayed in Fig. 9a. Seven distinct classes constitute Toward KDEF dataset; the accuracy rates for these classes are 95.9, 98.6, 99.3, 97.8, 100.0, and 97.9%. This dataset has an overall accuracy percentage of 98.47%. The KDEF dataset's confusion matrix is displayed in Fig. 9b.

Fig. 9
figure 9

Confusion matrix plot and ROC plot for KDEF dataset

The degree of similarity to the real value determines the accuracy. Figure 10a displays the accuracy of known methods using ResNet-101 and proposed approaches utilizing ShuffleNet V2 over five distinct datasets. The proposed method's accuracy rate based on the CK_Plus dataset is 0.94, whereas the existing method's accuracy rate was 0.89. Furthermore, the datasets KDEF achieves 0.99, TFEID achieves 0.79, FER_2013 achieves 0.99, and KMU_FED achieves 0.99 in the proposed approach, compared to 0.80, 0.94, 0.95, and 0.70 accuracy in the existing method. This demonstrates unequivocally the superiority of the proposed methodology over the existing one. The proposed ShuffleNet V2 technique and the existing ResNet-101 approach's sensitivity are depicted in Fig. 10b. In the existing technique, the sensitivity of the various datasets was 0.80, 0.78, 0.90, 0.90, and 0.70, respectively. Sensitivity results for the FER_2013, TFEID, KDEF, KMU_FED, and CK_Plus datasets were 0.87, 0.85, 0.99, 0.99, and 0.79, respectively, according to the proposed technique. Therefore, the proposed approach was superior to the existing strategy. Figure 10c illustrates the specificity of the proposed strategy and the existing procedure. The various datasets provide 0.98, 0.97, 0.99, 0.99, and 0.97 specificity in the proposed technique, respectively; in the existing method, these datasets generate 0.92, 0.91, 0.88, 0.87, and 0.88 specificity. This demonstrates unequivocally how superior the proposed approach is above the existing one.

Fig. 10
figure 10

Comparison of proposed ShuffleNet V2 and existing ResNet 101 approaches (a) Accuracy (b) Sensitivity (c) Specificity

Proposed ShuffleNet V2 and existing ResNet101 techniques' precision analyzes are displayed in Fig. 11a. For the datasets FER_2013, TFEID, KDEF, KMU_FED, and CK_Plus, the precision values in the proposed technique are, in order, 0.97, 0.88, 0.99, 0.99, and 0.84. For the same dataset, the existing method's precision values are 0.92, 0.82, 0.88, 0.88, and 0.81. This demonstrates that compared to the existing values, the proposed method yields a higher accuracy value. Next, F1_Score is examined and illustrated in Fig. 11b. In comparison to the existing method, which has an F1_Score value of 0.81, 0.78, 0.88, 0.89, and 0.69 for the various datasets, respectively, the proposed approach's F1_Score value is 0.9 in CK_Plus, 0.85 in FER_2013, 0.98 in KDEF and KMU_FED, and 0.77 in TFEID. This demonstrates the superiority of the proposed technique over the existing ones.

Fig. 11
figure 11

Comparison of proposed ShuffleNet V2 and existing ResNet 101 approaches (a) Precision (b) F1_Score

Figure 12a analyzes and illustrates FPR. The FPR value produced by ResNet 101, VGG-12, VGG-16 and ResNet 50 existing techniques are 0.08, 0.03, 0.09 and 0.025, respectively. But, the proposed model has 0.005 FPR value. The proposed approach gives less FPR compared to the existing approach, so this proposed method is comparatively better than the existing method. The proposed ShuffleNet V2 and existing ResNet101 methods' Kappa values are displayed in Fig. 12b. The Kappa value produced by ResNet 101, VGG-12, VGG-16 and ResNet 50 existing techniques are 0.83, 0.84, 0.81 and 0.79, respectively. But, the proposed model has 0.90 Kappa value. Then, Matthews's correlation coefficient (MCC) is sketched in Fig. 12c. Compared to the other four approaches, the proposed method's MCC is 0.9. According to the results, ResNet 101 had 0.83, VGG-12 had 0.82, VGG-12 had 0.88, and ResNet 50 had 0.84.

Fig. 12
figure 12

Comparison of proposed Existing approaches (a) FPR (b) Kappa (c) MCC

Figure 13a analyzes and illustrates the Negative Predictive Value (NPV). The NPV value produced by ResNet 101, VGG-12, VGG-16 and ResNet 50 existing techniques are 0.89, 0.87, 0.82 and 0.85, respectively. But, the proposed model has 0.91 NPV value. The proposed approach gives less NPV compared to the existing approach, so this proposed method is comparatively better than the existing method. The proposed ShuffleNet V2 and existing ResNet101methods' False Omission Rate (FOR) values are displayed in Fig. 13b. The FOR value produced by ResNet 101, VGG-12, VGG-16 and ResNet 50 existing techniques are 0.08, 0.13, 0.16 and 0.13, respectively. But the proposed model has 0.07 FOR value. Then, False Negative Rate (FNR) is sketched in Fig. 13c. Compared to the other four approaches, the proposed method's FNR is 0.03. According to the results, ResNet 101 had 0.1, VGG-12 had 0.13, VGG-12 had 0.149, and ResNet 50 had 0.17.

Fig. 13
figure 13

Comparison of proposed and Existing approaches (a) NPV (b) FOR (c) FNR

Figure 14a analyzes and illustrates the False Discovery Rate (FDR). The FDR value produced by ResNet 101, VGG-12, VGG-16 and ResNet 50 existing techniques are 0.15, 0.07, 0.142 and 0.125, respectively. But, the proposed model has 0.05 FDR value. The proposed approach gives less FDR compared to the existing approach, so this proposed method is comparatively better than the existing method. The proposed ShuffleNet V2 and existing ResNet101 methods' Informedness values are displayed in Fig. 14b. Compared to the other four approaches, the proposed method's informedness is 0.9. According to the results, ResNet 101 had 0.7, VGG-12 had 0.82, VGG-12 had 0.73, and ResNet 50 had 0.83.

Fig. 14
figure 14

Comparison of proposed and Existing approaches a FDR b Informedness

ShuffleNet V2 classifier is used in the proposed DER system to classify the driver's various emotional states. Compared to the existing method, the proposed ShuffleNet V2 methodology has higher accuracy, sensitivity, specificity, precision, F1_Score, Kappa, and Matthews's correlation coefficient (MCC), Negative Predictive Value (NPV), False Omission Rate (FOR), False Negative Rate (FNR), False Discovery Rate (FDR), Informedness and lower False Positive Rate (FPR) errors. Based on the results, Driver Emotion Recognition using the ShuffleNet V2 classifier is the most appropriate method for identifying the drivers' emotional state at that moment.

Above mentioned Table 6 illustrates the state of the art methods for driver emotional recognition. The state of the art methods are CNN (InceptionV3-VGG16), MLCNN, MRE-CNN and DLBP-DCT. By evaluating these existing models with the CK_Plus dataset. The performance metrics such as accuracy, precision, sensitivity and F1_score of the model are attained. Based on these attained values, the performance metrics of the proposed model are compared with the existing state of the art methods that results the proposed model as a better driver recognition model than the existing models.

Table 6 Comparison of State-of-the-Art methods for driver emotional recognition

5 Conclusion

Emotion-related human–machine systems are essential for intelligent automobiles, as driver emotions impact driving performance and contribute to traffic accidents. Detecting and recognizing driver emotions is emerging as a critical factor for improving the driver safety. The absence of real-scenario datasets hinders current research in on-road driver facial expression detection, which is crucial for automotive human–machine systems. In this proposed model, ShuffleNet V2 based Driver Emotion Recognition (DER) has been designed. Different types of emotions of the drivers are recognized and classified by this proposed DER system. The Image Resizing, noise removal, smoothening and improving the brightness of the image techniques for the datasets such as FER_2013, TFEID, KMU_FED, CK_Plus, and KDEF are performed in preprocessing. Then, the ROI based segmentation and KLT based texture feature extraction are processed for reducing the complexity and enhancing the recognition of the model. Different types of emotions are classified by the ShuffleNet V2 classifier. The accuracy rate obtained by the DER system with the ShuffleNet V2 classifier was much better than the existing model including ResNet 101, VGG-12, VGG-16 and ResNet 50. Accuracy, Precision, sensitivity, F1-score, specificity, False Positive Rate (FPR), Kappa, and MCC are some of the performance measures used to assess efficacy for this proposed model. The proposed model's achieved performance metrics values are 0.99, 0.99, 0.90, 0.89, 0.99, 0.005, 0.90, and 0.90. Thus, the proposed method may be a useful substitute for enhancing the existing techniques in recognizing the driver emotional. In future score, the designed driver recognition model can be able to design in an automotive industry as a safety and secure feature for futuristic vehicles. Likewise, this feature can also be implemented on other industries such as medical industry, construction industry, chemical industry, petroleum industry and power engineering as a monitoring system for the workers in critical zone and highly secured zone. This emotional recognition model can also be used in various fields, such as human–computer interactions (HCI), medical health, Internet education, security monitoring, psychological analysis and the entertainment industry.