1 Introduction

The importance of road traffic crashes as a global social problem cannot be overstated. Statistics show that fatigued driving is the primary cause of over 60% of traffic accidents [1, 2]. Fatigued driving occurs when drivers experience physiological and psychological dysfunction due to prolonged driving that significantly impairs their driving ability [3,4,5]. This condition affects several facets of the driver's cognitive processes, including attention, perception, reasoning, judgment, willpower, decision making, and reaction time. Driving while fatigued can significantly increase the risk of road crashes, highlighting the critical role of early detection of this condition in ensuring road safety.

Drivers exhibit distinct physiological and psychological symptoms when they become fatigued [6]. According to the length of driving time, fatigue driving can be classified as either short-term or long-term. In the case of short-term fatigue driving, drivers exhibit the following characteristics: (1) Increased frequency of blinking, fatigue, and reduced attention to safety. (2) Inaccurate and untimely gear shifting and lack of focused attention. (3) Inability to adjust driving behavior, such as acceleration, deceleration, and steering, based on changing road conditions. Meanwhile, long-term fatigue driving is characterized by: (1) Dry mouth, frequent yawning, head nodding, and difficulty keeping the head upright. (2) Painful, dry eyes that open and close, drowsiness, and blurred vision. (3) Low mood, slow reaction time, and impaired judgment [7,8,9].

At present, research on fatigue driving is mainly focused on the following three directions:

One is to evaluate and detect fatigue driving state based on signal characteristics. This approach mainly includes the use of electrocardiogram (ECG) [10, 11] and photoplethysmography (PPG) to detect ECG signals [12, 13], multi-wave electroencephalography (EEG) signals [14], surface electromyography (sEMG) to detect electroencephalography (EMG) signals [15], and measurement of electrooculography (EoG) signals between the cornea (with positive electricity) and the retina (with negative electricity) [16]. These methods of detecting fatigue using various physiological signals of the human body have strong theoretical support in biology and can achieve high detection accuracy. However, they require drivers to wear special detection instruments during the measurement process, which can greatly affect their ability to drive. In addition, the cost of professional detection instruments is generally very high, which makes it difficult to apply in practice.

The second is driving fatigue detection based on vehicle characteristics. It mainly relies on indirectly detecting and judging signs of fatigue through vehicle behavior, such as steering wheel angle, driving speed, acceleration, trajectory, lane offset, pressure exerted on the seat by various parts of the driver's body, and brake pedal pressure [17,18,19]. This vehicle and driver behavior-based method of fatigue detection avoids direct physical contact with the driver and does not interfere with the driver's driving behavior. However, factors such as weather and road conditions, vehicle models, and driver habits can significantly affect the accuracy of these detection methods, making them relatively less robust than other approaches.

The third is facial fatigue detection. It is performed by eyes, mouth, expression, nose and head posture [20,21,22]. Compared with traditional approaches, computer vision-based facial feature fatigue detection has several advantages, including non-contact and non-interference operation and high detection accuracy, making it a new research hotspot in this area. Mbouna [9] used SVM to classify both alert and non-alert states. Zhao C [23] identified fatigue expressions, extracted fatigue expression features, and classified them by the stochastic subspace integration model of SVM with a polynomial kernel function. Ahmad [24] mainly studied eye opening and closing with head movement detection to detect fatigue. They used the "Viola-Jones" method for face detection and the CART method to detect the Haar features of the human eye ROI region. Ghoddoosian [25] used dlib to extract human eye key points, compute EAR (Eye Aspect Ratio) values, and extract blink features; they also defined a time window to transfer blink features to a HM-LSTM network, learned the characteristics of blinks in the time dimension, and then used a combination of fully connected layers, regression units, and discrete values to map KSS values to three types of error states. Li [26] detects driver fatigue by measuring the duration of eye closure, blink frequency, and yawn frequency. For face detection, the dlib toolkit of YOLOv3 Tiny is used to extract feature vectors of the eye and mouth, and then an SVM classifier evaluates the fatigue state based on the characteristics of eye closure time, blink frequency, and yawn frequency.

The remainder of this paper is organized as follows: Section 2 describes the empirical fusion of KSS values of multiple fatigue behaviors using f1, f2, and f3 operators to establish the logical relationship of multiple fatigue behaviors and two KNN models used for real-time early fatigue and post-fatigue estimation. Section 3 details the detection process and the determination of the KSS values. Section 4 discusses the experiment and the results. The authors used a self-curated dataset for analysis and processing, and a simulation of driving fatigue on a real vehicle, which gave promising results compared to other algorithms. Finally, Section 5 presents some concluding remarks.

2 Facial fatigue detection algorithm

There are several challenges associated with the practical application of fatigue detection technology. One such challenge is the requirement for real-time performance, which limits the available models. While deep learning models offer high accuracy, they are time-consuming during the fatigue inference phase, making it difficult to study fatigue detection from a modeling perspective. Fatigue detection is also a classification task, but compared to object classification, its boundaries are not well defined. Early fatigue testing typically focuses on one specific fatigue behavior, and building a multi-feature model requires unsupervised or supervised learning methods. However, the construction of multiple fatigue behavior features can be time-consuming, so the fusion analysis model should not be used for extended periods of time. In addition, KSS labeling involves many subjective factors, which can lead to overfitting when training supervised models.

The method proposed in this paper is a visual monitoring technique that analyzes the driver's facial features in real time using a vehicle camera to determine the driver's fatigue state. The method uses a multi-feature empirical fusion model that considers the driver's personal situations and habits by determining an appropriate threshold. The model assigns a Karolinska Sleepiness Scale (KSS) score and a fatigue behavior weight that maps the multidimensional facial behavior combination to a fatigue-related KSS score to finally assess the driver's fatigue state. The authors believe that this approach provides an effective and non-invasive way to determine driver fatigue status, which can be beneficial for the development of advanced driver assistance systems and the reduction of driver-related accidents.

Figure 1 shows the fatigue detection framework, which includes three operators (f1, f2, and f3) that establish the logical relationship between different fatigue behaviors by empirically fusing the KSS values. In addition, the framework includes two K-Nearest Neighbors (KNN) models—one for short-term fatigue detection and another for long-term fatigue detection. The use of both models ensures that the proposed method can detect both early and late fatigue in real time, making it applicable to real-world driving scenarios. The overall architecture of the framework represents a comprehensive and practical approach to driver fatigue detection.

Fig. 1
figure 1

Detection framework of the facial fatigue detection algorithm

The steps of the fatigue detection algorithm are as follows:

  1. 1.

    Face detection: use the SCRFD-0.5GF + model.

  2. 2.

    68 face key point detection: use MobileNetV3-56 +  + model.

  3. 3.

    Head motion detection: use the EPNP algorithm to calculate 3 rotational degrees of freedom and 3 translational degrees of freedom of the head posture. The first-order difference and threshold comparison of each degree of freedom are calculated to detect nodding, normal movement, head rest and head tilt forward and backward.

  4. 4.

    Head forward and backward motion detection: use the pinhole imaging principle to calculate the distance between the face and the camera. First-order distance difference and threshold judgment are used to detect the forward and backward tilt motion.

  5. 5.

    Blink detection: adaptive blink threshold of head posture based on calibrated EAR_ EAR_Threshold, using EAR and PERCLOS for two-stage blink detection.

  6. 6.

    Yawn detection: A yawn detection algorithm based on head posture uses MAR (Mouth Aspect Ratio) and FOM (Frequency of Occurrence of Mouth Opening) for two-stage yawn detection.

Table 1 shows the fatigue behavior code and KSS value setting used in the proposed algorithm. The table lists three types of detection features, which include the yawning state (m1) and normal state (m2) of the mouth, fast blinking (e1), slow blinking (e2), and normal state (e3) of the eye, and finally, the nodding behavior (h1), leaning forward and backward behavior (h2), normal movement behavior (h3), and static state (h4) of the head posture. These behaviors are used to identify the instances of driver fatigue and map them to the corresponding KSS values. Thus, the table provides a comprehensive understanding of the driver's fatigue state and enables effective fatigue detection.

Table 1 Fatigue behavior codes

Given the objective of only detecting the driver's fatigue status, the proposed algorithm adopts a focused approach that reduces computational burden and increases efficiency. Specifically, in order to optimize resource utilization, we use a subset of the KSS sleepiness quantification table, specifically values ranging from 4–9, to assess driver fatigue, as detailed in Table 2.

Table 2 Karolinska Sleepiness Quantification Table [27]

There is a corresponding relationship between the observed object, the observed behavior and the value of the fatigue level (KSS). Specifically, the mouth fatigue range is 4 to 7, the head fatigue range is 4 to 8, and the eye fatigue range is 4 to 9, as shown in Fig. 2.

Fig. 2
figure 2

Relationship between object, behavior, and KSS

The empirical fusion of multiple fatigue behavior KSS values is to use normalized empirical KSS values and the number of fatigue behavior detection times, as shown in Fig. 3. Three operators are defined: singleton (f1), mutual (f2), and active/inhibit (f3). The cause-and-effect diagram of various fatigue behaviors constructed by human experience gives specific meanings to the three operators:

Fig. 3
figure 3

KSS value fusion diagram of multiple fatigue behaviors

$${f}_{1}=\alpha \times KS{S}_{n}or{m}_{cod{e}_{i}}\times {\text{count}}_{n}{\text{orm}}_{{\text{code}}_{i}}$$
(1)
$${f}_{2}=\text{tanh}\left(\beta \left(\sum\nolimits_{j}\left(KS{S}_{n}{\text{orm}}_{{\text{code}}_{j}}\times {\text{count}}_{n}{\text{orm}}_{{\text{code}}_{j}}\right)\right)\right)+\alpha {\text{Max}}_{j}(KS{S}_{n}{\text{orm}}_{{\text{code}}_{j}}\times {\text{count}}_{n}{\text{orm}}_{{\text{code}}_{j}})$$
(2)
$${f}_{3}=\text{tanh}(\beta (\sum\nolimits_{k}(KS{S}_{n}{\text{orm}}_{{\text{code}}_{k}}\times {\text{count}}_{n}{\text{orm}}_{{\text{code}}_{k}})))$$
(3)
$${\text{activate}}={f}_{3}=-inhibit$$
(4)

The f1 operator is designed to detect three common signs of fatigue: blinking, yawning, and nodding. First, a high KSS value is assigned to determine the onset of fatigue, and the operator then calculates the frequency of these signs to estimate the level of subsequent fatigue.

The f2 operator focuses on identifying early fatigue signs such as head tilting forward/backward and rapid blinking. To calculate the maximum level of early fatigue, the operator applies the mean KSS value assigned at the beginning and performs a count, followed by the use of the tanh activation function and the max function.

The f3 operator plays a complementary role to the f1 and f2 operators. It not only triggers the f1 operator and amplifies the subsequent fatigue value, but also dampens the f2 operator to reduce early fatigue values and mitigate potential early fatigue misjudgments.

The facial fatigue detection algorithm uses long and short term KNN to learn the fatigue threshold, and takes both short window KSS and long window KSS extracted from each video as training samples for two separate KNN models. To ensure efficient performance, the dataset is pre-processed and normalized for real-time use during early and late fatigue estimation.

3 Facial feature point detection

The KSS values are determined in several steps. First, face detection is performed using the SCRFD-0.5GF + algorithm. Next, the key points of 68 faces are detected using the MobileNetV3-56 +  + algorithm. Subsequently, the EPNP algorithm is used to calculate three rotational degrees of freedom and three translational degrees of freedom of head movements. The pinhole imaging principle is applied to detect whether the head is moving forward or backward. Finally, two-stage slow-blink detection is performed using the EAR and PERCLOS (Percentage of Eye Closure over Time). Similarly, two-step yawn detection is performed using the MAR and FOM measures. These steps help to accurately quantify the KSS values during fatigue detection.

3.1 Face detection

The face detection method used in this paper is SCRFD-0.5GF + [27]. It is a lightweight model that is well suited for deployment on edge devices with limited computational resources due to its small size and low computational cost. SCRFD-0.5GF + uses the backbone network to extract features from the input image, and predicts the position and category of objects in the image through a series of convolutional layers. A feature pyramid network (FPN) is used to capture multi-scale features. The FPN architecture is a bottom-up and top-down approach that aggregates feature maps from different levels of the backbone network to enable the model to effectively detect targets of different sizes and scales. The training samples are randomly cut into square patches on the backbone network, and more training samples are allocated to smaller scales to improve the detection results via a sample and computation allocation mechanism, as shown in Fig. 4. These output values are used in subsequent steps of feature extraction and KSS value prediction. The "class" represents the category of detected facial features of the driver, including four states such as awake, mildly fatigued, fatigued, and severely fatigued. The "box" represents the region of interest (ROI) detected in the driver's face, i.e. the driver's facial region. The "mask" represents the result of further segmentation and localization of the detected facial ROI, which is used to accurately extract facial features.

Fig. 4
figure 4

SCRFD-0.5GF + Backbone Network Architecture

Various methods are used to test the accuracy and efficiency of the verification dataset. The test images have a size of 640 × 640 and are evaluated using FaceBoxes (UCB17), Mobile-0.5GF, SCRFD-0.5GF, and SCRFD-1GF. The "# Params" and "# Flops" represent the number and the product of the parameters, respectively. The 640 images are evaluated on NVIDIA 2080TL × 640. The test results are shown in Fig. 5 and Table 3.

Fig. 5
figure 5

The accuracy of different methods for testing the validation set

Table 3 Comparison between SCRFD-0.5GF + and other network structures [28]

3.2 Detection of Face feature points

After obtaining the face block diagram using the improved SCRFD-0.5GF + model, the feature points of the face block diagram are detected. For this purpose, the lightweight model MobileNetV3-56 +  + is used to obtain the face key points. MobileNetV3-56 [29, 30] is a lightweight neural network architecture specifically designed for efficient image classification tasks on mobile devices. An important innovation of MobileNetV3-56 is the use of "squeeze and excite" (SE) blocks, which enhance the capture of channel dependencies and adaptively recalibrate feature maps to improve model accuracy while maintaining a low number of parameters and computational cost. This model can locate key points from coarse to fine with only a few parameters.

The SE module is added to the MobileNetv3 block and the activation function is replaced, as shown in Fig. 6. Since the activation functions used are different, NL (nonlinear) is used in the figure. There are two main types of activation functions: ReLU and Hardswish (Hard-σ). The final 1 × 1 reduced dimension projection layer uses the linear activation function (f (x) = x).

Fig. 6
figure 6

MobileNetV3-56 Improvements

Table 4 shows the architecture of MobileNet V3-56 +  + . The Input column indicates the input size, while NBN in the operator indicates the absence of batch normalization. The last conv2d 1 × 1 layer corresponds to a fully connected layer. Exp size is the dimension used by the first conv2d 1 × 1 layer in the bottleneck for dimension upscaling, and Out is the number of output channels through the bottleneck. SE indicates whether the SE module should be used, and NL indicates which activation function should be used. HS stands for Hardswish, while RE stands for ReLU. Additionally, s is the step size, and when s = 2, the length and width become half of the original.

Table 4 MobileNetV3-56 +  + Body Architecture

3.3 Mouth feature detection

The mouth feature detection method mainly uses the MobileNetV3-56 +  + model to capture facial key points, extract mouth feature points, and then identify the shape and motion of the lips. The two-step mouth yawn detection method uses the MAR and FOM methods. MAR [31] indicates the mouth aspect ratio, which is useful for detecting mouth openings. FOM [31] refers to the frame frequency of the open mouth, i.e., the number of times the mouth opens in a given time frame. In the first step, the distance between the upper and lower lips of the mouth is divided by the distance between the left and right lips to obtain the MAR value. Once the MAR value exceeds a certain threshold, it can be preliminarily judged as a yawn. Then, in the second step, the changes in FOM values are counted over a period of time; if the FOM value exceeds a certain threshold, it is considered a yawn. Using MAR and FOM in harmony can improve the accuracy and robustness of yawn detection. Figure 7 illustrates a complete mouth detection process.

Fig. 7
figure 7

Mouth feature detection

3.4 Eye feature detection

Eye feature detection based on calibrated head pose adaptive blink thresholds (adaptive_EAR_threshold) requires two-stage slow blink detection using EAR (Eye Aspect Ratio) and PERCLOS (Percentage of Eye Closure over Time). EAR [32] is typically used to detect whether the eye is closed, which is calculated by measuring the distance between feature points inside the eye, including the position of the eye corners, iris, and tail. When the eye is closed, the distance between these feature points decreases, resulting in a lower EAR value. PERCLOS [33], often used to evaluate applications such as drowsy driving and crew fatigue, determines the ratio of the time the eye is closed to the total time. The adaptive_EAR_threshold adjusts the EAR threshold based on head pose calibration. Since the shape and position of the eyes can change under different head positions, the EAR threshold must adapt accordingly to ensure accurate detection.

Two-stage slow blink detection divides the closed eye state into two stages: slow blink and fast blink. Slow blinking usually refers to the state in which the eyes are closed for a short time, while fast blinking usually refers to the state in which the eyes are closed for a long time. By dividing the closed state of the eyes into these two stages, the state of the eyes can be more accurately detected and subsequently processed. As shown in Fig. 8, it illustrates a complete eye detection process.

Fig. 8
figure 8

Eye feature detection

3.5 Head pose feature detection

When SCRFD-0.5GF + is used to frame the face, head position detection is considered necessary. Head detection can be divided into two parts, as shown in Fig. 9. First, head motion detection is improved by using the EPNP algorithm to calculate the three rotational and three translational degrees of freedom of the head posture. Detection of head nodding, normal head motion, head rest, and forward or backward tilt is done by calculating the first-order difference and threshold comparison of each degree of freedom. Second, head forward and backward motion detection uses the pinhole imaging principle to calculate the distance between the face and the camera, and then obtains the rate of change of the distance by first-order differentiation. It then determines whether the head is moving forward or backward by comparing the rate of change with the threshold. During the detection phase of head pose estimation, the EPNP algorithm [34] is used to compute the 3 rotational and 3 translational degrees of freedom of the head pose according to known 3D points and corresponding 2D points.

Fig. 9
figure 9

Head pose feature detection

During the head motion detection phase, the first-order difference is computed for each degree of freedom of the head, which gives the rate of change of each degree of freedom. Detection then judges the state of the head motion, including nodding, normal motion, head rest, and forward or backward head tilt, based on a comparison between the change rate and the threshold. By implementing these enhancements, the accuracy and robustness of head pose detection can be improved.

4 Experimental results

The experimental platform mainly consists of the central control unit, camera, horn and bus, and is installed in the experimental vehicle, as shown in Fig. 10.

Fig. 10
figure 10

Construction & environment of the experimental platform

4.1 Data preparation

Facial feature-based detection was performed using a self-curated hybrid dataset containing four mental state categories: awake, mild fatigue, moderate fatigue and severe fatigue. These four states comprehensively reflect the different stages from full wakefulness to severe fatigue, enabling the study to explore in depth the impact of changes in fatigue level on the performance of the detection algorithm. The data structure is as follows: the network acquisition part provides a total of 1671 sample images, of which 432 are in the awake state, 437 in mild fatigue, 435 in moderate fatigue and 367 in severe fatigue. These images come from different environments and scenarios and show different facial features, which provide strong support for the generalization ability of the model. Meanwhile, the video database collection part is derived from a 60 fps video stream with a resolution of 780 × 580, which ensures the clarity and details of the images. A total of 3104 sample images were collected in this section, including 761 in awake state, 774 in the mild fatigue state, 737 in the moderate fatigue state, and 832 in the severe fatigue state. Finally, 3602 sample images from the public dataset NTHU Drowsy Driving Detection Dataset, Closed Eyes in the Wild (CEW) data were used, of which 834 were in the awake state, 954 in the mild fatigue state, 862 in the moderate fatigue state, and 952 in the severe fatigue state. These images are not only sufficient in number but also of high quality, providing a solid foundation for model training and validation. In total, there are 2027 sample images for the awake, 2165 sample images for mild fatigue, 2034 sample images for moderate fatigue, and 2151 sample images for severe fatigue, as shown in the example of part of the dataset in Fig. 11.

Fig. 11
figure 11

Partial data set

4.2 Experimental analysis

The performance of the algorithm is evaluated by a fivefold cross-validation method and compared with traditional models such as Random Forest (RF), Support Vector Machine (SVM), Radial Basis Function Neural Network (RBF), Bayesian Classification (BC), Random Forest with Multi-feature Fusion (RFWF)and other models. The RF model uses an SVM-fused random forest algorithm, the SVM model uses the PSO-SVM algorithm, the RBF model uses the SOM algorithm, and the BC model uses a Bayesian model based on PCA. In addition, this study considers the runtime performance of the algorithm, i.e., the time consumed by a single identification process. This aspect is essential because the fatigue detection system needs to determine the driver's state in real time.

Table 5 shows the results of the above models on the dataset. A0 indicates the average detection accuracy of the awake state, A1 indicates the average detection accuracy of the mild fatigue state, A2 indicates the average detection accuracy of the moderate fatigue state, A3 indicates the average detection accuracy of the severe fatigue state, and Av indicates the average detection accuracy of the four states.

Table 5 Test Results

The test results show that the algorithm proposed in this paper achieves a high detection accuracy of 90.34%, 93.17%,95.46% and 99.67% for different fatigue states. The average accuracy is as high as 94.66%, which is 3.86% higher than that of the traditional RF model, 5% higher than that of the SVM, RBF and BC models, respectively. In addition, the algorithm runs relatively fast due to its optimization and lightweight design in each detection step and multi-feature parallel detection is employed to improve the computational efficiency. The test results are shown in Fig. 12, where the distinction between the four states can be clearly seen.

Fig. 12
figure 12

Detection status diagram

4.3 Validation test

For safety reasons, the fatigue state was manually simulated. The data set consisted of 900 sober driving samples (including 150 interference samples such as talking or rubbing eyes), 650 mild fatigue samples, 455 moderate fatigue samples, and 550 severe fatigue samples, for a total of 2555 valid samples. Each fatigue sample lasted between 3 and 8 min. Using artificial fatigue state simulations, the proposed algorithm was comprehensively evaluated for accuracy, as shown in Table 6.

Table 6 Comprehensive evaluation of the proposed algorithm

The test results indicate that the algorithm has a high degree of accuracy in detecting fatigued driving behavior, with an average detection accuracy of 98.35%. However, the tests also revealed that the algorithm had errors and missed detections in all the test videos, which could be attributed to the duration and severity of fatigue.

To further verify the detection performance of the algorithm in this paper, it is compared with the current mainstream fatigue driving detection algorithms on the self-curated dataset, and the experimental results are shown in Table 7. From the above table, it can be seen that under lower computing power and lighter weight, the algorithm in this paper has the highest mean average accuracy, and the mAP is 1.6% higher than that of the lighter weight Efficient Det-D2, and the number of parameters of the model is lower, and the computational complexity is smaller. This is because this paper comprehensively considers the lightweight processing and the depth extraction of face information, and further strengthens the focus on category features and the connection of contextual information by constructing feature mapping and lightweight feature enhancement module. In summary, the algorithm in this paper has strong comprehensive detection performance.

Table 7 Validation of the mainstream algorithms on a self-curated dataset

5 Conclusions

This paper presents a comprehensive facial feature-based driver fatigue detection algorithm that integrates several innovative techniques to improve detection accuracy and reliability. The main features of the proposed algorithm are:

  1. 1.

    The multi-feature fusion approach not only detects typical fatigue indicators such as blinking and yawning, but also incorporates new fatigue indicators such as head tilts forward and backward, thereby improving the overall comprehensiveness and precision of detection.

  2. 2.

    By fusing and analyzing multiple fatigue-related features, the algorithm can more accurately detect a range of driver postures, resulting in improved overall detection accuracy and robustness.

  3. 3.

    The algorithm's ability to map facial movements to KSS scores enables real-time assessment of fatigue levels, improving the system's performance and accuracy in detecting driver drowsiness.

  4. 4.

    The approach of decomposing fatigue videos into long and short KSS sequences, followed by early and late machine learning training, allows the algorithm to more effectively utilize the available training data, thereby improving its generalization ability and adaptability.

The proposed algorithm can effectively detect driver fatigue and provide timely warning signals, which is significant for promoting traffic safety and provides valuable insights for the future development of fatigue detection technology.