Keywords

1 Introduction

Driver fatigue resulting from sleep deprivation or sleep disorder is one of the major causes inducing the increase of traffic accidents on today’s roads. Thus, the automatic detection of the driver fatigue early enough to warn the driver may save a significant amount of lives and personal suffering. However, general standards for fatigue are unavailable. In this context, it is crucial to use new technologies to monitor drivers and to measure their fatigue level during the entire driving process. Main fatigue measurements developed in recent years are based on the running status of vehicles, the driver’s physiological features or the driver’s visual features.

The running status of the vehicle such as the sudden changes of steering wheel angle [1], instable speed and special lane changing mode etc. usually reflects the drowsiness of the driver. Since fatigue is one of the physiological states of human being, the drowsiness of driver can also be monitored by the biomedical signals such as pulse rate, heart rate, electromyography signal [2] and electroencephalogram (EEG) data [3] etc. Jung et al. monitored the driver’s fatigue levels by evaluating the HRV (heart rate variability) sensed by an embedded electrocardiogram (ECG) sensor with electrically conductive fabric electrodes on the steering wheel of a car [4]. Nambiar et al. classified the power spectrum of the driver’s HRV obtained from ECG signals to detect the onset of driver drowsiness [5]. Yang et al. detected driver drowsiness through observing blink’s pattern and other driving-relevant eye movements of the driver recorded by electrooculography [6]. Murugappan et al. analyzed the frequency bands of EEG signal to classify the driver’s drowsiness level with subtractive fuzzy classifier [7]. Usually, the devices to obtain the biomedical signals are expensive and intrusive. The stability of the obtained signals is prone to recede by noise in the real applications.

With the popularity of low-cost cameras, the video of drivers can be obtained non-intrusively. The drowsiness level of the driver can be determined by their visual facial features. Different symptoms for driver drowsiness monitoring have been introduced such as eye closure, blinking, yawning, fixed gaze direction [8, 9] etc. According to the comparisons among those commonly used detecting and evaluating techniques for motoring driver fatigue, the PERCLOS [10] (percentage of eyelid closure over the pupil over time) is the most closely correlated with driver fatigue in the study of simulating driving instrument. The driver is considered as drowsy when more than 15 % of time that the driver’s eyes are closed (e.g. 80 % or more) in a unit time interval. Hong et al. [10] detected human face by the Adaboost classifiers [11]. The eyes in the video are tracked based on CAMSHIFT algorithm. Eye states were estimated using a complexity function with a dynamic threshold. The drowsiness state of driver is determined by PERCLOS. Hardeep et al. detected fatigue of the driver by the blinking and closure of the driver’s eyes and issued a timely warning. This approach requires the camera to focus on the area around the driver’s eyes to improve the detection speed [12]. Coetzer et al. showed that AdaBoost is the most suitable eye classifier among artificial neural networks, support vector machines and AdaBoost based classifiers for a real world driver fatigue monitoring system [13]. Zhang et al. localized the eye region by AdaBoost classifier. In [14], driver fatigue is detected by calculating PERCLOS based on eye closure determined by the lattice degree of nearness of Fourier descriptor. Zhang et al. used drowsiness features extracted from the tracked eyes of the driver in the video. Six measures such as PERCLOS, blink frequency, maximum close duration, average opening level, opening velocity and closing velocity are combined using Fisher’s linear discriminant functions to classify the drowsiness state [15]. More recently, driver drowsiness detection is realized by facial expression analysis method. Tadesse et al. analyzed the facial expression of the driver through Hidden Markov Model (HMM) based dynamic modeling to detect drowsiness [16]. Ling et al. adopted the Gabor filter on facial images, which are divided into multiple sub-blocks to learn a discriminative local structured dictionary via the Fisher discrimination dictionary learning method, for efficient features describing drowsiness [17]. Moreover, Anumas et al. implemented a system which combined with facial features and biomedical signals using a fuzzy classifier to infer the level of driver fatigue [18].

The ultimate goal of this research is to develop a camera-based driver fatigue monitoring system, centered around the tracking of driver’s eyes, since the eyes provide the most information with regards to fatigue. The precise of facial feature extraction is important for assessing drowsiness. Therefore, it is important to explore innovative technologies to monitor the driver visual attention. For the test videos, we initialize the facial position using Adaboost based face detector. Then, we use supervised descent method (SDM) [19] to fit facial landmarks on eyes, nose, mouth, eyebrows accurately. These facial landmarks are used to initialize the landmarks in the next frame. When the current frame cannot detect face image or locate facial landmarks correctly, the next frame will be re-detected with Adaboost detector. Our method defines visual fatigue based on detected facial landmarks.

2 Facial Landmark Detection and Tracking by SDM

Given an image \( {\mathbf{I}} \), we assemble the coordinates of all facial landmarks detected in the last subsection as an initial coordinate vector \( {\mathbf{t}}_{0} \). The refinement of the locations of those landmarks is realized by the SDM method [19]. We denote the true location of facial landmarks as \( {\mathbf{t}}^{*} \) and the refined location of facial landmarks as \( {\mathbf{t}}_{0} + \varDelta {\mathbf{t}} \). In order to achieve a robust representation against illumination, we use SIFT as the feature of different landmarks. Denote \( {\mathbf{h}}\text{(}{\mathbf{t}}\text{)} \) as a SIFT operator function on facial landmarks \( {\mathbf{t}} \). e.g. \( {\mathbf{h}}\text{(}{\mathbf{t}}_{0} \text{)} \) is the SIFT feature extracted from \( {\mathbf{t}}_{0} \) in image \( {\mathbf{I}} \). Denote \( f({\mathbf{t}}) \) as the distance between the SIFT features extracted from \( {\mathbf{t}} \) and \( {\mathbf{t}}^{*} \), which is \( f({\mathbf{t}}) = \left\| {{\mathbf{h}}\text{(}{\mathbf{t}}\text{)} - {\mathbf{h}}\text{(}{\mathbf{t}}^{*} \text{)}} \right\|_{2}^{2} \).

Suppose \( {\mathbf{t}}^{*} \) is known for both training and testing images, the refinement of facial landmarks is modeled by minimizing the following function over \( \Delta {\mathbf{t}} \).

$$ f({\mathbf{t}}_{0} +\Delta {\mathbf{t}}) = \left\| {{\mathbf{h}}\text{(}{\mathbf{t}}_{0} +\Delta {\mathbf{t}}\text{)} - {\mathbf{h}}\text{(}{\mathbf{t}}^{*} \text{)}} \right\|_{2}^{2} $$
(1)

With Taylor’s theorem, \( f({\mathbf{t}}_{0} +\Delta {\mathbf{t}}) \) is represented by

$$ f({\mathbf{t}}_{0} +\Delta {\mathbf{t}}) \approx f({\mathbf{t}}_{0} ) + {\mathbf{J}}_{f} ({\mathbf{t}}_{0} )^{\text{T}}\Delta {\mathbf{t}} + \frac{1}{2}\Delta {\mathbf{t}}^{\text{T}} {\mathbf{H}}\text{(}{\mathbf{t}}_{0} {)\varDelta }{\mathbf{t}} $$
(2)

where \( {\mathbf{J}}_{f} ({\mathbf{t}}_{0} ) \) and \( {\mathbf{H}}\text{(}{\mathbf{t}}_{0} \text{)} \) are the Jacobian and Hessian matrices of function \( f({\mathbf{t}}) \) evaluated at \( {\mathbf{t}}_{0} \). We drop \( {\mathbf{t}}_{0} \) from \( {\mathbf{J}}_{f} ({\mathbf{t}}_{0} ) \) and \( {\mathbf{H}}\text{(}{\mathbf{t}}_{0} \text{)} \) to simplify the notation in the following. “T” denotes transposition operation. The refinement of facial landmarks always needs multiple iterations to converge to \( {\mathbf{t}}^{*} \). \( \Delta {\mathbf{t}}_{i} \) denotes the i-th update for \( \Delta {\mathbf{t}} \). The first update equation of \( {\mathbf{t}} \) is obtained by differentiating \( f({\mathbf{t}}_{0} +\Delta {\mathbf{t}}) \) with respect to \( \Delta {\mathbf{t}} \) and setting it to zero, which results in Eq. (3).

$$ \Delta {\mathbf{t}}_{1} = {\mathbf{H}}^{ - 1} {\mathbf{J}}_{f} $$
(3)

With the chain rule, \( {\mathbf{J}}_{f} \) can be further represented as \( {\mathbf{J}}_{f} = 2{\mathbf{J}}_{h}^{\text{T}} ({\mathbf{h}}\text{(}{\mathbf{t}}_{0} \text{)} - {\mathbf{h}}\text{(}{\mathbf{t}}^{*} \text{)}) \) by differentiating Eq. (1) with respect to \( \Delta {\mathbf{t}} \) [19], where \( {\mathbf{J}}_{h} \) is a simplicity representation of Jacobian matrix from SIFT operator function \( {\mathbf{h}}\text{(}{\mathbf{t}}\text{)} \) evaluated at \( {\mathbf{t}}_{0} \). Then the update Eq. (3) evolves as

$$ \Delta {\mathbf{t}}_{1} = - 2{\mathbf{H}}^{ - 1} {\mathbf{J}}_{h}^{\text{T}} ({\mathbf{h}}\text{(}{\mathbf{t}}_{0} \text{)} - {\mathbf{h}}\text{(}{\mathbf{t}}^{*} \text{)}) $$
(4)

With \( \Delta {\mathbf{t}}_{1} \), a new facial landmark location \( {\mathbf{t}}_{1} = {\mathbf{t}}_{0} +\Delta {\mathbf{t}}_{1} \) can be obtained, where \( {\mathbf{t}}_{1} \) represents the facial landmark location after the first iteration. It can be used together with \( {\mathbf{t}}^{*} \) to find the new update item for \( \Delta {\mathbf{t}} \) in the following iteration. The i-th update for \( \Delta {\mathbf{t}} \) is given as follows.

$$ \Delta {\mathbf{t}}_{i} = {\mathbf{t}}_{i} - {\mathbf{t}}_{i - 1} = - 2{\mathbf{H}}^{ - 1} {\mathbf{J}}_{h}^{\text{T}} ({\mathbf{h}}\text{(}{\mathbf{t}}_{i - 1} \text{)} - {\mathbf{h}}\text{(}{\mathbf{t}}^{*} \text{)}) $$
(5)

However, the true location of facial landmarks \( {\mathbf{t}}^{*} \) is only known for the training image, while unknown for the test image. It leads Eqs. (4) and (5) cannot be used to refine facial landmark locations for the test data. In addition, SIFT operator is not differentiable, thus numerical approximations are very computationally expensive to calculate the Jacobian and Hessian matrix in Eqs. (4) and (5).

To address this problem, a new method named as SDM is adopted in [19] to refine the location facial landmarks iteratively. Noticing that \( \Delta {\mathbf{t}}_{1} \) can be regarded as projecting \( {\mathbf{h}}\text{(}{\mathbf{t}}_{0} \text{)} - {\mathbf{h}}\text{(}{\mathbf{t}}^{*} \text{)} \) onto the row vectors of \( {\mathbf{R}}_{0} \) if \( {\mathbf{R}}_{0} = - 2{\mathbf{H}}^{ - 1} {\mathbf{J}}_{h}^{\text{T}} \) in Eq. (4), SDM models the update of \( \varDelta {\mathbf{t}} \) as a linear regression model without using \( {\mathbf{t}}^{*} \) to enhance its generalization performance for the test image. \( \Delta {\mathbf{t}}_{1} \) is modeled as

$$ \Delta {\mathbf{t}}_{1} = {\mathbf{R}}_{0} {\mathbf{h}}\text{(}{\mathbf{t}}_{0} \text{) + }{\mathbf{b}}_{0} $$
(6)

Where \( {\mathbf{b}}_{0} \) is a bias item to approximate the result of \( - {\mathbf{R}}_{0} {\mathbf{h}}\text{(}{\mathbf{t}}^{*} \text{)} \), \( {\mathbf{R}}_{0} \) and \( {\mathbf{b}}_{0} \) are parameters learned from training images. Similarly, the i-th update for \( \Delta {\mathbf{t}} \) is modeled as Eq. (7) in [19].

$$ {\mathbf{t}}_{i} = {\mathbf{t}}_{i - 1} +\Delta {\mathbf{t}}_{i} = {\mathbf{t}}_{i - 1} + {\mathbf{R}}_{i - 1} {\mathbf{h}}\text{(}{\mathbf{t}}_{i - 1} \text{) + }{\mathbf{b}}_{i - 1} $$
(7)

Parameters including descent directions {\( {\mathbf{R}}_{0} \) } and biases {\( {\mathbf{b}}_{i} \)} are estimated with the training images for the SDM method. \( {\mathbf{R}}_{0} \) and \( {\mathbf{b}}_{0} \) are learned with a Monto Carlo sampling method, \( {\mathbf{R}}_{i} \) and bias \( {\mathbf{b}}_{i} \) for i ≥ 1are learned iteratively. We refer the readers to [19] for details. The learned parameters are then used to infer the location of facial landmarks in test images. The facial landmarks detected by SDM are illustrated in Fig. 1.

Fig. 1.
figure 1

Facial landmarks detected by SDM

3 Extraction and Definition of Visual Fatigue Rules

The facial appearance of a person in fatigue or in the onset of fatigue can usually be characterized by closing eyes slowly and yawning frequently. For fatigue detection, the facial features around the eyes and mouth include enough information to capture the fatigue patterns. Thus, we focus on the analysis of facial features around the eyes and mouth in this research. The aspect ratios of eyes and mouth are considered as the evaluation features. The states of eyes and mouth are estimated through the membership function defined based on the experience.

3.1 Acquisition and Processing of Feature Data

According to the results of SDM algorithm, we can get 49 facial landmarks (See Fig. 1). We calculate the aspect ratios of eyes and mouth according to the facial landmarks in each video. The facial landmarks around each eye are shown in Fig. 2. They can provide the percentage of openness of eyes. The height of each eye is represented by the distance between the upper-eyelid and the lower-eyelid. The width of the eye is represented by the distance between two corners of each eye. The aspect ratio of each eye is calculated by Eq. (8). The degree of eye closure is determined by the smaller aspect ratio of two eyes. The aspect ratio of mouth can be calculated in a similar way, which is illustrated in Fig. 3. It can be used to estimate the openness of mouths. The height of the mouth is represented by the distance between the upper-lip and the lower-lip. The width of the mouth is represented by the distance between the corners of the mouth.

Fig. 2.
figure 2

Illustration of the eye aspect ratio

Fig. 3.
figure 3

Illustration of the mouth aspect ratio

In real application, the experiments show that it would bring error information for the fatigue recognition event, if the rotation angle of the drive is too large, the focus of attention of the driver may be dismissed, which leads to increase the risk of a catastrophic event. In such a case, the human face usually cannot be detected. To solve this problem, we put forward a solution to distinguish this case in two aspects. On the one hand, the proposed system will give an alarm to remind the driver not to distract if the facial landmarks cannot be located by SDM in the continuous period of more than 2 s; On the other hand, for the short time turning of the driver head happens occasionally during driving, we interpolate the missed aspect ratio caused by this case. Then we smooth the extracted features with a forgetting factor to avoid the short-term noise happened occasionally. As a result, we can extract more integrated and more reliable features because of the time continuity of the object in videos.

$$ eyeRatio = \frac{{d\left( {a,a'} \right)}}{{d\left( {b,b'} \right)}} $$
(8)

3.2 Metrics for Fatigue Description

We define the metrics based on the membership function (MF), PERCLOS-like and percentage of mouth yawn (PMY) to describe the feature of fatigue. The details are given as follows.

  • (a) MF metric

  • A membership function represents the degree of truth as an extension of valuation, which maps the aspect ratios into the range of [0, 1]. The function should be simple, convenient, fast, efficient and effective. We build two membership functions based on Gaussian distribution: A simple Gaussian curve and a two-sided composite of two different Gaussian curves. The two functions are gaussmf and gauss2mf. The membership functions of eyes and mouth are as shown in Fig. 4 (a) and (b), respectively.

    Fig. 4.
    figure 4

    The membership functions of eye and mouth features

  • (b) PERCLOS-Like and PMY Metrics

    The duration of eye closure demonstrates the fatigue to some extent. PERCLOS [13] has been a scientifically supported measure of drowsiness associated with slow eye closure, which is defined as the proportion of time that a subject’s eyes are closed over a specified period. Combining with the P80 criterion, when the percentage of time that the eyes are 80 % or more occluded in a unit interval is greater than 15 %, it is considered drowsy. It can be defined as Eq. (9).

    $$ {\text{PERCLOS}} = \frac{{\sum\limits_{{i{ = 0}}}^{k} {P(i)} }}{N} * 1 0 0\% $$
    (9)

    where N = fps * t, fps represents the number of frames per second,tis the detection time, N is the total number of frames in t. P(i) represents the frames of the i-th eye closure. k represents the time of eye closure in t. In this paper, we take the aspect ratios as the features instead of the area and hierarchically label the degrees of eye and mouth determined by the membership function, so we call it PECLOS-like feature.

    The aspect ratio of eye is in the range of [0.625, 1], when the eyes are open. If it is less than 0.625, it indicates the eyes are closed. The smaller the value is, the higher the degree of eye closure. We can get the hierarchical label in 4 grades based on Eq. (10). In addition, the present major researches show that the eye closure duration is about 0.2 and 0.3 s. If more than 0.5 s, the risk of a traffic accident jumps greatly. So we define the duration of PERCLOS-like feature as Eq. (11).

    $$ eye\,Label = \left\{ {\begin{array}{*{20}l} 1 \hfill & {if\,\,{\text{eye Ratio}} \ge 0. 9} \hfill \\ 2 \hfill & {if\,\, 0. 7\le {\text{eye Ratio < 0}} . 9} \hfill \\ 3 \hfill & {if\,\, 0. 5\le {\text{eye Ratio < 0}} . 7} \hfill \\ 4 \hfill & {if\,\,{\text{eye Ratio < 0}} . 5} \hfill \\ \end{array} } \right. $$
    (10)
    $$ P(i) = \left\{ {\begin{array}{*{20}l} {fs} \hfill & {if{\text{ eye Label = 1 and }}\Delta t \ge \tau_{eye} } \hfill \\ 0 \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
    (11)

    Similarly, studies have shown that the duration of yawning is about 6 s. We set the membership function of mouth as two-side gauss curve shown in Fig. 4(b). Moreover, if the duration of apparent yawning is more than 1.5 s, a traffic accident is more likely to happen. So we define the PMY. Then, we separate the hierarchical label to 4 grades. The yawning duration of PMY feature is defined in the same way.

    Based on the analysis of video frames, we consider it as eye closure if the first grade continuous more than 0.5 s as shown in Eq. (11). Similarly, we consider it as yawning if the first grade continuous more than 1.5 s. Then the duration can be computed.

3.3 Driver Fatigue Computation Rules

At this stage, the visual features of eye closure and mouth opening are used for fatigue determination. It is important to set a reasonable warning threshold. The thresholds in our system are obtained through experience by analyzing of the extracted features of videos. We give three rules to determine the fatigue as following. If the percentage of eye closure at grade one more than 15 %, then we consider it as drowsy. If the percentage of yawning at grade one more than 10 %, then we consider it as drowsy. If the percentage of eye closure at grade 1 and 2 more than 18 % and the percentage of yawning at grade 1 and 2 more than 13 %, then we consider it as drowsy.

4 Experimental Results and Analysis

In the experiments, we have 41 videos of human being under sober condition, and simulated drowsy condition, which also include two real driving videos. In this dataset, persons in 25 videos are fatigue, persons in 16 videos are not fatigue. The first frame of each sequence with the detected facial landmarks is shown in Fig. 5. The parameters of each video are given in Table 1. The fatigue videos are obtained through a Logitech camera to capture driver’s videos. This algorithm is implemented using MATLAB platform along with camera and is well suited for real world driving conditions since it can be non-intrusive by using a video camera to detect changes. The system is tested under the environment of Intel(R) Core(TM) i3-3220 CPU and 8 G RAM.

Fig. 5.
figure 5

The first frame of each sequence is shown with marked key feature points in green (Color figure online)

Table 1. Comparison with fuzzy method

Since the fatigue should be determined over a period of time, we use a sliding window over time to obtain the video frames in 30 s as a period. After processing the new data based on the determination rules, the system outputs the driver fatigue state. It takes about 140 ms for face location and eye detection per frame to output the results timely. When the proposed method is applied to these 41 videos, the system performs well with only two videos inaccurate. To evaluate the performance of the proposed method, we use false alarm ratio (FAR) and detection ratio (DR) as the evaluation metrics. These two metrics are evaluated on the video sequence, which last 30 s. The FAR and DR are averaged for each video, which are shown in Table 1.

In [18], Anumas et al. implemented a system using facial features extracted by active shape models and HRV data. The inference parameters were combined using a fuzzy classifier to obtain the level of driver fatigue. We compare our method with the method in [18] performed on facial data. Our system is slightly better according to the results in Table 1.In this paper, a real time machine vision-based system is proposed for the detection of driver fatigue which can detect the driver fatigue and can issue a warning early enough to avoid an accident. The results are robust and promising even when the driver is wearing glasses. The experiments conducted on the two real driving videos also prove the robustness of our method, which is better than the method in [18]. However, such a system may require further evaluations. The next improvement of the driver fatigue monitoring system will be the increase of the accuracy of the threshold which can gain from more real-world driving videos.