1 Introduction

The sign of “drowsy” emphasizes that somebody got shattered or tired at some stage in working hours. Although the term “Drowsy” usually consider as a common in general perspective, in some scenarios, it becomes crucial when somebody is engaged in a job where attention becomes the critical requirement like persons handle the bulky machine in industries or driving a heavy or passenger vehicle, etc. [1, 2]. Drowsiness creates the great trouble for the pedestrians walking on the road or passenger presents in the vehicle. In accordance with the yearly report 2018 over accident analysis in India specified by the Ministry of Road Transport & Highway highlighted that 4,67,044 accidents occurred in states and Union Territories [3]. From statistics of accidental data, this report uncovers the facts that 1,51,417 lives, as well as 4,69,418 people, were injured in these crashes. Of these accidents, 78% have occurred because of the diversion from concentration or drowsy activities of the driver. The yearly comparison between roads crashes showed 0.34% higher concerning the past year, which is also mentioned in this report. Further, the report also showed that from among the total percentage of deaths regarding road accidents worldwide, 11% of deaths were found in India. Further, this report emphasizes that most of the victims were concerned with 18–45 years.

Together with the above discussion, a small number of researches also show that one driver goes in a drowsy state out of 25 drivers at the time of driving [4, 5]. The report also reveals that crashes mostly occurred next to arrival place after a lengthy trip.

According to accidental road data analyzed in the previous paragraph, we can say that driver’s act is one of the essential aspects influencing road security. That is why scrutinizing the driver act related to drowsiness recognition has drawn the attention of researchers towards this field.

Miscellaneous techniques have been evolved for supervising the state of drowsiness in the driver. These techniques can be usually categorized into three classes, i.e., biological feature, vehicle movement indicators, and behavioral feature.

In the biological feature, the symptoms of drowsiness are captured through various signals such as brain signals, heart rate, and nerve impulses. Although biological features-based technique gives the significant result with a higher rate of accuracy but here, it is expected to use extra equipment needed to mount with the body of the driver to acquire these features. This extra equipment creates a hindrance to the everyday driving of the vehicle. Since this technique is invasive and hence, cannot use for the practical purpose [6].

In the second category, i.e., vehicle movement based drowsiness identification involves the vehicle characteristics such as lateral pose, steering wheel movement, bending angle, vehicle pace, pressure on brake or acceleration paddle, lateral displacement, and other signs of vehicle movements [7]. The benefit of this method is that capturing these features is absolutely simple and significant.

Besides these benefits, these techniques have many restrictions like driver expertise, type of vehicle, road shape, and situation. As this technique demands sufficient time to acquire these parameters. Therefore, these methods do not function well in the case of micro-sleep, which means that the driver goes in a sleepy state for a small duration on a linear road without fluctuating signals.

The problems persist in above-discussed methods are addressed by third category of drowsiness detection technique, i.e., behavioral based technique. Here, this technique is non-invasive means no extra equipment is essential to attach with the body of the driver to capture the features as the techniques discussed in the previous paragraph. Consequently, the behavioral-based technique has been famous among researchers because of its lesser cost of organization and ease of use. Primarily computer vision technique is used here to acquire the features via camera only [12,13,14]. By keeping this view in mind, we have employed this technique in our proposed framework.

As distinct facial cues such as head movement, pupil movement, facial expression, etc., emphasize the different levels of alertness in the driver. Among these cues, eye state identification is a prominent one. Various systems have been developed to recognize the sign of drowsiness based on eye state analysis, i.e., open or close eye [15,16,17]. These works used typical cameras and did not involve any training data. Our developed system is also configured based on this approach. Here, no training data is required and capable of capturing drowsiness at its early stage in real-time. In this way, it resolved the issue of having a good dataset for training as well as for testing purposes. Our proposed system works on a novel algorithmic approach as per the current demand.

The key objective of our proposed work is to remove the problem of occlusion as not resolve by the available works. Thus improve the accuracy of the proposed model. This proposed work represents the extensive study of drowsiness detection frameworks and develops a highly optimized and reliable model with significant results.

This entire study is divided into five sections: Sect.2  contains a literature review of previous work and its comparison to the proposed work. The proposed technique is detailed in Sect. 3. Further, Sect. 4 includes a result analysis and discussion of the proposed work. Section 5 contains ideas for future work as well as the conclusion of the proposed work.

2 Literature survey

Investigation of existing works has been done under this section. Here, we have incorporated the most influential works from various sources regarding drowsiness detection. Further, the authors examine a variety of publicly accessible datasets concerning national and international domains.

Here, the authors have carried out the investigation of drowsiness detection based on temporal and spatial features that utilize computer vision and deep learning concepts, as shown in Table 1 [8,9,10,11].

Table 1 Comparison between previous work and proposed work

We have given a novel algorithm for drowsiness detection using open eye state analysis in this present work. Here, our proposed work demands a conventional camera only. Therefore, the model does not require any training data. However, we have considered a small publically available IMM face dataset [18]. In this way, our proposed system is easy to configure, secure and cost-effective.

This is obvious from the survey that every facial cue, such as head rotation, pupil movement, eye blinking and yawning, etc., has its degree of significance in detecting a drowsy state [19]. Among these facial cues, eye state analysis is a prominent one that helps predict the state of alertness better. Several systems have been proposed to identify the level of drowsiness based on the investigation of eye states [20]. From these existing systems, few employed the four-step procedure, i.e., face recognition, localization eye region of interest (ROI), analysis of different states of the eye, and predict level of drowsy state [21]. These existing systems need only an ordinary camera and hence do not involve any training data for experimental purposes.

Few existing works skipped the face identification step [22, 23]. As a result, these systems acquired a higher speed. Subsequently, they had to depend on both training data and particular hardware like illuminators that can be risky from the health point of view due to the direct incidence of light on the eye.

Therefore, we have employed dlib software to recognize the face where hardware like illuminator and all was not needed. Our procedure involves the novel algorithmic approach for eye state analysis for drowsiness detection.

Numerous techniques have been proposed for face identification via NIR [24,25,26], deep neural network [27, 28], pattern matching [9], Gabor filters [29], skin color [30, 31]. Techniques that support skin color based identification may be limited to a particular ethnic crowd or degrade the performance under different lighting situations.

Further, methods based on the neural network [27] for face detection give better results but fail in terms of more processing time due to network training, which does not suit the requirement of early drowsiness detection.

Afterward, the existing framework [9] employed the concept iris-sclera pattern-based approach to detect open eye drowsiness detection performs better. However, it fails when the eye has already been suffering from defects like squinting or when the driver sees in the left or right direction extremely. In this scenario, the symmetry concept of sclera and iris will not work better to determine the state of the open eye, and hence the system will start mispredicting the cases of drowsiness. Further, this existing system also could not resolve the issue of the occluded frame as well as identification of eye region of interest under dim lighting conditions.

Although, the model that has employed the techniques of Gabor filter [29] gives the better result in the sense of accuracy but fails in terms of more time to train the model as well requirement of a suitable and significant amount of dataset for training and testing purpose which further a challenging task.

Likewise, the existing frameworks [32] that implement the technique of Viola-Jones based on Haar cascade feature gives the superior result with a faster rate of face identification but require dataset for training as well as testing purpose. Such a system is also does not capable of the suite the problem of occlusion.

In a comparison of previous works, our developed model has contributed in the following ways:

  • In our developed work, we have employed the most prominent features, i.e., pupil movement and eye aspect ratio, which reduces the processing time and enhances the accuracy.

  • In our proposed work, we have removed the issue of occluded frames via using the criterion of occlusion at its pre-processing step and further increase the accuracy.

  • Our developed system resolved the challenge of having a good dataset for the training and testing purpose of the model and hence reduced the processing time of the overall model.

  • In our proposed framework, we have created a reliable system that will not only identify drowsy states but could also capture the moment when the driver is departed from routine driving.

3 Proposed methodology

Since this is obvious from the techniques examined in Table 1, no one has given a proper solution to suit the challenges of occlusion that arise due to head rotation. In our developed framework, we have resolved the issue of occlusion by removing the defected frames via satisfying the criterion of separation of the pupil’s center and horizontal length of an eye using the computer vision technique. In this way, our proposed framework upgrades the accuracy of the overall system. The outline of our developed framework is exposed in Fig. 1.

Fig. 1
figure 1

Flow chart of proposed work

3.1 Dataset

As per the rigorous literature survey, we have found that most of the dataset was not wholly suitable for our experimental purpose. However, we have chosen the rarest IMM face dataset, which was well-annotated and satisfied our need for experimentation, i.e., detection of drowsiness with an open eye, to a certain extent because it contains all the images and video with an open eye. The difficulty with this dataset was that it did not persist any image or video where the subject was wearing the glasses. Therefore, we have included a few such images and videos from other online sources in order to maintain the versatility of this dataset.

In order to validate our proposed system, we have carried out a couple of experiments. In the first phase, we experimented on only color images present in the IMM face dataset [18]. This dataset contains 222 color images along with 18 grayscale images. These images have been contributed by 37 different subjects, with 6 images belonging to each subject. These six images related to each subject are labeled based on facial expressions (“normal,” happy”), lighting circumstances as well as facial rotation, i.e., “30 degrees movement to the subject's right,” “30 degrees movement to the person's left” and so on. These captured images are with resolution 640 × 480 without bearing glasses. In the next phase of the experiment, we have considered the six video sequences of a different subject with the various angle of head movement. These videos have been captured under various lighting conditions through the ordinary camera with a resolution of 640 × 480 at the rate of 30 frames/sec.

To build the multipurpose dataset, we have included the persons related to the diverse civilization, colors, and different environments. Further, the images, as well as videos, were acquired at a variety of angles under the different scenarios.

3.2 Facial feature extraction via Dlib

From the rigorous survey of the literature, we have found that several techniques are available for extracting the facial cues from the face. Iqbal et al. [33] have designed a framework based on a deep learning technique called Visual Geometry Group network (VGGFace) for recognizing the face in the image. This framework was having an analogous arrangement like a convolution neural network (CNN). This model eliminates the data in a hierarchical fashion. This model captures the face with a higher accuracy rate, i.e., 99.77%, but took more time to identify a face in the image. Thus, this model could not satisfy the criterion of early drowsiness detection. Likewise, Kambi et al. [34] have proposed a model that resolves the issue, such as huge alteration in facial expression as well as pose. This method contains the dual approach of LBP and k-Nearest Neighbor (k-NN) techniques. Further, this technique was not capable of capturing the faces in low illumination conditions. In order to resolve these problems, we have employed the concept of key-points-based techniques of face identification where instead of recognizing the face ultimately, the only region of interest in the face is captured via inbuilt landmark points [35]. Thus it reduces the overall processing time and hence satisfies the demand for early drowsiness detection with remarkable accuracy.

Thus in our proposed work, we have employed the Dlib facial feature predictor package available in OpenCV libraries for the identification of the region of interest in the face [36]. Here, the Dlib package consists of a Histogram of Oriented Gradient (HOG) feature descriptor for detecting region of interest in the face as well as Linear Support Vector Machine (SVM) as a classifier [37]. This Dlib package is associated with 68 facial landmarks which makes easier to extract the numerous region of interest (ROI) from the face. According to our requisite, we have obtained the landmarks sequencing from 37 to 48 to trace the pair of eyes to compute the center of pupil and eye aspect ratio (EAR) for imposing condition of drowsiness in open eye situation as shown in Fig. 2. These landmarks are extracted from each frame. Afterward, we obtained the coordinates, i.e., x and y, of these extracted landmarks to compute the value of EAR and center of the pupil for further processing.

Fig. 2
figure 2

ROI selection using 68 facial landmarks

3.3 Pre-processing technique

In our experimentation process, we have analyzed that it was challenging to capture the region of interest, such as eyes, mouth, and everything, in the presence of poor illumination, especially at night. As a result, we used the Histogram equalization approach, which evenly distributes intensity values throughout the frame [38, 39]. Thus, the dispersion of unequal light in each frame is reduced. Thereafter, we utilized the Gamma correction approach to increase contrast across the frame by performing a non-linear adjustment between the input and output mapped values [40].

3.4 Eye aspect ratio (EAR)

The term eye aspect ratio can be characterized as the ratio of perpendicular distance to the flat distance of an eye. In pursuance of the Dlib software package, eyes are designated with six inbuilt facial landmarks indexed from p1 to p6, as shown in Fig. 3.

Fig. 3
figure 3

Eye aspect ratio

Here, these landmarks participated in the calculation of the eye aspect ratio. In the computation of eye aspect ratio, initially, we determine the perpendicular distance via the evaluation of the midpoints through the coordinates of a couple of landmarks (p2, p3) and (p5, p6). Further, flat distance is computed from the coordinates of a pair of landmarks (p1, p4) using the Euclidian distance formula. Thus, the value of EAR is computed using Eq. (1) as follows:

$$\mathrm{EAR}= \frac{\Vert {P}_{2}- {P}_{6}\Vert + \Vert {P}_{3}- {P}_{5}\Vert }{2\Vert {P}_{1}- {P}_{4}\Vert }$$
(1)

3.5 Centre of pupil

As in pre-processing step of our framework, occluded frames are neglected based on certain conditions. Further, to establish this condition, the center of pupils is one of the essential parameters to be determined. Therefore, firstly we will compute the center of pupils.

In the initial stage, we have cropped eye region of interest (ROI) via extracting coordinates of relevant facial landmarks for localizing the center of pupils from each frame of a video. Further, gradient-based concept, i.e., pupil region is darker than the neighbouring region, has been employed to localize the center of the pupil. Afterward, we used a Gaussian blur filter with size 7 × 7 to reduce noise from each cropped eye ROI of a frame. After that, we have localized all the contours and pick the largest one based on the area of contour. Thus, the obtained largest contour will be the pupil. Further, we have enclosed the pupil through a rectangular bounding box. After that, we have computed the center of a pupil from the width and height of the bounding box as given in Eqs. (2) and (3) and from Fig. 4 as follows:

Fig. 4
figure 4

Centre of pupil detection

$${x}_{\mathrm{p}}=x+ w/2$$
(2)
$${y}_{\mathrm{p}}= y+ h/2$$
(3)

where, w, width of the bounding box containing pupil’ h, height of the bounding box containing pupil.

Here, we have used boudingRect () function to recognize (x, y, w, h) coordinates.

3.6 Condition of occlusion

The proposed framework of Yingyu et al. [36] emphasizes that a frame is said to be occluded when half of the line connecting the center of the pupil becomes equal to the horizontal length of an eye, i.e., either left or right eye. Consequently, we have imposed this condition to identify whether a frame is occluded or not at the initial stage of processing. Since, we have localized the center of pupil in the earlier Sect. 3.3, Now, we are capable of computing the length of the line connecting the pupil’s center from the coordinates of the center’s of the pupil’s by utilizing simple distance formula as presented in Eqs. (4) and (5) as well as from the Fig. 5 as follows:

Fig. 5
figure 5

Computation of condition of occluded frame

$${D}_{\mathrm{LR}}=\sqrt{{({x}_{\mathrm{L}}-{x}_{\mathrm{R}})}^{2}-{({y}_{\mathrm{L}}-{y}_{\mathrm{R}})}^{2}}$$
(4)
$${L}_{\mathrm{h}}= \sqrt{{\left({x}_{2}-{x}_{1}\right)}^{2}-{({y}_{2}-{y}_{1})}^{2}}$$
(5)

where \(({x}_{\mathrm{L}}\),\({y}_{\mathrm{L}}) \mathrm{and} ({x}_{\mathrm{R}}\),\({y}_{\mathrm{R}})\) are the pupil center’s for left and right eye. Further, \(({x}_{1}\),\({y}_{1}) \& ({x}_{2}\),\({y}_{2})\) are the coordinates of landmarks related to right eye, \({D}_{\mathrm{LR}}\) is the length of line connecting pupil centre’s of left and right eye, \({L}_{\mathrm{h}}\) is the horizontal length of an eye.

Thereafter, we have acquired the horizontal distance of an eye via extracting the coordinates of landmarks 37 and 40 concerning to right eye or 43 and 46 related to left eye as shown in Fig. 5. Afterward, we have checked the condition of occlusion as per the Algorithm 1 as follows:

figure a

Further, the complete processing of this proposed model is summarized via the novel algorithmic approach as given in Algorithm 2 as follows:

figure b

4 Result and discussion

In order to highlight the outcome for categorization of alert as well as drowsy states of driver based on the open eye in our proposed model, we have employed the concept of counting the number of frames for a particular duration. Stepwise processing of this complete concept is shown in Fig. 6 as follows.

Fig. 6
figure 6

Speed of algorithm

Further, we have shown the result of our model in the terms of Accuracy, Precision and Recall based on parameters of true positive (TP), true negative (TN), False positive (FP) and False negative (FN) which is obtained from the confusion metrics shown in Fig. 9. Here, the formulae used for the computation of Accuracy, Precision and Recall from the Eqs. (6), (7) and (8) as follows:

$$\mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$$
(6)
$$\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(7)
$$\mathrm{Recall}= \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(8)

Figure 7 demonstrates the computing speed of individual algorithm during the analysis of videos of three subjects. The average speed of each algorithm can be simply estimated using this graph. As indicated in Table 2, we compared the performance of prior work to our proposed work in terms of accuracy, precision, and recall. Our suggested approach achieves superior outcomes when compared to existing models, particularly those that are linked to temporal features, as this temporal characteristic is also used in our model.

Fig. 7
figure 7

The complete processing of proposed work

Table 2 Performance analysis of few drowsiness detection techniques

The outcomes of our developed model demonstrate the robust categorization of two classes of drowsiness, i.e., alert and drowsy, with improved accuracy, i.e., 94.2% in comparison to the available model [9], which shows the accuracy of 93%. As from the rigorous review of model [9], we have drawn the conclusion that the symmetry concept utilized in this model was not capable of determining the open state of eyes accurately in the situation when eye has already been suffering from defect like squinting or driver is seeing either left or right directions extremely for a long time. Thus, this system shows the number of false cases of drowsiness prediction, and hence, degrades the performance of overall system for the existing work [9]. To resolve the issues for such cases of false prediction, we have imposed the condition of occlusion on each incoming frames and neglected such a frame at its pre-processing step of our model. In this way, we have achieved the accuracy of 94.2% for our temporal feature- based model compared to existing work where total accuracy acquired was 93% [9]. On the other hand, the existing work [9] could not resolve the problem of drowsiness detection correctly for the frames captured under dim lighting conditions, especially during the night hours. In comparison, our developed framework has implemented the concept of Histogram equalization and Gamma correction techniques to resolve the issue of a frame captured under dim lighting conditions, especially during night hours and further enhanced the performance of overall system. Instance images of experimental result are shown in Fig. 8, and result statistics in the form of confusion metrics are shown in Fig. 9.

Fig. 8
figure 8

Instance images from IMM database and online source. First column is images with face detection. Second column contains images with pupil center and iris circle. Third column is inverse binary images of detected eyes. Fourth column contains images after applying Gaussian Filter over inverse binary image

Fig. 9
figure 9

Confusion matrices for model

In our proposed work, we have purely employed the computer vision technique, which captured temporal features of an eye (Eye aspect ratio, center of pupil) and mouth (Tip of the nose) for performing three levels of verification to identify the state of drowsiness correctly. Further, our developed system does not require any training data like the existing work [8, 11], and hence the processing time of our developed framework is very less, i.e., 31.74 ms/frame approximately in comparison of existing works [8, 11] where several hours are elapsed in training as well as in testing time. This advantageous point enhanced the robustness of our model because time plays a crucial role in drowsiness detection because drowsiness detection at its earlier stage gives enough time to the driver to recover from the drowsy state to rescue herself/himself from the accident that is going to happen.

5 Conclusion and future work

The practices utilized in our developed framework are genuinely based on temporal features of an eye (Eye Aspect Ratio, pupil movement) and head (Tip of the nose) movements. Further, we have imposed three levels of checks via using the above-mentioned features through a novice algorithmic approach that shows the robustness of our developed framework. In comparing existing work, we have utilized another novel algorithmic approach to diminish the issue caused by the occluded frame at the pre-processing stage of the experimentation process. A pure computer vision-based technique has been employed to upgrade the accuracy and performance and reduce the processing time. The model does not require any training data, and hence time elapse in training is saved. Thus, the outcomes achieved from the practices reveal that our developed model is better in comparison to existing models.

In support of future work, one can try some additional significant features like the posture of hands, legs, and all to build a more robust model compared to existing models. At some stage in the experimentation process, we have realized the difficulty of localizing the pupil’s center due to the extreme posture of the head in either direction. Consequently, one can employ neural network-based techniques like MT-CNN, Face-Net, and all to capture a face and other regions of interest instead of dlib based technique. Although these techniques will capture the face and other ROI with greater accuracy, they raise the problem of more time elapsed to capture these features. Therefore, it will be essential to compensate time with accuracy with some optimized approach.

Further, we realized the problem in setting the value of threshold to impose a condition of occlusion due to dissimilarity in the dimension of eyes of each person. As a result, there is a necessity to normalize the data in this scenario to upgrade the whole performance of the model. In addition to this, we have also observed that our system was not capable of recognizing the driver’s eye correctly under sunglasses as well as sudden changes in light as may arise during vehicle on the road. So, one can try to resolve this issue in future work with help of some robust technique. The dataset we used for our study was rarest and satisfactory but not up to the mark because it did not use cases of subjects bearing glasses and all. That is why; we have considered the video and image data from other sources such as online mode. Thus, one can create his/her dataset that should be versatile and relevant for this study.