1 Introduction

Gaze detection determines the areas at which a user is looking by using cameras and other sensors [16, 18, 67]. With advances in technology in recent years, driver assistance systems have also undergone a number of developments. Examples include the frontal collision warning system, the lane departure warning system, and the blind spot detection system. Car accidents are mainly caused by human error, through occurrences such as distraction and drowsiness [37,38,39]. Driver assistance systems mostly use sensors to detect external dangers such as the frontal collision and lane departure, etc. Therefore, the distracted driving such as not looking at frontal view or driver’s drowsiness is not monitored by these sensors. There have been various methods to detect the status of driver [2, 14, 33, 55], and gaze detection is one of these methods. Most of previous gaze detection researches were performed in indoor desktop environments [9, 19, 23, 30, 35, 48, 51, 56, 63, 68], and even the previous studies of driver’s gaze detection considered only head movement without eye tracking [20, 43]. Although there have been previous researches on driver’s gaze detection based on eye-tracking techniques [4, 20, 57, 60], their accuracies are affected by the various sitting positions and heights of drivers in case that initial calibration of driver is not performed. In general, for single camera-based gaze detection, initial user calibration is needed. By having the user observe a few designated spots on a display used for gaze detection at the outset, calibration maps the relationship between the location of the display and that of the center of the pupil, which is compensated to account for corneal specular reflection (SR) in the captured image of the eye. User calibration is also needed to compensate for the kappa angle between the location of the user’s gaze and that of the pupil, and for the fact that people’s eyeballs vary in size. In a vehicular environment, however, drivers find it difficult to perform initial calibration because the spots to observe to this end cannot be marked inside the vehicle. Therefore, past research on gaze detection in vehicular environments have usually skipped the calibration step, and focused on tracking a driver’s gaze based on the position of his/her head, which in turn is determined using the location of the iris center (instead of pupil center) or a statistical 3D model of the human head, which has the limitation of accuracy enhancement on gaze tracking in car environment. By using dual cameras, the driver’s calibration can be omitted, but processing time with complexity is increased by using the images from two cameras. In addition, the problem of disappearing corneal SR in the eye image as the driver severely turns his/her head to look at the side-view mirror has not been dealt in previous researches. To solve this problem, we propose the single camera-based driver’s gaze tracking method in actual vehicle environment based on one-point calibration, and using medial canthus (MC) in addition to pupil center to cope with the severe rotation of driver’s head. Our research is novel in three ways compared to past researches.

  • The driver’s gaze is calculated by combining the prior knowledge of average calibration in desktop monitor with the offset of each driver’s gazing at near-infrared (NIR) illuminator at the initial step (one-point calibration).

  • The robust gaze tracking is possible by tracking both corneal SR and MC to solve the problem of corneal SR being missed in eye image. And, the accurate parameters for calculating gaze position are obtained based on maximum entropy criterion.

  • Because there is few open database for NIR light-based gaze tracking in car environment, we made our collected database open through [15] in order to enable other researchers to compare their performance.

2 Related works

Previous gaze detection studies in vehicular environments can be largely sorted into dual cameras-based methods [3, 43, 61] and single camera-based methods. The latter can be also classified into 3D [57, 58, 60, 65] and 2D methods. The 2D gaze detection methods either use Purkinje images [4, 50] or detect facial feature points [20] to estimate gaze. One method that used Purkinje images employed the first image, the reflection from the outer surface of the cornea, and the fourth Purkinje image, the reflection from the inner surface of the lens, of the four Purkinje images to track gaze [4]. Another study [20] used facial feature points to find the iris and binarized it to estimate the area being gazed at. Regression-based methods have been researched, which includes appearance-based gaze estimation via uncalibrated gaze pattern recovery, adaptive linear regression for appearance-based gaze estimation [27, 31]. In [25], Gosh et al. proposed the method for eye detection and tracking for monitoring driver’s vigilance. In [24], they proposed a non-intrusive approach for drowsiness detection, and their method just recognized the drowsiness status of driver by recognizing open or closed eye. In [42], authors proposed SafeDrive system which could automatically determine driver phone use leveraging built-in smartphone sensors sensing driving conditions. In [41], they proposed Health Driving, a smartphone-based system for detecting driving events and road conditions solely with a built-in smartphone acceleration sensor. In [40], authors proposed Safe Walking, an Android smartphone-based system that can detect the walking behavior of pedestrians by leveraging the sensors and front camera on smartphones. In previous research [45], they proposed the method for generic human motion tracking based on annealed particle filtering and Gaussian process dynamical model. In [11], they also proposed a fusion formulation which combines low- and high-dimensional tracking methods into one framework. In previous research [46], they proposed the method for complicated activity recognition being composed of two components such as temporal pattern mining and adaptive multi-task learning. In [47], they also proposed the method to identify temporal patterns among actions and use the identified patterns to represent activities for automated recognition based on support vector machine (SVM) or k-nearest neighborhood (kNN) classifier. In [44], they defined an atomic activity-based probabilistic framework that uses Allen’s interval relations. However, all of these previous researches did not study the driver’s gaze detection. Table 1 shows a comparative analysis of proposed and previous studies.

Table 1 Comparison between the proposed method and past research

3 Proposed gaze detection method

3.1 Overview of proposed method

Figure 1 shows the flowchart of the proposed method. In step (1), a gaze detection system (see Section 3.2) that combines an NIR camera and an NIR illuminator is used to capture the facial image of a driver.

Fig. 1
figure 1

Flowchart of proposed method

If the driver has yet to go through calibration, he or she looks once at the NIR illuminator in the gaze detection system to perform one-point calibration, as shown in step (3) (see Section 3.5). If the driver has already performed one-point calibration, facial feature points are detected from the captured facial image, and the pupil search region is defined based on the detected eye feature points, as shown in step (4) (see Section 3.3). Then, the center of the pupil is detected within the search region and the MC points are detected from the facial feature points (steps (5) and (6)) (see Section 3.4). The SR search region is defined to find corneal SR based on the detected pupil center, and the corneal SR is detected within this region (step (7)) (see Section 3.4). In the final step, the driver’s gaze is calculated by consolidating the direction of the gaze of the driver obtained through one-point calibration in step (3) and the average calibration information of multiple subjects acquired beforehand using the detected pupil center, the MC, and the corneal SR (step (8)) (see Section 3.5).

3.2 Proposed gaze detection system

The gaze detection system for vehicles developed for this study consisted of an NIR camera and an NIR light-emitting diode (LED) illuminator, as shown in the upper part of Fig. 2; it was small in size (8.8 × 4.3 × 4.0 cm3 (in width × height × depth)). Because it was small, it could be installed in the vicinity of the dashboard, as shown in the lower part of Fig. 2, and can continuously track the driver’s gaze without obscuring the dashboard. Only in the case that the driving wheel is severely rotated (more than 60 or 70 degrees), the NIR illuminator or camera can be occluded by the driving wheel. However, this seldom occurs while actual driving. The NIR illuminator of 6 NIR-LEDs in the gaze detection system was placed to the left of the camera, and helped capture the driver’s facial image without being influenced by the changing light at night and day. Using the NIR LEDs at a wavelength of 850 nm prevented uncomfortable dazzling to driver. A universal serial bus (USB) Web camera [17] was used to capture NIR images of 1600 × 1200 pixels, and a zoom lens (focal length of 9 mm) was attached to the camera. An 850 nm band pass filter (BPF) was also mounted on the camera’s lens to minimize interference due to sunlight [1]. Power was supplied by a laptop computer using two USB lines, one connecting the camera and the other the illuminator. All the data acquisition and testing were performed on a laptop computer with 2.80 GHz CPU (Intel ® Core ™ i5-4200H) and 8 GB of RAM.

Fig. 2
figure 2

Proposed gaze detection system in vehicular environment

3.3 Detecting facial feature points and defining pupil search region

Previous gaze detection studies used the bright and dark pupil method to detect the center of the user’s pupil [18, 68]. For this to work, however, two separate illuminator groups need to be turned on and off in synchrony with the moments captured by the camera images to generate both bright and dark pupils, which makes the system larger and therefore more difficult to mount on a vehicle. This method was primarily used in an indoor desktop gaze detection environment; hence, in a vehicular environment where external sunlight flows in freely, bright pupils cannot be properly generated. Other studies have used a method to detect bright pixels within the captured images to detect the user’s corneal SR [9, 30], but it is difficult to use such methods in vehicular environments because brighter pixels by sunlight exist than in the corneal SR. For these, the proposed method first detects facial feature points, and then designates the pupil search region based on these points to detect the pupil center (step (4) of Fig. 1). Using the dlib facial feature point tracking method [13, 32], 68 feature points were automatically detected as shown in Fig. 3. The pupil search regions were defined as in Fig. 3b (green boxes) based on detected points 36 to 41 and points 42 to 47. In our research, the rough search region is defined and accurate pupil center is again detected based on the method of Fig. 4. Therefore, the incorrect search region caused by the incorrect detection of dlib tracking method does not affect the accurate detection of pupil center. However, if the rough search region is defined in completely different area of facial feature such as nostril caused by the incorrect detection by dlib method, it can cause the error of pupil detection. To solve this problem, our system successively compares the positions of eye centers detected by the dlib method in previous and current frames. As explained in Section 4.5, the proposed method can operate at a speed of approximately 35 frames per second (1000/28.6). Considering the time difference of 28.6 ms, the difference between the previous and current positions of eye centers cannot be large. Therefore, if there exists larger difference than threshold, our system determines that the positions of eye centers in current frame are not correct due to the failure of dlib method, and it uses the positions in previous frame for defining the rough pupil search region in current frame.

Fig. 3
figure 3

Examples of a detected facial feature points and b the defined pupil search region in an image in a vehicular environment

Fig. 4
figure 4

Flowchart of proposed method used to detect the center of pupil

3.4 Pupil, MC, and Corneal SR Detection

To estimate the area at which the driver is gazing using the pupil center corneal reflection (PCCR)-based gaze detection method, the center of pupil and the geometric center [6] of corneal SR have to be accurately detected. Figure 4 shows a flowchart of the pupil detection method based on image processing algorithms [26] and detail explanations of this method can be referred in [30]. As stated above, unlike gaze detection in a general indoor desktop monitor environment, where subjects stare into a small monitor, the area drivers look at in a vehicular environment, including both side mirrors, is far more extensive and results in greater head and eye movements. This is why corneal SR is sometimes detected in the bright areas of the sclera or the skin instead of the dark regions of the pupil or the iris, or is lost entirely and is undetectable. Corneal SR is generally used in the PCCR method as a reference point to compensate for head movements when estimating the driver’s gaze [56, 68]. As already mentioned, however, the corneal SR, or the reference point, can be lost when there is excessive head or eye movement in a vehicular environment; therefore, this study uses the MC as another reference point. As described in Section 3.3, points 39 and 42 of the detected facial feature points were designated as the location of the MC to be used for gaze detection. These points were used instead of points 36 and 45 because when the driver turns his or her head excessively, it is more likely that one of points 36 and 45 is rendered unobserved in the captured image than one of points 39 and 42. In [29], they proposed the method of tracking user’s gaze position on frontal viewing home appliance as the interface of disabled people in indoor environment. For eye-tracking, they used the simple method of detecting pupil center and corneal specular reflection (SR). However, our system is used in car environment of outdoor, and more illumination changes exist than that in [29]. Therefore, the region of interest of eye is firstly defined by the detected eye features by dlib facial feature point tracking method as shown in Fig. 3, and pupil center and corneal SR center are detected within this region. The performance of dlib facial feature point tracking method is not affected by the illumination changes in an external environment as shown in Table 2. In addition, as shown in Fig. 4, our eye-tracking method includes sophisticated procedure of reducing the detection error of pupil boundary caused by specular reflection (SR) hiding the boundary. For that, the steps (6) ~ (8) are newly proposed in our eye-tracking algorithm compared to [29]. In addition, different from [36], our method also detect medial canthus in order to cope with the case of severe head rotation where corneal SR disappears in the captured image.

Table 2 Error in the detection of center of pupil, corneal SR, and MC (unit: pixel)

3.5 Calculating gaze position by combining average calibration information and one-point calibration based on maximum entropy criterion

User calibration is the process of identifying the relationship between the monitor display and the region of the movements of the user’s pupils obtained by having users look at certain spots on the monitor display (e.g., spots 4, 6, and 9). User calibration is also needed to compensate for the kappa angle between where the user is actually looking and where the pupil is directed, and for the fact that people have eyeballs of varying size. In a vehicular environment, however, drivers find it difficult to perform initial calibration because the spots to look at for the purpose of calibration cannot be marked inside the vehicle. Therefore, this study uses the average of calibration information obtained beforehand from 10 users, each of whom was asked to stare at nine pre-designated spots on a monitor (M1M9 in Fig. 5) in an indoor desktop monitor environment. For fair comparisons, these 10 people’s data are not included in the data of 26 people which are used for performance evaluation in Section 4. The details of user calibration are as below.

Fig. 5
figure 5

Relationship between each pupil subregion and the monitor subregion

As described in Section 3.4, the PCCR vector (the location of the center of the pupil is compensated for by that of the corneal SR) is calculated based on the detected location of the center and the corneal SR; then, based on this, four geometric transform matrices (Matrix 1–4) are found using nine pupil centers (P1P9) (compensated for by the location of the corneal SR), each acquired when the subjects stared at the nine spots (M1M9) on the monitor plane, as illustrated in Fig. 5 and Eq. (1) [30, 35].

However, the relationship between pupil subregion and monitor subregion is actually nonlinear due to the nonlinear movement of pupil center on the 3D eyeball, camera parameters, and user’s position. Therefore, this relationship cannot be represent by the simple linear matrix of Eq. (1), and accurate parameters for this relationship is calculated based on maximum entropy criterion as shown in Eqs. (2) ~ (6). Detailed explanations are as follows. Figure 5 and Eq. (1) show an example of how such a geometric transform matrix is calculated:

$$ \left[\begin{array}{c}{M}_{x1}\ {M}_{x2}\ {M}_{x3}\ {M}_{x4}\\ {}{M}_{y1}\ {M}_{y2}\ {M}_{y3}\ {M}_{y4}\end{array}\right]=\left[\begin{array}{cc}a& b\\ {}e& f\end{array}\kern0.5em \begin{array}{cc}c& d\\ {}g& h\end{array}\right]\left[\begin{array}{c}{P}_{x1}\kern2.75em {P}_{x2}\kern2.75em {P}_{x3}\kern2.75em {P}_{x4}\\ {}{P}_{y1}\kern2.75em {P}_{y2}\kern2.75em {P}_{y3}\kern2.75em {P}_{y4}\\ {}{P}_{x1}{P}_{y1}\kern1em {P}_{x2}{P}_{y2}\kern0.5em {P}_{x3}{P}_{y3}\kern1.25em {P}_{x4}{P}_{y4}\\ {}1\kern3.75em 1\kern3.75em 1\kern3.75em 1\end{array}\right] $$
(1)

Where (Mxi, Myi) and (Pxi, Pyi) are the x and y coordinates of Mi and Pi, respectively. Then, we can obtain the following two functions.

$$ {\displaystyle \begin{array}{l}J\left(a,b,c,d\right)=1/\Big[\left( Mx1-\left( aPx1+ bPy1+ cPx1 Py1+d\right)\right)2+\left( Mx2-\left( aPx2+ bPy2+ cPx2 Py2+d\right)\right)2+\\ {}\left( Mx3-\left( aPx3+ bPy3+ cPx3 Py3+d\right)\right)2+\left( Mx4-\left( aPx4+ bPy4+ cPx4 Py4+d\right)\right)2+ offset1\Big]\end{array}} $$
(2)
$$ {\displaystyle \begin{array}{l}J\left(e,f,g,h\right)=1/\Big[\left( My1-\left( ePx1+ fPy1+ gPx1 Py1+h\right)\right)2+\left( My2-\left( ePx2+ fPy2+ gPx2 Py2+h\right)\right)2+\\ {}\left( My3-\left( ePx3+ fPy3+ gPx3 Py3+h\right)\right)2+\left( My4-\left( ePx4+ fPy4+ gPx4 Py4+h\right)\right)2+ offset2\Big]\end{array}} $$
(3)

The goal of our research is to obtain the optimal parameters (a, b, c, d) and (e, f, g, h), with which J(a, b, c, d) and J(e, f, g, h) are maximized, respectively. In Eqs. (2) and (3), offset1 and offset2 are the terms which make the denominator non-zero. From the eqs. (2) and (3), we can obtain the probability density function (PDF) of J(a, b, c, d) and J(e, f, g, h) as follows.

$$ P\left(a,b,c,d\right)=J\left(a,b,c,d\right)/\left(J\left(a,b,c,d\right)+J\left(e,f,g,h\right)\right) $$
(4)
$$ P\left(e,f,g,h\right)=J\left(e,f,g,h\right)/\left(J\left(a,b,c,d\right)+J\left(e,f,g,h\right)\right) $$
(5)

Then, based on the maximum entropy criterion [7], the goal of our research is to obtain the optimal parameters (a, b, c, d) and (e, f, g, h) (P(a, b, c, d) and P(e, f, g, h) are not biased but evenly fitted), with which H(a, b, c, d, e, f, g, h) are maximized, as follows.

$$ H\left(a,b,c,d,e,f,g,h\right)=-P\left(a,b,c,d\right)\mathit{\log}\left(P\left(a,b,c,d\right)\right)-P\left(e,f,g,h\right)\mathit{\log}\left(P\left(e,f,g,h\right)\right) $$
(6)

From that, we obtain the parameters of (a, b, c, d) and (e, f, g, h), and user’s gaze position (Gx, Gy) can be obtained the following equation with the extracted pupil center position (P’x, P’y).

$$ \left[\begin{array}{c}{G}_x\\ {}{G}_y\end{array}\right]=\left[\begin{array}{c}a\kern0.5em b\kern0.5em c\kern0.5em d\\ {}e\kern0.5em f\kern0.5em g\kern0.5em h\end{array}\right]\left[\begin{array}{c}P{\prime}_x\\ {}P{\prime}_y\\ {}P{\prime}_xP{\prime}_y\\ {}1\end{array}\right] $$
(7)

The gaze (Gx, Gy) from the pupil center (Px, Py), compensated for by the location of the corneal SR, in the given image can be calculated. Furthermore, when the corneal SR cannot be found in the captured image due to excessive head rotation, gaze (Gx, Gy) is calculated from the pupil center (Px, Py) that is compensated for by the location of the MC. Based on this, if the center of the pupil (compensated for by the location of the corneal SR or the position of MC) is in pupil subregion 2 of Fig. 5, the parameters of Eq. (7) based on Matrix 2 are used to estimate gaze; if the center is in pupil subregion 4, those based on Matrix 4 is used instead. The gaze value for each direction is determined from one image frame. For example, ten gaze values are determined from ten image frames. As stated above, the proposed tracks MC points (points 39 and 42 in Fig. 3a) at the same time to be able to calculate the gaze even when the corneal SR cannot be found in the captured image due to excessive head rotation; it also calculates the four sets of parameters of Eq. (7) based on four geometric transform matrices by using nine additional pupil centers (compensated for by the location of the MC), acquired when the test subjects were asked to stare at nine spots (M1 ~ M9) on the monitor plane illustrated in Fig. 5. Instead of having subjects stare at the nine spots on the monitor again, the nine pupil centers (compensated for by the location of the MC) are calculated from the images captured when the subjects look at the nine spots on the monitor to calculate the PCCR vector (the location of the pupil center is compensated for by that of the corneal SR). In other words, user calibration is accomplished with each user staring once at the nine spots on an indoor desktop monitor. Hence, the method uses the four sets of parameters of Eq. (7) based on four geometric transform matrices based on the location of the corneal SR and additional four sets of parameters of Eq. (7) based on four matrices based on that of the MC, for a total of eight sets of parameters of Eq. (7). The points of gaze calculated using the pupil center compensated for by the location of the corneal SR or the MC are less influenced by head movements. As mentioned above, the proposed method calculates the average of the pupil location (each compensated for by the locations of the corneal SR or the MCs) for each of the nine pre-designated spots (M1 ~ M9 of Fig. 5) that are stared at by 10 users in an indoor desktop monitor environment beforehand, and uses it as user calibration information in a vehicular environment. Because pupil information was obtained from general users in an indoor desktop monitor environment, however, the margin of error increases when this information is used to calculate the gaze of the given vehicle drivers. This is due to differences in each driver’s kappa angle and sitting position. To resolve this issue, the 10 users (whose data are not included in the data of 26 people which are used for performance evaluation in Section 4 for fair comparisons) also look into the NIR illuminator in the gaze detection system, installed at the bottom of the indoor desktop monitor, in advance. From that, we can obtain the average gaze position (Gx_cal_indoor, Gy_cal_indoor) of Eq. (7) of the 10 users. In addition, during the testing (see the details in Section 4.1), each people of 21 participant look once (at the initial step) at the NIR illuminator in the gaze detection system (one-point calibration), attached to the dashboard as shown in Fig. 2. From that, we can obtain the gaze position (Gx_cal_vehicle, Gy_cal_vehicle) of Eq. (7) when each people performs one-point calibration. Our system then finds the difference of between these two points ((Gx_cal_indoor, Gy_cal_indoor) and (Gx_cal_vehicle, Gy_cal_vehicle)) to compensate for the gaze position (Gx, Gy) of Eq. (7) when each people looks at one of the fifteen positions of vehicle of Fig. 6. This reduces the gap in the direction of the gaze due to different kappa angles and sitting positions of drivers, resulting in more accurate gaze detection in vehicles. Nevertheless, this gap cannot be completely removed. However, the distance between gazing positions (as shown in Fig. 6) are much larger than that in desktop monitor because of large viewing area in car, which has the effect on reducing the consequent gazing error in our experimental environment. The above-mentioned NIR illuminator uses 850 nm NIR LEDs, which emit weak light, rendering it appropriate for indicator in one-point calibration.

Fig. 6
figure 6

Fifteen positions in the vehicle for measuring the accuracy of gaze detection

4 Experimental results

4.1 Experimental setups

Although there have been previous researches on gaze detection in car environments [3, 4, 10, 20, 43, 57, 58, 61, 65], there is few open database for NIR light-based gaze tracking of drivers in car environment. Therefore, we collected our database for NIR light-based gaze tracking of drivers in car environment, and we made our collected database open through [15]. Our gaze tracking device aforementioned in Section 3.2 was used for collecting database. As shown in Fig. 7, the database was collected from a total of 26 participants: 10 wearing nothing, 8 wearing only four kinds of glasses, 5 wearing only two kinds of sunglasses, and 3 wearing only hat. In addition, even the people (wearing nothing) took various pose including putting one hand to cheek or using mobile phone as shown in Fig. 7 (e). Fifteen spots were designated to gaze at for the experiment, as shown in Fig. 6, and each participant stared at each spot during about three seconds and this procedure was iterated five times. Between each iteration, participant took a rest of about 2 min. A total of about 12 or 13 min were taken for each participant. When the participants were staring at each spot, they were told to act normally, as if they were actually driving and were not restrained to one position or given any special instructions to act in an unnatural manner. Due to the risks of car accidents, it is difficult to motivate the participants to stare at the 15 designated spots while actually driving for the experiment. Therefore, this study obtained images from various locations (from roads in daylight to a parking garage) in a real vehicle (model name of SM5 New Impression by Renault Samsung [54]) with its power on, but in park in order to create an environment most similar to when it is being driven (including factors like car vibration and external light). Moreover, to understand the influence of various kinds of external light on driver gaze detection, test data were acquired at different times of the day: in the morning, the afternoon, and at night. All algorithms of our method were implemented on Microsoft Visual Studio 2013 C++, and OpenCV (version 2.4.5) [49] library and Boost (version 1.55.0) library were used. Figure 7 shows some examples of images captured with the gaze detection system developed for this study in a vehicular environment. The research [36] uses more positions (18 positions) than our research (15 positions), and the three additional positions are respectively the left position of region 1, the right position of region 5, and the upper position of region 6 of Fig. 6. The camera resolution of their system [36] is low with low illuminator power, and the driver’s pupil center cannot be detected. Therefore, they tracked the driver’s gaze position only by measuring head rotation (not eye rotation), and got the experimental data where drivers were instructed to rotate his or her head intentionally and sufficiently. In case that driver only moves eye to gaze at some position in car without rotating head (it is more often the case with driver while driving), their method cannot detect driver’s gaze position. Different from them, our NIR solution can provide the clear pupil image for gaze detection by tracking eye movement in car environment while allowing the natural movement of driver’s head. In addition, the cases of gazing at the additional three positions (the left position of region 1, the right position of region 5, and the upper position of region 6 of Fig. 6) do not frequently occur while driving [28, 52, 53, 64]. Therefore, in previous researches, they did not use these three positions for experiments, either [10, 21, 22, 60, 66]. Based on that, we did not use these three positions, and performed the experiments with the 15 positions of Fig. 6.

Fig. 7
figure 7

Examples of images captured by gaze detection system in vehicular environment. Driver a wearing nothing, b wearing hat, c wearing sunglasses, d wearing glasses, and e putting one hand to cheek or using mobile phone

4.2 Accuracy of detecting pupil center, corneal SR, and MC

Figure 8 below shows examples of the driver’s pupil center, corneal SR, and MC detected using the proposed method. The pupil center, corneal SR, and MC were accurately detected without the influence of external light, even though the drivers were moving their heads in various manners. Even when the corneal SR could not be found in the captured image due to a driver’s excessive head rotation, the pupil center and the MC (points 39 and 42 in Fig. 3a) were accurately detected as shown in Fig. 8b.

Fig. 8
figure 8

Examples of detected points of pupil center, corneal SR, and MC a When corneal SR exists in image b When corneal SR disappears due to excessive rotation of driver’s head. Red small circles, red big circle, and green point show the detected facial feature points, pupil boundary, and corneal SR, respectively. In our experiments, there was no case that pupil center, corneal SR center, and MC were not detected in all the experimental images. Therefore, it is difficult to measure the false recognition rate because there is no ground criterion of correct and incorrect detection of pupil. However, their detected positions show error compared to accurate positions. Therefore, we measured the accuracies based on the following method. Table 2 below shows the error in detecting the pupil center, the corneal SR, and the MC using the method proposed in this study. The error was below 3.5 pixels, indicating high accuracy of detection. The detection error is shown as the average Euclidean distance between the manually marked locations of the pupil center, the corneal SR, and the MC, and the locations detected using this study’s algorithm

4.3 Accuracy of gaze detection

The subsequent experiment estimated gaze detection accuracy of the 15 points shown in Fig. 6. Because no pixel location exists in the vehicle in Fig. 6, unlike on a desktop monitor, the following method was used to measure the accuracy of gaze detection. As the 1st fold evaluation, among the data of 26 participants, the data of 5 people were randomly selected for training, and the remained data of 21 people were used for testing. Then, in the 2nd fold evaluation, among the data of 21 people used for testing in the 1st fold evaluation, the data of 5 people were randomly selected for training, and the remained data of 21 people were used for testing. From these evaluations of two-fold cross-validation, we obtained the average accuracy for performance evaluation. Same scheme were also used for measuring the error in the detection of center of pupil, corneal SR, and MC of Table 2. With the trained data of 5 people, the average gaze value ((Gx, Gy) of Eq. (7) compensated by one-point calibration) of each gaze position of the 15 positions of Fig. 6 was determined as the position’s reference gaze value. Using this method, 15 reference gaze values ((Gx1, Gy1), (Gx2, Gy2), … (Gx15, Gy15) of Eq. (7) compensated by one-point calibration) were calculated for all the 15 gaze positions of Fig. 6. Then, using the Euclidean distance between the gaze calculated from the testing data and the coordinates of these 15 reference gaze values, one gaze position with the minimum distance was determined as the driver’s gaze, of the 15 positions in Fig. 6. Accuracy was measured using the strictly correct estimation rate (SCER) and the loosely correct estimation rate (LCER). The SCER indicates the accuracy of the corresponding spot (gaze position) while the LCER specifies the accuracy of the corresponding spot as well as spots in the vicinity. Table 3 shows the accuracies of gaze detection with all the users. As seen in Table 3, the average SCER was approximately 79.3% while the average LCER was approximately 96%. Table 4 presents the accuracies in case of users wearing nothing, glasses, sunglasses, and hat, respectively. As shown in Table 4, the accuracies in case of users wearing nothing are highest whereas those in case of users wearing sunglasses are lowest. The reason why the accuracies in case of users wearing sunglasses are lowest is that the visibility of eye region with some sunglasses reduces because the sunglasses reduces the NIR light of our gaze detection system to eye as shown in Fig. 7c, which degrades the accuracy of detection of pupil and corneal SR centers, and consequent accuracies of gaze detection. The reason why the accuracies in case of users wearing glasses are lower than those in case of wearing nothing is that the SR on the glasses surface can hide the pupil and corneal SR regions, which degrades the accuracy of detection of pupil and corneal SR centers, and consequent accuracies of gaze detection. The accuracies in case of wearing hat are similar to those in case of wearing nothing, which presents that our method can detect the facial feature points, pupil and corneal SR centers irrespective of hat.

Table 3 SCER and LCER of proposed method with all the users
Table 4 SCER and LCER of proposed method in case of users wearing nothing, glasses, sunglasses, and hat, respectively

Figure 9 below shows examples of images with incorrect estimation based on the SCER and LCER. As shown in Fig. 9a, incorrect estimation usually occurred when the driver was looking at point 11 in Fig. 6. This is because the MC points (39 and 42 of Fig. 3a) of the detected facial feature points were used to calculate the gaze when the corneal SR could be found, but increasing the error in detecting the MC points in turn increased the margin of error in tracking gaze.

Fig. 9
figure 9

Examples of incorrect estimation based on a SCER and b LCER

As shown in Fig. 9b, incorrect estimation usually occurred when the driver was looking at point 15 in Fig. 6. This is because the driver was looking far too downward, making the eyelid occlude the upper part of the pupil, which results in inaccurate detection of the pupil center. Further, the location of the MC (points 39 and 42 of Fig. 3a), which was used when the corneal SR could not be found in the captured eye image, was detected inaccurately as well, and these two errors (errors in detecting the pupil center as well as the location of the MCs) caused another error in detecting gaze, even when based on the LCER.

4.4 Comparisons with other methods

As the next experiment, we compared the accuracies by our method with those by previous method [10]. Previous method used deep learning method based on convolutional neural network (CNN) for estimating gaze position in car environment. For fair comparisons, same training data used in our method were used for CNN training, and accuracies were measured with same testing data used in our method. Like our method, accuracy was measured twice by exchanging training and testing data (two-fold cross-validation), and the average accuracy, which was estimated twice, became the final accuracy of gaze detection. As shown in Tables 3 and 5, our method outperforms the previous method. The reason why the accuracies by previous method are lower than those by our method is that the number of target zone (15 zones) in our research is much larger than that (8 zones) in [10]. In addition, the research [10] did not considered the cases when a driver wore hat, a glasses, a sunglasses, and took various pose including putting one hand to cheek or using mobile phone as shown in Fig. 7.

Table 5 SCER and LCER by previous method [10]

In addition, Tables 6 and 7 show the comparative accuracies by previous study [61] and a commercial gaze detection system, the Tobii EyeX System [62] with those by our method. Because the user calibration information in an indoor desktop monitor environment was not acquired beforehand for use in a vehicular environment for the Tobii system, the system was attached to the same location in the vehicle as the gaze detection system used in this study in Fig. 2, and calibration was performed on four spots where user calibration for the Tobii system was possible (points in the vicinity of 2, 3, 9, 10 in Fig. 6). Calibration was only performed on the four spots where user calibration for the Tobii system is possible inside a vehicle because the system could not acquire user calibration information when there was excessive eye or head rotation. Because the Tobii system was designed for use on a small indoor desktop monitor, it yielded no results in gaze detection for points 1, 4, 5, 6, 7, 8, 11, 14, and 15 in Fig. 6 when used for an experiment in a vehicle, which had an extensive area for gazing. Therefore, Table 7 shows the average SCER of the Tobii system for all points, excluding ones with no results in gaze detection. That is, we show the SCERs by our method and Tobii system in Table 7 on same gaze zones excluding the points 1, 4, 5, 6, 7, 8, 11, 14, and 15 in Fig. 6. As seen in Tables 6 and 7, the SCER of the proposed method was higher than those of the previous method [61] and the Tobii system. The reason why their accuracy [61] is lower than ours is that they did not detect pupil center but iris center for calculating gaze position. However, our method detects pupil center, which shows better accuracy. The reason why the accuracy of Tobii system [62] is lower than ours is that their system is for gaze tracking in small-sized desktop monitor environment of indoor, which does not include various illumination changes of outdoor.

Table 6 Comparison of SCER between proposed method and a previous method (unit: %)
Table 7 Comparisons of SCER between proposed method and the Tobii system (unit: %)

Although there are various previous researches on gaze detection in car environment, most of their algorithm are not opened. Therefore, we selected these three methods [10, 61, 62] because their experimental gazing regions are comparable to ours [61], we can implement their algorithm [10], and the actual gaze position can be produced [62]. In case of the increased number of target zones, the changes in eye and head movement when driver gazes at each zone decreases in the captured image, and classification complexities of each gaze zone increase, which reduce the final accuracy of gaze detection. As the next experiment, we measured the accuracies by the methods [29, 36] as shown in Table 8. By comparing the Tables 3 and 8, we can find that our method outperforms the previous methods [29, 36]. From that, we can confirm that proposed maximum entropy criterion-based calculation of gaze position is superior to the geometric transform-based method of [29]. In addition, because the method [36] uses only head rotation for measuring driver’s gaze position without considering eye movement, their accuracies are lower than ours.

Table 8 SCER and LCER by previous methods

For comparisons with related works, we compared the accuracies by our method and those by previous method [10, 36, 61, 62] as shown in Tables 5, 6, 7 and 8. All these previous researches do not provide the codes of their algorithms, we re-implemented their algorithms by ourselves and performed the experiments with our datasets as shown in Tables 5, 6, 7 and 8.

4.5 Processing time

Finally, Table 9 shows the processing time of the proposed method. Every algorithm was executed in the laptop computer aforementioned in Section 3.2, and the total processing time was approximately 28.6 ms as shown in Table 9. This means that the proposed method can operate at a speed of approximately 35 frames per second (1000/28.6).

Table 9 Processing time of proposed method (unit: ms)

4.6 Experiments in driving environments

We performed the additional experiments in real driving practice. To measure the accuracy, driver should gaze at 15 designated spots, but he or she is difficult to stare at the spots while actually driving. Therefore, we collected the data from the participants (sitting at the side seat of driver) gazing at 15 spots while the driver was actually driving the car as shown in Fig. 10. Each spot was indicated by attached yellow and small tape which did not distract driver’s attention as shown in Figs. 10a and b. Our gaze tracking device was attached in front of dashboard (close to spot 2 as shown in Fig. 2) in previous experiments (Sections 4.1 ~ 4.5). However, because there is no dashboard in front of the participants (sitting at the side seat of driver) and it is difficult to attach our device at the spot 4 of Fig. 10a, we attached it at the position close to spot 14 as shown in Fig. 10a, from which we can show our experimental results in various positions of our gaze detection device in car. Except for this, the conditions of data acquisition were similar to those from driver in previous experiments (Sections 4.1 ~ 4.5). In order to check our method in various car environments, we used different car (model name of Daewoo Lacetti Premiere by Chevrolet [12]) from that used in the experiments of Sections 4.1 ~ 4.5. The database was collected from a total of 10 participants: 3 wearing nothing, 3 wearing only three kinds of glasses, 2 wearing only two kinds of sunglasses, and 2 wearing only hat. In addition, even the people (wearing nothing) took various pose including putting one hand to cheek or using mobile phone. Each participant stared at each spot during about three seconds and this procedure was iterated five times. Between each iteration, participant took a rest of about 2 min. A total of about 12 or 13 min were taken for each participant. When the participants were staring at each spot, they were told to act normally, as if they were actually driving and were not restrained to one position or given any special instructions to act in an unnatural manner. To understand the influence of various kinds of external light on driver gaze detection, test data were acquired at different times of the day: in the morning, the afternoon, and at night, and they were collected while driving on various roads. We also made this collected database open through [15] in order to enable other researchers to compare the performance of their system with our database. Table 10 shows the SCER and LCER by our gaze tracking method, and they are similar to those of Table 3. From that, we can confirm the effectiveness of our gaze tracking method in real driving environments.

Fig. 10
figure 10

Experiments in driving environments. a 15 gaze spots. b Experimental setup. c Examples of captured images (while gazing at spot 3 in the morning (left), spot 10 in the afternoon (middle), spot 6 at night (right) of Fig. 6)

Table 10 SCER and LCER of proposed method in case of real driving environments

4.7 Discussions on the usefulness for the NIR illuminator

The usefulness for the NIR illuminators in our gaze detection system is as follows. First, the NIR illuminator in the gaze detection system helps capture the driver’s facial image without being influenced by the changing light at night and day.

As shown in the image captured at night of Fig. 11a, the driver’s face image captured without NIR illuminator using our NIR camera-based gaze detecting device is so dark that pupil center cannot be detected for gaze detection. However, as shown in Fig. 11b, pupil area is distinctive in the captured image at night with NIR illuminator using our NIR camera-based gaze detecting device, which enables the pupil center to be detected. Although it can be considered to use visible light illuminator, it causes severe dazzling to driver’s eye and cannot be used in car environments. Without visible light illuminator, even with visible light camera, the pupil center cannot be detected due to too dark image at night as shown in Fig. 11c. Second, the center of corneal SR is important to calculate driver’s gaze position because the relative position of pupil center based on the center of corneal SR is used for calculating the gaze position. If only the pupil center is used for calculating gaze position, its movement is affected by driver’s head movement, which degrades the gaze detection accuracy. Therefore, all the previous methods of gaze detection based on PCCR vector used the center of corneal SR to compensate the head movement for calculating gaze position [8, 9, 30, 35, 56, 62, 67, 68]. In our gaze detection system, the NIR illuminator was placed to the left of the camera as shown in Fig. 2, which can generate the corneal SR in eye image as shown in Fig. 8a, and driver’s gaze position is calculated using both corneal SR and pupil center. Third, because more accurate gaze position can be detected by using pupil center instead of iris center [8, 9, 30, 35, 56, 62, 67, 68], our research utilizes the pupil center. The color of the iris (regions between pupil and sclera) is usually dependent on the amount of melanin in the anterior border layer [59]. The reflectance of the iris is also affected by the amount of melanin in the anterior border layer. The reflectance of melanin and that of the iris slightly increases as the illumination wavelength increases from visible light to NIR light [34]. That is, without NIR light, the reflectance of iris becomes small and its brightness decreases in image, which makes iris dark especially in case of (dark) brown irises (Asian people) and makes it difficult to discriminate pupil and iris region. With NIR light, the brightness of iris increases in image, which makes it easy to detect pupil region as shown in Fig. 8a. Therefore, all the previous methods using PCCR-based gaze detection used NIR illuminator [8, 9, 30, 35, 56, 62, 67, 68]. In our system, using the NIR LEDs at a wavelength of 850 nm prevented uncomfortable situations, such as the driver being blinded by the light while driving. An 850 nm band pass filter (BPF) was also mounted on the camera’s lens to minimize interference due to sunlight [1].

Fig. 11
figure 11

Comparison of eye images at night a without NIR light, b with NIR light, and c with visible light camera

5 Conclusions

This study proposed a new gaze detection method in a car environment. Through experiments, it was evident that the gaze detection accuracy of the proposed method was higher than that of previous methods and a commercial system at the fast processing speed. In addition, the effectiveness of our gaze tracking method was shown in real driving environments, which validated the convergence of the proposed algorithm. Because our gaze tracking camera is positioned at dashboard and captures driver’s face image by upper direction (not frontal direction), more error occurs in vertical neighbors than those in horizontal neighbors. In addition, the errors increase when driver rotates his or her face severely to gaze at the position far from our gaze detection camera. These phenomena are also observed in the results of Table 5 by previous methods [10] and research [36]. These errors can be reduced by using another camera in upper or side area like previous method [61]. In future work, the method without any calibration would be researched by using deep learning-based method. In addition, emotion recognition based on the change of gaze position would be researched.