Keywords

1 Introduction

Over the past decades, eye tracking systems have become a widely used tool in fields such as marketing research [1], psychological studying [2, 3] and human-computer interaction [4, 5]. Recently, eye tracking has also been applied to virtual reality and augmented reality devices for control [6] and panoramic rendering [7, 8]. Currently, commercially available head-mounted eye-tracking devices are expensive such as tobii and Google Glasses, thus designing a head-mounted eye-tracking system with simple hardware structure and low cost is of great significance to researchers for research in related fields.

Typically, the methods of head-mounted eye tracking systems are divided into 2D and 3D according to the features of eye movement changes used, and the 2D methods are simpler in hardware structure compared to the 3D methods. The 2D approachs use 2D eye movement features as input to construct a mapping model to obtain the location of the human eye gaze point. Takemura et al. [9] and Carlos et al. [10] use a camera and an infrared light source to obtain the pupil center and spot to form a pupil center-corneal refection vector, which was fitted by a polynomial to obtain the gaze position. Their method is the most common method for head-mounted eye tracking systems because of its good accuracy and relatively simple hardware structure, but their method has poor accuracy at non-calibrated points. Arar et al. [11] used four fixed infrared lights to extract the pupil center as well as four light spots to obtain the location of the gaze point based on the geometric principle of cross-ratio invariance. Their method requires four infrared light sources in order to obtain the reflected position of the corneal spot, which makes the hardware structure complex and has poor practicality. Moreover, the common disadvantage of the 2D methods is that the 2D features used do not make full use of the information of gaze direction changes, and therefore have poor accuracy at non-calibrated points.

In contrast to the 2D methods, the 3D methods directly obtain gaze direction and calculate the intersection with the scene to estimate the gaze point location based on the structural characteristics of the eye. However, most 3D methods rely on measurement information that is calibrated in advance such as light source position coordinates, camera position coordinates, screen position coordinates, and other information, causing great inconvenience to the use of eye tracking systems. Shih et al. [12] used two cameras and two light sources to calculate the human optics axis direction directly, avoiding the system calibration process and calibration errors. Nevertheless, this method requires multiple cameras to be calibrated and the position of the light sources need to be pre-set. Roma et al. [13] constructed a 3D eye model by considering the pupil and the radius of the eye as known quantities, and the direction of the line from the center of the eye to the center of the pupil as the visual axis. Their method ignores the physiological differences between different users. Zhu Z et al. [14] used two cameras and two infrared light sources to calculate corneal and pupillary parameters to obtain the direction of human eye gaze. Their method has the same disadvantages with Shih’s [12].

In summary, for head-mounted eye tracking systems, existing methods are usually 2D methods using pupil center-corneal reflection vectors for interpolation or 3D methods using three-dimensional eye models. The disadvantage of the 2D method is that pupil center-corneal reflection vectors as a feature does not take full advantage of the information on the change in line of sight, resulting in poor accuracy on non-calibration. The disadvantage of the 3D method is that it usually requires advance calibration of the position relationship of the camera, IR light source or uniform modeling of the eye ignoring individual differences. Moreover, it has a complex hardware structure and high production cost. Hence, we need a lightweight head-mounted eye tracker. Swirski et al. [15] proposed a method to recover 3D eyeball from a monocular, but they only evaluated the model for synthetic eye images in a simulation environment, and the realistic performance of eye-to-scene camera has never been quantified. In order to solve the problems of existing head-mounted eye tracking systems, this paper proposes a monocular reflection-free head-mounted 3D eye tacking system. Compared with existing methods, our method requires only one camera, does not use the average physiological parameters of the eye, and is able to improve accuracy at non-calibrated points.

The contributions of this work are threefold. First, an eye model is proposed that is more applicable to real-time human eye videos captured by eye cameras rather than just synthetic images in a simulation environment. Secondly, a mapping model from the 3D gaze direction vectors to the 2D plane is proposed, using the gaze direction angle instead of the PCCR to do the interpolation. Experimental results show that the method proposed in this paper has better accuracy. Finally, this paper designs a low-cost, simple hardware structure head-mounted eye-tracking system, which provides great convenience for research in related fields.

2 3D Eye Model

2.1 Computational Model of Eye Center

The model proposed by Swirski et al. [15] is based on two assumptions: (1) the apparent pupil contour in 2D eye image is a perspective projection of a 3D pupil circle P which is tangent to the eyeball of fixed radius, R. (2) the center of the eyeball is stationary over time. In their model, the gaze direction varies with the motion of the 3D pupil circle P on the eyeball surface. At each time point, the state of the eye model is determined by eye center c and the 3D pupil circle P.

Given a set of N eye images, recorded over a period of time, pupil contours are extracted from each image by means of an automatic pupil extraction algorithm [16,17,18], leading to sets of two-dimensional contour edges \( \varepsilon _i=\left\{ e_{ij},j=1,...,M_i \right\} \). Firstly, the edges \( \varepsilon _i \) of the contours on each image are fitted to ellipses \( l _i \). Next, assuming a pinhole camera model for perspective projection, the inverse projection of the pupil ellipse produces two 3D circles when fixing an arbitrary size of radius r [19]. Two 3D circles can be obtained by unprojection, and these two circles are denoted as:

$$\begin{aligned} \left( \boldsymbol{p}^{+}, \boldsymbol{n}^{+}, r\right) ,\left( \boldsymbol{p}^{-}, \boldsymbol{n}^{-}, r\right) \end{aligned}$$
(1)

where \( \boldsymbol{p}^{+} \) and \( \boldsymbol{p}^{-} \) denote the centers of the circles and \( \boldsymbol{n}^{+} \) and \( \boldsymbol{n}^{-} \) denote the normals of the circles. For the two circles obtained by unprojection of each pupil ellipse, Swirski et al. [15] removed the ambiguity by projecting the 3D vectors into the 2D image space. Because the normal of two circles in the image space are parallel:

$$\begin{aligned} \tilde{\boldsymbol{n}}_{i}^{+} \propto \tilde{\boldsymbol{n}}_{\boldsymbol{i}}^{-} \end{aligned}$$
(2)

Similarly, the line between \( \tilde{\boldsymbol{p}}_{i}^{+} \) and \( \tilde{\boldsymbol{p}}_{i}^{-} \) is parallel to \( \tilde{\boldsymbol{n}}_{i}^{\pm } \). The Eq. (3) can be derived:

$$\begin{aligned} \exists s, t \in R \cdot \tilde{\boldsymbol{p}}_{i}^{+}=\tilde{\boldsymbol{p}}_{i}^{-}+s \tilde{\boldsymbol{n}}_{i}^{+}=\tilde{\boldsymbol{p}}_{i}^{-}+t \tilde{\boldsymbol{n}}_{i}^{-} \end{aligned}$$
(3)

which means that you can choose either one of the two circles for this stage and calculate the projection of eyeball center \( \tilde{\boldsymbol{c}} \) by computing the intersection of the normal vectors. These vectors may have numerical or measurement errors and therefore will almost never intersect at a single point. Thus we can find the point with the smallest sum of distances from each line by the least squares method.

$$\begin{aligned} \tilde{\boldsymbol{c}}=\left( \sum _{i}\left( \boldsymbol{I}-\tilde{\boldsymbol{n}}_{i} \tilde{\boldsymbol{n}}_{i}^{T}\right) \right) ^{-1} \cdot \left( \sum _{i}\left( \boldsymbol{I}-\tilde{\boldsymbol{n}}_{i} \tilde{\boldsymbol{n}}_{i}^{T}\right) \tilde{\boldsymbol{p}}_{i}\right) \end{aligned}$$
(4)

The limitation of this approach is that only the eye model is fitted on the synthetic image sequence, while the situation in the real-time video will be more complex compared to the synthetic image.

There are two differences between performing pupil detection on video frames captured by an eye camera and pupil detection on a synthetic image: (1) The pupil outline on the synthetic image is distinct, while in the video frame the pupil outline may be blurred due to motion blur. (2) The pupil contour on the synthetic image is complete, whereas on the video frame it may be incomplete due to blinking, eyelash occlusion, or excessive eye rotation. In this case, the pupil contour may be partially or even completely obscured, resulting in low accuracy of the fitted ellipse.

Swirski et al. [17] obtained the projection of eyeball center by projecting the normal vector of the circle into the image space and then solving for the intersection of clusters of lines using least squares. In their method all projections of the normal vector are used to calculate \( \tilde{\boldsymbol{c}}\). However, when excessive eye rotation or incomplete pupil contours are encountered, the distance between the normal vector of the ellipse and \( \tilde{\boldsymbol{c}} \) may be too large, as in Fig. 1. Thus, this paper proposes an optimization algorithm to calculate the position of the center of the eye \( \boldsymbol{c} \).

We can calculate N lines from N images by Eq. ((3)), denoted as \( L^N \). Then M line are randomly selected from \( L^N \) to obtain \( L^M \), Eq. ((4)) can be rewritten as ((8)) for this stage.

$$\begin{aligned} L^{N}=\left\{ \tilde{\boldsymbol{n}}_{i}, i=1, \ldots , N\right\} \end{aligned}$$
(5)
$$\begin{aligned} \{M\}={\text {random}}(\{N\}) \end{aligned}$$
(6)
$$\begin{aligned} L^{M}=\left\{ \tilde{\boldsymbol{n}}_{j}, j=1, \ldots , M\right\} \end{aligned}$$
(7)
$$\begin{aligned} \tilde{\boldsymbol{c}}_{m}=\left( \sum _{j}\left( \boldsymbol{I}-\tilde{\boldsymbol{n}}_{j} \tilde{\boldsymbol{n}}_{j}^{T}\right) \right) ^{-1} \cdot \left( \sum _{j}\left( \boldsymbol{I}-\tilde{\boldsymbol{n}}_{j} \tilde{\boldsymbol{n}}_{j}^{T}\right) \tilde{\boldsymbol{p}}_{j}\right) \end{aligned}$$
(8)

where \( \tilde{\boldsymbol{c}}_{m} \) is the coordinates of eye center in the image space calculated by the iterative algorithm. Then we count the number of lines whose distance from \( \tilde{\boldsymbol{c}}_{m} \) is within the given threshold, repeat Eqs. (6)–(8) and select the intersection with the largest number of lines among all results. We calculate the intersection points again for those lines whose obtained results are within the threshold condition and compare them with the original results until the results are not changing. We then unproject \( \tilde{\boldsymbol{c}}_{m} \) to find 3D eyeball center \( \boldsymbol{c} \) by fixing the z coordinate of \( \boldsymbol{c} \).

Fig. 1.
figure 1

The green point is the theoretical projection position of the center of the eye. The orange line is the normal vector projection when the eye is overrotated or the pupil contour is incomplete. (Color figure online)

2.2 Calculating the Radius of the Eye

Once we obtain the coordinates of eyeball center \( \boldsymbol{c} \), we note that the normal \( \boldsymbol{n}_i \) of each pupil has to point away from eyeball center \( \boldsymbol{c} \):

$$\begin{aligned} \boldsymbol{n}_{i} \cdot \left( \boldsymbol{p}_{i}-\boldsymbol{c}\right) >0 \end{aligned}$$
(9)

Therefore, when projected into the image space, \( \tilde{\boldsymbol{n}}_i \) has to point away from the projected center \( \tilde{\boldsymbol{c}} \):

$$\begin{aligned} \tilde{\boldsymbol{n}}_{i} \cdot \left( \tilde{\boldsymbol{p}}_{i}-\tilde{\boldsymbol{c}}\right) >0 \end{aligned}$$
(10)

The pupil is tangent to the eye in the assumptions of Sect. 2.1, and the eye radius R can be estimated after obtaining the correct pupil projection. Since the unprojection of pupil has a distance ambiguity, we cannot use \( \boldsymbol{p}_i \) directly to calculate R. Thus, we consider a candidate location \( \hat{\boldsymbol{p}}_i \) for the pupil center that is different from \( \boldsymbol{p}_i \), which is another possible unprojection of \( \tilde{\boldsymbol{p}}_i \) at a different distance. This means that \( \hat{\boldsymbol{p}}_i \) is on the line passing through camera center and \( \boldsymbol{p}_i \), meanwhile the pupil circle is tangent to the eye, the line passing through \( \boldsymbol{c} \) and parallel to \( \boldsymbol{n}_{i} \) must pass through \( \hat{\boldsymbol{p}}_i \). The position of \( \hat{\boldsymbol{p}}_i \) can be obtained by calculating the intersection of these two lines, as in Fig. 2. Since two lines hardly ever intersect in space, the least squares method is used here to calculate the intersection point.

We then obtain the radius R of the eye by calculating the mean value of the distance between \( \hat{\boldsymbol{p}}_i \) and \( \boldsymbol{c} \).

$$\begin{aligned} R=\frac{1}{M} \sum _{i=1}^{M}\left( \hat{\boldsymbol{p}}_{i}-\boldsymbol{c}\right) \end{aligned}$$
(11)
Fig. 2.
figure 2

We find \( \hat{\boldsymbol{p}}_{i} \) by intersecting gaze line (blue line) from eyeball center with orange line which is passing through \( \boldsymbol{p}_{i} \) and the camera center o. (Color figure online)

2.3 Calculating Gaze Direction

In the assumptions of Sect. 2.1 each pupil center is on the surface of the eye and its projection is \( \tilde{\boldsymbol{p}}_i \). Due to the ambiguity of the distance, the \( {p}_{i} \) obtained by unprojection calculation can hardly lie exactly on the surface of the eye. However, in the normal case, the line passing through the center of the camera and \( \boldsymbol{p}_i \) is intersected by the eye. Therefore, A new pupil center \( \boldsymbol{p}_{_i}^{'} \) can be determined by calculating the intersection of the line passing through the center of the camera and \( \boldsymbol{p}_i \)with the eye \( \left( \boldsymbol{c},R \right) \). In order to calculate the position of the intersection point, the magnitudes of \( d_1 \) and L are first calculated.

$$\begin{aligned} \begin{aligned} d_{1}^{2}&=R^{2}-d_{2}^{2} \\&=R^{2}-\left( \Vert \boldsymbol{c}-\boldsymbol{o}\Vert ^{2}-L^{2}\right) \\&=R^{2}+L^{2}-\Vert \boldsymbol{c}-\boldsymbol{o}\Vert ^{2} \end{aligned} \end{aligned}$$
(12)
$$\begin{aligned} L=(\boldsymbol{c}-\boldsymbol{o}) \cdot \frac{\left( \boldsymbol{p}_{i}-\boldsymbol{o}\right) }{\left\| \boldsymbol{p}_{i}-\boldsymbol{o}\right\| } \end{aligned}$$
(13)

As can be seen by Fig. 3, normally, the line \( \boldsymbol{o}-\boldsymbol{p}_i \) will have two intersections with the eye \( \left( \boldsymbol{c},R \right) \), and the closest intersection is chosen here.

$$\begin{aligned} d_{\min }=L-d_{1} \end{aligned}$$
(14)
$$\begin{aligned} \boldsymbol{p}_{i}^{\prime }=d_{\min } \cdot \frac{\left( \boldsymbol{p}_{i}-\boldsymbol{o}\right) }{\left\| \boldsymbol{p}_{i}-\boldsymbol{o}\right\| } \end{aligned}$$
(15)

After obtaining the new pupil center position, \( \boldsymbol{n}_i \) is discarded in favor of using \( \boldsymbol{n}_{i}^{'}=\boldsymbol{p}_{i}^{'}-\boldsymbol{c} \) as gaze direction.

Fig. 3.
figure 3

We use the average radius of the eye R to recalculate the spatial position of the pupil \( p_{i}^{m i n} \).

3 System Design and Implementation

In any head posture, with the eye looking at different positions, the pupil center will be presented in a different position in the image. Therefore, it is often assumed that the movement of the pupil center can be used as a feature of the change in vision. The most common PCCR method uses the vector between the pupil center and the reflected light spot on the cornea to represent the feature of gaze direction. Assuming that the head-mounted device remains fixed relative to the head, the position of light source’s reflected spot on the cornea is fixed. The spot does not change with eye movement, so the PCCR vector only changes with the movement of the pupil center. Then the pupil corneal vector \( \left( x,y \right) \) in the eye camera image is mapped to the pixel point \( \left( X,Y \right) \) on the scene image or screen by the interpolation formula.

$$\begin{aligned} X=\sum _{k=0}^{n-1} a_{k} x^{i} y^{j}, i \in [0, k], j \in [0, k] \end{aligned}$$
(16)

This method is more accurate when looking at the calibration points, but less accurate when looking at the non-calibrated points. The reason for this phenomenon is that annotating the calibration points corresponds to bringing the interpolated nodes into the interpolation formula, while annotating the non-calibrated points corresponds to bringing the non-interpolated nodes into the interpolation formula. A deeper reason is that the polynomial chosen does not fit well the correspondence between the gaze points and the PCCR, or that there is no corresponding polynomial relationship between the two. The usual solution to this problem is to increase the order of the polynomial to improve the accuracy of the fit, but this results in more parameters and increases the complexity of the calibration procedure. Moreover, when the order is high enough, the Runge Phenomenon may occur. To address this problem this paper proposes a solution to improve the accuracy of non-calibration points without increasing the polynomial parameters and order. We argue that the PCCR in the ocular image does not make full use of the information about the change in the gaze direction. Therefore, in this paper, we propose to use the angle of the gaze direction \( \left( \alpha ,\beta \right) \) instead of the PCCR \( \left( x,y \right) \) as the feature of gaze direction. In the previous section we have obtained the gaze direction vector \( \boldsymbol{n}_{i}^{'} \) and here it is only necessary to transform vector into angle \( \left( \alpha ,\beta \right) \).

$$\begin{aligned} \boldsymbol{n}_{i}^{\prime }=\left( x_{\text{ gaze } }, y_{\text{ gaze } }, z_{\text{ gaze } }\right) \end{aligned}$$
(17)
$$\begin{aligned} \left\{ \begin{array}{c} \alpha =\arctan \left( \frac{\left| z_{\text{ gaze } }\right| }{x_{\text{ gaze } }}\right) , x>0 \\ \alpha =\pi -\arctan \left( \frac{\left| z_{\text{ gaze } }\right| }{\left| x_{\text{ gaze } }\right| }\right) , x \le 0 \end{array}\right. \end{aligned}$$
(18)
$$\begin{aligned} \left\{ \begin{array}{c} \beta =\arctan \left( \frac{\left| z_{\text{ gaze } }\right| }{y_{\text{ gaze } }}\right) , y>0 \\ \beta =\pi -\arctan \left( \frac{\left| z_{\text{ gaze } }\right| }{\left| y_{\text{ gaze } }\right| }\right) , y \le 0 \end{array}\right. \end{aligned}$$
(19)

The mapping between the gaze direction angle \( \left( \alpha ,\beta \right) \) and the scene image coordinates \( \left( X,Y \right) \) is then modelled by a polynomial. Then Eq. ((20)) can be rewritten as

$$\begin{aligned} X=\sum _{k=0}^{n-1} a_{k} \alpha ^{i} \beta ^{j}, i \in [0, k], j \in [0, k] \end{aligned}$$
(20)

A comparison of Eq. (20) with Eq. (16) shows that the system designed in this paper uses the same number of parameters as the traditional pupil-corneal vector method. In contrast to the PCCR method, the system is designed to use the angle of the gaze direction instead of PCCR vector, thus improving the use of the variation in visual information. A 3D to 2D mapping model is used to avoid advance calibration between the camera and the headset and to reduce the hardware architecture requirements.

4 Experiments

We use a head-mounted eye-tracking device made in our laboratory, in which the image resolution of the scene camera and the eye infrared camera are both 640 \( \times \) 480 pixels. The acquisition frame rate is 60 FPS. The development environment is Qt Creator 4.7 + OpenCV 3.0. In order to ensure the fairness of the experimental results, our experiments use the same head-mounted device to test the experimental results of the PCCR method and the method proposed in this paper (Fig. 4).

Fig. 4.
figure 4

The head-mounted eye tacker made by our laboratory with a scene camera and a eye camera.

4.1 Calibration

In the experiment, we use a nine-point calibration for both our method and the PCCR method respectively. Then the subjects gazed at the calibrated and non-calibrated points and collected the distribution of their respective gazing points. The polynomial used in the calibration process was the second order polynomial proposed by Cerrolaza et al. [20]:

$$\begin{aligned} \begin{array}{c} X=a_{0}+a_{1} x+a_{2} x^{2}+a_{3} y \\ +a_{4} y^{2}+a_{5} x y \end{array} \end{aligned}$$
(21)
$$\begin{aligned} \begin{array}{c} Y=b_{0}+b_{1} x+b_{2} x^{2}+b_{3} y \\ +b_{4} y^{2}+b_{5} x y \end{array} \end{aligned}$$
(22)

4.2 Data Collection

We invited 6 subjects to participate in this experiment. Every subject sits at a position approximately 0.7 m in front of the screen and adjusts the head posture to ensure that all calibration points on the screen are present in the scene image. The head posture is kept fixed during the calibration. Firstly, the markers on the screen are looked at in turn and the angle of vision or pupil-corneal vector is recorded for each calibration point. Secondly, the subject looks at a set of dots on the calibration target after the calibration is completed, and for each dot we collect 20 consecutive frames of data as a result of the experiment. Finally, significant shifts caused by involuntary eye movements such as nystagmus are removed. The distribution of the results of our method and PCCR method on calibration points are shown in Fig. 5 and Fig. 6.

Fig. 5.
figure 5

Distribution of gaze calibration points (Ours).

Fig. 6.
figure 6

Distribution of gaze calibration points (PCCR).

In order to further evaluate the accuracy of the proposed method and the PCCR method at non-calibrated points, 16 test points different from the calibrated points were experimentally fixed on the target for evaluation. The distribution of the results of our method and PCCR method on test points are shown in Fig. 7 and Fig. 8.

Fig. 7.
figure 7

Distribution of gaze test points (Ours).

Fig. 8.
figure 8

Distribution of gaze test points (PCCR).

The crosses in Fig. 5, 6, 7 and Fig. 8 represent the calibration points on the calibration target and the cluster of points represent the real gaze points collected during the experiment. Once the data for the gaze points have been collected, Eq. (23) is used to calculate the angular error.

$$\begin{aligned} \bar{\alpha }_{l}=\frac{\sum _{j=0}^{N} \arctan \left( \sqrt{\left( x_{i j}-X_{i}\right) ^{2}+\left( y_{i j}-Y_{i}\right) ^{2}} / L\right) }{N} \end{aligned}$$
(23)

where N is the number of qualified samples, \( \left( x_{ij},y_{ij} \right) \) is the position of the j-th data corresponding to the i-th gaze point collected, and \( \left( X_i,Y_i \right) \) is the position of the i-th reference gaze point. Figure 9 give the angular errors for each point when observing both calibrated and non-calibrated points for both our methods and the PCCR method.

Fig. 9.
figure 9

Error of calibration point and test point results. (a) is error of calibration point results; (b) is error of test point results.

According to the experimental data in Fig. 9, the errors of our method and the PCCR method are 0.56\(^{\circ }\) and 0.60\(^{\circ }\) respectively for the annotated calibration points, and 0.63\(^{\circ }\) and 0.94\(^{\circ }\) respectively for the annotated non-calibrated points. The experimental results show that the errors of the two methods are close to each other at the calibration points, and the accuracy of our method is improved at the non-calibrated points. Figure 9 show that using gaze direction angles instead of PCCR vectors as features makes better use of the information on gaze direction variation and improves the accuracy of the system in general and at non-calibrated points in particular. Using only a single camera and no eye-averaging parameters, the accuracy of this method remains at the same level compared to other 3D methods (Table 1).

Table 1. Comparison of the results of different 3D methods.

5 Conclusion

Based on the features of the pupil’s motion trajectory, we propose a single-camera head-mounted 3D eye tracking system. The number of cameras is reduced by analyzing the pupil motion trajectory to obtain the 3D gaze direction. By using the mapping model from the gaze direction to the scene, the advance calibration of the hardware structure is avoided. Moreover, the results show that the method in this paper has better accuracy at non-calibrated points compared to the PCCR method when using the same hardware equipment. The complexity of the hardware structure is greatly reduced while ensuring accuracy.