Keywords

1 Introduction

Nowadays the eye-tracking equipment have many types. The core function of the eye-tracking system is to judge eye rotation and make it synchronization with the image of the time. These systems are generally divided into 2 types.

One is the contact type which is usually driven by the electro-oculography (EOG) signal which need to be gathered with the wearable electrode slice in the eye. The EOG signal is rapidly conduction which also has nothing to do with the light or the operating environment. Many researchers used the EOG signal as the input interface to computer [1,2,3]. The developed system was real-timely well-used. But the detection ranges of the interface only included less countable directions which is the big weakness of these type.

The other type of the eye-tracking system is non-contact equipment which is the most popular and wide used on account of the non-invasion. These systems usually used an infrared sensor or several web cameras to recognize the facial features or the pupil rotation. These non-contact eye movement systems mainly include desktop application for the digital screen and wearable application for the moving scene based on the gathered video captured by the micro camera on the glasses or a hat. The desktop application is used with a bar-like equipment for instance Tobii eye tracking system and SMI desktop eye tracking system which have a build-in infrared sensor. Rózanowski and Murawski went through further research which made the sensor more reliable in the harsh environment and also improved the infrared recognition algorithm [4, 5]. Besides the camera video recognition solution was also studied. Hao and Lei got the head and eye rotation data by recognizing facial features [6]. Because the eyes are very near from the front camera of the phone it is easy to recognize the pupil features. Lee and Min got the eye tracking path with the pupil data [7]. More further, Panev and Manolova used the camera and 3D depth sensor to find out the head and eye rotation [8]. In the desktop application mentioned above the infrared and the cameras must be pre-fixed in front of the eye for the algorithm which lead to a narrow field of view, just on the screen. But it can be real-timely interacted as a human-machine interface which could help the disabled even play the game [9]. In the wearable application the eye-tracking system can be used in wide field of view and the data is more accurate because the sensors are just designed on the glasses near the eyes. Also there are also several novel ways to gather the data. Using the specially designed mark points on the glasses and fixed capture cameras Lin et al. got the head gesture data and made it an eye-controlled human–machine interface [10]. In his research the movement of the eye was not considered which made the system only responsible with the head move. While the commercial wearable equipment is well designed like SMI eye-tracking glasses. But all the wearable equipment shows what you are staring at based on the moving video captured on glasses not based on a fixed screen. Such wearable equipment could only be used in the post analysis which is completely on the opposite of the desktop application.

In order to real-timely evaluate the user’s SA and tune the human-machine display interface in a train or plane cab, we need to develop a new real-time eye-interaction system. The real-time of the desk application and the wide usability of the wearable application must be integrated together based on the current eye-tracking system. In this method, we could make the machine real-timely get which part you are staring at now and which part you are not noticed for a long time. This research gives the machine a real-time interface and make it possible that the machine can real-timely change the display strategy.

As the concept of awareness situation was defined by Endsley in 1995 [11], the researchers have proposed some method to assess it. In particular, Endsley came up with the situation awareness global assessment technique (SAGAT) which was an objective measure of SA [12]. With our real-time interface the machine could automatically answer the queries to replace the ones needed in the SAGAT during freezes in a simulation. When SA is decreasing to the waring edge, the display of the machine could change their interface to highlight the overlooked information based on the Itti’s attention map [13]. According to the SEEV model [14], the machine will improve the non-noticed part’s saliency until the user see it.

We wish this unique wearable equipment could be an input interface of the machine to enhance the machine’s perception of the user’s situation aware and strengthen the machine’s dynamic service abilities.

2 System Description

As mentioned above, the core function of the eye-tracking system is to define the head and eye rotation. With the eye tracking glasses, we could get the eye movement and with the motion capture, we could get the movement of the head. Combined with these two devices data with the software SDK, our designed eye-interaction system could get the head line (head gesture) and visual line (considering the eye movement). Meanwhile the system used physical parameters of the screen and device such as the length, height and field of view to calculate the visual point on the digital screen. The algorithm runes processes from the root node of body (the waist joint), the eye point, the head line, the visual line, finally to the visual point on the digital display screen (Fig. 1 shows the algorithm processes 1–5). Every screen need to be calibrated with 3 points to be defined its plane position in the space before used. Our eye-interface system is real-time and wide field of view which could be used as a novel human-machine interface.

Fig. 1.
figure 1

The spatial relationships between the key control points and the lines. Calculation sequence is from the root point, the eye point, the head line, the visual line, finally to the visual point.

2.1 Root Point and Coordinate Definition

Usually the user of this system is not moving who is standing mostly sitting on the chair such as drivers, pilots and system controller. So the screens location is an invariable data relatively to the user during the use period. In order to construct the algorithm, we set the waist joint of the user as the root point of the coordinate which is not moving during use. Based on the root point we set up the right-hand coordinate system in which the X axis is positive in the forward direction while the Y axis is positive in the left direction and the Z axis is positive in the upward direction (Fig. 2 shows the coordinate).

Fig. 2.
figure 2

The define of the key parts of the human segments, coordinate and rotation angles.

2.2 Eye Point and Head Line Calculation

In order to calculate the visual line of the user which need the eye point (the root point of the visual line) and the direction vector of the line. Wearable eye-tracking glasses and motion capture device are the common devices in human analysis. Using the motion caption APIs, we could real-timely get the postures of the key joints. Combined with the joint segment parameters the eye point position relatively to the root point could be calculated.

The size data of the human segments must be pre-given in the process. It contains 3 length size parts between the joints or key points, marking L 0 , L 1 , L 2 . Unusually the motion capture device could provide the postures, marking Roll(α), Pitch(β), Yaw(γ). Specially the rotation data which is gathering from the motion capture device must be defined whether it is a Euler angle in a dynamic coordinate system. It is a big different about the angles definition whether the coordinate rotates every time. Based on the test result the FAB system which is used in this research as the motion capture device provides the Euler angles which are in the dynamic coordinate system by the transform sequence (Z-Y-X Yaw-Pitch-Roll). The rotation angels are positive defined by the right-hand rotation rule.

Based on 3D transform method in the advanced mechanism we could build the rotation transform in the system as matrix-1.

$$ \begin{aligned} R[x_{0} ,y_{0} ,z_{0} ]^{T} = [x_{1} ,y_{1} ,z_{1} ]^{T} \hfill \\ R = R_{\text{Z}} (\gamma )R_{Y} (\beta )R_{X} (\alpha ) \hfill \\ \end{aligned} $$
(1)

Furthermore, we could get the eye point position (P eye-root ) by addition of the key control points’ position and the head posture rotation matrix (R eye-root ) by multiple product of the rotation matrix (Eq. 2).

$$ \begin{array}{*{20}l} {P_{eye - root} = P_{eye - neck} (\gamma_{head} ,\beta_{head} ,\alpha_{head} ,L_{2} )+ P_{neck - torso} (\gamma_{torso} ,\beta_{torso} ,\alpha_{torso} ,L_{1} )} \hfill \\ {\quad \quad \quad \quad + P_{torso - root} (\gamma_{root} ,\beta_{root} ,\alpha_{root} ,L_{0} )} \hfill \\ {\quad \quad \quad = [x_{eye - root} ,y_{eye - root} ,z_{eye - root} ]^{T} } \hfill \\ {R_{eye - root} = R_{eye - neck} (\gamma_{head} ,\beta_{head} ,\alpha_{head} )R_{neck - torso} (\gamma_{torso} ,\beta_{torso} ,\alpha_{torso} )R_{torso - root} (\gamma_{root} ,\beta_{root} ,\alpha_{root} )} \hfill \\ \end{array} $$
(2)

Some researchers used the infrared sensors or the cameras to get the head gesture [5,6,7,8]. But such indirect accessing method need to recognize the feature of the user which may cause the error of the use because of the difference between the users also leading to the narrow usable viewing field. Nowadays, the motion capture device is tiny, wearable and wireless which provides the direct data of the head gesture. So it is a better choice than the other sensors.

2.3 Visual Line Calculation

In order to confirm the visual line, we need not only the head posture but also the eye movement. So the wearable eye tracking glasses are the best chosen device for the eye movement data. With the help of the glasses’ API we could get the relative visual point in the moving video captured from a micro camera on the glasses (Fig. 3).

Fig. 3.
figure 3

The spatial relationships between the head line and the visual line. The visual line is obtained by two rotation transformations (R Z (γ),R X (α)).

Based on the proportional relationship between the video resolution and the relative visual point we could get the elevation angle α xy . Using the pre-test data or the glasses’ argument list, the video view angle γ’ can be defined. Then we could get the polar angle γ xy by the scale calculation. Thus the direction vector of the visual line (Eq. 3) can be defined and the parameter equation of the visual line (Eq. 4) can be calculated from the eye point. After such work the position and gesture of the eyeball have been recorded with 6-DOF data by definition of the visual line.

$$ R_{vision} = R_{X} (\alpha_{xy} = \frac{{V_{h} }}{{V_{w} }})R_{Y} (0)R_{Z} (\gamma_{xy} = \frac{{V_{w} }}{{Re_{w} }}\gamma ')R_{eye - root} [1,0,0]^{T} = [m_{vision} ,n_{vision} ,p_{vision} ]^{T} $$
(3)
$$ \left\{ \begin{aligned} x = \lambda m_{vision} + x_{eye - root} \hfill \\ y = \lambda n_{vision} + y_{eye - root} \hfill \\ z = \lambda p_{vision} + z_{eye - root} \hfill \\ \end{aligned} \right. $$
(4)

In addition, as shown in the paragraph the β’ xy is the exact elevation angle which is not used. It is not easy to be measured because of the difficulty of the distance measurement between the eyes and the screen. Some researchers used the 3D depth sensor to get the data [8]. But this solution will decrease the user’s usable viewing field also bring the new device to the environment. Others used the physical relationship to calculate the distance [10]. This method need the user not to break the distance by moving the head during the test.

In our research we held the ideas that it was no more device to be involved in except the motion capture device and the eye tracking glasses which made us could not reach the distance data. But we could still succeed in calculating the screen distance with the math method on the next step.

2.4 Screen Plane Calculation

The digital screen is a space plane which is unknown to the base coordinate of the human. There is a commonsense that we need 3 points to define a space plane. In order to locate the screen plane, the designed system will guide the user to focus on the 3 calibrated points on the vertex of the screen rectangle during the calibration process while the 3 visual lines will be recorded in the system as lines equations set (Eq. 5) (Fig. 4).

Fig. 4.
figure 4

The key points on the screen SP 0 , SP 1 , SP 2 are on the 3 calibrated visual lines which were recorded on the guidance of the system when starting. The screen rectangles on the plane 1–3 are the possible solution.

$$ {\text{CLine*0,1,2:}}\left\{ {\underline{{\lambda_{ *} }} \left| {m_{{CL^{ *} }} ,n_{{CL^{ *} }} ,p_{{CL^{ *} }} ,} \right.x_{{CEP^{ *} }} ,y_{{CEP^{ *} }} ,z_{{CEP^{ *} }} } \right\} $$
(5)

As mentioned above, we haven’t any distance sensor in the system which makes us cannot define the points which the user saw on the screen that is we cannot get the λ0, λ1, λ2 (in the Eq. 5) which leads to various possible planes shown in the Fig. 4. It pushed us to find a new way to achieve the plane equation.

We must import some physical parameters of the screen which are the width and the height of the digital screen to help to confirm the correct plane. Meanwhile, the Line (\( \overrightarrow {{SP_{0} SP_{1} }} \)) is perpendicular to Line (\( \overrightarrow {{SP_{0} SP_{2} }} \)). Through these 3 restrictions we could build the equation set (Eq. 6).

(6)

In the equation set we can find there are 3 unknown variables (λ0, λ1, λ2) and 3 equations. With geometrical relationship analysis there is only one solution of the equations. It is a little difficulty that we find solution by manual steps. But fortunately with the help of the computer we could use the existing math software library to find the solution of the equation set. Once we find the λ0, λ1, λ2, we get the 3 key points SP 0 , SP 1 , SP 2 on the screen and the plane equation (Eq. 7).

(7)

As the algorithm described, the plane equation is based on the root point and the base coordinate. As long as the user don’t move the waist position, the head and eyes can move in a wide viewing field during the use which is an improvement than the current equipment.

2.5 Visual Point Calculation

On the premise of getting the screen plane equation, we could calculate the intersection between visual line and the plane which is a point position in the three-dimensional space data (x, y, z) in equation (Eq. 8). As long as we get the parameter \( \underline{\underline{\lambda }} \) we get the visual point in space based on the root point.

$$ \left\{ {\begin{array}{*{20}c} {{\text{Plane: }}\text{P} (x_{SP0} ,y_{SP0} ,z_{SP0} \left| A \right. ,B,C )= 0} \\ {{\text{Visual Line: L(}}\underline{\underline{\lambda }} \left| {m_{vision} ,} \right.n_{vision} ,p_{vision} ,x_{eye - root} ,y_{eye - root} ,z_{eye - root} ) = 0} \\ \end{array} } \right. $$
(8)

However, we need print the point on the digital screen. That is, we need transform the point from the three-dimensional space based on the root of the human to the 2-dimensional plane based on the screen basic point. Choosing the starting point of the screen image resolution as the base control point we could build a new 2-dimensional coordinate (shows in Fig. 5).

Fig. 5.
figure 5

The screen coordinate is based on the SP 0 point. The SP 1 and SP 2 are the vertices of the screen monitor. Screen point V is the virtual point which the user is starting at.

By counting the cross product of the vector and some geometric methods, we could get the physical position of the screen point V (the visual point) which is described by the parameters (unit of length). Then based on the proportional relationship between the physical length and height data of the screen and the resolution of the screen, we could find the exact virtual point (x v , y v ) (unit of resolution) on the screen and mark it with the software in real time. Pantograph equation shows in (Eq. 9).

(9)

2.6 Data Synchronization

A wide viewing field eye-interaction system require synchronizing all the controlled digital screens. A data synchronization module was developed to estimate whether the computational visual point was on one of the screens. It is a center design which processes all data and records the eye-tracking path.

In order to indicate the visual point, the system will send commands to the sub-module in the screen and mark the point on the interface. Also with the enough restricts or importing in advance the system could synchronize that point on the control desk. For example, the button or the indicator on the driver’s desk could light up in specific color to indicate you are staring at it.

3 Application

SA is a word for the human which includes 3 levels (perception, comprehension and projection) [11]. The system is designed to real-timely get the visual point which the user is staring at. With the help of the designed eye-interaction system we could calculate the user’s perception. The information was divided into different important levels(IL). The sum of all the information importance score is 100. Once one information is changed in the display interface, the score will reduce the corresponding value (ILc). Once the user notices the information the score will recover the corresponding value(ILn). We set the starting score 100 as the best SA perception (SA P -). And the current SA perception could be defined in Eq. 10. When the SA P is under the warning level (60 scores). The system will highlight the information which have lost focus for a long time and whose important level is high.

$$ SA_{P}^{N + 1} = SA_{P}^{N} - \sum\nolimits_{i} {ILc_{i}^{N + 1} } + \sum\nolimits_{i} {ILn_{i}^{N + 1} } \left| { \, SA_{P}^{0} = 100} \right. $$
(10)

The attention map could be calculated by the Itti’s model [13] when the user is starting at a point on the screen. We could cumulate saliency-based visual attention rate with the elapsed time in Eq. 11. The \( \bar{S} \) is the average number of the attention saliency rate during a task with the time t e . The information area of interest is A 0 to A i . The IL A is the importance level of the area. The function \( \delta (A,t) \) is the attention saliency intensity of the area A at the time t. Considering the SEEV model [14] the user will take less effort to see the salient area which means the less \( \bar{S} \) is the better.

$$ \overline{S} { = }\frac{1}{{t_{e} }}\sum\nolimits_{A = 0}^{{A_{i} }} {IL_{A} } \int_{t = 0}^{{t_{e} }} {\delta (A,t)dt} $$
(11)

With the dynamic saliency readjustment based on the SA P and \( \bar{S} \), the machine will enhance the perception of the human and offer the better service.

4 Further Work

The algorithm is an exact solution of the virtual point in mathematics. The core difficulty of the algorithm is to find the root of the ternary quadratic equation set in (Eq. 6). Through the complex spatial analysis, we could know it is only one reasonable solution. But there is also an unreasonable solution in the negative direction of the axis. Also we could only use the computer math tools to find a numerical solution, which leads to some uncertain solving process.

Furthermore, as we known applying the computer math library to solving the root is a kind of numerical calculation. So the numerical stability of equation must be considered. Also it will take time to real-timely solve the root which brings the delay to the system. If it had some way to reach the exact solution of the equation set with a mathematical expression, the problem will be solved satisfactorily.

In addition, the algorithm includes a lot of steps which may bring the error added up. The error analysis is a lot of work. Maybe it is possible to simplify the algorithm by assuming the calibrated eye points are in the same position (CEP 0 , CEP 1 , CEP 2 ) (shows in Fig. 4). With such work the Eq. 6 will be simplified and easily be solved.

For further simplification, we can assume the eye point also in the same position which will simply the Eq. 8. It will be a great save of calculated amount of the algorithm. But the error must be calculated to show whether it is in an acceptable range with such simplified calculation model.