Keywords

1 Introduction

The World Health Organization (WHO) estimated that 1 billion people around the world live with some form of disability [36]. Approximately 10 million people in UK have disabilities with a neurological diagnosis. For a multitude of reasons, the number of people with profound disability stemming from neurological disorders is increasing with a resulting impact on their quality of life and that of their caregiver. The cost of caring for neuro-disabled persons in Europe has been estimated as 795 Billion Euro [17]. The value of assistive technologies in improving the quality of life of people with disability and also reduce carer strain is emphasized in a 2010 Royal College of Physicians Report [30].

For many individuals with disability access to a computer and/or communication aid may help mitigate the effect of communication impairments. Often this can be achieved through the identification of suitable access sites e.g. hand, foot, arm or head. Some patients, however, are profoundly disabled that they might be unable to talk but can only make small head movements and facial gestures such as eye blink or eyebrow movement. In some cases there may not even be enough head movement to enable the use of an access device such as a head tracker like SmartNav [2] and so the only remaining access site may be small facial gestures. Although there are other options available - e.g. the use of eye gaze, existing systems using eye gaze technology such as MyTobii [3] are complex, expensive and set-up/configuration places a significant burden on both the user and the caregiver.

The motivation for the work reported in this paper is the need for low-cost, reliable head tracking with an automatic facial gesture recognition system to help severely disabled users access electronic assistive technologies. The objective is to develop a multi-modal head tracking system, which uses facial gestures as a switching mechanism thus enabling severely disabled patients whose control is restricted to small head movements and facial gestures to be able to access a computer.

2 Background

Pistori [29], states that assistive devices using computer vision can have a great impact in increasing the digital inclusion of people with special needs. Computer vision can improve both the devices used for mobility i.e. controlling motorised wheel chairs, sign language detection and head trackers.Similarly, Betke et al. [7], describe the advances made in the development of assistive software and the use of emerging technology can lead to the creation of intelligent interfaces using both assistive technology and human computer interaction (HCI). The example of the CameraMouse [9] is used as an interface system with different assistive devices and software such as Midas Touch [6], Dasher [34] etc. are included to highlight the use of HCI and assistive devices.

Abascal et al. [4], highlighted some opportunities and challenges that designing human-computer interfaces suitable for the disabled can pose. For people suffering from disabilities, HCI can be used to design better interfaces which could be accessible to people with disabilities and thus improve socialisation, better access to communication facilities and have a greater control over their environment.

2.1 Device Evaluation

Fitts’ test [14] was developed in 1954 to model human movement. The result of the experiments showed that the rate of performance of the human motor system is approximately constant over a wide range of movement amplitudes. Mackenzie et al. [23], adapted the Fitts’ Law for assessing HCI. This work was later embedded in an International Standard for HCI, ISO 9431-9:2000 [18] providing guidelines for measuring the users’ performance, comfort and effort. The performance of the device was measured by making the user perform tasks using the device. There are six types of tasks - one-direction, multi-directional, dragging, free-hand tracing (drawing), and, hand input, grasp and park (homing/device switching). ISO 9431-9:2000 [18] requires that the input device be tested for at least 2 different Index of Difficulty (ID). Index of Difficulty (ID) is a measure of the difficulty of the task [5]. In Douglas et al. [12], the validity and practicality of the ISO framework using both multi-directional and the one-direction Fitts’ Tests for two devices namely a touch-pad and a joystick was investigated.

2.2 Gesture Detection

In this paper, the interest is in processing video information to recognise blink and eyebrow movement gestures. The detected gestures can be to emulate a mouse click or a switch action to access and control a computer/communication aid.

Grauman et al. [15] proposed two systems called BlinkLink and EyebrowsClicker. The BlinkLink software tracked both the motion within the eye region and the eye region itself. The EyebrowsClicker tracked the eyebrows region and detected the rising and falling of the eyebrows. To initialise the location of the eye and eyebrows regions, the user has to perform the gestures and by analysing the area of motion on the face, the respective regions are detected. A template of each region is generated. The correlation score of the eye region and a template of both the closed eye and open eye were compared to detect an eye blink. For eyebrows gesture, the distance from the eyes and the eyebrows are monitored to detect the rise and fall motion of the eyebrows. Blink detection had an overall success rate of around 95.6% and was tested on 15 healthy individuals and one person suffering from Traumatic Brain Injury (TBI). EyebrowsClicker had an overall success rate or 89% and it was tested with six individuals, but the software had to be reinstated twice during the data capture session because the tracking of the eyebrows was lost. There has been no further published work on this system.

Malik et al. [25] proposed a blink detection method using histogram of Local Binary Patterns (LBP) [27]. A template of open eye was generated using the average histogram of LBP from a sample of 50 images of an open eye. The histogram of LBP of images of the eye region were compared against the template using the Kullback-Leibler Divergence (KLD) method. In KLD, the distance between two distribution is zero only if the distributions are identical. KLD was found to be robust against both the precision of the eye detection and the variation in the window size of the detected eye region. The eye region are obtained using the Viola-Jones [33] algorithm implemented in OpenCV. The proposed algorithm was tested against the ZJU Eye blink Database [28] and resulted in a 99.2% blink detection rate. Missimer et al. [26] proposed a blink detection algorithm based on the analysis of the differences in three consecutive images. Blobs are generated from the merging of two difference images produced. Three points are used for tracking, the centre of the upper lip and the upper part of both eyebrows. In addition, optical flow is used to track these three points. The eye templates are generated based on the tracked points and used to train the system. The system is reported to having a success rate of 96.6% and was tested on 20 healthy individuals.

Yunqi et al. [37] proposed an eye blink detection algorithm which was used in a drowsiness driver warning system. The proposed system used Haar-like [32] features and AdaBoost to detect the face of the user. Some pre-processing was performed on the image and an edge detection algorithm was used to find the eye corners, the iris and the upper eyelid for each eye. The curvature of the upper eyelid was compared with the line connecting the two eye corners and if most of the upper eyelid curvature was under this line, the eye was considered closed. The algorithm was tested on images captured during a real driving session and 94% accuracy was obtained for the eye state detection.

In Zhang et al. [38], proposed a Gaze based assistive application on a smartphone to enable the user to communicate. The application can recognise six gestures from both eyes namely look up, look down, look right, look left, look center and closed eyes. The algorithm used OpenCV [8] and Dlib-ML [21]. Before using the device, calibration must be performed to create templates for each gesture. The template is created by making the users perform the gesture and capturing the image of the eye region when the action is performed. The algorithm detected the gestures with an accuracy of 86% on average. The accuracy rate decreased to 80.4% for people wearing glasses, increased to 89.0% for people wearing contact lenses and increased to 89.7% for people without glasses.

In Val et al. [11], eye blinks are used to control a robot. An infra-red emitter and an optical sensor were used to detect the eye blink. The blinks are used to navigate the robotic assistive aid, for example a right eye blink would cause the robot turn right and a left eye blink would make the robot turn left. A combination of the left blink followed by a right blink would cause the robot stop. In Krolak et al. [22], the proposed method uses two active contour [20] models - one for each eye - for detecting eye blinks. Haar-like features [32] are used to detect the face and the location of the eyes are determined using known geometrical proportion of the human face.

In Tuisku et al. [31], the evaluation of a system called Face Interface was conducted. The system used voluntary gaze direction for moving the cursor around the screen and facial muscle activation for the selecting objects on a computer screen. Face Interface used two different muscle activation - frowning and raising the eyebrows. A series of points were presented to the user. The time to complete the tasks and the accuracy of the activation were used as performance measure. The pointing tasks were conducted using three different target diameters (i.e. 25, 30, 40 mm), seven distances (i.e., 60, 120, 180, 240, 260, 450, and 520 mm), and eight pointing angles namely (\(0^\circ \), \(45^\circ \), \(90^\circ \), \(135^\circ \), \(180^\circ \), \(225^\circ \), \(270^\circ \), and \(315^\circ \)). It was found that for distances between 60 mm and 260 mm, tasks performed using the raising eyebrow selection technique were faster than those using the frowning technique. Also, the overall time taken to complete the tasks were 2.4 s for the frowning technique and 1.6 s for the raising technique. The \(IP\) of the frowning techniques was 1.9 bits/s and 5.4 bits/s for the eyebrow raising technique.

The systems reported here were limited in that they would only work with frontal facial images and were not robust in coping with posture changes. The work reported here aims to address these shortcomings by making use of the depth data available from RGB-D sensors.

3 Materials and Methods

The systems evaluated in this work incorporate a camera and an algorithm for tracking the head movement and detection of the eye blink or eyebrow movement facial gestures. The camera is either the Microsoft Kinect for Windows [1] sensor which can provide 3D (RGB-D) data or a Logitech web camera which can only provide 2D (RGB) data. Raw data is extracted in the form of images and depth maps. The efficacy of head tracking and gesture recognition of 3D vision-based system is compared to 2D vision-based systems using a modified Fitts’ test.

3.1 Device Evaluation

Fitts’ Test. Fitts originally proposed a method to model the human hand movement in order to improve human-machine interactions [13]. Each task has an ID which is based on the size of the target and the distance of the target from the starting point. The \(ID\) represents the cognitive-motor challenge imposed on the human to accomplish the task and is measured in bits as shown in Eq. (1).

$$\begin{aligned} ID = log_2(\frac{D}{W} +1) \end{aligned}$$
(1)

where \(D\) represents the distance from the starting point to the target and \(W\) is the width of the target.

$$\begin{aligned} MT = a+b\times ID \end{aligned}$$
(2)

The relationship between \(MT\) and \(ID\) is shown as a linear relationship where \(a\) is the y-intercept and \(b\) is the gradient of the line represented in Eq. (2). The Index of Performance (\(IP\)) in bits/second of a device is given in Eq. (3).

$$\begin{aligned} IP = \frac{1}{b} \end{aligned}$$
(3)

where \(b\) is the gradient of the line described in Eq. (3). A positive value of \(IP\) indicates that the device gets more difficult to use as the interaction becomes more challenging. Equation (4) is used to calculate the Effective Throughput (\(TP_e\)) in bits/second.

$$\begin{aligned} TP_e = \frac{ID_e}{MT} \end{aligned}$$
(4)

where \(MT\) is the mean movement time, in seconds, for all trials within the same condition. It represents the overall efficiency of the device in facilitating interactions.

$$\begin{aligned} ID_e = log_2(\frac{D}{W_e} +1) \end{aligned}$$
(5)

\(ID_e\), is the effective index of difficulty, in bits, and is calculated from the distance (\(D\)) from the start location to the target and \(W_e\), the effective width of the target. \(W_e\), is the effective width of the target and it is calculated from the observed distribution of the target selection coordinates.

$$\begin{aligned} W_e = 4.133 \times SD \end{aligned}$$
(6)

where \(SD\) is the standard deviation of the selection coordinates [12].

The experiments showed that the rate of performance of the human motor system is approximately constant over a wide range of movement amplitudes. Fitts’ Law [14, 23] states that \(MT\) should increase with an increase in the ID i.e. as the difficulty of the task increases, the time taken to complete the task also increases. Fitts’ Law was adapted in Mackenzie et al. [23], to assess HCI devices. Therefore, it was thought Fitts’ test is an appropriate tool for assessing the performance of the head tracking and gesture recognition system.

3.2 Gesture Detection

Fig. 1.
figure 1

Algorithm to detect blink and eyebrows movements.

Figure 1 shows an overview of the 3D head tracking and facial gesture recognition system. The facial gesture recognition system is the same for both the 2D vision system and the 3D Kinect system. Depth data is used only to filter the region of interest when processing the facial image - only objects within a meter of the 3D sensor were included in the region of interest and all other background is removed before further processing.

The facial gesture recognition system used the RGB data from the sensors. Facial areas of interest such as the head, eyes region, left eye and right eye are detected using a Haar-Cascade [32]. To detect a blink, closure of both eyes has to be detected for a period of 1 s or more and then return to the open state. If closure of only one eye is detected, the system assumes there is no blink. Only the transition from open eye to close eye and to open eye again is recognised as a blink.

In the case of the eyebrows detection, the two states of the eyebrows (raised, down) are monitored. In the eyebrows raised state the facial eyebrows muscles are contracted in order to raise the eyebrows and the down state, the muscles are relaxed and the eyebrows revert to their original location. The eyebrows region is detected using the location above the eye region. The state of the eyebrows is initially set to down. To recognise eyebrows movement both eyebrows have to be raised for a period of 1 s or more and subsequently return to the down state. Only the transition from down to raised and then to the down state again will be recognised as a valid eyebrows movement.

4 Experimentation

4.1 Setup

Fig. 2.
figure 2

Experimental set-up.

The participant was asked to perform a series of Fitts’ Tests [14, 24]. The participants were allowed to repeat the gestures until the click action was detected and thus this caused the movement time to increase. The Fitts’ Test was used to evaluate two devices: a 2D vision based head tracker using the Logitech web camera and a 3D head tracking system using the Kinect device. The experiment was performed using the two facial gestures (blink and eyebrows movement) as a switching mechanism. It has been reported that spontaneous eye blink can change from 20 to 30 blinks/min depending on the mental task the person is performing [19], and can decrease to about 11 blinks per minute during visually demanding tasks [35]. Therefore, the intentional blink time threshold was set to 1000 ms to distinguish between intentional and unintentional facial gestures and to prevent spontaneous blinks from being detected. The activation time of the eyebrows movement switch was also set to 1000 ms (Fig. 2).

Fig. 3.
figure 3

Target locations (incorporating 8 distinct movement orientations).

The screen used is a 17 in. LCD monitor with a resolution of 1280 by 1024 pixels. The target is selected at random from a set of pre-designated locations as shown in Fig. 3 and presented to the participant. The participant then has to move the cursor using head movement and select the target with the equivalent of a mouse click using the different facial gestures being evaluated. Once a target has been chosen, the participant has to move the cursor back to and select the target at the central location on the screen. This ensures that the same start point is used for each target selection. The choice of the stimulus target locations are based on earlier work by Guness et al. where the points were configured to perform a range of selection tasks with 8 target directions/orientations [16].

4.2 Sensors

Two sensors were used. The first was a standard Logitech web camera. The web camera captured \(640\times 480\) pixel RGB images at a rate of 30 frames per second. The second sensor was the Kinect for Windows sensor [1]. The Kinect sensor consists of a structured light based depth sensor and an RGB sensor. The Kinect sensor operates at a 30 Hz rate and generates \(640\times 480\) depth and RGB images. The depth range of the Kinect sensor in default mode is 800 mm to 4000 mm and in near mode is 500 mm to 3000 mm. In this experiment the Kinect sensor operated in near mode. Both the web camera and the Kinect sensor were selected because they are relatively inexpensive devices that can be readily obtained.

4.3 Depth Data

The depth data obtained from the Kinect sensor is used to reduce the search area for the different Haar-Cascade features. This will reduce the computational load and will avoid background distractions, such as people, movements and changes in lighting and therefore increase the performance. A mask is created from the depth data and the object within 1000 mm of the sensor is selected. The mask is used on the colour image to remove all the objects which are more than 1000 mm from the sensor.

5 Result

The experiment was carried out with 21 healthy individuals who completed the tests with all 4 devices. The \(MT\) in Fitts’ Test is the time taken to move to the target location from the starting point and performing the task. To be able to compare the devices and the effect of the facial gesture, we have broken the task in two. Task 1 involves moving the cursor to the target location using the movement of the head. Task 2 encapsulates Task 1 and also involves selecting the target by using one of the facial gestures as a switching mechanism.

Fig. 4.
figure 4

Fitts’ test result for Task 1 (movement to target).

In Figure 4, the Kinect-eyebrows has a lower \(MT\) that the Kinect-blink system for an ID greater than 1.9 bits. Overall for Task 1, it can be seen that the Kinect-eyebrows system has the lowest \(MT\), followed by the Kinect-blink, the webcam-blink and finally the webcam-eyebrows, which took the most time to complete (Fig. 5).

Fig. 5.
figure 5

Fitts’ test result for Task 2 (performing the facial gesture).

Table 1. Overall index of performance (\(IP\)) and effective throughput (\(TP_e\)) of tested devices

From Table 1, it can also be seen that both the \(IP\) and the \(TP_e\) for moving the cursor to the designated target (Task 1) were better than that of the combination of moving and the click action (Task 2) using the different facial gestures for all devices. This is to be expected as the clicking/selection method has an effect on the performance and efficiency of the system used. Also, both the \(IP\) and \(TP_e\) of the 3D Kinect system were better than those of the 2D Vision system. \(R^2\) is the coefficient of determination and measured as a percentage of how well the data fits the linear model [10]. If we look at the \(R^2\) values Task 1 are higher than those of Task 2, this would indicate that Task 1 follows the linear model more closely than Task 2. Also, another interesting observation is the fact that using the Blink gesture with both the web camera and the Kinect yield similar \(R^2\) values whereas the \(R^2\) values of the Eyebrows movement gesture are lower.

In Table 2, the IP for different devices are presented when performing Task 1 and a combination of Task 1 followed by Task 2 for different target orientations.

Table 2. \(IP\) of Task 1 and Task 2 in bits/second
Table 3. \(TP_e\) of Task 1 and Task 2 in bits/second

A one-way ANOVA test was performed on the \(TP_e\) for the different orientations and gestures of both Task 1 and Task 2. For the comparison by orientations, p < 0.01 (p = 0.001 and p = 0.007) for Task 1 and Task 2, it can be said that there is a significant difference between the mean of the different orientations i.e. the \(TP_e\) are different based on the orientation of the movement. For the comparison by gesture, only Task 2 had p < 0.01 (p = 0.0093). This indicates that there is a significant difference between the mean of the \(TP_e\) based on the gesture being performed. This would point out that there is a difference in the performance of the two facial gestures being investigated. The mean \(TP_e\) of Task 2 is greater due to the increased challenge of both moving and selecting/clicking. Also, there are no sufficient evidence of any difference between the means of \(TP_e\) based on the orientation and gesture for either Task 1 or Task 2. This would indicate the gesture recognition for the sample used might be invariant to the orientation of the task being performed.

6 Discussion

Using facial gestures as a switch is possible in real time but the use of such gestures may cause a drop in the overall \(IP\) of the systems. \(IP\) and \(TP_e\) values in Table 1 using the four different systems were obtained with participants successfully reaching and selecting all targets. As it can be seen in results for the overall \(IP\) (Table 1), the \(R^2\) value which represents the goodness of fit of the fitted line for the Kinect 3D system is greater than 0.7 i.e. the line accounts for more than 70% of the variance. In contrast, the Webcam-Eyebrows device \(R^2\) is 0.21, and thus accounts for only 21% of the variance. This could also indicate that the presence of outliers has a large influence on the fitted line and thus the gradient. As the \(IP\) calculation from Eq. (3) is based on the inverse of the slope, it is also being influenced by outliers at very low and very high indices of difficulty. It should be born in mind that each of the points in Figs. 4 and 5 are obtained from the mean of data obtained from 21 users and 8 directions giving 64 data points. In the presence of such outliers relying on \(TP_e\) as a measure of performance might be better.

It can be seen that there is a decrease in the \(TP_e\) of all the four different devices after the switching action is included. The reduction in the \(TP_e\) of the 2D Vision system is 45% and 44% for the blink and eyebrows devices respectively. Similarly, the decline in the \(TP_e\) of the 3D Kinect system is 32% and 35% for the blink and eyebrows devices respectively. The higher total \(TP_e\) value indicates that the Kinect system, utilizing 3D information, has resulted in better performance when the two tasks of moving and selecting are combined and thus improved the ease of use of the system as a whole. It has also been shown that the \(TP_e\) for Task 2 based on gesture are from different populations - with eyebrows having a higher mean \(TP_e\). There is no evidence to support a difference in performance based on sensor or device. This also supports the impact to the improved performance of the gesture detection algorithm.

In addition, the facial gesture detection rate affected the \(MT\) for the different devices. In this implementation of the Fitts’ test, the tasks were considered completed only when the switch was activated and click action performed.

7 Conclusion

Both Kinect systems have lower \(MT\) and higher \(IP\) and \(TP_e\) than the Webcam based systems thus showing that the introduction of the depth data had a positive impact on the head tracking algorithm. This could be explained by the ability to throw away unnecessary data at an early stage in processing using depth information and thereby speeding up subsequent stages to create a more smooth experience for the users. In this work, we have looked at only blink and eyebrows movement gestures, further work will have to be carried out on additional gestures such as mouth opening/closing and tongue movement. We now intend to conduct translational research with neurological patients.