Introduction

Interactive and autonomous agents might be common in everyday life in the future; we expect that such agents will have the ability to communicate with people naturally. For smooth and natural communication, it is not sufficient for us to understand only the words and sentences that our partners say; we also have to speculate about their intentions. For this purpose, we use not only verbal information but also diverse or multimodal nonverbal information (for example, Von Raffler-Engel 1980; Burgoon 1994; Daibou 2001). If people did not speculate about their speaking partner’s intentions by using nonverbal information, for example, they would have trouble to understand the meaning of even a simple interjection. In fact, Hayashi reported that the utterance of the Japanese interjection “eh” with different prosodic information could be interpreted to mean different things (Hayashi 1998).

We often unconsciously express our intentions by conveying nonverbal information. For example, facial expressions, prosody, and gaze tell us the speaker’s feelings, thoughts, and so on. It is important to understand such unconscious expressions of our intentions by using diverse nonverbal information. In addition, since we think the results of our investigation will be applied to a robot or agent that will interact with people in everyday situations, we also need to investigate situations similar to actual communication.

For speculating on intentions like deception, we focused on unconscious expressions when people tell a lie. We experimentally determined whether “lies” in a situation similar to actual communication were discernible by using nonverbal information such as gaze, prosody, and facial expressions. One of the reasons for our focusing on lies is that telling a lie is a typical behavior in which we may unconsciously express our intentions. Another reason is that it can be objectively defined. Coleman and Kay defined a prototypical lie; as having the following features: the speaker asserts something that is untrue; the speaker believes that it is untrue; and the speaker’s intention is to deceive. On the other hand, there are some ways to deceive people besides telling a prototypical lie. For example, people may be deceived when the truth is spoken like telling a lie. However, we focused on the prototypical lie because it can be objectively defined. In this paper, a “lie” means “linguistic statement that is intentionally deceiving.”

In our previous work (Ohmoto et al. 2005), we confirmed that we could identify a person who tells a lie through the synthetic use of multimodal nonverbal information. We conducted an experiment using a game in which the communication of players resembled actual communication. We manually classified the nonverbal information of every utterance during the game into lies or others. To do so, we focused on 13 variables; gaze (three variables), prosody (nine variables), and facial expression (one variable). We then carried out a discriminant analysis to classify utterances into “lies” or “others.” We conducted the same experiment with the same participants a month after the first experiment and found that the discrimination ratio reached 75–85% in each experiment. Moreover, the ratio was about 80% even when the discriminant function of the first experiment classified a data set of the second one. However, it was a tough task to elicit and analyze nonverbal information from manually recorded video. So, it is necessary to reduce the cost of eliciting and analyzing nonverbal information. Moreover, a method of automatic measurement is required when a robot or agent judges its partner’s (i.e., a person’s) intentions automatically.

For measuring during natural communication, a measuring system has to meet three conditions. (1) It can measure both the gaze direction and facial features. (2) It allows a user’s head position and orientation to move to some extent. (3) It is not necessary to put markers on a face or to make any model of a face manually before measurement. The first condition is necessary for eliciting diverse nonverbal information. The second and third conditions are necessary for natural communication. That is, natural communication is impossible if we cannot move our head or have wear markers on our face.

A few other gaze tracking systems satisfy (2) and (3). Some real-time systems that simultaneously measure head pose and gaze direction satisfy (1) and (2) (Matsumoto and Zelinsky 2000; Oka et al. 2005). However, no system satisfies all three conditions. Although multiple systems satisfying all three conditions could be used, a great deal of labor would be required to operate these systems at once.

For this reason, we made a real-time system that measures the gaze direction and facial features simultaneously without making a model of a face. We then used the measuring system to investigate a method of discriminating lies.

The purpose of this study is to confirm whether it is necessary or not for discriminating lies to pay attention to multimodal nonverbal information in situations similar to actual communication. For this purpose, we conducted an experiment using a game in which participants can tell a lie spontaneously and intentionally, if necessary.

The rest of the paper is organized as follows. Section 2 explains the details of the hardware and our algorithms for measuring facial features and gaze detection. Section 3 describes the experiment that uses the system for discriminating lies, and discusses the remaining problem of our method. Section 4 concludes this study.

Face-gaze measuring system

We made a real-time system for measuring both gaze direction and facial features without having to use markers, restrict the user’s actions, or manually make a model of the face before measuring.

Hardware

We used a pair of NTSC cameras (SONY FCB-EX480C × 2) to capture facial images. The output video signals from the cameras were fed into a vision processing board (Euresys Picolo tetra). The board, with a resolution of 640 × 480 pixels, captured the face image. The sampling rate was 30 FPS. The system used an infrared light for detecting the pupil center.

Outline

The software configuration for facial feature points and gaze direction consists of four major parts; (1) face tracking, (2) facial feature detection, (3) head pose estimation, and (4) gaze detection. In the face tracking stage, the system searches for the eye positions in the whole 2D image by using the SSR algorithm (Kawato and Tetsutani 2004a, b). After that, the system starts facial feature detection in each 2D image. If the facial feature detection is not successful, the system regards the face to have been lost and jumps back to the face tracking stage. If the facial feature detection is successful, the system then estimates a head pose. Next, the system detects the gaze direction. The 3D eye model is used to determine the 3D gaze vector. Finally, the system jumps back to the facial feature detection stage in the next frame.

Facial feature detection

In the facial feature detection stage, the system searches for ten feature points in the current frame: (1) four corners of the eyes, (2) two centers of the pupils, (3) two centers of the nostrils, and (4) two corners of the mouth. These features are detected in this order. They are detected by using the light and shadow in the 2D images.

First, in a small region around the eye, the edges of the eyelids are detected as shown in Fig. 1a. Dark pixels are detected around the eye positions, which were detected in the face tracking stage. Two or more thresholds are prepared for reducing errors when detecting dark pixels. Pixels that do not have other pixels nearby are deleted from the detected pixels because such pixels are not regarded as a part of the eyelid.

Fig. 1
figure 1

Detection sequence of facial feature points

Second, the dark pixels detected in Fig. 1a are fitted to quadratic curves by using the least squares method (Fig. 1b). These curves correspond to the upper and lower eyelids.

Third, the corners of the eyes are detected by searching a small region around the intersection of the eyelids’ curves (Fig. 1c).

Fourth, the centers of the pupils are estimated using the averages of the brightness of pixels in each region enclosed by the eyelids (Fig. 1d). The X coordinate of the pupil center is searched for as follows as follows. The average brightness of the pixels is calculated for every X coordinate in the region enclosed by the eyelids. Consequently, the averages of the brightness become a line with gradation as shown in Fig. 1d. The X coordinate of the pupil center is the peak of this concentration slope. The Y coordinate of the pupil center is detected with a similar procedure.

Fifth, the region of the nostrils is detected by searching the region shown in Fig. 1e. The search region is defined using the method in Yanagida et al. (2003). Dark regions in which dark pixels gathered are detected in the searching region. Nostrils are two dark regions that are located in a line. The centers of the nostrils are defined by the centers of these regions.

Finally, the mouth region is detected by searching the region shown in Fig. 1f. The mouth region is a large dark region under the nostrils. The corners of the mouth are on either side of the mouth region.

After the facial feature points are detected in each 2D image, the system performs stereo matching to calculate the 3D coordinates for each feature. Then, a face model is made from 100 recent sets of facial feature points.

Head pose estimation

In the head pose estimation stage, the system calculates the 3D pose of the head by using the facial feature points except the centers of the pupils and the corners of the mouth. The reason for doing so is that these features move very often in communication. We adopted a simple gradient method using virtual springs to estimate the head pose. In the relative coordinate in which the origin is the midpoint of the nostrils, the face model is gradually and iteratively rotated to reduce the elastic energy of the springs toward the measured facial feature points.

Gaze detection

In the gaze detection stage, the eyeballs are regarded as spheres. The gaze direction is determined on the basis of the head pose and of the center of the pupils. The 3D eye model consists of the relative position of the center of the eyeball with respect to the head pose and radius of the eyeball. The relative position of the eyeball center is defined as a 3D vector from the midpoint of from the center of the nostrils to the center of the eyeball. The radius of the eyeball takes a value around 13 mm. These parameters and the relative positions of the centers of the eyeballs are currently determined manually through a personal calibration where the gaze point is known. The 3D position of the eyeball can be determined from the pose of the head and the relative position of the center of the eyeball. Since the centers of the pupils are already detected, the system calculates the gaze direction from the relationship between the pupil center and eyeball center. Two gaze directions are detected independently. However, each measurement is not sufficiently accurate, mainly due to the low resolution of the image. Because the field of view of the camera is set to capture the whole face in an image, the radius of the pupil is only about 15[pixel] in a typical situation. Therefore, it is hard to determine the “gaze point” in a 3D scene by calculating the intersection of the detected gaze lines. Therefore, those two vectors are currently averaged to generate a single gaze vector in order to reduce the effect of noise.

We used the personal calibration method of the previous study (Ohno et al. 2002). In this calibration method, the user gazes at least two points on the screen. Once a calibration matrix is calculated, it is possible to compute the calibrated vector of a user. Many existing gaze tracking systems need to be calibrated before every measurement session even for frequent users. In addition, the calibration put a heavy strain on users since users must watch 5–20 calibration points.

Results

Snapshots obtained during tracking experiments using our system are shown Fig. 2. Crosses (×) around each feature points are the ones detected in the facial feature detection stage. Image (a) is a result of detecting a full face. Image (b) shows a result when the face is rotated. Image (c) shows a typical result when detection fails. The detection fails if the facial feature point is hidden by a hand or other object. In addition, our detection algorithm may go wrong depending on lighting conditions.

Fig. 2
figure 2

Results of facial feature detection

The whole process takes approximately 50 ms, 20 FPS. In the previous work (Ohmoto et al. 2005), we could discriminate lies by using 15 FPS video data. Therefore, this speed enables our system to analyze in the same way as in the previous work. The accuracy of the measurement of the facial feature points is approximately ±2 mm in translation, ±1° in head pose rotation. The accuracy of the gaze direction is approximately ±2°. The accuracy of the gaze direction is evaluated through experiments, in which two participants were asked to watch nine markers on a monitor.

Discriminating lies by using our system

We conducted experiments in which we discriminated lies by using our system. As explained in Sect. 1, we focused on a lie as an expression of deceptive intention, and we devised the automatic discrimination method based on certain multimodal nonverbal information.

Although it would be better for us to pay attention to all cues to deception, the number of known cues is quite large (DePaulo et al. 2003). In addition, lies comprised only 10% of the total number of utterances in our experiment. Therefore, we cannot pay attention to all deception cues in automatic discrimination. For this reason, we elicited certain cues to deception in advance, taken from many related studies about lies.

Experimental setting

Figure 3 shows the setup of the experiment. Users communicate with each other through a half-mirror box like in Fig. 3b. Three participants (players) formed a group, which will be referred as a “triad” in the rest of the paper. The monitor of each half-mirror box displayed shots of the other two participant’s faces taken by a network camera. The cameras were set, behind the middle of the each displayed participant’s face, inside the half-mirror box, so that they could catch the eyes of each other. The reason why we used the half mirror was that our previous research indicated gaze direction is important for discriminating lies (Ohmoto et al. 2005). The participants talked with each other through a microphone and speakers.

Fig. 3
figure 3

Setup of the system of automatic discrimination of lies

The participants were asked to play a game of Indian poker repeatedly after they were briefly provided instructions on the rules and strategies of the game. The other participants’ faces were displayed on the monitor of a half-mirror box, and each participant communicated through the monitor of each half-mirror box. The behavior and utterances of the participants playing games were recorded by our system and a voice recorder. The participants’ utterances and actions were not controlled; they were allowed free communication. Each triad consisted of two graduate students and the experimenter. The two graduate students were acquainted with each other. The reason why the experimenter participated in the game was that it was difficult for only beginners of the game to communicate smoothly. This experimenter behaved like a player. We conducted this experiment again using the same participants a month after the first experiment.

Method of discriminating lies

Below, we explain the method of extracting variables from the elicited nonverbal information.

Utterances, the units of analysis, were extracted from the data recorded during the experiment. We call the duration from the start to the end of one utterance an “utterance unit.”

Multimodal nonverbal information in every utterance unit was elicited and recorded by using our system and voice recorder. Elicited information was direction of gaze, pitch and power of prosody, and the 3D positions of the upper and lower eyelids and of the corners of the mouth. The variables for discriminating lies (Table 1) were extracted from that information.

Table 1 Variables in our experiments

Below, we explain how to extract these variables from the elicited information. In our previous work, we manually carried out this procedure by watching recorded videos of the experiments.

The three variables of the gaze row in Table 1 were estimated from the gaze direction. “Ratio of gazing at conversation partner” is the proportion of time during which a participant gazed at his/her partner of conversation to the total time of an utterance unit. The partner to whom a speaker talks is judged from the pose of the speaker’s head. “Ratio of gazing at useful objects for communication” is the proportion of time during which a participant gazed at the object that had useful information for communication to the total time of an utterance unit. For example, when a participant talked about objects on a table, the other participant’s faces and the objects on the table were regarded useful objects. The experimenter himself judged whether an object was useful or not. In this experiment, the candidate useful objects were faces and cards of other participants. “Transitional ratio of gazing at the objects” is the number of the gaze shifts divided by the total time of an utterance unit.

The variables of prosody are coded using the averages prosody of an utterance. The averages of pitch and power of an utterance are coded as any of the following three categories. The procedure of coding the pitch variable in the first half of an utterance is as follows: “Total pitch average in the first half” = average of pitch in the first half of each utterance in the experiment. “Pitch SD in the first half” = the standard deviation of pitch in the first half of each utterance in the experiment. If the average pitch in the first half of an utterance > (“total pitch average in the first half” + ”pitch SD in the first half”), the pitch variable in the first half is coded to +1. If average pitch in the first half of an utterance < (“total pitch average in the first half” −“pitch SD in the first half”), the pitch variable in the first half is coded to −1. The rest is coded to 0. The pitch variable in the second half is coded with a similar procedure. The power variables in the first half and the second half are also coded with a similar procedure. When the variable of the second half is higher than the variable of the first half, the value of the variable of change is set to +1, when low, the value is set to −1, and when there is no change, the value is set to 0 (prosody row in Table 1).

The reason why we adopted this method is to remove the noise due to the microphone direction and the change in the distance between the microphone and the participants as they move in their seats.

3D positions of the upper and lower eyelids and of the corners of the mouth are used to identify whether a smile is a forced one or not. It is difficult to pick up subtle changes in facial expression. We noticed that people forced a smile while telling a lie in many cases. It is reported that there is a time difference between the start of the reaction of the eyes and that of the mouth in a forced smile (Nakamura 2000). Therefore, “the mouth reacting earlier than the eyelids” is regarded as a typical facial expression feature (facial feature row in Table 1). The value was set to 1 (truth) when the mouth moved earlier than eyes. Otherwise, it was set to 0 (false). The variable was also set to 1 even when only the corners of the month move.

We used a discriminant analysis to classify the utterances into lies and others. We classified utterances into two groups according to the definition of lie explained in Sect. 1: One was “an utterance which is a lie” (hereafter, “lie utterance” for short) and the other was “other utterances.” We defined an “ambiguous utterance” as an utterance that is neither a truth nor a lie; for instance, an ambiguous statement and a noncommittal answer. “Ambiguous utterances” accounted for 10–20% of the whole utterance. “Ambiguous utterances” were classified into “other utterances.”

Linear discriminant analysis was applied to the data sets of the variables in Table 1. The method of selecting the variables was as follows. First, an experimenter extracted pairs of variables with correlation coefficients of 0.8 or more and, in addition, removed variables with a lower F-value from the extracted variables. Next, the variables were reduced by backward elimination. The variables finally selected by this series of operations were regarded as the main variables that contributed to discriminating whether an utterance was a lie or not.

Results

We conducted two experiments: experiment 1 and experiment 2. Three persons participated as a triad and the experiment was conducted in the following manner.

  1. 1)

    Each person of a triad sat in front of each half-mirror box.

  2. 2)

    The experimenter briefly provided instructions on the rules and strategies of the game.

  3. 3)

    The experimenter dealt a card to each participant through software.

  4. 4)

    Each participant communicated, through their monitor, in order to make the other participants quit, stay in the game or change their cards.

  5. 5)

    Each participant showed his/her card to the others through software, after deciding on whether to stay in or quit the game.

  6. 6)

    After the winner was decided, the losers paid the points to the winner.

Steps 3–6 were defined as a trial. This trial was repeated about 20 times. The experiment was conducted again by the same participants a month after the first experiment.

The percentage of utterances which contained measuring errors was about 10%. We could get enough data for applying the above-mentioned method of discriminating lies. The result is shown in Table 2.

Table 2 Results of discriminant analysis

Participant 1” means the measured participant of the experiment 1. “Participant 2” means the measured participant of the experiment 2. “Discrimination ratio (first experiment)” means the first experiment of each experiment: experiment 1 and experiment 2. “Discrimination ratio (second experiment)” means the experiment which was conducted again by the same participants a month after the first experiment of each experiment. “Discrimination ratio (using the discriminant function of a suitable situation)” means the discrimination ratio that the discriminant function, which was derived from the data of the first experiment, classified the data set of the second experiment as the unknown data set.

In all the results, both the gaze and prosody variables were always included in the finally selected main variables, which imply that it is necessary to observe diverse nonverbal information. These finally selected variables or their contributions differed between the first experiment and the second one. If there had been no consistency in contributions, we could not have been able to discriminate lies with relatively high discrimination ratios. However, “Discrimination ratio (using the discriminant function of a suitable situation)” in Table 2 were higher than chance in spite of the month interval between the first and the second experiment. Therefore, it is suggested that the discriminant functions were almost the same among the two experiments.

The average discrimination ratio was 72% in Table 2. According to the report of Miller and Stiff (1993), the proportion of correct answers was at most 70% when trained people judged whether an utterance was a lie or not. Therefore, this result showed that we could discriminate lies, by using diverse nonverbal information, almost as accurately as the trained people do.

Table 3 shows the results of discriminating lies by using the variables of a single modality, which were either three gaze variables or six prosody variables. In some results in Table 3, the discrimination ratios of both “lie utterance” and “other utterance” were lower than in Table 2. On the other hand, in other results in Table 3, the discrimination ratio of either “lie utterance” or “other utterance” was high, and the other was low. Table 4 shows the results for when we applied the discriminant function derived from the data of the first experiment by using the variables of a single modality, to the data of the second experiment as the unknown data set. The tendency found in Table 3 is also seen in Table 4. These results showed that it is necessary to pay attention to multimodal nonverbal information.

Table 3 Results of discriminant analysis using variables of a single modality
Table 4 Discrimination ratios for the discriminant function of variables of a single modality

The results were only those of a few experiments, which was not enough to prove the robustness of our method. Also, the data of the nonverbal information elicited by the system was sometimes missing, depending on the participants’ behavior while they were being measured. We are now conducting experiments to clarify the robustness of our method in this setup.

Discussion

Many researchers have studied deception detection using a variable(s) of single modality. For example, Fukuda presented data suggesting that the temporal distribution of blinks during the performance of a dual modality attention-focusing task can be useful in the detecting deceptions (Fukuda 2000). Based on his results, he suspects that, when subjects are attending to visual stimuli, the presentation of relevant auditory information should lead to the blink rate peaking following the processing of such stimuli. However, his method cannot be directly applied to the situation of actual communication, since actual communication is not controlled like a dual modality attention-focusing task: Many factors which are not related to the communication cause eye blinks. In the results shown in Table 2, the finally selected variables or their contributions differed between the first experiment and the second one. This shows that the expressions of nonverbal information are changeable. On the other hand, in the results shown in Table 2, there was no facial expression variable in the finally selected variables. This variable was selected in our previous work. These results also show that the expressions of nonverbal information are changeable. Therefore, if we pay attention to a single modality, we could not detect deception in actual communication. In fact, there was less consistency in the discrimination ratios in Table 3 as compared with those in Table 2. Moreover, the discrimination ratio of either “lie utterance” or “other utterance” was low. Table 4’s results also show less consistency in the variables of a single modality between the first experiment and the second one, as compared with those in Table 2. Hence, we suggest that it is necessary to pay attention to multimodal nonverbal information when discriminating lies in communication.

To enable fully automatic lies discrimination, we have to automate the three steps of our method: the first is the personal calibration in the gaze detection stage, the second is interpreting a situation, such as the context of the conversation and relationships among the speakers, and the third is classifying utterances into lie or other.

The problem of the personal calibration seems to be comparatively easy to solve. Our system needs the personal calibration for accurate measuring. However, the number of necessary points for calibration is limited (at least two). Therefore, we will be able to develop a personal calibration technique without active selection, as Ohno argued (2002).

On the other hand, it is difficult to interpret a situation and to classify utterances since it is necessary to consider the meanings of utterances.

It is suggested that the selected variables changed among situations and individuals. For example, participants’ behavior was different according to telling a lie reactively or deliberately. It is thus important to interpret a situation in order to discriminate lies and general intentions. However, it is necessary for interpreting a situation to understand the meanings of utterances, which is difficult to do automatically. If someone is able to speculate on an other’s state of mind by using nonverbal information, he or she may be able to speculate on situations indirectly.

Classifying utterances have two different problems. The first is that we can deceive people even if we do not tell a lie. We focused on a prototypical lie, as mentioned in Sect. 1 (Coleman and Kay 1981). However, there are some ways to deceive people besides a prototypical lie. For example, people may be deceived when the truth is spoken as if it were a lie. Probably, our method can discriminate the above type of deception if appropriate data are given for making the discriminant function. In some cases, however, we cannot prepare such appropriate data. Our method can only be applied to an appropriate data set.

The second is how to classify utterances into “lies” or “others” automatically. We can classify utterances once the meanings of the utterances are compared with the fact. However, it is difficult to do so automatically. This is an important topic for full automatic discrimination of lies and general intentions.

Conclusions

We made a real-time system that measures gaze direction and facial features as a means of detecting lies. The system satisfied three conditions to measure nonverbal information in natural communication: (1) to be able to measure both the gaze direction and facial features, (2) to allow the user’s head position and orientation to move to some extent, and (3) to obviate the need to put markers on a face or make a model of a face manually before measuring. We uses our system for detecting lies in experiments in which participants could tell a lie spontaneously and intentionally, if necessary, in a situation similar to actual communication. And then, we suggested that it is necessary for discriminating lies to pay attention to multimodal nonverbal information in situations similar to actual communication. This research if the first step in developing the method of automatic lie detection by using nonverbal information.

In the future, we would like to enable the system to carry out full automatic detection of lies. For this purpose, we will have to solve problems related to the personal calibration in the gaze detection stage, interpreting the situation, and labeling utterances as “a lie” or “not a lie.”