Smart Posterboard: Multi-modal Sensing and Analysis of Poster Conversations

Kawahara, Tatsuya

doi:10.1007/978-4-431-55867-5_9

Tatsuya Kawahara²

610 Accesses
3 Citations

Abstract

Conversations in poster sessions in academic events, referred to as poster conversations, pose interesting and challenging topics on multi-modal multi-party interactions. This article gives an overview of our CREST project on the smart posterboard for multi-modal conversation analysis. The smart posterboard has multiple sensing devices to record poster conversations, so we can review who came to the poster and what kind of questions or comments he/she made. The conversation analysis combines speech and image processing such as face and eye-gaze tracking, speech enhancement and speaker diarization. It is shown that eye-gaze information is useful for predicting turn-taking and also improving speaker diarization. Moreover, high-level indexing of interest and comprehension level of the audience is explored based on the multi-modal behaviors during the conversation. This is realized by predicting the audience’s speech acts such as questions and reactive tokens.

Access provided by Autonomous University of Puebla. Download chapter PDF

Survey on Multimodal Emotion Recognition (MER) Systems

Multimodal Techniques and Methods in Affective Computing – A Brief Overview

Estimating Speaker’s Engagement from Non-verbal Features Based on an Active Listening Corpus

Keywords

9.1 Introduction

Speech and image processing technologies have been improved so much that their target now includes natural human-human behaviors, which are made without being aware of interface devices. Examples of this kind of direction include meeting capturing [1] and conversation analysis [2]. We have conducted the CREST project, which focused on conversations in poster sessions, hereafter referred to as poster conversations [3, 4]. Poster sessions have become a norm in many academic conventions and open laboratories because of the flexible and interactive characteristics. In most cases, however, paper posters are still used even in the ICT areas. In some cases, digital devices such as LCD and PC projectors are used, but they do not have sensing devices. Currently, many lectures in academic events are recorded and distributed via Internet, but recording of poster sessions is never done or even tried.

Poster conversations have a mixture characteristics of lectures and meetings; typically a presenter explains his/her work to a small audience using a poster, and the audience gives feedbacks in real time by nodding and verbal backchannels, and occasionally makes questions and comments. Conversations are interactive and also multi-modal because participants are standing and moving unlike in meetings. Another good point of poster conversations is that we can easily make a setting for data collection which is controlled in terms of familiarity with topics and other participants and yet is “natural and real”.

The goal of the project is signal-level sensing and high-level analysis of human interactions. Specific tasks include face detection, eye-gaze detection, speech separation, and speaker diarization. These will realize a new indexing scheme of poster session archives. For example, after a long session of poster presentation, we often want to get a short review of the question-answers and feedbacks from the audience.

As opposed to the conventional “content-based” indexing approach which focuses on the presenter’s speech by conducting speech recognition and natural language analysis, we adopt an “interaction-oriented” approach which looks into the audience’s reaction. Specifically we focus on non-linguistic behaviors such as backchannel, nodding and eye-gaze information, because the audience better understands the key points of the presentation than the current machines. An overview of the proposed scheme is depicted in Fig. 9.1.

We have designed and implemented a research platform for multi-modal sensing and analysis of poster conversations. From the audio channel, utterances as well as laughter and backchannels are detected. Eye-gaze and nodding are also detected by using video and motion sensing devices. Special devices such as a motion-capturing system and eye-tracking recorders are used to make ground-truth annotation, but only video cameras and distant microphones are used in the practical system.

We also investigate high-level indexing of which segment was attractive and/or difficult for the audience to follow. This will be useful in speech archives because people would be interested in listening to the points other people liked. However, estimation of the interest and comprehension level is apparently difficult and largely subjective. Therefore, we turn to speech acts which are observable and presumably related with these mental states. One is prominent reactive tokens signaled by the audience and the other is questions raised by them. Prediction of these speech acts from multi-modal behaviors is expected to approximate the estimation of the interest and comprehension level. The scheme is depicted in Fig. 9.2.

9.2 Overview of System and Corpus

9.2.1 Smart Posterboard System

We have designed and implemented a smart posterboard, which can record a poster session and sense human behaviors. Since it is not practical to ask every participant to wear special devices such as a head-set microphone and an eye-tracking recorder and also to set up any devices attached to a room, all sensing devices are attached to the posterboard, which is actually a 65-in. LCD screen. Specifically, the digital posterboard is equipped with a 19-channel microphone array on the top, and attached with six cameras and two Kinect sensors. An outlook of the smart posterboard is given in Fig. 9.3. A more lightweight and portable system is realized by only using the Kinect sensors, which captures audio and video signals.

9.2.2 Multi-modal Corpus of Poster Conversations

We have recorded a number of poster conversations for multi-modal interaction analysis [3, 5]. In each session, one presenter (labeled as “A”) prepared a poster on his/her own academic research, and there was an audience of two persons (labeled as “B” and “C”), standing in front of the poster and listening to the presentation. Each poster was designed to introduce research topics of the presenter to researchers or students in other fields. The audience subjects were not familiar with the presenter and had not heard the presentation before. The duration of each session was 20–30 min. Some presenters made a presentation in two sessions, but to a different audience.

All speech data were segmented into IPUs (Inter-Pausal Unit) and sentence units with time and speaker labels, and transcribed according to the guideline of the Corpus of Spontaneous Japanese (CSJ) [6]. Fillers, laughter and verbal backchannels were also manually annotated. While fillers are usually followed by utterances by the same speaker, backchannels are uttered by themselves.

For the ground-truth annotation, special multi-modal sensing devices such as a motion capturing system were used while every participant wore a wireless head-set microphone and an eye-tracking recorder or a magnetometric sensor. In the early phase of the project, eye-gaze information was derived from the eye-tracking recorder and the motion capturing system by matching the gaze vector against the position of the other participants and the poster. But their calibration and post-processing are very time-consuming. In the latter phase of the project, the magnetometric sensor were adopted to estimate head orientations instead of precise eye-gaze.

9.2.3 Detection of Participants’ Eye-Gaze and Speech

Detection of participants and their multi-modal feedback behaviors such as eye-gaze and speech using the smart posterboard (green lines in Fig. 9.2) is explained. It is realized with multi-modal information processing, as shown in Fig. 9.4, and briefly explained in the following subsections.

9.2.3.1 Face and Eye-Gaze Detection

Kinect sensors are used to detect the participants’ face and their eye-gaze. As it is difficult to detect the eye-ball with the Kinect’s resolution, the eye-gaze is approximated with the head orientation. A preliminary analysis using the eye-tracking recorder showed that the difference between the actual eye-gaze and the head orientation is 10$^{\circ }$ on average, but it is much smaller when the participants look at the poster. The process of the face and the head orientation detection is as follows [7]:

1.
Face detection

Haar-like features are extracted from the color and ToF (Time-of-Flight) images to detect the face of the participants. Multiple persons can be detected simultaneously even if they move around.
2.
Head model estimation

For each detected participant, a three-dimensional shape and colors of the head are extracted from the ToF image and the color image, respectively. Then, a head model is defined with the polygon and texture information.
3.
Head tracking

Head tracking is realized by fitting the video image into the head model. A particle filter is adopted to track the three-dimensional position of the head and its three-dimensional orientation.
4.
Identification of eye-gaze object

From the six-dimensional parameters, an eye-gaze vector is computed in the three-dimensional space. The object of the eye-gaze is determined by this vector and the position of the objects. In this study, the eye-gaze object is limited to the poster and other participants.

The entire process mentioned above can be run in real time by using a GPU for tracking each person.

9.2.3.2 Detection of Nodding

Nodding can be detected as a movement of the head, whose position is estimated in the above process. However, discrimination against noisy or unconscious movements is still difficult. Therefore, nodding is not used in most of this study.

9.2.3.3 Speech Separation and Speaker Diarization

Speech separation and enhancement are realized with the blind spatial subtraction array (BSSA), which consists of the delay-and-sum (DS) beamformer and a noise estimator based on independent component analysis (ICA) [8]. Here, the position information of the participants estimated by the image processing is used for beamforming and initialization of the ICA filter estimation. This is one of the advantages of multi-modal signal processing. While the participants move around, the filter estimation is updated online.

When the 19-channel microphone array is used, speech separation and enhancement can be performed with a high SNR, but not in real time. Using the Kinect sensor realizes real-time processing, but degrades the quality of speech.

By this process, the audio input is separated to the presenter and the audience. Although discrimination among the audience is not done, DoA (Direction of Arrival) estimation can be used for identifying the speaker among the audience. In a baseline system, simple voice activity detection (VAD) is conducted on each of the two channels by using power and spectrum information in order to make speaker diarization. We can use highly-enhanced but distorted speech for VAD, but still keeps moderately-enhanced and intelligible speech for re-playing.

In Sect. 9.4, a more elaborate speaker diarization method is addressed by combining multi-channel audio input and eye-gaze information of the participants.

9.3 Prediction of Turn-Taking from Multi-modal Behaviors

Turn-taking in conversations is a natural behavior in human activities. Studies on turn-taking have been conventionally focused on dyadic conversations between two persons. While there are a number of studies conducting analysis on the turn-taking patterns [9–12], some studies investigated a prediction mechanism for a dialogue system to take or yield turns based on machine learning [13–16]. Some studies even attempt to evaluate the synchrony of dialogue [17, 18].

Recently, conversational analysis and modeling have been extended to multi-party interactions such as meetings and free conversations by more than two persons. Turn-taking in multi-party interactions is more complicated than that in the dyadic dialogue case, in which a long pause suggests yielding turns to the (only one) partner. Predicting whom the turn is yielded to or who will take the turn is significant for an intelligent conversational agent handling multiple partners [19, 20] as well as an automated system to beamform microphones or zoom in cameras on the speakers. Studies on computational modeling on turn-taking in multi-party interactions are very limited so far. Laskowski et al. [21] presented a stochastic turn-taking model based on N-gram for the ICSI meeting corpus. Jokinen et al. [22] investigated the use of eye-gaze information for predicting turn-holding or giving in three-party conversations.

This section deals with turn-taking behaviors in poster sessions. Conversations in poster sessions are different from those in meetings and free conversations addressed in the previous works, in that presenters hold most of turns and thus the amount of utterances is very unbalanced. However, the segments of audiences’ questions and comments are more informative and should not be missed, and thus prediction of such events is important in online applications such as automated recording control and a conversational agent. Therefore, the goal of this work is to predict turn-taking by the audience in poster conversations, and, if that happens, which person in the audience will take the turn to speak.

We approach this problem by combining multi-modal information sources. While most of the aforementioned previous studies focused on prosodic features of the current speakers, it is widely-known that eye-gaze information plays a significant role in turn-taking [23], and the works by Jokinen [22] and by Bohus [19] exploited that information in their modeling. The existence of posters, however, requires different modeling in poster conversations as the eye-gaze of the participants are focused on the poster in most of the time. This is true to other kinds of interactions using some materials such as maps and computers. Several kinds of parameterization of eye-gaze patterns including the poster object are investigated for effective features related with turn-taking. Moreover, backchannel information such as nodding and verbal reactions by the audience is also incorporated

In this study, four poster sessions are used. In majority of utterances (IPUs) of the presenter (“A”), the turn was held by himself/herself. The ratio of turn-taking by the audience (either “B” or “C”) is only 11.9 %. In this work, therefore, prediction of turn-taking is formulated as a detection problem rather than a classification problem. The evaluation measure should be recall and precision of turn-taking by the audience, not the classification accuracy of turn-holding and yielding by the presenter. This is consistent with the goal of the study.

9.3.1 Analysis on Eye-Gaze and Backchannel Features in Turn-Taking

First, statistics of eye-gaze and backchannel events are investigated on their relationship with turn-taking by the audience.

9.3.1.1 Distribution of Eye-Gaze

The object of the eye-gaze of all participants is identified at the end of the presenter’s utterances. The target object can be either the poster or other participants. The statistics are shown in Fig. 9.5 in relation with the turn-taking events. It is observed that the presenter is more likely to gaze at the person in the audience right before yielding the turn to him/her. We can also see that the person who takes the turn is more likely to gaze at the presenter, but the ratio of the turn-yielding by the presenter is not higher than the average over the entire data set.

The duration of the eye-gaze is also measured. It is measured within the segment of 2.5 s before the end of the presenter’s utterances because the majority of the IPUs are less than 2.5 s. It is listed in Table 9.1 in relation with the turn-taking events. We can see the presenter gazed at the person right before yielding the turn to him/her significantly longer than other cases. However, there is no significant difference in the duration of the eye-gaze by the audience according to the turn-taking events.

Table 9.1 Duration of eye-gaze and its relationship with turn-taking (s)

Smart Posterboard: Multi-modal Sensing and Analysis of Poster Conversations

Abstract

Similar content being viewed by others

Survey on Multimodal Emotion Recognition (MER) Systems

Multimodal Techniques and Methods in Affective Computing – A Brief Overview

Estimating Speaker’s Engagement from Non-verbal Features Based on an Active Listening Corpus

Keywords

9.1 Introduction

9.2 Overview of System and Corpus

9.2.1 Smart Posterboard System

9.2.2 Multi-modal Corpus of Poster Conversations

9.2.3 Detection of Participants’ Eye-Gaze and Speech

9.2.3.1 Face and Eye-Gaze Detection

9.2.3.2 Detection of Nodding

9.2.3.3 Speech Separation and Speaker Diarization

9.3 Prediction of Turn-Taking from Multi-modal Behaviors

9.3.1 Analysis on Eye-Gaze and Backchannel Features in Turn-Taking

9.3.1.1 Distribution of Eye-Gaze

9.3.1.2 Joint Eye-Gaze Events

9.3.1.3 Dynamics of Eye-Gaze

9.3.1.4 Backchannels

9.3.2 Prediction of Turn-Taking by Audience

9.3.2.1 Prediction of Speaker Change

9.3.2.2 Prediction of Next Speaker

9.4 Speaker Diarization with Backchannel Detection Using Eye-Gaze Information

9.4.1 Multi-modal Speaker Diarization

9.4.1.1 MUSIC Method Using Microphone Array

9.4.1.2 Eye-Gaze Features

9.4.1.3 Integration of Acoustic and Eye-Gaze Information

9.4.1.4 Speaker Diarization Experiment

9.4.2 Detection of Backchannels

9.5 Detection of Hot Spots via Prominent Reactive Tokens of Audience

9.5.1 Detection of Laughter and Reactive Tokens

9.5.2 Subjective Evaluation of Detected Hot Spots

9.5.3 Prosodic Analysis of Reactive Tokens

9.6 Prediction of Interest and Comprehension Level via Audience’s Questions from Multi-modal Behaviors

9.6.1 Definition of Interest and Comprehension Level

9.6.1.1 Annotation of Question Type

9.6.1.2 Relationship Between Question Type and Interest and Comprehension Level

9.6.2 Relationship Between Multi-modal Behaviors and Questions

9.6.2.1 Backchannels

9.6.2.2 Eye-Gaze at Presenter

9.6.3 Prediction of Interest and Comprehension Level

9.6.3.1 Prediction of Questions and Reactive Tokens for Interest Level Estimation

9.6.3.2 Identification of Question Type for Comprehension Level Estimation

9.7 Poster Session Browser

9.8 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation