Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

The eventual choice leant heavily towards the provocative. The supportive was not wholly ignored. Partners in the HUMAINE Database work package have developed the resources used in several significant projects on recognition of emotion, both unimodal (CEICES) and multimodal (Cowie et al.). However, the main systematic effort was directed towards establishing a corpus that summed up the challenges facing the community and drew together key resources that are potentially relevant to meeting them. The choice is grounded in HUMAINE’s core emphasis on considering emotion in a broad sense – ‘pervasive emotion’ – and engaging with the way it colours action and interaction. Until now, there has been no source to which the community could go to see and hear the forms that emotion takes in everyday action and interaction, and to look at the tools that might be relevant to describing it. The core aim of the HUMAINE Database was to provide that kind of source. The description in this chapter reflects that orientation. It does not set out to give technical specifications of the database contents. Instead it sets out to convey the range of forms that emotion takes in the database and the ways that the descriptive resources address them. The database is not simply about creating impressions. It is also designed to let key technical questions be addressed. That aspect is taken up at the end of the chapter.

1 Overview and Structure

This section sets out some of the specific concerns underlying the database.

The emphasis is on multimodal data. Most of the material is audiovisual. The visual channel usually includes face, but some sources deliberately contain gesture and some include body posture or mode of action. Other parts include physiological data and choice of response to challenges.

The material was collected to show a wide emotional range (negative to positive, active to inert, deeply engaged to playful). The emotion is embedded in a range of actions and interactions, and a variety of contexts.

The material makes varying levels of allowance for the limits of contemporary signal processing. At one extreme is data from TV recordings, shot outdoors with ‘difficult’ camera angles and noisy audio; at the other is data derived from laboratory scenarios devised to minimise signal processing challenges.

The labels that are used are described in the preceding chapter of this part. They include labels describing emotional content, based on the psychological literature; labels for emotional signs in the relevant modalities (face, speech and gesture context labels have been developed) and context labels. The labels span a range of resolutions in time (whole passage to moment by moment).

There are close links between the recordings and the labels. On the one hand, labels were chosen to deal with the phenomena that were observed in the recordings that laid the groundwork for the database. On the other hand, material was generated and selected to reflect the range of possibilities that the labels indicated ought to be represented.

The database consists of two parts: (i) primary records and (ii) a structured labelled subset. The primary records consist of recordings, almost all audiovisual, in diverse emotion-rich scenarios. The structured labelled subset is a balanced and labelled set of emotional episodes (referred to as ‘clips’) selected from the primary records to represent a range of emotions and to demonstrate the application of a wide range of labels covering emotional content, context and signs. Most of the primary records and the whole of the labelled subset are available to the research community under ‘Conditions of Use Agreement’ (see Appendix 1 for details).

The primary records consist of many hours of recordings and contain emotional episodes which have not been identified and selected for labelling. They are thus a resource which can be mined by other researchers. They are also a useful resource for observing how often emotionality of some level occurs across time. Some of the data are ‘naturalistic’ (in the sense that it has been collected in situations not under the researcher’s control, e.g. from film shot for television, etc.). Some are ‘laboratory’-induced data (in the sense that emotionality has been induced in a controlled environment according to a purpose-specific method). The laboratory-induced data is a rich resource, not just for the data itself but also for the range of methods used to carry out the induction of emotion (developed specially for the HUMAINE project – for fuller details, see chapter “Issues in Data Collection”).

One of the principles behind the HUMAINE database was that the data should be available to the community in general. With the particular nature of the data, this means that there have been important issues of ethical clearance and consent to be addressed. The ethical principles underpinning ethical clearance and consent are dealt with more fully in Part IV ‘Ethics and Good Practice’. Access to the database is via the HUMAINE Association portal (http://www.emotion-research.net).

2 The Total Data Set

The HUMAINE Database work package recorded or acquired a large body of material showing emotion as it appears and sounds in action and interaction. This section summarises the main kinds of material that have been collected as a result.

2.1 Summary of Primary Records

Table 1 provides a summary of the data types that make up the primary records. The recordings are usually either naturalistic or induced, although an emotional episode is selected from one professionally acted data set (GEMEP, see Baenziger and Scherer, 2007) for labelling (as a comparison). The GEMEP Data Set as a whole is not available as part of the HUMAINE database. The induction techniques that have been developed to induce much of the material are described in chapter “Issues in Data Collection” in this part. Appendix 2 describes in more detail the exact nature of the primary records – the material, technical information about the material including length, numbers of subjects, recording scenario and the conditions under which it is available. Appendix 3 describes in detail each episode or clip selected for labelling (to form part of the labelled subset) including a summary of the content of the emotional episode and descriptors of the emotion, context and modalities.

Table 1 Data types used in the HUMAINE Database

2.2 Illustration of Data Types

This section illustrates the nature of the data, pulling out its typical characteristics, theoretical interest and relevance and its strengths and weaknesses. It starts with the naturalistic data.

Figure 1 shows a typical frame taken from the Belfast Naturalistic Database (Douglas-Cowie et al., 2003). The subject is talking to an old friend about how she feels about her future son-in-law and expressing her positive feelings for him. The frame is typical of the data in that the emotion expressed is strong but not full blown. In terms of the emotion-related states described in the first chapter in this part, the emotion expressed in this particular example reflects a long-lasting feeling (‘attitude’) towards someone or something in the Belfast Naturalistic Database. Much of the data either expresses attitudes or is ‘established’ emotion (long-standing states that can be ‘triggered’ in a way that produces surges of overt emotion). In this example the subject is sedentary and the camera is fixed on the face, head and shoulders. The visual quality is not perfect. The induction technique captures a lot of speech in a dialogue situation and because the old friend is also the researcher she knows not to interrupt too much, thereby giving long stretches of uninterrupted speech for analysis. The text is given beside the picture with the interviewer’s comments in square brackets.

Fig. 1
figure 14_1_213094_1_En

Frame from Clip56b, Belfast Naturalistic Database

Figure 2 shows two frames, both from the Castaway Reality Television Data Set (Douglas-Cowie et al., 2007). The data set contains some intense emotion, and these frames illustrate that. The one on the left is of a subject in a positive state (after successful completion of a task) and the one on the right is of a subject in a negative state (after a task in which he thinks he has done badly). The material is important because it shows emotion in action as participants engage in a range of challenging activities both singly and in groups. It often shows shifting and complex emotions as a subject moves through a challenging activity and comes out at the end successfully or unsuccessfully. Because there is a lot of activity, there is a lot of movement by the participants and this gives rise to data that is not face-on or close-up, and so from an affective computing point of view, the data is quite challenging. The shots in Fig. 2 illustrate some of the more static material in the data. Most of the speech occurs in the one-to-one interactions with the team leader, but outdoor noises from the rest of the group tend to get in the way of good recordings. Nevertheless, pitch traces have been reliably extracted for the clips from this data that form part of the labelled subset.

Fig. 2
figure 14_2_213094_1_En

Frames from the Castaway Reality Television Data Set (from Clip5_2, left, and Clip 6_2, right)

Figure 3 shows two frames from the SAL (Sensitive Artificial Listener) induction technique (see “Issues in Data Collection”, Part III). SAL involves an artificial character with different emotional personalities (Poppy who is sad, Spike who is angry, Obadiah who is gloomy, Prudence who is sensible) engaging a subject in conversation. Each personality uses stock responses and phrases to pull the subject towards his/her mood. In Fig. 3 the subject is shown talking to the gloomy personality of the artificial listener (on the left) and then the happy personality of the artificial listener (on the right). The frames show reasonably natural data which demonstrates mild to moderate levels of emotion. The subject is sedentary and the camera is focused on the face and shoulders. There is no gesture, but the technique generates a lot of speech (see Table 2. The visual and auditory quality is good and SAL data has successfully been used to train an emotion recognition system (Ioannou et al., 2005).

Fig. 3
figure 14_3_213094_1_En

Subject in conversation with two different personalities of the Sensitive Artificial Listener (from Clip REllA2, left, and Clip REllB2, right)

Table 2 Text accompanying SAL clips REllA2 and REllB2 (SAL personality responses in square brackets)

The Belfast Activity/Spaghetti Data (see the chapter “Issues in Data Collection”) is represented in the next set of figures (Figs. 4 and 5). Two techniques were used to produce the emotion in action seen in these two figures.

Fig. 4
figure 14_4_213094_1_En

Sequenced frames from Belfast Activity Data at 0.56 s (left), 2.16 s (middle) and 3.00 s (right)

Fig. 5
figure 14_5_213094_1_En

Belfast Spaghetti Data (building up to climax, left, and in response to buzzer, right)

In the first (Activity Data), volunteers were recorded engaging in outdoor activities (e.g. mountain bike racing) in an effort to produce examples of full-blown emotion in action for which we would have consent and ethical clearance. This produced ‘provocative’ data, very dynamic, with subjects moving around. It also had a noisy sound track with affect bursts, but little speech. The data is demonstrated in Fig. 4, which shows sequenced frames from the subject watching one of the volunteers fall off a mountain bike in a 3-s episode. It demonstrates full-blown emotion and quite complex emotional shift. Some interesting work has been done on the data showing the speed of transition in facial movement in this data as opposed to acted emotional data (Sneddon and McRorie, 2006).

In the second technique, a more controlled environment was used where certain kinds of ‘ground truth’ could be established. It is called the Spaghetti method, because participants are asked to feel in boxes in which there were unpleasant objects (including spaghetti) and buzzers that went off as they felt around. They recorded what they felt emotionally during the activity. Figure 5 shows a typical data. The first frame shows the subject in the build up to the climax where a buzzer sounds when the subject locates the object in the box. The second shows the subject at the moment when the buzzer sounds. The data is of good quality both auditorily and visually, although there is very little actual speech – the sound track consists mainly of exclamations. In the clip from which the frames below are taken, the only words uttered are ‘Oh Jesus’ at the moment when the buzzer goes off.

Figures 6 and 7 represent a more recent move by the HUMAINE team towards experimentation with induced data that shows interaction between two subjects. This is particularly relevant to emotional synthesis. Figure 6 shows data from the Green Persuasive Data Set (see this chapter and the chapter “Issues in Data Collection”) where complex emotions are linked to varied cognitive states and interpersonal signals. In the Green Persuasive Data Set, one person tries to persuade another on a topic with multiple emotional overtones (adopting a ‘green’ lifestyle). Figure 7 shows data from the EmoTABOO Data Set (Zara et al., 2007) where the emphasis is on generating gesture and where subjects interact mainly through gesture and body movement to explain and understand a taboo or an unusual word known only to one of the subjects. The scenario produces a lot of amusement and embarrassment.

Fig. 6
figure 14_6_213094_1_En

Persuader and persuadee’s response in Green Persuasive Data Set (clips Ex2A and PT2a)

Fig. 7
figure 14_7_213094_1_En

Co-occurring frames from EmoTABOO, explainer on left and receiver on right

Figure 6 shows the persuader on the left and the person he is trying to persuade on the right. The frame on the right is taken from immediately after the persuader’s attempt to persuade the subject that cars are not needed in an environment-friendly ‘green’ world. She clearly disagrees. The data produces a lot of persuasive speech on the part of the persuader, less speech on the part of the persuadee but interesting facial responses. The text is given in Table 3. The quality both auditorily and visually is good.

Table 3 Exchanges between persuader and persuadee in clips Ex2A and PT2a

Figure 7 shows an interaction between the subject who is trying to explain a ‘secret’ word/concept (on the left) to the person on the right who does not know the secret word. The two frames are from exactly the same moment. The data is very rich in gesture and is of good quality. There is some speech but the emphasis is on gesture.

The final picture in this section comes from the DRIVAWORK corpus (Honig, 2007). This uses a simulated driving task and subjects are recorded relaxing, driving normally or driving under an additional task (mental arithmetic). The technique aims to elicit physiological data in the different states. Figure 8 shows a subject relaxed (left), driving (middle) and under task (right). The task is context specific. The speech takes the form of answers to mental arithmetic problems. Speech and face are very clear.

Fig. 8
figure 14_8_213094_1_En

Subject relaxed (left), driving normally (middle) and under task (right) taken from DRIVAWORK corpus

The Belfast Driving Simulator Data uses specially developed induction techniques to record subjects driving in a range of emotional states. The procedure consists of inducing subjects into a range of emotional states and then getting them to drive a variety of ‘routes’ designed to expose possible effects of emotion. Induction involves novel techniques designed to induce emotions robust enough to last through driving sessions lasting tens of minutes. Standard techniques are used to establish a basic mood, which is reinforced by discussions of topics that the participants have preidentified as emotive for them. The primary data is a record of the actions taken in the course of a driving session, coupled with physiological measures (ECG, GSR, skin temperature, breathing). It is supplemented by periodic self-ratings of emotional state. The data has not been video recorded. It is currently the topic of a Ph.D. and cannot be released until after completion of the Ph.D., but pilot work from the Ph.D. can be found at http://emotion-research.net/ws/wp5/edelle.ppt

3 The Labelled Subset

3.1 Aims of the Labelled Subset

The labelled subset is a balanced and labelled collection of extracts selected from the primary records. Each extract is selected to contain a relatively self-contained emotional episode and is referred to as a ‘clips’. The subset was chosen to represent the range and form of emotional life that people working in the field should be aware of and was labelled for both emotion and signs of emotion. Both the selection and the design of the labelling scheme were based on systematic criteria (see below) derived in part from the psychological literature on emotion and in part from experience with real data. The chapter “Issues in Data Labelling” in Part III gives further information on the principles and models which underpin the labelling scheme.

3.2 Size and Structure of the Labelled Subset

The labelled subset consists of 48 clips (between 3 s and 2 min in length) selected from the primary recordings. The labelled episodes are mounted on the ANVIL platform. Table 4 provides a summary of the numbers of clips selected from the range of data types to make up the labelled subset. Appendix 3 describes in detail each clip selected for labelling, including a summary of the content of the clip and descriptors of the emotion, context and modalities.

Table 4 Selection of clips for labelled subset

The selection of the final sample involves non-trivial issues. Two levels were used.

The first was the selection of clips from within a whole recording. In the case of relatively intense emotional episodes, the extraction of a section/clip includes build-up to and movement away from an emotional nucleus/explosion – lead in and coda are part of identification of the state. In the case of less emotionally intense recordings, the basic criterion used to set the boundaries of clips is that ‘the emotional ratings based on the clip alone should be as good as ratings based on the maximum recording available’ (i.e. editing should not exclude information that is relevant to identifying the state involved).

The second stage of selection was deciding which clips should form the labelled subset of the HUMAINE Database. It is very easy to drift into using a single type of material which conceals how diverse emotion actually is. To counter that, 48 clips were deliberately selected to cover material showing emotion in action and interaction; in different contexts (static, dynamic, indoor, outdoor, monologue and dialogue); spanning a broad emotional space (positive and negative, active and passive) and all the major types of combination of emotion (consistent emotion, co-existent emotion, emotional transition over time); with a range of intensities; showing cues from gesture, face, voice, movement, action and words and representing different genders and cultures.

Table 5 shows the framework which underpinned the final selection. Each clip was chosen for its representation in those main classes.

Table 5 Grid of distinctions underpinning selection of clips in labelled subset

The details of each of the 48 clips are given in Appendix 3. For each clip there is a short description of what is happening in the clip and the gender of the speaker as well as a description of quadrant, emotion mix, modality and context (constraints, goal and setting) as set out in Table 5. The clips are all available under Conditions of Use Agreement (Appendix 1).

The grid of distinctions underpinning the selection of clips is in some senses what one might expect – especially attention to emotional range, context and modality. But the history of emotional databases shows that this is not the norm. Very often databases are application focused and so they only have one type of data or where they are more open ended, they often do not prespecify criteria for selection from the raw records. This can lead to a slanted representation of the data. There are also some aspects of the grid that have not traditionally formed part of emotion databases. Specifying context is often not done, although the impact of context on the representation of an emotion can be large. And selection which takes into account the consistency of emotion and whether it is pure or co-existent with another emotion is a departure which is based on experience with real data. Experience with the Belfast Naturalistic Database and the EmoTV Database suggests that emotion mixing and emotion shifting are common (Devillers et al., 2006). Hence the grid is designed to capture clips which represent this feature.

3.3 Labelling of the Subset

A wide range of emotion labels and signs of emotion descriptors are attached to each clip. These are also available together with labelling manuals on the portal at http://www.emotion-research.net. The labelled data can be displayed on the ANVIL platform and the procedure for obtaining and using the software is explained on the portal. The emotion labelling has been done by six raters for all 48 clips and the data for all six labellers is available via the portal. The speech and language labelling has been done by one trained phonetician and is available for all 48 clips. Two clips are labelled for face and gesture; the labelling is available on the portal.

The components for the labelling scheme and the principles behind it are described in full in the chapter “Issues in Data Labelling”. This section summarises how they have been put together and used in the labelled subset of the HUMAINE database.

3.4 Emotion Labels

Two levels of description are included.

At the first level, global labels are applied to an emotion episode or clip as a whole. Factors that do not vary rapidly (the person concerned, the context) are described here. This provides an index that can be used to identify clips that a particular user might want to consider. For instance, it will allow a user to find examples of the way anger is expressed in relatively formal interactions (which will not be the same as the way it is expressed on the football terraces).

Labelling at the second level is time aligned. This is done using ‘trace’-type programs (see the chapter “Issues in Data Labelling”). Each of the programs deals with a single aspect of emotion (e.g. its valence, its intensity, its genuineness). An observer traces his/her impression of that aspect continuously on a one-dimensional scale while he or she watches the clip being rated. The data from these programs is imported into ANVIL as a series of continuous time-aligned traces. The trace-type labelling captures perceived flow of emotion.

Tables 6 and 7 summarise the global and continuous ‘trace’ labels that are applied to each clip.

Table 6 Global emotion descriptors applied to HUMAINE database (open comment is also invited for each class)
Table 7 ‘Trace’ label descriptors

3.5 Sign Labels

Labels for signs of emotion are also attached to each clip in the labelled subset. There are labels for speech and language applied to all clips and labels for gesture and face applied to two of the clips. Table 8 summarises the descriptors used.

Table 8 Signs of emotion descriptors

3.6 The Labelled Subset: Illustrations and Issues

The labelled subset is a powerful demonstration of the range and diversity of emotional life and how we can begin to describe it. This section works through a number of examples to illustrate some of the diversity and complexity of data in the subset and to show how the labelling can capture what is going on in quite complex emotional episodes.

Example 1. For the first example we return to the Spaghetti Data and to the woman already featured in Fig. 5. Table 9 shows how one rater described this clip in global emotional terms. Figure 9 shows the form of the display of emotional Trace continuous labelling for the same clip by the same rater. In Fig. 10 we see the whole of the clip labelled using the Trace programs. The screenshot is taken from ANVIL and it shows the traces from one rater for this clip for intensity of emotion, acting, masking, activation and power/powerlessness. The screenshot conveys the net effect of putting traces together. The clip shows a participant feeling in a box and suddenly triggering a buzzer. She gives a gasp, then a linguistic exclamation. The top trace, emotional intensity, rises abruptly after the gasp. The rater does not judge that the response is acted, but there is a degree of masking at the beginning which breaks down abruptly at the unexpected event. Activation rises abruptly after a delay (during which the participant might be described as frozen).

Fig. 9
figure 14_9_213094_1_En

Continuous trace labelling for intensity, acting, masking and level of activation for Clip 14e by one rater (red vertical bar marks frame shown)

Fig. 10
figure 14_10_213094_1_En

Four representations of fear

Table 9 Global emotion labels for Spaghetti Data Clip 14e (from one rater)

Example 2. The second example illustrates how emotion can look in different contexts and time domains. All six raters attached the word ‘fear’ to the clips from which the examples below are taken (Fig. 10). The first clip shows a subject we have seen before watching a friend fall off a mountain bike in the Belfast Activity Data. The second shows a subject undertaking the Spaghetti task at the point at which she stopped the task and said she was too frightened to continue. The third clip shows a subject recalling touching snakes in a darkened hut a few minutes after the event. The fourth shows a subject recalling a terrifying incident a year after it happened. These clips suggest that a sensible database needs to show the variety of things that are called ‘fear’. The sample in the HUMAINE database is by no means complete, but it is a useful pointer to the variety of representations behind an emotion word.

Example 3. The third example focuses on portraying the complexity of emotions that can occur in naturalistic data, particularly the mixed nature of emotions (referred to as ‘co-existing’ in the HUMAINE coding scheme) and the way in which emotion fluctuates and shifts within short time periods. By comparison, acted data tends to portray emotion as consistent, pure and static over a period of time. Figure 11 shows a sequence of frames from the Belfast Naturalistic Database taken at intervals from a 10-s period in which the subject utters the words ‘It’s a boy. And the anger drained out of me that night. I felt it going. It was like a release.’

Fig. 11
figure 14_11_213094_1_En

Clip 56d Belfast Naturalistic Database. Frame (a) starts with the memory of the announcement by the daughter’s birth partner that the baby had been born and that it was a boy. The expression certainly seems to contain happiness. In frames (b), (c) and (d) she recalls the anger she had felt but at the same time recalls her move away from the anger: the frames are clearly a mix of complex feelings (anger, pain, sadness, escape) and emotional shift. Frames (e) and (f) describe release from the anger and the final two frames might best be described as a return to peace

The context for the sequence of frames is that the subject is remembering the birth of her grandson (her daughter’s child). She and her daughter had not got on very well together and she had been particularly angry at her daughter for getting pregnant. She has been describing the anger she felt but then moves to describe the moment of her grandson’s birth and the release from the anger she had been feeling when the moment she heard her grandson had been born. Figure 11 shows the way in which the emotion shifts and blends as she recalls the incident. The sequence is typical of the type of data that comes from naturalistic settings. Work on EmoTV (Devillers et al., 2006), which unfortunately cannot be released for copyright reasons, makes similar points. One of the interesting things about this particular example is that the emotion expressed comes from recalling events, illustrating that recall can produce fairly intense emotion.

Example 4. This is a nice illustration of the need to consider a wide range of emotion-related states when classifying emotional behaviour. The opening chapter of this handbook discusses the theory behind these, and Table 6 lists those that are used in labelling the HUMAINE database. Figure 12 illustrates one of these states which is less commonly talked about – ‘suppressed’ emotion. The subject is shown talking to the angry personality of the Sensitive Artificial Listener in the SAL data (see above). What is happening is that the angry personality of SAL is trying to wind up the subject’s emotions into an angry state. All the raters of this clip attach the label ‘suppressed emotion’ to it. The words that they also all agree apply to the clip are politeness, tension, irritation, annoyance and anger. These are in line with the global label of ‘suppressed emotion’, indicating that the subject may have negative emotions but that he remains polite, keeping his emotions under control through a deliberate effort. The text (see under Fig. 12) indicates suppression of the emotions.

Fig. 12
figure 14_12_213094_1_En

An example of suppressed emotion from SAL Data as subject talks to the angry personality Spike. (Text: Like you’re not really annoying me and I don’t appreciate your attitude)

Examples 5 and 6 illustrate the labelling of signs of emotion and the richness of signs in the data. Figure 13 shows a frame from EmoTABOO with the array of gesture labels and FAPs attached. Figure 14 shows a frame from Castaway Reality TV. The episode is particularly rich in paralinguistic expression of emotion. The subject is asked what he misses most. He replies: ‘ I always get choked up … … family’ accompanied by long pausing, tremulous voice and nervous laughter.

Fig. 13
figure 14_13_213094_1_En

Gesture labelling and FAPS applied to EmoTABOO Data

Fig. 14
figure 14_14_213094_1_En

Paralinguistic expression of emotion in Castaway Reality TV Data Set

Example 7. The final figure, Fig. 15, shows some of the interesting and unexpected interactions between modalities. The subject in it appears to be smiling but the words indicate that she is actually in a state of shock. The subject (from the Castaway Reality TV Data Set) is recalling her encounter with snakes in the hut from which she has just emerged. She is saying ‘and the first thing I touched was the snake’s head … feel really shaky now.’ The bottom line in the figure is a trace from WordTrace by one rater. The word that the rater thought applied most to the episode was ‘fear’ and the trace is a trace of fear in the episode. The trace shows that fear is present and strong throughout the episode. There are many examples in the HUMAINE database where the expression on the face seems at odds with the emotion experienced.

Fig. 15
figure 14_15_213094_1_En

Conflicting signs from face and speech, Castaway Reality TV Data Set

4 Future Directions

The point of a database is to facilitate research, and the HUMAINE Database opens up a very large number of avenues for exploration.

At the most routine, the database provides evidence on the way a range of tools function and therefore provides a basis for evaluating them. The tools include everyday emotion categories, trace programs, descriptors for broad types of emotion and context. Evaluations include simple, formal procedures, such as tests of reliability. However, they also include others which are less clear-cut, but not less important: does the battery of descriptions tell us what we need to know, and if not, why not?

Related, but distinct, are questions about reduction. There are two obvious forms of question to consider. The first form is related to the concept of cover classes. It is concerned with establishing which labels can be merged without unacceptable loss of information. The second form is related to the concept of dimensions. There is a very large literature on the number of dimensions needed to represent a set of words. The HUMAINE Database opens up the possibility of asking how many dimensions are necessary to represent a set of samples of emotionally coloured behaviour.

These questions are not statistically trivial. For example, they should ideally take account of the way labellings evolve over time. A simplification which seems fair in terms of a series of ‘snapshots’ may be a serious problem if it undercuts the ability to predict what will happen next. Standard statistical reduction techniques do not address that kind of problem.

The end target of that kind of work is an empirically validated set of labels. It is frustrating, but there is no way to reach that stage without generating labellings some of whose components will eventually be discarded. Hence, it is to be expected that some components of the HUMAINE scheme will be discarded in the process of analysis. Conversely, new components will presumably need to be added. Iterative adjustment is to be expected, but it needs a core to work round.

Benchmarking is another key application. It is a major problem that the area lacks standard tasks against which the performance of new algorithms can be tested. The database offers two kinds of benchmark – clips to be analysed and types of information to be recovered. The test is not confined to machine recognition. For instance, the material in the database offers a very interesting test for brain-scanning technologies. Capturing differences between responses to HUMAINE Database clips is a much more acute test than is capturing differences between responses to photographs from the standard Ekman collection.

The records lend themselves to a range of studies. The most routine is simply extending the labelled set. There is a very large body of primary records that remains unlabelled, and its value would be multiplied if labelled versions were available to the community.

A wide range of issues call out for more specific studies. Three will be singled out here. The first is relationships between modalities. It has been pointed out that impressionistically, audio and visual signals sometimes seem to point in very different directions. The data provides opportunities to explore that issue much more systematically. A natural starting point is simply to label a substantial body of material on the basis of audio records only, visual records only and verbal transcripts only. The second is temporal evolution of expressions. Examples like Fig. 10 make it clear that there are rapid, radical changes in moment-by-moment expression of emotion. Some theoretical frameworks predict that change should occur on that kind of timescale (Scherer and Ellgring, 2007). The HUMAINE Database contains a reservoir of naturalistic data that makes it possible to explore these ideas. The third is whether information is localised or distributed. It is not obvious whether information about a person’s emotion is, so to speak, smeared evenly over time or concentrated in a few revealing moments. The material in the database invites research on the topic.

Beyond these, the database provides a kernel of primary records that help to clarify which kinds of extension make sense. It certainly does not make sense to collect new records at random. A considerable range of states and contexts are probably quite well covered in the data that now exists, and random inventions are quite likely to produce nothing but more stilted variations on the same theme. However, there are areas where information is quite clearly limited, and those are the areas where it makes sense to concentrate effort. An overwhelmingly obvious example is cross-cultural difference. A few HUMAINE techniques have been applied in substantially different cultures. The process needs to be extended radically to provide anything approaching a reasonable representation of the way culture affects the expression of emotion.

Perhaps most fundamental of all, the database invites a cumulative and collaborative attitude. The HUMAINE Database is a product of collaborations between several teams over a period of years. It will take a much larger scale of collaboration to accumulate a reservoir of data sufficient to understand the various ways in which various kinds and combinations of emotion can colour various actions and interactions.