1 Introduction

The ISO 24617-2 annotation standard [10, 11, 30] has been designed for the annotation of spoken, written and multimodal dialogue with information about the dialogue acts that make up a dialogue, with the aim to create interoperable annotated resources. A dialogue act is a unit in the description of communicative behaviour that corresponds semantically to certain changes that the speaker wants to bring about in the information state of an addressee. ISO 24617-2 defines a dialogue act as

  1. (1)

    Communicative activity of a dialogue participant, interpreted as having a certain communicative function and semantic content.

The communicative function of a dialogue act, such as Propositional Question, Inform, Confirmation, Request, Apology, or Answer, specifies how the act’s semantic content changes the information state of an addressee upon understanding the speaker’s communicative behaviour.

According to the annotation schemes that existed prior to the establishment of ISO 24617-2 and its immediate predecessor DIT\( {}^{++}, \) such as DAMSL; MRDA; HCRC Map Task; Verbmobil; SWBD-DAMSL; and DIT,Footnote 1 dialogue act annotation consisted of segmenting a dialogue into certain grammatical units and marking up each unit with one or more communicative function labels. The ISO 24617-2 standard supports the annotation of dialogue acts in semantically more complete ways by additionally annotating the following aspects:

  • Dimensions The annotation scheme supports ‘multidimensional’ annotation, where multiple communicative functions may be assigned to dialogue segments; different from DAMSL and other multidimensional schemes, the ISO scheme uses an explicitly defined notion of ‘dimension’, which corresponds to a certain type of semantic content.

  • Qualifiers are defined for expressing that a dialogue act is performed conditionally, with uncertainty, or with a particular sentiment.

  • Functional and feedback dependence relations link a dialogue act to other units in a dialogue, e.g. for indicating which question is answered by a given answer, or which utterance a speaker is providing feedback about.

  • Rhetorical relations may optionally be annotated to indicate, e.g. that one dialogue act motivates the performance of another dialogue act.

The following example illustrates the use of dimensions, communicative functions, qualifiers, dependence relations, and rhetorical relations (where "#fs1", "#fs2", and "#fs3" indicate the segments in P1’s and P2’s utterances that express a dialogue act—see Section 6.2.2 for more on segmentation).

  1. (2)

    1. P1: Is there an earlier connection?

    2. P2: Ehm,.. no, unfortunately there isn't.

    <diaml xmlns:"http://www.iso.org/diaml/ ">

    <dialogueAct xml:id="da1" target="#fs1"

       sender="#p1" addressee="#p2"

       communicativeFunction="propositionalQuestion" dimension="task"/>

    <dialogueAct xml:id="da2" target="#fs2"

      sender="#p2" addressee="#p1"

      communicativeFunction="stalling" dimension="timeManagement"/>

    <dialogueAct xml:id="da3" target="#fs2"

      sender="#p2" addressee="#p1"

      communicativeFunction="turnTake" dimension="turnManagement"/>

    <dialogueAct xml:id="da4" target="#fs3"

          sender="#p2" addressee="#p1"

      communicativeFunction="answer" dimension="task" sentiment="regret"

      functionalDependence="#da1"/>

    </diaml>

The development of ISO 24617-2 was supported by annotation experiments in which preliminary versions of the scheme were tested for their usability by human annotators and by machine-learned annotation. After its establishment as an international standard in 2012, further annotation efforts have been undertaken in applying the standard in several corpus annotation, collection, and re-annotation projects. This chapter describes the most substantial of these experiments and annotation efforts.

This chapter is organized as follows. Section 6.2 outlines the use of the ISO 24617-2 annotation scheme. Section 6.3 describes the results of experiments concerned with some of the special features of the annotation scheme. Section 6.4 presents several new and emerging corpora of dialogues, annotated with the ISO 24617-2 annotation scheme. Section 6.5 closes this chapter with concluding remarks and perspectives for future studies and applications using the ISO 24617-2 standard.

2 Annotating with ISO 24617-2

2.1 Features of the ISO 25617-2 Annotation Standard

Dimensions Utterances in dialogue often have more than one communicative function, as several authors have observed [3, 4, 8, 40, 46]. The following dialogue fragment illustrates this:

  1. (3)

    1. Anne: Henry, can you take us through these slides?

    2. Henry: Ehm… sure, just ordering my notes.

In the first utterance, Anne makes a request and assigns the next speaking turn to Henry. In the second utterance, Henry accepts the turn and stalls for time; accepts the request, and explains why he does not fulfill the request right away. The multidimensional DIT\( {}^{++} \) annotation scheme was designed to optimally support the annotation of multifunctional utterances [7]. This scheme is based on a well-founded notion of dimension, inspired by the observation that participation in a dialogue involves a range of communicative activities beyond those strictly related to performing the task or activity that motivates the dialogue. Dialogue participants also perform communicative activities such as giving and eliciting feedback, taking turns, stalling for time, and showing attention; moreover, they often perform several of these activities at the same time. The term ‘dimension’ refers to these various types of communicative activity.

The ISO 24617-2 annotation scheme inherits the following nine dimensions from the DIT\( {}^{++} \) scheme: (1) Task: dialogue acts that move the task or activity forward which motivates the dialogue; (2–3) Feedback, divided into Auto- and Allo-Feedback: acts providing or eliciting information about the processing of previous utterances by the current speaker or by the current addressee, respectively; (4) Turn Management: activities for obtaining, keeping, releasing, or assigning the right to speak; (5) Time Management: acts for managing the use of time in the interaction; (6) Discourse Structuring: dialogue acts dealing with topic management, opening and closing (sub-)dialogues, or otherwise structuring the dialogue; (7–8) Own- and Partner Communication Management: actions by the speaker to edit his current contribution or a contribution of another current speaker, respectively; (9) Social Obligations Management: dialogue acts for dealing with social conventions such as greeting, introducing oneself, apologizing, and thanking.

The ISO 224617-2 inventory of communicative functions consists of 56 of the 88 functions of the DIT\( {}^{++} \) taxonomy.Footnote 2 Some of these are specific for a particular dimension; for instance Turn Take is specific for Turn Management; Stalling is specific for Time Management, and Self-Correction is specific for Own Communication Management. Other functions can be applied in any dimension; for example, You misunderstood me is an Inform in the Allo-Feedback dimension. All types of question, statement, and answer can be used in any dimension, and the same is true for commissive and directive functions, such as Offer, Suggest, and Request. These functions are called general-purpose functions, as opposed to dimension-specific functions. Table 6.1 lists the communicative functions defined in ISO 24617-2.

Table 6.1 ISO 24617-2 communicative functions
  • Qualifiers The different qualifiers defined in ISO 24617-2 are applicable to different classes of dialogue acts. Sentiment qualifiers are applicable to any dialogue act with a general-purpose function (GPF); conditionality qualifiers to dialogue acts with a commissive or directive function (Promise, Offer, Suggestion, Request, etc.); and certainty qualifiers are applicable to dialogue acts with an ‘information-providing’ function’ (Inform, Agreement, Disagreement, Correction, Answer, Confirm, Disconfirm).

  • Functional Dependence Relations are indispensable for the interpretation of dialogue acts that are responsive in nature, such as Answer, Confirmation, Disagreement, Accept Apology, and Decline Offer. The semantic content of these acts depends crucially on the content of the dialogue act that they respond to. Functional dependence relations connect occurrences of such dialogue acts to their ‘antecedent’ and correspond to links for marking up a segment not only as having the function of an answer, for example, but also indicating which question is answered.

  • Feedback Dependence Relations play a similar role for determining the semantic content of feedback acts, which is co-determined by the utterance(s) that the feedback is about. Feedback acts often refer to the immediately preceding utterance, but can also refer further back and to more than one utterance [39]. The ISO 24617-2 annotation scheme therefore includes links for marking up these dependences; an example occurs in (7).

  • Rhetorical Relations, which have been studied extensively for written texts, also occur in spoken dialogue where they occur in two different ways, illustrated in the following examples (where the participants talk about remote TV controls):

    1. (4)

      1. A: I can never find them.

      2. B That’s because they don’t have a fixed location.

    2. (5)

      1. A: Where would you position the buttons?

      2. A: I think that has some impact on many things

In (6.2.1) the dialogue acts expressed by A’s and B’s utterances are related by a Cause relation between their respective semantic contents: the content of the second causes the content of the first; in (6.2.1), by contrast, the second dialogue act forms a reason for performing the first, so the causal relation is between the two dialogue acts as a whole, rather than between their semantic contents. The annotation of a rhetorical relation is illustrated in example (6.2.3).

Different from functional and feedback dependences, which are an integral part of dialogue acts with a responsive function and of feedback acts, respectively, rhetorical relations give additional information about the ways in which dialogue acts are semantically or pragmatically related.

2.2 Multidimensional Segmentation

Dialogues are often segmented into turns, defined as stretches of communicative behaviour produced by one speaker, bounded by periods of inactivity of that speaker. Such a segmentation is too coarse for accurate dialogue act annotation, as example (3) above illustrates. More accurate annotation is possible by using ‘functional segments’ as the units to which annotations are attached. Functional segments are defined as the minimal stretches of communicative behaviour that have a communicative function—‘minimal’ in the sense of not containing material that does not contribute to its communicative function(s). Functional segments are mostly shorter than turns, may be discontinuous, may overlap, and may have parts contributed by different speakers. Functional segments by definition have at least one communicative function, and possibly several. An example of the use of functional segments is shown in (6), where we see the utterance The first train to the airport on Sunday is at…let me see… 6.16 in response to the question What time is the first train to the airport on Sunday? The response has parts which have a communicative function in three different dimensions: Task, Auto-Feedback (expressed by the repetition in the second utterance), and Time Management; in each of these dimensions the relevant functional segment is shown; the DiAML annotation is represented in (6.2.2).

  1. (6)

    C: What time is the first train to the airport on Sunday?

    I: The first train to the airport on Sunday is at…let me see… 6.16

    Auto-Feedback fs2 The first train to the airport on Sunday

    Task: fs3 The First train to the airport on Sunday is at 6.16

    Time Man. fs4 …let me see…

    <diaml xmlns:"http://www.iso.org/diaml/">

    <dialogueAct xml:id="da1" target="#fs1"

         sender="#p1" addressee="#p2"

         communicativeFunction="setQuestion" dimension="task"/>

    <dialogueAct xml:id="da2" target="#fs2"

           sender="#p2" addressee="#p1"       communicativeFunction="autoPositive"

     dimension="autoFeedback" feedbackDependence="#fs1"/>

  2. (7)

    <dialogueAct xml:id="da3" target="#fs3"

     sender="#p2" addressee="#p1" communicativeFunction="answer"

     dimension="task" functionalDependence="#da1"/>

    <dialogueAct xml:id="da4" target="#fs4"

     sender="#p2" addressee="#p1"

     communicativeFunction="stalling" dimension="timeManagement"/>

    </diaml>

2.3 The Dialogue Act Markup Language (DiAML)

The ISO 24617-2 standard includes the specification of the Dialogue Act Markup Language (DiAML), designed in accordance with the ISO Linguistic Annotation Framework (ISO 24612 [31]), which draws a distinction between the concepts of annotation and representation. The term ‘annotation’ refers to the linguistic information that is added to segments of language data, independent of the format in which the information is represented; ‘representation’ refers to the format in which an annotation is rendered, independent of its content [28].

This distinction is implemented in the DiAML definition following the ISO Principles for Semantic Annotation (ISO 24617-6; see also [9]). The definition specifies, besides a class of XML-based representation structures, also a class of more abstract annotation structures with a formal semantics. These components are called the concrete and abstract syntax, respectively. Annotation structures are set-theoretical structures like pairs and triples, for which the concrete syntax defines an XML-based rendering. An annotation structure is a set of entity structures, which contain semantic information about a functional segment, and link structures, which describe semantic relations between functional segments. An entity structure contains the conceptual information of a single dialogue act, and specifies: (1) a sender; (2) one or more addressees; (3) possible other participants, like an audience or side-participants; (4) a communicative function; (5) a dimension; (6) possible qualifiers for sentiment,Footnote 3 conditionality or certainty; and (7) zero, one or more functional dependence relations or feedback dependence relations.

The concrete syntax, defined following the CASCADES method (see ISO 24617-6 [31] and [9]), has a unit that corresponds to entity structures in the form of the XML element dialogueAct, as illustrated in (2). The question asked by participant P1 is represented by the dialogueAct element with identifier da1, which refers to the functional segment fs1 formed by P1’s utterance. Participant P2’s response consists of two functional segments. First, a turn-initial Ehm,… which forms a multifunctional segment signalling that P2 is taking the turn and also stalls for time. The second functional segment contains the actual answer, which includes an expression of regret that is annotated by means of a qualifier, represented as the value of the sentiment attribute.

Functional dependence relations are components of a dialogueAct element since they form part of a dialogue act viewed as a semantic unit. The same is true for feedback dependence relations as a component of a feedback act, as illustrated in example (6). Rhetorical relations, by contrast, do not play a role in determining the meaning of a dialogue act, but provide additional information about the semantic/pragmatic relations between dialogue acts. They are represented by means of rhetoricalLink elements as shown in (8).

(8)

1. P4: Where would you position the buttons?

2. P4: I think that has some impact on many things

<diaml xmlns:"http://www.iso.org/diaml/">

<dialogueAct xml:id="da1" target="#fs1"

 sender="#p4" addressee="#p3"

 communicativeFunction="setQuestion" dimension="task"/>

<dialogueAct xml:id="da2" target="#fs2"

 sender="#p4" addressee="#p3"

 communicativeFunction="inform" dimension="task"/>

<rhetoricalLink dact="#da2"

 rhetoRelatum="#da1" rhetoRel="cause"/>

</diaml>

3 Experiences in the Use of ISO 24617-2

3.1 Communicative Function Recognition

Multidimensional annotation using a rich inventory of dialogue act tags is often thought to be too difficult for human annotators as well as for automatic annotation to give reliable results. In order to investigate this, Geertzen and Bunt [24] determined the inter-annotator agreement for assigning communicative functions in the ten dimensions of DIT\( {}^{++} \), nine of which are inherited by ISO 24617-2.

They observed that, when a hierarchically structured tag set is used, the popular standard kappa coefficient [17] is not an appropriate measure of agreement, since the assignment to a functional segment of two different but hierarchically related tags, like Answer and Confirm, or Inform and Agreement, does not reflect total disagreement, as the standard kappa would assume, but partial (dis-)agreement, since a Confirm act is a particular kind of Answer, and an Agreement is a particular kind of Inform. Instead, they defined a weighted kappa coefficient, using Cohen’s weighted kappa coefficient [18] with a distance metric that takes the hierarchical structure of the tag set into account (see also [34]). The taxonomically weighted kappa is defined as follows:

  1. (9)

    κ tw = \( 1-\frac{\varSigma \left(1-\delta \left(i,j\right)\right)\cdot {P}_{oi}j}{\varSigma \left(1-\delta \left(i,j\right)\right)\cdot {P}_{ei}j} \)

where the distance metric δ ij measures disagreement and is a real number normalized in the range between 0 and 1 (P oi and P ei are observed and expected probabilities, respectively). Table 6.2 shows standard and taxonomically weighted kappa scores per ISO 246170-2 dimension, averaged over all annotation pairs, for the DIAMOND corpus.Footnote 4

Table 6.2 Standard and weighted kappa-scores for annotator agreement in the annotation of communicative functions, per ISO 24617-2 dimension (adapted from [24])

The agreement scores indicate that human annotators can reliably use a rich, multidimensional annotation scheme like ISO 24617-2 or DIT\( {}^{++} \). The usability and reliability of an annotation scheme is not just a matter of the size or simplicity of the tag set, but rather of the conceptual clarity of the tags, their definitions and accompanying annotation guidelines.

3.2 Dimension Recognition

The notion of a dimension, as used in ISO 24617-2 and DIT\( {}^{++} \), is defined as follows:

  1. (10)

    A dimension is a class of dialogue acts concerned with one particular aspect of communication that a dialogue act can address independently from other dimensions [6].

Geertzen et al. [26] assessed the recognizability of dimensions by human annotators and by automatic means. Three annotators independently annotated dialogues from the DIAMOND and OVISFootnote 5 corpora with dimension tags. Table 6.3 presents agreement scores expressed in terms of Cohen’s kappa and tagging accuracy (comparing with a gold standard, see [26]). The table shows near perfect agreement between annotators, and moreover that accuracy is very high. Human annotators can apparently recognize the dimensions of the ISO 24617-2 standard almost perfectly.

Table 6.3 Inter-annotator agreement and tagging accuracy per dimension for the OVIS and DIAMOND corpora

To assess the machine learnability of dimension recognition, the rule induction algorithm Ripper was applied to data from the AMI, OVIS, and DIAMOND corpora. The features included in the data sets relate to prosody (minimum, maximum, mean, and standard deviation of pitch); energy; voicing; duration; occurrence of words (a bag-of-words vector); and dialogue history: tags of ten previous turns. Table 6.4 presents the scores obtained in tenfold cross-validation experiments. The results indicate that the dimensions of DIT\( {}^{++} \) and ISO 24617-2 are automatically recognizable with fairly high accuracy.

Table 6.4 Automatic dimension recognition scores in terms of accuracy (in %), with baseline scores (BL, classifier based on the dimension tag of the previous utterance), for AMI, DIAMOND, and OVIS data sets

3.3 Machine-Learned Dialogue Act Recognition

Petukhova and Bunt (2011) investigated the automatic classification of dialogue acts for unsegmented spoken dialogue. Table 6.5 shows the results of the combined classification of dimension and communicative function, using three different ‘local’ classifiers that apply to local utterance features. The DERsc error-rate metric is based on the Dialog Act Error Rate (DER) defined by Zimmermann et al. [46], which considers a word to be correctly classified if it has been assigned the correct dialogue act type, and it lies in the correct segment. Table 6.6 shows the results for two-step classification (manual segmentation followed by communicative function classification), which can be seen to work better for all dimensions except the Task dimension (the most important one).

Table 6.5 Overview of F- and DERsc-scores for joint segmentation and classification in each ISO 24617-2 dimension for Map Task data. Best scores in bold face
Table 6.6 Overview of F-scores on baseline (BL) and classifiers for two-step segmentation and classification tasks. Best scores in bold face

The fact that dialogue utterances are often multifunctional, having a communicative function in more than one dimension, makes dialogue act recognition a complex task. Splitting up the task may make it more manageable. A widely used strategy is to split a multi-class learning task into several binary learning tasks. Learning multiple classes, however, allows a learning algorithm to exploit interactions among classes. Petukhova and Bunt (2011) split the task in such a way that a classifier needs to learn (1) communicative functions in isolation; (2) semantically related functions together, e.g. all information-seeking functions (all types of questions) or all information-providing functions (all types of answers and informs). In total 64 classifiers were built for dialogue act recognition in AMI data and 43 for Map Task data.

Using local classifiers that produce all possible output predictions (‘hypotheses’) given a certain input leads to some predictions being false, since a local classifier never revisits a decision that it has made, in contrast with a human interpreter. Decisions should preferably be based not only on local features of the input, but also on broader contextual information. Therefore, Petukhova and Bunt (2011) trained higher-level ‘global’ classifiers that have, along with features extracted locally from the input data, the partial output predicted so far from all local classifiers. (This technique is also called ‘meta-classification’ or ‘late fusion’.) Five previously predicted class labels were used, taking into account that the average length of a functional segment in the data is 4.4 tokens. This was found to result in a 10–15 % improvement. Some incorrect predictions are still made, since the decision is sometimes based on incorrect previous predictions.

A strategy to optimize the use of output hypotheses is to perform a global search in the output space looking for best predictions. This is not always the best strategy, however, since the highest-ranking predictions are not always correct in a given context. A possible solution is to postpone the decision until some (or all) future predictions have been made for the rest of the current segment. For training, the classifier then uses not only previous predictions as additional features, but also future predictions of local classifiers. This forces the classifier to not immediately select the highest-ranking predictions, but to also consider lower-ranking predictions that could be better in the context.

Table 6.7 gives an overview of the global classification results based on added previous and next predictions of local classifiers. Both classifiers performed very well, outperforming the use of only local classifiers by a broad margin (cf. Table 6.5). It may be noted that the overall performance reported here is substantially better than the results of other approaches that have been reported in the literature. For instance, Reithinger and Klesen [43] report an average tagging accuracy of 74.7 % of applying techniques based on n-gram modelling to Verbmobil data; transformation-based learning applied to the same data achieved an accuracy of 75.1 % [44]. Hidden Markov Models used for dialogue act classification in the Switchboard corpus gave a tagging accuracy of 71 % [45]; and [33] report an accuracy of 73.8 % for the application to data from the OVIS corpus of a memory-based approach based on the k-nearest neighbour algorithm.

Table 6.7 Overview of F-scores and DERsc when global classifiers are used for AMI and Map Task data, based on added predictions of local classifiers for five previous and five next tokens. Best scores in bold face

Altogether, an incremental, token-based approach with global classifiers that exploit the outputs of local classifiers, applied to previous and subsequent tokens, results in excellent dialogue act recognition scores for unsegmented spoken dialogue. This can be seen as strong evidence for the machine learnability of the ISO 24717-2 annotation scheme.

3.4 Qualifier Recognition

The recognition of dialogue act qualifiers by human annotators was investigated by Petukhova [36]. The task in these experiments, involving four untrained annotators (undergraduate students), was to assign qualifier values to functional segments in pre-annotated dialogue fragments from the AMI corpus and the TRAINS corpus.Footnote 6

Table 6.8 shows that there are no systematic differences between annotators in assigning values for qualifier tags. They achieved moderate agreement (0. 4 < κ < 0. 6) on labelling certainty for the AMI data; the agreement for this category when labelling TRAINS dialogues is substantial (0. 6 < κ < 0. 8). The difference can be explained by the fact that AMI dialogues are more difficult to annotate for untrained annotators: AMI meetings are considerably more complex, as they are both multi-party and multi-modal. The best recognized category is conditionality, for which annotators achieved substantial to near perfect agreement (κ > 0. 8).

Table 6.8 Cohen’s kappa scores for inter-annotator agreement on the assignment of qualifiers per annotator pair for AMI and TRAINS data

Inter-annotator agreement scores for certainty and sentiment were influenced negatively by the fact that one of the values that annotators could choose for these qualifiers was ‘neutral’; some annotators assigned this qualifier to every segment that did not clearly express a certainty or a sentiment, while others assigned a certainty or a sentiment qualifier only to those segments which they judged as expressing a particular sentiment or (un)certainty.

4 Annotated Corpora

4.1 The DBOX Corpus

In the European project DBOX,Footnote 7 which aims to develop interactive games based on spoken natural language human-computer dialogues, a corpus has been collected in a Wizard-of-Oz setting. A set of quiz games was designed where the Wizard holds the facts about a famous person’s life and the player’s task is to guess this person’s name by asking questions.

In total 338 dialogues were collected with a total duration of 16 h, comprising about 6000 speaking turns. The collected data has been transcribed and annotated using the ISO 24617-2 annotation scheme. Table 6.9 shows that inter-annotator agreement between two trained annotators ranged between 0.55 and 0.94 in terms of Cohen’s kappa for segmentation and between 0.55 and 1.00 for the annotation of dialogue acts in the various dimensions (see [38] for details). For relations between dialogue acts the agreements ranged from 0.66 to 0.88.

Table 6.9 Inter-annotator agreement on segmentation and annotation of communicative functions per ISO dimension and on annotation of relations of the ISO relation types

4.2 Youth Parliament Debate Data

As part of the FP 7 European project Metalogue,Footnote 8 data have been analysed from three sessions of the UK Youth Parliament (YP). The sessions are video recorded and available on YouTube.Footnote 9 In these sessions, the YP members, aged 11–18, debate issues addressing sex education; university tuition fees; and job opportunities for young people.

The annotated corpus consists of 1388 functional segments from 35 speakers. Table 6.10 provides an overview of the relative frequencies of functional tags per ISO-dimension.

Table 6.10 Distribution of functional tags across ISO-dimensions in the UK YP corpus

Of the dialogue acts in the Task dimension, 41.4 % are Inform acts, which are often connected by rhetorical relations. For example:

  1. (11)

    D121: Let us be clear, sex education covers a wide range of issues

       affecting young people [Inform]

    D122: These include safe sex practices, STIs and legal issues

       surrounding consent and abuse [Inform Elaboration D121]

The ISO 24617-2 standard does not prescribe the use of any particular set of rhetorical relations; for the annotation of the DBOX corpus a combination was used of the hierarchy of relations used in the Penn Discourse Treebank (PDTB, [41]) and the taxonomy defined in [27]. Table 6.11 shows the distribution in the corpus of the rhetorical relations associated with Inform acts. The corpus is used for designing the Dialogue Manager module of the dialogue system that is built in the Metalogue project.

Table 6.11 Distribution of rhetorical relations associated with Inform acts in the corpus

4.3 The SWBD-ISO Corpus

Fang and collaborators made an effort to assign ISO 24617-2 annotations to the dialogues in the Switchboard Dialog Act (SWBD-DA) corpus (see Fang et al. [2022]).Footnote 10 This resource contains 1155 5-min conversations, orthographically transcribed in about 1.5 million word tokens. Each utterance in the corpus is segmented in ‘slash units’, defined as “maximally a sentence; slash units below the sentence level correspond to parts of the narrative which are not sentential but which the annotator interprets as complete” [35]. The corpus comprises 223,606 slash units, which are annotated with a communicative function tag from the SWBD-DAMSL annotation scheme, a variation of the DAMSL scheme defined specifically for this purpose [32]. See example (6.4.3), where ‘qy’ is the SWBD-DAMSL tag for yes/no questions and ‘utt1’ indicates the first slash unit within a turn.

  1. (12)

    qy A.1 utt1: { D Well, } { F uh, } does the company you work for test for drugs? /

In addition to this marking up of communicative functions, in-line markups are also used to mark ‘discourse markers’ such as { D Well, }, which often signal a rhetorical relation; filled pauses, like { F uh, }, restarts and repetitions, such as [I think, I think] and some other types of ‘disfluencies’.

To assess the possibility of converting SWBD-DA annotations to ISO 24617-2 annotations, first a detailed comparison was made of the two sets of communicative functions, revealing 14 one-to-one correspondences and 26 many-to-one equivalences. These tags can thus be converted automatically to ISO tags, which accounts for 83.97 % of the SWBD-DAMSL tags in the corpus. Six SWBD-DAMSL function tags have a one-to-many correspondence with 26 ISO tags, corresponding to 5.74 % of the Switchboard corpus; about 30 % of these cases can be converted automatically to an ISO tag by taking the tagging of the preceding slash unit into account; for example, an utterance tagged ‘aa’ (i.e., Accept) following an offer should be assigned the ISO tag Accept Offer, while it should be assigned the ISO tag Accept Request when following a request. For those cases where such a contextual disambiguation does not help, manual annotation was performed (see Fang et al. [22]).Footnote 11

Altogether, through combined automatic conversion and manual annotation 200.605 utterances (89.71 % of the Switchboard corpus) were assigned ISO 24617-2 communicative function tags. Table 6.12 shows the distribution of function tags in the resulting ‘SWBD-ISO’ corpus.

Table 6.12 Distribution of ISO 24617-2 communicative function tags the SWBD-ISO corpus

4.4 The DialogBank

In a recent initiative at Tilburg University a publicly available corpus has been created called the DialogBank, which consists of dialogues with gold standard annotations in DiAML according to the ISO 24617-2 standard. While recommending the use of XML for representing annotation structures as defined by the DiAML abstract syntax, the standard allows representations in other formats as long as these have the properties of being (1) complete, i.e. defining a rendering of any annotation structure defined by the abstract syntax, and (2) unambiguous, i.e. every representation encodes only one annotation structure. Representation formats that have these properties can be converted to and from the DiAML-XML format without loss of information. For some of the dialogues in the DialogBank, an alternative tabular representation format was defined that has these properties and that is more convenient for human readers (see Bunt et al. [14]).

The annotations include not only the multidimensional marking up of communicative functions and dimensions, but also of functional dependence relations; feedback dependence relations; rhetorical relations; and qualifiers for certainty, conditionality and sentiment.

The DialogBank currently contains dialogues taken from four English-language corpora: the HCRC Map Task, Switchboard, TRAINS, and DBOX corpora, and four Dutch-language corpora: the OVIS, DIAMOND, Dutch Map Task,Footnote 12 and SchipholFootnote 13 corpora. Addition is foreseen of dialogues from the AMI corpus, the YP corpus, and several other corpora.

4.4.1 Map Task and DBOX Dialogues

The Map Task and DBOX dialogues in the DialogBank were annotated using the ANVIL tool in which a facility has been created to export annotations in the DiAML-XML reference format of ISO 24617-2 [13]. Example (14) in the Appendix shows the result for a very short dialogue fragment. This format is perfect for machine consumption, but rather inconvenient for human readers, for example for checking the correctness of annotations. The more compact tabular formats shown below are more attractive in that respect.

The DBOX application (quiz game dialogues) called for some small extensions to the ISO annotation scheme, which were made in accordance with the guidelines included in the ISO 24617-2 standard for extending the annotation scheme. Two additional dimensions were introduced: Task Management (also familiar from DAMSL), for dialogue acts where the rules of the game are discussed, and Contact Management, also familiar from DIT\( {}^{++} \), for dialogue acts where the participants establish, check, or end contact between them.

4.4.2 Switchboard Dialogues

The dialogues in the Switchboard corpus were originally represented in a 3-column tabular format where the leftmost column contains an identifier of the slash unit in the third column, and the middle column contains an SWBD-DAMSL function tag.Footnote 14 In constructing the SWBD-ISO corpus, all in-line markups of filled pauses were replaced by Stalling tags and in-line markups of restarts by SelfCorrection tags. The result looks as shown in (13).

(13)

sw01-0105-0001-A001-01  

setQuestion

A.1 utt1: Jimmy, {D so } how do you get most of your news? /

sw01-0105-0002-B002-01

stalling

B.2 utt1: {D Well, } [ I kind of, +

 

selfCorrection 

{F uh, } I ] watch the,

 

stalling

{F uh, } national news

 

answer

everyday, for one /

sw01-0105-0003-B002-02

answer

B.2 utt2: I also read one or two papers a day /

sw01-0105-0004-B002-03

selfCorrection

inform

B.2 utt3: {C and } [ I’m a, + I’m pretty much a ] news junkie /

sw01-0105-0005-B002-04

answer

B.2 utt4: {C and } I tune in to CNN a lot./

sw01-0105-0006-A003-01

autoPositive

A 3 utt1: {F Oh, } wow /

While convenient for human readers, this format is not optimal for computer processing. The numbering of speaker turns and slash units is redundant (and turns have no special status in the ISO standard), and the rightmost column contains a mixed bag of information types (speaker, turn number, slash unit number within turn, transcribed slash unit, and disfluency and other markups). It could be converted to an XML representation like DiAML-XML by interpreting the first column as the values of the xml:id attribute, the second as the values of the communicativeFunction attribute, and the third as the values of the sender and target attributes and the textual rendering of slash units. However, representations like (13) differ from DiAML-annotations in three fundamental respects: (1) slash units do not always correspond to functional segments, which in general form a more fine-grained way of segmenting a dialogue; (2) the use of in-line markups goes against the ISO requirement that annotations should be in stand-off form; and (3) annotations according to ISO 24617-2 contain more information than just communicative functions, in particular also dimensions, qualifiers, and dependence relations, which are semantically indispensable.

These differences are taken into account in the design of a tabular representation format, called ‘DiAML-TabSW’, that is relatively close to that of (6.4.4.2), and facilitates comparison between the SWBD-DAMSL and ISO annotation schemes. For incorporating annotated Switchboard dialogues into the DialogBank, first, existing annotated dialogues were re-segmented into functional segments, and the functional segments that do not correspond to a slash unit were newly annotated with ISO 24617-2 communicative function tags and dimension tags. Second, a copy was made of the slash unit transcriptions in which all in-line markups were interpreted in terms of communicative functions, rhetorical relations, or qualifiers whenever possible, and removed. Third, the functional segments were represented in stand-off fashion by referring to a file that contains segment definitions in terms of word tokens or time points. Finally, the annotations of functional segments were enriched with functional and feedback dependences, qualifiers, and rhetorical relations.

Figure 6.1 shows the resulting representation. The first four columns represent the annotations proper: (1) functional segment identifiers; (2) dialogue act identifiers; (3) dialogue acts; and (4) sender, with much of the information concentrated in the third column: dimension, communicative function, dependences (as in “Ta:answer (da2)”), qualifiers and rhetorical relations. The fifth and sixths, s, containing functional segment texts and turn transcripts, column have been added for the convenience of human readers, and have no formal status.

Fig. 6.1
figure 1

ISO 24617-2 annotation of dialogue fragment in example (6.4.4.2), represented in DiAML-TabSW format. (Ta = Task, TiM = Time Management, TuM = Turn Management, OCM = Own Communication Management, AuF = Auto-Feedback)

4.4.3 Other Annotated Dialogues and Their Representation

The dialogues in the DIAMOND corpus were originally annotated with the DIT\( {}^{++} \) annotation scheme, for which the DitAT annotation tool was developed [23]; this tool produces representations in a multi-column tabular format with a separate column for each dimension. For inclusion of ISO 24617-2 versions of these annotations in the DialogBank, a new multi-column tabular format was defined, the ‘DiAML-MultiTab’ format, with one column identifying functional segments in stand-off fashion, as in the DiAML-TabSW format, one column indicating the speaker, and one column per dimension for representing communicative functions, qualifiers, dependence relations, and rhetorical relations. Figure 6.2 illustrates this format, which was proven to be convertible without loss of information to DiAML-XML and vice versa [14]. In the example, those columns have been suppressed that correspond to dimensions in which no communicative functions were marked up for this fragment.

Fig. 6.2
figure 2

ISO 24617-2 annotation of TRAINS dialogue fragment represented in DiAML-MultiTab format

The DiAML-MultiTab format was used also for representing re-annotated dialogues from the OVIS and TRAINS corpora, and newly annotated Schiphol dialogues.

5 Conclusions and Perspectives

The ISO 24617-2 standard for dialogue annotation has as its main features a rich taxonomy of clearly defined communicative functions, including many functions from previously developed annotation schemes such as DAMSL, DIT\( {}^{++} \), and ICSI-MRDA; the distinction of nine dimensions, inherited from the DIT\( {}^{++} \) schema; functional and feedback dependence relations that account for semantic dependences between dialogue acts; the use of qualifiers for expressing (un−)certainty, conditionality and sentiment; and rhetorical relations among dialogue acts. In this chapter, experiences and experiments were discussed that investigate how these features play out in human and automatic dialogue annotation.

New and emerging corpora were discussed that contain dialogues, annotated according to the ISO 24617-2 standard, notably the DBOX, YP, and DialogBank corpora. Such resources offer a promising basis for the study of human communication as well as for the design and training of modules in dialogue systems, such as recognizers of communicative functions in human interactive behaviour, and dialogue managers in speech-based or multimodal dialogue systems.