Keywords

1 Introduction

Storytelling serves many different social functions, e.g. stories are used to persuade, share troubles, establish shared values, learn social behaviors, and entertain [24, 33]. Moreover, stories are often told conversationally through dialog [38, 39] where the telling of a story is shaped by the personality of both the teller and the listener. For example, extraverted friends actively engage one another in constructing the action of the story by peppering the storyteller with questions, and by asking the listener to guess what happened [38, 39]. Thus the same story can be told in many different ways, often achieving different effects [22, 40].

A system capable of telling a story and then retelling it in different settings to different audiences requires two components: (1) a deep representation of the story and (2) algorithms that render the story content as different discourse instantiations. A deep representation of the story’s content, often called the story or fabula, must specifies the events, characters, and props of the story, as well as relations among them, including reactions of characters to story events. This is accomplished through est [31], a framework that bridges the story annotation tool scheherazade and a natural language generator (NLG).

The discourse representation is the surface rendering of the fabula, an instantiated expressive telling of a story as a stream of words, gestures or expressions [3, 8, 28, 29]. This paper presents m2d, a framework with algorithms that manipulate the story content to retell the story as a conversational dialog between two people. An example of the original, monologic, and dialogic telling of the Garden Story is shown in Fig. 1. Note that highlighted areas indicate the same information being presented differently at different stages.

Fig. 1.
figure 1

Garden story: original version and monologue/dialog generation. Highlighted areas indicate examples of the same information.

We build on the publicly available PersonaBank corpusFootnote 1, which provides us with the deep story representation and a lexico-syntactic representation of its monologic retelling [14]. PersonaBank consists of a corpus of monologic personal narratives from the ICWSM Spinn3r Corpus [6] that are annotated with a deep story representation called a story intention graph [13]. After annotation, the stories are run through the est system to generate corresponding deep linguistic structure representations. m2d then takes these representations as input and creates dialog with different character voices. We identify several stories by hand as good candidates for dialogic tellings because they describe events or experiences that two people could have experienced together.

Our primary hypothesis is H1: Dialogic tellings of stories are more engaging than monologic tellings. We also hypothesize that good dialog requires the use of narratological variations such as direct speech, first person, and focalization [14]. Moreover, once utterances are rendered as first-person with direct speech, then character voice becomes relevant, because it does not make sense for all the characters and the narrator to talk in the same voice. Thus our primary hypothesis H1 entails two additional hypotheses H2 and H3:

  • H2: Narratological variations such as direct speech, first person, and focalization will affect a readers engagement with a story.

  • H3: Personality-based variation is a key aspect of expressive variation in storytelling, both for narrators and story characters. Changes in narrator or character voice may affect empathy for particular characters, as well as engagement and memory for a story.

Our approach to creating different character voices is based on the Big Five theory of personality [1, 9]. It provides a useful level of abstraction (e.g., extraverted vs. introverted characters) that helps to generate language and to guide the integration of verbal and nonverbal behaviors [11, 16, 21].

To the best of our knowledge, our work is the first to develop and evaluate algorithms for automatically generating different dialogic tellings of a story from a deep story representation, and the first to evaluate the utility and effect of parameterizing the style of speaker voices (personality) while telling the story.

2 Background and Motivation

Stories can be told in either dialog or as a monolog, and in many natural settings storytelling is conversational [4]. Hypothesis H1 posits that dialogic tellings of stories will be more engaging than monologic tellings. In storytelling and at least some educational settings, dialogs have cognitive advantages over monologs for learning and memory. Students learn better from a verbally interactive agent than from reading text, and they also learned better when they interacted with the agent with a personalized dialog (whether spoken or written) than a non-personalized monolog [20]. Our experiments compare different instances of the dialog, e.g. to test whether more realistic conversational exchanges affects whether people become immersed in the story and affected by it.

Previous work supports H2, claiming that direct, first-person speech increases stories’ drama and memorability [34, 37]. Even when a story is told as a monolog or with third person narration, dialog is an essential part of stortelling: in one study of 7 books, between 40 % and 60 % of the sentences were dialog [7]. In general narratives are mentally simulated by readers [35], but readers also enact a protagonist’s speech according to her speech style, reading more slowly for a slow-speaking protagonist and more quickly for a fast-speaking protagonist, both out-loud and silently [43]. However, the speech simulation only occurred for direct quotation (e.g. She said “Yeah, it rained”), not indirect quotation (e.g. She said that it had rained). Only direct quotations activate voice-related parts of the brain [43], as they create a more vivid experience, because they express enactments of previous events, whereas indirect quotations describe events [42].

Several previous studies also suggest H3, that personality-based variation is a key aspect of storytelling, both for narrators and story characters. Personality traits have been shown to affect how people tell stories as well as their choices of stories to tell [17]. And people also spontaneously encode trait inferences from everyday life when experiencing narratives, and they derive trait-based explanations of character’s behavior [30, 32]. Readers use these trait inferences to make predictions about story outcomes and prefer outcomes that are congruent with trait-based models [30]. The finding that the behavior of the story-teller is affected by the personality of both the teller and the listener also motivates our algorithms for monolog to dialog generation [38, 39]. Content allocation should be controlled by the personality of the storyteller (e.g. enabling extraverted agents to be more verbose than introverted agents).

Previous work on generation for fictional domains has typically combined story and discourse, focusing on the generation of story events and then using a direct text realization strategy to report those events [18]. This approach cannot support generation of different tellings of a story [23]. Previous work on generating textual dialog from monolog suggests the utility of adding extra interactive elements (dialog interaction) to storytelling and some strategies for doing so [2, 25, 36]. In addition, expository or persuasive content rendered as dialog is more persuasive and memorable [26, 27, 41]. None of this previous work attempts to generate dialogic storytelling from original monologic content.

Fig. 2.
figure 2

m2d pipeline architecture.

3 M2D: Monolog-to-Dialog Generation

Figure 2 illustrates the architecture of m2d. The est framework produces a story annotated by scheherazade as a list of Deep Syntactic Structures (DsyntS). DsyntS, the input format for the surface realizer RealPro [12, 19], is a dependency-tree structure where each node contains the lexical information for the important words in a sentence. Each sentence in the story is represented as a DsyntS.

m2d converts a story (as a list of DsyntS) into different versions of a two-speaker dialog using a parameterizable framework. The input parameters control, for each speaker, the allocation of content, the usage of questions of different forms, and the usage of various pragmatic markers (Table 1). We describe the m2d parameters in more details below.

Table 1. Dialog conversion parameters

Content Allocation: We allocate the content of the original story between the two speakers using a content-allocation parameter that ranges from 0 to 1. A value of .5 means that the content is equally split between 2 speakers. This is motivated by the fact that, for example, extraverted speakers typically provide more content than intraverted speakers [16, 38].

Character and Property Database: We use the original source material for the story to infer information about actors, items, groups, and other properties of the story, using the information specified in the DsyntS. We create actor objects for each character and track changes in the actor states as the story proceeds, as well as changes in basic properties such as their body parts and possessions.

Aggregation and Deaggregation: We break apart long DsyntS into smaller DsyntS, and then check where we can merge small and/or repetitious DsyntS. We believe that deaggregation will improve our dialogs overall clarity while aggregation will make our content feel more connected [15].

Fig. 3.
figure 3

The DsyntS tree for The man ran to the big store.

Content Elaboration: In natural dialog, speakers often repeat or partially paraphrase each other, repeating the same content in multiple ways. This can be a key part of entrainment. Speakers may also ask each other questions thereby setting up frames for interaction [38]. In our framework, this involves duplicating content in a single DsyntS by either (1) generating a question/answer pair from it and allocating the content across speakers, or (2) duplicating it and then generating paraphrases or repetitions across speakers. Questions are generated by performing a series of pruning operations based on the class of the selected node and the relationship with its parent and siblings. For example, if store in Fig. 3 is selected, we identify this node as our question. The class of a node indicates the rules our system must follow when making deletions. Since store is a noun we prune away all of the attr siblings that modify it. By noticing that it is part of a prepositional phrase, we are able to delete store and use to as our question, generating The man ran where?.

Content Extrapolation: We make use of the deep underlying story representation and the actor database to make inferences explicit that are not actually part of the original discourse. For example, the actor database tracks aspects of a character’s state. By using known antonyms of the adjective defining the current state, we can insert content for state changes, i.e. the alteration from the fox is happy to now, the fox is sad, where the fox is the actor and happiness is one of his states. This also allows us to introduce new dialogic interactions by having one speaker ask the other about the state of an actor, or make one speaker say something incorrect which allows the second speaker to contradict them: The fox was hungry followed by No, he wasn’t hungry, he was just greedy.

Pragmatic Markers: We can also insert pragmatic markers and tag questions as described in Table 1. Particular syntactic constraints are specified for each pragmatic marker that controls whether the marker can be inserted at all [16]. The frequency and type of insertions are controlled by values in the input parameter file. Some parameters are grouped by default into sets that allow them to be used interchangeably, such as downtoners or emphasizers. To provide us more control over the variability of generated variants, specific markers which are by default unrelated can be packaged together and share a distributed frequency limit. Due to their simplistic nature and low number of constraints, pragmatic markers prove to be a reliable source of variation in the systems output.

Lexical Choice: We can also replace a word with one of its synonyms. This can be driven simply by a desire for variability, or by lexical choice parameters such as word frequency or word length.

Morphosyntactic Postprocessing: The final postprocessing phase forms contractions and possessives and corrects known grammatical errors.

The results of the m2d processor are then given as input to RealPro [12], an off-the-shelf surface text realizer. RealPro is responsible for enforcing English grammar rules, morphology, correct punctuation, and inserting functional words in order to produce natural and grammatical utterances.

4 Evaluation Experiments

We assume H1 on the basis of previous experimental work, and test H2 and H3. Our experiments aim to: (1) establish whether and to what degree the m2d engine produces natural dialogs; (2) determine how the use of different parameters affect the user’s engagement with the story and the user’s perceptions of the naturalness of the dialog; and (3) test whether users perceive personality differences that are generated using personality models inspired by previous work. All experimental participants are pre-qualified Amazon Mechanical Turkers to guarantee that they provide detailed and thoughtful comments.

We test users’ perceptions of naturalness and engagement using two stories: the Garden story (Fig. 1) and the Squirrel story (Fig. 4). For each story, we generate three different dialogic versions with varying features:

  • m2d-est renders the output from est as a dialog by allocating the content equally to the two speakers. No variations of sentences are introduced.

  • m2d-basic consists of transformations required to produce a minimally natural dialog. First we apply pronominalization to replace nouns with their pronominal forms when telling the story in the third person. We then manipulate sentence length by breaking very long sentences into shorter ones, or by combining repetitious short sentences into one sentence. This is motivated by the fact that utterances in dialog tend to be less formal and use less complex syntactic structures [5]. The last transformation is morphosyntactic postprocessing as described in Sect. 3.

  • m2d-chatty adds interactive features to m2d-basic such as the insertion of pragmatic markers (acknowledgements, disfluencies, hedges) and question-answer generation (Table 1).

Fig. 4.
figure 4

Squirrel story: monolog/dialog generation.

Each pairwise comparison is a Human Intelligence Task (HIT; a question that needs an answer), yielding 6 different HITs. We used 5 annotators (Turkers) per HIT to rate the levels of engagement/naturalness on a scale of 1–5, followed by detailed comments justifying their ratings.

We create several subsets of features that work well together and recursively apply random feature insertion to create many different output generations. These subsets include the types of questions that can be asked, different speaker interactions, content polarity with repetition options, pragmatic markers, and lexical choice options. A restriction is imposed on each of the subgroups, indicating the maximum number of parameters that can be enabled from the associated subgroup. This results in different styles of speaker depending on which subset of features is chosen. A speaker who has a high number of questions along with hedge pragmatic markers will seem more inquisitive, while a speaker who just repeats what the other speaker says may appear to have less credibility than the other speaker. We plan to explore particular feature groupings in future work to identify specific dialogic features that create a strong perception of personality.

4.1 M2D-EST vs. -Basic vs. -Chatty

The perceptions of engagement given different versions of the dialogic story is shown in Fig. 5. A paired t-test comparing m2d-chatty to m2d-est shows that increasing the number of appropriate features makes the dialog more engaging (p = .04, df = 9). However there are no statistically significant differences between m2d-basic and m2d-est, or between m2d-basic and m2d-chatty. Comments by Turkers suggest that the m2d-chatty speakers have more personality because they use many different pragmatic markers, such as questions and other dialogically oriented features.

Fig. 5.
figure 5

Mean scores for engagement.

Fig. 6.
figure 6

Mean scores for naturalness.

The perception of naturalness across the same set of dialogic stories is shown in Fig. 6. It shows that m2d-basic was rated higher than m2d-est, and their paired t-test shows that m2d-basic (inclusion of pronouns and agg- and deaggregation) has a positive impact on the naturalness of a dialog (p \(=\) .0016, df \(=\) 8). On the other hand, m2d-basic is preferred over m2d-chatty, where the use of pragmatic markers in m2d-chatty was often noted as unnatural.

4.2 Personality Models

A second experiment creates a version of m2d called m2d-personality which tests whether users perceive the personality that m2d-personality intends to manifest. We use 4 different stories from the PersonaBank corpus [13] and create introverted and extroverted personality models, partly by drawing on features from previous work on generating personality [16].

Table 2. Feature frequency for Extra. vs. Intro. Not all lexical instantiations of a feature are listed.

We use a number of new dialogic features in our personality models that increase the level of interactivity and entrainment, such as asking the other speaker questions or entraining on their vocabulary by repeating things that they have said. Content allocation is also controlled by the personality of the speaker, so that extraverted agents get to tell more of the content than introverted agents.

We generate two different versions of each dialog, an extroverted and an introverted speaker (Table 2). Each dialog also has one speaker who uses a default personality model, neither strongly introverted or extraverted. This allows us to test whether the perception of the default personality model changes depending on the personality of the other speaker. We again created HITs for Mechanical Turk for each variation. The Turkers are asked to indicate which personality best describes the speaker from among extroverted, introverted, or none, and then explain their choices with detailed comments. The results are shown in Table 3, where Turkers correctly identified the personality that m2d-personality aimed to manifest 88 % of the time.

Table 3. Personality Judgments
Table 4. Default Personality Judgments

Turkers’ comments noted the differential use of pragmatic markers, content allocation, asking questions, and vocabulary and punctuation. The extroverted character was viewed as more dominant, engaging, excited, and confident. These traits were tied to the features used: exclamation marks, questions asked, exchanges between speakers, and pragmatic markers (e.g., basically, actually).

The introverted character was generally timid, hesitant, and keeps their thoughts to themselves. Turkers noticed that the introverted speaker was allocated less content, the tendency to repeat what has already been said, and the use of different pragmatic markers (e.g. kind of, I guess, Mhmm, Err...).

Table 4 shows Turker judgements for the speaker in each dialog who had a default personality model. In 53 % of the trials, our participants picked a personality other than “none” for the speaker that had the default personality. Moreover, in 88 % of these incorrect assignments, the personality assigned to the speaker was the opposite of the personality model assigned to the other speaker. These results imply that when multiple speakers are in a conversation, judgements of personality are relative to the other speaker. For example, an introvert seems more introverted in the presence of an extravert, or a default personality may seem introverted in the presence of an extravert.

5 Discussion and Future Work

We hypothesize that dialogic storytelling may produce more engagement in the listener, and that the capability to render a story as dialog will have many practical applications (e.g. with gestures [10]. We also hypothesize that expressing personality in storytelling will be useful and show how it is possible to do this in the experiments presented here. We described an initial system that can translate a monologic deep syntactic structure into many different dialogic renderings.

We evaluated different versions of our m2d system. The results indicate that the perceived levels of engagement for a dialogic storytelling increase proportionally with the density of interactive features. Turkers commented that the use of pragmatic markers, proper pronominalization, questions, and other interactions between speakers added personality to the dialog, making it more engaging. In a second experiment, we directly test whether Turkers perceive that different speaker’s personalities in dialog. We compared introvert, extrovert, and a speaker with a default personality model. The results show that in 88 % of cases the reader correctly identified the personality model assigned to the speaker. The results show that the content density assigned to each speaker as well as the choice of pragmatic markers are strong indicators of the personality. Pragmatic markers that most emphasize speech, or attempt to engage the other speaker are associated with extroverts, while softeners and disfluencies are associated with introverts. Other interactions such as correcting false statements and asking questions also contribute to the perception of the extroverted personality.

In addition, the perceived personality of the default personality speaker was affected by the personality of the other speaker. The default personality speaker was classified as having a personality 53 % of the time. In 88 % of these misclassifications, the personality assigned to the speaker was the opposite of the other speaker, suggesting that personality perception is relative in context.

While this experiment focused only on extrovert and introvert, our framework contains other Big-Five personality models that can be explored in the future. We plan to investigate: (1) the effect of varying feature density on the perception of a personality model, (2) how personality perception is relative in context, and (3) the interaction of particular types of content or dialog acts with perceptions of a storyteller’s character or personality. The pragmatic markers are seen as unnatural in some cases. We note that our system currently inserts them probabilistically but do not make intelligent decisions about using them in pragmatically appropriate situations. We plan to add this capability in the future. In addition we will explore new parameters that improves the naturalness and flow of the story.