Keywords

Introduction

In this paper, I discuss how we can deepen our understanding of speech communication with special reference to our new Japanese corpus, the “My Funny Talk” corpus (Watashino Chotto Omoshiroi Hanashi in Japanese, henceforth MFT). Before introducing it in detail, I shall clarify our background and the objectives of this corpus.

The background of MFT is concerned with the progress of spoken language research.

The unique status of spoken language relative to written language has been acknowledged by several twentieth-century linguists. For example, Charles Hockett pursued features that distinguish human spoken language from other communication systems, including written language (Hockett 1960). Additionally, John Lyons called the priority of spoken language over written language as “one of the cardinal principles of modern linguistics” (Lyons 1981). However, actual research on language has focused on the written form, and language education has also followed this trend.Footnote 1 Consequently, there are very few textbooks based upon spoken language. Such a discrepancy between the ideal and reality concerning the status of spoken language is gradually being mitigated by the development and spread of technology. Advancements in recording, editing, and the release of audio-visual information enable us to develop and utilize various corpora as data for research and education on spoken language.Footnote 2

In the case of spoken language in Japan, the biggest corpus is the Corpus of Spontaneous Japanese (CSJ), developed in 2004 by the National Institute for Japanese Language and Linguistics, the National Institute of Information and Communications Technology, and the Tokyo Institute of Technology. It includes utterances consisting of approximately 7.5 million words (660 hours) collected mainly from academic presentation speech and simulated public speaking (Maekawa 2004). More recently, smaller but more focused corpora, featuring a greater concentration on the dynamics of conversation than found in the CSJ, have appeared. For example, Mayumi Usami and others developed a new corpus of utterances consisting of about 800,000 words (70 hours) from a pragmatic perspective based on their transcription system, Basic Transcription System for Japanese (BTSJ) (Usami and Nakamata 2013). Another example is our first corpus (40 hours), KOBE Crest FLASH (KCF), developed by Nick Campbell and jointly released online in 2012. The most prominent feature of KCF is that it includes visual indications of the temporal distribution of utterances during natural dialogue, in addition to audio-text information (Fig. 1).

Fig. 1
figure 1

Sample screen of KOBE Crest FLASH (http://www.speech-data.jp/taba/kobedata/)

At the center of Fig. 1, the reader can see double disconnected lines: a blue line (upper) and a pink line (lower). Each line reflects the temporal distribution of utterances by speakers, with the blue line reflecting one speaker’s utterance, the pink line reflecting the others’, and the line breaks representing periods of speaker silence. Therefore, a very short line implies that the speaker’s utterance ends in a very short time, as in the case of backchannels. With such visual indications, KCF is an effective tool for researching the temporal aspect of dialogue structure, including utterance overlapping and back channel usage.

While use of the CSJ requires a fee (¥50,000 for researchers and ¥25,000 for students), and the BTSJ corpus necessitates a legal contract for use, we are gradually releasing the KCF on the InternetFootnote 3 so that anyone can download the corpus free of charge. Although only a small part (two hours) has been released so far, interested readers can contact us to access the remaining portion of the corpus before the release. Sunakawa (2011) provides more detailed information about these corpora.

Our motivation for developing and releasing further corpus (i.e., MFT) is partly concerned with the treatment of visual information. Although the KCF includes visual indication of the temporal distribution of utterances, like its CSJ and BTSJ counterparts, it does not include visual information on the conversation itself. Unlike these previous corpora, our new corpus MFT includes such audio-visual information on speech. However, our motivation for developing and releasing MFT is not so much concerned with the problem of multi-modality as it is concerned with the issue of balance: an ideal corpus should be well-balanced for the given research or education. In order to clarify the dietary habits of the Japanese, for example, it would not be effective to investigate 50 informants from Hokkaido and 50 informants from Okinawa and conclude that lamb and goat meat are popular among Japanese people. Before gathering data, we should prepare an inventory of dietary cultures in Japan and then select informants carefully on the basis of the population and age-distribution of each dietary culture, rather than on a per capita basis of administrative districts. With this in mind, we cannot help but propose that none of the present corpora are well balanced, as our current knowledge of variation in spoken Japanese is still very limited.

The way of speaking during conversation changes drastically in accordance with style, situation, speech act type, and even speaker type (Sadanobu 2011). This begs the question, how many variants of speaking style does the Japanese language have? In what situations does each variant appear, and to what degree? How many speech-act types and speaker types does the Japanese language contain? All of these questions should be regarded as very important if we seek to construct a well-balanced corpus. Unfortunately, many of these questions remain unanswered, which limits the effectiveness of quantitative research based on corpus data, as is pointed out by Emanuel Schegloff: “… and in many areas we are not yet (in my judgment) in a position to satisfy these prerequisites that allow the possibility that quantitative analysis will deliver what is wanted from it and what it promises” (Schegloff 1993: 103). Nevertheless, with the information and resources available at present, we can offer a description and clarification of variation in Japanese speech, and this is the main goal in our construction of the MFT, started in 2010. As far as we know, it is the first and the only audio-visual corpus of Japanese speech and of funny talks (i.e., discussion that includes humorous contents) released on the Internet. In the next section, we describe its specifications for achieving the objective above.

Specifications of MFT

Our new corpus, My Funny Talk (MFT), is a collection of talks, all lasting only a few minutes in length, that were entered in a funny spoken-story tournament, held annually in collaboration with the Research Center for Media and Culture of Kobe UniversityFootnote 4 and my “kakenhi” project,Footnote 5 and sponsored by the Media Culture Campaign Council.Footnote 6 As noted above, the primary goal in our construction of the MFT is to shed light on variation in Japanese speech. Moreover, the MFT is constructed for at least three additional purposes, which are not necessarily affinitive but are interconnected and fully compatible with one another. First, MFT is useful as research material regarding modern Japanese folk storyteller’s art (“wagei,” in Japanese). Of course, existing materials on Japanese storytelling such as “koudan,” “rakugo,” and “joururi” are plentiful, but these are traditional data sources rather than modern ones. Although modern funny talks by professional comedians are quite common in TV programs, it is too simplistic to equate them with folk storytellers’ funny talks. Beyond this, and most crucially, they are not usable because of copyright issues. Second, MFT is useful as educational material on “live” Japanese speech for Japanese learners and their teachers. Third, MFT serves as an experimental trial for collecting conversational data, shown in detail later in the section. In order to simultaneously achieve these various aims, numerous precautions are taken throughout the construction of the MFT. Before entry, participants are told the tournament rules, and they attest that they waive their rights to privacy, publicity, and copyrighted works (e.g., phrases, poems, and songs that they hit upon and utter during their talk) through documents prepared by our lawyer; this is necessary to release their talks online.

We received 17 talks for the first tournament (2010), 72 talks for the second (2011), 43 talks for the third (2012), and 44 talks for the fourth (2013). Although the total number of hours of data currently available is quite limited (about 10 hours), it already has surprising varieties of Japanese speech, as is shown in Sect. 3. The yearly variation in the number of talks resulted from trial and error. For the second tournament, we strengthened the advertisement to increase the number of entries and thus obtain more speech variety. However, we later realized this was not beneficial to the research. The score of each participant was determined purely by Internet voting without any manipulation. Such a large sample size makes it significantly more difficult to assure sufficient gratuitous voters who click on, watch, and mark entered talks, rather than base their vote on the tournament speakers, especially when the number of entry talks is as large as in the second tournament. We therefore decided to decrease the number of entries, which proved to be difficult, as most participants loved the tournament and were eager to participate in it again.

Talks entered into the tournament are not identical in form. Many of them are fragments of conversation between three or four housekeepers, some are elicited from a cocktail party conversation among businessmen, and others presenters simply enter and talk by themselves alone. All talks are presented as an audio-visual movie file with Japanese subtitles, which helps Japanese learners understand the content. Seven from the first year tournament lack this visual information because of technical issues. The remaining nine talks of the first tournament have English, French, and Chinese translations in addition to Japanese subtitle for learners of basic Japanese. The watcher can choose his/her favorite design from the two prepared presentations. One format has parallel presentation, while the other has a language-switcher on the upper right side (Fig. 2), and we are presently planning to add these translations for entries in other tournaments as well.Footnote 7

Fig. 2
figure 2

Sample of movie file with parallel translations (No. 9, 2010) http://www.speech-data.jp/chotto/2010_sub/flash/2010009s4.html

As shown in Fig. 2, we give special treatment to person names. We substitute the graphic characters of their names with the symbol “**” so that the watchers do not know who is being talked about in the clips. Likewise, we add a beep sound over personal names in the sound files in order to conceal the identities of the individuals being discussed. Besides these precautions, there are still other legal or ethical issues. A few talks judged as strongly insulting to a particular person were cut off. During one talk, the speaker confessed to childhood shoplifting, and therefore the screen was blurred to prevent viewer identification. This measure was taken despite the fact that the presenter himself assured us that he did not need such treatment.

The tournament system was selected, but our aim is not to classify native speakers into several groups, from a group of fluent and eloquent speakers to that of unsophisticated, artless speakers. As previously noted, participant score was determined through Internet voting, but little attention was paid to this score. Although we agree with Mark Durie (1999) in asserting that the notion of skill is important to grammar, we do not think that the skill of a native speaker is so easily determined. Rather, our aim in utilizing a tournament system derives mainly from benefits in collecting and sharing corpora.

The collection of conversation data is becoming increasingly popular among Japanese linguists, but most of this data stems from conversations among students. This is because such data is the cheapest and easiest for university teachers to gather. However, students are not representative of most speech, and collecting non-students’ conversational data can often be expensive, time-consuming, and exhausting. We have to take the necessary yet tedious steps of asking for the informants’ cooperation, paying them for the rights to their likeness, and transporting equipment to their homes or offices. For researchers, the tournament system was an experimental trial in shifting all these burdens onto the informants’ side. At the beginning, we expected that by offering a small amount of prize money, informants might create their own videos and send them to us at their own cost. What we found was that most applicants preferred visiting the university and having us take their videos. Therefore, the tournament system has not worked so well in this respect, but we maintain that the experiment is worth continuing.

Obstacles are not limited to data collection phases. Difficulties have been encountered in sharing corpora as well. A researcher might be reluctant to release his data lest his analysis should eventually prove to be wrong. Even if it is released, other researchers might not be interested, as they cannot understand it as deeply as they understand their own data because of differences in background. Alternatively, neither researchers nor L2 learners might enjoy viewing the data, simply because it is boring. We thought that brief, funny stories entered in a tournament might be an effective way to significantly reduce such obstacles. That is to say, we expected that in order to get the prize, applicants would do their best to make their stories funny, short, and easy to understand for any watcher without an extensive background in the story’s subject.

Examples of Speech Variations Examined via MFT

Regarding the content of the stories, all applicants discussed personal experiences, and thus far we have not received any table jokes like “A cruise ship founders on a reef, and a man just manages to swim some miles and crawl up on a desert island …” This strong tendency of Japanese people to prefer personal experiences to jokes and anecdotes as funny talk could be related to characteristics of Japanese culture, since it coincides with what Oshima (2011) found in Japanese written funny stories. Future investigation could help develop a cross-linguistic study focusing on the culture of spoken-stories by conducting the same funny story tournament in other societies and comparing the entries with those in this corpus.

The overall structure of narrations of personal experiences is another research theme for further investigation. There are two different ideas concerning the order between subparts of narratives (i.e., “evaluation” and “resolution”). One is given by William Labov (e.g., Labov and Waletzky 1967, Labov 1997), who asserts that “evaluation” precedes “resolution” as in (1). And the other is given by Senko Kumiya Maynard (1989), who conversely puts “resolution” before “evaluation” as in (2). Although our preliminary work (Kaneda et al. 2013) indicates Japanese narratives fit better with the latter, further studies are required to make this point clear.

  1. (1)

    Overall Structure of Narratives (William Labov)

    1. 1.

      Abstract “is an initial clause that reports the entre sequence of events of the narrative” [Labov 1997: 402.]

    2. 2.

      Orientation part serves “to orient the listener in respect to person, place, time, and behavioral situation.” [Labov and Waletzky 1967: 32, underlines provided.]

    3. 3.

      Complication or complicating action means “the main body of narrative clauses usually comprises a series of events” [Labov and Waletzky 1967: 32.]

    4. 4.

      Evaluation “is defined by us as that part of the narrative which reveals the attitude of the narrator towards the narrative by emphasizing the relative importance of some narrative units as compared to others.” [Labov and Waletzky 1967: 37.]

    5. 5.

      Result or resolution means “the set of complicating actions that follow the most reportable event” [Labov 1997: 414.]

    6. 6.

      Coda is “a functional device for returning the verbal perspective to the present moment” [Labov and Waletzkey 1967: 39.]

  2. (2)

    Overall Structure of Narratives (Senko K. Maynard)

    1. 1.

      Prefacing: (obligatory)

      Expressions used to signal the transition from the current discourse into a narrative. Includes seven different categories, any one of which is obligatory and minimally one category must appear prior to the main body in which the narrative appears. Categories 2 through 7 may appear elsewhere in the main body of the narrative.

    2. 2.

      Setting: (obligatory if unknown to the listener)

      Specifics of the situation such as time and location where the event takes place, along with descriptions of characters involved in the event.

    3. 3.

      Narrative Event: (obligatory)

      Describes how the identified participants conduct or experience an event that is thought to be interesting to the story recipient. The description of the narrative event must minimally contain an event sequence consisting of two related chronologically ordered actions.

    4. 4.

      Resolution: (optional)

      The result of or the conclusion to the narrative event.

    5. 5.

      Evaluation/Reportability: (optional)

      Refers to the point of the narrative (why it is told) as defined by Labov (1972).Footnote 8 Includes relevance of the narrative to the story recipient, and both mental and emotional reaction of the narrator with respect to the narrated event.

    6. 6.

      Ending Remarks: (optional)

      Expressions that signal the end of the narrative frame and the shift of the framework from the narrative to another discourse unit.

      [Maynard 1989: 117–118.]

In this paper we focus on the complementary effectiveness of this corpus in our understanding of the varieties of spoken Japanese. Although we cannot say that the stories entered in a tournament are as natural as everyday conversation in all aspects, MFT abounds in speech variations scarcely addressed in hundreds of hours of standard spoken-language corpora. By using this corpus as a complement to previous corpora, we are better able to address actual variation in Japanese speech. We shall briefly provide two examples below.

Pitch

The first notable example is pitch. According to the traditional view, such as Amanuma et al. (1978), intonation cannot break patterns of lexical accents in Japanese. Recent research (Abe 1998, Sadanobu, 2005a, 2013) concludes that lexical accent can be affected by intonation to such a degree that it completely loses its original shape. Sadanobu (2005a, 2013) argues that this happens especially when the intonation is connected with the speaker’s strong attitudes or feelings.

In MFT, we find cases that support this position. Here is just one example (Fig. 3).

Fig. 3
figure 3

Example of utterance where falling lexical accent is deleted by rising intonation (No. 38, 2011) http://www.speech-data.jp/chotto/2011/2011038.html

The phrase ki-ta, which means “came” in English, has a falling pitch accent. The first mora ki bears a high pitch, whereas the second mora ta bears a low pitch. In this excerpt, the speaker utters this phrase three times in total, as shown in (3) below.

  1. (3)

    03: 34 ah, ki-ta-ze! “Oh, it has come!” (Direct speech of cry)

    03: 35 ki-ta-zo! “It has come!” (Direct speech of cry)

    03: 37 kaette-ki-ta “came back” (Representative speech of thought)

The pitch of the third utterance seems to preserve the falling accent in calm representative speech, although this is not so clear as it is not a main verb, but instead acts as a supplementary verb attached to the phrase kaette, meaning “go back.” In contrast, the first and the second utterances do not preserve the lexical accent in the speaker’s direct speech (i.e., cry of strong joy). They are uttered in rising intonation like yatta! (I did it!).

Phonation

The second and final example is phonation. MFT has plenty of utterances with various phonation types, including pressed (Fig. 4), whispery (Fig. 5), and rounded phonations.

Fig. 4
figure 4

Example of pressed phonation utterance (No. 8, 2010) http://www.speech-data.jp/chotto/2010_sub/jwplayer/2010008s.html

Fig. 5
figure 5

Example of whispery phonation utterance (No. 18, 2012) http://www.speech-data.jp/chotto/2012/2012018.html

Their meanings are culturally dependent, as a pressed voice can convey the attitudinal meaning of kyoshuku, which connotes a psychological shrinking with the feeling apology and embarrassment to native Japanese speakers (Sadanobu 2004; 2005b); however, this same voice comes off as arrogant to French speakers (Shochi et al. 2005).

The rounded phonation with shooting out the lip (togarase in Japanese) is one to which much attention is paid, especially in recent years. Through MRI (Magnetic Resonance Images) experiments,Footnote 9 we are now identifying articulatory types of rounded phonation (Figs. 6 and 7), one of which is an adult phonation with the attitude of kyoshuku, similar to that seen in the case of pressed voice.

Fig. 6
figure 6

MRI of “common” phonation

Fig. 7
figure 7

MRI of rounded phonation

We can see this demonstrated on data elicited from MFT (Fig. 8).

Fig. 8
figure 8

Example of rounded phonation utterance (No. 47, 2011) http://www.speech-data.jp/chotto/2011/2011047.html

Here, a female speaker discusses her past experience of joining a conversation with the general manager of her company and a visiting guest, who was the president of another company. She imitates these two people when she directly quotes their conversation. It is important to note here that the general manager has a higher status than the president, probably in accordance with the relative strengths of their companies.

What is interesting is that she speaks with a pressed and rounded phonation only when voicing the president. She speaks using this phonation four times in total and the utterances are shown in (4) below.

  1. (4)

    02: 48 ah, haa, nandesuka? “Oh, yes, what is it?”

    02: 57 iyaa “Noo”

    03: 09 iyaa “Noo”

    03; 14 iyaiya “No, no”

The first example is an utterance showing the president’s attitude of attentive listening to the general manager’s speech. The three remaining utterances are negative answers in response to the general manager’s questions. These four utterances begin not with rounded vowels, but with the vowel “a” for the first utterance and the vowel “i” for the remaining responses. However, the speaker’s mouth is rounded at the top of these four utterances. The general manager’s speech begin with “i” at 2:50. At this time, however, she does not round her lips. This rounded phonation is an adult phonation with the attitude of kyoshuku that we saw above (Sadanobu and Hayashi 2016). However, a slight change in phonation drastically changes its impression into a child’s complaint (Zhu and Sadanobu 2016). Do these two phonations differ in their articulatory way of rounding the lips, or do they share the rounding articulation and differ in voice quality only? The details should be investigated further, but our main argument is that we can utilize MFT, which abounds in such utterances, as a supplement to deepen our understanding of speech variation in the Japanese language.

Remaining Issues and Future Directions

Lastly, we shall introduce two doubts sometimes cast on the usability of the MFT corpus, and clear away them by showing our recognition of “dialects in the study of spoken language” and “technology of synthesizing expressive speech.” The first doubt is that it appears that most of the talks were performed in Kansai dialects and few in other dialects. It is true that most speakers of the MFT corpus are from Kansai (Osaka and Kobe, inter alia), and we should investigate why. One of the possible reasons is geographical. Since the tournament applicants preferred visiting our university (Kobe University) and having us take their videos rather than editing and sending them to us, as is described in Sect. 2, it is natural that they mostly reside near our university. In addition to this, there might be a cultural reason for the maldistribution of speakers. Kansai people are generally less hesitant about relating funny stories. In order to make sure of this, it is necessary to investigate, by questionnaire, feelings about relating funny stories in public not only for the applicants but also for the non-applicants who did not want to participate in the tournament. We should acknowledge much more that observing funny talks entered into the tournament is not sufficient to investigate the modern Japanese folk art of storytelling. However, it does not follow that we should collect and present funny talks for each dialect. This point is deeply related with the question “How should we treat dialectal difference in the study of spoken language?” Although it may have been a tradition of dialectal study to take only a small group of people through several generations in the dialectal area under investigation as “pure” informants and to exclude many people who have moved from other areas as “impure” informants, we do not aim to construct a “specimen case” of funny talks. By making the MTF corpus we just want to promote the study of spoken Japanese. And in spoken language all speakers are interconnected “by air.” Unlike a specimen case, there is no wall to differentiate speakers of one dialect from speakers of the others. At any time speakers of other dialects or “hybrid” speakers may join the conversation. Actually, we are more or less “hybrids” and our conversations are more or less conversations between speakers of different dialects, when funny talks take place. We cannot dispense with funny talks by “hybrid” speakers as valueless. They are of high value as are those of “pure” informants.

The second doubt cast on the usability of the MFT corpus is that the speaking styles found in this corpus may be just rare and peripheral ones. It is apparent that this doubt is based on an incorrect presupposition. We cannot tell whether a speaking style is “just rare and peripheral” or not before quantitative investigation, and quantitative investigation needs a well-balanced corpus, whose construction is the ultimate object of the MFT corpus. In addition to its logical incorrectness, this doubt also lacks an understanding of technology for synthesizing expressive speech. Technology for synthesizing speech is expected to develop to aid people who have lost their voices due to diseases such as ALS and injury (Iida 2008). In order to meet this expectation it is important for us to pay more attention to expressive speech, since the interlocutors for such people are firstly their family and close friends and therefore their conversations should be abundant in expressions of various attitudes and emotions. Whereas technology for synthesizing non-expressive Japanese read speech reached a considerably high level at the end of the twentieth century,Footnote 10 technology for synthesizing expressive speech has not been developed as much. In our view the main reason for this delay is not technological but linguistic. Once linguists succeed in describing all prosodic patterns and their attitudinal/emotional correlates, then it will be possible to annotate every prosodic pattern in a corpus and consequently to synthesize expressive speech in the same way as non-expressive speech. That is to say, the development of synthesized speech technology to aid people who have lost their voices depends much on the linguistic elucidation of various expressive prosodies. Although the current data size of MFT is rather small to be called a corpus, it is growing larger and larger every year. We hope that the readers will freely enjoy, download, and utilize the only audio-visual Japanese speech corpus released on the Internet for teaching, learning, and studying.