Keywords

1 Introduction: The ORD Corpus

Everyday spoken Russian has been the subject of scientific analysis since the works of E. Zemskaja, O. Sirotinina, O. Lapteva, N. Rozanova, M. Kitajgorodskaja, and other linguists. However, up to present, there were not enough linguistic resources of Russian real-life spontaneous speech. For example, the Spoken Speech Subcorpus in Russian National Corpus does not contain any audio data at all, consisting just of speech transcripts [1]. The other well-known Night Dream Stories corpus contains both texts and thoroughly annotated speech recordings. However, this corpus is relatively small (about 2 h of recordings, 14000 words in transcripts) and contains a restricted number of spoken genres (mainly narratives) [2]. There are also other Russian speech corpora that should be mentioned: RuSpeech corpus [3], an annotated corpus of Russian speech [4], corpus of emotion Russian speech [5] and some other resources. However, all these corpora either contain no everyday speech recordings or are limited just to few communicative situations (like the latter one).

The only Russian corpus containing spoken everyday speech recorded in natural and diverse communicative situations is the “One Day of Speech” (ORD) corpus that has its origin in St. Petersburg State University. The recordings are made by participants-volunteers who spend a whole day with switched-on voice recorders and record all their audible communication [6]. The similar methodology of long-term recordings had been earlier used for collecting data for the British National Corpus [7] and the JST ESP corpus in Japan [8].

Speech is transcribed and selectively annotated on different levels – phonetic, lexical, grammatical, and pragmatic levels. Transcribing and most annotations are made in ELAN [9]. Phonetic annotation is made in Praat [10]. Quantitative data processing is made for annotations on each level [11].

Nowadays, the ORD corpus is one of the most representative collections of everyday spoken Russian. It contains more than 1000 h of recordings gathered from 110 main participants and hundreds of their interlocutors. The ORD volunteer participants represent various professional and status strata. The age of participants rangers from 18 to 77 years with an average value of 37 years. Speech transcripts comprise about 500000 words.

The ORD corpus provides rich authentic material for research in phonetics and syntax of spoken Russian, and for solving applied linguistic problems in speech technologies (e.g., it may be used for adjustment and improvement of speech synthesis and recognition systems, and for forensic phonetics). In this paper we summarize main achievements in phonetic and syntactic studies made on the base of this corpus and outline some directions for further investigations.

2 Phonetic Studies

The initial goal of the ORD corpus was to conduct phonetic studies of Russian everyday speech. Nowadays, we may list the following phonetic aspects that are being successively investigated on the ORD data: temporal studies, study of speech reduction, phonetic realization of words and affixes, investigation of phonetic errors in speech production and mondegreens in speech perception, studies of rhythm structures and hesitation phenomena.

2.1 Temporal Studies

All multimedia annotations of speech signal are made in linguistic annotator ELAN, therefore each annotated phenomenon (sound, morpheme, word, phrase, turn, etc.) refers to a particular segment in correspondent sound file and has particular duration. Thus, temporal study of elements is possible on all linguistic levels. For example, on phonetic level we study speech rate, rhythmic patterns, temporal registers of Russian everyday speech, and other temporal phenomena.

In our phonetic studies, first of all, we obtained the frequency distribution of utterance length in words. Based on the ORD data, the average utterance length for spoken Russian is 4.35 words (SD = 4.02). Most of all utterances consist of a single word or a word-like particle (25.26 %). Two-word utterances make 15.58 % of the whole data, three-word utterances have the third rank that makes 12.45 %. Four-word utterances make 10,98 %, five-word utterances – 8.74 %, etc. [12].

If to measure utterance length in syllables, it turns out that the most frequent Russian utterances consist of one or two syllables and represent 11.0 % and 11.7 % of all utterances respectively. Three-syllable utterances are ranked third (8.7 %), four-syllable utterances take 7.7 %, five-syllable – 6.88 % and six-syllable utterances – 6.50 %. Utterances longer than 20 or more syllables take up less than 1 % of the whole. Thus, more than half of all spoken communication consists of short utterances with a length up to 6 syllables [12].

Therefore, the majority of Russian spoken interaction consists of one or a few word utterances that contain one or few syllables. As for the dependency of average utterance duration on their length in syllables, it is well described by the following linear function: y = 133.28x + 367.5, where x is a number of syllables, and y is an average duration of such utterances in milliseconds [12].

An average tempo of ORD informants is 5.31 syllables per second (syl/s). The variation among informants begins from the slowest 3.6 syl/s till the fastest 6.7 syl/s. These numbers are on average higher than, for example, in Norwegian (3.5–4.5 syl/s), in standard northern Dutch (5.2 syl/s), or in French (according to some data, 4.31 syl/s). However, they are significantly lower than in Spanish (7.81 syl/s) or in Brazilian Portuguese (6.57 syl/s). As Russians in Russia, with approximately the same tempo, people speak English in the UK (3.16–5.33 syl/s), as well as in the USA (3.1–5.4 syl/s) [13].

It was determined that there are several factors that influence speech tempo in Russian: gender (men speak faster than women), age (the older a speaker, the slower he or she speaks), level of language competence (the higher the competence level, the slower is the speech), and social role of speakers (speech is faster when communicating with friends than in work settings) [13, 14].

Finally, the hypothesis on existence of two temporal registers of speech was proposed: (1) the “regular” (or dialogue) register is used for producing utterances whose length does not exceed 15 syllables. Its distinctive feature is a strong interrelation between an average syllable duration (syllable rate) of utterances and their length in syllable (in this case the average syllable duration is a function from utterance length in syllables, ranging from 450 ms to 150 ms) and (2) the “speedy” (or monologue) register is used for producing longer utterances (exceeding 15 syllables). In contrast to a dialogue register, the average syllable duration (or an average utterance rate) of a “speedy” register does not depend on utterance length in syllables and is equal to approximately 150 ms [12].

2.2 Study of Reduction. Phonetic Realization of Words and Affixes

Reduced forms of different words and phrases, especially the most commonly used in everyday Russian, are analyzed. Many of these forms have already got correspondent written forms, which may be found in modern Russian literature and which are frequent in electronic communication. Current analysis touched on all possible features of reduced forms: (1) pronunciation (i.e., detailed phonetic transcription), (2) spelling (common variants of written forms), (3) semantic features, and (4) pragmatic features. Based on the results, the multimedia dictionary of reduced forms in Russian is created [15]. The study of spontaneous speech reduction may be used for building an authentic lexicon of word pronunciations.

In search for the correlation between grammatical meanings of morphemes and their phonetic realization, the real phonetic transcription of inflectional affixes for different speakers in various communication situations was obtained [16]. Statistical tables are drawn with correspondence of orthography, real phonetic transcription, and grammatical categories. These data provided the basis for the Audio dictionary of Russian inflectional affixes.

2.3 Studying the “Weak Points” in Speech Perception and Production

The lists of common mistakes of hearing (mondegreens or incorrect attribution of words) that were made by linguists-experts while transcribing the ORD recordings were compiled [17]. Based on these lists, the analysis of significant perceptive elements of word forms that are essential for their proper attribution was made. The following elements turned out to be invariant despite incorrect recognition of words: stressed syllable position, segmental parts of the stressed syllable (both consonants and vowels), number of syllables, consonantal “skeleton” of word form, its either initial or final segmental fragment.

The analysis of the phonetic mistakes that are typically done by Russian speaking people in everyday speech was made, too. Most common mistakes are the following: incorrect stress position in words and phrases, the alignment of phonetic features in neighboring words, palatalization errors, substitution of proper words by not-existing quasi-words, which are phonetically similar to the prototype words. However, we should point out that phonetic mistakes occur comparatively less than other type of errors, which have been also analyzed. Cf.: phonetic errors – 17 %, lexical errors – 28 %, morphology errors – 31 %, and syntactic errors – 24 % [18].

2.4 Russian Speech Rhythm Studies

Empirical investigation of the ORD recordings has revealed a tendency towards the usage of symmetrical rhythmical structures built of isochronic or quasi-isochronic segments in Russian everyday speech [19]. The most typical are structures consisting of two, three, four, etc. quasi-isochronic segments. Moreover, it was observed that phenomenon of isochronism of speech rhythmic structures can simultaneously appear on different structural levels. We have a hypothesis that the lower quasi-isochronic level performs the role of some kind of inner metronome, which organizes our speech flow. It may change its tempo on borders of rhythmic groups. However, these structural levels do not correlate with linguistic levels. We may suggest that the distribution of linguistic units onto “isochronic boxes” is determined mainly by pragmatics: the more important the segment is the more “boxes” it may take [19].

The use of fillers and other discourse markers in spontaneous speech may be explained in many cases just by unconscious desire of speakers to reach this temporal pattern. The examples of such cases are given and explained in [20]. This hypothesis is to be tested on representative ORD data.

2.5 Hesitation Phenomena

Speech hesitations are a common feature of spontaneous speech production. According to the ORD data, both filled and silent hesitation pauses are among the most frequent elements of spoken Russian. They naturally occur in all types of speech and by all speakers. Different types of hesitations found in the corpus have been analyzed. The classification model of hesitation phenomena was proposed. The most frequent hesitations are the following: silent pauses, stretching of sounds, interruptions, repetitions, filler-words, other kinds of fillers, and paralinguistic actions. The new term “verbal hesitation” has been introduced for denotation of verbal fillers of hesitation nature [21].

3 Syntactic Studies

Syntax is a part of grammar where the features of spoken language are most clearly revealed in a variety of ways (for example, cf. [22]). Syntactic studies of spontaneous speech are very important for ASR systems. Nowadays, language models of the most speech recognition systems are trained on the corpora of written texts. However, written Russian and spoken Russian differ greatly from each other in respect to some fundamental syntactic features, as it is shown below. Therefore, n-gram models that are built for written language cannot be efficiently applied to LVCSR tasks. That particularly refers to recognition of spontaneous real-life speech. Speech transcriptions of the ORD corpus form a valuable resource for creating a language model of spoken Russian.

Several syntactic studies have been already made on the base of the ORD corpus. One of them is a pilot research of verbal groups in Russian spontaneous speech. Based on the random sample of 550 verbal branches represented in the formal way, the following models of left- and right-branching subordinations were found:

  1. 1.

    Verbs without dependents (V): 13.64 %;

  2. 2.

    Symmetrical verbal groups (1V1, 2V2, 3V3): 14.00 %;

  3. 3.

    Generally left-branching verbal groups (1V, 2V, 3V, 4V, 5V, 2V1, 3V1, 3V2, 4V1): 59.27 %;

  4. 4.

    Pure left-branching verbal groups (1V, 2V, 3V, 4V, 5V): 50.36 %;

  5. 5.

    Generally right-branching verbal groups (V1, V2, V3, V4, V5, 1V2, 1V3, 1V4, 2V3): 12.90 %;

  6. 6.

    Pure right-branching verbal groups (V1, V2, V3, V4, V5): 7.45 %.

We have calculated the averages, characterizing left- and right- branching in verbal groups. Thus, the average left-width of the branch equals to 1.195, the average right-width of the branch is 1.565. Their ratio (L/R) equals to 1.309, therefore we observe in spoken Russian an evident trend towards left-branching asymmetry.

These results are very different from data obtained on the material of written texts. For example, in [23] is shown that in written Russian texts the ratio of left-branching structures to right-branching ones is close to 1 (i.e., almost symmetrical) with the slight tendency to right-branching:

Fiction:

(L/R)=0.974

Scientific texts:

(L/R)=0.984

Poetry:

(L/R)=0.983

Spoken speech:

(L/R)=1.310

Therefore, written Russian in this aspect leans towards the mirror symmetry, while everyday spoken Russian is left asymmetrical [23].

Amazingly, the difference in branching preference between written and spoken Russian is even greater than that between different languages. Thus, we may claim that in this aspect the difference between written and spoken Russian is larger than that between written Russian and written English.

The other syntactic studies made on the ORD data include the analysis of repetitions, interruptions, self-corrections, “plug-in” constructions, and the ways of reporting someone else’s speech [24]. Elements of meta-communication that are common for spontaneous speech and that depend on the type of communication and speaker’s characteristics have been studied as well.

4 Some Directions for Further Research

Recently, we have started a large sociolinguistic project with an aim to analyze special characteristics of everyday Russian used by different social groups, and to reveal how the language actually functions and what modifications does it have in a nowadays society. Speech of the major social groups of a contemporary Russian city (age-, gender-, professional-related, etc.) has to be analyzed on different linguistic levels in regard to social information about the speakers. In light of this task, it has became necessary to extend the volume of speech data gathered from particular social groupings [25] and to make correspondent adaptation of the corpus itself. Thus, “one day of speech” recordings are continued in 2015.

The study of speech of different social groups from the population of the second biggest Russian city – St. Petersburg – is to be conducted on phonetic, lexical, morphological and syntactic levels. For example, the following parameters are to be analyzed on phonetic level: (a) temporal characteristics of speech (overall speech tempo, duration of speech elements, typical rhythmic structures); (b) phonetic realization of frequently reduced forms, discursive markers and fillers; and (c) prosodic models for particular types of utterances. Regular studies of intonation have not been earlier conducted on the ORD data.

As for syntactic studies, it is planned to carry out syntactic analysis for the following parameters: (a) linear word order of verbal and noun syntactic groups, (b) syntactic complexity of spoken utterances (e.g., height and width of linearized trees, left-branching structures vs. right-branching ones, syntactic discontinuity) [26], (c) specific syntactic phenomena of spontaneous speech (parcellation, ellipsis, breaks, incompleteness, self-correction, etc.), and (d) the usage of syntactic markers (prepositions, conjunctions, introductory words, etc.).

Besides, it is planned to conduct studies of paralinguistic phenomena and psycholinguistic studies (dependency of speech characteristics from speaker’s psychological type) on all linguistic levels.

In this review, we intentionally skipped the description of lexical and morphological studies made on the ORD data. These investigations are actively performed as well (for example, cf. [11, 16, 24]) and deserve a separate review.

Sociolinguistic extension of the corpus allows to increase the volume of speech transcripts up to 1 million words during the next 1.5 years. Therefore, the ORD corpus will be a representative resource of everyday spoken Russian, suitable for solving both theoretical and applied linguistics problems.