Keywords

1 Introduction

For the last 30 years various large speech corpora have been developed through the world [1]. Well-known examples are TIMIT [2], Switchboard [3], Verbmobil [4], the Spoken Dutch Corpus [5] and the Corpus of Spontaneous Japanese [6]. At the moment a number of medium and large size Russian speech corpora are available. The largest published corpus of the Russian speech is ORD (One Day of Speech) corpus that is still under development [7]. It contains more than 1000 h of everyday speech. It has partial annotation and transcription. However, this corpus is not publicly available. The most annotated publicly available corpus nowadays is PrACS-Russ (Prosodically Annotated Corpus of Spoken Russian) that contains over 4 h of monologue speech [8]. It is available as part of Russian National Corpus [9]. The corpora containing well-annotated high-quality recordings are not publicly available. One of them is Corpus of Professionally Read Speech (CORPRES) contains over 30 h of speech recorded in a professional studio [10]. The corpus of monologues RuSpeech contains about 50 h of transcribed recordings produced by 220 speakers [11]. CoRuSS (Corpus of Russian Spontaneous Speech) is designed as a publicly available resource containing high-quality recordings of spontaneous speech with detailed prosodic transcription [12]. The recordings include dialogues between native Russian speakers, with a part of it - at least 14 h of speech from 60 speakers - annotated by expert linguists at lexical and prosodic levels.

One of the main reasons that provide the usability of large speech corpora is the availability and accuracy of annotations. For example, the TIMIT corpus is very popular for the phonetic and speech technology studies because of the very accurate phonetic transcriptions. The broad phonetic transcriptions are often used and sometimes even required for different tasks such as lexical pronunciation variation modelling for automatic speech recognition, unit selection for speech synthesis [10, 11, 13], automatic pronunciation training and assessment in Computer Assisted Language Learning [14] and general research on pronunciation variation [15]. Contemporary speech corpora are usually provided with a broad phonetic transcription of at least part of their material. In addition, time and money permitting, contemporary speech corpora are at least partially enriched with broad phonetic transcription with the help of expert phoneticians in order to ensure a more accurate representation of the material. The employment of experts is known to be exceedingly time-consuming and expensive when they have to transcribe speech from scratch. That is why, it is common practice to provide people with an example transcription they have to verify on the basis of their own perception of the speech signal [1].

Among the numerous approaches to providing text-to-speech transcription, the simplest is to use a small set of letter-to-sound rules to guess the pronunciation of any word. Each rule specifies a phonetic correspondence of sounds and letters. In some cases the letter’s context is used to determine which rule should be applied. However, any language has great variation in the pronunciation. The transcription made for the TTS systems usually have one ideal variant for the text. It could be predicted and changed according to the acoustic and phonetic quality of the sounds, speaker characteristics and so on. In the speech recognition tasks it is more important to have the correct information not only about the phonemes but also about the exact acoustic characteristics and their variation. Those characteristics that can be predicted by the context beforehand. The grapheme-to-phoneme transcriber can use a dictionary-lookup approach but it tells nothing about the sound changes between the words and phrase boundaries. Therefore the rules of transcribing should use all the knowledge about the context variations of the sounds in the standard pronunciation, the phonetic changes and their frequencies of occurency in speech.

In this paper we present a reliable method for automatic phonetic transcription of Russian text into phonetic symbols. The system was used for modelling phonetic transcription for the Speech Corpus of spontaneous speech CoRuSS for Russian Language [12].

This paper is organised as follows. In Sect. 2, we introduce the automatic transcriber design and main principles. Section 3 sketches the problems of rules extensions. Section 4 presents the inclusion of the speech variability rules. In Sect. 5 we formulate our conclusions.

2 Design of the Automatic Phonetic Transcriber

The program was developed in java jdk 1.8. Each rule specifies a phonetic correspondence of phonetic symbols to letters. The letter’s context is used to determine which rule should be applied. We implemented these processes as context-dependent rule modelling both within-word and cross-word contexts in which phones could be deleted, inserted or substituted with other phones.

The set of phonological and phonetic rules that differs according to conditions has been based on the phonetic knowledge obtained in experimental study of the great amount of the Russian speech corpora since the beginning of the previous century. There are 6 vowel phonemes and 36 consonant phonemes in the Russian literary speech [16,17,18]. The transcriber has been developed following the principles proposed by S. Stepanova [19] and K. Shalonova [20, 21]. Besides, the coarticulation and sound change processes for Russian standard language (as for any other language) constantly modify. In order to include all the variation we decided to work not with separate letter-to-phoneme assosiations but use the characteristics of sound classes and the processes of assimilation, dissimilation, insertion or deletion of sounds. It gives us opportunity to model different allophone variations that are not usually provided by other phonetic transcription systems. Besides, all the exclusion are taken into account.

For example, the Russian phoneme “c̆” has no voiced pair in the system. Among the allophones of “c̆” there are voiced and unvoiced variants. Therefore it is important for the transcriber to model correctly the exact variant which should be used in the transcription using the preceding and following letters.

The quality of the vowel phonemes in Russian varies according to the word stress, position in a phrase and the quality of the neighboring sounds consonants before and after the vowel. For the correct result the transcriber needs information about the place of the word stress. It could process the words with primary and secondary stress. The signs for these are “1” for primary word stress and “0” for secondary stress. The numbers should be put after the vowel in the orthographic text. Our transcriber does not include the automatic stress detection in the orthographic text.

There are more than 200 rules for the vowel transformations that include all this information. Also the exclusions are taken into account for vowel transformation by inserting them into the rules (Fig. 1).

The consonant variation depends upon the quality of the neighboring sounds. There are different kinds of consonant assimilation in Russian which is usually regressive one. The consonants became similar or different in the palatalization, voiced/unvoiced characteristic, place of articulation, manner of articulation. The consonant insertions and deletion processes are also taken into account. There more than 200 rules for consonant transformation including the consonant special sequences inside words (Fig. 2).

The resulting rule set comprised phonological and phonetic rules describing progressive and regressive voice assimilation, palatalisation and more specific rules modelling pronunciation variation in high-frequency words. We tried to take into account all the possible modifications and sound change that can happen within the word and on the word borders. Besides, the transcriber processes the pause signs and modifies the resulted transcription according to the place of the pause in the text and the pause type. There are several types of pauses: the end of phrase, the inhale sign, the sudden speech hesitation etc. According to the sound type the transcriber decides if the last consonant should be voiced or unvoiced for noise consonants (Fig. 3).

Fig. 1.
figure 1

Example of the grapheme-to phoneme rules for vowels

Fig. 2.
figure 2

Example of the grapheme-to phoneme rules for consonants

Fig. 3.
figure 3

Example of the grapheme-to phoneme rules for consonants sequencies

Fig. 4.
figure 4

Example of the Russian orthographic text for processing. ‘1’ is put after vowels to show the primary stress, ‘2’ is written after the vowels to show the secondary stress. The intonation markers are also included in the orthographic text. They show the intonation phrase borders and type of intonation

Fig. 5.
figure 5

Example of transcription. ‘0’ is put after vowels to show the primary stress, ‘8’ is written after the vowels to show the secondary stress

The processes in the word boundaries in the connected speech and the sound transformations in the end of the phrase are also included in the program. If the processed text has the phrase boundary markers and information about the pauses, speech breaks and intakes of breath it will process them automatically and decide about the phonetic quality of the sounds in the borders according to the Russian pronunciation (Figs. 4 and 5).

3 Rules Extensions and Refinements

At first we aimed at approximating transcription that were made with a limited rules and symbol set. Then we included the rules for pronunciation exclusions from the dictionary. The transcriber was developed to make transcriptions for the corpus CoRuSS [12] containing 30 h of high quality recorded spontaneous Russian speech. The recordings consist of dialogues between two speakers, monologues (speakers self-presentations) and reading of a short phonetically balanced text. Since the corpus is labeled for a wide range of linguistic-phonetic and prosodic information, it provides basis for empirical studies of various spontaneous speech phenomena. Besides, it allows comparing those phenomena with the ones we observe in prepared read speech. The corpus has orthographic and prosodic annotation for the part of the material. The orthographic decoding of the recording was made using no capital letters or punctuation marks; the only exception was a question mark to denote question phrases. Each word was written using standard spelling no matter whether it was pronounced in a proper way, mispronounced, or produced in a contracted form. Orthographic annotation also contained information about lexical stress: strong (primary) stress was marked with 1 after the vowel. Symbol 2 was used for vowels carrying secondary or weak stress, for vowels /o/, /e/ with no qualitative reduction. The Russian grapheme ‘e̎’ in this corpus was never replaced by ‘e’.

The transcriber was properly tested manually. At first different texts from the CoRuSS corpus [12] were processed and checked by expert phoneticians. The manually verified phonetic transcriptions were required to tune the transcription procedures and to evaluate their performance. We took into account very special cases of Russian pronunciation that occur in the connected speech and cannot be known from the orthographic dictionary containing only word transcriptions. In order to ensure the applicability of the transcription procedures in contexts we optimised our procedures with limited resources and minimal human effort using the statistics of the sound change in standard pronunciation from the real speech corpus CORPRES. Further additions and refinements to the rules could reduce the error rate still further.

4 Modeling Speech Variation

The resulting transcription were updated using the results of the manual real speech segmentation and labelling that was made by expert phoneticians for the CORPRES speech corpus [10]. The material contains two types of transcription: manual phonetic transcription (the sounds actually pronounced by the speakers) and the level of rule-based phonetic transcription (automatically generated by another text transcriber for TTS and partially corrected by the experts). The ideal transcription in the CORPRES corpus did not contain phonetic variants within pronunciation standard.

We counted the occurrence rate of different phonetic sequences in the same contexts for ideal transcriptions in CORPRES corpus and improved the rules using several variants of transcription or the most frequent one.

For example in Russian the word /pagul’a0j/ has different variants of phonetic transcriptions that could be met in standard pronunciation (Fig. 6):

  • [pəgul’a0i] - that variant was met 0 times in corpus (the dictionary standard).

  • [pogul’ai] - that variant was met 3 times in corpus.

  • [pugul’ai] - that variant was met 5 times in corpus.

Fig. 6.
figure 6

Example of transcription including the results of speech variability from the CORPRES. ‘0’ is put after vowels to show the primary stress, ‘8’ is written after the vowels to show the secondary stress

The example shows the variants of standard pronunciation and their frequency of occurrence in the phonetic transcription.

5 Conclusions

The results have shown that our transcriber is reliable and it could be used for the speech technology tasks that require the phonetic transcriptions of the text for speech segmentation, text-to-speech systems, and automatic speech recognition systems.

The transcriber could be adapted to the speaker as long as we know his/her speech peculiarities.

The automatic transcription can serve as an example for the human transcribers.

The ASR system and speech alignment system can be provided by a precise phonetic transcription if it has the text that has to be recognised.