An evaluation of sentence selection methods on the different phone-sized units for constructing Indonesian speech corpus

Muljono; Harjoko, Agus; Winarsih, Nurul Anisa Sri; Supriyanto, Catur

doi:10.1007/s10772-019-09662-1

An evaluation of sentence selection methods on the different phone-sized units for constructing Indonesian speech corpus

Published: 23 December 2019

Volume 23, pages 141–147, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Speech Technology Aims and scope Submit manuscript

An evaluation of sentence selection methods on the different phone-sized units for constructing Indonesian speech corpus

Download PDF

Muljono ORCID: orcid.org/0000-0001-6059-3571¹,
Agus Harjoko²,
Nurul Anisa Sri Winarsih¹ &
…
Catur Supriyanto¹

249 Accesses
2 Citations
Explore all metrics

Abstract

Collecting phonetically balanced text corpus is an important step to develop automatic speech recognition and text-to-speech systems. A corpus should have a small number of sentences but contains all phonetic units, such as monophone, triphone, and pentaphone units. There are exist least-to-most greedy algorithm (LTM + Greedy) and its variant to select the minimum sentence set. The variant is on the sentence scoring method, which affect the number of selected sentences. In this paper, we evaluate the sentence scoring methods by Zhang and Suyanto on LTM + Greedy algorithm. The sentence scoring methods are conducted on triphone and pentaphone units on the collection of sentence set. Triphone and pentaphone units have offered higher quality synthesized speech than monophone unit. The dataset of this paper is Indonesian sentences that collected from holy book translation, news, novel, dialog, monologue, and question sentences. Totally 115,489 sentences are used for the experiments. Based on the experiments, LTM + Greedy by Suyanto produces a smaller number of sentences that contain large number of phone units.

A Supervised Phrase Selection Strategy for Phonetically Balanced Standard Yorùbá Corpus

OMSST Approach for Unit Selection from Speech Corpus for Telugu TTS

Unit Selection Using Acoustic Supra-Segmental Cues to Improve Prosody

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

High-quality speech corpus is crucial for developing automatic speech recognition (ASR) and text-to-speech (TTS) synthesis systems. Both ASR and TTS can be used to develop a machine translation. ASR converts from speech into text and TTS converts text into speech. ASR was also implemented in various application such as hands-free operation and control, automatic query answering, telephone interactive voice response systems, and automatic dictation (Alghamdi et al. 2007). Speech corpus contains the audio files and their transcripts (so-called text corpus) (Patel and Kopparapu 2015). The performance of ASR and TTS depends on the phonetically balanced text corpus (Abushariah et al. 2010, 2012). The corpus should have a small number of sentences but cover all phonetic units.

Speech corpus can be constructed based on syllables and phoneme. The difference between syllable and phoneme is that the syllable consists of the vowel and its consonant, whilst the phoneme is the smallest unit of sound. For example, the word ’me’ (aku) is made up of one syllable. However, the word ’me’ may have two phonemes, i.e. [m] and [e]. The Barkhoda et. al’s study (Barkhoda et al. 2009) showed that phoneme-based produced more naturally speech synthesizer than syllable-based. Therefore, we study the phoneme-based speech corpus. In the speech synthesizer, there are several frequently used phone-sized units for producing high quality synthesized speech, i.e. monophone, triphone, and pentaphone units. Longer phone-sized units provide high natural speech synthesizer (Anushiya et al. 2013, 2015). In the study of Mandarin auto speech recognition, (Xu et al. 2018) showed that the triphone model produced high recognition rate compared to the monophone model.

How to select the minimum sentences set that contain all phonetic units is the issue of this study. The classical method to handle this issue is a greedy search algorithm (van Santen and Buchsbaum 1997). The algorithm selects the sentences from the mother sentence set by scoring the sentence based on the uncovered units. The standard greedy (SG) algorithm has been implemented to create a speech corpus in many languages such as Indonesia (Suyanto 2006), Bangla (Murtoza Habib et al. 2011), and Czech (Matouek and Romportl 2006). SG algorithm searches the minimum sentences by calculating the score of each sentence. For each looping, the sentence with the highest score is selected. Sentences scoring becomes an essential aspect to select the best sentences.

However, SG algorithm produces high computational time and large number of generated sentences. To reduce the computational cost of SG algorithm, Zhang and Nakamura proposed least-to-most greedy (LTM + Greedy) algorithm which selects the sentence only from the subset sentence or those sentences that contain the unit of least frequency (Zhang and Nakamura 2001). The objective of SG and LTM + Greedy algorithms is to generate the minimum sentence set which covers all phonetic units. The objectives of this study is to compare the sentence selection methods on different phone-sized units, especially on the triphone and pentaphone units.

The rest of this paper is organized as follows: Sect. 2 reviews the related works. The methodology is introduced in Sect. 3. Section 4 gives the results and discussion. Section 5 concludes our work and gives future works.

2 Indonesian text corpus

2.1 The Indonesian phoneme

Indonesia is located in Southeast Asia with Jakarta as the capital city. Although Bahasa Indonesia becomes a national language of Indonesia, more than 195 million people in Indonesia speak Bahasa Indonesia as second language (Sakti et al. 2004). Many people in Indonesia speak with the traditional language, such as Javanese, Sundanese, and Balinese (Muljono et al. 2016a). Bahasa Indonesia is a language from the Austronesian family (O’Grady and Archibald 2000). It is spoken not only in Indonesia but also in Malaysia, Singapore, Southern Thailand, and Brunei (called Bahasa Melayu). Besides, Bahasa Indonesia becomes one of the minority language in the Netherlands (Comrie 2009).

Bahasa Indonesia has 35 phonemes, consists of 6 vocal phonemes, 4 diphthong phonemes, 24 consonant phonemes and a silence (Suyanto 2006; Muljono et al. 2016b, c). A phoneme is the smallest sound unit. Determining phoneme is the first step to build a text to speech synthesis (TTS). This step establishes the correct utterance, it directly gives impact to design the speech corpus. This study adds 2 consonant phonemes, there are ‘z’and ‘x’. Those additional phonemes are borrowed from other languages such as Arabic and English. Table 1 shows Bahasa Indonesia phonemes.

Table 1 Bahasa Indonesia phonemes

Full size table

Defining the unit of the phonemes is important for phonetically balanced text corpus (Murtoza Habib et al. 2011). The units are classified into monophones (one phoneme), diphones (two phonemes), triphones (three phonemes), and pentaphones (five phonemes). Diphones are classified into right and left diphones. Right diphones are the diphones that taken from left to right of the sentence and left diphones are the diphones that taken from right to left of the sentence. This study uses triphones and pentaphones since the wave of speech signal depends on its previous and next phonemes (Suyanto 2006). The examples of the phonetically balancing units are shown in Table 2.

Table 2 Examples of phonetically balancing units

Full size table

2.2 Sentence selection algorithms

Greedy (Zhang and Nakamura 2003) is the classical algorithm for selecting the minimum sentence set for speech corpus. The detail standard greedy algorithm (SG) is shown in Algorithm 1. First, we have to provide the mother sentence set (denoted as S) and to-be-covered units list (denoted as U) which is taken from the mother sentence set. In each iteration, the algorithm scores all sentences in S and the sentence with the highest score $S_h$ is selected. Based on the selected sentence, delete the to-be-covered units that contained in $S_h$ from U. The iteration stops when the to-be-covered units list in U is empty.

To reduce the computational cost of SG algorithm, (Zhang and Nakamura 2001) sorted the to-be-covered units based on their frequency of appearance in ascending order, the method called least-to-most greedy algorithm (LTM + Greedy). Each uncovered units will have the subset of sentences which contain at least one token of the uncovered unit. LTM + Greedy is faster than SG since LTM + Greedy only find the best sentence from the subset. LTM + Greedy is shown in Algorithm 2.

The similar research was developed by Suyanto (2007) which proposed modified sentence scoring for LTM + Greedy algorithm. The modified sentence scoring by Suyanto is presented in Eq. (2). He addressed the issue that the sentence scoring by Zhang and Nakamura (2003) scored the long sentence with the low score.

$$\begin{aligned}&S_i=\frac{\text {Types of uncovered units in sentence } i}{\text {Total tokens of units in sentence } i} \end{aligned}$$

(1)

$$\begin{aligned}&S_i=\text {types of uncovered units in sentence } i \end{aligned}$$

(2)

3 Methodology

Table 3 Statistic of the mother sentence set

Full size table

3.1 Preprocessing steps

In this study, the raw text corpus was collected from many sources, such as holy book translation, news, novel, dialog, monologue, and question sentences. The preprocessing steps of the raw text corpus are described as follows:

1.
Sentence segmentation: Split the sentences based on punctuation such as full stop (.), question mark (?), exclamation mark (!), and quotation marks (’ or ”).
2.
Number and symbol conversion: convert the number or symbol to words, for example, 123 becomes seratus dua puluh tiga (one hundred and twenty three) and symbol $ becomes dolar (dollar). We also delete the hypen symbol, for example, laki-laki (some men) becomes laki laki.
3.
Inspection of e: Check all letters e and adjust based on how to read. For example, meja (table) becomes m@ja.

Table 3 shows the detailed statistic of our mother sentence set. The preprocessing steps generate 115,489 sentences to become the mother sentence set. There are 6,225,794 triphones and 5,741,062 pentaphones which are appeared in the mother sentence set. Meanwhile, the number of distinct triphones and distinct pentaphones are 13,501 and 214,868, respectively. The number of distinct triphones or distinct pentaphones is the number without any duplication. For example, the number triphones in sentence makan malam are 11 triphones, i.e., [sil-m+a] [m-a+k] [a-k+a] [k-a+n] [a-n+sil] [n-sil+m] [sil-m+a] [m-a+l] [a-l+a] [l-a+m] [a-m+sil]. Triphone [sil-m+a] occurs two times. Thus, the number of distinct triphones are 10 triphones, i.e., [sil-m+a] [m-a+k] [a-k+a] [k-a+n] [a-n+sil] [n-sil+m] [m-a+l] [a-l+a] [l-a+m] [a-m+sil]. The example of pentaphones can be seen in Table 2.

3.2 Experimental design

The two-sentence scoring methods are applied to select the minimum sentence set from the mother sentence set. The first method was proposed by Zhang and Nakamura (2003) as shown in Eq. 1, the second method was proposed by Suyanto (2007) as shown in Eq. 2. The two-sentence scoring methods are evaluated on the LTM + Greedy algorithms. All the experiments run on physical memory (RAM) of 16 GB. Similar to the previous research (Zhang and Nakamura 2003), our experimental results are described in three points of view: the size of the generated sets, search analysis, and computation costs.

Table 4 Results of Algorithms

Full size table

4 Experimental results

4.1 Size of the generated sets

Table 4 shows the results of the two methods. Both methods are applied on triphone and pentaphone units. In the number of sentences, LTM + Greedy by Suyanto produces a slightly smaller number of sentences than LTM + Greedy by Zhang in both triphones and pentaphones. In the triphone, LTM + Greedy by Suyanto and Zhang generate 3443 and 3531 sentences, respectively. Meanwhile in the pentaphone, LTM + Greedy by Suyanto and Zhang generate 35,816 and 36,798 sentences, respectively.

In the number of words appear and the number of distinct words, LTM + Greedy by Zhang produces the fewer number in triphone and pentaphone. In the average number of the phoneme per sentence, the maximum number of the phoneme in a sentence, and the minimum number of the phoneme in a sentence, the two methods almost have a similar number. Based on the generated sentences from the both methods, LTM + Greedy by Suyanto is capable to select the minimum sentence set that contains large number of phonetic units.

From the seven parameters listed in Table 4, generally, the number of sentence and number of phone units can be used as the performance measure. The method performs best when generating the minimum number of sentence and have a large number of phoneme. The experiments show that LTM + Greedy by Suyanto generated the fewest number of sentence set in triphone and pentaphone units.

4.2 Search analysis

Figures 1 and 2 are used to analyze the search performance of the two methods in the triphone and pentaphone units, respectively. The performance of the two methods is almost similar. The two methods tend to select the sentences with the same number of token in each iteration.

4.3 Computation costs

We report the computational cost of each method in Table 5. The computational cost of LTM + Greedy by Zhang and Suyanto is not different to much. It means that the complexity of both methods is equally well by generate the minimum sentence set in seconds or minutes for triphone and pentaphone units.

Table 5 Computational Time

Full size table

5 Conclusions and future works

This paper presents the evaluation of sentence selection methods on different phone-sized units, i.e. triphone and pentaphone units. The selected sentences set is useful for constructing the natural speech corpus. The sentence is collected from several sources in Bahasa Indonesia. The experimental results show that the LTM + Greedy by Suyanto successfully generate the minimum sentence set compared to LTM + Greedy by Zhang for constructing Indonesian text corpus. Not only produces the smaller number of sentences, but also contains large number of phone units. In the future, the generated minimum sentence set can be applied to develop speech corpus for Indonesian TTS synthesis system. We can evaluate how naturally TTS generate the speech from text.

References

Abushariah, M. A. M., Ainon, R. N., Zainuddin, R., Elshafei, M., & Khalifa, O. O. (2010). Phonetically rich and balanced speech corpus for Arabic speaker-independent continuous automatic speech recognition systems. In 10th international conference on information sciences, signal processing and their applications (pp. 65–68).
Abushariah, M. A., Ainon, R. N., Zainuddin, R., Elshafei, M., & Khalifa, O. O. (2012). Phonetically rich and balanced text and speech corpora for Arabic language. Language Resources and Evaluation, 46(4), 601–634.
Article Google Scholar
Alghamdi, M., Elshafei, M., & Al-Muhtaseb, H. (2007). Arabic broadcast news transcription system. International Journal of Speech Technology, 10(4), 183–195.
Article Google Scholar
Anushiya Rachel, G., Lilly Christina, S., Sherlin Solomi, V., Ramani, B., Vijayalakshmi, P., & Nagarajan, T. (2013). Development and analysis of various phone-sized unit-based speech synthesizers. In International conference oriental COCOSDA held jointly with 2013 conference on asian spoken language research and evaluation (pp. 1–5).
Anushiya Rachel, G., Sherlin Solomi, V., Naveenkumar, K., Vijayalakshmi, P., & Nagarajan, T. (2015). A small-footprint context-independent HMM-based synthesizer for Tamil. International Journal of Speech Technology, 18(3), 405–418.
Article Google Scholar
Barkhoda, W., ZahirAzami, B., Bahrampour, A., & Shahryari, O. (2009). A comparison between allophone, syllable, and diphone based TTS systems for kurdish language. In International conference oriental COCOSDA held jointly with 2013 conference on asian spoken language research and evaluation (O-COCOSDA/CASLRE) (pp. 557–562).
Comrie, B. (2009). The world’s major languages (2nd ed.). NY: Routledge.
Google Scholar
Matouek, J., & Romportl, J. (2006). On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In Proceedings of the second IASTED international conference on computational intelligence (pp. 1–6).
Muljono, Sumpeno, S., Arifianto, D., Aikawa, K., & Purnomo, M. H. (2016a). Developing an online self-learning system of Indonesian pronunciation for foreign learners. International Journal of Emerging Technologies in Learning, 11(4), 83–89.
Article Google Scholar
Muljono, M., Sumpeno, S., Arifianto, D., Aikawa, K., & Purnomo, M. H. (2016b). Indonesian text to audio visual speech with animated talking head. International Review on Computers and Software, 11(3), 261–269.
Google Scholar
Muljono, Winarsih, N. A., & Supriyanto, C. (2016c). Evaluation of classification methods for Indonesian text emotion detection. In International seminar on application for technology of information and communication (ISemantic) (pp. 130–133).
Murtoza Habib, S. M., Alam, F., Sultana, R., Absar Chowdhur, S., & Khan, M. (2011). Phonetically balanced Bangla speech corpus. In Conference on human language technology for development (pp. 87–93).
O’Grady, W., & Archibald, J. (2000). Contemporary linguistic analysis: an introduction. Pearson Canada (pp. 130–133).
Patel, C., & Kopparapu, S. K. (2015). A multi-criteria textselection approach for building a speech corpus international conference on text speech and dialogue (pp. 15–22). Cham: Springer.
Google Scholar
Sakti, S., Arman, A. A., Nakamura, S., & Hutagaol, P. (2004). Indonesian speech recognition for hearing and speaking impaired people. In 8th international conference on spoken language processing (pp. 1037–1040).
Suyanto. (2007). An Indonesian phonetically balanced sentence set for collecting speech database. Jurnal Teknologi Industri, 11(1), 59–68.
Google Scholar
Suyanto. (2006). Modified least-to-most greedy algorithm to search a minimum sentence set. TENCON (pp. 1–3).
van Santen, J. P. H., & Buchsbaum, A. L. (1997) Methods for optimal text selection. In Proceedings of Eurospeech (pp. 553–556). Rhodes, Greece
Xu, J., Zhu, Y., Xu, P., & Ma, D. (2018). Agricultural price information acquisition using noise-robust Mandarin auto speech recognition. International Journal of Speech Technology, 21(3), 681–688.
Article Google Scholar
Zhang, J., & Nakamura, S. (2001). Least-to-most ordered search for minimum sentence set for collecting speech database. In Proceedings of ASJ (pp. 145–146).
Zhang, J., & Nakamura, S. (2003). An efficient algorithm to search for a minimum sentence set for collecting speech database. Proceedings of ICPhS (pp. 3145–3148).

Download references

Acknowledgements

The research was funded by the Ministry of Research, Technology, & Higher Education of Indonesia through Post Doctoral Research Scheme 2018 (Grant No. 028/K6/KM/SP2H/PENELITIAN/2018).

Author information

Authors and Affiliations

Department of Informatics Engineering, Faculty of Computer Science, Universitas Dian Nuswantoro, Semarang, Indonesia
Muljono, Nurul Anisa Sri Winarsih & Catur Supriyanto
Department of Computer Sciences and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
Agus Harjoko

Authors

Muljono
View author publications
You can also search for this author in PubMed Google Scholar
Agus Harjoko
View author publications
You can also search for this author in PubMed Google Scholar
Nurul Anisa Sri Winarsih
View author publications
You can also search for this author in PubMed Google Scholar
Catur Supriyanto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muljono.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Muljono, Harjoko, A., Winarsih, N.A.S. et al. An evaluation of sentence selection methods on the different phone-sized units for constructing Indonesian speech corpus. Int J Speech Technol 23, 141–147 (2020). https://doi.org/10.1007/s10772-019-09662-1

Download citation

Received: 03 November 2018
Accepted: 13 December 2019
Published: 23 December 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10772-019-09662-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An evaluation of sentence selection methods on the different phone-sized units for constructing Indonesian speech corpus

Abstract

Similar content being viewed by others

A Supervised Phrase Selection Strategy for Phonetically Balanced Standard Yorùbá Corpus

OMSST Approach for Unit Selection from Speech Corpus for Telugu TTS

Unit Selection Using Acoustic Supra-Segmental Cues to Improve Prosody

1 Introduction