1 Introduction

High-quality speech corpus is crucial for developing automatic speech recognition (ASR) and text-to-speech (TTS) synthesis systems. Both ASR and TTS can be used to develop a machine translation. ASR converts from speech into text and TTS converts text into speech. ASR was also implemented in various application such as hands-free operation and control, automatic query answering, telephone interactive voice response systems, and automatic dictation (Alghamdi et al. 2007). Speech corpus contains the audio files and their transcripts (so-called text corpus) (Patel and Kopparapu 2015). The performance of ASR and TTS depends on the phonetically balanced text corpus (Abushariah et al. 2010, 2012). The corpus should have a small number of sentences but cover all phonetic units.

Speech corpus can be constructed based on syllables and phoneme. The difference between syllable and phoneme is that the syllable consists of the vowel and its consonant, whilst the phoneme is the smallest unit of sound. For example, the word ’me’ (aku) is made up of one syllable. However, the word ’me’ may have two phonemes, i.e. [m] and [e]. The Barkhoda et. al’s study (Barkhoda et al. 2009) showed that phoneme-based produced more naturally speech synthesizer than syllable-based. Therefore, we study the phoneme-based speech corpus. In the speech synthesizer, there are several frequently used phone-sized units for producing high quality synthesized speech, i.e. monophone, triphone, and pentaphone units. Longer phone-sized units provide high natural speech synthesizer (Anushiya et al. 2013, 2015). In the study of Mandarin auto speech recognition, (Xu et al. 2018) showed that the triphone model produced high recognition rate compared to the monophone model.

How to select the minimum sentences set that contain all phonetic units is the issue of this study. The classical method to handle this issue is a greedy search algorithm (van Santen and Buchsbaum 1997). The algorithm selects the sentences from the mother sentence set by scoring the sentence based on the uncovered units. The standard greedy (SG) algorithm has been implemented to create a speech corpus in many languages such as Indonesia (Suyanto 2006), Bangla (Murtoza Habib et al. 2011), and Czech (Matouek and Romportl 2006). SG algorithm searches the minimum sentences by calculating the score of each sentence. For each looping, the sentence with the highest score is selected. Sentences scoring becomes an essential aspect to select the best sentences.

However, SG algorithm produces high computational time and large number of generated sentences. To reduce the computational cost of SG algorithm, Zhang and Nakamura proposed least-to-most greedy (LTM + Greedy) algorithm which selects the sentence only from the subset sentence or those sentences that contain the unit of least frequency (Zhang and Nakamura 2001). The objective of SG and LTM + Greedy algorithms is to generate the minimum sentence set which covers all phonetic units. The objectives of this study is to compare the sentence selection methods on different phone-sized units, especially on the triphone and pentaphone units.

The rest of this paper is organized as follows: Sect. 2 reviews the related works. The methodology is introduced in Sect. 3. Section 4 gives the results and discussion. Section 5 concludes our work and gives future works.

2 Indonesian text corpus

2.1 The Indonesian phoneme

Indonesia is located in Southeast Asia with Jakarta as the capital city. Although Bahasa Indonesia becomes a national language of Indonesia, more than 195 million people in Indonesia speak Bahasa Indonesia as second language (Sakti et al. 2004). Many people in Indonesia speak with the traditional language, such as Javanese, Sundanese, and Balinese (Muljono et al. 2016a). Bahasa Indonesia is a language from the Austronesian family (O’Grady and Archibald 2000). It is spoken not only in Indonesia but also in Malaysia, Singapore, Southern Thailand, and Brunei (called Bahasa Melayu). Besides, Bahasa Indonesia becomes one of the minority language in the Netherlands (Comrie 2009).

Bahasa Indonesia has 35 phonemes, consists of 6 vocal phonemes, 4 diphthong phonemes, 24 consonant phonemes and a silence (Suyanto 2006; Muljono et al. 2016b, c). A phoneme is the smallest sound unit. Determining phoneme is the first step to build a text to speech synthesis (TTS). This step establishes the correct utterance, it directly gives impact to design the speech corpus. This study adds 2 consonant phonemes, there are ‘z’and ‘x’. Those additional phonemes are borrowed from other languages such as Arabic and English. Table 1 shows Bahasa Indonesia phonemes.

Table 1 Bahasa Indonesia phonemes

Defining the unit of the phonemes is important for phonetically balanced text corpus (Murtoza Habib et al. 2011). The units are classified into monophones (one phoneme), diphones (two phonemes), triphones (three phonemes), and pentaphones (five phonemes). Diphones are classified into right and left diphones. Right diphones are the diphones that taken from left to right of the sentence and left diphones are the diphones that taken from right to left of the sentence. This study uses triphones and pentaphones since the wave of speech signal depends on its previous and next phonemes (Suyanto 2006). The examples of the phonetically balancing units are shown in Table 2.

Table 2 Examples of phonetically balancing units

2.2 Sentence selection algorithms

Greedy (Zhang and Nakamura 2003) is the classical algorithm for selecting the minimum sentence set for speech corpus. The detail standard greedy algorithm (SG) is shown in Algorithm 1. First, we have to provide the mother sentence set (denoted as S) and to-be-covered units list (denoted as U) which is taken from the mother sentence set. In each iteration, the algorithm scores all sentences in S and the sentence with the highest score \(S_h\) is selected. Based on the selected sentence, delete the to-be-covered units that contained in \(S_h\) from U. The iteration stops when the to-be-covered units list in U is empty.

figure a
figure b

To reduce the computational cost of SG algorithm, (Zhang and Nakamura 2001) sorted the to-be-covered units based on their frequency of appearance in ascending order, the method called least-to-most greedy algorithm (LTM + Greedy). Each uncovered units will have the subset of sentences which contain at least one token of the uncovered unit. LTM + Greedy is faster than SG since LTM + Greedy only find the best sentence from the subset. LTM + Greedy is shown in Algorithm 2.

The similar research was developed by Suyanto (2007) which proposed modified sentence scoring for LTM + Greedy algorithm. The modified sentence scoring by Suyanto is presented in Eq. (2). He addressed the issue that the sentence scoring by Zhang and Nakamura (2003) scored the long sentence with the low score.

$$\begin{aligned}&S_i=\frac{\text {Types of uncovered units in sentence } i}{\text {Total tokens of units in sentence } i} \end{aligned}$$
(1)
$$\begin{aligned}&S_i=\text {types of uncovered units in sentence } i \end{aligned}$$
(2)

3 Methodology

Table 3 Statistic of the mother sentence set

3.1 Preprocessing steps

In this study, the raw text corpus was collected from many sources, such as holy book translation, news, novel, dialog, monologue, and question sentences. The preprocessing steps of the raw text corpus are described as follows:

  1. 1.

    Sentence segmentation: Split the sentences based on punctuation such as full stop (.), question mark (?), exclamation mark (!), and quotation marks (’ or ”).

  2. 2.

    Number and symbol conversion: convert the number or symbol to words, for example, 123 becomes seratus dua puluh tiga (one hundred and twenty three) and symbol $ becomes dolar (dollar). We also delete the hypen symbol, for example, laki-laki (some men) becomes laki laki.

  3. 3.

    Inspection of e: Check all letters e and adjust based on how to read. For example, meja (table) becomes m@ja.

Table 3 shows the detailed statistic of our mother sentence set. The preprocessing steps generate 115,489 sentences to become the mother sentence set. There are 6,225,794 triphones and 5,741,062 pentaphones which are appeared in the mother sentence set. Meanwhile, the number of distinct triphones and distinct pentaphones are 13,501 and 214,868, respectively. The number of distinct triphones or distinct pentaphones is the number without any duplication. For example, the number triphones in sentence makan malam are 11 triphones, i.e., [sil-m+a] [m-a+k] [a-k+a] [k-a+n] [a-n+sil] [n-sil+m] [sil-m+a] [m-a+l] [a-l+a] [l-a+m] [a-m+sil]. Triphone [sil-m+a] occurs two times. Thus, the number of distinct triphones are 10 triphones, i.e., [sil-m+a] [m-a+k] [a-k+a] [k-a+n] [a-n+sil] [n-sil+m] [m-a+l] [a-l+a] [l-a+m] [a-m+sil]. The example of pentaphones can be seen in Table 2.

3.2 Experimental design

The two-sentence scoring methods are applied to select the minimum sentence set from the mother sentence set. The first method was proposed by Zhang and Nakamura (2003) as shown in Eq. 1, the second method was proposed by Suyanto (2007) as shown in Eq. 2. The two-sentence scoring methods are evaluated on the LTM + Greedy algorithms. All the experiments run on physical memory (RAM) of 16 GB. Similar to the previous research (Zhang and Nakamura 2003), our experimental results are described in three points of view: the size of the generated sets, search analysis, and computation costs.

Table 4 Results of Algorithms

4 Experimental results

Fig. 1
figure 1

Number of Tokens of Different Sentences Scoring on LTM + Greedy Algorithm by Zhang Zhang and Nakamura (2003)

Fig. 2
figure 2

Number of Tokens of Different Sentences Scoring on LTM + Greedy Algorithm by Suyanto Suyanto (2007)

4.1 Size of the generated sets

Table 4 shows the results of the two methods. Both methods are applied on triphone and pentaphone units. In the number of sentences, LTM + Greedy by Suyanto produces a slightly smaller number of sentences than LTM + Greedy by Zhang in both triphones and pentaphones. In the triphone, LTM + Greedy by Suyanto and Zhang generate 3443 and 3531 sentences, respectively. Meanwhile in the pentaphone, LTM + Greedy by Suyanto and Zhang generate 35,816 and 36,798 sentences, respectively.

In the number of words appear and the number of distinct words, LTM + Greedy by Zhang produces the fewer number in triphone and pentaphone. In the average number of the phoneme per sentence, the maximum number of the phoneme in a sentence, and the minimum number of the phoneme in a sentence, the two methods almost have a similar number. Based on the generated sentences from the both methods, LTM + Greedy by Suyanto is capable to select the minimum sentence set that contains large number of phonetic units.

From the seven parameters listed in Table 4, generally, the number of sentence and number of phone units can be used as the performance measure. The method performs best when generating the minimum number of sentence and have a large number of phoneme. The experiments show that LTM + Greedy by Suyanto generated the fewest number of sentence set in triphone and pentaphone units.

4.2 Search analysis

Figures 1 and 2 are used to analyze the search performance of the two methods in the triphone and pentaphone units, respectively. The performance of the two methods is almost similar. The two methods tend to select the sentences with the same number of token in each iteration.

4.3 Computation costs

We report the computational cost of each method in Table 5. The computational cost of LTM + Greedy by Zhang and Suyanto is not different to much. It means that the complexity of both methods is equally well by generate the minimum sentence set in seconds or minutes for triphone and pentaphone units.

Table 5 Computational Time

5 Conclusions and future works

This paper presents the evaluation of sentence selection methods on different phone-sized units, i.e. triphone and pentaphone units. The selected sentences set is useful for constructing the natural speech corpus. The sentence is collected from several sources in Bahasa Indonesia. The experimental results show that the LTM + Greedy by Suyanto successfully generate the minimum sentence set compared to LTM + Greedy by Zhang for constructing Indonesian text corpus. Not only produces the smaller number of sentences, but also contains large number of phone units. In the future, the generated minimum sentence set can be applied to develop speech corpus for Indonesian TTS synthesis system. We can evaluate how naturally TTS generate the speech from text.