Keywords

1 Introduction

Humans use a system of communication called speech using a language, which consists of sounds, words and grammar. English, Hindi, French and most European languages’ words comprise of a sequence of distinctive units known as phonemes. However, several languages in the world are tonal, as Yip [21] points out. Tonal languages use tones to determine the meaning of the speech units.

As these languages are spoken by limited people, many languages have become extinct. Here, technology plays a crucial role in stopping the extinction by providing the techniques of Natural Language Processing (NLP) and the Automatic Speech Recognition (ASR) for these languages of the world. ASR is also needed for fostering economic growth and prosperity. For more flexibility and to have human-machine interaction, ASR system for tonal languages is required. Several tone recognition techniques have been developed for various tonal languages [20] such as Mandarin, Thai, Vietnamese, Punjabi, Yoruba, etc. However, it is found that no work has been done on automatic tone recognition of the Manipuri tonal language.

1.1 Manipuri

Manipuri, also known as Meiteilon/Meiteiron, is one of the scheduled Indian Tibeto-Burman language spoken predominantly in Manipur, a northeastern state of India. Some people of Indian states, such as Assam, Mizoram and Tripura, and other countries like Bangladesh and Myanmar, also speak Manipuri. It is the official language of Manipur, which is spoken by over 1.5 million speakers. In Manipur, among 29 different ethnic groups, Manipuri is the only medium of communication [13]. Manipuri is a tonal language in which the tone distinguish the meaning of words. For speech recognition and pronunciation evaluation, the identification of tone in Manipuri is essential.

Manipuri has its own script, which is known as Meitei/Meetei Mayek script. The Meitei Mayek Script has 27 Mapung Mayek (main alphabets). There are 8 Lonsum Mayek (unreleased characters), 8 Cheitap Mayek (vowel signs), 3 Khudam Mayek (punctuation marks including diacritics) and Cheising Mayek for the numericals [3, 11].

1.2 Tones in Manipuri

Using pitch in a language to distinguish lexical or grammatical meaning is known as tone [2]. As mentioned before, Manipuri is a tone language [17]. It has a lexically significant, contrastive, but relative pitch on each syllable. There are two tones in Manipuri [6, 9, 11, 15, 19]:

  1. 1.

    A level tone: unmarked

  2. 2.

    A falling tone: marked by lum mayek, “\(\cdot \)

Every syllable in Manipuri carries one of the two tones. The pitch (frequency) of level tone is lower than the pitch of the falling tone; thus, some authors (e.g., Chelliah 1997 [5]) have termed the level and falling tones low and high, respectively [19]. The level tone is unmarked while the falling tone marked as /‵/ in English representation. Furthermore, the lum mayek or the falling tone mark, “\(\cdot \)” is represented in Manipuri script just after the syllable, which carries the falling tone.

2 Related Works

In the international scenario, intensive research is done in tonal language speech recognition in the last three decades. Peng et al. (2021) [14] proposed a Multi-Scale model that gathers the information at multiple resolutions capturing the attributes of tone variation. The experiment is performed on the dataset, Chinese National Hi-Tech Project 863. Their model achieve tone error rate (TER) of 10.5%. Hao et al. (2019) proposed a framework based on deep neural networks for Mandarin tone recognition. The model use both the prosodic and the articulatory features as the raw input data. A 5-layer-deep belief network is employed to generate high-level tone feature. The 863-data corpus is used for the experiment and achieved an average tone recognition rate of 83.03% accuracy. Nguyen et al. (2016) [16] investigated the effect of tone in the Vietnamese Large Vocabulary Continuous Speech Recognition System and built an acoustic model using the tonal feature. The experimental result obtained 19.25% improvement over the non-tonal phoneme system.

In India for the language Manipuri, Thoudam (1980) [15] doctoral thesis has devoted a chapter on Manipuri phonology. He suggested that there were only two distinctive tones in Manipuri, namely, falling tone and level tone. Mahabir (1982) [4] argued for two tones, falling and level in his master’s thesis. Chelliah (1990) [18] studied the level ordered morphology and phonology in Manipuri and presented several phonological rules. Chelliah (1997) [5] explained the tone system in Manipuri. She presented a framework that correctly described that Manipuri exhibits a two-way tonal contrast, low tone and a default high tone. The fundamental frequency contours were used as the phonetic representations of the underlying tone pattern in the experiment. Meiraba (2014) [12] claimed that the tone bearing unit in Manipuri is the Rhyme of the syllable. The relative simplicity of the tone system of Manipuri is due to its rich consonantal inventory which can occur at the Coda position and that the realisation of tonal contrast can be affected by the Coda consonants.

3 Motivation

After exhaustive search it is found that there are limited tonal languages (Mizo, Punjabi, Singpho, Manipuri, etc.) in India and virtually no datasets are available for tonal analysis. It is also evident that there is a critical need to develop speech dataset of tonal contrast pairs to study the characteristic of the tonal variation leading to the understanding of distinct words for the language. This motivates us to develop a tonal contrast word pair for the Manipuri language and study the tone information present in it for developing robust ASR systems for Manipuri.

4 Creation of Tonal Contrast Word Pair Corpus

Fifty pairs of Manipuri tonal contrast words are collected from different sources [6, 8, 12, 15, 19]. The words are listed below in Fig. 2 with their respective meanings.

Fig. 1.
figure 1

Creation of ManiTo dataset.

Fig. 2.
figure 2

List of Manipuri tonal contrast word Pairs with their respective meanings.

The data is collected from six people: three males and three females, age range of the speakers is from 21 to 45. All of them are native speakers, out of which three of them (two male and one female) are working in the Linguistic Department of Manipur University, Imphal and their recording is done in the Audio, Visual, Language and Phonetic Laboratory Complex of Manipur University. The remaining three native local speakers’ recording is done in a quite office environment. A total of 50 tonal contrast words, five instances of each pair with some pause between the speech sounds, are recorded separately for each person. The steps of creating the dataset is shown in Fig. 1. The Cool Edit 2000 tool is used for recording the utterances. While recording, the following three parameters have been set in Cool Edit 2000.

Sampling Rate: It is the number of samples per second to be captured by the microphone into the system. Sampling rate is set to 44,100 Hz.

Channel: Mono channel is selected. In mono, all audio signals are routed through a single audio channel.

Resolution: Each sample is represented using 16 bits.

4.1 Preprocessing

The recorded speech sounds are further analyzed and segmented manually, with about 1000 samples of silence at the beginning and end of each word and saved in a .wav format, where each wav file has been named by using word name, tone detail ‘f’ for falling and ‘L’ for level, instance number and speaker ID.

For example, un_f_2_1.wav

Word: un    Tone: falling    Instance: 2    Speaker ID: 1

The corpus, ManiTo consists of 3,000 hand-crafted labeled speech data of size 273 MB. The recordings are carefully double checked and stored.

Fig. 3.
figure 3

Waveform and spectrogram with overlaid pitch contour of falling tone “un” sound.

Fig. 4.
figure 4

Waveform and spectrogram with overlaid pitch contour of level tone “un” sound.

5 Experimental Analysis

Praat [1] is a tool that can analyze, synthesize, and manipulate speech data. Praat version 6.1.51 is used for the experiment. From the developed dataset, the speech sample are analyzed using Praat. In tone analysis, features that reflect the pitch contour are lexically significant. The fundamental frequency, F0, acts as an indication for tone. For the preliminary study on ManiTo dataset, the pitch or F0 is extracted using Praat. Praat use the most accurate pitch analysis algorithm [7]. Figure 3 shows the analysis of falling tone “un” sound. The blue line is the pitch listing of the speech. Similarly, Fig. 4 shows the analysis of level tone “un” sound. From the two figures we can notice that the pitch of the level tone is lower than that of falling tone. Figure 5 shows the graph comparing the five utterances tonal contrast Pair1, “un” spoken by Speaker1. Figure 5a plots the pitch listing of falling tone, Fig. 5b plots the pitch listing of level tone, Fig. 5c is the normalisation of falling tone, Fig. 5d is the normalisation of level tone and Fig. 5e shows the comparison of average pitch listing of falling versus level tone. From the graph we can initially infer that the pitch of the falling tone is higher than that of level tone. Using parselmouth [10], a python library for the Praat software, mean F0, harmonics-to-noise-ratio(HNR), jitter, shimmer information are extracted and analysis is being conducted on this features to distinguish the tones accurately.

Fig. 5.
figure 5

Pitch contour of (a) “un” five utterances (falling tone) (b) “un” five utterances (level tone) (c) normalise pitch “un” (falling) (d) normalise pitch “un” (level) (e) average pitch comparison of level and falling tone.

6 Conclusion and Future Work

A speech dataset containing tonal contrast pair of the Manipuri language is being created. ManiTo containing 3,000 samples of Manipuri tonal contrast words is developed from data collected from 6 speakers. Fundamental analysis of the dataset is currently being done. It is found that the pitch of the falling tone word is higher than the level tone word. The pitch value can be used to distinguish the tones in Manipuri. Further analysis on feature selection is currently being done to accurately differentiate the tones and develop a robust model for tone recognition for the Manipuri language.