1 Introduction

Speech is the most popular and easiest way of human interaction. Sometimes it is too tricky due to different languages and accents. In this era, modern technology should be language-independent to enjoy its benefits. As per a study on open voice data in Indian languages in 2020, only 10–12% of Indians are comfortable in English, i.e., approximately 1253 million Indians are comfortable only in their mother tongue [1]. In India, there are almost 1369 classified languages; more than 120 languages have more than 10,000 speakers [2]. The top automatic speech recognition (ASR) [3, 4] based companies like Amazon Alexa, Apple Siri, Google Assistance, Microsoft- Cortana digital assistant, etc. [5, 6] are interested in working only on those languages which are commercially beneficial. Google-Home works on merely 13 languages throughout the world, only Hindi from India, whereas Microsoft has ASR systems on Hindi, Marathi, Gujarati, Telugu, and Tamil.

Presently India has approximately 1.18 billion mobile connections. Among them, 700 million are regular Internet users. Smartphone users’ growth rate is 25 million per quarter [1]. So, we need to implement ASR in all native languages to access the internet content efficiently and save the low-resource languages in this globalization. Some research has already been initiated to develop speech recognition technologies in some resourceful Indian languages. Kumar et al. [7] worked on Hindi ASR system in noisy environment using hybrid feature extraction technique like perceptual linear predictive (PLP) and MFCC. Babhulgaonkar et al. [8] implemented generalized back-off and factored language model by combining of linguistic features to predict upcoming words for an ASR of Hindi. Guchhait et al. [9] build a Kaldi based ASR system of Bengali language. In [10], the authors designed a speech recognizer of Marathi using the HTK toolkit of 910 sentences from 6 speakers. An ASR system was developed to transcribe Telugu TV news automatically by Reddy et al. [11]. Lokesh et al. [12] implemented a Tamil ASR based on a bidirectional recurrent neural network. An HMM-based speech and isolated word recognition system was developed by Ananthi et al. [13]. The research was carried out with the data collected from 20 different speakers, and the system obtained 87.42% recognition accuracy. An isolated spoken digit recognition of the Assamese language using HMM was implemented by Bharali et al. [14]. This digit recognition model used the speech corpus of 10 native Assamese speakers. In [15], an isolated spoken Marathi words recognition system using HMM has been executed. A small speech database of 20 phonetically rich Marathi words of ten native speakers was used in this research. The recognition system got a maximum accuracy of 86.50%. A convolutional neural network (CNN) based automatic speech recognition system for isolated words was designed by Slívová et al. [16]. A word-level recognition system for the Hindi language using the Kaldi toolkit was created by Sri et al. [17]. The various acoustic models, Monophone, Triphone, and SGMM were designed in this work. A GMM and MFCC-based isolated spoken numerals recognition system for the Bengali language was presented by Paul et al. [18] with a corpus of 1000 audio samples. The recognition system achieved 91.7% prediction efficiency. An isolated Arabic words recognition model has been proposed based on the hidden Markov model (HMM) in [19].

With the help of ASR technology, it is easy to recognize and translate the spoken language into text form by machine. According to the Atlas of the World’s Languages in Danger report UNESCO 2017, there are 33 endangered languages of Arunachal Pradesh, including Adi. The year 2019 is declared the Year of Indigenous Languages by United Nations to illustrate awareness of languages worldwide that are in jeopardy of disappearing.

As per Census 2011, a total of 2,48,834 speakers of Adi are found in Arunachal Pradesh, mainly spread over the west, east, and upper Siang districts [2] of Arunachal Pradesh. Adi language originated from the Tibeto-Burman family, typically associated with the Sino-Tibetan family [20].

The main challenges to working with Adi language are

  1. 1.

    No speech data are available on the internet or any other digital media, as Adi is a very Low-Resource Indigenous Language of Arunachal Pradesh.

  2. 2.

    Adi has adopted modified Roman script for writing in the Adi language, which is still being developed; therefore, it is a challenge to represent Adi words in proper phonetic transcripts.

  3. 3.

    Most Adi native speakers cannot read modified Roman scripts of Adi words, making data collection more complex.

  4. 4.

    The Adi tribes are spread in the different mountainous areas of Arunachal Pradesh, making the work more difficult.

Lalrempuii worked on the Morphology of Adi [21]. In [22], spectral and formant studies of Adi consonants have been investigated. A speech recognition system of Adi language proposed in [23]. In this work, the authors endeavored to develop an automatic isolated words recognition system for the Adi language using Hidden Markov Model-Gaussian Mixture Model (GMM-HMM) [24, 25] and SGMM [26].

The significant contributions of this research are:

  1. 1.

    The Corpus consists of 2088 unique isolated words of Adi Language that have never been endeavored before.

  2. 2.

    For the first time, this research shows the phonetic transcriptions of 2088 Adi words with proper phoneme sequences.

  3. 3.

    This proposed Adi words recognition model may become an opening step to building an ASR system in Adi.

  4. 4.

    This work demonstrates excellent recognition accuracy on monophone, three different triphones [27], and SGMM models with the help of the Kaldi toolkit.

  5. 5.

    The authors investigated overall phone alignment and occurrences and calculated WER separately for individual speakers with different models.

The paper is organized as follows: Section 2 describes the model building using Kaldi; the system configuration is discussed in Sect. 3; feature extraction is illustrated in Sect. 4. Section 5 explains language model creation; training and decoding are described in Sect. 6. After that, experimental results and discussion are given in Sect. 7. Section 8 concludes the current research work. Finally, Sect. 9 emphasizes future work of the present research.

2 Model building using Kaldi Toolkit

In this research, 16 consonants, 14 vowels (7 short and 7 long), 19 diphthongs, and a triphthong of the Adi language are enlisted [21]. The Adi consonants are shown in Table 1.

Table 1 List of Adi consonants

In Adi, the duration of the vowels is extremely noteworthy as the meaning of the same word is determined by the length of vowel phonemes, whether short or long. Generally, a short vowel may be replaced by an equivalent long vowel to give the different meanings of a word. The five phonetic properties present in the Adi are alveolar, bilabial, glottal, palatal, and velar. Also, this under-resourced language’s six different manners of articulation are affricates, fricatives, glides, liquids, nasals, and stops. Table 2 shows the list of Adi vowels. Some Diphthongs of this language are /aé/ [aə], /ai/ [ai], /ía/ [ɨa], /ao/ [aɔ], /oa/ [ɔa], and sole Triphthong is ‘uai.’ Dental, labio-dental fricative, aspirated and retroflex sounds are absent in Adi. The fricatives such as [h] and [s] can be interchanged in Adi words without any disparity in the meaning.

Table 2 List of Adi vowels

Kaldi is a highly efficient open-source speech recognition toolkit built by Johns Hopkins University [28]. In Fig. 1, the structural design of the automatic word recognition model has been illustrated. Different speech samples of Adi words have been recorded from 21 native Adi speakers of Arunachal Pradesh. Then MFCC features are extracted from the recorded speech samples. The extracted MFCC feature vectors are fed into the Decoder, which decodes and classifies the features using the Acoustic model, Lexicon (phonetic transcript of the Adi), and language model.

Fig. 1
figure 1

The architecture of automatic words recognizing model of Adi

Figure 2 describes the step-by-step process of the automatic word recognizing system. Acoustic samples are recorded and stored in waveform audio (.wav) file format. In this work, the sampling rate of audio data was fixed at 16 kHz with the bit rate of 129k signed PCM encoding and a 16-bit mono channel. The language data was prepared with 50 phonemes and every isolated 2088 Adi word’s phonetic representation. SRILM toolkit has been used to construct a language model. Finally, the ASR system efficiently recognizes isolated words of the Adi language using monophone, triphones, and SGMM speech models.

Fig. 2
figure 2

Step by step process of words recognizing model

Fig. 3
figure 3

The directory topology of the recommended model in Kaldi

Figure 3 shows the directory structure of our proposed model in the Kaldi toolkit, where different directories and indispensable files were created. Kaldi toolkit’s ‘egs’ is the most important directory under which ‘Adi_Words’ and a few sub-directories are designed to place all the stuff associated with the recommended model.

2.1 Speech dataset

The speech samples from native Adi speakers of Arunachal Pradesh were collected with the help of voice recording tools and were stored by the authors. The specifications of data for this research are as follows.

  1. (i)

    Bit Rate: 256k.

  2. (ii)

    Encoding: PCM.

  3. (iii)

    Channels: Mono, 16-bit.

  4. (iv)

    Sampling Rate: 16 kHz.

  5. (v)

    Format: waveform audio file (.wav).

In this isolated word recognition system, the speech corpus consists of 21 Adi speakers (9 male and 12 female) from Arunachal Pradesh. The corpus consists of 2088 unique Adi words, and 14,490-word utterances are present in this dataset.

2.2 Acoustic data

The acoustic data section of the model is organized with a few crucial files like spk2gender, wav.scp, utterance transcripts, utt2spk, and corpus.txt. ‘spk2gender’ contains gender details of every Adi speaker engaged in this ASR system. Location of speech samples for every utterance is placed in ‘wav.scp’ and utterance transcriptions in a ‘text’ file. ‘utt2spk’ is created for mapping speakers to utterances, whereas the transcript of the corpus is put together in corpus.txt file. The files essential to communicate between acoustic data and the Kaldi toolkit are shown in Table 3.

Table 3 List of files obligatory in acoustic data

2.3 Language data

The language data comprising Lexicon, silence, and non-silence phone information are listed in Table 4. The files ‘silence.txt’ and ‘nonsilence.txt’ include silence and non-silence phones.

Table 4 List of files requisite in language data
Table 5 Phoneme sequences of some Adi Words in Lexicon

A total of 50 non-silence phonemes of the Adi language are enlisted for this research. Some of the phones in this word recognition system are a, a:, e, e:, i, i:, b, eu, k, l, m, n, ŋ, ɔ, p, s, t, z, ɔa, aə. ‘sil’ and ‘spn’ are the two silence phones apply in this ASR model. The phonetic representation of some Adi words is given in Table 5.

In this corpus of the Adi language, 21 native speakers uttered 2088 unique isolated words with 14,490 utterances of words. The speech files are stored in .wav format.

3 System configuration

The Adi language’s automatic word recognition system was intended using Ubuntu 20.04 LTS (a 64-bit operating system). The system utilized is Lenovo IdeaPad Laptop associated with the Intel 10th Gen i5-10300 H Processor (2.5 GHz (Base)–4.5 GHz (Max), 4 Cores, 8 MB Cache), 8 GB RAM DDR4-3200, 512 GB SSD with NVIDIA GTX 1650 4GB GDDR6 dedicated GPU.

4 Feature extraction

The prime objective of feature extraction is to compute the feature vectors that symbolize the input speech as a sequence of observations. MFCC is a proficient and well-known speech feature extraction technique.

Fig. 4
figure 4

MFCC features extraction

The steps for MFCC feature extraction are shown in Fig. 4 with the help of a block diagram. The MFCC features were taken out using a 25 ms Hamming windowing technique with a 10ms overlap. The recording sampling rate is 16 kHz, indicating 16,000 samples for every second of audio samples. So a sole window incorporates 400 samples abridged to 13 cepstral coefficients. Then, another 13 delta and an extra 13 delta-delta coefficients are computed for every frame, i.e., an MFCC feature vector of 39-dimensional is employed in this work. First, the analog speech sample is converted into a digital gesture, followed by the pre-emphasis method. The motive behind splitting the speech utterances into frames of tiny duration is that the speech is non-stationary, and its temporal characteristics change too fast as every 10 ms–100 ms, the vocal tract changes its shape to produce different sounds. So, by captivating a small frame dimension, one can assume that the speech utterances will be stationary, and their uniqueness will not differ much inside the frame. A 10-millisecond frameshift is selected to track the continuity of the audio signal and not miss out on any sudden changes in the edges of the frames. Then discrete fourier transform (DFT) is applied to extract frequency domain information from the time domain. The fourier transform output is first squared and applied to a Mel filter bank to get the information into the power spectrum. The logarithm of the output for the Mel filter bank is taken to eradicate the acoustic variants, which are not significant for this ASR model. Lastly, the outcomes of the log Mel spectrum are again converted into the time domain with the help of discrete cosine transform (DCT) [29], and cepstral coefficients (MFCC) are obtained as the final output.

5 Language model creation

The Kaldi toolkit employs a structure that counts on finite state transducers (FST). Typical ARPA format of language models modified into the .fst format using OpenFst. According to the Kaldi toolkit, .fst file format is required to compile this ASR model. In the language model L.fst, G.fst and L_disambig.fst files are created. L.fst or the Phonetic Dictionary FST is made with phonetic symbols in the input and word symbols in the output to represent the lexicon in .fst format. G.fst stands for the language model FST, used for grammar, where L_disambig.fst is the phonetic dictionary with disambiguation Symbols FST. The SRI language modelling toolkit (SRILM) is utilized to implement a statistical language model in this speech recognition system. An ARPA trigram model file is built to track the probabilities of each phone engaged in this ASR system.

6 Training and decoding

The speech corpus comprises 14,490-word utterances spoken by 21 native Adi speakers with 50 non-silence phones. Word error rates (WER) are computed to validate the ASR system for different recognition models. So, each utterance is divided into equal alignments for phone time marking and mapped each division to the particular phoneme symbol in the sequence. Each phoneme’s probability density function (PDF) can be segmented and classified in this approach. All word utterances are split into individual units of speech known as a phoneme, as revealed in Table 5.

Fig. 5
figure 5

Monophone alignment for ‘sheela’

Fig. 6
figure 6

Triphone alignment for ‘mangkom’

In the monophone model, every phoneme was compared separately, i.e., a single phone was taken into account independently. All neighboring phones were ignored at the time of training. Here in Fig. 5, the waveform and phonetic mapping have been shown for the word ‘sheela.' The phonetic transcription of ‘sheela’ is ‘sh-i-l-aa(a:).' So four non-silence phonemes are ‘sh,' ‘i’, ‘l,' and ‘a:.' The total utterance duration for the word ‘sheela’ is 387.37 ms. Among these, ‘sh’ phone utterance duration is 89.4 ms,‘ i’ uttered for 96.03 ms, and ‘l’ duration 84.67 ms. The end phone ‘a:' exists for a slightly longer duration of 117.27 ms as it is a long vowel. A monophone model trained speech data with individual phone utterances in the first experiment. A word utterance does not simply depend on its phone sequences, but there is a strong influence of the phonemes that come left and right of each phoneme. So, combination with surrounding phonemes effectively improves the overall system recognition efficiency. The waveform and phonetic representation for the Adi word ‘mangkom’ (‘Instead’ in English) are in Fig. 6. The phonetic transcription of ‘mangkom’ is ‘m-a-ng(ŋ)-k-o-m .‘ The triphone model embraces three successive phonemes to decide the probability of a specific phoneme. In this model, a total of 50 Adi phones are involved. So one hundred fifty (50*3) possible HMM states can be assigned, one for every triphone. In the training section, decision trees were formed using the collection of triphones. In this research, the triphone model comprises three training models: tri1, tri2, and tri3. The tri1 training model uses MFCC features and delta + delta-delta (Δ + Δ Δ) features. Tri2 employs linear discriminant analysis (LDA) and maximum likelihood linear transform (MLLT). LDA is usually responsible for decreasing the feature space and creating specific conversions for every speaker. In the tri3 model, speaker adaptive training (SAT) is included with LDA and MLLT [30]. Noise and speaker normalization are achieved using data metamorphose for every speaker. The SGMM is similar to a GMM system that inquires all HMM states to share a similar GMM configuration with an equal number of Gaussians in every state. The SGMM arrangement has sub-states without any speaker adaptive framework. The advantage of SGMM is that it is comparatively firm than monophone and triphone models as it has less number of parameters than actual speech states. This superiority benefits the ASR system’s recognition efficiency with limited training data. A simple SGMM model with no sub-states performs better than the best GMM system, and after adding the sub-states model, WER further improved. There are two parts to a speech recognition system: acoustic modeling and decoding. The acoustic model transforms audio information into a phonetic attribute. The main goal is to precisely recognize phonemes and give outputs as a posteriorgram which signifies posterior probabilities for phonemes at every speech frame. The decoding model identifies features in the most probable word from a specific speaker’s utterance. A decoding graph [28] is constructed as a weighted finite state machine with the assistance of a lexicon, acoustic, and language model to envisage the most probable word. A decoding graph can be symbolized as HCLG = H◦C◦L◦G. ‘H’ stands for Hidden Markov Model, ‘C’ represents context-dependency, ‘L’ denotes Lexicon, and ‘G’ designates language model or grammar.

7 Results and discussion

In this ASR system, HMM-GMM monophone model, HMM-GMM triphone (tri1, tri2, tri3) model, and SGMM, a total of five models are used for isolated word recognition of the Adi language. All non-silence and silence phones length, as well as occurrences, are analyzed for all models. Table 6 illustrates that for the monophone model, silence (sil) accounts for 97.7% of phone occurrences at the time of utterance begin, and optional silence ‘sil’ is seen only 49.71% of the time at utterance end. In tri1, tri2, tri3 and SGMM models the silence accounts 97.5, 97.7, 97.2 and 97.4% respectively at the time of utterances begin and 74.71, 73.54, 71.28 and 71.64% respectively at the time of utterance end.

Table 6 Silence and non-silence phones occurrences for all models

7.1 Phones alignment Analyzation (Tri-3 model)

Phones alignment analysis of a speech recognition model is extremely imperative to know the actual silence and nonsilence phone occurrences involved in the recognition model of a specific language. The optional-silence ‘sil’ is seen only 71.28% of the time at utterance end in the tri-3 model. Table 7 shows the phone alignment and phone occurrences at the time of utterance begin and utterance end for the tri3 model. The phone’s duration showed in frames using ‘median, mean, 95-percentile’. In this mentioned model, at utterance begin, the silence (sil) accounts for 97.2% of phone occurrences, i.e., nonsilence accounts for 2.8%. Among these nonsilence phone occurrences at utterance begin, three phone a:_B, n_B, and s_B account for 0.7, 0.6 and 0.6%, respectively. At the end of the utterance, the silence (sil) accounts for 71.3% and nonsilence for 28.7%; mainly eight nonsilence phones ‘ŋ_E’, ‘e:_E’, ' ɔ_E’, ‘a:_E’, ‘e_E’, ‘u_E’, ‘i_E’ and ‘a_E’ are observed at utterance end. Phone’s occurrences at the time of utterance begin, and utterance end has been exhibited in Figs. 7 and 8 respectively.

Table 7 Phones alignment and occurrences (Utterance begin/end) for tri3 model
Fig. 7
figure 7

Phones occurrences (utterance begin)

Fig. 8
figure 8

Phones occurrences (utterance end)

4140 Adi words are considered for testing in this model. Table 8, overall phone alignment and occurrences are enlisted for the tri-3 model. At the time of decoding, all 4140 Adi words are phonetically divided. In widespread occurrences of each phone in the tri-3 model, the silence ‘sil’ is perceived only 7.5% with the duration of (59,102.3,351) frames considering ‘median, mean and 95-percentile’ and nonsilence phone occurs 92.5% with the duration of (8,13.1,30) frames. Forty nonsilence phone occurrences are illustrated in Table 8, and Fig. 9 shows three different phone categories: begin, end, and internal. Here a:_I (internal) confirms the highest overall occurrences with 6.3%, whereas u_E (end) and y_B (begin) assert the minor occurrences of 0.5% for each phone. So this type of analysis can provide a clear idea about the occurrences of every phone for a particular speech dataset and can also bestow a prediction for phone generation probability of a specific language.

Table 8 Overall phones alignment and occurrences for tri3 model

The optional-silence phone ‘sil’ occupies 38.7% of overall frames. Limiting the stats to the 62.1% of frames not covered by an utterance-[begin/end] phone, optional-silence sil occupies 5.7% of frames. Utterance-internal optional-silences sil comprise 2.2% of utterance-internal phones, with duration (median, mean, 95-percentile) of (22,34.1,101). In word boundary analysis, the silence can appear in five different ways in the speech samples: nonwords (sil), begin (sil_B), end (sil_E), internal (sil_I), and singleton (sil_S). Every nonsilence phone has four categories: begin, end, internal, and singleton. For example, ‘ŋ’ nonsilence phone categorization are ŋ _B (begin), ŋ_E (end), ŋ_I (internal), and ŋ_S (singleton).

Fig. 9
figure 9

Overall occurrences of each phone

7.2 WER and WRA

This isolated word recognition system is trained with 15 Adi speakers (6 male and 9 female) and tested with six speakers (three male and three female). The word error rate (WER) and word recognition accuracy (WRA) can determine the efficiency of any speech recognition system. The WER is the least edit distance between the ASR output and the definite transcriptions. Word error rate is typically illustrated in Eqs. (1) and (2) as

$$WER=\frac{Deletions(D)+Substitutions(S)+Insertions(I)}{Total\;number\;of\;the\;words\;(N)}$$
(1)
$$WER\;(\%)=\frac{Deletions\;(D)+Substitutions\;(S)+Insertions\;(I)}{Deletions\;(D)+Substitutions\;(S)+Correct\;Words(C)}\times100$$
(2)

Where N specifies the total number of words, D symbolizes deletion error, S represents the numeral of substitutions error, and I stand for insertion error. The word recognition accuracy (WRA) is conferred in Eqs. (3) and (4) as,

$$\text{W}\text{R}\text{A}=\left(1-\text{W}\text{E}\text{R}\right)=\frac{ \text{N}-\text{D}-\text{S}-\text{I}}{\text{N}}=\frac{ \text{C}-\text{I}}{\text{N}}$$
(3)
$$WRA\;(\%)\;=100-WER\;(\%)$$
(4)

In Table 9; Fig. 10, eight recognition results for each model are enlisted. The Monophone, tri1, tri2, tri3, and SGMM model has 33 different recognition outputs. 4140-word utterances of the Adi are considered for every recognition consequence to analyze the performance of different models. Table 9 shows eight different recognition outputs for each model. The best WERs and WRAs are illustrated in the table’s first row of each model. After overall analysis, the Monophone model was the least efficient, whereas SGMM showed the best recognition performance. The best recognition output of the Monophone model led to 92 insertions, 384 deletions, and 657 substitution errors. So, only 3007 words were correctly recognized among 4140 words used for testing. The SGMM model shows three insertions, 177 deletions, and 149 substitution errors, i.e., 3811-word utterances are properly recognized among 4140 words. The best WRAs are shown in bold in the first row of each model.

Table 9 Performance analysis of different ASR models
Fig. 10
figure 10

Recognition accuracy of different ASR models

In this work, the authors applied 10,350-word utterances (recordings of 15 speakers) for training and 4140-word utterances (recordings of 6 speakers) for testing, i.e., 71.43% of data used for training and 28.57% for testing. In the monophone model, WRA is 72.63%, and WER is 27.37%. In triphone models, WER was decreased, and the recognition accuracy of tri, tri2, and tri3 became 84.76, 87.54 and 91.06%, respectively. The SGMM model offered the lowest WER of 7.95%. So considering the recognition outputs of each model, Fig. 10 shows SGMM has the highest recognition accuracy, followed by Tri-3, Tri-2, Tri-1, and Monophone models. Although SGMM offers the highest recognition outputs for decoding results, some recognition outputs of the Tri-3 model performed better than SGMM.

7.3 WER for individual speakers

Among 21 speakers, six speakers’ speech samples have been used for testing. The WER can be calculated separately for different speakers with different models. The WER of individual speakers are given in Tables 10, 11, 12, 13 and 14, considering substitution, insertion, and deletion errors with the number of uttered words and correctly recognized words for Monophone, Triphone, and SGMM models. Whereas, Figs. 11, 12, 13, 14 and 15 demonstrates the performance analysis for individual speakers for the corresponding five models.

Table 10 WER analysis of Monophone model for individual speaker
Table 11 WER analysis of Tri-1 model for individual speaker
Table 12 WER analysis of Tri-2 model for individual speaker
Table 13 WER analysis of Tri-3 model for individual speaker
Table 14 WER analysis of SGMM for individual speaker
Fig. 11
figure 11

Monophone model performance analysis for individual speaker

Fig. 12
figure 12

Tri-1 model performance analysis for individual speaker

Fig. 13
figure 13

Tri-2 model performance analysis for individual speaker

Fig. 14
figure 14

Tri-3 model performance analysis for individual speaker

Fig. 15
figure 15

SGMM performance analysis for individual speaker

8 Conclusion

This research signifies the first initiative to build an ASR system for a low-resource Arunachali endangered tribal language, ‘Adi.' In this work, an isolated word recognition system has been developed on the Kaldi toolkit using speech samples from native Adi speakers. MFCC features were extracted from the speech samples and were used in the ASR system. Out of 14490-word utterances in the dataset with 2088 unique Adi words, 10350-word utterances (recordings of 15 speakers) were used for training, and 4140-word utterances (recordings of 6 speakers) were used for testing, i.e., 71.43% of the dataset was used for training, and 28.57% of the dataset was used for testing. The monophone model achieved a WRA of 72.63% and a WER of 27.37%. In triphone models, the WER decreased as the system complexity increased; the recognition accuracy achieved using the tri1 model, tri2 model, and tri3 model are 84.76, 87.54, and 91.06%, respectively. The SGMM model offered the lowest WER of 7.95%. Further, it was also observed that WER varies from speaker to speaker. In this work lowest WER of 0.58% was observed with speaker’ 004F’.

9 Future scope

Future works for the current research may include testing of the models with a larger speech corpus of different age group and dialects. Additionally, noisy data may be applied in these present models to make this Adi ASR system more robust and efficient. Development of various ASR applications in the upcoming future will be helpful to preserve this low recourse tribal language in this digitalized era.