1 Introduction

Automatic speech recognition (ASR) is a technique that allows for transcribing an oral message and extracting linguistic information from an audio signal. ASR is used in different domains such as teaching, interactive services, messaging, machine or robot control, quality control, data entry, remote access, system detection [1,2,3,4,5], etc. Also, several systems have been developed for voice recognition, like Hidden Markov Model Toolbox (HTK) [6], Institute for Signal and Information Processing (ISIP) [7], CMU Sphinx [8,9,10], and Kaldi [11].

The researchers in [12] have described the development of the Kannada speech recognition system using the Kaldi toolkit. Medennikov et al. [13] talk about the implementation of a Russian speech recognition system using the Kaldi toolkit. The authors [14] have presented a technical overview of the speech recognition systems based on Moroccan dialects. They talk about their recent progress pertaining to feature extraction methods, performance evaluation, and speech classifiers. The authors in [15] have created a Darija speech recognition system based on the CMU Sphinx tools with Hidden Markov Model (HMMs) and Gaussian Mixture Models (GMMs) combination. Their highest accuracy was 96.27%, which was found using 8 GMMs. Ameen et al. [16] have exploited several models for Arabic phonemes recognition. They used different machine learning schemes, neural network, recurrent neural network, artificial neural network (ANN), random forest, extreme gradient boosting (XGBoost), and long short-term memory. The obtained results indicate that the ANN machine learning method outperformed other methods. Table 1 presents some ASR systems studies based on the Amazigh language [17].

Table 1 ASR systems studies based on the Amazigh language

In this study, we propose a new method for the integration of the Amazigh language by using the open-source Kaldi based on an isolated variant vocabulary speech recognition system. We propose an open-source platform to evaluate our ASR performance by varying HMMs, Gaussian mixture models (GMMs), and feature extraction techniques, in order to determine the optimal values for maximum performance.

Our paper is organized as follows: an introduction in Secti 1. Section 2 gives a description of the Kaldi toolkit. Section 3 represents the Hidden Markov Models (HMMs). Section 4 gives an overview of Feature extraction methods. The proposed system architecture is detailed in Sect. 5. Experimental results are presented in Sect. 6. Finally, the conclusion is in Sect. 7.

2 Kaldi toolkit

Kaldi is considered an open-source project written in C +  + and released under the Apache License v2.0 for speech recognition [11]. Kaldi includes a large set of tools and programs such as HMMs, decision trees, neural networks, and data preprocessing, feature extraction. Their internal structure is shown in Fig. 1 [11]. The modules of the Kaldi library depend on two external libraries, the linear algebra libraries (BLAS / LAPACK) and the library which allows the integration of finite state transducers (OpenFST). The decodable class bridges these two external libraries. This ASR toolkit is still constantly updated and further developed by a pretty large community.

Fig. 1
figure 1

Kaldi toolkit [5]

3 Hidden Markov models (HMMs)

The HMM was introduced in the 1960s [28]. It was considered as one of the most used methods for speech recognition modeling [29]. Also, it used for computational molecular biology [30]. Figure 2 presents a three states Hidden Markov Model topology [31].

Fig. 2
figure 2

The 3 states HMM architecture

4 Feature extraction methods

The feature extraction phase plays a crucial role in the performance of ASR systems. It allows to extract of characteristics that make it possible to discern the components of the audio signal that are relevant for the identification of linguistic content, by rejecting the other information contained in that signal. In this work, the used feature extraction methods are:

  • MFCCs are widely employed in speech recognition [32]. The Mel’s for a particular frequency is computed according to the formula (1) [33]:

    $$Mel\left(f\right)=2595 {\mathrm{log}}_{10}\left(1+\frac{f}{700}\right)$$
    (1)
  • Perceptual Linear Prediction (PLP) [34].

  • Filter Bank Coefficients (FBANK) [35].

5 The proposed system architecture

Repeated In this research, we propose a speech platform for the integration of the Amazigh language into an isolated variant vocabulary speech recognition system. This system is based on the open-source Kaldi using HMMs approach, and a different number of GMMs. In addition, the MFCCs, FBANK, and PLP feature extraction techniques are used. The proposed System Architecture is presented in Fig. 3.

Fig. 3
figure 3

Proposed system architecture

5.1 Corpus

The speech data consists of the 10 first Amazigh digits (0–9) (Table 2 presents the used Amazigh digits) and ten Amazigh words (Table 3 presents the used Amazigh words). This corpus is collected from 30 Moroccan native Tarifit speakers. The speakers are invited to pronounce each Amazigh word ten times. Each digit and word was recorded and visualized back to ensure the inclusion of the entire word in the speech signal where only the corrected words were kept in the database. The speech is recorded with the help of a microphone by recording tool WaveSurfer with wave format and it was saved into one “.wav” file. 16 kHz sampling rate with a resolution of 16 bits was used. Also, two disjoint sets of audio files one for training and the other for testing were created in this work.

Table 2 The Amazigh digits
Table 3 The used Amazigh words

5.2 Acoustic model

The acoustic model allows the recognition of the phonemes sequences presented in the pronunciation dictionary. 3-State of HMM with a simple monophonic model trained are used for recognizing the speech data. A sample of the used Amazigh dictionary is presented in Table 4.

Table 4 Sample of the used dictionary

5.3 Decoder

The decoder combines the predictions of the acoustic and linguistic models to propose the most probable transcription in text for a given speech.

In this paper, we focused on the integration of the Amazigh language into an isolated variant vocabulary speech recognition system based on Kaldi with the use of GMM-HMM.

6 Experimental results

In this section, we performed two experiments. In the first, the system was trained and tested with the first ten Amazigh digits (0–9). In the second experiment, the system was trained and tested by the ten daily must-used Amazigh isolated words that present typical syllabic structure and are considered as good representative samples of the Amazigh language. All tests were performed on an Ubuntu 16.04 LTS (64-bit operating system). In our experiments, the data speech is divided into 70% for training and 30% for testing (see Table 4). Different sets of training and testing parameters were used to design an efficient detection system. We have trained and tested the system by using different GMM values, and the MFCC, FBANK, and PLP feature extraction techniques.

Figures 7, 8, and 9 present the results of the first experiment where the system was trained and tested using the digits for MFCC, FBANK, and PLP Coefficients and the GMM values ranging from 400 to 1000.

From Fig. 4 we can read that the most frequently recognized Amazigh digits using MFCCs are KRAD and SEMUS. While in the case of using the PLP coefficient the best frequently recognized Amazigh digits are AMYA and KRAD (see Fig. 5). In addition, in the case of FBANK coefficients, the most frequently recognized Amazigh digits are AMYA and SEMUS (See Fig. 6).

Fig. 4
figure 4

The recognition rate of Amazigh digits in the function of GMMs by using MFCCs

Fig. 5
figure 5

The recognition rate of Amazigh digits in the function of GMMs number by using PLP

Fig. 6
figure 6

The recognition rate of Amazigh digits in the function of GMMs by using FBANK

The system performance of MFCC, PLP, and FBANK extraction feature methods with several GMMS values is shown in Fig. 7. The best results were obtained with 400 GMMs for MFCCs and PLP.

Fig. 7
figure 7

The recognition rate difference between MFCC, PLP, and FBANK in the function of GMMs for Amazigh digits

Figures 8, 9, and 10. shown the results of the second experiment. A higher recognition rates was attained particularly especially with the words AFLLA and AWAR for using MFCC, PLP, and FBANK with the case of 400, 600, 800, and 1000 GMMs.

Fig. 8
figure 8

The recognition rate of Amazigh words in the function of GMMs by using MFCC

Fig. 9
figure 9

The recognition rate of Amazigh words in the function of GMMs by using PLP

Fig. 10
figure 10

The recognition rate of Amazigh words in the function of GMMs by using FBANK

Figure 11 shows the recognition rate difference between MFCC, PLP, and FBANK in the function of total Gaussian distributions for Amazigh words. The system obtains the best performance when trained by using FBANK with 400 GMMs.

Fig. 11
figure 11

The recognition rate difference between MFCC, PLP, and FBANK in the function of GMMs for Amazigh words

The achieved results in the exeprience1 and exeprience2 show:

  • The best result was found with 400 GMMs.

  • The FBANK coefficient performance was better for Amazigh words, also, it is noted that the MFCC coefficient performance was better for Amazigh digits.

By considering the Amazigh words digits analysis, all words and digits that consist of two or three syllables achieve a higher rate. As examples:

  • The “AFFLA” word its recognition rate is 100%, found for 400, 600, 800, and 1000 GMMs by using MFCC, PLP, and FBANK coefficients, with its number of syllables is 2.

  • The “AFOSI” word its number of syllables is 3, and its best recognition rate is 100% found with FBANK coefficient by using 800 GMMs.

  • The “ALNDAD” word its number of syllables is 2, the best recognition rate is 100% achieved with 600 and 800 GMMs by using MFCC coefficient.

  • The “AMAGGWAJ” word its number of syllables is 3, and its best recognition rate is 100% found with MFCC and FBANK coefficients by using 1000 and 400 GMMs respectively.

  • The “ANAKMAR” word it is a number of syllables is 3, and its best recognition rate is 100% found with 600 and 800 GMMs by using FBANK coefficient.

  • The “AWAR” word its best recognition rate is 100%, found for 400, 600, 800, and 1000 GMMs by using MFCC and FBANK coefficients, and for PLP coefficient by using 400, 800, and 1000 GMMs, with its number of syllable is 2.

  • The “AZLMAD” word its best recognition rate is 100%, found for 1000 GMMs by using MFCC coefficient, and for the PLP coefficient by using 400 and 1000 GMMs, also for 400 GMMs by using FBANK coefficient with its number of syllable is 2.

  • The “AMYA” digit its best recognition rate is 100%, found for PLP coefficient by using 400, 800, and 1000 GMMs, also for FBANK coefficient by using 400, 600, 800 and 1000 GMMs with its number of syllables is 2.

  • The “KRAD” digit its number of syllables is 2, and its best recognition rate is 100% found with MFCC coefficient by using 600 GMMs, and also with PLP coefficient by using 800 and 1000 GMMs.

  • The “SEMUS” digit its recognition rate is 100%, found for 400, 600, and 1000 GMMs by using MFCC and FBANK coefficients, with its number of syllables is two.

The digits and words analysis indicate that the misrecognized Amazigh words are monosyllabic ones like TAM, TZA, DAR and DAT. Our tests and analysis show that the best frequently recognized Amazigh commands are those composed of two or three syllables. While the frequently misrecognized Amazigh commands are monosyllabic. We can say that the number of syllables of Amazigh commands has an effect on the commands recognition rate.

The first objective of this work is to use the Kaldi toolkit to create a speech recognition system for the Amazigh Isolated-Words and Amazigh digits (0–9). As a comparison, we used the HMM-GMM acoustic models with different values of Gaussians (8, 16, and 32 GMMs) and MFCC coefficient trained with Kaldi and CMU Sphinx4 tools in order to establish a comparison in terms of recognition rate. To attain our objective we have performed two tests. The 10 first Amazigh digits (see Table 2) were trained and tested by the system in the first test. The system was trained and tested using the 10 isolated Amazigh words in the second test (see Table 3). The speech audio files used in this work were divided into two disjoint sets, for training and test (see Table 5).

Table 5 Corpus characteristics

Figure 12 shows the recognition rate (%) of Amazigh digits in the function of total Gaussian distributions number (GMMs) 8, 16, and 32. In the case of the Kaldi toolkit the result achieved using 8 GMMs is 90.42%, the result obtained using 16 GMMs is 90.28% and the result of 32 GMMs is 90.85%. On the other hand, in the case of CMU Sphinx4 tools, the result gives respectively are 88.7, 88.99, and 86.56 respectively for 8, 16, and 32 GMMs.

Fig. 12
figure 12

The recognition rate (%) difference for Amazigh digits between KALDI and SPHINX4

Figure 13 shows the recognition rate (%) with both Kaldi and CMU Sphinx4 of Amazigh words in the function of total Gaussian distributions number (GMMs) 8, 16, and 32. Wherein the case of Kaldi, the system correct rates were 87.66, 89.33, and 86.66% for using 8, 16, and 32 GMMs respectively. While in the case of CMU Sphinx4, the system correct rate was 86.66, 86.99, and 85.33% corresponding to 8, 16, and 32 GMMs one by one.

Fig. 13
figure 13

The recognition rate (%) difference for Amazigh words between KALDI and SPHINX4

Based on the results of the experience, we can see that Kaldi definitely outperformed CMU Sphinx4 with the use of GMM-HMM. Table 6 presents the comparison of our obtained results with other works.

Table 6 A summary of some Kaldi ASR systems

7 Conclusion

In this study, we have presented a new approach for the integration of the less-resourced Amazigh language into an isolated variant vocabulary speech recognition system. This system was implemented by using Kaldi toolkit using a 3-State HMM with 400, 600, 800, and 1000 GMMs.. In addition, the MFCCs, FBANK, and PLP feature extraction techniques are used in this work. Our system obtains the best performance of 93.96% when trained by using MFCCs. In another hand, a comparison between Kaldi and CMU Sphinx4 toolkits was presented, and our results showed that Kaldi definitely outperformed CMU Sphinx4 with the use of HMM-GMM.

In our future work, we will be focused on the enhancement of system performances by adopting hybrid and deep learning approaches.