Amazigh Speech Recognition System Based on CMUSphinx

Telmem, Meryam; Ghanou, Youssef

doi:10.1007/978-3-319-74500-8_37

Meryam Telmem⁴ &
Youssef Ghanou⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 37))

Included in the following conference series:

Proceedings of the Mediterranean Symposium on Smart City Applications

2112 Accesses
7 Citations

Abstract

In this paper, we are proposing a new approach to build an Amazigh automated speech recognition system using Amazigh environment. This system is based on the open source CMU Sphinx-4, from the Carnegie Mellon University. CMU Sphinx is a large-vocabulary; speaker-independent, continuous speech recognition system based on discrete Hidden Markov Models (HMMs).

Access provided by CONRICYT-eBooks. Download conference paper PDF

Amazigh speech recognition based on the Kaldi ASR toolkit

Article 22 June 2023

Interactive Voice Application-Based Amazigh Speech Recognition

A Continuous Speech Recognition System for Bangla Language

Keywords

1 Introduction

Automatic speech recognition is a computer technique with the objective of transcribing a signal from speech into text. It is an area that is still emerging, and attracting the attention of the public as well as many researchers, and opens up the future. Towards a new man-machine generation. This importance is explained by the privileged position of speech as a vector of human information. The realization of a RAP system requires the contribution of several research domains: signal processing, mathematical models, algorithms, etc. [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17].

A remarkable change in the state of the art makes the systems increasingly efficient with sufficient performance to be used in (many applications) many domains: Assistance to the autonomous life of people, Vocal control (in industry, medicine, aviation, toys, space), language learning and translation, indexing of large audiovisual databases, deep learning, etc. [1].

However, the speech signal is one of the most complex signals to characterize, which makes the task of a RAP system difficult. This complexity of the speech signal originates from the combination of several factors, the redundancy of the acoustic signal, the great inter and intra-speaker variability, the effects of the continuous speech coarticulation and the recording conditions. To overcome these difficulties, many mathematical methods and models have been developed, including dynamic comparison, neural networks [21], Vector Vector Machine SVM, stochastic Markov models and in particular The Hidden Markov Models HMM, which have become the perfect solution to the problems of automatic speech recognition.

Given the importance of ASR, several free software has been developed, Among the most famous: HTK [2] and CMU Sphinx [3], JULIUS, KALDI [4]. In this research work, we propose a novel approach to build an Amazigh automated speech recognition system, based on CMU Sphinx-4 which is based on Hidden Markov Models (HMMs). It is a flexible, modular and pluggable framework to help foster new innovations in the core research of HMM recognition systems. CMU Sphinx is used in this research work because of its high degree of flexibility and modularity [4].

The paper is organized as follows. Section 2 present the principle and the theory of speech recognition, Sect. 3 present in brief a description of the Amazigh language, Sect. 4 gives details about the steps to build the Amazigh Speech Recognition System, and the experimental results with The performance evaluation of the system is based on the Word error rate (WER). Finally, the conclusion is summarized in Sect. 5 with future work.

2 Related Works

This section presents some of the reported works available in the literature that are similar to the presented work. some of the works providing ASR system for others languages are (Kumar, K. (2012), Abushariah, M.A.A.M. (2012)).

El Ghazi et al. [7] have presented a system for automatic speech recognition on the Amazigh. used the hidden Markov model to model the phonetic units corresponding to words taken from the training base. The results obtained are very encouraging given the size of the training set and the number of people taken to the registration. To demonstrate the flexibility of the hidden Markov model we conducted a comparison of results obtained by the latter and dynamic programming.

Satori et al. [2,3,4,5,6,7,8,9,10,11,12,13,14] have developed of a speaker-independent continuous automatic Amazigh speech recognition system. The designed system is based on the Carnegie Mellon University Sphinx tools. In the training and testing phase an in house Amazigh_Alphadigits corpus was used. This corpus was collected in the framework of this work and consists of speech and their transcription of 60 Berber Moroccan speakers (30 males and 30 females) native of Tarifit Berber. The system obtained best performance of 92.89% when trained using 16 Gaussian Mixture models.

Kumar and Aggarwal [15], have built a connected-words speech recognition system for Hindi language. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition.

Abushariah et al. [16] have proposed an efficient and effective framework for the design and development of a speaker-independent continuous automatic Arabic speech recognition system based on a phonetically rich and balanced speech corpus. The speech corpus contains a total of 415 sentences recorded by 40 (20 male and 20 female) Arabic native speakers from 11 different Arab countries representing the three major regions (Levant, Gulf, and Africa) in the Arab world. The proposed Arabic speech recognition system is based on the Carnegie Mellon University (CMU) Sphinx tools, and the Cambridge HTK tools were also used at some testing stages. The speech engine uses 3-emitting state Hidden Markov Models (HMM) for tri-phone based acoustic models. Based on experimental analysis of about 7 h of training speech data, the acoustic model is best using continuous observation’s probability model of 16 Gaussian mixture distributions and the state distributions were tied to 500 senones. The language model contains both bi-grams and tri-grams. For similar speakers but different sentences, the system obtained a word recognition accuracy of 92.67% and 93.88% and a Word Error Rate (WER) of 11.27% and 10.07% with and without diacritical marks respectively. For different speakers with similar sentences, the system obtained a word recognition accuracy of 95.92% and 96.29% and a WER of 5.78% and 5.45% with and without diacritical marks respectively. Whereas different speakers and different sentences, the system obtained a word recognition accuracy of 89.08% and 90.23% and a WER of 15.59% and 14.44% with and without diacritical marks respectively.

Al-Qatab and Ainon [20] implemented an Arabic automatic speech recognition engine using HTK. The engine recognized both continuous speech as well as isolated words. The developed system used an Arabic dictionary built manually by the speech-sounds of 13 speakers and it used vocabulary of 33 words.

3 Amazigh Language

3.1 History

The Amazigh languages are a group of very closely related and similar languages and dialects spoken in Morocco, Algeria, Tunisia, Libya, and the Egyptian area of Siwa, as well as by large Amazigh communities in parts of Niger and Mali. In c, for example, Amazigh is divided into three regional varieties, with tariffs in the North, Tamazigh in Central and Southeast Morocco, and Tachelhite in the South-West and the High Atlas. Because of the problems of reliable language census, it is difficult to assess the exact number of speakers of Berber languages for each country [6] (Table 1).

Table 1. Number of Amazigh speakers by country

Full size table

This language have had a written tradition, on and off, for over 2000 years, although the tradition has been frequently disrupted by various invasions. It was first written in the Tifinagh alphabet, still used by the Tuareg, the oldest dated inscription is from about 200 BC. Later, between about 1000 AD and 1500 AD, it was written in the Arabic alphabet. Since the 20th century, it has often been written in the Latin alphabet, especially among the Kabylians.

3.2 Tifinagh

A modernized form of the Tifinagh alphabet was made official in Morocco in 2003, and a similar one is sparsely used Algeria. The Amazigh Latin alphabet is preferred by Moroccan Amazigh writers and is still predominant in Algeria (although unofficially). Mali and Niger recognized the Amazigh Latin alphabet and customized it to the Tuareg phonological system. Although, Tifinagh is still used in parts of Mali and Niger. Both Tifinagh and Latin scripts are increasingly being used in Morocco and parts of Algeria, while the Arabic script has been abandoned by Amazigh writers [7].

Only the IRCAM defined a precise order described by the expression below (a < b, means that a is sorted before b) (Table 2):

Table 2. Official Table of the Tifinaghe alphabet as recommended by l’RCAM

Full size table

3.3 Phonetics

The graphic system of the standard amazighe proposed by the IRCAM comprises:

27 consonants of: labels (), dental ().
the alveolar .
2 semi-consonants: .
vowels: the full ones (), neutral ().

4 Automatic Speech Recognition

Given a speech signal, current automatic speech recognition systems are based on a statistical approach [8], a formalization proposal, a theory of information theory.

Fundamentally, the problem of speech recognition can be stated as follows. From acoustic observations X, the system looks for the sequence of words W * maximizing the following equation:

$$ W^{*} = arg_{w} {\text{max P}}\left( {{\text{W}}|{\text{X}}} \right) $$

(1)

After applying the Bayes theorem, this equation becomes:

$$ W^{*} = arg_{w} \,\hbox{max} \,\frac{{P\left( {X |W} \right)P\left( W \right)}}{P\left( X \right)} $$

(2)

P (X) is considered constant and removed from Eq. 2.

$$ W^{*} = arg_{w} {\text{max P}}\left( {{\text{X}}|{\text{W}}} \right){\text{P}}\left( {\text{W}} \right) $$

(3)

Where the term P (W) is estimated via the language model and P (X | W) corresponds to the probability given by the acoustic models. This type of approach makes it possible to integrate, in the same decision-making process, acoustic and linguistic information (Fig. 1).

4.1 Acoustic Analysis of the Signal of Speech

The relevant acoustic information of the speech signal is mainly in the bandwidth [50 Hz–8 kHz]. A signal parametrization system, also known as acoustic pre-processing, is required for signal shaping and calculation of coefficients. This step must be done carefully, as it contributes directly to the performance of the system.

The acoustic analysis is divided into three stages, an analog filtering, an analog/digital conversion and a calculation of coefficients.

4.2 Hidden Markov Model

The acoustic model is used to model the statistics of speech features for each speech unit such as a phone or a word. The Hidden Markov Model (HMM) is the de facto standard used in the state-of-the-art acoustic models. It is a powerful statistical method to model the observed data in a discrete-time series. An HMM is a structure formed by a group of states connected by transitions. Each transition is specified by its transition probability. The word hidden in HMMs is used to indicate that the assumed state sequence generating the output symbols is unknown. In speech recognition, state transitions are usually constrained to be from left to right or self repetition [9], called the left-to-right model as shown in Fig. 2.

Each state of the HMM is usually represented by a Gaussian Mixture Model (GMM) to model the distribution of feature vectors for the given state. A GMM is a weighted sum of M component Gaussian densities and is described by Eq. (4).

$$ {\text{P}}({\text{x}}|\uplambda )= \sum\nolimits_{m = 1 }^{M} {wi\;g\left( {x |\mu i,\sum i} \right)} $$

(4)

5 CMU Sphinx

CMU Sphinx is a set of speech recognition development libraries and tools that can be linked in to speech-enable applications [10]. They have a number of packages for different tasks and applications:

Pocketsphinx: Lightweight library of written recognition in C.
Sphinxbase: support for libraries required by Pocketsphinx.
Sphinx4: decoder for voice recognition search written in Java.
CMUclmtk: Language model tools.
Sphinxtrain: Acoustic model drive tool.
Sphinx3: decoder for voice recognition search written in C.

5.1 Sphinx Train

SphinxTrain [10] Is the tool created by CMU for the development of acoustic models. It is a set of programs and documentation for realizing and constructing acoustic models for any language. SphinxTrain tool and which requires the installation of the libraries:

ActivePerl: The tool to edit scripts for SphinxTrain and allows to work in a Unix-like.
Microsoft Visual Studio: To compile sources In C to produce the Executables.

5.2 Architecture

The high level architecture for sphinx4 is relatively straightforward. As shown in the following Fig. 3, the architecture consists of the front end, the decoder, a knowledge base, and the application [4].

Front end: is responsible for gathering, annotating, and processing the input data. In addition, the front end extracts features from the input data to be read by the decoder. The annotations provided by the front end include the beginning and ending of a data segment. Operations performed by the front end include preemphasis, noise cancellation, automatic gain control, end pointing, Fourier analysis, Mel spectrum filtering, cepstral extraction, etc.

Knowledge base: provides the information in the decoder needs to do its job. This information includes the acoustic model and the language model. The knowledge base can also receive feedback from the decoder, permitting the knowledge base to dynamically modify itself based upon successive search results. The modifications can include switching acoustic and/or language models as well as updating parameters such as mean and variance transformations for the acoustic models.

Decoder performs: the bulk of the work. It reads features from the front end, couples this with data from the knowledge base and feedback from the application, and performs a search to determine the most likely sequences of words that could be represented by a series of features. The term “search space” is used to describe the most likely sequences of words, and is dynamically updated by the decoder during the decoding process.

Application: may also receive events from the decoder while the decoder is working on a search. These events allow the application to monitor the decoding progress, but also allow the application to affect the decoding process before the decoding completes. Furthermore, the application can also update the knowledge base at any time.

6 Experiments and Results

This section describes our experience in creating and developing an ASR for the Amazigh language. The formation of the acoustic model is done by the SphinxTrain [4,5,6,7,8].

6.1 Corpus

Developing an ASR in a new language like the Amazigh language requires gathering a large amount of corpus viewing the statistical nature of models (HMMs) generalized in automatic speech recognition. So, it is a very tedious task if no corpus exists, Since we must then collect the necessary resources ourselves: speech signal, lexicon, textual corpus, etc. The corpus consists of the Alphabet (33 letters) Amazigh. 9 of Moroccan speakers, are invited to pronounce the letters ten times. The corpus comprises ten repetitions by each speaker of the same letter. Thus, the corpus consists of 2970 audio files (33 letters × 10 repetitions × 9 lecturers). The test database contains 330 audio files [4,5,6,7,8,9,10,11] (Table 3).

Table 3. Recording parameter used for the Preparation of the corpus.

Full size table

6.2 Dictionary

In a file extension dic, the correspondence it will be specified between the words of the file of transcription and the phonemes used in the file extension phone. The transcription using the Latin scriptes (Figs. 4 and 5).

6.3 Language Model

Language model (language model or grammar Model) is a model that defines the use of words in An application. Each word in the model of Must be in the pronunciation dictionary

There are several types of models that describe language to recognize keyword lists, grammars and statistical language models, phonetic statistical language models. The choice from a language model depends on the application [12] (Fig. 6).

6.4 Acoustic Model

In the context of Markovian ASR, the acoustic model is generally an HMM, typically a three-state left-right HMM called Bakis, in a state associated with a phoneme. For example, the acoustic model for the word YAB which transcribes the alphabet and which contains the phonemes Y A B can be represented by the HMM of the Fig. 2.

Once we have a corpus, we can move on to the stage of creating the acoustic model. It is a tiring and difficult step in view of the scarcity of relevant documentation. The steps of creating an acoustic model, even If CMU Sphinx has a relatively large community [13].

6.5 Results

A system for automatic recognition of As Sphinx 4 uses two dependent elements Of the language: The acoustic model and the language model. In our application we carried out the Modification of these two models as described in previously.

The sphinx 4 must be configured using a file Xml. Thus, the choice of algorithms, the extraction and Comparison of feature vectors and other Important aspects for the creation of a PCR system. Can be customized as required and Application.

The Word error rate [14] has become the standard measurement scheme to evaluate the performance of voice recognition systems [12]. We have an original text and a length recognition text of N words. From them, the words I have inserted. The word D has been deleted and the word S has been replaced. The error rate of Word is:

$$ {\text{WER}} = \left( {{\text{I}} + {\text{D}} + {\text{S}}} \right)/{\text{N}} $$

(5)

WER is generally measured by a percentage.

Number of States by HMM

In order to test the effect of the number of states change by HMM on the quality of the acoustic models the system was driven using the two configurations three or five states by HMM knowing that the Sphinx-4 system accepts only these 2 Configurations. Both models were then tested by the Word Error Rate, WER (Table 4).

Table 4. Number of states by HMM.

Full size table

These results show that the best results were recorded for HMM = 3.

Number of Gaussian Probability Distributions

In order to test the effect of change in the number of Gaussian probability distributions on the system performance, the latter was trained and tested for different Gaussian values ranging from 1 to 256 for the two cases of 3 and 5 states by HMM (Tables 5 and 6).

Table 5. Number of Gaussian probability distributions & HMM = 3.

Full size table

Table 6. Number of Gaussian probability distributions & HMM = 5.

Full size table

The system obtained best the best results when trained using 2 Gaussian Mixture models, and 3 number of states change by HMM (Fig. 7).

The presented work has been compared with the existing similar works. In paper El Ghazi et al. [7] have presented a system for automatic speech recognition on the Amazigh. used the hidden Markov model to model the phonetic units corresponding to words taken from the training base. The corpus consists of the database size used for this research work is 2000 words. The test database contains 330 audio files. However the system is giving good performance 90%, but the design is speaker specific and uses very smallvocabulary. In paper (Satori et al. [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]), they have proposed a spoken Arabic recognition system, where Arabic alphabets were investigated to form the ten Arabic digits (from zero to nine). The proposed system consists of two steps: - Mel Frequency Cepstral Coefficients (MFCC) features extraction; - Classification and recognition conducted by CMU Sphinx4 which is a speaker independent system based on hidden Markov model. The mean performance results reached, when realizing three tests, were between 83.33% and 96.67%. Although the system performance is good, the vocabulary size is small.

Our work joins the results of literature, System is highly efficient produces 88% of accuracy. Vieu the vocabulary size is relatively small.

7 Conclusion

In this paper, the system of automatic speech recognition system for Amazigh language was developed. This system is based on the open source CMU Sphinx-4, from the Carnegie Mellon University. The important components of ASR system are feature extraction, acoustic modeling, pronunciation and modeling using HMM. The database size for this research work is 2970 words and produces 88% of accuracy.

The originality of our work lies in the fact of approaching the Amazigh language it is considered among the first works treating this language using the Open Source CMU Sphinx.

However, we consider that this is a preliminary work which will enable us subsequently to fulfill the basic objective that is an independent speaker recognition system for the Amazigh language. In perspective, we can extend the application for more vocabulary size of the Amazigh language.

References

Lecouteux, B.: Reconnaissance automatique de la parole guidée par des transcriptions a priori. Doctoral thesis. Université d’Avignon (2008)
Google Scholar
Satori, H., Harti, M., Chanfour, N.: Arabic speech recognition system based on CMUSphinx. IEEE. Program. I In: International Symposium on Computational Intelligence and Intelligent Informatics, ISCIII 2007 (2007). http://ieeexplore.ieee.org/document/4218391
Ghanou, Y., Bencheikh, G.: Architecture optimization and training for the multilayer perceptron using ant system. IAENG Int. J. Comput. Sci. 43(1), 20–26 (2016)
Google Scholar
Douib, W.: Reconnaissance automatique de la parole arabe par cmu sphinx 4. Doctoral thesis. Université Ferhat Abbas de Sétif 1 (2013)
Google Scholar
https://en.wikipedia.org/wiki/Berber_languages
Amour, M., Bouhjar, A., Boukhris, F.: 2004 IRCAM: publication: “initiation à la langue Amazigh” (2004)
Google Scholar
El Ghazi, A., Daoui, C., Idrissi, N.: Automatic speech recognition system concerning the moroccan dialecte (darija and tamazight). Int. J. Eng. Sci. Technol. (IJEST), 2012, 4(3), 966–975 (2012). ISSN 0975-5462
Google Scholar
Ulucinar, B.: Master thesis report (2007)
Google Scholar
Carnegie Mellon University. Sphinx-4. http://cmusphinx.sourceforge.net
Alotaibi, Y.A.: Investigating spoken Arabic digits in speech recognition setting. Inf. Comput. Sci. 173, 115 (2005)
Google Scholar
http://cmusphinx.sourceforge.net/wiki/tutoriallm
http://cmusphinx.sourceforge.net/wiki/tutorialam
http://cmusphinx.sourceforge.net/wiki/tutorialconcepts
Ettaouil, M., Ghanou, Y.: Neural architectures optimization and Genetic algorithms. Wseas Trans. Comput. 8(3), 526–537 (2009)
Google Scholar
Kumar, K., Aggarwal, R.K., Jain, A.: A Hindi speech recognition system for connected words using HTK. Int. J. Comput. Syst. Eng. 1(1), 25–32 (2012)
Article Google Scholar
Abushariah, M.A.A.M., Ainon, R., Zainuddin, R., Elshafei, M., Khalifa, O.O.: Arabic speaker-independent continuous automatic speech recognition based on a phonetically rich and balanced speech corpus. Int. Arab J. Inf. Technol. (IAJIT) 9(1), 84–93 (2012)
Google Scholar
Young, S.: The HTK hidden Markov model toolkit: design and philosophy. Doctoral thesis. Cambridge University Engineering Department, UK, Technical report. CUED/FINFENG/TR152, September 1994
Google Scholar
Ali, A., Zhang, Y., Cakrdinal, P., Dahak, N., Vogel, S., Glass, J.: A complete kaldi recipe for building Arabic speech recognition systems. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 525–529. IEEE, December 2014
Google Scholar
Satori, H., ElHaoussi, F.: Investigation Amazigh speech recognition using CMU tools. Int. J. Speech Technol. 17(3), 235–243 (2014)
Article Google Scholar
Al-Qatab, B.A.Q., Ainon, R.N.: Arabic speech recognition using Hidden Markov Model Toolkit (HTK). Paper presented at International Symposium in Information Technology (ITSim). Kuala Lumpur, 15–17 June (2010)
Google Scholar
Ettaouil, M., Ghanou, Y., El Moutaouakil, K., Lazaar, M.: Image medical compression by a new architecture optimization model for the Kohonen networks. Int. J. Comput. Theory Eng. 3(2), 204–210 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Team TIM, High School of Technology, Moulay Ismail University, Meknes, Morocco
Meryam Telmem & Youssef Ghanou

Authors

Meryam Telmem
View author publications
You can also search for this author in PubMed Google Scholar
Youssef Ghanou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Meryam Telmem or Youssef Ghanou .

Editor information

Editors and Affiliations

Computer Sciences Department, Faculty of Sciences and Techniques, Abdelmalek Essaadi University, Tangier, Morocco
Mohamed Ben Ahmed
Computer Sciences Department, Faculty of Sciences and Techniques, Abdelmalek Essaadi University, Tangier, Morocco
Anouar Abdelhakim Boudhir

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Telmem, M., Ghanou, Y. (2018). Amazigh Speech Recognition System Based on CMUSphinx. In: Ben Ahmed, M., Boudhir, A. (eds) Innovations in Smart Cities and Applications. SCAMS 2017. Lecture Notes in Networks and Systems, vol 37. Springer, Cham. https://doi.org/10.1007/978-3-319-74500-8_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-74500-8_37
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74499-5
Online ISBN: 978-3-319-74500-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Amazigh Speech Recognition System Based on CMUSphinx

Abstract

Similar content being viewed by others

Amazigh speech recognition based on the Kaldi ASR toolkit

Interactive Voice Application-Based Amazigh Speech Recognition

A Continuous Speech Recognition System for Bangla Language

Keywords

1 Introduction

2 Related Works

3 Amazigh Language

3.1 History

3.2 Tifinagh

3.3 Phonetics

4 Automatic Speech Recognition

4.1 Acoustic Analysis of the Signal of Speech

4.2 Hidden Markov Model

5 CMU Sphinx

5.1 Sphinx Train

5.2 Architecture

6 Experiments and Results

6.1 Corpus

6.2 Dictionary

6.3 Language Model

6.4 Acoustic Model

6.5 Results

Number of States by HMM

Number of Gaussian Probability Distributions

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation