Keywords

1 Introduction

Automatic speech recognition is a computer technique with the objective of transcribing a signal from speech into text. It is an area that is still emerging, and attracting the attention of the public as well as many researchers, and opens up the future. Towards a new man-machine generation. This importance is explained by the privileged position of speech as a vector of human information. The realization of a RAP system requires the contribution of several research domains: signal processing, mathematical models, algorithms, etc. [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17].

A remarkable change in the state of the art makes the systems increasingly efficient with sufficient performance to be used in (many applications) many domains: Assistance to the autonomous life of people, Vocal control (in industry, medicine, aviation, toys, space), language learning and translation, indexing of large audiovisual databases, deep learning, etc. [1].

However, the speech signal is one of the most complex signals to characterize, which makes the task of a RAP system difficult. This complexity of the speech signal originates from the combination of several factors, the redundancy of the acoustic signal, the great inter and intra-speaker variability, the effects of the continuous speech coarticulation and the recording conditions. To overcome these difficulties, many mathematical methods and models have been developed, including dynamic comparison, neural networks [21], Vector Vector Machine SVM, stochastic Markov models and in particular The Hidden Markov Models HMM, which have become the perfect solution to the problems of automatic speech recognition.

Given the importance of ASR, several free software has been developed, Among the most famous: HTK [2] and CMU Sphinx [3], JULIUS, KALDI [4]. In this research work, we propose a novel approach to build an Amazigh automated speech recognition system, based on CMU Sphinx-4 which is based on Hidden Markov Models (HMMs). It is a flexible, modular and pluggable framework to help foster new innovations in the core research of HMM recognition systems. CMU Sphinx is used in this research work because of its high degree of flexibility and modularity [4].

The paper is organized as follows. Section 2 present the principle and the theory of speech recognition, Sect. 3 present in brief a description of the Amazigh language, Sect. 4 gives details about the steps to build the Amazigh Speech Recognition System, and the experimental results with The performance evaluation of the system is based on the Word error rate (WER). Finally, the conclusion is summarized in Sect. 5 with future work.

2 Related Works

This section presents some of the reported works available in the literature that are similar to the presented work. some of the works providing ASR system for others languages are (Kumar, K. (2012), Abushariah, M.A.A.M. (2012)).

El Ghazi et al. [7] have presented a system for automatic speech recognition on the Amazigh. used the hidden Markov model to model the phonetic units corresponding to words taken from the training base. The results obtained are very encouraging given the size of the training set and the number of people taken to the registration. To demonstrate the flexibility of the hidden Markov model we conducted a comparison of results obtained by the latter and dynamic programming.

Satori et al. [2,3,4,5,6,7,8,9,10,11,12,13,14] have developed of a speaker-independent continuous automatic Amazigh speech recognition system. The designed system is based on the Carnegie Mellon University Sphinx tools. In the training and testing phase an in house Amazigh_Alphadigits corpus was used. This corpus was collected in the framework of this work and consists of speech and their transcription of 60 Berber Moroccan speakers (30 males and 30 females) native of Tarifit Berber. The system obtained best performance of 92.89% when trained using 16 Gaussian Mixture models.

Kumar and Aggarwal [15], have built a connected-words speech recognition system for Hindi language. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition.

Abushariah et al. [16] have proposed an efficient and effective framework for the design and development of a speaker-independent continuous automatic Arabic speech recognition system based on a phonetically rich and balanced speech corpus. The speech corpus contains a total of 415 sentences recorded by 40 (20 male and 20 female) Arabic native speakers from 11 different Arab countries representing the three major regions (Levant, Gulf, and Africa) in the Arab world. The proposed Arabic speech recognition system is based on the Carnegie Mellon University (CMU) Sphinx tools, and the Cambridge HTK tools were also used at some testing stages. The speech engine uses 3-emitting state Hidden Markov Models (HMM) for tri-phone based acoustic models. Based on experimental analysis of about 7 h of training speech data, the acoustic model is best using continuous observation’s probability model of 16 Gaussian mixture distributions and the state distributions were tied to 500 senones. The language model contains both bi-grams and tri-grams. For similar speakers but different sentences, the system obtained a word recognition accuracy of 92.67% and 93.88% and a Word Error Rate (WER) of 11.27% and 10.07% with and without diacritical marks respectively. For different speakers with similar sentences, the system obtained a word recognition accuracy of 95.92% and 96.29% and a WER of 5.78% and 5.45% with and without diacritical marks respectively. Whereas different speakers and different sentences, the system obtained a word recognition accuracy of 89.08% and 90.23% and a WER of 15.59% and 14.44% with and without diacritical marks respectively.

Al-Qatab and Ainon [20] implemented an Arabic automatic speech recognition engine using HTK. The engine recognized both continuous speech as well as isolated words. The developed system used an Arabic dictionary built manually by the speech-sounds of 13 speakers and it used vocabulary of 33 words.

3 Amazigh Language

3.1 History

The Amazigh languages are a group of very closely related and similar languages and dialects spoken in Morocco, Algeria, Tunisia, Libya, and the Egyptian area of Siwa, as well as by large Amazigh communities in parts of Niger and Mali. In c, for example, Amazigh is divided into three regional varieties, with tariffs in the North, Tamazigh in Central and Southeast Morocco, and Tachelhite in the South-West and the High Atlas. Because of the problems of reliable language census, it is difficult to assess the exact number of speakers of Berber languages for each country [6] (Table 1).

Table 1. Number of Amazigh speakers by country

This language have had a written tradition, on and off, for over 2000 years, although the tradition has been frequently disrupted by various invasions. It was first written in the Tifinagh alphabet, still used by the Tuareg, the oldest dated inscription is from about 200 BC. Later, between about 1000 AD and 1500 AD, it was written in the Arabic alphabet. Since the 20th century, it has often been written in the Latin alphabet, especially among the Kabylians.

3.2 Tifinagh

A modernized form of the Tifinagh alphabet was made official in Morocco in 2003, and a similar one is sparsely used Algeria. The Amazigh Latin alphabet is preferred by Moroccan Amazigh writers and is still predominant in Algeria (although unofficially). Mali and Niger recognized the Amazigh Latin alphabet and customized it to the Tuareg phonological system. Although, Tifinagh is still used in parts of Mali and Niger. Both Tifinagh and Latin scripts are increasingly being used in Morocco and parts of Algeria, while the Arabic script has been abandoned by Amazigh writers [7].

Only the IRCAM defined a precise order described by the expression below (a < b, means that a is sorted before b) (Table 2):

Table 2. Official Table of the Tifinaghe alphabet as recommended by l’RCAM

3.3 Phonetics

The graphic system of the standard amazighe proposed by the IRCAM comprises:

  • 27 consonants of: labels (), dental ().

  • the alveolar .

  • 2 semi-consonants: .

  • vowels: the full ones (), neutral ().

4 Automatic Speech Recognition

Given a speech signal, current automatic speech recognition systems are based on a statistical approach [8], a formalization proposal, a theory of information theory.

Fundamentally, the problem of speech recognition can be stated as follows. From acoustic observations X, the system looks for the sequence of words W * maximizing the following equation:

$$ W^{*} = arg_{w} {\text{max P}}\left( {{\text{W}}|{\text{X}}} \right) $$
(1)

After applying the Bayes theorem, this equation becomes:

$$ W^{*} = arg_{w} \,\hbox{max} \,\frac{{P\left( {X |W} \right)P\left( W \right)}}{P\left( X \right)} $$
(2)

P (X) is considered constant and removed from Eq. 2.

$$ W^{*} = arg_{w} {\text{max P}}\left( {{\text{X}}|{\text{W}}} \right){\text{P}}\left( {\text{W}} \right) $$
(3)

Where the term P (W) is estimated via the language model and P (X | W) corresponds to the probability given by the acoustic models. This type of approach makes it possible to integrate, in the same decision-making process, acoustic and linguistic information (Fig. 1).

Fig. 1.
figure 1

Setups involved in ASR system.

4.1 Acoustic Analysis of the Signal of Speech

The relevant acoustic information of the speech signal is mainly in the bandwidth [50 Hz–8 kHz]. A signal parametrization system, also known as acoustic pre-processing, is required for signal shaping and calculation of coefficients. This step must be done carefully, as it contributes directly to the performance of the system.

The acoustic analysis is divided into three stages, an analog filtering, an analog/digital conversion and a calculation of coefficients.

4.2 Hidden Markov Model

The acoustic model is used to model the statistics of speech features for each speech unit such as a phone or a word. The Hidden Markov Model (HMM) is the de facto standard used in the state-of-the-art acoustic models. It is a powerful statistical method to model the observed data in a discrete-time series. An HMM is a structure formed by a group of states connected by transitions. Each transition is specified by its transition probability. The word hidden in HMMs is used to indicate that the assumed state sequence generating the output symbols is unknown. In speech recognition, state transitions are usually constrained to be from left to right or self repetition [9], called the left-to-right model as shown in Fig. 2.

Fig. 2.
figure 2

A left-to-right HMM model with three true states.

Each state of the HMM is usually represented by a Gaussian Mixture Model (GMM) to model the distribution of feature vectors for the given state. A GMM is a weighted sum of M component Gaussian densities and is described by Eq. (4).

$$ {\text{P}}({\text{x}}|\uplambda )= \sum\nolimits_{m = 1 }^{M} {wi\;g\left( {x |\mu i,\sum i} \right)} $$
(4)

5 CMU Sphinx

CMU Sphinx is a set of speech recognition development libraries and tools that can be linked in to speech-enable applications [10]. They have a number of packages for different tasks and applications:

  • Pocketsphinx: Lightweight library of written recognition in C.

  • Sphinxbase: support for libraries required by Pocketsphinx.

  • Sphinx4: decoder for voice recognition search written in Java.

  • CMUclmtk: Language model tools.

  • Sphinxtrain: Acoustic model drive tool.

  • Sphinx3: decoder for voice recognition search written in C.

5.1 Sphinx Train

SphinxTrain [10] Is the tool created by CMU for the development of acoustic models. It is a set of programs and documentation for realizing and constructing acoustic models for any language. SphinxTrain tool and which requires the installation of the libraries:

  • ActivePerl: The tool to edit scripts for SphinxTrain and allows to work in a Unix-like.

  • Microsoft Visual Studio: To compile sources In C to produce the Executables.

5.2 Architecture

The high level architecture for sphinx4 is relatively straightforward. As shown in the following Fig. 3, the architecture consists of the front end, the decoder, a knowledge base, and the application [4].

Fig. 3.
figure 3

Setups involved in ASR system.

Front end: is responsible for gathering, annotating, and processing the input data. In addition, the front end extracts features from the input data to be read by the decoder. The annotations provided by the front end include the beginning and ending of a data segment. Operations performed by the front end include preemphasis, noise cancellation, automatic gain control, end pointing, Fourier analysis, Mel spectrum filtering, cepstral extraction, etc.

Knowledge base: provides the information in the decoder needs to do its job. This information includes the acoustic model and the language model. The knowledge base can also receive feedback from the decoder, permitting the knowledge base to dynamically modify itself based upon successive search results. The modifications can include switching acoustic and/or language models as well as updating parameters such as mean and variance transformations for the acoustic models.

Decoder performs: the bulk of the work. It reads features from the front end, couples this with data from the knowledge base and feedback from the application, and performs a search to determine the most likely sequences of words that could be represented by a series of features. The term “search space” is used to describe the most likely sequences of words, and is dynamically updated by the decoder during the decoding process.

Application: may also receive events from the decoder while the decoder is working on a search. These events allow the application to monitor the decoding progress, but also allow the application to affect the decoding process before the decoding completes. Furthermore, the application can also update the knowledge base at any time.

6 Experiments and Results

This section describes our experience in creating and developing an ASR for the Amazigh language. The formation of the acoustic model is done by the SphinxTrain [4,5,6,7,8].

6.1 Corpus

Developing an ASR in a new language like the Amazigh language requires gathering a large amount of corpus viewing the statistical nature of models (HMMs) generalized in automatic speech recognition. So, it is a very tedious task if no corpus exists, Since we must then collect the necessary resources ourselves: speech signal, lexicon, textual corpus, etc. The corpus consists of the Alphabet (33 letters) Amazigh. 9 of Moroccan speakers, are invited to pronounce the letters ten times. The corpus comprises ten repetitions by each speaker of the same letter. Thus, the corpus consists of 2970 audio files (33 letters × 10 repetitions × 9 lecturers). The test database contains 330 audio files [4,5,6,7,8,9,10,11] (Table 3).

Table 3. Recording parameter used for the Preparation of the corpus.

6.2 Dictionary

In a file extension dic, the correspondence it will be specified between the words of the file of transcription and the phonemes used in the file extension phone. The transcription using the Latin scriptes (Figs. 4 and 5).

Fig. 4.
figure 4

Extract of the file tiftotal.dic in tiftotal application.

Fig. 5.
figure 5

Extract of the tiftotal.phone writes by using the scriptes Latin

6.3 Language Model

Language model (language model or grammar Model) is a model that defines the use of words in An application. Each word in the model of Must be in the pronunciation dictionary

There are several types of models that describe language to recognize keyword lists, grammars and statistical language models, phonetic statistical language models. The choice from a language model depends on the application [12] (Fig. 6).

Fig. 6.
figure 6

Setups to building language model ARPA n-gram format with CMUCLMTK.

6.4 Acoustic Model

In the context of Markovian ASR, the acoustic model is generally an HMM, typically a three-state left-right HMM called Bakis, in a state associated with a phoneme. For example, the acoustic model for the word YAB which transcribes the alphabet and which contains the phonemes Y A B can be represented by the HMM of the Fig. 2.

Once we have a corpus, we can move on to the stage of creating the acoustic model. It is a tiring and difficult step in view of the scarcity of relevant documentation. The steps of creating an acoustic model, even If CMU Sphinx has a relatively large community [13].

6.5 Results

A system for automatic recognition of As Sphinx 4 uses two dependent elements Of the language: The acoustic model and the language model. In our application we carried out the Modification of these two models as described in previously.

The sphinx 4 must be configured using a file Xml. Thus, the choice of algorithms, the extraction and Comparison of feature vectors and other Important aspects for the creation of a PCR system. Can be customized as required and Application.

The Word error rate [14] has become the standard measurement scheme to evaluate the performance of voice recognition systems [12]. We have an original text and a length recognition text of N words. From them, the words I have inserted. The word D has been deleted and the word S has been replaced. The error rate of Word is:

$$ {\text{WER}} = \left( {{\text{I}} + {\text{D}} + {\text{S}}} \right)/{\text{N}} $$
(5)

WER is generally measured by a percentage.

Number of States by HMM

In order to test the effect of the number of states change by HMM on the quality of the acoustic models the system was driven using the two configurations three or five states by HMM knowing that the Sphinx-4 system accepts only these 2 Configurations. Both models were then tested by the Word Error Rate, WER (Table 4).

Table 4. Number of states by HMM.

These results show that the best results were recorded for HMM = 3.

Number of Gaussian Probability Distributions

In order to test the effect of change in the number of Gaussian probability distributions on the system performance, the latter was trained and tested for different Gaussian values ranging from 1 to 256 for the two cases of 3 and 5 states by HMM (Tables 5 and 6).

Table 5. Number of Gaussian probability distributions & HMM = 3.
Table 6. Number of Gaussian probability distributions & HMM = 5.

The system obtained best the best results when trained using 2 Gaussian Mixture models, and 3 number of states change by HMM (Fig. 7).

Fig. 7.
figure 7

Evolution of WER as a function of the number of Gaussian, in both cases 3 and 5 by HMM.

The presented work has been compared with the existing similar works. In paper El Ghazi et al. [7] have presented a system for automatic speech recognition on the Amazigh. used the hidden Markov model to model the phonetic units corresponding to words taken from the training base. The corpus consists of the database size used for this research work is 2000 words. The test database contains 330 audio files. However the system is giving good performance 90%, but the design is speaker specific and uses very smallvocabulary. In paper (Satori et al. [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]), they have proposed a spoken Arabic recognition system, where Arabic alphabets were investigated to form the ten Arabic digits (from zero to nine). The proposed system consists of two steps: - Mel Frequency Cepstral Coefficients (MFCC) features extraction; - Classification and recognition conducted by CMU Sphinx4 which is a speaker independent system based on hidden Markov model. The mean performance results reached, when realizing three tests, were between 83.33% and 96.67%. Although the system performance is good, the vocabulary size is small.

Our work joins the results of literature, System is highly efficient produces 88% of accuracy. Vieu the vocabulary size is relatively small.

7 Conclusion

In this paper, the system of automatic speech recognition system for Amazigh language was developed. This system is based on the open source CMU Sphinx-4, from the Carnegie Mellon University. The important components of ASR system are feature extraction, acoustic modeling, pronunciation and modeling using HMM. The database size for this research work is 2970 words and produces 88% of accuracy.

The originality of our work lies in the fact of approaching the Amazigh language it is considered among the first works treating this language using the Open Source CMU Sphinx.

However, we consider that this is a preliminary work which will enable us subsequently to fulfill the basic objective that is an independent speaker recognition system for the Amazigh language. In perspective, we can extend the application for more vocabulary size of the Amazigh language.