1 Introduction

Automatic Speech Recognition (ASR) means a process that inputs human speech and tries to convert to its corresponding set of words based on a specific algorithm (Al-Zabibi 1990). ASR has attracted a huge amount of interest in the last decades, mainly due to a wide area of applications involving such technology. ASR can be used in industrial and civilareas such as; hands free operations, mobile voice application, human-computer interaction, automatic translation, automated telephone services and can help handicapped people to control equipments or navigate on internet. It is a technology which makes life easier and very promising (Haton et al. 2006). Recently, ASR systems have played important roles in several areas and different ASR systems are found in literature. ASR systems are highly dependent on: the language spoken (English, French), the way to recognize speech (isolated words, continuous speech) and the speaker characteristics (speaker dependent, speaker independent). These categories of systems are used depending on the type of users’ application (Huang et al. 2001; Alotaibi and Shahshavari 1998; Satori et al. 2007, 2009; Abushariah et al. 2012).

In general, spoken alphabets and digits for different languages were targeted by ASR researchers. A speaker-independent spoken English alphabet recognition system was developed by Cole et al. (1990). That system was trained on one token of each letter from 120 speakers. Their performance was 95 % when tested on a new set of 30 speakers, but it was increased to 96 % when tested on a second token of each letter from the original 120 speakers. An artificial neural networks based speech recognition system was designed and tested with automatic Arabic digits recognition by Ajami Alotaibi (2005). The system was an isolated word speech recognizer and it was implemented both as a multi-speaker (i.e., the same set of speakers were used in both the training and testing phases) mode and speaker-independent (i.e., speakers used for training are different from those used for testing) mode. This recognition system achieved 99.5 % correct digit recognition in the multi-speaker mode, and 94.5 % in speaker-independent mode for clean speech. The Arabic speech recognition system developed by Hyassat and Zitar (2006) used CMU Sphinx4 engine based on HMM, which obtained a word recognition rate of 92.21 % for about 35 min of training and 7 min of testing speech data. Their system was trained using different Gaussian mixture models and they obtained best performance with eight Gaussians. Recently, Silva et al. (2012) investigated the speech recognition system digit in Portuguese language using Line Spectral Frequencies (LSF). They demonstrate that LFS provides best results in compared to those obtained by using Mel-Frequency Cepstrum Coefficients (MFCC).

The domain of ASR technologies is an expensive process and requires a considerable amount of resources. That is why only a small part of the world’s languages can benefit of this kind of technologies and related tools (Le and Besacier 2009). The Amazigh language is considered as an African resource poor or less-resourced language (Boukous 1995; Greenberg 1966). To the best of our knowledge, there is a few speech recognition research works on less-resourced language as Amazigh. In this paper we describe our experience to design Amazigh speech recognition system based on HMM.

The paper is organized as follows: Sect. 2 presents a brief description of the Amazigh language. In Sect. 3, we describe the Amazigh speech recognition system and our investigations to adapt the system to Amazigh language. Section 4 investigates the experimental results. Finally, in Sect. 5, we provide our conclusions and future directions.

2 Amazigh language

The Amazigh language, known as Berber or Tamazight, is a branch of Hamito-Semitic (Afro-Asiatic) languages. It is, spoken in a vast geographical area of North Africa. Amazigh covers the northern part of Africa which extends from the Red Sea to the Canary Isles and from the Niger and Mali (Tuareg) in the Sahara to the Mediterranean Sea (Boukous 1995; Greenberg 1966); Amazigh Languages 2013; Galand 1988).

In Morocco, the Amazigh language is spoken by some 28 % of the population, grouped in three main regional varieties, depending on the area and the communities: Tarifit spoken in northern Morocco, Tamazight in Central Morocco and South-East, and Tachelhit spoken in southern Morocco (Ouakrim 1995; Chaker 1984).

Since 2003, Tifinaghe-IRCAM has become the official graphic system for writing Amazigh in Morocco. This system contains: (Outahajala and Zenkouar 2011; Boukous 2009; Fadoua and Siham 2012).

  • 27 consonants including: the labials (

    figure a

    ,

    figure b

    ,

    figure c

    ), the dentals (

    figure d

    ,

    figure e

    ,

    figure f

    ,

    figure g

    ,

    figure h

    ,

    figure i

    ,

    figure j

    ,

    figure k

    ), the alveolars (

    figure l

    ,

    figure m

    ,

    figure n

    ,

    figure o

    ), the palatals (

    figure p

    ,

    figure q

    ), the velar (

    figure r

    ,

    figure s

    ), the labiovelars (

    figure t

    ,

    figure u

    ), the uvulars (

    figure v

    ,

    figure w

    ,

    figure x

    ), the pharyngeals (

    figure y

    ,

    figure z

    ) and the laryngeal (

    figure aa

    );

  • 2 semi-consonants

    figure ab

    and

    figure ac

    ;

  • 4 vowels: three full vowels:

    figure ad

    ,

    figure ae

    and

    figure af

    and neutral

    figure ag

    vowel (or schwa) which has a rather special status in Amazigh phonology.

3 Amazigh speech recognition system

This section describes our experience to create and develop an Amazigh voice recognition system using CMU Tools [cmu sphix web]. Figure 1 illustrates the main components that are usually found in a typical ASR system.

3.1 System overview

Fig. 1
figure 1

Block diagram of ASR system

All of our experiments, both training and recognizing were based on CMU Sphinx system, which is HMM-based, speaker-independent, continuous recognition system capable of handling large vocabularies (CMU Sphinx Open Source Speech Recognition Engines 2013; Huang 1989; Lee 1989). Our approach for modeling Amazigh sounds in The CMU Sphinx system consisted of generated and trained acoustic and language models with Amazigh speech data. The dictionary adopted in the experiments was made using 43 Amazigh words and their transcriptions. The allowed syllables in Amazigh language are: V, CV, VC, CVC, C, CC and CCC where V indicates a vowel while C indicates a consonant (Ridouane 2003). Table 1 represents the Amazigh 10 first digits and 34 alphabets along with the way of how to pronounce them, type of syllable, their transcription using English Arabic and Amazigh scripts, and number of syllables in every spoken word.

Table 1 Ten first digits and 33 Amazigh alphabets with their syllables and their transcription in English, Arabic and Amazigh letters

3.2 Speech database preparation

The database Amazigh_Alphadigits was created in the framework of this work and it contains a corpus of speech and their transcription of 60 Berber Moroccan speakers.Footnote 1 The corpus consists of spoken 10 Amazigh firsts’ digits (0–9) and 33 Amazigh alphabets collected from 60 Moroccan speakers native of Tarifit Berber (30 males and 30 females) aged between 12 and 45 years-old. The audio files were generated by speakers pronouncing the digits in numerical order and alphabets in alphabetical order. Thus, the task of labeling speech signals after segmentation is easy. The sampling rate of the recording is 16 kHz, with 16 bits resolution. Table 2 shows more speech corpus technical details.

Table 2 System parameters

During the recording sessions, speakers were asked to utter the 10 digits and 33 alphabets sequentially starting with digits followed by alphabets. Audio recordings for a single speaker were saved into one “.wav” file and sometimes up to four “.wav” files depending on number of sessions the speaker spent to finish recording. It is time consuming to save every single recording once uttered.

Hence, the corpus consists of 10 repetitions of every digit and alphabets produced by each speaker. Depending on this, the corpus consists of 25,800 tokens. During the recording session, the waveform for each utterance was visualized back to ensure that the entire word was included in the recorded signal see Fig. 2. Therefore, there was a need to segment manually these bigger “.wav” files into smaller ones each having a single recording of a single word and manual classification of those “.wav” files into the corresponding directories was done. Wrongly pronounced utterances were ignored and only correct utterances are kept in the database.

Fig. 2
figure 2

Waveform for the ten firsts Amazigh digits recording session (speaker moel) before cutting it manually into separated single utterance.

3.3 Training

Training is the process of learning the Acoustic Model and Language Model to construct the knowledge base used by the system. The knowledge base contains: Acoustic Model, Language Model and Pronunciation Dictionary.

3.3.1 Feature extraction

The purpose of this sub-system (see fig. 1) is to extract speech features which play a crucial role in speech recognition system performance.

As seen in Table 2, the parameters used in our system, were 16 KHz sampling rate with a 16 Kbit sample, 25.6 ms Hamming Window with consecutives frames overlap by 10 ms and Mel-Frequency Cepstral Coefficients (MFCC).

3.3.2 Acoustic model

The acoustic model provides a mapping between the observed features of basic speech units (phonemes) provided by the front-end of the system and the Hidden Markov Models (HMMs) (Huang et al. 1990). In the HMM based technique words in the target vocabulary are modeled as a sequence phonemes, while each phoneme is modeled as a sequence of HMM states. The basic HMM model used in this work is a 5-states HMMs architecture for each Amazigh phoneme, three emitting sates and two non emitting ones as entry and exit which join models of HMM units together in the ASR engine, as shown in Fig. 3. Each emitting state consists of Gaussian mixtures trained on 13 dimensional Coefficients MFCC, their delta and delta delta vectors, which are extracted from the signal.

Fig. 3
figure 3

The 5-states HMM model.

In this study, the acoustic model was performed using speech signal from the Amazigh Alphadigits training database. Table 4 shows a description database subsets used in the training. Every recording in the training corpus are transformed into a sequence of feature vectors. For each recording, a set of features files are computed using the front-end provided by Sphinxtrain (CMU lmtool 2013). In this stage, the engine looks into the phonetic dictionary (see Table 3) which maps every used Amazigh word to a sequence of phonemes. During the training, all 44 Amazigh phonemes are used by means of a phone list (Satori et al. 2009). The Amazigh phonemes are further refined into Context-Dependent (CD) tri-phones and added to the HMM set.

Table 3 The phonetic dictionary list used in the training
Table 4 Amazigh_Alphadigits database subsets description

3.3.3 Language model

The n-gram language model used by the ASR system guide the search for correct word sequence by predicating the likelihood of the nth word, using the \(n-1\) preceding words. The common feasible n-gram models are uni-gram, bi-gram and tri-gram. Creating of a language model consists of computing the word uni-gram counts, which are then converted into a task vocabulary, with word frequencies, generating the b-grams and tri-grams from the training text based on this vocabulary. In this work, The CMU-Cambridge statistical language modeling toolkit is used to generate Language model of our system (CMU lmtool 2013).

3.3.4 Pronunciation dictionary

The pronunciation dictionary called also lexicon it contains all Amazigh words we want to train followed by their pronunciation. Table 3, shows the phonetic dictionary list used in the training of our system. The alternate transcriptions marked with parenthesis like (2) stand for second pronunciation. The pronunciation dictionary serves as an intermediary between the Acoustic Model and Language Model.

4 Experimental results

In order to evaluate the performances of the system we had performed three experiments (Exp.1–Exp.3) and obtained results shown in the tables below. The experiments included the training and testing the system using different subsets of the Amazigh_Alphadigits corpus. In the first experiment the system was trained and tested by using only the Amazigh digits, second experiment working on the Amazigh alphabets, third the combination of the Amazigh digits and alphabets together. Table 4 shows the database subsets used in the three experiments their descriptions, Number of words, Number of speakers, Token number.

In all experiments corpus subsets were disjointed and partioned to training 70 % and testing 30 % in order to assure the speaker independent aspect. Also, the system was trained using different Gaussian mixture models. The numbers of Gaussian mixtures per model were 4, 8, and 16. Each of 43 digits and alphabets were considered separately.

In the case of the first experiment, Table 5 shows the accuracy rate of the system in addition to the system total accuracy rate using all digits. Depending on the testing corpus subset, the system had to recognise 1,800 token for the all 10 digits. The system performances are 91.01, 91.50 and 92.89 % was found for using 4, 8 and 16 Gaussian mixture distributions, respectively. It is found that 16 GMMs obtained the best recognition rate of 92.89 %.

Table 5 Amazigh digits recognition rate for different GMMs when trained with Amazigh_Alphadigits1 and tested with Amazigh_Alphadigits2 (Experiment 1)

By considering the digits recognition analysis, the most frequently misrecognized Amazigh digits are SA and KOZ. Both of these words are monosyllabic and the modelling process of such one syllable words is more difficult than two or more syllable words.

In the second experiment, the systems try to recognize 5,940 samples of all 33 amazigh alphabets. Table 6 shows the accuracy rate of the system. The performances are 87.90, 88.50 and 89.28 % for using 4, 8 and 16 GMMs, respectively. Also, in the case of alphabets the best results was found with 16 GMMs. The most frequently misrecognized Amazigh alphabets are YA and YO.

Table 6 Amazigh alphabets recognition rate for different GMMs when trained with Amazigh_Alphadigits1 and tested with Amazigh_Alphadigits2 (Experiment 2)

In the last experiment, all alphabets and digits are combined in order to use a maximum available dataset. In testing, the system was programmed to test total of 7,740 alphabets and digits tokens. The system correct rates in this case were 88.07, 88.88 and 89.07 % for 4, 8 and 16 GMMs, respectively. This confirms our previous observation that 16 GMMs performs better as compared to 4 and 8 GMMs. Also, it is noted that he system performance was better for alphabets but lower for digits (Tables 7, 8).

Table 7 Amazigh alphadigits recognition rate for different GMMs when trained with Amazigh_Alphadigits1 and tested with Amazigh_Alphadigits2 (Experiment 3)
Table 8 System overall recognition rate for experiments 1–3

5 Conclusion

In this paper, we investigated the speaker independent alphadigits ASR system using a database of sounds corresponding to digits and alphabets spoken in Moroccan Amazigh language. This system was implemented by using Carnegie Mellon University Sphinx tools based on HMMs. This work includes creating the speech database amazigh alphadigits, which consist of many subsets covering alldigits and alphabets of Berber language used in the training and testing phase of the system. Recognition results show that our Amazigh ASR system is speaker independent and its performance is comparable to that reported (Hyassat and Zitar 2006) Arabic recognition results.