1 Introduction

Speech is the main communication style of humans and the most natural way to exchange information. Therefore, several studies have been performed in past decades to design an ideal automatic speech recognition (ASR) system that is capable of understanding speech and sounds in real time under different conditions. However, this capability remains an established requirement for newly developed speech systems. The significant differences in speech cues, such as the absence of distinct boundaries between words or phonemes, and unwanted noise cues caused by the variability of speakers and their surroundings, such as speed of speech, style of speaking, and accents, renders this task more challenging [1, 2]. In addition, the degradation of speech recognition performance over IP networks was one of the main challenges faced by network speech recognition (NSR) researchers. In the NSR case, the client–server architecture was implemented by placing a server-side recognizer using a standard speech encoder. A speech signal is encoded by a conventional speech codec and transmitted to the server for decoding, feature extraction, and recognition phases [3]. The network dependency, coding, and transmission of data degrade the recognition performance due to the impact of data compression, transmission errors, or transcoding [4]. Table 1 presents the automatic speech recognition performance based on VoIP codecs. Table 2 presents automatic speech recognition systems based on audio codec and interactive voice response (IVR) technology.

Table 1 Automatic speech recognition performance based on VoIP codecs
Table 2 IVR-ASR systems performance based on VoIP audio codecs

On the other hand, we present some ASR systems based on hidden Markov model (HMM) and Gaussian mixture model (GMM) approaches.

T. K. DAS et al. [5] designed a speech information system based on HMMs and mel-frequency cepstral coefficients (MFCCs). Their best-achieved result is approximately 90%. H. Satori and F. El Haoussi [6] implemented an Amazigh speech system including digits and letters based on CMU Sphinx tools. Their achieved system performance was 92.8%. The authors in [7] presented an automatic speech recognition system by using the Odia language. The Kaldi toolkit is used to realize the automatic recognition system. Mono-phone and triphone models are investigated for Odia speech recognition, and Odia acoustic modeling is performed using the HMM and GMM.

Voice signal quality plays a major role in augmenting speech recognition system performance. In the case of the network ASR system, the speech signal is encoded by an audio VoIP codec and then transmitted to the server for recognition. This process degrades the quality of the received speech, which affects system performance. In this paper, we have implemented a network ASR system based on IVR and ASR technologies, where a degradation of recognition rates was observed with the integration of the IVR method that is based on the network ASR process. For this reason, we evaluated the performances of the VoIP-ASR system by varying the values of their respective parameters as codecs, HMMs and GMMs, to determine the optimal values for maximum performance.

The remainder of the paper is organized as follows: “Section 2” presents the IVR service and ASR technology. “Section 3” presents the system preparation. Section 4 presents the system architecture. Section 5 presents the conducted experiments. “Section 6” presents the results. “Section 7” presents the comparisons. “Section 8” concludes the paper.

2 IVR service and ASR technology

2.1 Audio codecs

Codecs are techniques used for encoding or compressing analog voice signals into digital bitstreams and then back to analog voice signals. There are different codecs, varying in complexity, necessary bandwidth, and voice quality, where better voice quality requires more bandwidth. One problem that emerges in the distribution of high-quality speech is network performance. In this study, our IVR implementation is based on the SIP signaling protocol [8] and RTP protocol [9] with G.711 and GSM codecs, which are employed as VoIP parameters [10].

2.1.1 G711 codec

G.711 [11] is a pulse code modulation (PCM) scheme that generates one 8-bit value every 125.ls, assured in a 64 kb/s bitstream. Speech data are encoded as 8 bits after logarithmic scaling. This audio codec includes two versions, u-Law, which is utilized in North America/Japan, and A-Law, which is exploited in Europe and the rest of the world. The A-Law version permits the conversion of 13-bit linear PCM samples into 8-bit compressed PCM samples, and the decoder performs the conversion, and vice versa, while the u-Law version allows the conversion of 14-bit linear PCM samples into 8-bit compressed PCM samples.

2.1.2 GSM codec

The ETSI GSM 06.10 Full Rate (FR) codec is the first digital speech-coding standard utilized in the Global System for Mobile Communications digital mobile phone systems, operating on an average bitrate of 13 kb/s. This audio codec was introduced in 1987 and exploits the RPE-LTP (regular pulse excitation–long term prediction) linear predictive coding principle [12].

2.2 Automatic speech recognition

Automatic speech recognition (ASR) is defined as the independent computer-driven transcription of spoken language into readable text [6]. Figure 1 shows a typical ASR architecture. Recently, our lab researchers targeted the applications of automatic speech recognition for the Moroccan Amazigh language [13,14,15,16,17,18,19].

Fig. 1
figure 1

ASR system architecture

2.3 MFCC feature extraction technique

The extraction of mel-frequency cepstral coefficients (MFCC) [20] includes an analysis based on the frames of an input speech, where the speech signal is segmented into a sequence of frames. Each frame offers a sinusoidal transformation (fast Fourier transform) to generate certain parameters, which then undergo a scale of perception on the mel-scale and decorrelation. The obtained output was a sequence of feature vectors that describe a logarithmically useful compressed amplitude and simplified frequency information. Figure 2 presents the detailed technique on the principle of cepstral analysis.

Fig. 2
figure 2

The MFCC process [12]

3 System preparation

3.1 Database preparation

The utterances used to evaluate and compare the system performance are collected from 24 Amazigh native speakers aged between 14 and 40 years old. The speech data were recorded in wave format. The applied sampling rates are 8 and 16 kHz. The corpus consists of 10 Amazigh spoken digits (0–9). Each digit is pronounced 10 times in detached data files, and each file includes one pronounced word. The selected digits and their transcription are shown in Table 3. More technical details about our system are shown in Table 4.

Table 3 Ten Amazigh digits with their English transcription
Table 4 System parameters

3.2 Files preparation

To prepare our acoustic model, we classed a set of input data and processes by exploiting the SphinxTrain tool. The following list presents the input files and data.

  • • Audio wave dataset

  • • List of fillers

  • • List of files for training and testing

  • • Transcription for training and testing

  • • Dictionary that determines the pronunciation of selected digits (Table 5)

  • • Language model that gives a representation of the occurrence probability for each digit

Table 5 Dictionary file

The phonetic dictionary was prepared in such a way that it consists of all expected digits with possible variants of their pronunciation. The careful and serious preparation of the input data and files plays a crucial role in designing a speech recognition system.

3.3 ASR parametrization

To evaluate the ASR system performances, several ASR configurations were prepared using HMM-states and GMMs. We prepared 6 acoustic models by varying the HMM states from 3 to 5 and the Gaussian distributions from 8 to 32. Table 6 presents different acoustic models utilized in our work.

Table 6 Prepared acoustic systems

4 Telephony spoken system architecture

The telephony-spoken system is an interactive pattern system in which a dialog between the user and the system is realized. As shown in Fig. 3, the main modules of our telephony spoken system architecture are IVR and ASR.

Fig. 3
figure 3

Model for establishing speech recognition via the Asterisk server

In the IVR part, the system receives voice input when the user interacts with the server by voice commands, the codec converts the analog waveforms into digital signals for the transmission as IP packets over the network, and then it converts the digital signals back to analog waveforms. In our study, we focused on voice traffic coding by using G.711u and GSM speech codecs.

In the ASR part, the Amazigh speech recognition system receives the transferred voice data from the Asterisk server. The received data were modeled as a sequence of phonemes, while each phoneme was modeled as a sequence of HMM states. We have used 3 and 5 HMM architectures for each phoneme, one emitting state (or three emitting states) and two non-emitting states as the entry and exit, which join the models of HMM units in the ASR engine. Each emitting state includes Gaussian mixtures trained on 13-dimensional MFCC coefficients, their delta and delta-delta vectors, which are extracted from the signal. The distribution of features for a phone was modeled with 8, 16, and 32 GMMs. Table 7 presents the feature extraction parameterization.

Table 7 Feature extraction parameterization

Our principal aim is to find a balance between an acceptable recognition rate and the choice of optimal parameters (HMMs, GMMs, and codecs). Figure 4 shows our process of speech recognition.

Fig. 4
figure 4

Speech recognition process

5 Experiments

In this section, all phases of the system (training and recognition) were based on the CMU Sphinx system, which is based on the HMM-GMM combination.

Our approach for modeling the encoded Amazigh sounds consisted of generated and trained acoustic models by using the unencoded voice and testing the system by an encoded voice by varying the audio codecs, HMMs and GMMs.

Seventy percent of the database (collected audio) was utilized for training to ensure speaker independence and the reliability and validity of our system. In the recognition phase, we test the system by 30% of the database (coding data with G711 and GSM codecs). The experimental setups are.

  • Software: our setup is based on the open source software Asterisk 1.6, Ekiga is used in the IVR part, CMU Sphinx Tools are used in the ASR part, and the operating system is the Ubuntu 14.04 LTS.

  • Hardware: The hardware consists of a laptop with an Intel Core i3 CPU with a speed of 2.4 GHz speed and 4G of RAM.

6 Experimental results

This section presents the results of proposed systems.

6.1 Case 1: Testing the unencoded data with unencoded trained models

Table 8 shows our achieved accuracies of the system, which is trained and tested by using the unencoded voice with three and five HMM states related to 8, 16, and 32 Gaussian mixture distributions. The best result of 91.57% was obtained with 3–16 HMMs–GMMs, where the lowest result of 85.86% was achieved by 5–32 HMMs–GMMs.

Table 8 System recognition rates based on unencoded data

By considering the individual word performance of the IVR-ASR system, the highest recognition rate is 92.86% for the words “krad,” “smmus,” “sdes,” and “tza” for Amsystem3-16. Based on this finding, we suggest that the number of syllables probably plays a positive role in the accuracy rate increment. Therefore, the lower performance word achieved by the Amsystem5-32 model is “yen.” A comparison of the results indicates that our work is in accordance with the results of [6].

6.2 Case 2: Testing the coded-decoded data (G.711 codec) with trained models

In this case, we keep the same trained acoustic models but change the test corpus by an encoded audio test database. In the case of 3 HMM states, the obtained results are 89.71, 88.71, and 87.86% by adopting 8, 16, and 32 Gaussian mixtures, respectively. In 5 HMMs, the system correct rates were 88.28, 87.86, and 85.86%, corresponding to 8, 16, and 32 GMMs, respectively. A higher recognition rate of 89.71% was achieved by the combination of 3–8 HMMs–GMMs (Table 9). The results that we obtained through experiments show that there is a difference in speech recognition for the two categories (unencoded and G 711). The lower recorded recognition rate is 85.86%, which was obtained by testing the system via Amsystem 5–32.

Table 9 System recognition rates based on the G711 codec

The analysis of the individual word performance showed that the best performance for “Amya” and “tza” words is achieved by the 3HMMs-8GMMs, 3HMMs-16GMMs, and 5HMMs-16GMMs combinations.

For the “krad” and “smmus” digits, the best accuracy is obtained by 3HMMs-8GMMs, 3HMMs-16GMMs, and 5HMMs-8GMMs.

The ASR parameter comparison between the first case and the second case shows that for the unencoded voice, the best results are obtained by testing data with the Amacous 3–16 trained model, and for the G.711-coded data, the higher results are obtained by testing data with the Amacous 3–8 trained model.

6.3 Case-3: Testing the coded-decoded data (GSM Codec) with trained models

For the GSM case (Table 10), the obtained accuracy is lower than that of G711. When the models were trained by the unencoded speech and tested by GSM decoded speech, the best and lowest recognition rates were 88.43% for Amsystem 3–8 and 84.57% for Amsystem 5–32, respectively. By considering the words’ individual performance, our finding shows that the higher recognition rate is 91.40% for “krad” obtained with the Amsystem 3–16 model. Table 9 shows the measured recognition rate for the GSM codec. The GSM-decoded signal causes degradation in speech recognition rates due to the distortions introduced to cepstral representations.

Table 10 System recognition rates based on the GSM codec

7 Best-case comparison

In this section, we present the confusion matrices of our best-obtained accuracies that are achieved with unencoded and G711-decoded speech. The testing set includes 700 utterances from seven speakers. Table 11 presents the performance comparison of our proposed method with some of the existing works in the same field.

Table 11 Comparison of our proposed method with some existing works

Table 12 shows the confusion matrix of the system based on the unencoded speech. The global accuracy from this experience is 91.57%. Table 13 presents the system confusion matrix for the encoded speech using the G 711 audio codec. The overall performance of the G 711 codec was 89.71%, which is similar to the overall performance achieved by the noncoded speech. However, the confusion matrices of both experiences show important differences.

Table 12 Confusion matrix for the best recognition rates (unencoded voice)
Table 13 Confusion matrix for the best audio codec performance (G711)

The analysis of the substituted words showed the following findings:

  • For unencoded voice, the exchange error involves two symmetrical substitutions that can be schematically represented [21] as SA ~ TZA, where inclusion of SA would bias the matrix toward symmetry.

  • For decoded voice, the substitution words increase, especially for the digits YEN, KOZ, SDES, and SA, and all these words are monosyllabic.

Generally, the omitted and substitution words increase in encoded voice. This behavior may be attributed to the effect on the ASR system when the actual pronunciation is different from what the recognizer expects or the deviation of the pronounced consonants via the telephonic channel that is in accordance with those of [22].

8 Conclusions

In this paper, we have evaluated the performances of an interactive speech recognition system based on the ASR and IVR technologies. We have searched for a balance between an acceptable recognition rate and the choice of optimal parameters (HMMs, GMMs, and codecs). The best system performance by considering the IVR parameterization is observed for the G711 codec. On the other hand, the best ASR parameterization for the combined system is three HMMs and 8–16 GMMs. Moreover, we have observed that the substitution words increase for the monosyllabic words in the case of encoded speech. Despite these results, certain limitations, such as background noise or speaking accent, influence the recognition rate of our proposed system.

In our future work, we will exploit the deep learning approach to improve the performance of the IVR-ASR system with a large voice database.