Speech Recognition System Based on OLLO French Corpus by Using MFCCs

Youcef, Braham Chaouche; Elemine, Yessaad Mohamed; Islam, Benmaiza; Farid, Bouttout

doi:10.1007/978-3-319-48929-2_25

Braham Chaouche Youcef⁴,
Yessaad Mohamed Elemine⁴,
Benmaiza Islam⁵ &
…
Bouttout Farid⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 411))

Included in the following conference series:

International Conference on Electrical Engineering and Control Applications

1256 Accesses
3 Citations

Abstract

The automatic speech recognition is an area of active study since the early 1950s, and the latest technologies in the field of stochastic processes and the discovery of Hidden Markov Models have given a new direction for this area.

This paper describes an approach of speech recognition by using the Mel-Scale Frequency Cepstral Coefficients (MFCC) from speech recognition experiments done on OLLO French corpus by different features. Our work consists in finding the most appropriate choice for this task using the Mel-Scale Frequency Cepstral Coefficients (MFCC) extracted from speech signal.

To evaluate this analysis, we built an ASR reference system based on the modeling of phonemes by the HMM (Hidden Markov Models) associated with the GMM models (Gaussian Mixture Model) using the HTK tool. The implementation of this system was made using several experiments in order to choose the best parameters used in two main steps to build an ASR system, acoustic analysis and decoding. The experiments show that the choice of 25 Gaussian components provides a good compromise between recognition accuracy and computation time, and we found also that the best parameters leading to good recognition accuracy are MFCC_E_D_A coefficients with 92.5%.

In this paper the quality and testing of speaker recognition and gender recognition system is completed and analysed.

Access provided by CONRICYT-eBooks. Download conference paper PDF

A Review of Various Techniques Related to Feature Extraction and Classification for Speech Signal Analysis

Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers

Article 26 February 2019

Hindi speech recognition in noisy environment using hybrid technique

Article 01 January 2021

Keywords

1 Introduction

Speech is the most natural means of communication between humans. With the development of information technology and the massive use of computer. Man-Machine Dialogue (MMD) using the word as a means of communication has been an increased interest from both the scientific and the industrial community. Automatic speech recognition (ASR), the main component of the MMD system, is a central topic in the broader one of Natural Language Processing (NLP) domain. The general structure of HMM-based speech recognition system [1] consists of two phases: a learning phase whose goal is the construction of acoustic models (HMM models) and recognition phases which the most likely word being imposed. Generally, ASR systems use cepstral parameters called standard parameters as acoustic representation of the speech signal. Cepstral parameters currently the most successful are the MFCCs coefficients (Mel Frequency Cepstral Coefficients). The procedure of calculation of the coefficients is performed on several stages.

The rest of this paper is organized as follows. Section 2 gives a description of Mel-Frequency Cepestral Coefficients (MFCCs). Section 3 introduces a description about OLLO French corpus. The experiments and the results obtained are given in Sect. 4. Concluding remarks are given in Sect. 5.

2 The Mel-Frequency Cepstrum Coefficient (MFCC)

The Mel-frequency Cepstrum Coefficient (MFCC) technique is often used to create the fingerprint of the sound files. The MFCC are based on the known variation of the human ear’s critical bandwidth frequencies with filters spaced linearly at low frequencies and logarithmically at high frequencies used to capture the important characteristics of speech. Studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale [2]. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the Mel scale. The following formula is used to compute the Mels for a particular frequency:

$$ mel(f) = 2595 \times \log_{10} (1 + \frac{f}{700}) $$

(1)

A block diagram of the MFCC processes is shown in Fig. 1. The windowing block minimizes the discontinuities of the signal by narrowing the beginning and end of each frame to zero. The following step consists of applying DFT in order to convert each frame from the time domain to the frequency domain. Then, the signal is passed through the Mel filter bank spectrum to mimic human hearing. In the final step, the Cepstrum, the Mel-spectrum scale is converted back to standard frequency scale. This spectrum provides a good representation of the spectral properties of the signal which is key for representing and recognizing characteristics of the speaker.

3 Coprus

The OLLO French part of the database is a corpus spoken by 10 native French individuals who lived in Belgium. There are six men and four women in different ages [3]. Each person speaks 150 utterances, every utterance is spoken in 6 different style. Each utterance is spoken three times per person.

So each person contains 2700 (150*6*3) files. As we are performing dependant speaker recognition tasks, the set of the learning (L1) contains the first of three each logatome’s repetitions in the corpus. The remaining two repetitions are contained within the set of test (L2).

The allocation of the corpus is used to build our ASR system. In our experiments, the sampling rate is down to 8 kHz.

To validate our reference system for the phonetic transcription of the speech signal, we took the ACC recognition accuracy as an evaluation criterion, calculated by the formula:

$$ Acc = \left( {\frac{N - D - S - I}{N}} \right) \times 100\% $$

(2)

Where:

H::: Number of recognized words;
D::: Number of deleted words;
S::: Number of substituted words;
I::: Number of inserted words;
N::: Total number of words.

4 Experimental Results and Analyse

Experiments included comparative evaluations of the recognition results using the mel–cepstral using OLLO databases (Fig. 2).

We used 25 ms speech window with mel–cepstral features [4], due to specific decomposition structure. We also used the same overlapping rate of speech window with the value of 10 ms. Speech feature vector computation included calculation of log energies in the mel–scaled filterbanks [5].

In order to have an effective system of reference, we conducted an experiment to determine the best number of Gaussian mixture components per state and the best type of acoustic parameters MFCC. In this experiment, we set the number of states to three active states [6], then we took the following types of acoustic parameters: MFCC, MFCC_E, MFCC_D, MFCC_D_A and MFCC_E_D_A. In each type of parameters, we varied the number of Gaussian 1 to 25 to determine the best number of Gaussians to have the best accuracy. The basic system uses MFCC coefficients and energy, and the differential coefficients of the parameters.

The logarithm of the energy of the frame is added to the 12 cepstral coefficients to form a vector of 13 coefficients. Differential coefficients of the first and second order, calculated automatically by the tools of HTK [7], are optionally used with the static coefficients.

We tested the contribution of differential coefficients of the first and second order with 13 initial coefficients, working with vectors of dimensions d = 13 (12 MFCC; E), d = 26 (12 MFCC; E; 12 ∆MFCC; ∆E) and d = 39 (12 MFCC; E; 12 ∆MFCC; ∆E; 12 ∆∆MFCC; ∆∆E). The emission probability of each state of the HMM is represented by a linear combination of Gaussian G with diagonal covariance matrix. All other experimental conditions are those of the already described basic system.

Table 1 shows the accuracy for the five sets of experiments and the number of Gaussian probability density.

Table 1. Comparative accuracy of recognition between static/dynamic coefficients depending on the number of Gaussians.

Full size table

From Table 1, based on the parameterization MFCC_E_D_A typical system gives very good results. The good performance of the system based on MFCC parameters are due to the nature of their products based on human perception models. The results also show that increasing the number of Gaussian enables better modeling of the acoustic space. In the case of a word recognition system, we achieved the best number of Gaussians that allows stability accuracies of recognition (number of Gaussian equal to 25). It is therefore necessary to find a compromise between the recognition accuracy and the number of parameters. We note that beyond the 13th Gaussian, the system performance is not significantly improved (91.31 to 92.50).

Based on these results, we note the following points:

The combination of MFCC_E_D_A type gives the best recognition accuracy with 92.50 %. However, this combination of 39 parameters requires more resources and computing time. By combining WCC_D_A against type 21 parameters (or 18 parameters in the case of level 5) provides a more compact representation relative to that MFCC_E_D_A or MFCC_D_A (36 parameters), requiring less time and computing resources. Thus, the combination WCC_D_A provides a good compromise in terms of accuracy and computational resources.

5 Conclusion

Our study is to apply the Mel-Scale Frequency Cepstral Coefficients (MFCC) in the acoustic analysis of speech signal to an ASR task.

To accomplish this task and to evaluate the acoustic analysis, we built a reference ASR system based on HMM models. Acoustic analysis of this reference system is based on the extraction of MFCC parameters. This system is built under the platform HTK and evaluated on the basis of OLLO database.

The construction of the reference system calls for the type of acoustic parameters and the number of Gaussian components for each active state HMM. To this end, we conducted various experiments to determine these two parameters. The results showed us that the most relevant acoustic parameters are MFCC_E_D_A coefficients.

The results also showed that the choice of 25 Gaussian components provides a good compromise between recognition accuracy and computation time. In the present work, After a comparative study between the acoustic analysis based on the MFCC parameters, we found that the best parameters leading to good recognition accuracy are MFCC_E_D_A coefficients. However, the representation of the latter requires a dimension (39 parameters) larger than that based on the parameters.

References

Haton, J.-P.: Reconnaissance Automatique de la Parole. Paris (2006)
Google Scholar
Patel, K., Prasad, R.K.: Speech recognition and verification using MFCC & VQ. Int. J. Emerg. Sci. Eng. (IJESE) 1(7) (2013). ISSN: 2319–6378
Google Scholar
Huang, L., Zhang, X.: Speaker independent recognition on OLLO French corpus by using different features. In: 2010 First International Conference on Pervasive Computing, Signal Processing and Applications
Google Scholar
Modic, R.: Comparative wavelet and MFCC speech recognition experiments on the Slovenian and English SpeechDat2
Google Scholar
Nguyen, Q.C.: Reconnaissance de la Parole en Langue Vietnamienne. thèse de doctorat, Institut national polytechnique de Grenoble, Juin 2002
Google Scholar
Bakis, R.: Continuous speech recognition via centisecond acoustic states, 91th. Meeting of the Acoust.Soc, avril 1976
Google Scholar
Young, S., et al.: The HTK Book (for HTK Version 3.4), p. 198 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

LMSE Laboratory, Department of Electronics, University of Mohamed El Bachir El Ibrahimi, 34265, Bordj Bou Arréridj, Algeria
Braham Chaouche Youcef & Yessaad Mohamed Elemine
Laboratory of Spoken Communication and Signal Processing, Faculty of Electronics and Computer Sciences, USTHB, 16000, Algiers, Algeria
Benmaiza Islam
Laboratory of Signal Processing, Department of Electronics, University of Constantine, 25000, Constantine, Algeria
Bouttout Farid

Authors

Braham Chaouche Youcef
View author publications
You can also search for this author in PubMed Google Scholar
Yessaad Mohamed Elemine
View author publications
You can also search for this author in PubMed Google Scholar
Benmaiza Islam
View author publications
You can also search for this author in PubMed Google Scholar
Bouttout Farid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Braham Chaouche Youcef .

Editor information

Editors and Affiliations

MIS (EA 4290), Université de Picardie Jules Verne, Amiens, France
Mohammed Chadli
Laboratoire d'Automatique et de Robotiqu, Université Abbes Laghrour, Khenchela, Khenchela, Algeria
Sofiane Bououden
VŠB-Technical University of Ostrava , Ostrava ba, Czech Republic
Ivan Zelinka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Youcef, B.C., Elemine, Y.M., Islam, B., Farid, B. (2017). Speech Recognition System Based on OLLO French Corpus by Using MFCCs. In: Chadli, M., Bououden, S., Zelinka, I. (eds) Recent Advances in Electrical Engineering and Control Applications. ICEECA 2016. Lecture Notes in Electrical Engineering, vol 411. Springer, Cham. https://doi.org/10.1007/978-3-319-48929-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-48929-2_25
Published: 02 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48928-5
Online ISBN: 978-3-319-48929-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

Abstract

Similar content being viewed by others

A Review of Various Techniques Related to Feature Extraction and Classification for Speech Signal Analysis

Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers

Hindi speech recognition in noisy environment using hybrid technique

Keywords

1 Introduction

2 The Mel-Frequency Cepstrum Coefficient (MFCC)

3 Coprus

4 Experimental Results and Analyse

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

Abstract

Similar content being viewed by others

A Review of Various Techniques Related to Feature Extraction and Classification for Speech Signal Analysis

Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers

Hindi speech recognition in noisy environment using hybrid technique

Keywords

1 Introduction

2 The Mel-Frequency Cepstrum Coefficient (MFCC)

3 Coprus

4 Experimental Results and Analyse

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation