An experimental framework for Arabic digits speech recognition in noisy environments

Touazi, Azzedine; Debyeche, Mohamed

doi:10.1007/s10772-017-9400-x

An experimental framework for Arabic digits speech recognition in noisy environments

Published: 03 February 2017

Volume 20, pages 205–224, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Speech Technology Aims and scope Submit manuscript

An experimental framework for Arabic digits speech recognition in noisy environments

Download PDF

Azzedine Touazi^1,2 &
Mohamed Debyeche¹

248 Accesses
5 Citations
Explore all metrics

Abstract

In this paper we present an experimental framework for Arabic isolated digits speech recognition named ARADIGITS-2. This framework provides a performance evaluation of Modern Standard Arabic devoted to a Distributed Speech Recognition system, under noisy environments at various Signal-to-Noise Ratio (SNR) levels. The data preparation and the evaluation scripts are designed by deploying a similar methodology to that followed in AURORA-2 database. The original speech data contains a total of 2704 clean utterances, spoken by 112 (56 male and 56 female) Algerian native speakers, down-sampled at 8 kHz. The feature vectors, which consist of a set of Mel Frequency Cepstral Coefficients and log energy, are extracted from speech samples using ETSI Advanced Front-End (ETSI-AFE) standard; whereas, the Hidden Markov Models (HMMs) Toolkit is used for building the speech recognition engine. The recognition task is conducted in speaker-independent mode by considering both word and syllable as acoustic units. Therefore, an optimal fitting of HMM parameters, as well as the temporal derivatives window, is carried out through a series of experiments performed on the two training modes: clean and multi-condition. Better results are obtained by exploiting the polysyllabic nature of Arabic digits. These results show the effectiveness of syllable-like unit in building Arabic digits recognition system, which exceeds word-like unit by an overall Word Accuracy Rate of 0.44 and 0.58% for clean and multi-condition training modes, respectively.

The impact of phonological rules on Arabic speech recognition

Article 24 July 2017

Modern Standard Arabic speech disorders corpus for digital speech processing applications

Article 13 March 2024

Using geometric spectral subtraction approach for feature extraction for DSR front-end Arabic system

Article 26 June 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The recent progress in wireless applications such as speech recognition over mobile devices has led to the development of client–server recognition systems, also known as Distributed Speech Recognition (DSR) (Pearce 2000). In DSR architecture, the front-end client is located in the terminal device and it is connected over a protected data channel to a remote back-end recognition server. Although many technological developments have been made, the existing speech recognition performance still needs improvement, particularly when the speech utterances are exposed to high noise environments.

With regard to developing evaluation databases for DSR systems in multiple languages, there have been the activities of AURORA working group (Hirsch and Pearce 2000; Pearce 2001; AURORA 2006). Their evaluation scenarios have had a considerable impact on noisy speech recognition research; this includes AURORA-2, AURORA-3, AURORA-4, and AURORA-5 databases. AURORA-2 is a small vocabulary evaluation of noisy connected digits for American English talkers; the task performs speaker-independent recognition of isolated and connected digits, with and without adding background noise. AURORA-3 consists of a noisy small vocabulary speech recorded inside cars, and it serves to test the frond-end from different languages, namely Finnish (Nokia 2000), Danish (Lindberg 2001), Spanish (Macho 2000), German (Netsch 2001), and Italian (Knoblich 2000). AURORA-4 provides a large vocabulary continuous speech recognition tasks, which aims to compare the effectiveness of different DSR front-end algorithms. In addition to its similarity to AURORA-2, AURORA-5 covers the distortion effects caused by the hands-free speech input inside a room. Furthermore, the AURORA tasks have been distributed with the HMMs Toolkit (HTK) scripts, which allow attaining easily the baseline performance for further speech recognition research.

Numerous evaluation methodologies and frameworks for AURORA-2 were developed by the working group of the information processing society for Japanese noisy speech recognition, namely CENSREC-1/AURORA-2J, CENSREC-2, CENSREC-3, and CENSREC-4 (Nakamura et al. 2005; Fujimoto et al. 2006; Nishiura et al. 2008). The first developed version, AURORA-2J, contains Japanese noisy connected digit utterances and their associated HTK evaluation scripts. CENSREC-2 is another database for evaluation of noisy continuous digits recognition whose data were recorded in real car driving environments. CENSREC-3 database contains speech utterances, of isolated words, recorded in similar environments to those considered in CENSREC-2. The last developed database, CENSREC-4, is an evaluation framework of distant-talking connected digit utterances in various reverberation conditions.

Arabic is one of the most widely spoken languages in the world. Nowadays, it is considered as the fifth widely used language, the native language of more than 350 million of people (World Bank 2016) as well as the liturgical language for over a billion Muslims around the world. In the Arab world today, there are two forms of Arabic, The Modern Standard Arabic (MSA) and the Modern Colloquial Arabic (MCA). MSA, commonly known as the modern form of classical Arabic (or Quranic Arabic), is considered as the official language in academic institutions, written and broadcasted Arabic media, and civil services. The vernacular or colloquial form is the most used when people speak about everyday life topics. Moreover, there exists a variety of MCA forms from different Arabic regions (e.g. Middle East, North Africa, and Egypt).

Lack of Arabic language resources is one of major issues confronted by the Arabic speech research community. Among the most relevant developed corpora one can cite the Orientel project (Siemund et al. 2002). This project covers an important package of data collection, on both MSA and MCA ranging from Mediterranean to Middle East countries, including Turkey and Cyprus, as well as applications for mobile and multi-modal platforms. Abushariah et al. (2012) have developed a large vocabulary speech database for MSA native speakers from 11 Arab countries, in which a total of 415 sentences are recorded by 40 speakers (20 male and 20 female). This database takes into account the speaker variability such as gender, age, country, and region, with the motivation of making it suitable for the design and development of Arabic continuous speech recognition systems.

Moreover, there exist data centers that provide relevant speech databases for both MSA and MCA. For example, the European Language Resources Association (ELRA), where the most popular project was NEMLAR broadcast news speech Arabic corpus. This project is composed of about 40 h of MSA recorded from four different radio stations (ELRA 2005). Also, the Linguistic Data Consortium (LDC) has developed recently a new database that contains 590 h of recorded Arabic speech from 269 male and female speakers. The LDC recordings are conducted by the speech group at King Saud University in different noise environments for read and spontaneous speech (LDC 2014). However, with such diversity and richness of Arabic language, there is a difficult process to generate from the existing databases a proper dataset for the problem at hand, as well as for testing the industrialized speech platforms.

Automatic recognition of spoken digits is essential in many DSR application areas for different languages. Compared to other commonly used languages, a limited number of recent efforts on building Arabic digit recognizers have been conducted. Among the previous researches, using either Artificial Neural Networks (ANNs) or Hidden Markov Models (HMMs) as recognition engine, the works presented in (Alotaibi et al. 2003; Alotaibi 2005, 2008; Hyassat and Abu Zitar 2006; Amrouche et al. 2010; Ma & Zeng 2012; Hajj and Awad 2013). However, there is a lack of common Arabic digits database for evaluation and results comparison of the systems proposed. This is principally due to the differences in the types of features and noises used, and differences in the testing methodologies.

The main objective of this work is to investigate Arabic spoken digits from the speech recognition point of view. We introduce an Arabic noisy speech speaker-independent isolated digits database and its evaluation scripts, named ARADIGITS-2. This database is particularly conceived to evaluate the recognition performance of MSA digits in a DSR system. The data preparation (i.e. speech files, used noises, and text transcriptions) and the HTK evaluation scripts are designed by drawing inspiration from AURORA-2 database (Hirsch and Pearce 2000). The spoken digit utterances are 112 Algerian MSA native speakers (56 male and 56 female) corrupted by additive noises at different Signal-to-Noise Ratio (SNR) levels. The European Telecommunications Standards Institute Advanced Front-End (ETSI-AFE) standard (ETSI 2007) is used for Mel Frequency Cepstral Coefficients (MFCCs) feature vectors extraction and compression. Whereas, the recognition task is performed using the two training modes: clean (that is, models are trained with clean data and the test is performed with noisy data) and multi-condition (that is, training is performed with clean and noisy data).

For small vocabulary tasks such as digits recognition, word acoustic unit is used more commonly. In this work, a syllable-based recognition system is also designed. The application of syllable unit is motivated by the polysyllabic nature of Arabic digits that has many differences in terms of syllable types and numbers (Naveh-Benjamin and Ayres 1986; Ryding 2005). When compared to other languages such as English, Arabic digits have about twice as many syllables per digit as those in English. For example, in Naveh-Benjamin & Ayres (1986) for the chosen four languages, English, Spanish, Hebrew, and Arabic, the mean number of syllables per word for the digits (0–6, 8, 9) is 1, 1.625, 1.875, and 2.25, respectively.

The motivation for building DSR system using Algerian MSA rather than Algerian MCA is two-fold: (i) As the case of different Arabic countries, the divergence among several Algerians MCA makes very complex the task of collecting data and designing a common recognition system. It is therefore similar to the case of MCA digits recognition where the digits are pronounced quietly different from place to place and town to town and (ii) In some DSR services where the front-end spoken digit numbers are highly important, such as bank account, credit card and insurance identification, the use of a recognition system based on MSA may guarantee more accurate performance.

The remainder of this paper is structured as follows: In Sect. 2, we present a general overview of ETSI DSR standards, AURORA-2 database, and HTK speech recognition toolkit. A detailed description of the ARADIGITS-2 data preparation and HTK parameterization is the object of Sect. 3. The recognition system performance obtained by empirically fine-tuning the suitable recognition parameters, for both word and syllable-like acoustic models, is presented in Sect. 4. Finally, we summarize the conclusion of the presented work in Sect. 5, as well as further work that needs to be completed.

2 DSR standards and AURORA-2 database

2.1 DSR standards

As depicted in Fig. 1, the main idea of DSR consists of using a local front-end terminal from which the MFCC vectors are extracted and transmitted, through an error protected data channel, to a remote back-end recognition server. Compared to the traditional network-based automatic speech recognition, a DSR system provides specific benefits for mobile services, such as (i) acoustic noise compensation at the client side, (ii) low bit-rate transmission over data channel, and (iii) improved recognition performance.

In the basic DSR standard ETSI Front-End (ETSI-FE) (ETSI 2003a), the speech features (i.e. MFCC components) are derived from the extracted speech frames, in the front-end part, at frame length of 25 ms with frame shift of 10 ms, using Hamming windowing. A Fourier transform is then performed and followed by a Mel filter bank with 23 bands in the frequency range from 64 up to 4 kHz. The extracted features are the first 12 MFCCs (C ₁-C ₁₂) and the energy coefficients, such as (C ₀) and the log energy (log E) in each extracted frame.

The different blocks of the ETSI front-end MFCC extraction algorithm are illustrated in Fig. 2.

The abbreviation of each block is listed as bellow:

ADC: Analog to Digital Converter
Offcom: Offset Compensation
Framing: Frame length is 25 ms, with frame shift is 10 ms
PE: Pre-emphasis Filter, with a factor of 0.97
Log E: Log Energy Computation
W: Hamming Windowing
FFT: Fast Fourier Transform (only magnitude components are considered)
MF: Mel filter bank with 23 frequency bands
LOG: Nonlinear Transformation
DCT: Discrete Cosine Transform

In the compression task (i.e. source coding), the 14-dimentional feature vector [C ₁, C _2, ..., C ₁₂, C ₀, log E] is split into seven sub-vectors, and each of them is quantized with its own 2-dimensional vector quantizer. The resulting compression bit-rate is 4400 bps and 4800 bps when the overhead and error protection bits are included (i.e. channel coding). In the back-end side delta and delta–delta coefficients, or time derivatives, are estimated and appended to the 13 static features [C ₁, C _2, ..., C ₁₂, C ₀ or log E], to obtain a total of 39 elements for each feature vector.

In some DSR applications, for example, in human assisted dictation, the machine and the human recognition are mixed in the same application, so it may be necessary to reconstruct the speech signal at the back-end. The ETSI Extended Front-End (ETSI-EFE) standard (ETSI 2003b) provides additional parameters such as voicing class and fundamental frequency, which are extracted at the front-end. These parameters allow reconstructing the speech signal at the back-end side. Therefore, the transmission of these additional components will increase relatively the compression bit-rate.

An advanced front-end feature extraction and compression algorithms (ETSI-AFE) (ETSI 2007) have been published by ETSI for robust speech recognition. The standardized AFE provides considerable improvements in recognition performance in the presence of background noise. In the feature extraction part of the ETSI-AFE standard, noise reduction is performed first, which is based on Wiener filtering theory. Then, MFCCs coefficients and log energy are computed from the de-noised signal and blind equalization is applied to cepstral features. Voice activity detection (VAD) for the non-speech frame dropping is also implemented in the front-end feature extractor. The VAD flag is used for excluding the non-speech frames from the recognition task.

On the server side, unlike the conventional ETSI-FE standard, where the cepstral derivatives are computed through the HTK recognition engine using centered finite difference approximation (Young et al. 2006), ETSI-AFE includes additional scripts that compute these coefficients based on polynomial approximation (more details are provided in Sect. 3.2). Also, in ETSI-AFE back-end side, the energy coefficients C ₀ and log E are both used in the recognition task by employing the following combination:

$${{C}_{comb}}=\alpha {{C}_{0}}+\beta logE,$$

(1)

where $\alpha$ and $\beta$ are set to 0.6/23 and 0.4, respectively (ETSI 2007).

2.2 AURORA-2 database and HTK toolkit

The original high quality TIDigits database (Leonard 1984) is the source speech of AURORA-2 database that consists of isolated and connected digits task. It provides speech samples and scripts to perform speaker-independent speech recognition experiments in clean and noisy conditions. This database has been prepared by down-sampling from the original 20 kHz sampling frequency to 8 kHz with an ideal low pass filter. An additional filtering is applied with the two standard frequency characteristics: G.712 (ITU-T 1996) and Modified Intermediate Reference System (MIRS) (ITU-T 1992).

AURORA-2 contains eight types of realistic additive noises with stationary and non-stationary segments (suburban train (subway), babble, car, exhibition hall, restaurant, street, airport, and train station) at different SNR levels (clean, 20, 15, 10, 5, 0, and −5 dB). This database contains two training sets of 8440 utterances for each one (clean and multi-condition sets), and three test sets (set A, set B, and set C).

The clean training set is filtered with the G.712 characteristic without any noise added. In multi-condition training set, the same utterances are equally split into 20 subsets, after filtering with the G.712 characteristic. These subsets are corrupted by four noises (subway, babble, car, and exhibition hall) at five different SNR levels (clean, 20, 15, 10, and 5 dB).

The first test set called test set A, consists of 28,028 utterances filtered with the G.712 characteristic using four different noises, namely subway, babble, car, and exhibition hall. In total, this test set consists of 28 subsets where the noises are added at seven different SNR levels (clean, 20, 15, 10, 5, 0, and −5 dB). Test set A contains the same noises to those used in the multi-condition training set; this leads to a high match between training and test data. The second test set called test set B, which is created by the same way as test set A (i.e. same clean utterances filtered with the G.712) but by using four different noises, restaurant, street, airport, and train station. The third test set called test set C contains 14,014 utterances distributed into 14 subsets, where two different types of noises are considered, subway and street. In test set C, speech and noises are first filtered with the MIRS (i.e. in order to simulate the frequency characteristics received from the terminal device), and then these noises are added at different SNR levels (clean, 20, 15, 10, 5, 0, and −5 dB). Furthermore, a full description of AURORA-2 database is given in Hirsch & Pearce (2000).

The HTK toolkit is principally designed for building HMM-based speech processing tools, in particular recognizers. It consists of a set of library modules and tools in C source code available from http://htk.eng.cam.ac.uk/. The tools provide sophisticated facilities for speech analysis, HMM training, testing, and results analysis. There are two major processing stages involved. Firstly, the HTK training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions. Secondly, unknown utterances are transcribed using the HTK engine.

In AURORA-2 task, the model set contains 11 whole word HMM models (digits 0 to 9 and “oh”) which are linear left-to-right with no skips over states, also known as Bakis topology (Bakis 1976). Two silence models are defined, i.e. “sil” (silence) and “sp” (short pause). The “sil” model has three emitting states and each state has six mixtures, while the “sp” model has only a single state. Each word model has 16 states with three Gaussian mixtures per state (in HTK structure, two dummy states are added at the beginning and at the end of the given set of states). Each Gaussian component is defined by the global means and variances of acoustic coefficients.

The overall Word Accuracy Rate (WAR) of recognition experiments, conducted on AURORA-2 task, using ETSI-AFE standard are summarized in Tables 1 and 2 (Hirsch and Pearce 2006). These experiments are performed on different training modes, with and without (i.e. baseline) MFCCs compression, and without including the optional VAD parameter.

Table 1 Overall WAR (%) for the AURORA-2 baseline

An experimental framework for Arabic digits speech recognition in noisy environments

Abstract

Similar content being viewed by others

The impact of phonological rules on Arabic speech recognition

Modern Standard Arabic speech disorders corpus for digital speech processing applications

Using geometric spectral subtraction approach for feature extraction for DSR front-end Arabic system

Explore related subjects

1 Introduction

2 DSR standards and AURORA-2 database

2.1 DSR standards

2.2 AURORA-2 database and HTK toolkit

3 ARADIGITS-2 data description and system parameterization

3.1 Data preparation

3.2 Time derivatives estimation

3.3 HTK parameterization

4 System testing and evaluation

4.1 Effects of derivative window length

4.2 Effects of number of states and Gaussian mixtures per state

4.3 Fitting of models parameters

4.3.1 Fixing the “zero” model

4.3.2 Optimizing the derivative window length

4.4 Detailed experimental results using syllable unit

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation