Automatic Speech Recognition

Soltau, Hagen; Saon, George; Mangu, Lidia; Kuo, Hong-Kwang; Kingsbury, Brian; Chu, Stephen; Biadsy, Fadi

doi:10.1007/978-3-642-45358-8_13

Hagen Soltau⁵,
George Saon⁵,
Lidia Mangu⁵,
Hong-Kwang Kuo⁵,
Brian Kingsbury⁵,
Stephen Chu⁵ &
…
Fadi Biadsy⁶

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

Abstract

In this chapter we describe techniques to build a high performance speech recognizer for Arabic and related languages. The key insights are derived from our experience in the DARPA GALE program, a 5-year program devoted to enhancing the state-of-the-art in Arabic speech recognition and translation. The most important lesson is that general speech recognition techniques work very well also on Arabic. An example is the issue of vowelization: short vowels are often not transcribed in Arabic, Hebrew, and other Semitic languages. Semi-automatic vowelization procedures, specifically designed for the language, can improve the pronunciation lexicon. However, we also can simply choose to ignore the problem at the lexicon level, and compensate for the resulting pronunciation mismatch with the use of discriminative training of the acoustic models. While we focus on Arabic, in this chapter, we speculate that the vast majority of the issues we address here will completely carry over to other Semitic languages. We have tested the approaches discussed in this chapter only on Arabic, as that is the Semitic language with the most resources. Our experimental results demonstrate that such language-independent techniques can solve language-specific issues at least to a large extent. Another example is morphology, where we show that a combination of language-independent techniques (an efficient decoder to deal with large vocabulary and exponential language models) and language-specific techniques (a neural network language model that uses morphological and syntactic features) lead to good results. For these reasons we describe in the text a list of both language-independent and language-specific techniques. We describe also a full-fledged LVCSR system for Arabic that makes best use of all the techniques. We also demonstrate how this system can be used to bootstrap systems for related Arabic dialects and Semitic languages.

Access provided by Autonomous University of Puebla. Download chapter PDF

The impact of phonological rules on Arabic speech recognition

Article 24 July 2017

Diacritics Effect on Arabic Speech Recognition

Article 10 July 2019

ArabRecognizer: modern standard Arabic speech recognition inspired by DeepSpeech2 utilizing Franco-Arabic

Article 22 July 2024

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In this chapter we describe techniques to build a high performance speech recognizer for Arabic and related languages. The key insights are derived from our experience in the DARPA GALE program, a 5-year program devoted to enhancing the state-of-the-art in Arabic speech recognition and translation. The most important lesson is that general speech recognition techniques work very well also on Arabic. An example is the issue of vowelization: short vowels are often not transcribed in Arabic, Hebrew, and other Semitic languages. Semi-automatic vowelization procedures, specifically designed for the language, can improve the pronunciation lexicon. However, we also can simply choose to ignore the problem at the lexicon level, and compensate for the resulting pronunciation mismatch with the use of discriminative training of the acoustic models. While we focus on Arabic, in this chapter, we speculate that the vast majority of the issues we address here will completely carry over to other Semitic languages. We have tested the approaches discussed in this chapter only on Arabic, as that is the Semitic language with the most resources. Our experimental results demonstrate that such language-independent techniques can solve language-specific issues at least to a large extent. Another example is morphology, where we show that a combination of language-independent techniques (an efficient decoder to deal with large vocabulary and exponential language models) and language-specific techniques (a neural network language model that uses morphological and syntactic features) lead to good results. For these reasons we describe in the text a list of both language-independent and language-specific techniques. We describe also a full-fledged LVCSR system for Arabic that makes best use of all the techniques. We also demonstrate how this system can be used to bootstrap systems for related Arabic dialects and Semitic languages.

1.1 Automatic Speech Recognition

Modern speech recognition systems use a statistical pattern recognition approach to the problem of transforming speech signals to text. This approach is data-driven: to build a speech recognition system, practitioners collect speech and text data that are representative of a desired domain (e.g., news broadcasts or telephone conversations), use the collected data to build statistical models of speech signals and text strings in the target domain, and then employ a search procedure to find the best word string corresponding to a given speech signal, where the statistical models provide an objective function that is optimized by the search process. A high-level block diagram of such a speech recognition system is given in Fig. 13.1.

More precisely, in the statistical framework, the problem of speech recognition is cast as

$$\displaystyle{ \widehat{\mathbf{W}} =\begin{array}{*{10}c} \mathop{\mathrm{argmax}}\\ \mathbf{W} \end{array} P(\mathbf{W}\vert \mathbf{X};\varTheta ) }$$

(13.1)

where W is a word sequence, $\widehat{\mathbf{W}}$ is the optimal word sequence, X is a sequence of acoustic feature vectors, and Θ denotes model parameters.

Solving this problem directly is challenging, because it requires the integration of knowledge from multiple sources and at different time scales. Instead, the problem is broken down by applying Bayes’ rule and ignoring terms that do not affect the optimization, as follows:

$$\displaystyle\begin{array}{rcl} \widehat{\mathbf{W}} =\begin{array}{*{10}c} \mathop{\mathrm{argmax}}\\ \mathbf{W} \end{array} P(\mathbf{W}\vert \mathbf{X};\varTheta )& &{}\end{array}$$

(13.2a)

$$\displaystyle\begin{array}{rcl} =\begin{array}{*{10}c} \mathop{\mathrm{argmax}}\\ \mathbf{W} \end{array} \frac{P(\mathbf{X}\vert \mathbf{W};\varTheta )P(\mathbf{W};\varTheta )} {P(\mathbf{X};\varTheta )} & &{}\end{array}$$

(13.2b)

$$\displaystyle\begin{array}{rcl} =\begin{array}{*{10}c} \mathop{\mathrm{argmax}}\\ \mathbf{W} \end{array} P(\mathbf{X}\vert \mathbf{W};\varTheta )P(\mathbf{W};\varTheta )& &{}\end{array}$$

(13.2c)

Referring back to Fig. 13.1, we can identify different components of a speech recognition system with different elements of Eq. (13.2). The feature extraction module (Sect. 13.2.1) computes sequences of acoustic feature vectors, X, from audio input. The acoustic model (Sect. 13.2.1) computes P(X | W; Θ): the probability of the observed sequence of acoustic feature vectors, X, given a hypothesized sequence of words, W. The language model (Sect. 13.3.1) computes P(W; Θ), the prior probability of a hypothesized sequence of words. The search process (Sect. 13.3.2) corresponds to the argmax operator.

1.2 Introduction to Arabic: A Speech Recognition Perspective

An excellent introduction to the Arabic language in the context of ASR can be found in Kirchhoff et al. [27]. Here we describe only a couple of special characteristics of Arabic that may not be familiar to non-Arabic speakers: vowelization and morphology.

1.
Vowelization

Short vowels and other diacritics are typically not present in modern written Arabic text. Thus a written word can be ambiguous in its meaning and pronunciation. An Arabic speaker can resolve the ambiguity based on human knowledge and various contextual cues, such as syntactic and semantic information. Although Arabic automatic diacritizaion has received considerable attention by NLP researchers, the proposed approaches are still error-prone, especially on non-formal texts. When designing an ASR system, we therefore need to consider whether to represent words in the vowelized or un-vowelized form for the language model. Another consideration is whether the pronunciation model or dictionary should contain information derived from the diacritics, such as short vowels.
2.
Morphology

Arabic is a morphologically rich language. In Arabic morphology, most morphemes are comprised of a basic word form (the root or stem), to which affixes can be attached to form whole words. Arabic white-space delimited words may be then composed of zero or more prefixes, followed by a stem and zero or more suffixes. Because of the rich morphology, the Arabic vocabulary for an ASR system can become very large, on the order of one million words, compared with English which typically has a vocabulary on the order of one hundred thousand words. The large vocabulary exacerbates the problem of data sparsity for language model estimation.

Arabic is the only Semitic language for which we have a sufficient amount of data, through the DARPA GALE program; therefore we have decided to focus on developing and testing our ASR techniques only for Arabic. We will describe how our design decisions helped us to successfully overcome some of the Arabic-dependent issues using new sophisticated language-dependent and independent modeling methods and good engineering.

It is important to note that the majority of the modeling techniques and algorithms outlined in this chapter have already been tested on a set of languages such as English, Mandarin, Turkish and Arabic. While we only discuss these ASR approaches for Arabic in this chapter, there is no reason to believe that these approaches would not work for other Semitic languages. The fundamental reason for this is that we were able to tackle language-specific problems with language-independent techniques. As an illustration, we also discuss in this chapter the similarities between the two Semitic languages Arabic and Hebrew. We speculate that the same language-dependent techniques used to address the challenges of Arabic ASR, such as the morphological richness of the language and diacritization would also work for Hebrew.

1.3 Overview

This chapter is organized as follows. In the first two sections, we describe the two major components for state-of-the-art LVCSR: the acoustic model and the language model. For each model, we distinguish between language-independent and language-specific techniques. The language-specific techniques for the acoustic models include vowelization and modeling of dialects in decision trees. The language-specific parts of the language model include a neural network model that incorporates morphological and syntactic features. In Sect. 13.4, we describe how all these techniques are used to build a full-fledged LVCSR system that achieves error rates below 10 % on an Arabic broadcast news task. In Sect. 13.5, we describe techniques that allow us to port Modern Standard Arabic (MSA) models to other dialects, in our case to Levantine. We describe a dialect recognizer and how this can be used to identify relevant training subsets for both the acoustic and language model. We describe a decoding technique that allows us to use a set of dialect-specific models simultaneously during run-time for improved recognition performance. Section 13.6 describes the various data sets we used for system training, development, and evaluation.

2 Acoustic Modeling

2.1 Language-Independent Techniques

2.1.1 Feature Extraction

The goal of the feature extraction module is to compute a representation of the audio input that preserves as much information about its linguistic content as possible, while suppressing variability due to other phenomena such as speaker characteristics or the acoustic environment. Moreover, this representation should be compact (typically generating about 40 parameters for every 10 ms of audio) and should have statistical properties that are compatible with the Gaussian mixture models most often used for acoustic modeling. A very common form of feature extraction is Mel frequency cepstral coefficients (MFCCs) [16], and most other approaches to feature extraction, such as perceptual linear prediction (PLP) coefficients, employ similar steps to MFCCs, so we describe their computation in detail below.

The steps for computing MFCCs are as follows.

1.
The short-time fast Fourier transform (FFT) is used to compute an initial time-frequency representation of the signal. The signal is segmented into overlapping frames that are usually 20–25 ms in duration, with one frame produced every 10 ms, each frame is windowed, and a power spectrum is computed for the windowed signal.
2.
The power spectral coefficients are binned together using a bank of triangular filters that have constant bandwidth and spacing on the Mel frequency scale, a perceptual frequency scale with higher resolution at low frequencies and lower resolution at high frequencies. This reduces the variability of the speech features without severely impacting the representation of phonetic information. The filter bank usually contains 18–64 filters, depending on the task, while the original power spectrum has 128–512 points, so significant data reduction takes place here.
3.
The dynamic range of the features is reduced by taking the logarithm. This operation also means that the features can be made less dependent on the frequency response of the channel and on some speaker characteristics by removing the mean of the features over a sliding window, on an utterance-by-utterance basis, or for all utterances attributed to a given speaker.
4.
The features are then decorrelated and smoothed by taking a low-order discrete cosine transform (DCT). Depending on the task, 13–24 DCT coefficients are retained.

In order to remove the effect of channel distortions, the cepstral coefficients are normalized so that they have zero mean and unit variance on a per utterance or a per speaker basis. The final feature stream includes the local temporal characteristics of the speech signal because these convey important phonetic information. Temporal context across frames can be incorporated by computing speed and acceleration coefficients (or delta and delta-delta coefficients) from the neighboring frames within a window of typically ±4 frames. These dynamic coefficients are appended to the static cepstra to form the final 39-dimensional feature vector. A more modern approach is to replace this ad-hoc heuristic with a linear projection matrix that maps a vector of consecutive cepstral frames to a lower-dimensional space. The projection is estimated to maximize the phonetic separability in the resulting subspace. The feature vectors thus obtained are typically modelled with diagonal covariance Gaussians. In order to make the diagonal covariance assumption more valid, the feature space is rotated by means of a global semi-tied covariance transform. This sequence of processing steps is illustrated in Fig. 13.2.

2.1.2 Acoustic Modeling

Speech recognition systems model speech acoustics using hidden Markov models (HMMs), which are generative models of the speech production process. A graphical model representation of an HMM is given in Fig. 13.3. An HMM assumes that a sequence of T acoustic observations $\mathbf{X} = \mathbf{x}_{1},\mathbf{x}_{2},\mathop{\ldots },\mathbf{x}_{T}$ is generated by an underlying discrete-time stochastic process that is characterized by a single, discrete state q _t taking on values from an alphabet of N possible states. The state of the generative process is hidden, indicated by the shading in Fig. 13.3. The HMM also makes two important conditional independence assumptions. The first is that the state of the generating process at time t, q _t, is conditionally independent of all states and observations, given the state at the previous time step, q _t−1. The second assumption is that the acoustic observation at time t, x _t, is conditionally independent of all states and observations, given q _t. These conditional independence assumptions are denoted by the directed edges in Fig. 13.3.

An HMM is specified by four elements: an alphabet of N discrete states for the hidden, generative process; a prior distribution over q ₀, the initial state of the generative process; P(q _t | q _t−1), an N × N matrix of state transition probabilities; and $\{P(\mathbf{x}_{t}\vert q_{t})\}$, a family of state-conditional probability distributions over the acoustic features. In most large-vocabulary speech recognition systems, the distribution of initial states is uniform over legal initial states. Similarly, the state transition matrix only allows a limited number of possible state transitions, but the distribution over allowable transistions is taken as uniform. Remaining are the methods used to define the state alphabet, which will be described in detail later in this section, and $\{P(\mathbf{x}_{t}\vert q_{t})\}$, the observation probability distributions.

The standard approach to modeling the acoustic observations is to use Gaussian mixture models (GMMs)

$$\displaystyle{ P(\mathbf{x}_{t}\vert q_{j}) =\sum _{ m=1}^{M}w_{\mathit{ mj}}{(2\pi )}^{-\frac{k} {2} }\vert \boldsymbol{\varSigma }_{\mathit{mj}}{\vert }^{-\frac{1} {2} }{e}^{-\frac{1} {2} {(\mathbf{x}_{t}-\boldsymbol{\mu }_{\mathit{mj}})}^{T}\boldsymbol{\varSigma }_{ mj}^{-1}(\mathbf{x}_{ t}-\boldsymbol{\mu }_{mj})} }$$

(13.3)

where state q _j has Mk-dimensional Gaussian mixture components with means $\boldsymbol{\mu }_{\mathit{mj}}$ and covariance matrices $\boldsymbol{\varSigma }_{\mathit{mj}}$, as well as mixture weights w _mj such that $\sum _{m}w_{\mathit{mj}} = 1$. Note that in most cases the covariance matrices are constrained to be diagonal. GMMs are a useful model for state-conditional observation distributions because they can model in a generic manner variability in the speech signal due to various factors, because their parameters can be estimated efficiently using the EM algorithm, and because their mathematical structure allows for forms of speaker and environmental adaptation based on linear regression.

To understand how the HMM state alphabet is defined, it is necessary to understand how words are modeled in large-vocabulary speech recognition systems. Speech recognition systems work with a finite, but large-vocabulary (tens to hundreds of thousands) of words that may be recognized. Words are modeled as sequences of more basic units: either phonemes or graphemes. Phonemes are the basic sound units of a language, and are thus a very natural compositional unit for word modeling. However, using phonemes entails a significant amount of human effort in the design of the speech recognition dictionary: somebody has to produce one or more phonetic pronunciations for every word in the dictionary. This cost has driven the use of graphemic dictionaries, in which words are modeled directly as sequences of letters. While this approach works well for some languages, such as Finnish, that have essentially phonetic spellings of words, it works less well for other languages. For example, consider the pair of letters “GH” in English, which can sound like “f” as in the word “enough,” like a hard “g” as in the word “ghost,” or can be silent as in the word “right.”

To characterize the temporal structure of the basic speech units (phonemes or graphemes), each unit is modeled with multiple states. A popular choice for modeling is illustrated in Fig. 13.4, where the HMM topology is strictly left-to-right with self loops on the states. The 3-state model shown can have separate observation distributions for the beginning, middle, and end of the unit. If different units have different durations, these may be represented by allocating a greater or fewer number of states to different units.

A final, key ingredient in modeling of speech acoustics is the concept of context dependence. While phonemes are considered to be the basic units of speech, corresponding to specific articulatory gestures, the nature of the human speech apparatus is such that the acoustics of a phoneme are strongly and systematically influenced by the context in which they appear. Thus, the “AE” sound in the word “man” will often be somewhat nasalized because it is produced between two nasal consonants, while the “AE” in “rap” will not be nasalized. Context-dependence can also help with the ambiguity in going from spelling to sound in graphemic systems. Returning to the “GH” example from above, we know for English that a “GH” that occurs at the end of a word and is preceded by the letters “OU” is likely to be pronounced “f”.

Although it produces more detailed acoustic models, context-dependence requires some form of parameter sharing to ensure that there is sufficient data to train all the models. Consider, for example, triphone models that represent each phone in the context of the preceding and following phone. In a naive implementation that represents each triphone individually, a phone alphabet of 40 phones would induce a triphone alphabet of 40³ = 64, 000 triphones. If phones are modeled with 3-state, left-to-right HMMs as shown above, this would lead to 3 × 64, 000 = 192, 000 different models. Due to phonotactic constraints, some of these models would not occur, but even ignoring those, the model set would be too large to be practical.

The standard solution to this explosion in the number of models is to cluster the model set using decision trees. Given an alignment of some training data, defined as a labeling of each frame with an HMM state (e.g., the middle state of “AE”), all samples sharing the same label can be collected together and a decision tree can be grown that attempts to split the samples into clusters of similar samples at the leaves of the tree. The questions that are asked to perform the splits are questions about the context of the samples: the identities of the phonetic or graphemic units to the left and right; the membership of the neighboring units in classes such as “vowels,” “nasals,” or “stops;” and whether or not a word boundary occurs at some position to the left or right. A popular splitting criterion is data likelihood under a single, diagonal-covariance Gaussian distribution. In this case, the decision trees can be trained efficiently by accumulating single-Gaussian sufficient statistics for each context-dependent state, and then growing the decision trees. Once a forest of decision trees has been grown for all units, they are pruned to the desired size. Typically, a few thousand to ten thousand context-dependent states will be defined for a large-vocabulary speech recognition system.

The training process for speech recognition systems is usually iterative, beginning with simple models having few parameters, and moving to more complex models having a larger number of parameters. In the case of a new task, where there are no existing models that are adequately matched to it, speech recognition training begins with a flat start procedure. In a flat start, very simple models with no context-dependence and only a single Gaussian per state are initialized directly from the reference transcripts as follows. First, each transcript is converted from a string of words to a string of phones by looking up the words in the dictionary. If a word has multiple pronunciations, a pronunciation is selected at random. This produces a sequence of N phones. Next, the corresponding sequence of acoustic features is divided into N equal-length segments, and sufficient statistics for each model are accumulated from its segments. Once the models are initialized, they are refined by running the EM algorithm, and the number of Gaussian mixture components per state is gradually increased.

The models that are produced by a flat start are coarse, and usually have poor transcription accuracy; however, they are sufficient to perform forced alignment of the training data. In the forced alignment procedure, the reference word transcripts are expanded into a graph that allows for the insertion of silence between words and the use of the different pronunciation variants in the dictionary, and then the best path through the graph given an existing set of models and the acoustic features for the utterance is found using dynamic programming. The result of this procedure is an alignment of the training data in which every frame is labeled with an HMM state. Given an alignment, it is possible to train context-dependent models, as described above. Typically, for a new task, several context-dependent models of increasing size (in terms of the number of context-dependent HMM states and the total number of Gaussians) will be trained in succession, each relying on a forced alignment from the previous model.

For ASR systems we are interested in the optimality of the recognition accuracy however, and we aim to train the acoustic model discriminatively so as to achieve the lowest word error rate on unseen test data. Directly optimizing the word error rate is hard because it is not differentiable. Alternative approaches look at optimizing smooth objective functions related to word error rate (WER) such as minimum classification error (MCE), maximum mutual information (MMI) and minimum phone error (MPE) criteria. Discriminative training can be applied either to the model parameters (Gaussian means and variances) or to the feature vectors. The latter is done by computing a transformation called feature-space MPE (fMPE) that provides time-dependent offsets to the regular feature vectors. The offsets are obtained by a linear projection from a high-dimensional space of Gaussian posteriors which is trained such as to enhance the discrimination between correct and incorrect word sequences. Currently, the most effective objective function for model and feature-space discriminative training is called boosted MMI and is inspired by large-margin classification techniques.

2.1.3 Speaker Adaptation

Speaker adaptation aims to compensate for the acoustic mismatch between training and testing environments and plays an important role in modern ASR systems. System performance is improved by conducting speaker adaptation during training as well as at test time by using speaker-specific data. Speaker normalization techniques operating in the feature domain aim at producing a canonical feature space by eliminating as much of the inter-speaker variability as possible. Examples of such techniques are: vocal tract length normalization (VTLN), where the goal is to warp the frequency axis to match the vocal tract length of a reference speaker, and feature-space maximum likelihood linear regression (fMLLR), which consists in affinely transforming the features to maximize the likelihood under the current model. The model-based counterpart of fMLLR, called MLLR, computes a linear transform of the Gaussian means such as to maximize the likelihood of the adaptation data under the transformed model.

2.2 Vowelization

One challenge in Arabic speech recognition is that there is a systematic mismatch between written and spoken Arabic. With the exception of texts for beginning readers and important religious texts such as the Qur’an, written Arabic omits eight diacritics that denote short vowels and consonant length:

1.
fatha /a/,
2.
kasra /i/,
3.
damma /u/,
4.
fathatayn (word-ending /an/),
5.
kasratayn (word-ending /in/),
6.
dammatayn (word-ending /un/),
7.
shadda (consonant doubling), and
8.
sukun (no vowel).

There are two approaches to handling this mismatch between the acoustics and transcripts. In the “unvowelized” approach, words are modeled graphemically, in terms of their letter sequences, and the acoustics corresponding to the unwritten diacritics are implicitly modeled by the Gaussian mixtures in the acoustic model. In the “vowelized” approach, words are modeled phonemically, in terms of their sound sequences, and the correct vowelization of transcribed words is inferred during training. Note that even when vowelized models are used the word error rate calculation is based on unvowelized references. Diacritics are typically not orthographically represented in Arabic texts. Diacritization is generally not necessary to make the transcript readable by Arabic literate readers. Thus, Arabic ASR systems typically do not output fully diacritized transcripts. Therefore, the vowelized forms are mapped back to unvowelized forms in scoring – it is also the NIST scoring scheme. In addition, the machine translation systems we use currently require unvowelized input. An excellent discussion of the Arabic language and automatic speech recognition can be found in Kirchhoff et al. [27].

One of the biggest challenges in building vowelized models is initialization: how to obtain a first set of vowelized models when only unvowelized transcripts are available. One approach is to have experts in Arabic manually vowelize a small training set [33]. The obvious disadvantage is that this process is quite labor intensive, which motivates researchers to explore automated methods [2]. Following the recipe in [2], we discuss our bootstrap procedure and some issues related to scaling up to large vocabularies.

2.2.1 Pronunciation Dictionaries

The words in the vocabulary of both the vowelized and unvowelized systems are assumed to be the same, and they do not contain any diacritics, just as they appear in most written text. In the unvowelized system, the pronunciation of a word is modeled by the sequence of letters in the word. For example, there is a single unvowelized pronunciation of the word Abwh.

$$\displaystyle{ \text{Abwh(01) A b w h} }$$

The short vowels are not explicitly modeled, and it is assumed that speech associated with the short vowels will be implicitly modeled by the adjacent phones. In other words, short vowels are not presented in our phoneme set; acoustically, they will be modeled as part of the surrounding consonant acoustic models.

In the vowelized system, however, short vowels are explicitly modeled in both training and decoding. We use the Buckwalter Morphological Analyzer (Version 2.0) [7], and the Arabic Treebank to generate vowelized variants of each word. The pronunciation of each variant is modeled as the sequence of letters in the diacriticized word, including the short vowels. For shadda (consonant doubling), an additional consonant is added, and for sukun (no vowel), nothing is added. For example, there are four vowelized pronunciations of the written word Abwh.

$$\displaystyle{\begin{array}{ll} \mbox{ Abwh(deny/refuse/$+$they$+$it/him)} &\mbox{ A a b a w o h u} \\ \mbox{ Abwh(desire/aspire/$+$they$+$it/him)}&\mbox{ A a b b u w h u} \\ \mbox{ Abwh(father$+$its/it)} &\mbox{ A a b u w h u} \\ \mbox{ Abwh(reluctant/unwilling$+$his/its)} &\mbox{ A b u w h u}\\ \end{array} }$$

The vowelized training dictionary has 243,368 vowelized pronunciations, covering a word list of 64,496 words. The vowelization rate is about 95 %. For the remaining 5 % of words that are not covered, we back off to unvowelized forms.

In the following subsections, we focus on the vowelized system. Since written transcripts of audio data do not usually contain short vowel information, how does one train the initial acoustic model? One could use a small amount of data with manually vowelized transcripts to bootstrap the acoustic model. Alternatively, one could perform flat-start training.

Table 13.1 Comparison of different initialization methods for vowelized models

Automatic Speech Recognition

Abstract

Similar content being viewed by others

The impact of phonological rules on Arabic speech recognition

Diacritics Effect on Arabic Speech Recognition

ArabRecognizer: modern standard Arabic speech recognition inspired by DeepSpeech2 utilizing Franco-Arabic

Keywords

1 Introduction

1.1 Automatic Speech Recognition

1.2 Introduction to Arabic: A Speech Recognition Perspective

1.3 Overview

2 Acoustic Modeling

2.1 Language-Independent Techniques

2.1.1 Feature Extraction

2.1.2 Acoustic Modeling

2.1.3 Speaker Adaptation

2.2 Vowelization

2.2.1 Pronunciation Dictionaries

2.2.2 Flat-Start Training vs. Manual Transcripts

2.2.3 Short Models for Short Vowels

2.2.4 Vowelization Coverage of the Test Vocabulary

2.2.5 Pronunciation Probabilities

2.2.6 Vowelization, Adaptation, and Discriminative Training

2.3 Modeling of Arabic Dialects in Decision Trees

2.3.1 Decision Trees with Dialect Questions

2.3.2 Building Static Decoding Graphs for Dynamic Trees

2.3.3 Experiments

3 Language Modeling

3.1 Language-Independent Techniques for Language Modeling

3.1.1 Base N-Grams

3.1.2 Model M

3.1.3 Neural Network Language Model

3.2 Language-Specific Techniques for Language Modeling

3.2.1 Search

4 IBM GALE 2011 System Description

4.1 Acoustic Models

4.1.1 Bayesian Sensing HMMs (BS)

4.1.1.1 Model Description

4.1.1.2 Automatic Relevance Determination

4.1.1.3 Initialization and Training

4.1.2 MADA-Based Acoustic Model (M)

4.1.2.1 Training Pronunciation Dictionary

4.1.2.2 Decoding Pronunciation Dictionary

4.1.3 Neural Network Acoustic Models (NNU and NNM)

4.2 Language Models

4.3 System Combination

4.4 System Architecture

5 From MSA to Dialects

5.1 Dialect Identification

5.1.1 Phone Recognizer and Front-End

5.1.2 Phone GMM-UBM and Phonetic Representation

5.1.3 A Phone-Type-Based SVM Kernel

5.2 ASR and Dialect ID Data Selection

5.3 Dialect Identification on GALE Data

5.4 Acoustic Modeling Experiments

5.4.1 Comparing Vowelizations

5.4.2 Selecting Dialect Data from the 300-h Training Subset

5.4.3 Tree Array Combination

5.4.4 Selecting Dialect Data from the 1,800-h Training Set

5.5 Dialect ID Based on Text Only

5.5.1 Levantine LM

5.5.2 Finding Levantine Words

6 Resources

6.1 Acoustic Training Data

6.2 Training Data for Language Modeling

6.3 Vowelization Resources

7 Comparing Arabic and Hebrew ASR

8 Summary

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter