Sparse Representations for Speech Recognition

Sainath, Tara N.; Kanevsky, Dimitri; Nahamoo, David; Ramabhadran, Bhuvana; Wright, Stephen

doi:10.1007/978-3-642-38398-4_15

Tara N. Sainath⁴,
Dimitri Kanevsky⁴,
David Nahamoo⁴,
Bhuvana Ramabhadran⁴ &
…
Stephen Wright⁵

Part of the book series: Signals and Communication Technology ((SCT))

3627 Accesses

Abstract

This chapter presents the methods that are currently exploited for sparse optimization in speech. It also demonstrates how sparse representations can be constructed for classification and recognition tasks, and gives an overview of recent results that were obtained with sparse representations.

Access provided by Autonomous University of Puebla. Download chapter PDF

Sparse Representation for Machine Learning

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

Article Open access 21 October 2014

Robust Hierarchical and Sparse Representation of Natural Sounds in High-Dimensional Space

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

15.1 Introduction

Sparse representation techniques for machine learning applications have become increasing popular in recent years [1, 2]. Since it is not obvious how to represent speech as a sparse signal, sparse representations have received attention only recently from the speech community [3], where they were proposed originally as a way to enforce exemplar-based representations. Exemplar-based approaches have also found a place in modern speech recognition [4] as an alternative way of modeling observed data. Recent advances in computing power and improvements in machine learning algorithms have made such techniques successful on increasingly complex speech tasks. The goal of exemplar-based modeling is to establish a generalization from the set of observed data such that accurate inference (classification, decision, recognition) can be made about the data yet to be observed the “ unseen” data. This approach selects a subset of exemplars from the training data to build a local model for every test sample, in contrast with the standard approach, which uses all available training data to build a model before the test sample is seen.

Exemplar-based methods, including k-nearest neighbors (kNN) [1], support vector machines (SVMs) and sparse representations (SRs) [3], utilize the details of actual training examples when making a classification decision. Since the number of training examples in speech tasks can be very large, such methods commonly use a small number of training examples to characterize a test vector, that is, a sparse representation. This approach stands in contrast to such standard regression methods as ridge regression [5], nearest subspace [6], and nearest line [6] techniques, which utilize information about all training examples when characterizing a test vector.

An SR classifier can be defined as follows. A dictionary $H=[h_1 ; h_2 \ldots ; h_N]$ is constructed using individual examples of training data, where each $h_i \in Re^m$ is a feature vector belonging to a specific class. $H$ is an over-complete dictionary, in that the number of examples $n$ is much greater than the dimension of each $h_i$ (that is, $m \ll N$). To reconstruct a signal $y$ from $H$, SR requires that equation $y \approx H\beta $, but imposes a sparseness condition on $\beta $, meaning that it requires only small number of examples from $H$ to describe $y$. A classification decision can be made by looking at the values of $\beta $ coefficients for columns in $H$ belonging to the same class.

The goal of this chapter is to explain how sparse optimization methods can be exploited in speech, how sparse representation can be constructed for classification and recognition tasks, and to give an overview of results obtained using sparse representation.

15.1.1 Chapter Organization

The remainder of the chapter is organized as follows. The second section deals with mathematical aspects of sparse optimization. We describe two SR methods: approximate Bayesian compressive sensing (ABCS) [7] and convex hull extended Baum-Welch (CHEBW) [8]. We discuss too their relation with the Extended Baum-Welch (EBW) optimization framework [9].

The third section is concerned with a variety of different sparseness techniques employing different types of regularization [2, 3]. Following [10] we explore what type of sparseness regularization should be employed. Typically sparseness methods such as LASSO [11] and Bayesian compressive sensing (BCS) [12] use an $l_1$ sparseness constraint. Other possibilities include the Elastic Net [13], which uses a combination of an $l_1$ and $l_2$ (Gaussian prior) constraint, and ABCS [3], which uses an $l_1^2$ constraint, known as a Semi-Gaussian prior. We analyze the difference in the spareness objectives for the above methods and we compare the performance of these methods for phonetic classification in TIMIT.

In the fourth section, we explore the application of ABCS to phoneme classification task in TIMIT. The benefit of this Bayesian approach is that it allows us to build compressive sensing (CS) on top of other Bayesian classifiers, for example a Gaussian mixture model (GMM). It was shown, following [3], that the CS technique allows attaining an accuracy of $80.01\,\%$, outperforming the GMM, kNN, and SVM methods.

In the fifth section, we describe a novel exemplar-based technique for classification problems, in which for every new test sample the classification model is re-estimated from a subset of relevant samples of the training data. We formulate the exemplar-based classification paradigm as a SR problem and explore the use of convex hull constraints to enforce both regularization and sparsity. Finally, we utilize the EBW optimization technique to solve the SR problem, and apply our proposed methodology for the TIMIT phonetic classification task, showing statistically significant improvements over common classification methods.

In the sixth section, following [14], we explore the use of exemplar-based SR to map test features into the linear span of training examples. Given these new SR features, we train a Hidden Markov Model (HMM) and perform recognition. On the TIMIT corpus, we show that applying the SR features on top of our best discriminatively trained system yields a reduction in phonetic error rate (PER) from $19.9\,\%$ to $19.2\,\%$. In fact, after applying model adaptation we reduce the PER further to $\mathbf{{19.0}}\,\%$, which was the best result on TIMIT reported in 2011. Furthermore, on a large vocabulary 50-h broadcast news task, we achieve a reduction in word error rate (WER) of $\mathbf{{0.3}}\,\%$.

In the seventh section, following [15], we discuss using SRs to create a new set of sparse representation phone identification features ($S_{pif}$). We describe the $S_{pif}$ features for both small and large vocabulary tasks. On the TIMIT corpus [16], we show that the use of SR in conjunction with our best context-dependent (CD) HMM system allows for a $0.7\,\%$ absolute reduction in phonetic error rate (PER), to $23.8\,\%$. Furthermore, on a 50-h Broadcast News task [17], we achieve a reduction in word error rate (WER) of $0.9-17.8\,\%$, using the SR features on top of our best discriminatively trained HMM system.

In the eighth section we describe how one can improve sparse exemplar modeling for speech tasks via enhancing exemplar-based posteriors.

15.2 Sparse Optimization

Recent studies have shown that sparse signals can be recovered accurately using fewer observations than the Nyquist/Shannon sampling principle would imply. The emergent theory that brought this insight to light is known as compressive sensing (CS) [22, 23]. Problems of reconstructing signals from compressive sensing data can be represented in several equivalent ways. One such formulation is the following optimization problem:

$$\begin{aligned} \min _\beta \parallel y-H\beta \parallel _2 \quad {\text {subject to}} \;\; \parallel \beta \parallel _1 \le \epsilon , \end{aligned}$$

(15.1)

where $y$ is an $m$-dimensional vector, $x$ is an $N$-dimensional vector, $H$ is an $m \times N$ matrix. The parameter $\epsilon $ controls the sparsity of the recovered solution. Provided $H$ satisfies certain properties, the signal $\beta $ can be reconstructed even when the number of observations $m$ is much less than the dimension $N$ of the ambient space in which $\beta $ resides. In fact, the required number of observations $m$ is related more strongly to the number of nonzeros in $\beta $.

This formulation can be generalized to handle other types of sparse and regularized optimization. We can write

$$\begin{aligned} \min _{\beta } \, f(\beta ) \quad \text {subject to} \;\; \phi (\beta ) \le \epsilon , \end{aligned}$$

(15.2)

where $f$ and $\phi $ are typically convex functions mapping $\mathbb R ^n$ to $\mathbb R $. Typically, $f$ is a loss function or maximum likelihood function, while the regularization function $\phi $ is typically nonsmooth, and chosen so as to induce the desired type of structure in $\beta $. As noted above, the popular choice $\phi (\beta ) = \Vert \beta \Vert _1$ induces sparsity into $\beta $. An alternative to (15.2) is the following weighted formulation:

$$\begin{aligned} \min _{\beta } \, f(\beta ) + \lambda \phi (\beta ), \end{aligned}$$

(15.3)

for some parameter $\lambda \ge 0$. It can be shown that (15.2) and (15.3) are equivalent: Under certain assumptions on $f$ and $\phi $, the solution of (15.2) for some value of $\epsilon >0$ is identical to the solution of (15.3) for some value of $\lambda \ge 0$, and vice versa.

We can generalize the formulations (15.2) and (15.3) further by considering nonconvex loss functions $f$ and regularization functions $\phi $, and adding an explicit constraint on the values of $\beta $. Nonconvex $f$ arise in, for example, deep belief networks, in which the outputs are highly nonconvex functions of the parameters in the network. Nonconvex regularizers $\phi $ such as SCAP and MCP are sometimes used to avoid biasing effects associated with the use of convex penalties. Explicit constraints such as nonnegativity ($\beta \ge 0$) and simplex ($\beta \ge 0$ and $\sum \nolimits _{i=1}^n \beta _i=1$) are common in many settings.

Many algorithms have been proposed to solve (15.2) and (15.3), many of which exploit the particular structure of $f$ and $\phi $ in various applications. One general approach that has been applied successfully in several settings is the prox-linear approach in which $f$ in (15.3) is replaced by a linear approximation and a prox-term that discourages the new iterate $\beta ^{k+1}$ from being moved too far from the current iterate $\beta ^k$. The subproblem to be solved at each iteration is:

$$\begin{aligned} \beta ^{k+1} = \arg \min _{\beta } \, \nabla f(\beta ^k)^T(\beta -\beta ^k) + \frac{1}{2\alpha _k} \Vert \beta - \beta ^k \Vert _2^2 + \lambda \phi (\beta ), \end{aligned}$$

(15.4)

where $\alpha _k$ is a positive parameter that plays the role of a line-search parameter. If the new iterate does not give satisfactory descent in the objective function of (15.3), we can decrease $\alpha _k$ and recompute a more conservative alternative value of $\beta ^{k+1}$, repeating as necessary.

The approach based on (15.4) is potentially useful when (a) the gradient $\nabla f(\cdot )$ can be computed at reasonable cost and (b) the subproblem (15.4) can be solved efficiently. Both situations typically hold in compressed sensing, under the formulation (15.3) with $f(\beta ) = \Vert H \beta - y \Vert _2^2$ and $\phi (\cdot ) = \Vert \cdot \Vert _1$. In this situation, the solution of (15.4) can be computed in $O(n)$ operations.

In the remainder of this chapter, we consider two fundamental methods for sparse optimization: an extended Baum-Welch (EBW) method (which can be expressed via a line-search $\mathcal A $-function (LSAF)) and an Approximate Bayesian Compressive Sensing (ABCS) algorithm, which is also closely related to EBW. The LSAF derivation is closely related to the prox-linear approach described above; in fact, the $\mathcal A $-function can be thought of as a generalization of the simple quadratic approximation to $f$ that is used in (15.4).

Both EBW and ABCS have been applied to speech classification and recognition problems, as we discuss in subsequent sections.

15.2.1 An EBW Compressed Sensing Algorithm

The Extended Baum-Welch (EBW) technique was introduced initially for estimating the discrete probability parameters of multinomial distribution functions of HMM speech recognition problems under the Maximum Mutual Information discriminative objective function [24]. Later, in [25], EBW was extended to estimating parameters of Gaussian Mixture Models (GMMs) of HMMs under the MMI discriminative function for speech recognition problems. In [9] the EBW technique was generalized to the novel Line Search A-functions (LSAF) optimization technique. A simple geometric proof was provided to show that LSAF recursions result in a growth transformation (that is, the value of the original function increases for the new parameters values). In [26] it was shown that a discrete version of EBW invented in more than 24 years ago can be also represented using $\mathcal A $-functions. This connection allowed a convergence proof for a discrete EBW to be developed [26].

15.2.2 Line Search A-Functions

Let $f(x): \mathcal U \subset \mathbb R ^n \rightarrow \mathbb R $ be a real valued differentiable function in an open subset $\mathcal U $. Let $\mathbf A _f=\mathbf A _f(x,y): \mathbb R ^n \times \mathbb R ^n \rightarrow \mathbb R $ be twice differentiable in $x \in \mathcal U $ for each $y \in \mathcal U $. We define $\mathbf A _f$ as an $\mathcal A $-function for $f$ if the following properties hold.

1.
$\mathbf A _f(x,y)$ is a strictly convex or strictly concave function of $x$ for any $y \in \mathcal U $. (Recall that twice differentiable function is strictly concave or convex over some domain if its Hessian function is positive or negative definite in the domain, respectively.)
2.
Hyperplanes tangent to manifolds defined by $z=g_y(x)=\mathbf A _f(x, y)$ and $z=f(x)$ at any $x=y \in \mathcal U $ are parallel to each other, that is,
$$\begin{aligned} \nabla _x \mathbf A _f(x,y)|_{x=y} = \nabla _x f(x) \end{aligned}$$
(15.5)

It was shown in [9] that a general optimization technique can be constructed based on $\mathcal A $-function. We formulated a growth transformation such that the next step in the parameter update that increases $f(x)$ is obtained as a linear combination of the current parameter values and the value $\tilde{x}$ that optimizes the $\mathcal A $-function, for which $\nabla _x \mathbf A _f(x,y)|_{x=\tilde{x}}=0$. More precisely, we stated that $\mathcal A $-function gives a set of iterative update rules with the following “growth” property: let $x_0$ be some point in $\mathcal U $ and $\mathcal U \ni \tilde{x}_0 \ne x_0$ be a solution of $\nabla _x A( x, x_0)|_{x=\tilde{x}_0} = 0$. Defining

$$\begin{aligned} x_1=x(\alpha ) = \alpha \tilde{x}_0 + (1-\alpha ) x_0, \end{aligned}$$

(15.6)

we have for sufficiently small $|\alpha | \ne 0$ that $f(x(\alpha )) > f(x_0)$, where $\alpha > 0$ if $A(x,x_0)$ concave and $\alpha < 0$ if $A(x,x_0)$ convex. The technique of generating $\tilde{x}$ in this way and performing the line search is termed “Line Search A-Function” (LSAF).

15.2.3 Discrete EBW

Here we show that discrete EBW can be described using the LSAF framework. Our descriptiong ins limited to the case of a single distribution, but the technique generalizes readily to several distributions.

Let the simplex $\mathcal S $ be defined as

$$\begin{aligned} \mathcal S :=\{\beta : \beta \in \mathbb R ^n, \beta _i \ge 0, i=1,{\ldots }n, \sum \beta _i=1 \}, \end{aligned}$$

and suppose that $f: \mathbb R ^n \rightarrow \mathbb R $ is a differentiable function on some subset $X \subset \mathcal S $. We wish to solve the following maximization problem for a function $f(\beta )$:

$$\begin{aligned} \max \, f(\beta )\;\;\text {subject to}\;\;\beta \in \mathcal S . \end{aligned}$$

(15.7)

Let $\beta \in X$ and define $a_i^k := \frac{\partial f(\beta ^k)}{\partial \beta ^k_i}, i=1,{\ldots }n$. For any $D \in \mathbb R $ and $\beta ^k \in \mathbb R ^n$ such that $\sum \nolimits _{j=1}^n a_j^k \beta _j^k + D \ne 0$, we define a recursion $T_D: \mathbb R ^n \rightarrow \mathbb R ^n$ as follows:

$$\begin{aligned} \beta _i ^{k+1} = T_D (\beta ^k) = \frac{ a_i^k \beta _i^k + D \beta _i^k}{\sum _{j=1}^n a_j^k \beta _j^k + D}. \end{aligned}$$

(15.8)

It was shown in [27] that for sufficiently large $D$, we have $f(\beta ^{k+1}) > f(\beta ^k)$, unless $\beta ^{k+1} = \beta ^k$.

An $\mathcal A $-function $\mathbb A _f$ for the function $f$ in (15.7) that is differentiable in some compact neighborhood $\mathcal U \subset X$ of a point $\beta _0 \in \mathcal S $ is given as:

$$\begin{aligned} \mathbb A _f(\beta _0, \beta ) = \sum (c_i+\beta _{0i}D) \log \beta _i, \end{aligned}$$

(15.9)

where $c_i=c_i(\beta _0) = \beta _{0i}\frac{\partial f (\beta ) }{\partial \beta _i }|_{\beta =\beta _0} = \beta _{0i} a_i(\beta _0)$ and $D$ is any number such that $a_i(\beta ) + D > 0$ for all $i$ and any $\beta \in \mathcal U $. (Existence of $D$ is guaranteed by differentiability of $f$ in $\mathcal U $ and compactness of $\mathcal U $.) To show that the function $\mathbb A _f(\beta _0, \beta )$ in (15.9) is an $\mathcal A $-function, one needs to check (15.5) as follows. Replace $\beta _n=1-\sum \beta _i$ in (15.7), (15.9), that is, consider the functions $g(\beta ') = f(\beta _1,...,\beta _{n-1}, 1-\sum \nolimits _1^{n-1}\beta _i )$, $\mathbb A _g(\beta _0; \beta ' ) = \mathbb A _f(\beta _0, \{\beta _1,..., \beta _{n-1}\}, 1-\sum \nolimits _1^{n-1} \beta _j)$ where $\beta '=\{\beta _1, ...\beta _{n-1}\}$. We have

$$\begin{aligned}&\frac{\partial \mathbb A _g(\beta _0, \beta ')}{\partial \beta _i} |_{\beta _i=\beta _{0i}} = a_i(\beta _{0}) \frac{\partial f(\beta )}{\partial \beta _i} |_{\beta _i=\beta _{0i}} + D\beta _{0i}\frac{\partial \log \beta _i}{\partial \beta _i}|_{\beta _i=\beta _{0i}} + \\&\qquad \qquad \qquad \qquad D (1-\sum _1^{n-1} \beta _{0i})\frac{\partial \log (1 - \sum _1^{n-1}\beta _i)}{\partial \beta _i}|_{\beta =\beta _{0}}= \frac{\partial g(\beta ')}{\partial \beta '} |_{\beta _i'=\beta _{0i}'}. \end{aligned}$$

It can be shown that adding a quadratic penalty $C \beta ^T \beta $ to the objective function $f(\beta )$ is equivalent to substituting the term $D$ with $D+2C$ in the discrete EBW recursion (15.8). Moreover, for sufficiently large $C$, the function $f(\beta ) + C \beta ^T\beta $ is concave in a simplex $\mathcal S $. Therefore, it achieves its maximum on the boundary of the of the simplex $\mathcal S $. This fact implies that for sufficiently large $D$, the EBW recursion enforces a sparse solution.

Discrete EBW methods can be applied to optimization of objective functions with fractional norm constraints, as suggested in [28]. We have

$$\begin{aligned} \max \, f(\{\beta _i\})\quad \text {subejct to} \;\; \Vert \beta \Vert _q = 1 \;\;\text {and} \;\;\beta _\mathrm{{i}} \ge 0 , \; \mathrm{{i}}=1,2,\cdots ,\mathrm{{n}}, \end{aligned}$$

(15.10)

where $\parallel \beta \parallel _q := (\sum \beta _i^q)^{1/q}$. Setting

$$\begin{aligned} \gamma _i = \beta _i^{1/q}, \quad g(\{\gamma _i\}) = f(\{\beta _i\}), \end{aligned}$$

(15.11)

transforms the problem (15.10) into a discrete EBW problem for which the recursion (15.8) could be applied. In [26], this optimization method with fractional norm constraints was applied to TIMIT classification tasks.

15.2.4 An ABCS Compressed Sensing Algorithm

Following [29], we describe the approximate Bayesian CS (ABCS) method. The key idea behind this algorithm is based on an approximate sparseness promoting prior which is a sort of mixture of Gaussian and Laplace distributions. ABCS is a variant of the algorithm in [30] and [31]. In what follows we gradually develop this underlying concept and a few others which form the core of the new method.

15.2.4.1 Bayesian Estimation

The Bayesian estimation methodology provides a convenient representation for dealing with complex observation models. In this work, however, we restrict ourselves to the conventional linear model used in CS theory

$$\begin{aligned} y_k = H \beta + n_k \end{aligned}$$

(15.12)

where $y_k$, $H \in \mathbb R ^{m \times N}$, and $n_k$ denote the $k$th $\mathbb R ^m$-valued observation, a fixed sensing matrix, and the observation noise of which the pdf $p(n_k)$ is known, respectively. The sought-after random parameter (the signal) $\beta $ is a $\mathbb R ^N$-valued vector for which the prior pdf $p(\beta )$ is given. Following this, the complete statistics of $\beta $ conditioned on the entire observation set consisting of $k$ elements, $\fancyscript{Y}_k = [y_1, \ldots , y_k]$ can be sequentially computed via the Bayesian recursion

$$\begin{aligned} p(\beta \mid \fancyscript{Y}_k) = \frac{p(y_k \mid \beta ) p(\beta \mid \fancyscript{Y}_{k-1})}{\int p(y_k \mid \beta ) p(\beta \mid \fancyscript{Y}_{k-1}) d \beta } \end{aligned}$$

(15.13)

where the likelihood $p(y_k \mid \beta ) = p_{n_k}(y_k - H \beta )$. One can rarely obtain a closed-form analytic expression of the posterior pdf (15.13), so approximation techniques are often used. One well-known example in which (15.13) does admit a closed-form solution is given by the following theorem, which plays a fundamental role in this work. (This is a well known result in estimation theory which is revisited here for completeness.)

Theorem 1

(Gaussian pdf Update). Assume that $p(\beta \mid \fancyscript{Y}_{k-1})$ is a Gaussian pdf of which the first two statistical moments are given by $\hat{\beta }_{k-1} \in \mathbb R ^n$ and $P_{k-1} \in \mathbb R ^{n \times n}$, that is $p(\beta \mid \fancyscript{Y}_{k-1}) = \mathcal N (\beta \mid \hat{\beta }_{k-1}, P_{k-1})$. Assume also that the observation $y_k$ satisfies the linear model (15.12) where $n_k$ is a $\mathbb R ^m$-valued zero-mean Gaussian random variable $n_k \sim \mathcal N (0, R)$ that is statistically independent of $\beta $. Then the Bayesian recursion (15.13) yields $p(\beta \mid \fancyscript{Y}_k) = \mathcal N (\beta \mid \hat{\beta }_k, P_k)$ where

$$\begin{aligned} \hat{\beta }_k = \hat{\beta }_{k-1} + P_{k-1} H^T \left( H P_{k-1} H^T + R \right) ^{-1}\left[ y_k - H \hat{\beta }_{k-1}\right] \end{aligned}$$

(15.14a)

$$\begin{aligned} P_k = \left[ I - P_{k-1} H^T \left( H P_{k-1} H^T + R \right) ^{-1} H \right] P_{k-1} \end{aligned}$$

(15.14b)

The initial values of the above quantities are set according to the Gaussian prior $p(\beta ) = \mathcal N (\beta \mid \hat{\beta }_0, P_0)$.

The proof of this statement can be found in [29]. Note that the quantity $P_k$ in Theorem 1 is the estimation error covariance, i.e.,

$$\begin{aligned} P_k := E\left[ (\beta - \hat{\beta }_k)(\beta - \hat{\beta }_k)^T \mid \fancyscript{Y}_k\right] \end{aligned}$$

where $\beta - \hat{\beta }_k$ is the estimation error of the unbiased estimator $\hat{\beta }_k$.

15.2.5 Sparseness-Promoting Semi-Gaussian Priors

Compressed sensing was embedded in the framework of Bayesian estimation by utilizing sparseness promoting priors such as Laplace and Cauchy [32]. Here we consider a different type of prior that facilitates the application of the closed-form recursion of Theorem 1. The sparseness-promoting prior used here is termed “semi-Gaussian” (SG) owing to its form

$$\begin{aligned} p(\beta ) = c \exp \left( -\frac{1}{2} \frac{\parallel \beta \parallel _1^2}{\sigma ^2}\right) . \end{aligned}$$

(15.15)

The motivation for using a SG prior can be motivated by analyzing the characteristics of the SG constraint $\parallel \beta \parallel ^2_1=(\sum \nolimits _i|\beta _i|)^2$ and the Laplacian constraint $\parallel \beta \parallel _1=(\sum \nolimits _i|\beta _i|)$. We can denote the SG density function as proportional to $p_{semi-gauss} \propto exp(-\parallel \beta \parallel _1^2)$ and the Laplacian density function proportional to $p_{laplace} \propto exp(-\parallel \beta \parallel _1)$. When $\parallel \beta \parallel _1 < 1$, it is straightforward to see that $p_{semi-gauss}>p_{laplace}$. When $\parallel \beta \parallel _1 =1$, the density functions are the same, and when $\parallel \beta \parallel _1 >1$ then $p_{semi-gauss}<p_{laplace}$. Therefore the semi-Gaussian density is more concentrated than the Laplacian density in the convex area inside $\parallel \beta \parallel _1 < 1$. Given the sparseness constraint $\parallel \beta \parallel _q$, as the fractional norm $q$ goes to 0, the density becomes concentrated at the coordinate axes and the problem of solving for $\beta $ becomes a non-convex optimization problem where the reconstructed signal has the least mean-squared-error (MSE). Intuitively, we expect the solution using the semi-Gaussian prior to behave closer to the non-convex solution.

This observation is further illustrated in Fig. 15.1, in which the level maps are shown for Laplace, semi-Gaussian, and Gaussian pdfs in the 2-dimensional case. The embedding of the prior (15.15) within the Gaussian variant of the Bayesian recursion in Theorem 1 is not straightforward. This follows from the fact that the restrictions under which Theorem 1 is derived involve a purely Gaussian prior and a likelihood pdf that is based on a deterministic sensing matrix $H$,

$$\begin{aligned} p(y_k \mid \beta ) \propto \exp \left( -\frac{1}{2} (y_k - H \beta )^T R^{-1} (y_k - H \beta )\right) . \end{aligned}$$

(15.16)

Theorem 1 provides an exact recursion for computing the Gaussian posterior based exclusively on the factors composing the above likelihood: the observation $y_k$, the sensing matrix $H$ and the observation noise covariance $R$. This fact has motivated the following approach which allows enforcing an approximate semi-Gaussian prior without changing the fundamental structure of the underlying update equations as obtained in Theorem 1.

15.2.6 Approximate Semi-Gaussian Prior

We introduce a state-dependent matrix $\hat{H} \in \mathbb R ^{1 \times N}$ whose entries are $\hat{H}^i = \mathrm sign (\beta ^i)$, $i=1,2,\ldots ,N$ (that is, $\hat{H}^i=+1$ and $\hat{H}^i=-1$ for $\beta ^i > 0$ and $\beta ^i < 0$, respectively). The semi-Gaussian prior can be expressed based on (15.16) while replacing $H$ and $R$ with $\hat{H}$ and $\sigma $, respectively, and assuming a fictitious observation $y=0$, that is

$$\begin{aligned} p(\beta ) = p(y = 0 \mid \beta , \hat{H}, \sigma ) \propto \exp \left( -\frac{1}{2} \frac{(0 - \hat{H} \beta )^2}{\sigma ^2}\right) \end{aligned}$$

(15.17)

The only difficulty in using (15.14a) for enforcing the semi-Gaussian prior (15.17) is the dependency of $\hat{H}$ on $\beta $. We recall that Theorem 1 relies on possibly varying a deterministic $H$ as opposed to the formulation in (15.17). This problem can be alleviated by letting

$$\begin{aligned} \hat{H}^i = \mathrm{sign }(\hat{\beta }^i_k), \quad i=1,2,\ldots ,N, \end{aligned}$$

(15.18)

that is, by substituting the conditional mean instead of the actual $\beta $. This modification renders $\hat{H}$ a $\fancyscript{Y}_k$-measurable quantity, as it depends on $\hat{\beta }_k$ which is a function of the entire observation set. This fact clearly does not affect the expressions in Theorem 1 as the derivations are conditioned on $\fancyscript{Y}_k$ (see [29]). Applying this approximation facilitates the implementation of Theorem 1 based on the likelihood (15.17). Hence, an additional processing stage is needed to apply the approximate sparseness-promoting prior:

$$\begin{aligned} \hat{\beta }_{k+1} = \left[ I - \frac{P_{k} \hat{H}^T \hat{H}}{\hat{H} P_{k} \hat{H}^T + \sigma ^2} \right] \hat{\beta }_{k} \end{aligned}$$

(15.19a)

$$\begin{aligned} P_{k+1} = \left[ I - \frac{P_{k} \hat{H}^T \hat{H}}{\hat{H} P_{k} \hat{H}^T + \sigma ^2}\right] P_{k}. \end{aligned}$$

(15.19b)

This stage is implemented after the usual processing of the observations set $\fancyscript{Y}_k$ (see (15.14)), where the initial covariance is taken as $P_0 \rightarrow \infty $.

At this point, a natural question is raised concerning the validity of the approximation suggested above. The following theorem, proved in [29], bounds the discrepancy between the exact posterior which uses the semi-Gaussian prior (15.15) and the approximate posterior in terms of the estimation error covariance $\hat{P}_k$.

Theorem 2

Denote $\hat{p}(\beta \mid \fancyscript{Y}_k)$ the Gaussian posterior pdf obtained by using the approximate semi-Gaussian prior technique, and let $p(\beta \mid \fancyscript{Y}_k)$ be the posterior pdf obtained by using the exact semi-Gaussian prior (15.15). Then

$$\begin{aligned} \mathrm{KL }\left( \hat{p}(\beta \mid \fancyscript{Y}_k) \parallel p(\beta \mid \fancyscript{Y}_k)\right) = \mathcal O \left( \sigma ^{-2} \max \left\{ \mathrm{Tr }(\hat{P}_k), \mathrm{Tr }(\hat{P}_k)^{1/2}\right\} \right) , \end{aligned}$$

(15.20)

where $\mathrm{KL }$ and $\mathrm{Tr }$ denote the Kullback-Leibler divergence and the matrix trace operator, respectively.

In practical applications for speech classification and recognition tasks, it was observed that the classification and recognition accuracy is not affected if ones computes a term $P_k$ in (15.19b) only once, then fixes this term for all subsequent iterations. This trick provides a significant speed up without significant degradation of accuracy.

15.2.7 ABCS Representations via LSAF

We recall the $\ell _1$-constrained problem (15.1), modified slightly by the use of a weighted data-fitting term

$$\begin{aligned} \min \parallel y - H \beta \parallel _R^2 \quad \text {subject to} \;\; \parallel \beta \parallel _1 \le \epsilon . \end{aligned}$$

In many practical application it is useful to add an $l_2$ regularization term to this formulation, to yield

$$\begin{aligned} \min \parallel y - H \beta \parallel _R^2 + \parallel \beta - \beta _0 \parallel ^2_{P_0}\quad \text {subject to} \;\; \parallel \beta \parallel _1 \le \epsilon . \end{aligned}$$

Using $\parallel y - H \beta \parallel _R^2 + \parallel \beta - \beta _0 \parallel ^2_{P_0} =\, \parallel \beta - \beta _1 \parallel ^2_{P_1}$ we can represent this problem as

$$\begin{aligned} \min \parallel \beta - \beta _1 \parallel ^2_{P_1} \quad \text {subject to} \;\; \parallel \beta \parallel _1 \le \epsilon , \end{aligned}$$

where $P_1$ is assumed to be positive-definite. We can now represent (15.1) by

$$\begin{aligned} \min \, F(\beta ) :=\, \parallel \beta - \beta _1 \parallel ^2_{P_1} + \parallel \beta \parallel ^i_1 /\sigma ^2, \end{aligned}$$

(15.21)

and define the $\mathcal A $-function as:

$$\begin{aligned} \mathbb{A }(\beta , \beta ^*) =\, \parallel \beta - \beta ^* \parallel ^2_{P_1}+ \{\mathrm{sign }(\beta ^*) \beta \}^i /\sigma ^2, \end{aligned}$$

(15.22)

where $i=1$ (Laplacian) or $i=2$ (squared $l_1$ norm). In [26] we show that $\mathbb A (\beta , \beta ^*)$ is $\mathcal A $-function of $F(\beta )$. According to the definition of the $\mathcal A $-function, we consider $\mathbb A (\beta , \beta ^*)$ and $F(\beta )$ in an open domain where they are both differentiable and construct an update of parameters when the extremum of $\mathbb A (\beta , \beta ^*)$ belongs to this domain. Our open domain excludes the origin $\beta = 0$. If some coordinates of $\beta $ approach 0 we can remove them by reducing the dimension of the problem. Using LSAF, we have the recursion:

$$\begin{aligned} \beta _k = \alpha \tilde{\beta }_{k-1}+(1-\alpha )\beta _{k-1}. \end{aligned}$$

The ABCS algorithm corresponds to a squared $l_1$-norm. Analysis of various regularization penalties for speech classification problems is given in Sect. 15.3. The ABCS method gives a solution of (15.21) via the recursion:$ \tilde{\beta }_{k-1} = \arg \max _\beta A(\beta , \beta _{k-1})$. Numerical experiments show that for a suitable choice of $\alpha $, the parameter $\beta _k$ converges to a solution of (15.21) more rapidly than the one obtained through the ABCS recursion. One can expect that LSAF with appropriate choices of $\alpha $ is more efficient than the ABCS.

15.3 An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification

Following [10] we describe and compare a variety of different sparseness techniques, which employ different types of regularization, and that have been explored for speech tasks [2, 3]. Firstly, we describe the main framework behind exemplar-based classification. Then we give a brief description of the TIMIT corpus. Next we discuss how sparseness can be useful in classification tasks. Finally, we compare the performance of different sparseness methods for classification.

15.3.1 Classification Based on Exemplars

The goal of classification is to use training data from $k$ different classes to determine the best class to assign to test vector $y$. First, let us consider taking all training examples $n_i$ from class $i$ and concatenate them into a matrix $H_i$ as columns, in other words $H_i=[x_{i,1}, x_{i,2}, \ldots , x_{i,n_i}]\in \mathbb{R }^{m\times {n_i}}$, where $x\in \mathbb{R }^m$ represents a feature vector from the training set of class $i$ with dimension $m$. Given sufficient training examples from class $i$, [6] shows that a test sample $y$ from the same class can be represented as a linear combination of the entries in $H_i$ weighted by $\beta $, that is:

$$\begin{aligned} y=\beta _{i,1}x_{i,1}+\beta _{i,2}x_{i,2}+\ldots +\beta _{i,n_i}x_{i,n_i} \end{aligned}$$

(15.23)

However, since the class membership of $y$ is unknown, we define a matrix $H$ to include training examples from all $k$ classes in the training set, in other words the columns of $H$ are defined as $H=[H_1, H_2, \ldots , H_k]=[x_{1,1}, x_{1,2}, \ldots , x_{k,n_k}]\in \mathbb{R }^{m\times {N}}$. Here $N$ is the total number of all training examples from all classes. We can then write test vector $y$ as a linear combination of all training examples, in other words $y=H\beta $. We can solve this linear system for $\beta $ and use information about $\beta $ to make a classification decision. Specifically, large entries of $\beta $ should correspond to the entries in $H$ with the same class as $y$. Thus, one proposed classification decision approach [3] is to compute the $l_2$ norm for all $\beta $ entries within a specific class, and choose the class with the largest $l_2$ norm support.

15.3.2 Exemplar-Based Methods

Various types of exemplar-based classifiers can be cast in the framework of representing the test vector $y$ as a linear combination of training examples $H$, subject to a constraint on $\beta $. Below, we review a few popular techniques that are based on the following optimization problem for various values of $q$ and $\alpha $

$$\begin{aligned} \min _\beta \parallel y-H\beta \parallel _2 \;\;\;\mathtt{s.t. }\;\;\; \parallel \beta \parallel _q^\alpha \le \epsilon \end{aligned}$$

(15.24)

1.
Ridge regression (RR) methods [5] use information about all training examples in $H$ to make a classification decision about $y$, in contrast to a nearest-neighbor (NN) approach to exemplar-based classification, which uses information about just 1 training example. Specifically, the RR method looks to project $y$ into the linear space of all training examples and solves for the $\beta $ which minimizes (15.24) for $q=2, \alpha =2$. The term $\parallel \beta \parallel _2^2 \le \epsilon $ is an $l_2$ norm on $\beta $ (i.e. a Gaussian constraint) but does not enforce any sparseness.
2.
Sparse representations: like RR methods, sparse representation (SR) techniques (i.e., [3, 6], project $y$ into the linear span of examples in $H$, but constrain $\beta $ to be sparse. Specifically, SR methods solve for $\beta $ by minimizing (15.24), given various settings for $\alpha $ and $q$. For example, in a probabilistic setting $q = 1$, $\alpha = 1$ leads to a Laplacian constraint, whereas $q = 1$, $\alpha = 2$ leads to a Semi-Gaussian constraint. The remainder of this section is focused on comparing the RR method to various SR methods with different types of regularizations.

15.3.3 Description of TIMIT

We analyze the behavior of various exemplar-based methods on the TIMIT [16] corpus. The corpus contains over 6,300 phonetically rich utterances divided into three sets, namely the training, development, and core test set. For testing purposes, the standard practice is to collapse the 48 trained labels into a smaller set of 39 labels. All methods are tuned on the development set and all experiments are reported on the core test set.

The complete experimental setup, as well as the features used for classification, are similar to [3]. First, we represent each frame in our signal by a 40 dimensional discriminatively trained Space Boosted Maximum Mutual Information (fBMMI) feature. We split each phonetic segment into thirds, taking the average of these frame-level features around 3rds, and splice them together to form a 120 dimensional vector. This allows us to capture time dynamics into each segment. Then, at each segment, segmental feature vectors to the left and right of this segment are joined together and a Linear Discriminative Analysis (LDA) transform is applied to project 200 dimensional feature vector down to 40 dimensions.

Similar to [3], we find a neighborhood of closest points to $y$ in the training set using a kd-tree. These $k$ neighbors become the entries of $H$. We explore classification performance for different sizes of $H$. In what follows, we explore the following two questions, using TIMIT to provide experimental results to support our framework.

Why and when is sparseness important for exemplar-based methods?
If sparseness is used, what type of regularization constraint should be utilized?

15.3.4 Why Sparse Representations?

We will motivate the difference between the RR and SR methods further with the following example. Let us consider a $2 \times 7$ matrix

$$\begin{aligned} H=[h_1,h_2,h_3,h_4, h_5, h_6, h_7]= \left[ \begin{array}{ccccccc} 0.2 &{} 0.1 &{} 0.4 &{} 0.3 &{} -0.6 &{} 0.6 &{} -0.6\\ 0.2 &{} 0.3 &{} 0.35 &{} 0.3 &{} 0.1 &{} 0.3 &{} 0.4\end{array} \right] , \end{aligned}$$

where first three columns $h_1, h_2, h_3$ are “training” utterances that belong to a class $C_1$ and last four columns are “training” utterances that belong $C_2$. Assume also that a vector $y=[0.29;0.29]$ is “test” data that belong to a class $C_1$. thus will include the outlier points of $C_2$. Solving (15.24) with $q=2, \alpha =2$ (i.e., the RR method) produces the vector $\beta \approx [0.12; 0.15; 0.21; 0.18; -0.05; 0.122 0.08]$ and the best class is $C_2$. However using the SR method in (15.24) (for example, using ABCS method with a SG constraint as explained in Sect. 15.2) produces a vector $\beta \approx [0.00; 0.01; 0.77; 0.00; 0.00; 0.00; 0.03]$ with the support located at the third entry in $H$. In this case, the $C_1$ is identified as the correct class. Thus, by using a subset of examples in $H$, the classification decision for SR and RR can be vastly different, particularly in the case of outliers.

To analyze the behavior of the SR and RR methods in a practical speech example, we explore phonetic classification on TIMIT as the size of $H$ is varied from $1$ to $10{,}000$. A plot of the error rate for the two methods for varied $H$ is shown Fig. 15.2. For this figure, we again used the ABCS SR method. First notice that as the size of $H$ increases up to $1{,}000$ the error rates of the RR and SR both decrease, showing the benefit of including multiple training examples when making a classification decision. Also notice that there is no difference in error between the RR and SR techniques, suggesting that regularization does not provide any extra benefit. However, as the size of $H$ increases past $1{,}000$ and there are more number of training examples for each class, the SR method performs better than the RR method, demonstrating the advantage of using sparseness to select only a few examples in $H$ to explain $y$ rather than all examples in $H$.

15.3.5 What Type of Regularization?

Now that we have motivated the use of regularization, in this section we analyze different forms of regularization. As illustrated by (15.24), with $q=1$, a sparse representation solution can be formulated by finding the $\beta $ which minimizes the residual error $\parallel y-H\beta \parallel _2$, subject to a regularization $\parallel \beta \parallel _q \le \epsilon $ on $\beta $. There are four common types of regularizations on $\beta $.

1.
If $q=2$ and $\alpha =2$, then the regularization becomes $\parallel \beta \parallel _2 \le \epsilon $. This constraint can be modeled as a Gaussian prior. Common techniques which impose an $l_2$ constraint on $\beta $ include Ridge Regression [5]. The effect of the $l_2$ norm is to spread values of entries in $\beta $ equally. Therefore the optimization problem (15.24) for $q= 2$ tries to find a balance between keeping the residual $\parallel y - \beta \parallel _2$ small and trying to keep all the entries in the vector $\beta $ to be non-zero.
2.
If $q=1$ and $\alpha =1$, then the regularization becomes $\parallel \beta \parallel _1 \le \epsilon $. This constraint can be modeled as a Laplacian prior. Common techniques which impose an $l_1$ constraint on $\beta $ include LASSO [11] and Bayesian Compressive Sensing (BCS) [12]. The Lasso problem can be formulated as follows:
$$\begin{aligned} \min _\beta \parallel y-H\beta \parallel _2 +\lambda \parallel \beta \parallel _1, \end{aligned}$$
(15.25)
as in (15.3), where $\lambda $ controls the weight of the $l_1$ norm. The Least Angle Regression (LARS) ([33]) solves LASSO through a forward stepwise regression, computing point estimates of $\beta $ at each step. The effect of the $l_2$ norm is to spread values of entries in $\beta $ equally. Therefore the optimization problem (15.24) for $q= 2$ tries to find a balance between keeping the residual $\parallel y - \beta \parallel _2$ small while at the same time preventing all the entries in $\beta $ from vanish. In contrast, the norm $l_1$ tries to enforce sparsity in $\beta $ while keeping the residual $\parallel y - H \beta \parallel _2$ small.

Bayesian Compressive sensing [12] can be formulated in a fashion similar to (15.25). BCS introduces a probabilistic framework to estimate the spareness parameters required for signal recovery. This technique limits the effort required to tune the sparseness constraint and also provides complete statistics for the estimate of $\beta $.
3.
Many techniques also impose a combination of an $l_1$ and $l_2$ constraint on $\beta $. These methods include the popular Elastic Net [13]. The Elastic Net [13] method imposes a mixture of an $l_1$ and $l_2$ constraints, i.e.,
$$\begin{aligned} \min _\beta \parallel y-H\beta \parallel _2 + \lambda _1\parallel \beta \parallel _1 +\lambda _2\parallel \beta \parallel ^2_2. \end{aligned}$$
(15.26)
Here $\lambda _1$ and $\lambda _2$ are weights controlling the $l_1$ and $l_2$ constraint. In the elastic net formulation the $l_1$ term enforces the sparsity of solution, whereas the $l_2$ penalty ensures democracy among groups of correlated variables. The second term has also a smoothing effect that stabilizes the obtained solution.
4.
The previously described ABCS explores the use of a semi-Gaussian prior and solves for $\beta $ in a Bayesian framework. The ABCS essentially solves
$$\begin{aligned} \min _\beta \parallel y-H\beta \parallel _2 + \lambda _1(\beta - \beta _0)^TP_0^{-1}(\beta - \beta _0) + \lambda _2\parallel \beta \parallel _1^2. \end{aligned}$$
(15.27)

15.3.5.1 Visualization of Sparsity

We analyze the difference in $\beta $ coefficients for different sparseness methods. For a randomly selected classification frame $y$ in TIMIT and an $H$ of size 200, we solve (15.24) for $\beta $. Figure 15.3 plots the sorted 200 $\beta $ coefficients for four different techniques employing different reguliarizations, namely Ridge Regression, Lasso, Elastic Net and ABCS. The plot shows that the $\beta $ coefficients for the RR method are the least sparse, as we would expect. In addition, the LASSO technique has the sparsest $\beta $ values. The sparsity of the Elastic Net and ABCS techniques methods are in between RR and LASSO, with ABCS being more sparse than Elastic Net due to the Semi-Gaussian constraint in ABCS, which is more sparse than the $l_1$ constraint in the Elastic Net.

15.3.5.2 TIMIT Results

Table 15.1 shows the results comparing various sparseness methods on TIMIT for a size of $H=200$. As one can see from the table, the three methods which combine a sparseness constraint with and $l_2$ norm, namely ABCS, Elastic Net and CSP, all achieve statistically the same accuracy. The two methods which use the $l_1$ norm, namely BCS and LASSO, have slightly lower accuracy, showing the decrease in accuracy when a high degree of sparseness is enforced. Thus, it appears that using a combination of a sparsity constraint on $\beta $, coupled with an $l_2$ norm, does not force unnecessary sparseness and offers the best performance.

15.4 ABCS for Classification

In this section we follow [3] and describe application of ABCS for Timit classification tasks. We perform classification as described in Sect. 15.3.1 solving (15.23) for $\alpha = 2$ and $q=1$ via (15.14a), (15.14b), (15.19a), and (15.19b). We compute the $l_2$ norm for all $\beta $ entries within a specific class and choose the class with the largest $l_2$ norm support. Pooling together all training data from all classes into $H$ will make the columns of $H$ large (i.e., can be greater than 100,000 for TIMIT), and will make solving for $\beta $ intractable. Therefore, to reduce the size of $N$ and make ABCS problem more solvable, for each $y$, we find a neighborhood of closest points to $y$ in the training set using a kd-tree [35]. These $k$ neighbors become the entries of $H$. $k$ is chosen to be in the large to ensure that $\beta $ is sparse and all training examples are not chosen from the same class.

Table 15.1 Accuracies for different sparseness methods

Sparse Representations for Speech Recognition

Abstract

Similar content being viewed by others

Sparse Representation for Machine Learning

Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

Robust Hierarchical and Sparse Representation of Natural Sounds in High-Dimensional Space

Keywords

15.1 Introduction

15.1.1 Chapter Organization

15.2 Sparse Optimization

15.2.1 An EBW Compressed Sensing Algorithm

15.2.2 Line Search A-Functions

15.2.3 Discrete EBW

15.2.4 An ABCS Compressed Sensing Algorithm

15.2.4.1 Bayesian Estimation

Theorem 1

15.2.5 Sparseness-Promoting Semi-Gaussian Priors

15.2.6 Approximate Semi-Gaussian Prior

Theorem 2

15.2.7 ABCS Representations via LSAF

15.3 An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification

15.3.1 Classification Based on Exemplars

15.3.2 Exemplar-Based Methods

15.3.3 Description of TIMIT

15.3.4 Why Sparse Representations?

15.3.5 What Type of Regularization?

15.3.5.1 Visualization of Sparsity

15.3.5.2 TIMIT Results

15.4 ABCS for Classification

15.4.1 Nonlinear Compressive Sensing

15.4.2 Experiments

15.4.2.1 Performance for Different \(H\)

15.4.2.2 Comparing Different Classifiers

15.4.2.3 Analysis of Results

15.5 A Convex Hull Approach to Sparse Representations

15.5.1 Convex Hull Formulation

15.5.2 Convex Hull Classification Rule

15.5.3 Experiments

15.5.3.1 Algorithmic Behavior

15.5.3.2 Comparison with ABCS

15.5.3.3 GMM-Based Term

15.5.3.4 Comparison with Other Techniques

15.5.3.5 Accuracy Versus Size of Dictionary

15.6 Sparse Representation Features

15.6.1 Measure of Quality

15.6.2 Choices of Dictionary \(H\)

15.6.3 Choice of Sampling

15.6.4 Experiments

15.6.5 Sparsity Analysis

15.6.6 TIMIT Results

15.6.6.1 Frame Accuracy

15.6.6.2 Error Rate for \(H\beta \) Features

15.6.7 Broadcast News Results

15.6.7.1 Selection of \(H\)

15.6.7.2 WER for \(H\beta \) Features

15.7 SR Phone Identification Features (\(S_{pif}\))

15.7.1 Construction of Dictionary \(H\)

15.7.1.1 Seeding \(H\) from Nearest Neighbors

15.7.1.2 Using a Language Model

15.7.1.3 Using a Lattice

15.7.2 Reducing Sharpness Estimation Error

15.7.2.1 Choice of Class Identification

15.7.2.2 Posterior Combination

15.7.2.3 \(S_{pif}\) Feature Combination

15.7.3 Experiments

15.7.4 TIMIT Results

15.7.4.1 Frame Accuracy

15.7.4.2 Recognition Results: Class Identification

15.7.4.3 Recognition Results: Posterior Combination

15.7.5 Broadcast News

15.7.5.1 Recognition Results: Choice of \(H\) and Class Identity

15.7.5.2 Oracle Results of Reducing Estimation Error

15.7.5.3 Recognition Results: Posterior and \(S_{pif}\) Combination

15.8 Enhancing Exemplar-Based Posteriors for Speech Recognition Tasks

15.8.1 Results

15.8.1.1 Using \(S_{pif}\) Features As Output Probabilities

15.8.1.2 Enhancing Using Neural Networks

15.8.1.3 Smoothing with Posterior Modeling

15.8.1.4 Error Analysis

Notes