Keywords

1 Introduction

An ASR system finds large applications demanding for real-time environments which are embedded with high ambient noise levels. The performance of an ASR is acceptable in clean environments; however, the system performance degrades in the presence of noise. Thus, there is a strong need for noise robustness to be considered [13].

Much of the research in speech recognition aims at first robust feature extraction and the second being building a robust classifier [4]. The most popular MFCC features based on short-time Fourier transform (STFT) and power spectrum estimation do not give a good representation of noisy speech. The features based on the STFT produce uniform resolution over the time–frequency plane. Due to this, it is difficult to detect sudden bursts in a slowly varying signal or the highly non-stationary parts of the speech signal. The recent approach of wavelet packets which segment the frequency axis and makes uniform translation in time is been proposed. Wavelet coefficients provide flexible and efficient manipulation of a speech signal localized in the time–frequency plane which is an alternative to MFCC [57]. The perceptual wavelet filter bank is built to approximate the critical band responses of the human ear. Wavelet packets decompose the data evenly into all bins, but PWPs decompose only critical bins [8].

In this paper, we propose a wavelet domain front end for an ASR which combines speech enhancement and the feature extraction. The proposed method includes a time-adapted wavelet domain hybrid speech enhancement using Teager energy operators (TEO) and dynamic perceptual wavelet packet (PWP) features applied to a hidden Markov model (HMM)–based classifier.

The rest of the paper is organized as follows. Section 2 introduces a block diagram of the proposed wavelet domain front end of the ASR and provides detailed description of each constituting part. Section 3 describes the recognizer and the toolkit used. Section 4 evaluates the performance of the proposed system under different levels of noise. The conclusion is presented in Sect. 5.

2 Proposed Wavelet Domain Front End of ASR

Figure 1 describes the proposed noise robust wavelet domain front end of an ASR. The noisy speech input wave files are sampled at 16 kHz and segmented into frames of 24-ms duration with frame shift interval of 10-ms overlap.

Fig. 1
figure 1

Block diagram of proposed wavelet domain front end of ASR

2.1 Speech Enhancement

In the proposed method, wavelet packet transform (WPT) is applied to each input frame. The coefficients obtained are then subjected to Teager energy approximation [9, 10], where the threshold is adapted with respect to the voiced/unvoiced segments of the speech data. A hybrid thresholding process is adopted which is a compromise for the conventional hard and soft thresholding in preserving both the edges and reducing the noise.

2.1.1 Wavelet Packet analysis

For a j-level WP transform, the noisy speech signal y[n] with frame length N is decomposed into 2j sub-bands. The mth WP coefficient of the kth sub-band is expressed as follows:

$$ W_{k,m}^{J} = {\text{WPT}}\left\{ {y(n),j} \right\} $$
(1)

where n = 1,…N, m = 1,…, N/2j and k = 1,…2j.

2.1.2 Teager Energy Operator on Wavelet Coefficients

  • Teager energy approximation for each WPT sub-band is computed

    $$ \text {TEO}_{i,k} = Y_{i,k}^{2} - Y_{i,k - 1} Y_{i,k + 1} $$
    (2)
  • TEO coefficients are smoothened in order to reduce the sensitivity to noise

    $$ {{M}}_{i,k} = {\text{TEO}}_{i,k} *H_{p} $$
    (3)
  • Normalize the TEO coefficients

    $$ M^{\prime }_{i,k} = \left[ {\frac{{M_{i,k} }}{{\hbox{max} \left( {M_{i,k} } \right)}}} \right] $$
    (4)
  • Time-scale adaptive threshold based on Bayes shrink for each sub-band k is computed.

    $$\lambda_{{i,k}}\;=\;\lambda_{i}(1-{M^{\prime}}_{{i,k}})$$
    (5)

2.1.3 Denoising by Thresholding

Denoising using wavelet packet coefficients is performed by thresholding; that is, the coefficients which fall below the specific value are shrunk and the later retained. Different thresholding techniques have been proposed. However, there are two popular thresholding functions used in the speech enhancement systems which are the hard and the soft thresholding functions [1113].

Hard thresholding is given by

$$ T_{s} (\lambda ,w_{k} ) = \left\{ {\begin{array}{*{20}l} {w_{k} } & {{\text{if}}\,\left| {w_{k} } \right| > \lambda } \\ 0 & {{\text{if}}\,\left| {w_{k} } \right| \le \lambda } \\ \end{array} } \right. $$
(6)

Soft thresholding is given by

$$ T_{s} \left( \lambda ,w_{k}{} \right) = \left\{ {\begin{array}{*{20}l} {\text{sgn} (w_{k} )\left| {w_{k} } \right| - \lambda } & {{\text{if}}\,\left| {w_{k} } \right| > \lambda } \\ 0 & {{\text{if}}\,\left| {w_{k} } \right| \le \lambda } \\ \end{array} } \right. $$
(7)

where w k represents wavelet coefficients and λ the threshold value.

However, hard thresholding is best in preserving edges but worst in denoising, while soft thresholding is best in reducing noise but worst in preserving edges. In order to have a general case of both reducing noise and preserving edges, a hybrid thresholding is used.

Hybrid thresholding is given as follows:

$$ T_{s} (\lambda ,w_{k} ) = \left\{ {\begin{array}{*{20}l} {w_{k} *\frac{{\left| {w_{k} } \right|^{{{\upalpha}}} - \lambda^{{{\upalpha}}} }}{{\left| {w_{k} } \right|^{{{\upalpha}}} }}} & {{\text{if}}\,\left| {w_{k} } \right| > \lambda } \\ 0 & {{\text{if}}\,\left| {w_{k} } \right| \le \lambda } \\ \end{array} } \right. $$
(8)

With careful tuning of parameter α for a particular signal, one can achieve best denoising effect within thresholding framework.

The enhanced speech is then reconstructed using the inverse WP transform

$$ x^{\prime } (n ) {\text{ = WPT}}^{ - 1} \left\{ {W_{K}^{\prime J} ,j} \right\} $$
(9)

2.2 Dynamic Perceptual Wavelet Packet Feature Extraction

Wavelet coefficients provide flexible and efficient manipulation of a speech signal localized in the time–frequency plane [57]. The perceptual wavelet filter bank is built to approximate the critical band responses of the human ear. Wavelet packets decompose the data evenly into all bins but PWPs decompose only critical bins [5, 8, 14]. The size of the decomposition tree is directly related to the number of critical bins. The decomposition is implemented by an efficient 7-level tree structure, depicted in Fig. 2. The PWP transform is used to decompose nx(n) into several frequency bands that approximate the critical bands. The terminal nodes of the tree represent a non-uniform filter bank.

Fig. 2
figure 2

Tree structure of PWPT

The PWP coefficients for the sub-bands are generated as follows:

$$ wj,i(k) = {\text{pwpt}}\,\left( {nx\,(n)} \right) $$
(10)

where n = 1, 2, 3,…L (L is the frame length),

  • j = 0, 1, 2,…, 7 (j is the number of levels),

  • i = 1, 2, 3,… (2j−1) (i is the sub-band index in each level of j).

The static PWP coefficients are made more robust by computing the delta and the acceleration coefficients.

3 Speech Recognition

HMMs are the most successful statistical models for classification of speech. HMM is a stochastic approach which models the given problem as ‘doubly stochastic process’ [1517].

A N-state HMM is defined by the parameter set \( \lambda = \left\{ {\pi_{i} ,a_{ij} ,b_{i} (x),i,j = 1, \ldots N} \right\} \), where

\( \pi_{i} \) initial state probability for state i,

\( a_{ij} \) transition probability from state i to state j,

\( b_{i} (x) \) state observation probability density function (pdf) that is usually modeled by a mixture of Gaussian densities.

A five-state continuous density HMM is used as the statistical model for classification of speech signals. The hidden Markov model toolkit (HTK) is a portable toolkit for building and manipulating HMMs optimized for speech recognition [18].

4 Experimental Setup and Results

To evaluate the performance of the proposed method, recognition experiments were carried out using TIMIT database. The data were digitized with sampling rate of 16 kHz and 16 bits/sample quantization. This database consists of 200 training sentences from 6 male and female speakers and testing sentences randomly picked from training data. To simulate various noisy conditions, the testing sentences were corrupted by the additive white Gaussian noise with SNR conditions from 40 to 0 dB. The baseline recognition system was implemented on HTK toolkit with continuous density HMM models.

In the proposed feature extraction method, the perceptual wavelet features are extracted using MATLAB and written into HTK format using htkwrite function available from voicebox MATLAB package. The 24 perceptual wavelet filter banks are constructed by trial and error. The proposed tree shows excellent results using the Daubechies 45 prototype. Using a hamming windowed analysis frame length of 20 ms, the resulted 13-dimensional features plus their delta and delta–delta features, in other word, totally 39-dimensional features were used for speech recognition.

Table 1 shows that compared to the standard wavelet thresholding method, time-adapted wavelet-based hybrid thresholding using the TEOs as proposed in this paper has a better enhancement effect for SNR levels ranging from −10 to +10 db. The proposed wavelet domain front end features with HMM classifier–based ASR performs better recognition when compared to the conventional ASR with MFCC features and HMM as shown in Table 2. The PWPs capable of decomposing the critical bins during feature extraction supports in providing robust features for the ASR.

Table 1 Input/output SNR values obtained with different thresholding methods
Table 2 Comparison of recognition accuracy for different features at various noise levels

5 Conclusions

A wavelet domain front end for an ASR which combines both the enhancement and the feature extraction for different noise levels has been presented. The proposed time-adapted wavelet-based hybrid thresholding using the TEOs outperforms the conventional wavelet-based denoising schemes, as shown in Table 1. The dynamic PWP features, which decomposes only the critical sub-bands applied to a HMM–based classifier along with the proposed enhancement algorithm, recognize better than the conventional MFCC features with HMM as shown in Table 2. Further the ASR can be improved with a hybrid classifier into context and for larger database.