Keywords

3.1 Introduction

Voice over Internet Protocol (VoIP) is a pioneer technology in the modern communication world that allows delivery of voice calls over packet-switched network like broadband Internet connectivity. This technology utilizes the existing data communication infrastructure to deliver the voice packets. The current challenges faced by VoIP includes relatively high bandwidth requirement, traffic congestion leading to propagation delay, network delay variations, and excessive delay [1]. To minimize these limitations of VoIP, an efficient speech compression technique is desired. The speech signal is required to be compressed so as to conserve the precious resource of bandwidth.

In the recent years, wavelet transforms and its applications are being extensively studied in the field of signal processing. Wavelet transform provides excellent resolution in frequency as well as in time domain [2]. The wavelet transform represents the signal with very high precision and limited storage requirements [3]. The wavelet is defined as limited waveform having zero average value. It is finite in nature. The multi-resolution capability of the wavelet provides us with dilate and translate versions of the wavelet [4]. The resolution of the analysis is determined by the scaling function, and the analysis is performed by the mother wavelet function Ψ(k) [3]. Wavelet transform is calculated by the convolution of original signal s(k) and the mother wavelet function Ψ(k) as defined as follows [5]:

$$ \begin{aligned} {\text{W}}_{\Psi } (m,n )& = \int\limits_{ - \infty }^{\infty } {s(k)} \Psi ^{\prime }_{mn} \left( k \right){\text{d}}k \\ & = \frac{1}{\surd m}\int\limits_{ - \infty }^{\infty } {s(k)\Psi \left( {\frac{k - n}{m}} \right)} {\text{d}}k \\ \end{aligned} $$
(3.1)

where s(k) is the original signal, ‘m’ is the scaling factor, ‘n’ is the translation parameter, and Ψ(k) is the mother wavelet. The wavelet function is given by

$$ \Psi _{m,n } { = }\frac{1}{\sqrt m }\Psi \left( {\frac{k - n}{m}} \right) $$
(3.2)

The discrete version of the continuous wavelet transform (CWT) with dyadic grid parameters of translation n = p and the scale m = 2j and the mother wavelet is defined by

$$ \Psi (x )= 2^{j / 2}\Psi ( 2^{j} x - p ) $$
(3.3)

Similarly, the scaling function is defined as follows:

$$ \phi (x ) = 2^{j / 2} \phi ( 2^{j} x - p ) $$
(3.4)

The original function f(x) can be obtained from the scaling and the wavelet functions from the [6]:

$$ f (x ) = \sum\limits_{p = - \infty }^{\infty } {c_{p} \phi_{p} (x ) } + \sum\limits_{p = - \infty }^{\infty } {d_{j,p}\Psi _{j,p} (x ) } $$
(3.5)

where C p are the average coefficients, and d j,p are detail coefficients.

3.2 Various Wavelet Families

In this paper, we evaluate the following wavelet families Haar, Daubechies, Discrete approximation of Meyer wavelet (dmey) and Coiflets. Each of these wavelet families is defined as follows:

  1. (a)

    Haar Wavelet (Haar)

Haar wavelet is the simplest possible wavelet [7]. For a signal represented by 2t values, the wavelet transform recursively provides the difference and forwards the sum to the next level, resulting in 2t − 1 differences and one total summation. It is not continuous. The wavelet function Ψ(x) is defined as follows

$$ \Psi \left( x \right) = \left\{ {\begin{array}{*{20}l} 1 & {0 \le t \le \frac{1}{2}} \\ { - 1} & {\frac{1}{2} \le t \le 1} \\ 0 & {{\text{Otherwise}}} \\ \end{array} } \right. $$
(3.6)

The scaling function is defined as follows

$$ \Phi \left( x \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {0 \le t < 1} \hfill \\ {0,} \hfill & {\text{Otherwise}} \hfill \\ \end{array} } \right. $$
(3.7)
  1. (b)

    Daubechies Wavelet

Daubechies wavelets are orthogonal wavelets, having largest number of vanishing moments for some support and are commonly used for the analysis of a signal. Here, the scaling and the wavelet functions are not defined [8]. The number of coefficients generated is defined by the index N of the coefficients, and the number of vanishing moments is N/2 [3].

  1. (c)

    Discrete Approximation of Meyer Wavelet (dmey)

The discrete format of the Meyer wavelet function is defined as follows

$$ G_{\text{o}} \left( {{\text{e}}^{j\omega } } \right)\surd 2\sum\limits_{K} {\Phi (2\omega + 4k\Pi )} $$
(3.8)

Given the basis function ‘Φ’, DTFT techniques are employed to obtain the scale coefficients [9].

  1. (d)

    Coiflet Wavelet

Coiflets are wavelets having scaling functions with vanishing moments. The wavelet is near symmetric and has N/3 vanishing moments, and scaling function has N/3 − 1 vanishing moments [3]. If the taps N = 6p, then 2p number of vanishing moment conditions are imposed on wavelet function and 2p − 1 on scaling function and the remaining on normality and orthogonality conditions.

Thus, the conditions imposed are as follows [10]:

$$ \mathop \smallint \nolimits \phi \left( k \right){\text{d}}k = 1 $$
(3.9)
$$ \mathop \smallint \nolimits \phi \left( k \right)\phi (k - l){\text{d}}k = \delta_{ 0 ,l} $$
(3.10)
$$ \mathop \smallint \nolimits k^{n}\Psi (k ) {\text{d}}k = 0 \quad {\text{for}}\;n = 0, 1, 2, \ldots 2p - 1 $$
(3.11)
$$ \mathop \smallint \nolimits k^{n} \phi (k ) {\text{d}}k = 0\quad {\text{for}}\;n = 0, 1, 2, \ldots 2p - 1 $$
(3.12)

3.3 Speech Signal Processing Using Wavelet Transform

The speech signal processing or compression by wavelet transform is performed by choosing a particular wavelet function. The speech quality requirements of the codec govern the selection of the wavelet function for the analysis. The objective of the processing is to maximize the signal quality and minimize reconstructed error variance [11]. Wavelets decompose a signal into components of different frequency bands called as resolution. The signal compression is achieved by reconstructing the signal by considering a limited set of approximation coefficients and some detail coefficients. This is done by the process of thresholding, wherein coefficients falling below a threshold value are ignored and made equal to zero [11]. The signal is reconstructed by performing inverse wavelet transform using the coefficient values which are above the threshold values. Generally, 5-level decomposition is adequate for speech signals [12]. Figure 3.1 [13] shows the process of the speech signal processing for the purpose of compression using the wavelet transform technique.

Fig. 3.1
figure 1

Speech compression using DWT

3.4 Performance Evaluation Parameters

The speech codecs based on the above-defined families of wavelets are implemented in MATLAB for the simulation purpose. The acceptability of the performance of the wavelet-based speech codec for VoIP application is gauged by the subjective testing of mean opinion score (MOS), wherein the original signal and re-constructed signal are presented to a user, who then provide a performance rating between 1 and 5, where 5 is excellent grade [14]. Further, the performance evaluation of the wavelet-based codec is carried out by objective testing of the speech samples. The tests were carried out by comparing the performance in terms of compression ratio (CR), SNR, NRMSE [12, 13], and retained signal energy (RSE). The expressions of these parameters are given below.

$$ {\text{CR}} = \frac{{{\text{Length}}\,{\text{of}}\,\left( {o\left( k \right)} \right)}}{{{\text{Length}}\,{\text{of}}\,\left( {p\left( k \right)} \right)}} $$
(3.13)

where

o(k):

is the input signal

p(k):

is the re-constructed signals, respectively

$$ {\text{SNR}} = 10\log_{10} \left( {\frac{{\sigma_{x}^{2} }}{{\sigma_{e}^{2} }}} \right) $$
(3.14)

where \( \sigma_{x}^{2} \,{\text{and}}\,\sigma_{e}^{2 } \) are mean square of the input signal and the mean square difference between the input and re-constructed signal, respectively.

Normalized root mean square error (NRMSE) is given by

$$ {\text{NRMSE}} = \sqrt {\frac{{(o\left( n \right) - p\left( n \right))^{2} }}{{(o\left( n \right) - \mu o\left( n \right))^{2} }}} $$
(3.15)

where

o(n):

is the original input signal,

p(n):

is the signal, re-constructed and

µo(n):

is the mean of the original signal.

Retained signal energy (RSE) [15] is defined as follows

$$ {\text{RSE}}\,\left( \% \right) = \frac{{\left\| {o(n)} \right\|^{2} }}{{\left\| {p\left( n \right)} \right\|^{ 2} }} \times 100 $$
(3.16)

where

||o(n)||:

is the original signal norm

||p(n)||:

is the norm of the re-constructed signal.

3.5 Results

Discrete wavelet transform-based codec is simulated in MATLAB based on the speech compression principle adopted in wavelet transforms. The test sentences as presented in Table 3.1 are iterated against each of the set of 4 different wavelet families, viz. Haar, Daubechies, dmey, and Coiflet wavelets.

Table 3.1 Details of sample sentences used in the experiment

The speech signal is decomposed into 5-level approximation and detail coefficients. A global threshold value is used for the decomposition of signal. The quality of the signal was measured based on MOS, SNR, RSE, and compression ratio.

The results are shown in the following figures. Figure 3.2 shows the comparison of the wavelets in terms of the MOS, Fig. 3.3 compares the wavelets in terms of the compression ratio, Fig. 3.4 compares the wavelets in terms of SNR, and Fig. 3.5 shows the comparison in terms of the RSE %.

Fig. 3.2
figure 2

Comparison of MOS of wavelets

Fig. 3.3
figure 3

Comparison of compression ratio of wavelets

Fig. 3.4
figure 4

Comparison of SNR of wavelets

Fig. 3.5
figure 5

Comparison of retained energy

It can be seen from the above that dmey wavelets provide excellent results in terms of % energy retention and compression ratio, followed by Daubechies family. Further wavelets of Daubechies family provide better degree of performance in terms of the MOS and SNR of the signal.

3.6 Conclusions

Wavelet-based speech coding, in general, offers a good degree of compression of the speech signal, whose magnitude can be varied easily. The Haar wavelet transform is the straight forward and fastest transform to be used for speech compression. However, due to its discontinuity, it is not advantageous for the simulation of speech signals. Daubechies wavelet has shown its superiority over other families of wavelet for speech compression in terms of all parameters such as % compression and SNR value and hence extensively used in various speech processing applications. Further, it can be inferred from the above that the average MOS of wavelet-based speech codec is in the range of 3.9–4.5, which is near toll quality; hence, they compare well with the currently deployed speech codec in the VoIP applications. The results further reveal that the performance of codec under study remains unaffected with change in language or speakers.