1 Introduction

Conjugate structure algebraic-code-excited linear prediction (CS-ACELP) speech coder basically categorized in to Hybrid coder (Wu and Yang 2006; analysis by synthesis coder) classification which provides attractive trade-off between waveform coders and vocoders with satisfactory speech quality and transmission bit rate. Research and development in the domain of source coding techniques of CELP (code excited linear prediction), ACELP (algebraic CELP) and CS-ACELP (conjugate structure-ACELP) still continue to emerge as a popular area of research worldwide among researchers and academicians.

The speech encoder takes its input as a 16-bit linear PCM from the audio part of the mobile station or on the network side. The 64 k bit/s data, should be converted to 16 bit linear PCM before encoding, and 16 bit linear PCM to the appropriate form after reconstruction. CS-ACELP describes the trailed mapping between input blocks of 160 past speech samples,80 present speech samples and 40 future speech samples in 16 bit linear PCM format to encoded blocks of 80 bits to output blocks of 240 total speech samples (ITU-T Recommendation 2007). The rate of input sampling is 16,000 samples/s which is processed for a coding operation at 8000 samples/s and after reconstruction once again it is interpolated to 16,000 samples/s. The rate of sampling is 16,000 samples/s leading to an average bit rate of 8 kbps.

This paper is organized as follows. In Sect. 2, excitation sequence of CS-ACELP based speech codec is introduced. In Sect. 3, excitation codebook structure of legacy 8 and 11.8 kbps CS-ACELP based speech codec is touched upon. Section 4 highlights the proposed modification suggested in the search engine to determine the optimized codevector. In Sect. 5, various subjective and objective quality assessment parameters are depicted. Section 6 deals with comparative performance evaluation of CS-ACELP speech coder which is computed and demonstrated using set of graphs and tables. Finally, the concluding remarks are given in Sect. 7.

2 CS-ACELP speech codec excitation sequence generation

The detailed flow of excitation sequence of CS-ACELP based speech codec is shown in the Fig. 1. The input speech signal frame, consisting of 80 speech samples is first pre-processed in which mainly signal scaling and high pass filtering operations are performed to reduce the overflows in fix point implementation and to remove the undesired low frequency components. LPC (linear predictive coding) analysis is performed over this 80 speech samples to determine ten LPC coefficients per frame which is a short-term analysis of the speech signal. The combination of per frame LPC coefficients and original segmented speech samples is passed through a weighting filter which take cares of the perception of the human auditory system. Generally, the noise located in a lower energy region is much more annoying when the speech is reconstructed with short term synthesis filter. The error weighting filter shifts these kinds of noise from lower energy region to higher energy region so that it is masked at the time of reconstruction of the speech signal. Weighted error output is subtracted from the output of impulse response of the filter which is a combination of short term synthesis filter and error weighting filter. Initially the impulse response is generated by 40 zero input samples to the above hybrid combination to flush everything out of the system. The pitch contribution is removed from the hybrid output by subtracting the combination of adaptive codebook and impulse response from it. Firstly, the optimum open loop pitch and its gain are determined from the adaptive codebook and the input is forwarded to impulse response to generate the combination of codevector from adaptive codebook and impulse response. The output long term residuals which are not having any pitch information is again used in subtraction operation with the combination of a codevector from stochastic codebook and its sign magnitude to remove the contribution of excitation. The subtracted output which is also called as short term residual is again feedback as a weighted error input to model the excitation structure. The Bit allocation for the legacy CS-ACELP working at 8 kbps is shown in Table 1.

Fig. 1
figure 1

Detailed flow of excitation sequence of CS-ACELP based speech coder

Table 1 Bit allocation of the 8 kbit/s CS-ACELP algorithm (10 ms frame; ITU-T Recommendation 2007)

3 Excitation codebook structure of legacy 8 kbps and extended 11.8 CS-ACELP based speech codec

ITU-T G.729 CS-ACELP speech coder is having 40 pulse positions with 4 track excitation codebook structure as depicted in Table 2. Excitation codebook structure is having 8 positions in first three tracks and 16 positions in final track. In the conventional 8 kbps, fixed codebook search procedure, the final codevector is determined by 8192 number of searches in case of Full search approach (Salami et al. 1998),1440 number of searches in case of Focused search approach (12), 320 number of searches in case of Depth first search approach (Adoul and Laflamme 1997) with exhaustive recursive searches.

Table 2 Fixed codebook excitation structure of legacy 8 kbps CS-ACELP based speech coder (ITU-T Recommendation 2007)

The numbers of searches are more because of final track is having 16 pulse positions, which increases the alternate searches of the final track eight positions every time when the search procedure starts with different tracks. As first three tracks are having eight positions, it requires 3 bits to code these positions and final track pulse position requires 4 bits as it is having 16 pulse positions. Along with the four pulse positions from four different tracks, it also requires to transmit the sign of the corresponding pulse positions. On the whole it requires 13 bits to transmit final best codevector and 4 sign bits for respective selected pulse position per subframe.

The excitation structure of G.729 (8 kbps) is replaced by forward mode excitation structure of extended CS-ACELP 11.8 kbps speech codec which follows the excitation structure of 5 tracks with each track is having two non-zero pulses with a sign magnitude of ±1. The excitation structure of standard extended 11.8 kbps CS-ACELP speech coder is demonstrated in Table 3.

Table 3 Fixed codebook excitation structure of standard 11.8 kbps CS-ACELP based Speech Coder in forward LP mode (ITU-T Recommendation 2007)

ITU-T standardized extended G.729E (11.8 kbps) fixed codebook excitation structure, which uses different bit codebook structure namely as forward mode excitation codebook (35 bits) structure and backward mode fixed codebook excitation code structure (ITU-T Recommendation 2007; 44 bits). In the proposed analysis, the 35 bits fixed excitation codebook structure is used at transmitter side as well as receiver side. The excitation codevector is determined over a subframe of 40 samples. The algebraic codebook search is performed by least significant pulse replacement procedure (Bernard 2005). Initial fixed codevector is determined by choosing a pulse position having a largest magnitude of correlation vector d(n) (Eq. 2). The exhaustive search procedure is started by arranging the coefficients of correlation vector d(n) according to the pulse positions of a conjugate excitation codebook structure highlighted in Table 3. First pulse position among the two from the respective track is determined by the maximum of the eight positions of that particular track and second pulse each track is determined by finding the Euclidian distance from first maximum of each respective track. The next stage codevector is determined by replacing each and every pulse position of the initial codevector with the other pulse position of respective track one at a time. The combination of the pulse position which maximizes the value of (Eq. 1) Qk is declared as a next stage codevector.12 Qk values are computed per track. The final best codevector is determined by the maximum value of Qk out of 60 Qk values (Bernard 2005)

$$\mathop {\max }\limits_{k} Q_{k} = \mathop {\max }\limits_{k} \frac{{C_{k} ^{2} }}{{E_{k} }} = \mathop {\max }\limits_{k} \frac{{(d^{{t~}} c_{k} )^{2} }}{{c_{k}^{t} ~\emptyset _{{c_{k} }} }} = \frac{{\left( {\mathop \sum \nolimits_{{j = 0}}^{{M - 1}} s_{{j~}} d(m_{{j~}} )} \right)^{2} ~}}{{\mathop \sum \nolimits_{{j = 0}}^{{M - 1}} \emptyset \left( {m_{{j,~}} m_{j} } \right) + 2~\mathop \sum \nolimits_{{i = 0~~~~~}}^{{M - 2~~~}} \mathop \sum \nolimits_{{j = i + 1}}^{{M - 1}} s_{i} s_{j} ~~\emptyset \left( {m_{{i~}} - m_{j} } \right)}}$$
(1)

here, M is a number of tracks in a subframe analysis.

A Kth codebook vector is described as Ck and t denotes a transposed matrix. d is called as correlation vector and matrix PHI are described as (ITU-T Recommendation 2007):

$$d\left( n \right)=\mathop \sum \limits_{i=n}^{M - 1} {x_2}\left( i \right)h\left( {i - n} \right),~~~~~i=0, \ldots .,M$$
(2)
$$\emptyset \left( {i,j} \right) = \mathop \sum \limits_{{n = j}}^{{M - 1}} h\left( {n - i} \right)h\left( {n - j} \right),~~j = i,...,M$$
(3)

From Eqs. 2 and 3 the total number of pulse positions in a sub-frame is M, a target signal for the fixed codebook searching is expressed as x2(n) and an impulse response of a linear predictive synthesizing filter is described as h(n) (ITU-T Recommendation 2007).

Also a numerator and a denominator of Eq. 1 are described as (ITU-T Recommendation 2007):

$$C=\mathop \sum \limits_{i=0}^{{N_p} - 1} sign\left\{ {d\left( i \right)} \right\}d\left( {{m_i}} \right)$$
(4)
$$E = \mathop \sum \limits_{{i = 0}}^{{N_{p} - 1}} \emptyset \left( {m_{i} ,m_{j} } \right) + 2\mathop \sum \limits_{{i = 0}}^{{N_{p} - 2}} \mathop \sum \limits_{{j = i + 1}}^{{N_{p} - 1}} \;sign\left\{ {d\left( i \right)} \right\}\;sign\left\{ {d\left( j \right)} \right\}\;\emptyset \left( {m_{i} ,m_{j} } \right)$$
(5)

Number of pulses in sub-frame is described as Np and m denotes a position of ith pulse.

Final codevector is having best two pulse positions per track which maximizes the value of Eq. 1, which require 6 bits per track as each track is having total of eight pulse positions. Apart from the pulse position, among the two non- zero sign magnitude pulse having amplitudes ±1, only one of the sign of the non-zero pulse is transmitted and other sign is derived directly at the decoder from the other one which has been received at the receiver, which is being explain below.

Let the sign and the positions of the two non-zero pulses are s1, s2, p1 and p2. In all the iterations, the smallest selected position among the two is assigned as p1 and the other one as p2. If p1 ≤ p2, then s2 = s1 else s2 is different from s1. At the time of finding a final codeword if both the signs are equal then p1 is declared as smallest among the two position otherwise p1 is declared as largest position and p2 is declared as smallest position (ITU-T Recommendation 2007).

4 Proposed modification in the search engine of the excitation codebook structure for determining optimised codevector

The proposed modification uses the codebook partition approach for modification in the searching procedure of determining final codevector. In the proposed approach, firstly the excitation codebook is partitioned into two equal parts in the structure dimensions. The standardized structure dimension of the fixed excitation codebook is having five tracks with total of 40 positions. The proposed approach uses the advantage of number of positions in each track which is an even number. Partitioning of a codebook structure allows the coding of the individual position into total of four combinations instead of eight combinations. This modification requires total of 20 bits for final excitation codevector and 5 sign bits for transmission through a channel per subframe.

4.1 Efficient transmission of position and sign magnitude of non-zero pulses with reduced number of bits

In the proposed codebook partition approach the two partitions are assigned with the four-different combination of 00, 01, 10 and 11. From the least significant pulse replacement approach the two non-zero elements are obtained per track. As codebook, partitioning approach is used to provide the coding using 2-bit position instead of 3-bit position, the selected two non-zero pulses may be either from a first partition of the excitation codebook or from a second partition of the excitation codebook. The label of combination is assigned in ascending order as 00, 01, 10 and 11 in the partition 1 of the codebook, while in a partition 2, labels assignment is starting from 11, 10, 01 and 00.

As codebook is partitioned into two equal parts, there are total three possibilities where the non-zero pulses are located. First possibility corresponds that the two non-zero pulses are found in the partition 1, similarly the second possibility give rise to investigation of a two non-zero pulses in partition 2. While the third possibility state the combination of finding the two non-zero pulses in two different partitions. The two non-zero pulses having amplitude ±1 is transmitted along with the positions in their respective track.

Let the two non-zero pulses are located at positions 0 and 5 belongs to partition 1 in track 1 having sign magnitude +1 and −1. According to label assignment these two pulses are having binary label of 00 and 01. The corresponding sign bits are transmitted in 2 bits with 1 and 0 for the above case. the sign bits for the all the five tracks are transmitted according to 3 described possibilities. if the two non-zero pulses are from the same partition then the sign bits are transmitted in a 2 in 1cell which is same with the case of possibility 2. If the non-zero pulses are from the different partition as of possibility 3 then the sign magnitude of the two non-zero pulses are transmitted with 1 in 1 cell.

At the receiver, initially the corresponding cell format is identified, and if the cell format is of 2 in 1 cell then it is identified that the two non-zero pulses are located either in partition 1 or in partition 2. To confirm the partition number, the respective binary labels are identified with their order. If the 4-bit binary format of the codeword is of ascending order, then sign magnitudes are placed at those assigned corresponding binary locations in partition 1. The contradictory prevails the same for the partition 2 in descending order. Non-zero pulses belonging to different track are identified at receiver using 1 in 1 cell combination and they are placed at their respective binary combination but in two different partition.With the proposed modification in transmission of position and sign magnitude of excitation codevector reduces 10 bits per frame for the transmission of the index of the excitation codevector. As only one excitation codebook is used at transmitter as well as at receiver, no switching bit is required, which actually required in the case of standard 11.8 CS-ACELP based speech codec due to two different codebook structures are used at encoder and decoder. The bit allocation of proposed 10.6 Kbps CS-ACELP based speech codec is shown in Table 4.

Table 4 Bit allocation of the proposed 10.6 kbit/s CS-ACELP algorithm (10 ms frame)

4.2 Reduction in a number of searches of excitation codevector of legacy 8 kbps CS-ACELP coder

With the proposed modification in transmission of excitation codeword with its sign magnitude, the excitation search can be also restricted by finding the initial codevector only. In the proposed modification, Initial codevector is determined by the position of the first and second maxima of correlation vector d(n) (Eq. 2.) from each track. If the initial codevector made up of ten non-zero pulses, is considered as a final codevector then no searches are required which was initially reduced to 60 from 320 of focused search approach with modified search procedure of excitation codebook structure. It is being observed in Figs. 2, 3 that, if the initial codevector is processed as a final codevector then it yields good quality speech at the decoder with satisfactory subjective and objective parameter results compared to the legacy 8 kbps CS-ACELP speech coder. The results are compared with the 8 kbps CS-ACELP speech coder only as the numbers of searches are reduced to 0. With the proposed modification in transmission of sign magnitude and the position required 4 bits for position and 2 bits for sign for individual code word per track as there are 2 non-zero pulses per track. CS-ACELP 11.8 kbps recommendation requires 70 bits with 60 bits for codeword and 10 bits for sign transmission per frame while with proposed modification in transmission require 50 bits for final codevector 10 bits for sign magnitude transmission at the receiver. Proposed modification in CS-ACELP working at 8 kbps leads to the reduction of the bits required to transmit the excitation codevector compared to the extended CS-ACELP (11.8 kbps) and also it reduces the complexity of the two-different excitation codebook structure required at transmitter and at the receiver of CS-ACELP working at 11.8 kbps.

5 Subjective and objective measures

To evaluate the overall performance of proposed modification, objective and subjective quality assessment parameters are utilized. In subjective evaluation, MOS analysis is being explored whereas objective evaluation is categorized into waveform based, spectral based and perceptual based analysis.

5.1 Subjective measures

In subjective measure, MOS (mean opinion score) which is used to determine the quality of compressed speech at the output of the decoder. The quality of the output speech is asked to judge by randomly 30 persons. They are asked to rate the quality of the speech signal according to the options available in Table 4. The listeners are asked to judge the overall quality of the speech by playing the speech in noiseless environment with high quality head phones. The ratings are demonstrated in the Table 5.

Table 5 Mean opinion score (MOS) ratings (Ninad and Kosta 2012)

5.2 Objective measures

To evaluate and compare the performance of proposed speech codec with legacy speech codec in terms of decoded speech quality and also comparison between decoded speech quality between the two codec’s by considering initial codevector as a final codevector in proposed 10.6 kbps speech codec, different types of objective measures have been carried out. Objective measure is classified into waveform, spectral, perceptual and composite measures based analysis (Ninad and Kosta 2012).

5.2.1 Waveform based analysis

Following quality assessment parameters are evaluated in this category.

  1. (1)

    Absolute error (ABS) is mathematically defined as,

    $$Abserr = \mathop \sum \nolimits^{} \left| {s_{{i~}} - s_{o} } \right|~$$
    (6)
  2. (2)

    Mean square error (MSE) is mathematically expressed as,

    $$MSE = ~\left( {\mathop \sum \nolimits^{} \frac{{\left( {S_{i} - ~S_{o} } \right)^{2} }}{N}} \right)$$
    (7)
  3. (3)

    Root mean square error (RMSE) is mathematically expressed as,

    $$RMSE = ~\sqrt {\left( {\mathop \sum \nolimits^{} \frac{{\left( {S_{i} - S_{o} } \right)^{2} }}{N}} \right)}$$
    (8)
  4. (4)

    Signal to noise ratio is mathematically given as,

    $$SNR = ~10\log _{{10}} \frac{{\mathop \sum \nolimits^{} \left| {S_{i} } \right|^{2} }}{{\mathop \sum \nolimits^{} \left| {S_{i} - S_{o} } \right|^{2} }}$$
    (9)

    where Si = input signal, So = decoded output signal and N = total no. of frames.

  5. (5)

    Segmental SNR is mathematically given as,

    $${\text{SNR}}_{{{\text{SEG}}}} = \frac{1}{M}\mathop \sum \limits_{{j = 0}}^{{M - 1}} 10\log _{{10}} \left[ {\frac{{\mathop \sum \nolimits_{{n - m_{{j - N + 1}} }}^{{m_{j} }} s^{2} ~(n)}}{{\mathop \sum \nolimits_{{n - m_{{j - N + 1}} }}^{{m_{j} }} \left[ {s\left( n \right) - \widehat{s}(n)} \right]^{2} }}} \right]$$
    (10)

    where s(n) = input signal, \(\widehat{s }\) (n) = decoded signal, N = segment length, M = no. of segments and mj =end of the current segment.

5.2.2 Perceptual based analysis

  1. (1)

    Perceptual evaluation of speech quality (PESQ)

    PESQ algorithm uses psychoacoustic and cognitive models by using a synchronization scheme, this algorithm time aligns the original and degraded speech signals, as misalignment could result in a false quality score. To compute the speech-quality degradation represented by the disturbance metric between the psychophysical representations of the reference and degraded speech samples, the cognitive model performs complex non-linear calculations. PESQ is designed to analyze specific parameters of audio, including time warping, variable delays, transcoding, and noise. PESQ score is computed as a liner combination of the average disturbance value Dind and the average asymmetrical disturbance value Aind as follows (Ninad and Kosta 2012):

    $${\text{PESQ}} = {\text{a}}_{0} + {\text{a}}_{1} {\text{D}}_{{{\text{ind}}}} + {\text{a}}_{2} {\text{A}}_{{{\text{ind}}}}$$
    (11)

    where a0, a1 and a2 are calculated using Multiple linear regression analysis.

5.2.3 Spectral based analysis

Following parameters are categorized to perform spectral based analysis (Ninad and Kosta 2012).

  1. (1)

    Log likelihood ratio (LLR) is defined by following equation,

    $$d_{{LLR}} \left( {\overrightarrow {{a_{p} }} ,\overrightarrow {{a_{c} }} } \right) = \log \left( {\frac{{\overrightarrow {{a_{p} }} ~R_{c} ~\overrightarrow {{a_{p}^{T} }} }}{{\overrightarrow {{a_{c} }} ~R_{c} ~\overrightarrow {{a_{c}^{T} }} }}} \right)$$
    (12)

    where \(\overrightarrow{{a}_{c}}\) is the LPC vector of the original speech signal frame and \(\overrightarrow{{a}_{p}}\) is the LPC vector of the decoded speech signal frame, and Rc is the autocorrelation matrix of the original speech signal.

  2. (2)

    Itukara Saito distance measure is mathematically defined as,

    $$d_{{IS}} \left( {\overrightarrow {{a_{p} }} ~,\overrightarrow {{a_{c} }} } \right) = \frac{{\sigma _{c} ^{2} }}{{\sigma _{p} ^{2} }}\left( {\frac{{~~~\overrightarrow {{a_{p} }} ~R_{c} ~\overrightarrow {{a_{p}^{T} }} }}{{~~\overrightarrow {{a_{c} }} ~R_{c} ~\overrightarrow {{a_{c}^{T} }} }}} \right) + \log \left( {\frac{{\sigma _{c} ^{2} }}{{\sigma _{p} ^{2} }}} \right) - 1$$
    (13)

    where \({{\sigma }_{p}}^{2}\) and \({{\sigma }_{c}}^{2}\) are LPC gains of original and decoded signals (Ninad and Kosta 2012). The range of the IS value is limited between 0 and 100.

  3. (3)

    Cepstrum distance (CEP)

    It provides an estimation of distance between two log spectra. The Cepstrum coefficients can be obtained with the recursion procedure of LPC coefficients as using the given expression:

    $$c\left( m \right)={a_m}+\mathop \sum \limits_{k=1}^{m - 1} \frac{k}{m}c\left( k \right){a_{m - k}}$$
    (14)

    where, p denotes the order of the LPC analysis. An objective measurement from the Cepstrum coefficient can be computed with following expression (Ninad and Kosta 2012):

    $${d_{CEP}}\left( {\overrightarrow {{c_c}} ,\overrightarrow {{c_p}} } \right)=\frac{{10}}{{\log 10}}\sqrt {2~\mathop \sum \limits_{k=1}^p {{[{c_c}\left( k \right) - {c_p}\left( k \right)]}^2}}$$
    (15)

    where \(\overrightarrow{{c}_{c}}\) and \(\overrightarrow{{c}_{p}}\) are the cepstrum coefficient vector of the original and recovered signal. The range of the limitation of the Cepstrum distance was limited between 0 and 10.

  4. (4)

    Frequency weighted segmental SNR (fwSNRseg)

    $${\text{fwSNRseg}} = \frac{{10}}{M}\mathop \sum \limits_{{m = 0}}^{{M - 1}} \frac{{~\mathop \sum \nolimits_{{j = 1}}^{K} W\left( {j,m_{~} } \right)\log _{{10}} \frac{{\left| {X\left( {j,m} \right)} \right|^{2} }}{{\left( {\left| {X\left( {j,m} \right)} \right| - \left| {~\overrightarrow {{\widehat{X}}} ~\left( {j,m} \right)} \right|} \right)^{2} }}~}}{{\mathop \sum \nolimits_{{j = 1}}^{K} W\left( {j,m} \right)}}$$
    (16)

    where W(j,m) is denoted as weight placed on the jth frequency band, k denotes the number of bands, M denotes the total number of frames in the signal, \(\left|X(j,m)\right|\) is denoted as weighted original signal spectrum in the jth frequency band at the mth frame, while \(\left|\widehat{X}(j,m)\right|\) is denoted as weighted decoded signal spectrum in the same band.

  5. (5)

    Weighted slop spectrum distance is defined as,

    $${\text{fwSNRs}} = \frac{{10}}{M}\mathop \sum \limits_{{m = 0}}^{{M - 1}} \frac{{~\mathop \sum \nolimits_{{j = 1}}^{K} W\left( {j,m_{~} } \right)\log _{{10}} (\left| {s_{c} \left( {j,m} \right)} \right| - \left| {s_{p} \left( {j,m} \right)} \right|)^{2} ~}}{{\mathop \sum \nolimits_{{j = 1}}^{K} W\left( {j,m} \right)}}$$
    (17)

    In each frequency band weighted slop spectrum distance calculates the weighted difference between the spectral slops. Spectral slope is calculated as the difference between adjacent spectral magnitudes in decibels. \({s}_{c }(j,m)\) and \({s}_{p }(j,m)\) are denoted as spectral slope of jth frequency band at frame m of the original and decoded speech signal with total of 25 number of bands (Ninad and Kosta 2012).

5.2.4 Composite measures

Unlike the simple objective measures parameters, there are certain parameters which combine all objective measures to form a new measure called as composite measure. Composite measure is the linear combination of existing objective measures to form a new objective measure which utilizes linear regression analysis. Following parameters are utilized and checked for the effective composite measure: a measure called as Csig for signal distortion which is a linear combination of PESQ, LLR and WSS measures, a measure which is known as Cbak for background noise distortion which is a linear combination of PESQ, segSNR and WSS measures, a measure which is responsible for overall speech quality measurement called as Covl formed by linearly combining WSS, LLR and PESQ measures.

The multiple linear regression analysis of above three composite measure is shown below (ITU-T Recommendation 2003; Falk and Chan 2006; Salmela and Mattila 2004; Grundlehner et al. 2005).

$${\text{C}}_{{{\text{sig}}}} = {\text{ }}3.903 - 1.029 \cdot {\text{LLR }} + 0.603 \cdot {\text{PESQ}} - 0.009 \cdot {\text{WSS}}$$
(18)
$${\text{C}}_{{{\text{bak}}}} = {\text{ 1}}.{\text{634}} + 0.{\text{478}} \cdot {\text{PESQ }} - 0.00{\text{7}} \cdot {\text{WSS }} + 0.0{\text{63}} \cdot {\text{segSNR}}$$
(19)
$${\text{C}}_{{{\text{ovl}}}} = 1.594{\text{ }} + {\text{ }}0.805 \cdot {\text{PESQ }} - 0.512 \cdot {\text{LLR }} - {\text{ }}0.007 \cdot {\text{WSS}}$$
(20)

6 Simulation of proposed algorithm based on MATLAB

Here, both legacy CS-CELP working at 8 kbps and proposed CS-ACELP working at 10.6 kbps are implemented in MATLAB and performance of both coders is evaluated using different subjective and objective measures. Excitation structure of legacy CS-ACELP working at 8 kbps having four tracks is replaced with 5 track structure of excitation codebbok with two non-zero pulses in each track instead of 1 non-zero pulse in each track. The proposed 10.6 kbps CS-ACELP creates the room of 12 bits/frame in 118 bits of CS-ACELP 11.8 kbps for steganographic data transmission or better error concealment at channel coding level. For the sake of analysis of subjective and objective parameter, five different wave files have been chosen.Footnote 1 Each wav file is sampled at 16 kHz and coded by 16 bits per sample.

6.1 Result obtained for MOS analysis

MOS analysis is performed for twenty different wave files. Wave files are taken from VoxForge speech corpus database.Footnote 2 Thirty random subjects had to judge the quality of speech in noise free environment using very high fidelity headphones. The listeners had to give a score to all the decoded speech files which are CS-ACELP 8 kbps decoded speech as well as proposed CS-ACELP 10.6 kbps decoded speech. The results of the MOS score of twenty different wav filesFootnote 3 are shown in a Fig. 2. As it can be observed from Fig. 2 that MOS score of proposed CS-ACELP 10.6 kbps speech coder is far better than the legacy CS-ACELP 8 kbps speech coder. The result of decoded speech quality of output speech considering initial codevector as a final codevector with no search in the search engine of excitation codebook is also shown (Fig. 2) and compared with proposed 10.6 kbps CS-ACELP speech coder with 60 searches and with legacy CS-ACELP 8 kbps speech coder which require different number of searches in different search approach. It is witnessed that the quality of decoded speech by considering initial codevector as a final codevector is better compared to the other different search methods of legacy 8 kbps speech coder but less than the proposed approach with 60 numbers of searches.

Fig. 2
figure 2

MOS score comparison between proposed 10.6 kbps, proposed 10.6 kbps with no search complexity and legacy 8 kbps CS-ACELP based speech coder

6.2 Result obtained for objective analysis

Perceptual evaluation of speech quality (PESQ) based objective analysis is performed on 20 different wave files taken from VoxForge speech corpus database.Footnote 4 The results of the different objective classified quality assessment parameter based on waveform based analysis; perceptual based analysis and spectral based analysis are highlighted in Tables 6, 7 and 8 for seven different wave files taken from VoxForge speech corpus database.Footnote 5 The results for the other classified objective quality assessment parameters analysis have been shown in Table 6 for legacy CS-ACELP 8 kbps speech coder, proposed CS-ACELP 10.6 kbps speech coder and proposed 10.6 kbps CS-ACELP based speech codec with consideration of initial codevector as a final codevector. It can be observed that there is a large amount of distortion when both the legacy CS-ACELP speech coder and proposed CS-ACELP speech coder are compared. The results of all quality assessment parameter are quite fair in case of proposed coder compared to legacy coder (Fig. 3).

Fig. 3
figure 3

PESQ score comparison between proposed 10.6 kbps, proposed 10.6 kbps with no search complexity and legacy 8 kbps CS-ACELP based speech coder

Table 6 Waveform based analysis
Table 7 Perceptual based analysis
Table 8 Spectral based objective evaluation

6.3 Computing population mean for result analysis of subjective and objective parameters

Performance of proposed coder is evaluated using objective quality assessment parameter called as PESQ and subjective quality assessment parameter called as MOS. As can be witnessed from the Tables 6, 7 and 8 that the results of the different classified objective and subjective parameters are quite satisfactory. To evaluate performance oriented consistency of proposed coder, the population mean of 95% confidence interval is calculated based on the results obtained for subjective and objective parameters of seven different wave files in Sects. 2 and 3.

As demonstrated in (Morgan et al. 2017), range of confidence interval is calculated with the help of statistical parameters like mean, standard deviation, sample size and the standardized normal distribution values or critical t distribution values for different percentages of confidence interval. As per (Morgan et al. 2017), the sample size has direct relationship with the confidence interval. As depicted in (Morgan et al. 2017), when the sample size is small then in that case the critical values of t distribution are considered for confidence interval measurement.

Confidence interval is calculated as per (Morgan et al. 2017),

$${\text{Population~mean}}={\text{sample~mean }} \pm { \text{sample~error}}$$
$$\mu =\dddot x \pm ~~\frac{{t \cdot {s_x}}}{{\sqrt n }}$$
(21)

where, µ is a population mean which defines the 95 or 99% confidence interval range. \(\ddot{\text{x}}\) is a sample mean, t is decided from the critical values of t distribution table (Morgan et al. 2017), \({\text{s}}_{\text{x}}\) is a standard deviation and n is defined as a sample size. The confidence interval calculation for 95 and 99% is shown in section A and section B.

6.3.1 Calculation of population mean of different confidence intervals for the proposed 10.6 kbps CS-ACELP based speech codec

6.3.1.1 Calculation of population mean based on objective speech quality evaluation parameter PESQ

In below calculation following notations are followed:


xi = 7 PESQ values from Table 6 for proposed 10.6 kbps CS-ACELP based speech codec is used as seven samples which act as an input for population mean calculation.


\(\ddot{\text{x}}\) is sample mean,


n = sample size, which is 7 as per the total number of samples.

$${s_x}=\text{standard~deviation}=\sqrt {\frac{{\mathop \sum \nolimits^ {{({x_i} - \dddot x)}^2}}}{{n - 1}}}$$
(22)

From the calculations the value of standard deviation (Eq. (22)) is sx = 0.064.


(A) Population mean for 95% confidence interval for PESQ samples. Population mean is calculated from Eq. (21) in which value of t is taken as 2.447 by considering degree of freedom (Morgan et al. 2017) as 6 (n − 1) due to total of seven samples, from the table of critical values of t distribution (Morgan et al. 2017).

From Eq. (21) the value of population mean for 95% confidence interval turned out as (2.976, 3.094).

From the observation of the calculation of Table 9 and the calculation of confidence interval, it is observed that except 1 sample values (PESQ values), remaining six sample values resides in the range of population mean of 95% confidence interval. The value of a 1 sample is 3.1768 which is a PESQ value of Rai0005.wav,Footnote 6 is beyond the range of highest value (3.094) of population mean of 95% confidence interval.

Table 9 Calculation of standard deviation for the samples of PESQ
6.3.1.2 Calculation of population mean based on subjective speech quality evaluation parameter MOS

From the calculations the value of standard deviation [Eq. (22)] is \({\text{s}}_{\text{x}}\) = 0.073.

(B) Population mean for 95% confidence interval for MOS samples. Population mean is calculated from Eq. (21) in which value of t is taken as 2.447 by considering degree of freedom as 6 (n − 1) (Morgan et al. 2017) due to total of seven samples, from the table of critical values of t distribution (Morgan et al. 2017).

From Eq. (21) the value of population mean for 95% confidence interval turned out as (3.321, 3.457).

From the observation of the calculation of Table 10 and the calculation of confidence interval, it is observed that except 1 sample values (MOS values), remaining six sample values resides in the range of population mean of 95% confidence interval. The value of a 1 sample is 3.5468, which is a MOS value of Rai0005.wav,Footnote 7 is beyond the range of highest value (3.457) of population mean of 95% confidence interval.

Table 10 Calculation of standard deviation for the samples of MOS

6.4 Searching complexity analysis using simulation delay

In order to compute the searching complexity in terms of simulation delay for a given program in MATLAB, the time required to reconstruct and recover the wave file is calculated using two commands called as ‘tic’ and ‘toc’ (Ninad and Kosta 2012). The execution time in the case of proposed 10.6 kbps CS-ACELP based speech codec by considering initial codevector as a final convector is a least with an average of 5.02 s (average of all wave files). While the execution time required for proposed 10.6 kbps is 5.85 s (average of all wave files) and legacy 8 kbps CS-ACELP based speech codec is 5.30 s. The above values are calculated for all mentioned wave files. It can be advocated from the analysis that, with the increase in the bit rate which includes the standard exhaustive search procedure of excitation codevector, the simulation time increases hence it is true in the case of proposed 10.6 kbps speech codec compared to legacy 8 kbps CS-ACELP based speech codec. The crucial computation is to calculate execution time required when initial codevector is considered as a final codevector in the search of excitation codevector which require no exhaustive searches compared to the execution time required in legacy speech coder. From the analysis, it is revealed that the average time required is less when initial codevector is considered as final codevector compared to the time required to final codevector in case of legacy speech coder.

7 Discussion and concluding remarks

CS-ACELP (G.729) is extensively used in VoIP applications which are today’s one of the most emerging applications on the smartphones because of its low bit rate requirement to transmit the hybrid traffic through the communication channel.

Basic aim behind implementation of proposed 10.6 kbps CS-ACELP speech coder is to reduce the search engine complexity of excitation codebook structure by introducing excitation codebook structure with a less number of searches compared to legacy coder. Coding complexity is also reduced in comparison with standard CS-ACELP 11.8 kbps speech coder by using the same excitation codebook structure both at the transmitter and receiver which is a different bit excitation codebook structure at transmitter as well as receiver in former. Searching complexity of excitation codebook codevector is also reduced by considering initial codevector as a final codevector.

Results of subjective and objective analysis of proposed CS-ACELP 10.6 kbps speech coder are fairly good compares to legacy CS-ACELP 8 kbps speech coder. The proposed CS-ACELP 10.6 kbps speech coder requires 60 bits per frame for final excitation codevector transmission through the channel, while 11.8 kbps excitation codebook structure requires 70 bits per frame for final excitation codevector transmission through the channel. The proposed coder is a better trade-off option between the two legacy/standard speech coder of CS-ACELP (8 and 11.8 kbps), which provides reduction in number of searches in determining final best optimized excitation codevector compared to the legacy CS-ACELP 8 kbps speech coder, while it also transmits less number of bits for the coding of excitation codevector compared to the requirement of number of bits for coding excitation codevector in standard CS-ACELP 11.8 kbps speech coder.

Efficiency of a proposed algorithm is also evaluated with the population mean of 95% confidence interval (CI95) with the results of objective and subjective quality assessment parameters like PESQ and MOS for different wave files from a standard speech corpus database as an input to proposed CS-ACELP based speech coder. As per the observation, the range calculated for population mean of 95% confidence interval for seven number of samples as individual inputs in terms of PESQ and MOS analysis results, incorporates maximum sample values of subjective or objective quality assessment parameters results which were taken as sample inputs for the calculation of population mean. The observation ensures the consistency of proposed algorithm for different values of PESQ and MOS for different wave files. it is also concluded that the samples which are beyond the scope of range of 95% confidence intervals are the samples having the value greater than the highest value of range of 95%confidence interval, which assure the quite good quality of output decoded speech from the observation of PESQ and MOS ratings (Table 5).