1 Introduction

Most of the traditional telephone networks allow only a narrowband (NB) signal which is band-limited to 300–3400 Hz. Usually, human speech contains frequencies far beyond the NB frequency range. Thus, the transmission of human speech through the networks leads to the muffled sound and poor-quality telephony speech. Wideband (WB) speech transmission in the range of 50–7000 Hz would be desirable for better speech quality. To allow WB speech services, the essential changes required within the network infrastructure are quite expensive and time-taking (Jax & Vary, 2006). This is happening to be a major hurdle for the transmission of high-quality speech in telephone networks. Therefore, it is very important to enable WB speech transmission using speech bandwidth extension (SBE) techniques to enhance the quality of speech (Jax, 2002).

Artificial bandwidth extension (ABE) is one among various methods of SBE which improve the quality and intelligibility of telephony speech (Jax, 2002). In this approach, the out-of-band information i.e. the frequencies below 300 Hz and above 3400 Hz are estimated from the NB signal. Excitation signal and spectral envelop (SPENV) are estimated by most ABE techniques which are used to regenerate the out-of-band signal. Different approaches for extension of excitation signal are presented in (Jax, 2002; Prasad & Kishore Kumar, 2016). Different techniques for estimating WB SPENV are presented in Prasad & Kishore Kumar (2016), Zhen-Hua et al. (2018), Bong-Ki et al. (2018), Abel and Fingscheidt (2017), Yingwue et al. (2016). In Yuanjie Dong et al. (2020), a time–frequency network with channel attention and non-local modules is used for SBE. Latent representation learning for ABE using a conditional variational auto-encoder is presented to enhance the speech quality in Pramod Bachhav et al. (2019). The time-domain multi-scale fusion neural network approach for improving the performance of SBE is presented in Xiang Hao et al. (2020). SBE using a conditional generative adversarial network with discriminative training is introduced in Jonas Sautter et al. (2019) to reconstruct the better-quality speech signal. The audio signal of better quality is regenerated in Mathieu and Felix (2020) using audio bandwidth extension aided by dilated convolutional neural network approach. In Kyoungjin and Joon et al. (2020), a deep neural network ensemble approach for reducing artificial noise in SBE is introduced. A waveform-based method for SBE that uses a deep three-way split summation FFTNet architecture is proposed in Berthy Feng et al. (2019). In Johannes and Tim (2019), a time-domain ABE technique towards low-frequency band by a sinusoidal synthesis of missing harmonics is presented to enhance the quality of the reconstructed WB signal. A WaveNet-based model conditioned on a log-mel spectrogram representation of NB speech to reconstruct the better quality speech signal is proposed in Archit Gupta et al. (2019). However, traditional ABE methods are suffering from reconstructing WB speech with high quality under all conditions (Jax & Vary, 2002).

Compared to ABE techniques, a WB speech with high quality is reconstructed when the out-of-band information is transmitted by hiding it in the NB signal using data hiding methods (Jax & Vary, 2006). Several techniques for SBE using data hiding are proposed in the state-of-the-art literature. An SBE technique is proposed in Jax and Vary (2006) to embed the encoded SPENV parameters of the lost speech frequency components within the NB signal. A better-quality WB signal is reconstructed at the receiver end using the embedded information. A much better-quality WB signal over (Chen & Leung, 2005) has been reconstructed in Chen & Leung (2007), where the spectral envelope parameters (SPEVPS) are efficiently encoded using phonetic classification. The pitch-scaled frequencies of the out-of-band signal are embedded into the unused frequencies of traditional telephony speech to enhance the quality of reconstructed WB speech in Geiser and Vary (2013). The WB signal of better quality is regenerated in Bhatt and Kosta (2015), Geiser and Vary (2007), Kosta (2016) using joint source coding and data hiding technique. High-quality WB signal is reconstructed in Prasad and Kishore Kumar (2017), Sunil Kumar and Kishore Kumar (2019) using various frequency-domain data hiding techniques. The enhancement in the quality of reconstructed WB speech is achieved by restoring the hidden audible components of the out-of-band signal (Chen & Leung, 2007). The SPEVPS of an out-of-band signal is embedded into the NB signal bitstream to improve the quality of reconstructed WB speech in Chen et al. (2013). The WB signal of better quality is regenerated in Sagi and Malah (2007) using the quantization-based watermarking technique.

SBE techniques with data hiding are expected to deliver high-quality composite narrowband (CNB) alongside reconstructed wideband (RWB) signals. Also, these techniques must be able to handle issues pertaining to quantization and channel noises. Nevertheless, most of the traditional techniques fail to provide high-quality CNB and RWB signals (Bhatt & Kosta, 2015; Chen & Leung, 2007; Chen & Leung, 2007; Chen et al., 2013; Geiser & Vary, 2007, 2013; Jax & Vary, 2006; Kosta, 2016; Prasad & Kishore Kumar, 2017; Sagi & Malah, 2007; Sunil Kumar & Kishore Kumar, 2019). Also, they are less robust to channel and quantization noises. Thus, developing a novel SBE technique using data hiding is essential to improve the quality of CNB and RWB signals and more robust to channel and quantization noises.

An audio steganography technique is presented in Kanhe and Aghila (2016), used discrete Cosine transform (DCT)-based data hiding technique to insert the secret message signal in DCT coefficients of a host speech signal without degrading the perceptual quality of the host signal. It was shown that this approach is producing a stego signal that is indistinguishable from the host signal while being able to reliably recover the secret message signal at the receiver end without any degradation in quality.

A novel SBE algorithm using the discrete Cosine transform-based data hiding technique (Kanhe & Aghila, 2016) is proposed to embed the parameters of the lost speech frequency components within the DCT coefficients of the NB signal. These hidden parameters are retrieved at the receiver side to produce a better-quality WB signal by combining the missing speech signal that was transmitted through the DCT coefficients and the NB signal. The proposed scheme uses the real missing speech information instead of its estimation which makes the reconstruction of the WB speech more accurate compared to the conventional ABE methods. Furthermore, the proposed method is compatible with conventional NB terminal equipment, e.g., a plain ordinary telephone set. In other words, conventional NB receivers can still access the NB speech properly without additional hardware, while a customized receiver can extract the embedded information and provide WB signal with much better quality.

The telephone network channel introduces channel and quantization noises. Techniques proposed in Chen and Leung (2007), Geiser et al. (2005), Jax and Vary (2006) for SBE are considered only the quantization noise ignoring the channel noise. The quantization noise and channel noise effects are considered in this paper. The spread spectrum technique (Hasan et al. 1998) is used in this work for retrieving the embedded information as it is claimed to be more robust against quantization and channel noises. In particular, each parameter to be inserted is spread by multiplying with a particular spreading sequence. The embedded information is then formed by adding the spread signals. Due to orthogonality among spreading sequences, the embedded information is retrieved reliably by using a correlator.

To minimize the interference caused by the other embedded components, spreading sequences with low cross-correlations are preferred. Hadamard codes have an optimal cross-correlation performance, i.e., orthogonal to each other, whereas the m-sequences, Gold-codes, and Kasami-codes are with varying cross-correlation properties (Dianan & Jabbari, 1998; Goldsmith, 2006). Because the Hadamard codes are well recognized by its optimal cross-correlation performance, it is employed in this work for minimizing the interference caused by the other embedded components.

The paper is ordered as follows. In Sect. 2, the DCTBDH method for SBE is introduced. The subjective and objective analyses are discussed in Sect. 3. Finally, in Sect. 4, conclusions are summarized.

2 DCT-based data hiding technique for SBE

2.1 Transmitter

The proposed transmitter is shown in Fig. 1. A speech signal designated as WB \({S}_{wb}(n)\) is sampled at a frequency of 16 kHz. This signal is further fragmented to form a low band signal using a low pass filter (LPF) and a high band signal using a high pass filter (HPF) respectively. The LPF extracts speech signal information that is present between 0 and 4 kHz and is designated as low band signal while HPF extracts speech information that is present between 4 and 8 kHz designated as high band signal. The LPF output is decimated by a factor of two to produce an NB signal \({S}_{nb}\left(n\right).\) The high band signal is decimated to produce an upper band (UB) signal \({S}_{eb}\left(n\right)\). Therefore, 8 kHz is the sampling frequency of \({S}_{nb}\left(n\right)\) and \({S}_{eb}\left(n\right).\)

Fig. 1
figure 1

Proposed transmitter

To imperceptibly embed \({S}_{eb}\left(n\right)\) in to \({S}_{nb}\left(n\right),\) the number of parameters that represents \({S}_{eb}(n)\) is minimized. Here, the linear predictive analysis (Hanzo et al., 2001) is employed to accomplish this target. Linear predictive analysis is based on the source-filter model of speech production. The linear predictive coefficients (LPC) are the reciprocal of the autoregressive filter coefficients. The LPC which correspond to the spectral envelope of \({S}_{eb}\left(n\right)\) are denoted as\({b}_{i}\left(i=1,\dots ,10\right)\), where \(i\) is the order of filter, which are found by using the Levinson-Durbin algorithm. The small variation in LPC results in substantial distortions when reconstructing \({S}_{eb}\left(n\right) ;\) hence LPC are modified into line spectral frequencies (LSF) (Hanzo 2001). Also, the gain of\({s}_{eb}(n)\), denoted with\({G}_{r}\), has to be embedded since synthesized UB speech has to be scaled to an appropriate energy to evade over-estimation (Nilsson & Kleijn, 2001). Thus, the representation vector which represents \({S}_{eb}(n)\) is formed by combining LSF and gain, i.e.,\(R=\left[{LSF}_{1},{LSF}_{2,\dots \dots ..,}{LSF}_{10},{G}_{r}\right]\).

The excitation parameters of \({S}_{eb}\left(n\right)\) are not embedded to reduce the number of parameters to be hidden. This is because, the ear is not very sensitive to distortions of the excitation signal at above NB (Jax & Vary, 2003). Thus, estimating the excitation of \({S}_{eb}\left(n\right)\) at the receiver from \({S}_{nb}(n)\) is well-suited for the reconstruction performance.

All the parameters of \(R\) is denoted with \({D}_{i}\). One among all the parameters of \(R\) is then denoted with \({D}_{i0}\). Every parameter of \(R\) to be embedded is spread by multiplying it with a particular spreading sequence, i.e.,\({D}_{i}\cdot{p}^{\to i},1\le i\le Q\). The hidden data is then produced by adding all of these spreading vectors and is given by

$$V\left(g\right)=\sum_{i=1}^{Q}{D}_{i}{p}^{i}\left(g\right)$$
(1)

where \({g}^{th}\) element of \({p}^{\to i}\) represented by \({p}^{i}\left(g\right).\) DCT is then applied on NB signal \({S}_{nb}(n)\) and can be expressed as

$${S}_{nb}\left(k\right)=w\left(k\right)\sum_{n=0}^{N-1}{S}_{nb}\left(n\right)\mathrm{cos}\frac{\left(2n+1\right)k\pi }{2N}, k=0 to N-1$$
(2)

where

$$w\left(k\right)=\sqrt{\frac{1}{N}}\; if\; k=0,w\left(k\right)=\sqrt{\frac{2}{N}} \;otherwise$$

The last 16 coefficients of the DCT coefficients are replaced by \(V\left(g\right)\) resulting in a CNB signal spectrum (Kanhe & Aghila, 2016). To transform back the CNB signal spectrum to time-domain representation, inverse discrete cosine transform (IDCT) is applied on the CNB signal spectrum and can be expressed as

$${S}_{nb}\left(n\right)=\sum_{k=0}^{N-1}w\left(k\right){S}_{nb}\left(k\right)\mathrm{cos}\frac{\left(2n+1\right)k\pi }{2N}, n=0 \quad to \quad N-1$$
(3)

Thus, a CNB signal \({S}_{nb}^{1}(n)\) is produced so that it can be communicated to the receiver on a telephone network channel.

A synchronization sequence like 111….0.11 is added after every frame of the CNB signal to achieve frame synchronization (ETSI ES 201 2000) between the transmitter and receiver. The arrival of a new frame of the CNB signal at the receiver is indicated by the reception of a synchronization sequence.

2.2 Receiver

The proposed receiver is shown in Fig. 2. The CNB signal received through a telephone network channel is noisy. Assume that the received signal is represented by Ŝ\(_{nb}^{1}(n)\) i.e., Ŝ\(_{nb}^{1}\left(n\right)= {S}_{nb}^{1}\left(n\right)+e\). Where \(e\) represents the combination of channel and quantization noises. The conventional phone terminal treats Ŝ\(_{nb}^{1}(n)\) as an ordinary signal. The NB signal quality is not noticeably degraded since there is a very small perceived difference between \({S}_{nb}^{1}(n)\) and Ŝ\(_{nb}^{1}(n)\). Retrieval of the embedded data requires applying DCT on the CNB signal to obtain the DCT coefficients.

Fig. 2
figure 2

Proposed receiver

The spread parameters are then obtained from the last 16 DCT coefficients (Kanhe & Aghila, 2016) and a correlator is used to de-spread these parameters. Assuming a particular Ḓ\(_{i}\) is denoted as Ḓ\(_{io}\) to be retrieved, the correlation can be expressed as

(4)

where Ṿ\(\left(g\right)\) represents noisy \(V\left(g\right)\) and is given by

(5)

Equation (5) is substituted into Eq. (4), so that we have

$$=\frac{1}{Q}\sum_{g=1}^{Q}{p}^{io}(g)\left(\sum_{i=1}^{Q}{\hat{D} }_{i}{p}^{i}\left(g\right) +\overline{e} \left(g\right)\right)$$
$$=\frac{1}{Q}\sum_{g=1}^{Q}{p}^{io}\left(g\right)\times \left({\hat{D} }_{io}{p}^{io}\left(g\right)+\sum_{i\ne io}{\hat{D} }_{i}{p}^{i}\left(g\right)+\overline{e} \left(g\right)\right)$$
$$={\hat{D} }_{io}+\frac{1}{Q}\sum_{g=1}^{Q}\sum_{i\ne io}{\hat{D} }_{i}{p}^{i}\left(g\right){p}^{io}\left(g\right) +\frac{1}{Q}\sum_{g=1}^{Q}{p}^{io}\left(g\right)\overline{e} \left(g\right)$$
(6)

The PN sequences are orthogonal. i.e.

$$\sum_{g=1}^{Q}{p}^{i}\left(g\right){p}^{io}\left(g\right)=0$$

where i ≠ io. Therefore

$$\sum_{g=1}^{Q}\sum_{i\ne io}{\hat{D} }_{io}{p}^{i}\left(g\right){p}^{io}=\sum_{i\ne io}{\hat{D} }_{io}\sum_{g=1}^{Q}{p}^{i}\left(g\right){p}^{io}\left(g\right) =0$$
(7)

Also, since there was no correlation between \({p}^{io}\left(g\right)\) and \(\overline{e} \left(g\right)\) i.e.

$$\frac{1}{Q}\sum_{g=1}^{Q}{p}^{io}\left(g\right)\mathrm{\overline{e} }\left(g\right)=0$$
(8)

when \(Q\to \infty\). Equations (7) and (8) are substituted into Eq. (6), thus we have

(9)

This reveals that the parameters which represent Ŝ\(_{eb}\left(n\right)\) can be effectively recovered from using the SS technique and then the LPC are obtained from LSF. Meanwhile, NB residual signal is obtained by inverse filtering Ŝ\(_{nb}^{1}(n)\) using LPC of Ŝ\(_{nb}^{1}\left(n\right)\) and then obtain the UB excitation signal by extending the NB residual signal. The UB signal Ŝ\(_{eb}(n)\) that was embedded is synthesized by exciting the synthesis filter described by the recovered LPC by a UB excitation signal. The received CNB and reconstructed UB signals are sampled at an 8 kHz sampling rate. These signals are then interpolated by a factor of two. \({S}_{eb}^{1}(n)\), represents interpolated Ŝ\(_{eb}(n)\) signal. The interpolated CNB (\({S}_{nb}^{11}(n)\)) and UB \({S}_{eb}^{1}(n)\) signals are added up for reproducing a WB signal (\({S}_{wb}^{1}(n)\)) of good quality.

3 Experimental results

The speech utterances used for the performance evaluations of traditional and proposed SBE techniques were obtained from the TIMIT database (Garofalo et al., 2013). The evaluations were done by taking thirty different speech utterances which were spoken by thirty female and male speakers. The performance assessment of the methods was done by considering the subjective as well as objective measures. Each speech signal was split to form frames of 20 ms long and between frames an overlap of 10 ms was maintained. Each frame was processed individually. Existing SBE algorithms like data hiding (Chen & Leung, 2005), phonetic classification (Chen & Leung, 2007), audio watermark (Chen et al., 2013), steganographic WB telephony (Bhatt & Kosta, 2015), magnitude spectrum data hiding (Prasad & Kishore Kumar, 2017), and transform domain data hiding (Sunil Kumar & Kishore Kumar, 2019) were compared with the proposed method. Additive white Gaussian noise (AWGN) and µ-law channel models were used for analysis.

3.1 Subjective listening test results

The perceptual transparency was assessed with the mean opinion score (MOS) test (Chen & Leung, 2005, 2007). The subjective comparison between WB, CNB, NB, and RWB signals was also employed (Prasad & Kishore Kumar, 2017). Each person is made to hear the speech utterances through headphones in a silent chamber. An evaluation was done using a predefined scale by examining participant’s views on speech sounds. Thirty persons have participated in the tests.

3.1.1 Perceptual transparency

The CNB and NB signals have to be similar sounds. The perceptual transparency was assessed with the MOS test. While comparing CNB and NB signals the listener comes out with a decision in terms of MOS as given in Table 1. The average MOS values of traditional (Bhatt & Kosta, 2015; Chen & Leung, 2005, 2007; Chen et al., 2013; Prasad & Kishore Kumar, 2017; Sunil Kumar & Kishore Kumar, 2019) and the proposed techniques are given in Table 2. The proposed technique gave a MOS value of 3.99 which indicates that the proposed technique has excellent perceptual transparency over the traditional techniques. The proposed technique gave a MOS value of 3.99 which was almost near the standard MOS value of 4 which indicates that CNB and NB signals were more or less identical.

Table 1 MOS
Table 2 Result of MOS

3.1.2 Subjective comparisons between WB, NB, CNB, and RWB speech samples

A listening test was done for comparing performances between the proposed and conventional methods (Bhatt & Kosta, 2015; Chen & Leung, 2005, 2007; Chen et al., 2013; Prasad & Kishore Kumar, 2017; Sunil Kumar & Kishore Kumar, 2019). Here, WB signal, NB signal, CNB signal, and RWB signal were labelled I, II, III, and IV respectively. Participants are asked to do a pairwise comparison between the samples to tell whether the first sample was superior to, inferior than, or equal to the second. The responses after comparing I, II, and III with the other signals respectively are tabulated in Table 3, 4, 5. The number of participants with a specific preference is indicated by Arabic numerals in the table. It is observed that WB signal is superior to NB and CNB signals of traditional and the proposed methods from Table 3. Also, we observe that RWB signal quality is far superior using the proposed method over traditional methods from Table 3. Thus, the speech quality was enhanced by the proposed technique. Compared to traditional methods, it is observed that the RWB signal of the proposed method is superior to that of NB signal, as may be seen from Table 4. Also, a clear perceptual transparency improvement of the proposed method over the conventional methods was observed from Table 4 which shows that the quality of CNB signal is almost identical to that of NB signal. Obviously, the data embedding performed in the proposed method has very little impact on perception. Compared to conventional methods, it is observed that RWB speech of proposed technique is better than CNB speech from Table 5. Thus, the proposed method is demonstrated to produce a much better-quality speech signal than the conventional methods.

Table 3 Subjective comparison test results between I, II, III, and IV
Table 4 Subjective comparison test results between II, III, and IV
Table 5 Subjective comparison test results between III and IV

3.2 Objective quality evaluations

The database which was used in subjective listening tests was also used in evaluating objective measures. The perceptual transparency was assessed with the narrowband-perceptual evaluation of speech quality (NB-PESQ) measure (ITU-T, 2001). RWB speech quality was evaluated with the Log Spectral Distortion (LSD) (Chen & Leung, 2005, 2007) and wideband-perceptual evaluation of speech quality (WB-PESQ) measures (ITU-T, 2005). The robustness of hidden data against quantization and channel noises was evaluated with the help of a mean square error (MSE) measure (Chen & Leung, 2007).

3.2.1 RWB speech quality

The quality of RWB speech is evaluated using LSD measure and is calculated using the formula

$$LSD =\frac{1}{2\pi }{\int }_{-\pi }^{\pi }{\left({20\mathit{log}}_{10}\frac{{\mathsf{g}}_{p}}{{a}_{s}\left({e}^{jw}\right)}-20{\mathit{log}}_{10}\frac{{\hat{g} }_{p}}{\left|{\hat{a} }_{s}\left({e}^{jw}\right)\right|}\right)}^{2}dw$$
(10)

where \({\mathsf{g}}_{p}\) is the gain of UB signal,\(\frac{1}{{a}_{s}\left({e}^{jw}\right)}\) is the SPENV of UB signal, \({\hat{g} }_{p}\) is the gain of the reconstructed UB signal and \(\frac{1}{{\hat{a} }_{s}\left({e}^{jw}\right)}\) is the SPENV of the reconstructed UB signal respectively. An RWB signal with the least value of LSD is said to be of good quality. The resultant LSD for conventional (Bhatt & Kosta, 2015; Chen & Leung, 2005, 2007; Chen et al., 2013; Prasad & Kishore Kumar, 2017; Sunil Kumar & Kishore Kumar, 2019) and proposed techniques with a µ-law channel model are presented in Table 6 and it was very evident that the RWB signal quality of the proposed technique was far superior to the signal quality generated using conventional techniques. In addition, the proposed technique offers an LSD of 2.2248 indicating that RWB speech of the proposed technique and original WB speech qualities are almost equal. Good RWB signal performance of the proposed technique which was already found in the subjective tests is now supported by these LSD values also. The proposed technique offers an LSD of 2.35 with the AWGN channel model.

Table 6 LSD test results

3.2.2 Perceptual transparency

The evaluation of perceptual transparency is done by providing NB and CNB signals as inputs and comparing them to rate speech quality. The NB-PESQ value will range between 0.5 and 4.5, where the higher the value, the more superior the quality. The average NB-PESQ values of conventional (Bhatt & Kosta, 2015; Chen & Leung, 2005, 2007; Chen et al., 2013; Prasad & Kishore Kumar, 2017; Sunil Kumar & Kishore Kumar, 2019) and proposed methods are tabulated in Table 7. The proposed technique gave NB-PESQ value of 4.47 which indicates that the proposed technique has excellent perceptual transparency over traditional techniques, which was already confirmed by subjective listening tests.

Table 7 NB-PESQ test results

In Fig. 3, the upper plot a depicts the spectrogram of NB speech \({\mathrm{y}}_{\mathrm{nb}}\left(\mathrm{n}\right)\), whereas the lower plot b depicts the spectrogram of the CNB signal \({\mathrm{y}}_{\mathrm{nb}}^{1}\left(\mathrm{n}\right)\). It is clear from the figures that \({\mathrm{y}}_{\mathrm{nb}}\left(\mathrm{n}\right)\) and \({\mathrm{y}}_{\mathrm{nb}}^{1}\left(\mathrm{n}\right)\) are almost indistinguishable.

Fig. 3
figure 3

Spectrograms from top to bottom: a Composite NB speech, b NB speech

3.2.3 Robustness of embedded information

AWGN with SNR ranges between 15 and 35 dB (Keiser and Strange 2017) is added to the CNB signal. The evaluation of the robustness of the proposed technique is done by utilizing MSE and is calculated using the formula.

$$MSE=\frac{1}{N}\sum_{n=0}^{N-1}{\left({S}_{wb}^{1}\left(n\right)-{S}_{wb}(n)\right)}^{2}$$
(11)

where the RWB signal is represented by \({S}_{wb}^{1}\left(n\right)\) and the original WB signal is represented by \({S}_{wb}(n)\). The spreading sequence length is 16. An RWB signal with a small value of MSE is said to be of good quality. The proposed technique gave MSE values, as a function of SNR ranges between 15 and 35 dB, which are below 7.7083 × 10–4 indicating that the RWB signal quality obtained by the proposed technique is excellent. The proposed technique gave an MSE value after adding quantization noise (μ-law) to \({\mathrm{s}}_{\mathrm{nb}}^{1}\left(\mathrm{n}\right)\) is 5.78 × 10–4 which indicates RWB signal quality that was obtained by the proposed technique is excellent.

3.2.4 WB speech quality

The evaluation of the quality of RWB speech is done by giving WB and RWB signals as inputs and comparing them in order to rate speech quality. The average WB-PESQ values of the conventional (Bhatt & Kosta, 2015; Chen & Leung, 2005, 2007; Chen et al., 2013; Prasad & Kishore Kumar, 2017; Sunil Kumar & Kishore Kumar, 2019) and proposed methods are shown in Table 8. A WB-PESQ value of 4.45 confirms that the RWB signal quality that was obtained by the proposed technique is excellent compared to traditional techniques, which was already confirmed by subjective listening tests on a set of participants. Thus, the speech quality was improved by using the proposed technique.

Table 8 WB-PESQ test results

4 Conclusions

In this paper, SBE utilizing the DCTBDH technique for extending the bandwidth of the existing NB telephone networks has been proposed. The spread spectral envelope parameters of the UB signal are embedded within the DCT coefficients of the NB signal at the transmitter. The embedded information is extracted at the receiver end to reconstruct the WB signal of good quality.

The spread spectrum technique is employed to increase the robustness of the embedded UB signal to quantization and channel noises by spreading the spectral envelope parameters by multiplying them with spreading sequences and then adding them up together to provide the embedded information. The embedded information can be reliably recovered by using a correlator. The robustness of the proposed method to quantization and channel noises is confirmed by MSE test. The MOS and NB-PESQ test values obtained for the proposed method indicate that the method embeds the UB information more transparently compared to the conventional methods. The RWB signal quality was enhanced by the proposed technique over conventional techniques and it was evident through subjective listening, LSD and WB-PESQ tests. The proposed method is demonstrated to produce a much better-quality speech signal than the conventional techniques. Hence it is suitable for extending the bandwidth of the existing telephone networks without making changes to the telephone networks.