1 Introduction

The public switching telephone network uses speech codecs for low latency communication. It is desirable to utilize wide bandwidth (WB) speech codecs, such as AMR-WB [17], which has been standardized to improve quality significantly. However, the old analog telephone system supports only the narrow bandwidth (NB) speech codecs, such as G.711 [14] and G.729 [18]. NB speech signals are limited in the bandwidth of 0.3–3.4 kHz, resulting in a perceived reduction in quality compared to WB speech signals with the bandwidth of 0.3–7 kHz [40]. The old analog telephone system renovation also takes much effort for both the sender and receiver sides [22]. To enhance the quality of the NB speech signal, speech enhancement approaches have received much attention.

A speech bandwidth extension (BWE) method is a speech enhancement approach that reconstructs the missing upper bandwidth (UB) spectrum of 3.4–7 kHz using a source-filter model of speech production [21]. The source-filter model represents a speech signal by a convolution of an excitation signal and a spectral envelope [24]. A UB excitation signal is generated from the existing NB excitation signal by frequency shift [25] or noise modulation [35]. A UB spectral envelope is estimated using codebooks [38] based on statistical models, such as Gaussian mixture models [34] and neural networks [1]. However, BWE methods have the performance limitation to reconstruct the missing UB spectrum. Jax et al. showed that the maximum achievable performance of the UB spectral envelope estimation depends on mutual information (MI) between NB and UB spectra and that it is relatively low to reconstruct the missing UB spectrum [20]. This is because an NB spectrum has a one-to-many relationship with UB spectra [2]. Nilsson et al. also demonstrated that MI of the consonants, especially fricatives, is lower than that of vowels [26].

A solution to resolve the performance limitation is to transmit side information about the missing UB spectrum. However, transmitting both side information and an NB speech signal may cause high latency communication due to the increased amount of information. Researchers have devised BWE methods that transmit side information using speech steganography without increasing the amount of information [4,5,6,7, 12, 28,29,30, 37, 39]. Speech steganography embeds side information into a hidden channel of the NB speech signal to generate a composite narrow bandwidth (CNB) speech signal. When the receiver side supports speech steganography, the missing UB spectrum can be reconstructed using side information extracted from the CNB speech. Otherwise, the received CNB speech signal is used directly as an NB speech signal. Therefore, BWE methods using speech steganography need to minimize the quality decline of the CNB speech signals due to embedding side information.

A BWE method using bitstream data hiding embeds side information into a bitstream of the encoded NB speech signal [7]. BWE methods using a joint coding technique also incorporate embedding side information into encoding NB speech signals [28, 39]. Although the quality of the CNB speech signal is equivalent to that of the NB speech signal, these methods treat with only a specific NB speech codec. To correspond to various NB speech codecs, BWE methods using signal domain speech steganography have been devised, which embeds side information into an NB speech signal before encoding [4,5,6, 12, 29, 30, 37]. As side information, various feature vectors have been utilized to reconstruct the missing UB spectrum.

Prasad et al. adopted code excited linear prediction (CELP) parameters as feature vectors [30]. Since CELP parameters generally have covered NB speech signals, this method should set specified CELP parameters for UB speech signals. Methods using the part of the UB power spectra have also been devised [6, 12]. The most straightforward approach is to utilize a relative gain between NB and UB excitation signals and line spectral frequencies (LSF), representing a UB spectral envelope [4, 5, 29, 37]. A relative gain is required to avoid an overestimation of the power of the UB speech signal [3, 27]. A sender side converts feature vectors into a binary signal using a codebook. The binary signal is then embedded into an NB speech signal using signal domain speech steganography based on some transform domains.

Chen et al. embed a binary signal into an NB speech signal using dither quantization in the time domain [4, 5]. Although dither quantization requires less processing time, a bit error for the binary signal occurs due to artifacts such as speech codecs and noise through the telephone system. Sagi et al. embedded a binary signal using the scalar Costa scheme in the discrete Hartley transform (DHT) domain [37]. While it is robust against artifacts, the capacity of binary signals to be embedded depends on the power of the NB speech signal.

Prasad et al. proposed a BWE method using transform-domain data hiding (TDDH) based on the discrete Fourier transform (DFT) domain [29]. TDDH is robust signal domain speech steganography that converts a binary signal into a hidden vector and embeds it into a magnitude spectrum in the high-frequency bandwidth of 3.4-4 kHz, not depending on the power of the NB speech signal. Since human hearing is little sensitive to the distortion of the magnitude spectrum in the high-frequency bandwidth due to TDDH [31, 32], the quality of the CNB speech signal is almost equivalent to that of the NB speech signal. However, the hidden vector may contain negative values despite that the magnitude spectrum accepts only nonnegative values. An offset is thus required to be set to embed a hidden vector into a magnitude spectrum, which degrades the quality of the CNB speech signal. Besides, a UB speech signal is generated with non-overlapping, which results in the discontinuity between frames. Also, it is not easy to reproduce the slight UB sound pressure change by calculating a relative gain in each frame.

In this paper, we propose a BWE method using TDDH based on the DHT domain. The proposed method has two advantages. First, we avoid setting the offset by embedding a hidden vector into an amplitude spectrum of the NB speech signal in the DHT domain where negative values can be accepted. Furthermore, the conventional method [29] embeds a common hidden vector into the bandwidths of 3.4–4 kHz and 4–4.6 kHz in the DFT domain with symmetry at a Nyquist frequency, while the proposed method embeds different hidden vectors because the DHT domain is asymmetric, which improves the robustness against artifacts by embedding the hidden vector into the wider bandwidth. Second, the proposed method generates a UB speech signal with overlapping to avoid the discontinuity between frames. Here, relative gains are calculated in sub-frames to reproduce the slight UB sound pressure change over time.

This paper is organized as follows: In Sect. 2, we present a method of generating a CNB speech signal. Section 3 describes a BWE method using side information extracted from the CNB speech signal. We analyze the performance of TDDH based on the DHT domain in Sect. 4. Subjective listening tests and objective measures for the proposed method are discussed in Sect. 5. Finally, conclusions are given in Sect. 6.

Fig. 1
figure 1

Block diagram of the CNB speech signal generation

2 Composite Narrow Bandwidth Speech Signal Generation

Figure 1 shows a block diagram of the CNB speech signal generation. First, an input speech signal is separated into NB and UB speech signals using a band-pass filter with cutoff frequencies of 0.3 kHz and 3.4 kHz and a high-pass filter with a cutoff frequency of 3.4 kHz. To reduce a redundancy, the UB speech signal is frequency-shifted and down-sampled. Feature vectors of LSF and relative gains are extracted from the frequency-shifted and down-sampled UB speech signal and then quantized into a binary signal using two codebooks. The binary signal is converted into a hidden vector using the spread spectrum scheme with a pseudo-noise (PN) sequence to enhance the robustness against artifacts. Finally, the hidden vector is embedded into an amplitude spectrum in the high-frequency bandwidth of 3.4–4.6 kHz using TDDH based on the DHT domain, and a CNB speech signal is generated by inverse DHT (iDHT).

Fig. 2
figure 2

Sub-frame definition for the relative gain calculation

The proposed method calculates LSF of the UB speech signal with non-overlapping. First, autoregressive (AR) coefficients in l-th frame \(a_{l}(j)\) \((j=1, \ldots ,J)\) are given by solving the following equation using the Levinson–Durbin algorithm:

$$\begin{aligned} \sum _{j=1}^{J} a_{l}(j) r_l(|q-j|) = r_l(q), \ q=1,\ldots ,J, \end{aligned}$$
(1)

where \(r_l(q)\) and J denote a modified autocorrelation coefficient and the order of the AR coefficients, respectively. The AR coefficients are then converted into LSF \(F_{l}(j)\) to suppress the quantization error [13].

Next, the proposed method calculates a relative gain between NB and UB excitation signals in each sub-frame. Figure 2 shows the sub-frame definition for the relative gain calculation, where we consider that a UB speech signal is generated with 75% overlapping at the receiver side in this paper. Let \(x_{l, m}(n)\) (\(n=0, \ldots , N-1\)) denote a speech signal at m-th sub-frame, where N denotes the number of frame samples. An excitation signal \(u_{l, m}(n)\) is defined as:

$$\begin{aligned} u_{l,m}(n)= & {} x_{l,m}(n) - \sum _{j=1}^{J} {a}_l(j) x_{l,m}(|n-j|). \end{aligned}$$
(2)

Let \(u_{l, m}^{\mathrm{NB}}(n)\) and \(u_{l, m}^{\mathrm{UB}}(n)\) denote NB and UB excitation signals, respectively. The proposed method calculates a relative gain \(G_{l,m}\), following as

$$\begin{aligned} G_{l,m}= & {} 20 \log _{10} \left\{ \sum _{n=0}^{N-1} \left( u_{l, m}^{\mathrm{UB}} (n) \right) ^2 \right\} - 20 \log _{10} \left\{ \sum _{n=0}^{N-1} \left( u_{l, m}^{\mathrm{NB}} (n) \right) ^2 \right\} . \end{aligned}$$
(3)

Finally, we obtain feature vectors for LSF \(\mathbf{C}^\mathrm{F}_{l}=[F_{l}(1), \ldots , F_{l}(J)]^{\mathrm{T}}\) and relative gains \(\mathbf{C}^{\mathrm{G}}_{l} = [G_{l, 1}, \ G_{l, 2}, \ G_{l, 3}, \ G_{l, 4}]^{\mathrm{T}}\), where \([\cdot ]^{\mathrm{T}}\) denotes a transpose operation.

When feature vectors are grouped and converted into a binary signal using a codebook, the length of the binary signal needs to be increased to suppress the quantization error [23]. Nevertheless, as the length of the binary signal increases, the quality decline of the CNB speech signal due to TDDH becomes more serious [32]. Hence, the proposed method converts each feature vector into binary signals separately using some codebooks, as well as G.729 [18]. Let \(2^{N^{\mathrm{F}}}\) and \(2^{N^\mathrm{G}}\) \((N^{\mathrm{F}}, \ N^{\mathrm{G}} > 0)\) denote the size of the codebook for LSF and relative gains, respectively. The feature vectors are quantized with \(N^{\mathrm{F}}\) and \(N^{\mathrm{G}}\) binary digits, respectively. In this paper, we obtain codebooks by the Linde–Buzo–Gray training algorithm [23]. The proposed method generates a binary signal \( b_{l}(i) (\in \{1, \ -1\}, i=0, 1, \ldots , N^{\mathrm{S}}-1)\) by combining these binary signals, where \(N^{\mathrm{S}}\) denotes the total bit length such as \(N^{\mathrm{S}} = N^\mathrm{F} + N^{\mathrm{G}}\). In this case, a synchronization sequence such as \(111\ldots 11\) has been prepared to accomplish the frame synchronization between the sender and receiver sides [10].

To enhance the robustness against artifacts, the proposed method converts the binary signal to a hidden vector using the spread spectrum scheme [8]. Let P denote the length of the bandwidth where the hidden vector is embedded into an amplitude spectrum. With a PN sequence \( Q(p, \ i) (\in \{1, \ -1\}, p=0, 1, \ldots , P-1)\), we obtain a hidden vector

$$\begin{aligned} E_l(p) = \beta \cdot \sum _{i=0}^{N_{S}-1} Q(p, \ i) b_{l}(i), \end{aligned}$$
(4)

where \(\beta (>0)\) denotes the strength of TDDH. The larger \(\beta \), the more robust the hidden vector against artifacts, but the more significantly the quality decline of the CNB speech signal due to TDDH, and vice versa. Also, as shown in Sect. 4, we set \(\beta \) as a positive value to avoid reversing the binary signal extracted from the CNB speech signal. In this paper, we empirically fix at \(\beta =0.01\). Besides, we generate a PN sequence using Hadamard codes [9].

The proposed method embeds the hidden vector into an amplitude spectrum in the high-frequency bandwidth of 3.4–4.6 kHz with non-overlapping. Let \({x}_l^{\mathrm{NB}}(n)\) denote an NB speech signal. We define an amplitude spectrum \(H_l(k) (k=0, 1, \ldots , N-1)\) as

$$\begin{aligned} H_l(k)= & {} \sum ^{N-1}_{n=0} {x}^{\mathrm{NB}}_l(n) \text{ cas } \left( \frac{2 \pi nk}{N}\right) , \end{aligned}$$
(5)

with

$$\begin{aligned} \text{ cas }(t)= & {} \text{ cos }(t) + \text{ sin }(t). \end{aligned}$$
(6)

The proposed method then embeds the hidden vector into the amplitude spectrum, following as

$$\begin{aligned} H'_l(k)= & {} \left\{ \begin{array}{lll} H_l(k), &{} k = 0, \ldots , (N-P)/2-1\\ E_l\bigl (k-(N-P)/2\bigr ), &{} k = (N-P)/2, \ldots , (N+P)/2-1 \\ H_l(k), &{} k = (N+P)/2, \ldots , N-1 \end{array} \right. . \end{aligned}$$
(7)

Finally, we obtain a CNB speech signal \(y_l(n)\), following as

$$\begin{aligned} y_l(n)= & {} \frac{1}{N} \sum ^{N-1}_{k=0} {H'}_{l}(k) \text{ cas } \left( \frac{2 \pi nk}{N}\right) . \end{aligned}$$
(8)
Fig. 3
figure 3

Block diagram of the BWE method using side information extracted from CNB speech signal

3 Speech Bandwidth Extension Using Side Information Extracted from CNB Speech Signal

Figure 3 shows a block diagram of the BWE method using side information extracted from a CNB speech signal. First, a hidden vector is extracted from the amplitude spectrum of the received CNB speech. Here, an NB speech signal is generated by iDHT from the amplitude spectrum, where the hidden vector has been extracted. The hidden vector is converted into a binary signal using a PN sequence. Feature vectors for LSF and relative gains are retrieved from the binary signal using codebooks. A UB excitation signal is generated from the NB excitation signal and the retrieved relative gains, and a UB spectral envelope is obtained from the retrieved LSF. The proposed method calculates the UB speech signal with overlapping to avoid the discontinuity between frames. Finally, the frame-shifted and up-sampled UB speech signal is added to the up-sampled NB speech signal to generate a WB speech signal.

Let \({\hat{H}}_{l}(k)\) be an amplitude spectrum of the received CNB speech. An hidden vector \({\hat{E}}_{l}(p)\) is extracted, following as

$$\begin{aligned} {\hat{E}}_{l}(p)= & {} {\hat{H}}_{l} \bigl ((N-P)/2+p \bigr ). \end{aligned}$$
(9)

The extracted hidden vector is then converted into a binary signal

$$\begin{aligned} {\hat{b}}_{l}(i) = \text{ sgn } \left[ \sum _{p=0}^{P-1} Q(p, \ i) {\hat{E}}_{l}(p) \right] , \end{aligned}$$
(10)

where \(\text{ sgn }[\cdot ]\) denotes a sign function. The retrieved feature vectors for LSF \((\mathbf {C}^{{\hat{F}}}_{l}=[{\hat{F}}_{l}(1), \ldots , {\hat{F}}_{l}(J)]^{\mathrm {T}})\) and relative gains \(\mathbf{C}^{{\hat{G}}}_{l} = [{\hat{G}}_{l, 1}, \ {\hat{G}}_{l, 2}, \ {\hat{G}}_{l, 3}, \ {\hat{G}}_{l, 4}]^{\mathrm{T}}\) are obtained from the binary signal using codebooks. The proposed method also reuses the CNB speech signal as an NB speech signal. Let \(\hat{H'}_l(k)\) denote an amplitude spectrum where the hidden vector has been extracted:

$$\begin{aligned} \hat{H'}_l(k)= & {} \left\{ \begin{array}{lll} {\hat{H}}_l(k), &{}\quad k = 0, \ldots , (N-P)/2-1\\ 0, &{}\quad k = (N-P)/2, \ldots , (N+P)/2-1 \\ {\hat{H}}_l(k), &{}\quad k = (N+P)/2, \ldots , N-1 \end{array} \right. . \end{aligned}$$
(11)

An NB speech signal is calculated from the amplitude spectrum using iDHT, following as

$$\begin{aligned} {\hat{x}}^{\mathrm{NB}}_l(n)= & {} \frac{1}{N} \sum ^{N-1}_{k=0} {\hat{H'}}_{l}(k) \text{ cas } \left( \frac{2 \pi nk}{N}\right) . \end{aligned}$$
(12)

With \(\mathbf{C}^{{{\hat{G}}}}_{l}\), the proposed method generates a UB excitation signal \(\hat{{u}}^{\mathrm{UB}}_{l, m}(n)\), following as

$$\begin{aligned} \hat{{u}}^{\mathrm{UB}}_{l, m}(n)= & {} \sqrt{{10}^{\frac{{\hat{G}}_{l, m}}{20}}} \hat{{u}}^{\mathrm{NB}}_{l,m}(n), \end{aligned}$$
(13)

where \(\hat{{u}}^{\mathrm{NB}}_{l, m}(n)\) denotes an NB excitation signal. Let \({{\hat{a}}}_{l}(j) \) denote AR coefficients converted from \({\hat{F}}_{l}(j)\). We generate a UB speech signal

$$\begin{aligned} {\hat{x}}^{\mathrm{UB}}_{l, m}(n)= & {} {\hat{u}}^{\mathrm{UB}}_{l, m}(n) + \sum _{j=1}^{J} {{\hat{a}}}_{l}(j) {\hat{x}}^{\mathrm{UB}}_{l,m}(|n-j|). \end{aligned}$$
(14)

Finally, we obtain a WB speech signal by adding the up-sampled NB speech signal into the frame-shifted and up-sampled UB speech signal with overlapping.

4 Performance Analysis for Transform-Domain Data Hiding Based on Discrete Hartley Transform Domain

This section discusses the performance analysis of TDDH based on the DHT domain. We assume that a CNB speech signal suffers artifacts based on the additive white Gaussian noise (AWGN). The received CNB speech signal \({\hat{y}}_{l}(n)\) is written as:

$$\begin{aligned} {\hat{y}}_l(n) = y_l(n) + g(n), \end{aligned}$$
(15)

where g(n) denotes white Gaussian noise. On the DHT domain, Eq. (15) is interpreted as:

$$\begin{aligned} {\hat{H}}_l(k) = H'_l(k) + J(k), \end{aligned}$$
(16)

where J(k) denotes an amplitude spectrum of g(n). By substituting Eqs. (7) and (16) into Eq. (9), a relation equation is given as:

$$\begin{aligned} {\hat{E}}_l(p)= & {} E_l(p) + J'(p), \end{aligned}$$
(17)

with

$$\begin{aligned} J'(p) =J \bigl ((N'-P)/2+p \bigr ). \end{aligned}$$
(18)

By substituting Eqs. (4) and (17) into Eq. (10), Eq. (10) is also interpreted as

$$\begin{aligned} {\hat{b}}_{l}(i)= & {} \text{ sgn } \left[ \sum _{p=0}^{P-1} Q(p, \ i) \left( E_{l}(p)+ J'(p) \right) \right] \nonumber \\= & {} \text{ sgn } \left[ \sum _{p=0}^{P-1} \left( \beta Q(p, \ i)Q(p, \ i) b_{l}(i)\right. \right. \nonumber \\&\left. \left. + \sum _{i' \ne i} \beta Q(p, \ i)Q(p, \ i') b_{l}(i') + Q(p, \ i)J'(p) \right) \right] . \end{aligned}$$
(19)

Note that a PN sequence is orthogonal such that \(\sum _{p=0}^{P-1} Q(p, \ i)Q(p, \ i') = P \cdot \delta _{i-i'}\), where \(\delta _{i}\) is the Kronecker delta. Equation (19) is thus rewritten as

$$\begin{aligned} {\hat{b}}_{l}(i)= & {} \text{ sgn } \left[ \beta P b_{l}(i) + \sum _{p=0}^{P-1} Q(p, \ i)J'(p) \right] . \end{aligned}$$
(20)

In a clean environment with \(J'(p) = 0\), we have \({\hat{b}}_{l}(i) = \text{ sgn } [ \beta P b_{l}(i) ]\). Because of \(P>0\), the equation is then rewritten as \({\hat{b}}_{l}(i) = \text{ sgn } [ \beta b_{l}(i) ]\). If \(\beta <0\), we have \({\hat{b}}_{l}(i) \ne b_{l}(i)\). Hence, we set \(\beta >0\). A bit error occurs in the case such that \(\text{ sgn } [ b_{l}(i) ] \cdot \text{ sgn } \Bigl [ \sum _{p=0}^{P-1} Q(p, \ i)J'(p) \Bigr ] = -1\) and \(|\beta P b_{l}(i)| \le |\sum _{p=0}^{P-1} Q(p, \ i)J'(p)|\).

We define a variable \({\hat{d}}_{l}(i)\) from Eq. (20) as

$$\begin{aligned} {\hat{d}}_{l}(i)= & {} \beta P b_{l}(i) + \sum _{p=0}^{P-1} Q(p, \ i)J'(p). \end{aligned}$$
(21)

According to the central limit theorem [11], the conditional probability distribution \(f({\hat{d}}_{l}(i) \ | \ b_l(i)) \) is given as:

$$\begin{aligned} f({\hat{d}}_{l}(i) \ | \ b_l(i)=1)= & {} \frac{1}{\sqrt{2 \pi \sigma _\mathrm{Q}^2}} e^{-\frac{({{\hat{d}}_l(i)-\beta P)}^2}{2 \sigma _\mathrm{Q}^2}}, \end{aligned}$$
(22)
$$\begin{aligned} f({\hat{d}}_{l}(i) \ | \ b_l(i)=-1)= & {} \frac{1}{\sqrt{2 \pi \sigma _\mathrm{Q}^2}} e^{-\frac{({{\hat{d}}_l(i)+\beta P)}^2}{2 \sigma _\mathrm{Q}^2}}, \end{aligned}$$
(23)

where \(\sigma _\mathrm{Q}^2\) denotes the variance of the variable \(\sum _{p=0}^{P-1} Q(p, \ i)J'(p)\). We then transform \(\sigma _\mathrm{Q}^2\) as

$$\begin{aligned} \sigma _\mathrm{Q}^2= & {} {\text{ E } \left[ { \left( \sum _{p=0}^{P-1} Q(p, \ i)J'(p) \right) }^2 \right] }\nonumber \\= & {} \text{ E } \left[ \sum _{p=0}^{P-1}\sum _{p'=0}^{P-1} Q(p, \ i)Q(p', \ i)J'(p)J'(p') \right] \nonumber \\= & {} \text{ E } \left[ \sum _{p=0}^{P-1}\sum _{p'=0}^{P-1} N^{\mathrm{S}} \cdot \delta _{p-p'}J'(p)J'(p') \right] \nonumber \\= & {} N^{\mathrm{S}} \cdot \sum _{p=0}^{P-1} \text{ E } \left[ J'(p)^2 \right] \nonumber \\= & {} N^{\mathrm{S}} P \sigma _\mathrm{J}^2 \end{aligned}$$
(24)

where \(\sigma _\mathrm{J}^2\) denotes the variance of \(J'(p)\). In the case of \({\hat{d}}_l>0\), \({\hat{b}}_{l}(i)=1\) for Eq. (20). The conditional probability for \(p({\hat{b}}_{l}(i)=1 \ | \ b_l(i)=-1 )\) is thus given as:

$$\begin{aligned} p({\hat{b}}_{l}(i)=1 \ | \ b_l(i)=-1 )= & {} \int _0^\infty f({\hat{d}}_{l}(i) \ | \ b_l(i)=-1) \text{ d }{\hat{d}}\nonumber \\= & {} \frac{1}{\sqrt{2 \pi \sigma _\mathrm{Q}^2}} \int _0^\infty e^{-\frac{({{\hat{d}}_l(i)+\beta P)}^2}{2 \sigma _\mathrm{Q}^2}} \text{ d }{\hat{d}}\nonumber \\= & {} \frac{1}{2} \text{ erfc } \left( \sqrt{\frac{{\beta }^2 P^2}{2 \sigma _\mathrm{Q}^2}} \right) \nonumber \\= & {} \frac{1}{2} \text{ erfc } \left( \sqrt{\frac{{\beta }^2 P}{2 N^{\mathrm{S}} \sigma _\mathrm{J}^2}} \right) , \end{aligned}$$
(25)

where \(\text{ erfc }(q) = \frac{2}{\sqrt{\pi }} \int _q^\infty e^{-t^2} \text{ d }t\) denotes a complementary error function. Similarly, the conditional probability \(p({\hat{b}}_{l}(i)=-1 \ | \ b_l(i)=1 )\) is given as:

$$\begin{aligned} p({\hat{b}}_{l}(i)=-1 \ | \ b_l(i)=1 )= & {} \frac{1}{2} \text{ erfc } \left( \sqrt{\frac{{\beta }^2 P}{2 N^{\mathrm{S}} \sigma _\mathrm{J}^2}} \right) . \end{aligned}$$
(26)

We assume that the prior probabilities are equiprobable such as \(p(b_l(i)=1) = p(b_l(i)=-1) = 1/2\), which has been used for the bit error calculation [29, 41]. Based on Eqs. (25) and (26), we calculate the probability for the bit error

$$\begin{aligned} e_l= & {} p\bigl ({\hat{b}}_{l}(i)=-1 \ | \ b_l(i)=1 \bigr ) \cdot p\bigl (b_l(i)=1\bigr )\nonumber \\&\ + p\bigl ({\hat{b}}_{l}(i)=1 \ | \ b_l(i)=-1 \bigr ) \cdot p\bigl (b_l(i)=-1\bigr )\nonumber \\= & {} \frac{1}{2} \text{ erfc } \left( \sqrt{\frac{{\beta }^2 P}{2 {N^{\mathrm{S}} \sigma _\mathrm{J}}^2}} \right) . \end{aligned}$$
(27)

We find that \(e_l\) decreases as P increases. That is, TDDH based on the DHT domain improves the robustness against artifacts by embedding the hidden vector into the amplitude spectrum in the wider high-frequency bandwidth.

5 Subjective Listening Tests and Objective Measures

This section describes subjective listening tests and objective measures for the proposed method. First, we verified the quality difference between NB and CNB speech signals. Second, we evaluated the quality of a generated WB speech signal where the missing UB spectrum has been reconstructed using side information extracted from the CNB speech signal. Besides, we verified the robustness against artifacts. In this paper, we assumed noise environments: AWGN at several signal-to-noise ratio (SNR) levels without or with speech codecs G.711 [14] and G.729 [18].

We used speech datasets taken from English speech corpus PTDB-TUG [33] and Japanese speech corpus ASJ-JNAS [19]. Speech samples taken from PTDB-TUG were used for training codebooks. Hundred speech samples taken from ASJ-JNAS were used for the performance analysis tests. A sampling rate for speech signals was 16 kHz. We adopted 10-order LSF \((J=10)\) to represent a UB spectral envelope using the Hamming window. Also, the number of frame samples was \(N=160\) (20 ms). Feature vectors were converted into a binary signal of \(N^{\mathrm{S}} = 12\), where the proposed method utilized two codebooks of \(N^{\mathrm{F}} = 8\) and \(N^{\mathrm{G}} = 4\).

The proposed BWE method using TDDH based on the DHT domain with relative gains (BWE-HG) is compared with three different methods: a BWE method using TDDH based on the DFT domain with a relative gain (BWE-F) [29], a BWE method using TDDH based on the DHT domain with a relative gain (BWE-H), and a BWE method using TDDH based on the DFT domain with relative gains (BWE-FG). BWE-F and BWE-FG converted a binary signal into a hidden vector of \(P=12\) and embedded it into a magnitude spectrum in the high-frequency bandwidth of 3.4–4 kHz, where the common hidden vector was embedded in the bandwidth of 4–4.6 kHz because of the symmetry at a Nyquist frequency. Here, an offset was required to embed a hidden vector with negative values into a magnitude spectrum with non-negative values. For BWE-F and BWE-FG, Eqs. (4) and (10) are rewritten as:

$$\begin{aligned} E_l(p) = \beta \cdot \sum _{i=0}^{N_{S}-1} Q(p, \ i) b_{l}(i) + \beta P, \end{aligned}$$
(28)

and

$$\begin{aligned} {\hat{b}}_{l}(i) = \text{ sgn } \left[ \sum _{p=0}^{P-1} Q(p, \ i) \left( {\hat{E}}_{l}(p) - \beta P \right) \right] , \end{aligned}$$
(29)

respectively. BWE-H and BWE-HG also converted a hidden vector of \(P=24\) and embedded it into an amplitude spectrum in high-frequency bandwidth of 3.4–4.6 kHz. While BWE-F and BWE-H calculated a relative gain in each frame and generated a UB speech signal with non-overlapping, BWE-FG and BWE-HG calculated relative gains in sub-frames and generated a UB speech signal with 75% overlapping. Here, BWE-F and BWE-H grouped feature vectors such as \(\mathbf{C'}_{l}=[F_{l}(1), \ldots , F_{l}(J), G_{l, 1}]^{\mathrm{T}}\) and converted it to a binary signal of \(N^{\mathrm{S}}=12\).

In subjective listening tests, 9 Japanese listeners between the age of 22 and 24 participated, who had normal hearing and have not trained before. A listener listened to test speech samples generated from two male and two female speakers through a headphone (MDR-7506) in a quiet room. A degradation category rating test [15] was employed to evaluate a CNB speech signal in comparison with an NB speech signal based on degradation mean opinion score (DMOS) in Table 1. An absolute category rating test [15] was also employed to evaluate the generated WB speech signals based on mean opinion score (MOS) in Table 2.

Table 1 Category for DMOS
Table 2 Category for MOS

In objective quality measurements, we evaluated the perceptual transparency of the CNB speech signal using NB-PESQ [16]. NB-PESQ returned a score from -0.5 to 4.5. We also evaluated the perceptual similarity between original and estimated WB speech signals using log-spectral distance (LSD) [36]. In this paper, we define LSD as:

$$\begin{aligned} \mathrm{LSD} = \frac{1}{L} \sum _{l=0}^{L-1} \sqrt{\frac{1}{|{\mathcal {K}}|}\sum _{k \in {\mathcal {K}}} \left[ 10 \log _{10}\frac{P_l(k)}{{\hat{P}}_l(k)} \right] ^2}, \end{aligned}$$
(30)

where \(P_{l}(k)\) and \({\hat{P}}_{l}(k)\) denote power spectra of the original and generated WB speech signals, respectively. Also, \({\mathcal {K}}\) is the set of the frequency indices at the bandwidth to be analyzed, and L denotes the number of analyzed frames. We utilized the Hamming window of 32 ms and analyzed the bandwidth of 3.4–7 kHz.

Fig. 4
figure 4

Sound spectrograms of NB and CNB speech signals. a NB speech signal. b CNB speech signal for BWE-F. c CNB speech signal for BWE-FG. d CNB speech signal for BWE-H. e CNB speech signal for BWE-HG

The first experiment verified the quality difference between NB and CNB speech signals. Figure 4 shows sound spectrograms for NB and CNB speech signals. Compared to the NB speech signal, the CNB speech signals had the spectral distortion in the high-frequency bandwidth of 3.4–4 kHz due to TDDH. It was also seen that BWE-F and BWE-FG had more serious spectral distortion because of the offset. Table 3 shows results of DMOS and NB-PESQ. Compared to DMOS for the CNB speech signal using TDDH based on the DFT domain (BWE-F and BWE-FG), NB-PESQ for the CNB speech signal using TDDH based on the DHT domain (BWE-H and BWE-HG) was higher by more than 0.30 points because of the needless of the offset. Also, DMOS for BWE-HG was over 3.60 because the proposed method avoided the discontinuity between frames by generating a UB speech signal with 75% overlapping. The proposed method therefore suppressed the quality decline due to TDDH in comparison with the conventional method [29].

Table 3 Subjective and objective quality assessments of CNB speech signals
Fig. 5
figure 5

Sound spectrograms for original and generated WB speech signals. a Original WB speech signal. b Generated WB speech signal for BWE-F. c Generated WB speech signal for BWE-H. d Generated WB speech signal for BWE-FG. e Generated WB speech for BWE-HG

The second experiment verified the quality of the generated WB speech signal. Figure 5 shows sound spectrograms of original and generated WB speech signals. It can be seen that the missing UB spectrum has been reconstructed successfully on the generated WB speech signals. Also, since BWE-F and BWE-H, and BWE-FG and BWE-HG have reconstructed the UB spectrum using a common feature vector, the sound spectrograms of these generated WB speech signals were identical, respectively. Compared to the method of generating a UB speech signal with a relative gain (BWE-F and BWE-H), the method of generating a UB speech signal with relative gains (BWE-FG and BWE-HG) has reconstructed the missing UB spectrum more accurately. Figure 6 shows UB sound pressure changes of the original and generated UB speech signals. It can be seen that the UB sound pressure change of the UB speech signal generated with relative gains was similar to that of the original UB speech signal. We evaluated the distance of the UB sound pressure change between the original and generated UB speech signals by root mean squared error (RMSE). While RMSE for the method of generating a UB speech signal with a relative gain was 12.33 dB, RMSE for the method of generating a UB speech signal with relative gains was 6.18 dB. Therefore, the proposed method achieves the slight UB sound pressure change representation. Table 4 shows results of MOS and LSD. LSD for BWE-HG was lower by 0.15 dB compared to BWE-F. Also, MOS for BWE-HG was over 3.00, which was higher by 1.12 compared to an NB speech signal. These results show that the proposed method enhanced the quality of the NB speech signal more efficiently.

Fig. 6
figure 6

UB sound pressure change over time for original UB speech signal (blue line), UB speech signal generated with a relative gain (red line), and UB speech signal with generated relative gains (yellow line)

Table 4 Subjective and objective quality assessments for generated WB speech signals
Table 5 Bit error rate of extracted binary signal under simulation environments at several SNR levels [%]

We verified the robustness against artifacts. Table 5 represents the bit error rate of the extracted binary signal under noise environments at several SNR levels. Without speech codecs, a binary signal was extracted successfully from a CNB speech signal in a clean environment. With speech codecs, a bit error occurred even in a clean environment. In particular, G.729 compresses the amount of information in an NB speech signal more than G.711 by quantizing feature vectors, and thus, it was not easy to accurately extract a binary signal from the decoded CNB speech signal. Also, the bit error rate increased as the SNR level decreased. In comparison with the method using TDDH based on the DFT domain (BWE-F and BWE-FG), the method using TDDH based on the DHT domain (BWE-H and BWE-HG) achieved lower bit error rate. These results confirm that the robustness depends on the length of the bandwidth where the hidden vector is embedded, as shown in Eq. (27). While TDDH based on the DFT domain embedded a hidden vector into a magnitude spectrum in the high-frequency bandwidth of 3.4–4 kHz, TDDH based on the DHT domain embedded a hidden vector into an amplitude spectrum in the high-frequency bandwidth of 3.4–4.6 kHz. Therefore, the proposed method achieved the robustness improvement against artifacts.

Finally, we discuss the processing time. We have constructed the system with 3.60 GHz Intel i7 core and implemented in MATLAB. Here, the length of the original WB speech signal was 2.22 s. To generate a CNB speech signal, BWE-F, BWE-H, BWE-FG, and BWE-HG took 0.150 s, 0.122 s, 0.151 s, and 0.123 s, respectively. Also, to generate a WB speech signal, BWE-F, BWE-H, BWE-FG, and BWE-HG took 0.120 s, 0.042 s, 0.127 s, and 0.042 s, respectively. The proposed method took longer processing time because relative gains were calculated in some sub-frames and a UB speech signal was generated with 75% overlapping. The total processing time of the proposed method was shorter than the length of the original WB speech signal, and thus, the proposed method worked with less latency communication as well as the conventional method.

6 Conclusion

In this paper, we proposed a BWE method using TDDH based on the DHT domain. Subjective listening tests and objective measures showed that the proposed method generated a CNB speech signal without an offset, and thus, the quality difference between NB and CNB speech signals was suppressed. Also, the proposed method generated a UB speech signal with overlapping using relative gains to represent the slight UB sound pressure change over time and enhanced the quality of the NB speech signal by reconstructing the missing UB spectrum. Furthermore, a bit error rate in a noise environment was suppressed by embedding a binary signal into an amplitude spectrum in the wider high-frequency bandwidth. In the future, we will work on speech steganography robust against speech codecs with high compression ratios such as G.729. The code of the proposed method is available at https://github.com/Yuya-Hosoda/Works.