1 Introduction

High-fidelity voice communications preserve the quality of message signals. In the Global System for Mobile communications (GSM), message signals are typically sampled at the rate of 8000 samples/sec [1]. According to the Nyquist criterion, the transmission bandwidth in the GSM happens to be narrow, i.e., limited to 0–4 kHz. Hence, frequencies in the human speech signals above the transmission bandwidth get suppressed. As a result, the naturalness, clarity, and pleasantness of the received signals deteriorate. Therefore, digital signal processing techniques are developed, which improve the quality of signal by extending bandwidth. More specifically, a narrowband (NB) telephone signal sampled at 8 kHz is processed to recover the frequency components higher than 4 kHz present in the original wideband (WB) signal sampled at 16 kHz. For this, the high-band (HB) information present in 4–8 kHz range is extracted from the wideband (0–8 kHz) signal. The extracted high-band information is further used in the bandwidth extension of the narrowband signal at the receiver end. This process is called artificial bandwidth extension (ABE) for a stationary narrowband signal. A general ABE process is shown in Fig. 1 in the case of a stationary signal.

Fig. 1
figure 1

A basic block diagram depicting the process to produce the narrowband signal and artificial bandwidth extension of a stationary narrowband signal

Figure 1 consists of the transmitter setup and receiver setup. The transmitter setup generates the narrowband signal sampled at 8 kHz. Conventional transmitter setup has a low pass filter (LPF) followed by a downsampler with a downsampling factor (\(\downarrow 2\)). The narrowband signal \(S_{NB}[n]\) sampled at 8 kHz is the output signal of the transmitter-set. The receiver setup synthesizes the wideband signal. The receiver setup consists of four processes: narrowband information extraction, high-band information estimation, resampling process, and bandwidth extension process. In Fig. 1, \(\uparrow 2\) represents an upsampler with an upsampling factor 2, \(S^\prime _{NB}[n^\prime ]\) denotes the narrowband signal sampled at 16 kHz, and \(S^\prime _{HB}[n^\prime ]\) denotes the estimated high-band signal sampled at 16 kHz. A bandwidth extension process is applied to the received narrowband signal \(S_{NB}[n]\) for estimating the missing high-band signal at the receiver side. The bandwidth extension process uses high-band information, which is estimated using a machine learning model for given narrowband information/features. The machine learning model is trained offline. The narrowband features are extracted from the narrowband signal. In the resampling process, the resampled narrowband signal \(S_{NB}[n^\prime ]\) is obtained by passing the narrowband signal \(S_{NB}[n]\) through the upsampler (\(\uparrow 2\)) followed by the low pass filter. The wideband signal is estimated by adding the estimated high-band signal \(S^{'}_{HB}[n']\) and narrowband signal \(S_{NB}[n']\).

Many approaches are proposed for ABE based upon the source-filter model. In this model, the speech signal is segregated into two parts: speech production filter (SPF) as a vocal tract filter and excitation signal as a residue signal [52]. The excitation signal is passed through the speech production filter to produce the speech signal. The excitation signal can be either a white noise for the unvoiced speech or a quasi-periodic impulse train for the voiced speech. The magnitude spectrum of the excitation signal is flat in both cases: white noise and quasi-periodic impulse train. Thus, the vocal tract filter shapes the spectral envelope of the speech signal. The spectral envelope can be accurately modeled using a signal model containing the poles (resonances) as well as the zeros (anti-resonances) [38]. The spectral envelope and excitation of the high-band signal are estimated using an extrapolation process applied on the narrowband signal and some extra information [17, 18, 32, 46, 47]. In existing methods for ABE, spectral envelopes of the high-band signal and narrowband signal can be represented by linear prediction coefficients (LPC) [6], line spectral frequencies (LSF) [35], linear frequency cepstral coefficients (Cepstrum) [2], and Mel frequency cepstral coefficients (MFCC) [44, 53] features. These features capture poles (formants) information present in the speech spectrum. Further, the high-band excitation can be estimated using many different ways, i.e., bandpass-envelope modulated Gaussian noise (BP-MGN) [47], harmonic noise model (HNM) [56], spectrum folding [17, 37], pitch adaptive modulation [28], full-wave rectification [18], and spectral translation [18, 28, 37]. Another method has been proposed, which is based on the temporal envelope model [30]. It uses the temporal envelope and fine structure of the sub-bands for synthesizing the high-band speech signal. Some approaches are developed without using any modeling, and such approaches use the magnitude spectrum to synthesize the high-band information. A joint dictionary training model is proposed, which utilizes the sparsity of the spectrogram [50]. Log spectra of the wideband signal is directly used to represent narrowband and high-band information for ABE [11, 34]. In [3], the Cepstrum feature is used to represent the high-band information for ABE. In [8], CQT (constant-Q transform) feature is used for ABE, but the dimension of this feature has been taken high.

According to [38], speech production filter can be accurately represented by a pole–zero model. Many existing methods use an all-pole model, which may not be sufficient to represent the spectral envelope of speech portions like fricatives, nasals, laterals, and the burst interval of stop consonants due to the presence of valleys in the frequency response of the SPF [38]. In our work, the pole–zero model (we call it the signal model also) is used to represent the spectral envelope of the wideband signal [38]. Moreover, existing methods focus on the estimation of the high-band (HB) signal only as the narrowband signal \(S_{NB}\) is available at the receiver side. At the transmitter side, the original wideband signal is passed through a near-ideal low pass filter (LPF) prior to the downsampler to produce the narrowband signal. The decomposition of narrowband and high-band information at the transmitter is a common technique used in many ABE works (see [2, 35, 58]), including our work reported in [21]. On account of the decomposition of narrowband and high-band information at the transmitter, two challenges arise for the effective ABE of the narrowband speech signal: (i) weaker conditional dependence between narrowband and wideband specifically for the unvoiced frames of speech and (ii) the need for the adjustment of energy levels between the estimated high-band and the retained narrowband speech signals [44, 58]. In different unvoiced frames of speech, narrowband information is almost the same, while high-band information varies. Therefore, it is difficult to estimate respective high-band information for given narrowband information of the unvoiced frame. To tackle these challenges, a new ABE framework is proposed in this work. The proposed work differs from the existing works in two aspects. First, the narrowband signal generated at the transmitter is no longer perfect. It can be stated that the transmitted aliased narrowband signals may have less intelligibility, but these are hypothesized to establish the better conditional dependence between narrowband and wideband information. The narrowband signal includes aliasing distortion due to dropping the low pass filter prior to downsampler (a similar approach has been used in [20] also), which helps in the estimation of high-band information of the unvoiced speech. Because the high-band information is reflected in the narrowband region after downsampling, which yields more variations among the narrowband features for the unvoiced speech. This results in a better conditional dependency between narrowband features and proposed wideband features for the unvoiced speech. Second, the interpolation filter for the speech signal is estimated by using the \(H^\infty \) optimization/filtering, which is recommended in the literature (especially in control) to handle variations in system models (in our case, the pole–zero model or signal model) [51]. This has been used in [7, 60] for the reconstruction of the orchestral music signal by using a single pole–zero model. However, a single model is not sufficient for a non-stationary signal (orchestral music and speech signal [36, 41]). Due to the non-stationary nature of speech signals, a frame-based approach (short-time processing) is applied to speech signals, which increases the necessity of storage for additional information about interpolation filters with their corresponding narrowband details. For this, machine learning models are designed and used to estimate the wideband information [2, 9, 10, 17, 27, 28, 32, 35, 58, 59]. In this work, this problem is solved by using two machine learning models, Gaussian mixtures model [15] and feed-forward DNN [24].

2 A Proposed Setup for Artificial Bandwidth Extension of Speech Signals

This section discusses the proposed artificial bandwidth extension framework for the narrowband signal sampled at 8 kHz. Figure 2 shows an outline of the proposed ABE framework. It includes the windowing and framing processes, setups used at the transmitter side and receiver side (explained in Sect. 2.1), processes to obtain the wideband feature and narrowband feature for bandwidth extension (explained in Sect. 2.2), estimation of wideband feature (explained in Sect. 2.3), and synthesis of the wideband signal (explained in Sect. 2.4). The windowing and framing processes are performed to get stationary frames/signals from non-stationary speech signals [36]. It is done by using the Hamming window of 25 ms duration with 50% overlapping between adjoining frames. Each subblock of Fig. 2 is further explained in forthcoming subsections.

Fig. 2
figure 2

Block diagram consists of training of a machine learning model and extension of the narrowband signal

2.1 Setups Used at the Transmitter Side and Receiver Side

This section discusses the transmitter and receiver setups. These setups are combined and drawn in Fig. 3.

Fig. 3
figure 3

Generation of the narrowband signal and reconstruction of the stationary wideband speech frame

The transmitter (Tx) produces the narrowband signal at the output. A wideband speech frame is downsampled by a factor of 2 at the transmitter side. This leads to an output narrowband speech frame \(y_d\), which is drawn in Fig. 3. This narrowband generation process introduces distortion (aliasing) in the narrowband speech frame. Hence, our work is focused on estimating the full wideband (0–8 kHz) signal at the receiver side. The receiver setup has three processes: narrowband information extraction, wideband information extraction, and bandwidth extension process (see Fig. 3). These processes are used at the receiver side for estimating wideband speech frames. A bandwidth extension process is applied to the narrowband speech frame at the receiver side, as shown in Fig. 4. In Fig. 4, \(y_d\) is upsampled by a factor of 2 and subsequently passed through an interpolation filter K. This leads to an estimated wideband speech frame \({\widehat{y}}\). The interpolation filter (K) contains the wideband information of a signal.

Fig. 4
figure 4

Bandwidth extension process for a stationary speech frame

Designing the filter K is the core of this work. For designing the filter K, an error system is made by combining the wideband speech frame, narrowband generation process, and bandwidth extension process, as shown in Fig. 5.

Fig. 5
figure 5

Error system setup for the reconstruction of a stationary speech frame

The synthesis filter K is designed by minimizing the reconstruction error using a suitable norm.

In Fig. 5, \(e=y-{\widehat{y}}\). y and \({\widehat{y}}\) denote the original/true wideband speech frame and estimated wideband speech frame, respectively.

Every discrete-time stationary speech signal can be represented by a linear discrete time-invariant (LDTI) system driven by a white noise for unvoiced speech or an impulse train for voiced speech [38]. Hence, pole–zero information about the original wideband speech frame y is extracted in the form of a signal model F as the speech production filter, which reflects the signal properties. In other words, the signal model F represents the spectral envelope information of the wideband speech frame. A modified error system containing the signal model F is given in Fig. 6.

Fig. 6
figure 6

Proposed architecture of error system for reconstructing a stationary speech frame

In Fig. 6, y is the output of system F driven by an input \(w_d\) with known features. The transfer function of F is represented by F(z). It is further assumed that F(z) is a stable and strictly proper rational transfer function. F can be represented in the z-domain as

$$\begin{aligned} F(z)=\mathbf{C}(z\mathbf{I}-{ \mathbf{A}})^{-1}{} \mathbf{B} , \end{aligned}$$

where \(\mathbf{A, B, C}\) are constant real matrices of appropriate dimensions. The signal model F(z) is computed by the standard Prony’s method-based function available in MATLAB [39, 40]. The obtained model is causal and but may be unstable. To make it stable, those poles of the model, lying outside of the unit circle, are emulated inside by reciprocating their magnitudes without altering the phase [38]. Note that the magnitude spectrum of F(z) remains the same, however, the phase spectrum changes. This stabilizing process does not affect too much the perception of a speech signal because the human auditory system is less sensitive to phase information [38].

2.1.1 Performance Index

The \(H^\infty \) system norm is used to minimize the reconstruction error. Because this norm handles small modeling errors [51]. The \(H^\infty \)-norm of a system \({\mathcal {G}}\) with input \({\mathcal {X}} \in l^2 ({\mathbb {Z}},{\mathbb {R}}^n)\) and output \({\mathcal {Y}} \in l^2({\mathbb {Z}},{\mathbb {R}}^m)\) is defined as (see, e.g., [13, 51, 60])

$$\begin{aligned} ||{\mathcal {G}}||_\infty&:=\sup _{{\mathcal {X}} \ne 0} \frac{||{\mathcal {Y}}||_2 }{||{\mathcal {X}}||_2}. \end{aligned}$$
(1)

2.1.2 Problem Formulation

To design optimal K(z), the following optimization problem is solved.

Problem 1

Given a stable and causal F(z), design a stable and causal interpolation filter \(K_\mathrm{opt}\) defined as

$$\begin{aligned} K_\mathrm{opt}:=&{{\,\mathrm{ arg\,min}\,}}_{K}( ||{\mathbb {T}}||_\infty ), \end{aligned}$$
(2)

where \({\mathbb {T}}:=F - K(\uparrow 2)(\downarrow 2) F\). \({\mathbb {T}}\) maps \(w_d\) to e (see Fig. 6).

As mentioned earlier, the non-stationary behavior of speech signals introduces some uncertainty in the estimation of the signal model F(z). In such a case, \(H^\infty \)-norm optimization provides a robust solution against small modeling error in F(z) [51]. Solution of Problem 1 is explained in “Appendix A.1.” It computes the optimal IIR filter \(K_\mathrm{opt}\). Henceforth, \(K_\mathrm{opt}\) is denoted by K.

2.2 Speech-Specific Wideband and Narrowband Features

The strategy explained in Sect. 2.1 is used for extending the bandwidth of the narrowband speech frame. Further, interpolation filters are obtained for all speech frames. The interpolation filter K has an infinite impulse response (IIR). Practically, the IIR filter K cannot be modeled directly by machine learning techniques. Therefore, this filter is converted into an approximate finite impulse response (FIR) interpolation filter by truncating its Taylor series at the origin. The number of terms in FIR interpolation filter is chosen 21 empirically, which is explained in Sect. 3.1. This FIR filter response is taken as the wideband feature \(\mathbf{Y}_K\) in this work.

Only narrowband information is available on the receiver side. The interpolation filter is needed in the bandwidth extension process. For estimating the interpolation filter, a pre-trained model is trained using the interpolation filter information and corresponding narrowband information (narrowband feature). The pre-trained model is further used to estimate the filter information for a given narrowband feature (see Sect. 2.3). The narrowband information (narrowband feature) is taken in four different ways, i.e., linear prediction coefficients (LPC) [5], line spectral frequencies (LSF) [25], linear frequency cepstral coefficients (Cepstrum) [2], and Mel frequency cepstral coefficients (MFCC) [44, 53]. These parameters are computed from the narrowband speech frame. The dimension of the narrowband feature is fixed to 10.

2.3 Modeling and Mapping

This section has details of the machine learning models used in this paper. Machine learning models are used to estimate the FIR interpolation filter using the narrowband feature. For this purpose, a pre-trained model is trained using the narrowband and wideband features. In our work, machine learning models such as GMM and DNN are used, which are explained in “Appendixes A.2 and A.3,” respectively.

2.4 Wideband Signal Estimation

The entire flow for the training of a machine learning model and extension of the narrowband signal is shown in Fig. 2, which is used for ABE of speech signals. It can be broadly divided into two principal blocks: training and extension. In the training block, windowing of the wideband signal is performed first. Two parallel processes are then performed on the windowed wideband signal. The one process is the computation of signal model and subsequent extraction of wideband feature \(\mathbf{Y}_K\) (see Sects. 2.1 and 2.2). The another one performs the downsampling of wideband speech frame and subsequent extraction of narrowband feature \(\mathbf{X}\) (see Sect. 2.2). Narrowband and wideband features are modeled by GMM or DNN (see Sect. 2.3). In the extension block, the first step is the windowing of narrowband signal and subsequent extraction of narrowband feature \({\tilde{\mathbf{X }}}\). Further, \({\tilde{\mathbf{X }}}\) is mapped to the wideband feature \({\tilde{\mathbf{Y }}}_K\) by using the pre-trained model. The windowed narrowband signal is upsampled by a factor of 2 and then passed through the interpolation filter K(z). K(z) is obtained by the estimated wideband feature \({\tilde{\mathbf{Y }}}_K\). The resulting signal is multiplied by the reciprocal of the Hamming window to estimate the wideband speech frame. Further, the overlapped portion of two adjacent frames is estimated by averaging the overlapped parts of the estimated wideband speech frames. In other words, the weighted overlap-add method (WOLA) is applied [16, 57].

3 Experimental Analysis and Results

This section has a description of the speech signals, which are taken from the TIMIT database [61] and RSR15 database [33]. Both the datasets contain the recorded speech files at a sampling rate of 16 kHz. TIMIT database consists of two different sets: test set and training set. The training set is used for training the machine learning models, and the test set is taken as a validation set. A new test set is made by some speech files taken from the RSR15 dataset and used for testing the machine learning models. This new test set has the speech files uttered by 4 female and 3 male speakers. The test set from a different database leads to more generalized results.

Section 3.1 has the mathematical formulations of objective measures used for evaluating the proposed method. In Sect. 3.2, the objective measures are analyzed for deciding the dimension of the wideband feature. Further, the proposed method is evaluated using the GMM model in Sect. 3.2.1 and DNN topology in Sect. 3.2.2. In Sect. 3.2.3, the proposed method is compared with the existing methods. In Sect. 3.3, the subjective measure is discussed.

3.1 Objective Measures

In this work, several standard objective speech quality measures such as MSE (mean square error) [43], SDR (signal to distortion ratio) [23], LLR (log likelihood ratio) [36, 49], LSD (log spectral distance) [3], MOS-LQO (mean opinion score listening quality objective) estimated from PESQ (perceptual evaluation of speech quality) [26, 49], and STOI (short-time objective intelligibility) [54] are computed for performance analysis. Mathematical formulations of objective measures are written as follows

$$\begin{aligned} \text {MSE}= \frac{\sum _{i=1}^{L}(s(i)-{\tilde{s}}(i))^2}{L}. \end{aligned}$$
(3)

L is the signal length, s is the original wideband signal, and \({\tilde{s}}\) is the reconstructed wideband signal.

$$\begin{aligned} \text {SDR(dB)}=10 \log _{10} \frac{\sum _{i=1}^{L}(s(i)^2}{\sum _{i=1}^{L}(s(i)-{\tilde{s}}(i))^2}. \end{aligned}$$
(4)

Parameters in (4) are the same as defined in (3).

$$\begin{aligned} \text {LLR}=\frac{ \sum _{i=1}^M \log _{10} \big ( \frac{{ \overrightarrow{a_i}_p^T} { R_{ic}} \overrightarrow{{ a_i}_p}}{{ \overrightarrow{a_i}_c^T}{ R_{ic}} \mathbf{\overrightarrow{a_i}_c}}\big )}{M}, \end{aligned}$$
(5)

where M is the number of frames, \(\overrightarrow{a_i}_c\) and \(\overrightarrow{a_i}_p\) are the LPC vectors of the original ith speech frame and reconstructed ith speech frame, respectively. \( R_{ic}\) is the autocorrelation matrix of the original ith speech frame.

$$\begin{aligned} \text {LSD}=\frac{\sum _{i=1}^M \sqrt{\big (\frac{\sum _{j=1}^N (20\log _{10}|{ X}(i,j)|-20\log _{10}|{ {\tilde{X}}}(i,j)|)^2}{N}}\big )}{M}, \end{aligned}$$
(6)

with |X(ij)| and \({{\tilde{X}}}(i,j)\) being the absolute values of FFT of ith frame and jth frequency bin of the original and reconstructed speech frames, respectively. M and N denote the number of frames and the number of frequency bins, respectively.

$$\begin{aligned} \text {MOS-LQO}=a+\frac{b}{(1+\exp (c*p+d))} \end{aligned}$$
(7)

with \(a=0.999, b=4.999-a, c=-1.4945, d=4.6607\), and p is PESQ.

These measures are characterized into two major categories based on frequency and time domain. MSE and SDR yield performance with respect to time. LLR and LSD yield information about frequencies. MOS-LQO and STOI measures are suitable to measure quality together in both the time and frequency domain. LLR, SDR, and PESQ measures are computed with the help of a composite tool downloaded from the website of the author. MOS-LQO is estimated from the PESQ [22, 26].

3.2 Objective Analysis

Initially, the performance of enhanced speech signals is analyzed. The narrowband speech signal is enhanced by applying the interpolation filter K on the upsampled narrowband signal in the condition of using the oracle filter K directly in the architecture shown in Fig. 4. The objective measures are listed in Table 1 for the wideband speech signals estimated by the output of an upsampler with upsampling factor 2 (\(y_{d,u}\)) and output of the interpolation filter K (\({\widehat{y}}\)).

Table 1 Performance comparison of speech signals enhanced by applying an upsampler with an upsampling factor 2 (without applying filter K) and the oracle interpolation filter K in Fig. 4 on the speech files taken from the validation set

In Table 1, the interpolation filter K improves all the objective measures significantly.

Moreover, filter K has an infinite impulse response. It is transformed into an approximate FIR filter by using the Taylor series truncation method. For deciding the length of the truncated FIR filter, the objective measures are computed on some speech files taken from the validation set with the varying length of the filter.

Table 2 Performances evaluation on the speech files taken from the validation set in condition of direct implanting FIR filter K (oracle K) in Fig. 4 for ABE

In Table 2, the objective measures are improved with increasing the number of terms present in the FIR filter, but gradually after the length 21. Hence, the filter length is set to 21. Then, the pre-trained models GMM and DNN are obtained using the training data information. Then, the performance of the test set is analyzed using the pre-trained models, as described in the following subsections.

Moreover, the objective measures are analyzed for the voiced speech and unvoiced speech of the test set separately. So, speech signals are segregated into two fundamental parts: voiced speech and unvoiced speech by a glottal activity detection (GAD) method [4, 42]. It is a well-known fact that the narrowband region contains higher energy than the high-band region for voiced speech and vice versa for unvoiced speech [36]. Our proposed strategy considers the recovery of full wideband. This is because, information present in the narrowband region is distorted because of aliasing; however, information present in the high-band region is lost because the wideband signal is converted into the narrowband signal. As a result, unvoiced speech and voiced speech are affected in our transmitter setup. The main benefit of direct downsampling is the better estimation of wideband feature for a given narrowband feature of the unvoiced speech. Because the high-band information is reflected in the narrowband region after downsampling, which yields more variations among the narrowband features for the unvoiced speech. This results in the better conditional dependence between narrowband features and proposed wideband features for the unvoiced speech. Later, the performance is analyzed for the voiced speech and unvoiced speech separately.

3.2.1 Performance Evaluation Using Gaussian Mixture Model

The GMM-based regression technique is used to estimate the interpolation filter (wideband feature) for a given narrowband feature. GMM model with 128 mixtures is trained using the narrowband features and proposed wideband features. Further, the performance of the proposed approach using the GMM model is evaluated on the test set for four types of narrowband features: LSF, LPC, Cepstrum, and MFCC, as done in Table 3.

Table 3 Performance evaluation by using 128 GMMs on the test set

The objective measures are analyzed for these narrowband features. The LSF narrowband feature leads to the best performance in comparison with the other narrowband features.

Furthermore, objective measures are tabulated in Table 4 for the voiced speech and Table 5 for the unvoiced speech extracted from speech signals belonging to the test set by considering the four types of narrowband feature representations.

Table 4 Performance evaluation by using 128 GMMs for voiced speech extracted from the speech signals belonging to the test set
Table 5 Performance evaluation by using 128 GMMs for unvoiced speech extracted from the speech signals belonging to the test set

MSE and SDR measures produced by using LSF narrowband feature are more close to their respective values obtained by using the oracle FIR filter K (\(\mathbf{Y_K}\)) directly for the voiced speech. The rest of the objective measures produced by using the MFCC narrowband feature are leading to the lowest difference from their respective values obtained by using \(\mathbf{Y_K}\) directly for the voiced speech. The Cepstrum narrowband feature yields the lowest MSE, and the LSF narrowband feature produces the better remaining objective measures for the unvoiced speech.

3.2.2 Performance Evaluation Using Deep Neural Network

DNN topology is used to estimate the interpolation filter coefficients. Some preliminary experiments are done to decide the parameter values for DNN topology with fixing the narrowband feature. An optimal DNN architecture is designed after optimizing its parameters over the fixed LSF narrowband feature representation. AdaMax (adaptive moment estimation based on the infinity norm) [31] optimizer is used to update the weights of the network by applying \(L_2\) regularization empirically [24]. Experimentally hyper-parameters such as mini-batch size, epoch, learning rate \(\alpha \), decay rates \(\beta _1\) for the first-moment estimate and \(\beta _2\) for the second-moment estimate over a broad range are set to 200, 50, 0.01, 0.9, and 0.999, respectively. Mean and variance normalization (MVN) is applied to the features. Also, batch normalization before activation function is applied to each hidden layer. The ReLU activation function is used in hidden layers, and the linear activation function is used in the output layer. Performances of different DNN topologies on the validation set are tabulated in Table 6.

Table 6 Performance evaluation on the validation set for different DNN topologies with varying the number of hidden layers (\(N_{HL}\)) and the number of units (\(N_U\)), and ReLU activation function in hidden layers, linear activation function in the output layers, LSF narrowband feature and \(\textit{AdaMax}\) optimizer

The overall good performance on the validation set is acquired by four hidden layers and 256 hidden units. Next, this architecture is trained by changing mini-batch size. As a result, the mini-batch size is decided 50. These obtained parameters are selected in designing the optimal DNN architecture.

Moreover, different DNN models are trained with other activation functions in the hidden layers such as ELU, tanh, and softplus. Performance on the test set is analyzed for all the DNN architectures, as shown in Table 7.

Table 7 Performance evaluation on the test set for the DNN models designed using different activation functions such as ReLU, ELU, tanh, softplus used in hidden layers, and linear in the output layer; Number of hidden layers \((N_{HL})=4\); Number of units \((N_U)= 256\) in each hidden layer

It is analyzed that the LPC narrowband feature yields better MOS-LQO, MSE, and STOI than the other narrowband features. On the other hand, the rest of the objective measures in the majority of the cases are better for the LSF narrowband feature. Among all the activation functions, the softplus function yields the best performance in the majority of the cases using the LSF narrowband feature. Furthermore, Tables 8 and  9 give the objective measures computed for the voiced speech and unvoiced speech taken from the test set, respectively, with different activation functions and different narrowband feature definitions.

Table 8 Performance evaluation for voiced speech extracted from speech signals belonging to the test set for the DNN models designed using different activation functions such as ReLU, ELU, tanh, softplus used in hidden layers, and fixed linear activation function in the output layer; Number of hidden layers \((N_{HL})=4\); Number of units \((N_U)= 256\) in each hidden layer
Table 9 Performance evaluation for the unvoiced speech extracted from speech files belonging to the test set for the DNN models with considering different activation functions such as ReLU, ELU, tanh, softplus used in hidden layers, and fixed linear activation function in the output layer; Number of hidden layers \((N_{HL})=4\); Number of units \((N_U)= 256\) in each hidden layer

The LSF narrowband feature, among all the narrowband features, yields the best performance for voiced speech and unvoiced speech. The LSF narrowband feature yields the best SDR, LLR, and LSD using the ELU, ReLU, and softplus activation functions in the DNN model for the voiced speech, respectively. The LPC narrowband feature yields the best MSE and MOS-LQO using the ReLU and tanh functions in the DNN model for the voiced speech, respectively. For the unvoiced speech, the LSF narrowband feature and ELU, tanh, and softplus functions used in designing of the DNN model yield the closest SDR, LLR, and LSD to their respective values obtained by using oracle \(\mathbf{Y}_K\) directly, respectively. The DNN model designed using the ELU activation function and Cepstrum narrowband feature yield the best MSE for the unvoiced speech. The DNN model designed using the softplus activation function and LPC narrowband feature yield the best MOS-LQO for the unvoiced speech.

3.2.3 Comparisons

Our proposed method is compared with the existing methods based on the conventional source-filter model wherein the excitation signal is extended by two different ways such as spectrum folding [17, 37, 58] and spectrum translation [37, 44]. Experimental conditions such as datasets, dimensions of narrowband and wideband features, windowing, and DNN model are kept the same. The LSF features are used to represent the narrowband feature and wideband feature. Also, these methods require a gain factor, which is calculated by following [58] for spectrum folding and [44] spectral translation. The cepstral domain method is also compared in which the narrowband feature is the narrowband magnitude spectrum and the wideband feature is represented by cepstral coefficients [3].

Moreover, these techniques are implemented by using the low pass filter for generating the narrowband signal. Here, the low pass filter is a non-causal FIR filter defined in [1]. Cut off frequency of the LPF filter is 3660 Hz. The length of this filter is 118. Non-causality of this filter introduces a delay in transmission.

Table 10 A comparison of the objective measures computed on the test set speech files for different methods

As seen in Table 10, the proposed method improves all the objective measures except the MOS-LQO and STOI when compared with the existing methods. MOS-LQO and STOI values are obtained better by the existing methods. It may be due to the available original narrowband information. In the existing methods, the narrowband signal is generated by using the low pass filter. Therefore, the narrowband information does not alter.

Next, spectrograms of the estimated speech signals are shown in Fig. 7, which are estimated by the proposed, spectrum folding, spectral translation, and cepstral domain methods using the same DNN model. As viewed in Fig. 7, the spectrogram of the extended speech signal has more difference around 4 kHz from the original spectrogram for the existing methods than the proposed method. It has happened because of the energy levels adjustment issue around 4 kHz in the existing methods. It is observed around 0.9 s and 0.77 s in Fig. 7 that the estimated high-band information is more close to the original high-band information by the proposed method than the existing methods. However, the estimated high-band information around 7–8 kHz and 0.40–0.55 s in Fig. 7 is observed more than the original information by the proposed method when compared with the existing methods.

Fig. 7
figure 7

Spectrogram of a original wideband signal, b, c, d, and e reconstructed wideband signal by proposed method, spectrum folding, spectral translation (e) cepstral domain, respectively

3.3 Subjective Listening Test

Subjective assessment is done according to the ITU-T P.800 [48, Annex E] for examining the speech quality. This task is done for the extended speech signals obtained by the proposed method, spectrum folding method, spectrum translation method, and cepstral domain method using the DNN model with the softplus activation function. Extended wideband speech files by the proposed method are rated with respect to extended wideband speech files by the existing methods. Ten pairs of extended speech signals belonging to the test set are randomly chosen for these methods, i.e., 60 files total. Then, twelve listeners were asked to give a mean opinion score (MOS) value between -3 (much worse) to 3 (much better). The ages of these listeners are between 23 and 32 years. These listeners do not have any hearing impairment and understand well English language. They were permitted to listen to the speech files more than once. Further, 95% confidence interval (CI) and p values are computed for measuring statistical significance. Then, the comparison mean opinion score (CMOS), 95% confidence interval (CI), and p values are listed in Table 11.

Table 11 Subjective assessment on artificially extended speech files belonging to the test set by the proposed method with respect to the existing methods

Our proposed method improves CMOS significantly by 1.80, 0.96, and 1.59 points in comparison with the spectrum folding, spectral translation, and cepstral domain, respectively. Unvoiced phonemes are perceived better in the extended speech files using the proposed method than the existing methods. For reference, the speech files are provided for all the conditions that can be accessed using the link.Footnote 1

4 Conclusion

A new framework (which capitalizes on artificially introduced non-ideality in the narrowband signal) is proposed for the artificial bandwidth extension of speech signals. In our proposed framework, the transmitter setup is different from the existing setup, which helps mainly in identifying the high-frequency components for the unvoiced speech. The discrete interpolation filter is obtained by using a signal model with the help of \(H^\infty \) optimization. The obtained rational stable and causal interpolation filter is converted into an FIR filter empirically. This FIR filter is considered as the wideband feature. Experiments are performed by considering four types of narrowband features such as LSF, LPC, MFCC, and Cepstrum. Estimation of wideband feature for a given narrowband feature is conducted by two different modeling techniques such as GMM and DNN with several topologies. Performance is analyzed on the test set speech files taken from the RSR15 database by computing the standard objective measures: SDR, MSE, MOS-LQO, LLR, STOI, and LSD and subjective listening test. Also, the objective measures are analyzed for the voiced speech and unvoiced speech separately. The proposed method gives better results except for the MOS-LQO and STOI objective measures in comparison with the existing methods using the DNN model. In the listening test, CMOS is achieved higher by the proposed method than the existing methods.