1 Introduction

In sectors, for example, speech recognition and speaker identification, speech is the notable and important kind of contact between humans and human to computer [1]. Due to numerous sorts of interferences, today's speech communication technologies are severely harmed, making direct listening difficult and causing inaccurate information transfer [2]. As a result, to achieve nearly transparent communication in applications like cell phones, one of the primary research undertakings in the speech processing field during the last few decades has been the enhancement of degraded speech. Speech enhancement's major purpose is to lower the distortions and to boost the perceptual characteristics of speech, such as clarity and intelligibility [3, 4]. In some cases, these two characteristics, clarity, and intelligibility, are unrelated. A speaker's exceptionally clear speech in a foreign language, for example, can be of great value to an auditor but has zero intelligibility. As a result, high-clarity speech could have little intelligibility, whereas low-clarity speech might have a lot of it [5].

The number of microphones used to collect speech data, which might be single, dual, or multiple, is used to classify speech enhancement systems. Even though multi-microphone speech enhancement outperforms single-microphone in terms of noise reduction [1], single-microphone speech enhancement remains a prevalent study theme due to its simplicity in design and processing. The single-microphone speech enhancement takes noisy data with only one microphone and does not provide any extra knowledge about the degradation or clear speech. [6] demonstrates that for speech clarity and intelligibility, short-term spectral magnitude (STSM) is more significant than phase shift.

Boll [7] pioneered spectral subtraction, an extensive single-microphone speech enhancement technique relying on the computation of STSM. The spectral subtraction method's main advantages are i) because of its ease (only noise spectrum estimation is required), and ii) its adaptability when it comes to changing the subtraction parameter. Spectral subtraction, despite its capacity to reduce background degradation, inserts musical sound into enhanced speech. The musical sound perception as twittering degrades the perceptual clarity of speech recordings. It may even be more disturbing than the interference before enhancement if it is too prominent.

In speech enhancement, the presence of musical sound is a key issue. Several speech enhancement approaches have been developed in previous decades to address improvements to the classical spectral subtraction method for counteracting musical sounds and enhancing speech clarity in noisy environments [8, 9]. To make musical sounds inaudible, over-subtraction and spectral flooring were recommended in [8]. [9] recommends employing a multiband model in the frequency domain to improve speech.

This study investigates the use of iterative-processed multiband speech enhancement (IP-MBSE) as a post-processing approach for musical sound suppression in enhanced speech recordings. The additive background degradation is converted to an unpleasant musical sound using a multiband spectral subtraction (MBSS) step. The MBSS processing step's outturn is used as the input for the following iteration in IP-MBSE. The musical sound is estimated again in every repetition, and the over-subtraction of spectral data is performed individually in each subband. This method is repeated only a few times. A tradeoff among the degradation level suppression, distorting speech, and musical sounds distinguishes the improved speech.

The rest of the paper is arranged as follows: The fundamentals of spectral subtraction [7], spectral over-subtraction (SOS) [8], and multiband spectral subtraction (MBSS) [9] are covered in Section 2. The proposed method for musical sound suppression, iterative-processing-based multiband speech enhancement, is described in Section 3 (IP-MBSE). The performance evaluation and experimental findings are presented in Section 4. The conclusion is addressed in Section 5.

2 The fundamentals of spectral subtraction

Spectral subtraction is a cost-effective method for successfully removing degradation from degraded speech. Boll [7] proposed the spectral subtraction technique, which can be utilized for speech enhancement and recognition.

In real-world conditions, additive noise degrades the speech signal [3, 7]. Background degradation is unrelated to clear speech and is known as additive noise. Degradations can be either stationary (for instance, white Gaussian) or non-stationary (for instance, colored). The speech signal that has been degraded by WGN is referred to as “noisy speech”. The sum of clear speech and degradation can be used to represent the noisy signal mathematically [3, 7] as

$$y\left[n\right]=s\left[n\right]+d\left[n\right],$$
(1)

y[n], s[n], and d[n] are the nth samples of noisy speech, clean speech, and background degradation, respectively. Because the speech signal is non-stationary, it is usually broken into short-length frames for subsequent processing to render them stationary over time using the short-term Fourier transform (STFT). Equation (1) may now be expressed as [6, 7], with YW(ω), DW(ω), and SW(ω) denoting the STFT of the signals.

$${Y}_W\left(\omega \right)={S}_W\left(\omega \right)+{D}_W\left(\omega \right)$$
(2)

There are two segments of the spectral subtraction method. The noisy speech spectrum is subtracted from an average noise spectrum estimate in the first segment. This is referred to as the elementary subtraction step. To reduce the signal level in the silent zones, numerous changes are made in the second segment, including half-wave rectification (HWR), musical sound lessening, and speech distortion. Because phase distortion is not noticed by the human ear, the phase of noisy speech is kept constant throughout the process [6]. As a result, noisy speech's short-term spectral magnitude (STSM) is equal to the sum of clean speech's STSM and noise's STSM with a lack of phase shift information, and (2) can be represented as

$$\left|{Y}_W\left(\omega \right)\right|=\left|{S}_W\left(\omega \right)\right|+\left|{D}_W\left(\omega \right)\right|$$
(3)

Here

$${Y}_w\left(\omega \right)=\kern0.5em \left|{Y}_w\left(\omega \right)\right|\exp \left(j{\varphi}_y\left(\omega \right)\right),$$
$${S}_w\left(\omega \right)=\kern0.5em \left|{S}_w\left(\omega \right)\right|\exp \left(j{\varphi}_y\left(\omega \right)\right),$$

D w(ω) =  |Dw(ω)| exp(y(ω)) and φy(ω) is the phase-shift of the noisy signal. The spectrum of noisy speech is obtained by the product of Yw(ω) by its conjugate \({Y}_w^{\ast}\left(\omega \right)\). As a result, (2) become

$${\left|{Y}_w\left(\omega \right)\right|}^2={\left|{S}_w\left(\omega \right)\right|}^2+{\left|{D}_w\left(\omega \right)\right|}^2+{S}_w\left(\omega \right){D}_w^{\ast}\left(\omega \right)+{S}_w^{\ast}\left(\omega \right){D}_w\left(\omega \right)$$
(4)

\({D}_w^{\ast}\left(\omega \right)\) and \({S}_w^{\ast}\left(\omega \right)\) are the conjugates of Dw(ω) and Sw(ω). The noisy spectrum, clean speech spectrum, and noise spectrum are denoted by |Yw(ω)|2, |Sw(ω)|2, and |Dw(ω)|2, respectively. |Dw(ω)|2, \({S}_w\left(\omega \right){D}_w^{\ast}\left(\omega \right)\) and \({S}_w^{\ast}\left(\omega \right){D}_w\left(\omega \right)\)cannot be obtained directly in (4), thus they are approximated as, E{|Dw(ω)|2}, \(E\left\{{S}_w\left(\omega \right){D}_w^{\ast}\left(\omega \right)\right\}\) and \(E\left\{{S}_w^{\ast}\left(\omega \right){D}_w\left(\omega \right)\right\}\), here E{.} is the operator of ensemble averaging. The terms \(E\left\{{S}_w\left(\omega \right){D}_w^{\ast}\left(\omega \right)\right\}\) and \(E\left\{{S}_w^{\ast}\left(\omega \right){D}_w\left(\omega \right)\right\}\) fall to zero when the additive noise is regarded as zero-mean and orthogonal to speech [3]. As a result, (4) can be rephrased as

$${\left|{\hat{S}}_w\left(\omega \right)\right|}^2={\left|{Y}_w\left(\omega \right)\right|}^2-E\left\{{\left|{D}_w\left(\omega \right)\right|}^2\right\}={\left|{Y}_w\left(\omega \right)\right|}^2-{\left|{\hat{D}}_w\left(\omega \right)\right|}^2$$
(5)

where \({\left|{\hat{S}}_w\left(\omega \right)\right|}^2\) and |Yw(ω)|2 are the processed and the noisy speech short-term power spectrums, respectively. The average noise power, \({\left|{\hat{D}}_w\left(\omega \right)\right|}^2\), is calculated and adjusted during speech interruptions using voice activity detector (VAD) [7].

$${\left|{\hat{D}}_w\left(\omega \right)\right|}^2=\frac{1}{M}\sum\nolimits_{i=0}^{M-1}{\left|{\hat{Y}}_{SP}\left(\omega \right)\right|}^2$$
(6)

M denotes speech pauses number in consecutive frames.

The spectral subtraction method assumes that the speech signal has been corrupted by additive white Gaussian noise (WGN) with a flat spectrum, meaning that the degradation has affected the signal evenly across the spectrum. The subtraction step in this procedure must be done with caution to minimize speech distortion. Due to an erroneous estimation of the noise spectrum, the spectra obtained after the subtraction operation may have some negative values. Half-wave rectification (HWR, setting the negative regions to zero) or full-wave rectification (FWR, absolute value) are utilized because the spectrum of estimated speech can grow negative but not be negative. HWR is widely used, but it introduces distracting sounds into the estimated speech. FWR prevents the production of irritating sounds, but it is less effective at degradation suppression. As a result, the equation for spectral subtraction is given by

$${\left|{\hat{S}}_w\left(\omega \right)\right|}^2=\left\{\begin{array}{l}\left[{\left|{Y}_w\left(\omega \right)\right|}^2-{\left|{\hat{D}}_w\left(\omega \right)\right|}^2\right]\kern1em \textrm{if}\ {\left|{Y}_w\left(\omega \right)\right|}^2>{\left|{\hat{D}}_w\left(\omega \right)\right|}^2\\ {}0\kern10.5em \textrm{else}\kern1.75em \end{array}\right.$$
(7)

Because human perception is phase insensitive [6], the improved speech spectrum may be produced using the phase of the degraded speech, and the estimated speech can be reconstructed using the inverse STFT (ISTFT) of the enhanced spectrum using the phase of the degraded speech and the overlaps-add (OLA) approach, which can be represented as

$${\hat{s}}_w\left[n\right]=\textrm{ISTFT}\ \left\{\left|{\hat{S}}_w\left(\omega \right)\right|\exp \left(j{\varphi}_y\left(\omega \right)\right)\right\}$$
(8)

The disadvantage of spectral subtraction is that it makes the enhancing procedure more complicated. According to (5), spectral subtraction's efficacy is strongly dependent on good noise estimation, which is further constrained by speech/pause detector performance. Musical sound and speech distortion are two primary challenges that develop when the noise estimate is not correct. The spectral over-subtraction of Berouti [8] is a variation of magnitude spectral subtraction [7].

2.1 Spectral over-subtraction (SOS)

To lessen musical sound and distortion, [8] presents a modified spectral subtraction. An over-subtraction factor and the noise spectral floor parameter are used in addition to the spectral subtraction [7] in this method [8]. The steps are as follows:

$${\left|{\hat{S}}_w\left(\omega \right)\right|}^2=\left\{\begin{array}{l}{\left|{Y}_w\left(\omega \right)\right|}^2-\alpha {\left|{\hat{D}}_w\left(\omega \right)\right|}^2\kern1.25em \textrm{if}\ \frac{{\left|{\hat{D}}_w\left(\omega \right)\right|}^2}{{\left|{Y}_w\left(\omega \right)\right|}^2}<\frac{1}{\alpha +\beta }\ \\ {}\beta {\left|{\hat{D}}_w\left(\omega \right)\right|}^2\kern6.25em \textrm{else}\end{array}\right.$$
(9)

with α ≥ 1 and 0 ≤ β ≪ 1

The spectral floor prevents the final spectrum from falling below a predetermined minimum level instead of being set to zero, and the over-subtraction factor controls how much noise power is subtracted from noisy speech power in each frame. The a-posterior segmental SNR determines the over-subtraction factor. The following formula can be used to compute the over-subtraction factor

$$\alpha ={\alpha}_0+\left(\textrm{SNR}\right)\left(\frac{\alpha_{\textrm{min}}-{\alpha}_0}{{\textrm{SNR}}_{\textrm{max}}}\right)$$
(10)

The subtraction factor subtracts an overestimation of noise from the noisy spectrum in this approach, which assumes that noise has a uniform influence on the speech spectrum. As a result, different combinations of the over-subtraction factor α, and spectral floor parameter β produce a tradeoff between the amount of leftover sound and the level of perceived musical sound for a balance of speech distortion and musical sound removal. When the parameter β is set to a high value, only a small amount of musical sound is audible; when β is set to a low value, the leftover sound is greatly reduced, but the musical sound becomes quite annoying. As a result, the appropriate value of α is set as per (10) and β = 0.03.

Although this method reduces perceived musical sound, background noise remains, and enhanced speech is distorted.

2.2 Multiband spectral subtraction (MBSS)

In the real world, degradations have a different impact on the speech spectrum. A linear frequency spacing multiband approach to SOS is presented in [9]. The noisy spectrum is bifurcated into K (K = 4) non-intersecting evenly spaced frequency subbands in this scheme, with spectral over-subtraction being applied independently in each subband. The over-subtraction factor for each subband is re-adjusted using the multiband spectral subtraction (MBSS) scheme. As a result, the estimation of the clean speech spectrum in the ith subband is calculated to be as

$${\left|{\hat{S}}_i\left(\omega \right)\right|}^2=\left\{\begin{array}{l}\left[{\left|{Y}_i\left(\omega \right)\right|}^2-{\alpha}_i{\delta}_i{\left|{\hat{D}}_i\left(\omega \right)\right|}^2\right],\kern0.75em \textrm{if}\kern0.5em {\left|{\hat{S}}_i\left(\omega \right)\right|}^2>\beta {\left|{Y}_i\left(\omega \right)\right|}^2\\ {}\beta {\left|{Y}_i\left(\omega \right)\right|}^{2\kern7em }\kern1.5em \textrm{else}\end{array}\right.$$
(11)

where ki < ω < ki + 1.

The start and end limits of the ith subband are represented by ki and ki + 1. The αi is the ith subband-specific over-subtraction factor, which is a function of the segmental SNR (SegSNR) and allows some control over the noise subtraction level in each subband. The SegSNRi is computed using spectral components from each subband i as

$${\textrm{SegSNR}}_i\ \left(\textrm{dB}\right)=10\;{\log}_{10}\left(\frac{\sum_{\omega ={k}_i}^{k_{i+1}}{\left|{Y}_i\left(\omega \right)\right|}^2}{\sum_{\omega ={k}_i}^{k_{i+1}}{\left|{\hat{D}}_i\left(\omega \right)\right|}^2}\right)$$
(12)

Figure 1 depicts the implementation of four subbands with estimated SegSNR [9]. The noisy speech spectrum is divided into four frequency subbands: {60 Hz ~ 1000 kHz (Subband 1), 1 kHz ~ 2 kHz (Subband 2), 2 kHz ~ 3 kHz (Subband 3), and 3 kHz ~ 4 kHz (Subband 4)}. The figure shows that the SegSNR of the low-frequency bands (Subband 1) is significantly higher than that of the high-frequency subbands (Subband 4) [9].

Fig. 1.
figure 1

SegSNR of four linearly spaced frequency subbands of degraded speech

The δi is a subband subtraction factor that may be modified independently for each frequency subband to tailor the noise removal procedure and gives more control over the noise subtraction level in each subband. Because the majority of the speech energy is held below 1 kHz, the values of δi [9] are empirically estimated and change as needed.

$${\delta}_i=\left\{\begin{array}{c}1\kern12.75em {f}_i\le 1\ \textrm{kHz}\\ {}2.5\kern6.75em 1\ \textrm{kHz}<{f}_i\le \frac{f_s}{2}-2\ \textrm{kHz}\\ {}1.5\kern10.75em {f}_i>\frac{f_s}{2}-2\ \textrm{kHz}\end{array}\right.$$
(13)

The higher frequency of the ith subband is fi, and the sampling frequency is fs. Because the lower frequencies contain the majority of the speech energy, choosing the lower values of δi for the lower subbands minimizes speech distortion. Both the αi and δi factors can be modified for each subband for different speech situations to boost speech clarity.

Because real noise is highly random, improving the MBSS for WGN reduction is required. However, MBSS outperforms the spectral subtraction method [7] and SOS [8].

3 Iterative-processed multiband speech enhancement (IP-MBSE)

The additive background noise is converted to an annoying leftover sound with musical structure via the multiband spectral subtraction (MBSS) processing step. This paper proposes an iterative-processed multiband speech enhancement (IP-MBSE) post-processing method for suppressing musical sound in enhanced speech recordings. In the suggested method, the outturn of the MBSS processing step is fed into the subsequent iteration, which estimates the noise spectrum after each repetition (iteration) and performs spectral over-subtraction in each subband separately. By repeatedly applying the enhanced speech to the input and executing the operation, the proposed method reduces musical sound even further. This procedure is iterated only a few times because a higher iteration number distorts the signal, while a lower iteration number retains the musical sound. Because a higher iteration number distorts the signal, a lower iteration number retains the musical sound in the estimated speech. This process is iterated a few times.

Figure 2 depicts the block diagram of iterative-processed multiband speech enhancement (IP-MBSE). Iteration is used repeatedly to estimate speech as input to improve speech and eliminate musical sounds. As shown in Fig. 2, the additive background noise transforms into a musical sound after the first step of conventional MBSS. Assume the input signal is y[n], and the enhanced speech obtained after the MBSS step is \(\hat{s}\left[n\right]\). As a result, the MBSS reduces additive noise significantly. This noise reduction is associated with the presence of annoying musical structure sound in the enhanced speech \(\hat{s}\left[n\right]\). By re-estimating, the remaining noise from each subband in each iteration is fed to the following iteration phase in IP-MBSE. As a result, the final enhanced outgoing speech signal can be obtained after a finite number of iterations.

Fig. 2.
figure 2

Block diagram of iterative-processed multiband speech enhancement (IP-MBSE)

The iterative technique is inspired by Wiener filtering, which is the noise reduction method [10,11,12]. As a result, if the noise estimation and MBSS procedures are considered filtering steps, the filter's outturn is employed not just for filter design but also for the iteration that follows. This filter may be adaptively renewed to enhance speech clarity and intelligibility by re-estimating leftover sound.

The noisy speech at the mth iteration step, where m represents the iteration count, is expressed as

$$y\left[m,n\right]=s\left[m,n\right]+d\left[m,n\right]$$
(14)

y[m, n], s[m, n], and d[m, n] are the nth samples at mth iteration of the degraded speech, clear speech, and interference, respectively. In MBSS processing, the mth iteration step is calculated as

$${\left|{\hat{S}}_i\left(m,\omega \right)\right|}^2=\left\{\begin{array}{l}{\left|{Y}_i\left(m,\omega \right)\right|}^2-{\alpha}_i{\delta}_i{\left|{\hat{D}}_i\left(m,\omega \right)\right|}^2\kern0.75em \textrm{if}\kern0.5em {\left|{\hat{S}}_i\left(m,\omega \right)\right|}^2>\beta {\left|{Y}_i\left(m,\omega \right)\right|}^2\\ {}\beta {\left|{Y}_i\left(m,\omega \right)\right|}^2\kern8.25em \textrm{else}\end{array}\right.$$
(15)

where ki < ω < ki + 1

$${\left|{\hat{S}}_i\left(m+1,\omega \right)\right|}^2={\left|{\hat{S}}_i\left(m,\omega \right)\right|}^2{\left|{Y}_i\left(m,\omega \right)\right|}^2$$
(16)

\({\left|{\hat{S}}_i\left(m,\omega \right)\right|}^2\), |Yi(m, ω)|2, and \({\left|{\hat{D}}_i\left(m,\omega \right)\right|}^2\) represent the estimated speech, degraded speech, and estimated noise power in the ith subband, respectively, at the mth iteration step. After the mth iteration, the outturn \({\hat{S}}_i\left(m,\omega \right)\) is used as the input in the (m + 1)th iteration processing as

$$y\left[m+1,n\right]=\hat{s}\left[m,n\right]$$
(17)

The noise spectrum is estimated in IP-MBSE for each iteration based on the noise component that remains just after the preceding step’s repetitive processing. The leftover noise component is the noise component of y[m + 1, n] that the MBSS was unable to suppress at the mth iteration. Because each MBSS processing step reduces the amount of noise, increasing the iteration in this method reduces the quantity of leftover noise.

The number of iterations is a significant aspect of the IP-MBSE, and it affects speech enhancement performance [12]. At the end of each iteration, the SegSNR is proportional to the iteration number and grows as the iterations increase. Because the over-subtraction factor is affected by SegSNR, it increases as well. Figure 4 depicts the relationship between the iteration number and the mean value of the over-subtraction factor. The greater the number of iterations, as shown in Fig. 3, the better the speech enhancement performance with less musical sound.

Fig. 3.
figure 3

Relation between the iteration number and the over-subtraction factor mean value

4 Evaluation of performance and experimental results

The experimental findings and performance evaluation of the suggested methodology, IP-MBSE, and its compression with the conventional MBSS scheme are shown in this section. We took noisy speech samples (sampled at 8 kHz) from the NOIZEUS corpus speech database [13] for simulations. For the experiment, we employed four distinct utterances (three male speakers and a female speaker).

The time-frequency distributions of the background noises are varied, and they have varied effects on the speech signals. For the performance assessment of IP-MBSE, the utterances are degraded with seven different real noises and white Gaussian noise at various SNR levels ranging from 0 - 15 dB. The real-world noises are those of cars, trains, restaurants, babbles, airports, streets, and exhibitions.

The noisy utterance is separated into many frames for the experimental work, with a frame size of 256 and 50% overlap. The noisy signal is subjected to the Hamming window. The noise estimate is updated by averaging throughout the pause frames (20 frames). The noise power spectral density is calculated with a smoothing factor of 0.9.

The number of iterations has a big impact on IP-MBSE's speech enhancement performance. The relationship between iteration number and mean over-subtraction factor (α) is depicted in Fig. 3 to investigate the connection between speech enhancement performance and iteration number. It has been observed that α increases with the iterations, implying that the higher the number of iterations, the better the speech enhancement performance with less musical sound. Nevertheless, the waveforms and spectrograms in Figs. 4, 5, 6, 7, 8, 9, 10 show that increasing the iteration number reduces the speech component by some amount while effectively suppressing the musical sound. As a result, for the speech degraded by car noise, we fix iterations 2 to 3 while leaving the additional variables the same as in the reference MBSS step.

Fig. 4.
figure 4figure 4

Speech spectrograms of sp1 utterance, "The birch canoe slid on the smooth planks", by male speaker from NOIZEUS corpus: (a) clean speech; (b) (LEFT SIDE) speech degraded by Car, Train, Babble, Restaurant, Airport, Street, Exhibition, and White noise, respectively (5 dB SNR); (c) (RIGHT SIDE) corresponding enhanced speech

Fig. 5.
figure 5figure 5

Temporal waveforms of sp1 utterance, "The birch canoe slid on the smooth planks”, by male speaker from NOIZEUS corpus: (a) clean speech; (b) (LEFT SIDE) speech degraded by Car, Train, Babble, Restaurant, Airport, Street, Exhibition, and White noise, respectively (5 dB SNR); (c) (RIGHT SIDE) corresponding enhanced speech

Fig. 6.
figure 6

Temporal waveforms and speech spectrogram of sp1 utterance, "The birch canoe slid on the smooth planks", by male speaker from NOIZEUS corpus: (a) clean speech; (b) noisy speech (degraded by Car noise at 5 dB SNR); (c) speech enhanced by MBSS (PESQ =1.78), and (d) speech enhanced by IP-MBSE (1.92)

Fig. 7.
figure 7

Temporal waveforms and speech spectrogram of sp1 utterance, "The birch canoe slid on the smooth planks", by male speaker from NOIZEUS corpus: (a) clean speech; (b) noisy speech (degraded by Car noise at 10 dB SNR); (c) speech enhanced by MBSS (PESQ=2.03), and (d) speech enhanced by IP-MBSE (PESQ=2.15)

Fig. 8.
figure 8

Temporal waveforms and speech spectrograms of sp6 utterance, "Men strive but seldom get rich", by male speaker from NOIZEUS corpus: (a) clean speech; (b) noisy speech (speech degraded by Car noise at 10 dB SNR); (c) speech enhanced by MBSS (PESQ=2.16); and (d) speech enhanced by IP-MBSE (PESQ=2.27)

Fig. 9.
figure 9

Temporal waveforms and speech spectrograms of sp10 sp10 utterance, "The sky that morning was clear and bright blue", by male speaker from NOIZEUS corpus: (a) clean speech; (b) noisy speech (speech degraded by Car noise at 10 dB SNR); (c) speech enhanced by MBSS (PESQ=2.26); and (d) speech enhanced by IP-MBSE (PESQ=2.46)

Fig. 10.
figure 10

Temporal waveforms and speech spectrograms of sp12 utterance, "The drip of the rain made a pleasant sound", by female speaker from NOIZEUS corpus: (a) clean speech; (b) noisy speech (degraded by Car noise at 10 dB SNR); (c) speech enhanced by MBSS (PESQ=2.01); and (d) speech enhanced by IP-MBSE (PESQ=2.26)

Both objective and subjective indicators have been used to assess IP-MBSE performance. SNR, SegSNR, and PESQ are objective metrics, while MOS and spectrograms are subjective metrics.

4.1 Objective evaluation

  1. a).

    Signal-to-Noise Ratio (SNR): This is calculated by dividing an utterance's total signal energy by its total noise energy. The equation below is used to evaluate the SNR results of improved signals.

$$\textrm{SNR}=10\;{\log}_{10}\left(\frac{\sum_{n=1}^L{s}^2\left[n\right]}{\sum_{n=1}^L{\left\{s\left[n\right]-\hat{s}\left[n\right]\right\}}^2}\right)$$
(18)

n, L represents the sample index and the number of samples. s[n],\(\hat{s}\left[n\right]\) denotes the clean speech and improved speech. The summing is done across the length of the signal.

  1. b).

    Segmental Signal-to-Noise Ratio (SegSNR): The average signal to noise energy ratio per frame is known as SegSNR, and it may be written as:

$$\textrm{SegSNR}=\frac{1}{M}\sum\nolimits_{m=0}^{M-1}10\;{\log}_{10}\left(\frac{\sum_{n={N}_m}^{N_m+N-1}{s}^2\left[n\right]}{\sum_{n={N}_m}^{N_m+N-1}{\left\{s\left[n\right]-\hat{s}\left[n\right]\right\}}^2}\right)$$
(19)

M, N denotes the number of frames in a signal and the number of samples frames. The SegSNR is better correlated with perceptual clarity than the SNR. The greater SegSNR indicates the less distortions.

  1. c).

    Perceptual Evaluation of Speech Quality (PESQ): The ITU-T recommends the PESQ for speech clarity assessment because it is an objective evaluation and predicts the subjective opinion score of a degraded speech sample [14]. In several testing situations, the PESQ is found to be highly linked with subjective tests [14].

4.2 Subjective evaluation – Mean Opinion Score (MOS)

A subjective evaluation is based on the judgment of others. The listening tests for our experimental review were conducted with five participants in a confined room wearing headphones. For each test signal, each listener assigns a score ranging from one to 4.5. This score reflects their overall impression of the clarity of the speech, which includes musical sound and background noise, as well as speech distortion. These tests were conducted on a scale that corresponded to the MOS scale described in [3]. For each speaker, the following procedure is used: clear and noisy speech is played and replayed twice, and each signal is played and repeated twice.

At various SNR levels, Table 1 compares IP-MBSE to standard MBSS with respect to global SNR [dB] and SegSNR [dB]. For various forms of noise, the value of SNR and SegSNR for IP-MBSE is superior to MBSS.

Table 1. IP-MBSE objective evaluation and comparison in terms of SNR [dB] and SegSNR [dB]

The PESQ and MOS scores of IP-MBSE versus MBSS are shown in Table 2. The IP-MBSE outperforms traditional MBSS on the PESQ test for all noises except train and airport noise, while better speech generated by IP-MBSE exceeds MBSS on the MOS measure.

Table 2. The outcome of a noise reduction speech quality test

The time-wave patterns and spectrograms of clear, noisy, and enhanced speech signals are shown in Figs. 4, 5, 6, 7, 8, 9, 10. As seen in Figs. 4, 5, 6, 7, 8, 9, 10, the IP-MBSE decreases the musical structure of the leftover noise more than MBSS. As a result, IP-MBSE-affected speech is more pleasant to listen to, and musical sounds have a white character with acceptable distortion. This backs up the results of the SNR, SegSNR, and PESQ tests (Table 1), as well as listening tests (Table 2).

5 Conclusion

In this paper, we investigated an iterative-processed multiband speech enhancement (IP-MBSE) for the suppression of annoying musical sounds. The outturn of multiband spectral subtraction (MBSS) is fed into the proposed technique in subsequent iterations. The iteration number is crucial in IP-MBSE because a higher number distorts the signal while a lower number retains the musical sound in the estimated speech. As a result, only a few iterations are carried out. When IP-MBSE is compared to the conventional MBSS, it is found that IP-MBSE outperforms MBSS at low SNRs.