Iterative-processed multiband speech enhancement for suppressing musical sounds

Upadhyay, Navneet

doi:10.1007/s11042-023-17336-z

Iterative-processed multiband speech enhancement for suppressing musical sounds

Published: 21 October 2023

Volume 83, pages 45423–45441, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Iterative-processed multiband speech enhancement for suppressing musical sounds

Download PDF

Navneet Upadhyay ORCID: orcid.org/0000-0001-6097-1624¹

105 Accesses
Explore all metrics

Abstract

A multiband spectral subtraction (MBSS) processing step transforms background noise into annoying musical sounds. The paper proposes an iterative-processed multiband speech enhancement (IP-MBSE) post-processing method for suppressing musical sounds in enhanced speech recordings. In the proposed technique, the outturn of the MBSS processing is employed as an input for the subsequent iteration. The noise spectrum is estimated in each iteration, and the spectral subtraction is executed in each subband individually. The proposed method reduces musical sound even further by applying the estimated speech to the input and repeating the process. This procedure is repeated only a few times. The performance of the proposed technique, IP-MBSE, is measured using: (i) objective clarity measurements such as signal to noise ratio (SNR), segmental SNR (SegSNR), and perceptual evaluation of speech quality (PESQ), as well as (ii) subjective clarity metrics such as mean opinion score (MOS) and spectrogram at various SNR levels. The results of the IP-MBSE are compared with the conventional MBSS, and it is found that the IP-MBSE estimated speech is more pleasant for auditors.

Iterative Thresholding-Based Spectral Subtraction Algorithm for Speech Enhancement

Musical Noise Reduction Capability of Various Speech Enhancement Algorithms

Speech Enhancement Using Modified Modulation Magnitude Estimation-Based Spectral Subtraction Algorithm

Article 13 November 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In sectors, for example, speech recognition and speaker identification, speech is the notable and important kind of contact between humans and human to computer [1]. Due to numerous sorts of interferences, today's speech communication technologies are severely harmed, making direct listening difficult and causing inaccurate information transfer [2]. As a result, to achieve nearly transparent communication in applications like cell phones, one of the primary research undertakings in the speech processing field during the last few decades has been the enhancement of degraded speech. Speech enhancement's major purpose is to lower the distortions and to boost the perceptual characteristics of speech, such as clarity and intelligibility [3, 4]. In some cases, these two characteristics, clarity, and intelligibility, are unrelated. A speaker's exceptionally clear speech in a foreign language, for example, can be of great value to an auditor but has zero intelligibility. As a result, high-clarity speech could have little intelligibility, whereas low-clarity speech might have a lot of it [5].

The number of microphones used to collect speech data, which might be single, dual, or multiple, is used to classify speech enhancement systems. Even though multi-microphone speech enhancement outperforms single-microphone in terms of noise reduction [1], single-microphone speech enhancement remains a prevalent study theme due to its simplicity in design and processing. The single-microphone speech enhancement takes noisy data with only one microphone and does not provide any extra knowledge about the degradation or clear speech. [6] demonstrates that for speech clarity and intelligibility, short-term spectral magnitude (STSM) is more significant than phase shift.

Boll [7] pioneered spectral subtraction, an extensive single-microphone speech enhancement technique relying on the computation of STSM. The spectral subtraction method's main advantages are i) because of its ease (only noise spectrum estimation is required), and ii) its adaptability when it comes to changing the subtraction parameter. Spectral subtraction, despite its capacity to reduce background degradation, inserts musical sound into enhanced speech. The musical sound perception as twittering degrades the perceptual clarity of speech recordings. It may even be more disturbing than the interference before enhancement if it is too prominent.

In speech enhancement, the presence of musical sound is a key issue. Several speech enhancement approaches have been developed in previous decades to address improvements to the classical spectral subtraction method for counteracting musical sounds and enhancing speech clarity in noisy environments [8, 9]. To make musical sounds inaudible, over-subtraction and spectral flooring were recommended in [8]. [9] recommends employing a multiband model in the frequency domain to improve speech.

This study investigates the use of iterative-processed multiband speech enhancement (IP-MBSE) as a post-processing approach for musical sound suppression in enhanced speech recordings. The additive background degradation is converted to an unpleasant musical sound using a multiband spectral subtraction (MBSS) step. The MBSS processing step's outturn is used as the input for the following iteration in IP-MBSE. The musical sound is estimated again in every repetition, and the over-subtraction of spectral data is performed individually in each subband. This method is repeated only a few times. A tradeoff among the degradation level suppression, distorting speech, and musical sounds distinguishes the improved speech.

The rest of the paper is arranged as follows: The fundamentals of spectral subtraction [7], spectral over-subtraction (SOS) [8], and multiband spectral subtraction (MBSS) [9] are covered in Section 2. The proposed method for musical sound suppression, iterative-processing-based multiband speech enhancement, is described in Section 3 (IP-MBSE). The performance evaluation and experimental findings are presented in Section 4. The conclusion is addressed in Section 5.

2 The fundamentals of spectral subtraction

Spectral subtraction is a cost-effective method for successfully removing degradation from degraded speech. Boll [7] proposed the spectral subtraction technique, which can be utilized for speech enhancement and recognition.

In real-world conditions, additive noise degrades the speech signal [3, 7]. Background degradation is unrelated to clear speech and is known as additive noise. Degradations can be either stationary (for instance, white Gaussian) or non-stationary (for instance, colored). The speech signal that has been degraded by WGN is referred to as “noisy speech”. The sum of clear speech and degradation can be used to represent the noisy signal mathematically [3, 7] as

$$y\left[n\right]=s\left[n\right]+d\left[n\right],$$

(1)

y[n], s[n], and d[n] are the n^th samples of noisy speech, clean speech, and background degradation, respectively. Because the speech signal is non-stationary, it is usually broken into short-length frames for subsequent processing to render them stationary over time using the short-term Fourier transform (STFT). Equation (1) may now be expressed as [6, 7], with Y_W(ω), D_W(ω), and S_W(ω) denoting the STFT of the signals.

$${Y}_W\left(\omega \right)={S}_W\left(\omega \right)+{D}_W\left(\omega \right)$$

(2)

There are two segments of the spectral subtraction method. The noisy speech spectrum is subtracted from an average noise spectrum estimate in the first segment. This is referred to as the elementary subtraction step. To reduce the signal level in the silent zones, numerous changes are made in the second segment, including half-wave rectification (HWR), musical sound lessening, and speech distortion. Because phase distortion is not noticed by the human ear, the phase of noisy speech is kept constant throughout the process [6]. As a result, noisy speech's short-term spectral magnitude (STSM) is equal to the sum of clean speech's STSM and noise's STSM with a lack of phase shift information, and (2) can be represented as

$$\left|{Y}_W\left(\omega \right)\right|=\left|{S}_W\left(\omega \right)\right|+\left|{D}_W\left(\omega \right)\right|$$

(3)

Here

$${Y}_w\left(\omega \right)=\kern0.5em \left|{Y}_w\left(\omega \right)\right|\exp \left(j{\varphi}_y\left(\omega \right)\right),$$

$${S}_w\left(\omega \right)=\kern0.5em \left|{S}_w\left(\omega \right)\right|\exp \left(j{\varphi}_y\left(\omega \right)\right),$$

D _w(ω) = |D_w(ω)| exp(jφ_y(ω)) and φ_y(ω) is the phase-shift of the noisy signal. The spectrum of noisy speech is obtained by the product of Y_w(ω) by its conjugate ${Y}_w^{\ast}\left(\omega \right)$. As a result, (2) become

$${\left|{Y}_w\left(\omega \right)\right|}^2={\left|{S}_w\left(\omega \right)\right|}^2+{\left|{D}_w\left(\omega \right)\right|}^2+{S}_w\left(\omega \right){D}_w^{\ast}\left(\omega \right)+{S}_w^{\ast}\left(\omega \right){D}_w\left(\omega \right)$$

(4)

${D}_w^{\ast}\left(\omega \right)$ and ${S}_w^{\ast}\left(\omega \right)$ are the conjugates of D_w(ω) and S_w(ω). The noisy spectrum, clean speech spectrum, and noise spectrum are denoted by |Y_w(ω)|², |S_w(ω)|², and |D_w(ω)|², respectively. |D_w(ω)|², ${S}_w\left(\omega \right){D}_w^{\ast}\left(\omega \right)$ and ${S}_w^{\ast}\left(\omega \right){D}_w\left(\omega \right)$cannot be obtained directly in (4), thus they are approximated as, E{|D_w(ω)|²}, $E\left\{{S}_w\left(\omega \right){D}_w^{\ast}\left(\omega \right)\right\}$ and $E\left\{{S}_w^{\ast}\left(\omega \right){D}_w\left(\omega \right)\right\}$, here E{.} is the operator of ensemble averaging. The terms $E\left\{{S}_w\left(\omega \right){D}_w^{\ast}\left(\omega \right)\right\}$ and $E\left\{{S}_w^{\ast}\left(\omega \right){D}_w\left(\omega \right)\right\}$ fall to zero when the additive noise is regarded as zero-mean and orthogonal to speech [3]. As a result, (4) can be rephrased as

$${\left|{\hat{S}}_w\left(\omega \right)\right|}^2={\left|{Y}_w\left(\omega \right)\right|}^2-E\left\{{\left|{D}_w\left(\omega \right)\right|}^2\right\}={\left|{Y}_w\left(\omega \right)\right|}^2-{\left|{\hat{D}}_w\left(\omega \right)\right|}^2$$

(5)

where ${\left|{\hat{S}}_w\left(\omega \right)\right|}^2$ and |Y_w(ω)|² are the processed and the noisy speech short-term power spectrums, respectively. The average noise power, ${\left|{\hat{D}}_w\left(\omega \right)\right|}^2$, is calculated and adjusted during speech interruptions using voice activity detector (VAD) [7].

$${\left|{\hat{D}}_w\left(\omega \right)\right|}^2=\frac{1}{M}\sum\nolimits_{i=0}^{M-1}{\left|{\hat{Y}}_{SP}\left(\omega \right)\right|}^2$$

(6)

M denotes speech pauses number in consecutive frames.

The spectral subtraction method assumes that the speech signal has been corrupted by additive white Gaussian noise (WGN) with a flat spectrum, meaning that the degradation has affected the signal evenly across the spectrum. The subtraction step in this procedure must be done with caution to minimize speech distortion. Due to an erroneous estimation of the noise spectrum, the spectra obtained after the subtraction operation may have some negative values. Half-wave rectification (HWR, setting the negative regions to zero) or full-wave rectification (FWR, absolute value) are utilized because the spectrum of estimated speech can grow negative but not be negative. HWR is widely used, but it introduces distracting sounds into the estimated speech. FWR prevents the production of irritating sounds, but it is less effective at degradation suppression. As a result, the equation for spectral subtraction is given by

$${\left|{\hat{S}}_w\left(\omega \right)\right|}^2=\left\{\begin{array}{l}\left[{\left|{Y}_w\left(\omega \right)\right|}^2-{\left|{\hat{D}}_w\left(\omega \right)\right|}^2\right]\kern1em \textrm{if}\ {\left|{Y}_w\left(\omega \right)\right|}^2>{\left|{\hat{D}}_w\left(\omega \right)\right|}^2\\ {}0\kern10.5em \textrm{else}\kern1.75em \end{array}\right.$$

(7)

Because human perception is phase insensitive [6], the improved speech spectrum may be produced using the phase of the degraded speech, and the estimated speech can be reconstructed using the inverse STFT (ISTFT) of the enhanced spectrum using the phase of the degraded speech and the overlaps-add (OLA) approach, which can be represented as

$${\hat{s}}_w\left[n\right]=\textrm{ISTFT}\ \left\{\left|{\hat{S}}_w\left(\omega \right)\right|\exp \left(j{\varphi}_y\left(\omega \right)\right)\right\}$$

(8)

The disadvantage of spectral subtraction is that it makes the enhancing procedure more complicated. According to (5), spectral subtraction's efficacy is strongly dependent on good noise estimation, which is further constrained by speech/pause detector performance. Musical sound and speech distortion are two primary challenges that develop when the noise estimate is not correct. The spectral over-subtraction of Berouti [8] is a variation of magnitude spectral subtraction [7].

2.1 Spectral over-subtraction (SOS)

To lessen musical sound and distortion, [8] presents a modified spectral subtraction. An over-subtraction factor and the noise spectral floor parameter are used in addition to the spectral subtraction [7] in this method [8]. The steps are as follows:

$${\left|{\hat{S}}_w\left(\omega \right)\right|}^2=\left\{\begin{array}{l}{\left|{Y}_w\left(\omega \right)\right|}^2-\alpha {\left|{\hat{D}}_w\left(\omega \right)\right|}^2\kern1.25em \textrm{if}\ \frac{{\left|{\hat{D}}_w\left(\omega \right)\right|}^2}{{\left|{Y}_w\left(\omega \right)\right|}^2}<\frac{1}{\alpha +\beta }\ \\ {}\beta {\left|{\hat{D}}_w\left(\omega \right)\right|}^2\kern6.25em \textrm{else}\end{array}\right.$$

(9)

with α ≥ 1 and 0 ≤ β ≪ 1

The spectral floor prevents the final spectrum from falling below a predetermined minimum level instead of being set to zero, and the over-subtraction factor controls how much noise power is subtracted from noisy speech power in each frame. The a-posterior segmental SNR determines the over-subtraction factor. The following formula can be used to compute the over-subtraction factor

$$\alpha ={\alpha}_0+\left(\textrm{SNR}\right)\left(\frac{\alpha_{\textrm{min}}-{\alpha}_0}{{\textrm{SNR}}_{\textrm{max}}}\right)$$

(10)

The subtraction factor subtracts an overestimation of noise from the noisy spectrum in this approach, which assumes that noise has a uniform influence on the speech spectrum. As a result, different combinations of the over-subtraction factor α, and spectral floor parameter β produce a tradeoff between the amount of leftover sound and the level of perceived musical sound for a balance of speech distortion and musical sound removal. When the parameter β is set to a high value, only a small amount of musical sound is audible; when β is set to a low value, the leftover sound is greatly reduced, but the musical sound becomes quite annoying. As a result, the appropriate value of α is set as per (10) and β = 0.03.

Although this method reduces perceived musical sound, background noise remains, and enhanced speech is distorted.

2.2 Multiband spectral subtraction (MBSS)

In the real world, degradations have a different impact on the speech spectrum. A linear frequency spacing multiband approach to SOS is presented in [9]. The noisy spectrum is bifurcated into K (K = 4) non-intersecting evenly spaced frequency subbands in this scheme, with spectral over-subtraction being applied independently in each subband. The over-subtraction factor for each subband is re-adjusted using the multiband spectral subtraction (MBSS) scheme. As a result, the estimation of the clean speech spectrum in the i^th subband is calculated to be as

$${\left|{\hat{S}}_i\left(\omega \right)\right|}^2=\left\{\begin{array}{l}\left[{\left|{Y}_i\left(\omega \right)\right|}^2-{\alpha}_i{\delta}_i{\left|{\hat{D}}_i\left(\omega \right)\right|}^2\right],\kern0.75em \textrm{if}\kern0.5em {\left|{\hat{S}}_i\left(\omega \right)\right|}^2>\beta {\left|{Y}_i\left(\omega \right)\right|}^2\\ {}\beta {\left|{Y}_i\left(\omega \right)\right|}^{2\kern7em }\kern1.5em \textrm{else}\end{array}\right.$$

(11)

where k_i < ω < k_i + 1.

The start and end limits of the i^th subband are represented by k_i and k_i + 1. The α_i is the i^th subband-specific over-subtraction factor, which is a function of the segmental SNR (SegSNR) and allows some control over the noise subtraction level in each subband. The SegSNR_i is computed using spectral components from each subband i as

$${\textrm{SegSNR}}_i\ \left(\textrm{dB}\right)=10\;{\log}_{10}\left(\frac{\sum_{\omega ={k}_i}^{k_{i+1}}{\left|{Y}_i\left(\omega \right)\right|}^2}{\sum_{\omega ={k}_i}^{k_{i+1}}{\left|{\hat{D}}_i\left(\omega \right)\right|}^2}\right)$$

(12)

Figure 1 depicts the implementation of four subbands with estimated SegSNR [9]. The noisy speech spectrum is divided into four frequency subbands: {60 Hz ~ 1000 kHz (Subband 1), 1 kHz ~ 2 kHz (Subband 2), 2 kHz ~ 3 kHz (Subband 3), and 3 kHz ~ 4 kHz (Subband 4)}. The figure shows that the SegSNR of the low-frequency bands (Subband 1) is significantly higher than that of the high-frequency subbands (Subband 4) [9].

The δ_i is a subband subtraction factor that may be modified independently for each frequency subband to tailor the noise removal procedure and gives more control over the noise subtraction level in each subband. Because the majority of the speech energy is held below 1 kHz, the values of δ_i [9] are empirically estimated and change as needed.

$${\delta}_i=\left\{\begin{array}{c}1\kern12.75em {f}_i\le 1\ \textrm{kHz}\\ {}2.5\kern6.75em 1\ \textrm{kHz}<{f}_i\le \frac{f_s}{2}-2\ \textrm{kHz}\\ {}1.5\kern10.75em {f}_i>\frac{f_s}{2}-2\ \textrm{kHz}\end{array}\right.$$

(13)

The higher frequency of the i^th subband is f_i, and the sampling frequency is f_s. Because the lower frequencies contain the majority of the speech energy, choosing the lower values of δ_i for the lower subbands minimizes speech distortion. Both the α_i and δ_i factors can be modified for each subband for different speech situations to boost speech clarity.

Because real noise is highly random, improving the MBSS for WGN reduction is required. However, MBSS outperforms the spectral subtraction method [7] and SOS [8].

3 Iterative-processed multiband speech enhancement (IP-MBSE)

The additive background noise is converted to an annoying leftover sound with musical structure via the multiband spectral subtraction (MBSS) processing step. This paper proposes an iterative-processed multiband speech enhancement (IP-MBSE) post-processing method for suppressing musical sound in enhanced speech recordings. In the suggested method, the outturn of the MBSS processing step is fed into the subsequent iteration, which estimates the noise spectrum after each repetition (iteration) and performs spectral over-subtraction in each subband separately. By repeatedly applying the enhanced speech to the input and executing the operation, the proposed method reduces musical sound even further. This procedure is iterated only a few times because a higher iteration number distorts the signal, while a lower iteration number retains the musical sound. Because a higher iteration number distorts the signal, a lower iteration number retains the musical sound in the estimated speech. This process is iterated a few times.

Figure 2 depicts the block diagram of iterative-processed multiband speech enhancement (IP-MBSE). Iteration is used repeatedly to estimate speech as input to improve speech and eliminate musical sounds. As shown in Fig. 2, the additive background noise transforms into a musical sound after the first step of conventional MBSS. Assume the input signal is y[n], and the enhanced speech obtained after the MBSS step is $\hat{s}\left[n\right]$. As a result, the MBSS reduces additive noise significantly. This noise reduction is associated with the presence of annoying musical structure sound in the enhanced speech $\hat{s}\left[n\right]$. By re-estimating, the remaining noise from each subband in each iteration is fed to the following iteration phase in IP-MBSE. As a result, the final enhanced outgoing speech signal can be obtained after a finite number of iterations.

The iterative technique is inspired by Wiener filtering, which is the noise reduction method [10,11,12]. As a result, if the noise estimation and MBSS procedures are considered filtering steps, the filter's outturn is employed not just for filter design but also for the iteration that follows. This filter may be adaptively renewed to enhance speech clarity and intelligibility by re-estimating leftover sound.

The noisy speech at the m^th iteration step, where m represents the iteration count, is expressed as

$$y\left[m,n\right]=s\left[m,n\right]+d\left[m,n\right]$$

(14)

y[m, n], s[m, n], and d[m, n] are the n^th samples at m^th iteration of the degraded speech, clear speech, and interference, respectively. In MBSS processing, the m^th iteration step is calculated as

$${\left|{\hat{S}}_i\left(m,\omega \right)\right|}^2=\left\{\begin{array}{l}{\left|{Y}_i\left(m,\omega \right)\right|}^2-{\alpha}_i{\delta}_i{\left|{\hat{D}}_i\left(m,\omega \right)\right|}^2\kern0.75em \textrm{if}\kern0.5em {\left|{\hat{S}}_i\left(m,\omega \right)\right|}^2>\beta {\left|{Y}_i\left(m,\omega \right)\right|}^2\\ {}\beta {\left|{Y}_i\left(m,\omega \right)\right|}^2\kern8.25em \textrm{else}\end{array}\right.$$

(15)

where k_i < ω < k_i + 1

$${\left|{\hat{S}}_i\left(m+1,\omega \right)\right|}^2={\left|{\hat{S}}_i\left(m,\omega \right)\right|}^2{\left|{Y}_i\left(m,\omega \right)\right|}^2$$

(16)

${\left|{\hat{S}}_i\left(m,\omega \right)\right|}^2$, |Y_i(m, ω)|², and ${\left|{\hat{D}}_i\left(m,\omega \right)\right|}^2$ represent the estimated speech, degraded speech, and estimated noise power in the i^th subband, respectively, at the m^th iteration step. After the m^th iteration, the outturn ${\hat{S}}_i\left(m,\omega \right)$ is used as the input in the (m + 1)^th iteration processing as

$$y\left[m+1,n\right]=\hat{s}\left[m,n\right]$$

(17)

The noise spectrum is estimated in IP-MBSE for each iteration based on the noise component that remains just after the preceding step’s repetitive processing. The leftover noise component is the noise component of y[m + 1, n] that the MBSS was unable to suppress at the m^th iteration. Because each MBSS processing step reduces the amount of noise, increasing the iteration in this method reduces the quantity of leftover noise.

The number of iterations is a significant aspect of the IP-MBSE, and it affects speech enhancement performance [12]. At the end of each iteration, the SegSNR is proportional to the iteration number and grows as the iterations increase. Because the over-subtraction factor is affected by SegSNR, it increases as well. Figure 4 depicts the relationship between the iteration number and the mean value of the over-subtraction factor. The greater the number of iterations, as shown in Fig. 3, the better the speech enhancement performance with less musical sound.

4 Evaluation of performance and experimental results

The experimental findings and performance evaluation of the suggested methodology, IP-MBSE, and its compression with the conventional MBSS scheme are shown in this section. We took noisy speech samples (sampled at 8 kHz) from the NOIZEUS corpus speech database [13] for simulations. For the experiment, we employed four distinct utterances (three male speakers and a female speaker).

The time-frequency distributions of the background noises are varied, and they have varied effects on the speech signals. For the performance assessment of IP-MBSE, the utterances are degraded with seven different real noises and white Gaussian noise at various SNR levels ranging from 0 - 15 dB. The real-world noises are those of cars, trains, restaurants, babbles, airports, streets, and exhibitions.

The noisy utterance is separated into many frames for the experimental work, with a frame size of 256 and 50% overlap. The noisy signal is subjected to the Hamming window. The noise estimate is updated by averaging throughout the pause frames (20 frames). The noise power spectral density is calculated with a smoothing factor of 0.9.

The number of iterations has a big impact on IP-MBSE's speech enhancement performance. The relationship between iteration number and mean over-subtraction factor (α) is depicted in Fig. 3 to investigate the connection between speech enhancement performance and iteration number. It has been observed that α increases with the iterations, implying that the higher the number of iterations, the better the speech enhancement performance with less musical sound. Nevertheless, the waveforms and spectrograms in Figs. 4, 5, 6, 7, 8, 9, 10 show that increasing the iteration number reduces the speech component by some amount while effectively suppressing the musical sound. As a result, for the speech degraded by car noise, we fix iterations 2 to 3 while leaving the additional variables the same as in the reference MBSS step.

Both objective and subjective indicators have been used to assess IP-MBSE performance. SNR, SegSNR, and PESQ are objective metrics, while MOS and spectrograms are subjective metrics.

4.1 Objective evaluation

a).
Signal-to-Noise Ratio (SNR): This is calculated by dividing an utterance's total signal energy by its total noise energy. The equation below is used to evaluate the SNR results of improved signals.

$$\textrm{SNR}=10\;{\log}_{10}\left(\frac{\sum_{n=1}^L{s}^2\left[n\right]}{\sum_{n=1}^L{\left\{s\left[n\right]-\hat{s}\left[n\right]\right\}}^2}\right)$$

(18)

n, L represents the sample index and the number of samples. s[n],$\hat{s}\left[n\right]$ denotes the clean speech and improved speech. The summing is done across the length of the signal.

b).
Segmental Signal-to-Noise Ratio (SegSNR): The average signal to noise energy ratio per frame is known as SegSNR, and it may be written as:

$$\textrm{SegSNR}=\frac{1}{M}\sum\nolimits_{m=0}^{M-1}10\;{\log}_{10}\left(\frac{\sum_{n={N}_m}^{N_m+N-1}{s}^2\left[n\right]}{\sum_{n={N}_m}^{N_m+N-1}{\left\{s\left[n\right]-\hat{s}\left[n\right]\right\}}^2}\right)$$

(19)

M, N denotes the number of frames in a signal and the number of samples frames. The SegSNR is better correlated with perceptual clarity than the SNR. The greater SegSNR indicates the less distortions.

c).
Perceptual Evaluation of Speech Quality (PESQ): The ITU-T recommends the PESQ for speech clarity assessment because it is an objective evaluation and predicts the subjective opinion score of a degraded speech sample [14]. In several testing situations, the PESQ is found to be highly linked with subjective tests [14].

4.2 Subjective evaluation – Mean Opinion Score (MOS)

A subjective evaluation is based on the judgment of others. The listening tests for our experimental review were conducted with five participants in a confined room wearing headphones. For each test signal, each listener assigns a score ranging from one to 4.5. This score reflects their overall impression of the clarity of the speech, which includes musical sound and background noise, as well as speech distortion. These tests were conducted on a scale that corresponded to the MOS scale described in [3]. For each speaker, the following procedure is used: clear and noisy speech is played and replayed twice, and each signal is played and repeated twice.

At various SNR levels, Table 1 compares IP-MBSE to standard MBSS with respect to global SNR [dB] and SegSNR [dB]. For various forms of noise, the value of SNR and SegSNR for IP-MBSE is superior to MBSS.

Table 1. IP-MBSE objective evaluation and comparison in terms of SNR [dB] and SegSNR [dB]

Full size table

The PESQ and MOS scores of IP-MBSE versus MBSS are shown in Table 2. The IP-MBSE outperforms traditional MBSS on the PESQ test for all noises except train and airport noise, while better speech generated by IP-MBSE exceeds MBSS on the MOS measure.

Table 2. The outcome of a noise reduction speech quality test

Full size table

The time-wave patterns and spectrograms of clear, noisy, and enhanced speech signals are shown in Figs. 4, 5, 6, 7, 8, 9, 10. As seen in Figs. 4, 5, 6, 7, 8, 9, 10, the IP-MBSE decreases the musical structure of the leftover noise more than MBSS. As a result, IP-MBSE-affected speech is more pleasant to listen to, and musical sounds have a white character with acceptable distortion. This backs up the results of the SNR, SegSNR, and PESQ tests (Table 1), as well as listening tests (Table 2).

5 Conclusion

In this paper, we investigated an iterative-processed multiband speech enhancement (IP-MBSE) for the suppression of annoying musical sounds. The outturn of multiband spectral subtraction (MBSS) is fed into the proposed technique in subsequent iterations. The iteration number is crucial in IP-MBSE because a higher number distorts the signal while a lower number retains the musical sound in the estimated speech. As a result, only a few iterations are carried out. When IP-MBSE is compared to the conventional MBSS, it is found that IP-MBSE outperforms MBSS at low SNRs.

References

O’Shaughnessy D (2007) Speech Communications: Human and Machine, 2nd ed., Hyderabad, India: University Press (India) Pvt. Ltd.
Ephraim Y (1992) Statistical-model-based speech enhancement systems. in Proceedings IEEE 80(10):1526–1555
Article Google Scholar
Loizou PC (2013) Speech Enhancement: Theory and Practice, II^nd ed. Taylor and Francis
Ephraim Y, Ari HL, Roberts W (2006) A brief survey of speech enhancement, in Electrical Engineering Handbook, 3rd ed. Boca Raton, FL: CRC
Ephraim Y, Cohen I (2006) Recent advancements in speech enhancement, in The Electrical Engineering Handbook, CRC Press, ch. 5, pp. 12-26
Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proceedings IEEE 67:1586–1604
Article Google Scholar
Boll SF (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Transaction Acoustic, Speech, Signal Processing 27(2):113–120
Article Google Scholar
Berouti M, Schwartz R, Makhoul J (1979) Enhancement of speech corrupted by acoustic noise, in Proceedings Int. Conf. Acoustic, Speech, Signal Processing, Washington DC, 208-211
Kamath S, Loizou P (2002) A multi-band spectral subtraction method for enhancing speech corrupted by colored noise, in Proceedings Int. Conf. Acoustic, Speech, Signal Processing, Orlando, USA, May
Upadhyay N, Karmakar A (2012) Single channel speech enhancement utilizing iterative processing of multi-band spectral subtraction algorithm, in Proceedings IEEE Int. Conf. Power, Control and Embedded System, MNNIT Allahabad, India, Dec. 17-19, 196-201
Ogata S, Shimamura T (2001) Reinforced spectral subtraction method to enhance speech signal. in Proceedings Int. Conf. Electrical and Electronic Technology 1:242–245
Article Google Scholar
Li S, Wang J-Q, Niu M, Jing X-J, Liu T (2010) "Iterative spectral subtraction method for millimeter-wave conducted speech enhancement," J. Biomedical Science and Engineering 3:187–192
Article Google Scholar
A noisy speech corpus for assessment of speech enhancement algorithms. http://www.utdallas.edu/~loizou/speech/noizeus/
"Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs," ITU, ITU-T Rec. P. 862, 2000.

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, The LNM Institute of Information Technology, Jaipur, 302 031, India
Navneet Upadhyay

Authors

Navneet Upadhyay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Navneet Upadhyay.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Upadhyay, N. Iterative-processed multiband speech enhancement for suppressing musical sounds. Multimed Tools Appl 83, 45423–45441 (2024). https://doi.org/10.1007/s11042-023-17336-z

Download citation

Received: 07 March 2022
Revised: 20 October 2022
Accepted: 29 September 2023
Published: 21 October 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11042-023-17336-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Iterative-processed multiband speech enhancement for suppressing musical sounds

Abstract

Similar content being viewed by others

Iterative Thresholding-Based Spectral Subtraction Algorithm for Speech Enhancement

Musical Noise Reduction Capability of Various Speech Enhancement Algorithms

Speech Enhancement Using Modified Modulation Magnitude Estimation-Based Spectral Subtraction Algorithm

1 Introduction