Keywords

1 Introduction

As we reside in native surroundings filled with noise and disturbance, there is generally unwanted noise associated with the signals particularly speech that hinders the processing of signals in original form. Noise and unwanted interference affect human–human and human–machine communications among varied fields which include degrading the properties of speech involving intelligibility together with quality, identification corresponding to particular speaker, and recognition of speech [1,2,3]. Noise is generated everywhere, characteristics of which are either known or unknown. The field involving removal of noise and interference out of disrupted speech by incorporating variants of signal processing methodology constitutes speech processing. The different categories that comprise processing of signals include coding, enhancement, recognition, and synthesis particularly of speech. To frame the voice communication comfortable, natural, and practical, digital signal processing techniques are required [4]. Applications of speech communication requiring the noise reduction algorithms include answering machines, freehand communication, hard-of-hearing aids, localized and remote distance telecommunications, mobile and car phones, multiparty conferencing, noisy manufacturing and cockpits, teleconferencing systems, and voice over Internet Protocol (VoIP).

Normally, the word noise describes the undesirable signal that hinders and disrupts the analysis, processing, transmission, and reception of required informative acoustic signal. In order to achieve desirable representation and suppression of impact of noise, it becomes necessary to classify the concerning terminology of noise into respective four subclasses defined as follows: additive noise is the interference that gets associated with the signal due to varied sources when transmitted via communication channel, interfering signals that arise when multiple speakers are communicating at a time, reverberation is the effect of sound that remains after the sound is produced and is particularly due to multipath propagation, and echo is the sound reflection that reaches the listener after delay and arises mainly because of mixed link among microphones and loudspeakers. To take into account the corresponding problems mentioned, numerous speech signal processing techniques are employed including reduction in noise or enhancement of speech, separation of source and speaker, de-reverberation of speech, and cancelation and suppression of echo [5].

The signal analyzed through microphone is usually a representation of pure signal of speech with undesirable noise effect, resulting in corrupted signal and main challenge being to deal with background noise that causes degradation of signal of interest. The foremost consideration of suppression algorithms of noise is thus to recover and restore clean speech in original form given the superimposed signal to achieve the following essential goals: enhancing perceptual speech quality corrupted due to noise, improving objective performance criteria including intelligibility and signal-to-noise ratio (S/N or SNR) and enhancing the robustness of remaining applications of speech processing techniques comprising echo suppression and cancelation, coding of speech, recognition and synthesis of speech, particularly to noise [6].

The presence of unwanted background noise in the acoustic signal severely impacts the functionality and execution involving speaker identification–verification (SIV) process that results in the reduction of recognition rate. Such systems are usually employed and incorporated before any SIV systems for enhancing the working of such systems to achieve better results, as depicted in Fig. 1.

Fig. 1
figure 1

Speech enhancement system

2 Methodology Employed

Various speech enhancement methods are employed for reducing the noise in speech signal, among which spectral subtractive method is popular and commonly used method in real-world applications [1, 5]. Other traditional speech enhancement algorithms comprise statistical model-based methodologies, subspacing procedures, and binary masking principles. The spectral subtraction method including Fourier transform domain relies on eliminating the spectrum related to noise from noise corrupted speech in magnitude form obtained via Fourier transform, giving the enhanced clean speech signal as output [6]. The work on noise reduction techniques started with the novel contribution including two patents by Schroeder [7, 8] who put forward an application of analog method for spectral magnitude subtraction algorithm. After that, Boll [9] through his explanatory work specified the digital domain representation of spectral subtraction algorithm. Lim and Oppenheim [10] in form of their milestone effort represented the noise suppression problem by considering the already existing algorithms and forming a comparison. Their work explained the usefulness of reduction and suppression in noise from noise corrupted signal to upgrade the signal intelligibility and quality.

The noise reduction challenges are numerous in quantity. Pertaining to single channel where signal is recorded by one microphone, and of multichannel where signal is recorded by more than one microphone, there is the necessity to derive an optimal solution for removing as much undesirable noise as possible without degrading the standards including quality of speech signal and its intelligibility for purpose of communication. The proposed work presents a combination of Fourier transform decomposition of noise corrupted signal together with spectral subtraction to enhance speech signal for improvement of speaker identification–verification process.

2.1 Segmentation and Framing of Speech Signal

A speech signal is usually not stationary in real sense, but is typically considered quasi-stationary for short period of time. The main rationale being the glottal system and the features of such system do not change instantly [11]. Particularly for definite units of sound in a language called as Phonemes, the characteristics of speech usually stay unchangeable and are short approx. 5–100 ms time period. As such, application of traditional signal processing techniques becomes practical to be incorporated during short time span. Normally, speech processing is applied by considering very short windows including overlapping followed by analyzing and processing of such windows, referred to as frame. Thus, a speech signal, typically stationary in windows of suppose 20 ms, is partitioned and segmented into frames of 20 ms, corresponding to \(N\) samples given as

$$N = t_{fs} f_{s}$$
(1)

where \(t_{fs}\) forms the time frame step and \(f_{s}\) comprises frequency of sampling of signal.

Figure 2 depicts the segmentation of speech signal into short window frames. The overlapping of frames is shown with the corresponding first part of frame overlapped with the previous frame and remaining part with the next frame. The time frame step \(t_{\text{fs}}\) specifies the time duration among the start time of corresponding frame.

Fig. 2
figure 2

Segmentation of speech into frames

The duration from the beginning of new frame up to the end of current frame is referred to as overlap time \(t_{\text{o}}\). Following from these considerations, the frame length \(t_{\text{fl}}\) is represented as

$$t_{\text{fl}} = t_{\text{fs }} + t_{\text{o}}$$
(2)

Thus, the window is of length \(t_{\text{fl}}\), which corresponds to \(t_{\text{fl}} f_{\text{s}}\) samples.

In this method, frames are taken to be about 25 ms long, and audio file is taken to be of 16 kHz. This corresponds to 0.025 s*16,000 samples/s = 400 samples in length. We are using an overlap of 50% that constitute about 200 samples. So, the first frame will start at 0 instant, second frame will start at 200, third frame will start at 400, etc., indicated by frame1, frame2, and frame3 in the figure.

2.2 Decomposition in Fourier Transform Domain

Considering the quasi-stationary feature of speech for processing, the analysis involving speech is done taking short segmented windows referred frames and applying short time domain of Fourier transform (STFT) on respective individual short segment, yielding Fourier spectrum on corresponding frame [12]. Getting the noise corrupted signal as input is the combination of speech in clean form and corrupted due to additive noise. The model is represented as

$$y\left[ \eta \right] = x\left[ \eta \right] + s\left[ \eta \right]$$
(3)

where \(y\left[ \eta \right]\), \(x\left[ \eta \right]\), and \(s\left[ \eta \right]\) represent the sampled noise corrupted signal, pure signal, and additive noise, with the assumption of additive noise having average time domain value of zero, not varying together with speech signal, \(\eta\) being the discrete index of time [13].

Now the STFT of the noise corrupted signal y \(\left( \eta \right)\) will thus be represented by

$$Y\left( {\eta ,\varpi } \right) = \mathop \sum \limits_{l = - \infty }^{\infty } y\left( l \right)w\left( {\eta - l} \right)e^{ - j2\pi \varpi l/N}$$
(4)

where \(\varpi\) constitutes the discrete frequency index, \(N\) as the duration of frame (in samples), \(l\) as the frame number, and \(w\left( \eta \right)\) as speech analysis function referred to window function. While considering the processing of speech signal, the Hamming window is usually used having duration range of typically 20–40 ms [14]. Windowing is required as the analysis of input signal involves processing of samples that are finite, resulting in discontinuation of respective frames. Such discontinuities among the corresponding frames are eliminated by employing windowing, resulting in smooth end of frames and getting connected accurately to the start of upcoming frame [15].

2.3 Reconstruction of the Signal

To construct the improved clean signal, \(x\left( \eta \right)\), another transform referred as inverse STFT is applied on modified speech spectrum and continuing with the incorporation of least-squares overlap-add synthesis, depicted as

$$x\left( \eta \right) = \frac{1}{{W_{\theta } \left( \eta \right)}}\mathop \sum \limits_{l = - \infty }^{\infty } \left[ {\left( {\frac{1}{N} \mathop \sum \limits_{\varpi = 0}^{N - 1} Y\left( {l,\varpi } \right)e^{{\frac{j2\pi \eta \varpi }{N}}} } \right)w_{\tau } \left( {l - \eta } \right)} \right]$$
(5)

where \(w_{\tau } \left( \eta \right)\) represents the function referred as synthesis window, with \(W_{\theta } \left( \eta \right)\) represented as

$$W_{\theta } \left( \eta \right) = \mathop \sum \limits_{l = - \infty }^{\infty } w_{\tau }^{2} \left( {l - \eta } \right)$$
(6)

Usually, the synthesis window employed is Hanning window, depicted as

$$w_{\tau } \left( \eta \right) = \left\{ {\begin{array}{*{20}l} {0.5 - 0.5\cos \left( {\frac{{2\pi \left( {\eta + 0.5} \right)}}{N}} \right),} \hfill & { 0 \le \eta \le N} \hfill \\ {0,} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right.$$
(7)

2.4 Spectral Subtractive Principle

Spectral subtractive principle forms the practical and useful method that is employed for the suppression of ambient noise from signal. The method relies on the regeneration of spectrum involving magnitude of a signal with background noise associated with signal and subtracting the average noise spectrum approximation obtained from Fourier transform from noise corrupted signal spectrum. The scenarios involving processing of signals at receiver with communication channel are contaminated by noise, and the corrupted signal is usually encountered at the receiver end. For such circumstances, local average impact of noise is considered on spectrum of signal [16]. The addition of additive noise on signal thus raises the average value and variance of magnitude spectrum of a signal, as depicted in Fig. 3.

Fig. 3
figure 3

Impact of noise on signal pertaining to time and frequency domain

Due to variant time characteristics of speech, the signal analysis is achieved and done using frame-by-frame analysis by incorporating short time domain of Fourier transform (STFT) on signal depicted by Eq. 4, illustrated as

$$Y\left( {\eta ,\varpi } \right) = X\left( {\eta ,\varpi } \right) + S\left( {\eta ,\varpi } \right).$$
(8)

With the assumption of independent relation among speech signal and background noise, the corresponding magnitude spectrum of corrupted signal \(y\left[ \eta \right]\) is represented without cross terms and depicted as

$$\left| {Y\left( \varpi \right)} \right|^{2} = \left| {X\left( \varpi \right)} \right|^{2} + \left| {S\left( \varpi \right)} \right|^{2}$$
(9)

To obtain the spectrum involving improved clean signal, an approximate of corrupted signal spectrum is eliminated out of input signal spectrum, represented as

$$\left| {\dot{X}\left( \varpi \right)} \right|^{2} = \left| {Y\left( \varpi \right)} \right|^{2} - \left| {\dot{S}\left( \varpi \right)} \right|^{2}$$
(10)

The other application of spectral subtractive principle involves the realization as filter referred as spectral subtractive filter, mathematically represented as product of corrupted spectrum pertaining to speech by noise and the spectral subtractive filter (SSF), depicted as

$$\left| {\dot{X}\left( \varpi \right)} \right|^{2} = \left( {1 - \frac{{\left| {\dot{S}\left( \varpi \right)} \right|^{2} }}{{\left| {Y\left( \varpi \right)} \right|^{2} }}} \right)\left| {Y\left( \varpi \right)} \right|^{2}$$
(11)
$$\left| {\dot{X}\left( \varpi \right)} \right|^{2} = \hat{H}^{2} \left( \varpi \right)\left| {Y\left( \varpi \right)} \right|^{2}$$
(12)

where \(\hat{H}\left( \varpi \right)\) represents the function referred as gain function, related to spectral subtractive filter (SSF) which is considered as filter with zero phase, having the representation of magnitude varying among the range \(0 \le \hat{H}\left( \varpi \right) \le 1\), given as,

$$\hat{H}\left( \varpi \right) = \left\{ {{ \hbox{max} }\left( {0,1 - \frac{{\left| {\dot{S}\left( \varpi \right)} \right|^{2} }}{{\left| {Y\left( \varpi \right)} \right|^{2} }}} \right)} \right\}^{1/2}$$
(13)

For reconstructing the signal, phase spectrum characteristics of speech are taken into account. The usual method in determining the phase or angle variation of corresponding corrupted speech is relating the angle variation of noise degraded speech to the phase of clean signal obtained after suppression. Thus, the approximation of speech regarding a short segment frame is expressed as

$$\dot{X}\left( \varpi \right) = \left| {\dot{X}\left( \varpi \right)} \right|e^{j < Y\left( \varpi \right)}$$
(14)
$$\dot{X}\left( \varpi \right) = \hat{H}\left( \varpi \right) Y\left( \varpi \right)$$
(15)

From this, it follows that an approximate waveform of speech in time domain can be reconstructed using inverse Fourier transform. The sequence followed in performing speech enhancement with Fourier transform and spectral subtraction approach is depicted by flow diagram in Fig. 4.

Fig. 4
figure 4

Enhancement of noise corrupted speech signal approach

3 Program Code

figure a

4 Experimental Results

While enhancing the speech signal, the main rationale is suppressing the noise from corrupted speech to upgrade the signal intelligibility together with quality. Signal quality forms the subjective performance measure that evaluates to what degree the speech sounds fine and thus includes the characteristics as naturalness, roughness of noise, etc., and intelligibility forms an objective performance measure that determines how much the signal is understood.

The experiment is conducted on two speakers, taking one male voice and female voice considering the coded speech database of ITU-T P-series recommendations [17]. This coded speech database comprises the sentences with varying durations that are uttered in diversified languages and accent. These uttered sentences are corrupted by noise particularly additive noise having contrasting signal-to-noise ratio (S/N or SNR) to authenticate and verify particular speaker and for speech recognition purpose, while incorporating this speech enhancement method as prior treatment to such systems. The experiment results of the methodology involving spectral subtractive principle on female corrupted voice and the enhanced female voice are depicted in Figs. 5, 6, 7, and 8. With the incorporation of algorithm, noise is shown to be removed from the signal, resulting in understandable speech signal.

Fig. 5
figure 5

Audio wave of corrupted female voice

Fig. 6
figure 6

Audio wave of enhanced female voice

Fig. 7
figure 7

Spectrum representation of corrupted female voice

Fig. 8
figure 8

Spectrum representation of enhanced female voice

5 Conclusion

In this paper, the procedure of enhancing the speech of interest incorporating Fourier transform domain and spectral subtractive principle is shown that suppresses the noise associated with speech signal. Further, this method is employed before the recognizer system for speech and speaker identification process to lessen the undesirable impact of noise and interference on speech, resulting in the improvement of speech quality and speech intelligibility. The experimental results show the audio wave of speech signal with and without the effect of noise together with the spectrum of both the signals. The waveform shows the removal of noise from female voice and deriving the clean voice free from ambient noise.

6 Future Considerations

In the future work, we will implement speech enhancement in real-time systems involving time domain by employing fully convolutional neural network known as temporal convolutional neural network (TCNN), a hybrid deep learning approach. This method will involve the training of model in a speaker and noise unconstrained procedure and will include few parameters to train the model. This will explore deep neural network architecture pertaining to time domain analysis of speech enhancement. This research will further incorporate the analysis of additional speech processing tasks including de-reverberation of speech, echo suppression and cancelation, source separation, and speaker separation using TCNN model so as to upgrade SNR, quality, and intelligibility of speech signal.