Keywords

1 Introduction

Background noises given by single or multiple sound sources are always present in the environment. In engineering practice, they have a significant impact on acoustic signal processing. However, human or mammalian auditory systems are less responsive to noise than computational systems, as they process sensory signals of the auditory periphery using high-level neuronal structures that form biological neural networks.

Humans can concentrate attention on a certain acoustic source, e.g. the speaker’s voice, despite the variability of the sound environment. This allows for verbal communication in noisy environments, including conditions of multi-talker babble noise. This phenomenon of auditory perception is widely known as the “cocktail party effect”, accentuated by E.C. Cherry back in 1953. It represents a unique hearing ability that enables extracting the necessary acoustic signal source in the presence of varied background noises. In psychoacoustics, this feature is associated with auditory scene analysis (ASA) [1]. ASA is related to the problem of acoustic scene, event, or source recognition through the perceptual mechanisms of the auditory system. The principles of ASA underlie various biologically relevant computational studies, which examined the systems of practical acoustic signal processing and used one or two microphone recordings in the experimental setup. These studies are known as computational auditory scene analysis (CASA) [2].

However, there are certain conditions where technical systems of automatic speech signal processing and acoustic scene analysis may have an advantage over biological auditory systems. This advantage is due to the fact that microphone sensors used in the setup can be arbitrarily allocated and optimally positioned in space. Besides, unlike monaural or binaural hearing, a technical system can have multiple microphone sensors and channels, which allows improving the quality of results through information redundancy and appropriate signal processing [3,4,5]. Thus, if there are no restrictions on the number and relative position of microphones, the limitations of CASA can be circumvented.

To create machine hearing and audition systems, it is advisable to combine the advantages of auditory signal processing with technical capabilities. Auditory peripheral coding of an input acoustic signal in the form of neural responses provides a robust representation against background noises [6] due to neural phase-locking [7]. Furthermore, the representation and parametrization of speech signals based on the responses of auditory nerve (AN) fibers provides noise-robust features for automatic speech recognition, outperforming common mel-frequency cepstral coefficients under certain noise conditions [8,9,10,11].

Our study aims to develop a signal processing algorithm for noisy vowel phoneme representations in the form of simulated AN responses with the purpose of noise reduction. The approach described in the present paper imitates some features of biological neural processing. It employs a computational model of the auditory periphery and an artificial neural network for blind separation of AN responses. Three different spontaneous rates for signal and noise mixture were considered in the study.

2 Simulation of the AN Responses

A physiologically-motivated computer model of the auditory periphery by R. Meddis [9] was used to obtain neural responses of auditory nerve fibers. This model simulates the temporal fine structure of AN firing for three types of fibers corresponding to the input speech signal: low spontaneous rate (LSR)—less than 0.5 spikes/s, middle spontaneous rate (MSR)—0.5–18 spikes/s, and high spontaneous rate (HSR)—18–250 spikes/s [12]. In the present study, the model was set to generate a probabilistic firing rate pattern.

The model requires a digitalized speech signal in the WAV format, sampled at 44.1 kHz. The sound pressure level of the input signal was adjusted to 60 dB, as it must correspond to the preferred listening level for conversational speech. Further, the signal passes through a processing cascade that simulates the functions of the outer, middle, and inner ear. The nonlinear mechanical behavior of the basilar membrane is modeled by a dual-resonance nonlinear filterbank (DRNL) [13]. Each segment of the basilar membrane provides the most pronounced response for a specific frequency of the acoustic stimulus, which is defined as best frequency (BF) for sounds near threshold. Thus, DRNL decomposes the signal into 41 frequency bands, logarithmically spaced from 250 to 8,000 Hz, corresponding to BFs in the most significant range for speech. The subsequent processing stages simulate stereocilia movement, inner hair cells transduction, synaptic exocytosis, and AN firing.

At the output, the auditory periphery model generates a signal encoded by the average firing rate of the auditory nerve fibers. The present study compares the results for three types of AN fibers as mentioned above. Figure 1 demonstrates a sequence of five English long vowels – clean (first column) and with additive white Gaussian noise at 0 dB SNR (second column). The duration of each vowel sound is 300 ms. The figure illustrates AN responses for LSR, MSR, and HSR fibers correspondingly. Every BF channel of the obtained AN firing probability pattern provides responses, which are then smoothed using a 20 ms Hann window and a 10 ms frame shift to extract feature data. Thus, each input signal is represented by its own multivariate data matrix consisting of 41 spectral features and an equal number of samples.

Fig. 1.
figure 1

AN firing probability patterns for three nerve fiber types for the vowel sequence: first column – clear speech signal, second column – signal corrupted by AWGN with 0 dB SNR.

3 Blind Signal Processing for AN Responses

Let us suppose that the sources of background noise are localized in environment about the target signal source. In that case, all sound signals are received by all sensors, but with different intensity, thus forming a linear mixture. Using the information about signal intensity difference on various sensors, we can solve the noise reduction problem for the target signal as a problem of blind source separation (BSS) [14]. Let us assume two mutually independent sound sources. The first source corresponds to the speech signal represented by a sequence of vowel phonemes. The second source is localized background environmental noise. In this case, the blind noise reduction problem [15] comes down to the task of independent component analysis (ICA).

The paper suggests using blind signal processing to solve the problem of noise reduction for a speech signal at the level of the auditory periphery. However, a technical system allows for arbitrary placement of multiple sensors in space, as distinct from the mammalian auditory system. Therefore, the intensity of signals may vary significantly, depending on the positions of sources in relation to sensors. Every sensor is represented by an auditory periphery model that encodes information in the form of stationary AN firing probability patterns. In this case, the mixing model of the speech signal and the background noise remains indeterminate, and source separation is based only on the AN responses on different sensors.

Let us consider a case where the two aforementioned sound sources are separated with the use of two biologically relevant sensors. The mixing model is a transformation of two AN output signals by a non-singular mixing matrix \(\mathbf {H}\), the dimension of which depends on the number of mixed sound sources. If the sources of mixing are significantly different in amplitude or if the location of the sensors is chosen poorly, the mixing matrix is ill-conditioned. For a stable operation of the separation algorithm, it is advisable to perform a decorrelated transformation of the signal mixture in advance. The use of a decorrelation matrix makes it possible to present mixed signals in such a way that their correlation matrix is identity: \(\mathbf {R}_{x_1 x_1}=E\left\{ \mathbf {x}_1 \mathbf {x}_{1}^{T} \right\} =\mathbf {I}\). The mixing matrix will take the form \(\mathbf {A}=\mathbf {Q}\mathbf {H}\), where \(\mathbf {H}\) is the original unknown mixing matrix.

At the output of the auditory periphery model built-in each sensor, a mixture of sources is formed. Some of the signals represent the neural responses to the target signal, and others represent the responses to the noise caused by the sound environment: \(\mathbf {s}\left( t \right) =\left[ s_{1}\left( t \right) ,s_{2}\left( t \right) \right] ^{T}\). The speech signal and the noise are mixed, and the additional mixture can affect the formed mixture to represent the intrinsic noise \(\mathbf {n}\left( t \right) =\left[ n_{1}\left( t \right) ,n_{2}\left( t \right) \right] ^{T}\) of the system elements. The result of the conversion is the observed and measured signal \(\mathbf {x}\left( t \right) =\mathbf {A}\mathbf {s}\left( t \right) +\mathbf {v}\left( t \right) \), where \(\mathbf {v}\left( t \right) =\left[ v_{1}\left( t \right) ,v_{2}\left( t \right) \right] ^{T}\). The task is reduced to the search for the separation matrix \(\mathbf {W}\) of the observed signal vector \(\mathbf {x}\left( t \right) \) by means of an artificial neural network. The matrix \(\mathbf {W}\) should be such that the estimate \(\mathbf {y}\left( t \right) \) of the unknown signal vector \(\mathbf {s}\left( t \right) \) would be the result of applying the separation matrix to the measured signal: \(\mathbf {y}\left( t \right) =\mathbf {W}\mathbf {x}\left( t \right) \). In other words, the BSS task for the AN firing rate pattern is reduced to estimating the original signal by searching for the inverse mixing operator.

We separate the stationary AN response patterns into components attributable to signal or noise, assuming the independence of neural responses to these two factors. The independence condition is determined by the minimum of information that the neural responses to signal and noise have in common. The transformation of two-dimensional signals of the AN output \(\mathbf {X}_1\left( t \right) ,\mathbf {X}_2\left( t \right) \) into vectors is performed according to the equation: \(x_{n\left( i-1 \right) \times j}=x_{i,j}\). Therefore, the goal of source separation is to minimize the Kullback-Leibler divergence between the two distributions – the probability density function (PDF) \(f\left( \mathbf {y},\mathbf {W} \right) \), which depends on the coefficients of the matrix \(\mathbf {W}\) and the factorial distribution:

$$\begin{aligned} f_{1}\left( \mathbf {y},\mathbf {W} \right) =\prod _{i=1}^{m}f_{1,i}\left( y_{i},\mathbf {W} \right) \; . \end{aligned}$$
(1)
$$\begin{aligned} D_{f||f_{1}}\left( \mathbf {W} \right) =-h\left( \mathbf {y} \right) +\sum _{i=1}^{m}h_{1}\left( y_{i} \right) \; . \end{aligned}$$
(2)

where \(h\left( \mathbf {y} \right) \) is the entropy at the output of the separator, \(h_{1}\left( y_{i} \right) \) is the entropy of the i-th element of the vector. For BSS, we used an approximation of probability density \(f_{1,i}\left( y_i \right) \) by truncating the Gram-Charlier decomposition:

$$\begin{aligned} f_{1,i}\left( y_{i} \right) \approx N\left( y_{i} \right) \left[ 1+\frac{\kappa _{i,3}}{3!}H_{3}\left( y_{i} \right) +\frac{\kappa _{i,2}}{4!} H_{4}\left( y_{i} \right) +\frac{\kappa _{i,6}+10\kappa _{i,3}^{2}}{6!}H_{6}\left( y_{i} \right) \right] \; . \end{aligned}$$
(3)

where \(\kappa _{i,k}\) is the cumulant k-order of the variable \(y_{i}\); \(H_{3}\left( y_{i} \right) =y_{i}^{3}-3y_{i}, H_{4}\left( y_{i} \right) =y_{i}^{4}-6y_{i}^{2}+3, H_{6}\left( y_{i} \right) =y_{i}^{6}-15y_{i}^{4}+45y_{i}^{2}-15\) are Hermite polynomials; \(N\left( y_{i} \right) =\frac{1}{\sqrt{2\pi }}exp\left( \frac{-y_{i}^{2}}{2} \right) \) is a PDF of a random quantity. The rule of weights correction when adapting a shared matrix is:

$$\begin{aligned} \mathbf {W}\left( n+1 \right) =\mathbf {W}\left( n \right) +\mu \left( n \right) \left[ \mathbf {I}-\varphi \left( \mathbf {y}\left( n \right) \right) \mathbf {y}^{T}\left( n \right) \right] \mathbf {W}^{-T}\left( n \right) \; . \end{aligned}$$
(4)

where \(\mu \left( n \right) \) is the convergence rate parameter, \(\varphi \left( \mathbf {y}\left( n \right) \right) =\left[ \varphi \left( y_{1}\left( n \right) \right) ,\varphi \left( y_{2}\left( n \right) \right) \right] ^{T}\) is a vector consisting of activation functions, the form of which changes in the course of adaptation. The activation functions change in the learning process, since their magnitude depends on the observed values \(y_{i} \left( n \right) \).

4 Results and Discussion

4.1 Experimental Setup

This study addresses the problem of blind noise reduction. A series of computational experiments was conducted in which noise with different spectral power distributions was removed from the signal. The study aimed to investigate the impact of noise on the stationary AN firing probability pattern distortion and included different kinds of colored noises on the first stage: white, pink, red, blue, and violet. White Gaussian noise is a widespread noise model in robustness studies. In application areas, it is also important to remove pink (or flicker) noise, whose power spectral density is inversely proportional to frequency. The spectral density of red noise decreases in proportion to the square of the frequency. The spectral density of blue noise is specular with respect to pink noise, i.e. it increases with increasing frequency. Blue noise was synthesized using spectrum inversion. The spectral density of violet noise is inverted with respect to the red noise frequency spectrum.

On the second stage of blind noise reduction computational experiments, mixtures with real-world environmental noise were considered, including eight categories of urban acoustic scenes: airport, travelling by a bus, travelling by an underground metro, travelling by a tram, street with medium level of traffic, public square, metro station and indoor shopping mall. To obtain such categories of the environmental noises a TUT Urban Acoustic Scenes 2018 dataset [16] from DCASE Challenge was used.

Here is a summary of the experimental setup of our study. In accordance with the problem statement, we used two sensors and two sound sources. The first source was a clean speech signal. The second source was interference represented by one of the aforementioned noise types. On the first stage of computational experiments with colored noise interferences, the speech signal was a sequence of English long vowels represented by a multi-frequency complex tone that was synthesized as a sum of the first five formant frequencies – a speech-related model sound. Sound mixture had a duration of 1.5 s. On the second stage, the vowel sequence was pronounced several times by a male speaker. Sound mixture with real-world noise interferences had a duration of 10 s.

Each sensor received a sound mixture of two sources, with different mixing parameters specified in the mixing matrix. In this way, a certain spatial location of each sound source was simulated. Then, auditory peripheral representation was modelled for the sound mixtures in the form of AN average firing rate probability pattern. The output data matrices served as inputs for the FastICA algorithm of blind source separation [17]. The resulting data matrices describe the unmixed patterns of the corresponding sound sources. Finally, the impact on the blind noise reduction quality by the increase in the number of sensors from 2 to 8 has been considered.

4.2 Blind Noise Reduction Evaluation

The noise reduction performance for simulated AN fibers response patterns was evaluated through the signal-to-noise ratio (SNR) and noise intensity measurements. SNR allows estimating the ratio of target signal power to the power of background noise. For denoised response pattern Y of AN fibers, SNR is defined as follows:

$$\begin{aligned} SNR_{y}=10log_{10}\left( \frac{\left\| X\right\| ^2}{\left\| Y-X\right\| ^2}\right) \; . \end{aligned}$$
(5)

where X is the response pattern for the clean vowel sequence. Also, SNR was calculated for the response patterns for initial signal mixtures \(S_{1}\) and \(S_{2}\). The tables below demonstrate SNR estimation results corresponding to different spontaneous rate types of AN fibers and colored noise interferences. Table 1 presents the initial SNR for sound mixtures on two sensors provided by the mixing matrix, averaged for the vowel sequence. Table 3 presents the SNR values for the unmixed AN response pattern for the vowel sequence – a result of blind noise reduction.

Table 1. Initial SNR/dB for a mixture on two sensors by spontaneous rate
Table 2. Initial noise intensity for a mixture on two sensors by spontaneous rate

The noise intensity for each of 41 BF bands of the AN firing probability pattern can be approximated by the standard deviation. For the resultant response pattern Y, it can be defined as follows:

$$\begin{aligned} \sigma _{y}=\sqrt{\frac{1}{T}\sum _{t=1}^{T}\left( y(t)-\left[ \frac{1}{T}\sum _{t=1}^{T}y(t)\right] \right) ^2} \; . \end{aligned}$$
(6)

where T represents the total number of samples. Table 2 shows initial noise intensity values for the AN response pattern of the two-sensor sound mixture. Table 4 lists the resulting noise intensity values for the output AN response pattern obtained through blind noise reduction algorithm.

Table 3. Resultant SNR/dB for mixture with colored noise
Table 4. Resultant noise intensity for mixture with colored noise
Table 5. Results of two-sensor blind noise reduction (SNR/dB) for HSR AN fibers: mixture with real-world environmental noise

As can be seen, the results of noise reduction are most representative for HSR AN fibers. As for MSR and LSR fibers, the performance of blind noise reduction was poorer for colored noises. While the AN response pattern turned out to be less sensitive to red noise, the most distortion was made by violet noise. Let us consider the second stage of computational experiments. Table 5 summarizes the blind noise reduction results in terms of SNR for a vowel sequence mixed with real-world environmental noises represented by eight categories of urban acoustic scenes. These noises largely overlap with the speech range, so their removal is the challenging task. The location of sound sources with respect to sensors was set by the mixing matrix so that the average SNR value for the sound mixtures was approximately 0 dB. As can be seen from the obtained results, the approach allowed us to improve the average value of SNR by 7 dB. This is a good result for an initial study.

Fig. 2.
figure 2

Blind noise reduction performance for vowel sequence represented by HSR AN fibers, depending on the number of sensors: left panel – mixture with colored noises, right panel – mixture with real-world environmental noises.

As mentioned in the introduction, technical systems of speech signal processing allow the use of multiple microphone sensors. Therefore, we also evaluated blind noise reduction performance with increasing number of sensors for an HSR AN fiber response pattern. As seen from Fig. 2, performance was improved for the considered types of noise interference with increasing number of sensors, both for colored noise models and for real-world environmental noises.

5 Conclusions

The paper has suggested an approach to enhancement of noisy speech intelligibility by means of processing the signals of the auditory periphery. We have considered the task of designing a blind noise reduction system, which uses the information about the sound sources that is received by biologically relevant sensors distributed in space. The sensors simulates the processes of encoding information at the AN level of the auditory periphery. The speech signal, represented by a sequence of English long vowels, was separated from noise by means of independent component analysis of stationary AN firing probability patterns.

Two stages of computational studies were carried out – the first stage involved colored noise models, and the second dealt with background noises of real-world acoustic scenes. The quality of noise reduction largely depends on the mutual position of sound sources and sensors. In our case, arbitrary positions were chosen, modelled by a well-conditioned mixing matrix. The suggested approach has improved the SNR of the stationary AN firing activity pattern for colored and real-world noises. Besides, an increased number of sensors has demonstrated an improved quality of blind noise reduction.

An increase in SNR values can also be achieved through the optimization of quantity and relative placement of sensors in a given acoustic environment. Further elaboration of the approach will involve methods of blind signal extraction and real-time processing of dynamic AN firing activity patterns. The developed methodology can be used at the stage of pre-processing in machine hearing and biologically-inspired speech signal classification systems, such as [6, 11, 18]. Ultimately, it can become part of the new generation of neurocomputer interfaces and find use in cochlear implants [19].