Keywords

1 Introduction

In recent years, robust multi-channel Automatic Speech Recognition (ASR) has been a major focus of research which led to large improvements in transcription accuracy [1]. These gains are mainly due to the development of novel neural network (NN) architectures [2, 3] and the combination of neural network (NN)s with well-known speech enhancement techniques like statistical beamforming [4, 5] and dereverberation [6]. However, realistic application environments often still present a challenge to Automatic Speech Recognition (ASR) systems because of overlapped speech and moving speakers [7].

Recently, several promising approaches for source separation [8,9,10] and source extraction [11,12,13,14] in the presence of multiple simultaneously active speakers were presented. This contribution focuses on source extraction, where one is interested in only one of the speakers in a mixture.

Different techniques have been proposed to identify the target speaker. In the so-called SpeakerBeam (SB) approach, the target speaker is identified by an enrollment, also called adaptation utterance (AU), which the speaker has to provide in advance and from which his/her spectral characteristics are obtained [11, 13]. This information is then used to guide a neural network for mask estimation to focus on the target speaker.

The desired speaker can also be identified by the speaker’s position as in [14], where a neural network uses oracle information of the target speaker location to focus on a specific source, assuming the speaker does not move. In [12] a beamforming vector is estimated on a keyword preceding the user’s command. While this setting may be appropriate for operating a digital home assistant, in many other application scenarios, such as a meeting, it would be very inconvenient if utterances had to start with a keyword to identify and locate the target speaker. Additionally, a fixed beamformer estimated on a AU or a keyword cannot capture changes in the speaker position or noise statistics.

In this contribution we are concerned with the extraction of a target speaker from multi-talker speech. We would like to take advantage of the spatial diversity present in the speech mixture while facing the problem that the spatial characteristics of the target speaker may change. To be specific, we allow speakers to change their position from one utterance to the next. The proposed system is based on the SpeakerBeam concept developed in [11], which we extend to a block-online source extraction system. We assume that an AU has been recorded for each speaker in advance, when no competing speakers are present. This AU is used to estimate a beamforming vector, which is applied to the AU itself to improve the extraction of the speaker embedding vector, which captures the target speaker’s spectral characteristics. It is further used to enhance the distorted input signal of the neural network. Thereby, emphasizing all signal components originating from the position of the target speaker during the AU. To cope with subsequent changing speaker positions, the beamformer coefficients are recursively updated.

Spatial features have proven very effective in enhancing the performance of neural network supported acoustic beamforming [15,16,17]. It is, however, unclear, to which extent they are also useful if speaker positions change. We therefore test the effectiveness of those features by comparing results for stationary speakers and speaker position changes between utterances. It will be shown that spatial features computed on the speech mixtures remain to be effective.

The paper is structured as follows: In Sect. 2 a short overview over the system is presented, where Sect. 2.1 focuses on the beamforming vector estimation and Sect. 2.2 explains the neural network structure used for mask estimation. In Sect. 3 the systems are evaluated on a database presented in Sect. 3.1. Final conclusions are drawn in Sect. 4.

2 System Overview

We assume a multi-channel signal captured by D microphones. In the short-time Fourier transform (STFT) domain the overlapped speech \(\mathbf {Y}\) and the adaptation utterance \(\mathbf {A}\) can be expressed as

$$\begin{aligned} \mathbf {Y}(t,f)&= \mathbf {X}_i(t,f) + \sum _{j\ne i} \mathbf {X}_{j}(t, f) + \mathbf {N}(t, f) \end{aligned}$$
(1)
$$\begin{aligned} \mathbf {A}(t,f)&= U(t,f) + \mathbf {N}(t, f). \end{aligned}$$
(2)

Here, \(\mathbf {Y}(t, f)\), \(\mathbf {N}(t, f)\) and \(\mathbf {X}_k(t, f)\) are the STFT coefficient vectors of the speech mixture, of the noise and of the k-th source image at the microphones. \(\mathbf {A}(t, f)\) represents the distorted and U(tf) the clean AU. The time and frequency indices t and f will be dropped wherever possible without sacrificing clarity.

2.1 Beamforming

Speech enhancement is done using the well known Minimum Variance Distortionless Response (MVDR) beamformer, which minimizes the noise power without introducing distortions on signals originating from a target direction, by optimizing the cost function [18]:

$$\begin{aligned} \mathbf {F}_\text {MVDR} = \underset{\mathbf {F}}{\text {argmin}} \; \mathbf {F}^\mathsf {H}\varvec{\varPhi }_{\mathbf {NN}}\mathbf {F} \quad \text {s.t.} \quad \mathbf {F}^\mathsf {H}\tilde{\mathbf {H}}=1, \end{aligned}$$
(3)

where \(\tilde{\mathbf {H}}=[1,...,\tilde{H}_D]^\mathsf {T}\) is the target speaker acoustic transfer function (ATF) normalized to a reference microphone, which is called relative transfer function (RTF), and \(\varvec{\varPhi }_{\mathbf {NN}}\) is the noise spatial correlation matrix (SCM).

We employ the solution of the MVDR cost function in the form presented in [19]:

$$\begin{aligned} \mathbf {F}_\text {MVDR} = \frac{\varvec{\varPhi }_{\mathbf {NN}}^{-1}\varvec{\varPhi }_{\mathbf {XX}}}{\text {tr}\left\{ \varvec{\varPhi }_{\mathbf {NN}}^{-1} \varvec{\varPhi }_{\mathbf {XX}} \right\} } \mathbf {u}, \end{aligned}$$
(4)

where \(\mathbf {u}\) is a unit vector pointing to the reference microphone, \(\text {tr}\{\cdot \}\) is the trace operator and \(\varvec{\varPhi }_{\mathbf {XX}}\) is the target speech SCM. Here, the target speech SCM is forced to follow the rank-1 approximation [20] by using:

$$\begin{aligned} \tilde{\varvec{\varPhi }}_{\mathbf {XX}} = \mathbf {a}\mathbf {a}^H \cdot \text {tr}\{\varvec{\varPhi }_{\mathbf {XX}}\}/\text {tr}\{\mathbf {a}\mathbf {a}^H\} \end{aligned}$$
(5)

with \(\mathbf {a}=\varvec{\varPhi }_{\mathbf {NN}}\mathcal {P}\left\{ \varvec{\varPhi }^{-1}_\mathbf {NN}\varvec{\varPhi }_\mathbf {XX}\right\} \) and \(\mathcal {P}\left\{ \cdot \right\} \) as the principal component of the matrix given in parentheses. Both the noise and target speaker SCMs are estimated using speech and noise masks \(M_\nu \), where \(\nu \in [\mathbf {X},\mathbf {N}]\). In case of block-wise estimation a recursive update of the SCM is applied [21]:

$$\begin{aligned} \varvec{\varPhi }_{\nu \nu }(nN) = \beta _{\nu } \varvec{\varPhi }_{\nu \nu }((n-1)N) + (1-\beta _{\nu }) \widehat{\varvec{\varPhi }}_{\nu \nu }(nN), \end{aligned}$$
(6)

with n as the block-index, \(\beta _\nu \) as the forgetting factor and

$$\begin{aligned} \widehat{\varvec{\varPhi }}_{\nu \nu }(nN) = \frac{1}{\sum _{l = 0}^{N-1} M_{\nu }(nN-l)} \sum _{l = 0}^{N-1} M_{\nu }(nN-l) \mathbf {Y}(nN-l) \mathbf {Y}^{H}(nN-l). \end{aligned}$$
(7)

In the offline (batch) case, \(\varvec{\varPhi }_{\nu \nu }(nN)\) is estimated on the whole utterance, i.e., \(\beta _{\nu }=0\) and N is set to the number of frames in the utterance.

Equation (6) requires an initialization. The noise SCM is initialized either by assuming white noise and thereby a diagonal matrix or by estimating the SCM of diffuse noise:

$$\begin{aligned} \varvec{\varPhi }_{\mathrm {diff}}(f)= \varphi _\mathbf {N}\cdot \mathrm {sinc}\left( 2\pi f \cdot \frac{F_\text {max}}{F} \cdot \mathbf {d}/c\right) , \end{aligned}$$
(8)

where \(\mathbf {d}\) is the matrix of distances between the microphones, c is the velocity of sound, \(F_\text {max}\) the Nyquist frequency, F the number of frequency bins, and \(\varphi _\mathbf {N}\) is the noise power.

The target speech SCM may either be initialized using the RTF of the speaker position and the rank-1 approximation \(\tilde{\varvec{\varPhi }}_{\mathbf {XX}} = \varphi _\mathbf {X} \tilde{\mathbf {H}}\tilde{\mathbf {H}}^\mathsf {H}\) with \(\varphi _\mathbf {X}\) as the speech power, or using the SCM of the AU.

For comparison purposes, a second speech enhancement method is employed using non-adaptive beamforming. A set of MVDR beamforming coefficient vectors is precomputed, assuming concentrated sources at fixed, predefined positions and a diffuse noise field, as described in [22]. The predefined positions for the FixedBF are set in a circular form around the array with \({10}^{\circ }\) distance, a radius of 1.5 m and 0.4 m height relative to the array, resulting in 36 positions. During the AU phase, an acoustic source localization is performed using the Steered Response Power - Normalized Arithmetic Mean (SRP-NAM) algorithm, as described in [23], and the beamforming vector corresponding to the estimated position is selected for source extraction. This method will be referred to as FixedBF.

2.2 Mask Estimation

In this section we describe the mask estimation required for SCM updates given in Eq. (6). It is a modified version of the SB source extraction network introduced in [11].

Fig. 1.
figure 1

System overview of the presented spatial speaker extractor.

The neural network for mask estimation can be split in three parts: a recurrent neural network (RNN) layer, followed by an adaptation layer and a classification layer, consisting of two feed forward layers (FFs). In the adaptation layer one larger feed forward layer is split into several sub-layers. The outputs of these sub-layers are combined prior to the application of the non-linearity \(\sigma \), using weights \(\alpha \):

$$\begin{aligned} h^{(\ell )}_k = \sigma \left( \sum _{j=1}^{N^{(\ell -1)}} h^{(\ell -1)}_j \sum _{m=1}^M \alpha _m W_{mjk} \right) , \qquad k = 1, \ldots , N^{(\ell )} \end{aligned}$$
(9)

where \(h^{(\ell - 1)}_j\) is the output of the j-th node in the preceding, \((\ell -1)\)-st, layer, and \(h^{(\ell )}_k\) the k-th node output in the \(\ell \)-th layer. \(N^{(\ell )}\) is the number of nodes in layer \(\ell \), \(W_{mjk}\) the learn-able weight matrix coefficients, where m indicates the sub-layer, and M the number of adaptation weights. Here, \(\varvec{\alpha }=[\alpha _1, ..., \alpha _M]^T\) is provided by an Auxiliary Network (AUX), to which the AU is used as input. This enables the mask estimator (ME) to focus on the speaker which was present during the AU.

The SB approach shows a degradation in performance when applied in a scenario with overlapping speakers with similar spectral characteristics as is observed in speakers of the same gender. To alleviate this problem spatial information is employed, assuming that the target speaker spoke the AU and his contribution to the speech mixture \(\mathbf {Y}\) from the same position in the room. First, both the AU and the distorted signal \(\mathbf {Y}\) are enhanced using a beamformer estimated from the SCM calculated on the AU as described above. Additionally, spatial features as described in [16] are extracted from both the AU and \(\mathbf {Y}\):

$$\begin{aligned} \mathrm {cosIPD}(t,f,p,q) = \cos \left( \angle y_{t,f,p} - \angle y_{t,f,q}\right) , \end{aligned}$$
(10)
$$\begin{aligned} \mathrm {sinIPD}(t,f,p,q) = \sin \left( \angle y_{t,f,p} - \angle y_{t,f,q}\right) , \end{aligned}$$
(11)

where p, q are channel indices and \(\angle \) is the phase operator. In the case of more than two channels all combinations of channel pairs are employed. However, at the output of the auxiliary network a mean pooling over the channel pairs is carried out to allow a more robust estimation in case of defective channels.

Furthermore, a beamformer is estimated on the AU. This beamformer, called “initial beamformer” in the following, is used to enhance the AU and the mixed speech to compute enhanced features.

To summarize, three sets of features are input to the AUX and mask estimation network: first, log-spectral features computed from the observed microphone signals, second, enhanced log-spectral features obtained after applying the initial beamformer to the microphone signals, and third, the aforementioned spatial features.

A block diagram of the presented system is depicted in Fig. 1.

Both the features computed from the initial beamformer and the spatial features computed on the AU are informative only under the assumption that both the speech of the target speaker in the speech mixture and the AU originate from the same position in the room. Therefore, a system dependent on these features will probably fail in a moving speaker scenario. However, the spatial information computed from the speech mixture can still be beneficial to extract the target speech, in particular if the competing speaker has similar spectral characteristics.

We propose to use a block-online recursive mask estimation system as depicted in Fig. 2. The initial beamformer estimated on the AU is used to enhance the first block of input frames which in turn are used to update the SCMs and estimate a new beamforming vector. This new beamforming vector then replaces the initial beamformer coefficients to compute the above mentioned set of enhanced features on the next block of frames. By this recursive update the enhanced feature set remains able to capture valid information in the presence of speaker movement or changes in the noise statistics.

Fig. 2.
figure 2

System overview of the spatial speaker extractor reusing the estimated beamforming vector as initial beamformer for the next block of frames.

3 Experiments

The presented systems are compared using four evaluation metrics: signal to distortion ratio (SDR) following the implementation presented in [24], an “invasive” SDR (InvSDR) [25], whereby the speech and the distortion are separately processed by the beamformer, and the SDR is computed as the power ratio of the resulting two outputs, the intelligibility measure STOI [26] and the perceptual speech quality metric PESQ [27]. All systems will be evaluated in terms of their gain compared to the signal at a reference microphone prior to the enhancement. Additionally, the systems are evaluated in terms of Word Error Rate (WER) of a subsequent Automatic Speech Recognition (ASR) system.

All signals are recorded or resampled with 8kHz. For the STFT computation, a 512-point FFT is used with a Hann window and an 75% overlap, resulting in 257 frequency bins for each time frame. The ME consists of an LSTM layer of 1024 units, two feedforward layers with 1024 units each and one output layer. The first feedforward layer is split into 30 sub-layers for the SB approach. The auxiliary network has two feed-forward layers of 50 units each and an output layer of 30 units, as in [11]. Finally, for the block-online estimation we use a block size of \(N=5\) frames, corresponding to 80 ms.

Fig. 3.
figure 3

Sketch of one of the meeting rooms the impulse responses and noises were recorded in. Room size approx. \({4\mathrm{m} \times 6\mathrm{m}}\). Drawn true to scale.

3.1 Database Description

We evaluate the proposed source extraction system on two databases. The first is the one described in [28], which consists of 30000 training, 500 development and 1500 evaluation examples. Each example is created by randomly choosing two utterances from the Wall Street Journal (WSJ) database and convolving the signals with six-channel room impulse responses (RIRs) with reverberation times \(T_{60}\in [{20}\,\text {ms},{500}\,\text {ms}]\) simulated by the Image Methode [29]. The shorter of the generated multi-channel signals is padded with zeros to arbitrarily fall in the duration of the longer signal. The observation utterance then consists of the sum over both utterances, to which white Gaussian noise with an Signal to Noise Ratio (SNR) of 15 to 25 dB is added. The speaker sets of training, development and evaluation sets are mutually exclusive. Therefore, we characterize the database as open. For the AU we convolve a second utterance spoken by the target speaker with the same RIR and add white Gaussian noise. This database will be referred to as RirSim and is used for all parameter tuning and network training.

The second database is created similarly to the one described above, however the RIRs and the noise are replaced by real signals recorded in a conference scenario. The real RIRs and noises were recorded using a flat 8-channel Microelectromechanical systems (MEMS) microphone array, \({7\,\mathrm{cm} \times 10\,\mathrm{cm}}\) in size and of elliptic shape. The recordings took place in two different meeting rooms with reverberation times of \(T_\mathrm {60}\approx {1}\mathrm{s}\) at the premises of voice INTER connect GmbH in Dresden. Figure 3 shows the floor plan of one of these rooms. The microphone array was flush-mounted at the center of the meeting room table in both cases. The table height is 0.73 m. Impulse responses for ten different lateral speaker positions per room were recorded using a coaxial loudspeaker at an assumed human speaker’s mouth height of 1.15 m. The speaker positions for the depicted room, together with their directions of view, are shown as squares with arrows in Fig. 3. Four different types of typical meeting room noise sources (air-conditioning, paper shuffling, projector, typing noises) were recorded using the microphone array. The database thus created will be called RirReal.

Table 1. Gains of the beamformer output compared to the signal at a reference microphone w.r.t. different performance measures, and word error rate for different feature sets of the speaker extraction system on RirSim.

3.2 ASR Backend

The Automatic Speech Recognition (ASR) backend used the wide residual network structure proposed in [30] with logarithmic mel filterbank input features and two Long-Short-Term-Memory (LSTM) layers. This acoustic model is combined with a trigram language model from the WSJ baseline script provided by the KALDI toolkit [31]. All hyper-parameters were taken from [30]. The same neural acoustic model, trained on the artificially reverberated WSJ utterances of RirSim, is used for both databases. The network is trained on alignments extracted with a HMM model trained in KALDI. The decoding is performed without language model rescoring.

3.3 Source Extraction in Static Speaker Scenario

In Table 1 the performance of different feature sets for the extraction systems described above are compared on the RirSim database. All systems use the log-spectral magnitude of the observation. As additional features we compare the log-spectral magnitude of the observation enhanced using an initial beamforming vector estimated on the AU, spatial features according to Eqs. (10) and (11), or both the spatial features and the enhanced signals. If the method is offline, both the beamforming vector and mask estimation are carried out in batch mode on the whole utterance.

All described features achieve better results than the original SpeakerBeam system, whose performance is given in the first results row of Table 1. Even the online system achieves better results using the additional features compared to the original offline SpeakerBeam system. Therefore, we conclude that using spatial information is beneficial for our source extraction system in case of static speakers. In [17] we present an in-depth evaluation of the described features in case of static speakers.

3.4 Source Extraction in the Presence of a Speaker Position Change

To simulate a change in speaker position, we divided the WSJ database in pairs of two utterances, where the first is convolved with the same set of RIRs as the AU and second is convolved with a different set of RIRs than the first, while keeping the competing speaker in the speech mixture and his/her position in the room fixed in both utterances.

The change of the target speaker position calls for adaptive beamforming. We thus expect the online beamformer to outperform the offline beamformer.

While the target speaker position in the first of the two utterances coincides with the one present in the AU, this no longer holds for the second. This renders the spatial information gained from the AUX incorrect. Table 2 displays the extraction results achieved with different features for online and offline systems. Note that neither the Acoustic Model (AM) nor the ME is retrained on the new RIR and noise.

Table 2. Gains of the beamformer output compared to the signal at a reference microphone w.r.t. different performance measures, and word error rate for non-stationary speaker on RirReal. Here Position (Pos.) 0 symbolizes the first speaker position which is equal to the position during the AU whereas Position 1 indicates a change in the position. “only ME” indicates that the additional spatial features are used as input to the mask estimation network only.

Using spatial features during mask estimation but not in the AUX improves the extraction in case of changes in the target speaker position as can be seen in the entry with “only ME” in the column “spatial”. Similarly, can be concluded that it is beneficial to update the initial beamforming vector for each block of frames, see the entry with \(\mathbf {F}(\ell -1)\) under the column “enhanced”.

Additionally, the results confirm that the extraction achieved by a recursively updated beamforming vector is only slightly impeded by the change in speaker position, whereas a fixed beamformer estimated once for the concatenated utterances suffers significantly from changes in the speaker position. This is especially true for the fixed beamforming vector estimated on the AU since no information about the concurrent speaker is included in the noise SCM estimation.

To emphasize the benefits of recursive beamformer adaptation the cosine distance between the recursively estimated beamforming vector and an oracle offline beamformer is depicted in Fig. 4. Here, the coefficients of the offline beamformer have been obtained separately on the first and second utterance using the oracle speech and noise images at the microphones. The displayed tracking curves are averaged over multiple utterances.

Fig. 4.
figure 4

Cosine distance between the block-online beamforming vector and an oracle offline beamforming vector calculated on the speech and noise image averaged over 500 utterances. Speaker positions changed at frame #600.

The figure showcases the ability of the online beamforming vector to adapt to a change in speaker position. Furthermore, the recursive update displays an invariance concerning the forgetting factor \(\beta _\nu \)

4 Conclusion

This paper offers a thorough investigation of speaker extraction systems guided by an AU in case of changes in the speaker position. We showcased the benefits of recursively updating a beamforming vector and investigated the usefulness of spatial features in case of target speaker position changes. While the spatial characteristics of the target speaker extracted from the adaptation utterance becomes outdated, the use of spatial features for mask estimation to extract a target speaker from a speech mixture remains beneficial. This can be attributed to the fact that they allow to separate speakers based on their spatial diversity, thus not relying solely on different spectro-temporal properties of the speakers.