Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Sound source localization and tracking (SSLT) refers to the problem of estimating the location from which a sound signal originates with respect to the microphone array geometry. It plays an important role in a teleconferencing system and in social robot applications. In a teleconference scenario, a camera that is capable of automatic steering can be deployed to focus on the speaker given the estimated speaker position [22, 29]. In addition, source localization is often required and regarded as a preprocessing step before the enhancement of an acoustic signal from a particular location [20]. In the domain of social robotics, the localization technique is applied so that the robot can concentrate on a subject of interest or be made aware of where other sound events are coming from.

Multiple microphones are, in general, required in order to achieve SSLT. Different microphone array configurations have been used in the recent literature, e.g., binaural microphones [5], linear array [39], circular array [9] and distributed microphone arrays [13, 25]. The source position is estimated by exploiting the range differences from the source to the microphones. Although various algorithms have been developed in recent decades for SSLT applications, room reverberation, background noise, and sound interference are some of the key challenges that need to be addressed in a realistic environment. In the context of room acoustics, the microphones capture not only the direct-path propagation component of the source signal but also the multipath propagation component due to the reflections at the room boundaries. The multipath component, together with the background noise, distorts the time delay information contained in the microphone received signals and degrades the localization performance. In addition, one is often interested in localizing and tracking a desired source (e.g., human speech source) in the presence of certain sound interferers (e.g., fan noise, air-conditioner noise) which often exist in a room environment. These interferers may distract the system which, as a result, localizes the interferers rather than the desired source.

The organization of this chapter is as follows: in Sect. 3.2, mathematical formulation of the SSLT problem is introduced. Conventional localization and tracking methods are then reviewed. In Sect. 3.3, a proposed method that deals with the problem of speech source tracking in the presence of sound interference is discussed. The proposed method exploits the speech harmonicity feature so as ensure that only speech signals are used for tracking. The integration of SSLT for social robot application is discussed in Sect. 3.4. Finally, the future possible research directions and conclusions are presented in Sects. 3.5 and 3.6, respectively.

2 Overview of Sound Source Localization and Tracking Algorithms

SSLT algorithms can be classified into two categories: localization approach and tracking approach. The localization approach assumes independence between successive audio frames and estimates the source location independently across each data frame. The tracking approach exploits consistency between successive frames by assuming that the source is stationary or moving at a slow rate. In this section, the mathematical formulation for these two approaches is discussed.

Fig. 3.1
figure 1

Signal propagation model

2.1 Mathematical Formulation of Sound Source Localization

The SSLT problem is illustrated in Fig. 3.1. The speech signal s(n) radiates away from the source position and propagates to the microphones. The received signals contains not only direct-path but also multipath components caused by reflection from the room boundaries. Within a short time frame, the channel from the source to the ith microphone can be considered as a linear time-invariant system and is represented by a channel impulse response \(h_i(n)\). The ith microphone received signal can thus be formulated as [3]

$$\begin{aligned} y_i(n) = s(n) *h_i(n) + v_i(n),~~~i=1,2,\ldots ,M, \end{aligned}$$
(3.1)

where \(*\) is the convolution operator, \(v_i(n)\) is the additive noise, and M is the number of microphones. In order to infer the signal delay information, the impulse response \(h_i(n)\) can be further decomposed into a direct-path component and a multipath component. The microphone received signal can thus be rewritten as

$$\begin{aligned} y_i(n) = a_i s(n-\tau _i) + s(n) *h'_i(n) + v_i(n),~~~i=1,2,\ldots ,M, \end{aligned}$$
(3.2)

where \(0\le a_i \le 1\) is the attenuation factor due to propagation, \(\tau _i\) is the direct-path time delay from the source to the ith microphone, and \(h'_i(n)\) denotes the remaining impulse response which is defined as the difference between the original response and the direct-path component. In (3.2), the time delay \(\tau _i\) is dependent on the source position with respect to the microphone array. However, direct estimation of \(\tau _i\) is not achievable since SSLT is a passive localization problem. Most of the algorithms exploit the relative time delay information among microphones and one such algorithm is introduced in the following section.

2.2 Sound Source Localization Using Beamforming-Based Approach

Given the microphone received signal \(y_i(n)\), localization is usually performed using each data frame defined as

$$\begin{aligned} \mathbf {y}_i(k)=[y_i(kN)~y_i(kN+1)~\ldots ~y_i(kN+N-1)], \end{aligned}$$
(3.3)

where N is the frame length and k is the frame index. Beamforming is one of the widely used approaches for sound source localization. In principle, the beamformer computes the spatial power spectrum for the whole region of interest and searches for the highest power corresponding to the source position estimate (see Fig. 3.2 for example). The family of beamforming techniques includes steered response power (SRP) [8, 10], minimum variance distortionless response [34], linearly constrained minimum variance [11, 34], etc.

Fig. 3.2
figure 2

The power spectrum when SNR \(=\) 20 dB, \(T_{60} = 150\) ms. The ground truth of the source position is denoted by the circle dot which is plotted on top of the spectrum for clarity of presentation

The SRP beamformer gained popularity due to its simplicity. Considering M microphones, the SRP function defines the power

$$\begin{aligned} \mathcal {P}_k(\mathbf {r}') = \sum _{\omega _l \in \varOmega } \left| \sum _{i=1}^M W_i(k,\omega _l)Y_i(k,\omega _l)e^{j\omega _l \Vert \mathbf {r}' - \mathbf {r}^{\mathrm {m}}_i \Vert _2/c}\right| ^2 \end{aligned}$$
(3.4)

corresponding to the current steered location \(\mathbf {r}'\) at time frame k, where \(\mathbf {r}'=[x'~y']^T\) is the steered location in the region of interest, \(W_i(k,\omega _l)\) is a weighting function, \(Y_i(k,\omega _l)\) is the short-time Fourier transform of the ith microphone received signal defined as \(Y_i(k,\omega _l)=\mathcal {F}(\mathbf {y}_i(k))\), \(\omega _l\) is the angular frequency of the lth bin index, c is the speed of sound, \(\mathbf {r}^\mathrm {m}_i\) is the position of the ith microphone, \(\Vert \mathbf {r}' - \mathbf {r}^{\mathrm {m}}_i \Vert _2\) is the distance from the steered location to the ith microphone position, and \(\varOmega \) is the interested frequency range over which the computation is carried out. In (3.4), the SRP is performed by computing the time delay from the steered location \(\mathbf {r}'\) to each microphone in the first step. The corresponding power is then calculated by time aligning the signals in the frequency domain according to the signal delays and summing over all the microphones. The weighting function \(W_i(k,\omega _l)\) is important in power calculation. While different weighting functions can be used [24], the phase transform (PHAT) given as

$$\begin{aligned} W_{i}^{\mathrm {PHAT}}(k,\omega _l) = \frac{1}{| Y_i(k,\omega _l) |} \end{aligned}$$
(3.5)

remains one of the most commonly used weighting schemes. The corresponding beamformer is therefore named as SRP-PHAT. By substituting (3.5) into (3.4), it can be seen that the PHAT weighting is independent of the source energy and the computed SRP response is only dependent on the phase delay.

Furthermore, by steering the beamformer across the whole region of interest, one can obtain the power spectrum as shown in Fig. 3.2. Estimating the source position is therefore achieved by searching for the location that corresponds to the maximum power, i.e.,

$$\begin{aligned} \widehat{\mathbf {r}}_k = \arg \max _{\mathbf {r}' \in \mathcal {D}}\mathcal {P}_k(\mathbf {r}') , \end{aligned}$$
(3.6)

where \(\mathcal {D}=\{ {x,y}|x_{\mathrm {min}}\le x \le {x_{\mathrm {max}}},y_{\mathrm {min}}\le y \le {y_{\mathrm {max}}} \}\) is the considered searching domain.

It has been shown in [7] that the beamforming method achieves higher spatial resolution than other localization methods such as those based on time-difference-of-arrival method [3]. However, one drawback is the high computation complexity required for scanning the region of interest. Some researchers choose different resolution grids to reduce the computation burden [12]. In addition, a recently proposed work integrates the energy in each discrete grid to achieve better performance [4].

2.3 Sound Source Tracking Using Particle Filter-Based Approach

The localization algorithm discussed in Sect. 3.2.2 estimates the source position using each microphone data frame \(\mathbf {y}_i(k)\) independently. The performance reduces when the background noise and reverberation increase since under these conditions, some of the data frames suffer from signal distortion and are therefore unable to provide reliable location estimates. However, if we assume that the source is stationary or moving at a low rate with respect to the convergence of the tracking algorithm, one possible approach to improve the performance is to exploit the temporal consistency of location measurements across successive frames.

We now consider successive data frames \(\{\mathbf {y}_i(k)|~k=1, 2, \ldots , K\}\) where k is the frame index, and K is the total number of audio frames. The aim is to estimate the source positions over all the time frames, leading to a source tracking problem. We first define the state variable as \(\varvec{\alpha }_k=[x_k~y_k~\dot{x}_k~\dot{y}_k]^T\) at frame index k, where \(x_k\) and \(y_k\) correspond to the source position while \(\dot{x}_k\) and \(\dot{y}_k\) are the source velocities in x and y direction, respectively. Similarly, the measurement variable \(\mathbf {z}_k = [\widehat{x}_k~~\widehat{y}_k]^T\) is defined. This measurement vector can be obtained from the SRP location estimate by evaluating (3.4)–(3.6) for the kth time frame data. Therefore, the state-space model can be written as

$$\begin{aligned} \varvec{\alpha }_k&= \mathcal {G}(\varvec{\alpha }_{k-1},\mathbf {u}_k), \end{aligned}$$
(3.7a)
$$\begin{aligned} \mathbf {z}_k&= \mathcal {H}(\varvec{\alpha }_k,\mathbf {w}_k), \end{aligned}$$
(3.7b)

where \(\mathcal {G}(\cdot )\) is the process function defining the time evolution of the state, \(\mathbf {u}_k\) is the process noise, \(\mathcal {H}(\cdot )\) is the measurement equation defining the mapping from \(\varvec{\alpha }_k\) to \(\mathbf {z}_k\), and \(\mathbf {w}_k\) is the measurement noise.

To formulate \(\mathcal {G}(\cdot )\) in (3.7a), the Langevin process model has been widely used as it provides a realistic model to simulate human source motion [13, 25, 31, 35, 37]. This model can be described using

$$\begin{aligned} \varvec{\alpha }_k = \begin{bmatrix} 1&0&aT&0\\0&1&0&aT\\0&0&a&0\\0&0&0&a\end{bmatrix} \varvec{\alpha }_{k-1}+\begin{bmatrix} bT&0\\0&bT\\b&0\\0&b \end{bmatrix}\mathbf {u}_k , \end{aligned}$$
(3.8)

where \(\mathbf {u}_k \sim \mathcal {N}(\varvec{\mu },\varvec{\Sigma })\) is the noise vector following Gaussian distribution, T is the time interval between consecutive frames, and \(\varvec{\mu }=[0~0]^T\) and \(\varvec{\Sigma }=\mathbf {I}_{2\times 2}\) correspond to the mean vector and covariance matrix, respectively. In addition, the model parameters are defined as \(a = \exp (-\beta T)\), \(b = \bar{v}\sqrt{1-a^2}\), where \(\bar{v}=0.8\,\mathrm {m/s}\) is the steady-state velocity and \(\beta =10\,\mathrm {Hz}\) is the rate constant [25]. To formulate \(\mathcal {H}(\cdot )\) in (3.7b), we note that \(\mathbf {z}_k\) is defined as the two-dimensional location estimate obtained from SRP and hence, we can express

$$\begin{aligned} \mathbf {z}_k = \begin{bmatrix} 1&0&0&0\\0&1&0&0\end{bmatrix} \varvec{\alpha }_k + \mathbf {w}_k, \end{aligned}$$
(3.9)

where \(\mathbf {w}_k\) represents the measurement error.

The process of sound source tracking is performed in a probabilistic manner. Statistically, the posterior probability density function (pdf) \(\mathrm {Pr}(\varvec{\alpha }_k|\mathbf {z}_{1:k})\) is used to denote the probability of state \(\varvec{\alpha }_k\) conditioned on the measurements up to time k and the measurement likelihood \(\mathrm {Pr}(\mathbf {z}_k|\varvec{\alpha }_k^{(p)})\) represents the probability of attaining measurement \(\mathbf {z}_k\) conditioned on the state. Considering continuous data frames, the sound source tracking problem can therefore be formulated as follows: for each frame index k, given \(\mathrm {Pr}(\varvec{\alpha }_{k-1}|\mathbf {z}_{1:k-1})\) at the previous time frame, the objective is to estimate \(\mathrm {Pr}(\varvec{\alpha }_k|\mathbf {z}_{1:k})\) using the source motion model \(\mathcal {G}(\cdot )\) and the new measurement \(\mathbf {z}_k\).

While Kalman filtering has been proposed for source tracking [15, 18], the particle filter (PF) framework [1, 17] is deemed to be a better approach for the SSLT problem due to the absence of linearity and Gaussian distribution requirement in the state-space formulation. The PF was first introduced in SSLT in [35] and has gained great popularity [13, 14, 25, 27, 31, 37].

In the PF framework, the posterior density \(\mathrm {Pr}(\varvec{\alpha }_{k}|\mathbf {z}_{1:k})\) is approximated by a set of particles of the state space with associated weights \(\{(\varvec{\alpha }_k^{(p)},w_k^{(p)})\}_{p=1}^{N_{p}}\), i.e.,

$$\begin{aligned} \mathrm {Pr}(\varvec{\alpha }_{k}|\mathbf {z}_{1:k})=\sum _{p=1}^{N_{p}} w_k^{(p)}\delta (\varvec{\alpha }_{k}-\varvec{\alpha }_k^{(p)}), \end{aligned}$$
(3.10)

where \(p=1,\ldots ,{N_{p}}\) denotes the particle index, \(\varvec{\alpha }_k^{(p)}\) is the pth particle of state space, \(w_k^{(p)}\) is its associated weight, and \(\delta (\cdot )\) is the Dirac delta function. The bootstrap PF-based sound source tracking is performed as follows: suppose at time \(k-1\), the set \(\{(\varvec{\alpha }_{k-1}^{(p)},w_{k-1}^{(p)})\}_{p=1}^{N_{p}}\) is an approximation of the posterior density \(\mathrm {Pr}(\varvec{\alpha }_{k-1}|\mathbf {z}_{1:k-1})\), the set \(\{(\varvec{\alpha }_k^{(p)},w_k^{(p)})\}_{p=1}^{N_{p}}\) at time index k corresponding to \(\mathrm {Pr}(\varvec{\alpha }_{k}|\mathbf {z}_{1:k})\) is then obtained by a propagation step

$$\begin{aligned} \varvec{\alpha }_k^{(p)} = \mathcal {G}(\varvec{\alpha }_{k-1}^{(p)},\mathbf {u}_k), \end{aligned}$$
(3.11)

followed by an update step,

$$\begin{aligned} w_k^{(p)} \propto w_{k-1}^{(p)} \mathrm {Pr}(\mathbf {z}_k|\varvec{\alpha }_k^{(p)}). \end{aligned}$$
(3.12)

Computation of \(\mathrm {Pr}(\mathbf {z}_k|\varvec{\alpha }_k^{(p)})\) is required in (3.12) and a pseudo likelihood approach has been proposed [25, 37] to reduce the computational load involved in the process of determining the SRP maximum corresponding to the source location measurement. In this formulation, the SRP map itself is used as an approximation of \(\mathrm {Pr}(\mathbf {z}_k|\varvec{\alpha }_k^{(p)})\). To some extent the SRP can define the probability of the source being located in the steered positions within the room as it corresponds to the energy originating from those positions. The pseudo likelihood approach defines the likelihood as

$$\begin{aligned} \mathrm {Pr}(\mathbf {z}_k|\varvec{\alpha }_k) = \left\{ \begin{array}{l} \mathcal {P}_k^\gamma (\varvec{\ell }_k), \text{ for } \text{ voiced } \text{ frame }\\ \mathcal {U}_{\mathcal {D}}(\varvec{\ell }_k), \text{ for } \text{ unvoiced } \text{ frame } \end{array} \right. , \end{aligned}$$
(3.13)

where \(\gamma =2\) is a control parameter to regulate the SRP function for source tracking [25], \(\mathcal {U}_{\mathcal {D}}(\cdot )\) is the uniform pdf over the considered enclosure domain \(\mathcal {D}\), and \(\varvec{\ell }_k\) denotes the first two elements of \(\varvec{\alpha }_k\).

In practice, due to the proportionality in (3.12), the normalization process is always computed using

$$\begin{aligned} w_k^{(p)} \Leftarrow \frac{w_k^{(p)}}{\sum _{i=1}^{{N_{p}}} w_k^{(i)} }, \end{aligned}$$
(3.14)

where \(\Leftarrow \) denotes the assignment of a new value to the variable. In addition, the PF usually consists of a resampling stage which prevents the degeneration phenomenon where, after a few iterations, a majority of the particles would possess small weights incurring a waste of computation [1]. Finally the state estimate, at time frame index k, is given as

$$\begin{aligned} \widehat{\varvec{\alpha }}_k = \sum _{p=1}^{{N_{p}}} w_k^{(p)} \varvec{\alpha }_k^{(p)}, \end{aligned}$$
(3.15)

and the first two elements of \(\widehat{\varvec{\alpha }}_k\) represent the position estimate from the tracking framework. A summary of the bootstrap PF-based sound source tracking algorithm can be found in Table 3.1.

Table 3.1 Summary of the bootstrap PF

3 Proposed Robust Speech Source Tracking

In Sect. 3.2, several approaches have been discussed for localizing and tracking a stationary or moving source. Significant progress has been made in recent decades for robust SSLT in different adverse environments. However, localizing or tracking a speech source in the presence of sound interferences is still an open problem. This is particularly important in robotic applications since the robots are expected to continue interacting with a human user in a noisy environment. It is also important to note that sound interferences may be nonstationary and unpredictable in nature. Take an office room for instance, the fan noise, air-conditioner noise, or a telephone ring may be located at different positions. Existing methods, in general, are unable to distinguish between the desired speech source and interferers. The performance may be degraded when these interferers are present.

In this section, a speech source tracking method that is robust to interferers is introduced [38]. The proposed method incorporates a well-known speech feature in the frequency domain known as harmonicity. We first compare the speech spectrogram with some typical sound interference in Sect. 3.3.1 and illustrate the speech harmonic feature. Details of the proposed method will be introduced in Sect. 3.3.2. In Sect. 3.3.3, simulations are conducted to evaluate the performance of the proposed method in the presence of interference, noise, and reverberation.

Fig. 3.3
figure 3

Spectrograms of different signals. a Speech signal spectrogram. b Fan noise spectrogram. c Power drill noise spectrogram. d Telephone ring noise spectrogram

3.1 The Harmonic Structure in the Speech Spectrogram

Figure 3.3 shows the spectrogram of a typical speech signal obtained from the TIMIT database [16] and that corresponding to different sound interferers obtained from the NOISEX-92 database [33]. The speech spectrogram, as shown in Fig. 3.3a, indicates that several harmonics (dark curves) corresponding to multiple integers of a pitch frequency are present. The pitch frequency represents the frequency of the vocal cord vibration, which normally ranges from 100 to 300 Hz , depending on whether it is a male or a female voice [6]. This spectrogram indicates that speech energy is dominant on these harmonics. Figure 3.3b shows the spectrogram of a recorded fan noise where the energy is concentrated below \(2\,\mathrm {kHz}\). The spectrogram of a recorded power drill noise, shown in Fig. 3.3c, indicates a similar energy distribution in the low frequency range although high energy spectral lines appear at approximately 1.5, 2, and 2.2 kHz. These dominant frequencies may be caused by mechanical rotation or vibration. It is useful to note that no regular harmonic structure is exhibited in these two types of sound. In terms of the telephone ring sound, shown in Fig. 3.3d, a regular harmonic structure is caused by the presence of a single tone. However, the harmonics differ from that of the speech signal due to a difference in pitch frequency.

In the following, we therefore assume that the sound interference does not share the same harmonic bands as speech due to different pitch frequency, or that the interference does not possess any harmonic structure. The key objective of the proposed method is to estimate these harmonic bands corresponding to the speech components and to emphasize on the harmonic bands as they provide high signal-to-interference ratio (SIR). Other frequency regions are not used for tracking as these frequencies are contaminated by the sound interferers.

3.2 Speech Source Tracking in the Presence of Sound Interference

In the conventional sound source tracking framework, as introduced in Sect. 3.2.3, particles are propagated according to the source dynamic model before being weighted by the measurement likelihood. It computes the particle weights by employing a pseudo-likelihood that has been derived from SRP-PHAT measurements [13, 25, 37]. While this technique may achieve good tracking performance, the performance may significantly reduce in the presence of interference. This is due to the inability of SRP-PHAT to discriminate between the speech source and the acoustic interference in general. It implies that any acoustic interference will result in a dominant peak occurring at the interferer’s position, and the particles are likely to propagate toward that location away from the speech source (see Fig. 3.7a). The performance of these algorithms reduces significantly in low SIR, resulting in the SSLT losing track of the speech source.

To mitigate the degradation in performance, we exploit speech harmonicity such that the measurement likelihood is predominantly weighted by the speech signal as opposed to the interferers. The overall framework of the proposed method is as follows: (1) a prior source position is estimated using the assumed source dynamic model, (2) a beamformer is then applied to enhance the source signal from the prior estimated position in order to extract speech feature, (3) the reliable harmonic bands are estimated using the enhanced signal in the following step, (4) the new measurement likelihood is then derived by emphasizing these high SIR harmonic bands while discarding the other frequency regions.

3.2.1 Prior Prediction

In general, a clear source signal is often required in order to extract the corresponding speech features. However, due to the presence of interference and background noise, obtaining such a clear source signal is challenging. To improve the feature extraction performance, we propose a speech signal enhancement stage consisting of prior source position prediction and a beamformer. Considering the Langevin source dynamic model introduced in Sect. 3.2.3, for time frame index k, the prior source state can be estimated using (3.7a) and (3.8) as

$$\begin{aligned} \widehat{\varvec{\alpha }}_k^- = \mathcal {G}(\widehat{\varvec{\alpha }}_{k-1}^+,\mathbf {u}_k) , \end{aligned}$$
(3.16)

given the state estimate at the previous frame. Here, \(\widehat{\varvec{\alpha }}_{k-1}^+\) is the posterior state estimate at time frame index \(k-1\). The prior source location estimate

$$\begin{aligned} \widehat{\mathbf {r}}_k^- = [\widehat{x}_k^-~~\widehat{y}_k^-]^T , \end{aligned}$$
(3.17)

corresponds to the first two elements in \(\widehat{\varvec{\alpha }}_k^-\). Note that this prior estimate is based only on the assumed source motion. Its objective is to allow the beamformer to enhance the signal from this preliminary estimated source position. The feature-directed measurements, as will be described in the subsequent sections, will further refine the state estimate.

3.2.2 Feature Extraction

After obtaining a prior estimate of source position at each iteration, a beamformer can be employed to enhance the signal from that particular position. Note that the beamformer was used as a localization technique in Sect. 3.2.2. However, beamforming was initially used for enhancing the signal from a known source position and suppressing the interference and noise [34]. Various beamformers can be applied after a prior source location has been estimated. We consider, for example, the delay-and-sum beamformer [23] due to its simplicity although other forms of beamformers such as presented in [21, 32] may be used to enhance the speech signal. The delay-and-sum beamformer output for the prior estimated source location \(\widehat{\mathbf {r}}_k^-\) is given as

$$\begin{aligned} S(\omega _l,\widehat{\mathbf {r}}_k^-) = \sum _{i=1}^M \varPhi \left( D_i(\widehat{\mathbf {r}}_k^-)\right) Y_i(k,\omega _l)e^{j\omega _l D_i(\widehat{\mathbf {r}}_k^-)/c}, \end{aligned}$$
(3.18)

where i is the microphone index, M is the number of microphones, and \(Y_i(k,\omega _l)\) is the frequency-domain received signal from the ith microphone at kth frame. The variable \(\omega _l\) is the angular frequency of lth frequency bin, c is the speed of sound, \(D_i(\widehat{\mathbf {r}}_k^-) = \Vert \widehat{\mathbf {r}}_k^- -\mathbf {r}^{\mathrm {m}}_i\Vert _2\) is the distance from the prior estimated source position to the ith microphone, and \(\varPhi (\cdot )\) is a monotonic function that weighs the ith microphone signal according to the source-sensor distance. In our simulations, we found that \(\varPhi \left( D_i(\widehat{\mathbf {r}}^-_k)\right) =1/D_i(\widehat{\mathbf {r}}^-_k)\) performs well as it emphasizes the signal from the microphone that is closer to the source.

Fig. 3.4
figure 4

Spectrogram and selected harmonic bands indicated in blue lines. a Clean speech. b Power-drill interference. c Reference microphone received signal and its selected harmonic bands (in blue). d Beamformer enhanced signal and its selected harmonic bands (in blue)

Figure 3.4 shows the signal enhancement result for a \(6\,\mathrm {s}\) speech signal when a power drill interference is present at SIR \(=\) 5 dB and white Gaussian noise with signal-to-noise (SNR) ratio of 15 dB. These results were generated using the method of images [26] with \(T_{60}=200\) ms and eight microphones are placed 0.5 m away from the room perimeter (see Fig. 3.7). Figure 3.4a shows the spectrogram of the original speech signal where a clear harmonic structure can be found. Figure 3.4b shows the power drill interference spectrogram where no harmonic structure is present. In general, the source signal received by a single reference microphone is often distorted, especially when the interferer is close to the microphone, as shown in Fig. 3.4c. Extraction of speech harmonics from this received signal is therefore challenging. The beamformer enhanced signal, as shown in Fig. 3.4d, is indeed clearer than the microphone received signal. The speech harmonics are dominant across the whole spectrogram although certain interference energy leakage is visible. The beamformer enhanced signal will be used for feature extraction in the next step.

To extract the speech harmonics from a noisy spectrum, we use the multi-band excitation (MBE) fit method [2, 19]. As indicated in Fig. 3.5, the MBE model defines a voiced frame in the frequency domain as the product of spectrum envelop \(H(\omega )\) and excitation spectrum \(E(\omega ,\omega _{\mathrm {p}})\) given by [19]

Fig. 3.5
figure 5

MBE model for a speech signal. The voice frame can be modeled as a product of spectrum envelop \(H(\omega )\) and excitation spectrum \(E(\omega ,\omega _{\mathrm {p}})\) in the frequency domain

$$\begin{aligned} S_{\mathrm {spch}}(\omega ) = H(\omega )E(\omega ,\omega _{\mathrm {p}}), \end{aligned}$$
(3.19)

where \(\omega _{\mathrm {p}}\) is the pitch frequency, such that

$$\begin{aligned} E(\omega ,\omega _{\mathrm {p}}) = \sum _{q=1}^Q \varPsi (\omega -q\omega _{\mathrm {p}}), \end{aligned}$$
(3.20)

where q is the harmonic index, Q is the number of harmonics, \(\omega _{\mathrm {p}}\) is the pitch frequency, and \(\varPsi (\omega )\) is the Fourier transform of the Hamming window.

We now consider extracting the harmonic information from the beamformer enhanced signal \(S(\omega ,\widehat{\mathbf {r}}_k^-)\) via MBE model fitting. The harmonic information \(\omega _{\mathrm {p}}\) and \(H(\omega )\) can be estimated via minimization of the fitting error between \(S(\omega ,\widehat{\mathbf {r}}_k^-)\) and the MBE modeled signal

$$\begin{aligned} \varepsilon (\omega _{\mathrm {p}})&= \int _{0}^{2\pi }{|S(\omega ,\widehat{\mathbf {r}}_k^-) - S_{\mathrm {spch}}(\omega )|^2d\omega } \nonumber \\&= \int _{0}^{2\pi }{|S(\omega ,\widehat{\mathbf {r}}_k^-) - H(\omega )E(\omega ,\omega _{\mathrm {p}})|^2d\omega }, \end{aligned}$$
(3.21)

where \(S(\omega ,\widehat{\mathbf {r}}_k^-)\) has been defined in (3.18).

In practice, the above process is computed in discrete frequency domain where \(\omega _l=2\pi l/L\) denotes the angular frequency of lth frequency bins, L is the number of frequency bins, and \(\omega _{\mathrm {p}}\) is now computed from the discrete angular frequencies. In order to solve the nonlinear minimization problem in (3.21), the whole spectrum is further decomposed into several harmonic bands. The qth harmonic band ranges in the interval \([a_q,~b_q]\), where the lower and upper limits are defined as \(a_q=\lceil (q-0.5)\omega _{\mathrm {p}} \rfloor \) and \(b_q=\lceil (q+0.5)\omega _{\mathrm {p}} \rfloor \), respectively, and \(\lceil \cdot \rfloor \) denotes the selection of the nearest frequency bin. The variable \(H(\omega )\) is also decoupled into complex amplitude \(H_q\) for each harmonic band q, so that the fitting error for each harmonic band is

$$\begin{aligned} \varepsilon _q(\omega _{\mathrm {p}})=\sum _{\omega _l=a_q}^{b_q}|S(\omega _l,\widehat{\mathbf {r}}_k^-)-H_qE(\omega _l,\omega _{\mathrm {p}})| , \end{aligned}$$
(3.22)

and the total error in (3.21) becomes

$$\begin{aligned} \varepsilon (\omega _{\mathrm {p}})=\sum _{q=1}^Q \varepsilon _q(\omega _{\mathrm {p}}). \end{aligned}$$
(3.23)

We note that there is a subtle difference between (3.23) and (3.21); in (3.23) we only sum over the Q harmonic bands of interest, while in (3.21) the whole spectrum is integrated.

The harmonic information is thus represented by two parameters, the pitch frequency \(\omega _{\mathrm {p}}\) and complex amplitude \(H_q\) for all harmonic bands. The variable \(H_q\) can be obtained by considering the derivative of (3.22) to be zero giving

$$\begin{aligned} H_q=\frac{\displaystyle \sum _{\omega _l=a_q}^{b_q}S(\omega _l,\widehat{\mathbf {r}}_k^-)E^*(\omega _l,\omega _{\mathrm {p}})}{\displaystyle \sum _{\omega _l=a_q}^{b_q}|E(\omega _l,\omega _{\mathrm {p}})|^2} , \end{aligned}$$
(3.24)

where \(*\) denotes conjugate operation. The pitch frequency \(\omega _{\mathrm {p}}\) can be estimated by the following steps: each fitting error \(\varepsilon _q(\omega _{\mathrm {p}})\) is evaluated using the optimal value of \(H_q\) obtained in (3.24). The error function in (3.23) is then computed with respect to all pitch frequencies \(\omega _{\mathrm {p}}\) of interest. Finally, the global minimum of \(\varepsilon (\omega _{\mathrm {p}})\) is determined and the corresponding \(\omega _{\mathrm {p}}\) is selected as the estimated \(\widehat{\omega }_{\mathrm {p}}\) due to speech.

3.2.3 Feature-Directed Particle Weight Update

To obtain the feature-directed particle weight update, it is required to determine the most reliable harmonic bands and select them for computation of the likelihood. Two criteria are proposed to determine the reliability of the harmonic bands: (1) the normalized fitting error and (2) the normalized harmonic energy.

First, the normalized fitting error [2] is defined, for each harmonic, as the effectiveness of a given frequency band to be fitted with the speech harmonic model. It is computed as

$$\begin{aligned} \bar{\varepsilon }_q=\frac{\varepsilon _q(\widehat{\omega }_{\mathrm {p}})}{\displaystyle \sum _{\omega _l=a_q}^{b_q}|S(\omega _l,\widehat{\mathbf {r}}_k^-)|^2} , \end{aligned}$$
(3.25)

where the fitting error \(\varepsilon _q(\widehat{\omega }_{\mathrm {p}})\) is computed by substituting the estimated pitch frequency \(\widehat{\omega }_{\mathrm {p}}\) into (3.22). The fitting error is normalized by the energy of each corresponding harmonic band.

In the second step, the normalized harmonic energy, defined by the ratio of energy distributed on that harmonic over the total energy, i.e.,

$$\begin{aligned} P_q=\frac{\displaystyle \sum _{\omega _l=a_q}^{b_q}H_q E(\omega _l,\widehat{\omega }_{\mathrm {p}})}{\displaystyle \sum _{q=1}^Q\sum _{\omega _l=a_q}^{b_q}H_q E(\omega _l,\widehat{\omega }_{\mathrm {p}})}. \end{aligned}$$
(3.26)

is computed. As the energy of the speech signal is expected to be concentrated in a harmonic structure, those harmonic bands with low \(\bar{\varepsilon }_q\) and high \(P_q\) are more likely to retain most of the speech components, while other regions are expected to contain the interference signal. We therefore set two harmonic-band thresholds \(\zeta \) and \(\eta \) for selecting the reliable (speech) harmonic bands such that

$$ \begin{aligned} {G_q(\omega _l)}&=\left\{ \begin{array}{l} \varPsi (\omega _l-q\widehat{\omega }_{\mathrm {p}}), \text{ if } \bar{\varepsilon }_q \le \zeta ~ \& ~P_q\ge \eta ,~\omega _l \in [a_q,b_q] \\ 0, \text{ otherwise } \end{array} \right. , \end{aligned}$$
(3.27a)
$$\begin{aligned} G(\omega _l)&=\sum _{q=1}^Q G_q(\omega _l) . \end{aligned}$$
(3.27b)

Equation (3.27a) indicates that only harmonic bands that satisfy the thresholds are selected; the other frequency bands are discarded. Equation (3.27b) indicates that the selection process is carried out over all frequency bands of interest. The sum of the selected harmonic bands are denoted as \(G(\omega _l)\).

Fig. 3.6
figure 6

MBE fitting result. a Clean speech and MBE fit. b Beamformer output, MBE fit, and \(G(\omega )\) in the presence of a power drill signal

Figure 3.6 shows extraction results of the speech harmonics using a frame of 32 \(\mathrm {ms}\). Figure 3.6a shows the MBE fitting result, computed using (3.22)–(3.24), for the case of clean speech where no interferer is present. We note that the MBE approximation, shown by the dotted line, is capable of estimating the harmonics of clean speech. Figure 3.6b shows the result for the case where a power-drill signal is added to the speech signal at an SIR \(=\) 5 dB. The beamformer output \(S(\omega _l,\widehat{\varvec{\ell }}_k^-)\), shown by the solid line, therefore consists of spectral components corresponding to the power drill at 400 and 1500 Hz and the speech signal. Comparing Fig. 3.6a, b, we note that the MBE fit shown in Fig. 3.6b is able to estimate the speech harmonics with reasonable accuracy, albeit with some distortion. The estimated reliable speech harmonic bands are shown with \(G(\omega _l)\) and are denoted by the bold lines (which has been normalized to 0 dB for clarity).

The extraction discussed above considers a single data frame. By iterating the procedure over all the frames, \(G(\omega _l)\) in (3.27b) can be extended to \(G(k,\omega _l)\) which denotes the selected harmonic bands at the kth frame. The selected harmonics over all the frames are shown in Fig. 3.4d where a 6 \(\mathrm {s}\) speech in the presence of power-drill interference is considered. We note that using the beamformer and MBE fit, speech harmonic bands can be estimated as indicated by the dark lines in the spectrogram.

With \(G(k,\omega _l)\), the new SRP function \(\mathcal {P}_k(\varvec{\ell })\) with weight \(W_i(k,\omega _l)\) is given as

$$\begin{aligned} \mathcal {P}_k(\varvec{\ell })&= \sum _{\omega _l \in \varOmega } \left| \sum _{i=1}^M W_i(k,\omega _l)Y_i(k,\omega _l)e^{j\omega D_i(\varvec{\ell })/c}\right| ^2, \end{aligned}$$
(3.28a)
$$\begin{aligned} W_i(k,\omega _l)&= \frac{G(k,\omega _l)}{|Y_i(k,\omega _l)|}, \end{aligned}$$
(3.28b)

where \(\varOmega \) is the frequency over which the SRP function is evaluated. Similar to the pseudo likelihood method [25, 37], the SRP function is used to define the measurement likelihood in the PF framework,

$$\begin{aligned} \mathrm {Pr}(\mathbf {z}_k|\varvec{\alpha }_k) = \left\{ \begin{array}{l} \mathcal {P}_k^\gamma (\varvec{\ell }), \text{ for } \text{ voiced } \text{ frame }\\ \mathcal {U}_D(\varvec{\ell }), \text{ for } \text{ unvoiced } \text{ frame } \end{array} \right. , \end{aligned}$$
(3.29)

where \(\gamma =2\) is a control parameter to regulate the SRP function for source tracking [25], and \(\mathcal {U}_D(\cdot )\) is the uniform pdf over the considered enclosure domain \(D=\{ {x_k,y_k}|x_{\mathrm {min}}\le x_k \le {x_{\mathrm {max}}},y_{\mathrm {min}}\le y_k \le {y_{\mathrm {max}}} \}\). The likelihood function is then used to update the particle weights of the particles. The proposed SSLT framework is summarized in Table 3.2.

Table 3.2 Summary of the proposed algorithm

3.3 Simulation Results

Simulations were conducted using synthetic impulse responses generated by the method of images [26]. The dimension of the room was \(5\,\mathrm {m} \times 5\,\mathrm {m} \times 2.5\,\mathrm {m}\), and the reverberation time \(T_{60}\) were 200–\(300\,\mathrm {ms}\). Eight microphones were distributed \(0.5\,\mathrm {m}\) away from the perimeter of the room (see Fig. 3.7). An \(8\,\mathrm {s}\) male speech signal sampled at 16 \(\mathrm {kHz}\) from the TIMIT database [16] was used as the source signal. A power drill (PD) signal and a recorded telephone ring (TR) signal obtained from the NOISEX-92 database [33] were used as interferers. White Gaussian noise of 15 dB SNR was added to the microphone signals. The speed of source was approximately set at \(0.6\,\mathrm {m/s}\). The positions of speech source were estimated using a frame size of 512 samples with \(N_{p}=100\) particles. We also used an effective sample size threshold \(N_\mathrm {{thr}} = 37.5\), harmonic-band thresholds \(\zeta = 0.6\) and \(\eta = 0.03\). A total of 12 harmonic bands (\(Q=12\)) were considered. The proposed method is compared with the conventional tracking method using SRP-PHAT as pseudo likelihood [25]. Both methods were evaluated using \(0\le \varOmega \le 2\,\mathrm {kHz}\) from which, for the proposed algorithm, speech pitch frequency was estimated from 100 to \(300\,\mathrm {Hz}\) using (3.22)–(3.24). In this chapter, we quantify the performance using the average tracking error across all audio frames, i.e.,

$$\begin{aligned} \bar{e}=\frac{1}{K}\sum _{k=1}^{K} ||\widehat{\mathbf {r}}_k^+ - \mathbf {r}_k||_2, \end{aligned}$$
(3.30)

where \(\widehat{\mathbf {r}}_k^+\) is the posterior estimated position at kth frame, \(\mathbf {r}_k\) is the true source position, \(||\cdot ||_2\) is the L-2 norm, and K is the number of frames.

Fig. 3.7
figure 7

Comparison of tracking results when TR is present at SIR \(= -3\) dB, \(T_{60} = 250\,\mathrm {ms}\). a Conventional SRP-PHAT tracking method. b Proposed tracking method

Figure 3.7 compares the tracking result for \(T_{60}=250\,\mathrm {ms}\) in the presence of telephone ring at \(-3\) dB SIR. Figure 3.7a shows that the tracking performance of the conventional SRP-PHAT approach is adversely affected by the interferer. Due to the high measurement likelihood of SRP-PHAT for the interferer region, the particles are “trapped” once they are propagated there, in this case the region near the telephone ring. The SRP-PHAT method has an average error of 0.58 \(\mathrm {m}\) indicating that it does not converge to the speech source trajectory. On the other hand, Fig. 3.7b shows the tracking performance of the proposed method. This result shows that the proposed method is less significantly affected by the presence of the telephone ring achieving an average error of 0.12 \(\mathrm {m}\).

Fig. 3.8
figure 8

Comparison of tracking results when both PD and TR are present at SIR \(= 3\) dB, 0 dB, respectively, \(T_{60} = 250\,\mathrm {ms}\). a Conventional SRP-PHAT tracking method. b Proposed tracking method

Figure 3.8 shows the tracking result when both power drill and telephone ring are present at 3 and 0 dB SIRs, respectively, with \(T_{60}=250\,\mathrm {ms}\). Again, Fig. 3.8a shows the conventional SRP-PHAT approach losing track of the speech source. The particles are “trapped” at the region near the power drill, leading to the average error of 0.61 \(\mathrm {m}\). On the other hand, the proposed method, shown in Fig. 3.8b, retains its robustness with an average error of 0.13 \(\mathrm {m}\).

Table 3.3 Comparison of mean tracking error \(\bar{e}\) between the SRP-PHAT tracking method and the proposed tracking method

Table 3.3 shows the average tracking error for various test conditions. The source trajectory and interference positions remain the same as the previous setup. These results show that the proposed algorithm can achieve better accuracy than the SRP-PHAT method. For instance, in the presence of power drill at 3 dB SIR, the SRP-PHAT method exhibits a large tracking error of 0.56 \(\mathrm {m}\) when \(T_{60}=0.2\,\mathrm {s}\). The proposed method achieves an error of 0.11 \(\mathrm {m}\), which translates to an 80 % reduction of error over the SRP-PHAT method. Furthermore, the proposed method maintains its robustness in localization and tracking in the presence of two interferers, while the SRP-PHAT approach suffers from large tracking error under low SIR condition. However, it is also observed that the performance of the proposed algorithm degrades modestly when reverberation time is increased. The proposed method may fail under adverse environments as indicated when \(T_{60}=0.3\,\mathrm {s}\), PD and PR are present at SIR of 3 and \(-6\) dB.

Fig. 3.9
figure 9

Comparison of tracking results when PD is present at SIR \(=\) 3 dB, \(T_{60} = 200\,\mathrm {ms}\). a Conventional SRP-PHAT tracking method. b Proposed tracking method

Fig. 3.10
figure 10

Comparison of tracking results when both PD and TR are present at SIR \(=\) 3 dB, 0 dB, respectively, \(T_{60} = 200\,\mathrm {ms}\). a Conventional SRP-PHAT tracking method. b Proposed tracking method

Different source trajectory and interference configurations were also examined in Figs. 3.9 and 3.10. As before, these results show that the conventional SRP-PHAT approach is likely to be affected by interferers, while the proposed approach retains its robustness; the particles are propagated closely along the source trajectory.

Fig. 3.11
figure 11

Comparison of mean tracking error versus different reverberation time \(T_{60}\). a Power drill is present at SIR \(=\) 0 dB. b Telephone ring is present at SIR \(= -5\) dB

Fig. 3.12
figure 12

Integration setup with the social robot system

Figure 3.11 shows the performance of both algorithms under different reverberation conditions. Figure 3.11a shows the results when power drill is present at an SIR \(=\) 0 dB. The SRP-PHAT tracking method, indicated by the dashed line, results in consistently high tracking errors of more than \(1\,\mathrm {m}\). The SRP-MBE tracking method, shown by the solid line, results in errors of less than \(0.3\,\mathrm {m}\) when \(T_{60}\) is below \(0.35\,\mathrm {s}\). However, the performance deteriorates rather significantly when \(T_{60}\) is beyond \(0.4\,\mathrm {s}\). A similar conclusion can be drawn from Fig. 3.11b where the telephone ring is present at SIR \(= -5\) dB. The SRP-PHAT tracking method consistently results in high tracking errors of more than \(0.5\,\mathrm {m}\), while the SRP-MBE deteriorates when \(T_{60}\) is higher than \(0.3\,\mathrm {s}\).

4 Integration with Social Robot

Sound source localization and tracking have been investigated in the previous sections. In this section, we describe a system where the SSLT module has been integrated to the social robot and to the virtual human. Figure 3.12 shows the demo setup of a social robot system in the BeingThere Center, Nanyang Technological University. Microphones are employed linearly with known positions. The SSLT module estimates the position of a speaker within the room and delivers the position information through I2P connections to the server. The other modules (e.g., the head controller module) would therefore have access to the sound position information. Either the virtual human or the social robot is able to turn its head to a person who is speaking in the room. By focusing on the speaker, the interaction between robot and users is improved. The sound position information can also be combined with the face detection module, which allows the robot to be aware of all the users while focusing on the active speaking person.

5 Future Avenues

This research focuses on SSLT problems in the meeting room environment and will continue to be the research focus in the near future. The following are some of the possible suggestions for future research:

  1. 1.

    Improving the performance of SRP-MBE in the reverberant environment. The performance of the proposed SRP-MBE tracking algorithm degrades when reverberation time increases. This is due to the fact that the harmonic bands are disturbed by a high amount of reverberation. The issue of how to recover or extract the time delay information from the degraded harmonic bands certainly requires future investigation.

  2. 2.

    Tracking time-varying number of sources. In recent years, tracking time-varying number of sources has gained much interest in the research community [14, 28, 30]. In a typical environment, there might be multiple speakers speaking at the same time, which results in speech signals overlapping. In addition, some speakers may become quiet after talking for a while. This practical situation requires an advanced probabilistic model such as random finite set [28, 36] to be incorporated in the particle filter framework to achieve multiple speaker tracking. In addition, it requires a mechanism to detect and initialize a newborn target and remove certain inactive targets from the state at a certain time instant [14].

6 Conclusions

In this chapter, we first reviewed the SSLT problem in a meeting room environment for teleconference purposes. The challenges include room reverberation, background noise, and sound interference. After reviewing some of the existing methods, a proposed SSLT framework was discussed for tracking a speech source in the presence of sound interference. This method is capable of estimating the speech harmonic bands for localizing and tracking. By only emphasizing the harmonic bands, better speech-sensitive measurement likelihood can be achieved resulting in better weight update for the particles. Simulation results show that the proposed method can achieve lower tracking error than the conventional SRP-PHAT method in the presence of multiple interferers.