Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Estimating the direction of non-stationary moving array as well as moving narrowband sources is considered an active research area. The most cutting-edge techniques are originated from the maximum likelihood (ML) and expectations maximization (EM) methods as they have a form of recursive extended Kalman filters and use built-in source-movement models. For non-stationary speech sources, nonparametric modeling of the source movements can be employed. Such models have no mathematical model of the assumed signal, while, parametric approaches have a mathematical model to define the signal form and to estimate it [1].

Consider the problem of localizing \( q \) speech sources by using an array of \( n \) passive sensors. In order to obtain the signal model, where the sources generate a wave-field traveling through space and sampled by the sensor array. The array aperture is the space occupied by the array and usually is measured in units of signal wavelength. For point sources (omnidirectional) in the far field of the array, the only parameter that characterizes a difference of the signals impinging on the sensors from a source is a time-delay, which is called angle of arrival (AOA), or direction of arrival (DOA).

For a uniform linear array, consider the array of \( n \) identical sensors uniformly spaced on a line that receives \( q( q < n) \) narrowband signals impinging from the unknown varying directions \( \{ \theta_{1} , \ldots ,\theta_{q} \} \). The \( n \times 1 \) output vector of the array at the discrete time \( t \) is modeled as [2,3,4,5,6,7]:

$$ {\mathbf{r}} ( {\text{t)}} = {\mathbf{A}} ( {\text{t)}}{\mathbf{s}} ( {\text{t)}} + {\mathbf{e}} ( {\text{t) }} $$
(3.1)

where, the \( n \times q \) time-varying direction matrix is given by:

$$ {\mathbf{A}}(t) = \left[ {{\mathbf{a}}(\theta_{1} } \right.(t)),{\mathbf{a}}(\theta_{2} (t)), \ldots ,{\mathbf{a}}(\theta_{q} (t))] $$
(3.2)

The \( {\mathbf{A}}(t) \) matrix is composed of the source direction vectors, which are known as the steering vectors that defined as follows.

$$ {\mathbf{a}}(\theta_{i} ) = \left[ {1,\exp ( - j\frac{2\pi }{\lambda }} \right.d\sin \theta_{i} ), \ldots ,\left. {\exp ( - j(n - 1)\frac{2\pi }{\lambda }d\sin \theta_{i} )} \right]^{T} \, $$
(3.3)

where \( i = 1 ,\ldots ,q \, \), \( \lambda \) is the wavelength defined as the distance traveled by the harmonic carrier signal in one period; \( d \) is the equi-spaced inter-element spacing; \( {\mathbf{s}}(t) \) is a \( q \times 1 \) vector of the source waveforms; \( {\mathbf{e}}(t) \) is \( n \times 1 \) vector of sensor noise as the white zero-mean Gaussian random noise with a variance of \( \sigma^{2} \) that has:

$$ E\{ {\mathbf{e}}(t)\} = 0, \, E\{ {\mathbf{e}}(t){\mathbf{e}}(t)^{H} \} = \sigma^{2} I,E\{ {\mathbf{e}}(t){\mathbf{e}}(t)^{T} \} = 0 $$
(3.4)

where \( E\{ \} \) means the expectation, \( (.)^{H} \) refers to the Hermitian conjugation and \( (.)^{T} \) stands for the transpose.

After setting this model, the source location problem is turned into a time-invariant or time-variant parameter estimation problem.

3.1 Direction of Arrival Estimation Techniques

In many cases, the receiver cannot determine which direction a speech signal will arrive from. Accordingly, the DOA estimation step becomes essential before beamforming. The array-based DOA estimation techniques can be broadly divided into conventional, subspace-based, maximum likelihood, and integrated techniques [8].

3.1.1 Conventional Beamformer for DOAE

Conventional beamformer approaches are based on the concepts of beamforming and null steering without exploiting the nature of the received signal model or the statistical model of the signals/noise. These beamformers are electronically steer the beams in the possible directions and look for peaks in the output power. The delay-and-sum methods are considered classical beamformers for DOA estimation. However, these methods suffer from poor resolution, where the width of the beam and the height of the side lobes limit the effectiveness when the signals are from multiple sources. Capon’s Minimum Variance method tries to overawe the poor resolution problems related to the delay-and-sum technique. Capon’s method has several disadvantages, namely (i) it fails in the presence of other signals that are correlated with the Signal-of-Interest (SOI), and (ii) it is expensive for large arrays, where it requires the computation of matrix inverse.

The classical beamforming based methods have fundamental limitations in resolution. These limitations arise due to the neglecting of the input data model structure. Generally, these conventional methods need a large number of elements to accomplish high resolution. A conventional beamformer as the delay-and-sum beamformer selects the phases to steer the array in a particular direction, known as the look direction [3]. It has been introduced as a natural extension of the standard Fourier-based spectral analysis of the sensor array data. The model of a finite impulse response spatial filter with the output for the signal model is given by:

$$ y(t) = \sum\limits_{k = 1}^{n} {\omega_{k}^{ * } } r_{k} (t) = {\mathbf{W}}^{H} {\mathbf{r}}(t),\quad {\mathbf{W}} = (\omega_{1} , \ldots \ldots ,\omega_{n} )^{T} \, $$
(3.5)

Assume that the waveform \( {\mathbf{s}}(t) \) and the noise \( {\mathbf{e}}(t) \) are zero-mean independent random processes. Given samples \( {\mathbf{r}} ( {\text{t)}} \), thus, the output power is measured by:

$$ P({\mathbf{W}}) = \frac{1}{N}\sum\limits_{t} {\left| {y(t)} \right|}^{2} = {\mathbf{W}}^{H} {\hat{\mathbf{R}}}_{r} {\mathbf{W}} \, $$
(3.6)

Design a spatial filter suppressing the noise component and preserving the waveform signal \( {\mathbf{s}}(t) \). It can be done by the following optimization problem [4]:

$$ \mathop {\hbox{min} }\limits_{{\mathbf{W}}} \left\| {\mathbf{W}} \right\|^{2} \;{\text{subject}}\;{\text{to}}\;{\mathbf{W}}^{H} {\mathbf{a}}(\theta ) = 1 $$
(3.7)

The optimal weights vector for the spatial filter is expressed by:

$$ {\mathbf{W}} = {{{\mathbf{a}}(\theta )} \mathord{\left/ {\vphantom {{{\mathbf{a}}(\theta )} n}} \right. \kern-0pt} n} \, $$
(3.8)

Substitute by these weights, the output power of the spatial filter as a function of \( \theta \) is obtained as follows:

$$ P_{conv} (\theta ) = \frac{1}{{n^{2} }}{\mathbf{a}}^{H} (\theta ){\hat{\mathbf{R}}}_{r} {\mathbf{a}}(\theta ) \, $$
(3.9)

In order to find \( \theta \) of the unknown DOA, the power \( p_{conv} (\theta ) \) is maximized on \( \theta \) using the following expression:

$$ \hat{\theta } = \arg (\mathop {\hbox{max} }\limits_{\theta } P_{conv} (\theta )) $$
(3.10)

Several methods including Capon’s beamforming, MUSIC and other subspace-based methods use the power function analysis as a basic original idea for the development. However, one can start from the observation model to obtain the following residual function:

$$ I(\theta ,s) = \frac{1}{N}\sum\limits_{t} {\left\| {{\mathbf{r}}(t) - {\mathbf{a}}(\theta )s(t)} \right\|}^{2} \, $$
(3.11)

The quadratic residual function gives the maximum likelihood estimates of the angle \( \theta \) and the waveform \( {\mathbf{s}}(t) \) provided that the random errors are Gaussian [9]. This beamformer deals with stationary sources only, so other beamformers can be used to handle the non-stationary sources.

3.1.2 Subspace DOA Estimation Methods

It is well known that several super resolution approaches are available for estimating the DOA of signals received by array including MUSIC, and ESPRIT. All are eigenvalue decomposition problems. The difference between these algorithms is in how the information is used to determine the DOA.

The MUSIC algorithm is a high resolution technique that provides information about the number of incident signals, signal DOA, noise power, etc. It can resolve closely spaced signals that cannot be detected by Capon’s method. In the MUSIC algorithm, an exhaustive search is performed looking for the signals that are orthogonal to the noise subspace. Various modifications of the MUSIC algorithm have been proposed to decrease the computational complexity and increase its resolution performance. These modified versions include (i) the Root-MUSIC algorithm, which is based on the polynomial rooting and provides higher resolution. It reduces the number of calculations by avoiding an exhaustive search [10], however, it is applicable only to a uniformly spaced linear array. In addition, (ii) the cyclic MUSIC which is a signal selective direction finding algorithm that exploits the spectral coherence of the received signal as well as the spatial coherence to improve the performance of the conventional MUSIC algorithms. By exploiting spectral correlation along with MUSIC, it is possible to resolve signals spaced more closely than the resolution threshold of the array when only one of them is the SOI [11].

In this connection, it may be mentioned that the ESPRIT algorithm is another subspace-based DOA estimation technique that reduces the computational and storage requirements of MUSIC and does not involve an exhaustive search through all possible steering vectors to estimate the DOA. Unlike MUSIC, ESPRIT does not require prior knowledge about the array manifold vectors.

3.1.3 Maximum Likelihood Techniques

Maximum likelihood (ML) techniques are considered effective techniques for DOA estimation, which are computationally intensive. The ML methods are better than the subspace based methods, expressly in SNR conditions or in the case of small number of signal samples [12]. Additionally, the ML techniques deal with correlated signal conditions better than subspace-based techniques.

Assume \( n \) observations that have \( x_{1} , x_{2} , \, \ldots , x_{n} \) samples impending with unidentified probability density function \( f_{0} (\cdot) \) from certain distribution. The joint density function of all observations can be defined as:

$$ f(x_{1} ,x_{2} , \ldots ,x_{n} {\mid }\theta ) = f(x_{1} {\mid }\theta ) \times f(x_{2} {\mid }\theta ) \times \cdots \times f(x_{n} {\mid }\theta ) $$
(3.12)

For a parametric function, Eq. (3.12) is called the likelihood, which is given by:

$$ {\mathcal{L}}(\theta {\kern 1pt} ;{\kern 1pt} x_{1} , \ldots ,x_{n} ) = f(x_{1} ,x_{2} , \ldots ,x_{n} {\mid }\theta ) = \prod\limits_{i = 1}^{n} f (x_{i} {\mid }\theta ) $$
(3.13)

where the following log-likelihood function is employed:

$$ \ln {\mathcal{L}}(\theta ;x_{1} , \ldots ,x_{n} ) = \sum\limits_{i = 1}^{n} {\ln } f(x_{i} {\mid }\theta ) $$
(3.14)

The average log-likelihood estimator is then expressed as follows:

$$ \hat{\ell } = \frac{1}{n}\ln {\mathcal{L}} $$
(3.15)

In the model, this estimates the predictable log-likelihood of a particular observation.

In order to estimate the DOA using the ML estimator, the value of \( \theta_{0} \), which represents the required DOA is determined by finding the value of the angle that maximizes \( \hat{\ell }(\theta ;x) \), which describes the maximum likelihood estimator (MLE) of  \( \theta_{0} \) as follows, if a maximum occurs:

$$ \{ \hat{\theta }_{\text{MLE}} \} \subseteq \{ \mathop {{ \arg }{\kern 1pt} { \hbox{max} }}\limits_{{\theta \in \,\Theta }} \;\hat{\ell }(\theta {\kern 1pt} ;{\kern 1pt} x_{1} , \ldots ,x_{n} )\} $$
(3.16)

This expression represent he estimated DOAE of the speakers in microphone array problems.

3.1.4 Local Polynomial Approximation Beamformer

Recently, an efficient beamforming technique using local polynomial approximation (LPA) has been developed and modified for different array geometries. Undeniably, this nonparametric LPA beamformer technique is first applied to DOA estimation by Katkovnik and Gershman [13], then it is generalized, modified and developed by Ashour et al. [2] and Elkamchouchi et al. [3].

It is worth noting that localizing and tracking multiple narrowband moving sources using a passive array are considered the fundamental problems in communication, sonar, radar, and microphone arrays. The conventional beamforming and high-resolution subspace techniques are established to exploit the benefits of temporal integration of array data for unmoving arrays and sources [14]. It assumes that only quite short series of observations are used for beamforming and estimation in non-stationary moving sources.

Conventional approaches fail and have deteriorated performance with the scenarios of moving sources. The LPA beamformer is quite different from the ML and the conventional beamformer [15]. Definitely, in the standard ML formulation, the source steering vectors are assumed to be time-invariant leading to different forms of the beamforming functions. The computational complexity of the LPA beamforming is \( M \) times higher than that of the conventional beamforming algorithm, where \( M \) is the number of points in the angular velocity domain. Typically, the LPA is a sliding window polynomial filtering (transform). The linear first degree LPA treats the discrete time 1D signal as sampled from an underlying continuous function within the selected window and uses loss function [16]. For DOA estimation of moving speech sources, the LPA beamformer estimates the time-varying DOA \( \hat{\theta }(t) \) from a finite number \( N \) of the array observations \( {\mathbf{r}} ( {\text{t)}} \). For DOA estimation, let the speech source motion model within the observation interval using Taylor series [17]:

$$ \begin{aligned} \theta (t + kT) & = \theta (t) + \theta^{(1)} (t)(kT) + \frac{{\theta^{(2)} (t)}}{2}(kT)^{2} + \frac{{\theta^{(3)} (t)}}{6}(kT)^{3} + \cdots \\ & = c_{0} + c_{1} kT + c_{2} (kT)^{2} + c_{3} (kT)^{3} + \cdots \\ \end{aligned} $$
(3.17)

where \( T \) is the sampling interval, and the parameters \( c_{0} \) and \( c_{1} \) will be used as estimates of the angle \( \theta (t) \) and its derivative \( \theta^{(1)} (t) \). The source trajectories are considered arbitrary functions of time that fit the nonparametric f piecewise continuous \( \alpha \)-differentiable function, which is given by:

$$ F_{\alpha } = \{ \left| {\theta^{(\alpha )} (t)} \right| \le L_{\alpha } , \, \theta^{(\alpha )} (t) = \frac{{d^{\alpha } \theta (t)}}{{dt^{\alpha } }} $$
(3.18)

For \( \alpha = 0 \), \( \left| {\theta (t)} \right| \) is just restricted by the value \( L_{0} \). For \( \alpha = 1 \, \) and 2, the velocity (first derivative) and the acceleration (second derivative) of \( \theta (t) \) exist for almost every time instants and the absolute values of these derivatives have as upper bounds \( L_{1} \) and \( L_{2} \), respectively. The word “nonparametric” indicates that nothing is known about a parametric form of [18]. The source localization is ensured by a sliding weight-function (window) \( \omega_{h} \) that discounts observations outside a neighborhood of the center \( t \) of the approximation.

Different kind of windows can be used, for example, the rectangular windows have equal weights for observations in the window. Nonrectangular windows, such as triangular, quadratic, usually prescribe higher weights to observations which are closer to the center \( t \).

The window function can be expressed by:

$$ \omega_{h} (kT) = (\frac{T}{h})\omega (\frac{kT}{h}) $$
(3.19)

where \( \omega (v) \) is a real symmetric function \( \left[ {} \right.\omega (v) = \omega ( - v)\left. {} \right] \) satisfying the following conventional properties:

$$ \omega (v) \ge 0,\omega (0) = \mathop {\hbox{max} }\limits_{v} \mathop {\omega (\nu )}\limits_{{}} , \, \int\limits_{ - \infty }^{\infty } {\omega (v)} dv = 1 $$
(3.20)

where the scaling parameter \( h \) determines the window length.

For linear 1D LPA beamformer, assume sufficiently short window, consequently, the third and later terms in Eq. (3.17) are insignificant, hence, the following model is obtained:

$$ \theta (t + kT) = c_{0} + c_{1} kT $$
(3.21)

here \( c_{0} = \theta (t) \, \) and \( {\text{c}}_{ 1} = \theta^{(1)} (t) \) represent the instantaneous source DOA and angular velocity, respectively. So, the problem is to find the estimate \( {\hat{\mathbf{c}}} \) of the vector \( {\mathbf{c}} = (c_{0} ,c_{1} )^{T} \) for each speech source of interest from a finite number of non-stationary array observations.

The loss function of the LPA using the weighted least squares approach in order to estimate the angle and its derivative is given by [6]:

$$ G(t,{\mathbf{c}}) = \frac{1}{{\sum\limits_{k} {\omega_{h} (kT)} }}\sum\limits_{k} {\omega_{h} (kT)\left\| {{\mathbf{r}}(t + kT) - \left. {{\mathbf{a}}({\mathbf{c}},kT)s(t + kT)} \right\|} \right.}^{2} $$
(3.22)

where \( {\mathbf{a}}({\mathbf{c}},kT) = {\mathbf{a}}(c_{0} + c_{1} kT) \) and \( {\mathbf{e}}(t + kT) = {\mathbf{r}}(t + kT) - {\mathbf{a}}({\mathbf{c}},kT)s(t + kT) \) is a residual of fitting the output \( {\mathbf{r}}(t + kT) \) of the sensor by the corresponding output \( {\mathbf{a}}({\mathbf{c}},kT)s(t + kT) \) of the steering vector and \( \omega_{h} (kT) \) is the window.

The minimization of \( G(t,{\mathbf{c}}) \) with respect to the unknown deterministic waveform \( s(t + kT) \, \) is expressed by:

$$ \frac{\partial G}{{\partial s^{*} (t + kT)}} = \frac{{ - \omega_{h} (kT)}}{{\sum\limits_{k} {\omega_{h} (kT)} }}{\mathbf{a}}^{H} ({\mathbf{c}},kT)\left\{ {{\mathbf{r}}(t + kT) - \left. {{\mathbf{a}}({\mathbf{c}},kT)s(t + kT)} \right\} = 0} \right. $$
(3.23)

Therefore, the estimate of the waveform \( s(t + kT) \, \) is given by:

$$ \hat{s}(t + kT) = \frac{{{\mathbf{a}}^{H} ({\mathbf{c}},kT){\mathbf{r}}(t + kT)}}{n} $$
(3.24)

where the property \( {\mathbf{a}}^{H} ({\mathbf{c}},kT){\mathbf{a}}({\mathbf{c}},kT) = n \) is exploited. By substituting from Eq. (3.19) in Eq. (3.17), thus,

$$ G(t,{\mathbf{c}}) = \frac{1}{{\sum\limits_{k} {\omega_{h} (kT)} }}\sum\limits_{k} {\omega_{h} (kT)\left\{ {{\mathbf{r}}^{H} (t + kT){\mathbf{r}}(t + kT) - \left. {\frac{{\left| {{\mathbf{a}}^{H} ({\mathbf{c}},kT){\mathbf{r}}(t + kT)} \right|^{2} }}{n}} \right\}} \right.} $$
(3.25)

This function should be minimized over the vector parameter \( {\mathbf{c}} \). Since, only the second term depends on \( {\mathbf{c}} \), hence, the minimization of \( G(t,{\mathbf{c}}) \) is equivalent to the maximization of the LPA beamformer function, which is given by:

$$ P(t,{\mathbf{c}}) = \frac{1}{{n\sum\limits_{k} {\omega_{h} (kT)} }}\sum\limits_{k} {\omega_{h} (kT)\left| {{\mathbf{a}}^{H} ({\mathbf{c}},kT){\mathbf{r}}(t + kT)} \right|^{2} } $$
(3.26)

This function is independent of the nature of \( s(t) \), thus, if the transmitted signal is unknown, it will not affect this term in the algorithm. The maximization of \( P(t,{\mathbf{c}}) \) can be performed using the two-dimensional (2D) search over \( c_{0} \) and \( c_{1} \). Subsequently, this LPA beamformer can be protracted to multiple sources situations using direct superposition of particular responses to each source.

At the time instant \( t \), the estimates of \( \theta (t) \) and \( \theta^{(1)} (t) \, \) as well as the value of the waveform \( s(t) \) constitute a solution of the optimization problem:

$$ (\hat{\theta }(t)\mathop {,\hat{\theta }^{(1)} (t),\hat{s}(t)}\limits^{{}} ) = \arg (\mathop {\hbox{min} }\limits_{{{\mathbf{c}},s(t)}} G(t,{\mathbf{c}})) $$
(3.27)

or

$$ (\hat{\theta }(t)\mathop {,\hat{\theta }^{(1)} (t)}\limits^{{}} ) = \arg (\mathop {\hbox{max} }\limits_{{\mathbf{c}}} P(t,{\mathbf{c}})) $$
(3.28)

From the preceding procedure, the DOA, and the velocity of the moving speaker can be estimated accurately using the LPA beamformer technique.

3.2 Optimization Algorithms in DOAE

In the near-field, numerous algorithms were implemented in the recent years for source localization. Several algorithms are based on subspace approaches, while, others use evolutionary computing methods. Practically, near-field case arises in innumerable situations, including microphone arrays for speech enhancement, seismic exploration, under water source localization, and ultrasonic imaging. For near-field source localization, several researchers proposed many approaches, such as the 2D MUSIC technique; high-order spectra (HOS) based algorithms; the weighted linear prediction technique; the ESPRIT technique; and the ML technique [19, 20]. However, in practical conditions, these proposed procedures are computationally heavy, some require extra computations for parameters pairing in case of multiple sources, in addition, closely spaced sources’ localization suffer from poor estimation in low Signal to Noise Ratio (SNR).

Optimization algorithms and evolutionary computing methods, such Particle Swarm Optimization (PSO), Genetic algorithm (GA), Differential Evolution (DE), and Genetic programming (GP) proved their significance recently [21,22,23,24,25,26,27]. These methods achieve commanding global optimizers with avoiding being stuck in local minima. In addition, they can be hybridized to provide reliable and effective optimized solutions.

Using an array antenna for DOAE from the received signal is a critical topic in sonar, radar, communication systems, and microphone array systems. Traditional DOAE techniques including ML, MUSIC, root-MUSIC, ESPRIT have been used. Recently, the ML estimation using particle swarm optimization (PSO) [28], the genetic algorithm (GA)-based technique [29], and the evolutionary programming (EP)-based method [30] are developed. Choi [31] implemented a new DOAE scheme using PSO-based SPECC (Signal Parameter Extraction via Component Cancellation), where the optimization method supports the extraction process of the signal sources’ amplitudes and incident angles that impinge on the sensor array. Sheikh et al. [32] employed differential evolution algorithm for range and DOAE of near-field narrow band sources that impose on a uniform linear array (ULA). During the optimization steps, the mean square error (MSE) is used as a fitness function. The results of DE are compared with the results of Genetic Algorithm (GA).

3.3 Time of Arrival Estimation Techniques

Speaker localization is concerned with locating the speaker position in a certain place according to the received sound signals from the MA. This process supports several real world applications including speech recognition, video conferencing [33], speech acquisition [34], hands-free voice communication [35], acoustic surveillance devices that require high quality of captured speech from the speakers [36]. Acoustical situation is degraded by background/additive noise, and distortion due to the speech signal reverberation from a speaker. In order to overwhelm such problem, the speech is recorded using microphones set, which require moving speaker’s localization and tracking. Once the real speaker position is known, the MA can be electronically steered for high-quality speech acquisition. Moreover, the speaker localization is vital in the multi-speaker scenario. Time of arrival estimation (TDOA) localization scheme calculates the time delay estimation between each microphone’s pair and the source using several techniques including the generalized cross correlation of maximum likelihood (GCC-ML), the generalized cross-correlation of phase transform (GCCPHAT), the Hilbert envelope of the LP residual and the linear prediction (LP) residual [37, 38].

The TDOA is considered the superior technique to compute the time delay estimation between each pair of microphones and the source. It is essential to acquire decent estimate of the time-delay even with corrupted signals by reverberation and noise. For TDE, the spectral features of speech signals are processed, which are affected by the speech degradations due to noise and reverberation [39, 40].