An information based feedback control for audio-motor binaural localization

Bustamante, Gabriel; Danès, Patrick; Forgue, Thomas; Podlubne, Ariel; Manhès, Jérôme

doi:10.1007/s10514-017-9639-8

An information based feedback control for audio-motor binaural localization

Published: 15 June 2017

Volume 42, pages 477–490, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Autonomous Robots Aims and scope Submit manuscript

An information based feedback control for audio-motor binaural localization

Download PDF

Gabriel Bustamante¹,
Patrick Danès ORCID: orcid.org/0000-0002-3697-8583¹,
Thomas Forgue¹,
Ariel Podlubne¹ &
…
Jérôme Manhès¹

388 Accesses
10 Citations
Explore all metrics

Abstract

In static scenarios, binaural sound localization is fundamentally limited by front-back ambiguity and distance non-observability. Over the past few years, “active” schemes have been shown to overcome these shortcomings, by combining spatial binaural cues with the motor commands of the sensor. In this context, given a Gaussian prior on the relative position to a source, this paper determines an admissible motion of a binaural head which leads, on average, to the one-step-ahead most informative audio-motor localization. To this aim, a constrained optimization problem is set up, which consists in maximizing the entropy of the next predicted measurement probability density function over a cylindric admissible set. The method is appraised through geometrical arguments, and validated in simulations and on real-life robotic experiments.

Binaural Systems in Robotics

Practical Robotic Auditory Perception and Approaching Methods Based on Small-sized Microphone Array

Article 21 April 2022

An Adaptive Neural Mechanism with a Lizard Ear Model for Binaural Acoustic Tracking

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The advent of auditory robots has led to the emergence of binaural audio-motor localization schemes which, by combining binaural perception and motor commands, can disambiguate front from back and recover source range (Cooke et al. 2007; Nakadai et al. 2000). Some of these can cope with a moving and intermittent sound source (Portello et al. 2012). However, the question remains how to drive a binaural head so as to maximize the spatial information on a source extracted from the sensorimotor flow.

In Robotics, Simultaneous Localization and Mapping (SLAM) techniques have been extended to make robots move in order to improve their knowledge about the environment (Thrun et al. 2005). Control policies could be found by maximizing information criteria related to the robot situation, e.g., by determining the direction of maximum local information improvement. Shannon entropy or mutual information have often been used (Bourgault et al. 2002), as well as the Fisher information matrix (FIM) (Feder et al. 1999). It has been shown that a mapping robot guided by a mutual information based controller can be “attracted” towards unexplored areas (Julian 2013). Similar strategies have been used to coordinate multiple sensor platforms (Grocholsky et al. 2003). Information-theoretic controllers can address different objectives such as the control of a robot-mounted camera to optimize depth estimation (Forster et al. 2014), or the selection of sensor parameters (e.g., zoom or attitude) for scene analysis (Denzler and Brown 2002; Sommerlade and Reid 2008).

In the bearings-only tracking problem, optimum observer actions can be determined by maximizing a cost functional involving FIM determinants (Cadre and Laurent-Michel 1999). When the problem is the reduction of the mean square tracking error, the minimization of the posterior Cramér-Rao lower bound—i.e., the inverse of the Bayesian extension of the FIM—has been addressed (Ristic and Arulampalam 2003).

In robot audition, the problem of auditory scenes exploration has also been investigated (Martinson and Schultz 2009). A mobile robot has been equipped with a microphone array to localize sound sources and estimate its own position in a known geometric map (Sasaki et al. 2010). Motion planning based on audio situation has been proposed to improve speech recognition by a monaural robot (Kumon et al. 2010). In Martinson et al. (2011), sound source localization was improved by optimizing the position of microphones deployed in the environment. Recently, a robot equipped with a microphone array was controlled to locate a sound source by minimizing a criterion based on the entropy of an occupancy grid used to represent the source position belief (Vincent et al. 2015).

Given a prior knowledge on the relative position of a static sound source with respect to a binaural head, this paper deals with the determination of an admissible finite motion of the sensor which leads, on average, to the minimum uncertainty in the one-step-ahead localization. It is organized as follows. First, the three-stage approach to binaural active localization (Bustamante et al. 2015) which has motivated this work is recalled (Sect. 2). Then, a constrained optimization problem is defined, so as to get the next best position of the sensor (Sect. 3). A numerical solution scheme is proposed. Further, useful insights into the geometry of the problem are provided (Sect. 4), when the exploration is guided by directional cues such as the interaural time difference (ITD) between two microphones placed antipodally on a spherical binaural head. Evaluations are then conducted in simulation and on a binaural robotic platform (Sect. 5). Therein, a comparison is made with some open-loop motion policies. Conclusions and prospects end the paper.

2 A three-stage framework to active binaural localization

This work took place within the EU FET Two!Ears project (www.twoears.eu) whose aim was to develop a computational model of auditory perception and experience in humans. Listeners are regarded as multi-modal agents that develop their concept of the world by active, exploratory, interaction, and, in the course of this process, interpret percepts, collect knowledge and develop concepts accordingly. To enable this, the Two!Ears model includes not only bottom-up—signal-driven—processing but also top-down—hypothesis-driven—feedbacks. Some of these feedbacks come from the cognitive level, e.g., the context-dependent adjustment of bottom-up processing parameters, or the hypothesis-driven activation of specific low-level processing procedures. Other feedbacks operate at the sensorimotor level—with no cognition in between—at much shorter time scales, e.g., “turn-to-reflex” exploratory movements to dispel localization ambiguities.

Such sensorimotor feedbacks for single-source active binaural localization can be adressed through the three-stage framework depicted on Fig. 1. Stage A implements the maximum likelihood estimation of the source azimuth and the information-theoretic detection of its activity from the short-term channel-time-frequency decomposition of the binaural stream (Portello et al. 2013). Stage B assimilates these azimuths over time and combines them with the motor commands into a stochastic filter, leading to the posterior probability density function (or “belief”) of the head-to-source relative position (Portello et al. 2014b). Stage C is the topic of this paper. It consists in a feedback controller which, on the basis of the output from Stage B, can move the head so as to improve the quality of the localization. Stage A has been extended to the multiple-source case (Portello et al. 2014a), and Stage B can cope with a moving and/or intermittent source (Portello et al. 2012), but this is not considered here.

First an overview of Stages A and B is proposed. Then, the paper focuses on Stage C.

2.1 Terminology

A binaural head is fitted with the left and right microphones $R_1$ and $R_2$. A frame $\mathcal {F} = (O,\overrightarrow{x_R},\overrightarrow{y_R},\overrightarrow{z_R})$ is attached to it, with ${\overrightarrow{R_1O} = \overrightarrow{OR_2}}$ (Fig. 2). $R_1,R_2$ and the pointwise emitter E lie on a common horizontal plane defined by $(O,\overrightarrow{y_R},\overrightarrow{z_R})$, where $\overrightarrow{y_R} = \frac{\overrightarrow{R_2R_1}}{\Vert \overrightarrow{R_2R_1}\Vert }$ supports the interaural axis and $\overrightarrow{z_R}$ is oriented towards the front direction. So, $\overrightarrow{x_R}$ is vertical and points downwards. a terms the radius of a sphere approximating the head.

Throughout the paper, geometric vectors are denoted with arrows. Scalar, vector or matrix variables are written in normal font. Whether they are deterministic or stochastic can be straightly inferred.

2.2 Stage A: short-term extraction of directional cues

The interaural transfer function is assumed known over an adequate range of source azimuths and frequencies. The source signal and sensor noises are modeled as jointly Gaussian, zero-mean, individually and jointly “locally stationary” random processes (Portello et al. 2013). Then, on the basis of the channel-time-frequency decomposition $z_k$ of the binaural signal on a sliding window ending at time k, the short-term maximum likelihood $\hat{\theta }_k$ of the source azimuth $\theta _k$ comes as the argmax of a “pseudo likelihood” $p(z_k|\theta _k)$. This pseudo likelihood is obtained by replacing in the genuine likelihood of the unknown variables the most likely spectral parameters of the source as a function of its azimuth, by means of a notable separation property.

2.3 Stage B: combination with motor commands

A discrete-time stochastic state space equation is set up, uniting the motor commands to the head-to-source position ${x_k = (e_y,e_z)^T}$ to be estimated (Fig. 2). A theoretically sound Gaussian mixture square-root unscented Kalman filter (GMsrUKF) is defined so as to incorporate the above pseudo likelihood $p(z_k|\theta _k)$, where $\theta _k$ comes as a static function of $x_k$, and compute a Gaussian mixture approximation of the posterior probability density function (pdf), or “belief”,

$$\begin{aligned} p(x_k|{z}_{1:k}) = \sum _{i=1}^{I_k}w_k^i{\mathcal {N}}\left( x_k;{\hat{x}}_{k|k}^i,P_{k|k}^i\right) , \end{aligned}$$

(1)

where $(w_k^i,{\hat{x}}_{k|k}^i,P_{k|k}^i)$ are the weight, mean and covariance of each hypothesis (Portello et al. 2014b). Empirical tests show that self-initialization as well as posterior covariance consistency are generally ensured, so that front and back are disambiguated, and both range and azimuth are faithfully recovered.

2.4 Stage C: problem statement

Let $\mathcal {F}_k = (O_k,\overrightarrow{x_R}_k,\overrightarrow{y_R}_k,\overrightarrow{z_R}_k)$ and ${X}_k = ({e_x}, {e_y},{e_z})^T$ be the frame $\mathcal {F}$ at time k and the Cartesian coordinates of the—static—source in $\mathcal {F}_k$. If between times k and $k+1$ the sensor undergoes the translation $T_y \overrightarrow{y_R}_k + T_z \overrightarrow{z_R}_k$ followed by the rotation of angle $\phi \triangleq \widehat{(\overrightarrow{z_{R}}_k,\overrightarrow{z_{R}}_{k+1})}$ around $\overrightarrow{x_R}_k$, then the vector ${X}_{k+1}$ of the source coordinates in $\mathcal {F}_{k+1} = (O_{k+1},\overrightarrow{x_R}_{k+1},\overrightarrow{y_R}_{k+1},\overrightarrow{z_R}_{k+1})$ writes as

$$\begin{aligned} {{X}_{k+1}} = R^T(\phi ){{X}_k} - R^T(\phi ){{T}} + w_k, \end{aligned}$$

(2)

with $T = (0,T_y,T_z)^T$, $R(\phi )$ the rotation matrix corresponding to $\phi $, and $w_k$ the dynamic noise (if present).

A belief on the sensor-to-source position $x_k = ({e_y},{e_z})^T$ at time k is given in terms of the 2D Gaussian pdf ${\mathcal {N}}(x_k;{\hat{x}}_{k|k},P_{k|k})$, with ${\hat{x}}_{k|k}$ the estimate of $x_k$ and $P_{k|k}$ the associated error covariance matrix. The problem consists in determining the motion $({T},\phi )$ of the sensor which best improves, on average, the next localization of the sound source. First, a metric is described, uniting the belief on the state at time k and the rigid motion applied over $[k;k+1]$, to the expected information obtained after a measurement update at time $k+1$. The exploration is assumed to be guided by a scalar closed-form observation model such as

$$\begin{aligned} z_{k} = l(x_{k}) + v_{k} = \bar{l}(\theta _k)+v_k,\ z_k \in \mathbb {R},\ v_k \sim {\mathcal {N}}(0,R_k), \end{aligned}$$

(3)

with $v_k$ the measurement noise and $R_k$ its (co)variance. In the above, $z_k$ is assumed to be a directional cue, in that it solely depends on the source relative azimuth ${\theta _k = -\mathrm {atan2}(e_y,e_z)}$. Note that by convention, $\theta _k$ is 0 along $\overrightarrow{z_R}$ and increases clockwise. Assuming a farfield sound source, $\bar{l}(\theta _k)$ in (3) could express the time delay of arrival (TDOA) between two microphones in free field. Unless otherwise stated, in the sequel $\bar{l}(\theta _k)$ stands for the Woodworth-Schlosberg farfield approximation of the ITD between two antipodal microphones placed on a spherical head. That is, (Aaronson and Hartmann 2014)

$$\begin{aligned} \bar{l}\left( \theta _k\right)&= \frac{a}{c}\left( \theta _k+\sin \left( \theta _k\right) \right) \ \text {for}\, {|\theta _k|\in \left[ 0,\frac{\pi }{2}\right] },\nonumber \\ \bar{l}\left( \theta _k\right)&= \frac{a}{c}\left( \pi -\theta _k+\sin \left( \theta _k\right) \right) \ \text {for}\, {\theta _k \in \left[ \frac{\pi }{2},\pi \right] },\nonumber \\ \bar{l}\left( \theta _k\right)&= \frac{a}{c}\left( - \pi -\theta _k+\sin \left( \theta _k\right) \right) \ \text {for}\, {\theta _k\in \left[ -\pi , -\frac{\pi }{2}\right] }, \end{aligned}$$

(4)

with c the velocity of sound.

3 Feedback control of the binaural sensor

On the basis of Stages A and B introduced above, the main topic of the paper is now adressed, namely, the development of Stage C. The information based feedback control is first stated, then turned into a constrained optimization problem. A geometric interpretation is discussed. A numerical solution is obtained by means of the projected gradient algorithm.

3.1 Information-theoretic constrained optimization

Let x, y be continuous random variables with joint and marginal pdfs p(x, y) and p(x), p(y). The differential entropy

$$\begin{aligned} \textstyle h({x}) = -\int p(x) \log p(x) \mathrm {d}x \end{aligned}$$

(5)

and the mutual information (nonnegative by definition)

$$\begin{aligned} \textstyle I({{x}},{{y}}) = \int p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\mathrm {d}x\mathrm {d}y \end{aligned}$$

(6)

respectively embody the uncertainty in x and measure the amount of information that x contains about y (Cover and Thomas 1991).

When conditioned on the event that a random variable z takes a given value, they will henceforth be denoted by h(x|z), h(y|z) and I(x, y|z). The Bayes rule underlying the measurement update stage relates the next filtered state pdf $p(x_{k+1}|z_{1:k+1})$, the next predicted state pdf $p(x_{k+1}|z_{1:k})$, the observation model $p(z_{k+1}|x_{k+1})$ and the next predicted measurement pdf $p(z_{k+1}|z_{1:k})$. Consequently, entropies and mutual information of these distributions can be connected with an entropy update rule of the same kind as (Manyika 1993). The expectation

$$\begin{aligned} \text{[ }0pt][l]{\int {} -\log p{}\left( x_{k+1}|z_{1:k+1}\right) p(x_{k+1},z_{k+1}|z_{1:k}) \mathrm {d}x_{k+1}\mathrm {d}z_{k+1},}\nonumber \\ \end{aligned}$$

(7)

of $-\log p(x_{k+1}|z_{1:k+1})$ conditioned on $z_{1:k}$, which is also equal to $\mathsf {E}_{z_{k+1}|z_{1:k}}{}\bigl \{h({x}_{k+1}|{z}_{1:k+1})\bigr \}$, satisfies (Bustamante et al. 2016)

$$\begin{aligned} \mathsf {E}_{z_{k+1}|z_{1:k}} \{h\left( {x}_{k+1}|{z}_{1:k+1}\right) \}&= h\left( x_{k+1}|z_{1:k}\right) - I,\nonumber \\ \mathsf {E}_{x_{k+1}|z_{1:k}} \{h\left( {z}_{k+1}|{x}_{k+1}\right) \}&= h\left( z_{k+1}|z_{1:k}\right) - I,\\ I&=I(x_{k+1},z_{k+1}|z_{1:k}),\nonumber \end{aligned}$$

(8)

with I the conditional mutual information of the next state and measurement. Due to the nonnegativity of I, $\mathsf {E}_{z_{k+1}|z_{1:k}}{}\{h({x}_{k+1}|{z}_{1:k+1})\} \le h(x_{k+1}|z_{1:k})$ holds, which highlights the information gain brought by the measurement update.

Between times k and $k+1$, (linear) Kalman time update equations turn the Gaussian belief $p(x_k|z_{1:k}) = {\mathcal {N}}(x_k;{\hat{x}}_{k|k},P_{k|k})$ into the next predicted state pdf $p(x_{k+1}|z_{1:k}) = {\mathcal {N}}(x_{k+1};{\hat{x}}_{k+1|k},P_{k+1|k})$. Then, (nonlinear) Kalman measurement update equations incorporate the measurement $z_{k+1}$ so as to compute a Gaussian approximation $p(x_{k+1}|z_{1:k+1}) \approx {\mathcal {N}}(x_{k+1};{\hat{x}}_{k+1|k+1},P_{k+1|k+1})$ of the next filtered state pdf. They involve a Gaussian approximation $p(z_{k+1}|z_{1:k}) \approx {\mathcal {N}}(z_{k+1};{\hat{z}}_{k+1|k},S_{k+1|k})$ of the next predicted measurement pdf. Let |.| term the determinant of a matrix. If $w_k$ is negligible in (2), then $h(x_{k+1}|z_{1:k}) {}={} \frac{1}{2}\log [(2\pi e)^{n_x} |P_{k+1|k}|]$ is also equal to $\frac{1}{2}\log [(2\pi e)^{n_x} |P_{k|k}|]$ in view of the fact that the sensor undergoes a rigid motion^{Footnote 1}, with ${n_x = 2}$. In addition, $h(z_{k+1}|x_{k+1}) = \frac{1}{2}\log [(2\pi e)^{n_z} |R_{k+1}|]$, with ${n_z=1}$, is also independent of the control variables $(T,\phi )$. Besides, both $h(x_{k+1}|z_{1:k+1}) {}={} \frac{1}{2}\log [(2\pi e)^{n_x} |P_{k+1|k+1}|]$ and $h(z_{k+1}|z_{1:k})$ ${}={} \frac{1}{2}\log [(2\pi e)^{n_z} |S_{k+1|k}|]$ do not depend on the measurement $z_{k+1}$. Consequently, the following rule is in effect.

Theorem 1

Finding the next best sensor position which minimizes the entropy $h(x_{k+1}|z_{1:k+1})$ of the next filtered state pdf—which is also its expected value w.r.t. $z_{k+1}$—is equivalent to maximizing the mutual information $I(x_{k+1},z_{k+1}|z_{1:k})$ of the next predicted state and measurement, or to maximizing the entropy $h(z_{k+1}|z_{1:k})$ of the next predicted measurement pdf. In other words, the optimum rigid body motion $({T}^*,\phi ^*)$ to be applied to the sensor is the solution of

$$\begin{aligned} (\mathcal {P}) {\left\{ \begin{array}{ll} \begin{aligned} ({T}^*,\phi ^*) &{}= \arg \min _{({T},\phi ) \in \mathcal {T}\times \mathcal {R}} h\left( {x}_{k+1}|{z}_{1:k+1}\right) \\ &{}= \arg \max _{({T},\phi ) \in \mathcal {T}\times \mathcal {R}} I(x_{k+1},z_{k+1}|z_{1:k})\\ &{}= \arg \max _{({T},\phi ) \in \mathcal {T}\times \mathcal {R}} h\left( {z}_{k+1}|{z}_{1:k}\right) , \end{aligned} \end{array}\right. } \end{aligned}$$

(9)

where $\mathcal {T}$ and $\mathcal {R}$ respectively term the sets of admissible translations and rotations.

3.2 Interpretation

A fundamental feature of the observation model (3) is that the spreading of the measurement noise—i.e., the (co)variance $R_k$—is assumed known and constant independently of the value of the hidden state vector. If the exploration is guided by TDOAs/ITDs, then this assumption is valid since the standard deviation of the noise associated to their extraction is typically a fraction of the audio sampling period.

From (3), the loci of the sensor-to-source positions x corresponding to given values of the measurement z in the absence of noise—or “iso-z loci”—are radial lines rigidly linked to frame $\mathcal {F}$ and passing through O. For TDOA/ITD measurements, because of the nonlinearity of the measurement equation—see for instance (4)—these lines are not uniformly distributed along the azimuths. They are more concentrated along the direction of $\overrightarrow{z_R}$ which defines the auditive fovea, while they are sparser around the interaural axis $\overrightarrow{y_R}$. Given a belief ${\mathcal {N}}(x_k;{\hat{x}}_{k|k},P_{k|k})$ on the head-to-source position at time k, Fig. 3 sketches the 2D Gaussian approximation of the next filtered state pdf ${\mathcal {N}}(x_{k+1};{\hat{x}}_{k+1|k+1},P_{k+1|k+1})$ after applying various rigid motions $({T},\phi )$ to the sensor. All the involved normal distributions are depicted by related 99%-probability confidence ellipses. Importantly, if the dynamic noise is neglected in (2), then the next predicted state pdf ${\mathcal {N}}(x_{k+1};{\hat{x}}_{k+1|k},P_{k+1|k})$ is basically described by the same ellipse as for the initial belief, but “viewed” from the sensor once it has completed its motion. Besides, (3) implies that the pdf of the head-to-source position deduced from the sole measurement $z_{k+1}$ can be described by a 99%-probability confidence cone tapering to the apex $O_{k+1}$. For a given variance $R_k$ of the measurement noise, the extent of this cone on each side of the iso-z locus corresponding to the genuine azimuth of the source is all the more important as the iso-z loci are sparse. The measurement update fuses these two last pdfs so as to get the next belief ${\mathcal {N}}(x_{k+1};{\hat{x}}_{k+1|k+1},P_{k+1|k+1})$. Qualitatively, the fusion is all the more efficient as the overlap of the respective confidence ellipse and cone occurs around the modes of the pdfs and has a limited spatial extent.

From the initial configuration depicted in Fig. 3a, the head first undergoes a pure rotational motion so that the auditive fovea (supported by $\overrightarrow{z_R}_{k+1}$) becomes oriented towards the major axis of the confidence ellipse associated to the next predicted state pdf (Fig, 3b). On Fig. 3c, a translation is applied so as to drive $O_{k+1}$ on the line supported by the minor axis of that ellipse, and a subsequent rotation makes $\overrightarrow{y_R}_{k+1}$ point towards its center. Last, in Fig. 3d, the auditive fovea $\overrightarrow{z_R}_{k+1}$ is driven towards the minor axis of that ellipse.

Equation (9) in Theorem 1 states that the next best sensor position must maximize the (determinant of) the (co)variance $S_{k+1|k}$ of the next predicted measurement pdf $p(z_{k+1}|z_{1:k})$. Here, the scalar value of $S_{k+1|k}$ is heuristically related to the number of iso-$z_{k+1}$ loci intersecting the confidence ellipse associated to the next predicted state pdf. The more iso-$z_{k+1}$ loci intersect that ellipse, the higher is $S_{k+1|k}$.

As aforementioned, the confidence cone describing the spatial uncertainty on the head-to-source position due to the noisy measurement is wide if the source lies along the interaural axis (Fig. 3c). In this case, a small number of iso-z loci intersect the confidence ellipse associated to the predicted state pdf, so that the measurement update cannot significantly improve the information in the next filtered state pdf. When the auditive fovea is oriented towards the confidence ellipse, the confidence cone is narrower, so the measurement update is more efficient (Fig. 3b, d). The variance $S_{k+1|k}$ is also higher than in the above case. Further, if the fovea points to the minor axis of the confidence ellipse, then the measurement update is improved (Fig. 3d).

Importantly, the closer the sensor gets to the source, the smaller is the spatial uncertainty on the head-to-source position given a TDOA/ITD measurement. Then, a greater number of iso-z loci cross the confidence ellipse associated to the predicted state pdf, so that the predicted measurement variance $S_{k+1|k}$ increases, what is beneficial.

3.3 Numerical solution

In view of the above, starting from the head-to-source position belief ${\mathcal {N}}(x_k;{\hat{x}}_{k|k},P_{k|k})$ at time k, the desired optimum finite translations $T_y^*,T_z^*$ and rotation $\phi ^*$ maximize the log-determinant—as $z \in \mathbb {R}$, just the log—of the (co)variance the next predicted measurement pdf $p(z_{k+1}|z_{1:k})$, i.e., maximize ${F}_{k}(T_y,T_z,\phi ) = \log S_{k+1|k}$ with $F_k:\mathbb {R}^3 \rightarrow \mathbb {R}$. Then, the optimization problem $(\mathcal {P})$ defined in (9) can be stated as

$$\begin{aligned} (\mathcal {P}) {\left\{ \begin{array}{ll} \text{[ }0pt][l]{\displaystyle (T_y^*,T_z^*,\phi ^*) = \arg \max _{(T_y,T_z,\phi ) \in (\mathcal {T}\times \mathcal {R})} {} F_k (T_y,T_z,\phi )} \end{array}\right. } \end{aligned}$$

(10)

with $\mathcal {T} = \{ (T_y,T_z)\in \mathbb {R}\times \mathbb {R} \left| \right. {T_y}^2 + {T_z}^2 \le r^2_{max} \}$ and $\mathcal {R} = \{ \phi \in \mathbb {R} \left| \right. |\phi | \le \phi _{max} \}$ the sets of admissible translations and rotations. $\mathcal {T}\times \mathcal {R}$ thus constitutes a cylinder volume (Fig. 4a). The height of the cylinder represents the admissible rotations while horizontal sections stand for the feasible translations given a fixed rotation.

Though $F_k$ has no closed form, an approximation of its gradient around a defined translation and rotation ${U} = (T_y,T_z,\phi )^T$ can be derived by means of successive first order Taylor expansions and the Unscented transform (Julier and Uhlmann 2004). This approximation writes as

$$\begin{aligned} F_k\left( {U} + {du}\right) = F_k\left( {U}\right) + {\nabla F_k}\left( {U}\right) ^{T} {du}, \end{aligned}$$

(11)

with ${du} = (dT_{y},dT_{z},d \phi )^{T}$ the infinitesimal motion vector applied around U and ${\nabla F_k}({{U}})$ the gradient of $F_k$ evaluated at U, which points to the direction of steepest ascent of $F_k$ around U. A derivation of ${\nabla F_k}({{U}})$ is proposed in “Appendix 1”.

The projected gradient algorithm is then used to solve $(\mathcal {P})$ numerically. It consists in iteratively updating the value of the decision variable $U = (T_y,T_z,\phi )^T$ obtained through the conventional gradient ascent method by projecting it onto the closed convex set $\mathcal {T}\times \mathcal {R}$ by means of the projection operator $\pi _{\mathcal {T}\times \mathcal {R}}(.)$ defined as

$$\begin{aligned} \pi _{\mathcal {T}\times \mathcal {R}}({U})&\triangleq \arg \min _x \left\{ ||{U}-x||_2 , x \in (\mathcal {T}\times \mathcal {R}) \right\} . \end{aligned}$$

(12)

This leads to Algorithm 1.

4 Geometrical insights

In this section, the geometry of the maximization problem ($\mathcal {P}$) is depicted.

4.1 Overview

Given a belief ${\mathcal {N}}(x_k;{\hat{x}}_{k|k},P_{k|k})$ on the sensor-to-source position at time k, it is interesting to consider the level sets of the criterion $F_k(T_y,T_z,\phi )$ w.r.t. the translation and rotation variables $T_y,T_z,\phi $. The gradient vectors of $F_k(T_y,T_z,\phi )$ are orthogonal to these surfaces and highlight the directions of steepest ascent. Restricting to horizontal sections of the admissible cylindrical set indexed by values $\phi _0$ of the rotation variable can ease the analysis. The contour lines of $F(T_y,T_z,\phi _0)$ w.r.t. $T_y,T_z$ can be observed, as well as the 2-dimensional “local” gradient vectors—which are just obtained by setting the third entry of the genuine 3-dimensional gradient vectors to 0 (Fig. 5).

For the instances of the problem $(\mathcal {P})$ considered in Sects. 4.2 and 4.3 below, the optimum solution(s) have been observed to lie on the external surface of $\mathcal {T}\times \mathcal {R}$ in all considered scenarios (this fact has not been proved analytically). So, the contour lines of the criterion $F_k$ constrained to the cylinder surface will also be displayed. To this aim, the following bivariate function is introduced

$$\begin{aligned} \tilde{F}_k(\alpha ,\phi ) = [F_k \circ g](\alpha ,\phi )\\ \text {with}\ \begin{array}[t]{@{}rcl@{}} g: \, \mathbb {R}^2 &{} \rightarrow &{} \mathbb {R}^3 \\ \Bigl ({\begin{matrix} \alpha \\ \\ \phi \end{matrix}}\Bigr ) &{} \mapsto &{} \Bigl ({\begin{matrix} r_{max} \sin (\alpha )\\ r_{max} \cos (\alpha )\\ \phi \end{matrix}}\Bigr ), \end{array}\nonumber \end{aligned}$$

(13)

where $(\alpha , \phi )$ references the position onto the cylinder surface (Fig. 4a, b).

4.2 Iso-entropy contour lines for ITD based exploration

When $\bar{l}(\theta _k)$ in (3) stands for the Woodworth-Schlosberg farfield approximation (4) of the ITD between two antipodal microphones placed on a spherical head, the iso-$z_k$ loci are similar to those depicted in Fig. 3.

The contour lines of $F_k(T_y,T_z,\phi _0)$ are plotted on Fig. 6a–c w.r.t. $T_y,T_z$ for various subsequent rotations $\phi _0$ of the head, given an initial frame $\mathcal {F}_k$ and a confidence ellipse describing the belief ${\mathcal {N}}(x_k;{\hat{x}}_{k|k},P_{k|k})$, where $\hat{x}_{k|k} = (1,1.5)^T$. The set of admissible translations is also displayed, as well as the constrained local maximum on the slice of the admissible set defined by $\phi _0$.

In Fig. 6a, the sensor undergoes a pure translation followed by no rotation. The contour lines of the criterion appear to be distorted—i.e., the gradient of the criterion is subject to important local variations—whenever the translation is either ${T} = (1,.)^T$ or ${T} = (.,1.5)^T$. By refering to intuitive arguments from Sect. 3.2, one can show that for ${T} = (1,.)^T$ (resp. ${T} = (.,1.5)^T$), the distorsion is explained by the fact that $O_{k+1}$ lies on the major axis of the confidence ellipse associated to the next predicted state pdf ${\mathcal {N}}(x_{k+1};{\hat{x}}_{k+1|k},P_{k+1|k})$ (resp. the interaural axis $\overrightarrow{y_R}_{k+1}$ is aligned with the minor axis of this ellipse). For each such restricted value of T, the head must get closer to the source so as to reach a given value of the information criterion, than if a neighboring unrestricted translation were applied.

Subsequent rotations of the head by $\phi _0 = +30^\circ $ or $\phi _0 = -30^\circ $ turn Fig. 6a into Fig. 6b or Fig. 6c, respectively. The contour lines are changed, and so is the maximum restricted to the slice defined by $\phi _0$. It is more interesting to apply a rotation of $-30^\circ $ than $+30^\circ $, because the obtained optimum for $\phi _0 = -30^\circ $ lies on a contour line with higher value (and thus warmer color). Noticeably, the first distortions explained in the above paragraph for a null rotation remain, while the second ones are just rotated by $\phi _0$. Also, as the step size between the indices of two consecutive contour lines is constant, and as these contour lines are not regularly spaced, the closer the sensor gets to the source, the higher is the increase in the information criterion $F_k$.

To get some insight on the maximum value of $F_k(T_y,T_z,\phi )$ on the cylindrical surface of the admissible set, the function $\tilde{F}_k(\alpha ,\phi )$ has then been evaluated for the same initial belief. It appears that its maximum is located on $\phi ^* = -48^{\circ }$ (Fig. 6d).

In some cases, e.g., ${\hat{x}}_{k|k} = (0,1.5)^T$, the problem $(\mathcal {P})$ has several optimums, see Fig. 7a, b.

4.3 Iso-entropy contour lines for azimuth based exploration

This section considers the following observation model

$$\begin{aligned} z_{k} = \theta _{k} + v_{k},\ z_k \in \mathbb {R},\ v_k \sim {\mathcal {N}}(0,R_k). \end{aligned}$$

(14)

Note that observing azimuth measurements contaminated with constant-variance noise is unrealistic in practice. Indeed, when extracting azimuth measurements from the binaural stream, the closer the sound source is to the front axis (resp. to the interaural axis), the smaller (resp. the bigger) the associated uncertainty is. Nevertheless, this case has been included because it enables a verification of some intuitive features.

The iso-z loci corresponding to equispaced values of the azimuth measurements are equiangular radial lines passing through O . The confidence cones associated to any measured azimuth then have the same width—they are just rotated images of each other. So, given a belief on the source position evenly spread around its genuine location, the assimilation of such an azimuth measurement intuitively brings the same information whether the sensor remains static or whether it moves on a circle centered on the source, regardless of its orientation.

The analysis of the contour lines of $F_k(T_y,T_z,\phi _0)$ w.r.t. $T_y,T_z$ shows that they do not depend on the rotation $\phi _0$ (Fig. 6e, f). Consequently, the contour lines of $\tilde{F}_k(\alpha ,\phi )$ w.r.t. $\alpha ,\phi $ are vertical (Fig. 6h). Nonetheless, the contour lines are still distorted for ${T} = (1,.)^T$ in (Fig. 6e, f) for the same reasons as those explained in Sect. 4.2. These distortions vanish when the confidence ellipse associated to the initial belief is circular (Fig. 6g), and the contour lines become concentric. In this case, the only way to increase the gained information on the source location is to get closer to it, which is in agreement with the above intuition.

5 Evaluation of the algorithm

The whole three-stage scheme has been implemented on a simulated or real KEMAR binaural head-and-torso-simulator (HATS) from G.R.A.S.$^\circledR $(kemar.us) endowed with omnidirectional planar motion, i.e., with two translational and one rotational degrees of freedom. This section reports the assessment of the obtained audio-motor localization, depending on whether the binaural head undergoes the active motion developed in this paper or other kinds of open-loop movements.

For the sake of simplicity, the binaural head and the robot supporting it move every $T_s = 1\text {s}$, then stop in order to acquire binaural signals, perform their short-term analysis (Stage A, Sect. 2.2) and update the belief on the source position (Stage B, Sect. 2.3). To drive the exploration, Stage C relies on the Woodworth-Schlosberg measurement equation. The next best position of the robot then comes from the solution of $(\mathcal {P})$ (Sect. 3.3).

The quality of the short-term azimuth estimation in Stage A critically affects the behavior of the whole binaural active localization. Therefore, in both simulated and live experiments, a non-intermittent white noise signal filtered by a $1\,\text {kHz}$ bandwidth band-pass filter with $1\,\text {kHz}$ central frequency has been selected for the sound source, as it endows the azimuth pseudo-likelihood with modes much sharper than with speech sources for instance (Portello et al. 2013). Various ways to cope with intermittent sources in Stages A or B have been proposed in Portello et al. (2012, 2014b), but they have not been implemented here. The movements of the binaural sensor have been limited in translation and rotation by $r \le r_{max} = 0.1\text {m}$ and $|\phi | \le \phi _{max} = 15^{\circ }$.

5.1 Simulations with audio spatialization

The online rendering of realistic binaural signals caused by a static sound source has first been simulated in an anechoic environment. When the sensor moves, those binaural signals are synthesized by using a database of Head Related Impulse Responses (HRIRs) suited to the used KEMAR HATS. This database as well as a binaural simulator are publicly available at the URLs www.twoears.eu and docs.twoears.eu/en/latest/binsim/.

The sound source is initialized at the position ${X = (1,2)^T}$ in the robot frame $\mathcal {F}_0$ at time $k=0$. To simplify the notation in the legends of the next plots, this frame is denoted as $\mathcal {F}_0 = (O,\overrightarrow{x}_0,\overrightarrow{y}_0,\overrightarrow{z}_0)$.

Various motions of the sensor have been simulated: the proposed active strategy, a translation along the interaural axis, a circular movement such that the front direction of the head stays tangent to its trajectory, and a random movement (Fig. 8a). During the five first seconds in all the scenarios, the same rotational movement is applied to the sensor in order to disambiguate front and back, so that at ${t=5\,\text {s}}$ the Gaussian mixture belief can be better approximated by a single Gaussian pdf. The common progress of the audio-motor localization from initial time ${t=0\,\text {s}}$ to ${t = 5\,\text {s}}$ is displayed on Fig. 8c, d. Then, each specific movement is applied from time ${t=6\,\text {s}}$ until the end.

It can be observed that the active motion translates the sensor and rotates its fovea towards the estimated position of the sound source. By computing the Gaussian moment-matched approximation of every state belief $\sum _{i=1}^{I_k}w_k^i{\mathcal {N}}(x_k;{\hat{x}}_{k|k}^i,P_{k|k}^i)$, the entropy $h(x_{k}|z_{1:k})$ has been evaluated for the different strategies (Fig. 8b). In terms of localization efficiency, the active motion strategy clearly outperforms the random and translation open-loop movements.

In view of the closeness of the entropies of the passive circular and active motions at each time ${t \in [6\,\text {s},9\,\text {s}]}$ and at ${t = 17\,\text {s}}$, the confidence ellipses of the respective beliefs have similar sizes. However, they may have distinct centers and/or orientations, see for instance Fig. 8e, f.

Between ${t=9\,\text {s}}$ and ${t=17\,\text {s}}$, the entropies of the posterior state pdfs obtained for the circular motion are minimum. This does not contradict the fundamental property that, between any two consecutive times, the active strategy finds the translation and rotation of the head leading to the maximum decrease of the entropy. In fact, this is an interesting example where the sequence of N one-step-ahead optimum motions does not constitute a N-step-ahead optimum motion.

By Sect. 3.2, if at ${t=17\,\text {s}}$ the head starts from Fig. 8e—where its (blue) interaural axis is close to the confidence ellipse—and keeps moving on a circular path tangent to its front axis with no dynamic noise, then only little information can be brought by the measurement. In the simulated experiment, the entropy even increases because the information loss implied by a noisy dynamics cannot be compensated by the little information gain brought by the measurement. If at ${t=17\,\text {s}}$ the head starts from Fig. 8f—where its (red) front axis intersects the confidence ellipse—then there plausibly exists an admissible translation and rotation which can further decrease the entropy, even if little noise affects the dynamics.

5.2 Live experiments on a binaural robot

A KEMAR HATS has been mounted on a NEOBOTIX MP-L655 nonholonomic mobile robot. In order to make the head omnidirectional, i.e., to endow it with two translational and one rotational degrees of freedom, the neck of the HATS has been equipped with a homemade controllable azimuth degree of freedom (Fig. 1). Its software architecture is based on the ROS middleware. Real time components such as binaural audio stream server or three-stage active localization have been synthesized by means of the GenoM3 module generator (Mallet et al. 2010). The supervision task which manages the program, plots and saves the results, is performed by a $^\circledR $ client. The experiments have been conducted in an open-space $15\,\text {m}\times 5\,\text {m}\times 8\,\text {m}$ area delimited by dividing walls made of resin, so that reverberation effects were limited.

The results of the audio-motor localization for several motion strategies as well as the genuine position of the source measured by a real-time motion capture system—with ${\pm 0.1\,\text {mm}}$ accuracy—are displayed on Fig. 9. A translation along the interaural axis reduces the uncertainty on the distance to the source but cannot disambiguate front from back. A pure rotation (not shown on Fig. 9 due to space reasons) resolves the front-back ambiguity but cannot recover the source range. The active motion drives the head in the same way as before. The entropy of the moment-matched approximation of the state posterior pdf is reported on Fig. 10. A circular motion is also implemented, leading to a behavior quite similar to Sect. 5.1.

The whole three-stage framework runs in $5\,\text {ms}$ on a i7 quadcore laptop @$2.8\,\text {GHz}$ with $16\,\text {GB}$ RAM. Further videos are available on http://homepages.laas.fr/danes/AR2016.

6 Conclusion and prospects

Given a Gaussian belief on the relative position of a sound source with respect to a binaural head, a method has been proposed to determine the admissible planar motion of the sensor which leads to the one-step-ahead best audio-motor localization. It internally relies on a measure of the information brought by the incorporation of TDOA/ITD observations after the sensor has moved. Any other measurement variable could guide the exploration, provided that it is related to the hidden relative position by a closed-form equation similar to (3). Experiments have been conducted on a soundscape rendering simulator and on a mobile binaural robot, when the approach constitutes the “feedback control” component of a three-stage framework to active binaural localization.

Though the “one-step-ahead” statement of the synthesis of the active motion of the sensor is greatly simplified, several issues remain open even in this context. An immediate problem concerns the gap between the Gaussian prior required by the method and the—multimodal—Gaussian mixture (1) provided by the used estimation technique (GMsrUKF) in Sect. 2.3. This is especially significant at the first localization times, as the combination of the short-term azimuths extracted from the binaural stream and of the sensor motion does not yet enable front-back disambiguation nor range recovery. Several elementary options can be envisaged to get around this problem: (a) keep the most probable hypothesis of the belief provided by the GMsrUKF; (b) turn the genuine multimodal belief into its Gaussian moment-matched approximation; (c) keep the most probable “branch” of the genuine Gaussian mixture—i.e., set of contiguous hypotheses with similar azimuths— and compute its moment-mached approximation; (d) at the early times, apply elementary translation and rotation movements to the head so as to reduce the number of hypotheses in the Gaussian mixture. In this paper, (b) and (d) were jointly used. One more involved alternative is to avoid trading the Gaussian mixture belief for a single Gaussian distribution. As the differential entropy of a Gaussian mixture density cannot be evaluated analytically, two solutions can be envisaged: use an alternative measure of information which can be expressed in closed-form for a Gaussian mixture distribution and supports a rule similar to (8); approximate the differential entropy of a Gaussian mixture wherever needed by a closed-form formula whose accuracy-complexity balance can be handled. These topics are the subject of current research.

Though the proposed strategy does find the translation and rotation optimizing a one-step-ahead criterion, the sequence of N such motions may be outperformed by another sequence of admissible displacements as explained in Sect. 5.1. This is why current research also focuses on multi-step methods, where the objective is to find a sequence of the robot commands $u^{\star } = \{u_k,u_{k+1},\ldots ,u_{k+N}\}$ which improves the localization after N-steps. For instance, this optimum N-step sequence may be obtained by expressing the differential entropy of the belief at $k+N$ and minimizing its expected value over the next measurements $z_{k+1},\ldots ,z_{k+N}$, in the vein of Deutsch et al. (2004).

A thorough evaluation of Stage A in several kind of acoustic environments is in process. It will be followed by the evaluation of the whole localization framework, including the audio-motor localization Stage B and information-based feedback control Stage C. Finally, the integration of the proposed active localization framework within a comprehensive computational model of human auditory perception—like the one developed in Two!Ears—requires further investigation. Active localization has been viewed as a sensorimotor function operating on short time ranges, i.e., a low-level “reflexive behavior”. So, its interaction with upper-level long-term cognitive processes needs to be refined. Among the important issues are a kind of exploration-exploitation dilemma: when and how must a cognitive process decide between exploring—i.e., parameterizing and triggering an active localization reflexive behavior in order to gather information—and launching an elaborate reasoning on the basis of its current knowledge?

Notes

Consider again the dynamic equation (2) with no dynamic noise, and assume that the posterior covariance $\overline{P}_{k|k}$ of the full state $X_k$ (defined in $\mathbb {R}^3$) is $\overline{P}_{k|k} = {{\mathrm{diag}}}(0,P_{k|k})$. As the vector $R^T(\phi ){T}$ is constant, the next “full” predicted covariance $\overline{P}_{k+1|k}$ writes as $\overline{P}_{k+1|k} = R^T(\phi )\overline{P}_{k|k}R(\phi )$, with $R(\phi ) = {{\mathrm{diag}}}(1,r(\phi ))$, and ${|R(\phi )|{}={}|r(\phi )|{}={}1}$. Consequently, $\overline{P}_{k+1|k} = {{\mathrm{diag}}}(0,P_{k+1|k})$ with ${|P_{k+1|k}|=|r^T(\phi )P_{k|k}r(\phi )|=|P_{k|k}|}$.

References

Aaronson, N., & Hartmann, W. (2014). Testing, correcting, and extending the Woodworth model for interaural time difference. The Journal of the Acoustical Society of America, 135, 817–823.
Article Google Scholar
Bourgault, F., Makarenko, A., Williams, S., Grocholsky, B., Durrant-Whyte, H. (2002). Information based adaptive robotic exploration. In IEEE/RSJ international conference on intelligent robots and systems, (IROS’2002), Lausanne, Switzerland.
Bustamante, G., Danès, P., Forgue, T., Podlubne, A. (2016) Towards information-based feedback control for binaural active localization. In IEEE international conference on acoustics, speech, and signal processing (ICASSP’2016), Shanghai, China.
Bustamante, G., Portello, A., Danès, P. (2015). A three-stage framework to active source localization from a binaural head. In IEEE international conference on acoustics, speech, and signal processing (ICASSP’2015), Brisbane, Australia.
Cooke, M., Lu, Y., Lu, Y., Horaud, R. (2007). Active hearing, active speaking. In International symposium on auditory and audiological research (ISAAR’07), Marienlyst, Helsigør, Denmark.
Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley.
Book MATH Google Scholar
Denzler, J., & Brown, C. (2002). Information theoretic sensor data selection for active object recognition and state estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 145–157.
Article Google Scholar
Deutsch, B., Zobel, M., Denzler, J., & Niemann, H. (2004). Multi-step entropy based sensor control for visual object tracking. Pattern Recognition, 3175, 359–366.
Google Scholar
Feder, H., Leonard, J., & Smith, C. (1999). Adaptive mobile robot navigation and mapping. The International Journal of Robotics Research, 18(7), 650–668.
Article Google Scholar
Forster, C., Pizzoli, M., Scaramuzza, D. (2014). Appearance-based active, monocular, dense reconstruction for micro aerial vehicles. In Proceedings of robotics, science and systems, Berkeley, USA.
Grocholsky, B., Makarenko, A., Durrant-Whyte, H. (2003). Information-theoretic coordinated control of multiple sensor platforms. In IEEE international conference on robotics and automation, (ICRA’03), Taipei, Taiwan.
Julian, B. (2013). Mutual information-based gradient-ascent control for distributed robotics. PhD thesis, Massachusetts Institute of Technology.
Julier, S. J., & Uhlmann, J. K. (2004). Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3), 401–422. doi:10.1109/JPROC.2003.823141.
Article Google Scholar
Kumon, M., Fukushima, K., Kunimatsu, S., Ishitobi, M. (2010). Motion planning based on simultaneous perturbation stochastic approximation for mobile auditory robots. In IEEE/RSJ international conference on intelligent robots and systems (IROS’2010), Taipei, Taiwan.
Le Cadre, J. P., & Laurent-Michel, S. (1999). Optimizing the receiver maneuvers for bearings-only tracking. Automatica, 35(4), 591–606.
Article MATH Google Scholar
Mallet, A., Pasteur, C., Herrb, M., Lemaignan, S., Ingrand, F. (2010). Genom3: Building middleware-independent robotic components. In IEEE international conference on robotics and automation, (ICRA’2010), Anchorage, Alaska.
Manyika, J. (1993). An information-theoretic approach to data fusion and sensor management. PhD thesis, University of Oxford.
Martinson, E., Apker, T., Bugajska, M. (2011). Optimizing a reconfigurable robotic microphone array. In IEEE/RSJ international conference on intelligent robots and systems (IROS’2011), San Francisco, California.
Martinson, E., & Schultz, A. (2009). Discovery of sound sources by an autonomous mobile robot. Autonomous Robots, 27, 221–237.
Article Google Scholar
Nakadai, K., Lourens, T., Okuno, H., Kitano, H. (2000). Active audition for humanoid. In National conference on artificial intelligence (AAAI’2000). Austin, TX.
Portello, A., Bustamante, G., Danès, P., Mifsud, A. (2014a). Localization of multiple sources from a binaural head in a known noisy environment. In IEEE/RSJ international conference on intelligent robots and systems (IROS’2014), Chicago, IL.
Portello, A., Bustamante, G., Danès, P., Piat, J., Manhès, J. (2014b). Active localization of an intermittent sound source from a moving binaural sensor. In Forum Acustium (FA’2014), Krakow, Poland.
Portello, A., Danès, P., Argentieri, S. (2012). Active binaural localization of intermittent moving sources in the presence of false measurements. In IEEE/RSJ international conference on intelligent robots and systems (IROS’2012).
Portello, A., Danès, P., Argentieri, S., Pledel, S. (2013). HRTF-based source azimuth estimation and activity detection from a binaural sensor. In IEEE/RSJ international conference on intelligent robots and systems (IROS’2013), Tokyo, Japan.
Ristic, B., & Arulampalam, M. (2003). Tracking a manoeuvring target using angle-only measurements: Algorithms and performance. Signal Processing, 83(6), 1223–1238.
Article MATH Google Scholar
Sasaki, Y., Thompson, S., Kaneyoshi, M., Kagami, S. (2010). Map-generation and identification of multiple sound sources from robot in motion. In IEEE/RSJ international conference on intelligent robots and systems (IROS’2010), Taipei, Taiwan (pp. 437–443).
Sommerlade, E., Reid, I. (2008). Information-theoretic active scene exploration. In IEEE conference on computer vision and pattern recognition, (CVPR’2008), Anchorage, Alaska.
Thrun, S., Burgard, W., & Fox, D. (2005). Probabilistic robotics. Cambridge, MA: The MIT Press.
MATH Google Scholar
Vincent, E., Sini, A., Charpillet, F. (2015). Audio source localization by optimal control of a mobile robot. In IEEE international conference on acoustics, speech and signal processing (ICASSP’2015), Brisbane, Australia.

Download references

Acknowledgements

The authors would like to thank Matthieu Herrb, Anthony Mallet, and Xavier Dollat for their invaluable help.

Author information

Authors and Affiliations

LAAS-CNRS, Université de Toulouse, CNRS, UPS, Toulouse, France
Gabriel Bustamante, Patrick Danès, Thomas Forgue, Ariel Podlubne & Jérôme Manhès

Authors

Gabriel Bustamante
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Danès
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Forgue
View author publications
You can also search for this author in PubMed Google Scholar
Ariel Podlubne
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Manhès
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrick Danès.

Additional information

This work was partially supported by EU FET Grant Two!Ears, ICT-618075, www.twoears.eu.

This is one of several papers published in Autonomous Robots comprising the Special Issue on Active Perception.

Appendix

Consider the posterior state pdf $p(x_k|z_{1:k})$ of the sensor-to-source position at time k, and ${\mathcal {N}}(x_k;{\hat{x}}_{k|k},P_{k|k})$ the approximate Gaussian belief. This pdf can be mapped into the 1D Gaussian approximation ${\mathcal {N}}(z_{k+1};\hat{z}_{k+1|k},S_{k+1|k})$ of the predicted measurement pdf $p(z_{k+1}|z_{1:k})$, by using the unscented transform. The aim is then to maximize the variance $S_{k+1|k}$ so as to increase the entropy $h(z_{k+1}|z_{1:k})$. This involves the composition of several functions.

First the sigma-points $\left\{ X_{i}^{-}\right\} $ corresponding to ${p(x_{k}|z_{1:k}) = {\mathcal {N}}(x_k;{\hat{x}}_{k|k},P_{k|k})}$ are computed from the posterior mean ${\hat{x}}_{k|k}$ of the state vector at time k and the Cholesky decomposition $P_{k|k} = L_{k|k}L_{k|k}^T$ of the posterior covariance:

$$\begin{aligned} \left\{ X_{i}^{-}\right\} = {\text {Sigma}}\_{\text {points}} \left( {\hat{x}}_{k|k},L_{k|k} \right) \end{aligned}$$

(15)

The sigma-points $\left\{ X_{i}^{+}\right\} $ of the next predicted state pdf $p(x_{k+1}|z_{1:k}) = {\mathcal {N}}(x_k;{\hat{x}}_{k+1|k},P_{k+1|k})$ can be obtained by applying the translation and rotation on each sigma point in the set $\left\{ X_{i}^{-}\right\} $. Note that (2) is defined as a function of $(T_y,T_z,\phi )$, so that

$$\begin{aligned} \forall i,\ X_{i}^{+} = \Phi _{X_{i}^-}(T_y,T_z,\phi ). \end{aligned}$$

(16)

Then the set of sigma-points $\left\{ Z_{i}^+\right\} $ of the predicted measurement pdf $p(z_{k+1}|z_{1:k}) = {\mathcal {N}}(z_k;\hat{z}_{k+1|k},S_{k+1|k})$ can be obtained from $\left\{ X_{i}^+\right\} $ defined in (16) by:

$$\begin{aligned} \forall i,\ Z_{i}^+ = l\left( -\mathrm {atan2}\left( X_{i}^+(1),X_{i}^+(2) \right) \right) , \end{aligned}$$

(17)

with $X_{i}^+(1)$ and $X_{i}^+(2)$ the components of $X_{i}^+$, and $l(\cdot )$ the measurement equation used to guide the exploration. Finally the mean $\hat{z}_{k+1|k}$ and variance $S_{k+1|k}$ of $p(z_{k+1}|z_{1:k})$ are computed by

$$\begin{aligned} \hat{z}_{k+1|k}= & {} \sum _i w_m^{i}Z_{i}^+\end{aligned}$$

(18)

$$\begin{aligned} S_{k+1|k}= & {} \sum _i w_c^{i}\left( Z_{i}^+ - \hat{z}_{k+1|k}\right) ^2, \end{aligned}$$

(19)

with $\left\{ w_m^i\right\} $ and $\left\{ w_c^i\right\} $ the classic weights of the unscented transform.

The log of the variance $S_{k+1|k}$ comes as a function of the finite translation and rotation, i.e., $\log S_{k+1|k} = {F}_{k}(T_y,T_z,\phi )$. However the maximum of this function is not analytically tractable. Its gradient around ${U} = (T_y,T_z,\phi )$ is then computed as follows.

The first order Taylor expansion of the functions $\Phi _{X_{i}^-}$, $\mathrm {atan2}$, l, and $\log $, are composed around U with infinitesimal translations and rotation ${du} = (dT_y, dT_z, d\phi )^T$:

$$\begin{aligned}&\Phi _{X_{i}^-}({U} + {du}) = \Phi _{X_i^-}({U}) + {J{\Phi _{X_i^-}}}({U}) \, {du} \nonumber \\&\mathrm {atan2}(u,v){}={}\mathrm {atan2}(u_0,v_0){}+{}{{\nabla }^T\mathrm {atan2}}(u_0,v_0) \, \begin{pmatrix} u - u_0\\ v-v_0 \end{pmatrix} \nonumber \\&l(w) = l(w_0) + l'(w_0)(w-w_0) \nonumber \\&\log (r) = \log (r_0) + \frac{1}{r_0}(r-r_0) \end{aligned}$$

(20)

with ${\nabla }$ the gradient operator. $J{\Phi _{X_i^-}}({U})$ is the Jacobian of $\Phi _{X_i^-}$ at U. Then the result of the composition, noted $Z_{i}(dT_y,dT_z,d\phi )$, is used to retrieve the mean and the variance with (18) and (19). Finally, the first order Taylor expansion of $F_k(dT_y,dT_z,d\phi )$ is obtained, highlighting the gradient ${\nabla }F_k$:

$$\begin{aligned} F_k\left( {{U} + {du}}\right) = F_k\left( {U}\right) + {\nabla }^TF_k ({U}) \, {du}. \end{aligned}$$

(21)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bustamante, G., Danès, P., Forgue, T. et al. An information based feedback control for audio-motor binaural localization. Auton Robot 42, 477–490 (2018). https://doi.org/10.1007/s10514-017-9639-8

Download citation

Received: 01 March 2016
Accepted: 30 May 2017
Published: 15 June 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s10514-017-9639-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An information based feedback control for audio-motor binaural localization

Abstract

Similar content being viewed by others

Binaural Systems in Robotics

Practical Robotic Auditory Perception and Approaching Methods Based on Small-sized Microphone Array

An Adaptive Neural Mechanism with a Lizard Ear Model for Binaural Acoustic Tracking

1 Introduction

2 A three-stage framework to active binaural localization

2.1 Terminology

2.2 Stage A: short-term extraction of directional cues

2.3 Stage B: combination with motor commands

2.4 Stage C: problem statement

3 Feedback control of the binaural sensor

3.1 Information-theoretic constrained optimization

Theorem 1

3.2 Interpretation

3.3 Numerical solution

4 Geometrical insights

4.1 Overview

4.2 Iso-entropy contour lines for ITD based exploration

4.3 Iso-entropy contour lines for azimuth based exploration

5 Evaluation of the algorithm

5.1 Simulations with audio spatialization

5.2 Live experiments on a binaural robot

6 Conclusion and prospects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An information based feedback control for audio-motor binaural localization

Abstract

Similar content being viewed by others

Binaural Systems in Robotics

Practical Robotic Auditory Perception and Approaching Methods Based on Small-sized Microphone Array

An Adaptive Neural Mechanism with a Lizard Ear Model for Binaural Acoustic Tracking

Explore related subjects

1 Introduction

2 A three-stage framework to active binaural localization

2.1 Terminology

2.2 Stage A: short-term extraction of directional cues

2.3 Stage B: combination with motor commands

2.4 Stage C: problem statement

3 Feedback control of the binaural sensor

3.1 Information-theoretic constrained optimization

Theorem 1

3.2 Interpretation

3.3 Numerical solution

4 Geometrical insights

4.1 Overview

4.2 Iso-entropy contour lines for ITD based exploration

4.3 Iso-entropy contour lines for azimuth based exploration

5 Evaluation of the algorithm

5.1 Simulations with audio spatialization

5.2 Live experiments on a binaural robot

6 Conclusion and prospects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation