Introduction

There are only a few examples of neuronal circuits where we have a relatively complete understanding of the interactions of multiple levels from single-cell biophysics, over circuit anatomy to behavior. Most of these exemplary brain structures are sensory circuits such as, e.g., the retina where the response properties of ON and OFF bipolar cells can be explained via specific channel properties and it is relatively clear how ganglion cells process bipolar cell signals and how all this relates to specific aspects of visual perception [2].

A further example of a mechanistically well understood neural circuit function seems to be the binaural coincidence detector neurons in the medial superior olive (of mammals) or in the nucleus laminaris (of birds and reptiles), which generate a neural representation of acoustic space based on their sensitivity to interaural time differences (ITDs). Almost every basic neurobiology textbook refers to this model system and connects it to the theoretical model by Jeffress [13] (Fig. 1). Jeffress postulated that coincidence detector neurons would be aligned along incoming axons, and their geometric neighborhood relations would introduce gradually increasing axonal delays. If each neuron receives inputs from both hemispheres (ears) with slightly different axonal delays, the cells thereby establish a neuronal map in which the activity of each neuron encodes a different specific azimuthal direction (ITD). The Jeffress model is simple and seductively mighty since it solves all algorithmic problems of azimuthal sound localization: (i) It offers both mechanism (neuronal coincidence detection and axonal delays) and an easily readable code (a systematic neuronal map or “labeled line”). (ii) The code is invariant against changes of all obvious acoustic parameters (intensity, frequency). (iii) The Jeffress model provides clear experimentally testable predictions. The latter is usually not part of the textbooks, although in birds there are clear evidences for elements of the Jeffress model like a topographic map [40] and graded axonal delays [27], in mammals most of Jeffress’ predictions could not be proved or even have been disproved.

Fig. 1
figure 1

The Jeffress model and its predictions. (a top): Jeffress proposed a neural circuit to consist of coincidence detector neurons (colored discs) that are driven by axons (red) from the left and right ear. The neurons are spatially arranged along the axons such that the interaural time differences (ITDs) are gradually mapped to the location of the neuron (color scale from green to orange). (a bottom) The latency difference between the inputs of the two ears determines the ITD at which a cell fires maximally. The orange cell for example fires maximally at a best ITD of 120 µs (vertical line). b The Jeffress model predicts that the best ITD of a cell is independent of stimulus frequency, since it solely depends on a difference of axonal conduction times. c The product if best ITD and stimulus frequency is called best IPD (interaural phase difference) and, as a function of frequency, is represented by a bisecting line. The slope of this line corresponds to the best ITD. d The Jeffress model predicts that all ITDs have to be uniformly distributed within the physiological range (blue box in a; here + /− 160 µs) since every cell’s activity signals for exactly one ITD. © Courtesy of Benedikt Grothe (a) and Christian Leibold (bd)

However, models are particularly useful if some of their predictions turn out to be wrong. Particularly the discrepancies between Jeffress’ predictions and experiments have generated a wealth of new hypotheses over the past years (and decades) that are of interest and have led to many new fundamental findings far beyond the specific neural circuit. In this paper, we summarize some evidence against the Jeffress model being implemented as a neural circuit in mammals, we present current alternative ideas, and summarize as to how far they have been experimentally tested and proved.

The binaural coincidence detection circuit in mammals

First, we describe the underlying neuroanatomy (Fig. 2), since already there is little doubt by now that it does not fit the circuit proposed by Jeffress. The coincidence detector neurons of mammals are situated in the superior olivary complex (SOC), which is a collection of nuclei in the ascending auditory pathway of the mammalian brainstem. In mammalian species with good low-frequency hearing (< 2000 Hz) transversal brainstem sections comprise a clearly visible layer of bipolar neurons with one dendrite pointing into the medial direction and the other dendrite pointing laterally. These neurons form the medial superior olive (MSO) and their individual activities resolve ITDs with a precision of about 30 µs. To be able do so, they have specific biophysical properties that we describe later. MSO cells receive two types of direct synaptic input and at least one modulatory input. The well-known bilateral excitatory inputs are conveyed via the axons of the spherical bushy cells, which are glutamatergic neurons of the ventral cochlear nucleus (CN). Spherical bushy cells in the ipsilateral CN project onto the lateral MSO dendrites, whereas contralateral spherical bushy cells project onto medial MSO dendrites. 3-D reconstructions of individual axons (e.g., in cat) revealed the absence of axonal delay lines [15]. Only few projections of spherical bushy cells terminate on the MSO cell bodies. Those are predominantly contacted by glycinergic inhibitory synaptic boutons [14] that are inconsistent with Jeffress’ model and arise from the lateral and medial nuclei of the rapezoid body (LNTB, MNTB). Since LNTB neurons are innervated by globular bushy cells of the ipsilateral CN and MNTB neurons are innervated by contralateral globular bushy cell axons, MSO neurons receive not only excitatory but also inhibitory inputs from both hemispheres. Globular bushy cells, their axons and synapses as well as their target neurons in LNTB and MNTB exhibit distinct structural-anatomical features that facilitate fast and precise action potential firing and synaptic transmission [38, 45]. Particularly the Calyx of Held synapse is well-known for enabling a 1:1 transformation of its excitatory (glutamatergic) input into an inhibitory (gycinergic) output. How the four groups of synaptic MSO inputs interact at high-temporal precision is a matter of a very controversial debate lasting for already two decades [3, 30, 33].

Fig. 2
figure 2

The mammalian neuronal microcircuit. a The MSO consists of a layer of bipolar neurons with a high density of glycinergic receptors (yellow staining) at the soma (from Kapfer et al. [14]). b The MSO (yellow) receives four inputs from the ascending auditory pathway, two glutamatergic ones (red) from the spherical bushy cells of the ventral cochlear nucleus (VCN; green) and two glycinergic ones (blue) from LNTB (ipislateral) and MNTB (contralateral). Both LNTB and MNTB are driven by glutamatergic inputs from VNC globular bushy cells. In addition, MSO neurons receive modulatory feedback from the SPN which acts presynaptically onto GABA-B-receptors © courtesy of Benedikt Grothe

Only recently, a further GABAergic MSO input has been described, which arises from a disynaptic feedback loop [39]. MSO neurons not only project auditory information to the midbrain but also send collaterals into a neighboring GABAergic group of neurons which is called the superior paraolivary nucleus (SPN). GABAergic SPN outputs project back to the MSO where they activate presynaptic GABA-B receptors. Activation of these receptors reduces the release of transmitters at both the excitatory and the inhibitory MSO inputs. Thus, negative feedback induces a relative reduction of the MSO output, if it was active shortly before.

Characteristic phases

From the physiological perspective, the characteristic phase (CP) is the central quantity that most fundamentally challenges the Jefferess model. Roughly speaking, the CP measures the frequency-dependent discrepancy of the axonal delay. In the plain Jeffress model, neuronal firing is solely determined by the difference of the axonal delays from the right and the left ear. A delay difference is a frequency-independent quantity and thus the ITD at which the cell responds most (called “best ITD”) must be independent of the frequency. However, all measurements in mammals so far [26, 31, 46] reveal a frequency-dependent best ITD approximately following the relation:

Best ITD = CP/frequency + CD.

Here, the constant CD is called characteristic delay and corresponds to the frequency independent portion of the delay difference (Fig. 3). The distribution of CPs varies from species to species but is always sufficiently broad such that it cannot be explained merely by noise. In gerbils, e.g., CPs are virtually uniformly distributed between − 1/4 and + 1/4 cycles [31].

Fig. 3
figure 3

Characteristic phases, characteristic delays. a Firing rates of an exemplary gerbil MSO neuron as a function of ITD for different frequencies (colors) of a pure tone stimulus. b In contrast to Jeffress’ prediction (Fig. 1), best ITDs (vertical lines in a) vary with stimulus frequency. c Best IPD as a function of stimulus frequency is a linear function with non-vanishing intercept of the Best IPD axis. The slope of this line is called characteristic delay (CD), the intercept of the Best IPD axis is called characteristic phase (CP). d, e Distributions of CPs and CDs in a population of 48 MSO cells from Pecka et al. [31]. f CPs und CDs are negatively correlated reflecting the contralateral (positive) bias, of best ITD. To be able to better compare cells with different best frequencies (BF) we CDs were multiplicatively scaled with BF. All graphs stem from data of Michael Pecka [31]. © Courtesy of Christian Leibold

But if there are non-vanishing CPs, how do they come about? What is their function? Are they learned? Are they plastic? What is their evolutionary origin? From nowadays perspective, the answers to all these questions are not fully known and many doubts still exist despite many exciting models and a wealth of excellent data. In this sense, binaural coincidence detection is still an open and largely controversial neurobiological topic. However, and in contrast to other neural microcircuits, the functional relevance of the binaural neural circuit (processing of ITDs) is unquestioned and thus it comprises an excellent model system that links all levels of neuroscience research from psychophysics to cellular aspects and even evolution. In the next sections we summarize the fundamental findings and point towards important open hypotheses.

Problem: neuronal representation

Facing the existence of non-Jeffress-like characteristic phases, there are currently two main ideas concerning the neuronal representation of the azimuthal position of a sound source both providing alternatives to the labeled line code. The historically first one is the hemispheric difference code [26, 37] assuming that the (potentially normalized) difference of the total neuronal activities of the MSOs of both hemispheres has a mathematically unique relation to the ITD and thereby provides a complete neuronal representation of the stimulus position (Fig. 4). The second idea maybe referred to as a population pattern code [5, 7], in which the population vector of MSO cells’ firing rates comprises a high-dimensional signature of the ITD, that can be read by current pattern detection algorithms. Obviously, the hemispheric difference code is contained within the class of population pattern codes.

Fig. 4
figure 4

Population codes of sound source location. Top: Summed population response (ipsilatera “minus” contralateral hemisphere) as a function of ITD in a simulated population of 48 MSO cells with CPs, CDs, and Best frequencies from Fig. 3 df. Colors correspond to different stimulus frequencies (blue 600 Hz red 1000 Hz). Grey vertical lines indicate the physiological ITD ranges that are determined by the inter-ear distance. For ITDs above 250 µs the population code does not provide a unique relation to ITDs and also looses frequency invariance. Bottom: Hypothesis on the population code for ITDs above 250 µs: Rate sum of all ipsilateral MSO cells (from top) vs. rate sum of an “LSO”-population with identical best IPDS as the MSO population but random CPs > 0.25 cycles. Dots of same ITD (colors) are approximately located on straight lines (black dashed lines), that could be easily decoded by a readout structure (e.g., a perceptron). © Courtesy of Christian Leibold

Both ideas are going to be presented in greater detail next.

i) Hemispheric difference model (2-channel model)

Most interestingly, the origin of this model dates back to even before Jeffress published his model, as von Bekesy in his fundamental “Zur Theorie des Hörens; über das Richtungshören bei einer Zeitdifferenz oder Lautstärkenungleichheit der beiderseitigen Schallwirkungen” suggested already in 1930 that the increased psychophysical localization acuity at 0 degrees azimuth (in front) could be explained by more neurons changing their activity state around this direction. This means that at 0 degrees, neurons should be most sensitive to ITD changes and thus cannot have their maximum firing rate there. The idea has been revived in the 2000s, as McAlpine et al. [26] showed that in guinea pigs most ITD-sensitive cells in the midbrain seem to have maximum firing rate outside (contralaterally) the physiological ITD interval that is determined by the size of the animal’s head. Instead, single cells showed largest rate changes at 0 degree. Similar data existed before in cat [46] and were published around the same time for gerbils [3].

An obvious problem of the hemispheric difference code results from the periodicity if the ITD-sensitive responses. For a pure tone stimulus with frequency f, a cell cannot distinguish between ITDs that differ by multiples of the period 1/f. This also holds approximately true for band-pass noise inputs that are generated by cochlear filtering. If the periodicity 1/f is in the same range as the physiological ITD interval, the neuronal responses become non-monotonic and the unique relation between ITD and hemispheric difference breaks down (Fig. 4). This is especially a problem for large heads (large physiological ITD range; for comparison: 250 µs for gerbils, 500 µs for cats, 1.4 ms for humans) and high frequencies. The plain hemispheric difference model thus only offers a solution for small mammals with low-frequency hearing as gerbils. Optimal coding theories can be applied to animals with larger heads and predict that then the hemispheric difference code must become more complex and involve more than two channels [7, 11], particularly if one also takes into account the low-frequency neurons of the lateral superior olive (LSO) that have CPs larger than 0.25 cycles (Fig. 4).

ii) Population pattern code

The main idea underlying a population pattern code is to make no a priori assumptions on how the activity of the cells in both hemispheres is read out, but instead train pattern detection algorithms such that they are able to detect and read out ITD-specific traces [5, 7, 23]. There is indeed evidence for ITD information being contained even beyond the classical tuning curves such as position-dependent temporal activity patterns [17, 18, 28], and sound-structure-dependent influences of single-cell spatial selectivity [9] which all could be made use of by a general pattern detector. This approach is particularly elegant, since it can use the variability within the cell population (such as different best ITDs) to improve the population code and thereby resolve the problem of non-monotonic hemispheric activity for large heads at higher frequencies. However, this idea has the disadvantage that the code itself remains very abstract and thus gives little insight into how the downstream structures might use it. In other words, the population pattern code is very general and includes a huge collection of possible codes (including the hemispheric difference code) and thus it is strictly speaking almost non-falsifiable and thus theoretically and experimentally difficult to grasp.

The current dispute also points to a general problem of all coding theories, viz., the objective function which the code satisfies is generally unknown. Even in such a seemingly obvious case like estimating sound source azimuth the minimization of the localization error may not be the most important goal. Instead, source separation or the speed of the estimator might be more relevant, or also some (whatever) readability of the code for downstream areas that use spatial information in a flexible and context-dependent way [19].

Problem: psychophysics

A further indication contradicting the classical Jeffress code arises from psychoacoustic adaptation experiments [6, 39, 43]. The basic idea underlying these studies is that in a sparse code like the Jeffress labeled line, in which neurons encode only few stimulus positions, the presence of a sustained sound from one direction adapts only those few neurons that are responsible for this direction. Perception of sounds from other positions should thus not be affected. Contrary to this prediction, all experiments (and also physiological measurements) show strong influences of the adapter sounds on the perception of test sounds in a whole hemisphere and beyond. Since MSO neurons preferentially respond to sounds in the contralateral hemisphere, stimuli in the contralateral hemisphere suppress responses to stimuli occurring within 100 ms to several seconds later in the same hemisphere. This leads to predictable localization errors of the test sounds [39] and can be mechanistically explained by the feedback from SPN. Whether and how the psychophysical adaptation experiments constrain non-sparse coding hypotheses is not investigated by now.

Problem: cell physiology and microcircuit

Psychophysically, the temporal acuity of binaural coincidence is about 10 µs “just noticable difference in ITD.” The value is roughly the same in humans and gerbils [12, 22, 29]. The geometrical distance between the ears which is much smaller in gerbils therefore is likely to explain most of the species difference in localization acuity. The underlying neuronal properties are thus most probably similar. As a result of these high demands on temporal precision, MSO neurons have very special properties. To be able to detect coincidences on a microsecond time scale, the cellular memory (i.e., the membrane time constant) must be extraordinary short. Indeed adult MSO neurons have such small time constants of only about 300 µs in vitro (corresponding to input resistances of 5 Mega ohms and below; [4, 35]). The membrane conductances necessary to achieve such values result from the expression of at least two active channels that are partially open at rest: a HCN1 channel [1, 16] and a fast potassium (Kv1) channel [25, 41]. The expression of both channels increases during ontogeny and the input resistance thereby drastically reduces from about 40 Mega Ohms at P14 to 5 Mega Ohms at P60 [35]. The resulting speed of the neurons, however, comes with the problem that they become less excitable and thus it is not surprising that the action potentials of MSO cells have somatic amplitudes that cannot be distinguished from synaptic potentials. In fact, the after hyperpolarization following an action potential is mostly the best indicator for superthreshold excitation [4]. The strong synaptic conductances additionally complicate the generation of action potentials. So how then can MSO cells generate action potentials in the first place? A theoretical study [20] shows that this is nevertheless possible but probably sometimes at larger distances from the soma (at the first or second node of Ranvier). An experimental verification of this prediction is still missing. These findings also suggest that extracellularly measured firing rates in vivo generally do not reflect somatic activation but result from axonal current dipoles [46].

Another particularly controversial open question is the mechanistic nature of the CPs and CDs. In principle, the MSO cells themselves could be asymmetric, i.e., the two dendrites collecting ipsi- and contralalteral inputs could process them differently because of morphological (dendrite diameters/lengths) or physiological (different channel compositions) differences. So far, however, systematic asymmetries of MSO neurons have not been found, particularly concerning dendritic anatomy [32] and thus it seems plausible to assume that the synaptic inputs to the MSO cells (i.e., the microcircuitry) are asymmetric.

Already several decades ago Schroeder [34] postulated that CPs may arise from a stereausis effect owing to a difference in the locations of cochlear origin (characteristic frequencies) of the ipsi- and contralateral inputs. The difference in cochlear locations would correspond to different phases of the cochlear travelling wave and their difference would then account for the CP. It is actually a very plausible scenario that for a single cell, the bilateral inputs stem from locations with slightly different characteristic frequencies and that these differences contribute to CP. The stereausis model, however, further requires that the binaural difference in characteristic frequency has to be systematic (for all cells in all mammal species) to be able to explain the contralateral bias of best ITDs in the population in vivo. Experimental evidence for such a systematic bias is missing. Admittedly, it is difficult to map differences in bilateral characteristic frequency on single-cell level, although extensive analysis of monaural MSO responses in gerbils provided no support for the stereausis idea [31]. Thus, not fully implausible but experimentally hard to grasp, the stereausis model remains with us as a potential candidate to explain part of the CPs.

A second model that explains the origin of CPs as well as the contralateral bias in best ITDs is based on fast phase-locked inhibition. As already mentioned above, MSO cells receive two inhibitory inputs for the left and the right ear in addition to their bilateral excitatory inputs. Blocking glycinergic inhibition in vivo revealed a shift of the best ITD toward midline (0 degrees azimuth) and thus inhibition interferes with the temporal coincidence mechanism [3]. In vitro synaptic inhibitory currents at the MSO have been found with time constants between 1 and 2 ms [4, 24]. Thus inhibitory currents are slower than excitatory currents, but still in a range that resolves frequencies below 1 kHz and is able to shift the best ITD of MSO neurons [30]. Inhibitory inputs from the ipsilateral ear arrive at the MSO via the LNTB, inhibition from the contralateral ear arrives via the MNTB. It was predicted theoretically that asymmetries in the amplitudes and latencies of these two pathways in general result in CPs and CDs at the MSO [21]. Nevertheless the idea that fast glycinergic inhibition plays an important role in binaural coincidence detection remains controversial [42]. A major argument against an involvement of inhibition is that it cannot explain the measured characteristic delays at relatively higher frequencies (800 bis 1500 Hz) because of its limited synaptic kinetics [21, 30, 33] and therefore requires further mechanisms. Proper measurements of synaptic currents in vivo have so far been impossible because of technical and anatomical reasons, and thus this conflict awaits its final experimental resolution. Conversely, for frequencies around 500 Hz the fast inhibitions model can doubtlessly explain the experimentally measured range of ITD-sensitive responses [21, 30].

In our opinion, all (known and yet unknown) cellular and anatomical asymmetries are able to contribute to characteristic phases and delays in mammals including Jeffress-type delays (although they are not arranged as topographic labeled lines). However, the direct inhibition has the special role that (a) its influence is experimentally shown (Brand et al. [3]) and (b) it is a plausible and efficient mechanism for plastic modifications [36], which is not the case for cochlear innervation patterns (stereausis) and also not the case for cellular-anatomical asymmetries. Plasticity of inhibition, on the contrary, is able to adapt the cell on short time scales to stimulus or response history and to dynamically change the population code accordingly. Why this is useful? Which objectives are served by such adaptation? And what effects this adaptation has on the population code is so far unclear. At least, the psychophysical work by Getzmann [6] suggests that adaptation may be used to improve source separation (not localization acuity) in the more active hemisphere.

Discussion from an evolutionary perspective

A problem that got little attention in the current debate is the obvious difference between birds and some reptiles on the one hand, whose axonal delay lines and space maps seem to largely fit Jeffress’ ideas, and mammals on the other hand, who have fast inhibitory inputs and use a still unresolved coding strategy. However, things become easier when we see them from a phylogenetic perspetive (as often in biology) [8]. There are two main observations to be stressed.

1. The fossil record, which is particularly extensive for early mammals and their ancestors, make it very clear: The common carbonic ancestors of birds and mammals did not have tympanic ears and thus could not hear airborne sound. Hence, it is difficult to imagine that they would have been able to make use of an utmost precise coincindence circuit like the MSO or nucleus laminaris. In fact, tympanic ear arose about 100 million years after the amniote tetrapodes separated, independently in reptiles (and their bird descendants) and early mammals. Also the tympanic ears exhibit a fundamental difference: Reptiles and birds had (and still have) a single middle ear bone and internally coupled ears, which limits the transducable frequency range to low frequencies and makes use of interferences between both ears to effectively increasing ITDs. In the triassic they also were comparatively larger animals who could make use of large ITDs. Mammals, in the contrary, were very tiny (with maximal ITDs below 50 us), had uncoupled ears, and had three ear bones (in contrary to the implausible myth that two bones developed later, which is disproved by all fossil evidence). The presence of these tree middle ear bones provided them with good high-frequency hearing already early on. In fact, all terrestrial mammals hear well above 10 kHz the frequency range above 20 kHz called “ultrasound” is well audible for all small and even for almost all large mammals.

2. Living in an acoustic low-frequency world, reptiles/birds depend on ITDs and their representation might ideally be adjusted to the visual map since most of these animals have been diurnal. For small mammals, ITDs play a relatively minor or no role, since with high-frequency hearing available their head shadows yield massive interaural intensity differences (IIDs) in this frequency range. IIDs are processed by all mammals by means of a simple subtraction in the lateral superior olive (LSO) where excitatory spherical bushy cell activity from the ipsilateral side interacts with glycinergic inhibition from the contralateral side. The latter stems from the MNTB and thus is the same as the contralateral inhibitory inputs that impinge on the MSO. The result of this excitatory inhibitory integration is a population rate code that is strikingly similar to the one in the MSO. Not too surprisingly, this high frequency population code of space is not 1:1 aligned to any visual map in mammals that have been small and nocturnal for millions of years. Moreover, even a functionally similar GABAergic feedback via presynaptic GABA-B receptors is present in LSO and has the same effect on the population code as described for the MSO [8].

Because of these evolutionary aspects, we believe that IID processing is the primary mechanism of sound localization in mammals and only later on, because of increased head sizes, ITDs have become useful. They were not processed completely differently, but by using similar principles. Moreover, the LSO and its components already generates some rudimentary epiphenomenological ITD sensitivity (although less precise than the MSO) and thus could have served as a template for building an MSO [10]. However, it is still unclear how MSO and LSO are developmentally and phylogenetically connected. At least on a conceptual level, they seem to share mechanisms and coding principles. A deep understanding of their joint evolutionary origin would thus be of fundamental importance to further enhance our understanding of sound localization in mammals.