Introduction

Imagine being at a party with loud music playing and people cheering and chatting here and there. An old fried is telling you about his latest experiments, and only by closely watching his lips you just manage to understand the details. This well-known—and actually well-investigated—situation nicely illustrates how we often rely on the combined sensory input to correctly perceive our environment. Indeed, it is the combination of sensory information that is important for an authentic and coherent perception (Adrian 1949; Stein and Meredith 1993). Many studies under controlled laboratory conditions nicely describe how multisensory input can facilitate behavior by speeding reaction times (Hershenson 1962; Gielen et al. 1983), improving detection of faint stimuli (Frens and Van Opstal 1995; Driver and Spence 1998; McDonald et al. 2000; Vroomen and de Gelder 2000), or can even change the quality of the sensory percept-like in illusions such as the ventriloquist, the McGurk or the parchment skin (Howard and Templeton 1966; McGurk and MacDonald 1976; Jousmaki and Hari 1998; Shams et al. 2000; Guest et al. 2002). And concerning the above example of the cocktail party, a classic study found that the visual input corresponds to a hearing improvement that equivalents about 15–20 dB of sound intensity (Sumby and Polack 1954). Research on how our brain merges evidence from different modalities is key to an understanding of how sensory percepts arise, and as recent findings suggest, might change our view on the organization of sensory processing.

Given the manifold impact of sensory integration on perception and behavior, much work is devoted to the questions of where and how this occurs in the brain. Earlier studies found little evidence for cross-modal interactions at early stages of processing and promoted a hierarchical view, suggesting that sensory information converges only in higher association areas and specialized subcortical structures (Jones and Powell 1970; Felleman and Van Essen 1991; Stein and Meredith 1993). These association areas include the superior temporal sulcus, the intra-parietal sulcus and regions in the frontal lobe (Fig. 1), and abundant functional and anatomical studies support cross-modal interactions in these regions (Benevento et al. 1977; Hyvarinen and Shelepin 1979; Bruce et al. 1981; Rizzolatti et al. 1981; Hikosaka et al. 1988; Graziano et al. 1994, 1999; Cusick et al. 1995; Fogassi et al. 1996; Seltzer et al. 1996; Duhamel et al. 1998; Banati et al. 2000; Calvert et al. 2000; Fuster et al. 2000; Macaluso et al. 2000; Bremmer et al. 2002; Beauchamp et al. 2004; van Atteveldt et al. 2004; Barraclough et al. 2005; Saito et al. 2005; Schlack et al. 2005; Sugihara et al. 2006; Avillac et al. 2007).

Fig. 1
figure 1

Association areas implied in sensory integration. The (subcortical) superior colliculus (sc) is shown in light colors and the dashed gray lines indicate regions where sulci were “opened”. See text for a list of references reporting sensory integration in these areas

The observation that these association areas cover only a small portion of the cortical mantle (see Fig. 1) suggests that either most of sensory cortex is indeed devoted to the processing of a single modality, or that the hierarchical picture misses some of the areas involved in cross-modal processing. Indeed, accumulating evidence challenges this view and suggests that areas hitherto regarded as unisensory can be modulated by stimulation of several senses (Foxe and Schroeder 2005; Ghazanfar and Schroeder 2006). Here we review this evidence using the auditory cortex as a model system. However, before diving into the evidence, it is helpful to consider the criteria that are frequently employed to identify sensory integration.

Functional criteria for sensory integration

Multisensory integration is a frequently used term that is often left without exact definition. As a result, some researchers think of some higher level cognitive combination that merges different sensory evidence into a coherent percept, while others refer to specific response properties of neuronal activity. Concerning the study of neuronal responses, the term sensory integration is heavily influenced by pioneering studies in the superior colliculus, a subcortical convergence zone for sensory information (Stein and Meredith 1993). This structure is involved in orienting the eyes towards salient points in the sensory environment and a series of studies nicely scrutinized how neurons in the superior colliculus respond to auditory, visual and somatosensory cues. In this context, sensory (or cross-modal) convergence can be defined as occurring if (for a given neuron) a response can be elicited by stimuli from different sensory modalities presented in isolation, or if activity elicited by one stimulus can be modulated (enhanced or depressed) by a stimulus from another modality. Such a response modulation is also called cross-modal interaction, as the activities elicited by the different stimuli interact to collectively determine the neuron’s response to the combined stimulus. Neurons, which show cross-modal convergence or interactions are also defined as multisensory neurons, as their responses can be affected by several sensory modalities. Based on the study of such neurons’ response properties, a number of principles for sensory integration were derived.

A first principle pertains to the spatial arrangement of sensory stimuli. Neurons in the superior colliculus usually respond to stimulation only within a restricted spatial region. For example, visual responses are limited to stimuli within a restricted region of the visual field and auditory responses are limited to sounds originating from a range of directions. For multisensory neurons, the receptive fields of the different modalities usually overlap and only stimuli falling within this overlap lead to an enhanced response; stimuli falling outside the overlap often cause response depression (the principle of spatial coincidence) (Stein 1998).

A second principle posits that the sensitivity of neurons to cross-modal enhancement is dependent on the relative timing of both stimuli (the principle of temporal coincidence) (Stein and Wallace 1996). Only stimuli that occur in close temporal proximity cause response enhancement and stimuli that are well separated in time elicit their normal unisensory response. Together with the principle of spatial coincidence, this posits that cross-modal interactions are specific to stimuli that possibly could originate from the same source.

A third principle suggests that the strength of response modulation depends on the efficacies of the unisensory stimuli in driving the neuron (the principle of inverse effectiveness). Stimuli that by themselves elicit strong responses usually cause little cross-modal interaction, while stimuli that elicit weak responses can cause strong interactions when presented simultaneously (Stein and Wallace 1996; Perrault et al. 2003; Stanford et al. 2005; Stanford and Stein 2007). Importantly, this principle suggests a nice link between neuronal activity and behavioral benefits of sensory integration. At the level of behavior, the benefit of combining sensory evidence is highest when each sense alone provides only little information about the environment. Assuming that stronger responses also convey more information about the stimulus, this translates to a stronger response enhancement in the case of weak neuronal responses.

These three principles have turned into a set of criteria that are often applied in order to decide whether a particular effect indeed reflects sensory integration. Although these principles were derived from the activity of individual neurons in the superior colliculus, they are often applied to other measures of neuronal activity, such as functional imaging (Calvert 2001; Beauchamp 2005; Laurienti et al. 2005), and in other regions of the brain.

Sensory convergence in “unisensory” cortical areas: functional evidence

Contrasting the hierarchical picture of sensory processing, several studies suggest that also areas classically regarded as unisensory show patterns of sensory convergence and integration. This phenomenon is often described as “early sensory integration”, as it proposes that cross-modal effects occur early in time during the response and in areas that are generally regarded as lower (early) in the sensory hierarchy (Schroeder et al. 2004; Foxe and Schroeder 2005; Ghazanfar and Schroeder 2006). It is worth noting that already several decades ago some studies suggested cross-modal interactions in lower sensory areas, but they perished in the mass of studies suggesting otherwise (Lomo and Mollica 1959; Murata et al. 1965; Bental et al. 1968; Spinelli et al. 1968; Morrell 1972; Fishman and Michael 1973; Vaudano et al. 1991).

Electric functional imaging studies (EEG and MEG) reported changes of evoked potentials over sensory areas that occurred shortly after stimulus onset and as a result of combining stimuli of different modalities. For example, one study reported enhancement of auditory evoked responses when an additional somatosensory stimulus was applied to a hand (Murray et al. 2005). This cross-modal enhancement reached significance after 50 ms, suggesting that this effect occurs already during the first feed-forward sweep of processing. Similar observations were made for a range of combinations of the different modalities (Giard and Peronnet 1999; Fort et al. 2002a, 2002b; Molholm et al. 2002), and in addition, EEG studies suggested neuronal correlates of well known multisensory illusions over classical auditory areas; e.g., for the McGurk effect (Colin et al. 2002; Mottonen et al. 2002). However, the coarse nature of this method leaves doubts about the localization of these effects, asking for methodologies with better spatial resolution.

Functional imaging of the blood-oxygenation level-dependent response (fMRI-BOLD) provided good insight into which areas of the brain might be part of an early multisensory network (Calvert 2001). Prominent examples come from auditory cortex. For this system, several studies revealed that visual stimuli modulate (usually enhance) auditory activity (Pekkola et al. 2005a; Lehmann et al. 2006) and, to a certain degree, might also activate auditory cortex by themselves (Calvert et al. 1997; Bernstein et al. 2002). Many of these studies relied on audio-visual speech or communication signals (Calvert et al. 1999; van Atteveldt et al. 2004), suggesting that this class of stimuli might engage circuits that are particularly prone to cross-modal influences. However, similar cross-modal effects were also reported for combinations of auditory and somatosensory modalities (Foxe et al. 2002; Schurmann et al. 2006).

Several imaging studies proposed that cross-modal influences on auditory cortex occur at the earliest stages, possibly even in primary auditory cortex (Calvert et al. 1997). Yet, to fully support such claims, one needs to faithfully localize individual auditory fields in the same subjects that show cross-modal influences. For auditory cortex this can be a problem, as many of the auditory fields are rather small and have a variable position in different subjects (Hackett et al. 1998; Kaas and Hackett 2000); especially group averaging techniques can easily “blur” over distinct functional areas (Rademacher et al. 1993; Crivello et al. 2002; Desai et al. 2005). To overcome these limitations, we employed high resolution imaging of the macaque monkey (Logothetis et al. 1999), in combination with a recently developed approach to localize individual fields in auditory cortex (Petkov et al. 2006) (Fig. 2a, b). This technique allows obtaining a functional parcellation of auditory cortex into the distinct functional fields, in a similar manner as retinotopic mapping is frequently used to obtain a map of the different visual areas (Engel et al. 1994).

Fig. 2
figure 2

Functional imaging of sensory convergence in monkey auditory cortex. a Functional images were acquired parallel to the temporal plane, in order to maximize resolution and signal to noise over the auditory regions. b To functionally localize many of the known auditory fields, a functional parcellation was obtained using different localizer sounds such as pure tones and band-passed noise (Petkov et al. 2006). Left panel: voxels significantly preferring low or high sound frequencies. Middle panel: a smoothed frequency preference map obtained using six different sound frequencies. Right panel: sketch of the functional parcellation of auditory cortex. Prominent regions are the core, which receives strong driving projections from the thalamus, the belt, which encompasses many secondary regions and the parabelt, which mostly comprises auditory association cortex. A1: primary auditory cortex, CM: caudo-medial field; CL: caudo-lateral field; MM: medio-medial field in the medial belt. c Visual stimulation enhances auditory activations in the caudal field (Kayser et al. 2007). Middle panel: activation map for naturalistic sounds, with a superimposed functional parcellation from this animal. Left panel: Time course for two regions in auditory cortex (see arrows). The upper example shows similar responses to auditory and audio-visual stimulation, the lower example shows stronger (enhanced) responses in the audio-visual condition. Right panel: summary across many experiments with alert and anaesthetized animals; shaded fields consistently exhibited audio-visual enhancement. d Touch stimulation enhances auditory activations in the caudal belt (Kayser et al. 2005). Left panel: example map showing a discrete region with auditory-somatosensory response enhancement (blue), and activation to tone stimuli (red) used for functional localization of primary fields. Right panel: summary across experiments with anaesthetized animals; shaded fields consistently exhibited audio-tactile enhancement

Combining visual and somatosensory stimuli with various sounds, we were able to reproduce previous findings that visual and somatosensory stimulation can enhance auditory activations within restricted regions of auditory cortex (Kayser et al. 2005, 2007) (Fig. 2c, d). Our results clearly demonstrate that these cross-modal influences occur only at the caudal end and mostly in the auditory belt and parabelt (secondary and association cortex). The functional parcellation of auditory cortex allowed us to localize these cross-modal interactions to the caudo-medial and caudo-lateral fields (CM, CL), portions of the medial belt (MM) and the caudal parabelt. To test the functional criteria of sensory integration, we manipulated the temporal alignment of acoustic and touch stimulation, and altered the effectiveness of the auditory stimulus. The results demonstrated that both, the principle of temporal coincidence and the principle of inverse effectiveness are met in this paradigm—supporting the notion that these effects have the typical patterns of sensory integration. Combined with the human imaging studies, our findings provide strong support for the notion that early auditory cortical areas are indeed susceptible to cross-modal influences. To answer the questions of where this influence originates from, and how it manifests at the level of individual neurons, complementary studies using anatomical and electrophysiological methods are required.

Anatomical evidence for early convergence

The functional evidence for early cross-modal interactions is paralleled by increasing anatomical evidence for multiple sources of these effects. In short, all types of anatomical connections, regardless whether of feed-forward, lateral or feed-back type, have the potential to provide cross-modal inputs into early sensory cortices. Best studied are feed-back projections from classical association areas, which reach down to primary and secondary auditory cortex (Barnes and Pandya 1992; Hackett et al. 1998; Romanski et al. 1999; Smiley et al. 2007), to primary and secondary visual areas (Falchier et al. 2002) and to somatosensory cortex (Cappe and Barone 2005).

Most noteworthy, recent studies demonstrated cross-connections between different sensory streams and found projections from auditory areas to primary and secondary visual cortex in the macaque monkey (Falchier et al. 2002; Rockland and Ojima 2003). While some of these projections arise from the auditory core (primary auditory cortex), demonstrating direct cross connections between primary sensory cortical areas, most of them originate in auditory parabelt (auditory association areas). Overall these projections are sparse with sometimes only a dozen of neurons labeled per slice, and not much is known about their specific targets. Yet, they show signs of functional specificity, and prominently target the peripheral representation and the lower visual hemifield. Although such specificity seems unexpected, it might be related to species specific habits, like the manipulations of objects in the hands for primates. Similar projections from the visual to auditory cortex were recently demonstrated in the ferret, where primary auditory cortex receives considerable projections from higher visual areas and also weaker projections from primary visual cortex (Bizley et al. 2006) (and see (Budinger et al. 2006) for similar results in the Mongolian Gerbil). Along the same lines, a recent study nicely described possible routes for somatosensory input to auditory cortex (Smiley et al. 2007). Smiley and colleagues found that the caudal auditory belt receives input from the granular insula, the retroinsula as well as higher somatosensory areas in the parietal lobe, suggesting that lateral input from higher somatosensory processing stages is surprisingly prominent in the auditory belt.

In addition to these cortico-cortical connections, there is a range of subcortical nuclei that could relay cross-modal signals to sensory cortices. Many of the intralaminar nuclei (e.g., the suprageniculate or the limitans nucleus), the koniocellular matrix neurons and forebrain structures project diffusely to the sensory cortices (Morel and Kaas 1992; Pandya et al. 1994; Jones 1998; Zaborszky and Duque 2000; Zaborszky 2002; Budinger et al. 2006; de la Mothe et al. 2006). Again the caudal auditory cortex can serve as a good model system, and a recent study delineated how several multisensory thalamic structures could send somatosensory signals to the auditory belt (Hackett et al. 2007). This not being enough, the thalamus also provides more complex means for the interaction of different sensory streams. For example, different sensory streams can cross-talk via thalamic nuclei such as the reticular nucleus, and recent studies provided interesting insights into how this structure might facilitate interaction between sensory streams and with association areas in the prefrontal cortex (Crabtree et al. 1998; Crabtree and Isaac 2002; Zikopoulos and Barbas 2006): both, somatosensory and motor related thalamic nuclei were found to send and receive projections from overlapping regions in the TRN, allowing these to modulate each other. Although not directly shown so far, such intra-thalamic pathways could also link different sensory modalities, allowing different sensory streams to interact via thalamo-cortical loops.

Although anatomical studies revealed a number of candidate routes for cross-modal input to early sensory cortices, there is no clear relationship between a specific connection and a specific functional finding. At present, the lack of understanding makes it hard to incorporate the early cross-modal interactions into current frameworks of sensory processing, and each of the functional studies reviewed above points to several of these connections as a presumptive source. It could well be the case that different types of cross-modal interactions in a given sensory area are mediated by distinct connections; this for example seems to be the case for visual and somatosensory inputs to auditory areas (Schroeder and Foxe 2002). However, the present knowledge about the anatomical projections that mediate early cross-modal interactions is not conclusive enough to advance our understanding about their function.

Electrophysiological studies of early cross-modal interactions

The evidence from functional imaging studies is supported by a growing body of electrophysiological data. For example, local field potential responses to audio-visual communication signals are enhanced when a sound is complemented by a visual stimulus (Ghazanfar et al. 2005). In this study, conspecific vocalizations were paired with a video showing the animal producing this vocalization, in analogy to human studies using audio-visual speech (Fig. 3a). The cross-modal interaction was very prominent, including more than 70% of the recording sites in the auditory core (primary auditory cortex) and nearly 90% in the auditory belt (secondary auditory cortex). In addition, the interaction was found to be sensitive to the temporal alignment of the auditory and visual components, in agreement with the principle of temporal coincidence.

Fig. 3
figure 3

Visual stimuli modulate neuronal activity in auditory cortex. a Combining auditory and visual conspecific vocalization signals enhances auditory population responses in the auditory belt. Top panel: example of an audio–visual movie showing a vocalizing macaque monkey (producing a coo vocalization). Lower panels: local field potentials recorded from two sites. The upper example shows response enhancement, the lower response depression and neither of them showed a significant response to just visual stimulation. Adapted with permission from (Ghazanfar et al. 2005). b Recordings of multi-unit (AMUA) and single-unit responses to different naturalistic audio–visual movies in the caudal belt. No example shows a clear visual response, but all show multisensory response depression (Kayser and Logothetis unpublished data)

To probe in more detail whether and how individual auditory neurons can be modulated by visual stimuli, we adapted our fMRI paradigm to electrophysiological experiments. Recording in different caudal fields, we could not find a clear impact of a purely visual stimulus on the neuronal responses. However, the visual stimulus clearly modulated the auditory responses for many of the neurons, although often in a subtle way. Remarkably, for most neurons this audio–visual interaction resulted in a decrease of the response (Fig. 3b).

Such a depression of auditory responses by visual stimulation was also observed in many fields of the ferret auditory cortex (Bizley et al. 2006). Using simplistic stimuli such as light flashes and noise bursts, and recording in anaesthetized animals, the authors found that even in A1 more than 15% of the neurons showed cross-modal effects; and this fraction further increased for higher auditory fields. The audio–visual interaction depended on the relative timing of both stimuli and some neurons were responsive to restricted areas of the visual field only, providing evidence for the principles of spatial and temporal coincidence.

Physiological studies also convincingly revealed somatosensory input to auditory cortex. While a subset of neurons in the caudo-medial field responds to cutaneous stimulation of the head and neck (Fu et al. 2003), the influence of the somatosensory stimulus is even more compelling at the level of subthreshold activity (Schroeder et al. 2001, 2003; Schroeder and Foxe 2002). Recording current source densities and multi-unit activity, Lakatos and colleagues nicely delineated the mechanism by which the somatosensory stimulus enhances auditory responses (Lakatos et al. 2007) (Fig. 4). While the somatosensory stimulus by itself does not increase neuronal firing rates, it resets the phase of the ongoing slow wave activity. This phase resetting ensures that a simultaneous auditory stimulus arrives at the phase of optimal excitability. In this way, an auditory stimulus that is paired with a simultaneous somatosensory stimulus will elicit stronger neuronal responses than an auditory stimulus presented in isolation. In addition, this effect is spatially specific with respect to the hand receiving the somatosensory stimulus, and depends on the efficacy of the auditory stimulus, in agreement with the principles of inverse effectiveness and spatial coincidence. Most impressively, however, the authors found a strong relationship between the timing of auditory and somatosensory stimulation: only when the auditory stimulus was synchronous, or followed the somatosensory stimulus after a multiple of a certain oscillation cycle, was the response enhanced (Fig. 4). It should be noted that this data was obtained from primary auditory cortex, the first stage of auditory processing in the cortex. All in all, electrophysiological studies are discovering a growing number of stimulation paradigms in which early cortical sensory areas are modulated by cross-modal input. The electrophysiological studies thereby provide detailed means to understand the neuronal basis of cross-modal interactions frequently observed in human imaging studies.

Fig. 4
figure 4

Somatosensory stimulation enhances activity in auditory cortex. a Time course of the multi-unit response. The response to the combined auditory-somatosensory stimulus is stronger than the arithmetic sum of the auditory and somatosensory responses, indicating multisensory response enhancement. b Demonstration of the principle of inverse effectiveness. For auditory stimuli that are little effective in eliciting a response (low sound intensities), there is a significant enhancement of the response when the somatosensory stimulus is added. c Spatial specificity of the audio–somatosensory interaction. Combining the (binaural) auditory stimulus with a somatosensory stimulus on the ipsilateral side of the recording location leads to response depression, while a somatosensory stimulus on the contralatereal side leads to response enhancement. d Temporal profile of response enhancement. While a simultaneous presentation of auditory and somatosensory stimuli leads to a response enhancement, there is a variable effect of altering the stimulus onset asynchrony (SOA). Response enhancement occurs when the SOA reaches multiples of a gamma (30 ms) or theta (140 ms) oscillation cycle, while response depression occurs for intermediate values. Adapted with permission from (Lakatos et al. 2007)

Complementary evidence from imaging and electrophysiology

While functional imaging and electrophysiological studies nicely complement each other in terms of spatial and temporal resolution, they sometimes provide conflicting results when applied to cross-modal paradigms. Most functional imaging studies report cross-modal enhancement, quite in line with the classical thinking about sensory integration. For example, visual modulation of primary auditory cortex was found to enhance auditory BOLD responses (see Fig. 2c or Calvert et al. 1997; Calvert 2001; Pekkola et al. 2005b; Lehmann et al. 2006). At the level of neuronal activity, however, the evidence is more diverse: while local field potentials and current source densities seem to be enhanced (Schroeder and Foxe 2002; Ghazanfar et al. 2005), the firing rates of many individual neurons show suppression (see Fig. 3b and Bizley et al. 2006). This conflicting evidence might either result from slight differences in the individual paradigms, but it might also result from the different sources of the respective signals.

The fMRI-BOLD signal reflects neuronal activity only indirectly via neurovascular coupling and neurotransmission-triggered changes in blood flow and blood oxygenation. Given our current knowledge, the BOLD signal reflects the metabolic correlate of the aggregate synaptic activity in a local patch of cortex (Logothetis et al. 2001; Logothetis and Wandell 2004; Lauritzen 2005). As a result, the BOLD signal could reflect the sum of both excitatory and inhibitory synaptic activity, while individual neurons reflect the imbalance between their excitatory and inhibitory afferents. For example, cross-modal interactions could involve inhibitory interneurons, which would result in an enhancement of local synaptic activity but might decrease the activity of the pyramidal neurons to which prototypical neurophysiological experiments are biased (Towe and Harding 1970; Logothetis and Wandell 2004). As a result, fMRI-BOLD studies—and also measurements of local field potentials or current source densities—might detect cross-modal enhancement, while single unit recordings would find response depression.

While this scenario is speculation, recent data provide compelling evidence that inhibitory circuits might be important for mediating cortical cross-modal interactions (Dehner et al. 2004; Meredith et al. 2006). The cat ectosylvian cortex contains separate regions dominated by either the auditory (region FAES) or somatosensory modality (region SIV) that are connected via direct anatomical projections. Yet, electrical stimulation of the auditory field (FAES) by itself does not elicit responses in the somatosensory field (SIV). Only when neurons in SIV are driven by stimulating their somatosensory receptive fields, electrical stimulation of the FAES leads to a reduction of the firing rates in SIV. Hence, this type of cross-modal effect is only visible when neurons are driven by their dominant modality and proper controls confirmed the inhibitory (GABAergic) nature of these cross-modal interactions (Dehner et al. 2004; Meredith et al. 2006). Noteworthy, the interaction observed in these studies increased monotonically with the strength of electrical FAES stimulation, quite in contrast to what might be expected from the principle of inverse effectiveness (Dehner et al. 2004).

Cross-modal interactions in cortex and superior colliculus

Given the growing number of studies providing evidence for cross-modal interactions in early sensory cortices, there remains little doubt as to their existence. Yet, current results clearly show that—when considering individual neurons—these effects can be rather subtle and are often only detectable when the right stimulus is chosen. As a result, such cross-modal interactions are best detectable at the level of population analysis, or direct population responses such as multi-unit activity and local field potentials. In addition, cross-modal interactions can be restricted to particular areas within a sensory system (e.g., in (Kayser et al. 2005, 2007), which makes them hard to detect unless the right region is sampled or spatially resolved imaging techniques are used. However, within those cortical regions exhibiting cross-modal interactions, their frequency can be very high. For example, Lakatos and coworkers reported auditory-somatosensory interactions at nearly every recording site in A1 (Lakatos et al. 2007) and Ghazanfar and colleagues found audio–visual interactions at 70% of the sites in the same area (Ghazanfar et al. 2005). These numbers are much larger compared to 30% of the recording sites showing multisensory responses in the (monkey) superior colliculus (Stein 1998). Altogether, this suggests a number of differences between the patterns of sensory integration found in cortex and in the superior colliculus.

Though much of our understanding about how individual neurons merge sensory information is derived from studies on the superior colliculus, it is a rather specialized subcortical structure involved in motor planning and detecting salient events (Krauzlis et al. 2004). Neurons in this structure accumulate evidence across space, time and modalities, and co-localized multi-modal features belonging to the same object reinforce other to attract attention. This is reflected in the typical response enhancement seen for stimuli originating form the same source (Alvarado et al. 2007). Yet, the function of the superior colliculus differs from that typically attributed to sensory cortices.

Sensory cortical areas are engaged in analyzing and integrating multiple features of the same modality in order to form a representation of the sensory environment. For this, they synthesize information about sensory objects from the different features that lie within their receptive fields. Such intra-modal combinations often result in responses that reach about the average response of both features presented in isolation (Riesenhuber and Poggio 2000). Hence, cortical neurons rarely show supra-linear effects when several features of the dominant modality are presented simultaneously within their receptive fields.

From these observations we infer two differences, which might be critical for a better understanding of the cross-modal interactions in cortex and superior colliculus. First, virtually all neurons in sensory cortex are dominated by the respective modality, suggesting that cross-modal influences in cortex are more of a modulatory kind, i.e., impose small modulation on the sensory evoked activity. This is in contrast to the classical multisensory structures where many multisensory neurons can be driven by several modalities and multisensory interactions sometimes change firing rates by more than an order of magnitude. And second, while multisensory neurons form only a subset of the neurons in the superior colliculus (Stein 1998) or the higher temporal and frontal association cortices (Benevento et al. 1977; Hyvarinen and Shelepin 1979; Bruce et al. 1981; Rizzolatti et al. 1981; Hikosaka et al. 1988; Graziano et al. 1994), cross-modal interactions seem to be rather widespread within the sensory cortices. Hence, while in the superior colliculus only a subset of the neurons shows strong cross-modal interactions, many neurons in sensory cortex seem to be weakly modulated by cross-modal input. This difference might not only hint upon the neuronal mechanism that mediate the respective cross-modal interactions but should also allow further insights into their function.

Conclusion

Many studies reporting cross-modal interactions in early sensory cortices propose that the observed effects constitute a correlate of sensory integration. This claim is largely supported by the finding that many of these effects obey the traditional criteria imposed on sensory integration and show specificity to the temporal or spatial alignment of the stimuli. Yet, it is hard to believe that such simple characterization merits the term sensory integration, especially given the lack of evidence for any of these effects to aid behavior and increase the sensory system’s ability to scrutinize its environment.

Behavioral studies of sensory integration, for example those described in the introduction, usually differentiate the two stages of sensory combination and integration (Ernst and Bulthoff 2004): First, sensory combination leads to increased information about the environment as a result of merging non-redundant information provided by different senses. Second, sensory integration reduces the uncertainty in the internal representation of the stimulus, which then improves the behavioral reaction. With regard to the early integration effects none of these points has been tested so far. To merit the term early integration, it needs to be verified that sensory representations indeed gain information about the environment or become more reliable by receiving cross-modal inputs. Granted, such studies are not easy, but they are of great importance to understand the relevance of these cross-modal interactions for sensory perception and behavior.

Despite this skepticism, we believe that it makes sense for the brain to provide early sensory pathways with information about stimuli impinging on the other senses. One way to think about this is to imagine engineering a complex sensory device. Of course, one could split the system into different and separated modules, each handling and analyzing the input provided by one of the sensors; only a final stage would try merging the sensory pictures provided by the different modules. In many cases this might work and result in a uniform sensory “whole”. In some cases, however, each module might come up with a distinct solution and the final merging might be impossible, e.g., then the visual system only sees a house by the auditory system hears a dog. In trying to prevent such mismatching sensory percepts, one could introduce a mechanism that ensures consistent processing in the different sensory modules, and that selectively enhances the signal to noise ratio for objects common to the different sensory modalities; of course some (or the same) mechanism first needs to decide whether a given object provided common input to more than a single sense. Such a mechanism might use predictions generated by higher stages to bias the ongoing processing in each sensory module using some sort of contextual or feedback signal.

The idea of feedback facilitating the processing in lower sensory areas by using predictions generated by higher areas is not novel (Ullman 1995; Rao and Ballard 1999). Especially the visual system, with its rather well known connectivity, has served as a model for studying the contribution of feedback to sensory processing (Lamme and Roelfsema 2000; Martin 2002). Given the larger receptive fields at higher stages, feedback is thought to reflect global signals that are integrated into the local processing in lower areas, i.e., it allows combining local fine scale analysis with large scale scene integration (Bullier 2001; Sillito et al. 2006). In addition, lateral projections from neurons at the same processing stage also provide contextual modulation to the ongoing processing (Levitt and Lund 1997; Somers et al. 1998; Moore et al. 1999). In an analogous way, a consistency signal from multi-sensory areas could modulate the processing in each sensory stream to ensure a consistent multi-modal picture of the environment. Interestingly, studies of contextual modulation in primary sensory areas suggest that the modulatory influence depends on the efficacy of the primary stimulus driving the neurons; contextual modulation often enhances responses for weak stimuli but suppresses responses to highly salient inputs (Levitt and Lund 1997; Somers et al. 1998; Moore et al. 1999)—quite reminiscent of the principle of inverse effectiveness. Clearly, these ideas are still vague, but speculations like this will be necessary to promote our understanding of what cross-modal modulations actually do in early sensory cortices.