Introduction

Crossmodal influences on basic visual tasks have been extensively documented in recent years, with evidence spanning across a wide range of methods, animal species and experimental paradigms (see Shams and Kim 2010; Vroomen and Keetels 2010 for reviews). Here, we focus on auditory–visual interactions that lead to enhancement in visual performance in humans. Concerning modality combinations, we consider specifically audio-visual interactions because they have been investigated most comprehensively and are commonly related with the widespread assumption that crossmodal integration confers an adaptive advantage to organisms (Lewkowicz and Kraebel 2004; Bahrick et al. 2004). We consider the term enhancement in a broad sense, describing situations where a sound can cause faster and/or more accurate and/or more precise perception of a visual event, compared to when there is no concurrent sound. Sound-driven enhancements of vision include reports of decreases in response latencies to visual targets (Miller 1982; Corneil et al. 2002), lowering of detection thresholds (Caclin et al. 2011; Frassinetti et al. 2002; Gleiss and Kayser 2013; Jaekl and Harris 2009; Jaekl and Soto-Faraco 2010; Noesselt et al. 2010), decreases in visual search time (Van der Burg et al. 2008), increases in brightness judgments (Stein et al. 1996), increases in perceived duration of brief visual stimuli (Walker and Scott 1981; Vroomen and de Gelder 2000; Van Wassenhove et al. 2008), faster motion detection (Meyer et al. 2005) and increased visual saliency (Noesselt et al. 2008). Enhancement is only one of several possible outcomes of multisensory integration and is distinguished from other multisensory phenomena conferring what may be considered performance detriments or illusions (e.g. Shams et al. 2000; Sinnett et al. 2008; Thurlow and Jack 1973) or changes in information content (e.g. McGurk and MacDonald 1976). The hypothesis laid out here may well apply to these manifestations of multisensory integration arising from inter-sensory conflict, but fall beyond the scope of the present article. We focus, instead, on crossmodally induced enhancements demonstrated in basic visual judgment tasks because they are often used to underscore direct multisensory interactions occurring at relatively short latencies and in hierarchically early stages of processing. Such phenomena have typically been linked to physiological interactions in subcortical or primary sensory areas—defined as ‘early’, sensory-based interaction (Driver and Noesselt 2008; Stein and Stanford 2008; Shams and Kim 2010).

Perhaps surprisingly, the interpretation of this common example of multisensory interaction, namely sound-driven enhancement of vision in human behavioural paradigms is not often agreed upon. For example, studies supporting such enhancement include a number of psychophysical audio-visual investigations involving subjective brightness ratings (Stein et al. 1996) along with those using visual detection tasks (Frassinetti et al. 2002; Bolognini et al. 2005; Manjarrez et al. 2007; Andersen and Mamassian 2008; Caclin et al. 2011). Such enhancements have often been measured using paradigms effective for determining sensory-based signal combination independent of higher-level influences (e.g. decision, attentive state—see Ngo and Spence 2012). Sensory-level interactions are consistent with known early, low-level physiological processes (Meredith and Stein 1983; Wilkinson et al. 1996; Wallace et al. 1998; Molholm et al. 2002; Lehmann and Murray 2005; Kayser et al. 2005; Lakatos et al. 2007; Driver and Noesselt 2008; Clemo et al. 2012) and have been sometimes related to the discovery of direct (i.e. monosynaptic) cortico-cortical connections between sensory areas in anatomical studies (Falchier et al. 2002; Rockland and Ojima 2003; Cappe and Barone 2005; Smiley and Falchier 2009; Meredith et al. 2009; also see Lewis and Noppeney 2010 for fMRI-based support). Yet, a considerable number of other psychophysical studies have failed to support this early sensory-based interpretation of sound-driven enhancement of vision (e.g. Meyer and Wuerger 2001; Marks et al. 2003; Odgaard et al. 2003; Alais and Burr 2004; Schnupp et al. 2005; Lippert et al. 2007 see also Kayser and Logothetis 2007). These studies convincingly argue instead for various alternative explanations of enhancement effects, based on other known processes such as attentional orienting, reduction in temporal uncertainty or biases at the level of decision/response (for a relevant discussion see: De Gelder and Bertelson 2003). This category includes simple alerting (see de Boer-Schellekens et al. 2013) based on unspecific subcortical—cortical interactions related to fast changes in arousal (Sturm and Willmes 2001; Maravita and Iriki 2004). For example, findings related to very fast and spatially unspecific crossmodal enhancement have been attributed to such phenomena (Murray et al. 2005). Such an account, however, does not explain enhancements found when auditory stimuli follow visual target onsets (Miller 1986; Andersen and Mamassian 2008; Leone and McCourt 2013), or when enhancements are based on crossmodal correspondences in specific attribute values such as spatial frequency (Pérez-Bellido et al. 2013—see below).

Thus, sensory-level effects are not consistently confirmed and are therefore only observed under certain conditions. What conditions are common to psychophysical experiments supporting sensory-level enhancement? We believe the answer may be integral to demonstrating audio-visual enhancement and, in part, may be present in the existing literature.

Audio-visual enhancement and visual pathways

We contend that sensory interactions facilitating perceptual enhancement do occur and that inconsistencies in the conclusions of previous studies—sensory-level audio-visual enhancement versus alternative explanations—can arise, in part from the characteristics of the different neural mechanisms underlying the very visual processes that are putatively enhanced by sound. In particular, we reason that early, sensory-level crossmodal influences in a variety of psychophysical tasks can depend mostly on the differential involvement of specialized processing channels existing at low-level stages of visual processing (for reviews see: Livingstone and Hubel 1988, Merigan and Maunsell 1993). For example, contrast thresholds (Shapley 1990) and reaction times to visual onsets can be determined by the early magnocellular division of the visual system (M-system) (Breitmeyer 1975). These M-system properties contrast with the early parvocellular division (P-system), the latter being more efficient at processing chromatic information, high spatial frequencies and higher contrasts. The P-system is thought to subserve colour and form/pattern vision leading to object recognition and figure–ground segregation (Livingstone and Hubel 1988; Merigan 1989; Roe et al. 2012). In natural circumstances, both parvocellular and magnocellular pathways are stimulated by objects and events in the visual world, and there is extensive interaction between these pathways at various stages of cortical processing (Maunsell 1992; Schroeder et al. 1998; Saalmann et al. 2007; Nassi and Callaway 2009). Despite the importance of this division in visual processing, its broad mapping onto putative ‘dorsal’ and ‘ventral’ pathway functioning and its well known impact in visual psychophysics, it is surprising that such visual properties are rarely considered explicitly in crossmodal investigations. Here, we expand on previous empirical work relating audio-visual benefit to visual pathways (Jaekl and Soto-Faraco 2010; Pérez-Bellido et al. 2013) and postulate that some discrepancies in previous findings regarding audio-visual enhancements may be resolved by considering the relative level of involvement and effectiveness of processing within these two visual pathways, in the different experimental paradigms. Specifically, relevant investigations both confirming and failing to confirm sensory-level crossmodal interaction that we discuss are likely to critically involve the effectiveness of M-pathway processing.

Magnocellular-based audio-visual interactions

Auditory and visual neural responses are combined into crossmodal signals at various processing stages in different cortical and subcortical areas. An example often cited in multisensory literature is the superior colliculus (SC), a subcortical structure supporting crossmodal sensory integration. The SC plays an integral role in controlling and executing orienting responses towards novel or behaviourally relevant stimuli—namely saccadic orienting (Lee et al. 1988; Roucoux et al. 1980). In mammals, the audio-visual interaction in the SC occurs primarily in neurons within its intermediate and deep layers, receiving input from both auditory and visual modalities (May 2006) and input from higher, extrastriate areas (see Boehnke and Munoz 2008). Importantly, the primary visual afferent to the SC consists of input from magnocellular layers of the lateral geniculate nucleus (Berson and McIlwain 1982; Schiller et al. 1979), via primary visual area, V1 and direct connections from retinal ganglion cells (Garey and Powell 1968; Garey et al. 1968). This visual input subserves detection, localization, attentional orienting (Shen et al. 2011) and is mostly sensitive to transient, low spatial frequency and low-contrast stimulation (Kaplan and Shapley 1986; Plainis and Murray 2005; Schneider and Kastner 2005).

Indeed, evidence for auditory interaction with M-pathway signals in the SC is demonstrated in the temporal pattern of incoming signals. Auditory transduction typically occurs at earlier latencies than those for visual stimuli (Fain 2003). Similarly, auditory SC response latency [typically 10–44 ms—Meredith et al. 1987; Wise and Irvine 1983 (cat studies), 14 ms in primates—Wallace et al. 1996] precedes visual response latencies (typically 40–70 ms in primates—Bell et al. 2006, also see Boehnke and Munoz 2008), and physiological response enhancement in the SC is consistent with overlapping discharge periods of auditory and visual responses (Meredith et al. 1987). Congruent with these physiological findings, behavioural response latencies to audio-visual stimuli have been shown to be significantly speeded up relative to those obtained in a unisensory visual condition, as measured by manual response and saccadic reaction times (Bernstein et al. 1969; Diederich and Colonius 2004; Gielen et al. 1983; Goldring et al. 1996; Harrington and Peck 1998; Hughes et al. 1994; Miller 1982; Perrott et al. 1990; Pérez-Bellido et al. 2013). Audio-visual interaction conferring such reaction time enhancement has been modelled to conform with SC response patterns (Corneil et al. 2002). Such findings would seem to imply a major role of magnocellular input, affecting the sensitivity of these layers as manifested by visual response characteristics.

Audio-visual interactions in the SC are spatially dependent on activity patterns across receptive fields and have accordingly been found to occur most strongly for spatially aligned audio-visual components (Meredith and Stein 1996; Gepshtein et al. 2005; Meyer et al. 2005, but see Spence 2013). However, Stein et al. (1996) and Fiebelkorn et al. (2011) found audio-visual brightness enhancements for spatially discordant stimuli, suggesting such enhancement might instead result from some degree of interaction occurring at a cortical level (Lakatos et al. 2005; Schroeder and Lakatos 2009; see also Romei et al. 2012 for EEG data in humans). In agreement with Stein et al. (1996) and Fiebelkorn et al. (2011), we hypothesize interaction between auditory response and early visual cortical response may contribute to enhancement for spatially disparate stimuli. At early cortical stages, audio-visual correspondences are complicated by the longer response latencies in V1 relative to A1 (V1 latency, 41–55 ms: Clark and Hillyard 1996; Foxe and Simpson 2002; Foxe and Schroeder 2005; A1 latency, 9–15 ms: Celesia 1976; Clark and Hillyard 1996; Molholm et al. 2002). Specifically fast, contrast-sensitive magnocellular responses (Cleland et al. 1971; Cleland and Levick 1973) and their higher temporal resolution (Kulikowski and Tolhurst 1973; Kaplan and Shapley 1982) may be an optimal candidate for an efficient selection of early cortical crossmodal associations concerning contrast enhancement, congruent with psychophysical findings (Jaekl and Soto-Faraco 2010; Pérez-Bellido et al. 2013).

Psychophysical interpretations of sound-induced enhancement of vision

Given the above, investigations set out to determine behavioural enhancements of vision by sound may often be likely to depend on the effective engagement of early, magnocellular processing. It is therefore notable that these studies have frequently utilized visual stimuli not explicitly designed to optimally engage the M-pathway. For example, commonly used for visual stimulation in such investigations are abrupt stimuli, well above detection threshold. Such stimuli engage the M-system sensitivity to transient onsets, but they may not always confer opportunity for signal enhancement at the level of perceptual influence. Indeed, visual contrast response gain in the lateral geniculate nucleus can be greater than an order of magnitude in magnocellular layers compared to parvocellular responses (Kaplan and Shapley 1986). Specifically, contrast gains computed by Michaelis–Menten saturation functions show that for achromatic stimulation between Michelson contrast values between 0 and 1, magnocellular cells have gained (impulses per second/% change in contrast) values typically between 5 and 8, whereas parvocellular cells are relatively insensitive, with values typically between 0.15 and 0.5 (Kaplan and Shapley 1986, also see Pokorny 2011). Therefore, abrupt visual stimuli of relatively high contrast can easily saturate early magnocellular response levels, leaving contrast discrimination to be primarily determined by activation patterns in the P-system (see Pokorny 2011 for a review). Specifically, graded responses within the early magnocellular system occur only within a narrow contrast range relative to the mean luminance of the display. Thus, although they elicit a large magnocellular response, the properties of higher contrast stimuli easily saturate M-pathway response levels and may preclude the likelihood for multisensory-based improvement in contrast enhancement paradigms for which the level of magnocellular activation plays an integral role. That is, enhancements here are more likely to occur when additional auditory stimulation can boost a relatively weak magnocellular response above the threshold required for detection or discrimination, rather than when responses to stimuli already detectable relative to the adapted background are at most, weakly modulated by sound, if at all.

For example, Marks et al. (2003) and Odgaard et al. (2003) used visual stimuli in a brightness comparison task with dark-adapted participants, for which the lowest luminance level was one just noticeable difference above the 79 % luminance detection threshold and found no crossmodal enhancement. At this level of detection performance, additional crossmodal stimulation provided by concurrent sound may not yield measurable brightness enhancement in a comparison task relative to threshold levels (Wilkinson et al. 1996). Additionally, Caclin et al. (2011), using a criterion-free detection paradigm, showed no audio-visual improvements in detecting foveal, 11.4 cycle-per-degree Gabor patches. According to prior physiological and psychophysical literature, these stimuli were unlikely to optimally engage magnocellular response (Kulikowski and Tolhurst 1973; Legge 1978; Wilson 1980; Tootell et al. 1988; Livingstone and Hubel 1988; Leonova et al. 2003), although enhancement was, however, observed in a subset of participants with relatively weak performance in a unimodal visual-only condition.

Noesselt et al. (2010) found consistent sensory-level detection advantages attributable to audio-visual integration. The visual stimuli in this study consisted of Gabor patches calibrated to low, 55 and 65 % contrast thresholds, and the effect was only obtained at the lower contrast level. These findings are in agreement with Stein et al. (1996) who used subjective brightness ratings in a comparison task (but see Odgaard et al. 2003). Using a more direct analysis involving a steady/pulsed-pedestal paradigm specifically designed for the purpose of segregating M- and P-based contrast selectivity, Jaekl and Soto-Faraco (2010) have shown that sensory-level audio-visual contrast enhancement of near-threshold stimuli occurs under conditions selectively favouring magnocellular sensitivity to transient, low spatial frequency conditions. Additionally, Pérez-Bellido et al. (2013) found that sound-induced visual enhancement in RTs could be psychophysically dissociated into separate components. One component of the RT enhancement resulted from interactions occurring in post-sensory stages of processing (i.e. uncertainty reduction, speed up of motor reaction by alerting) and affected reaction times across the entire range of visual spatial frequencies tested, while a sensory-based audio-visual RT benefit occurred selectively for low-frequency visual transients configured for optimal magnocellular sensitivity.

Importantly, such sensory-specific interactions conferring enhancement are in line with the principle of inverse effectiveness (Meredith and Stein 1983), a defining principle of sensory integration which implies that relatively weak stimulus intensities lead to stronger crossmodal interaction. This principle is congruent with the findings of Stein et al. (1996) and Noesselt et al. (2010), who reported stronger brightness enhancement at lower stimulus intensities. However, inverse effectiveness alone cannot account for the crossmodal contrast enhancements observed in Jaekl and Soto-Faraco (2010) and Pérez-Bellido et al. (2013) which was shown only for low rather than high spatial frequency stimuli.

Notably, audio-visual improvement to low-contrast stimuli occurs preferentially for transient rather than sustained inputs (Van der Burg et al. 2010; Werner and Noppeney 2011). Transient inputs are defined by both changes from ‘off’ to ‘on’ as well as ‘on’ to ‘off’ states and are congruently signalled by brief visual responses throughout several stages in the visual system, including responses in subcortical regions (Cleland et al. 1971; Cleland and Levick 1973; Maunsell et al. 1999) as well as primary visual cortex (Horiguchi et al. 2009). In line with these physiological findings, Andersen and Mamassian (2008) demonstrated that for audio-visual stimuli, crossmodal transient synchrony was sufficient for eliciting sensory enhancements in a luminance change detection paradigm. Additionally, Van der Burg et al. (2010) found that target detection in visual search can be enhanced by sound when auditory and visual stimuli were transiently presented. Conversely, their study also revealed that sustained but temporally correlated signals were ineffective improving this visual search, manifesting that a precise temporal representation of the stimuli is necessary for multisensory integration in these detection paradigms (see also Zannoli et al. 2012). Altogether, these results highlight the importance of optimal magnocellular sensitivity to relatively high temporal frequencies to produce sound-induced enhancement in visual detection tasks.

Influences, above and beyond early sensory-level interactions clearly have also convincingly been demonstrated. Such influences include those putatively arising from reductions in temporal (Lippert et al. 2007) and/or spatial uncertainty (McDonald et al. 2000; Frassinetti et al. 2002; Bolognini et al. 2005) by means of attention orienting, or those promoted by crossmodally induced biases in decision-level processes. In addition, audio-visual integration can also modulate visual perception at processing stages for which visual signals are more integrated between processing streams (e.g. Werner and Noppeney 2010) and in other aspects for which effective parvocellular (rather than magnocellular) involvement may be critical. These modulations can subserve ventral stream processing, functioning to separate figure from ground (Roe et al. 2012) and aid in object perception (Kourtzi and Connor 2011). Such audio-visual interactions have been supported by demonstrations of the influence of sound in brain areas known to receive parvocellular input that contribute to object-related tasks. For example, influences occurring during object naming or categorization (Colombo and Gross 1994; Bookheimer et al. 1998; Tranel et al. 2003). Psychophysical investigations aimed specifically at demonstrating auditory–parvocellular interaction at a sensory level have revealed that non-informative sounds can attenuate the effectiveness of metacontrast masking and influence orientation judgments of high-frequency Gabor patches (Jaekl and Harris 2009). Performance in both these tasks was designed specifically to be dependent upon the effectiveness of parvocellular processing. Importantly, these paradigms differ in objective from those involving visual detection and reaction time tasks exploiting functional aspects of relatively early M-pathway processing.

Conclusion and future directions

We have placed the focus on the discrepancy between studies both confirming and failing to confirm early, sensory-based crossmodal influences in basic visual tasks. Our contention is that such inconsistencies may at least partly be resolved by considering the major anatomical and functional divisions within the early visual system between the magno- and the parvocellular pathways, which broadly map onto putative dorsal and ventral functions. Specifically, we have emphasized those studies which use tasks concerning primarily M-pathway functions—early crossmodal combinatorial processes influencing basic behaviours such as those involved in fast reactions, luminance detection and contrast enhancement—which can be dependent on the effectiveness of early transient magnocellular signals to indicate the presence and location of a near-threshold object or event. If crossmodal influences are to manifest in these tasks, they are mostly like to occur if stimuli are appropriately optimized for magnocellular sensitivity—broadly defined in terms of low-contrast, low spatial frequency transient stimuli. It is interesting that this apparently simple principle has rarely been considered in previous work regarding sensory interaction. We warrant that such consideration is important in future studies concerning audio-visual enhancement, especially those involving saccadic reaction time measurements, stimulus detection and paradigms concerning contrast sensitivity. Carefully designed experiments that measure strictly sensory-level interactions (e.g. unbiased by spatial and/or temporal cueing), conducted with these considerations in mind may most effectively determine the nature of crossmodal enhancement.