Introduction

The cognitive representations of many abstract concepts (such as numbers and time) might have a spatial format. For example, a well-known phenomenon that derives from the literature on numerical cognition, but that is not restricted to numbers, is the Spatial–Numerical Association of Response Codes (SNARC). When participants are asked to judge whether a number is odd or even, by pressing a left-sided or a right-sided button, reaction times (RTs) are faster when participants respond to relatively larger numbers (e.g., 9) with the right-sided button than with the left-sided one, whereas the opposite result is observed for relatively smaller numbers (e.g., Experiment 1; Dehaene, Bossini, & Giraux, 1993). The interpretation of the SNARC is grounded on the fact that number magnitude representation might resemble the form of a mental number line (MNL), which is spatially oriented from left-to-right—at least in left-to-right reading cultures (e.g., see Shaki, Fischer, & Petrusic, 2009). Along the MNL, smaller numbers are represented on the left side and larger numbers on the right side of the imagined space.

Spatial associations of response codes are not restricted to numbers. These associations have been observed with various types of serial-ordered information such as letters of the alphabet, months of the year, and days of the week (Gevers, Reynvoet, & Fias, 2003, 2004). This implies that mapping sets of ordered elements onto spatial positions may take place in different domains. A similar effect has been observed when participants have to estimate time: left-sided responses to early-onset timing are faster than those to late-onset timing, whereas right-sided responses to late-onset timing are faster than those to early-onset timing [i.e., the Spatial–Temporal Association of Response Codes (STEARC) effect; Ishihara, Keller, Rossetti, & Prinz, 2008; Vallesi, Binns, & Shallice, 2008].

Similarly, pitch is classified in many languages using terms denoting spatial attributes, such as “high” and “low” (e.g., “high” or “low” tones). These terms suggest the presence of a relation—though implicit—between pitch-height and space. Indeed, from the early studies on the relation between pitch-height and space, an association between pitch-height and the vertical dimension has been shown. For instance, Pratt (1930) was the first to report an association between perceived location and pitch (i.e., in terms of lower pitch-height and higher pitch-height) in a scale oriented in the vertical dimension (from the floor to the ceiling). Participants were asked to listen to different pitches and then they were asked to judge, on a numbered scale from the floor to the ceiling, the perceived position of the pitches sounded by hidden loudspeakers placed along the vertical dimension. The results showed that lower pitches were judged lower in the scale, whereas higher pitches were judged higher in the scale, suggesting the existence of a relationship between space and pitch representation along the vertical axis.

In several studies, the explicit and implicit relation between pitch-height and space has been examined in depth (e.g., Lidji, Kolinsky, Lochy, & Morais, 2007; Nishimura & Yokosawa, 2009; Rusconi, Kwan, Giordano, Umiltà, & Butterworth, 2006). In this regard, auditory pitch can be considered in terms of magnitude/order information, yielding an association between pitch-height and space that is similar to that between numbers and space. Rusconi et al. (2006) were the first to report an association between lower response keys and lower pitch-heights, and between upper response keys and higher pitch-heights, in auditory frequency discrimination; this effect is known as the Spatial–Musical Association of Response Codes [SMARC; Rusconi et al., 2006; but see also the proposed Spatial–Pitch Association of Response Codes (SPARC) effect by Lidji et al., 2007].

The nature of the stimulus–response (S–R) associations is still a matter of debate. Spatial–numerical associations have been well established in the horizontal dimension of space (for a review, see Fischer & Shaki, 2014), and have recently been extended also to the vertical space (e.g., Grade, Lefèvre, & Pesenti, 2013; Hartmann, Gashaj, Stahnke, & Mast, 2014; Hartmann, Grabherr, & Mast, 2012; Hartmann, Mast, & Fischer, 2015; Holmes & Lourenco, 2012; Ito & Hatta, 2004; Loetscher, Bockisch, Nicholls, & Brugger, 2010). In the case of numbers, the horizontal association has been attributed to reading and writing habits, as well as finger counting habits (Gobel, Shaki, & Fischer, 2011; Zebian, 2005; but see Rugani, Vallortigara, Priftis, & Regolin, 2015), whereas the vertical association might reflect the experience that “more” usually corresponds to higher space (Hartmann et al., 2014; Holmes & Lourenco, 2012).

In the previous studies, however, upper and lower responses were not defined with respect to the true vertical axis, because the response keys were placed in front of the participants in an antero-posterior direction (i.e., along the sagittal axis; Gevers, Verguts, Reynvoet, Caessens, & Fias, 2006; Ito & Hatta, 2004; Müller & Schwarz, 2007; Shaki & Fischer, 2012). Thus, the “vertical” SNARC effect has only been tested in the radial dimension, instead of a “true” upper and lower response arrangement. Only two studies used truly vertically response positions: Holmes and Lourenco (2012, Experiment 2) and Hartmann et al. (2014, Experiment 1). Holmes and Lourenco (2012) did not find evidence of a SNARC effect in an implicit task (the participants had to respond to a parity judgment task); they only found a vertical SNARC effect when a small-down and large-up spatial–numerical association was required in an explicit task (i.e., participants were instructed to think of numbers as floors in a building). This result raised questions about whether the absence of a spontaneous vertical SNARC effect resulted from differences between the classical SNARC paradigm and the experimental setting (the authors used a touch screen requiring arm movements to respond) or whether it reflects that there is no reliable association between number magnitudes and vertically response positions. On the contrary, Hartmann et al. (2014, Experiment 1) asked participants to categorize even and odd numbers (implicit association task) by asking them to respond with their hands pressing two vertically separated response pads. The results revealed a vertical SNARC effect—as found in previous studies—but in this case by testing participants in the veridical vertical condition.

With regard to the SMARC effect, in non-musicians, the association lower pitch-left and upper pitch-right appears to occur only when participants have to process pitch-height explicitly (Lidji et al., 2007; Rusconi et al., 2006), but it has been also reported when pitch-height was irrelevant for the task if a reference tone was provided (Cho, Bae, & Proctor, 2012).

Pitch-height representation might be different, however, from the representations of other stimuli that are characterized by a single dimension, which varies on a single continuum, such as physical size, luminance, angles, time, or numerical values (Cohen Kadosh & Henik, 2006; Cohen Kadosh et al., 2005; Fias, Lammertyn, Reynvoet, Dupont, & Orban, 2003; Ishihara et al., 2008). The auditory human system is highly evolved for processing natural, complex auditory signals effectively. Pitch-height is one of the primary auditory sensations and plays a defining role in music, speech, and auditory scene analysis. Although the main physical correlate of pitch-height is acoustic periodicity, or repetition rate, there are many interactions that confound the relations between the physical stimulus and the perception of pitch-height, such as the effects of other acoustic parameters on pitch judgments, and the complex interactions between perceptual auditory organization and pitch-height (Oxenham, 2012). Since the first published papers on this topic (Lidji et al., 2007; Rusconi et al., 2006), the SMARC effect has been reported in several studies, through Stimulus–Response Compatibility (SRC) paradigms in which simple tones (i.e., pure tones) were used as stimuli. Natural tones, however, are often complex harmonic tones. Complex tones can be described as the sum of several simple sine waves, each with different frequency, amplitude, and phase. In particular, in harmonic complex tones, the frequency of each frequency component is an integer number of times that of the lowest frequency component of the tone, the so-called fundamental frequency (F0). When humans listen to a complex harmonic tone, the sensation of pitch-height coincides with that evoked by F0; in other words, a pure tone of a given frequency and a complex harmonic tone of identical F0 evoke the same pitch-height sensation. The remaining harmonics of the tone determine the tone’s timbre. Timbre is a multidimensional concept and brightness is one of its dimensions (McAdams, 2012). Timbres can be scaled for brightness, depending on how low- and high-frequency harmonics are represented in the tone’s spectrum. The spectral centroid (i.e., the mean of the frequency components of a tone weighted by their relative amplitude) is often used to address to the timbre’s brightness (Grey & Gordon, 1978; McAdams, Winsberg, Donnadieu, De Soete, & Krimphoff, 1995). Brightness (i.e., spectral centroid) and pitch (i.e., F0) interact in the perception of musical sounds (Melara & Marks, 1990). For example, McDermott, Lehr, and Oxenham (2008) showed that listeners were able to recognize familiar melodies when the pitch information was cleared (melodies were made by fixed pitch tones), but the tone’s brightness reproduced the contour of the original melody. Russo and Thompson (2005) showed that listeners judged the size of a two-tone musical interval as larger when pitch and brightness of the tones composing the interval varied coherently (i.e., the high-pitched tone was associated with a bright timbre) rather than when the tones varied incoherently (e.g., the high-pitched tone was associated with a dull timbre). Noticeably, the sound’s brightness determines one of the so-called “useful dimensions of sensitivity” (i.e., size; Gibson, 1963). In fact, the sound’s brightness is often the cue revealing the size of the object that is behind a sound: small objects produce bright sounds, whereas large objects produce dull sounds (Giordano, Rocchesso, & McAdams, 2010; Grassi, 2005; Grassi, Pastore, & Lemaitre, 2013). In previous studies, pitch-height and brightness were not manipulated independently (e.g., Lidji et al., 2007; Rusconi et al., 2006): because pure tones have only one frequency component, pitch-height and brightness varied always coherently. Therefore, it is impossible to know if the SMARC effect observed in previous studies could be due to the perceived sensation of pitch-height, the perceived sensation of brightness, or both.

The aim of the present study was to disentangle the role that pitch-height and brightness play in forming the spatially oriented representations of tones in the SMARC effect. We asked participants to judge, as fast as possible, whether the tone they were listening to was (or was not) modulated in amplitude (i.e., vibrato). We performed an experiment composed by three different experimental conditions. In a first condition, the tones’ pitches spanned from low to high coherently with the tones’ brightness that also spanned from dull to bright (Fig. 1, pitch-height varied, brightness varied). In a second condition, the tones’ pitches spanned from low to high but the tones’ brightness was fixed (Fig. 1, pitch-height varied, brightness fixed). Finally, in a third condition, the tones’ pitch was fixed, whereas the tones’ brightness spanned from dull to bright (Fig. 1, pitch-height fixed, brightness varied). The tone’s brightness was modified by manipulating the tone’s spectral centroid.

Fig. 1
figure 1

Spectra of the tones presented in the three experimental conditions. Each graph shows the amplitude of the tone’s harmonics as a function of their frequency. The gray-filled symbols highlight the harmonic at fundamental frequency (i.e., the tone’s pitch-height), whereas the white symbols represent the upper harmonics. The black arrows indicate the value of the spectral centroid (i.e., the tone’s brightness). Note how in the condition “pitch-height varied, brightness varied” pitch-height and brightness covaried coherently; in the condition “pitch-height varied, brightness fixed” the pitch-height increased log-linearly, whereas the brightness was fixed; in the condition “pitch-height fixed, brightness varied” the pitch-height was fixed, whereas the brightness increased log-linearly

The SMARC effect in the radial condition has been consistently found irrespectively of task relevance of pitch-height and musical proficiency (see Rusconi et al., 2006, Experiment 1). Although consistent results have been obtained on the horizontal SMARC effect when the pitch-height of tones was task relevant, mixed results have been reported across experiments in which the pitch-height was irrelevant to the task (Rusconi et al., 2006, Experiments 2 and 3). Given the previous results, we decided to test musically untrained participants (i.e., non-musicians) in the implicit condition only to avoid any possible influence of the explicit task in creating spatial mappings/representations of pitch-height.

Materials and methods

Participants

Twenty-four, neurologically healthy, right-handed students of the University of Padova (age: M 24.1 years, SD 6.9; 6 males) were recruited for the study. All participants were non-musicians (i.e., they neither played a musical instrument nor they were able to read music) and had negative history of primary or secondary auditory problems. All participants gave their informed consent to participate to the study. They were tested in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Apparatus

We implemented the experiment in E-Prime 2.0 (Psychology Software Tools, Inc). The software was running on a Pentium IV computer. Sounds were synthetized by means of freely distributed MATLAB toolboxes (i.e., MLP, Grassi & Soranzo, 2009; PSYCHOACOUSTICS, Soranzo & Grassi, 2014) and presented to the participants in a sample rate of 44.1 kHz and a resolution of 16 bits. The output of the soundcard (M-AUDIO Fast Track Pro) was delivered to both channels of a pair of Sennheiser HD 280 pro headphones. The experiment was conducted in a silent room (noise below 35 dBA at the listener’s ear). During the experiments, the sounds’ pressure at the listener’s ear was ~60 dB SPL. The participants performed a two-choice RT task by pressing two buttons (i.e., the left- or the right-end button) of a Serial Response Box™ (SRBox, Psychology Software Tools, Inc).

Stimuli

Pitch-height varied, brightness varied

The sounds were eight 1000-ms-long complex tones (four steady in amplitude and four sinusoidally modulated in amplitude) including the first nine harmonics of a 169.6-Hz (or 226.4, 302.2, 404.4) fundamental frequency (see Fig. 1, pitch-height varied, brightness varied). The musical interval dividing the tones was a perfect fourth (e.g., such as the tonal distance between C and F). All the tone’s harmonics had identical amplitude and were added in phase. In steady tones, the tone’s envelope was gated on and off with two 10-ms raised cosine ramps, but it was otherwise constant in level. In the amplitude-modulated tones, the tones’ envelope was modulated in amplitude with a sinusoidal modulator (modulation rate 25 Hz) and with a modulation depth of −7.5 dB. Modulation depth is here expressed as 20 log(m), where m is a modulation index that ranges from 0.0 (no modulation) to 1.0 (max modulation).

Pitch-height varied, brightness fixed

The sounds were identical to those used in the condition “pitch-height varied, brightness varied” except that the amplitude of the harmonics was log-linearly modulated in such a way that all sounds had an identical spectral centroid of 1308.0 Hz. That is, in the tones with a low fundamental frequency, the lower harmonics were attenuated and the higher harmonics were amplified (Fig. 1, pitch-height varied, brightness fixed). On the contrary, for the tones with a high fundamental frequency, the opposite settings were used.

Pitch-height fixed, brightness varied

The sounds were eight 1000-ms-long complex tones (four steady in amplitude and four sinusoidally modulated in amplitude) including the first nine harmonics of a 261.6-Hz fundamental frequency. The amplitude of the harmonics was log-linearly modulated so that the spectral centroid of each tone was separated from the spectral centroid of the successive/preceding tone by a perfect fourth (e.g., the musical interval dividing the notes A and D). The spectral centroids of the four tones were, respectively, 848.1, 1132.1, 1.511.2, 2.017.2 Hz, and the fundamental frequency was identical for all tones (Fig. 1, pitch-height fixed, brightness varied).

Procedure

The participants were seated on a chair in front of a table, wearing the headphones. During the experiment, the participants were blindfolded to avoid any influence of visual stimuli on their performance and to keep their attention on the auditory stimuli. The SRBox was fixed on the table in front of each participant. The hands of the participants were positioned holding the SRBox, so that they could press the right-end button with their right thumb and the left-end button with their left thumb (Fig. 2). The surface of the left-end button was smooth, whereas that of the right-end button was rough. The stimuli were presented in two consecutive sessions. In one session, the SRBox was placed horizontally in front of the participant (i.e., horizontal condition), with the SRBox’s sagittal midline aligned with the participant’s sagittal midline (Fig. 2a). In the other session, the SRBox was placed vertically in front of the participant (i.e., vertical condition), with the buttons facing the participant, and the sagittal midline of the buttons aligned with the participant’s sagittal midline (Fig. 2b). One sound per trial was presented to the participants for each response axis (i.e., vertical and horizontal).

Fig. 2
figure 2

Schematic representation of the SRBox position for the two sessions: a horizontal axis; b vertical axis. The SRBox was fixed on a table by means of an ad hoc apparatus made by wood and clamps

Each trial began with the presentation of the 1000-ms-long tone. The participants were required to respond, by pressing the corresponding button, as fast and accurate as they could, after having decided whether the presented tone was vibrato or not. The participants had to respond within 1500 ms from the onset of the tone. Immediately after the participant’s response, or after 1500 ms from the beginning of the trial without any participant’s response, there was a 300-ms-long, silent time interval. Then, an acoustic feedback about their performance was delivered to the participants. The feedback consisted of the word “yes”, if the response was correct, of the word “no”, if the response was wrong, and of the word “null”, if no response was recorded. The tones were presented to the participants with an inter-trial interval (ITI) of 2700 ms. The ITI was calculated from the participant’s response, or from the end of the 1500 ms of response after the trial beginning.

For each axis (i.e., horizontal and vertical), two blocks of stimuli were presented. In one block, the participants were instructed to respond with the rough button to the non-vibrato tones and with the smooth button to the vibrato tones, and vice versa on the other block. Each of the 8 tones was presented 5 times, for a total of 40 trials presented in a random order (20 trials with the non-vibrato tones, and 20 trials with the vibrato tones). To become confident with the stimuli–response association, the participants performed 8 trials of training, in which all the tones were randomly presented.

The stimuli were presented in counterbalanced order among the participants with respect to the sound (pitch-height varied, brightness varied; pitch-height varied, brightness fixed; pitch-height fixed, brightness varied), the axis (horizontal, vertical), and the stimulus/hand association (left, right). The participants performed 12 blocks, altogether.

Results

First, all the incorrect trials and the trials without response were discarded from further analysis (5.5 %). Then, for each participant and for each condition (pitch: height, brightness, height and brightness; axis: horizontal, vertical; tones: 1–4; hand: right, left), we used the modified recursive procedure with moving criterion described in Van Selst and Jolicoeur (1994) to reject outliers (1.5 %). RTs to modulated and non-modulated sounds are reported in "Appendix". Thereafter, the dependent variable was calculated as the difference between the mean RT performed with the right thumb and the mean RT performed with the left thumb (Δ-RT), measured in milliseconds. Negative values indicated faster responses with the left thumbs with respect to those with the right thumb. In contrast, positive values indicated faster responses with the right thumb with respect to those with the left thumb.

Finally, for each participant, we calculated the beta coefficient (β) of the regression using the Sound factor as predictor and the Δ-RT as the dependent variable. We derived the βs for both the horizontal and vertical axes, separately for each sound condition. We used an alpha level of 0.05 for all statistical tests.

We first ran a two-way analysis of variance for repeated measures, with axis (horizontal, vertical) and sound (pitch-height varied, brightness varied; pitch-height varied, brightness fixed; pitch-height fixed, brightness varied) as factors. The main effect of axis was not significant, F(1, 23) = 3.57, p = .072. The main effect of sound was significant, F(2, 46) = 3.42, p = .041. Finally, the interaction axis by sound manipulation was also significant, F(2, 46) = 3.71, p = .032.

To follow-up the significant interaction, we performed t tests versus zero (using βs as the dependent variable) to investigate whether there was a significant linear SMARC effect as a function of axis, separately for each sound condition. Finally, we performed t tests for repeated measures to test whether there was a difference in performance between the horizontal and vertical axes.

Pitch-height varied, brightness varied

The t tests versus zero performed on the βs revealed a significant SMARC effect when participants used the SRBox in the vertical condition (M 0.57, SEM 0.08; t(23) = 5.17, p < 0.001, d = 1.06), but not in the horizontal condition (M 0.05, SEM 0.11; t(23) = 0.81, p = .428, d = 0.16). The direct comparison between the two conditions was significant, t(23) = −3.07, p = .005, d = −0.63 (see Fig. 3).

Fig. 3
figure 3

The mean RT difference (left minus right) for the horizontal and vertical conditions. The bars represent the confidence intervals (95 %)

Pitch-height varied, brightness fixed

The t tests versus zero were not significant either in the horizontal condition (M −0.09, SEM 0.12; t(23) = −0.6, p = .554, d = −0.12) or in the vertical condition (M 0.11, SEM 0.12; t(23) = 1.66, p = .11, d = 0.34). The horizontal versus vertical comparison was not significant too, t(23) = −1.84, p = .079, d = −0.38 (see Fig. 4).

Fig. 4
figure 4

The mean RT difference (left minus right) for the horizontal and vertical conditions. The bars represent the confidence intervals (95 %)

Pitch-height fixed, brightness varied

The t tests versus zero performed on the βs were not significant either in the horizontal condition [M 0.2, SEM 0.12; t(23) = 1.83, p  = .08, d c 0.37] or in the vertical condition [M 0.13, SEM 0.13; t(23) = 1.05, p = .303, d = 0.22]. The horizontal versus vertical comparison was not significant too, t(23) = 0.82, p = .423, d = 0.17 (see Fig. 5).

Fig. 5
figure 5

The mean RT difference (left minus right) for the horizontal and vertical conditions. The bars represent the confidence intervals (95 %)

Analyses of RTs in the vertical conditions

To assess whether the βs in the vertical conditions were significantly different from each other, for the vertical conditions only we ran a one-way ANOVA for repeated measures with sound condition (pitch-height varied, brightness varied; pitch-height varied, brightness fixed; pitch-height fixed, brightness varied) as factor. The effect of sound was significant, F(2, 46) = 4.952, p = .011. Pairwise comparisons with Bonferroni correction showed that βs of the vertical condition “pitch-height varied, brightness varied” were significantly different with respect to the condition “pitch-height varied, brightness fixed” and the condition “pitch-height fixed, brightness varied”, p = .033 and p = .043, respectively. The difference between the condition “pitch-height varied, brightness fixed” and the condition “pitch-height fixed, brightness varied” was not significant (p > .05).

The visual inspection of Fig. 5 suggests that the fourth tone may behave different from the other tones; related to this, if only the first three tones were included in the analysis, a SMARC effect might be observed. Therefore, for the vertical axis of the condition “pitch-height fixed, brightness varied” only, we ran a one-way ANOVA for repeated measures with tones (848.1, 1132.1, 1511.2, 2017.2 Hz) as factor. The effect of tones was not significant, F(3,69) = 1.242, p = .301. Pairwise comparisons with Bonferroni correction showed no significant difference among the tones (all p > .05), showing that responses to the fourth tone of this condition were not significantly different from those to the other tones.

Discussion

The aim of the present study was to disentangle the role that pitch-height and brightness play in forming the spatially oriented representations of tones in the SMARC effect, in both the horizontal and vertical dimensions. In three experimental conditions, participants were asked to judge, as quickly as possible, whether the tone they were listening to was (or was not) modulated in amplitude (i.e., vibrato). Participants were required to make the response in both horizontal and vertical axes. In the condition “pitch-height varied, brightness varied”, pitch-height and brightness varied coherently; in the condition “pitch-height varied, brightness fixed”, pitch-height varied log-linearly whereas brightness was fixed; in the condition “pitch-height fixed, brightness varied”, pitch-height was fixed whereas brightness varied log-linearly. We observed the SMARC effect only in the condition “pitch-height varied, brightness varied”, in which pitch-height and brightness varied coherently, and only in the vertical condition. In contrast, the results did not show any SMARC effect either when the pitch-height varied log-linearly whereas the brightness was fixed or when the pitch-height was fixed whereas the brightness varied log-linearly. These results suggest that when harmonic complex tones are presented to participants, the association between pitch-height and space emerges only under specific conditions, according to the characteristic of the tones (e.g., pitch and brightness). The results also suggest that a change in pitch-height only is not sufficient to elicit the SMARC effect and also that a change in brightness only is not sufficient either. In contrast, complex tones (such as the harmonic complex tones used in the present experiment) are processed as a whole and then a coherent change in both pitch-height and brightness is necessary to observe the SMARC effect. Perhaps, because the pitch and the brightness of tones tend to vary coherently in nature (McAdams, 2012), the SMARC effect can be observed when both manipulations vary coherently.

Moreover, we found a SMARC effect only in the vertical dimension. This result is in line with those of previous studies (e.g., Rusconi et al., 2006) in which, by testing non-musicians, the SMARC effect was only found in the vertical condition, when the tones were processed implicitly. These results suggest that the association between tones and space occurs along the vertical dimension, rather than on the horizontal dimension. We believe that previous studies on the SMARC effect suffered from bias in testing the radial space instead of a veridical, vertical space because of the involvement of different types of space. Then, we would like to argue that the present study has some important methodological differences with respect to the previous studies. First, in our study, participants carried out all the three experimental conditions blindfolded. In the previous studies (Lidji et al., 2007; Rusconi et al., 2006), the participants fixated a cross at the center of a computer screen; this could affect the participants’ performance because of the presence of visuo-perceptual distractors. Second, in previous studies, participants responded by pressing the Q and the P keys of a QWERTY keyboard in the horizontal condition, and the spacebar and “6” keys in the vertical condition. Spacebar and “6”, however, are not veridical lower and veridical upper response keys. In fact, in this case the verticality is only metaphorical, as already noticed by Rusconi et al. (2006, footnote 1, p. 118). Although Vu, Proctor, and Pick (2000) showed no significant interactions with respect to the keyboard’s orientation, indicating that the effects obtained when both vertical and horizontal compatibility are varied, it is not desirable to assume that there are not differences between the vertical and the radial dimensions; in fact, these spatial dimensions can be subserved by two different mechanisms of space processing. Indeed, we know from neuropsychological studies that the processing of the perceptual space (visual, auditory, tactile) is subserved by different neural circuits. For instance, Adair, Williamson, Jacobs, Na, and Heilman (1995) showed that spatial neglect occurring after bilateral hemisphere lesions can disrupt responses along space defined by the vertical and radial axes, separately. That is, different reference frames could be dissociated. Moreover, Geldmacher and Heilman (1994) showed that the position of stimuli in the visual space is an important determinant of spatial attention, evaluating the relative contributions of spatial and visual field factors in bisection of line segments placed in the radial position. Geldmacher and Heilman showed that processing of radial space and upper visual field was dissociated, suggesting that visual field position influences the allocation of spatial attention in the radial axis of space. Finally, Cappelletti, Freeman, and Cipolotti (2007) showed that three of the neglect patients they tested with a mental number bisection task revealed an upward bias for vertically oriented number lines. The remaining two neglect patients they tested did not show any bias in processing vertical lines. The results of Cappelletti et al. suggested that horizontal and vertical space representation can be associated or dissociated among neglect patients and that, at least partially, independent mechanisms may be involved in processing horizontal and vertical mental number lines.

A third difference of the present study is the use of a SRBox. Although computer keyboards and mice are frequently used to measure RTs, the accuracy of these instruments is lower compared to dedicated devices (e.g., SRBox). In many cases, the keyboard error variance is negligible, when compared with the variance of RTs in many research areas. There is probably doubt, however, as to whether the keyboard error variance can safely be ignored (e.g., see Shimizu, 2002). Moreover, in the present study the shape of the SRBox was useful to detect RTs in the vertical position because it allowed a good placement of the hands. In contrast, it was not possible to place comfortably the keyboard vertically.

One could argue that the results of the present study might be mediated by the perception of urgency to react to some stimuli. In particular, given the fact that we used the right hand placed at the top of the SRBox and the left hand placed at the bottom of the SRBox in the vertical condition, some effect of muscular tension could induce an urgency effect to react to the sound stimuli. In fact, previous studies have often found that the perceived urgency to respond or react to a sound is influenced by the frequency, the amplitude, and temporal parameters of a sound (e.g., see Haas & Edworthy, 1996). Moreover, spectral bands with higher central frequencies tend to be judged as being more urgent than those with lower central frequencies, suggesting a relationship between spectral bands and perceived urgency (Gordon, Russo, & MacDonald, 2013, Experiments 1 and 2). The present results, however, do not seem to be influenced by the urgency because of two main reasons. First, we found different effects among the sound conditions, suggesting that the SMARC effect we found was not related to the position of hands. Second, we presented the stimuli in counterbalanced order among participants with respect to the sound condition (pitch-height varied, brightness varied; pitch-height varied, brightness fixed; pitch-height fixed, brightness varied), the response axis (horizontal, vertical), and the stimulus/hand association (left-rough/right-rough): using the counterbalanced order we avoided the influence of the urgency effect in reacting to the auditory stimuli.

One question could be raised about the vertical and horizontal placements of the hands in the present study. With the constraint of the vertical condition, indeed, the upper button was always used with the right hand and the lower button with the left hand. Therefore, one may argue that four factors are necessary to elicit the SMARC effect: pitch variation, brightness variation, vertical response axis, and a specific assignment of the left and right hand. It could be argued that the role (if any) of the latter factor cannot be disentangled by the present paradigm. In response to this observation, we argue that when hands are crossed along the horizontal plane, RTs are longer than when the hands are uncrossed (Berlucchi, Tassinari, & Aglioti, 1994). Therefore, when hands are placed along the horizontal plane we gather two distinct conditions (i.e., crossed versus uncrossed hands). In contrast, when hands are placed along the vertical plane it is not possible to gather equivalent, distinct conditions. In fact, there is no way to cross hands in the vertical condition; for this reason, we decided to keep the same position of hands (uncrossed) in both the vertical and the horizontal conditions, counterbalancing the side of response. We placed the right hand above (and the left hand below) because it has been shown that in stimulus–response compatibility (SRC) tasks, SRC effects also occur when stimulus and response arrays are orthogonal, and there is no spatial correspondence between stimuli and responses. When a stimulus set is vertically arranged and a response set is horizontally arranged, it has been shown that an up-right/down-left S–R mapping induce an advantage on the opposite mapping (see Cho & Proctor, 2003). Moreover, Wood, Nuerk, and Willmes (2006) showed that the spatial context has an influence on the SNARC effect and that hand-based associations (not only representational associations) are relevant for the SNARC effect. By maintaining the hands in the same position, we avoided these possible confounding effects on RTs.

The results of the present study open new questions about the presence of an association between tones and spatial representations, suggesting the existence of a complex phenomenon that should be investigated more in depth. In nature, tones are composed by different sound waves and the human auditory system evolved to be able to elaborate these complex sounds. How the auditory experience shapes spatial representations? Since the first published paper on the SMARC effect (Rusconi et al., 2006), several authors have reported this phenomenon carrying out studies with SRC paradigms in which pure tones were used as stimuli. Musical tones, however, are complex acoustical stimuli and the complexity of the tones, indeed, could have modeled the auditory space through a complex processing of several sine waves with different frequency, amplitude, and phase. When humans listen to a harmonic complex tone, the sensation of pitch coincides with that evoked by F0 (i.e., a pure tone of a given frequency and a complex tone of identical F0 evoke the same pitch-height). It should be noted, indeed, that people have some difficulties to ignore changes in timbre (i.e., they are influenced by the irrelevant changes in timbre) when they are required to make judgments on pitch-height (Borchert, Micheyl, & Oxenham, 2011). In addition, recently Parise, Knorre, and Ernst (2014) suggested that pitch borrows its spatial connotation from the statistics of natural auditory scenes. Parise et al. (2014) recorded a large set of natural sounds from the environment, analyzed the elevation-dependent filtering of the outer ear, and measured frequency-dependent biases in human sound localization. They found that auditory scene statistics reveals a clear mapping between frequency and elevation, providing the basis for the spatial connotation of pitch. In any case, more studies are needed to have a clear idea about the relationship between the frequency content of a sound and the way the sound is (or is not) spatially represented.

Testing expert musicians, for example, would be relevant to further explore the SMARC effect with complex tones. We can speculate that in the condition “pitch-height varied, brightness varied” we would expect the presence of the SMARC effect both on the vertical and the horizontal axes (see Rusconi et al., 2006, Experiment 2, for similar results), suggesting a possible role of the instrument played (e.g., the piano; see Lidji et al., 2007). If we hypothesize that pitch-height is the main factor to elicit the SMARC effect in trained musicians, and given that in this condition the variation is exclusively of the pitch-height through the manipulation of the harmonics of the respective upper and lower fundamental frequencies, we would expect in the condition “pitch-height varied, brightness fixed” a SMARC effect both in the vertical and in the horizontal conditions (the same as that of the “pitch-height varied, brightness varied” condition). Finally, if we hypothesize that pitch-height is the main factor that elicits the SMARC effect, in the condition “pitch-height fixed, brightness varied” we would not expect to find the SMARC effect (i.e., no significant difference between the vertical and the horizontal conditions).

Some studies on number representation showed that the direction of the mental number line is shaped by reading and writing habits (Berch, Foley, Hill, & Ryan, 1999; Dehaene et al., 1993; Gobel et al., 2011; Shaki et al., 2009; Zebian, 2005). The same results were found in the vertical dimension (Hartmann et al., 2014; Holmes & Lourenco, 2012): participants were faster to respond to smaller numbers with the bottom hand and to larger numbers with the top hand. One could argue whether language could be responsible for the SMARC effect too. It is unclear whether linguistic terms such as “high” and “low” origin (or reflect) this spatial association. The space–pitch association can be observed in pre-schoolers (Nava, Grassi, & Turati, 2015) and newborns (Walker et al., 2010; however, see Lewkowicz & Minar, 2014). Infants may acquire the implicit, culturally determined associations of their caregivers (Nygaard, Herold, & Namy, 2009; Stern, Spieker, & MacKain, 1982). These findings suggest that pitch-height mappings may be not dependent on language. Moreover, mappings between auditory pitch and visual brightness have recently been documented in chimpanzees (Ludwig, Adachi, & Matsuzawa, 2011). Parkinson, Kohler, Sievers, and Wheatley (2012) tested a remote population of Cambodia (the Kreung tribe) who do not use spatial language to describe pitch. Participants viewed shapes rising or falling in space while hearing sounds either rising or falling pitch, and they were required to report the auditory change. Associations between pitch and vertical position were found (faster RTs with compatible high position/high pitch and low position/low pitch), similarly to that already shown in Western cultures where pitch is described in terms of spatial height. These results suggest that the association between the vertical visual domain and auditory pitch can arise independently from language.

In the present study, we manipulated pitch and brightness. Even though Western cultures have linguistic terms to describe pitch (e.g., “high” and “low”), brightness has no linguistic markers. We found, however, an association of pitch/brightness in vertical space and no in horizontal space. We did not find an association between space and tones when these two auditory components varied separately. Moreover, we did not find an association between pitch-height and brightness in the horizontal axis, suggesting that pitch-height appears to be mapped onto vertical space in absence of shared verbal labels. If there were a linguistic effect, we would have found an effect also in the pitch variation condition with brightness fixed. This was not the case.

It seems, then, that the auditory pitch resembles a correspondent mental representation of space (i.e., vertical) that is able to interact with motor responses, and it seems also be more congruent than the association of pitch-height with the horizontal space (see also Lega, Cattaneo, Merabet, Vecchi, & Cucchi, 2014, for other sensory modalities). The height of tonal representation is not unique and seems to depend on deep signal analysis. Future investigations on the SMARC and SMARC-related phenomena, carried out with non-musicians but also with expert musicians to investigate the influence of musical expertise, should take into account the composite characteristics of musical tones, considering the sophisticated processing of sounds performed by the human auditory system. We suggest that tones should have specific acoustic characteristics to be spatially shaped; otherwise, the spatial mapping of the tone is not elicited. In fact, we found the SMARC effect using complex tones only when pitch-height and brightness varied coherently. Moreover, we found this association on the vertical dimension only, suggesting that the horizontal dimension could be influenced by other cultural factors (e.g., see Fischer, Riello, Giordano, & Rusconi, 2013).