Keywords

Introduction

Music is defined as a set of sounds with specific attributes (such as frequency and timbre) that are presented temporally as patterns, following rules that can be adjusted to create different sensations based on cultural and stylistic categories. Time is evident in music at different scales, with the smallest oscillations in frequencies being the backbone of timbre, periods producing rhythms, and contours supporting melodies. Temporal features are, therefore, of particular interest in the analysis of auditory stimuli, particularly music.

Music perception is based on the detection and analysis of acoustic events, including their duration and position along time. The particular organization of time intervals between one sound and another creates the perception of a rhythm or pattern; therefore, rhythm is the perceived musical structure, and its perception depends on a priori existence of an internal time frame (i.e., metric). By contrast, beat perception seems to be a basic and innate phenomenon that does not depend on prior learning of metric structures, but rather on the salience and regularity of the pulse inherent to the acoustic signal. However, the perception of an underlying beat can be modulated by the stimulus itself, namely the temporal structure of acoustic events and their accents, but also by the listener, such as his or her preferred tempo rage or fluctuations in attention.

In this chapter we review the literature regarding music perception. First, a brief review of the flow of information within the auditory cortex is presented. Next, we compare the perception of music to that of speech, as these two acoustic categories share important traits in terms of their communicative functions, evolution and temporal and spectral characteristics, highlighting their differences in terms of temporality and acoustic patterns. Particular aspects of music parameters are addressed elsewhere in this book (see, for example, previous chapter on beat induction).

The Auditory Cortices

The characterization of the auditory cortices has been studied from different perspectives, all of which reveal a detailed subdivision that can be seen in the cat, ferret, macaque, the chimpanzee and in man, with more than ten areas identified [1, 2]; each of these areas have different functional, cytological and neurochemical features [24], and their afferents come from different thalamic nuclei (e.g., dorsal, ventral and medial portions of the medial geniculate complex, pulvinar and posterior nuclei). All these aspects, besides their different neurochemical gradients/ neurochemical profile provide distinctive features in auditory processing, so we will make some brief remarks on their anatomical and functional organization.

Anatomical Organization of the Auditory Cortices

The primary auditory cortex, or core region [Brodmann’s Area (BA), 41], is located on the dorsal surface of the superior temporal gyrus (STG), covered by the frontoparietal operculum. The core is surrounded postero-laterally by the belt (BA 42 and possibly BA 52), and antero-laterally by the parabelt region (corresponding to BA 22), the latter two regions being considered secondary and tertiary auditory cortices, respectively (Fig. 1).

Fig. 1
figure 1

Schematic representing the distribution of the auditory cortex in human and macaque monkey. Upper panel: lateral views of the macaque monkey (left) and human (right) brains; lower panel: dorsolateral view showing the location of auditory cortex on the lower bank of the lateral sulcus. Primary auditory areas (core) are shown in dark gray, belt (yellow) and parabelt areas (green) are colored. Left: macaque monkey map by Hackett et al. [5]; right: human distribution by Brodmann [6]. Schematics are not to scale. STG superior temporal gyrus, STS superior temporal sulcus, LS lateral sulcus, CS central sulcus. Modified from Hackett [6]

In turn, these three regions have been subdivided into around 12 regions, using functional and anatomical criteria that have been obtained mainly from non-human primates [1, 7, 8], but also by post mortem studies in humans, which have allowed a precise cytological characterization of three sub-areas of the primary auditory cortex (T1.0, T1.1 and T1.2, according to Morosan et al. [3]) within Heschl’s gyrus [4, 9].

The core, belt and parabelt form, strictly, the auditory cortex, because they are direct targets of the acoustic radiation, which emerges from the medial geniculate complex (MGC), although each region has a specific pattern of thalamic afferents: the projections from the ventral portion of the MGC (MGv) are mainly distributed in BA 41 or core, while the belt and parabelt regions are mainly contacted by the dorsal portion (MGd), and finally all the regions are reached by fibers emerging from the medial subdivision (MGm). Thus, each cortical region receives a unique blend of fibers from several thalamic nuclei, and therefore each thalamic nucleus provides a distinct variety of information to its cortical targets [6, 10, 11]. The anatomical distribution of these three regions is not only based on their thalamic afferents but also in the cytoarchitecture of each region which presents a precise pattern of cellular distribution (e.g., dense concentration of small granular cells in layer II and IV in the core compared with the less granular appearance, larger pyramidal cells in layer IIIc and smaller width of layer IV in the parabelt) [3, 4]. Furthermore, it should be mentioned that the intrinsic connectivity (i.e., the local circuits of each cortical region), provides pathways for communication among neurons within or between the cell columns that constitute functional units [12, 13].

Seldon [14] described the cytoarchitecture and the axonal and dendritic distributions in the human auditory cortex in an attempt to establish the morphological correlates of speech perception, making a distinction in different patterns of columnar organization between primary and secondary regions. In addition, clear inter-hemispheric differences were noted, such as the enlargement of the left planum temporale (Wernicke’s area), different sizes of neuronal columns and intervals between them and, different values of fractional volume of neuropil. Thus, anatomical features are important factors to consider when making functional inferences from a particular region, since neither the intrinsic connectivity, thalamic afferents or cytoarchitectural organization are identical in each region [1315].

Functional Distribution of the Auditory Cortex

Tonotopy (i.e., the spatial arrangement of structures devoted to particular acoustic frequencies) is the functional characteristic more commonly used to classify the primary auditory cortex (A1) or core. The ordered arrangement of the frequency distribution from the cochlea to A1 allows the identification of tonotopic cortical maps. In non-human primates these maps are extended to the belt (albeit in a less precise pattern) [8, 16]. In humans, tonotopic maps have a frequency distribution in a gradient of high to low (posterior-anterior), repeated in a mirror array, and are located along Heschl’s gyrus (even if the gyrus is bifurcated) [16]. The role of the auditory cortex is not limited to decomposition of frequencies of a complex acoustical stimulus, but is also sensitive to its spectral profile, as suggested by the increased activation of A1 during stimulation with harmonic tones in comparison with pure tones [17].

Functional assessment of the surrounding auditory cortices (belt and parabelt) is more difficult because it breaks the linearity in the representation of the stimulus evident in A1. Besides, the total area of the human belt and parabelt extends approximately 9.6 times more than their equivalents in the macaque brain (in contrast to the core region which covers a greater cortical surface in the macaque). These inter-species differences have been proposed as the cornerstones of language development in humans [8] and highlight the poor suitability of animal models for the study of inherently human traits, such as music and speech. However, different studies have reported that regions adjacent to the core and belt share tonotopic gradients. Woods et al. [8] evaluated functional magnetic resonance imaging (fMRI) activations on the cortical surface of the STG in response to attended and non-attended tones of different frequency, location, and intensity, in humans. They reported that the core regions presented mirror-symmetric tonotopic organization and showed greater sensitivity to sound properties than belt fields, which showed greater modulation for processes driven by attention. These data show that the belt region is probably involved in analyses prone to modulation by other factors, such as attentional resources, as evidenced by its greater activation during tasks requiring auditory recognition [8], or during stimulation with more behaviorally relevant sounds [18].

Manipulation of acoustic characteristics such as amplitude or frequency generate little or no modulation of activity of BA 22 or parabelt (divided into rostral and caudal parabelt; RP or CP, respectively), suggesting that its topographic organization is not related to the physical properties of stimulus (as in the core or some portions of the belt) [19, 20]. Activation of the parabelt has been associated with verbal processing, semantic integration, formation of “auditory faces”, among others [21]. The parabelt also shows a positive correlation between activation levels and the level of spectral and temporal complexity of the stimuli, showing differences between the right and left hemispheres. Temporal modulations, for example, produce increased activation of the parabelt in the left hemisphere, while spectral modulations do so in the right hemisphere [22]. All these data suggest that higher level auditory areas combine information obtained previously (e.g., temporal and spectral), to form a unified representation of what is being heard [23].

Information Flow Within the Auditory Cortex

Kaas and Hackett [7] reported a hierarchical organization in the primate auditory cortex by using invasive electrophysiological methods, but an analogous hierarchical organization can be inferred from anatomical and functional data obtained from fMRI in humans [8, 20, 24]. Most studies attempting to assess complex aspects of auditory processing have focused on speech and language due to their relevance to humans. Okada et al. [25], evaluated the sensitivity to acoustic variation within intelligible versus unintelligible speech, and they found that core regions exhibited higher levels of sensitivity to acoustic features, whereas downstream auditory regions, in both anterior superior temporal sulcus (aSTS) and posterior superior temporal sulcus (pSTS), showed greater sensitivity to speech regardless of its intelligibility, and less sensitivity to acoustic variation.

There are other auditory-like areas involved in higher order processing, receiving sensory information from other systems besides the strictly auditory regions (e.g., STS also receives visual and somatosensory input). For example, speech processing and voice selective areas have been demonstrated in the upper bank of the STS [2628].

Zatorre and Schönwiesner [29], while studying the involuntary capture of auditory attention, observed temporal and spatial flow of information that depended on the characteristics of acoustic stimuli. They showed that primary and secondary cortices respond to acoustic temporal manipulations in different ways: primary areas were involved in the detection of acoustic changes, whereas secondary areas extract the details of such acoustic change; a subsequent activation (with lag of ∼50 ms) in the mid-ventrolateral prefrontal cortex was associated to memory-based decisions and to the novelty value of the acoustic change (regardless of the magnitude of this change) [30]. A similar result was reported by Patterson et al. [31], from an fMRI experiment that involved spectrally matched sounds that produced no pitch, fixed pitch or a melody, in order to identify the main stages of whole melody processing in the auditory pathway. Based on their results, they suggested the following information flow during melody processing: (1) extraction of time-interval information (neural firing pattern in the auditory nerve) and construction of time-interval histograms (likely within the brain stem and thalamus); (2) determination of the pitch value and its salience from the interval histograms (probably occurring in lateral Heschl’s gyrus); and (3) identification of pitch changes in discrete steps and tracking of changes in a melody (regions beyond auditory cortex in the superior temporal gyrus (STG) and/or lateral planum polare (PP).

Popescu et al. [32], also studied the information flow, but from the standpoint of rhythm, and they found widely distributed neural networks during music perception (by changing the rhythmical features of a musical motif). They reported activations, soon after the onset of the stimulus, within and around the primary and secondary auditory cortices, but also in SM1 (primary somatomotor area), the supplementary motor area (SMA) and premotor area (PMA). These data suggest an important role for the motor cortex in music perception and more precisely in the perception of the temporal patterns embodied in the musical rhythm, proposing the existence of two interrelated subsystems that mediate the auditory input and an internal rhythm generator subsystem (see Chapter 5.2 and 5.3).

The above-mentioned studies show that the perception of sound stimuli is a distributed process that follows a hierarchical order, which in the case of complex sounds, such a music or speech, includes regions within and beyond the auditory cortices (e.g., premotor, supplementary motor areas, frontal regions). We must consider, however, that complex sounds are formed of simple elements (intensity, frequency, onset) that form patterns as a function of time. The location and intensity of cortical activations derived from complex acoustical stimuli are extremely dependent on the time scale of the stimulus itself, which can range from only a few ms to the entire contour of a melody [32, 33]. Some of these data are supported by lesion and psychophysical studies of higher-order temporal processing (analysis of sound sequences such as patterns of segmented sounds or music), suggesting that these deficits are produced by temporal lobe lesions that involve superior temporal lobe areas beyond the primary auditory cortex [34, 35].

Music and speech are two examples of complex acoustical stimuli with great relevance to our species, given their role as information carriers. While the neural mechanisms required for their perceptions may be shared to a great extent [36], certain pathologies that affect one, domain, but not the other, suggest a certain degree of independence [37, 38]. In this final section we will mention several studies that suggest that the selectivity for musical sounds exists, showing evidence of cortical regions sensitive to music stimuli over other types of complex sounds.

The Musical Auditory Cortex

In brain-music research, one of the most studied topics is pitch perception, and it has been reported that lesions encroaching into the right Heschl’s gyrus result in deficits in the perception of pitch of spectrally complex stimuli with no energy at the fundamental [39]. This was demonstrated in an experiment by Zatorre and Belin [40], where they found distinct areas of the auditory cortex, in each hemisphere, that respond to distinct acoustic parameters: the anterior auditory region on the right hemisphere showed a greater response to spectral than temporal variation; a symmetrical area on the left hemisphere showed the reverse pattern; finally, a region within the right superior temporal sulcus also showed a significant response to spectral modulations, but showed no change to the temporal changes. In brief, cortical activity of specific areas within the left hemisphere was modulated by temporal manipulations, while spectral variations modulated the activity of right-hemispheric cortical structures (Fig. 2). With these data the authors support the hypothesis of right hemisphere dominance for music perception, specially in pitch processing, in comparison with the putative role of the left hemisphere in temporal processing.

Fig. 2
figure 2

Top panel: MRI images superimposed with the functional activation assessed through positron-emission tomography. The left superior image shows a horizontal view trough Heschl’s gyrus (z = 9 mm in MNI standard space). This region shows more activation in the temporal modulation conditions in comparison with the spectral conditions. The right superior image corresponds to a horizontal view locating the anterior superior temporal region (z = −6 mm), which shows more activation in response to spectral manipulations versus temporal conditions. Bottom panel: Error bars showing percentage of cerebral blood-flow difference in temporal and spectral conditions. In this figure, the right hemisphere is presented in the right side of the image. Modified from Zatorre and Belin [40]

With this evidence as context, the following question is: How is spectral information processed when it also contains linguistic information, as is the case in tonal languages? Current theories answer this question in two ways: (a) based on the cue-specific hypothesis (which determines that interhemispheric asymmetry is based on low-level acoustical features of the stimulus), linguistically relevant pitch patterns would depend on processing carried out in right-hemisphere networks; and (b) based on the domain-specific model (which states that low-level acoustical features are not relevant for predicting hemispheric lateralization), analysis of speech is processed in an exclusive system engaging higher-order abstract processing mechanisms, primarily in the left hemisphere. Both proposals show that there is a hemispheric specialization to specific basic aspects of sound: the right hemisphere is more sensitive to slow temporal acoustic patterns (contour), while the left hemisphere has a higher spectral and temporal resolution (phonemes). Hemispheric specialization is also evident in higher-order analyses, as evidenced by the left hemisphere dominance for speech perception. Using variations in pitch to create differences in word meaning in tonal language speakers, it has been demonstrated that tonal perception is lateralized to the left hemisphere. In experiments where Mandarin speakers were asked to discriminate Mandarin tones and low-pass filtered homologous pitch patterns, there was increased activity of the left inferior frontal regions, in both speech and non-speech stimuli, in comparison to English-speaking listeners who exhibited activation in homologous areas of the right hemisphere [41, 42]. The conclusion proposed was that pitch processing can be lateralized to the left hemisphere only when the pitch patterns are phonologically significant to the listener; otherwise, the right hemisphere emerges as dominant and is involved in the extraction of the long-term variations of the stimulus.

Gandour et al. [43], also demonstrated that the left hemisphere appears to be dominant in processing contrastive phonetic features in the listener’s native language, showing fronto-parietal activation patterns for spectral and temporal cues, even during the non-speech conditions. However, when acoustic stimuli are no longer perceived as speech, the language-specific effects disappear, regardless of the neural mechanisms underlying lower-level processing of spectral and temporal cues, showing that hemispheric specialization is sensitive to higher-order information about the linguistic status of the auditory signal.

Following this line of thought, Rogalsky et al. [44] explored the relation between music and language processing in the brain, using a paradigm of stimulation with linguistic and melodic stimuli that were modified at different rates (i.e., 30 % faster or slower that their normal rate). This experiment evaluated if the temporal envelope of a stimulus feature (that according to several studies plays a major role in speech perception), can elicit domain-specific activity that highlight the regions that were modulated by periodicity manipulations. They found some overlap in the activation patterns for speech and music restricted to early stages of processing, but not in higher-order regions (e.g., anterior temporal cortex or Broca’s area); perhaps the most important result was that there was no overlap between regions that showed a correlation between their activity and the modulation rate of sentences (i.e., anterior and middle portions of the superior temporal lobes, bilaterally), and those that showed correlation with the modulation rate of melodies (dorsomedial regions of the anterior temporal lobe, primarily in the right hemisphere). This experiment attempted to isolate regions sensitive to rate modulation correlations (higher-order aspects of processing), finding that music and speech are processed largely within distinct cortical networks. As the authors acknowledged, it is important not to conclude from the apparently lateralized pattern for music processing, that the right hemisphere preferentially processes music stimuli (as is often assumed), because the lateralization effect described was due to the comparison of the activation patterns to music versus speech.

Selectivity for Music and Musicianship

Our group conducted an experiment to evaluate music perception, with the main objective of evaluating whether there are specific temporal regions that preferentially respond to musical stimuli (using novel melodies with different timbres and emotional charge), as compared to other complex acoustic stimuli including speech and non-linguistic human vocalizations, monkey vocalizations and environmental sounds. With this paradigm, we tried to evaluate the cortical responses associated to music perception but within an ecological context, using complex sounds without any kind of experimental manipulation (i.e., as we normally hear them in our everyday life). Our intention was to elicit the activation of cortical regions involved in the perception of music without disturbing any of its parameters, and then compare with the activity elicited by other types of complex stimuli, particularly speech. Finally, we wanted to assess whether these hypothesized music-selective regions are modulated by prior musical training, considering that previous studies have revealed that specific musical abilities can modify the distribution of the functional networks but also the neuroanatomical characteristics associated to their processing [4547]. To achieve our goals, we included individuals with and without formal musical training (groups did not differ in terms of age or gender). This group comparison allowed us to look for differences in music processing based on the individual history of interactions (ontogeny) and to explore how experience can modify the overall processing of auditory stimuli. We used a paradigm of acoustic stimulation in an fMRI experiment, which included two main categories: (a) Human vocal sounds such as non-linguistic vocalizations sounds (e.g., yawning, laughs and screams); and speech (sentences in several languages); and (b) musical stimuli, excerpts of novel musical passages played on piano and violin. We found a functional segregation when we compared the cortical activity associated with the processing of any type of human vocalization versus the activity generated by musical sounds. The comparison among music versus human vocalizations revealed a discrete bilateral area located in the anterior portion of the STG (Fig. 3a; light and dark blue colors) which responded significantly more to music than to human voices; this region was located within Brodmann’s area 22, but extending to a more rostral portion named planum polare. Notably, these differences remain significant even when comparing only violin versus speech, two stimuli with very similar spectro-temporal acoustic characteristics (Fig. 3a, bar graph). The regions activated during the perception of speech or nonverbal vocalizations (i.e., the opposite contrast) coincide with those reported in the literature: bilateral activation of the lateral STG, medial temporal gyrus (MTG), predominantly in the left hemisphere where the cluster extended to the edge of the STG and STS; and other regions such as the hippocampus, the amygdala, and the inferior frontal gyrus (Fig. 3a, warm colors).

Fig. 3
figure 3

Music-selective cortical regions. Voxels with significant activation (corrected cluster p < 0.05) are overlaid on the MNI-152 atlas, in radiological convention. (a) Music sensitive regions (blue colors). Coronal and sagittal views (left and right, respectively), the clusters in light blue (music > human vocalizations [speech + non-linguistic vocalizations]) and dark blue (music > speech) show no overlap with the cluster in orange (human vocalizations > music); amplification of the sagittal view showing part of Heschl’s gyrus (HG) and the planum polare (PP). Bar plot showing BOLD signal change for each of the stimuli, obtained from the peak of maximal activation of the contrast testing for music > human vocalizations (speech and non-linguistic vocalizations). Error bars show the standard error. (b) Sagittal and axial views (left and right, respectively), showing the results from contrasts testing music > human vocalizations in musicians > non-musicians. Differential BOLD activity of the right planum temporale (green color), elicited by music or human vocalizations, was present only in musicians, the blue cluster (PP) is shown for reference. (c) Individual statistical maps from the analyses for music > human vocalizations (red color; p < 0.01 uncorrected), overlaid in T1-weighted images of 4 representative musicians (upper panel) and 4 representative non-musicians. R right, L left

One way to interpret these results is to consider the planum polare as a relay in the stream of musical stimuli (and perhaps other complex acoustically rich sounds), that receives information from the core and belt regions (among other association areas), and integrates complex acoustic attributes, serving as an integrator required for the analysis of diverse features of the stimulus. Indeed, previous results demonstrate that this region co-participates with frontal regions in tasks involving pitch and melodic discrimination [4850].

One of the strongest arguments for questioning the selectivity observed in the planum polare, is to attribute the differences observed in the patterns of activity of each sound (e.g., speech or music), to differences in the spectro-temporal properties of each acoustic stimuli. However, Schönwiesner and colleagues [22, 51, 52], have shown that manipulations of spectro-temporal patterns along the time dimension are not sufficient to explain the activation of tertiary or high-order cortices. Using complex broadband stimuli with a drifting sinusoidal spectral envelope (dynamic ripples), they measured spectro-temporal modulation transfer functions (MTFs) in the auditory cortex, finding that dynamic ripples elicited strong responses from primary to secondary cortices (on and around Heschl’s gyrus), but not in higher-order auditory cortices (e.g., posterior superior temporal gyrus and PT or STS). They argued that the lack of activity in higher areas may be due to two important characteristics of dynamic ripples (1) their low acoustical complexity, i.e., higher-order areas might integrate information across the spectrum modulation (units with simple summing responses MTFs); and (2) their lack of behavioral significance, arguing that higher auditory areas do not faithfully represent the physical properties of sounds but rather the relation between a sound and its behavioral implications [52]. Another fact that supports selectivity for music was observed in musicians, since only their group showed modulation of the planum temporale, whereas musicians presented similar activation for music and human vocalizations, non-musicians showed higher activity in response to human vocalizations (ascompared to music).

Subject-level analyses of our fMRI data revealed that bilateral activation of the planum polare was more prevalent in the group of musicians (27/28) compared to non-musicians (13/25) (Fig. 3c) during music listening. Previously we discussed that the functional asymmetry in musical processing postulates the right hemisphere as dominant, but in this experiment we found that this functional asymmetry was modified in musicians, which showed no differences in BOLD signal modulation between the left an the right planum polare during music perception. Even though musicians and non-musicians likely have the same neural substrates for musical processing (both perceive and distinguish what is and what is not music), musicians may demand similar resources in both hemispheres, while non-musicians do so in an asymmetric fashion, suggesting a functional specialization relative to musicianship.

Conclusions

The evidence presented in this chapter indicates that cortical responses to music are distributed and sophisticated; each area in the auditory cortex reveals its specialization according to its stage of processing in the flow of information. Several studies provide data regarding a musical processing network that differs from the network associated with speech perception [44, 50, 53], this means that the particular attributes of these two complex stimuli are processed by specialized networks, which are sensitive to spectral and temporal patterns that distinguish each sound category. As a summary of the information flow, we can say that the right primary auditory cortex is more sensitive to the changes in frequency and timing that characterize music; that belt regions (besides presenting extensions of the tonotopic maps) start to exhibit singularities, such as increased activation during directed attention to sounds, harmonic tones preference, among others; and that the parabelt region is involved in more complex processes, exhibiting preference or selectivity for acoustic elements inherent in music, or showing activation with frontal regions during tasks involving discrimination of tones and tunes [48, 54, 55]. Furthermore, it can be concluded that the networks involved in the perception of music show some specificity, which may be evident in plasticity processes such as training (i.e., by the history of the interaction between the listener and the stimulus) ([44, 5557]). These concepts will serve to develop more advanced and integrative models for the comprehension of music and speech processing.