5.1 Introduction

One of the most remarkable abilities of the human brain is to create. Creativity is the cornerstone of human culture, and is the core cognitive capacity that has enabled music throughout history. Much of the act of creating new music, such as in music composition, is an effortful process that requires prolonged persistence, motivation, and dedication. However, other aspects of musical creativity, such as musical improvisation, have an appearance of spontaneity and automaticity, and appear to depend on states of flow that seize the improviser as they encounter musical ideas and produce novel musical output seemingly in real time. How is this real-time creativity possible: how does the brain tackle the problem of musical improvisation, and how does it accomplish this feat? Can improvisation be learned, and if so, how?

In this chapter, I begin with an introduction into the field of the Cognitive Neuroscience of Music. This includes a methodological overview of the tools and techniques commonly used to investigate the key components of music, as well as the relative strengths and limitations of each technique, and their general findings. From this overview, I turn to the core question that is addressed in this chapter: how does the human brain accomplish musical improvisation? This is a question that requires multiple answers, especially considering that musical improvisation is a complex system that necessarily involves many simultaneous excitations and inhibitions across the brain over time. Thus, I take a multi-level approach, discussing different levels of evidence in sequence. I introduce a hierarchical model of musical improvisation as a complex system, which affords multiple levels of description. From there I turn to each level and review the psychological and the neuroscience evidence in some detail. This is followed by a call to action for what might be missing, or still to be addressed, in the current state of the psychology and neuroscience of musical improvisation.

5.2 Cognitive Neuroscience of Music

Cognitive Neuroscience is, fundamentally, the biological study of the mind. As its parent fields of Cognitive Neuroscience are Psychology and Biology, it necessarily inherits the overarching questions from these fields, in that the main goal is to link brain structure and function to behavior. The field has made great strides in advancing knowledge, mainly by using emerging technologies to sense and to record brain activity with ever-increasing spatiotemporal precision and accuracy. As part of these advancements, recently there has been much interest in real-time brain processes. As music necessarily unfolds over time, listening to and playing music can be thought of as a quintessential real-time brain process.

Magnetic Resonance Imaging (MRI) is a technique that can provide insight into structure as well as function of the human brain. Structural images include anatomical images and Diffusion Tensor Imaging (DTI). These techniques are powerful when applied together because of their respective foci: anatomical images are best at detecting grey matter whereas DTI at imaging white matter. While grey matter includes cell bodies of neurons and dendrites which provide input to them, they also provide important information such as the thickness of the cortex, the surface area of specific structures, and the volume of cortical and subcortical structures. In contrast, DTI is highly tuned toward white matter, which contains the axonal connections between neurons. Because DTI images the structural connections in the brain, it is an especially useful technique for studies where the hypothesis pertains directly to brain connectivity.

In contrast to the structural imaging techniques, functional techniques are important as they yield time-series data which can then be assessed against EEG, MEG, and other more time-sensitive measures. Functional MRI studies (or fMRI) include task and resting-state studies. Task fMRI studies typically involve the subject engaging in a variety of mental operations in the scanner, and comparisons between brain activations during experimental tasks and control tasks can yield regions and patterns of interest, which may be related to specific mental operations. For example, Limb et al. [34] compared task activations during the spontaneous generation of new musical sequences on an MRI-compatible keyboard, against task activations during a control task of playing a musical scale. The control task does not involve spontaneous music generation but nevertheless does involve interacting with the same keyboard; thus, the motoric component and some aspects of the auditory component are accounted for in the task-fMRI design. By subtracting the brain activity during the control task from the brain activity during the experimental task, researchers can obtain activation patterns that are uniquely driven by the unique aspect of the experimental task; i.e., the cognitive operations that distinguish the experimental task from the control task. For example, by subtracting brain activity during scale-playing from brain activity during the spontaneous generation of new musical sequences, it should be possible to isolate the effect of spontaneity while subtracting away the effect of audiation and motor movements, since the two latter operations are in both experimental and control tasks whereas spontaneity is only a characteristic of the experimental task and not the control task. By assuming this purely linear relationship between different cognitive operations fMRI tasks, this design relies on the assumption that cognitive components are additive in their effects on brain activity, and is known as the cognitive subtraction methodology. Results from this widely used cognitive subtraction methodology have been influential; for instance, in showing isolated activity in the superior temporal lobe areas (the auditory cortex) during auditory tasks. In addition, task-related fMRI has shown attention-related modulations of stimulus-driven processing. For example, when attention that is directed specifically to auditory stimuli is compared to attention to visual stimuli, there is additional involvement from neighboring regions to the auditory cortex [57]. This suggests that the extent of activations due to a predefined perceptual or cognitive operation can be modulated by the conditions of the task. In the study of musical improvisation, activations during the spontaneous generation of musical notes in contrast against motorically performing overlearned sequences (such as musical scales) shows activity in distributed regions throughout the brain, with the largest number of significantly active clusters within the frontal lobe [34]. This pattern of distributed clusters of activity may suggest that the experimental task of musical improvisation, which clearly involves multiple cognitive and perceptual operations, differs from a control task in multiple significant ways. The same results may also imply that different individuals use different strategies or set of cognitive operations to approach the task of musical improvisation.

5.3 Intrinsic Networks of the Brain

Although powerful, the standard model of task-related functional MRI has its own limitations. The predominant model of fMRI work involves cognitive subtraction, but there are caveats to cognitive subtraction as a standard model. In particular, abundant research in the past decade has shown that negative activations may be just as crucial as positive activations for subserving behavior, especially complex behavior such as musical improvisation. Evidence for this comes from the finding that specific regions of the human brain show correlated activity over time. Furthermore, across a variety of tasks in the fMRI, some regions of the brain are positively correlated with each other in activity, whereas other regions are negatively correlated (anticorrelated). The most consistent set of regions that are anticorrelated with task-related activity includes the medial prefrontal cortex and the posterior cingulate cortex, and some regions in the temporal lobe. This network of regions is together known as the Default Mode Network [17]. The Default Mode Network has received intense interest and attention in Cognitive Neuroscience in recent years, partly because it appears to be active when the mind is not actively pursuing a task. In contrast, the Default Mode Network is active during mind-wandering, or during stimulus-independent thought [12, 51]. This set of regions is also deactivated during effortful cognitive tasks, such as attention and memory tasks. The latter set of tasks activate the Executive Control Network and the Dorsal Attention Network, which are centered around the frontal and parietal cortices.

Mind-wandering is also thought to precede creativity [2]. Thus, these anticorrelated networks are useful in thinking about music and improvisation because musical improvisation can be thought of as a real-time creative act. As such, cognitive neuroscientists have found activity in the Default Mode Network during improvisation, that is in flexible exchange with the brain’s Executive Control Network [3]. A review of the neuroscience of musical improvisation suggests that multiple large-scale brain networks are involved in improvisation; specifically, a network of prefrontal brain regions is commonly activated, including areas in the motor system (premotor and pre-supplementary motor area), areas in the language network (inferior frontal gyrus), and importantly areas in the classic Default Mode Network (medial prefrontal cortex) and Executive Control Network [4, 15, 47, 58]. Following up on these general findings, Belden et al. [6] asked if findings from task fMRI studies might extend to differences in intrinsic brain networks as assessed by resting-state functional connectivity. Seeding the Default Mode Network and the Executive Control Network in resting-state fMRI, this study compared three groups of different musical training: improvising musicians, classical musicians, and controls with minimal musical training. Improvising musicians showed the highest functional connectivity from the Executive Control Network, whereas both classical and improvising musicians showed higher Default Mode Network connectivity than musically untrained controls. Interestingly, the primary visual network also showed higher functional connectivity to both Default Mode and Executive Control networks in improvising musicians. From this study, the pattern of results that emerges suggests that while classical (non-improvising) musicians had higher connectivity within specific networks, improvising musicians showed higher connectivity between networks, whereas both musically trained groups showed higher resting-state connectivity overall compared to musically untrained controls. This distinction between within-network and between-network connectivity may recur in creativity research more generally, as more research in the network neuroscience of creativity shows that individuals who are more creativity in laboratory tasks tend to simultaneously engage multiple networks (in particular Default Mode, Executive Control, and Salience Networks) that are usually anticorrelated in their activity [5].

While fMRI is a highly effective method at localizing multiple brain networks at once, one shortcoming of fMRI is that the time resolution is quite low, on the order of several seconds. This is because of the inherent properties of fMRI: the technique makes use of the fact that neural activity is coupled with changes in the oxygenation level of the blood flow. Thus, by measuring fluctuations in blood flow (known as the hemodynamic response), we can extract the Blood Oxygenation Level-Dependent (BOLD) response as the signal of interest in fMRI. Although changes in activity of neuronal populations are causally related to the BOLD signal, the BOLD response requires several seconds to reach its peak after its corresponding neural activity. This delay, known as the hemodynamic lag, places limitations on the temporal resolution of fMRI. Since music is an art form that necessarily unfolds over time, temporal resolution is at least as important as spatial resolution for our understanding of neural circuitry that enables musical functions.

5.4 Temporally Precise Indices of Brain Activity in Music

In contrast to fMRI, Electroencephalography (EEG) and Magnetoencephalography (MEG) are methods that enable examining the function of the brain with much higher temporal resolution. EEG relies on the electrical properties of brain activity, and MEG relies on the magnetic fields that are generated as a result of electrical brain activity. When groups of neurons fire they generate electrical changes, known as local field potentials. These local field potentials can be recorded as dipoles, which are separations in electrical charge. These dipoles propagate throughout the head and are recordable on the surface of the scalp using electrodes. The source of the dipole can be identified by taking the second-order derivative of the electrical gradient across different recording sites on the scalp, thus giving a more precise location of the source of electrical activity, which can be interpreted as the neural generators of the activity. Because there are multiple geometric nonlinearities in the mapping between the head shape and the brain, such as due to cerebrospinal fluid as well as the structure of the brain tissue, the mapping between the source of the dipole and the local field potentials is often inconsistent. This places constraints on the spatial acuity of EEG, especially in areas of the brain that are near the inside folds (sulci) of the cortex, and in areas that are deep inside the head beneath the level of the cortex (i.e., subcortical structures). Thus, the spatial resolution of EEG is relatively low. On the other hand, the temporal resolution of EEG is high as it relies on a direct measure of neural firing itself. Established methods in EEG research have capitalized on this temporal resolution to obtain fine-grained time-domain and frequency-domain readouts of brain activity during music processing. One established EEG method is Event-Related Potentials (ERPs). ERPs rely on recordings of EEG during repeated presentations of the same stimulus, or the same category of stimuli, while precisely tagging the time window during which each stimulus is presented. Then, by averaging the EEG time windows of all stimulus presentations, the randomly distributed noise sources in the EEG are averaged out, whereas the neural signal that is associated with the neural processing of the stimulus becomes amplified. This increase in signal-to-noise ratio of the EEG, as a result of time-locked averaging across many trials, results in an electrical potential that is uniquely related to each neural event. The Event-Related Potential (ERP) technique has been influential in Cognitive Neuroscience of Music research, as it is sensitive to cognitive, top-down effects as well as to the bottom-up processing of sensory stimuli. One well-known ERP is the P300, which denotes a positive waveform around 300 ms after the stimulus, and is observed whenever a subject is responding to a stimulus. Another well-known ERP is the Mismatch Negativity (MMN), which is a negative waveform around 200 ms after any stimulus mismatch (i.e., a slightly unexpected stimulus that is different from other stimuli in an ongoing stream of events). Another ERP that is closely related to the MMN is the Early Right Anterior Negativity (ERAN), which is a negative waveform that is largest over the right hemisphere, and occurs around 200 ms after the onset of an unexpected musical chord. Since the ERAN is elicited whenever unexpected chords are presented, it has been known as a neural marker of musical syntax [29]. The ERAN is elicited, albeit smaller in amplitude, even when attention is directed away from music, such as when musical chord progressions are played in the background during a demanding visually presented cognitive task [38]. This finding of attention-dependent modulation of the ERAN suggests that neural processes for syntactic structure in music are partially automatic; in other words, we process musical syntax without needing to allocate attention toward it, but attention certainly enhances the syntax processing. This puts musical syntax in the same category as linguistic syntax, which is also indexed by an early waveform, but one that is left-lateralized: the Early Left Anterior Negativity (ELAN) is usually observed after an unexpected word category, such as when the tense of a verb within a sentence is different from expected [18]. The hemispheric asymmetry—between the ELAN for language and the ERAN for music—may seem to lend support to the popular idea that language is left-lateralized whereas music is right-lateralized; however, in practice the lateralization of these ERP results is much more variable, and is dependent on low-level acoustical properties of the stimuli as well as to higher-level task demands and/or instructions to the subject.

While the ERAN and ELAN are relatively early cortical ERP waveforms, there are other ERPs that occur later in latency during musical and linguistic processing. These mid- and late-latency waveforms include the N400 and the P600. The N400 is a negative waveform largest in central-parietal sites around 400 ms after the onset of a semantic incongruity [32], whereas the P600 is a positive waveform largest in parietal sites around 600 ms. Typically, the N400 is observed in response to semantic incongruity: unexpected semantic sentence stimuli such as “I take my coffee with cream and socks” elicits an N400 after the unexpected word “socks.” In contrast, the P600 is elicited during sentences that require syntactic integration, such as when a sentence contains an ambiguous syntactic structure that the brain must resolve [31]. A classic example of this comes from “garden path” sentences, e.g., “The horse raced past the barn fell.” In this example, the phrase “raced past the barn” is initially interpreted as a verb phrase (i.e., describing the action of the horse). However, when “fell” is presented, the brain has to re-interpret “raced past the barn” as an adjective phrase that describes a property rather than an action of the horse. This reinterpretation of the garden path sentence is known to elicit the P600 waveform [31]. Because the P600 is similar to the P3 in its waveform shape and topography, it has been suggested that these two effects are from the same neural generators [11]. In music, a similar late positive effect has been observed in response to unexpected melodic changes that require the reinterpretation of chord structure [8, 55]. This suggests some similarity between neural resources for processing language and music, especially in the syntactic structure of music and language.

The P600 or P3 effect is also observed during music cognition experiments when participants are asked to attend and respond specifically to chord structure. In one study, Koelsch et al. [30] presented participants with chord progressions that were either highly expected (tonic endings), or unexpected (Neapolitan chord endings), and asked participants to press a button in response to the unexpected chords [29]. This attentive listening condition was compared against an unattended condition when the same chord progressions were played in the background while participants did another task. ERPs showed a large ERAN during both attended and unattended conditions; however, in the attended condition the ERAN was followed by the P3 which was not observed in the unattended condition [30]. This pattern of results was partially replicated in Loui et al. [38], which also observed that the ERAN was elicited during attended and unattended listening, but with a larger amplitude during attended listening. In Loui et al.’s extension [38], the researchers further added an amplitude change in some of the chords, and participants’ task during the attended condition was to respond to the amplitude change. The P3 was observed only during the amplitude change, but not during the syntactically unexpected chord. This finding shows that the P3 is elicited in response to any feature of sounds to which attention is directed, and not to harmony or musical syntax per se. On the other hand, the ERAN is more specifically elicited in response to violations in musical expectancy, especially from violations in chord structure. The effects of musical training can affect neural processing of musical syntax, as indicated by these ERPs. In a study comparing improvising (mostly jazz-trained) musicians and non-improvising (mostly classical) musicians against non-musicians, Przysinda et al. [59] recorded preference ratings for highly expected, slightly unexpected, and very unexpected chord progressions, while EEG was recorded. Behavioral ratings showed that while non-musicians and non-improvising musicians preferred the highly expected chord progressions and disliked the very unexpected chord progressions, improvising musicians preferred the slightly unexpected chord progressions, and did not dislike the very unexpected chord progressions as much as the other two groups. ERPs in response to unexpected chords showed a larger ERAN among improvising musicians. Although all three groups showed the P3, it was the largest among the improvising musicians. However, while the P3 returned to baseline by around 800 ms after the onset of the unexpected chord in both the improvising musicians and the non-musicians, the P3 lingered to a much later latency (800 ms after stimulus onset) in the non-improvising (classical) group only. This late-latency positive effect in classical musicians suggests a rumination, or error-related processing, that lingers in the minds of the classical musicians even after the improvisers have returned to baseline. Taken together, the double dissociation between early processing in improvising musicians and late-latency processing in classical musicians highlights the effects of different genres of musical training: while classical training emphasizes accuracy and precision, training in improvisation emphasizes sensitivity to expectancy and engagement with unexpected events. Methodologically it is useful to have improvising musicians, classical musicians, and non-musicians as multiple control groups, as it enables a direct comparison between different types of musical training. Even though the Cognitive Neuroscience of Music has seen a surge of interest in recent years, with many studies being done on the effects of musical training, the vast majority of these studies have examined western classical musical training as the dominant genre, without much studies done at all on non-classical forms of training. By examining the effects of jazz improvisation training, this study makes a first attempt at quantifying the effects of training on various neural and cognitive outcomes.

In another in-depth study on Western music improvisation, Goldman et al. [20] recorded EEG in musicians with varying levels of improvisation experience while they listened to an oddball task, where standard stimuli are interspersed with occasional deviant stimuli. Standard stimuli were chords, interspersed with occasional deviant chords that either did or did not serve a similar chord function musically. Participants with more improvisation experience showed larger N2 and P3 while processing the functionally deviant chords, suggesting that the ability to engage in creative improvisation is related to differences in knowledge organization.

While ERPs provide accurate time-domain information of brain activity in response to stimuli and cognitive processes, frequency-domain information can also be informative, especially since abundant research in recent years has identified periodic oscillations at multiple frequencies as a fundamental feature of brain activity. Oscillatory activity is especially important in music and speech, as evidence shows that these complex auditory stimuli can engage phase-locked activity in the brain via the process of neural entrainment. When listening to acoustic stimuli with energy at certain frequencies, brain activity also shows activity at the corresponding frequencies. These oscillations at specific frequencies do not only reflect passive processing of stimulus by the auditory system, but in some cases they also reflect the active parsing of stimulus streams. This is demonstrated clearly in a MEG study on the cortical tracking of speech: When meaningful Chinese speech was presented with a syllabic rate of 4 Hz, noun or verb phrases at 2 Hz, and sentences at 1 Hz, only Chinese listeners showed corresponding activity at 1 and 2 Hz. In contrast, English speakers who did not understand Chinese only showed activity at 4 Hz [14]. This finding suggests cortical tracking of linguistic stimuli reflect comprehension, rather than the perceptual processing of acoustic stimuli. Findings such as this one show how informative it can be to take a frequency-based approach instead of a time-domain approach in analyzing and interpreting EEG and MEG data.

The frequency-based approach is especially informative when there are frequency-based hypotheses motivating the collection of the EEG and MEG data. In music, for example, one commonality across musical experiences is that of rhythm, which is a pattern of repeated durations over time. Rhythm is closely related to beat, which is the steady pulse that a listener extracts from the rhythm. The ability to extract a beat from an ongoing pattern of sounds is crucial to musical ability, and thus much of the literature in the Cognitive Neuroscience of Music perception and cognition is dedicated to the perception of rhythm and beat. Since the beat is tied to a specific frequency (e.g., a 1 Hz beat corresponds to 60 beats per minute), a frequency-domain analysis makes the problem of beat perception relatively tractable. Nozaradan et al. [54] recorded EEG from human subjects while they listened to rhythmic patterns that were designed to elicit a sense of beat and meter. They found that the EEG showed peaks of oscillatory activity at the rhythmic frequencies, with peaks observed at beat frequencies even when the acoustic energy was not necessarily strongest [54]. This observation of oscillatory brain activity at the beat frequency without acoustic stimulation is also termed the “missing pulse” phenomenon [69], borrowing from terminology of the “missing fundamental” phenomenon in which pitch can be perceived virtually even without acoustic energy at the corresponding fundamental frequency [70]. Thus, the rhythmic brain activity appears to be an index of how the mind interprets the stimuli, instead of being a faithful mirror of the stimuli themselves. In other words, these findings highlight the importance of oscillatory brain activity in coding not only for bottom-up processing of rhythmic stimuli, but also of top-down brain mechanisms such as attention and memory that are involved in the experience of music.

5.5 Attention Toward Moments in Time

The idea that attention can fluctuate rhythmically over time, especially in sync with music and speech stimuli, is formalized in the Dynamic Attending Theory [33], which is a framework for describing the brain as a series of internal oscillations, known as attending rhythms, that can entrain (i.e., tune in) to external events and focus on expected points in time. This model is especially appropriate as a formal description of the brain during musical experience: As the materials of music (pitch, harmony, melody, rhythm) unfold over time, these same musical materials can reasonably be expected to guide our attention over time as well. Ongoing research is aimed at understanding how the different musical materials guide attention, and the trajectory of the ability and strategies used to attend to these musical features throughout the human lifespan [16].

The ability to sustain attention is also related to the degree of engagement one feels toward a piece of music. For example, if a piece of music urges a person to move, then the act of engaging in movement likely helps to sustain attention over time in a rhythmically oscillatory manner. The pleasurable urge to move to music, also known as “groove,” has become an interesting topic that intersects the study of neural oscillations and entrainment, rhythm and beat, and pleasure and reward. In the attempt to understand how rhythmic patterns might activate in the motor system, Stupacher et al. [68] stimulated the motor cortex using Transcranial Magnetic Stimulation (TMS) and measured their resultant Motor Evoked Potentials (MEPs) as a measurability of the motor system [68]. Music that was rated as high-groove showed larger MEPs, suggesting more excitability of the motor system. This finding fits well with fMRI studies showing activity in the motor system during the perception of musical rhythm and beat [22]. Furthermore, people with motor disorders, specifically Parkinson’s Disease, are less able than controls to extract the beat from musical rhythms, as shown by a reduced beat-based advantage in discriminating musical rhythms [23]. This provides converging evidence that the motor system is important for beat perception.

While the engagement of the motor system explains the urge to move that one feels when listening to music, the link between pleasure and the urge to move is yet unclear. Insight comes from examining the link between groove and syncopation, which is the shift of acoustic events from metrically strong to metrically weak beats [73], who compared pleasure ratings between drum-beats with varying levels syncopation, and showed that medium degrees of syncopation yielded the highest desire to move and the highest pleasure ratings. Since syncopation is in itself a violation of rhythmic expectations, this preference for a medium level of syncopation is broadly consistent with an inverse u-shaped relationship between expectation and pleasure, which has long been hypothesized in music [52], and also in the psychology and biology of aesthetics more generally [7].

The ability to hone one’s expectations toward events at recurrent moments in time is likely a precursor to the ability to the tendency to synchronize with others. Interpersonal synchrony is a fascinating topic of recent investigations, especially since the tendency to synchronize with others seems to be tied to the sense of group identity and group affiliation. The effect of interpersonal synchrony on group affiliation was tested when pairs of participants were asked to rate how well they felt affiliated to each other after tapping either to synchronous or to asynchronous metronomes [26]. As hypothesized, ratings of interpersonal affiliation were higher for participants after tapping to synchronous metronomes. This suggests that the ability to entrain to another individual’s rhythmic behaviors is closely tied to social behavior, in particular affiliative behavior.

The link between synchronizing with other people and cooperative behavior appears to be established relatively early in life. This link was tested in 14-month-old infants, who were placed in a standard cooperation task after a rhythmic task. In the rhythmic task, the infants were bounced either in-synchrony or out-of-synchrony with an experimenter. Following this bouncing task, the infants were placed in a room with the experimenter, who “accidentally” drops an object. Cooperation was measured by whether the infant helped by picking up and handing the dropped object to the experimenter. Infants who had bounced with the experimenter were more likely to help the experimenter afterwards. Interestingly, cooperative behavior was observed even when the rhythmic bouncing was done out-of-phase (by still in-synchrony), suggesting that rhythmic synchrony rather than the similarity or symmetry of movement was what drove the cooperative behavior [13]. While this measure of cooperative behavior is important and informative as a measure of prosociality (i.e., behavior that benefits society as a whole), and may thus have optimistic implications for the effects of music on social behavior, it turns out that infants were only more likely to help individuals who had bounced with them; this helping behavior did not extend to other individuals who did not participate in the bouncing [71]. This suggests that rhythmic synchrony in music may help affiliation for the in-group only, rather than promoting altruistic behavior in general. Nevertheless, this cooperative behavior extends to older children (aged 4), who were more likely to cooperate in their behavior after swinging rhythmically with a partner [61], suggesting that the link between rhythmic entrainment and cooperation is likely a stable relationship across the lifespan.

5.6 Prediction and Reward

Given that rhythm directs attention and expectation (as posited by the Dynamic Attending Theory) and relates to enjoyment and group affiliation, it is not surprising that the brain processes expectation in time in the same way that it processes other forms of pleasure and enjoyment. The currency with which the brain processes reward is dopamine, which is a neurotransmitter that is emitted when animals undergo hedonic experiences such as food and sex [60]. The dopaminergic system in the brain includes several way stations: the substantia nigra, ventral tegmental area, nucleus accumbens, caudate, putamen, and orbitofrontal cortex. In particular, the nucleus accumbens, caudate, and putamen are known as the striatum. Together these regions tend to be active when processing extremely pleasurable rewards, such as when winning in gambling tasks [60]. Interestingly, even events that are themselves not rewarding, but that signal rewards (i.e., they provide a cue toward rewards) also activate cells in the dopaminergic system. Furthermore, events that signal rewards, but are then followed by the lack of reward, results in a decrease in activity in the dopamine neurons [67]. Thus, the dopaminergic system is known as a code for the difference between the prediction of a reward and the actual experience of the reward; this difference is known as the reward prediction error. Importantly, these same dopaminergic areas are active during the experience of intensely pleasurable moments in music [9, 10]. In a study that specifically related music to activity in the dopaminergic system, Salimpoor et al. [66] injected participants with radioactive raclopride, which is metabolized during dopaminergic activity, and conducted Positron Emission Tomography (PET) scanning combined with functional MRI while people listened to pieces of music that they selected as being intensely pleasurable to them. This combined PET and fMRI was especially useful because it allowed the researchers to simultaneously localize brain activity and establish its link to the dopaminergic system. Results from this study showed a peak of activity in the caudate during the anticipation of intensely pleasurable moments in music, and a peak of activity in the nucleus accumbens during the experience of the intensely pleasurable moment. This finding is exciting for Neuroscience generally and for Music Cognition researchers specifically, because it shows a differentiation between anticipation and experience. Both are important components of the experience of pleasure, but this distinction is especially important for music, which is fundamentally an interplay between expectation and experience. In further work, the same researchers showed that the auditory cortices are coupled in activity with the nucleus accumbens, and that this coupling is stronger during the experience of strongly preferred music [65], thus linking the experience of reward to activity in the auditory system. More evidence for the role of the reward system in music listening comes from musical anhedonia, which is a recently coined abnormal lack of sensitivity to musical reward. People with musical anhedonia feel no desire to listen to music, despite normal hedonic responses to non-auditory senses, such as visual art, food, and monetary reward [50]. Because of this fascinating dissociation between music and monetary reward, musical anhedonia can be considered a model system for examining the link between music and reward, and for examining what it is that makes music special within our culture. In a diffusion tensor imaging study on a striking case of musical anhedonia, Loui et al. [46] showed that compared to a large group of controls, the musical anhedonic had less structural connectivity between auditory areas in the temporal lobe and reward areas in the nucleus accumbens. Looking at individual differences in white matter connectivity across a group of individuals who vary in musical reward responses, Martinez-Molina et al. [49] showed that structural connectivity between auditory and insula, which is an area important for emotional functioning and for interoceptive functions, was correlated with individual differences in self-reported musical reward. Similar differences were seen between people who frequently experience chills, or strong emotional responses, when listening to music, compared with those who rarely experience chills during music listening [63]. Furthermore, when listening to music that a general population rated as rewarding, musical anhedonics showed no functional connectivity between auditory areas and the nucleus accumbens, further providing support for a disconnection between auditory and reward regions in the brain [48]. Reviewing these findings, Belfi and Loui [1] propose an anatomical model for the coupling between the auditory and reward systems, and posit a distinction between multiple types of predictions, only some of which may become rewarding.

The importance of prediction and expectation in the musical experience is the theme of Meyer’s seminal work Emotion and Meaning in Music [52], and is also the topic of Huron’s theory of musical expectation as laid out in his volume Sweet Anticipation [27]. Huron articulates the ITPRA model, which conceptualizes five phases of experiencing music (and indeed any experiences that are time-dependent) as Imagination, Tension, Prediction, Reaction, and Appraisal. This five-stage model is useful for thinking about the experience of music, partly because it separates the short-term, or the immediate, predictions and responses surrounding an event, from the longer-term buildup of expectations in the Imagination phase and the follow-up experiences of the Appraisal phase. Computational modeling studies have also begun to quantify the dynamics of information processing in music, and to relate these dynamics to brain activity. In particular, the Information Dynamics of Music (IDyoM) model simulates musical expectation using information-theoretic measures and makes predictions for how much uncertainty there is at each moment in time, and/or surprising each event is on a note-by-note basis [56]. Training this model on corpora of real music, and testing it against human ratings, shows an inverse u-shaped relationship between expectation and preference [19]. Using these computational tools coupled with neuroscientific methods, the field can begin to relate activity in the dopaminergic system to musical expectation in a realistic music-listening context.

Regarding music and reward, one persistent puzzle is that music, unlike food or sex, has no clear evolutionary advantage, yet music consistently ranks among life’s greatest pleasures. Why do we find pleasure in experiences that are not necessary for keeping ourselves alive, or for keeping our progeny alive? The answer must come from the interaction between music and inherent properties of the cognitive system. The ability to learn is fundamental to the cognitive system, and is evolutionary advantageous as it enables organisms to adapt flexibly to their environment. As the ability to form correct predictions is likely adaptive, the ability to form predictions must also be learned, and may therefore acquire reward value in and of itself. Thus, the relationship between learning and reward is an active area of research in music and in neuroscience more generally.

5.7 Music and Language Learning

The literature on learning is also heavily influenced by work on language acquisition, which has shown that infants as young as eight months of age are able to learn the transitional probability of events within an acoustic stream (e.g., syllables) [64]. These findings provide evidence for the existence of a statistical learning mechanism as part of our cognitive system that can learn expectations from exposure in much the same way that infants can learn the grammatical structure of linguistic sounds in their environment even without explicit instruction. Music is likely learned in a similar manner. Evidence for the learning of musical structures via passive exposure comes from abundant findings in reaction time, subjective ratings, and neuroimaging and electrophysiological studies showing that even people with no explicit musical training have knowledge of musical structure. For example, the ERAN is observed in response to unexpected musical chords even among non-musicians [29]. When asked to rate their preference for different chord progressions, both musicians and non-musicians rated unexpected chord progressions as less preferred [42], although jazz and classical musicians differed in their preference for unexpected chord progressions [59]. Reaction time is also slower when non-musicians are presented with unexpected musical chords, even when their task is independent of chord structure [42], suggesting that we bring implicitly learned expectation into our experience of music.

While most would agree that these expectations are implicitly learned, much remains unknown in how this musical knowledge is acquired, and to what extent these learned expectations interact with acoustic properties of the musical stimulus (e.g., consonance and dissonance). These questions are challenging to address because the vast majority of participants (at least participants that most music cognition labs can access) have already acquired an internal template or representation of musical structures within their culture. In other words, common-practice Western musical structure is overlearned in most brains. In order to observe the human brain as it is learning musical structure for the first time [72], one can approach younger populations to trace their developmental trajectory of musical knowledge Jacoby and McDermott [28], or one can take a cross-cultural approach in which we can compare the constraints of cognition in the musical systems of different cultures, or one can create a new musical system in order to test how the human brain responds de novo. In an attempt to create a musical system, Loui et al. [39, 44, 2010] turned to the Bohlen-Pierce scale, which differs from existing musical systems of the world in its mathematical ratios. While musical systems of the world commonly recur around the octave, which is a 2:1 ratio in frequency, the Bohlen-Pierce scale recurs around the tritave, which is a 3:1 ratio. The equal-tempered Bohlen-Pierce scale has 13 logarithmically even divisions of the 3:1 frequency ratio (in contrast to the 12 logarithmically even divisions of the equal-tempered Western scale). Using melodies generated from this system, it was shown that participants can rapidly learn melodies after exposure. Furthermore, they can generalize what they had learned after exposure to a sufficiently large set of melodies by choosing new melodies that followed the same grammatical structure. In addition, participants rated melodies that they heard repeatedly during exposure as being more preferable. Because it is a systematic way to test the learning and liking of new music, the finding that humans rapidly identify and prefer grammatical structure in a new musical scale offers optimistic implications for creating new music. EEG recordings made during the course of one hour of listening to the new musical system showed an ERAN in response to infrequent chords in the Bohlen–Pierce scale after around 20 min of listening [44]. These results together show that humans can rapidly learn to form expectations for new music, by being sensitive to the frequencies and probabilities of events in the acoustic environment; in that regard music learning is analogous to language learning [45]. Further follow-up studies showed that the input makes a significant difference in what was learned and preferred: the larger the set of exposure items, the more people learned; on the other hand, the more times each melody was repeated, the more people preferred those specific melodies [43]. Additionally, the ability to learn grammatical structure was found to be correlated with the white matter connections in the arcuate fasciculus, a super highway that connects regions important for auditory perception and sound production, as shown in diffusion tensor imaging studies [40].

This same white matter pathway is implicated in multiple auditory-motor functions. For example, people who are tone-deaf, who have trouble perceiving and producing small differences in pitch (e.g., less than a semitone), have less white matter connectivity in the arcuate fasciculus [37]. These individuals also have difficulty with statistical learning of new musical systems, using the same tasks described above [41]. Furthermore, people with tone-deafness are unaware of their own pitch production: when presented with pairs of pitches, and asked to reproduce them by humming and to tell them apart by pitch height, people who are tone-deaf show a striking mismatch between their production and their perception, frequently singing a pitch interval that is different from what they report hearing [39]. This perception-production mismatch also points to a more general distinction between different pathways, or streams, in the auditory system. These ideas on separable streams in audition are partly inspired by analogous findings in the visual system [21]: for example, people with lesions in their visual cortex, who are blind and have no conscious awareness of what they see, are nevertheless able to scale the size of their grip as they reach toward objects in front of them. These types of findings provide support for dual-stream pathways in the visual system. In the analogous sense, the auditory system is posited to have multiple, separable pathways for processing “where” (location information) and “what” (identity information) [62]. These functional distinctions are also posited specifically for speech processing [25] and for musical functions such as singing [35]. Strong support for the dual-stream pathways comes from a diffusion tensor imaging study which showed that the superior branch of the arcuate fasciculus is less connected in people who are tone-deaf [37]. Furthermore, people with musical training had larger volume in the arcuate fasciculus [24], and a week of performing an auditory-motor task on musical cues (in a motorically controlled form of musical training) resulted in increased integrity of the right arcuate fasciculus [53].

The same statistical learning mechanisms that are involved in language acquisition and music learning may be involved in creativity as well. Although the areas of creativity and statistical learning are commonly thought of as separate lines of research, Zioga et al. [74] tested the relationship between learning and creativity. They trained participants on an artificial musical grammar, using similar methods as presented above regarding the Bohlen-Pierce scale. After each training session, participants created their own musical compositions, which were later evaluated by human experts. Results showed that the individuals who were better learners as defined by the statistical learning task were also better at generating more creative new melodies as judged by experts [74]. These results are first to link statistical learning to creativity. Future studies are needed to examine the extent to which similar neural substrates underlie learning and creativity.

5.8 Conclusions: Creativity at Multiple Levels

Taken together, it is clear that multiple cognitive and neural systems contribute to the brain’s ability to improvise creative musical output. Musical improvisation can be thought of as a combination of multiple levels of cognitive and neural computations [36]. At the highest level, musical improvisation serves the computational goal of using musical knowledge to generate auditory-motor patterns that are rewarding. This is accomplished by cognitive algorithms that involve statistical learning mechanisms which help to shape the learned musical structures, including melodies, harmonies, and rhythms. The cognitive algorithms also include idea generation and evaluation, and flexibly switching between the perception of auditory targets and the motor production of those targets. The underlying brain networks that implement these cognitive mechanisms include the auditory-motor, executive control, default mode, and reward networks. Future work may further delineate the relationship between learning and creativity, and between social information (e.g., visual and auditory information that comes from partners in group musical improvisation) and the prediction and reward systems. This social information likely involves neural entrainment of auditory and motor areas, which refine predictions and thus generate rewards. By studying the spatial and temporal dynamics of human brains as they engage in cognitive operations that are linked to musical improvisation, future work may design new biologically informed musical partners that aid improvisation with intellectually and aesthetically satisfying outcomes.