Introduction

The ability to understand, to appreciate and even to create music is an inherent aptitude for many people. Thus, understanding how the human brain perceives and engages with music can provide valuable knowledge for optimizing the integration of music into humans’ daily lives. In recent decades, researchers have provided substantial evidence regarding the neural mechanisms behind music perception. Music is a complex event taking the form of sequentially varying tones and rhythm, sometimes from multiple sound sources (e.g., instruments). Despite the intricate nature of music, many individuals exhibit the capacity to understand, recognize, learn and recall music efficiently, irrespective of their training backgrounds, ages, or educational levels (Bigand & Poulin-Charronnat, 2006; Trehub, 2006; Trehub et al., 1984). However, how the memory for a piece of music is formed and developed in the brain with such efficiency is understudied. Moreover, currently there are surprisingly limited formal empirical data of the connections between music memory and other types of memory, especially concerning the neural systems supporting each. One consequence of that is the debate over whether the music memory system is distinct from those supporting other types of long-term memory.

One influential piece of evidence contributing to this debate comes from lesion cases that demonstrated damage to the medial temporal lobe, an area known for encoding and consolidating general declarative memory, did not prevent learning novel music (Esfahani-Bayerl et al., 2019). However, considering music’s intricate and multifaceted structure, it is important to consider that music memory may comprise hierarchical components, potentially reliant on diverse brain centers that individually support distinct memory processes. Thus, depending on how novelty is controlled and how memory is assessed, different impairments may or may not be expected in response to a lesion. Indeed, in his review, Altenmüller proposed that as music representations grow in complexity, progressing from pure auditory presentation to the incorporation of multimodal elements like visual performance, the implicated brain areas “graduate” from purely lower level processing in sensory areas to higher level prefrontal areas (E. O. Altenmüller, 2001).

In order to seek support for or against such a hypothesis of hierarchical neural networks for music memory, this review synthesizes current knowledge on the brain’s contributions to music memory. This involves categorizing music memory into different levels of information processing and representation. The goal is to outline brain networks underpinning different music components and phases of music memory and to discuss the connection between music memory’s neural mechanisms and the underlying areas’ roles in other memory domains. Notably, we delve into the role of the medial temporal lobe, a well-known declarative memory hub, within memory formation or retrieval. One of our main hypotheses is that, instead of there being a “unique” music memory system independent of the medial temporal lobe, music memory is upheld by a “syntax processing network (syntax network), comprising mainly the auditory cortex and the frontal-parietal regions, alongside a “contextual associates network” (context network): mainly situated in subcortical limbic and reward regions. Within the hierarchical representation of a music composition, we propose that the syntax network preprocesses and encodes the first level of musical pattern/structure, while the context network supports comprehension of higher level of music meaning and emotion (“music semantics”: a music literature terminology; shown in Fig. 1). Notably, this taxonomy discussion reveals overlaps between neural networks involved in music memory and those identified by decades of systematic neuroscience studies of nonmusic memory. Given society’s widespread interest in using music as an environmental tool to help humans memorize other events (e.g., from its use in advertising to therapeutic settings), establishing connections between the neuroscience of memory and the neuroscience of music becomes imperative.

Fig. 1
figure 1

Anatomical summary of the two main music-memory neural networks

This figure shows the two major brain networks proposed to support music memory. In the left hemisphere, it shows the inferior frontal gyrus, superior temporal gyrus, and inferior parietal lobule, which together operate as the “music syntax network’ that supports memory of music syntactical structure and provide first-level understanding of music sequential patterns. Separately, in the right hemisphere, the figure shows the limbic system (hippocampus and amygdala) and the striatum, inside which the ventral striatum plays an important role in reward feedback. These three areas together support the memory of associated emotional/episodic contexts evoked by and surrounding the music itself. As described in the text, the two types of memories supported by the two main brain networks might interact such that the “syntax network” provides first-level structural understanding of music pattern for further aesthetic/higher level analysis of music such as the emotion, processed by the limbic and reward system. Conversely, the limbic and reward system adds emotional and contextual elements to the music itself and potentially makes the music structure more memorable. We depicted the two networks separately on left and right hemispheres only for better visualization and we did not imply any lateralization of these networks.

*to be noted, the word “semantics” is used in both memory and music literature and it has slightly different meanings. In music literature, the “semantics” of music stands for the higher level information it conveys (Koelsch, 2009), in similar way to the language. Semantic memory, on the other hand, refers to the memory of a collection of general/nonepisodic knowledge. In this review, we will typically use the term “music semantics” in the manner of the music literature, and refer it as the emotional and aesthetic meaning of music. (Color figure online)

Overview: Hierarchical music memory

Music encompasses various traditional memory categories, including semantic memory and episodic memory (Tulving, 1972), making it a multifaceted event. Music learning initiates with sound processing. Summarized in detail by a few review chapters (Ginsborg, 2004; Jäncke, 2019), the initial learning took place at a sensory level during which the brain forms separate sensory memories for streams of tones and temporal intervals (the “meter pattern”). To perceive music as a “whole” integrated event, tonal sequences are chunked into melodies until then the listeners are able to perceive the auditory input as music, not noise or speech. The ability to distinguish music from other auditory patterns is inherent (Deutsch, 1969; Trainor et al., 2004; Trehub, 1987; Winkler et al., 2009). This ability is commonly attributed to motor and sensory areas in the brain, including but not limited to the auditory cortex (BA 41/42, superior temporal gyrus), presupplementary motor area (pre-SMA), and others (Gaab et al., 2003; Groussard et al., 2010; Jäncke et al., 2003; Koelsch et al., 2009; Kunert et al., 2015). These areas, which serve as first-level processors for diverse auditory inputs, are commonly implicated in most music-listening tasks. However, this review’s primary focus is on the mechanisms of remembering a piece of music (and its associates), and therefore it will not extensively elaborate on these important sensory systems.

Beyond sensory processing, knowledge and the memory of a specific piece of music is hierarchical (Ginsborg, 2004): from being able to recognize a familiar music composition, to being able to sing along (retrieve) parallel lyrics with the music, to knowing the associated title, composer, or the attached story and personal experiences with the music. However, many studies have defined long term memory (LTM) for music in different ways, yielding variability in the reported neural correlates. Some literature has tried to categorize music LTM using the traditional Tulving’s model and suggested that the memory of a piece of music can be implicit (e.g., procedural memory such as playing an instrument), semantic (lexicon knowledge) or episodic (e.g., “when” or “where”; Jäncke, 2019). But it has become clear that it is very difficult to test the differences between these forms of memory expressed by the brain in response to music in an experimental setting. For instance, sometimes music is so powerful in cueing associated events/emotion, when retrieving the “semantic” knowledge of a piece of music (recognizing if one piece of music is heard before), the possibility of retrieving the episodes paired with a piece of music simultaneously cannot be excluded. Similarly, it is also difficult to tease apart the implicit and semantic (explicit) components of “knowing a piece of music.” For instance, learning to perform a piece of music (such as playing the instrument) involves both explicit and implicit learning—instrument performance is a type of implicit procedural memory, but it also requires semantic understanding of the music structure and tonal lexicon. Thus, instead of using a traditional categorical memory model from psychology to explain music memory, here we propose an alternative perspective: scholars can more reliably delineate music properties and its neural correlates by evaluating music LTM at two levels according to the nature of music components being learned/memorized.

The first level is the music syntactical structure memory—the syntactical pattern and structure of music. A large proportion of existing neuroscience studies on music memory concentrate on this level, investigating memory of melody+rhythm patterns by posing questions such as “Can you recognize this melody?” or “Which ending tone corresponds to the previously learned music?” (Halpern & O’Connor, 2000; Sikka et al., 2015). We label this second level as contextual associates memory—incorporating nonmusic structural elements that still contribute to the extended memory representation of the music, such as the facts of the music piece (e.g., title), the associated lyrics, the paired episodic memory traces and our emotions elicited by or surrounding experiences with the composition. These associative elements provide higher level of semantic/aesthetic meanings, enriching the understanding and memory of the music and carrying autobiographical significance.

This framework is important to a hierarchical understanding of music memory because sometimes how listeners process and encode the music structure is associated with how they understand the attached semantic meaning or autobiographical significance—for example, any story or emotion the tonal relationships of music express (Krumhansl, 2002; Meyer, 2008). We hypothesize that this hierarchical aspect is mirrored in the organization of the neural correlates—music structure/lexicon processing and encoding primarily occurs in “syntax processing” regions (NB, such processing is still inherently tied to memory, much like linguistic syntax processing), while associations between music and contexts are likely accomplished by broader regions, such as the hippocampus, a region important for binding events and building declarative relational associations (Olsen et al., 2012). Adopting this perspective of music memory, rather than traditional memory distinctions, might explain the seemingly disparate neural correlates of music memory in the literature, given that music literature has used highly variable paradigms, and focused on various stages of memory processes.

The following section of this review will categorize and summarize various task types used in the past, giving an overview on how different brain regions appear to support remembering different components of music, from initial encoding/preprocessing to long-term memory retrieval. It is worth noting that perspectives on some stages of music memory have only been minimally addressed in the literature. This is a challenge for making inferences about what the neural correlates signify, but it is also an opportunity for future research to focus on those types of designs to fill gaps in our understanding.

Music syntactical structure memory

The essence of music structure memory—A rule-governed system

Encoding music structure in memory is highly dependent on a syntax knowledge system which combines variations over multiple attributes/dimensions (e.g., pitch, meter) across time into an integrated event (Krumhansl, 1991; Levitin & Tirovolas, 2009). Music theorists term this multidimensional processing the hierarchy of music—a stable sound structure that determines when an auditory sequence qualifies as “tonal” (Krumhansl, 1991; Leonard, 1956; Narmour, 1983). This hierarchical structure differentiates music from noise, making music memorable, recognizable, and meaningful in a manner similar to language.

This syntax processing is inherently a memory-related phenomenon—like language, we understand music according to the mappings between what we hear and prior compositions and rules stored in our memory. Interestingly, most people appear to possess an innate familiarity with certain music syntax. For example, infants prefer consonant over dissonant music pieces as adults do (Trainor & Heinmiller, 1998). Similarly, the ability to recognize whether adjacent tones deviate from “tonality” was observed as early as Day 2 or 3 postpartum (Stefanics et al., 2009). Infants can also detect violations in temporal patterns. During initial exposure, infants could segment tones based on the temporal intervals using Gestalt’s Principle strategy, similar to how adults do (Hannon & Trehub, 2005; Trehub, 1987). The inherent detection of dissonance, whether tonal or temporal, is occasionally referred to as sensory/tonal dissonance and may directly lead to the stimulation of early sensory organs such as the basilar membrane (Johnson-Laird et al., 2012; Tillmann et al., 2014).

Beyond such seemingly innate rule processing, people continue to learn and update their memory of music syntax through their experiences, which in turn can facilitate new music encoding and analysis of the hierarchical structure of a novel piece of music. People are faster in learning a piece of music from their own culture (Demorest et al., 2010), suggesting utilization of a “‘learned syntax” schema when learning music of familiar styles. This “learned syntax” could develop during early stages of life—a study found infants as young as 4 months old already showed a preference for rhythm style of their own culture (Soley & Hannon, 2010). An EEG study also found that children display greater prediction error signals when exposed to musical input that violates their cultural musical syntax (Jentschke et al., 2008). Behavioral and modeling studies suggest that novel music learning always involves an interaction between old–new syntax processing. That is, from memory, listeners generate predictions during music listening based on prior knowledge of music syntax, and any “surprising” or “prediction violating” tones trigger listeners to update their probability calculation of the tonal relationship, resulting in updated music syntax and more effective encoding of new music pieces (Deutsch & Feroe, 1981; Leman, 2012; Pearce & Wiggins, 2012). Due to the critical role of grounding syntax processing in long-term memory, when we investigate the brain areas that support memory of music structure, studies should not only focus on how a specific piece of music is learned. It is also important to consider how the syntax associated with a composition is learned and updated, especially in terms of how the “old syntax” from prior music experiences interacts with “novel music.”

Given the extensive and intricate range of topics addressed regarding music structure memory in this review, we have included Table 1 as a concise point of reference. This table summarizes the common tasks employed in prior studies concerning each stage of music structure memory, along with the primary brain areas frequently reported in the literature. For a visual representation, please refer to the corresponding section on the left side of Fig. 2, which presents a brain map of music structure memory.

Table 1 Music structure memory and its neural correlates. Summary of common tasks used in the literature related to music structure memory, and the main brain areas reported
Fig. 2
figure 2

Detailed visualization of brain regions supporting music memory formation. This figure shows the major brain regions that (1) support each stage of music structure memory, from initial preprocessing to retrieval and (2) the contextual associations, including episodes, semantic, and emotion association. Except for inferior frontal gyrus, which is recruited during all stages of both music memory types, the figure shows that for music structure memory, predominantly neocortical areas are involved whereas the contexts associated with the music are supported by the limbic system and subcortical reward circuitry. Note. This is not a formal meta-analysis result, the highlighted brain areas those most frequently reported throughout the literaturemodest variations in specific coordinates notwithstanding. IFG = inferior frontal gyrus; STG = superior temporal gyrus; ACC = anterior cingulate cortex; OFC = orbitofrontal gyrus; BG = basal ganglia; IPL = inferior parietal lobule; SFG = superior frontal gyrus; SPL = superior parietal lobule; MTL = medial temporal lobe; DMN = default mode network; mPFC = medial prefrontal cortex; DLPFC = dorsolateral prefrontal cortex; OFG: orbitofrontal gyri. (Color figure online)

Initial syntactical analysis for memory encoding—Preprocessing

Before delving into the literature on memory encoding per se, it is important to emphasize that the initial learning of novel music involves a comparison between auditory input and the schema of music syntax—an abstracted memory and rule set for the anticipated structure of compositions. Examining brain activity during novel music listening and comparing syntax-violating compositions to syntax-consistent music ones can reveal the neural basis of encoding novel music structure and how pre-existing schemas support this process.

A common paradigm for such comparison is the chord progression paradigm. Such tasks manipulate chord structure of a piece of musical sequence in a way to violate the specific harmonic progression syntax (e.g., Western-music tonality system) at certain points. The subjects either solely listen to the deviant sequence or select from provided chords to make the music more consonant (consonant music: the combination of tones that is musically pleasing and melodious). Using this paradigm, EEG data show error-related signal potentials when listeners encountered irregular tones, violating the schema to which the composition has been mapped through prior learning. This includes the early right anterior negativity (ERAN) signal, an ERP effect that happens around 200 ms after a syntactically irregular chord (Kalda & Minati, 2012; Koelsch et al., 2005; Koelsch et al., 2007; Pagès-Portabella & Toro, 2020), and mismatch negativity (MMN) (fronto-central negativity), a similar signal that is evoked most strongly for an “error” tone (Lappe et al., 2013; Rohrmeier & Koelsch, 2012). A recent study using multiple source analysis identified Broca’s area, along with its right homolog as the dominant source for ERAN. The primary auditory cortex, particularly the superior temporal gyrus (STG), emerged as the dominant cortical source for MMN during processing harmonically irregular chords (Villarreal et al., 2011). Moreover, an intracranial EEG study provided more direct spatial measures, pinpointing the superior temporal gyrus (STG) and inferior frontal gyrus (IFG), including Broca’s area, as syntactical error detection signal sources for both music and language processing (Sammler et al., 2009). The fMRI studies during irregular chords listening revealed strong activities in these areas (STG and IFG) plus the premotor cortex, prefrontal cortex, and emotion and long-term memory related areas including orbitofrontal cortex, amygdala, hippocampus, anterior cingulate gyrus (Bravo et al., 2020; Koelsch, 2006, 2009; Koelsch et al., 2008; Koelsch et al., 2005a; Kunert et al., 2015).

An emergent theme of agreement from such work is the sensitivity of neural correlates, especially the STG and IFG, to syntax-violated musical sequence detection. Interestingly, however, some of this sensitivity might be innate. A recent study used fMRI on 1–3-day-old newborns while they passively listened to tonal music versus altered dissonant music and showed increased hemodynamic responses in right auditory cortex and left inferior frontal cortex to dissonant music excerpts (Perani et al., 2010). Thus, it is reasonable to postulate that during initial music structure processing, detecting deviation in tonal relationship or comparing the perceived musical pattern with stored syntax begins with the primary auditory cortex (mainly STG) and the IFG, which are sensitive even with minimal or no prior learning. These areas have long been found to relate to language processing including grammar and phonic processing (Binder et al., 1996; Koelsch et al., 2005b; Mesgarani et al., 2014), and their primal sensitivity to basic structure of auditory sequences may be why they also support learning new music pieces according to whether they follow already-learned syntax/styles.

Chord progression tasks mainly focus on the effects of violating music tonality. The above results suggest that the syntactical relationship between tones is processed in the similar brain system for language (Kunert et al., 2015). However, fewer neuroscience investigations have explored other musical structural components, such as the temporal relationship or metric processing. This gap is noteworthy because temporal intervals between tones are pivotal in defining a piece of music’s unique identity. Indeed, keeping the order of tones the same but changing the temporal intervals between them can flip one song to be perceived as another. In this sense, remembering the structure of a piece of music also requires the encoding of the temporal pattern in addition to the tonal pattern. Although understudied, certain research teams have provided important insights into how this dimension of music structural encoding is supported during early stages by using rhythmic progression designs during which they played participants music with irregular temporal intervals. The fMRI results comparing listening to regular versus randomly timed rhythmic patterns for music tones revealed differential activation in the left inferior parietal lobule (IPL) and frontal operculum, but not superior temporal areas (the auditory cortex; Limb et al., 2006). In subcortical areas, the basal ganglia was found to be robustly activated by regular beats in comparison to irregular rhythm sequences (Grahn, 2009). Another group used magnetoencephalography (MEG) and compared passive listening to both melodic and rhythmic deviations, and interestingly, they identified inferior parietal lobule (IPL) for temporal irregular music processing but not tonal processing (Lappe et al., 2013). The same group later used EEG and asked the participants to detect melodic and rhythmic errors during music listening. Results indicated that the rhythmic deviations activated inferior and superior parietal areas, along with the supplementary motor area. Conversely, melodic errors recruited the tonal syntax areas, primarily the STG and IFG (Lappe et al., 2016). Although limited, these data are important as they indicate a distinction in the neural substrate for metric and melodic rule processing and analysis for early stage of music memory encoding: while the former relies on motion-related areas such as the parietal areas and striatum, the latter may predominantly recruit the superior temporal cortex and inferior frontal cortex. A few other paradigms investigating temporal processing using same–different rhythm discriminations (Kuck et al., 2003; Thaut et al., 2014) and beat perception/reproduction tasks (Chen et al., 2008; Grahn & Brett, 2007; Konoike et al., 2012) suggest a parietal-motor-cerebellum network for rhythm processing, encoding and representation. Outside the music domain, these areas, including IPL, pre-SMA, and cerebellum, are closely associated with time perception (Assmus et al., 2003; Koch et al., 2007; Lee et al., 2007; Livesey et al., 2007). In particular, IPL has been associated with the integration of space and time. This potentially forms a core mechanism underlying rhythm learning in music processing, as memory formation for it relies on integrating tonal and temporal relationships.

In summary, the initial stage of encoding a novel music piece can engage a dynamic interplay between perception/processing and preexisting knowledge/memory of syntax. Both melodic and rhythmic structure processing studies demonstrate a cortical network for such syntax-based musical initial processing. This neocortical cluster (depicted in Figs. 1 and 2) supports the unfolding and understanding of continuous musical input over time. This, as detailed above, serves as a foundational element for the hierarchical encoding of music as an integrated memory. This system operates early in the process of music input perception and provides initial support for listeners to scan and “preprocessing” music—organizing the component music representations to be ready for subsequent encoding of the music into long-term memory. Concurrently, the process of contrasting established syntax schemas with new perception of musical sequences can foster continued updates to the syntactical representation. Following this progression, our subsequent section will delve into evidence on the mechanisms underlying the acquisition of novel syntax.

Memory encoding—Music and syntax learning

The fidelity (precision and detail) of memory encoding in the brain—the process of storing information into long-term memory for later retrieval—highly correlates with whether the memory can be successfully recalled later (Hasselmo, 2006). While memory encoding for music structure has been explored using behavioral measures, limited neuroimaging data are available on the underlying neural correlates. Here, we consolidate findings from three types of music encoding-related task commonly used to help uncover the brain mechanisms for music structure encoding in memory.

One exciting approach is a cross-culture music-learning design, which asks how humans process and learn music pieces from different cultural styles. These studies could reveal mechanisms for both single novel-music piece-encoding and new syntactical-regularity encoding by comparing how people learn new music pieces from familiar and unfamiliar cultures, particularly considering behavioral results highlighting that humans are better at encoding, recognizing, and understanding music from their own cultures (Demorest et al., 2016; Drake & El Heni, 2003; Morrison et al., 2008; Soley & Hannon, 2010). Results from this line of research imply that a proportion of the regularity present in the structure of outside-culture music is not already represented in the listener’s long-term memory, and thus this provides us a window into structure learning that is more de novo when encoding a particular piece than is possible with the common use of within-culture music stimuli. In the brain, compared with culturally familiar music, learning unfamiliar-style music prompted increased activity in angular gyrus, middle frontal gyrus, insula, cerebellum, and paracingulate cortex (Demorest et al., 2010; Nan et al., 2008). In particular, the angular gyrus has been linked to novel rule learning functions such as language syntax processing as well as mathematical rule learning (Bemis & Pylkkänen, 2013; Pyke et al., 2015; Seghier, 2013). Encoding both culturally familiar and unfamiliar music recruited activity in STG, IFG (Nan et al., 2008, reported left hemisphere while Demorest et al., 2010, reported right hemisphere) and planum temporale, an area posterior to the auditory cortex (Demorest et al., 2010; Morrison et al., 2003; Nan et al., 2008). Interestingly, the similar engagement of STG and IFG in learning both familiar and unfamiliar cultural music suggests that these regions primarily facilitate the learning any specific music sequences based on existing syntax knowledge. By contrast, the involvement of a broader network including the angular gyrus/IPL appears when grappling with the challenge of learning new music styles and regularities, demanding a concurrent analysis of their unfamiliar structural principles and encoding.

Because this is a collective of areas reproduced from a number of studies sharing this syntax learning and syntax using approach, to simplify the text we will refer this cluster of areas as the “music syntax network” in the following text (as shown in Fig. 1), noting that it appears to be consistently organized with inferior frontal areas and superior temporal areas, supplemented by the inferior parietal cortex (IPL; because the IPL also supports syntax processing/learning for music structure, as discussed in the following sections). Within this music syntax network, STG has been argued by some to correlate with the ability to learn new music in relation to prior knowledge. Morrison et al. (2003) compared musicians vs. untrained controls when they listened to music from their culture versus an unfamiliar culture. The results showed that, behaviorally, musicians were better at encoding the music from their own cultures than nonmusicians and this group showed higher activity in STG and midfrontal regions during encoding (Morrison et al., 2003). This implies a training-related plasticity in STG and its role in supporting the acquisition of novel music based on established regularity.

A second type of experimental design relating to music structure memory encoding uses a “new syntax learning paradigm” and focuses on the function of “statistical learning” in music. Statistical learning is an important cognitive function reflecting rule extraction from repetitive exposure to a sequence. This concept has been widely studied in language research—for example, how infants learned and integrated linguistic regularities (Romberg & Saffran, 2010). Similarly, investigating statistical learning of musical sequences might reveal how listeners learn a specific pattern of music structure with minimal prior knowledge. Such studies created artificial (novel) grammar for tonal sequences and exposed them to the listeners. During the learning task, one MEG data revealed substantial oscillatory entrainment between temporo-frontal areas that positively correlated with learning performance (Moser et al., 2021). Using diffusion tensor imaging (DTI), a study identified a positive correlation between pitch-related grammar learning performance and the tract volume between right IFG and right middle temporal gyrus, again suggesting an important role of temporal-frontal coordination in rule learning of music (Loui et al., 2011). They also reported higher white matter volume in the supramarginal gyrus, a proportion of IPL (the other being angular gyrus), correlating with individual differences in learning performance. Interestingly, one lesion study suggested that damage to left IFG would not prevent successful grammar learning of pitch sequences (Jarret et al., 2019)—when this is taken together with the Loui et al. (2011) result mentioned above (Loui et al., 2011), this implied there may be a right lateralization of IFG in encoding brand new musical syntax. Meanwhile, another fMRI study, conducted during recognition of newly learned sequences after statistical learning, suggested left IFG activation in distinguishing learned tonal sequences from random sequences. Overall, these findings suggest IFG’s pivotal role in musical syntax learning.

An earlier section of this review emphasized the temporal sequence structure being another core component of music—however, only a few studies have focused on how the brain encodes novel musical rhythm patterns. One paper suggested that right IFG, bilateral supramarginal gyrus and planum temporale contributed to extracting rhythm patterns from novel rhythmic music sequence, by extracting and summarizing the interval ratio between tones (Notter et al., 2019). It indicates that the “music syntax network” does not only process the tonal syntax for new learning but might serve as a general musical rule processing center for both the tonal and temporal relationship of sounds. Most other rhythm pattern learning tasks used nonmusic stimuli (e.g., visual stimuli present at a specific beat) and have underscored the involvement of time-processing areas such as the basal ganglia and cerebellum in rhythm encoding (Janata & Grafton, 2003; Ramnani & Passingham, 2001; Sakai et al., 2004). The basal ganglia and cerebellum have also been observed in music listening tasks, especially when the music is rhythmically regular and leads to motor reactions such as tapping or dancing (Molinari et al., 2007; Zatorre et al., 2007). However, interpretation of the contributions of these subcortical areas from nonmusic rhythm studies is challenging due to the nature of their task design, which often involves corresponding motor reactions to the stimuli. More investigation is still needed to address the question of whether these subcortical regions are critical areas supporting novel musical rhythm pattern learning independent of sometimes implicit motoric responding aligned with the rhythm.

The two types of designs discussed above, which emphasize tone and rhythm structure, typically leave open the question of how the brain encodes a specific piece of song or melody as a unit that can be explicitly recalled later. Unfortunately, there are relatively few studies that have utilized neuroimaging methods during the encoding stage of novel music learning with a postlearning declarative recall task. An fMRI study presented novel music with different repeated motifs embedded while the participants performed a “phrase segmentation task,” which asked the listeners to use their instinct and press bottoms to indicate boundaries between music phrases during listening. Motifs with more repetition and exposure, which should lead to stronger memory, elicited increased activities in the SMA, basal ganglia, hippocampus, IFG and cerebellum (Burunat et al., 2014). This suggests potential involvement of these areas in encoding specific melodic sequences. Another MEG study presented 68 participants a highly structured classical music prelude repeatedly and later tested their memories for the music. During encoding, the auditory cortex, insula, hippocampus and basal ganglia were engaged in the task (Bonetti et al., 2021). Notably, musical experts exhibited stronger activity in STG and insula, aligning with the syntactical learning findings mentioned above and suggesting training-related plasticity in these areas (mainly the auditory cortex) for music encoding functions. Although these types of studies are infrequent, they suggested that the “music syntax network,” including the IFG and the auditory cortex, along with the insula, hippocampus, and the basal ganglia might collectively support the encoding of specific music encoding.

In summary, the overview of various studies relating to music structure memory encoding revealed a network that spans the neocortex (centered in IFG, STG, insula, and IPL), inner cortex (the hippocampus) and the subcortex (cerebellum and basal ganglia). If we juxtapose the emphasis on primary auditory cortex from data on basic musical input processing discussed in the last section, a pattern emerges: music memory encoding draw upon downstream and higher order areas that deal with sequential learning and integration. Interestingly, the reader will note the limited mention of the hippocampus, a crucial region in memory research. While it has a well-established role in sequential learning and declarative memory encoding in nonmusic literature (Shapiro & Eichenbaum, 1999; Wallenstein et al., 1998), its involvement in music structure memory encoding is rarely reported. From the existing literature, the hippocampus only appears to come into play in specific instances, such as encoding a particular melody that was repeatedly listened to during the task. We provide further discussion of these observations and the potentially interesting conditions for hippocampal involvement in music structure memory in a later section.

Memory retrieval—Recognition memory tasks

Many of the extant “music memory” neuroscience studies have primarily focused on the retrieval stage of memory. Among these, a commonly used approach involves recognition tasks where participants listen to compositions and determine whether they have heard a particular piece before. Recognizing melody has been associated with a wide range of brain regions across the neocortex (Jäncke, 2019; Peretz & Zatorre, 2005). Notably, a meta-analysis paper examined studies that compared listening to unfamiliar versus familiar music and found that the superior frontal gyrus (SFG) exhibited the most significant involvement during listening to familiar music (Freitas et al., 2018). However, this paper also emphasized the absence of consistent activations in any areas across different studies. There are many possible reasons, including inconsistency in music styles, task designs and participants demographics the selected studies used. Moreover, the nature of the memory retrieval being tested can vary in demands depending on the task method and difficulty. For instance, the amount of recognition cues provided or the degree of similarity between target compositions and lures might influence the retrieval process. Some debates within the field of nonmusic recognition memory might provide hints: for example, evidence suggests distinctive neural substrates for familiarity (“recognize it”) and recollection (“recall it”) (Duarte et al., 2004; Yonelinas, 2002). In a similar vein, looking into the method of each music recognition task and categorizing them by retrieval levels might offer a more nuanced understanding of the variable neural correlates during music structure memory retrieval.

One interesting breakdown of the underlying studies is that many music recognition studies used music stimuli from popular listening charts. These studies measured brain activity while participants listened to and recognized famous music selected from these databases. These music clips were typically pretested and validated for familiarity through pilot tests. In this way, these studies used music that was (presumably) frequently listened to and decisions could be made with a sense of familiarity in the recognition paradigm. This process might require less effort than recalling the music from memory without any cue (Yonelinas, 2002). Recognizing familiar (famous) music, as opposed to unfamiliar non-famous music, showed significant activation in the STG, IFG, as well as in the superior frontal gyrus (SFG), superior parietal lobule (SPL), orbitofrontal lobe (OFC), insula, anterior cingulate gyrus (ACC) and parahippocampal gyrus (E. Altenmüller et al., 2014; Freitas et al., 2018; Jacobsen et al., 2015; Klostermann et al., 2009; Satoh et al., 2006; Sikka et al., 2015). Some of these areas fall outside the syntax network (which is, IFG, STG and IPL) are commonly observed in nonmusic recognition and familiarity tasks (Aggleton et al., 2011; Haxby et al., 1996; Morita et al., 2014; O’Connor et al., 2010). For example, in item recognition, superior parietal regions seem to support a sense of familiarity while more ventral lateral parietal and temporal regions, which were not found in these music familiar–unfamiliar recognition contrasts, are more involved in memory recollection (Yonelinas, 2002). Areas such as insula, anterior cingulate and superior frontal regions are not as consistently associated in episodic recognition tasks, though, do seem to relate to emotion-related recognition task such as facial recognition (Campbell et al., 2015; Haxby et al., 1996; Morita et al., 2014) and music recognition, as implied here. Given most music carries some degree of affective valence, one study focusing on how familiarity of music modulated induced-emotion showed higher activity in emotion-memory-related areas such as the cingulate cortex, thalamus, amygdala, and hippocampus (Pereira et al., 2011). This suggests an association between “knowing a piece of music” and “liking the music,” and indicated a reward-emotion circuits behind music long-term memory. This connection between music, emotion, and memory is a topic of further discussion in later sections on music contextual memory).

Overall, compared with episodic memory familiarity, which many studies argue to rely more on medial temporal lobe cortex (the perirhinal cortex in particular; Yonelinas, 2002), most probes of music structural familiarity reveal a frontal-parietal cortical network of activity. The “music syntax network” consistently emerges as being associated with familiar music recognition, as it had been involved in encoding (Demorest et al., 2010; Ford et al., 2016; Gagnepain et al., 2017; Groussard et al., 2010; Jacobsen et al., 2015; Pereira et al., 2011; Plailly et al., 2007; Sikka et al., 2015). Considering the need to process music syntactical structure in order to successfully recognize music as “old,” it remains unclear and a matter for continued research whether this network is primarily involved in the fundamental levels of musical stimuli processing across retrieval and encoding, or whether it serves a distinctive functional role for music retrieval compared with its role in encoding.

A few studies of the neural systems supporting music recognition have used less known music, and implemented an encoding or familiarizing phase before the memory recognition task. While famous music might easily cue a sense of familiarity, in these tasks using novel music, recognition might require more in-depth processing and effortful comparison between to-be-recognized stimuli versus old memory. Consequently, such tasks might engage brain areas that support old memory recollection. By contrasting correct recognition(hits) and incorrect rejections(misses), one study revealed increased activity in the right IFG and left cerebellum, indicating a correlation with successful music structure LTM retrieval (E. Altenmüller et al., 2014). In another study contrasting hits versus correct rejections (old–new), revealed more activity in left IFG and the hippocampus for hits, implying their role in representing and retrieving old memory (Watanabe et al., 2008). Comparing these two studies raises the question: is there any functional difference between left and right IFG contributions to music retrieval? One study used a cross-culture design in which participants learned music of their own culture as well as that from an unfamiliar culture. During the postlearning recognition task, they found more right IFG activation during recognition of music from familiar culture and more left IFG when recognizing music from unfamiliar culture (Demorest et al., 2010). This might imply that the right IFG is more involved in remembering and recognizing patterns from existing/well-learned music syntax while left IFG contributes to general regular sequence processing and retrieval. Studies comparing lyrical incongruency (e.g., the adjacent words in lyrics do not make sense together) and melodic incongruency also revealed left only IFG activity in both conditions, and they suggested this area serves as a general sequence rule processor for both music and language (Koelsch et al., 2000; Peretz, 2002).

Besides the retrieval/recognition of music structure, there is another behavioral phenomenon tied to music structure memory recognition: the mere exposure effect. The mere exposure effect shows that prior exposure to music can enhance an individual’s preference for the music’s composition (Fang et al., 2007; Peretz et al., 1998). Various studies of this effect used repeated passive music listening paradigms; tracking how the preference for music compositions increased is associated with recognition memory of music (Green et al., 2012; Samson & Peretz, 2005; Wang & Chang, 2004). Neuroimaging data from these paradigms showed an increase in dorsolateral prefrontal cortex(DLPFC), IFG, and IPL activity (Green et al., 2012) as listeners increased their preference to the music as a function of repeated exposure. This again implicates a frontal-parietal cortical network involved in music recognition, even it is unattended development of preferences associated with that familiarity. In addition to the IFG and IPL, which supports the music syntax, the DLPFC is known to support working memory (Barbey et al., 2013; Levy & Goldman-Rakic, 2000)—a general but critical memory function during continuous music encoding. Other than the areas mentioned above, the medial temporal lobe might also be critical for certain types of music learning and recognition. Evidence from medial temporal lobe (MTL) lesions showed that right MTL damage would result in failure of the mere exposure effect for music (Samson & Peretz, 2005). These patients also displayed impaired explicit recognition abilities for the music itself, as measured after the listening paradigm. This suggests that the MTL might be necessary for successful music encoding and future explicit recognition success, even though this area is less commonly identified in neuroimaging studies of music recognition function.

In summary, music structure memory retrieval is associated with multiple brain networks. The main network supporting music recognition includes the “syntax network” as well as neighboring frontal-parietal regions. Table 1 highlights major papers contributing to this understanding, noting the varied task types used in those references. Figure 2 synthesizes the correlates of music memory formation from Table 1 into a visualization of the networks in brain space, which helps the reader to see similarities and changes in recruited networks across different phases of processing music memory. These areas may support recognizing the regular sequential pattern of the music stimuli by comparing it with prior knowledge of the syntax. Recognizing a piece of previously heard music also activate emotion-related areas such as the orbitofrontal areas and the basal ganglia potentially due to a positive correlation between music rewarding and the familiarity (Salimpoor et al., 2015). One remaining debate is whether the hippocampus and surrounding medial temporal lobe is essential for music structure memory retrieval. Although larger proportion of studies did not report hippocampal involvement during famous or newly learned music recognition tasks, some MTL lesion data show a failure to explicitly recognize music (Samson & Peretz, 2005), and a few other memory recognition tasks using more complicated designs (such as music boundary detection tasks) actually did report a hippocampus correlation with successful recognition (Burunat et al., 2014, 2018; Knösche et al., 2005). When associating hippocampal activity with overall memory performance (overall recognition rate), one study found a positive correlation, suggesting hippocampus might still relate to the overall memory ability for music structure (Watanabe et al., 2008). These observations lead to a question of whether hippocampus and its neighboring MTL cortical areas contribute to a higher level of memory retrieval in music. Given the MTL’s essential role in nonmusic declarative memory functions, it is of considerable interest to delve into the finer details of whether and how the hippocampus and surrounding MTL contribute to music memory.

Medial temporal lobe and music structure memory

The medial temporal lobe plays an essential role in successfully learning new declarative information. As the reader will also see in later sections of this review, the MTL is intimately tied to contextual and emotional dimensions of music memory. However, as is evident in the sections above, it is still under debate whether MTL is necessary for music structure learning per se. Although several studies reported individuals with bilateral medial temporal lobe lesions being able to learn music (Cavaco et al., 2012; Finke et al., 2012; Valtonen et al., 2014), there are potential alternative explanations. Among these studies, two cases measured music memory via instrument learning and claimed intact music memory in MTL lesions patients because they could successfully learn to play and read new music. Such an ability to play instruments is associated strongly with procedural or skill memory that is supported by brain areas outside medial temporal lobe (Squire, 1992) and is echoed by a long literature showing new motor skill learning in MTL lesion patients (Corkin, 1968; Squire & Zola-Morgan, 1991). Another study aimed at more explicit component of music structure memory using a recognition task and showed intact recognition ability for newly learned music in an MTL-lesioned cellist. They thus concluded that new music learning was independent of MTL (Finke et al., 2012). However, this study came with an interesting caveat: the patient was a life-long cellist, and his music training history could provide an elevated music memory ability compared with an average person, bolstered by mechanisms for greater neural plasticity in cortex for information that can be related to prior knowledge/old memory (van Kesteren et al., 2012, 2014)—in his case, new music compositions—and that potentially resists the functional damage by the lesions in MTL. We speculate these “schema”-grounded cortical learning enhancements may also explain why there are more cases of intact music memory in AD patients when the patients had music training history (Baird & Samson, 2009). In order to eliminate this possibility, recently, the same group that studied the cellist patient published another case of intact newly learned music recognition ability in a bilateral hippocampal lesion patient. Although they identified this patient as a musical layperson because he only played instruments in late childhood and adolescence, in their musical processing and perception task, this patient actually outperformed the control group, suggesting a higher-than-average music ability. Along this vein, aging studies showed that people with nominal amounts of music training early in life, even if they had not played instruments for years, showed training-driven neural plasticity relating to sound processing (White-Schwoch et al., 2013). In healthy population studies, evidence of hippocampal morphological plasticity was also found in music trained group, reflected in higher grey matter volumes (Herdener et al., 2010; Teki et al., 2012). Such training-induced neural changes might correlate with better music processing and encoding ability in musicians compared with nonmusicians (Burunat et al., 2018). Thus, the lesion cases of patients with previous music training, although suggestive that music memory ability is persistent despite MTL damage (compared with other forms of memory such as visual episodic memory), may be insufficient to conclude that music structure learning functions are truly independent of MTL involvement.

It should also be stressed that there are other lesion studies that support MTL’s functional importance for music learning. Multiple case studies of bilateral or unilateral MTL lesion patients or Alzheimer’s disease patients present some extent of disability in recognizing newly learned music (Bartlett et al., 1995; Cowles et al., 2003; Samson & Peretz, 2005; Samson & Zatorre, 1992), already familiar music (Bartlett et al., 1995; Huijgen et al., 2015; Peretz, 1993, 1996), or single sounds that were recently presented (Squire et al., 2001), and even in implicit preference development via the mere exposure effect (Samson & Peretz, 2005). Many emphasized that these individuals retained normal online music structure perception/processing ability while losing long-term memory for the music they had processed (Peretz, 1993, 1996). Some patients can learn to perform the music right after perception but failed to consolidate it (Cowles et al., 2003), the others failed to hold the music in short-term declarative memory, such as in a music timbre comparison test (Samson & Zatorre, 1994). All such data suggest memory encoding and consolidating for music structure may in fact be impoverished with MTL damage, aligning with findings of the same functions of MTL from nonmusic domains (LaBar & Phelps, 1998; Shapiro & Eichenbaum, 1999).

When looking into healthy populations, an important consideration is that music is highly regular and predictable based on prior knowledge, even for “novel” pieces and thus music learning can be adequately supported by an “old syntax” network which we review above, centered on frontal areas which were also shown to store and retrieve nonmusic old memory (Preston & Eichenbaum, 2013; van Kesteren et al., 2010). It also bears noting that MRI considerations such as signal-to-noise ratio, statistical thresholds and activity cluster size cutoffs, can contribute to false negatives in studies which aren’t designed specifically to test for MTL involvement, and given the correlational nature of neuroimaging data such null reports cannot establish a lack of causality for the MTL in music memory. As a result, the scarcity of reports indicating MTL involvement in declarative music memory does not necessarily imply that MTL does not play an role in supporting music structure memory at all.

A major reason for expecting that it should is that both human and animal studies have shown that the MTL makes core contribution to sequential learning particularly due to its functions of time and temporal context processing (Eichenbaum, 2014; MacDonald et al., 2011) and its ability to bind multidimensional components of memories together (Gheysen & Fias, 2012; Olsen et al., 2012; Wallenstein et al., 1998). In fact, damage in hippocampus bilaterally can lead to failures in the statistical learning of sequential regularities (Covington et al., 2018; Schapiro et al., 2014). On complementary side, there is a rich and growing literature on nonmusic memory which also delineates mechanisms for frontal-cortical memory systems to circumvent the hippocampus and form new semantic associations when the new experience maps more strongly to prior knowledge (van Kesteren et al., 2012, 2014). Inspired by the examples where MTL is implicated in music learning, and because music is typically a multidimensional sequence of information, requiring the ability to bind multiple elements together and to learn statistical regularities, we draw on such perspectives from the neuroscience of memory to speculate the following: that the hippocampus might collaborate with the “syntax network” during initial learning of musical structure regularity to support high-order sequence segmentation and relational binding of temporally distinct and distant elements (as it does in nonmusic memory literature). The experiences an individual has with the music structures (through their lifetime listening and exposure, or formally studying an instrument, or through familiarity with the individual song or structurally similar compositions) lowers the burden of new and continued learning on the hippocampus. This perspective can help rectify the disparate findings mentioned earlier. It also emphasizes the importance of studies that control for various sources of direct familiarity and more schematic/thematic familiarity with music stimuli, especially when investigating hippocampal involvement.

In the interest of promoting future research in this direction, here we further unpack this idea. During initial learning of truly novel music, one important step is to extract a hierarchical structure of a continuous auditory input. Analogous to verbal story, words within a sentence are more correlated with each other than outside the sentence. Similarly, sentences adjacent to each other are more semantically related than distant sentences. To learn long and continuous music sequences, the listener needs to construct a hierarchical map by segmenting ongoing music into bars (a small segment of music that holds a few beats), phrases, periods, and sections. This segmentation is based on the statistical relationship between tones as defined by musical syntax. During this process, tones are grouped into regular patterns and patterns are segmented/grouped and ordered—this is the initial encoding of the music as an integrated piece that holds structural regularity. In nonmusic memory literature, hippocampal function shows strong associations with hierarchical regularity learning, supporting cluster segmenting and temporal boundary detecting during on-going sequence encoding (Gupta et al., 2012; Hsieh et al., 2014; Schapiro et al., 2016). Some evidence suggests this same function does exist in music sequence encoding. One study asked participants to listen to 8-minute-long music with repeated motifs embedded. The participants performed a segmentation task using motifs as cues and showed significant activation in both the hippocampus and IFG during music boundary identification (Burunat et al., 2014). They also observed strong functional connectivity between hippocampus with cerebellum during motif repetition, when the most stand-out regular pattern was detected and encoded, suggesting these two regions have a role in sequence regularity detection. In a follow-up study from the same group, they conducted a similar music boundary segmentation task and again found strong functional connectivity between hippocampus and cerebellum that positively correlated with the section segmentation performance (Burunat et al., 2018). This is aligned with evidence from visual serial sequence learning tasks showing hippocampal-cerebellum network contributes to spatio-temporal sequence encoding and prediction ability (Onuki et al., 2015). Similarly, another group designed a unique acoustic serial sequence learning paradigm during which the listeners needed to detect and learn regular acoustic patterns hidden within distracting irregular sequences (Jablonowski et al., 2018). They observed a positive correlation between the amount of learned acoustic sequence and the left hippocampus activation, implying this area’s primary involvement in sequential regularity detection and encoding during a serial of musical input. Besides detecting the sequence, the ability to segment ongoing sequences also enables better extracting and understanding of the regular patterns, especially when initially encoding a novel music sequence. One MEG task focusing on music phrase perception recorded brain changes during listening to the continuous sequences of music phrases and using source-localization reported major activity from limbic system and posterior MTL at the phrase boundaries (Knösche et al., 2005), suggesting this brain cluster is important in dividing ongoing sequences into different groups.

In addition to regularity detection and segmentation, the hippocampus might also provide support for associations between temporally “distant” elements during initial music structure learning. In nonmusic domains, the hippocampus supports both dissociation and binding in memory between temporally distant experiences—for example, building association between two events that happened at different times (Wallenstein et al., 1998). In music learning, an overall composition involves higher order associations between multiple musical subevents over a protracted time scale, and such hippocampal binding functions may be particularly necessary when the music composition is long. For example, orchestral classical music piece often contains more than one chapter and chapters can differ a lot in tempo, key and emotion. In this case, music syntactical schemas might be insufficient for relating and binding two very different chapters together (because they don’t follow the same structure). One fMRI task tested a similar idea, probing learning of distinct musical sequences as same groups, and showed that lesions in left MTL led to failure in this higher order music pair learning, supporting such a role in binding distant music sequences into memory (Wilson & Saling, 2008). Overall, although the data are limited, extant literature implies that the hippocampus might be more involved during initial music encoding, when tasks require truly novel and explicit sequence structure detection, as well as binding segments of music into higher order relational memories; these results align with nonmusic memory literature and together suggest hippocampus may have an essential role in sequence temporal structure encoding across domains, particularly through its function of sequential element segmentation and association. What many studies failing to show hippocampal involvement may lack is control over a sufficient level of novelty, sequential structural complexity, and/or explicit memory demands—but more studies are needed to put this explanation to the test and truly rectify the mixed findings in the literature.

Music contextual associates memory

When asked to recall a piece of music, we often find ourselves retrieving more than just its musical structure. Music-associated information such as the title, lyrics, emotional theme, or even the autobiographical episodes connected to it can come to mind, sometimes even involuntarily. Humans love to listen to music and play it in many important life events, such as weddings. This can be attributed partially to the deeper semantic meaning conveyed by music. The story the music conveys and the emotion it induces connect the listeners to the music, leading to the encoding of self-related emotional episodic memories. Unfortunately, these contextual associate elements of music memory have rarely been systematically discussed from either a behavioral or neuronal perspective. This review has thus saved this aspect of music memory for the last part, as it can be a particularly fruitful future direction for music memory research—especially for those who are interested in using music to aid life-related behaviors such as modulating memory and emotion. Here we will summarize the limited current knowledge of the neural correlates supporting formation and retrieval of music contextual associates memory. The right side of Fig. 2 provides a visual aid for the main brain areas discussed in this section.

Music-lyrics binding

Music that has lyrics plays the predominant role in the modern music production market, and for most of us much of the everyday music we hear and sing has lyrics. Research has shown that texts are more memorable when they are embedded within a piece of music (Palisson et al., 2015; Simmons-Stern et al., 2010). Advertisers have leveraged this phenomenon by pairing their slogans with melodies to capitalize on the memorability of this combination (Yalch, 1991). One cognitive hypothesis stemming from this phenomenon suggests that when texts become an additional dimension of the music sequence, the processing mechanism of the texts switches from the language system to the music system. One evidence is that memory deficits patients who struggled to retrieve verbal memory when they were encoded or cued using speech alone showed significant improvement when the texts were learned and retrieved using music. This finding suggested that memorizing lyrics (texts) involves dissociable processors when presented alone versus as a part of music (Simmons-Stern et al., 2010). A possible cognitive mechanism might relate to the extremely regular temporal structure of music that might improve words chunking and modulate attention to verbal structure by providing temporal scaffold (Conway et al., 2009; Wallace, 1994). Alonso and colleagues conducted multiple neuroimaging tasks to explore neural differences between learning music that was sung with lyrics versus learning music and lyrics presented separately at the same time. They found left IFG, left motor cortex and bilateral medial temporal gyrus (MTG) activation contributing to unified encoding conditions, whereas other areas including the right hippocampus, left caudate and cerebellum circuits and right IFG. during the separate presentation condition (Alonso et al., 2014, 2016). It suggests that when lyrics are learned as a part of the music, less areas’ involvement is needed for encoding it, whereas when presented as a separate sequence, it recruits additional brain areas for higher order relational binding of the sequences, such as the hippocampus.

When lyrics are encoded as a part of music, do the same areas processing melodic sequences support encoding the lyrics? Behavioral and neuroscience evidence showed varied insights into this matter. From a behavioral standpoint, although tunes and texts show a bidirectional effect—such that one can cue another during retrieval (Peretz et al., 2004; Peynircioğlu et al., 2008)—some studies suggest that learning the two might compete with each other (Racette & Peretz, 2007), implying encoding the lyrical dimension and melodic dimension of music might share (and compete for) the same neural mechanism. On the other hand, one ERP data recorded while people listened to a song with manipulations of semantic congruency (lyrics congruency) or/and music syntactical congruency. The results showed dissociable neural responses to lyrical semantic violations versus music semantic violations—even though they were played at the same time as an integrated event. This supports a view of independence of lyric and tune processing when listening to sung music (Besson et al., 1998). An earlier lesion study also supported dissociation, highlighting that text recognition might be associated with left temporal lobe structures whereas melodic coding or recognition depended on both temporal lobe hemispheres (Samson & Zatorre, 1991).

Overall, the available evidence suggests that text encoding recruits fewer brain areas when it becomes a contextual associate of music. As a semantic/contextual component of the music itself, it is processed mainly by the music syntactic network, but the MTL may be important for helping bridge linguistic associates of music depending on how they are presented.

Music as context: Episodic memory binding

Music is powerful in a way that it can become a part of our personal memories and sometimes only by hearing it again can effortlessly trigger recollections of long-past personal experiences (H. Baumgartner, 1992). Music can perform as a strong memory retrieval cue, and multiple prior studies have succeeded applying music in helping both normal people and memory dementia groups such as mild Alzheimer’s patients to recall autobiographical memories (Baird & Samson, 2014; Cady et al., 2008; El Haj et al., 2012; Irish et al., 2006). Regarding the cognitive mechanisms behind it, some argued that music can add strong emotional response onto personal experience and people tend to encode and retrieve emotional memories in better efficiency (Belfi et al., 2016; Eschrich et al., 2008; Jäncke, 2008). Others suggested that during retrieval, music listening activated a top-down access to the personal memories by providing a predominant social context (Cady et al., 2008). We currently lacked neural evidence for either hypothesis and limited works investigated the brain mechanisms of music-associated personal episodic memory (autobiographical memory). In particular, no neuroimaging study has been found focusing the encoding phase of autobiographical memory with music played in the background. Future researchers interested in this question can consider comparing episodic memory being encoded with music versus control (e.g., nonmusic sound played in background) and by asking how the neural differences correlate with memory encoding performances in different conditions, people might gain a better understanding of how music-associated episodic memory formation benefit from the “music system” neuronally.

The medial prefrontal cortex (mPFC) seems to be a core area supporting personal memories retrieval during music listening—individuals with damages in this area showed selective impairments in recalling music-evoked autobiographical memories but not for face-evoked memories (noted, both cues are emotion-related; Belfi et al., 2016). In general population group, studies have shown increased mPFC activity during autobiographical memories recall with familiar music played (Ford et al., 2011; Janata, 2009). Additionally, during music listening, mPFC has been found to support tracking of melodic dynamics—for example, a topography of tonal space, such as different keys in Western music, was found in rostromedial prefrontal cortex (Janata et al., 2002). Beyond its role in music listening experience, mPFC has a broader function in supporting retrieval of general declarative memory and facilitating the formation of associations between memories (Euston et al., 2012). In this context, the mPFC is positioned to potentially connect the music structure memory with the episodic memory. Taken together, mPFC might specifically support music-evoked autobiographical memories by not only gathering the personal memories but also assembling the structural elements of music with the episodic content paired with it order to form and re-experience a set of episodic/affective collective memory (Janata, 2009).

MTL/hippocampus is a core area for general episodic memories retrieval (Eldridge et al., 2000). Although Janata (2009) did not observe significant MTL engagement during music-evoked autobiographical memories (MEAMs) retrieval (Janata, 2009), some evidence has indeed hinted its involvement in this process—the hippocampus might especially contribute to the episode details and specificity retrieval. Ford et al. (2011) found a positive correlation between the MTL activation and the levels of specificity of the autobiographical memory been retrieved by listeners (Ford et al., 2011), suggesting its role in recruiting details of the associated episodes paired with the music. Kubit and Janata (2018) compared when participants attended to the music familiarity versus when the attention was directed to the associated personal memories and observed a dual network that separately supports these two retrieval processes (Kubit & Janata, 2018). They identified MTL plus the default mode network (DMN), which showed a coupling contribution to episodic memory retrieval (Huijbers et al., 2011), being most engaged during personal memory attending condition whereas a frontal-parietal network with especially the IFG and pre-SMA activation for music attending recall (Kubit & Janata, 2018). The latter finding is aligned with previous discussion that music structural memory retrieval was frontal-parietal focused. In a MEAMs study comparing young and old, the authors found that young people tended to retrieve more details of the personal memories than the old and in the brain the young showed more MTL plus DMN activation whereas the old group mainly recruited the dorsomedial PFC activation (Ford et al., 2016). Behavioral studies suggested that not only young people tended to retrieve more specific events corresponding to music (Ford et al., 2014; Schulkind et al., 1999), music training might also strengthen the specific event–music binding. It was shown that during music listening, musicians tended to recall more personal memories and to relate the music to themselves (Fauvel et al., 2013; Groussard et al., 2010). MRI data revealed more hippocampal and amygdala recruitment in musicians and it was postulated by the authors that the MTL’s involvement was related to its function in memory imagery and emotional contextualization during music listening (Alluri et al., 2015; Frühholz et al., 2014). Comparing musicians versus nonmusicians during familiar music listening, Groussard et al. (2010) found more anterior hippocampus as well as entorhinal cortex and parahippocampal involvement in musicians and suggested that hippocampus and other MTL areas engaged in episodic recollection processing during familiar music listening (Groussard et al., 2010). Other relevant studies compared listening to self-selected music with unfamiliar or recent learned music and exhibited hippocampal, parahippocampal and cerebellar activity only in self-selected familiar music listening (Thaut et al., 2020; Wu et al., 2019), indicating their role in recollecting self-related emotions and memories triggered by self-selected music.

Lastly, hippocampus might also interact with the frontal cortex to retrieve the autobiographical details as a part of the music (context). The connectivity between the two areas has been highlighted in other episodic memory retrieval tasks such as spatial retrieval (Preston & Eichenbaum, 2013). In a familiar music listening task, the researchers used dynamic causal modeling to test the direction and interaction between brain areas modulated by different music familiarity levels. They observed (1) left hippocampal activation during listening to familiar music only in musicians, suggesting more episodic recruitment in this group; (2) a top-down connectivity projected from left IFG to left hippocampus modulated by music familiarity (Gagnepain et al., 2017), suggesting that during music-associated memory retrieval, hippocampal recruitment is modulated online by the retrieval of episodic memory traces whereas IFG might support music structure identification as well as project this familiarity information into the hippocampus and drive it to retrieve episodic memory (shown in Fig. 1). It provided evidence for the connectivity and communication between the two networks separately supporting “music syntactical structure memory” and “music contextual associated episodic memory” that together develop a vivid and self-related experience of music memory retrieval.

Emotion and reward

One remarkable aspect of music is its ability to express and evoke emotion. Emotion is usually an impartible component of memory not only because it colors and defines the subjective perception of the memory but also due to its power to enhance memorability and recall of the memory (Kensinger & Schacter, 2006; Tyng et al., 2017). Extensive research has explored music emotion in terms of how the association between different types of tonal structure and emotions was developed both behaviorally and neuronally. Here, we aim to discuss how the memory of emotion is encoded as well as how the evoked emotion interacts with the memory of music itself, both the structural memory and other contextual associative memory.

The emotions induced by music are highly relevant to listeners’ ability to anticipate music’s structure elements. The musical syntax knowledge allows listeners to predict the ongoing music and by comparing the reality and the prediction, the listeners can generate reward-related emotions (Tillmann et al., 2014). For example, the feeling of pleasantness is induced by whether the music follows the syntactical rules and in many studies, the researchers created unpleasant music by violating the music regularity (Blood et al., 1999; Koelsch et al., 2006; Tillmann et al., 2014). When comparing manipulated music versus original regular music using fMRI, several studies observed significant changes in limbic and reward systems including but not limited to hippocampus, amygdala, parahippocampus, ventral striatum mainly the nucleus accumbens(NAc), insula and orbitofrontal cortex (Blood et al., 1999; Brown et al., 2004; Koelsch et al., 2006; Mueller et al., 2015). Among these areas, ventral striatum represented reward system seems to be pivotal in comparing expectation with the reality, which may lead to dopamine release and result in providing the rewarding and emotional value to the music stimuli (Salimpoor et al., 2011, 2013). Another study observed hippocampus exhibited similar activity to the ventral striatum participants listening to manipulated music—both were correlated with the music pleasantness (Mueller et al., 2015). This hippocampal involvement might contribute to a stronger encoding of pleasant music, regulated by the hippocampal-striatum pathway (Lisman & Grace, 2005) via dopaminergic (Salimpoor et al., 2013) and serotonergic mechanisms (Evers & Suhr, 2000). Both neurotransmitters play important roles in general learning and memory processes (Meneses & Liy-Salmeron, 2012; Wise, 2004). This music-emotion effect on memory strength might even expand to nonmusic memory. For example, it has shown that musical pleasure and hedonia can improve encoding performance for verbal episodic memory with music in the background, potentially supported by the interaction between ventral striatum/NAc and the hippocampus in a way that the dopamine release could strengthen long-term potentiation in the hippocampus, suggested by the authors (Cardona et al., 2020).

In addition to the reward system, the amygdala and hippocampus together also help distinguish and respond to the pleasant versus unpleasant music, manipulated through syntax violations (Koelsch et al., 2006)—both areas show significantly increased responses to negative music. One lesion study found that temporal lobe epilepsy patients who had amygdala and hippocampal damages could not recognize sad music and moreover, they did not show better memory for emotional music than neutral music, suggesting the role of these areas in encoding contextual emotional aspects of music memory (Samson et al., 2009). Amygdala and hippocampus also interact with the “music syntax network” in perceiving and measuring the uncertainty and surprise during musical sequential processing, suggested by a modeling study (Cheung et al., 2019). They pointed out that the amygdala–hippocampal interaction was essential in deciding the valence of the music, whereas NAc modulated the motivation of subsequent information perceiving and prediction. While amygdala and hippocampus particularly respond to the unpleasant and surprising music, music that follows prediction activated the areas relating to syntax processing including the Heschl’s gyrus (part of auditory cortex) and left IFG (Koelsch et al., 2006). Music that makes people extremely satisfied and intensely pleasurable may also activate frontal areas such as the orbitofrontal lobe (Blood & Zatorre, 2001). Overall, the rewarding system together with the limbic system might support the “music syntax network” to better encode music sequences by providing rewarding feedback.

In addition to the emotional responses modulated by music structure or syntax, the emotion conveyed by the music can sometimes be more complicated, extending beyond a simple dichotomy of pleasantness and unpleasantness (Leonard, 1956). The perception of music emotion relates to the memory in a way that the emotion adds semantic content to the music—it become a part of the story a piece of music conveys. Firstly, the emotional semantic component strongly correlates with the memorability of the music itself. For example, music excerpts with stronger emotional valence and arousal dimensions are better remembered (Ferreri & Rodriguez-Fornells, 2017). The MTL/limbic system plays the major role in perceiving and identifying emotion expressed by the music, especially complex emotions, such as pride and guilt (Frühholz et al., 2014). The hippocampus, in particular, can support temporal integration of the complex emotional semantic information and provides a contextual memory association (Fig. 1), which can further strengthen the memory of the music itself (Frühholz et al., 2014). Relating to its role in episodic memory encoding, hippocampus support self-related memory association such as scenario imagery during the music-evoked emotional experiences (Alluri et al., 2015). Evidence includes musicians tending to imagine themselves in a personal experience during music listening, and showing more hippocampus activation (Groussard et al., 2010). Amygdala and anterior hippocampus activation were also observed when emotional music was paired with neutral film clips but not during solely music listening (Eldar et al., 2007), suggesting that the existence of visual context implements the semantic/contextual association and these areas might support encoding this association during music memory encoding. Another study observed similar neural activity when people listened the music with closed eyes but not opened eyes, possibly linking to a semantic imaginary processing during music encoding (Frühholz et al., 2014; Lerner et al., 2009). In the meta-analysis, Koelsch (2020) discussed the role of the hippocampus in music emotion perception and experience. He suggested that the right hippocampus supports the perception of attachment-related emotions associated with the social bonding such as empathy function whereas left hippocampus reacts to only negative emotion of the music relating to the familiarity and expectation of the music (Koelsch, 2020). It added evidence to a frequent debate in music literature about the differences between emotion perception and (felt) emotional responses to music (Gabrielsson, 2001). While the evoked pleasantness or unpleasantness correlates to whether the music structural pattern follows our syntactical knowledge, the perceived emotion relates more to the understanding and mirroring of the semantic meaning embedded within the tonal variation of the music. Other than the hippocampus, IFG, the music syntax area might also support higher level emotional/semantic meaning understanding. From nonmusic tasks, IFG was found to involve in empathy and social perception such as facial emotion perception, via the mirror neurons mechanism (Jabbi & Keysers, 2008; Kaplan & Iacoboni, 2006; Press et al., 2012). The mirror neurons enabled individual to understand the social meaning (such as intention and emotion) of the communication signal perceived from others (Kilner et al., 2009). Several music literatures identified this mechanism revealed by left IFG activity during music listening and music performance watching and they concluded that IFG was also engaged in understanding the semantic intention of the music to better perceive and experience the emotion induced from the music (D’Ausilio, 2009; Molnar-Szakacs & Overy, 2006).

The emotion of music not only adds semantic meaning to the music, it also provides contextual association to nonmusic information encoded with music. For example, emotional music played in the background implemented visual encoding including pictures and faces (T. Baumgartner et al., 2006; Jolij & Meurs, 2011; Proverbio et al., 2015). Other than visual memory, pleasant music also benefited verbal episodic memory encoding and recollection (Cardona et al., 2020). Some mechanism behind might relate to phase synchronization, a mechanism found in both human and rats relating to memory formation (Fell & Axmacher, 2011). Increased frontal theta power was observed during consonant music listening compared with dissonant music (Sammler et al., 2007). An increased synchronization between right temporal and right frontal cortex was also found to correlate with the degree of pleasantness the listeners reported to the background music (Ara & Marco-Pallarés, 2020). The memory retrieval and working memory are tightly related to such frontal oscillation (Schack et al., 2005) and thus pleasant music listening might promote memory function for information that learned in a musical context via such a mechanism. Future studies aiming at utilizing music as memory aids should consider further investigating such emotional power. For example, some suggest the famous but debated “Mozart’s effect” (that listening to certain music can improve cognitive function including attention and memory) might relate to the arousal and emotion evoked by background music (Schellenberg & Weiss, 2013).

In summary, identifying and experiencing emotion during music listening can enhance both music structure memory and associated contextual memory. Likely via modulating communication between ‘music syntax network’, the reward system, and the MTL (Fig. 1), emotion might enhance deeper encoding of episodic-like contextual associations during music memory formation. These effects, to be studied further, could provide a powerful means for music memory to modulate other memory in applied settings.

Conclusions

Decades of music cognition studies have provided comprehensive evidence for what types of mechanisms are involved in music processing. However, there is still a lack of consistent definitions of what constitutes music memory, and a lack of consistent data on how it is formed in the brain. Importantly, few prior studies have investigated how the association between music memory and general long-term memory functions is mediated. This Review proposes a restructuring of how music memory is conceptualized for the purpose of future neuroscience research. We argue that music memory is a multidimensional and hierarchical event, within which different components might rely on different neural networks. By virtue of music’s complexity, these do not fall cleanly along traditional boundaries between different forms of memory from other literature. The review presents neuroscience results organized according to (1) memory formation and (2) memory retrieval for (a) music syntactical memory (music structure memory) and (b) memory for associated nonmusic elements (music contextual associates memory). We believe it is clear that while auditory processing cortices play a central role in music-listening memory tasks, there is an expansion of this network to other areas in different tasks, with the network components brought in being modulated by levels of memory complexity. As much music literature has pointed out, music is a powerful event that requires less subcortical involvement during memory encoding and retrieval. Its unique strictly rule-followed features may tax a more restricted network during structure preprocessing, which is mainly supported by the inferior frontal gyrus and auditory cortex, the same brain areas associated with language comprehension. For music structure memory encoding, we suggest there is sufficient evidence for a shared neural circuit for it with other forms of memory—other than the “syntactic network” (IFG, auditory cortex, mainly the STG, IPL), this level of music learning also involves a neural circuit that includes the medial temporal lobe and subcortical structures such as the basal ganglia and cerebellum that generally support statistical learning, time perception/encoding and sequential learning for nonmusic events. Importantly, although some lesion literature implies that music structure memory is independent of the hippocampus, this review found numerous datapoints suggesting that music syntactical/structural memory can rely on the same medial temporal lobe mechanisms supporting declarative sequence memories of other forms. Corresponding to its role as a hierarchical sequence associator from nonmusic memory paradigms (McKenzie et al., 2014), the hippocampus may help bind components from different dimensions and across time to help form integrated representation of the music. We argue this may be especially true during the initial stage of novel sequential pattern detection and encoding, while reviewing why it is very uncommon for a music composition to be truly novel to the brain’s memory systems. In particular, one reason for evidence of “hippocampal independence” suggested by many music memory studies could be that music learning typically benefits from learned syntax and our “prior knowledge” schematic memory system, which different memory literatures hold to be supported by frontal areas and their interactions with cortex outside of MTL.

On the other hand, it is surprisingly rare for neuroimaging studies to have looked into music memory on the level of semantic declarative memory—the story it conveys—or in terms of how it becomes associated with information from other domains (emotions, episodic experience). We provide an overview of the current knowledge of neural correlates for memory of music paired with lyrics, music associated episodic events, and memory for the paired emotion. Overall, evidence suggests there is a dispersed network for this level of memory beyond the core “syntax network” that enables us to encode and retrieve music contextual associates. Areas including the limbic system and reward circuitry became involved when the memory for music demands higher levels of semantic, emotional, and autobiographical processing. Importantly, the semantic/contextual associations formed by these subcortical areas can conversely enhance the encoding and retrieval of music structural memory itself. As a result, one important take-home from the literature is that music memory is special, not only because it is associated with a schematic syntactical system that supports highly efficient and often-effortless learning but also due to the interactions between its syntactical and semantic representations: the higher-level of semantic and emotional meaning of music provides a scaffold for personal and contextual associations for the listeners. This interaction is reflected in the brain putatively through a complex cortical–subcortical collaboration (Fig. 1). Admittedly, this conclusion is derived from a somewhat limited body of neuroscience results. We hope that future research will focus on understanding this interaction, in part because it might be further applied to aid people in better encoding and retrieving of memories of other aspects of their lives (a potential already being explored in critical cases such as Alzheimer’s). By bridging the gap between mechanisms for music processing and nonmusic long-term memory formation, we might be able to design “music listening experience” to better benefit our daily life.