Keywords

Introduction

In music, as in several other domains, events occur over time. The way events are structured in time, both in music and in other domains, allows the brain to anticipate the timing of events and, in doing so, to optimize processing of events that occur at expected moments in time (Nobre & van Ede, 2018). In addition, temporal expectations (e.g., predicting “when” an event will happen) can guide our movement. This is of particular interest when considering musical rhythm, where temporal expectations allow us to dance and make music together (Honing, 2012; Leow & Grahn, 2014), and may play a role in our enjoyment of music (Fiveash et al., 2023). Temporal expectations can be formed based on different information in the environment, such as the contingency between a cue and a temporal interval, and the passage of time itself (Nobre & van Ede, 2018). Such interval-based predictions, as well as foreperiod effects, are discussed in depth elsewhere (Buhusi & Meck, 2005; Ng & Penney, 2014). Here, we focus on temporal expectations as present in rhythm, which denotes the temporal structure of a sequence of multiple events.

Figure 1 shows a schematic overview of rhythmic structure. A rhythmic sequence of seven sounds is depicted, with temporal intervals of various lengths separating the sounds, forming a rhythmic pattern of shorter and longer temporal intervals. In music, importantly, in addition to structure in the form of a rhythmic pattern (Fig. 1B), rhythm often induces the perception of a regular pulse or beat (Bouwer et al., 2021; Nobre & van Ede, 2018). The beat (Fig. 1C) is a perceived regularly recurring salient moment in time (Cooper & Meyer, 1960) that we can tap and dance to. In musical rhythm, the beat often coincides with an event, but a beat can also coincide with plain silence (see shaded area in Fig. 1): Listeners can perceive a beat even in the absence of cues to this regularity in the rhythmic signal and can persist in perceiving a beat in the presence of conflicting rhythmic information (Honing & Bouwer, 2019; Longuet-Higgins & Lee, 1984). The beat is often embedded in a hierarchical structure of multiple perceived levels of temporal regularity. At a higher level, we can hear regularity in the form of regular stronger and weaker beats (often referred to as meter, like in a waltz, which has a strong-weak-weak pattern of beats), and at a lower level, we can perceive regular subdivisions of the beat. Together, these regularities create a hierarchical pattern of saliency known as a metrical structure (Fig. 1D). We can perceive temporal regularity with a period roughly in the timescale of 200 to 2000 ms (London, 2002, 2012). Within this range, we have a clear preference for beats with a period around 600 ms or 100 beats per minute (Fraisse, 1982), and while listeners can to some extent guide the level of regularity they attend to most (Drake et al., 2000), the regularity closest to the preferred rate is often considered most salient (e.g., the beat).

Fig. 1
An illustration of a musical transcription features a stave with four parts. A, score with musical notation. B, rhythm, C, beat, and D, metrical structure, are presented in vertical bars.

Schematic overview of structure in rhythm. Rhythm can be conceptualized as a sequence of events in time. Panel A depicts an example rhythm in common music notation. In Panel B, C, and D, sounds are depicted as vertical bars. On top of perceiving the rhythmic pattern formed by the temporal structure of the sounds (e.g., the succession of longer and shorter intervals in time, panel B), we can perceive a regular beat, here depicted as black events (C). Several nested hierarchical levels of regularity make up a metrical structure (D), with differences in salience between strong beats (depicted in black), weak beats (depicted in dark gray), and subdivisions of the beat (in light gray and white). The metrical interpretation is represented as a metrical tree, with the length of the branches representing the theoretical metric salience of a specific position in the sequence. Note that the third beat (shaded area) coincides with silence: this is a “loud rest” or syncopation, with a missing event on a perceived beat

In this chapter, we first discuss the processes underlying the perception of a regular beat, and possible considerations for designing stimuli that induce beat perception. Next, we discuss how beat perception can be studied using event-related potentials (ERPs), and we give an overview of studies probing beat perception with ERPs in human adults, human newborns, and nonhuman animals. The current chapter updates a previous overview on this topic (Honing et al., 2014). Note that we focus on perceptual aspects of beat perception. For a discussion of how beat perception relates to movement, the motor system, and motor entrainment, see overviews on this topic elsewhere (Cannon & Patel, 2021; Damm et al., 2019; Merchant et al., 2015; Repp & Su, 2013).

Mechanisms of Beat Perception

Entrainment as a Mechanism for Beat Perception

The perception of a regular beat and the temporal expectations we form in response to a beat are often explained within the framework of entrainment (Henry & Herrmann, 2014; Obleser & Kayser, 2019): the synchronization of an internal regularity to the regularity in an external stimulus. From a psychological perspective, entrainment has been described by Dynamic Attending Theory (DAT) (Jones, 2009; Large & Jones, 1999). DAT proposes that internal fluctuations in attentional energy, termed attending rhythms, elicit expectations about when future events occur. The internal fluctuations in attentional energy can adapt their phase and period to an external rhythm, leading to alignment of peaks in attentional energy with metrically strong positions (i.e., peaks in attentional energy fall on the beat). At moments of heightened attentional energy, events are expected to occur, and processing of events is enhanced (Haegens & Zion Golumbic, 2018). The attending rhythms are thought to be self-sustaining and can occur at multiple nested levels, tracking events with different periods simultaneously (Drake et al., 2000; Large & Jones, 1999). These features of the Dynamic Attending model correspond respectively to the stability of our metrical percept and the perception of multiple hierarchical levels of regularity (Large, 2008). Behavioral support for DAT comes from studies showing a processing advantage on the beat (e.g., in phase with an external regularity) for perceiving temporal intervals (Large & Jones, 1999), pitch (Jones et al., 2002), intensity changes (Bouwer et al., 2020; Bouwer & Honing, 2015), and phonemes (Quené & Port, 2005). The processing advantage persists after a rhythmic sequence ends, in line with the supposed self-sustaining nature of the attending rhythms (Hickok et al., 2015; Saberi & Hickok, 2022b). However, note that recently, the persistent behavioral facilitation of events in phase with a regularity outlasting rhythmic stimulation could not always be replicated, which spurred discussion on the automaticity and ubiquity of entrainment (Bauer et al., 2015; Bouwer, 2022; Lin et al., 2021; Saberi & Hickok, 2022a, b; Sun et al., 2021). Several explanations for these discrepant findings have been suggested, including the presence of large individual differences in the strength of entrainment (Bauer et al., 2015; Saberi & Hickok, 2022b; Sun et al., 2021), the dependence of entrainment on uncertainty in the auditory input (Saberi & Hickok, 2022b), and the dependence of entrainment on the rate of the rhythmic signal (Pesnot Lerousseau et al., 2021; Saberi & Hickok, 2022b).

At the neural level, entrainment may be implemented by the alignment of low-frequency oscillations (in the delta range; 0.5-4 Hz) to external regularity (Haegens & Zion Golumbic, 2018; Henry & Herrmann, 2014; Obleser & Kayser, 2019; Rimmele et al., 2018), leading to heightened neural sensitivity at expected time points (Haegens & Zion Golumbic, 2018; Henry & Herrmann, 2014), akin to the peaks in attentional energy described by DAT. In line with the self-sustaining nature of the attending rhythms described by DAT, neural oscillations can also retain their alignment to a rhythmic sequence after sensory stimulation is stopped (Bouwer et al., 2023; Kösem et al., 2018; van Bree et al., 2021). Entrainment has mainly been studied in the context of regular, periodic stimulation, to explain the prediction of regular, isochronous beats (e.g., predictions that are equally spaced in time). Recently however, models of entrainment have also been used to explain predictions for non-isochronous rhythmic patterns (e.g., the predictions of successions of short and long intervals, that are not necessarily of equal length), in the context of irregular meters as found in Balkan music (Tichko & Large, 2019). For the purposes of this chapter, importantly, entrainment theories, both at the psychological and neural level, predict that processing is enhanced for events that are in phase with the entraining signal (Haegens & Zion Golumbic, 2018).

Predictive Processing as a Mechanism for Beat Perception

The perception of a beat is a bidirectional process: not only can a varying musical rhythm induce the perception of a regular beat (hence also referred to as “beat induction” (Honing, 2012)), but a regular beat can also influence the perception of the very same rhythm that induces it. Hence beat perception can be seen as an interaction between bottom-up and top-down sensory and cognitive processes (Desain & Honing, 1999), and as such fits well within the framework of predictive processing (Koelsch et al., 2019; Vuust & Witek, 2014). Within this framework, the perceived metrical structure provides a representation within which incoming sounds are interpreted. This representation is constantly updated based on the incoming sensory information. The relation between the events in the music and the perceived metrical structure thus is a flexible one, in which the perceived metrical structure is both inferred from the music and has an influence on how we perceive the music (Desain & Honing, 2003; Grube & Griffiths, 2009). Of importance to the current chapter, within predictive processing models, it is often assumed that sensory processing for expected events is attenuated (Friston, 2005). Thus, entrainment and predictive processing accounts of rhythm perception make somewhat different predictions about the underlying mechanisms of beat perception (Bouwer & Honing, 2015; Palmer & Demos, 2022), be it synchronization of an internal regularity with an external one, or creation of a hierarchical mental representation of the beat regularity. Now that we have considered the possible mechanisms underlying the perception of a beat, in the next section we consider aspects of rhythmic stimuli that may induce a perceived beat.

Beat from the Bottom Up: Considerations for Stimulus Design

Inducing a Beat from a Rhythmic Sequence

The simplest rhythmic stimulus that may induce the perception of a regular beat is an isochronous sequence (e.g., a sequence with identical durations between tones, like a metronome, see Fig. 2, example 1A). To probe beat perception, responses to sounds in such sequences have been compared to responses to sounds in sequences with irregular, jittered timing (Fig. 2, example 1B), with the premise that while an isochronous sequence can elicit a perceived beat, a jittered sequence cannot. Thus, a difference in responses to events in such regular and jittered sequences may be ascribed to the presence of a perceived beat in the former but not the latter. In addition to the regularity at the beat level, we can also perceive metrical structure in isochronous sequences, even if all sounds are identical (i.e., an equitone sequence). It was shown that listeners perceive events in odd positions (Fig. 2, example 1A, black shades) as more salient than events in even positions (Fig. 2, example 1A, gray shades), in line with odd positions representing metrically accented and even positions representing metrically unaccented events (Brochard et al., 2003; Potter et al., 2009). This phenomenon is termed subjective accenting and is reminiscent of perceiving “tick-tock” when listening to a clock instead of “tick-tick” (e.g., we hear a stronger and weaker “tick” and “tock” even if all “ticks” are in fact physically identical). Note that subjective accenting may depend on the rate of the sequence, with listeners shifting the number of notes perceived as one group (or one beat) depending on the tempo. Interestingly, while humans prefer a beat at a rate around 600 ms, subjective accenting seems to favor rates that are slower, suggesting that it is akin to the alternation of strong and weak beats (e.g., meter), rather than events on and off the beat (Bååth, 2015; Poudrier, 2020). In addition, subjective accenting is a highly variable effect that does not always occur in all listeners (Criscuolo et al., 2023).

Fig. 2
An illustration of a musical transcription sheet contains a stave for the regular and non regular beat. The stave is divided into 3 parts. A, No accents. B, acoustic timbre or intensity accents. C, temporal grouping accents.

Examples of rhythmic sequences used to study beat perception. Rhythms consist of sound events represented here by vertical bars. Dashed vertical lines represent the perceived beats. Long vertical bars and bars of intermediate length (for example 2) represent (frequent) standard sounds. The shortest vertical bars represent (infrequent) deviant sounds, such as an unexpected decrease in loudness, which are used to elicit a specific series of ERPs (see ERPs in response to expectancy violations). The tree structure underneath the example rhythms depicts the (theoretical) perceived metrical structure for the rhythms that induce a beat (example A). 1) The simplest stimulus to study beat perception is arguably an isochronous sequence (1A), with a rate within the range of human preferred tempo. Responses to deviant and standard events in such a sequence can be compared to responses to the same events in jittered sequences (1B, see for example (Schwartze et al., 2011, 2013; Teki et al., 2011)). Some studies have also compared responses to deviants in odd (black) and even (gray) positions in isochronous sequences (1A), to study subjective accenting (Brochard et al., 2003; Potter et al., 2009). 2) A rhythm with alternating loud (long white vertical bars) and soft (intermediate white vertical bars) sounds is thought to induce a regular duple beat when it has isochronous timing (2A). Beat perception can be probed by comparing responses to events on the beat (black) and off the beat (gray), either for deviant or standard sounds. Care must be taken to compare events that are acoustically identical and occur in an acoustically identical context. To control for sequential learning, the difference between responses on and off the beat in an isochronously timed sequence (2A) can be compared to the same contrast in a sequence with jittered timing and the same statistical structure (2B), which is thought to induce sequential learning, but not a beat (Bouwer et al., 2016; Háden et al., 2024; Honing et al., 2018). Note that here, it is also possible to (subjectively) perceive strong and weak beats (e.g., perceive yet another level of regularity). For the sake of simplicity, this is not depicted in this Fig. 3) When a rhythm has non-isochronous timing, a beat can be induced by temporal grouping accents. When an (accented) event mostly occurs with regular intervals, listeners will infer a regular beat (3A). Here, to control for grouping, like in example 2, responses in a sequence with regular accents and integer-ratio durations (3A) can be contrasted with responses in a sequence with the same grouping structure, but irregular accents and non-integer-ratio durations (3B), which is thought not to induce a perceived beat (Bouwer et al., 2020; Grahn & Brett, 2007). As for example 2, for the sake of simplicity, only one level of regularity (the beat) is depicted here

While listeners can thus perceive beat and meter in isochronous, equitone sequences, in natural rhythm, the beat usually needs to be inferred from a varying rhythmic signal. Moreover, the regularity listeners perceive need not even be apparent from a rhythmic signal, as in fact, rhythm sometimes does not contain regularity at the beat rate at all (Tal et al., 2017). To infer a metrical structure from music with a varying rhythmic structure, we often make use of accents. In a sequence of events, an accent is a more salient event because it differs from other, non-accented events along some auditory dimension (Ellis & Jones, 2009). When accents exhibit regularity in time, we can induce a regular beat from them. Accented tones are then usually perceived as on the beat or, on a higher level, as coinciding with a strong rather than a weak beat (Lerdahl & Jackendoff, 1983). Loudness accents may be used to allow listeners to infer a beat from a rhythm (Bouwer et al., 2018), and pitch accents also have been shown to play a role in perceiving the beat (Ellis & Jones, 2009; Hannon et al., 2004). Indeed, spectral information may even be more informative for the brain to entrain to than the sound envelope of a rhythm (e.g., changes in loudness or onsets) (Weineck et al., 2022). It is very likely that in natural music, many sound features can contribute to an accent structure and our perception of the beat, including not only loudness and pitch but also timbre. In line with this, the use of ecologically valid stimuli may enhance the perception of a beat (Bolger et al., 2013; Tierney & Kraus, 2013). Example 2A in Fig. 2 depicts a rhythm which mostly consists of alternating loud (long vertical white bars) and softer (intermediate vertical white bars) tones. Such a pattern would induce a duple beat through loudness accents, with some events falling on the beat (black shades) and some events falling off the beat (gray shades).

Accents can also arise from the perceptual grouping of rhythmic events in time, even when sounds are acoustically identical. When an onset is isolated in time relative to other onsets, it sounds like an accent. Second, when two onsets are grouped together, the second onset sounds accented. Finally, for groups of three or more onsets, the first and/or last tone of the group will be perceived as an accent (Povel & Essens, 1985). Such temporal accents may drive the perception of a beat in a bottom-up manner. Recordings from midbrain neurons in rodents have shown increased firing rate for events on the beat compared to events off the beat in rhythms with purely temporal accents, consistent with the idea that increased responses to tones that are salient based on temporal grouping may drive human beat perception (Rajendran et al., 2017, 2020). Example 3A in Fig. 2 shows a rhythm in which the beat is elicited by temporal accents. Here, a beat can be perceived through temporal accents that are regularly spaced, with an (accented) event always coinciding with perceived beat times. Note that to perceive a beat in this type of rhythm, not only regular spacing of accents but also the presence of intervals with integer-ratio durations is of importance (Grahn & Brett, 2007; Jacoby & McDermott, 2017). For example, if the regularity of the beat is present with a period of 600 ms (i.e., 100 beats per minute), it is beneficial to the perception of the beat if a rhythm contains temporal intervals of 150, 300, and 450 ms, which are all related to the beat interval at integer ratios (in this case ratios of 4/1, 2/1, and 4/3). Both for example 2A and example 3A, we would expect differential responses to events on and off the beat, as events on the beat are more expected if a beat is perceived.

In addition to bottom-up influences on a perceived beat from accents and the temporal structure of a rhythm, listeners can impose different metrical structures on rhythmic sequences if instructed to do so (Iversen et al., 2009; Nozaradan et al., 2011), and cultural background and experience may affect the beat we perceive (Gerry et al., 2010; Hannon & Trehub, 2005; Jacoby & McDermott, 2017; Kaplan et al., 2022; Yates et al., 2016).

Dissociating Beat Perception from Duration-Based Temporal Expectations

Importantly, one challenge in beat perception research is to dissociate responses to a regular beat from responses to other types of structure in the rhythm, such as duration-based temporal structure, ordinal structure, low-level acoustic differences, and temporal grouping. First, we can, in addition to hearing a beat, perceive temporal structure in predictable single durations, and predictable rhythmic patterns, be it by learning the contingency between a cue and a specific temporal duration, by learning a sequence of absolute intervals (e.g., the time intervals between two events), or by learning a rhythmic pattern in the form of relative durations (e.g., the ratios between consecutive inter-onset intervals) (Bouwer et al., 2020, 2023; Breska & Deouell, 2017; Morillon et al., 2016; Nobre & van Ede, 2018). Neuroimaging work suggests that specific networks are dedicated to perceiving absolute and relative durations respectively. While a network comprising the cerebellum and the inferior olive is involved in absolute duration-based timing, a different network, including the basal ganglia and the SMA, is active for relative or beat-based timing (Teki et al., 2011). It is still unclear how the perception of absolute durations, relative durations, rhythmic patterns, and metrical structure are related, with some suggesting that the underlying mechanism for pattern and beat perception is similar (Cannon, 2021; Cannon & Patel, 2021) and some suggesting separate mechanisms (Bouwer et al., 2020, 2023). Hence, when studying beat perception, it is important to take into account possible overlap between temporal structure based on a beat, and temporal structure based on patterns and absolute durations (Bouwer et al., 2021). This may be a challenge for studies relying on isochrony to study beat perception (e.g., Fig. 2, example 1), as the temporal structure in an isochronous sequence can be described both in terms of its regularity, and in terms of the repetition of a single interval (Bouwer et al., 2021; Keele et al., 1989). To account for this, the use of more complex stimuli, with at least one level of hierarchy (e.g., some events on the beat and some events off the beat, like in examples 2 and 3) may be instrumental.

Dissociating Beat Perception from Ordinal Structure

When a beat is elicited by accents in otherwise isochronous sequences, to induce two levels in a metrical hierarchy (Fig. 2, example 2A), one challenge that arises in probing beat perception is that in strongly beat inducing sequences, the accents themselves also introduce ordinal structure. For instance, in example 2A, a listener may infer that a soft sound is always followed by a louder sound and that a loud sound is followed by a soft sound in most cases (e.g., loud and soft sounds mostly alternate). Thus, listeners may learn the ordinal, statistical structure of a sequence (Conway & Christiansen, 2001), something humans are capable of at a young age (Saffran et al., 1999). To account for such ordinal structure in probing beat perception, one approach is to not just compare the difference between responses to sounds in metrically strong and weak positions (e.g., on and off the beat) in isochronous sequences (Fig. 2, example 2A), in which this difference is affected by statistical learning and the perceived beat, but to contrast this difference with the difference between responses to sounds in metrically strong and weak positions in jittered sequences (Fig. 2, example 2B), in which the difference is only affected by statistical learning (Bouwer et al., 2016).

Dissociating Beat Perception from Low-Level Acoustics and Grouping

In natural music, a beat is often induced by creating accents on the beat (similar to example 2A in Fig. 2). Because accented sounds by definition need to stand out from non-accented sounds, this often means that tones on the beat have a different sound than tones that are not on the beat. Similarly, the acoustic context (e.g., the tone preceding the tone of interest) of weak and strong metrical positions is not identical. Such acoustic differences may lead to differences in low-level perceptual features like masking, and may affect sensory responses (Bouwer et al., 2014; Honing et al., 2014; Winkler et al., 2013). To account for this, stimuli must ideally be controlled to be able to probe different metrical positions with identical acoustic properties. To this end, example 2A contains occasional events on offbeat positions that have a loud sound, making them identical to sounds on the beat. This makes it possible to probe sounds on and off the beat with identical acoustic properties and context (Bouwer et al., 2016; Háden et al., 2024; Honing et al., 2018).

Finally, temporal grouping may be instrumental in inducing a beat in non-isochronous rhythms (Fig. 2, example 3A). That is, grouping of events may lead to perceived accents, which may then be used by a listener to abstract a beat structure from a non-isochronous sequence (Povel & Essens, 1985). However, it must be considered that differences in salience due to perceptual grouping may lead to differences in neural responses, regardless of the presence of a beat (Andreou et al., 2015). To account for this, a similar strategy as described above for ordinal structure may be followed, whereby responses to events in a non-isochronous rhythm with regularly spaced accents, which is thought to induce a beat (e.g., Fig. 2, example 3A), are compared to responses to events in a rhythm without regularly spaced accents, but with an identical grouping structure (e.g., Fig. 2, example 3B), which is thought to not induce a beat (Bouwer et al., 2020).

To summarize, in musical rhythm, humans often perceive nested, hierarchical levels of regularity known as a metrical structure, with the most salient level of regularity representing the beat. The perception of beat and meter has been explained by entrainment theories and theories of predictive processing, which make slightly different predictions for how the perceived metrical structure affects the processing of events in a rhythm (e.g., entrainment should lead to enhanced processing of events in strong metrical positions, while predictions should lead to attenuation of events that are expected) (Bouwer & Honing, 2015; Lange, 2013). A metrical structure can be inferred from a rhythm through accents in various forms. Importantly, in studying beat perception, the perception of metrical structure needs to be dissociated from other types of structure present in rhythm, such as duration-based temporal structure, ordinal structure, low-level acoustic variability, and grouping. Hence, a simple comparison of responses on and off the beat is often not enough to infer something about beat perception, as events on and off the beat often differ in many more characteristics than just their metrical position.

Measuring Beat Perception with Event-Related Potentials (ERPs)

Some of the main questions regarding beat perception are concerned with whether beat perception is innate (or spontaneously developing) and/or species-specific (Honing, 2018; Honing et al., 2014). Testing human newborns and nonhuman animals to answer these questions requires a method that is noninvasive and does not require an overt response from the participant. EEG is well suited for this task and has the temporal resolution to track the perception of a beat over time. Several different approaches exist in probing beat perception with EEG. Analyses of EEG responses in the frequency domain may directly probe the entrainment of low-frequency neural oscillations to an external regularity (Nozaradan, 2014; Tal et al., 2017), but also need to account for possible methodological pitfalls (Novembre & Iannetti, 2018; Zoefel et al., 2018). Here, we focus on the well-studied approach of analyzing event-related potentials (ERPs), and we discuss several recent studies that have used ERPs to probe beat perception in human adults, newborns, and nonhuman primates.

Auditory ERPs

ERPs are hypothesized to reflect the sensory and cognitive processing in the central nervous system associated with particular (auditory) events (Luck, 2005). ERPs are isolated from the EEG signal by averaging the signal in response to many trials containing the event of interest. Through this averaging procedure, any activity that is not time-locked to the event is averaged out, leaving the response specific to the event of interest: the ERP. While ERPs do not provide a direct functional association with the underlying neural processes, there are several advantages to the technique, such as the ability to record temporally fine-grained and covert responses not observable in behavior. Also, several ERP components have been well studied and documented, not only in human adults but also in newborns and nonhuman animals. Some of these components, used in testing beat perception, are elicited with an oddball paradigm.

ERPs in Response to Expectancy Violations

An auditory oddball paradigm consists of a frequently recurring sequence of stimuli (standards), in which infrequently a stimulus is changed (deviant) in some feature (e.g., pitch, intensity, and timing). The deviant stimulus thus violates the expectations that are established by the standard stimuli. Depending on the task of the subject a deviant stimulus elicits a series of ERP components reflecting different stages and mechanisms of processing. The mismatch negativity (MMN) is a negative ERP component elicited between 100 and 200 ms after the deviant stimulus. MMN is thought to reflect automatic deviance detection through a memory-template matching process (Kujala et al., 2007; Näätänen et al., 2007), and can be elicited by expectancy violations in sound features such as pitch, duration, or timbre (Winkler, 2007; Winkler & Czigler, 2012), abstract rules (Paavilainen et al., 2007), or stimulus omissions (Yabe et al., 1997). The N2b is a component similar to the MMN in latency, polarity, and function, but it is only elicited when the deviant is attended and relevant to the task (Schröger & Wolff, 1998). At around 300 ms after the deviant stimulus, a positive component can occur, known as the P3a, which reflects attention switching and orientation toward the deviant stimulus. For task-relevant deviants, this component can overlap with the slightly later P3b, reflecting match/mismatch with a working memory representation (S. H. Patel & Azzam, 2005; Polich, 2007). The latency and amplitude of the MMN, N2b, P3a, and P3b are sensitive to the relative magnitude of the expectancy violation (Comerchero & Polich, 1999; Fitzgerald & Picton, 1983; Rinne et al., 2006; Schröger & Winkler, 1995) and correspond to discrimination performance in behavioral tasks (Novitski et al., 2004). These properties are exploited when probing beat perception with ERPs. Moreover, ERP responses to expectancy violations, most notably the MMN, have been recorded in comatose patients (Näätänen et al., 2007), sleeping newborns (Alho et al., 1992), and anesthetized animals (Csépe et al., 1987), making ERP research an ideal instrument for interspecies comparisons and for testing the innateness of beat perception.

ERPs in Response to Frequent Stimuli

While the abovementioned ERPs are elicited by expectancy violations, any sound will elicit a succession of obligatory responses, regardless of whether a sound is frequent or infrequent. Hence, in addition to using responses to expectancy violations to probe beat perception, we can also compare responses to frequent sounds (standards). In the current chapter, we focus on two early sensory responses (as studied in humans): the P1 and the N1. The auditory P1 (sometimes termed P50, as it typically peaks at about 50 ms post-stimulus onset) and N1 (sometimes termed N100, as it typically peaks around 100 ms post-stimulus onset) components are thought to be generated in auditory cortices, and are sensitive to stimulus features, like loudness and pitch change, and presentation rate (Näätänen & Picton, 1987; Picton et al., 1974; Winkler et al., 2013). In addition, N1 has been shown to be affected by both attention and expectations, including temporal expectations (Lange, 2013; Picton & Hillyard, 1974), making it a potentially informative component to study in the context of musical rhythm.

Using ERPs to Probe Beat Perception

The general idea of using ERPs to probe beat perception is that an event on the beat is perceived differently from an event occurring not on the beat due to the metrical expectations of the listener, and thus that two physically identical events in different metrical positions should yield different brain responses. More specifically, ERP responses elicited by expectancy violations (e.g., MMN, N2b, P3a, P3b) are typically larger for more unexpected events. If a beat is perceived, we form strong expectations for events to occur on the beat (Honing & Bouwer, 2019). Hence, ERPs in response to expectancy violations that interfere with the perceived beat (e.g., a deviant softer sound on the beat, depicted by the short vertical bars in black shades in Fig. 2, examples 2A and 3A) should be larger than ERPs in response to violations that do not interfere with a perceived beat, either because they are in line with the metrical structure (e.g., a deviant softer sound in an offbeat position, depicted by the short vertical bars in gray shades in Fig. 2, examples 2A and 3A) or because no beat is perceived (e.g., a deviant softer sound in a jittered sequence, depicted by the short vertical bars in Fig. 2, examples 2B and 3B). In addition, several components of the obligatory auditory-evoked potential (e.g., the P1 and N1 responses) are smaller for expected than unexpected sounds, in line with predictive processing accounts that predict the silencing of the predicted sensory input (Lange, 2013). Hence, in the presence of a perceived metrical structure, events in weak metrical positions (e.g., standard sounds off the beat, depicted by the long vertical bars in gray shades in Fig. 2, examples 2A and 3A) are less expected than events in strong metrical positions (e.g., standard sounds on the beat, depicted by the long vertical bars in black shades in Fig. 2, examples 2A and 3A) and may therefore elicit stronger responses.

ERP responses to expectancy violations and P1 and N1 responses can also be affected by attention. The N2b and P3b only occur when a stimulus is task-relevant (Polich, 2007; Schröger & Wolff, 1998), while the MMN can be modulated by attention (Haroush et al., 2010), and can even be completely eliminated when deviations in attended and unattended auditory streams vie for feature-specific processing resources (Sussman, 2007). Since dynamic attending theory predicts enhanced processing in metrically strong positions due to a peak in attentional energy, we may expect that ERP components in response to expectancy violations are affected by metrical position due to differences in attention, with larger responses to events that coincide with peaks in attention (e.g., in strong metrical positions). At the same time, N1 has been shown to be enhanced by attention, hence the response to events in strong metrical positions may be larger than the response to events in weak metrical positions (Haegens & Zion Golumbic, 2018).

Note that several mechanisms may thus affect ERPs to rhythm in different ways, and sometimes even in opposite directions, with larger responses to events in strong metrical positions due to attention effects, and smaller responses due to the effects of expectations (Lange, 2013). Also, in most cases, an implicit assumption made by studies using oddball designs is that expectations for when a sound will occur are coupled with expectations for the sound itself (“what”). In other words, in the studies below, when an expectation is violated, it is almost always the expectations for a certain sound (“what”) that is violated, and not the expectation for sound timing itself. Whether expectations for timing can be formed at all without any expectation for sound identity is a subject for debate, and outside of the scope of this chapter (Clarke, 2005; Gibson, 1975; Morillon et al., 2016).

Importantly, when using ERPs to probe beat perception, the ERPs are not a direct index of the processes involved in the perception of a beat. Rather, ERPs are used that have been extensively studied over the years and are known to be affected by attention and expectations. Since the main mechanisms underlying beat perception have been associated with the processes of attending and expectancy (see Mechanisms of beat perception), ERPs can be used to index the strength of beat perception by indexing the strength of attention and expectations. The ERPs themselves thus do not reflect the perceived beat, but rather, are modulated by it. We will now turn to a discussion of research that has used ERPs to probe beat perception in human adults, newborns, and nonhuman primates.

Probing Beat Perception in Human Adults with ERP Responses to Expectancy Violations

Comparing Isochronous to Jittered Sequences

As described above, the simplest way of probing beat perception is by comparing responses to infrequent sounds embedded within an isochronous, presumably beat inducing sequence (Fig. 2, example 1A) with responses to infrequent sounds within a sequence with jittered timing (Fig. 2, example 1B). Typically, P3 responses are larger, and sometimes earlier, for deviants in isochronous than jittered sequences (Lange, 2009; Rimmele et al., 2011; Schmidt-Kassow et al., 2009; Schwartze et al., 2011), in line with stronger expectations being formed about the occurrence of events in isochronous sequences. This effect is somewhat attenuated in cerebellar patients (Kotz et al., 2014) and children with developmental coordination disorder (Chang et al., 2021), and has been related to movement, both in healthy adults and Parkinson patients (Conradi et al., 2016; Lei et al., 2019), confirming a role for motor networks in the formation of temporal expectations. Results for earlier responses are somewhat mixed, but larger N2b responses to deviants in isochronous than jittered sequences have been observed (Kotz et al., 2014; Rimmele et al., 2011). The effect seems to be attention-dependent, though numerically, the same effect can be seen for the MMN in unattended conditions (Schwartze et al., 2011).

However, of note, because an isochronous stimulus is used, it is unclear whether these results are due to beat perception, or rather, differences in learning single intervals. Interestingly, one study in the visual domain found no differentiation between the effects of temporal expectations on the P3 for isochronous sequences and cue-based expectations (Breska & Deouell, 2017). Also, similar P3 effects can be observed for sequences with grouping structure, but not temporal regularity (Schmidt-Kassow et al., 2009). Thus, the contrast between responses to isochronous and jittered sequences likely contains a combination of beat perception and the perception of other types of regularity.

Comparing Responses to Strong and Weak Beats

To account for the presence of duration-based temporal processing, one option is to add an extra hierarchical level to the metrical structure and examine differences between metrical positions. One example of how deviant detection can show the presence of metrical perception comes from studies examining subjective rhythmization (Brochard et al., 2003; Potter et al., 2009). In these studies, participants were presented with an isochronous series of tones. They were hypothesized to perceive the tones in odd positions as stronger than tones in even positions, due to an imposed duple metrical structure. Infrequently, a softer tone was introduced, either in odd or in even positions (Fig. 2, example 1A). These deviants elicited an N2b and a P3b. The P3b to deviants in odd positions had a larger amplitude than the P3b to deviants in even positions, showing that the deviants were indeed detected better—or perceived as more violating—on a strong beat than on a weak beat (Brochard et al., 2003; Potter et al., 2009). In a related study, physical accents in the form of tones with longer durations were used to induce a duple or triple meter, and similar to the subjective rhythmization studies, the P3 response to softer target tones was larger in strong than weak metrical positions. Here, a similar effect was found for the earlier N2b components, albeit only in the duple meter condition (Abecasis et al., 2005).

Note that in these studies, the isochronous sequence on which a structure was imposed was at a rate close to the preferred tempo for humans, and as such, the difference between odd and even positions can be interpreted more as meter (e.g., strong and weak beats) than as beat (e.g., on the beat and off the beat). It is unclear whether these results are based on listeners imposing the temporal structure of regularity at the level of the meter or on listeners imposing a hierarchical grouping structure, with groups of two or three events. To examine this, one strategy could be to contrast the difference between responses to strong and weak beats in an isochronous sequence (e.g., example 1A in Fig. 2) with the same difference in a jittered sequence (e.g., example 1B in Fig. 2).

Of note, in the studies looking at subjective rhythmization, the rhythmic sequences were always task-relevant, and the ERP components of interest were the attention-dependent P3b and N2b. One other study examined meter processing under unattended conditions by using drum rhythms with occasionally omitted sounds on strong and weak beats, and found a latency difference for the MMN dependent on meter (e.g., shorter latency for strong than weak beat violations (Ladinig et al., 2009, 2011)). However, these findings could not be replicated in a bigger sample (Bouwer et al., 2014), suggesting that meter processing may require attention. Additionally, meter processing may be affected by musical training (Nave-Blodgett et al., 2021).

Comparing Responses on and off the Beat

At the level of the beat, several studies have used oddball paradigms to study the difference in responses on and off the beat. For drum rhythms with infrequent omissions, MMN responses were larger for omissions on the beat than off the beat, even when the sequences were not attended (Bouwer et al., 2014). Similarly, MMN was larger for intensity decrements in odd then even positions for isochronous sequences at a rate that corresponded to twice the preferred rate of humans (e.g., the isochronous sequence was at the level of subdivisions of a beat, with odd events on the beat and even events off the beat (Bouwer & Honing, 2015)). Interestingly, for isochronous sequences without acoustic cues to the hierarchical metrical structure, this effect was larger for Western listeners than for bicultural listeners who are familiar with sub-Saharan African music (Haumann et al., 2018), indicative of an effect of experience. Indeed, another study found that deviance responses to omissions on and off the beat were related to musical training (Silva & Castro, 2019), and differences may also be due to innate variability in strategies for temporal processing (Snyder et al., 2010).

Note that for the abovementioned studies, deviants consisted of softer sounds or omissions. Both entrainment and predictive processing accounts of beat perception would predict these deviants to be more salient in strong metrical positions, either since more processing resources are focused on those points in time, or because listeners form strong expectations for louder sounds on the beat (Bouwer & Honing, 2015). Results from studies using intensity increments as deviants may be more in line with the latter explanation, as these consistently found larger ERP responses to unexpected increments off the beat then on the beat (Abecasis et al., 2009; Bouwer & Honing, 2015; Geiser et al., 2010). In these studies, however, no jittered control sequences (e.g., examples B in Fig. 2) were used, leaving the possibility open that the rhythmic stimuli may have induced a temporal grouping structure. Of note, behaviorally, listeners show the effects of grouping even for non-isochronous rhythmic sequences with a timing structure that does not induce a beat easily (Bouwer et al., 2020).

Importantly, the stimuli used by (Bouwer et al., 2014) contained multiple types of structure in addition to the beat. This study used drum rhythms with alternating bass drum, snare drum, and hihat sounds. While omissions on the beat always followed a hihat sound, omissions off the beat followed a bass drum sound—an order of events that was overall more likely in the sequences. Hence, the observed effects could be due to statistical learning (i.e. learning the transitional probabilities between consecutive sounds), rather than beat perception (Bouwer et al., 2016). In a follow-up study, this was accounted for by using jittered sequences as a control condition (e.g., Fig. 2, example 2B). Here, the difference in ERP responses (MMN, N2b, and P3a) to intensity decrements on and off the beat was larger in the isochronous sequences (Fig. 2, example 2A) than in the jittered sequences (Fig. 2, example 2B), which the authors took as evidence for beat perception. This effect came on top of the statistical learning that was evident from the difference between responses to deviants on and off the beat in the jittered sequences, and the effect was present regardless of attention to the sequences (Bouwer et al., 2016). The results from this study are depicted in Fig. 3A. One open question is whether the differences in responses could potentially be due to better statistical learning in isochronous than jittered sequences, as temporal expectations have been shown to affect statistical learning (Tsogli et al., 2022).

Fig. 3
3 graphical illustrations of event related potential for a regular beat with isochronous timing and a non regular beat with jittered timing. A, human adults attended and unattended difference waves. B, human newborns with larger M M R. C, monkey A, and monkey B with isochronous sequences.

ERP results from studies probing beat perception in human adults, newborns, and nonhuman primates. (a) Difference waves (e.g., difference between ERPs to deviant and standard sounds) for infrequent intensity decrements presented within isochronous and jittered sequences, either on the beat or off the beat (Fig. 2, example 2). For human adults, N2b, MMN, and P3 responses were larger on the beat (black) than off the beat (gray), and this difference was more pronounced in isochronous (solid lines) than jittered sequences (dashed lines), suggestive of beat perception (Bouwer et al., 2016). (b) In newborns, similar to adults, the MMR was largest for deviants on the beat in isochronous sequences, providing evidence for the presence of beat processing (Háden et al., 2024). Note that the latency and morphology of newborn MMR are very different from the MMN found in adults. (c) In two nonhuman primates presented with the same paradigm (Fig. 2, example 2), the MMR was larger for deviants presented within isochronous sequences (solid lines) than for deviants presented within jittered sequences (dashed lines). However, here, the difference between the responses to deviants on and off the beat was not larger in the isochronous than jittered condition, suggesting that while the animals were capable of perceiving the temporal regularity of the isochronous sequences, they did not represent the full metrical structure including the beat (Honing et al., 2018). Note that like for newborns, the morphology of the ERPs and the latency of the MMR are different from that commonly found in human adults (see also Table 1), and highly variable between individuals

Finally, several studies have found bigger ERP responses for deviations from the rhythmic surface structure than deviations from the hierarchical metrical (and ordinal) structure, both for rhythms consistent of drum sounds indicating the metrical structure (Vuust et al., 2005, 2009) and rhythms with temporal accents (Edalati et al., 2021; Geiser et al., 2009). This shows that absolute temporal expectations can greatly influence ERP responses to rhythm, and highlights the importance of controlling for differences in the surface structure of rhythm.

To summarize, a large collection of studies has now shown the presence of differences in responses to deviant sounds on and off the beat for several ERP components, including the MMN, N2b, P3a, and P3b, often without attention directed to a rhythmic stimulus, and with musically untrained listeners. However, it remains a challenge to design stimuli that can readily ascribe this effect to beat perception.

Probing Beat Perception in Human Adults by Looking at the Auditory P1 and N1 Response

Comparing Isochronous to Jittered Sequences

A large body of research has shown smaller sensory responses to sounds in isochronous than jittered sequences, both for the N1 component (Foldal et al., 2020; Kotz et al., 2014; Lange, 2009, 2010; Makov & Zion Golumbic, 2020; Schwartze et al., 2013; Schwartze & Kotz, 2015; van Atteveldt et al., 2015) and the P1 component (Brinkmann et al., 2021; Rimmele et al., 2011; Schwartze et al., 2013, 2015; Schwartze & Kotz, 2015). This is in line with the attenuation of expected sounds as predicted by predictive processing accounts of temporal expectations. This effect was shown to be largely independent of attention (Makov & Zion Golumbic, 2020; Schwartze et al., 2013). While the use of isochronous sequences without hierarchical structure prohibits strong conclusions about the involvement of beat-based perception, of note, this effect was diminished in patients with basal ganglia lesions (Schwartze et al., 2015), but not in patients with cerebellar lesions (Kotz et al., 2014). As the basal ganglia, but not the cerebellum, is specifically involved in beat-based perception (Grahn, 2009; Merchant et al., 2015), this may suggest that for isochronous sequences, temporal expectations rely at least to some extent on beat perception.

Comparing Responses in Different Metrical Positions

While early sensory responses are usually attenuated by the presence of temporal predictability in isochronous sequences, interestingly, studies comparing early sensory responses on and off the beat have found opposite results, with larger responses for events on the beat. This was found for the N1 response for rhythms with temporal accents indicating the beat (Abecasis et al., 2009) and melodies with pitch structure indicating the beat (Fitzroy & Sanders, 2015), and for the P1 response for isochronous sequences at a fast rate (e.g., with odd tones being on the beat and even tones off the beat (Bouwer & Honing, 2015)), and real music (Tierney & Kraus, 2013). Similarly, the N1 response was found to be larger for events on a strong beat than for events on a weak beat in two studies with isochronous sequences on which listeners were instructed to impose a duple, triple, or quadruple meter (Fitzroy & Sanders, 2021; Schaefer et al., 2010). Thus, while the putative beat perception in isochronous sequences leads to attenuated responses as compared to in jittered sequences without a beat, at the same time, beat perception seems to enhance responses on the beat when compared to responses off the beat.

There are several explanations for this discrepancy. First, the effects of attention and prediction on early sensory processing are thought to be opposite, with the former leading to enhancement and the latter to attenuation (Lange, 2013). Possibly, the balance in the extent to which attentional processes and predictive processes related to beat perception are present differs depending on the type of sequence used. Another possibility is that whereas the contrast between isochronous and jittered sequences taps into processes associated with temporal expectations, the contrast between different metrical positions taps into process associated with hierarchical perception and grouping. Evidence for this idea comes from two studies that manipulated beat perception (e.g., in the temporal domain, the temporal regularity of the signal) while controlling for the grouping structure of non-isochronous rhythms (akin to Fig. 2, sequence 3A and 3B). In both these studies, sensory responses were attenuated for events on the beat in sequences with regularly spaced accents (e.g., with a beat, Fig. 2, example 3A) as compared to in sequences with irregular accents (e.g., without a beat, Fig. 2, example 3B (Bouwer et al., 2020; Schirmer et al., 2021)), even in the absence of attention (Bouwer et al., 2020). In one of these studies, ERPs and behavioral responses were measured separately for events on the beat and off the beat, both in the sequences with and without a beat. Of note, while the ERPs yielded no significant difference between events on and off the beat, behaviorally, there was an advantage for events on the beat, even for sequences without a regular beat, indicative of possible grouping effects (Bouwer et al., 2020).

To summarize, temporal expectations generally seem to lead to attenuation of the P1 and N1 component of the auditory-evoked potential, irrespective of the task relevance of a rhythmic sequence. In contrast, metrical structure may lead to enhancement of these components for metrically strong as compared to weak positions. This discrepancy may be explained by dissociating between temporal expectations, including beat perception, and expectations based on grouping and hierarchical structure.

In general, both studies examining responses to infrequent sounds (e.g., probing expectancy violations with oddball paradigms) and studies examining early sensory responses to frequent sounds have found consistent differences in ERP responses dependent on the presence of a regular beat. In many studies, these effects were shown to be independent of task relevance, and the effects were present in participants without specific musical training. These properties make ERPs an interesting candidate to probe beat perception in human newborns and nonhuman primates (Honing et al., 2014), which we will turn to in the next sections.

Measuring ERPs in Human Newborns

Birth is a special moment for research as it is the first time that the infants’ nervous system is easily accessible to electrophysiological measurements, and a starting point of development with unfiltered auditory input (Lecanuet, 1996). However, the auditory system develops from the second trimester during pregnancy (Moore & Linthicum, 2007), and shows signs of discrimination of sounds even in utero (Huotilainen et al., 2005). Hence, birth cannot be taken as a sharp boundary between innate and learned abilities, albeit there is some evidence separating these abilities in preterm infants (Mahmoudzadeh et al., 2017). Due to the extremely rapid development of the auditory system during the first year, recordings from newborns are not only noisier than recordings from adults but also qualitatively different, lacking adult-obligatory components such as the P1 and N1 (Eggermont & Ponton, 2003). MMN-like ERP responses in newborns were first measured by Alho et al. (Alho et al., 1992). It is not yet clear whether the infants’ responses are identical or only analogous to the adult MMN responses (Háden et al., 2016). Based on the ERP correlates of deviant-standard discrimination we can assume that auditory information that leads to discrimination in adults is also processed in the infants’ brains. However, further processing steps are unclear. ERPs both negative and positive in polarity and within a wide variety of latency ranges from about 80 ms up to 500 ms were found in response to oddball designs (Virtala et al., 2022). With these caveats in mind, in the discussion below we will refer to these ERP responses found in newborns and young infants as mismatch responses (MMR).

Measuring ERPs in newborns is a technical and analytical challenge, not only because of the inherent noisiness of the signal but also due to the limited recording time usually available, the altered state of the infants that are mostly sleeping throughout the recording, and the limited number of channels used in newborn recordings. Fortunately, the use of high-density (64+) electrode nets became widespread, several preprocessing pipelines aim to address noise in recordings (Debnath et al., 2020; Fló et al., 2022; Gabard-Durnam et al., 2018; Kumaravel et al., 2022), and templates for more accurate source reconstruction have become available (O’Reilly et al., 2021). These advances allow for more fine-grained analyses of infantile auditory processing and better comparison with adult results. Such advances can also motivate the replication of basic results in the field.

Several abilities that underlie music perception seem to be functioning already at birth. Newborns are able to separate two sound streams based on sound frequency (Winkler et al., 2003) and detect pattern repetitions which they incorporate into their model of the auditory scene (Stefanics et al., 2007). Most important to beat perception is the ability to process temporal relations. The presentation of a stimulus earlier or later than expected in an isochronous sequence elicits an MMR in 10-month-old infants (Brannon et al., 2004), at least for large time intervals (500–1500 ms). Newborns are also sensitive to shorter changes (60–100 ms) in stimulus length (Čeponiené et al., 2002; Cheour et al., 2002), and 6-month-old infants detect even shorter gaps (4–16 ms) inserted in tones (Trainor et al., 2001, 2003), showing the remarkable temporal resolution of the auditory system. Háden et al. showed that newborns are sensitive to changes in the presentation rate of the stimulation, can detect the beginning of sound trains, and react to the omission of expected stimuli (Háden et al., 2012). Furthermore, there are indications that newborns can learn hierarchical rules (Moser et al., 2020), and can integrate contextual information in their predictions about future events over both shorter (Háden et al., 2015) and longer time periods (Todd et al., 2022). Some of the abilities that reflect the general organization of temporal pattern processing in the brain may be present even before term birth. Preterm newborn infants were shown to exhibit an MMR to earlier than expected tones in a non-isochronous rhythmic pattern in duple meter (Edalati et al., 2022). Thus, the infant brain, even preterm, can detect rhythmic pattern violations. Dynamic causal modeling (DCM) of the MMR revealed extensive top-down and bottom-up connections between the auditory cortices and temporal structures on both sides, and right frontal areas (Edalati et al., 2022), similar to the network found in adults (Phillips et al., 2015). Taken together, these results indicate that investigating phenomena reliant on temporal processing (e.g., beat and meter perception) is viable in (newborn) infants.

Using MMR to Probe Beat Perception in Human Newborns

Several studies to date have examined beat perception in newborns using MMR as an index of temporal expectations. One study examined processing of unexpected sounds within natural language that was presented either spoken, sung, or rhythmically recited to a strong beat at about 2 Hz, as intended for a nursery rhyme (Suppanen et al., 2019). Deviants in the form of changes in words, vowels, sound intensity, or pitch were introduced on stressed syllables (e.g., on the beat). MMR to vowel and word changes was only elicited in the rhythmic nursery rhyme condition. The enhancement of MMR in a rhythmic context is reminiscent of the larger oddball responses to deviants in isochronous than jittered sequences found in adults (Schwartze et al., 2011). Interestingly, only vowel and word changes elicited an MMR, but not intensity and pitch changes. This may have been due to the collation of responses for all intensity and pitch changes, including intensity increments as well as decrements, which in adults can lead to opposite results (Bouwer & Honing, 2015). However, these results may also underline the importance of context, in this case linguistic, on the processing of acoustic deviants, and raises the question whether processing of linguistic stimuli may be privileged even at birth (Thiede et al., 2019).

Two studies have looked at differences in MMR on and off the beat in newborns. First, Winkler et al. tested whether newborns can extract a regular beat from a varying rhythmic stimulus (Winkler et al., 2009), using a paradigm previously used in adults to probe meter perception (Ladinig et al., 2009, 2011). Newborns were presented with a drum pattern in duple meter, in which sounds on the first beat (e.g., the strongest metrical position in the pattern) were occasionally omitted. The response to these omissions was compared to the response to omissions off the beat, and to the response to omissions in a control sequence consisting of patterns in which sounds on the first beat were always omitted. The ERP responses to the omissions on the beat differed significantly from responses to patterns without omission, omissions off the beat, and omissions in the control sequence. The results were interpreted as evidence for the presence of ability to detect a beat in newborns. However, the omission on the beat differed from the omissions off the beat in multiple ways, including differences in acoustic context, and differences in the transitional probabilities of the omitted sounds. Therefore, the results of this study could have been biased by indexing not just beat perception but also low-level acoustic differences between conditions, and sequential learning (Bouwer et al., 2014).

To control for these possible confounds, a subsequent study (Háden et al., 2024) used a paradigm previously used to probe beat perception in human adults (Bouwer et al., 2016) and nonhuman primates (Honing et al., 2018). Newborns were presented with a drum rhythm with alternating accented and unaccented sounds that induce a beat (or duple meter) when presented with isochronous timing, but not when presented with randomly jittered timing (Fig. 2, example 2). Infrequently, softer sounds were introduced as deviants, falling either on the beat or off the beat. Deviants were always preceded and followed by identical sounds, to control for the effects of acoustic context on ERPs (see Fig. 2, example 2). Results showed a clear difference in MMR amplitude between metrical positions in the isochronous sequence, but not in the equivalent jittered sequence (Fig. 3b). However, the current paradigm could not show effects of statistical learning (e.g., a difference in responses on and off the beat for the jittered sequences), despite previous evidence for this ability working in newborns (Bosseler et al., 2016), and the presence of this effect in adults using the same paradigm (Bouwer et al., 2016). Despite the qualitative differences between an adult MMN and the newborn MMR, these results provide converging evidence that beat processing is present in newborns infants, even when controlling for acoustic context and statistical learning.

Measuring ERPs in Nonhuman Primates

There is quite some discussion on whether beat perception is species-specific (Fitch, 2015; Ravignani, 2018; Wilson & Cook, 2016). Evidence in support of beat perception in a select number of species comes from experiments that test motor entrainment to a beat through overt behavior (A. D. Patel, 2021). However, if the production of synchronized movement to sound or music is not observed in a species, this is no evidence for the absence of beat perception. It could well be that certain animals are simply not able to synchronize their movements to a varying rhythm, while they can perceive a beat. Also, with behavioral methods that rely on overt motoric responses, it is difficult to separate between the contribution of perception and action. Electrophysiological measures, such as ERPs, do not require an overt response, and as such provide an attractive alternative to probe beat perception in animals (Honing et al., 2018).

While most animal studies have used implanted electrodes to record electroencephalograms (EEG) (Javitt et al., 1994; Laughlin et al., 1999; Pincze et al., 2001), noninvasive electrophysiological techniques such as scalp-recorded evoked potentials (EP) and event-related potentials (ERP) are considered an attractive alternative. Next to being a mandatory requirement for studying some nonhuman primates such as chimpanzees (Fukushima et al., 2010; Hirata et al., 2013), these methods allow for a direct comparison between human and nonhuman primates. As such they have contributed to establishing animal models of the human brain and human brain disorders (Gil-da-Costa et al., 2013; Godlove et al., 2011), a better understanding of the neural mechanisms underlying the generation of human evoked EP/ERP components (Fishman & Steinschneider, 2012), as well as delineating cross-species commonalities and differences in brain functions, including rhythm perception and cognition (Fukushima et al., 2010; Hirata et al., 2013; Itoh et al., 2015; Reinhart et al., 2012; Ueno et al., 2008, 2009). The most relevant ERP components for comparative primate studies of rhythm perception are summarized in Table 1.

Table 1 Homologies between rhesus monkey, chimpanzee, and human cortical auditory-evoked potentials (ERPs). Time range in ms; alternative naming in square brackets

Since the discovery of the MMN component, researchers have tried to find analogous processes in animal models (Shiramatsu & Takahashi, 2021; Woodman, 2011) and to integrate deviance detection and predictive processing into a general framework of auditory perception (Näätänen et al., 2010). A wide range of electrophysiological methods from scalp electrodes to single-cell recordings have been used on animal models. These methods highlight different phenomena of varying spatial and temporal resolution. The most vital difference is that scalp and epidural recordings may yield components similar to the human MMN (i.e. electric responses generated by large brain areas), whereas local field potential, multiunit activity, and single-cell recordings work on a lower spatial scale and reflect stimulus-specific adaptation (Nelken & Ulanovsky, 2007). SSA has many common properties with MMN; both can be observed in similar paradigms, and it is still debated whether SSA reflects the cellular level activity underlying MMN. However, this discussion is beyond the scope of the current chapter.

Using epidural recording, MMN-like responses (from here on referred to as MMR) have been shown in different species including rats (Nakamura et al., 2011), cats (Csépe et al., 1987; Pincze et al., 2001, 2002) and macaque monkeys (Javitt et al., 1992, 1994). In most of these studies, frequency and amplitude violations were used. In rats, deviance detection was shown for both a temporal feature, sound duration (Nakamura et al., 2011), and to an abstract feature, namely melodic contour (Ruusuvirta et al., 2007). Recordings from scalp electrodes showed MMR in mice (Umbricht et al., 2005), and in a single chimpanzee (Ueno et al., 2008). While not all attempts at recording MMR from animals were successful, it seems that an MMR can be reliably elicited in some animal models (Harms et al., 2016; Schall et al., 2015; Shiramatsu & Takahashi, 2021) and thus can be used to study auditory processing in nonhuman primates.

Using MMR to Probe Beat Perception in Nonhuman Primates

Honing et al. (Honing et al., 2012) demonstrated, for the first time, that an MMR can be recorded from the scalp in rhesus monkeys (Macaca mulatta), both for pitch deviants and unexpected omissions. Ueno et al. (Ueno et al., 2008) used a similar method in a chimpanzee (Pan troglodytes) and Gil-da-Costa et al. (Gil-da-Costa et al., 2013) made a comparison between measuring an MMR in humans and macaques (Macaca fascicularis). Together these results provide support for the idea that a mismatch response can be used as an index of the detection of expectancy violations in an auditory signal in both humans and nonhuman primates. A follow-up study, using stimuli and an experimental paradigm identical to those used to study beat perception in human adults (Bouwer et al., 2016) and newborns (Háden et al., 2024), confirmed that rhesus monkeys are sensitive to the isochrony of a rhythmic sequence, but not to its induced beat (Honing et al., 2018). Results from the two monkeys in this study are depicted in Fig. 3c. These findings are in line with the hypothesis that beat perception is somewhat species-specific. Note that while rhesus monkeys continue to be an important animal model for the human brain, and results in monkeys and humans are often compared (Gil-da-Costa et al., 2013), we have to be cautious in directly comparing ERP signals from humans and nonhuman animals because of obvious differences in neural architecture.

Behaviorally, contrary to what was previously thought (Zarco et al., 2009), macaques do show the ability to predictively tap to a metronome, and to modify their tapping tempo to tempo changes in the entraining stimulus, when provided with sufficient feedback and reward (Gámez et al., 2018). In addition, and consistent with these behavioral results, it was shown that during isochronous tapping, the medial premotor cortex in monkeys indexes time intervals in a relative and predictive manner (Betancourt et al., 2023; Gámez et al., 2019). But note that processing isochrony is not the same as beat perception, and may be subserved by a different mechanism (Bouwer et al., 2021; Honing et al., 2018). For an overview of time encoding in the primate medial premotor cortex, see Merchant et al., this volume.

Overall, the observed differences between humans and monkeys provide support for the gradual audiomotor evolution (GAE) hypothesis (Honing et al., 2018; Honing & Merchant, 2014; Merchant & Honing, 2014). This hypothesis suggests beat-based timing to be more developed in humans as opposed to apes and monkeys, and that it evolved through a gradual chain of anatomical and functional changes to the interval-based mechanism to generate an additional beat-based mechanism. More specifically, the integration of sensorimotor information throughout the mCBGT circuit and other brain areas during the perception or execution of single intervals is similar in human and nonhuman primates, but different in the processing of multiple intervals (Merchant & Honing, 2014). While the mCBGT circuit was shown to be also involved in beat-based mechanisms in brain imaging studies (e.g., (Teki et al., 2011)), direct projections from the medial premotor cortex (MPC) to the primary auditory cortex (A1) via the inferior parietal lobe (IPL) that is involved in sensory and cognitive functions such as attention and spatial sense, may be the underpinning of beat-based timing as found in humans, and possibly apes (Merchant & Honing, 2014; Proksch et al., 2020).

Probing beat perception and isochrony perception in animals is still in its infancy (Bouwer et al., 2021; Henry et al., 2021; Wilson & Cook, 2016). But it appears, at least within the primate lineage, that beat perception has evolved gradually, peaking in humans and present only with limitations in chimpanzees (Hattori & Tomonaga, 2020), bonobos (Large & Gray, 2015), macaques (Honing et al., 2018), and other nonhuman primates (Raimondi et al., 2023).

While beat perception can be argued to be fundamental to the capacity for music (i.e. musicality (Honing, 2012)), it continues to be difficult to trace back this skill in the animal world. In the few species that are studied, it appears to be mostly vocal learners that are sensitive to a regular pulse (the beat) in a varying rhythmic stimulus such as music. Seminal examples are a sulphur-crested cockatoo (A. D. Patel et al., 2009) and a gray parrot (Schachner et al., 2009) that are capable of synchronizing to the beat of human music and, importantly, maintaining synchrony when the same music is played at a different tempo. The observation that this behavior was initially only shown in vocal learning species gave rise to the vocal learning and rhythmic synchronization (VLRS) hypothesis (A. D. Patel, 2006, 2021), suggesting our ability to move in time with an auditory beat in a precise, predictive, and tempo-flexible manner originated in the neural circuitry for complex vocal learning. This hypothesis is an alternative to the GAE hypothesis discussed earlier.

However, the gradual audiomotor evolution (GAE) and vocal learning (VLRS) hypotheses differ in several ways (see also Proksch et al., 2020). First, the GAE hypothesis does not claim that the neural circuit that is engaged in rhythmic entrainment is deeply linked to vocal perception, production, and learning, even if some overlap between the circuits exists. Furthermore, since the cortico-basal ganglia-thalamic circuit (CBGT) has been involved in beat-based mechanisms in imaging studies, we suggest that the reverberant flow of audiomotor information that loops across the anterior prefrontal CBGT circuits maybe the underpinning of human rhythmic entrainment. Lastly, the GAE hypothesis suggests that the integration of sensorimotor information throughout the mCBGT circuit and other brain areas during the perception or execution of single intervals is similar in human and nonhuman primates.

In addition, a recent counterexample to the VLRS hypothesis is a California sea lion (Zalophus californianus; not considered a vocal learner) that is able to synchronize head movements to a variety of musical fragments, as well as showing generalization over different tempi (Cook et al., 2013; Rouse et al., 2016). Overall, it seems that perceiving a beat in a complex stimulus (i.e. music) and being able to synchronize to it is not restricted to humans, might well be more widespread than previously thought, and not restricted to vocal learners per se (Bouwer et al., 2021; ten Cate & Honing, 2023; Wilson & Cook, 2016).

Discussion and Conclusion

In this chapter, we have shown how ERPs can be used to probe the perception of a regular beat in rhythm. Measuring ERPs is relatively straightforward, it can be realized in populations that are difficult to study behaviorally (like infants and monkeys), and it is a well-researched method. However, several challenges remain, for beat perception research in general, and for ERP studies in particular.

First, as we have stressed throughout this chapter, musical rhythm contains many types of structure, including not only temporal structure, both in terms of a regular beat and absolute temporal intervals, but also grouping, ordinal structure, and hierarchy. The beat can be considered the most prominent periodicity in a rhythmic signal (Fiveash et al., 2022), and beat perception has been considered as the ability to flexibly extract a regular temporal structure from rhythm (Penhune & Zatorre, 2019). Such definitions of the beat clearly involve the temporal aspect of rhythm, and specifically the temporal periodicity associated with beat-based perception. For many studies targeting beat perception with ERPs, it is not completely clear whether influences of absolute timing, grouping, ordinal structure, and hierarchical structure can be ruled out, as these structural aspects of rhythm often covary with the temporal regularity that is the beat, and are often even necessary to induce a beat.

Related to this, some have suggested that the perception of hierarchical metrical structure is different from the perception of a beat or pulse as temporal regularity (Fitch, 2013; Silva & Castro, 2019). The idea that meter processing is indeed more about hierarchical structure, or the alternation of stressed and unstressed events, than about temporal regularity is in line with models of meter in language, where the meter does not necessarily adhere to a temporal regularity. In language, learning the alternation of stressed (e.g., salient) and unstressed sounds is vital to processing (Henrich et al., 2014; Henrich & Scharinger, 2022; Magne et al., 2016), and the hierarchical structure that arises from such nontemporal structure is often termed meter. This is, however, at odds with models of beat perception that consider beat and meter to be interrelated, with meter perception relying on similar (oscillatory) mechanisms as beat perception (e.g., meter in this interpretation is just another level of regularity within a structure of multiple nested levels of regularity) (Drake et al., 2000; Large, 2008). The relationship between the different aspects of rhythm perception, and specifically the relationship between beat perception and hierarchy perception, remains an important topic for future studies.

One disadvantage of using ERPs to study beat perception is that with ERPs, what is probed is not the mechanism of beat perception itself, but rather the effect a perceived beat has on the sensory processing of incoming information, be it expected or unexpected tones. Combining results from ERP studies with results from studies that directly probe the underlying mechanisms of beat perception, for example, by examining low-frequency neural oscillations in response to rhythm (Lenc et al., 2021), will provide more insight in this regard. Also, the studies discussed in this chapter mostly deal with purely perceptual effects of beat perception. While some studies have used ERPs in studying motor synchronization to a beat (Andrea-Penna et al., 2020; Conradi et al., 2016; Lei et al., 2019; Mathias et al., 2020; Schwartze & Kotz, 2015), given the tight coupling between beat perception and movement, this remains an interesting topic for future work. Ultimately, combining different methods and paradigms will allow us to get a more coherent picture of the perception of beat and meter, and address its apparent innateness, and domain and species specificity. All in all, this research will contribute to a better understanding of the fundamental role that beat and meter perception play in music.