Keywords

1 Presumed Dissociation of Music and Language

Music is among the most ancient forms of art in human history (Masataka, 2010). The universality of music suggests that music has served some functional roles in increasing the survival odds of mankind. However, specific advantages of music conferred to human survival remain elusive. Existing theoretical and empirical studies show that music enhances group cohesion and helps emotional regulation (Tarr et al., 2014). In contrast, some theorists see no survival benefit in music, a view which is championed by Steven Pinker’s famous characterization of music as “auditory cheesecake” (Pinker, 1997).

There has been a long-standing debate over whether music and language processing share the same processing stages and recruit the same neural regions. Recent studies challenge a simple dichotomy between music and language processing by showing activation of overlapping neural regions by music and language (Koelsch et al., 2002; Yu et al., 2017; but see, Peretz, et al., 2015) and impaired processing of linguistic prosody in amusic patients (Thompson et al., 2012). Thus, the linguistic ability is not as impermeable as presumed and shares part of the processing stages with music.

Interestingly, it has been reported that long-term musical training enhances language processing at various processing stages (Zhao & Kuhl, 2016), including fundamental frequency (f0) extraction at the brainstem (Wong et al., 2007). Therefore, it seems only natural to postulate that machinery recruited in music processing has facilitated the evolution of language and helps in language acquisition in human infants (Masataka, 2008).

In this chapter, we first discuss the commonalities in processing stages recruited between music and language. Among the perceptual functions presumably recruited in both musical and linguistic domains, a substantial amount of knowledge has accumulated about the perception of frequency structure and rhythmic pattern. Thus, in the second section of this chapter, we provide an overview of empirical data on the developmental course of these perceptual functions in musical and linguistic domains in human infants. In the last section of this chapter, we propose a hypothesis regarding the evolutionary roots of musicality that is utilized in language acquisition in human infants today.

2 Commonality Between Music and Language

Language can transmit messages that are far more semantically complex than music. However, the sound stream of language shares many acoustic characteristics with musical sounds. First, vowel sounds have similar spectral structures as musical chords. When pronouncing vowels, the power of different frequencies is modified by resonance in the vocal tract, and consequently, some of the frequencies become prominent in vocal sound. Such peaks in spectral power in a few frequency ranges, called formants, define the perceived vowel category. Thus, analysis of the relative relationship between formant frequencies is required in vowel categorization. The similar ability of relative pitch perception is indispensable in music perception because the perceptual quality of musical chords is also determined by the relationship between the pitches of musical notes played simultaneously.

Second, both music and language are characterized by a hierarchical structure (Jackendoff & Lerdahl, 2006). Language is governed by grammatical rules that define the ordering of words in a hierarchical manner. Similarly, identical sound sequences are played repeatedly within a music score and are grouped into larger motifs hierarchically. Thus, the ability of perceptual grouping together with the statistical analysis of repeated sequences are valuable in analysing the global structure underlying the sound stream in both music and language.

Third, pitch contour , the temporal course of pitch change, is essential in processing both musical and linguistic sounds. Pitch contour creates a musical melody that induces vivid emotional reactions and even conveys semantic information (Koelsch et al., 2004). Likewise, linguistic prosody plays an important role in conveying emotional and semantic information (Doi et al., 2013). The human linguistic system interprets the meaning of identical sentences differently depending on the pitch contour, as in the case of pitch change at the end of utterances in declarative sentences and yes-no questions. Further, pitch contour sometimes specifies a structurally important location, e.g., phrasal boundary, in the linguistic stream.

Fourth, both music and language have a culturally transmitted rhythmic structure. Folk music in several regions has a unique metric structure that is different from those seen in other regions of the world. Likewise, some researchers claim that language can be grouped into families according to its rhythmic structure (Nazzi & Ramus, 2003). Culture-specific patterns of auditory rhythms exert such strong influences on the perceptual system through postnatal exposure that adults have difficulty detecting slight changes and reproducing rhythmic patterns with culturally unfamiliar metric structures (Hannon & Trehub, 2005a; Collier & Wright, 1995).

3 Contributions of Musicality to Human Language Acquisition

Given the close similarity between music and language as described above, neural mechanisms recruited in musical processing could help in processing linguistic materials as well. Infants at the prelinguistic stage are faced with the task of analysing sound streams and grasping their underlying structure without any prior knowledge (but see Chomsky, 1965). The primary tenet of the prosodic bootstrapping hypothesis of language acquisition is that prosodic information, such as lexical rhythm and melodic contour in the linguistic sound stream, scaffold infants’ analysis of the mother tongue. Prosodic cues are loosely related to grammatical structure. Therefore, prosodic information, a collection of musical properties of language, can help infants learn their mother tongue.

Studies on auditory perception in the fetus raise the appealing possibility that prosodic bootstrapping of language acquisition starts during the prenatal period. Fetuses are exposed to environmental sounds in the womb. However, the abdominal wall filters out high-frequency components of external sounds, and consequently, phonemic information is almost lost. Salient information contained in the filtered sound heard in the womb is mainly a prosodic pattern of language. In DeCasper et al. (1994), mothers read target rhymes each day for 4 weeks during pregnancy. Researchers then measured the fetal heart rate when hearing the target rhyme and an unfamiliar rhyme. There was a more notable deceleration in heart rate when hearing the familiar rhyme, indicating that fetuses can learn the prosodic pattern of mother’s speech.

In this section, we discussed how musical functions could assist language acquisition by reviewing empirical evidence on the developmental course of these abilities. We focused especially on the perception of rhythm and frequency structure and examined whether there are any parallels between the developmental courses of these abilities in the musical and linguistic domains.

3.1 Perception of Frequency Structure

3.1.1 Musical Chord and Formant Perception

In musical chord perception , the auditory system must analyse the relationship among f0 of simultaneously played musical notes. The difference of one-semitone in one note comprising a chord changes the perceived quality of a musical chord, from major to minor chords or vice versa. Further, adherence to certain mathematical rules creates a consonant chord, while violation of it results in a dissonant one.

There has been a debate over the innateness of chord processing ability, especially whether the preference for consonance over dissonance is acquired through cultural assimilation. Empirical studies generally favour the innateness of consonant chord preferences (Masataka, 2006). A functional magnetic resonance imaging (fMRI) study revealed differential activation of the newborn’s brain in response to consonant and dissonant chords (Perani et al., 2010). An event-related potential (ERP) study indicated that newborns could discriminate between a consonant and dissonant as well as between major and minor chords (Virtala et al., 2013). These findings indicate that the infant brain is innately endowed with the capacity to analyse the pitch relationship of multiple sounds being played simultaneously.

The formant structure in vowel sounds has some resemblance to the spectral structure in musical chords. In vowel categorization, the auditory system must analyse the peak frequencies of at least three formants simultaneously. Since everyone has different exposure to f0 depending on their gender, age, and height, the formant frequencies of identical vowel sounds differ among individuals. Despite this, identical vowels uttered by different individuals are perceived as such, a phenomenon called speaker invariance . Thus, what counts in vowel perception is the analysis of the relationship among formant frequencies rather than the absolute values of each formant frequency.

Behavioural studies on infant vowel categorization have shown that even neonates can discriminate between different vowel sounds (Cheour-Luhtanen et al., 1995). Vowel categorization is tuned by postnatal exposure to the mother tongue. Interestingly, some studies even argue that the tuning process starts in the womb (Moon et al., 2013). Therefore, as in the case of musical chord perception, the nascent ability of formant perception functions from the very initial stage of development.

3.1.2 Melody and Pitch Contour Perception

Melodic contour is created by the temporal sequence of musical notes f0. Melody will sound the same when the notes are transposed or played in different keys (Mottron et al., 2009). This illustrates that the primary determinant of musical melody is not the absolute value of pitch but the contour of the temporal sequence of relative pitches that unfolds as the music is played.

Studies on the early development of melodic perception have shown that infants as young as 5–10 months old can detect violations of melodic contour irrespective of transposition (Trehub & Hannon, 2006). Plantinga and Trainor (2009) showed that even 2-month-olds could discriminate melodies of different songs, although the musical materials used in this study were not well controlled for low-level auditory features.

It is widely acknowledged that infants use pitch contour as a cue to analyse the grammatical structure. One example is the use of lexical stress in word segmentation (Jusczyk, 1999). Lexical stress is characterized by high pitch, long duration, large amplitude, and vowel quality. Among these acoustic cues, the human auditory system is quite sensitive to pitch change; humans can discriminate lexical stress patterns (trochaic or iambic) of nonsense syllables based on pitch-cue alone (Hoeschele & Fitch, 2016). Infants as young as 8 months can use lexical stress cues for segmenting sound streams into words (Jusczyk, 1999). In English-speaking countries, infants usually segment sound streams into units with strong-weak stress patterns. However, the 7-month-olds in Thiessen and Saffran’s (2007) study switched this stress-based strategy and started extracting words with weak-strong (iambic) stress patterns after repeated exposure to word sequences of the iambic stress pattern. Thus, infants can flexibly modify their stress-based word-segmentation strategy through experience.

Another well-studied example is infants’ use of pitch contour information as a cue for boundary detection. In speech, boundaries between phrases and clauses are often marked by pitch change and a long duration of syllables followed by a brief pause, which is observed cross-linguistically. Infants as young as 6 months have been shown to rely heavily on prosodic pattern for boundary detection (Seidl, 2007). Wellmann et al. (2012) investigated the acoustic characteristics that 8-month-old infants rely on in boundary detection. Pitch change or long duration alone were not sufficient for infants to detect a phrase boundary. However, the combination of long duration and pitch change enabled infants to find boundaries without pause cues. Similar results have been obtained in 6-month-olds as well (Seidl, 2007).

These studies indicate the primary importance of pitch contour in linguistic sound in order for human infants to be able to segment grammatical units. Interestingly, these studies revealed the emergence of the ability to use pitch contour in sound segmentation around 6–10 months (Thiessen & Saffran, 2007; Seidl, 2007; Wellmann et al., 2012), which roughly corresponds to the age when infants acquire the ability to process melodic contour (Trehub & Hannon, 2006). Comparisons of the developmental courses of pitch change in musical and linguistic materials also support a domain-general pattern of development (Chen et al., 2017). Indeed, such coincidence alone should not be deemed as definitive evidence, but it is quite conceivable that the maturation of identical neural mechanisms underlies the development of the ability to process pitch contour in both musical and linguistic materials.

Infants’ sensitivity to pitch contour is effectively used in parental vocalizations directed towards infants. When talking to infants, adults modify their manner of speech so it differs from speech that is used with adults (Kuhl, 2007; Masataka, 2003). Such infant-directed speech is mainly characterized by high-pitch and exaggerated intonation (Doi, 2020). High-pitched voice is effective in grabbing an infant’s attention, possibly due to its emotional connotation (Corbeil et al., 2013). Likewise, exaggerated intonation makes it easier for infants to extract pitch contour, which assists infants in word segmentation and boundary detection. Thus, the domain-general ability of relative pitch perception as well as social input provided in infant-directed speech interactively scaffold infants’ language acquisition (Doi, 2020; Sulpizio et al., 2018).

3.2 Rhythm Perception

Rhythm perception can be both objective and subjective (Iversen et al., 2009). A strong beat is often marked by large sound amplitude. At the same time, people sometimes perceive strong and weak beats in the repetition of monotonous sounds without any acoustic marks.

Rhythmicity in synchronized bodily movement is observed in as early as the neonatal stage, which is thought to reflect the activity of the central pattern generator. As for the perception of musical rhythm, an ERP study by Winkler et al. (2009) showed evidence of beat perception in neonates. In their study, neonates were exposed to a sequence of percussion sounds with a hierarchical metric structure. They were repeatedly exposed to standard sounds that lacked sound at a weak beat location. Within the sequence of standard sounds, the target sound, in which sound was omitted at a strong beat (downbeat) location, was presented with low frequency. They focused on a newborn’s homologue of mismatch negativity (MMN). MMN is an ERP component usually elicited by deviant auditory sound that is presented with low frequency, embedded within a sequence of standard sounds. MMN is elicited even when a subject is not directing attention to the sounds. Thus, MMN is accepted as a reliable indicator of the ability to discriminate deviant and standard sound at the pre-attentive perceptual stage. The main finding of Winkler et al. (2009) was that the target sound elicited MMN in neonates, indicating that even neonates can discriminate weak and strong beats.

The innate ability of beat perception prepares infants to process rhythmic structures in incoming auditory information. This ability is shaped further by postnatal exposure to environmental sounds. Folk music of Eastern European countries such as Bulgaria and Macedonia has a complex metrical structure that differs from the simple metric structure in Western music. Twelve-month-olds reared in the United States, who probably had almost no experience of hearing music with a complex metre, could not detect slight changes in complex metres of Eastern European folk music (Hannon & Trehub, 2005b), while 6-month-old infants could (Hannon & Trehub, 2005a). In phoneme perception, the neural system is plastically tuned to process phoneme categories in mother tongues through postnatal experience, while losing the ability to discriminate phoneme categories that do not exist in the mother tongue (Kuhl et al., 2011). Likewise, infants lose their ability to discriminate unfamiliar types of faces, e.g. faces of different species and unfamiliar races (Pascalis et al., 2002). The studies by Hannon and Trehub (Hannon & Trehub, 2005a, b) indicate that a similar process of perceptual tuning is at work in the development of musical rhythm; infants lose their ability to process unfamiliar rhythmic structure through postnatal exposure to their musical culture.

Linguists have raised the possibility that languages can be classified into several families according to their rhythmic structure, e.g. stress- and syllable-timed languages (Nazzi & Ramus, 2003). In stress-timed languages, such as English, the timing of the successive stressed locations in utterances are kept fairly constant, while in syllable-timed languages, such as French and Italian, syllables are uttered with constant timing. Nazzi et al. (1998) tested whether neonates could discriminate unfamiliar low-pass filtered foreign languages. Low-pass filtering eliminates high-frequency components of linguistics sound. Consequently, phonemic information is almost lost in low-pass filtered language, which makes it impossible for infants to use phoneme distribution as a clue to discriminate two languages. Interestingly, neonates in this study could discriminate unfamiliar foreign languages in different rhythmic families but not in the same rhythmic family. Thus, infants can detect rhythmic structures in languages as well as in music from the neonatal stage.

Nazzi and Ramus (2003) proposed the rhythm bootstrapping hypothesis , which proposes that human infants rely on rhythmic structure in language as a clue to segment grammatical units. Adults who speak syllable-timed languages use segmentation strategies in online language processing which differ from those of individuals whose mother tongue is a stress-timed language (Cutler et al., 1986). Therefore, speakers of both syllable- and stress-timed languages adopt a word segmentation strategy suited to their mother tongue. Lexical stress often signals the onset of a single word in English, but stress is less likely to be a marker of a word boundary in syllable-timed languages. Such language-specificity in word-segmentation strategies presents infant with a problem of deciding which acoustic cue to rely on in word segmentation. The rhythmic bootstrapping hypothesis suggests that the rhythmic structure of language gives infants a clue to discover the most efficient strategy, syllable- or stress-based, in word segmentation.

Most evidence for rhythm bootstrapping comes from studies on infants in an English-speaking environment. The aforementioned studies on infants learning English show that infants as young as 8 months segment words by lexical stress (Jusczyk, 1999; Thiessen & Saffran, 2007). Further, when the cues of transition probability and lexical stress are incongruent, 9-month-old infants treat lexical stress as a primary cue over transition probability in word segmentation (Thiessen & Saffran, 2007). As for syllable-timed languages, Nazzi et al. (2006) reported that French infants segmented words using a syllable-based strategy. These results, together with the early emergence of the domain-general ability to perceive rhythmic structure (Nazzi et al., 1998; Winkler et al., 2009), provide partial support to the rhythm bootstrapping hypothesis. The reason remains elusive why human infants, who are innately endowed with the ability of rhythm processing (Winkler et al., 2009), do not show evidence of language-specific segmentation strategies until around 8 months. One possible reason is that infants do not learn to associate perceived rhythmic structure with word segmentation strategy until around this age.

4 Evolutionary Roots of Musicality and Its Relationship with Language

Perceptual systems recruited in music processing can be used to analyse the linguistic stream and grasp its grammatical structure. The parallel development of corresponding functions in the musical and linguistic domains indicates the possibility that the maturation of domain-general functions recruited in both music and language processing assist language acquisition in human infants. This line of reasoning further raises the possibility that the evolution of musical functions has prepared the basis of the evolution of language in humans.

The burgeoning ability, or precursors, of music processing can be seen in birds, rodents, and non-human primates as well (Doi, 2020). However, their musicality does not match that of humans in its refinement. Considering this, the gap in musicality between humans and non-human species might constitute part of the reason why only humans have a sophisticated ability for language processing and speech communication. This section reviews the existing findings on musical abilities in non-human species and discusses the evolution of language from the perspective of phylogenetic roots of musicality.

4.1 Frequency Structure Perception

4.1.1 Analysis of Musical Chords and Vowel Formants

Chord perception and vowel categorization require the ability to grasp the relationships of peak frequencies. Interestingly, behavioural and electrophysiological studies have revealed close similarity in musical chord perception between non-human primates and humans (Izumi, 2000). For example, Fishman et al. (2001) measured electrophysiological responses in neurons of the primary auditory region in macaque monkeys and humans to dissonant and consonant chords. This study revealed that neurons in homologous regions in macaque monkeys’ and humans’ brains represent dissonance levels of musical chords.

Instrumental and vocal sounds contain harmonics of f0. Here, harmonic means sound with the frequency of integer-multiple of f0. When a sound composed of harmonics of the same f0 is presented, one perceives a sound with the f0, even when the sound lacks spectral peak in f0. This phenomenon, called missing fundamental , is deemed as the expression of the superb ability of the human auditory system to analyse harmonic structure, and human infants as young as 3 months old show signs of missing fundamental perception (He & Trainor, 2009). Bendor and Wang (2005) measured activations of frequency-sensitive neurons, neurons that are activated by sound with specific frequency, in the auditory cortex in marmosets. They found a set of neurons that were activated by both pure tone with f0 and the sound composed of harmonics of f0 without spectral peak in f0. These findings indicate that marmosets possess the ability to analyse fundamental frequency in complex harmonic sound. Behavioural studies have revealed the ability to perceive missing fundamentals in other species as well (Cynx & Shapiro, 1986; Heffner & Whitfield, 1976).

These electrophysiological and behavioural studies (Bendor & Wang, 2005; Fishman et al., 2001) indicate that the basic functions for analysing the relationships of peak frequencies are phylogenetically old. Considering the spectral similarity between vowel formants and musical chords, it is possible that non-human species also possess the basic ability of vowel categorization. Direct evidence for vowel categorization has been obtained in several species (Hienz et al., 1981, 1996; Ohms et al., 2010). Among these species, songbirds show the most prominent resemblance to humans in their ability to categorize phonemes. Ohms et al. (2010) found that zebra finches could learn to discriminate vowels and generalize this discrimination to vowel sounds uttered by opposite-sex speakers despite the difference in the absolute height of formant frequencies. Thus, the findings of Ohms et al. (2010) support the view that zebra finches can analyse the relative relationship of formant frequencies in a manner closely similar to that of humans.

4.1.2 Analysis of Pitch Contour

In ecological settings, animals transmit many messages by modifying their vocalizations either voluntarily or involuntarily. A well-known example is the innate association between arousal level and high-pitched voice. In a highly aroused state, vocal folds vibrate at a higher frequency, which generates voices with higher f0 in many mammalian species including humans (Bachorowski, 1999; Filippi et al., 2019; Kamiloğlu et al., 2020). In addition to the absolute height of f0, a substantial number of studies have revealed that context-dependent messages and emotional states are encoded in the pitch contour in animal vocalizations (Briefer, 2012; Filippi et al., 2019). Therefore, the ability to analyse pitch contour must have been essential for survival.

The neural system must extract the temporal course of pitch change irrespective of the absolute pitch in the perception of pitch contour. Independence of absolute pitch and pitch contour perception is well illustrated in a phenomenon called octave-generalization which states that melodies sound the same when transposed by an octave. Absolute pitches of all musical notes change after transposition. Despite this, the human auditory system perceives an identical melody after transposition, which indicates strong reliance on the pitch contour, or relative pitch change, in melody processing (Mottron et al., 2009). A study by Wright et al. (2000) tested octave generalizations of musical chords in rhesus monkeys. Their main finding was that the monkeys showed signs of octave-generalization; the monkeys perceived the identical tune played in different octaves to be the same when tonal music was used as the musical material. Thus, similar to humans, rhesus monkeys also rely heavily on pitch contour rather than absolute pitch in perceiving sound sequence.

In songbirds, Spierings and ten Cate (2014) revealed that zebra finches treat prosodic patterns more heavily than structural cues in discriminating multisyllabic sequences. Among three stress cues manipulated, i.e., pitch contour, amplitude, and duration, pitch was the most salient one to the songbirds, indicating the basic ability to discriminate rising and falling pitch contour in this species. The same group also showed that zebra finches could learn to discriminate trochaic and iambic stress patterns by pitch cue alone, but not by duration and sound amplitude cues (Spierings et al., 2017), showing the prominence of pitch contour in stress detection in zebra finches.

The number of laboratory studies reporting pitch contour perception ability in non-human species is relatively small, compared to studies on musical chord and formant perception (Fishman et al., 2001; Hienz et al., 1981, 1996; Izumi, 2000; Ohms et al., 2010). This could be because the analysis of pitch contour might be actually more difficult for non-human species than chord and formant perception; the analysis of the temporal course of pitch change requires additional functions, such as working memory, compared to the analysis of pitch relationships among simultaneously played sounds. At the same time, the prevalent use of pitch contour in the wild (Briefer, 2012; Filippi et al., 2019) indicates that lack of ecological validity in laboratory settings might have prevented researchers from finding signs of pitch contour perception in the laboratory (see Hoeschele et al., 2014, for similar discussion).

4.2 Rhythm Perception

Rhythm perception is closely linked to motoric functions. Indeed, human fMRI studies revealed activation of motoric regions, such as the premotor area and basal ganglia, in rhythm and beat perception (Grahn & Brett, 2007). When listening to ambient music, humans spontaneously make bodily movements in tune with the perceived beat of the music. Spontaneous entrainment to rhythm is ubiquitously observed in humans but is relatively rare among non-human species (Patel et al., 2009; Schachner et al., 2009). Even in those species showing signs of rhythmic entrainment, the temporal precision of their movement is far lower than that of humans (Hattori & Tomonaga, 2020; Patel et al., 2009).

Lack of spontaneous entrainment to rhythm does not necessarily mean the lack of ability to produce and perceive rhythmic patterns. The production of rhythmic patterns is often tested by the synchronization-continuation task (SCT). In SCT, an animal is first required to make bodily movements, usually tapping movement, in synchrony with an auditory or visual stimulus appearing at constant intervals (synchronization phase). In the continuation phase, external stimulation is eliminated and the subject must continue making bodily movements at the same pace as the synchronization phase.

An electrophysiological study using the SCT paradigm found a subgroup of neurons representing action timing in the medial premotor cortex of rhesus monkeys (Merchant et al., 2011). The firing rate of one group of neurons changed according to the elapsed time from the last action, while the other group of neurons represented the remaining time until the next action. Thus, there are several systems for time keeping in the brain of rhesus monkeys, enabling them to make externally and internally paced actions.

However, the ability to make rhythmic movements in rhesus monkeys is not the same as that in humans. Zarco et al. (2009) compared the behaviour in SCT between rhesus monkeys and humans. After training, rhesus monkeys learned to make paced movements, but detailed analysis revealed substantial differences in the pattern of action timing between rhesus monkeys and humans. First, humans made the actions synchronously with or slightly ahead of external stimulation, indicating anticipatory preparation of timed-action. In contrast, the action timing of rhesus monkeys lagged behind external stimulation, though the action timing was faster than that in the serial reaction time task. Second, the variance of action timing was drastically larger during continuation than in the synchronization phase in rhesus monkeys at long intervals, but no such trend was found in humans. These findings raise the possibility that the mechanism for generating internally timed movements in rhesus monkeys is qualitatively different from that in humans.

Regarding the purely perceptual aspect of rhythm processing, Honing et al. (2012) measured MMN-homologue elicited by deviant stimuli in rhesus macaques using the same paradigm as Winkler et al. (2009). As explained above, a deviant sound sequence, in which sound is omitted at the downbeat location, elicits MMN in human newborns. However, the same stimuli did not reliably elicit MMN-like responses in rhesus macaques, indicating a lack of beat perception in this species.

Though several avian species show rhythmic patterns in their vocalizations, there are only mixed results on the ability of these species to perceive rhythmic structure (ten Cate & Spierings, 2019). European starlings are reported to be capable of discriminating rhythmic and arrhythmic patterns (Hulse et al., 1984). Pigeons are reported to show the ability to discriminate different metric structures only under severely limited conditions (Hagmann & Cook, 2010).

5 Evolutionary Roots of Language from the Perspectives of Musicality

Many species, including songbirds and non-human primates, show human-like abilities of frequency structure perception. Some species can apply this ability to process materials taken from human language (Hoeschele & Fitch, 2016; Ohms et al., 2010; Spierings & ten Cate, 2014). Further, cotton-top tamarins and rats are reported to be capable of discriminating unfamiliar languages based on prosodic cues (Ramus et al., 2000; Toro et al., 2003). Considering these, it seems likely that phylogenetically old functions are recruited in both music processing and prosodic bootstrapping of language acquisition in human infants in a domain-general manner.

In contrast to the cross-species prevalence of frequency structure perception (Briefer, 2012; Filippi et al., 2019; Fishman et al., 2001; Wright et al., 2000), few species show the ability to perceive and produce rhythmic patterns. To obtain a full picture of the evolutionary roots of prosodic bootstrapping in language acquisition, it should be clarified how only humans acquired sophisticated rhythmic abilities. This is still a contentious field of debate, but we summarize below a tentative scenario based on existing evidence.

Schachner et al. (2009) analysed the characteristics of species that show signs of rhythmic entrainment and claimed that most of the species capable of rhythmic entrainment are also vocal learners/imitators, species that possess the ability to mimic and reproduce environmental sounds and conspecific vocalizations (Egnor & Hauser, 2004). Because both rhythmic entrainment and vocal learning require a linkage between auditory and motor systems, Patel proposed that increased functional and anatomical linkages between motor and auditory regions underlie the evolution of both vocal learning and rhythmic entrainment (Patel, 2014; Patel & Iversen, 2014). Songbirds, primates, dolphins, elephants, and bats use vocal imitation for authenticating group membership, territorial defence, adjustment of social relationships, and courtship behaviour (Coleman et al., 2007; Doi, 2020). Thus, refinement of vocal learning/imitation, hence the development of auditory-motor coupling, had conferred clear adaptive benefits to these species.

Ethnographic research on existing hunter-gatherer societies indicates the primary roles of vocal imitation in cooperative behaviour (Boyd, 2018; Lewis, 2014). Accumulation and sharing of knowledge about the surrounding environment among group members increase the odds of survival. Initiated males of the Bakaya Pygmy group achieve this by narrating their experience and knowledge through multimodal channels, including sophisticated imitation of environmental sounds as well as gestures and facial expressions (Lewis, 2014). Therefore, representing external entities by mimetic sound might have served as an efficient tool of communication in Homos throughout evolutionary history (Boyd, 2018). The adaptive benefit of sound-mimicking ability must have led to a closer and stronger coupling between motor and auditory regions in the human brain.

Though somewhat speculative, auditory-motor coupling may have paved the way for the emergence of fine-motor control of the vocal apparatus required in speech communication (Kearney & Guenther, 2019). The evolution of language made auditory-motor coupling even more valuable for humans, further strengthening the anatomical and functional association between these neural regions. In other words, auditory-motor coupling underlying the refinement of vocal learning/imitation prepared the basis for the evolution of speech communication. Thereafter, the survival benefit of language and speech communication in turn strengthened this coupling further.

The evolution of tight auditory-motor coupling (Rauschecker & Scott, 2009) has been driven first by the survival benefit of vocal learning/imitation and then speech communication. As a by-product of this process, humans have acquired superb ability for rhythm processing. Interestingly, neuroimaging studies indicate that motoric regions as well as auditory cortices contribute to the perception of rhythmic (Grahn & Brett, 2007) and prosodic information (Brown & Martinez, 2007; Reiterer et al., 2007; Belyk et al., 2016). Further, several studies revealed an association between the strength of auditory-motor coupling and linguistic processing ability (Yu et al., 2017). Thus, it seems to be the case that the human brain found a way to utilize strong auditory-motor coupling to process acoustic information and hence bootstrap language acquisition during infancy.

6 Conclusion

Human infants are faced with the daunting task of analysing the underlying structure of linguistic sound streams. Due to the lack of any language-specific knowledge, linguistic sounds must be almost indistinguishable from music for infants. Therefore, it is natural to think that infants first apply their ability for musical processing to linguistic materials.

Indeed, existing studies point to the possibility that domain-general abilities of frequency structure and rhythm perception contribute to language acquisition during early infancy. Infants start applying these abilities to analyse linguistic sound streams almost coincidentally with the emergence of corresponding abilities in the musical domain. Such cross-domain similarity in the developmental course provides partial support for the view that domain-general functions assist language acquisition. However, this view should be validated empirically by future studies that investigate whether identical neural structures are recruited in processing musical and linguistic materials in prelinguistic infants.

The abilities for musical chord and pitch contour perception can be seen in many non-human species. Further, the species can apply these perceptual functions to analyse linguistic materials as well. In contrast, animals do not match humans in their ability to perceive and produce rhythmic patterns. Considering these, what separates humans from non-human species in terms of language evolution seems to be the emergence of tight coupling between auditory and motor regions in humans that engendered both speech communication and refined rhythmic ability.

The emergence of language made auditory-motor coupling more beneficial for humans. The strong auditory-motor coupling has brought up musical and linguistic abilities to even higher levels in humans, thereby enabling infants to use these domain-general functions to analyse linguistic sound streams in the process of language acquisition.