Introduction

Research on precursors to speech has revealed vocal capabilities and inclinations within the first year of human life not seen in any other primate at any age (Griebel et al., 2016; Griebel & Oller, 2014; Oller et al., 2019b). From the first months of life, the human infant produces prelinguistic vocalizations (termed “protophones”) that lay foundations for speech, even though they do not yet constitute speech. Infants go through stages of protophone development before they utter their first word: 1) phonation without systematic supraglottal articulation, yielding primarily vowel-like sounds or “vocants” (0–2 months); 2) primitive articulation or “gooing” where vocants are combined with tongue movements that often involve contact with structures in the back of the mouth (1–4 months); 3) vocal play or expansion, where playful vocalization produces a variety of sounds such as squealing, growling, and raspberries (3–8 months); and 4) canonical babbling with well-formed syllables, such as [ba], [da], or [na] beginning as early as 5 months but not later than 10 months in typical development (Kent, 2022; Stark, 1980).

Recent work involving home day-long recordings of both typically developing (TD) and prematurely born infants collected using Language Environment Analysis (LENA) recorders has revealed that infants produce protophones far more often than previously expected, ranging from four to five per waking minute in the first months after full-term birth and continuing throughout the first year (Oller et al., 2019a, 2019b). Oller et al. (2019a) reported that even the prematurely born infants still in neonatal intensive care began protophone production at high rates soon after extubation, which allowed them to  breathe on their own. As many as 90% of protophones are spontaneously produced, not directed to or elicited by any caregiver (Long et al., 2020; Oller et al., 2021). Babies seem to be building a vocal capability actively, not merely responding to caregiver elicitation, but rather exploring vocalizations and finding ways to incorporate new sounds into a repertoire, a pattern reminiscent of the proposed agent-like activity of infants and children as proposed in “construction grammar” (Croft, 2001; Elbers, 1997; Tomasello, 2003).

The present paper focuses on vocal play, an understudied yet potentially crucial phenomenon, defined as a vocal pattern in which infants engage in bouts of practice-like protophone production (Stark, 1980, 1981). Vocal play is often focused on a small number of recognizable vocal categories that manifest different phonatory patterns: vocants (or vowel-like sounds, usually with normal or “modal” phonation), squeals (with high pitch and often falsetto or “loft” phonation), and growls (often with low pitch and fry or harsh phonation) (Buder et al., 2013, 2018). Anecdotally, parents often report that vocal play across these categories does not appear to be randomly sequenced but rather tends to occur in clusters of particular types (Oller, 2000). For example, an infant may produce vocants (the most common phonatory category) for a period of several minutes and then abruptly produce several growls within a similar time period, returning a few minutes later to producing vocants, or switching to squeals.

The pattern of non-random occurrence of these protophone types in vocal play suggests endogenously generated practice, unlike any vocal pattern that has ever been reported to occur in our ape relatives. Infant protophone production appears to constitute endogenous development of categories. While they do not constitute speech sounds, protophones reveal an ability that all speech requires, namely the ability to create vocal categories that are not in the human innate repertoire of vocal signals, such as crying, shrieking, or laughing (Oller et al., 2016).

Prior Research on Vocal Play in Typical Development

The notion of vocal play in human infants began to attain modern currency with longitudinal research by Stark (Stark, 1981; Stark et al., 1975). She suggested that infants from 3–5 months of age tend to engage in long bouts of vocal activity and coined the term “vocal play” to refer to it. Other researchers conducting longitudinal research on vocal development also observed that infants produced playful, repetitive protophones and that parents confirmed vocal play occurred at home (Laufer & Horii, 1977; Oller, 1978, 1980). Infants appeared to expand and consolidate their repertoires of vocal types during the vocal play stage, practicing the most salient sounds, namely vocants, squeals, and growls. Notably, there was very little quantification of the rate of occurrence or the pattern of repetition or clustering in infant production of the vocal types that were recognized by the investigators. Instead, research tended to focus on acoustic and articulatory descriptions of infant vocalizations (Bauer & Kent, 1987; Holmgren et al., 1986; Kent, 1981; Robb & Saxman, 1988), the phonetic and acoustic definition of canonical babbling (Oller, 1980, 1981, 1986), and parent-infant vocal interaction (Beebe et al., 1988; Bloom, 1988; Kaye & Fogel, 1980; Stern et al., 1975).

More recent research on this topic has quantified that small samples of infants have evidenced the production of particular vocal types in concentrated periods or in non-random repetitions of individual vocal types (Oller et al., 2007; also see Supporting Information in Oller, 2013). By far the most extensive empirical study of vocal play clustering was reported in a recent study, where 130 TD infants were tracked longitudinally with all-day recordings across their first year of life (Yoo et al., in press). Human coding of randomly selected five-minute segments from home recordings showed that TD infants demonstrated a strong tendency to engage in non-random clustering of squeals and growls with regard to the most frequently occurring category of protophones, the vocants (Yoo et al., in press). Specifically, more than 60% of TD infants in this study showed significant clustering patterns of squeals or growls on average across the first year, suggesting that TD infants engage in systematic production of the three vocal categories from their first months.

Early Vocal Production in Autism

The study of early vocal production in both TD and clinical populations may supply an empirical basis for recognizing clinically meaningful differences. Infant vocalizations have been widely studied in infant siblings of autistic childrenFootnote 1 given that differences in vocal development may be leveraged as early emerging behavioral markers that could facilitate early detection of autism. However, within this body of literature, more attention has been focused on the frequency or rate of vocalizations (volubility) than on how vocalizations of different types are sequenced or distributed when infants produce them. Findings have been mixed with regard to whether infants with elevated likelihood for autism (EL) show different volubility than infants with typical likelihood for autism (TL). No difference was reported in the frequency of vocalization during a five-minute free play between EL and TL infants (Northrup & Iverson, 2015) or in percentage of time engaged in vocalizations in videotaped first birthday parties in autistic infants compared to TD infants (Osterling et al., 2002). Other studies have reported fewer or lower rates of speech-like vocalizationsFootnote 2 in EL infants compared to TL infants (Chenausky et al., 2017; Patten et al., 2014; Paul et al., 2011; Warlaumont et al., 2014). Swanson et al. (2018) used automated processing developed by the LENA Foundation to identify infant vocalizations and found a greater number of vocalizations in EL infants compared to TL infants, driven by a subgroup of “hypervocal” infants. A related finding from Plate et al. (2021) revealed that the majority of the “hypervocal” EL infants from the Swanson et al. (2018) paper did not develop autism at 24 months, and no significant difference was found in overall vocalization rate between EL infants with a confirmed diagnosis of autism and TD infants at 6, 12, or 24 months. Most studies on this topic have involved manual coding of vocalizations from brief (< 30-min samples) audio or video recordings (Chenausky et al., 2017; Northrup & Iverson, 2015; Patten et al., 2014; Paul et al., 2011; Plate et al., 2021) or automated detection of vocalizations from day-long recordings (Swanson et al., 2018; Warlaumont et al., 2014).

To our knowledge, no study has examined patterns of infant vocal play as we define it here in autistic infants using either laboratory or naturalistic day-long home recordings. Theoretically, given the empirical evidence of clustering in TD infants (Yoo et al., in press), a finding that autistic infants also show a similar pattern of non-random occurrence of protophone types would hint at robustness of self-organization in vocal category development (Oller et al., 2016) and would highlight infants’ endogenous inclinations to engage in vocal exploration. Clinically, describing the pattern of vocal practice in autistic children may help us understand how autism interacts with early vocal development and could open doors to identifying additional early vocal markers of autism.

The Present Study

The present study represents the first effort to compare patterns of vocal clustering in TD and autistic infants. Our study addresses the first year of life using human coding of randomly sampled segments from day-long recordings. The study has the following objectives. First, we test the extent to which infants tended to produce vocal categories (squeals, vocants, and growls) in clusters rather than randomly distributing them across day-long recordings, focusing on possible differences in the extent of clustering across the TD and autistic groups. Second, we report developmental patterns of vocal clustering across six age categories in the first year in both groups. Third, we explore the extent to which clustering in vocal play correlated with later language outcomes and/or clinical features of autism from diagnostic evaluations at 2 years. The relationship between the extent of vocal clustering and later outcomes may unfold in two ways. On the one hand, evidence of vocal clustering in autistic infants may be positively correlated with later language outcomes, suggesting that vocal practice may support language development generally. On the other hand, given that restrictive and repetitive patterns of behaviors (RRBs) are a core feature of autism (American Psychiatric Association, 2013), a high degree of vocal clustering in early infancy may reflect a high degree of repetitive vocalization, be a marker for early repetitive behaviors and stereotypies, and correlate with the extent of RRBs at 2 years.

Methods

The institutional review boards of Emory University (Emory IRB00059383 and IRB00097674) and the University of Memphis (IRB #2143) approved the procedures used in this study. All families provided written consent prior to their participation in the longitudinal project from which the data in the present study were derived.

Participants

A total of 147 infants (103 TD infants and 44 autistic infants) were selected based on confirmed outcomes at 3 years from a larger database of > 300 infants who participated in a longitudinal sibling study of speech and language development across the first three years of life at the Marcus Autism Center (MAC) in Atlanta, Georgia. Newborn infants were recruited as having either elevated likelihood (EL) for autism (having at least one older biological sibling with a confirmed autism diagnosis) or typical likelihood (TL) for autism (no familial history of autism in 1st, 2nd, or 3rd degree relatives). Exclusion criteria included birth complications (e.g., born preterm < 34 weeks, prenatal or perinatal complications), evidence of a disorder influencing speech perception or production (e.g., hearing loss, cleft palate), genetic conditions including those associated with autism (e.g., Fragile X syndrome, tuberous sclerosis), and other medical conditions such as non-febrile seizure disorders or conditions requiring tube feeding or ventilation.

Recruitment was conducted blind to sex or socioeconomic status (SES). Home language recordings from infants were selected for human coding to balance sex and SES distributions to the extent possible given male-to-female ratios of autism and the demographics of the Atlanta metropolitan area. The proportion of male vs. female participants was not statistically significant between groups (χ2 = 1.42, p = 0.23). However, a significant difference was detected for SES backgrounds between groups: the TD group included more infants from high SES backgrounds than the autistic group (χ2 = 21.71, p < 0.001). Thus, we conducted between-group analyses in stratified groups of SES to evaluate the potential effect of SES on vocal clustering patterns. No significant differences emerged between the outcome groups in either the low-SES or high-SES groups. The pattern in both SES groups mirrored the overall results (See Supplementary Material S3 for details). Demographic information for all participants is presented in Table 1.

Table 1 Participant demographics

All infants received a full diagnostic characterization at 2 and 3 years of age, including administration of the Autism Diagnostic Observation Schedule, 2nd Edition (ADOS-2; Lord et al., 2012) and assessment of cognitive and language outcomes using the Mullen Scales of Early Learning (MSEL; Mullen, 1995). At each assessment, diagnostic impressions were assigned by two different senior-level clinicians, blind to prior autism likelihood status and diagnoses. Confirmatory diagnoses were reached through consensus using clinicians’ diagnostic impressions and ADOS-2 diagnostic classifications. Only infants who were initially recruited as TL and later confirmed as having no clinical features were included in the TD group for this study. All infants later diagnosed with autism regardless of autism likelihood status were included in the autistic group. Language, cognitive, and autism severity characteristics for all participants are presented in Table 2.

Table 2 Language and cognitive characteristics of participants

Day-Long Recording Procedures

Caregivers completed day-long recordings approximately once a month between approximately 0 to 36 months. The present study involved human coding of recording data between 0 and 13 months. Audio recordings were completed using LENA recorders. On average, families of TD infants completed 9.1 recordings (range: 4–13) across the ages studied and families of autistic infants completed 8.1 recordings (range: 3–12), with an average recording time of approximately 11 hours per day.

Data included 1293 all-day recordings (357 for autistic infants, 936 for TD infants). Human coding on 21 randomly selected five-minute segments from each recording yielded 27,153 segments for data analysis. For age-related analyses, recording age was rounded and data were grouped as follows: 0–2 months, 3–4 months, 5–6 months, 7–8 months, 9–10 months, and 11–13 months, with cutoffs at 2.5, 4.5, 6.5, 8.5, and 10.5 months respectively.

Coding Procedures and Categories

Thirty-six English-speaking graduate students in the University of Memphis Origin of Language Laboratories (OLL) were trained for human coding of infant vocalizations. Coding was conducted using Action Analysis, Coding, and Training (AACT, Delgado et al., 2010), which allows the use of keystroke and mouse-selected coding for vocal categories in real time. Coders were blind to diagnostic and demographic information. Coding was conducted at the utterance level based on a “breath-group criterion” in accord with which phonation begins an utterance and inhalation ends it (Lynch et al., 1995). Vegetative sounds (e.g., coughing, sneezing, burping, and effort grunts) were not coded. The three primary protophones (vocants, squeals, and growls) accounted for approximately ~ 85% of all utterances across the first year and are the focus of the analyses for this study. Cries, whimpers, and laughs (about 13% of coded utterances) were excluded from analysis.

In the present study, the clustering of squeals and growls is referenced to vocants, the apparent default category for protophones, since they constituted about 75% of utterances, while growls were about 7% and squeals about 5%. The real-time method may limit coding of the non-default phonatory categories (squeal and growl) to particularly salient instances of squeal and growl, a limitation that we think corresponds to perceptions of parents, who (like real-time coders) have only one opportunity to judge each infant utterance.

Detailed definitions for squeals, vocants, and growls are provided in the Supplementary Material S1. The three categories are often easy to recognize as sharply distinct, but many infant utterances involve considerable ambiguity; they may shift from one vocal regime to another within utterance, presumably because infants are exploring vocalization. Thus, coders were required to make a forced choice among the three categories in real time based on the most auditorily salient feature of the entire vocalization. The key point here is that ambiguity is an inherent aspect of the judgments of protophones because they are exploratory and often yield fuzzy or uncertain impressions, an important issue to consider in assessment of coder agreement, which we address below.

Analyses of Clustering Patterns

We conducted Fisher’s exact tests (Fisher, 1934; Freeman & Halton, 1951) to evaluate the null hypothesis that the proportion of squeals with respect to vocants was the same, or growls with respect to vocants was the same, across the five-minute segments in each recording that were included for analysis, at the 95% confidence level. Fisher’s exact test is an appropriate test of independence in the analysis of contingency tables. It is robust with regard to small or uneven n’s (as often occurred in individual five-minute segments in our data) and is consequently particularly suitable compared to other non-parametric tests of independence of categorical variables, such as chi-square. In Supplementary Material S2 we tabulate raw data from example recordings to help readers understand the Fisher’s exact test results intuitively.

Segments with high rates of crying or whimpering and segments where the infant was asleep were deemed incompatible with vocal play. Consequently, we excluded segments in which 1) the infant was judged by the coder to be asleep throughout the five minutes (as indicated in a short questionnaire completed after coding each segment; n = 7235 segments), and 2) segments where ≥ 5 cry or whimper utterances occurred (n = 2795 segments). After applying the exclusion criteria, we analyzed the remaining total of 17,137 segments. The proportion of excluded segments due to cry and sleep did not differ significantly between the TD and autistic groups.

Fisher’s exact tests were conducted on all available recordings (n = 1293 recordings), one for squeals vs. vocants and again for growls vs. vocants. A significant Fisher’s exact test for the available comparisons within a recording would signify that the protophone types were distributed in a non-random way, indicating clustering. In other words, either squeals or growls or both were being produced significantly more often in particular segments than in other segments of the same recording. Prior to the analyses, we evaluated all recordings for the possibility that there were no squeals or growls throughout the recording. If a recording did not contain any coded squeals, it was designated as “not analyzable” by the Fisher’s exact test for the squeals vs vocants comparison. Likewise, the growls vs vocants comparison was not feasible if a recording did not contain any coded growls. There were 141 recordings with no coded squeals, 140 with no coded growls, and 47 with neither squeals nor growls. It is important to note that the recordings that did not contain squeals or growls were not excluded from the final data but rather were included in the denominator when calculating the percentage of recordings that showed significant clustering patterns of squeals or growls with respect to vocants.

To answer the main research question regarding the extent of clustering of squeals and growls with respect to vocants in autistic infants as compared to TD infants, the proportion of recordings that showed significant clustering patterns of squeals and growls (with all available recordings represented in the denominator) was first calculated for each infant and then summarized for the TD and the autistic groups. We also calculated the proportion of recordings that showed significant clustering patterns of either squeals or growls, the numerator being the number of recordings where the Fisher’s exact test for either squeals vs vocants or growls vs vocants was significant. Given that the distribution of proportion data was not normal, we report group means and bootstrapped 95% confidence intervals calculated across 5000 resampling iterations. Between-group differences were examined using permutation testing with 5000 resamples. To examine the developmental trend of the extent of clustering of squeals and growls, data were averaged for all the recordings of each infant within each age interval prior to computing means and bootstrapped confidence intervals for each age interval and group.

Finally, correlation analyses were conducted in the autistic group using the proportion of recordings that showed significant clustering patterns of squeals, growls, or either squeal or growls, and outcome measures at 2 years. The outcome measures include (1) ADOS Overall Total scores, (2) ADOS Social Affect scores, (3) ADOS Restricted and Repetitive Behavior (RRB) scores, (4) Mullen Scales of Early Learning (MSEL; Mullen, 1995) Receptive Language t-scores, and (5) MSEL Expressive Language t-scores.

Intercoder Agreement

We examined coder agreement across 289 five-minute segments, including 21 segments randomly selected from nine agreement recordings of both TD and autistic infants. The agreement recordings came from the same longitudinal studies but were not included in the actual data reported under Results. All 21 segments of some recordings were coded independently by the 36 coders who provided data for the results reported below. Additional coders who had been trained the same way also coded all 21 segments of some or all of the nine recordings. The number of individuals who coded each of the 9 recordings varied from 20–48 (mean = 32.1 coders per recording).

As expected, coders varied in their counts of the three protophone types, indicating predictable disagreement among coders about categorizations of squeals, growls, and vocants. The coefficient of variation (CoV, ratio of standard deviation to mean) across coders provides a measure of the degree of coder variation. The mean CoV across the nine recordings for vocants was 0.21, while squeals and growls showed notably more variation (CoV = 0.59 and 0.96 respectively). The data thus show substantial differences among coders in their judgments about vocants, squeals, and growls. Acoustically and auditorily analyzed examples offer perspective on why such disagreement is to be expected, namely that many utterances of infants involve a mixture of the phonatory regimes that characterize the three protophone types. These mixtures of regimes put forced-choice coders in the position of having to decide which among the phonatory regimes within any utterance is most salient, and often there is no absolute correct answer.

We conducted a permutation test (Good, 2005) on all of the nine agreement recordings to determine whether coder variation had any significant effect on the dependent variables of interest (percentage of recordings that showed significantly clustering pattern of squeals, growls, or either squeals or growls). We resampled the 48 coders without replacement 5000 times and calculated the proportion of recordings that showed significant clustering patterns of squeals or growls with respect to vocants to determine the empirical distribution of our dependent variables under the null hypothesis that coder identity had no effect. The test indicated that the null hypothesis could not be rejected at the 95% confidence interval, and there was therefore no evidence that coder differences significantly skewed the results. Detailed results and elaboration of the permutation tests are provided in Supplementary Material S4.

Results

Clustering of Protophones in TD and Autistic Groups

Results indicated a considerable tendency for infants in both outcome groups to show significant clustering patterns. For squeals, 39% of recordings from TD infants showed significant clustering with respect to vocants (95% CI = [0.36, 0.44]), and 46% of recordings from autistic infants showed significant clustering (95% CI = [0.40, 0.53]). The percentage of recordings that showed significant clustering did not differ significantly across the TD and autistic groups (p = 0.14). With respect to growls, 39% of recordings from TD infants showed significant clustering (95% CI = [0.34, 0.44]) and 43% of recordings from autistic infants showed significant clustering (95% CI = [0.35,51]. Again, no significant difference was found in the percentage of recordings that showed significant clustering of growls across the TD and autistic groups (p = 0.41).

When calculating the percentages, the number of excluded recordings that were classified as non-analyzable based on the absence of coded squeals or growls was included in the denominators. Consequently, the percentages reflect clustering of squeals and growls across all recordings available for all the infants. Since only 21 segments (105 min) were selected from each all-day recording for coding (about 1/6 of the available recording time), these values necessarily underestimate the percentage of days the infants actually engaged in some clustering of squeals and/or growls. The data thus supply a conservative estimate of the amount of clustering of squeals and growls.

Evaluation of whether individual infants showed either significant clustering of squeals or growls in each recording revealed that 66% of recordings from autistic infants and 61% recordings from TD infants showed a significant result for one or the other. Thus, the findings suggest infants usually showed some discernible clustering patterns for protophones.

Clustering Patterns of Protophones Across Ages

The data by age for each vocal type comparison showed clear evidence of clustering of squeals and growls with respect to vocants across all age groups in both the TD and autistic groups. Figure 1 provides a summary of the developmental patterns, indicating that clustering of squeals and growls occurred from the first months of life and at all the age intervals. For TD infants, the percentage of recordings that showed significant evidence of clustering ranged from 26 to 49% for squeals and from 33 to 43% for growls. For autistic infants, the percentage of recordings that showed significant evidence of clustering of squeals was slightly higher (although not statistically significantly) than for TD infants, ranging from 37 to 57%. The percentage of recordings that showed significant evidence of clustering of growls ranged from 33 to 43% for autistic infants.

Fig. 1
figure 1

Developmental Patterns of Vocal Clustering across Six Age Intervals in the First Year. a Values were averaged for all the recordings of each infant within each age interval prior to computing means. b Error bars represent bootstrapped 95% confidence intervals calculated across 5000 resampling iterations

Interestingly, for both groups, the greatest amount of clustering did not fall within the 3–4, 5–6, or 7–8 month intervals, the ages that were thought to constitute the vocal play stage in the early stage models of vocal development (Kent, 2022; Nathani et al., 2006; Oller, 1978, 2000; Stark, 1980). The clustering of squeals showed a tendency to increase toward the middle of the year and peaked at around 9–10 months for both groups. The clustering for growls appeared to be more stable across age than for squeals, showing the highest values at the first and the final intervals of the year. The last panel of Fig. 1 displays the percentage of recordings that showed significant clustering of either squeals or growls at each of the age intervals for both groups. Significant clustering of either squeals or growls occurred in from 49 to 68% of the recordings from TD infants and from 57 to 69% of the recordings from autistic infants. A slight increasing trend for clustering patterns was observed for both groups. Notably, all the age intervals beyond 5 months demonstrated significant clustering for at least 60% of the recordings. And there was no age interval, even the 0–2 month interval, where significant clustering was absent.

Correlation Between the Extent of Vocal Clustering and Outcomes at 2 Years

Correlation analyses revealed that the proportion of recordings with significant clustering of squeals, growls, or either squeal or growls did not correlate significantly with any outcome measures at 2 years. The very low correlation coefficients for the extent of vocal clustering and MSEL expressive or receptive language ranged between − 0.13 to 0.1 (all p values > 0.5). Likewise, the correlation coefficients for the extent of vocal clustering and ADOS-2 scores ranged between − 0.2 and 0.09 (all p values > 0.3). Detailed results are presented in Table 3.

Table 3 Correlations between the extent of squeal or growl clustering and outcomes at 2 years

Discussion

To our knowledge, this study represents the first quantitative assessment of the tendency of autistic infants to produce vocal categories in clusters across the first year of life. We compared infants’ vocal play patterns for squeals, vocants, and growls across 1293 longitudinal day-long recordings from 103 TD infants and 44 autistic infants. On average, for each infant, 117 five-minute segments (~ 9.75 hours) across their first year of life were human-coded to examine these vocal play patterns. Given that vocants dominate infants’ early vocalizations, constituting about 80% of all protophones (Oller et al., 2021) and demonstrate primarily modal phonation in the usual pitch range of each infant, the present work focused on the clustering of protophones with non-modal phonation (squeals and growls) in relation to vocants.

Every infant in the study, regardless of diagnostic outcome, showed significant clustering of either squeals or growls in at least one of their recordings. For both TD and autistic infants, more than 60% of recordings on average showed significant clustering for squeals or growls. These findings highlight the robustness of the systematic production of vocal categories across the first year of life. The similarity of the clustering patterns in the TD and autistic groups suggests that vocal category formation through active infant vocal exploration is a deep human tendency.

It has long been speculated that vocal exploration is critically important in establishing foundations for language (Oller, 1980; Stark, 1980;), but now there is a basis to speculate further: it appears that vocal category formation as manifested in clustering may be critically important to the development of foundations for language. The high volubility of exploratory vocalization may form a circumstance in which category formation emerges simply because infants listen to themselves and notice the differences in phonatory patterns they are able to produce. The apparent practice that accompanies this recognition would seem to fit with modern theories emphasizing the inherent nature of curiosity and active “seeking” for information about the world, certainly in humans, but presumably in many species (Gottlieb & Oudeyer, 2018; Panksepp, 2009;). Curiosity and the exploration that is motivated by it are thought by some to be the primary drivers of learning (Panksepp & Biven, 2012).‬

Because significant clustering of both squeals and growls was evident at the very first age interval for both TD and autistic infants, it seems clear that active vocal exploration occurred from very early in life. Not only did protophones show the tendency to cluster from the first months from the present study, but another study revealed that protophones also demonstrate functional flexibility from at least the first three months (Jhang & Oller, 2017). The cited work showed that each of the three vocal categories can express both neutral and negative emotional states on different occasions of use. Additional work showed that such flexibility of usage extends to include positive as well as neutral and negative affective states once smiling is established (by about 3 months), after which the full range of affectively flexible expression occurs for all three phonatory categories at all subsequent infant ages throughout the first year (Oller et al., 2013). The infants’ tendency to produce protophones in clusters and to use them flexibly in terms of affective expression suggests infants achieve both stable form and flexible function in the first months. Taken together, these findings reiterate the special role protophones serve as precursors to oral language.

It may be thought surprising that no strong relation was revealed between vocal clustering in the autistic group and autism severity or language outcomes. We initially speculated that vocal clustering might be a manifestation of RRBs in some autistic children. The lack of a significant correlation should not, however, be taken as the final word on the topic, because again, the coding system allowed only three categories of phonation to be indicated, a limitation that may have obscured possible relations between vocal clustering and later outcomes. It is also important to note that our study’s focus on the first year of life and our focus only on human coding of three particular vocal categories may have caused us to miss other features of vocalization that may differ across autistic individuals compared with TD individuals. Prior work has suggested that prosodic differences may indeed accompany speech in autistic children and adults. For example, both variability of pitch and long-term spectra of speech have been found different across autistic and TD children 4–6 years of age (Bonneh et al., 2011). A recent meta-analysis involving 39 studies also concluded that autistic individuals had higher mean pitch, larger range and variability of pitch, and greater voice duration than TD controls (Asghari et al., 2021). Sheinkopf et al. (2000) reported that autistic preschoolers produced significantly more vocalizations with atypical vocal quality (defined as vocalizations produced as squeals, growls, or yells) than children with developmental delay. Interestingly, atypical vocal quality ratio in autistic children did not correlate significantly with joint attention measures in Sheinkopf et al. (2000), a finding that may correspond to the lack of a significant correlation between vocal clustering and social communication measured by the ADOS-2 Social Affect score in the present study. Future longitudinal studies are warranted to further illuminate the relationship between early vocalization and autism outcomes.

Limitations

While real-time coding is an efficient way to code a large number of day-long recordings, it provides only a simplified sketch of infant vocalization. Individual utterances of infants at any age can include characteristics of any one or all the phonatory categories, in addition to involving multiple vocal regimes not generally addressed in defining the three primary categories. To simplify the task, coders were trained to make each judgment based on the most auditorily salient characteristic of each utterance. Real-time coding may be less precise or reliable than repeat-listening coding (but see Willadsen et al., 2020), yet there is an important sense in which real-time coding may be preferable, because it resembles the recognition of infant vocalizations by caregivers, who interact with infants in real time and, like our coders, only have one opportunity to interpret each vocalization. The findings of the study seem to confirm common observations from caregivers that infants produce various vocal types in clusters (Stark, 1978; Oller, 2000), a pattern that suggests infants may be practicing these vocal categories. The very fact that parents express the opinion that infants are practicing vocally when they repeat sounds adds weight to the biological argument that infant vocal activity must provide reliable evidence to caregivers of infants’ growing capacity for vocal communication (Oller, 2000).

Another limitation is that all EL infants recruited to our sample had an older autistic sibling, but we did not control for the number of older siblings in either diagnostic group. Given the endogenous nature of infant vocalizations, it could be that having an older sibling would have little or no influence on infant vocal practice patterns, but findings from this study may not generalize well to autistic infants from single-incidence families or families with multiple older children.

Conclusions

This study is, as far as we know, the first to examine patterns of clustering in vocal production in autistic infants quantitatively. Three main findings emerged. First, both TD and autistic infants showed clustering patterns of protophones in their first year, suggesting that all the infants in this study appeared to be practicing vocal categories. This main finding suggests that vocal clustering practice is a robust feature of and an important precursor for language development. Second, the clustering of vocal categories occurred in both groups from the first month and showed an increasing trend of clustering of squeals and a stable trend of clustering of growls. Lastly, the extent of clustering of vocal categories in autistic infants in the first year showed extremely low and non-significant correlations with autism severity or language outcomes at 2 years.

Overall, the results provide support for the idea that vocal play and clustering of vocal types may be a fundamental property of human development, laying a deep foundation for later speech and language development. It is hard to imagine how one could learn to talk without the ability to form vocal categories. The fact that our results did not reveal significant reduction in vocal play clustering among the autistic infants suggests further that the clustering tendency may be so important to human development that it has been evolved to resist developmental differences associated with emerging autism, at least in the first year of life. We caution, however, that our study is focused on phonatory development only. There is much room remaining for comparative evaluation of vocal capacities and inclinations across autism and typical development.