The ability to use cues from multiple senses in concert (i.e., multisensory integration) is a fundamental aspect of brain function and critical developmental milestone for infants, who must learn to perceive complex, multimodal events in ways that are meaningful and relevant [1, 2]. In general, multisensory integration is accomplished by the detection of intersensory redundancy, the spatially coordinated and/or temporally synchronized presentation of the same information across two or more sense modalities [3]. From this perspective, audio-visual events that occur within close temporal proximity are automatically integrated if they fall within a specific range called the audio-visual temporal binding window (i.e., intersensory temporal contiguity window; see Lewkowicz 2000 [4]. Conceptually, the temporal binding window is a measure of sensitivity to audio-visual temporal synchrony, quantified as the maximum amount of time that auditory and visual sensory inputs can be physically separated and still perceived as unitary or synchronous.

Although infants are sensitive to audio-visual synchrony relations from birth [56], the size of the audio-visual temporal binding window has been found to be much larger in infants than in adults [78]. As children acquire perceptual experience with synchronous events, their sensitivity to audio-visual synchrony improves and their temporal binding windows grow smaller [9]. However, there is evidence to suggest this process is disrupted in individuals with autism spectrum disorder (ASD; [10]). Since very little work in this regard has focused on infancy, the primary purpose of our study was to examine and compare sensitivity to audio-visual asynchrony among infants at elevated likelihood of developing ASD and their typically developing (TD) counterparts.

Multisensory integration and autism spectrum disorder

Across multiple studies using a variety of paradigms, children with ASD have shown impaired perception of audio-visual relations, especially for social events. Compared to TD children, for instance, who preferentially looked towards a synchronous (vs. asynchronous) audio-visual display of a woman speaking, children with ASD exhibited no clear preference in this regard, suggesting they may not have discriminated between the stimuli [11]. Consistent with this interpretation, older children with ASD have been less accurate than TD children in judging whether auditory and visual speech cues were temporally aligned [1213] and have exhibited larger temporal binding windows for audio-visual speech [141213]. In general, research suggests that children with ASD are less sensitive to the temporal synchrony of auditory and visual cues within speech, a powerful source of intersensory redundancy for TD infants [5].

Importantly, when non-social events (e.g., hammer tapping a nail, bouncing ball) are depicted, the link between ASD and multisensory integration is less clear: some studies have demonstrated disparate performance in ASD, both enhanced and reduced [1518], while in other studies, individuals with ASD perform relatively similarly to TD individuals [19, 202122]. Of note, in TD individuals, the audio-visual temporal binding window is typically larger for social (e.g., speech) than non-social (e.g., flashes, beeps) events [7, 2326], suggesting that sensory integration in these contexts are relatively distinct processes in general.

Relatedly, both children and adults with ASD exhibit difficulty perceiving the McGurk illusion [27], in which simultaneously presented (but incongruent) auditory and visual speech cues (e.g., visual “ga” and auditory “ba”) are fused to generate a novel, illusory percept (e.g., “da” or “tha”) [212830]. Thus, whereas TD individuals apparently integrate the incongruent auditory and visual speech cues in this situation, individuals with ASD do not, suggesting that they may generally rely more on auditory than visual cues to perceive multimodal events. Interestingly, and consistent with this interpretation, children with ASD are reportedly more susceptible to a visual flash-beep illusion, in which multiple beeps paired with a single flashing light produce an illusion of having seen multiple light flashes [17]. Thus, when the event is non-social in nature, children with ASD are not necessarily impaired in their capacity for multisensory integration and are possibly even more likely than TD children to do so.

Despite a wealth of research on this topic in children and adults [11, 17, 1912, 12, 14, 31, 32, 32], very few studies have examined multisensory processing in children younger than 24 months who are at elevated likelihood of developing ASD. Given the vast literature on multisensory integration in TD infants (see Lewkowicz, 2000, 2014 [4, 33] for reviews), this represents a notable gap in the empirical literature. By examining sensitivity to audio-visual asynchrony among both TD infants and those at elevated likelihood of developing ASD, the current study addresses the question of whether an extended audio-visual temporal binding window is present in infancy as a function of elevated likelihood of developing ASD and for both social and non-social events.

Implications for language development

Examining these associations in infancy is critical because language develops rapidly across the second year of life and infants’ sensitivity to audio-visual temporal synchrony may contribute to the process. According to the intersensory redundancy hypothesis, multimodal events in which sensory cues are temporally synchronized are highly salient and serve to recruit selective attention and organize perceptual learning in early development [34, 35]. In particular, selective attention to the source of redundant sensory information is thought to allow for “intermodal learning” that captures the salient perceptual dimensions of the cultural world [36]. As a prime example, attending to the visibly moving lips of speaking face (a source of highly redundant sensory information) allows infants to perceptually integrate faces and voices into a coherent whole rather than as a series of disjointed inputs [4, 36], which may be critical for the perception of other information conveyed by faces, such as speaker identity, emotion, and language group [373839].

More importantly, however, selective attention to the mouth region of a speaking face may be critical for understanding speech as an intentional action that can be performed by the self. Shortly after they begin to babble (8–10 months), TD infants begin attending more to the mouth (vs. eye) region of a speaking face [4043], which may support their emerging ability to imitate lip movements associated with speech sounds. Additionally, the detection of audio-visual temporal synchrony is likely critical for establishing relations between spoken words and their visible referents [44, 45. Thus, by promoting attention to the most important features of an event, an infant’s sensitivity to audio-visual synchrony may serve a foundational role in the acquisition of language, which is often delayed or disrupted in individuals with ASD [46, 47].

Although no previous study has examined whether sensitivity to audio-visual synchrony is associated with language development in infants at elevated likelihood of developing ASD, there is some support for this idea. Stevenson et al. [48], for instance, found that the size of the audio-visual temporal binding window in children with ASD was related to their speech perception, and this relation was further mediated by their ability to integrate social stimuli like the McGurk Effect. Relatedly, Righi et al. [49] showed that the ability of preschool children with ASD to match synchronous speech with corresponding lip movements predicted both their expressive and receptive language abilities. Finally, Bahrick et al. [50] reported that accuracy of intersensory matching of faces and voices is positively associated with language competence in TD children; more recently, these findings have been extended to infants [51, 52]. Thus, considering that children diagnosed with ASD often exhibit delays in language [12, 14, 17, 30, 32, 48, 53], it is likely that infants at elevated risk for developing the disorder would exhibit impairments in this domain.

Current study

In sum, the literature suggests that (1) the audio-visual temporal binding window for social stimuli is larger in children who are diagnosed with ASD, but that (2) there are relatively few studies examining the audio-visual temporal binding window in infants less than 24 months. Additionally, (3) although it has never been examined in infants at elevated likelihood of developing ASD, the audio-visual temporal binding window for speech likely impacts subsequent language production. Thus, the primary aim of the current study was to examine and compare the sensitivity to audio-visual synchrony of speech cues among infants at elevated likelihood of developing ASD and in their TD counterparts. Additionally, given the theoretical significance of audio-visual sensory integration for the development of expressive language, we also aimed to examine the associations between infants’ sensitivity to audio-visual synchrony and language production.

To assess the size of infants’ temporal binding windows, we used the habituation/dishabituation procedure, which is well-established in infants [5456]. In general, this procedure utilizes looking time to measure infants’ attention to a repeated stimulus and subsequent ability to discriminate the repeated stimulus from a novel stimulus. Research has indicated that the number of trials in which habituation occurs is indicative of stimulus encoding and that individual differences are related to later cognitive abilities, such as IQ [56]. Thus, consistent with previous studies [5], before presenting infants with the asynchronous test stimuli, we habituated them to the synchronous stimuli. A speaking face served as the social event and a bouncing ball served as the nonsocial event. Finally, to probe whether a significant relation between the audio-visual temporal binding window for a social stimulus and language production exists, we assessed the size of infants’ productive vocabulary between 17 and 30 months.

In general, we expected that infants at elevated likelihood of developing ASD would exhibit reduced sensitivity to audio-visual synchrony when viewing the social stimulus and hence, their temporal binding window for the speaking face stimulus is expected to be larger compared to TD infants. Given conflicting evidence about the performance of children with ASD when presented with non-social stimuli, we did not expect to find significant group differences in the size of the audio-visual temporal binding window for the nonsocial (bouncing ball) stimulus. Finally, we expect that TD infants will have larger vocabularies than infants at elevated likelihood of developing ASD and a significant positive relation between the social audio-visual temporal binding window and language production was expected for both groups.

Method

Participants

Two groups of infants between 4 and 24 months of age were tested. One group was comprised of 35 infants at elevated likelihood of developing ASD (M = 12.90 months, SD = 5.49, 51% female) and the other of 53 TD infants (M = 10.60 months, SD = 5.10, 51% female). Approximately half of the infants at elevated likelihood of developing ASD were between 1 and 2 years of age (N = 17); the same number of TD infants were between 1 and 2 years of age (N = 18). However, there were nearly twice as many TD infants less than 12 months (N = 35) as there were infants at elevated likelihood of developing ASD. Thus, because the TD group was slightly younger than the group at elevated likelihood of developing ASD on average, t (69) = − 2.00, p = 0.05, age was considered as a factor in the analyses.

Consistent with previous studies, infants were considered at elevated likelihood of developing ASD if they had an older sibling with a confirmed diagnosis of ASD, were born < 36 weeks gestation, or had a birth weight < 2000 g [5761]. The premature infants ranged between 27- and 36-week gestation and corrected gestational age was accounted for (by using the expected due date as the date of birth to calculate age at the time of the visit). TD infants had no family history of autism, were full-term at birth, had a birth weight of 2000 g or higher, and had a 5-min APGAR score of 7 or higher. All infants were healthy at the time of testing and had no recent history of eye or ear infection. We tested an additional 12 infants but did not include their data due to fussiness (n = 6), parental interference (n = 2), or equipment failure (n = 4). When infants were between 17 and 30 months, their parents were recontacted to fill out a vocabulary assessment.

Procedures

Infants completed two separate habituation/dishabituation procedures, one that presented a social event (speaking face) followed by one that presented a nonsocial event (bouncing ball); both were presented on a 24-inch Dell computer screen. Testing took place in a quiet, dimly lit room. Infants either sat in a child seat or on their caregiver’s lap (about 50 cm from the computer screen); in the latter case, caregivers wore headphones that played white noise. Infants completed the speaking face procedure first as it was the primary measure of interest, followed by the bouncing ball procedure. Between procedures, infants were given a 10-min break during which they were taken out of the test room and encouraged to play with their caregiver. All procedures and materials were approved by the Institutional Review Board and informed consent was obtained prior to data collection.

The speaking face event (see Lewkowicz 2010, [9]) consisted of a woman wearing a neutral expression looking directly into the camera while producing the speech syllable /ba/. The woman opened her mouth, articulated the syllable /ba/, and then closed her mouth every 4 s. The woman’s face spanned roughly 1/3 of the computer screen, subtending approximately 19° of visual angle in height and 28° of visual angle in width. An audible /ba/ was synchronous with her lip movements and was presented at 65 dB, A-scale. The bouncing ball event (see Lewkowicz 1996 [8], Minar and Lewkowicz 2018 [62]) consisted of a moving red ball that made an impact sound when it hit the upper/lower bounds of the computer screen. This stimulus was created in Adobe After Effects (Adobe Systems, San Jose, CA). The ball was 2 inches in diameter and subtended approximately 6° in visual angle in height and width. The ball moved at a rate of 10 cm/s (with a 50 ms pause at each endpoint) and was presented in front of a 12 × 16 grid of small white dots against a black background. The change in direction at the upper and lower bounds of the screen was synchronous with the sound of a wooden spoon hitting an empty plastic container, which was presented at 65 db, A-scale.

Measures

Habituation/dishabituation

Each habituation trial began when infants attended to the stimulus screen and ended when they disengaged from the screen for a period of 1 s, or until their look duration exceeded the maximum trial length of 60 s [38, 55, 63]. Infants were repeatedly shown the stimulus until the amount of looking between the first and last three habituation trials was decreased by 50%; once this occurred, infants were habituated to the stimulus [56]. To ensure the appropriate number of habituation trials were administered to each infant, look duration was assessed live by trained coders using a peephole. Observers recorded fixation on an event recorder by observing when the infants’ eyes were oriented towards the stimulus. For both event conditions, the number of trials required for the infant to achieve the habituation criterion was calculated.

Once infants were habituated, five test trials depicting the same events were presented at increasing levels of audio-visual asynchrony (333 ms, 500 ms, 666 ms, 833 ms, and 1000 ms), with the sound always preceding its corresponding visual event. The trial in which infants exhibited their longest look duration was taken to indicate the size of their audio-visual temporal binding window for each stimulus condition.

Vocabulary Production

Toddlers’ vocabulary production was assessed between 17 and 30 months using the Toddler form (part IA, “Words Children Use”, 680-words) of the MacArthur-Bates Communicative Development Inventory (CDI), a standard vocabulary checklist suitable for children within this age range [64]. CDI data are available for 35 TD infants (M = 20.85, SD = 3.39) and 18 infants at elevated likelihood of developing ASD (M = 21.05, SD = 3.35).

Results

Habituation/dishabituation

All infants were successfully habituated to the stimuli as indicated by a 50% reduction in looking for both conditions. A paired samples t-test revealed that infants required a greater number of trials to achieve habituation criterion for bouncing ball (M = 8.93) than for speaking face (M = 7.50), t(67) = 4.13, p < 0.01, but there were no other main effects of condition on habituation performance.

Group differences in study variables and correlations with age are displayed in Table 1. For speaking face, there is a significant main effect of group on the initial look duration during the pretest, F(1) = 6.47, p < 0.05, such that TD infants looked significantly longer (M = 36 s) than infants at elevated likelihood of developing ASD (M = 26 s) on average. However, there are no significant group differences with respect to the looking during other phases of habituation or the number of trials required to achieve the habituation criterion. For bouncing all, there are no significant main effects of risk group on habituation performance.

Table 1 Mean look duration across habituation phases by risk group and condition

For speaking face, age is significantly negatively correlated with the average look duration across the last three habituation trials and with the number of trials required to achieve habituation, indicating that older infants habituated faster and did not look as long at the stimulus by the end of the procedure as younger infants. For bouncing all, age is significantly negatively correlated with the average look duration across the first three habituation trials but not with looking during the pretest or the last three habituation trials.

On average, infants’ maximum look during test trials for speaking face (M = 18.15 s) was more than twice as long as their average look duration across the last 3 habituation trials (7.67 s), which represents a 137% increase in looking. A similar proportion was observed for bouncing ball (122% increase). A set of paired samples t-tests revealed that these differences in looking were significant across both groups and both conditions (Face/TD: M = 8.70, t(48) = 7.04, p < 0.01; Face/ASD: M = 13.10, t(32) = 5.20, p < 0.01; Ball/TD: M = 8.59, t(47) = 5.54, p < 0.01; Ball/ASD: M = 11.10, t(24) = 3.91, p < 0.01), further suggesting that infants were successfully dishabituated to the asynchronous test stimuli at some point. The proportions of infants in each group who looked longest during each test trial (300 ms, 500 ms, 666 ms, 833 ms, 1000 ms) are displayed in Figs. 1a and b. For the speaking face, most TD infants looked longest during the 500 ms trial whereas most of the infants at elevated likelihood of developing ASD looked longest during the 666 or 833 ms trials. The pattern for bouncing ball is less clear: around a third of infants in both groups looked longest during the 666 ms trial, but most of the at-risk infants looked longest during the 1000 ms trial. The millisecond asynchrony offset value of the trial containing the maximum look duration was conceptualized as the audiovisual temporal binding window.

Fig. 1
figure 1

a Proportion of infants who looked longest during each test trial (speaking face). b Proportion of infants who looked longest during each test trial (bouncing ball)

Across groups, the temporal binding window is significantly smaller for speaking face than for bouncing ball, F(153) = 4.48, p < 0.05, η2 = 0.03). Across conditions, the temporal binding window is significantly smaller for TD infants than infants at elevated likelihood of developing ASD, F(153), = 7.67, p < 0.01, η2 = 0.04. There was no significant condition x group interaction in this regard, F(65) = 0.01, p = 0.92. However, separate independent samples t-tests were conducted since not all participants completed both tests. This analysis revealed that TD infants exhibited a significantly smaller temporal binding window than AR infants for the social event, t(80) = − 2.68, p = 0.01, Cohen’s D = 0.59. However, there were no significant group differences in this regard for bouncing ball. A set of paired samples t-tests further suggested that the main effect of the condition (Speaking Face < Bouncing Ball) was only significant among TD infants, t = − 2.13, p < 0.05. Among the at-risk infants, the size of the temporal binding window for Speaking Face was not significantly smaller than that of Bouncing Ball, t = − 1.29, p = 0.22. Thus, unlike TD infants, for at-risk infants, the size of the temporal binding window for a social event was not significantly smaller than it was for a nonsocial event (see Fig. 2).

Fig. 2
figure 2

Average temporal binding window size by group and condition. Note: **p < 0.01, *p < 0.05; TD, typically developing; EL, at elevated likelihood of developing ASD

Sensitivity to audio-visual synchrony and language production

Finally, to explore the implications of multisensory processing for language development, we examined whether the size of the audio-visual temporal binding window was associated with vocabulary production. First, because infant age at the time of language testing (M = 22.43, SD = 6.04) is significantly positively correlated with the CDI vocabulary score (R = 0.84), a residualized score (controlling for infant age) was calculated; this variable is normally distributed (M = 0, SD = 116.59, Range = 636.36, Skew = 0.37, Kurtosis = 1.22). Subsequently, the temporal binding window was modeled as a continuous predictor of vocabulary with the infant group (TD vs. elevated likelihood of developing ASD) as a categorical moderator; separate models were conducted for the Speaking Face and Bouncing Ball conditions.

The overall model for Speaking Face was significant, F(4, 48) = 2.99, p < 0.05, R2 = 0.13. A main effect of the group was observed as well as a significant interaction between the group and the size of the temporal binding window. To probe the interaction, a simple slopes analysis was conducted. Among TD infants, the slope is not significant (p = 0.34), but among infants at elevated likelihood of developing ASD, the size of the temporal binding window is significantly positively associated with vocabulary (B = 0.30, p < 0.05; see Fig. 3). A similar interaction effect was observed for Bouncing Ball but the model did not exceed the significance threshold, F(4, 48) = 2.16, p = 0.09.

Fig. 3
figure 3

Interaction of temporal binding window and infant risk group on vocabulary production. Note: vocabulary is residualized for age

Discussion

In early development, the perception of temporal synchrony within audio-visual speech is thought to support acquiring language and may be disrupted in infants at elevated likelihood of developing ASD [65]. Despite a large body of research on multisensory integration in TD infants (for reviews see Lewkowicz 2000 [4] or Bahrick and Lickliter 2012 [66]) and children diagnosed with ASD [6771], few if any studies have investigated multisensory processing in infants at elevated likelihood of developing ASD. This work is crucial to identifying early markers of the disorder because features of atypical development consistent with a broader autism phenotype are often detectable before 1 year of age [59, 72, 7374]. Thus, filling a critical gap in the literature, the present study examined and compared sensitivity to temporal asynchrony within audio-visual events among infants at elevated likelihood of developing ASD with their TD counterparts.

In general, our primary hypothesis was supported: infants at elevated likelihood of developing ASD had a significantly larger audio-visual temporal binding window for a social event than TD infants, although, the size of the temporal binding window for the non-social event (Bouncing Ball) did not significantly differ between groups in this regard. This suggests that TD infants were more sensitive to the asynchrony between auditory and visual speech cues than infants at elevated likelihood of developing ASD, and is consistent with previous research showing larger audio-visual temporal binding windows for social, but not non-social, events in older children with ASD compared to TD children [17, 18, 30]. Additionally, for TD infants, the audio-visual temporal binding window for the social event was significantly smaller than for the non-social event whereas for infants at elevated likelihood of developing ASD, performance in these conditions was relatively similar. In general, these findings suggest that infants who are at elevated likelihood of developing ASD may process multisensory information for social events less efficiently than TD infants [10, 75, 76].

Given language delays are common among children with ASD, we expected that the size of the audiovisual temporal binding window (especially for the social event) would help explain variation in early productive vocabulary. However, our hypothesis was not supported. There was no significant relation between the temporal binding window for Speaking Face and language for TD infants and for ASD infants, a significant positive association was observed such that a larger productive vocabulary was observed among those with a wider temporal binding window for Speaking Face. This is not consistent with the idea that a smaller temporal binding window reflects greater sensitivity to audiovisual asynchrony and a better ability to integrate auditory and visual cues. However, it may suggest that infants at elevated likelihood of developing ASD are relying on different strategies for acquiring language. The size of the temporal binding window for the non-social event (Bouncing Ball) was not significantly associated with language production for either group.

In general, our findings are consistent with prior literature in support of making a distinction between social and nonsocial information processing. Indeed, a number of studies have identified perceptual abnormalities in children with ASD during audio-visual tasks involving human faces and voices but not during tasks involving nonhuman stimuli when compared to their TD counterparts [11, 7779], [21]). For instance, relative to TD children those with ASD have significantly more difficulty visually orienting to social stimuli such as their name being called, but only slightly more difficulty orienting to non-social stimuli such a musical toy [80]. Perceptual difficulties with social stimuli in individuals with ASD are further highlighted by their impaired performance on the McGurk illusion [27, 8183]. Individuals with ASD perceive the McGurk illusion less often than their peers without ASD, often relying instead on the auditory modality to the exclusion of the visual information [2112, 32]. Thus, our findings are in agreement with previous research conducted with older children (already diagnosed with ASD) in suggesting a specific deficit in multisensory integration for social information and further suggest this deficit is already present in infancy.

Importantly, our findings are also consistent with Smith et al. [31] and a number of eye-tracking studies [40, 41, 848687, 88] in which TD children selectively attended to the lip movements of talking faces while ASD children showed reduced attention to this region [8991]. That is, one potential reason why the infants at elevated likelihood of developing ASD in our study took longer to become dishabituated in the speaking face condition is because they were not looking at the mouth. By focusing attention to the speaker’s mouth, infants may be able to understand the action of speech in relation to their own body; it may allow them to integrate what they are hearing with what they are seeing in order to reproduce the action themselves (i.e., imitation). Although no previous study has examined whether enhanced attention to a speaker’s mouth in infancy is associated with greater vocal imitation, it has been shown to predict greater language production at 24 months [43]. More recently, Habayeb et al. [92] reported that mouth-looking in 1- to 2-year-old infants was significantly associated with greater expressive language and that infants at elevated likelihood of developing ASD looked significantly less at the mouth than TD infants. Thus, regardless of why, if infants at elevated likelihood of developing ASD do not selectively attend to the mouth region of a speaking face to the same degree as TD infants, their expressive language development may be impaired.

Neurobiological processes may contribute to the impairments in audio-visual speech processing in ASD. Some studies, for instance, have suggested that the left temporal cortex fails to become specialized for speech processing in individuals with ASD [93], but how this might relate specifically to infants’ audio-visual temporal synchrony is unclear. The mirror neuron system (MNS), in addition, has been implicated in language development [94, 95] and is thought to be disrupted in autism [96]. In this system, neurons in the sensorimotor cortex that fire when an action is performed also fire when that same action is observed (i.e., performed by another person). In this way, the MNS is thought to be important for action understanding and “self-other mapping” [97] or the “translation of seeing and hearing into doing” [98]. In developmental EEG studies, greater MNS activation during action observation in infants is associated with better subsequent imitation of the action [99, 100]. Considering that speech is an action that can be observed and imitated, it seems possible that infants could exhibit MNS activation during audio-visual speech and that it could play a role in the development of expressive language.

Although empirical work on the MNS in infants has largely focused on the perception of manual, object-directed actions (e.g., tapping a block), there is increasing evidence that the MNS is responsive to other types of actions, including communicative gestures (e.g., pointing) and facial movements [101103]. Given the evidence that TD infants begin attending more to the mouth region of a speaking face towards the end of the first year [41], and that greater attention to the mouth region at this time is associated with better language development [43104], it seems highly possible that MNS function plays a role in this process. Whether MNS activation during audio-visual speech is a cause or consequence of heightened visual attention to the speaker’s lips is unclear, but it may be disrupted in infants at elevated likelihood of developing ASD. Thus, an important direction for future research involves replication of the present study in conjunction with eye-tracking and neuroimaging methods (e.g., EEG).

Conclusion

This study is not without its limitations. In addition to the fact that we did not incorporate eye-tracking technology into the assessments, we had a fairly heterogenous group of infants at elevated likelihood of developing ASD, some having been born premature or low birthweight and others with older siblings that had a diagnosis. Although each of these criteria have been associated with elevated likelihood of developing ASD, they may involve very different etiological pathways. Thus, that we were unable to control for the type of risk factor in our analyses is a limitation. A related limitation is that we were unable to provide information about which infants were ultimately diagnosed with ASD. Finally, although the effect of age on vocabulary was accounted for in our statistical analyses, it is a limitation of the study that vocabulary was not assessed at the same age for all participants.

In summary, results of the current study suggest that the early characteristics of ASD in infants at elevated likelihood of developing the condition also include sensory integration difficulties, specifically with regard to the capacity for audio-visual synchrony. While this notion is still speculative, our findings contribute to a growing body of literature indicating that sub-clinical autistic behaviors may be present in children who might not yet fulfill all the clinical criteria for an ASD diagnosis. Although additional research is needed to understand the link between audio-visual sensory integration and language development in both TD and at-risk infants, this study represents an important first step towards understanding the nature of attention deficits that contribute to ASD and further suggests that problems in multisensory integration may be present in infants at elevated likelihood of developing ASD long before a clinical diagnosis is usually made.