Introduction

Gestures are an important aspect of human communication. Humans begin gesturing as early as 10 months of age (Bates, 1979) and these early gestures serve as important precursors to later spoken language. A substantial amount of research has demonstrated that co-speech gesture enhances communication (Driskell & Radtke, 2003; Goldin-Meadow, 1999; Hostetter, 2011) and facilitates learning and memory (Beattie & Shovelton, 2001; Breckinridge Church et al., 2007; Cook et al., 2010; Thompson et al., 1998). Nonetheless, many questions remain about when and how these benefits occur (for recent reviews, see Kandana Arachchige et al., 2021; Dargue et al., 2019). Among other examples, current evidence conflicts about the role of gesture meaningfulness in memory enhancement (So et al., 2012; Straube et al., 2014; Feyereisen, 2006; Kartalkanat, & Göksun, 2020; Levantinou & Navarretta, 2016), the circumstances under which incongruent gestures impair understanding of language (Wu & Coulson, 2007b; Habets et al., 2011), and the extent to which gestures impact early versus late stages of language processing (Kelly et al., 2004; Wu & Coulson, 2005; Wu & Coulson, 2007a).

Research on how gesture impacts cognition can inform understanding of autism spectrum disorder (ASD), a developmental disorder marked by deficits in social communication for which diagnostic criteria include difficulties with nonverbal communication, including but not limited to the understanding and use of gestures (American Psychiatric Association, 2013). For children and adults alike, the Autism Diagnostic Observation Schedule, Second Edition (ADOS-2; Lord et al., 2012) directs clinicians to assess the quality and quantity of gesture use during social interaction. Compared to their typically developing peers, autistic infants and toddlers show delays and decreases in gesture production—particularly deictic (i.e., pointing) gestures (Manwaring et al., 2018; Özçalışkan et al., 2016) and those with a communicative function (e.g., gestures directing attention to oneself or to an object to express a shared interest; Mishra et al., 2020; Watson et al., 2013). The frequency of gesture production is generally comparable between adolescents and adults with and without autism (de Marchena & Eigsti, 2010 and Lambrechts et al., 2014a) but there are nuanced differences related to the timing and motion of gestures. Adults with autism produce slower movements and more pauses between gestures compared to non-autistic adults, and such temporal features are associated with measures of motor cognition and social awareness (Trujillo et al., 2021). In some instances, autistic adults alter nonverbal behaviors to present as more neurotypical (i.e., masking; Cook et al., 2022) and tend to rely on gesture production to signal conversational turns to a greater extent than their non-autistic peers (de Marchena et al., 2019).

Research findings regarding gesture comprehension in autistic populations are inconsistent (Dimitrova et al., 2017; Lambrechts et al., 2014b; Clements & Chawarska, 2010; de Marchena & Eigsti, 2010; Silverman et al., 2017; Fourie et al., 2020; Hubbard et al., 2012). Some studies report impaired gesture comprehension in children and adults with ASD, as evidenced by less accurate matching of meaningful gestures to appropriate pictures, words, or objects (Cossu et al., 2012). Atypical comprehension and integration of gesture with speech may be particularly impacted when nonverbal cues are communicative (e.g., pointing versus grasping; Aldaqre et al., 2016) or pro-social (e.g., pointing to share an experience versus pointing to request a desired item; Clements & Chawarska, 2010). However, other studies using similar tasks have found no comprehension differences (Dimitrova et al., 2017; Adornetti et al., 2019).

Many factors, such as participant age, symptom type and severity, and task demands, may account for these and other discrepancies. However, heterogeneity in the stimuli used across studies likely contributes substantially to conflicting findings. Agostini et al. (2019) outlined several sources of stimulus-based variability that may impact the results of gesture comprehension studies, including the overall comprehensibility of the gestures, the types of gesture (e.g., pantomimes versus emblems), and whether the videos or images depict the actor’s entire face and body, torso only, or just the hands. Other important considerations for studies pertaining to ASD may be the complexity of the movement involved in the gesture (Trujillo et al., 2021) and presence or absence of face visibility (Dawson et al., 2005), given evidence that people with ASD spend less time than people without ASD viewing the faces of actors when observing and attempting to imitate gestures (Vivanti et al., 2008).

We have developed a database of 162 well-characterized videos depicting hand gestures ranging in iconicity for use in research that have been normed for comprehensibility by groups of participants with and without a self-reported diagnosis of ASD. During the norming process, adult participants from both diagnostic groups were asked to: (1) provide a numerical rating of the meaningfulness of each gesture, and (2) regardless of meaningfulness, provide a one-word label that best describes the gesture. The inclusion of ambiguous as well as meaningful gestures makes this stimulus set well suited for research on the role of semantics in gesture comprehension and/or imitation (e.g., “dual-route” theories of imitation; Stieglitz Ham et al., 2011; Bartolo et al., 2001). All gestures were edited to be of uniform duration. While the gesture stimuli are silent, the face and mouth are obscured from view to allow other researchers the option of embedding audio. The normed database of gestures is publicly available via the Open Science Framework repository (OSF), https://osf.io/a3pg7/.

In addition to producing a stimulus database for future research, the present study is the first to our knowledge to examine gesture naming and perceived meaningfulness of iconic and ambiguous gestures in people with and without ASD. Importantly, open-ended naming tasks may be sensitive to differences in gesture comprehension that are too subtle to impact performance on traditional matching tasks. Thus, along with the numerical meaningfulness ratings and set of verbal responses given by each group, we also provide overall- and group-specific measures for each gesture of: (1) the number of distinct responses generated during the naming task; (2) the information-theoretic measure of entropy, which is sensitive to the level of competition or dominance among possible responses; and (3) the mean semantic distance of each response from each other response based on corpus-derived semantic distance metrics. Together, these measures permit comparisons of response heterogeneity at both the lexical level (number of distinct responses generated) and the semantic level (the extent to which the responses generated are similar in meaning).

Finally, to further enhance the utility of this resource, we include: (1) measures that quantify the overall amount of movement involved in the gesture, and (2) the age-of-acquisition (based on existing norms; Kuperman et al., 2012) of the labels given by each participant to each gesture. These ancillary measures will benefit researchers wishing to control for motion variability and/or tailor their use of the stimulus set for child or adolescent participants.

Methods

Participants

Participants were recruited on Prolific (http://www.prolific.co), a recruitment platform for online research. Prolific’s stand-alone demographic profiles and screening tools were used to selectively advertise the study to participants who identified as being between the ages of 18 and 40, living in the United States, and native and primary speakers of English. An additional screener was used to advertise to a specific number of participants whose profile endorsed having received a formal clinical diagnosis of autism spectrum disorder as either a child or an adult (ASD group), as well as an equal number of participants who self-reported that they had never received an autism diagnosis (non-autistic or NA group). Importantly, the nature of Prolific’s screening capabilities is such that participants are unable to know why any specific study has been made available to them, and autism was not mentioned as an inclusion criterion in our study description. These procedures are in line with empirically supported best practices for minimizing instances of participants misrepresenting themselves to gain access to the study or receive compensation, which is a concern for online research (Chandler & Paolacci, 2017). A detailed description of our recruitment and screening processes is provided in the Supplemental Methods.

Three versions of the task were available, each containing a different set of gestures. Participants were allowed to complete more than one version but were not allowed to complete the same version more than once. Overall, we received 182 submissions from 128 individuals who reported having a diagnosis of autism and 137 submissions from 129 individuals who reported not having a diagnosis. Submissions were excluded from analyses if they were incomplete, included low-effort responses, or if the participant provided demographic information that conflicted with their Prolific profile (see Supplemental Methods for details). This process resulted in the exclusion of 17 submissions from 15 participants in the NA group and 62 submissions from 45 participants in the ASD group (Tables S1 and S2), leaving a total of 120 submissions per group (40 for each task version) that came from 114 and 83 unique NA and ASD participants respectively. Table 1 depicts demographic information and group comparisons for age, gender, race, and level of education. The only significant difference in any of these metrics with respect to autism status was race, which was driven by a higher proportion of Asian participants in the NA group.

Table 1 Participant characteristics

This study was approved by the Louisiana State University Institutional Review Board and participants provided informed consent prior to participating.

Stimuli

A set of 162 hand gestures were silently filmed on a 13-inch MacBook Pro using Photo Booth (Apple Inc). An actor, seated on a chair, was visible only from the neck to waist. Of the 162 gestures, 108 were categorized as iconic gestures and the remaining 54 as nonsense gestures. We use the term “iconic” to refer to gestures that tend to evoke specific, identifiable action concepts when viewed in isolation (e.g., in the absence of concurrent speech or other disambiguating context), and “nonsense” to refer to gestures that are perceived as relatively meaningless or ambiguous in isolation (see Fig. 1 for examples). These classifications were made based on norming data from a pilot study that we conducted, which is described in detail in Supplemental Methods. It is important to note that, both during piloting and in the data presented here, a wide range of meaningfulness values are evident within each category. Thus, these labels should be viewed as relative rather than absolute, and we include them primarily as informative descriptors for researchers who wish to use this stimulus set. Meaningfulness ratings and free response labels are available on OSF for all gestures.

Fig. 1
figure 1

Stills of example gesture video stimuli and stimulus timing. Note. A An iconic gesture intended to represent clapping or applause, B a nonsense gesture with no intended meaning, and C an iconic gesture intended to convey the word insert or into. Each gesture video was edited into an 8.5-s clip. Gesture video clips began with the actor’s hands resting in her lap for 2.5 s. Each gesture lasted precisely 2.5 s and the clip ended with 3.5 s of stillness.

Videos were edited into 8.5-s clips using Adobe Premier Pro (Adobe Inc). As shown in Fig. 1, the timing of every video clip was standardized, such that the gesture was initiated 2.5 s into the clip, lasted 2.5 s, and ended with 3.5 s of stillness. To ensure that every gesture lasted precisely 2.5 s, some of the gestures were slightly sped up or slowed down from the original recorded speed. Each video clip was exported to .mp4 format with a 540-pixel resolution using H.264 (AVC) compression. The frame rate was 24 frames/second, or a total of 204 frames per video. Gestures were randomly divided into one of three groups for norming, each consisting of 36 iconic and between 17 and 19 nonsense gestures.

Procedure

After enrolling in the study on Prolific, participants were directed to Qualtrics (Qualtrics, Provo, UT) to complete the task. After reading task instructions, participants were presented with one gesture video at a time. For each video, participants were asked to respond on a 0–4 scale to the question, “How meaningful did you find this gesture?”, with anchor points at zero (not meaningful), 2 (somewhat meaningful), and 4 (completely meaningful). Next, participants were asked, “If you had to choose one word to describe this gesture, what would it be?” and were instructed to limit their response to one word. Four attention check trials were also shown. During these trials, the actor was present and seated in the same position without movement. A text overlay appeared two seconds into the video that read “ATTENTION CHECK” and instructed the participant to assign a meaningfulness rating of 4 to the associated trial.

Participants could watch each video more than once and responses were untimed. At the end of the study, participants were asked to complete a brief optional demographics form, which included questions about age, gender, education, native language, American Sign Language (ASL) fluencyFootnote 1, handedness, and ASD diagnosis. The mean time taken to complete the study was 30 minutes.

Free response lemmatization

Prior to statistical analysis, we lemmatized all 12,960 free responses using a combination of manual and automated procedures. Lemmatization is a process in which words are reduced to their base vocabular form. Verb tense, pluralization, and other inflections are removed (Jongejan & Dalianis, 2009). Lemmatizing words improves precision in text analysis because different variations of the same base word (e.g., breaking, broke, broken, breaks) can be interpreted as the same response (e.g., break).

Prior to lemmatization, misspelled responses were corrected using the spell-check function in Excel. All atypical responses were documented and reviewed by two experimenters. Homonyms were corrected when it was clear that the wrong word was provided (e.g., peddle was changed to pedal if made in response to the gesture for pedaling). When participants provided more than one response, only the first word was saved (e.g., if the participant entered drive/steer, we saved drive). Words potentially perceived as offensive (e.g., curse words, sexual content (n = 12), non-words (n = 3), or responses that exceeded two words (n = 16) were excluded from analyses. We also excluded responses that appeared to be typos (n = 1), responses indicating technical problems (n = 1), “NA” entries (n = 11), and entries left blank (n = 85). All responses that indicated a lack of effort (e.g., “don’t know”, “unsure”, “meaningless”, etc.) were excluded (n = 132) unless they were given a meaningfulness rating of 3 or 4 or made in response to the gesture for balancing, teetering, guessing, or shrugging (n = 76). In the latter instance, responses such as “don’t know” or “unsure” were appropriate to the corresponding gesture and included in analyses. In total, these procedures resulted in the exclusion of 261 of the 12,960 responses, or 2.01% of the data. All spelling corrections and excluded responses are documented and available on our OSF page.

We lemmatized the data based on the hash_lemma dictionary, which is a lemmatizing dictionary contained in the "lexicon” R package (Rinker, 2018a). This dictionary includes over 40,000 words and their corresponding lemma forms. We applied this dictionary to all free response data using the "textstem” R package (Rinker, 2018b), which is an automated text regularization program that changes words to their lemma form. The output of the program was then reviewed by two experimenters. One change in the lemmatized data was not appropriate (feel changed to fee) and corrected accordingly. We also accepted cheers as a salutation, rather than the lemma cheer, as this response matched the corresponding gesture.

Calculation of ancillary measures

Motion tracking

Motion tracking was applied to each video using the software OpenPose (Cao et al., 2018). OpenPose is a motion tracking software that uses a deep neural net trained to identify human body poses in videos. We used OpenPose’s “body25” model (which is recommended in the context of stimulus control; Trettenbrein & Zaccarella, 2021) to extract the x- and y- coordinates of keypoints on each wrist, elbow, and shoulder as well as on the top and bottom of the torso for each frame of each video (See Supplemental Fig. 1 for illustration). We then used the R package “OpenPoseR” (Trettenbrein & Zaccarella, 2021) to quantify the amount of bodily motion between adjacent frames. This process involves first computing the velocity of the individual keypoints along the x and y axis, and then computing the Euclidean norm of the sums of the velocity vectors (ENSVV), which yields a single value per frame representing the total change in motion relative to the prior frame.

Prior to calculating velocity, the motion tracking data were cleaned using OpenPoseR’s file_clean function. This function identifies instances in which the model failed to fit a particular point or did so with low confidence (cutoff = 0.3) and imputes these values with the mean of the previous and consecutive frames. ENSVVs that were extreme outliers (values of > 8000) were also imputed. Finally, a third-order Kolmogorov–Zurbenko filter (span = 5) was applied to each gesture’s sequence of ENSVVs to handle remaining high-frequency jitter due to sampling error. Filtering was implemented using the R package “kza” (Close et al., 2020).

Frame-by-frame ENSVVs for each video are available on OSF, as are the mean values each video for the pre-gesture time interval (second 0–2.5 or frames 2–60), the time interval containing the gesture (seconds 2.5–5 or frames 61–120) and post-gesture interval (seconds 5–8.5/frames 121–204). Overall, the mean ENSVV for iconic gestures was 89.61 per frame during the pre-gesture interval (SD = 17.55), 402.82 during the gesture interval (SD = 110.87), and 81.56 during the post-gesture interval (SD = 12.66). For nonsense gestures, the values were 91.29 (21.10), 476.72 (132.58), and 81.12 (17.04), respectively.

Response age-of-acquisition

To estimate the age of acquisition (AoA) of the lemmatized versions of the words used to describe the gestures, we turned to a published set of norms for 30,121 English content words collected from people residing in the U.S. (Kuperman et al., 2012). It is important to note that the age at which a word that is used to describe a gesture is learned may differ somewhat from the age at which that word’s meaning can be clearly conveyed by a specific gesture. For example, while the word “anger” has an AoA of six, older children and adults may be able to infer anger from a wider range of nonverbal cues than 6-year-olds. Nonetheless, given that ASD is a developmental disorder and often studied in children or adolescents, we include this measure as a starting point for researchers interested in identifying subsets of the gestures that may be appropriate for young research participants.

AoA ratings were available for 12,408 of the 12,960 (96%) responses in the dataset. There was a mean of 76.7 responses per gesture (SD = 5.36, range = 34–80). Per gesture, the number of responses provided by participants in the ASD group with AoAs available did not differ from the number provided by NA participants (MASD = 38.24, SDASD = 2.71; MNA = 38.36, SDNA = 2.85; t(161) = 1.08, p = 0.28, Cohen’s d = 0.04), nor did the mean AoA of the responses given to the gestures by each group (means = 5.77 and 5.75 for the ASD and NA group, respectively; SD = 0.86 and 0.93, t(161) = 0.53, p = 0.60, Cohen’s d = 0.02). Overall, the mean AoA for responses given to iconic gestures was 5.70 (SD = 0.93, range = 3.55–7.98). For nonsense gestures, the mean AoA was 5.86 (SD = 0.70, range = 4.30–7.56).Footnote 2

Analytic strategy

Statistical analyses served two main purposes. The first was to validate differences in perceived meaningfulness between the 108 gestures that were categorized a priori as “iconic” and the 54 gestures categorized as “nonsense”. In addition to higher average meaningfulness ratings for the iconic relative to the nonsense gestures, we would also expect more consensus among raters in the free responses produced for iconic gestures, which should manifest as lower response diversity (e.g., fewer distinct words used to name each gesture), lower response entropy (more dominance of some responses over others), and higher semantic similarities among responses to a given gesture. Semantic similarity scores were calculated using Global Vectors for Word Representation (GloVe; Pennington et al., 2014), an unsupervised learning algorithm that extracts word vector representations from co-occurrence probabilities in natural language corpora. Procedures for calculating response diversity, entropy, and semantic similarity are described below.

Our second analysis goal was to compare the response profiles of participants with and without a diagnosis of ASD. Because group differences may vary at different levels of ambiguity, each variable of interest (meaningfulness, response diversity, response entropy, and response set semantic similarity) was analyzed using 2 (Gesture Category: Iconic vs. Nonsense) x 2 (Group: ASD vs. NA) mixed factor ANOVAs. In addition, we computed across-gesture correlation coefficients between the values obtained for each variable from the ASD and NA samples separately for the iconic and nonsense gestures. This set of analyses examined the extent to which relative differences in the measures of interest among gestures were similar between the groups.

Response diversity

A response diversity score was calculated for each gesture within each group by dividing the number of unique responses provided by the total number of responses provided. This calculation only considered responses that appeared in the GloVe corpus.Footnote 3

Response entropy

A score was calculated for each gesture within each group based on the information-theoretic measure of entropy (H; Shannon, 1948). For a gesture that was assigned a total of i unique meanings, entropy is calculated using the following formula, in which R represents the number of unique meanings assigned to a given gesture and pi is the proportion of participants who produced each unique meaning:

$$H= -\sum\limits_{i=1}^{R}{p}_{i}{\mathrm{log}}_{2}\left({p}_{i}\right)$$

This measure provides an index of how well the label elicited by a specific gesture can be predicted that takes into account both the number and distribution of responses. A higher entropy score indicates a relatively even distribution among the set of responses, whereas lower entropy scores occur when some responses are more dominant than others. As a hypothetical example, if the labels “draw”, “sketch”, “scribble”, and “doodle” were each produced 25% of the time in response to a given gesture, the response entropy for that gesture would be 2.0. By contrast, if the “draw” label was produced by 75% of raters, “sketch” by 15%, “scribble” by 8%, and “doodle” by 2%, the resulting entropy value would be 1.13, reflecting the dominance of some responses over others. Only responses present in the GloVe corpus contributed to entropy scores.

Response semantic similarity

A mean semantic similarity score was calculated for the set of unique responses given to each gesture within each group. Pairwise similarity values were calculated from a 300-dimension GloVe embedding. Using the “sim2” function of the “text2vec” R package (Selivanov et al., 2022), the cosine similarity was calculated for each pair of unique lemmatized responses provided for each gesture based on their corresponding vectors in GloVe. These pairwise values, which range from – 1 to 1, were averaged to yield a single measure of response semantic similarity for each gesture-group combination. Higher similarity values indicate that the responses provided by participants tended to have more closely related meanings, while lower similarity values identify gestures that elicited less related responses. For example, a gesture that received “yell”, “shout”, and “scream” as responses would receive a response similarity score of 0.61, whereas the response set “draw”, “shake”, and “tickle” would yield a score of 0.10.

Results

The means and standard deviations for each measure of interest (meaningfulness, response diversity, response entropy, and response semantic similarity) are described in Table 2 and Fig. 2, subdivided by group and gesture type. Across-gesture correlations are shown in Fig. 3.

Table 2 Means and standard errors of gesture norming metrics
Fig. 2
figure 2

Means and distributions of gesture norming metrics. Note. Violin plots depicting the means and distributions of A meaningfulness ratings, B response diversity, C response entropy, and D response semantic similarity, subdivided by gesture type (Iconic, Nonsense) and rater group (ASD, NA). Small black dots represent individual gestures and y-axis placement denotes the mean value assigned to that gesture by ASD and NA raters. Thin grey lines connect each gesture’s value from the ASD group with its corresponding value from the NA group. The black diamonds outlined in white represent condition means, and the white lines represent 95% confidence intervals

Fig. 3
figure 3

Across-gesture correlations for gesture norming metrics between ASD and NA groups. Note. Scatterplots depicting across-gesture correlations between the ASD and NA groups for A meaningfulness ratings, B response diversity, C response entropy, and D response semantic similarity, subdivided by gesture type (Iconic: green circles, Nonsense: yellow triangles). Each data point represents an individual gesture and denotes the mean value for each measure from the ASD group (x-axis) and NA group (y-axis)

Meaningfulness ratings

As expected, a significant main effect of gesture category on meaningfulness ratings emerged, F(1, 160) = 91.27, p < 0.001, \({\eta }_{p}^{2}\) = 0.36, such that gestures that were categorized a priori as iconic gestures were rated as more meaningful than those categorized as nonsense gestures (Fig. 2a). Meaningfulness ratings did not differ between groups, F(1, 160) = 2.49, p = 0.12, \({\eta }_{p}^{2}\) = 0.02. However, there was a significant condition \(\times\) group interaction, F(1, 160) = 6.31, p = 0.01, \({\eta }_{p}^{2}\) = 0.04. Follow-up t tests revealed that participants in the NA group found nonsense gestures less meaningful than those in the ASD group, t(53) = 2.67, p = 0.01, Cohen’s d = 0.15, whereas no group differences were present for iconic gesture meaningfulness t(107) = 0.17, p = 0.86, Cohen’s d = 0.01. Note that the group difference for nonsense gesture meaningfulness should be interpreted with caution due to the negligible effect size.

Figure 3a depicts the across-gesture correlations between the average meaningfulness rating assigned to each gesture by the ASD group and the rating assigned by the NA group. These ratings were extremely highly correlated for both iconic gestures r(106) = 0.94, p < 0.001 and nonsense gestures r(52) = 0.91, p < 0.001, indicating strong agreement between groups about the relative meaningfulness of certain gestures over others.

Response diversity

Main effects of both gesture category F(1, 160) = 55.24, p < 0.001, \({\eta }_{p}^{2}\) = 0.26 and group F(1, 160) = 30.90, p < 0.001, \({\eta }_{p}^{2}\) = 0.17 were present for response diversity (Fig. 2b). Specifically, nonsense gestures elicited a greater proportion of unique responses than did iconic gestures for both groups, and the response sets obtained from participants with ASD contained more unique labels on average relative to those obtained from participants without ASD. The interaction was non-significant F(1, 160) = 2.79, p = 0.10, \({\eta }_{p}^{2}\) = 0.02.

Figure 3b depicts the correlations between the number of distinct responses produced by the ASD group relative to the NA group for each gesture. As with meaningfulness ratings, across-gesture differences in response diversity between groups were highly correlated for both gesture types, r(106) = 0.86, p < 0.001 for iconic gestures, r(52) = 0.81, p < 0.001 for meaningless gestures.

Response entropy

Analyses of response entropy revealed main effects of both gesture category F(1, 160) = 48.36, p < 0.001, \({\eta }_{p}^{2}\) = 0.23 and group F(1, 160) = 27.77, p < 0.001, \({\eta }_{p}^{2}\) = 0.15 (Fig. 2c). Specifically, nonsense gestures elicited higher levels of response entropy relative to iconic gestures, and entropy was higher for response sets obtained from participants with ASD relative to those without ASD. These effects were qualified by a significant interaction, F(1, 160) = 4.42, p = 0.04, \({\eta }_{p}^{2}\) = 0.03. Follow-up t tests revealed that difference in entropy between the ASD and NA response sets was significant for iconic gestures t(107) = 5.51, p < 0.001, Cohen’s d = 0.23, but not for nonsense gestures t(53) = 1.33, p = 0.19, Cohen’s d = 0.11.

As shown in Fig. 3c, strong and significant correlations across gestures were present between levels of response entropy produced by the ASD relative to the NA group for both iconic gestures r(106) = 0.91, p < 0.001 and nonsense gestures r(52) = 0.81, p < 0.001.

Response semantic similarity

A significant main effect of gesture category was present for response semantic similarity, F(1, 160) = 12.09, p = 0.001, \({\eta }_{p}^{2}\) = 0.07 (Fig. 2d). This effect reflected greater similarity among labels assigned to iconic relative to nonsense. The main effect of group was not significant, F(1, 160) = 0.17, p = 0.68, \({\eta }_{p}^{2}\) = 0.00, nor was the interaction, F(1, 160) = 0.24, p = 0.63, \({\eta }_{p}^{2}\) = 0.00.

Figure 3d depicts the correlation between the semantic similarity of the responses produced by the ASD group relative to the NA group for each gesture. The correlation was significant for both iconic r(106) = 0.54, p < 0.001 and nonsense gestures r(52) = 0.70, p < 0.001.

Discussion

The goal of the present study was to provide a database of high-quality and well-characterized videos ranging in meaningfulness for use in gesture research. To our knowledge, only two sets of silent iconic gesture videos are currently available (Ortega and Özyürek, 2020; van Nispen et al., 2017), as well as one set that includes iconic gestures, emblems, and meaningless gestures (Agostini et al., 2019). Several features of the present database are unique from and complementary to these existing resources. First, each gesture video was carefully edited to be temporally uniform and precise, meaning that the videos are equal in length with the same amount of time before, during, and after each gesture. This uniformity is beneficial for methods such as event-related potentials (ERPs) that require precise attention to timing. Second, unlike other databases, here actor visibility is limited to the neck down, which required us to avoid including gestures that incorporate the face or head (e.g., “applying lipstick”; Agostini et al., 2019). This lack of face visibility serves to eliminate influences from social cues such as facial expression or eye gaze and allows users of the videos to embed auditory speech without creating incongruity with respect to the actor’s mouth/facial movements.

Third, and perhaps most notably, each video was interpreted and rated for meaningfulness by groups of participants with and without a diagnosis of ASD. Research into the extent and nature of gesture processing difficulties in autistic children and adults has produced mixed results. Resolving these inconsistencies is of particular importance given that “deficits in understanding and use of gesture” is currently mentioned in the DSM-5 as one way to fulfill the diagnostic requirement of presenting difficulty with nonverbal communicative behavior (American Psychiatric Association, 2013). One potential source of variability in research outcomes is heterogeneity in the stimuli used across studies (see Agnostini et al., 2019 and Kandana Arachchige et al., 2021 for similar arguments). Well-characterized, openly available stimulus sets like the present database promise to accelerate research in this area by facilitating the ability for multiple research teams to use the same stimuli, and by providing stimuli for which people with and without ASD were equally represented in the norming process. We have also included ancillary measures, such as motion quantification and lemma age-of-acquisition, which may be particularly relevant to an autistic population.

The present study is the first to our knowledge to compare gesture naming and perceived gesture meaningfulness between people with and without ASD. The data contain several insights. First, use of the meaningfulness scale was highly similar between the groups. For gestures that were categorized a priori as iconic, no differences were found in perceived meaningfulness ratings across groups, and the elevated meaningfulness ratings in the ASD group for the nonsense gestures had a negligible effect size. Second, consensus meaningfulness ratings from each group were extremely highly correlated across gestures, as were measures of response diversity, response entropy, and response semantic similarity. This pattern of results suggests that the cues used to evaluate gestures for meaningfulness are comparable across groups, at least for stimuli such as these that contain only hand-based cues to meaning.

Group differences were present in the patterns of one-word names provided as stimulus labels. Relative to their non-autistic counterparts, autistic participants provided more unique gesture labels (i.e., greater response diversity) for gestures overall. Gesture labels provided by autistic participants also had higher entropy scores, indicating a relative lack of dominance of certain responses over others. That said, this greater variability in word choice among autistic relative to non-autistic participants does not necessarily mean that the concepts evoked by the gestures were different or more variable among autistic participants. Indeed, although the overall number of responses was larger for the ASD group, analyses of the semantic similarity values among these responses revealed no differences between groups. As an illustrative example, for the iconic gesture inserting (Fig. 1c), the gesture labels insert, enclose, and into were given by participants in both groups. Autistic participants provided more than twice the number of unique responses for inserting as non-autistic participants, including deposit, install, and sheath. However, the meanings of these words are highly similar both to one another and to the gesture’s intended meaning.

It is helpful to consider the above results alongside research on forms of “unconventional language use” that have been documented in the autistic population, some of which are characterized by atypical word choices (for reviews, see Luyster et al., 2022 and Naigles & Tek, 2017). Dunn et al. (1996) found that children with ASD provide less prototypical examples of category members relative to both neurotypical and language-delayed children in a category fluency task in which they were asked to name animals and vehicles. This and similar findings have led to suggestions that ASD may be characterized by reduced lexical or semantic organization, sometimes combined with an advanced command of less frequent word forms (Naigles and Tek, 2017; Hilvert et al., 2019). Other forms of unconventional language use in ASD are social/pragmatic in nature, such as a tendency for autistic individuals to offer information that is more specific, technical, or detailed than is needed within a given discourse context (i.e., “pedantic speech”, De Villiers et al., 2007; Ghaziuddin & Gerstein, 1996). Accordingly, it is possible that the greater diversity of word choices provided by the ASD group during the naming task reflected different strategies evoked by our prompt (“If you had to choose one word to describe this gesture, what would it be?”). Non-autistic participants may have relied on lexical selection heuristics concerning communication efficiency and word accessibility (Koranda et al., 2022), whereas autistic raters may have been more likely to provide the most specific word that comes to mind. Future studies could test this possibility by varying the instructions given to the participant (e.g., “What is the very first word that this gesture brings to mind?”) and/or by using a debrief questionnaire to probe participants’ decision-making processes.

The online format of the current study presents a few limitations. No formal assessments of autism, language ability, or cognitive ability were conducted, and we relied on self-report to categorize participants as either autistic or not autistic. Although we took several empirically supported precautions to minimize deception on the part of participants, some risk of intentional or unintentional misclassification remains. This caveat is particularly relevant to the null results presented here, including the null result regarding group differences in semantic similarity among gesture labels. This and other results implying a lack of group differences (e.g., the high across-gesture correlations among meaningfulness ratings) should be interpreted with caution until additional research is conducted in which diagnosis status and associated cognitive and communication differences are formally validated.

It is also important to note that the autistic individuals who participated in this study likely do not represent the full spectrum of ability and disability that can be associated with ASD. For example, all participants had the language and other cognitive skills necessary to follow instructions and complete the task, as well as to create a profile on Prolific. Accordingly, the generalizability of our results may be constrained to individuals on the autism spectrum who have lower support needs. We took steps to make our study accessible, such as using a self-paced and repetitive trial format and allowing the videos to be viewed multiple times. Our task also passed all Web Content Accessibility Guidelines (WCAG) checks with one exception: our optional demographics questionnaire included a matrix response table, which can be difficult for users with cognitive disability or low vision to complete. Nonetheless, future research that goes further in offering accommodations—for example, by including the option for a caregiver to assist with navigating the online platform, understanding instructions, and/or entering responses—would be beneficial to test the generalizability of these results and to determine the suitability of our stimulus database for research that involves autistic people with higher support needs.

The fact that we only recruited American-English speakers residing in the United States presents an additional constraint on generalizability. While our meaningful gestures were rated high in iconicity, the stimulus set contains a few gestures that could be appropriately described as emblems (McNeill, 1985). Emblematic gestures are those that are socially learned and often culture-specific, such as holding a thumbs-up to convey approval or satisfaction (Agostini et al, 2019). Differences among languages may also influence how a gesture is interpreted. For example, Ortega and Özyürek (2020) point out that in Dutch, an action and its accompanying tool are often incorporated into one word and provide the examples “knippen, ‘to cut with scissors’” and “snijden, ‘to cut with a knife’” (p. 56). Thus, the gestures included in this stimulus set may not be linguistically or culturally appropriate for populations outside of the U.S. or who primarily speak a language other than English.

Despite these limitations, our stimulus set is well positioned to accelerate research on a variety of research topics related to gesture processing, both with and without reference to ASD. Indeed, in a recent review paper, Kandana Arachchige et al. (2021) identified multiple areas of inconsistency across studies on how iconic gestures are integrated with speech that may stem from stimulus differences. For example, Habets et al. (2011) found evidence that incongruent gestures impair comprehension, but only when they are presented concurrently with speech or at stimulus onset asynchronies (SOAs) less than 360 ms. By contrast, Kelly et al. (2004) and Wu and Coulson (2007a) reported impairing effects of incongruent gestures at SOAs of 800 ms and >1000 ms, respectively. Kandana Arachchige et al. (2021) proposed that differences in gesture ambiguity across studies may account for discrepant findings, with asynchronous gestures primarily impacting language comprehension when they are relatively unambiguous. These authors also raised the possibility that speech-gesture integration impairment in ASD may differ in magnitude depending on whether the gestures provide information that is redundant versus complementary to the information provided by the speech (see also Dimitrova et al., 2017; Perrault et al, 2019). The detailed information provided about gesture meaningfulness and interpretation in the present stimulus database can aid in future research efforts to directly test these and other theoretically and clinically meaningful research questions related to gesture processing.