Introduction

Speakers often produce co-speech gestures, or unplanned, fluent hand movements while talking. Previous research suggests that co-speech gestures aid in the communicative process, but whom these gestures help—the speaker during production, or the listener during perception, or both—has been much debated. One view claims that gestures convey information that augments the information in the speech signal and thus enhances listener comprehension (e.g., Beattie and Shovelton 1999; Kendon 1983). In a meta-analysis of 63 studies examining the effect of gesture on speech comprehension, Hostetter (2011) concluded that there was a modest but significant effect of seeing co-speech gestures on message comprehension when those gestures were (1) spatial or motor related or (2) non-redundant with co-speech language. Based on these results, Hostetter claims that “Gestures do benefit comprehension, and this benefit is independent of any benefits gestures may have for a speaker’s production” (p. 311).

An alternative view is that gestures serve little communicative function for the listener but rather aid the speaker in formulating messages (e.g., Krauss 1998). In particular, co-speech gestures have been found to aid both lexical retrieval (Butterworth & Hadar 1989; Hadar et al. 2001; Krauss 1998; Pine et al. 2007; Rauscher et al. 1996) and the conceptual planning of utterances (Melinger and Kita 2007; Hostetter et al. 2007; Alibali et al. 2000). In addition to helping the structure of speech, gestures also facilitate the maintenance of spatial representations in working memory (Wesp et al. 2001) such that individuals gesture more often when descriptions feature spatial components (for a review, see Hostetter 2011). According to this view, the primary function of co-speech gestures is to enhance speech fluency.

Interestingly, not all gesture types facilitate descriptions of spatial information. Spatial descriptions tend to feature representational gestures, which comprise of both iconic gestures, which resemble the gesture’s referent in form, and deictic gestures, which indicate a location or path (McNeill 1992). These representational gestures reflect spatial thinking and are more likely to co-occur when producing spatial words than when producing non-spatial words (Alibali 2005; Beattie and Shovelton 1999; Krauss 1998). Representational gestures also aid lexical retrieval when linguistic information has visuo-spatial components (Krauss and Hadar 1999; McNeill 1992) and enhance a speaker’s spatial memory (Morsella and Krauss 2004).

The utility of representational gestures also increases with task difficulty such that when visuo-spatial information is more complex, speakers’ production of representational gestures increases. Hostetter et al. (2007), for example, asked participants to describe the shape of dot arrays (e.g., “The top three dots form a triangle”). When arrays lacked guiding lines that simplified the task, speakers produced more representational gestures. These and similar findings provide evidence for the information packaging hypothesis (Kita 2000), which claims that gestures help speakers organize complex, visuo-spatial information into packages, or units, suitable for speaking. Speakers produce more representational gestures when the visuo-spatial information cannot be easily verbalized. Representational gestures can thus be conceptualized as a mode of thinking that helps translate complex visuo-spatial information into linguistic output.

Goldin-Meadow and Alibali (2013) have further suggested that representational gestures are, at least in part, intentionally used by speakers to help communicate salient information. There is circumstantial evidence for this. First, speakers gesture more when they can see one another than when they cannot (Bavelas et al. 2008; Mol et al. 2011). Second, speakers do not use some types of gestures when alone. Specifically, when interlocutors cannot see one another, the frequency of beat gestures, which convey no semantic information but have a rhythmic relationship to the accompanying speech, remains high, while the frequency of representational gestures decreases (Alibali et al. 2001; Clark and Krych 2004).

Although it is important to consider how representational gestures may be intentionally used to communicate information, speaker intentionality does not imply that greater frequency of their use achieves that end. Considering that representational gestures are produced more when describing complex visuo-spatial information (Hostetter et al. 2007), it is possible that the presence of these gestures may reflect accompanying speech that is uninformative or disfluent.

Current Studies

In the current studies we address the utility of co-speech hand gestures, with a particular focus on the role of specific types of gestures during the production and comprehension of messages. We asked the following three questions: (1) What specific types of gestures do speakers produce in spatial tasks of varying complexity (Experiment 1)?, (2) To what extent do the different types of gestures produced in Experiment 1 facilitate listener comprehension (Experiment 2)?, and (3) To what extent does it matter if listeners see the gestures that accompany the spoken message?. In Experiment 1, speakers described simple and complex visuo-spatial stimuli (apartment layouts), and in Experiment 2, participants either watched (video + audio condition) or listened (audio-only condition) to those descriptions and attempted to draw the respective layouts.

What types of gestures will speakers produce when describing complex versus simple apartments in Experiment 1? Given previous evidence that speakers produce more representational gestures when the visuo-spatial information cannot be easily verbalized (e.g., Hostetter et al. 2007), we predicted that speakers should produce more representational gestures when describing complex relative to simple apartments.

Experiment 1

Method

Participants

Sixteen female native English speakers (Barnard University undergraduates between 18 and 22-years-old) participated in Experiment 1 for course credit in our research lab. Participants were recruited using web-based sign-up procedures approved by the institutional review board at Columbia University. One speaker was removed from our analysis because she did not gesture once (Experiment 1, N = 15). For clarity, participants in Experiment 1 will be referred to as “speakers.”

Materials

Apartment layouts served as stimuli (for examples, see “Appendix”). Apartments were drawn digitally on 9 × 10 grids and shared particular features, including at least one bedroom and bathroom, a front door, one kitchen, and one living room. Some apartment layouts were manipulated to be more complex by increasing the number of bedrooms, bathrooms, and hallways. In a pilot study, lab members attempted to describe the layouts and noted which they found most complex. This generated apartment layouts that we considered “complex” and layouts that we considered “simple.”

Procedure

After providing informed consent to participate in our laboratory, speakers sat in an armless chair facing a digital camcorder that recorded their movements and wore a head-mounted microphone to record their speech. Speakers were told that they would describe three apartment layouts in English so that future participants would be able to recreate the apartments by following their descriptions. Instructions to speakers were ambiguous about whether future participants would view the videos and/or only hear the audio component of their descriptions.Footnote 1

Speakers were given one layout at a time and had unlimited time to memorize each one. When speakers finished studying the given layout, they handed it to the experimenter who sat behind them. Speakers were instructed to describe the apartment from memory, directing their speech to the camcorder. This procedure was repeated for each of the layouts, with the order of describing simple and complex apartments first counterbalanced across participants.

Coding of Gestures

Transcription and gesture coding was completed by a native English speaker (CYT). During gesture coding, the videos were played at both original and 30-frame-per-second speeds using Final Cut Express video editing software.

Coding of the gestures occurred in three passes (Duncan 2002; McNeill 1992). First, apartment descriptions were transcribed and segmented into short utterances reflecting the grammatical structuring of the speech. Most segments consisted of a verb and its associated modifiers (e.g., “If you walk in through the front door”). Second, gesture onset and offset were identified based on changes in hand movement, shape, and position. Gesture onsets and offsets occurred both within and across word boundaries (McNeill 1992) and were noted as such in the transcription. Gestures were identified as all hand movements produced, excluding those that were irrelevant to the task (e.g., scratching one’s arm). Finally, each gesture was labeled as a particular type using McNeill’s classification system (1992; for definitions and examples, see Table 1). In this pass, we identified (1) iconic, (2) deictic, (3) metaphoric, and (4) beat gestures. Because many of our speakers produced a single gesture type that represented both iconic and a deictic information, we identified an additional gesture type, (5) combined iconic–deictic. To establish reliability of our coding scheme, a second coder completed the final pass by labeling each gesture in a random sample of apartment descriptions. To measure the agreement in gesture labels between the two coders, we calculated the Cohen’s Kappa coefficient (Cohen 1960), which takes the percentage agreement between two coders based on chance into account (Kappa = 0.83 for gesture identification).

Table 1 Definitions and samples of gestures produced in Experiment 1
Table 2 Descriptive statistics for number of gestures used (Experiment 1) and listener accuracy (Experiment 2) by complexity of apartment layout

Statistical Approach

Did speakers use more of a specific type of gesture when describing simple compared to complex apartments? In the MIXED procedure of SAS (SAS Institute 2001), we constructed a series of multilevel models (one for each type of gesture) to predict how frequently speakers used a given gesture type when describing apartments that were either complex or simple, controlling for the total word count of that description. Multilevel modeling allowed us to take full advantage of our repeated design, so we estimated intercepts that were unique to each speaker, effectively accounting for each speaker’s unique frequency of gesturing. As coefficient estimates are unstandardized in multilevel modeling, we provide estimates, labeled b, their standard errors, as well as the 95 % confidence interval for each estimate.

Results and Discussion

Did speakers use more of a specific type of gesture when describing simple compared to complex apartments? We did not see a significant difference in the speakers’ use of iconic gestures, F(1, 14) = 0.12, p = 0.74, deictic gestures, F(1, 14) = 0.05, p = 0.82, metaphoric gestures, F(1, 14) = 0.08, p = 0.78, or beat gestures, F(1, 14) = 0.14, p = 0.71, during descriptions of simple versus complex layouts.

Although manipulating apartment layout complexity did not affect speakers’ use of iconic, deictic, metaphoric, or beat gestures, speakers’ use of the combined iconic–deictic gesture did vary by the complexity of the layout they described. As shown in Table 3, speakers used more iconic–deictic gestures when describing a complex versus a simple apartment, b = 2.29(0.71), F(1, 14) = 10.47, p < 0.006, suggesting that speakers used, on average, 2.29 more iconic–deictic gestures when describing a complex versus simple apartment (95 % CI; 0.76, 3.83).

Table 3 Number of iconic–deictic gestures used, by apartment complexity, in Experiment 1

In addition to estimating the fixed effect of apartment complexity on gesture rate (such that speakers used more iconic–deictic gestures when describing complex apartments), we assessed between-speaker variance in gesture rate. The second-to-last line of Table 3, labeled “Speaker Intercept” shows an estimation of the random effect of the speaker. Results show significant heterogeneity in speakers’ use of the iconic–deictic gesture, z = 0.2.38, p < 0.037 suggesting that the unique intercepts for each speaker were significantly different from one another. Put another way, in addition to using more iconic–deictic gestures when describing complex apartments, speakers showed significant variance in their use of iconic–deictic gestures.

Our original prediction that speakers would use more representational gestures when describing complex apartments was partially confirmed, given that speakers used more of the combined iconic–deictic gesture, but not other types of representational gestures. Why might this have been the case? Although we did not analyze the content of speakers’ verbal output, it is possible that while describing a more complex apartment layout, speakers may be more inclined to use semantically complex motion phrases (e.g., “and if you turn left”), leading them to generate iconic–deictic gestures, which include both an object or action, and direction or location. This interpretation would lend support to the information packaging hypothesis, which states that the production of representational gestures helps speakers organize visuo-spatial information and occurs more often with descriptions of increasing complexity (Kita 2000). Alternatively, the increase in iconic–deictic gestures may reflect the greater spatial memory demand on the part of the speaker when describing complex apartments. This interpretation would replicate previous work that manipulates the complexity of visuo-spatial information (e.g., Hostetter et al. 2007). Unfortunately, our current data are not designed to test whether increase in the iconic–deictic gesture was due to increased demand on speakers’ verbal output, memory, or both. However, regardless of the cause of increased iconic–deictic gestures, our data do suggest that higher numbers of iconic–deictic gestures may reflect difficulty on the part of the speaker to produce a fluent description. Experiment 1 on its own does not address to what extent these gestures are useful in message comprehension. In Experiment 2, we examine how variation in gesture use affects the accuracy with which listeners are able to draw apartments that match the layouts speakers described.

Experiment 2

We originally predicted that speakers in Experiment 1 would produce more representational gestures when describing complex apartment layouts compared to simple ones, but found this only to be true for the combined iconic–deictic gesture. We will therefore modify our predictions for Experiment 2 to reflect the use of iconic–deictic gestures, rather than the more general category of representational gestures. If the frequency of iconic–deictic gesture production increased when describing complex apartment layouts because the speaker found those layouts more challenging to describe (as predicted by the information packaging hypothesis), or because they found them harder to remember (as suggested by Hostetter et al. 2007), we would expect that speakers who used more iconic–deictic gestures made descriptions that were harder to follow. Therefore we predict that the higher frequency of iconic–deictic gestures should impact message comprehension regardless of whether listeners saw the descriptions or only heard them.

Method

Participants

One hundred fifty-eight native speakers of English (Columbia University undergraduates, 70 % female, M age years = 20.9, SD = 5.3) participated in Experiment 2 for course credit. Participants were recruited using web-based sign-up procedures approved by the institutional review board at Columbia University. Participants in Experiment 2 will be referred to as “listeners.”

Stimuli

The 45 apartment descriptions in English from 15 speakers generated in Experiment 1 served as stimuli. Video recordings were converted in QuickTime and audio files were converted in Sound Studio. Listeners were presented either the audio + video files or the audio-only files.

Procedure

Listeners were randomly assigned to the audio + video condition (N = 72) or the audio-only condition (N = 86). Listeners viewed or heard each apartment description twice and were given a clipboard that contained blank grids to draw the apartment layouts described. To provide listeners with a starting point, the location of the front door was placed on the blank layout. The descriptions were presented to listeners over Sennheiser HD 280 Pro Circumaural headphones connected to 14-in. display Macintosh G3 computers. Each listener was randomly presented six apartment layouts: three in English and three in Spanish. For the purpose of this paper, we will only describe data from the three layouts listeners were presented in English.

Layout Accuracy

Listeners’ drawings generated in Experiment 2 were scored for accuracy relative to the original layout that speakers were asked to describe in Experiment 1. Points were awarded to each drawing based on several objective criteria: (1) Did the drawing include all rooms in the original layout?, (2) Were the rooms placed in the correct location?, and (3) Were the relative sizes of each room correct? There were two additional criteria that were used to deduct points: (1) Did the listener rotate the layout (e.g., such that all rooms on the right appeared on the left)?, and 2) Did the listener not use the full grid area? Because the number of rooms differed across complex and simple layouts, percent correct was calculated for each drawing based on the particular layout it was meant to depict. Summarized in the second half of Table 2, there was a range in listeners’ accuracy scores, with drawings from simple layouts earning higher percentage scores than drawings from complex layouts (M = 77.9 %; SD = 19.2 % and M = 67.9; SD = 20.3 %).

To establish reliability of our map accuracy-coding scheme, a second coder followed the same coding in a random sample of apartment layouts. Because the rating system required few subjective judgments, we had very high internal consistency between the two coders (Cronbach’s α = 0.95).

Statistical Approach

Six multilevel models were built to predict the accuracy of apartment drawings with change in accuracy expressed by change in percentage points. The first model examined how apartment accuracy was affected by (1) the condition of the listener (audio + video vs. audio-only) and (2) if a complex or simple apartment was being described when (3) controlling for the word count of a description. We capitalized on the repeated observations design of our study and allowed variability due to speakers and listeners to serve as random effects, allowing us to estimate means that are unique to each speaker and each listener.

In our second series of multilevel models we had two goals: first, to determine if listeners drew more accurate apartment layouts when the speaker used more or fewer gestures (e.g., Did listeners make worse drawings when speakers used more iconic–deictic gestures?), and second, did it matter if listeners saw those gestures (e.g., Was there a significant interaction of a speaker’s number of iconic–deictic gestures and the condition of the listener?). Building on the structure of our first model, we built a series of five models (one for each gesture type). Each unique model predicted the effect of the number of times a speaker used (4) iconic, (5) deictic, (6) metaphoric, (7) beat, and (8) iconic–deictic gestures on the accuracy of subsequent listeners’ apartment drawings.

In these models, we also examined our second research goal: to assess whether being able to see specific gesture types helped listeners draw more accurate apartments. Consequently, for each of the five models, we added (9) an interaction term for number of times a speaker used a given gesture by the condition of the listener (e.g., In the model testing the effect of iconic–deictic gestures, we had an interaction term for the number of iconic–deictic gestures a speaker used by the condition of the listener).Footnote 2

Finally, given that we are interested in the unique effect of a given gesture type and not a speaker’s overall rate of gesturing, we controlled for (10) the frequency of all other gestures the speaker used during that description. Because we were interested in the accuracy between original layout and drawing, we did not control for the accuracy of the specific description listeners heard or saw. All models were built using the MIXED procedure of SAS (SAS Institute 2001). Below, we report unstandardized coefficient estimates of the chance in percentage points (b), their standard error, and the 95 % confidence interval for these estimates.

Results and Discussion

Did listeners draw more accurate apartment layouts when they could see the speakers’ descriptions? As summarized in Table 4, our first multilevel model revealed that the accuracy of layout drawings were comparable for listeners in the audio + video and audio-only conditions, F(1, 154) = 1.04, p = 0.31, suggesting that visual access alone did not help listeners draw more accurate layouts. Unsurprisingly, drawings of simple apartments were scored as more accurate than complex apartments b = −10.42 (1.96), F(1, 154) = 28.39, p < 0.001, suggesting that drawings of simple apartment layouts were, on average, 10 % points higher than drawings of complex layouts, (95 % CI; −14.28, −6.56).

Table 4 Listener’s apartment drawing accuracy (from 0 to 100), as predicted by the condition of the listener, in Experiment 2

Random effects of apartment layout accuracy were tested to assess between-speaker and between-listener variance. Looking at Table 4, under “Speaker Intercept,” we can see that there was little heterogeneity between speakers, z = 1.24, p < 0.11. Put another way, our sample of 15 speakers did not contain participants who produced “better” or “worse” descriptions across all the layouts described: Each speaker had descriptions that allowed listeners to draw accurate (and inaccurate) apartment layouts.

Where we did see heterogeneity was in the random effect of listeners. Our model estimated a unique intercept for each listener to begin predicting the accuracy of his or her apartment drawings. In Table 4, under “Listener Intercept,” we see a great deal of heterogeneity between listeners, z = 0.3.02, p < 0.0012, meaning that unique listener’s intercepts were significantly different from one another. Put another way, regardless of what descriptions they heard or saw, some listeners simply drew more accurate layouts compared to others. The pattern of between-speaker and between-listener heterogeneity will be the same for all models run in Experiment 2, and so we will not discuss them further.

Although listeners in the audio + video condition did not draw more accurate apartments, it is possible that the number of gestures produced during descriptions influenced the accuracy of apartment drawings. In our second series of multilevel models, we tested if the number of times a speaker used a given gesture affected the accuracy of listeners’ drawings. Summarized in Table 5, we found a negative association between the number of iconic–deictic gestures used in a description and the accuracy of apartment drawings listeners made, b = −0.78(0.39), F(1, 151) = 4.10, p < 0.045. When speakers used more iconic–deictic gestures, listeners drew worse apartments (95 % CI −1.54; −0.019) or, for each additional iconic–deictic gesture speakers used, listeners’ accuracy decreased only by 0.78 % points. Nonetheless, the negative effect of speakers’ use of iconic–deictic gestures becomes meaningful if we consider this result in context: Imagine we compare two speakers, one whose gesture rate is one standard deviation above versus one standard deviation below the mean rate of iconic–deictic gesture use when describing a complex apartment (shown in Table 2, M = 4.26 ± 3.60). The speaker one SD above the mean would make just under eight iconic–deictic gestures, compared to the speaker one SD below the mean who would make < 1 iconic–deictic gesture. Given that each iconic–deictic gesture a speaker used predicts a decrease in drawing accuracy of 0.78 % points, this seven-gesture difference would translate to a difference of five percentage points in accuracy (e.g., a change, say, from 80 % accurate to 85 % accurate).

Table 5 Listener’s apartment drawing accuracy (from 0 to 100), as predicted by talker’s use of iconic deictic gestures

This series of models also addressed our second research goal: to assess if it mattered whether or not listeners saw speakers’ particular gestures. Specifically, was the interaction between a speaker’s iconic–deictic gesture use and a listener’s condition significant? Table 5 reveals speakers’ use of iconic–deictic gestures influenced listeners regardless of whether or not the listener saw those iconic–deictic gestures: the interaction term of the speaker’s use of iconic–deictic gestures and the listener’s condition was not significant F(1, 151) = 0.27, p = 0.60. Taken together with our findings from Experiment 1, that speakers used more iconic–deictic gestures when they are describing a complex apartment, these results suggest that speakers use iconic–deictic gestures when either visuo-spatial memory demands or when speaking demands are high.

Our second series of multilevel models aimed to assess the influence of each specific gesture type on apartment-drawing accuracy also revealed a positive association between the number of beat gestures a speaker made during an apartment description and the accuracy of subsequent apartment. As summarized in Table 6, listeners drew more accurate apartment layouts when exposed to descriptions which featured more beat gestures, b = 0.65(0.30), F(1, 151) = 4.75, p < 0.031. This association suggests that speakers who used more beat gestures when making their descriptions provided listeners with a description that enabled them to make more accurate apartment drawings (95 % CI, 0.062; 1.24). As with the number of iconic–deictic gestures, this change in percentage points is quite small, only 0.65 percentage points. If we again consider a speaker who was one SD above versus one SD below the mean in use of beat gestures when describing a complex apartment (from Table 2, M = 4.19 ± 4.69), we are looking at a difference of over eight beat gestures that would lend itself to a difference in predicted accuracy score of over five percentage points. As with the negative effect of iconic–deictic gestures, we found that the positive effect of speakers’ use of beat gestures was not influenced by the condition of the speaker: The interaction term of speaker’s use of beats by listener condition was not significant F(1, 151) = 0.08, p = 0.77, meaning it didn’t matter if listeners could see these beat gestures or if they only heard the description with which they were associated.

Table 6 Listener’s apartment drawing accuracy (from 0 to 100), as predicted by talker’s use of beat gestures

Our second series of models revealed that no other gesture type was associated with the accuracy of apartment drawings: speakers’ use of iconic, F(1, 151) = 0.59, p = 0.44, deictic F(1, 151) = 1.16, p = 0.28, or metaphoric, F(1, 151) = 0.06, p = 0.81 gestures did not affect the accuracy of apartment drawings. As with the a speaker’s use of iconic–deictic and beat gestures, the interaction of a listener’s condition and speaker’s frequency of iconic, F(1, 151) = 0.10, p = 0.75, or deictic F(1, 151) = 0.11, p = 0.75, or metaphoric F(1, 151) = 0.56, p = 0.45 gesture was not significant.Footnote 3

General Discussion

In Experiment 1, we explored differences in how speakers use various types of gesture when describing simple and complex apartment layouts. Results suggest that the more complex an apartment layout, the higher the number of iconic–deictic gestures speakers used. In Experiment 2, we asked listeners to draw apartment layouts based on the descriptions generated in Experiment 1. Being able to see a video of the apartment descriptions did not improve listeners’ accuracy of subsequent drawings. Interestingly, regardless of whether a simple or complex apartment was being described and regardless of whether the listeners could see the speaker’s gestures, descriptions that featured more iconic–deictic gestures led to less accurate apartment drawings.

Why does gesture production during an apartment description affect the accuracy of apartments drawn from that description? Our findings suggest that the use of iconic–deictic gestures reflects task difficulty such that they were used more when layouts were complex (Experiment 1). Descriptions featuring more iconic–deictic gestures resulted in less accurate apartment layout drawings (Experiment 2). Interestingly, it was the presence of these gestures in the description and not whether the listeners saw them that affected the accuracy of subsequent drawings. This leads us to speculate about why speakers use iconic–deictic gestures. First, the finding that speakers produced more iconic–deictic gestures when describing complex apartments aligns with the information packaging hypothesis (Kita 2000). According to this view, representational gestures aid in the planning of a verbal message. Specifically, representational gestures occur most frequently when speakers organize spatial information into units suitable for verbal expression. Given the difficulty of organizing the relative location, size, and type of rooms in each complex apartment layout, speakers might have produced more iconic–deictic gestures to facilitate verbal descriptions of these features. Second, even if they were used to help spatial recall in this task, as previous work would suggest (De Ruiter 1998; Morsella and Krauss 2004; Wesp et al. 2001), these findings do not inform the extent to which iconic–deictic gestures were successfully recruited to aid spatial memory in our task. Studies where hand mobility is restricted suggest that had participants not been able to produce these iconic–deictic gestures, their descriptions would have been even less successful (Morsella and Krauss 2004). In contrast with theories implying that gesture facilitates spatial memory (de Ruiter 1998; Morsella & Krauss 2004; Wesp et al. 2001), the information packaging hypothesis makes explicit claims regarding the role of gesture in lessening speaking rather than memory demands. Although the present data cannot fully disentangle these two accounts, they do support the view that representational gestures, specifically iconic–deictic gestures, reflect task difficulty level.

There is an alternative explanation for why higher frequency of iconic–deictic gestures in descriptions predicted worse apartment drawings. Perhaps, rather than the production of iconic–deictic gestures reflecting poor memory of an apartment layout, the difference in layout drawing performance lies in the decoding of what these iconic–deictic gestures often represented: the description of motion. By this account, it may be easier to recreate an apartment layout from a description that simply states where the rooms are located than from a description in which the speaker is navigating in an imagined space.

Although the focus of this work is the role of representational hand gestures, we were surprised to find that apartment layout descriptions that featured a higher frequency of beat gestures resulted in better drawings. Unlike representational gestures, beat gestures do not convey any semantic content (McNeill 1992). Although the role of beat gestures is not well understood (Casasanto 2013), there is some evidence that beats play a role in speech prosody and may aid in subsequent comprehension. Speech segments that are accompanied by a beat gesture are acoustically differentiated. For example, Krahmer and Swerts (2007) found that words that occurred along with beat gestures had high-frequency formants, thereby enhancing the acoustic prominence of those words for the listener. Taken together with our findings from Experiment 2, it is possible that beat gestures either help speakers produce comprehensible speech or at least are a reflection of the speaker’s communicative clarity. Not mutually exclusive from this view is the possibility that the production of beat gestures reflects semantic fluency (Hostetter and Alibali 2007). Given claims that manual movements that have no direct relation to the semantic content in accompanying speech can activate speech-related brain regions and facilitate lexical access (Ravizza 2003), beat gestures may operate in a similar manner. The link between beat gestures and shifts in the acoustical features or semantic fluency of the apartment descriptions might explain why beat gestures did not need to be seen in Experiment 2 to have a positive impact on subsequent apartment drawings.

Finally, there are limitations to our approach. In the audio-only condition, we removed listeners’ access to any visual information about the speaker, including his or her mouth movements, which listeners benefit from being able to see (Jesse & Massaro 2010). Why did our video condition listeners not benefit from being able to see speakers’ mouth movements? Perhaps the video recordings, aimed to show speakers’ head and torso, did not have clear enough images of the mouth. Nonetheless, being able to see the gestures or the mouth movements did not facilitate comprehension on the part of the listeners.

The role of co-speech gesture may vary by situational context and is affected by the highly social nature of communication. Speakers constantly adapt how they speak and to whom they are speaking. This has been shown at a linguistic level (Pickering and Garrod 2004), at a paralinguistic level (Pardo 2006), and in gesture. For example, the use of representational and beat gestures increases when mothers speak to language-learning infants (Iverson et al. 1999), when foreign language teachers speak to language learners (Allen 2000), and when interlocutors have less common ground (Jacobs & Garnham 2007). Moreover, speakers adapt their gestures to match the idiosyncrasies of the person with whom they are communicating, dyadically synchronizing their gestures during a conversation (Bavelas et al. 1988; Kimbara 2006; Wallbott 1995). Some authors (e.g., Goldin-Meadow 2003) have even suggested that gestures may play a role in promoting positive affect between speakers. Further still, negotiation researchers have found that, compared to phone negotiations, those who interact face-to-face are more likely to behave cooperatively (Purdy et al. 2000) and to build dyadic rapport (Drolet and Morris 2000) than those who negotiate over the phone. Taken together, these findings emphasize the importance of considering the nature of the communicative environment when interpreting the role of gesture.

The present findings directly compare the communicative effects of different types of gestures, implying that even within the category of representational gestures, the role of gesture varies as a function of gesture type. Although gestures may play an important social role in communication, our findings suggest that the production of representational gestures does not necessarily facilitate message comprehension, particularly when the presence of those representational gestures reflects high spatial memory or speaking demands.