Introduction

The relationship between speech and gestures has acquired a considerable interest for a wide range of disciplines. Linguists, psychologists and cognitive scientists have proposed a critical role for manual gesture in the development and evolution of human language (Arbib 2005, 2008; Gentilucci and Corballis 2006; Rizzolatti and Arbib 1998; Tomasello et al. 2005). The idea at the basis of this proposal is that gesture and spoken language belong to a unique communication system because they are linked to the same mental processes despite differing in expression modality (Kendon 2004; McNeill 1992). An alternative idea proposes that gestures and spoken language represent, however, two different communication systems, the gesture working as an auxiliary support when verbal expression is temporally disrupted or word retrieval is difficult (Hadar et al. 1998; Krauss and Hadar 1999; Levelt et al. 1985).

People frequently use their hands to communicate. Such hand movements can either necessarily accompany speech in order to add information to the communication meaning, or be so conventionalized in their form that they are effortlessly understood on their own (Burling 1999; Gunter and Bach 2004). The first category includes the so-called iconic gestures or gesticulations that depict salient visual and spatial properties of the object of conversation (e.g. size and shape, representational gesticulation) or the action described by the sentence. If gesticulations are produced in isolation, their meaning results to be ambiguous (McNeill 2005). On the contrary, the second category is defined as emblems or symbolic gestures, whose meaning results to be clear as they can directly convey meaning in the absence of speech (Goldin-Meadow 1999, 2003; McNeill 2005; Andric et al. 2013). The majority of the studies investigating the relation between speech and gesture focused on iconic gestures rather than emblems (Andric et al. 2013). However, emblems represent the most interesting type of gestures for their unique capacity to fully replace spoken language and its semantic meaning.

Is this resemblance between emblems and speech reflected in an overlapping communication system? Behavioral and neuroimaging evidence support this hypothesis. In a previous study, when healthy individuals performed a symbolic gesture and simultaneously pronounced a corresponding-in-meaning word, the gesture kinematics and voice spectra of the word changed as compared to the sole performance of the gesture or pronunciation of the word (Bernardis and Gentilucci 2006). In contrast, no reciprocal influence between gesture kinematics and voice spectra was observed when (a) the meanings of gesture and word were incongruent with each other, (b) the word was pronounced simultaneously with a meaningless gesture and, finally, (c) when the gesture was performed simultaneously with a pseudo-word (Barbieri et al. 2009). These data support the key role of semantic congruence/incongruence in speech and gesture integration. In a TMS study, Gentilucci et al. (2006) highlighted the key role of Broca’s area in the integration of emblems with speech. Indeed, they reported that repetitive TMS applied to this region induced a temporary interference when words were pronounced in response to emblematic gestures. More specifically, the increase in the voice spectra described during the simultaneous speech and gesture production (Bernardis and Gentilucci 2006) resulted to be temporarily disrupted. The large overlap of the networks sustaining emblems and speech processing was recently demonstrated by an fMRI study by Andric et al. (2013), that, despite the different stimulus modality, unraveled through a conjunction analysis a converging activation of bilateral superior temporal sulcus (STS), as well as lateral temporal and frontal cortices.

While rTMS and fMRI studies spatially located the network common to gesture and language processing, given their low temporal resolution, they did not allow a fine-grained description of the temporal dynamics. To this aim, event-related potentials (ERPs) can represent an excellent tool. All ERP studies previously conducted on gesture and speech focused on the N400 component (Ozyürek et al. 2007; Wu and Coulson 2005; Kelly et al. 2004), that is a negative-going waveform in the average ERP that reaches its peak amplitude approximately 400 ms after stimulus onset. N400 has a broad scalp distribution, with maximal amplitudes at midline central or parietal sites and noticeably smaller amplitudes at prefrontal and lateral frontal sites (Duncan et al. 2009). This component was first reported by Kutas and Hillyard (1980) in a comparison of sentence-final words that formed predictable completion and those that were semantically incongruent, where the latter elicited a larger negative wave. In the case of iconic gestures presented with a sentence, they elicit a N400 component, however its amplitude is still larger for gestures that are semantically anomalous or less expected (Kelly et al. 2004; Ozyürek et al. 2007).

A huge amount of literature addressed the functional role of N400 and attempts to map this effect onto specific cognitive operations. In the feed-forward chain, leading from perception to semantic integration with contextual information, N400 was proposed to reflect anomalies in at least one of these stages. The first interpretation by Kutas and Hillyard (1980) suggests that the larger N400 amplitude elicited by semantically anomalous stimuli reflects the semantic reprocessing that people perform during senseless sentences reading. Subsequently, Brown and Hagoort (1993) proposed this ERP component to be the marker of the semantic integration of a word with the working context, with semantically anomalous stimuli requiring a greater integration effort. More recently, the N400 effect was also proposed to reflect facilitated activation of features of the long-term memory representation that are associated with a lexical item (Kutas and Federmeier 2011; Federmeier et al. 2007; Lau et al. 2009). According to this view, the different modulation of the N400 effect arises because predictable words in context are easier to access from memory.

Given that contextual factors tend to simultaneously facilitate both lexical access and semantic integration, further studies investigated the localization of the N400 effect with the aim to subsequently determine what underlying neurocognitive operations it reflects. To our knowledge, few EEG studies investigated the N400 source localization. Caldara et al. (2004) reported the involvement of left Inferior Frontal Gyrus (IFG) during the integration of verbal material and recognition of face identity. In a review, Friederici (2004) proposed that the Superior Temporal Sulcus (STS) and left frontal areas are N400 generators concerning lexical-semantic material. Finally, Khateb et al. (2010) analyzed the N400 effects in a phonologic task (rhyme detection), a semantic task (related or unrelated words) and an image categorization task. Source localization showed the involvement in all tasks of the middle/superior temporal gyrus, suggesting that these areas index the same cognitive content despite differences in the representational formats and types of mismatch. Considering fMRI and intracranial studies (Nobre and McCarthy 1995; McCarthy et al. 1995), three regions resulted to be mainly involved in the N400 generation: the left posterior temporal cortex, the left anterior temporal cortex and the left inferior frontal cortex (see for a review Lau et al. 2008). Focusing on the studies specifically dealing with gestures (see Andric and Small 2012), Willems et al. (2007) investigated the neural network involved in the integration of semantic information from speech and iconic gestures using fMRI. The results showed that premotor areas (BA6) were specifically modulated by gesture information ‘‘mismatching’’ with a language context. Moreover, an increase in integration load of both verbal and gestural information into prior speech context activated Broca’s area and adjacent cortex (left inferior frontal cortex, BA 45/47). The authors interpreted these results suggesting that these regions can be involved in the interaction between speech and gestures. Other studies (Holle et al. 2008, 2010) demonstrating that left posterior superior temporal sulcus (STSp) is a site where the integration of iconic gesture and speech takes place. In contrast, posterior middle temporal gyrus (MTGp) and anterior superior temporal cortex (STa) responses may be tuned to interpret meaning, including when it is conveyed in gesture. In response to emblems, MTGp activity has been found in either the left (Lui et al. 2008; Villarreal et al. 2008) and right (Nakamura et al. 2004) hemispheres, as well as bilaterally (Lotze et al. 2006; Xu et al. 2009). STa activation was reported both in co-speech gestures (Skipper et al. 2009; Green et al. 2009; Straube et al. 2011) as well as in emblems processing (Lotze et al. 2006).

Starting from these premises, the aim of the present study was to address how observing an emblem can serve as semantic priming for a subsequent word, at both behavioral and neurophysiological level. Using high-density EEG recordings, we reported the temporal dynamics and spatial distribution of speech and symbolic gesture interplay, where they could be semantically congruent or incongruent. Aside to the electrophysiological investigations, we carried out a behavioral experiment using the same stimuli in an independent sample, in order to validate the semantic in/congruence of the gesture-verb pairs, and to measure the efficacy of the gestures in acting as prime the congruence judgment on the subsequently presented verb.

Differently from previous studies (Holle and Gunter 2007; Ozyürek et al. 2007; Wu and Coulson 2005, 2007a, 2007b; Kelly et al. 2004) in which the co-speech gesture was associated to verbal material, we presented symbolic gesture as prime rather than word. This choice was due to the fact that we want to verify whether the symbolic gesture behaves as well as verbal material. The presence of a N400 component and its modulation by congruence of gesture and word meaning was assessed. Since the production of symbolic gestures and words are reciprocally influenced only when they express congruent meanings (Barbieri et al. 2009, Bernardis and Gentilucci 2006), we expected different neurophysiological responses to congruent rather than incongruent stimuli. In addition, advanced source localization techniques allowed us to identify the regions mostly contributing to the response and whether some areas are differently involved during the incongruent stimuli processing.

Methods

Behavioral Experiment

Participants

A sample of twenty-one (15 females) native Italian speakers participated in the behavioral study. They had normal or corrected-to-normal vision and no history of neurological or psychiatric disorder. The mean age of the volunteers was 25.1 ± 3.75 years. All were right-handed, as ascertained by the Edinburgh Handedness Inventory (Oldfield 1971). All participants were naïve as to the purpose of the study. The study was approved by the Ethics Committee of the Medical Faculty at the University of Parma.

Stimuli, Apparatus and Procedure

Stimuli included 20 video-clips (see Online Resource 1) in which an actor performed a symbolic gesture. In each video the actor was facing the video-camera and performed the gesture with the right hand without producing any facial expression. No audio was associated to video presentation. In addition, 40 Italian verbs were selected: 20 of whom were congruent with the meaning of the gesture, while the remaining expressed an opposite meaning. Each word (character font, CALIBRI; character size, 72 pts) was displayed in the center of the screen in imperative form. Table 1 reports for all presented gesture, the congruent and incongruent words. The words presented after the gestures were matched both in terms of length (expressed in number of characters) and in terms of frequency, evaluated by the COLFIS database (Laudanna et al. 1995) for Italian language. An unpaired two tailed t test was used to compare the verbs relative to congruent and incongruent conditions. Both variables resulted to be largely not significant (length: mean ± standard error for congruent: 6.25 ± 0.4, incongruent 6.15 ± 0.4, p = 0.82; occurrence: congruent 922.6 ± 457.8, incongruent 542.3 ± 232.2, p = 0.72).

Table 1 Stimuli List

In the Congruent condition, the verbs were semantically congruent with the previously presented gestures. For example, the gesture Stop, i.e. the open hand still with the palm facing the observer and the arm extended, is congruent with the verb “stop”. The same gesture in the Incongruent condition is followed by the verb “come in”, indicating an opposite meaning.

Symbolic gestures were selected on the basis of previous literature (Poggi 2002; Kendon 2004; Barbieri et al. 2009; Ferri et al. 2014; De Stefani et al. 2013; Papeo and Rumiati 2013; Campione et al. 2014), while, on the other hand, verbs congruent with each gesture were selected so that their meaning matched as much as possible the gesture ones. Differently, incongruent verbs were identified as the verbal counterpart of the congruent ones.

We used the software Matlab 6.5 (The MathWorks Inc., Natick, MA, USA) for both visual stimuli presentation and response times (RTs) collection. Participants sat in front of a black table. A PC 19′’ monitor was placed on the table plane, 70 cm far from the participant’s forehead. The monitor was set at a spatial resolution of 1,024 × 768 pixels and at a temporal resolution of 60 Hz. Participants placed their right index and middle finger on the PC keyboard. Visual stimuli were presented centrally, subtending a horizontal visual angle smaller than 10 degrees. Each trial began with the presentation of a white a fixation cross (700 ms) in the middle of the screen, followed by the gesture video-clip (2,000 ms), the printed verb on a dark grey background and, finally, by a dark grey background used as intertrial (1,500 ms). The verb disappeared as soon as participants responded pressing a button. The participants were given with a limited maximum response time of 3,000 ms. Response times (RTs) were measured from verb onset. A total of 40 trials was administered, 20 for each condition (Congruent and Incongruent, 1 repetitions for each gesture-verb couple). Participants had to decide whether the verb meaning was congruent or not with the meaning of the previously presented gesture. They made key–press responses with their right index and middle fingers in a balanced way among participants.

EEG experiment

Participants

Twenty-one (8 males) native Italian speakers participated in the EEG experiment. They had normal or corrected-to-normal vision and no history of neurological or psychiatric disorder. The mean age of the volunteers was 26.2 ± 2.4 years. All were right-handed, as ascertained by the Edinburgh Handedness Inventory (Oldfield 1971). They gave written informed consent for their participation. Ethical approval was obtained from the Ethics Committee of the Medical Faculty at the University of Parma.

Stimuli and Procedure

The same set of stimuli was used for the EEG experiment. Each trial (see Fig. 1) started presenting a fixation cross on the PC monitor (700 ms), followed by the gesture video-clip (2,000 ms), the printed verb on a dark grey background (1,000 ms) and, finally, by the only dark grey background used as inter-trial (random duration, range 2,000–3,000 ms). A total of 160 trials were administered, 80 for each condition (Congruent and Incongruent, four repetitions for each gesture-verb couple), which took up to a total of 20 min recording. Visual stimuli were presented using E-Prime software (http://www.pstnet.com). The participants were comfortably seated 70 cm away from a 19-inch monitor, where stimuli were presented centrally, subtending a horizontal visual angle smaller than 10 degrees.

Fig. 1
figure 1

Experimental design. The trial timeline, common to both behavioral and EEG experiment, is reported. The fixation cross is followed by a video clip with an actor making a symbolic gesture (in the figure the “victory” gesture is shown as example). The subsequent verb could be congruent (“vinci”, i.e. “win”) or incongruent (“perdi”, i.e. “lose”). The intertrial duration was jittered between 2,000 and 3,000 ms

Participants were instructed to perform a semantic decision task concerning the presented stimuli: they had to evaluate whether a gesture-verb couple was congruent or incongruent. Only when a question mark appeared on the screen 500 ms after verb offset (10 % of trials, randomly distributed), they were required to state aloud the congruence/incongruence of the stimulus meaning in the last observed trial. In order to avoid eye movements during stimuli presentation, participants were required to fixate a cross presented centrally at the beginning of each trial.

EEG Recording

Continuous EEG was acquired using the 128-channel Geodesic EEG System (Electrical Geodesics, Inc., Eugene, OR, USA) and the HydroCel Geodesic Sensor Net that arrays the sensors (AgCl coated electrodes) in a geodesic pattern over the surface of the head. It included 19 contacts at the equivalent 10–20 system locations. Consistent positioning was achieved by aligning the Sensor Net with skull landmarks (nasion, vertex, and pre-auricular points). With high input impedance amplifiers (Net Amps300), low noise EEG was obtained with sensor-skin impedances maintained below 50 kΩ. The signal was digitized at 250 Hz sampling rate (0.01 Hz high-pass filter), recorded with a vertex reference.

EEG Data Analysis

EEG data were analyzed off-line by means of NetStation software (Electrical Geodesics, Inc., Eugene, OR, USA) and homemade MATLAB scripts (The Mathworks, Natick, MA). Continuous recordings were segmented in epochs lasting 1,100 ms, each including the 100 ms preceding and the 1,000 ms following the verb onset. For artifact detection and removal, each participants’ epoch-file was imported in BrainVision Analyzer software (BrainProducts, Munich, Germany) and analyzed by means of Independent Component Analysis (ICA, Delorme and Makeig 2004). ICA tends to focus on the data that express the most of the power (in case of EEG, low frequency oscillations). Since very low frequencies can represent skin conductance changes that might confuse ICA demixing, a high pass filter (1 Hz) was applied to each recording before the ICA decomposition. All components (an average component number of 8.4 ± 6.1 over a total number of 64, i.e. about 13 %) whose topography, power spectrum and time-course were related to eye blink, saccades, cardiac and muscular artifacts (see Jung et al. 2000) were excluded. Subsequently, the resulting IC weights were applied to more modestly filtered data (0.5–70 Hz), in accord with the observation that such a high-pass filter is capable to preserve the most of the signal of interest in case of slow ERP (Kappenmann and Luck Kappenman and Luck 2010) like the N400 addressed in this study. The resulting epoch-files were further visually inspected to exclude remaining bad trials (about 5 % of trials removed) and re-referenced versus the average signal of all electrodes located above the axial plane passing through fronto-polar and occipital electrodes.

The outer most belt of electrodes of the sensor-net was discarded because they are more prone to show residual muscular artifacts. This procedure is commonly used when high-density sensor nets are employed (see Avanzini et al. 2013; Berchio et al. 2013; Rihs et al. 2013; Britz and Michel 2010). Eventually, the ERP for each subject and condition was computed on the remaining 110 electrodes, and the group-averaged ERPs were calculated for both Congruent and Incongruent conditions (see Fig. 2).

Fig. 2
figure 2

Statistical assessment of the N400 time interval. a Superimposition of the grand average for Congruent (red line) and Incongruent (black line) conditions. All 110 electrodes are displayed in a butterfly visualization. The X-axis reports the time relative to the word presentation (indicated by a thick mark), ranging from −100 and 660 ms. The Y-axis indicates the ERP amplitude, ranging from −2 to 2 µV. b The results of a point-wise two-tailed paired t test applied over the GFP values reported. We retained in this analysis only significant t values at p < 0.05 after non parametric multiple comparisons correction. Note that the statistical significance lies in the interval 428–484 ms, from here on defined as the N400 time interval. c The t-map relative to the comparison between congruent and incongruent conditions in the N400 interval is reported. Yellow dots represent the recorded electrodes, and the marked ones correspond to electrodes showing a significant (p < 0.01) more negative amplitude for incongruent vs. congruent condition. d ERPs trace for congruent (red line) and incongruent (black line) conditions for the boldly marked electrode in the aside topo-maps. The black bar indicates the N400 time interval. e The Cohen’s d coefficient was estimated for all electrodes in the N400 interval and plotted topographically on the scalp map. Note that all the significant electrodes identified in (c) show an at least “moderate” effect size (d > 0.5) (Color figure online)

To identify the time windows where the processing of the two conditions differs, two different and somewhat complementary approaches were employed: global field power (GFP) and microstate analysis.

Global Field Power is a measure of field strength which is computed at each time point as the spatial standard deviation of the potential field (see Lehmann and Skrandies 1980). A direct comparison of its values is capable to reveal the time intervals where two different conditions differ in terms of electric field power. A two-tailed point-wise paired t test was thus applied to the GFP for the Congruent and Incongruent conditions. To reduce the false positive detection ratio, a non-parametric correction for multiple comparisons was applied using a permutation test based on a t-statistic (Maris and Oostenveld 2007; Groppe et al. 2011). Significant and adjacent time points were grouped in periods of interest and for each of them the average t-map and the relative effect size were computed to depict the topographical distribution of the effect. To this aim, the Cohen’s d coefficient was computed using the Measures of Effect Size Toolbox (Hentschke and Stüttgen 2011) as the ratio between the difference of means and the pooled standard deviation, and its values were evaluated taking into consideration the general rules of thumb provided by Cohen (1988). However, such a procedure may sound circular as it considers only time samples individually significant, so these panels will be considered as a topographical counterpart of the GFP analysis and not as an independent analysis.

A significant difference in GFP values cannot explain if such difference arises from a greater involvement of the same brain generators (the same areas differently modulated by two experimental conditions) or rather from the involvement of two different generators’ patterns, reflecting in different scalp maps and different GFP values.

For this reason, the grand-averaged ERPs were processed using a space-oriented brain electric field analysis. This method is based on the notion of functional brain microstates introduced in the 1980′s by Lehmann (1987). It is based on the observation that the electric brain activity does not vary randomly over time after a stimulus onset, but, rather, that some brain topographies remain stable over time from tens to hundred milliseconds (Michel et al. 1999). Each stable brain topography (named microstate) is sustained by a specific brain network and reflects a specific functional brain state (Lehmann 1987; Michel et al. 1999). The analysis procedure implemented for identifying the periods of topographic stability within and between experimental conditions is a modified agglomerative hierarchical clustering (e.g. Murray et al. 2008) termed ‘‘AAHC’’ (Atomize and Agglomerate Hierarchical Clustering), applied here on the group-averaged ERPs. Cluster analysis is reference-free, and insensitive to amplitude modulation of the same scalp potential field across conditions, since normalized maps are compared. The output is a set of template maps that describe the group-averaged ERPs. The number of microstates explaining for the most of the considered data set variance was determined by a modified Krzanowski-Lai (K-L) criterion (Murray et al. 2008). These procedures are well established and adopted in many previous studies characterizing the electrical responses to different experimental conditions (see for examples Guthrie and Buchwald 1991; Murray et al. 2002, 2004; Britz et al. 2009).

To statistically assess the validity of the microstate results, we applied a fitting procedure based on the calculation of the spatial correlation between single-subject ERPs and template maps (Murray et al. 2008; see also Brandeis et al. 1995). For each subject and condition, the amount of time characterized by each template was obtained in a specific time window of interest. A repeated measurements ANOVA was subsequently performed with MAP and CONDITION as factors, with the aim to validate at the single subject level the differences highlighted by the microstate segmentation. When significant effects were found, post hoc analysis (p < 0.05) were conducted with Bonferroni correction (n = 4, as only within map or within condition post hoc comparisons were evaluated). Microstate segmentation and back-fitting analysis, performed across time and experimental conditions, allowed us to determine whether and when different conditions engaged distinct scalp potential configurations, that in turn call for different intracranial generators (Murray et al. 2008).

Once assessed the time windows showing different maps according to conditions, we calculated a distributed inverse solution with local autoregressive average (LAURA) model. This model is based on reconstruction of the brain electric activity in each point of a 3D grid of solution points. Each solution point is considered as a possible location of a current source, thus there is no a priori assumption on the number of dipoles in the brain. The computation provides a unique configuration of activity at each solution point that explains the surface measurements. Since an infinite number of distributions of current sources within this 3D grid of solution points can lead to exactly the same scalp potential map, the inverse problem is highly underdetermined. This underdetermined nature of the source model further necessitates the application of different assumptions in order to identify the ‘optimal’ or ‘most likely’ solution. LAURA attempts to incorporate biophysical laws as constraints driving the calculation of a unique solution. This approach is capable of dealing with multiple simultaneously active sources. The solution space was computed on a locally spherical head model with anatomical constraints (LSMAC model, Brunet et al. 2011) and comprised 3001 solution points equidistantly distributed within the brain structures of the Montreal Neurological Institute (MNI 152) average brain.

The inverse solution was computed for each template map returned by microstate segmentation and whose duration was longer than 20 ms. In this way, we obtained for each condition a sequence of cortical activations over time. For each source localization, the best fit (i.e. the voxel exhibiting the maximum current density value within a specific cortical region) was identified. In addition, a t test in the source space was computed on the average current density during the GFP time interval so as to show the cortical sources that are differentially active in congruent and incongruent trials.

Results

Behavioral Experiment

Congruent and incongruent conditions were analyzed separately. First of all, the accuracy of the categorization was assessed so as to verify the validity of the stimuli selection. Participants responded correctly to the 92.86 % of congruent trials and 91.90 % of the incongruent ones, demonstrating how reliably the two verb groups are perceived as congruent and incongruent. A t test carried out on the two categories showed no statistical difference (t(20) = −0.722, p = 0.47). Subsequently, in order to evaluate from a behavioral point of view the semantic priming effect, we discarded incorrectly responded trials (7.62 % of total administered trials, of which 3.57 % in the congruent condition and 4.05 % in the incongruent one) and calculated the mean RTs for the correct responses in each condition. Responses out of the individual average ± two standard deviations were treated as outliers and were not further considered (4.1 % of the correct trials). The t test on RTs revealed participants’ improved performance in congruent trials (t(20) = −5.205, p < 0.001; η2 = 0.575). RTs for congruent trials were ~200 ms faster than those for the incongruent ones (1,063 vs. 1,226 ms).

EEG experiment

The participants were able to judge the congruence/incongruence between gesture and verb. The mean percentages of correct responses in the Congruent and Incongruent conditions were 99.4 and 97 %, respectively.

The grand-averaged ERP (Fig. 2, panel A) revealed a common scalp pattern for both Congruent and Incongruent conditions processing. A first peak, occurring at 116 ms after the verb onset, showed a large positivity over occipital regions, while a reversed pattern was detected at the second peak (212 ms) with the positivity spread over the anterior regions. The third peak occurred at 272 ms and presented a posterior right-lateralized positivity. The last peak (448 ms) was characterized by negativity around the vertex electrode (Cz) and by positivity over bilateral occipito-temporal regions.

The t test performed sequentially over the GFP values at all time frames (see Fig. 2, panel B) revealed that no difference between the two conditions appears in the early stages of processing (until about 400 ms after verb onset). On the contrary, in the time window ranging from 428 to 484 ms, GFP resulted to be larger for the Incongruent condition relative to the Congruent one. Panels C and E of Fig. 2 report the t-map (Congruent minus Incongruent) and the effect size map computed as the Cohen’s d coefficient and plotted on a scalp map by means of EEGLAB (Delorme and Makeig 2004) functions. Looking at these panels, one can note how the largest difference between Congruent and Incongruent conditions is expressed by centro-parietal sites, bilaterally (note that asterisks indicate the electrodes showing a significant difference—p < 0.01—on the average amplitude), that present a significant difference at the scalp level as well as the highest effect size values.

In order to characterize the spatio-temporal dynamics underlying the amplitude difference between Congruent and Incongruent conditions, we performed a topographic pattern analysis (i.e. microstates analysis) based on AAHC that allowed us to disentangle if this effect is due to either the higher recruitment of the same cortical generators, or to the presence of distinctive topographic distributions. Figure 3 (panel B) depicts the results of microstates segmentation of the two conditions, as well as the scalp template maps relative to all microstates (panel A). The K–L criterion indicated 11 as the optimal number of microstates, explaining 90.4 % of the variance of the dataset. Two analog microstates sequences characterized the first 390 ms of both conditions, confirming a common early processing after the verb onset. In the subsequent time window (392–520 ms), the segmentation showed the presence of different microstates according to different conditions. Template 9 (392–450 ms) was assigned to the Incongruent, while Template 10 (392–524 ms) to the Congruent condition. It is important to note that this difference in microstate sequence coincides with the time window identified through the GFP analysis, suggesting that different topographies (and so different cortical generators) could be at the basis of the strength difference characterizing the reported N400 effect. While the whole N400 duration was characterized by the same scalp topography (Template 10) in the Congruent condition, in the Incongruent one Template 10 (450–536) occurred only after Template 9. Afterwards, the temporal sequence of microstates returned to be similar between the two conditions.

Fig. 3
figure 3

Source localization of the N400 effect during congruent and Incongruent conditions. a and b Depict the results of the microstate segmentation for Congruent and Incongruent conditions. The template maps of each microstate are shown in (a), underlined with the same color of the corresponding microstate reported in (b). The X-axis reports the time relative to the word presentation and it is aligned to the upper part. The Y-axis indicates the GFP amplitude, ranging from 0 to 1 µV2. The N400 time window (380–480 ms) is mainly covered by microstate 10 for the congruent condition and by microstates 9–10 for the incongruent one. Subsequently, both conditions present the occurrence of microstate 11. In c the source localization relative to these three microstates is reported on a set of axial slices from the MNI152 brain template. Each localization is shown into a square colored correspondingly to the relative microstate. The color code for the current density in the brain space ranges from 0 to 0.0003 A2/m2. d Depicts the results of the statistical difference (t test) performed in the source space during the N400 time interval. The activation relative to the incongruent condition (left MTG/STS and parietal areas) are reported in blue, the one relative to congruent condition (ATL) in red (Color figure online)

The selectivity of Template 9 for the Incongruent condition was assessed by means of a back-fitting procedure applied for the two conditions in the time interval 428–484 ms, i.e. the period showing the significant GFP modulation. The repeated measurements ANOVA indicated a significant MAP*CONDITION interaction (F(1, 20) = 12.134, p = 0.0023, eta2 = 0.142). Post-hoc analysis revealed a significant prevalence of Template 9 in Incongruent relative to Congruent condition (mean percentage of time frames assigned to Template 9: 54.46 vs. 24.11 %, respectively, p < 0.05) and an inverse relationship for Template 10 (mean percentages: 45.54 vs. 75.89 %, p < 0.05).

The source localization performed on the topographic maps returned two different generators pattern for Template 9 and Template 10 (Fig. 3, panel C). The Template 10, covering the entire duration of the N400 interval for the Congruent condition, resulted to mainly activate the left anterior temporal lobe (best fit at MNI [−40, 8, −30]) and a small activity was detected also in a medial cluster located under the vertex (medial frontal lobe, best fit at MNI coordinates [−4, −20, 60]). Conversely the Template 9, describing the first part of the N400 evoked by the Incongruent condition, in addition to a similar left anterior temporal region (best fit at MNI [-45, 6, 34]), presented a marked activation of the left posterior middle temporal gyrus (best fit at MNI [−50, −50, −13]). Similarly to Template 10, a medial frontal cluster was present also in this microstate (best fit at MNI [−4, −15, 60]), but showing a larger activation. The second part of the N400 time window for Incongruent condition is labeled as Template 10, therefore sharing the same cortical generators described for Congruent condition. After the N400 effect, both conditions presented an EEG activity labeled as Template 11, lasting about 80 ms. The source localization relative to this template showed, in addition to bilateral anterior temporal region (best fit at: right side [40, 10, −30], left side [−40, 10, −30]), a left inferior frontal activation (best fit at [−33, 32, −18]).

Finally, the t test computed in the source space basically confirmed the previous findings, i.e. a greater involvement of posterior temporal regions for incongruent stimuli and of anterior temporal areas for congruent ones. A third significant cluster, mostly active during the incongruent trials, was located in left parietal regions, spreading over BA40 and BA7 (Fig. 3, panel D).

Discussion

In the present study a semantic priming paradigm was used to investigate how the observation of a symbolic gesture may influence the processing of a subsequent word that could be congruent or incongruent with the gesture. To this aim, two studies were conducted. In the behavioral study, response times for the congruent conditions were shorter than for the incongruent one. It must be noted that response times recorded in our behavioral experiment encompass the time needed (1) to achieve the semantic comprehension of the verb and (2) to perform a semantic congruence judgment over the just observed gesture/verb pair. This judgment process requires that the semantic comprehension of the verb has previously occurred to take place. Even if we cannot disentangle whether this facilitation is either up to the verb semantic comprehension or to the semantic congruence judgment, we observed that the congruence between gesture and word speeds up this entire processing. When the word meaning is incongruent relative to the gesture, the lack of priming effect leads to longer processing times. The ERP study showed larger N400 amplitude in case of incongruence between gesture and word. The source localization applied to the N400 time interval revealed left MTG and ATL as the areas mainly involved in the N400 generation. In particular, MTG resulted to be specifically activated for the Incongruent conditions.

Although the N400 response is typically associated with the integration of words in a sentence, it can be elicited by most meaningful stimuli (see Lau et al. 2008 for a review), including faces (Barrett and Rugg 1989), pictures (Barrett and Rugg 1990; Ganis et al. 1996), isolated words (Bentin et al. 1985), and pseudo-words (Rugg and Nagy 1987). Generally, these studies compared the average ERP amplitude in a time window either around the latency of the ERP peak, or a priori defined according to the previous literature. In these respective time intervals, the majority of these studies reported an amplitude modulation mostly involving the centro-parietal region. Only fewer studies localized the N400 effect applying the inverse solution to the scalp, identifying the time interval of interest either from previous literature (Ren et al. 2009; Baetens et al. 2011; Kim et al. 2012) or from a statistical analysis performed on the scalp electric data (Simos et al. 1997; Caldara et al. 2004; Proverbio et al. 2009; Yang et al. 2010).

As pointed out in the Introduction, a large amount of literature addressed the N400 effect during the interaction of gestures with speech, but most of this research focused on co-speech gestures, rather than emblems. However, the properties of emblems make them the best candidate to investigate, by means of a semantic priming paradigm, the interplay between two independent communication forms. To our knowledge, no previous ERP literature characterized the spatiotemporal dynamics of how an emblem influences language processing. Our electrophysiological results demonstrated that also symbolic gestures interact with language through different processes depending upon the semantic relation between the two communication forms.

Microstate segmentation and source localization allowed us to locate, in both time and space, the networks at the basis of these different processes. In the time window relative to the N400 effect, different generator patterns were identified for the congruent and incongruent conditions. The left posterior temporal cortex (STSp and MTGp) activation appeared to be dominant when gesture and word were incongruent. While in the congruent condition the gesture served as semantic priming, allowing a rapid perception of the word meaning, in the incongruent one the word meaning needed to be represented and stored separately from gesture. These findings are in line with the fMRI study by Willems et al. (2009) that found a greater activity of left STSp and MTGp when the speech was accompanied by incongruent pantomimes. The left posterior temporal cortex is well known to be involved in long-term storage of lexical-semantic representations (Hickok and Poeppel 2007; Lau et al. 2008) and many studies have also associated its activity with the recognition of word meaning (Binder et al. 1997; Chao et al. 1999; Gold et al. 2006). While a vast literature highlighted the involvement of this area in the storage of lexical representations, which aspect of them is stored is still an open question. Some studies proposed that this region stores conceptual features associated with lexical representations (Binder et al. 2009), while others suggested that lexical representations could be stored in MTG, further interfacing with a semantic network distributed across several brain areas (Hickok and Poeppel 2007). An additional cluster was found to be selectively activated by incongruent condition, mostly covering the left parietal cortices. Even if we cannot be definite about the functional role of this activation, we may speculate that it reflects the modal counterpart of the amodal lexical/semantic representation taking place in MTG/STS when a verb, incongruent with the gesture (and so not primed), is read. This speculation is in line with a large amount of literature demonstrating the involvement of several parietal regions during emblems observation (Nakamura et al. 2004; Lotze et al. 2006; Villarreal et al. 2008). In addition, as proposed by Andric and Small (2012), when a gesture is incongruent with speech, the parietal activation could reflect the attempt a more detailed processing of the hand action needed to reconcile the divergent spoken and manual information.

Whereas posterior temporal areas resulted to be activated mainly for semantically incongruent stimuli, the left anterior temporal lobe (ATL) was recruited in both conditions during the whole N400 time period. Our data support the hypothesis that ATL activity is not modulated by the semantic relation between gesture and language but rather that it plays a role in the interplay of these multimodal inputs. Studies about semantic dementia revealed that such patients present with a progressive and selective degradation of semantic representation regardless the stimulus modality. These data led to postulate that ATL activity is not modality-dependent (Warrington 1975; Bozeat et al. 2000; Rogers et al. 2004; Coccia et al. 2004). In the same vein, repetitive TMS studies on healthy individuals (Pobric et al. 2007; Lambon Ralph et al. 2009) reported a slowing selective for semantic tasks when stimulation was delivered over lateral ATL. Taken together, these data suggested that ATL might constitute a semantic hub bringing together modality-specific information in order to form an amodal semantic representation (see Visser et al. 2010 for a meta-analysis of 164 neuroimaging studies). Focusing on the relation between N400 and ATL, previous intracranial studies reported how this region might contribute to the generation of this component (Nobre and McCarthy 1995; McCarthy et al. 1995). Conversely, neuroimaging studies addressing this issue were inconclusive (see Lau et al. 2008), as measurements from anterior temporal areas often suffer from a loss of signal due to “susceptibility artifact” (Jezzard et al. 2001; Devlin et al. 2000). Using EEG and source localization technique, we were able not only to observe that ATL sustains the most of the N400 activity, but also to describe the temporal relation of this activity with other cortical regions.

The subsequent time window (template 11, till 600 ms after word onset) was characterized by the same map for both conditions. Its localization indicated the activation of the left inferior frontal and anterior temporal cortex bilaterally. An extensive previous literature addressed the role of IFG in the processing of speech, gesture and their integration (see Andric and Small 2012 for an exhaustive review). In addition, fMRI studies reported that IFG is more active when unrelated stimuli are administered. For example, Willems et al. (2007) reported a BOLD activation of the pars triangularis of the IFG selectively modulated by the observation of incongruent relative to congruent condition (speech-iconic gestures). The involvement of IFG in fMRI studies addressing N400 signal, was mostly revealed by experimental designs using long time intervals between the onset of the first stimulus (prime) and the onset of the second stimulus (target), usually referred to as the stimulus-onset asynchrony (SOA). Long SOAs (see Table 2 in Lau et al. 2008) elicited greater IFG responses, and this result was ascribed to a strategic priming effect, while short SOAs recruit less the IFG as an automatic priming would take place. In our study no fixed SOA was used, as dynamic stimuli can convey their meaning with a different timing. However, examining the timing of the most communicative frame of the video, the average SOA resulted larger than 600 ms. In relation to these mismatching results (long SOA and lack of IFG activation during the N400 time window), it must be noted that (a) given the low temporal resolution of fMRI, such studies cannot determine if the differential activation of IFG lies in a time window around 400 ms after target presentation (so contributing to the N400 effect generation), or rather in a later stage, reflected in late EEG components modulation; (b) fMRI and EEG do not always provide parallel results when investigating general task (van Petten and Luka 2006) and also tasks known to elicit an N400 effect (Geukes et al. 2013). The IFG involvement in abovementioned fMRI studies could be related to an increase in the amplitude of the late positivity that follows the N400 response, sometimes called post-N400 positivity (van Petten and Luka 2006; Federmeier et al. 2007). In line with this view, our findings cannot account for a key role of IFG in the neural cognitive processes underlying the N400 effect (see also Brower and Hoeks, Brouwer and Hoeks 2013, for similar conclusions).

Different interpretations have been proposed to explain the functional role of N400 in semantic processing. The two major accounts propose that, on one side, it reflects the process of semantic integration of a critical stimulus with the working context (integration view), while, on the other, it reflects facilitated activation of features of the long-term memory representation that is associated with a lexical item (facilitated access). The first hypothesis seems to be in line with a conceptual storage of MTG. In this case, the input stimulus is processed and transformed in its meaning, that is first stored (MTG) and subsequently integrated with the prime (ATL). On the contrary, a more lexical view of MTG would favor for a facilitated access hypothesis, where this activation may pre-activate a network of long-term memory representations associated with the lexical item. However, a mere task of semantic priming does not allow to favor for one of these two views, as contextual factors tends to simultaneously facilitate both lexical access and semantic integration, so, both of these accounts can be at the basis of the N400 reported here.

In summary, differently from previous literature on gesture and language interaction, we addressed in the N400 framework two communication forms expressed in different modalities and fully autonomous in their meaning. We provided evidence of how the observation of an emblem influences the language perception and of how this interplay is mainly instanced by different portions of the temporal cortex, demonstrating that emblems, even when presented in isolation, have an autonomous semantic content capable to interact with verbal language.