Introduction

Throughout human history, spoken language has undergone constant changes. This process has not only led to the formation of different kinds of languages, but also to the development of a vast number of different spoken regional and/or ethnic varieties world-wide (e.g., dialects and even subcultural jargons; Aitchison 2001). However, only a limited number of standard written language norms exist (Chambers and Trudgill 2002; Greenfield 1972). As a consequence, the way words are chosen and articulated in a certain dialect may differ strongly from the (written) standard language equivalent. Even though such linguistic differences may influence mapping between written and spoken language during literacy acquisition (Terry et al. 2012, 2010), little is known how dialect-specific differences in spoken language are reflected at the neural level. In the current study, we investigate neural processing related to familiarity with dialect-specific pronunciation and lexicality by measuring event-related potentials (ERPs) in preliterate children speaking one of two German language varieties, where one language variety corresponds more strongly to the German written norm than the other. The central objective of this study is thus to determine to what extent speaking a dialect impacts phonological and semantic processing at the neural level, in a dialect versus standard language contrast.

Studies using temporally and/or spatially sensitive neuroimaging techniques have brought forth information on general neural mechanisms involved in language processing (e.g., Hickok and Poeppel 2007; Price 2012; Vigneau et al. 2006, 2011) and have helped researchers to better understand the mechanisms of higher-order processing of phonological and (lexico-)semantic material (Friederici and Weissenborn 2007; Ganushchak et al. 2011; Rabovsky and McRae 2014). In terms of phonetic speech perception, studies using electroencephalography (EEG) have provided strong evidence for the fact that language-specific influences impact neural processing (e.g., Conrey et al. 2005; Näätänen 2001; Näätänen et al. 1997; Winkler et al. 1999). Specifically, these studies illustrate that neural response patterns to native phonological speech sound contrasts differ from patterns found for non-native variants. Early exposure to a specific linguistic environment thus seems to impact the development of one’s mother tongue and the phonetic inventory associated with it (Peltola et al. 2003).

At the level of semantic processing, EEG research has examined the brain’s response to congruity of expectancy (or the lack of it) by systematically presenting congruous and incongruous material. Hereby, a wide scope of experimental methods has been used ranging from paradigms entailing entire sentence structures to simple prime-target pairings (e.g., priming by sentences: Hagoort 2008; Kutas and Hillyard 1980a, b; McCallum et al. 1984; Röder et al. 2000; Schulz et al. 2008; van Berkum et al. 1999; written word as primes and targets; Khateb et al. 2007, 2010; Landi and Perfetti 2007; images as primes and targets:; Barrett and Rugg 1990; Luck and Hillyard 1994; or spoken word-image pairings as prime-target dyads; Friedrich and Friederici 2004, 2006; Henderson et al. 2011). All these studies explored the modulation of the N400 ERP component. In particular, the negative-going N400 represents the difference ERP between congruous and incongruous conditions which occurs approximately 250–500 ms post-stimulus onset and peaks mainly around 400 ms with a wide-spread centro-parietal scalp distribution in adults (e.g., Federmeier and Kutas 2002; Kutas and Federmeier 2000, 2011; Nigam et al. 1992). Similar N400 topographies have also been detected in children (Friedrich and Friederici 2004, 2006; Juottonen et al. 1996), but seem to be more widely distributed over the scalp, display higher ERP amplitudes, have longer peak latencies, and/or a slight temporal delay, as a result of still ongoing neural maturation processes (Atchley et al. 2006; Byrne et al. 1999; Hahne et al. 2004; Henderson et al. 2011; Holcomb et al. 1985).

Regarding its function, the N400 is linked to semantic context in an inverted fashion. There is an increase in N400 amplitude with greater semantic context violations (Dunn et al. 1998; Khateb et al. 2010). For example, in N400 priming studies semantically fitting prime-target pairings elicit only weak N400 deflections as a result of ongoing neural processing. However, non-fitting prime-target pairs trigger stronger N400 amplitudes in response to a substantial violation of expectancy and/or, because the critical word required more effort for semantic integration within a specific context (Barrett and Rugg 1990; Friedrich and Friederici 2004, 2005; Kutas and Federmeier 2011). By employing a pairwise ‘spoken word-colorful image’ paradigm, Friedrich and Friederici (2004) demonstrated that neural responses to spoken words that did not match with a simultaneously presented image (real object names or even pronounceable pseudowords) elicited a more negative-going waveform than did audio–visually matching conditions. In such a manner, the N400 seems not only to reflect ongoing neural processing in response to stimuli bearing potentially meaningful information (Kutas and Federmeier 2000), but further reveals the intensity with which the presented stimulus overlaps with concepts stored in the mental lexicon (Nigam et al. 1992).

In a study examining semantic mismatch detection within a sentence reading task, where sentence-final words were either semantically matching or anomalous, the centrally-located N400 effect was paired with an earlier slightly left-lateralized posterior negativity between 240 and 320 ms after stimulus presentation, called an early N400 effect (Schulz et al. 2008). A more fronto-centrally located ERP preceding the N400 effect is also reported in literature, which is often referred to as a phonological mapping negativity (PMN). A PMN component typically occurs approximately 200–350 ms after stimulus presentation at fronto-central electrode sites in adults (Connolly and Phillips 1994; Connolly et al. 2001) and children (e.g., Bonte and Blomert 2004; Connolly et al. 1995; Desroches et al. 2009). Functionally, the PMN is affiliated with the phonological stage of auditory word processing and is sensitive to phonological constraints during semantic processing (Connolly et al. 1990). As such, the fronto-central PMN likely represents an autonomous neural process reflecting a different level of stimulus processing than the N400 (Connolly et al. 1995, 2001), while it’s relation to the early N400 effect remains unclear.

The N400 is often followed by a later occurring positive deflection known as a late positive complex (LPC) (e.g., Conrey et al. 2005; Dunn et al. 1998; Fitzpatrick and Indefrey 2014; Grieder et al. 2012; Juottonen et al. 1996; McCallum et al. 1984) or as a post-N400-positivity (PNP) (e.g., DeLong et al. 2014; Thornhill and Van Petten 2012), peaking between 500 and 900 ms at parietal scalp locations in adults (Coulson et al. 2005) and 600–1100 ms in children (e.g., Schulz et al. 2008). Although the function of the LPC is still not clearly determined, some researchers link it to detection and/or reparation instances of faulty sentence structures or detection of ill-formed words (e.g., Fitzpatrick and Indefrey 2014; Kutas and Federmeier 2011). Others suggest that the LPC reflects processes involved in perception awareness in terms of congruity judgment (e.g., Buchwald et al. 1994; Conrey et al. 2005; Daltrozzo et al. 2012; Juottonen et al. 1996; Kutas and Hillyard 1980b; McCallum et al. 1984). Moreover, the LPC has also been linked to (long-term) semantic memory and classification effects (e.g., Coulson et al. 2005; Curran et al. 1993; Röder et al. 2000).

The multitude of studies on language-related semantic processing has shown that N400 effects seems to occur similarly across different languages (e.g., English: Barrett and Rugg 1990; German:; Friedrich and Friederici 2005; Dutch:; Brown and Hagoort 1993). Moreover, a few studies have ventured into the domain of bilingual language processing (e.g., Hahne et al. 2004; Moreno et al. 2002; Moreno and Kutas 2005; Weber-Fox and Neville 1996) addressing the question of how bilinguals manage to select the appropriate word in an intended language context, and, whether mechanisms for language-specific word choice modulate electrophysiological responses during semantic congruity detection (e.g., Kutas et al. 2009). Research on bilingual adults has shown that the N400 ERP for semantic anomaly detection is equally successful in an individual’s first (L1) and second (L2) language context (Hahne et al. 2004; Kutas et al. 2009; Moreno and Kutas 2005), as semantically anomalous contexts triggered larger negative deflections than congruous ones. However, L2 language proficiency (Proverbio et al. 2002) and age of L2 acquisition (Weber-Fox and Neville 1996) seem to modulate the extent of N400 amplitude and latency, being smaller and occasionally later for the less proficient language form. In light of these findings, the question arises whether differential mechanisms for language-specific speech perception can also be found within alternative variations of a single language, i.e., across dialects.

In one of the only ERP studies investigating neural measures for cross-dialectal speech perception within an incongruity detection N400 paradigm, Martin et al. (2015) examined differences in neural responses during the processing of British English (BE) and American English (AE) vocabulary and pronunciation in adult native speakers of British English. They uncovered two prominent effects: First, non-native vocabulary (e.g., holiday in BE versus vacation in AE) elicited larger negative N400 deflections. Second, facilitation effects for speech stimulus processing, i.e., reduced N400 amplitudes, were found whenever the target word was spoken in the corresponding dialect (e.g., holiday with BE accent versus AE accent). Martin et al. (2015) thus were able to establish that word integration is strongly dependent on dialect-based familiarity and seems to rely on context-specific information. Accordingly, in the case of adult speakers, integration of prior knowledge on cross-dialectal lexical variations seems to influence vocabulary processing as speech unfolds itself to the hearer. Moreover, in a recent ERP study, Lanwermeyer and colleagues (Lanwermeyer et al. 2016) examined how dialect-specific competencies influenced cross-dialectal comprehension in adult native speakers of the Central Bavarian dialect (German dialect). Participants listened to sentences where sentence-final words were either native or non-native to their dialect and were asked to rate these sentences according to context goodness. Results revealed that ERPs for incomprehension and incongruity both triggered a biphasic N400-LPC pattern, indicating that lexeme mismatch and lexeme unfamiliarity seem to evoke similar effects for semantic anomaly detection (Kutas and Federmeier 2011).

The studies mentioned above address the issue of how dialects influence speech processing at the neural level in adult speakers and how familiarity with dialectal variants of a specific language modulates neural processing in terms of semantic integration. However, the question of how such mechanisms occur for dialect speakers with only limited—or even no prior—knowledge of the diverging language variety still remains unanswered. This issue is particularly important as it attempts to elucidate neural processing that occurs in dialect-speaking children before they are exposed to the normative influences of learning the standard language variety in school.

In the following we will address this issue by contrasting two groups of pre-school children who grow up speaking one of two varieties of German (either Standard German (StG) or Swiss German (CHG) dialect). Examining the CHG versus StG language variety situation has several advantages: Although German constitutes the standard language employed both in the German speaking part of Switzerland and in Germany, there are fundamental country-specific differences in how German and its varieties are implemented. In the German speaking part of Switzerland a diglossic language situation exists (Ferguson 1959). The term ‘diglossia’ describes a language situation where more than one language variety (e.g., High versus Low variety) is used in a given society, but where no social group employs the High variety for colloquial conversation (Saiegh-Haddad 2012). Specifically, CHG is the primary language variety spoken in everyday life by German-speaking Swiss and diverges considerably from spoken and written StG in terms of phonology, vocabulary and syntax (see Supplementary Material S1 for a more detailed description of linguistic differences between CHG and StG; or see e.g., Fleischer and Schmid 2006). However, news broadcasts and official governmental reports are mostly communicated in StG. As such, StG co-occurs in moderation alongside the native CHG variant in Switzerland. In contrast, in Germany, most spoken dialects nowadays closely resemble linguistic approximations of the standard StG language variety (Elspass 2007) and thus differ less from the standard StG equivalent than CHG. Furthermore, speaking CHG dialect in Switzerland is not confounded by the factor of socioeconomic status. Accordingly, a comparison of CHG and StG seems to be highly suited to deduce differences occurring solely due to language variety specific influences. Based on the Swiss diglossic language situation and the fact that the Swiss educational system requires school to be taught in StG only from the elementary level on, it is assumed that native CHG dialect speaking children have relatively little exposure to StG before school enrollment. Yet several German television and radio programs for children exist that are spoken in StG. Therefore, it is possible that CHG native children may become at least to some degree familiarized with spoken StG even before school. However, as such a contact is not structured nor directly monitored, CHG native kindergarten-aged children likely do not develop high StG skills before formal instruction in school.

The main goal of this study is thus to determine to what extent an individual’s dialect-specific background influences semantic processing at the neural level. We were interested in whether or not neural processing differences occurred in young speakers of a given German language variety whenever they encountered vocabulary and/or pronunciation variants corresponding to their native language variety. By means of EEG, we recorded neural processes that occurred in response to a ‘spoken word-image’ paradigm and manipulated ‘word-image dyad congruity’ for dialect-specific differences in vocabulary and phonology (i.e. pronunciation). With this we sought to explore to what extent familiarity (or the lack of it) with dialect-specific word variants influenced the decoding and activation of semantic information processing at the neural level, and, whether dialect-based pronunciation variants affected semantic information processing. We furthermore incorporated an audio-visual mismatch control contrast independent of the listener’s dialectal background to investigate effects purely due to semantic incongruity detection and, to determine whether the employed ‘spoken word-image’ paradigm elicits similar neural response patterns as found in other studies using similar semantic anomaly detection tasks (e.g., Friedrich and Friederici 2004, 2006).

Our main hypothesis is that familiarity with a dialect impacts neural processing of semantic information. We thus anticipate that neural processing mechanisms dealing with unfamiliar dialect-specific vocabulary should be comparable to the processing of semantically incongruous stimulus material and/or to ERPs elicited by paradigms involving pseudowords (e.g., Domahs et al. 2009; Friedrich and Friederici 2004, 2006; Kutas and Federmeier 2011). Accordingly, unfamiliar dialect-specific vocabulary should elicit larger N400 amplitudes as compared to familiar dialect-specific word variants, (1) because of a violation of stimulus expectancy and (2) because lexical integration requires more effort. However, we do not expect to find any N400 effects whenever images are paired with unfamiliar dialect-specific pronunciations that only differ in terms of word-initial vowel duration but not in lexicality itself (because slight phonemic variations likely still trigger the correct mental concept (e.g., Brunellière et al. 2009; Lanwermeyer et al. 2016). Furthermore, we predict to find an ERP preceding the N400, i.e., an early N400 effect or, alternatively, a PMN for early phonological stimulus processing. Similar evidence of dialect-based processing differences should also be detectable in the ERP following the N400, i.e., in the LPC. Specifically, we anticipate that word stimuli pronounced in the alternative (non-native) dialect or as an unfamiliar word will trigger higher-order control mechanisms for discrepancy detection (e.g., congruity judgment) similar to late ERP effects found for sudden physical stimulus changes in semantic mismatch paradigms (e.g., by altering a speaker’s voice; McCallum et al. 1984) or by manipulating font size of visual stimuli; Kutas and Hillyard 1980a).

Materials and Methods

Participants and Study Design

In the main analysis of this study, we examined 35 native CHG [18 boys, 17 girls; mean age: 6.55 years (SD:±0.37 years)] kindergarten-aged children living around Zurich, Switzerland and 18 same-aged native StG [7 boys, 11 girls; mean age: 6.57 years (SD: ± 0.32 years)] children living in Magdeburg, Germany. One additional CHG native child was excluded due to low accuracy values in an attention monitoring task embedded in the ERP experiment (mean audio and visual accuracy <50%) and data from one additional StG native child was omitted due to low scores in the non-verbal IQ test (IQ < 80). All subjects reported normal or corrected-to-normal vision, normal hearing and had no prior history of neurological diseases or psychiatric disorders.

Data collection took place shortly before the summer break after which the children were to be enrolled into 1st grade of primary school. Testing was conducted in labs either at the Department of Psychology at the University of Zurich (Switzerland), examination rooms at the Otto-von-Guericke University Magdeburg (Germany) or in vacant rooms of selected day-care facilities in the city of Magdeburg. The audio-visual EEG experiment reported here belonged to a series of several short experiments investigating dialect-based differences for phonological, semantic and syntactic processing at the neural level in young children (total duration ca. 3 h). In addition to the EEG session, we also conducted a ca. 2.5 h long behavioral examination session to measure precursor abilities for reading and spelling: phonological awareness skills (TEPHOBE; Mayer 2011), upper- and lower-case letter knowledge, as well as an IQ score (three subtests digit span: forwards/backwards, matrix reasoning, block design of the HAWIK; Petermann and Petermann 2010). However, we will not provide results for the behavioral scores in this study. A long break during which the child was able to recuperate and consume a snack was mandatory after ca. 1 h of testing for either examination session.

In order to investigate language-, development- and health-specific factors, an extensive questionnaire was sent home to the children’s parents and/or primary caregivers prior to the first examination date. Of main concern was that none of the children had any developmental impairments and that either only CHG or only StG was the language variety learned from birth. The questions pertaining to German language variety exposure required the parents to specify in detail which language variety they and their child spoke natively and to indicate whether and, if yes, to what extent their child was exposed to a non-native German language variety.

Written informed consent was obtained from each participant and the attending caregiver prior to each examination session. Participants were compensated with a book voucher in the worth of 40 CHF for their participation. Additionally, each child received a small gift (toy, candy, colorful pencils etc.) after each of the two experimental sessions (behavioral test battery and EEG recording). In the months following the data collection, all parents received a brief written report regarding their child’s performance in the behavioral session. The study protocol was approved by the ethics committee of the Faculty of Arts and Social Sciences at the University of Zurich.

Stimuli

In the audio-visual EEG experiment, we employed spoken object names and pictures as stimuli (i.e., toys, animals, clothing, food or simple objects). For each of the three experimental contrasts in our semantic congruity detection task (dialect-independent contrast, CHG versus StG vocabulary contrast, and CHG versus StG pronunciation contrast), we employed 13 words that were controlled for syllable number, word frequency, and phoneme distribution. Word frequency was tested using the online German Children’s Book Corpus ChildLex developed by the Max Planck Institute for Human Development Berlin, Germany (internet query portal: http://alpha.dlexdb.de/query/childlex/childlex1/typ/list/, last visited January 13th, 2016). For the full list of words and their frequencies see Table 1.

Table 1 List of words used for spoken stimuli and corresponding word frequency indices (asserted by ChildLex)

For the dialect-independent control contrast, words were chosen that are pronounced the same in CHG and StG. In the CHG versus StG vocabulary contrast, each object encompassed a different word name in CHG and StG. In the CHG versus StG pronunciation contrast, words were used that hold a short vowel in the word-initial syllable in CHG, but which are articulated as long vowels in StG. During experiment development, word items were checked by a Swiss kindergarten teacher whether children were familiar with the specific object names spoken in their native language variety before school enrolment and the pictures were tested in a small pilot study in order to double check that young children would associate the pictures with the intended words. However, we did not examine CHG and StG picture naming in each of the examined kindergarten-aged kids due to time constraints.

All spoken word stimuli were recorded in a sound-proof recording cabin at the Phonetics Lab of the University of Zurich. Stimuli were spoken by a native CHG professional speaker who was educated in StG pronunciation and lived in Berlin, Germany. By employing the speech editing tool PRAAT (version 5.3.23, Boersma and Weenink 2014), speech stimuli were equalized for duration (600 ms), intensity (70 dB) and pitch (250 and 180 Hz for two-syllable words, 250 Hz for one-syllable words). The visual stimulus material consisted of simple black line illustrations on a white background and depicted the spoken word stimuli used in this experiment. The images were clearly identifiable by young children.

Procedure

Neural measures were acquired using a 128-electrode mobile EEG recording system (HydroCel Geodesic Sensor Net (GSN 300) developed by Electrical Geodesics Incorporated, EGI, Eugene, OR). The mobility of this system allowed us to bring it from Zurich to Magdeburg in order to ensure identical recording and presentation conditions in both locations. During the audio-visual EEG experiment which was run with E-Prime Software (Version 2.0.8.90 Psychology Software Tools, Pittsburgh, PA), participants were seated ca. 80 cm away from a laptop computer screen on which the experiment was presented. Loudspeakers were placed both left and right from the monitor in order to provide binaural auditory stimulation. During EEG recording, the participating children performed a passive viewing/listening task with audiovisual matching or non-matching word-picture pairs. To ensure that equal attention was directed towards the auditory and visual domain, we embedded a rare target detection task in the experiment: The children were required to indicate target images or target sounds by mouse-click [i.e., detection of different colorful images of the cartoon figures ‘the beagle boys’ or unique noise sounds (e.g., door slamming shut, glass window breaking, coins chinking)] that were occasionally interspersed between word-picture trials (10.3% of all trials were targets). We chose to include such a task into our experimental procedure, as several studies have shown that N400 amplitudes seem to be modulated by attention (Brown and Hagoort 1993; Chwilla et al. 1995).

Each spoken word-image pairing was nested in between a 90 ms long pre- and 500 ms long post-stimulus fixation cross presentation. Duration of each auditory stimulus was set at 600 ms, whereas black and white images were presented for 1500 ms to minimize interference by a visual offset response while the auditory stimulus was being processed (Henderson et al. 2011).

In total, we presented 468 audio-visual pairings over two blocks (plus 54 pseudo-randomly interspersed targets to control for attention). In each block, 3 runs containing 6 mini-runs of 13 stimulus pairs were shown in a pseudo-randomized order. Each mini-run contained stimulus pairs of the same condition (dialect-independent matching, dialect-independent nonmatching, CHG-specific words, StG specific words, CHG specific vowel pronunciation, StG specific vowel pronunciation). This blocked condition approach was chosen to maximize condition effects and to avoid confusing the young children. Overall, 78 trials were shown per condition. After each run, a short break was held to restore the participants’ attention focus.

EEG Recording and Pre-processing

The 128-channel EEG was recorded against the Cz reference at a sampling rate of 500 Hz, with high-pass (0.1 Hz) and low-pass (100 Hz) filter settings. Impedances were kept below 50 kΩ. Offline, the EEG was processed using Brain Vision Analyzer software (version 2.0.4.368, Brain Products GmbH, Gilching, D). The continuous EEG was digitally filtered (0.3–30Hz) and corrected for blinks and eye movement artifacts using an independent component analysis (ICA; Makeig et al. 2004). Channels with extensive artifacts were spline-interpolated and the EEG was transformed to the average reference (Lehmann and Skrandies 1980). After segmentation (−150 ms prior and 1500 ms post stimulus), artifact-free segments (within ±100 µV) were averaged separately for each condition. Finally, the ERPs were corrected for a constant delay of 42 ms. This delay resulted from a constant 24 ms sound release delay as revealed by a timing test (using Event Timing Tester, Version 2.0; EGI) and a 18 ms constant delay from the anti-aliasing filter of the amplifier (for details see advisory notice regarding anti-alias filter effects on EEG timing for Net Amps 300 amplifiers dated November 26, 2014, Electrical Geodesics Inc.; cf. Pegado et al. 2014).

The individuals’ global field power (GFP) and grand means were computed for all three conditions. Difference ERPs and topographic difference maps (t-maps) were calculated as follows: incongruous dialect-independent words minus congruous dialect-independent words, CHG words minus StG words, and, words with CHG vowel pronunciation minus words with StG vowel pronunciation. Note, congruity in the dialect-independent contrast was the same for both the StG and CHG groups, while congruity changed depending on dialect familiarity for the other two contrasts. Thus, congruity effects should show a polarity reversal between groups for the latter two contrasts.

Statistical Analyses

For the statistical ERP analysis two methods were employed: First, we used a data-driven approach using a Topographic analysis of variance (TANOVA) to identify significant within- and between-subject effects as well as their interactions across all time points. TANOVA was computed using Randomization Graphical User interface (RAGU) software (Koenig et al. 2011). Secondly, we employed a more theory-driven approach by focusing on peaks and electrodes that corresponded with previous findings of semantic anomaly detection (e.g., Kutas and Federmeier 2011).

TANOVA analysis allowed us to detect word-picture mismatch effects, group main effects, and differential mismatch effects between groups (interaction) without having to pre-define a subset of electrodes or time frames (cf. Fig. 2) (e.g., Grieder et al. 2012). Specifically, we conducted separate point-to-point TANOVAs for each of the three contrasts (dialect-independent, CHG vs. StG vocabulary, CHG vs. StG pronunciation). Each TANOVA contained a group factor (CHG vs. StG group) and a within-subject factor (dialect-independent congruous versus dialect-independent incongruous; CHG versus StG vocabulary; CHG versus StG pronunciation). TANOVAs were computed on non-normalized (raw) maps for each time-point in the ERP (−150 to 1500 ms) and determined systematic differences between the factors by administering a non-parametric randomization test on the GFP of the difference maps (e.g., Grieder et al. 2012; Holmes et al. 1996; Jost et al. 2014; Lehmann and Skrandies 1980; Murray et al. 2008; Strik et al. 1998). A similar time point-wise analysis approach has been employed in several previous studies to investigate processing differences that exist for two separate conditions across specific time segments (e.g., Jost et al. 2014; Maurer et al. 2010, 2003, 2008) or to examine temporal changes occurring in training studies (e.g., Oelhafen et al. 2013; Stein et al. 2006). However, raw map differences identified by TANOVA can either stem form differences in map strength (although both maps show similar topographies) or from topographic differences (despite the occurrence of similar GFP; Jost et al. 2014; Maurer et al. 2010). To account for false positive results, we ran a maximum duration test (with an alpha level of p < .05) which controlled for multiple comparisons across the analyses and compared the identified significant time frames with the expected time frames that would occur under the null hypothesis (Grieder et al. 2012; Koenig et al. 2011).

In the theory-driven analysis, we selected a cluster of centro-parietal electrodes, which in previous research on semantic anomaly detection has showed largest effects for N400 and LPC elicitation (see Kutas and Federmeier 2011 for review). As such we averaged the voltage values at 17 centro-parietal electrodes with Pz as the center surrounded by two concentric circles [E53/E54/E55/E60/E61/E62/E66/E67/E71/E72/E76/E77/E78/E79/E84/E85/E86, corresponding to P3/CpZ/Pz/PO3/P1/POz/O2/P2/PO8/PO4 positions (Luu and Ferree 2000)]. Given developmental differences in ERP latency and the lack of similar paradigms with children of the same age, we could not derive the latencies of the time windows of interest from previous studies. Instead, we used GFP values of the difference ERPs from the dialect-independent mismatch condition to identify significant ERP peaks related to mismatch responses in our data (e.g., Hauk et al. 2006). Given that our main interests were the effects in the CHG versus StG vocabulary and the CHG versus StG pronunciation contrasts, using the time windows from the dialect independent condition made sure that time window selection was not biased by the effects of interest. As such, we identified three GFP peaks and determined short time windows of ±20 ms during which we ran the further analyses on mean values across the time window (e.g., Brem et al. 2009; Hauk et al. 2006). The first time window occurred at 268–308 ms alike the temporal latency of a PMN or an early N400 component (e.g., Bonte and Blomert 2004; Connolly and Phillips 1994; Schulz et al. 2008; called early N400 hereafter). The second occurred at 454–494 ms which temporally strongly corresponds to the N400 latency found in adults and children (e.g., Kutas and Federmeier 2011). The third time window occurred at 900–940 ms which temporally matches with the latency reported for the LPC component (e.g., Juottonen et al. 1996).

For each experimental contrast, we computed a 2 × 2 ANOVA for repeated measures with the between-subjects factor language variety group (CHG natives versus StG natives) and the within-subject factor congruity (congruity versus incongruity or CHG versus StG word or pronunciation variant), analogous to the methodology described in the RAGU-based TANOVA analysis, using the mean values (across electrode cluster and time window) for each of the three time windows of interest (early N400, N400, LPC). Significant main and interaction effects as well as trends will be described in the ”Results” section. However, only significant results (p < .05) will be discussed in detail in the ”Discussion” section.

Behavioral Attention Task

In addition to the ERP analysis, we investigated attention task compliance by analyzing response accuracy and reaction times for auditory and visual target detection. Accuracy and reaction time scores were calculated using a repeated measures 2 × 2 ANOVA with the between-subject factor language variety group (CHG natives versus StG natives) and the within-subject factor modality (audio versus visual) across all three experimental tasks. Behavioral results mainly served for participant exclusion and thus will not be discussed in detail in later sections.

Results

Behavioral Attention Task

Overall, auditory and visual target detection accuracy was very high (>90%) for both language variety groups over all three experimental contrasts, and no language variety group (CHG natives versus StG natives) or modality-specific (auditory versus visual) main effects could be found (modality, F (1,51) = 2.326, p = .133; language variety group, (F (1,51) = 1.706, p = .197). However, there was a significant language variety group × modality interaction (F (1,51) = 4.902, p = .031) indicating that CHG native children seemed to respond slightly more accurately to visually presented targets, whereas StG native children performed slightly better at auditory target stimulus detection. Separate group contrasts for the auditory and visual modality resulted in a nonsignificant effect [t (51) = −1.283, p = .205] for auditory target detection and a trend for visual target detection [t (51) = 1.866, p = .068]. Furthermore, group-wise comparisons for visual versus auditory target detection accuracy were non-significant in both groups [in CHG natives: t (34) = 1.531, p = .135, and in StG natives: t (17) = −1.400, p = .180]. Regarding target detection reaction times over all 3 experimental contrasts (determined by mouse click speed to beagle boy sounds or images), we found a significant main effect for modality [F (1,51) = 5.314, p = .025], but neither language variety group [F (1,51) = 0.122, p = .728] nor the language variety group × modality interaction revealed a significant effect [F (1,51) = 0.403, p = .529]. In particular, mean response time for visual targets was 823 ms (±112 ms) in CHG native children and 843 ms (±125 ms) in StG native children, whereas mean response time for auditory target detection was 798 ms (±118 ms) in CHG native children and 799 ms (±107 ms) in StG native children. Our results thus showed that all children detected auditory target stimuli more quickly than visual ones.

Event-Related Potentials

Data-Driven TANOVA Analysis

Here we determined the differential time-course of (within-subject, between-subject and/or interaction) effects for all three experimental contrasts (dialect-independent contrast, CHG versus StG vocabulary contrast and CHG versus StG pronunciation contrast) (cf. Fig. 2). In the following we will report the results separately for each contrast.

Dialect-Independent Contrast

The TANOVA values of the difference ERP revealed three time windows in which the ERP maps for the semantically congruous and incongruous audio–visual pairings differed from each other, indicating a significant effect for the within-subject factor congruity (p < .05; cf. Fig. 2; above: T1: 214–348 ms, T2: 370–596 ms, 3. T3: 652–1194 ms). However, no significant effects were found for language variety group or for the language variety group × congruity interaction, demonstrating that both language variety groups processed the matching and mismatching audio-visual pairings similarly. The difference t-map for the early segment indicated a strong posterior negativity for both groups similar to the temporal and topographic aspects of an early N400 component (Schulz et al. 2008). The significant congruity effect found at ca. 400 ms strongly resembled a N400 effect in terms of topography and time of occurrence (Friedrich and Friederici 2004; Kutas and Federmeier 2011). The last segment (after 600 ms) revealed a strong centro-parietal positivity, alike a LPC (Juottonen et al. 1996). See Fig. 3 (left) for group-specific topographic difference maps for each significant temporal segment.

CHG versus StG Vocabulary Contrast

Here we identified two significant language variety group × congruity interaction effects with the TANOVA analysis (cf. Fig. 2, middle), once at 468–656 ms post-stimulus presentation and then around 962–1100 ms. Topographic inspection of the difference ERP (CHG words minus StG words) for the earlier interaction effect revealed a strong centro-parietal negativity with frontal positivity for StG-dialect speakers whenever the unfamiliar CHG word variant was presented and thus indicated a N400 specific effect. This N400 effect also presented itself in CHG native children, however with an inverted polarity (as a result of the ERP calculation). The later segment exposed a posterior positivity in StG natives and a posterior negativity in CHG natives. Accordingly, when accounting for the calculation-based inverted polarity in CHG natives, both groups revealed a LPC-specific topography in response to the presentation of the unfamiliar word variant. See Fig. 3 (middle) for group-specific topographic difference maps for each significant temporal segment.

CHG versus StG Pronunciation Contrast

TANOVA analysis of the difference ERP (CHG vowel words minus StG vowel words) indicated a single significant time segment for the language variety group × congruity interaction which occurred at 764–856 ms post-stimulus onset. A closer look at the scalp topography revealed a posterior positivity in the StG native children’s group, i.e. LPC, driven by a larger positivity for the unfamiliar CHG word pronunciation than for the familiar StG word pronunciation. In the CHG native group we were able to determine an alike LPC effect, however, with an inverted polarity which again derived from the manner of calculation. See Fig. 3 (right) for group-specific topographic difference maps.

Theory-Driven ERP Peak-Specific Analysis

In the following we will report the results for each of the three experimental (semantic dialect-independent, CHG versus StG vocabulary and CHG versus StG pronunciation) contrasts using the three time segments identified by GFP peaks in the dialect-independent semantic mismatch contrast. See Fig. 4 for an illustration of the ERP curves regarding mean values across the centro-parietal electrode cluster for each experimental contrast, separated for the CHG-specific and the StG-specific language variety group. All theory-driven ANOVAs use mean amplitude values averaged across the centro-parietal electrode cluster and averaged across the time window of interest. Additional waveforms at different electrodes sights (left, midline, right) can be found in the Supplementary Material (SM Figs. 1, 2, 3).

Fig. 1
figure 1

Simultaneous audio–visual presentation of congruent or incongruent spoken word-image pairings and additional attention task. a Control condition with same semantic (mis-)match for both CHG and StG dialectal varieties. b Language-variety specific vocabulary difference CHG versus StG. c Vowel duration differences for CHG versus StG (e.g., "Nasä" [nazə] vs. "Nase" [naːzə])

Fig. 2
figure 2

Significant TANOVA time-windows obtained with RAGU for each experimental condition indicated in dark-grey coloring (x-axis: time course; y-axis: level of p value for differences between conditions, groups or interactions): a control condition: (1) TW 214–348 ms, (2) TW 370–596 ms, (3) TW: 652–1194 ms; b vocabulary condition: 1.TW: 468–656 ms, 2. TW: 962–1100 ms; c vowel duration condition: 1.TW:764–856 ms

Fig. 3
figure 3

Topographic ERP difference maps for mean values across the segments of interest indicated by the TANOVA (data-driven analysis). Left control condition displaying three significant effects for congruity: (1) at 214–348 ms post-stimulus onset, (2) 370–596 ms, (3) 652–1194 ms. Middle vocabulary condition revealing two significant language-variety group x congruity interactions: (1) 568–656 ms, (2) 962–1100 ms. Right Dialect-based vowel duration specific condition indicated only 1 significant language-variety group x congruity interaction effect: 1. 764–856 ms

Fig. 4
figure 4

ERP curves for theory-driven analysis at the centro-parietal electrode cluster. Grey segments indicate the three peaks determined by the GPF/RMS values in the control condition at 268–308 ms (early N400), 454–494 ms (N400) and 900–940 ms (LPC). Upper panel represents ERP waves for CHG natives and lower panel shows ERP waves for StG-native children. Left control condition: both language-variety groups show more negative N400 and more positive LPC amplitudes for the mismatching condition. Middle vocabulary specific condition. CHG- and StG-native children show inverted N400-LPC ERP effects based on dialect familiarity. Right Dialect-based vowel duration specific condition. No wave-specific N400 ERP difference for both language-variety groups and no visible interaction effects

Dialect-Independent Contrast

We first computed a repeated measures ANOVA on mean amplitudes (across centro-parietal electrodes and time window) in the early N400 time window (268–308 ms) using the within-subject factor congruity (matching versus mismatching audio–visual pairs) and the between-subject factor language variety group (CHG native versus StG native speakers). Results revealed no significant main effect for congruity or for the congruity x language variety group interaction for the early negative-going ERP (congruity, F (1,51) = 2.153, p = .148; congruity x language variety group interaction, F (1,51) = 0.314, p = .578). However, there was a trend-like main effect for language variety group (F (1,51) = 3.523, p = .066), indicating more negative ERP amplitudes in the congruent as well as the incongruent condition for the CHG native group.

Regarding mean amplitudes (across centro-parietal electrodes and time window) in the N400 time window (454–494 ms), we identified a significant main effect for congruity (F (1,51) = 25.816, p < .001) but not for language variety group (F (1,51) = 2.153, p = .148) or for the congruity × language variety group interaction (F (1,51) = 0.314, p = .578). Results demonstrated that both StG and CHG natives showed similar neural patterns, but that were different for the congruous and the incongruous audio-visual pairings. Additional group-wise paired t-tests revealed a significant N400 effect which rode on a positivity, as amplitudes of the audio–visual mismatch condition were less positive in comparison to the matching condition in both groups (CHG: t (34) = −3.818, p < .001; StG: t (17) = −3.311, p < .005).

A significant effect for congruity was also determined for the LPC time window (900–940 ms) (F (1,51) = 21.629, p < .001) when using mean amplitudes (across centro-parietal electrodes and time window). And again, the main effect of language variety group was non-significant [F (1,51) = 0.464, p = .499]. Additional post-hoc t-tests identified a significantly larger positivity for the LPC in the mismatching condition than for the matching condition in both language variety groups (CHG: t (34) = 2.952, p < .01; StG: t (17) = 3.049, p < .01). Furthermore, there was also a trend-like interaction effect (congruity × language variety group interaction, F (1,51) = 3.876, p = .054), which was driven by the fact that StG native children showed a more pronounced difference ERP than CHG native children, as the ERP for the incongruous audio–visual pairing was more positive and the ERP for audio–visual congruity was more negative (cf. Fig. 4).

CHG versus StG Vocabulary Contrast

In the early N400 time window (268–308 ms), repeated measures ANOVA on mean amplitudes values (across centro-parietal electrodes and time window) revealed no significant main effects or interaction effect (congruity: CHG natives versus StG natives; F (1,51) = 0.009, p = .924; congruity x language variety group interaction; F (1,51) = 0.105, p = .748). However, we found a slight trend for language variety group (F (1,51) = 3.506, p = .067) indicating that ERP amplitudes for the CHG word variant as well as the StG word variant were slightly larger in the StG native group (cf. Fig. 4 for details).

In the N400 time segment (454–494 ms), we identified a significant interaction effect for congruity × language variety group [F (1,51) = 4.191, p < .05] using mean amplitudes (across centro-parietal electrodes and time window). Results revealed that StG native children displayed a stronger negative deflection when images were paired with the unfamiliar spoken CHG words compared with the pairing with familiar StG words. Similarly, CHG native children displayed a stronger negativity for audio–visual stimulus pairs with the unfamiliar StG words than with familiar CHG words. Accordingly, both groups revealed a distinct dialect-based N400 incongruity effect for dialectal word variants non-correspondent to their native dialect when paired with the corresponding image. However, none of the main effects were significant [congruity, F (1,51) = 0.041, p = .840; language variety group, F (1,51) < 0.001, p = .989].

Repeated measures ANOVA on mean amplitudes (across centro-parietal electrodes and time window) for the LPC time segment (900–940 ms) showed no significant interaction or main effects [congruity, F (1,51) = 0.290, p = .592; language variety group, F (1,51) = 0.275, p = .602; congruity × language variety group, F (1,51) = 1.690, p = .199].

CHG versus StG Pronunciation Contrast

Repeated measures ANOVA on mean amplitudes (across centro-parietal electrodes and time window) revealed no significant main effects or any significant interactions in any of the analyzed time segments (early N400 time window (268–308 ms): congruity, F (1,51) = 1.557, p = .218, language variety group, F (1,51) = 1.252, p = .268, congruity × language variety group interaction, F (1,51) = 0.445, p = .508; N400 time window (454–494 ms): congruity, F (1,51) = 0.386, p = .537, language variety group, F (1,51) = 0.610, p = .438, congruity × language variety group interaction, F (1,51) = 0.066, p = .798; LPC time window (900–940 ms): congruity, F (1,51) = 2.311, p = .135, language variety group, F (1,51) = 0.251, p = .619, congruity × language variety group, F (1,51) = 0.016, p = .899).

Discussion

The main goal of the present study was to investigate how differences in neural processing are related to dialect-specific familiarity with vocabulary and pronunciation in a group of CHG versus StG native speaking kindergarten-aged children. To this end, we used ‘spoken word-picture’ pairs that were either congruent or incongruent. In one contrast, incongruity was the same for both language variety groups (control contrast), while in two other contrasts congruity depended on the language variety background—once defined by language variety-specific words (CHG vs. StG vocabulary contrast), once defined by language variety-specific pronunciation (CHG vs. StG pronunciation contrast). Additionally, we employed a target detection task to ensure that attention was equally directed to the visual and the auditory domain.

Converging results from theory- and data-driven ERP analyses revealed similar incongruity effects across both language variety groups in the control contrast for the N400 and LPC effects, but incongruity × language variety group interactions in the dialect-based vocabulary and pronunciation contrasts. While the incongruity × language variety group interactions were found for both the N400 and LPC effects in the vocabulary contrast, an interaction was only found for the LPC effect in the vowel length contrast. In the following we will first briefly discuss the behavioral attention task and this is followed by a more detailed discussion in regards to the here determined ERP effects.

Behavioral Attention Task

Our behavioral attention task, which involved monitoring of auditory and visually presented target stimuli, revealed an overall very high auditory and visual target detection accuracy (>90%) for both language variety groups over all three experimental contrasts. However, there was a weak, but significant language variety group × modality interaction for accuracy. In particular, CHG native children displayed a slightly higher response accuracy for visually presented targets and StG native children were slightly more successful in auditory target detection. We speculate that even though we closely controlled pronunciation of the stimuli by employing the same professional speaker who was an expert in both CHG and StG, subtle pronunciation cues in the stimuli, may have led the two groups of children to pay slightly more or less attention to the pictures versus spoken words. Importantly, however, the critical N400 interaction between incongruity and language variety group was not affected by this weak attentional bias, which we tested by adding the behavioral accuracy difference between visual and auditory conditions as a covariate into the analysis (with covariate: p = .057; without covariate: p = .046).

N400

In the dialect-independent control contrast, incongruent audio–visual pairings revealed less positivity than congruent pairings between 400 and 600 ms at centro-parietal locations, resulting in a negativity in the difference ERP. As expected, this effect was similar across both dialectal groups, given that the words used in the control contrast occur both in CHG and StG. Timing and topography of this effect strongly correspond with N400 effects reported previously in children (Henderson et al. 2011; Schulz et al. 2008). Our data suggests that the visually presented stimuli seem to act as primers and activate a specific lexical representation that exists in the viewer’s mental lexicon (Aitchison 2001). However, if this mental representation does not overlap with the word the participant heard, then this non-correspondence seems to trigger a semantic mismatch. In such a manner, the N400 seems to be linked directly with the lexical appropriateness and the linguistic certainty of the stimuli provided (Samuel and Larraza 2015).

While many previous N400 studies use sequential priming paradigms with stimulus pairs or sentences (e.g., Duta et al. 2012; Klintfors et al. 2011; Kutas 1993; Nigam et al. 1992), the current results provide converging evidence that similar N400 effects can be obtained with word-picture pairing paradigms, where audio-visual stimuli are presented simultaneously (Friedrich and Friederici 2004; Henderson et al. 2011). Compared to previous studies, however, the timing of the N400 effect in the current study (400–600 ms) was later than in older children (300–500 ms; Henderson et al. 2011), but earlier than in infants (400–800 ms; Friedrich and Friederici 2004), which is in agreement with a latency shift in development.

In the CHG versus StG vocabulary contrast, we found a congruity × language variety group interaction that was essentially driven by the fact that StG native children displayed a larger negative ERP deflection for the CHG word variants, while the CHG native children showed a more extensive negativity in response to words spoken in StG. The difference ERP for audio-visual pairings with unfamiliar compared to familiar dialect-specific vocabulary occurred after 450 ms post-stimulus presentation with a centro-parietal topography in both children’s groups. Thus, timing and topography correspond with the results obtained in the dialect-independent mismatch condition and suggest the presence of an N400 effect that is sensitive to the specific vocabulary used in an individual’s native dialect. The processing of unfamiliar words thus seems to require more extensive semantic processing than is needed for the processing of familiar words. This finding coincides with the fact that the N400 amplitude is highly determined by expectancy in a given context, and, that target stimuli that diverge from a primed context elicit large N400 amplitudes (Friedrich and Friederici 2005; Juottonen et al. 1996). Similar effects have been previously reported for semantic mismatch paradigms involving pseudowords (e.g., Domahs et al. 2009; Friedrich and Friederici 2005). Thus, unfamiliar dialect-specific vocabulary seems to disrupt expectancy in a similar manner as the one determined for semantic incongruity detection free from language-specific influences.

In the CHG versus StG pronunciation contrast, however, results from both types of analyses showed no significant effect around 400–600 ms post-stimulus presentation (neither main effects, nor interactions). Accordingly, there was no significant N400 effect detectable for words pronounced in CHG in StG native children, nor was there one for words spoken with the StG specific vowel duration in CHG native children. As such, our data suggests that, although there was a duration difference between the CHG and StG specific pronunciation, none of the children had any difficulties to match the auditorily presented stimuli to the corresponding image, irrespective of whether a word variant encompassing a short or a long dialect-specific vowel was presented. Taking into account the fact that the N400 component not only reflects semantic incongruity detection but also expresses the degree to which a presented word triggers lexical-semantic activation (Sebastian-Gallés et al. 2006), we hypothesize that words spoken in CHG as well as StG activated the same lexical entries, whenever they were presented simultaneously with the corresponding image. This effect may further have been reinforced by the repeated presentation of the audio-visual matching and mismatching pairs in our experimental paradigm and may have led to the “learning” of the unfamiliar dialect-specific pronunciation variants. Similar adaptation effects have been reported in studies examining processing of unfamiliar accents at the behavioral level (Goslin et al. 2012).

LPC

In the dialect-independent contrast, incongruent audio-visual pairings revealed more positivity than congruent pairings between 600 and 1200 ms at centro-parietal locations, resulting in a positivity in the difference ERP. As anticipated, this effect was similar across both dialectal groups, because the word stimuli used in the control contrast are represented both in CHG and StG vocabulary. The centro-parietal positivity corresponded strongly topographically and temporally with late effects for semantic mismatch detection previously reported by Schulz et al. (2008) in 11-year old children. In their experiment, Schulz et al. (2008) were able to determine a relatively long-lasting positivity that began shortly after 600 ms. Contrary to our simultaneously presented audio-visual mismatch paradigm, Schulz et al. (2008), however, employed whole sentences encompassing semantically corresponding and non-corresponding sentence-final words. Nevertheless, the stronger positivity in response to mismatching audio-visual pairing determined in our study most probably reflects neural processes associated with memory retrieval and congruity judgment alike the processes that are triggered during the processing of incongruous sentence endings (e.g., Daltrozzo et al. 2012; Juottonen et al. 1996; Schulz et al. 2008). To our knowledge none of the studies investigating semantic mismatch detection using ‘spoken word-image’ pairings have previously specifically mentioned findings on LPC effects following N400 elicitation (e.g., Friedrich and Friederici 2005; Henderson et al. 2011; Kornilov et al. 2015). Our results thus provide evidence that later occurring effects for congruity judgment based on memory reveal equally large LPC effects in ‘spoken word-image’ paradigms.

In the CHG versus StG vocabulary contrast, we found a significant congruity × language variety group interaction (occurring after 900 ms) indicating that CHG native children revealed a larger centro-parietal positivity in response to the StG specific vocabulary than when they heard familiar CHG specific words matched with the corresponding picture items. In turn, StG native children displayed a larger positivity at centro-parietal electrodes for CHG specific vocabulary. In both groups this late positivity corresponded temporally and topographically to the LPC that we found in the dialect-independent control contrast. The unfamiliar word variants thus seem to require more neural involvement for congruity judgment than familiar words, and this is likely linked to the notion that non-native words do not (directly) activate the corresponding lexical representations stored in memory (Kuhl 2000).

In the CHG versus StG pronunciation contrast, converging results revealed a language variety group x congruity interaction that occurred after 700 ms post-stimulus presentation and was located at centro-parietal electrode sites. Again, CHG native children revealed a stronger late positivity in response to words with StG specific pronunciation, i.e., LPC, whereas StG native children showed a stronger positivity for words with CHG specific pronunciation. Although, the LPC occurred slightly earlier in the CHG versus StG pronunciation contrast than in the dialect-independent contrast, results provide evidence for the fact that unfamiliar pronunciation variants activated stronger neural processing mechanisms in later time instances whenever the heard word encompassed a vowel variant that deviated from expectancy. In accordance with the fact that the LPC is dedicated to the processing of input and its comparison with representations stored in long-term memory (i.e., mental lexicon), words with unfamiliar dialect-based pronunciations required additional processing if they did not directly correspond to the native prototype. A reason as for why the LPC occurred earlier in the CHG versus StG pronunciation contrast than in the dialect-independent contrast, may be due to the absence of the N400 component. The lack of phonological overlap together with the correct semantic context likely yielded earlier visibility of the LPC component.

Early N400

Converging results for neural responses occurring prior to the N400 effect are less conclusive. In the dialect-independent contrast, only the data-driven TANOVA analysis determined a mismatch effect which occurred ca. 100 ms prior to the N400-specific peak. Topographic inspection revealed that both groups displayed a strong posterior negativity resulting from the difference ERP for dialect-independent mismatching audio-visual pairings after 250 ms and this negativity occurred slightly left-lateralized. The topography of this effect thus diverged from effects previously reported for the phonological mapping negativity (PMN) with its more fronto-central distribution (Connolly and Phillips 1994; Connolly et al. 2001; Desroches et al. 2009; Kornilov et al. 2015). The topographic distribution rather resembled an early N400 effect as reported previously by Schulz et al. (2008), who investigated semantic incongruity detection during sentence reading in children in elementary school. Given the pattern of results in the present study, one possible interpretation could be that the early N400 effect is most pronounced if the incongruous word is identified as a familiar word that is incongruous with the context. In contrast to previous N400 studies, this early N400 effect might have been detected in the current study and in the Schulz et al. (2008) study because of the application of a topographic analysis approach that included all electrodes of the ERP map.

In the CHG versus StG vocabulary contrast converging results revealed no pre-N400 effect. This finding suggests that an early N400 effect is pronounced for familiar but mismatching words, but is reduced for unfamiliar words, as is the case for this contrast. It remains open, whether the absence of an early N400 effect is related to the lack of lexical familiarity, or whether such an effect may be concealed by additional neural processing due to the unfamiliar word that is not part of the hearer’s mental lexical representation.

Furthermore, we also did not detect any preN400 effects for the CHG versus StG pronunciation contrast. However, this was not very surprising as the auditorily presented words in this experimental contrast did indeed correspond phonologically and semantically with the presented image in both the CHG and StG specific condition, although they encompassed a dialect-specific word-initial vowel-duration difference. This result goes in line with previous findings investigating ‘spoken word-image’ mismatch detection for words that shared a word-initial phoneme (e.g. “luck” vs. “luggage”). In these studies, word-initial phoneme correspondence resulted in the absence of a PMN but did indeed elicit a N400 effect if the word was semantically inappropriate (Desroches et al. 2009; Kornilov et al. 2015). Moreover, it also indicates that no additional contextual pre-processing took place for non-native stimulus integration (e.g., Connolly and Phillips 1994).

Limitations and Outlook

To our knowledge, this study is one of the first to examine neural processing mechanisms of semantic mismatch detection in terms of dialect familiarity in young children. Accordingly, no literature exists to when and where neural responses to unfamiliar dialect-specific vocabulary and pronunciation will occur in the brain. In order to overcome this difficulty we employed a twofold analysis methodology. However, additional research is necessary to pinpoint these mechanisms on a larger scale. Furthermore, our audio–visual stimulus pairings were presented in a block-wise manner for each of the three experimental contrasts, making it possible that children may have anticipated whether a sequence of matching or mismatching audio–visual pairings was presented, or, whether the present pictures were paired with familiar or unfamiliar vocabulary or pronunciation after observing the first pairing in a block. This may have facilitated learning of audio–visual pairings, especially in the unfamiliar dialect-based contrasts. Yet the strong N400-LPC effect we detected in the CHG versus StG vocabulary contrast suggests otherwise. Likewise, the larger LPCs in the CHG versus StG pronunciation contrast indicate a strong involvement of neural processing mechanisms for congruity judgment in pairings with unfamiliar word-initial vowel pronunciation, providing evidence that children did not become accustomed to the unfamiliar word pronunciation. A further limitation is that we did not specifically test for StG vocabulary knowledge in CHG native children prior to or after running the experiment. Although both children’s groups showed a high predisposition towards stimuli spoken in their native German language variety, it remains unclear whether the N400-LPC effects stemmed directly from reduced auditory stimulus expectancy or rather from general lack of knowledge of the non-native vocabulary. In a future study, it would be of importance to additionally collect data on the active production level of the non-native vocabulary, by e.g., examining picture naming abilities of previously seen as well as new (but equally difficult) stimuli. As such, additional testing would provide better insights into how well for example CHG native children have learned StG language knowledge in the Swiss diglossic language situation before being enrolled in school. Moreover, we found no robust early N400 effects in either the dialect-independent contrast or the CHG versus StG vocabulary contrast, where spoken word stimuli could have elicited such an effect. In the dialect-independent contrast, converging results suggest that an earlyN400 effect may exist. However, due to stimulus-specific constraints it was not possible to explicitly match all auditory stimuli to the extent that each mismatching or unfamiliar word contrast encompassed a word-initial phoneme different from the expected onset. In a future study it would be beneficial to control for such an effect.

In sum, our findings contribute to improving the understanding of how an individual’s dialect may influence the decoding and activation of semantic information processes at the neural level. However, the study also leaves some questions unanswered and which would be interesting to address in future research: As we could show, neural processing mechanisms for semantic incongruity effects in (pre-literate) kindergarten-aged children differ in connection to their mother tongue dialect. Thus, how are these mechanisms reflected after obtaining formal instruction in school, e.g., at the end of elementary school? Furthermore, as phonological vowel-length variations unfamiliar in the child’s native dialect triggered larger LPC amplitudes and thus required additional neural processes for congruity judgment, it would be reasonable to investigate whether phonological processing mechanisms are influenced at the behavioral level, as well. Such research may provide crucial insights into processes that unfold in young dialect-speaking children when they learn to read and write in the corresponding standard language form (e.g., CHG native children learning to read and spell in StG), especially because literacy skills are strongly linked to phoneme-grapheme mapping strategies (Snowling 1980).

Conclusion

The present study extends previous research about semantic mismatch detection in simultaneously presented ‘spoken word-image’ pairings to the question of how dialect-specific vocabulary and pronunciation impacts semantic processing at the neural level contingent on one’s native language variety background. The control contrast where match–mismatch status of spoken word-image pairings was not affected by language background revealed robust N400 and LPC effects in both groups, thereby demonstrating the feasibility of the paradigm and, at the same time, providing a reference for the temporal and topographic characteristics of the mismatch effects in kindergarten-aged children. While both dialect-specific contrasts revealed an LPC mismatch effect that depended on language background, a language-dependent N400 mismatch effect was only found for the vocabulary, but not for the pronunciation contrast. This suggests that lexico-semantic access, as indicated by the N400 effect, is more robust against slight pronunciation variations of words, such as shortening or lengthening of a vowel in one language variety compared to another one. This may be the case, because speech perception needs to deal with variability within and between speakers in general. Still the presence of an LPC effect in the absence of an N400 effect in the pronunciation contrast, suggests that some late evaluation or control processes take place, even though the matching lexico-semantic representation seems to have been retrieved beforehand.

Given that the CHG native children in the current study were tested shortly before entering school, they are going to learn StG as part of their literacy acquisition. An interesting question for future studies thus might be whether robustness towards violations of pronunciations might be predictive of how well a standard language variety or also a foreign language can be learned for oral and literate communication. Finally, while semantic processing seems not to have been affected by the vowel length changes in the current experiment, an open question is whether semantic processing is also robust towards other variations in pronunciation. The degree to which semantic processing is robust against different types of pronunciation variations is relevant for how children growing up in a diglossic language context need to adjust to the standard language while learning to read.