Introduction

Languages differ in how they express meaning, but these differences are constrained by the biology of the human brain. Thus, many accounts of processing propose that every language is processed in fundamentally the same ways. For instance, all languages appear to communicate information at a similar rate1 and have a similar distance between hierarchically dependent linguistic units2. Likewise, all languages utilize the same brain networks for comprehension and production3, with the same amount of time required to access and retrieve word meaning4,5,6. Language users also rely on common processing heuristics to build sentence structure in real time7,8, including assumptions about which sentence elements are probable agents9,10. It follows, then, that behavioral and neural metrics of sentence processing should yield similar results no matter the language.

Nonetheless, models of language processing must account for the full extent of crosslinguistic variability, and many so-called universals do not hold for all languages11. For example, processing of verb-final sentences relies on distinct profiles of working memory and predictive parsing12,13, and comprehension of these structures varies across languages14. Additionally, different linguistic features lead to comprehenders prioritizing different units of information15, such as languages with flexible word order enlisting different processing resources from languages with fixed order16. Suffice it to say, studying diverse languages is part of appreciating the extent of variation in sentence processing.

In the present study, we considered verb-final sentences in Mandarin, a language unusual for having flexible word order—a feature normally associated with highly inflected languages—despite virtually no morphological inflection17,18, making Mandarin a “typologically hybrid” language19. For example, the two sentences 孩子苹果吃掉了 “child apple ate” and 苹果孩子吃掉了 “apple child ate” both mean that the child ate the apple, despite the word order being reversed. Even basic terms like “subject” and “object” are not always appropriate for describing Mandarin sentences20,21. These grammatical properties make it challenging to create an “unambiguously ungrammatical” Mandarin sentence22, leading some to describe Mandarin as a “semantics-based” as opposed to a “syntax-based” language23. These features make Mandarin an ideal language for testing assumptions about meaning and structure in sentence processing models.

The divide between syntax and semantics has long been a focus of sentence processing research4,24,25. At the syntax-semantics interface is argument structure, or the grammatical roles that verbs assign to sentence elements26,27. Although many categories of roles have been described, there is consensus that the human mind distinguishes between the doer, or agent, and the receiver, or patient, of a verb’s action28,29,30. We use the terms agent and patient to represent the concept of proto-agent and proto-patient, two roles argued to be psychologically real across languages28. Languages differ in how agent and patient roles are expressed, and grammatical information is often redundant with semantic knowledge of stereotypical argument roles31. It is thus productive to pit semantics against syntax to characterize how they are prioritized in different languages for argument structure interpretation.

Mandarin has another feature important for argument structure: the coverbs BA and BEI. These coverbs can occur in verb-final sentences to disambiguate agent and patient roles. BA assigns agent status to its preceding noun, as in 孩子把苹果吃掉了 “child BA apple ate”, resulting in subject-object-verb word order. BEI assigns patient status to its preceding noun, as in 苹果被孩子吃掉了 “apple BEI child ate”, resulting in object-subject-verb word order. These word orders are also possible without the coverbs BA and BEI, as already indicated, but may be pragmatically restricted32. There has been debate about the syntactic categories of BA and BEI33,34, but for simplicity we refer to them as coverbs35. Although BA and BEI are both common in verb-final clauses, they have differences in structure and usage. BA must be followed by a noun phrase and is limited in which verbs it can be used with, and BEI is typically analyzed as a passive construction and can be followed directly by a verb36,37. Despite these differences, BA and BEI are powerful cues for argument structure assignment and each assigns a different interpretation in verb-final sentences.

In the present study of Mandarin sentence processing, we used electroencephalography (EEG) and behavioral measures to systematically compare the impact of syntactic (coverbs and word order) and semantic information (animacy and event knowledge). Below, we first introduce an impactful model of crosslinguistic sentence processing, the Competition Model, and then turn to the specific case of role reversals and existing accounts that consider Mandarin data. We then summarize the critical elements of our experimental design and relevant predictions.

Competition Model

For experimental comparison of crosslinguistic differences, an impactful framework is the Competition Model38, a key reference for language processing researchers for the last 40 years. As its name suggests, the Competition Model describes processing as a competitive arena where fundamental units of information calledcues” compete to shape decisions in comprehension and production39. Much of the research in the framework of the Competition Model has targeted argument structure processing, employing a forced-choice task for agent selection with orthogonal cue comparison across languages to assess the relative strength and validity of specific cues in offline judgments38. Agent selection differs among languages depending on how often certain cues are present (cue availability) and how often these cues correctly indicate the agent (cue reliability); in the case of competition among cues, the primary cue driving argument structure assignment is said to have greater cue strength40. Mandarin speakers have been shown to rely on the following cues in order of decreasing strength: BEI, animacy, word order, and BA, although BA may be as strong as word order in verb-final sentences32.

Role Reversals and EEG Experiments

Sentences where the stereotypical arguments of agent and patient are swapped without grammatical violations are known as role reversals, such as “the child the apple ate.” Over the past twenty years, researchers have studied role reversals with event-related potentials (ERPs), with specific focus on the N400 and P600 components. Traditionally, N400s were linked to semantic processing4 and P600s to syntactic processing25,41. However, this functional distinction has become more nuanced, especially when implausible role reversals were shown to elicit “semantic P600 effects” without modulation of the N40042,43,44.

Different explanations for semantic P600 effects have been proposed, including parallel syntactic and semantic processing streams45,46,47 and classifying the N400 and P600 as indexing retrieval and integration instead of semantics and syntax48,49. However, few explanations have considered the potential for different languages to require different processing profiles for sentence comprehension. Given its unusual linguistic properties, Mandarin is a fruitful target for testing crosslinguistic validity of sentence processing models.

We are aware of two processing models that have explicitly considered Mandarin to study argument structure in role reversals. The first is the extended Argument Dependency Model (eADM), which, like the Competition Model, sets crosslinguistic diversity at the center of its model architecture16,50,51. According to the eADM, processing of sentence elements is divided in separate streams for nouns and verbs, and nouns are assigned a proto-agent or proto-patient role as argument structure is built iteratively. When parsers encounter difficulty in initial stages of relational and plausibility processing, an N400 effect is elicited; P600 effects, however, are limited to subsequent well-formedness and repair computations. Crucially, the computations for processing nouns and verbs depend on language-specific patterns of cue weighting, and the model predicts that role reversals only elicit semantic P600 effects without N400 modulation in languages that have rigid word order16,52. Conversely, role reversals for a language with flexible word order like Mandarin will elicit N400 effects with or without P600 modulation16. These predictions arise from revisions of earlier models to better account for conflict between syntactic and thematic processing streams52,53.

The second model that has considered Mandarin is the Bag of Arguments account, which encompasses multiple studies over the past decade54,55,56,57 that together inform an overarching processing model of argument structure58. The Bag of Arguments model centers around the N400 as an index of prediction during real-time sentence processing, and role reversals are expected to modulate N400 amplitude only if (1) the verb and its arguments are combinable and highly predictable54, (2) arguments are in the same clause as the verb55,57, and (3) parsers have more than 800 ms to utilize structural role information to predict the verb56. If these conditions are met, then role reversals should elicit N400 effects, with or without modulation of the P60056,58. The eADM and Bag of Arguments accounts stemmed from different motivations, with the eADM focused on crosslinguistic differences in argument structure and plausibility computations, while the Bag of Arguments model is concerned with the timing of verb prediction based on preceding arguments. The eADM also makes predictions for ERPs prior to the verb59,60. Accordingly, the models are not incompatible, and both can inform interpretation of the present study.

Present Study

The present study had two primary aims. First, we sought to improve characterization of cue weighting in Mandarin using an experiment with a balanced design and monolingual Mandarin-speaking participants. While prior descriptions of the pattern of Mandarin cue weighting have been impactful23,32,61, there has been a recent call to use updated methodological approaches to evaluate these findings’ replicability62. To appreciate the iterative nature of argument structure assignment, we also compared ERPs at pre-verb sentence positions. Second, we used ERPs to test processing of role reversals, which have been widely reported to elicit semantic P600 effects48. While most accounts assume that role reversals are processed similarly across languages48,54, there is evidence that different language process these structures differently16. Prior experiments have tested Mandarin role reversals, but there are conflicting reports of N400s16, semantic P600s54,56, or no ERP modulation58. These studies have not reported controlling for participants’ bilingual language knowledge, despite findings showing sentence comprehension can be impacted by second language knowledge23,61.

Accordingly, we designed stimuli by systematically manipulating semantic and syntactic cues: Reversibility, Agent Animacy, Order, and Structure. For Reversibility, there were reversible sentences where either noun was equally plausible as agent and irreversible sentences where only one noun was a plausible agent. To maximize the plausibility difference between reversible and irreversible sentences, we manipulated Agent Animacy such that irreversible sentences had contrasting animacy between the two nouns and reversible sentences had two nouns with shared animacy status. Unlike many prior studies23,32,61, we included plausible inanimate agents, thus dissociating animacy and plausibility. The stimuli designed with the above semantic restraints were then crossed with Order, whether a given plausible agent was in first or second position in the sentence, and Structure, whether a sentence used BA, BEI, or noun-noun–verb (NNV) structure without a coverb. The experimental conditions are summarized in Table 1 for the semantic variables and Table 2 for the syntactic variables.

Table 1 Examples of sentence stimuli with semantic manipulations. All reversible sentences had two nouns with shared animacy status, while all irreversible sentences had two nouns with opposing animacy status. For each of the four conditions in the table, we used a unique set of verbs and noun pairs. All nouns and verbs were two syllables and all sentences ended with the aspect particle LE.
Table 2 Examples of sentence stimuli with syntactic manipulations. Each sentence created according to the semantic cues of Reversibility and Agent Animacy was paired with the cues of Order and Structure to display to participants in the experiment. Due to Mandarin’s flexible word order in NNV structures without a coverb, either Order of the plausible noun should result in the plausible interpretation. The example given is for a sentence with a plausible animate agent, but the same procedure was applied for all sentence materials. Note that Order and Structure were manipulated for reversible sentences as well, but in those cases there was no semantic cue for informing agent selection, and as such all arrangements were potentially plausible.

Predictions for behavioral and ERP results were informed by three sentence processing models considering Mandarin data, as previously summarized. In contrast to previous Competition Model reports of a linear ranking of cue strength23,32,61, we expected that by dissociating animacy from plausibility, animacy would not be the strongest cue. For role reversal sentences, we expected that if Mandarin plausibility comprehension comprises language-specific mechanisms, then there would be an N400 response, with or without P600 modulation, as predicted by the eADM16. Alternatively, if Mandarin comprehension relies on mechanisms identical to other languages, then there should be a semantic P600 response without modulation of the N400, as found in prior studies42,43,44,54,56. Crucially, we note that each of the three models has different aims and proposed mechanisms underlying their proposed processing architecture, and thus the corresponding predictions in the context of the current study (overviewed in Table 3) are not strict tests of the validity of any single model over another. In fact, the models are sufficiently distinct in their targeted explanations that direct comparison may be inappropriate. Instead, the present study can give support, nuance, or reconsideration for specific aspects of each model.

Table 3 Predictions from three sentence processing models. Refer to the text for clarification on where our predictions diverged from the models.

Note that towards our first aim of characterizing cue weighting, we further considered ERPs prior to the verb at sentence-initial noun and coverbs. Previous studies have reported greater N400 components for inanimate nouns in comparison to animate nouns at sentence-initial position59,60. According to the eADM, this is attributed to a preference for sentence-initial arguments to be agents (or subjects)10, and thus non-ideal inanimate agents elicit greater N400 amplitude than their animate counterparts59,60. Per this explanation, we should observe an N400 effect for sentence-initial inanimate nouns. These predictions are summarized in Table 3.

Results

Behavioral Results

Reversible Sentences

Participants’ interpretation of reversible sentences showed two key effects related to our predictions. First, BA and BEI were the strongest cues for agent assignment (BA: β = 16.96, SE = 5.83, Z = 8.23, p < 0.001; BEI: β = 0.06, SE = 0.02, Z =  − 7.73, p < 0.001), which confirmed previous behavioral results for verb-final sentences23,32. In contrast to prior reports, however, our participants had no inherent word order preference for agent selection in NNV sentences (β = 1.25, SE = 0.24, Z = 1.20, p = 0.232). Alongside these simple effects, BA and BEI interacted with Agent Animacy, with the coverb cue being slightly weaker when indicating an inanimate agent (BA: β = 1.24, SE = 0.12, Z = 2.18, p = 0.03; BEI: β = 0.65, SE = 0.06, Z =  − 4.66, p < 0.001). Model results are visualized in Fig. 1a.

Figure 1
figure 1

Results of agent assignment task. (a) Model predictions for interaction between Agent Animacy and Structure for first agent noun selection in reversible sentences. Y-axis shows probability of selecting the first noun as agent. Error bars show 95% confidence intervals; (b) model predictions for interaction among Order, Agent Animacy, and Structure for first agent noun selection in irreversible sentences. Y-axis shows probability of selecting the first noun as agent. Error bars show 95% confidence intervals. (c) Individual differences in Coverb and Plausibility scores. Dotted line shows the line y = x where Coverb and Plausibility Scores are equal in value. Discrete labels for agent assignment strategy are shown here, but subsequent use of scores for analysis was done with Difference Score as a continuous variable.

Irreversible Sentences

Just as in reversible sentences, BA and BEI were the strongest cues for agent assignment in irreversible sentences (BA: β = 18.21, SE = 7.00, Z = 7.55, p < 0.001; BEI: β = 0.07, SE = 0.03, Z =  − 6.21, p < 0.001). Unlike reversible sentences, Order had a strong effect on agent assignment in the absence of coverbs, demonstrating that irreversible sentences had only one plausible interpretation (β = 11.16, SE = 2.67, Z = 10.09, p < 0.001). Structure and Order further interacted (BA: β = 0.72, SE = 0.13, Z =  − 1.86, p = 0.063; BEI: β = 0.42, SE = 0.06, Z =  − 6.52, p < 0.001), such that the effect of Order was considerably weaker for BA and BEI structures than for NNV sentences but not entirely absent. Essentially, when the positioning of the plausible agent agreed with the coverb cue, participants’ agent assignment was influenced more than by either cue alone (BA: β = 64.44, SE = 34.90, Z = 7.69, p < 0.001 ; BEI: β = 0.05, SE = 0.09, Z = 6.24, p < 0.001).

The model further revealed a two-way interaction between Agent Animacy and Order and a three-way interaction among the cues of Structure, Order, and Agent Animacy. In the two-way interaction, the effect of Order was stronger for plausible animate agents than for plausible inanimate agents (β = 1.69, SE = 0.13, Z = 6.73, p < 0.001). We note that this effect is similar in magnitude to the effect of Agent Animacy for reversible sentences. The three-way interaction was only significant for BEI sentences (β = 0.69, SE = 0.08, Z =  − 3.12, p = 0.002), where, unlike for BA or NNV, there was minimal difference in the effect of Order between plausible animate agents and plausible inanimate agents (animate: β = 29.5, SE = 15.8, Z = 6.30, p < 0.001; inanimate: β = 15.9, SE = 8.13, Z = 5.41, p < 0.001). This effect was most apparent when comparing BA and BEI role reversal sentences. While BA was more successful at driving role reversal interpretations when the plausible agent was inanimate (animate: 67% (SE = 11.4) probability first noun agent selection; inanimate: 80% (SE = 8.4)): BEI role reversal interpretations were unaffected by Agent Animacy (animate: 29% (SE = 7.9); inanimate: 31% (SE = 8.3)). Model results are visualized in Fig. 1b.

Individual Differences

While the behavioral models showed group-level effects, further data exploration showed that not all participants used the same comprehension strategies. To quantify individual differences, we calculated individual scores for reliance on coverbs in reversible sentences (Coverb Score) and Order in the absence of a coverb in irreversible sentences (Plausibility Score). These scores respectively represent individuals’ reliance on coverbs and plausibility when there are no competing cues for agent assignment. The maximum score value of 1 indicates a given participant always used the corresponding cue, whereas a score of 0 indicates the cue was always disregarded.

From visual inspection of individuals’ Coverb and Plausibility Scores, we identified three discrete strategies: using only plausibility, only coverbs, or both plausibility and coverbs. To create a continuous metric, we calculated a Difference Score by subtracting the Coverb Score from the Plausibility Score, where individual scores ranged from − 0.98 (most coverb-driven) to 0.79 (most plausibility-driven). For visualization, we labeled participants with a Difference Score between − 0.33 and 0.33 as having a balanced strategy (i.e., the middle third of the values range), where they gave approximately equal weight to the two cues of plausibility and coverb, as depicted in Fig. 1c. This Difference Score further predicted participant reaction times, with more plausibility-driven participants taking longer to respond to reversible NNV sentences (see Supplementary Materials).

ERP Results

Noun One

Visual inspection of ERPs between animate and inanimate nouns in initial sentence position did not reveal a substantial difference in N400 amplitude. A mixed effects model for midline electrodes showed no significant effect of noun one animacy (β = 0.08 µV, SE = 0.20, Z = 0.40, p = 0.69, model results reported in Supplementary Materials).

Coverb

ERPs at the respective second word in NNV, BA and BEI sentences (i.e., contrasting noun two in NNV and the coverbs) showed strong variation in the P200 amplitude of the coverbs BA and BEI, as can be seen in Fig. 2. Direct comparison between ERPs showed that BA elicited a smaller P200 than BEI and noun two in NNV sentences, as can be appreciated in Fig. 2. Noun two further elicited a sizeable N400 component, consistent with word class effects63. The smaller P200 for BA was not predicted a priori, but given the striking visual difference in the ERP waveforms, we analyzed single trial amplitudes in the P200 and N400 time windows.

Figure 2
figure 2

ERPs for the effect of Structure at the second word position. The 200-ms pre-onset baseline interval is indicated with a gray rectangle. Scalp maps show BEI minus BA (top) and noun two minus BA (bottom) for the P200 and N400 time windows.

The average P200 amplitude for BA sentences was significantly smaller than for NNV sentences at electrodes Fz and Cz (ps < 0.01). With correction for multiple comparisons, P200 amplitude for BEI did not differ significantly from that for BA or Noun Two in NNV sentences. Average N400 amplitude for noun two in NNV sentences was larger than for BA and BEI (ps <  = 0.01).

Role Reversals

Visual inspection of role reversal results showed that implausible sentences elicited a broadly distributed, centro-parietal N400 between 300 and 500 ms and a sustained, localized frontal positivity around 800 ms, as depicted in Fig. 3. We further compared ERPs broken down by Structure and Agent Animacy (see Supplementary Materials). BA role reversals showed a larger and broader N400 effect with a sustained frontal positivity, and a later, broad positivity beginning around 700 ms in both posterior and frontal locations. BEI role reversals showed a smaller, more localized N400 effect with a sustained frontal negativity, and a central-posterior right-lateralized positivity in the late P600 time window also beginning around 700 ms. For plausible animate agents, role reversals showed a broad N400 effect, and role reversals with plausible inanimate agents appeared to show a localized N400 effect followed by a broadly distributed positivity. However, these interactions should be interpreted cautiously due to relatively low power for statistical inference between subconditions.

Figure 3
figure 3

ERPs in response to role reversals. (a) ERPs are averaged across Agent Animacy and Structure. Scalp maps show reversal minus plausibility (averaged across other factors) for the N400 time window from 300 to 500 ms and the P600 time window from 500 to 900 ms. The 200-ms post-onset baseline interval is indicated with a gray rectangle. (b) Model predictions for three-way interaction among Structure, Agent Animacy, and Plausibility. Note that while the three-way interactions were significant for both time windows, not all post-hoc pairwise comparisons were significant. The simple effect of Plausibility was significant for the N400 time window and not for the P600 time window.

N400

Despite the appearance of an N400 effect for Plausibility in the ERP waveforms, our initial model results indicated this effect was only marginally significant (β =  − 0.51 µV, SE = 0.32, Z =  − 1.61, p = 0.106). Based on our contrast coding, the coefficient for Plausibility represented the effect at the reference level of Pz; however, the model including only Pz showed a significant effect of Plausibility (β =  − 0.92 µV, SE = 0.35, Z =  − 2.65, p = 0.008). We determined that the discrepancy between these model values was due to our limiting the output to three-way interactions (Voltage ~ Structure*Plausibility*Agent Animacy + Structure*Plausibility*Electrode + Structure*Agent Animacy*Electrode + Plausibility*Agent Animacy*Electrode). For a full model including all possible interactions (Voltage ~ Structure*Plausibility*Agent Animacy*Electrode), the corresponding coefficients were identical and the effect of Plausibility was significant. These results are challenging to reconcile and could lead to opposite conclusions based on sometimes arbitrary decisions. We bring up this difficulty because such challenges are faced by many, if not all, users of mixed effects models and the field continues to develop standards for best practice64.

To address these issues, we report here the results of the midline model with a reduced random structure for item variability excluding a random slope for Plausibility to increase power65, with other models reported in Supplementary Materials. The present model revealed a main effect of Plausibility (β =  − 0.46 µV, SE = 0.18, Z = 2.55, p = 0.011), demonstrating that role reversals elicited an N400 effect averaged across Structure and Agent Animacy. There was a further three-way interaction among Structure, Agent Animacy, and Plausibility (β = 0.39 µV, SE = 0.18, Z = 2.16, p = 0.03). Post-hoc pairwise comparisons showed that the largest N400 effect was for implausible BEI sentences with animate agents (e.g., the reversal sentence 仆人被镜子擦亮了 “servant BEI mirror polished”; β =  − 1.39 µV, SE = 0.35, Z = 3.99, p < 0.001). Described qualitatively from the model predictions in Fig. 3b, implausible BA role reversals elicited a numerically greater N400 than plausible BA sentences regardless of Agent Animacy status, while implausible role reversals with BEI sentences elicited a greater N400 for animate agents but not for inanimate agents.

P600

Although the ERP visualization suggested a small frontal positivity in the P600 time window, the statistical model did not reveal a significant main effect of Plausibility. Note that for consistency with the N400 analysis, we also excluded the random slope for Plausibility in the item random structure. The only significant model coefficient was a three-way interaction among Agent Animacy, Structure, and Plausibility (β = 0.62 µV, SE = 0.23, Z = 2.66, p = 0.008). Post-hoc pairwise comparison showed that for BEI sentences, ERP amplitudes to reversals with plausible animate agents were significantly more positive than their congruent counterparts. (β = 1.67 µV, SE = 0.46, Z = 3.63, p < 0.01). Nonetheless, we note that this frontal positivity is not a typical semantic P600 distribution56,66.

Discussion

We investigated argument structure comprehension in verb-final sentences in Mandarin, a language with flexible word order and virtually no inflection. We analyzed monolingual, Mandarin native speakers’ behavioral and EEG data to capture a cohesive picture of real-time cue competition. Our behavioral results demonstrated that 1) word order was not used to assign argument structure in the absence of other cues; 2) the coverbs BA and BEI were the strongest cues for agent assignment but were differently impacted by Agent Animacy in the case of role reversals; and 3) participants showed individual differences in their reliance on semantic and syntactic cues. Our EEG results showed that 1) sentence-initial noun animacy did not impact N400 amplitude; 2) BA elicited a reduced P200 amplitude relative to BEI and nouns; and 3) the disambiguating verb in role reversal sentences elicited an N400 effect without a subsequent semantic P600. To our knowledge, this is the first time a forced agent-assignment task has been used in an EEG experiment. A key advantage of this task is that it directly reveals a reader’s sentence interpretation and resembles aspects of natural language processing, where individuals must understand the relation of a verb to its arguments. Additionally, all experimental sentences were, in principle, grammatical Mandarin structures, thus minimizing acceptability judgments.

With respect to prior findings, our behavioral results showed a distinct cue weighting profile for Mandarin. First, Agent Animacy was not the most important cue, with participants accepting both animate and inanimate plausible agents. We note that prior experiments often confounded animacy with plausibility23,32,61,67, thus overestimating the role of animacy in driving sentence interpretation. Second, in reversible sentences without coverbs, word order did not drive agent assignment, challenging the idea that there is an inherent preference for object-subject-verb32,68 or subject-object-verb67 word order. Our findings support the idea that verb-final Mandarin sentences where word order is the only cue may be ambiguous69. In contrast to English, where pre-verb and post-verb positioning of arguments reliably signals structural roles15,61, we suggest that only post-verb positioning is reliable in Mandarin10,17.

Our behavioral results further challenge previous descriptions of cue weighting as mere linear ranking. Instead, multiple cues were weighted to varying extents in different contexts. Consequently, a particular profile of cues could result in different interpretations, meaning any given interpretation is best described probabilistically in terms of the available cues, as in Fig. 1. Animacy is a good example: while it had no simple effect for driving agent assignment, Agent Animacy interacted with other cues to subtly affect comprehension. While the coverb BA was consistently a stronger cue for agent assignment when its preceding noun was animate, this was not the case for the coverb BEI, for which interpretations or role reversal sentences were not impacted by Agent Animacy. This divergence hints at nuanced processing demands between BA and BEI70. We comment further on animacy below in conjunction with the ERP results.

While the group-level findings demonstrate an overarching pattern for Mandarin cue weighting, we further found individual differences in interpretation strategy. Although most participants used both plausibility (i.e., Order in irreversible sentences) and coverbs to make their interpretations, a subset of participants relied on one cue while largely ignoring the other. This response pattern further impacted reaction times (see Supplementary Materials), with plausibility-driven participants taking longer to respond to reversible NNV sentences than their coverb-driven counterparts. While individuals varied in comprehension strategy, group averages reflected core characteristics of the cue weighting profile of Mandarin, with individual variation occurring within the confines of these characteristics. For instance, while individuals disregarded the coverb cue in favor of plausibility, no participant interpreted BA as if it were BEI, or vice versa.

In conjunction with the behavioral results, our ERP findings provide insights about online, incremental parsing decisions. We considered three sentence time windows: the first noun, the second word, and the verb. At the first noun, we found no effect of animacy on N400 amplitude, unlike previous reports in English59,60, which alongside the behavioral findings indicates that the cue of animacy alone is not sufficient to drive assignment of argument structure. Instead, animacy can interact with other cues, and inanimate nouns can be preferentially interpreted as agents given certain semantic contexts. Importantly, the potential of inanimate nouns to be agents is likely impacted by multiple factors, including concreteness71, situational relationships72, motor and social cognition73, and whether the nouns correspond to places74 or natural forces75. These factors were not controlled in the present experiment, so it is possible that the perceived “agentiveness” of the inanimate nouns in the present study was relatively high, thus leading to the particular profile of agent interpretation and the lack of a sentence-initial N400 amplitude difference from animate nouns.

At the second word position, we observed a striking decrease in P200 amplitude for BA in comparison to BEI and nouns. Although not predicted, the P200 effect is compelling evidence for differences between BA and BEI in cognitive demand prior to the verb76. Upon encountering BA, most participants likely assigned agent status to the previous noun; when reading BEI, participants had to wait for the upcoming noun to complete agent assignment. We note that this P200 effect cannot be due to the visual simplicity of BA in relation to other characters77,78; if this were the case, then BA (把) and BEI (被) both should have P200 amplitudes smaller than the more visually complex, two-character nouns. There has also been a report of a similar P200 difference in the auditory domain79. This effect may stem from a difference between assigning agent or patient arguments to an iteratively constructed sentence structure during online processing. To test this hypothesis, future studies should test single-argument sentences with BA and BEI (e.g., 把苹果吃掉了 “BA apple ate”) and the alternation of the experimental task to patient instead of agent identification. We note that this manipulation is only possible with flexible word order for pre-verbal arguments and could thus represent processing specific to languages with frequent verb-final structures, although it may be limited to the current experimental task.

At the position of the verb, we found that role reversals elicited an N400 effect followed by a local frontal positivity. In contrast to reports of semantic P600 effects to role reversals45,46,48,54, our findings indicate that Mandarin role reversal anomalies were detected via relatively early, automatic semantic processing and meaning retrieval mechanisms. Our finding of a frontal positivity contrasts with predictions for semantic P600s, given that frontal positivities are often dissociated from typically posterior P600 effects56,66. This lack of a semantic P600 effect is not the first such report16,58, suggesting it may be prudent to reconsider the label “semantic P600”, just as early labels of the N400 and P600 components as indexing processing of semantics5,80 and syntax81 were misleading. If P600 responses to role reversals are primarily driven by task16,58, this component may be understood better as a member of the P300 family82, in which case prior reports of semantic P600 effects were likely driven by acceptability judgment tasks. To appreciate the interplay of the N400 and P600 components, computational models like the retrieval-integration account49,83,84 and noisy channel models85,86 are well equipped to explain and predict ERP responses to role reversals, although the present study suggests that crosslinguistic differences cannot be discounted. One way to integrate crosslinguistic differences may be via a language-specific filter based on cue reliability and prominence, with quantitative and qualitative differences among languages and language experience39,87.

Our ERP results may be partially explained by existing models, as outlined in Table 3. The N400 effect can be, in principle, consistent with both the eADM and the Bag of Arguments account, although there are some discrepancies with prior reports. According to the eADM, and consistent with the present study’s findings, Mandarin role reversals modulate the N400 because of the language’s flexible word order, which requires greater weighting of plausibility cues at early processing stages16. A prior eADM study of Mandarin found N400 modulation only for BEI reversals and not BA reversals16, which could conceivably stem from task or modality differences, as well as the present experiment’s systematic comparison of multiple cues. For the Bag of Arguments account, the model can be adjusted to explain the present observation of N400 modulation with 750 ms stimulus onset asynchrony (SOA). While experiments have demonstrated that 600 ms is insufficient56,58 and 800 ms is sufficient58 for argument role assignment to constrain verb prediction as reflected in modulation of the N400 component, there is not a functional explanation for the necessary time to complete the associated computations. One possibility is a mechanism proposed by the Memory, Unification, and Control model88, where semantic information is integrated in processing cycles. Combining the structural roles of two sentence elements to aid in verb prediction could require two complete processing cycles, perhaps corresponding to twice the N400 latency, which could also be consistent with the present study’s SOA of 750 ms.

Neither the eADM nor the Bag of Arguments account specifically predicted the present experiment’s lack of a P600 effect, although both research groups have highlighted task as driving P600 modulation16,58. While eADM experiments have not reported P600 effects for Mandarin role reversals, the model does not explain why role reversals for some sequence-independent languages elicit just an N400 effect (Mandarin and Turkish) and others elicit a biphasic N400-P600 effect (German and Icelandic)16. For the Bag of Arguments account, most studies have reported a semantic P600 effect54,55,56, although recent experiments with fewer task demands found no P600 modulation58, and the model itself does not make explicit predictions for P600 modulation. Additionally, in contrast to the present experiment, Bag of Arguments studies of Mandarin have manipulated cloze probability and combinability in specifically BA verb-final sentences with animate agents54,56,58. If our nouns and verbs were less related to each other than stimuli in previous studies, or the presence of an explicit task with multiple competing cues in BA, NNV, and BEI sentences impacted participants’ parsing strategies, these differences could explain the inconsistent findings. Task and stimuli differences have been shown to affect N400 and P600 responses to role reversals59,60, and acceptability tasks are especially linked with P600 effects82,87,89. While some researchers describe the N400 as less sensitive to task modulation than the P60060,89,90,91, it should be noted that the N400 can show nuanced variation depending on task and context, potentially because of component overlap92. Because the present study is the first ERP experiment using a forced-choice agent assignment task, we cannot discount the potential for task to play a role in driving our effects.

The present study updates our understanding of Mandarin sentence processing and the incremental, online processing of argument structure assignment. In contrast to previous studies describing Mandarin cue weighting as a simple linear ranking23,32,61, we used logit models to more accurately portray a probability-based profile of Mandarin62, capturing the gradient nature of cues for sentence comprehension93. Our results are consistent with models proposing crosslinguistic differences in core processing steps16,87,94 and provide new information for the timing of argument role information in comprehension of verb-final sentences58. We cannot discount the impact of task and stimuli differences in driving some of our findings, especially for the ERP effects that have smaller effect sizes than the behavioral results. Verifying the extent of crosslinguistic differences will require systematic comparison of diverse languages and sentence types beyond role reversals. Even so, inconsistent ERP findings for Mandarin role reversals, including N400 effects16, P600 and N400-P600 effects54,56, and null effects58 suggest that this language may merit special consideration in neurocognitive models of sentence processing. Of broader implication, recent advances in machine learning have led to successful decoding of sentences from blood-oxygen-level-dependent data95,96, indicating the potential for rapid advancement in analyzing neuroimaging data. As the current study shows, basic tenets of syntactic and semantic processing diverge among languages, and thus decoding nuanced sentence meaning from brain data may require precise targeting to language-specific features.

Methods

Participants

In total, 39 Mandarin native speakers participated in the study. Of these 39, four were excluded from analysis due to technical problems during experiment delivery and one was excluded due to failure to stay attentive during the experimental session, resulting in 34 (19 to 25 years old, mean age = 22, SD = 1.9, 19 female) datasets. Additionally, there were six subjects who did not see the BA reversal subcondition (approximately 30 sentences in total) due to error in experiment delivery. We opted to include their data in all analysis, as mixed effects models (see Data Analysis section below) can appropriately handle missing data97.

All participants were recruited via online advertisement and word of mouth in Nanjing and tested at Nanjing Normal University. All participants were right-handed based on the Edinburgh Handedness Inventory98 (average score = 83) had normal vision or wore corrective lenses, and did not have any history of neurological disorders. Participants gave written informed consent and were compensated 150 RMB for their time.

To ensure that Mandarin processing was not influenced by other language experience, we limited recruitment to participants who primarily communicated in Mandarin and had limited knowledge of English and other languages, including Chinese languages and dialects. Because English is a required subject in Chinese primary, secondary, and tertiary schools99, all participants had some previous exposure to English. To minimize the influence of English on processing, we restricted recruitment to only those who self-reported an English level of 3 or below on a scale from 1 to 6 (1 being no knowledge of English, 6 being nativelike), who did not use English on a regular basis, and who were at or below the College English Test Level 4, which is typically below communicative competence100. If participants had exposure to a dialect other than standard Mandarin, this was restricted to Northern dialects (e.g., Nanjing, Xuzhou, Nantong, Shandong, Hebei) which are classified as belonging to the Mandarin dialect family and are mutually intelligible17. Note that there were exceptionally three participants in the present study who had knowledge of a Chinese language outside of the Mandarin dialect family (Wu, Gan, and Xiang), but they had minimal exposure to these languages in their adult life and primarily used Mandarin.

Participants further completed a detailed language background and usage questionnaire, from which we report summary values in Table 4. To further evaluate their language knowledge, participants also completed a LexTALE lexical decision task in English101 and Mandarin102. Self-reported proficiency values represent a mean of three separate values for reading, writing, and listening. Exposure percentages represent the self-reported average percent of exposure time from birth to the present. Participants reported percentages in approximately three-year increments throughout their lives, which we then averaged to create an aggregate estimate of lifetime language exposure. Note that the dialect exposure numbers primarily reflect Mandarin dialects (e.g., Nanjing, Nantong, and Xuzhou dialects), which are mutually intelligible with standard Mandarin. Usage percentages represent the average of self-reports of percent of time a language is used in different social contexts, including at school, at the workplace, speaking with friends, and general reading.

Table 4 Participants’ language experience and proficiency.

Materials

We created verb-final sentences with two noun arguments across the two levels of Reversibility (reversible, irreversible) and Agent Animacy (animate, inanimate). Crossing these two factors resulted in four conditions: reversible animate agent, reversible inanimate agent, irreversible animate agent, and irreversible inanimate agent (as summarized in Table 1). To maximize ambiguity in reversible sentences, we chose nouns that shared the same animacy status. We selected 30 transitive verbs for reversible inanimate and irreversible inanimate sentences and 31 transitive verbs for reversible animate and irreversible animate sentences, resulting in 122 unique verbs. To minimize repetitions of sentence materials during the experiment, we selected two noun pairs (noun pair one and noun pair two) for each verb, such that each pair combined with the verb to meet the requirements of the corresponding condition (e.g., reversible with animate agent: 老板技工举报了 “boss technician denounced”; 证人被告举报了 “witness defendant denounced”). These steps resulted in a total of 244 unique noun pairs. Within these parameters, we further controlled for frequency (using subtitle frequencies103) and number of strokes. Frequency values and number of strokes are reported in Table 5. The full sentence materials are reported in the Supplementary Materials.

Table 5 Controlled variables for sentence materials.

In designing the sentence materials, we endeavored to select ideal agent-patient verbs and nouns, but a minority of verbs were closer to the experiencer verbs category (e.g., 强化 “strengthen”, 冻死 “freeze”, 安慰 “comfort”). This variability in our stimuli may affect our results, as experiencer verbs used in role reversal sentences have been shown to elicit biphasic N400-P600 effects in English59,60. However, we note that the majority of our stimuli use agentive verbs, and the designation of experiencer or agentive verb does not neatly extend across languages. In the case of Mandarin, the structure of verb complements means that there can be a mismatch between the functional and structural roles of arguments104. For these cases, the framework of proto-agent and proto-patient still applies28.

Prior to running the EEG experiment, we created an offline questionnaire with our sentence materials to receive information for agent assignment and acceptability ratings from native Mandarin speakers. Note that these questionnaires did not include coverbs, which naturally resulted in decreased acceptability in the absence of a conversational context. Although including BA or BEI in these sentences would increase the naturalness, we wanted to understand how our sentences were comprehended at a purely semantic level and that they would meet a minimum level of acceptability in the NNV structure without a coverb, without systematic differences between conditions in acceptability, as well as ensuring there was a clear semantic direction for our irreversible sentences. These results are summarized in Table 6.

Table 6 Pretest results for sentence materials. Acceptability Rating is based on a scale from 1 to 5, where 1 indicated “completely unacceptable” and 5 indicated “completely acceptable”. Agent Preference is an average value where each sentence item was presented to participants in both possible orders. A value closer to 1 or 0 indicates a strong preference for one of the nouns, while a value closer to 0.5 indicates no preference for either noun.

To make our list of stimuli for running the experiment, we next crossed our factors Reversibility and Agent Animacy with Structure (NNV, BA, and BEI) and Order (first and second, representing position of the plausible noun). Note that for reversible sentences, one of the orders was arbitrarily assigned as first so that Order could still be tested and controlled for these items. We assembled ordered lists for presenting sentences to participants. Each of the 122 verbs was used three times, once for each level of Structure, resulting in 366 total sentences. To minimize the effects of repetition, we used the two noun pairs for each verb, so that a given noun only repeated a maximum of once. For example, the two noun pairs 喜鹊鸟笼/老鼠箱子 “magpie birdcage/mouse box”and verb困住 “trap” might appear in the experiment as follows: 喜鹊被鸟笼困住了。 “magpie BEI birdcage trapped”; 喜鹊把鸟笼困住了。 “magpie BA birdcage trapped”; 箱子老鼠困住了。 “box mouse trapped”. Note that each sentence ended with the aspect particle LE and a period.

To pseudorandomize our stimuli, we used the program Mix105, constraining the randomization such that each level of Structure could repeat a maximum of two times consecutively and a given verb occurred a minimum of 90 trials before or after its previous occurrence. Due to the design of the sentence materials, there was an equal probability of the first or second noun being animate or inanimate and actor or undergoer, so there was no way for participants to develop strategies to predict the role of the nouns until they saw BEI or BA and the final verb. Stimuli were pseudorandomized to maximize distance between repeated verbs (at least 90 items between repetitions) and minimize repetitions of same structure condition to two.

Procedure

All parts of the experiment were approved by the McGill Faculty of Medicine Institutional Review Board following the guidelines of the Canadian Tri-Council Policy Statement and by the School of Foreign Languages and Cultures at Nanjing Normal University (南京师范大学外国语学院). After reviewing and signing the consent form, participants sat in a sound-attenuated booth. All stimuli were presented with Presentation® software (Version 17.2, Neurobehavioral Systems, Inc., Berkeley, CA, www.neurobs.com) using the Windows XP operating system.

Sentences were presented visually word-by-word, for 650 ms per word and each word followed by a 100 ms inter-stimulus interval (i.e., stimulus onset asynchrony (SOA) = 750 ms). Each trial began with a cue for the participants to blink (“( −)”) for 2000 ms, then a fixation cross (“ + ”) for 500 ms. Each sentence ended with the particle LE appearing for 650, before displaying a prompt for the agent assignment task. For the task prompt, the two nouns from the preceding sentence appeared on the screen, with the first noun on the left and the second noun on the right, with “ !!!!! ” between them. Participants were asked to choose which of the two nouns was the agent (施事) by pressing the A or L key, corresponding to the left or right noun, respectively. The experiment was divided into four blocks of approximately 90 sentences each, with a scheduled break between each block. Participants could also pause at any response prompt to rest before continuing. The total experiment time, including preparation of the cap and cleanup, lasted from two to three hours.

EEG Recording and Preprocessing

Participants’ EEG was recorded from 32 Ag/AgCl electrodes mounted on an elastic cap according to the international 10–20 system (EASYCAP, Herrsching, Germany). To monitor vertical and horizontal eye movement, one electrode was positioned below the right eye and another to the left of the left eye. Electrode impedance was kept below 10 kΩ. The recordings were amplified online with a bandpass filter of 0.05–100 Hz, referenced online to electrode FCz, and digitized at a sampling rate of 500 Hz.

We used EEGLAB (v2019.1) and ERPLAB (v7.0.0) to preprocess the data. The EEG signal was downsampled to 250 Hz and re-referenced to the average of the linked mastoids (TP9 and TP10). We then used a high-pass filter at a cutoff of 0.1 Hz (FIR filter, Kaiser window, Kaiser beta = 4.89856, transition bandwidth = 0.2, filter order = 3934) and a low-pass filter at a cutoff of 30 Hz (FIR filter, Kaiser window, transition bandwidth = 10, Kaiser beta = 4.89856, filter order = 80). To correct eye movement artifacts, we decomposed the data using independent component analysis (ICA, runica algorithm in EEGLAB, with the option 'extended', 1). Note that exclusively for the ICA decomposition, we used data that was high-pass filtered at a cutoff of 0.5 Hz (FIR filter, Kaiser window, Kaiser beta = 4.89856, transition bandwidth = 0.2, filter order = 3934) because using a higher high-pass filter on data for ICA decomposition improves the signal-to-noise ratio106. The data were cleaned automatically prior to ICA using the pop_rejcont function (epochlength 2, overlap 1, freqlimit 1–25, threshold 10, taper hamming). The final ICs were then copied to the data filtered at 0.1 and 30 Hz for analysis. We removed a maximum of two ICs per participant, one IC each for vertical and horizontal eye movement.

The signal was then segmented into epochs from − 200 to 1000 ms around each critical word (first noun, coverb, second noun, and verb), with pre-stimulus 200 ms baseline correction. To appreciate the ERP changes across the entire sentence, we additionally created whole-sentence epochs with 200 ms pre-onset baselines; including the baseline interval, these epochs spanned 3100 ms for NNV sentences and 3850 ms for BA and BEI sentences. Based on visual inspection of role-reversal sentences in the whole-sentence epochs, we determined that there were important ERP differences occurring before verb onset, which could make a pre-onset baseline problematic107. To minimize the impact of baseline differences at the verb, we re-epoched the verb time window with a post-stimulus-onset 200 ms baseline correction (baseline interval of 0 to 200 ms), with the assumption that early components in this interval should have minimal difference between conditions. Results from this post-stimulus-onset 200 ms baseline correction are reported in the text because we believe this more accurately reflects verb-linked activity; results from the original pre-stimulus 200 ms baseline correction are reported in Supplementary Materials.

Artifact rejection was performed across epochs with a moving window threshold of 80 µV (window size = 500 ms). During review of the artifact rejection process, we determined that electrodes Fp1 and Fp2 were exceptionally noisy across participants and excluded them from analysis. For select subjects whose automatic rejection resulted in greater than 15% of trials being rejected, epoched data were manually inspected to include additional trials and the overall quality of individual datasets. This review resulted in all 34 subjects being included for final analysis (i.e., all individuals had fewer than 15% rejected trials).

Data analysis

We analyzed responses and reaction times for the agent assignment task and ERPs time-locked to the onset of target words in sentences. The four factors manipulated in the sentence materials were included in the analysis of each of these measures. The factor Structure was comprised of three levels: NNV, BA, and BEI. Structure was treatment coded such that NNV was the reference level to evaluate the effect of coverbs in relation to sentences with no coverb. The factors Reversibility (reversible and irreversible), Agent Animacy (animate and inanimate) and Order (first and second, denoting position of the plausible noun in irreversible sentences) were sum coded. Unless otherwise noted, these were the factors and contrast coding.

All data analysis was done using R version 4.02108. To account for variability across items and within participants, we computed mixed effects models using the glmer function from package lme4 version 1.1-23109, including the optimizer = ‘bobyqa’ parameter. Model coefficients were calculated by maximum likelihood estimates using the Laplace approximation. In the case of the binary data from the agent assignment task, we added the argument family = ‘binomial’ to fit a logistic mixed effects model. To ensure that model effects were interpretable, we limited fixed and random effects to a maximum of three-way interactions, even if there were possible higher order interactions. For random effects structures, all factors with possible variability within items or participants were included in the maximal possible structure. Note that because Agent Animacy and Reversibility did not vary across individual items, these factors were not included in the random structure for item.

All p-values were calculated with the Satterthwaite approximation calculated based on Wald Z-scores in lmerTest package version 3.1-2110. To construct the maximum possible model and random structure, we used the buildmer package version 1.8111 using the direction = ‘order’ parameter, which adds effects to the model in order of their contribution to log-likelihood. We then again used buildmer to do stepwise removal of model variables with the direction = ‘backward’ parameter to maximize log-likelihood score. These optimized models are reported in the text to maximize power and minimize overfitting112, but the maximal models are reported for reference in the supplementary materials113.

Significant interactions were followed up with post-hoc tests for pairwise comparisons using the emmeans package version 1.4.8114 and Tukey method for adjustment of p-values to correct for multiple comparisons. Interactions were visualized with the emmip function from the emmeans package. Note that while p-values are reported for model results, our inferences and interpretations were not limited to significance testing; instead, we further considered our hypotheses and predictions, effect sizes, and the limitations of data quantity and quality115.

For ease of interpretability, model results are reported with graphical depictions of coefficients and confidence intervals generated by the plot_model function from the sjPlot package version 2.8.9116; full model outputs, including random effects, are reported in the supplementary materials in tables generated from the tab_model function from the sjPlot package. For simplicity, model results reported in the text are limited to significant effects or effects that were related to initial predictions. For data arrangement and general plotting, we used the tidyverse package117, with final figure adjustment performed using the software Inkscape version 0.92118.

Agent Assignment

Binary agent assignment responses were analyzed with logistic mixed effects models. Because irreversible sentences had a single plausible interpretation, while reversible sentences had two plausible interpretations, we analyzed reversible and irreversible sentences separately. This allowed us to better understand the effect of plausibility on agent assignment, while limiting model coefficients to a maximum of three-way interactions. For both reversible and irreversible sentences, the maximum specified model included the fixed effects of Structure, Agent Animacy, and Order, random slopes and intercepts for Structure, Agent Animacy, and Order by participant, and random slopes and intercepts for Structure and Order by item. We report coefficients, confidence intervals, and p-values on the odds ratios scale, but original tests were performed on the log odds scale. For interpretability, interactions are illustrated on the probability scale.

Reaction Times

Reaction times for agent assignment responses were analyzed with linear mixed effects models. Reaction times were first cleaned to exclude response times above 10 s or below 100 ms. We then cleaned reaction times by condition, limiting to those values within 1.5 standard deviations for each subcondition of Structure, Reversibility, and Animacy119.These steps resulted in excluding 12.4% of trials from further analysis; we note that some of the excluded trials included instances when participants took breaks before responding. Reaction times were then natural log transformed to ensure that we met assumptions of distribution normality for analysis120. Note that we also analyzed the raw reaction time values121 and results were similar to those found for the log-transformed data; these results are reported in supplementary materials for transparency122. The maximum specified model included the fixed effects of Structure, Reversibility, Agent Animacy, and Order, random slopes and intercepts for Structure, Reversibility, Agent Animacy, and Order by participants, and random slopes and intercepts for Structure and Order by item. As an additional step, we also ran a model with the additional factor of Difference Score (the difference between a participant’s reliance on plausibility cues and their reliance on coverb cues), which is introduced in the section Individual Differences in Cue Weighting for Agent Assignment. Recent work has demonstrated the importance of individual differences in psychology and language research123,124, and including Difference Score in the model explained additional variability in the data. We report coefficients, confidence intervals, and p-values on the log-transformed scale, but model predictions were back-transformed to milliseconds for interpretability. Note that all reaction time results are reported in Supplementary Materials.

ERPs

As noted above, ERPs were analyzed at the first noun, coverb, second noun, and verb position of the sentence. Condition averages were calculated for each subject and then grand average ERPs were calculated for each condition. These grand average ERPs by condition were used for visual inspection and are represented in all ERP figures in the present study. Statistical models, however, were all based on average amplitudes for specific time windows in single trial epochs.

For the first noun of the sentence, referred to hereafter as noun one, we analyzed average amplitude in the N400 time window from 300 to 500 ms. This time window analysis was planned a priori based on reports of greater N400 effects for inanimate nouns than for animate nouns59,60. At the second-word position of the sentence, there was either another noun (noun two, in the case of NNV sentences), the coverb BA, or the coverb BEI. At this sentence position, we analyzed average amplitudes in the P200 (100 to 300 ms) and N400 (300 to 500 ms) time windows. The P200 time window was selected for analysis after visual observation of large differences in the ERPs between sentence structure types. The N400 time window was analyzed as a validation step to confirm expectations that nouns elicited larger N400 amplitudes than coverbs, thus giving more weight to the unexpected differences in P200 amplitude. At the verb position of the sentence, we analyzed the N400 (300 to 500 ms) and P600 (700 to 900 ms).

For each time window analyzed, we used single trial average amplitude to calculate linear mixed effects models. We first ran models on midline electrodes, including the factor Electrode (Fz, Cz, Pz, Oz), to confirm the presence of effects, and then over all other electrodes on the scalp excluding the midline, with the additional levels of Anteriority (frontal, central, and posterior) and Laterality (right, left). N400 and P600 effects, the primary components investigated in the present study, typically present with a posterior distribution on the scalp125; with this in mind, we treatment coded the factors Electrode and Anteriority with the reference levels of Pz and posterior, respectively. In contrast, P200 effects typically have a frontal distribution78, so for models in the P200 time window, we exceptionally used the reference levels of Fz and frontal for Electrode and Anteriority, respectively. Note that for the model specifications below, the factor Electrode was substituted by Anteriority and Laterality for the models over non-midline electrodes.

For noun one, the maximum specified model included the fixed effects of noun one Animacy (animate, inanimate) and Electrode, with random slopes and intercepts included for noun one Animacy and Electrode by participant and by item. For noun two and coverb, the maximum specified model included the fixed effects of Structure (NNV, BA, BEI) and Electrode, with random slopes and intercepts included for Structure and Electrode by participant and by item.

At the verb, only unambiguous role reversal sentences were analyzed (contrasting plausible vs implausible sentences), which limited trials to irreversible BA and BEI sentences (see Table 2). For clarity with respect to predictions about role reversal effects, the factor Order was recoded in terms of Plausibility; BA sentences with the plausible noun in first position were coded as plausible, while BEI sentences with the plausible noun in first position were coded as implausible, with the same logic applied for sentences with the plausible noun in second position. Plausibility was treatment coded with plausible as the reference level. Because there were only two levels of Structure (BA and BEI) with neither level more suited as a reference, Structure was sum coded for this analysis. The maximum specified model for the verb included the fixed effects of Structure, Agent Animacy, Plausibility, and Electrode. Random slopes and intercepts were included for Structure, Agent Animacy, Plausibility, and Electrode by participant, and random slopes and intercepts for Structure, Plausibility, and Electrode by item. Additionally at the verb, we ran models at individual midline electrodes as confirmation for the effects across midline electrodes. These models are reported in supplementary materials.

Because the components of interest in the present study (N400, P600, P200) are typically maximal at or near midline electrodes125, we primarily report in the text results from models at the midline; results from lateral electrodes are reported in the text if they show effects beyond the models at midline electrodes. Note that full models for lateral electrodes excluding the midline are reported in the Supplementary Materials. Additionally, simple effects of the topographical factors Electrode, Anteriority, and Laterality, or interactions involving only these factors, are not reported or discussed in the text because they are not related to the experimental manipulations. Lastly, for models at the verb, we discuss in the text only those effects that included Plausibility because this is the only factor for which we had predictions.

All final analyses were performed on single trial average amplitudes, but average ERPs were calculated by condition for plotting purposes. All figures showing ERP voltage against time and scalp maps reflect these average ERPs and were plotted using the R package ERPscope126.