Introduction

Word frequency, the number of times a word occurs regardless of context, has long played a central role in developing and evaluating models of visual word recognition and reading. However, a pioneering study by Adelman et al. (2006) found that much of the variance previously attributed to WF is better explained by the diversity of contexts in which a word occurs (for review, see Caldwell-Harris, 2021). Adelman et al. (2006) operationalized a measure, contextual diversity (CD) – the proportion of texts in a corpus in which a word occurs. When controlling for other dimensions that affect lexical processing, CD but not word frequency (WF) affected naming and lexical decision times (Adelman et al., 2006; Adelman & Brown, 2008; Jones et al., 2017).

Adelman et al.’s work was motivated by memory research, where repeated exposure has minimal effects when an item is repeated in the same context (Verkoeijen et al., 2004). If lexical memory follows the same principles, words that occur in more diverse contexts will be better learned and retrieved (see Jones et al., 2012, for a related, learning-based account).

Recently, we proposed a “context constructivist” framework that assumes that (1) lexical representations store fine-grained, contextualized statistical information about word distributions; and (2) these representations are used to actively construct and update a context model that informs expectations about expected words in that context (Chen et al., in preparation; Yan et al., 2018). Thus, lexical retrieval is optimized to reflect “need probability” (Anderson & Schooler, 1991) –the probability that a word will be encountered in the upcoming text or discourse.Footnote 1

With contextualized word representations, people can form expectations about what words are likely to be encountered in the current task/context. In a specific context, words that are more frequent within that context will be more expected. CD and WF are highly correlated; however, WF is not directly incorporated into lexical representations, and thus is not accessible or easily computed (it would have to be computed by summing the frequency of a word in the range of contexts in which it occurs, weighted by the probability of these contexts). However, the number of distinct contexts in which a word occurs would be more accessible: as CD increases, words are likely to have a larger and more varied set of semantic associations (Adelman et al. 2006; Hoffman et al, 2011), and therefore degree of semantic activation would be a good proxy for need probability.

Because WF and CD are proxies for the same underlying factor, when other variables that affect lexical processing are controlled, it would be surprising to find different effects of both WF and CD. Indeed, in an important study, Plummer et al. (2014) found CD but not WF effects in an eye-tracking study across multiple fixation measures. Because CD and WF cannot be manipulated factorially (HCD words are typically HWF), they introduced a three-condition design with words in a contrast/control condition, a HCD condition, matched for WF with the contrast condition, but with higher CD, and a LWF condition, matched for CD with the contrast condition with lower WF. They found CD (HCD vs. the contrast condition) but not WF (LWF vs. the contrast condition) effects (e.g., HCD words had shorter FFD). The same CD-dominant pattern for three-condition designs has been found in eye-tracking studies for words and characters in sentences in Chinese (Chen, Huang et al., 2017a; Chen, Zhao et al., 2017b), and for lexical decisions with young readers in Portuguese (Perea et al., 2013), and character decision in Chinese (Huang et al., 2021).

Crucially, our account makes novel predictions about CD and WF when contextual constraint increases. First, CD effects should decrease as contextual constraint increases. Second, in strongly constraining contexts with three-condition designs, WF but not CD should affect reading times. We have confirmed these predictions in a three-condition eye-tracking experiment in Chinese and in an analysis of a corpus of eye-tracking data for natural texts in English (Chen et al., 2025; Yan et al., 2018).Footnote 2

In contrast to behavioral studies with words in isolation, which consistently find effects of CD but not WF, a recent ERP study by Vergara-Martinez et al. (2017) found dissociable effects: Both CD and WF evoked negativities in the 225- to 325-ms time window. However, high CD words elicited larger negativity than low CD words in the anterior region, whereas low-frequency words evoked larger negativity than high-frequency words in the anterior-central region.

The ERP study by Vergara-Martinez et al. is important in clarifying the locus of the CD effect, showing that CD effects have a semantic origin. However, there are two aspects of the results that are noteworthy. First, while the CD but not WF affected response times, the 13-ms CD effects are smaller than observed in previous behavioral studies (e.g., 53 ms in Perea et al., 2013; 65 ms in Plummer et al., 2014).Footnote 3 This raises questions about the strength of the CD manipulation. Second, Vergera-Martinez et al. argue that because facilitatory effects are found for both increased CD and increased WF, different effects might be masked in behavioral measures but might be dissociable with a measure like ERP. While this is true in principle, it does not explain why behavioral effects of WF are not found when CD is controlled. Moreover, it’s not clear why larger anterior negativity for higher CD words would map onto faster response times, whereas larger anterior-central negativity for lower WF words would not map onto a response time difference. These observations highlight the importance of replicating the results, especially if the replication showed stronger behavioral effects, which would ensure that the manipulation of CD was robust.

We examined CD and WF effects for characters in Chinese. Characters are the basic orthographic/morphemic unit in Chinese, which minimizes structural complexities associated with morphology, and to some extent orthographic consistency and spelling-to-sound mapping are minimized (Adelman et al. noted that WF is more strongly correlated with word form structural factors than CD). While behavioral and neural studies present different patterns in lexical processing in these two language systems (e.g., Cao et al., 2013; Kim et al., 2016; Zhou & Marslen-Wilson, 2000), behavioral studies using characters in Chinese find the same pattern of CD effects as is found in English and in Portuguese.

Separate neural patterns for WF and CD in a language with a very different orthography would provide compelling support for Vergara-Martinez et al.’s conclusions. Moreover, it would provide strong evidence against any approach, such as ours, in which contextual variability measures and WF are proxies for the same underlying dimension. On the other hand, if we do not find different effects of WF and CD, the results would be consistent with that hypothesis, and importantly, it would pave the way for contextual manipulations that could provide a strong test of the unified hypothesis, a point we return to in the General discussion.

We used stimuli drawn from a corpus of Chinese characters used in films (Cai & Brysbaert, 2010) and manipulated character frequency (CF) and CD simultaneously. As in Vergara-Martinez et al., we used a three-condition design.

We predicted that compared with control condition with the same CF but lower CD, character decision times would be faster for the HCD characters, with no effect of CF. As we noted earlier, degree of semantic activation would be a good proxy for need probability. Higher CD characters are likely to be semantically richer than lower CD characters (Adelman et al., 2006; Hoffman et al., 2013; Vergara-Martinez et al., 2017). Therefore, we predicted that HCD characters would induce larger N400 or late positive component (LPC) than characters in the control condition. LPC is a positive component occurring at approximately 500 ms after stimulus onset, with the largest scalp distribution over the posterior region. Although it was initially discussed in relation to syntactic and structural processing, more recent findings demonstrate that LPC is also sensitive to semantic context (for review, see Aurnhammer et al., 2023). Semantic richness effects, which often result in N400 effects for words in alphabetic languages (e.g., Müller et al., 2010; Rabovsky et al., 2012; Vergara-Martinez et al., 2017), are also realized as effects on the LPC component in Korean and Chinese (Ding et al., 2017; Kwon et al., 2012). In these studies, larger N400 or LPC amplitude is often reported for words with many semantic associates or features than for those with few semantic associates or features. If there are effects of both CD and CF, we should also see ERP differences in the LCF condition compared to the control condition, even if (as expected) there are no behavioral effects between these conditions.

Methods

Participants

Twenty-nine students participated in the study (15 females, 14 males, age range 21–26 years, mean age 23.72 years). Participants were right-handed, native Mandarin Chinese speakers with normal or corrected-to-normal vision, and no history of neurological or language impairments. Participants were paid for their participation and signed informed consent prior to the experiment.

Materials

Characters were selected from the SUBTLEX-CH-CHR database (Cai & Brysbaert, 2010). The database provides CF based on the number of occurrences in 33 million words, and CD based on the proportion of films in which a character appears in a 6,243 film-corpus. The CF and CD were both transformed to a log scale. We chose this corpus because frequencies based on this database explain more of the variance in word and character reading than frequencies based on written texts (Cai & Brysbaert, 2010).

We selected 150 single monomorphemic characters from the database, with 50 characters for each condition (Fig. 1B). Characters in the HCD condition have similar CF to the control group (t (98) = -1.662, p = 0.10), but they have higher CD (t (98) = -16.433, p < 0.001). Characters in the LCF condition have lower CF than the control group (t (98) = 25.485, p <0.001), but they have similar CD (t (98) = -1.645, p = 0.103).

Fig. 1
figure 1

A An example of Chinese word and character. The word (槐花, /huai2 hua1/, the sophora flower) consist of two characters: 槐 (/huai2/, sophora), 花 (/hua1/, flower). Characters are pronounceable and convey meaning. B Examples of characters (upper) and pseudo-characters (lower) in different conditions

Chinese characters are composed of a series of strokes, and those strokes often combined to form sub-character units called “radicals” (Taft et al., 1999; Yan et al., 2012). Different characters may vary in the number of strokes and number of radicals, both of which affect the recognition of characters (Ding et al., 2004; Feldman & Siok, 1997, 1999; Taft et al., 1999; Taft & Zhu, 1997). Therefore, across conditions, characters were matched for number of stokes (ts < -0.099, ps > 0.529), radicals (ts < 1.003, ps > 0.171), orthographic neighborhood size (ts < 1.126, ps > 0.263), and semantic polysemy (ts < -0.880, ps > 0.163). We also controlled for phonological consistency (Hsu et al., 2009; Lee et al., 2005, 2015) and regularity (Cai et al., 2012). Phonological consistency (ts < 1.548, ps > 0.127) and regularity (χ2 s < 0.31, ps > 0.58) were matched across conditions for phonograms. Regularity and consistency are phonological properties of phonograms (Hsu et al., 2009; Lee et al., 2005; Yum & Law, 2019). Regularity is defined as whether the pronunciation of a phonogram is identical with its phonetic radical, regardless of tone. Consistency is defined as the degree to which a phonetic radical is a reliable cue to the sound of the phonogram containing it. This was calculated by dividing the number of orthographic neighbors with the same pronunciation by the total number of orthographic neighbors.

Twenty-six participants rated concreteness, familiarity, imageability, age of acquisition, valence, arousal, and dominance of each character on 7-point scales. These variables did not differ significantly across conditions (ts < 1.472, ps > 0.147). The detailed values for each condition are presented in Table 1.

Table 1 Characteristics of the target characters in each group

One hundred and fifty pseudo-characters were generated by randomly combining radicals from the original characters: all followed standard orthographic patterns. Using a 7-point scale, 20 students who didn’t participate in the EEG experiment rated whether the pseudo-characters looked like real characters. There was no significant difference among conditions (ts < 1.36, ps > 0.18).

Procedure

Participants were seated in a sound-attenuating, electrically shielded chamber, approximately 65 cm distant from a computer screen. Following previous studies (e.g., Huang et al., 2021; Zhao et al., 2010), each trial began with a fixation cross in the center of the screen with a random duration (M = 1,250 ms, range = 1,000–1,500 ms). A character was then presented for 200 ms, followed by a blank screen for 2,500 ms. There were six blocks, with each block containing 50 trials. Block order was counterbalanced across participants. Stimuli from the same condition did not appear in more than three consecutive trials and were displayed in a pseudo-randomized order.

Participants performed a character decision task, pressing the “D” or “K” key as accurately and quickly as possible. Assignment of “character” and “pseudo-character” to keys was balanced across participants. The E-Prime software package (Psychology Software Tools, Pittsburgh, PA, USA) was adopted for stimulus presentation and response collection. Response time (RT) was measured from stimulus onset to the participants’ response. The experiment began with a practice session of 20 trials to familiarize participants with the procedure. The entire experiment lasted about 1 h.

EEG recordings

EEG was continuously recorded by a SynAmp amplifier from 64 Ag/AgCl electrodes, mounted on an elastic cap, located in the Standard International 10–20 System. EEG was referenced online to the left mastoid, and then re-referenced offline to the algebraic average of the left and right mastoids. Vertical electro-oculogram (EOG) was recorded from electrodes located above and below the orbital regions of the left eye. Horizontal EOG was recorded from electrodes located at the outer canthus of each eye. EEG data were digitized at a rate of 1,000 Hz, with a 400-Hz high cut-off filter and a 0.05-Hz low cut-off filter. Electrode impedances were kept below 5 kΩ throughout the experiment.

Behavioral data analysis

Planned comparisons used linear mixed-effects models for character decision times and mixed logit models for accuracy using the lme4 package (Bates et al., 2015) in R (R Development Core Team, 2014). The model included fixed effects (conditions) and the maximal random effects structure that would converge as justified by the data with by-participants and by-items random intercepts and slopes (Barr et al., 2013; Jaeger, 2008; Matuschek et al., 2017).Footnote 4 The lmerTest package was implemented for significance testing. For linear mixed effects models, we estimated p values using the Satterthwaite approximation for degrees of freedom (Kuznetsova et al., 2017).

EEG data analysis

EEG data were analyzed using MATLAB scripts based on EEGLAB toolbox (Delorme & Makeig, 2004). A digital bandpass filter between 0.1 and 30 Hz was conducted offline. Ocular artifacts were removed via independent component analysis, and other types of EEG artifacts were rejected automatically with criterion of ± 75 μV and manually through visual inspection. Data were segmented from 200 ms before to 800 ms after the onset of the targets, with baseline correction from 200 ms to 0 ms preceding target onset. Incorrectly answered trials were excluded from further analysis. On average, 7.3% of trials were rejected, and 46.28 ± 2.89, 47.07 ± 2.80 and 45.72 ± 4.71 trials were included in the control, HCD and LCF conditions, respectively, with no significant difference in number of trials remaining across conditions (ts < 1.56, ps > 0.13).

Based on visual inspection and previous research (e.g., Lartseva et al., 2014), statistical analyses were performed on the mean amplitude between 400 and 600 ms. The midline and lateral electrodes were computed separately. In the midline analysis, there were two factors including character type (LCF/HCD group and Control group) and region (anterior (Fz, FCz), central (Cz, CPz), and posterior (Pz, POz)). In the lateral analysis, there were three factors including character type, Hemisphere (left and right), and Region (anterior, central, and posterior). Lateral electrodes were organized into six regions of interest (ROIs): left anterior (F1, F3, F5, FC1, FC3, FC5), left central (C1, C3, C5, CP1, CP3, CP5), left posterior (P1, P3, P5, PO3, PO5, PO7), right anterior (F2, F4, F6, FC2, FC4, FC6), right central (C2, C4, C6, CP2, CP4, CP6), and right posterior (P2, P4, P6, PO4, PO6, PO8).

We used linear mixed-effects models to analyze the item-based amplitude of the ERP in the time window of 400 to 600 ms. The model included fixed effects (e.g., condition, region, hemisphere) and the maximal random effects structure that would converge, as justified by the data with by-participants and by-items random intercepts and slopes (Barr et al., 2013; Matuschek et al., 2017).Footnote 5 Post hoc pairwise comparisons were conducted using the emmeans package with Tukey corrections (Lenth et al., 2018).

Results

Behavioral results

Mean RTs and accuracy rates are presented in Table 2. The average accuracy rates were 95.93% (SE = 0.64%) in the control group, 94.69% (SE = 0.95%) in the LCF condition, and 98.00% (SE = 0.42%) in the HCD condition. Mixed logit models showed that there were no significant effects of character frequency and CD on error rates (|β|s < 0.92, |z|s < 1.90, ps > 0.05).

Table 2 Mean character decision times and average accuracy rates for characters in lexical decision task

Mean character decision times were 730.61 ms (SE = 5.34 ms) in the control group, 733.20 ms (SE = 5.40 ms) in the LCF condition, and 686.39 ms (SE = 4.09 ms) in the HCD condition (see Fig. 2). As predicted, the CD effect was significant (control group vs. HCD group), β = -48.00, SE = 12.38, t = -3.88, p < 0.001, whereas the WF effect (control group vs LCF group) was not, β = 4.73, SE = 13.85, t = 0.34, p = 0.73.

Fig. 2
figure 2

Mean response times for characters in lexical decision task. The control group comprises characters with higher character frequency (CF) and lower contextual diversity (CD), the LCF group comprises characters with a similar CD as the control group but lower CF, the HCD group comprises characters with a similar CF as the control group but higher CD. Bigger dots represent mean amplitude for each condition, and other small dots represent individual mean amplitude for each condition. Error bars represent bootstrapped 95% confidence interval of the mean

ERP results

The grand average ERP, time-locked to the onsets of critical characters, is displayed in Fig. 3. Between 400 and 600 ms, there was a main effect of CD in both the midline electrodes (F = 7.45, p = 0.008) and the lateral electrodes (F = 8.44, p = 0.005). High-CD characters evoked larger late positive component (LPC) than the control condition (see Fig. 4). The CD × region interaction was significant (see Fig. 5), F = 3.20, p = 0.04. Simple effect analyses showed that the effect of CD was largest at the posterior sites (β = 0.75, SE = 0.22, z = 3.48, p < 0.001), followed by the central region (β = 0.59, SE = 0.22, z = 2.73, p = 0.006), and did not reach significance at the anterior region (β = 0.26, SE = 0.22, z = 1.21, p = 0.23). The CD × hemisphere interaction was marginally significant, F = 2.90, p = 0.09. We further performed a Bayes factor model comparison using R package “BayesFactor” (Morey & Rouder, 2018). The Bayes factor reflects the ratio of the likelihood probability of two competing models. It has advantages over other model comparison methods such as likelihood ratio tests (Baele et al., 2013). Adding the interaction between CD and hemisphere into the model only improved it by a factor of 0.094, showing no evidence for the potential interaction effect (Jeffreys, 1998).

Fig. 3
figure 3

Grand average event-related potential (ERP) in response to the target characters from nine representative electrodes over the -200- to 800-ms time window. The control group comprises characters with higher character frequency (CF) and lower contextual diversity (CD), the LCF Group comprises characters with a similar CD to the control group but lower CF, the HCD Group comprises characters with a similar CF to the control group but higher CD. The onset of the critical word is aligned to the zero in the timeline. Analysis windows are shown by the gray-shaded areas

Fig. 4
figure 4

Topographical distributions of the contextual diversity (CD) effect and the character frequency (CF) effect in the 400- to 600-ms time window. Mean amplitude differences were calculated across the event-related potential responses to three conditions

Fig. 5
figure 5

Mean amplitude of the event-related potentials in the 400- to 600-ms window elicited by control group (high frequency and low CD characters) and HCD group (high frequency and CD characters) in each region. Black dots represent mean amplitude for each condition and error bars represent bootstrapped 95% confidence interval of the mean. Small dots represent individual mean amplitude for each condition

As shown in Figs. 3 and 4, no main effect of CF was observed in the midline analysis (F = 0.67, p = 0.42) or in the lateral analysis (F = 1.73, p = 0.19). The interaction between the CF and hemisphere was marginally significant (F = 2.91, p = 0.09); however, the Bayes factor shows that adding the interaction between CF and hemisphere into the model only improved it by a factor of 0.08, which is extremely weak evidence for the model with character frequency and hemisphere added. No other interaction with CF was observed, Fs < 0.59, ps > 0.71. Supplemental regression analysis also observed significant effect of CD but not CF (see Fig. 6).Footnote 6

Fig. 6
figure 6

The amplitude of event-related potentials in the time window of 400- to 600-ms as a function of the contextual diversity values (Log10 transformed). Red dots (Fig. 6a) and yellow dots (Fig. 6b) represent individual amplitude for each item in midline and lateral analyses respectively. Colorful shaded regions represent 95% confidence intervals on the slopes

Discussion

We manipulated CF and CD for Chinese characters using a character decision task while measuring ERPs. With CF controlled, character decision times were faster for higher CD characters compared to a control condition, whereas there were no effects of CF, with the magnitude of the CD effects consistent with previous behavioral studies.

ERPs were sensitive to CD but not frequency. The LPC, a late positive component that likely reflects degree of semantic activation (Chen et al., 2016; Juottonen et al., 1996; Zou et al., 2019), and which is sensitive to linguistic context (Aurnhammer et al., 2023), was larger for higher CD characters compared to lower CD, matched-frequency controls. Importantly, the CD effect obtained in the present study cannot be explained in terms of other semantic variables (e.g., concreteness, imageability) or emotional variables (e.g., valence, arousal), as the experimental characters were matched in these factors (see Table 1). Compared to low CD characters, contextual information is richer and more available for high CD characters, resulting in a larger LPC amplitude. Notably, in previous ERP studies using Chinese words or characters, which manipulated word and character frequency but not CD, frequency effects were also reflected in LPC (e.g., Guo et al., 2004; Ye et al., 2019; Yum & Law, 2019; Zhang et al., 2006). Moreover, the direction and the central-posterior distribution of the CD effects resembles the results obtained in other ERP studies that manipulated factors related to context (e.g., Kwon et al., 2012).

There are similarities and differences between our findings and those of Vergara-Martinez et al. (2017). The most important similarity is that the LPC locus of the CD effects support Vergara-Martinez et al.’s conclusion that the ERP effects of CD are “the result of larger semantic networks that become temporally active for words that appear in many contexts” (Vergara-Martinez et al, 2017, p. 467).

There are two notable differences. First, Vergara-Martinez et al. found CD effects on N400, whereas in our study CD affected LPC, a later component. This difference is not surprising. While frequency effects on N400 have been observed in Chinese, frequency consistently affects LPC, which follows N400 and is sensitive to semantic and contextual variables. In behavioral studies where both CD and WF were manipulated, character and lexical decision times (the current study and Huang et al., 2021) and reading times (Chen, Huang et al., 2017a; Chen, Zhao et al., 2017b) to Chinese words and characters were slower than those to English (Plummer et al., 2014), Spanish (Vergara-Martínez et al., 2017), and Portuguese (Perea et al., 2013). The different time-course of the CD effects likely reflects slower access of semantic/lexical information in Chinese compared to alphabetic languages, with the time course of the LPC consistent with character-decision times (for review, see Li et al., 2022).

The second, and most important, difference is that Vergara-Martinez et al. found ERP effects of both CD and WF with CD and WF effects differing in their direction and distribution, whereas we found effects of character CD but not frequency, which is consistent with results using behavioral measures. Further research will be needed to determine whether this difference can be attributed to properties of alphabetic compared to character-based orthographies, or to some other aspect of the materials, for example, structural characteristic of word forms that are correlated with WF but not CD (Adelman et al. 2006; Vergera-Martinez et al., 2017). One promising approach would be to use three-condition designs in which context manipulations result in either CD or WF effects, depending upon the strength of the contextual constraint (see note 2 for an example).

The results are consistent with our context constructivist account in which both CD and WF effects reflect need probability (e.g., predictability) of a word. On this account, lexical representations store only context-contingent frequencies. Thus, token frequency is not easily accessible/computable. However, the range of contexts in which a word will occur (which is correlated with semantic richness) is accessible and thus a good proxy for need probability for words in isolation or weakly constraining contexts. Because WF and CD are both proxies for need probability we do not predict dissociable effects of these two variables in three-condition designs in which other variables that affect lexical access, many of which are correlated with WF, are factored out. In ERP studies, depending on the time course of semantic effects, CD should be reflected in components sensitive to richness of context, such as N400 or LPC.

The results are also consistent with two proposals that do not incorporate need probability. The first is the “context availability model” (Holcomb et al., 1999; Schwanenflugel et al., 1988; Schwanenflugel & Shoben, 1983), which is often used to explain concreteness effect. This model argues that comprehension is heavily reliant on contextual information provided by either the preceding context or the comprehender’s mental knowledge. In the absence of context, lexical decisions are shorter for high-CD characters because of the increased availability of related contextual information, which also results in a larger LPC amplitude. However, the context constructivist model differs from the context availability model in making specific claims about how context is incorporated into lexical representations and in predicting word frequency effects in constrained contexts.

Our approach differs from Adelman et al. (2006) and Jones et al. (2017) in that it incorporates context into lexical representations and assumes that need probability underlies both CD and WF effects. Our approach makes novel predictions about how WF and CD effects will be modulated by contextual constraint, which can be manipulated in three-condition designs. We suggest that neural-imaging studies adopting this approach would be a fruitful avenue for understanding the neural basis of CD and WF effects, including whether they are dissociable.