Keywords

1 Introduction: The Corpus’s Role in Sentence Processing

An important topic in linguistic research concerns the interface issue, namely, how a language system interacts with computation, expressive content, and articulation. Two dimensions of linguistic processing—language comprehension and language production—are particularly important. Language comprehension revolves around how the mind perceives and interprets linguistic signals, whereas language production entails how the mind generates linguistic codes for articulation. These dual facets of language processing serve as the foundation for the development of various theories concerning language.

While it might appear evident that there should be a connection between language comprehension and language production, the precise nature of this connection remains less clear. A conventional model like the Speech Chain (Denes and Pinson 1993) sees comprehension and production as inseparable facets of the same coin. Language production corresponds to the speaker (i.e., encoding) aspect of the chain while language comprehension corresponds to the listener (i.e., decoding) aspect of the chain. Such models typically assume a symmetrical relation between comprehension and production, with these two aspects linked through shared linguistic representations. Accordingly, if a linguistic expression is difficult to encode, it is also taken to be difficult to decode. The complexity of linguistic representations and language users’ experience with language comprehension and language production can both account for the symmetrical processing effects in comprehension and production. Linguistic materials that are more complex are expected to be harder to interpret and produce (Ferreira 1991; Gibson and Warren 2004). Similarly, less frequently encountered/produced expressions are expected to be more demanding to understand (Reali and Christiansen 2007).

The Production-Distribution-Comprehension (PDC) model (MacDonald 2013) represents a significant endeavor to directly bridge the realms of sentence production and sentence comprehension. According to the PDC model, the distributional regularities in corpora provide valuable insights into the mechanisms at play during utterance planning. This involves organizing information based on processing ease, with a tendency to reuse recently employed structures. Distributional regularities can also be used to predict how utterances may unfold (Hale 2001, 2006; Levy 2008). Distributional regularities from corpora therefore serve as an important resource for making inferences about grammar. On the one hand, corpus data can be seen as a snapshot of collective language production, revealing what structures and expressions are favored in a given context. On the other hand, corpus data illuminates the probabilistic underpinnings of grammar based on which parsing decisions are made.

2 Processing Relative Clauses

Taking relative clauses (RCs) as an example, a common finding in English is that subject-extracted relative clauses (SRCs) like (1) below are easier to process than object-extracted relative clauses (ORCs) like (2) both for comprehension and for production (Gibson et al. 2005; King and Just 1991; Traxler et al. 2002; see Lin and Bever 2006 and O’Grady 2011 for a typological overview). Multiple factors contrasting SRCs and ORCs can account for the processing advantage of (1) over (2), including, for instance, the shorter distance between the head and the gap in SRCs compared with that in ORCs (Gibson 1998) and the canonical thematic order of Noun-Verb-Noun (NVN) or Agent-Verb-Patient found in SRCs but not in ORCs (Bever 1970; Lin 2014, 2015).

(23.1) The harpisti who [gapi] knows the composer received good reviews.

(23.2) The harpisti who the composer knows [gapi] received good reviews.

The processing advantage of SRCs is predicted based on the formal property of the linguistic material, namely, a shorter filler-gap distance and the canonicity of word orders found in SRCs. Intriguingly, this processing asymmetry is also consistent with the distributional dominance of SRCs in corpora. Roland et al. (2007), for instance, reported that ORCs are less frequent than SRCs in English written corpora. Considering production, distribution, and comprehension, therefore, RCs in English show a rather consistent pattern; that is, SRCs exhibit higher frequency and are generally easier to process compared to ORCs.

The underlying reasons for this correlation, however, remain a subject of debate, given the presence of multiple factors that can make similar predictions. One potential scenario considers production as the foundation for distributional dominance and, consequently, ease of comprehension. In this view, due to factors like locality and word order canonicity, planning the production of an SRC is inherently more straightforward than that of an ORC. Consequently, SRCs tend to appear more frequently in corpora. As language users encounter SRCs more often, they become more adept at both producing and comprehending them, creating a self-reinforcing cycle. Another plausible scenario involves inferring from frequency distribution that SRCs serve a more functional role in discourse than ORCs. Given their higher frequency of use, SRCs are not only easier to produce or reuse but are also more likely to be expected and comprehended by language users. Several other explanations could account for this correlation, but the linked observations in comprehension, production, and corpus distribution have yet to definitively establish the causal relationships among them.

This chapter will report the distributional frequencies of Chinese relative clauses in the Sinica Treebank 3.0 (http://turing.iis.sinica.edu.tw/treesearch/; Chen et al. 1996, 2003) and discuss these distributions in light of their significance in sentence processing. In recent years, researchers have increasingly focused on the processing of head-final relative clauses, where RCs appear before the head nouns they modify. Chinese, in particular, has garnered attention in sentence processing research. While the basic word order of Chinese is Subject-Verb-Object (SVO) as it is in English, the noun phrase (NP) structure in Chinese is head-final. The embedded clause in a Chinese NP appears before the noun it modifies. Owing to this typological particularity, SRCs and ORCs in Chinese present distinct filler-gap relations than those in English. Specifically, Chinese RCs feature gaps that precede fillers in terms of linear order, and SRCs entail longer dependency distances compared to ORCs as shown in (3-4). Furthermore, ORCs, but not SRCs, adhere to the canonical NVN order in Chinese. These considerations related to locality and word order suggest a processing advantage for ORCs over SRCs, in contrast to the observations in English.

(23.3) [gapi]認識作曲家的豎琴家i獲得好評。

 [gapi]__renshi__zuoqujia__de__shuqinjiai__huode__haoping

 [gapi]__know__composer__DE__harpisti__win__good.review

The harpisti who [gapi] knows the composer received good reviews.

(23.4) 作曲家認識 [gapi]的豎琴家i獲得好評。

 zuoqujia__renshi__[gapi]__de__shuqinjiai__huode__haoping

 composer__know__[gapi]__DE__harpisti__win__good.review

The harpisti who the composer knows [gapi] received good reviews.

Head-final relative clauses like those in Chinese therefore offer an intriguing arena for the various comprehension and production factors that have otherwise been complicated in head-initial RCs. While locality and word order canonicity both predict easier comprehension of SRCs in English, they predict easier comprehension of ORCs in Chinese. Interestingly, the distribution of relative clauses in Chinese corpora does not consistently align with these processing predictions as observed in English. Frequency distributions have quite consistently indicated higher occurrence of SRCs than ORCs in the corpora (e.g., Wu et al. 2011), thus predicting an SRC advantage. In fact, research on Chinese RC processing has yielded mixed results. In terms of comprehension, some studies have reported that SRCs are easier (Chen et al. 2012; Jäger et al. 2015; Lin and Bever 2006), while others have reported that ORCs are easier (Gibson and Wu 2013; Hsiao and Gibson 2003; Lin 2014; Lin and Garnsey 2011; Packard et al. 2011; Qiao et al. 2012; Sung et al. 2016). In terms of RC production, SRCs have been found to take a shorter time to initiate than ORCs (Lin 2013).

The dominance of SRCs in corpora is in line with the SRC advantage in sentence planning (Lin 2013) and in some comprehension studies (Chen et al. 2012; Jäger et al. 2015; Lin and Bever 2006) but in conflict with the ORC advantage in other comprehension studies (Gibson and Wu 2013; Hsiao and Gibson 2003; Lin 2014; Lin and Garnsey 2011; Packard et al. 2011; Qiao et al. 2012; Sung et al. 2016). In light of this, our study aims to delve deeper into the distributions of Chinese RCs while considering their relevance to critical issues in RC processing. Subsequent sections will dissect the corpus data extracted from the Sinica Treebank and explore the intricate connections between sentence comprehension, sentence production, and linguistic representation.

3 Distributional Regularities of Chinese Relative Clauses in the Sinica Treebank

Chinese relative clauses were extracted from the Sinica Treebank 3.0, which is based on the Sinica Corpus (http://asbc.iis.sinica.edu.tw/; Chen et al. 1996), a balanced corpus of contemporary Chinese texts produced between 1981 and 2007 (Huang et al. 2017). The Sinica Treebank 3.0 is composed of 361,834 words automatically parsed into 61,087 syntactic trees, which were manually checked and corrected before public release. Our corpus searches targeted NPs that contained prenominal modifier phrases headed by 的 de where the prenominal modifier contained a clause, a verb phrase (VP), or a verb. A sample tree diagram is provided in Fig. 23.1.

Fig. 23.1
A tree diagram. V P is divided into head V C 2, aspect di, and goal N P. Goal N P to predication V P and head nab, predication V P to head V P and head D E, head V P to location N P, standard P P, and head V A 11, location N P to property nab and head n c d a, and standard P P to head P 58 and dummy N P.

Example of a Sinica tree structure of a relative clause, with depths in parentheses and phrasal nodes in boxes

Our search yielded 3081 tokens, which were manually coded based on various syntactic and semantic properties of the head nouns, the prenominal clauses, and the location of complex NPs in the matrix clauses. The coding process was carried out and reviewed by native speakers of Standard Chinese (i.e., Mandarin), including both authors and several linguists. The coding guidelines were established by the first author. Cases where de served as a genitive marker (e.g., 人性的黑暗面 renxing de heianmian “the dark side of human nature”) or appeared as part of an idiom (e.g., 所謂的 suowei de “so-called”) as well as cases that contained incomplete RC fragments were excluded from further analysis (N = 106, 3% of all tokens). As a result, 2975 RCs were retained for subsequent analyses.

In addition to manually coding the syntactic and semantic properties of the RCs, we extracted the parts-of-speech (POS) tags of the embedded verbs based on verb classification in the Sinica Corpus and measured the syntactic complexity of the embedded clauses based on several metrics.Footnote 1 These metrics included (a) the length of the prenominal RCs in terms of the number of characters and number of words, (b) the syntactic depth of the prenominal clauses in terms of the number of syntactic layers, and (c) syntactic complexity in terms of the number of phrasal nodes in the prenominal clauses. We will use Fig. 23.1 above to illustrate these measures.

The number of syllables or characters is the most straightforward measure. In Fig. 23.1, the prenominal clause contains seven characters/syllables, including the relativizer de. In Standard Chinese, the number of syllables/characters is almost equivalent to the number of morphemes. Phonological lengths thus quite closely reflect the amount of lexical content. The number of words (six in Fig. 23.1) is based on word segmentation in the Sinica Corpus. The number of layers (or depth) of a prenominal clause indicates how deep the clause is, which is measured by the number of edges on the path from the head (VP‧的 in Fig. 23.1) to its deepest word (Head:Naa 風). Note that we counted from the head node of the RC (VP‧的), not the head node of the whole tree fragment (VP), so in Fig. 23.1, the number of edges on the path is four. Tokens where more than one RC was found were excluded from this analysis. An additional measure of syntactic complexity is the number of phrasal nodes, whereby all non-terminal (non-leaf) nodes are counted. In the tree in Fig. 23.1, the embedded clause has four phrasal nodes—head:VP, location:NP, standard:PP, and DUMMY:NP. These phrasal nodes are roughly equivalent to the constituents in the sentence, which we believe are a good indicator of RC complexity.

The RCs were classified into six distinct types, with a primary focus on how the head nouns are reconstructed in the embedded clauses. Head nouns can be modified by clauses that are devoid of missing arguments. These RCs are gapless and are integrated with the head nouns as clausal complements (see Sect. 23.3.1). In most cases, the embedded clause contains a missing argument, with which the head noun is identified. A complete clause can be reconstructed by interpreting the missing argument as being coreferential with the head noun. In these instances, a filler-gap dependency exists between the head and the missing argument. We considered five subtypes where the head holds a dependency with an NP in the subordinate clause. In possessive RCs, the head is coreferential with the possessor argument of an embedded NP. In descriptive RCs, the head serves as the NP that the descriptive RC predicates on. The remaining three subtypes of RCs contain more obvious missing arguments in the embedded clause. In passive RCs, the head noun is coreferential with the missing subject NP of the embedded passive clause. In SRCs, the head noun is coreferential with the subject NP in the embedded clause. Finally, in ORCs, the head noun is coreferential with the object NP in the embedded clause. Table 23.1 provides definitions for the six types of RCs, each of which will be introduced in more detail. Furthermore, their respective distributions in the corpus will be discussed in subsequent sections:

Table 23.1 Definitions of the relative clause types

Figure 23.2 presents the percentile distributions of the different types of RCs. The majority (87%) of the RCs fell within two types of gapped RCs—SRCs (53%) and ORCs (34%), with SRCs outnumbering ORCs. The embedded clauses clearly showed the tendency of having missing subject or object arguments that were coreferential with the head nouns.

Fig. 23.2
A bar graph has the following data from left to right. 1. Gapless R C, 5%. 2. Possessive R C, 1%. 3. Descriptive R C, 4%. 4. Passive R C, 3%. 5. S R C, 53%. 6. O R C, 34%.

Percentile distribution of relative clauses

To get an initial glimpse of the complexity of the prenominal clauses, Table 23.2 shows the clausal lengths in terms of syllables/characters and words, the syntactic depths, and the syntactic complexity of the six types of RCs. The overall pattern was consistent across all four metrics (ps < 0.05, paired comparisons with Tukey correction). Descriptive RCs were the shortest and least complex, while passive RCs were the longest and most complex. SRCs were longer and more complex than ORCs.

Table 23.2 Length and complexity of relative clauses

Given that the syntactic category of the embedded verb plays an important role in selecting arguments, we further extracted the POS of the main verbs in the embedded clauses based on verb classification in the Sinica Corpus (Huang et al. 2017). The distribution of verb classes in the different RC types is presented in Table 23.3. The following sections will further discuss the POS properties of the different RC types using the information in Table 23.3.

Table 23.3 POS distributions of the embedded verbs of relative clauses (the most common categories are boldfaced)

3.1 Gapless Relative Clauses and Possessive Relative Clauses

Both gapless RCs, exemplified in (5) to (7) below, and possessive RCs, as illustrated in (8), present themselves as complete clauses without obvious missing arguments or gaps. This section will distinguish these two types of RCs and compare their distributions in the corpus. Gapless RCs encompass three distinct types of compositional relations between the head noun and the embedded clauses. When the head noun functions as a relational noun (e.g., “time” and “space”), it takes an event argument and the prenominal clause fulfills the event argument requirement of the relational noun and serves as a clausal complement of the head noun. RCs like (5) are commonly referred to as gapless relative clauses (Tsai 1997; Zhang 2008) or adjunct relative clauses (Lin 2018) in the literature. Gapless relative clauses also encompass sloppy relative clauses like (6), where the head noun is coerced into a relational noun, and it becomes integrated with a clausal complement to arrive at a sense of aboutness—akin to the function of “of” in English (Cheng and Sybesma 2005).Additionally, appositive relative clauses, exemplified by (7), fall under the category of gapless RCs. Together, gapless relative clauses accounted for approximately 5% of the relative clauses found in the Sinica Treebank.

(23.5) 七十萬人居住的以色列境內各阿拉伯城鎮

 qishiwan__ren__juzhu__de__yiselie__jingnei__ge__alabo__chengzhen

 700,000__people__live__DE__Israel__inside__each__Arabic__city

the Arabic cities inside Israel where 700,000 people live

(23.6) 昨日盤面拉高出貨的味道濃厚

 zuori__panmian__lagao__chuhuo__de__weidao__nonghou

 yesterday__stock.index__rise__sell__DE__taste__strong

The feel of stocks rising and being sold was strong yesterday.

(23.7) 民不與官鬥的道理

 min__bu__yu__guan__dou__de__daoli

 civilian__not__with__ government.officials__fight__DE__principle

the principle that civilians should not fight against government officials

In contrast, in some gapless prenominal clauses, the head noun is non-relational and does not take the entire embedded clause as its complement or argument. Instead, the head noun forms a possessive association with a nominal argument located within the embedded clause. These RCs are classified as possessive RCs, as shown in (23.8) below. In these instances, the head noun is interpreted as the possessor argument of an embedded inalienable noun (e.g., shencai “figure” and shou “hand”) (following Lin 2011). Possessive RCs constituted only 1% of the relative clauses extracted from the Sinica Treebank.

(23.8) 一位身材i魁梧、手i持鐵椎的大力士i yi__wei__shencaii__kuiwu__shoui__chi__tiechui__de__dalishii

 one__CL__figurei__stout__handi__hold__hammer__DE__strong.guyi

a strong guy whose figure is stout and whose hand holds a hammer

Distinctive reading patterns have been observed in gapless relative clauses like those in (23.5) to (23.7) and possessive relative clauses like (23.8) (Lin 2018) owing to the head nouns holding different dependency relations with the embedded clauses. Since the entire gapless RC is integrated with the adjunctive relational head noun, the complexity and frequency of the prenominal clause influence the processing difficulty of the complex NP. Conversely, the comprehension of possessive RCs is sensitive to the structural position of the dependent noun (possessee) in the prenominal clause. Dependent nouns located at subject positions as seen in (23.8) are generally easier to comprehend than those at lower syntactic positions such as objects. Gapless and possessive relative clauses are otherwise comparable in terms of pronominal clause lengths and syntactic complexity, and the lengths of the head nouns. All instances of possessive RCs found in our study involved an inalienable noun located in the subject position like in (23.8).

Furthermore, the animacies of the head nouns were distinctive between the two types of RCs. The majority (97%) of the head nouns in the gapless RCs were non-human relational nouns, while 53% of the head nouns in the possessive RCs were human possessors. Comparing the main verbs in gapless RCs and those in possessive RCs, it was observed that over half (52%) of the main verbs in the possessive RCs were stative intransitive verbs (VH), suggesting that possessive RCs mainly serve the function of describing the individual-level properties of the human head nouns.

3.2 Subject and Object Relative Clauses: Matrix Position, Animacy, and Complexity

The most common relative clauses are those where the head noun is interpreted as a key argument of the main verb in the embedded clause. These relative clauses typically contain a missing argument that is coreferential with the head noun. The highest grammatical functions in the Keenan-Comrie Accessibility Hierarchy (Keenan and Comrie 1977) shown in (23.9) below, namely, the subject and the object, are also the positions most frequently relativized in Chinese. These two types of relative clauses (not including descriptive SRCs and passive SRCs) account for over 87% of the relative clauses in the Sinica Treebank.

(23.9) Keenan-Comrie Accessibility Hierarchy (1977: 66):

 subject > direct object > indirect object > oblique NP > genitive NP > object of

 comparison

Our study classified RCs that involved subject extraction into three subtypes: subject relative clauses that contain a missing subject argument (53%) like in (23.10) below, RCs that contain a passive structure (3%) like in (11), and prenominal modifiers that involve descriptive predicates (4%) like in (23.12). A typical RC that involved the extraction of a noun from an object position (34%) is exemplified by (23.13) below.

(23.10) [gapi]唱歌的小河i

 [gapi]__changge__de__xiaohe i

 [gapi]__sing__DE__river i

the river that sings

(23.11) [gapi]被列為觀光區的原住民部落i

 [gapi]__bei__liewei__guanguangqu__de__yuanzhumin__buluoi

 [gapi]__BEI__designate.as__tourist.district__DE__aboriginal__sitei

the aboriginal sites that have been designated as tourist districts

(23.12) [gapi]年輕的一代i

 [gapi]__nianqing__de__yi__daii

 [gapi]__young__DE__one__generationi

the young generation

(23.13) 人類共同追求[gapi]的目標i

 renlei__gongtong__zhuiqiu__[gapi]__de__mubiaoi

 mankind__together__pursue__[gapi]__DE__goali

the goal that all mankind pursues together

Passive relative clauses, with a word order like that in (23.14) below and an additional functional head such as 被 bei, 受 shou, 為 wei, 由 you, 遭 zao, etc., are distinctive from SRCs and ORCs. Notably, in the so-called "short passives", the agent NP may be absent, and the head noun typically assumes the role of the theme or patient NP of the embedded verb. Due to these distinctions, we have categorized passive RCs separately and will discuss their distributional properties in Sect. 23.3.3.

(23.14) [gapi]__bei/zao/shou__(Agent.NP)__Verb__DE__Patient.NPi

Given that stative verbs in Chinese are typically predicative of subject NPs, as in (15) below, they can be regarded as RCs that involve subject extractions. However, they also diverge quite significantly from the typical gapped relatives like SRCs and ORCs, which entail the relativization of a key argument of the embedded verb. Based on the information provided in Table 23.2, descriptive relative clauses were notably shorter in length (averaging 4.65 characters) and displayed a higher degree of simplicity (averaging 1.47 phrasal nodes) compared to RCs that involved extractions from subject or object positions. They can thus be taken as simple predicates that are integrated with the head nouns without having to involve a structure-based filler-gap dependency, much like gapless relative clauses and adjectives in English. Notably, the embedded verbs in these descriptive RCs were mainly stative intransitive verbs (57% being VH verbs).

(23.15) 這些孩子還很年輕

 zhe__xie__haizi__hai__hen__nianqing

 this__CL__kids__still__very__young

These kids are still young.

We will now turn to the distributional properties of RCs that involve the extraction of subject and object arguments. As introduced, SRCs and ORCs are among the most commonly studied sentence structures. Of the relative clauses extracted from the Sinica Treebank, SRCs (53%) appeared more frequently than ORCs (34%), which is consistent with findings in other languages and in other studies on the Chinese language. Table 23.2 also shows that SRCs were longer and more complex than ORCs. Sentence comprehension studies on Chinese RCs have yielded a mix of SRC advantages and ORC advantages, as reviewed in Sect. 23.2. The corpus distributions suggest that Chinese language users may, on the whole, be more experienced with SRCs than ORCs.

One important discourse function of RCs is to reference information already present in the background and present the focused NP for predication. The RC’s position in the matrix clause therefore plays a pivotal role for understanding the discourse functions. Typically, the subject position of a sentence imparts grounding information shared by interlocutors whereas the object position provides new and focused information. Sentence processing research has revealed that, overall, Chinese RCs are more frequently expected in the subject position (Lin 2012). Table 23.4 summarizes the findings of Wu et al. (2011), who extracted 1218 relative clauses from the first 1000 files in the Chinese Treebank 5.0 (Xue et al. 2005), and compares them with the distributions in our study based on the Sinica Treebank.

Table 23.4 Distribution of relative clauses as a function of extraction types and position in matrix clauses

The general distributions were similar in both studies, with RCs appearing more often in the subject positions of matrix clauses than in the object positions. Furthermore, there were more SRCs than ORCs in both positions. However, our study differs from Wu et al. (2011) in that the SRCs in our study were more inclined to modify matrix subject NPs, while the ORCs tended to modify matrix object NPs. This contrast was even more pronounced when we differentiated between matrix clauses that contained the presentative copula shi and those that did not, as shown in Table 23.5.

Table 23.5 Distribution of relative clauses as a function of extraction types, position in matrix clauses, and existence of shi in matrix predicates

Sentences containing non-shi predicates presented a stronger tendency for an SRC to modify a subject NP (41% vs. 23%) and for an ORC to modify an object NP (22% vs. 15%). This interplay between the presence of the presentative copula shi and the distribution of RCs in matrix clauses underscores the importance of differentiating sentences containing shi and those that do not when studying RC positions. It also implies that the grammatical function of the head noun in the RC interacts with its function in the matrix clause. When considering sentences without shi it becomes apparent that head nouns tend to fulfill the same grammatical functions in both the subordinate and matrix clauses. This observation can be explained by two plausible accounts. First, in terms of production, it may be more efficient to maintain consistent grammatical functions in both the embedded clause and the matrix clause. Secondly, this distribution also aligns with the general semantic tendency that NPs in the subject position tend to be human and those in the object position tend to be non-human entities, as proposed by Traxler et al. (2002). The humanness/animacy factor can lead to the tendency for the heads of SRCs to be human nouns, which are also preferably located in the subject position of the matrix clause. On the other hand, the heads of ORCs are more likely to be inanimate and preferably located in the object position of the matrix clause.

To delve deeper into these two accounts, we further conducted an analysis of the animacy distribution of the head nouns in relation to the types of grammatical extractions (SRC vs. ORC) and their matrix positions (Subject vs. Object). In terms of animacy and humanness, the head nouns were classified into five categories, as shown in Table 23.6. We focused on the distribution of inanimate NPs (58%) and human NPs (37%) because these two categories accounted for the majority (95%) of the data.

Table 23.6 Examples and distribution of head noun animacy/humanness

Figure 23.3 presents the percentile distribution of SRCs and ORCs (N = 997, where the head noun is either inanimate or human) as a function of head noun animacy/humanness and matrix positions, excluding the matrix sentences that contained shi.

Fig. 23.3
A double bar graph indicates the percentages of inanimate and human. The data from left to right are as follows. S S R C, 11.1% and 30.3%. S O R C, 12.6% and 1.5%. O S R C, 14.4% and 9.2%. O O R C, 19.3% and 1.5%.

Percentile distribution of head noun animacy/humanness, RC matrix positions, and RC types

The distribution percentages shown in Fig. 23.3 affirm the overall animacy/humanness asymmetry in terms of grammatical positions, which has been observed across languages (Fox and Thompson 1990). Specifically, subject positions are more likely to be occupied by human nouns and object positions are more likely to be occupied by inanimate nouns. This asymmetry was also evident in RC extraction types, as both SRCs and ORCs showed distinctive animacy preferences.

As shown in Fig. 23.3, in the matrix subject position, while SRCs mainly modified human NPs, only very few ORCs modified human NPs. In the matrix object position, the proportion of inanimate NPs increased for both SRCs and ORCs and the proportion of human NPs decreased, especially in SRCs. The animacy preference within the matrix clauses and that within the embedded clauses presented an intriguing interaction, resulting in a competition between the two levels of grammatical functions based on their animacy preferences. In the matrix subject position, the animacy preference of the embedded RC type determined the tendency, while in the matrix object position, that of the matrix clause determined the tendency. In both matrix positions, ORCs modified inanimate head NPs more frequently than SRCs, while SRCs featured a higher proportion of human head nouns than ORCs only in the matrix subject position.

Regarding the POS of the embedded verbs (see Table 23.3), in both SRCs and ORCs, transitive action verbs (VC) were the most common. The different verb classes were fairly evenly distributed in SRCs but more skewed toward transitive verbs that required an object argument in ORCs. Compared with ORCs, SRCs had more intransitive action verbs like pao “to run” (VA: 15%) that required only one subject argument, classification verbs like xing “to be named as” (VG: 8%), and stative verbs that required only one object argument like daibiao “to stand for” (VJ: 17%).

Finally, ORCs in Standard Chinese are known to sometimes appear with the particle suo located before the main verb, as in (16) below, which is associated with greater formality and literary style. Among the RCs in our study, 164 RCs (5.5%) featured the particle suo. ORCs with suo were longer than those without suo in the embedded clauses (9.7 vs. 6.8 characters, t = 5.92, p < 0.001), which is consistent with the notion that constituent length serves as an indicator of formality in Standard Chinese, with longer constituents generally conveying a higher degree of formality.

(23.16) 專家所具備的投資能力比一般人高。

 zhuanjia__suo__jubei__de__touzi__nenngli__bi__yiban__ren__gao

 expert__SUO__have__DE__invest__ability__compare__regular__person__high

The ability to invest that experts have is higher than that of regular people.

Upon comparing ORCs with suo and ORCs without suo in terms of whether they modify a matrix subject NP or a matrix object NP in Fig. 23.4, a noteworthy observation emerged. While ORCs without suo tended to appear in the matrix object position in sentences that did not involve shi, ORCs with suo are equally distributed in subject and object positions. This finding suggests that the enhanced formality associated with suo in an ORC overrides the animacy propensity that was discussed above and leads to a more balanced appearance of an ORC in the subject and object positions of matrix clauses.

Fig. 23.4
A quadruple bar graph indicates the values of S V O S, S V O O, Shi S, and Shi O. The data from left to right are as follows. Negative S U O, 125, 198, 124, and 107. Positive S U O, 30, 30, 17, and 16.

Distribution of RC matrix positions as a function of the existence of suo in ORCs (numbers indicate instances)

3.3 Passive Relative Clauses

Due to the increased syntactic complexity associated with an additional functional head (e.g., bei), passive RCs are longer and more complex than SRCs and ORCs (see Table 23.1). Passive RCs, as exemplified in (11), stand between SRCs and ORCs as a third category that involves the relativization of a key argument associated with the embedded verb. In terms of thematic content, passive RCs are similar to ORCs as it is the patient NP of the embedded clause that is relativized. In terms of the grammatical position of the relativized gap, a passive RC is more similar to an SRC, where the gap is located in the subject position.

We looked at the position of passive RCs in matrix clauses and found that, like ORCs, the majority (69%) of passive RCs were located at the matrix object position. Further exploring the animacy distribution of the head nouns in SRCs, ORCs, and passive RCs, as shown in Table 23.7 below, based on the coding scheme in Table 23.6, we found that passive RCs were more similar to ORCs, with the head noun more likely to be an inanimate NP, though the tendency of having an inanimate head noun was not as strong as that of ORCs. These observations suggest that relativized patient NPs tend to be inanimate nouns. Moreover, ORCs and passive RCs were similar in terms of their thematic content and animacy preferences.

Table 23.7 Animacy distribution of SRCs, ORCs, and passive RCs

On the other hand, there appeared to be more human head nouns in passive RCs (31%) than in ORCs (7%), suggesting that a human patient noun is more likely to be relativized if it appears in the subject position of a passive clause than if it appears in the object position of an SVO clause. This finding suggested that passivization promoted the saliency of a patient NP for relativization.

Turning now to the interplay between the animacy of the head noun and the position of the complex NP in the matrix clause in Fig. 23.5, the head nouns of passive RCs were predominantly human NPs in matrix subject positions (i.e., S-PassiveRC) but inanimate NPs in matrix object positions (i.e., O-PassiveRC). This distribution again confirms that passive RCs fall between SRCs and ORCs. The animacy of its head noun mirrors that of an SRC in the matrix subject position but aligns more closely with that of an ORC in the matrix object position.

Fig. 23.5
A stacked bar graph indicates the percentages of inanimate and human. The data from left to right are as follows. S S R C, 27% and 73%. S O R C, 89% and 11%. S Passive R C, 27% and 73%. O S R C, 61% and 39%. O O R C, 93% and 7%. O Passive R C, 80% and 20%.

Percentile distribution of animacy/humanness, RC matrix positions, and RC types

The POS distribution of the embedded verbs in passive RCs was similar to those of ORCs. Unlike SRCs, passive RCs did not contain any intransitive action verbs (VA) and had more instances of classification verbs like chengwei “to call” (VG) serving as the main verb. In terms of thematic ordering, passive RCs presented the canonical order of Agent-Verb-Patient, similar to that of an ORC. Lin (2015) compared the reading patterns of passive RCs, RCs that involved the disposal marker ba, as shown in (23.17) below, and normal SRCs. The study's findings indicated that passive RCs exhibited the shortest reading times. This outcome underscores the importance of thematic ordering in processing relative clauses.

(23.17) [gapi]__ba__Patient.NP__Verb__DE__Agent.NPi

4 Classifier Position in Relative Clauses

One important function of RCs in discourse is to serve the restrictive function; that is, RCs help bring attention to particular referents already present in the background knowledge. One well-known proposal about how restrictiveness is expressed in Standard Chinese focuses on the position of the determiner-classifier phrase in relation to the relative clause (Chao 1968). When a relative clause precedes a determiner-classifier phrase, as in (18a) below, it is considered restrictive because the pre-determiner-classifier position is an edge-position that marks focus. When a relative clause appears after a determiner-classifier phrase, as in (18b), it lacks the focus marking and can be interpreted either as restrictive or non-restrictive (Lin 2012).

(23.18a) 他在台北拇指山下許的那個願

 ta__zai__taibei__muzhishan__xuxia__de__na__ge__yuan

 he__at__Taipei__Mt.Muzhi__make__DE__that__CL__wish

the wish that he made on Mt. Muzhi in Taipei

(23.18b) 這場可能贏的球

 zhe__chang__keneng__ying__de__qiu

 this__CL__likely__win__DE__ball.game

the ball game that (I am) likely to win

Over the years, this proposal has sparked controversy. One way to test Chao’s (1968) proposal is to examine whether the position of determiner-classifier (CL) phrases interacts with the matrix positions of complex NPs since restrictive relative clauses are more likely to appear in subject positions to ground referents (Gibson et al. 2005). Following this logic, we expected to find more occurrences of RCs that appeared before classifier phrases than RCs that appeared after classifier phrases in subject positions.

Focusing on the matrix positions of RCs that co-occurred with classifier phrases (N = 174) and distinguishing sentences that contained shi from those that did not, we found that RCs were generally more likely to appear after CL phrases, except when they appeared in the subject position of a sentence containing shi (see Fig. 23.6).

Fig. 23.6
A double bar graph indicates the percentages of C L R L and R C C L. The data from left to right are as follows. S V O S, 73% and 27%. S V O O, 72% and 28%. N shi N S, 17% and 83%. N shi N O, 74% and 26%.

Position of RCs in relation to CL phrases as a function of matrix positions

The finding that RCs were, overall, more likely to appear after classifiers suggests that the post-classifier position (i.e., CL-RC) is an unmarked position for RCs. The greater occurrences of RCs in the pre-classifier position when complex NPs appeared in the matrix subject position of a sentence with shi suggest that (i) the subjects in sentences with shi are preferred for grounding referents and (ii) RCs appearing in pre-classifier positions are indeed more likely to be used in a restrictive sense. These findings are consistent with Chao’s (1968) proposal and further specify that grounding most likely happens in the subject position of a sentence containing shi.

Figure 23.7 further breaks down the distributions in Fig. 23.6 as a function of RC types (SRCs vs. ORCs) and shows an overall trend where ORCs appeared in the marked pre-classifier position more often than SRCs did.

Fig. 23.7
A stacked bar graph indicates the percentages of R C C L and C L R C. The data from left to right are as follows. S S R C S V O, 21% and 79%. S O R C, 50% and 50%. O S R C, 21% and 79%. O O R C, 40% and 60%. S S R C N shi N, 67% and 33%. S O R C, 100% and 0%. O S R C, 11% and 89%. O O R C, 63% and 38%.

Position of RCs and CL phrases as a function of matrix positions and RC types

To understand this finding, we schematically sketched the linear sequencing of classifier phrases in relation to RCs in (19) below:

(23.19a) CL-SRC: CL [ _ V N1 de] N2

(23.19b) CL-ORC: CL [ N1 V _ de] N2

(23.19c) SRC-CL: [ _ V N1 de] CL N2

(23.19d) ORC-CL: [ N1 V _ de] CL N2

Our observation in Fig. 23.7 was that, relative to (19b) and (19a), respectively, (19d) appeared more often than (19c), which suggests the possibility that language users may have attempted to avoid the potential classifier-noun clash in the CL-ORC condition in (19b) by moving the ORC to a pre-classifier position—assuming that a decision was made between (19a) and (19c) as well as between (19b) and (19d). Wu (2011) similarly found less than 5% of classifier phrases before ORCs in the corpus and interpreted this as a production strategy to avoid ambiguity between the classifier and the first noun in the relative clause. Interestingly, passive RCs where no semantic clash exists after the classifier displayed the same pattern as SRCs in preferring the unmarked position (88% vs. 12%) in classifier phrases, further supporting the notion of ambiguity avoidance in ORCs.

Being prenominal, Chinese RCs often present a challenge for comprehension because they can initially be taken as a matrix clause (Lin and Bever 2011). Sentence comprehension research has used pre-RC classifiers as a cue for marking constituent boundaries. In (20) below, because the classifier 塊 kuai and the following pronominal 他 ta “he” cannot form a local constituent, a phrasal boundary must be created between the two, signaling ta as the beginning of the embedded clause. This boundary has been employed as a cue that may indicate the beginning of an embedded clause for sentence comprehension (e.g., Lin 2018). Based on the production data from the corpus, however, ORCs rarely appeared after determiner-classifier phrases.

(23.20) 一塊他喜歡[gapi]的石頭i

 yi__kuai__ta__xihuan__[gapi]__de__shitoui

 one__CL__he__like__[gapi]__de__rocki

a rock that he likes

5 Headless Relative Clauses

The head nouns of RCs can be left empty, as in (21) below, when they can be easily reconstructed from context or are of generic nature. Among the collected tokens, 303 RCs (10%) were headless. The majority (95%) of headless RCs were either SRCs or ORCs. Interestingly, in contrast to the overall distribution of RC types where we found more SRCs than ORCs (see Fig. 23.1), headless RCs were more often found in ORCs (58%) than in SRCs (37%) (see Table 23.8). This pattern suggests that head nouns that are coreferential with the object of the embedded clause are more likely to be omitted given their lower saliency in discourse. These omitted head nouns were more likely to be inanimate (65%) than human (30%).

(23.21) 所以真正賺錢的都是這些廠商

 suoyi__zhenzheng__zhuanqian__de__dou__shi__zhe__xie__changshanng

 therefore__really__make.profit__DE__DOU__SHI__this__CL__merchant

Therefore, those who really make a profit are the merchants.

Table 23.8 Distribution of headless RCs as a function of RC types

Focusing on SRCs and ORCs, headless SRCs (20%) were more likely to appear as an independent topicalized NP than headless ORCs (10%). In matrix clauses, headless SRCs appeared more often in the subject position, which is consistent with the overall preference for SRCs to appear in the matrix subject position (see Fig. 23.8). Interestingly, headless ORCs did not show the same preference for the matrix object position. For sentences with shi, in particular, headless ORCs were more likely to appear in the subject position. The tendency for a headless RC to appear in the subject position of a sentence with shi suggests that headless RCs are mainly used for grounding referents that already exist in the background.

Fig. 23.8
A quintuple bar graph indicates the percentages of S V O subject, S V O object, N shi N subject, N shi N object, and N P. The data from left to right are as follows. S R C: 39%, 1%, 36%, 4%, and 20%. O R C: 7%, 8%, 61%, 13%, and 10%.

Position of headless RCs as a function of matrix positions

6 Concluding Remarks

This chapter presented topics on the comprehension of Chinese relative clauses in relation to the distributional properties of Chinese relative clauses in the Sinica Treebank. The data were analyzed regarding structural dimensions such as the length and complexity of relative clauses, their positions in matrix clauses, semantic dimensions regarding the animacy of head nouns, and the position of classifier phrases in relation to relative clauses. These corpus data, which serve as a snapshot of collective sentence production, have contributed to our understanding of the relation between production and comprehension. Moreover, they have raised intriguing questions about sentence processing for further exploration.