Describing the Grammatical Knowledge of Chinese Words for Natural Language Processing

Bai, Xiaojing

doi:10.1007/978-3-031-38913-9_5

Xiaojing Bai⁵

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 49))

198 Accesses

Abstract

The grammar of a language tells people how smaller linguistic units combine to form larger ones. This chapter will introduce a grammatical knowledge base of Chinese words, which was developed for natural language processing, assisting the automatic analysis and generation of Chinese sentences. The overall design of the knowledge base, the classification of words, and the formal description of their grammatical functions will be outlined, and some semantic issues will be discussed. From a computational perspective, the knowledge base offers new insights into the long-running introspection and exploration of grammar issues in Chinese.

Access provided by Autonomous University of Puebla. Download chapter PDF

How can Grammatical Inference Contribute to Computational Linguistics?

Constructing a poor man’s wordnet in a resource-rich world

Article 11 February 2015

Towards a Universal Grammar for Natural Language Processing

Keywords

1 Introduction

The grammar of a language tells people how smaller linguistic units, which have sound and meaning, combine to form larger ones. These linguistic units mainly include morphemes, words, phrases, and sentences. The grammar of the Chinese language has been studied for more than 2000 years (Gong 龚千炎 2000), the purpose of which, in general, is to discover the rules of linguistic facts that can help and are helping people express ideas in Chinese appropriately, clearly, and precisely. Computational approaches were adopted in the mid-twentieth century to process the Chinese language (Zong et al. 宗成庆等 2009) to aid human-human and human-machine communication, and new goals have been set for research on the grammar of Chinese (Yu 俞士汶 2000). Language models have been built for computers to analyze and generate natural languages, taking either rationalist or empirical positions (Church 2011; Feng 冯志伟 2008; Zong 宗成庆 2008). With the rationalist positions, particularly, the grammatical knowledge of Chinese has been described in machine-readable dictionaries (lexicons) and rule banks and marked up in corpus data.

For quite a long time in the history of natural language processing (NLP), rule systems have been widely used to capture grammatical knowledge. A rule system, with its high level of abstraction, can describe the syntagmatic relation between words from different word classes. The richness and complexity of language, however, make it impossible for a rule system to cover all the syntagmatic relations between individual words as their properties tend to vary significantly. The statistical approach, which has been used to investigate the co-occurrence between words in large corpora, is a promising alternative. There are constraints on this alternative approach, namely, the amount of computing power and the availability of large corpora with rich annotation, though dramatic progress has been made in these two aspects over the past 20 years (Hirschberg and Manning 2015).

This chapter will introduce an intermediate approach between rule systems and the statistical approach, which was adopted to describe the grammatical knowledge of Chinese words in the Grammatical Knowledge Base (GKB) of Contemporary Chinese developed by the Institute of Computational Linguistics (ICL) at Peking University. The knowledge base is independent of any specific NLP system, irrelevant even to any computational theory or algorithm. This general-purpose knowledge base stores the basic facts about the grammatical functions of commonly used Chinese words. A specific application system can access the data in the GKB, which is, in most cases, a subset of the abundant grammatical knowledge stored. It is also possible and necessary to add new entries and properties to adapt the GKB to a particular system. The knowledge base has supported a wide range of NLP tasks and applications since the 1980s. From a computational perspective, the GKB offers new insights into the long-running exploration of grammar issues in Chinese (Yu et al. 俞士汶等 2011).

This chapter will present a brief introduction to the formal description of the grammatical knowledge of Chinese words in the GKB, including the overall design of the knowledge base, the classification of words, and the description of their grammatical functions. Semantic considerations for the GKB will be discussed, considering the significance of semantics to grammar in a broad sense. The conclusion will highlight the features of the GKB and its implications for both theoretical linguistic studies and natural language processing.

2 Overall Design of the Knowledge Base

The GKB is a large-scale electronic dictionary that was developed for natural language processing. As a language knowledge base in NLP systems, the GKB facilitates the automatic analysis and generation of language. It features the classification of words and the description of their grammatical properties in a fine-grained way. There are approximately 80,000 entries of words with distinctive functions and meanings. The grammatical properties of each word are recorded as attribute-value pairs, describing how the word combines with other words. The knowledge base contains more than 3,600,000 attribute values in total.

2.1 Databases

The GKB is organized as 34 relational databases, which makes it easy to manage the data, eliminate redundancy, and convert the stored grammatical knowledge of Chinese words to other formal mechanisms, such as complex feature structures. As shown in Fig. 5.1, a general database keeps a record of all the selected words and their shared properties. There are 26 databases that store different classes of words and their properties, respectively. Within each database, one row represents a single entry, and its properties are described by the values in the corresponding columns that represent grammatical attributes. Sample entries in the GKB are shown in Table 5.1.

A circular classification chart of the general database is divided into databases of nouns, quantifiers, verbs, adjectives, and pronouns. Verbs and pronouns are further divided into databases of verbs taking nominal and predicative objects, separable verbs, and databases of 3 types of pronouns. — **Fig. 5.1**

Table 5.1 Sample entries in the GKB

Full size table

Table 5.1 lists sample entries taken from the database of verbs, where, for example, the value of the attribute 外内 “taking a true object” suggests the transitivity of a verb. Verbs and pronouns are further divided, and the typical properties of each subclass are described in a separate database. Therefore, for a verb taking nominal objects, such as 拜访 baifang “to visit,” there is actually an entry in each of the three databases (i.e., the general database, the database of verbs, and the database of verbs taking nominal objects). The three entries can be linked if required to provide more information about the verb.

Theoretically, classification and property description are equivalent methods of distinguishing words. Suppose n attributes are designated for a set of words (where n ≥ 1 and the attribute value is 1 or 0); there will be at most 2ⁿ distinct subsets of the words. On the contrary, to divide these words into N distinct subsets (where N ≥ 2), there needs to be at least [log₂(N − 1) + 1] attributes (the square brackets denote rounding the number to the nearest integer) (Yu et al. 俞士汶等 2011).

Word classification and property description have their own strengths and weaknesses. The former helps to sort a multitude of words quickly, but it is difficult to define an infallible classification scheme, which is particularly true in the case of Chinese. The latter is time-consuming but allows more fine-grained linguistic expertise. In the GKB, both methods were adopted to complement each other.

2.2 Selection of Words

The GKB includes words listed in important reference books of Chinese grammar compiled for humans, but more typically, high-frequency words were carefully selected from corpus data. All GKB entries satisfy the definitions of words by the National Standard of Contemporary Chinese Word Segmentation for Information Processing (GB 13715). Closed classes, such as locative particles, pronouns, prepositions, conjunctions, auxiliaries, modal particles, and interjections, are all included. With open classes, such as nouns, verbs, adjectives, numerals, time words, location words, stative words, classifying words, onomatopoeic words, and fixed expressions like Chinese idioms, idiomatic expressions, and abbreviations, representativeness and frequency of the candidates were considered. However, words typically used during a limited period of time were not selected despite their possible high frequency, such as 士大夫 shidafu “scholar-official” and 臭老九 choulaojiu “a derogatory label for intellectuals.” Likewise, words from classical Chinese, dialects, and special technical fields were excluded.

Detailed considerations have been given to reduplicated forms, ensuring as much coverage as possible while avoiding conceivable redundancy, for instance, whether the basic form (e.g., 亮晶 liangjing) and its reduplicated form (e.g., 亮晶晶 liangjingjing “glittering”) are words, whether the basic form (e.g., 往 wang ‘toward’) and its reduplicated form (e.g., 往往 wangwang “often”) mean the same thing, and whether the basic form (e.g., 大方 dafang “generous”) and its reduplicated form (e.g., 大大方方 dadafangfang “generous”) belong to the same word class.

To increase its coverage of Chinese words, the GKB includes as many word components as possible, which are limited in number but highly productive in word formation. Prefixes and suffixes are included as word components, such as 老 lao “used before surnames to indicate seniority,” 超 chao “super-,” 准 zhun “quasi-,” 们 men “used after a personal pronoun or a noun to show plural number,” 者 zhe “used after a noun phrase to indicate a person doing the stated work or following the stated doctrine,” 化 hua “-ify,” etc. There is also a table in the GKB for morphemes that are not considered words when standing alone, such as 脐 qi in 肚脐 duqi “navel,” 贝 bei in 贝壳 beike “shell,” 冬 dong in 冬天 dongtian “winter,” etc.

3 Classification of Words

In Chinese language processing, grammatical units mainly include characters, words, phrases, and sentences. No consensus has been reached so far on the classification of Chinese words. From the perspective of language engineering, a viable classification scheme has been applied to the GKB, which is mainly based on Zhu’s grammatical theory (Zhu 朱德熙 1982, 1983), and word classes have thereby been defined. Accordingly, all GKB entries have been classified, and a corpus of more than 10 million Chinese characters has been segmented and POS-tagged to assess the viability of the classification system.

3.1 Basic Word Classes

There are 18 basic classes of Chinese words in the GKB, as listed in Table 5.2. Among them, nouns, time words, location words, locative particles, numerals, and quantifiers are called nominals; verbs, adjectives, and stative words are predicates; and pronouns are divided between nominals and predicates. Moreover, nominals, predicates, classifying words, and adverbs are referred to as content words, while prepositions, conjunctions, auxiliaries, and sentence-final particles are called function words. Distinct from these, there are also onomatopoeic words and interjections. In addition, there are eight types of non-lexical items: prefixes, suffixes, morphemes, non-morpheme characters, Chinese idioms, idiomatic expressions, abbreviations, and punctuation marks (Yu et al. 俞士汶等 2003).

Table 5.2 The classification system of Chinese words and their part-of-speech (POS) tags in the GKB

Full size table

3.2 Purpose of Word Classification

Word classes help to describe how words combine to form larger syntactic structures (i.e., phrases and sentences). There are two kinds of relations involved: the syntagmatic relation, where words combine to form a certain syntactic structure, and the paradigmatic relation, where words can replace each other in a certain syntactic position. As illustrated in the following examples, words from the same class, such as 爸爸 baba “father” and 学校 xuexiao “school,” bear a paradigmatic relation, meaning that they can take the same syntactic position w1, acting as the subject, in a syntactic structure, as shown in (5.1) and (5.2) below:

w1__w2__w3__w4__w5__w6__w7__w8
(5.1) 爸爸__昨天__买__了__两__本__新__书
baba__zuotian__mai__le__liang__ben__xin__shu
father__yesterday__bought__u__two__q__new__books
My father bought two new books yesterday
(5.2) 学校__去年__增添__了__三__台__先进__设备
xuexiao__qunian__zengtian__le__san__tai__xianjin__shebei
school__last-year__got __u__three__pieces__advanced__equipment
The school got three more pieces of advanced equipment last year

Not considered an urgent task in human-oriented linguistic studies, the classification of words and the tagging of word classes are indispensable in computational linguistics and natural language processing.

In rule-based parsing with a context-free grammar (CFG), for instance, each leaf node, or terminal, of a parse tree produced by the CFG rules represents a word class in the context-free language. A grammatical sentence is simply a legitimate tree that is derivable from the production rules, and syntactic parsing starts with rewriting words by their corresponding word classes. The classification of words is in fact the description of their fundamental grammatical properties, which are the most important clues for natural language analysis and generation.

In statistical parsing with N-grams, for instance, data sparseness can be endemic when the probabilities of word sequences are computed. Alternatively, when the probabilities of word class sequences are computed, the classification of words is required as well. With real-world applications like document retrieval and information extraction, where deep syntactic analysis may not be required, word segmentation and POS-tagging will also contribute to higher accuracy, and word class definitions in this sense are also essential.

3.3 Word Class Definitions by Grammatical Functions

Theoretically speaking, a word can be classified according to its grammatical functions, which are generally taken as the role and the distribution of the word in the syntactic structure: (1) what role it plays as a syntactic constituent and (2) which words or word classes it collocates with. In the GKB, the specification of grammatical functions constitutes a solid ground for the proper classification of words. Here are some of the functions specified for adjectives:

(a)
Acting as the predicate in a subject-predicate construction but taking no true object. In (5.3), 安静 anjing “quiet” acts as the predicate and takes no object. The word sometimes takes an object, as in (5.4), but the numeral-quantifier phrase 两天 liang tian “two days” is not a true object.
(5.3) 教室_安静
jiaoshi_anjing
classroom_quiet
The classroom is quiet
(5.4) 他_安静_了_两_天
ta_anjing_le_liang_tian
he_quiet_u_two_days
He remained quiet for two days
(b)
Taking a modifying degree adverb like 很 hen “very,” 挺 ting “pretty,” or 特别 tebie “particularly” as in 很长 hen chang “very long,” 挺安静 ting anjing “pretty quiet,” and 特别雄伟 tebie xiongwei “particularly magnificent.”
(c)
Acting as the complement in a predicate-complement construction, such as 干净 ganjing “clean” in 洗干净 xi ganjing “to wash clean” and 结实 jieshi “tight” in 捆得结实 kun de jieshi “to be fastened tight.”

Word classes can thus be distinguished broadly. For instance, nouns cannot perform functions (b) and (c), and they do not perform function (a) in most cases. Similarly, some functions of nouns cannot be performed by adjectives. However, as word classes in Chinese are often multifunctional, functions can be shared by different classes and the distinction between classes is then obscured. In the GKB, therefore, the probability distribution of grammatical functions has been carefully considered and examined.

On the one hand, although a certain word class may be able to play different syntactic roles, the probability of performing these roles is different as can be observed in a large corpus of real texts. Nouns in Chinese, for example, may function either as the subject, the object, the attributive, or even the predicate if in a nominal-predicate sentence. In real texts, however, a noun is mostly used as the subject, the object, or the head in a nominal phrase, but seldom as the predicate. Similarly, verbs and adjectives can be used as the subject, the object, or the predicate, but in real texts, they are mostly used as the predicate.

On the other hand, probability also varies when a specific syntactic role is played by different word classes respectively. In a subject-predicate construction, the position of the subject is mainly taken by a noun and that of the predicate by a verb; in a predicate-object construction, the predicate is often a verb and the object a noun; and in a predicate-complement construction, the predicate is often a verb and the complement an adjective or a stative word.

When a word class is defined in the GKB, its grammatical functions with predominant distributions are identified first, and the selectional preference for the word class to play certain syntactic roles is often specified as well. In the case of adjectives, its predominant functions do not include “acting as the modifier of a noun,” because (1) an adjective alone does not modify nouns very often, as it usually needs to be combined with 的 de to play such a role, and (2) it is quite common that the modifier of a noun is a noun or even a verb. With such a simplified model of complicated grammatical phenomena, words in the GKB are classified based on which grammatical functions of each word can be inferred.

Words in the same class may share some similarity in meaning, but it does not follow that words expressing the same meaning can perform the same grammatical function. For instance, 战争 zhanzheng “war” and 打仗 dazhang “to go to war” are related semantically but their grammatical functions diverge significantly. Likewise, 红 hong “red” and 红色 hongse “red” denote the same color; whereas the former can serve as the predicate in a sentence and the latter as the subject or object, the grammatical function of these words cannot be reversed. The meaning of a word, though not the main criterion for its classification, is important in the GKB, more details of which will be discussed in Sect. 5.5.

3.4 Multi-class Words, Homographs, and Homonyms

According to word class definitions, a word can be put into a specific class. However, part-of-speech ambiguity may arise when one word has different grammatical functions typical of more than one word class and is thus taken as belonging to two or more classes. The words 共同 gongtong and 定期 dingqi are adverbs in (5.5a) and (5.6a), respectively, but are classifying words in (5.5b) and (5.6b), respectively. These words are treated as multi-class words in the GKB, and they have their own entries in the databases of adverbs and classifying words, respectively.

(5.5a) 共同_完成_一_些_任务
gongtong_wancheng_yi_xie_renwu
together_accomplish_one_q_task
to accomplish some tasks together
(5.5b) 我们_的_共同_愿望
women_de_gongtong_yuanwang
we_u_common_aspriations
our common aspirations
(5.6a) 定期_检查_机器
dingqi_jiancha_jiqi
regularly_check_machine
to check the machine regularly
(5.6b) 一_笔_定期_存款
yi_bi_dingqi_cunkuan
one_q_fixed_deposit
a fixed deposit

Homographs are words that share the same form but have different pronunciations. They may belong to the same word class, such as 和 huo “to mix” in 和稀泥 huo xini “blur the line between right and wrong” and 和 he “to tie” in 和一盘棋 he yi pan qi “to tie in a game of chess,” both of which are verbs. However, there is another homograph 和 he “and,” which is a conjunction.

When two or more words share both the same form and the same pronunciation, they are called homonyms. They may belong to the same word class, such as 抄 chao “to copy” in 抄稿子 chao gaozi “to copy a draft” and 抄 chao “to take” in 抄近道 chao jindao “to take a shortcut,” respectively. But in many cases, they belong to different word classes, which gives rise to part-of-speech ambiguity in natural language processing. For instance, 花 hua is a verb in 花时间 hua shijian “to spend time,” but it is a noun in 石榴花 shiliu hua “pomegranate flower.”

Both homographs and homonyms are distinguishable in meaning, as can be seen in the examples above. In contrast, the semantic distinction between the adverb 共同 gongtong and the classifying word 共同 gongtong, as well as that between the adverb 定期 dingqi and the classifying word 定期 dingqi, is hardly discernible. Despite these differences, the GKB treats homographs and homonyms as multi-class words in a broad sense. Each homograph or homonym is stored as a separate entry and is classified mainly according to its grammatical functions. In the GKB, there is a column for the attribute 兼类 “multi-class word” in the databases of each word class, the value of which indicates the other word classes that an entry may belong to, be it a multi-class word in a strict sense or in a broad sense.

4 Description of Grammatical Properties

Word class definitions constitute the criteria by which words can be properly classified. As has been discussed previously, the complexity and ambiguity of linguistic phenomena make it extremely hard to carry through some of the “strict” criteria prescribed by linguists, such as “adverbs are function words that only serve as the adverbial modifier” (Zhu 朱德熙 1982), which would definitely exclude 很 hen “very” and 极 ji “extremely” because the two words, though commonly recognized as adverbs, can also be used as a complement, as in 舒服得很 shufu de hen “very comfortable” and 痛快极了 tongkuai ji le “extremely happy.”

Consequently, the predominant functions of adverbs have been considered, resulting in a more practical definition that “adverbs are function words mainly used as the adverbial modifier,” which allows hen and ji to be included. The less strict criteria, however, gives rise to a new problem—words that are grouped in a particular class can display inconsistent grammatical properties. As a partial but practical solution to this dilemma, a more delicate approach has been adopted for the GKB to describe the grammatical attributes of each word, with its word class being only one of them, and to provide more detailed information about Chinese words for NLP tasks and application systems.

4.1 Selection of Grammatical Attributes

The grammatical attributes described in the GKB were selected with respect to the special requirements of NLP tasks, as this knowledge base was originally designed to assist computers in analyzing and generating Chinese sentences. More specifically, grammatical attributes help resolve ambiguities that either are intrinsic to natural languages or arise when natural languages are analyzed by computers; on the other hand, they help computers generate fluent Chinese sentences. When analyzing Chinese sentences, computers may rely on different grammatical theories and algorithms, but they also follow four basic steps: (1) segment the sequence of Chinese characters into words; (2) add a POS tag to each word; (3) combine words to form phrases and then sentences; and (4) identify the syntactic or semantic role of each word or phrase in a phrase or sentence. Following these steps, the GKB manages to provide as much information as possible.

4.1.1 Morphological Attributes

The Chinese language, though much less inflectional than English or Russian, has some classes of words that can form new words through reduplication and affixation. In the database of nouns, for instance, there is a column for the attribute 重叠 “reduplicated form.” For single-character nouns like 人 ren “person” and 家 jia “family,” the value of this attribute is NN, indicating that their reduplicated forms are 人人 renren “everyone” and 家家 jiajia “every family.” For double-character nouns like 方面 fangmian “aspect” and 风雨 fengyu “hardships,” the value is AABB, indicating that their reduplicated forms are 方方面面 fangfangmianmian “every aspect” and 风风雨雨 fengfengyuyu “all kinds of hardships,” respectively.

Both prefixes and suffixes are used to form words. In the database of prefixes, there is a column for the attribute 后接词性 “POS of the word that the prefix is added to” and another column for the attribute 结构词性 “POS of the word to be formed.” Similarly, the attributes 前接词性 “POS of the word that the suffix is added to” and 结构词性 “POS of the word to be formed” are specified in the database of suffixes. In addition, the general database has a column for the attribute 单合 “simple/compound” to distinguish between simple and compound words. These attributes are selected to assist in the detection of unknown words.

4.1.2 Syntactic Attributes

Syntactic attributes, which constitute the bulk of the grammatical attributes in the GKB, describe whether and how a word can be combined with other words or word classes to form syntactic structures and what syntactic role the word can play therein. In the database of adjectives, for instance, there is a column for the attribute 很 “very.” The value of this attribute is 否 “no” if an adjective cannot be modified by the degree adverb 很 hen “very”; otherwise, the corresponding field is left blank. In the database of verbs, the same attribute is described, indicating whether a verb can be modified by hen. Interestingly, the values of this attribute help to further distinguish between verbs describing mental activities, which are assumed to be able to take hen as the modifier. It is appropriate to say 很爱 hen ai “to love … very much,” 很喜欢 hen xihuan “to like … very much,” and 很想念 hen xiangnian “to miss … very much,” but the modifier does not go well with 盘算 pansuan “to figure,” which is also a verb for mental activities.

In many cases, the syntactic attributes may suggest the syntactic role that a word plays in certain syntactic structures. For example, if an adjective takes the degree modifier hen, this implies that an adverbial-head structure can be formed, with the adjective as the head. In the database of nouns, the attribute 前名 “preceded by a noun” helps to describe whether an attributive-head structure can be formed by a noun as the head and its preceding noun as the modifier.

In each database, there are also attributes that explicitly describe whether a word can play certain syntactic roles. For instance, the attribute 宾语 “object” in the database of nouns specifies not only whether a noun alone can be an object but also whether the noun needs to take an attributive modifier to play such a role. The value of this attribute for the noun 方面 fangmian “aspect” is 定 “attributive modifier,” as the word itself cannot be the object of a predicate-object construction. Instead, it takes the attributive modifier 各个 gege “each” as in 兼顾各个方面 jiangu gege fangmian “to give consideration to each aspect.”

4.1.3 Semantic Attributes

In a broad sense, grammatical studies involve syntax, semantics, and pragmatics. The attributes described in the GKB are mainly morphological and syntactic, but some semantic attributes are included as well. Each entry in the GKB has a field for its sense and another field for its sample usages as a reference for human users. Other attributes are described to facilitate computer processing.

The database of time words includes the semantic attribute 时态 “tense,” the values of which can be 过 “past” when a word refers to past time, such as 从前 congqian “before” and 昨天 zuotian “yesterday,” and 未 “future” when a word refers to future time, such as 将来 jianglai “future” and 明天 mingtian “tomorrow.”

More semantic attributes can be found in the databases of verbs. With verbs taking nominal objects, for instance, there are different columns for 受事 “patient,” 结果 “result,” 与事 “beneficiary,” 工具 “instrument,” 方式 “manner,” 处所 “location,” 时间 “time,” 目的 “purpose,” 原因 “reason,” 致使 “cause,” 施事 “agent,” etc., specifying the possible semantic roles that the nominal object of a verb can play.

4.1.4 Collocation

Two words may co-occur very often in a sentence but do not combine to form a syntactic construction. For example, the preposition 在 zai “at” collocates significantly with the locative particles 上 shang “on top of, above,” 下 xia “under, below,” 中 zhong “in, in the center of,” and 里 li “in, inside” to form different patterns, which allows the insertion of other words to form phrases like 在理论上 zai lilun shang “in theory,” 在他的帮助下 zai ta de bangzhu xia “with his help,” 在群众中 zai qunzhong zhong “among the masses,” and 在女儿的房间里 zai nüer de fangjian li “in the daughter’s room.” Knowledge about collocations like these will help the analysis and generation of Chinese sentences. Therefore, the database of prepositions includes the attributes 后照应词 “collocate” and 后照应类 “POS of collocate.” Similar attributes can be found in the databases of locative particles, adverbs, conjunctions, and auxiliaries.

4.2 Data Redundancy

As grammatical attributes are specified for each database in the GKB, the problem of data redundancy has been carefully considered. For instance, nouns like 中国 zhongguo “China,” 学校 xuexiao “school,” 图书馆 tushuguan “library,” and 财政部 caizhengbu “Ministry of Finance” can also be used as location words, which means that they can be the object of 在 zai “at,” 到 dao “to,” and 往 wang “to.” To minimize redundancy, the attribute 处所 “location” has been added to the database of nouns, the value of which suggests that a noun can also be a location word. Otherwise, there are two separate entries in two databases for such words, a large number of which can be found in Chinese.

There is, however, a trade-off between data redundancy and computational cost. For example, the general database has a column to record the number of homographs for an entry, which seems redundant as the number can be computed automatically whenever queried. However, the number is stored in this column to reduce query execution time. In the task of ambiguity resolution, the number in this column tells the computer immediately whether or not to end the search for all the possible homographs of a word. Similar considerations have also been given to other attributes in different databases.

4.3 Value Types

A relational database organizes data into a table of rows and columns. In the GKB, one row in a database constitutes a word entry. For each entry, there are different columns for the values of different attributes, respectively. The values can be one of two data types—numeric data and character string data. Numeric values are found only in the general database for attributes like 字数 “number of characters,” 同字词 “number of homographs,” 音节数 “number of syllables,” 同音调 “number of homophones,” 使用频度 “frequency,” etc. Attribute values in the GKB are mostly character strings, of which there are four kinds.

Some attributes can have one of two possible values, which is the most common case. For example, in the database of verbs, there is a column for the attribute 很 “very,” the value of which can be 很 “very” or null depending on whether a verb can take degree adverbs like 很 hen “very,” 极 ji “extremely,” 极其 jiqi “extremely,” 非常 feichang “very,” and 太 tai “so” as its modifier. There are columns named 系词 “copula,” 助动词 “auxiliary verb,” 趋向动词 “directional verb,” 形式动词 “dummy verb,” and so on for verbs, the values of which can be 是 shi “yes” or 否 fou “no” depending on whether a verb belongs to those subclasses. Values of this kind are similar to but more evident than the logical type and are thus adopted to ease the input and validation of grammatical knowledge by human annotators.

Some attributes can have one of many possible values. For example, in the database of pronouns, there is a column for the attribute 子类 “subclass,” the value of which can be 人 “personal pronoun,” 指 “demonstrative pronoun,” or 疑 “interrogative pronoun.” In the database of personal pronouns, the value of the attribute 人称 “person” can be 一 “first person”, 二 “second person,” 三 “third person,” or null. Values of this kind vary considerably in length.

Theoretically, nonatomic values should be eliminated in relational databases. In the GKB, as in the common practice of database management, not all values are atomic. In the database of nouns, for example, there are columns for attributes like 度量 “measure quantifier,” 容器量词 “container quantifier,” 形量词 “shape quantifier,” 不定量词 “indefinite quantifier,” etc. The values of these attributes for the noun 白糖 baitang “sugar” can be either atomic or nonatomic, listing its possible collocates respectively: 克 ke “gram,” 千克 qianke “kilogram,” 公斤 gongjin “kilogram,” and 吨 dun “ton” as measure quantifiers; 瓶 ping “bottle,” 袋 dai “sack,” and 包 bao “bag” as container quantifiers; 撮 cuo “pinch” as a shape quantifier; and 些 xie “some” and 点 dian “a little” as indefinite quantifiers.

Two attributes are added to each database in the GKB, the values of which are character strings specifying the sense of each entry and its sample usages. Originally set to assist human annotators, these values can also be used in natural language processing, such as word sense disambiguation.

5 Semantic Considerations in the GKB

With a repertoire of grammatical knowledge, semantic concerns are indispensable. Setting its focus on morphology and syntax, the GKB includes careful considerations for the semantic information of Chinese words as well.

5.1 Word Entries Distinguished by Their Meanings

As mentioned in Sect. 5.3.4, multi-class words, homographs, and homonyms are treated the same way in the GKB, which helps to solve problems caused by shared word forms in machine translation, spelling correction, speech recognition, speech synthesis, and many other NLP tasks. In the case of homographs and homonyms, particularly, word meanings are the main consideration for setting entries. In other words, a word form is represented as different entries if it can be used to refer to different things, ideas, activities, etc., such as 和 huo/he, 抄 chao, and 花 hua.

5.2 Semantic Properties Described for Word Entries

As mentioned in Sect. 5.4.1, some properties described in the GKB are straightforwardly semantic, such as the attribute 格标 “case marker” in the database of prepositions and the database of verbs taking nominal objects. For the preposition 被 bei, a case marker for the semantic role of agent, the attribute value is 施 “agent”; for 把 ba, a case marker for the semantic role of patient, the attribute value is 受 “patient”; and for 用 yong, a case marker for the semantic role of tool, the attribute value is 工 “tool.”

5.3 Grammatical Properties Distinguished Based on Semantic Clues

Some grammatical properties are distinguished with regard to not only the syntactic behaviors but also the semantic clues of a word or the words in its context. For instance, there is a column 兼语 “pivotal” for all verb entries in the GKB, the value of which suggests that a verb can take the position of v1 in the pivotal construction “v1 + n + v2.” To decide this, however, semantic knowledge is required. For the word sequence to be a pivotal construction, n should be a noun denoting the agent of the action specified by v2, where the role “agent” is a semantic category. The semantic clue thus helps to confirm that the verb 选 xuan “to elect” in (5.7) below can take the position of v1, as 他 ta “he/him” is the agent of 当 dang “to act as.” In contrast, the verb 帮 bang “to help” in (5.8) below cannot take the position of v1, as 他 ta “he/him” is not the agent of 洗 xi “to wash,” and the sequence therefore forms a serial verb construction:

(5.7) 选_他_当_班长
xuan_ta_dang_banzhang
elect_him_as_monitor
to elect him as the monitor
(5.8) 帮_他_洗_衣服
bang_ta_xi_yifu
help_him_wash_cloth
to help him wash the cloth

6 Conclusion

It is evident that language resources play an increasingly important role in the progress of computational linguistics and natural language processing, but the development of language resources is also evidently tedious, difficult, and time-consuming. The GKB started as an electronic dictionary in the 1980s, and it has taken three decades for the dictionary to develop into the knowledge base it is today, during which linguists and computer scientists have gone hand in hand to unravel the grammatical properties of Chinese words and describe them as computational attributes that can be used by NLP systems. The knowledge base relies heavily on the expert knowledge of linguists to set the guidelines and to carry them out in the selection of words, the definitions of word classes and grammatical attributes, the classification of words, and the descriptions of attribute values.

The knowledge base differs tremendously from traditional printed dictionaries in form, content, and size, which is greatly motivated by the computational perspective of computer scientists, or more precisely, computational linguists. These formalisms allow the GKB to assist in natural language processing as a repository of grammatical knowledge, the data structure of which enables easy conversion between different knowledge representations. The classification system and all word entries have been validated and optimized with corpus data of different sizes, which involved automatic processing tasks—word segmentation, POS-tagging, phrase boundary detection, phrase type tagging, etc. As a working component of an NLP system, the GKB is a huge repository with high accuracy, appropriate granularity, and optimal resource cost. All of these features may add a computational perspective to the introspective studies of Chinese, from which the complicated grammatical knowledge of Chinese words is observed, distinguished, and described.

References

Church, Kenneth. 2011. A pendulum swung too far. Linguistic Issues in Language Technology 6(5):1–27.
Google Scholar
Feng, Zhiwei 冯志伟. 2008. Preface II 序二. In Statistical natural language processing 统计自然语言处理, Chengqing Zong 宗成庆, 5–18. Beijing: Tsinghua University Press.
Google Scholar
Gong, Qianyan 龚千炎. 2000. A review of studies on the grammar of Chinese 汉语语法研究的回顾. In An introduction to studies on grammar 汉语研究入门, ed. Qingzhu Ma 马庆株, 69–87. Beijing: The Commercial Press.
Google Scholar
Hirschberg, Julia, and Christopher D. Manning. 2015. Advances in natural language processing. Science 349(6245):261–266.
Google Scholar
Yu, Shiwen 俞士汶. 2000. Natural language understanding and studies on grammar 自然语言理解与语法研究的回顾. In An introduction to studies on grammar 汉语研究入门, ed. Qingzhu Ma 马庆株, 240–251. Beijing: The Commercial Press.
Google Scholar
Yu, Shiwen, Xuefeng Zhu, Hui Wang, Huarui Zhang, Yunyun Zhang, Dexi Zhu, Jianming Lu, and Rui Guo 俞士汶, 朱学锋, 王慧, 张化瑞, 张芸芸, 朱德熙, 陆俭明, 郭锐. 2003. The grammatical knowledge-base of contemporary Chinese—A complete specification (2nd ed.) 现代汉语语法信息词典详解(第二版). Beijing: Tsinghua University Press.
Google Scholar
Yu, Shiwe, Zhifang Sui, and Xuefeng Zhu 俞士汶, 穗志方, 朱学锋. 2011. The comprehensive language knowledge base and its prospect 综合型语言知识库及其前景. Journal of Chinese Information Processing 中文信息学报 25(6):12–20.
Google Scholar
Zhu, Dexi 朱德熙. 1982. Lecture notes on grammar语法讲义. Beijing: The Commercial Press.
Google Scholar
Zhu, Dexi 朱德熙. 1983. Questions and answers on grammar语法问答. Beijing: The Commercial Press.
Google Scholar
Zong, Chengqing 宗成庆. 2008. Statistical natural language processing 统计自然语言处理. Beijing: Tsinghua University Press.
Google Scholar
Zong, Chengqing, Youqi Cao, and Shiwen Yu 宗成庆, 曹右琦, 俞士汶. 2009. Sixty years of Chinese information processing 中文信息处理 60 年. Applied Linguistics 语言文字应用 4:53–61.
Google Scholar

Download references

Author information

Authors and Affiliations

Language Centre, Tsinghua University, Beijing, China
Xiaojing Bai

Authors

Xiaojing Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaojing Bai .

Editor information

Editors and Affiliations

Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
Chu-Ren Huang
Graduate Institute of Linguistics, National Taiwan University, Taipei, Taiwan
Shu-Kai Hsieh
School of Electronic Information and Artificial Intelligence, Leshan Normal University, Leshan City, Sichuan, China
Peng Jin

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bai, X. (2023). Describing the Grammatical Knowledge of Chinese Words for Natural Language Processing. In: Huang, CR., Hsieh, SK., Jin, P. (eds) Chinese Language Resources. Text, Speech and Language Technology, vol 49. Springer, Cham. https://doi.org/10.1007/978-3-031-38913-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-38913-9_5
Published: 19 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-38912-2
Online ISBN: 978-3-031-38913-9
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

(5.3) 教室_安静
jiaoshi_anjing
classroom_quiet
The classroom is quiet
(5.4) 他_安静_了_两_天
ta_anjing_le_liang_tian
he_quiet_u_two_days
He remained quiet for two days