Keywords

1 Introduction

A noun compound (NC) is a sequence of two or more nouns that functions as a single noun (Downing 1977). The use of NCs is very frequent in English-written text, including press and technical materials, newswires, and fictional prose. In other languages, such as Chinese, NCs are also abundant in texts since the compounding of nouns is the most common way of naming new things. The syntax and semantics of noun compounds has remained an active research field in linguistics, which includes the broader research of multiword expressions (MWEs). As a well-established subtask of language understanding, the interpretation of noun compounds involves uncovering the underlying semantic relations encoded by constituent nouns. For example, 爱情故事 aiqing gushi “love story” can be illustrated as 讲述爱情的故事 jiangshu aiqing de gushi “a story that tells about love” and 别墅女人 bieshu nvren “villa woman” means 住在别墅的女人 zhuzai bieshu de nvren “a woman living in a villa.” Understanding the semantic relations between noun compounds is helpful for many tasks, such as machine translation, information retrieval, and question answering, among others.

In this chapter, we will focus on the semantic interpretation of Chinese noun compounds. The remainder is organized as follows: Section 27.2 will describe related work, while Sect. 27.3 will present a novel taxonomy of Chinese noun compounds based on the transparency of the compounds. In Sect. 27.4, a method for predicting the semantic relations of novel NCs based on word similarity will be introduced. Section 27.5 will illustrate how to interpret noun compounds using verbal paraphrasing, while Sect. 27.6 will offer the conclusion and future work.

2 Previous Studies

In theoretical linguistics, there are contradictory views regarding the semantic interpretation of NCs. Most linguists describe the semantics of noun compounds via a set of abstract relations, as represented in the work of Levi (1978), who presented nine recoverable deletable predicates (RDPs)—be, cause, have, make, use, about, for, from, and in—that are universal and primitive in generating noun compounds, and Warren (1978), who proposed a four-level hierarchical taxonomy derived from the Brown Corpus. Following this tradition, some scholars in the computational field have focused on the taxonomies of noun compounds. Ó Séaghdha (2007) proposed six semantic relations—be, have, in, actor, inst(rument), and about—and each relation was subdivided into subcategories. For example, have is subdivided into the possession, condition-experiencer, property-object, part-whole, and group-member subcategories. Tratz and Hovy (2010) presented a large, fine-grained taxonomy of 43 noun compound relations, which were notably tested by Amazon’s Mechanical Turk service. However, there is still no consensus as to which set of relations binds nouns in a noun compound.

Overall, the semantic relations proposed by different scholars have ranged from general to more specific, with the general ones aiming for broad-coverage analysis of unrestricted text and the specific ones aiming for specialized applications in some domains. In this line of research, the semantic interpretation of NCs is viewed as a multiclass classification problem, where the predefined semantic relations are the categories to be assigned. However, the approach of abstract relations is problematic in several ways. As Nakov and Hearst (2013) pointed out, it is unclear which relation inventory is best, as relations capture only part of the semantics and multiple relations are possible. For example, Wei (2012) assumed that 中国电影 zhongguo dianying “Chinese movies” is classified into the categories of location and content.

Considering these drawbacks, other researchers have used verbal paraphrasing to interpret noun compounds (Girju et al. 2005; Nakov and Hearst 2006; Nakov 2008; Ó Séaghdha 2008). Finn (1980) interpreted “salt water” with “dissolved in.” Butnariu and Veale (2008) summarized eight relational possibilities, for example, “headache pill” might be paraphrased as “headache-inducing pill,” “headache prevention pill,” “pill for treating headaches,” “pill that causes headaches,” “pill that is prescribed for headaches,” and “pill that prevents headaches.” With these verbs, the paraphrases are more specific than that of the abstract relations. Following this view, the SemEval 2010 task 9 “Noun Compound Interpretation Using Paraphrasing Verbs and Prepositions” and SemEval 2013 task 4 “Free Paraphrases of Noun Compounds” both intended to promote a paraphrase-based approach to this problem.

Accordingly, there are two ways to interpret noun compounds in Chinese. Theoretically, there have been some achievements in the analysis of semantic relations, while very little work on the automatic semantic interpretation of Chinese NCs has been done. Zhao et al. (2007) focused on a subset of Chinese NCs in which the head word is a verb nominalization, such as 血液循环 xueye xunhuan “blood circulation,” and four coarse-grained semantic roles were proposed for the classification of noun modifiers in compound nominalization. Our study took a static approach in which the interpretation was viewed as a classification problem. As for the second line of research, Wang (2010) and Wang et al. (2014) adopted a bottom-up strategy to capture the verbs of noun compounds and provided four types of paraphrase patterns. As Wei (2012) pointed out, these four types are not specific enough to give proper interpretations. Instead, Wei (2012) classified the noun compounds into eight major types and 346 subcategories, which proved to be fine-grained.

3 Taxonomy of Chinese Noun Compounds

Whether using abstract relations or verbal paraphrasing, there are still some noun compounds that are not interpretable. We hypothesized that this is due to the lack of consideration of the decomposable possibilities and the semantic transparency of noun compounds. Taking the noun compound 夫妻肺片 fuqi feipian “pork lungs in chili sauce” as an example, it is not decomposable; that is, the meaning of the compound is not simply the combination of the literal meanings of the parts. Levi (1978) proposed a transparency scale for noun compounds, as shown in Table 27.1.

Table 27.1 Levi’s (1978) transparency scale for noun compounds

In Table 27.1, Levi (1978) summarized five types of noun compounds based on semantic transparency, each type showing a different interpretation pattern of the noun compounds. For example, “orange peel” is simply the combination of “orange” and “peel.” However, “grammar school” cannot be combined literally because there is a hidden verb in this compound, as in “grammar teaching school.” In contrast, the other types cannot be combined literally or be interpreted by hidden verbs. For instance, “ladybird” is not a kind of bird but a kind of bug, “Coccinellidae,”Footnote 1 and “honeymoon” has nothing to do with “honey” or “moon” but instead refers to the vacation that brides and grooms take to celebrate their marriage. The type “partly idiomatic” is special because it is partly idiomatic that verbs are not easy to recover. For example, it is not acceptable to say “flea selling market” for the market selling small commodities.

In light of Levi’s (1978) transparence scale and Nunberg et al.’s (1994) claims on idioms, we collected 428 noun-noun compounds (N1-N2) and classified them into the following four categories shown in Table 27.2.

Table 27.2 Basic types of noun compounds

As Table 27.2 shows, the first three types are decomposable at the syntagmatic level, but the last one is non-decomposable. Initially, we decided that non-decomposable idioms should be analyzed as a whole unit both syntactically and semantically, and since the other types were decomposable, they could be divided into N1 and N2. However, the semantic relations of these types are different in terms of semantic transparency. Therefore, we proposed a novel taxonomy of Chinese noun compounds based on semantic transparency. Table 27.3 summarizes 11 subcategories of noun compounds based on their semantic relations.

Table 27.3 Semantic relations of NCs

To interpret the noun compounds in Table 27.3, we created different interpretation patterns with different conditions. Category 1 corresponds to the noun compounds of type a in Table 27.2, which can be interpreted as the literal meanings of the parts, for example, 机组人员 jizu renyuan “crew members” in the paraphrased 属于机组的人员 shuyu jizu de renyuan “the members that belong to the crew.”

In categories 2 to 5, these four types correspond to both type a and type b, since the meaning of the compounds can be interpreted by the fixed pattern of the components and can also be predicted by hidden verbs. For instance, the paraphrased verb of the compound 雅典奥运会 yadian aoyunhui “Athens Olympics” could be 举办 juban “to hold,” and thus the paraphrased sentence would be 在雅典举办的奥运会 zai yadian juban de aoyunhui “The Olympic Games that were held in Athens.” As for 爱情故事 aiqing gushi “love story,” it could be paraphrased as 关于爱情的故事 guanyu aiqing de gushi “the story about love” and 讲述爱情的故事 jiangshu aiqing de gushi “the story telling about love.”

Moreover, categories 6 to 9 correspond to type b, in which the hidden verb must be revealed. In this group, the qualia roles of the head noun are different for each type. For example, the qualia role in category 6 is age because “material” usually relates to the make relation, and the relation of “patient” in category 7 relates more with telic roles,Footnote 2 which are interpreted as the function of N1. For example, 围棋高手 weiqi gaoshou “chess master” could be paraphrased as 下围棋的高手 xia weiqi de gaoshou “the masters of playing chess.” Here, 下 xia “to play” is the telic role of 围棋 weiqi “chess.”

The last two categories correspond to type c and the non-decomposable idioms separately. Noun compounds in category 10 should be interpreted as having a metaphoric meaning, and thus they cannot be interpreted by hidden verbs. Taking 试管婴儿 shiguan yinger “test tube babies” as an example, the compound cannot be illustrated using expressions like 在试管里 孕育的婴儿 zai shiguan li yunyu de yinger “the babies that are fertilized in test tubes.” The word 试管 shiguan “test tubes” has the metonymic meaning of 试管孕育技术 shiguan yunyu jishu “in glass fertilization.” Therefore, the metaphoric meaning of the compound needs to infer 用试管技术孕育的婴儿 yong shiguan jishu yunyu de yinger “the babies that are fertilized by the technique of using test tubes.” For these types of idioms, they are not decomposable at all and should be treated as a whole unit. For example, 夫妻肺片 fuqi feipian “pork lungs in chili sauce” refers only to the name of the dish.

4 Interpretation Based on Word Similarity

Kim and Baldwin (2005) introduced a method for interpreting novel English noun compounds with semantic relations using WordNet: Similarity. Based on the taxonomy above, we proposed a method using word similarity to predict the semantic relations of novel Chinese NCs. Given an NC in the testing data, we calculated the similarities between the correspondence nouns in the training data to acquire the semantic relation, which was our first strategy.

4.1 Word Similarity Measures

HowNet-based similarity. HowNet is a commonsense knowledge base of interconceptual relations and inter-attribute relations of concepts as connoted in lexicons of Chinese and their English equivalents (Dong and Dong 2005). As a knowledge base, the knowledge structured by HowNet is represented by a graph rather than a tree, and it is devoted to demonstrating the general and specific properties of concepts. For every word sense ci (i.e., concept), its definition is composed of a set of sememes and corresponding relations. For instance, the Chinese word 学校 “school” is defined as follows in Fig. 27.1.

Fig. 27.1
A text box of data presented by How Net. It has a definition of the word school. The text reads school, institute place, teach, study, education with text in a foreign language for each word.

Definition of the Chinese word 学校 “school” in HowNet

HowNet allows users to measure the semantic similarity and relatedness between a pair of two concepts based on the overlapping of sememes. In our study, we adopted a similarity measure provided by Liu and Li (2002) to achieve the similarity of two nouns.

Cilin-based similarity. Cilin is a Chinese thesaurus that defines and describes “concepts” and reveals their relations using Synset. The semantic category of words (i.e., concepts) is encoded by a five-layer tree, as shown in Fig. 27.2.

Fig. 27.2
A text box of examples in Cilin. There are 7 lines with text in a foreign language.

Examples in Cilin

The similarity of two words in Cilin is measured by the distance in the tree. Formally, it is defined using Formula (27.1):

$$ {\mathrm{sim}}_{\mathrm{cilin}}\left({w}_1,{w}_2\right)=1-\frac{\mathrm{pathlen}\left({w}_1,{w}_2\right)}{\mathrm{pathlen}\left({w}_1,\mathrm{Root}\right)+\mathrm{pathlen}\left({w}_2,\mathrm{Root}\right)} $$
(27.1)

where pathlen(w1, w2) is the minimum path length of (w1, w2) to their common parent node, and Root represents the root of the tree.

4.2 Method

The similarity between NCs (t1, t2) and (n1, n2) was calculated by the similarities of the component nouns. Formally, the similarity of each NC pair was defined using Formula (27.2):

$$ \mathrm{S}\mathrm{im}\left(\left({t}_1,{t}_2\right)\ \left({n}_1,{n}_2\right)\right)=\frac{\left(\upalpha \mathrm{S}1+\mathrm{S}1\right)\times \left(\left(1-\upalpha \right)\mathrm{S}2+\mathrm{S}2\right)}{2} $$
(27.2)

where S1 is the modifier similarity (i.e., Sim(t1, n1)) and S2 is the head similarity (i.e., Sim(t2, n2)), while α ∈ [0, 1] is the weighting factor that balances the contributions of the modifier and head.

For each test NC, we calculated the similarities of all NCs in the training data. Then, we chose the NC in the training data that had the highest similarity and labeled it the test NC according to the sematic relation associated with that training data. Formally, the semantic relation of the test NC (t1, t2) was determined using Formula (27.3):

$$ \mathrm{Relation}\ \left({t}_1,{t}_2\right)=\mathrm{Relation}\ \left({n}_{i1},{n}_{i2}\right),\mathrm{where}\ i=\underset{i}{\underbrace{\mathrm{argmax}}}\ \mathrm{Sim}\ \left(\left({t}_1,{t}_2\right),\left({n}_{i1},{n}_{i2}\right)\right) $$
(27.3)

Figure 27.3 shows the complete procedure of our method, while Fig. 27.4 illustrates in detail how we calculated the similarities between a test NC (t1, t2) and the NCs in the training data.

Fig. 27.3
A flow diagram of the procedure of our method. Test and train are words. The similarities are calculated, maximum is found, and test N C is tagged.

The procedure of our method

Fig. 27.4
A flow diagram of detailed similarities between the test N C and training N Cs. The test N C has t 1 and t 2 which leads to training N Cs with semantic relations of different classes. Similarity of noun-noun pairs are found in rows S 1 and S 2.

Detailed similarities between the test NC and training NCs

As can be seen, the test NC is associated with a total number of “m” similarities, where “m” is the number of NCs in the training data. Then, the semantic relation of the test NC was determined by the training instance with the highest similarity.

4.3 Experiments and Evaluation

We retrieved two-word Chinese NCs from the People’s Daily of 1998 and 2000, which were segmented and POS tagged (Yu et al. 2002). After excluding proper nouns and coordinate constructions, we obtained 1483 NCs for our experiment. The semantic relations of all the NCs were judged by two annotators who had majored in linguistics. Overall, we used 978 NCs for the training data and 505 NCs for the testing data.

We experimented with the two similarity methods introduced above, assuming that the contribution of the head and modifier noun was equal (α = 0.5). Table 27.4 shows the experimental results. Note that the HowNet and Cilin similarities were based on dictionary-based methods. Thus, if the test word did not appear in HowNet or Cilin, our method could not tag the test NC (i.e., unlabeled data) because of the lack of similarities. The performances of HowNet and Cilin similarity were very close, and they each classified 35% of the NCs correctly.

Table 27.4 Accuracy based on HowNet and Cilin similarity

Table 27.5 lists some test NCs and the most similar NCs found in the training data. As can be seen, our method provided reasonable interpretations, which is very useful in understanding novel NCs. For instance, if a reader did not know the meaning of the novel NC 网络医生 wangluo yisheng “network doctor,” our method provided NCs such as 出租车司机 chuzuche siji “taxi driver,” which were easy to understand. Our method could also help a reader to predict the semantic relation of two nouns. Taking 布料玩具 buliao wanju “cloth toy” and 黄金首饰 huangjin shoushi “gold treasury” as an example, they both shared the same semantic relation of “material,” and thus their similarity was very high, so with our method, a reader could learn the semantic relation of the former and the unfamiliar relation of the latter, as well as the more frequently used relation.

Table 27.5 The most similar NCs based on the two similarity measures

5 Interpretation Using Verbal Paraphrasing

In linguistic theories, it has been proven that verbs play an important role in the process of noun compound derivation. In this section, we will present a simple and unsupervised approach for characterizing the semantic relations held in two-word Chinese NCs. What is especially novel about this approach is that NCs are interpreted in terms of verbal phrases, rather than by a set of concrete verbs. This is a richer and more flexible paraphrasing model in the sense that one semantic relation can be expressed by different verbal phrases.

5.1 Acquisition of Paraphrasing Verbs

In English, popular approaches to the acquisition of paraphrasing verbs have searched for snippets that have both nouns as endpoints as well as collected verbs from intervening materials. For example, Nakov and Hearst (2006) used the phrase “noun2 THAT * noun1” for Google queries and extracted verbs between THAT and noun1 from the returned pages. However, there are neither inflections nor clear form markers in Chinese, such as the complementizers that indicate relative clauses, which is why it is difficult to acquire Chinese verbs using explicit clues.

Semantic relations between words should be expressed through certain syntactic forms and structures. The semantic relations held between nouns, and verbs are directly expressed by “Verb-Object” and “Subject-Verb” structures, in which the noun acts as the subject or object of the verb. For example, the Verb-Object structure 切割钻石 qiege zuanshi “cut the diamond” shows that 钻石 zuanshi “diamond” is a solid substance that can be cut. Thus, we aimed to acquire concept-related verbs for the nouns using the two grammatical relations above. It was determined that a large-scale corpus with phrase-structure annotation was necessary for this task. However, such resources in Chinese are limited, resulting in a lack of coverage of the acquired verbs. Therefore, we adopted a backward strategy that extracted the verbs from specific grammatical relations (i.e., Subject-Verb and Verb-Object) in terms of collocation using Chinese Word Sketch (CWS).

Chinese Word Sketch. CWSFootnote 3 is a combination of the Chinese Gigaword Corpus and the corpus management tool in Sketch Engine (Kilgarriff et al. 2004; Huang et al. 2005). The Chinese Gigaword Corpus (second edition) is a comprehensive archive of newswire text data in Chinese containing about 1.4 billion Chinese characters. All the texts have been segmented and POS tagged automatically. We included all the data in our study. The main functionality of Sketch Engine includes KWIC displays, co-occurrence statistics, grammatical relations, and word sketches, which provide grammatical descriptions of a word in terms of corpus collocations. For nouns, the grammatical description includes nine relations: “A_Modifier/N_Modifier/Modifies,” “Subject_of,” “Object_of,” “And/Or,” and “Possession/Possessor.” All the collocations were formalized as triples of Rel; Word1; Word2, where Rel is a relation, Word1 is a keyword of a query, and Word2 is the collocation involved with respect to the relation in question.

We used a two-step procedure to acquire the verbs that were related to the compound “n1 n2.” First, we collected the collocations with “Subject_of” and “Object_of” relations using n1 and n2 as the keywords of the queries, respectively. We chose only the top 200 words with the highest salience for each relation. Thus, we obtained two sets of collocating verbs denoted as VerbSet1 and VerbSet2 for n1 and n2. Then, we found the intersection of VerbSet1 and VerbSet2, which provided the final paraphrasing verbs. Table 27.6 shows an example of the procedure.

Table 27.6 Verbs acquired for the noun compound 钻石戒指 zuanshi jiezhi “diamond ring”

We used this method for 电影公司 dianying gongsi “film company” and 啤酒公司 pijiu gongsi “beer company,” which have the same head. The paraphrasing verbs are shown in Table 27.7, which shows that the two similar compounds have very few common verbs. Fine-grained semantic distinctions were captured with our approach.

Table 27.7 Examples of paraphrasing verbs for 电影公司 dianying gongsi “film company” and 啤酒公司 pijiu gongsi “beer company”

5.2 Generating Verbal Paraphrases

Yuan (1995) proposed four typical Chinese syntactic patterns for the recovery of the implied predicates, as shown in Table 27.8. In our approach, we used those patterns to generate verbal paraphrases for a compound based on the acquired paraphrasing verbs. We obtained the verbal paraphrases to the maximum using those patterns; however, many of them did not make sense. Next, we filtered out the inappropriate paraphrases via search engines.

Table 27.8 Patterns used to generate verbal paraphrases

5.3 Filtering Verbal Paraphrases

The goal of this process aimed to remove the noise (i.e., inappropriate paraphrases) and retain the most reasonable verbal paraphrases by assigning a higher rank to them. For this purpose, we validated these paraphrases by finding evidence in a large corpus. The greater the evidence, the more appropriate a given paraphrase should be.

The notion of “Web as a corpus” has been widely accepted by researchers. Keller and Lapata (2003) applied web counts to a wide variety of NLP tasks involving syntax and semantics and demonstrated that realistic NLP tasks can benefit from web counts. In our approach, we viewed all the candidate paraphrases as queries, and all queries were submitted to the search engines and performed as exact matches. Thus, we obtained the web counts of the paraphrases. For each noun compound, the paraphrases were ranked by descending order of web counts. The paraphrases with a higher ranking were considered more reasonable than those with a lower ranking.

Baidu (www.baidu.com) and Google (www.google.com) were the most popular search engines for our Chinese search. We conducted experiments based on the web counts obtained from the two search engines, respectively. The number of hits from Baidu and Google was not identical, which resulted in some differences in the ranking. Table 27.9 shows the top five paraphrases for 钻石戒指 zuanshi jiezhi “diamond ring” based on Baidu and Google, respectively (incorrect phrases are in italics).

Table 27.9 Top five verbal paraphrases ranked by Baidu and Google

5.4 Experiments and Evaluation

We randomly selected 391 Chinese noun compounds from the newswire corpus People’s Daily to test our approach. For each compound, the top 10 candidate paraphrases were collected. All the paraphrases were judged by three human subjects.Footnote 4 They were asked to make binary judgments (yes or no) for each paraphrase, that is, whether the paraphrases expressed a meaning similar to that of the compound. If more than two subjects labeled the paraphrase yes, it was viewed as correct. We defined the accuracy of the compounds using Formula (27.4):

$$ \mathrm{Accuracy}=\frac{\mathrm{the}\ \mathrm{number}\ \mathrm{of}\ \mathrm{compounds}\ \mathrm{with}\ \mathrm{correct}\ \mathrm{interpretation}}{\mathrm{the}\ \mathrm{total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{compounds}}\times 100\% $$
(27.4)

Table 27.10 shows the different accuracy rates, where “n” equals 1, 3, 5, and 10. As shown, the performances based on Google and Baidu were very similar. Thus, our method provided correct interpretations for almost 70% of the compounds when only the topmost paraphrase was given, and accuracy increased with the number of candidate paraphrases.

Table 27.10 Accuracy based on Google and Baidu

6 Conclusion

In this chapter, we presented our preliminary research on the interpretation of Chinese noun compounds using two different strategies. For the abstract relation strategy, we proposed a novel taxonomy of Chinese noun compounds based on the transparency of the compounds. Then, we proposed a method for interpreting Chinese NCs based on word similarity. Our experimental results showed that word similarity provided useful information in solving interpretation problems. In the future, we plan to use corpus-based similarity methods such as word2vec to solve the out-of-vocabulary (OOV) problem. Moreover, the voting strategy can be used to determine the semantic relations of the test NCs since we chose only those NCs with the highest similarity.

For the verbal paraphrasing strategy, we proposed the simple dynamic approach of using paraphrasing verbs, which could be useful in many NLP tasks. This approach not only provided possible interpretations of noun compounds but also captured interesting fine-grained semantic differences of similar noun compounds. In the future, we plan to acquire more verbs using web data, such as the Google 5-gram web index. We also plan to expand the paraphrasing patterns. Finally, we are also very interested in applying the methods proposed here to information retrieval.