Exploring lexical, syntactic, and semantic features for Chinese textual entailment in NTCIR RITE evaluation tasks

Huang, Wei-Jie; Liu, Chao-Lin

doi:10.1007/s00500-015-1629-1

Exploring lexical, syntactic, and semantic features for Chinese textual entailment in NTCIR RITE evaluation tasks

Focus
Published: 26 March 2015

Volume 21, pages 311–330, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

Exploring lexical, syntactic, and semantic features for Chinese textual entailment in NTCIR RITE evaluation tasks

Download PDF

Wei-Jie Huang¹ &
Chao-Lin Liu^1,2

393 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

We computed linguistic information at the lexical, syntactic, and semantic levels for Recognizing Inference in Text (RITE) tasks for both traditional and simplified Chinese in NTCIR-9 and NTCIR-10. Techniques for syntactic parsing, named-entity recognition, and near synonym recognition were employed, and features like counts of common words, statement lengths, negation words, and antonyms were considered to judge the entailment relationships of two statements, while we explored both heuristics-based functions and machine-learning approaches. The reported systems showed their robustness by simultaneously achieving second positions in the binary-classification subtasks for both simplified and traditional Chinese in NTCIR-10 RITE-2. We conducted more experiments with the test data of NTCIR-9 RITE, with good results. We also extended our work to search for better configurations of our classifiers and investigated contributions of individual features. This extended work showed interesting results and should encourage further discussions.

Recognizing Textual Entailment in Non-english Text via Automatic Translation into English

Chinese Textual Entailment Recognition Based on Syntactic Tree Clipping

Recognizing Textual Entailment and Computational Semantics

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recognizing textual entailment^{Footnote 1} (RTE) (Dagan et al. 2006) has become a major research topic in natural language processing (NLP) in the past decade (Watanabe et al. 2013a). Given a pair of statements, text ($T)$ and hypothesis ($H)$, the most basic format of an RTE task is to determine whether $H$ is true when $T$ is true; namely, whether or not $T$ entails $H$. A more challenging format is to determine whether $T$ and $H$ are contradictory statements (Dagan et al. 2009). More recently in PASCAL RTE-6,^{Footnote 2} NTCIR-10 RITE-2,^{Footnote 3} and NTCIR-11 RITE-VAL,^{Footnote 4} researchers investigated and evaluated methods for identifying statements in a collection, e.g., a corpus like Wikipedia, which are relevant to a given statement $T$, where relevancy includes both entailment, paraphrase, and contradiction.

The RTE tasks are relevant and applicable to many NLP applications, including knowledge management (Tsujii 2012). If a statement entails another in a collection of statements, then one may not need to consider both statements to produce a concise summary of the collection, so recognizing entailments is useful for automatic text summarization (Lloret et al. 2008; Tatar et al. 2009). Similar reasons apply to how recognizing entailment can be applied to question answering systems (de Salvo et al. 2005). When a question entails another, the recorded answer to the previous question may be useful for answering the new question. RTE can also be useful for judging the correctness of students’ descriptive answers in assessment tasks. It is rare for students to respond to questions with statements that are exactly the same as the instructors’ standard answers. It is also not practical to expect instructors to list all possible ways which students may answer a question. In such cases, recognizing paraphrase relationships between students’ and instructors’ answers becomes instrumental (Nielsen et al. 2009). We have also applied RTE techniques to enable computers to take reading comprehension tests that are designed for middle school students (Huang et al. 2013).

Dagan et al. (2009) provided an overview of the approaches for RTE. Treating RTE as a classification task is an obvious option, where different systems consider various factors to make the final decisions. Due to the availability of the training data in RTE activities, machine learning-based approaches are common. Researchers design methods to utilize different levels of linguistic, including syntactic and semantic, information provided in the given statement pairs to judge their relationships. Transformation-based methods offer interesting alternatives for the RTE tasks. If a statement can be transformed into another via either syntactic rewriting (Bar-Haim et al. 2008; Stern et al. 2011; Shibata et al. 2013) or logical inference procedures (Chambers et al. 2007; Takesue and Ninomiya 2013; Wang et al. 2013; Watanabe et al. 2013b), then the statements may be highly related. In addition to using the information conveyed by the given statements, external information like common sense knowledge and ontology about problem domains can strengthen the basis on which entailment decisions are made (de Salvo et al. 2005; Stern et al. 2010).

The first corresponding event of PASCAL RTE for Japanese and Chinese took place in NTCIR-9, and was named RITE as the acronym for “Recognizing Inference in Text” (Shima et al. 2012). NTCIR-10 continued to host RITE-2 for Japanese and Chinese, and had, respectively, ten and nine teams participating in the traditional and simplified Chinese subtasks (Watamabe et al. 2013). All of these participants considered different combinations of linguistic information as features to determine the entailment relationships of statement pairs. Most of them employed support vector machines as the classifiers.

There were different subtasks in NTCIR-9 RITE and NTCIR-10 RITE-2. The binary classification (BC) subtask required participants to judge whether or not $T$ entails $H$. In this paper, we will focus only on the BC subtasks in the NTCIR RITE tasks, as we believe that the BC subtask is the most fundamental subtask of them all.

In NTCIR-10 RITE-2, the best performing team in the BC subtask for traditional Chinese (CT) adopted a voting mechanism (Shih et al. 2013). The best performing team in the BC subtask for simplified Chinese (CS) employed an alignment-based strategy (Wang et al. 2013). We (Huang and Liu 2013) trained heuristic functions to achieve second best performance in the BC subtasks for both CT and CS. The best team outperformed us in the BC subtask for CT by only 0.7 % in the F1 measure. Chang et al. (2013) embraced decision trees as the classifier but did not achieve an impressive performance.

For obvious reasons, all participating systems in NTCIR-10 RITE-2 used some forms of linguistic features to make decisions. As may be expected, different systems considered different sets of features and applied them in different ways. We computed lexical, syntactic, and semantics information about the statement pairs to judge their entailment relationships. The linguistic features were computed with public tools and machine-readable dictionaries, including the Extended HowNet^{Footnote 5} (Chen et al. 2010). Preprocessing steps for the statements included conversion between simplified and traditional Chinese, Chinese segmentation, and converting formats of Chinese numbers. We employed such linguistic information as (1) words that were shared by both statements; (2) synonyms, antonyms, and negation words; (3) information about the named entities of the statement pairs; and (4) similarity between parse trees and dependency structures, etc.

The performance of our approaches was sufficiently robust that we achieved the second best scores in both CT and CS subtasks. Since each participating team could submit running results of three different configurations, we actually experimented with our models that we built by training heuristic functions and support vector machines (SVMs). Our best results were achieved by the trained heuristic functions, achieving second position in the BC subtasks for both CT and CS. Our SVM-based models achieved the third best score in the BC subtask for CT, but dropped to 12th position in BC subtask for CS.

We have extended our work after participation in NTCIR-10 RITE-2. We ran grid searches of larger scales to find the best combinations of parameters and features for the classification models. In general, conducting the grid searches helped us build better models. However, the experimental results also provide interesting and seemingly perplexing material for further discussion in the paper. We also tested our systems with the test data for the BC subtasks of NTCIR-9 RITE, and found that we were able to achieve better performance than the best performer in NTCIR-9 RITE tasks.

We explain the preprocessing of the text material and extraction of their linguistic features in Sect. 2, examine the constructions of the heuristics-based and machine learning-based classifiers in Sect. 3, present and discuss the experimental results in Sect. 4, review and deliberate on some additional observations in Sect. 5, and wrap up this paper in Sect. 6.

2 Major system components

In this section, we describe components of our running systems, including the preprocessing steps and the extraction of fundamental linguistic features.

2.1 Preprocessing

In this subsection, we explain the preprocessing steps: traditional-to-simplified Chinese conversion, numeric format conversion, and Chinese segmentation.

2.1.1 Traditional-to-simplified Chinese conversion

We relied on Stanford NLP tools^{Footnote 6} to do Chinese segmentation and named-entity recognition. As those tools were designed to perform better for simplified Chinese, we had to convert traditional Chinese into simplified Chinese. We converted words between their traditional and simplified forms of Chinese with an automatic procedure which relied on a tool in Microsoft Word. We did not design or invent a conversion dictionary of our own, and the quality of conversion depended solely on Microsoft Word.

There are two major methods for converting between traditional and simplified Chinese text. The simpler option is just to do character-to-character conversion, e.g., changing “

”^{Footnote 7} to “

”. A more sophisticated and better conversion is to do word-to-word conversion, changing this sample statement to “

”. This latter conversion includes the simplified Chinese words, i.e., “

”, “

”, and “

” that are used in the training of the Stanford tools, so is more likely to lead to better system performance. Microsoft Word offers the second type of conversion as much as it can, and we understand that Microsoft Word might not convert all traditional Chinese words perfectly to their simplified counterparts, e.g., the result of converting “

” is “

”. “

” is a preferred conversion. However, Microsoft Word is a good and accessible current choice.

2.1.2 Numeric format conversion

There are multiple ways for people to write numbers in English text, e.g., sixteen vs. 16. In Chinese, there are at least three ways to write numbers in text, e.g., “3”, “

”, and “

” for the number “3”. There are also specific characters to express specific numbers, e.g., “

” and “

” for 20 and 30, respectively. In addition, there are simplified ways to express relatively small numbers, e.g., “

” for 32 but “

” for 12. In the latter case, “

” is more formal but is rarely used.

To streamline our handling of numbers in Chinese statements, we employed regular expressions to capture specific strings and convert them to Arabic numerals. The conversions need special care for some extraordinary instances. For instance, one may not want to convert “

” to “

” or convert “

” to “

”.

2.1.3 Chinese string segmentation

We employed the Stanford Word Segmenter^{Footnote 8} (Chang et al. 2008) to segment Chinese character strings into word tokens. Unlike most alphabetical languages in which words are separated by spaces, Chinese text strings do not have delimiters between words. In fact, Chinese text did not use punctuation marks until modern times. In the field of natural language processing, converting a Chinese string into a sequence of Chinese words is called segmentation (or tokenization) of Chinese.

A major challenge of Chinese segmentation is that different segmentations of a given Chinese string can represent very different meanings of the original string. We can segment the string “

” in two different ways: {“

”, “

”, “

”} or {“

”, “

”, “

”, “

” }. Adopting the former segmentation, the translation of the original Chinese string is “how many more years can one do research”. Adopting the latter will lead to “how many more years can the graduate student survive”. To most native speakers of Chinese, the former segmentation is much more natural, but the latter is not unacceptable. In the 2012 Bakeoff for Chinese segmentation, the best performing system reached an F1 measure slightly shy of 95 % (Duan et al. 2012).

2.2 Lexical semantics

2.2.1 Lexical resources and computation for Chinese synonyms

The number of words shared by statement pairs is the most commonly used feature to judge entailment. Identifying words that are shared literally is a direct way to compute word overlaps. Indeed, in previous RTE and RITE events, organizers provided baseline systems which calculated character overlaps to determine entailment (Bar-Haim et al. 2006; Stern et al. 2011).

In practice, people may express the same or very similar ideas with synonyms and near synonyms, so their identification is also very important. The following statements are very close in meaning though they do not use exactly the same words.

(1)
Tamara is reluctant to raise this question.
(2)
Tamara hesitates to ask this question.

Translating this pair into Chinese will also show the importance of identifying synonyms.

(3)
Tamara
(4)
Tamara

The literature has seen abundant ways to compute synonyms for English, particularly those that computed the similarity between words based on WordNet^{Footnote 9} (Budanitsky and Hirst 2006). In contrast, we have yet to find a good way to compute synonyms for Chinese.

To compute synonyms for a given word, we rely on both existing lexicons and computing methods. We acquired a dictionary for synonyms and antonyms^{Footnote 10} from the Ministry of Education (MOE) of Taiwan. This MOE dictionary lists 16,005 synonyms and 8625 antonyms.

We could employ the extended HowNet^{Footnote 11} (E-HowNet), which can be considered as an extended WordNet for Mandarin Chinese, to look up synonyms of Chinese words. E-HowNet contains 88,079 traditional Chinese words in the 2012 version, and can provide synonyms of Chinese words, so we could use the list of synonyms directly. We will find 38 synonymous words^{Footnote 12} which carry the concept of “hesitate” in E-HowNet. In this particular case, we would be able to tell that “

” in statement (3) and “

” in statement (4) are synonymous with the list in E-HowNet. However, “

” in statement (3) does not belong to the synonym list^{Footnote 13} of “

” in statement (4). “

” is similar to “raise” in English. One can raise a question or a concern, so “raise” alone does not necessarily relate to asking questions.

We could also use the definitions for words in E-HowNet to estimate the relatedness between two Chinese words by their taxonomical relations and semantic relations (Chuang et al. 2012; Chen 2013; Huang and Liu 2013). In this work, we converted the definition of a word into a “definition tree”, e.g., Fig. 1, according to the taxonomy in E-HowNet. Each node represents a primitive unit, a function word, or a semantic role. Considering each internal node in a definition tree as a root, we built a collection of subtrees of the definition tree. In Fig. 1, there are 15 nodes.

The DICE coefficient^{Footnote 14} between the collections of subtrees of two definition trees is used to measure the degree of relatedness of two definitions. Given two collections, e.g., $X$ and $Y$, the DICE coefficient is defined in Eq. (1), where $\vert X\vert $ is the number of elements in $X$.

$$\begin{aligned} \mathrm{DICE}\text{( }X,\;Y)=\frac{2\left| {X\cap Y} \right| }{\vert X\vert +\vert Y\vert } \end{aligned}$$

(1)

Due to the definition, a DICE coefficient must fall in the range of [0, 1]. Two definitions will be considered anonymous if their DICE coefficient is larger than a threshold, for which we chose to use 0.88 based on a small-scale experiment.

Computing Chinese synonyms only with information in dictionaries is an imperfect method. Chinese text contains out-of-vocabulary (OOV) words a lot more frequently than English text. For these OOV words, dictionary-based methods cannot always help.

2.2.2 Chinese antonyms and negation words

We consider two ways to express opposite meanings. The first is antonyms, e.g., “good” vs. “bad”; and the second is through negation words, e.g., “good” and “not good”.

We relied on the lists of antonyms provided by the MOE dictionary (cf. Sect. 2.2.1). Since there are only 8625 words in the antonym lists in the dictionary, we can handle only a very small number of antonyms at this moment.

We created a list of negation words based on our own judgment. This list of negation words include “

”, “

”, “

”, “

”, and “

”. Note that we consider “

”, “

”, “

”, and “

” to be negation words only when they are individual words after segmentation. Hence, we will handle words like “

” correctly. This list allows us to find that statements (5) and (6) (used in NTCIR-10 RITE-2) have opposite meanings.

(5)
(6)

We could also handle other negation words like “

”, “

”, “

”, and “

”. However, this heuristic list is as yet unable to handle all possible Chinese negation words correctly. A more complex word like, “

”, would need special attention in our system. A direct application of our heuristic list will treat this word as two negations, but this word is not really related to negation.

2.2.3 Named entity and verb recognition

Among parts of speech in almost all languages, nouns and verbs are the essential parts for understanding the core meanings of sentences. Information about named entities such as persons, locations, organizations, and time are crucial for inferring relationships between statements. A software tool for named entity recognition (NER) not only annotates words in a sentence as nouns but also subcategorizes them as persons, locations, organization names and time specifications. Although current technologies for NER do not offer perfect performance, being able to carry out NER even partially paves a way to handle typical questions regarding the five Ws (What, When, Where, Why, Who). We employed S-MSRSeg, which is a tool for named entities recognition developed by Microsoft Research (Gao et al. 2005).

Verbs provide information about the actions or states that a given sentence describes. Recognizing verbs for a sentence pair is thus useful. We employed the Stanford parser (Levy and Manning 2003) to do the tagging of parts of speech. Although it is possible to consider sub-categorization of verbs, we did not do so in the current study.

2.3 Syntactic features

We parsed the Chinese statements with the Stanford parser (Levy and Manning 2003) to obtain the parse trees and the part-of-speech (POS) tags for words. A parse tree of a sentence reveals important information about the meaning of the sentence. At this moment, we used the parsing results to do two types of comparisons. The first was to compare the similarity between the parse trees of $T$ and $H$ with the same method (the DICE coefficient) that we used to compare the definition trees of different senses as explained in Sect. 2.1.1. We also compared the collections of POS tags of two sentences, particularly the tags for verbs.

Based on our experience, the Stanford parser works better for simplified Chinese than for traditional Chinese. Hence, we converted statements of traditional Chinese into simplified Chinese before the parsing step in our procedures (cf. Sect. 2.1.1).

We noticed that the Stanford parser did not always produce the best or even correct parse trees for the given statements. The parser ranked candidate parse trees with probabilistic models, and produced the trees with leading scores. Although we could request more than one parse tree for a given statement, we chose to use only the top-ranked tree for computational efficiency of our systems.

2.4 Semantic features

It is preferable to employ higher level information about statement pairs to judge their entailment relationships. After considering information available at the lexical and syntactic levels, semantic features immediately came to mind. However, there are multiple ways to define and represent sentential semantics. Frame semantics is a conceivable choice (Fillmore 1976; Burchardt et al. 2009), for instance. In this work, we explored an application of dependency structures (Chang et al. 2009).

Linguists consider the context of words a very important factor to define meaning. “You shall know a word by the company it keeps” (Firth 1957) or similar arguments (e.g., Firth 1935; Harris 1954) are commonly cited in courses on linguistics. “One sense per discourse, one sense per collocation” (Yarowsky 1995) appears in the literature in computational linguistics very frequently. For this reason, using vector space models to capture contextual information has become one of the standard approaches in both natural language processing and information retrieval.

In our work, we explored an application of dependency structures to capturing the contextual information in a sentence. There are different ways to apply the dependency structures for inferring entailment relationships, and we note that Day et al. also employed the tree-edit distances of dependency structures in NTCIR-10 RITE-2 (Day et al. 2013).

We illustrate our methods with a short English example, “We consider dependency structures for inferring textual entailment”, to make the example more easily understandable to non-Chinese speakers. We list the typed and collapsed dependencies of this statement below. A dependency relation is expressed in the format of relation-name (governor, dependent), where both governor and dependent are words appended with their positions in the sentence.

nsubj(consider-2, We-1)
root(ROOT-0, consider-2)
amod(structures-4, dependency-3)
dobj(consider-2, structures-4)
prepc_for(consider-2, inferring-6)
amod(entailment-8, textual-7)
dobj(inferring-6, entailment-8)

We can ignore the root node and build a matrix to encode the direct relationships between words, as shown in Table 1. The column headings show the governors, and the row headings show the dependents. A cell will be 1 if there is a relationship from the dependent to the governor. Hence, ignoring the relation name, the cell (We, consider) is 1 because of nsub(consider-2, We-1). Notice that the matrix is not symmetric because of the functions of words in different relationships.

Table 1 Matrix form for encoding dependency structures

Exploring lexical, syntactic, and semantic features for Chinese textual entailment in NTCIR RITE evaluation tasks

Abstract

Similar content being viewed by others

Recognizing Textual Entailment in Non-english Text via Automatic Translation into English

Chinese Textual Entailment Recognition Based on Syntactic Tree Clipping

Recognizing Textual Entailment and Computational Semantics

Explore related subjects

1 Introduction

2 Major system components

2.1 Preprocessing

2.1.1 Traditional-to-simplified Chinese conversion

2.1.2 Numeric format conversion

2.1.3 Chinese string segmentation

2.2 Lexical semantics

2.2.1 Lexical resources and computation for Chinese synonyms

2.2.2 Chinese antonyms and negation words

2.2.3 Named entity and verb recognition

2.3 Syntactic features

2.4 Semantic features

3 Classification methods

3.1 Trained heuristic functions

3.1.1 Word overlap

3.1.2 Missing named entities

3.1.3 Imbalanced negations

3.1.4 Occurrence of antonyms

3.1.5 An integrated heuristic decision function

3.1.6 A brief critical review

3.1.7 Machine learning methods

3.1.8 The candidate features

3.1.9 The classifiers: SVMs, decision trees, and linearly weighted models

4 Empirical evaluations

4.1 Data sources

4.2 Evaluation metrics

4.3 NTCIR-10 RITE-2 evaluation task

4.4 More experiments for the heuristic functions

4.5 More experiments for the machine learning-based models

4.6 Evaluations with NTCIR-9 RITE test data

4.7 Effects of syntactic and semantic information

5 Additional discussions

5.1 Y-precision, Y-recall, N-precision, and N-recall

5.2 Performance of SVM-based systems

5.3 Effects of specific features on experiments with real test data

5.4 World knowledge and subjective judgments

6 Concluding remarks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation