Context Analysis for Computer-Assisted Near-Synonym Learning

Yu, Liang-Chih; Chien, Wei-Nan; Hsu, Kai-Hsiang

doi:10.1007/978-981-13-3570-9_7

Liang-Chih Yu⁷,
Wei-Nan Chien⁷ &
Kai-Hsiang Hsu⁸

Part of the book series: Chinese Language Learning Sciences ((CLLS))

843 Accesses

Abstract

Despite their similar meanings, near-synonyms may have different usages in different contexts. For second-language learners, such differences are not easily grasped in practical use. This chapter introduces several context analysis techniques such as pointwise mutual information (PMI), n-gram language model, latent semantic analysis (LSA), and independent component analysis (ICA) to verify whether near-synonyms do match the given contexts. Applications can benefit from such techniques to provide useful contextual information for learners, making it easier for them to understand different usages of various near-synonyms. Based on these context analysis techniques, we build a prototype computer-assisted near-synonym learning system. In experiments, we evaluate the context analysis methods on both Chinese and English sentences, and compared its performance to several previously proposed supervised and unsupervised methods. Experimental results show that training on the independent components that contain useful contextual features with minimized term dependence can improve the classifiers’ ability to discriminate among near-synonyms, thus yielding better performance.

Access provided by Autonomous University of Puebla. Download chapter PDF

Chinese Near-Synonym Study Based on the Chinese Gigaword Corpus and the Chinese Learner Corpus

A Study on Chinese Synonyms: From the Perspective of Collocations

Word Replaceability Through Word Vectors

Article Open access 25 April 2020

1 Introduction

Near-synonym sets represent groups of words with similar meanings, which can be derived from the existing lexical ontologies such as WordNet (Fellbaum 1998), EuroWordNet (Rodríguez et al. 1998), and Chinese WordNet (Huang et al. 2008). These are useful knowledge resources for computer-assisted language learning (CALL) (Cheng 2004; Inkpen and Hirst 2006; Inkpen 2007; Ouyang et al. 2009; Wu et al. 2010) and natural language processing (NLP) applications such as information retrieval (IR) (Moldovan and Mihalcea 2000; Navigli and Velardi 2003; Shlrl and Revle 2006; Bhogal et al. 2007; Yu et al. 2009) and (near-)duplicate detection for text summarization (Vanderwende et al. 2007). For example, in composing a text, near-synonyms can be used to automatically suggest alternatives to avoid repeating the same word in a text when suitable alternatives are available in the near-synonym set (Inkpen and Hirst 2006). In information retrieval, systems can perform query expansion to improve the recall rate, for example, through recognizing that the weapon sense of “arm” corresponds to the weapon sense of “weapon” and of “arsenal”.

Although the words in a near-synonym set have similar meanings, they are not necessarily interchangeable in practical use due to their specific usage and collocational constraints (Wible et al. 2003). Consider the following examples.

(E1) {strong, powerful} coffee
(E2) ghastly {error, mistake}
(E3) {bridge, overpass, tunnel} under the bay.

Examples (E1) and (E2) both present an example of collocational constraints for the given contexts. In (E1), the word “strong” in the near-synonym set {strong, powerful} is more suitable than “powerful” to fit the given context “coffee,” since “powerful coffee” is an anti-collocation (Pearce 2001). Similarly, in (E2), “mistake” is more suitable than “error” because “ghastly mistake” is a collocation and “ghastly error” is an anti-collocation (Inkpen 2007). In (E3), the near-synonym set {bridge, overpass, tunnel} represents the meaning of a physical structure that connects separate places by traversing an obstacle. Suppose that the original word is “tunnel” in the context “under the bay”. The word “tunnel” cannot be substituted by the other words in the same set because all the substitutions are semantically implausible (Yu et al. 2010). The above examples indicate that near-synonyms may have different usages in different contexts, and such differences are not easily captured by second-language learners. Therefore, we develop a computer-assisted near-synonym learning system to assist Chinese English-as-a-Second-Language (ESL) learners to better understand different usages of various English near-synonyms and use them appropriately in different contexts.

This chapter introduces the use of NLP techniques such as automatic near-synonym choice (Edmonds 1997; Gardiner and Dras 2007; Inkpen 2007; Islam and Inkpen 2010; Wang and Hirst 2010; Yu et al. 2010; Yu and Chien 2013; Yu et al. 2016) to verify whether near-synonyms match the given contexts. The problem of automatic near-synonym choice has been formulated as a “fill-in-the-blank” (FITB) task, as shown in Fig. 1. Given a near-synonym set and a sentence containing one of the near-synonyms, the near-synonym is first removed from the sentence to form a lexical gap. The goal is to predict an answer (best near-synonym) that can fill the gap from the given near-synonym set (including the original word). The systems can then be evaluated by examining their ability to restore the original word with the best near-synonym.

2 Automatic Near-Synonym Choice

Among many approaches to automatic near-synonym choice, Edmonds’ pioneering study used a lexical co-occurrence network to determine the near-synonym that is most typical or expected in a given context (Edmonds 1997). Other proposed methods can generally be categorized as unsupervised (Gardiner and Dras 2007; Inkpen 2007; Islam and Inkpen 2010; Yu et al. 2010) and supervised methods (Wang and Hirst 2010; Yu and Chien 2013; Yu et al. 2016).

2.1 Unsupervised Methods

In the unsupervised methods, the pointwise mutual information (PMI) (Gardiner and Dras 2007; Inkpen 2007) and n-gram-based methods (Islam and Inkpen 2010; Yu et al. 2010) are the two major methods.

2.1.1 PMI-Based Method

The PMI is used to measure the strength of co-occurrence between a near-synonym and each individual word appearing in its context. A higher mutual information score indicates that the near-synonym fits well in the given context, and thus is more likely to be the correct answer. The pointwise mutual information (Church and Hanks 1990) between two words x and y is defined as

$$ {\text{PMI}}(x,y) = \log_{2} \frac{P(x,y)}{P(x)P(y)}, $$

(1)

where $ P(x,y) = C(x,y)/N $ denotes the probability that x and y co-occur; $ C(x,y) $ is the number of times x and y co-occur in the corpus; and N is the total number of words in the corpus. Similarly, $ P(x) = C(x)/N $, where C(x) is the number of times x occurs in the corpus, and $ P(y) = C(y)/N $, where C(y) is the number of times y occurs in the corpus. Therefore, Eq. 1 can be rewritten as

$$ {\text{PMI}}(x,y) = \log_{2} \frac{C(x,y) \cdot N}{C(x) \cdot C(y)}. $$

(2)

The frequency counts C(·) presented in Eq. 2 can be retrieved from a large corpus such as the Waterloo terabyte corpus used in (Inkpen 2007), and the Web 1T 5-gram corpus used in (Gardiner and Dras 2007).

Given a sentence s with a gap, $ s = \ldots w_{1} \ldots w_{\ell} \ldots w_{\ell + 1} \ldots w_{2\ell} \ldots $, the PMI score for a near-synonym NS_j to fill the gap is computed from the words around the gap, defined as

$$ {\text{PMI}}(NS_{j},s) = \sum\limits_{i = 1}^{2\ell} {{\text{PMI}}(NS_{j},w_{i})}. $$

(3)

where $ \ell $ is a window size representing $ \ell $ words to the left and right of the gap. Finally, the near-synonym with the highest score is considered to be the answer.

2.1.2 5-Gram Language Model

N-grams can capture contiguous word associations within given contexts. Assume a sentence $ s = \ldots w_{i - 4} w_{i - 3} w_{i - 2} w_{i - 1} w_{i} w_{i + 1} w_{i + 2} w_{i + 3} w_{i + 4} \ldots $, where w_i represents a near-synonym in a set. In computing the 5-gram scores for each near-synonym, only the five product items $ P(w_{i} \left| {w_{i - 4}^{i - 1}} \right.) $, $ P(w_{i + 1} \left| {w_{i - 3}^{i}} \right.) $, $ P(w_{i + 2} \left| {w_{i - 2}^{i + 1}} \right.) $, $ P(w_{i + 3} \left| {w_{i - 1}^{i + 2}} \right.) $, and $ P(w_{i + 4} \left| {w_{i}^{i + 3}} \right.) $ are considered (Islam and Inkpen 2010). The other items are excluded because they do not contain the near-synonym and thus will have the same values. Accordingly, the 5-gram language model (n = 5) with a smoothing method can be defined as

$$ \begin{aligned} P(s) & = \prod\limits_{i = 1}^{5} {P(w_{i} \left| {w_{i - n + 1}^{i - 1}} \right.)} \\ & = \prod\limits_{i = 1}^{5} {\frac{{C(w_{i - n + 1}^{i}) + (1 + \alpha_{n})M(w_{i - n + 1}^{i - 1})P(w_{i} \left| {w_{i - n + 2}^{i - 1}} \right.)}}{{C(w_{i - n + 1}^{i - 1}) + \alpha_{n} M(w_{i - n + 1}^{i - 1})}}} \\ \end{aligned} $$

(4)

where $ M(w_{i - n + 1}^{i - 1}) $ denotes a missing count used in the smoothing method, defined as

$$ M(w_{i - n + 1}^{i - 1}) = C(w_{i - n + 1}^{i - 1}) - \sum\limits_{{w_{i}}} {C(w_{i - n + 1}^{i})} $$

(5)

where C(·) denotes an n-gram frequency, which can be retrieved from a large corpus such as the Web 1T 5-gram corpus used in (Islam and Inkpen 2010). The 5-gram language model is implemented as a back-off model. That is, if the frequency of a higher order n-gram is zero, then its lower order n-grams will be considered. Conversely, if the frequency of a higher order n-gram is not zero, then the lower order n-grams will not be included in the computation. Similar to the PMI-based method, the near-synonym with the highest score is considered to be the answer.

2.2 Supervised Methods

Supervised methods usually approach near-synonym choice tasks as classification tasks in which each near-synonym in a near-synonym set represents a class, and the features used for classification are the words which occur in the contexts of the near-synonyms. The near-synonyms and their context words are represented by vector-based representation which is frequently used in distributional models of lexical semantics (Harris 1954; Lin 1998; Roussinov and Zhao 2003; Weeds et al. 2004). Based on this representation, a co-occurrence matrix of the near-synonyms and their context words can be built from the training data, i.e., a collection of sentences containing the near-synonyms. Each entry in the matrix represents a co-occurrence frequency of a context word and a near-synonym. Different context analysis techniques such as latent semantic analysis (LSA) and independent component analysis (ICA) can then be applied to the context matrix to identify useful context features that contribute to the classification task (Wang and Hirst 2010; Yu and Chien 2013).

2.2.1 Latent Semantic Analysis (LSA)

LSA is a technique for analyzing the relationships between words and documents and has been widely used in many application domains such as information retrieval (Landauer et al. 1998), latent topic discovery (Cribbin 2011), and document clustering (Wei et al. 2008). For automatic near-synonym choice, LSA is used as a context analysis technique to identify useful latent context features for the near-synonyms through indirect associations between words and sentences.

The first step in LSA is to build a word-by-document matrix for near-synonyms and their context words (Wang and Hirst 2010). In addition to documents, for our task, other text units such as sentences or 5-grams could also be used to build the matrix because these text units also contain contextual information for near-synonyms. Figure 2 shows a sample matrix X built using the sentence as the unit. The columns in $ {\mathbf{X}}_{v \times d} $ represent d sentences containing the near-synonyms in a near-synonym set and the rows represent v distinct words occurring in the near-synonyms’ contexts in the corpus. Singular value decomposition (SVD) (Golub and Van Loan 1996) is then used to decompose the matrix $ {\mathbf{X}}_{v \times d} $ into three matrices as follows:

$$ {\mathbf{X}}_{v \times d} = {\mathbf{U}}_{v \times n} \sum_{n \times n} {\mathbf{V}}_{n \times d}^{T}, $$

(6)

where U and V, respectively, consist of a set of latent vectors of words and sentences, $ \sum $ is a diagonal matrix of singular values, and $ n = \hbox{min} (v,d) $ denotes the dimensionality of the latent semantic space. Additionally, each element in U represents the weight of a context word, and the higher weighted words are the useful context features for the near-synonyms. By selecting the largest k₁(≤n) singular values together with the first k₁ columns of U and V, the near-synonyms and their context words can be represented in a low-dimensional latent semantic space. The original matrix can also be reconstructed with the reduced dimensions, as shown in Eq. 7

$$ \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} = {\mathbf{U}}_{{v \times k_{1}}} \sum_{{k_{1} \times k_{1}}} {\mathbf{V}}_{{k_{1} \times d}}^{T}, $$

(7)

where $ \widehat{{\mathbf{X}}} $ represents the reconstructed matrix.

In SVM training and testing, each input sentence with a lexical gap is first transformed into the latent semantic representation as follows:

$$ {\hat{\mathbf{t}}}_{{k_{1} \times 1}} = \sum_{{k_{1} \times k_{1}}}^{- 1} {\mathbf{U}}_{{k_{1} \times v}}^{T} {\mathbf{t}}_{v \times 1}^{{}}, $$

(8)

where $ {\mathbf{t}}_{v \times 1}^{{}} $ denotes the vector representation of an input sentence and $ {\hat{\mathbf{t}}}_{{k_{1} \times 1}} $ denotes the transformed vector in the latent semantic space. Each transformed training vector is then appended by the correct answer (the near-synonym removed from the sentence) to form a (k₁ + 1)-dimensional vector for SVM training.

The strength of LSA lies in discovering the latent context features for near-synonyms using SVD. Consider the example shown in Fig. 3. The original matrix, as shown in Fig. 3a, is built using five training sentences containing two different near-synonyms NS_i and NS_j. Suppose that the words w₁, w₂ are the useful features for NS_i, and w₃, w₄ are useful for NS_j, but w₄ is a latent feature because it does not frequently occur in the context of NS_j. After applying SVD, the latent features can be identified by replacing the zero entries in the original matrix with nonzero real values through the indirect associations between words and sentences. For instance, w₄ originally does not occur in s₃ and s₄, but it does co-occur with w₃ in the matrix (e.g., in s₅), which means that w₄ might also occur in the sentences where w₃ occurs (e.g., s₃ and s₄). Therefore, the zero entries (w₄, s₃) and (w₄, s₄) are replaced with a nonzero value through the indirect associations between of w₃ and w₄ in s₅, as shown in Fig. 3b. This helps identify a useful latent feature w₄ for NS_j. However, identifying latent features through the indirect associations cannot avoid feature overlap when different near-synonyms share common words in their contexts. This might be possible because near-synonyms usually have similar contexts. For instance, in Fig. 3a, w₁, which is useful for NS_i, still occurs in the context of NS_j(e.g., s₄). Through the indirect associations between of w₁ and w₃ in s₄, the frequency of w₁ increases in the context of NS_j because it may also occur in the sentences where w₃ occurs (e.g., s₃ and s₅), as shown in Fig. 3b. Therefore, when all word features are to be accommodated in a low-dimensional space reduced by SVD, term overlap may occur between the latent vectors. As indicated in Fig. 3c, the two sample latent vectors which contribute to two different near-synonyms share a common feature w₁. Classifiers trained on such latent vectors with term overlap may decrease their ability to distinguish among near-synonyms.

2.2.2 Independent Component Analysis (ICA)

ICA is a technique for extracting independent components from a mixture of signals and has been successfully applied to solve the blind source separation problem (Hyvärinen et al. 2001; Lee 1998). Recent studies have shown that ICA can also be applied to other application domains such as text processing (Kolenda and Hansen 2000; Rapp 2004; Sevillano et al. 2004). In contrast to LSA, ICA extracts independent components by minimizing the term dependence of the context matrix. Therefore, LSA and ICA can be considered complementary methods where LSA can be used to discover latent features that do not frequently occur in the context of near-synonyms, and ICA can be used to further minimize the dependence of the latent features such that overlapped features can be removed, as presented in Fig. 3d. Based on this complementariness, the ICA-based framework can be used to analyze the LSA output to discover more useful latent features for different near-synonyms, and the dependence between them can also be minimized. The discriminant power of classifiers can thus be improved by training them on the independent components with minimized term overlap. The ICA model can be formally described as

$$ {\mathbf{X}} = {\mathbf{AS}}\text{,} $$

(9)

where X denotes the observed mixture signals, A denotes a mixing matrix, and S denotes the independent components. The goal of ICA is to estimate both A and S. Once the mixing matrix A is estimated, the demixing matrix can be obtained by $ {\mathbf{W}} = {\mathbf{A}}^{- 1} $, and Eq. 9 can be rewritten as

$$ {\mathbf{S}} = {\mathbf{WX}}\text{,} $$

(10)

That is, the observed mixture signals can be separated into independent components using the demixing matrix.

For our problem, the context matrix can be considered as a mixture of signals because it consists of the contexts of different near-synonyms. Therefore, ICA used, herein is to estimate the demixing matrix so that it can separate the mixed contexts to derive the independent components for each near-synonym. Figure 4 shows the ICA-based framework combining LSA and ICA.

LSA decomposition and reconstruction

In the training phase, the original context matrix $ {\mathbf{X}}_{v \times d} $ is first decomposed by SVD using Eq. 6, and then reconstructed with reduced dimensions using Eq. 7. Useful latent features that do not frequently occur in the original matrix can thus be discovered in this step.

ICA decomposition and demixing

To further minimize term dependence in deriving the independent components, the reconstructed matrix $ \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} $ is passed to ICA to estimate the demixing matrix. ICA accomplishes this by decomposing $ \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} $ using Eq. 11. Figure 5 shows an example of the decomposition.

$$ \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} = {\mathbf{A}}_{{v \times k_{2}}} {\mathbf{S}}_{{k_{2} \times d}}. $$

(11)

Based on this decomposition, the demixing matrix can be obtained by $ {\mathbf{W}}_{{k_{2} \times v}} = {\mathbf{A}}_{{v \times k_{2}}}^{- 1} $, where k₂(≤n) is the number of independent components. Similar to the matrix $ {\mathbf{U}} $ in LSA, each element in $ {\mathbf{W}} $ also represents the weight of a context word, and the higher weighted words are useful context features for the near-synonyms. Therefore, the demixing matrix contains useful context features with minimized term dependence for different near-synonyms.

Once estimated, the demixing matrix is used to separate $ \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} $ to derive the independent components as follows:

$$ {\mathbf{S}}_{{k_{2} \times d}} = {\mathbf{W}}_{{k_{2} \times v}} \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}}, $$

(12)

Each column vector in $ {\mathbf{S}}_{{k_{2} \times d}} $ is then appended by the correct answer for SVM training. Similarly, as shown in Fig. 4, each test instance $ {\mathbf{t}}_{v \times 1}^{{}} $ in the testing phase is also transformed by the demixing matrix, and then predicted with the trained SVM model.

3 Experimental Results

This section presents the evaluation results of different methods for near-synonym choice. Section 3.1 describes the experiment setup, including experimental data, implementation details of methods used, and the evaluation metric. Section 3.2 investigates the selection of optimal parameters for LSA and ICA-based methods. Section 3.3 compares the results obtained by the various methods. Section 3.4 presents a detailed analysis to examine the effect of term overlap on classification performance.

3.1 Experiment Setup

3.1.1 Data

As shown in Table 1, seven English and Chinese near-synonym sets was used for evaluation. For Chinese near-synonym choice evaluation, two test corpora were used: the Chinese News Corpus (CNC) and the Sinica Corpus (SC), both released by the Association for Computational Linguistics and Chinese Language Processing (ACLCLP). The test examples were collected from the two corpora by selecting sentences containing the near-synonyms in the seven Chinese near-synonym sets. A total of 36,427 (CNC) and 26,504 (SC) sentences were collected, where 20% of sentences from each corpus were randomly selected as a development set for parameter tuning of LSA and ICA-based methods, and the remaining 80% were used as the test set for performance evaluation. In addition, the classifiers (described in the next section) were built from the Chinese Web 5-gram corpus released by the Linguistic Data Consortium (LDC). For English near-synonym choice evaluation, the Web 1T 5-gram corpus released by LDC was used for both classifier training and evaluation using the cross-validation method.

Table 1 English and Chinese near-synonym sets

Full size table

3.1.2 Classifiers

The classifiers involved in this experiment included PMI, the 5-gram language model, cosine measure, LSA, and ICA-based methods (including stand-alone ICA and a combination of LSA and ICA). The implementation details for each classifier are as follows:

PMI: Given a near-synonym set and a test example with a gap, the PMI scores for each near-synonym were calculated using Eq. 3, where the size of the context window $ \ell $ was set to 2. The frequency counts were retrieved from the Web 1T 5-gram corpus and Chinese Web 5-gram corpus, respectively, for English and Chinese near-synonym choice evaluation.

5 GRAM: The 5-gram scores for each near-synonym in a test example were calculated using Eq. 4. The frequency counts for n-grams were retrieved by querying Google (as in Yu et al. 2010) for English near-synonym choice evaluation, and from the Chinese Web 5-gram corpus for Chinese evaluation.

COS: Given a near-synonym set, all 5-grams containing the near-synonyms in the set were first extracted from the training data (i.e., from the Web 1T 5-gram corpus for English near-synonyms and from the Chinese Web 5-gram corpus for Chinese near-synonyms). The 5-grams with the near-synonyms are removed and were then used to build a word-by-class matrix for the near-synonym set. The best near-synonym for each test example was then predicted by comparing the cosine similarity of the test example and the near-synonyms in the matrix.

LSA: COS used all 5-grams to build the context matrix but, due to the efficiency consideration, we randomly selected only 20,000 5-grams to build the word-by-document (5-gram) matrix for each English and Chinese near-synonym set. The number of the 5-grams for each near-synonym in the matrix was selected according to their proportions in the corpus. Once the matrix was built, each training instance (i.e., each column vector of the matrix) was transformed into the latent space using Eq. 8. The correct answer (the near-synonym removed from the training instance) was then appended to the corresponding transformed vector to form a (k₁ + 1)-dimensional vector for SVM training.

ICA: Stand-alone ICA was implemented using Eqs. 9 and 10. The input matrix was the same as that of LSA, and the demixing matrix was estimated using the FastICA package (Hyvärinen 1999). An SVM classifier was then trained using the independent components obtained using Eq. 10 with the correct answers appended.

LSA + ICA: The combination of LSA and ICA was implemented by taking the LSA result as input to estimate the demixing matrix in ICA, as shown in Eq. 11. An SVM classifier was then trained using the independent components obtained from Eq. 12 with the correct answers appended. For LSA, ICA, and LSA + ICA, the best near-synonym for each test example was predicted using the trained SVM models.

3.1.3 Evaluation Metric

In testing, this experiment followed the FITB evaluation procedure (Fig. 1) to remove the near-synonyms from the test samples. The answers proposed by each classifier are the near-synonyms with the highest score. The correct answers are the near-synonyms originally in the gap of the test samples. Performance is determined by accuracy, which is defined as the number of correct answers made by each classifier, divided by the total number of test examples.

3.2 Evaluation of LSA and ICA-Based Methods

Experiments were conducted to compare the performance of LSA, ICA, and LSA + ICA using different settings for the parameters k₁ and k₂, which, respectively represent the dimensionality of the latent semantic space and the number of independent components. The optimal settings of the two parameters were tuned by maximizing the classification accuracy on the development set. Figure 6 shows the classification accuracy of LSA, ICA, and LSA + ICA with different settings of dimensionality (k₁ or k₂). The accuracies were obtained by averaging the seven Chinese near-synonym sets. For LSA, the x-axis represents different values of k₁, with performance increasing with k₁ up to 2000. For ICA, the x-axis represents different values of k₂, with an optimal performance at k₂ = 500. For LSA + ICA, the performance was tuned by varying both k₁ and k₂. Due to the large number of combinations of k₁ and k₂, for LSA + ICA, Fig. 7 only plots the optimal accuracy for each value of k₁ on the x-axis, and the optimal accuracy for each value of k₁(e.g., 500) was determined by increasing k₂ by increments of 100 to select the highest accuracy among the different settings of k₂. For instance, the accuracy of LSA + ICA at k₁ = 500 was selected from the highest achieved at k₂ = 500, and this accuracy (i.e., that reached at k₁ = 500 and k₂ = 500) was also the optimal performance of LSA + ICA.

The results presented in Fig. 6 show that LSA + ICA improved the performance of LSA for all dimensionalities because the degree of term overlap in LSA was reduced by ICA. In addition, the performance difference between LSA + ICA and LSA was greater when the dimensionality was smaller, indicating that ICA was more effective in reducing the degree of term overlap in a low-dimensional space. More detailed analysis of the relationship between the term overlap and the classification performance is discussed in Sect. 3.4. The best settings of the parameters were used in the following experiments.

3.3 Comparative Results

This section reports the classification accuracy of supervised and unsupervised methods including PMI, 5GRAM, COS, LSA, ICA, and LSA + ICA. Table 2 shows the comparative results for the Chinese corpora including Chinese News Corpus (CNC), Sinica Corpus (SC), and both (ALL). The binomial exact test (Howell 2007) was used to determine whether the performance difference was statistically significant.

Table 2 Classification accuracy for Chinese corpora

Full size table

For the two unsupervised methods, 5GRAM outperformed PMI on both test corpora. One possible reason is that the 5-gram language model can capture contiguous word associations in a given context, whereas in PMI, words are considered independently within the given context. In addition, all supervised methods (i.e., COS, LSA, ICA, and LSA + ICA) achieved better performance than the two unsupervised methods on both test corpora. In the supervised methods, COS provided the baseline results since it did not use any technique for context analysis. As indicated in Table 2, COS yielded higher average accuracy than the best unsupervised method (i.e., 5GRAM) by 2.55 and 7.83% on CNC and SC, respectively, and by 4.79% on ALL. Once context analysis techniques were employed, LSA, ICA, and LSA + ICA significantly improved COS, indicating that context analysis is a useful technique for near-synonym choice. For LSA, it improved the average accuracy of COS by 7.52, 4.29, and 6.16%, respectively, on CNC, SC, and ALL. The improvement mainly came from the discovery of useful latent features from the contexts of the near-synonyms. ICA also outperformed COS, with the improvement mainly coming from the discovery of independent components by minimizing the feature dependence among near-synonyms. After combining LSA and ICA, the performance of both LSA and ICA was further improved because LSA + ICA cannot only discover useful latent features for different near-synonyms but also can minimize the dependence between them, thus improving the discriminant power of classifiers to distinguish between near-synonyms.

For the evaluation of English near-synonym choice, 20,000 5-grams for each English near-synonym set were randomly selected from the Web 1T 5-gram corpus (Web 1T), and the near-synonyms in the 5-grams were removed for the purpose of FITB evaluation. Ten-fold cross-validation was then used to determine the classification accuracy of the methods used, with a t-test to determine statistical significance. In addition, for PMI, frequency counts were retrieved from the whole Web 1T 5-gram corpus, and those for 5GRAM were retrieved by querying Google (as in Yu et al. 2010). Table 3 shows the comparative results of the various methods, showing that LSA + ICA improved the performance of both stand-alone LSA and ICA.

Table 3 Classification accuracy for the English corpus

Full size table

3.4 Term Overlap Analysis

This section investigates the effect of term overlap on classification performance. Term overlap in LSA and the ICA-based methods can be estimated from their respective corresponding matrices $ {\mathbf{U}}_{v \times k}^{{}} $ and $ {\mathbf{W}}_{v \times k}^{T} $. Each column of $ {\mathbf{U}}_{v \times k}^{{}} $ and $ {\mathbf{W}}_{v \times k}^{T} $ represents a latent vector/independent component of v words, and each element in the vector are a word weight representing its relevance to the corresponding latent vector/independent component. Therefore, the meaning of each latent vector/independent component can be characterized by its higher weighted words. Figure 7 shows two sample latent vectors for LSA and two independent components for LSA + ICA.

The upper part of Fig. 7 shows parts of the context words and their weights in the two latent vectors, where latent vector #1 can be characterized by friend, opinion, and chance, which are the useful features for identifying the near-synonym “give,” and latent vector #2 can be characterized by protection, information, and increase, which are useful for identifying the near-synonym “provide”. Although the two latent vectors contained useful context features for the respective different near-synonyms, these features still had some overlap between the latent vectors, as marked by the dashed rectangles. The overlapped features, especially those with higher weights, may reduce the classifier’s ability to distinguish between the near-synonyms. The lower part of Fig. 7 also shows two independent components for the near-synonyms “give” and “provide”. As indicated, the term overlap between the two independent components was relatively low.

To formally compare the degree of term overlap of LSA and the ICA-based methods, we used a measure, overlap@n, to calculate the degree of overlap of the top n ranked words among the latent vectors in $ {\mathbf{U}}_{v \times k}^{{}} $ and independent components in $ {\mathbf{W}}_{v \times k}^{T} $. First, the words in each latent vector (or independent component) were ranked according to the descending order of their weights. The top n words were then selected to form a word set. Let $ s_{i}^{n} $ and $ s_{j}^{n} $ be the two word sets of the top n words for any two latent vectors (or independent components). The degree of term overlap between them was calculated by the number of intersections between their corresponding word sets divided by n, i.e., $ {{\left| {s_{i}^{n} \cap s_{j}^{n}} \right|} \mathord{\left/ {\vphantom {{\left| {s_{i}^{n} \cap s_{j}^{n}} \right|} n}} \right. \kern-0pt} n} $. Therefore, the degree of term overlap for a whole matrix, namely $ overlap_{{{\mathbf{U}}_{v \times k}^{{}}}} $ (or $ overlap_{{{\mathbf{W}}_{v \times k}^{T}}} $), can be calculated by the average of the degrees of term overlap between all latent vectors (or independent components) in $ {\mathbf{U}}_{v \times k}^{{}} $ (or $ {\mathbf{W}}_{v \times k}^{T} $). That is,

$$ overlap_{{{\mathbf{U}}_{v \times k}^{{}}}} @n = \frac{1}{{C_{2}^{k}}}\sum\limits_{i = 1}^{k} {\sum\limits_{j = i + 1}^{k} {\frac{{\left| {s_{i}^{n} \cap s_{j}^{n}} \right|}}{n}}}, $$

(13)

where $ C_{2}^{k} $ denotes the number of combinations of any two vectors in $ {\mathbf{U}}_{v \times k}^{{}} $ (or $ {\mathbf{W}}_{v \times k}^{T} $). Figure 8 presents a sample matrix for computing the degree of term overlap, consisting of three vectors of the top five terms. The respective overlap@5 between the vectors (1, 2), (1, 3), and (2, 3), was 1/5, 2/5, and 3/5, yielding an average 2/5 as the overlap@5 for the matrix.

By averaging the degree of term overlap over the matrices corresponding to the near-synonym sets, we can obtain the degree of term overlap of LSA, ICA, and LSA + ICA, respectively, defined as overlap_LSA@n, overlap_ICA@n, and overlap_LSA+ICA@n. Figure 9 shows the degree of term overlap for LSA, ICA, and LSA + ICA averaged over the seven Chinese near-synonym sets against various dimensionality values. The results show that ICA achieved the lowest degree of term overlap for both overlap@10 and overlap@50. In addition, combining LSA and ICA reduced the degree of term overlap of using LSA alone, especially for a smaller dimensionality. As indicated in Fig. 9, the difference between both overlap_LSA+ICA@10 and overlap_LSA@10 and overlap_LSA+ICA@50 and overlap_LSA@50 increased with smaller dimensionality, mainly due to the fact that the features discovered by LSA were not easily separable in a lower dimensional space. In this circumstance, incorporating ICA can more effectively reduce the degree of term overlap. As the dimensionality increased, the difference between LSA and LSA + ICA gradually decreased, mainly because the increase in dimensionality decreased the degree of term overlap in LSA, thus ICA only produces a limited reduction of term overlap in LSA. As indicated, both overlap_LSA+ICA@10 and overlap_LSA+ICA@50 yielded a small decrease or increase of overlap when the dimensionality exceeded 1000.

To further analyze the relationship between term overlap and classification performance, Fig. 10 compares the classification accuracy of LSA, ICA, and LSA + ICA from Fig. 6 and overlap_LSA@50, overlap_ICA@50, and overlap_LSA+ICA@50 from Fig. 9. Comparing the degree of term overlap and classification performance of LSA + ICA and LSA found that reducing the degree of term overlap improved classification performance. Given a small dimensionality, the performance of LSA was low due to the high degree of term overlap. Combining LSA and ICA in this stage yielded a greater performance improvement because LSA + ICA effectively reduced the degree of term overlap in LSA such that useful context features could be separated into different independent components according to their contribution to different near-synonyms. An increase in the dimensionality improves the performance of LSA due to the reduced degree of term overlap. Meanwhile, the performance of LSA + ICA was not similarly improved due to the small reduction of the degree of term overlap, resulting in a gradual decrease in the performance difference between LSA + ICA and LSA. Another observation is that LSA + ICA also outperformed ICA, despite ICA having a lower degree of term overlap than LSA + ICA. This is mainly due to the fact that LSA + ICA can discover more useful context features than ICA, and also minimizes feature dependence.

4 Applications

Based on the contextual information provided by the PMI and n-gram, we implement a prototype system with two functions: contextual statistics and near-synonym choice, both of which interact with learners.

4.1 Contextual Statistics

This function provides the contextual information retrieved by PMI and n-gram. This prototype system features a total of 21 English near-synonyms grouped into seven near-synonym sets, as shown in Table 1. Figure 11 shows a screenshot of the interface for contextual information lookup. Once a near-synonym set is selected, the 100 top-ranked context words and n-grams are retrieved for each near-synonym in the set. For PMI, both proportion-based PMI scores and co-occurrence frequencies between near-synonyms and their context words are presented. For n-gram, the 100 top-ranked n-grams with their frequencies are presented. Through this function, learners learn to determine the most frequently co-occurring and discriminative words and n-grams for different near-synonyms.

4.2 Near-Synonym Choice

This function assists learners in determining suitable near-synonyms when they are not familiar with the various usages of the near-synonyms in a given context. Learners can specify a near-synonym set and then input a sentence with “*” to represent any near-synonym in the set. The system will replace “*” with each near-synonym, and then retrieve the contextual information around “*” using PMI and n-gram. Figure 12 shows a sample sentence (the original word substance has been replaced with *) along with its contextual information retrieved by the system. For PMI, at most five context words (window size) before and after “*” are included to compute proportion-based PMI scores for each near-synonym. In addition, the sum of all PMI scores for each near-synonym is also presented to facilitate learner decisions. For n-gram, the frequencies of the n-grams (2–5) containing each near-synonym are retrieved. In the example shown in Fig. 12, learners can learn useful word pairs such as (substance, matter) and n-grams such as “substance of the matter,” thus learning to discriminate between substance, material, and stuff.

5 Conclusion

A framework that incorporates LSA and ICA for near-synonym choice is presented. Both LSA and ICA are used to analyze the contexts of near-synonyms. LSA is used to discover useful latent contextual features, while ICA is used to estimate the independent components with minimal dependence between the features. Experiments compared several supervised and unsupervised methods on both Chinese and English corpora. Results show that the ICA-based methods can reduce the degree of term overlap to improve the classifiers’ ability to distinguish among near-synonyms, thus yielding higher classification accuracy.

Future work will focus on improving classification performance by combining multiple features such as predicate-argument structure and named entities occurring in the context of near-synonyms. In addition, current near-synonym choice evaluation is carried out on several preselected near-synonym sets. To make the near-synonym choice task more practical (similar to all-words word sense disambiguation), all-words near-synonym choice evaluation could also be designed and implemented to verify whether every word in a text fits the context well.

References

Bhogal, J., Macfarlane, A., & Smith, P. (2007). A review of ontology based query expansion. Information Processing and Management,43(4), 866–886.
Article Google Scholar
Cheng, C.-C. (2004). Word-focused extensive reading with guidance. In Proceedings of the 13th International Symposium on English Teaching (pp. 24–32).
Google Scholar
Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics,16(1), 22–29.
Google Scholar
Cribbin, T. (2011). Discovering latent topical structure by second-order similarity analysis. Journal of the American Society for Information Science and Technology,62(6), 1188–1207.
Article Google Scholar
Edmonds, P. (1997). Choosing the word most typical in context using a lexical co-occurrence network. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (pp. 507–509).
Google Scholar
Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.
Book Google Scholar
Gardiner, M., & Dras, M. (2007). Exploring approaches to discriminating among near-synonyms. In Proceedings of the Australasian Language Technology Workshop (pp. 31–39).
Google Scholar
Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, MD: Johns Hopkins University Press.
Google Scholar
Harris, Z. (1954). Distributional structure. Word,10(2–3), 146–162.
Article Google Scholar
Howell, D. C. (2007). Statistical methods for psychology (6th ed.). Belmont, CA: Thomson.
Google Scholar
Huang, C.-R., Hsieh, S.-K., Hong, J.-F., Chen, Y.-Z., Su, I.-L., Chen, Y.-X., & Huang, S.-W. (2008). Chinese Wordnet: Design, implementation, and application of an infrastructure for cross-lingual knowledge processing. In Proceedings of the 9th Chinese Lexical Semantics Workshop.
Google Scholar
Hyvärinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks,10(3), 626–634.
Article Google Scholar
Hyvärinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley.
Book Google Scholar
Inkpen, D. (2007). A statistical model of near-synonym choice. ACM Transactions on Speech and Language Processing,4(1), 1–17.
Article Google Scholar
Inkpen, D., & Hirst, G. (2006). Building and using a lexical knowledge-base of near-synonym differences. Computational Linguistics,32(2), 1–39.
Article Google Scholar
Islam, A., & Inkpen, D. (2010). Near-synonym choice using a 5-gram language model. Research in Computing Science,46, 41–52.
Google Scholar
Kolenda, T., & Hansen, L. K. (2000). Independent components in text. Advances in Neural Information Processing Systems,13, 235–256.
Google Scholar
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes,25(2–3), 259–284.
Article Google Scholar
Lee, T. W. (1998). Independent component analysis—Theory and applications. Norwell, MA: Kluwer.
Book Google Scholar
Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (pp. 768–774).
Google Scholar
Moldovan, D., & Mihalcea, R. (2000). Using Wordnet and lexical operators to improve internet searches. IEEE Internet Computing,4(1), 34–43.
Article Google Scholar
Navigli, R., & Velardi, P. (2003). An analysis of ontology-based query expansion strategies. In Proceedings of the Workshop on Adaptive Text Extraction and Mining.
Google Scholar
Ouyang, S., Gao, H.-H., & Koh, S.-N. (2009). Developing a computer-facilitated tool for acquiring near-synonyms in Chinese and English. In Proceedings of the 8th International Conference on Computational Semantics (pp. 316–319).
Google Scholar
Pearce, D. (2001). Synonymy in collocation extraction. In Proceedings of the Workshop on WordNet and Other Lexical Resources.
Google Scholar
Rapp, R. (2004). Mining text for word senses using independent component analysis. In Proceedings of the 4th SIAM International Conference on Data Mining (pp. 422–426).
Google Scholar
Rodríguez, H., Climent, S., Vossen, P., Bloksma, L., Peters, W., Alonge, A., et al. (1998). The top-down strategy for building EuroWordNet: vocabulary coverage, base concepts and top ontology. Computers and the Humanities,32, 117–159.
Article Google Scholar
Roussinov, D., & Zhao, J. L. (2003). Automatic discovery of similarity relationships through Web mining. Decision Support Systems,35(1), 149–166.
Article Google Scholar
Sevillano, X., Alías, F., & Socoró, J. C. (2004). Reliability in ICA-based text classification. In Proceedings of the 5th International Conference on Independent Component Analysis and Blind Signal Separation (pp. 1213–1220).
Chapter Google Scholar
Shlrl, A., & Revle, C. (2006). Query expansion behavior within a thesaurus-enhanced search environment: A user-centered evaluation. Journal of the American Society for Information Science and Technology,57(4), 462–478.
Article Google Scholar
Vanderwende, L., Suzuki, H., Brockett, C., & Nenkova, A. (2007). Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion. Information Processing and Management,43(6), 1606–1618.
Article Google Scholar
Wang, T., & Hirst, G. (2010). Near-synonym lexical choice in latent semantic space. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1182–1190).
Google Scholar
Weeds, J., Weir, D., & McCarthy, D. (2004). Characterising measures of lexical distributional similarity. In Proceedings of the 20th International Conference on Computational Linguistics (pp. 1015–1021).
Google Scholar
Wei, C. P., Yang, C. C., & Lin, C. M. (2008). A Latent semantic indexing-based approach to multilingual document clustering. Decision Support Systems,45(3), 606–620.
Article Google Scholar
Wible, D., Kuo, C.-H., Tsao, N.-L., Liu, A., & Lin, H.-L. (2003). Bootstrapping in a language learning environment. Journal of Computer Assisted learning,19(1), 90–102.
Article Google Scholar
Wu, C.-H. Liu, C.-H., Matthew, H., & Yu, L.-C. (2010). Sentence correction incorporating relative position and parse template language models. IEEE Transactions on Audio, Speech and Language Processing, 18(6), 1170–1181.
Google Scholar
Yu, L.-C., & Chien, W.-N. (2013). Independent component analysis for near-synonym choice. Decision Support Systems,55(1), 146–155.
Article Google Scholar
Yu, L.-C., Lee, L.-H., Yeh, J.-F., Shih, H.-M., & Lai, Y.-L. (2016). Near-synonym substitution using a discriminative vector space model. Knowledge-Based Systems,106, 74–84.
Article Google Scholar
Yu, L.-C., Wu, C.-H., Chang, R.-Y., Liu, C.-H., & Hovy, E. H. (2010). Annotation and verification of sense pools in OntoNotes. Information Processing and Management,46(4), 436–447.
Article Google Scholar
Yu, L.-C., Wu, C.-H., & Jang, F.-L. (2009). Psychiatric document retrieval using a discourse-aware model. Artificial Intelligence,173(7–8), 817–829.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Yuan Ze University, Taoyuan, Taiwan
Liang-Chih Yu & Wei-Nan Chien
Yuanze University, Taoyuan, Taiwan
Kai-Hsiang Hsu

Authors

Liang-Chih Yu
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Nan Chien
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Hsiang Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang-Chih Yu .

Editor information

Editors and Affiliations

Department of Applied Linguistics, The Pennsylvania State University, University Park, PA, USA
Xiaofei Lu
Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei, Taiwan
Berlin Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yu, LC., Chien, WN., Hsu, KH. (2019). Context Analysis for Computer-Assisted Near-Synonym Learning. In: Lu, X., Chen, B. (eds) Computational and Corpus Approaches to Chinese Language Learning. Chinese Language Learning Sciences. Springer, Singapore. https://doi.org/10.1007/978-981-13-3570-9_7

Download citation

DOI: https://doi.org/10.1007/978-981-13-3570-9_7
Published: 07 February 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-3569-3
Online ISBN: 978-981-13-3570-9
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Context Analysis for Computer-Assisted Near-Synonym Learning

Abstract

Similar content being viewed by others

Chinese Near-Synonym Study Based on the Chinese Gigaword Corpus and the Chinese Learner Corpus

A Study on Chinese Synonyms: From the Perspective of Collocations

Word Replaceability Through Word Vectors

1 Introduction

2 Automatic Near-Synonym Choice

2.1 Unsupervised Methods

2.1.1 PMI-Based Method

2.1.2 5-Gram Language Model

2.2 Supervised Methods

2.2.1 Latent Semantic Analysis (LSA)

2.2.2 Independent Component Analysis (ICA)

LSA decomposition and reconstruction

ICA decomposition and demixing

3 Experimental Results

3.1 Experiment Setup

3.1.1 Data

3.1.2 Classifiers

3.1.3 Evaluation Metric

3.2 Evaluation of LSA and ICA-Based Methods

3.3 Comparative Results

3.4 Term Overlap Analysis

4 Applications

4.1 Contextual Statistics

4.2 Near-Synonym Choice

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation