1 Introduction

Near-synonym sets represent groups of words with similar meanings, which can be derived from the existing lexical ontologies such as WordNet (Fellbaum 1998), EuroWordNet (Rodríguez et al. 1998), and Chinese WordNet (Huang et al. 2008). These are useful knowledge resources for computer-assisted language learning (CALL) (Cheng 2004; Inkpen and Hirst 2006; Inkpen 2007; Ouyang et al. 2009; Wu et al. 2010) and natural language processing (NLP) applications such as information retrieval (IR) (Moldovan and Mihalcea 2000; Navigli and Velardi 2003; Shlrl and Revle 2006; Bhogal et al. 2007; Yu et al. 2009) and (near-)duplicate detection for text summarization (Vanderwende et al. 2007). For example, in composing a text, near-synonyms can be used to automatically suggest alternatives to avoid repeating the same word in a text when suitable alternatives are available in the near-synonym set (Inkpen and Hirst 2006). In information retrieval, systems can perform query expansion to improve the recall rate, for example, through recognizing that the weapon sense of “arm” corresponds to the weapon sense of “weapon” and of “arsenal”.

Although the words in a near-synonym set have similar meanings, they are not necessarily interchangeable in practical use due to their specific usage and collocational constraints (Wible et al. 2003). Consider the following examples.

  • (E1) {strong, powerful} coffee

  • (E2) ghastly {error, mistake}

  • (E3) {bridge, overpass, tunnel} under the bay.

Examples (E1) and (E2) both present an example of collocational constraints for the given contexts. In (E1), the word “strong” in the near-synonym set {strong, powerful} is more suitable than “powerful” to fit the given context “coffee,” since “powerful coffee” is an anti-collocation (Pearce 2001). Similarly, in (E2), “mistake” is more suitable than “error” because “ghastly mistake” is a collocation and “ghastly error” is an anti-collocation (Inkpen 2007). In (E3), the near-synonym set {bridge, overpass, tunnel} represents the meaning of a physical structure that connects separate places by traversing an obstacle. Suppose that the original word is “tunnel” in the context “under the bay”. The word “tunnel” cannot be substituted by the other words in the same set because all the substitutions are semantically implausible (Yu et al. 2010). The above examples indicate that near-synonyms may have different usages in different contexts, and such differences are not easily captured by second-language learners. Therefore, we develop a computer-assisted near-synonym learning system to assist Chinese English-as-a-Second-Language (ESL) learners to better understand different usages of various English near-synonyms and use them appropriately in different contexts.

This chapter introduces the use of NLP techniques such as automatic near-synonym choice (Edmonds 1997; Gardiner and Dras 2007; Inkpen 2007; Islam and Inkpen 2010; Wang and Hirst 2010; Yu et al. 2010; Yu and Chien 2013; Yu et al. 2016) to verify whether near-synonyms match the given contexts. The problem of automatic near-synonym choice has been formulated as a “fill-in-the-blank” (FITB) task, as shown in Fig. 1. Given a near-synonym set and a sentence containing one of the near-synonyms, the near-synonym is first removed from the sentence to form a lexical gap. The goal is to predict an answer (best near-synonym) that can fill the gap from the given near-synonym set (including the original word). The systems can then be evaluated by examining their ability to restore the original word with the best near-synonym.

Fig. 1
figure 1

Example of FITB evaluation for automatic near-synonym choice

2 Automatic Near-Synonym Choice

Among many approaches to automatic near-synonym choice, Edmonds’ pioneering study used a lexical co-occurrence network to determine the near-synonym that is most typical or expected in a given context (Edmonds 1997). Other proposed methods can generally be categorized as unsupervised (Gardiner and Dras 2007; Inkpen 2007; Islam and Inkpen 2010; Yu et al. 2010) and supervised methods (Wang and Hirst 2010; Yu and Chien 2013; Yu et al. 2016).

2.1 Unsupervised Methods

In the unsupervised methods, the pointwise mutual information (PMI) (Gardiner and Dras 2007; Inkpen 2007) and n-gram-based methods (Islam and Inkpen 2010; Yu et al. 2010) are the two major methods.

2.1.1 PMI-Based Method

The PMI is used to measure the strength of co-occurrence between a near-synonym and each individual word appearing in its context. A higher mutual information score indicates that the near-synonym fits well in the given context, and thus is more likely to be the correct answer. The pointwise mutual information (Church and Hanks 1990) between two words x and y is defined as

$$ {\text{PMI}}(x,y) = \log_{2} \frac{P(x,y)}{P(x)P(y)}, $$
(1)

where \( P(x,y) = C(x,y)/N \) denotes the probability that x and y co-occur; \( C(x,y) \) is the number of times x and y co-occur in the corpus; and N is the total number of words in the corpus. Similarly, \( P(x) = C(x)/N \), where C(x) is the number of times x occurs in the corpus, and \( P(y) = C(y)/N \), where C(y) is the number of times y occurs in the corpus. Therefore, Eq. 1 can be rewritten as

$$ {\text{PMI}}(x,y) = \log_{2} \frac{C(x,y) \cdot N}{C(x) \cdot C(y)}. $$
(2)

The frequency counts C(·) presented in Eq. 2 can be retrieved from a large corpus such as the Waterloo terabyte corpus used in (Inkpen 2007), and the Web 1T 5-gram corpus used in (Gardiner and Dras 2007).

Given a sentence s with a gap, \( s = \ldots w_{1} \ldots w_{\ell} \ldots w_{\ell + 1} \ldots w_{2\ell} \ldots \), the PMI score for a near-synonym NSj to fill the gap is computed from the words around the gap, defined as

$$ {\text{PMI}}(NS_{j},s) = \sum\limits_{i = 1}^{2\ell} {{\text{PMI}}(NS_{j},w_{i})}. $$
(3)

where \( \ell \) is a window size representing \( \ell \) words to the left and right of the gap. Finally, the near-synonym with the highest score is considered to be the answer.

2.1.2 5-Gram Language Model

N-grams can capture contiguous word associations within given contexts. Assume a sentence \( s = \ldots w_{i - 4} w_{i - 3} w_{i - 2} w_{i - 1} w_{i} w_{i + 1} w_{i + 2} w_{i + 3} w_{i + 4} \ldots \), where wi represents a near-synonym in a set. In computing the 5-gram scores for each near-synonym, only the five product items \( P(w_{i} \left| {w_{i - 4}^{i - 1}} \right.) \), \( P(w_{i + 1} \left| {w_{i - 3}^{i}} \right.) \), \( P(w_{i + 2} \left| {w_{i - 2}^{i + 1}} \right.) \), \( P(w_{i + 3} \left| {w_{i - 1}^{i + 2}} \right.) \), and \( P(w_{i + 4} \left| {w_{i}^{i + 3}} \right.) \) are considered (Islam and Inkpen 2010). The other items are excluded because they do not contain the near-synonym and thus will have the same values. Accordingly, the 5-gram language model (n = 5) with a smoothing method can be defined as

$$ \begin{aligned} P(s) & = \prod\limits_{i = 1}^{5} {P(w_{i} \left| {w_{i - n + 1}^{i - 1}} \right.)} \\ & = \prod\limits_{i = 1}^{5} {\frac{{C(w_{i - n + 1}^{i}) + (1 + \alpha_{n})M(w_{i - n + 1}^{i - 1})P(w_{i} \left| {w_{i - n + 2}^{i - 1}} \right.)}}{{C(w_{i - n + 1}^{i - 1}) + \alpha_{n} M(w_{i - n + 1}^{i - 1})}}} \\ \end{aligned} $$
(4)

where \( M(w_{i - n + 1}^{i - 1}) \) denotes a missing count used in the smoothing method, defined as

$$ M(w_{i - n + 1}^{i - 1}) = C(w_{i - n + 1}^{i - 1}) - \sum\limits_{{w_{i}}} {C(w_{i - n + 1}^{i})} $$
(5)

where C(·) denotes an n-gram frequency, which can be retrieved from a large corpus such as the Web 1T 5-gram corpus used in (Islam and Inkpen 2010). The 5-gram language model is implemented as a back-off model. That is, if the frequency of a higher order n-gram is zero, then its lower order n-grams will be considered. Conversely, if the frequency of a higher order n-gram is not zero, then the lower order n-grams will not be included in the computation. Similar to the PMI-based method, the near-synonym with the highest score is considered to be the answer.

2.2 Supervised Methods

Supervised methods usually approach near-synonym choice tasks as classification tasks in which each near-synonym in a near-synonym set represents a class, and the features used for classification are the words which occur in the contexts of the near-synonyms. The near-synonyms and their context words are represented by vector-based representation which is frequently used in distributional models of lexical semantics (Harris 1954; Lin 1998; Roussinov and Zhao 2003; Weeds et al. 2004). Based on this representation, a co-occurrence matrix of the near-synonyms and their context words can be built from the training data, i.e., a collection of sentences containing the near-synonyms. Each entry in the matrix represents a co-occurrence frequency of a context word and a near-synonym. Different context analysis techniques such as latent semantic analysis (LSA) and independent component analysis (ICA) can then be applied to the context matrix to identify useful context features that contribute to the classification task (Wang and Hirst 2010; Yu and Chien 2013).

2.2.1 Latent Semantic Analysis (LSA)

LSA is a technique for analyzing the relationships between words and documents and has been widely used in many application domains such as information retrieval (Landauer et al. 1998), latent topic discovery (Cribbin 2011), and document clustering (Wei et al. 2008). For automatic near-synonym choice, LSA is used as a context analysis technique to identify useful latent context features for the near-synonyms through indirect associations between words and sentences.

The first step in LSA is to build a word-by-document matrix for near-synonyms and their context words (Wang and Hirst 2010). In addition to documents, for our task, other text units such as sentences or 5-grams could also be used to build the matrix because these text units also contain contextual information for near-synonyms. Figure 2 shows a sample matrix X built using the sentence as the unit. The columns in \( {\mathbf{X}}_{v \times d} \) represent d sentences containing the near-synonyms in a near-synonym set and the rows represent v distinct words occurring in the near-synonyms’ contexts in the corpus. Singular value decomposition (SVD) (Golub and Van Loan 1996) is then used to decompose the matrix \( {\mathbf{X}}_{v \times d} \) into three matrices as follows:

Fig. 2
figure 2

Illustrative example of singular value decomposition for latent semantic analysis

$$ {\mathbf{X}}_{v \times d} = {\mathbf{U}}_{v \times n} \sum_{n \times n} {\mathbf{V}}_{n \times d}^{T}, $$
(6)

where U and V, respectively, consist of a set of latent vectors of words and sentences, \( \sum \) is a diagonal matrix of singular values, and \( n = \hbox{min} (v,d) \) denotes the dimensionality of the latent semantic space. Additionally, each element in U represents the weight of a context word, and the higher weighted words are the useful context features for the near-synonyms. By selecting the largest k1(≤n) singular values together with the first k1 columns of U and V, the near-synonyms and their context words can be represented in a low-dimensional latent semantic space. The original matrix can also be reconstructed with the reduced dimensions, as shown in Eq. 7

$$ \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} = {\mathbf{U}}_{{v \times k_{1}}} \sum_{{k_{1} \times k_{1}}} {\mathbf{V}}_{{k_{1} \times d}}^{T}, $$
(7)

where \( \widehat{{\mathbf{X}}} \) represents the reconstructed matrix.

In SVM training and testing, each input sentence with a lexical gap is first transformed into the latent semantic representation as follows:

$$ {\hat{\mathbf{t}}}_{{k_{1} \times 1}} = \sum_{{k_{1} \times k_{1}}}^{- 1} {\mathbf{U}}_{{k_{1} \times v}}^{T} {\mathbf{t}}_{v \times 1}^{{}}, $$
(8)

where \( {\mathbf{t}}_{v \times 1}^{{}} \) denotes the vector representation of an input sentence and \( {\hat{\mathbf{t}}}_{{k_{1} \times 1}} \) denotes the transformed vector in the latent semantic space. Each transformed training vector is then appended by the correct answer (the near-synonym removed from the sentence) to form a (k1 + 1)-dimensional vector for SVM training.

The strength of LSA lies in discovering the latent context features for near-synonyms using SVD. Consider the example shown in Fig. 3. The original matrix, as shown in Fig. 3a, is built using five training sentences containing two different near-synonyms NSi and NSj. Suppose that the words w1, w2 are the useful features for NSi, and w3, w4 are useful for NSj, but w4 is a latent feature because it does not frequently occur in the context of NSj. After applying SVD, the latent features can be identified by replacing the zero entries in the original matrix with nonzero real values through the indirect associations between words and sentences. For instance, w4 originally does not occur in s3 and s4, but it does co-occur with w3 in the matrix (e.g., in s5), which means that w4 might also occur in the sentences where w3 occurs (e.g., s3 and s4). Therefore, the zero entries (w4, s3) and (w4, s4) are replaced with a nonzero value through the indirect associations between of w3 and w4 in s5, as shown in Fig. 3b. This helps identify a useful latent feature w4 for NSj. However, identifying latent features through the indirect associations cannot avoid feature overlap when different near-synonyms share common words in their contexts. This might be possible because near-synonyms usually have similar contexts. For instance, in Fig. 3a, w1, which is useful for NSi, still occurs in the context of NSj(e.g., s4). Through the indirect associations between of w1 and w3 in s4, the frequency of w1 increases in the context of NSj because it may also occur in the sentences where w3 occurs (e.g., s3 and s5), as shown in Fig. 3b. Therefore, when all word features are to be accommodated in a low-dimensional space reduced by SVD, term overlap may occur between the latent vectors. As indicated in Fig. 3c, the two sample latent vectors which contribute to two different near-synonyms share a common feature w1. Classifiers trained on such latent vectors with term overlap may decrease their ability to distinguish among near-synonyms.

Fig. 3
figure 3

Comparison of LSA and ICA for feature representation

2.2.2 Independent Component Analysis (ICA)

ICA is a technique for extracting independent components from a mixture of signals and has been successfully applied to solve the blind source separation problem (Hyvärinen et al. 2001; Lee 1998). Recent studies have shown that ICA can also be applied to other application domains such as text processing (Kolenda and Hansen 2000; Rapp 2004; Sevillano et al. 2004). In contrast to LSA, ICA extracts independent components by minimizing the term dependence of the context matrix. Therefore, LSA and ICA can be considered complementary methods where LSA can be used to discover latent features that do not frequently occur in the context of near-synonyms, and ICA can be used to further minimize the dependence of the latent features such that overlapped features can be removed, as presented in Fig. 3d. Based on this complementariness, the ICA-based framework can be used to analyze the LSA output to discover more useful latent features for different near-synonyms, and the dependence between them can also be minimized. The discriminant power of classifiers can thus be improved by training them on the independent components with minimized term overlap. The ICA model can be formally described as

$$ {\mathbf{X}} = {\mathbf{AS}}\text{,} $$
(9)

where X denotes the observed mixture signals, A denotes a mixing matrix, and S denotes the independent components. The goal of ICA is to estimate both A and S. Once the mixing matrix A is estimated, the demixing matrix can be obtained by \( {\mathbf{W}} = {\mathbf{A}}^{- 1} \), and Eq. 9 can be rewritten as

$$ {\mathbf{S}} = {\mathbf{WX}}\text{,} $$
(10)

That is, the observed mixture signals can be separated into independent components using the demixing matrix.

For our problem, the context matrix can be considered as a mixture of signals because it consists of the contexts of different near-synonyms. Therefore, ICA used, herein is to estimate the demixing matrix so that it can separate the mixed contexts to derive the independent components for each near-synonym. Figure 4 shows the ICA-based framework combining LSA and ICA.

Fig. 4
figure 4

ICA-based framework for near-synonym choice

LSA decomposition and reconstruction

In the training phase, the original context matrix \( {\mathbf{X}}_{v \times d} \) is first decomposed by SVD using Eq. 6, and then reconstructed with reduced dimensions using Eq. 7. Useful latent features that do not frequently occur in the original matrix can thus be discovered in this step.

ICA decomposition and demixing

To further minimize term dependence in deriving the independent components, the reconstructed matrix \( \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} \) is passed to ICA to estimate the demixing matrix. ICA accomplishes this by decomposing \( \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} \) using Eq. 11. Figure 5 shows an example of the decomposition.

$$ \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} = {\mathbf{A}}_{{v \times k_{2}}} {\mathbf{S}}_{{k_{2} \times d}}. $$
(11)
Fig. 5
figure 5

Illustrative example of ICA decomposition

Based on this decomposition, the demixing matrix can be obtained by \( {\mathbf{W}}_{{k_{2} \times v}} = {\mathbf{A}}_{{v \times k_{2}}}^{- 1} \), where k2(≤n) is the number of independent components. Similar to the matrix \( {\mathbf{U}} \) in LSA, each element in \( {\mathbf{W}} \) also represents the weight of a context word, and the higher weighted words are useful context features for the near-synonyms. Therefore, the demixing matrix contains useful context features with minimized term dependence for different near-synonyms.

Once estimated, the demixing matrix is used to separate \( \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}} \) to derive the independent components as follows:

$$ {\mathbf{S}}_{{k_{2} \times d}} = {\mathbf{W}}_{{k_{2} \times v}} \widehat{{\mathbf{X}}}_{\begin{subarray}{l} \\ v \times d \end{subarray}}, $$
(12)

Each column vector in \( {\mathbf{S}}_{{k_{2} \times d}} \) is then appended by the correct answer for SVM training. Similarly, as shown in Fig. 4, each test instance \( {\mathbf{t}}_{v \times 1}^{{}} \) in the testing phase is also transformed by the demixing matrix, and then predicted with the trained SVM model.

3 Experimental Results

This section presents the evaluation results of different methods for near-synonym choice. Section 3.1 describes the experiment setup, including experimental data, implementation details of methods used, and the evaluation metric. Section 3.2 investigates the selection of optimal parameters for LSA and ICA-based methods. Section 3.3 compares the results obtained by the various methods. Section 3.4 presents a detailed analysis to examine the effect of term overlap on classification performance.

3.1 Experiment Setup

3.1.1 Data

As shown in Table 1, seven English and Chinese near-synonym sets was used for evaluation. For Chinese near-synonym choice evaluation, two test corpora were used: the Chinese News Corpus (CNC) and the Sinica Corpus (SC), both released by the Association for Computational Linguistics and Chinese Language Processing (ACLCLP). The test examples were collected from the two corpora by selecting sentences containing the near-synonyms in the seven Chinese near-synonym sets. A total of 36,427 (CNC) and 26,504 (SC) sentences were collected, where 20% of sentences from each corpus were randomly selected as a development set for parameter tuning of LSA and ICA-based methods, and the remaining 80% were used as the test set for performance evaluation. In addition, the classifiers (described in the next section) were built from the Chinese Web 5-gram corpus released by the Linguistic Data Consortium (LDC). For English near-synonym choice evaluation, the Web 1T 5-gram corpus released by LDC was used for both classifier training and evaluation using the cross-validation method.

Table 1 English and Chinese near-synonym sets

3.1.2 Classifiers

The classifiers involved in this experiment included PMI, the 5-gram language model, cosine measure, LSA, and ICA-based methods (including stand-alone ICA and a combination of LSA and ICA). The implementation details for each classifier are as follows:

  • PMI: Given a near-synonym set and a test example with a gap, the PMI scores for each near-synonym were calculated using Eq. 3, where the size of the context window \( \ell \) was set to 2. The frequency counts were retrieved from the Web 1T 5-gram corpus and Chinese Web 5-gram corpus, respectively, for English and Chinese near-synonym choice evaluation.

  • 5 GRAM: The 5-gram scores for each near-synonym in a test example were calculated using Eq. 4. The frequency counts for n-grams were retrieved by querying Google (as in Yu et al. 2010) for English near-synonym choice evaluation, and from the Chinese Web 5-gram corpus for Chinese evaluation.

  • COS: Given a near-synonym set, all 5-grams containing the near-synonyms in the set were first extracted from the training data (i.e., from the Web 1T 5-gram corpus for English near-synonyms and from the Chinese Web 5-gram corpus for Chinese near-synonyms). The 5-grams with the near-synonyms are removed and were then used to build a word-by-class matrix for the near-synonym set. The best near-synonym for each test example was then predicted by comparing the cosine similarity of the test example and the near-synonyms in the matrix.

  • LSA: COS used all 5-grams to build the context matrix but, due to the efficiency consideration, we randomly selected only 20,000 5-grams to build the word-by-document (5-gram) matrix for each English and Chinese near-synonym set. The number of the 5-grams for each near-synonym in the matrix was selected according to their proportions in the corpus. Once the matrix was built, each training instance (i.e., each column vector of the matrix) was transformed into the latent space using Eq. 8. The correct answer (the near-synonym removed from the training instance) was then appended to the corresponding transformed vector to form a (k1 + 1)-dimensional vector for SVM training.

  • ICA: Stand-alone ICA was implemented using Eqs. 9 and 10. The input matrix was the same as that of LSA, and the demixing matrix was estimated using the FastICA package (Hyvärinen 1999). An SVM classifier was then trained using the independent components obtained using Eq. 10 with the correct answers appended.

  • LSA + ICA: The combination of LSA and ICA was implemented by taking the LSA result as input to estimate the demixing matrix in ICA, as shown in Eq. 11. An SVM classifier was then trained using the independent components obtained from Eq. 12 with the correct answers appended. For LSA, ICA, and LSA + ICA, the best near-synonym for each test example was predicted using the trained SVM models.

3.1.3 Evaluation Metric

In testing, this experiment followed the FITB evaluation procedure (Fig. 1) to remove the near-synonyms from the test samples. The answers proposed by each classifier are the near-synonyms with the highest score. The correct answers are the near-synonyms originally in the gap of the test samples. Performance is determined by accuracy, which is defined as the number of correct answers made by each classifier, divided by the total number of test examples.

3.2 Evaluation of LSA and ICA-Based Methods

Experiments were conducted to compare the performance of LSA, ICA, and LSA + ICA using different settings for the parameters k1 and k2, which, respectively represent the dimensionality of the latent semantic space and the number of independent components. The optimal settings of the two parameters were tuned by maximizing the classification accuracy on the development set. Figure 6 shows the classification accuracy of LSA, ICA, and LSA + ICA with different settings of dimensionality (k1 or k2). The accuracies were obtained by averaging the seven Chinese near-synonym sets. For LSA, the x-axis represents different values of k1, with performance increasing with k1 up to 2000. For ICA, the x-axis represents different values of k2, with an optimal performance at k2 = 500. For LSA + ICA, the performance was tuned by varying both k1 and k2. Due to the large number of combinations of k1 and k2, for LSA + ICA, Fig. 7 only plots the optimal accuracy for each value of k1 on the x-axis, and the optimal accuracy for each value of k1(e.g., 500) was determined by increasing k2 by increments of 100 to select the highest accuracy among the different settings of k2. For instance, the accuracy of LSA + ICA at k1 = 500 was selected from the highest achieved at k2 = 500, and this accuracy (i.e., that reached at k1 = 500 and k2 = 500) was also the optimal performance of LSA + ICA.

Fig. 6
figure 6

Classification accuracy of LSA, ICA, and LSA + ICA on the development set, as a function of dimensionality

Fig. 7
figure 7

Examples of latent vectors, selected from \( {\mathbf{U}}_{v \times k}^{{}} \), and independent components, selected from \( {\mathbf{W}}_{v \times k}^{T} \), for the near-synonyms “give” and “provide”. The weights shown in this figure are the absolute values of the weights in the latent vectors and independent components

The results presented in Fig. 6 show that LSA + ICA improved the performance of LSA for all dimensionalities because the degree of term overlap in LSA was reduced by ICA. In addition, the performance difference between LSA + ICA and LSA was greater when the dimensionality was smaller, indicating that ICA was more effective in reducing the degree of term overlap in a low-dimensional space. More detailed analysis of the relationship between the term overlap and the classification performance is discussed in Sect. 3.4. The best settings of the parameters were used in the following experiments.

3.3 Comparative Results

This section reports the classification accuracy of supervised and unsupervised methods including PMI, 5GRAM, COS, LSA, ICA, and LSA + ICA. Table 2 shows the comparative results for the Chinese corpora including Chinese News Corpus (CNC), Sinica Corpus (SC), and both (ALL). The binomial exact test (Howell 2007) was used to determine whether the performance difference was statistically significant.

Table 2 Classification accuracy for Chinese corpora

For the two unsupervised methods, 5GRAM outperformed PMI on both test corpora. One possible reason is that the 5-gram language model can capture contiguous word associations in a given context, whereas in PMI, words are considered independently within the given context. In addition, all supervised methods (i.e., COS, LSA, ICA, and LSA + ICA) achieved better performance than the two unsupervised methods on both test corpora. In the supervised methods, COS provided the baseline results since it did not use any technique for context analysis. As indicated in Table 2, COS yielded higher average accuracy than the best unsupervised method (i.e., 5GRAM) by 2.55 and 7.83% on CNC and SC, respectively, and by 4.79% on ALL. Once context analysis techniques were employed, LSA, ICA, and LSA + ICA significantly improved COS, indicating that context analysis is a useful technique for near-synonym choice. For LSA, it improved the average accuracy of COS by 7.52, 4.29, and 6.16%, respectively, on CNC, SC, and ALL. The improvement mainly came from the discovery of useful latent features from the contexts of the near-synonyms. ICA also outperformed COS, with the improvement mainly coming from the discovery of independent components by minimizing the feature dependence among near-synonyms. After combining LSA and ICA, the performance of both LSA and ICA was further improved because LSA + ICA cannot only discover useful latent features for different near-synonyms but also can minimize the dependence between them, thus improving the discriminant power of classifiers to distinguish between near-synonyms.

For the evaluation of English near-synonym choice, 20,000 5-grams for each English near-synonym set were randomly selected from the Web 1T 5-gram corpus (Web 1T), and the near-synonyms in the 5-grams were removed for the purpose of FITB evaluation. Ten-fold cross-validation was then used to determine the classification accuracy of the methods used, with a t-test to determine statistical significance. In addition, for PMI, frequency counts were retrieved from the whole Web 1T 5-gram corpus, and those for 5GRAM were retrieved by querying Google (as in Yu et al. 2010). Table 3 shows the comparative results of the various methods, showing that LSA + ICA improved the performance of both stand-alone LSA and ICA.

Table 3 Classification accuracy for the English corpus

3.4 Term Overlap Analysis

This section investigates the effect of term overlap on classification performance. Term overlap in LSA and the ICA-based methods can be estimated from their respective corresponding matrices \( {\mathbf{U}}_{v \times k}^{{}} \) and \( {\mathbf{W}}_{v \times k}^{T} \). Each column of \( {\mathbf{U}}_{v \times k}^{{}} \) and \( {\mathbf{W}}_{v \times k}^{T} \) represents a latent vector/independent component of v words, and each element in the vector are a word weight representing its relevance to the corresponding latent vector/independent component. Therefore, the meaning of each latent vector/independent component can be characterized by its higher weighted words. Figure 7 shows two sample latent vectors for LSA and two independent components for LSA + ICA.

The upper part of Fig. 7 shows parts of the context words and their weights in the two latent vectors, where latent vector #1 can be characterized by friend, opinion, and chance, which are the useful features for identifying the near-synonym “give,” and latent vector #2 can be characterized by protection, information, and increase, which are useful for identifying the near-synonym “provide”. Although the two latent vectors contained useful context features for the respective different near-synonyms, these features still had some overlap between the latent vectors, as marked by the dashed rectangles. The overlapped features, especially those with higher weights, may reduce the classifier’s ability to distinguish between the near-synonyms. The lower part of Fig. 7 also shows two independent components for the near-synonyms “give” and “provide”. As indicated, the term overlap between the two independent components was relatively low.

To formally compare the degree of term overlap of LSA and the ICA-based methods, we used a measure, overlap@n, to calculate the degree of overlap of the top n ranked words among the latent vectors in \( {\mathbf{U}}_{v \times k}^{{}} \) and independent components in \( {\mathbf{W}}_{v \times k}^{T} \). First, the words in each latent vector (or independent component) were ranked according to the descending order of their weights. The top n words were then selected to form a word set. Let \( s_{i}^{n} \) and \( s_{j}^{n} \) be the two word sets of the top n words for any two latent vectors (or independent components). The degree of term overlap between them was calculated by the number of intersections between their corresponding word sets divided by n, i.e., \( {{\left| {s_{i}^{n} \cap s_{j}^{n}} \right|} \mathord{\left/ {\vphantom {{\left| {s_{i}^{n} \cap s_{j}^{n}} \right|} n}} \right. \kern-0pt} n} \). Therefore, the degree of term overlap for a whole matrix, namely \( overlap_{{{\mathbf{U}}_{v \times k}^{{}}}} \) (or \( overlap_{{{\mathbf{W}}_{v \times k}^{T}}} \)), can be calculated by the average of the degrees of term overlap between all latent vectors (or independent components) in \( {\mathbf{U}}_{v \times k}^{{}} \) (or \( {\mathbf{W}}_{v \times k}^{T} \)). That is,

$$ overlap_{{{\mathbf{U}}_{v \times k}^{{}}}} @n = \frac{1}{{C_{2}^{k}}}\sum\limits_{i = 1}^{k} {\sum\limits_{j = i + 1}^{k} {\frac{{\left| {s_{i}^{n} \cap s_{j}^{n}} \right|}}{n}}}, $$
(13)

where \( C_{2}^{k} \) denotes the number of combinations of any two vectors in \( {\mathbf{U}}_{v \times k}^{{}} \) (or \( {\mathbf{W}}_{v \times k}^{T} \)). Figure 8 presents a sample matrix for computing the degree of term overlap, consisting of three vectors of the top five terms. The respective overlap@5 between the vectors (1, 2), (1, 3), and (2, 3), was 1/5, 2/5, and 3/5, yielding an average 2/5 as the overlap@5 for the matrix.

Fig. 8
figure 8

Example of computing the degree of term overlap

By averaging the degree of term overlap over the matrices corresponding to the near-synonym sets, we can obtain the degree of term overlap of LSA, ICA, and LSA + ICA, respectively, defined as overlapLSA@n, overlapICA@n, and overlapLSA+ICA@n. Figure 9 shows the degree of term overlap for LSA, ICA, and LSA + ICA averaged over the seven Chinese near-synonym sets against various dimensionality values. The results show that ICA achieved the lowest degree of term overlap for both overlap@10 and overlap@50. In addition, combining LSA and ICA reduced the degree of term overlap of using LSA alone, especially for a smaller dimensionality. As indicated in Fig. 9, the difference between both overlapLSA+ICA@10 and overlapLSA@10 and overlapLSA+ICA@50 and overlapLSA@50 increased with smaller dimensionality, mainly due to the fact that the features discovered by LSA were not easily separable in a lower dimensional space. In this circumstance, incorporating ICA can more effectively reduce the degree of term overlap. As the dimensionality increased, the difference between LSA and LSA + ICA gradually decreased, mainly because the increase in dimensionality decreased the degree of term overlap in LSA, thus ICA only produces a limited reduction of term overlap in LSA. As indicated, both overlapLSA+ICA@10 and overlapLSA+ICA@50 yielded a small decrease or increase of overlap when the dimensionality exceeded 1000.

Fig. 9
figure 9

Degree of term overlap of LSA, ICA, and LSA + ICA, as a function of dimensionality

To further analyze the relationship between term overlap and classification performance, Fig. 10 compares the classification accuracy of LSA, ICA, and LSA + ICA from Fig. 6 and overlapLSA@50, overlapICA@50, and overlapLSA+ICA@50 from Fig. 9. Comparing the degree of term overlap and classification performance of LSA + ICA and LSA found that reducing the degree of term overlap improved classification performance. Given a small dimensionality, the performance of LSA was low due to the high degree of term overlap. Combining LSA and ICA in this stage yielded a greater performance improvement because LSA + ICA effectively reduced the degree of term overlap in LSA such that useful context features could be separated into different independent components according to their contribution to different near-synonyms. An increase in the dimensionality improves the performance of LSA due to the reduced degree of term overlap. Meanwhile, the performance of LSA + ICA was not similarly improved due to the small reduction of the degree of term overlap, resulting in a gradual decrease in the performance difference between LSA + ICA and LSA. Another observation is that LSA + ICA also outperformed ICA, despite ICA having a lower degree of term overlap than LSA + ICA. This is mainly due to the fact that LSA + ICA can discover more useful context features than ICA, and also minimizes feature dependence.

Fig. 10
figure 10

Comparison of the classification accuracy and the degree of term overlap of LSA, ICA, and LSA + ICA, as a function of dimensionality

4 Applications

Based on the contextual information provided by the PMI and n-gram, we implement a prototype system with two functions: contextual statistics and near-synonym choice, both of which interact with learners.

4.1 Contextual Statistics

This function provides the contextual information retrieved by PMI and n-gram. This prototype system features a total of 21 English near-synonyms grouped into seven near-synonym sets, as shown in Table 1. Figure 11 shows a screenshot of the interface for contextual information lookup. Once a near-synonym set is selected, the 100 top-ranked context words and n-grams are retrieved for each near-synonym in the set. For PMI, both proportion-based PMI scores and co-occurrence frequencies between near-synonyms and their context words are presented. For n-gram, the 100 top-ranked n-grams with their frequencies are presented. Through this function, learners learn to determine the most frequently co-occurring and discriminative words and n-grams for different near-synonyms.

Fig. 11
figure 11

Screenshot of contextual statistics

4.2 Near-Synonym Choice

This function assists learners in determining suitable near-synonyms when they are not familiar with the various usages of the near-synonyms in a given context. Learners can specify a near-synonym set and then input a sentence with “*” to represent any near-synonym in the set. The system will replace “*” with each near-synonym, and then retrieve the contextual information around “*” using PMI and n-gram. Figure 12 shows a sample sentence (the original word substance has been replaced with *) along with its contextual information retrieved by the system. For PMI, at most five context words (window size) before and after “*” are included to compute proportion-based PMI scores for each near-synonym. In addition, the sum of all PMI scores for each near-synonym is also presented to facilitate learner decisions. For n-gram, the frequencies of the n-grams (2–5) containing each near-synonym are retrieved. In the example shown in Fig. 12, learners can learn useful word pairs such as (substance, matter) and n-grams such as “substance of the matter,” thus learning to discriminate between substance, material, and stuff.

Fig. 12
figure 12

Screenshot of near-synonym choice

5 Conclusion

A framework that incorporates LSA and ICA for near-synonym choice is presented. Both LSA and ICA are used to analyze the contexts of near-synonyms. LSA is used to discover useful latent contextual features, while ICA is used to estimate the independent components with minimal dependence between the features. Experiments compared several supervised and unsupervised methods on both Chinese and English corpora. Results show that the ICA-based methods can reduce the degree of term overlap to improve the classifiers’ ability to distinguish among near-synonyms, thus yielding higher classification accuracy.

Future work will focus on improving classification performance by combining multiple features such as predicate-argument structure and named entities occurring in the context of near-synonyms. In addition, current near-synonym choice evaluation is carried out on several preselected near-synonym sets. To make the near-synonym choice task more practical (similar to all-words word sense disambiguation), all-words near-synonym choice evaluation could also be designed and implemented to verify whether every word in a text fits the context well.