Keywords

1 Introduction

Identifying the semantic relations that exist between two words (or entities) is one of the fundamental steps in many natural language processing (NLP) tasks. For example, to detect word analogies between pairs of words [2,3,4] such as (water, pipe) and (electricity, wire), we must first identify the relations that exist between the two words in each word pair (in this case flows in). In relational information retrieval [5], given a query x is to y as z is to? we would like to retrieve entities that have a semantic relationship with z similar to that between x and y. For example, given the relational search query Bill Gates is to Microsoft as Steve Jobs is to?, a relational search engine is expected to return the result Apple Inc.

Despite the wide applications of relations in NLP systems, it remains a challenging task for humans to come up with representative features for identifying the semantic relation between two given words. In our previous example, the relationship between Bill Gates and Microsoft can be complex as Bill Gates is both a founder, a lead developer in many products, and a former CEO of the Microsoft. In order for a human to suggest representative features for identifying a relationship given only via an entity-pair instance, he/she must not only be familiar with the individual entities, but also know the different relations that would exist between those entities. Therefore, more automated methods for representing relations using descriptive features are necessary.

A popular strategy for representing the relation between two words is to extract lexical or syntactic patterns from the co-occurrence contexts of those words [6]. The extracted lexical patterns can then be used to measure the relational similarity between two word-pairs using a similarity measure defined over the distributions of patterns. Although surface patterns have been used successfully to represent the semantic relations between two words, it suffers from the data sparseness. The co-occurrences of two words with a specific pattern can be sparse even in a large corpus, requiring some form of a dimensionality reduction in practice [7]. It is also computationally expensive method because we must consider co-occurrences between surface patterns and all pairs of words. The number of all pairwise combinations between words grows quadratically with the number of words, and we require a continuously increasing set of surface patterns to cover the relations that exist between the two words in each of those word-pairs.

To overcome the above mentioned issues in the holistic approach, Turney [1, 8] proposed the Dual Space approach, where the relations between two words is composed using features related to individual words. Specifically, he used nouns and verbs as features for describing respectively the domain and function spaces. The proposal to use verbs as a proxy for the functional attributes of words that are likely to contribute towards semantic relations is based on linguistic intuition. Although this intuition is justified by the experimental results, the question can we learn descriptors of semantic relations from labeled data? remains unanswered.

We address this question by proposing a method for ranking lexical descriptors for representing semantic relations that exist between two words. Given a set of word-pairs for a particular relation type, we model the problem of extracting descriptive features as a linear classification problem. Specifically, we train a linear-SVM to discriminate between positive (analogous) and randomly generated pseudo-negative (non-analogous) word-pairs using features associated with individual words. The weights learnt by the classifier for the features can then be used as a ranking-score for selecting most representative features for a particular semantic relation. Experimental results on a benchmark dataset for relation classification show that the proposed feature selection method outperforms several competitive baselines and previously proposed heuristics.

The paper is organized as follows: in Sect. 2 we discuss some related work of feature selection in NLP. The methodology adopted in this work is presented in Sects. 3 and 4. The dataset applied in this research with the experimental results are discussed in Sect. 5. Finally, we conclude the paper and discuss some possible future works.

2 Related Work

Identifying appropriate feature space for NLP tasks is a problem that have been studied widely in the literature. The most popular and effective method is based on matrix factorisation such as Non-Negative Matrix Factorization (NMF), Principle Component Analysis (PCA) and Singular Value Decomposition (SVD). Basically, those methods aim to transform the high-dimensional distributional representations to low-dimensional latent space. For word-level representation, Latent Semantic Analysis (LSA) is a method relying on SVD to represent a word in a vector space using only top, i.e. 300 or more, dimensions to capture the meaning of words in the low-dimensional latent space [9]. For word pairs representation, Latent Relational Analysis (LRA) is a method proposed by Turney [10] for measuring the similarity in the semantic relations between two pairs of words. In LRA, SVD has been applied to pair-pattern matrix to represent a latent feature space. Although LRA achieve satisfied result for answering the 374 SAT questions (\(56.1\%\)), it is complex process to factorize a huge matrix and thus it is time-consuming method (requires 9 days to run).

On the other hand, many feature selection methods have been proposed in the literature. Selecting important features using classification approach has been used for different NLP tasks such as sentiment analysis [11] and text classification [12, 13]. Given a number of examples for specific task, linear classifier ables to recover the features that are relevant to separate the examples into classes. For example, in text classification a documents are represented by words in the vocabulary which suffer from the curse of dimensionality. A linear classifier generates coefficients of the features in the space which are used to rank the most informative words that helps in separating documents into categories.

For sentence-level similarity, Ji and Eisenstein [14] apply data-driven approach for weighting the features for paraphrase classification task. Based on supervised (labeled) dataset, they propose new weighting metric for features in order to distinguish the deterministic features for sentence semantics. The weighting metric uses KL Divergence to weight the distributional features in the co-occurrence matrix for sentences before decomposing process. They report significant improvement on sentence similarity in comparison with other works.

Another approach to select a subset of informative feature is using mutual information based methodology. PMI statistical weighting method has been applied for feature selection for document categorisation [15, 16]. It calculates the amount of information that a feature includes about a specific categories. Xu et al. [15] show that MI is not efficient approach to select relevant feature for text classification compared with other known approach such as Document Frequency (DF) and Information Gain (IG).

While there are efforts spent for feature selection for many NLP tasks, only few attentions have been directed to relational similarity between two pairs of words. Turney [1] heuristically identify a space for semantic relations called function space which consist of verb patterns. For example, for an analogy (wordlanguage), (notemusic), word and note share the same function, e.g. the function of building units (vocabularies). Similarity, language and music share the same function, the function of communications. To the best of our knowledge, there is no work yet on feature selection data-driven methods for relational similarity task. Consequently, this paper contribute to handle that issue.

3 Relational Similarity in Feature Space

Let us consider a feature x in some feature space \(\mathcal {S}\). We do not impose any constraints on the type of features here, and the proposed method can handle any type of features that can be used to represent a word such as other words that co-occur with a target word in the corpus (lexical features), or their syntactic categories such as part-of-speech (POS) (syntactic features). The feature space \(\mathcal {S}\) is defined as the set containing all features we extract for all target words. We represent the salience of x in \(\mathcal {S}\) by the discriminative weight \(w(x, \mathcal {S}) \in \mathbb {R}\). For example, if x is a representative feature of \(\mathcal {S}\), then it will have a high \(w(x, \mathcal {S})\). The concept of a discriminative weight can be seen as a feature selection method. If a particular feature is not a good representative of the space, then it will receive a small (ideally zero) weight, thereby effectively pruning out the feature from the space.

Given the above setting, the task of discovering relational feature spaces can be modelled as a problem of computing the discriminative weights for features. We use \(\phi (A)\) to denote the set of non-zero features that co-occur with the word A. The salience \(f(A, x, \mathcal {S})\) of x as a feature of A in \(\mathcal {S}\) is defined as:

$$\begin{aligned} f(A,x,\mathcal {S}) = h(A, x) \times w(x, \mathcal {S}) \end{aligned}$$
(1)

Here, \(h(A,x) \ge 0\) is the strength of association between A and x, and can be computed using any non-negative feature co-occurrence measure. In our experiments we use positive pointwise mutual information (PPMI) computed using corpus counts as h(Ax).

(1) is analogous to the tf-idf score used in information retrieval in the sense that h(Ax) corresponds to the term-frequency (tf) (i.e. how significant is the presence of x as a feature in A), and \(w(x, \mathcal {S})\) corresponds to the document-frequency (df) (i.e. what is the importance of x as a feature in the space \(\mathcal {S}\)). The similarity, \(\text {sim}_{\mathcal {S}}(A, C)\) between two words A and C in \(\mathcal {S}\) can then be defined as in (2) which is the sum of pointwise products over the intersection of the feature sets \(\phi (A)\) and \(\phi (C)\).

$$\begin{aligned} \text {sim}_\mathcal {S}(A, C) = \sum _{x \in \phi (A) \cap \phi (C)} f(A,x,\mathcal {S}) f(C, x, \mathcal {S}) \end{aligned}$$
(2)

Moreover, by substituting (1) in (2) we get:

$$\begin{aligned} \text {sim}_\mathcal {S}(A, C) = \sum _{x \in \phi (A) \cap \phi (C)} h(A,x)h(C,x){w(x,\mathcal {S})}^2 \end{aligned}$$
(3)

Following the proposal by [1], we can then compute the relational similarity, \(\text {sim}_{\text {rel}}((A, B), (C, D))\), between two word-pairs (AB) and (CD) as the geometric mean of their functional similarities:

$$\begin{aligned} \text {sim}_{\text {rel}}((A, B), (C, D)) = \sqrt{\text {sim}_{\mathcal {S}}(A, C) \times \text {sim}_{\mathcal {S}}(B, D)} \end{aligned}$$
(4)

4 Learning Features Weights

The relational similarity measure described in Sect. 3 depends on the feature space \(\mathcal {S}\) via the discriminative weights \(w(x, \mathcal {S})\) assigned to each feature x. Therefore, our goal of discovering a representative feature space from data can be seen as a problem of learning \(w(x, \mathcal {S})\). We propose a supervised classification-based approach for computing discriminative weights using labeled dataset.

Let us denote a labeled dataset consists of word-pairs (AB) and (CD) annotated for \(l=1\) (i.e. the two word pairs are analogous) or \(l=0\) (otherwise). Here, \(l \in \{0, 1\}\) denotes the class label. From (12) and (3), we see that for two analogous word-pairs, (AB) and (CD), their relational similarity increases if the two products h(Ax)h(Cx) and h(Bx)h(Dx) increase. Following this observation, we define a feature x to appear in an instance word-pairs (AB) and (CD) iff:

$$\begin{aligned} (x \in \phi (A) \cap \phi (C)) \vee (x \in \phi (B) \cap \phi (D)) \end{aligned}$$
(5)

4.1 Linear Classifier Method for Relational Feature Ranking

For the proposed classification-based approach, each positive instance word pairs ((AB), (CD)) or negative word-pairs \(((A',B'), (C',D'))\) have a corresponding feature vector in \(\mathcal {S}\), such that the entry for x in the (AB), (CD) positive instance is defined as follows:

$$\begin{aligned} g(((A,B),(C,D)),x)=\mathcal {I}[x \in \phi (A) \cap \phi (C)]+\mathcal {I}[x \in \phi (B) \cap \phi (D)] \end{aligned}$$
(6)

Here, g(((AB), (CD)), x) denotes the value of feature x in the feature vector representing the instance ((AB), (CD)), and \(\mathcal {I}\) is the indicator function which return 1 if the expression evaluated is true, or 0 otherwise. Likewise for a negative instance. We train a linear-SVM binary classifier to learn a weight for each feature in the feature space. \(w(x,\mathcal {S})\) can be interpreted as the confidence of the feature as an indicator of the strength of analogy (relational similarity) between (AB) and (CD). The absolute value of a weight of a feature can be considered as a measure of the importance of that feature when discriminating the two classes in a binary linear classifier. Therefore, we rank the features in the space according to the absolute value of the weights \(|w(x,\mathcal {S})|\). Only linearised kernel classifier explicitly associates weights to individual features. Therefore, this approach is restricted to linear kernel. In the case of non-linear kernels such as polynomial kernels that can be expanded prior to learning to all feature combinations considered in the kernel computation, we can still apply this technique to identify salient feature combinations. However, we limit the discussion in this paper to finding relational feature spaces consisting of individual features and defer the study of salient feature combinations for relational similarity measurement to future work.

The proposed method is compared against baseline methods namely: KL and PMI in addition to random selection and heuristic verb space. KL and PMI methods also require labelled data as in the proposed classification-based approach.

4.2 KL Divergence-Based Ranking Approach

We consider KL divergence-based weighting approach proposed by [14] to compute \(w(x,\mathcal {S})\) for relational similarity measurement. For this purpose, we will consider the two distributions for each feature x in S-space namely, p(x) and q(x) where p(x) is computed for analogous ((AB), (CD)), while q(x) is taken over the unrelated pairs of words \(((A',B'), (C',D'))\). \(p(x) = P(x \in \phi (A)|x \in \phi (C), l=1 \) or \( x \in \phi (B)|x \in \phi (D), l=1)\). Similarly, \(q(x) = P(x \in \phi (A')|x \in \phi (C'), l=0 \) or \( x \in \phi (B')|x \in \phi (D'), l=0)\).

Specifically, we compute the probability p(x) of a feature x being an indicator of the analogous class as follows:

$$\begin{aligned} \frac{1}{Z_p(x)} \sum _{(A,B),(C,D) \in \mathcal {N}_{+}} g(((A,B),(C,D)),x) \end{aligned}$$
(7)

Here, \(\mathcal {N}_{+}\) is the set of positive word-pairs, and the normalisation coefficient \(Z_p(x)\) satisfies, \(\sum _{x \in \mathcal {S}} p(x) = 1\). Likewise, we can compute q(x), the probability of a feature x being an indicator of the negative (relationally dissimilar) class using the features occurrences in negative instances \(((A',B'), (C',D'))\) as follows:

$$\begin{aligned} \frac{1}{Z_q(x)} \sum _{(A',B'),(C',D') \in \mathcal {N}_{-}} g(((A',B'),(C',D')),x) \end{aligned}$$
(8)

Here, \(\mathcal {N}_{-}\) is the set of negative word-pairs, and the normalization coefficient \(Z_q(x)\) satisfies, \(\sum _x q(x) = 1\). Having computed p(x) and q(x), we then compute \(w(x,\mathcal {S})\) as the KL divergence between the two distributions as,

$$\begin{aligned} w(x, \mathcal {S}) = p(x)\log \left( \frac{p(x)}{q(x)} \right) . \end{aligned}$$
(9)

4.3 PMI Ranking Approach

PMI is used to weight a feature x such that:

$$\begin{aligned} w(x,\mathcal {S})=\text {PMI}(x,\mathcal {N}_{+})-\text {PMI}(x,\mathcal {N}_{-}) \end{aligned}$$
(10)

Where PMI\((x,\mathcal {N}_{+})\) measures the association between a feature x with analogues word-pairs, whereas PMI\((x,\mathcal {N}_{-})\) indicates the co-occurrence of a feature with relationally dissimilar pairs. PMI has been computed as follows:

$$\begin{aligned} \text {PMI}(x,\mathcal {N}_{+})&=\log \left( \frac{h(x,\mathcal {N}_{+})}{h(x,\mathcal {N}) |\mathcal {N}_{+}|} |\mathcal {N}| \right) \\ \mathcal {N}&= \mathcal {N}_{+} \cup \mathcal {N}_{-} \nonumber \end{aligned}$$
(11)

Here \(\mathcal {N}\) is the union set of the positive and negative word-pairs and \(h(x,\mathcal {N}_{+})\) is summed for all analogous pairs:

$$\begin{aligned} \sum _{(A,B),(C,D) \in \mathcal {N}_{+}} g(((A,B),(C,D)),x) \end{aligned}$$

Similarly, \(h(x,\mathcal {N}_{-})\) is calculated considering negative instances in the dataset.

We rank the features according to the absolute values of their weights by each of the methods described to define the representative space to measure the relational similarity. The relational similarity between two given word pairs is computed as follows after reducing the word representations to the top ranked feature space:

$$\begin{aligned} \text {sim}_{\text {rel}}((A, B), (C, D)) = \sqrt{\text {sim}(A, C) \times \text {sim}(B, D)} \end{aligned}$$
(12)

Cosine similarity between two vectors is defined as follows:

$$\begin{aligned} \text {sim}({{\varvec{x}}},{{\varvec{y}}})=\frac{{{\varvec{x}}}^\top {{\varvec{y}}}}{\left\| {{\varvec{x}}} \right\| \left\| {{\varvec{y}}} \right\| } \end{aligned}$$
(13)

We experimented using both unnormalised word embeddings as well as \(\ell _{2}\) normalised word representations. We found that \(\ell _{2}\) normalised word representations perform better than the unnormalised version in most configurations. Consequently, we report results obtained only with the \(\ell _{2}\) normalised word representations in the remainder of the paper.

5 Experimental Design

5.1 Dataset

The above mentioned feature selection methods require a labelled dataset of word-pairs for a particular relation type. To generate such a dataset we use the following procedure. We used the dataset proposed by Vylomova et al. [17] that consists of triples \(\langle w_1,w_2,r \rangle \), where word w \(_1\) and w \(_2\) are connected by a relation r Footnote 1. This dataset consists of 15 relation types, we include the relation types for which we have efficient number of pairs to generate the dataset. Consequently, 7 semantic relation types and their sub categories have been considered in this study as presented in Table 1.

Table 1. Statistic of the dataset used in this study.

For each relation, we exclude some pairs of words for testing the methods, in total we have 367 testing pairs distributed among the relations. We generate positive training instances by pairing word-pairs that have same relation types (considering sub-relations), resulting in 7, 187 positive instances from this procedure. Next, we randomly pair a word-pair from a relation r with a word-pair from a relation \(r'\) such that \(r \ne r'\) to create a pseudo-negative training dataset that has approximately an equal number of instances as that in the positive training dataset (i.e., 7, 000).

5.2 Evaluation Measures

During evaluation, we consider the problem of classifying a given pair of words \((w_1,w_2)\) to a specific relation r in a predefined set of relations \(\mathcal {R}\) according to the relation that exists between \(w_1\) and \(w_2\). We measure the relational similarity between a given pair and all the remaining pairs in the testing data. Then, we perform 1-NN relation classification such that if the 1-NN has the same relation label as the target pair, then we consider it to be a correct match. Macro-averaged classification accuracy is used as the evaluation measure. We use the PPMI matrix from Turney [18], which contains PPMI values between a word and unigrams from the left and right contexts of that word in a corpusFootnote 2. The total number of features extracted (\(|\mathcal {S}|\)) is 139, 246.

5.3 Results

For a classification method, we train linear SVM using scikit-learn libraryFootnote 3. We use 5 folds cross-validation to find the optimal value of penalty parameter C of the error term. Following Turney [1], we used verbs as \(\mathcal {S}\) to evaluate the performance of the functional space for measuring relational similarity. We used the NLTK POS taggerFootnote 4 for identifying verbs in the feature space. The verb space identified by the POS tagger contains 12k verbs.

In Table 2, we compare the feature weighting methods discussed in Sect. 4 for different semantic relation types used in the evaluated dataset (illustrated in Table 1). The accuracies for SVM-based, KL, PMI and random ranking methods are reported for the top 1k features. For verb-space, the results indicate the performance of the 12k verbs in the feature space. Classification approach of weighting features and verb-space perform equally for hypernym relation. For meronym, event and attribute relation types the proposed linear-SVM outperforms other methods of feature ranking. KL divergence-based method shows its ability to perform well compared with other methods for cause-purpose and space-time relations. Among different relation types compared in Table 2, classification-based weighting method reports the highest macro-average accuracy compared with other baselines. The fact that the proposed method could improve the performance for many relations of relational classification task empirically justifies our proposal for a data-driven approach for feature selection for relational similarity measurement.

Table 2. Accuracy per relation type for the top 1000 ranked features.

We evaluate which of the ranking methods ranks the relational features at the top of the weighted feature list. Figure 1 shows the micro-average accuracies of the top-ranked features selected by the different methods, verb-space is not included in this comparison as it is not a ranking method for feature selection. We start by evaluating the top ranked feature, subsequently adding 10 more features at a time. The random baseline randomly selects a subset of features from \(\mathcal {S}\). As shown in the Fig. 1, the top-weighted features using the proposed linear SVM-based approach outperforms all other methods for relational similarity measurement. The proposed method statistically significantly outperforms (according to McNemar test with p < 0.05) all other methods for ranking the most informative features in the top ranked feature list. This indicates that the effective features for measuring relational similarity are indeed ranked at the top by the proposed method. In addition, our results show that it is possible to maintain a relational classification accuracy while using only small subset of the features (top 100 features). KL divergence-based ranking method follow classification approach for ranking the best features for relational similarity. However, PMI method performs badly as it gives accuracies comparable with the random feature selection method. PMI is known to give higher values to rare features thereby preferring rare features. We believe this might be an issue when selecting features for representing word-pairs.

Fig. 1.
figure 1

Cumulative evaluation of feature weighting methods.

6 Conclusion

We proposed the first-ever method for discovering a discriminative feature space for measuring relational similarity from data. The relational classification results show that using labeled data to train a linear classifier for feature selection can improve the feature space for relational similarity measurement. The proposed method outperforms KL and PMI methods for discovering relational feature space. Using PMI to discover relational features has been demonstrated to have relatively poor performance, a finding which is consistent with previous work for text classification task [15]. In addition, classification-based weighting method reports better performance for many relation types compared with the functional verb space. Future researches can be carried out to improve the feature space for relational similarity task by incorporating verb space with the data-driven discovered features.