Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Among various applications of WordNet [4], the task of modeling semantic similarity between words has attracted considerable attention over the last two decades. WordNet-based semantic similarity measures, ranging from simple path-length dependent functions [14, 26] and measures that exploit the notion of the least common subsumerFootnote 1 [36] to those that utilize information content computed over corpora [10, 16, 27], have been proposed in the literature. These measures have been evaluated within the task of word sense disambiguation [21] and incorporated into natural language processing and information extraction systems [2, 31]. Despite a wide range of applications, the issue of using other wordnets in place of Princeton WordNet as resources for modeling similarity among words appears not to have gained the same level of attention. Our aim is to use PolNet [35] and PlWordNet [17] to model the semantic similarity of Polish nouns. Since we have not found a software package for measuring semantic similarity that could be easily adapted to make use of both Polish wordnets (cf. Sect. 2), we decided to implement our own. Therefore, the goal of this paper is twofold. First, we present WSim: a new tool for determining degrees of semantic similarity using measures computed over WordNet-like databases.Footnote 2 Second, we report the preliminary results of the use of WordNet-based similarity measures to model the similarity of Polish nouns. This is, to the best of our knowledge, the first attempt to apply two wordnets developed for the same language in a shared application-oriented task.

The paper is a revised version of [13]. It presents new, unpublished results of supervised models of semantic similarity built on the values of wordnet-based measures (cf. Sect. 6). Furthermore, it reproduces the experiments from the original paper using a Polish Wikipedia dump from November 20, 2017 instead of the February 6, 2014 dump used previously. Lastly, more detailed results on measuring semantic similarity between English counterparts of Polish nouns are provided (cf. Tables 5 and 9).

2 Related Work

WordNet::Similarity [23] is a widely-cited software package that implements a range of WordNet-based semantic similarity measures. This package has become a de facto standard tool for computing similarity scores using WordNet and serves as a reference point for other implementations (e.g., [24]). Unfortunately, WordNet::Similarity operates only on Princeton WordNet and is not able to load wordnets that do not conform to the internal storage format of the wn program distributed with Princeton WordNet [32]. The same restriction applies to the Python interface to WordNet provided by the NLTK toolkit [1]. In addition to Princeton WordNet, the Java reimplementation of WordNet::Similarity by Shima [29], called WS4J, can load the Japanese WordNet [9]. PolNet is not distributed in the Princeton WordNet conformant form and we have not found any tool that could be used to convert it to this format without a vast amount of preprocessing.

A major advance in terms of interoperability is the WordnetTools library [24], which can load any wordnet that is stored in a file conforming to the Wordnet-LMF format [30]. However, at the time of writing, neither PolNet nor PlWordNet had been released in this format. WordnetTools also accepts files in the Global WordNet Grid format [6], but we were unable to load into it the DEBVisDic [8] conformant XML file, which is part of the PlWordNet distribution.

Since exact replication of results using different software packages is not easy to achieve (see [24, sect. 5.3]), we did not want to use separate tools for computing values of similarity measures for the two wordnets (e.g. NLTK for PlWordNet and WordnetTools for PolNet). Therefore, we decided to reimplement WordNet-based semantic similarity measures on top of the WQuery suite [11, 12], which is able to load both PlWordNet and PolNet. An additional advantage of this approach is the ability to modify the similarity measures by revising the concise expressions of the WQuery language (cf. Sect. 4) instead of the Java code of WordnetTools, which, in the case of any changes, would require recompilation. Furthermore, since WQuery (version 0.10) can load wordnets stored in Wordnet-LMF, DEBVisDic [8], and the Princeton WordNet internal format,Footnote 3 we acquired the ability to make direct comparisons between the values of similarity measures computed for the lexical databases stored in all of the aforementioned formats.

3 WSim

As mentioned in the previous section, WSim is built on the WQuery suite. Therefore, before computing the values of similarity measures, the wordnet must be converted into the WQuery database format using the wcompile commandFootnote 4 from the WQuery toolkit. Since both PlWordNet and PolNet are available in the XML files compatible with the DEBVisDic editor [8], the -t deb option must be passed to the command

figure a

With the wordnet in the WQuery format, the similarity of pairs of words (or word senses) can be computed by passing them to the standard input of the wsim command, separated by tab characters.

figure b

By default, wsim determines the similarity of a pair of words by inverting the value of the shortest path length in the hypernymy hierarchy linking the synsets containing the given words; thus for the pair samochód (Eng. car) and rower (Eng. bicycle) the similarity determined with PolNet is

figure c

WSim implements six semantic similarity measures:

  1. 1.

    inverted length of the shortest path,

  2. 2.

    Wu-Palmer [36],

  3. 3.

    Resnik [27],

  4. 4.

    Jiang-Conrath [10],

  5. 5.

    Leacock-Chodorow [14],

  6. 6.

    Lin [16].

Following [23], we denote these measures by path, wup, res, jcn, lch, and lin, respectively. They can be selected by passing the -m option to the wsim command. For instance, to compute the Wu-Palmer measure, the command

figure d

must be executed. In the case of information content dependent measures [10, 16, 27] word (or sense) counts can be submitted in a file passed as an argument of the -c option, e.g.

figure e

If the counts are distributed along with a wordnet (as is true in the case of Princeton WordNet) the -c option can be skipped.

figure f

4 Implementation of Measures

The similarity measures are implemented in WSim as functions formulated in the WQuery language [11]. Every function that ends with the _measure suffix is interpreted as a similarity measure and is available through the -m option of the wsim command. For every pair of senses read from the input, the wsim command determines their corresponding synsets and passes them to the function indicated by the argument of the -m option. In the case of pairs of words, wsim returns the maximum of the similarity values computed for every pair of senses of the submitted words.

Let us consider the Wu-Palmer measure as an example. The measure is given by the following formula (cf. [2, 36]):

$$\frac{2*dep(lcs(l,r))}{dist(l, lcs(l,r)) + dist(r, lcs(l,r)) + 2*dep(lcs(l,r))} $$

where l and r are synsets, lcs(lr) denotes the least common subsumer of l and r, dist denotes the distance between two synsets in the hypernymy hierarchy, and dep returns the distance of a synset from the hypernymy root. The Wu-Palmer measure has the following implementation in WQuery:

figure g

We will not be discussing WQuery in detail.Footnote 5 In order to follow the examples it is enough to understand that arithmetic expressions, variable assignments (:=), and function calls (f(...)) are interpreted in a manner similar to that of scripting languages such as Python. The arguments are passed to a function in the %A variable and return values are passed using the emit statement. The main advantage of using WQuery in place of a generic scripting language to implement similarity measures is the ability to use regular expressions over the semantic relation names to denote paths in the wordnet graph. In the case of wup_measure the sub-function lcs_dist that computes the distance from a synset to its least common subsumer determines the paths from a synset %s to its subsumer %lcs via the regular expression

$$ {\texttt {\%s.hypernym*.\%lcs}} $$

that traverses zero or more times through the hypernym relation from the synset %s to its subsumer %lcs. The root_dist function that computes the distance from a synset to the hypernymy root uses the expression

$$ {\texttt {\%A.hypernym*[empty(hypernym)]}} $$

to denote the paths from a synset %A through zero or more hypernymy links to the synsets that do not have hypernyms.Footnote 6 We present the complete code implementing these functions below.

figure h

The lcs_by_depth function, which is also called by wup_measure, is a built-in function of WQuery that determines the least common subsumers of synsets.

The similarity functions are loaded into WSim at the beginning of execution from a designated directory. Thus, given a correspondence between arguments of wsim and function names and the ability to address arbitrary paths in the wordnet graph using the WQuery language, the user can easily experiment with definitions of new measures. For instance, the user can consider a meronymy-based variant of the path measure by providing the following function to wsim:

figure i

5 Semantic Similarity Computation Using Polish Wordnets

Given a tool that accepts lexical databases stored in the DEBVisDic editor compatible format, we can compute the values of similarity measures for both Polish wordnets and compare them to the human similarity ratings. In the case of English, the Rubenstein and Goodenough dataset of 65 human-rated noun pairs [28] and its 30-pair subset from Miller and Charles [19] are often used for the purpose of evaluating similarity measures (e.g., [2, 22, 27]). Paliwoda-Pękosz and Lula [20], who translated this dataset into Polish and had it rated, also report the performance of several similarity measures on 39 pairs of the translated nouns covered by version 0.95 of PlWordNet. We refer hereafter to this dataset as PL39 and to the Rubenstein and Goodenough dataset as RG65. For the purpose of our analysis we use version 2.2 of PlWordNet [17] and version 3.0 of PolNet [35]. Furthermore, in order to determine the values of measures that utilize information content (i.e. Resnik, Jiang-Conrath, and Lin), we use word frequencies derived from Polish Wikipedia.

PlWordNet 2.2 and PolNet 3.0 cover 38 and 26 pairs of nouns from the PL39 dataset, respectively. The correlation coefficients between the values of the similarity measures and the human rating of 26 noun pairs common to both wordnets are given in Table 1. It can be seen that, regardless of the correlation type, the Lin measure performs best. The same measure achieves the best results in the case of all 38 word pairs covered by PlWordNet (cf. Table 2). We report the pairs of words from PL39 and the corresponding values of the Lin measure in Table 3.Footnote 7 For the purpose of comparison we also computed the correlation coefficients between the human ratings of the RG65 word pairs and similarity measure scores determined using version 3.0 of WordNet. The results obtained for 26 word pairs from RG65, which are English counterparts of PL39 word pairs common to both Polish wordnets, are given in columns 2 and 3 of Table 5. Columns 4 and 5 present the results for 38 pairs of words from RG65, which are counterparts of all PL39 word pairs that occur in PlWordNet. In the case of WordNet, the Leacock-Chodorow measure results in the highest Pearson’s correlation and the Jiang-Conrath and path measures achieve the highest value of Spearman’s correlation coefficient among the analyzed similarity functions.

Table 1. Correlation coefficients between the human ratings and similarity measure scores determined for PL39 word pairs that occur in both Polish wordnets.
Table 2. Correlation coefficients between the human ratings and similarity measure scores determined for all PL39 word pairs that occur in PlWordNet.
Table 3. The values of similarity measures determined for the PL39 word pairs.
Table 4. Correlation coefficients between PlWordNet- and PolNet-based measures.

Given the correlation coefficients for a fixed measure and the same corpusFootnote 8 it is tempting to compare the differences between the two wordnets with respect to the results on the same dataset. However, it must be noted that although the correlation coefficients between human ratings for the 26 nouns from PL39 and measure values induced from PolNet are generally higherFootnote 9 than the corresponding coefficients derived for PlWordNet, the results are difficult to interpret due to size of the dataset size and are not significant at the \(\alpha =0.05\) level according to the Meng, Rosenthal, and Rubin’s z-test as implemented by Diedenhofen [3].

Table 5. Correlation coefficients between the human ratings and similarity measure scores determined for the RG65 word pairs which are counterparts of the PL39 pairs.

6 Supervised Similarity Models

Given the values of semantic similarity measures computed for PL39 word pairs, we decided to determine whether supervised models of similarity can be built on the measured values. We developed a range of regression models using the wordnet-based similarity measures as explanatory variables and the similarity score from PL39 as the response variable. The methods of regression we considered are: linear regression (lr), neural networks (nn), regression tress (rt), random forests (rf), and \(\epsilon \)-support vector regression (svr). We used R environment [25] with stats, nnet [34], rpart [33], randomForest [15] and e1071 [18] packages to develop and evaluate the regression models. For the neural network architecture, we chose a multilayer perceptron with one hidden layer and performed a grid search with 5-fold cross-validation on the training set to determine the number of neurons in the hidden layer. In the case of random forests, we performed a grid search with 5-fold cross-validation to determine the number of trees and the minimum size of the terminal nodes. For the support vector regression we examined linear, polynomial,Footnote 10 radial basis, and sigmoid kernels and performed a grid search for values of the C, \(\gamma \), and \(\epsilon \) parameters (Table 4).

Table 6. Correlation coefficients between the human ratings and supervised model scores determined for PL39 word pairs that occur in both Polish wordnets.

The models were evaluated using the leave-one-out cross-validation technique (i.e. the similarity of a given pair of words is predicted using the model trained on the similarity scores measured for the other pairs). Table 6 presents the correlation coefficients between the human ratings of the PL39 dataset word pairs and the similarity scores determined by the supervised models built on the values of wordnet-based similarity measures. For this experiment, we restricted the dataset to 26 word pairs from PL39 that occur in both Polish wordnets. Table 7 reports the results obtained for models built on 38 noun pairs from PL39 that occur in PlWordNet. It can be seen that in both settings the Lin measure outperforms the supervised models, with the sole exception of the random forest model built from the similarity measures determined using PlWordNet for the dataset restricted to 26 common word pairs. Furthermore, even in this exceptional case, the difference between the correlation coefficients determined for the Lin measure and random forest model is not significant at the \(\alpha =0.05\) level according to the Meng, Rosenthal, and Rubin’s z-test. Similar results can be observed for the word pairs from RG65 which are English counterparts of the pairs of nouns from PL39 (Table 8). The Leacock-Chodorow measure outperforms the supervised models with respect to Pearson’s correlation, whereas the Jiang-Conrath and path measures outperform the supervised models with respect to Spearman’s rank correlation. This suggests that, in the case of a small dataset, it is worth choosing one of the wordnet-based similarity measures instead of trying to build a supervised regression model on top of them.

Table 7. Correlation coefficients between the human ratings and supervised model scores determined for all PL39 word pairs that occur in PlWordNet.
Table 8. Correlation coefficients between the human ratings and supervised model scores determined for the RG65 word pairs which are counterparts of the PL39 pairs.
Table 9. The values of similarity measures computed for the RG65 word pairs which are counterparts of the PL39 pairs.

7 Conclusion

We presented a new framework for semantic similarity computation, using wordnet-based measures. The main advantages of our tool are compatibility with various wordnet database formats and the ability to implement new measures using embedded query language. The framework was employed to model the semantic similarity of nouns using measures derived from two Polish wordnets, PlWordNet and PolNet. The results must be considered preliminary due to the small size of the dataset used for the purpose of evaluation. Nevertheless, this is the first attempt to use both Polish wordnets within the context of a shared task.

In the future, we plan to extend the framework with additional measures (e.g., [7]). We also intend to create a larger evaluation set that will cover the content of PolNet more extensively.