Abstract
The paper describes a new framework for computing the semantic similarity of words and concepts using WordNet-like databases. The main advantage of the presented approach is the ability to implement similarity measures as concise expressions in the embedded query language. The preliminary results of the use of the framework to model the semantic similarity of Polish nouns are reported.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Among various applications of WordNet [4], the task of modeling semantic similarity between words has attracted considerable attention over the last two decades. WordNet-based semantic similarity measures, ranging from simple path-length dependent functions [14, 26] and measures that exploit the notion of the least common subsumerFootnote 1 [36] to those that utilize information content computed over corpora [10, 16, 27], have been proposed in the literature. These measures have been evaluated within the task of word sense disambiguation [21] and incorporated into natural language processing and information extraction systems [2, 31]. Despite a wide range of applications, the issue of using other wordnets in place of Princeton WordNet as resources for modeling similarity among words appears not to have gained the same level of attention. Our aim is to use PolNet [35] and PlWordNet [17] to model the semantic similarity of Polish nouns. Since we have not found a software package for measuring semantic similarity that could be easily adapted to make use of both Polish wordnets (cf. Sect. 2), we decided to implement our own. Therefore, the goal of this paper is twofold. First, we present WSim: a new tool for determining degrees of semantic similarity using measures computed over WordNet-like databases.Footnote 2 Second, we report the preliminary results of the use of WordNet-based similarity measures to model the similarity of Polish nouns. This is, to the best of our knowledge, the first attempt to apply two wordnets developed for the same language in a shared application-oriented task.
The paper is a revised version of [13]. It presents new, unpublished results of supervised models of semantic similarity built on the values of wordnet-based measures (cf. Sect. 6). Furthermore, it reproduces the experiments from the original paper using a Polish Wikipedia dump from November 20, 2017 instead of the February 6, 2014 dump used previously. Lastly, more detailed results on measuring semantic similarity between English counterparts of Polish nouns are provided (cf. Tables 5 and 9).
2 Related Work
WordNet::Similarity [23] is a widely-cited software package that implements a range of WordNet-based semantic similarity measures. This package has become a de facto standard tool for computing similarity scores using WordNet and serves as a reference point for other implementations (e.g., [24]). Unfortunately, WordNet::Similarity operates only on Princeton WordNet and is not able to load wordnets that do not conform to the internal storage format of the wn program distributed with Princeton WordNet [32]. The same restriction applies to the Python interface to WordNet provided by the NLTK toolkit [1]. In addition to Princeton WordNet, the Java reimplementation of WordNet::Similarity by Shima [29], called WS4J, can load the Japanese WordNet [9]. PolNet is not distributed in the Princeton WordNet conformant form and we have not found any tool that could be used to convert it to this format without a vast amount of preprocessing.
A major advance in terms of interoperability is the WordnetTools library [24], which can load any wordnet that is stored in a file conforming to the Wordnet-LMF format [30]. However, at the time of writing, neither PolNet nor PlWordNet had been released in this format. WordnetTools also accepts files in the Global WordNet Grid format [6], but we were unable to load into it the DEBVisDic [8] conformant XML file, which is part of the PlWordNet distribution.
Since exact replication of results using different software packages is not easy to achieve (see [24, sect. 5.3]), we did not want to use separate tools for computing values of similarity measures for the two wordnets (e.g. NLTK for PlWordNet and WordnetTools for PolNet). Therefore, we decided to reimplement WordNet-based semantic similarity measures on top of the WQuery suite [11, 12], which is able to load both PlWordNet and PolNet. An additional advantage of this approach is the ability to modify the similarity measures by revising the concise expressions of the WQuery language (cf. Sect. 4) instead of the Java code of WordnetTools, which, in the case of any changes, would require recompilation. Furthermore, since WQuery (version 0.10) can load wordnets stored in Wordnet-LMF, DEBVisDic [8], and the Princeton WordNet internal format,Footnote 3 we acquired the ability to make direct comparisons between the values of similarity measures computed for the lexical databases stored in all of the aforementioned formats.
3 WSim
As mentioned in the previous section, WSim is built on the WQuery suite. Therefore, before computing the values of similarity measures, the wordnet must be converted into the WQuery database format using the wcompile commandFootnote 4 from the WQuery toolkit. Since both PlWordNet and PolNet are available in the XML files compatible with the DEBVisDic editor [8], the -t deb option must be passed to the command
With the wordnet in the WQuery format, the similarity of pairs of words (or word senses) can be computed by passing them to the standard input of the wsim command, separated by tab characters.
By default, wsim determines the similarity of a pair of words by inverting the value of the shortest path length in the hypernymy hierarchy linking the synsets containing the given words; thus for the pair samochód (Eng. car) and rower (Eng. bicycle) the similarity determined with PolNet is
WSim implements six semantic similarity measures:
-
1.
inverted length of the shortest path,
-
2.
Wu-Palmer [36],
-
3.
Resnik [27],
-
4.
Jiang-Conrath [10],
-
5.
Leacock-Chodorow [14],
-
6.
Lin [16].
Following [23], we denote these measures by path, wup, res, jcn, lch, and lin, respectively. They can be selected by passing the -m option to the wsim command. For instance, to compute the Wu-Palmer measure, the command
must be executed. In the case of information content dependent measures [10, 16, 27] word (or sense) counts can be submitted in a file passed as an argument of the -c option, e.g.
If the counts are distributed along with a wordnet (as is true in the case of Princeton WordNet) the -c option can be skipped.
4 Implementation of Measures
The similarity measures are implemented in WSim as functions formulated in the WQuery language [11]. Every function that ends with the _measure suffix is interpreted as a similarity measure and is available through the -m option of the wsim command. For every pair of senses read from the input, the wsim command determines their corresponding synsets and passes them to the function indicated by the argument of the -m option. In the case of pairs of words, wsim returns the maximum of the similarity values computed for every pair of senses of the submitted words.
Let us consider the Wu-Palmer measure as an example. The measure is given by the following formula (cf. [2, 36]):
where l and r are synsets, lcs(l, r) denotes the least common subsumer of l and r, dist denotes the distance between two synsets in the hypernymy hierarchy, and dep returns the distance of a synset from the hypernymy root. The Wu-Palmer measure has the following implementation in WQuery:
We will not be discussing WQuery in detail.Footnote 5 In order to follow the examples it is enough to understand that arithmetic expressions, variable assignments (:=), and function calls (f(...)) are interpreted in a manner similar to that of scripting languages such as Python. The arguments are passed to a function in the %A variable and return values are passed using the emit statement. The main advantage of using WQuery in place of a generic scripting language to implement similarity measures is the ability to use regular expressions over the semantic relation names to denote paths in the wordnet graph. In the case of wup_measure the sub-function lcs_dist that computes the distance from a synset to its least common subsumer determines the paths from a synset %s to its subsumer %lcs via the regular expression
that traverses zero or more times through the hypernym relation from the synset %s to its subsumer %lcs. The root_dist function that computes the distance from a synset to the hypernymy root uses the expression
to denote the paths from a synset %A through zero or more hypernymy links to the synsets that do not have hypernyms.Footnote 6 We present the complete code implementing these functions below.
The lcs_by_depth function, which is also called by wup_measure, is a built-in function of WQuery that determines the least common subsumers of synsets.
The similarity functions are loaded into WSim at the beginning of execution from a designated directory. Thus, given a correspondence between arguments of wsim and function names and the ability to address arbitrary paths in the wordnet graph using the WQuery language, the user can easily experiment with definitions of new measures. For instance, the user can consider a meronymy-based variant of the path measure by providing the following function to wsim:
5 Semantic Similarity Computation Using Polish Wordnets
Given a tool that accepts lexical databases stored in the DEBVisDic editor compatible format, we can compute the values of similarity measures for both Polish wordnets and compare them to the human similarity ratings. In the case of English, the Rubenstein and Goodenough dataset of 65 human-rated noun pairs [28] and its 30-pair subset from Miller and Charles [19] are often used for the purpose of evaluating similarity measures (e.g., [2, 22, 27]). Paliwoda-Pękosz and Lula [20], who translated this dataset into Polish and had it rated, also report the performance of several similarity measures on 39 pairs of the translated nouns covered by version 0.95 of PlWordNet. We refer hereafter to this dataset as PL39 and to the Rubenstein and Goodenough dataset as RG65. For the purpose of our analysis we use version 2.2 of PlWordNet [17] and version 3.0 of PolNet [35]. Furthermore, in order to determine the values of measures that utilize information content (i.e. Resnik, Jiang-Conrath, and Lin), we use word frequencies derived from Polish Wikipedia.
PlWordNet 2.2 and PolNet 3.0 cover 38 and 26 pairs of nouns from the PL39 dataset, respectively. The correlation coefficients between the values of the similarity measures and the human rating of 26 noun pairs common to both wordnets are given in Table 1. It can be seen that, regardless of the correlation type, the Lin measure performs best. The same measure achieves the best results in the case of all 38 word pairs covered by PlWordNet (cf. Table 2). We report the pairs of words from PL39 and the corresponding values of the Lin measure in Table 3.Footnote 7 For the purpose of comparison we also computed the correlation coefficients between the human ratings of the RG65 word pairs and similarity measure scores determined using version 3.0 of WordNet. The results obtained for 26 word pairs from RG65, which are English counterparts of PL39 word pairs common to both Polish wordnets, are given in columns 2 and 3 of Table 5. Columns 4 and 5 present the results for 38 pairs of words from RG65, which are counterparts of all PL39 word pairs that occur in PlWordNet. In the case of WordNet, the Leacock-Chodorow measure results in the highest Pearson’s correlation and the Jiang-Conrath and path measures achieve the highest value of Spearman’s correlation coefficient among the analyzed similarity functions.
Given the correlation coefficients for a fixed measure and the same corpusFootnote 8 it is tempting to compare the differences between the two wordnets with respect to the results on the same dataset. However, it must be noted that although the correlation coefficients between human ratings for the 26 nouns from PL39 and measure values induced from PolNet are generally higherFootnote 9 than the corresponding coefficients derived for PlWordNet, the results are difficult to interpret due to size of the dataset size and are not significant at the \(\alpha =0.05\) level according to the Meng, Rosenthal, and Rubin’s z-test as implemented by Diedenhofen [3].
6 Supervised Similarity Models
Given the values of semantic similarity measures computed for PL39 word pairs, we decided to determine whether supervised models of similarity can be built on the measured values. We developed a range of regression models using the wordnet-based similarity measures as explanatory variables and the similarity score from PL39 as the response variable. The methods of regression we considered are: linear regression (lr), neural networks (nn), regression tress (rt), random forests (rf), and \(\epsilon \)-support vector regression (svr). We used R environment [25] with stats, nnet [34], rpart [33], randomForest [15] and e1071 [18] packages to develop and evaluate the regression models. For the neural network architecture, we chose a multilayer perceptron with one hidden layer and performed a grid search with 5-fold cross-validation on the training set to determine the number of neurons in the hidden layer. In the case of random forests, we performed a grid search with 5-fold cross-validation to determine the number of trees and the minimum size of the terminal nodes. For the support vector regression we examined linear, polynomial,Footnote 10 radial basis, and sigmoid kernels and performed a grid search for values of the C, \(\gamma \), and \(\epsilon \) parameters (Table 4).
The models were evaluated using the leave-one-out cross-validation technique (i.e. the similarity of a given pair of words is predicted using the model trained on the similarity scores measured for the other pairs). Table 6 presents the correlation coefficients between the human ratings of the PL39 dataset word pairs and the similarity scores determined by the supervised models built on the values of wordnet-based similarity measures. For this experiment, we restricted the dataset to 26 word pairs from PL39 that occur in both Polish wordnets. Table 7 reports the results obtained for models built on 38 noun pairs from PL39 that occur in PlWordNet. It can be seen that in both settings the Lin measure outperforms the supervised models, with the sole exception of the random forest model built from the similarity measures determined using PlWordNet for the dataset restricted to 26 common word pairs. Furthermore, even in this exceptional case, the difference between the correlation coefficients determined for the Lin measure and random forest model is not significant at the \(\alpha =0.05\) level according to the Meng, Rosenthal, and Rubin’s z-test. Similar results can be observed for the word pairs from RG65 which are English counterparts of the pairs of nouns from PL39 (Table 8). The Leacock-Chodorow measure outperforms the supervised models with respect to Pearson’s correlation, whereas the Jiang-Conrath and path measures outperform the supervised models with respect to Spearman’s rank correlation. This suggests that, in the case of a small dataset, it is worth choosing one of the wordnet-based similarity measures instead of trying to build a supervised regression model on top of them.
7 Conclusion
We presented a new framework for semantic similarity computation, using wordnet-based measures. The main advantages of our tool are compatibility with various wordnet database formats and the ability to implement new measures using embedded query language. The framework was employed to model the semantic similarity of nouns using measures derived from two Polish wordnets, PlWordNet and PolNet. The results must be considered preliminary due to the small size of the dataset used for the purpose of evaluation. Nevertheless, this is the first attempt to use both Polish wordnets within the context of a shared task.
In the future, we plan to extend the framework with additional measures (e.g., [7]). We also intend to create a larger evaluation set that will cover the content of PolNet more extensively.
Notes
- 1.
A joint transitive hypernym of two synsets such that no other joint transitive hypernym of these synsets is placed below it within the hypernymy hierarchy.
- 2.
Databases that are organized similarly to WordNet [4], called wordnets in the rest of the paper.
- 3.
Through the JWI library [5].
- 4.
We assume in the following examples that all commands are invoked in the Linux shell environment.
- 5.
Interested readers can consult [11].
- 6.
The synsets satisfying the condition empty(hypernym).
- 7.
The pair środek dnia/południe is omitted in Table 3, since środek dnia occurs in neither PlWordNet 2.2 nor in PolNet 3.0.
- 8.
In the case of information content-based measures.
- 9.
With the exception of Pearson’s correlation coefficient for the Jiang-Conrath measure.
- 10.
Polynomial kernels of degrees 2 and 3 were considered.
References
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Sebastopol (2009)
Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)
Diedenhofen, B.: cocor: Comparing correlations, (Version 1.0-0) (2013). http://r.birkdiedenhofen.de/pckg/cocor/
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA (1998)
Finlayson, M.A.: Java libraries for accessing the princeton wordnet: comparison and evaluation. In: Proceedings of the 7th Global Wordnet Conference, Tartu, Estonia, pp. 78–85 (2014)
Global WordNet Association: Global WordNet Grid (2012). http://globalwordnet.org/global-wordnet-grid/. Accessed 20 Sept 2015
Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms, chap. 13, pp. 305–332. In: Fellbaum [4] (1998)
Horak, A., Pala, K., Rambousek, A., Povolny, M.: DEBVisDic - first version of new client-server wordnet browsing and editing tool. In: Sojka, P., et al. (eds.) Proceedings of the Third International WordNet Conference - GWC 2006. Masaryk University, Brno, Czech Republic (2005)
Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., Kanzaki, K.: Development of the Japanese WordNet. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco, 26 May–1 June 2008, European Language Resources Association (2008)
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of 10th International Conference on Research in Computational Linguistics, ROCLING 1997 (1997)
Kubis, M.: A query language for wordnet-like lexical databases. In: Pan, J.-S., Chen, S.-M., Nguyen, N.T. (eds.) ACIIDS 2012. LNCS (LNAI), vol. 7198, pp. 436–445. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28493-9_46
Kubis, M.: A tool for transforming wordnet-like databases. In: Vetulani, Z., Mariani, J. (eds.) LTC 2011. LNCS (LNAI), vol. 8387, pp. 343–355. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08958-4_28
Kubis, M.: A semantic similarity measurement tool for WordNet-like databases. In: Vetulani, Z., Mariani, J. (eds.) Proceedings of the 7th Language and Technology Conference, pp. 150–154. Fundacja Uniwersytetu im. Adama Mickiewicza, Poznań, Poland, November 2015
Leacock, C., Chodorow, M.: Combining local context and wordnet similarity for word sense identification, chap. 11, pp. 265–283. In: Fellbaum [4] (1998)
Liaw, A., Wiener, M.: Classification and Regression by randomForest. R News 2(3), 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML 1998, pp. 296–304. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1998)
Maziarz, M., Piasecki, M., Szpakowicz, S.: Approaching plWordNet 2.0. In: Proceedings of the 6th Global Wordnet Conference. Matsue, Japan, January 2012
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F.: e1071: Misc functions of the department of statistics (e1071), TU Wien, R package version 1.6-3 (2014). http://CRAN.R-project.org/package=e1071
Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cognit. Process. 6(1), 1–28 (1991)
Paliwoda-Pękosz, G., Lula, P.: Measures of semantic relatedness based on wordnet. In: International Workshop For Ph.D. Students. Brno, Czech Republic (2009). ISBN: 978-80-214-3980-1
Patwardhan, S., Banerjee, S., Pedersen, T.: Using measures of semantic relatedness for word sense disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 241–257. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36456-0_24
Pedersen, T.: Information content measures of semantic similarity perform better without sense-tagged text. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 2010, pp. 329–332. Association for Computational Linguistics, Stroudsburg, PA, USA (2010)
Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet::Similarity: Measuring the Relatedness of Concepts. In: Demonstration Papers at HLT-NAACL 2004, pp. 38–41, HLT-NAACL-Demonstrations 2004, Association for Computational Linguistics, Stroudsburg, PA, USA (2004). http://dl.acm.org/citation.cfm?id=1614025.1614037
Postma, M., Vossen, P.: What implementation and translation teach us: the case of semantic similarity measures in wordnets. In: Orav, H., Fellbaum, C., Vossen, P. (eds.) Proceedings of the Seventh Global Wordnet Conference, Tartu, Estonia, pp. 133–141 (2014)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2014). http://www.R-project.org/
Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybern. 19(1), 17–30 (1989)
Resnik, P.: using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI 1995, pp. 448–453. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1995)
Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8(10), 627–633 (1965)
Shima, H.: ws4j - WordNet Similarity for Java (2015). https://code.google.com/p/ws4j/. Accessed 28 Aug 2015
Soria, C., Monachini, M., Vossen, P.: Wordnet-LMF: Fleshing out a standardized format for wordnet interoperability. In: Proceeding of the 2009 international workshop on Intercultural collaboration, pp. 139–146. ACM, New York, USA (2009)
Stevenson, M., Greenwood, M.A.: A semantic approach to IE pattern induction. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 2005, pp. 379–386. Association for Computational Linguistics, Stroudsburg, PA, USA (2005)
Tengi, R.I.: Design and Implementation of the WordNet Lexical Database and Searching Software, chap. 4, pp. 105–127. In: Fellbaum [4] (1998)
Therneau, T., Atkinson, B., Ripley, B.: rpart: recursive partitioning and regression trees, R package version 4.1-8 (2014). http://CRAN.R-project.org/package=rpart
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S. Springer, New York (2002). https://doi.org/10.1007/978-0-387-21706-2. http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457-0
Vetulani, Z., Kubis, M., Obrębski, T.: PolNet - Polish WordNet: Data and Tools. In: Calzolari, N., et al. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation, ELRA, Valletta, Malta, May 2010
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, ACL 1994, pp. 133–138. Association for Computational Linguistics, Stroudsburg, PA, USA (1994). https://doi.org/10.3115/981732.981751
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Kubis, M. (2018). A Semantic Similarity Measurement Tool for WordNet-Like Databases. In: Vetulani, Z., Mariani, J., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2015. Lecture Notes in Computer Science(), vol 10930. Springer, Cham. https://doi.org/10.1007/978-3-319-93782-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-93782-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93781-6
Online ISBN: 978-3-319-93782-3
eBook Packages: Computer ScienceComputer Science (R0)