Abstract
A method for the automatic extraction of words with similar meanings is presented which is based on the analysis of word distribution in large monolingual text corpora. It involves compiling matrices of word co-occurrences and reducing the dimensionality of the semantic space by conducting a singular value decomposition. This way problems of data sparseness are reduced and a generalization effect is achieved which considerably improves the results. The method is largely language independent and has been applied to corpora of English, French, German, and Russian, with the resulting thesauri being freely available. For the English thesaurus, an evaluation has been conducted by comparing it to experimental results as obtained from test persons who were asked to give judgements of word similarities. According to this evaluation, the machine generated results come close to native speaker’s performance.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Bullinaria, J. A., & Levy, J. P. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39, 510–526.
Burnard, L., & Aston, G. (1998). The BNC handbook: Exploring the British national corpus with Sara. Edinburgh: Edinburgh University Press.
Denoyer, L., & Gallinari, P. (2006). The Wikipedia XML corpus. ACM SIGIR Forum, 40(1), 64–69.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Fellbaum, C. (Ed.) (1998). WordNet: An electronic lexical database. Cambridge: Bradford Books, MIT Press.
Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Dordrecht: Kluwer.
Harris, Z. S. (1954). Distributional structure. Word, 10(23), 146–162.
Hirst, G., & St-Onge, D. (1998). Lexical chains as representation of context for the detection and correction of malapropisms. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 305–332). Cambridge: MIT Press.
Jarmasz, M., & Szpakowicz, S. (2003). Roget’s thesaurus and semantic similarity. In Proceedings of the international conference on recent advances in natural language processing (RANLP-03), Borovets, Bulgaria, September (pp. 212–219).
Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the international conference on research in computational linguistics, Taiwan.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240.
Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook of Latent Semantic Analysis. Hillsdale: Lawrence Erlbaum.
Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.
Lezius, W., Rapp, R., & Wettler, M. (1998). A freely available morphology system, part-of-speech tagger, and context-sensitive lemmatizer for German. In Proceedings of COLING-ACL 1998, Montreal (Vol. 2, pp. 743–748).
Lin, D. (1998a). Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL 1998, Montreal (Vol. 2, pp. 768–773).
Lin, D. (1998b). An information-theoretic definition of similarity. In Proceedings of the 15th international conference on machine learning (ICML-98), Madison, WI (pp. 296–304).
Matveeva, I., Levow, G., Farahat, A., & Royer, C. (2005). Generalized latent semantic analysis for term representation. In Proceedings of the international conference on recent advances in natural language processing (RANLP-05), Borovets, Bulgaria.
Pado, S., & Lapata, M. (2007). Dependency-based construction of semantic space models. Computational Linguistics, 33(2), 161–199.
Pantel, P., & Lin, D. (2002). Discovering word senses from text. In Proceedings of ACM SIGKDD, Edmonton (pp. 613–619).
Rapp, R. (2002). The computation of word associations: Comparing syntagmatic and paradigmatic approaches. In Proceedings of 19th COLING, Taipei, ROC (Vol. 2, pp. 821–827).
Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In Proceedings of the ninth machine translation summit, New Orleans (pp. 315–322).
Rapp, R. (2004). A freely available automatically generated thesaurus of related words. In Proceedings of the fourth international conference on language resources and evaluation (LREC), Lisbon (Vol. II, pp. 395–398).
Rapp, R. (2007). The computation of semantically related words: Thesaurus generation for English, German, and Russian. In B. Sharp & M. Zock (Eds.), Natural language processing and cognitive science (pp. 71–80). Setúba: INSTICC Press.
Resnik, P. (1995). Using information content to evaluate semantic similarity. In Proceedings of the 14th international joint conference on artificial intelligence (IJCAI-95), Montreal (pp. 448–453).
Ruge, G. (1992). Experiments on linguistically based term associations. Information Processing and Management, 28(3), 317–332.
Ruiz-Casado, M., Alfonseca, E., & Castells, P. (2005) Using context-window overlapping in Synonym Discovery and Ontology Extension. In Proceedings of the international conference recent advances in natural language processing (RANLP-2005), Borovets, Bulgaria.
Sahlgren, M. (2001). Vector-based semantic analysis: representing word meanings based on random labels. In A. Lenci, S. Montemagni, & V. Pirrelli (Eds.), Proceedings of the ESSLLI workshop on the acquisition and representation of word meaning, Helsinki.
Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the EACL SIGDAT workshop, Dublin (pp. 47–50).
Schütze, H. (1997). Ambiguity resolution in language learning: computational and cognitive models. Stanford: CSLI Publications.
Terra, E., & Clarke, C. L. A. (2003). Frequency estimates for statistical word similarity measures. In Proceedings of HLT/NAACL, Edmonton, Alberta (pp. 244–251).
Turney, P. D. (2001). Mining the Web for synonyms. PMI-IR versus LSA on TOEFL. In Proc. of the twelfth European conference on machine learning, Freiburg, Germany (pp. 491–502).
Turney, P. D. (2006). Similarity of semantic relations. Computational Linguistics, 32(3), 379–416.
Turney, P. D. (2008). A uniform approach to analogies, synonyms, antonyms, and associations. In Proceedings of the 22nd international conference on computational linguistics (Coling 2008), Manchester, UK (pp. 905–912).
Turney, P. D., Littman, M. L., Bigham, J., & Shnayder, V. (2003). Combining independent modules to solve multiple-choice synonym and analogy problems. In Proceedings of the international conference on recent advances in natural language processing (RANLP-03), Borovets, Bulgaria (pp. 482–489).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rapp, R. The automatic generation of thesauri of related words for English, French, German, and Russian. Int J Speech Technol 11, 147 (2008). https://doi.org/10.1007/s10772-009-9043-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10772-009-9043-7