The automatic generation of thesauri of related words for English, French, German, and Russian

Rapp, Reinhard

doi:10.1007/s10772-009-9043-7

The automatic generation of thesauri of related words for English, French, German, and Russian

Published: 29 October 2009

Volume 11, article number 147, (2008)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

International Journal of Speech Technology Aims and scope Submit manuscript

The automatic generation of thesauri of related words for English, French, German, and Russian

Download PDF

Reinhard Rapp¹

117 Accesses
7 Citations
Explore all metrics

Abstract

A method for the automatic extraction of words with similar meanings is presented which is based on the analysis of word distribution in large monolingual text corpora. It involves compiling matrices of word co-occurrences and reducing the dimensionality of the semantic space by conducting a singular value decomposition. This way problems of data sparseness are reduced and a generalization effect is achieved which considerably improves the results. The method is largely language independent and has been applied to corpora of English, French, German, and Russian, with the resulting thesauri being freely available. For the English thesaurus, an evaluation has been conducted by comparing it to experimental results as obtained from test persons who were asked to give judgements of word similarities. According to this evaluation, the machine generated results come close to native speaker’s performance.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Bullinaria, J. A., & Levy, J. P. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39, 510–526.
Google Scholar
Burnard, L., & Aston, G. (1998). The BNC handbook: Exploring the British national corpus with Sara. Edinburgh: Edinburgh University Press.
Google Scholar
Denoyer, L., & Gallinari, P. (2006). The Wikipedia XML corpus. ACM SIGIR Forum, 40(1), 64–69.
Article Google Scholar
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Google Scholar
Fellbaum, C. (Ed.) (1998). WordNet: An electronic lexical database. Cambridge: Bradford Books, MIT Press.
MATH Google Scholar
Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Dordrecht: Kluwer.
MATH Google Scholar
Harris, Z. S. (1954). Distributional structure. Word, 10(23), 146–162.
Google Scholar
Hirst, G., & St-Onge, D. (1998). Lexical chains as representation of context for the detection and correction of malapropisms. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 305–332). Cambridge: MIT Press.
Google Scholar
Jarmasz, M., & Szpakowicz, S. (2003). Roget’s thesaurus and semantic similarity. In Proceedings of the international conference on recent advances in natural language processing (RANLP-03), Borovets, Bulgaria, September (pp. 212–219).
Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the international conference on research in computational linguistics, Taiwan.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240.
Article Google Scholar
Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook of Latent Semantic Analysis. Hillsdale: Lawrence Erlbaum.
Google Scholar
Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.
Google Scholar
Lezius, W., Rapp, R., & Wettler, M. (1998). A freely available morphology system, part-of-speech tagger, and context-sensitive lemmatizer for German. In Proceedings of COLING-ACL 1998, Montreal (Vol. 2, pp. 743–748).
Lin, D. (1998a). Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL 1998, Montreal (Vol. 2, pp. 768–773).
Lin, D. (1998b). An information-theoretic definition of similarity. In Proceedings of the 15th international conference on machine learning (ICML-98), Madison, WI (pp. 296–304).
Matveeva, I., Levow, G., Farahat, A., & Royer, C. (2005). Generalized latent semantic analysis for term representation. In Proceedings of the international conference on recent advances in natural language processing (RANLP-05), Borovets, Bulgaria.
Pado, S., & Lapata, M. (2007). Dependency-based construction of semantic space models. Computational Linguistics, 33(2), 161–199.
Article Google Scholar
Pantel, P., & Lin, D. (2002). Discovering word senses from text. In Proceedings of ACM SIGKDD, Edmonton (pp. 613–619).
Rapp, R. (2002). The computation of word associations: Comparing syntagmatic and paradigmatic approaches. In Proceedings of 19th COLING, Taipei, ROC (Vol. 2, pp. 821–827).
Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In Proceedings of the ninth machine translation summit, New Orleans (pp. 315–322).
Rapp, R. (2004). A freely available automatically generated thesaurus of related words. In Proceedings of the fourth international conference on language resources and evaluation (LREC), Lisbon (Vol. II, pp. 395–398).
Rapp, R. (2007). The computation of semantically related words: Thesaurus generation for English, German, and Russian. In B. Sharp & M. Zock (Eds.), Natural language processing and cognitive science (pp. 71–80). Setúba: INSTICC Press.
Google Scholar
Resnik, P. (1995). Using information content to evaluate semantic similarity. In Proceedings of the 14th international joint conference on artificial intelligence (IJCAI-95), Montreal (pp. 448–453).
Ruge, G. (1992). Experiments on linguistically based term associations. Information Processing and Management, 28(3), 317–332.
Article Google Scholar
Ruiz-Casado, M., Alfonseca, E., & Castells, P. (2005) Using context-window overlapping in Synonym Discovery and Ontology Extension. In Proceedings of the international conference recent advances in natural language processing (RANLP-2005), Borovets, Bulgaria.
Sahlgren, M. (2001). Vector-based semantic analysis: representing word meanings based on random labels. In A. Lenci, S. Montemagni, & V. Pirrelli (Eds.), Proceedings of the ESSLLI workshop on the acquisition and representation of word meaning, Helsinki.
Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the EACL SIGDAT workshop, Dublin (pp. 47–50).
Schütze, H. (1997). Ambiguity resolution in language learning: computational and cognitive models. Stanford: CSLI Publications.
Google Scholar
Terra, E., & Clarke, C. L. A. (2003). Frequency estimates for statistical word similarity measures. In Proceedings of HLT/NAACL, Edmonton, Alberta (pp. 244–251).
Turney, P. D. (2001). Mining the Web for synonyms. PMI-IR versus LSA on TOEFL. In Proc. of the twelfth European conference on machine learning, Freiburg, Germany (pp. 491–502).
Turney, P. D. (2006). Similarity of semantic relations. Computational Linguistics, 32(3), 379–416.
Article Google Scholar
Turney, P. D. (2008). A uniform approach to analogies, synonyms, antonyms, and associations. In Proceedings of the 22nd international conference on computational linguistics (Coling 2008), Manchester, UK (pp. 905–912).
Turney, P. D., Littman, M. L., Bigham, J., & Shnayder, V. (2003). Combining independent modules to solve multiple-choice synonym and analogy problems. In Proceedings of the international conference on recent advances in natural language processing (RANLP-03), Borovets, Bulgaria (pp. 482–489).

Download references

Author information

Authors and Affiliations

University of Tarragona, Tarragona, Spain
Reinhard Rapp

Authors

Reinhard Rapp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reinhard Rapp.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rapp, R. The automatic generation of thesauri of related words for English, French, German, and Russian. Int J Speech Technol 11, 147 (2008). https://doi.org/10.1007/s10772-009-9043-7

Download citation

Received: 15 July 2009
Accepted: 08 October 2009
Published: 29 October 2009
DOI: https://doi.org/10.1007/s10772-009-9043-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The automatic generation of thesauri of related words for English, French, German, and Russian

Abstract

Article PDF

Similar content being viewed by others

RuThes Thesaurus for Natural Language Processing

Constructing a poor man’s wordnet in a resource-rich world

COVER: a linguistic resource combining common sense and lexicographic information

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The automatic generation of thesauri of related words for English, French, German, and Russian

Abstract

Article PDF

Similar content being viewed by others

RuThes Thesaurus for Natural Language Processing

Constructing a poor man’s wordnet in a resource-rich world

COVER: a linguistic resource combining common sense and lexicographic information

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation