Abstract
This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Broder, A.Z.: Identifying and Filtering Near-Duplicate Documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection Statistics for Fast Duplicate Document Detection. ACM Trans. Inf. Syst. 20, 171–191 (2002)
Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 380–388. ACM, New York (2002)
Kolcz, A., Chowdhury, A., Alspector, J.: Improved Robustness of Signature-based Near-Replica Detection via Lexicon Randomization. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2004, pp. 605–610 (2004)
Anderka, M., Stein, B., Potthast, M.: Cross-Language High Similarity Search: Why No Sub-linear Time Bound Can Be Expected. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 640–644. Springer, Heidelberg (2010)
Ture, F., Elsayed, T., Lin, J.J.: No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 943–952 (2011)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 107–113 (2008)
Platt, J., Toutanova, K., tau Yih, W.: Translingual Document Representations from Discriminative Projections. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. EMNLP 2010, pp. 251–261 (2010)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Linking of Similar Texts Across Languages. In: Recent Advances in Natural Language Processing III. Selected Papers from RANLP 2003, pp. 307–316 (2003)
Steinberger, R., Pouliquen, B., Hagman, J.: Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)
Potthast, M., Stein, B., Anderka, M.: A Wikipedia-Based Multilingual Retrieval Model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The Mathematics of Statistical Machine Translation: Parameter Estimation. Comput. Linguist. 19, 263–311 (1993)
Barrón-Cedeño, A., Rosso, P., Pinto, D., Juan, A.: On Cross-lingual Plagiarism Analysis using a Statistical Model. In: Proceedings of the ECAI 2008 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, PAN 2008 (2008)
Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., Rosso, P.: A Statistical Approach to Crosslingual Natural Language Tasks. J. Algorithms 64, 51–60 (2009)
Mcnamee, P., Mayfield, J.: Character N-Gram Tokenization for European Language Text Retrieval. Inf. Retr. 7(1-2), 73–97 (2004)
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-Language Plagiarism Detection. Language Resources and Evaluation, Special Issue on Plagiarism and Authorship Analysis 45(1) (2011)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. CoRR abs/cs/0609059 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gupta, P., Barrón-Cedeño, A., Rosso, P. (2012). Cross-Language High Similarity Search Using a Conceptual Thesaurus. In: Catarci, T., Forner, P., Hiemstra, D., Peñas, A., Santucci, G. (eds) Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. CLEF 2012. Lecture Notes in Computer Science, vol 7488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33247-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-33247-0_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33246-3
Online ISBN: 978-3-642-33247-0
eBook Packages: Computer ScienceComputer Science (R0)