Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition

Sanchez-Perez, Miguel A.; Gelbukh, Alexander; Sidorov, Grigori

doi:10.1007/978-3-319-24027-5_42

Miguel A. Sanchez-Perez²¹,
Alexander Gelbukh²¹ &
Grigori Sidorov²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9283))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1910 Accesses
13 Citations
1 Altmetric

Abstract

The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask of the plagiarism detection competition at PAN 2014, which resulted in the best-performing system at the PAN 2014 competition and outperforms the best-performing system of the PAN 2013 competition by the cumulative evaluation measure Plagdet. Our method relies on a sentence similarity measure based on a tf-idf-like weighting scheme that permits us to consider stopwords without increasing the rate of false positives. We introduce a recursive algorithm to extend the ranges of matching sentences to maximal length passages. We also introduce a novel filtering method to resolve overlapping plagiarism cases. Our system is available as open source.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Exactus Like: Plagiarism Detection in Scientific Texts

An Innovative Similarity Measure for Sentence Plagiarism Detection

Algorithms and Corpora for Persian Plagiarism Detection

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Bär, D., Zesch, T., Gurevych, I.: Text reuse detection using a composition of text similarity measures. In: Kay, M., Boitet, C. (eds.) COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 8–15, Mumbai, India, pp. 167–184. Indian Institute of Technology Bombay (2012)
Google Scholar
Barrón-Cedeño, A., Vila, M., Martí, M.A., Rosso, P.: Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computational Linguistics 39(4), 917–947 (2013)
Article Google Scholar
Forner, P., Navigli, R., Tufis, D., Ferro, N. (eds.): Working Notes for CLEF 2013 Conference. CEUR Workshop Proceedings, Valencia, Spain, September 23–26, vol. 1179. CEUR-WS.org (2013)
Google Scholar
Gillam, L.: Guess again and see if they line up: surrey’s runs at plagiarism detection notebook for PAN at CLEF 2013. In: Forner et al. [3]
Google Scholar
Gollub, T., Stein, B., Burrows, S.: Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M. (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM, August 2012
Google Scholar
Kong, L., Qi, H., Du, C., Wang, M., Han, Z.: Approaches for source retrieval and text alignment of plagiarism detection notebook for PAN at CLEF 2013. In: Forner et al. [3]
Google Scholar
Küppers, R., Conrad, S.: A set-based approach to plagiarism detection. In: Forner, P., Karlgren, J., Womser-Hacker, C. (eds.) CLEF 2012 Evaluation Labs and Workshop, Online Working Notes. CEUR Workshop Proceedings, Rome, Italy, September 17–20, vol. 1178. CEUR-WS.org (2012)
Google Scholar
Maurer, H., Kappe, F., Zaka, B.: Plagiarism – A survey. Journal of Universal Computer Science 12(8), 1050–1084 (2006)
Google Scholar
Palkovskii, Y., Belov, A.: Using hybrid similarity methods for plagiarism detection notebook for PAN at CLEF 2013. In: Forner et al. [3]
Google Scholar
Poria, S., Agarwal, B., Gelbukh, A., Hussain, A., Howard, N.: Dependency-based semantic parsing for concept-level text analysis. In: Gelbukh, A. (ed.) CICLing 2014, Part I. LNCS, vol. 8403, pp. 113–127. Springer, Heidelberg (2014)
Chapter Google Scholar
Poria, S., Cambria, E., Ku, L.W., Gui, C., Gelbukh, A.: A rule-based approach to aspect extraction from product reviews. In: Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP), pp. 28–37. Association for Computational Linguistics and Dublin City University, Dublin, August 2014
Google Scholar
Poria, S., Cambria, E., Winterstein, G., Huang, G.: Sentic patterns: Dependency-based rules for concept-level sentiment analysis. Knowl.-Based Syst. 69, 45–63 (2014)
Article Google Scholar
Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) Working Notes for CLEF 2014 Conference. CEUR Workshop Proceedings, Sheffield, UK, September 15–18, vol. 1180, pp. 845–876. CEUR-WS.org (2014)
Google Scholar
Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: Forner et al. [3]
Google Scholar
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Huang, C., Jurafsky, D. (eds.) COLING 2010, 23rd International Conference on Computational Linguistics, Posters Volume, August 23–27, Beijing, China, pp. 997–1005. Chinese Information Processing Society of China (2010)
Google Scholar
Shrestha, P., Solorio, T.: Using a variety of n-grams for the detection of different kinds of plagiarism notebook for PAN at CLEF 2013. In: Forner et al. [3]
Google Scholar
Suchomel, S., Kasprzak, J., Brandejs, M.: Diverse queries and feature type selection for plagiarism discovery notebook for PAN at CLEF 2013. In: Forner et al. [3]
Google Scholar
Torrejón, D.A.R., Ramos, J.M.M.: Text alignment module in CoReMo 2.1 plagiarism detector notebook for PAN at CLEF 2013. In: Forner et al. [3]
Google Scholar

Download references

Author information

Authors and Affiliations

Centro de Investigacin en Computacin, Instituto Politcnico Nacional, Mexico City, Mexico
Miguel A. Sanchez-Perez, Alexander Gelbukh & Grigori Sidorov

Authors

Miguel A. Sanchez-Perez
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar
Grigori Sidorov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel A. Sanchez-Perez .

Editor information

Editors and Affiliations

Institut de Recherche en Informatique de Toulouse, Toulouse , France
Josanne Mothe
Department of Computer Science, University of Neuchatel, Neuchâtel, Switzerland
Jacques Savoy
Faculteit der Geesteswetenschappen, Universiteit Amsterdam, Amsterdam, The Netherlands
Jaap Kamps
Institut de Recherche en Informatique de Toulouse, Toulouse, France
Karen Pinel-Sauvagnat
School of Computing, Dublin City University, Dublin, Ireland
Gareth Jones
LIA - CERI, Université d'Avignon et des Pays de Vaucluse, Avignon, France
Eric San Juan
Department of Information Engineering, University of Padua, Padua, Italy
Linda Capellato
of Information Engineering (DEI), University of Padua, Department, Padova, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sanchez-Perez, M.A., Gelbukh, A., Sidorov, G. (2015). Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition. In: Mothe, J., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2015. Lecture Notes in Computer Science(), vol 9283. Springer, Cham. https://doi.org/10.1007/978-3-319-24027-5_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-24027-5_42
Published: 20 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24026-8
Online ISBN: 978-3-319-24027-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition

Abstract

Chapter PDF

Similar content being viewed by others

Exactus Like: Plagiarism Detection in Scientific Texts

An Innovative Similarity Measure for Sentence Plagiarism Detection

Algorithms and Corpora for Persian Plagiarism Detection

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition

Abstract

Chapter PDF

Similar content being viewed by others

Exactus Like: Plagiarism Detection in Scientific Texts

An Innovative Similarity Measure for Sentence Plagiarism Detection

Algorithms and Corpora for Persian Plagiarism Detection

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation