Abstract
In professional writing, plagiarism is an offensive fraud and a breach of academic ethics. While similarity check is one of the preliminary stages toward an effort to restrain the plagiarism in the domain, most of them fail to detect the text present over the images. The loophole can be used to put the plagiarized text in the form of an image to bypass the similarity check software. To overcome the problem, here, we propose an approach, supplement to the existing software, to extract machine-readable text from the images in scientific documents. The approach accepts portable document format (PDF) and uses the metadata retrieved from the document to automatically detect and localize the images in the document. Thereafter, the electronic form of the text is extracted from the images using optical character recognition (OCR) technique. The proposed framework is validated using a scientific article format containing a block of text and flow diagrams in the form of images. The results indicate that the proposed method is capable of detecting and localizing the text correctly present over the images in machine-readable form.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kakkonen T, Mozgovoy M (2010) Hermetic and web plagiarism detection systems for student essays—an evaluation of the state-of-the-art. J Educ Comput Res 42(2):135–159
Chong MYM (2013) A study on plagiarism detection and plagiarism direction identification using natural language processing techniques, PhD thesis, University of Wolverhampton
Hong ST (2017) Plagiarism continues to affect scholarly journals. J Korean Med Sci 32(2):183
Alzahrani SM, Salim N, Abraham A (2012) Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(2):133–149
Meuschke N, Gondek C, Seebacher D, Breitinger C, Keim D, Gipp B (2018) An adaptive image-based plagiarism detection approach. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries. ACM
Hussain SF, Suryani A (2015) On retrieving intelligently plagiarized documents using semantic similarity. Eng Appl Artif Intell 45:246–258
Chowdhury HA, Bhattacharyya DK (2018) Plagiarism: taxonomy, tools and detection techniques
Ahuja L, Gupta V, Kumar R (2020) A new hybrid technique for detection of plagiarism from text documents. Arab J Sci Eng
Mostafa HE, Benabbou F (2020) A deep learning based technique for plagiarism detection: a comparative study. IAES Int J Artif Intell (IJ-AI) 9(1):81
Originality checking and plagiarism prevention tool. https://www.turnitin.com/products/similarity
Plagiarism checker by grammarly. https://www.grammarly.com/plagiarism-checker
Publish with confidence. http://www.ithenticate.com/
Free online plagiarism checker for teachers and students. http://plagiarisma.net/
Iwanowski M, Cacko A, Sarwas G (2016) Comparing images for document plagiarism detection
Srivastava S, Mukherjee P, Lall B (2015) Implag: detecting image plagiarism using hierarchical near duplicate retrieval
Ovhal PM, Phulpagar B (2015) Plagiarized image detection system based on CBIR. Int J Emerg Trends Technol Comput Sci 4(3)
Kay A (2007) Tesseract: an open-source optical character recognition engine. Linux J 2007(159):2
Smith R (2007) An overview of the tesseract OCR engine
Pypdf2 1.26.0: Python package index (online). Available: https://pypi.org/project/PyPDF2/
Pytesseract 0.3.4: Python package index (online). Available: https://pypi.org/project/pytesseract/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sharmeen, A., Agarwal, N., Suman, A., Agarwal, S., Kumar, K. (2022). Process Design to Self-extract Text from Images for Similarity Check. In: Mohanty, M.N., Das, S. (eds) Advances in Intelligent Computing and Communication. Lecture Notes in Networks and Systems, vol 430. Springer, Singapore. https://doi.org/10.1007/978-981-19-0825-5_11
Download citation
DOI: https://doi.org/10.1007/978-981-19-0825-5_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-0824-8
Online ISBN: 978-981-19-0825-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)