Abstract
Extraction of Semantic Similarity and relevant information from the corpus is one of the elusive tasks in Text Mining due to the unstructured data, uneven pattern, multiple resolutions, concealed meaning, and other ambiguities. The main focus of semantic similarity analysis lies in meaning concerning the word sense that lies in the arrangements context words and the other words in the sentence with respect to the window size. One of the hurdles to extract the exact semantic similarity from paraphrase statements is the corpus length. The longer corpus has the better chance to match any query statement and it may contain more words, which arises the over penalization problem. To alleviate this problem avoid over penalization by length normalization. The objective of the study is to improve the efficiency in capturing semantic similarity and pertinent information by increased term frequency saturation and increased impact of document normalization with the less penalization method. This study introduced a novel method, Perfect Matching Algorithm (PMA), developed to reduce the over penalization on context corpus with taken into account the length of both Query and Context Documents by the length normalization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Muhammed SH (2014) An automatic similarity detection engine between sacred texts using text mining and similarity measures. Rochester Institute of Technology
McDonald DM (2014) A text mining analysis of religious texts. J Bus Inq 13(1):27–47
Verma M (2017) Lexical analysis of religious texts using text mining and machine learning tools. Int J Comput Appl 168(8):39–45
Taa A, Abed QA, Ahmad M (2018) Al-Quran ontology based on knowledge themes. J Fundamental Appl Sci 9(5):800–810
Hegazi MOA, Hilal A, Alhawarat M (2015) Fine-grained Quran dataset. Int J Adv Comput Sci Appl 6(12):308–313
Popa RC, Goga N, Goga M (2016) Ontology learning applied in education: a case of the new testament. The European proceedings of social & behavioural sciences EpSBS, Edu World 2016 7th ınternational conference, pp 1032–1039
Popa RC, Goga N, Goga M (2019) Extracting knowledge from the Bible: a comparison between the old and the new testament. International conference on automation, computational and technology management (ICACTM). IEEE, New York, pp 505–510
Firth JR (1957) The technique of semantics. Papers Linguistics 37(2):191–200
Miller K (1993) Five papers on wordnet. Technical report. Prinston University, Prinston.
Baeza-Yates R, Robeiro-Neto B (1999) Modern information retrieval. ACM Press Books
Zipf GK (1949) Human behaviour and the principal of least effort. Addison-Wesley
Hassanat AB (2014) Dimensionality invariant similarity measure. J Am Sci 10(8):221–226
Huang A (2008) Similarity measures for text document clustering. New Zealand computer science research student conference, pp 1–8
Kocher M, Savoy J (2017) Distance measures in author profiling. Inf Process Manage 53(1):1103–1119
Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307
Cha S-H (2008) Taxonomy of nominal type histogram distance measures. American conference on applied mathematics (MATH ‘08), pp 325–330
Manning CD, Raghavan P, Schütze H (2009) Introduction to information retrieval. Cambridge University Press, Cambridge, England, pp 192–195
Varghese N, Punithavalli M (2020) Semantic similarity analysis on knowledge-based and prediction based models. Int J Innov Technol Exploring Eng 9(6):447–481
Lv Y, Zhai C (2011) Lower-bounding term frequency normalization. CIKM, pp 7–16
Cotterell R, Schütze H (2019) MorphologicalWord-embeddings. Human Language Technologies. Preprint at arXiv: 1907.02423, pp 1–6
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. Preprint at arXiv: 1301.3781, pp 1–12
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Preprintat arXiv: 1310.4546, pp 1–9
Varghese N, Punithavalli M (2020) Word vector representations: sparse versus dense vectors. Working Papers Linguistic Lit 12(1): 360–367
Kwon YM, Jun SH, Gal WM, Lim MJ (2018) The performance comparison of the classifiers according to binary bow, count bow and Tf-Idf feature vectors for malware detection. Int J Eng Technol 7(3):15–22
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Varghese, N., Punithavalli, M. (2022). Semantic Similarity Extraction on Corpora Using Natural Language Processing Techniques and Text Analytics Algorithms. In: Mathur, G., Bundele, M., Lalwani, M., Paprzycki, M. (eds) Proceedings of 2nd International Conference on Artificial Intelligence: Advances and Applications. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-6332-1_16
Download citation
DOI: https://doi.org/10.1007/978-981-16-6332-1_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6331-4
Online ISBN: 978-981-16-6332-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)