Skip to main content

Comparison of Different Similarity Methods for Text Categorization

  • Conference paper
  • First Online:
Innovations in Data Analytics ( ICIDA 2022)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1442))

Included in the following conference series:

Abstract

Incorporating semantic information into any similarity metric increases its effectiveness and yields findings that may be further analyzed using human interpretation. There will be fewer accurate findings if the similarity is calculated based only on the text’s words. Three alternative approaches are shown in this study, each of which uses a feature vector that combines semantic information from readers and calculates similarities between them. These methods—LSA using word2vec, Explicit Semantic Analysis using Bag-of-Words, and Soft Cosine Similarity using TF-IDF—are based on textual data and knowledge-based methodologies. The technique produces simple-to-read documents that can be used in different information retrieval systems. When comparing commonalities between brief news texts, Latent Semantic Analysis employing Word2Vec Vectors outperformed the other two.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. S. Zhang, X. Zheng, C. Hu, A survey of semantic similarity and its application to social network analysis, in 2015 IEEE International Conference on Big Data (Big Data) (IEEE, 2015), pp. 2362–2367

    Google Scholar 

  2. T. Kenter, M. De Rijke, Short text similarity with word embeddings, in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (2015), pp. 1411–1420

    Google Scholar 

  3. M. Oussalah, M. Mohamed, Knowledge-based sentence semantic similarity: algebraical properties. Prog. Artif. Intell. 11(1), 43–63 (2022)

    Google Scholar 

  4. E. Chersoni, E. Santus, L. Pannitto, A. Lenci, P. Blache, C.R. Huang, A structured distributional model of sentence meaning and processing. Nat. Lang. Eng. 25(4), 483–502 (2019)

    Google Scholar 

  5. M. Maru, S. Conia, M. Bevilacqua, R. Navigli, Nibbling at the hard core of word sense disambiguation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, vol .1: Long, 4724–4737 (2022)

    Google Scholar 

  6. D. Chandrasekaran, V. Mago, Evolution of semantic similarity—a survey. ACM Comput. Surv. (CSUR), 54(2), 1–37 (2021)

    Google Scholar 

  7. M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to document distances, in International conference on machine learning (PMLR, 2015), pp. 957–966

    Google Scholar 

  8. N. Shibata, Y. Kajikawa, I. Sakata, How to measure the semantic similarities between scientific papers and patents in order to discover uncommercialized research fronts: A case study of solar cells, in PICMET 2010 Technology Management For Global Economic Growth (IEEE, 2010), pp. 1–6

    Google Scholar 

  9. H. Pu, G. Fei, H. Zhao, G. Hu, C. Jiao, Z. Xu, Short text similarity calculation using semantic information, in 2017 3rd International Conference on Big Data Computing and Communications (BIGCOM) (IEEE, 2017), pp. 144–150

    Google Scholar 

  10. W.H. Gomaa, A.A. Fahmy, A survey of text similarity approaches. Int. J. Comp. Appl. 68(13), 13–18 (2013)

    Google Scholar 

  11. A. Kaundal, A. Kaur, A review on WordNet and Vector space analysis for short-text semantic similarity. Int. J. Innov. Eng. Technol. (2017)

    Google Scholar 

  12. E. Altszyler, M. Sigman, S. Ribeiro, D.F. Slezak, Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. arXiv preprint arXiv:1610.01520 (2016)

  13. J.J. Lastra-Díaz, J. Goikoetxea, M.A.H. Taieb, A. García-Serrano, M.B. Aouicha, E. Agirre, A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art. Eng. Appl. Artif. Intell. 85, 645–665 (2019)

    Google Scholar 

  14. U. Srinivasarao, A. Sharaff, Email sentiment classification using lexicon-based opinion labelling, in Intelligent Computing and Communication Systems (Springer, Singapore, 2021), pp. 211–218

    Google Scholar 

  15. B. Altınel, M.C. Ganiz, Semantic text classification: a survey of past and recent advances. Inf. Proc. Manage. 54(6), 1129–1153 (2018)

    Google Scholar 

  16. M.A. Hadj Taieb, T. Zesch, M. Ben Aouicha, A survey of semantic relatedness evaluation datasets and procedures. Artif. Intell. Rev. 53(6), 4407–4448 (2020)

    Google Scholar 

  17. J.J. Lastra-Díaz, A. García-Serrano, M. Batet, M. Fernández, F. Chirigati, HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Inf. Syst. 66, 97–118 (2017)

    Google Scholar 

  18. U. Srinivasarao, A. Sharaff, Sentiment analysis from email pattern using feature selection algorithm. Expert Syst. e12867 (2021)

    Google Scholar 

  19. U. Srinivasarao, A. Sharaff, Email thread sentiment sequence identification using PLSA clustering algorithm. Expert Syst. Appl. 193, 116475 (2022)

    Google Scholar 

  20. Z. Quan, Z.J. Wang, Y. Le, B. Yao, K. Li, J. Yin, An efficient framework for sentence similarity modeling. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 27(4), 853–865 (2019)

    Google Scholar 

  21. A. Mahmoud, M. Zrigui, Semantic similarity analysis for corpus development and paraphrase detection in Arabic. Int. Arab J. Inf. Technol. 18(1), 1–7 (2021)

    Google Scholar 

  22. E. Gabrilovich, S. Markovitch, Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJcAI 7, 1606–1611 (2007)

    Google Scholar 

  23. G. Sidorov, A. Gelbukh, H. Gómez-Adorno, D. Pinto, Soft similarity and soft cosine measure: similarity of features in vector space model. Comput. Sist. 18(3), 491–504 (2014)

    Google Scholar 

  24. P. Sitikhu, K. Pahi, P. Thapa, S. Shakya, A comparison of semantic similarity methods for maximum human interpretability, in 2019 Artificial Intelligence for Transforming Business and Society (AITB), vol. 1 (IEEE, 2019), pp. 1–4

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ulligaddala Srinivasarao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Srinivasarao, U., Karthikeyan, R., Bilal, M.J., Hariharan, S. (2023). Comparison of Different Similarity Methods for Text Categorization. In: Bhattacharya, A., Dutta, S., Dutta, P., Piuri, V. (eds) Innovations in Data Analytics. ICIDA 2022. Advances in Intelligent Systems and Computing, vol 1442. Springer, Singapore. https://doi.org/10.1007/978-981-99-0550-8_39

Download citation

Publish with us

Policies and ethics