Skip to main content

Kannada Document Classification Using Unicode Term Encoding Over Vector Space

  • Conference paper
  • First Online:
Recent Advances in Artificial Intelligence and Data Engineering

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1386))

Abstract

Today, there is a great demand for extracting useful information and ability to take actionable insights from heaps of raw textual data. The processing of regional language texts, notably with low resources is challenging. Especially for Indian Regional Languages (IRL), language text processing is in huge requirement with regard to the applications of natural language processing (NLP) and text analytics. As a part of regional language processing and text analytics text documents classification for IRL are yet to be explored. Kannada is one of the official Indian regional languages. In this paper, the new benchmark Kannada document’s dataset is created and analyzed using machine learning algorithms. This paper proposes an explicit Unicode term encoding based Kannada document classification, using the vector space model. Both term frequency (TF) and term frequency-inverse document frequency (TF-IDF), statistical measures are used for classification of Kannada documents using K-NN and SVM classifiers. SVM (linear) classifier performs better than the K-NN classifier in classifying the Kannada documents with 98.67% mean accuracy over K-Fold experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. A. Dhar, N.S. Dash, K. Roy, Classification of bangla text documents based on inverse class frequency, in 2018 3rd International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU) (IEEE, 2018), pp 1–6

    Google Scholar 

  2. A. Dhar, N. Dash, K. Roy, Classification of text documents through distance measurement: An experiment with multi-domain bangla text documents, in 2017 3rd International Conference on Advances in Computing, Communication & Automation (ICACCA)(Fall) (IEEE, 2017), pp 1–6

    Google Scholar 

  3. S. Mohanty, P. Santi, R. Mishra, R. Mohapatra, S. Swain, Semantic based text classification using wordnets: Indian language perspective, in Proceedings of the 3th International Global WordNet Conference, South Jeju Island, Korea, (Citeseer, 2006), pp. 321–324

    Google Scholar 

  4. M. Tummalapalli, M. Chinnakotla, R. Mamidi, Towards better sentence classification for morphologically rich languages, in Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing

    Google Scholar 

  5. A. Dhar, N.S. Dash, K. Roy, Categorization of bangla web text documents based on tf-idf-icf text analysis scheme, in Annual Convention of the Computer Society of India (Springer, 2018), pp. 477–484

    Google Scholar 

  6. P.K. Panigrahi, N. Bele, A review of recent advances in text mining of Indian languages. Int. J. Bus. Inf. Syst. 23(2), 175–193 (2016)

    Google Scholar 

  7. S.A. Narhari, R. Shedge, Text categorization of Marathi documents using modified lingo, in 2017 International Conference on Advances in Computing, Communication and Control (ICAC3) (IEEE, 2017), pp. 1–5

    Google Scholar 

  8. A. Dhar, N.S. Dash, K. Roy, An innovative method of feature extraction for text classification using part classifier, in International Conference on Information, Communication and Computing Technology (Springer, 2018), pp. 131–138

    Google Scholar 

  9. L. Wang, Support vector machines: Theory and applications, vol. 177. (Springer Science & Business Media, 2005)

    Google Scholar 

  10. M. Tummalapalli, R. Mamidi, Syllables for sentence classification in morphologically rich languages, in Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation (2018)

    Google Scholar 

  11. K. Rajan, V. Ramalingam, M. Ganesan, S. Palanivel, B. Palaniappan, Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36(8), 10914–10918 (2009)

    Article  Google Scholar 

  12. R. Jayashree, K. Srikantamurthy, B.S. Anami, Sentence level text classification in the Kannada language—A classifier’s perspective. Int. J. Comput. Vis. Robot. 5(3), 254–270 (2015)

    Article  Google Scholar 

  13. Puri, S., Singh, S.P., An efficient Hindi text classification model using SVM, in Computing and Network Sustainability (Springer, 2019), pp. 227–237

    Google Scholar 

  14. B.S. Harish, D.S. Guru, S. Manjunath, Representation and classification of text documents: A brief review. IJCA, Spec. Issue RTIPPR 2, 110–119 (2010)

    Google Scholar 

  15. J.J. Webster, C. Kit, Tokenization as the initial phase in nlp, in COLING 1992 Volume 4: The 15th International Conference on Computational Linguistics (1992)

    Google Scholar 

  16. M. Revanasiddappa, B. Harish, A new feature selection method based on intuitionistic fuzzy entropy to categorize text documents. IJIMAI 5(3), 106–117 (2018)

    Article  Google Scholar 

  17. G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  18. T. Tokunaga, I. Makoto, Text categorization based on weighted inverse document frequency, in Special Interest Groups and Information Process Society of Japan (SIG-IPSJ, Citeseer (1994)

    Google Scholar 

  19. S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit (O’Reilly Media, Inc., 2009)

    Google Scholar 

  20. Project N (2020) https://www.nltk.org/_modules/nltk/tokenize/regexp.html. Last updated on 13 Apr 2020

Download references

Acknowledgements

This work is supported by Vision Group on Science and Technology (VGST), Department of IT, BT and Science and Technology, Government of Karnataka, India [File No.: VGST/2019-20/GRD No.:850/397].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Kasturi Rangan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kasturi Rangan, R., Harish, B.S. (2022). Kannada Document Classification Using Unicode Term Encoding Over Vector Space. In: Shetty D., P., Shetty, S. (eds) Recent Advances in Artificial Intelligence and Data Engineering. Advances in Intelligent Systems and Computing, vol 1386. Springer, Singapore. https://doi.org/10.1007/978-981-16-3342-3_31

Download citation

Publish with us

Policies and ethics