Kannada Document Classification Using Unicode Term Encoding Over Vector Space

Kasturi Rangan, R.; Harish, B. S.

doi:10.1007/978-981-16-3342-3_31

R. Kasturi Rangan¹⁶ &
B. S. Harish¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1386))

313 Accesses
1 Citations

Abstract

Today, there is a great demand for extracting useful information and ability to take actionable insights from heaps of raw textual data. The processing of regional language texts, notably with low resources is challenging. Especially for Indian Regional Languages (IRL), language text processing is in huge requirement with regard to the applications of natural language processing (NLP) and text analytics. As a part of regional language processing and text analytics text documents classification for IRL are yet to be explored. Kannada is one of the official Indian regional languages. In this paper, the new benchmark Kannada document’s dataset is created and analyzed using machine learning algorithms. This paper proposes an explicit Unicode term encoding based Kannada document classification, using the vector space model. Both term frequency (TF) and term frequency-inverse document frequency (TF-IDF), statistical measures are used for classification of Kannada documents using K-NN and SVM classifiers. SVM (linear) classifier performs better than the K-NN classifier in classifying the Kannada documents with 98.67% mean accuracy over K-Fold experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Softcover Book: USD 379.99; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Performance Comparison of TF*IDF, LDA and Paragraph Vector for Document Classification

Application of Customized Term Frequency-Inverse Document Frequency for Vietnamese Document Classification in Place of Lemmatization

Content-Based Long Text Documents Classification Using Bayesian Approach for a Resource-Poor Language Urdu

References

A. Dhar, N.S. Dash, K. Roy, Classification of bangla text documents based on inverse class frequency, in 2018 3rd International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU) (IEEE, 2018), pp 1–6
Google Scholar
A. Dhar, N. Dash, K. Roy, Classification of text documents through distance measurement: An experiment with multi-domain bangla text documents, in 2017 3rd International Conference on Advances in Computing, Communication & Automation (ICACCA)(Fall) (IEEE, 2017), pp 1–6
Google Scholar
S. Mohanty, P. Santi, R. Mishra, R. Mohapatra, S. Swain, Semantic based text classification using wordnets: Indian language perspective, in Proceedings of the 3th International Global WordNet Conference, South Jeju Island, Korea, (Citeseer, 2006), pp. 321–324
Google Scholar
M. Tummalapalli, M. Chinnakotla, R. Mamidi, Towards better sentence classification for morphologically rich languages, in Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing
Google Scholar
A. Dhar, N.S. Dash, K. Roy, Categorization of bangla web text documents based on tf-idf-icf text analysis scheme, in Annual Convention of the Computer Society of India (Springer, 2018), pp. 477–484
Google Scholar
P.K. Panigrahi, N. Bele, A review of recent advances in text mining of Indian languages. Int. J. Bus. Inf. Syst. 23(2), 175–193 (2016)
Google Scholar
S.A. Narhari, R. Shedge, Text categorization of Marathi documents using modified lingo, in 2017 International Conference on Advances in Computing, Communication and Control (ICAC3) (IEEE, 2017), pp. 1–5
Google Scholar
A. Dhar, N.S. Dash, K. Roy, An innovative method of feature extraction for text classification using part classifier, in International Conference on Information, Communication and Computing Technology (Springer, 2018), pp. 131–138
Google Scholar
L. Wang, Support vector machines: Theory and applications, vol. 177. (Springer Science & Business Media, 2005)
Google Scholar
M. Tummalapalli, R. Mamidi, Syllables for sentence classification in morphologically rich languages, in Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation (2018)
Google Scholar
K. Rajan, V. Ramalingam, M. Ganesan, S. Palanivel, B. Palaniappan, Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36(8), 10914–10918 (2009)
Article Google Scholar
R. Jayashree, K. Srikantamurthy, B.S. Anami, Sentence level text classification in the Kannada language—A classifier’s perspective. Int. J. Comput. Vis. Robot. 5(3), 254–270 (2015)
Article Google Scholar
Puri, S., Singh, S.P., An efficient Hindi text classification model using SVM, in Computing and Network Sustainability (Springer, 2019), pp. 227–237
Google Scholar
B.S. Harish, D.S. Guru, S. Manjunath, Representation and classification of text documents: A brief review. IJCA, Spec. Issue RTIPPR 2, 110–119 (2010)
Google Scholar
J.J. Webster, C. Kit, Tokenization as the initial phase in nlp, in COLING 1992 Volume 4: The 15th International Conference on Computational Linguistics (1992)
Google Scholar
M. Revanasiddappa, B. Harish, A new feature selection method based on intuitionistic fuzzy entropy to categorize text documents. IJIMAI 5(3), 106–117 (2018)
Article Google Scholar
G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
T. Tokunaga, I. Makoto, Text categorization based on weighted inverse document frequency, in Special Interest Groups and Information Process Society of Japan (SIG-IPSJ, Citeseer (1994)
Google Scholar
S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit (O’Reilly Media, Inc., 2009)
Google Scholar
Project N (2020) https://www.nltk.org/_modules/nltk/tokenize/regexp.html. Last updated on 13 Apr 2020

Download references

Acknowledgements

This work is supported by Vision Group on Science and Technology (VGST), Department of IT, BT and Science and Technology, Government of Karnataka, India [File No.: VGST/2019-20/GRD No.:850/397].

Author information

Authors and Affiliations

Department of Information Science & Engineering, Vidyavardhaka College of Engineering, Mysuru, India
R. Kasturi Rangan
Department of Information Science & Engineering, JSS Science and Technology University, Mysuru, India
B. S. Harish

Authors

R. Kasturi Rangan
View author publications
You can also search for this author in PubMed Google Scholar
B. S. Harish
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. Kasturi Rangan .

Editor information

Editors and Affiliations

Department of Mathematical and Computational Sciences, National Institute of Technology Karnataka (NITK), Mangalore, India
Pushparaj Shetty D.
Department of MCA, NMAM Institute of Technology, Karkala, India
Surendra Shetty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kasturi Rangan, R., Harish, B.S. (2022). Kannada Document Classification Using Unicode Term Encoding Over Vector Space. In: Shetty D., P., Shetty, S. (eds) Recent Advances in Artificial Intelligence and Data Engineering. Advances in Intelligent Systems and Computing, vol 1386. Springer, Singapore. https://doi.org/10.1007/978-981-16-3342-3_31

Download citation

DOI: https://doi.org/10.1007/978-981-16-3342-3_31
Published: 01 November 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3341-6
Online ISBN: 978-981-16-3342-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Kannada Document Classification Using Unicode Term Encoding Over Vector Space

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Performance Comparison of TF*IDF, LDA and Paragraph Vector for Document Classification

Application of Customized Term Frequency-Inverse Document Frequency for Vietnamese Document Classification in Place of Lemmatization

Content-Based Long Text Documents Classification Using Bayesian Approach for a Resource-Poor Language Urdu

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Kannada Document Classification Using Unicode Term Encoding Over Vector Space

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Performance Comparison of TF*IDF, LDA and Paragraph Vector for Document Classification

Application of Customized Term Frequency-Inverse Document Frequency for Vietnamese Document Classification in Place of Lemmatization

Content-Based Long Text Documents Classification Using Bayesian Approach for a Resource-Poor Language Urdu

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation