Abstract
Today, there is a great demand for extracting useful information and ability to take actionable insights from heaps of raw textual data. The processing of regional language texts, notably with low resources is challenging. Especially for Indian Regional Languages (IRL), language text processing is in huge requirement with regard to the applications of natural language processing (NLP) and text analytics. As a part of regional language processing and text analytics text documents classification for IRL are yet to be explored. Kannada is one of the official Indian regional languages. In this paper, the new benchmark Kannada document’s dataset is created and analyzed using machine learning algorithms. This paper proposes an explicit Unicode term encoding based Kannada document classification, using the vector space model. Both term frequency (TF) and term frequency-inverse document frequency (TF-IDF), statistical measures are used for classification of Kannada documents using K-NN and SVM classifiers. SVM (linear) classifier performs better than the K-NN classifier in classifying the Kannada documents with 98.67% mean accuracy over K-Fold experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
A. Dhar, N.S. Dash, K. Roy, Classification of bangla text documents based on inverse class frequency, in 2018 3rd International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU) (IEEE, 2018), pp 1–6
A. Dhar, N. Dash, K. Roy, Classification of text documents through distance measurement: An experiment with multi-domain bangla text documents, in 2017 3rd International Conference on Advances in Computing, Communication & Automation (ICACCA)(Fall) (IEEE, 2017), pp 1–6
S. Mohanty, P. Santi, R. Mishra, R. Mohapatra, S. Swain, Semantic based text classification using wordnets: Indian language perspective, in Proceedings of the 3th International Global WordNet Conference, South Jeju Island, Korea, (Citeseer, 2006), pp. 321–324
M. Tummalapalli, M. Chinnakotla, R. Mamidi, Towards better sentence classification for morphologically rich languages, in Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing
A. Dhar, N.S. Dash, K. Roy, Categorization of bangla web text documents based on tf-idf-icf text analysis scheme, in Annual Convention of the Computer Society of India (Springer, 2018), pp. 477–484
P.K. Panigrahi, N. Bele, A review of recent advances in text mining of Indian languages. Int. J. Bus. Inf. Syst. 23(2), 175–193 (2016)
S.A. Narhari, R. Shedge, Text categorization of Marathi documents using modified lingo, in 2017 International Conference on Advances in Computing, Communication and Control (ICAC3) (IEEE, 2017), pp. 1–5
A. Dhar, N.S. Dash, K. Roy, An innovative method of feature extraction for text classification using part classifier, in International Conference on Information, Communication and Computing Technology (Springer, 2018), pp. 131–138
L. Wang, Support vector machines: Theory and applications, vol. 177. (Springer Science & Business Media, 2005)
M. Tummalapalli, R. Mamidi, Syllables for sentence classification in morphologically rich languages, in Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation (2018)
K. Rajan, V. Ramalingam, M. Ganesan, S. Palanivel, B. Palaniappan, Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36(8), 10914–10918 (2009)
R. Jayashree, K. Srikantamurthy, B.S. Anami, Sentence level text classification in the Kannada language—A classifier’s perspective. Int. J. Comput. Vis. Robot. 5(3), 254–270 (2015)
Puri, S., Singh, S.P., An efficient Hindi text classification model using SVM, in Computing and Network Sustainability (Springer, 2019), pp. 227–237
B.S. Harish, D.S. Guru, S. Manjunath, Representation and classification of text documents: A brief review. IJCA, Spec. Issue RTIPPR 2, 110–119 (2010)
J.J. Webster, C. Kit, Tokenization as the initial phase in nlp, in COLING 1992 Volume 4: The 15th International Conference on Computational Linguistics (1992)
M. Revanasiddappa, B. Harish, A new feature selection method based on intuitionistic fuzzy entropy to categorize text documents. IJIMAI 5(3), 106–117 (2018)
G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
T. Tokunaga, I. Makoto, Text categorization based on weighted inverse document frequency, in Special Interest Groups and Information Process Society of Japan (SIG-IPSJ, Citeseer (1994)
S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit (O’Reilly Media, Inc., 2009)
Project N (2020) https://www.nltk.org/_modules/nltk/tokenize/regexp.html. Last updated on 13 Apr 2020
Acknowledgements
This work is supported by Vision Group on Science and Technology (VGST), Department of IT, BT and Science and Technology, Government of Karnataka, India [File No.: VGST/2019-20/GRD No.:850/397].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kasturi Rangan, R., Harish, B.S. (2022). Kannada Document Classification Using Unicode Term Encoding Over Vector Space. In: Shetty D., P., Shetty, S. (eds) Recent Advances in Artificial Intelligence and Data Engineering. Advances in Intelligent Systems and Computing, vol 1386. Springer, Singapore. https://doi.org/10.1007/978-981-16-3342-3_31
Download citation
DOI: https://doi.org/10.1007/978-981-16-3342-3_31
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3341-6
Online ISBN: 978-981-16-3342-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)