Abstract
The Marathi language is one of the prominent languages used in India. It is predominantly spoken by the people of Maharashtra. Over the past decade, the usage of language on online platforms has tremendously increased. However, research on Natural Language Processing (NLP) approaches for Marathi text has not received much attention. Marathi is a morphologically rich language and uses a variant of the Devanagari script in the written form. This works aims to provide a comprehensive overview of available resources and models for Marathi text classification. We evaluate CNN, LSTM, ULMFiT, and BERT based models on two publicly available Marathi text classification datasets and present a comparative analysis. The pre-trained Marathi fast text word embeddings by Facebook and IndicNLP are used in conjunction with word-based models. We show that basic single layer models based on CNN and LSTM coupled with FastText embeddings perform on par with the BERT based models on the available datasets. We hope our paper aids focused research and experiments in the area of Marathi NLP.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akhtar MS, Ekbal A, Bhattacharyya P (2016) Aspect based sentiment analysis in hindi: resource creation and evaluation. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp 2703–2709
Al-Amin M, Islam MS, Uzzal SD (2017) Sentiment analysis of Bengali comments with word2vec and sentiment information of words. In: 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE). IEEE, pp 186–190
Arora G (2020) inltk: Natural language toolkit for indic languages. arXiv preprint arXiv:2009.12534
Arora P (2013) Sentiment analysis for hindi language. MS by Research in Computer Science
Bolaj P, Govilkar S (2016) Text classification for Marathi documents using supervised learning methods. Int J Computer Appl 155(8):6–10
Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Advances in neural information processing systems, pp 7059–7069
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146
Joshi A, Prabhu A, Shrivastava M, Varma V (2016) Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. In: Proceedings of COLING 2016, the 26th International conference on computational linguistics: Technical Papers, pp 2482–2491
Joshi R, Goel P, Joshi R (2019) Deep learning for hindi text classification: a comparison. In: International conference on intelligent human computer interaction. Springer, Berlin, pp 94–101 (2019)
Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759
Kakwani D, Kunchukuttan A, Golla S, Bhattacharyya A, Khapra MM, Kumar P (2020) Indicnlpsuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. Findings of EMNLP
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882
Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10(4):150
Kudo T, Richardson J (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226
Pal AR, Saha D, Dash NS (2015) Automatic classification of Bengali sentences based on sense definitions present in bengali wordnet. arXiv preprint arXiv:1508.01349
Patil, HB, Patil AS (2017) Mars: a rule-based stemmer for morphologically rich language marathi. In: 2017 International conference on computer, communications and electronics (Comptelix). IEEE, pp 580–584
Patil JJ, Bogiri N (2015) Automatic text categorization: Marathi documents. In: 2015 International conference on energy systems and applications. IEEE, pp 689–694
Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed Indian languages: An overview of sail\_code-mixed shared task@ icon-2017. arXiv preprint arXiv:1803.06745
Patra BG, Das D, Das A, Prasath R (2015) Shared task on sentiment analysis in Indian languages (sail) tweets-an overview. In: International conference on mining intelligence and knowledge exploration. Springer, Berlin, pp. 650–655
Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual bert? arXiv preprint arXiv:1906.01502
Sarkar K, Bhowmick M (2017) Sentiment polarity detection in Bengali tweets using multinomial naïve bayes and support vector machines. In: 2017 IEEE Calcutta conference (CALCON). IEEE, pp 31–36
Suárez PJO, Sagot B, Romary L (2019) Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In: 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache
Vispute SR, Potey M (2013) Automatic text categorization of Marathi documents using clustering technique. In: 2013 15th International Conference on Advanced Computing Technologies (ICACT). IEEE, pp 1–5
Wenzek G, Lachaux MA, Conneau A, Chaudhary V, Guzmán F, Joulin A, Grave E (2019) Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359
Acknowledgements
This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kulkarni, A., Mandhane, M., Likhitkar, M., Kshirsagar, G., Jagdale, J., Joshi, R. (2022). Experimental Evaluation of Deep Learning Models for Marathi Text Classification. In: Gunjan, V.K., Zurada, J.M. (eds) Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Lecture Notes in Networks and Systems, vol 237. Springer, Singapore. https://doi.org/10.1007/978-981-16-6407-6_53
Download citation
DOI: https://doi.org/10.1007/978-981-16-6407-6_53
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6406-9
Online ISBN: 978-981-16-6407-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)