Skip to main content

Abstract

The Marathi language is one of the prominent languages used in India. It is predominantly spoken by the people of Maharashtra. Over the past decade, the usage of language on online platforms has tremendously increased. However, research on Natural Language Processing (NLP) approaches for Marathi text has not received much attention. Marathi is a morphologically rich language and uses a variant of the Devanagari script in the written form. This works aims to provide a comprehensive overview of available resources and models for Marathi text classification. We evaluate CNN, LSTM, ULMFiT, and BERT based models on two publicly available Marathi text classification datasets and present a comparative analysis. The pre-trained Marathi fast text word embeddings by Facebook and IndicNLP are used in conjunction with word-based models. We show that basic single layer models based on CNN and LSTM coupled with FastText embeddings perform on par with the BERT based models on the available datasets. We hope our paper aids focused research and experiments in the area of Marathi NLP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Akhtar MS, Ekbal A, Bhattacharyya P (2016) Aspect based sentiment analysis in hindi: resource creation and evaluation. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp 2703–2709

    Google Scholar 

  2. Al-Amin M, Islam MS, Uzzal SD (2017) Sentiment analysis of Bengali comments with word2vec and sentiment information of words. In: 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE). IEEE, pp 186–190

    Google Scholar 

  3. Arora G (2020) inltk: Natural language toolkit for indic languages. arXiv preprint arXiv:2009.12534

  4. Arora P (2013) Sentiment analysis for hindi language. MS by Research in Computer Science

    Google Scholar 

  5. Bolaj P, Govilkar S (2016) Text classification for Marathi documents using supervised learning methods. Int J Computer Appl 155(8):6–10

    Google Scholar 

  6. Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Advances in neural information processing systems, pp 7059–7069

    Google Scholar 

  7. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  8. Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146

  9. Joshi A, Prabhu A, Shrivastava M, Varma V (2016) Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. In: Proceedings of COLING 2016, the 26th International conference on computational linguistics: Technical Papers, pp 2482–2491

    Google Scholar 

  10. Joshi R, Goel P, Joshi R (2019) Deep learning for hindi text classification: a comparison. In: International conference on intelligent human computer interaction. Springer, Berlin, pp 94–101 (2019)

    Google Scholar 

  11. Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759

  12. Kakwani D, Kunchukuttan A, Golla S, Bhattacharyya A, Khapra MM, Kumar P (2020) Indicnlpsuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. Findings of EMNLP

    Google Scholar 

  13. Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882

  14. Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10(4):150

    Article  Google Scholar 

  15. Kudo T, Richardson J (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226

  16. Pal AR, Saha D, Dash NS (2015) Automatic classification of Bengali sentences based on sense definitions present in bengali wordnet. arXiv preprint arXiv:1508.01349

  17. Patil, HB, Patil AS (2017) Mars: a rule-based stemmer for morphologically rich language marathi. In: 2017 International conference on computer, communications and electronics (Comptelix). IEEE, pp 580–584

    Google Scholar 

  18. Patil JJ, Bogiri N (2015) Automatic text categorization: Marathi documents. In: 2015 International conference on energy systems and applications. IEEE, pp 689–694

    Google Scholar 

  19. Patra BG, Das D, Das A (2018) Sentiment analysis of code-mixed Indian languages: An overview of sail\_code-mixed shared task@ icon-2017. arXiv preprint arXiv:1803.06745

  20. Patra BG, Das D, Das A, Prasath R (2015) Shared task on sentiment analysis in Indian languages (sail) tweets-an overview. In: International conference on mining intelligence and knowledge exploration. Springer, Berlin, pp. 650–655

    Google Scholar 

  21. Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual bert? arXiv preprint arXiv:1906.01502

  22. Sarkar K, Bhowmick M (2017) Sentiment polarity detection in Bengali tweets using multinomial naïve bayes and support vector machines. In: 2017 IEEE Calcutta conference (CALCON). IEEE, pp 31–36

    Google Scholar 

  23. Suárez PJO, Sagot B, Romary L (2019) Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In: 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache

    Google Scholar 

  24. Vispute SR, Potey M (2013) Automatic text categorization of Marathi documents using clustering technique. In: 2013 15th International Conference on Advanced Computing Technologies (ICACT). IEEE, pp 1–5

    Google Scholar 

  25. Wenzek G, Lachaux MA, Conneau A, Chaudhary V, Guzmán F, Joulin A, Grave E (2019) Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359

Download references

Acknowledgements

This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kulkarni, A., Mandhane, M., Likhitkar, M., Kshirsagar, G., Jagdale, J., Joshi, R. (2022). Experimental Evaluation of Deep Learning Models for Marathi Text Classification. In: Gunjan, V.K., Zurada, J.M. (eds) Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Lecture Notes in Networks and Systems, vol 237. Springer, Singapore. https://doi.org/10.1007/978-981-16-6407-6_53

Download citation

Publish with us

Policies and ethics