A Trigram HMM-Based POS Tagger for Indian Languages

Sarkar, Kamal; Gayen, Vivekananda

doi:10.1007/978-3-642-35314-7_24

Kamal Sarkar⁴ &
Vivekananda Gayen⁴

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 199))

2392 Accesses
7 Citations

Abstract

We present in this paper a trigram HMM-based (Hidden Markov Model) part-of-speech (POS) tagger for Indian languages, which will accept a raw text in an Indian language (typed in corresponding language font) to produce a POS tagged output. We implement the trigram POS Tagger from the scratch based on the second order Hidden Markov Model (HMM). For handling unknown words, we introduce a prefix analysis method and a word-type analysis method which are combined with the well known suffix analysis method for predicting the probable tags. Though our developed systems have been tested on the data for four Indian languages namely Bengali, Hindi, Marathi and Telugu, the developed system can be easily ported to a new language just by replacing the training file with the POS tagged data for the new language. Our developed trigram POS tagger has been compared to the bigram POS tagger defined as a baseline.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

Marathi Parts-of-Speech Tagger Using Supervised Learning

A Study on the Importance of Linguistic Suffixes in Maithili POS Tagger Development

Keywords

References

Brants, T.: TnT – “A statistical part-of-speech tagger”. In: Proc. of the 6th Applied NLP Conference, pp. 224–231 (2000)
Google Scholar
Dandapat, S., Sarkar, S., Basu, A.: Automatic part-of-speech tagging for bengali: an approach for morphologically rich languages in a poor scenario. In: Proceedings of the Association for Computational Linguistic, pp. 221–224 (2007)
Google Scholar
Ekbal, A., et al.: Bengali part of speech tagging using conditional random field. In: Proceedings of the 7th International Symposium of Natural Language Processing (SNLP 2007), Pattaya, Thailand, December 13-15, pp. 131–136 (2007)
Google Scholar
Ekbal, A., Bandyopadhyay, S.: Part of speech tagging in bengali using support vector machine. In: IEEE International Conference on Information Technology, ICIT 2008, pp. 106–111 (2008)
Google Scholar
Kumar, D., Josan, G.S.: Part of speech taggers for morphologically rich indian languages: a survey. International Journal of Computer Applications (0975-8887) 6(5) (2010)
Google Scholar
Ali, H.: An unsupervised parts-of-speech tagger for the bangla language, Department of Computer Science, University of British Columbia (2010)
Google Scholar
Chakrabarti, D.: Layered parts of speech tagging for bangla, Language in Indian. Special Volume: Problems of Parsing in Indian Languages (May 2001), http://www.languageinindia.com
Antony, P.J., Soman, K.P.: Parts of speech tagging for Indian languages: a literature survey. International Journal of Computer Applications (0975-8887) 34(8) (November 2011)
Google Scholar
Shrivastava, M., Bhattacharyya, P.: Hindi POS Tagger Using Naive Stemming: Harnessing Morphological Information Without Extensive Linguistic Knowledge, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay. In: Proceeding of the ICON (2008)
Google Scholar
Ray, P.R., Harish, V., Sarkar, S., Basu, A.: Part of Speech Tagging and Local Word Grouping Techniques for Natural Language Parsing in Hindi, Department of Computer Science & Engineering, Indian Institute of Technology, Kharagpur, INDIA 721302, http://www.mla.iitkgp.ernet.in/papers/hindipostagging.pdf
Selvam, M., Natarajan, A.M.: Improvement of Rule Based Morphological Analysis and POS Tagging in Tamil Language via Projection and Induction Techniques. International Journal of Computers 3(4) (2009)
Google Scholar
Antony, P.J., Santhanu, P.M., Soman, K.P.: SVM Based Parts Speech Tagger for Malayalam. In: International Conference on-Recent Trends in Information, Telecommunication and Computing, ITC 2010 (2010)
Google Scholar
Pattabhi, R.K.R.T., Vijay Sundar Ram, R., Vijayakrishna, R., Sobha, L.: A Text Chunker and Hybrid POS Tagger for Indian Languages, AU-KBC Research Centre. MIT Campus, Anna University, Chromepet, Chennai (2007)
Google Scholar
Rao, D., Yarowsky, D.: Part of Speech Tagging and Shallow Parsing of Indian Languages, Department of Computer Science, Johns Hopkins University, USA, The Proceedings of the Workshop on Shallow Parsing in South Asian Languages (2007), http://shiva.iiit.ac.in/SPSAL2007/final/iitmcsa.pdf
Jurafsky, D., Martin, J.H.: Speech and Language Processing An Intoduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Preason Education Series (2002)
Google Scholar
Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transaction on Information Theory IT-13(2), 260–269 (1967)
Article Google Scholar
Sarkar, K., Gayen, V.: A Practical Part-of-Speech Tagger for Bengali. In: Third International Conference on Emerging Applications of Information Technology (EAIT 2012) (accepted, 2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science & Engineering Department, Jadavpur University, Kolkata, 700 032, India
Kamal Sarkar & Vivekananda Gayen

Authors

Kamal Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Vivekananda Gayen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamal Sarkar .

Editor information

Editors and Affiliations

Dept of Computer Science Engineering, Anil Neerukonda Institute of Technology and Sciences, Vishakapatnam, India
Suresh Chandra Satapathy
AI Lab, University of Hyderabad, Hyderabad, India
Siba K. Udgata
Bhubaneswar Engineering College, Bhubaneswar, India
Bhabendra Narayan Biswal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarkar, K., Gayen, V. (2013). A Trigram HMM-Based POS Tagger for Indian Languages. In: Satapathy, S., Udgata, S., Biswal, B. (eds) Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA). Advances in Intelligent Systems and Computing, vol 199. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35314-7_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-35314-7_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35313-0
Online ISBN: 978-3-642-35314-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Trigram HMM-Based POS Tagger for Indian Languages

Abstract

Chapter PDF

Similar content being viewed by others

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

Marathi Parts-of-Speech Tagger Using Supervised Learning

A Study on the Importance of Linguistic Suffixes in Maithili POS Tagger Development

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Trigram HMM-Based POS Tagger for Indian Languages

Abstract

Chapter PDF

Similar content being viewed by others

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

Marathi Parts-of-Speech Tagger Using Supervised Learning

A Study on the Importance of Linguistic Suffixes in Maithili POS Tagger Development

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation