Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text

Ansari, Mohd Zeeshan; Khan, Shazia; Amani, Tamsil; Hamid, Aman; Rizvi, Syed

doi:10.1007/978-981-15-0222-4_39

Mohd Zeeshan Ansari⁹,
Shazia Khan⁹,
Tamsil Amani⁹,
Aman Hamid⁹ &
…
Syed Rizvi⁹

Part of the book series: Algorithms for Intelligent Systems ((AIS))

860 Accesses
3 Citations

Abstract

Language identification is the detection of the language of a text in which it is written. The problem becomes challenging when the writer does not use the indigenous script of a language. Generally, this kind of text is generated by social media which is a mixture of English with the native language(s) of the writer. The users of social media platforms that belong to India write in code-mixed Hindi–English language. In this work, we study the word-level language identification as a classification problem to identify the language of a word written in Roman script. We employ POS tags in a transliteration-based approach to prepare the Hindi–English code-mixed corpus. We evaluate the corpus over itself and observe that notable results are obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

Part-of-Speech Tagger for Konkani-English Code-Mixed Social Media Text

BERT Based Language Identification in Code-Mixed English-Assamese Social Media Text

References

Barman, U., Das, A., Wagner, J., & Foster, J. (2014). Code Mixing: a challenge for language identification in the language of social media. In Proceedings of The First Workshop on Computational Approaches to Code Switching (EMNLP 2014), Qatar (pp. 13–23).
Google Scholar
Hughes, B., Baldwin, T., Bird, S., Nicholson, J., & MacKinlay, A. (2006). Reconsidering language identification for written language resources. In Proceedings of 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy.
Google Scholar
Gupta, P., Bali, K., Banchs, R. E., Choudhury, M., & Rosso, P. (2014). Query expansion for mixed-script information retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Queensland pp. 677–686.
Google Scholar
Dunning, T. (1994). Statistical identification of language. Technical Report MCCS 940-273, Computing Research Laboratory, New Mexico State University.
Google Scholar
Darnashek, M. (1995). Gauging similarity with n-grams: language-independent categorization of text. Science, 267, 843–848.
Article Google Scholar
Kruengkrai, C., Srichaivattana, P., Sornlertlamvanich, V., & Isahara, H. (2005). Language identification based on string kernels. In: Proceedings of the 5th International Symposium on Communications and Information Technologies (ISCIT-2005, Beijing, China (pp. 896–899).
Google Scholar
Johnson, S. (1993). Solving the problem of language recognition. Technical Report, School of Computer Studies, University of Leeds.
Google Scholar
Giguet, E. (1995). Categorisation according to language: a step toward combining linguistic knowledge and statistical learning. In Proceedings of the 4th International Workshop on Parsing Technologies (IWPT-1995), Prague, Czech Republic.
Google Scholar
Grefenstette, G. (1995). Comparing two language identification schemes. In Proceedings of Analisi Statistica dei Dati Testuali (JADT), Rome, Italy (pp. 263–268).
Google Scholar
Lins, R. D., Goncalves, P. (2004). Automatic language identification of written texts. In Proceedings of the 2004 ACM Symposium on Applied Computing (SAC 2004), Nicosia, Cyprus (pp. 1128–1133).
Google Scholar
Hammarstrom, H. (2007). A fine-grained model for language identification. In Proceedings of Improving Non English Web Searching (iNEWS07) (pp. 14–20).
Google Scholar
Ceylan, H., & Kim, Y. (2009). Language identification of search engine queries. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore (pp. 1066–1074).
Google Scholar
Vatanen, T., Vayrynen, J., & Virpioja, S. (2010). Language identification of short text segments with n-gram models. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010) (pp. 3423–3430).
Google Scholar
Carter, S., Weerkamp, W., & Tsagkias, M. (2013). Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation 1–21.
Google Scholar
Tromp, E., & Pechenizkiy, M. (2011). Graph-based n-gram language identification on short texts. In: Proceedings of Benelearn 2011, The Hague, Netherlands (pp. 27–35).
Google Scholar
Goldszmidt, M., Najork, M., & Paparizos, S. (2013). Boot-strapping language identifiers for short colloquial postings. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2013), Prague, Czech Republic.
Google Scholar
Yamaguchi, H., & Ishii, K. T. (2012). Text segmentation by language using minimum description length. In Proceedings the 50th Annual Meeting of the Association for Computational Linguistics (Long Papers), (Vol. 1, pp. 969–978), Jeju Island, Korea.
Google Scholar
King, B., & Abney, S. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1110–1119), Atlanta, Georgia.
Google Scholar
Nguyen, D., & Dogruoz, A. Z. (2013). Word level language identification in online multilingual communication. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA (pp. 857–862).
Google Scholar
Ling, W., Xiang, G., Dyer, C., Black, A., & Trancoso, I. (2013). Microblogs as parallel corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Long Papers), (Vol. 1, pp. 176–186), Sofia, Bulgaria.
Google Scholar
Baldwin, T., & Lui, M. (2010). Human Language Technologies: In: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California (pp. 229–237).
Google Scholar
Lui, M., & Baldwin, T. (2011) In: Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand (pp. 553–561).
Google Scholar
Milroy, L., Muysken, P. (1995). One speaker, two languages: cross-disciplinary perspectives on code-switching. Cambridge University Press: Cambridge.
Google Scholar
Alex, B. (2008). Automatic detection of English inclusions in mixed-lingual data with an application to parsing. Ph.D. thesis, School of Informatics, The University of Edinburgh: Edinburgh, UK.
Google Scholar
Auer, P. (2013). Code-Switching in Conversation: Language, Interaction and Identity. London: Routledge.
Book Google Scholar
Dewaele, J. M. (2010). Emotions in Multiple Languages. Palgrave Macmillan.
Google Scholar
Dey, A., & Fung, P. (2014). A Hindi-English code-switching corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 2410– 2413), Reykjavik, Iceland. European Language Resources Association (ELRA).
Google Scholar
Solorio, T., & Liu, Y. (2008a). Learning to predict code-switching points. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (pp. 973–981).
Google Scholar
Gottron, T., & Lipka, N. (2010). A comparison of language identification approaches on short, query-style texts. In Advances in information retrieval (pp. 611–614). Springer.
Google Scholar
Farrugia, P. J. (2004). TTS pre-processing issues for mixed language support. In Proceedings of CSAW’04, the Second Computer Science Annual Workshop (pp. 36–41). Department of Computer Science & A.I., University of Malta.
Google Scholar
Rosner, M., Farrugia, P. J. (2007). A tagging algorithm for mixed language identification in a noisy domain. In 8th Annual Conference of the International Speech Communication Association INTERSPEECH-2007 (pp. 190–193). ISCA Archive.
Google Scholar
Jamatia, A., Gambach, B., & Das, A. (2015). Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages. In Proceedings of Recent Advances in Natural Language Processing, Bulgaria (pp. 239–248).
Google Scholar
AlGhamdi, F., et al. (2016.) Part of speech tagging for code switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Austin (pp. 98–107).
Google Scholar
Sequiera, R., Choudhury, M., Bali, K. (2015). POS tagging of Hindi-English code mixed text from social media: Some machine learning experiments. In Proceedings of the 12th International Conference on Natural Language Processing, Trivandrum, India (pp. 237–246).
Google Scholar
https://www.english-corpora.org/. Accessed 26 Feb 2019.
Loper, E., & Bird, S. (2002). NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India
Mohd Zeeshan Ansari, Shazia Khan, Tamsil Amani, Aman Hamid & Syed Rizvi

Authors

Mohd Zeeshan Ansari
View author publications
You can also search for this author in PubMed Google Scholar
Shazia Khan
View author publications
You can also search for this author in PubMed Google Scholar
Tamsil Amani
View author publications
You can also search for this author in PubMed Google Scholar
Aman Hamid
View author publications
You can also search for this author in PubMed Google Scholar
Syed Rizvi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohd Zeeshan Ansari .

Editor information

Editors and Affiliations

Rajasthan Technical University, Kota, Rajasthan, India
Harish Sharma
University of Southern Denmark, Odense, Denmark
Kannan Govindan
Amity University, Jaipur, Rajasthan, India
Ramesh C. Poonia
Amity University, Jaipur, Rajasthan, India
Sandeep Kumar
University of Bahrain, Zallaq, Bahrain
Wael M. El-Medany

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ansari, M.Z., Khan, S., Amani, T., Hamid, A., Rizvi, S. (2020). Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text. In: Sharma, H., Govindan, K., Poonia, R., Kumar, S., El-Medany, W. (eds) Advances in Computing and Intelligent Systems. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-15-0222-4_39

Download citation

DOI: https://doi.org/10.1007/978-981-15-0222-4_39
Published: 03 January 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0221-7
Online ISBN: 978-981-15-0222-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

Part-of-Speech Tagger for Konkani-English Code-Mixed Social Media Text

BERT Based Language Identification in Code-Mixed English-Assamese Social Media Text

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

Part-of-Speech Tagger for Konkani-English Code-Mixed Social Media Text

BERT Based Language Identification in Code-Mixed English-Assamese Social Media Text

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation