Abstract
Language identification is the detection of the language of a text in which it is written. The problem becomes challenging when the writer does not use the indigenous script of a language. Generally, this kind of text is generated by social media which is a mixture of English with the native language(s) of the writer. The users of social media platforms that belong to India write in code-mixed Hindi–English language. In this work, we study the word-level language identification as a classification problem to identify the language of a word written in Roman script. We employ POS tags in a transliteration-based approach to prepare the Hindi–English code-mixed corpus. We evaluate the corpus over itself and observe that notable results are obtained.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Barman, U., Das, A., Wagner, J., & Foster, J. (2014). Code Mixing: a challenge for language identification in the language of social media. In Proceedings of The First Workshop on Computational Approaches to Code Switching (EMNLP 2014), Qatar (pp. 13–23).
Hughes, B., Baldwin, T., Bird, S., Nicholson, J., & MacKinlay, A. (2006). Reconsidering language identification for written language resources. In Proceedings of 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy.
Gupta, P., Bali, K., Banchs, R. E., Choudhury, M., & Rosso, P. (2014). Query expansion for mixed-script information retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Queensland pp. 677–686.
Dunning, T. (1994). Statistical identification of language. Technical Report MCCS 940-273, Computing Research Laboratory, New Mexico State University.
Darnashek, M. (1995). Gauging similarity with n-grams: language-independent categorization of text. Science, 267, 843–848.
Kruengkrai, C., Srichaivattana, P., Sornlertlamvanich, V., & Isahara, H. (2005). Language identification based on string kernels. In: Proceedings of the 5th International Symposium on Communications and Information Technologies (ISCIT-2005, Beijing, China (pp. 896–899).
Johnson, S. (1993). Solving the problem of language recognition. Technical Report, School of Computer Studies, University of Leeds.
Giguet, E. (1995). Categorisation according to language: a step toward combining linguistic knowledge and statistical learning. In Proceedings of the 4th International Workshop on Parsing Technologies (IWPT-1995), Prague, Czech Republic.
Grefenstette, G. (1995). Comparing two language identification schemes. In Proceedings of Analisi Statistica dei Dati Testuali (JADT), Rome, Italy (pp. 263–268).
Lins, R. D., Goncalves, P. (2004). Automatic language identification of written texts. In Proceedings of the 2004 ACM Symposium on Applied Computing (SAC 2004), Nicosia, Cyprus (pp. 1128–1133).
Hammarstrom, H. (2007). A fine-grained model for language identification. In Proceedings of Improving Non English Web Searching (iNEWS07) (pp. 14–20).
Ceylan, H., & Kim, Y. (2009). Language identification of search engine queries. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore (pp. 1066–1074).
Vatanen, T., Vayrynen, J., & Virpioja, S. (2010). Language identification of short text segments with n-gram models. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010) (pp. 3423–3430).
Carter, S., Weerkamp, W., & Tsagkias, M. (2013). Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation 1–21.
Tromp, E., & Pechenizkiy, M. (2011). Graph-based n-gram language identification on short texts. In: Proceedings of Benelearn 2011, The Hague, Netherlands (pp. 27–35).
Goldszmidt, M., Najork, M., & Paparizos, S. (2013). Boot-strapping language identifiers for short colloquial postings. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2013), Prague, Czech Republic.
Yamaguchi, H., & Ishii, K. T. (2012). Text segmentation by language using minimum description length. In Proceedings the 50th Annual Meeting of the Association for Computational Linguistics (Long Papers), (Vol. 1, pp. 969–978), Jeju Island, Korea.
King, B., & Abney, S. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1110–1119), Atlanta, Georgia.
Nguyen, D., & Dogruoz, A. Z. (2013). Word level language identification in online multilingual communication. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA (pp. 857–862).
Ling, W., Xiang, G., Dyer, C., Black, A., & Trancoso, I. (2013). Microblogs as parallel corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Long Papers), (Vol. 1, pp. 176–186), Sofia, Bulgaria.
Baldwin, T., & Lui, M. (2010). Human Language Technologies: In: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California (pp. 229–237).
Lui, M., & Baldwin, T. (2011) In: Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand (pp. 553–561).
Milroy, L., Muysken, P. (1995). One speaker, two languages: cross-disciplinary perspectives on code-switching. Cambridge University Press: Cambridge.
Alex, B. (2008). Automatic detection of English inclusions in mixed-lingual data with an application to parsing. Ph.D. thesis, School of Informatics, The University of Edinburgh: Edinburgh, UK.
Auer, P. (2013). Code-Switching in Conversation: Language, Interaction and Identity. London: Routledge.
Dewaele, J. M. (2010). Emotions in Multiple Languages. Palgrave Macmillan.
Dey, A., & Fung, P. (2014). A Hindi-English code-switching corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 2410– 2413), Reykjavik, Iceland. European Language Resources Association (ELRA).
Solorio, T., & Liu, Y. (2008a). Learning to predict code-switching points. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (pp. 973–981).
Gottron, T., & Lipka, N. (2010). A comparison of language identification approaches on short, query-style texts. In Advances in information retrieval (pp. 611–614). Springer.
Farrugia, P. J. (2004). TTS pre-processing issues for mixed language support. In Proceedings of CSAW’04, the Second Computer Science Annual Workshop (pp. 36–41). Department of Computer Science & A.I., University of Malta.
Rosner, M., Farrugia, P. J. (2007). A tagging algorithm for mixed language identification in a noisy domain. In 8th Annual Conference of the International Speech Communication Association INTERSPEECH-2007 (pp. 190–193). ISCA Archive.
Jamatia, A., Gambach, B., & Das, A. (2015). Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages. In Proceedings of Recent Advances in Natural Language Processing, Bulgaria (pp. 239–248).
AlGhamdi, F., et al. (2016.) Part of speech tagging for code switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Austin (pp. 98–107).
Sequiera, R., Choudhury, M., Bali, K. (2015). POS tagging of Hindi-English code mixed text from social media: Some machine learning experiments. In Proceedings of the 12th International Conference on Natural Language Processing, Trivandrum, India (pp. 237–246).
https://www.english-corpora.org/. Accessed 26 Feb 2019.
Loper, E., & Bird, S. (2002). NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ansari, M.Z., Khan, S., Amani, T., Hamid, A., Rizvi, S. (2020). Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text. In: Sharma, H., Govindan, K., Poonia, R., Kumar, S., El-Medany, W. (eds) Advances in Computing and Intelligent Systems. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-15-0222-4_39
Download citation
DOI: https://doi.org/10.1007/978-981-15-0222-4_39
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0221-7
Online ISBN: 978-981-15-0222-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)