Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

Jain, Shreya; Agarwal, Kanika

doi:10.1007/978-981-16-4807-6_17

Shreya Jain¹⁴ &
Kanika Agarwal¹⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 302))

635 Accesses
1 Citations

Abstract

Language Identification (LI) is a crucial part of various text-processing pipelines, as most techniques presume that the language of input text is known. Document-level Language Identification has been seen as an almost solved problem in some application areas, but language detectors fail in the case of social media environment due to code-switching, word-borrowing from different languages, phonetic typing; which imply that LI in code-mixed text must be carried out at word-level. Hence, this work focuses on identifying languages at word-level in multilingual environments like social-media. One of the major concerns of these environments is phonetic typing which can be taken into consideration by inculcating graphemic features into our model. Character n-grams take all combination of character occurring together into account resulting in large model size, whereas graphemic features consider only those combinations of characters having some underlying linguistic significance. For example, ‘kh’ and ‘gh’ graphemes occur majorly in languages like Hindi and Urdu in comparison to English. According to our observations in dataset (Sarma et al. in Word level language identification in assamese-bengali-hindi-english code-mixed social media text, pp. 261–266), we have observed that more graphemes (53.46%) are exclusive to a particular language than bigrams (21.38%) or trigrams (39.43%) are. This work consists of detailed analysis and comparison on the basis of several metrics between the character n-gram and grapheme based features by performing experiments using grapheme based features in various popular methods (originally containing only character n-gram features) in place of character n-gram features. Through these set of experiments and our analysis, we show the usefulness of grapheme in the field of word-level LI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text

BERT Based Language Identification in Code-Mixed English-Assamese Social Media Text

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text

References

Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J., Weiss, D.: A fast, com-pact, accurate model for language identification of codemixed text. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 328–337, Brussels, Belgium. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1030. https://www.aclweb.org/anthology/D18-1030
Bali, K., Sharma, J., Choudhury, M., Vyas, Y.: I am borrowing ya mixing ? an analysis of English-Hindi code mixing in Facebook. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 116–126, Doha, Qatar. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/W14-3914. https://www.aclweb.org/anthology/W14-3914
Rijhwani, S., Sequiera, R., Choudhury, M., Bali, K., Maddila, C. S.: Estimating code-switching on twitter with a novel generalized word-level language detection technique. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1971–1982, Vancouver, Canada. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1180. https://www.aclweb.org/anthology/P17-1180
Wikipedia. n-gram—wikipedia, the free encyclopedia (2004). https://en.wikipedia.org/wiki/N-gram. [Online; Accessed 22 July 2004]
Urban Dictionary (2011). https://dictionary.reference.com/browse/grapheme. [Online; Accessed 22 July 2011]
Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M., Smith, N.: Hierarchical character-word models for language identification, pp. 84–93 (2016). https://doi.org/10.18653/v1/W16-6212
Singh, K., Sen, I., Kumaraguru, P.: A Twitter Corpus Hindi English Code Mixed Dataset for POS Tagging. Workshop on Natural Language Processing for Social Media (Social NLP 2018)
Google Scholar
Killer, M., Stüker, S., Schultz, T.: Grapheme based speech recognition. Proc. Eurospeech, 04 (2009)
Google Scholar
Baldwin, T., Lui, M.: Language identification: The long and the short of the matter. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 229–237, Los Angeles, California. Association for Computational Linguistics (2010). https://www.aclweb.org/anthology/N10-1027
Murthy, K., Kumar, G.: Language identification from small text samples. J. Quant. Linguis. 13, 57–80 (2006). https://doi.org/10.1080/09296170500500694
Church, K.: Stress assignment in letter to sound rules for speech synthesis. In: 23rd Annual Meeting of the Association for Computational Linguistics, pp. 246–253, Chicago, Illinois, USA. Association for Computational Linguistics (1985). https://doi.org/10.3115/981210.981240. https://www.aclweb.org/anthology/P85-1030
Yang, X., Liang, W.: An n-gram-and-wikipedia joint approach to natural language identification. In: 2010 4th International Universal Communication Symposium, pp. 332–339 (2010). https://doi.org/10.1109/IUCS.2010.5666010
Lui, M., Baldwin, T.: Langid.py: An off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, pp. 25–30, Jeju Island, Korea. Association for Computational Linguistics (2012). https://www.aclweb.org/anthology/P12-3005
Tromp, E., Pechenizkiy, M.: Graph-based n-gram language identification on short texts. In: Proceedings of Benelearn, pp. 27–34 (2011)
Google Scholar
Moodley, A.: Language Identification With Decision Trees: Identification Of Individual Words In The South African Languages. Ph.D. thesis (2016)
Google Scholar
Malmasi, S., Zampieri, M.: Arabic dialect identification using iVectors and ASR transcripts. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 178–183, Valencia, Spain. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-1222. https://www.aclweb.org/anthology/W17-1222
Jhamtani, H., Kumar, B., Raychoudhury, V.: Word-level language identification in bi-lingual code-switched texts (2014)
Google Scholar
Barman, U., Das, A., Wagner, J., Foster, J.: Code-mixing: A challenge for language identification in the language of social media. In: Proceedings of the First Workshop on Computational Approaches to Code-Switching (2014)
Google Scholar
Samih, Y., Maharjan, S., Attia, M., Kallmeyer, L., Solorio, T.: Multilingual code-switching identification via LSTM recurrent neural networks. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59, Austin, Texas. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/W16-5806. https://www.aclweb.org/anthology/W16-5806
Yeong, Y.-L., Tan, T.-P.: Applying grapheme, word, and syllable information for language identification in code switching sentences (2011). https://doi.org/10.1109/IALP.2011.34
Oh, J.-H., Choi, K.-S.: An ensemble of grapheme and phoneme for machine transliteration, vol. 3651, pp. 450–461 (2005). https://doi.org/10.1007/1156221440
Banerjee, S., Choudhury, M., Chakma, K., Naskar, S.K., Das, A., Bandyopadhyay, S., Rosso, P.: Msir@fire: A comprehensive report from 2013 to 2016. SN Comput. Sci. 1(1), 55 (2020). https://doi.org/10.1007/s42979-019-0058-0
Article Google Scholar
Dongen, N.: Analysis and prediction of dutch-english code-switching in dutch social media messages (2017)
Google Scholar
Bartlett, S., Kondrak, G., Cherry, C.: On the syllabification of phonemes. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 308–316, Boulder, Colorado. Association for Computational Linguistics (2009). https://www.aclweb.org/anthology/N09-1035
Sarma, N., Singh, S. R., Goswami, D.: Word level language identification in assamese-bengali-hindi-english code-mixed social media text. In: 2018 International Conference on Asian Language Processing (IALP), pp. 261–266 (2018)
Google Scholar
Rathod, P., Dhore, M.L., Dhore, R.: Hindi and Marathi to English machine transliteration using SVM (2013)
Google Scholar
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J.: Speech emotion recognition using spectrogram and phoneme embedding, pp. 3688–3692 (2018). https://doi.org/10.21437/Interspeech.2018-1811
Cortes, C., Vapnik, V.: Support-vector networks. Mac. Learn. 20(3), 273–297 (1995). ISSN 1573-0565. https://doi.org/10.1023/A:1022627411411

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Guwahati, Guwahati, India
Shreya Jain & Kanika Agarwal

Authors

Shreya Jain
View author publications
You can also search for this author in PubMed Google Scholar
Kanika Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Information Technology, Institute of Technical Education and Research(ITER), Siksha 'O' Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India
Jyoti Prakash Sahoo
School of Information Technology and Engineering (SITE), Vellore Institute of Technology, Vellore, India
Asis Kumar Tripathy
Center for Forensic Science, School of Mathematical and Physical Science, University of Technology Sydney, Broadway, NSW, Australia
Manoranjan Mohanty
Department of Computer Science and Information Engineering (CSIE), Providence University, Taichung, Taiwan
Kuan-Ching Li
Department. of Computer Science and Information Technology, Institute of Technical Education and Research(ITER), Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, India
Ajit Kumar Nayak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jain, S., Agarwal, K. (2022). Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text. In: Sahoo, J.P., Tripathy, A.K., Mohanty, M., Li, KC., Nayak, A.K. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, vol 302. Springer, Singapore. https://doi.org/10.1007/978-981-16-4807-6_17

Download citation

DOI: https://doi.org/10.1007/978-981-16-4807-6_17
Published: 01 January 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4806-9
Online ISBN: 978-981-16-4807-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text

BERT Based Language Identification in Code-Mixed English-Assamese Social Media Text

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text

BERT Based Language Identification in Code-Mixed English-Assamese Social Media Text

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation