Abstract
Language Identification (LI) is a crucial part of various text-processing pipelines, as most techniques presume that the language of input text is known. Document-level Language Identification has been seen as an almost solved problem in some application areas, but language detectors fail in the case of social media environment due to code-switching, word-borrowing from different languages, phonetic typing; which imply that LI in code-mixed text must be carried out at word-level. Hence, this work focuses on identifying languages at word-level in multilingual environments like social-media. One of the major concerns of these environments is phonetic typing which can be taken into consideration by inculcating graphemic features into our model. Character n-grams take all combination of character occurring together into account resulting in large model size, whereas graphemic features consider only those combinations of characters having some underlying linguistic significance. For example, ‘kh’ and ‘gh’ graphemes occur majorly in languages like Hindi and Urdu in comparison to English. According to our observations in dataset (Sarma et al. in Word level language identification in assamese-bengali-hindi-english code-mixed social media text, pp. 261–266), we have observed that more graphemes (53.46%) are exclusive to a particular language than bigrams (21.38%) or trigrams (39.43%) are. This work consists of detailed analysis and comparison on the basis of several metrics between the character n-gram and grapheme based features by performing experiments using grapheme based features in various popular methods (originally containing only character n-gram features) in place of character n-gram features. Through these set of experiments and our analysis, we show the usefulness of grapheme in the field of word-level LI.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J., Weiss, D.: A fast, com-pact, accurate model for language identification of codemixed text. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 328–337, Brussels, Belgium. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1030. https://www.aclweb.org/anthology/D18-1030
Bali, K., Sharma, J., Choudhury, M., Vyas, Y.: I am borrowing ya mixing ? an analysis of English-Hindi code mixing in Facebook. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 116–126, Doha, Qatar. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/W14-3914. https://www.aclweb.org/anthology/W14-3914
Rijhwani, S., Sequiera, R., Choudhury, M., Bali, K., Maddila, C. S.: Estimating code-switching on twitter with a novel generalized word-level language detection technique. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1971–1982, Vancouver, Canada. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1180. https://www.aclweb.org/anthology/P17-1180
Wikipedia. n-gram—wikipedia, the free encyclopedia (2004). https://en.wikipedia.org/wiki/N-gram. [Online; Accessed 22 July 2004]
Urban Dictionary (2011). https://dictionary.reference.com/browse/grapheme. [Online; Accessed 22 July 2011]
Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M., Smith, N.: Hierarchical character-word models for language identification, pp. 84–93 (2016). https://doi.org/10.18653/v1/W16-6212
Singh, K., Sen, I., Kumaraguru, P.: A Twitter Corpus Hindi English Code Mixed Dataset for POS Tagging. Workshop on Natural Language Processing for Social Media (Social NLP 2018)
Killer, M., Stüker, S., Schultz, T.: Grapheme based speech recognition. Proc. Eurospeech, 04 (2009)
Baldwin, T., Lui, M.: Language identification: The long and the short of the matter. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 229–237, Los Angeles, California. Association for Computational Linguistics (2010). https://www.aclweb.org/anthology/N10-1027
Murthy, K., Kumar, G.: Language identification from small text samples. J. Quant. Linguis. 13, 57–80 (2006). https://doi.org/10.1080/09296170500500694
Church, K.: Stress assignment in letter to sound rules for speech synthesis. In: 23rd Annual Meeting of the Association for Computational Linguistics, pp. 246–253, Chicago, Illinois, USA. Association for Computational Linguistics (1985). https://doi.org/10.3115/981210.981240. https://www.aclweb.org/anthology/P85-1030
Yang, X., Liang, W.: An n-gram-and-wikipedia joint approach to natural language identification. In: 2010 4th International Universal Communication Symposium, pp. 332–339 (2010). https://doi.org/10.1109/IUCS.2010.5666010
Lui, M., Baldwin, T.: Langid.py: An off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, pp. 25–30, Jeju Island, Korea. Association for Computational Linguistics (2012). https://www.aclweb.org/anthology/P12-3005
Tromp, E., Pechenizkiy, M.: Graph-based n-gram language identification on short texts. In: Proceedings of Benelearn, pp. 27–34 (2011)
Moodley, A.: Language Identification With Decision Trees: Identification Of Individual Words In The South African Languages. Ph.D. thesis (2016)
Malmasi, S., Zampieri, M.: Arabic dialect identification using iVectors and ASR transcripts. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 178–183, Valencia, Spain. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-1222. https://www.aclweb.org/anthology/W17-1222
Jhamtani, H., Kumar, B., Raychoudhury, V.: Word-level language identification in bi-lingual code-switched texts (2014)
Barman, U., Das, A., Wagner, J., Foster, J.: Code-mixing: A challenge for language identification in the language of social media. In: Proceedings of the First Workshop on Computational Approaches to Code-Switching (2014)
Samih, Y., Maharjan, S., Attia, M., Kallmeyer, L., Solorio, T.: Multilingual code-switching identification via LSTM recurrent neural networks. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59, Austin, Texas. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/W16-5806. https://www.aclweb.org/anthology/W16-5806
Yeong, Y.-L., Tan, T.-P.: Applying grapheme, word, and syllable information for language identification in code switching sentences (2011). https://doi.org/10.1109/IALP.2011.34
Oh, J.-H., Choi, K.-S.: An ensemble of grapheme and phoneme for machine transliteration, vol. 3651, pp. 450–461 (2005). https://doi.org/10.1007/1156221440
Banerjee, S., Choudhury, M., Chakma, K., Naskar, S.K., Das, A., Bandyopadhyay, S., Rosso, P.: Msir@fire: A comprehensive report from 2013 to 2016. SN Comput. Sci. 1(1), 55 (2020). https://doi.org/10.1007/s42979-019-0058-0
Dongen, N.: Analysis and prediction of dutch-english code-switching in dutch social media messages (2017)
Bartlett, S., Kondrak, G., Cherry, C.: On the syllabification of phonemes. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 308–316, Boulder, Colorado. Association for Computational Linguistics (2009). https://www.aclweb.org/anthology/N09-1035
Sarma, N., Singh, S. R., Goswami, D.: Word level language identification in assamese-bengali-hindi-english code-mixed social media text. In: 2018 International Conference on Asian Language Processing (IALP), pp. 261–266 (2018)
Rathod, P., Dhore, M.L., Dhore, R.: Hindi and Marathi to English machine transliteration using SVM (2013)
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J.: Speech emotion recognition using spectrogram and phoneme embedding, pp. 3688–3692 (2018). https://doi.org/10.21437/Interspeech.2018-1811
Cortes, C., Vapnik, V.: Support-vector networks. Mac. Learn. 20(3), 273–297 (1995). ISSN 1573-0565. https://doi.org/10.1023/A:1022627411411
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jain, S., Agarwal, K. (2022). Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text. In: Sahoo, J.P., Tripathy, A.K., Mohanty, M., Li, KC., Nayak, A.K. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, vol 302. Springer, Singapore. https://doi.org/10.1007/978-981-16-4807-6_17
Download citation
DOI: https://doi.org/10.1007/978-981-16-4807-6_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4806-9
Online ISBN: 978-981-16-4807-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)