A New Methodology for Language Identification in Social Media Code-Mixed Text

Gupta, Yogesh; Raghuwanshi, Ghanshyam; Tripathi, Aprna

doi:10.1007/978-981-15-3383-9_22

Yogesh Gupta¹⁷,
Ghanshyam Raghuwanshi¹⁸ &
Aprna Tripathi¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1141))

Included in the following conference series:

International Conference on Advanced Machine Learning Technologies and Applications

2008 Accesses
2 Citations
1 Altmetric

Abstract

Nowadays, Transliteration is one of the hot research areas in the field of Natural Language Processing. Transliteration means that transferring a word from one language to another language and it is mostly used in cross-language platforms. Generally, people use code-mixed language for sharing their views on social media like Twitter, WhatsApp, etc. Code-mixed language means one language is written using another language script and it is very important to identify the languages used in each word to process such type of text. Therefore, a deep learning model is implemented using Bidirectional Long Short-Term Memory (BLSTM) for Indian social media texts in this paper. This model identifies the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The proposed model gives better accuracy for word-embedding model as compared to character embedding.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Meta Embeddings for LinCE Dataset

Deep Learning-Based Language Identification in Code-Mixed Text

Sentiment Analysis on Hindi–English Code-Mixed Social Media Text

References

Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751 (2014)
Google Scholar
King, B., Abney, S.: Labeling the languages of words in mixed-language documents using weakly supervised methods. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1110–1119 (2013)
Google Scholar
Nguyen, D., Dogruoz, A.S.: Word level language identification in online multilingual communication. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 857–862 (2013)
Google Scholar
Das, A., Gamback, B.: Identifying languages at the word level in code-mixed Indian social media text. In: Proceedings of the 11th International Conference on Natural Language Processing, India, pp. 378–387(2014)
Google Scholar
Sequiera, R., Choudhury, M., Gupta, P., Rosso, P., Kumar, S., Banerjee, S., Naskar, S., Bandyopadhyay, S., Chittaranjan, G., Das, A., Chakma, K.: Overview of FIRE-2015 shared task on mixed script information retrieval, vol. 1587, pp. 19–25 (2015)
Google Scholar
Jhamtani, H., Bhogi, S.K., Raychoudhury, V.: Word-level language identification in bi-lingual code-switched texts. In: Proceedings of 28th Pacific Asia Conference on Language, Information and Computation, pp. 348–357 (2014)
Google Scholar
Ethiraj, R., Shanmugam, S., Srinivasa, G., Sinha, N.: NELIS—named entity and language identification system: shared task system description. FIRE Workshop 1587, 43–46 (2015)
Google Scholar
Bhargava, R., Sharma, Y., Sharma, S.: Sentiment analysis for mixed script indic sentences. In: Proceedings of International Conference on Advances in Computing, Communications and Informatics, ICACCI, India, pp. 524–529 (2016)
Google Scholar
Sharma, S., Srinivas, P., Balabantaray, R.: Emotion detection using online machine learning method and TLBO on mixed script. In: Language Resources and Evaluation Conference, vol. 10, no. 5, pp. 47–51 (2016)
Google Scholar
Bali, K., Jatin, S., Choudhury, M., Vyas, Y.: I am borrowing ya mixing? An analysis of English-Hindi code mixing in Facebook. In: Proceedings of The First Workshop on Computational Approaches to Code Switching, EMNLP, pp. 116–126 (2014)
Google Scholar
Rao, P.R.K., Devi, S.L.: CMEE-IL: code mix entity extraction in indian languages from social media Text@FIRE 2016—an overview. FIRE Workshop 1737, 289–295 (2016)
Google Scholar
Sapkal, K., Shrawankar, U.: Transliteration of secured SMS to Indian regional language. Proc. Comput. Sci. 78, 748–755 (2016)
Article Google Scholar
Zubiaga, A., Vicente, I.S., Gamallo, P., Pichel, J.R., Alegria, I., Aranberri, N., Ezeiza, A., Fresno, V.: TweetLID: a benchmark for tweet language identification. Lang. Resour. Eval. 50(4), 729–766 (2015)
Article Google Scholar
Alekseev, A., Nikolenko, S.: Word embedding for user profiling in online social networks. Computacion y Sistemas 21(2), 203–226 (2017)
Google Scholar
Chaudhary, J., Patel, A.C.: Bilingual machine translation using RNN based deep learning. Int. J. Sci. Res. Sci. Eng. Technol. 4(4), 1480–1484 (2018)
Google Scholar
Samuel, K.C.: Exploring language; some specificities, complexities and limitations in human communication and social interaction in multi-cultural contexts. Adv. J. Soc. Sci. 5(1), 26–36 (2019)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing System, vol. 2, pp. 3111–3119 (2013)
Google Scholar
Jamatia, A., Das, A.: Task report: tool contest on POS tagging for code-mixed Indian social media (Facebook, Twitter, and WhatsApp) text. In: Proceedings of ICON (2016)
Google Scholar
Banerjee, S., Chakma, K., Naskar, S., Das, A., Rosso, P., Bandyopadhyay, S., Choudhury, M.: Overview of the mixed script information retrieval (MSIR). In: CEUR Workshop Proceedings, vol. 1737, pp. 94–99 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Manipal University Jaipur, Jaipur, India
Yogesh Gupta
Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, India
Ghanshyam Raghuwanshi
Department of Computer Engineering and Applications, GLA University, Mathura, India
Aprna Tripathi

Authors

Yogesh Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Ghanshyam Raghuwanshi
View author publications
You can also search for this author in PubMed Google Scholar
Aprna Tripathi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yogesh Gupta .

Editor information

Editors and Affiliations

Faculty of Computer and Information, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Department of Computer Science and Engineering, Manipal University Jaipur, Jaipur, Rajasthan, India
Roheet Bhatnagar
Faculty of Science, Helwan University, Helwan, Egypt
Ashraf Darwish

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, Y., Raghuwanshi, G., Tripathi, A. (2021). A New Methodology for Language Identification in Social Media Code-Mixed Text. In: Hassanien, A., Bhatnagar, R., Darwish, A. (eds) Advanced Machine Learning Technologies and Applications. AMLTA 2020. Advances in Intelligent Systems and Computing, vol 1141. Springer, Singapore. https://doi.org/10.1007/978-981-15-3383-9_22

Download citation

DOI: https://doi.org/10.1007/978-981-15-3383-9_22
Published: 26 May 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3382-2
Online ISBN: 978-981-15-3383-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

A New Methodology for Language Identification in Social Media Code-Mixed Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Meta Embeddings for LinCE Dataset

Deep Learning-Based Language Identification in Code-Mixed Text

Sentiment Analysis on Hindi–English Code-Mixed Social Media Text

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A New Methodology for Language Identification in Social Media Code-Mixed Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Meta Embeddings for LinCE Dataset

Deep Learning-Based Language Identification in Code-Mixed Text

Sentiment Analysis on Hindi–English Code-Mixed Social Media Text

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation