Abstract
Language identification is used to categorize the language of a given document. Language identification categorizes the contents and can have a better search results for a multilingual document. In this work, we classify each line of text to a particular language and focused on short phrases of length 2–6 words for 15 Indian languages. It detects that a given document is in multilingual and identifies the appropriate Indian languages. The approach used is the combination of n-gram technique and a list of short distinctive words. The n-gram model applied is language independent whereas short word method uses less computation. The results show the effectiveness of our approach over the synthetic data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
M. Venugopalan, D. Gupta, Exploring sentiment analysis on twitter data, in 2015 Eighth International Conference on Contemporary Computing (IC3) (IEEE, 2015)
mhrd.gov.in/sites/upload_files/mhrd/files/upload_document/languagebr.pdf
P. Salunkhe, et al., Recognition of multilingual text from signage boards, in International Conference on Advances in Computing, Communications and Informatics (ICACCI) (IEEE, 2017)
J. Amudha, N. Kumar, Gradual transaction detection using visual attention system. Adv. Int. Inform. 111—122 (2014)
D. Gupta, M.L. Leema, Improving OCR by effective pre-processing and segmentation for devanagari script: a quantified study. J. Theor. Appl. Inf. Technol. (ARPN), 52(2), 142—153 (2013)
K. Jaya, D. Gupta, Exploration of corpus augmentation approach for English-Hindi bidirectional statistical machine translation system. Int. J. Electr. Comput. Eng. (IJECE), 6(3), 1059–1071 (2016)
D. Gupta, T. Aswathi, R.K. Yadav, Investigating bidirectional divergence in lexical-semantic class for English-Hindi-Dravidian translations. Int. J. Appl. Eng. Res. 10(24), 8851–8884 (2015)
W.B. Cavnar, J.M. Trenkle, N-gram–based text categorization, in Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, Nevada, USA, 1994), pp. 161—175
V. Keselj, F. Peng, N. Cercone, C. Thomas, N-gram based author profiles for authorship attribution, in Proceedings of the Pacific Association for Computational Linguistics (2003), pp. 255–264
P. Soucy, G.W. Mineau, A simple KNN algorithm for text categorization, in Proceedings 2001 IEEE International Conference on Data Mining (San Jose, CA, 2001), pp. 647—648
W. Zheng, Y. Qian, H. Lu, Text categorization based on regularization extreme learning machine. Neural Comput. Appl. 22(3–4), 447–456 (2013)
G. Grefenstette, Comparing two language identification schemes, in 3rd International Conference on Statistical Analysis of Textual Data (1995)
N. Hwong, A. Caswell, D.W. Johnson, H. Johnson, Effects of cooperative and individualistic learning on prospective elementary teachers’ music achievement and attitudes. J. Soc. Psychol. 133(1), 58–64 (1993)
R.D. Lins, P. Goncalves, Automatic language identification of written texts, in Proceedings of the 2004 ACM Symposium on Applied Computing, SAC ’04 (ACM, New York, NY, USA, 2004), pp. 1128–1133
J.M. Prager, Linguini, language identification for multilingual documents, in Proceedings of the 32nd Hawaii International Conference on System Sciences (1999)
P.M. Dias Cardoso, A. Roy, Language identification for social media: short messages and transliteration, in Proceedings of the 25th International Conference Companion on World Wide Web (International World Wide Web Conferences Steering Committee, 2016), April 11, pp. 611–614
S. Banerjee, A. Kuila, A. Roy, S.K. Naskar, P. Rosso, S. Bandyopadhyay, A hybrid approach for transliterated word-level language identification: CRF with post-processing heuristics, in Proceedings of the Forum for Information Retrieval Evaluation (ACM, 2014) Dec 5, pp. 54–59
D.K. Gupta, S. Kumar, A. Ekbal, Machine learning approach for language identification & transliteration, in Proceedings of the Forum for Information Retrieval Evaluation, 2014 Dec 5 (ACM), pp. 60–64
B. Sinha, M. Garg, S. Chandra, Identification and classification of relations for Indian languages using machine learning approaches for developing a domain specific ontology, in International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT), New Delhi, 2016, pp. 415–420
R. Bhargava, Y. Sharma, S. Sharma, Sentiment analysis for mixed script Indic sentences, in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, 2016, pp. 524–529
S.S. Prasad, J. Kumar, D.K. Prabhakar, S. Tripathi, Sentiment mining: an approach for Bengali and Tamil tweets, in 2016 Ninth International Conference on Contemporary Computing (IC3), Noida, 2016, pp. 1–4
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bhaskaran, S., Paul, G., Gupta, D., Amudha, J. (2021). Indian Language Identification for Short Text. In: Gao, XZ., Tiwari, S., Trivedi, M., Mishra, K. (eds) Advances in Computational Intelligence and Communication Technology. Advances in Intelligent Systems and Computing, vol 1086. Springer, Singapore. https://doi.org/10.1007/978-981-15-1275-9_5
Download citation
DOI: https://doi.org/10.1007/978-981-15-1275-9_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1274-2
Online ISBN: 978-981-15-1275-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)