Indian Language Identification for Short Text

Bhaskaran, Sreebha; Paul, Geetika; Gupta, Deepa; Amudha, J.

doi:10.1007/978-981-15-1275-9_5

Sreebha Bhaskaran¹⁸,
Geetika Paul¹⁸,
Deepa Gupta¹⁸ &
…
J. Amudha¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1086))

886 Accesses
11 Citations

Abstract

Language identification is used to categorize the language of a given document. Language identification categorizes the contents and can have a better search results for a multilingual document. In this work, we classify each line of text to a particular language and focused on short phrases of length 2–6 words for 15 Indian languages. It detects that a given document is in multilingual and identifies the appropriate Indian languages. The approach used is the combination of n-gram technique and a list of short distinctive words. The n-gram model applied is language independent whereas short word method uses less computation. The results show the effectiveness of our approach over the synthetic data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Building Indonesian Local Language Detection Tools Using Wikipedia Data

Automatic Language Identification for Celtic Texts

Automatic language identification: a case study of Pahari languages

Article 12 May 2023

References

M. Venugopalan, D. Gupta, Exploring sentiment analysis on twitter data, in 2015 Eighth International Conference on Contemporary Computing (IC3) (IEEE, 2015)
Google Scholar
mhrd.gov.in/sites/upload_files/mhrd/files/upload_document/languagebr.pdf
P. Salunkhe, et al., Recognition of multilingual text from signage boards, in International Conference on Advances in Computing, Communications and Informatics (ICACCI) (IEEE, 2017)
Google Scholar
J. Amudha, N. Kumar, Gradual transaction detection using visual attention system. Adv. Int. Inform. 111—122 (2014)
Google Scholar
D. Gupta, M.L. Leema, Improving OCR by effective pre-processing and segmentation for devanagari script: a quantified study. J. Theor. Appl. Inf. Technol. (ARPN), 52(2), 142—153 (2013)
Google Scholar
K. Jaya, D. Gupta, Exploration of corpus augmentation approach for English-Hindi bidirectional statistical machine translation system. Int. J. Electr. Comput. Eng. (IJECE), 6(3), 1059–1071 (2016)
Google Scholar
D. Gupta, T. Aswathi, R.K. Yadav, Investigating bidirectional divergence in lexical-semantic class for English-Hindi-Dravidian translations. Int. J. Appl. Eng. Res. 10(24), 8851–8884 (2015)
Google Scholar
W.B. Cavnar, J.M. Trenkle, N-gram–based text categorization, in Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, Nevada, USA, 1994), pp. 161—175
Google Scholar
V. Keselj, F. Peng, N. Cercone, C. Thomas, N-gram based author profiles for authorship attribution, in Proceedings of the Pacific Association for Computational Linguistics (2003), pp. 255–264
Google Scholar
P. Soucy, G.W. Mineau, A simple KNN algorithm for text categorization, in Proceedings 2001 IEEE International Conference on Data Mining (San Jose, CA, 2001), pp. 647—648
Google Scholar
W. Zheng, Y. Qian, H. Lu, Text categorization based on regularization extreme learning machine. Neural Comput. Appl. 22(3–4), 447–456 (2013)
Google Scholar
G. Grefenstette, Comparing two language identification schemes, in 3rd International Conference on Statistical Analysis of Textual Data (1995)
Google Scholar
N. Hwong, A. Caswell, D.W. Johnson, H. Johnson, Effects of cooperative and individualistic learning on prospective elementary teachers’ music achievement and attitudes. J. Soc. Psychol. 133(1), 58–64 (1993)
Article Google Scholar
R.D. Lins, P. Goncalves, Automatic language identification of written texts, in Proceedings of the 2004 ACM Symposium on Applied Computing, SAC ’04 (ACM, New York, NY, USA, 2004), pp. 1128–1133
Google Scholar
J.M. Prager, Linguini, language identification for multilingual documents, in Proceedings of the 32nd Hawaii International Conference on System Sciences (1999)
Google Scholar
P.M. Dias Cardoso, A. Roy, Language identification for social media: short messages and transliteration, in Proceedings of the 25th International Conference Companion on World Wide Web (International World Wide Web Conferences Steering Committee, 2016), April 11, pp. 611–614
Google Scholar
S. Banerjee, A. Kuila, A. Roy, S.K. Naskar, P. Rosso, S. Bandyopadhyay, A hybrid approach for transliterated word-level language identification: CRF with post-processing heuristics, in Proceedings of the Forum for Information Retrieval Evaluation (ACM, 2014) Dec 5, pp. 54–59
Google Scholar
D.K. Gupta, S. Kumar, A. Ekbal, Machine learning approach for language identification & transliteration, in Proceedings of the Forum for Information Retrieval Evaluation, 2014 Dec 5 (ACM), pp. 60–64
Google Scholar
B. Sinha, M. Garg, S. Chandra, Identification and classification of relations for Indian languages using machine learning approaches for developing a domain specific ontology, in International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT), New Delhi, 2016, pp. 415–420
Google Scholar
R. Bhargava, Y. Sharma, S. Sharma, Sentiment analysis for mixed script Indic sentences, in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, 2016, pp. 524–529
Google Scholar
S.S. Prasad, J. Kumar, D.K. Prabhakar, S. Tripathi, Sentiment mining: an approach for Bengali and Tamil tweets, in 2016 Ninth International Conference on Contemporary Computing (IC3), Noida, 2016, pp. 1–4
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bengaluru, India
Sreebha Bhaskaran, Geetika Paul, Deepa Gupta & J. Amudha

Authors

Sreebha Bhaskaran
View author publications
You can also search for this author in PubMed Google Scholar
Geetika Paul
View author publications
You can also search for this author in PubMed Google Scholar
Deepa Gupta
View author publications
You can also search for this author in PubMed Google Scholar
J. Amudha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sreebha Bhaskaran .

Editor information

Editors and Affiliations

School of Computing, University of Eastern Finland, Kuopio, Finland
Xiao-Zhi Gao
Computer Science Engineering Department, ABES Engineering College, Delhi, India
Shailesh Tiwari
Department of Computer Science and Engineering, National Institute of Technology Agartala, Agartala, Tripura, India
Munesh C. Trivedi
Motilal Nehru National Institute of Technology, Allahabad, Uttar Pradesh, India
Krishn K. Mishra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bhaskaran, S., Paul, G., Gupta, D., Amudha, J. (2021). Indian Language Identification for Short Text. In: Gao, XZ., Tiwari, S., Trivedi, M., Mishra, K. (eds) Advances in Computational Intelligence and Communication Technology. Advances in Intelligent Systems and Computing, vol 1086. Springer, Singapore. https://doi.org/10.1007/978-981-15-1275-9_5

Download citation

DOI: https://doi.org/10.1007/978-981-15-1275-9_5
Published: 19 June 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1274-2
Online ISBN: 978-981-15-1275-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Indian Language Identification for Short Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Building Indonesian Local Language Detection Tools Using Wikipedia Data

Automatic Language Identification for Celtic Texts

Automatic language identification: a case study of Pahari languages

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Indian Language Identification for Short Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Building Indonesian Local Language Detection Tools Using Wikipedia Data

Automatic Language Identification for Celtic Texts

Automatic language identification: a case study of Pahari languages

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation