Letter Based Text Scoring Method for Language Identification

Takcı, Hidayet; Soğukpınar, İbrahim

doi:10.1007/978-3-540-30198-1_29

Hidayet Takcı¹⁷ &
İbrahim Soğukpınar¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3261))

Included in the following conference series:

International Conference on Advances in Information Systems

1482 Accesses
5 Citations

Abstract

In recent years, an unexpected amount of growth has been observed in the volume of text documents on the internet, intranet, digital libraries and news groups. It is an important issue to obtain useful information and meaningful patterns from these documents. Identification of Languages of these text documents is an important problem which is studied by many researchers. In these researches generally words (terms) have been used for language identification. Researchers have studied on different approaches like linguistic and statistical based. In this work, Letter Based Text Scoring Method has been proposed for language identification. This method is based on letter distributions of texts. Text scoring has been performed to identify the language of each text document. Text scores are calculated by using letter distributions of new text document. Besides its acceptable accuracy proposed method is easier and faster than short terms and n-gram methods.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Indian Language Identification for Short Text

Automatic language identification: a case study of Pahari languages

Article 12 May 2023

Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format

References

Dumas, S., Plat, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representation for text categorization. In: Proceedings of CIKM-1998, 7th ACM International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Google Scholar
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (December 1995)
Google Scholar
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of SDAIR-1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4) (2002)
Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Han, E.-H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. Principles of Data Mining and Knowledge Discovery, 424–431 (2000)
Google Scholar
Pawlowski, B.: Letter Frequency Statistics, http://www.ultrasw.com/pawlowski/brendan/Frequencies.html
Visa, A.: Technology of Text Mining. In: Perner, P. (ed.) MLDM 2001. LNCS (LNAI), vol. 2123, pp. 1–11. Springer, Heidelberg (2001)
Chapter Google Scholar
Johnson, S.: Solving the problem of language recognition Technical report, School of Computer Studies, University of Leeds (1993)
Google Scholar
Churcher, G.: Distinctive character sequences, Personal communication (1994)
Google Scholar
Hayes, J.: Language Recognition using two and three letter clusters. Technical report, School of Computer Studies, University of Leeds (1993)
Google Scholar
Takcı, H., Soğukpınar, İ.: Centroid-Based Language Identification Using Letter Feature Set. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 640–648. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Gebze Institute of Technology, 41400, Gebze /Kocaeli
Hidayet Takcı & İbrahim Soğukpınar

Authors

Hidayet Takcı
View author publications
You can also search for this author in PubMed Google Scholar
İbrahim Soğukpınar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dokuz Eylül University, lzmir, Turkey
Tatyana Yakhno

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takcı, H., Soğukpınar, İ. (2004). Letter Based Text Scoring Method for Language Identification. In: Yakhno, T. (eds) Advances in Information Systems. ADVIS 2004. Lecture Notes in Computer Science, vol 3261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30198-1_29

Download citation

DOI: https://doi.org/10.1007/978-3-540-30198-1_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23478-4
Online ISBN: 978-3-540-30198-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Letter Based Text Scoring Method for Language Identification

Abstract

Chapter PDF

Similar content being viewed by others

Indian Language Identification for Short Text

Automatic language identification: a case study of Pahari languages

Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Letter Based Text Scoring Method for Language Identification

Abstract

Chapter PDF

Similar content being viewed by others

Indian Language Identification for Short Text

Automatic language identification: a case study of Pahari languages

Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation