Abstract
In recent years, an unexpected amount of growth has been observed in the volume of text documents on the internet, intranet, digital libraries and news groups. It is an important issue to obtain useful information and meaningful patterns from these documents. Identification of Languages of these text documents is an important problem which is studied by many researchers. In these researches generally words (terms) have been used for language identification. Researchers have studied on different approaches like linguistic and statistical based. In this work, Letter Based Text Scoring Method has been proposed for language identification. This method is based on letter distributions of texts. Text scoring has been performed to identify the language of each text document. Text scores are calculated by using letter distributions of new text document. Besides its acceptable accuracy proposed method is easier and faster than short terms and n-gram methods.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Dumas, S., Plat, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representation for text categorization. In: Proceedings of CIKM-1998, 7th ACM International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (December 1995)
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of SDAIR-1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4) (2002)
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Han, E.-H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. Principles of Data Mining and Knowledge Discovery, 424–431 (2000)
Pawlowski, B.: Letter Frequency Statistics, http://www.ultrasw.com/pawlowski/brendan/Frequencies.html
Visa, A.: Technology of Text Mining. In: Perner, P. (ed.) MLDM 2001. LNCS (LNAI), vol. 2123, pp. 1–11. Springer, Heidelberg (2001)
Johnson, S.: Solving the problem of language recognition Technical report, School of Computer Studies, University of Leeds (1993)
Churcher, G.: Distinctive character sequences, Personal communication (1994)
Hayes, J.: Language Recognition using two and three letter clusters. Technical report, School of Computer Studies, University of Leeds (1993)
Takcı, H., Soğukpınar, İ.: Centroid-Based Language Identification Using Letter Feature Set. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 640–648. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Takcı, H., Soğukpınar, İ. (2004). Letter Based Text Scoring Method for Language Identification. In: Yakhno, T. (eds) Advances in Information Systems. ADVIS 2004. Lecture Notes in Computer Science, vol 3261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30198-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-540-30198-1_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23478-4
Online ISBN: 978-3-540-30198-1
eBook Packages: Computer ScienceComputer Science (R0)