Abstract
Identification of script in multi-lingual documents is essential for many language dependent applications suchas machine translation and optical character recognition. Techniques for script identification generally require large areas for operation so that sufficient information is available. Suchassumption is nullified in Indian context, as there is an interspersion of words of two different scripts in most documents. In this paper, techniques to identify the script of a word are discussed. Two different approaches have been proposed and tested. The first method structures words into 3 distinct spatial zones and utilizes the information on the spatial spread of a word in upper and lower zones, together with the character density, in order to identify the script. The second technique analyzes the directional energy distribution of a word using Gabor filters withsuitable frequencies and orientations. Words withv arious font styles and sizes have been used for the testing of the proposed algorithms and the results obtained are quite encouraging.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Spitz, A.L.: Determination of Script and Language Content of Document Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 235–245
Sibun, P., Spitz, A.L.: Natural Language Processing from Scanned Document Images. In: Proceedings of the Applied Natural Language Processing, Stuttgart (1994) 115–121
Nakayama, T., Spitz, A.L.: European Language Determination from Image. In: Proceedings of the International Conference on Document Analysis and Recognition, Japan (1993) 159–162
Hochberg, J., et al.: Automatic Script Identification from Images Using Cluster-Based Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 176–181
Dang, L., et al.: Language Identification for Printed Text Independent of Segmentation. In: Proceedings of the International Conference on Image Processing. (1995) 428–431
Tan, C.L., et al.: Language Identification in Multi-lingual Documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 751–756
Tan, T.N.: Rotation Invariant Texture Features and their Use in Automatic Script Identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 751–756
Chaudhuri, B.B., Pal, U.: A complete Printed bangla OCR System. Pattern Recognition 31 (1998) 531–549
Chaudhuri, B.B., Pal, U.: Automatic Separation of Words in Multi-lingual Multiscript Indian Documents. In: Proceedings of the International Conference on Document Analysis and Recognition, Germany (1997) 576–579
Chaudhury, S., Sheth, R.: Trainable Script Identification Strategies for Indian languages. In: Proceedings of the International Conference on Document Analysis and Recognition, India (1999) 657–660
Hubel, D.H., Wiesel, T.N.: Receptive Fields and Functional Architecture in Two Non-striate Visual Areas 18 and 19 of the Cat. Journal of Neurophysiology 28 (1965) 229–289
Campbell, F.W., Kulikowski, J.J.: Orientational Selectivity of Human Visual System. Journal of Physiology 187 (1966) 437–445
Chen, Y.K., et al.: Skew Detection and Reconstruction Based on Maximization of Variance of Transition-Counts. Pattern Recognition 33 (2000) 195–208
Dhanya, D.: Bilingual OCR for Tamil and Roman Scripts. Master’s thesis, Department of Electrical Engineering, Indian Institute of Science (2001)
Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (1998) 955–974
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dhanya, D., Ramakrishnan, A.G. (2002). Script Identification in Printed Bilingual Documents. In: Lopresti, D., Hu, J., Kashi, R. (eds) Document Analysis Systems V. DAS 2002. Lecture Notes in Computer Science, vol 2423. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45869-7_2
Download citation
DOI: https://doi.org/10.1007/3-540-45869-7_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44068-0
Online ISBN: 978-3-540-45869-2
eBook Packages: Springer Book Archive