Line-Level Script Identification for Six Handwritten Scripts Using Texture Based Features

Singh, Pawan Kumar; Sarkar, Ram; Nasipuri, Mita

doi:10.1007/978-81-322-2247-7_30

Pawan Kumar Singh⁷,
Ram Sarkar⁷ &
Mita Nasipuri⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 340))

2189 Accesses
6 Citations

Abstract

Script identification from a given document image has some important applicability in many computer applications such as automatic archiving of multilingual documents, searching online archives of document images and for the selection of script specific Optical Character Recognition (OCR) engine in any multilingual environment. In this paper, we propose a texture based approach for text line-level script identification of six handwritten scripts namely, Bangla, Devnagari, Malayalam, Tamil, Telugu and Roman. A set of 80 features based on Gray Level Co-occurrence Matrix (GLCM) has been designed for the present work. Multi Layer Perceptron (MLP) is found to be the best classifier among a set of popular multiple classifiers which is then extensively tested by tuning different parameters. Finally, an accuracy of 95.67 % has been achieved on a dataset of 600 text lines using 3-fold cross validation with epoch size 1,500 of MLP classifier.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Word-Level Script Identification from Handwritten Multi-script Documents

Statistical Textural Features for Text-Line Level Handwritten Indic Script Identification

Identification of Devnagari and Roman Scripts from Multi-script Handwritten Documents

Keywords

1 Introduction

A very important task in the field of automated document analysis system is OCR, which is broadly defined as the process of recognizing either printed or handwritten text from document images and converting it into electronic form. Till date, many algorithms have been presented in the literature to perform this task for a specific language/script. Almost all existing works on OCR make an implicit assumption that the script type of the document to be processed is known beforehand. In a multilingual environment, such document processing systems relying on OCR would clearly need human intervention to select the appropriate OCR package, which is certainly inefficient and undesirable. On the contrary, it is quite impossible to design a single OCR system which can recognize a reasonable number of scripts. Therefore, it is a general requirement to identify the script type before feeding the document image to their corresponding OCR engine.

Script identification from handwritten documents is a challenging process due to the following reasons: (1) difficulties addressed in pre-processing such as ruling lines, skewness, noise, etc., (2) complexity in feature selection due to large set of symbols, and (3) sensitivity of the scheme due to huge variations in handwriting styles. In general, script identification can be achieved at three levels namely, Page-level, Text line-level and Word-level. The ability to reliably identify the script type using least amount of textual data is essential when dealing with document pages that contain text words of different scripts. But identifying text words of different scripts with only a few numbers of characters may not always be feasible because at word-level, the number of characters present in a single word may not be always informative. On the contrary, identifying scripts at page-level can be sometimes too complicated and laborious. So, it would be better to perform the script identification at text line-level.

In the context of Indic scripts, most of the published methodologies [1–6] on text line-level script identification, have considered the printed text documents. A few number of research works [7, 8] are applied on handwritten text lines. Despite these research contributions, it can be noticed that most of works have been done for bilingual or trilingual scripts. But, this is a pure limitation in a multilingual country like India, where people residing at different locations use different script. So, in Indian context, a script recognition system should have the ability to identify more number of Indic scripts. This has motivated us to take the challenge of identifying script type of the handwritten text lines written in five Indic scripts namely, Bangla, Devnagari, Malayalam, Tamil, and Telugu along with Roman script. We have included Roman script in our work as this script is used in all official works in almost every state of India.

2 Proposed Work

The proposed work is based on a simple observation that every script/language consists of a finite set of characters, each having a distinct visual appearance, which serves as useful visual clues to recognize the script. That is why, we have been inspired to use texture based GLCM feature for identifying handwritten scripts.

2.1 Gray Level Co-occurrence Matrix (GLCM)

GLCM estimates the properties of an image related to second order statistics which considers the relationship among pixels or groups of pixels. Haralick [9] suggested the use of GLCM which has become one of the most well known and widely used texture features. This method is based on the joint probability distributions of pairs of gray levels. GLCM shows how often each gray level occurs at a pixel located at a fixed geometric position relative to other pixel, as a function of the gray level [10]. Mathematically, for a given image I of size M × N and for a displacement vector $ d\left( {d_{x} ,d_{y} } \right) $, the GLCM is defined to be a square matrix P of size L × L where, L is the number of gray level range (0, 1, …, L − 1) in the image.

$$ P\left( {i,j} \right) = \mathop \sum \limits_{x = 1}^{M} \mathop \sum \limits_{y = 1}^{N} \left\{ {\begin{array}{*{20}l} 1 \hfill & { if\;I\left( {x,y} \right) = i \;{\text{and}}\; I\left( {x + d_{x} ,y + d_{y} } \right) = j } \hfill \\ 0 \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$

(1)

Here, i and j are the intensity values of the image I, x and y are the spatial pixel positions in the image I and the offset $ \left( {d_{x} ,d_{y} } \right) $ depends on the direction θ and the distance d for which the matrix is computed. Here, $ p\left( {i,j} \right) $ is a count of the number of times $ I\left( {x,y} \right) = i\,{\text{and}}\,I\left( {x + d_{x} ,y + d_{y} } \right) = 1 $ occur in image I. Figure 1 illustrates the co-occurrence matrices along four directions $ \left( {0^\circ ,\,45^\circ ,\,90^\circ \,{\text{and}}\, 1 3 5^\circ } \right) $ considering a 2 × 2 image represented with two gray-tone values 0 and 1. For this purpose, we have considered two neighboring pixels ($ d = 1 $ and $ d = 2 $) along four possible directions.

A set of 10 features based on GLCM have been extracted which are described below in detail.

Energy

Energy, also known as uniformity, measures the image homogeneity. It is the sum of squares of entries in the GLCM. Energy is high when image has very good homogeneity or when pixels are very similar. Energy [11] is calculated as

$$ Energy = \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} P^{2} \left( {i,j} \right) $$

(2)

Entropy

Entropy shows the amount of information of the image that is needed for the image compression. Entropy measures the loss of information or message in a transmitted signal [9]. Entropy is calculated as

$$ Entropy = - \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} P\left( {i,j} \right) \cdot \;\log P\left( {i,j} \right) $$

(3)

Inertia

Inertia [12] is the measure of the amount of the local variations. A large amount of variation gives a large inertia. It is defined as

$$ Inertia = \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} \left( {i - j} \right)^{2} \cdot P\left( {i,j} \right) $$

(4)

Autocorrelation

Autocorrelation [9] measures the linear spatial relationship between spatial sizes of texture primitives. Autocorrelation-based approach to texture analysis estimates the intensity value concentration on all or part of an image represented as a feature vector. It is defined as

$$ Autocorrelation = \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} i \cdot j \cdot P\left( {i,j} \right) $$

(5)

Covariance

Covariance [12] provides a measure of the strength of the correlation between two or more distinct pairs of adjacent pixel distribution. If the change in one pair of pixel distribution corresponds with the change in the other one, i.e., the pixel distribution tends to show similar behavior, the covariance is positive otherwise it is negative. It can be defined as

$$ Covariance = \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} \left( {i - M_{x} } \right)\left( {j - M_{y} } \right) \cdot P\left( {i,j} \right) $$

(6)

$$ M_{x} = \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} i \cdot P\left( {i,j} \right) $$

(7)

$$ M_{y} = \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} j \cdot P\left( {i,j} \right) $$

(8)

Contrast

The contrast [11] is a difference of moment of the GLCM. It is the measure of the amount of local variation present in the image. It can be thought of as a linear dependency of gray levels of neighboring pixels [9]. It is defined as

$$ Contrast = \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} P\left( {i,j} \right) \cdot |i - j|^{k} , \quad k \in {\mathbb{Z}} $$

(9)

In the present work, the value of $ k $, which is chosen to be 4 proves to be optimal.

Local Homogeneity

Local homogeneity [11] is the closeness of gray levels in the spatial distribution over the image. Homogeneous textured image comprises of a limited range of gray levels and hence, the GLCM image exhibits a few values with relatively high probability. It is defined as

$$ Local\,Homogeneity = \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} \frac{1}{{1 + \left( {i - j} \right)^{2} }}P\left( {i,j} \right) $$

(10)

Cluster Shade

Cluster shade [11] is the measure of skewness of the co-occurrence matrix, in other words, the lack of symmetry. It is defined as

$$ Cluster\,Shade = \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} \left( {i - M_{x} + j - M_{y} } \right)^{3} \cdot\,P\left( {i,j} \right) $$

(11)

Cluster Prominence

Cluster prominence [11] is also the measure of skewness of the co-occurrence matrix. When cluster prominence is high, the image is not symmetric. In addition, when cluster prominence is low, there is a peak in the co-occurrence matrix around the mean values. Low cluster prominence implies little variation in gray-scales and can be defined as

$$ Cluster\,Prominence = \mathop \sum \limits_{i = 0}^{L - 1} \mathop \sum \limits_{j = 0}^{L - 1} \left( {i - M_{x} + j - M_{y} } \right)^{4} \cdot P\left( {i,j} \right) $$

(12)

Information Measure of Correlation

Correlation measures the linear dependency of gray levels of neighboring pixels. Digital Image Correlation is an optical method that employs tracking and image registration techniques for accurate 2-D and 3-D measurements of changes in images. This is often used to measure deformation, displacement, strain and optical flow [10] and can be defined as

$$ Information\,Measure\,of\,Correlation = \frac{{ - \mathop \sum \nolimits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \nolimits_{j = 0}^{{{\text{L}} - 1}} P\left( {i,j} \right)\log P\left( {i,j} \right) - H_{xy} }}{{\hbox{max} \left( {H_{x} ,H_{y} } \right)}} $$

(13)

where,

$$ H_{xy} = - \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} P\left( {i,j} \right) \cdot \log \left( {\mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} P\left( {i,j} \right) \cdot \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} P\left( {i,j} \right)} \right) $$

(14)

$$ H_{x} = - \mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} \left\{ {\mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} P\left( {i,j} \right) \cdot \log \mathop \sum \limits_{j = 0}^{L - 1} P\left( {i,j} \right)} \right\} $$

(15)

$$ H_{y} = - \mathop \sum \limits_{j = 0}^{{{\text{L}} - 1}} \left\{ {\mathop \sum \limits_{i = 0}^{{{\text{L}} - 1}} P\left( {i,j} \right) \cdot \log \mathop \sum \limits_{i = 0}^{L - 1} P(i,j)} \right\} $$

(16)

Due to the binary nature of the document images from which the features are estimated, the extraction of such features is unnecessary and indeed counterproductive. Since, there are only two gray levels, the matrices will be of size 2 × 2, i.e., it is possible to fully describe each matrix with only three unique parameters due to the diagonal symmetry property [11]. For each of the 10 measurements defined above, the values of $ d = \left\{ {1,2} \right\} $ and $ \theta \in 0^\circ ,45^\circ ,90^\circ \,{\text{and}}\,135^\circ $ lead to a total of 80 (10 × 8) features using GLCM.

3 Experimental Results and Discussion

The performance of the proposed approach has been tested using a dataset of 600 text lines written in six handwritten scripts namely, Bangla, Devnagari, Malayalam, Tamil, Telugu and Roman. Here, each script consists of an equal number of text lines which are extracted from the document images by using piecewise water flow technique as described in [13]. The extracted text lines are stored as gray scale images. Noise is removed by using Gaussian filter [12]. Text lines are then binarized using well known Otsu’s global thresholding approach [14]. The proposed approach is then applied on the preprocessed text lines and evaluated using seven well-known classifiers (with the help of a software tool known Weka [15] ) namely, Naïve Bayes, Bayes Net, MLP, Support Vector Machine (SVM), Random Forest, Bagging and MultiClass Classifier. The recognition performances and their corresponding scores achieved at 95 % confidence level are shown in Table 1.

Table 1 Recognition performances of the proposed script identification technique using seven well-known classifiers

Full size table

The accuracy achieved by MLP classifier (as evident form Table 1) shows that it has the ability to perform better if executed comprehensively using different tuning parameters. For this purpose, we have used 3-fold, 5-fold and 7-fold cross validation schemes with varied epoch sizes of MLP classifiers (see Table 2). From the table, it is observed that for 3-fold cross validation with epoch size 1,500 of MLP, the best identification accuracy achieved is 95.67 %.

Table 2 Recognition accuracies of script identification technique for different folds and different number of epochs of MLP classifier

Full size table

Though Table 2 shows encouraging results but some misclassifications occur during the experimentation. The main reasons are presence of noise and poor segmentation of text lines due to skewness, punctuation symbols, etc. in the document images. The structural resemblance in the character set of the scripts like Bangla and Devnagari causes similarity in the adjacent pixel distribution which in turn misclassifies them among each other. Due to similar reason, Telugu and Malayalam scripts have been misclassified among each other. Figure 2 shows some instances of misclassification of the present technique.

4 Conclusion

Based on the observation of human ability to classify dissimilar scripts, we have inspected the possibility of using only global analysis of scripts for identifying them. In this paper, a texture based approach for text line-level script identification from handwritten documents is presented. An 80-element feature set based on GLCM is applied in the present work to distinguish six popular scripts used in India. Experiments are performed by considering only the text lines from the document images and an overall classification rate of 95.67 % is achieved. Although the present technique is evaluated on a limited dataset, we have achieved encouraging results. The work, presented here, is a footstep towards building a general multi-script OCR system pertinent to Indian subcontinent that can work for all official Indic scripts.

References

Pal, U., Chaudhuri, B.B.: Script line separation from indian multi-script documents. In: Proceedings of 5th International Conference on Document Analysis and Recognition (ICDAR), pp. 406–409. (1999)
Google Scholar
Pal, U., Chaudhuri, B.B.: Identification of different script lines from multi-script documents. Image Vis. Comput. 20(13–14), 945–954 (2002)
Article Google Scholar
Pal, U., Sinha, S., Chaudhuri, B.B.: Multi-script line identification from indian documents. In: Proceedings of 7th International Conference on Document Analysis and Recognition (ICDAR), pp. 880–884. (2003)
Google Scholar
Joshi, G.D., Garg, S., Sivaswamy, J.: Script identification from Indian documents. In: International Workshop Document Analysis Systems, Nelson. Lecture Notes in Computer Science, vol. 3872, pp. 255–267. (2006)
Google Scholar
Padma, M.C., Vijaya, P.A.: Identification of Telugu, Devnagari and English scripts using discriminating features. Int. J. Comput. Sci. Inf. Technol. (IJCSIT) 1(2), 64–78 (2009)
Google Scholar
Gopakumar, R., SubbaReddy, N.V., Makkithaya, K., Dinesh Acharya, U.: Script identification from multilingual indian documents using structural features. J. Comput. 2(7), 106–111 (2010)
Google Scholar
Chaudhuri, B.B., Bera, S.: Handwritten text line identification in Indian scripts. In: Proceedings of 10th International Conference on Document Analysis and Recognition, pp. 636–640. (2009)
Google Scholar
Hangarge, M., Dhandra, B.V.: Offline handwritten script identification in document images. Int. J. Comput. Appl. (IJCA) 4(6), 6–10 (2010)
Google Scholar
Haralick, R.M., Shanmungam, K., Dinstein, I.: Textural features of image classification. IEEE Trans. Syst. Man, Cybern. 3, 610–621 (1973)
Article Google Scholar
Haralick, R.M., Watson, L.: A facet model for image data. Comput. Vision Graph. Image Process. 15, 113–129 (1981)
Article Google Scholar
Busch, A., Boles, W.W., Sridharan, S.: Texture for script identification. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1720–1732 (2005)
Article Google Scholar
Gonzalez, R.C., Woods, R.E.: Digital Image Processing, vol. I. PHI, New Delhi (1992)
Google Scholar
Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M.: Extraction of text lines from handwritten documents using piecewise water flow technique. J. Intell. Syst. 23(3), 245–260 (2014)
Google Scholar
Ostu, N.: A thresholding selection method from gray-level histogram. IEEE Trans. Syst. Man Cybern. SMC-8, 62–66 (1978)
Google Scholar
www.cs.waikato.ac.nz/ml/weka/documentation.html

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
Pawan Kumar Singh, Ram Sarkar & Mita Nasipuri

Authors

Pawan Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
Ram Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Mita Nasipuri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pawan Kumar Singh .

Editor information

Editors and Affiliations

University of Kalyanai, Kalyanai, West Bengal, India
J. K. Mandal
Department of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Vishakapatnam, India
Suresh Chandra Satapathy
Dean, Faculty of Engineering, Technology, University of Kalyani, Kalyani, West Bengal, India
Manas Kumar Sanyal
Engineering and Technological Studies, University of Kalyani, Kalyani, West Bengal, India
Partha Pratim Sarkar
Department Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India
Anirban Mukhopadhyay

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, P.K., Sarkar, R., Nasipuri, M. (2015). Line-Level Script Identification for Six Handwritten Scripts Using Texture Based Features. In: Mandal, J., Satapathy, S., Kumar Sanyal, M., Sarkar, P., Mukhopadhyay, A. (eds) Information Systems Design and Intelligent Applications. Advances in Intelligent Systems and Computing, vol 340. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2247-7_30

Download citation

DOI: https://doi.org/10.1007/978-81-322-2247-7_30
Published: 21 January 2015
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2246-0
Online ISBN: 978-81-322-2247-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Line-Level Script Identification for Six Handwritten Scripts Using Texture Based Features

Abstract

Similar content being viewed by others

Word-Level Script Identification from Handwritten Multi-script Documents

Statistical Textural Features for Text-Line Level Handwritten Indic Script Identification

Identification of Devnagari and Roman Scripts from Multi-script Handwritten Documents

Keywords

1 Introduction

2 Proposed Work

2.1 Gray Level Co-occurrence Matrix (GLCM)

Energy

Entropy

Inertia

Autocorrelation

Covariance

Contrast

Local Homogeneity

Cluster Shade

Cluster Prominence

Information Measure of Correlation

3 Experimental Results and Discussion

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Line-Level Script Identification for Six Handwritten Scripts Using Texture Based Features

Abstract

Similar content being viewed by others

Word-Level Script Identification from Handwritten Multi-script Documents

Statistical Textural Features for Text-Line Level Handwritten Indic Script Identification

Identification of Devnagari and Roman Scripts from Multi-script Handwritten Documents

Keywords

1 Introduction

2 Proposed Work

2.1 Gray Level Co-occurrence Matrix (GLCM)

Energy

Entropy

Inertia

Autocorrelation

Covariance

Contrast

Local Homogeneity

Cluster Shade

Cluster Prominence

Information Measure of Correlation

3 Experimental Results and Discussion

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation