Keywords

1 Introduction

Script is considered as a graphic form which is used in any writing system. The languages used in the human society are typeset with the different scripts. A script can be used by only one language or it can be shared by several languages, with or without any variations [1]. India has 23 languages [2], recognized by constitution, viz., Assamese, Bengali, Bodo, Dogari, Kannada, Hindi, Sindhi, Nepali, Urdu, Punjabi, Marathi, Gujarati, Oriya, Sanskrit, Tamil, Telugu, Malayalam, Kashmiri, Manipuri, Konkani, Maithali, Santhali and English. The 12 major scripts used to write these languages are: Bangla, Devanagari, Gujarati, Gurumukhi, Manipuri, Malayalam, Oriya, Tamil, Telugu, Kannada, Roman and Urdu. Among these, only Urdu is written from right to left whereas the rest of the scripts are written from left to right. The first 10 scripts, originated from the ancient Brahmi script, are also known as Indic scripts.

In general, any OCR system is used to recognize only a script of particular type, and for the same reason, it is not viable to model a single OCR system for recognizing variety of scripts/languages. Hence, one can think of making a pool of OCR engines which correspond to different scripts in a multi-lingual environment. However, for this, it is necessary to have the knowledge about the script used to write the document. Hence, this necessity could be fulfilled if the researchers can be able to design an automatic script recognition module for the multi-script scenario.

An automatic script identification module would be used to sort or search the relevant information when the domain is multilingual/multi-script or even it helps to index/categorize the documents images on the basis of its script type. When a script is used to write only one language, then the script recognition technique can also be considered as language recognition technique. Otherwise, script recognition is the first step of classification followed by language identification among the languages which share a common script.

Different methodologies have been reported in the literature for accomplishing this task, sometimes with high degree of accuracy. A comprehensive survey based on script recognition techniques had been prepared by Singh et al. [1], with emphasis on script identification in both printed and handwritten Indic scripts scenario. Script identification for printed documents at page level [35], text-line level [610], and word level [1116] have been found in the literature. On the contrary, only few works have been done for handwritten script identification at page level [17], text-line level [18, 19], and word level [2022]. Singh et al. [17] developed a page-level script identification technique for handwritten document pages using Gray Level Co-occurrence Matrix (GLCM). The proposed technique had been experimented on four scripts namely, Bangla, Devanagari, Telugu, and Roman and the system was found to identify 91.48% scripts successfully using MLP classifier. Hangarge et al. [18] described a set of 13 spatial spread features of the three scripts namely, English, Devanagari and Urdu which were extracted using morphological filters. Experiments were carried out with k-NN classifier by varying the number of neighbors (k = 3, 5, 7, 9, 11, 13, 15) and the performance of the technique was optimal when the value of k is set to 3. The overall recognition accuracies of the proposed system were found to be 88.67% and 99.2% for tri-script and bi-script cases respectively. Singh et al. [19] proposed a texture based concept for script identification at text-line level for six handwritten scripts namely, Bangla, Devanagari, Malayalam, Tamil, Telugu and Roman. An accuracy of 95.67% had been achieved using 3-fold cross validation of MLP classifier. Roy et al. [20] described a scheme for word-wise identification of handwritten Roman and Oriya scripts for Indian postal automation using water-reservoir and topological features. The overall accuracy rate achieved on the test dataset was found to be 99.6% and 97.69% respectively. Sarkar et al. [21] presented a system, which identified the scripts of the handwritten words from the document images, written in Bangla or Devanagari mixed with Roman scripts with the help of eight holistic features. The recognition performances of 99.29% and 98.43% had been achieved on the test sets of Bangla-English words and Devanagari-English words respectively. P.K. Singh et al. reported a technique [22] which recognized the scripts of handwritten words from a document page, written in Devanagari script mixed with Roman script. A set of 39 distinctive features using topological along with convex hull based features were designed for the recognition purpose and the overall script identification accuracy of 99.54% was achieved. However, a major limitation of the above works is that researchers have considered only a few Indic scripts. This has been a major point of motivation behind developing a robust handwritten script identification technique including all the official Indic scripts along with Roman script.

2 Challenges Related to Handwritten Script Identification

There are some unique challenges that must be addressed in the domain of handwritten script recognition system. Among many, two basic problems are: inter-writer variability and intra-writer variability. Inter-writer variability encompasses the variations seen among different writers i.e., different writers will invariably have different writing styles. In contrast, intra-writer variability takes into consideration that the same writer tends to write the same textual content in a different manner depending upon his/her frame of mind. The challenge in this regard is to create a writer-independent script identification system that has the ability to adapt these variations like humans. Another major challenge that the system has to address is the problem of constrained versus unconstrained handwriting. Constrained handwriting refers to handwritten text that conforms to a pre-defined writing constriction, e.g. all the text words in a document image will be discrete and non-touching. Whereas unconstrained handwriting refers to the fact that the document image may contain discrete and cursive handwriting or a mixture of both, with no restriction on the writers while they write. Apart from this, difficulties inherent in recognizing handwritten scripts pose huge challenges than its printed form. Similarity among different scripts is quite common when the documents are handwritten. The styles of writing for handwritten scripts are more diverse than printed fonts. Also, problems such as existence of ruling lines, noise, skew, quality of ink, age of the document, etc. are commonly seen in handwritten documents. As mentioned earlier, script identification can be achieved at either page level, text-line level or word level. Sometimes, identifying scripts at page-level can be sometimes too convoluted and protracted due to large computational complexity. On the other hand, identifying words written in different scripts using a few characters is definitely a challenging task because the number of characters presents in a single word may not produce significant amount of discriminative information required for identification. Therefore, considering the complexities of the scripts, it would be better to identify the scripts at text-line level compared to page or word-level.

3 Data Collection and Pre-processing

There are no standard databases, considering either handwritten or printed Indic script documents, available in public domain which can be successfully used for this experimentation. Hence, we have prepared in-house database of handwritten documents. Different educated people were requested to write few text-lines of his/her choice inside A-4 size pages. Handwritten text-lines were written in 12 official scripts of India as mentioned earlier. It is to be noted that writers involved in the data collection drive belong to different professions. These pages are then scanned at 300 dpi resolution and saved as gray tone images. The noisy pixels therein, if any, are removed by Gaussian filter [23]. It is worth mentioning that the inter-word and intra-word spaces are very non-uniform in the handwritten text-lines. Numerals of any script which may be present in the text document are not considered for the present work. A text–line whose width is at least 50% of the page width is considered here. A sample snapshot of text-line images written in 12 different scripts is shown in Fig. 1. Otsu’s global thresholding approach [24] is used to convert them into two-tone images (0 and 1) where the label ‘1’ represents the object and ‘0’ represents the background. However, the dots and punctuation marks appearing in the text- lines have not been eliminated, since these may also contribute to understand the text in a meaningful way. Finally, 2400 handwritten text-line images are prepared with precisely 200 text-lines per script.

Fig. 1
figure 1

Sample text-line images written in: a Bangla, b Devanagari, c Gujarati, d Gurumukhi, e Kannada, f Malayalam, g Manipuri, h Oriya, i Tamil, j Telugu, k Urdu, and l Roman scripts respectively

4 Proposed Work

Every script/language, consisting of a finite set of characters, has a distinct visual appearance, which serves as useful visual clues to recognize the script. The current research is inspired by this simple observation of the human beings which also motivates the researchers to design different texture based features. Usually, texture features are designed to capture the granularity and repetitive patterns of local regions seen within an image. Some well-known texture features relying on GLCM and Gabor filter bank consider multiple scales and orientations for feature extraction which in turns involves a high computation cost. The conventional statistical textural features utilized in this paper, are Neighborhood Gray-Tone Difference Matrix (NGTDM) and Gray-level Run Length Matrix (GLRLM). These features are illustrated below in the following subsections.

4.1 Neighborhood Gray-Tone Difference Matrix (NGTDM)

A NGTDM [25] defines the texture measures which are very much correlated with human perception of textures. It calculates the texture using neighborhood intensity differences which will be helpful to describe the local features. The NGTDM are based on the differences between each pixel and the neighboring pixels in the adjacent regions. A NGTDM is basically a column vector of G elements. This vector is populated by computing the difference between the intensity values of a pixel and the mean intensity calculated over a square shaped window centered at that pixel. Suppose the image intensity level \(f\left( {x,y} \right)\) at location \(\left( {x,y} \right)\) is i, \(i = 0,1,2, \ldots , L - 1\). The mean intensity value of the window centered at \(\left( {x,y} \right)\) can be written as:

$$f_{i} = f\left( {x,y} \right) = \frac{1}{W - 1}\mathop \sum \limits_{m = - K}^{K} \mathop \sum \limits_{n = - K}^{K} f\left( {x + m,y + n} \right)$$
(1)

where, K denotes the window size and \(W = \left( {2K + 1} \right)^{2}\). The i-th entry of the gray-tone difference matrix is given by:

$$g\left( i \right) = \mathop \sum \limits_{x = 0}^{M - 1} \mathop \sum \limits_{y = 0}^{N - 1} \left| {i - f_{i} } \right|$$
(2)

for the pixels whose intensity value is i. Otherwise, \(g\left( i \right) = 0\).

Five different features are derived from the NGTDM, described below, to quantitatively estimate the following perceptual texture properties:

  • Coarseness. It finds out the presence of any texture in an image and is measured by the size of the primitives which form the texture. Generally, a coarse texture comprises large sized primitives which are typified by the degree of neighboring uniformity of gray-levels. On the other hand, fine texture can be defined by small primitives and these are described by the degree of neighboring variations of gray-levels.

    $$F_{cos} = \left( {\epsilon + \mathop \sum \limits_{i = 0}^{L - 1} p_{i} g\left( i \right)} \right)^{ - 1}$$
    (3)

    where, \(\varepsilon\) is a small number which avoids the coarseness coefficient to become infinite and \(p_{i}\) is the estimated probability of the occurrence of the intensity values i such that

    $$p_{i} = N_{i} /n$$
    (4)

    with \(N_{i}\) denoting the number of pixels having level i, and \(n = \left( {N - K} \right)\left( {M - K} \right)\).

  • Contrast. It quantifies the amount of clarity with which the different primitives in a texture can be differentiated. A well contrasted image is defined by the primitives which are visible as well as distinguishable. Among the factors that influence contrast, the gray-levels, the ratio of white and black pixels and the frequency of intensity changes of gray-levels are important.

    $$F_{con} = \left[ {\frac{1}{{N_{t} \left( {N_{t} - 1} \right)}}\mathop \sum \limits_{i = 0}^{L - 1} \mathop \sum \limits_{j = 0}^{L - 1} p_{i} p_{j} \left( {i - j} \right)^{2} } \right]\left[ {\frac{1}{n}\mathop \sum \limits_{i = 0}^{L - 1} g(i)} \right]$$
    (5)
  • Busyness. It measures the change of intensity from any pixel to its locality. If the intensity changes are quick and rush then it is called busy texture, whereas if the same are slow and gradual then it is called a non-busy texture. There is a relationship of busyness with the spatial frequency of the intensity changes in an image. Along with that, busyness is also affected by the amplitude of the intensity changes.

    $$F_{bsuy} = \frac{{\mathop \sum \nolimits_{i = 0}^{L - 1} p_{i} g\left( i \right)}}{{\mathop \sum \nolimits_{i = 0}^{L - 1} \mathop \sum \nolimits_{j = 0}^{L - 1} \left| {ip_{i} - jp_{j} } \right|}}\forall p_{i} \ne 0,\,\,p_{j} \ne 0$$
    (6)
  • Complexity. This is the visual information of texture. A texture is said to be complex when its information content is very high. This depends on the number of diverse primitives and average intensity values. Complexity is the sum of normalized differences between intensity values measured in pairs. These are weighted by the sum of the elements in the NGTDM corresponding to any two intensity values. Mathematically, it can be written as:

    $$F_{com} = \mathop \sum \limits_{i = 0}^{L - 1} \mathop \sum \limits_{j = 0}^{L - 1} \frac{{\left| {i - j} \right|}}{{n\left( {p_{i} + p_{j} } \right)}}\left[ {p_{i} g\left( i \right) + p_{j} g\left( j \right)} \right]\,\forall p_{i} \ne 0,\,\,p_{j} \ne 0$$
    (7)
  • Texture Strength. Strength integrates and summarizes the concepts of busyness and coarseness. An image with a strong texture is composed by easily definable and clearly visible elements. It can be expressed as:

    $$F_{str} = \frac{{\mathop \sum \nolimits_{i = 0}^{L - 1} \mathop \sum \nolimits_{j = 0}^{L - 1} \left( {p_{i} + p_{j} } \right)\left( {i - j} \right)^{2} }}{{\epsilon + \mathop \sum \nolimits_{i = 0}^{L - 1} g(i)}}\,\forall p_{i} \ne 0,\,\,p_{j} \ne 0$$
    (8)

For feature extraction purpose, each of the text-line images, written in different scripts, are firstly divided into 4 sub-images using 2-level quad tree decomposition approach and the five features are then computed from each of these sub-images. Two distances \(d = 1\) and \(d = 2\) are used in feature computation, corresponding to neighborhood sizes of \(3 \times 3\) and \(5 \times 5\) respectively. So, a feature vector of size 40 (F1-F40) is extracted for each text-line images using NGTDM. In the computation of \(F_{cos}\) and \(F_{str}\), the value of \(\epsilon\) is taken as \(10^{ - 7}\).

4.2 Gray-Level Run Length Matrix (GLRLM)

The application of a run length matrix for the purpose of texture feature extraction is proposed by Galloway [26]. Let, there is a given image of size \(M \times N\), then a run-length matrix \(p\left( {i,\,j} \right)\) is determined as the number of runs of pixels having gray-level i and run length j.

  • Short Run Emphasis (SRE):

    $$SRE = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \mathop \sum \limits_{j = 1}^{N} \frac{{p\left( {i,j} \right)}}{{j^{2} }} = \frac{1}{{n_{r} }}\mathop \sum \limits_{j = 1}^{N} \frac{{p_{r} \left( j \right)}}{{j^{2} }}$$
    (9)
  • Long Run Emphasis (LRE):

    $$LRE = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \mathop \sum \limits_{j = 1}^{N} p\left( {i,j} \right) \cdot j^{2} = \frac{1}{{n_{r} }}\mathop \sum \limits_{j = 1}^{N} p_{r} \left( j \right) \cdot j^{2}$$
    (10)
  • Gray-Level Non-uniformity (GLN):

    $$GLN = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \left( {\mathop \sum \limits_{j = 1}^{N} p(i,j)} \right)^{2} = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \left[ {p_{g} (i)} \right]^{2}$$
    (11)
  • Run Length Non-uniformity (RLN):

    $$RLN = \frac{1}{{n_{r} }}\mathop \sum \limits_{j = 1}^{N} \left( {\mathop \sum \limits_{i = 1}^{M} p(i,j)} \right)^{2} = \frac{1}{{n_{r} }}\mathop \sum \limits_{j = 1}^{N} \left[ {p_{r} (j)} \right]^{2}$$
    (12)
  • Run Percentage (RP):

    $$RP = \frac{{n_{r} }}{{n_{p} }}$$
    (13)

    In the above equations, \(n_{r}\) is the number of runs whereas \(n_{p}\) is the number of pixels in the image. It is noticed that most of the features are only functions of \(p_{r} (j)\), which do not consider the gray-level information of \(p_{g} (i)\). Chu et al. [27] estimated two features to calculate gray-level information in the matrix.

  • Low Gray-Level Run Emphasis (LGRE):

    $$LGRE = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \mathop \sum \limits_{j = 1}^{N} \frac{{p\left( {i,j} \right)}}{{i^{2} }} = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \frac{{p_{g} \left( i \right)}}{{i^{2} }}$$
    (14)
  • High Gray-Level Run Emphasis (HGRE):

    $$HGRE = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \mathop \sum \limits_{j = 1}^{N} p\left( {i,j} \right) \cdot i^{2} = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} p_{g} \left( i \right) \cdot i^{2}$$
    (15)

    Further, Dasarathy et al. [28] described another four feature estimation functions based on the concept of combined statistical measure of gray-level and run length, as follows:

  • Short Run Low Gray-Level Emphasis (SRLGE):

    $$SRLGE = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \mathop \sum \limits_{j = 1}^{N} \frac{{p\left( {i,j} \right)}}{{i^{2} \cdot j^{2} }}$$
    (16)
  • Short Run High Gray-Level Emphasis (SRHGE):

    $$SRHGE = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \mathop \sum \limits_{j = 1}^{N} \frac{{p\left( {i,j} \right) \cdot i^{2} }}{{j^{2} }}$$
    (17)
  • Long Run Low Gray-Level Emphasis (LRLGE):

    $$LRLGE = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \mathop \sum \limits_{j = 1}^{N} \frac{{p\left( {i,j} \right) \cdot j^{2} }}{{i^{2} }}$$
    (18)
  • Long Run High Gray-Level Emphasis (LRHGE):

    $$LRHGE = \frac{1}{{n_{r} }}\mathop \sum \limits_{i = 1}^{M} \mathop \sum \limits_{j = 1}^{N} p\left( {i,j} \right) \cdot i^{2} \cdot j^{2}$$
    (19)

These features are all based on intuitive reasoning, in an attempt to capture some apparent properties of run-length distribution. For each of the 11 measurements, defined above, the values of \(\theta\, \in\, 0^\circ , 45^\circ ,90^\circ \,and\,135^\circ\) lead to a total of 44 (F41-F84) features using GLRLM. Finally, a set of 84 (i.e. 40 + 44) statistical textural features are extracted using both NGTDM and GLRLM for the text-line level classification of twelve different handwritten scripts.

5 Experimental Evaluation and Discussion

The performance of the present script identification scheme is evaluated on a dataset of 2400 preprocessed text-line images as described in Sect. 3. For each 200 text line images of a particular script, 135 images are applied for training and the rest 65 images are applied for testing purpose. Seven well-known classifiers namely, Naïve Bayes, Bayes Net, MLP, Support Vector Machine (SVM), Random Forest, Bagging and MultiClass Classifier are used to select the best classifier suitable for the present experimental setup. The recognition performances and their corresponding scores achieved at 95% confidence level are shown in Table 1.

Table 1 Recognition performances of the proposed script identification technique using seven well-known classifiers (best case is shaded in gray and styled in bold)

As observed from Table 1 that MLP classifier produces the highest identification accuracy of 97.69%. In the present work, detailed error analysis of MLP classifier with respect to some well-known statistical parameters namely, Kappa statistics, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), True Positive Rate (TPR), False Positive Rate (FPR), Precision, Recall, F-measure, Matthews Correlation Coefficient (MCC) and Area Under ROC (AUC) are also computed. The values of Kappa statistics, mean absolute error, root mean square error of MLP classifier for the present technique are found to be 0.9748, 0.0056 and 0.0557 respectively. Table 2 provides a statistical performance analysis of the remaining parameters for each of the aforementioned scripts.

Table 2 Statistical performance measures along with their respective means (shaded in gray and styled in bold) achieved by the proposed technique for twelve handwritten scripts

Though Table 2 shows impressive results but some misclassifications have been found during the experimentation. The main reasons are: (a) presence of speckled noise, (b) existence of multi-skewed words present in some text-lines, and (c) occurrence of irregular spaces within text words, punctuation symbols, etc. in the text–line images. The structural resemblance in the character set of some of the Matra based scripts like Devanagari and Gurumukhi and non-Matra based scripts like Kannada and Telugu as well as Malayalam and Tamil cause similarity in the contiguous pixel distribution which in turns misclassifies them among each other. Figure 2 shows some samples of misclassified text-line images.

Fig. 2
figure 2

Sample text-line images written in a Bangla, b Devanagari, c Gurumukhi, d Kannada, e Telugu, f Malayalam and g Tamil misclassified as Gujarati, Gurumukhi, Devanagari, Telugu, Kannada, Tamil and Malayalam scripts respectively

6 Conclusion and Future Work

We have proposed a robust method for handwritten script identification at text-line level for all the official scripts of India. The main intention of this paper is to facilitate the multilingual handwritten OCR and script based retrieval of offline handwritten documents. A set of 84 features are extracted using the combination of NGTDM and GLRLM. NGTDM aims to extract information about spatial changes in intensity which can be obtained by looking at the difference between the gray tone of each image pixel and the gray tones of its neighbors. On the contrary, GLRLM contains great discriminatory information which in turn preserves much of the texture information in run-length matrices. Experimental results have shown that an accuracy rate of 97.69% is achieved using MLP classifier which is quite acceptable taking the complexities and shape variations of the scripts under consideration. This work is first of its kind presuming the number of official scripts into account. Our future endeavor will be to modify this technique to perform the script identification from handwritten document images containing more number of Indian languages. As the key feature used in this technique is mainly texture based, in future, the technique will be applicable for recognizing non-Indic scripts in any multi-script environment. Focus will be also to increase the size of the text-line script database to incorporate larger variations of writing styles belonging to writers from speckled backgrounds which, in turn, would devise our technique as writer independent.