An approach to the script discrimination in the Slavic documents

Brodić, Darko; Milivojević, Zoran N.; Maluckov, Čedomir A.

doi:10.1007/s00500-014-1435-1

An approach to the script discrimination in the Slavic documents

Script discrimination

Methodologies and Application
Published: 28 August 2014

Volume 19, pages 2655–2665, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

An approach to the script discrimination in the Slavic documents

Download PDF

Darko Brodić¹,
Zoran N. Milivojević² &
Čedomir A. Maluckov¹

246 Accesses
10 Citations
Explore all metrics

Abstract

The paper deals with the problem of the script discrimination in old Slavic printed documents. Therefore, an algorithm for script classification and identification is proposed. It creates coded text from initial document. Then, the coded text is subjected to statistical analysis. As a result, the texture feature extraction is carried out. Obtained texture features are used as criteria for script classification and identification. The proposed method is tested on the samples of old Slavic printed documents written in Glagolitic, Cyrillic and Latin script.

Script Characterization in the Old Slavic Documents

Classification of the Scripts in Medieval Documents from Balkan Region by Run-Length Texture Analysis

Classification of German Scripts by Adjacent Local Binary Pattern Analysis of the Coded Text

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Optical character recognition (OCR) is a computer-based system that recognizes printed characters by scanning the original text document image. Basically, it consists of the following stages: (1) preprocessing, (2) feature extraction with classification and (3) post-processing (Ghosh et al. 2010).

Script recognition represents the part of OCR included in the feature extraction with classification stage, a very important part of document image analysis (Ghosh et al. 2010). Different methods have been developed for the script recognition task. They are classified as global and local methods.

Global methods consider wider blocks in document images, which are subjected to the statistical and frequency-domain analysis (Joshi et al. 2007). To extend the effectiveness of the method, document image blocks have to be normalized. Furthermore, the image should be free of noise and high quality (Busch et al. 2006).

Local methods separate small pieces of text as connected components. Connected components contain characters, words or lines. After that, the analysis of different features, like for example the black pixel runs, is carried out (Pal and Chaudhury 2002). Local methods are suitable for low-quality and noisy documents, however, they are computationally expensive.

Textures are the image features that can be described according to their spatial, frequency and perceptual properties (Del Bimbo 2001; Tolambiya et al. 2010). An effective representation of textures can be based on statistical and structural properties of brightness patterns (Yang and Purves 2004). Texture can be measured by taking into account the spatial arrangement of gray-level primitives (Haralick 1979). Hence, the major statistical method used in texture analysis is based on the definition of the joint probability distributions of pairs of pixels (Valkealahti and Oja 1998). Texture analysis can be very helpful in cases where image objects are characterized more by their texture than by the intensity (Zhang and Tan 2002; Bharati et al. 2004; Eleyan and Demirel 2011).

This paper proposes a script recognition module. The object of research is the Slavic documents. These types of documents were chosen because they can be written in three different scripts: Latin, Glagolitic and Cyrillic. Consequently, their differentiation is a challenging task, and a new approach is introduced herein. Furthermore, our approach unites local and global methods. First, it treats the characters in text, which is the manner of local methods. It maps each character into the corresponding script type according to its position in the text line. This way, the number of variables is significantly reduced. The result of this step is coded text (Brodić et al. 2013). Second, the script type distribution of coded text is analyzed. As a result, four script features are extracted. Then, the text is subjected to textural analysis, obtaining the gray-level co-occurrence matrix (GLCM), which is used to extract the texture features (Haralick et al. 1973) needed for classification. The textural analysis is a typical step of global methods. Finally, a discrimination function is established according to the comprehensive texture feature classification, representing criteria for script discrimination and identification.

The paper is organized as follows. Section 2 addresses all aspects concerning the proposed algorithm; it includes the coding, script distribution and co-occurrence analysis, feature extraction and establishment of criteria for script discrimination. Section 3 defines the custom oriented database of documents written in different scripts. Then, it explains the experiment that evaluates the proposed algorithm. Section 4 gives the results of the experiment and discusses them. Section 5 concludes the paper.

2 Proposed algorithm

The proposed algorithm consists of the following stages: (1) coding, (2) statistical analysis, and (3) determination of criteria for script discrimination. However, it can be divided into many sub-stages. Figure 1 illustrates the structural diagram of the algorithm flow.

2.1 Coding

Each text line can be split into three vertical zones: (1) upper, (2) middle and (3) lower (Zramdini and Ingold 1998; Chaudhuri et al. 2002). The letters in a certain script have different positions in the text line. Base letters (B), like the letter a, occupy a middle zone only; ascender letters (A), like the letter h, spread over the middle and upper zones; while descendent letters (D), like the letter g, include the middle and lower zones. Few letters like the capital letter Lj (in Serbian or Croatian Latin alphabet) comprise all three zones. They are classified as a full letter (F). Table 1 shows the script type classification.

Table 1 Script type classification

Full size table

This way, the letters from Serbian, or Croatian Latin alphabet, Serbian Cyrillic alphabet and Croatian Glagolitic alphabet are mapped into the elements from the identification set $I$:

$$\begin{aligned} I=\{\hbox {B},\hbox {A},\hbox {D},\hbox {F}\}. \end{aligned}$$

(1)

Furthermore, set $I$ is coded to set $C$ to effectively perform the statistical analysis (Brodić et al. 2013, 2014):

$$\begin{aligned} C=\{0,1,2,3\}, \end{aligned}$$

(2)

where B $\rightarrow $ 0, A $\rightarrow $ 1, D $\rightarrow $ 2, and F $\rightarrow $ 3. Table 2 shows Latin, Glagolitic and Cyrillic letters as well as theirs codes according to Table 1 (slightly adapted to current Croatian or Serbian language) (Brodić et al. 2014).

Table 2 Coding of Slavic alphabets

Full size table

The proposed algorithm replaces all letters from a certain script with the equivalent member of the set $C$ by coding as in Table 2. This way, an initial text is converted into the coded text.

2.2 Statistical analysis

Statistical analysis is divided into two parts: script type distribution (Brodić et al. 2013) and co-occurrence (Brodić et al. 2013, 2014). In both analyses the input is coded text, which subjected to the statistical analysis. Figure 2 illustrates the same text written in different Slavic scripts along with their coding.

2.2.1 Script type distribution

First, the script type distribution of coded text is analyzed. As a result, four script features are extracted. Table 3 shows these features, which are obtained from the same text written in different Slavic scripts (see Fig. 2 for reference).

Table 3 Comparison of the script type distributions between scripts

Full size table

The script type distribution of Latin, Glagolitic and Cyrillic script is given in Fig. 3a–c, respectively.

Glagolitic script has the highest distribution of base script type, then follows Cyrillic script, while Latin script has the smallest distribution of base script type. Latin script has the highest distribution of ascending script type, Glagolitic script has slightly smaller distribution, while the Cyrillic script has a considerably lower distribution of ascending script type. Cyrillic script has the highest distribution of descending script type. Glagolitic and Latin scripts have a substantially lower distribution of descending script type. Latin and Cyrillic scripts have similar distributions of full script type, while the Glagolitic script has weak or even no distribution of full script type.

2.2.2 Co-occurrence analysis

Currently, the coded text is subjected to co-occurrence analysis (Haralick et al. 1973; Clausi 2002) to extract the texture features. This approach generates texture features of image according to calculated co-occurrence probabilities. These probabilities represent the conditional joint probabilities of all pair-wise combinations of gray levels in the spatial window of interest (WOI). WOI is determined by the inter-pixel distance ($d$) and orientation ($\theta $) (Haralick et al. 1973; Clausi 2002). Figure 4 shows an illustration of WOI.

The following parameters are considered to describe the image with GLCM: (1) the number of gray levels, (2) the orientation angle ($\theta $) and (3) the length of displacement ($d$). In our case, the codes are considered as the gray levels.

The method starts in the top left corner and counts the occurrences of each reference pixel to neighbor pixel relationship. This way, each element ($i$, $j$) of GLCM represents the sum of the number of times the pixel with the value $i$ is located at some distance $d$ and angle $\theta $ from the pixel of intensity $j$. At the end of this process, the element ($i$, $j$) gives the number of how many times the gray levels $i$ and $j$ appears as a sequence of two pixels located at a defined distance $d$ along a chosen direction $\theta $. The GLCM for an image I with $M$ rows and $N$ columns is parameterized by the offset ($\Delta x$, $\Delta y$) as (Eleyan and Demirel 2011):

$$\begin{aligned}&P(i,j)\nonumber \\&\quad =\sum \limits _{x=1}^M \,\sum \limits _{y=1}^N \left\{ \begin{array}{l@{\quad }l} 1, \hbox {if}\;I(x,y)=i\;,&{} I(x+\Delta x,y+\Delta y)=j\\ 0, \hbox {otherwise}&{}\nonumber \\ \end{array}\right. \\ \end{aligned}$$

(3)

The offset ($\Delta x$, $\Delta y$) represents the pixel displacement $d $and the orientation $\theta $ at which GLCM is calculated. In our example, the input represents the coded text given as a 1D image. Accordingly, the feasible values of the parameters $d$ and $\theta $ are narrowed to $d =$ 1 and $\theta = 0^{\circ }$. Consequently, the number of gray levels $G$ of a coded text is mapped to 4 (from 0 to 3) (Brodić et al. 2013, 2014).

The normalized probability version of the GLCM is given as:

$$\begin{aligned} C(i,j)=P(i,j)/\sum \limits _{i,j}^G {P(i,j)}. \end{aligned}$$

(4)

To characterize different scripts, the same text written with different scripts (see Fig. 2 for reference) is subjected to the co-occurrence analysis. Figure 5 shows the normalized probability GLCM for each script.

Furthermore, the number of texture features can be extracted from the GLCM (Haralick et al. 1973; Clausi 2002). Unlike ref. Brodić et al. (2013, 2014), the eight texture features among the fourteen proposed in Haralick et al. (1973) is used. Their definition is given in Table 4.

Table 4 GLCM texture feature definition

Full size table

Table 5 shows typical eight GLCM texture feature measures obtained from the same text written in different Slavic scripts (see Fig. 2 for reference).

Table 5 GLCM texture feature measures

Full size table

2.3 Criteria for the script discrimination

Each script is characterized with its own set of specific features (mainly typographical). The statistical analysis of coded text is used to extract them. It starts with the script type distribution analysis followed by the co-occurrence analysis. The statistical analysis is enlarged compared to those given in Brodić et al. (2013, 2014) by including a bigger set of extracted texture features. Furthermore, it is not used for script characterization only (same document written by different scripts), but for script identification as well (different document written by different scripts). As an extension to the previous method (Brodić et al. 2014), the enlarged feature vector given by four script type distribution measures and eight GLCM texture measures is used. The proposed approach compared to previous ones (Brodić et al. 2013, 2014) contributes to increased validity in order to establish criteria for script discrimination based on thresholding decision making. Accordingly, the statistical analysis shows the clear difference between scripts.

3 Experiments

The experiment is determined to evaluate the quality of the proposed algorithm. The custom oriented database of documents similar to those given in http://www.croatianhistory.net/etf/juraj_slovinac_misli.html, http://www.croatianhistory.net/etf/badurina_parcic.html is created. It comprises the “training” and “test” set (Silva and Ribeiro 2007). Training set consists of total 130 documents, which includes at least 40 documents written in each script. Typical length of text is from approx. 300 to 3,000 characters. Test set consists of 10 documents written in each script, i.e., total of 30 documents. Typical length of text is from approx. 500 to 4,000 characters. Texts are extracted from the book “Le château de virginité” (“The Castle of Virginity”) written in 1411 by George d’Esclavonie (Juraj Slovinac) (http://www.croatianhistory.net/etf/juraj_slovinac_misli.html). He was a Croatian Glagolitic priest and professor at Sorbonne in Paris around 1400. Figure 6 illustrates sample documents from the database written in different Slavic scripts.

4 Results and discussion

4.1 Results of the script type distribution

The script type distributions are used to extract four script features, which are used to characterize different scripts. To quantify the obtained results, we used the minimum and maximum values. Tables 6, 7 show the distributions, which are obtained from the training and test set.

Table 6 Comparison of the script type distributions between scripts (training set)

Full size table

Table 7 Comparison of the script type distributions between scripts (test set)

Full size table

Figure 7 shows the script type distributions for training set—a, c, e, g, and test set—b, d, f, h.

From the training set, we can establish the script discrimination relation:

Test set confirms the previously established script discrimination relation.

4.2 GLCM feature results

The extended set of eight GLCM texture features is used (compared to Brodić et al. 2013, 2014) as a basis to discriminate different scripts. To quantify the obtained results, we have used the minimum and maximum values. The texture features obtained from a statistical analysis of database texts written in Latin, Glagolitic and Cyrillic script for training and test set are shown in Tables 8, 9.

Table 8 GLCM texture feature measures (training set)

Full size table

Table 9 GLCM texture feature measures (test set)

Full size table

It should be noted that the values of entropy, inverse different moment and homogeneity are quite similar among scripts. Consequently, these features will be discarded from further discussion. In the further analysis, another five texture features are of the interest. Figure 8 shows the minimum and maximum values of the energy (Latin, Cyrillic and Glagolitic script).

The energy value of 0.3 differentiates the Latin from the other two scripts. Figure 9 shows the the minimum and maximum values of the GLCM maximum (Latin, Cyrillic and Glagolitic script).

Similarly, Latin (from the other two scripts) can be separated by the maximum value of 0.45. Figure 10 shows the minimum and maximum values of the dissimilarity (Latin, Cyrillic and Glagolitic script).

Currently, Cyrillic can be extracted from the other scripts by a dissimilarity value of 0.77 (test set only). Figure 11 shows the minimum and maximum values of the contrast of Latin, Cyrillic and Glagolitic script.

Glagolitic can be distinguished from the other scripts by contrast value of 1.7 (test set only). Figure 12 shows the minimum and maximum values of the correlation (Latin, Cyrillic and Glagolitic script).

Latin can be differentiated from the other scripts by setting the correlation value to $-$0.15.

Taking into account all aforementioned features, i.e., energy, maximum, correlation, dissimilarity and contrast, we can establish the discrimination criteria that can be used for script recognition in the Slavic documents (test set only). The criteria are given by the following pseudo-code:

Although, the aforementioned features show significant variation among scripts, they are valid for the test set only. To establish generalized criteria for script discrimination, we should use the broader information, i.e., those obtained from the training set. In this way, the results from the script type distribution have to be included as well. The extended criteria can be expressed by the following pseudo-code:

The above criteria can be used to effectively discriminate certain script, i.e., Latin, Cyrillic and/or Glagolitic script. The presented concept recognizes the scripts in the document without errors. However, it can be noted that the established concept is based on the ideal conditions. To prove their validity in real circumstances, their effectiveness should be evaluated by incorporating in an OCR system.

5 Conclusions

This manuscript proposed an algorithm for the script characterization and identification in Slavic documents written in Latin, Glagolitic and Cyrillic scripts. The algorithm accompanies the statistical analysis of the coded text. It is obtained by coding text from document according to the baseline status of each letter. The statistical analysis is performed with the script type distribution and co-occurrence analysis of the coded text. As the results, four script type features and eight GLCM texture features are obtained from a statistical analysis. Due to the difference in the script characteristics, the results of the statistical analysis show significant diversity among scripts. This represents the key point for decision-making process of script identification. The proposed method is tested on documents from custom oriented database. The experiments gave encouraging results.

References

Bharati MH, Liu JJ, MacGregor JF (2004) Image texture analysis: methods and comparisons. Chemom Intell Lab Systems 72(1):57–71
Article Google Scholar
Brodić D, Milivojević ZN, Maluckov Č (2013) Recognition of the script in Serbian documents using frequency occurrence and co-occurrence analysis. Sci World J 2013(896328):1–14
Article Google Scholar
Brodić D, Milivojević Z, Maluckov Č A (2014) Script characterization in the old Slavic documents. In: Elmoataz A, Lezoray O, Nouboud F, Mammass D (eds) Image and Signal Processing, LNCS 8509, pp 230–238. Springer, Berlin
Busch A, Boles WW, Sridharan S (2006) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27(11):1720–1732
Article Google Scholar
Chaudhuri BB, Pal U, Mitra M (2002) Automatic recognition of printed Oriya script. Sadhana 27(1):23–34
Article Google Scholar
Clausi DA (2002) An analysis of co-occurrence texture statistics as a function of grey level quantization. Can J Remote Sens 28(1):45–62
Article Google Scholar
Del Bimbo A (2001) Visual information retrieval. Morgan Kaufmann Publishers Inc, San Francisco
Eleyan A, Demirel H (2011) Co-occurrence matrix and its statistical features as a new approach for face recognition. Turkish J Electrical Eng Comput Sci 19(1):98–107
Google Scholar
Ghosh D, Dube T, Shivaprasad AP (2010) Script recognition—a review. IEEE Trans Pattern Anal Mach Intell 32(12):2142–2161
Article Google Scholar
Haralick R, Shanmugam K, Dinstein I (1973) Textural features for image classification. IEEE Trans Systems Man Cybern 3(6):610–621
Article Google Scholar
Haralick RM (1979) Statistical and structural approaches to texture. Proc IEEE 67(5):786–804
Article Google Scholar
Joshi GD, Garg S, Sivaswamy J (2007) A generalised framework for script identification. Int J Document Anal Recogn ( IJDAR) 10(2):55–68
Article Google Scholar
Pal U, Chaudhury BB (2002) Identification of different script lines from multi-script documents. Image Vis Comput 20(13–14):945–954
Silva C, Ribeiro B (2007) On text-based mining with active learning and background knowledge using SVM. Soft Comput 11(6):519–530
Article Google Scholar
Tolambiya A, Venkatraman S, Kalra PK (2010) Content-based image classification with wavelet relevance vector machines. Soft Comput 14(2):129–136
Article Google Scholar
Valkealahti K, Oja E (1998) Reduced multidimensional co-occurrence histograms in texture classification. IEEE Trans Pattern Anal Mach Intell 20(1):90–94
Article Google Scholar
Yang Z, Purves D (2004) The statistical structure of natural light patterns determines perceived light intensity. In: Proceedings of the National Academy of sciences of the United States of America 101(23):8745–8750
Zhang J, Tan T (2002) Brief review of invariant texture analysis methods. Pattern Recogn 35(3):735–747
Article MATH Google Scholar
Zramdini AW, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal Mach Intell 20(8):877–882
Article Google Scholar

Download references

Acknowledgments

This work was partially supported by the Grant of the Ministry of Education, Science and Technological Development of the Republic Serbia, as a part of the project TR33037 and III43011.

Author information

Authors and Affiliations

Technical Faculty in Bor, V.J. 12, University of Belgrade, 19210 , Bor, Serbia
Darko Brodić & Čedomir A. Maluckov
College of Applied Technical Sciences, Aleksandra Medvedeva 20, 18000 , Niš, Serbia
Zoran N. Milivojević

Authors

Darko Brodić
View author publications
You can also search for this author in PubMed Google Scholar
Zoran N. Milivojević
View author publications
You can also search for this author in PubMed Google Scholar
Čedomir A. Maluckov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Darko Brodić.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brodić, D., Milivojević, Z.N. & Maluckov, Č.A. An approach to the script discrimination in the Slavic documents. Soft Comput 19, 2655–2665 (2015). https://doi.org/10.1007/s00500-014-1435-1

Download citation

Published: 28 August 2014
Issue Date: September 2015
DOI: https://doi.org/10.1007/s00500-014-1435-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An approach to the script discrimination in the Slavic documents

Abstract

Similar content being viewed by others

Script Characterization in the Old Slavic Documents

Classification of the Scripts in Medieval Documents from Balkan Region by Run-Length Texture Analysis

Classification of German Scripts by Adjacent Local Binary Pattern Analysis of the Coded Text

1 Introduction