1 Introduction

Optical character recognition (OCR) is a computer-based system that recognizes printed characters by scanning the original text document image. Basically, it consists of the following stages: (1) preprocessing, (2) feature extraction with classification and (3) post-processing (Ghosh et al. 2010).

Script recognition represents the part of OCR included in the feature extraction with classification stage, a very important part of document image analysis (Ghosh et al. 2010). Different methods have been developed for the script recognition task. They are classified as global and local methods.

Global methods consider wider blocks in document images, which are subjected to the statistical and frequency-domain analysis (Joshi et al. 2007). To extend the effectiveness of the method, document image blocks have to be normalized. Furthermore, the image should be free of noise and high quality (Busch et al. 2006).

Local methods separate small pieces of text as connected components. Connected components contain characters, words or lines. After that, the analysis of different features, like for example the black pixel runs, is carried out (Pal and Chaudhury 2002). Local methods are suitable for low-quality and noisy documents, however, they are computationally expensive.

Textures are the image features that can be described according to their spatial, frequency and perceptual properties (Del Bimbo 2001; Tolambiya et al. 2010). An effective representation of textures can be based on statistical and structural properties of brightness patterns (Yang and Purves 2004). Texture can be measured by taking into account the spatial arrangement of gray-level primitives (Haralick 1979). Hence, the major statistical method used in texture analysis is based on the definition of the joint probability distributions of pairs of pixels (Valkealahti and Oja 1998). Texture analysis can be very helpful in cases where image objects are characterized more by their texture than by the intensity (Zhang and Tan 2002; Bharati et al. 2004; Eleyan and Demirel 2011).

This paper proposes a script recognition module. The object of research is the Slavic documents. These types of documents were chosen because they can be written in three different scripts: Latin, Glagolitic and Cyrillic. Consequently, their differentiation is a challenging task, and a new approach is introduced herein. Furthermore, our approach unites local and global methods. First, it treats the characters in text, which is the manner of local methods. It maps each character into the corresponding script type according to its position in the text line. This way, the number of variables is significantly reduced. The result of this step is coded text (Brodić et al. 2013). Second, the script type distribution of coded text is analyzed. As a result, four script features are extracted. Then, the text is subjected to textural analysis, obtaining the gray-level co-occurrence matrix (GLCM), which is used to extract the texture features (Haralick et al. 1973) needed for classification. The textural analysis is a typical step of global methods. Finally, a discrimination function is established according to the comprehensive texture feature classification, representing criteria for script discrimination and identification.

The paper is organized as follows. Section 2 addresses all aspects concerning the proposed algorithm; it includes the coding, script distribution and co-occurrence analysis, feature extraction and establishment of criteria for script discrimination. Section 3 defines the custom oriented database of documents written in different scripts. Then, it explains the experiment that evaluates the proposed algorithm. Section 4 gives the results of the experiment and discusses them. Section 5 concludes the paper.

2 Proposed algorithm

The proposed algorithm consists of the following stages: (1) coding, (2) statistical analysis, and (3) determination of criteria for script discrimination. However, it can be divided into many sub-stages. Figure 1 illustrates the structural diagram of the algorithm flow.

Fig. 1
figure 1

Structural diagram of the algorithm flow

2.1 Coding

Each text line can be split into three vertical zones: (1) upper, (2) middle and (3) lower (Zramdini and Ingold 1998; Chaudhuri et al. 2002). The letters in a certain script have different positions in the text line. Base letters (B), like the letter a, occupy a middle zone only; ascender letters (A), like the letter h, spread over the middle and upper zones; while descendent letters (D), like the letter g, include the middle and lower zones. Few letters like the capital letter Lj (in Serbian or Croatian Latin alphabet) comprise all three zones. They are classified as a full letter (F). Table 1 shows the script type classification.

Table 1 Script type classification

This way, the letters from Serbian, or Croatian Latin alphabet, Serbian Cyrillic alphabet and Croatian Glagolitic alphabet are mapped into the elements from the identification set \(I\):

$$\begin{aligned} I=\{\hbox {B},\hbox {A},\hbox {D},\hbox {F}\}. \end{aligned}$$
(1)

Furthermore, set \(I\) is coded to set \(C\) to effectively perform the statistical analysis (Brodić et al. 2013, 2014):

$$\begin{aligned} C=\{0,1,2,3\}, \end{aligned}$$
(2)

where B \(\rightarrow \) 0, A \(\rightarrow \) 1, D \(\rightarrow \) 2, and F \(\rightarrow \) 3. Table 2 shows Latin, Glagolitic and Cyrillic letters as well as theirs codes according to Table 1 (slightly adapted to current Croatian or Serbian language) (Brodić et al. 2014).

Table 2 Coding of Slavic alphabets

The proposed algorithm replaces all letters from a certain script with the equivalent member of the set \(C\) by coding as in Table 2. This way, an initial text is converted into the coded text.

2.2 Statistical analysis

Statistical analysis is divided into two parts: script type distribution (Brodić et al. 2013) and co-occurrence (Brodić et al. 2013, 2014). In both analyses the input is coded text, which subjected to the statistical analysis. Figure 2 illustrates the same text written in different Slavic scripts along with their coding.

Fig. 2
figure 2

Same text given in different scripts: a original text in Latin script, and b its coded counterpart; c original text in Glagolitic script, and d its coded counterpart; e original text in Cyrillic script, and f its coded counterpart

2.2.1 Script type distribution

First, the script type distribution of coded text is analyzed. As a result, four script features are extracted. Table 3 shows these features, which are obtained from the same text written in different Slavic scripts (see Fig. 2 for reference).

Table 3 Comparison of the script type distributions between scripts

The script type distribution of Latin, Glagolitic and Cyrillic script is given in Fig. 3a–c, respectively.

Fig. 3
figure 3

Script type distribution: a Latin script, b Glagolitic script, and c Cyrillic script

Glagolitic script has the highest distribution of base script type, then follows Cyrillic script, while Latin script has the smallest distribution of base script type. Latin script has the highest distribution of ascending script type, Glagolitic script has slightly smaller distribution, while the Cyrillic script has a considerably lower distribution of ascending script type. Cyrillic script has the highest distribution of descending script type. Glagolitic and Latin scripts have a substantially lower distribution of descending script type. Latin and Cyrillic scripts have similar distributions of full script type, while the Glagolitic script has weak or even no distribution of full script type.

2.2.2 Co-occurrence analysis

Currently, the coded text is subjected to co-occurrence analysis (Haralick et al. 1973; Clausi 2002) to extract the texture features. This approach generates texture features of image according to calculated co-occurrence probabilities. These probabilities represent the conditional joint probabilities of all pair-wise combinations of gray levels in the spatial window of interest (WOI). WOI is determined by the inter-pixel distance (\(d\)) and orientation (\(\theta \)) (Haralick et al. 1973; Clausi 2002). Figure 4 shows an illustration of WOI.

Fig. 4
figure 4

WOI for the calculation of texture features, considering \(d\) = 1 and different directions

The following parameters are considered to describe the image with GLCM: (1) the number of gray levels, (2) the orientation angle (\(\theta \)) and (3) the length of displacement (\(d\)). In our case, the codes are considered as the gray levels.

The method starts in the top left corner and counts the occurrences of each reference pixel to neighbor pixel relationship. This way, each element (\(i\), \(j\)) of GLCM represents the sum of the number of times the pixel with the value \(i\) is located at some distance \(d\) and angle \(\theta \) from the pixel of intensity \(j\). At the end of this process, the element (\(i\), \(j\)) gives the number of how many times the gray levels \(i\) and \(j\) appears as a sequence of two pixels located at a defined distance \(d\) along a chosen direction \(\theta \). The GLCM for an image I with \(M\) rows and \(N\) columns is parameterized by the offset (\(\Delta x\), \(\Delta y\)) as (Eleyan and Demirel 2011):

$$\begin{aligned}&P(i,j)\nonumber \\&\quad =\sum \limits _{x=1}^M \,\sum \limits _{y=1}^N \left\{ \begin{array}{l@{\quad }l} 1, \hbox {if}\;I(x,y)=i\;,&{} I(x+\Delta x,y+\Delta y)=j\\ 0, \hbox {otherwise}&{}\nonumber \\ \end{array}\right. \\ \end{aligned}$$
(3)

The offset (\(\Delta x\), \(\Delta y\)) represents the pixel displacement \(d \)and the orientation \(\theta \) at which GLCM is calculated. In our example, the input represents the coded text given as a 1D image. Accordingly, the feasible values of the parameters \(d\) and \(\theta \) are narrowed to \(d =\) 1 and \(\theta = 0^{\circ }\). Consequently, the number of gray levels \(G\) of a coded text is mapped to 4 (from 0 to 3) (Brodić et al. 2013, 2014).

The normalized probability version of the GLCM is given as:

$$\begin{aligned} C(i,j)=P(i,j)/\sum \limits _{i,j}^G {P(i,j)}. \end{aligned}$$
(4)

To characterize different scripts, the same text written with different scripts (see Fig. 2 for reference) is subjected to the co-occurrence analysis. Figure 5 shows the normalized probability GLCM for each script.

Fig. 5
figure 5

GLCM for the coded text from Fig. 2: a Latin, b Glagolitic, and c Cyrillic

Furthermore, the number of texture features can be extracted from the GLCM (Haralick et al. 1973; Clausi 2002). Unlike ref. Brodić et al. (2013, 2014), the eight texture features among the fourteen proposed in Haralick et al. (1973) is used. Their definition is given in Table 4.

Table 4 GLCM texture feature definition

Table 5 shows typical eight GLCM texture feature measures obtained from the same text written in different Slavic scripts (see Fig. 2 for reference).

Table 5 GLCM texture feature measures

2.3 Criteria for the script discrimination

Each script is characterized with its own set of specific features (mainly typographical). The statistical analysis of coded text is used to extract them. It starts with the script type distribution analysis followed by the co-occurrence analysis. The statistical analysis is enlarged compared to those given in Brodić et al. (2013, 2014) by including a bigger set of extracted texture features. Furthermore, it is not used for script characterization only (same document written by different scripts), but for script identification as well (different document written by different scripts). As an extension to the previous method (Brodić et al. 2014), the enlarged feature vector given by four script type distribution measures and eight GLCM texture measures is used. The proposed approach compared to previous ones (Brodić et al. 2013, 2014) contributes to increased validity in order to establish criteria for script discrimination based on thresholding decision making. Accordingly, the statistical analysis shows the clear difference between scripts.

3 Experiments

The experiment is determined to evaluate the quality of the proposed algorithm. The custom oriented database of documents similar to those given in http://www.croatianhistory.net/etf/juraj_slovinac_misli.htmlhttp://www.croatianhistory.net/etf/badurina_parcic.html is created. It comprises the “training” and “test” set (Silva and Ribeiro 2007). Training set consists of total 130 documents, which includes at least 40 documents written in each script. Typical length of text is from approx. 300 to 3,000 characters. Test set consists of 10 documents written in each script, i.e., total of 30 documents. Typical length of text is from approx. 500 to 4,000 characters. Texts are extracted from the book “Le château de virginité” (“The Castle of Virginity”) written in 1411 by George d’Esclavonie (Juraj Slovinac) (http://www.croatianhistory.net/etf/juraj_slovinac_misli.html). He was a Croatian Glagolitic priest and professor at Sorbonne in Paris around 1400. Figure 6 illustrates sample documents from the database written in different Slavic scripts.

Fig. 6
figure 6

Sample documents from database: a Latin, b Glagolitic, and c Cyrillic

4 Results and discussion

4.1 Results of the script type distribution

The script type distributions are used to extract four script features, which are used to characterize different scripts. To quantify the obtained results, we used the minimum and maximum values. Tables 6, 7 show the distributions, which are obtained from the training and test set.

Table 6 Comparison of the script type distributions between scripts (training set)
Table 7 Comparison of the script type distributions between scripts (test set)

Figure 7 shows the script type distributions for training set—a, c, e, g, and test set—b, d, f, h.

Fig. 7
figure 7

Script type distributions: a base for training set, b base for test set, c ascender for training set, d ascender for test set, e descendent for training set, f descendent for test set, g full for training set, h full for test set

From the training set, we can establish the script discrimination relation:

figure e

Test set confirms the previously established script discrimination relation.

4.2 GLCM feature results

The extended set of eight GLCM texture features is used (compared to Brodić et al. 2013, 2014) as a basis to discriminate different scripts. To quantify the obtained results, we have used the minimum and maximum values. The texture features obtained from a statistical analysis of database texts written in Latin, Glagolitic and Cyrillic script for training and test set are shown in Tables 89.

Table 8 GLCM texture feature measures (training set)
Table 9 GLCM texture feature measures (test set)

It should be noted that the values of entropy, inverse different moment and homogeneity are quite similar among scripts. Consequently, these features will be discarded from further discussion. In the further analysis, another five texture features are of the interest. Figure 8 shows the minimum and maximum values of the energy (Latin, Cyrillic and Glagolitic script).

Fig. 8
figure 8

The energy of Latin, Cyrillic and Glagolitic script: a training set, b test set

The energy value of 0.3 differentiates the Latin from the other two scripts. Figure 9 shows the the minimum and maximum values of the GLCM maximum (Latin, Cyrillic and Glagolitic script).

Fig. 9
figure 9

The maximum of Latin, Cyrillic and Glagolitic script: a training set, b test set

Similarly, Latin (from the other two scripts) can be separated by the maximum value of 0.45. Figure 10 shows the minimum and maximum values of the dissimilarity (Latin, Cyrillic and Glagolitic script).

Fig. 10
figure 10

The dissimilarity of Latin, Cyrillic and Glagolitic script: a training set, b test set

Currently, Cyrillic can be extracted from the other scripts by a dissimilarity value of 0.77 (test set only). Figure 11 shows the minimum and maximum values of the contrast of Latin, Cyrillic and Glagolitic script.

Fig. 11
figure 11

The contrast of Latin, Cyrillic and Glagolitic script: a training set, b test set

Glagolitic can be distinguished from the other scripts by contrast value of 1.7 (test set only). Figure 12 shows the minimum and maximum values of the correlation (Latin, Cyrillic and Glagolitic script).

Fig. 12
figure 12

The Correlation of Latin, Cyrillic and Glagolitic script: a training set, b test set

Latin can be differentiated from the other scripts by setting the correlation value to \(-\)0.15.

Taking into account all aforementioned features, i.e., energy, maximum, correlation, dissimilarity and contrast, we can establish the discrimination criteria that can be used for script recognition in the Slavic documents (test set only). The criteria are given by the following pseudo-code:

figure f

Although, the aforementioned features show significant variation among scripts, they are valid for the test set only. To establish generalized criteria for script discrimination, we should use the broader information, i.e., those obtained from the training set. In this way, the results from the script type distribution have to be included as well. The extended criteria can be expressed by the following pseudo-code:

figure g

The above criteria can be used to effectively discriminate certain script, i.e., Latin, Cyrillic and/or Glagolitic script. The presented concept recognizes the scripts in the document without errors. However, it can be noted that the established concept is based on the ideal conditions. To prove their validity in real circumstances, their effectiveness should be evaluated by incorporating in an OCR system.

5 Conclusions

This manuscript proposed an algorithm for the script characterization and identification in Slavic documents written in Latin, Glagolitic and Cyrillic scripts. The algorithm accompanies the statistical analysis of the coded text. It is obtained by coding text from document according to the baseline status of each letter. The statistical analysis is performed with the script type distribution and co-occurrence analysis of the coded text. As the results, four script type features and eight GLCM texture features are obtained from a statistical analysis. Due to the difference in the script characteristics, the results of the statistical analysis show significant diversity among scripts. This represents the key point for decision-making process of script identification. The proposed method is tested on documents from custom oriented database. The experiments gave encouraging results.