Introduction

The Ministry of Health (MOH) and all related organisms are institutions, which are submitted to the trusteeship of the Minister of Health in KSA. Among its missions, doctors, analysts, pharmacists, etc., should daily treat thousands of documents from all health organizations such as (healthcare reimbursements, hospitalization, and physiotherapy sessions). In addition to its huge amount, these documents are sometimes handwritten and require hundreds of working hours per week to be treated or stored in databases. The objective of our work is the identification of the script of each block of text, from the processed documents. Script recognition consists of finding the alphabet used to write a document. The observation of some medical forms makes it possible to note the existence of different writings in the blocks representing the regions of interest. These are the regions with significant importance for a patient’s medical record and should therefore be recognized to facilitate their storage in a knowledge database or to their use for an indexation system. A block of text can be written in Arabic or Latin, manuscript or printed shape (see Fig. 1). Manual identification of the language with which a block is written is no longer appropriate. The automatic identification of the script as well as the nature of its writing will guide each block of homogeneous text to an appropriate optical character recognition (OCR) system. That’s why we focus on the best way to characterize the four kinds of writing: printed Arabic, manuscript Arabic, printed Latin and manuscript Latin. We turned to robust image processing tools to characterize homogeneous blocks extracted from medical forms. In particular, we plan to use texture-based methods.

Fig. 1
figure 1

Example of a medical document filled with Latin manuscript text and Arabic manuscript text

Background

Each script has specific visual characteristics. This specification is due to the variation in spatial characteristics such as character density and feature orientation. For example, Western languages are characterized by a small alphabet and generally isolated signs. Conversely, the Arabic script is characterized by many ligatures and a strong cursive while Asian languages use several thousand pictograms forming relatively complex internal structures of straight lines.

This is a recent area compared to that of optical character recognition (OCR). It is very useful in any multilingual recognition system to know the script according to which a text is written in order to be able to direct it to the appropriate optical character recognition system (Latin, Arabic, Chinese…). For printed documents, the variety of fonts, style, size, etc., makes the recognition of scripts even more difficult.

The state of the art in recognition of scripts shows that this research is scarce and that it focuses mainly on printed documents. But in this area, there is also work on discrimination between handwritten texts and printed texts. The script identification works are divided into three groups according to the analysis: the word level (or pseudo-word), at the line level and a block level. Block-based approaches assume that texts are normalized (equal height and width), uniformly formatted (interspaces and constant inter-words), and therefore have only one script.

Zhou et al. [1] and Elgammal and Ismail [2] analyze the profiles of horizontal and vertical projections to discriminate, respectively, different scripts. The efficiency of methods based on the analysis of projection profiles decreases when the number of different alphabets is important.

Chaudhuri [3] presented an approach to discrimination of printed and handwritten scripts for the classification of Bangla and Devanagari Indian scripts. Discrimination is based on structural and statistical characteristics at the line level.

Contrary to the vast majority of the work already done in this field concerning the discrimination of whole blocks of text, the author [4] proposed a discrimination between Arabic and Latin scripts at word level on mixed blocks. The approach uses similarity of forms for discrimination of Arabic/non-Arabic scripts at the word level. To reduce computation time, we preceded the processing of a pre-selection of Arabic scripts from information on the spatial relationships between related components without reducing the overall performance of the identification. This work was taken up by [5] with a discrimination method based on a recognition approach.

In [6] authors demonstrated an automatic identification method for Arabic and Latin script in both handwritten and printed script. Furthermore, another author suggested a method based on morphological and geometrical analysis for Arabic and Latin text block discrimination for both printed and handwritten shapes [7]. An exact framework based on a steerable pyramid transforms at word level in order to identify Arabic and Latin script was proposed by [8]. Structural features at word level was proposed by [9], to recognize the Arabic or Latin scripts of machine printed or handwritten documents effectively. Additionally, Li and Tan [10] used a statistical features technique in character level to identify Arabic, English, and Chinese scripts from camera-based images. The researchers in [11] demonstrated a recent survey in script identification in more details.

There are many reasons that considered a difficult task for the development of an automated assistant system for scripts recognition extracted from handwritten medical forms, such as the complexity of the form of writings that seem illegible and incomprehensible, the variability of the writings of a same doctor or intervening, the existence of writings resulting from a mixture of several consultation of the same document, for example lines and words which overlap, presence of text written in the margin and/or between lines, the quality of several deteriorates colored images because of scanning or using carbonized copy (Fig. 1).

Therefore, this research analyzes the image directly by grayscale without filtering, geometric correction or restoration. This choice prevents us from using many of the reusable methods, especially those based on segmentation.

The Proposed Approach

The main idea of our approach is to consider the entire text block to be labeled as a particular texture pattern. Each type of texture will therefore have a homogeneous appearance characterizing the same class of script, and we can therefore use techniques already used successfully to distinguish textures.

Through several years, many approaches were proposed for texture detection and quantification. However, the texture analysis still constitutes a very important research topic. In particular, the continuous attention to the texture analysis proves the importance of texture for image understanding. Several surveys on texture analysis can be found in [12,13,14,15]. Statistical approaches are based on the intensity values of each pixel in order to calculate numerical descriptors.

SGLD Calculation

The co-occurrence of the gray levels or the spatial dependence of the intensities is the most used second-order statistic in image processing. It is better known as SGLD for Spatial Gray Level Dependence. The SGLD counts the number of occurrences that two different points of the image take values of given intensities.

We consider I the image on which we calculate the co-occurrence. I(x, y) is the intensity value observed in (x, y) and I(x + u, y + v) takes the intensity value observed after a translation (u, v) of the coordinates. The SGLD (u, v, i, j) counts the number of times I(x, y) takes the intensity value i and I(x + u, y + v) the intensity value j (Eq. 1). In terms of statistics, co-occurrence makes it possible to calculate the joint law of simultaneously observing events (I(x, y) = i) and (I(x + u, y + v) = j) for all pixels (x, y) of the image.

$$SGLD(u,v,i,j) = {\text{Mes}}[(I(x,y) = i) \cap (I(x + u,y + v) = j)].$$
(1)

The SGLD is expressed according to the 4 variables (u, v, i, j) if we represent the spatial relation (u, v) in Cartesian coordinates. We can also represent it in polar coordinates (Eq. 2) by describing the displacement \((u,v)\) by its equivalent \((\rho \times \cos \theta ,\rho \times \sin \theta )\) (see Fig. 2).

Fig. 2
figure 2

SGLD principle: a second-order statistic [16]

$$SGLD(\rho ,\theta ,i,j) = {\text{Mes}}[(I(x,y) = i) \cap (I(x + \rho \times \cos \theta ,y + \rho \times \sin \theta ) = j)].$$
(2)

We can use either of these two formulas, but it is more convenient to choose the polar version to specify the displacement ρ and the orientation θ between the two points.

To characterize the different writings, the displacements ρ must be very small so as to measure the local form of the letters. With larger displacements ρ, we measure information on the layout, the interline distance and the height of the lines. The maximum displacement value ρmax defines the size of the analysis window and therefore the measurement scale.

The co-occurrence of the gray levels is like comparing the image and its translated for all the possible displacements. The basis of co-occurrence is thus similar to that of self-similarity because it allows the forms to be measured by themselves according to all the displacements of length ρ and possible direction θ.

For each direction θ and each displacement ρ, we have a co-occurrence matrix of size Ng × Ng where Ng is the number of gray levels of the image. It is possible to quantify the gray levels in Nmax levels only in order to reduce the sizes of the matrices of obtained co-occurrence. If we have a displacement ρ limited to ρmax and a number θmax of possible directions, the SGLD is thus a matrix of ρmax × θmax matrices of size Nmax × Nmax each.

To measure only the shape of the characters, we choose a reduced number of displacement ρ equal to ρmax = 8. The choice ρmax = 8 is compatible with the scale of the vast majority of images available regardless of the resolution. This analysis window is suitable for images with an average height of text lines greater than 24 pixels. Without any normalization of the size of the writings in all the images, the SGLD may find dissimilar two identical scripts of extremely different size. On the other hand, we know that SGLD is robust to variations in image resolution as long as the scale remains the same. Indeed, a very low resolution image will lose information in the co-occurrence matrices for low displacement values ρ = 1.

The matrices of co-occurrences relating to ρ = 0 cannot be significant since they do not correspond to any displacement. The discrete nature of the images does not allow having more than 4 directions for the displacement of 1 pixel, 8 directions for a displacement of 2 pixels, etc. So for low values of displacement ρ, there are not all the directions θ.

All SGLD co-occurrence matrices or \({\text{Cooccurrence}}(\rho ,\theta ,i,j)\) are four-dimensional data following ρ, θ, i and j.

With the current values ρmax = 8, θmax = 8, Nmax = 16 and tmax = 16, we obtain p = 16,384 descriptors for matrices. The matrices of co-occurrences therefore have a considerable number of descriptors that cannot be used precisely because of the “curse of dimensionality” [17]. This is the reason behind the usage of co-occurrences in image analysis. Effective matrix reduction is therefore necessary for better exploitation of this multidimensional data.

The method proposed by Haralick et al. [18] is still active to exploit SGLD for biomedical applications. We chose to use them at first, and then we proposed one new method based on CNN which we are not going to finish to test in this work lack of time. For that purpose, we calculated the Haralick features to reduce the set of characteristics, which maximizes the discrimination rate (Fig. 3).

Fig. 3
figure 3

Some examples of calculating the SGLD for blocks of script texts and different types: printed Latin, manuscript Latin, printed Arabic and manuscript Arabic

Haralick Features Calculation

Haralick [18] proposed 14 descriptors f1 to f14 to be computed on each of the grayscale co-occurrence matrices to describe the texture of an image. Some of these characteristics describe the presence of an organized structure while others reflect the complexity of an image or the nature of the transitions between the gray levels of the points of the image.

We present in the following a full description of these descriptors.

Let M be one of the co-occurrence matrices resulting from the SGLD for fixed values of displacements ρ = ρ0 and direction θ = θ0. If the image has Ng grayscales, then the matrix M will be of size Ng × Ng (Eq. 3).

$$M(i,j) = SGLD(\rho_{0} ,\theta_{0} ,i,j) = {\text{Mes}}[(I(x,y) = i) \cap (I(x + \rho_{0} \cos \theta_{0} ,y + \rho_{0} \sin \theta_{0} ) = j)] .$$
(3)

For each of the co-occurrence matrices, the joint probabilities \(P(i,j)\) are calculated by normalizing each value by the sum of the elements of the respective co-occurrence matrix (Eq. 4).

$$P(i,j) = \frac{M(i,j)}{{\sum\nolimits_{i = 0}^{{i < N_{\text{g}} }} {\sum\nolimits_{j = 0}^{{j < N_{\text{g}} }} {M(i,j)} } }}$$
(4)

\(\{ P(i,j)\}\) then becomes a distribution of joint probabilities to observe the two events \(\left( {I(x,y) = i} \right)\) and \(\left( {I(x + \rho_{0} \cos \theta_{0} ,y + \rho_{0} \sin \theta_{0} ) = j} \right)\) in an image I.

The characteristics of Haralick require the definition of:

Four Projections Px, Py, Px + y, Pxy

$$P_{x} (i) = \sum\limits_{j = 0}^{{j < N_{\text{g}} }} {P(i,j)\;{\text{is}}\;{\text{the}}\;{\text{sum}}\;{\text{of}}\;{\text{the}}\;P(i,j)\;{\text{pixels}}\;{\text{along}}\;{\text{the}}\;i{\text{th}}\;{\text{row}}.}$$
$$P_{y} (j) = \sum\limits_{i = 0}^{{i < N_{g} }} {P(i,j)\;{\text{is}}\;{\text{defined}}\;{\text{similarly}}\;{\text{in}}\;{\text{the}}\;{\text{vertical}}\;{\text{direction}} .}$$

In terms of probabilities, the sum \(P_{x} (i)\) represents the law marginal observation \((I(x,y) = i)\) for all points (x, y) for each gray level i. Similarly, the sum \(P_{y} (j)\) represents the second marginal law of observation \((I(x + \rho_{0} \cos \theta_{0} ,y + \rho_{0} \sin \theta_{0} ) = j)\) according to the level of gray j. The projections \(P_{x + y} (k)\) and \(P_{x - y} (k)\) allow to define other marginal laws of probabilities summing in the two diagonals. The first law measures the probability distribution of the events \(\left( {I(x,y) + I(x + \rho_{0} \cos \theta_{0} ,y + \rho_{0} \sin \theta_{0} ) = k} \right)\) while the second gives the distribution of the probabilities of the events \(\left( {\left| {I(x,y) - I(x + \rho_{0} \cos \theta_{0} ,y + \rho_{0} \sin \theta_{0} )} \right| = k} \right)\).

$$P_{x + y} (k) = \sum\limits_{{i < N_{\text{g}} }} {\sum\limits_{\begin{subarray}{l} j < N_{\text{g}} \\ k < 2 \times N_{\text{g}} - 1 \end{subarray} }^{i + j = k} {P(i,j)\;{\text{is}}\;{\text{the}}\;{\text{projection}}\;{\text{sum}}\;{\text{of}}\;P(i,j)\;{\text{in}}\;{\text{the}}\;{\text{ascending}}\;{\text{diagonal}}\;{\text{direction}}.} }$$
$$P_{x - y} (k) = \sum\limits_{{i < N_{\text{g}} }} {\sum\limits_{\begin{subarray}{l} j < N_{\text{g}} \\ k < N_{\text{g}} \end{subarray} }^{{\left| {i - j} \right| = k}} {P(i,j)\;{\text{is}}\;{\text{the}}\;{\text{projection}}\;{\text{sum}}\;{\text{of}}\;P(i,j)\;{\text{in}}\;{\text{the}}\;{\text{descending}}\;{\text{diagonal}}\;{\text{direction}}.} }$$

Five Entropies HX, HY, HXY, HXY1, HXY2

Entropy is a function that measures the degree of disorder of a system. By extension, it can be used to characterize the regularity of a discrete distribution. There are therefore 5 entropies with the multiple combinations of computation on the two marginal laws \(P_{x} (i)\), \(P_{y} (j)\) and the distribution of joint probabilities \(P(i,j)\):

$$\begin{aligned} & {\text{Entropy}}\;{\text{of}}\;P_{x} \quad HX = - \sum\limits_{i = 0}^{{i < N_{\text{g}} }} {P_{x} (i) \times { \log }(P_{x} (i))} \\ & {\text{Entropy}}\;{\text{of}}\;P_{y} \quad HY = - \sum\limits_{j = 0}^{{j < N_{\text{g}} }} {P_{y} (j) \times { \log }\left( {P_{y} (j)} \right)} \\ & {\text{Entropy}}\;{\text{of}}\;P\quad HXY = - \sum\limits_{i = 0}^{{i < N_{\text{g}} }} {\sum\limits_{j = 0}^{{j < N_{\text{g}} }} {P(i,j) \times {\text{log(}}P(i,j) )} } \\ & {\text{Entropy}}\;{\text{of}}\;P\;{\text{on}}\;P_{x} \times P_{y} \quad HXY1 = - \sum\limits_{i = 0}^{{i < N_{\text{g}} }} {\sum\limits_{j = 0}^{{j < N_{\text{g}} }} {P(i,j) \times { \log }\left( {P_{x} (i) \times P_{y} (i) } \right)} } \\ & {\text{Entropy}}\;{\text{of}}\;P_{x} \times P_{y} \quad HXY2 = - \sum\limits_{i = 0}^{{i < N_{\text{g}} }} {\sum\limits_{j = 0}^{{j < N_{\text{g}} }} {P_{x} (i) \times P_{y} (i) \times { \log }\left( {P_{x} (i) \times P_{y} (i) } \right)} } \\ \end{aligned}$$

Six Statistics

The Haralick descriptors use the averages \(\mu_{x}\) and \(\mu_{y}\) as well as the variances \(V_{x}\) and \(V_{y}\) of gray levels according to the distributions of Px and Py.

$$\begin{aligned} {\text{Averages}}\quad \mu_{x} & = \sum\limits_{i = 0}^{{i < N_{\text{g}} }} {\sum\limits_{j = 0}^{{j < N_{\text{g}} }} {i \times P(i,j)} } \\ \mu_{y} & = \sum\limits_{i = 0}^{{i < N_{\text{g}} }} {\sum\limits_{j = 0}^{{j < N_{\text{g}} }} {j \times P(i,j)} } \\ {\text{Variances}}\quad V_{x} & = \sum\limits_{i = 0}^{{i < N_{\text{g}} }} {\sum\limits_{j = 0}^{{j < N_{\text{g}} }} {(i - \mu_{x} )^{2} \times P(i,j)} } \\ V_{y} & = \sum\limits_{i = 0}^{{i < N_{\text{g}} }} {\sum\limits_{j = 0}^{{j < N_{\text{g}} }} {(j - \mu_{y} )^{2} \times P(i,j)} } \\ \end{aligned}.$$

This results in the standard \(\sigma_{x} = \sqrt {V_{x} }\) and \(\sigma y = \sqrt {V_{y} }\) of marginal laws Px and Py.

Finally, we define the average of the levels according to the marginal law Pxy: \(m_{x - y} = \sum\nolimits_{k = 0}^{{k < N_{{\rm g}} }} {k \times P_{x - y} (k)}.\)

All of these definitions allow us to present the fourteen textural attributes defined by Haralick [19,20,21] (Table 1):

Table 1 List of 14 Haralick descriptors

Results and Discussion

Datasets

We extracted 389 word blocks from medical forms by hand to validate our approach. Each block of text can be formed by one or more words of the same type of writing: Arabic manuscript, Latin manuscript, Arabic printed or Latin printed and we have four classes to discriminate. We call this part of our HEL-ADU-W database. The distribution of the words on the 4 classes is presented in the table below (see Table 2).

Table 2 HEL-ADU-W database

Experiments

After calculating the SGLDs of the 389 images of blocks of homogeneous text and divided into 4 classes, we calculated the 14 Haralick descriptors corresponding to each co-occurrence matrix.

Unlike factor analysis that maximizes the variance of n observations as a function of p variables, discriminate analysis maximizes the distribution of n observations in their respective classes. There are many discrimination techniques. We will retain Fisher’s discrimination which assumes that classes are approximately Gaussian and that they can be linearly separated. Because of the large number of descriptors, we use an LDA on 4 descriptors of Haralick at the same time (Table 3).

Table 3 Confusion matrix of the four classes MA, ML, PA and PL obtained by the SGLD with f2, f6, f8 and f11 of Haralick features

The LDA analysis results show an unambiguous separation between classes like Manuscript Arabic (MA), Manuscript Latin (ML), Printed Arabic (PA) and Printed Latin (PL) by using the SGLD with f2, f6, f8 and f11 features. The confusion matrix shows that the classes PA and PL for printed writings are perfectly recognized. We reached 98.95% of discrimination for ML and 97.95% for MA. The MA and the ML present few number of confusions, which is predictable because manuscript writings shares common particularities of shapes independently of the Arabic or Latin script because of the cursive state of forms.

In addition, the handwriting Arabic or Latin words, on medical forms, are very illegible and the characters are not even distinguishable and are therefore much more difficult to characterize comparing to printed scripts.

Conclusion

This study presented a successful contribution in the field of handwritten/printed writings characterization of the text blocs understanding for Medical forms. We chose to characterize a whole homogeneous bloc by statistical methods based on the SGLD. For our identification problem Arabic/Latin scripts, printed/manuscript, the approach using a global analysis of any zone of text, presents an advantage to treat writing without any segmentation step into characters by using textural method.

Indeed the objective of this study is not the recognition of characters—as the classification of each character separately is not required—but the characterization of a word or block of words. Such block is considered as a kind of texture. We admit that a block of text containing enough high number of letters of variable frequencies constitutes a basis of statistically reliable observations to determine the type of writing independently from the content of texts. This is the goal to use such a robust texture measurement as the SGLD.

The results showed that descriptors based on the co-occurrences such as the SGLD allow finding approximately the writing classes of our medical documents.