Keywords

1 Introduction

In last two decades several works have been done in the computer recognition of handwritten words. But few of us believe that a computer will ever be able to read humans’ handwriting as good as human. Even so, it does not hurt to try to develop technology which can approach the recognition ability of humans. Solving the character segmentation problem is one of the keys to putting character recognition technology to practical use.

Character segmentation is an operation that seeks to decompose an image of a sequence of characters into subimages of individual symbols [7] as shown in (Fig. 1). It is one of the decision processes in a system for optical character recognition (OCR). The performance of character segmentation techniques depend on the quality of the scanned document because due to poor quality scanning and ink bleeding, it generally happens that the neighboring characters in the scanned image touch each other. Character segmentation is a major challenge for such degraded documents [14].

A wide variety of line, word, and character segmentation methods for printed documents of Indian languages are reported in the literature [12]. But segmentation of cursive handwriting still remains one of the most challenging problems in the area of handwritten character recognition. Bangla and Hindi are the two very popular Indian languages having this cursive property in writing style. Some of the prior works on Bangla handwritten character segmentation include in [4, 6, 8, 13, 16]. But evidence of works on Hindi character segmentation are just a few in number [5, 9]. In this paper, we present a novel technique for segmentation of unconstrained handwritten Hindi words which is highly efficient over the other existing methodologies in literature. The key features of our proposed method are summarized as follows:

  • Extensive use of the structural properties of characters in the segmentation process.

  • Efficient to handle inputs with highly skewed header lines.

  • Covers many different handwriting styles written by different individuals and gives correct output for them.

Fig. 1.
figure 1

Segmentation of handwritten Hindi words.

The rest of the paper is organized as follows. Section 2 describes the characteristics of Hindi language. Section 3 presents the proposed methodology for accurate segmentation of handwritten Hindi words. Experimental results and related discussions are reported in Sect. 4. The concluding notes are given in Sect. 5.

2 Properties of Hindi Language

Devanagari is the script for writing Hindi language. It consists of 14 vowels and 33 consonants. A sample set of basic Hindi alphabets is shown in Fig. 2(a). The writing style is from left to right. There is no concept of lower or upper case alphabets as in English language [2]. Half form characters in Hindi increases the language complexity for recognition. The half characters may touch with full characters to make the characters called conjuncts or compound (see Fig.2(b)). In each conjunct character, the right part is a full consonant, and the left part is always a half consonant. When two or more characters are combined to form a word, the horizontal lines touch each other and generate a header line called shirorekha. The vowels modifiers can be placed at the left, right (or both), top, or bottom of the consonant. The vowels above the header line are called ascenders or upper modifiers and vowels below the consonants are called descenders or lower modifiers.

Fig. 2.
figure 2

Hindi alphabet set.

3 Proposed Methodology

The proposed method has three phases. A preliminary segmentation process extracts the header line and delineates the upper-strip from the rest in phase 1. This yields vertically separated middle zone and bottom zone components that may be conjuncts, touching characters, characters with lower modifier attached to it, shadow characters, or a combination of these. In phase 2, statistical information about these intermediate individual components is collected and the segmentation of upper modifier is performed. In phase 3, this statistical information is used again to select the components on which further segmentation needs to be attempted. This separates the lower modifiers from the middle zone components. This segmentation methodology is performed in the following hierarchical order as shown in Fig. 3.

  1. 1.

    Scan the handwritten Hindi words needed to be segmented and perform the binarization using an adaptive-cum-interpolative method as described in [3]. It is noticed that binarization plays an important role in character segmentation. In this method, a multiscale framework is added to adaptive version of Otsu’s method [11] to handle noises in different scales. To convert Otsus method to an adaptive model, instead of computing the global threshold value for the whole image, it computes the local threshold value for each pixel by observing the intensity behavior of its neighbor pixels.

  2. 2.

    Detect the header line and remove it completely.

  3. 3.

    Segment the upper modifiers left in the upper zone, if any, and make appropriate joining if required.

  4. 4.

    Identify the middle zone components containing any lower modifiers and segment these lower modifiers if required.

  5. 5.

    Finally, the segmented result is presented for further recognition process.

Fig. 3.
figure 3

System architecture of the proposed method.

Fig. 4.
figure 4

Flowchart for header line detection and removal.

3.1 Detection and Removal of Header Lines

Figure 4 outlines the proposed method for detecting and removing the header lines even if they are skewed in nature. The following steps discuss the method in detail.

  1. 1.

    Perform thinning of binarized handwritten Hindi words to get single pixel thin skeletons using Huang’s method [10].

  2. 2.

    Find the start row, end row, start column, and end column for the span of a word.

  3. 3.

    Get the horizontal density of number of object pixels for each row in the upper half of the word height, i.e., from ‘start row’ to ‘(start row+end row)/2’.

  4. 4.

    Find the highest density row (marked as ‘record’) from the above list and consider to be the approximate header line row.

  5. 5.

    Divide the entire word width into stripes. The number of stripes is equal to ((2\(\times \)width)/lower height) of the input word, where width = (end column-start column) and lower height = (end row-record).

  6. 6.

    Find the row having highest density of object pixel for each stripe by scanning from ‘start row’ to ‘record’+7. This threshold value is set as per the experimental analysis.

  7. 7.

    Find the difference between ‘record’ and the local maximum row (from step 6) for each stripe and accordingly shift the entire stripe upwards or downwards based on the sign of the difference.

  8. 8.

    Finally, remove the row ‘record’ and appropriate number of rows above and below of it (according to the pen width of the input word) to get rid of the header line.

After removal of header line, we calculate the vertical density of object pixels for each column. We identify the columns having zero count for the above vertical density and use them as breakdown columns to separate the components at these positions.

Fig. 5.
figure 5

Flowchart for upper modifier segmentation.

3.2 Segmentation of Upper Modifiers

To deal with upper modifiers separately, we extract it out one by one using Rosenfeld and Kak component labeling algorithm [15]. Figure 5 outlines the proposed technique for upper modifier segmentation. The steps are as follows:

  1. 1.

    The upper modifiers in Hindi language can be classified into two classes. This classification is done based on the number of times they touch the header line at different positions as shown in Fig. 6. They touch either once or twice as per the characteristics of writing style.

  2. 2.

    If it is a single touch, then check the class of the next modifier (if there is any) in the sequence from left to right. If there is no next modifier or the next modifier is not a single touching modifier then go to step 3. But if it is a single touching modifier, then check whether the touching points of present and next modifiers are nearer to each other or not. If they are not, then go to step 3. Otherwise, join them by applying the extrapolation method used in [1] and then move to the next step. All different possibilities of upper modifier are shown in Fig. 7.

  3. 3.

    Find the column which is four columns preceding from the touching point (see Fig. 8). If this column lies over blank space in middle zone then connect it to the component in the middle zone which is just right to the above calculated column. If it lies over some component in the middle zone then simply represent it separately.

  4. 4.

    If it is double touching, then find the middle column for the column span of the upper modifier. In Fig. 9, the middle column is marked by red line and the column span of the upper modifier is marked by green lines. If the middle column lies over some component in the middle zone (see Fig. 9(a)), then calculate the width on left (\(W_l\)) and right (\(W_r\)) sides of that column in the middle zone of that component. If \(W_l\) is less than \(W_r\) then join the modifier with the component in the middle zone which is just left to the underlying component in the middle zone. Otherwise, join it with the component in the middle zone which is just right to the underlying component in the middle zone.

  5. 5.

    If the middle column lies over blank space in the middle zone as shown in Fig. 9(b), then calculate the width of the components on both sides (\(C_l\) and \(C_r\)) of that column in the middle zone. If \(C_l\) is less than \(C_r\) then join the modifier with the component of width \(C_l\); otherwise, join the modifier with the component of width \(C_r\).

Fig. 6.
figure 6

(a) Single touching upper modifiers. (b) Double touching upper modifiers. Red circle indicates the touching point (Color figure online).

Fig. 7.
figure 7

(a) No next modifier. (b) Next modifier is not a single touching modifier. (c) Next modifier is single touching but touching points are not nearer to each other. (d) Next modifier is single touching and touching points are nearer to each other. Red circle indicates the touching point (Color figure online).

Fig. 8.
figure 8

Calculated column (marked in red) lies over (a) blank space in middle zone; (b) some component in middle zone (Color figure online).

Fig. 9.
figure 9

Middle column (in red) lies over (a) some component in middle zone; (b) blank space in middle zone (Color figure online).

3.3 Segmentation of Lower Modifiers

The proposed method for lower modifier segmentation are shown in Fig. 11. At first, the components containing the lower modifiers are identified. Thereafter, they are classified into three classes as given below.

  • Middle bar characters

  • Right bar characters

  • No bar characters

The steps to determine the classes are:

Fig. 10.
figure 10

\(3\times 3\) grid for lower modifier detection.

  1. 1.

    Divide the entire component into 3\(\times \)3 grid as shown in Fig. 10.

  2. 2.

    Consider block 3 and 6 and check whether more than 90 % of their rows contain object pixels or not. If so, then the component is a right bar character.

  3. 3.

    Consider block 2 and 5 and perform similar check over their rows. If it is satisfied, then it is a middle bar character.

  4. 4.

    If none of the cases are satisfied, then it is a no bar character.

Fig. 11.
figure 11

Flowchart for lower modifier segmentation.

After the final classification, if it is a right bar character then block 9 is considered and if it is a middle bar character then block 8 is considered in the grid. Thereafter, the density of object pixels in each row in the considered block is calculated. Then, moving from top to bottom in that block, the first row at which the density increases all of a sudden is found. This row is omitted to segment the lower modifier from its middle zone component. The no bar characters can be dealt exceptionally.

Fig. 12.
figure 12

Experimental results of different phases for handwritten Hindi words.

4 Experimental Results and Discussion

This section presents the experimental results and related discussion of our proposed method.

4.1 Experimental Dataset

Our dataset consists of about 12750 Hindi word samples among which 1200 samples are printed and 11550 samples are handwritten. The handwritten samples are collected from 30 different writers used 10 different types of pens with varying pen width. The samples are scanned at 300 dpi using HP Office Jet 5610 scanner. The implementation has been done on MATLAB (R2010a).

4.2 Character Segmentation Results

Detection of Header Lines: The experimental results are shown in Fig. 12(a). In this figure, the left and right column shows the input samples and the corresponding outputs after the completion of phase 1. It is shown that there is a white single width straight line detected as the header line for each of the inputs. This represents the required row to be removed. Now, we can remove the appropriate number of rows above and below of the obtained row according to the pen width of the written text so as to completely get rid of the header line.

Segmentation of Upper Modifiers: The experimental results are shown in Fig. 12(b). The left and right column in the figure shows the inputs and the corresponding outputs after the completion of phase 2. We can observe that all the upper modifiers along with their middle zone counterpart get totally separated from the rest and are represented individually. For the upper modifiers with their counterparts in the middle zone, appropriate joining has been done and also the extrapolation has been performed if required.

Segmentation of Lower Modifiers: Finally, the lower modifiers are segmented in the last phase of the proposed method. The experimental results are shown in Fig. 12(c). The left and right column in the figure shows the inputs and the corresponding outputs after the completion of third phase. It is observed that all the lower modifiers are detected and segmented from their middle zone component character correctly.

For the above all test cases, we consider the input samples having the shortcomings mentioned earlier to show the efficacy of our proposed method.

Table 1. Header line detection accuracy.
Table 2. Upper modifier segmentation accuracy.
Table 3. Lower modifier segmentation accuracy.

4.3 Comparison with Other Methods

We compared our results with two existing methods of Bansal–Sinha [5] and Hanmandlu–Agrawal [9]. As the datasets used in these two methods were not available, so to perform the comparative analysis in same platform we prepared our own dataset with reasonable size as discussed in Sect. 4.1. We implemented the other two methods and tested all these methods on our own dataset. Our main objective was to make this comparative analysis unbiased. The accuracy of header line detection, upper modifier segmentation, and lower modifier segmentation are shown in Tables 1, 2, and 3 respectively. We observe that the performance of header line detection is much better than the other two existing methods. This is because of the efficiency of our proposed method to handle large variety of writing styles and skewed header lines as input data. Also for upper and lower modifier segmentation, our algorithm has shown an acceptable improvement in accuracy over the other two existing methods. The overall success rate of our proposed method is 96.93 % which is much better than Bansal–Sinha (48.58 %) and Hanmandlu–Agrawal (72.8 %) methods. During the measuring of accuracy rate, we treated over and under segmentation as an incorrect segmentation. This work can be extended later on to make a more generalized method for lower modifier segmentation in case of no bar characters and the segmentation of two characters touch in upper, middle, or lower region. Many a times shadow characters also occur in handwritten text when one totally independent component occurs under some other component. They can also be dealt with in future.

We have also done a comparison in between the above said methods and our proposed method w.r.t. computational time. We used our own test datasets for this experimental analysis. It is noticed that our proposed method is computationally efficient than the other two methods for both the printed and handwritten images.

5 Concluding Remarks

In this paper, we have proposed a character segmentation method based on structure shape of Hindi language. The proposed method has performed significantly well at each level of segmentation to handle large scale shape variation in writing style of Hindi language. The proposed method is tested on handwritten Hindi word images and the results are very promising with an average accuracy rate of 96.93 %. But this method is not performing well for few particular cases as stated earlier. In future, we shall extend our work to improve the accuracy of segmenattion and to make it applicable to character recognition for handwritten Hindi OCR system.