1 Introduction

Optical character recognition (OCR) is the electronic conversion of the printed or handwritten scanned document into the machine understandable text. It is the process of changing the physical text information to digital form so that it can be edited, stored, or used for many other machine processes. The main advantage of this process is that it is helpful in entering the data from printed data records without human intervention. Basic OCR system includes various steps of pre-processing, feature extraction and classification.

Character segmentation is a pre-processing step in OCR. It is the process of separating the word image into individual character images. It is the fundamental and critical step of OCR. The performance of OCR degrades due to incorrect character segmentation. Segmentation of characters is a complex task in Indian scripts of cursive handwriting style. Character segmentation becomes difficult with the variation in writing styles, skew variability etc.

A large number of works are done on various scripts like Roman, Chinese, and Arabic [11, 15, 16, 24] etc. Various strategies in character segmentation are consolidated in an early study by Casey and Lecolinet [12]. Various techniques for character segmentation for Indian scripts are proclaimed in [34]. Methodologies used by researchers for segmenting Hindi and Bangla languages are united in [4]. A survey on recognition of Devanagari script is given in [23]. Work from the 1970s of printed and handwritten OCRs are consolidated in the survey. Bansal and Sinha [6] have proposed a two pass algorithm for segmenting Devanagari script. In the first pass, the word is segmented into easily separable characters. Separated characters are classified into composite and non-composite characters based on statistical information about the height and width of the characters obtained after the first pass. Characters classified into the category of composite characters are segmented in the second pass of the algorithm. Hanmandlu and Agrawal [18] have used structural properties of Hindi language for segmentation. Sarkar et al. [44] have proposed a non-linear fuzzy membership function for header line estimation. Then used a non-linear fuzzy function to identify the segmentation points on the header line. An adaptive Hindi OCR is proposed in [28]. In this, Hindi scripts are first identified from bilingual and multilingual documents based on structural properties and then character segmentation is applied on the identified word. A horizontal projection approach is used for dissecting the word into characters. Another work of handwritten Hindi text segmentation based on header line removal is reported in [36]. In this, the header line is detected by estimating the average line height and then the characters are segmented. Srivastava and Sahu [47] have also used the header line detection and removal approach for handwritten Hindi text segmentation. Ramteke et al. [41] have reported segmentation of Marathi handwritten text based on header line detection. One of the early research on Bangla character segmentation is based on recursive contour following [10]. Pal and Dutta [35] have used a concept based on water reservoir principle for character segmentation of Bangla. Bag et al. [3] have proposed a method based on vertex characterization of outer isothetic polygonal covers. Roy et al. [42] have presented a approach for skew detection, correction, and character segmentation. In this work, segmentation points are extracted on the basis of some patterns observed in the handwritten words. Basu et al. [7] have proposed a two step fuzzy technique for segmentation of Bangla words. In a recent work of handwritten Bangla OCR [1], distance based segmentation approach is used for line, word, and character segmentation. Lehal and Singh [27] have developed a Gurmukhi OCR system for machine printed text. The recognition of the text is done at sub-character level. The word is divided into sub-characters in the segmentation phase and in the recognition phase these sub-characters are classified and combined to form Gurmukhi characters. Lehal [26] has presented a complete OCR for printed multi-font Gurmukhi script. Kumar and Sengar [25] have proposed a bilingual character segmentation comprising Gurmukhi and Devanagari script based on the horizontal and vertical projection of the word. Sharma et al. [45] have proposed an iterative approach based on the header line, aspect ratio and vertical and horizontal projection profiles for handwritten Gurmukhi text. Mangala and Kaur [29] have proposed an end detection algorithm for segmentation of touching and broken characters in handwritten Gurmukhi words. They also used horizontal and vertical profile projection for isolated character segmentation. Number of works have been done for character segmentation for Indian languages, but most of the methods work on removal of the header line followed by post-processing to segment the characters of the words [13, 28, 31, 36, 37, 46] as discussed above.

OCR of Indian scripts is an ongoing topic of research as not much of work is done on multilingual OCR of Indian scripts. However, it is troublesome for skewed word images. Implementing a generalized character segmentation method for more than one language is a trivial task due to the variation in writing styles and structural properties of languages. Few research works have been done for multilingual character segmentation for Indian scripts [2, 17, 22, 30]. But all of them, use different approaches for segmenting different languages and combine these approaches to make a single module. All these works are based on script identification and then applying character segmentation for the corresponding script.

Implementing a generalized method for character segmentation without removing header line, which works on more than one language and also handles skewed words motivates us for this work. The novelty of the work lies on the fact that our proposed method does not identify the language before segmentation and handles all the languages using a single segmentation approach, unlike previous works. Following key features of the proposed method illustrate the novelty of the proposed work.

  • Proposed method does not identify language before segmentation.

  • It uses a single approach to segment the words of different languages.

  • It performs character segmentation without removal of header line.

  • Works well on skewed words up to an angle of ± 10.

  • Works well on words with broken header line.

  • Proposed method segments broken words correctly.

This is significant as a single method can be used to develop multilingual OCR for Indian languages. To our best knowledge, only a single work is done by Bhattad and Chaudhuri [8] on using a single approach for bilingual character segmentation of Bangla and Devanagari. But this work does not perform segmentation of modifiers. In this paper, we propose a method to segment multilingual handwritten words into characters using the polygonal approximation of the word.

The architecture of rest of the paper is as follows. Section 2 discusses characteristics of all the four languages, Hindi, Marathi, Punjabi, and Bangla. Section 3 extends the proposed methodology for character segmentation. Experimental results and analysis of results are discussed in Section 4. Concluding remarks and scope for future work are given in Section 5.

2 Properties of languages

The Government of India have recognized 22 languages namely, Assamese, Bengali (Bangla), Bodo, Dogri, Gujrati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Marathi, Manipuri, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, and Urdu as official languages. Among these four popular languages are Hindi, Marathi, Punjabi, and Bangla collectively spoken by approx 723 million people worldwide.

2.1 Properties of Hindi language

Hindi language is written in Devanagari script. It is the most prominent language spoken in India. It consists of 13 vowels and 33 consonants.

Like English language, there is no case of uppercase and lowercase of characters. Writing mode of this language is from left to write. Two or more characters are combined to form a word by joining header lines of individual characters.

2.2 Properties of Marathi language

Marathi language is also written in Devanagari script. Over 73 million people speaks Marathi in India resulting it as the fourth most spoken language in India and ranks nineteenth in the list of most spoken languages in the world [19]. As it is also written in Devanagari scripts, so it has the same properties as of Hindi language.

2.3 Properties of Punjabi language

With over 100 million speakers worldwide, it is ranked tenth most widely spoken language in the world [20]. With 30 million speakers in India, it is ranked 11th most spoken language in India. Punjabi is most widely spoken language in Pakistan. Gurmukhi script is used to write the Punjabi language. It comprises 10 vowels and 41 consonants. Punjabi words are also connected by header line. Its writing mode is from left to right.

2.4 Properties of Bangla language

Bangla language is written using Bangla script. The basic character set of Bangla comprises 11 vowels and 39 consonants. It is the second most spoken language in India and seventh most spoken language in world [21]. Over 80 million people in India speaks Bangla. It is also the national language of Bangladesh. For OCR point of view Bangla is very significant language.

Our algorithm works on exploiting the common structural properties (Fig. 1) of the above mentioned four languages. The comparative analysis of the above four mentioned languages is given in Table 1.

Fig. 1
figure 1

Structural properties of languages

Table 1 Comparison of the four Indian languages

3 Proposed method

In this work, we have considered header line and baseline for character segmentation and modifiers’ segmentation. In our proposed method header line of the word is not removed to perform character segmentation. The whole method is divided into four phases. First phase is preparatory phase in which input word is transformed in the required form for the algorithm. In second phase, polygonal approximation is done to find out the straight line segments of the word. On these segments, we perform the traversing to find out the segmentation points on the header line. On the basis of these segmentation points, word is divided into the characters. In the next two phases, upper and lower modifiers, if any, are segmented using statistical information about the individual components of words. Figure 2 depicts the system architecture of the proposed method.

Fig. 2
figure 2

System architecture of the proposed method

3.1 Preparatory stage

Optical scanning of the document results into raw input image for OCR system. Pre-processing is the first phase of document image analysis. The purpose of this stage is to improve the quality of the image. Two pre-processing methods are applied to make input ready for the next phase.

  • This stage reads the input image of word and then performs binarization using Otsu’s method [33]. Otsu’s method works on fixed global thresholding. Converting gray scale image to binary image is the first step of OCR.

  • On the binarized input, thinning is applied using [49]. This is a fast parallel algorithm. In thinning, the image regions are reduced to one-pixel width characters.

These mentioned pre-processing techniques, namely binarization and thinning are important as our proposed method solely depends on these two methods. Polygonal approximations of a digital curve are based on chain code property. As chain code works on single pixel width characters, so for this purpose thinning is applied. To perform thinning, image binarization is an important pre-processing step. If any stroke is lost in the binarization process, then the approximate shape of the character will be inaccurate, so as the segmentation of the word which in result affects the performance of OCR.

3.2 Character segmentation

Figure 3 outlines the proposed method for character segmentation. Following steps describe the process in detail.

  1. 1.

    Binarized thinned word obtained in Section 3.1 is the input for this process (Fig. 3b).

  2. 2.

    Polygonal approximation [9] is applied on the binarized thinned input word image to do the character segmentation of word. This algorithm determines digital straightness of the one-pixel width word obtained after thinning. It describes digital straight line segments (DSS) from digital curve. The number of such segments required to cover the word is few. As a result, the data set required to represent the word is reduced to a large extent. Since, our proposed method of segmentation is based on the structural properties of the four mentioned languages so, it requires storing the information of the structure and shape of the words in efficient manner. The structure of the word can be represented with very less data after applying the polygonal approximation and can be accessed efficiently. It transforms the huge pixel data of words to simple graph with considerable less data points and makes the traversal of points simple and possible. Polygonal approximation gives the approximate points of segments (Fig. 3c). When we combine these approximated points, we get the approximated shape of the word as shown in Fig. 3e.

  3. 3.

    Junction point is defined as the point which have more than two neighbors. We identify all the junction points among all the approximated points. The green points are the junction points of the word (Fig. 3e).

  4. 4.

    We define header junction points as the junction points residing on the header line. We identify all the header junction points out of all the junction points. All the junction points which are on the header line are detected even if the word is skewed upto an angle of ± 10. In this process, the word is divided into stripes vertically (Fig. 3d). In each vertical section, row with maximum object pixel density is detected as header line. Red boxes in the figure show the header line for that particular region. Thus header line is detected irrespective of the skewness of the word so as the header junction points are determined. Magenta points in Fig. 3e are the header junction points.

  5. 5.

    We combine the approximated points to get the approximated shape of the word which resemble a graph structure (Fig. 3e). These points are traversed to determine the segmentation points. Traversing is done for all the header junction points headerJP obtained in step 4. If the end point or another header junction point is reached while traversing from source header junction point S, then S is marked as segmentation point. Following steps describes traversing in detail:

    Step 1: Initialize a queue Q as empty and source S as a headerJP.

    Step 2: Each approximated point is connected to its neighbors, so, we check the neighbors. If neighbor point p of S is a junction point and its not visited yet, then p is marked as visited and inserted to queue Q.

    Step 3: If it is a headerJP, then we do not need to traverse further for S. Point p is marked as visited and as segmentation point. Traversing is terminated for point S and Q is made empty.

    Step 4: If Q is empty then, no more points are remaining for traversal, it ensures that all the junction points reaching from header junction point headerJP are traversed.

    Step 5: If Q is not empty, then all the points reaching from headerJP are not traversed. So, point is deleted from Q, and this becomes new source S for traversal and steps 2–5 are repeated.

    Step 1–5 are repeated for each headerJP.

    Figure 4 explains the traversing method to determine the segmentation points on the example word image. Both the cases of traversal described above are shown in the example. In this example, four points J1,J2,J3, and J4 are the header junction points and E1,E2, and E3 are the end points. Green points are the junction points of the word (‘c’ becomes junction point due to the presence of small cut). We start traversing from J1 following the direction shown by arrows and reach to the point E1 via 3 points ‘a’, ‘b’, and ‘c’. This is one path of traversal (Fig. 4a). On reaching point E1, queue becomes empty, so, traversing is completed for header junction point J1. Another path of traversal is possible starting from J1 to E1 following the direction via points ‘a’, ‘c’, and ‘b’ (Fig. 4b). In this path, we reach to end point E1 through junction point ‘a’. Since, all the other points reaching from ‘a’ are not discovered yet so, we backtrack and traversing continues following the direction of arrows as shown in Fig. 4b. On reaching to junction point ‘b’, traversing stops as all the points are discovered and queue becomes empty. We got the end point and did not reach to any other header junction point in this traversal so, header junction point J1 is marked as segmentation point. Similarly, starting from J2, we reach to E2. Whereas, starting from header junction point J3, we get to end point E3 following the direction of the arrows via 2 junction points ‘d’ and ‘e’. But on reaching to E3, queue is not empty as junction point ‘e’ reaching from J3 is not traversed completely (neighbor point of ‘e’ is still not visited). So, traversing backtracks from E3 and another header junction point J4 is reached, which is the part of same character. Therefor, among J1,J2,J3, and J4, header junction J1,J2, and J4 are considered for segmentation while J3 remains connected to the character. Even presence of small cuts and merges does not create any problem in traversal as data set required to represent the word is few after applying the polygonal approximation so as the cuts and merges are few in the word. All the points follows the same procedure and results into character segmentation (as explained in the given example).

  6. 6.

    We segment the word into individual characters after we get the segmentation points. The word is divided into individual characters with modifiers attached to it (Fig. 3f). We perform the segmentation of both upper and lower modifiers one by one. In the next sections, we explain the process of modifiers segmentation.

figure c
Fig. 3
figure 3

The process of character segmentation a input word image, b binarized thinned word, c detected approximated points, d detection of header line, e different points considered for segmentation (junction points are shown in green color are header junction points are shown in magenta color), f result of character segmentation (modifiers remain connected), g bounding boxes for upper modifier segmentation, h result of upper modifier segmentation, i baseline detection, j baseline junction point detection (rectangular box is baseline region and red point in this region is baseline junction point), k bounding box (shown in red color) for lower modifier segmentation, and l result of lower modifier segmentation and final segmented output

Fig. 4
figure 4

Detection of segmentation points using PointTraversal procedure

3.3 Upper modifier segmentation

After the character segmentation, the statistical information of the word width is utilized to do the upper modifier segmentation. Our proposed method for upper modifier segmentation is advantageous as all the upper modifiers are segmented using this single logic. We do not categorize the modifiers to segment them as reported in earlier literatures [5].

Procedure SegmentUpModifier in Algorithm 1 explains the approach for segmentation of upper modifiers.

  1. 1.

    We determine the presence of upper modifier with the help of bounding box. A rectangle bounding box Bb (Fig. 3g) is drawn over each header junction point headerJP which belongs to the character (Line 1–4). The width of this box is equal to the width of the character and height (hb) of the box is calculated using (1).

    $$ {h_{b}}_{i} = |{headerJP}^{x}_{i} - \verb|mid_point|_{i}| $$
    (1)

    where,

    $$ \verb|mid_point|_{i} = (\verb|start_row| + {headerJP}^{x}_{i})/2 $$
    (2)

    where, start_row is the start row of word image.

  2. 2.

    Next, to detect the upper modifier, we calculate the number of pixels np present inside each Bb. If np𝜃u, upper modifier is present (Line 5–6). We determine the value of 𝜃u as 30 on the basis of experimental analysis (Section 4.4).

  3. 3.

    After we identify the modifier, we compare the width of character Wc associated with the header junction point over which the modifier is connected (Line 7).

  4. 4.

    The upper modifier is segmented from the character if Wc𝜃w(threshold), where Wc is width of character (Line 8–10). Otherwise, upper modifier stays associated to the character.

    This process is explained in Fig. 5. In first two figures, two modifiers are shown. In first case, w1 < 𝜃w while w2 > 𝜃w so, upper modifier is segmented at point b but remains connected to a. The reverse of this is true in second case. So, upper modifier is segmented at point c. In next two figures, w > 𝜃w, so upper modifier is segmented at points e and f. In last figure, modifiers at points g is segmented as w2 < 𝜃w. We have taken the value of 𝜃w as 10 on the basis of experimental analysis (Section 4.4). Figure 3h shows the result of upper modifier segmentation of the input word. Some image results for upper modifier segmentation for four different languages are shown in Fig. 6. In case upper modifier is not connected properly to the character of the middle zone and appears partly on top of two characters, width of the nearest character is compared as the upper modifier automatically comes in the bounding box over the character which is nearer to the modifier (Fig. 7). In the given example of Hindi language, point ‘p’ is nearer to header junction point ‘a’ than point ‘b’ so width of the character associated with header junction point ‘a’ is compared resulting in to the proper segmentation of upper modifier.

figure d
Fig. 5
figure 5

Explanation of upper modifier segmentation with example

Fig. 6
figure 6

Image examples for segmentation of upper modifier for a Hindi, b Marathi, c Punjabi, and d Bangla languages

Fig. 7
figure 7

Examples of upper modifier segmentation for Hindi in case of modifier appears partly between two characters a Input word, b approximated points of word, and c segmented output

3.4 Lower modifier segmentation

The method used to segment lower modifiers is similar to the upper modifier segmentation method as discussed in Section 3.3. In case of upper modifiers, we work on the junction points lying on header line. Similarly, for lower modifier we consider the the junction points lying on the baseline region. Baseline is the invisible line that separates the middle zone from lower zone of the word (Fig. 1). The advantage of this method is that all the modifiers are handled using this single approach only. The method is explained in procedure SegmentLowModifier in Algorithm 2.

  1. 1.

    This procedure starts with detecting the baseline using procedure DetectBaseline in Algorithm 3. To detect the baseline, sudden changes in horizontal density of object pixel from middle row to end row of image is considered (Line 1–4). It is observed that sudden changes in density is found at the end of lower zone but it is not correct if the modifier is present (Fig. 3i). So, to generalize the baseline detection, concept of pixel density of word in different zone is considered. The first detected baseline is named as Baselineend (Line 6). To overcome this problem, second baseline Baselinesecond is detected (Line 7). We calculate the difference npdiff of number of pixels lying in both rows of Baselinesecond and Baselineend (Line 9–11). If npdiff < 5 (value taken as per experimental analysis) then Baselineend is considered as baseline (Line 12–13); otherwise Baselinesecond is the baseline (Line 15).

    figure e

    Figure 8 shows the two cases for detecting the baseline. In first word of image npdiff is less as both the rows Baselineend and Baselinesecond lying in middle zone which is highest zone of object pixel density. On the other hand, in case of presence of lower modifier (second word of image), row Baselineend is in lower zone (low object pixel density) and row Baselinesecond is in middle zone (high object pixel density). Due to which npdiff will be higher in this case. So, in case of first word of Fig. 8, Baselineend is Baseline and in second word, Baselinesecond is Baseline. All the values which are considered are based on experimental analysis.

  2. 2.

    We define baseline junction points baseJP as the junction points lying in the baseline region (Fig. 3j). We identify all the baseline junction points baseJP after the baseline is detected.

  3. 3.

    Bounding box are drawn for each baseJP (Fig. 3k) (Line 4–5). Width of bounding box is equal to the width of the character associated with the baseJP and height hb is calculated using (3).

    $$ {h_{b}}_{i} = |baseJ{P^{x}_{i}} - \verb|end_row|| $$
    (3)

    where, end_row is the end row of the word image.

  4. 4.

    Calculate the total number of pixels inside the bounding box. If number of object pixels are greater than the threshold 𝜃l then modifier is present. We disconnect the lower modifier from that baseline junction point (Line 6–9). We set the value of 𝜃l as 30 (on the basis of experimental analysis)(Section 4.4). Figure 3l shows the result of lower modifier segmentation for the input image of Fig. 3a. Figure 3l is the final segmented output of the input word. Image results of lower modifiers’ segmentation for all the four languages are shown in Fig. 9.

Fig. 8
figure 8

Example of baseline detection

Fig. 9
figure 9

Examples of lower modifier segmentation for a Hindi, b Marathi, c Punjabi, and d Bangla languages

4 Experimental results and discussion

We discuss the experimental results and analysis of results of our proposed method in this section.

4.1 Experimental dataset

We have taken about 3000 handwritten words of each of the four language for our experimental purpose resulting 12000 handwritten words in total. Out of four languages, dataset for Hindi and Bangla are obtained from [43] and dataset for Punjabi is obtained from [32]. Marathi dataset is collected from 20 individuals of different age group. Each Marathi image is digitized at 300 dpi resolution using Cannon Pixima E560 flat-bed scanner. The implementation has been done on MATLAB.

4.2 Results for character segmentation

Method described above performs segmentation for four Indian languages. Some of the image results for Hindi and Marathi languages are shown in Fig. 10 and for Punjabi and Bangla languages are shown in Fig. 11. The proposed methods handles some special cases like broken words, words without header line and skewed words as well. These cases are described as follows:

Fig. 10
figure 10

Experimental results for Hindi and Marathi languages. Alternate colors of gray and black is used to represent the segmented components

Fig. 11
figure 11

Experimental results for Punjabi and Bangla languages. Alternate colors of gray and black is used to represent the segmented components

Character Segmentation of Skewed Words: As described in Section. 3.2, methods works well on skewed words. Image result for skewed words for all four different languages are shown Fig. 12.

Fig. 12
figure 12

Examples of character segmentation of skewed words for a Hindi, b Marathi, c Punjabi, and d Bangla languages

Character Segmentation of Broken Words:

In handwritten text, the word can be broken in two ways. Many time while writing in flow, some of the characters are broken from the words. This makes character segmentation difficult. Few examples of broken words are given in Fig. 13. This is first type of broken word. Our proposed method works well on input with broken characters. Segmentation is performed on the remaining connected part of the word (Fig. 14). The broken parts of the word do not require any segmentation and can be considered as they are for classification.

Fig. 13
figure 13

Broken word image examples of Type-I

Fig. 14
figure 14

Segmentation of broken words of Fig. 13

In second type of broken word, header line is broken in middle resulting two or more parts in a single word. These type of words are found mainly in Bangla only. To handle these kind of cases, connected components are found in the word image and the proposed method is applied for each component resulting into segmented individual characters. Figure 15 shows the image example for words broken from header line.

Fig. 15
figure 15

Broken word image example of Type-II

Character Segmentation of Words without Header line: Many writers do not put header line while writing in flow. Due to the absence of header line, segmentation becomes difficult. These type of cases are also found mainly in Bangla language. To overcome this problem, we estimated header line of the word using a linear curve fitting-based method proposed in [39]. After estimating the header line, character segmentation is applied as discussed in Section 3. Some of the image example for Bangla language are given in Fig. 16.

Fig. 16
figure 16

Image examples of broken header line in word in Bangla language.

Detailed experimental results for all four languages: Hindi, Marathi, Punjabi, and Bangla are reported in Table 2, Table 3, and in Table 4 for character segmentation, upper modifiers’ segmentation, and lower modifiers’ segmentation respectively. Lower modifiers in Punjabi language are not connected to the characters. This is the reason for the 100% accuracy for lower modifier segmentation for Punjabi language. The combined average accuracy for character segmentation obtained for all the languages is 90.07%. For upper modifiers, the average accuracy for all the four languages is 90.33%, and in case of lower modifiers, the obtained average accuracy is 93.99%.

Table 2 Character segmentation result of all four languages
Table 3 Upper modifiers’ segmentation result of all four languages
Table 4 Lower modifiers’ segmentation result of all four languages

4.3 Character recognition result

In this Section, we discuss the extracted features and classifiers used for character recognition purpose.

4.3.1 Feature extraction

Two categories of features [40] are extracted for each separated binary component of the word as given below.

  1. i.

    Cumulative stretch in four different direction, and

  2. ii.

    Shadow features [14].

Cumulative stretch features are extracted for four directions, namely horizontal, vertical, left diagonal, and right diagonal. For shadow features extraction, each component is divided into 8 octant and each octant has 3 sides. Light is projected from 3 different directions and length of the shadow is computed on each side of octant. Total of 84 cumulative stretch features and 120 shadow features are obtained resulting into a total of 204 features for each component.

4.3.2 Classifiers

We have applied the extracted features to three classifiers, namely Multilayer Perceptron (MLP), Support Vector Machine(SVM), and Random Forest.

MLP: MLP is also known as feed-forward neural network which consists of multiple layers of computational units interconnected in feed-forward way. MLP maps the input to the predicted output data. It uses the back propagation learning technique in which output values are compared to the correct answer to get the value of mean square error. This error is then fed back to the network and weights of the network are updated in order to reduce the mean square error between the actual and desired outcome of the network. The network is trained when this value to minimized. In the present work, we use one-hidden layer MLP.

This classifier is widely used for classification in various areas. Pramanik and Bag [38] have used MLP for classification of Bangla compound characters, Zhang and Sun [50] have used single layer feed forward neural network to develop a angiosperm genus classification system. Wang et al. [48] have also used single layer feed forward neural network for recognition of facial emotions.

SVM: SVM is a set of supervised learning methods that analyzes data and recognizes patterns that is used for classification. In SVM model each data item is plotted as a point in n-dimensional space, mapped so that the points of the different categories are divided by a clear gap. In more formal way, a SVM model constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space that differentiate the two classes. The hyperplane achieves a better separation if data points of any class are as far as possible. Larger difference between the hyperplane and the data points leads to the lower generalization error of the classifier. In this work we have used polynomial kernel in SVM model for classification.

Random Forest: Random forest is a supervised learning method that operates by constructing multiple decision tree classifiers during training phase. To classify a new object on the basis of attributes, the decision of the majority of the trees is chosen by the classifier as the final decision. The random forest classifier avoids the problem of over-fitting by using multiple trees which re-samples the data during training phase and by changing the features over different classifier. Moreover, random forest classifier handles the missing values which makes it more robust to noise. These two advantages of random forest classifier make this a potential choice for classification.

The recognition of characters is done using above mentioned classifiers. Table 5 shows the accuracy of each language for all the three classifiers. Among all tested languages, and classifiers, the Hindi language shows the highest accuracy of 96.09% for MLP classifier. Marathi and Punjabi languages also show the best accuracy for MLP classifier scoring 90.50% and 91.21% respectively. Bangla language shows relatively less accuracy of 85.74% for SVM classifier in comparison to the other three languages. Vast variation in the writing style of Bangla language results in ample deviation in the dataset which makes the recognition of characters difficult and consequently affects the recognition accuracy for this language. Best accuracies obtained by all four languages among all three classifiers are shown in bold numbers.

Table 5 Character recognition accuracy for each language

We also applied a deep learning architecture for recognition of characters. A deep convolutional neural network (CNN) is a feed-forward multilayer network trained in supervised mode. CNN consists of a number of layers namely, convolutional layers, pooling layers, and fully connected layers. There can be any number of fully connected layers after the convolutional layers and pooling layers as in a standard multi-layer neural network. The CNN model takes the whole binary image of dimension of 32 × 32 as input in a single 1 dimensional array. So the single 1 dimensional array of size 1024 is selected as features. From the total data of each language, 80% data is used for training of model and rest of the data is used for testing. The training data is randomized beforehand to avoid the problem of data over-fitting. This model is trained with the batch training size of 100 and learning rate of 0.001. The classifier provides the recognition accuracy for deep learning for all the languages. For our case, CNN classifier obtains accuracy of 94.71% for Hindi language which is less than the accuracy obtained by all the other three classifiers. For Marathi and Punjabi languages CNN shows better accuracy of 89.16% and 89.84% in comparison to random forest classifier for the same. For Bangla language CNN obtains the accuracy of 83.77% which is same as for MLP classifier.

4.4 Parameter tuning

Figure 17 provides a tuning for choosing the optimal threshold 𝜃u for checking the presence of upper modifier as discussed in Algorithm 1. We have taken 400 random word images, 100 from each language containing upper modifiers to do the parameter tuning. Figure 17 shows the graph of accuracy for different values of 𝜃u for all the four languages. Accuracy of upper modifier segmentation is different for all the languages for various values of 𝜃u, but all the languages except Marathi, obtained the best accuracy for upper modifier segmentation for 𝜃u ≥ 30. Marathi shows a slight decrement in accuracy for threshold > 30. So, choosing 𝜃u as 30 is the best accuracy for upper modifier segmentation as depicted in Fig. 17.

Fig. 17
figure 17

Tuning of parameter 𝜃u used in Algorithm 1

Similarly, Fig. 18 depicts the tuning of threshold value 𝜃w of Algorithm 1 for all the four languages. Same data samples of 400 images have been tested for various threshold values starting from 𝜃w = 2. Languages obtained best accuracy for 𝜃w ≥ 10 except for Punjabi language as shown in the graphs. Punjabi achieved its best accuracy for 𝜃w ≥ 8. So, threshold 𝜃w ≥ 10 have been chosen as the common threshold for all the four languages.

Fig. 18
figure 18

Tuning of parameter 𝜃w used in Algorithm 1

Tuning of threshold value 𝜃l of Algorithm 2 is shown in Fig. 19 for all the four languages. Graph for Punjabi language is a straight line at 100%. This is due to the fact that lower modifiers are not connected to the character of the middle zone. Due to which segmentation accuracy is 100% for any value of pixels. Other than Punjabi all the languages obtained their best accuracy for 𝜃l ≥ 30.

Fig. 19
figure 19

Tuning of parameter 𝜃l used in Algorithm 2

4.5 Comparison with other methods

As per our best knowledge, only a single work of Bhattad and Chaudhuri [8] is reported for character segmentation for two different languages using a single approach. So to show the competency of our proposed method we compared character segmentation accuracy of individual language against state-of-the-art approaches for each dataset. We compared our method with recent works of Srivastav and Sahu [47] and Sarkar et al. [44] method for Hindi language, Ramteke et al. [41] for Marathi language, Mangla and Kaur [29] for Punjabi language, and Arefin et al. [1] for Bangla language. As the databases used in the above papers are not available, so to do the comparative analysis, we implemented the above methods and tested the same on 500 words of each database as discussed in Section 4.1. Table 6 shows the character segmentation accuracy for the above mentioned methods and our proposed method. We observe that the performance of character segmentation is much better than the existing methods for all the languages. The reason behind this is that the compared methods fail to segment the skewed words. In method proposed by Bhattad and Chaudhuri [8] lower modifiers are not segmented resulting into the decrement in the accuracy. The segmentation of upper modifiers are followed by post processing of reconnecting them to original characters. But our proposed method does not require any post processing for modifier segmentation. Our proposed method is tested for four languages and gives satisfactory results, whereas [8] is tested only for two languages. The success rate of our proposed method is 91.04% and 89.47% for the Hindi and Bangla languages respectively which is much better than 80.59% and 82.26% for Hindi and Bangla of method proposed by Bhattad and Chaudhuri [8]. It is also better than the methods proposed by Sarkar et al. [44] and Srivastav and Sahu [47] for Hindi language and by Arefine et al. [1] for Bangla language. Arefine et al. [1] have also not mentioned the segmentation of modifiers, so this cannot be compared with our proposed method. For Marathi and Punjabi languages, our method gives the accuracy of 88.72% and 91.82% which is better than the corresponding methods. Upper and lower modifier segmentations are not mentioned separately in the above papers, so they also cannot be compared. Only Sarkar et al. [44] and Srivastav and Sahu [47] have provided the segmentation accuracy for the modifiers for Hindi language. So, the segmentation accuracies for these methods are compared with our proposed method. Accuracies obtained by proposed method are shown in bold numbers to show the efficacy of the method.

Table 6 Comparison of character segmentation accuracy of the proposed method with the state-of-the-art methods

A comparison of the average process speed of above compared method is given in Table 7. This comparison is carried out with the methods on the same language datasets as specified above. The method proposed by Bhattad and Chaudhuri [8] is computationally less expensive but shows poor performance. The process speed of our proposed method depends on the word structure and length of the word. Our proposed method uses polygonal approximation which is based on chain code, so for this reason, our proposed method takes much time for processing. Our method shows better computational efficiency for Marathi and Punjabi language as compared to other methods. Best average process speed for each language among all the methods is shown in bold.

Table 7 Average process speed (in second) taken by different character segmentation methods

5 Conclusion

We have proposed a novel script independent character segmentation technique based on polygonal approximation. The use of structural properties of all the languages helps us to generalize the segmentation process. This method is tested on four popular Indian languages and has given promising results for all the four languages resulting average segmentation accuracy of 90.07%. We have also performed the character recognition with random forest, SVM, MLP, and CNN classifiers with satisfying segmentation and recognition results. In future, we shall extend our work to improve the accuracy by handling the failure cases and to use this method for character segmentation of more languages.