1 Introduction

Handwriting is the art of drawing letters, words, or symbols on a surface with pen or pencil. A digital handwriting system captures a handwritten character by sequence of strokes. Online handwriting recognition is the process of analyzing these strokes and determining a particular character from a selected character set with a requisite degree of confidence. With the invention of affordable pen/stylus-based devices, research in the area of online handwriting recognition has received attention from many researchers [1, 2]. A limited amount of work in online character recognition of Gurmukhi has been carried out in recent past [3, 4]. However, a lot of works in handwriting recognition have been done for simplified Chinese, traditional Chinese, English, Japanese, and Korean languages [57]. In the literature, isolated character recognition for many Indic languages like Bangla, Hindi, Tamil, and Telugu has been reported by many researchers [811]. In this paper, recognition of Gurmukhi script, a popular script in northern part of India to write Punjabi language, has been attempted. This script consists of 9 vowels (laga matras), 41 consonants, \(bind\bar{\imath }\), ṭ\(ipp\bar{\imath }\) as two symbols for nasal sounds and addak as a symbol to duplicate the sound of a consonant. It is written from left to right. Table 1 illustrates various characters of Gurmukhi script. A sequence of points captured or drawn during a pen-down and pen-up event, while writing, is called a stroke. Gurmukhi characters are written in various styles that magnifies the difficulty in recognizing a particular character. The strokes for writing characters in Gurmukhi script can be drawn in one of the three horizontal zones, namely, upper, middle and lower zones as depicted in Fig. 1. Most of the character symbols of Gurmukhi are written in middle zone. The region above the headline denotes the upper zone, where some of the vowels () and sub-parts of some other vowels () reside, while the middle zone represents the area below the headline where the consonants and some sub-parts of vowels () are present. The lower zone represents the area below middle zone where some vowels and certain half characters lie in the foot of consonants. A Gurmukhi character can be written with a single or a combination of strokes, with stroke considered as the smallest unit. A major problem in character recognition of Gurmukhi is detection of headline and baseline. Appearance of similar looking strokes in these three zones results in formation of different characters which escalates the issue. This paper focuses on creation of a single classifier for estimation of strokes in the three zones. The solution has further been augmented with a robust postprocessing mechanism to transform the set of strokes into Gurmukhi characters.

Table 1 Chart of Gurmukhi characters

This paper is organized into seven sections. This section introduces basic concepts behind the work. Section 2 describes work done in the area of handwriting recognition for various languages. Section 3 describes features of Gurmukhi script and how handwritten data for Gurmukhi script are collected. This section also discusses various stages of a Gurmukhi character recognition system. Section 4 explains how various features have been calculated and then organized for recognition of strokes after applying various preprocessing steps on the collected data. Section 5 describes two classifiers, namely support vector machine (SVM) and hidden Markov model (HMM) that have been used for classification of strokes in this work. This section also explains how the models are constructed from the training sets of various features as obtained in Sect. 4. Section 6 describes the postprocessing steps performed to generate Gurmukhi characters from the identified strokes. Section 7 concludes the work done with salient findings.

Fig. 1
figure 1

Zones in Gurmukhi script

2 Related work

In the computing environment, keyboards had been an effective interface for interacting with computers, but with advancements in computational technology, and advent of touch-based systems, a more natural way of interacting with the computers is being explored. Handwriting is one such natural way that was pioneered and patented by Yaeger et al. [12]. This section briefly reviews the literature on handwriting recognition systems. Plamondon and Srihari [13] in their survey on online and offline handwriting recognition systems explained various processes involved, starting from input to final understanding/recognition of characters for a language. They have discussed categories of recognition methods, namely structural/rule-based methods and statistical methods in their work in order to recognize handwriting with pen based devices. Structural/rule-based methods involve with defining robust and reliable rules for recognition purposes. In Statistical methods, shape of a graphical mark known as stroke is described by fixed number of features, and the classes of these graphical marks are described by multidimensional probability distribution. Various recognition methods have been experimented by researchers across the world for handwriting recognition of scripts. These methods include artificial neural network (ANN), hidden Markov model (HMM), support vector machine (SVM), dynamic time warping (DTW), elastic matching and rule-based methods. Table 2 illustrates the recognition methods used and accuracy achieved by various researchers for different languages.

Table 2 Recognition techniques used and accuracy achieved for different languages

3 Data collection and steps in recognition process

Verma and Sharma [4] identified and grouped the strokes as per the zone in which they appear. The number of stoke classes considered by them was 102. They took 13 stroke classes in upper zone, 82 stroke classes in middle zone and 7 stroke classes in lower zone, and proposed zone-wise classifiers in their work. In this study, a different approach for classification has been tried where a single classifier is built with all the stroke classes, irrespective of their zone of appearance. In this process, similar looking strokes have been merged and a common strokeID has been assigned to them irrespective of their zones. For example, 123 () and 131 () have been merged and assigned a common strokeID 123; 122 () and 132 () have been merged and assigned a common strokeID 122; 133 () and 134 () have been merged and assigned a common strokeID 133. This resulted in a total of 74 stroke classes, which have been used for training of classifiers in this work.

3.1 Data collection

In this work, we have addressed the problem of identification of characters of Gurmukhi script. A character in Gurmukhi script can be written with one or more stroke combinations. One of the major steps in recognition of Gurmukhi characters is recognition of these strokes. Based on recognized strokes, and their combination with nearby strokes, the final Gurmukhi character is formed. To start with, a stroke classifier needs to be trained for recognition of strokes. For this purpose, a total of 44301 samples of Gurmukhi words have been collected from 124 writers using Tablet PC device.

For each word, all strokes along with their writing sequence have been recorded in an XML format. For each stroke, xy traces of digital PEN on the touch screen between successive pen-down and pen-up events have been recorded. The writers selected here were well-versed in writing Gurmukhi script. In order to have variations in handwriting samples, the writers were selected from various regions of Punjab (INDIA). For the purpose of training of classifiers, the strokes that were written by writers as per their correct script formation in an unconstrained environment were selected and annotated. A total of 74 stroke classes, as mentioned above, have been identified. Table 3 contains these strokes, the strokeIDs associated with them and the number of samples of these strokes used for training.

The developed online Gurmukhi character recognition system has been tested on the data collected from 10 new users. These users wrote each Gurmukhi consonant five times, giving a set of 1750 Gurmukhi characters. This data set is referred as testdata in Sect. 6.

Table 3 Gurmukhi strokes used for classification

3.2 Steps involved in recognition of characters

This section includes a brief description of the steps involved in recognition of Gurmukhi characters.

  1. (a)

    Stroke capturing: As the first step, we capture the xy traces of each stroke written by the writers.

  2. (b)

    Preprocessing: After capturing a stroke, the xy traces are preprocessed for extraction of features in the next step. During preprocessing, a noise or distortion which arises due to software limitations is removed from original stroke. Normalization and centering of the stroke, where the input stroke is fitted into a fixed size window and is moved to the center location, is then performed. During writing of a stroke, some points may be missed by the touch sensor. These missing points are interpolated using Bézier interpolation technique. Jitters are removed from the stroke by averaging each point with its neighbors. After performing the preprocessing steps, we obtain 64 preprocessed xy traces of the stroke in three different-sized windows, i.e., \(200 \times 200\), \(300 \times 300\), and \(400 \times 400\), to investigate the effect of windows size on the performance of stroke recognition.

  3. (c)

    Feature extraction: Features required for classification of a stroke as per model chosen are extracted from 64 preprocessed xy traces, which are used for recognition of stroke. The features used in this work are elucidated in Sect. 4.

  4. (d)

    Classification: SVM- and HMM-based classifiers have been used for classification of strokes. The results for recognition of strokes using radial basis function kernel in SVM and HMM for different feature sets are presented in Sect. 5.

  5. (e)

    Postprocessing: Zone identification is other important step in character formation, as certain strokes based on their appearance in different zones produce different characters. Zone of a stroke is identified based on its relative position compared to other strokes. Certain other postprocessing steps like rearrangement of strokes, stroke grouping/merging are applied on these recognized strokes to form the characters.

4 Feature extraction

Five different features, namely normalized xy traces (\(N_{xy}\)), region-based features (\(R_{xy}\)), curvature features (\(C_{xy}\)), curvature feature-based classes (\(C^N_{xy}\)), and directional features (\(D_{xy}\)), have been considered for classification models. This section now explains briefly how these features have been extracted from the preprocessed x-y traces.

  • Normalized xy traces ( \(N_{xy}\) ): The 64 preprocessed traces are considered as first feature set. This feature set contains 128 elements for a given stroke sample and is referred as \(N_{xy}\) feature set.

  • Region-based features ( \(R_{xy}\) ): As explained in Sect. 3.2, each stroke is normalized to windows of size \(200 \times 200\), \(300 \times 300\) and \(400 \times 400\). The windows were further divided into 100 small equi-sized sub-windows and labeled as \(w_1\), \(w_2, \ldots, \) \(w_{100}\). Each point in 64 preprocessed traces will lie in one of these 100 sub-windows. The window number where the point lies is taken as a feature value. As such, this feature set will have 64 elements and is referred as \(R_{xy}\) feature set.

  • Curvature features (\(C_{xy}\) ): Curvature is defined as the degree to which something is curved. The 64 normalized xy traces have been used to calculate curvature at each point. The curvature at a point (xy) for the curve \(y = f(x)\) is given by Eq. (1) [34].

    $$\begin{aligned} R_{\mathrm{curve}}=\frac{\left( 1+ \left( \frac{\mathrm {d} y}{\mathrm {d} x}\right) ^{2}\right) ^{\frac{3}{2}}}{\frac{\mathrm {d}^2 y}{\mathrm {d} x^2}} \end{aligned}$$
    (1)

    The curvature at a point is calculated using three adjacent points. Since we need at least three points to calculate second-order derivative \(\frac{\mathrm {d}^2 y}{\mathrm {d} x^2}\), we shall get a feature vector of size 62 from preprocessed xy traces. The features thus obtained are referred as \(C_{xy}\).

  • Curvature feature-based classes ( \(C^N_{xy}\) ): The values of curvature, \(C_{xy}\), lie in the range (0, \(\infty \)). These values have further been divided into 20 discrete classes based on uniform frequency of curvature values, and these classes have been assigned a code. Table 4 illustrates the coding pattern of these discrete classes along with range of curvature values. This feature set will contain 62 such values representing the class in which the curvature value appears and is referred as \(C^N_{xy}\).

  • Directional features ( \(D_{xy}\) ): To calculate directional features, 64 preprocessed xy traces (\(P_1\), \(P_2, \ldots, \) \(P_{64}\)) on the trajectory of a stroke as shown in Fig. 2 have been used. The angle between two points \(P_i\) and \(P_{i+2}\), \(i=1, 2, \ldots, 62\) is calculated and coded using a quantized vector consisting of 12 different values as illustrated in Table 5. This feature is referred as \(D_{xy}\).

Table 4 Coding of curvature features in \(C^N_{xy}\)
Fig. 2
figure 2

Sequence of 64 preprocessed xy traces

Table 5 Coding of directional features

5 Classification models

The features extracted from a stroke have now been used to build a classifier for recognition of the stroke. In this section, we explain two different classifiers experimented for classification of strokes. There are 9129 samples that have been used to train the recognition engine. The two classification approaches are given in following subsections.

5.1 Support vector machine (SVM)

For the classification process, SVM with radial basis function (RBF) kernel has been used. Other parameters of SVM that have empirically been optimized in this work are learning rate (\(\gamma \)) and penalty parameter (C). Three preprocessed data sets having window size \(200 \times 200\), \(300 \times 300\), and \(400 \times 400\) have been used to obtain features sets. Five different features, explained in Sect. 4, have been obtained from these preprocessed data sets. These 15 data sets were experimented for classification using k-fold cross-validation for three different values of k as mentioned in Table 6. Table 6 also contains the values of other parameters used for training SVM-based classifiers. As such, 45 different combinations have been tested with these parameters.

Table 6 SVM parameters

It is well established that scaling of inputs affects the performance of SVM. Experiments have also been conducted to find an efficient range for scaling the input parameters that yields better accuracy. These experiments suggest to fix the scale of interval as [1, 9] since highest accuracy was achieved for this scaling. The results obtained using five features with selected SVM parameters are illustrated in Table 7. Table 7 also gives the accuracy of the model on data consisting of 920 stroke samples. These 920 samples have not been used for training of SVM classifiers.

5.2 Hidden Markov model (HMM)

The features, namely \(R_{xy}\), \(C^N_{xy}\), and \(D_{xy}\) as explained in Sect. 4, have been used to build HMM model \(\lambda = (\pi , A, B)\) for each stroke as proposed by Rabiner and Juang [35]. The other two features, \(N_{xy}\), and \(C_{xy},\) have not been used for building HMM models due to their large observation space. In \(N_{xy}\) feature set, the number of observations equals the dimension of window size used (i.e., 200 in case of \(200 \times 200\)), whereas in the case of \(C_{xy}\) feature set, number of observations cannot be fixed as the feature is a floating point value. The parameters \(\pi \), A, and B for each strokeID have been calculated for the feature sets. For \(R_{xy}\) feature set, the observation set is O = \(\{1, 2, \ldots, 100\}\) and the set of states is \(S= \{1, 2, \ldots, 64\}\). For training HMM model for \(C^N_{xy}\) features, the observation set is \(O = \{1, 2, \ldots, 20\}\), and the set of states is \(S = \{1, 2, \ldots, 62\}\). For building HMM model with \(D_{xy}\), the observation set is \(O = \{1, 2, \ldots, 12\}\), and the set of states is \(S = \{1, 2, \ldots, 62\}\). Algorithm 1 has been used to create HMM models from the three features sets. Algorithm 2 has now been used to test a sample for its matching strokeID.

figure k
figure l

As such 27 different combinations have been tested. The results obtained using the HMM models with three features are presented in Table 7. Table 7 also shows the accuracy of these models on unseen data of 920 stroke samples, which were not used for training the models.

5.3 Summary of stroke classification results

In this work, two classification models, namely SVM and HMM, have been implemented. Five different features have been experimented with these classification models, resulting into eight feature–classifier combinations. Each of these combinations has been tested for three different preprocessed window sizes, i.e., \(200 \times 200\), \(300 \times 300\), and \(400 \times 400\). As shown in Table 7, feature–classifier combinations in \(300 \times 300\) window size are performing better than feature–classifier combinations in other two window sizes. In six of the eight feature–classifier combinations, the models resulted in a performance of more than 82.3 %. The features \(C_{xy}\) and \(C^N_{xy}\) have a low performance (\({<}39.7\) %) for SVM-based classifier. The \(N_{xy}\) feature yields the highest stroke recognition accuracy of 96.5 % among SVM-based classifiers. In HMM-based classifiers, features \(R_{xy}\) and \(C^N_{xy}\) achieved stroke recognition rates of 95.5 and 95.0 %, respectively. Direction feature \(D_{xy}\) yields the stroke recognition accuracy of 85.3 and 90.4 %, respectively, when used with SVM and HMM. It has also been observed that a very large training set is required to train a HMM model having features with large observation set, to minimize sparsity in transition and observational probability matrices of HMM. The two features \(N_{xy}\) and \(C_{xy}\) could not be used in training owing to data set used in this work. The three best feature–classifier combinations obtained from these experiments are \(C^N_{xy}\)-HMM, \(R_{xy}\)-HMM, and \(N_{xy}\)-SVM with stroke recognition accuracies of 95.0, 95.5, and 96.5 %, respectively, for window size of \(300 \times 300\).

Table 7 Feature-wise accuracy using SVM- and HMM-based classifiers

6 Postprocessing and character generation

The three best feature–classifier combinations resulted from the experiments illustrated in Sect. 5 were further used to implement the character recognition system. A voting-based classifier using these three classifiers have also been tested. After recognition of strokes using a classifier, the strokeIDs are processed further to generate characters of Gurmukhi script in postprocessing phase. As defined in Sect. 1, middle zone of Gurmukhi script is the busiest zone and hence most of the stroke classes selected for recognition in Table 3 are middle zone strokes. Some stokes with similar shape appear in different zones as illustrated in Table 8. Based on appearance of these strokes in different zones, different characters are formed. A zone detection algorithm in order to detect the position of strokes relative to middle zone stroke has been implemented. In this algorithm, original xy traces of the strokes are used for detecting their zone. The zonal information along with strokeID is passed to the character recognition process. The scheme proposed by Kumar et al. [3] has been used to recognize the Gurmukhi script characters.

Table 8 Some issues in character formation caused by zoning

Four character recognition systems developed using three best feature–classifier combinations and a voting-based classifier have been tested on testdata. The results obtained for this data set are depicted in Table 9. The proposed systems could achieve an average accuracy of 83.5 % for Gurmukhi character and an average accuracy of 88.0 % for character . The reason behind such a low accuracy achieved in recognition of these characters is due to confusion of strokes having strokeID 164 (), 165 () with strokeID 155 () and 156 (), respectively. Due to misclassification of these strokes, the resultant character is different from the ground truth. The system achieved 88.0 % accuracy for characters . The reason behind this low recognition rate is due to the stroke combinations, as mentioned in Table 8. Misidentification of zone of stroke is another common reason for low recognition rate of certain characters, namely , and . The system achieved an average recognition accuracy of 100.0 % for the characters , and . The average character recognition time in milliseconds (ms) are also elucidated in Table 9. The recognition time is calculated as the time taken by character recognition system from time of submission of strokes to the final character generated on screen. It is evident that HMM-based classifiers are faster than SVM-based classifiers. It is also evident from Table 9 that voting-based classifier, as expected, performs better than other classifiers with an overall character recognition accuracy of 96.7 %.

Table 9 Character-wise recognition accuracies of different models

7 Conclusion

We have implemented three models for recognition of online handwritten Gurmukhi characters. These three models are based on SVM and HMM classifiers. A voting-based classification model based on these three models has also been implemented. We have also experimented with five different features in this work. The writing styles of Gurmukhi script users give rise to different shapes of strokes using which the characters are formed. We have carried out a brief study on the issues that are prominent in dealing with formation of confusing characters.

In an earlier study on online Gurmukhi script recognition, an accuracy of 87.4 % has been achieved using elastic matching technique by Sharma et al. [27]. They experimented with HMM models using 130 samples of each Gurmukhi character and achieved a recognition rate of 91.9 %. The HMM model was built on direction code-based feature as explained in Sect. 4. The feature used by them has eight different values of direction, whereas in the current work, we have used 12 different directions. Verma and Sharma [4] in their work on zone-based features for online Gurmukhi script achieved a recognition rate of 92.1 % for normalized diagonal zone-based features. In this work, we have achieved a higher recognition accuracy of 96.7 % for Gurmukhi script character recognition using a voting-based classifier. As expected, the average recognition time taken by voting-based classifier is more than other three classifiers. These studies are summarized in Table 10.

As a further work in this direction, the low recognition accuracies of some of the confusing characters in Gurmukhi script can be improved by adding a more robust zone detection algorithm. Different classification techniques based on multilayer perceptron and linear discriminant analysis (LDA) for different feature combinations can also be tested. One can also explore the effect of training data on recognition rates of classification models.

Table 10 Comparison of results with earlier approaches