1 Introduction

Transferring data among human beings and computers play an important part in Document analysis and recognition. Optical character recognition (OCR) system is a key portion of a document analysis system and evolved for the identification of printed text as well as handwritten text because each document is segmented into lines, words, and characters for recognition and analysis of document text. Character recognition is a procedure which connects a figurative definition with objects (letters, symbols, and numerals) drawn on an image. There are twenty-two official languages and about twelve distinct scripts in India existing to compose these official Indian languages, and Gurumukhi is also one part of these scripts. For Punjabi language, Gurumukhi script is used for writing and is the twelfth most publicly spoken language in the world. Gurumukhi is composed of 35 basic characters in which there are 3 vowel bearers and 32 consonants. In addition to this, Gurumukhi script includes 6 additional consonants, 3 half characters and 9 vowel modifiers. The left to right and top to bottom is the pattern followed as the writing style in Gurumukhi script and is case insensitive. Even today, many algorithms have been put forward for character recognition, but the effectiveness of these algorithms is not satisfactory. So, there is still a scope to work on Gurumukhi character recognition and need to improve the recognition accuracy achieved with different methodologies. Moreover, handwritten Gurumukhi character recognition system finds applications in numerous fields like handwritten notes reading, form processing, recognition of handwritten postal code, etc. In this article, four feature extraction approaches, viz., zoning, peak extent-based, diagonal, and centroid features are evaluated. These feature extraction approaches have been evaluated by numerous researchers for distinct script identification, and it has been observed that these approaches are accomplishing recommendable results as compared to other approaches specified for handwriting recognition (Kumar et al. 2013a, b). Initially, each input image is partitioned into n number of uniform sized zones for extracting various features (Kumar et al. 2014). Depending on the four features specified, three classification techniques, namely, convolution neural network (CNN), decision tree and random forest are used for the classification of input characters. The authors have improved the recognition accuracy for offline handwritten Gurumukhi characters using adaptive boosting (AdaBoost) technique in the present work. This is an effort in direction to the establishment of the system that could recognize handwritten Gurumukhi characters effectively. The authors have noticed that most of the work for text recognition is carried out for non-Indic scripts and very few attempts have been made on Indian scripts. Various challenges have been identified like connected characters, the variation in the character patterns, uncertainty and indecipherability of character, overlapping characters in a word, unconstrained cursive words, variations in human writing styles, variations in shapes and sizes of character, poor input quality, and lesser accuracy rate in Gurumukhi character recognition. Other main considerations for Gurumukhi text recognition are the presence of distortions in headlines of the words, touching or overlapping characters, and different shape of a same character at different occurrences. Several methods have already been proposed for recognition of Gurumukhi character, but these are not able to achieve considerable recognition accuracy because quality of features and classifier is not able to achieve acceptable recognition accuracy. This motivated us to propose a hybrid approach of various techniques to improve the recognition accuracy for Gurumukhi character recognition. Moreover, the authors observed that there are numerous ensemble techniques (such as Bagging, Boosting) available in the literature that aid in attaining higher recognition rates as compared to using individual classifiers only. Although the concept of recognition in Gurumukhi script is pretty old, till now AdaBoost ensemble technique has not been employed for Gurumukhi character recognition. This fact motivated us to employ AdaBoost technique in order to attain superior performance in offline handwritten Gurumukhi character recognition system. The rest of the segments of this work are structured as defined. Section 2 details the related work specific to character recognition techniques. Section 3 discusses the various phases of the presented framework. Section 4 introduces AdaBoost methodology along with its working. Section 5 represents experimental results and the comparative analysis of performance of different classification techniques. The relative results of the presented work with the state-of-the-art work are reported in Sect. 6. Finally, in Sect. 7 the authors have concluded the paper and present a few future directions.

2 Related work

The recognition of offline handwritten Gurumukhi character is a difficult task as the consequence of different writing styles of individuals, various shapes of characters, and diverse sizes. A few writers use cursive writing, which also makes the character recognition system more complicated. A concise overview of related work on character recognition is presented in this section. Various researchers have dealt with handwritten character recognition for different scripts like Arabic, Bangla, Devanagari, etc. For example, Schwenk and Bengio (1997) applied the AdaBoost approach for decreasing the error rate for the recognition of online written characters, written by 203 individual writers which are classified by Diabolo networks and multilayer networks. Schomaker and Segers (1999) proposed a technique using geometrical features for the recognition of cursive Roman handwriting. Alaei et al. (2010) employed SVM classifier on modified chain code features for the recognition system of Persian isolated handwritten character which attains a recognition accuracy of 98.1%. Ahranjany et al. (2010) presented the combination of CNN with gradient training approach to recognize handwritten Farsi/Arabic digits by achieving the accuracy of 99.17%. Zhu et al. (2010) worked on 35, 686 samples of online handwritten Japanese text by employing a robust model by achieving a recognition accuracy of 92.8%. Rampalli and Ramakrishnan (2011) employed a combined recognition system for an online handwritten Kannada character recognition along with offline handwriting recognition which improves the accuracy of online handwriting recognizer by 11%. Venkatesh and Ramakrishnan (2011) recognized the online geometrical feature using a methodology that achieved an average accuracy of 92.6%. Gohell et al. (2015) worked on the recognition of online handwritten Gujarati characters and numerals using low level stroke feature-based technique, by attaining the numeral recognition accuracy of 95.0%, character recognition accuracy of 93.0% and combined numeral and character recognition accuracy of 90.0%. Saabni (2015) suggested the use of AdaBoost approach for the improvement in accuracy rates in the process of automatic handwriting recognition which has been tested on the 60,000 digits training samples. Antony et al. (2016) recognized the Tulu script by using the hybridization of Haar features and AdaBoost approach. Ardeshana et al. (2016) worked on the recognition of handwritten Gujarati character by extracting discrete cosine transformation-based features and using naïve Bayes classifier on 22,000 samples, thus achieving a recognition accuracy of 78.05%. Shahin (2017) proposed system for recognition of 14,000 different printed Arabic words using linear and nonlinear regression methodology which attains recognition accuracy of 86.0%. Dargan and Kumar (2018) surveyed the different feature extraction approaches and different classification methods used by numerous researches for writer identification in Indian and non-Indian scripts. Elbashir and Mustafa (2018) presented a CNN paradigm for the recognition of handwritten Arabic characters giving the test accuracy of 93.5% and training accuracy of 97.5% on the dataset presented by SUST ALT. Kumar et al. (2018a, b) explored a comprehensive review on character and numeral recognition of non-Indic and Indic scripts highlighting various constraints and issues faced. Gupta and Kumar (2019) attained the accuracy of 95.1% for the recognition of document forgery by extracting key printer noise feature, oriented FAST rotated and BRIEF feature, and speeded-up robust feature which is further classified by random forest classifier along with adaptive boosting approach. Husnain et al. (2019) recognized the unconstrained multi-font offline handwritten Urdu characters with an accuracy of 96.04% and 98.3% for Urdu characters and numerals, respectively, using CNN. Joseph et al. (2019) presented a model which is a combination of CNN as feature extraction tool and XGBoost as an accurate prediction model to increase the recognition rate handwritten text. This model is evaluated on 810,000 isolated handwritten English characters providing the accuracy of 97.18%. Kavitha and Srimathi (2019) used CNN to recognize 156 class offline handwritten isolated Tamil characters from the dataset of 82,929 images prepared by HP Labs India to produce the training and testing accuracy of 95.16% and 97.7%. Ptucha et al. (2019) presented an intelligent character recognition system based on the CNN approach recognizing random length handwritten text streams. Sethy and Patra (2019) achieved the average recognition accuracy of 98.8% on the implementation of Axis Constellation approach along with PCA on offline Odia handwritten characters. A few researchers have explored the area of handwritten Gurumukhi character recognition. For example, Garg (2009) exhibited a model for offline handwritten Gurumukhi character recognition, based on the idea of human biological neural network which normalizes input character size into 32 × 32 and extracts various features for character recognition. Kumar et al. (2011, 2012) have also presented a model in view of k-NN classifier for offline handwritten Gurumukhi character recognition achieving recognition accuracy of 94.12% for 3500 samples out of which only 350 samples have been considered for testing dataset. Siddharth et al. (2011) further published a model for offline handwritten Gurumukhi character recognition which explores support vector machine and probabilistic neural network classifiers. Singh and Budhiraja (2012) proposed a framework for Gurumukhi numeral recognition using wavelet transform features and backpropagation neural network for classifying the numerals based on the extracted features, accomplishing recognition accuracy of 88.83%. Also, a detailed study of various feature extraction approaches and classifiers and their combinations is presented. Koundal et al. (2017) studied the system for Punjabi printed and handwritten character recognition exploring its numerous feature extraction and classification approaches. Kaur and Kumar (2021) have proposed a system that uses a holistic approach to recognize a word, where a word itself is considered as an individual item. They employed various statistical features and different classifiers for the recognition purpose. They explored AdaBoost (adaptive boosting) algorithm to boost the recognition accuracy. Using AdaBoost, they attained maximum recognition accuracy of 88.78% for handwritten Gurumukhi word recognition. But, till date, researchers could not get acceptable recognition accuracy for offline handwritten Gurumukhi character recognition; therefore, there is still scoped to work on this to enhance the recognition accuracy. In this paper, the authors have captured 14,000 Gurumukhi character samples written by four hundred individuals and improved the recognition accuracy by using AdaBoost methodology. Yuan et al. (2019) proposed Gated CNN in order to integrate numerous convolutional layers for object detection. This proposed approach was experimented on two image datasets such as PASCAL VOC and COCO and attained promising results in terms of accuracy and speed as compared to the existing approaches. Masita et al. (2020) provided a review on the application of deep learning algorithms in object detection. They inferred that deep neural networks, convolutional neural networks, and region-based convolutional neural networks have been employed as the baseline for numerous robust detection systems. The deep learning proved to be efficient in detection of objects. Wang et al. (2021) proposed background and foreground seed selection approach in order to detect salient objects. They conducted the experiments on publicly available datasets and attained better performance than state-of-the-art approaches. In this work, the authors have considered noise-free dataset for the experimental work. But this methodology can be applicable for other documents and noise from those documents can be removed by using various techniques proposed by the Bhadouria et al. (2014). They have proposed various techniques and presented many studies to noise reduction techniques.

3 Description of the proposed system

The different phases used in the system for the character recognition of offline handwritten are digitization and pre-processing, feature extraction, and classification as shown in Fig. 1.

Fig. 1
figure 1

Sequence of phases for handwritten character recognition system

3.1 Digitization and pre-processing

Transforming the handwritten text written on paper to a computerized form is known as digitization. This computerized conversion is completed by scanning the document, and afterward the bitmap image ‘.bmp’ is created of the original scanned text document. For online character recognition, the numbers of strokes are available, whereas for offline character recognition, the document is scanned, and one can get the digital image of that document. The digital image is produced in the digitization process which becomes the input for the pre-processing phase, which is the preliminary phase in the presented system. In this paper, authors produced the digital images of paper-based text documents using the HP-1400 scanner. In pre-processing phase, the digitized input images by using nearest neighborhood interpolation (NNI) algorithm are normalized in the size of 88 × 88 pixels, and also, the authors had diminished the line width of the text using the thinning approach, viz., parallel thinning algorithm proposed by Zhang and Suen (1984).

3.2 Feature extraction

It is used to extricate the relevant details contained in the digital image of the input character. Performance of optical character recognition system is basically relying on the extracted details to great extent. Various feature extraction techniques have been found in the literature, which are used in the character recognition system to recognize the characters. In optical character recognition applications, it is critical to extract those features that allow the system to differentiate between all the character classes that exist. Feature extraction approaches are classified into two types, namely structural and statistical features. In the present work, the authors have extracted statistical features, viz., zoning, peak extent-based, centroid, and diagonal features. The zoning feature extraction technique divides the character image into n number of regions in hierarchical arrangement as proposed by Kumar et al. (2014). Suppose an input image is at current level L and then it is having 4(L) sub-images. For instance, the number of sub-images is 4 for L = 1, 16 for L = 2, and so on. Therefore, 4(L) -dimensional feature vector is produced for every L. In this work, L = [0, 1, 2, and 3] has been considered by the authors.

In the diagonal feature extraction methodology, initially a character image is divided into n number of regions in hierarchical order. For each region, by moving along the diagonals of its respective pixels, the features are extracted. The center of an image is the point which is considered as the centroid which becomes the feature by considering the coordinates of the centroid of the foreground pixels in each region of an input image. By considering the summation of the peak extents which is calculated with the consecutive black pixels among each region of an input image leads to the formation of the peak extent-based feature.

The above-specified feature extraction approaches have been implemented in various script recognition systems by different researchers, which concludes the better performance of these techniques in comparison with other approaches defined for handwriting recognition systems. Also, the performance evaluation of peak extent-based features is evaluated better by Kumar et al. (2013a, 2015) among the other existing approaches for handwritten Gurumukhi. So, in this presented work, for enhancing the recognition accuracy of handwritten Gurumukhi character recognition system, these feature extraction techniques individually and blend of these four feature extraction techniques in the hybrid structure has been implemented.

Algorithm (a): Zoning Feature Extraction.

To extract these features, the following steps have been used:

Step I The thinned image is divided up to four levels in hierarchical order to create zones.

Step II At each level and for each zone, the number of foreground pixels is calculated.

The above steps, for each L level, will provide a feature set with 4(L)elements.

Algorithm (b): Diagonal Feature Extraction.

For diagonal feature extraction methodology, following steps is used:

Step I The thinned image is divided up to four levels in hierarchical order to create zones.

Step II For a single sub-feature in each zone, diagonal foreground pixels present are summed.

Step III Then, to create a single value feature for the corresponding zone, the values calculated in Step II are averaged.

Step IV For the zones which do not have any diagonal foreground pixel, its feature value is considered as zero.

The above steps, for each L level, will provide a feature set with 4(L) elements.

Algorithm (c): Centroid Feature Extraction.

Step I The thinned image is divided up to four levels in hierarchical order to create zones.

Step II At each level and for each zone, calculate the coordinates of foreground pixels.

Step III For the feature value, store the coordinates of the calculated centroid from the values taken from Step II.

Step IV The feature value is considered as zero for the zones having no foreground pixel.

The above steps, for each L level, will generate a feature set with 2 × 4(L) elements.

Algorithm (d): Peak Extent Feature Extraction.

Step I The thinned image is divided up to four levels in hierarchical order to create zones.

Step II For each row in a zone, calculate the sum of successive foreground pixels as the peak extent.

Step III For each row in a zone, the values of successive foreground pixels are replaced by peak extent value.

Step IV In each row, find the largest value of peak extent.

Step V For the respective zone, the summation of the values of peak extent sub-feature is considered as feature.

Step VI The feature value is considered as zero for the zones having no foreground pixel.

Step VII The values in the feature vector are normalized by dividing each element of the feature vector with the largest value in the feature vector.

The above steps, for each L level, will generate a feature set with 2 × 4(L) elements.

3.3 Classification

Classification is also considered as the crucial functions in the character recognition process. It is the last phase of a character recognition process in which unique labels are assigned based on the extracted features to character images. Classification basically selects the feature space to which the unknown pattern belongs, and by using different analytical properties and artificial intelligence approaches, it extracts samples from a few classes. In this process, the feature vectors of the input character are compared with the illustration of each character class with the main aim to reduce the misclassification admissible to the features extracted in the previous state. In this work, authors have considered three classification methodologies, i.e., random forest, convolution neural network, and decision tree. These classifiers are most widely used these days in the fields of image processing and pattern recognition with CNN as the most appropriate for pattern recognition. And these classifiers are giving good performance for text recognition of various non-Indic and Indic scripts. So, these three classifiers are considered in this work for handwritten Gurumukhi character recognition. The authors have used LeNet of CNN for character recognition (LeCun et al. 1998). A decision tree classifier is communicated as a recursive part of occurrence space. The decision tree is made up of nodes that give a rooted tree, i.e., a node termed as “root” with a property of no incoming edges forms a directed tree and absolutely one incoming edge is present for all other nodes. The representation used for the decision tree is a test on an attribute (feature) is denoted by each internal node, the outcome of the test is represented on each branch of the tree, and the value of the class-label is given by each leaf (terminal or decision node). The over-fitting critical situation of decision tree is eliminated using random forest. The main usage of decision tree classifiers is to classify the numerous sub-samples of the dataset, and for such a design, the meta estimator that fits this statistic of decision tree classifiers is achieved with the use of random forest. So, among the supervised learning classification algorithms and for large databases, random forest achieves unexcelled (Breiman 2001). Experimental results using these classification techniques are presented in Sect. 5.

4 AdaBoost methodology

AdaBoost introduces to a specific training approach for a boosted classifier. A boost classifier is defined as: -

$$ F_{T} \left( x \right) = \mathop \sum \limits_{t = 1}^{T} f_{t} \left( x \right) $$

where an object \(x\) is taken as input by each weak learner \(f_{t}\) and a value is returned as output expressing the class of the object. Adaptive boosting (AdaBoost), proposed by Freund and Schapire (1999) classified as machine learning meta-algorithm. Adaptive boosting is a representative algorithm in the boosting family. The basic idea of this algorithm is to be used in collaboration with other learning algorithms to enhance their performance. AdaBoost is robust in terms of consecutive weak learners, which are twisted for those features which are misclassified by earlier classifiers. The AdaBoost algorithm is the most widely used algorithm in various applications of artificial intelligence and pattern recognition. AdaBoost is a kind of classifier with huge recognition accuracy. It is straightforward in establishment and does not have to make selections of high-quality features using PCA and so forth. In this paper, the authors have considered this approach for enhancing the recognition results of offline handwritten Gurumukhi characters. In this present work, three classifiers, namely random forest, convolution neural network and decision tree, are evaluated as mentioned above in Sect. 4. Convolution neural network performs better than other two classifiers in this work, and using the convolution neural network, a recognition accuracy of 90.4% was accomplished. Therefore, finally authors have applied the AdaBoost algorithm to convolution neural network for improving the recognition results and obtained a recognition accuracy of 96.3%. Moreover, root mean squared error and false acceptance rates of the proposed framework have also been reduced with AdaBoost algorithm. The experiential outcomes based on three classifiers and AdaBoost algorithm as specified above are discussed in the section below.

5 Experimental results and discussions

Four feature extraction methodologies, namely peak extent-based, diagonal, centroid and zoning features, have been used to extract the shape information of characters for recognition. A combination of these features in a hybrid manner has also been tried for enhancing the recognition accuracy. In order to recognize offline handwritten Gurumukhi characters, three classifiers, namely random forest, decision tree and convolution neural network, have been considered in this work along with 35 basic characters (three vowel bearers and thirty-two consonants) of the Gurumukhi script. Authors have collected 14, 000 samples of offline handwritten Gurumukhi characters on which each approach has been tested. These samples were collected from four hundred writers at public places like, schools, colleges etc. The dataset is partitioned into two portions using fivefold cross-validation technique with 70% data of 14,000 samples is considered as training dataset, whereas the remaining samples (30% data) are considered as testing dataset for performance analysis. The results of various experiments show that our algorithm can accomplish a higher recognition rate (up to 96.3%), and a little higher than the convolution neural network using fivefold cross-validation technique, when we used CNN with adaptive boosting methodology. Classifier-wise recognition results based on partitioning strategy and fivefold cross-validation technique are depicted in Tables 1, 2, respectively. These results are graphically depicted in Figs. 2, 3, respectively. Moreover, the root mean squared error (RMSE) as 5.20% and the false acceptance rate (FAR) as 1.0% are also reduced using the AdaBoost methodology as depicted in Tables 3, 4, 5, 6. RMSE is graphically presented for partitioning strategy and fivefold cross-validation technique in Figs. 4, 5, respectively, and similarly, FAR for partitioning strategy and fivefold cross-validation technique in Figs. 6, 7, respectively. The authors have also noticed that hybrid features seem like the most discriminating features as compared to other features as shown in Table 7.

Table 1 Recognition accuracy using partitioning strategy (70% data as training set and 30% data as testing set)
Table 2 Recognition accuracy using fivefold cross-validation
Fig. 2
figure 2

Recognition accuracy using partitioning strategy

Fig. 3
figure 3

Recognition accuracy using fivefold cross-validation

Table 3 RMSE using partitioning strategy (70% data as training set and 30% data as testing set)
Table 4 RMSE using fivefold cross-validation
Table 5 FAR using partitioning strategy (70% data as training set and 30% data as testing set)
Table 6 FAR using fivefold cross-validation
Fig. 4
figure 4

RMSE using partitioning strategy

Fig. 5
figure 5

RMSE using fivefold cross-validation

Fig. 6
figure 6

False acceptance rate using partitioning strategy

Fig. 7
figure 7

False acceptance rate using fivefold cross-validation

Table 7 Comparison with existing methodologies for Gurumukhi character recognition

6 Comparison with state-of-the-art work

In this segment, authors have presented comparisons of proposed methodology with the extant work. The recognition accuracy of 91.6% has been attained by Lehal and Singh (1999) for printed Gurumukhi text recognition. For handwritten Gurumukhi character recognition, Sharma and Jain (2010) experimented on extricated zoning features and two classifiers, namely k-NN and SVM, achieving maximum 72.5% and 72.0% recognition accuracy with k-NN and SVM, respectively. Kumar et al. (2011) have experimented 3500 samples of handwritten characters by extracting intersection and open end points-based features which have been classified by SVM classifier. The diagonal feature is also extracted achieving accuracy of 94.1% on the same sample set using k-NN classifier. It has been observed during experimentation that the number of training samples and testing samples in the dataset affects recognition accuracy.

Kumar et al. (2013b) attained accuracy of 84.6%, 85.9%, and 89.2% by experimenting on 10,500 offline handwritten Gurumukhi character samples using modified division points-based features, which are classified by Linear-SVM, MLP, and k-NN classifiers. Kumar et al. (2016) have used various transformation techniques for offline handwritten Gurumukhi character recognition and have accomplished a recognition accuracy of 95.8% for 10,500 samples penned by 200 distinct writers. Kumar et al. (2018b) improved the recognition rate to 95.91% for the Medieval handwritten Gurumukhi 150 manuscripts by using the combination of zoning, DCT, and gradient features and combination of k-NN, SVM, decision tree, and random forest classifiers along with the application of adaptive boosting and bagging approach. Kumar et al. (2019) evaluated the performance of the various classifiers, viz., k-NN, SVM (Linear, RBF), naïve Bayes, decision tree, CNN, and random forest applied on the peak extent diagonal and centroid features extracted from the sample set of 13,000 Gurumukhi isolated characters and numerals. In this paper, the authors have considered 14,000 samples and obtained recognition accuracy of 96.3% using CNN and adaptive boosting methodology.

7 Conclusion and future scope

In this paper, the authors presented the improved recognition results for offline handwritten Gurumukhi character recognition system based on AdaBoost ensemble technique. Four features, namely peak extent-based, diagonal, centroid and zoning features, along with their combinations have been considered in order to extract the desired features from the character samples. The classifiers that have been employed in this work are convolution neural network, decision tree and random forest tree. Experimental results present the effectiveness of the proposed system based on recognition accuracy, RMSE and FAR. Using the considered classification techniques and fivefold cross-validation technique, the authors have accomplished a recognition accuracy of 90.4%, 61.7% and 84.4% with CNN, decision trees and random forest tree, respectively, for 14,000 samples of training and testing dataset. Finally, the highest recognition accuracy of 96.3% has been attained using AdaBoost methodology and demonstrated the superior performance of the proposed approach as compared to other state-of-the-art approaches. To the best of authors' knowledge, AdaBoost technique has been applied in order to recognize offline handwritten Gurumukhi characters and thus the results attained can be considered as baseline for comparisons in the further research. This is well known that Gurmukhi shares a number of structural similarities with some other Indian scripts. As such, the work carried out in this thesis can further be extended for these scripts after building a database for these scripts.