Introduction

Optical character recognition (OCR) is a process of reading document images and transforming it into machine readable format, which is ASCII code or Unicode. The first idea of OCR was found with the invention of ‘retina scanner’. In early age, the idea of the OCR could be found in mechanical devices, which were able to recognize the limited number of characters with minimal speed and huge time. Later computer researchers started working on OCR systems. OCR is a systematic and sophisticated way of document handling. Not only this, OCR of documents helps in electronically editing the documents, key word spotting and keyword searching in the documents. Even for preservation of documents the OCR methodologies are applied on degraded documents. A degraded document can be defined as a document that either does not follow the typographical convention or get deformations due to some external reasons. A handwritten document also contains some sort of degradation like skew, tilted lines, un-even line gaps etc. In other words, the topographical convention of a document structure is not preserved in a handwritten document. Moreover, handwritten documents contain different type of writing due to variation of writers. So, the recognition of handwritten documents is a complex OCR problem to solve. These degradation creates challenges in the way of achieving accuracy in optical character recognition for handwritten documents. For English language OCR is a solved problem now. In European scripts such as Latin, Greek the OCR systems developed have achieved great success. In Indian languages like Hindi, Bengali, Tamil etc the works on OCR systems are going on for few decades now. Though rigorous work suggested by researchers, OCR of handwritten documents is still an unsolved challenging problem. The challenges increases in working with Indian scripts due to their complex nature and huge character sets.

An OCR system is mainly composed of three modules: Pre-processing, segmentation and classification. In some OCR systems all these three phases are followed and the unit of recognition is character, those systems are called analytic system. In some of the systems the unit of recognition is word or line. Those system are called holistic system. In holistic system the character segmentation module is skipped. The advantage of holistic approach is that it removes the segmentation overhead and makes the OCR process faster. The most important step is the classification as the accuracy of the whole system majorly depends on the classification or recognition step. With time several technologies or algorithms were introduced for classification. The first approach was template matching. Later structural analysis, statistical learning methods, feature extraction methods etc were introduced.

In this article, I have proposed a novel holistic model for recognition of handwritten words. The proposed hybrid model consists of two major parts: feature extraction and recognition. In feature extraction a combination of Convolution Neural Network layer and Bi-directional long short term memory layer is used to extract the global and temporal feature from the word image. These two features are combined by a compact Bi-liner pooling layer, which generates the final feature vector. Recognition is done using a CTC layer. This hybrid model is tested on three public datasets CMATERdb2.1.2, IIIT-HW-Dev and IIIT-HW-Telugu containing Bengali, Devanagari and Telegu scripts respectively. The results have been compared of the existing methods on these datasets with the proposed model. The proposed model is script independent, which is an advantage of this model. The highlights of the article are:

  • This article proposes a holistic recognition scheme for offline multiple handwritten Indian scripts.

  • Its objective is to eliminate the need of script dependent classifiers and built a classifier which can be trained and tested on any script.

  • In the proposed hybrid model, both spatial feature and temporal feature are extracted from the handwritten word image and combined by a compact bi-liner pooling layer. For recognition Connectionist Temporal Classification has been used.

  • CNN and BLSTM, both the feature extractor enables end to end training and testing of the model.

  • The model is tested on three public databases of Bengali, Devanagari and Telegu scripts and achieved better results than most of the existing classifiers on these datasets.

The organization of the article is as follows:

A detailed survey on present algorithms for character classification can be found in Sect. “Brief Survey on Classification Algorithms”. The description of the dataset can be found in Sect. “Database”. The proposed model is discussed in Sect. “Proposed Model”, which also contains the results achieved in testing phase and error analysis on the results. Finally the conclusion and future scope are in Sect. “Conclusion”.

Brief Survey on Classification Algorithms

The working principal of early OCR systems was based on template matching. In [1] the authors, Kelner and Glauberman, used this template matching procedure using a magnetic shift register in 1956. In template matching by calculating the differences between the test image and template images the match score is determined.After using reflected light for template matching, in 1962 [2], Hannan and group combined electronics and optical techniques to achieve the goal of OCR. They used electron tube instead of light and generated the match score. After mechanical and electronics, OCR became a field of work for computer researchers. The template matching was done using a logical matching scheme named pee-hole technique. In this technique, the binarized input character was matched with the strokes, black and white portions and for constant size and width the logic was true. This technique was adopted by Solatron Electronics Group Ltd. The main weakness of template matching procedure is that it demands to create a huge number of samples, when there is variation is character style and font as template matching is very rigid in nature. To overcome this, with time different improvements on template matching procedure have been proposed. One of the improvements of the template matching procedure for character recognition is correlation coefficient. In coefficient correlation the template is matched with the test character and similarity score is generated along with the location where the similarity score is maximum. Depending on this score the character is identified. In [3, 4], this template matching approach were used.

Until now template matching was going well for machine printed characters, but when hand written characters came into the scenario, the template matching procedure failed. Though, concept wise both the procedures are similar but in structural analysis a new idea was added, which is feature extraction. Feature can be defined as distinguishing characteristics to identify a character. The structural analysis procedure is still an open problem. With time different types of feature extraction techniques are developed to achieve the perfection and it is still going on. The first structural analysis was done using stroke feature. In 1954, Rohland [5] counted the number of black pixels in vertical scan and used it as a feature in structural analysis. In structural analysis the image is divided into partitions and both topological (cross count of black pixels) and geometrical (length and width) features are detected. Along with different features, different partitioning approaches was also proposed. In [6] Johnson and in [7] Dimond used different partitioning approaches. Sometimes, weights are assigned to each partition, similar approach was done by Kamensky [8].

Until now most of the proposed approaches discussed, were based either on structural analysis or template matching. Later combining both the approaches hybrid scheme was proposed in different studies. In hybrid scheme advantages of both the schemes were adopted. The advantage of template matching scheme is, it is very strong to global changes. Advantage of structural analysis scheme is, it can detect the local features and adopt the variation. In [9] a hybrid approach was proposed by Munson, where the images were divided into sub regions and the objects of matching were local features within the sub region. Similar idea can be found in [10] by Greanias et al. Later IBM 1287 was built following this idea. In early age, the features used were very basic features, for example shape features, intensity features, which were not enough to capture the discriminating feature. Later more prominent features were proposed, which increased the character recognition accuracy. Few of them are Histogram of Gradient feature [11], Shift invariant feature [12], wavlet feature [13] etc.

These three approaches discussed so far, are the foundation of all the proposed approaches for classification or recognition till now. With time algorithms have been developed either to find a match, or to extract the feature else combination of the two approaches. Now let us discuss some of the widely used algorithms, which are the following:

Hidden Marcov model (HMM): In HMM statistically derived parameters are used and from huge amount of training set the probability parameters are estimated. Some of the works on HMM are [14, 15].

Support vector machine (SVM): One of the most famous approaches of classification/ recognition is Support Vector Machine (SVM). SVM works best when it is used for two class classification problem. In case of multi-class normally the classes are divided into two class for the classification and the difficulty lies here to choose the distribution of the classes. Some of the recent works on SVM are reported on [16, 17].

Decision tree: Another method of character classification is decision tree or graph based method. In decision tree based methods, from the total number of classes refinement is done based on the discriminating features. As the tree proceeds downward through the branches the tree moves towards more specialized features from the generic features. Finally, the leaf nodes represents the class labels. Some of the important works using decision tree are [18, 19].

Fuzzy logic: Fuzzy logic is another well known scheme used for character recognition. The working of fuzzy logic mainly mimics the human reasoning. It process the variable in such a way that it allows multiple possible truth values for a same variable. In [20, 21], fuzzy logic based system are proposed for offline character recognition.

Clustering: Clustering methods are widely used classification algorithms that mainly works in the dimensional space and creates groups or clusters based on similar properties of the data points of that space. One of the popular clustering algorithm is K-means clustering algorithm that helps to create k number of disjoint cluster based on the number of observations. Each cluster has a cluster center and the cluster center must be well separated. A new data point is assigned to one of the k clusters. One example of k-means clustering algorithm can be seen in [22]. Variation of k-means clustering is fuzzy c-means clustering. Fuzzy c-means algorithm has been applied for Bulgarian character recognition in [23].

Artificial neural networks: Towards the end of nineteenth century a new scheme stared being widely used in OCR technologies, i.e., artificial neural networks. Artificial neural network follows supervised learning and uses labelled data for training. Some of the works for classification are [24] for Devanagari script, [25] for Arabic character. Other works using artificial neural network can be found on [26,27,28]. One of the variation of ANN is deep learning algorithm. Some of the basic and widely used deep learning networks are Convolutional Neural Network, Recurrent Neural Network, Long Short Term Memory, Generative Adversarial Networks (GANs), Autoencoders etc. and their applications are [29,30,31,32,33], respectively.

In the early age, when the OCR engines started to develop, researchers combined two or more classifiers and proposed hybrid schemes for OCR. It will not be possible to discuss all of the studies, I have tried to incorporate some of them. In [34] a hierarchical model is proposed combining k-means and SVM. In [35] deep learning and decision tree is combined. In [36] svm is hybridized with multi dimensional RNN. Another work [37] combines MLP and SVM. Another combination of CNN and SVM can be seen in [38]. In [39], a CNN is combined with XGBoost. So, now we have a brief idea of the classification approaches.

Database

In the proposed work, examined the proposed model has been examined using three public databases which are CMATERdb2.1.2 [40], IIIT-HW-Dev [41], IIIT-HW-Telugu [42]. The databases contains offline handwritten words in Bengali, Devanagari and Telegu. The details of the databases are discussed in the following sections:

Fig. 1
figure 1

Some sample images from CMATERdb2.1.2 database

Fig. 2
figure 2

Some sample images from IIIT-HW-Dev database

Fig. 3
figure 3

Some sample images from IIIT-HW-Telegu database

  • CMATERdb2.1.2: This database is developed containing word images written in Bengali script. This database contains 18000 word samples with lexicon size of 1298. Some of the word samples are given in Fig. 1.

  • IIIT-HW-Dev: This dataset is developed by the researchers of CVIT, IIIT Hyderabad. In this database there are total 95,381 Devanagari words written by 12 individuals. Total vocabulary of 95,381 Devanagari words are present in the database. The database is divided into three subsets which are train, test and validation in the ratio of 70:15:15 and contain nearly 66766, 14307 and 14808 annotated word images. Some images are given in Fig. 2.

  • IIIT-HW-Telugu: This dataset is developed by the researchers of CVIT, IIIT Hyderabad. This database contains handwritten words of Telegu words. Total 1,18,515 words are collected from 11 writers containing vocabulary of 12,945 words. The total words are divided into roughly 70:15:15 ratio, resulting 80637, 19980 and 17898 annotated word images in train, test and validation set respectively. Some images are given in Fig. 3.

Proposed Model

I have proposed a hybrid model for offline word recognition. As mentioned earlier, the model is a combination of CNN layers and BLSTM layers. The whole system overview of the proposed system is given in Fig. 4. The system can be divided into three main components: Feature extraction, training the classifier and testing the classifier or recognition. In feature extraction two types of features are extracted, spatial feature by convolutional layers and temporal features by Bi directional long short term memory layer. By combining these two types of features by Compact Biliner Pooling the final feature is formed. Compact biliner pooling helps in highlighting important features which in turn improves the recognition accuracy. The extracted feature next goes into the CTC layer for recognition. In the training phase the labelled word images are inputted into the classifier and the network in trained. In testing phase, the combined feature is extracted from the word image and the trained model recognizes the word image with the help of the trained CTC layer. In the following subsections the network and its components have been discussed in detail.

Feature Extraction

Feature extraction is one of the most important factors which affect the recognition accuracy drastically. In fact, a good feature extraction technique can produce a good result even with an average classifier. I have already discussed in section 2 that for years different types of feature extraction techniques are introduced by researchers. Starting from area, pixel intensity features now features like HOG, Shift invariant etc are being widely used.

Fig. 4
figure 4

Diagram of the Whole System

The only aim of a feature extraction technique is to identify the discriminating feature and extract it so that the classifier can easily recognize the feature. So it can be said that along with strong classifier, strong feature extraction technique is equally important aspect of any OCR system. In feature extraction the main challenge is to choose the feature extraction technique that will work best for a specific task. Here manual intervention is needed, which may affect the system if wrong decision is taken. To overcome this, the machine extracted features were introduced which alleviate the need of hand crafted features. Such features are CNN features, LSTM features, BLSTM features. These feature extraction techniques extract the complex features automatically from the raw image. Additionally, this also discards the need of individual feature extraction unit as the recognizer and the feature extractor can be trained simultaneously.

To alleviate the need of handcrafted features and conduct the training simultaneously with the recognition module, a feature extraction technique combining CNN and BLSTM is proposed in this work. The CNN feature and BLSTM feature is combined with a compact bi linear pooling. The combined feature is the output of the feature extraction module. The feature extraction module is depicted in Fig. 5.

The input of the proposed word recognition system is binary word images of fixed size. A binary image is a combination of 0 and 1 valued pixels which is achieved using binarization. In CMATERdb2.1.2 the word images are already binary images but of variable size. So in CMATAR DB 2.1.2 as a pre-processing step only padding has been done with white pixels to fix the image size. The length of the images are fixed to 500 with padding in the right side. The width of the images are fixed to 200 with padding in the bottom. So the final image size after pre-processing is 500\(\times\)200. IIIT-HW-Dev and IIIT-HW-Telegu databases contain RGB resized images of fixed length and width. So, Otsu’s thresholding algorithm [43] has been used to get the binary word images. The word images of CMATERdb2.1.2 database, IIIT-HW-Dev database and IIIT-HW-Telugu database contain noise free handwritten images, so other than binarization and padding no other pre-processing method are followed.

As mentioned earlier, from the binarized two dimensional word image two types of features are extracted. By using the convolution layer the spatial feature or the shape, coordinates and regions of a word image are extracted. To capture more generic or global features the two CONV2d layer are used in parallel with different filter sizes. The configuration of the convolution layers are:

Conv1: Kernal size = maxnorm(5); Stride = 1; Dropout = 0.2;

Conv2: Kernal size = maxnorm(6); Stride = 1; Dropout = 0.2;

Following the CONV2d layers, maxpooling layers are used. The configuration of both the maxpool layers are same: Maxpool: pool = (2 \(\times\) 2), Stride: 2 \(\times\) 2; Maxpooling layers helps in dimension reduction as well as reduce the complexity of the extracted feature. The output of the two maxpool layers are flattened and a vector of one dimension is created. Both the 1D vectors are concatenated which creates the final global feature. Along with the global feature, to extract the temporal features two BLSTM layers of same dimension are used. The BLSTM layers are responsible to capture the correlations. In one of the BLSTM layer the word image is feed from top to bottom (BLSTM2) and in another BLSTM layer the word image is feed from left to right (BLSTM1). If the input word image is considered of (h \(\times\) w). Then for BLSTM (Left to right) the input will be processed as (1 \(\times\) height \(\times\) width) and for BLSTM (Top to bottom) the input will be processed as (1 \(\times\) width \(\times\) height). Output of both the BLSTM layers is 2D matrices. The configurations of the BLSTM layers are:

BLSTM1: Hidden:16*w Drop: 0.3;

BLSTM2: Hidden:16*w Drop: 0.3;

Fig. 5
figure 5

Feature extraction module

From the 2D matrices, a row wise average operation is processed for BLSTM1 and a column wise average operation is processed for BLSTM2 layer. After the average operation on BLSTM2 layer a transpose operation is done to match the dimension of BLSTM1 and BLSTM2 layer so that the final summation operation can be applied on them. The global features extracted by the CNN layers and the temporal features extracted by BLSTM layers are combined by a compact biliner pooling. The bi linear pooling is a simple operation which results in the outer product of two input vectors. The advantage of bilinear pooling over simple mathematical operations is the resultant output vector focuses on fine grained details which in turn help to achieve high recognition accuracy. Moreover bi linear pooling has achieved state-of-the-art performance on wide range on machine learning task. One of the variations of bi-linear pooling is compact bi-linear pooling with an additional advantage of compactness. The global and temporal feature is transformed into sketch count vectors. After that the vectors are combined with a Fast Fourier Transform (FTT) convolution [44]. Another advantage of the CBP layer is that it allows back propagation which enables the proposed system to be an end to end model and feature extraction as an intrinsic part of the classification module. The output of the CBP layer is the final feature vector. The feature vector enters the CTC layer for recognition.

Training the Classifier

In training the classifier, the most important part is to train the CTC layer or the Connectionist temporal classification layer. The main advantage of CTC is that it allows sequential recognition discarding the need of segmentation specially in case of document processing. The model have been trained individually by the training set mentioned in Sect. “Database”. The Devanagari and Telegu database are already divided into train, test and validation sets in 70:15:15 ratios. In [40], the database is divided into train and test set with 14,400 images and 3600 images respectively. A validation set from the training set by collecting 3600 images are created. This result in 10800 images in the training set. The input to the whole module is the input binary word image, the text label in the training phase. To train the CTC layer the characters of the script and a blank character are considered. By considering these characters the encoding is done in the CTC layer. The loss function calculated is CTC loss. In CTC loss first the probability of occurrence of the ground truth is computed and by taking negative logarithm the loss is computed. The loss is back propagated through the network to update the weights and produce the trained model. RMSProp optimizer and 0.001 learning rate while training the network have been used. The network training continued until the improvement in validation accuracy became static.

Testing the Classifier

In Sect. “Database”, the experimental datasets are already discussed. The trained model have been tested on the test set of the databases. It is to be noted that the training and testing of the different scripts are done individually. The databases used to train and test the models are public databases and on these databases in literature several research work are proposed. The results obtained from the proposed model are compared with some of those with. In the testing phase the CTC uses the best path to decode the text. It is achieved in two steps. In the first step, max probability is calculated of the characters and the character with maximum probability is considered the desired character. In the second step the duplicate characters and blank characters are removed to get the actual word or character sequence.

Result and Discussion

As mentioned the model is tested on the three databases. The results obtained on the test set are discussed in Table 1.

Table 1 Classification accuracy obtained on the databases

For word recognition, two types of accuracies are calculated from the computed label error. First one is the character level accuracy: for a word image, the number of characters that are correctly identified is the character level accuracy. Second accuracy is calculated, which is word level accuracy. In a word image, if the correctly recognized characters by total number of characters are greater than a certain threshold then the line is considered as a correct word recognition. The accepted threshold for the work is 90%. The computer character level and word level accuracy as reported in Table 1.

These two are standard metrics for handwriting recognition tasks. In Table 2 The results on CMATARdb 2.1.2 database reported on [40] are compared. From the table, it can be seen that the proposed method has achieved high accuracy on the dataset.

Table 2 Comparison on CMATARdb 2.1.2 database
Table 3 Comparison on IIIT-HW-Dev database
Table 4 Comparison on IIIT-HW-Telugu database

In Tables 3 and 4 I have compared the results on IIIT-HW-Dev database and IIIT-HW-Telugu database reported on [41] and [42] respectively. It can be observed that the method has performed better than most of the proposed methods. Though in case of Devanagari script it failed to perform better than Mixed-IAM-SCNN-BLSTM-Lexi model. Similarly, in recognition of Telegu script the IAM-SCNN-BLSTM-Lexi has performed better.

Error Analysis

Though the proposed model has performed really well compared to state-of-the-art methods, but for few word classes the error occurred. After examine the errors it has been found that the errors mainly occurred for the similarity in character shape. All the three scripts on which I have worked have this challenge. A small variation in one character shape represents another character. For example, in Bengali script if an arc is added with the character, then it represents another character . Similarly, if a dot is added with then it creates another character .

Another challenge that have high impact on classification is the complex character shape. All Indian script contains a concept of compound characters, where two basic characters when combined together creates a modified shape. This increases the number of character classes as well as the challenges in recognition of such characters. Those characters drops the recognition accuracy.

The final and the most important factor, which contributes in the error is the imperfections that are present in a handwritten document. A handwritten document contains some sort of degradation like skew, tilted lines, un even line gaps etc. More over, handwritten documents contain different type of writing due variation of writers. So, the recognition of handwritten documents is a complex problem to solve.

Conclusion

In this present article, a holistic word recognition model is proposed. The model is tested on Bengali, Devanagari and Telegu scripts. In holistic word recognition approach a given word image is considered as a single and indivisible unit. For that purpose a hybrid model is proposed. The model extracts global feature from a word image by using convolution layer and temporal feature by using BLSTM layer. Both the features are combined by a Compact Bi-linear Pooling layer. The final feature vectors goes into the CTC layer which recognizes the word.

This proposed model is font and shape independent that is why it worked for all three scripts. The results achieved are better than the existing methods but there is still some scope of improvements. An error analysis has been done on the results and observed that the shape complexity, imperfections in a document, similarity in character shape are the main reasons that effects the accuracy. Additionally, any other preprocessing step other than binarization and padding have not been followed. No pre-processing like slant correction, skew correction are applied on the handwritten words, which are very common pre-processing that are applied for handwritten word recognition. Inclusion of such a pre-processing module like skew correction may increase the recognition accuracy in future. Additionally inclusion of post processing technique like dictionary learning will increase the total accuracy more.