1 Introduction

Transcription of handwritten text into digital text has got increased attention among researchers due to its various challenges and complexities. The recognition system for printed text document exists with great performance [1]. However, there is a need to improve the performance of the handwritten recognition system. Handwritten text documents are divided into online and offline handwritten text document [2]. In online handwritten text documents, geometrical and temporal information is stored while writing (e.g. writing by a pressure sensitivity device on an electronic writing pad) [3, 4]. In an offline handwritten text document only a sample of the text is available after the text has been written with variations in the handwriting style of writers, which makes the offline handwriting recognition system more challenging than the online handwriting recognition system [5]. Moreover, the complex shapes of characters in some scripts like Devanagari, Bangla, etc. make the performance of the offline handwriting recognition system more difficult. Since handwriting recognition system has potential applications in the field of offline handwritten historical document digitization, bank cheque processing, postal automation, automatic data entry, etc., there is a necessity to improve the handwriting recognition system more accurately.

Indic scripts have some more challenges in handwriting recognition than Latin, Chinese, Korean and Japanese because of the presence of variations in the order of strokes or symbols, half consonant, etc., which is discussed in detail in terms of online recognition in [6]. Kaur et al [7] also presented a detailed review of work done in multilingual online and offline character recognition for Indic and non-Indic scripts. This work identified the deficiencies and presented an in-depth view of work done at each phase of character recognition for the printed and handwritten documents. Kumar et al [8] discussed major challenges for character and numeral recognition in Indic and non-Indic scripts.

This research article investigates Devanagari’s modified character recognition with the help of CNN. It is observed that the performance of recognition mostly depends on the feature extractor methods. Good accuracies of feature identification and extraction like pixel value, shape, orientation, texture, position, etc. are required to solve the recognition problem accordingly. On the other hand, the recent deep learning era reduces the requirement of developing new feature extractors for every problem; e.g. convolutional neural network (CNN) learns low-level features such as edges and lines in early layers, then loops and then a high-level representation of a text image. Among all the methods presented in the literature for the recognition of characters, numerals, words, etc. it is found that the evolution of deep learning methods makes improvement in traditional feature-based methods significantly [9,10,11,12]. In this article, a deep double-stage CNN network has been used for offline handwritten modified character recognition. The superiority of proposed methods is claimed with the help of comparison with traditional feature extraction and classification method.

2 Proposed approach

In this research work, two CNN-based models and one traditional feature extraction (Histogram of Oriented Gradients—HOG) and classification (Support Vector Machine—SVM)-based methods are presented for the recognition of offline handwritten modified characters. The description of these models is presented in sections 2.1 and 2.2.

2.1 Proposed CNN architecture description

The proposed CNN network architecture is shown in figure 1, which consists of three components including preprocessing, convolutional layers and classification layer from left to right.

Figure 1
figure 1

The network architecture.

Preprocessing consists of resizing of images and conversion of samples from colour to greyscale. The convolution layers automatically extract features from each input image and the last convolution layer forwards the output to a fully connected layer, which is followed by the classification layer.

The architecture of this model consists of CNN layers. Input image size is set to \(300\times 300\times 1\). For 7 CNN layers, filters (8, 16, 32, 64, 128, 256 and 256) are used of size \(3\times 3\). These filters determine the number of feature maps. Padding is kept ‘the same’ to ensure that the spatial output size is the same as the input size. It helps in keeping information at the borders. Stride is set to 1. Batch Normalization (BN) layers used between convolution layers and non-linearity as a Rectified Linear Unit (RELU) normalize the activations and gradients propagating through the network so that network training becomes an easier optimization problem. The max-pooling layer uses pool size of [2,2] with stride equal to 2 to return the maximum value of a rectangular region of inputs. After convolution and max-pooling (down-sampling) layers, fully connected layers are added in which the neurons connect to the preceding layer’s neurons. The size of the fully connected layer is set to the total number of unique classes (or labels) in the target data. In single CNN architecture, the fully connected layer is set to 435. In double-CNN architecture, the fully connected layer is set to 37 in Stage-1 and 13 in Stage-2. The description of these CNN-based models is presented in sections 2.1.a and 2.1.b. Softmax normalizes the output of a fully connected layer and its output produces positive numbers that sum to one. The classification layer uses the probabilities returned by the softmax activation function for each input to assign the input image to one of the mutually exclusive classes and compute the loss. Stochastic Gradient Descent with Momentum (SGDM) with an initial learning rate of 0.01 is specified in training options. Maximum epochs are set to 12 and data is shuffled in every epoch. An epoch is a complete training cycle on a training dataset where network accuracy during training is specified by validation data and validation frequency. During training, validation accuracy is calculated at regular intervals.

2.1a Single CNN architecture:

This model comprises a 7-layer CNN architecture that is trained on offline handwritten modified characters with labels as the name of corresponding consonant and modifier (or matra) together, formed by rearranging the Hindi consonants with Matras dataset, e.g. sample image of is labelled as ’KaAA( )’. Since the dataset formed has very few samples of a few labels, repetition of sample images has been done to increase its size. It is defined as a complete labelled dataset in table 1 and divided into unique groups of train, validation and test in 6 experiments. Validation and test accuracies are calculated 6 times during each experiment and mean validation accuracy (6-fold cross-validation accuracy) is calculated to evaluate its performance and shown in table 2.

2.1b Double-CNN architecture:

In this proposed model, two CNN architectures are used. The first CNN architecture is trained on consonant labelled dataset available as Hindi consonants with Matras in CALAM in order to classify test samples into basic consonant classes, here named Stage-1. For example, Devanagari modified character is labelled as ‘Ka( )’. The second CNN architecture is trained on modifier labelled dataset formed by rearranging sample images of consonant labelled dataset so that it can learn to classify test samples into correct modifier classes, here named Stage-2. For example, Devanagari modified character is labelled as ‘AA( )’. This model is able to predict consonant and modifier class together for a test sample after combining the results obtained from Stage-1 and Stage-2. The acquired complete predicted label is compared to the actual label to calculate accuracy on the test dataset. To check the performance of both stages, 6-fold cross-validation accuracy has been evaluated and tabulated in table 2.

2.2 HOG features and SVM classifier

The HOG is a feature descriptor technique used to count the gradient orientation in a localized portion of an image called cells. The shape information in the feature vector is varied by varying the cell size. The visualization of cell size [2 2], [4 4] and [8 8] is shown in figure 2 for [32 32] size image. It states that the maximum shape information is achieved in cell size [2 2] and cell size [8 8] encodes the least shape information. However, the dimensionality of the feature vector is increased in cell size [2 2] to cell size [8 8] from 324 to 8100. A good negotiation of cell size [4 4] is chosen here, which is able to encode a sufficient amount of spatial information in feature-length 1764. SVM is chosen as a classifier using supervised learning on HOG features and its corresponding labels.

Figure 2
figure 2

(a) Sample of modified character. (b) Visualization of HOG features on variable cell size.

3 Experimental results and analysis

The dataset used for experiments is randomly divided into the train (70\(\%\)), validation (15\(\%\)) and test samples (15\(\%\)). The description of dataset distribution is tabulated in table 1. The dataset is divided into 6 number of sections (or folds), where each section is decided for train, validation and test at some point. The validation and test results of these 6 sections are calculated and the mean value is presented in table 2. It is noted for single CNN architecture that the average 6-fold cross-validation accuracy and test accuracy are, respectively, 81.52\(\%\) (with 4.81 standard deviation) and 81.62\(\%\) (with 5.17 standard deviation). It is also observed from table 2 that the average 6-fold cross-validation accuracy for Stage-1 and Stage-2 in double-CNN architecture is approximately 89.80\(\%\) (with 3.33 standard deviation) and 85.65\(\%\) (with 5.87 standard deviation), respectively. The performance of Stage-1 and Stage-2 is also calculated for random distribution of dataset used in each stage for 11 experiments and its average value is shown in table 3. It is observed from table 3 that both stages of double-CNN architecture perform well on the recognition of test data up to an average value of 90.99% (with a standard deviation of 0.01). A few examples of modified character recognition results using double-CNN architecture on test dataset are presented in table 4.

Table 1 Description of data used for experiment evaluation.
Table 2 Six-fold cross-validation and test accuracies for single CNN architecture and double-CNN architecture.
Table 3 Performance of double-CNN architecture using random distribution of dataset.
Table 4 Recognition results of double-CNN architecture for a few test samples.

4 Comparison of results

In this research work, an attempt has been made for the recognition of offline handwritten modified characters in the Devanagari script using two CNN-based models as described in sections 2.1.a and 2.1.b. The way of recognition for a sample image of by both models is presented in figure 3. From tables 2 and 3, it is observed that the validation and test accuracies get improved for double-stage CNN model.

Figure 3
figure 3

Recognition example by single CNN architecture and double-CNN architecture.

The recognition work discussed in this article is also evaluated by feature extraction and classification method as discussed in section 2.2. The dataset used in this work is partitioned into train and test data (7:3 ratio). The performance of the HOG features and SVM classifier is also evaluated. In this experiment the HOG features are extracted from \(4 \times 4\) cell size of \(32 \times 32\) size image, which leads to a \(1 \times 1764\) feature set for each image as shown in figure 2. The comparison of recognition performance on the dataset described in table 1 is presented in table 5 by the HOG+SVM technique with the CNN technique.

Table 5 Comparison of proposed models with (HOG+SVM) technique.

5 Conclusion

The paper presents a novel technique for recognition of offline handwritten modified characters in the Devanagari script. Two methods using CNN models have been discussed and it is observed that double-CNN architectures perform better than single CNN architectures. Traditional feature extraction like HOG features and a classifier like SVM are also implemented to check the performance and it is observed that deep CNN is able to recognize Devanagari modified character with more acceptable accuracy.