1 Introduction

Pattern recognition is one of the rudimentary requirement of many applications based on deep learning. Pattern recognition is an important aspect of many application domains like optical character recognition, video surveillance, face recognition, medical diagnosis, human computer recognition and access control systems. Pattern recognition requires highly adroit pipelines that can tackle not only real time data but also it can recognize it accurately.

Handwritten Recognition is one of the basic entity in Pattern Recognition. Handwritten recognition is a fundamental requirement for OCR and Document Analysis and Extraction. Marathi language is one of the most spoken regional language in India and is the mother tongue of many Maharashtraians. Many researchers [1,2,3] already studied on the recognition of isolated handwritten characters. Marathi numerals classification is not a frequently touched topic. The accuracy obtained for our pipeline is about 97.91% which is much higher than the other existing pipelines.

The data is collected by scanning more than 80,000 handwritten numerals. The Marathi numerals follow nonlinear nature as compared to the linear nature of English numerals. Various nomenclatures for writing the same numerals make digit recognition task as a challenging one. We have tried to cover all styles of writing in our data-set. The Marathi numerals data-set is not easily available on the Internet like the English numerals data-set (MNIST). The key highlights of the paper are

  • Stacked ensemble for convolutional neural network is proposed for numeral recognition.

  • The meta learning classifier has better performance than the existing systems.

In this paper, Sect. 2 throws insight into the work already done in this field, Sect. 3 describes about the dataset gathering and pre-processing techniques. Furthermore, Sect. 4 talks in detail about the proposed pipeline and implementation. Section 5 deals with Performance evaluation and analysis and finally, we conclude with Sect. 6 that talks about the future scope and final conclusions drawn.

2 Related work

The handwriting styles, the uniqueness in this jargon has uplifted the uniqueness and essence of Marathi language in Indian sub-continent. The development of handwritten numerals identification system is not an easy task, given the diverse writing styles and the sophisticated nature of curves in writing. Despite this, Dongre and Mankar proposed a solution using statistical discriminate functions and geometric attributes like line,line directions, perimeter, solidity, image area, eccentricity of the numerals for their identification in 2013 [4]. Kim et al. [5] proposed a system that used hybrid features for representation and combined classifier for classification. Acharya et al. [6] devised a handwritten recognition system that made the use of various features in multilevel classifiers. Vasantha et al. [7] implemented pre-processing and post-processing in order to augment accuracy above 99%. Kumar et al. [8] used the morphological features for identifying blobs in numerals with blobs and stem. Singh et al. [9] designed a artificial neural network pipeline that identifies five different types of fonts of the Devanagari script. Rajput and Mali [10] used Fourier Descriptors to describe the shape of quarantined Marathi handwritten numerals and the system was tested using various algorithms. Bhattacharya et al. [11] achieved the accuracy of 92.83% using artificial neural network (ANN) and hidden Markov model (HMM). Srivastava and Gharde [12] used support vector machines (SVM) on the dataset constructed by then automated numeral extraction and segmentation program (ANESP). Moment invariant techniques and affine moment invariant techniques were used extract 18 features that were passes to the SVM, which achieved accuracy about 99.48%. Mane and Kulkarni [13] proposed a customized convolutional neural network (CNN), which achieved accuracy of 94.93%. Mane and Kulkarni mentioned importance of CNN in pattern recognition [14]. Vaidya and Joshi [15] proposed a handwritten numeral identification system using the statistical distributions of the image feature vectors. Hanmandlu et al. [16] proposed a fuzzy model for Hindi characters identification, which achieved an accuracy of 90.65%. Khanale and Chitnis [17] used ANN for the Marathi characters identification,that achieved an accuracy of about 96%. Patil and Sinha [18] proposed basic ANN approach for classification, but the magnitude of their data-set i. e. 150 images is very small as compare to proposed dataset i.e 82,000 images. Duddela et. al. [19] proposed ANN classifier for Devnagri digits using PRTool which achieved 95% accuracy. Many researchers [20,21,22,23] used the support vector machine and its extended variations like weighted SVM, SVM-KNN etc for recognition of handwritten characters. Patil et al. [24] presented a recurrent neural network with an LSTM model for recognition of handwritten MARATHI digits which achieved 79% accuracy. Recently, Gupta et al. [25] used supervised deep learning techniques for handwritten digit recognition for eight different scripts which got a maximum 96% result. Additionally, our proposed stack assembler neural network is much more intricate and novel in comparison with ANN and other classifiers. The proposed architecture outperforms their ANN in terms of diversity and accuracy.

Fig. 1
figure 1

Extracted and cropped images of collected samples

3 Dataset-collection and preprocessing

The Marathi handwritten numeral dataset is not available for deep learning applications. Hence, The dataset is manually generated by collecting handwritten samples of Marathi numerals from people of diverse age groups. The individuals were asked to write the numerals from 0 to 9 on a nine region A-4 size paper, which was partitioned into nine fixed-size areas. Representation of extracted and cropped images of collected samples is shown in Fig. 1. All the images are converted into grayscale images, and their dimensions are cropped to 28*28 i.e 784 features per image. To avoid space congestion, all the features of the image are stored in a CSV file corresponding to their respective image ids. Now, the dataset is augmented using various image data set augmentation techniques like zooming (0.2), shear-range (0.04), Horizontal and Vertical Flips, rotation (8), width shift range (0.2), height shift range (0.2) etc. The images in the dataset after applying dataset augmentation techniques look as shown in Fig. 2.

Fig. 2
figure 2

Images after applying augmentation techniques

4 Proposed pipeline and implementation

The proposed pipeline is the stacked ensemble model, whose base pipeline is customized CNN pipeline that is used to identify the handwritten Marathi numerals.

  • The CNN pipeline does not make use of the pooling operation. Rather, stride convolutions with larger kernel sizes are applied so that the weights of the filter can be updated during backpropagation.

  • The CNN pipeline used here is inspired by the VGG 16 CNN pipeline. The pipeline consists of two normal convolution blocks followed by a stridden convolution block to extract lower level features.

All the convolution layers and dense layers are using the Rectified Linear Unit (ReLu) activation function to introduce non-linearity in the pipeline designed.

$$\begin{aligned} { f(x) = \left\{ \begin{array}{ll} \text {0} &:\text {x}\le 0\\ \text {x} &:\text {x}\nleq 0 \\ \end{array} \right. } \end{aligned}$$
(1)

ReLu is most widely used activation function in many CNN pipelines [26]. Softmax Classifier is used in this Pipeline. Softmax classifier is give as follows-

$$\begin{aligned} { P(y=j)=e^z_\text {j}/\sum _\text {k}e^z_\text {k} } \end{aligned}$$
(2)

where j = 1 to N(no of classes) The function normalizes the output values in the range from 0 to 1 so that the output can be decoded as categorical probability distribution over K classes.

Fig. 3
figure 3

Base pipeline

The loss function used here is the categorical loss entropy function which is given by

$$\begin{aligned} { H(p,q)=-\sum _\text {i}p_\text {i}\log {q_\text {i}}-(1-p_\text {i})\log {(1-q_\text {i})} } \end{aligned}$$
(3)

Where p represents the actual output, and q shows predicted labels. We have proposed a stacking ensemble for customized CNN pipeline designed by us to augment the accuracy of the predictions. In stacking ensemble, all the pre-trained base pipelines (customized CNN pipelines) participating in the average ensemble are integrated or stacked into the multi-head deep learning neural network that makes a prediction based on the outputs given by the base pipeline. The base pipelines are stacked so to create a meta-learning pipeline classifier that combines the outputs of the base models and gives the final output result which is represented in Fig. 3.

Stacked ensemble for CNN makes sure that the all the best possible contributions are taken from the base CNN pipelines as compared to the same or average contribution made by the pipeline in the average ensemble. Moreover, the base pipelines can also be updated or trained again, depending upon the functional requirement and the computation capacity. The Fig. 4 illustrates the architecture diagram of the proposed pipeline. The Base classifier is the base customized CNN Pipeline. The keras API is used to implement the pipeline. The flowchart of the entire training process is represented in Fig. 5. The five base pipelines models are trained. The average ensemble accuracy of all the models is about 97.2%. All the base pipelines are merged to create a meta learning classifier that is contingent on stacked ensemble. Ensembling augments the performance metrics of the CNN Pipeline and is widely used in pattern classification. The batch size used in the implementation is 64. 63,000 samples are used for Training, 7000 samples are used for validation; 11,500 samples are used for testing. For each fold, the training and validation samples are randomly selected to make the model more exhaustive.

Fig. 4
figure 4

Architecture diagram of proposed stacked ensemble model

The customized convolutional neural networks training algorithm is described using six steps.

  • Step I: Initialize all filters and parameters with random values in the CNN pipeline.

  • Step II: The pipeline takes an image batch as input and iterates through stages like convolution, flattening, and finally makes an output.

  • Step III: Calculate the error using the categorical cross-entropy loss function mentioned above.

  • Step IV: Back propagate the error and update all parameters accordingly.

  • Step V: Repeat the above steps for each base pipeline.

  • Step VI: Test the base pipeline for the given Testing set.

Fig. 5
figure 5

Training flowchart for base pipeline/model

We have trained ten base pipelines, but keeping in mind the computational constraints, we will use stack only five base pipelines. Now, the meta-learning classifier is created, which takes the stacked outputs of all the base models, and processes it and gives the final output. Learning algorithm for stacked ensemble meta learning classifier is described in Algorithm 1.

The base pipelines were not updated as a part of the training process of meta classifier. The meta-learning classifier was trained on the validation data set for 1 epoch. The no of samples used were 8000. The accuracy achieved was 97.91%. The Fig. 6 shows the pipeline of the meta-learning classifier.

Fig. 6
figure 6

Stacked ensemble meta learning classifier

5 Results and analysis

The approach proposed by Patil in 2012 has the highest accuracy, but their sample size is tiny. Hence, it is tough to say whether their model will outnumber other models in the exhaustive testing strategy. The proposed approach is examined on 11,500 samples, which is much higher than the sample size used in other methods by the authors. Hence, in terms of both accuracy and Sample size, the proposed approach is optimum. The Fig. 7 shows the plot of accuracy with the count of members used for average ensemble. The comparison of the proposed method with existing systems is represented in Table 1. The average ensemble accuracy is about 97.21%. But, the stacked ensembled meta-learning classifier augmented the accuracy to 97.91%. The classification metrics used in the analysis of the proposed pipeline are described in Fig. 8.

  • Confusion matrix: It is a square matrix of size equal to the number of the target classes. Here, the rows represent the actual labels, and columns represent the predicted labels. Each entity represents the number of samples predicted for their labels against their actual labels. For the classifier to be optimum, the diagonal of the matrix should contain maximum numbers, and rest all elements should be zero.

  • Accuracy: It represents the percentage of samples whose predicted labels match their actual labels. It is one of the basic and important classification metric.

    $$\begin{aligned} Accuracy=\frac{TP+TN}{TP+TN+FP+FN}. \end{aligned}$$
    (4)
  • Precision: Precision tells us that, out of the population of a particular label, how many sample’s predicted labels matched their actual labels.

    $$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$
    (5)
  • Recall: It gives the proportion of how many samples from a particular class label were identified correctly.

    $$\begin{aligned} Recall=\frac{TP}{TP+FN} \end{aligned}$$
    (6)
  • F1-score: The harmonic mean of precision and recall gives the F1-score. The maximum value of F1-score is 1 and minimum is 0.

    $$\begin{aligned} F1_score=\frac{2*Precision*Recall}{Precision+Recall} \end{aligned}$$
    (7)
Table 1 Comparison of proposed method with exiting systems
Fig. 7
figure 7

Accuracy with no of member k-fold validation

Fig. 8
figure 8

Confusion matrix

Fig. 9
figure 9

Mis-classified samples

Table 2 Classification report

Actual verses predicted testing samples of confusion matrix is shown Fig. 9. The Table 2 shows the classification report of the stack ensemble meta learning classifier and Figure 9 shows some misclassified samples. The samples that are not predicted correctly are either not clear or having vague writing nature as it is clear from the image above. The proposed pipeline fails to predict such handwritten samples accurately.

6 Conclusion

The paper focuses on the Marathi handwritten numeral recognition using stacked ensembles. The pipeline proposed achieved an average accuracy of 97.91%. The stacked ensemble learning approach augments the accuracy of the average ensemble model. However, in some cases, the pipeline does not work as desired. This is because of the complex writing curves, the lower resolution of the scanned images, and some unusual patterns involved in writing. However, pipelines works as expected in most of the different writing styles and patterns.

Optical character recognition with Devanagari scripts faces huge impediments because of the identification of letters and numbers. The proposed approach would contribute a lot for OCR with Devanagari scripts. Additionally, various applications like evaluation of Marathi answer sheets, Marathi sentiment analysis etc could benefit a lot from our approach.

In the upcoming future, the same pipeline could be extended to Marathi alphabets. Additionally, the accuracy could be augmented by increasing more of number of individual classifiers or epochs. Furthermore, dilated customized CNN could be used to decrease computational cost as well as to increase accuracy. The pipeline should be designed in a way to accommodate the curves that our pipeline failed to identify.