1 Introduction and motivation

A machine can recognize handwritten text using optical character recognition (OCR). There are two ways to recognize handwritten texts: online and offline. In the case of online, Some electronic devices can be used to trace the writing direction when the writing is going on [24], such as online signature authentication [18]. On the other hand, offline handwriting identification performs the identification of text from scanned textual images. It includes the recognition of numbers from bank cheques [1], addresses from letters, etc. With the help of this, it is possible to save both space and time. Here, there is no information available about the movement of the pen tip, the trajectory, or the direction of the text line. Offline OCR is, therefore, more critical than online OCR. Some of the essential processes involved in the offline process are shown in Fig. 1.

Fig. 1
figure 1

Common steps involved in Offline optical character recognition

Several scripts are used in India for various languages. There is considerably less amount of research towards OCR present in the literature to support handwritten text identification. Furthermore, it can be noted that most of the studies attempt to recognize Devanagari and Bangla handwritten characters in comparison to other regional languages [41]. Odia is an Indo-Aryan language. More than 35 million people in Indian states such as Odisha, West Bengal, and Gujarat speak Odia. Assamese and Bangla languages are mostly related to Odia. The Odia writing is originated from the Kalinga script and is one of the many offshoots of ancient India’s Brahmi script. Since ancient times, palm leaves have been used in Odisha to record the state’s literature and history. Furthermore, since 1838, the printing press has been available in Odisha. There are many historical events and manuscripts in Odia by several renowned writers like ‘Fakir Mohan Senapati’, ‘Radhanath Ray’, ‘Madhusudan Das’, Gopabandhu Das, etc. In recent decades, attempts have been made to preserve these as digital files. Saving the contents as a text file form is a better choice than a scanned copy of the pages to avoid running out of storage space. The manual conversion of this job is impractical. It’s essential to automate the conversion of these image files to text files, which necessitates the use of offline OCR for Odia characters. The lack of Odia databases to train the OCR engine is the primary impediment. The performance of Odia OCR must be improved to fulfill the requirements of real-time recognition. It motivates strongly to donate a good quality of Odia characters and numeral dataset to the research community working in this domain. Handwritten character recognition is among the several popular applications of computer vision. Though, this job is not simple. Researchers are encouraged to enhance recognition results using various pattern recognition algorithms, which helped the offline text retrieval and digitization process.

In the early years of 2005, researchers began their research study on offline Odia OCR. Odia has 10 numerals, 49 basic characters. More than 100 compound characters can be formed with the help of these basic characters. This basic character consists of 12 vowels and 37 consonants. These can be seen from Fig. 2, where all digits, basic characters, and some of the commonly used compound characters in the handwritten form are presented. Each of the character classes is represented by a green number at the top of the characters. These class labels are the class sequence present in the newly developed ‘IIITBOdiaV2’ database created in IIIT Bhubaneswar. The character groups numbered from 0 to 9 are the digit classes. The numbers 10 to 56 represent basic characters, and 57 to 59 represent special modifiers. There are two characters between classes 56 and 57 that are ignored from the list of basic characters in this database because they are very similar to classes 34 and 35, respectively. The modified characters of ‘Ka’ are represented by the character classes 60 to 69. Some of the most common compound characters are 70 to 111, except for 80, 81, 82, and 84. Class labels 80, 81, 82, and 84 are modified characters similar to class labels 60 to 69. Currently, the compound characters labeled 70 to 111 are present inside the database, though there are more than a hundred compound characters in the script. These are not included in this database due to insufficient collections of characters.

Fig. 2
figure 2

Set of handwritten digits, basic characters and compound characters in Odia. Numbers mentioned in green colour at top position of the characters are the respective class labels present in the ‘IIITBOdiaV2’ database

Though very little research has been conducted on the recognition of offline Odia characters, most of them are on Odia printed character recognition out of the literature. The recognition results in terms of accuracy of these printed characters are also very satisfying. The account of the research works becomes further smaller in the case of handwritten Odia character recognitions. Again, the majority of them are on handwritten numerals, but there are very few on handwritten characters. The recognition outcomes of the handwritten characters are truly very disappointing. Finally, it becomes zero in count in the case of handwritten compound character recognition. Many factors contribute to this, including allograph complexity, a scarcity of datasets, and a scarcity of commercial sectors. Smartphones in India are expanding exponentially in all the regions of India, which will lead to heavy man-computer interactions in the future. OCR developments for Indian scripts, including popular scripts such as Odia, will be in high demand. As a result of these factors, a high-quality Odia handwritten text with various modifiers and compound character recognizers is now required. It necessitates the collection of a standard handwritten Odia database consisting of different possible compound characters. The current study proposes a character database as well as a hybrid feature extraction technique. The neural networks models are designed to take extracted features as input instead of using direct images. It would speed up the character recognition task because it would take less time to process the extracted features than compared to immediate images.

1.1 Characteristics and complexities of Odia scripts

Unlike English, the Odia script does not have separate uppercase and lowercase characters. The Odia script has a vast number of simple and complex characters. Each of the Odia characters has various components with different characteristics. Most of them are circular in their upper sections and have perpendicular straight lines on the right side of the letters. The modified characters are of different shapes and sizes based on the ‘matras’ or ‘kaara’ used. Depending on the vowel, these characters can be small or big, and the ‘matras’ can occur on any of the four sides of a consonant character. Some of them are shown in Fig. 2, from class ‘60’ to ‘69’ and many more, which are not labeled here. When a consonant or a vowel is combined with another consonant, then a compound character is formed. Some of the formations of compound characters are shown in Fig. 3.

Fig. 3
figure 3

Some examples of Odia compound characters using ‘juktas’ or ‘conjucts’. Numbers mentioned in green colour at top position of the characters are the respective class labels present in the ‘IIITBOdiaV2’ database

Several problems that an Odia character recognition device could face. Those are as follows:

  1. 1.

    Due to roundish, identical, and sometimes similar characters, distinguishing character class becomes difficult.

  2. 2.

    There is no clear and fixed rule for the interference of modifiers with their base character. It is a significant roadblock for word-to-character segmentation.

  3. 3.

    The structure of mirror images between two characters becomes one more challenge. The flipping of one letter, may look like another character. The characters ‘Ga’ and ‘Ma’ become ‘Tha’ and ‘Dha’, respectively, when they are vertically flipped. It is shown in Fig. 4.

  4. 4.

    The absence of benchmark databases is another weakness in recognizing Odia handwritten characters. There are currently only three publicly accessible databases. These are provided by, Indian Statistical Institute in Kolkata, NIT Rourkela, and IIT Bhubaneswar.

Fig. 4
figure 4

Two examples of characters when flipped vertically produce another characters

The implementation of a handwritten character recognition system necessitates a massive number of training samples. It’s also difficult to generate a dataset of this size because it takes a lot of time and effort. Only three extensive handwritten datasets have been developed in recent years. That includes a set of character images from handwritten Odia numerals and basic characters only. That is why an attempt has been made to build a database for Odia handwritten characters, including some widely used modified characters and compound characters.

2 Related works done

There are around 35 papers in terms of conferences and journals are present in literature for recognizing handwritten Odia characters in an offline manner. It is very little in comparison to other regional languages of India, like Gurumukhi, Bangla, Tamil, etc. All this research was done within the last two decades. In most of the papers, authors have used some synthetic handwritten character datasets by creating them on their own instead of any benchmark datasets. It happens due to the lack of availability of benchmark datasets in the public domain. Only three datasets are there which are made publicly available for research purposes. However, all these datasets do not provide all the categories of characters publicly. These are limited to digits and basic characters. Here in our literature study, we have emphasized those papers which are based on evaluating a set of benchmark datasets. Because using a benchmark dataset in the experiments is the best procedure to evaluate a recognition mechanism.

In the initial portion of the literature of Odia handwritten character recognition, it was found that some studies were performed on the ISI Kolkata numeral dataset. Tripathy et al. 2003 [42] suggested a method to recognize the numerals, and the authors have used threshold-based binarization as a primary preprocessing technique. Features are extracted based on reservoir area and location, the path of water flow, the number of loops, the center of gravity, ratio between reservoir and loop, profile-based features, on jump discontinuity. Finally, using a binary tree classification approach, the accuracy obtained was around 97.74%. The paper Roy et al., 2005 [32] suggested computing region of interest (ROI) based on bounding box and segmented the characters into blocks. The feature extraction was based on a chain code histogram having a dimension of 400. By applying neural networks and quadratic classifiers, an accuracy of 94.81% was reported in the study. Except for these preprocessing techniques, few others like Gaussian filter, Roberts filter are applied in [29] and accuracy of 98.54% was achieved using the same quadratic classifier.

In [4], the authors implemented binarization, thresholding, and normalization of the digit images in the preprocessing phase. Applying the Hidden Markov model (HMM) on extracted features like horizontal and vertical strokes, the accuracy of 90.50% was obtained. In [20], only Gaussian filter and resize were performed inside the preprocessing phase. The Zernike moment features were extracted and used on an ensemble of MLP based on Multi-class Adaboost, and finally, the recognition rate of 97.10% was found. The Otsu threshold and mean filter were implemented during the preprocessing step in the paper [9]. Following that, all of the samples were resized to 72 × 72. The kirsch gradient operator and curvature were used to extract features. The feature dimension was reduced using PCA. For classification, a modified quadratic discriminant function and its discriminative learning version were used. It produced an accuracy of 98.5%. In [34] authors merged their own created numeral dataset with the standard ISI Kolkata dataset. After this, LU factorization-based feature and naïve Bayes classifier were used for classification that helps to produce an accuracy of 92.75%. Some preprocessing methods, such as filter-based noise reduction and connected component analysis, were used in [11]. Following that, all of the digit samples were resized to a 24 × 24 format. On SVM, HMM, Bayes classifier, quadratic discriminant function, and kNN classifiers, different features such as Slantlet, Stockwell, and Gabor wavelet-based transformations were used. The highest level of accuracy was attained, with a score of 98.8%. Binarization, thinning, size normalization, and dilation techniques were used in the preprocessing phase on only 500 samples of the ISI Kolkata dataset by Sethy et al. 2016 [36]. The discrete cosine transform was used to extract features, which were then added to the ANN, resulting in 90% accuracy.

On the ISI Kolkata dataset, Majhi et al., 2018 [25] used preprocessing techniques such as a median filter and canny edge detection. After that, all of the samples were resized to 64 × 64. Using discrete Fourier, short-time Fourier, discrete cosine, discrete wavelet, S-transform, and curvelet of digits transformations, features were extracted. Using PCA, the feature dimension was reduced to 32. MLP, artificial neural network, radial basis function network, and probabilistic neural network were among the classifiers used. Finally, a 98.70% recognition rate was reported. In [6], only binarization and resize to 28 × 28 to whole ISI Kolkata datasets was performed. The total number of inputs became 784-pixel values for the suggested extreme learning machine. It gave an accuracy of 94.47%.

In [13], the only preprocessing added to the ISI Kolkata and IIT Bhubaneswar dataset is segmentation. The feature, based on sparse concept coded tetrolets was applied to RF, SVM, kNN, and modified quadratic discriminant function classifiers (MQDF). Finally, the accuracy of IIT Bhubaneswar digits and characters was 98.72% and 93.24%, respectively. Similarly, the ISI Kolkata dataset’s digit classification accuracy was found to be 99.22%. In [40] also, the only preprocessing applied to the same datasets was size normalization. Then the images are supplied to a CNN model. It achieved up to 98.4% on the ISI Kolkata dataset and 97.71% on the IIT Bhubaneswar digit dataset.

Some preprocessing techniques, such as finding ROI using a bounding box and binarization using the otsu threshold, were performed on digit samples of the IIT Bhubaneswar dataset [10]. Following that, the samples were divided into nine zones. Non-redundant Stockwell transforms, Slantlet coefficients were extracted and used in GA, PSO, and Differential Evolution based optimization with kNN. Finally, on the ISI Kolkata and IIT Bhubaneswar digit datasets, 99.1% and 98.6% accuracy were obtained, respectively.

In [12], Denoising, Size Normalization, Binarization, and Morphology based operations were applied to the IIT Bhubaneswar dataset. Features like Stockwell transform, the weighted gradients were extracted and were studied on kNN and modified quadratic discriminant function (MQDF). It gave an accuracy of 99.1% on numerals and 95.14% on basic characters of the IIT Bhubaneswar dataset.

In [5], preprocessing techniques like Binarization, pruning, and dilation was applied to the IIT Bhubaneswar digit dataset. After that, the samples were normalized to 40 × 40. Features extracted through convolution layers whose weights are upgraded by JAYA optimization. Then these are applied to the Random Forest classifier, which produced an accuracy of 98.25%.

Only size normalization was used on the OHCSv1.0 (NIT Rourkela) and ISI Kolkata datasets in [7]. Convolution layers with a multi-objective Jaya-based optimized network were used to extract features. Then these are applied to SVM and Random forest. It produced an accuracy of 98.9% towards recognizing characters of the NIT Rourkela dataset using RF and 97.70% on the ISI Kolkata numeral dataset using the SVM classifier.

In [39], Size normalization, median filter, skeletonization like preprocessing were applied to ISI Kolkata and OHCSv1.0 (NIT Rourkela) dataset. Only 200 samples from each character class and 300 samples from each digit class were taken. Features were extracted using row symmetry and column symmetry chords. Then these were applied on the Decision Tree classifier and achieved a recognition rate of 96.2% on ISI Kolkata numerals and 95.6% on the OHCSv1.0 Character dataset. In [27], authors performed classification on 57 character classes of the OHCSv1.0 (NIT Rourkela) dataset. The functions like Binarizarion, Skew Detection & Correction, and Segmentation are performed in the preprocessing phase. Features based on the horizontal histogram and vertical histogram are collected and applied to the classifiers like SVM, kNN, and BPNN. Finally, the highest accuracy achieved was 83.75%.

The authors of [38] in 2018, used normalization and dilation to preprocess the NIT Rourkela dataset. From each of the 47 classes, only 150 specimens were taken. The discrete wavelet transform was used to extract features, which were then reduced using PCA. The feature set was provided to BPNN, and 94.8% accuracy was obtained. In 2019, the authors of [37] implemented noise reduction, skew correction, and normalization towards preprocessing on the NIT Rourkela dataset. All 350 samples are taken from every 47 classes. The extraction of features was based on symmetric axis chords, mathematical features such as Euclidean distance and Hamilton distance, and the feature dimension was then reduced using PCA. These were applied to a Gaussian kernel with a radial basis function neural network and achieved a recognition rate of 98.8%.

3 Database used for simulation

Experiments are carried out on the three preexisting benchmark datasets and one newly created dataset. Compound characters aren’t present in any existing databases. As a result, we have been motivated to create a new dataset with compound characters. ISI KolkataFootnote 1 Digit, NIT Rourkela’s NITROHCSV1.0Footnote 2, and IIT BhubaneswarFootnote 3 are the only currently available datasets. IIITBOdiaV2Footnote 4 is the name of our most recently created dataset. The following sections provide an overview of each of the datasets. The details of the publicly available datasets, including the recently created dataset, are elaborated in Table 1.

Table 1 Details of the publicly available Odia handwritten character datasets

3.1 ISI Kolkata digit

The ISI Kolkata Odia database [3] contains 5,970 handwritten Odia numerals from 356 authors. The database was built using 105 emails, 166 job letters, and various forms. From the available data, a training set of 4970 samples and a test set of 1000 samples were generated. Each character has 100 copies in the test set, while imbalanced units are there in the training set. The samples come in a variety of sizes.

3.2 NITROHCSV1.0

One of the databases for handwritten Odia characters is NIT Rourkela’s NITROHCSV1.0 [27]. Binarization and segmentation were used to preprocess the samples. Each specimen was archived after regular standardization. The samples of the dataset are developed by 150 people regardless of gender, age, and education level. They wrote 57 characters on a series of formatted sheets, including ten numbers and 47 basic characters on each. Developers have used many machinery equipments to eliminate paper clattering and bursting.

3.3 IIT Bhubaneswar

Samples from the IIT Bhubaneswar [12] database were obtained in several locations and weather conditions. People were required to write in an unregulated environment in order to realize typo variations. Papers were scanned at both 600 dpi and 300 dpi. Each handwritten class of Odia numerals and characters has 500 entries in the database. There are a total of 80 classes, ten for numerals and 70 for characters. There are two types of written samples are collected. The characters are written first on clean 70 GSM pages, and then on predesigned grid arrangements of pages. The characters in the grid format are all the same height and divided by a fixed width.

3.4 IIITBOdiaV2

IIITBOdiaV2 is an Odia character database consisting of numbers and alphabets at the same time. The grid design on the papers was made and supplied to the 150 volunteers. All the volunteers’ ages range from 5 to 70 years old. All sample copies were scanned at 300 dpi at IIIT Bhubaneswar using an HP scanner. After multiple preprocessing activities such as filtering and threshing, the characters are segmented and isolated from the parent texts. They are then fine-tuned to ensure that the samples for each class written by volunteers are spread evenly during the development and processing of the dataset. In the IIITBOdiaV1, there were ten numbers digit groups and fifty basic-character groups presented. In this study, a few popular compound characters, as well as modified characters, are added, and it became IIITBOdiaV2. There are currently 112 numbers of class labels present in the database. It includes ten numbers digit classes, 50 basic characters, and the rest 52 class groups include compound characters and modified characters. Through contacting the author by email, datasets can be accessible to anyone, provided these should be used only for scholarly purposes.

4 Database preprocessing

In every character recognition scheme, preprocessing is important. It becomes much more essential when we build a handwritten character recognition mechanism. A variety of impediments can be seen, such as noise, paper quality loss after use, poor writing styles of individuals, and so on. These are very common when recording a database of handwritten characters. To remove these flaws, several stages of preprocessing are needed. In the first step, we use preprocessing techniques like median blur filtering, bounding box, and dilation to build the ‘IIITBOdiaV2’ database. Other preprocessing procedures like thresholds, binarization, thinning and resize are carried out in the second stage before classification and identification.

4.1 Median blur filter

The median blur filter [21] is popular among order statistical filters for its high efficiency and effective rate in eliminating specific noise forms such as Gaussian and random salt and pepper noise. It replaces the middle pixel of the corresponding P × P window with the average value of each of the cells. The median differs considerably in the noisy regions that are present in the images. Here the median blur filter from Python OpenCV has been used with a ‘P’ value of ‘3’ to minimize the noise pixels on the character images before segmentation. The requisite formula is shown in (1). Where, Kpq is the set of coordinates in a rectangular sub-image window which has a center at (m,n). The D(m,n) is the restored destination image, and S(p,q) is corrupted or source image and calculated area under the dimension of Kpq.

$$ D(m,n)= medianBlur\left \{S(p,q)\right \},~~~~~ where (p,q)\in K_{pq} $$
(1)

4.2 Bounding box with dilation

The sheets issued to the volunteers are set up with uniform grids so that the character font sizes do not differ considerably. The next crucial step was to find a method that could determine the locations of the characters on the grid paper in the proper alphabetical order. The region of interest (ROI) was discovered using the bounding box technique [23]. However, the bounding box faces some flaws. The bounding box occasionally caught unwanted degraded grid lines. Another few issues were the uncompromising styles of characters and the construction of characters. Some of the characters are made up of many pieces. The bounding box selects the broken characters as well as the fragmented pieces. That is why a minimum contour length of ‘60’ was used as a threshold to fix these problems. These errors can be corrected by the reduced threshold value, such as 60-pixel length, which includes eroded left grid lines. As a result, a few main components of certain characters with widths less than ‘60’ have been omitted. The matra ‘u’ of the modified character ‘Ku’ has a contour of less than 60. The bounding box assumed that these fragments were distinct characters because their lengths were less than 60. Practically, these are not distinguishing characters; they are most likely to be letter fragments. The problem was solved by merging partial character components using a morphological dilation technique with kernel size 10. The bounding box considers these as a single entity and helps to find definite characters.

4.3 Thresholding and binarization

Character images must be correctly preprocessed before the recognition job to enhance its accuracy. For a gray-scale segmented character image, the intensity values of the pixels range from 0 to 255. In the text image, the values ‘255’ and ‘0’ signifies to ‘bright’ and ‘dark’, respectively. The intermediate values have a light-black or greyish appearance. A high-dimensional storage technique is required to represent and execute these. Text pictures, on the other hand, do not necessitate such representations. Text images typically have two values. The first is the text line itself, and the second is the rest of the background. As a result, when dealing with character images, only binary representations are required. The ‘Otsu global thresholding’ [28] method is used to generate this binarized version of the samples. The breakpoint has been set at the ‘T’ threshold. If the intensity value is more than ‘T’, the updated value will be 255 or white; otherwise, it will be ‘0’ or black. All pixel values were divided by their highest value to standardize the images. After passing through the thresholding, the images are binarized [16], with ‘0’ representing a black or background image and ‘1’ representing a white for the text line.

4.4 Thinning

The thinning operation [2] was carried out after the binarization phase. The line width of the images varies from place to place, as the data is handwritten. Hence, a thin version of the same character image becomes necessary. Throughout the skeleton or thinned version of the character images, the thickness of the text line is one pixel everywhere. The thin version of a character image is shown in Fig. 5. All the scanned images will be thinned and supplied to the feature extraction procedure for further processing. However, the background of samples has been resized before obtaining their thinner versions to remove the existing unnecessary parts.

Fig. 5
figure 5

Thinned version of a character followed by its binary version

4.5 Resize

Initially, all of the samples from various datasets are variable in size and large. As a result, it will be computationally expensive. To standardize the collections retrieved from the different repositories, they need to be resized to a fixed dimension. So that, the requirement of time and space can be reduced to some extent. The image size cannot be so small that the clarity of the image is compromised. As a result, the 64 × 64 and 64 × 128 pixel scales were chosen as the best size for all samples from each dataset in various scenarios.

5 Feature extractions

Regardless of the writers, the extraction of features of handwritten text images should not be affected hugely. A proper recognition framework is necessary that can be used by anyone. In this case, a contour-based representation of the character is a preferably better choice. Because a character written by any writer may be of various sizes but may have an immense resemblance with respect to its structure. As a result of this, each recognition method’s performance may improve. This section of the research gives a quick overview of three different feature extraction methods. These are: (I) the angular motion of a character, (II) the distance between the character text lines and their centers, and (III) the five types of slopes formed by the text line of a character.

5.1 AMS Feature

‘AMS’ is known as the Angular Motion of Shape base feature. This is an effective technique for finding features based on the shape or contour of a character. It believes that all the different classes of characters in any script produce different structures. That is why the AMS feature [15] for any class of characters must be unique. In such kind of feature extraction technique, all the images are fragmented into two zones. These are the upper zone and lower zone. In most of the characters, the upper portion of the character looks completely different than its lower portion. This helps in producing quite an identifiable feature from a character. After extracting the required features from every character image, the algorithm stores them in the AMS feature matrix. The necessary pseudo-code of the program is shown in Algorithm 1. Initially, a 3 × 3 window that moves vertically across two zones of each character image. This type of window is also called a sliding window [14]. For the upper zone, the row-wise and column-wise range values are shown in (2). Similarly, for the lower zone, the row-wise and column-wise range values can be seen in (3).

$$ i_{upper}= 0 ~to ~\lceil{row/2}\rceil-1 ~and ~j_{upper} =0 ~to ~col-1 $$
(2)
$$ i_{lower}= \lceil{row/2}\rceil ~to ~row-1 ~and ~j_{lower}=~ 0 ~to ~col-1 $$
(3)

As the character images are converted from gray-scale to binary version, that is why the images have two values, which are 1 and 0. In our case, ‘1’ signifies text, whereas the ‘0’ signifies background. When the sliding window moves throughout the thin version of the character image and finds a ‘1’ in its center position, it looks after the other eight adjacent cells. Otherwise, the window ignores the contents. The eight adjacent cells are named with alphabets from ‘a’ to ‘f’. Depending on the presence of ‘1’ intensity value in the character image fitting on the sliding window, it generates the code, as shown in Fig. 6. The number of possible combinations of alphabets formed from the eight cells of the sliding window is 28 = 256. This value is completely based on the concept of set theory. If there are ’n’ numbers of characters in a set, then the total number of combinations of characters that can be formed is 2n. These codes of alphabet combinations will be repeated as the window moves throughout the shape of a character. One thing that can be noted is that the window moves only from left to right in a vertical manner. That is why the number of codes will be 256/2 = 128, instead of 256. This value of 128 is only for one zone out of two zones. So for two zones, this value will be 128 × 2 = 256. All these 256 alphabet combination codes were stored in a matrix called ’Headermatrix’. All of the codes from this matrix must be copied as column headers to the AMS feature matrix. As a result, whenever an alphabet combination is matched to one of the column headers of the feature matrix, the frequency count of the relevant code is increased. The final updated value for that column concerning the sequence of rows occupied by the sample in the feature matrix is stored, as shown in Fig. 7. Hence, the numbers shown in the table of Fig. 7 represent the frequency or the number of occurrences of the alphabet combination present at the corresponding column header. The whole process is shown in Algorithm 1. The algorithm confirms that getting the frequency for all codes from two zones for all samples is an iterative process. There will be 257 columns in the final version of the AMS feature matrix, including the last column for the class label. In the final version of the AMS feature matrix, there will be 257 columns which include the class label in the last column. The total number of rows will be the number of samples present in the particular database.

figure a
Fig. 6
figure 6

Code generation using a sliding window of size 3 × 3

Fig. 7
figure 7

Extracting AMS feature from two portion of character image using sliding window

5.2 CTD Feature

This is also called the ‘Center to Text Distance’ feature [15]. Most of the characters of any script have different shapes and distances from their centers. The characters that belong to a particular class may have a similar amount of distance from the center of the text. With regards to this fundamental idea, this CTD feature extraction is designed. There ‘8’ different distances are computed from the center based on ‘8’ different angles. The details of these ‘8’ distances {Rx,Ru,Rb,Lx,Lu,Lb,Ty,and By} can be understood from the Fig. 8. After applying various preprocessing methods as discussed in Section 4, the final binarized thin version of the samples was collected. These are supplied to the CTD feature matrix computation program. The required pseudo code is shown in the Algorithm 2. Initially, the center of each character image is calculated. From this center, the controller will move through these ‘8’ directions. Wherever there is a white text pixel fetched in any of the ‘8’ directions, the corresponding distance from the cross-section point to its center is computed. These distances are calculated using the Euclidean distance method. These distance values are all floating-point values and all are stored in the CTD feature matrix.

Fig. 8
figure 8

Eight directional distance values from center of characters

figure b

5.3 FoG feature

This Five orientation gradient (FoG) based feature is based on the gradient or slope occupied by a thin line of a text image. It also depends on the contour of a character, like the AMS and CTD features. An image sample is a two-dimensional array, where each cell of all rows and columns of the matrix consists of pixel intensity values. For this type of feature extraction, a 2 × 2 window is used that will move horizontally from left to right. Initially, the character image samples are binarized and resized to 64 × 128 followed by the median blur filtering. There are four zones column-wise but eight zones row-wise. Hence, each image sample is divided into 32 (8 × 4 = 32) equal zones, where each zone has the size of 16 × 16. These 32 numbers of zones representation on a character is shown in Fig. 9 with an ordering of 1 to 32. When the window moves, then at least two cells are expected to have a white or ‘1’ pixel value as their content. If at least two cells of the window have white intensity, then the algorithm is designed to generate features for the characters. In this scenario, a single pixel’s existence was ignored. Because, after an initial preprocessing with the ‘median blur’ function, the appearance of a single-pixel vanishes inside the samples. The addition of a one-pixel case to the feature set increases its dimension unnecessarily. Hence this is eliminated.

Fig. 9
figure 9

Dividing each character sample into 32 equal zones, where the size of each zone is 16 × 16

It’s important to note that the maximum number of gradients or slopes that can be created through the 2 × 2 window is five, as seen in Fig. 10. Two types of intensity values, such as black and white, are present in the sliding window. A pixel of written text is represented by white or ‘1’, while the backdrop is represented by black or ‘0’. In general, there are three scenarios faced by the sliding window. (1) if only two cells have white pixels, (2) if all four cells have white pixels, and (3) if only three cells have white pixels.

Fig. 10
figure 10

Representing the angular positions of the connecting pixels in a 2 × 2 cell and extracting the character code from this

The four forms of potential slope or gradient variants are collected in Case 1. From the four cells a, b, c, d, if ‘b’ and ’d’ are both ‘1’, a line with a forward 45 gradient is formed, as shown in Fig. 10a. The backward gradient type with a 45 angle is formed if each of the ‘a’ and ‘c’ cells has an integer ‘1’, as seen in Fig. 10b. A horizontal gradient of 0 is generated if the positions a, d, or b, c are ‘1’, as seen in Fig. 10c. Similarly, cells a and b or d and c are filled with ‘1’, resulting in a vertical line with a 90 gradient, as shown in Fig. 10d.

The feature is extracted when all the four cells of the window are filled with a white or ‘1’. This kind of representation belongs to case 2. All the cells of the window are completely occupied by the thin text line of the character. It is also referred to as a “no gradient” state. Figure 10e shows its illustration. Likewise, when three cells are filled with white color, it belongs to case 3. Logically, this is the situation that arose by the merging of two different representations from case 1. After that, for each respective zone starting from 1 to 32 sequentially, the frequencies are calculated by moving the window horizontally. Afterward, all of the computed gradient frequencies are stored as a vector sequentially based on the zone numbers. Finally, five simple gradient types, including one with no gradient, have been considered in such a FoG type of feature representation. There are 32 regions in all. It is why the feature matrix is of 160 dimensions, i.e. 32 × 5 = 160. The steps for obtaining the Five Orientation Gradient (FoG) function are defined in Algorithm 3.

figure c

6 Proposed character recognition model

An offline handwritten character recognition method for Odia characters, including the compound characters, is suggested here. An offline handwritten character recognition method for Odia characters, including the compound characters, is suggested here. In literature, few papers are there, based on handwritten compound characters [8, 33] of Bangla language of India, where authors have decomposed the compound characters into their basic forms of characters and ‘matras’ [30]. Then they have used their recognition methodologies to classify these. This logic may not be fruitful here because if a compound character of Odia would decompose, then it would reform more than one character, as shown in Fig. 11. Due to which the meaning of the particular text will be different. As a result, rather than breaking compound characters to identify them in their simplest form, an attempt has been made to recognize the complete compound characters at once. All the steps involved during the identification process are shown in the block diagram, Fig. 12.

Fig. 11
figure 11

Decomposition of one compound character forms multiple other characters. Numbers shown in green colour on top of the characters are the respective class labels present in the ‘IIITBOdiaV2’ database

Fig. 12
figure 12

Block diagram for the proposed handwritten character recognition framework

There are three fundamental tasks involved in this recognition process and can be identified from the block diagram. Those are: preprocessing, feature extraction, and classification methods. Sections 4 and 5, previously addressed the preprocessing and feature extraction phases. After this, the most necessary task is to classify the test samples. It can be achieved by giving the test samples appropriate class labels based on their extracted features. Different experiments were carried out using six traditional machine learning algorithms and two neural network models. The machine learning classifiers include Logistic Regression (LR) [35], Gaussian Naive Bayes (GNB) [26], Decision Tree (DT) [26], K-Nearest Neighbors (kNN) [31], Random Forest (RF) [22], Support Vector Machine (SVM) [31], Modified RNN, and the CNN [17]. The three feature sets, AMS, CTD, and FoG, are used to compute and evaluate the accuracies obtained by these classifiers. After creating all these three kinds of feature matrices, then the classifiers are applied, and the classification accuracy is computed. Except for these three types of feature matrices, one more feature matrix has also been considered for conducting the experiments. That is the resulting merged feature matrix of all three feature sets.

  • AMS = dimension is 256 and 1 more for the class labels.

  • CTD = dimension is 8 and 1 more for the class labels.

  • FoG= dimension is 160 and 1 more for the class labels.

  • AMS + CTD + FoG = dimension is 256 + 8+ 160 = 424 and 1 more for the class labels.

The Python sci-kit learning environment has been used to access the inbuilt classifiers such as LR, DT, kNN, GNB, RF, and SVM. Most of the function parameters are set to default depending on the classifier chosen. Wherever any tuning has been done, it is mentioned here. Using solver=‘lbfgs’, multi class=‘auto’, and max_iter= 10000 for Logistic Regression. The solver is solely responsible for making the algorithm more effective. Because the dataset contains a little large number of samples, the number of iterations is increased to achieve successful convergence. The criterion=‘entropy’ and estimators=‘300’ have been set explicitly for Random Forest. The n_estimators determine the number of trees to be built. To produce a better split point, an ‘information gain based entropy’ criterion was used. To maintain a standard and neutral study, these standards are kept consistent for all the conducted experiments. All the rest parameters for all the classifiers are set to the default value. Instead of a conventional RNN and CNN, a tweaked version of RNN and CNN is used in this research using Keras in python. Instead of providing a set of images, the feature matrix is used as an input in these neural network architectures.

6.1 Modified recurrent neural network (RNN)

A recurrent neural network (RNN) is a form of neural network that operates similar to a traditional neural network. The only difference here is the memory estimation takes a short time. Multiplying the input weight and activation functions, the network model generates results like a regular neural network. The RNN returns to this performance several times itself. The LSTM model was designed using the Keras Deep Learning Library. The time-step was named because the model requires time to generate results. A repeated value of 300 has been taken for a specific time step. After the data is loaded into memory, the LSTM model can be designed, fitted, and checked. One hidden LSTM layer is used in this suggested model. Since the AMS, CTD, and FoG features have been combined so that the resulting feature matrix will have 8 + 256+ 160 = 424 dimensions. Depending on which, the input layers in the experiments have a maximum of 424 neurons. A dropout layer is used to remove the networks’ over-fitting during the training. The value of the dropout is taken as 0.05 in this case. The model is made up of three hidden layers. At the end of each fully connected dense layer, the activation function ‘ReLU’ is used to maximize model efficiency. The network is compiled with the ‘Sparse categorical loss entropy’ and the ‘Adam Variant of Stochastic Gradient Descent’ loss function as the potential optimizers. Here, the batch scale and the value of epoch, both are taken as 50. Based on the dataset’s output class labels, the network output layer will have anything from 10 to 112 neurons. In the output layer, an activation function such as Softmax is used to increase the model’s prediction to the probability class.

6.2 Modified convolutional neural network (CNN)

A convolutional neural network is used in several applications, such as visual recognition, object tracking, facial recognition, etc. CNN is a form of deep learning that is inspired by the neuron signaling model of the animal visual cortex. The weight learning of CNN and bias are very much close to the procedure of feedforward neural networks. We designed a convolutional neural network with Keras environment under Python to achieve cutting-edge efficiency. In this network architecture, the Keras sequential model was used. The add() processes are used to stack layers. The rectified linear unit (ReLU) is one of the best activation functions that have been used in the network, with three densely connected convolution layers. The suggested feature vector is used as an input, whose dimensions are 8, 160, 256, or 424, depending on the selected feature set. The first input layer has nodes based on the input feature dimension and 0.03 dropouts to reduce the overfitting effect. Excluding this input layer, 3 more hidden layers are incorporated. The ‘adam’ optimizer is used as a ’stochastic gradient descent (SGD)’ optimizer. The validation accuracy obtained in the estimation process is dependent on ‘sparse categorical cross entropy’ loss. Both the epoch and the batch size are given a value of 50 for these experiments. The set of output neurons and the softmax multi-class activation function are presented in the final layer. Based on the data set to be classified, the number of labels will vary from 10 to 112.

7 Experimental analysis

Here, various experiments have been conducted based on the datasets, feature sets as well as classifiers. Four Odia handwritten character datasets were taken for the experiments: ISI Kolkata, IIT Bhubaneswar, NITROHCS_V1.0, and IIITBOdiaV2. Though the IIT Bhubaneswar dataset consists of digits, basic characters, and compound characters, here we could collect only the digits and basic characters. The NITROHCS_V1.0 dataset consists of basic characters only. The IIITBOdiaV1 dataset initially had digits and basic characters only. Later, we combined some commonly used compound characters with the existing ones to create IIITBOdiaV2. Several experiments are conducted on these datasets. Then all the results are represented in tabular form so that a small difference between the outcomes can be identified clearly. The results displayed in this section except for ISI Kolkata is the average of k-fold cross-validation, where the k value is 5. The ISI Kolkata dataset has a separate training and testing set, but the others do not.

In the first trial, all the three feature sets extracted from the different datasets are given to the classifiers, then their obtained accuracy is compared. In the second trial, all the three feature sets were merged to have a larger dimension of the dataset and the same has been given as input to all the classifiers that have been mentioned earlier. All the obtained accuracy from different classifiers based on the feature set is compared. Tables 23 and 4 represent the outcomes of different classifiers on different features for ISI Kolkata, IIT Bhubaneswar, and IIITBOdiaV2 digit dataset, respectively. More or less, all the classifiers are giving quite better results using the merged set of all three features. Still, a clear comparison can be seen that all the classifiers outperform on the FoG feature than AMS, though the FoG is a feature set with only 160 dimensions. This dimension is quite less than the AMS type feature, which is 256 in dimension. Furthermore, when all of the features are combined into one and then fed into the classifiers, the accuracy of all of the classifiers grows. It means that all the feature sets are relevant towards the correct recognition of characters. In the merged dataset, all the feature subsets have a clear contribution towards accurate character identification.

Table 2 Recognition accuracy of different classifiers on different feature set of ISI Kolkata digit dataset
Table 3 Recognition accuracy of different classifiers on different feature set of IIT Bhubaneswar digit dataset
Table 4 Recognition accuracy of different classifiers on different feature set of IIITBOdiaV2 digit dataset

Tables 56 and 7 present the outcomes of different classifiers on different basic character datasets from IIT Bhubaneswar, NITROHCSv1 and IIITBOdiaV2, respectively. Here also the same fact happened to like the digit recognition. Always the neural network-based models perform better than the other state-of-the-art classifiers. Out of the three datasets, the recognition accuracy of IIITBOdiaV2 basic characters is quite low compared to the other basic character datasets. It implies the basic characters present inside the IIITBOdiaV2 dataset are a little complex and must be difficult to recognize by any recognition methodologies.

Table 5 Recognition accuracy of different classifiers on different feature set of IIT Bhubaneswar basic character dataset
Table 6 Recognition accuracy of different classifiers on different feature set of NITROHCSv1 basic character dataset
Table 7 Recognition accuracy of different classifiers on different feature set of IIITBOdiaV2 basic character dataset

Tables 8 and 9 represent the results obtained by the classifiers on IIT Bhubaneswar and IIITBOdiaV2 datasets, respectively. Here these datasets consist of digits and basic characters only. Usually, if the number of classes increases, then the accuracy decreases. But here, it does not happen because these datasets consist of characters as well as digits. Indeed, the recognition result of numerals is always better than character recognition. So when the digits are mixed with alphabets, it helps to enhance the recognition accuracy. In both IIT Bhubaneswar and IIITBOdiaV2, the number of classes is 60, and IIITBOdiaV2 has fewer samples. Nonetheless, the recognition accuracy is low. So it can be inferred that the IIITBOdiaV2 dataset has more complex specimens than the IIT Bhubaneswar dataset.

Table 8 Recognition accuracy of different classifiers on different feature set of IIT Bhubaneswar (digit + basic character) dataset
Table 9 Recognition accuracy of different classifiers on different feature set of IIITBOdiaV2 (digit + basic character) dataset

Tables 10 and 11 display the recognition rate in terms of accuracy for characters that include compound characters. Table 10 displays the results where the features are extracted from 102 classes because it consists of basic characters and compound characters. Whereas Table 11 presents the results for all the classifiers applied on features that are extracted from 112 numbers of classes because it consists of digits, basic characters as well as compound characters. It should also be noted that the number of classes in Table 10 is less than that in Table 11, but its recognition accuracy is lower. Because, in Table 11, the digit samples are merged, which helps to enhance its recognition.

Table 10 Recognition accuracy of different classifiers on different feature set. Samples are taken from 102 classes (basic character + compound character) of IIITBOdiaV2 dataset
Table 11 Recognition accuracy of different classifiers on different feature set. Samples are taken from 112 classes (digit + basic character + compound character) of IIITBOdiaV2 dataset

7.1 Comparison

Images can be used to train and evaluate a neural networks models. A comparison has been performed on the neural networks based on the input provided, first by providing immediate images and secondly by providing the extracted features. The efficiency of the character recognition approach using both methods can be compared using two different parameters: accuracy and execution time per epoch. The accuracy and execution time per epoch of the CNN and RNN depending on the input are shown in Table 12. Although the traditional CNN and RNN had a marginally higher accuracy when images were used as input, it did so at a substantial cost in terms of execution time. Here, the two main factors which influence the increase in the execution time of traditional CNN are sample count and input sample resolution (in pixels). The number of layers of CNN has been kept constant in both scenarios. The suggested feature extraction method, on the other hand, provides the classifiers with a constant feature dimension regardless of the image size. Thus, the conventional CNN and RNN will always take more time to execute than the proposed approach. This study not only emphasizes accuracy but, at the same time, the execution time is also being considered. As a result, further experiments with basic characters and compound characters based on the direct images as input to the classification system have been ignored here. Table 13 represents a comparison of our proposed approach with the others present in the literature based on the accuracy obtained. It can be seen that our suggested approach, as well as feature extraction technique, is unable to reach the peak as others have performed, but the results are very much closer. This study is the first attempt towards the handwritten Odia compound character recognition. As a result, it is confirmed that the suggested feature extraction techniques are fruitfully working on a different category of Odia characters while also successfully supporting numerous machine learning algorithms.

Table 12 Comparison of recognition accuracy of RNN and CNN classifier on different digit dataset based on the input provided
Table 13 Comparison of recognition accuracy achieved by different researchers present in literature with the proposed recognition model on Odia handwritten datasets

8 Conclusion and future work

Odia is one of the most popular languages in India, used by more than 50 million people in the whole world. There is a considerable need for an Odia offline handwritten character recognition mechanism to identify numerals, basic characters, and compound characters. Though the number of users is increasing every day due to the digital involvement of computers and smartphones, still very little research has been done on handwritten character recognition in the literature. Here, in this study, the issue regarding the offline Odia handwritten character recognition has been addressed. One of the major issues was the unavailability of the Odia character dataset publicly that includes compound characters. With this in mind, a dataset IIITBOdiaV2 has been created at IIIT Bhubaneswar. The dataset currently has 16198 numbers of characters in total from 112 character classes. To date, the development process of this dataset is going on, and these numbers are increasing day by day. Here, 52 categories of compound characters were incorporated, which are very popular and widely used by most users. It is a fact that deep learning models are always a leader in the case of pattern recognition. Generally, it takes the images as input directly. The computation takes a long time when the image size and the number of samples are large. We have used deep learning networks, with the input as a set of features (AMS, CTD, and FoG) with a much lower dimension than the size of original character images. These extracted features of the images are not only small in size, but they are also more functional because they help in the character identification across many classes. At the same time, these feature sets also support various machine learning classifiers by giving a consistent recognition result in terms of accuracy under k-fold cross-validation. We are trying to append some other possible compound characters of the Odia script to this existing dataset so that the quality of the character recognizer can be enhanced. It should also be noted that the proposed approach’s recognition accuracy is not as high as those found in the literature. In the future, some new feature extraction approaches could be used to yield better recognition results with fewer feature dimensions. Nowadays, most documents are available in a distributed environment, and in such cases, new techniques like edge computing [19] can also be used in the future for the offline handwritten character recognition task.