1 Introduction

Identifying and recognizing handwritten characters and digits are one of the core problem and prominent task in computer vision society. It has achieved an abundance of interest for its diverse applications in helping to recognize characters from images. The applications stretch from zip code identification to writer recognition, from recognizing numerals and alphabets in the number plate of traffics to bank check processing, etc. Moreover, it can be practically used to convert handwritten characters to ASCII or Unicode, which can serve as a standard in further processing, such as translation to other language or speech synthesis. However, constructing a design for handwritten character recognition establishes various challenges to the researchers due to the problems that prevail in data acquisition and its nature of unconstrained handwritten characters or text. The shape of the same character that may differ depending upon the writers, some may write with large structure, while others may complete in small-scale version. Hence, the overlap area of ink trace of the character may be very less.

Further, depending on the acquisition device, pen width and ink color may impose variation on writing style. Moreover, handwritten Manipuri characters are complicated due to their structure and shape. They include a significant character set with more curves, loops, and other details in the characters. Many character pairs are quite similar in shape. All these issues demand attention and solution with the help of an efficient recognition system. The recognition process requires the documents to decompose into identifiable elements. In a simple form, the recognizable elements can be isolated character which can be applied directly for recognition. If sentences or paragraphs are taken into consideration, then the segmentation of line, word, and character is required to attain the level of identifiable elements. Segmentation is the elementary constituent for a handwritten text recognition system, and therefore, it has drawn the attention of numerous researchers [1,2,3,4,5]. In this paper, we have introduced self-collected Meitei Mayek datasets for handwritten isolated characters and unconstrained written text. Moreover, to validate the collected datasets, character recognition on isolated character dataset and segmentation (lines, words, and character) on text dataset is presented.

Fig. 1
figure 1

Meitei Mayek alphabets

Manipuri (Meitei Mayek) is a scheduled language of the Indian constitution and the official language of Manipur state. It belongs to the Tibeto-Burman branch of the Sino-Tibetan language family and is the primary communicating language of Manipur, which has spread in the northeast and some parts of Bangladesh and Myanmar. The current script is the reconstruction of the ancient Meitei Mayek script, and it consists of rich set of characters which involve 10 numerals called Cheising Eeyek, 27 basic alphabets (which is further comprised of 18 original letters called Eeyek Eepee and 9 additional called Lom Eeyek), 8 derived letters called Lonsum Eeyek and 8 associating symbol called cheitap Eeyek. Their architecture is given in Fig. 1. In this paper, we have considered the 27 basic alphabet and 8 derived letters only, for the experiment in this paper. The script has been reinstated recently, and there is no standard database available for research. Unavailability of a conventional database on this script has made a hindrance to research work on this script.

Keeping in view the prevailing problem of unavailability of a dataset, we have attempted to develop a Meitei Mayek database of isolated characters (Mayek27) and unconstrained full-length Meitei Mayek (MM) handwritten text. The database of Meitei Mayek (Mayek27) characters will form a basis for recognition of this script. Most of the previous studies on Meitei Mayek have been carried out on isolated numerals and characters only. Therefore, the segmentation method on the full-text MM handwritten text dataset may form a benchmarking procedure. Further, to validate the proposed databases, we have performed some experiments on the databases. In the Mayek27 database, a convolutional neural network-based character recognition system is proposed on 4900 image. On the dataset MM, we have performed text-line segmentation based on the morphological operation to identify midpoints between consecutive lines and accordingly trail the gap block by block till the entire line is separated. Word segmentation is not considered a separate problem but goes effectively alongside the run-length of line segmentation by identifying the optimal column to separate words. Moreover, the connected component-based technique is applied to the segmentation of individual characters from the words. The proposed segmentation algorithm works well on curve, touching, and skew lines.

The remainder of the paper is organized as follows: Sect. 2 reviews numerous existing methods and techniques available in the literature for the character recognition system and various segmentation algorithms. Section 3 presents the acquisition of databases: Mayek27 and MM in detail. Section 4 presents multiple experiments performed on the databases, character recognition algorithms on Mayek27 dataset and text-line, word and character segmentation on MM dataset, and their results. Section 5 concludes the paper stating the achievement and future work.

2 Related work

This section has provided with an extensive survey on various character recognition algorithms that exist in the literature and methods for segmentation to text documents to identifiable elements such as text-line, words or characters.

2.1 Review on character recognition

Various character recognition systems have been developed for different languages in the world. Numerous feature extraction techniques existed in the literature are based on the statistical attribute of the image in the spatial domain, while some are based on the transformed field. The spatial feature deals with the same spatial variables as the original image, where each pixel holds information of the image. The transformed domain converts an image to a different domain such as frequency or time that facilitates the feature description that may not be conveniently exposed in the spatial area. The extraction of a distinctive feature is a vital stage in the recognition process. Feature selection should promote increase interclass differences and decrease the intraclass gap.

The spatial feature explores the demographic attribute and structural characteristics of an image individual or combine. Pixel values are manipulated, or their correlation is estimated to extract discriminant properties, and this can be extracted globally or locally, depending on the application. Pixel-based methods have been explored for its essential characteristics for recognition in the literature [6,7,8]. Either the direct binary ink trace (pixel pattern) or its density has been used for recognition problems. The prominent texture descriptor, Local Binary Pattern (LBP), had been reported in the literature for various pattern recognition system [9,10,11]. Many different forms of LBP can be obtained based on the distance and orientation of sampling points. The effectiveness of the LBP in various types had been explored, and their results had been discussed. Further, the classification of an image by fusion of global feature using wavelet transform of local ternary pattern and local features such as speed up robust feature descriptor and bag of words has been proposed in [12]. Shape features like chain code [13] and histogram feature for gradient were exploited in [14] for character recognition. The decomposition of complex objects into elementary shapes has been explored in [15] for pattern recognition. The analysis of the shapes will help in finding the relevant primitives in recognizing an object. The 2D SIFT algorithm [16] has been extended to a perfectly scale-invariant feature selection algorithm of a 3D mesh for 3D object recognition in [17]. Recognition of character by air-writing through hand or fingertip is an interesting mechanism. One such work has been proposed in [18], where the color and depth images of the Kinect sensor are used to identify the writing. The proposed method utilizes slope variation detection to extract features from the trajectory to identify Persian numbers.

The spatial domain feature extraction misses the frequency component of the image. Therefore, researchers were inspired to exploit the transformed domain for feature extraction. Low-frequency components relate basic shape, and high-frequency components of the transform domain describe details of an image. That is the reason for considering the coefficient of Fourier transform (FT) for feature extraction in [19, 20]. Despite being a robust descriptor, the FT captures the overall properties of an image and often excludes local features. So, to achieve fine details, wavelet transform (WT) has been examined in [21]. With a valid numerical basis, the WT can analyze an image extensively by constructing a ground wavelet. On account of the strong response for localized target along a particular direction, orientation-based Gabor filters have been explored for character recognition [22, 23].

A dynamic transformation called Stockwell transform [24] has been proposed to acquire the time-frequency domain description of an image, and these properties have been considered for various pattern recognitions by researchers [25,26,27,28]. A mathematical model called Markov model-based [29] and probabilistic and fuzzy feature-based [30] pattern recognition were reported in the literature as well for character recognition.

Image zoning is another technique in the literature that was used in recognition of handwritten characters as it can address variation in writing patterns. The entire image is broken down into numerous sub-images called zones that each can provide useful information related to a specific part of an image. The optimal selection of the zone from where the localized feature was exploited had been proposed in [27], which was based on a bio-inspired process. The zoning process iterates itself until error had been minimized. Another zoning method combined with membership function had been proposed in [31, 32]. The role of zone membership functions had been elaborated by defining the influence that different zones had contributed to the overall feature. It had been proposed to select the best-suited membership function for each zone of an image to exploit the characteristics of feature distribution of that zone. Their experimental results had claimed to be superior to other traditional zoning methods.

In contemporary years, deep learning architecture [33, 34] has acquired much attention for various applications of pattern recognition and computer vision problems. The anticipation of deep learning helps in the commencement of the convolutional neural network (CNN) in machine learning algorithms. A multilayer artificial neural network has been introduced for character recognition and various other computer vision problems in [35]. A classic network popularly known as Lenet has been proposed by Lecun et al. [36], which incorporates gradient-based learning methods to the convolutional neural network (CNN). Further, similar to Lenet network but quite bigger and powerful has been introduced in [37]. Experiments have been conducted on the ImageNet dataset for classification. The network used ReLu, and multiple GPUs with a special layer called local response normalization (LRN). A remarkable scheme is proposed in [38] where instead of having as many hyperparameters, the focus has been made on the evaluation of simpler networks where convolution layer of \((3 \times 3)\) filters is fixated with increasing depth.

2.2 Review on document segmentation

Segmentation of a transcribed document image into lines, word, and character is a significant problem to solve due to the complication occurs in the handwritten document, such as the irregular spacing between lines and words and touching of characters across words and text-lines. Although many algorithms have been proposed and enormous effort devoted to the segmentation of text-lines and words for unconstrained handwritten documents, but still there is plenty of room for improvement.

Methods for text-line detection and segmentation of printed document are relatively easy and have explored [13, 39] as they have approximately straight with parallel text-line, and global projection profile can segment them. But, handwritten documents are often non-uniformly spaced and associated with skew and curve. Many efforts have been devoted to solving the challenging task of the handwritten text-line and word segmentation. The approaches can be categorized broadly as projection profile analysis [7, 40,41,42,43,44,45,46,47,48,49], connected component grouping [50,51,52], and level set method [53].

The \(X-Y\) cut algorithm [44] is a projection-based top-down segmentation method but performs well only on documents of the parallel text-line and a large gap. The partial-projection profile [41] has been proposed to deal with curve lines. The level set method [53] is an effective top-down approach for the segmentation of unconstrained handwritten documents, but it has high computation complexity. The Docstrums method [50] is a bottom-up approach that merges neighboring connected components but fails to detect some lines. The piece-wise projection [47] segmentation method is highly sensitive to variation in the size of characters.

In [54], line and word segmentation detection techniques using midpoint had been proposed. The midpoint detection-based approach is based on the recognition of spaces that separates two lines or words. Another text-line and word segmentation algorithm were proposed in [52], where the text-lines are segmented and normalized, and then words are segmented. In their technique, the distance between connected components is measures, and lines and words are segmented based on a threshold. A hough transformation-based lines and word segmentation from document images were presented in [55, 56]. The proposed technique not only applied on document images but also on a dataset for the business card reader system and license plate recognition system. Although the algorithm has low under segmentation results, it sometimes fails to segment closely spaced lines.

In [57], the segmentation of words was formulated as a binary quadratic assignment problem that considered the pairwise correlation between the gaps as well as the likelihood of individual gaps. The parameters are estimated on a structured SVM framework so that it is independent of language and writing style. Segmentation of lines, words, and characters was presented in [58]. In this paper, the authors had extracted a horizontal projection profile from the document, and using local minima points line segmentation was performed. Further, simultaneous word and character segmentation are proposed by popping out column runs from each row in an intelligent manner.

3 Compilation of Meitei Mayek dataset

Data acquisition plays a significant role in the research area. It accounts for gathering and estimating relevant information to develop a target system, here handwritten character recognition system. There is no publicly available dataset for handwritten Meitei Mayek characters. Therefore, we have manually collected isolated characters from various people who can read and write Meitei Mayek for the development and evaluation of efficient character recognition. Previously, an isolated handwritten Meitei Mayek dataset has been proposed in [59], but it consists of the only 27 classes of Eeyek eepee. In this paper, we have included the 8 letters called Lonsum Mayek, which are derived from distinct Eeyek Eepee. So, in total, there are 35 classes of Meitei Mayek characters considered for recognition in this paper. The derived characters are very similar to their respective original, which further adds to the challenge in recognizing them as highlighted in Fig. 2. Figure 3 illustrates an instance of a filled form of Meitei Mayek dataset having four sections: a unique label, demographic information holder, printed character or text, and empty slot(s) for various individuals to inscribe the written character in their writing style.

Fig. 2
figure 2

The original and derived letters of Meitei Mayek alphabets, images in the first row are the original letters and the second row signifies the respective derived lonsum Eeyek

3.1 Mayek27 dataset

It consists of 35 letters, and they have been collected in a set of 4 in 140 pages of the A4 sheet. The isolated characters are raised in a tabular format where a cell is occupied by one handwritten character sample. Every page has been provided with 35 empty slots for various individuals to inscribe the printed character in their writing style. Since every character has been sampled 35 instances in a set, a sub-total of 1225 (\(35 \times 35\)) isolated characters are collected for a set. Therefore, considering all the four sets, there are a total of 4900 Meitei Mayek characters available for experimentation in this work. To complete the data acquisition process, 90 people have contributed to their writing habits. These people have a different educational background and have a mixed age group between 6 and 40 years. The writers also record their demographic information in the dataset form such as name, address, occupation, qualification signature so that other applications like signature verification can utilize the data. The preprocessing methods performed in this paper are similar to the approach described in our previous work in [59, 60] and are illustrated in Figs. 4 and 5.

Fig. 3
figure 3

Sample filled forms of Mayek27 and MM datasets

Fig. 4
figure 4

Preprocessing performed in Mayek27 dataset

3.2 The Meitei Mayek (MM) dataset

It is devoted to the text document database having words comprising of Meitei Mayek and English. The MM dataset has been developed because there exist English words whose Meitei Mayek equivalent does not avail. The text documents have been incorporated with various challenges to make the segmentation problem more interesting such as skew, curve, close, and touching lines. In total, 189 documents pages have been collected from 114 peoples of varied age group consisting 809 lines.

Fig. 5
figure 5

Process for obtaining minimum bounding rectangle of a character in Mayek27 a edge image, b dilate image, c filled image, d minimally bounded box, e cropped image

Besides the handwritten characters and text for experimental evaluation, we have also collected the demographic information of the writer. This information can be made available for other applications, such as signature verification. A sample format of filled Meitei Mayek dataset (isolated character and text documents) is illustrated in Fig. 3 consisting of four different sections.

The developed database will be made available to the public for research. A researcher working on Meitei Mayek can easily download them and use it for their work. The generated dataset contained samples of both printed and handwritten characters and text. So, it can also be used for separate recognition of printed as well as handwritten character. Identification of printed and handwritten text can also be performed. Moreover, demographic information collected can be used in other applications such as signature verification, writer identification from the names, pincode identification from the address. Besides, the Meitei Mayek has been reinstated recently, and full focus is being made so that the proposed datasets will be beneficial in the future for various digital applications research work. The Mayek27 and MM dataset have been scanned at 300 dpi and stored in the “png” format for further processing.

Fig. 6
figure 6

A sample architecture of the proposed convolutional neural network for Meitei Mayek recognition

4 Experimentation on datasets

For the technical validation and contribution of the datasets as a standard benchmark platform for linguistic research on Meitei Mayek script, we have applied character recognition technique on the Mayek27 dataset and text-line, word, and character segmentation on the MM dataset and present the experimental results.

Fig. 7
figure 7

Visual illustration of ReLU activation function

4.1 Character recognition on Mayek27 dataset by CNN

The experiment has been carried out on the collected samples of Mayek27 Dataset, having 4900 samples. All the images are isolated individually and normalized to a fixed size of \(32 \times 32\) to complete the recognition system.

Convolutional neural networks (CNN) are a genre of deep, feed-forward artificial neural networks. Deep learning provides a powerful set of techniques for learning in neural networks, which are successfully adapted to various applications of investigating visual imagery. Convolutional neural networks are designed that enables a computer to learn from observational data. The CNN is commonly sequenced by a set of layers that can be aggregated by their functionalities as illustrated in Fig. 6 and interpreted as follows:

4.1.1 Convolution layer

The convolution operation is one of the fundamental building blocks of a CNN. It performed 2D convolution to the input image with the filters’ weight and harmonized them across the channels. The filter has an equal number of layers as the input image channels, while the output volume has the same depth as the number of filters. The advantage of having a convolution layer over having only FC layers is parameter sharing and sparsity of connections. Parameter sharing defines that if a feature detector is useful in one part of the image is probably useful in another part of the image as well. Sparsity signifies that in each layer, each output value depends only on a small number of inputs.

An activation function introduces the nonlinearity mapping between the input and the output. It also promotes robustness and strength to the network to determine something complicated and productive from the image. Another significant characteristic of an activation function is the differentiability to perform backpropagation escalation procedures. The backward propagation in the network computes gradients of error(loss) concerning weights and then accordingly optimize weights using gradient descent. Therefore, the activation layer increased the nonlinearity of the network without affecting the respective fields of the convolution layer.

$$\begin{aligned} R(z) = \max (0,z) \end{aligned}$$
(1)

Some of the popular types of activation functions are the sigmoid, tanh, and rectify linear unit (ReLu). In this approach, we have used the rectify direct unit (ReLu), which is given by Eq. 1 and visually illustrated in Fig. 7. This method solves the vanishing gradient problem in a rational approach. The function returns zero for all negative values and preserves the linearity for the positive values. Therefore, the ReLU is sparsely activated and is more likely to process a meaningful aspect of the problem. However, it should only be used within the hidden layers of a neural network model. Hence for the output layer, a special kind of activation function called the softmax layer is used at the end of fully connected layer output to compute the probabilities for the classes (for classification). For a given sample vector input x and weight vectors \({w_i}\), the predicted probability of \(y=j\) is given by Eq. 2.

$$\begin{aligned} P(y=i|x) = \frac{e^{x^Tw_j}}{\sum _{k=1}^{K} e^{x^Tw_k}} \end{aligned}$$
(2)

4.1.2 Pooling layer

It can be inferred from the previous section that the convolution layers provide activation mapping between input and output variables, pooling layers employ nonlinear downsampling on activation maps. This layer basically takes a filter (normally of sizes \(2 \times 2\)) and a stride of the same length. It then applies it to the input volume and outputs the maximum number (in case of max-pooling) in every subregion that the filter convolves around.

Fig. 8
figure 8

Images that activate the channels within the network through the layers (for simplicity 16 images from each CONV layer are shown)

Table 1 Comparison of recognition rate (RR) with the existing methods in the literature

The pooling operation preserves any feature detected in any quadrant of the image in the output of the processed. The popular type of pooling method is average and max-pooling. In this paper, we have used max-pooling where no zero paddings on the image are required. It uses two hyperparameters, filter size FS and stride S. For an image of size, \({M} \times {N} \times {D}\), pooling results in \(\big (\lfloor {\frac{M -f}{s} + 1}\rfloor \times \lfloor {\frac{N -f}{s} + 1} \rfloor \times D\big )\) size output. The pooling operation requires no parameter to learn. We have used filter size, \({FS} = 2 \times 2\) and stride, \(S = 2\) for all the max-pooling operation in this approach.

4.1.3 Fully connected layer

It can be perceived as a regular neural network with ultimate learning acquiring all possible visual features to relate to the appropriate output labels. Fully connected (FC) layers are usually adaptive to classification or encoding tasks with a common output of a vector, which, when the Softmax layer displays the confidence level for classification.

The model of CNN architecture in this approach consists of three convolution and pooling layers, one pooling later applied after each convolution layer for downsizing the sampling to half. Finally, there are three fully connected layers, with the last being the Softmax layer (Eq. 3). As described in Eq. 2, this layer postulates a probability distribution over a fixed number of categories and selects the category that has the maximum probability designated by the network. Adding more layers without having a large training set may likely introduce irregularities in the data, causing over-fitting and in turn, reduce accuracy on the test data. Therefore, in this approach, we have limits to only three convolution layers as we do not large dataset.

$$\begin{aligned} S(y_i) = \frac{e^{y_i}}{\sum _j e^{y_j}} \end{aligned}$$
(3)

The first convolution layer accepts a character image of size \(32 \times 32\) for the start of distinctive feature extraction for recognition. Every convolution layer has the same filter size of \(3 \times 3\), but the number of kernels for every convolution layer varies. In the first layer, we have employed 16 kernels, followed by 32 and 96. The increase in the number of kernels helps in growing a network in volume, thereby boosting the representational power of the network as the number of units per layer becomes larger. Each convolution layer is passed through the ReLu activation layer, and the ReLu layer output is downsized to half by a max-pooling layer of size \(2 \times 2\), as illustrated in Fig. 6. After the final pooling layer, the image size is reduced to \(2 \times 2\) with 96 channels from the number of kernels employed. Then, the image is vectorized and flatten to be passed to a fully connected convolution layer, having dimension 368 and then to another FC layer of size 100 and lastly to the Softmax layer, which generates a probability distribution over the classes for a given input. In this architecture, we have applied the regularization technique, batch normalization on every layer to facilitate network training and diminish the sensitivity to network initialization.

4.1.4 Experimental results and comparison

The CNN model is tested on a self-collected handwritten Meitei Mayek isolated character dataset of size 4900 sample images all normalized to \(32 \times 32\). This CNN model works well even with not so large number of sample images. A validation accuracy of 99.02% of correct recognition is obtained on 20th epoch. It took about 146 s to reach until 20th epoch and reached final iteration. The accuracy has been observed to be above 90% as it reaches second epoch and is consistently going toward higher accuracy.

Further, as the network goes more in-depth with the higher number of convolution layers and filters, complex and detail information is gained; however, to a certain limit only, beyond that, there is a possibility of over-fitting. It can be illustrated visually in Fig. 8; more meaningful information can be perceived as the number of convolution layer proceeds toward the Softmax layer with the expense of complex computation. The CONV layer one can be seen as pure black and white blocks stack together. However, in the process, as the network advances toward CONV Layer 3, it can be realized that meaningful image can be seen, which can be used for classification. These images are focused on the features cultivated by the network.

We have also experimented on the additional fourth layer and fifth layer to analyze the change in recognition accuracy. However, it has been observed that it is not necessarily true to improve accuracy as the number of layers increases. The experimental analysis shows a decline in recognition accuracy, however, slightly as the additional layer has been added. This is because of the fact that increasing layer definitely extracts more feature but up to a certain extent only. There is a possibility of over-fitting the data and may results in false-positive if training data are not large enough.

The proposed character recognition work has been compared with the previous work in the literature, and the results are summarized in Table 1. It can be observed from the table that the proposed CNN model has provided with higher recognition rate as compared to the other neural network methods and techniques existing in the literature.

4.2 Text-line segmentation on MM dataset

The approach for text-line segmentation is the modification of the existing partial-projection profile technique and calculates the projection histogram of only the first 100 to 200 columns from the left side of the document, and based on that number of lines and transition points are estimated. Generally, the text is written from the left side of the document, and most of the lines in a document are covered in the first 100 to 100 columns. Therefore, no further division of the text document takes place. From the transition points, we get the midpoints between two consecutive lines. The method for text-line segmentation keeps track of the space between lines by calculating the projection profile for i rows above and below of the midpoint to j columns forward. The horizontal projection histogram is divided into three parts based on the fact that a line can proceed straight or either upward or downward. Then, various cases are analyzed to identify the optimal row among the three parts and advances in the region, which has the lowest value of the projection histogram. The process continues until it covers all lines and throughout the column. The whole procedure is explained in detail in the subsequent section, and it is illustrated in Fig. 9.

Fig. 9
figure 9

The procedure for text-line segmentation

The first and foremost processing step followed by our algorithm for text-line segmentation after extracting the handwritten document is to find the midpoints between adjacent lines. The procedure for computing the midpoints (mp) is given in Algorithm 1. The first process is to convert the given document into grayscale, and the edge of the whole text document is estimated. But for some text edges are not well defined due to the uneven distribution of stroke by the writers. Therefore, Gamma correction has been performed before edge detection so that the stroke is even and not broken. One such instance is illustrated in Fig. 10. The sudden change in intensity value, or where the gradient is maximum represents an edge. We have used Sobel operator having window size \(3 \times 3\) for this process. The sample edge image is illustrated in Fig. 9c. Further, morphological image operation dilation is performed for adding foreground pixels to the boundary so that the text is filled. To conduct dilation, we have used structural element disk having size three, such that the pixels covered by the structural elements will be changed to 1 if the origin is a hit operation. Similarly, we have also performed erosion operation on the dilated image with the same structural element to preserve the core shape of text if lines are too close or touching, as shown in Fig. 9d. Histogram projection profile of a stripe of the eroded image (Fig. 9e) has been calculated and smoothened by a simple moving average filter of window size three. These operations can be summarized from statement 1 to 5 of Algorithm 1. To remove continuity in projection histogram, if present, the HPH whose value is less than around 10 have been set to 0, which is depicted in Fig. 11. This procedure takes care of the lines that are too close, and their projection profile coincides so that a possible mid-index between lines can be obtained.

figure d
Fig. 10
figure 10

a A text image, b edge image without gamma correction, c edge image after gamma correction

Fig. 11
figure 11

a A Sample continuous horizontal projection histogram, b rectified projection histogram

After obtaining the projection histogram of the stripe, we find the points (rows) where the value changes from zero to nonzero and vice versa using statement 9 of Algorithm 1 and stored them in an array, lines [k], where k ranges from 1 to \((2\,\times \,\hbox {number of line})\). As illustrated in Algorithm 2, except for the first and last element of the array (which represent above-border of the first line and below-border of the last line) we took the average of the consecutive elements of array lines[k] and the midpoint (mp) of two adjacent lines are calculated. For each mid-index point, we keep track of the space between lines. Then, starting from n column to a fixed distance of j, the horizontal projection profile is calculated for \({\hbox {mp}}-i\) to \({\hbox {mp}}+i\) rows. For our algorithm, we have taken \(j = 20\) and \(i = 15\) . The projection profile is divided into three parts based on the fact that a line can proceed straight or either upward or downward. Various cases have been considered to find the optimal row to proceed, and then it has advanced on the part, which has the lowest value of the projection profile. The process is repeated until it covers all mp and throughout the column. The cases examine for finding the optimal row for line segmentation is presented in Algorithm 2.

figure e

The total number of lines considered for our experiment is 809. There are 197 skews, 114 curved, 145 close, 77 touching lines, and 276 are straight lines. Out of the total, 746 lines are segmented, and 66 are not segmented, giving an accuracy of \(91.84\%\). The performance of the text-line segmentation algorithm on various constraints is summarized in Table 2. It can be realized from the table that the proposed segmentation algorithm significantly segments lines considering various constraints. The order of the constraint in which segmentation accuracy decreases is as follows: normal, skew, curve, close, and touching lines. The proposed algorithm can segment normal and skew lines with more accuracy than the other challenges imposed. A few images are illustrated in Fig. 12 that depicts text-line segmentation on various constraints.

Table 2 Results of segmentation of line having various constraints

The normal lines are more or less straight, which have adequate space between consecutive lines. That is why it has achieved the highest accuracy of 96.1% among the other constraints. However, due to a large gap between words of a line, sometimes the separator goes wrongly upward or downward if an extension of a character (from above or below line) is present that is aligned with the gap. The unconstrained handwritten text-lines are generally skewed (either upward or downward) in nature. The procedure for estimating the slopes and assisting the in-between text-lines gap is an important stage in document segmentation. Further, it may also happen that a greater degree of skewness results in touching or overlapping characters between text-line, which ultimately produce inefficient results. However, in this approach, we do not have a separate algorithm for slope detection, rather the gap between consecutive lines are traced block by block. An accuracy of 94.92% is achieved for skew line segmentation. The lesser the gap, the inaccuracy of segmentation has been reported. Therefore, the touching lines have the least accuracy of 84.41%, while close lines have slightly improved the accuracy of 85.51%.

Fig. 12
figure 12

Few text-line-segmented documents a, b are straight lines, c, d are skew lines, e, f are curve lines, g, h are close and touching lines

4.3 Word segmentation

In this approach, segmentation of words is not treated as a separate problem after line segmentation. It efficiently goes alongside the line segmentation. As the tracking of the optimal row between neighboring lines goes on using Algorithm 2, we monitor the vertical projection histogram (VPH) of t columns between the above and below-border a line as given in Fig. 13. The operation is repeated for the run-length of line segmentation so that every word that exists in a line is detected.

A word is detected by identifying the column when the VPH of t columns are found to be all zeros, i.e., no text is present. If two consecutive t columns are found to be all zeros, then only one is considered as only one column is required to separate and detect a word. All the separating points of words, i.e., rows and columns of the beginning and end of each word are identified and obtained. A word is identified by four points, two points above the word, and two below (for example, in Fig. 13, points marked as 1, 2, 3, and 4). These series of points are stored in two arrays, one for above-border and the other for below-border of aline. Then, a word is extracted by using the four points. It may also happen that a column is not detected as in the case of the first word in Fig. 13, but the word is extracted using the information available. If a text document contains almost straight lines, then the rows representing a word do not differ by 20 to 30 levels. If that is the case, then the word can be extracted by taking the respective average of the above and below line rows. It may be noted that the lines are not always straight, as discussed in the previous section. So, just taking average is not enough to give a clear separation of words. For skew lines, we have extended our word extraction algorithm. A sample of a skew word is given with the identification of rows and columns given in Fig. 14.

Fig. 13
figure 13

Process for word segmentation

Fig. 14
figure 14

A sample skew word with rows and columns identification

Fig. 15
figure 15

Few word segmented documents a, b are straight lines, c, d, e are skew lines, f, g are curve lines, h, i are close and touching lines

Fig. 16
figure 16

Segmented words of document image in Fig. 15e

Algorithm 3 illustrates the extraction of word procedure, for straight line or minimally skew/curve line words are extracted by taking an average of either the above word or below words rows (for instance, row1 and row1’ in Fig. 14). Otherwise, we identified the clockwise or counterclockwise skewness or cursiveness, if (floor(\(\frac{{\hbox {col2}}-{\hbox {col1}}}{{\hbox {row1}}-{\hbox {row1'}}}\))< 0) clockwise otherwise counterclockwise.

$$\begin{aligned} \hbox {base}= & {} \hbox {abs(row1-row1')}\\ \hbox {perp}= & {} \hbox {abs(col1-col2)} \end{aligned}$$
figure f

For, clockwise skew of lines, the row is incremented by 1 for every \({\hbox {floor}}(\frac{{\hbox {perp}}}{{\hbox {base}}})\). Similarly, for counterclockwise skew of lines, the row is decremented by 1 for every \({\hbox {floor}}(\frac{{\hbox {perp}}}{{\hbox {base}}})\), to perform the word extraction operation. False detection of blank space as a word is discarded by the word extraction algorithm.

The total number of words considered for word segmentation is 2917, and segmented words among them are 2595 giving an accuracy of 88.96%. The post-processing method needs to be performed to eliminate the extra space around the word and additional text from neighboring lines or words. Figure 15 illustrates the results of word segmentation, and Fig. 16 depicts the segmented words of a sample document.

The words that appear in a text document are extracted using the word extraction algorithm and save as individual images for further processing of character segmentation. The words in an accurately segmented normal text-line documents are extracted with no loss of characters. The documents which are not largely skewed have also been extracted correctly. However, those documents in which skewness is very high, then the few characters of words are lost, or irrelevant character is added.

figure g
Fig. 17
figure 17

The process for character segmentation

4.4 Character segmentation

Segmentation of documents to the identifiable element is a necessary step for building a recognition system. In this section, character segmentation from word images using connected component analysis is explained in detailed by Algorithm 4. In the first step, a word image is taken, and its binary counterpart is computed. Further, the complement of the binary image is estimated, and no. of the connected component is calculated. Then, all the pixels identified by every connected component are turned to background pixels, and its complement is subtracted from the complement of the binary image. Finally, the minimum bounding rectangle of the subtracted image is obtained, and the character is segmented. The whole procedure is depicted in Fig. 17.

The above algorithm work for words without a headline. Generally, Meitei Mayek is written without a headline, but at the time of database acquisition, some writer writes sentences with headlines. For those words, the above algorithm fails and has to be dealt with a different approach. First, the headline is removed so that each character in a word is isolated, and they can be detected and separated by a vertical projection histogram. The headline is removed by finding the maximum horizontal projection profile row within the half size of the image and deleting it (converting to background pixel). Then, the characters are separated by a vertical projection histogram, as shown in Fig. 18.

Fig. 18
figure 18

a A word image with headline, b same image after headline removed

For the segmentation algorithm, we have considered 4932 characters and out of which 4494 have been isolated, giving an accuracy of 91.12%. After segmenting words using Algorithm 4, few of them are found to have characters that are joined due to the presence of headline; they are further processed by removing headline and isolating by vertical projection histogram. Still, the segmentation algorithm fails to segment some touching characters and those attached with symbols.

5 Conclusions and future work

The work presented in this paper addresses two problems: firstly, developed Meitei Mayek databases (Mayek27 and MM datasets) for research work and secondly, performing experimentation on the acquired dataset. We have conducted character recognition and segmentation (line, word, and character) on the collected databases to validate the proposed databases. The Mayek27 dataset has been compiled to form a basis for the recognition model on Meitei Mayek. The MM dataset has been developed because there exist English words whose Meitei Mayek equivalent does not avail.

Character recognition and text-line segmentation have been performed on the large and challenging dataset that we have developed called Mayek27 and MM database, respectively. Character recognition is performed using CNN with three convolutional layers employing with \(3 \times 3\) filter size on the images. The analysis and extensive experimental study reveal the efficiency and effectiveness of this method for character recognition. The comparative evaluation for character recognition suggests that this work has accomplished superior performance than the existing methods in the literature.

On the MM dataset, text-line, word, and character segmentation have been performed, and various constraints have been considered in our experiment. The proposed text-line segmentation can separate skew, curved, and touching text-lines. Lines in the document could not be detected if it has began written moving toward the center instead of the standard left side of the text. But, the overall percentage of accurate segmentation of the text-line of our proposed method is 91.84%. Further, the accuracy obtained in word and character segmentation is 88.96% and 91.12%, respectively, .