1 Introduction

India is a land of culture, civilization, art, craft, music, dance, and literature. India possesses a dynamic culture which is spanning back to the beginning of mankind. Indian literary tradition is also one of the oldest in the world [1]. Indian literature is always been a reason to feel proud for every Indian. It is written in many languages using different scripts.

Language is a medium for communication consisting of vocabulary and grammar. While script is a collection of graphical symbols which are used to denote the characters used by languages. For example, Hindi is a language, and it is written in Devnagari script. Devnagari script is also used to write other languages like Sanskrit, Marathi etc.

Handwritten documents written between twelfth to sixteenth century are known as medieval manuscripts. These manuscripts are written in many languages and different scripts. A large number of medieval manuscripts are found in Devnagari script.

Many researchers have proposed different segmentation techniques for different languages and scripts. It is observed that, techniques working well for English language characters (Roman script), are not well suited for Indian languages. Most of Indian scripts including Devnagari use modifiers (matras) above, below and at sides of the characters. These scripts also use joint characters. Modifiers and joint characters make the segmentation process difficult. Handwritten text also produces some complexities such as overlapped characters, skewed text, touching characters, uneven gaps between lines and characters etc. [2,3,4,5]. In medieval scripts, many decorative marks and symbols are used which makes segmentation process very complex.

Very less research work is done for character recognition from ancient manuscripts. As per our knowledge, this is the first attempt of its kind to segment characters from handwritten medieval Devnagari manuscripts.

This paper proposes a novel approach for character segmentation using the unique writing style of the medieval Devnagari script. This approach is referred as ‘Shirorekha Based Character Segmentation’ (SBCS) in rest of the paper. In recent Devnagari script (which is used nowadays), a space is placed between two words and the whole word is covered with a single continuous shirorekha above it. In unique writing style of medieval Devnagari scripts, there is no space between two words (may be to save the valuable paper). The text written in medieval Devnagari is a continuous sequence of characters. There is no space between any words; each character has a separate shirorekha above it (as shown in Fig. 1).

Fig. 1
figure 1

Sample medieval Devnagari manuscript

Segmentation is very important phase in optical character recognition process. Most recognition errors occur because of segmentation error [6]. Proposed SBCS method scans horizontal lines along with the shirorekha from pre-processed image. Wherever there is a break in shirorekha, or a minor touch between two characters, it is considered as segmentation point for character. The method is discussed in detail in Sect. 3.

2 Literature review

Below Table 1 shows some existing approaches proposed by different researchers for various languages.

Table 1 Existing character segmentation method study

3 Shirorekha Based Character Segmentation (SBCS)

The proposed Shirorekha Based Character Segmentation (SBCS) method is used for character segmentation for medieval Devnagari manuscript. There are certain operations performed on manuscript pages before applying the SBCS method. These operations are:

  • Grayscale conversion of colored images

  • Binarization using OTSU global thresholding algorithm

  • Line segmentation [10]

The segmented text lines are taken as input for SBCS method. A sample text line image is shown in Fig. 2.

Fig. 2
figure 2

A segmented text line image

As mentioned earlier, medieval Devnagari has a unique writing style that is no space between words and separate horizontal bar (shirorekha) on each character. The shirorekha on the character also covers the modifiers attached to that particular character.

The SBCS technique works using this unique writing style of the script. SBCS algorithm is described in detail:

  • Step 1: Create a row histogram for entire line image. Consider the row with having maximum number of black pixels in the upper half of the image as shirorekha row. Figure 3 shows the shirorekha form the row histogram chart.

  • Step 2: As the characters written in manuscript are thick, a shirorekha cannot be a single pixel width. So, a couple of rows above and below of the identified shirorekha row is considered as shirorekha span.

  • Step 3: This shirorekha span is analyzed to find all breaks in shirorekha line. Breaks in shirorekha is identified by creating a column histogram for the shirorekha span. A threshold value is to be assumed here (it should be according to character thickness). If the column histogram contains intensity less than assumed threshold, it is considered as possible segmentation point.

  • Step 4: In some characters, the character body can be wider than the shirorekha or shirorekha can be longer than the character body. In these cases, if segmentation done only by considering break in the shirorekha, it can lead to loss of partial character as shown in Fig. 4 (it can lead to partial character loss for ‘ka’).

    To resolve this, a column histogram for whole image is created as shown in Fig. 5.

  • Step 5: Column histogram values some columns before and after the possible end point is observed. The column with the minimum histogram value among them is the final segmentation point for the character. Figure 6 shows the segmented characters using SBCS algorithm.

Fig. 3
figure 3

Shirorekha from the row histogram

Fig. 4
figure 4

Identified possible end points

Fig. 5
figure 5

Column histogram for text line image

Fig. 6
figure 6

Segmented characters with SBCS algorithm

4 Experiment results

The proposed novel technique SBCS is implemented using Visual Basic.Net 2013 under Microsoft Windows environment with × 86 based PC, 2.27 GHz processor, 3 GB RAM. The experimental results are analyzed for performance evaluation.

Here, segmented lines from eleven manuscript pages (prats) are selected for an experiment. Table 2 depicts the experimental results and computed accuracy. Accuracy percentage for each prat is calculated as the percentage ratio of total characters and correctly segmented characters.

Table 2 Experiment results on various prats

It is observed that percentage of accuracy varies in the range of 82.94% to 93.04% with an overall average of 88.28%. The SBCS method is suitable for segmenting whole characters, numbers, characters with modifiers and joint characters.

As this is initial research of its kind. There are no results for medieval Devnagari character segmentation available to compare. The proposed research is achieving a promising result compared to many results for recent Devnagari scrip.

5 Conclusion

The proposed Shirorekha Based Character Segmentation (SBCS) method works very well with all kind of modifiers as well as joint characters used in the Medieval Devnagari Script. The method shows character segmentation results with the accuracy 88.28%. This accuracy rate is very good as compared to many methods proposed for recent Devnagari script.

6 Future scope

The proposed method can be improvised to achieve better accuracy and results. It can be extended to segment special symbols and decorative marks used in ancient manuscripts.