1 Introduction

India is a diverse nation and rich in the literature. As of today, there are 33 languages and 2000 dialects, of which 22 are recognized under the constitution. The popular South Indian recognized languages are Telugu, Tamil, Kannada, Malayalam, Tulu, etc [1]. Telugu script, which is an offshoot of Brahmi script, has complex structural characteristics and makes character recognition a difficult task. It has 16 vowels and 36 consonants. Consonants sometimes combine with the vowels or another consonant to form a complex or compound character [2, 3]. Recognition of compound characters is a challenging task to researchers.

Palm leaves were used as a writing material to record popular literature, scientific treaties and history [4,5,6]. Countries like India, Thailand, Philippines, and Indonesia became home to large collections of palm leaf manuscripts. These are recognized as the key to knowledge in diverse fields that have been passed through generations [4,5,6]. The current work acquires information using a 3D profiler to resolve the issues related to palm leaf manuscripts. The pressure applied by the scribe on such scripts to incise characters is measured in the current work. The depth of incision, a 3D feature, is used to perform background subtraction and to improve character recognition rate.

2 Related work

Sastry et al. [7,8,9] contributed their work on palm leaf documents by extracting the 3D feature, depth of indentation which is proportional to the pressure applied by the scriber, for basic isolated Telugu characters. Only at selected points like bends, start, and end points of the character the three coordinates (X, Y, Z) were measured using a human and mechanical interface system. The two coordinates (X, Y) were measured using a mechanical instrument called measuroscope. The third coordinate Z, the depth of indentation was measured, using a dial gauge indicator having a resolution of 0.01 microns. In [7,8,9] investigations were done whether the depth of indentation or pressure applied by the scriber contributes in improving character recognition, without considering the background. There is a tendency of human errors while selecting the points from character to character in their work. It is because using two different instruments, the measurements were taken by physically removing the sample from one instrument to the other at manually selected points.

Several other authors [6, 10,11,12,13,14] have contributed their work in removing noise from various handwritten historical documents like Thai, Ottoman and Patrimonial.

3 Proposed methodology

The flowchart of the proposed recognition scheme for palm leaf scribed text is shown in Fig. 1. It is divided into four major components: 3D data acquisition, preprocessing, feature extraction, and feature selection. The detailed description of all these components is explained in the following subsections.

Fig. 1
figure 1

Flowchart of the proposed recognition scheme

3.1 Data acquisition

In the current study, data have been acquired using a computer controlled XP-200 series profilometer of AMBiOS Technology [16]. A diamond point stylus profiler of radius 2.5 microns having lateral resolution of 100 nm is used to measure the pressure applied by the scribe on the palm leaf manuscripts. The movement of the stylus in lateral direction (Z direction) generates an analog signal which is converted into digital data.

The procedure for raster scanning is described with the help of Fig. 2. The profiler can be programmed for vertical raster scanning by specifying the following arguments to measure three coordinates (X, Y and Z) for the palm leaf text:

  1. 1.

    Scan length (L)

  2. 2.

    Scan width (W)

  3. 3.

    Step size (S) between two consecutive scans/profiles

The scan length and width are specified for the text line as shown in Fig. 2a. Each vertical line in Fig. 2b indicates a scan profile. The difference between two consecutive profiles is the step size (S is set to 0.2 mm). For each scan, the up/down movement of the diamond point stylus along vertical direction (for specified scan width W) is captured by the high-resolution camera. The depth of indentation, Z, (a 3D feature) is recorded, converted into digital form and stored in the computer, attached to the profilometer. Each scan profile contains recorded Y and Z values, where Y is measured in microns and Z is measured in Angstrom units. The number of profiles (N) generated is given by L/S. The extreme left profile in Fig. 2b is used as reference to measure X coordinate of the text line. For every profile, X is incremented by the step size as depicted in equation (1).

$$\begin{aligned} X = S \times (p - 1) \end{aligned}$$
(1)

where p is profile number.

Fig. 2
figure 2

Arguments specified for raster scanning a width and length of the text, b step size between profiles; the left most profile used as reference to measure X coordinate

Fig. 3
figure 3

Sample analog scan profile containing Y and Z data points, whose W = 3 mm; \( V_{h} < L_{h}\) valleys are ignored and \( V_{h} \ge L_{h}\) valleys are considered for threshold computation

A total of 67 palm leaf manuscripts were considered for experimentation. The number of Telugu character samples considered for recognition is 8745 belonging to 53 different classes. Each character class contains 165 samples. These character samples are divided into five folds (each fold contain 1749 samples).

3.2 Preprocessing

A sample analog scan profile generated for a scan width of 4mm is shown in Fig. 3. The x-axis in Fig. 3 represents the data points along vertical direction (Y) in microns and y-axis represents the corresponding depth points along Z direction in Angstroms. The blue line indicates the base of the sample. The manuscripts considered in our experiments are 300 years old; hence, they are crinkled and the same is recorded. The red line indicates the threshold applied to subtract the background / to extract the foreground text. The threshold applied is 40% of the average of all the minimum valley points \( V_\mathrm{min} \) encountered from N profiles. Provided the minimum valley point’s height \( V_{h} \) should be greater than or equal to the predefined lateral height \( L_{h} \), to avoid the errors due to the crinkled surface texture of the palm leaf scripts. The number of minimum valley points encountered in the profile shown in Fig. 3 is 2, as indicated in Fig. 3. If no minimum valley points encountered in a profile then it indicates spacing between the strokes / characters. This adaptive threshold computed from every N profiles is used to subtract the background, even when the scripts are written by multiple writers.

The foreground text extracted after background subtraction is segmented into characters using vertical projection profile algorithm. Selecting two coordinates at a time, from the segmented data points, character patterns are generated in different planes of projection, viz., XY, YZ, and XZ. For every segmented character image /pattern, the minimum boundary of either width or height of the character (whichever is greater) is normalized to a fixed value M (M=50 in the current work). Preserving its aspect ratio further these images are resized to \(\hbox {M} \times \hbox {M}\).

3.3 Feature extraction

The two features extracted for the characters /patterns generated in all the three planes of projection are as follows:

  1. 1.

    Histogram of gradients (HoG)

  2. 2.

    Cell-based directional distribution

For classification, k-NN classifier is employed for both these approaches.

3.3.1 Histogram of gradients

The features are extracted from the normalized palm leaf character image by slicing it into cells. Further cells are grouped to frame a block. The size of the cell selected is \(10 \times 10\), and size of the block is \(2 \times 2\) cell. By overlapping 50% of the blocks, in each cell histogram of gradient directions is computed using a 1D derivative mask [\(-\)1 0 1] along x-axis and its transpose along y-axis. The quantized gradient orientations are mapped into one of the 9 bins, spaced at \( 2\pi /9 \) intervals. The normalized group of histogram of the corresponding gradient magnitudes represents the histogram descriptor of a character. The advantage of HoG descriptor is invariant to geometric transformations as it operates on local cells [15].

The feature vector size is the product of the number of bins, the size of a block and the number of blocks in the image. The number of blocks for a \(50 \times 50\) normalized image is \(4 \times 4\), as 50% of the blocks are overlapped. Hence, the size of the feature vector is \(9 \times 2 \times 2 \times 4 \times 4 = 576\).

3.3.2 Cell-based directional distribution

In this algorithm, the object pixels are distributed based on the directional information in each cell by masking 8 directional masks [17, 18] (size of each mask is \(3\times 3 \)). The size of the cell is \( 10\times 10 \) in this work. In each cell, the pixels distributed in a particular direction are accumulated and mapped into their respective bins. Hence for 25 cells, a total of \(8 \times 25 = 200\) features represent a pattern /character.

3.4 Feature selection techniques

Reducing dimensionality or removing irrelevant features is one of the major assignments in any recognition system [19, 20]. In the current work, feature dimensionality reduction is carried out by implementing optimization techniques such as Differential Evolution (DE) [19, 21] and Particle Swarm Optimization (PSO) [22, 23]. Reducing the features needed for learning improves predictive accuracy and memory requirement [24]. Let the randomly generated initial population for all these techniques be \( POP_{A} \) defined for \( Q_{p} \) individuals / patterns with D dimensions. Let \( \overrightarrow{A}_{i,g} \) be the \( i_{th} \) individual / pattern and \( a_{j,i,g} \) be its \( j_{th} \) dimension parameter in \( g_{th} \) generation. The brief details of these techniques are given below.

3.4.1 Differential evolution

It is a population-based stochastic approach introduced by Price and Storn in 1995. The four steps involved in DE are population initialization, mutation, crossover, and selection. Let the randomly generated \( i_{th} \) individual be \( \overrightarrow{A}_{i,g} \) and \( a_{j,i,g} \) be its \( j_{th} \) dimension feature in \( g_{th} \) generation. The fitness is measured for all the individuals in the population. In the current work, the classification error computed using k-NN (k-Nearest Neighbor) classifier is used as fitness function. Let the fitness measured for \( \overrightarrow{A}_{i,g} \) individual be \(f(\overrightarrow{A}_{i,g})\). The mutant individual \( \overrightarrow{B}_{i,g} \) is defined as follows [19]:

$$\begin{aligned} \overrightarrow{B}_{i,g} = \overrightarrow{A}_{r_{0},g} + F \times \left( \overrightarrow{A}_{r_{1},g}-\overrightarrow{A}_{r_{2},g}\right) \end{aligned}$$
(2)

where \( r_{0} \), \( r_{1} \) and \( r_{2} \) are three random numbers generated. F \( \epsilon \) (0,1) controls the rate of population to evolve and is inversely proportional to max(\(a_{j,r_{1},g} \),\( a_{j,r_{2},g} \)). This allows the second term in equation (2) to oscillate within limits without crossing optimal solutions. DE crosses every individual vector with the mutant vector to define a trial vector \( \overrightarrow{U}_{i,g} \) as depicted in equation (3).

$$\begin{aligned} \overrightarrow{U}_{i,g}= {\left\{ \begin{array}{ll} \overrightarrow{B}_{i,g}, &{} \text {if}\ \varnothing (0,1) \le C_{r} \\ \overrightarrow{A}_{i,g}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

where \(\varnothing \) is a random value and \(C_{r} ~ \epsilon \) (0,1) defines the crossover rate. In the current work, crossover is set to 0.5. DE algorithm [19] employs roulette wheel to replace the duplicated features in the trial vector. The roulette wheel is designed by computing positive and negative distribution factors of the features [19].

The selection process in DE chooses either the original or the trial individual for the next generation, based on the fitness \( f(\overrightarrow{A}_{i,g})\), as depicted in equation (4), to produce the next generation individual \(\overrightarrow{A}_{i,g+1}\).

$$\begin{aligned} \overrightarrow{A}_{i,g+1}= {\left\{ \begin{array}{ll} \overrightarrow{U}_{i,g}, &{} \text {if}\ f \left( \overrightarrow{U}_{i,g}\right) \le f\left( \overrightarrow{A}_{i,g}\right) \\ \overrightarrow{A}_{i,g}, &{} \text {if}\ f\left( \overrightarrow{A}_{i,g}\right) < f\left( \overrightarrow{U}_{i,g}\right) \end{array}\right. } \end{aligned}$$
(4)

The four steps of DE are repeated for the new individuals in the population until the stopping criteria is reached, to find an optimum feature subset.

3.4.2 Particle swarm optimization

PSO developed by Kennedy and Eberhart [25] motivated from the behavior of the particles in the swarm. PSO starts with an initial swarm of particles in the search domain. Let the random position and velocity vectors of an \( i_{th} \) particle /individual be \( \overrightarrow{A}_{i,g} \) and \( \overrightarrow{V}_{i,g} \), respectively, at \( g_{th} \) generation. Each particle in the swarm flies with these initial positions and velocities in the search domain to attain the best fitness. In the current work, classification error is considered as the fitness function for PSO. At generation g, \( \overrightarrow{A}_{\mathrm{best}_{i}} \) and \( \overrightarrow{G}_{\mathrm{best}} \) are the best fitness positions achieved by the \( i_{th} \) particle/individual and the swarm/group, respectively.

For every generation, the velocities and positions of the particles are computed [22] using Eqs. (5) and (6), respectively, with linearly time varying inertia weight (\( \omega \)) over generations [26]:

$$\begin{aligned} \overrightarrow{V}_{i,g+1}= & {} \omega \times \overrightarrow{V}_{i,g} + c_{1} \times \varnothing _{1} \times \left( \overrightarrow{A}_{\mathrm{best}_{i}}-\overrightarrow{A}_{i,g}\right) + c_{2} \nonumber \\&\times \varnothing _{2} \times \left( \overrightarrow{G}_{\mathrm{best}}-\overrightarrow{A}_{i,g}\right) \end{aligned}$$
(5)
$$\begin{aligned} \overrightarrow{A}_{i,g+1}= & {} \overrightarrow{A}_{i,g} + \overrightarrow{V}_{i,g+1} \end{aligned}$$
(6)

where \( \varnothing _{1} \) and \( \varnothing _{2} \) are two numbers generated randomly over the range [0 1], \( c_{1} \) and \( c_{2} \) are cognitive and social acceleration constants, respectively, and \( \omega \) is given by

$$\begin{aligned} \omega = (\omega _{1}-\omega _{2}) \times \frac{g_{max}-g}{g_{max}} \end{aligned}$$

where \( \omega _{1} \) and \( \omega _{2} \) are the inertia weights, g is the current generation and \( g_{max} \) is the maximum number of generations. The optimum values [26] set for \( \omega _{1} \) and \( \omega _{2} \) are 0.9 and 0.4, respectively. A constant value 2 is set to \( c_{1} \) and \( c_{2} \) acceleration coefficients.

For every generation, the \( \overrightarrow{A}_{best_{i}} \) and \( \overrightarrow{G}_{best} \) positions of the \( i_{th} \) particle and the swarm are updated, respectively, based on the fitness computed for the updated positions of the particles until an optimum solution is achieved.

4 Experimental results

Traditionally scanned and camera-captured documents are prone to several distortions like uneven illumination, low resolution, perspective, and motion blur. These distortions may elevate the noise levels in degraded documents. The actual textual information gets effected, while suppressing these noise levels using the conventional noise removal approaches. However, no additional noise is added using the proposed depth sensing approach. With the specified stylus force (10mg) applied on the sample document, no damage caused to it.

4.1 Preprocessing

The proposed background subtraction algorithm extracts the foreground text even from stained documents. As the pressure applied by the scriber varies from line to line and from scriber to scriber, the threshold is computed for every line separately considering all the profiles in a line, as discussed in Sect. 3.2. The sample background-subtracted images are shown in Fig. 4b for the input text strings considered in Fig. 4a. Though there is a stain in string 1, the text is extracted properly after background subtraction.

Fig. 4
figure 4

Sample results a input string images from palm manuscript b background-subtracted string images

Each segmented character consists of X, Y, and Z data points. Selecting two coordinates at a time, patterns are generated in three different planes of projection viz., XY, YZ, and XZ. The patterns generated in all three planes are shown in Fig. 5a–d for some similar characters of Telugu. These Telugu characters have dissimilar patterns in YZ and XZ planes of projection which lead to improve the recognition accuracy.

Fig. 5
figure 5

Patterns generated in XY, YZ, and XZ planes of projection for some similar Telugu characters a Haa b Hi c Vaa d Ha

4.2 Comparison of palm leaf character recognition approaches

The patterns generated in three planes of projection are trained and tested separately, to identify patterns generated in a plane that offers recognition of character patterns. Hence, the experiment is repeated for 3 times, for each plane of patterns. The number of characters considered for experimentation is 8745 from 53 different classes. All the characters are tested once by five-fold cross-validation.

With two different features extracted from the patterns, discussed in Sect. 3.3, the characters are classified using k-NN classifier (k=1). The cross-validated results obtained are compared with the existing and state-of-the-art methods which are tabulated in Table 1. The noise from the flat-bed scanned palm leaf documents in state-of-the-art method is removed using morphological operations such as dilation and erosion. The ‘HoG’ feature set yielded better classification accuracies in all three planes of projection. The best recognition rate obtained is 93.1% in YZ plane of projection using ‘HoG’ feature set. The maximum variations found for Telugu similar characters are in Y direction [7,8,9]. This inherent characteristic of Telugu script in combination with the 3D feature (depth of incision), Z, yielded better recognition accuracies in YZ plane of projection compared to the other planes of projection for all the cases. The recognition accuracies achieved in XZ plane of projection are better than that of in XY plane of projection. This is due to the contribution of Z in XZ plane of projection, which yielded better results compared to XY plane of projection. This proves that the recognition rate is improved using the 3D feature (depth).

Sastry et al. [7, 8] worked on basic isolated Telugu palm leaf characters, whereas the proposed work is done not only on basic characters but also on compound characters. These compound characters are obtained by combining vowels, consonants or both. In [7, 8] the data from palm manuscripts were acquired using a ‘Human–mechanical interface system,’ whereas in the current work the data are acquired using a sophisticated profiler.

In [7] radon transform was used to extract the features from the characters. Sastry et al. [8] recognized the palm leaf characters in two levels. In the first level, the characters were classified by extracting 2D-FFT features. The mis-recognized characters from the first level were again classified by extracting 2D-DCT features in the second level. They reported the best recognition rate of 92.8% in YZ plane of projection. This approach is computationally expensive as two sets of features were extracted for the characters.

The results reported in [7] are in line with the results obtained using the proposed approaches. The recognition accuracies are better in XY and XZ planes of projection using the proposed approaches compared to the one reported in [8]. When compared to the conventional or state-of-the-art method the character recognition results obtained with the proposed system (considering the Z feature) are better in both YZ and XZ planes.

Table 1 Comparison of palm leaf character recognition approaches with the existing and state-of-the-art methods

4.3 Feature selection results

The recognition results obtained are better in YZ plane of projection, as discussed in the previous subsection for both the proposed and existing approaches. Hence in this subsection, optimization of features is carried out only for patterns obtained in YZ plane of projection. The two feature sets extracted from the patterns are concatenated and utilized to check the performance of the optimization techniques, such as DE and PSO. The subset size / number of dimensions (D) is varied from 5 to 100 features in the experiment, as the proper subset size is unpredictable. This is done because for larger subset sizes no improvement is achieved [19].

Each optimization technique is made to run with the same initial population \( POP_{A} \). The size of the population, \( Q_{p} \), is set to 40 and the maximum number of generations /iterations, \( g_\mathrm{max} \), is set to 100 for all the experiments. The fitness function is the classification error rate computed using k-NN classifier. The experiment is repeated for 20 times, for each subset size. The average classification rates for various subset sizes are shown in Fig. 6.

It has been reported in [19, 27] that the performance of DE is better than PSO and GA optimization techniques for various medical datasets. Even in the current work, it is observed that due to the exploitation and exploration capability associated with DE, it outperformed even for smaller subset sizes.

The best classification rate obtained with the combination of features extracted and PSO is 89.5% for an optimum feature subset size of 85. No improvement in recognition rate is found using PSO. The performance of DE with the combination of features extracted yielded the best recognition accuracy of 94.68% for an optimum feature subset size of 60. Thereby, the predictive classification accuracy is improved by reducing the features needed for learning [24].

Fig. 6
figure 6

Feature optimization by combining the features in ‘YZ’ plane of projection

5 Conclusion

In conventional scanned or camera-captured images, distortions like low resolution, perspective, and uneven illumination are common. For noisy or historical documents, these distortions may elevate the noise levels. The proposed 3D depth sensing approach subtracts the background and extracts the text from degraded palm manuscripts, even if the document has stains on it. No damage caused to the sample document, though the measurements are taken using a contact type stylus profiler.

Recognizing the similar Telugu characters is resolved using the depth information, ‘Z’. Patterns are developed in various planes of projection viz., XY, YZ, and XZ. The best recognition rate obtained is 93.1% in YZ plane of projection using ‘Histogram of Gradients’ feature set. The attribute of Telugu script in conjunction with the 3D feature, ‘Z’, yielded better recognition accuracies in YZ plane of projection compared to the other planes of projection.

To reduce feature dimensionality, optimization techniques such as Differential Evolution and Particle Swarm Optimization are implemented. The experiments are conducted by concatenating two sets of features and by varying the subset sizes. With reduced subset of features, the predictive accuracy is improved and the memory requirement is also reduced. The best recognition accuracy obtained for an optimum feature subset size of 60 is 94.68%, with Differential Evolution technique, in YZ plane of projection. Hence, there is a significant improvement in recognition rate for the reduced feature subsets considered using Differential Evolution technique.

The limitation of the proposed study is high scan time, due to this all the character samples were used in five-fold cross-validation. In future, the work will be extended for independent test set by acquiring more data. The other limitation of the proposed system is that it is applicable only for palm leaf manuscripts.