1 Introduction

Writer identification system is one of the important applications of document analysis and recognition. Writer identification is the process of identifying the author or the writer based on the handwritten text, and the nature of the script may be Indic or non-Indic. Each script has its own specific characteristics and uniqueness that can be explored to devise a unique approach for the target such as the extraction of specific features from of the text samples written by a person, implementation of classification algorithms and so on. Automatic identification of the writer for the online or offline digitized handwriting sample images has been widely used and important for many applications such as crime investigation, forensic analysis, theft cases and personal identification for wills. It is the part of document analysis and identification which comes under the field of pattern recognition and the machine learning research community. The proposed experiment was implemented on the Devanagari script for the development of the writer identification system for achieving high accuracy rates as compared to the state-of-the-art work. Devanagari is the national language of India and is the third popular language of the world. It has been believed that the Devanagari script was originally developed from the Brahmi script by applying numerous transformations. As there is a close linkage between handwriting recognition and writer identification, it is therefore believed that writing by hand can also be an important source for solving many important issues such as the name of the writer, age, handedness (left or right) and even prediction of the state and the country to which person belongs.

The relationship between the handwriting recognition and the writer identification is helpful for many problems such as:

  • Handwriting recognition deals with the ignoring the writer-dependent variations in writing and extracting set of features that can be identified in the different allographs of the same character.

  • Writer identification system analyzes the differences of an individual handwriting style with another and takes the benefits of the variations in characters and words for the identification of the writer from a given sample image.

1.1 Offline and online modes

Writer identification can be of two types, namely offline writer identification and online writer identification. Offline writer identification is the process of identifying the writer based on a text which was already submitted, scanned and transformed into a computer as a sample image. Offline writer identification is difficult as compared to online because of the number of factors like variations in handwriting style, quality of paper, preprocessing methods and so on. Online writer identification is the process of identifying the writer online, while writing using the electronic devices such as stylus pen and tablet. Working with online data is easy as compared to offline data. A sample of offline and online handwriting is shown in Fig. 1a, b. Here, in the present paper, the authors are presenting a system for offline Devanagari writer identification.

Fig. 1
figure 1

Handwritten text: a offline text, b online text

1.2 Text-dependent and text-independent systems

Writer identification system is further divided into two categories called text dependent and text independent. In text-dependent methodology, a writer is required to compose the same text as previously given to the recognition, but in text-independent methodology, any text can be utilized for the identification. And the recognition rate or accuracy can be measured by using a few factors such as number of writers, the type of handwriting, the division of training and testing dataset.

1.3 Applications of writer identification system

A variety of application areas of writer identification system include:

  • Biometric recognition

  • Forensic record analysis and investigation

  • Authentication of customers in the banking system

  • Crime investigations

  • Analysis of Wills

  • Mobile bank transactions

  • Signature verification

  • Resolving any kind of suspects.

1.4 Layout of the paper

The paper is divided into eight sections. The brief introduction of the sections is shown here, and the detail introductory part of the writer identification system is presented in Sect. 1. Section 2 illustrates the state-of-the-art work on the writer identification system mainly focused on the Indic scripts. Motivations and the various challenges of the proposed work are discussed in Sect. 3. Section 4 consists of fundamentals, characteristics and introduction of Devanagari script. Section 5 portrays the proposed framework of the writer identification system for Devanagari script. Section 6 presents the experimental results and discussions based on the proposed work. Section 7 shows the comparative study of the findings on the writer identification, their feature extraction methods, classification methods and the accuracy achieved with the proposed system. Finally, limitations, conclusion and future directions are presented in Sect. 8.

2 State-of-the-art work

Sethi and Chatterjee (1977) have presented a handwritten Devanagari character recognition system in 1977 with simple primitives. Presence, absence and positional relationships of these primitives were used to make decisions and reaching conclusions. For training of their model, they considered multilayer perceptron (MLP) and the radial basis function (RBF) networks and error back-propagation algorithms. Pal et al. (2010) proposed a novel method for the text recognition of multi-script recognition in Bangla and Devanagari. By considering the cavity regions and background information, water reservoir and convex hull approaches were applied. From the foreground part, invariant features were extracted for the character. Circular features and convex hull features are integrated with support vector machine as the classifier and the experiment successfully attained the accuracy of 99.18% when tested with 7515 Devanagari characters. For the text document in Bangla script, 98.86% accuracy has been achieved when they collected 7874 Bangla characters. Shaw and Puri (2010) proposed a framework for offline handwritten Devanagari word recognition system. For feature extraction process, they used vertical and horizontal stroke-based features. Hidden Markov model was used for the classification process and the team realized an identification accuracy rate of 81.63%. Siddiqi and Vincent (2010) used the concept of graphemes for characterization of writer’s identification. Chain code-based technique was proposed to extract features from the handwriting contours. These features were integrated and validated with the IAM database. They attained an identification accuracy of 77.0% with the IAM database and 79.0% with distribution of chain code. Khanale and Chitnis (2011) proposed a system for character recognition of Devanagari script by inculcating a two-layer feed-forward neural network. For the training of the network, with back-propagation model, they achieved a recognition accuracy of 96.0%. Agnihotri (2012) presented a new method of generation for the chromosome functions and fitness function for offline handwritten Devanagari character recognition. He used diagonal-based method for the feature extraction and genetic algorithm for classification. The corpus consisted of 1000 samples and achieved accuracy of 85.78%. Singh and Lehri (2014) proposed an offline character recognition system for Devanagari script. They used one-dimensional vector arrangements of size 49 × 1 with neural network for the classification, and they successfully got 93.0% accuracy rate.

Halder et al. (2015) proposed a novel technique for writer identification with Devanagari script, with five copies of the character set of Devanagari script by 50 different writers. They computed 64-dimensional feature set which was constructed from the gradient of images. For the classification phase, they used SVM classifier and achieved accuracy of 99.12%. Panda and Tripathy (2015) proposed a new method for offline Odia character recognition. They employed template matching and Unicode mapping for the writer identification process. By using these techniques, they reported the identification accuracy of 97.0%. Sagar and Pandey (2015) proposed an intelligent method for the writer identification system based on the Devanagari script with artificial neural network (ANN). By employing slant estimation techniques, textural features, random, Hough transform and finally with Zernike moments the experiment attained better results. Alwzwazy et al. (2016) proposed a deep learning method called CNN for the Arabic handwritten digit recognition. They performed their study on 45,000 samples. Deep CNN was used for the classification and proved to give more accurate results, i.e., 95.7% for the Arabic handwritten digit recognition. Xing and Qiao (2016) proposed a novel framework for writer identification system by employing multi-stream deep CNN. They used local handwritten patches followed by training with soft-max classification loss. They also used multi-stream structure and data augmentation learning for improving the performance. They developed Deep Writer, a deep multi-stream CNN, which was used to perform a deep representation for identifying writers. For the experiment, they employed IAM and HWDB datasets and attained the accuracy of 99.0%. Roy et al. (2017) presented a novel approach with deep belief networks for the compressed delineation of the data. Hidden Markov model (HMM) was used for the word recognition process. For the dataset, RIMES and IFN/ENIT datasets for the Latin and Arabic languages, respectively, are selected and they also conducted the experimental results on Devanagari script. The experimental work proved that the proposed method was performing better than the MLP-HMMs approaches. Andrew et al. (2017) presented a novel technique for writer identification for Telugu script with 150 writers. They extracted directional features and descriptive convolution-based features for recognition. For classifier, they considered nearest neighbor and SVM with linear and RBF kernels. The experiment realized the identification accuracy of 71.0%.

Adak and Chaudhuri (2015) presented a novel method for writer identification which was specially designed for Bangla numerals and characters. They used 193 ortho-syllabic numerals and characters of NewISldb:HwC and got improved results for identification as compared to previous studies. Adak et al. (2017) presented writer identification system on two scripts, namely Bangla and English. Their techniques assumed that writer sometimes has struck-out and cross-out inappropriate words in the document. A corpus of 29,341 words was collected and maintained with the handcrafted feature set. For classification, they used support vector machine (SVM) with radial basis function (RBF) and achieved the accuracy of 79.07% for identification. Dhandra and Vijaylaxmi (2015) presented a novel way to explore recognition of writer of the Kannada script. They used to feature vector that consisted of multi-resolution spatial and directional features based on discrete cosine transformation, random transforms and structural features. Here, features from two or more words were combined, and for the classification, k-nearest neighbor and fivefold cross-validation technique was used and got satisfactory results. Thendral et al. (2013) proposed a text-dependent writer identification system with the Tamil script and their approach is based upon the local and global features. With tenfold cross-validation method and decision tree for the classification, they attained an identification accuracy of 98.6%. Deng et al. (2017a) presented a novel multi-objective optimization model based on the minimum walking distance of the passengers, idle time variations of gates, number of flights and utilization of gates. This novel method can effectively help in providing a reference for assigning the gates in hub airport and balance the utilization rates of gates by reducing walking distance, improving the services of airport and most important is the satisfaction level of passengers. Deng et al. (2017b) proposed a new collaborative optimization algorithm to improve the local search ability of the genetic algorithms and improving the convergence speed in ant colony optimization (ACO) algorithm that uses multi-population strategy is used for the information exchange and to overcome long search time. The collaborative strategy is used to dynamically balance the global ability and local search ability and to improve the convergence speed. Dargan and Kumar (2018) have presented a state-of-the-artwork for various Indic and non-Indic scripts. This state-of-the-art work gives the cognizance and beneficial assistance to the novice researchers in this field by providing in a nut shell the studies of various feature extraction methods and classification techniques required for writer identification on both Indic and non-Indic scripts. They have also presented a comprehensive survey on deep learning and its application in the different research areas (Dargan and Kumar (2019)). Kumar et al. (2011) have recognized offline handwritten Gurmukhi characters using k-NN classifier. They have also presented a hierarchical technique for recognition of offline handwritten Gurmukhi characters. Using hierarchical technique, they have reported a recognition accuracy of 91.8% (Kumar et al. 2014). Kumar et al. (2018) have collected 31,500 samples of Gurmukhi characters from 90 different writers for 35 class problem. They have achieved an identification accuracy of 89.85% and 94.76% by using a combination of zoning, transition and peak extent-based features with linear SVM classifier with partitioning strategy and tenfold cross-validation technique, respectively. Deng et al. (2019) proposed a novel improved ant colony optimization ICMPACO method in which they employed co-evolution, multi-population, pheromone updating strategy and pheromone diffusion approach is also used. The goal is to maintain the balance between the solution diversity and the convergence speed. Here, the problem is decomposed into several subproblems and the ants in the population are divided into elite ants and common ants. By employing traveling salesman problem and actual gate assignment problem, they got better assignment results and improved optimization value. Zhao et al. (2018) proposed a novel identification high-order morphology gradient spectrum entropy method for describing the fault damage degree of bearing and to testify the vibration signal of motors under no-load and load states. The experiment shows improved accuracy rates of fault damage degree and fault prediction of rotating machinery. Zhao et al. (2019) presented a PABSFD method, i.e., a fault diagnostic method based on the principal component analysis (PCA) and broad learning system (BLS). This method can help in eliminating the feature correlation and dimensional reduction in the feature matrix and accurately achieved the fault diagnosis results.

3 Motivations and challenges

The motivation for the proposed system, i.e., the development of writer identification framework, originates from the usefulness and continuous need of forensic record analysis, records verification in the banking system, biometric recognition and so on. As handwriting is really an art which is so flexible and dynamic, Fig. 2 shows the variations in handwriting style in Devanagari script with the name of the writer.

Fig. 2
figure 2

Variations in handwriting styles of different writers

There are a variety of challenges that emphasize more interest and excitement of the researchers for the development of proposed system based on the handwritten text. The various challenges are like complex to distinguish the different handwriting styles of individuals, diverse shapes and size of alphabets, the quality of the document, constrained handwriting styles and unconstrained handwritings, etc. The quality of the paper, flexibilities and variations in the font styles is the major challenges that one can face for the identification process in Indic scripts. Working on Indic scripts causes more difficulties as compared to the non-Indic scripts, e.g., poor-quality document consisting of holes, noise, spots, broken strokes, larger character set, modifiers and the absence of standard test databases.

4 Devanagari script and data collection

Devanagari script is also known as Nagari, God, Deva which is used to write the Hindi, Sanskrit, Marathi and Nepali language. It was developed from the North India monumental script which was called Gupta. It is written from the left to right and has a strong preference in the symmetrical rounded shapes of characters. The script includes a total family of 47 primary characters out of which, 14 are vowels and 33 are consonants. A character set of Devanagari script is depicted in Fig. 3.

Fig. 3
figure 3

Character set of Devanagari script

The Devanagari script has been playing a marvelous role for developing the literature. Hindi, Marathi, Nepali and Sanskrit are the languages that mainly use the Devanagari script for writing the text in these languages. This script is written from left to right with no distinct letter cases and is recognizable with the horizontal line along the top of the full letter. In this study, 49 characters of Devanagari script have been considered. Samples of these characters are collected from various public places such as school and college which were taken from the 100 different writers with five samples of each character written by each writer. Hence, a corpus of 24,500 samples for 49-class Devanagari problem has been created.

5 Proposed writer identification system

We are presenting here a neoteric system for the text-dependent writer identification that relies on the isolated handwritten Devanagari characters. We are going to develop a robust system with efficient feature extraction phase that performs the extraction of discriminative features followed by the classification phase. The proposed framework consists of various stages such as digitization, preprocessing, feature extraction and classification. These phases are briefly discussed in the following subsections.

5.1 Digitization and preprocessing

Digitization phase deals with the conversion of handwritten character into a digital image. This is done using a scanner with 300 dpi, which is considered as a standard value. While performing writer identification, the first step is to convert the multilevel images into black and white images. In the preprocessing stage, various operations will be applied to a character image. Firstly, the character image is normalized into 64 × 64 size by using the nearest neighbor interpolation method and then images are converted into a bitmap image as [0, 1]. Then, the bitmap image is transformed into a thinned image by using a parallel thinning algorithm (Zhang and Suen 1984).

5.2 Feature extraction

Feature extraction phase is the most important phase of writer identification framework as this phase deals with the extraction of features from the preprocessed characters. It extracts important features, characteristics and relevant properties about the character and finally classifies the writer with the least measure of time and calculations in the classification phase. Here, four features such as transition features, diagonal features, zoning features and peak extent-based features are used for the development of the writer identification system. These feature extraction methods are briefly discussed in the next subsections.

5.2.1 Zoning features (F1)

In this method, decomposition of the thinned image of a character into n (= 100) number of equivalent estimated zones was done. Presently, the number of front area pixels in each zone is determined. The numbers p1, p2,…, pn acquired for all n zones, are standardized to [0, 1] coming about into a list of capabilities of n components. So, using this process, 100 features have been collected for an image of a handwritten Devanagari character.

5.2.2 Diagonal features (F2)

This method deals with the division of the original thinned image of a character into n (= 100) number of equivalent estimated zones. These features are extracted from the pixels of each zone by moving along its diagonals. Each zone has 2n − 1 diagonal and foreground pixels which are present along each diagonal. Then, these are summed up and finally produced a single sub-feature. Using this algorithm, the authors obtained n features corresponding to each sample.

5.2.3 Transition features (F3)

Extraction of transition feature relies on the estimations and area of transitions from background to foreground pixels in the vertical and the horizontal directions. For this, the image is examined from left to right and top to bottom. This approach, as depicted in Fig. 4, this process gives 2n features for a character image. Using this process, 200 features have been collected for an image of a handwritten Devanagari character.

Fig. 4
figure 4

Transition feature extraction: a transitions in horizontal direction, b transitions in vertical direction

5.2.4 Peak extent-based features (F4)

In this section, authors are implemented presented peak extent-based features on the dataset. They have used these features for offline handwritten Gurmukhi character recognition, and using these features, they have achieved the best recognition accuracy for character recognition of Gurmukhi script. In this technique, features are extracted by considering the sum of the peak extent that fits successive black pixels along each zone. Peak extent-based features can be extracted horizontally and vertically. In the horizontal peak extent features, they considered the sum of the peak extents that fit successive black pixels horizontally in each row of a zone as shown in Fig. 5b, and in vertical peak extent features, they considered the sum of the peak extents that fit successive black pixels vertically in each column of a zone as shown in Fig. 5c.

Fig. 5
figure 5

Peak extent-based features: a zone of bitmap image, b horizontally peak extent-based features, c vertically peak extent-based features

Using this process, again 200 features (100 features in horizontal peak extent and 100 features in vertical peak extent) have been collected for each sample of handwritten Devanagari character. In our current study, the feature extraction techniques are applied individually, and various combinations of these techniques are also used for identifying the writers based on their Devanagari handwriting styles. Different combinations of these feature extraction vectors are used to improve the identification accuracy as discussed in Sect. 6.

5.3 Classification

Classification is the final phase of the writer identification system. Classification phase is a decision-making phase, which is used to identify the writer in view of the features extracted in the previous phase. For classification, the authors have considered two classification techniques, namely k-NN and SVM. The authors have computed experimental results with neural network and decision tree classifiers also. But the combination of k-NN and linear SVM is performing better than the other classifiers, so the authors have presented the experimental results of k-NN and linear SVM in this article.

  • k-nearest neighbor (k-NN) is a method for classifying unknown samples in view of neighboring samples in the training feature space. Locations and labels of the training samples are used to divide the space into regions. Usually, Euclidean distance is used to calculate the distance between stored feature vector and candidate feature vector in k-nearest neighbor algorithm. In the present work, the value of k is considered as k = 1.

  • SVM is an exceptionally valuable method for data classification. SVM is a learning machine, which has been broadly applied in pattern recognition. It depends upon a measurable learning hypothesis that uses supervised learning. In supervised learning, a machine is trained instead of programmed to perform a given task on a few inputs/outputs’ pairs. SVM classifier has also been considered with a linear kernel.

6 Experimental results and discussion

In this section, the authors have presented the experimental results of the proposed writer identification framework in view of the Devanagari characters. The results are produced by using different feature extraction techniques and classification techniques. For experimental results, authors have collected a corpus consisting of total 24,500 samples of segmented characters of Devanagari script collected from 100 different writers. For training dataset, the authors have taken 70% of the total collected data, i.e., 17,150 samples, and for the testing phase, they have taken dataset which is 30% of the total data, i.e., 7350 samples. In feature extraction phase, four feature sets, namely zoning features, diagonal features, transition features and peak extent-based features, tested individually and different combinations of all these feature sets are considered to improve the accuracy of the system. By using this proposed framework, promising and acceptable results of 91.53% have been achieved by using a combination of zoning, diagonal, transitions and peak extent-based features with a linear SVM classifier, as depicted in Table 1. In Table 1, the authors have presented accuracy, TPR (true positive rate) and FAR (false positive rate) that are computed using Eqs. (1)–(3), respectively. These experimental results are also shown graphically in Fig. 6. We have also presented writer-wise accuracy for analyzing the handwriting quality of each writer based on their handwriting flow and shapes of characters. Writer-wise recognition accuracy with a combination of zoning, diagonal, transitions and peak extent-based features with linear SVM classifier is depicted in Table 2. Using individual feature set maximum accuracy of 90.23% has been achieved by using zoning features or transition features.

Table 1 Feature-wise experimental results using k-NN and linear SVM classifier
Fig. 6
figure 6

Feature-wise writer identification accuracy using linear SVM classifier

Table 2 Writer-wise recognition accuracy using proposed framework using linear SVM classifier
$$ {\text{Accuracy}} = \frac{{{\text{True}}\;{\text{Positive }}\left( {\text{TP}} \right) + {\text{True}}\;{\text{Negative }}\left( {\text{TN}} \right)}}{{{\text{True}}\,{\text{Positive }}\left( {\text{TP}} \right) + {\text{True}}\;{\text{Negative }}\left( {\text{TN}} \right) + {\text{False }}\;{\text{Positive }}\left( {\text{FP}} \right) + {\text{False}}\;{\text{Negative }}\left( {\text{FN}} \right)}} $$
(1)
$$ {\text{TPR}} = \frac{{{\text{True}}\;{\text{Positive}}\,\left( {\text{TP}} \right)}}{{{\text{True}}\;{\text{Positive}}\,\left( {\text{TP}} \right) + {\text{False}}\;{\text{Negative}}\,\left( {\text{FN}} \right)}} $$
(2)
$$ {\text{FPR}} = \frac{{{\text{False}}\;{\text{Positive}}\,\left( {\text{FP}} \right)}}{{{\text{False}}\;{\text{Positive}}\,\left( {\text{FP}} \right) + {\text{True}}\;{\text{Negative}}\,\left( {\text{TN}} \right)}} $$
(3)

7 Comparative study with the state-of-the-art work

In this section, the authors have presented a comparative study of proposed work the state-of-the-art work, after getting the output from the proposed study with the huge collection of corpora. The comparative study is shown in Table 3 which presents the parameters such as author name, accuracy rates, feature extraction method and classifier used. Table 3 also depicts that the proposed system performs better than already existing work for writer identification of handwritten characters of Devanagari script.

Table 3 Comparative study of proposed work versus state-of-the-art work

8 Conclusion and future scope

Writer identification system based on the handwritten text is a novel, changing and very useful approach in the field of biometric recognition system. Development of writer identification system for Devanagari script based on handwritten text is such a system in which one can identify the author of a given text by using handwriting as input. The paper comprehensively and systematically presented state-of-the-art work on the writer identification system based on the Devanagari script. Framework consisting of phases such as digitization and preprocessing, feature extraction and then classification techniques are deeply explained. An accuracy rate of 91.53% has been obtained by using the proposed framework with the blending of zoning, diagonal, transitions and peak extent-based features and linear SVM classifier. The advantage of this proposed system is to extend the designed framework for the other Indian scripts like Bengali, Gujarati, Tamil, Gurumukhi and so on. Based on the handwritten text either online or offline, gender classification, handedness (left or right) prediction, age estimation and verification and nationality prediction can be developed. Various challenges in developing the proposed system are the presence of touching characters in the dataset, overlapping characters, unconstrained databases of writers, poor scanning, setting of a threshold, noisy data, etc. Deep learning, autoencoders, hybridization of classifiers and use of multiple feature extraction methods are good solutions to be used for the future directions.