Keywords

1 Introduction

The offline handwritten text recognition problem [7], which consists in the automatic transcription of images of handwritten text, has been extensively studied from many years [14, 17, 19]. The significant differences in the writing of many individuals and the cursive nature of handwritten characters make this recognition task hard and still it remains as an open research problem.

In this work, we focus on the offline problem of recognizing handwritten characters. The main objective is building a character recognition model, general enough to be used as part of a word recognition system without needing to segment the words into its component characters. Thus, we use the proposed model over a sliding window, which runs along the words, in order to provide evidences of the occurrence of each character sequentially. So, the complete word is finally recognized.

The task of building a character-based recognition model, that can be effective in characters included into a word without segmenting it, is difficult since most handwritten character databases contain only isolated characters [5, 9]. For this reason, we built a new specific corpus of handwritten characters specially adapted to our problem where characters in a word can appear as joined.

Neural networks have been used along many years for the offline handwritten character recognition problem [12]. Good recognition results were achieved when training and evaluating over the same database. Different handwritten character databases were used for this problem. For example, Yuan et al. [19] used CNN on the UNIPEN offline database, Ciresan and collaborators [2] built ensembles of CNNs over the NIST database, and Van der Maaten [17] used neural networks from 2 to 5 layers over the TICH database. In all of the cases the authors offered their results by training and evaluating over a train-test partition from the same database. They did not offer cross-database evaluations to learn the models built over one corpus and to validate in another different one. We are interested in this type of evaluations since our method includes the use of a character-level model which is built on a different corpus of handwritten words.

We also applied the model presented in this paper to the handwritten word recognition problem. In previous papers, some authors used methods based on identifying characters [1], and other authors [10] used methods based on sliding windows for extracting features from the vertical lines of the image. In our future work, we are planing to combine these two approaches by applying a character-level model over a sliding window of the images of words in order to recognize the sequence of characters present in these words. At the end of this paper, we sketch a possible approach to perform this task.

In this paper, we built a new character recognition model and validate its generalization capacity using a different corpus. Our objective is building a new approach to handwritten word identification that avoids the character segmentation problem. To develop this approach, we have created a new offline handwritten character corpus using the UNIPEN database [9] and adding, in a synthetic way, artifacts to each character image. We aim to approximate the effect of the character shape when it is placed inside a word, basically by moving and truncating it. Also, we have added some artifacts to the left and right of each character. The main reason to create this new dataset is that in the character-level model we can find characters included as parts of words using a sliding window, and, therefore, these characters can appear as moved, truncated at left or right, and joined to other characters.

In order to build our model, we created a new architecture based on CNNs due to the recent impressive results produced by these networks in image classification problems, in general [11], and also in handwriting digit recognition [5] and character classification problems [20], in particular. Based on these advances, we have designed our CNN architecture to build the handwritten character image classification model which is based in the VGG network [16] including several layers of stacked convolutions of small size of \(3\times 3\) joint into two final dense layers. We provide our results on this model, which was applied for two different input image resolutions of \(64\times 64\) and \(32\times 32\) pixels, respectively.

Finally, we also present the first stages on how to apply the previous model to obtain a representation of word images as sequences of probabilities for each character, detected in word images using a sliding window. The initial results are promising and we are currently working in using this word representation to combine it with decoders as CTC [6] and with language models to develop an end-to-end word handwriting image classifier.

Although the goal of our current research is providing a general solution to the handwriting word recognition problem, this paper is mainly focused on a preliminary stage which includes: the creation of a new handwritten character database, the proposal of a classification model on this database and the evaluation of the whole solution on different public handwritten character corpuses. Our approach was applied over a dictionary of the 52 English uppercase/lowercase characters. In any case, the methodology presented can be used on other language characters with minimal changes and starting with a character corpus for the considered language.

2 Handwritten Character Database

In this section, we describe the steps followed to create the new dataset of handwritten characters. It is specially oriented to train a model that works with characters, but as a component of a more general system, to automatically recognize handwritten words. Our database was built in two steps:

  • First, we extract isolated characters from the UNIPEN online handwritten database.

  • Next, we perform several transformations on the initial character images to simulate how these characters can appear inside real words.

The UML activity diagram shown in Fig. 1 illustrates the stages involved for producing the new character database. The next two subsections describe in detail these stages.

2.1 Extraction of Characters from UNIPEN Online

By using this online database to extract the “base” characters, we avoid the problem of character segmentation in offline handwritten text and we also obtain perfect isolated characters. The isolated annotated characters are first identified. Next, these characters are retouched by hand so that, the x and y coordinates of pixels are connected in each stroke which shapes the character. Variable-thickness strokes are used depending on the original resolution of the characters to ensure that all final characters have similar thickness. The images from each available category (i.e. different types of alphabet characters) are generated: uppercase, lowercase, digits and punctuation marks. These generated images are resized to \(64\times 64\) pixels without changing their aspect ratio. Finally, the generated images are curated manually, one by one, to make sure that they are assigned to their correct category and are human-legible. Our database contains 93 categories and a total of 68,382 image characters, where the average number of samples per category is 735. Table 1 illustrates the types of characters present in our dataset. In the experiments of this work, we only used the uppercase and lowercase image letters with a total of 46,102 image samples belonging to 52 classes. In order to facilitate the reproducibility of the results, our database of curated characters can be downloaded from: https://github.com/sueiras/handwritting_characters_database. We also provide here a Python code that implements the transformations described in the next paragraph.

Fig. 1.
figure 1

UML activity diagram of the process of creation and transformation of the character database. Process are presented as white boxes and outputs as gray boxes.

2.2 Transformations Applied to Curated Original Images

A set of transformations (see Fig. 2) were applied to the original character images with two goals: augmenting the training set size in experiments and getting good results with the models when they are applied on parts of word images (i.e. at the word level), where the characters are not isolated.

The applied transformations are the following ones. First, we resize characters and translate them up or down. Specifically, we reduce the characters a 25% of its original size and the new image is moved up or down for each character. The main idea is that some characters only can appear up or centered in a word, while others can only appear down and centered, and finally other ones can appear in any place. The characters included in each case are:

  • The letters: ‘a’, ‘c’, ‘e’, ‘i’, ‘m’, ‘n’, ‘o’, ‘r’, ‘s’, ‘u’, ‘v’, ‘w’, ‘x’, ‘z’ are placed up centered and down.

  • The letters: ‘g’, ‘j’, ‘p’, ‘q’, ‘y’ are placed centered and down.

  • Finally the letters: ‘b’, ‘d’, ‘f’, ‘h’, ‘k’, ‘l’, ‘t’, ‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘J’, ‘K’, ‘L’, ‘M’, ‘N’, ‘O’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘U’, ‘V’, ‘W’, ‘X’, ‘Y’, ‘Z’ are placed up and centered.

Second, move left or right the character image and include, on the border of each character, the beginning or the end of other character. So that, similar training examples are going to be available to those that are going to be found when applied the model on a piece of a word image. To achieve this, the borders of the characters (with 3 pixels of thickness) are extracted from the available sample. These borders are added to the extremes of the existing characters in order to generate a new sample. In this process, the no-overlapping between the characters and the added borders is guaranteed.

Table 1. Characters contained in the database.
Fig. 2.
figure 2

Example of the chain of transformations applied to build a training sample.

3 Classification Model and Results

The initial model that we have built to classify the character images has 52 letter categories (i.e. 26 uppercase and 26 lowercase). These categories are specially difficult because several letters like: ‘c’, ‘f’, ‘i’, ‘j’, ‘k’, ‘o’, ‘p’, ‘u’, ‘v’, ‘w’, ‘x’ or ‘z’ have similar shape for uppercase and lowercase. Our recognition system used a deep neural network with convolutional and subsampling layers. Our architecture was inspired in the VGG model presented in [16] that proposes the use of stacked small \(3\times 3\) receptive fields in the convolutional layer.

The original VGG architecture has between 16 and 19 layers and is oriented to the more complex task of the ImageNet challenge [4] to classify a color image into 1,000 general categories. Our problem is more limited and we do not need an architecture so complex. The proposed model has 10 trainable layers, grouped in 3 stacks of convolutional and subsampling layers, and 3 final dense layers. The detailed architecture can be observed in Fig. 3.

Fig. 3.
figure 3

Architecture model with the sizes of the layers and the sizes of the convolutional masks (in red). (Color figure online)

In all the convolutional layers we use zero-padding to preserve the spatial size and a stride of ones, and all hidden layers include the non-linear rectification units (ReLU) [11]. Finally, we use the dropout regularization [15] in the first two fully-connected layers with a ratio of 0.5. We trained the model over 5 epochs with a learning rate of 0.01 and over other 5 epochs with a learning rate of 0.001. A momentum of 0.9 and the Nesterov technique [13] were used. The model obtained an accuracy of the 87.5% on the test data.

To determine the importance of the resolution of the input images, we built the previous model with the same architecture and the same training parameters but only changing the resolution of initial images to \(32\times 32\) pixels. A recognition accuracy in test images of 86.4% was achieved for this new image resolution.

To evaluate the generalization capacity of the presented model over other different datasets and compare our results with other published ones [19], we trained the model separately for uppercase and lowercase letters. Many public character benchmarks include separate error data for uppercase and lowercase [3]. Using the UNIPEN database, Yuan et al. [19] applied the LeNet-5 CNN for offline handwritten English character recognition. They reported recognition rates of 93.7% for uppercase and 90.2% for lowercase, respectively. In our experiments, on a test separated sample of the 20% of the original images of our corpus extracted from UNIPEN online, we achieved respective recognition rates of 98.4% for uppercase and 96.3% for lowercase.

Additional experiments were carried out over the TICH database [17]. Van der Maaten provided a best error recognition result of 82.77% ± 0.82% in recognition using \(10-fold\) cross-validation procedure and a k-NN classifier. Our uppercase model (note that the TICH dataset contains only uppercase) trained with our database and test with TICH, produced a recognition result of 92.5% using the complete dataset.

There exists several benchmarks for using the NIST database (2nd edition) [8]. One of the most recent and accurate ones that uses deep networks is [3], which provided recognition rates of 98.17% for uppercase and 92.53% for lowercase. Our model, trained with our dataset and validated with this NIST benchmark, obtains recognition rates of 92.9% for uppercase (i.e. these results are very close to those obtained with the TICH database) and 71.5% for lowercase. In this case, the big differences between uppercase and lowercase characters can be explained because our database has gray scale images, while the NIST database uses binary images. By training the same architecture using the NIST dataset (i.e. as train and test sets), we obtained recognition rates in NIST of 96.9% for uppercase and 90.4% for lowercase, respectively. All these results are summarized in Table 2.

Table 2. Summary of character recognition results for different databases.

4 First Steps for Word Image Recognition

In order to apply the character-level model to recognize word images, we define an algorithm that receives as input the image of a word. This input has a height of 64 pixels and a variable width of w pixels. By other way, the algo rithm returns as output an array of size \(52\times w\). Each column of this array will contain the likelihood of each character in the dictionary (26 uppercase and 26 lowercase letters) when it appears in the corresponding column of the original image (see Fig. 4). Basically, the algorithm applies the character model over the word image using several sliding windows and aggregates the model results in the final output. It has the following stages:

  1. 1.

    First, we attach to the left and right of the word image two 8-pixel-width bands of background in order to better include the start and end of the image in the sliding process. This can help to improve the detection of the final and initial characters of the word.

  2. 2.

    Next, we define several sliding windows to go over the word image from left to right. These windows have different widths (w). In particular, we use the following ones: 32, 40, 48, 56 and 64, respectively. The goal is to capture characters with approximately these widths.

    • For window widths less than 64 we need to complete them to a resolution \(64\times 64\) because this is the input shape expected by the character classification model. In order to do this, we complete the windows image with background bands to the left and right of it.

  3. 3.

    We apply the character-level model to the previous windows by sliding them over the word image. So, we obtain for each window a sequence of scores of the characters that the model can detect with this window.

    • For each initial image column, this appear in several windows with different sizes and receive several scores of the character model. In order to obtain a final prediction of that character to appear in this column, we aggregate all the model scores that include this column. In the aggregation, we apply a weighted average using a normal distribution over each snapshot that includes this column to give more weight to the central columns of each window.

Fig. 4.
figure 4

Result of applying the proposed character recognition system over the image of a word. For simplicity, only three windows are averaged in the figure. Observe the different widths of the windows.

The previous algorithm can be described by Eq. 1 where n is the width of the input image, o(i) corresponds to the CNN output when the input of the network is in position i, N(xiw/2) is the value of a normal distribution in x with mean i and standard deviation w/2.

$$\begin{aligned} o'(x)={1 \over |w|} \sum _{w\in \{32,40,48,56,64\}}\left( \sum _{i=0}^{n}o(i)\cdot N(x;i,w/2)\right) \end{aligned}$$
(1)

5 Conclusion

This paper described a system designed to obtain a representation of any handwritten word image based on characters without segmenting it. It included a new handwriting character image database and a new classification model to handwritten characters. We have shown how it is possible to use this representation to decode any word image. Finally, the proposed system does not need from any training process to be used with any specific English corpus.

As future work, we plan to improve the decoding process of the word images to build a complete and accurate handwriting offline word recognition system. It will be directly applicable over any corpus with English characters. Currently, we are working in the two following research lines:

  • Apply a decoding model with the connectionist temporal classification (CTC) output layer [6] to improve the process of word decoding. This model can decode the representation of the word obtained by applying the character model (in the sliding window) to the correct word.

  • Use the context of any given word in the corpus to improve the word recognition. To achieve this, a language model that considers the previous and the next words predictions to improve the current word recognition, will be included. In this point we are working with Recurrent Neural Networks (RNN) models using Long Short-Term Memory (LSTM) units [18].