Keywords

1 Introduction

Handwriting style is a personal behavior characteristic which can differentiate between individuals and consequently can be used as a biometric modality. Handwriting text recognition (HTR) is still an open research problem today due the multiple difficulties to be analyzed in an automatic form [18, 23]. For example, the intrapersonal and interpersonal variabilities in handwriting make the task quite difficult. That means not everyone writes with the same style and the same person could not write in the same way at each time and situation. Also, it is very common to capture some noise in the image that could produce classification errors while performing this task in an automatic method.

The handwriting recognition problem uses two approaches based on data capture process: on-line and off-line recognition [18]. The data available in the on-line recognition are time series of coordinates, representing the movement of the pen. In the off-line case the data are only an image of the text. Because the data available are more detailed and noised free, the on-line recognition obtains generally better results [19]. For the off-line case, there are also two main approaches: holistic and segmentation-based ones [15, 22]. While in the holistic (or global) approach the classification is carried out using directly the whole word image, in the segmentation-based case, the isolated characters are extracted before performing their classification.

Typically, a HTR system has five main steps: pre-processing, segmentation, feature extraction, training, recognition and post-processing [17]. In the segmentation-based case and previous to character extraction, it is a frequent practice to normalize the text images to reduce their variability in order to achieve better recognition results. Although there are several techniques which handle isolated characters, and some new techniques which recognize large vocabulary with the help of some dictionaries or grammars [12], there are still few works that consider the problem of recognizing long character sequences. In this paper, we present a new method for such handwriting text recognition problem, which is focused on the off-line approach and does not need from any normalization or segmentation stage. Our approach was tested using variable-length digit sequence images.

Despite the fact that there are many improvements in the HTR problem, especially in the on-line case, there are still several open problems to be solved [12, 17]. The possible high variability in the text words and the image noise presence must also be taken into account. In particular, length variability in words may difficult the character segmentation task, and the presence of noise can cause the system wrongly isolates some characters.

One additional important problem we have to deal with, is the variable length of words (or digits in numbers). There exist some systems which produce very good recognition results on isolated characters (or digits), but their performance drop down when dealing with large words or numbers [12]. For example, different methods are applied to classify isolated characters [7, 17] from the MNIST database [14]. One of these latest methods, proposed by Ciresan et al. [5], achieved a 99.77% of accuracy. These methods used Hidden Markov Models (HMM), Neural Netwoks (NN), Hybrid NN/HMM and Convolutional Networks (CNN) as classifiers [2, 16]. More recently, some other proposals [9] aimed to recognize complete text words using Recurrent Neural Networks (RNN) [3, 7, 9] such as the Long-Short Term Memory (LSTM) [10, 11], complemented with some types of language models.

Normalizing the images is one of the most common way to reduce the variability in handwritten text. As it is shown by Arica et al. [1], skew normalization and baseline extraction, slant normalization, size normalization and contour smoothing are some of the most common techniques for reducing text variability and simplify the practical problem. In our approach, for the sake of producing a more realistic solution, we avoid this image normalization through increasing the training database through data augmentation. Moreover, we generate the inputs ‘on demand’ from a handwritten numbers database, as it will be explained in the next Section.

For the above referred normalization-based methods, after that step a segmentation algorithm is applied in order to get isolated characters. One of the most used ones consists on splitting the words into connected components which are classified as characters [15]. As our method does need neither from normalization nor segmentation, the text and noise variability conditions will be higher and this makes it more difficult the recognition task as happens in many practical situations.

In this paper, we describe a new deep learning approach for handwritten number recognition. Our method has been successfully tested on large numbers (i.e. long chains of digits). The proposed system learns by an ‘on demand’ basis which generate an infinite number of samples, instead of using a standard fixed-length database.

The paper is organized as follows. Section 2 describes the building of our experimental number database produced using the MNIST dataset. Section 3 outlines the system architecture for the considered problem. Experimental results are presented and explained in Sect. 4. Finally, the last Section summarizes the conclusions of this work.

2 Handwritten Digit Database

The samples used to train and test our system are built on the well-known MNIST database [6]. This database consists of by 70.000 digit images, which of 60.000 are used for training and the remaining 10.000 are for tests. All of these samples are black and white images, centered and size-normalized (with \(28\times 28\) pixels). We use these images to create the new sample, as follows:

  1. 1.

    By combining randomly the MNIST digits to create variable-length numbers, a of digits which will be the data of our sample datasets.

  2. 2.

    Move and transform the generated sequenced in order to add variability.

Figure 1 shows the process of sample generation.

Fig. 1.
figure 1

Example of a generated number sequence. The first row shows the original MNIST digits, the second row show a generated digit sequence and the rest of rows shows some transformations which can be applied to the image sequence

2.1 Generation of Sequence Numbers from MNIST Database

To develop our handwriting number recognition system, we have created a generator of numbers from the MNIST database images. In this generator it is possible to choose the numbers length with a maximum of 10 digits, the number of classes which are in the sequence, and the transformation applied to the images. So that:

  1. 1.

    First, the number of digits of the sequence is randomly chosen and an empty image of size \(28\times 280\) is created.

  2. 2.

    Second, the allowed digit classes are also randomly chosen.

  3. 3.

    Third, the following processes are performed as many times as number of digits the new sequence has:

    1. (a)

      Random choice of the digit between the allowed classes

    2. (b)

      The new digit is added to the \(28\times 280\) image. This digit is placed as close as possible to the remaining digits of the sequence

  4. 4.

    As the digits are joined without spaces, the width of the created number is less than 280, so it is possible to move the image around the free pixels. So that, if it is chosen, the digits can be randomly moved.

2.2 Transformations Applied to Images

To add variability and produce a potentially infinite sample, some transformations are applied to the created image. The generator can choose whether the following transformations are applied:

  1. 1.

    Affine transformations to the right or to the left in order to rotate the images.

  2. 2.

    Increase the scale of the digits by width or height.

  3. 3.

    Morphological erosions.

  4. 4.

    Morphological dilations.

  5. 5.

    Translate the positions of digits.

After these transformations a random background (with pixels between 240–255 gray values) is added to avoid the noise which is produced when a transformation is applied. Note that it is possible to apply more than one transformation into the same generated sequence. On the other hand, in the translation transformation, the sequence will be displaced to the right randomly or by a fixed scalar. Moreover, as we explain in the next Section, our system has two different models: a convolution neural network which is used to feature extraction, and a recurrent neural network which predicts the digit ordering of the sequence. In the convolution model the random translate transformation has been applied, while in the recurrent model the images were moved 8 pixels.

3 Classification

The proposed number recognition system (see Fig. 2) includes two main steps: (1) Convolutional Neural Network model (CNN) to obtain image digit features and (2) Long-Short Term Memory (LSTM) Network to recognize the digit sequence. Convolutions for feature extraction have been used formerly in works like Bluche [2] who applied them to predict the states of a Hidden Markov Model. In our system, we get the digits features by predicting the existing digits of the sequence in a CNN model based on the VGG architecture [21]. For that, the target of that model is an array of size 10 which outputs 1 if the number is in the image or 0, otherwise. Note that the CNN model only detect the presence of digits classes in the number sequence, but not their positions.

Fig. 2.
figure 2

General model architecture. Blue boxes correspond to the convolution masks. Green boxes correspond to the sliding window. Red boxes correspond to the tensors which are the inputs of the LSTM model. (Color figure online)

As the positions of the digits are important to get the number, a LSTM model placed at the output of the CNN model is applied to perform local predictions about positions of existing digits in the number. Moreover, to distinguish two or more consecutive identical digits, we transform each class in two classes. Instead of having class 0 we have class ‘beginning of 0’ (b0) and class ‘end of 0’ (e0), and so on with the rest of the classes. With this type of target, when we present the sequence 22 to the LSTM model, we get the output ‘b2e2b2e2’. We use also and extra class which correspond to the ‘blank’ class, which is similar to Graves et al. uses [8]. Figure 2 shows the general model architecture of our system.

The CNN model is a simpler version of the VGG architecture and aims to predict just if a digit is present or not in the number. This model was trained in an incremental process where the number of classes were increased slowly. The training process of the CNN model had 10 phases:

  1. 1.

    In the first phase, the inputs had a variable length between 1 and 10. Moreover, each input only contains random patterns from one class. This first phase ended when the training error was lower than 1%.

  2. 2.

    In the second phase, the process was the same, but random patterns were chosen between 1 or 2 different classes.

  3. 3.

    We repeat this incremental process until the last phase, where all the classes could be present in the numbers.

Table 1 shows examples of the CNN model inputs. As it is shown in Fig. 3, the final CNN model is composed by three groups of two convolution layers followed by a sub sampling (in particular max-pooling) layer, and two dense layers. In the first dense layer we used a dropout [20] of 60%. All the convolutions layers have a stride of \(3\times 3\) (which is equivalent to have a stride of \(5\times 5\) due to there are two consecutive convolution layers) and zero-padding. Also, all the neurons have ReLU [13] as activation function.

Fig. 3.
figure 3

CNN model architecture.

By the other hand, the inputs of the LSTM model come from a sliding window over the original image merged with the last convolutional layer of the CNN model. We chose the last convolutional layer, instead of the last dense layer, because we are interested in the local information obtained from the convolutions. To build the LSTM inputs, we scored each image and reshape the obtained \(64\times 3\times 35\) tensors in a \(192\times 35\) vector. In the next step, we merge the convolutions with the original image, joining each column of the image (28 values) with its corresponding columns of the convolution layers (192), obtaining 220 values. Finally, each input sample for the LSTM training process is composed of a sliding window of 16 columns, and so of \(220\,\times \,16\) values.

The LSTM model that we built has two LSTM layers and a dense layer as it is shown at the bottom of Fig. 2. For each input tensor, the output is a new tensor of \(1\times 21\). As the original image has a width of 280, we append the 280 outputs to get the output sequence of ‘beginnings’ and ‘ends’. The final step is to convert the ‘beginning-end’ label sequence in the final digit sequence. Note that, the final softmax output give us different possibilities for each class, and we need only one. So, we select the combination that maximize the sum of the selected classes, taking into account two rules:

  1. 1.

    If there is a ‘beginning’ of a number it is necessary to have an ‘end’ of the same number.

  2. 2.

    The ‘blank’ can be applied only after an ‘end’.

Table 1. Example of phases. In every phase the generated numbers have between 1 and 10 digits. In Phase 1 there is only 1 different class, in Phase 2 two classes, and so on until phase 10 where all the classes are can be present in the same sequence.

4 Experimental Results

Experiments have been carried out in two stages. In the first one, the CNN model was trained in several phases as it was explained in previous Section. We use a learning rate of 0.01 and the Stochastic Gradient Descent (SGD) [4] algorithm as optimizer and we need 60 epochs to train the whole model. To achieve the error classification we build a test number sequence dataset with 1000 images, which was created with the same algorithm as in the training case, except that in the test case no transformation was applied on images. Figure 4(a) shows an example of an original test image, and the scores that we got with the CNN+LSTM model. In the example is how increase the ‘beginning’ and the ‘end’ of each number.

The CNN model produced 99% Character Accuracy Rate (CAR) and a 90% of Sequence Accuracy Rate (SAR), which means that in 90% of the cases all the digits were correctly found. By the other hand, in the global CNN+LSTM model, there is a 98,4% of accuracy in CAR and an 88,6% of accuracy in SAR. To measure how the CNN model improves the final results, we trained a LSTM model with the same architecture, but only with the original image as input data. The results were a few worse than the presented system as we obtained 86,2% of accuracy in SAR. These results are shown by Table 2.

Figure 4(b) shows the comparison between a perfect model, which would predict successfully all the numbers, and our system. This graphic is built sorting the tested numbers by their confidence and counting how many numbers are successfully predicted. It is seen that if we would predict just the 20% of the tested numbers with more confidence all the digits would be successfully classified. If we would predict the numbers with less confidence numbers, the accuracy would start to decrease until the 88,6% (i.e. the accuracy that we achieved when owe predicted all the numbers of the test sample).

Table 2. Summary of recognition results of our system for isolated digits (CAR column) and variable-length numbers (SAR column)
Fig. 4.
figure 4

(a) Example of an test image and their scores with the CNN+LSTM model. (b) Comparison between the accumulative response in a perfect model and our system.

Fig. 5.
figure 5

Recognition results: (a) frequency of tested numbers by their edit distance and (b) frequency of well classified test number by their number length

Finally, in Fig. 5(a) shows the distribution of the edit distance on the test numbers. It shows that 88,6% of the tested numbers have 0 as edit distance (that means that they were correctly predicted), an 11% have edit distance of 1 (and just 1 digit were wrongly predicted), and a 0.4% have han edit distance of 2 or more. Figure 5(b) shows that the distribution of the successfully recognized numbers is uniform over the number length. So that, it was not relevant the number of digits in the sequences for our classifications results.

5 Conclusion

This paper describes a new automatic system for recognizing numbers of variable length. This system is based on deep learning neural models and was trained using an ‘on demand’ scheme to produce numbers using digit images of the MNIST dataset. We avoid the normalization step of many handwriting recognition systems. Moreover, our solution does not need to segment the digits and classifies the whole sequence as a number. Our model architecture was divided in two steps. In the first step a Convolutional Neural Network (CNN) model was trained to extract the digit features. This model was trained in several phases where the number of digits and classes were increased as well as the architecture of the model. In the second step, the convolutions were used with the original image by a LSTM model in order to predict the beginning and the end of each digit. Finally, the obtained digit sequence was transformed in the recognized number. Using this system, an 88,2% of the test numbers were correctly predicted. Moreover, we noticed that the lengths of the numbers do not influence in the classification and, when a number was misclassified, only one digit of it was wrongly predicted.

In the future, we would like to improve these models for longer numbers as well as to train them for arbitrary types of sequence of characters.