Keywords

1 Introduction

Unconstrained offline handwritten text recognition has been studied for decades now. Until recently, all the proposed approaches were focused on recognizing the text from cropped parts (text regions) of the original document, leading to a sequential multistep approach, namely text region segmentation, ordering and recognition. Numerous advances enabled to extend the recognition stage to handle increasingly complex inputs. In the 90’s, the use of Hidden Markov Model (HMM) enabled to go from isolated character recognition [19] to multi-character (word or line) recognition [12, 14]. Thereafter, the democratization of deep neural networks, combined with the Connectionist Temporal Classification (CTC) loss [15], made the line-level approach the standard framework to handle handwritten document recognition [7, 11, 16, 22, 23, 30, 33].

Fig. 1.
figure 1

Reading order comparison between DAN (top) and Faster DAN (bottom). Circles and crosses represent the start and the end of a pass, respectively. The first pass is showed in red, and the second one in blue. The DAN (top) sequentially predicts the characters of the whole documents in a single pass. The Faster DAN first predicts the first character of each line (as well as the layout tokens), and then predicts the remaining of all the text lines in parallel, in a second pass. (Color figure online)

More recently, few advanced works focused on text recognition at paragraph level [1, 2, 8, 32], reaching similar performance compared to line-level recognition [10]. However, whether it is at character, word, line or paragraph level, this three-step paradigm has many drawbacks: the errors accumulate from one step to another, additional physical segmentation annotations are required to train the segmentation step, the use of a rule-based ordering stage is limited for documents with a complex layout, and the stages are performed independently, so they cannot benefit from one another.

Based on these observations, we proposed in [9] a new end-to-end paradigm named Handwritten Document Recognition (HDR). It aims at serializing documents in an XML-way, combining both Handwritten Text Recognition (HTR) and Document Layout Analysis (DLA), through layout XML-markups. We proposed the Document Attention Network (DAN) [9] to tackle HDR.

It is made up of a Fully Convolutional Network (FCN) encoder to extract features from the input image, and a transformer [29] decoder to recurrently predict the different character and layout tokens. The DAN reached competitive results, recognizing both text and layout at page or double-page levels, compared to state-of-the-art line-level or paragraph-level approaches. The main drawback of the DAN is about its autoregressive character-level prediction process, which leads to high prediction times (a few seconds per document image).

In this paper, we propose Faster DAN, a novel approach to significantly reduce the prediction time of end-to-end HDR, without impacting the training time. This approach is based on a new document positional encoding whose aim is to inject the line membership information to each predicted character. In this way, we can parallelize the recognition of the text lines still using a single model, while reducing the total number of iterations. The Faster DAN relies on a two-step prediction process: a first step is dedicated to the prediction of the layout tokens, as well as the first character of each text line; all the text lines are then recognized in parallel in the second stage through multi-target transformer queries. This is illustrated in Fig. 1.

We show that the Faster DAN reaches competitive results compared to the original DAN, while being at least 4 times faster on three public datasets: READ 2016, RIMES 2009 and MAURDOR.

This paper is organized as follows. Section 2 is dedicated to the related works. DAN background is presented in Sect. 3. We detail the proposed approach in Sect. 4. Section 5 presents the experimental environment and the results. We draw the conclusion in Sect. 6.

2 Related Works

Nowadays, the most popular HTR framework is made up of three stages: the input document image is segmented into text line crops, which are then ordered and recognized. Indeed, the concept of text line is widely used as a building block in many works, and has been studied from different angles.

The text line has mostly been studied from the physical point of view: the majority of the works focused on predicting text line bounding boxes, either through a pixel-by-pixel classification task [3, 21, 24] or through an object-detection approach [5, 6]. Detecting the start-of-line information was also studied as part of the segmentation stage. In [20], a model is trained to predict the coordinates of the bottom-left corner of each text line, as well as their height. Similarly, in [28, 31], the authors considered the prediction of the start-of-line coordinates as an object detection task, using a region proposal network. Scale and rotation values are also associated to each text line to handle monotonic slanted lines. Contrary to these works, the Faster DAN strategy we propose only relies on language supervision: we do not need any additional physical annotations.

Recent works proposed to perform the recognition step at paragraph level [1, 2, 8, 33]. Although not relying on raw physical text line annotations, most paragraph-level text recognition works take advantage of the physical properties of text lines in single-column layout: the whole horizontal axis is associated to a text line, no matter its length. The authors of [32] and [8] concatenate the representation of the different text lines, or the text line predictions, respectively, to get back to a one-dimensional alignment problem. In [1, 10], the text lines are processed recurrently through a line-level attention mechanism.

Another approach to deal with multi-line images consists in relying on an autoregressive character-level prediction process [2, 9, 25, 27]. This time, the notion of text line is limited to the use of a dedicated line break token, used as any other character token. This way, this approach is no longer limited to single-column document. This strategy is also used in [18] for visual question-answering, information extraction or classification of documents, the OCR task being reduced to pretraining. In [9], we proposed the Document Attention Network to tackle Handwritten Document Recognition, by predicting opening and closing layout markup tokens in an XML way: all the character and layout tokens are sequentially and indifferently predicted, leading to hundreds or even thousands of iterations for single-page or double-page document images. It results in long prediction times: approximately one second for 100 characters on a single GPU V100.

In this paper, we propose to speed up the prediction of this latter approach by reading text lines in parallel. This way, we take the best of both worlds: we can deal with documents with complex layout through this character-level attention, and we use the concept of text line more directly through the prediction of the first character of each line and by using a dedicated document positional encoding scheme, but without using any additional physical annotations.

3 DAN Background

We proposed the Document Attention Network (DAN) in [9] for the task of Handwritten Document Recognition. It takes an input image of a whole document \(\boldsymbol{X} \in \mathbb {R}^{H_\text {i} \times W_\text {i} \times C_\text {i}}\), where \(H_\text {i}\), \(W_\text {i}\), \(C_\text {i}\) are the height, the width and the number of channels, respectively. It outputs the associated XML-like serialized representation \(\hat{\boldsymbol{y}}\), i.e., a sequence of tokens, each token \(\hat{\boldsymbol{y}}_i\) representing either a layout markup or a character among an alphabet \(\mathcal {A}^*\). For an input document represented by N tokens, we can note the expected output sequence as \(\boldsymbol{y}^* \in {\mathcal {A}^*}^N\). For instance, a three-line document, split into two paragraphs, could be represented as:

$${<}{\text {D}}{>}{<}\text {P}{>}\text {Line 1}\backslash \text { nLine 2}{<}/{\text {P}}{>}{<}{\text {P}}{>}\text {Line 3}{<}/\text {P}{>}{<}/\text {D}{>}$$

where <D> and <P> corresponds to document and paragraph markups, respectively.

The DAN is made up of two main components. An FCN encoder is used to extract 2D features \(\boldsymbol{f}^\text {2D} \in \mathbb {R}^{H \times W \times d}\) from the input image \(\boldsymbol{X}\), with \(H=\frac{H_\text {i}}{32}\), \(W=\frac{W_\text {i}}{8}\) and \(d=256\). A Transformer decoder is used to iteratively predict the tokens \(\hat{\boldsymbol{y}}_i\). To this aim, a special start-of-transcription token is used to initialize the prediction (\(\hat{\boldsymbol{y}}_0=\text {<sot>}\)) and a special end-of-transcription token is added to the ground truth to stop it. This way, the new target sequence is \(\boldsymbol{y} \in \mathcal {A}^{N+1}\) with \(\boldsymbol{y}_{N+1}=\text {<eot>}\) and \(\mathcal {A} = \mathcal {A}^* \cup \{ \text {<eot>} \}\). During inference, a maximum number of iterations \(N_\text {max}=3,000\) is fixed in case of the <eot> token is not predicted.

The transformer attention mechanism being invariant to the order of its input sequences, positional encoding is added to inject the positional information: 2D positional encoding \(\boldsymbol{P}^\text {2D} \in \mathbb {R}^{H \times W \times d}\) for the 2D features of the image, and 1D positional encoding \(\boldsymbol{P}^\text {1D} \in \mathbb {R}^{N_\text {max} \times d}\) for the previously predicted tokens. Both positional encodings are defined as a fixed encoding based on sine and cosine functions with different frequencies, as proposed in the original Transformer paper [29]. The image features are flattened afterward, for transformer needs:

$$\begin{aligned} \boldsymbol{f}^\text {1D} = \text {flatten}(\boldsymbol{f}^\text {2D} + \boldsymbol{P}^\text {2D}). \end{aligned}$$
(1)

The DAN can be seen under the prism of the question-answering paradigm. At iteration t, the question corresponds to the previously predicted tokens \(\boldsymbol{\hat{y}}^{\boldsymbol{t}}=[\hat{\boldsymbol{y}}_0, ..., \hat{\boldsymbol{y}}_{t-1}]\), referred to as context in this work, and the answer is the next token \(\hat{\boldsymbol{y}}_{t}\). Formally, the tokens are first embedded through a learnable matrix \(\boldsymbol{E} \in \mathbb {R}^{(|\mathcal {A}|+1) \times d}\) (+1 for the <sot> token), leading to \(\boldsymbol{e}^{\boldsymbol{t}} = [\boldsymbol{e}_0, ..., \boldsymbol{e}_{t-1}]\), with \(\boldsymbol{e}_i = \boldsymbol{E}_{\hat{\boldsymbol{y}}_i}\) (\(\in \mathbb {R}^{d}\)). Positional embedding is then added to get the transformer input query \(\boldsymbol{q}^{\boldsymbol{t}} = [\boldsymbol{q}_0, ..., \boldsymbol{q}_{t-1}]\) with \(\boldsymbol{q}_i = \boldsymbol{e}_i+\boldsymbol{P}^\text {1D}_i\).

The transformer’s self-attention and cross-attention mechanisms compute an output \(\boldsymbol{o}_i \in \mathbb {R}^{d}\) for each query input \(\boldsymbol{q}_i\) by comparing them with the other query tokens, and with the image features \(\boldsymbol{f}^\text {1D}\), respectively. Formally,

$$\begin{aligned} \boldsymbol{o}^{\boldsymbol{t}} = [\boldsymbol{o}_0, ..., \boldsymbol{o}_{t-1}] = \text {decoder}(\boldsymbol{q}^{\boldsymbol{t}}, \boldsymbol{f}^\text {1D}), \end{aligned}$$
(2)

where the decoder corresponds to a stack of 8 standard transformer decoder layers [29]. This process being autoregressive, the query at position i can only attend to positions from 0 to i. In addition, the intermediate computations are preserved for each layer from one iteration to another in order to avoid computing the same output multiple times.

A score \(\boldsymbol{s}^t_i\) is computed for each token i of the alphabet \(\mathcal {A}\) using a single densely-connected layer of weights \(\boldsymbol{W}_p\) (\(\boldsymbol{s}^t \in \mathbb {R}^{|\mathcal {A}|}\)):

$$\begin{aligned} \boldsymbol{s}^t = \boldsymbol{W}_p \cdot \boldsymbol{o}_{t-1}. \end{aligned}$$
(3)

Probabilities are obtained through softmax activation: \(\boldsymbol{p}^t_i = \frac{\exp {\boldsymbol{s}^t_i}}{\sum _j \exp {\boldsymbol{s}^t_j}}\). The predicted token at iteration t is the one whose probability is maximum:

$$\begin{aligned} \hat{\boldsymbol{y}}_t=\text {arg}\max (\boldsymbol{p}^t). \end{aligned}$$
(4)

The model is trained in an end-to-end fashion using the cross-entropy loss over the sequence of tokens:

$$\begin{aligned} \mathcal {L}_\text {DAN} = \sum _{t=1}^{N+1} \mathcal {L}_\text {CE}(\boldsymbol{y}_t, \boldsymbol{p}^t). \end{aligned}$$
(5)

This autoregressive process can be parallelized during training through teacher forcing, but this is not possible during inference. That is why we propose the Faster DAN strategy.

4 Faster DAN

The standard character-level attention-based approach for HTR is to sequentially recognize all the characters \(\boldsymbol{y}_i\) of the whole input image \(\boldsymbol{X}\). This way the number of iterations, thus the prediction time, grows linearly with the number of characters in the document. This may be negligible for isolated text line images, for which the image feature extraction stage is predominant, but this becomes significant for whole page images (around one second for 100 characters on a GPU V100).

We propose the Faster DAN, a novel approach for Handwritten Document Recognition, to noticeably reduce the prediction time. The goal is to take advantage of the line-based structure of documents to parallelize the recognition of the text lines. Considering the layout markups and the <eot> tokens as lines by themselves (of unit length), we can rewrite the target sequence as \(\boldsymbol{y}=\text {concatenate}(\boldsymbol{y}^1, ..., \boldsymbol{y}^L)\) where L is the number of lines in the document and \(\boldsymbol{y}^j \in \mathcal {A}^{n_j}\) represent the different text lines (\(\boldsymbol{y}^{j}_i\) is the character i of line j).

Using one model per line is prohibitive in terms of GPU memory consumption. Instead, the parallelization is carried out among a single model which processes multi-target queries through masking in the second pass. This is feasible thanks to the dedicated document positional encoding scheme we propose. It is important to note that the proposed approach is not specific to the DAN architecture. It could be used with any attention-based HDR model. However, to our knowledge, the only available end-to-end HDR model is the DAN.

Reading Lines in Parallel. Parallelizing the recognition faces two main challenges: detecting all the text lines, and recognizing them in parallel through transformer queries without mixing them. Moreover, since our goal is to perform HDR, and not only HTR, we also need to recognize the layout entities.

Fig. 2.
figure 2

Comparison of the prediction process and positional encoding scheme between DAN and Faster DAN. This illustrates the example of a document input with three one-word text lines. The DAN associates a unique positional value for each token, which continues from one text line to the next. The Faster DAN uses two positional values: the index of the text line and the position of the token in this text line. Special (start and end) tokens are in blue and layouts tokens are in green. (Color figure online)

To tackle these issues, we opted for a two-pass process, as illustrated in Fig. 2b. In a first pass, the model sequentially predicts the layout tokens as well as the first character of each text lines, solving both layout recognition and text line detection. Then, the different text lines are completed in parallel based on their previously predicted first character. To this end, it is crucial to determine which token belongs to which line.

Document Positional Encoding. To parallelize the recognition of the text lines, we propose a new positional encoding scheme, as shown in Fig. 2. We associate to each predicted token \(\hat{\boldsymbol{y}}^j_i\) (with \(\hat{\boldsymbol{y}}^0_0 = \text {<sot>}\)) two 1D positional embedding: one for the index of the line, and the other one for the index of the token in the line, leading to the global positional embedding \(\boldsymbol{P}^\text {doc} \in \mathbb {R}^{l_\text {max} \times n_\text {max} \times d}\), where \(l_\text {max}\) is the maximum number of line and \(n_\text {max}\) is the maximum number of characters per line. \(\boldsymbol{y}^{j}_i\) is associated to:

$$\begin{aligned} \boldsymbol{P}^\text {doc}_{j,i} = \text {concatenate}(\boldsymbol{P}^\text {1D'}_{j}, \boldsymbol{P}^\text {1D'}_{i}), \end{aligned}$$
(6)

where \(\boldsymbol{P}^\text {1D'}\) is equivalent to \(\boldsymbol{P}^\text {1D}\) but encoded on half channels (\(\boldsymbol{P}^\text {1D'}_i \in \mathbb {R}^{d/2}\)). The transformer input queries become \(\boldsymbol{q}^j_i = \boldsymbol{E}_{\hat{y}^j_i} + \boldsymbol{P}^\text {doc}_{j,i}\). The idea of injecting the line information was already used in [27], but it was computed as a ratio with an arbitrary maximum number of lines, and concatenated to the token embedding directly. In addition, the position of the tokens was absolute, and not relative to the current text line, as for the standard DAN.

First Pass. The Faster DAN follows the standard autoregressive process to predict the first token \(\hat{\boldsymbol{y}}^{j}_0\) of each line j based on Eqs. 2 to 4. At iteration t, \(\boldsymbol{q}^{\boldsymbol{t}}= [\boldsymbol{q}^0_0, ..., \boldsymbol{q}^{t-1}_0]\).

Second Pass. The standard Transformer decoding process is to give a sequence of query tokens \(\boldsymbol{q}^{\boldsymbol{t}}\) as input and keep the output corresponding to the last token only (\(\boldsymbol{o}_{t-1}\)), as single output for iteration t. Instead, the output of the last token of each line \(\boldsymbol{o}^{j}_{t-1}\) are kept in this second pass. We refer to this as multi-target queries. \(\hat{\boldsymbol{y}}^{j}_0\) are duplicated into \(\hat{\boldsymbol{y}}^{j}_1\) to initiate the second pass; the modification of the associated position in line (from 0 to 1) indicates to the model a change of expected behavior: from the prediction of the first token of the next line to the prediction of the next token of the current line. By setting \(\boldsymbol{q}^{\boldsymbol{t}} = [ \boldsymbol{q}^0_0, ..., \boldsymbol{q}^0_{t-1}, ..., \boldsymbol{q}^L_0, ..., \boldsymbol{q}^L_{t-1}]\) (the t first tokens of all the lines), we obtain \(\boldsymbol{o}^{\boldsymbol{t}} = [ \boldsymbol{o}^0_0, ..., \boldsymbol{o}^0_{t-1}, ..., \boldsymbol{o}^L_0, ..., \boldsymbol{o}^L_{t-1}]\) through Eq. 2. In this way, the \(t^\text {th}\) tokens of each line j are computed in a single iteration t:

$$\begin{aligned} \hat{\boldsymbol{y}}^j_t = \text {arg}\max (\boldsymbol{W}_p \cdot \boldsymbol{o}^{j}_{t-1}). \end{aligned}$$
(7)

Extra tokens (\(\hat{\boldsymbol{y}}^j_i \text { with } i>n_j\)) are discarded through masking.

Context Exploitation. The naive approach to recognize the text lines in parallel would be to recognize them independently, by applying a mask to discard the tokens from all the other text lines. It means that \(\boldsymbol{q}^j_i\) could only attend to line j (itself) and position 0 to i in that line. However, this would lead to an important loss of context. Instead, we propose to take advantage of all the partially predicted text lines: \(\boldsymbol{q}^j_i\) can attend to all lines, from 0 to L, and from position 0 to i in those lines, this is illustrated in Fig. 3

Fig. 3.
figure 3

Context comparison between DAN and Faster DAN. The colored cells represent the current character to predict (in purple), the previously predicted tokens i.e. the context (in blue and green), the token used for the prediction (in green), and the remaining characters to recognize (in gray). (Color figure online)

The major drawback of parallelizing the line recognition, compared to purely sequential recognition, is the loss of context. Indeed, the standard DAN benefits from all the past context during prediction: this is partially available for the Faster DAN since the past context is limited to the beginning of the text lines. In this way, it becomes harder for the model to focus on the correct text part, especially for very short contexts. Indeed, a sequence of characters may appear several times in a document, especially if this sequence is short, e.g., at the beginning of the recognition process. We counterbalance the loss of context from past by combining partial context from both past and future. We show the impact of this approach in Sect. 5.6.

Training and Inference. The model is trained over the target sequence using the cross-entropy loss:

$$\begin{aligned} \mathcal {L} = \sum _{j=1}^{L} \sum _{\begin{array}{c} i =0 \\ i \ne 1 \end{array}}^{n_j} \mathcal {L}_\text {CE}(\boldsymbol{y}^j_i, \boldsymbol{p}^j_i). \end{aligned}$$
(8)

It has to be noted that the training time is not impacted by this two-step decoding strategy since the whole expected sequence prediction (from both passes) is trained in parallel through teacher forcing, with appropriate masks.

During inference, the Faster DAN reduces the number of iterations I from

$$\begin{aligned} I_\text {DAN} = \displaystyle \sum _{j=1}^{L} n_j = N+1 \;\;\; \text { to } \;\;\; I_\text {FasterDAN} = L + \max _j(n_j), \end{aligned}$$

by considering the line breaks as belonging to the lines. For example, 25 text lines of 50 characters, structured according to 3 layout entities, leads to 1,251 iterations for the DAN, and only 76 iterations for the proposed Faster DAN.

5 Experimental Study

5.1 Datasets

We used three document-level public datasets to evaluate the proposed approach: RIMES 2009 [17], READ 2016 [26] and MAURDOR [4]. Document image examples from these three datasets are showed in Fig. 4.

Fig. 4.
figure 4

Document image examples from the RIMES 2009, READ 2016 and MAURDOR datasets.

RIMES 2009. The RIMES 2009 dataset corresponds to French grayscale handwritten page images. These pages are letters produced in the context of writing mail scenarios. Text regions are classified among one of the following 7 classes: sender coordinates, recipient coordinates, object, date & location, opening, body and PS & attachment. We used these classes as layout tokens.

READ 2016. The READ 2016 dataset corresponds to Early Modern German handwritten pages from the Ratsprotokolle collection. Images are RGB encoded. We used two versions of this dataset: single-page images and double-page images. The layout classes are as follows: page, section, margin annotation and body.

MAURDOR. The MAURDOR dataset consists in a heterogeneous collection of documents. We used the same configuration as in [9] i.e. we only use the English and French documents, and we focus on the C3 and C4 subsets of this dataset, which corresponds to private or professional correspondences. The documents are either handwritten, printed, or a mix of both. There is no sufficient annotation to produce the layout tokens, so we only evaluate the HTR task on this dataset.

Table 1 details the splits in training, validation and test, as well as the number of characters in the alphabet and the number of layout tokens (2 by class, for opening and closing markups) for each dataset.

Table 1. Splits and number of character and layout tokens for each dataset.

5.2 Metrics

In addition to the standard Character Error Rate (CER) and Word Error Rate (WER) metrics used to evaluate the text recognition performance, we proposed two metrics in [9] to evaluate the specific layout recognition of the HDR task. The Layout Ordering Error Rate (LOER) consists in considering the document layout as a graph and computing the graph edit distance between the prediction and the ground truth. The LOER aims at evaluating the layout recognition only, considering the reading order between layout entities. Since LOER and CER/WER only evaluate the layout and text recognition independently, the \(\text {mAP}_\text {CER}\) is used to evaluate the recognition of the layout with respect to the text content. It is computed as the area under the precision/recall curve, as in object detection approaches [13] for instance, but it is based on a CER threshold instead of a IoU one. The \(\text {mAP}_\text {CER}\) does not dependent on the reading order between layout entities. That is why it is important to consider all these metrics altogether to evaluate the HDR task.

5.3 Training Details

In [9], we used some pretraining and curriculum training strategies to speed up the convergence of the DAN, and to not use any physical segmentation annotation during training. To be fairly comparable with this work, we follow the exact same training configuration, whose major points are as follows:

  • The encoder is pretrained on synthetic isolated text line images using the CTC loss and a dedicated FCN line-level OCR model.

  • The Faster DAN is trained on a mixture of real and synthetic documents. Using a curriculum strategy, the Faster DAN is trained on increasingly complex synthetic documents through the epochs. The complexity varies from two aspects: the number of lines contained in the document image, and the size of this image. The ratio between synthetic and real document also evolves during training, from 90%/10% to 20%/80%.

  • A rule-based post-processing is used to make sure that the layout markups have the correct format (no unpaired markup, for instance).

  • Whether it is for pretraining or training, input images are downsized to 150 dpi, normalized and data augmentation is performed 90% of the time.

We carried out 2-day pretraining and 4-day training on a single GPU V100 (32 Go), using automatic mixed-precision. We used the Adam optimizer with an initial learning rate of \(10^{-4}\). We do not use any external data, external language model nor lexicon constraints.

5.4 Comparison with the State of the Art

To our knowledge, the only work performing HDR is the DAN [9]. Tables 2, 3 and 4 provide an evaluation of the Faster DAN on the READ 2016, RIMES 2009 and MAURDOR datasets, respectively, as well as a comparison with the state of the art.

Table 2. Evaluation of the Faster DAN on the test set of the READ 2016 dataset and comparison with the state of the art. Metrics are expressed in percentages.
Table 3. Evaluation of the Faster DAN on the test set of the RIMES 2009 dataset and comparison with the state of the art. Metrics are expressed in percentages.
Table 4. Evaluation of the Faster DAN on the test set of the MAURDOR dataset and comparison with the state of the art. Metrics are expressed in percentages.

The Faster DAN reaches competitive results compared to the DAN on the three datasets. For the READ 2016 dataset, it even reaches state-of-the-art results in terms of LOER and \(\textrm{mAP}_\textrm{CER}\) for both single-page and double-page versions, involving a better recognition of the layout. Results are not as good for the RIMES 2009 dataset, which includes more variability in terms of layout. We assume that this higher variation makes the first pass of the Faster DAN more difficult. This is confirmed when measuring the CER for the first pass only: it is of 4.72% and 5.34% for READ 2016 at single-page and double-page levels, and of 9.10% for RIMES 2009. Concerning the MAURDOR dataset, the Faster DAN reaches competitive results on the C3 and C4 categories, taken separately, and it reaches new state-of-the art results when mixing both categories with 10.50% of CER, compared to 11.59% for the standard DAN.

Discussion. It has to be noted that it is more difficult to compare the text recognition performance at document level than at line level. Indeed, the reading order is far more complex for documents, to go from one paragraph to another, and to one line to the next, than for isolated lines. This way, even perfectly recognized, the CER can be severely impacted if the paragraphs are recognized in the wrong order. On the contrary, the \(\textrm{mAP}_\textrm{CER}\) is invariant to the order of the layout entities, but it is dependent to the well recognition of the layout.

Another point to emphasize is about the severity of the errors made. There are two types of errors to be distinguished. The first corresponds to standard character addition, removal, or substitution cases. During the first pass of the Faster DAN, this kind of error may have a great impact because a whole text line may be duplicated or discarded. However, during the second pass, we assume that the impact of such errors is rather equivalent for both DAN and Faster DAN. The second kind of errors is related to the end-of-transcription token prediction. Indeed, although rare, the model may not predict the end of the transcription and loop on the same text region again and again until reaching an arbitrary chosen iteration limit. For this later issue, the standard DAN is more impacted than the Faster DAN. Indeed, the DAN only have one iteration limit, which corresponds to the global number of tokens to predict for the whole document. For the Faster DAN, we used two iteration limits: one for the number of lines, and one for the number of characters per line. Given that the range of values for a line length is smaller than for the whole document, the impact is less important for the Faster DAN.

Prediction Time. Table 5 shows a comparison of the Faster DAN with the DAN in terms of prediction times for the three datasets: RIMES 2009, READ 2016 and MAURDOR. To be fairly comparable, these times account for the whole prediction process, including the time dedicated to the encoder part and to formatting instructions. Additional details are given for each dataset such as the image sizes, the number of characters, lines, and layout tokens per image, and the number of characters per line. The values are given as average for the test set of each dataset. As one can note, the Faster DAN is significantly faster than the DAN for all the datasets, speeding up the prediction process by at least 4.

Table 5. Prediction time comparison between the DAN and the Faster DAN. Times (in seconds) are averaged on the test set for a single document image, using a single GPU V100.

We showed that the Faster DAN reaches competitive results on three document-level datasets while being at least 4 times faster than the standard DAN at prediction time. We now evaluate the performance on heterogeneous documents, by mixing both RIMES 2009 and READ 2016 datasets.

5.5 Evaluation on Heterogeneous Documents

In this experiment, we mixed both RIMES 2009 and READ 2016 datasets at single page level, for both training and evaluation. Examples from both datasets are balanced at training time, i.e., the models have been trained on the same number of documents for both datasets. These are the first results for such an experiment; we also train the standard DAN for comparison purposes. Results are shown in Table 6. As one can note, results are rather similar when training on datasets separately or altogether, except for the DAN on the RIMES dataset whose CER increases from 4.54% up to 7.96%.

Table 6. Evaluation of the Faster DAN on heterogeneous data (mixing READ 2016 and RIMES 2009 for both training and evaluation) and comparison with the state of the art.

5.6 Ablation Study

In Table 7, we propose an ablation study of the proposed approach on the RIMES 2009 and READ 2016 datasets. The first line corresponds to the Faster DAN baseline. In experiment (1), the document positional encoding is replaced by standard 1d positional encoding, i.e., a unique index is associated to each token. The model does not succeed to recognize the text, showing the necessity of injecting line positional information to parallelize the recognition. The model can only access to tokens of the text line to recognize in (2), also preventing the text recognition. Indeed, it is nearly impossible to predict the next character with only a one-character query (beginning of the second pass) since characters are not unique in a document. For both experiments, one can note that the LOER is nearly not impacted, this is because the layout recognition takes place in the first pass, before the parallelization.

Table 7. Ablation study of the Faster DAN and DAN. Results (in percentages) are given for the test set of the RIMES 2009 and READ 2016 datasets.

In experiment (3), in addition to the tokens of the text line to recognize, the first character of all the text lines, as well as the layout markup tokens, are available. This leads to an increase of the CER of at least 1.89 points for RIMES 2009, and up to 2.99 points for READ 2016 at double-page level, compared to the baseline. This shows the efficiency of the text line detection performed in the first pass, since the text recognition is parallelized, but it also demonstrates that gathering the context from past and future lines helps to improve the performance. In experiment (4), the positional encoding of the line and of the index in the line are summed instead of being concatenated. As one can note, results are slightly in favor of the concatenation.

6 Conclusion

In this paper, we proposed the Faster DAN, a novel approach for end-to-end Handwritten Document Recognition. We evaluate this approach with the current state-of-the-art architecture and showed that this approach reaches competitive results on three document-level datasets while being at least 4 times faster. This way, we preserved the advantages of using a single end-to-end approach, while greatly mitigating the major drawback of prediction time. In this work, we focused on line-level multi-target queries to show the gain in prediction time. However, it would also be possible to perform this parallelization at paragraph level in order to have a more important language modeling of the past: this would represent an in-between in terms of prediction time.