Keywords

1 Introduction

Over the last few decades, there has been a growing interest in studying and examining images of historical and modern documents, largely driven and motivated by the need to investigate material stored in libraries and archives [4]. There are millions of pages that have been digitized and are accessible as images, but in order to facilitate efficient work with such documents by humanists, historians and the general public, there is ongoing research and academic discourse aimed at understanding and converting the content of these documents [17] in a digital form. This is equal to recognize and comprehend the textual and non-textual content through an automated procedure.

Recently, the research of document image analysis has investigated utilizing deep learning for the segmentation and recognition of typewritten and handwritten text. At this point, it is worth noting that images of historical documents are distinct from images of natural scenes, where deep learning techniques are already applied widely [8]. As deep learning techniques advance, the automated processing and understanding of documents has become increasingly efficient, concerning text recognition, while highlighting two key steps as crucial towards success [4]: Text line detection and text line recognition. Text line detection is a technique used in document processing and computer vision to detect and extract text lines from a scanned document [21]. It is often being applied as a pre-processing step for Optical Character Recognition (OCR) and Layout Analysis, while having great impact on the accuracy of each task [6].

In this work, by utilizing a variation of the well-known YOLOv5 ( [13]) Deep Neural Network model (YOLOv5-OBBFootnote 1) and by proposing some simple but efficient modifications applied during training, we can achieve very accurate text line detection results on different datasets of high diversity. Moreover, we demonstrate the efficiency of such workflow on the text recognition task.

The contributions of this paper are as follows: a) We introduce a new dataset, named GTLD dataset, with annotated text line quadrilateral polygons and a smaller subset of 1.642 documents (including annotations on the Tobacco-3482 dataset) is publicly availableFootnote 2. b) We show in the experiments that the introduced oriented quadrilateral polygons can improve accuracy on the text line detection task, especially when text line boxes overlap. c) We provide promising results on text line recognition using an end-to-end workflow.

The rest of the paper is organized as follows. Section 2 presents related works, Sect. 3 introduces the proposed method, Sect. 4 demonstrates experimental scenarios and results and Sect. 5 presents the conclusion of this work.

2 Related Work

Text Line Detection. Text line detection methods have been proposed for many years [3, 16, 18, 22], where most techniques focus on historical documents because of their diverse nature. In addition, segmentation of unconstrained handwritten manuscripts is usually the most challenging task to comprehend [7]. In such cases, text line detection is sometimes equivalent to the detection of baselines.

In [10], a two-stage workflow is proposed to detect baselines, where the first stage is a pixel labeling (or goal-oriented binarization) deep hierarchical neural network (ARU-Net), which detects foreground baseline pixels and separators. Finally, second stage clusters extracted superpixels in order to build baselines. In [2], dhSegment is proposed, where a generic neural network (U-Net [20] based on ResNet50 [12]) enables mutli-task detection for page segmentation, layout analysis, line detection and ornament extraction. A similar technique has been proposed in [5], where Doc-UFCN utilizes a light U-shaped Fully Convolutional Network (FCN) without any dense layers for page segmentation and text line detection. In [1], text line detection is achieved when document image binarization is applied as a pre-processing step and connected component analysis enables the localization of text blocks and lines. Finally, in [8] the authors propose a Mask-RCNN [11] based architecture for detecting text lines in historical Arabic handwritten documents, where they group labeled results after detecting text lines on image patches.

Text Line Recognition. Recognizing the text in documents has been frequently considered as a subsequent step to the text line detection task. This paper focuses on recognition of Greek polytonic documents, where variations in writing styles and accents is a very challenging task [9]. In [23], the authors propose a workflow for the recognition of Greek printed polytonic scripts, using a character segmentation step combined with intensity-based feature extractor and \(k-\)nearest neighbors (KNN) classification. Following the neural network logic and by using only Long Short-Term Memory (LSTM) Networks, in [24], the authors demonstrate some interesting results for Greek polytonic scripts.

Over the years, many methods have been proposed for the Handwriting Text Recognition (HTR) problem, where most approaches focus on Deep Learning techniques [25]. In [17], an attention-based sequence-to-sequence (Seq2Seq) model was proposed, where a Convolutional Neural Network (CNN) with Bidirectional LSTM (BLSTM) networks encode the image into a feature representation. Then, a decoder that utilizes a recurrent layer interprets this information and an attention mechanism enables focusing on the most relevant encoded features.

3 Proposed Method

3.1 Object Detection Model

The overall proposed architecture is provided in Fig. 1. Inspired by the success of the well-known YOLO [19] object detectors, we use a variation of the YOLOv5 model [13], utilized for the detection of oriented quadrilateral polygons [26], (named YOLOv5-OBB for convenience).

As shown in Fig. 1, the CSP-Darknet53 is the Darknet53 backbone architecture, combined the Cross Stage Partial (CSP) network strategy. CSP is applied in order to overcome the problem of redundant gradients by truncating the gradient flow, while preserving the advantage of residual and dense blocks used in the backbone in order to flow information to the deepest layers and to overcome the vanishing gradient problem. Using CSP, inference speed is increased (less FLOPS) while the number of parameters is being reduced. YOLOv5 has a major change in the Neck of the architecture, when compared to it’s predecessors. A variant of Spatial Pyramid Pooling layer feeds a Path Aggregation Network (PANet) in which the CSP strategy has been incorporated via the BottleNeckCSP layer. Finally, the Head of the network is composed from 3 convolution layers for predicting the location of bounding boxes, as long as the object classes with their scores and the angle estimation.

Fig. 1.
figure 1

Architecture of the proposed YOLOv5-OBB model: The CSP-Darknet53 is the backbone feature extractor and a variant of Spatial Pyramid Pooling (SPP) layer feeds a Path Aggregation Network (PANet). The Head of the network is composed from 3 convolution layers combined with the Circular Smooth Label technique in order to classify the quadrilateral’s angle and to accurately detect the text lines in the document. In BottleNeckCSP convolutional layers, the Cross Stage Partial (CSP) technique has been incorporated in order to overcome the problem of redundant gradients.

3.2 Circular Smooth Label for Angle Estimation

In addition to the YOLOv5 model in order to localize text lines with orientation, this work follows the Circular Smooth Label (CSL) technique proposed in [26] in order to classify the rotation angle of the object’s bounding polygon. The use of regression methods can cause issues as ideal predictions fall outside the defined range. The result of a prediction can be controlled more effectively if the object angle is approached as a classification problem, where the angle is treated as a category label, and the number of categories is determined by the angle range. However, changing the method from regression to one-hot encoded classification can result in a slight decrease of accuracy.

With the CSL method, the angles are encoded in a circular/repeating pattern (Fig. 1), while windowing functions (pulse, rectangular, triangle and Gaussian) can be used to exterminate boundary problems in order to obtain better prediction of the angle’s class.

3.3 Extended Polygonal Labeling

In many cases, ground-truth text lines are strictly annotated and the polygonal edges almost overlap with the characters. Even when this is not the case, we observe that LSTM layers in a text line recognition system handle edge characters better, as demonstrated in the experiments Section. So, by loosening the enclosing polygon, both line detection and recognition systems improve. For this reason and during training of the YOLOv5-OBB model, we shift every point \(p_i = (x_i,y_i), i \in {1, 2, 3, 4}\) of the \(n-th\) text line quadrilateral of width \(w_n\) over the horizontal axis, resulting to the new point \({p_i}^{'} = ({x_i}^{'},{y_i}^{'})\), following Eqs. 1, 2:

$$\begin{aligned} {x_i}^{'} = {x_i} + a\times k \times w_n \end{aligned}$$
(1)
$$\begin{aligned} {y_i}^{'} = \frac{(y_{j} - y_{i})}{(x_{j} - x_{i})} \times ({x_i}^{'} - x_i) + {y_i} \end{aligned}$$
(2)

where \(j=(i+1) \bmod 4\), \(a=-1\) for the two leftmost points and \(a=+1\) for the two rightmost points. We empirically use \(k=0.03\) as a shift factor relative to the width of the quadrilateral, since we observe that resulting polygons do not overlap, while starting and ending characters are recognized properly from the text line recognition system. A visualization is given in Fig. 2.

Fig. 2.
figure 2

Extended Polygonal Labeling: Ground-truth text lines are sometimes annotated strictly and the polygonal edges almost overlap with the characters (upper image). By loosening the enclosing polygon over the horizontal axis, both line detection and recognition systems improve (lower image).

4 Experiments

4.1 Datasets

Since the detection of text lines can be considered in general as a language-independent segmentation task, we combine two datasets with a totaling number of 17.133 images. We perform our experiments on the Tobacco-3482 dataset [14] along with a new Greek polytonic dataset named GTLD. The latter is a combination of two heterogeneous collections, with the first subset containing also handwritten documents. The overview of the datasets is given in Table 1. Smaller portions of each dataset (denoted with a “small” suffix in Table 1) with a total number of 1.642 images are available along with the corresponding annotations at a text line level (See footnote 2).

Tobacco-3482 Dataset. The Tobacco-3482 [14] dataset consists of document images belonging to 10 classes such as letter, form, email, resume, memo, etc. The dataset has 3482 images and we provide annotations at a text line level.

Greek Text Line Detection Dataset (GTLD). We introduce a new dataset, named GTLD (See footnote 2), which contains a total of 13.651 images acquired from two collections: The ShakeIT collection contains images of Greek polytonic documents that originate from almost 140 books of Greek translations of Shakespeare’s plays published between 1842 and 1950. For this work, we use 28 books with a total of 3955 randomly selected pages. The PIOP collection consists of almost 100.000 digital documents from the Historical Archives of the Piraeus Bank Group Cultural Foundation (PIOP)Footnote 3. This archive includes a significant number of collections starting from the early 20th century. In this paper, we use a total of 9696 randomly selected pages. Both collections contain mostly Greek polytonic characters, with rare cases of English/German. Exemplar images from both collections are shown in Fig. 3.

Table 1. Overview of the datasets included in this work and the number of images used for training, validation and testing.
Fig. 3.
figure 3

Exemplar documents of the GTLD dataset: a): typewritten text (ShakeIT), b) handwritten text (PIOP), c) “Hybrid” text (PIOP).

4.2 Experimental Results

Text Line Detection: For the text line detection task, we train all models shown in the following for 80–150 epochs (more epochs when using smaller subsets) using SDG algorithm with a starting learning rate of 0.001 and batch size equal to 4. We also apply common augmentation techniques during training (random affine transformations, color jitter, flipping, etc.) over the input images. For the Circular Smooth Label, we use Gaussian kernel with a radius of 4. As a YOLOv5 backbone, we use the YOLOv5-largeFootnote 4 (YOLOv5l) model, pre-trained on MSCOCO dataset [15]. We keep the instance of the model that performed best on the validation set. As evaluation metrics, we use recall (R), precision (P), f1-score (F1) and mean Average Precision (mAP), evaluated at mAP@.5 and mAP@.5:.95, following the best evaluation practices in the domain of object detection. Table 2 shows the different architectures used in the following experiments, where “Rect” means rectangular polygons, “Quad” means quadrilateral polygons, “Extended” means extended quadrilateral or rectangular polygons and “CSL” means Circular Smooth Label

Table 2. Different architectures for the text line detection task.

Table 3 presents the evaluation of all trained models for each collection. It is noted that each model has been trained either on all the “small” collections or the “large” collections and we do not train a separate model for each collection. At this point, it is important to highlight the following: a) For all small collections, it is expected to have lower metrics, because we use fewer training samples in contrast with the larger ones. b) For all setups, YOLOv5-rect is inferior to the variations proposed by this work. c) YOLOv5-rect-ext is superior to the YOLOv5-rect version, which proves the contribution of considering extended bounding polygons (for this case, bounding boxes). d) Application of the CSL technique (YOLOv5-OBB) significantly improves detection results (\(+7.5\%\) on mAP@.5:.95) when compared to YOLOv5-rect. e) Extended Polygonal Labeling is a further improvement to the CSL technique for all the larger collections (\(+0.1\%\), \(+0.3\%\), \(+0.4\%\) on mAP@.5:.95 respectively). For the smaller ones it is useful when there is need for more refined results (higher mAP@.5:.95 on PIOP-small and ShakeIT-small with Greek polytonic scripts), but if detail is not important (Tobacco-3482), extended quadrilaterals may be omitted.

In order to investigate the competitiveness of our proposed method, we evaluate two popular systems on our datasets. The first one, Google Cloud Vision APIFootnote 5, is commercial, while the second one is the well-known open-source OCR engine TesseractFootnote 6 (version 5.0.0). In both cases we evaluate the systems using each one’s default setup. The evaluation results are given in Table 4. It is observed that the Tesseract engine fails to process all collections. One reason is that Tesseract provides only bounding rectangles instead of quadrilaterals and predictions suffer from the problems mentioned above. Google Cloud Vision API seems to perform better and gives more refined results on ShakeIT-small, were the layout of the documents is simpler. On the other hand, Our method is superior to all over the more diverse collections of PIOP-small and Tobacco-3482-small.

Table 3. Evaluation results (%) for the GTLD (PIOP, ShakeIT) and Tobacco-3482 datasets
Table 4. Comparison (mAP %) of the proposed method against other popular commercial/open-source systems.

Text Line Recognition: Finally, order to signify the the robustness of the proposed method, we evaluate our systems indirectly over the subsequent task of text line recognition on Greek polytonic documents. For this reason, we perform the following steps:

  • We train a text line recognition module, by utilizing the Calamari-OCRFootnote 7 engine. We use about 100.000 annotated text lines (OCR level) from the PIOP and ShakeIT collections. We choose the “htr+” architecture ( [17]).

  • We apply text line detection using four models (YOLOv5-rect, YOLOv5-rect-ext, YOLOv5-OBB and YOLOv5-OBB-ext) trained on the smaller collections.

  • In order to construct the full page OCR from the predictions, we sort the detected text lines using Density-based spatial clustering (DBSCAN) and the coordinates of the extracted clusters.

  • We consider the full page OCR of the 66 ShakeIT-small test images for evaluation. We choose this collection because it has simpler layout in order to minimize reading order errors from the previous step.

  • As evaluation metrics, we consider the well-known Character Error Rate (CER) and Word Error Rate (WER).

From the results of Table 5 it is demonstrated again that the CSL method with extended quadrilaterals yields the best results and assists significantly to the text line recognition task. CER and WER results also validate the robustness of our text line recognition module. Another observation is that the text line recognition workflow that has been proposed above can be considered as an end-to-end workflow without any prior processing steps except for the line detection module combined with an grouping algorithm like DBSCAN for line ordering.

Table 5. Evaluation of page OCR results using text line detection models on the ShakeIT-small test images.

5 Conclusion

In this work, text line detection and recognition have been considered, with a focus on Greek polytonic documents. It was proposed that by utilizing YOLOv5 with some easy and simple techniques like the CSL method and extended polygonal labeling, there is a significant improvement on the text line detection task and accuracy results outperform popular commercial and open-source systems. Moreover, by applying this logic to the subsequent task of performing OCR even at a page level, similar improvements are noted and text line detection (with a line ordering technique) can be the only step before OCR in an end-to-end workflow. Finally, a new dataset was introduced (GTLD), with a smaller subset being publicly available for future works.