Abstract
In this work we highlight the significance of Text Line Detection in documents. By utilizing a well-known Deep Neural Network and by proposing some simple but efficient modifications applied during training of such models, we can achieve very accurate results in different datasets of high diversity. Moreover, such models can be robust even when trained with few data. Our focus is on Greek polytonic documents (typewritten and handwritten) and we provide a new dataset to the public (GTLD-small) for text line detection. We evaluate our method through scenarios applied to the detection and recognition tasks, while demonstrating promising results when compared to popular commercial or open-source systems.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Deep Neural Networks
- Object Detection
- Text Line Detection
- Text Line Recognition
- YOLOv5
- Circular Smooth Label
1 Introduction
Over the last few decades, there has been a growing interest in studying and examining images of historical and modern documents, largely driven and motivated by the need to investigate material stored in libraries and archives [4]. There are millions of pages that have been digitized and are accessible as images, but in order to facilitate efficient work with such documents by humanists, historians and the general public, there is ongoing research and academic discourse aimed at understanding and converting the content of these documents [17] in a digital form. This is equal to recognize and comprehend the textual and non-textual content through an automated procedure.
Recently, the research of document image analysis has investigated utilizing deep learning for the segmentation and recognition of typewritten and handwritten text. At this point, it is worth noting that images of historical documents are distinct from images of natural scenes, where deep learning techniques are already applied widely [8]. As deep learning techniques advance, the automated processing and understanding of documents has become increasingly efficient, concerning text recognition, while highlighting two key steps as crucial towards success [4]: Text line detection and text line recognition. Text line detection is a technique used in document processing and computer vision to detect and extract text lines from a scanned document [21]. It is often being applied as a pre-processing step for Optical Character Recognition (OCR) and Layout Analysis, while having great impact on the accuracy of each task [6].
In this work, by utilizing a variation of the well-known YOLOv5 ( [13]) Deep Neural Network model (YOLOv5-OBBFootnote 1) and by proposing some simple but efficient modifications applied during training, we can achieve very accurate text line detection results on different datasets of high diversity. Moreover, we demonstrate the efficiency of such workflow on the text recognition task.
The contributions of this paper are as follows: a) We introduce a new dataset, named GTLD dataset, with annotated text line quadrilateral polygons and a smaller subset of 1.642 documents (including annotations on the Tobacco-3482 dataset) is publicly availableFootnote 2. b) We show in the experiments that the introduced oriented quadrilateral polygons can improve accuracy on the text line detection task, especially when text line boxes overlap. c) We provide promising results on text line recognition using an end-to-end workflow.
The rest of the paper is organized as follows. Section 2 presents related works, Sect. 3 introduces the proposed method, Sect. 4 demonstrates experimental scenarios and results and Sect. 5 presents the conclusion of this work.
2 Related Work
Text Line Detection. Text line detection methods have been proposed for many years [3, 16, 18, 22], where most techniques focus on historical documents because of their diverse nature. In addition, segmentation of unconstrained handwritten manuscripts is usually the most challenging task to comprehend [7]. In such cases, text line detection is sometimes equivalent to the detection of baselines.
In [10], a two-stage workflow is proposed to detect baselines, where the first stage is a pixel labeling (or goal-oriented binarization) deep hierarchical neural network (ARU-Net), which detects foreground baseline pixels and separators. Finally, second stage clusters extracted superpixels in order to build baselines. In [2], dhSegment is proposed, where a generic neural network (U-Net [20] based on ResNet50 [12]) enables mutli-task detection for page segmentation, layout analysis, line detection and ornament extraction. A similar technique has been proposed in [5], where Doc-UFCN utilizes a light U-shaped Fully Convolutional Network (FCN) without any dense layers for page segmentation and text line detection. In [1], text line detection is achieved when document image binarization is applied as a pre-processing step and connected component analysis enables the localization of text blocks and lines. Finally, in [8] the authors propose a Mask-RCNN [11] based architecture for detecting text lines in historical Arabic handwritten documents, where they group labeled results after detecting text lines on image patches.
Text Line Recognition. Recognizing the text in documents has been frequently considered as a subsequent step to the text line detection task. This paper focuses on recognition of Greek polytonic documents, where variations in writing styles and accents is a very challenging task [9]. In [23], the authors propose a workflow for the recognition of Greek printed polytonic scripts, using a character segmentation step combined with intensity-based feature extractor and \(k-\)nearest neighbors (KNN) classification. Following the neural network logic and by using only Long Short-Term Memory (LSTM) Networks, in [24], the authors demonstrate some interesting results for Greek polytonic scripts.
Over the years, many methods have been proposed for the Handwriting Text Recognition (HTR) problem, where most approaches focus on Deep Learning techniques [25]. In [17], an attention-based sequence-to-sequence (Seq2Seq) model was proposed, where a Convolutional Neural Network (CNN) with Bidirectional LSTM (BLSTM) networks encode the image into a feature representation. Then, a decoder that utilizes a recurrent layer interprets this information and an attention mechanism enables focusing on the most relevant encoded features.
3 Proposed Method
3.1 Object Detection Model
The overall proposed architecture is provided in Fig. 1. Inspired by the success of the well-known YOLO [19] object detectors, we use a variation of the YOLOv5 model [13], utilized for the detection of oriented quadrilateral polygons [26], (named YOLOv5-OBB for convenience).
As shown in Fig. 1, the CSP-Darknet53 is the Darknet53 backbone architecture, combined the Cross Stage Partial (CSP) network strategy. CSP is applied in order to overcome the problem of redundant gradients by truncating the gradient flow, while preserving the advantage of residual and dense blocks used in the backbone in order to flow information to the deepest layers and to overcome the vanishing gradient problem. Using CSP, inference speed is increased (less FLOPS) while the number of parameters is being reduced. YOLOv5 has a major change in the Neck of the architecture, when compared to it’s predecessors. A variant of Spatial Pyramid Pooling layer feeds a Path Aggregation Network (PANet) in which the CSP strategy has been incorporated via the BottleNeckCSP layer. Finally, the Head of the network is composed from 3 convolution layers for predicting the location of bounding boxes, as long as the object classes with their scores and the angle estimation.
3.2 Circular Smooth Label for Angle Estimation
In addition to the YOLOv5 model in order to localize text lines with orientation, this work follows the Circular Smooth Label (CSL) technique proposed in [26] in order to classify the rotation angle of the object’s bounding polygon. The use of regression methods can cause issues as ideal predictions fall outside the defined range. The result of a prediction can be controlled more effectively if the object angle is approached as a classification problem, where the angle is treated as a category label, and the number of categories is determined by the angle range. However, changing the method from regression to one-hot encoded classification can result in a slight decrease of accuracy.
With the CSL method, the angles are encoded in a circular/repeating pattern (Fig. 1), while windowing functions (pulse, rectangular, triangle and Gaussian) can be used to exterminate boundary problems in order to obtain better prediction of the angle’s class.
3.3 Extended Polygonal Labeling
In many cases, ground-truth text lines are strictly annotated and the polygonal edges almost overlap with the characters. Even when this is not the case, we observe that LSTM layers in a text line recognition system handle edge characters better, as demonstrated in the experiments Section. So, by loosening the enclosing polygon, both line detection and recognition systems improve. For this reason and during training of the YOLOv5-OBB model, we shift every point \(p_i = (x_i,y_i), i \in {1, 2, 3, 4}\) of the \(n-th\) text line quadrilateral of width \(w_n\) over the horizontal axis, resulting to the new point \({p_i}^{'} = ({x_i}^{'},{y_i}^{'})\), following Eqs. 1, 2:
where \(j=(i+1) \bmod 4\), \(a=-1\) for the two leftmost points and \(a=+1\) for the two rightmost points. We empirically use \(k=0.03\) as a shift factor relative to the width of the quadrilateral, since we observe that resulting polygons do not overlap, while starting and ending characters are recognized properly from the text line recognition system. A visualization is given in Fig. 2.
4 Experiments
4.1 Datasets
Since the detection of text lines can be considered in general as a language-independent segmentation task, we combine two datasets with a totaling number of 17.133 images. We perform our experiments on the Tobacco-3482 dataset [14] along with a new Greek polytonic dataset named GTLD. The latter is a combination of two heterogeneous collections, with the first subset containing also handwritten documents. The overview of the datasets is given in Table 1. Smaller portions of each dataset (denoted with a “small” suffix in Table 1) with a total number of 1.642 images are available along with the corresponding annotations at a text line level (See footnote 2).
Tobacco-3482 Dataset. The Tobacco-3482 [14] dataset consists of document images belonging to 10 classes such as letter, form, email, resume, memo, etc. The dataset has 3482 images and we provide annotations at a text line level.
Greek Text Line Detection Dataset (GTLD). We introduce a new dataset, named GTLD (See footnote 2), which contains a total of 13.651 images acquired from two collections: The ShakeIT collection contains images of Greek polytonic documents that originate from almost 140 books of Greek translations of Shakespeare’s plays published between 1842 and 1950. For this work, we use 28 books with a total of 3955 randomly selected pages. The PIOP collection consists of almost 100.000 digital documents from the Historical Archives of the Piraeus Bank Group Cultural Foundation (PIOP)Footnote 3. This archive includes a significant number of collections starting from the early 20th century. In this paper, we use a total of 9696 randomly selected pages. Both collections contain mostly Greek polytonic characters, with rare cases of English/German. Exemplar images from both collections are shown in Fig. 3.
4.2 Experimental Results
Text Line Detection: For the text line detection task, we train all models shown in the following for 80–150 epochs (more epochs when using smaller subsets) using SDG algorithm with a starting learning rate of 0.001 and batch size equal to 4. We also apply common augmentation techniques during training (random affine transformations, color jitter, flipping, etc.) over the input images. For the Circular Smooth Label, we use Gaussian kernel with a radius of 4. As a YOLOv5 backbone, we use the YOLOv5-largeFootnote 4 (YOLOv5l) model, pre-trained on MSCOCO dataset [15]. We keep the instance of the model that performed best on the validation set. As evaluation metrics, we use recall (R), precision (P), f1-score (F1) and mean Average Precision (mAP), evaluated at mAP@.5 and mAP@.5:.95, following the best evaluation practices in the domain of object detection. Table 2 shows the different architectures used in the following experiments, where “Rect” means rectangular polygons, “Quad” means quadrilateral polygons, “Extended” means extended quadrilateral or rectangular polygons and “CSL” means Circular Smooth Label
Table 3 presents the evaluation of all trained models for each collection. It is noted that each model has been trained either on all the “small” collections or the “large” collections and we do not train a separate model for each collection. At this point, it is important to highlight the following: a) For all small collections, it is expected to have lower metrics, because we use fewer training samples in contrast with the larger ones. b) For all setups, YOLOv5-rect is inferior to the variations proposed by this work. c) YOLOv5-rect-ext is superior to the YOLOv5-rect version, which proves the contribution of considering extended bounding polygons (for this case, bounding boxes). d) Application of the CSL technique (YOLOv5-OBB) significantly improves detection results (\(+7.5\%\) on mAP@.5:.95) when compared to YOLOv5-rect. e) Extended Polygonal Labeling is a further improvement to the CSL technique for all the larger collections (\(+0.1\%\), \(+0.3\%\), \(+0.4\%\) on mAP@.5:.95 respectively). For the smaller ones it is useful when there is need for more refined results (higher mAP@.5:.95 on PIOP-small and ShakeIT-small with Greek polytonic scripts), but if detail is not important (Tobacco-3482), extended quadrilaterals may be omitted.
In order to investigate the competitiveness of our proposed method, we evaluate two popular systems on our datasets. The first one, Google Cloud Vision APIFootnote 5, is commercial, while the second one is the well-known open-source OCR engine TesseractFootnote 6 (version 5.0.0). In both cases we evaluate the systems using each one’s default setup. The evaluation results are given in Table 4. It is observed that the Tesseract engine fails to process all collections. One reason is that Tesseract provides only bounding rectangles instead of quadrilaterals and predictions suffer from the problems mentioned above. Google Cloud Vision API seems to perform better and gives more refined results on ShakeIT-small, were the layout of the documents is simpler. On the other hand, Our method is superior to all over the more diverse collections of PIOP-small and Tobacco-3482-small.
Text Line Recognition: Finally, order to signify the the robustness of the proposed method, we evaluate our systems indirectly over the subsequent task of text line recognition on Greek polytonic documents. For this reason, we perform the following steps:
-
We train a text line recognition module, by utilizing the Calamari-OCRFootnote 7 engine. We use about 100.000 annotated text lines (OCR level) from the PIOP and ShakeIT collections. We choose the “htr+” architecture ( [17]).
-
We apply text line detection using four models (YOLOv5-rect, YOLOv5-rect-ext, YOLOv5-OBB and YOLOv5-OBB-ext) trained on the smaller collections.
-
In order to construct the full page OCR from the predictions, we sort the detected text lines using Density-based spatial clustering (DBSCAN) and the coordinates of the extracted clusters.
-
We consider the full page OCR of the 66 ShakeIT-small test images for evaluation. We choose this collection because it has simpler layout in order to minimize reading order errors from the previous step.
-
As evaluation metrics, we consider the well-known Character Error Rate (CER) and Word Error Rate (WER).
From the results of Table 5 it is demonstrated again that the CSL method with extended quadrilaterals yields the best results and assists significantly to the text line recognition task. CER and WER results also validate the robustness of our text line recognition module. Another observation is that the text line recognition workflow that has been proposed above can be considered as an end-to-end workflow without any prior processing steps except for the line detection module combined with an grouping algorithm like DBSCAN for line ordering.
5 Conclusion
In this work, text line detection and recognition have been considered, with a focus on Greek polytonic documents. It was proposed that by utilizing YOLOv5 with some easy and simple techniques like the CSL method and extended polygonal labeling, there is a significant improvement on the text line detection task and accuracy results outperform popular commercial and open-source systems. Moreover, by applying this logic to the subsequent task of performing OCR even at a page level, similar improvements are noted and text line detection (with a line ordering technique) can be the only step before OCR in an end-to-end workflow. Finally, a new dataset was introduced (GTLD), with a smaller subset being publicly available for future works.
Notes
References
Ahn, B., Ryu, J., Koo, H.I., Cho, N.I.: Textline detection in degraded historical document images. EURASIP J. Image Video Process. 2017(1), 82 (2017)
Ares Oliveira, S., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation. In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018, pp. 7–12. IEEE (2018)
Basu, S., Chaudhuri, C., Kundu, M., Nasipuri, M., Basu, D.: Text line extraction from multi-skewed handwritten documents. Pattern Recogn. 40(6), 1825–1839 (2007)
Boillet, M., Kermorvant, C., Paquet, T.: Robust text line detection in historical documents: learning and evaluation methods. Int. J. Doc. Anal. Recog. (IJDAR) 25(2), 95–114 (2022)
Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2134–2141 (2021)
Diem, M., Kleber, F., Sablatnig, R.: Text line detection for heterogeneous documents. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 743–747 (2013)
Diem, M., Kleber, F., Sablatnig, R., Gatos, B.: cBAD: ICDAR 2019 competition on baseline detection. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1494–1498 (2019)
Droby, A., Kurar Barakat, B., Alaasam, R., Madi, B., Rabaev, I., El-Sana, J.: Text line extraction in historical documents using mask r-CNN. Signals 3(3), 535–549 (2022). https://doi.org/10.3390/signals3030032
Gatos, B., et al.: GRPOLY-DB: an old Greek polytonic document image database. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 646–650 (2015)
Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Docu. Anal. Recogn. (IJDAR) 22(3), 285–302 (2019)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Jocher, G., et al.: ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation (2022). https://doi.org/10.5281/zenodo.7347926
Kumar, J., Ye, P., Doermann, D.: Structural similarity for document image classification and retrieval. Pattern Recogn. Lett. 43, 119–126 (2014). iCPR2012 Awarded Papers
LIn, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Text line detection in handwritten documents. Pattern Recogn. 41(12), 3758–3772 (2008)
Michael, J., Labahn, R., Grüning, T., Zöllner, J.: Evaluating sequence-to-sequence models for handwritten text recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1286–1293 (2019)
Nicolas, S., Paquet, T., Heutte, L.: Text line segmentation in handwritten document using a production system. In: 9th International Workshop on Frontiers in Handwriting Recognition, pp. 245–250 (2004)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Sahare, P., Dhok, S.B.: Review of text extraction algorithms for scene-text and document images. IETE Tech. Rev. 34(2), 144–164 (2017). https://doi.org/10.1080/02564602.2016.1160805
Shi, Z., Govindaraju, V.: Line separation for complex document images using fuzzy runlength. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL 2004), p. 306. DIAL 2004, IEEE Computer Society, USA (2004)
Sichani, A.M., Kaddas, P., Mikros, G.K., Gatos, B.: OCR for Greek polytonic (multi accent) historical printed documents: development, optimization and quality control. In: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, pp. 9–13. DATeCH2019, Association for Computing Machinery, New York, NY, USA (2019)
Simistira, F., Ul-Hassan, A., Papavassiliou, V., Gatos, B., Katsouros, V., Liwicki, M.: Recognition of historical Greek polytonic scripts using LSTM networks. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 766–770 (2015)
Teslya, N., Mohammed, S.: Deep learning for handwriting text recognition: existing approaches and challenges. In: 2022 31st Conference of Open Innovations Association (FRUCT), pp. 339–346 (2022)
Yang, X., Yan, J.: Arbitrary-oriented object detection with circular smooth label. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 677–694. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_40
Acknowledgment
This research has been partially co-financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH-CREATE-INNOVATE, project Culdile (Cultural Dimensions of Deep Learning, project code: T1EDK-03785), the Operational Program Attica 2014-2020, under the call RESEARCH AND INNOVATION PARTNERSHIPS IN THE REGION OF ATTICA, project reBook (Digital platform for re-publishing Historical Greek Books, project code: ATTP4-0331172) and the project “Corpus-assisted drama translation research: Shakespeare In Translation - ShakeIT”, under the call of internal research project funding of University of Cyprus (UCY https://www.ucy.ac.cy/directory/en/).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kaddas, P., Gatos, B., Palaiologos, K., Christopoulou, K., Kritsis, K. (2023). Text Line Detection and Recognition of Greek Polytonic Documents. In: Coustaty, M., Fornés, A. (eds) Document Analysis and Recognition – ICDAR 2023 Workshops. ICDAR 2023. Lecture Notes in Computer Science, vol 14194. Springer, Cham. https://doi.org/10.1007/978-3-031-41501-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-41501-2_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41500-5
Online ISBN: 978-3-031-41501-2
eBook Packages: Computer ScienceComputer Science (R0)