Text Line Extraction Using Fully Convolutional Network and Energy Minimization

Barakat, Berat Kurar; Droby, Ahmad; Alaasam, Reem; Madi, Boraq; Rabaev, Irina; El-Sana, Jihad

doi:10.1007/978-3-030-68787-8_9

Berat Kurar Barakat ORCID: orcid.org/0000-0002-7240-7286¹⁶,
Ahmad Droby¹⁶,
Reem Alaasam¹⁶,
Boraq Madi¹⁶,
Irina Rabaev¹⁷ &
…
Jihad El-Sana¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12667))

Included in the following conference series:

International Conference on Pattern Recognition

1994 Accesses
6 Citations

Abstract

Text lines are important parts of handwritten document images and easier to analyze by further applications. Despite recent progress in text line detection, text line extraction from a handwritten document remains an unsolved task. This paper proposes to use a fully convolutional network for text line detection and energy minimization for text line extraction. Detected text lines are represented by blob lines that strike through the text lines. These blob lines assist an energy function for text line extraction. The detection stage can locate arbitrarily oriented text lines. Furthermore, the extraction stage is capable of finding out the pixels of text lines with various heights and interline proximity independent of their orientations. Besides, it can finely split the touching and overlapping text lines without an orientation assumption. We evaluate the proposed method on VML-AHTE, VML-MOC, and Diva-HisDB datasets. The VML-AHTE dataset contains overlapping, touching and close text lines with rich diacritics. The VML-MOC dataset is very challenging by its multiply oriented and skewed text lines. The Diva-HisDB dataset exhibits distinct text line heights and touching text lines. The results demonstrate the effectiveness of the method despite various types of challenges, yet using the same parameters in all the experiments.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Segmentation of text lines using multi-scale CNN from warped printed and handwritten document images

Article 21 May 2021

Text Line Segmentation: A FCN Based Approach

Line extraction in handwritten documents via instance segmentation

Article 21 May 2023

Keywords

1 Introduction

Segmentation in computer vision is the task of dividing an image into parts that are easier to analyse. Text lines of a handwritten document image are widely used for word segmentation, text recognition and spotting, manuscripts alignment and writer recognition. Text lines need to be provided to these applications either by their locations or by complete set of their pixels. The task of identifying the location of each text line is called detection, whereas the task of determining the pixels of each text line is called extraction. Much research in the recent years has focused on text line detection [3, 14, 24]. However, detection defines the text lines loosely by baselines or main body blobs. On the other hand, extraction is a harder task which defines text lines precisely by pixel labels or bounding polygons.

The challenges in text line extraction arise due to variations in text line heights and orientations, presence of overlapping and touching text lines, and diacritical marks within close interline proximity. It has been generally demonstrated that deep learning based methods are effective at detecting text lines with various orientations [14, 22, 25, 30]. However, only few of the recent researches [8, 30] have addressed the problem of extraction given the detection, yet with the assumption of horizontal text lines.

This paper proposes a text line extraction method (FCN+EM) which uses Fully Convolutional Network (FCN) to detect text lines in the form of blob lines (Fig. 1(b)), followed by an Energy Minimization (EM) function assisted by these blob lines to extract the text lines (Fig. 1(c)). FCN is capable of handling curved and arbitrarily oriented text lines. However, extraction is problematic due to the Sayre’s paradox [27] which states that exact boundaries of handwritten text can be defined only after its recognition and handwritten text can be recognized only after extraction of its boundaries. Nevertheless, humans are good at understanding boundaries of text lines written in a language they do not know. Therefore, we consider EM framework to formulate the text line extraction in compliance with the human visual perception, with the aid of the Gestalt proximity principle for grouping [17]. The proposed EM formulation for text line extraction is free of an orientation assumption and can be used with touching and overlapping text lines with disjoint strokes and close interline proximity (Fig. 1(a)).

The proposed extraction method (FCN+EM) is evaluated on Visual Media Lab Arabic Handwritten Text line Extraction (VML-AHTE) dataset, Multiply Oriented and Curved (VML-MOC) dataset [5], and DIVA Historical Manuscript Database (DIVA-HisDB) [28]. VML-AHTE dataset is characterized by touching and overlapping text lines with close proximity and rich diacritics. VML-MOC dataset contains arbitrarily oriented and curved text lines. DIVA-HisDB dataset exhibit varying text line heights and touching text lines.

The rest of the paper is organized as follows. Related work is discussed in Sect. 2, and the datasets are described in Sect. 3. Later, the method is presented in Sect. 4. The experimental evaluation and the results are then provided in Sect. 5. Finally, Sect. 6 draws conclusions and outlines future work.

2 Related Work

A text line is a set of image elements, such as pixels or connected components. Text line components in a document image can be represented using basic geometric primitives such as points, lines, polylines, polygons or blobs. Text line representation is given as an input to other document image processing algorithms, and, therefore, important to be complete and correct.

There are two main approaches to represent text lines: text line detection and text line extraction. Text line detection detects the lines, polylines or blobs that represent the locations of spatially aligned set of text line elements. Detected line or polyline is called a baseline [14, 24] if it joins the lower part of the character main bodies, and a separator path [8, 26] if it follows the space between two consecutive text lines. Detected blobs [3] that cover the character main bodies in a text line are called text line blobs.

Text line extraction determines the constituting pixels or the polygons around the spatially aligned text line elements. Pixel labeling assigns the same label to all the pixels of a text line [9, 26, 30]. Bounding polygon is used to enclose all the elements of a text line together with its neighbourhood background pixels [11, 15]. Most of the extraction methods assume horizontally parallel text lines with constant heights, whereas some methods [2, 5] are more generic.

Recent deep learning methods estimate x-height of text lines using FCN and apply Line Adjacency Graphs (LAG) to post-process FCN output to split touching lines [20, 21]. Renton et al. [24, 25] also use FCN to predict x-height of text lines. Kurar et al. [3] applied FCN for challenging manuscript images with multi-skewed, multi-directed and curved handwritten text lines. However these methods either do only text line detection or their extraction phase is not appropriate for unstructured text lines because their assumption of horizontal and constant height text lines. The proposed method assumes the both, detection and extraction phases to be for complex layout.

ICDAR 2009 [12] and ICDAR 2013 [29] datasets are commonly used for evaluating text line extraction methods and ICDAR 2017 [10] dataset is used for evaluating text line detection methods. DIVA-HisDB dataset [28] is used for both types of evaluations: detection and extraction. Therefore, we select to use DIVA-HisDB [28] as it provides ground truth for detection and extraction. However, this dataset is not enough representative of all the segmentation problems to evaluate a generic method. Hence, we also evaluated the proposed method on publicly available VML-MOC [5] dataset that contains multiply oriented and curved text lines with heterogeneous heights, and on VML-AHTE dataset that contains crowded diacritics.

3 Datasets

We evaluated the proposed method on three publicly available handwritten datasets. We suppose that these datasets demonstrate the generality of our method. As VML-AHTE dataset contains lines with crowded diacritics, VML-MOC dataset contains multiply oriented and curved lines, and Diva-HisDB dataset contains consecutively touching multiple lines. In this section we present these datasets.

3.1 VML-AHTE

VML-AHTE dataset is a collection of 30 binary document images selected from several manuscripts (Fig. 2). It is a newly published dataset and available online for downloading^{Footnote 1}. The dataset is split into 20 train pages and 10 test pages. Its ground truth is provided in three formats: bounding polygons in PAGE xml [23] format, color pixel labels and DIVA pixel labels [28].

3.2 Diva-HisDB

DIVA-HisDB dataset [28] contains 150 pages from 3 medieval manuscripts: CB55, CSG18 and CSG863 (Fig. 3). Each book has 20 train pages and 10 test pages. Among them, CB55 is characterized by a vast number of touching characters. Ground truth is provided in three formats: baselines and bounding polygons in PAGE xml [23] format and DIVA pixel labels [28].

3.3 VML-MOC

VML-MOC dataset [5] is a multiply oriented and curved handwritten text lines dataset that is publicly available^{Footnote 2}. These text lines are side notes added by various scholars over the years on the page margins, in arbitrary orientations and curvy forms due to space constraints (Fig. 4). The dataset contains 30 binary document images and divided into 20 train pages and 10 test pages. The ground truth is provided in three formats: bounding polygons in PAGE xml [23] format, color pixel labels and DIVA pixel labels [28].

4 Method

We present a method (FCN+EM) for text line detection together with extraction, and show its effectiveness on handwritten document images. In the first phase, the method uses an FCN to densely predict the pixels of the blob lines that strike through the text lines (Fig. 1(b)). In the second phase, we use an EM framework to extract the pixel labels of text lines with the assistance of detected blob lines (Fig. 1(c)). In the rest of this section we give a detailed of description FCN, EM and how they are used for text line detection and text line extraction.

4.1 Text Line Detection Using FCN

Fully Convolutional Network (FCN) is an end-to-end semantic segmentation algorithm that extracts the features and learns the classifier function simultaneously. FCN inputs the original images and their pixel level annotations for learning the hypothesis function that can predict whether a pixel belongs to a text line label or not. A crucial decision have to be made about the representation of text line detection. Text line detection labels can be represented as baselines or blob lines.

We use blob line labeling that connects the characters in the same line while disregarding diacritics and touching components among the text lines. Blob line labeling for VML-AHTE and DIVA-HisDB datasets is automatically generated using the skeletons of bounding polygons provided by their ground truth (Fig. 5(d)). Blob line labeling for VML-MOC dataset is manually drawn using a sharp rectangular brush with a diameter of 12 pixels (Fig. 5(b)).

FCN Architecture. The FCN architecture (Fig. 6) we used is based on the FCN8 proposed for semantic segmentation [19]. Particularly FCN8 architecture was selected because it has been successful in page layout analysis of handwritten documents [4]. It consists of an encoder and a decoder. The encoder downsamples the input image and the filters can see coarser information with larger receptive field. Consequently the decoder adds final layer of encoder to the lower layers with finer information, then upsamples the combined layer back to the input size. Default input size is $224\times 224$, which does not cover more than 2 to 3 text lines. To include more context, we changed the input size to $350\times 350$ pixels. We also changed the number of output channels to 2, which is the number of classes: blob line or not.

FCN Training. For training, we randomly crop 50, 000 patches of size $350\times 350$ from inverted binary images of the documents and their corresponding labels from the blob line label images (Fig. 5(b)). We adopted this patch size due to memory limitation. Using full pages for training and prediction is not feasible on non-specialized systems without resizing the pages to a more manageable size. Resizing the pages will result in details loss, which usually reduces the accuracy of segmentation results.

The FCN was trained by a batch size of 12, using Stochastic Gradient Descent (SGD) with momentum equals to 0.9 and learning rate equals to 0.001. The encoder part of FCN was initialized with its publicly available pre-trained weights.

FCN Testing. During the testing, a sliding window of size $350\times 350$ was used for prediction, but only the inner window of size $250\times 250$ was considered to eliminate the edge effect. Page was padded with black pixels at its right and bottom sides if its size is not an integer multiple of the sliding window size, in addition to padding it at 4 sides for considering only the central part of the sliding window.

4.2 Text Line Extraction Using EM

We adapt the energy minimization (EM) framework [6] that uses graph cuts to approximate the minima of arbitrary functions. These functions can be formulated in terms of image elements such pixels or connected components. In this section we formulate a general function for text line extraction using text line detection. Then, we adapt this general function to be used with connected components for text line extraction.

EM Formulation. Let $\mathcal {L}$ be the set of binary blob lines, and $\mathcal {E}$ be the set of elements in the binary document image. Energy minimization finds a labeling f that assigns each element $e\in \mathcal {E}$ to a label $l_e\in \mathcal {L}$, where energy function $\mathbf{E} (f)$ has the minimum.

$$\begin{aligned} \mathbf{E} (f) = \sum _{e\in {\mathcal {E}}}D(e, \ell _e)+\sum _{\{e,e'\}\in \mathcal {N}}d(e, e')\cdot \delta (\ell _e \ne \ell _{e'}) \end{aligned}$$

(1)

The term D is the data cost, d is the smoothness cost, and $\delta $ is an indicator function. Data cost is the cost of assigning element e to label $l_e$. $D(e, \ell _e)$ is defined to be the Euclidean distance between the centroid of the element e and the nearest neighbour pixel in blob line $l_e$ for the centroid of the element e. Smoothness cost is the cost of assigning neighbouring elements to different labels. Let $\mathcal {N}$ be the set of nearest element pairs. Then $\forall \{e,e'\}\in \mathcal {N}$,

$$\begin{aligned} d(e,e') = \exp ({-\beta \cdot d_e(e,e')}) \end{aligned}$$

(2)

where $d_e(e,e')$ is the Euclidean distance between the centroids of the elements e and $e'$, and $\beta $ is defined as

$$\begin{aligned} \beta =(2\langle {d_e(e,e')}\rangle )^{-1} \end{aligned}$$

(3)

$\langle \cdot \rangle $ denotes expectation over all pairs of neighbouring elements [7] in a document page image. $\delta (\ell _e \ne \ell _{e'})$ is equal to 1 if the condition inside the parentheses holds and 0 otherwise.

EM Adaptation to Connected Components. The presented method extracts text lines using results of the text line detection procedure by FCN. Extraction level representation labels each pixel of the text lines in a document image. The major difficulty in pixel labeling lies in the computational cost. A typical document image in the experimented datasets includes around 14, 000, 000 pixels. Due to this reason, we adapt the energy function (Eq. 1) to be used with connected components for extraction of text lines.

Data cost of the adapted function measures how appropriate a label is for the component e, given the blob lines $\mathcal {L}$. Actually, the data cost alone would be equal to clustering the components with their nearest neighbour blob line. However, simply nearest neighbour clustering would be deficient to correctly label the free components that are disconnected from the blob lines (Fig. 7).

A free component tends to exist closer to the components of a line it belongs to, but can be a nearest neighbour of a blob line that it does not belong to. This is because the proximity grouping strength decays exponentially with Euclidean distance [18]. This phenomenon is formulated using the smoothness cost (Eq. 2). Semantically this means that closer components have a higher probability to have the same label than distant components. Hence, the competition between data cost and smoothness cost dictates free components to be labeled spatially coherent with their neighbouring components.

5 Experiments

We experiment with three datasets that are different in terms of the text line segmentation challenges they contain. VML-AHTE dataset exhibits crowded diacritics and cramped text lines, whereas DIVA-HisDB dataset contains consequently touching text lines. Completely different than them VML-MOC exhibits challenges caused by arbitrarily skewed and curved text lines. The performance is measured using the line segmentation evaluation metrics of ICDAR 2013 [13] and ICDAR 2017 [1].

5.1 ICDAR 2013 Line Segmentation Evaluation Metrics

ICDAR 2013 metrics calculate recognition accuracy (RA), detection rate (DR) and F-measure (FM) values. Given a set of image points I, let $R_i$ be the set of points inside the $i^{th}$ result region, $G_j$ be the set of points inside the $j^{th}$ ground truth region, and T(p) is a function that counts the points inside the set p, then the MatchScore(i, j) is calculated by Eq. 4

$$\begin{aligned} MatchScore(i,j) = \frac{T(G{j}\cap R{i})}{T(G{j}\cup R{i})} \end{aligned}$$

(4)

The evaluator considers a region pair (i, j) as a one-to-one match if the MatchScore(i, j) is equal or above the threshold, which we set to 90. Let $N_1$ and $N_2$ be the number of ground truth and output elements, respectively, and let M be the number of one-to-one matches. The evaluator calculates the DR, RA and FM as follows:

$$\begin{aligned} DR = \frac{M}{N_1} \end{aligned}$$

(5)

$$\begin{aligned} RA = \frac{M}{N_2} \end{aligned}$$

(6)

$$\begin{aligned} FM=\frac{2\times DR\times RA}{DR+RA} \end{aligned}$$

(7)

5.2 ICDAR 2017 Line Segmentation Evaluation Metrics

ICDAR 2017 metrics are based on the Intersection over Union (IU). IU scores for each possible pair of Ground Truth (GT) polygons and Prediction (P) polygons are computed as follows:

$$\begin{aligned} IU=\frac{IP}{UP} \end{aligned}$$

(8)

IP denotes the number of intersecting foreground pixels among the pair of polygons. UP denotes number of foreground pixels in the union of foreground pixels of the pair of polygons. The pairs with maximum IU score are selected as the matching pairs of GT polygons and P polygons. Then, pixel IU and line IU are calculated among these matching pairs. For each matching pair, line TP, line FP and line FN are given by: Line TP is the number of foreground pixels that are correctly predicted in the matching pair. Line FP is the number of foreground pixels that are falsely predicted in the matching pair. Line FN is the number of false negative foreground pixels in the matching pair. Accordingly pixel IU is:

$$\begin{aligned} \text {Pixel } IU=\frac{TP}{TP+FP+FN} \end{aligned}$$

(9)

where TP is the global sum of line TPs, FP is the global sum of line FPs, and FN is the global sum of line FNs.

Line IU is measured at line level. For each matching pair, line precision and line recall are:

$$\begin{aligned} \text {Line precision}=\frac{\text {line } TP}{\text {line } TP + \text {line } FP} \end{aligned}$$

(10)

$$\begin{aligned} \text {Line recall}=\frac{\text {line } TP}{\text {line } TP + \text {line } FN} \end{aligned}$$

(11)

Accordingly, line IU is:

$$\begin{aligned} \text {Line } IU=\frac{\text {CL}}{\text {CL\,+\,ML\,+\,EL}} \end{aligned}$$

(12)

where CL is the number of correct lines, ML is the number of missed lines, and EL is the number of extra lines.

For each matching pair: A line is correct if both, the line precision and the line recall are above the threshold value. A line is missed if the line recall is below the threshold value. A line is extra if the line precision is below the threshold value.

5.3 Results on VML-AHTE Dataset

Since VML-AHTE and VML-MOC datasets are recently published datasets we run two other supervised methods. First method is a holistic method that can extract text lines in one phase and is based on instance segmentation using MRCNN [16]. Second method is based on running the EM framework using the blob line labels from the ground truth and we refer to it Human+EM. On VML-AHTE dataset, FCN+EM outperforms all the other methods in terms of all the metrics except Line IU. It can successfully split the touching text lines and assign the disjoint strokes to the correct text lines (Table 1).

Table 1. Results on VML-AHTE dataset

Full size table

5.4 Results on VML-MOC Dataset

The VML-MOC dataset contains both types, straight text lines and curved text lines. Number of straight text lines is substantially greater than the number of curved text lines. This imbalance causes the FCN to overfit on the straight text lines. This in turn leads to fragmented blob lines when predicting over the curved text lines. Therefore, to compensate this imbalance, we generated images containing artificially curved text lines. We selected the document image parts with straight lines and warp these images 90$^{\circ }$ from their middle. Furthermore, each one of those warped lines was mirrored in the horizontal and vertical directions resulting in curved lines in four directions. Figure 8 illustrates this procedure. The FCN+EM that is trained with augmented curved text lines (FCN+EM+Aug) outperforms the FCN+EM that is trained only with the training set (Table 2). But FCN+EM+Aug still underperforms a learning free algorithm [5].

Table 2. Results on VML-MOC dataset

Full size table

5.5 Results on DIVA-HisDB Dataset

We compare the results with the results of Task-3 from ICDAR 2017 competition on layout analysis for medieval manuscripts [28]. Task-3’s scope of interest is only the main text lines but not the interlinear glosses. We removed these glosses prior to all our experiments using the ground truth. It should be noticed that Task-3 participants removed these glosses using their own algorithms.

Table 3 presents a comparison of our methods with the participants of ICDAR 2017 competition on layout analysis for challenging medieval manuscripts for text line extraction. The FCN+EM can obtain a perfect Line IU score on the books CSG863 and CB55. Its Pixel IU is on par with the best preforming method in the competition.

Table 3. Comparison with the Task-3 results of the ICDAR2017 competition on layout analysis for challenging medieval manuscripts [28].

Full size table

5.6 Discussion

An observable pattern in the results is the parallel flow of line IU values and pixel IU values while RA values are fluctuating in comparison to DR values. Clearly, such counter-intuitive behaviour of a metric is not preferable in terms of interpretability of the results. On the other hand, ICDAR 2017 evaluator can not handle the cases where a text line consists of multiple polygons. Such case arises from MRCNN results. MRCNN segments a text line instance correctly but represents it as multiple polygons with the same label. Evaluating MRCNN results in their raw form yields to low values unfairly (Fig. 9). Because ICDAR 2017 evaluator calculates an IU score for each possible pair of ground truth polygons and prediction polygons then selects the pairs with maximum IU score as the matching pairs. Consequently, a text line represented by multiple polygons is considered only by its largest polygon.

6 Conclusion

This paper presents a supervised text line segmentation method FCN+EM. The FCN detect the blob lines that strike through the text lines and the EM extracts the pixels of text lines with the guidance of the detected blob lines. FCN+EM does not make any assumption about the text line orientation or text line height. The algorithm is very effective in detecting cramped, crowded and touching text lines. It has a superior performance on VML-AHTE and DIVA-HisDB datasets but comparable results on VML-MOC dataset.

Notes

References

Alberti, M., Bouillon, M., Ingold, R., Liwicki, M.: Open evaluation tool for layout analysis of document images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 4, pp. 43–47. IEEE (2017)
Google Scholar
Aldavert, D., Rusiñol, M.: Manuscript text line detection and segmentation using second-order derivatives. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 293–298. IEEE (2018)
Google Scholar
Barakat, B., Droby, A., Kassis, M., El-Sana, J.: Text line segmentation for challenging handwritten document images using fully convolutional network. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 374–379. IEEE (2018)
Google Scholar
Barakat, B., El-Sana, J.: Binarization free layout analysis for Arabic historical documents using fully convolutional networks. In: 2nd International Workshop on Arabic Script Analysis and Recognition (ASAR), pp. 26–30. IEEE (2018)
Google Scholar
Barakat, B.K., Cohen, R., El-Sana, J., Rabaev, I.: VML-MOC: segmenting a multiply oriented and curved handwritten text line dataset. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 6, pp. 13–18. IEEE (2019)
Google Scholar
Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23(11), 1222–1239 (2001)
Article Google Scholar
Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In: Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, vol. 1, pp. 105–112. IEEE (2001)
Google Scholar
Campos, V.B., Gómez, V.R., Rossi, A.H.T., Ruiz, E.V.: Text line extraction based on distance map features and dynamic programming. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 357–362. IEEE (2018)
Google Scholar
Cohen, R., Dinstein, I., El-Sana, J., Kedem, K.: Using scale-space anisotropic smoothing for text line extraction in historical documents. In: Campilho, A., Kamel, M. (eds.) ICIAR 2014. LNCS, vol. 8814, pp. 349–358. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11758-4_38
Chapter Google Scholar
Diem, M., Kleber, F., Fiel, S., Grüning, T., Gatos, B.: cBAD: ICDAR2017 competition on baseline detection. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1355–1360. IEEE (2017)
Google Scholar
Fischer, A., Baechler, M., Garz, A., Liwicki, M., Ingold, R.: A combined system for text line extraction and handwriting recognition in historical documents. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 71–75. IEEE (2014)
Google Scholar
Gatos, B., Stamatopoulos, N., Louloudis, G.: ICDAR2009 handwriting segmentation contest. Int. J. Doc. Anal. Recogn. (IJDAR) 14(1), 25–33 (2011)
Article Google Scholar
Gatos, B., Stamatopoulos, N., Louloudis, G.: ICFHR 2010 handwriting segmentation contest. In: 2010 12th International Conference on Frontiers in Handwriting Recognition, pp. 737–742. IEEE (2010)
Google Scholar
Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22(3), 285–302 (2019). https://doi.org/10.1007/s10032-019-00332-1
Article Google Scholar
Gruuening, T., Leifert, G., Strauss, T., Labahn, R.: A robust and binarization-free approach for text line detection in historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 236–241. IEEE (2017)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Koffka, K.: Principles of Gestalt Psychology, vol. 44. Routledge, Abingdon (2013)
Book Google Scholar
Kubovy, M., Van Den Berg, M.: The whole is equal to the sum of its parts: a probabilistic model of grouping by proximity and similarity in regular patterns. Psychol. Rev. 115(1), 131 (2008)
Article Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Moysset, B., Kermorvant, C., Wolf, C., Louradour, J.: Paragraph text segmentation into lines with recurrent neural networks. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 456–460. IEEE (2015)
Google Scholar
Moysset, B., Louradour, J., Kermorvant, C., Wolf, C.: Learning text-line localization with shared and local regression neural networks. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 1–6. IEEE (2016)
Google Scholar
Oliveira, S.A., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 7–12. IEEE (2018)
Google Scholar
Pletschacher, S., Antonacopoulos, A.: The page (page analysis and ground-truth elements) format framework. In: 2010 20th International Conference on Pattern Recognition, pp. 257–260. IEEE (2010)
Google Scholar
Renton, G., Chatelain, C., Adam, S., Kermorvant, C., Paquet, T.: Handwritten text line segmentation using fully convolutional network. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 5, pp. 5–9. IEEE (2017)
Google Scholar
Renton, G., Soullard, Y., Chatelain, C., Adam, S., Kermorvant, C., Paquet, T.: Fully convolutional network with dilated convolutions for handwritten text line segmentation. Int. J. Doc. Anal. Recogn. (IJDAR) 21(3), 177–186 (2018). https://doi.org/10.1007/s10032-018-0304-3
Article Google Scholar
Saabni, R., Asi, A., El-Sana, J.: Text line extraction for historical document images. Pattern Recogn. Lett. 35, 23–33 (2014)
Article Google Scholar
Sayre, K.M.: Machine recognition of handwritten words: a project report. Pattern Recogn. 5(3), 213–228 (1973)
Article Google Scholar
Simistira, F., et al.: ICDAR2017 competition on layout analysis for challenging medieval manuscripts. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1361–1370. IEEE (2017)
Google Scholar
Stamatopoulos, N., Gatos, B., Louloudis, G., Pal, U., Alaei, A.: ICDAR 2013 handwriting segmentation contest. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1402–1406. IEEE (2013)
Google Scholar
Vo, Q.N., Lee, G.: Dense prediction for text line segmentation in handwritten document images. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3264–3268. IEEE (2016)
Google Scholar

Download references

Acknowledgment

The authors would like to thank Gunes Cevik for annotating the ground truth. This work has been partially supported by the Frankel Center for Computer Science.

Author information

Authors and Affiliations

Ben-Gurion University of the Negev, Beersheba, Israel
Berat Kurar Barakat, Ahmad Droby, Reem Alaasam, Boraq Madi & Jihad El-Sana
Shamoon College of Engineering, Ashdod, Israel
Irina Rabaev

Authors

Berat Kurar Barakat
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Droby
View author publications
You can also search for this author in PubMed Google Scholar
Reem Alaasam
View author publications
You can also search for this author in PubMed Google Scholar
Boraq Madi
View author publications
You can also search for this author in PubMed Google Scholar
Irina Rabaev
View author publications
You can also search for this author in PubMed Google Scholar
Jihad El-Sana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Berat Kurar Barakat .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Alberto Del Bimbo
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Rita Cucchiara
Department of Computer Science, Boston University, Boston, MA, USA
Stan Sclaroff
Dipartimento di Matematica e Informatica, University of Catania, Catania, Italy
Giovanni Maria Farinella
Cloud & AI, JD.COM, Beijing, China
Tao Mei
Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Marco Bertini
Computational Sciences Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Tonantzintla, Puebla, Mexico
Hugo Jair Escalante
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Roberto Vezzani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barakat, B.K., Droby, A., Alaasam, R., Madi, B., Rabaev, I., El-Sana, J. (2021). Text Line Extraction Using Fully Convolutional Network and Energy Minimization. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12667. Springer, Cham. https://doi.org/10.1007/978-3-030-68787-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-68787-8_9
Published: 21 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68786-1
Online ISBN: 978-3-030-68787-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Text Line Extraction Using Fully Convolutional Network and Energy Minimization

Abstract

Similar content being viewed by others

Segmentation of text lines using multi-scale CNN from warped printed and handwritten document images

Text Line Segmentation: A FCN Based Approach

Line extraction in handwritten documents via instance segmentation

Keywords

1 Introduction

2 Related Work