Keywords

1 Introduction

Segmentation in computer vision is the task of dividing an image into parts that are easier to analyse. Text lines of a handwritten document image are widely used for word segmentation, text recognition and spotting, manuscripts alignment and writer recognition. Text lines need to be provided to these applications either by their locations or by complete set of their pixels. The task of identifying the location of each text line is called detection, whereas the task of determining the pixels of each text line is called extraction. Much research in the recent years has focused on text line detection [3, 14, 24]. However, detection defines the text lines loosely by baselines or main body blobs. On the other hand, extraction is a harder task which defines text lines precisely by pixel labels or bounding polygons.

The challenges in text line extraction arise due to variations in text line heights and orientations, presence of overlapping and touching text lines, and diacritical marks within close interline proximity. It has been generally demonstrated that deep learning based methods are effective at detecting text lines with various orientations [14, 22, 25, 30]. However, only few of the recent researches [8, 30] have addressed the problem of extraction given the detection, yet with the assumption of horizontal text lines.

This paper proposes a text line extraction method (FCN+EM) which uses Fully Convolutional Network (FCN) to detect text lines in the form of blob lines (Fig. 1(b)), followed by an Energy Minimization (EM) function assisted by these blob lines to extract the text lines (Fig. 1(c)). FCN is capable of handling curved and arbitrarily oriented text lines. However, extraction is problematic due to the Sayre’s paradox [27] which states that exact boundaries of handwritten text can be defined only after its recognition and handwritten text can be recognized only after extraction of its boundaries. Nevertheless, humans are good at understanding boundaries of text lines written in a language they do not know. Therefore, we consider EM framework to formulate the text line extraction in compliance with the human visual perception, with the aid of the Gestalt proximity principle for grouping [17]. The proposed EM formulation for text line extraction is free of an orientation assumption and can be used with touching and overlapping text lines with disjoint strokes and close interline proximity (Fig. 1(a)).

Fig. 1.
figure 1

Given a handwritten document image (a), FCN learns to detect blob lines that strike through text lines (b). EM with the assistance of detected blob lines extracts the pixel labels of text lines which are in turn enclosed by bounding polygons (c).

The proposed extraction method (FCN+EM) is evaluated on Visual Media Lab Arabic Handwritten Text line Extraction (VML-AHTE) dataset, Multiply Oriented and Curved (VML-MOC) dataset [5], and DIVA Historical Manuscript Database (DIVA-HisDB) [28]. VML-AHTE dataset is characterized by touching and overlapping text lines with close proximity and rich diacritics. VML-MOC dataset contains arbitrarily oriented and curved text lines. DIVA-HisDB dataset exhibit varying text line heights and touching text lines.

The rest of the paper is organized as follows. Related work is discussed in Sect. 2, and the datasets are described in Sect. 3. Later, the method is presented in Sect. 4. The experimental evaluation and the results are then provided in Sect. 5. Finally, Sect. 6 draws conclusions and outlines future work.

2 Related Work

A text line is a set of image elements, such as pixels or connected components. Text line components in a document image can be represented using basic geometric primitives such as points, lines, polylines, polygons or blobs. Text line representation is given as an input to other document image processing algorithms, and, therefore, important to be complete and correct.

There are two main approaches to represent text lines: text line detection and text line extraction. Text line detection detects the lines, polylines or blobs that represent the locations of spatially aligned set of text line elements. Detected line or polyline is called a baseline [14, 24] if it joins the lower part of the character main bodies, and a separator path [8, 26] if it follows the space between two consecutive text lines. Detected blobs [3] that cover the character main bodies in a text line are called text line blobs.

Text line extraction determines the constituting pixels or the polygons around the spatially aligned text line elements. Pixel labeling assigns the same label to all the pixels of a text line [9, 26, 30]. Bounding polygon is used to enclose all the elements of a text line together with its neighbourhood background pixels [11, 15]. Most of the extraction methods assume horizontally parallel text lines with constant heights, whereas some methods [2, 5] are more generic.

Recent deep learning methods estimate x-height of text lines using FCN and apply Line Adjacency Graphs (LAG) to post-process FCN output to split touching lines [20, 21]. Renton et al. [24, 25] also use FCN to predict x-height of text lines. Kurar et al. [3] applied FCN for challenging manuscript images with multi-skewed, multi-directed and curved handwritten text lines. However these methods either do only text line detection or their extraction phase is not appropriate for unstructured text lines because their assumption of horizontal and constant height text lines. The proposed method assumes the both, detection and extraction phases to be for complex layout.

ICDAR 2009 [12] and ICDAR 2013 [29] datasets are commonly used for evaluating text line extraction methods and ICDAR 2017 [10] dataset is used for evaluating text line detection methods. DIVA-HisDB dataset [28] is used for both types of evaluations: detection and extraction. Therefore, we select to use DIVA-HisDB [28] as it provides ground truth for detection and extraction. However, this dataset is not enough representative of all the segmentation problems to evaluate a generic method. Hence, we also evaluated the proposed method on publicly available VML-MOC [5] dataset that contains multiply oriented and curved text lines with heterogeneous heights, and on VML-AHTE dataset that contains crowded diacritics.

3 Datasets

We evaluated the proposed method on three publicly available handwritten datasets. We suppose that these datasets demonstrate the generality of our method. As VML-AHTE dataset contains lines with crowded diacritics, VML-MOC dataset contains multiply oriented and curved lines, and Diva-HisDB dataset contains consecutively touching multiple lines. In this section we present these datasets.

3.1 VML-AHTE

VML-AHTE dataset is a collection of 30 binary document images selected from several manuscripts (Fig. 2). It is a newly published dataset and available online for downloadingFootnote 1. The dataset is split into 20 train pages and 10 test pages. Its ground truth is provided in three formats: bounding polygons in PAGE xml [23] format, color pixel labels and DIVA pixel labels [28].

Fig. 2.
figure 2

Some samples of challenges in VML-AHTE dataset.

3.2 Diva-HisDB

DIVA-HisDB dataset [28] contains 150 pages from 3 medieval manuscripts: CB55, CSG18 and CSG863 (Fig. 3). Each book has 20 train pages and 10 test pages. Among them, CB55 is characterized by a vast number of touching characters. Ground truth is provided in three formats: baselines and bounding polygons in PAGE xml [23] format and DIVA pixel labels [28].

Fig. 3.
figure 3

Diva-HisDB dataset contains 3 manuscripts: CB55, CSG18 and CSG863. Notice the touching characters among multiple consecutive text lines in CB55.

3.3 VML-MOC

VML-MOC dataset [5] is a multiply oriented and curved handwritten text lines dataset that is publicly availableFootnote 2. These text lines are side notes added by various scholars over the years on the page margins, in arbitrary orientations and curvy forms due to space constraints (Fig. 4). The dataset contains 30 binary document images and divided into 20 train pages and 10 test pages. The ground truth is provided in three formats: bounding polygons in PAGE xml [23] format, color pixel labels and DIVA pixel labels [28].

Fig. 4.
figure 4

VML-MOC dataset purely contains binarized side notes with arbitrary orientations and curvy forms.

4 Method

We present a method (FCN+EM) for text line detection together with extraction, and show its effectiveness on handwritten document images. In the first phase, the method uses an FCN to densely predict the pixels of the blob lines that strike through the text lines (Fig. 1(b)). In the second phase, we use an EM framework to extract the pixel labels of text lines with the assistance of detected blob lines (Fig. 1(c)). In the rest of this section we give a detailed of description FCN, EM and how they are used for text line detection and text line extraction.

4.1 Text Line Detection Using FCN

Fully Convolutional Network (FCN) is an end-to-end semantic segmentation algorithm that extracts the features and learns the classifier function simultaneously. FCN inputs the original images and their pixel level annotations for learning the hypothesis function that can predict whether a pixel belongs to a text line label or not. A crucial decision have to be made about the representation of text line detection. Text line detection labels can be represented as baselines or blob lines.

We use blob line labeling that connects the characters in the same line while disregarding diacritics and touching components among the text lines. Blob line labeling for VML-AHTE and DIVA-HisDB datasets is automatically generated using the skeletons of bounding polygons provided by their ground truth (Fig. 5(d)). Blob line labeling for VML-MOC dataset is manually drawn using a sharp rectangular brush with a diameter of 12 pixels (Fig. 5(b)).

Fig. 5.
figure 5

Sample patches from document images of VML-MOC (a) and VML-AHTE (c). Blob line labeling for VML-AHTE and DIVA-HisDB is generated automatically (d). Blob line labeling for VML-AHTE is manually drawn using a paint brush with a diameter of 12 pixels (b).

FCN Architecture. The FCN architecture (Fig. 6) we used is based on the FCN8 proposed for semantic segmentation [19]. Particularly FCN8 architecture was selected because it has been successful in page layout analysis of handwritten documents [4]. It consists of an encoder and a decoder. The encoder downsamples the input image and the filters can see coarser information with larger receptive field. Consequently the decoder adds final layer of encoder to the lower layers with finer information, then upsamples the combined layer back to the input size. Default input size is \(224\times 224\), which does not cover more than 2 to 3 text lines. To include more context, we changed the input size to \(350\times 350\) pixels. We also changed the number of output channels to 2, which is the number of classes: blob line or not.

Fig. 6.
figure 6

The FCN architecture used for text line detection. Vertical lines show the convolutional layers. Grids show the relative coarseness of the pooling and prediction layers. FCN8 upsamples 4 times the final layer, upsamples twice the pool4 layer, and combine them with pool3 layer. Finally, FCN8 upsamples the combination to the input size.

FCN Training. For training, we randomly crop 50, 000 patches of size \(350\times 350\) from inverted binary images of the documents and their corresponding labels from the blob line label images (Fig. 5(b)). We adopted this patch size due to memory limitation. Using full pages for training and prediction is not feasible on non-specialized systems without resizing the pages to a more manageable size. Resizing the pages will result in details loss, which usually reduces the accuracy of segmentation results.

The FCN was trained by a batch size of 12, using Stochastic Gradient Descent (SGD) with momentum equals to 0.9 and learning rate equals to 0.001. The encoder part of FCN was initialized with its publicly available pre-trained weights.

FCN Testing. During the testing, a sliding window of size \(350\times 350\) was used for prediction, but only the inner window of size \(250\times 250\) was considered to eliminate the edge effect. Page was padded with black pixels at its right and bottom sides if its size is not an integer multiple of the sliding window size, in addition to padding it at 4 sides for considering only the central part of the sliding window.

4.2 Text Line Extraction Using EM

We adapt the energy minimization (EM) framework [6] that uses graph cuts to approximate the minima of arbitrary functions. These functions can be formulated in terms of image elements such pixels or connected components. In this section we formulate a general function for text line extraction using text line detection. Then, we adapt this general function to be used with connected components for text line extraction.

EM Formulation. Let \(\mathcal {L}\) be the set of binary blob lines, and \(\mathcal {E}\) be the set of elements in the binary document image. Energy minimization finds a labeling f that assigns each element \(e\in \mathcal {E}\) to a label \(l_e\in \mathcal {L}\), where energy function \(\mathbf{E} (f)\) has the minimum.

$$\begin{aligned} \mathbf{E} (f) = \sum _{e\in {\mathcal {E}}}D(e, \ell _e)+\sum _{\{e,e'\}\in \mathcal {N}}d(e, e')\cdot \delta (\ell _e \ne \ell _{e'}) \end{aligned}$$
(1)

The term D is the data cost, d is the smoothness cost, and \(\delta \) is an indicator function. Data cost is the cost of assigning element e to label \(l_e\). \(D(e, \ell _e)\) is defined to be the Euclidean distance between the centroid of the element e and the nearest neighbour pixel in blob line \(l_e\) for the centroid of the element e. Smoothness cost is the cost of assigning neighbouring elements to different labels. Let \(\mathcal {N}\) be the set of nearest element pairs. Then \(\forall \{e,e'\}\in \mathcal {N}\),

$$\begin{aligned} d(e,e') = \exp ({-\beta \cdot d_e(e,e')}) \end{aligned}$$
(2)

where \(d_e(e,e')\) is the Euclidean distance between the centroids of the elements e and \(e'\), and \(\beta \) is defined as

$$\begin{aligned} \beta =(2\langle {d_e(e,e')}\rangle )^{-1} \end{aligned}$$
(3)

\(\langle \cdot \rangle \) denotes expectation over all pairs of neighbouring elements [7] in a document page image. \(\delta (\ell _e \ne \ell _{e'})\) is equal to 1 if the condition inside the parentheses holds and 0 otherwise.

EM Adaptation to Connected Components. The presented method extracts text lines using results of the text line detection procedure by FCN. Extraction level representation labels each pixel of the text lines in a document image. The major difficulty in pixel labeling lies in the computational cost. A typical document image in the experimented datasets includes around 14, 000, 000 pixels. Due to this reason, we adapt the energy function (Eq. 1) to be used with connected components for extraction of text lines.

Data cost of the adapted function measures how appropriate a label is for the component e, given the blob lines \(\mathcal {L}\). Actually, the data cost alone would be equal to clustering the components with their nearest neighbour blob line. However, simply nearest neighbour clustering would be deficient to correctly label the free components that are disconnected from the blob lines (Fig. 7).

Fig. 7.
figure 7

Segmented samples that show the necessity of smoothness cost for text line extraction. Samples in the first row are true and achieved with smoothness cost. Samples in the second row are false and caused by the lack of a smoothness cost. Notice that smoothness cost pulls the diacritics to the components they belong to, in spite of their proximity to the wrong blob line.

A free component tends to exist closer to the components of a line it belongs to, but can be a nearest neighbour of a blob line that it does not belong to. This is because the proximity grouping strength decays exponentially with Euclidean distance [18]. This phenomenon is formulated using the smoothness cost (Eq. 2). Semantically this means that closer components have a higher probability to have the same label than distant components. Hence, the competition between data cost and smoothness cost dictates free components to be labeled spatially coherent with their neighbouring components.

5 Experiments

We experiment with three datasets that are different in terms of the text line segmentation challenges they contain. VML-AHTE dataset exhibits crowded diacritics and cramped text lines, whereas DIVA-HisDB dataset contains consequently touching text lines. Completely different than them VML-MOC exhibits challenges caused by arbitrarily skewed and curved text lines. The performance is measured using the line segmentation evaluation metrics of ICDAR 2013 [13] and ICDAR 2017 [1].

5.1 ICDAR 2013 Line Segmentation Evaluation Metrics

ICDAR 2013 metrics calculate recognition accuracy (RA), detection rate (DR) and F-measure (FM) values. Given a set of image points I, let \(R_i\) be the set of points inside the \(i^{th}\) result region, \(G_j\) be the set of points inside the \(j^{th}\) ground truth region, and T(p) is a function that counts the points inside the set p, then the MatchScore(ij) is calculated by Eq. 4

$$\begin{aligned} MatchScore(i,j) = \frac{T(G{j}\cap R{i})}{T(G{j}\cup R{i})} \end{aligned}$$
(4)

The evaluator considers a region pair (ij) as a one-to-one match if the MatchScore(ij) is equal or above the threshold, which we set to 90. Let \(N_1\) and \(N_2\) be the number of ground truth and output elements, respectively, and let M be the number of one-to-one matches. The evaluator calculates the DR, RA and FM as follows:

$$\begin{aligned} DR = \frac{M}{N_1} \end{aligned}$$
(5)
$$\begin{aligned} RA = \frac{M}{N_2} \end{aligned}$$
(6)
$$\begin{aligned} FM=\frac{2\times DR\times RA}{DR+RA} \end{aligned}$$
(7)

5.2 ICDAR 2017 Line Segmentation Evaluation Metrics

ICDAR 2017 metrics are based on the Intersection over Union (IU). IU scores for each possible pair of Ground Truth (GT) polygons and Prediction (P) polygons are computed as follows:

$$\begin{aligned} IU=\frac{IP}{UP} \end{aligned}$$
(8)

IP denotes the number of intersecting foreground pixels among the pair of polygons. UP denotes number of foreground pixels in the union of foreground pixels of the pair of polygons. The pairs with maximum IU score are selected as the matching pairs of GT polygons and P polygons. Then, pixel IU and line IU are calculated among these matching pairs. For each matching pair, line TP, line FP and line FN are given by: Line TP is the number of foreground pixels that are correctly predicted in the matching pair. Line FP is the number of foreground pixels that are falsely predicted in the matching pair. Line FN is the number of false negative foreground pixels in the matching pair. Accordingly pixel IU is:

$$\begin{aligned} \text {Pixel } IU=\frac{TP}{TP+FP+FN} \end{aligned}$$
(9)

where TP is the global sum of line TPs, FP is the global sum of line FPs, and FN is the global sum of line FNs.

Line IU is measured at line level. For each matching pair, line precision and line recall are:

$$\begin{aligned} \text {Line precision}=\frac{\text {line } TP}{\text {line } TP + \text {line } FP} \end{aligned}$$
(10)
$$\begin{aligned} \text {Line recall}=\frac{\text {line } TP}{\text {line } TP + \text {line } FN} \end{aligned}$$
(11)

Accordingly, line IU is:

$$\begin{aligned} \text {Line } IU=\frac{\text {CL}}{\text {CL\,+\,ML\,+\,EL}} \end{aligned}$$
(12)

where CL is the number of correct lines, ML is the number of missed lines, and EL is the number of extra lines.

For each matching pair: A line is correct if both, the line precision and the line recall are above the threshold value. A line is missed if the line recall is below the threshold value. A line is extra if the line precision is below the threshold value.

5.3 Results on VML-AHTE Dataset

Since VML-AHTE and VML-MOC datasets are recently published datasets we run two other supervised methods. First method is a holistic method that can extract text lines in one phase and is based on instance segmentation using MRCNN [16]. Second method is based on running the EM framework using the blob line labels from the ground truth and we refer to it Human+EM. On VML-AHTE dataset, FCN+EM outperforms all the other methods in terms of all the metrics except Line IU. It can successfully split the touching text lines and assign the disjoint strokes to the correct text lines (Table 1).

Table 1. Results on VML-AHTE dataset
Fig. 8.
figure 8

Example of generated curved lines: (a) shows the original straight lines section, (b) is the result of warping (a) by 90\(^{\circ }\) in the middle to generated the curved lines, and (c) is the mirrored image of (b) in the vertical direction.

5.4 Results on VML-MOC Dataset

The VML-MOC dataset contains both types, straight text lines and curved text lines. Number of straight text lines is substantially greater than the number of curved text lines. This imbalance causes the FCN to overfit on the straight text lines. This in turn leads to fragmented blob lines when predicting over the curved text lines. Therefore, to compensate this imbalance, we generated images containing artificially curved text lines. We selected the document image parts with straight lines and warp these images 90\(^{\circ }\) from their middle. Furthermore, each one of those warped lines was mirrored in the horizontal and vertical directions resulting in curved lines in four directions. Figure 8 illustrates this procedure. The FCN+EM that is trained with augmented curved text lines (FCN+EM+Aug) outperforms the FCN+EM that is trained only with the training set (Table 2). But FCN+EM+Aug still underperforms a learning free algorithm [5].

Table 2. Results on VML-MOC dataset

5.5 Results on DIVA-HisDB Dataset

We compare the results with the results of Task-3 from ICDAR 2017 competition on layout analysis for medieval manuscripts [28]. Task-3’s scope of interest is only the main text lines but not the interlinear glosses. We removed these glosses prior to all our experiments using the ground truth. It should be noticed that Task-3 participants removed these glosses using their own algorithms.

Table 3 presents a comparison of our methods with the participants of ICDAR 2017 competition on layout analysis for challenging medieval manuscripts for text line extraction. The FCN+EM can obtain a perfect Line IU score on the books CSG863 and CB55. Its Pixel IU is on par with the best preforming method in the competition.

Table 3. Comparison with the Task-3 results of the ICDAR2017 competition on layout analysis for challenging medieval manuscripts [28].

5.6 Discussion

An observable pattern in the results is the parallel flow of line IU values and pixel IU values while RA values are fluctuating in comparison to DR values. Clearly, such counter-intuitive behaviour of a metric is not preferable in terms of interpretability of the results. On the other hand, ICDAR 2017 evaluator can not handle the cases where a text line consists of multiple polygons. Such case arises from MRCNN results. MRCNN segments a text line instance correctly but represents it as multiple polygons with the same label. Evaluating MRCNN results in their raw form yields to low values unfairly (Fig. 9). Because ICDAR 2017 evaluator calculates an IU score for each possible pair of ground truth polygons and prediction polygons then selects the pairs with maximum IU score as the matching pairs. Consequently, a text line represented by multiple polygons is considered only by its largest polygon.

Fig. 9.
figure 9

MRCNN method correctly predicts text line pixels but its results are not fairly evaluated due to disconnected polygons.

6 Conclusion

This paper presents a supervised text line segmentation method FCN+EM. The FCN detect the blob lines that strike through the text lines and the EM extracts the pixels of text lines with the guidance of the detected blob lines. FCN+EM does not make any assumption about the text line orientation or text line height. The algorithm is very effective in detecting cramped, crowded and touching text lines. It has a superior performance on VML-AHTE and DIVA-HisDB datasets but comparable results on VML-MOC dataset.