1 Introduction

Accessibility of the valuable cultural heritage of historical documents is an important concern of archives, libraries as well as certain companies, e.g., those specialized in genealogy. After years of digitization at an industrial scale to protect and preserve these valuable goods, millions over millions of scanned pages are stored at servers all over the world [1]. The generic next step is to make the enormous amount of content of these document images accessible and enable humanists, historians, genealogists as well as ordinary people to efficiently work with these documents. Besides the cost- and time-consuming process of manually annotating volumes [2], it is subject to current research and scientific discussion how to automate this process [3].

Since 2009, tremendous progress in the field of Automated Text RecognitionFootnote 1 (ATR) [4, 5] as well as Keyword Spotting (KWS) [6,7,8] was achieved. The performance of state-of-the-art systems reaches character error rates below \(10 \%\) for ATR [9] and mean average precisions above 0.9 for KWS [10] for complex handwritten documents. Although efforts are made to develop systems working solely on the rough input image without any a-priori segmentation [11,12,13], the best performing recognition systems—with reference to recently hosted competitions—rely on segmented words or text lines as input. Entirely segmentation-free approaches suffer either from an enormous training/inference time and/or, up to now, did not demonstrate its applicability with competitive quality on challenging datasets [10]. Hence, a workflow which involves a text line extraction followed by the transformation of pixel information into textual information (ATR/KWS) is the widely used standard. This work deals with the first step of the information retrieval pipeline, namely the text line extraction. This is a mandatory step since errors directly effect the performance of the overall information retrieval process. The text line extraction is still unsolved to a certain extent for historical documents due to difficulties such as physical degradations (e.g., bleed-through, faded away characters, heterogeneous stroke intensity), image capture conditions (e.g., scan curve, illumination issues), complex layouts (e.g., structured documents, marginalia, multi-column layouts, varying font sizes), arbitrary orientations and curved text lines.

The results achieved by state-of-the-art approaches are not satisfying [14], especially if dealing with heterogeneous data. Therefore, this work focuses on the extraction of text lines in arbitrary historical documents. Since different ATR/KWS systems necessitate different text line representations, e.g., bounding boxes [15], x-height areas [16] or more precise polygonal representations following all ascenders and descenders [7], there is not the one correct text line representation. Therefore, we limit ourselves toward the text line detection task by representing each text line by its baseline. The detected baselines allow for an extraction of the text lines in an appropriate—with respect to the following method way. The problem of extracting a text line given its baseline has to handle problems like touching or overlapping components and can be tackled by applying, e.g., histogram approaches to estimate the x-height [16] or by utilizing dynamic programming to calculate separating seams [17] However, this is not within the scope of this work.

Besides the classical image processing-based approaches, deep learning-based methods became omnipresent in the document analysis community within the last years. Such techniques were recently used to solve several different problems such as binarization [18], page boundary extraction [19], page segmentation [20] or text line detection [16]. The presented work to our knowledge is the first which uses a two-stage method, combining deep learning strategies and state-of-the-art image processing-based techniques. We propose an extension of the U-Net [21], the so-called ARU-Net. The fully convolutional U-Net is extended by incorporating residual blocks [22] to increase its representative power. Furthermore, a spatial attention mechanism is developed which allows the ARU-Net to focus on image content at different positions and scales. The network is designed to processes the entire, arbitrarily-sized image at once to take account of all spatial context. The ARU-Net is universal in a way that it could be used to tackle any pixel labeling task. In this work, it is trained in a fully supervised fashion to classify each pixel to belong to one of the following classes: baseline, separator and other. The separator class is introduced to explicitly predict beginning and end of each text line and not just rely on the information implicitly given by the baseline class. This is advantageous for text lines which are close together but have to be separated, e.g., those belonging to different columns. The network output serves as input for an image processing-based bottom-up clustering approach. This approach utilizes so-called states of superpixels [23], which encode local text orientation and interline distances. This second stage allows for an error correction of the network output by incorporating domain knowledge based on assumptions, which hold for text lines in general, see Sect. 3.3.3. Additionally, it is easily possible to incorporate the separator information, which allows for a handling of documents with complex layouts, e.g., images containing tables or marginalia.

Each method relying on supervised deep learning and therefore relying on training data can suffer from the need of an enormous amount of labeled training data. We demonstrate that the presented approach achieves high-quality results on the Bozen dataset [24] with less than 50 full-page training samples by using data augmentation strategies. Along with an annotating effort of just a few minutes per page, the adaptation of the proposed method is easy and cheap. We demonstrate the applicability of the proposed method for images with arbitrarily oriented as well as curved text lines by achieving nearly as good results as for straight \(0^{\circ }\) oriented text lines. Finally, we show that the presented approach outperforms state-of-the-art methods on three different datasets. A relative F value [25] error (the gap to 1.0) reduction of at least \(24\%\) is achieved for the cBAD dataset [26]. This dataset is composed of 2036 historical images with annotated baselines of nine different archives and libraries from all over Europe and is therefore—in the opinion of the authors—the most representative and heterogeneous freely available dataset. Especially, for the complex track, which contains mostly documents with complex layouts, the average F value is increased from 0.859 to 0.922.

The main contributions of this work are:

  • Introduction of a newly designed deep neural network (ARU-Net) for pixel labeling along with a meaningful parametrization—the ARU-Net and its training framework are open source,Footnote 2

  • Introduction of the new concept of learned separators to handle complex layouts instead of an a-priori page segmentation or white-/blackrun calculation

  • Introduction of a state-of-the-art two-stage workflow which combines state-of-the-art deep learning and image processing techniques—the entire workflow is freely usable via the Transkribus platform.Footnote 3

2 Related work

A comprehensive survey of approaches for text line extraction in historical documents is given in [27, 28]. In this section, we will focus on approaches relevant for this work.

In [17, 29, 30], the principle of dynamic programming is utilized to calculate cost optimal paths passing the image from left to right to separate different text lines from each other. These methods basically differ in the way the images are pre-processed and in the definition of the cost function. Garz et al. [31] propose a method based on clustering of interest points (this is just another name for what we call superpixel). Using a standard clustering technique, interest points in an area which exceeds a certain density are clustered to form word clusters. Word clusters are separated to sub-word segments, and these are finally grouped to build text lines. Ryu et al. [23] propose an algorithm which uses certain characteristics (so-called states) of extracted connected components to assign costs to certain clustering results. These states encode local text orientation and interline distances and are introduced in Definition 3.13. Subsequently, using four different operations (merge, split, merge–split, merge–merge–split) on an initial coarse clustering, the costs are minimized to obtain an optimal clustering, which leads to the final text line segmentation. Ahn et al. [32] improve this approach by the introduction of a newly developed binarization method and an improved clustering process. Grüning et al. [33] extended the approach of Ryu et al. so that it is applicable for more general superpixels with a newly introduced clustering procedure which does not rely on a coarse initial clustering. Besides these “classical” approaches, which are based on image processing techniques, methods based on machine learning gained importance within the last two years. Moysset et al. [34] propose a method based on a recurrent neural network. The network is trained given only the number of lines in the image utilizing connectionist temporal classification which was introduced to train networks for handwriting text recognition and allows for ground truth data without any alignment. The trained neural network predicts confidences for the vertical coordinates of the image to belong either to the classes line or interline. Further post-processing of the neural network output is performed to detect the text lines. In follow-up works, they formulated the problem as a regression problem [35]. The recurrent neural network directly predicts bounding boxes as well as the start of each text line, respectively. Besides this regression-based approach, classification-based approaches were proposed most recently. In contrast to the approach of Moysset et al., these methods perform a pixel labeling to classify each image pixel (instead of classifying rows of pixels, only). For instance, Renton et al. [16] propose a fully convolutional network (FCN) based on dilated (or atrous) convolutions to classify pixels as text line main body or not. The classification results are utilized to extract the text line information. These techniques are currently very popular, e.g., four of the five participants of the cBAD: ICDAR2017 Competition on Baseline Detection [36] use methods relying on FCNs. For example, the methods presented in [16, 52,53,54] tackle the problem of baseline detection with fully convolutional neural networks. However, these methods either rely on a patch-wise processing of the input image (which results in a cumbersome reconstruction of the baseline hypothesis for the entire image), on massive pre-trained encoder structures (which significantly increases the number of model parameter and slows down the system) or on elementary post-processing steps (which deteriorates the system’s performance). The presented method overcomes this limitations and shows its superiority on the challenging cBAD dataset.

3 Methodology

Fig. 1
figure 1

Two-stage workflow to detect baselines—the first stage utilizes a deep hierarchical neural network to perform a pixel labeling. The result of Stage I is the input for an image processing-based method in Stage II. This method clusters superpixel to build baselines. The image is sampled from the cBad complex test set [25]

In this section, we introduce the two-stage method for baseline detection, see Fig. 1. The first stage relies on a deep neural network—the ARU-Net—and performs a pixel labeling. The pixel labeling can be seen as some kind of goal-oriented binarization. Instead of detecting all foreground elements, it restricts itself to those elements which are of interest for the specific task. The second stage performs a superpixel (SP) extraction on the first stage’s output. These SPs are further clustered to build baselines. In the following, the problem of baseline detection is formulated. Afterward, a detailed description of the proposed ARU-Net is given. Finally, the SP extraction and clustering approach are described.

3.1 Problem statement

We will introduce the problem of baseline detection in a formal way by defining all necessary termini and notation. Within this work, we follow the definition of a baseline given in [26]:

Definition 3.1

(Baseline) A baseline is defined in the typographical sense as the virtual line where most characters rest upon and descenders extend below.

Hence, each baseline can be represented by a polygonal chain. The infinite set \(\mathcal {P}\) of all possible polygonal chains is called polygonal chain space. Within this work, we limit ourselves toward gray-scale images. The set of all possible (gray-scale) images \(\mathcal {I}=\bigcup _{h,w\in \mathbb {N}} \left[ 0,1\right] ^{h\times w}\) is called image space. If the colored image is available, we usually use this one for visualization even though it is converted to its gray-scale version for calculations. For visualization purposes, a pixel intensity value of 1 means white and 0 means black. \(I_h\) denotes the height of image I, \(I_w\) denotes the width, analogously.

Definition 3.2

(Baseline detector, baseline hypothesis) We call a function \(b:\mathcal {I}\rightarrow \mathfrak {P}(\mathcal {P})\) which maps each image to a subset of \(\mathcal {P}\) a baseline detector. The set of all baseline detectors is denoted by \(\mathcal {B}\). The output of b for a certain image I is called baseline hypothesis.

Definition 3.3

(Baseline ground truth) The set \(\mathcal {G}_I\subset \mathcal {P}\) of polygonal chains representing the baselines of an image I (possibly annotated by a human operator) is called baseline ground truth (for image I).

Definition 3.1 allows for some baseline variety. Hence, there is not the one unique and correct ground truth for an image. Therefore, ground truth information is always biased by its creator. This has to be taken into account for the evaluation process as well as for the baseline detector design.

Definition 3.4

(Similarity score) A function \(\langle \cdot ,\cdot \rangle _{\mu }:\mathcal {P}\times \mathcal {P}\rightarrow \left[ 0,1\right] \) assigning a scalar value to each pair of baseline ground truth and baseline hypothesis polygonal chain sets is called similarity score.

A value of 1.0 indicates that two polygonal chains are regarded as equal. Within this work, we follow the similarity score introduced in [25]: We measure the accuracy of a baseline detector in terms of the F value, see [25] for a detailed introduction.

The problem tackled in this work can now be formulated as follows: Suppose there are two sets of images along with their baseline ground truth information \(\mathcal {T}_\mathrm{train}=\{(I,\mathcal {G}_I)_i\ |\ i=1,\ldots ,n\}\) and \(\mathcal {T}_\mathrm{test}=\{(I,\mathcal {G}_I)_i\ |\ i=1,\ldots ,m\}\). We aim for a design of a baseline detector \(b^*\) given \(\mathcal {T}_\mathrm{train}\) which solves

$$\begin{aligned} b^*=\mathop {\mathrm{arg}\,\mathrm{max}}\limits _{b\in \mathcal {B}}\sum _{(I,\mathcal {G}_I)\in \mathcal {T}_\mathrm{test}} \langle \mathcal {G}_I , b(I) \rangle _{\mu }. \end{aligned}$$

In the design phase of \(b^*\), the set \(\mathcal {T}_\mathrm{test}\) is unknown and one is allowed to use solely \(\mathcal {T}_\mathrm{train}\). Hence, one has to ensure that \(b^*\) generalizes well from \(\mathcal {T}_\mathrm{train}\) to \(\mathcal {T}_\mathrm{test}\).

Since the proposed design consists of two stages and the first stage relies on deep learning techniques, an adaptation to a differently biased ground truth (produced by a different annotator) can be done easily by retraining the first stage without any fine tuning done by experts.

3.2 Stage I: ARU-Net

Typically, layout analysis algorithms directly work on the input image I or on a binarized version of it [17, 23, 29,30,31, 33]. Instead, we employ a more goal-oriented transformation of the input image utilizing a neural network, which is trained in a supervised manner to assign a certain class to each pixel like in [21, 37, 38]. This is often referred to as pixel labeling or semantic segmentation. We will introduce the problem of pixel labeling utilizing hierarchical neural networks, followed by a description of the proposed ARU-Net architecture.

3.2.1 Pixel labeling: problem formulation

Definition 3.5

(Neural pixel labeler) A neural pixel labeler (NPL) for the classes \(\mathcal {C}=\left\{ c_1,\ldots ,c_n\right\} \) is a hierarchical neural network \(\varPhi (\ \cdot \ ;\varvec{w}):\mathcal {I} \rightarrow \mathcal {I}^{\left| \mathcal {C}\right| }\). The NPL is parametrized by \(\varvec{w}\in \mathbb {R}^N\). For \(I\in \mathcal {I}\), it performs a prediction over all pixels and all possible classes \(\varPhi (I;\varvec{w})=C\in \left[ 0,1\right] ^{I_h\times I_w \times \left| \mathcal {C}\right| }\), where C sums to one over all classes and for all coordinates.

\(C(:,:,c)=C_{:,:,c}\in \mathcal {I}\) denotes the image which encodes the pixel-wise prediction (confidence) for the cth class.

Definition 3.6

(Pixel ground truth) A Cartesian product \(G_I\in \mathcal {I}^{\left| \mathcal {C}\right| }\) is called pixel ground truth (for image I) if it assigns exactly one class (one-hot-encoding) to each pixel.

Following the problem formulation of Sect. 3.1, we aim for an NPL, which was tuned on a training set and optimally performs on a test set. Assume there are training and test sets as stated above, but with pixel ground truth information instead of baseline ground truth information, which are denoted by \(\widetilde{\mathcal {T}}_\mathrm{train}\) and \(\widetilde{\mathcal {T}}_\mathrm{test}\). The performance of an NPL is evaluated in terms of the cross-entropy between the predicted and the ground truth distributions. The cross-entropy can also be motivated by a maximum likelihood estimation. This results in the cross-entropy loss function.

Definition 3.7

(Loss function) Let \(\widetilde{\mathcal {T}}\) be a set of images along with their pixel ground truth and \(\varPhi (\ \cdot \ ;\varvec{w})\) is an NPL. The performance of \(\varPhi \) on \(\widetilde{\mathcal {T}}\) is evaluated in terms of the (cross-entropy) loss function

$$\begin{aligned} -\sum _{(I,G_I)\in \widetilde{\mathcal {T}}}\sum _{y=1}^{I_h}\sum _{x=1}^{I_w}\sum _{c=1}^{\left| \mathcal {C}\right| }G_I(y,x,c)\ln {\varPhi (I;\varvec{w})_{y,x,c}}. \end{aligned}$$

To improve the performance of the NPL on the training set, one can calculate the loss function’s gradient with respect to the model parameters using the well-known technique of backpropagation [39]. The gradient is used to update the model parameters by gradient descent: \(\varvec{w} \leftarrow \varvec{w} - \tau \cdot \frac{\partial L}{\partial \varvec{w}}(\varPhi ,\widetilde{\mathcal {T}}_\mathrm{train})\) with a learning rate\(\tau \). This is repeated to successively adapt the NPL. The process of adapting the model by minimizing its loss is called training. Since one does not aim for a minimization of the loss on the training set, the system has to generalize to achieve high-quality results on the test set as well. To stabilize training, avoid over-fitting, and improve generalization, etc., dozens of techniques to improve the simple update rule which is stated above were introduced within the last years. Since the introduction of these is beyond the scope of this work, we refer to [40]. Details on techniques used within this work are given in Sect. 4.

3.2.2 ARU-Net: architecture

The ARU-Net is a special form of an NPL and is described in this section. We omit a formal introduction of the used neural network components and concepts and refer to the above-mentioned literature. Within the last few years, different architectures were proposed for the pixel labeling task. Most of them are based on convolutional neural networks (CNNs) [41]. A direct application of CNNs for semantic segmentation is presented in [37]. The presented fully convolutional network (FCN) combines local features to produce more meaningful high level features using pooling layers. Pooling reduces the spatial dimension. Thus, the result suffers from a coarse resolution. Noh et al. [38] tackle this problem by applying a deconvolutional network on the subsampled output of the FCN. The U-Net proposed in [21] furthermore introduces shortcuts between layers of the same spatial dimension. This allows for an easier combination of local low level features and global higher-level features. Additionally, error propagation for deep structures is facilitated, and the so-called vanishing gradient problems [42] are reduced. The U-Net is the basis for the proposed ARU-Net. We extend the U-Net by two more key concepts—spatial attention (A) and depth (residual structure (R)) to be described below. Remarkably, in contrast to the U-Net proposed in [21], we perform border padding. Hence, the spatial dimensions in each scale space of the U-Net are all the same, see Fig. 2 for a schematic representation of an U-Net. The output of the U-Net thus is a feature map (Z features in Fig. 2) of the same spatial dimension as the input. Hence, the U-Net becomes an NPL as defined in Definition 3.5 by adding a convolutional (to get pixel-wise predictions) softmax classifier on top which distinguishes between the different classes of \(\mathcal {C}\).

Fig. 2
figure 2

U-Net—the input is an image of arbitrary spatial dimension. “Act” is the activation function thus the rectangles represent sets of activation maps. Each rectangle represents a 3-dim array (\(\in \mathbb {R}^{h\times w\times z}\)). Within each scale space (roman numbers), the feature map widths and heights are constant (encoded by the height of the rectangles). The number of feature maps Z is pictured by the width of the rectangles. Between adjacent scale spaces the spatial dimension decreases by a certain factor (2 in the figure) and the representative depth (number of feature maps) increases by the same factor

Remark 3.1

If the presented architectures are used for the pixel labeling task, it is implicitly assumed that such a classifier is always added to generate per class confidences at pixel level.

He et al. [22] introduce very deep neural networks which are still trainable and yield state-of-the-art results. This is achieved using so-called residual blocks. Residual blocks introduce shortcuts, which enable the error backpropagation and identity propagation even for very deep structures. Hence, the vanishing gradient problems are reduced [22]. There are various different forms of residual blocks. The one used within this work is depicted in Fig. 3.

Definition 3.8

(RU-Net) An RU-Net is an U-Net with residual blocks.

That means, each of the 2 layer CNN blocks in Fig. 2 is replaced by a residual block as in Fig. 3.

Fig. 3
figure 3

Residual Block—the input is convolved and the resulting 3-dim array (the maps before passed through an activation function are referred to as logits) is used twice. At the first branch, it is passed through the activation function and further processed by several convolution layers. At the second branch, it is directly fed into a summation node. After a point-wise summation of the two logit maps, an activation function is applied. The shortcut enables for an easy identity propagation and error backpropagation. Arbitrarily many inner layers are possible

To explicitly incorporate the potential to handle various font sizes, especially mixed font sizes on a single page, we introduce a pixel-wise (spatial) attention mechanism. For this purpose, we introduce an attention network (A-Net). The A-Net is a multilayer CNN which generates a single output feature map. The A-Net will be applied along with the RU-Net at different scales, and the same network weights are used on all scales (weight sharing). Specially, a scale pyramid is built by downscaling the input image \(I = I_1\) several times. The resulting (scaled) images \(I_1,\ I_2,\ I_4,\ I_8,\ldots ,\ I_s\) (subscripts denote the scaling factors) are fed into the RU-Net and the A-Net. Trainable deconvolutional layers (of corresponding scales) are applied on the outputs of the RU- and the A-Net to obtain feature maps of spatial dimensions equal to the inputs. \(A_1,\ldots ,A_s\) denote the up-sampled feature maps of the A-Net, \(\mathrm{RU}_1,\ldots ,\mathrm{RU}_s\) of the RU-Net, respectively. After applying a pixel-wise softmax normalization for the attention maps

$$\begin{aligned} \widehat{A}_i(y,x)=\frac{\exp (A_i(y,x))}{\sum _{j\in \{1,2,\ldots ,s\}} \exp (A_j(y,x))} \end{aligned}$$

the normalized attention maps \(\widehat{A}_i\) sum to one (pixel-wise). The feature maps \(\mathrm{RU}_i\) are combined following

$$\begin{aligned} ARU=\sum _{i\in \{1,2,\ldots ,s\}} \mathrm{RU}_i\odot \widehat{A}_i, \end{aligned}$$

where \(\odot \) is the Hadamard product. ARU is the input for the classifier to build a NPL, see Remark 3.1.

Fig. 4
figure 4

ARU-Net—the input image and its downscaled versions are fed into the A-Net and R-U-Net (weight sharing across different scales). The results for the lower resolutions are deconvolved. The attention maps are passed through a softmax normalization. The brighter the map at a certain position, the more attention is paid to that position at the corresponding scale. The attention maps are point-wise multiplied with the feature maps of the RU-Net. The results are summed, and a classification is performed

Definition 3.9

(ARU-Net) An RU-Net incorporating the described spatial attention mechanism is called ARU-Net, see Fig. 4.

The point-wise multiplication combined with the pixel-wise attention maps allow the ARU-Net to pay attention in different scales at different positions of the image. In Fig. 4, one can see that this behavior was indeed learned by the network. It seems like the RU-Net is specialized on a certain font size and the A-Net distinguishes between areas of different font sizes (bright and dark areas).

The ARU-Net as introduced can be used for any pixel labeling task, e.g., binarization, page detection and page segmentation. The purpose of the ARU-Net is defined and fixed by the number of classes and the ground truth data provided for training. In this work, we limit ourselves to the baseline detection problem introduced in Sect. 3.1. For this purpose, we introduce three different classes: baseline (bl), separator (sep) and other (\(\varnothing \)). The separators mark beginning and end of each text line. Although the separator information is implicitly encoded by the baselines, it is advantageous to explicitly introduce it as possible classification result. Especially, for baselines which are close together, e.g., such belonging to two adjacent columns, this approach helps to avoid segmentation errors. Pixel ground truth for the classes \(\mathcal {C}=\{\text {bl},\text {sep},\varnothing \}\) could be automatically generated by Algorithm S.1 (supplements) given the baseline ground truth.

A sample image with baseline ground truth along with its generated pixel ground truth is depicted in Fig. 5. The prediction of a trained ARU-Net for this sample image is shown in Fig. 6a.

3.3 Stage II: baseline estimation

This subsection describes the second stage of the proposed approach. Baselines are estimated given the output of the ARU-Net. This task consists of three steps: superpixel calculation, state estimation and superpixel clustering, which are described in the following.

The trained ARU-Net generates an output \(C{\in }\left[ 0,1\right] ^{I_h{\times } I_w\times 3}\) for each image \(I\in \mathcal {I}\). In the following, \(B=C_{:,:,1}\) denotes the image encoding the confidence of each pixel belonging to a baseline and \(S=C_{:,:,2}\) is the separator image, see Fig. 6a

Fig. 5
figure 5

Baseline and pixel ground truth—these are shown for the top snippet of the image of Fig. 1 (color figure online)

Fig. 6
figure 6

Baseline detection process—two intermediate steps are shown for the top snippet of the image of Fig. 1 (color figure online)

3.3.1 Superpixel calculation

The number of all pixels in an image often exceeds several millions. To reduce the dimensionality of the problem (the number of pixels to be regarded for the baseline estimation), we limit ourselves to a subset of all pixels.

Definition 3.10

(Superpixel) Let \(\mathcal {S}=\{\varvec{p}_1,\ldots ,\varvec{p}_N\}\) be a subset of the image pixels of I (typically, \(N \ll I_h\cdot I_w\) holds). An element of \(\mathcal {S}\) is called superpixel (SP).

Basically, the definition of a superpixel does not introduce any new concept. A SP is just a normal pixel which is somehow regarded to be of certain importance. Since it is a frequently used term, we decided to introduce it via a definition. It is easy to see that the choice of the set of SPs is crucial for the overall performance. If there are no SPs for a baseline at all, this baseline will be missed. To calculate a suitable set of SPs, we utilize the baseline map B generated by the ARU-Net.

In a first step, B is binarized \(B_b = B>b\) by an element-wise comparison of B with a confidence threshold b. The morphological skeleton \(B_s=\text {SKE}(B_b)\) is calculated for \(B_b\) following Lantuéjoul’s formula [43]. All foreground pixels (pixels with an intensity of 1) of \(B_s\) build an initial set of pixels \(\{\varvec{p}_1,\ldots ,\varvec{p}_M\}\). Its elements are sorted in descending order w.r.t. their baseline confidences. Finally, \(\mathcal {S}\) is set up by iteratively adding pixels of this sorted list(beginning with the first pixel). To keep the number of SPs small, a new pixel \(\varvec{p}\) is added to \(\mathcal {S}\) only if its distance to all other SPs exceeds a certain threshold \(\left\| \varvec{p}-\varvec{q}\right\| _2>d\quad \forall \varvec{q}\in \mathcal {S}\), otherwise it is skipped. In Fig. 6b, the set of resulting SPs is shown. These SPs build the basis for the further clustering.

Remark 3.2

For all experiments, we have chosen fixed values of \(b=0.2\) (binarization threshold) and \(d=10\) (distance threshold). These demonstrated to be well suited for a wide range of different scenarios. Hence, they are not regarded as free parameters of the system which have to be further tuned. This also holds for the parameters which are fixed in Remarks 3.5 and 3.9.

3.3.2 Superpixel state estimation

Assume we can assign each SP to a certain text line. The state of an SP should encode meaningful characteristics of its text line. These characteristics will be defined and combined to build the state. This work is based on the previous work of [23, 33], but adapted to the characteristics of SPs extracted given the ARU-Net output, e.g., easier calculation of the local text orientation as well as a different smoothing cost formulation.

Definition 3.11

(Local text orientation) The local text orientation\(\theta \) of an SP \(\varvec{p}\) is the slope of its text line’s baseline at the coordinates closest (w.r.t. the Euclidean distance) to \(\varvec{p}\).

Definition 3.12

(Interline distance) The interline distances of an SP \(\varvec{p}\) is the distance of its text line’s baseline to the nearest other baseline. Distance means the distance which is orthogonal to the local text direction of \(\varvec{p}\).

Definition 3.13

(State) The state of an SP is the pair \((\theta ,s)\) of its local text orientation and its interline distance.

In the following, we will describe a method to estimate the states of all SPs. The local text orientation will be calculated in a straightforward way utilizing solely the baseline image B and local information. On the other hand, the estimation of the interline distances combines local information of the text line’s periodicity with the more global assumption that nearby SPs tend to have similar interline distances. For these approaches, the concepts of neighborhood and connectivity are mandatory and will be introduced.

Definition 3.14

(Neighborhood system, edge, adjacent) We call a subset \(\mathcal {N}\subset \mathcal {S}\times \mathcal {S}\)neighborhood system. An element of \(\mathcal {N}\) is called edge and denoted by \(\varvec{e}_{\varvec{p},\varvec{q}}\). \(\mathcal {N}\) is not directed (\(\varvec{e}_{\varvec{p},\varvec{q}}=\varvec{e}_{\varvec{q},\varvec{p}}\)). Two SPs \(\varvec{p},\ \varvec{q}\) are adjacent if \(\varvec{e}_{\varvec{p},\varvec{q}}\in \mathcal {N}\). \(\varvec{e}_{\varvec{p},\varvec{q}}{\setminus } \varvec{p}\in \mathcal {S}\) denotes the SP \(\varvec{q}\).

Remark 3.3

In the following, the neighborhood system \(\mathcal {N}\) for a set of SPs is always calculated by Delaunay’s triangulation [44].

Definition 3.15

(Connectivity function) The line segment \(g(\ \cdot \ ;\varvec{e}_{\varvec{p},\varvec{q}}):\left[ 0,1\right] \rightarrow \mathbb {R}^2\) defined by \(g(\tau ; \varvec{e}_{\varvec{p},\varvec{q}}):=\varvec{p} + \tau \left( \varvec{q}-\varvec{p}\right) \) connects the two pixels \(\varvec{p},\varvec{q}\) of the edge \(\varvec{e}_{\varvec{p},\varvec{q}}\). The function \(\varGamma : \mathcal {N}\times \mathcal {I}\rightarrow \left[ 0,1\right] \) defined by

$$\begin{aligned} \varGamma (\varvec{e}_{\varvec{p},\varvec{q}},I)=\frac{\int _{0}^{1} I(g(\tau ;\varvec{e}_{\varvec{p},\varvec{q}})) d\tau }{\left\| \varvec{p}-\varvec{q}\right\| _2} \end{aligned}$$

is called connectivity function. \(I(g(\tau ;\varvec{e}_{\varvec{p},\varvec{q}}))\) denotes the intensity of the pixel in I closest (w.r.t the Euclidean distance) to the real-valued coordinates \(g(\tau ;\varvec{e}_{\varvec{p},\varvec{q}})\).

The connectivity function calculates the average intensity for a given image along the shortest path connecting two pixels. The local text orientation of each SP is estimated by \(\theta _{\varvec{p}}=\text {LTO}(\varvec{p};\mathcal {N},B)\) utilizing \(\mathcal {N}\) and the baseline image B, see Algorithm 1. The LTO algorithm picks the two neighbors of an SP \(\varvec{p}\) with the largest baseline connectivity to \(\varvec{p}\) and determines the slope of the line passing through these neighbors.

Fig. 7
figure 7

Interline distance estimation—illustration of several projection profiles for a certain SP (red point). The profiles for different diameters \(d\in \{64,128,256,512\}\) and an orientation of \(0^{\circ }\) are shown in green. The winning period (interline distance) is drawn as yellow curve. In blue a histogram for a wrong orientation (\(45^{\circ }\)) is shown (color figure online)

figure a

The periodicity of text lines in document images is utilized to calculate the interline distances. We determine the interline distance of an SP \(\varvec{p}\) by evaluating the regional text line periodicity around \(\varvec{p}\) as follows. For an SP \(\varvec{p}\), a circular region of diameter \(d\in \mathbb {N}\) around \(\varvec{p}\), and a projection direction determined by the local text orientation \(\theta _{\varvec{p}}\), let \(\varvec{h}^{\varvec{p},d}=(h^{\varvec{p},d}_1,\ldots ,h^{\varvec{p},d}_{d})\in \mathbb {N}^d\) be the projection profile with respect to \(\mathcal {S}\), see Fig. 7. For the calculation of \(\varvec{h}^{\varvec{p},d}\), only SPs with a distance to \(\varvec{p}\) of less than \(\frac{d}{2}\) are taken into account.

Remark 3.4

The projection profile \(\varvec{h}^{\varvec{p},d}\) can be calculated very efficiently by utilizing the cross-product of the orientation vector \(\varvec{o}=(\cos (\theta _{\varvec{p}}),\sin (\theta _{\varvec{p}}))^T\) and the vectors for \(\varvec{q}\in \mathcal {S}\) with \(\left\| \varvec{p}-\varvec{q}\right\| _2\le \frac{d}{2}\).

To extract the regional periodicity inherent in the projection profile \(\varvec{h}^{\varvec{p},d}\), a Discrete Fourier Transformation (DFT) is applied to \(\varvec{h}^{\varvec{p},d}\) with resulting coefficients \(\varvec{H}^{\varvec{p},d}=(H^{\varvec{p},d}_1,\ldots ,H^{\varvec{p},d}_{d})\in \mathbb {C}^d\). A coefficient \(H^{\varvec{p},d}_k,\ k\in \left\{ 1,\ldots ,d\right\} \) corresponds to the portion of the signal with a period of \(\frac{d}{k}\) to the entire signal \(\varvec{h}^{\varvec{p},d}\). In the simplest case, the index \(k'\) of the dominant coefficient of \(\varvec{H}^{\varvec{p},d}\) determines the interline distance s of \(\varvec{p}\) as \(s = \frac{d}{k'}\). However, we may be forced to assign a different value to \(\varvec{s}\) due to additional constraints to be discussed in a moment. Therefore, we introduce a data energy value for each possible value \(\frac{d}{k}\) of the interline distance s of \(\varvec{p}\). From energy, we then derive a data cost to be used within a cost minimization framework for finding the optimal interline distance.

Definition 3.16

(Data energy, data cost) The data energy of SP \(\varvec{p}\) and interline distance \(\frac{d}{k}\) is given by \(E_{\varvec{p}}\left( \frac{d}{k}\right) =\frac{\left| H^{\varvec{p},d}_k\right| ^2}{\left\| \varvec{H}^{\varvec{p},d}\right\| _2^2}\). The data cost is calculated by \(D_{\varvec{p}}\left( \frac{d}{k}\right) = -\log \left( E_{\varvec{p}}\left( \frac{d}{k}\right) \right) \).

Remarkably, the data energy is normalized such that it sums (over k) up to 1.0 for arbitrary \(d\in \mathbb {N}\). To cover a suitable range of different interline distances as well as to be robust against disturbances due to close-by text regions of a different style, the projection profiles and DFTs are calculated for different diameters \(d\in \{64,128,256,512\}\) and \(k\in \{3,4,5\}\). The choice of the values for d and k is application driven and results in reasonable interline distances (sorted list) of \(\varvec{S}:=(170.7, 128.0, 102.4, 85.3, 64.0, 51.2, 42.7, 32.0, 25.6, 21.3, 16.0, 12.8)\).

In the following, we write \(s_{\varvec{p}}\) for the assigned interline distance \(s=\frac{d}{k}\in \varvec{S}\) of SP \(\varvec{p}\) and say \(\varvec{p}\) is labeled with \(s_{\varvec{p}}\). A labeling \(\left\{ s_{\varvec{p}} \right\} _{{\varvec{p}}\in \mathcal {S}}\) of \(\mathcal {S}\) assigns an interline distance to each SP of \(\mathcal {S}\). Following a greedy labeling strategy by assigning the interline distance with the highest data energy to each SP leads to a noisy result, see Fig. 8a. To reduce the noise effects, the influence of close-by SPs is taken into account. It is reasonable to expect that neighboring SPs tend to have similar interline distances. This expectation is encoded via a smoothing cost defined for adjacent SPs.

Definition 3.17

(Smoothing cost) For each \(\varvec{e}_{\varvec{p},\varvec{q}}\in \mathcal {N}\) (and assigned interline distances \(s_{\varvec{p}},s_{\varvec{q}}\)) the smoothing cost is defined by

$$\begin{aligned} V_{\varvec{p},\varvec{q}}(s_{\varvec{p}},s_{\varvec{q}}) = {\left\{ \begin{array}{ll} \sigma , &{} \quad \langle s_{\varvec{p}},s_{\varvec{q}}\rangle _s \ge 4,\\ \langle s_{\varvec{p}},s_{\varvec{q}}\rangle _s, &{}\quad \mathrm{else}.\\ \end{array}\right. } \end{aligned}$$

\(\langle s_{\varvec{p}},s_{\varvec{q}} \rangle _s\) is the index difference of \(s_{\varvec{p}}\) and \(s_{\varvec{q}}\) in the sorted list \(\varvec{S}\) of possible interline distances, e.g., \(\langle 16.0, 42.7\rangle _s=4\). Thus, the smoothing cost \(V_{\varvec{p},\varvec{q}}(s_{\varvec{p}},s_{\varvec{q}})\) becomes large if interline distances of different sizes are assigned to adjacent SPs. A maximum cost value of \(\sigma \) is used for huge differences in the interline distances. Setting \(\sigma \) to a large value prevents neighboring SPs to differ to much in their interline distances.

Fig. 8
figure 8

SPs with their assigned states—the local text orientation of each SP is visualized by the orientation of the green lines (rotated by \(90^{\circ }\)). The length of the lines encodes the interline distance of the corresponding SP (color figure online)

Definition 3.18

(Labeling cost) The labeling cost is given by

$$\begin{aligned} C\left( \left\{ s_{\varvec{p}} \right\} _{{\varvec{p}}\in \mathcal {S}}\right) = \alpha \sum _{{\varvec{p}}\in \mathcal {S}} D_{\varvec{p}}\left( s_{\varvec{p}}\right) +\beta \sum _{\varvec{e}_{\varvec{p},\varvec{q}}\in \mathcal {N}}V_{\varvec{p},\varvec{q}}(s_{\varvec{p}},s_{\varvec{q}}). \end{aligned}$$

The data cost and the smoothing costs are weighted by \(\alpha \) and \(\beta \), respectively, to form the labeling cost. The graph cut algorithm [45] is utilized to minimize the labeling cost. The final labeling is shown in Fig. 8b.

Remark 3.5

For all experiments, we have chosen fixed values of \(\sigma =25\), \(\alpha =1\) and \(\beta =1\).

3.3.3 Superpixel clustering

In the previous subsections, the calculation of SPs and their enrichment with state information was described. In a final step, this state information is utilized to cluster the SPs to build baselines. There will be a one-to-one assignment between clusters and baselines. In the following, we call a set of SPs cluster.

In this subsection, we formulate the clustering problem and introduce a greedy clustering procedure to solve the problem. Two assumptions which hold for baselines in general constitute the conditions for the clustering problem:

  1. (I)

    Baselines should not exceed a certain curvilinearity value.

  2. (II)

    Within the interline distance of a baseline, there are no other baselines.

Basically, assumption (I) claims that a baseline can be approximated by a polynomial function of a certain degree, see [23]. Assumption (II) is self-explanatory.

Remark 3.6

In the following, \(\theta (\{\varvec{p}_1,\ldots ,\varvec{p}_n\})\) denotes the average orientation and \(s(\{\varvec{p}_1,\ldots ,\varvec{p}_n\})\) the average interline distance of all SPs in \(\{\varvec{p}_1,\ldots ,\varvec{p}_n\}\).

Definition 3.19

(Curvilinearity value) Let \(\mathrm{deg}\in \mathbb {N}\) and \(\mathcal {S}\) be a set of SPs. Assume \(p_{\mathcal {S},\mathrm{deg}}(t)\in \mathbb {P}\left[ t\right] \) is the polynomial which solves the linear regression problem in the monomials \(t^0,t^1,\ldots ,t^{\mathrm{deg}}\) for \(\mathcal {S}'\) which results from \(\mathcal {S}\) by rotating all pixels by \(-\theta (\mathcal {S})\). The root-mean-square regression error normalized by \(s(\mathcal {S})\) is called curvilinearity value of \(\mathcal {S}\) and is denoted by \(cur(\mathcal {S},\mathrm{deg})\).

Remark 3.7

We fix \(\mathrm{deg}=3\) and omit it in the following.

Definition 3.19 allows for an easy evaluation of (I). To test for (II) we will introduce the distance of two clusters. Remarkably, only distances orthogonal to the text orientation should be taken into account. First, the orthogonal component of the distance between two SPs is introduced. Afterward, this is generalized for two clusters of SPs.

Definition 3.20

(Off-text distance) Given two SPs \(\varvec{p}\), \(\varvec{q}\) and an orientation \(\theta \), the off-text distance of \(\varvec{p}\) and \(\varvec{q}\) is the length of the component of \(\varvec{p}-\varvec{q}\in \mathbb {R}^2\) which is orthogonal to \(\theta \). It is denoted by \(\left\| \varvec{p}-\varvec{q} \right\| _{\theta }\).

Remark 3.8

The off-text distance can be efficiently calculated by \( \left\| \varvec{p}-\varvec{q} \right\| _{\theta }{=} \left| \left( \varvec{p}_x {-} \varvec{q}_x\right) \sin (\theta )-\left( \varvec{p}_y - \varvec{q}_y\right) \cos (\theta ) \right| \).

Calculating the minimal pairwise off-text distance of all SPs of two clusters could result in a cluster distance distorted by SP outliers. Therefore, SPs in each cluster will be projected onto the corresponding regression curve obtained by the regression problem in Definition 3.19, before taking pairwise distances.

Definition 3.21

(Regression curve) Let \(\mathcal {S},\ \mathcal {S}'\) and \(p_{\mathcal {S}}(t)\) be of Definition 3.19. The spatial t-range of \(\mathcal {S}'\) is given by \(t_\mathrm{min}=\min \{\varvec{p}_x\ \vert \ \varvec{p}\in \mathcal {S}'\}\) and \(t_\mathrm{max}=\max \{\varvec{p}_x\ \vert \ \varvec{p}\in \mathcal {S}'\}\). A curve \(c_{\mathcal {S}}:\ \left[ 0,1\right] \rightarrow \mathbb {R}^2\) which results from rotating the graph of \(p_{\mathcal {S}}(t)\) for \(t\in \left[ t_\mathrm{min},t_\mathrm{max}\right] \) by \(\theta (\mathcal {S})\) is called regression curve of \(\mathcal {S}\).

The SPs in \(\mathcal {S}\) are projected onto \(c_{\mathcal {S}}\) (in direction \(\theta (\mathcal {S})+\frac{\pi }{2}\)). The resulting projected SPs are denoted by \(\mathcal {S}^c\). To achieve robust distance estimates even for curved and differently slanted text lines, we focus on SPs of the different clusters which are quite close to each other and furthermore take into account the slope of the regression curve at the specific SP positions instead of averaging over the entire text line.

Definition 3.22

(Cluster distance) Assume two clusters \(\mathcal {S}_1\), \(\mathcal {S}_2\) with regression curves \(c_{\mathcal {S}_1}(t)\), \(c_{\mathcal {S}_2}(t)\) and projected SPs \(\mathcal {S}^c_1\), \(\mathcal {S}^c_2\). The cluster distance is defined as

$$\begin{aligned} d(\mathcal {S}_1,\mathcal {S}_2)=\min _{\begin{array}{c} \varvec{p}\in \mathcal {S}_1^c, \varvec{q}\in \mathcal {S}_2^c: \\ \left\| \varvec{p}-\varvec{q}\right\| _2<4\cdot s(\mathcal {S}_1\cup \mathcal {S}_2) \end{array}} \left\| \varvec{p}-\varvec{q} \right\| _{\theta ^c(\varvec{p},\varvec{q})}, \end{aligned}$$

where \(\theta ^c(\varvec{p},\varvec{q})\) is the average slope of the corresponding regression curves at \(\varvec{p}\) and \(\varvec{q}\), respectively.

Since, it is now possible to evaluate conditions (I) & (II), we will use this to introduce feasible sets of clusters. For this purpose, we will limit ourselves to partitions (a special kind of cluster sets) and require the baseline clusters to be \(\mathcal {N}\)-linked. The set of all partitions of a set \(\mathcal {M}\) is denoted by \(par(\mathcal {M})\).

Definition 3.23

(\(\mathcal {N}\)-linked) Let \(\mathcal {S}\) be a cluster and \(\mathcal {N}\) be a neighborhood system. \(\mathcal {S}\) is \(\mathcal {N}\)-linked iff \(\forall \varvec{p},\varvec{q}\in \mathcal {S}\ \exists \varvec{p}_0,\ldots ,\varvec{p}_N\in \mathcal {S}: \varvec{p}_0=\varvec{p}\wedge \varvec{p}_N=\varvec{q}\wedge \varvec{e}_{\varvec{p}_i,\varvec{p}_{i+1}}\in \mathcal {N}\ (0\le i\le N-1)\) holds.

That means, for all pairs of SPs there are edges in \(\mathcal {S}\) which connect these respective SPs.

Definition 3.24

(Feasible) For \(\gamma ,\delta \in \mathbb {R}_+\), \(L\in \mathbb {N}\), a set of SPs \(\mathcal {S}\) and a neighborhood system \(\mathcal {N}\), we call a set of clusters \(\mathscr {P}=\left\{ \mathcal {S}_0, \ldots , \mathcal {S}_L \right\} \)feasible iff

  1. 1.

    \(\mathscr {P}\in par(\mathcal {S})\)

  2. 2.

    \(\forall i > 0:\ \mathcal {S}_i\) is \(\mathcal {N}\)-linked

  3. 3.

    Conditions (I) and (II) hold:

    • \(cur(\mathcal {S}_i)<\gamma \quad \forall i>0\)

    • \(d(\mathcal {S}_i,\mathcal {S}_j)>\delta \cdot \max \{s(\mathcal {S}_i),s(\mathcal {S}_j)\}\quad \forall i,j>0, i\ne j\).

The set of feasible sets of clusters is denoted by \(feas_{\mathcal {N}}(\mathcal {S})\).

The clusters \(\mathcal {S}_i,\ i> 0\) identify the baselines, \(\mathcal {S}_0\) constitutes the clutter cluster containing SPs not belonging to any baseline. We identify the baseline corresponding to \(\mathcal {S}_i\) with the polygonal chain of the projected SPs \(\mathcal {S}_i^c\) which follow the regression curve \(c_{\mathcal {S}_i}(t)\), see Fig. 9. The number \(L\in \mathbb {N}\) of baselines is (a priori) unknown. In the following, we will incorporate domain knowledge to promote SPs belonging to different baselines not to be \(\mathcal {N}\)-linked. Hence, clusterings with erroneously connected baselines are not feasible anymore. This is done by a modification of the neighborhood system \(\mathcal {N}\).

Fig. 9
figure 9

Influence of the separator information—the resulting baselines (blue lines) with and without taking into account the separator information are shown (color figure online)

Since baselines of different text orientations should not contribute to the same cluster, we adjust the initial neighborhood system \(\mathcal {N}\) by removing edges \(\varvec{e}_{\varvec{p},\varvec{q}}\) of SPs with substantially different local orientations: \(\left| \theta _{\varvec{p}}-\theta _{\varvec{q}}\right| \mod \pi > \frac{\pi }{4}\). In addition, it is an ease to incorporate layout information by further adjusting \(\mathcal {N}\). The layout information encoded by the separator image S (Fig. 6a) can be incorporated by taking into account the connectivity of SPs in S. All edges \(\varvec{e}_{\varvec{p},\varvec{q}}\in \mathcal {N}\) for which a separator is crossed, i.e., \(\varGamma (\varvec{e}_{\varvec{p},\varvec{q}},S)>\eta \) or \(\max _{\tau }S(g(\tau ;\varvec{e}_{\varvec{p},\varvec{q}}))> 2\cdot \eta \) (g of Definition 3.15) holds, are removed, see Fig. 9b.

Finally, a common scenario is the baseline detection with given text regions. We assume that the text regions are represented by closed polygonal chains \(\varvec{R}_1,\ldots ,\varvec{R}_N\). This additional layout information (if available) is easy to integrate. All edges for which \(\not \exists \varvec{R}_i:\ \varvec{R}_i\text { contains } \varvec{p},\varvec{q}\) holds are removed. Roughly speaking, a closed polygonal chain contains an SP if for all “ways” from the SP to the image border one have to cross the polygonal chain. Hence, SPs which are part of different non-overlapping text regions are not \(\mathcal {N}\)-linked any more. Thus, each baseline \(\mathcal {S}_i,\ i>0\) is entirely contained in one text region for all feasible sets. The resulting neighborhood system is still denoted by \(\mathcal {N}\).

Remark 3.9

For all experiments, we have chosen fixed values of \(\gamma = 0.3\), \(\delta = 0.5\) (Definition 3.24) and \(\eta =0.125\).

After reducing the neighborhood system, we now introduce the total baseline energy. We will assign an energy to all feasible sets and aim for an optimal one. This allows for the formulation of the clustering problem to be solved.

Definition 3.25

(Total baseline energy) Let B be a baseline image, \(\mathcal {N}\) a neighborhood system and \(\mathscr {P}=\left\{ \mathcal {S}_0, \ldots , \mathcal {S}_L \right\} \) a set of clusters over \(\mathcal {S}\). With \(\mathcal {N}(\mathcal {S}_i)=\{\varvec{e}_{\varvec{p},\varvec{q}}\in \mathcal {N}\ |\ \varvec{p},\varvec{q}\in \mathcal {S}_i\}\subset \mathcal {N}\) the total baseline energy is defined by

$$\begin{aligned} b(\mathscr {P})=\sum _{i=1}^L\ \sum _{\varvec{e}_{\varvec{p},\varvec{q}}\in \mathcal {N}(\mathcal {S}_i)}\varGamma (\varvec{e}_{\varvec{p},\varvec{q}},B). \end{aligned}$$

Finally, the clustering problem can be formulated as

$$\begin{aligned} \mathscr {P^*}=\mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\mathscr {P}\in feas_{\mathcal {N}}(\mathcal {S})}b(\mathscr {P}). \end{aligned}$$

Because there could be a huge number of feasible sets of clusters for large \(\mathcal {S}\), we introduce a greedy clustering algorithm. The proposed algorithm clusters edges of \(\mathcal {N}\) instead of clustering SPs. If an edge is assigned to a cluster (set) of edges, we assign both corresponding SPs to the corresponding cluster of SPs. In a first step, the set of edges in \(\mathcal {N}\) is sorted in decreasing order w.r.t.

$$\begin{aligned} \left( 1- \frac{\left\| \varvec{p}-\varvec{q}\right\| _{\theta (\{\varvec{p},\varvec{q}\})}}{\left\| \varvec{p}-\varvec{q}\right\| _2} \right) \cdot \varGamma (\varvec{e}_{\varvec{p},\varvec{q}},B). \end{aligned}$$

Hence, the sorting takes into account the B-connectivity value of an edge and discounts it if \(\varvec{e}_{\varvec{p},\varvec{q}}\) is rather orthogonal to \(\theta (\{\varvec{p},\varvec{q}\})\). Discounted edges are less likely part of a baseline and are therefore sorted to the end of the list. The sorted list is denoted by \(\varvec{N}\). This avoids that these edges are falsely assigned to baseline clusters which are composed of just a few correct edges (statistics of the cluster are not reliable, yet). Given \(\mathcal {S}\) and \(\varvec{N}\), the proposed clustering algorithm assigns all edges to the clutter cluster (\(\mathcal {S}_0\)). It iteratively moves edges to baseline clusters such that the resulting set of clusters remains feasible and the total baseline energy increases. The algorithm is shown in detail in the supplements (Algorithm S.2).

4 Experiments

The experiment section is divided into 4 subsections. First, we investigate the influence of the training set size as well as the influence of different data augmentation strategies. This is followed by an investigation of the performance of the proposed method if it is applied to images with curved or arbitrarily oriented text lines. The third subsection presents and compares results of different versions of our proposed NPL architectures on the very heterogeneous and challenging cBAD dataset [25, 26]. We perform statistical tests to show the statistical significance of the stated conclusion—the superiority of the proposed ARU-Net in a two-stage workflow over other architectures and a single-stage workflow. Finally, we compare the proposed method against other state-of-the-art methods on the datasets of 3 recently hosted competitions. As mentioned in Sect. 3, we will follow the similarity score of [25] (F value) to measure the quality of the baseline detection. The configuration for all experiments including the hyperparameters of the network architecture as well as the training is summarized in Table 1. This configuration is the result of an extensive search in the hyperparameter space and results in impressive results for various scenarios/datasets.

Table 1 Hyperparameters: the architecture and training configuration which were used in this work are described

Since no early stopping based on the loss for any validation set is used, we train on the entire training set. The ARU-Net workflow for training and inference (Tensorflow code) as well as a trained network are freely available.Footnote 4 The ARU-Net training takes 3–24 h from scratch (dependent on the number of epochs and samples per epoch) on a Titan X GPU. The inference time per image ranges from 2 to 12 s per image on a dual-core laptop (Intel Core i7-6600U with 16GiB RAM), this reduces to 0.5 to 2 s running the ARU-Net on the Titan X.

4.1 Influence of training sample number and data augmentation

A major drawback of state-of-the-art approaches (Sect. 2) is the need for an extensive expert tuning if confronted with scenarios which are not already covered. But the eligibility for an usage at industrial scale depends on the possibility to easily adapt at reasonable cost. For approaches relying on machine learning, this reduces to two questions:

  • What about the amount of ground truth needed?

  • What about the effort of ground truth production?

Concerning the second question, we refer to the automatic generation of pixel ground truth given the baseline ground truth. The annotation of (polygonal) baselines for a document image is quite easy and does not need remarkable expert knowledge compared to, e.g., ground truth production for ATR systems for historical handwritings or even the text line annotation at surrounding polygon level. The effort is reduced to several minutes per page by using platforms such as Transkribus.Footnote 5 In the following, we want to examine the first question.

The influence of training dataset size along with different data augmentation strategies is investigated for the freely available Bozen datasetFootnote 6 [24], see Fig. S.1 (supplements). This dataset is a subset of documents from the Ratsprotokolle collection of Bozen composed of minutes of the council meetings held from 1470 to 1805 and consists of 400 pages. It is written in Early Modern German. Baseline ground truth information is available in form of PAGEFootnote 7 XML. The dataset is quite challenging concerning layout analysis issues. Most of the pages consist of a single main text region with many difficulties for line detection and extraction, e.g., bleed-through, touching text lines and marginalia. For the following experiments, we have randomly divided the Bozen set in a set of training samples \(\mathcal {T}\) of size 350 and a test set of size 50. In a first step, we randomly set up a chain of subsets of \(\mathcal {T}\)

$$\begin{aligned} \mathcal {T}_1\subset \mathcal {T}_3\subset \mathcal {T}_5\subset \mathcal {T}_{10}\subset \mathcal {T}_{30}\subset \mathcal {T}_{50}\subset \mathcal {T}_{100}\subset \mathcal {T}_{200}\subset \mathcal {T}_{350}, \end{aligned}$$
Fig. 10
figure 10

Influence of the number of training samples and of different data augmentation strategies—the bar height represents the mean F value. The error bars encode min–max values of the 5 experiments (not the standard deviation). The dashed green line marks the maximum mean value of 0.975 achieved for 350 trainings samples. For a description of the different augmentation strategies: B, S, \(S+A\) and \(S+A+E\), see main text (color figure online)

where \(\mathcal {T}_i\) contains i training samples (pages and pixel ground truth). Since we expect an influence of the choice of training samples (= sorting of \(\mathcal {T}\)), we repeat the mentioned procedure 4 times. Notably, the test set remains untouched. Finally, we got 45 training sets—five of each quantity. For each set, we trained the RU-Net for 100 epochs with 256 images per epoch. Therefore, we randomly choose samples of the training set and remove them from the set. If each element of the training set was used for training once, we start again with the initial training set. Hence, it does not matter whether the number of training samples per epoch exceeds the size of the training set or not. This procedure guarantees the same amount of training samples shown to the networks in training independent of the size of the training set. The RU-Net was chosen instead of the ARU-Net, because of the homogeneity of the Bozen dataset concerning font size and resolution. We trained the RU-Net from scratch on all 45 sets in 4 different scenarios. For training purposes the image pre-processing mentioned in Table 1 is disabled. Instead, the training samples \((I,\mathcal {G}_I)_i\) are pre-processed following one of the four strategies:

  1. 1.

    Subsampled by a constant factor of 3 (no further data augmentation—one training sample per element of the training set)—B

  2. 2.

    Randomly subsampled by a factor \(s\in \left[ 2,5\right] \)—S

  3. 3.

    \(S +\)  random affine transformation (three corner points of the image are randomly shifted within a circle of diameter \(0.025\cdot \max (\mathcal {I}_h,\mathcal {I}_w)\) around there original position)—\(S + A\)

  4. 4.

    \(S + A +\) elastic transformation [46]—\(S + A+ E\).

For the test set, the images were subsampled by the constant factor of 3 in all scenarios. The results of these 180 experiments are shown in Fig. 10. One can see that all 3 data augmentation strategies significantly improve the performance compared to the base (B) strategy. Notably, for small numbers of training samples the min–max difference is much larger than for higher number of training samples. Hence, if just a few training samples are available, the choice of these is of importance. The best mean F value (0.975) is achieved for all 350 training samples with the \(S+A+E\) strategy. Nevertheless, there only is a negligible loss in performance for 200 or 100 training samples. Even for 30 training samples, a F value of 0.963 is achieved for the \(S+A\) strategy, which is sufficient for most applications, see Fig. S.1 (supplements). This results in a quite acceptable effort for ground truth production making the presented approach interesting even for industrial production. The \(S + A\) data augmentation strategy will be the default for the rest of this work.

Of course, the presented numbers are not directly transferable to collections with pages of entirely different scenarios, e.g., census tables mixed with postal cards mixed with ...One would expect that more than 30 training samples are necessary for this kind of scenario. Nevertheless, the presented experiment reflects a common situation: One has a robust baseline detector which was trained on very heterogeneous data (see Sect. 4.4.3), but this detector does not work satisfyingly well for a certain (in most cases quite homogeneous) collection. The numbers presented here give a hint concerning the effort of ground truth production necessary in this scenario.

4.2 Curved and oriented text lines

In this subsection, we demonstrate the ability of the introduced approach to handle curved or arbitrarily oriented text lines. In a first experiment, the test set of the Bozen dataset was deformed to contain arbitrarily curved text lines. For this purpose, we utilized trigonometric functions with random period to simulate curved text lines in the test phase. The RU-Net was trained (5 times) for 100 epochs with 256 samples per epoch on the Bozen training set using the \(S + A + E\) augmentation strategy with strong elastic deformations. We choose elastic transformations in training, because they simulate curves of different amplitudes and frequencies in the same image. Furthermore, we increased the polynomial degree (Definition 3.19) to 5 to enable the system to handle the curvatures present in the test set.

Remark 4.1

Different methods were used to deform the images during training and test phases. Hence, the system had to learn the concept of curved text lines instead of an inversion of the image degradation method used in the training phase.

In a second experiment, we have trained an RU-Net (5 times) on arbitrarily oriented samples of the Bozen training set and evaluated the resulting networks on oriented pages of the test set. The results are shown in Table 2, and a few sample images are shown in Fig. S.2 (supplements). For the curved scenario, the results are as good as for the base scenario. In case of the oriented scenario, the results are slightly worse, but still excellent. This demonstrates the applicability for images with curved or oriented text lines without remarkable adaptation of the workflow. Finally, we have trained five models with all degradations (affine, elastic, rotation) and evaluated this model on the three different scenarios. The corresponding F values are depicted in Table 2. The system is worse than the experts for the base and curved scenarios, but for the oriented scenario it even benefits from the additional elastic transformations.

Table 2 Results for the Bozen test set: the results in the base (B), curved (C) and oriented (O) scenarios are depicted. Finally, the F-val for a single system trained with all degradations is shown for the different test lists (A)

4.3 U-Net versus ARU-Net versus single-stage workflow

Table 3 Results for the cBAD test set: the results for different neural network architectures and the workflow without Stage II (for the ARU-Net) are shown

In Sect. 3, we have introduced the ARU-Net in a two-stage workflow. In this section, we will investigate its superiority over the classical U-Net as well as over a ”single-stage” workflow. For this purpose, we have trained the U-, RU-, ARU- and LARU-Net (each 5 times—random weight initialization and random training sample order) on the recently introduced cBAD datasetFootnote 8 [26]. The LARU-Net is an ARU-Net with a separable MDLSTMFootnote 9 layer at the lowest resolution to incorporate full spatial context. The details of the dataset are described in [25]. In our opinion, this is the most challenging freely available dataset at the moment. We have trained each network for 250 epochs, 1024 training samples each epoch using the \(S + A\) data augmentation strategy. To assure the statistical significance of the posed superiority of the newly introduced architecture, we follow [47] and provide the results of a statistical analysis. The choice of appropriate statistical tests is quite limited since we cannot make any assumptions regarding the underlying distribution. We utilize \(95\%\) confidence intervals (CI) provided by nonparametric bootstrapping [48] as well as the Tukey–Duckworth test (level of significance: \(5\%\)) [49]. The results obtained are summarized in Table 3. The ARU-Net performs significantly (last two columns) better than all architectures with less computational effort. The LARU-Net could not prove its superiority and is therefore dismissed. Furthermore, the results show that the introduction of the second stage is beneficial for the overall performance. It has to be mentioned that the above comparison is not fair concerning the number of trainable parameters—U-2.16, RU-4.13, ARU-4.14, LARU-6.25 (in millions)—nor concerning the training or even inference time. The comparison is about different architectures which, theoretically, have different capabilities, and whether they make good use of them or not. For instance, the LARU-Net should be capable of incorporating a more detailed spatial context, but in fact it does not benefit (in our settings) from this capability.

4.4 Comparison against the State of the Art

In this subsection, we compare the proposed framework against the state of the art. We have chosen the 3 most recent competitions on text line detection for historical documents, namely ICDAR 2015 competition on text line detection in historical documents [14], ICDAR2017 Competition on Layout Analysis for Challenging Medieval Manuscripts (Task 2) [50] and cBAD: ICDAR2017 Competition on Baseline Detection [36]. We will not further introduce the datasets or metrics used and refer to the competition papers.

4.4.1 ICDAR 2015 competition on text line detection in historical documents (ANDAR-TL)

The ARU-Net was trained on the cBAD training set.Footnote 10 This competition aims at the origin point (OP) detection. An OP is roughly spoken the lower left “corner” of a text line. Hence, we calculate the left most point of each detected baseline. This is the output of our system for this competition. The achieved results are shown in Table 4. Since the ARU-Net was not trained on the original training data, it is hard to compare its results to the other ones. Nevertheless, we would like to stress the fact that trained systems usually perform better if training set and test set are sampled from the same distribution. For example, the ARU-Net trained on the cBAD training set achieves an average F value of 0.9605 for the Bozen test set, which is worse than the F value of 0.9750 of the system trained solely on the Bozen training set, see Table 2. This indicates (but does not prove) the superiority of the presented method over the other methods in Table 4.

Table 4 Origin point (OP) detection results for the ANDAR-TL test set: results for the dataset of [14] are shown
Fig. 11
figure 11

Results for an image of the CSG18 subset of the test set—the original image (only the main text lines were ground-truthed), the baseline image generated by the trained ARU-Net and the baselines detected by the proposed method are shown (from left to right)

4.4.2 ICDAR2017 competition on layout analysis for challenging medieval manuscripts (Task 2)

The DIVA-HisDB dataset consists of 150 annotated pages of three different medieval manuscripts with challenging layouts, see Fig. 11. The ARU-Net was trained for 250 epochs 1024 samples per epoch on the competition training dataFootnote 11 provided by the competition organizers. This allows an entirely fair comparison to the participant’s results, see Table 5. The proposed method substantially outperforms the winning one and reduces the error (the gap to 1.0) by \(43.26\%\) (relatively). The specialty of this competition was that the methods should focus on a special kind of text, e.g., comments were not annotated as text. Hence, the ARU-Net had to learn to distinguish between different types of text. The output of the ARU-Net and the detected baselines for a sample image of the CSG18 subset of the test set are shown in Fig. 11. One can see that the ARU-Net entirely ignores all text entities not regarded (in this competition) as main text. Remarkably, no further information besides the input image is provided to the ARU-Net.

Table 5 Results for the ICDAR2017 competition on layout analysis for challenging medieval manuscripts: the F values for Task 2 of all participants and the proposed method are shown for the different subsets of the test set
Table 6 Results for the cBAD test set: the P-, R- and F values of all participants and of the proposed method for the simple and complex track of the cBAD: ICDAR2017 Competition on Baseline Detection are shown

4.4.3 cBAD: ICDAR2017 Competition on Baseline Detection

We compare our average result for the ARU-Net (see Table 3) to the results presented in [36], see Table 6. Our method performs considerably better in both tracks compared to all submissions. Especially, the increase in performance for the complex track is massive. Remarkably, the winning team uses an U-Net-based system with task specific pre- and post-processing. This indicates that the newly introduced concepts and parametrization, which are presented in this work, significantly improve the capability of the classical U-Net. Some results on chosen images of the cBAD test set are shown in Figs. S.3–S.5 (supplements). Notably, no further information besides the input image (and the text region information in the simple track) is provided to the ARU-Net nor to the second stage of the workflow during inference.

5 Conclusion

In this work, we presented a machine learning-based method for text line detection in historical documents. The text lines are represented by their baselines. The problem and the proposed method were introduced thoroughly. The proposed ARU-Net, which is a universal pixel labeling approach, was trained to predict the baseline position and the beginning and end of each text line. This enables the system to handle documents with complex layouts, e.g., tables, marginalia, multi-column layouts. We have shown that the system can be trained from scratch with manageably few training samples for a complex but homogeneous collection. Remarkably, ground truth production is quite cheap. A ground truth sample is just a page with annotated baselines, which can be done in a few minutes per page. Notably, this annotation process is possible without any expert knowledge. This is a big advantage compared to classical image processing-based methods, which typically demand for expert knowledge in the adaptation phase. Therefore, one can expect that an adaptation on collections, which are not covered by the neural network, is possible by a wide audience of users (not only computer scientists) with reasonable ground-truthing effort. The applicability of the proposed method was shown for straight, curved and oriented text lines as well as for a combined scenario. The superiority of the proposed ARU-Net in the two-stage workflow over the classical U-Net and over a simplified workflow was shown and statistically verified. Finally, we showed that the proposed method substantially outperforms the previous state of the art. Nevertheless, as one can see in Figs. S.3–S.5 (supplements) there are still errors made by the system, e.g., missed baselines (see Fig. S.4—bottom right), segmentation errors (see Fig. S.5—bottom left), false positives (see Fig. S.3—top left) or problems with strongly degraded documents (see Fig. S.4—top left). But these errors do not seem to follow a certain deterministic principle, which is not surprising for a method based on machine learning. However, we plan to test newly introduced concepts like capsules, memory augmentation and deeply supervised networks to further improve the system’s performance.