Keywords

Fig. 1.
figure 1

The figure depicts the problem of recognizing table structure from it’s image. This opens up many applications including information retrieval, graphical representation and digitizing for editing.

1 Introduction

Deep neural networks have shown promising results in understanding document layouts  [1,2,3]. However, more needs to be done for structural and semantic understanding. Among these, the problem of table structure recognition has been of high interest in the community  [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]. Table structure recognition refers to representation of a table in a machine-readable format, where its layout is encoded according to a pre-defined standard  [10,11,12,13,14, 17]. It can be represented in the form of either physical  [10, 12, 14, 17] or logical formats  [11, 13]. While logical structure contains every cells’ row and column spanning information, physical structure additionally contains bounding box coordinates. Table structure recognition is a precursor to contextual table understanding, which has a myriad of applications in business document analysis, information retrieval, visualization, and human-document interactions, as motivated in Fig. 1.

Table structure recognition is a challenging problem due to complex structures and high variability in table layouts  [4,5,6,7,8,9,10,11,12,13,14,15,16,17]. Early attempts in this space are dependent on extraction of hand-crafted features and meta-data extracted from the pdfs on top of heuristic/rule-based algorithms  [21,22,23,24] to locate tables and understanding tables by predicting/recognizing structures. These methods, however, fail to extend to scanned documents as they rely on meta-data information contained in the pdfs. They also make strong assumptions about the structure of the tables. Some of these methods are also dependent on textual information analysis which make them domain dependent. While textual features are useful, visual analysis becomes imperative for analysis of complex page objects. Inconsistency of size and density of tables, presence and location of table cell borders, variation in table cells’ shapes and sizes, table cells spanning multiple rows and/or columns and multi-line content are some challenges (refer Fig. 2 for some examples) that need to be addressed to solve the problem using visual cues  [4, 5, 21,22,23,24].

Fig. 2.
figure 2

Examples of complex table images from unlv and icdar-2013 datasets. Complex tables are ones which contain partial or no ruling lines, multi-row/column spanning cells, multi-line content, many empty dense cells.

We pose the table structure recognition problem as the generation of xml containing table’s physical structure in terms of bounding boxes along with spanning information and, additionally, digitized content for every cell (see Fig. 1). Since our method aims to predict this table structure given the table image only (without using any meta-information), we employ a two-step process—(a) top-down: where we decompose the table image into fundamental table objects, which are table cells using a cell detection network and (b) bottom-up: where we re-build the entire table as a collection of all the table cells localized from the top-down process, along with their row and column associations with every other cell. We represent row and column associations of table cells using row and column adjacency matrices.

Though table detection has observed significant success  [11, 25,26,27,28], detection of table cells remains a challenging problem. This is because of (i) large variation in sizes and aspect ratios of different cells present in the same table, (ii) cells’ inherent alignment despite high variance in text amount and text justification, (iii) lack of linguistic context in cells’ content, (iv) presence of empty cells and (v) presence of cells with multi-line content. To overcome these challenges, we introduce a novel loss function that models the inherent alignment of cells in the cell detection network; and a graph-based problem formulation to build associations between the detected cells. Moreover, as detection of cells and building associations between them depend highly on one another, we present a novel end-to-end trainable architecture, termed as tabstruct-net, for cell detection and structure recognition. We evaluate our model for physical structure recognition on benchmark datasets: scitsr [14], scitsr-comp [14], icdar-2013 table recognition [18], icdar-2019 (ctdar) archival  [19], and unlv [29]. Further, we extend the comparative analysis of the proposed work for logical structure recognition on tablebank  [11] dataset. Our method sets up a new direction for table structure recognition as a collaboration of cell detection, establishing an association between localized cells and, additionally, cells’ content extraction.

Our main contributions can be summarised as follows:

  • We demonstrate how the top-down (cell detection) and bottom-up (structure recognition) cues can be combined visually to recognize table structures in document images.

  • We present an end-to-end trainable network, termed as tabstruct-net for training cell detection and structure recognition networks in a joint manner.

  • We formulate a novel loss function (i.e., alignment loss) to incorporate structural constraints between every pair of table cells and modify Feature Pyramid Network (fpn) to capture better low-level and long-range features for cell detection.

  • We enhance the visual features representation for structure recognition (built on top of model  [9]) through lstm.

  • We unify results from previously published methods on table structure recognition for a thorough comparison study.

Fig. 3.
figure 3

Block diagram of our approach. Table detection is a precursor to table structure recognition and our method assumes that table is already localized from the input document image. The end-to-end architecture predicts cell bounding boxes and their associations jointly. From the outputs of cell detection and association predictions, xml is generated using a post-processing heuristic.

2 Related Work

In the space of document images, researchers have been working on understanding equations  [30, 31], figures  [32, 33] and tables  [6,7,8,9,10,11,12,13,14,15,16,17]. Diverse table layouts, tables with many empty cells and multi-row/column spanning cells are some challenges that make table structure recognition difficult. Research in the domain of table understanding through its structure recognition from document images dated back to the early 1990s when algorithms based on heuristics were proposed  [21,22,23,24, 34,35,36]. These methods were primarily dependent on hand-crafted features and heuristics (horizontal and vertical ruling lines, spacing and geometric analysis). To avoid heuristics, Wang et al.  [5] proposed a method for table structure analysis using optimization methods similar to the x-y cut algorithm. Another technique based on column segmentation, header detection, and row segmentation to identify the table structure was proposed by Hu et al.  [4]. These methods make strong assumptions about table layouts for a domain agnostic algorithm.

Many cognitive methods [6,7,8,9,10,11,12, 14,15,16, 37,38,43] have also been presented to understand table structures as they are robust to the input type (whether being scanned images or native digital). These also do not make any assumptions about the layouts, are data-driven, and are easy to fine-tune across different domains. Minghao et al.  [11] proposed one class of deep learning methods to directly predict an xml from the table image using the image-to-markup model. Though this method worked well for small tables, it was not robust enough to dense and complex tables. Another set of methods is invoice specific table extraction  [39, 40], which were not competent for a more generic use-cases. To overcome this challenge, a combination of heuristics and cognitive methods has also been presented in  [12]. Chris et al.  [10] presented another interesting deep model, called splerge, which is based on the fundamental idea of first splitting the table into sub-cells, and then merging semantically connected sub-cells to preserve the complete table structure. Though this algorithm showed considerable improvements over earlier methods, it was still not robust to skew present in the table images. Another interesting direction was presented by Vine et al.  [42], where they used conditional generative adversarial networks to obtain table skeleton and then fit a latent table structure into the skeleton using a genetic algorithm. Khan et al.  [15], through their gru based sequential models, showed improvements over several cnn based methods for table structure extraction. Recently, many works have preferred a graph-based formulation of the problem as the graph is inherently an ideal data structure to model structural associativity. Qasim et al.  [9] proposed a solution where they used graph neural networks to model table-level associativity between words. The authors validate their method on synthetic table images. Chi et al.  [14] proposed another graph-based problem formulation and solution using a graph attention mechanism. While these methods made significant progress towards understanding complex structured tables, they made certain assumptions like availability of accurate word bounding boxes, accurate document text, etc. as additional inputs  [6, 9, 14]. Our method does not make any such assumptions. We use the table image as the input and produce xml output without any other information. We demonstrate results on complex tables present in unlv, icdar-2013, icdar-2019 ctdar archival, scitsr, scitsr-comp tablebank, and pubtabnet datasets.

Fig. 4.
figure 4

Visual illustration of cell spanning information along rows and columns of a table from unlv dataset. Left Image: shows original table image in unlv and Right Image: illustrates ground-truth cell spanning information.

3 TabStruct-Net

Our solution for table structure recognition progresses in three steps—(a) detection of table cells; (b) establishing row/column relationships between the detected cells, and (c) post-processing step to produce the xml output as desired. Figure 3 depicts the block diagram of our approach.

Fig. 5.
figure 5

Our tabstruct-net. Modified rpn in cell detection network, which consists of both top-down and bottom-up pathways to better capture low-level visual features. P2 layer of the optimized feature pyramid is used in the structure recognition network to extract visual features.

3.1 Top-Down: Cell Detection

The first step of our solution for table structure recognition is localization of individual cells in a table image, for which we use the popular object detection paradigm. The difference from natural scene images, however, is an inherent association between table cells. Recent success of r-cnns  [44] and its improved modifications (Fast r-cnn  [45], Faster r-cnn  [46], Mask r-cnn  [47]) have shown significant success in object detection in natural scene images. Hence, we employ Mask r-cnn  [47] for our solution with additional enhancements—(a) we augment the Region Proposal Network (rpn) with dilated convolutions  [48, 49] to better capture long-range row and column visual features of the table. This improves detection of multi-row/column spanning and multi-line cells; (b) inspired by  [50], we append the feature pyramid network with a top-down pathway, which propagates high-level semantic information to low-level feature maps. This allows the network to work better for cells with varying scales; and (c) we append additional losses during the training phase in order to model the inherent structural constraints. We formulate two ways of incorporating this information—(i) through an end-to-end training of cell detection and the structure recognition networks (explained next), and (ii) through a novel alignment loss function. For the latter, we make use of the fact that every pair of cells is aligned horizontally if they span the same row and aligned vertically if they span the same column. For the ground truth, where tight bounding boxes around the cells’ content are provided  [13, 14, 18], we employ an additional ground truth pre-processing step to ensure that bounding boxes of cells in the same row and same column are aligned vertically and horizontally, respectively. We model these constraints during the training in the following manner:

$$\begin{aligned} \begin{array}{l} L_{1} = \sum \nolimits _{r\in SR}\sum \nolimits _{c_i, c_j\in r} ||y1_{c_i} - y1_{c_j}||_2^2 \text {,}\ L_{2} = \sum \nolimits _{r\in ER}\sum \nolimits _{c_i, c_j\in r} ||y2_{c_i} - y2_{c_j}||_2^2 \\ L_{3} = \sum \nolimits _{c\in SC}\sum \nolimits _{c_i, c_j\in c} ||x1_{c_i} - x1_{c_j}||_2^2\ \text {and} \ L_{4} = \sum \nolimits _{c\in EC}\sum \nolimits _{c_i, c_j\in c} ||x2_{c_i} - x2_{c_j}||_2^2 \end{array} \end{aligned}$$

Here, SR, SC, ER and EC represent starting row, starting column, ending row and ending column indices as shown in Fig. 4. Also, \(c_i\) and \(c_j\) denote two cells in a particular row r or column c; \(x1_{c_i}\), \(y1_{c_i}\), \(x2_{c_i}\) and \(y2_{c_i}\) represent bounding box coordinates X-start, Y-start, X-end and Y-end respectively of the cell \(c_i\). These losses (\(L_{1}\), \(L_{2}\), \(L_{3}\), \(L_{4}\)) can be interpreted as constraints that enforce proper alignment of cells beginning from same row, ending on same row, beginning from same column and ending on same column respectively. Alignment loss is defined as

$$\begin{aligned} \begin{array}{l} L_{align} = L_{1}+L_{2}+L_{3}+L_{4}. \end{array} \end{aligned}$$
(1)

3.2 Bottom-Up: Structure Recognition

We formulate the table structure recognition using graphs similar to  [9]. We consider each cell of the table as a vertex and construct two adjacency matrices - a row matrix \(M_{row}\) and a column matrix \(M_{col}\) which describe the association between cells with respect to rows and columns. \(M_{row}, M_{col}\in \mathbb {R}^{N_{cells} \times N_{cells}}\). \(M_{row_{i,j}} = 1\) or \(M_{col_{i,j}} = 1\) if cells ij belong to the same row or column, else 0.

The structure recognition network aims to predict row and column relationships between the cells predicted by the cell detection module during training and testing. During training, only those predicted table cells are used for structure recognition which overlap with the ground truth table cells having an IoU greater than or equal to 0.5. This network has three components:

  • Visual Component: We use visual features from P2 layer (refer Fig. 5) of the feature pyramid based on the linear interpolation of cell bounding boxes predicted by the cell detection module. In order to encode cells’ visual characteristics across their entire height and width, we pass the gathered P2 features for every cell along their centre horizontal and centre vertical lines using lstm  [51] to obtain the final visual features (refer Fig. 5) (as opposed to visual features corresponding to cells’ centroids only as in  [52]).

  • Interaction Component: We use the dgcnn architecture based on graph neural networks used in  [52] to model the interaction between geometrically neighboring detected cells. It’s output, termed as interaction features, is a fixed dimensional vector for every cell that has information aggregated from its neighbouring table cells.

  • Classification Component: For a pair of table cells, the interaction features are concatenated and appended with difference between cells’ bounding box coordinates. This is fed as an input to the row/column classifiers to predict row/column associations. Please note that we use the same  [52] Monte Carlo based sampling to ensure efficient training and class balancing. During testing time, however, predictions are made for every unique pair of table cells.

We train the cell detection and structure recognition networks in a joint manner (termed as tabstruct-net) to collectively predict cell bounding boxes along with row and column adjacency matrices. Further, the two structure recognition pathways for row and column adjacency matrices are put together in parallel. The visual features prepared using lstms for every vertex are duplicated for both the pathways, after which they work in a parallel manner. The overall empirical loss of tabstruct-net is given by:

$$\begin{aligned} \begin{array}{l} L = L_{box} + L_{cls} + L_{mask} + L_{align} + L_{{gnn}}, \end{array} \end{aligned}$$
(2)

where \(L_{box}\), \(L_{cls}\) and \(L_{mask}\) are bounding box regression loss, classification loss and mask loss, respectively defined in Mask r-cnn  [47], \(L_{align}\) is alignment loss which is modeled as a regularizer (defined in Eq. 1) and \(L_{{gnn}}\) is the cross-entropy loss back propagated from the structure recognition module of tabstruct-net. The additional loss components help the model in better alignment of cells belonging to same rows/columns during training, and in a sense fine-tunes the predicted bounding boxes that makes it easier for post-processing and structure recognition in the subsequent step.

3.3 Post-Processing

Once all the cells and their row/column adjacency matrices are predicted, we create the xml interpretable output as a post-processing step. From the cell coordinates along with row and column adjacency matrix, SR, SC, ER and EC indexes are assigned to each cell, which indicate spanning of that cell along rows and columns. We use Tesseract  [53] to extract the content of every predicted cell. The xml output for every table image finally contains coordinates of predicted cell bounding boxes and along with cell spanning information and its content.

4 Experiments

4.1 Datasets

We use various benchmark datasets—scitsr  [14], scitsr-comp  [14], icdar-2013 table recognition  [18], icdar-2019 (ctdar) archival  [19], unlv  [29], Marmot extended  [12], tablebank  [11] and pubtabnet  [13] datasets for extracting structure information of tables. Statistics of these datasets are listed in Table 1.

Table 1. Statistics of the datasets used for our experiments.

4.2 Baseline Methods

We compare the performance of our tabstruct-net against seven benchmark methods—deepdesrt  [7], tablenet  [12], graphtsr  [14], splerge  [10], dgcnn  [9], Bi-directional gru  [15] and Image-to-Text  [11].

Table 2. Shows the performance of our tabstruct-net for physical table structure recognition on various benchmark datasets.

4.3 Implementation Details

tabstruct-netFootnote 1 has been trained and evaluated with table images scaled to a fixed size of 1536 \(\times \) 1536 while maintaining the original aspect ratio as the input. While training, cell-level bounding boxes along with row and column adjacency matrices (prepared from start-row, start-column, end-row and end-column indices) are used as the ground truth. We use nvidia titan x gpu with 12 gb memory for our experiments and a batch-size of 1. Instead of using 3 \(\times \) 3 convolution on the output feature maps from the fpn, we use a dilated convolution with filter size of 2 \(\times \) 2 and dilation parameter of 2. Also, we use the resnet-101 backbone that is pre-trained on ms-coco  [54] dataset. Dilated convolution blocks of filter size 7 are used in the fpn. To compute region proposals, we use 0.5, 1 and 2 as the anchor scale and anchor box sizes of 8, 16, 32, 64 and 128. lstms used to gather visual features have a depth of 128. The final memory state of the lstm layers is concatenated with the cell’s coordinates to prepare features for the interaction network. Further, for generation of the row/column adjacency matrices, we use 2400 as the maximum number of vertices keeping in mind dense tables. Next, features from 40 neighboring vertices are aggregated using an edge convolution layer followed by a dense layer of size 64 with ReLu activation. Since every input table may contain hundreds of table cells, training can be a time consuming process. To achieve faster training, we employ a two-stage training process. In the first stage, we use 2014 anchors and 512 rois, and in the second stage, we use with 3072 anchors and 2048 rois. During both the stages, we use 0.001 as the learning rate, 0.9 as the momentum and 0.0001 as the weight decay regularisation.

Table 3. Shows the performance of our tabstruct-net for logical table structure recognition on various benchmark datasets.

4.4 Evaluation Measures

We use various existing measures—precision, recall and F1  [14, 18, 29] to evaluate the performance of our model for recognition of physical structure of tables. For recognition of logical structure of tables, we use bleu  [55] score as used in  [11] and Tree-Edit-Distance-based similarity (teds)  [13]. Since xml is our final output for table structure recognition, we also use bleu  [55], cider  [56] and rouge  [57] scores to compare generated xml and ground truth xml on spanning information and content of every cell. We first calculate these scores separately on each table and then compute both micro-averaged score and macro-averaged score as the final result. We consistently use an IoU threshold of 0.6 to compute the confusion matrix. Please note that only non-empty table cells are considered similar to  [18] for the evaluation.

4.5 Experimental Setup

One major challenge in the comparison study with the existing methods is the inconsistent use of additional information (e.g., meta-features extracted from the pdfs  [10], content-level bounding boxes from ground truths  [12, 14] and cell’s location features generated from synthetic dataset  [9]). Hence, we do experiments in two different setups

  • Setup-A (S-A): using only table image as the input

  • Setup-B (S-B): using table image along with additional information (e.g., cell bounding boxes) as the input. For this, instead of removing the cell detection component from the network, we ignore the predicted boxes and use the ground truth ones as input for structure recognition.

5 Results on Table Structure Recognition

Tables 2 and 3 summarize the performance of our model on standard datasets used in the space of table structure recognition.

5.1 Analysis of Results

Table 4 presents results on icdar-2013 dataset. In S-A, we observe that our model outperforms deepdesrt  [7] method by a 27.5% F1 score. This is because cell coordinates for the latter are obtained by row and column intersections, making it unable to recognize cells that span multiple rows/columns. For dense tables (small inter-row spacing), row segmentation results of deepdesrt combined multiple rows into one in several instances. split+heuristic  [10] method outperforms tabstruct-net by a small margin, however, it requires icdar-2013 dataset-specific cell merging heuristics and is trained on a considerably larger set of images. Therefore, a direct comparison of (split+heuristic) with our method is not fair. Nevertheless, comparable results of tabstruct-net indicates its robustness to icdar-2013 dataset, without using any kind of dataset-specific post-processing. However, if compared under the same training environment and no post-processing, our model outperforms splerge with a 3% average F1 score. splerge works well for datasets where ground truth bounding boxes are annotated at the content-level instead of cell-level. This is because it allows for a wider area for a prospective prediction of a row/column separator. Further, since it is based on cell detection through the row and column separators, it is not agnostic to input image noise such as skew and rotations. This method is susceptible to dataset-specific post-processing as opposed to ours, where no post-processing is needed.

Table 4. Comparison of results for physical structure recognition on icdar-2013 dataset. #Images: indicates number of table images in the training set. Heuristic: indicates dataset specific cell merging rules for various models in  [10].

In S-B, tabstruct-net sets up a state-of-the-art benchmark on the icdar-2013 dataset, outperforming all the existing methods  [9, 10, 12, 14]. It is further interesting to note that our technique outperforms split-pdf+heuristic model also without needing any post-processing. It is because our enhancements to the dgcnn  [9] model can capture the visual characteristics of a cell across a larger span through lstms. We observe that our model achieves significantly improved performance when content-level bounding boxes are used instead of cell-level, which are much easier to obtain with the help of ocr tools and pdf meta-information.

Table 5. Physical structure recognition results on icdar-2013 dataset for varying IoU thresholds to demonstrate tabstruct-net’s robustness. ES: Experimental Setup, CD: Cell Detection, TH: IoU threshold value, SR: Structure Recognition, P2: using visual features from P2 layer of the fpn instead of using separate convolution blocks, lstm : use of lstms to model visual features along center-horizontal and center-vertical lines for every cell, td + bu : use of Top-Down and Bottom-Up pathways in the fpn, AL: addition of alignment loss as a regularizer to tabstruct-net.

Table 5 shows the performance of our technique under the varying IoU thresholds. It can be inferred from the table that our model achieves an F1 score of 79.1% on structure recognition with an IoU threshold value of as high as 0.7. For the IoU values of 0.5 and 0.6, our model’s performance is 91.9% and 90.6%, respectively. It demonstrates the robustness of our model. Figures 6 and 7 display some qualitative outputs of our method on the datasets discussed in Sect. 4.1. Figure 8 shows some of the failure cases of cell detection by our method. It can be seen that our model fails for table images that have large amounts of empty spaces. Supplementary material has (i) more quantitative results, (ii) more qualitative examples, (iii) specific implementation details, (iv) detailed comparative analysis, IoU variation results, and ablation study on all the datasets.

Fig. 6.
figure 6

Sample intermediate cell detection results of tabstruct-net on table images of icdar-2013, icdar-2019 ctdar and unlv, scitsr, scitsr-comp and tablebank datasets.

5.2 Ablation Study

Table 6 shows the outcome of our enhancements to Mask r-cnn  [47] and dgcnn  [9] models for both cell detection and structure recognition networks under S-A and S-B. From the table, it can be observed that our additions to the networks result in a significant increase of 4% average F1 scores on cell detection and structure recognition tasks. The novel alignment loss, along with the use of top-down and bottom-up pathways in the fpn results in an improvement of 2.3% F1 score for cell detection and 2.4% on structure recognition. Use of lstms and P2 layer output to prepare visual features for structure recognition results in a 2.1% improvement of F1 scores. Interestingly, because both models are trained together in an end-to-end fashion, cell detection’s effect is also observed in the form of a 1.5% average F1 score. This empirically bolsters our claim of using an end-to-end architecture for cell detection and, in turn, structure recognition.

Table 6. Ablation study for physical structure recognition on icdar-2013 dataset. ES: Experimental Setup, CD: Cell Detection, SR: Structure Recognition, P2: using visual features from P2 layer of the fpn instead of using separate convolution blocks, lstm : use of lstms to model visual features along center-horizontal and center-vertical lines for every cell, td + bu : use of Top-Down and Bottom-Up pathways in the fpn, AL: addition of alignment loss as a regularizer to tabstruct-net.
Fig. 7.
figure 7

Sample structure recognition output of tabstruct-net on table images of icdar-2013, icdar-2019 ctdar archival and unlv datasets. First Row: prediction of cells which belong to the same row. Second Row: prediction of cells which belong to the same column. Cells marked with orange colour represent the examine cells and cells marked with green colour represent those which belong to the same row/column of the examined cell. (Color figure online)

Fig. 8.
figure 8

Sample intermediate cell detection results of tabstruct-net on table images of icdar-2013, icdar-2019 ctdar, unlv, scitsr, scitsr-comp and tablebank datasets illustrate failure of tabstruct-net.

6 Summary

We formulate the problem of table structure recognition as a combination of cell detection (top-down) and structure recognition (bottom-up) tasks. For cell detection, we make a modification to the rpn of original Mask r-cnn and introduce a novel alignment loss function (formulated for every pair of table cells) to enforce structural constraints. For structure recognition, we improve input representation for the dgcnn network by using lstm, pre-trained ResNet-101 backbone and rpn of cell detection network. Further, we propose an end-to-end trainable architecture to collectively predict cell bounding boxes along with their row and column adjacency matrices to predict structure. We demonstrate our results on multiple public datasets on both digital scanned as well as archival handwritten table images. We observe that our approach fails to handle tables containing a large number of empty cells along both horizontal and vertical directions. In conclusion, we encourage further research in this direction.