Keywords

1 Introduction

Document structure analysis and parsing are some of the most crucial parts of document image processing for digitization and information extraction. One of the most important components of documents is tables. Tables are a structured way of representing data allowing for visual and logical grouping of data in a highly comprehensible manner. Tables in documents are frequently used to present key information such as financial records, receipts, and data forms. Extracting this information can be vital and beneficial to most businesses around the world. However, tabular data extraction can be a challenging task as more often this data is found as scanned document images that contain no structural or content metadata.

Tabular data extraction consists of three major tasks, Table Detection, Table Structure Recognition, and Semantic Understanding. Table detection is the task of locating tables in a given document, while table structure recognition aims towards the segmentation of tables into rows and columns for layout understanding. Semantic understanding involves the assignment of information such as row or column header (e.g., Unit price, Stock value, etc.) to a particular cell. Our focus in this work is the table structure recognition part, where the aim is to extract rows and columns in a given table image.

In practice, Convolutional Neural Network (CNN) is used as an effective tool for extracting meaningful features from visual information. Given that scanned document images contain only visual information, CNNs become increasingly important for Table Detection and Table Structure Recognition alike. State-of-the-art CNNs are efficient at extracting visual features from data; however, they can be highly data demanding in nature. This precondition of CNNs coupled with the unavailability of large Table Structure Recognition datasets presents a challenging problem as annotating large datasets can be both expensive and time-consuming. It is, therefore, crucial to focus on methods for improving the data-efficiency of deep learning models to achieve better results for Table Structure Recognition on smaller datasets.

Augmentation techniques are widely used to improve the data efficiency in Deep Neural networks. It is the practice of adding slight realistic changes to the original data to increase the diversity of the training data. This concept helps model regularize and avoid over-fitting in small datasets. There are several image transformation techniques such as scaling, rotating, smearing, etc. that are widely used in Computer Vision to augment data. For the purposes of comparison, we choose to apply Random cropping, (first employed in AlexNet [14]) and Color Jitter (used in re-implementation of ResNet [9] by Facebook AI Research) as the standard augmentation. Please refer to the bottom row of Fig. 2 for example of these transformations on ICDAR 2013 dataset. However, as shown by our experimentation in Sect. 4, they are not effective for table structure recognition as they do not alter the structure of a table but instead produce unrepresentative data, which results in a decreased performance.

To overcome these challenges, we propose a novel method of data augmentation based on two key operations of Replication and Deletion on rows and columns, illustrated in Fig. 1 and sample images of these illustrated operations can be seen in Fig. 2. We also introduce a data-driven probabilistic model for generating parameters that control the outlook of the augmented data. The augmented variations generated by this technique better represent the real-world variation in tables. Please refer to Sect. 3 for further details about our methodology.

Our experiments on ICDAR 2013 dataset demonstrate that our proposed augmentation technique yields a higher data efficiency and out-performs both the non-augmented and standard image augmentation approaches. Please refer to Sect. 4 for further details on experimental evaluation.

These results set a strong foundation for more future work in this direction. To facilitate this, we have decided to make our implementation of structural data augmentation named TabAug and associated experimentation code open-sourceFootnote 1. We have also developed the existing code to be fairly simple to plug into various existing models with support for T-Truth annotation format [20] with no training overhead.

Fig. 1.
figure 1

Visual depiction of proposed augmentation operations

The rest of paper is structured into following sections. Section 2, outlines literature review of relevant works in the domain of table structure recognition and tabular data extraction in general. In Sect. 3 we present our methodology and implementation of the proposed augmentation technique. In Sect. 4 we report on our experiments and comparative results. Finally, in Sect. 5 we conclude our research with future direction for our work.

2 Related Work

Tabular data extraction is an old and recognized problem with solutions maturing from over past 20 years. Starting with one of the earliest works in table detection in text files using heuristics was done by [24] in 1996 followed by Pyreddy et al. [15] in 1997. The following years saw a novel heuristic-based approach of detecting tables in document images from Keinenger et al. [11, 12] and [13] forming a combined table plotting and recognition system named TRECS. Zanibbi et al. [25] presented a comprehensive survey paper that focused on table recognition literature writing down extensive observations on various aspects of popular methods available at the time. A novel approach of detecting tables from ruling lines was presented by Basilios et al. [6]. In 2010 Shahab et al. [20] presented comprehensive and rigorous evaluation metrics for table detection and structure recognition. Significant work was done by Shafait et al. [19] in developing table detection algorithm for multiple layouts, they also introduced more meaningful performance metrics for table detection.

Use of data-driven approaches in table detection started from Chen et al. [3] in 2011 where SVMs were used for table detection in handwritten documents with noise and artifacts. In 2013 Kasar et al. [10] also made use of SVMs for table region detection by classifying ruling lines. The following year Anukriti et al. [2] used Conditional Random Fields (CRFs) with encoded foreground block characteristics and the contextual information as features for learning layouts and labeling table and non-table regions in documents.

From 2015 onward, Deep Learning models have been used extensively for solving the detection and recognition problem. [1, 8, 18] and [22] made use of Faster R-CNN [17] with their own take on handcrafted features and pre-processing methodologies. These methods were successful in producing state-of-the-art results for table detection but failed to produce significant results in Structure Recognition. Schreiber et al. [18] mentions the unnatural approach of detecting rows and columns through Faster R-CNN [17] and instead proposes fine-grained image segmentation through FCN-X’s architecture by Shelhamer et al. [21], however, their proposed technique heavily relies upon image stretching for expanding background pixels as a pre-processing methodology.

In more recent works from Qasim et al. [16] and Tensmeyer et al. [23] on table structure recognition, we see a more stable and natural approach towards the formulation of problem. Qasim et al. [16] made use of Graph Neural Networks for generating cell adjacency matrix for all existing OCR detected words and Tensmeyer et al. [23] formulating the problem of structure recognition as a combination of row, column splits and cell mergings for defining the structure of a table. Qasim et al. [16] mentions that lack of large datasets has been a major hindrance for Deep Learning methods in table structure recognition and makes use of synthetic data to show the effectiveness of their network. However, synthetic data generated randomly is limited in capturing the visual and general characteristics of the target dataset, due to which the model fails to perform on the target dataset despite the optimal results on the synthetic data. Similarly for the split model from Tensmeyer et al. [23], even though more data-efficient than Qasim et al. [16] GNN model, makes use of large proprietary dataset to train the network for an effective model to be tested on ICDAR 2013. Our experiments of training and testing of split model from Tensmeyer et al. [23] on ICDAR 2013 dataset reveals sub-optimal results than the network potential due to the lack of diverse and large training dataset.

With newer deep learning algorithms, we are seeing a trend of increase in performance on larger datasets; however, their results translate to sub-optimal results on smaller datasets thus limiting their potential use cases. Traditionally in computer vision, these problems are addressed with image transformation based data augmentation techniques defined as standard data augmentation for increasing data diversity in training data. Albeit successful in natural images, standard augmentation hold an insignificant and unstable impact on table images. In this paper, we propose a new re-imagined Tabular Data Augmentation inspired from [4] and [7] by replicating and removing table structure elements (rows and columns) to form augmented tables all while maintaining their visual artifacts, we have also introduced a data-driven probabilistic model (similar to [5]) for decision parameters of our method of Tabular Data Augmentation.

3 Methodology

We adopt an elementary approach of structural data augmentation for table structure recognition. We consider cells to be the building block of a table, that is, a table is defined as a combination of cells. Therefore, it should theoretically be possible to achieve a large number of combinations given a small number of cells. However, combining different cells into a table poses many limitations with regards to compatibility. First, widths and heights of cells can vary greatly, and thus combining them without due deliberation can lead to unnatural tables. Second, cells tend to inherit their formatting and styling from their adjacent cells. In a fully randomized collection of cells, one cell might have left-justified text while the next might have right-justified text, making for a confusing and unnatural table.

To overcome these limitations, we redefine a table to be a combination of rows and columns rather than cells. This helps keep intact the co-relation of cell styling within a row and column. To further simplify the problem, we limit these combinations to randomization of rows and columns within the same table. Thus, we can generate a randomized permutation of the rows and columns within a table to obtain a structurally augmented version of the table. This augmentation technique is then combined with a data-driven probabilistic sampler for maximizing its effectiveness.

3.1 Augmentation Operations

To achieve randomization of rows and columns, we define four basic operations that we can apply to each table:

Fig. 2.
figure 2

Sub-figure 2a contains an original table sample from ICDAR 2013 dataset. Sub-figures 2b and 2c show variations of the table generated by TabAug. Sub-figures 2d and 2e show the results of image transformation based augmentations.

  • Row deletion

  • Column deletion

  • Row replication

  • Column replication

These operations are depicted visually in Fig. 1 for better understanding. These are the core atomic operations that we utilize during our augmentation pipeline. They can be applied sequentially in random orders on a table to achieve increasingly varying versions of the same table.

3.2 Augmentation Sub-operations

Each of the proposed augmentation operations consists of several sub-operations, which are further explained below. For simplicity, we will only consider Column deletion and replication, as Row deletion and replication can be directly inferred from it.

Source Selection: Before doing any replication or removal, we need to select a column on which we may apply the operation. We randomly select an index c in the range of \(\{1\,..\, C-1\}\), where C is the total number of columns in a given table. We purposefully leave out the \(0-th\) index as the first column can have row headers (similarly column headers in the case of row selection), which should be retained in its location to maintain a natural table.

Furthermore, in case if spanning cells exist in the selected column, we must ensure that no partial/broken cells are copied.

$$\begin{aligned} \text {span}_{\text {min}} = \min (\{\text {cells}_{r\,c}^{\text {Start Column}} \,\, \forall r \in \{0..R-1\}\}) \end{aligned}$$
(1)
$$\begin{aligned} \text {span}_{\text {max}} = \max (\{\text {cells}_{r\,c}^{\text {End Column}} \,\, \forall r \in \{0..R-1\}\}) \end{aligned}$$
(2)

For the selected column c, the values of \(\text {span}_{\text {min}}\) and \(\text {span}_{\text {max}}\) are calculated using Eqs. 1 and 2. These values provide us with column indices for a self contained convex block (as depicted in Fig. 3a and 3b). In case if even after performing the above steps the selected block is non-convex (as depicted in Fig. 3c) we abort the operation and return the table unaltered. Otherwise our selection is defined by \(c_{\text {max}}=\text {span}_{\text {max}}\) and \(c_{\text {min}}=\text {span}_{\text {min}}\) which is used by next sub-operations.

Fig. 3.
figure 3

A non-convex selection 3a cannot be replicated or deleted and hence it is expanded so as to make it convex 3b. Sub-figure 3c represents the case where expansion also results in a non-convex selection.

Target Location Selection (only for Replication): This sub-operation is similar to the previous one with few modifications. We select an index d in the range of \(\{1\,..\, C\}\) randomly, where C is the total number of columns in a given table. The target location for column replication will be at the start of d-th column. So we purposefully leave out the \(0-th\) index as we do not want to move the first column. Furthermore, if \(d=C\) then the target location is after the last column at the end of the image. We perform a check to see if the target location d is intersecting with a spanning cell. If so, placing a column at that location will cause the existing spanning cell to split apart, and hence we change d according to the equations below:

$$\begin{aligned} d = {\left\{ \begin{array}{ll} \text {span}_{\text {min}}, &{} \text {if abs}(d - \text {span}_{\text {min}}) <= \text {abs}(d - \text {span}_{\text {max}})\\ {} &{} \text {and span}_{\text {min}} \ne 0\\ \text {span}_{\text {max}} + 1, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

where \(\text {span}_{\text {min}}\) and \(\text {span}_{\text {max}}\) are calculated using Eqs. 1 and 2 by setting \(c=d\).

Similar to the previous sub-operation, if after the correction of d it intersects with a spanning cell we abort the operation and return the table unaltered.

Execution: To execute the operation, whether deletion or replication, we need to alter both the table image and the ground-truth of the table, the specifics of which vary between the two operations. Following values are required in both operations:

$$ x_{\text {min}} = \text {columns}[c_{\text {min}}].x1\,\,\,\,\,\,,\,\,\,\,\,\, x_{\text {max}} = \text {columns}[c_{\text {max}}].x2\,\,\,\,\,\,,\,\,\,\,\,\, w = x_{\text {max}} - x_{\text {min}} $$

where x1 denotes starting x-coordinate of a column and x2 denotes the ending x-coordinate of a column.

Deletion: Firstly, we remove the columns with indices in the range (\(c_{\text {min}}\), \(c_{\text {max}}\)) from the ground-truth, including all the contained cells and span information. For the columns and cells having indices greater than \(c_{max}\), we subtract w from their x-coordinates, to move them to the left.

For the image, we cut out the image from \(x_{\text {min}}\) to \(x_{\text {max}}\) and then move left the image pixels to the right side of \(x_{\text {max}}\) by w.

Replication: Firstly, we calculate the x-coordinate of the target location where the replicated column is to be placed.

$$\begin{aligned} x_{\text {dst}} = \text {columns}[d].x1 \end{aligned}$$
(4)

Then we copy the columns with indices in the range (\(c_{\text {min}}\), \(c_{\text {max}}\)) from the ground-truth, including all the contained cells and span information. For the copied columns and cells we offset their x-coordinates by \(x_{\text {dst}}-x_{\text {min}}\), so as to move it to the target location. Further, the columns and cells having indices greater than d, we add w to their x-coordinates, to move them to the right, clearing up space for the replicated column.

For the image, we move the image pixels to the right of \({x_{\text {dst}}}\) by w, so as to clear up space for crop of the replicated columns. Then we copy the image pixels from \(x_{\text {min}}\) to \(x_{\text {max}}\) and move them by \(x_{\text {dst}}-x_{\text {min}}\), so as to move it to destination location \(x_{\text {dst}}\).

3.3 Augmentation Pipeline

Having defined the augmentation operations, an effective pipeline must be established that can utilize the proposed operations for training a model, therefore we propose a pipeline in the following.

Fig. 4.
figure 4

An example of an augmentation search tree. Each of the nodes \(N_{ij}\) is obtained by applying an augmentation operation to its parent node, where the root of the tree is the original table \(T_i\).

Augmentation Tree Exploration: Before training a model, we pre-compute a sample set \(N_i\) of augmented versions of a given table \(T_i\). Each member of \(N_{i}\) is considered to be a node (represented as \(N_{ij}\)), which is achievable by applying a series of augmentation operations (edges in the tree). An example of such a partial tree is shown in Fig. 4. By applying this tree search, we can extract a controlled sample set of all achievable augmented tables of a given root table \(T_i\). To avoid inundation of samples, we apply pruning to the tree search to keep the number within a reasonable limit. First, only the nodes having tree depth greater than 5 and less than 10 are kept as a part of this sample set. Further, the maximum search width for each depth level is also restricted for pruning purposes. For depth = 1, maximum search width is 8, for depth = 2, it is 4, for depth = {3..5} it is 2, and finally for depth = {6..10} it is 1. Furthermore, out of the obtained sample set, any node is ignored that has a pixel height or width greater than 1.5 times the original table \(T_i\), to ensure unnaturally large tables are not used for training.

Categorization of Tables and Their Nodes: As a pre-processing step, both the tables and their nodes are mapped into categories based on their number of rows and columns. The mapping table is of size 5 \(\times \) 4 as depicted in Fig. 5. An example of such a mapping is that a node with 4 rows and 5 columns will be assigned to B2 category. Further, if the number of columns in \(N_{ij}\) are greater than 10, it is still assigned category \(\{A-E\}4\). Similarly, if the number of rows in \(N_{ij}\) are greater than 14, it is assigned category \(E\{1-4\}\).

After the completion of this step each \(T_i\) and \(N_{ij}\) will have a category associated with it which is referenced as \(T_i^{\text {cat}}\) and \(N_{ij}^{\text {cat}}\) respectively. It must be noted that the decision of category boundaries is purely empirical.

Fig. 5.
figure 5

A grid enumerating all of the possible table categories. Each table is assigned one of these categories based on its number of rows and columns

Probability Based Selection: For each table \(T_i\) we construct a probability distribution \(P_i\) using which \(T_i\)’s nodes will be sampled during training. Firstly, a global frequency distribution of the tables over table categories is generated, giving us \(F^G\) a 5 \(\times \) 4 matrix. Further, for each table \(T_i\), a frequency distribution of all \(N_{ij}\) over the table categories is generated, giving us \(F_i\), which is another 5 \(\times \) 4 matrix. Lastly, we generate another 5 \(\times \) 4 matrix \(gauss_i\) which is a 2-D Gaussian centered around \(T_i^{\text {cat}}\). The spread of this Gaussian distribution allows control over the diversity of the nodes \(N_{ij}\) that are sampled during training. These 3 matrices are multiplied to obtain \(P_i=\text {Gauss}_i * F^G * F_i\).

During training, a random table category is selected using \(P_i\) as the probability distribution. Once the category is selected, one node is randomly sampled from all the nodes of \(T_i\) having the selected category. This selected node is passed for training.

4 Experiments and Results

To evaluate the efficacy of our proposed augmentation methodology, we train the Split model proposed by [23] on ICDAR 2013 dataset. We train the model using three different methodologies:

  1. 1.

    Non-Augmented: Images fed for training without any modification.

  2. 2.

    Standard: Basic image transformations, such as, hue saturation and brightness jitter in combination with image cropping.

  3. 3.

    TabAug: Our proposed augmentation methodology.

The dataset has a total of 128 pages and 156 tables, which is divided into train, test and validation splits with a proportion of 0.72 : 0.2 : 0.08 respectively. We train the model with 4 different percentages of the training set, that is, \(25\%,\) \(50\%\), \(75\%\) and \(100\%\). It must be noted that this is a ratio of the total training set utilized while training and not a ratio of training samples divided by total samples. Further, we repeat each experiment 3 times to get a better estimate of the average results and their deviation.

For training, we use a batch size of 1, and a learning rate of 0.00075. After every 15 epoch, we decay the learning rate by a factor of 0.8. Lastly, we train the models for a total of 27, 500 iterations, which we empirically found to be enough for convergence of the model.

4.1 Ground Truth

For ICDAR 2013 dataset, ground-truths are provided as bounding boxes of cells, along with their starting and ending row/column indices. This ground-truth format was converted to T-Truth format [20], however, the resulting row and column separators were not aligned with the ruling lines of the table, hence, we re-annotated the dataset using T-Truth to align the separators and then cross-checked it to ensure that the annotations were coherent. The models are trained and tested on this re-annotated dataset, however, as they have been manually cross-checked, the evaluation is applicable to original annotations as well.

During the re-annotation phase, we upgraded the T-Truth annotation tool to fix several bugs and make it user-friendly. The upgraded version has been made publicly availableFootnote 2.

While T-Truth annotation format is used for the augmentation process, Split model [23] requires pixel-wise annotations. To generate such ground-truth, the separators of T-Truth annotations are expanded to the nearest words (horizontally in the case of column ground-truth, and vertically in the case of row ground-truth).

4.2 Performance Measures

We chose the evaluation metrics proposed by Shahab et al. [20] primarily for two reasons. First, these metrics are comprehensive, painting a complete picture of a model’s performance. Second, these are general-purpose metrics and can be applied to any type of segment such as rows, columns, and cells.

In the evaluation metrics proposed by Shahab et al. [20], first the ground truth segments and the predicted segments are numbered. A correspondence matrix of shape n x m is created where n is the number of ground truth segments while m is the number of predicted segments. Each entry [ij] in the matrix stores the number of pixels that are intersecting between the given ground-truth segment \(G_i\) and predicted segment \(S_j\), that is \(|G_i \cap S_j|\). Further, the sum of i-th row in the correspondence matrix gives the total number of pixels in ground-truth segment \(G_i\) and the sum of j-th column gives the total number of pixels in predicted segment \(S_j\). Once this correspondence matrix is generated, we define the following evaluation metrics using the threshold value of \(T=0.1\) (consequently \(1-T=0.9\)):

Fig. 6.
figure 6

Sample outputs of all three approaches for a comparative analysis. Each row provides a separate test sample. The red boxes highlight the image region that is displayed for visualization of ground-truth and the predictions. Ground-truth is displayed as blue lines while predictions are displayed as green lines. (Color figure online)

  1. 1.

    Correct Detections: The total number of ground-truth segments that have a one-to-one mapping with a predicted segment and a major overlap. Concretely, a given ground-truth segment \(G_i\) is considered to be a correct detection if it has a major overlap with a predicted segment \(S_j\) and \(S_j\) does not have a significant overlap with any other ground-truth segment \((G_k;\forall k \ne i)\). That is:

    $$\begin{aligned} \frac{|G_i \cap S_j|}{{G_i}} > 1 - T \text { and } \frac{|G_k \cap S_j|}{{G_k}} < T \ \ ;\forall k \ne i \end{aligned}$$
  2. 2.

    Over-Segmentations: The total number of ground-truth segments that have a significant overlap with more than one predicted segment. That is, a ground-truth segment \(G_i\) is over-segmented if:

    $$\begin{aligned} T< \frac{|G_i \cap S_j|}{{G_i}}< 1 - T \text { and } T< \frac{|G_i \cap S_k|}{{G_i}} < 1 - T \ \ \text {where}; k \ne j \end{aligned}$$
  3. 3.

    Under-Segmentations: Total number of predicted segments that have a significant overlap with more than one ground-truth segments. That is, a predicted segment \(S_j\) is under-segmented if:

    $$\begin{aligned} T< \frac{|G_i \cap S_j|}{{G_i}}< 1 - T \text { and } T< \frac{|G_k \cap S_j|}{{G_k}} < 1 - T \ \ \text {where}; k \ne i \end{aligned}$$

All of these evaluation metrics are normalized by division with the total number of ground-truth segments, which are presented as percentages in the results. We intentionally leave out the Partial Detections, Missed Segments, and False Positive Detections from our evaluation metrics, as due to the problem formulation of Split Model [23] they always evaluate to zero.

Table 1. Results of evaluating the trained models on 20% of ICDAR 2013 dataset reserved for testing.
Fig. 7.
figure 7

Models were trained on a varying percentage of the total training dataset and then evaluated on a consistent 20% test split. The graphs show the progression of correct detection with an increase in training data for all three approaches.

4.3 Results and Analysis

We evaluate all the trained models on a consistent 20% test split of ICDAR 2013 dataset. An evaluation of the performance measures described in Sect. 4.2 is provided in Table 1. Our proposed approach outperforms all other approaches in all performance measures for each of the Cell, Row, and Column recognition. Further, the models trained with the proposed approach show less standard deviation in the evaluation metrics, which is indicative of more stability in training. Moreover, as depicted in Table 2 and Fig. 7 our proposed approach consistently shows improvements across a range of training dataset sizes.

The standard augmentation approach manages to improve upon the evaluation metrics for Row, however, happens to be deficient in Cell and Column evaluations. This is explained by the fact that Row segmentation, especially in ICDAR 2013 dataset, is an easier task with less visual variability and complexity. Hence, it can be learned mostly based on white-spaces, for which image transformations provide an ample amount of augmentation. Column segmentation contains many complex scenarios and requires learning of more abstract and structural features, for which image transformations turn out to be insufficient.

We have further shown a comparison of the predictions generated by the models trained on each of the three approaches in Fig. 6. In Fig. 6a, we see the case where both standard augmentation and TabAug approaches successfully segment the rows, however, the non-augmented approach under-segments a row in the header region. Conversely, in Fig. 6b, we see a case of over-segmentation by the non-augmented approach. In Fig. 6c, non-augmented and standard augmentation approaches fail to recognize the boundaries of the columns correctly, due to the white-spaces that exist between the protruding words of the header. However, TabAug robustly recognizes the boundaries of the column and predicts a single column separator. In Fig. 6d, we see a scenario where the TabAug approach fails, as it confuses vertically consistent breaks in the text as a column breakage. Regardless, we see that it still does a decent job at making logically sound column segmentations as compared to segmentations from the other two approaches. Figure 6e depicts another sample that is correctly segmented by TabAug, however causes significant confusion in column segmentation for the other two approaches. Finally, in Fig. 6g the first column is over-segmented by all of the approaches, as the column header has no overlap with its content below. This demonstrates a hard sample that requires a higher cognitive understanding for correct prediction.

Table 2. Correct detection percentages achieved by models trained on different percentages of the training dataset evaluated on a consistent 20% test split of ICDAR 2013 dataset.

5 Conclusion

In this paper, we presented TabAug, a novel augmentation technique capable of producing structural changes in a table through replication and deletion of rows and columns. A data-driven probabilistic model is used in conjunction with the augmentation technique to control the augmentation process. Following the promising results of Split-model [23] trained on the publicly available ICDAR 2013 dataset using TabAug, we believe our work provides a strong foundation for numerous future extensions. In future, we plan to explore ideas for cross-table augmentation through statistical layout matching.