Keywords

1 Introduction

In recent years, table information retrieval has garnered substantial attention. In several cases, table data describe, explain, or complement key statements in the document; therefore, they can be utilized for various natural language processing tasks, such as question answering systems [16, 30, 34], constructing or augmenting a knowledge base [4, 22, 23], and fact-checking [1]. In particular, tables that are contained in free-format documents such as PDF, Word, Excel, and images are often critical for the above tasks, e.g., experimental data in papers; financial performance in financial reports; and statistics in public documents, invoices, and ledgers.

However, the amount of table data available for machines is still limited; a major reason for this is that extracting the tables and modifying them into a machine-readable format is still a great challenge. This difficulty arises because free-format documents do not have tags or separators for tables similar to markup languages such as HTML, XML, and JSON; therefore, even after identifying the location of the table [6, 9, 24, 25, 29], it is necessary to structure it to a machine-readable format.

Specifically, the main issue is parsing the table elements to the machine-readable table format. Table elements can be extracted using a PDF content-stream analyzer or an optical character reader. However, these tools only provide a bounding box position for each table element. To obtain machine-readable table data, it is essential to parse these bounding boxes into a structured table format. This task is often known as table structure recognition, and is the main subject of this study.

For table structure recognition, the following difficulties prevent a simple pre-defined rule strategy: (1) the presence of spanning cells; (2) the width and height of the bounding box must vary. For instance, an intuitive approach would be to construct a parsing rule based on the relative positions of the bounding boxes; i.e., if two or more bounding boxes are aligned on a single vertical line, these boxes may belong to a single column. This rule-based approach sometimes works, especially for a simple table. In practice, however, most tables have spanning cells that belong to multiple columns or rows. Moreover, determining the box alignment is difficult because of the different widths and heights of each bounding box.

To overcome the above difficulties, recent studies have proposed deep neural network-based approaches. An earlier attempt [24] employs fully convolutional network (FCN) architecture [15] to detect the row and column regions. This approach has also been adopted in recent works [27, 28], which applied the object detection framework. The advantage of this approach is that it can naturally incorporate the table structure information, such as ruled lines or margins. However, one should take care of the mechanism through which the blank cells are joined to construct the spanning cells [31, 35], which is necessary for correctly recognizing the hierarchical structure of the table. In this paper, we refer to this approach as the detection-based approach.

Recently, relation classification approaches have been proposed in several studies [2, 14, 18, 21], wherein row and column recognition is considered as a relation classification task between a pair of bounding boxes. The advantage of this method is that a joint operation is not required for constructing spanning cells. Most studies on this approach utilize the graph structure of the table elements and employ graph convolutional networks (GCNs) [13], which successfully recognize multi-rows/columns using spanning cells. However, one major disadvantage of this approach is the difficulty of feature engineering. For instance, it is difficult to fully utilize ruled line information using this approach. In this paper, we refer to this approach as the graph-based approach.

In this study, we adopt a novel and simple image-based relation classification approach for the table structure recognition task. Our idea is to employ an edge-based rectangle region formed by each pair of nodes as the input to a relation classifier. This rectangle contains essential information for the classification: the relative position, ruled lines, and the geometry of bounding boxes. Moreover, enlarging this edge-based region incorporates the global patterns of the table, which significantly improves the model accuracy. We stress that our approach has the advantages of both detection- and graph-based approaches, and succeeds in considerably reducing the complex design of pre-defined rules or feature engineering. Another advantage of our approach is that the data can be augmented through label-invariant operations. We propose novel label-invariant data augmentation techniques for the edge-based region, and demonstrate that they make significant contributions, especially when training with small amounts of data. In summary, our contributions are as follows.

  • We propose a novel edge-based cropping strategy for table structure recognition.

  • We introduce an edge region-based convolutional neural network (ER-CNN) that efficiently encodes the edge-based cropped images and positional information of the bounding boxes.

  • We propose efficient data augmentation techniques for the edge-based cropped images.

We evaluate our approach on two real-world table datasets consisting of PDF and handwritten scanned images. We compare our approach with four baselines, including two state-of-the-art graph-based approaches. We observe that our approach significantly outperforms the baselines in the exact matching ratio for tables.

The remainder of this article is organized as follows. In Sect. 2, we briefly review related works. In Sect. 3, we define the problem that is the focus of this study. In Sect. 4, we introduce the motivation of our approach through observation. In Sect. 5, we describe our approach. We then present our experimental results in Sect. 6; finally, we provide a conclusion in Sect. 7.

2 Related Works

For table structure recognition, similar to the development of the table detection task [6, 9, 24, 25, 29], recent studies adopted a deep learning approach rather than pre-defined rules or heuristics [11, 26, 32] for structuring more complex tables. Several studies use the semantic segmentation or object detection methods to detect the columns and rows of a table [24, 27, 28]. The difference in our approach is that, while the approaches in previous studies are based on row and column detection, we adopt the relation classification approach and employ the edge-based cropping strategy for the classification.

Recently, other approaches based on relation classification have been proposed [2, 14, 18, 21]. In this approach, the table structure is recognized via the relationship between each cell. Most works for this approach utilize the graph structure of the table elements, considering each bounding box as nodes. In [14], the graph structure is constructed using the k-nearest neighbor (k-NN) algorithm, and features for the classification are constructed via GCN [13]. In [2], a multi-head attention mechanism is incorporated. In this study, both the node and edge features are convoluted via GCN, thereby exchanging their feature propagation. [21] also convolutes the edge feature via GCN architecture. Meanwhile, [18] adopts GravNet [17] and DGCNN [33] for graph convolution. A significant difference in our approach is that, while the previous studies mainly utilize the positional information of the table element for their input feature, we incorporate information about the ruled lines and geometry of bounding boxes by adopting CNN-based architecture and an edge-based region cropping strategy.

Our approach also relates to object detection and categorization, such as R-CNN [8], Fast R-CNN [7], and Faster R-CNN [20] in that the cropped image can be considered as a proposed object, and the relation classification corresponds to the categorization. The difference is that we determine the cropping region through the combination of the nodes, and utilize both the image and position for the model input to stress the geometry of the component.

3 Problem Setting

In our problem setting, we define dataset \(\mathcal {D}\) as a set of n tables: \(\mathcal {D} \equiv \{T^{1}, \dots , T^{n}\}\), where each table \(T^{t}\) consists of a table image \(I^{t}\), set of bounding boxes \(\mathcal {B}^{t}\) and set of relations \(\mathcal {R}^{t}\):Footnote 1

$$\begin{aligned} T^{t} \equiv \{I^{t}, \mathcal {B}^{t}, \mathcal {R}^{t}\} \,. \end{aligned}$$
(1)

The table image \(I^{t}\) has an image with \(H^{t} \times W^{t}\) pixels and C channels, i.e.,

$$\begin{aligned} I^{t} \in [0, 1]^{H^{t} \times W^{t} \times C} \,. \end{aligned}$$
(2)

In this study, we assume that table images are preprocessed into gray-scaled or binarized pictures with a single channel; that is, \(C=1\). Meanwhile, \(\mathcal {B}^{t}\) is a set of bounding boxes for each table element, i.e.,

$$\begin{aligned} \mathcal {B}^{t} \equiv \{b^{t}_{1}, b^{t}_{2}, \dots , b^{t}_{m^{t}}\} \,, \end{aligned}$$
(3)

where \(m^{t}\) denotes the number of bounding boxes contained in table \(T^{t}\). b represents a bounding box that is defined as follows:

$$\begin{aligned} b^{t}_{i} \equiv (x_{li}^{t}, y_{ti}^{t}, x_{ri}^{t}, y_{bi}^{t} ) \,. \end{aligned}$$
(4)

We let each bounding box be described as a rectangle with a four-dimensional vector \((x_l, y_t, x_r, y_b)\), where \((x_l, y_t)\) and \((x_r, y_b)\), respectively, represents the top-left and bottom-right position of the bounding box. Coordinate x/y increases from left/top to right/bottom; x and y satisfy \(0 \le x_l< x_r < W\) and \(0 \le y_t< y_b < H\), respectively. We also refer to \((x_{ci}, y_{ci})\) as the coordinate of the center of \(b_i\). In practice, bounding boxes may be split, merged, or missing owing to incomplete identification.Footnote 2 Although such misidentification can be mitigated by improving the box identification tool, improvements of the tool are beyond the scope of this paper. Therefore, in our problem setting, we assume that the bounding boxes are ideally provided; that is, \(b \in \mathcal {B}\) has a one-to-one correspondence with the table element. Finally, \(\mathcal {R}^{t}\) represents a set of relations between pairs of boxes, which is determined by a set of triplets:

$$\begin{aligned} \mathcal {R}^{t} \equiv \{ (b^{t}_i, b^{t}_j, y^{t}_{ij}) \mid b^{t}_i, b^{t}_j \in \mathcal {B}^{t}, y^{t}_{ij} \in \mathcal {L} \} \,, \end{aligned}$$
(5)

where \(\mathcal {L}\) represents a set of relation labels: \(\mathcal {L}=\{\text {irrelevant, row, column}\}\). Subsequently, by analogy from the graph representation, we may refer to the bounding boxes as nodes and relations between boxes as edges. Moreover, we may omit the table index t if it is clear from the context.

The relation classification approaches for the table structure recognition are used to predict \(y_{ij}\) for \(b_i\) and \(b_j\) under a given table image I and a set of bounding boxes \(\mathcal {B}\).

4 Observations

To clarify the motivation of our approach, we provide an overview of the relationship between nodes, edges, ruled lines, and other neighbor nodes, using concrete examples.

Figure 1 shows examples of the geometry of nodes in a table. The figure shows that the relationship between the two blue boxes differs depending on the geometry of the other nodes and the ruled lines, even if the relative positions of the two nodes are the same. From the upper examples in Fig. 1, most relationships can be inferred by observing the inner area of the two nodes. In (1), we can infer that a column relationship between the two blue nodes is allowed, whereas this is inappropriate in (2) because of the presence of the intermediate cell. Meanwhile, (3), (4), and (5) show the effect of the ruled lines: (3) allows for column relationships between blue nodes, whereas (4) does not. Similarly, the combination of these ruled lines shown in (5) cuts off the column relationship between the two blue nodes.

Fig. 1.
figure 1

Relationships between column relations and table elements. The boxes represent the bounding boxes in the table, and the two blue boxes correspond to the nodes of interest. The red lines represent the column relationships, while the thick black lines represent the table rule lines. Here, we omit the row relationships. (Color figure online)

Meanwhile, there are examples where the outer geometry influences the relationship, as shown in the lower examples in Fig. 1. In these examples, if we focus only on the inner area (orange rectangle), there could be a column relationship between the two blue nodes. However, once we increase the size of the region, such a relationship is found to be inappropriate because of the relationship with the other nodes. This observation suggests that the model should incorporate a proper range of peripheral information.

In the previous relation classification approaches, these geometrical patterns were not efficiently incorporated. This is because constructing a node or edge feature that incorporates these geometrical patterns requires hard feature engineering. Meanwhile, the image near the pair of nodes, we call it the edge region, naturally contains such information, which is efficiently extracted by a CNN architecture without complex feature engineering. Motivated by these observations, we propose a novel image-based relation classification approach, which is discussed in the subsequent section.Footnote 3

5 Description of Our Approach

Our approach consists of two modules: the preprocessor, which extracts the information of the edge region, and the ER-CNN, which is employed as the classification model. An overview of our approach is shown in Fig. 2.

Fig. 2.
figure 2

Illustration of our approach.

5.1 Preprocessor

Edge-Based Cropping Region. The key idea of our approach is to employ a node pair-based image cropping strategy. Based on the observations in the previous section, we use a cropped table image with a rectangle formed by a pair of nodes as a primitive feature for the relation classification. More concretely, if node pairs have bounding boxes at \((x_{ri}, y_{ti}, x_{li}, y_{bi})\) and \((x_{rj}, y_{tj}, x_{lj}, y_{bj})\), we define the inner region of the bounding boxes as follows:

$$\begin{aligned} b_{ij}^{e} \equiv (\min (x_{li}, x_{lj}), \min (y_{ti}, y_{tj}), \max (x_{ri}, x_{rj}), \max (y_{bi}, y_{bj})) \,. \end{aligned}$$
(6)

This region encompasses information about the inner state between two nodes.

Another important aspect for relation classification is the outer status near the node pair. To incorporate this global information, we scale the width and height of the inner region \(b_{ij}^{e}\). The edge-based cropping region is determined as follows:

$$\begin{aligned} b'^{e}_{ij} = ( \min (x_{li}, x_{lj}) - r w_{ij}, \min (y_{ti}, y_{tj}) - r h_{ij}, \nonumber \\ \max (x_{ri}, x_{rj}) + r w_{ij}, \max (y_{bi}, y_{bj}) + r h_{ij}) \,. \end{aligned}$$
(7)

Here, \(w_{ij}\) and \(h_{ij}\) denote the width and height of \(b_{ij}^{e}\), respectively, and r is a hyperparameter that defines the scale of the cropping region and we adopt \(r=1\) in this paper. If the cropping region extends outside the image, we fill in the overflow with a blank value. We cut out the rectangular region \(b'^{e}_{ij}\) of the original table image I, which we define as \(I_{ij}^{e}\), and use for crafting an input image for ER-CNN.

Crafting the Input Image of the Model. We construct the input of ER-CNN by splitting the cropped image \(I_{ij}^{e}\) into three channels: the channels of bounding boxes i and j, the channel of the other bounding boxes, and the channel of the other pixels, containing the noise and ruled information of the image. In the channels for the boxes, the rectangles of the boxes are filled with a constant, and the remaining area is filled with a blank value. This channel splitting helps the model to correctly recognize the table components. Finally, we resize this channel-split image to a \(64 \times 64\) square shape for the input of ER-CNN.

Box Position Extraction. In addition to the cropped image, we utilize the positional information of the bounding boxes. The sizes of the bounding boxes can vary considerably depending on the lengths and styles of the original table elements. Therefore, the pixel area occupied by a bounding box can be sometimes extremely small after the cropping and resizing procedure. In such a case, the geometry of the table element cannot be extracted properly from the image alone. To cope with this problem, we explicitly input the box position into the model. Specifically, we use the box centers of \(b_i\) and \(b_j\) normalized by the size of \(b'^{e}_{ij}\): \(\{c_i, c_j\}\), where

$$\begin{aligned} c_i \equiv \left( \frac{x_{ci} - x'_{lij}}{w'_{ij}} \,,\, \frac{y_{ci} - y'_{tij}}{h'_{ij}} \right) \,. \end{aligned}$$
(8)

Here, \(w'_{ij}\) and \(h'_{ij}\) denote the width and height of \(b'^{e}_{ij}\), respectively, and \(x'_{lij} / y'_{tij} \) is left/top position of \(b'^{e}_{ij}\).

5.2 Model

As described in Fig. 2, the ER-CNN consists of two encoders: an image encoder and a position encoder. The outputs of these encoders are concatenated, and then, passed through a final classification layer that outputs the label probability. For the backbone architecture of the image encoder, we adopt a small pre-trained residual neural network (ResNet) model (with 18 layers) [10]. The encoded vector is obtained via the first block layer output of the ResNet (with 64 channels) and subsequent two FC layers with batch normalization and rectified linear unit (ReLU) activation. Notably, one can easily exchange this module with a larger and more complex architecture. As demonstrated in the results section, our approach performs well, even with this shallow architecture.

For the position encoder, the input is a set of box positions defined by Eq. (8). Each two-dimensional coordinate is first embedded into a d-dimensional hidden space \(\mathcal {R}^d\) via transformation f: \( l_{i} = f(c_{i})\). Thereafter, the encoded position vector \(l_{p}\) is obtained by inputting the concatenation of \(l_{i}\), \(l_{j}\) into function g: \(l_{p} = h(l_{i} \oplus l_{j})\). For the transformation functions f and g, we adopted a two-FC layer with ReLU activation. Similarly, we set up the final classification layer using two FC layers with batch normalization, ReLU activation, and a softmax function. We set the size of all hidden layers to 64 and \(d=64\).

5.3 Data Augmentation

An advantage of the image-based approach is that the amount of data can be augmented through label-invariant operations. Unlike typical image classification tasks, however, the geometry or presence of bounding boxes and ruled lines significantly affects the relation label; therefore, commonly used data augmentation, such as random crop, random erasing [36] or cutout [3], are likely to generate noisy samples.

We introduce two novel label-invariant data augmentation techniques for our approach: randomly changing the size of the bounding boxes and adding noise near the box. The former is based on the intuition that the size of the box can vary depending on the size of the characters of the table element, whereas the relationship is mostly independent of the character size. The latter incorporates noise that often occurs near the box, owing to the mismatch between the characters and bounding box. Specifically, we create the augmented images \(\tilde{I}_{ij}^{e}\) table-by-table according to the process in Sect. 5.1, using the randomly rescaled bounding boxes and the table image. More precisely, we first fill the original box areas with a blank value, and then place the rescaled bounding boxes. We added one augmented data per sample for our model’s training set. A new set of augmented data was generated for each training iteration

5.4 Scalability

We finally mention the scalability of our approach. If prediction is done for all combinations of boxes, \(\mathcal {O}(m^2)\) computations are required in each table, which is difficult to perform for large tables. However, this computational complexity can be reduced by the fact that distant boxes are mostly irrelevant pairs. Specifically, we reduce the number of candidate pairs of boxes by using a k-NN method based on the location of the boxes [2]. This reduces the computational complexity to \(\mathcal {O}(km)\), making it practically feasible. Besides, we expect that it is possible to reduce the number of actual CNN computation using techniques similar to Fast R-CNN [7], which is an interesting future work.

6 Experiments

In this section, we first review the datasets and introduce evaluation metrics. Next, we introduce the baselines and experimental details. Finally, we present the results under the three experimental settings.

6.1 Dataset

We used two real table datasets: SciTSR [2], comprising typed PDF images, and ICDAR2019 [5], comprising of handwritten scanned images.

The SciTSR dataset comprises 15,000 PDF format tables, containing bounding boxes, relationships, and table images for each table. The average number of nodes in one table is approximately 40. Because we found that some of the table relationships were missing in the bounding box in the lower-left corner, we fixed the generation code. In addition, some tables in the dataset were out of the PDF area, which were removed by imposing a simple threshold to the maximum and minimum positions of the bounding box. After filtering, 11,134 training tables and 2,801 test tables were obtained. In the test dataset, a list of complex table IDs is provided (635 tables after the preprocessing above), and we also report the performance on this list as the complex test set.

The ICDAR2019 dataset comprises 850 (600 for training and 250 for test) scanned table images with handwritten entries. The ground truth of the table area and table structure is provided in XML format. We constructed the relations by parsing these XML files. To reduce the overlap between the boxes and ruled lines, we reduced the size of the bounding box by 50%. In addition, because the images had various background colors, we gray-scaled and binarized the images using threshold values of 80 percent quantile for each table image. Sometimes one scanned image contained multiple tables; these tables were split using XML tags. In the test set, we found that images with ID numbers greater than 10,000 had significantly different properties than the other training and test data: not handwritten, captured images, approximately one-tenth the size of the images. Most models did not perform well against this test set; therefore, we decided to separately evaluate them as in-domain and out-of-domain test set. After preprocessing, we obtained 677 tables for training, 190 tables for the in-domain, and 145 for the out-of-domain test set.Footnote 4 The average number of nodes per table was approximately 300.

6.2 Evaluation Metrics

We adopted macro-averaged precision, recall, and F1 scores as metrics for our experiment. These metrics tend to achieve a high score in the relation classification of table structures. For instance, one misclassification for each table yields a F1 score of approximately 0.99. However, such misclassification, even at a rate of one per table, seriously degrades the performance of subsequent natural language processing tasks. Therefore, we employed an additional metric, the exact match. This metric yields 1 if the predicted rows or columns match the ground truth perfectly in each table. In our experiment, we measured the average ratio of the exact match for rows, columns, and tables (i.e., 1 if both rows and columns yield perfect matches).

6.3 Baselines

We compared our model performance with the following four baselines.

Rule Base. A simple, but strong baseline, for constructing the table rows and columns using a pre-defined rule. Here, we adopted the following extraction algorithm based on the overlaps of the pairs of bounding boxes.

  1. 1.

    Select \(b_i \in \mathcal {B}\), which is located at the left/top most position.

  2. 2.

    Select \(b_j\), which is located at the left/top position most in \(\mathcal {B} \backslash \{b_i\}\).

  3. 3.

    If \(b_i\) and \(b_j\) do not overlap on the vertical/horizontal-axis and overlap on the horizontal/vertical-axis above a threshold length, we set \(y_{ij} = \) row/column, otherwise \(y_{ij} =\) irrelevant.

  4. 4.

    If \(y_{ij} = \) irrelevant, then remove \(b_j\) and go back to 2.

  5. 5.

    If \(b_j\) that satisfies \(y_{ij} \ne \) irrelevant is found or all \(b_j\) is searched, then we assign \(y_{ij'}\) = irrelevant to all the remaining \(b_{j'}\), and restart this algorithm from 1 replacing \(\mathcal {B}\) with \(\mathcal {B} \backslash \{b_i\}\).

In our experiments, we set the threshold value in the step 3 as 50% of the smaller height/width of \(b_i\) and \(b_j\). Because this rule accurately identifies the row/column relationship between two distant nodes, the algorithm achieves high prediction accuracy, although it cannot structure a spanning-cell relationship.

MLP. Multi-layer perceptron (MLP) is a class of a feed-forward network consisting of input, output, and hidden layers. In this experiment, we constructed a module with three hidden layers and ReLU activations. As the input, we fed a concatenation of the pair of node features. We will describe the node feature adopted in the experiment in Sect. 6.4.

GraphTSR. [2] incorporates node and edge features for the input of the graph neural network. The author adopts a multi-head attention layer for the graph convolution. Both node and edge features are constructed based on the positions and sizes of the nodes. We adopted the same architecture and features for our experiment.

Ties. [18] shows variations in the architecture of the graph convolution mechanism for node features. From the results of the study, we adopted the DGCNN [33] module, where the node graph structure is dynamically constructed by the hidden features of the nodes. We set the number of the vertex neighbors for the DGCNN to 10. The image information was also used for the node feature by convolving the table image and sampling the CNN feature at the node position. In addition, the authors employed an edge sampling strategy to reduce the memory complexity. In our experiment, we sampled a constant number of negative samples (i.e., irrelevant edges) for each node. We set the number of the negative samplings for each node to 10.Footnote 5

Table 1. Performances on SciTSR dataset.P, R, and \(F_1\) are precision, recall and F1 scores respectively. The numbers in parentheses represent standard deviations.
Table 2. Performances on ICDAR2019 dataset.
Table 3. Ablation study on the SciTSR dataset.

6.4 Experimental Details

We split the training dataset in a ratio of 3:1; the former was used for training and the latter for validation. The training was terminated by referring to the average F1 score of the rows and columns of the validation data. We adopted the cross-entropy loss, and minimized it using the Adam optimizer [12]. For GraphTSR, and Ties, we referred to the official code, and modified or reconstructed them for our experiments, retaining their original architectures.

To construct a set of relations \(\mathcal {R}\) including \(y=\) irrelevant, we adopted a conventional negative sampling method: For each node i, we constructed a set of node pairs by pairing i to k-NN nodes. We assigned a row, column, or irrelevant label to each node pair by referring to the ground truth. For all baselines, we set the number of nearest neighbor nodes \(k=20\) for training and validation, and \(k=50\) for testing except for GraphTSR. For GraphTSR, we found that the same k for training and prediction yielded a higher performance, and hence we adopted \(k=20\) for prediction.

We used the following 16 node features for MLP and Ties: box positions, box centers, box width, and box height, along with these features normalized by the table size. The box center normalized by the table size was also used for the k-NN box search. For GraphTSR, the features are standardized, and the edge feature is used. We adopted the same features as the original codes.

Each approach was run three times and the average values are reported.

6.5 Performances on the Full Data

First, we tested the performance of the model when trained on all training data samples. Tables 1 and 2 summarize the results.Footnote 6 For SciTSR dataset, we observe that our method significantly outperforms the baselines for most of the metrics.Footnote 7 In ICDAR2019 dataset, the images are distorted and noisy, which is difficult for image-based approach. Nevertheless, our approach achieves a competitive performance with the baselines on the F1 scores and significantly outperforms on the exact matching ratio for rows and tables. We expect this accuracy to improve further with more sophisticated image preprocessing.

6.6 Ablation Studies

Next, we present the results of the ablation studies of our model on the SciTSR dataset. In this experiment, we tested the model performance by ablating the data augmentation (-DA) and changing scaling parameter r in the range of (0, 0.5, 1, 2). The results are summarized in Table 3. Interestingly, even \(r=0\), which only contains information on the inner part, yields a competitive accuracy with the baselines, indicating the importance of the internal states of the node pairs. In contrast, the best accuracy was obtained at \(r=1\), and the accuracy decreased for larger r values. This is because, when the cropped area is excessively large, important information on the inner part is missing owing to the resizing procedure.

Fig. 3.
figure 3

Performances on the small data. The shaded regions represent the standard deviation. F1 in the left panel represents the average of the row and column F1 scores.

6.7 Performances on the Small Data

Finally, we present the model performance under a small data setting. Documents often contain sensitive information, and crowdsourcing is not always available. In such a situation, it is desirable to achieve practical accuracy with a small number of annotations. Assuming a sample size that is sufficiently small to be annotated without crowdsourcing, we sampled 5, 10, 15, 20, 30, 50 and 100 tables for training and 10 tables for validation from the SciTSR training dataset. The performance was evaluated on the same test dataset used in the full data experiment.

The results are presented in Fig. 3. In graph-based approaches (Ties and GraphTSR), the model cannot be trained well with such a small training dataset, and the performances are below that of the rule-based baseline. In contrast, our method significantly outperforms the graph-based approaches and is competitive with the rule-based baseline, even without data augmentation. With data augmentation, the performance increases further; even with five samples, the exact matching ratio is greater than 60% and with 30 samples, it is above 70%.

7 Conclusion and Discussion

In this study, we proposed a novel image-based approach for the table structure recognition task. Our model efficiently exploits information on geometry of the table elements and table ruled lines through edge-based cropped images. We evaluated our approach on two real-world table datasets, consisting of typed PDFs and handwritten scanned images, in comparison with four baselines. We have observed that our approach significantly outperforms the baselines in the exact matching ratio for tables. In addition, our experiments have confirmed that our approach works well with small amounts of data. We finally note that our approach can be easily combined with the graph convolutional architecture by exchanging the position encoder with a GCN architecture, which may help to improve robustness against noisy and distorted images.