Keywords

1 Introduction

With the rise of digital documents, document understanding received much attention from leading industrial companies, such as IBM [35] and Microsoft [31, 32]. Visual Question Answering (VQA) on visually-rich documents (i.e. scanned document images or PDF file pages) aims to examine the comprehensive document understandings in conditions of the given questions [13]. A comprehensive understanding of a document includes structural understanding [18, 25, 26] and content understanding [6, 7].

The existing document VQA mainly examines the understanding of the document in terms of contextual understanding [21, 29] and key information extraction [10, 24]. Their questions are designed to ask about certain contents on a document page. For example, the question “What is the income value of consulting fees in 1979?" expects the specific value from the document contents. Such questions examine the model’s ability to understand questions and document textual contents simultaneously.

Apart from the contents, the other important aspect of a document is its structured layout which forms the content hierarchically. Including such structural layout understandings in the document, the VQA task is also critical to improve the model’s capabilities in understanding the documents from a high level. Because in real-world document understandings, apart from querying about certain contents, it is common to query a document from a higher level. For example, a common question would be “What is the figure on this page about?" and answering such a question requires the model to recognize the figure element and understand that the figure caption, which is structurally associated with the figure, should be extracted and returned as the best answer.

Additionally, the existing document VQA limits the scale of document understanding to a single independent document page [21, 29]. But most document files of human’s daily work are multi-page documents with successively logical connections between pages. It is a more natural demand to holistically understand the full document file and capture the connections of textual contents and their structural relationships across multiple pages rather than the independent understanding of each page. Thus, it is significant to expand the current scale of page-level document understanding to the full document-level.

In this work, we propose a new document VQA dataset, PDF-VQA, that contains questions to comprehensively examine document understandings from the aspects of 1)document element recognition 2) and their structural relationship understanding 3) from both page-level and full document-level. Specifically, we set up three tasks for our dataset with questions that target different aspects of document understanding. The first task mainly aims at the document elements recognition and their relative positional relationship understandings on the page-level, the second task focuses on the structural understanding and information extraction on the page level, and the third task targets the hierarchical understanding of document contents on the full document level. Moreover, we adopted the automatic question-answer generation process to save human annotation time and enrich the dataset with diverse question patterns. We have also explicitly annotated the relative hierarchical and positional relationships between document elements. As shown in Table 1, our PDF-VQA provides the hierarchically logical relational graph and spatial relational graph, indicating the different relationship types between document elements. This graph information can be used in model construction to learn the document element relationships. We also propose a graph-based model to give insights into how those graphs can be used to gain a deeper understanding of document element relationships from different aspects.

Our contributions are summarized as 1) We propose a new document-based VQA dataset to examine the document understanding of comprehensive aspects, including the document element recognition and the structural layout understanding; 2) We are the first to boost the scale of document VQA questions from the page-level to the full document level; 3) We provide the explicit annotations of spatial and hierarchically logical relation graphs of document elements for the easier usage of relationship features for future works; 4) We propose a strong baseline for PDF-VQA by adopting the graph-based components.

Table 1. Summary of conventional document-based VQA. Answer type abbreviations are MCQ: Multiple Choice; Ex: Extractive; Num: Numerical answer; Y/N: yes/no; Ab: Abstractive. Datasets with a tick mark in Text Info. the column provides the textual information/OCR tokens on the image/document page ROI. LR graph: logical relational graph; SR graph: spatial relational graph.

2 Related Work

VQA is firstly proposed by [1] which categorizes the image source of the VQA task into three types: realistic/synthetic photos, scientific charts, and document pages. VQA with realistic or synthetic photos is widely known as the conventional VQA [1, 8, 11, 12, 19]. These realistic photos contain diverse object types and the questions of the conventional VQA query about the recognition of objects and their attributes and the positional relationship of the objects. The later proposed scene text VQA problem [2, 23, 27, 30] involves realistic photos with scene texts, such as the picture of a restaurant with its brand name. The questions of scene text VQA query about recognising the scene texts associated with objects in the photos. VQA with scientific charts [3, 13, 14, 22] contain the scientific-style plots, such as bar charts. The questions usually query trend recognition, value comparison, and the identification of chart properties. VQA with document pages involves images of various document types. For example, the screenshots of web pages that contain short paragraphs and diagrams [29], info-graphics [20], and single document pages of scanned letters/reports/forms/invoices [21]. These questions usually query the textual contents of a document page, and most answers are text spans extracted from the document pages.

VQA tasks on document pages are related to Machine Reading Comprehension (MRC) tasks in terms of questions about the textual contents and answered by extractive text spans. Some research works [21, 29] also consider it as an MRC task, so it can be solved by applying language models on the texts extracted from the document pages. However, input usage is the main difference between MRC and VQA. Whereas MRC is based on pure texts of paragraphs and questions, document-based VQA focuses on the processing of image inputs and questions. Our PDF-VQA is based on the document pages of published scientific articles, which requires the simultaneous processing of PDF images and questions. We compare VQA datasets of different attributes in Table 1. While the questions of previous datasets mainly ask about the specific contents of document pages or the certain values of scientific charts/diagrams, our PDF-VQA dataset questions also query the document layout structures and examine the positional and hierarchical relationships understandings among the recognized document elements.

Table 2. Data Statistics of Task A, B, and C. The numbers in Image row for Task A/B refer to the number of document pages but the entire document number for Task C.

3 PDF-VQA Dataset

Our PDF-VQA dataset contains three subsets for three different tasks to mainly examine the different aspects of document understanding: Task A) Page-level Document Element Recognition, B) Page-level Document Layout Structure Understanding, and C) Full Document-level Understanding. Detailed dataset statistics are in Table 2.

Task A aims to examine the document element recognition and their relative spatial relationship understanding on the document page level. Questions are designed into two types to verify the existence of the document elements and count the element numbers. Both question types examine relative spatial relationships and understandings between different document elements. For example, “Is there any table below the’Results’ section?" in Fig. 1 and “How many tables are on this page?". Answers are yes/no and numbers from a fixed answer space.

Task B focuses on understanding the document layout structures spatially and logically based on the recognized document elements on the document page level and extracting the relevant texts as answers to the questions. There are two main question types: structural understanding and object recognition. The structural understanding questions relate to examining spatial structures from both relative positions or human reading order. For example, “What is the bottom section about?" requires understanding the document layout structures from the relative bottom position and “What is the last section about?" requires identifying the last section based on the human reading order of a document. The object recognition questions explicitly contain a specific document element in the questions and require to recognition of the queried element first, such as the question “What is the bottom table about?" in Fig. 1. Answering these two types of questions require a logical understanding of the hierarchical relationships of document elements. For instance, based on the textual contents, the section title would be a logically high-level summarization of its following section and is regarded as the answer to “What is the last section about?". Similarly, a table caption is logically associated with a table; table caption contents would best describe a table.

Task C questions have a sequence of answers extracted from multi-pages of the full document. It enhances the document understanding from the page to the full document level. Answering a question in Task C requires reviewing the full document contents and identifying the contents hierarchically related to the queried item in the question. For example, the question “Which section does describe Table 2 ?" in Fig. 1 requires the identification of all the sections of the full document that have described the queried table. The answers to such questions are the texts of the corresponding section titles extracted as the high-level summarization of the identified sections. Identifying the items at the higher-level hierarchy of the queried item is defined as the parent relation understanding the question in PDF-VQA. Oppositely, Task C also contains the questions of identifying the items at the lower-level hierarchy of the queried item, and such questions are defined as the child relation understanding. For example, a question, “What does the ‘Methods’ section about?" requires extracting all the subsection titles as the answer.

The detailed question type distribution of each task is shown in Table 3.

Fig. 1.
figure 1

PDF-VQA sample questions and document pages for Task A, B, and C.

3.1 Data Source

Our PDF-VQA dataset collected the PDF version of visually-rich documents from the PubMed Central (PMC) Open Access SubsetFootnote 1. Each document file has a corresponding XML file that provides the structured representations of textual contents and graphical components of the articleFootnote 2. We applied the pretrained Mask-RCNN [35] over the collected document pages to get the bounding boxes and categories for each document element. The categories initially consisted of five common PDF document element types: title, text, list, figure, and table. We then labelled the text elements that are positionally closest to the tables and figures into two additional categories table caption and figure caption respectively.

3.2 Relational Graphs

Visually rich documents of scientific articles consist of fixed layout structures and hierarchically logical relationships among the sections, subsections and other elements such as tables/figures and table/figure captions. Understanding such layout structures and relationships is essential to boost the understanding of this type of document. The graph has been used as an effective method to represent the relationships between objects in many tasks [4, 18, 33, 34]. Inspired by this, for each document, we annotated the hierarchically logical relational graph (LR graph) and spatial relational graph (SR graph) to explicitly represent the logical and spatial relationships between document elements respectively. Those two graphs can be directly used by any deep-learning mechanisms to enhance the feature representation. In Sect. 6, we propose a graph-based model to enlighten how such relational information can solve the PDF-VQA questions. The SR graph indicates the relative spatial relationships between document elements based on their absolute geometric positions with their bounding box coordinates. For each document element of a single document page, we identify its relative spatial relationships with all the other document elements among eight spatial types: top, bottom, left, right, top-left, top-right, bottom-left and bottom-right. The LR graph indicates the potential affiliation between document elements by identifying the parent object and their children’s objects based on the hierarchical structures of document layouts. We follow [18] to annotate the parent-child relations between the document elements in a single document page to generate the LR graph. The graph of the full document of multiple pages are augmented by the graphs of its document pages.

Table 3. Ratio and exact number of various question types of Task A, B and C.

3.3 Question Generation

Visually rich documents of scientific articles have consistent spatial and logical structures. The associated XML files of these documents provide detailed logical structures between semantic entities. Based on this structural information and the pre-defined question template, we applied an automatic question-generation process to generate large-scale question-answer pairs efficiently. For example, the question “How many tables are above the ‘Discussion’?" is generated from the question template “How many 〈E1〉 are 〈R〉 the ‘〈E2〉’?" by filling the masked terms 〈E1〉, 〈R〉 and 〈E2〉 with document element label (“table"), positional relationship (“above") and title name extracted from document contents (“Discussion") respectively. We prepare each question template with various language patterns to diversify the questions. For instance, the above template can also be written as “What is the number of 〈E1〉 are 〈R〉 the ‘〈E2〉’?". We have 36, 15, and 15 question patterns for Task A, B, and C, respectively. We limit the parameter values of the document element label to only title, list, table, figure as asking for the number/existence/position of text elements would be less valuable. The parameter values include four document element labels, eight positional relationships (top, bottom, left, right, top-left, top-right, bottom-left and bottom-right), ordinal form (first, last) and the texts from document contents (e.g. section title, references, etc.). We also replace some parameter values with their synonyms, such as “on the top of" for “above".

To automatically generate the ground truth answers to our questions, we first represent each document page (for Task A and B)/the full document (for Task C) with all the document elements and the associated relations from the two relational graphs as in Sect. 3.2. We then apply the functional program, which is uniquely associated with each question template and contains a sequence of functions representing a reasoning step, over such document(page) representations to reach the answer. For example, the functional program for question “How many tables are above of the ‘Discussion’?" consists of a sequence of functions \(\textit{filter-unique}\rightarrow \textit{query-position}\rightarrow \textit{filter-category}\rightarrow \textit{count}\) to filter out the document elements that satisfy the asked positional relationships and count the numbers of them as the ground-truth answer.

Moreover, we conduct the question balancing from answer-based and question-based aspects to avoid question-conditional biases and balance the answer distributions. Firstly, we conduct an answer-based balancing by down-sampling questions based on the answer distribution. We identify the QA pairs with large ratios, divide identified questions into groups based on the patterns, and reduce QA pairs with large ratios until the answer distributions are balanced. After that, we further conducted the question-based balancing to avoid duplicated question types. To achieve this, we smooth over the distributions of parameter values filled in the question templates by removing the questions with large proportions of certain parameter values until the balanced distribution of parameter value combinations. Since the parameter values of Task C question templates are almost unique, as all of them are the texts from document contents, we did not conduct the balancing over Task C. After the balancing, Task A questions are down-sampled from 444,967 to 81,085, and Task B questions are down-sampled from 246,740 to 53,872.

Fig. 2.
figure 2

The top 4 words of questions in Task A, B and C.

4 Dataset Analysis and Evaluation

4.1 Dataset Analysis

The average number of questions per document page/document in Task A, B, and C are 6.57, 4.37, and 4.93. The average question length for Task A, B and C are 25, 10 and 15, respectively. A sunburst plot showing each task’s top 4 question words is shown in Fig. 2. We can see that Task A question priors are more diverse to complement the simplicity of document element and position recognition questions and to prevent the model from memorizing question patterns. For Task B and C, question priors distribute over “What", “When", “Can you", “Which". And we also specifically design questions in a declarative sentence with “Name out the section..." in Task C. 13.43%, 0.24% and 29.38% of the questions in Task A, B, and C are unique questions. This unique question ratio seems low compared to other document-based VQA datasets. This is because, rather than only aiming at the textual understanding of certain page contents, our PDF dataset targets more the spatial and hierarchically structural understandings of document layouts. Our questions are generally formed to ask about the document structures from a higher level and thus contain less unique texts that are associated with the specific contents of each document page. Answers for Task A questions are from the fixed answer space that contains eight possible answers: “yes", “no", “0", “1", “2", “3", “4" and “5". Answers for Task B and C are texts retrieved from the document page/entire document. We also analyzed the top 15 frequent question patterns in Task A, B and C as shown in Fig. 3 to show the common questions of each question type in each task. We used a placeholder “X” to replace the different figures, table numbers or section titles that would exist in the questions to present the common question patterns in this analysis.

Fig. 3.
figure 3

Top 15 Frequency Questions of Task A, B and C.

Table 4. Positive rates (Pos(%)) and Fleiss Kappa Agreement (Kappa) of human evaluation.

4.2 Human Evaluation

To evaluate the quality of automatically generated question-answer pairs, we invited ten raters, including deep-learning researchers and crowd-sourcing workers. Firstly, to determine the relevance between the question and the corresponding page/document, we define the Relevance criteria. Correspondingly, we define Correctness to determine whether the auto-generated answer is correct to the question. In addition, we ask raters to judge whether our QA pairs are meaningful and possibly appear in the real world by using Meaningfulness criteriaFootnote 3. After we collect the raters’ feedback, we calculate the positive rate of each perspective and apply Fleiss Kappa to measure the agreements between multiple raters, as can be seen in Table 4. All three tasks achieve decent positive rates with substantial or almost perfect agreements. For Task A, Relevance and Correctness can reach positive rates with nearly perfect agreements. Few raters gave negative responses regarding the Meaningfulness of questions about the existence of tables or figures, while those questions are crucial to understanding the document layout for any upcoming table/figure contents understanding questions. In Task B, all three perspectives achieve high positive rates with substantial agreements. The disagreements about Task B mainly come from the questions with no specific answer (N/A), some raters thought those questions were incorrect and meaningless, but these questions are crucial to understanding the commonly appearing real-world cases. Because it is possible that a page does not contain the queried elements in the question, and no specific answer is a reasonable answer for such cases. Finally, for Task C, both positive rates and agreement across three perspectives are notable. In addition, except for three perspectives, raters agree most of the questions in Task C need cross-page understanding (the positive rate is 82.91%).

5 Baseline Models

We experimented with several baselines on our PDF-VQA dataset to provide a preliminary view of different models’ performances. We choose the vision-and-language models that have proved good performances on VQA tasks and a language model as listed in Table 5. We followed the original settings of each baseline but only made modifications on the output layers to suit different PDF-VQA tasksFootnote 4.

6 Proposed Model: LoSpa

In this paper, we introduce a strong baseline, Logical and Spatial Graph-based model (LoSpa), which utilizes logical and spatial relational information based on logical (LR) and spatial (SR) graphs introduced in Sect. 3.2.

Fig. 4.
figure 4

Logical and Spatial Graph-based Model Architecture for three tasks. Task A, B and C use the same relational information to enhance the object representation but different model architectures in the decoding stage.

Input Representation: We treat questions as sequential plain text inputs and encode them by BERT. For document elements of given document page I such as Title, Text, Figure, we use pre-trained ResNet-101 backbones to extract visual representations \(X_v \in \mathbb {R}^{N\times d_f}\) and use [CLS] token from BERT as the semantic representation \(X_s \in \mathbb {R}^{N\times d_s}\) for the texts of each document element.

Relational Information Learning: We construct two graphs: logical graph \(\mathcal {G}_{l} = \left( \mathcal {V}_{l}, \mathcal {E}_{l} \right) \) and spatial graph \(\mathcal {G}_{s} = \left( \mathcal {V}_{s}, \mathcal {E}_{s} \right) \) for each document page. For the logical graph \(\mathcal {G}_{l}\), based on [18], we define the semantic feature as node representation \(\mathcal {V}_{l}\) and the existence of parent-child relation between document elements (extracted from the logical relational graph annotation in our dataset) as binary edge values \(\mathcal {E}_{l}\) \(\left\{ 0, 1 \right\} \). Similarly, for spatial graph \(\mathcal {G}_{s}\), we follow [18] to use the visual features of document elements as node representation \(\mathcal {V}_{s}\) and the distance with two nearest document elements to weight edge value \(\mathcal {E}_{s}\).

For each document page I, we take \(X_s \in \mathbb {R}^{N\times d_s}\) and \(X_v \in \mathbb {R}^{N\times d_f}\) as the initial node feature matrix for \(\mathcal {G}_{l}\) and \(\mathcal {G}_{s}\) respectively. These initial node features are fed into a two-layer Graph Convolution Network (GCN) and trained by predicting each node category. After the GCN training, we extract the first layer hidden states as the updated node representations \(X_s' \in \mathbb {R}^{N\times d}\) and \(X_v' \in \mathbb {R}^{N\times d}\) that has augmented the relational information between document elements for \(\mathcal {G}_{s}\) and \(\mathcal {G}_{f}\) respectively, where \(d=768\). For each aspect feature, we conduct separated linear transformations to the initial feature matrices (\(X_v\)/\(X_s\)) and the updated feature matrices (\(X_s'\)/\(X_v'\)). Inspired by [18], we apply the element-wise max-pooling over them. The pooled features \(X''_s\) and \(X''_v\) are the final semantic and visual representations of nodes enhanced by logical and spatial relations, respectively. Finally, we concatenate semantic and visual features of each document element, yielding relational information enriched multi-modal object representations \(O_1, O_2,..., O_N\).

QA Prediction: We sum up the object features \(O_1, O_2,..., O_N\) with positional embedding to integrate the information of document elements orders, which are inputs into multiple transformer encoder layers together with the results of the sequence of question word features \(q_1, q_2,...,q_T\). We pass the encoder outputs into the transformer decoders and apply a pointer network upon the decoder output to predict the answers. We apply a one-step decoding process each time using the word embedding \(w_{i}\) of one answer from the fixed answer space as the decoder input. Let the \(z_{i}^{dec}\) be the decoder output for the decoder input \(w_{i}\); we then conduct the score \(y_{t,i}\) between \(z_{i}^{dec}\) and the answer word embedding \(w_{i}\) following \(y_{t,i} = \left( w_{i} \right) ^{T}z_{t}^{dec} + b_{i}^{dec}\), where \(i = 1,..., C\), and C are the total answer numbers of the fixed answer space for each task. We apply a softmax function over all the scores \(y_{1},...,y_{C}\) and choose the answer word with the highest probability as the final answer for the current image-question pair. We treat Task B and C as the same classification problem as Task A, where the answers are fixed to 25 document element index numbers for Task B and 400 document element index numbers for Task C. The index numbers for document elements start from 0 and increase following the human-reading order (i.e. top to bottom, left to right) over a single document page (for Task B) and across multiple document pages (for Task C). OCR tokens are extracted from the document element with the corresponding predicted index number for the final retrieved answers for Task B and C questions. We use the Sigmoid function for Task C questions with multiple answers and select all the document elements whose probability has passed 0.5.

7 Experiments

7.1 Performance Comparison

We compare the performances of baseline models and our proposed relational information-enhanced model over three tasks of our PDF-VQA dataset in Table 5. All the models process the questions in the same way as the sequence of question words encoded by pretrained BERT but differ in other features’ processing. The three large vision-and-language pretrained models (VLPMs): VisualBERT, ViLT and LXMERT, achieved better performances than other baselines with inputting only question and visual features. The better performance of VisualBERT than ViLT indicates that object-level visual features are more effective than image patch representations on the PDF-VQA images with segmented document elements. Among these three models, LXMERT, which used the same object-level visual features and the additional bounding box features, achieved the best results over Task A and B, indicating the effectiveness of bounding box information in the cases of PDF-VQA task. However, its performance on Task C is lower than VisualBERT. This might be because Task C inputs the sequence of objects (document elements) from multiple pages. The bounding box coordinates are independent on each page and therefore cause noise during training. Surprisingly, LayoutLM2, pretrained on document understanding datasets, achieved much lower accuracy than the three VLPMs. This might be because LayoutLM2 used token-level visual and bounding box features, which are ineffective for the whole document element identification. Compared to LayoutLM2 used the token-level contextual features, M4C, as a non-pretrained model, inputting object-level bounding box, visual and contextual features achieved higher performances. Such results further indicate that the object-level features are more effective for our PDF-VQA tasks. The object-level contextual features of each document element are represented as the [CLS] hidden states from the pretrained BERT model inputting the OCR token sequence extracted from each document element.

Our proposed LoSpa achieves the highest performance compared to all baselines, demonstrating the effectiveness of our adopted GCN-encoded relational features. Overall, all models’ performances are the highest on Task A among all tasks due to the relatively simple questions associated with object recognition and counting. The performances of all the models naturally dropped on Task B when the ability of contextual and structural understanding are simultaneously required. Performances on Task C are the lowest for all models. It indicates the difficulty of document-level questions and produces massive room for improvement for future research on this task.

Table 5. Performance Comparison over Task A, B, and C. Acronym of feature aspects: Q: Question features; B: Bounding box coordinates; V: Visual appearance features; C: Contextual features; R: Relational Information.
Table 6. Validating the effectiveness of proposed logical-relation (LR) and spatial-relation (SR) based graphs.

7.2 Relational Information Validation

To further demonstrate the influences of relational information on document VQA tasks, we perform the ablation studies on each task, as shown in Table 6. For all three tasks, adding both aspects of relational information can effectively improve the performance of our LoSpa model. Firstly, Spatial relation (SR) enhanced models can make the models of all three tasks more robust. Regarding logical relation (LR), it can lead to more apparent improvements on Task B since Task B involves more questions that require understanding document structure more comprehensively. Moreover, since the graph representation of two relation features is trained on the training set, most of the test set performance is lower than the validation set during the QA prediction stage.

Table 7. Task A, B and C performance on different question types. Same as the overall performance shown previously, the metric of Task A/B is F1 and Task C is Accuracy.

7.3 Breakdown Results

We conduct the breakdown performance comparison over different question types of each task as shown in Table 7. Generally, all models’ performances on Existence/Structural Understanding/Parent Relation Understanding questions are slightly better than Counting/Object Recognition/Child Relation Understanding questions in tasks A, B and C, respectively, due to their larger question numbers when training. Overall, all models’ performances are stable on different question types of each task and follow the same performance trend as on all questions in Table 5. However, M4C’s performance on Object Recognition is much lower than its performance on the Structural Understanding questions. This indicates that M4C is more powerful in recognizing the contexts and identifying the semantic structures between document elements. However, it does not have enough capacity to identify the elements and related semantic elements simultaneously. Also, the LXMERT’s performances on Parent Relation Understanding questions are much better than those on Child Relation Understanding questions. This is because answers to parent questions are normally located on the same page as the queried elements. In contrast, answers to child questions are normally distributed over several pages, which is impacted by the independent bounding box coordinates of each page. The stable performances of M4C over the two question types of task C also indicate that using contextual features would eliminate such issues. Our LoSpa, incorporating relational information between document elements, achieves stable performances over both question types in Task C.

8 Conclusion

We proposed a new document-based VQA dataset to comprehensively examine the document understanding in conditions of natural language questions. In addition to contextual understanding and information retrieval, our dataset questions also specifically emphasize the importance of document structural layout understanding in terms of comprehensive document understanding. This is also the first dataset that introduces document-level questions to boost the document understanding to the full document level rather than being limited to one single page. We enriched our dataset by providing a Logical Relational graph and a Spatial Relational graph to annotate the different relationship types between document elements explicitly. We proved that such graph information integration enables outperforming all the baselines. We hope our PDF-VQA dataset will be a useful resource for the next generation of document-based VQA models with an entire multi-page document-level understanding and a deeper semantic understanding of vision and language.