Keywords

1 Introduction

Document digitization plays a critical role in the automatic retrieve and management of document information. Most of these documents are still processed manually, with billions of labor costs each year in industry. Thus, researches on automatic document image classification have great practical value. The document image classification task attempts to predict the type of a document image by analyzing the document’s appearance, layout, and content representation. Traditional solutions to this challenge mainly include the image-based classification method and the text-based classification method. The former tries to extract patterns in the pixels of the image to match elements with a specific category, such as shapes or textures. The latter tries to understand the text printed in the document and associate it with its corresponding class.

Fig. 1.
figure 1

Model size vs. classification accuracy on the RVL-CDIP dataset. The LayoutLMv2 [1] is currently the state-of-the-art method. However, this model has much more parameters (426M) and requires tens of millions of data for model pre-training to achieve the best accuracy.

However, in real-business applications, the same kind of document often contains different layouts. This intra-class difference makes visual-based classification difficult, and it is impossible to perform rigid feature detection and feature matching, like SURF [2], SIFT [3], and ORB [4]. In addition, different kinds of documents sometimes show high visual similarity, which increases the difficulty of classification. For example, some news articles contain tables and figures, make them look like scientific publications. Therefore, it is difficult for pure visual methods, including CNN, to classify document images with ideal classification accuracy.

If judging from the content of the text, these documents have a similar structure: the address and date usually appear at the top, and the signature usually appears at the bottom. Making full use of the information in the document images, including visual, positional, and textual features, can improve document classification accuracy. In recent years, researchers have started to use the graph concept, including the GCN [5], to do some graph node classification and link prediction tasks with the feature aggregation capabilities of GCN. Therefore, we propose a framework based on the GCN architecture, which can make full use of the multimodal characteristics of the document image. The model incorporates three types of input features: (1) Compact image feature representations for the slice of each text block and the whole document image; (2) Textual features from the text content of each text block; and (3) Positional features denoting the positions of texts within a document image. By doing so, the model can aggregate the visual features and textual features in the document image, and the accuracy of document image classification can be effectively improved.

To sum up, the contribution of this work lies in three folds:

  1. (1)

    A one-step, end-to-end approach is developed to handle document image classification tasks by a single GCN-based model. The model possesses great scalability to take such a task across various document images with complex layouts.

  2. (2)

    The model uses the concept of graphs to classify documents and innovatively proposes a method for constructing node features that combine visual, positional, and textual features, which can greatly improve the model performances with fewer parameter sizes, and the best accuracy-speed trade-off is achieved. As shown in Fig. 1.

  3. (3)

    In practical applications, the model can be trained from scratch and does not require large-scale pre-training.

2 Related Work

Document image classification tasks were generally solved using semantic-based methods in the past. And Bag of Words (BOW)-based methods have shown great success in document image classification [6, 7]. However, the primary mechanism of the BOW-based process is to calculate the frequency information of the corresponding word dictionary and ignore the unique layout position information between the document image components, which limits the ability to describe document images.

With the development of deep learning methods in various fields of computer vision, such as target recognition, scene analysis, and natural language processing, deep learning methods show better performance than traditional methods. Some scholars use deep CNN in the field of document image classification and achieve satisfying performances. For the first time, Le Kang et al. use CNN to classify document images [8]. Their results prove that the performance of the CNN is better than the traditional methods. Later, Afzal et al. propose to design a deeper neural network [9], pre-train the network on the ImageNet dataset [10], and then perform transfer learning on the document image dataset. They get better results on the same document image classification dataset with a 12.25% improvement of accuracy. Their experiments show that training a CNN requires many data, and the transfer learning techniques are practical and feasible. However, the CNN-based model can only handle visually different documents, and the performance is deficient on visually similar documents.

To classify document images from the content, some researchers combine the Optical Character Recognition (OCR) [11,12,13,14] with Natural Language Processing (NLP) [15]. These methods can deal with visually similar documents well, but do not make full use of the visual information of the document images. Moreover, the document images usually contain defects, including rotation, skew, distortion, scanning noise, etc. All of these bring significant challenges to the OCR system and directly affect subsequent NLP modules. Although enormous efforts have been paid, the OCR + NLP approaches are still short of satisfying performance for the above reasons.

Recently, some researchers notice that the classification of complex document images requires multi-modal feature fusion. For example, LayoutLMv2 [1] realizes to combine textual, visual, and positional information for the document classification task, achieving state-of-the-art performance. Still, it has many parameters (426M) to achieve the optimal result, and requires tens of millions of pre-training data.

3 Proposed Approach

We propose a document image classification framework, which constructs a graph representation for each document image, and the overall architecture is shown in Fig. 2. The first CNN sub-module (CNN1) is responsible for extracting the whole image’s visual features. For each OCR text block, the second CNN sub-module (CNN2) is used for extracting local-aware visual features for the text image slice of the block. Textual features are extracted by a Tokenize-Embedding-GRU (Gated Recurrent Unit) pipeline from text contents. Positional features are extracted by a Fully Connected layer (FC1) from text block coordinates. The GCN sub-module is designed to fuse and update the above visual, textual, and positional features and extract graph representations for the document image. At last, the graph representations are passed to a Fully Connected layer (the classification layer, FC3 in Fig. 2) to get the specific category of the document image.

The input of the model includes four parts from the document image, which are: (1) the full image of the document; (2) the image slices of each text block; (3) the text contents of each text block; and (4) the coordinates of each text block. In practice, the text block information is generated by an off-the-shelf OCR system, from which we can get the text content and the coordinates of the four vertices for each text block. One text block from the OCR results is taken as one graph node. Based on this information, an innovative graph node feature construction method is proposed, which combines the full image feature and the feature of each text block.

Fig. 2.
figure 2

Overall design of the proposed model. The model employs CNN1 and CNN2 as the backbone network for extracting the full-image visual features and the local-aware visual features. The embedding layer is responsible for converting text information into textual features. The FC1 converts the position vector into the positional feature. The GCN sub-module is designed to fuse and update node features and extract graph representations for the document image. FC1, FC2, and FC3 is Fully Connected layer.

3.1 Graph Node Feature Extraction

Node features of the graph are constructed from two parts. They are full input image features and text block features, where the text block features include text image features, text content features, and text position features.

The whole image features are extracted by a CNN sub-module (CNN1 in Fig. 2). In our experiments, we attempt to use different CNN backbones, including ResNet50 and VGG19. For these backbones, the final Fully Connected layer is removed, and the size of the Adaptive Average Pooling layer is changed to 7\(\times \)7. The full document image is resized to a fixed size and then passed to this module to get a 7\(\times {7}\times {C}\) feature map, where C is the image feature channel. Then, this feature map is split into 7\(\times \)7 parts along the x-direction and y-direction, so 49 parts of features are obtained along the channel-direction (1\(\times \)1\(\times {C}\)). Finally, each part of the features is squeezed and taken as one node feature of the graph. From a computer vision point of view, this is similar to dividing the original image into 7\(\times \)7 sections and then extracting a node feature by the CNN for each section.

The first 49 nodes’ features are prepared from the full input image’s CNN feature as described above, and the next is to prepare node features from each OCR text block. The image slice features of each text block are extracted by another CNN sub-module (CNN2). Similar to CNN1, we choose ResNet34 and VGG16 as CNN2 backbones in different experimental setups, respectively. The difference between CNN2 and CNN1 is that, after removing the last fully connected layer, the size of the Adaptive Average Pooling layer of the CNN2 is 1\(\times \)1. Thus the size of the visual features generated by the CNN2 for each text block is 1\(\times \)1\(\times {C}\).

In preparing text features for each text block, we pad or cut the text content to a fixed length of 16 words. Then, the Bert Word Piece Tokenizer is used to convert the text into id indexes. Different from BERT [16] training, the [CLS] and [SEP] tokens are removed. An embedding layer is employed to convert these id indexes into 64-d features. Finally, each line of text is transformed into a 128-d textual feature by a 128-unit GRU layer.

The positional information for each text block is obtained from the coordinates of the four vertices of the text block. Each coordinate is composed of two values in x-direction and y-direction. Therefore, the position vector for each text block is constructed and then transformed into a 128-d feature vector by a Fully Connected layer (FC1).

For each OCR text block, the visual, textual, and positional features are prepared by the above steps. Next, they are concatenated together and passed to a Fully Connected layer (FC2) to get the final node feature vector. According to this setting, we can get n nodes’ feature if there are n OCR text blocks. As previously introduced, 49 node features have been prepared from CNN1. Thus the graph representation of the input image has 49+n nodes.

3.2 Graph Convolutional Network Module

Unlike CNN, which performs convolution operations in a regular Euclidean space such as a two-dimensional matrix, GCN extends the convolution operation to non-Euclidean data with a graph structure. GCN takes the graph structure and node features as input and obtains a new node representation by performing graph convolution operations on the neighboring nodes of each node in the graph and then pooling all nodes to represent the entire graph.

A multi-layer GCN is defined by the following layer-wise propagation rule [5]:

$$\begin{aligned} H^{(l+1)}=\sigma (\widetilde{D}^{-1/2}\widetilde{A}\widetilde{D}^{-1/2}H^{(l)}W^{(l)}) \end{aligned}$$
(1)

Therefore, as long as the input feature X and the adjacency matrix A are known, the updated node feature can be calculated. In our model, the input feature X is the n+49 nodes’ features. Since the graph in our model is a Fully Connected graph, every two nodes have a connection, so the adjacency matrix A is \({N}\times {N}\) full-one matrix. We build a GCN module with two graph convolutional layers, as shown in Fig. 3. Each layer of graph convolution is followed by a SiLU activation function. The graph is defined by the fully connected N nodes and initialized from the node features prepared by the above steps. States and features are propagated across the entire graph by the two graph convolutional layers. The final node states vector of the graph is the \({N}\times \)512 vector. Then, the final node states are averaged to a 1\(\times \)512 vector, which is the graph representation of the input data. Finally, the 512-d enriched graph representation is then passed to a 512\(\times {k}\) FC layer (FC3 in Fig. 2), where k is the number of the classes of document images.

Fig. 3.
figure 3

Schematic depiction of multi-layer Graph Convolutional of aggregating node characteristics. The model’s input includes a graph definition with a total of N nodes and the node features.

4 Experiments

4.1 Datasets Description

The model is applied to the document image classification task on the Medical Insurance Document Image (MIDI) dataset and the Ryerson Vision Lab Complex Document Information Processing (RVL-CDIP) dataset [17].

The MIDI Dataset. This dataset contains scanned and photo images collected from the real business system. It has a total of 160,000 images in 20 categories, and sample images are shown in Fig. 4. We split these images into 120,000 training images, 20,000 validation images, and 20,000 testing images. These images are collected from various provinces and cities in China. This dataset has the characteristics of significant intra-class differences and slight inter-class differences.

Fig. 4.
figure 4

OverallSample Images from the MIDI dataset. From left to right: Claim form, Personal information form, Medical invoice, Medical imaging report, Claim notice.

The RVL-CDIP Dataset. This dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 testing images. The images are resized to a maximum length of 1000 pixels. Some sample images of this dataset can be seen in Fig. 5.

Fig. 5.
figure 5

Sample images from the RVL-CDIP dataset. From left to right: Letter, Form, Email, Handwritten, Advertisement, Scientific report, Scientific publication, Specification, File folder, News article, Budget, Invoice, Presentation, Questionnaire, Resume, Memo.

4.2 Model Training and Evaluation

For each experiment, a trainable end-to-end pipeline was built according to Fig. 2, and the output of the pipeline was the enriched feature of the original image. After the final classifier (FC3), the category to which the image belongs was predicted. To test the impact of different visual backbones (CNN1 and CNN2 in Fig. 2) on the model performances, we tested ResNet50 and VGG19 for CNN1 and tested ResNet34 and VGG16 for CNN2. In order to compare our model with the CNN-based visual model, we also tested the performance of the VGG16 and ResNet50 on the MIDI dataset. For the RVL-CDIP dataset, we followed the same model and hyper-parameter setups with the MIDI experiments.

The training epochs were set to 20 for all experiments with gradient accumulation technology to ensure stable convergence of the model. All models were trained on an NVIDIA Tesla V100 machine, using the Cross-Entropy Loss function and AdamW optimizer. The max learning rate was set to 8e-5, and the cosine learning rate scheduler was set. In addition, the learning rate warm-up steps were set to 50000 for the RVL-CDIP dataset, 10000 for the MIDI dataset, respectively. During training, all input data were shuffled at each epoch begin.

5 Results and Discussion

On the MIDI dataset, the classification accuracies of the proposed models and the CNN-based models are shown in Table 1. The results suggest that our models with different backbone setups significantly surpass CNN-based methods. The experiments reach the best classification accuracy of 99.10%, with 6.58% and 5.71% accuracy improvement than the CNN-based VGG16 and ResNet50. The outstanding performance means that the proposed models can be directly used in industrial applications since this dataset is the actual business dataset. The proposed models have much fewer parameters than the VGG16 because we removed the large-parameter FC layer.

Table 1. Classification accuracy of different models on the MIDI dataset.

Table 2 shows the result of our model compared with VGG16, ResNet50, and other models, including text-only models and image-only models on the RVL-CDIP dataset. The table shows that the proposed model outperforms those text-only or image-only models as it leverages the multi-modal information within the documents. The proposed model uses the fewest parameters but shows the best classification accuracy.

It is worth noting that although the RVL-CDIP dataset is larger than the MIDI dataset. Due to the higher image resolution, higher OCR character recognition accuracy, and color images, the classification accuracy on the MIDI dataset is higher than on the RVL-CDIP dataset when using the same model setup and training setups. The OCR engine in our experiments is a general multi-language engine and not specially optimized for English data. Thus, the OCR character recognition accuracy is unsatisfactory due to the OCR engine optimization and the low pixel resolution of texts in several images.

Table 2. Comparison of accuracies on RVL-CDIP of best models from other papers.
Fig. 6.
figure 6

Confusion matrix of the proposed model on the RVL-CDIP dataset.

Figure 6 reports the confusion matrix of the proposed model on the RVL-CDIP dataset. It shows that the proposed model performs very well on most categories of images. However, the classification accuracy for the three categories is less than 90%, which is form, scientific report, and presentation. This is because there are overlaps of definitions among the three categories. For example, some pages of scientific reports usually contain data forms, which make them be defined as the “form” category.

6 Conclusion

This paper presents a document image classification framework based on GCN. We propose a novel multi-modal graph node feature construction method to combine the visual, textual and positional features of each text block in the image and the visual feature of the full document image. All of these make the feature expression more abundant. By transmitting information to the GCN network, the meaningful features are enriched for classification. Experiments were carried out on the MIDI dataset and the RVL-CDIP dataset. The proposed model obtained classification accuracies of 99.10% and 93.45% on the two datasets, respectively, which are superior to CNN algorithms. Experimental data have shown that our model is effective and efficient. Moreover, our end-to-end pipeline does not require handcrafted features or largescale pre-training as other works.

In our experiments, the OCR engine we can obtain is not optimized for the English data. The lower gain on the RVL-CDIP dataset is directly affected by the high error rate of OCR recognition and the low image resolution of several images. Therefore, we will further find commercial OCR systems suitable for English text recognition to tackle this problem. We also consider adding more features to the GCN model, such as learning the relationship between text blocks, to make full use of the GCN capabilities and various information of the document image.