1 Introduction

Deep learning has provided compelling results in various document understanding problems such as document retrieval, information extraction, and document image classification. Thanks to its impressive performance over a large number of tasks, this area has been explored extensively. Existing works covered several techniques including document binarization [3, 45], layout analysis [44, 51], and structural similarity constraints [14] for many document analysis tasks. However, to ensure a good generalization, many deep neural networks with large amount of parameters have been used for document image classification in order to extract the most relevant visual features [36].

Fig. 1
figure 1

Samples of different document classes in the RVL-CDIP dataset. From left to right: Advertisement, Budget, Email, File folder, Form, Handwritten, Invoice, Letter, Memo, News article, Presentation, Questionnaire, Resume, Scientific publication, Scientific report, Specification

Unlike the general images from the ImageNet dataset [50], document images have a distinct visual style (Fig. 1). Therefore, numerous studies on document processing tasks have used transfer learning. It has shown to be effective on boosting the classification performance of document images [1, 16, 25], whereas randomly initialized networks are under-performing [29]. Additionally, from the perspective of a natural language processing classifier, document images can be categorized into various classes based on their textual content processed by an optical character recognition (OCR) system [48, 57]. Yang et al. [65] presented a neural network to extract semantic enriched information from textual content based on a word embedding mechanism. Also, Appiani et al. [5] described a system that exploits a structural analysis approach to characterize and automatically index heterogeneous documents with variable layout, by determining the class of the document image based on reliable automatic information extraction methods.

Nevertheless, the challenge of document images remains in their wide range of visual variability, where documents from the same category might have different spatial properties. Due to their particular visual style, relying on deep convolutional networks to extract visual properties to perform document image classification might fail to distinguish between highly correlated classes. The intra-class variability of document images might be even larger than the inter-class variability, where two or multiple document images of different categories can be visually, and in terms of their textual content, closer than two or multiple documents from the same category. This level of intra-class variability can be mitigated by introducing the latent semantic information from the text corpus within the document image. Once the visual features of the image modality and the textual features of the text modality are extracted, they are leveraged into a multi-modal network to combine both feature vectors into one feature vector based on a feature fusion methodology [7, 17, 43]. Typically, multi-modal methods for document image classification rely on image and text modalities. They contain two or an ensemble of deep networks which are pre-trained on large-scale datasets to extract discriminate features from the input data. With such approaches, the learning process of the image modality and the text modality is still independent one from another. The output features of both modalities are subsequently combined together to perform an ensemble trainable document image classification network [6, 21, 61, 62]. Yet, these independent learning approaches might be enhanced if the visual and the textual features share some mutual information between them.

In this paper, we propose an ensemble trainable network with a mutual learning strategy based on a new regularization term, to model the interaction between visual and textual features learned across image and text modalities throughout the training stage. The conventional mutual learning strategy aims to encourage collaborative learning between modalities, allowing image and text modalities to simultaneously learn their discriminant features in a mutual learning manner. The aim of introducing this approach is to enable the current modality in process to mimic the other modality by minimizing the difference in class probabilities produced by the image modality and those produced by the text modality. However, rather than the conventional distillation-based teacher–student approach with one-way knowledge transfer from a pre-trained teacher to a student [27], the conventional mutual learning strategy starts with a pool of untrained students in a student-to-student peer-teaching model, to learn to solve the tasks collaboratively [72]. It turns out that conventional mutual learning achieves better results than independent learning in either a supervised or a conventional distillation learning approach from a larger pre-trained teacher. Nonetheless, conventional mutual learning is a bidirectional knowledge transfer-based method, in which the current student modality can learn from a better example from the other modality; meanwhile, the good student learns from the worse modality. That is to say, if the other student is worse than the current student, the negative knowledge will be introduced and might weaken the ongoing training. This violates the motivation of the conventional mutual learning. Thus, we introduce a mutual learning approach based on a truncated Kullback–Leibler divergence regularization term (Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\)). This approach enables the current modality to learn only the positive knowledge from the other modality and prevent the negative knowledge to be introduced in the ongoing learning of the current modality. The proposed collaborative mutual learning approach with regularization improves the quality of the final predictions of the single-modal and multi-modal modalities and helps to overcome the drawback of the conventional mutual learning trained with the standard Kullback–Leibler divergence (KLD).

Furthermore, as one of the goals of this paper is to combine image and text features through a better multi-modal feature fusion methodology, we introduce a self-attention-based feature fusion module that serves as a middle block in our ensemble trainable network. Therefore, we aim to simultaneously extract more powerful and meaningful features from different middle blocks of the image and text modalities through the self-attention-based feature fusion module. This approach enables to focus more on the salient parts of feature maps of each modality and aims to capture relevant semantic information between the pairs of image regions and text words. Such self-attention-based modules have recently become an elemental component in many multi-modal tasks such as visual question answering, image captioning, and image–text matching [31, 37, 42, 63]. Moreover, we adopt an early average ensemble fusion scheme in the final model to ensure a more stable and better-performing solution for the task of document image classification.

This work builds on our previous works on multi-modal networks for document image classification [10, 11]. For the rest of the paper, we denote mutual learning trained with the standard (KLD) as \(\hbox {ML}_{{\mathrm{KLD}}}\), mutual learning trained with regularization as ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\), and ensemble self-attention-based mutual learning with regularization as EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\). Following are the main contributions in this paper:

  • We introduce a mutual learning with a regularization term to overcome the drawback of the conventional mutual learning. This approach allows the current modality to learn the positive knowledge from the other modality instead of the negative knowledge which weakens the learning capacity of the current modality in process.

  • We present a self-attention-based feature fusion module for a better multi-modal feature extraction to perform fine-grained document image classification. Our proposed self-attention module enhances the overall accuracy of the ensemble network and achieves state-of-the-art classification performance compared to single-modal and multi-modal learning methods.

  • We perform a comprehensive ablation study on the benchmark RVL-CDIP and Tobacco-3482 datasets to analyze the effectiveness of our proposed ensemble trainable network with/without the mutual learning approach, and with/without the self-attention-based feature fusion module.

  • We evaluate the performance and the generalization ability of the proposed ensemble network through inter-dataset and intra-dataset evaluation on the benchmark RVL-CDIP and Tobacco-3482 datasets for the single-modal and multi-modal fusion modalities.

The remainder of this paper is organized as follows. First, Sect. 2 reviews the related works and Sect. 3 introduces our proposed architecture network. Then, Sect. 4 provides the details of our proposed method. We detail the experimental setup in Sect. 5. We then perform the experiments and the ablation study in Sect. 6. Finally, we give a conclusion of this paper and provide the future work in Sect. 7.

2 Related work

2.1 Image embeddings

Over the past few years, a variety of research studies have been proposed for document image classification. Due to the different manners of organizing each document, document images might be classified based on their heterogeneous visual structural properties and/or their textual content. Earlier attempts have utilized layout structure to convert printed documents into a complementary logical structure [18]. Region-based analysis techniques have shown notable performance in visually identifying document components, assuming that documents share a particular spatial configuration [12]. Among all, DCNNs-based approaches outperformed handcrafted feature methods for the task of document image classification. Hao et al. [24] proposed a novel method for table detection in PDF documents based on CNNs. Harley et al. [25] proposed an alternative strategy to learn visual features through region-based approaches. Still, many pre-trained DCNNs-based approaches such as AlexNet, VGG-16, GoogLeNet, and ResNet-50 [2, 26, 32, 33, 53, 55, 56] have been used along transfer learning to achieve accurate document image classification results on the RVL-CDIPFootnote 1 and Tobacco-3482 datasets.

2.2 Text embeddings

Recently, classifying textual content extracted from document images has been also investigated. In many natural language processing (NLP) tasks, the representation of words has drawn significant attentions. The development of static word embeddings such as Word2Vec and Glove [40, 46] to contextualized dynamic word embeddings such as ELMO, Fasttext, XLNet, and Bert [19, 41, 47, 66] has made a huge progress to address the polysemy problem and the semantic aspect of words. In the meantime, several approaches handled the task of document image classification by performing optical character recognition (OCR) techniques. Yang et al. [65] combined generated text features with visual features in a fully convolutional neural network. Also, [8, 17] experimented with shallow bag-of-words (BoW) along visual features in a two-modality classifier. Moreover, similar to our approach, Lai et al. [35] presented a hybrid approach to extract contextual information using a RNN-CNN.

2.3 Multi-modal embeddings

As stated before, documents are natively multi-modal. Multi-modal learning for computer vision and natural language processing has been widely used for image- and text-level understanding problems such as text document image-based classification, visual question answering [67, 74], image captioning [4], and image–text matching [38]. Most multi-modal fusion and attention learning methods require multi-modal reasoning over multi-modal inputs that are represented into a common space, where data related to the same topic of interest tend to appear together. For the multi-modal fusion methods, earlier attempts used naive concatenation, element-wise multiplication, and/or ensemble methods for multi-modal features [23, 52, 64, 71]. Latest works like [6, 7] introduced multi-modal deep networks that jointly learn visual and textual features through a fusion methodology. Noce et al. [43] proposed an approach that combines OCR and NLP algorithms to extract and manipulate relevant text concepts from document images, which are visually embedded within each document image to improve the classification results of a convolutional neural network. Fukui et al. [22] proposed a multi-modal compact bilinear pooling to efficiently and expressively combine multi-modal features. Xu et al. [61, 62] have recently proposed a novel architecture to merge textual and layout information for document image classification. Finally, Souhail et al. [10, 11] proposed a multi-modal learning network that jointly learns the features of image and text modalities through different fusion schemes. The proposed methods have shown a superior performance compared to the single modalities and, thus, have achieved state-of-the-art performance on the RVL-CDIP and Tobacco-3482 datasets using heavyweight and lightweight deep neural networks along with different static and dynamic word embeddings.

2.4 Self-attention-based fusion embeddings

The attention learning was adopted to learn to attend to the most relevant regions of the input space in order to assign different weights to different regions. It was first proposed by Bahdanau et al. [9] for neural machine translation. The mechanism is firstly used for machine translation where the most relevant words for the output often occur at similar positions in the input sequence. Later, Vaswani et al. [58] proposed a self-attention module in machine translation models which could achieve state-of-the-art results at the moment. Then, the self-attention module was introduced to guide the visual attention from images. For the image modality, the self-attention-based modules learn to focus on particular image regions within a given document image [49, 59, 73]. Beyond the visual attention modules that are applied solely to the image modality, recent studies have introduced co-attention models that learn simultaneously from visual and textual attention to benefit from fine-grained representations of both modalities [31, 42]. Wang et al. [60] proposed a novel position-focused attention network to investigate the relation between the visual and textual views. Chen et al. [13] proposed a question-guided attention map that projects the question embeddings to the visual space and formulates a configurable convolutional kernel to search the image attention region. Furthermore, some existing works that handled the task of jointly learning the interaction between image and text features used co-attention and self-attention modules [39, 68,69,70].

3 Architecture overview

The proposed ensemble deep network (see Fig. 3) is based on a multi-modal architecture, which consists of the image, text, and image/text fusion modalities. The image and text modalities are dedicated to extract visual features and textual embeddings, respectively. The fusion branch is used to combine the extracted image and text features into multi-modal features. After the training of the ensemble network, the classification of document images is conducted by either the image modality or the text modality. Moreover, the visual features and the the text embeddings learned are fused to conduct document image classification in a multi-modal manner.

3.1 Image modality

The image modality extracts the visual features using the Inception-ResNet-V2 [54] as a backbone network, which is a convolutional neural network that achieved state-of-the-art results on the ILSVRC image classification benchmark. The model has 54.36 M parameters.

3.2 Text modality

Further, we process all document images with an off-the-shelf optical character recognition (OCR) system, i.e.  Tesseract OCRFootnote 2 to extract the text from document images. Since the document images from RVL-CDIP and Tobacco-3482 datasets are well oriented and relatively clean, it is quite straightforward to run the Tesseract OCR engine on such documents. We utilized this OCR engine to conduct a fully automatic page segmentation without orientation or script detection. We analyzed the output of OCR and find a lot of errors in the recognition especially for the classes (Handwritten, Notes), due to its incapability to recognize handwriting. Besides, the tesseract OCR engine is not always good at analyzing the natural reading order of documents. For example, it may fail to recognize that a document contains two columns and may try to join text across columns, which is the case of some samples from the classes (ADVE, Scientific) as shown in the qualitative results of the OCR engine in Fig. 2. In addition, it may produce poor quality OCR results, as a result of poor quality scans, or the distinct forms of document images as the sample shown in Fig. 2 which corresponds to the class (News). They may contain handwritten text, tables, figures, and multi-column layouts. The embedded features extracted from the generated text corpus are computed using Bert-base model [19]. It is a contextualized bidirectional word embedding mechanism that joints word representation conditioned on both left and right context in all layers using self-attention-based approaches.

Fig. 2
figure 2

Sample images and their corresponding OCR results of 9 classes of the Tobacco-3482 dataset that overlap with the RVL-CDIP dataset

Fig. 3
figure 3

The proposed ensemble self-attention-based mutual learning network (EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\))

3.3 Multi-modal module

After the training of the image modality/branch and the text modality/branch by the proposed mutual learning approach with regularization (i.e., ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\)), we attempt to fuse these two modalities/branches to simultaneously learn the image and text features extracted from the two image and text branches. Moreover, we adopt an early fusion methodology (i.e., average ensembling) as in [11], which enables to enhance the global performance of multi-modal networks.

3.4 Self-attention-based fusion module

The proposed self-attention-based fusion module has been inspired by the attention modules in the squeeze and excitation network [28], which is based on re-weighting the channel-wise responses in a certain layer of a CNN by using soft self-attention in order to model the inter-dependencies between the channels of the convolutional features. As shown in Fig. 4a, the attention fusion module is used as a middle fusion block in our ensemble trainable network. The intermediate features extracted from the middle blocks of the image branch (e.g., the output of Residual block0) and the text branch (e.g., the output of Transform block0) are passed to the corresponding attention block as the inputs of the attention block. The channel-wise information is then extracted from the input image or text intermediate features by performing down-sampling with the global average pooling and global max pooling layers in the attention blocks (see Fig. 4b). The generated channel-wise features are then inputted to the self-attention block(s) to compute the attention maps. Specially, the self-attention maps obtained from the different self-attention blocks are concatenated as the final self-attention map in the visual attention block. Finally, the obtained self-attention maps from the visual attention block and text attention block are concatenated to generate the fusion attention map of the different modalities. The obtained fusion attention map is multiplied by the image and text intermediate features, respectively (i.e., the input of the visual and text attention block) as the input to the following residual/transform block in the image/text branch (see Fig. 4a).

4 Proposed method

In this section, we detail the proposed multi-modal mutual learning and self-attention-based feature fusion approaches.

4.1 Multi-modal mutual learning

As shown in Fig. 3, the proposed multi-modal mutual learning network consists of three different modalities: image modality (image branch), text modality (text branch), and the multi-modal modality (fusion of the two image an text modalities).

Consider a training dataset with a set of samples and labels \((x_n, y_n) \in ({\mathcal {X}}, {\mathcal {Y}})\), over a set of K classes \({\mathcal {Y}} \in \{1,2,\ldots ,K\}\). To learn the parametric mapping function \(f_s(x_n) : {\mathcal {X}} \mapsto {\mathcal {Y}}\), we train our ensemble network with the parameter \(f_s(x_n, {\Theta })\), where \({\Theta }\) are the parameters obtained by minimizing a training objective function \({\mathcal {L}}_{train}\) denoted as:

$$\begin{aligned} {\Theta } = \underset{\theta }{\arg \min } {\mathcal {L}}_{\mathrm{train}}(y, f_s(x, {\theta })) \end{aligned}$$
(1)

The total training loss of the ensemble network \({\mathcal {L}}_{\mathrm{train}}\) is the sum of the weighted losses of the different modalities, i.e. the image modality loss \({\mathcal {L}}_1\), the text modality loss \({\mathcal {L}}_2\), and the multi-modal fusion (image/text) loss \({\mathcal {L}}_3\). Specifically, \({\mathcal {L}}_1\) and \({\mathcal {L}}_2\) are obtained by the mutual learning, which can be also called as the mutual learning loss. Thus, the total loss \({\mathcal {L}}_{train}\), for a pair \((x_n,y_n)\), is defined as follows:

$$\begin{aligned} {\mathcal {L}}_{train}(\mathbf {X_n};\Theta )= & {} \sum _{i=1}^{M}w_i{\mathcal {L}}_i({\mathbf {X}}_n^{(i)};\Theta _i)\nonumber \\= & {} w_1{\mathcal {L}}_1 + w_2{\mathcal {L}}_2 +w_3{\mathcal {L}}_3 \end{aligned}$$
(2)

where \(M = 3\) is the number of modalities to be performed. \({X_i}\) and \({\Theta _i}\) are the corresponding features, and the parameters learned from each modality \(\Theta =\{\Theta _i\}_{i=1}^{M}\) are the overall parameters of the networks to be optimized by \({\mathcal {L}}_{\mathrm{train}}\). \(w_i \in [0,1]\) s.t. \(\sum w_i=1\) denote hyper-parameters which balance the independent loss terms. Thus, \({\mathbf {X}}_i \in {\mathbb {R}}^{d_i}\), where \(d_i\) is the dimension of the features \(X_i\), and \({\mathcal {L}}_i, w_i \in {\mathbb {R}}^1\).

4.1.1 Mutual learning loss

The conventional mutual learning task loss consists of two losses: a supervised learning loss (e.g. cross-entropy loss) and a mimicry loss (e.g. Kullback–Leibler divergence (KLD). The conventional mutual learning setting aims to help the training of the current modality by transferring the knowledge between one or an ensemble of modalities in a mutual learning manner as in [72]. However, the knowledge learned from the other modality through the conventional (KLD) includes both the negative part and the positive part that is transferred to the current modality. Yet, instead of using the standard (KLD) in the original mutual learning, we propose a so-called truncated KLD loss (Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\)) as a new regularization term in the training loss of the current modality, which enables to filter the negative knowledge learned from the other modality, and only keep the knowledge being positive to the current modality [see Eq. (5)]. In this work, the cross-entropy loss \({\mathcal {L}}_{s}\) of the current modality in process can be written as:

$$\begin{aligned} {\mathcal {L}}_{s}({\mathbf {X}};\Theta ) = \sum ^{K}_{k=1}-y_k\log ({\mathcal {P}}_{s}({\hat{y}}_k|{\mathbf {X}},\theta _k)) \end{aligned}$$
(3)

where the probability \({\mathcal {P}}_{s}\) is the softmax operation given by:

$$\begin{aligned} {\mathcal {P}}_{s}({\mathbf {X}};\theta _k) = \frac{e^{f^{\theta _k}({\mathbf {X}})}}{\sum ^{K}_{k'} e^{f^{\theta _{k'}}({\mathbf {X}})}} \end{aligned}$$
(4)

where K is the number of classes in the dataset, \({y_k}\) is the one-shot label of the feature \({\mathbf {X}}\) of the input sample, and \(P_{s}\) is the class probability estimated by the softmax function. The truncated Kullback–Leibler divergence regularization (Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\)) loss of the current modality in process \({\mathcal {D}}_{{\mathrm{KL}}_{\mathrm{Reg}}}\) is given by:

$$\begin{aligned} {\mathcal {D}}_{{\mathrm{KL}}_{\mathrm{Reg}}}({\mathcal {P}}_{{s}_{2}}\parallel {{\mathcal {P}}_{{s}_{1}}}) = \sum ^{K}_{k=1}{\mathcal {P}}_{{s}_{2}} \max \left\{ 0, \log \left( \frac{{\mathcal {P}}_{{s}_{2}}}{{\mathcal {P}}_{{s}_{1}}}\right) \right\} \end{aligned}$$
(5)

where \(P_{{s}_{1}}\) is the class probability estimated by the current modality, while \(P_{{s}_{2}}\) refers to the class probability estimated by the other modality. In this way, the mutual learning approach transfers the positive knowledge learned from the current modality to the other modality, by adapting the conventional mutual learning with the constraints of the mimicry loss \({\mathcal {D}}_{{\mathrm{KL}}_{\mathrm{Reg}}}\). (i.e. Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\)). In the following part, \(P_{{s}_{1}}\) refers to the class probabilities of the image modality, while \(P_{{s}_{2}}\) refers to the class probabilities of the text modality.

  1. (i)

    Image modality setting: for the image modality, the overall loss function \({\mathcal {L}}_1\) is given by:

    $$\begin{aligned} {\mathcal {L}}_1({\mathbf {X}}_1;\Theta _1)= & {} {\mathcal {L}}_{{s}_{1}}({\mathbf {X}}_1;\Theta _1)\nonumber \\&+\,\beta {\mathcal {D}}_{{\mathrm{KL}}_{\mathrm{Reg}}} ({\mathcal {P}}_{{s}_{2}}\parallel {{\mathcal {P}}_{{s}_{1}}}) \end{aligned}$$
    (6)

    where \(\beta = 0.5 \) is a hyper-parameter denoting the regularization weight. The motivation of the conventional mutual learning aims to augment the training capacity of the network, by introducing the mimicry loss to align the classification probability of the current modality to the other modality with better training. However, it is not always true that the other/text modality performs better than the current/image modality. In that case, the ongoing training of the current/image modality will be weakened by the sum of the mimicry loss with the supervised loss (i.e. the cross-entropy loss for the classification of the document image). For instance, the mutual learning with regularization \({D}_{{\mathrm{KL}}_{\mathrm{Reg}}}\) loss will encourage the current/image modality to learn only the positive knowledge from the other/text modality and, thus, prevent the negative knowledge to be introduced in the ongoing training of the current/image modality.

  2. (ii)

    Text modality setting: for the text modality, the overall loss function \({\mathcal {L}}_2\) can be written as:

    $$\begin{aligned} {\mathcal {L}}_2({\mathbf {X}}_2;\Theta _2)= & {} {\mathcal {L}}_{{s}_{2}}({\mathbf {X}}_2;\Theta _2)\nonumber \\&+\, \beta {\mathcal {D}}_{{KL}_{\mathrm{Reg}}}({\mathcal {P}}_{{s}_{1}} \parallel {{\mathcal {P}}_{{s}_{2}}}) \end{aligned}$$
    (7)

    Similarly to the image modality setting, the mutual learning with regularization \({D}_{{KL}_{\mathrm{Reg}}}\) loss will prevent to transfer the negative knowledge that might be introduced from the other/image modality and, thus, will encourage to transfer only the positive knowledge to the current/text modality throughout the training process.

Fig. 4
figure 4

The proposed attention-based fusion module

4.1.2 Multi-modal learning loss

Instead of classifying document images using the independent image or text modalities mentioned before, we can also conduct document image classification in a multi-modal manner by combining the image features and text embeddings extracted from the two modalities trained with the mutual learning approach with regularization (i.e.  ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\)). We directly superpose the visual features of the trained image modality and text embeddings of the trained text modality to generate the ensemble cross-modal features as shown in Eq. (9). Note that the dimension of the features extracted from the image modality and the text modality are equal in this work and are denoted as d. A softmax layer at the end of the network is used to learn the classification of document images based on the ensemble cross-modal features \({\mathbf {X}}_3\). The parameter \(\Theta _3\) of the softmax layer is optimized by the cross-entropy loss function \({\mathcal {L}}_3({\mathbf {X}}_3;\Theta _3)\) which is given by:

$$\begin{aligned} {\mathcal {L}}_3({\mathbf {X}}_3;\Theta _3) =-\sum ^{K}_{k=1}y_k \log P({\hat{y}}_k|{\mathbf {X}}_3,\Theta _3) \end{aligned}$$
(8)

where \({\mathbf {X}}_3\) is given by:

$$\begin{aligned} {\mathbf {X}}_3 = [X_1+X_2] , \quad {\mathbf {X}}_3 \in {\mathbb {R}}^{d} \end{aligned}$$
(9)

4.2 Self-attention-based fusion module

The aim of the self-attention-based fusion module (see Fig. 4) is to enhance the representation of the concatenated image and text feature maps to capture their salient features while eliminating to some extent the irrelevant or noisy ones. The adopted self-attention-based fusion module has been inspired by the attention module in [28, 58], which is based on the channel-wise re-calibration of feature maps to model the dependency of channels. The intermediate feature maps of each single modality can be interpreted as a set of local descriptors that include global information in the decision process of the network. This is achieved by using global max pooling and global average pooling layers to generate channel-wise information. The advantage of these pooling operations is to enforce correspondences between feature maps and categories.

Consider a set of input features \({\mathbf {X}} = [\mathrm {x}_{1},...,\mathrm {x}_{m}] \in {\mathbb {R}}^{{m}\cdot {d}_{x}}\) and output features \({\mathcal {F}} = [\mathrm {f}_{1},...,\mathrm {f}_{m}] \in {\mathbb {R}}^{{m}\cdot {d}_{f}}\), where \(\mathrm {m}\) is the number of samples, and \({d}_{x}\) and \({d}_{f}\) are the dimensions of input and output features, respectively. For the image modality, the input features \({\mathbf {X}}\) are passed to global average pooling and global max pooling layers. The spatial information for each layer is computed as:

$$\begin{aligned} {\mathcal {F}}^{'}_{{I}_{\mathrm{Avg}}}= & {} GlobalAvgPool2D ({\mathbf {X}}_{{I}_{\mathrm{Avg}}}) \end{aligned}$$
(10)
$$\begin{aligned} {\mathcal {F}}^{'}_{{I}_{\mathrm{Max}}}= & {} GlobalMaxPool2D ({\mathbf {X}}_{{I}_{\mathrm{Max}}}) \end{aligned}$$
(11)

where \({\mathcal {F}}^{'}_{{I}_{\mathrm{Avg}}}\) and \({F}^{'}_{{I}_{\mathrm{Max}}}\) correspond to the intermediate feature maps of the intermediate input features \({X}_{{I}_{\mathrm{Avg}}}\), and \({X}_{{I}_{\mathrm{Max}}}\) of the image modality. For the text modality, the input features are fed to a global max pooling layer:

$$\begin{aligned} {\mathcal {F}}^{'}_{{T}_{\mathrm{Max}}} = GlobalMaxPool1D ({\mathbf {X}}_{{T}_{\mathrm{Max}}}) \end{aligned}$$
(12)

where \({F}^{'}_{{T}_{\mathrm{Max}}}\) corresponds to the intermediate feature maps of the input features \({X}_{{T}_{\mathrm{Max}}}\) of the text modality. For our proposed self-attention-based fusion module, the intermediate feature maps of the image and text modalities extracted by the pooling operations are fed to three independent fully connected layers which correspond to the vectors query, keys, and values, respectively, as follows:

$$\begin{aligned} {\mathrm{Q}} = {\mathrm{FC}}_{q}({\mathcal {F}}^{'}); \nonumber \\ {\mathrm{K}} = {\mathrm{FC}}_{k}({\mathcal {F}}^{'}); \nonumber \\ {\mathrm{V}}_{I} = {\mathrm{FC}}_{v}({\mathcal {F}}^{'}) \end{aligned}$$
(13)

where \(\mathrm {Q}, \mathrm {K}, \mathrm {V} \in {\mathbb {R}}^{{m}.{d}}\) are three vectors of the same shape used to calculate the attention function, which consists of computing the compatibility of the query with the key vectors to retrieve the corresponding value.

Given a query \(\mathrm {q} \in \mathrm {Q}\) and all keys \(\mathrm {K}\), we calculate the dot products of \(\mathrm {q}\) with all keys \(\mathrm {K}\), divide each by a scaling factor \(\sqrt{{d}_{f}}\), and apply the softmax function to get the attention weights on the values. The output features of each self-attention module of image and text modalities \({\mathcal {F}}\) are given as follows:

$$\begin{aligned} \mathrm {A}= & {} Softmax \left( \frac{\mathrm {Q}\cdot {K}^{{\mathcal {T}}}}{\sqrt{{d}_{f}}}\right) \end{aligned}$$
(14)
$$\begin{aligned} {\mathcal {F}}= & {} \mathrm {A}\cdot \mathrm {V} \end{aligned}$$
(15)

where \(\mathrm {A}\) is the attention map containing the attention weights for all query–key pairs, and the output features of the self-attention blocks \({\mathcal {F}}\) are the weighted summation of the values \(\mathrm {V}\) determined by the attention function \(\mathrm {A}\).

Learning an accurate attention map \(\mathrm {A}\) is crucial for self-attention learning. The scaled dot-product attention in Eqs. [(14), (15)] models the relationship between feature pairs. Once the spatial information is extracted and fed into the self-attention blocks to compute the attention maps, they are then concatenated and multiplied by the input features of the image and text modalities for adaptive feature fusion, which is computed as follows:

$$\begin{aligned} {\mathcal {M}}({\mathcal {F}}) = \sigma ({\mathcal {F}})\cdot {\mathcal {F}} \end{aligned}$$
(16)

where \({\mathcal {M}}\) is the feature map that is passed to the following intermediate image and text blocks of the image and text modalities. The term \(\sigma (\cdot )\) denotes the sigmoid function. This feature map generated by the proposed self-attention-based fusion module focuses on the important features of the channels and concentrates on where the salient features are located.

5 Experimental setup

5.1 Datasets

To evaluate the performance of our proposed ensemble trainable network presented in Sect. 3, two benchmark datasets have been used. First, we introduce a subset of the IIT-CDIP Test Collection known as RVL-CDIP. This dataset consists of grayscale labeled scanned document images into 16 classes (Advertisement, Budget, Email, File Folder, Form, Handwritten, Invoice, Letter, Memo, News article, Presentation, Questionnaire, Resume, Scientific publication, Scientific report, Specification). The dataset is split into training set which contains 320, 000 images, and a validation and a test sets which contain 40, 000 images each. Some representative images from the dataset are shown in Fig. 1. Secondly, we use the public Tobacco-3482 dataset to evaluate the performance and the generalization ability of the ensemble trained network on the common classes between the two datasets. The Tobacco-3482 dataset contains 3, 482 grayscale document images of 10 categories: ADVE, Email, Form, Letter, Memo, News, Notes, Report, Resume, and Scientific.

5.2 Preprocessing

As the image modality requires as an input, images of a fixed size, we first downscale all images to 229 x 229 pixels. Intuitively, when training DCNNs, data augmentation has shown to be effective for real-world image classification [33]. The training data are augmented by shifting it horizontally and vertically with a range of 0.1. Also, shear transform is applied with a range of 0.1. To improve regularization of our image modality, cutout [20] is applied, which augments the training data by partially occluded versions of the existing sample images. On the other hand, document images from the RVL-CDIP dataset are well oriented and relatively clean. Hence, we run the Tesseract OCR engine. We used the version 4.0.0-beta.1 of Tesseract based on a LSTM engine to aim for better accuracy. The resulting extracted text was not post-processed. Although document information might be lost in OCR, such as typeface, graphics, layout, stop words, misspellings, symbols, and characters, it could benefit from some level of spell checking to improve the semantic learning. However, we chose to provide the true output of Tesseract OCR as it is.

5.3 Training details

The network used in our proposed approaches was conducted on a 4 NVIDIA RTX-2080 GPU, using stochastic gradient descent optimizer (SGD), with Nesterov momentum, mini-batch size of 16, and a learning rate of 1e-3 decayed with a value of 0.5 every 10 epochs. The learning rate decay is defined as:

$$\begin{aligned} \mathrm{lr} = \mathrm{initial}\_\mathrm{lr} * \mathrm{drop}^{\left( \frac{\mathrm{iter}}{\mathrm{iter}\_\mathrm{drop}} \right) } \end{aligned}$$
(17)

The mutual learning strategy with regularization (i.e.  ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\)) is performed in each mini-batch throughout the training process. At each iteration, the predictions of each modality are computed and the parameters are updated according to the predictions of the other modality as in Eqs. [(6)–(8)]. The optimization process of parameters \(\Theta _1\), \(\Theta _2\), and \(\Theta _3\) is performed iteratively until convergence. We considered early stopping within 10 epochs to stop the training process once the model’s performance stops improving on the hold-out validation dataset.

6 Experiments and ablation study

6.1 Evaluation protocol

To evaluate the performance and the generalization ability of our proposed ensemble network, we proceed with intra-dataset and inter-dataset evaluation on the benchmark RVL-CDIP and Tobacco-3482 datasets. For the intra-dataset evaluation, we train and test the model on the same dataset to evaluate the performance of the proposed approaches, whereas for the inter-dataset evaluation, we train and test the ensemble network on different datasets to evaluate the generalization ability of the trained model. We first train our ensemble network on the RVL-CDIP dataset, and then, we employ the intra-dataset evaluation on RVL-CDIP and the inter-dataset evaluation on Tobacco-3482. Secondly, we train our ensemble network on the Tobacco-3482 dataset, and then, we employ the intra-dataset evaluation on Tobacco-3482 and the inter-dataset evaluation on RVL-CDIP. Note that there is no overlap between training set and test set either in intra-dataset or inter-dataset evaluation.

Table 1 The overall classification accuracy (Acc.), recall (R.), precision (Pr.) metrics of the proposed approaches on the RVL-CDIP dataset

We report the accuracy, recall, and precision metrics achieved on the test set for the following methods: the independent learning based on the single-modal image and text modalities and the mutual learning trained with the standard Kullback–Leibler divergence (KLD). The mutual learning trained with the truncated Kullback–Leibler divergence regularization (Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\)) loss and the ensemble self-attention mutual learning trained with (Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\)) are denoted, respectively, as IL, \(\hbox {ML}_{{\mathrm{KLD}}}\), ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\), and EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) (see Table 1). We also compute the average precision (AP) from prediction scores which summarizes a precision–recall curve as the weighted mean of precision achieved at each threshold, with the increase in recall from the previous threshold used as the weight:

$$\begin{aligned} \mathrm {AP} = \sum _{n}(\mathrm {R}_{n} - \mathrm {R}_{n-1})\mathrm {P}_{n} \end{aligned}$$
(18)

where \(\mathrm {P}_{n}\) and \(\mathrm {R}_{n}\) are the precision and recall at the nth threshold. The high area under the (AP) curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both precision and recall show that the model is returning accurate results (high precision), as well as returning a majority of all positive results (high recall). In addition, we compare our work against other state-of-the-art methods on the RVL-CDIP and Tobacco-3482 datasets. Note that the baseline methods in Tables 2 and 4 are not necessarily based on image and text modalities. For example, [61] leverages image features to incorporate words’ visual information into LayoutLM for document-level pre-training. Also, [62] leverages pre-training text, layout, and image in a multi-modal framework by using text–image alignment and text–image matching tasks in the pre-training stage, where the cross-modality interaction is better learned.

6.2 Intra-dataset evaluation

6.2.1 Results on the RVL-CDIP dataset

On the large-scale RVL-CDIP dataset, all of the adopted approaches in this work achieve comparable performance with the state-of-the-art models. We report the overall accuracy results in Table 2, compared to our latest work [11] and other baseline methods. The proposed EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) model achieves the best performance in terms of accuracy for the single-modal image and text modalities, and for the multi-modal fusion modality at an accuracy of 97.67, 97.63, and 97.70%, respectively. The adopted self-attention-based fusion module has shown its effectiveness in capturing simultaneously the inter-modal interactions between image features and text embeddings, along with the mutual learning approach with regularization (i.e. ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\)). Therefore, it improves the global classification performance of the single-modal and multi-modal modalities and outperforms the state-of-the-art methods.

6.2.2 Evaluation of the single-modal tasks on the RVL-CDIP dataset

  1. (i)

    IL vs \({\mathrm{ML}}_{{\mathrm{KLD}}}\): the reported results in Table 1 illustrate the impact of training the independent image and text modalities in a mutual learning manner, on the learning process of both modalities. We observe that the \(\hbox {ML}_{{\mathrm{KLD}}}\) method improves the classification performance of the image modality from 85.04 to 88.87%, while it deteriorated the performance of the text modality from 84.96 to 80.89%. We explain this performance deterioration of the text modality by learning the negative knowledge from the image modality. In fact, the knowledge transferred via the standard (KLD) loss harms the ongoing training of the current/text modality in process. Here, given image features from an image sample with its corresponding text embeddings, the negative learning comes from the low-class probabilities predicted by the image modality, while at the same time, the text modality has made the right predictions from the same sample. In this way, the mutual training is harmed for the text modality and its loss variation \({\mathcal {L}}_{2}({\mathbf {X}}_2;\Theta _2)\) becomes slower. Thus, using the mutual learning \(\hbox {ML}_{{\mathrm{KLD}}}\) method actually makes the text modality worse than the independent learning (IL) method.

    Table 2 The overall classification accuracy of our best EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method against baseline methods on the RVL-CDIP dataset

    Nonetheless, for the image modality, the classification accuracy has improved. This means that transferring the knowledge from the text modality to the image modality by learning mutually from the text predictions is effective.

  2. (ii)

    IL vs ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\): The classification results in Table 1 show that training the image and text modalities in a mutual learning manner—trained with the regularization term (i.e. Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\))—provides an improvement compared to the IL and the \(\hbox {ML}_{{\mathrm{KLD}}}\) methods. It improves the classification accuracy of the image modality from 85.04% for the IL method to 90.81% for the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method. Also, it enhances the predictions of the text modality from 84.96 to 88.80%, respectively. Accordingly, the network keeps learning only from its cross-entropy loss \({\mathcal {L}}_{s}({\mathbf {X}};\Theta )\) when the knowledge to be transferred from the other modality will harm the ongoing training of the current modality.

  3. (iii)

    ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) vs EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\): the proposed self-attention-based fusion module for image and text feature fusion focuses on the salient feature maps generated from the image and the text modalities and suppresses the unnecessary ones to efficiently leverage these two modalities. The introduction of this attention module to fuse the two modalities along with the mutual learning approach has shown its efficiency compared to the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method as shown in Table 1. We demonstrate that the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method outperforms ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method with a significant margin at an accuracy of 97.67, 97.63% for the image and text modalities, respectively. The attention module enhances the classification performance of all classes for the single-modal modalities. Therefore, leveraging both modalities to one another in a middle fusion manner along with the mutual learning strategy encourages collaborative learning during the training stage.

6.2.3 Evaluation of the multi-modal tasks on the RVL-CDIP dataset

In the multi-modal learning task, the learned image and text features are combined to conduct document image classification. At first, from Table 1, we see that the multi-modal fusion predictions outperform the independent predictions of the single-modal modalities for each method. Moreover, jointly learning both modalities in an ensemble network benefits from training image modality and text modalities both independently (IL) and in a mutual learning manner (ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\)). The ensemble predictions learned across the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method with an accuracy of 97.70% outperform the predictions learned from training the ensemble network across either the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\), the \(\hbox {ML}_{\mathrm{KLD}}\), or the IL approaches at an accuracy of 96.28, 90.06, and 94.44%, respectively. That is to say, the ability of the self-attention-based fusion module along with the mutual learning strategy—trained with the regularization term (i.e. Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\))—to improve ensemble models is beneficial for the task of document image classification, which outperforms the state-of-the-art results for the multi-modal task as shown in Table 2.

Table 3 The overall classification accuracy (Acc.), recall (R.), precision (Pr.) metrics of the proposed approaches on the Tobacco-3482 dataset

Accordingly, the proposed EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method manages to correct the classification errors produced by image and text modalities during the learning process. Hence, it provides state-of-the-art classification results for the task of document image classification.

In this manner, we showed the effectiveness of leveraging visual and textual features learned in a mutual learning with regularization strategy through a self-attention-based feature fusion module. Our approach learns simultaneously relevant and accurate information from the image modality, and the text modality during the training stage. It enhances the ensemble model predictions by encouraging attention collaborative learning from one modality to another. Also, it boosts the overall classification performance. We report in (Online Resource 1, Figs. 1, 2), the confusion matrices of the multi-modal modalities of the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) and the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) methods, respectively.

6.2.4 Results on the Tobacco-3482 dataset

As reported in Table 3, which corresponds to the achieved performance on the Tobacco-3482 dataset, the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method improves the classification performance significantly. The proposed EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method improves the overall performance of the single-modal and multi-modal modalities at an accuracy of 97.99, 96.27, and 98.57% for the image modality, for the text modality, and for the multi-modal fusion modality, respectively, compared to other methods. Thus, it achieves compelling performance results compared to the baseline methods on the Tobacco-3482 dataset (see Table 4).

Besides, the results illustrate that training the image and text modalities in a mutual learning manner with the \(\hbox {ML}_{\mathrm{KLD}}\) method weakens the learning capacity of the text modality. Therefore, we show the effectiveness of the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) approach that transfers only the positive knowledge from the current modality in process to the other modality.

6.3 Inter-dataset evaluation

6.3.1 Evaluation on the Tobacco-3482 dataset

To evaluate the generalization ability of our ensemble network trained on the RVL-CDIP dataset, we use the benchmark Tobacco-3482 dataset and report the overall accuracy, recall, precision, and F1-score as useful metrics to evaluate the performance of the single-modal and multi-modal modalities. Since the Tobacco-3482 is an imbalanced dataset, we focus more on the precision–recall metrics which are useful to measure the success of predictions when the classes are imbalanced, which are reported in Tables 5 and 6. Note that the precision metric is a measure of result relevancy, while the recall metric is a measure of how many truly relevant results are returned. The F1-score measures the weighted average of the precision and recall, while the relative contribution of precision and recall to the F1-score is equal. However, we evaluate on 9 classes of the RVL-CDIP dataset which overlap with the classes of the Tobacco-3482 dataset that are: Advertisement, Email, Form, Letter, Memo, News article, Resume, Scientific publication, and Scientific report. We exclude the category named Note from the Tobacco-3482 dataset which does not overlap with any of the categories of the RVL-CDIP dataset.

Table 4 The overall classification accuracy of the proposed approaches against baseline methods on the Tobacco-3482 dataset
Table 5 The inter-dataset evaluation results of the mutual learning ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method on the Tobacco-3482 dataset

As it can be seen from Tables 5 and 6 and (Online Resource 1, Figs. 3, 4, 5), the proposed EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method displays a better generalization behavior than the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method (Online Resource 1, Figs. 6, 7, 8) over 8 categories that overlap with the RVL-CDIP dataset. The EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method performs better with an overall accuracy of 87.29% for the image modality, 87.23% for the text modality, and 87.63% for the multi-modal fusion modality, compared to 84.82, 83.72, and 86.68% for the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method, respectively. Regarding the Scientific publication category, the recall of the model considering the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) and ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) methods is very low. Among all the samples, the ability of the model to find the positive samples of the Scientific publication category is only at \(37.93\%\), \(36.02\%\), and \(39.46\%\) for the image modality, the text modality, and the multi-modal fusion modality, respectively, for the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method, while it is at 33.72, 33.72, and 34.10% for each modality, respectively, for the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method. The low recall for the two methods is due to the overlap between two categories that are Scientific publication and Scientific report.

After all, we see that for the two proposed EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) and ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) methods, the model returns very few results compared to the intra-dataset evaluation, but most of its predicted labels are correct when compared to the training labels for the single-modal modalities, as well as for the multi-modal fusion modality. Among all classes, the generalization ability of the model given the two methods is very poor regarding the class Scientific report, where the precision and recall are very low, whereas, for the intra-dataset evaluation, the performance of the ensemble network concerning the category Scientific report is at 94.62% and 94.30% for the multi-modal fusion modality of the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) and ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) methods, respectively.

Table 6 The inter-dataset evaluation results of the ensemble self-attention mutual learning (EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\)) approach on the Tobacco-3482 dataset

We illustrate in Figs 5 and 6 the precision–recall curves of the best and worst classes for the multi-modal modalities of the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) and ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) methods, respectively. It shows the trade-off between precision and recall for different thresholds. We compute the average precision (AP) from prediction scores which summarizes a precision–recall curve. We see that the model is returning accurate results (high precision), as well as a majority of positive results (high recall), as it is the case for the categories Resume, Email, and Memo, where most of the predicted samples are labeled correctly for either the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) or the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) methods. However, we observe a good precision but low recall for the Scientific publication category, and a bad precision and recall for the Scientific report category (Table 7).

Therefore, Table 8 illustrates the average–precision scores (AP) of the common categories for the two proposed methods ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\), and EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\). Hence, we relate a good generalization ability of our proposed EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) and ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) methods trained on RVL-CDIP, and evaluated on Tobacco-3482, regarding 7 common classes between the RVL-CDIP and Tobacco-3482 datasets, except for the Scientific publication and the Scientific report categories where it generalizes the worst.

Fig. 5
figure 5

The precision–recall curves of the inter-dataset evaluation of the best classes of the multi-modal modalities for the two EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) and ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) methods

Fig. 6
figure 6

The precision–recall curves of the inter-dataset evaluation of the worst classes of the multi-modal modalities for the two EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) and ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) methods

6.3.2 Evaluation on the RVL-CDIP dataset

Symmetrically, we propose to evaluate the generalization ability of our proposed model trained on the Tobacco-3482 dataset and validated on the large-scale RVL-CDIP dataset. The overall accuracy, recall, precision, and F1-score metrics of our best EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) approach are proposed in Table 7. We proceed with the same evaluation protocol as in Sect. 6.3.1, where there are 9 classes of the Tobacco-3482 dataset that overlap with the classes of the RVL-CDIP dataset.

Table 7 The inter-dataset evaluation results of the ensemble self-attention mutual learning (EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\)) approach on the RVL-CDIP dataset

From Table 7, the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method displays a better generalization ability compared to the other methods. It performs the best with an overall accuracy of 78.89% for the image modality, 79.06% for the text modality, and 86.68% for the multi-modal fusion modality. Among all classes, and similarly to the inter-dataset evaluation on the Tobacco-3482 dataset, the network generalizes the worst for the same categories which are Scientific publication and Scientific report, while it generalizes the best for the categories Resume, Letter, Memo, and Email. Moreover, the ensemble network manages to predict only 10.50% of samples that belong to the Scientific report category as true positives, while 85.26% are predicted as they belong to the Scientific publication category. At this stage, the precision and recall of the model are very low regarding the Scientific report category for each modality. As mentioned in Sect. 6.3.1, the bad precision and recall are due to the overlap between the two categories, which results to a bad generalization ability of the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) method considering only the two categories, contrary to the intra-dataset evaluation, where the ensemble network achieves accurate results with high precision and recall for all the categories.

Therefore, we relate a good generalization ability of our proposed EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) trained on Tobacco-3482, and evaluated on RVL-CDIP, regarding 7 common classes between the RVL-CDIP and Tobacco-3482 datasets, except for the Scientific publication and the Scientific report categories where it generalizes the worst. These results are encouraging as we can see that our proposed system is able to learn on a small dataset (around 6000 documents) compared to the RVL-CDIP training set.

Table 8 The average precision (AP) scores of the inter-dataset evaluation of the ML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) and the EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\) for the multi-modal fusion modality on the Tobacco-3482 dataset

7 Conclusion and future work

In this paper, we have proposed an ensemble network that jointly learns the visual structural properties and the corresponding text embeddings from document images through a self-attention-based mutual learning strategy (EAML\(_{{\mathrm{Tr}\text {-}\mathrm{KLD}}_{\mathrm{Reg}}}\)). We have shown that the designed self-attention-based fusion module along with the mutual learning approach with the regularization term enables the current modality to learn the positive knowledge from the other modality instead of the negative knowledge, which weakens the learning capacity for the current modality during the training stage. This constraint has been realized by adding a mimicry truncated Kullback–Leibler divergence regularization loss (i.e. Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\)) to the conventional supervised setting. With this approach, we have further combined the mutual predictions computed by the trained image and text modalities in an ensemble network through multi-modal learning to boost the overall classification accuracy of document images. The proposed mutual learning strategy with regularization has shown to be efficient in improving the overall performance of the ensemble model. For the future research, we will improve the performance and the generalization ability of our self-attention-based mutual learning strategy to enhance the learning process between different modalities both independently, and in an ensemble network.