Experimental study of rehearsal-based incremental classification of document streams

Malik, Usman; Visani, Muriel; Sidere, Nicolas; Coustaty, Mickael; Joseph, Aurelie

doi:10.1007/s10032-024-00467-w

Experimental study of rehearsal-based incremental classification of document streams

Original Paper
Published: 11 May 2024

(2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Experimental study of rehearsal-based incremental classification of document streams

Download PDF

Usman Malik¹,
Muriel Visani^1,2,
Nicolas Sidere¹,
Mickael Coustaty¹ &
…
Aurelie Joseph³

88 Accesses
Explore all metrics

Abstract

This research work proposes a novel protocol for rehearsal-based incremental learning models for the classification of business document streams using deep learning and, in particular, transformer-based natural language processing techniques. When implementing a rehearsal-based incremental classification model, the questions raised most often for parameterizing the model relate to the number of instances from “old” classes (learned in previous training iterations) which need to be kept in memory and the optimal number of new classes to be learned at each iteration. In this paper, we propose an incremental learning protocol that involves training incremental models using a weight-sharing strategy between transformer model layers across incremental training iterations. We provide a thorough experimental study that enables us to determine optimal ranges for various parameters in the context of incremental classification of business document streams. We also study the effect of the order in which the classes are presented to the model for learning and the effects of class imbalance on the model’s performances. Our results reveal no significant difference in the performances of our incrementally trained model and its statically trained counterpart after all training iterations (especially when, in the presence of class imbalance, the most represented classes are learned first). In addition, our proposed approach shows an improvement of 1.55% and 3.66% over a baseline model on two business documents dataset. Based on this experimental study, we provide a list of recommendations for researchers and developers for training rehearsal-based incremental classification models for business document streams. Our protocol can be further re-used for other final applications.

Toward an Incremental Classification Process of Document Stream Using a Cascade of Systems

Hierarchical classification of data streams: a systematic literature review

Article 22 October 2021

DRILL: Dynamic Representations for Imbalanced Lifelong Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A major part of human communication, formal or informal, takes place through documents. The need for document processing and analysis is especially crucial in the corporate sector, where various organizational decisions depend upon the information extracted from business documents (letters, invoices, quotations, tax notices, resumes, bank statements, etc). Some of the most usual document processing tasks needed by corporate organizations include document classification, clustering, forensics, and information extraction.

Document classification refers to automatically identifying and assigning the correct category for a given document, based on clues hidden in the document’s content [1]. This contextual information used for classification can be in the form of text, images, or both.

Recent studies show that deep learning techniques can be used to perform different types of document processing and analysis tasks [2, 3]. But, while traditional deep learning algorithms rely on static training and test sets, most real-life datasets for business documents classification are constantly evolving. Indeed, every day, companies receive/digitize new documents which can belong to existing categories or represent new document categories. Traditional deep learning approaches, which learn from static data, are not optimal in such cases, as most of them would have to be retrained from scratch every time a new class (or at least significant information) is added, resulting in excessive time and resource consumption. Incremental learning models are thus more suitable for most real-life document classification applications.

Incremental learning is a branch of machine learning where models are trained on the go with the arrival of new data during training [4, 5]. The model, after being trained on the initial dataset, updates itself to adjust to new data distribution at each iteration where new data is added.

In this paper, we focus on incremental business document classification, which comprises classifying automatically all inbound communication, i.e. diverse document streams, including emails, invoices, resumes, tax notices, bank statements, etc. [6]. More precisely, in this research work, we propose a rehearsal-based incremental learning protocol for the classification of business documents based on the text automatically extracted from these documents. The idea behind rehearsal strategies is that, along with the new data arriving on the go, a subset of data from previous training iterations is also kept in memory and used for training during later iterations [5, 7], to avoid “catastrophic forgetting” for the classes learned during early learning iterations [8].

We experiment with the proposed protocol in the presence of documents written in two different languages: English and French, and in the presence of balanced, as well as highly imbalanced, datasets.

The three main contributions of this paper are:

We introduce a novel protocol for incremental document classification and perform an incremental classification of document streams using weight sharing strategy between transformer models layers across multiple incremental iterations, which have not been explored before, to the best of our knowledge.
Using different datasets, we extensively compare its performances with its static counterpart and a baseline approach and investigate the effects of class imbalance on both models.
We provide recommendations for setting the main parameters required for rehearsal-based incremental document classification models: (a) the number of instances from “old” classes (classes learned in previous training iterations) that need to be kept in memory at each iteration in order to avoid catastrophic forgetting, (b) the optimal number of new classes to add at each iteration, and (c) the effect on the overall performance of the order in which the classes are presented to the model for learning. We formulate these recommendations based on an extensive experimental study.

This article is divided into six sections. Section 2 sheds light on some of the existing works for document classification and incremental learning. The proposed approach, methodology, and the details of the experiments performed in this research work are presented in Sect. 3. Section 4 presents the analysis of our results, while Sect. 5 contains a discussion and recommendations. Finally, Sect. 6 concludes the paper and presents future directions for this research.

2 Literature review

Several researchers have proposed models for business document classification using textual, image, and multimodal information, possibly using incremental learning.

Since this paper lies at the crossroad of document classification and incremental learning, this section is further divided into three sub-sections. While Sect. 2.1 briefly reviews recent works for document classification, Sect. 2.2 presents a thorough literature review for incremental learning, and Sect. 2.3 discusses the choices adopted in this research work.

2.1 Document classification

Typical business document classification workflows include document capture, image analysis, Optical Character Recognition (OCR) for recognizing the text from bitmap images, text analysis, assigning the appropriate category to the document, and document routing to some business process based on the category assigned. However, the complexity and diversity of informative elements, backgrounds, and geometric layouts make it difficult to achieve very good results for automatic document classification [9].

Asim et al. [1] present a two-staged text classification system (TSCNN). The first stage includes using a filter-based feature selection method - Normalized Difference Measure (NDM) - to eliminate redundant or irrelevant features. These fine-tuned features are then fed to multichannel Convolutional Neural Networks (CNN) for classifying the input document into the relevant category. Using two publicly available datasets (BBC News and 20 News-Group), this approach achieves an accuracy of 99.251% and 91.746%, respectively. While it outperforms the baselines, it requires an additional feature engineering step, which (for large datasets) may result in increased training time compared to contemporary models.

Alhaj et al. [10] propose using a stemming technique to reduce the high dimensionality of feature vectors and save computational cost. Using three stemming methods (Information Science Research Institute, Tashaphyne, and ARLStem) and three machine learning algorithms (naive bayes, support vector machines, and K-nearest neighbours), they classified Arabic text documents. The best results (94.64% Micro-F1) are obtained using ARLStem for dimensionality reduction, combined with support vector machines. The approach was not tested on other mainstream languages besides Arabic.

d’Andecy et al. [9] compare the performance of CNN-RNN based approach and a custom incremental learning approach for automatic document classification using the OCR based textual representation of documents from Digital Mailroom dataset. They reported that CNN-RNN based approach outperformed the Incremental classification approach by achieving an accuracy of 94%. Though the results achieved via custom incremental learning approach serve as a proof of concept for the viability of the approach, the performance achieved is less than the state of art performance achieved via CNN-RNN based approach.

Shahkolaei et al. [11] use log-Gabor filter for text/non-text image segmentation and then SVM for classification. On two publicly available datasets (Visual document image quality assessment and MHDID), an accuracy of respectively 76.11% and 85.07% was achieved. This model, however, is only tested with Arabic language documents, and the extent of model adaptability to other languages is not known.

Some other interesting rule-based and machine learning approaches for document classification are proposed in [9, 12,13,14]. A comparison of some of the existing approaches for document classification is presented in Table 1.

Table 1 A summary of existing works for document classification

Experimental study of rehearsal-based incremental classification of document streams

Abstract

Similar content being viewed by others

Toward an Incremental Classification Process of Document Stream Using a Cascade of Systems

Hierarchical classification of data streams: a systematic literature review

DRILL: Dynamic Representations for Imbalanced Lifelong Learning

Explore related subjects

1 Introduction

2 Literature review

2.1 Document classification

2.2 Incremental learning

2.2.1 Regularization approaches

2.2.2 Architectural strategies

2.2.3 Variational continual learning

2.2.4 Rehearsal approaches

2.3 Discussion

3 Proposed approach and methodology

3.1 Datasets

3.2 Proposed approach

3.3 Experimental protocol

3.4 Evaluation strategies and measures

3.4.1 Evaluation strategy for comparison with the static learning model

3.4.2 Evaluation strategy to study the effect of class order, BC, NC and IEC values

4 Results and discussion

4.1 Comparison with static model

4.1.1 Results for private dataset

4.1.2 Results for RVL-CDIP dataset

4.2 Effects of class order, BC, NC, and IEC values for private dataset

4.2.1 Private dataset–ost-frequently-occurring classes first

4.2.2 Private dataset–least-frequently-occurring classes first

4.2.3 Private dataset–random addition of new classes

4.3 Effects of BC, NC, and IEC values for RVL-CDIP Dataset

5 Discussion & recommendations

6 Conclusion and future work

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation