Automatic document classification via transformers for regulations compliance management in large utility companies

Dimlioglu, Tolga; Wang, Jing; Bisla, Devansh; Choromanska, Anna; Odie, Simon; Bukhman, Leon; Olomola, Afolabi; Wong, James D.

doi:10.1007/s00521-023-08555-4

Automatic document classification via transformers for regulations compliance management in large utility companies

Original Article
Published: 28 April 2023

Volume 35, pages 17167–17185, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Automatic document classification via transformers for regulations compliance management in large utility companies

Download PDF

Tolga Dimlioglu¹^na1,
Jing Wang ORCID: orcid.org/0000-0003-3779-0301¹^na1,
Devansh Bisla¹,
Anna Choromanska¹,
Simon Odie²,
Leon Bukhman²,
Afolabi Olomola² &
…
James D. Wong²

384 Accesses
3 Citations
Explore all metrics

Abstract

The operation of large utility companies such as Consolidated Edison Company of New York, Inc. (Con Edison) typically rely on large quantities of regulation documents from external institutions which inform the company of upcoming or ongoing policy changes or new requirements the company might need to comply with if deemed applicable. As a concrete example, if a recent regulatory publication mentions that the timeframe for the Company to respond to a reported system emergency in its service territory changes from within X time to within Y time—then the affected operating groups will be notified, and internal Company operating procedures may need to be reviewed and updated accordingly to comply with the new regulatory requirement. Each such regulation document needs to be reviewed manually by an expert to determine if the document is relevant to the company and, if so, which department it is relevant to. In order to help enterprises improve the efficiency of their operation, we propose an automatic document classification pipeline that determines whether a document is important for the company or not, and if deemed important it forwards those documents to the departments within the company for further review. Binary classification task of determining the importance of a document is done via ensembling the Naive Bayes (NB), support vector machine (SVM), random forest (RF), and artificial neural network (ANN) together for the final prediction, whereas the multi-label classification problem of identifying the relevant departments for a document is executed by the transformer-based DocBERT model. We apply our pipeline to a large corpus of tens of thousands of text data provided by Con Edison and achieve an accuracy score over $80\%$. Compared with existing solutions for document classification which rely on a single classifier, our paper i) ensemble multiple classifiers for better accuracy results and escaping from the problem of overfitting, ii) utilize pretrained transformer-based DocBERT model to achieve ideal performance for multi-label classification task and iii) introduce a bi-level structure to improve the performance of the whole pipeline where the binary classification module works as a rough filter before finally distributing the text to corresponding departments through the multi-label classification module.

Business text classification with imbalanced data and moderately large label spaces for digital transformation

Article Open access 30 April 2024

Application of Decision Tree ID3 Algorithm in Tax Policy Document Recognition

Automatic Multi-class Classification of Polish Complaint Reports About Municipal Waste Management

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the rapid development of artificial intelligence (AI) [1] in recent years, computers can automatically process many tasks in industry. In the manufacturing industry [2,3,4], companies use robots for automatic equipment instead of human labor that frees employees from repetitive and boring tasks. AI in medicine [5,6,7] also becomes a hot research topic with a huge amount of breakthrough achievements in multiple directions, e.g., medical robotics [8], medical diagnosis [9, 10], medical statistics [11], human biology [12], etc. The advance in AI technology also changes the world of finance [13] by launching a hot trend of quantitative research [13]. Many Fintech companies apply machine learning methodologies in trading strategy decision [14] and high-frequency trading [15] to earn more profit. When it comes to the utility industry [16,17,18,19], people also prefer artificial intelligent technologies to minimize human intervention and save expenses.

This paper targets at a typical application scenario in the utility industry. All utility companies face the same problem: how to solve the customer requirements accurately and quickly [20, 21]. Take Con Edison as an example, tons of regulations are received every day from external regulation bodies. To resolve these regulations properly, Con Edison hires a large number of people to read the whole regulation and then forward it to the relevant departments. If the classification process could be completed automatically or semi-automatically by AI, the company could not only enhance the work efficiency and the accuracy of classification, but also save expenses in training and hiring staff (Fig. 1).

In this paper, we present an automatic document classification pipeline (shown in 1) via deep learning to solve regulation classification task at Con Edison. The pipeline consists of two parts: i) binary classification module aims at separating the regulations important to Con Edison from those that are not important to Con Edison and ii) multi-label classification module can classify the regulation important to Con Edison to specific departments within the company. Con Edison provides a large corpus of thousands of already processed regulation text data with two types of labels: i) important versus not important to Con Edison for binary classification task; ii) multiple labels demonstrating the specific departments to which the regulation belongs. For the multi-label classification task, the regulation might not only belong to one specific department, but it can also concern multiple departments within Con Edison. Therefore, the second task in the pipeline should be a multi-label, not multi-class classification task.

For the binary classification task, we utilize support vector machine (SVM) [22], Naive Bayes (NB) [23], random forest [24], and artificial neural network (ANN) [25] and combine the four binary soft classifiers with soft voting. The accuracy of the binary classification module in the pipeline reaches $\approx 92\%$. For the multi-label classification task, we utilize the DocBERT model (adding a fully connected ANN network after the BERT model for classification). Moreover, we apply the binary cross-entropy loss (BCELoss) instead of the classical cross-entropy loss (CELoss) since it is a multi-label not multi-class classification task. The accuracy of the multi-label classification module reaches $\approx 80\%$ under the top-3 accuracy metrics defined in Sect. 4.

This paper is organized as follows: Section 2 reviews the literature on binary classification, multi-label classification, and natural language processing. Section 3 describes the details of the datasets that we use for analysis. Section 4 defines the accuracy metrics utilized for the evaluation of the models. Section 5 discusses the construction of the automatic pipeline and its performance on the corresponding datasets. Finally, Sect. 6 concludes the contribution of our work.

2 Literature review

2.1 Binary classification

Binary classification is the process of classifying observations of a dataset into two groups based on a classification rule. It is a classical topic with multiple practical scenarios, e.g., medical testing [26], quality control in industry [27], information retrieval [28, 29], etc. Many commonly used machine learning techniques were introduced to solve binary classification problems. The Naive Bayes (NB) [23] classifier constructs the probability model based on the Bayes’ theorem with strong independence assumptions between the features. Decision tree (DT) [30] is a nonparametric supervised learning algorithm that constructs a classification/regression tree by identifying ways to split a data set based on different conditions. Since a single decision tree might create over-complex trees that do not generalize well, an ensembling method, random forest (RF) [24], constructs a multitude of decision trees during the training process and then let them vote for the final results. Some other improvements in RF, e.g., AdaBoost [31], XGBoost [32], lightGBM [33], etc., became popular with desirable performance on many machine learning tasks. [22, 34] proposed the support vector machine (SVM) that constructs a hyperplane with the largest separation, or margin, between two classes. By introducing the kernel trick [22] that nonlinearly maps the inputs into a very high-dimensional space, SVM could also solve nonlinear classification problems. In 1958, psychologist Frank Rosenblatt [35] borrowed the concept of the biological neural network into computer science and proposed the first artificial neural network (ANN). The fully connected neural network is the simplest ANN where the connection between neurons in a biological neural network is modeled as weights [25]. The neural network quickly sweeps the world after being proposed due to its excellent performance in almost all application cases [36, 37]. Recently, a large amount of different types of neural networks have been proposed based on the requirements of different tasks. The convolutional neural network (CNN) [38, 39] introduces the shared-weight architecture of the convolution kernels which slide along input features to reduce the number of parameters in fully connected neural networks, which achieve excellent performance in image processing task. In order to deal with tasks involving time-series dataset, e.g., speech recognition [40], video recognition [41], text generation [42], etc., Recurrent neural network (RNN) [43] that use previous output as inputs are purposed. To solve the drawback of forgetting long-term memory in RNN, the long short-term memory (LSTM) [44] introduces the gate construction: input gate, output gate and forget gate to the vanilla RNN to control whether to remember input of current step. Gated recurrent unit (GRU) [45] simplifies the structure of LSTM by decreasing the number of gates from 3 to 2 and achieves comparable performance in multiple tasks.

2.2 Multi-label classification

In traditional multi-class classification [46] tasks, an observation in the dataset only contains a single label from a set of labels, and we can use cross-entropy [47] as objective function. However, in multi-label classification tasks, a single observation might have multiple labels from a set of labels. The multi-label classification [48, 49] was first motivated by text classification [50] and medical diagnosis [48], where text documents contain more than one theme, and patients are prone to suffer from more than one disease. With the rapid development of technology, multi-label classification becomes essential in many modern applications, e.g., protein function classification [51, 52], music categorization [53,54,55], semantic scene classification [56,57,58], etc. The methods for the multi-label classification task could be classified into two groups: i) problem transformation methods [59,60,61], ii) algorithm adaptation methods [62, 63]. The problem transformation methods aim at transferring the original multi-label problem into the combination of several multi-class classification tasks. Regarding the algorithm adaptation methods, people design algorithms that could be directly applied to original multi-label task. For example, people change the cross-entropy loss for multi-class task into binary cross-entropy loss that is suitable for multi-label task [64]. However, traditional multi-label classification methods encounter many obstacles when it comes to the extreme large-scale multi-label classification problem with thousands of labels, e.g., recommendation system [65,66,67], natural language processing [68] and image processing [69]. Many new techniques, e.g., one versus all (OvA) classifiers [70,71,72], tree-based classifiers [66, 73], deep learning-based classifiers [74,75,76], embedding-based classifiers [77, 78], are proposed in order to solve the extreme large-scale multi-label classification task. However, our task only has hundreds of labels, which is not an extreme large-scale multi-label classification and can be solved by conventional multi-label classification methods.

2.3 Natural language processing

Languages are the most important mental creation of humans that distinguish us from animals [79, 80]. There are more than 7,100 spoken languages that exist nowadays and our connected world is filled with an abundant volume of natural language text containing different content of knowledge [81]. With the rapid advance in AI, scientists are laying more and more emphasis on the topic of natural language processing (NLP) [82,83,84] to enable AI to understand texts efficiently and accurately similar to humans.

The NLP technology is widely employed in many applications, e.g., speech recognition [40], sentiment analysis [85], document classification [86, 87], natural language generation [88, 89], etc. An NLP system can be separated into the following two processes: i) data processing [90] ii) model construction [84]. Data processing step [90] is aimed at mapping the text document into vectors that are understandable to the computers. Many techniques are proposed in order to vectorize long sentences or text documents by learning word associations from a large corpus of text, e.g., bag-of-words (BOW) [91], continuous bag-of-words (CBOW) [92] and skip-gram [92]. A large amount of deep learning models, e.g., convolutional neural networks (CNN) [38], recurrent neural networks (RNN) [43], textCNN [93], BiLSTM [94, 95] and attention mechanisms [96], are utilized in the model construction for NLP tasks. Recently, the emergence of a lot of powerful pre-trained models, e.g., CoVe [97], ELMo [98], OpenAI generative pre-trained transformer (GPT) [99] and bidirectional encoder representations from transformers (BERT) [100], dramatically increases the performance of deep learning models in multiple NLP tasks.

2.3.1 Transformer-based models

Transformer unit [101] was a milestone invention in NLP history and brought NLP into a new era. The self-attention mechanism proposed in transformer contains the bidirectional information of the whole text, which outperforms other sequential models like RNN, textCNN, and LSTM that only consider one-directional information in many tasks. The powerful BERT model constructed from the transformer block by taking the encoder layers is widely used in NLP tasks. RoBERTa [102] presents a replication study of BERT pretraining [100] and achieves a more powerful pretrained model by increase the training time and batch sizes; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. The decoding-enhanced BERT with disentangled attention (DeBERTa) [103] further enhances BERT by introducing the disentangled attention mechanism, incorporating absolute positions in the decoding layer to predict the masked tokens in model pre-training and using a new virtual adversarial training method to finetune. There are many BERT-based models in document classification. DocBERT [37] inserts a fully connected layer to the last hidden state vector of the BERT architecture. In the RoBERT [104] model, the hidden state vectors and posterior probabilities of the BERT model are stacked and then fed into an LSTM layer. The output of this LSTM serves as a document embedding. In ToBERT [104] model, hidden state vectors and posterior probabilities from BERT model are stacked but this time, they are fed into a transformer block since transformers are known for capturing long-distance relationships between the words in a sequence. Hierarchical attention networks (HAN) [105] are designed to capture two basic insights in the document structure. As a result, it has two levels of the attention mechanism: word level and sentence level. Words and sentences are encoded with bidirectional GRU layers, summarizing the information from both directions.

The text-to-text transfer transformer (T5) [106, 107] is a comprehensive text-to-text model designed to address various NLP tasks. Unlike other multi-task language models that rely on task-specific architectural components and loss functions, T5’s creators developed a unified learning approach that treats every NLP challenge as a text-to-text problem. This enables them to use a single, consistent model, loss function, and hyperparameters to generate a unified, multi-task model. The ByT5 [108] model is a modified version of T5 that can handle text in raw byte format rather than tokens. In contrast, models such as BERT need a separate tokenization process to divide documents into sub-word vocabularies. This can result in greater memory limitations because larger vocabularies necessitate extensive embedding matrices with numerous parameters. T5-based models could also be applied in the text ranking [109] or document classification task [110, 111] in recent works. The main difference for BERT-based and T5-based models is that BERT only includes encoders, while T5 contains both encoders and decoders and perform better in natural language understanding (NLG) task. However, document classification is the natural language understanding (NLU) task not the natural language understanding (NLG) task. Therefore, focusing on BERT-based models works well for our document classification task. Moreover, T5 is a text-to-text generation model which require manually tuning the prompt [112,113,114]. Therefore, we focus on BERT-based model for our implementation of our work.

3 Data analysis

In this section, we provide details about the datasets provided by Con Edison. There are two dataset portions that are granted to us at different stages of the project. The first portion is provided to us for training and validation at the beginning of the project, and the held-out dataset is provided at the later stages of the project and used for testing. The dataset portions vary in terms their distributions which is mainly because of the fact that held-out dataset contains the most recent samples. Despite having different distributions, they have the same structure with two components, which are Regulations and Obligations, respectively. The connection between these two components is established with an ID key. That is, if the same ID key is associated with both a regulation text and obligation text, this implies that the text in the obligation component is the highlighted, refined fragment from the corresponding text in the regulation component.

3.1 Train and validation dataset

Here, we provide the details on the initially provided dataset which we used for training and validation. As mentioned, it has two components: Regulations and Obligations.

3.1.1 Regulations

Regulations are greater in length and some are several pages long documents including the laws or legislation put forward by the regulator state or the federal government. In this dataset, the regulations have two possible labels: Applicable and Not Applicable. As the label names imply, if a regulation label is Applicable, then it contains a law or legislation that is applicable to at least one of Con Edison’s departments. For the regulations with Not Applicable label, it is vice versa. In total, there are 5570 different applicable regulations and 2212 different not applicable ones in this dataset.

3.1.2 Obligations

Since regulation texts are long, even in the applicable ones, not every part of the regulation contains the vital point of the announced law. In the Obligations dataset, only the important sentences or parts from the regulations are provided. Thus, Obligations are much shorter in length, contain at most two paragraphs, and most of them are formed by several sentences from a single paragraph.

Note that several obligations can be deduced from the same regulation, since a regulation might contain important information in its different paragraphs or sections. Besides, a single obligation might concern more than one department of Con Edison, which makes the task carried out on the obligation dataset a multi-label classification task. Note that the labels in this dataset are the department names, anonymous here by digit numbers for confidentiality purposes.

In total, there are 111 different department names or labels, 5320 different obligation texts, and 7428 different text-label pairs. As can be seen, this is a highly imbalanced and small-sized dataset compared to the number of classes. We decided to group the departments with less than or equal to ten associated obligations into one label called Others since it is not really feasible for the model to learn the patterns with very few samples. By doing so, we are left with 59 different department labels including Others label which has 158 samples. The histogram of the dataset after this grouping method is provided in Fig. 2.

3.2 Held-out test dataset

The held-out dataset has the same structure as the train and validation dataset. This dataset contains 122 applicable and 1333 not applicable regulations. The obligations from the applicable regulations have 176 department labels. Although there are 27 different departments among the 176 department labels, six of them are grouped into the Others label using the dictionary obtained while forming the histogram in Fig. 2 from the training set. After this procedure, the histogram of the new set of obligations from the archived dataset is provided in Fig. 3.

4 Evaluation metrics

In total, we evaluate the models with four different metrics: accuracy (%), soft accuracy (%), Top-k accuracy (%), and normalized discounted cumulative (nDCG) score. Except for the conventional accuracy score, we introduce the remaining three metrics for evaluating the performance of experiments on the multi-label classification task.

4.1 Accuracy score (%)

Accuracy is one of the most interpretable evaluation metrics for most of the experiment results. It is simply obtained by calculating the percentage of correctly predicted labels with respect to the total size of the evaluation set. In the case of evaluating the models for the multi-label classification task, correct prediction is defined as the exact match between the target vector and the model’s output layer after applying the sigmoid activation function on each of its neurons and then they are rounded to 0 or 1.

4.2 Soft accuracy score (%)

Soft accuracy score is more generous while evaluating the performance in the multi-label classification task. Particularly, contrary to the conventional accuracy score, in this metric, the correct prediction is achieved when at least one of the sigmoid activated and rounded output layer neurons with value 1, is aligned with the target labels. In this way, the evaluation of the model is realized in a more flexible manner. Notice that this is again a percentage.

4.3 Top-k accuracy score (%)

Top-k accuracy score is a popular evaluation metric frequently used by the ML community, especially in the presence of too many target classes. In particular, if the target class belongs to the list of top-k most likely classes predicted by the model, the model gets the credit and this prediction is counted as correct. In the multi-label setting, if there is an overlap between the set of target classes and the set of top-k most likely classes predicted by the model, the prediction is counted as correct.

4.4 nDCG-k score

To better understand the normalized discounted cumulative gain (nDCG) score, we also need to first explain what discounted cumulative gain (DCG) score is. This is because nDCG is the normalized version of DCG. DCG quantifies the ranking success of the model prediction. Let y and $\hat{y}$ be the true label vector and the output of the classifier, respectively. That is,

$$\begin{aligned} y&= [y_1, y_2, ..., y_L] \in {\{0,1\}}^{L} \\ \hat{y}&= [\hat{y}_1, \hat{y}_2, ..., \hat{y}_L] \in R^{L} \end{aligned}$$

where L is the number of classes. Then, the DCG score is defined as follows:

$$\begin{aligned} \text {DCG-}k = \sum _{l \in \rm{rank}_k(\hat{y})}^{} \frac{y_l}{\text {log}(l+1)} \end{aligned}$$

DCG-k measures the accuracy based on the first k most possible classes in prediction. The term $\text {log}(l+1)$ on the denominator controls the weights of each class. The higher the probability of a class in the prediction is, the greater the impact of the class will have on the final DCG score. Using this definition of DCG score, we normalize DCG by its log weights and formulate the nDCG score as:

$$\begin{aligned} \text {nDCG-}k = \frac{\text {DCG-}k}{ \sum _{l=1}^{\text {min}(k, \Vert y\Vert _0)} \frac{1}{\text {log}(l+1)} } \end{aligned}$$

5 Pipeline

In this section, we provide details of our pipeline. Particularly, the pipeline consists of two different modules: the binary classification module and the multi-label classification module. Figure 4 shows the outline of our pipeline including the modules and their components.

The binary classification module is responsible for determining whether a given raw regulation text is applicable to Con Edison or not. If it is not applicable, the pipeline returns the result accordingly. If it is applicable, the raw text is then sent to the multi-label classification module. We would like to emphasize that the raw text rather than the already processed version in the binary classification module is sent to the multi-label module, since the processing steps for these two modules are different from each other. After receiving the texts, the multi-label classification module predicts the most probable k departments that the case in the regulation belongs to. Details of these modules are explained in Subsections 5.1 and 5.2.

5.1 Binary classification module

The binary classification module works as a rough filter to separate the Regulation dataset into two parts: the documents that are Applicable to Con Edison and the documents that are Not Applicable to Con Edison. The structure of the binary classification module shown in Fig. 5 is constructed as follows:

(1)
Text processing: data cleaning and vectorize texts via bag-of-words method.
(2)
Binary soft classifiers: train four binary soft classifiers: Naive Bayes (NB), support vector machine (SVM), random forest (RF), artificial neural networks with two hidden layers (ANN2).
(3)
Final prediction for binary classification: ensemble binary soft classifiers by soft voting for the final prediction.

5.1.1 Text processing

After the data cleaning process of removing punctuations, eliminating numbers, and removing stopwords, the remaining text document dataset contains in total of 30733 different words. From the histogram of words shown in Fig. 6, 19108 words appear less than 10 times and 135 words appear more than 5000 times. On the one hand, the words appearing too frequently are stopwords, for example, “company,” “following” and “state,” which will not help with the prediction. On the other hand, the large volume of rare words without sufficient information for training drastically increases the computation cost at the same time. Therefore, we remove rare words that appear less than 10 times and frequent words that appear more than 5000 times. We get a cleaned word dictionary with 11490 remaining words.

The next step is to transfer the text information to vectors that can be mathematically processed by computers. We map the original text dataset into the vector space with bag-of-words (BOW) [92] method. The BOW method gives indexes to words in the cleaned word dictionary and records their number of occurrences in the text. Figure 7 shows an example to illustrate the BOW method.

5.1.2 Binary soft classifiers

For the binary classification task, we introduce four classical and powerful machine learning classification methods: Naive Bayes (NB) [23], support vector machine (SVM) [22, 34], random forest (RF) [24], and fully connected artificial neural network with two hidden layers (ANN2) [25]. NB classifier constructs the probability model based on Bayes’ theorem with strong independence assumptions between the features. The SVM classifier separates two classes from the dataset by finding the best hyperplane with maximum margins. RF classifier constructs a multitude of decision trees during the training process, then lets them vote for the final results. For the fully connected artificial neural network classifier, we compare three network structures (Fig. 8): (i) ANN1(100) is the fully connected neural network with one hidden layer of 100 neurons; (i) ANN1(1000) is the fully connected neural network with one hidden layer of 1000 neurons; (ii) ANN2 is the fully connected neural network with two hidden layers of 1000 and 100 neurons, respectively. We perform grid search on the hyperparameters (see Table 1): initial learning rate ($lr_0=[0.01,0.005,0.001]$), batch size ($bs=[16,32,128]$), dropout ($dp=[0.1,0.3,0.6]$) and weight decay ($wd=[1\rm{e}{-4},1\rm{e}{-5},0]$). Table 2 lists the best two settings of the parameters for each neural network. We discover that the fully connected neural network with two hidden layers where the first hidden layer has 1000 neurons and the second hidden layer has 100 neurons outperforms other net structures by $0\sim 1$%.

Table 1 Hyperparameter search grid for ANN models

Automatic document classification via transformers for regulations compliance management in large utility companies

Abstract

Similar content being viewed by others

Business text classification with imbalanced data and moderately large label spaces for digital transformation

Application of Decision Tree ID3 Algorithm in Tax Policy Document Recognition

Automatic Multi-class Classification of Polish Complaint Reports About Municipal Waste Management

Explore related subjects

1 Introduction

2 Literature review

2.1 Binary classification

2.2 Multi-label classification

2.3 Natural language processing

2.3.1 Transformer-based models

3 Data analysis

3.1 Train and validation dataset

3.1.1 Regulations

3.1.2 Obligations

3.2 Held-out test dataset

4 Evaluation metrics

4.1 Accuracy score (%)

4.2 Soft accuracy score (%)

4.3 Top-k accuracy score (%)

4.4 nDCG-k score

5 Pipeline

5.1 Binary classification module

5.1.1 Text processing

5.1.2 Binary soft classifiers

5.1.3 Final prediction for binary classification

5.2 Multi-label classification module

5.2.1 Experiments with ANN models

5.2.2 Experiments with LSTM-based models

5.2.3 Experiments with BERT-based models

5.2.4 Multilabel classification final remarks

5.3 Pipeline results

6 Conclusion

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Detailed Pipeline Figure

Appendix B Binary classifier comparison on different dataset

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation