1 Introduction

With the rapid development of artificial intelligence (AI) [1] in recent years, computers can automatically process many tasks in industry. In the manufacturing industry [2,3,4], companies use robots for automatic equipment instead of human labor that frees employees from repetitive and boring tasks. AI in medicine [5,6,7] also becomes a hot research topic with a huge amount of breakthrough achievements in multiple directions, e.g., medical robotics [8], medical diagnosis [9, 10], medical statistics [11], human biology [12], etc. The advance in AI technology also changes the world of finance [13] by launching a hot trend of quantitative research [13]. Many Fintech companies apply machine learning methodologies in trading strategy decision [14] and high-frequency trading [15] to earn more profit. When it comes to the utility industry [16,17,18,19], people also prefer artificial intelligent technologies to minimize human intervention and save expenses.

This paper targets at a typical application scenario in the utility industry. All utility companies face the same problem: how to solve the customer requirements accurately and quickly [20, 21]. Take Con Edison as an example, tons of regulations are received every day from external regulation bodies. To resolve these regulations properly, Con Edison hires a large number of people to read the whole regulation and then forward it to the relevant departments. If the classification process could be completed automatically or semi-automatically by AI, the company could not only enhance the work efficiency and the accuracy of classification, but also save expenses in training and hiring staff (Fig. 1).

Fig. 1
figure 1

Pipeline overview

In this paper, we present an automatic document classification pipeline (shown in 1) via deep learning to solve regulation classification task at Con Edison. The pipeline consists of two parts: i) binary classification module aims at separating the regulations important to Con Edison from those that are not important to Con Edison and ii) multi-label classification module can classify the regulation important to Con Edison to specific departments within the company. Con Edison provides a large corpus of thousands of already processed regulation text data with two types of labels: i) important versus not important to Con Edison for binary classification task; ii) multiple labels demonstrating the specific departments to which the regulation belongs. For the multi-label classification task, the regulation might not only belong to one specific department, but it can also concern multiple departments within Con Edison. Therefore, the second task in the pipeline should be a multi-label, not multi-class classification task.

For the binary classification task, we utilize support vector machine (SVM) [22], Naive Bayes (NB) [23], random forest [24], and artificial neural network (ANN) [25] and combine the four binary soft classifiers with soft voting. The accuracy of the binary classification module in the pipeline reaches \(\approx 92\%\). For the multi-label classification task, we utilize the DocBERT model (adding a fully connected ANN network after the BERT model for classification). Moreover, we apply the binary cross-entropy loss (BCELoss) instead of the classical cross-entropy loss (CELoss) since it is a multi-label not multi-class classification task. The accuracy of the multi-label classification module reaches \(\approx 80\%\) under the top-3 accuracy metrics defined in Sect. 4.

This paper is organized as follows: Section 2 reviews the literature on binary classification, multi-label classification, and natural language processing. Section 3 describes the details of the datasets that we use for analysis. Section 4 defines the accuracy metrics utilized for the evaluation of the models. Section 5 discusses the construction of the automatic pipeline and its performance on the corresponding datasets. Finally, Sect. 6 concludes the contribution of our work.

2 Literature review

2.1 Binary classification

Binary classification is the process of classifying observations of a dataset into two groups based on a classification rule. It is a classical topic with multiple practical scenarios, e.g., medical testing [26], quality control in industry [27], information retrieval [28, 29], etc. Many commonly used machine learning techniques were introduced to solve binary classification problems. The Naive Bayes (NB) [23] classifier constructs the probability model based on the Bayes’ theorem with strong independence assumptions between the features. Decision tree (DT) [30] is a nonparametric supervised learning algorithm that constructs a classification/regression tree by identifying ways to split a data set based on different conditions. Since a single decision tree might create over-complex trees that do not generalize well, an ensembling method, random forest (RF) [24], constructs a multitude of decision trees during the training process and then let them vote for the final results. Some other improvements in RF, e.g., AdaBoost [31], XGBoost [32], lightGBM [33], etc., became popular with desirable performance on many machine learning tasks. [22, 34] proposed the support vector machine (SVM) that constructs a hyperplane with the largest separation, or margin, between two classes. By introducing the kernel trick [22] that nonlinearly maps the inputs into a very high-dimensional space, SVM could also solve nonlinear classification problems. In 1958, psychologist Frank Rosenblatt [35] borrowed the concept of the biological neural network into computer science and proposed the first artificial neural network (ANN). The fully connected neural network is the simplest ANN where the connection between neurons in a biological neural network is modeled as weights [25]. The neural network quickly sweeps the world after being proposed due to its excellent performance in almost all application cases [36, 37]. Recently, a large amount of different types of neural networks have been proposed based on the requirements of different tasks. The convolutional neural network (CNN) [38, 39] introduces the shared-weight architecture of the convolution kernels which slide along input features to reduce the number of parameters in fully connected neural networks, which achieve excellent performance in image processing task. In order to deal with tasks involving time-series dataset, e.g., speech recognition [40], video recognition [41], text generation [42], etc., Recurrent neural network (RNN) [43] that use previous output as inputs are purposed. To solve the drawback of forgetting long-term memory in RNN, the long short-term memory (LSTM) [44] introduces the gate construction: input gate, output gate and forget gate to the vanilla RNN to control whether to remember input of current step. Gated recurrent unit (GRU) [45] simplifies the structure of LSTM by decreasing the number of gates from 3 to 2 and achieves comparable performance in multiple tasks.

2.2 Multi-label classification

In traditional multi-class classification [46] tasks, an observation in the dataset only contains a single label from a set of labels, and we can use cross-entropy [47] as objective function. However, in multi-label classification tasks, a single observation might have multiple labels from a set of labels. The multi-label classification [48, 49] was first motivated by text classification [50] and medical diagnosis [48], where text documents contain more than one theme, and patients are prone to suffer from more than one disease. With the rapid development of technology, multi-label classification becomes essential in many modern applications, e.g., protein function classification [51, 52], music categorization [53,54,55], semantic scene classification [56,57,58], etc. The methods for the multi-label classification task could be classified into two groups: i) problem transformation methods [59,60,61], ii) algorithm adaptation methods [62, 63]. The problem transformation methods aim at transferring the original multi-label problem into the combination of several multi-class classification tasks. Regarding the algorithm adaptation methods, people design algorithms that could be directly applied to original multi-label task. For example, people change the cross-entropy loss for multi-class task into binary cross-entropy loss that is suitable for multi-label task [64]. However, traditional multi-label classification methods encounter many obstacles when it comes to the extreme large-scale multi-label classification problem with thousands of labels, e.g., recommendation system [65,66,67], natural language processing [68] and image processing [69]. Many new techniques, e.g., one versus all (OvA) classifiers [70,71,72], tree-based classifiers [66, 73], deep learning-based classifiers [74,75,76], embedding-based classifiers [77, 78], are proposed in order to solve the extreme large-scale multi-label classification task. However, our task only has hundreds of labels, which is not an extreme large-scale multi-label classification and can be solved by conventional multi-label classification methods.

2.3 Natural language processing

Languages are the most important mental creation of humans that distinguish us from animals [79, 80]. There are more than 7,100 spoken languages that exist nowadays and our connected world is filled with an abundant volume of natural language text containing different content of knowledge [81]. With the rapid advance in AI, scientists are laying more and more emphasis on the topic of natural language processing (NLP) [82,83,84] to enable AI to understand texts efficiently and accurately similar to humans.

The NLP technology is widely employed in many applications, e.g., speech recognition [40], sentiment analysis [85], document classification [86, 87], natural language generation [88, 89], etc. An NLP system can be separated into the following two processes: i) data processing [90] ii) model construction [84]. Data processing step [90] is aimed at mapping the text document into vectors that are understandable to the computers. Many techniques are proposed in order to vectorize long sentences or text documents by learning word associations from a large corpus of text, e.g., bag-of-words (BOW) [91], continuous bag-of-words (CBOW) [92] and skip-gram [92]. A large amount of deep learning models, e.g., convolutional neural networks (CNN) [38], recurrent neural networks (RNN) [43], textCNN [93], BiLSTM [94, 95] and attention mechanisms [96], are utilized in the model construction for NLP tasks. Recently, the emergence of a lot of powerful pre-trained models, e.g., CoVe [97], ELMo [98], OpenAI generative pre-trained transformer (GPT) [99] and bidirectional encoder representations from transformers (BERT) [100], dramatically increases the performance of deep learning models in multiple NLP tasks.

2.3.1 Transformer-based models

Transformer unit [101] was a milestone invention in NLP history and brought NLP into a new era. The self-attention mechanism proposed in transformer contains the bidirectional information of the whole text, which outperforms other sequential models like RNN, textCNN, and LSTM that only consider one-directional information in many tasks. The powerful BERT model constructed from the transformer block by taking the encoder layers is widely used in NLP tasks. RoBERTa [102] presents a replication study of BERT pretraining [100] and achieves a more powerful pretrained model by increase the training time and batch sizes; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. The decoding-enhanced BERT with disentangled attention (DeBERTa) [103] further enhances BERT by introducing the disentangled attention mechanism, incorporating absolute positions in the decoding layer to predict the masked tokens in model pre-training and using a new virtual adversarial training method to finetune. There are many BERT-based models in document classification. DocBERT [37] inserts a fully connected layer to the last hidden state vector of the BERT architecture. In the RoBERT [104] model, the hidden state vectors and posterior probabilities of the BERT model are stacked and then fed into an LSTM layer. The output of this LSTM serves as a document embedding. In ToBERT [104] model, hidden state vectors and posterior probabilities from BERT model are stacked but this time, they are fed into a transformer block since transformers are known for capturing long-distance relationships between the words in a sequence. Hierarchical attention networks (HAN) [105] are designed to capture two basic insights in the document structure. As a result, it has two levels of the attention mechanism: word level and sentence level. Words and sentences are encoded with bidirectional GRU layers, summarizing the information from both directions.

The text-to-text transfer transformer (T5) [106, 107] is a comprehensive text-to-text model designed to address various NLP tasks. Unlike other multi-task language models that rely on task-specific architectural components and loss functions, T5’s creators developed a unified learning approach that treats every NLP challenge as a text-to-text problem. This enables them to use a single, consistent model, loss function, and hyperparameters to generate a unified, multi-task model. The ByT5 [108] model is a modified version of T5 that can handle text in raw byte format rather than tokens. In contrast, models such as BERT need a separate tokenization process to divide documents into sub-word vocabularies. This can result in greater memory limitations because larger vocabularies necessitate extensive embedding matrices with numerous parameters. T5-based models could also be applied in the text ranking [109] or document classification task [110, 111] in recent works. The main difference for BERT-based and T5-based models is that BERT only includes encoders, while T5 contains both encoders and decoders and perform better in natural language understanding (NLG) task. However, document classification is the natural language understanding (NLU) task not the natural language understanding (NLG) task. Therefore, focusing on BERT-based models works well for our document classification task. Moreover, T5 is a text-to-text generation model which require manually tuning the prompt [112,113,114]. Therefore, we focus on BERT-based model for our implementation of our work.

3 Data analysis

In this section, we provide details about the datasets provided by Con Edison. There are two dataset portions that are granted to us at different stages of the project. The first portion is provided to us for training and validation at the beginning of the project, and the held-out dataset is provided at the later stages of the project and used for testing. The dataset portions vary in terms their distributions which is mainly because of the fact that held-out dataset contains the most recent samples. Despite having different distributions, they have the same structure with two components, which are Regulations and Obligations, respectively. The connection between these two components is established with an ID key. That is, if the same ID key is associated with both a regulation text and obligation text, this implies that the text in the obligation component is the highlighted, refined fragment from the corresponding text in the regulation component.

3.1 Train and validation dataset

Here, we provide the details on the initially provided dataset which we used for training and validation. As mentioned, it has two components: Regulations and Obligations.

3.1.1 Regulations

Regulations are greater in length and some are several pages long documents including the laws or legislation put forward by the regulator state or the federal government. In this dataset, the regulations have two possible labels: Applicable and Not Applicable. As the label names imply, if a regulation label is Applicable, then it contains a law or legislation that is applicable to at least one of Con Edison’s departments. For the regulations with Not Applicable label, it is vice versa. In total, there are 5570 different applicable regulations and 2212 different not applicable ones in this dataset.

3.1.2 Obligations

Since regulation texts are long, even in the applicable ones, not every part of the regulation contains the vital point of the announced law. In the Obligations dataset, only the important sentences or parts from the regulations are provided. Thus, Obligations are much shorter in length, contain at most two paragraphs, and most of them are formed by several sentences from a single paragraph.

Note that several obligations can be deduced from the same regulation, since a regulation might contain important information in its different paragraphs or sections. Besides, a single obligation might concern more than one department of Con Edison, which makes the task carried out on the obligation dataset a multi-label classification task. Note that the labels in this dataset are the department names, anonymous here by digit numbers for confidentiality purposes.

In total, there are 111 different department names or labels, 5320 different obligation texts, and 7428 different text-label pairs. As can be seen, this is a highly imbalanced and small-sized dataset compared to the number of classes. We decided to group the departments with less than or equal to ten associated obligations into one label called Others since it is not really feasible for the model to learn the patterns with very few samples. By doing so, we are left with 59 different department labels including Others label which has 158 samples. The histogram of the dataset after this grouping method is provided in Fig. 2.

Fig. 2
figure 2

Histogram of the labels in the obligation dataset (The actual department and section names are hidden for confidentiality)

3.2 Held-out test dataset

The held-out dataset has the same structure as the train and validation dataset. This dataset contains 122 applicable and 1333 not applicable regulations. The obligations from the applicable regulations have 176 department labels. Although there are 27 different departments among the 176 department labels, six of them are grouped into the Others label using the dictionary obtained while forming the histogram in Fig. 2 from the training set. After this procedure, the histogram of the new set of obligations from the archived dataset is provided in Fig. 3.

Fig. 3
figure 3

Histogram of the obligations per department in the held-out set

4 Evaluation metrics

In total, we evaluate the models with four different metrics: accuracy (%), soft accuracy (%), Top-k accuracy (%), and normalized discounted cumulative (nDCG) score. Except for the conventional accuracy score, we introduce the remaining three metrics for evaluating the performance of experiments on the multi-label classification task.

4.1 Accuracy score (%)

Accuracy is one of the most interpretable evaluation metrics for most of the experiment results. It is simply obtained by calculating the percentage of correctly predicted labels with respect to the total size of the evaluation set. In the case of evaluating the models for the multi-label classification task, correct prediction is defined as the exact match between the target vector and the model’s output layer after applying the sigmoid activation function on each of its neurons and then they are rounded to 0 or 1.

4.2 Soft accuracy score (%)

Soft accuracy score is more generous while evaluating the performance in the multi-label classification task. Particularly, contrary to the conventional accuracy score, in this metric, the correct prediction is achieved when at least one of the sigmoid activated and rounded output layer neurons with value 1, is aligned with the target labels. In this way, the evaluation of the model is realized in a more flexible manner. Notice that this is again a percentage.

4.3 Top-k accuracy score (%)

Top-k accuracy score is a popular evaluation metric frequently used by the ML community, especially in the presence of too many target classes. In particular, if the target class belongs to the list of top-k most likely classes predicted by the model, the model gets the credit and this prediction is counted as correct. In the multi-label setting, if there is an overlap between the set of target classes and the set of top-k most likely classes predicted by the model, the prediction is counted as correct.

4.4 nDCG-k score

To better understand the normalized discounted cumulative gain (nDCG) score, we also need to first explain what discounted cumulative gain (DCG) score is. This is because nDCG is the normalized version of DCG. DCG quantifies the ranking success of the model prediction. Let y and \(\hat{y}\) be the true label vector and the output of the classifier, respectively. That is,

$$\begin{aligned} y&= [y_1, y_2, ..., y_L] \in {\{0,1\}}^{L} \\ \hat{y}&= [\hat{y}_1, \hat{y}_2, ..., \hat{y}_L] \in R^{L} \end{aligned}$$

where L is the number of classes. Then, the DCG score is defined as follows:

$$\begin{aligned} \text {DCG-}k = \sum _{l \in \rm{rank}_k(\hat{y})}^{} \frac{y_l}{\text {log}(l+1)} \end{aligned}$$

DCG-k measures the accuracy based on the first k most possible classes in prediction. The term \(\text {log}(l+1)\) on the denominator controls the weights of each class. The higher the probability of a class in the prediction is, the greater the impact of the class will have on the final DCG score. Using this definition of DCG score, we normalize DCG by its log weights and formulate the nDCG score as:

$$\begin{aligned} \text {nDCG-}k = \frac{\text {DCG-}k}{ \sum _{l=1}^{\text {min}(k, \Vert y\Vert _0)} \frac{1}{\text {log}(l+1)} } \end{aligned}$$

5 Pipeline

In this section, we provide details of our pipeline. Particularly, the pipeline consists of two different modules: the binary classification module and the multi-label classification module. Figure 4 shows the outline of our pipeline including the modules and their components.

Fig. 4
figure 4

Detailed diagram of the pipeline

The binary classification module is responsible for determining whether a given raw regulation text is applicable to Con Edison or not. If it is not applicable, the pipeline returns the result accordingly. If it is applicable, the raw text is then sent to the multi-label classification module. We would like to emphasize that the raw text rather than the already processed version in the binary classification module is sent to the multi-label module, since the processing steps for these two modules are different from each other. After receiving the texts, the multi-label classification module predicts the most probable k departments that the case in the regulation belongs to. Details of these modules are explained in Subsections 5.1 and 5.2.

5.1 Binary classification module

The binary classification module works as a rough filter to separate the Regulation dataset into two parts: the documents that are Applicable to Con Edison and the documents that are Not Applicable to Con Edison. The structure of the binary classification module shown in Fig. 5 is constructed as follows:

  1. (1)

    Text processing: data cleaning and vectorize texts via bag-of-words method.

  2. (2)

    Binary soft classifiers: train four binary soft classifiers: Naive Bayes (NB), support vector machine (SVM), random forest (RF), artificial neural networks with two hidden layers (ANN2).

  3. (3)

    Final prediction for binary classification: ensemble binary soft classifiers by soft voting for the final prediction.

Fig. 5
figure 5

Structure of binary classification module

5.1.1 Text processing

After the data cleaning process of removing punctuations, eliminating numbers, and removing stopwords, the remaining text document dataset contains in total of 30733 different words. From the histogram of words shown in Fig. 6, 19108 words appear less than 10 times and 135 words appear more than 5000 times. On the one hand, the words appearing too frequently are stopwords, for example, “company,” “following” and “state,” which will not help with the prediction. On the other hand, the large volume of rare words without sufficient information for training drastically increases the computation cost at the same time. Therefore, we remove rare words that appear less than 10 times and frequent words that appear more than 5000 times. We get a cleaned word dictionary with 11490 remaining words.

Fig. 6
figure 6

Histogram of words set for binary classification task

The next step is to transfer the text information to vectors that can be mathematically processed by computers. We map the original text dataset into the vector space with bag-of-words (BOW) [92] method. The BOW method gives indexes to words in the cleaned word dictionary and records their number of occurrences in the text. Figure 7 shows an example to illustrate the BOW method.

Fig. 7
figure 7

Example of the bag-of-words method

5.1.2 Binary soft classifiers

For the binary classification task, we introduce four classical and powerful machine learning classification methods: Naive Bayes (NB) [23], support vector machine (SVM) [22, 34], random forest (RF) [24], and fully connected artificial neural network with two hidden layers (ANN2) [25]. NB classifier constructs the probability model based on Bayes’ theorem with strong independence assumptions between the features. The SVM classifier separates two classes from the dataset by finding the best hyperplane with maximum margins. RF classifier constructs a multitude of decision trees during the training process, then lets them vote for the final results. For the fully connected artificial neural network classifier, we compare three network structures (Fig. 8): (i) ANN1(100) is the fully connected neural network with one hidden layer of 100 neurons; (i) ANN1(1000) is the fully connected neural network with one hidden layer of 1000 neurons; (ii) ANN2 is the fully connected neural network with two hidden layers of 1000 and 100 neurons, respectively. We perform grid search on the hyperparameters (see Table 1): initial learning rate (\(lr_0=[0.01,0.005,0.001]\)), batch size (\(bs=[16,32,128]\)), dropout (\(dp=[0.1,0.3,0.6]\)) and weight decay (\(wd=[1\rm{e}{-4},1\rm{e}{-5},0]\)). Table 2 lists the best two settings of the parameters for each neural network. We discover that the fully connected neural network with two hidden layers where the first hidden layer has 1000 neurons and the second hidden layer has 100 neurons outperforms other net structures by \(0\sim 1\)%.

Fig. 8
figure 8

Structures of three types of fully connected artificial neural networks: ANN1(100), ANN1(1000) and ANN2 from left to right

Table 1 Hyperparameter search grid for ANN models
Table 2 Best two settings of the parameters with maximal test accuracy after training for 200 epochs on ANN1(100), ANN1(1000) and ANN2. We compute the validation accuracy after each epoch. The final val. accu. is the test accuracy at the end of the training process. The best val. accu. is the highest test accuracy during the training process

Therefore, we choose the ANN2 net structure as a soft classifier under the fully connected neural network setting and compare it with other soft classifiers: NB, SVM and RF. Table 3 illustrates the training and validation accuracy of four binary soft classifiers. We do not include the comparison of the best test accuracy among all classifiers, since there is no concept of epochs during the training procedure of NB and RF. From Table 3, each soft classifier reaches a high accuracy level of more than 85%, and ANN2 outperforms others with an accuracy of more than 97% (Table 3).

Table 3 Training accuracy and final validation accuracy of ANN2, NB, SVM and RF

5.1.3 Final prediction for binary classification

Based on the performance of four softer classifiers illustrated in Table 20, we design two strategies for achieving the final prediction:

  1. (1)

    Vanilla strategy: Utilize the single ANN2 classifier which outperforms other classifiers.

  2. (2)

    Ensembling strategy: Ensemble all binary soft classifiers by soft voting. Soft voting is inspired by the voting process of random forest method. We first let each soft classifier predicts separately and then choose the class selected by most classifiers.

Intuitively, strategy #2 might be a better choice because of the following two reasons. Firstly, the test accuracy results of binary soft classifiers (Table 3) are all above 85%, which implies that even the worst NB classifier is a powerful methodology for the binary classification task. Therefore, including all the models in the final pipeline are more likely to achieve better performance. Secondly, ensembling multiple models into the final classifier could reduce variance and minimize the bias of models, and ultimately decrease the chance of overfitting. An ideal model must be able to generalize well among different datasets. However, a single soft classifier only trains on a given dataset with a fixed model structure. When it comes to a brand new dataset with not exactly the same distribution as the original dataset, the single soft classifier is quite likely to suffer from the problem of overfitting. On the contrary, ensembling multiple binary soft classifiers alleviates the risk of focusing too much on a specific feature and achieves better generalization performance. Especially for the document classification task for Con Edison, the data distribution does not stay unchanged among different time periods. Therefore, the generalization property of the model is of vital importance to achieve idea performance for our task.

In order to verify our hypothesis that strategy #2 is better, we design the following experiments to compare two prediction strategies. Except for the original regulation dataset we used for training and testing the binary soft classifiers in Sect. 5.1.2, we use the held-out test dataset to measure the generalization property of strategies on different datasets with different data distributions. The held-out test dataset is another dataset provided by Con Edison which is also the regulatory documents but from another timeline. Therefore, the data distribution is a little different from the original train and validation datasets, and it is a good fit to assess the model’s robustness against the distribution shift.

For strategy #1, we simply apply the ANN2 classifier trained in Sect. 5.1.2. For strategy #2, we utilize the soft voting mechanism, which takes the average of probabilities of all binary soft classifiers as the final prediction of the whole binary classification module. Table 4 demonstrates that strategy #2 indeed generalizes better than strategy #1, and we choose strategy #2 for our final pipeline design. Although strategy #1 performs slightly better than strategy #2 on the original dataset used for training and validation, it generalizes much worse on the held-out test dataset compared with the ensembling method. Therefore, we select the second ensembling strategy when forming the final pipeline.

Table 4 Accuracy of two strategies on the original and new datasets

5.2 Multi-label classification module

The main role of the multi-label classification module of the pipeline is to list the departments that are potentially concerned by the given regulation. To do that we experiment with several models and several different configurations. First, naive ANN models are used as a baseline which takes bag-of-words as features. Then, we experiment with LSTM-based models [115], which are effective to deal with sequential data such as time series or texts. Finally, we run experiments with more sophisticated BERT-based models in our experiments. BERT [116] is a state-of-the-art language representation model that outperformed its competitors, such as OpenAI GPT, in several NLP tasks with great margins. During the experiments for this module, we have used the obligation dataset. For the experiments, \(20\%\) of the dataset is reserved for validation and the remaining \(80\%\) for training.

5.2.1 Experiments with ANN models

For this model type, similar text processing techniques are applied in the binary classification module, that is, dropping numbers and punctuation, eliminating words based on their frequency, and creating the feature vector using the bag-of-words modeling. As a result of dropping the words based on their frequency on the obligation dataset, we are left with 8638 different words. Hence, the feature vector obtained with BoW and the input size of the model is 8638. In total, we have experimented with two different ANN models varying in size:

  1. (1)

    ANN1: ANN model with 1 hidden layer with 500 neurons

  2. (2)

    ANN2: ANN model with 2 hidden layers with 1000 and 500 neurons, respectively.

Also, in both networks, hidden layers are followed by ReLU activation functions. Table 5 shows the hyperparameter search grid used in the experiments.

Table 5 Hyperparameter search grid for ANN models

Hence, there are 72 different training settings for each model, and networks are trained in each setting for 300 epochs. Among all these settings, Table 7 provides the best performing results of each model type with corresponding configurations. Note that best-performing configurations are selected based on the achieved performance in conventional accuracy score on the validation set, as seen in Table 6.

Table 6 Best performing model configurations for ANN models with soft and conventional accuracy scores

The results of the Top-k accuracy scores computed on the validation set using the model configurations specified above are provided in Table 7.

Table 7 Top-k accuracy scores obtained on the validation set for ANN models

Similarly, the results of the nDCG-k scores computed on the validation set using the model configurations specified above are provided in Table 8.

Table 8 nDCG-k scores obtained on the validation set for ANN models

As it can be seen from Tables 6, 7, 8, although the ANN2 model achieved higher soft and classical accuracy scores, the ANN1 model reaches higher scores in terms of all the top-k accuracy and nDCG-k scores for \(k = [2, 3, 5, 7, 10]\).

5.2.2 Experiments with LSTM-based models

Because BoW modeling only captures the information about the presence of the words and their frequency, it cannot incorporate the information about the context. Hence, we turn our focus toward word embeddings and models that can make use of word embeddings. Particularly, we use LSTM models since they can facilitate long-range dependencies better than RNN models due to their internal gates [115]. Because of their superiority over RNNs, the LSTMs are used as a solution to many NLP tasks such as machine translation or document classification [117].

We use a state-of-the-art LSTM-based classification model proposed in [118] which comprises a 1-layered BiLSTM network followed by a max-pooling layer on concatenated hidden states and a feed-forward network. The authors report that the proposed model achieves better accuracy scores compared to other CNN-based and LSTM-based models in text classification tasks when trained with proper regularization, hence they refer to their model as \(\rm{LSTM}_{\rm{reg}}\). In accordance with the paper itself, we use dropout on hidden layers, embedding vectors, and weight decay. For the word embeddings, we use pretrained Word2Vec embeddings which capture the semantic word similarities and estimate continuous vector representations of words [119]. In Word2Vec, word embeddings are in 300 dimensional space.

The classifier part of the model is again selected from the ANN1 and ANN2 models explained in 5.2.1. For ANN1, the following hidden layer sizes are used in the experiments: [100, 500, 1000]. And for ANN2, the following hidden layer size configurations are searched as part of the hyperparameter tuning: [(500, 100), (500, 500), (1000, 100), (1000, 500)]. For both ANN1 and ANN2, the input layer size is determined in the same way and it is twice the hidden state size of the BiLSTM since the feature vector is obtained from max-pooling of the concatenated hidden states in both directions. Notice that, the hidden state size of the LSTM model is varied in the hyperparameter search. Hence, the input size of the classifier takes the values [1024, 1536, 2048] when the LSTM hidden state size is [512, 768, 1024] respectively.

We use a sequence length of 512 for the \(\rm{LSTM}_{\rm{reg}}\) model and if a text contains less than 512 tokens we pad it with 0s and, if it is longer than the sequence length we only take the first 512 tokens to forward to the model. We experiment with different hidden state sizes [512, 768, 1024], dropout values [0.1, 0.3] and weight decay \([1e-6, 1e-5]\) values. The \(\rm{LSTM}_{\rm{reg}}\) model is trained for 200 epochs with the Adam optimizer with a learning rate of \(1e-4\) and a batch size of 64.

In the tables below, we provide the details and results of the best configuration that yields the smallest accuracy score for both ANN1 and ANN2 classifiers on top of the \(\rm{LSTM}_{\rm{reg}}\) model.

Table 9 Best performing model configurations for \(\rm{LSTM}_{\rm{reg}}\) models with soft and conventional accuracy scores
Table 10 Top-k accuracy scores obtained on the validation set for \(\rm{LSTM}_{\rm{reg}}\) models
Table 11 nDCG-k scores obtained on the validation set for \(\rm{LSTM}_{\rm{reg}}\) models

As is shown in Tables 9, 10, 11, 12, 13, the \(\rm{LSTM}_{\rm{reg}}\) model with the ANN1 classifier outperforms the one with the ANN2 classifier in terms of classical and soft accuracy scores. It also achieves better Top-k accuracy and nDCG-k scores.

5.2.3 Experiments with BERT-based models

Having seen the unsatisfactory accuracy scores obtained on the validation set using the ANN and LSTM models, we have also experimented with the models that use a more advanced model as a base, that is, the BERT model. The reason why we are going in this direction is that the feature vectors obtained with BoW modeling does not capture the contextual information at all since it only depends on the appearance of the words from the dictionary. BERT model uses transformers as its core building block; hence, it can also incorporate contextual information into the word embeddings. That is, the word bank appearing in bank account has a different representation than the one in river bank. Besides, it features bidirectional self-attention layers, meaning that it can obtain a better relation map in the attention mechanism compared to the unidirectional ones. [116]

Although the BERT model is trained on unlabeled corpus data with masked language modeling and next sentence prediction tasks, it is possible to make changes in its output layers and then fine-tune it end-to-end to make use of the BERT model’s contextual understanding in any task. [116] Particularly, our task is to classify the departments to which the given regulatory case belongs, and this is a document classification task. There are several available models created by making changes in the BERT model’s output layer for this type of task. Particularly, for our use case, this model is called DocBERT which is proposed in [120]. As it is explained in the paper, this model is simply obtained by inserting a fully connected layer to the BERT model’s first hidden state from its last layer which corresponds to the [CLS] token at the output. Note that for the classification task, insertion of the fully connected layer to this specific place is suggested by the authors of the BERT paper. [116]

For our experiments, we used two DocBERT models which replace the fully connected layer explained in the previous paragraph with ANN1 and ANN2 models. Recall that details of these models are explained in Section 5.2.1. The only change from the explained models is the input size which is now 768 instead of 8638 since the BERT model’s states are 768 dimensional vectors. In total, we experimented with DocBERT models in four different schemes:

  • DocBERT with ANN1 classifier using Adam optimizer

  • DocBERT with ANN2 classifier using Adam optimizer

  • DocBERT with ANN1 classifier using Adagrad optimizer

  • DocBERT with ANN2 classifier using Adagrad optimizer

Even though we run experiments with different hyperparameters in the listed schemes, we observe that the Adagrad optimizer does not converge. The motivation for using the Adagrad optimizer is that we want to finetune the parts of the BERT and classifier weights that are not updated that frequently. And similarly, we prefer not to change the well-updated parts that much. Although this is the main strength of using the Adagrad optimizer, it drastically underperformed compared to the Adam optimizer. Thus, we do not bother sharing the results obtained with the Adagrad optimizer. In Table 12, we share the hyperparameter search grid for the experiments with DocBERT models.

Then, in Table 13, we also share the highest accuracy scores achieved in both schemes (DocBERT with ANN1 and ANN2) in addition to the hyperparameter selections yielding these results. Again, the best-performing settings are selected based on the performance of models on the validation set. Finally, we also provide the results with Top-k accuracy scores and nDCG scores in Table 14.

Table 12 Hyperparameter search grid for the experiments with DocBERT model
Table 13 Best performing model configurations for DocBERT models with soft and conventional accuracy scores
Table 14 Top-k accuracy scores obtained on the validation set with the same DocBERT models
Table 15 nDCG-k scores obtained on the validation set with the same DocBERT models

As is shown in Tables 13, 14, 15, the DocBERT model with the ANN1 classifier outperforms the one with the ANN2 classifier in terms of classical and soft accuracy scores, whereas it obtains inferior results in Top-k accuracy and nDCG-k scores.

5.2.4 Multilabel classification final remarks

In Table 16, we share the results of best-performing ANN, \(\rm{LSTM}_{\rm{reg}}\) and DocBERT models based on their conventional accuracy scores and top-k accuracy scores on the validation set.

Table 16 Comparison of best performing multilabel classification models

As can be seen, the transformer-based BERT variant DocBERT model is better than \(\rm{LSTM}_{\rm{reg}}\) and ANN models in terms of accuracy scores. DocBERT model attains around \(17\%\) and \(15\%\) more accuracy than the best performing ANN and \(\rm{LSTM}_{\rm{reg}}\) models, respectively. The accuracy metric essentially highlights the confidence of the model in exactly predicting the ground truth labels which indicates how much generalization capability the model acquired in a certain manner. The superiority of DocBERT is also present in the other metrics as well. This also demonstrates the effectiveness of the transformers in handling long-range dependencies and extracting semantic information. Hence, we use the DocBERT model with the ANN1 classifier in our multilabel classification module, as part of our pipeline.

5.3 Pipeline results

After combining the binary and multi-label modules as it is shown in Fig. 1, we obtain the pipeline for document elimination and classification. Then, we evaluate the performance of our pipeline on both the previous train \(+\) val set and the held-out test set (Table 17). Note that we only report the classical accuracy for the binary classification module and the top-k accuracy score for the multi-label classification module since these are the most interpretable ones. Regarding the top-k accuracy score, we only report for \(k = [2,3,5]\) since these are selected as the most reasonable k values by the Con Edison team at the operational level.

Table 17 Result of the pipeline and its components on train + validation set

Although it might be insignificant to share the results on the train + validation set for the binary classification module, we would like to emphasize the accuracy scores achieved on the multi-label classification part and the overall pipeline. This is mainly because the evaluation of the pipeline and its components are carried out with the regulations, meaning that the multi-label classification part has not exactly seen this data but only some sections of its text. The results show that there is not much deviation from the results obtained on the validation portion of the obligation dataset.

We also test this pipeline on the held-out test set. Table 18 shows the pipeline results obtained on this dataset.

Table 18 Result of the pipeline and its components on the held-out set

Table 17 and Table 18 together illustrate that there is a drop in the accuracy scores of the multi-label module and the overall pipeline. It is mainly caused by a large number of data samples with only Others as its associated label in the new held-out test dataset. However, this is not the case in the training set, which implies that there is a change in the distribution of the new held-out dataset compared with the previous one. We also assess the pipeline by ignoring these samples and provide obtained results in Table 19. There is a considerable increase in the performance of the multi-label classification module since the distributions between the train set and held-out set become closer to each other. Overall, the pipeline achieves success in both determining whether a regulation is applicable to Con Edison or not and, finding the most related departments that are concerned by the applicable regulations.

Table 19 Result of the pipeline and its components on the new held-out set without the samples that have only Others label

6 Conclusion

Our research proposes a deep learning-based automatic document classification system that aims to efficiently allocate text regulations to the relevant departments in Con Edison. The system achieves high accuracy scores, with over 90% accuracy for binary classification and over 80% Top-3 accuracy for multi-task classification on the given datasets. The pipeline can be used in various contexts where document classification is required, but it has a limitation in classifying long documents, as the DocBERT model used for multi-label tasks can only process documents with fewer than 512 tokens. Currently, the system only considers the last 512 tokens of a long document for multi-label classification, which may not be suitable for applications with much longer documents. To overcome this issue, we plan to incorporate embedding or text abstraction techniques in our pipeline for long document processing in the future.