1 Introduction

As a fundamental task in natural language processing, the Named Entity Recognition (NER) task has attracted extensive attention over recent years [1]. The traditional NER task aims to extract entity mentions from text and classify them into a predefined fixed set of entity classes [2, 3]. However, in many real-world scenarios, the requirements for recognizing new entity classes often arise, and the model needs to learn to recognize new entity classes while maintaining the ability to recognize old entity classes. For example, an e-commerce search engine is often required to grasp new user intents, which requires recognizing new entity classes. In order to recognize new entity classes, a traditional solution is to construct a dataset that contains annotations for both old and new entity classes, and then train a model from scratch. This solution is both time-consuming and laborious, and the original data may also become unavailable due to privacy protection issues, further increase the cost of recognizing new entity classes.

To tackle the shortcomings of traditional solutions, Monaikul et al. [4] introduce Class Incremental Learning (CIL) into the NER, called the Class Incremental Named Entity Recognition (CINER) setting, which is more in line with real-world application scenarios. Under the CINER setting, as illustrated in Fig. 1, the model must continuously learn new entity classes, with the new dataset containing annotations only for the new entity classes. The learning process can be divided into multiple steps, such as \({S}_{1}, {S}_{2}, \dots \), and each step corresponding to one or more new entity classes. As is well known, CIL faces the catastrophic forgetting problem [5,6,7,8]. This is because current machine learning models (especially deep learning models) store knowledge in model parameters, and training on the dataset that only contains annotations for new entity classes will make parameters deviate from the original pattern. Consequently, this divergence in parameters results in a significant drop in performance for the old entity classes, giving rise to the problem of catastrophic forgetting. Moreover, worse for NER tasks, the distinctive labeling of NER data exacerbates the problem of catastrophic forgetting. This occurs because the data from the current step might include entities from previous classes. Apart from the new entity class introduced in the current step, entities from other classes are labeled as non-entity. For example, see \({S}_{2}\) in Fig. 1, where "Musk" is labeled as a non-entity, but in fact "Musk" is a Person entity. Since only the ORG (Organization) entities are labeled in \({S}_{2}\text{step}\), entities from other classes are disregarded. Obviously, training models on such a dataset will aggravate catastrophic forgetting. Current CINER methods mainly mitigate forgetting problems by knowledge distillation [9] and reviewing past knowledge [10]. These existing CINER methods adopt a single-model approach, which is initialized using the parameters learned in the previous step. According to [10] and [4], the performance of their methods drops severely with more steps, which demonstrates that these methods still suffer from catastrophic forgetting. Furthermore, these single-model methods exhibit instability when faced with different entity class learning orders [10]. The performance corresponding to different entity class learning orders may be very different for the same set of entity classes.

Fig. 1
figure 1

The figure shows the CINER setting and the MM framework. Among them, "PER," "ORG," "LOC," and "O" stand for Person, Organization, Location, and non-entity labels, respectively. The dataset available for each step contains labels only for the new entity class. MM trains a model for each entity class independently and then combines the output label sequences from all MM models for inference

Due to the severe catastrophic forgetting in CINER, it is difficult for a single model to continuously learn new entity classes without forgetting. Moreover, single-model methods make it difficult to exclude the influence of the order in which entity classes are learned. To this end, we propose the Multi-Model (MM) framework for CINER. As shown in Fig. 1, within the MM framework, we independently train a new model at each step and employ all models for joint inference. Under the CINER setting, the MM framework naturally has the following two advantages. First and foremost, the significant advantage is that MM has no forgetting problems. This is because MM framework uses new models to learn new entity classes, ensuring that learning new entity classes does not affect the past models' ability to recognize old entity classes. Moreover, because each model is trained independently, MM is robust when faced with different entity class learning orders.

Despite the considerable advantages, the multi-model framework offers within the CINER context, there remain three problems that need to be addressed. First, in MM, individual models are unable to acquire knowledge from other steps. This is problematic as it results in wasted knowledge from datasets of other steps. In response to this, we propose an Error-Correction Training (ECT) strategy. In ECT, we leverage the new entity class labels in the current dataset to supervise and rectify incorrect predictions made by previous models. ECT facilitates past models in refraining from recognizing new class entities as belonging to their respective entity classes. Therefore, ECT can improve the precision of past models in MM. Experimental results demonstrate that ECT improves the performance of MM. Second, outputs from multiple models may conflict. For example, consider the sentence "Tesla will have a launch event.", where the gold label for "Tesla" is Organization. If both the Person model and the Organization model recognize "Tesla" as the corresponding entity, the conflict-handling rule should determine "Organization" as the ultimate output for "Tesla." Hence, an appropriate conflict-handling rule is essential for the MM framework. For this purpose, we design conflict-handling rules and conduct detailed experiments. We select the rule that demonstrated the biggest performance improvement for the MM framework. Lastly, the MM framework incurs higher inference and storage costs compared to single-model methods. To mitigate the inference and storage costs in MM, we share partial parameters across all models. We execute parameter sharing by freezing the parameters of the bottom layers. Experimental results show that MM with shared parameters remains competitive with the SOTA single-model methods.

We evaluated the MM framework on the English entity recognition datasets of CoNLL-03 [11] and OntoNotes-V5 [12]. The experimental results show that MM outperforms the SOTA CINER methods.

The contributions of this paper are summarized as follows:

  1. 1.

    To the best of our knowledge, MM is the first multi-model CINER framework. MM has no forgetting problem and is robust to different entity class learning orders.

  2. 2.

    We propose conflict-handling rules and ECT that significantly improve MM. Among them, ECT contributes to improving the prediction precision of the models within the MM framework. Meanwhile, the conflict-handling rules serve to facilitate the prediction of more precise final label sequences within the MM framework.

  3. 3.

    Experimental results show that MM outperforms the SOTA methods by a large margin, and MM with shared parameters is still competitive with the SOTA methods.

2 Related works

2.1 Class incremental learning

Class incremental learning (also referred to as lifelong learning and continual learning) is a machine learning task that aims to learn new knowledge continually without forgetting. For dealing with catastrophic forgetting, the mainstream methods are to use knowledge distillation. Learning without forgetting [8] adapts a learned model to new tasks while retaining the knowledge gained earlier with a regularization term on probabilities. Kirkpatrick et al. [13] store the sensitivity of previous task loss to different parameters and penalizes model parameter changes from one task to the next according to the different sensitivities. There are also methods that store parts of past samples for review. Rebuffi et al. [14] store valuable samples of old classes with a limited capacity and train network while maintaining the output of all visible samples on the past model. D'Autume C et al. [15] alternately train on new data and memory data, ensuring that the feature space can always be restored to its original state through simple linear changes. Zhou D et al. [16] introduce a novel class incremental learning method, called MEMO, that leverages shared and specialized layers to balance memory efficiency and accuracy. Tong S et al. [17] propose a model of structured memory, based on a closed-loop transcription between object classes and their subspaces in a low-dimensional feature space. They experimentally show that their method can prevent catastrophic forgetting and perform well for both generation and classification tasks. Recently, some researchers have recognized the advantages of ensemble methods in CIL. Ensemble methods allow knowledge to be stored in different neural networks, preventing catastrophic forgetting. Fei Ye et al. [18] propose a lifelong learning method called Teacher–Student Network Learning. In this method, the teacher network is responsible for learning new tasks, while the student network learns from the teacher network through knowledge distillation, enabling the continuous accumulation of knowledge. Fei Ye et al. [19] also propose the Knowledge Discrepancy Score (KDS) criterion, which measures the relevance of new task input information to the existing knowledge accumulated by the teacher module from its previous training. KDS ensures the lightweight nature of the teacher architecture and allows for the reuse of previously learned knowledge when appropriate, thereby accelerating the learning of a given task. Carta, Antonio, et al. [20] introduce a new lifelong learning paradigm called "Ex-Model Continual Learning." The authors argue that lifelong learning systems should utilize compressed information in the form of trained models. Douillard, Arthur, et al. [21] suggest that parameter expansion-based dynamic architectures can effectively mitigate catastrophic forgetting in continual learning. Ostapenko, Oleksiy, et al. [22] introduce Dynamic Generative Memory (DGM)—a continual learning framework driven by synaptic plasticity. This framework is capable of maintaining old knowledge while learning new tasks and benefiting from it.

These CIL methods are not suitable for the CINER configuration. This is because these methods do not consider the problem of old class entities in new data being labeled as non-entities. In our MM, each model does not need to learn other entity classes and, therefore, is not affected by this issue. Furthermore, the aforementioned ensemble CIL methods that lack conflict-handling rules specifically designed for NER tasks and do not consider the utilization of data in subsequent steps to correct previous models.

2.2 Class incremental named entity recognition

Traditional NER methods [23,24,25,26,27] aim to extract entity mentions from text and classify them into a predefined fixed set of entity classes. However, in many real-world scenarios, new entity classes appear periodically, and the model needs to recognize new entity classes without forgetting past entity classes. Therefore, Monaikul N et al. [4] introduce class incremental setting to NER and construct the CINER setting. Monaikul N et al. [4] preserve past knowledge by constraining model output as it learns new entity classes. They tend to constrain the model output changes at the non-entity positions when learning on new datasets. Under the CINER setting, Xia Y et al. [10] propose a two-stage framework for CINER. For the learning stage, they distill the old knowledge from teacher to a student on the current dataset. For the reviewing stage, they generate samples of old entity classes using generation models for review. Ma R et al. [28] propose an entity-aware contrastive learning method that adaptively detects entity clusters in "O" under the CINER setting. Zhang D et al. [29] introduce a pooled feature distillation loss that skillfully navigates the trade-off between retaining knowledge of old entity types and acquiring new ones, thereby more effectively mitigating the problem of catastrophic forgetting. Wang R et al. [30] focus on the NER tasks under few-shot scenarios and achieve good results by using past NER models to generate training samples of past classes. Li H et al. [31] propose a method called task Relation Distillation and Prototypical pseudo-label (RDP) for Incremental Named Entity Recognition (INER). They introduce a task relation distillation scheme that ensures inter-task semantic consistency across different incremental learning tasks by minimizing inter-task relation distillation loss. Zhang D et al. [32] discuss a novel Decomposing Logits Distillation (DLD) method for CINER. Chen Y et al. [33] propose a decoupled two-phase framework method for the CINER task under few-shot scenarios. The whole task is converted to two separate tasks named entity span detection and entity class discrimination that leverage parameter-cloning and label-fusion to learn different levels of knowledge separately. The above methods focus on balancing learning new knowledge and preserving old knowledge in a single model. Despite the initial success, current methods can only partially solve the forgetting problem, and the forgetting problem will become more severe with the increase in entity classes. In contrast, our multi-model framework has no forgetting problems.

3 Methodology

The training and testing process of MM is shown in Algorithm 1. Where \(S_k\) denotes the kth step. In this section, we first formally defined the CINER settings. Then, we introduced the NER model, error-correction training, and conflict-handling rules.

Algorithm 1
figure h

Training and testing procedure of MM framework at Sk

3.1 Problem formulation

We adopt the CINER setting proposed by [4]. Under the CINER setting, we set different steps according to the learning order of new entity classes, denoted as \(\{ S_1 ,S_2 , \ldots , S_k , \ldots \}\). Each step contains one or more new entity classes. In each step, we train a single model for new entity classes. In our proposed MM framework, the same processing is used when a step includes one or more entity classes. To facilitate understanding and simplify the description, let's assume that each step includes one new entity class. For example, in \(S_k\), the corresponding entity class is \(E_k\), and the entity classes under different steps do not overlap. In \(S_k\), only the dataset \(D_k\) is available. \(D_k\) only contains the annotations for \(E_k\). \(D_k\) can be decomposed into \(D_{{\text{train}}_k }\) and \(D_{{\text{val}}_k }\) for training and validation, respectively. The learning object is to learn to recognize \(E_k\) and maintain the ability to recognize \(\{ E_1 , \ldots ,E_{k - 1} \}\). After completing the learning processing in \(S_k\), the testing set \(D_{{\text{test}}_k }\) that contains the annotations for \(\{ E_1 , \ldots ,E_k \}\) is used for testing.

In this study, all discussions for NER methods and experiments, by default, use the "BIO" annotation schema. Where the label "B" represents the beginning of an entity, and the label "I" denotes the continuation of an entity. The label "O" means non-entity. For a specific entity class, denotes as B-XXX (Begin an “XXX” entity) and I-XXX (Inner of an “XXX” entity). For example, B-PERSON and I-PERSON denote PERSON entity.

3.2 Learn to recognize new entity class

Under the CINER setting, methods need to train a model on the dataset that only contains the annotations for the new entity class. In MM, we initialize a new model to learn to recognize the new entity class. The NER model used in MM and the training process are detailed below.

In this study, we treat NER as a sequence labeling task. Our NER model employs the BERT text encoder [34] in conjunction with a Conditional Random Field (CRF) layer [35]. The implementation process is shown in Fig. 2. Given an input sentence \(x = \{ x_0 , \ldots ,x_L \}\) and ground-truth label sequence \(y = \left\{ {y_1 , \ldots ,y_L } \right\},y_i \in R^{{\text{label}}_{{\text{num}}} }\), where \(L\) is the sequence length, and \({\text{label}}_{{\text{num}}}\) is the number of label types. First, apply BERT and one linear layer to map \(x\) to the emission score \(e(x,y)\). The emission score is a measure of how likely a particular label is for a given word, computed by passing the BERT output through a linear layer, resulting in a score for each possible label at each position in the sequence. Finally, we utilize the CRF label transition probability matrix \(T\) in conjunction with the emission score \(e\) to compute the probability \(p\left( {y|x} \right)\) of the output label sequence. The label transition matrix \(T\) captures the probabilities of transitioning from one label to another in the sequence. Each element \(T[y_i ,y_{i + 1} ]\) in the transition matrix represents the likelihood of transitioning from label \(y_i\) at position \(i\) to label \(y_{i + 1}\) at position \(i + 1\). This helps the model consider the dependencies between consecutive labels, which is important for sequence labeling tasks like NER. The calculations are as follows:

$$ {\text{score}}\left( {x,y} \right) = \mathop \sum \limits_{i = 1}^L \left( {e(x_i ,y_i )} \right) + \mathop \sum \limits_{i = 1}^{L - 1} \left( {T\left[ {y_i ,y_{i + 1} } \right]} \right) $$
(1)
$$ p\left( {y|x} \right) = \frac{{e^{{\text{score}}\left( {x,y} \right)} }}{{\sum_{y{\prime} \in Y(x)} e^{{\text{score}}\left( {x,y{\prime} } \right)} }} $$
(2)

where \({\text{score}}(x,y)\) represents the sequence label score, and \(Y(x)\) represents all possible label sequences.

Fig. 2
figure 2

Entity recognition process

During training, the loss function uses a negative log-likelihood function, and the loss calculation equation is as follows:

$$ {\text{loss}} = - \log p\left( {y|x} \right) $$
(3)

During inference, Viterbi algorithm [36] is used for sequence decoding.

3.3 Error-correction training

In MM, each model is only trained on the dataset of the corresponding step, so it cannot obtain knowledge from other datasets. We consider this wastes knowledge from other step datasets. Although the dataset from other steps contains only corresponding entity class labels, these labels can still serve as valuable supervision information to help past models in distinguishing the current entity class from its corresponding entity class. Hence, datasets from subsequent steps have the potential to improve previously trained models. To this end, we propose an Error-Correction Training (ECT) strategy. Within ECT, we employ the entity labels from the current step dataset as supervision information. This helps past models avoid misclassifying new class entities as their own. Supervise past models not to recognize new class entities as their own entity classes. The error-correction process is shown in Fig. 3. In ECT, each past model is first used to obtain the prediction label sequences about the current step dataset, and the conflict parts are corrected according to the gold labels. Finally, train the corresponding model on its self-labeled dataset. The implementation of ECT is described below.

Fig. 3
figure 3

The ECT procedure for a single sample

The procedure of ECT is shown in Fig. 3. First assuming that it is currently at \(S_k\), the available dataset is \(D_k\), which contains only the annotations of \(E_k\). We perform ECT on all past models \(M_1 , \ldots , M_{k - 1}\). For each sample \(x = \{ x_0 , \ldots ,x_L \}\) and corresponding gold-label sequence \(y = \left\{ {y_1 , \ldots ,y_L } \right\}\) in \(D_k\), we generate the prediction \(y^i = \{ y_1^i , \ldots ,y_L^i \}\) using \(M_i ,{\text{where}}\,i < k\). Subsequently, we examine whether the entity labels in \(y^i\) conflict with those in \(y\). If conflicts arise, we reset the label of the entire conflicting entity to "O." For instance, illustrated in Fig. 3, consider the sentence "Elon Musk is the founder of Tesla" with its corresponding gold-label sequence in \(S_k\) as "O O O O O O B-ORG." The prediction label sequence of \(M_1 (PER)\) is "B-PER I-PER O O O O B-PER." Notably, there is a conflict between the predicted label for "Tesla" and the gold labels. In response, we correct the label sequence to "B-PER I-PER O O O O O" and incorporate this corrected label sequence into the dataset \(D_{k,1}\). Finally, train \(M_i\) using the dataset \(D_{k,i}\). Subsequent experiments demonstrate that the ECT can improve the precision of past models and the overall performance of MM.

3.4 Conflict-handling rules

In MM, the final output label sequence is obtained by merging output label sequences from models in MM. Conflict processing is required when the output labels from multiple models differ at the same position. For handling the conflicts, we design a variety of conflict-handling rules and conducted detailed experiments to compare their performance. We chose the one with the best performance for MM. These conflict-handling rules are described in detail below. Conflicts are divided into four types, covering all kinds of conflicts. The conflict types and solutions are as follows:

"B" and "O" conflicts: If only model \(M_i\) outputs the label "B-\(E_i\)" and others output the label "O," the final output label is set to "B-\(E_i\)."

"B" and "B" conflicts: If multiple models output the label "B," select the label "B-\(E_i\)" with the largest output probability, set the final output label as "B-\(E_i\)."

"I" and "O" conflicts: If one or multiple models output the label "I" and others output the label "O," check whether the final output label at the previous position also belongs to the entity class \(E_i\). If so, output "I-\(E_i\)." Otherwise, output label "O."

"B" and "I" conflicts: If both labels "B-\(E_i\)" and label "I-\(E_j\)" are present in the output labels of multiple models, and the output label at the previous position is B-\(E_j\) or I-\(E_j\).

Rule1: Compare the corresponding output probabilities to the label "B-\(E_i\)" and the label "I-\(E_j\)," and select the one with higher probability as the final output label. In addition, if the final output label is set to label "B-\(E_i\)," modify the previous output labels of the entity corresponding to the current position label "I-\(E_i\)" to the label "O."

Rule2: Compare the corresponding output probabilities to the label "B-\(E_i\)" and the label "I-\(E_j\)," and select the one with higher probability as the final output label.

Rule3: The corresponding output probability is not to be compared, and set the final label as "I-\(E_j\)" directly.

As depicted above, the difference between these rules lies in "B" and "I" conflicts. Among them, both rule1 and rule2 determine the current output label based on the probability output. In cases where the output probability of "B-\(E_i\)" is greater than "I-\(E_j\)," both rule1 and rule2 will assign the label "B-\(E_i\)" to the current position. The difference between them is that rule1 also sets the label of the truncation entity \(E_j\) to "O." Rule3 does not consider the probability output and directly outputs the label "I-\(E_j\)." Subsequent experiments have confirmed that among the proposed rules, rule1 is the optimal choice. We believe that this is because rule1 takes into account the probability output and avoids errors caused by truncated entities.

4 Experiments setting

4.1 Datasets

To evaluate the effectiveness of our proposed framework, we conducted experiments on the English NER datasets of CoNLL-03 and OntoNotes-V5. The details of two datasets are as follows:

CoNLL-03 [11]: The English NER dataset of CoNLL-03 includes 22 K sentences extracted from news stories. The dataset contains four entity class labels, namely: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC).

OntoNotes-V5 [12]: The English NER dataset of OntoNotes-V5 includes 144 K sentences extracted from various texts. Text types include news, transcribed telephone conversations, and weblogs. The dataset contains the labels of 18 entity classes.

For a fair comparison, we followed the experimental setting of [10] and [31]. In \(S_k\), for constructing \( D_k\), we chose all the samples that contain entity \(E_k\). For each sample in \(D_k\), we modified all the labels "B-\(E_i\)" and "I-\(E_i\)"\( \left( {i \ne k} \right)\) to the label "O." For constructing the testing set of \(D_k\), we chose all the samples that contain entity \(\cup_{i = 1}^k E_i\) and set the label "B-\(E_i\)" and "I-\(E_i\)" \((i > k)\) to the label "O."

4.2 Metric

In evaluations, we counted the micro-F1 score and macro-F1 score [37] for our proposed framework. Specifically, we counted the test average macro-F1 and average micro-F1 for all learning orders in each step. The calculation equations are as follows:

$$ {\text{macro}}F1_k = \frac{1}{r} \sum \limits_r {\text{macro}}F1_k^r A $$
(4)
$$ {\text{micro}}F1_k = \frac{1}{r} \sum \limits_r {\text{micro}}F1_k^r $$
(5)

where the \(F1_k^r\) denotes as the test \(F1\) score of \(S_k\) in the learning order \(r\).

In addition, we also assessed the robustness of MM to different entity class learning orders. Robustness to learning order refers to the stability of the framework's test performance after being trained on the same training set with different entity class learning orders. It is important to note that, in the CINER setting, the robustness to entity class learning orders is present only in the final step. Referring to Table 1, in the last step, different learning orders have the same entity class set. In other steps, different learning orders correspond to different entity class sets, and the corresponding training and test sets also differ. Therefore, the assessment of the robustness of learning orders should rely on the performance in the final step. We use the error bound to evaluate the robustness of MM. Error Bound (EB) [38] is a metric to evaluate the robustness of learning orders. EB is defined as follows:

$$ {\text{EB}} = Z_{\frac{\alpha }{2}} \times \frac{\sigma }{{\sqrt {n} }} $$
(6)

where \(Z_{\frac{\alpha }{2}}\) is the confidence coefficient of confidence level \(\alpha\), and \(\sigma\) is the standard deviation of average F1 value obtained from \(n\) different learning orders. A framework with lower EB shows better robustness.

Table 1 The different entity classes learning order for each dataset

4.3 Implements details

In the experiment, we adopted the settings from Xia et al. [10] and Zhang et al. [31] to evaluate the effectiveness of the proposed framework. In the setting proposed by Xia et al. [10], multiple entity class learning orders were sampled for the same set of entity categories, enabling the assessment of the method's robustness to different entity class learning orders. Specifically, we sampled 8 orders from CoNLL-03 and 6 from OntoNotes-V5, which are detailed in Table 1. The confidence of EB is set to 0.95. For a fair comparison, we used the BERT-base-uncased [34] as Xia et al. [10].

We also implemented the setting proposed by Zhang et al. [31], where entity classes are learned in a fixed order (alphabetical order) using corresponding data slices to sequentially train the model. This setting is referred to as FG-a-PG-b, where “a” denotes the number of the entity classes included in the first step, and “b” represents the number of the entity classes included in each subsequent step. For CoNLL-03, we adopted FG-2-PG-1 and FG-2-PG-1 configurations. For OntoNotes-V5, which includes 18 entity classes, we used FG-8-PG-2, FG-8-PG-1, FG-2-PG-2, and FG-1-PG-1. The learning difficulty increases with the number of steps, with the FG-1-PG-1 setting for OntoNotes-V5 being the most challenging as it involves 18 incremental steps. For a fair comparison, we used the BERT-base-cased [34] like Zhang et al. [31] in FG-a-PG-b setting.

In all experiments, we used pre-training parameters provided by BERT [34] to initialize our model. The maximum sequence length is set to 128. For learning new entity class, we used the Adam [39] optimizer with a 5e-5 learning rate, a batch size of 32, a maximum training epoch of 20, and early stopping (patience = 3). In addition, we performed one epoch ECT and used the SGD [40] optimizer with a learning rate of 5e-5 and a batch size of 64. In addition, the MM in all experiments except Sect. 5.5 is the version without shared parameters. All the experiments are implemented on a single NVIDIA GPU V100. The experiments are implemented in PaddlePaddle.

4.4 Compared baselines

All the comparison baselines employ identical training settings to those of MM. In this context, MM does not share parameters. For experiments regarding parameter sharing, please refer to Sect. 5.5. We directly refer to the results in corresponding studies in all experiments. ExtendNER [4] and L&R [10] have provided only the macro-F1 score.

ExtendNER [4]: ExtendNER is a method based on knowledge distillation. It takes the current model as the teacher and the new model as the student. They distill past knowledge from teacher model to student model on current step dataset.

L&R [10]: L&R is a SOTA method for CINER, which merges generation methods and knowledge distillation techniques. In L&R, they train a generation model for each step of the dataset. The difference compared to ExtendNER is that it not only distills on current dataset but also uses the data generated by the generation model for review.

DLD [32]: DLD is decomposing logits distillation (DLD) method, enhancing the model’s ability to retain old knowledge and mitigate catastrophic forgetting.

RDP [31]: RDP is a prototypical pseudo-label strategy that distinguishes old entity types from the current non-entity type using the old model.

CPFD [29]: CPFD retaining knowledge of old entity types and acquiring new ones by introducing a pooled feature distillation loss. Additionally, they develop a confidence-based pseudo-labeling for the non-entity type.

(ours)w/o-ECT: MM framework without ECT. In each step, the corresponding model will not change after completing the training. It is a method independent of the learning orders.

(ours)MM: MM denotes the multi-model framework with ECT. Unless otherwise specified below, MM uses ECT by default.

UpperBound: It is the theoretical upper boundary of CINER. It does not perform incremental learning but trains the model from scratch on the dataset, which contains annotations for all entity classes.

5 Result analysis

5.1 Comparison experiments discussion

Based on the results in Tables 2, 3, 4, and 5, our proposed method achieved significant improvements compared to state-of-the-art baseline methods. The best results are shown in bold. As shown in Tables 2 and 3, compared to L&R, in \(S_4\) on CoNLL-03, the macro-F1 score of MM is 0.34 higher than that of L&R. In OntoNotes-V5, which involves a greater number of class incremental steps, MM demonstrates significant improvements of 1.33, 2.27, 3.44, 3.28, and 3.32 in \(S_{2, \ldots ,6}\), respectively, compared to L&R. The results in Tables 4 and 5 demonstrate that, under the FG-1-PG-1 setting, which has the most incremental steps, our method achieved significant advantages over the advanced baseline methods. In the FG-1-PG-1 setting on the CoNLL-03 dataset, which includes four steps, our method exceeded the SOTA methods by 5.78 and 6.36 in micro-F1 and macro-F1, respectively. In the FG-1-PG-1 setting on the OntoNotes-V5 dataset, which includes 18 steps, our method surpassed the state-of-the-art methods by 13.85 and 13.72 in micro-F1 and macro-F1, respectively. The experimental results reveal that the advantage of the MM method becomes more pronounced compared to single-model methods with the increase in incremental steps. This is because current single-model approaches still encounter the problem of forgetting, which worsens as the number of incremental steps increases. Our MM can incrementally learn new entity classes without forgetting and significantly outperforms single-model methods in the CINER setting. Moreover, in all experimental results, MM with ECT consistently outperformed MM without ECT, demonstrating the effectiveness of ECT.

Table 2 Results on CoNLL-03 under CINER setting proposed by Xia et al.
Table 3 Results on OntoNotes-V5 under CINER setting proposed by Xia et al.
Table 4 Results on CoNLL-03 under FG-a-PG-b CINER setting
Table 5 Results on OntoNotes-V5 under FG-a-PG-b CINER setting

5.2 The advantage of multi-model framework

5.2.1 Multi-model framework has no forgetting problem

The multi-model framework theoretically solves the catastrophic forgetting problem completely. The forgetting problem of current single-model methods comes from the parameter drift caused by learning new entity classes. In contrast, in MM, where each model does not need to learn new patterns, is no forgetting problem. Additionally, through the ECT strategy, MM can also improve the performance of past models by utilizing data from subsequent steps. The single-model method cannot use ECT, which is the unique advantage of the multi-model framework. Tables 2, 3, 4, and 5 show that MM outperforms single-model methods significantly.

5.2.2 Multi-model framework is robust to different learning orders

The robustness of incremental learning orders refers to the performance gap when learning the same entity class set in different learning orders. For example, given entity set like {Person, Location}, if the performance gap between learning Person first and learning Location first is small, it indicates that the method performs well in robustness of incremental learning orders. Tables 2 and 3 present the EB values for each method at final step. Different learning orders have the same entity class set and test datasets at the final step. Therefore, the EB values at the final step can indicate the robustness of these models for different learning orders. The EB value of the w/o-ECT method is 0 because each model remains unchanged after isolated training. The UpperBound method directly learns all entity classes from the start, resulting in an EB value of 0. Additionally, experimental results demonstrate that the EB of MM is 0.02 and 0.11 in CoNLL-03 and OntoNotes-V5 at the last step, respectively. The EB of MM is much smaller than that of the single-model methods. In single-model methods, knowledge is stored within a single model, which can lead to earlier knowledge being overwritten by later knowledge, potentially degrading performance. In single-model methods, the performance on the same entity class at different time steps can exhibit significant variation due to knowledge interference. Therefore, single-model methods are sensitive to the entity class learning order. In contrast, in MM, each model is trained independently. The entity class learning orders hardly affect its final performance. Thus, the MM exhibits strong robustness to different learning orders.

5.3 Impact of error-correction training

According to Tables 2, 3, 4, and 5, our proposed ECT achieves steady improvement for MM. Moreover, we conducted an investigation to understand the underlying mechanisms behind the effectiveness of ECT. We analyzed ECT from two aspects, first is the validation precision of each model after applying ECT. Subsequently, we investigated the influence of different hyperparameters on the performance of ECT.

5.3.1 Model precision after applying ECT

We experimentally verify the specific impact of ECT on each model in MM. Tables 6 and 7 present the validation precision results of our experiments. Specifically, in the CONLL-03 dataset, we select the first four entity class learning orders. We record the validation precision of the model corresponding to the first entity class in each of these four learning orders after applying ECT. For example, in the entity class learning order "LOC → ORG → MISC → PER," we record the validation precision of the LOC model on the validation set \({D}_{{\text{val}}_{1}}\) at each step after applying ECT. The result can be found in the row in Table 6 where LOC is located. We also conducted the same experiments in OntoNotes-V5. Based on the results presented in Tables 5 and 6, we observe that ECT generally leads to an improvement in precision for the models in MM. For instance, after applying ECT, the NORP model in the OntoNotes-V5 dataset exhibits a precision improvement of 1.38 percentage points. This demonstrates the effectiveness of ECT.

Table 6 Validation precision of each entity class at each step on CoNLL-03 when using ECT
Table 7 Validation precision of each entity class at each step on OntoNotes-V5 when using ECT

5.3.2 The effect of different hyperparameters for ECT

We presented the micro-F1 values of MM under different hyperparameters in Fig. 4 and Table 8. As discussed in the preceding section, we have proved the robustness of MM to entity class learning orders. To simplify the experiments and focus on specific parameters, we conducted experiments using a fixed entity class learning order. Figure 4 illustrates the impact of different learning rates on performance. A learning rate of 5e-5 consistently delivers the highest performance. Table 8 shows the performance of applying one and two epochs of ECT. It indicates that two epochs of ECT will lead to a serious performance drop compared to one epoch of ECT. These findings demonstrate that, for the MM framework, one epoch of ECT suffices to achieve optimal performance.

Fig. 4
figure 4

Effects of ECT with different learning rates on performance. The histogram on the left is the micro-F1 score of the entity class learning order "LOC- > PER- > ORG- > MISC." The right one is the micro-F1 score of the entity class learning order "ORG- > PERSON- > GPE- > DATE- > CARDINAL- > NORP"

Table 8 This table shows the micro-F1 of different ECT epochs

5.4 Performance of different conflict-handling rules

We have designed three conflict-handling rules. The conflict-handling rule in MM is recorded as rule1. The difference between these rules lies in "B" and "I" conflicts, and the other points are the same. Detailed about three rules can be found in Sect. 3.4. Table 9 shows the performance of w/o-ECT with different conflict-handling rules. It can be figured out that rule1 has the best all-around performance. We believe that this is because rule1 takes into account the probability output and avoids errors caused by truncated entities. Therefore, we chose rule1 as the conflict-handling rule in the MM framework.

Table 9 The average micro-F1 of all learning orders with different conflict-handling rules for w/o-ECT

5.5 Sharing parameters in MM

The inference cost of MM increases linearly with the number of models, which could lead to scalability challenges. To mitigate the inference cost, we try to employ parameter sharing for the bottom layers across all models. Specifically, all models in MM are initialized with identical parameters, and we freeze the bottom \(n\) layers during training. In the inference phase, we calculate the output vectors of the bottom \(n\) layers first, followed by feeding these output vectors into the remaining portions of each model. This approach allows us to store only a single copy of the parameters for the bottom \(n\) layers. Furthermore, during inference, we only need to compute the outputs of the bottom \(n\) layers once.

To explore the impact of shared parameters on framework performance, we conducted detailed experiments. Tables 10 and 11 display the results of the parameter-sharing experiments in the last step. We set \(n\) to 6 and 9, denoted as MM(6) and MM(9), respectively. Our experiments indicate that the MM with shared parameters outperforms existing single-model methods on OntoNotes-V5. Particularly in the FG-n-PG-n setting of OntoNotes-V5, the results in Table 11 show that MM(6) and MM(9) achieve better performance compared to MM without shared parameters under various experimental settings. Notably, MM(6) even surpasses the UpperBound trained with all data directly in the FG-8-PG-2 setting. This demonstrates that MM with shared parameters still holds significant advantages over single-model methods, and this advantage increases with the number of incremental steps in the task.

Table 10 The macro-F1 results of sharing parameters MM at the last incremental step
Table 11 The results of sharing parameters MM at the last incremental step under FG-n-PG-n setting on OntoNotes-V5

5.6 Inference time and space complexity analysis

Due to the use of multiple models, the MM framework requires more memory and inference time compared to single-model methods. We conducted experiments to compare the inference cost and performance differences between our MM (with parameters shared between six and nine layers) and the state-of-the-art baseline method CPFD. MM executes its multiple BERT models in parallel, while CPFD uses a single BERT model. We randomly selected 10 samples from the OntoNotes-V5 dataset and performed inference sequentially (without batching). The micro-F1 and macro-F1 scores were calculated using the entire test set. The experimental results are shown in Table 12.

Table 12 Inference cost and performance on the OntoNotes-V5 dataset at the last step under the FG-8-PG-2(6 steps) and FG-1-PG-1(18 steps) setting, MM includes six or eighteen model parameters

In the FG-8-PG-2 setting, the number of parameters, FLOPs, inference time, and memory usage of MM(6) are approximately 3.5, 3.5, 1.6, and 1.6 times that of CPFD, respectively. For MM(9), these metrics are approximately 2.25, 2.25, 1.45, and 1.3 times that of CPFD, respectively. Despite the increased inference costs, MM with shared parameters achieves higher performance, with MM(6) improving micro-F1 from 83.38 to 87.01 and macro-F1 from 66.37 to 75.64. MM(9) also shows improvements with micro-F1 at 84.72 and macro-F1 at 70.92.

In the FG-1-PG-1 setting, MM(6) and MM(9) demonstrate even more pronounced improvements, though at higher costs. MM(6) has parameter count, FLOPs, inference time, and memory usage approximately 9.5, 9.5, 3.7, and 3 times that of CPFD, respectively. MM(9) reduces these costs slightly. However, MM(6) improves micro-F1 from 66.73 to 83.67 and macro-F1 from 54.12 to 67.86. MM(9) also shows strong performance, with micro-F1 at 82.95 and macro-F1 at 63.70.

These results demonstrate that despite the increased inference costs, the MM framework with shared parameters significantly outperforms CPFD, especially as the number of incremental steps increases.

6 Conclusion

This paper proposed a Multi-Model (MM) framework to solve the catastrophic forgetting problem of the CINER task. MM trains and saves a model independently for each incremental step, so MM has no forgetting problems and is robust to the learning order of entity classes. Moreover, we proposed ECT to improve the precision of models and design an effective conflict-handling rule to merge the output label sequences of multiple models. We conducted detailed experiments on CoNLL-03 and OntoNotes-V5. Experimental results show that our method is superior to the SOTA CINER methods. Furthermore, we demonstrated the effectiveness of each part in MM through ablation analysis. In the future, we will consider how to train multi-models independently while limiting the output probability distribution to being on the same metric. Moreover, it would be interesting to explore how to incorporate knowledge from multiple models into a single model.