Keywords

1 Introduction

Since the rise of BERT [3], the pre-trained language models (PLMs) are the state-of-the-art (SOTA) models for natural language processing (NLP) [13, 31]. Many PLMs are developed by the academia and industry, such as GPT [15], XLNet [25], and ALBERT [10], and so forth. These BERT-style models achieved considerable improvements in many NLP tasks by self-supervised pre-training and transfer learning on labeled tasks, such as classification, text pair matching, named entity recognition (NER), etc. Despite their outstanding performances, their industrial usage is still limited by the high latency during inference.

Fig. 1.
figure 1

The overall architecture of our FastNER framework.

NER and other sequence labeling tasks play a central role in many application scenarios, such as question answering, document search, document-level information extraction, etc. However, these applications require low latency. For example, an online search engine needs to respond to the user’s query in less than 100 milo-seconds. Thus, a NER model should be accurate and efficient. In addition, at certain time intervals, consumer query traffic is very concentrated. For example, during dinner hours, food search engines will be used much often than usual. Thus, it is important for deployed NER models to flexibly adjust their latency.

Literature has focused on making PLMs’ inference more efficient via adaptive inference [14, 23, 28, 30]. The idea of adaptive inference is to process simple queries with lower layers of BERT and more difficult queries with deeper layers, thus speeding up inference on average without loss of accuracy. The speedup ratio can be flexibly controlled with certain hyper-parameters without re-deploying the model services. Early exiting is the representative adaptive inference methods [1]. As depicted in Fig. 1, it implements adaptive inference by installing an early exit, i.e., an intermediate prediction head, at each layer of PLMs (multi-exit PLMs) and early exiting “easy” samples to speed up inference. All the exits are jointly optimized at the training stage with BERT’s parameters. At the inference stage, certain early exiting strategies are designed to decide which layer to exit [4, 8, 18, 23, 27, 28]. In this mode, different samples can exit at different depths.

For our framework to be generally applicable, we mainly adopt the biaffine model [26] for NER. The biaffine model converts the NER task into a 2-dimensional table filling task, thus providing a solution to both the flat and nested NER problem. [26] shows that the biaffine model can achieve state-of-the-art (SOTA) performances on both nested NER tasks and flat NER tasks.

In this work, we propose a framework for the early exiting of BERT biaffine NER models, inspired by BADGE [29]. First, we add a convolutional bypass to the current transformer layer to provide different representations for the current layer’s biaffine exit and the next transformer layer of the BERT backbone. In this way, the BERT backbone will not be distracted from different tasks, thus improving the cross-layer average performance of the multi-exit biaffine model. Second, we extend the commonly used early exiting mechanisms in sentence classification tasks, entropy-based early exiting and max-probability based early exiting, to the biaffine NER model, we can perform adaptive inferences for NER tasks. Intuitively, the decision of early exit is made when the intermediate biaffine exit is confident in its predictions.

Extensive experiments are conducted on the six benchmark NER tasks. Three of the tasks are nested NER tasks, ACE2004Footnote 1, ACE2005Footnote 2, GENIA [9]. We also experiment on three flat NER tasks, CONLL2003 [19], OntoNotes 4.0 ChineseFootnote 3 and the Chinese MSRA task [11]. We show that: (1) our FastNER training method consistently performs better than the baseline multi-exit model training methods. (2) We show that we can achieve 2–3x speedup with limited performance losses. In addition, we show that with the better multi-exit model trained with our FastNER, better efficiency-performance tradeoffs can be made. Ablation studies validate the architectural design of our FastNER methods.

The rest of the paper is organized as follows. First, we introduce the preliminaries for the Biaffine NER model and early exiting. Second, we elaborate on our FastNER method. Third, we conduct experiments on 6 NER tasks and conduct a series of ablations studies. Finally, we conclude with possible future works.

2 Preliminaries

This section introduces the background for PLMs and early exiting. Throughout this work, we consider a NER task with samples \(\{(x, y), x\in \mathcal {X}, y\in \mathcal {Y}, i=1, 2, ..., N\}\), e.g., sentences and their NER span information, and the number of entity categories is K (including the non-entity type label). The input sequence length after BERT’s subword tokenization is L.

2.1 PLM Models

We use BERT as the backbone model. BERT is a Transformer [20] model pre-trained in a self-supervised manner on a large corpus. In the ablation studies, we also use ALBERT [10] as backbones. ALBERT is more lightweight than BERT since it shares parameters across different layers, and the embedding matrix is factorized. The number of layers of our PLM backbone is denoted as M, and the hidden dimension is d.

2.2 The Biaffine Model for NER

The BERT-Biaffine model [26] transforms the NER task into a two-dimensional table filling task. It asks the model to identify whether the slot in the table with coordinate (se) corresponds to an entity with category k, that is, whether a pair of tokens \((x_s, x_e)\) in the input sequence \(x = (x_1, x_2, ,,,, x_L)\) is the start and end tokens for an entity with category k. Formally, after BERT encoding, the contextualized embedding of tokens s and e are \(h_s\) and \(h_e\) (\(h_s\), \(h_e\) \(\in \) \(\mathcal {R}^{d}\)). Then \(h_s\) and \(h_e\) will go through two multi-layer perceptrons with Tanh activation function (denoted as MLP-start and MLP-end),

$$\begin{aligned} h_s = \text {Tanh}(h_s W^{start}_{1}) W^{start}_{2}, \end{aligned}$$
(1)
$$\begin{aligned} h_e = \text {Tanh}(h_e W^{end}_{1}) W^{end}_{2}. \end{aligned}$$
(2)

MLP-start and MLP-end transform the BERT’s representations to adapt to the table-filling NER task. Then in a biaffine layer f, the score of span (se) is calculated by

$$\begin{aligned} f(s, e) = h_s^{T} U h_e + W (h_s \oplus h_e) + b. \end{aligned}$$
(3)

Since we need to calculate the scores for K entity categories, U is a \(d \times K \times d\) tensor, and W is a \(2d \times K\) tensor. \(f(s, e) \in \mathcal {R}^{K}\) is the scores (or logits). A softmax operation will transform f(se) into a probability distribution p(se), which represents how likely the span (se) is a category k entity.

The learning objective of the biaffine model is to assign a correct category (including the non-entity) to each valid span. Hence it is a multi-class classification problem at each slot of the two-dimensional table and can be optimized with cross-entropy loss:

$$\begin{aligned} \mathcal {L} = - \sum _{s=1}^{L} \sum _{e = s}^{L} \sum _{k=1}^{K} \mathcal {I}(y(s, e) = k) \log p_{k}(s, e), \end{aligned}$$
(4)

where y(se) is the ground-truth label of span (se), \(p_{k}(s, e)\) is the predicted probability mass of (se) having label k, and \(\mathcal {I}(\cdot )\) is the indicator function. After fine-tuning the BERT biaffine model, the inference procedure of the BERT biaffine model follows [26], which involves determining the final named entity spans. Since there may be conflicting spans, [26] rank the spans via their probability masses, and the span with a higher probability mass will be kept when it conflicts with other predicted spans.

2.3 Early Exiting

As depicted in Fig. 1, early exiting architectures, or multi-exit architectures, are networks with exitsFootnote 4 at each transformer layer. Since the previous literature usually considers sentence-level classification tasks, the exits are sentence classifiers. However, since we are dealing with sequence labeling tasks formulated as two-dimensional table filling, with M exits, M separate biaffine modules \(f^{(m)}\) are installed right after each layer of BERT (\(m=1, 2, ... ,M\)), and the scores for span (se) at layer m is given by:

$$\begin{aligned} f^{(m)}(s, e) = h_s^{T} U^{(m)} h_e + W^{(m)} (h_s \oplus h_e) + b^{(m)}. \end{aligned}$$
(5)

And the loss function at each layer becomes

$$\begin{aligned} \mathcal {L}^{(m)} = - \sum _{s=1}^{L} \sum _{e = s}^{L} \sum _{k=1}^{K} \mathcal {I}(y(s, e) = k) \log p_{k}^{(m)}(s, e), \end{aligned}$$
(6)

where \(p^{(m)}(s, e) = \text {Softmax}(f^{(m)}(s, e))\) is the predicted probability distribution at exit m.

Training. The most commonly used training method for early exiting architectures is joint training (JT). All the exits are jointly optimized at the training stage with a summed loss function. Following [6] and [28], the overall loss function is:

$$\begin{aligned} \mathcal {L}^{WA} = \dfrac{\sum _{m=1}^{M} m * \mathcal {L}^{(m)}}{\sum _{m=1}^{M} m }. \end{aligned}$$
(7)

Note that the weight m corresponds to the relative inference cost of exit m.

Two other commonly used training methods are two-stage training [14, 23] (2ST) and alternating training [24] (ALT). 2ST first fine-tunes the PLM backbone and the last exit till convergence in the first stage and then fine-tunes the intermediate exits in the second stage. ALT trains the backbone and the last exit at the even optimization steps, and the intermediate exits at the odd optimization steps.

Inference. At inference, the multi-exit PLMs can operate in two different modes, depending on whether or not the computational budget to classify an example is known.

Static Early Exiting. We can directly appoint a fixed exit \(m^{*}\) of PLM, \(f^{(m^{*})}\), to predict all the queries.

Dynamic Early Exiting. Under this mode, upon receiving a query input x, the model starts to predict on the exits \(f^{(1)}\), \(f^{(2)}\), ..., in turn. It will continue until it receives a signal to stop early at an exit \(m^{*} < M\), or arrives at the last exit M. At this point, it will output a predictions by combining the current and previous predictions in a certain way. Different samples might exit at different layers under this early exit setting.

Speedup Ratio. Following PABEE [28], we mainly report the speedup ratio as the efficiency metric. For each test sample \(x_i\), the inference time cost is \(t_i\) under early exiting, and is \(T_i\) without early exiting, then the average speedup ratio on the test set is calculated by

$$\begin{aligned} \text {Speedup} = 1 - \dfrac{ \sum _{i=1}^{N} t_i}{ \sum _{i=1}^{N} T_i }. \end{aligned}$$
(8)

3 FastNER

3.1 A Lite Biaffine Module

Note that the original BERT biaffine NER model does not consider the early exiting scenarios. Each biaffine module (Eq. 5) introduces 5–7 million parameters. If we add this biaffine module at each layer, the added parameters will amount to 60 million or above. Introducing too many randomly initialized parameters would result in low efficiency, difficulty in optimization, and overfitting for shallow layers. Thus, we propose a modified version of biaffine module called the lite biaffine module.

In the lite biaffine module, MLP-start and MLP-end are substituted by a simple linear projection layer that project \(h_s\) and \(h_e\) from dimension d to \(d_1 = d / 4\),Footnote 5 and the down-projected \(h-s\) and \(h_e\) are fed into Eq. 5 for logit calculations. This way, the parameters in a biaffine layer will be reduced to less than 0.5 million. In the experiments, we will show that our lite biaffine module performs better than the original one in the early exiting scenarios.

3.2 Motivation

Similar to the analysis in [29], training a multi-exit BERT biaffine model requires training multiple prediction heads of different depths simultaneously. Thus, under this setting, an intermediate layer has to fulfill two tasks at once: (a) providing semantic representations to the next layer and (b) providing proper token features to the biaffine module of the current layer. One may wonder whether these two tasks conflict with each other and result in poor optimizations. [12] investigate this problem in the sentence classification tasks and find that each layer’s optimizations are often in conflict and can cause gradient instability. They provide a solution called gradient equilibrium, which is to adjust the gradients from each exit. However, in our experiments, we will show that this method does not provide significant improvements.

Another solution is to use different sub-networks for these two tasks, following the literature on sparse multi-task learning [17]. However, this method provides two different representations with two different, forward passes with different sub-networks, significantly slowing down the inference speed. Thus, this approach does not meet our purpose.

To summarize, we need a new method that can provide two different representations, one for the next layer and the other for the current layer’s prediction, within a single forward pass.

3.3 Bypass Architecture

We now present the core of our FastNER framework: the convolutional bypasses (depicted in Fig. 1). The notation follows [29]. Denote the hidden states of the input from the last layer as \(H_{m-1}\). \(H_{m-1}\) will go through the transformer layer’s self-attention (MHSA) and positional feed-forward (FFN) modules, to become \(H_{m, 0}\), and then a LayerNorm operation to output \(H_{m}\).

We want the efficient bypass \(B_m\) to adjust \(H_{m}\) to fit the task better. \(B_m\) is simple in architecture (On the right side of Fig. 1). After receiving the input \(H_{m-1}\), \(B_m\) reduce its dimension from d to r, and obtain \(H_{m, B}^{(1)}\) (where \(r<<d\), and r is called the bottleneck dimension) via a down-projection \(W_{down} \in {\mathbb {R}}^{d \times r}\). \(H_{m, B}^{(1)}\) will go through a non-linear activation function \(g_1\), a convolutional layer with kernel size 3 (denoted as conv) and another activation function \(g_2\), and become \(H_{m, B}^{(2)}\). \(H_{m, B}^{(2)}\) will then be up-projected to \(H_{m, B}^{(3)}\) to recover the dimension, by an up-projection matrix \(W_{up} \in {\mathbb {R}}^{r \times d}\). The literature usually refer to r as the bottleneck dimension. Formally, \(B_m\) can be expressed as:

$$\begin{aligned} H_{m, B}^{(3)} \leftarrow g_{2}(\text {conv}(g_{1}(H_{m-1} W_{down})))W_{up}. \end{aligned}$$
(9)

Finally, the current layer will output two representations, \(H_{m}\), which is the original hidden states, and \(H_{m}^{'}\), which is modified by the bypass by:

$$\begin{aligned} H^{'}_{m} & = \text {LayerNorm}(H_{m, B}^{(3)} + H_{m, 0}), \nonumber \\ H_{m} & = \text {LayerNorm}(H_{m, 0}). \end{aligned}$$
(10)

\(H_{m}\) is passed to the next transformer layers as input, and \(H^{by}_{m}\) will be the hidden states received by the intermediate biaffine exit. The bottleneck dimension r is very small, like 16, so that the extra parameters or flops introduced by the bypasses are less than 1% of the compared to the BERT backbone.

3.4 Early Exiting for Biaffine NER Model

Although the literature comprehensively studied the early exiting for sentence classification tasks, the early exiting mechanism of entity-level tasks like NER has been neglected. Based on the literature on early exiting on sentence classification tasks, this work proposes two plausible early exiting mechanisms for the biaffine NER model.

Entropy-Based Early Exiting (Entropy). This method is a directly extension of [18] and [23] from the sentence classification tasks to NER. We denote the table of distributions predicted by the biaffine exit m as \(\mathcal {T}^{(m)} = \{p^{(m)}(s, e) |s, e \in 1, ..., L \}\), which is a \(L \times L \times K\) tensor. On each slot \(p^{(m)}(s, e)\) of the biaffine table, we can calculate its entropy \(\text {Ent}^{(m)}(s, e)\) via

$$\begin{aligned} \text {Ent}^{(m)}(s, e) = \dfrac{-1}{\log (K)} \sum _{k=1}^{K} p_{k}^{(m)}(s, e) \log (p_{k}^{(m)}(s, e)). \end{aligned}$$
(11)

Intuitively, if the biaffine exit is confident with its prediction, the average entropy \(\text {AvgEnt}^{(m)}\), calculated by

$$\begin{aligned} \text {AvgEnt}^{(m)} = \dfrac{\sum _{s=1}^{L}\sum _{e=1}^{L} \text {Ent}^{(m)}(s, e) }{L * L}, \end{aligned}$$
(12)

will be smaller. Thus \(\text {AvgEnt}^{(m)}\) can be treated as the early exiting criterion. A threshold \(\tau _{e}\) is predefined. If at layer m, \(\text {AvgEnt}^{(m)}\) is smaller than \(\tau _{e}\), the model will exit. Otherwise, the model will continue its forward pass.

Maximum-Probability-Based Early Exiting (Maxprob). This method is a direct extension of [16] from the sentence classification tasks to NER. Intuitively, if the biaffine exit is confident with its prediction, the table of predicted distributions \(\mathcal {T}^{(m)}\) will concentrate their probability masses on single specific labels. Denote the maximum probability mass at slot (se) as \(MP^{(m)}(s, e)\), then the average maximum probability is given by

$$\begin{aligned} \text {AvgMP}^{(m)} = \dfrac{\sum _{s=1}^{L}\sum _{e=1}^{L} MP^{(m)}(s, e)}{L * L}. \end{aligned}$$
(13)

A threshold \(\tau _{mp}\) is predefined. If at layer m, \(\text {AvgMP}^{(m)}\) is larger than \(\tau _{mp}\), the model will exit. Otherwise, the model will continue its forward pass.

4 Experiments

4.1 Datasets

We evaluate our FastNER on both nested and flat NER tasks. For the nested NER task, we use the ACE2004 taskFootnote 6, ACE2005 taskFootnote 7, and GENIA task [9]. For the flat NER task, we evaluate our method on the CONLL 2003 task [19] (CONLL03), the OntoNotes 4.0 corpusFootnote 8 (Onto-4), and the Chinese MSRA NER (MSRA) task [11].

4.2 Baselines

For multi-exiting BERT biaffine fine-tuning, we compare our FastNER framework with the following baselines:

Two-Stage Training (2ST). This method is adapted by [23] and [14]. It first fine-tunes the backbone and the last exit till convergence. Then all the intermediate layers’ exits (except the last layer’s exit) will be trained on top of the frozen backbone.

Joint Training (JT). This method trains the BERT backbone and all the biaffine exits jointly. Literature has different variations for JT. PABEE [28] and RightTool [16] adopts increasing loss weights for higher layers. We will denote their version of joint training as JT-PABEE. BranchyNet [18] adopts different and gradually increasing learning rates for different exits during training. We will denote their version as JT-BranchyNet.

Alternating Training. BERxit proposes to combine the training of JT-PABEE and 2ST, that is, conduct back-propagation via the loss from the last exit at the odd optimization steps and conduct back-propagation via the average loss from all the intermediate exits.

Gradient Equilibrium (GradEquil). [12] proposes to adjust the gradient norm of each intermediate layer during optimization, so that the training process will be more stable.

Sparse Multi-task (Sparse-MT). [17] As analyzed in Sect. 3.2, this method is not suitable for model inference speedup methods since it requires multiple forward passes to generate different representations suitable for different tasks. We include this method as a sanity check and show that our FastNER method performs better even with a single forward pass.

Early Exiting Mechanisms. To show that our FastNER method can effectively improve the model’s early exiting performances, we will run dynamic early exiting with different early exiting mechanisms as described in Sect. 3.4: (a) Entropy-based method (entropy); (b) maximum probability-based method (maxprob). Early exiting will be run on different backbones to show that our FastNER framework can improve the efficiency-performance tradeoffs.

4.3 Experimental Settings

Training. English NER tasks use the open-sourced Google BERT [3]Footnote 9 as the backbone, and the Chinese tasks adopt the BERT-www-ext released from [2]Footnote 10 as the backbone model. We also use ALBERT-base and ALBERT base Chinese by [10] as the backbone models for ablation studies. We add a lite biaffine NER layer or an original biaffine layer after each intermediate layer of the pre-trained models as the intermediate classifiers. The convolutional bypasses’ activation function is set to be GELU [5]. We fine-tune models for at most 25 epochs; early stopping with patience eight is performed, and the best checkpoint is selected based on the dev set performances. We perform grid search over batch sizes of 16, 32, 128, learning rates of 1e−5, 2e−5, 3e−5, 5e−5 with an Adam optimizer, and the convolutional bypasses’ bottleneck dimension of 8, 16, 32. We implement FastNER and all the baselines on the base of Hugging Face’s Transformers [22]. Experiments are conducted on four Nvidia V100 16 GB GPUs.

Inference. Following prior work on input-adaptive inference [8, 18], inference is on a per-instance basis, i.e., the batch size for inference is set to 1. This is a common scenario in the industry where individual requests from different users [16] come at different time points. We report the median performance over five runs with different random seeds.

4.4 Evaluation Metrics

Entity-level F1 is the most widely used metric for NER tasks [7]. For multi-exit PLMs, each exit has a performance score. Thus, to properly evaluate multi-exit PLMs, we propose the following three derived metrics: (a) F1-avg, which denotes the cross-layer average F1 score; (c) F1-best, which is the best F1 score among all the layers. We use F1-avg as our primary metric for experimental result reporting and checkpoint selection during training.

Table 1. Experimental results of models with BERT backbone on the six benchmark NER datasets.

4.5 Overall Comparison

We compare our FastNER with the previous SOTA training methods of multi-exit BERT-biaffine models. Table 1 reports the performance on the six benchmark NER datasets when using BERT as the backbone model. The upper half of the table reports the performances of the last transformer layer’s biaffine exit or the 6-th layer’s exit. With fewer randomly initialized parameters, our lite biaffine layer can outperform comparably with the original biaffine layer.

The cross-layer average and best performances are reported in the lower half of Table 1. The following takeaways can be made:

  • Our FastNER method consistently outperforms the existing multi-exit BERT biaffine model training methods in terms of F1-avg by a clear margin. Note that as modifications to the joint training methods, GradEquil and Sparse-MT perform comparably to JT-BranchyNet and JT-PABEE under our settings. Although ALT [24] and 2ST [23] perform well in sentence-level tasks like the GLUE benchmarks [21], it does not perform very well when training the BERT biaffine NER models.

  • With the help of the bypasses, our FastNER method improves the average performances across all the intermediate layers and improves the F1-best scores compared with JT-PABEE or JT-BranchyNet by a large margin. This result is consistent with our motivation: introducing the convolutional bypasses can help the intermediate transformer layer to concentrate on providing hidden representations. In contrast, the bypasses can provide a modified version of the current layer’s hidden states that are more suitable for the current layer’s biaffine exit. In this way, both the F1-best and F1-avg scores can improve. As a direct result, the best layer’s score under our FastNER framework is comparable to or performs better than vanilla fine-tuning.

  • To show that our FastNER does not achieve such performance improvements merely by adding more parameters, we also run FastNER with the original biaffine module (the FastNER + original biaffine setting in Table 1). With much more additional parameters, FastNER + original biaffine still under-performs the FastNER setting in terms of F1-avg. We think this is because the original biaffine modules are too parameter cumbersome for the shallow layers to learn.

Fig. 2.
figure 2

The speedup-score curves with different dynamic early exiting methods, on the ACE, CoNLL03 and Onto-4 datasets. The multi-exit BERT biaffine models are trained with FastNER or JT-PABEE.

4.6 Dynamic Early Exiting Performances

With the improved overall performances on each layer, intuitively, the model’s early exiting performances can be improved. We run early exiting with different confidence thresholds on multi-exit BERT-biaffine models trained by FastNER or JT-PABEE. The early exiting mechanisms are the entropy-based method and the maxprob-based method. The resulting speedup-performance curves are plotted in Fig. 2, where the x-axis represents the speedup ratio, and the y-axis is the F1 score achieved on the test set.

From Fig. 2, we can see that with our FastNER training and early exiting methods, a BERT biaffine model can achieve 2x–3x speedups with limited performance degradations. The apparent gaps between our FastNER model and JT-PABEE shown in Fig. 2 can also be observed in Fig. 2, proving that improving the averaging performances across layers in better efficiency-performance tradeoffs during early exiting. In addition, the entropy-based and maxprob-based methods can perform comparably with each other for the early exiting of BERT biaffine models.

4.7 Ablation on Whether to Pass \(H_i^{'}\) to the Next Layer

The core idea of FastNER is to provide different intermediate hidden states for different purposes via our convolutional bypasses. As a sanity check and to demonstrate that our design is necessary, we now consider the following setting: reducing our design of bypasses by passing \(H_i^{'}\) to the next layer and using it for prediction. We will denote this setting as FastNER-reduced. We use FastNER-reduced for training on ACE2004 and Onto-4 datasets, and the results are reported in Table 2.

From Table 2, we can see that FastNER-reduced asks \(H_i^{'}\) to complete two tasks at once, resulting in a significant drop in overall performances. Note that the performance difference between FastNER and FastNER-reduced is significant on ACE2004 and Onto-4. The results show that our design of providing different representations for different purposes via convolutional passes is the key to improving the overall performances of multi-exit BERT biaffine models.

Table 2. Experimental results for the ablation study of whether to pass \(H_i^{'}\) to the next layer. The F1-avg (cross-layer average F1) scores are reported. The performance differences between FastNER and FastNER-reduced are significant for both tasks.

4.8 Comparisons of Different Bottleneck Dimensions r

The bottleneck dimension r is 16 for the main experiments. To investigate the effects of bottleneck dimensions on the model performances, We conducted ablation experiments on the ACE2004 and Onto-4 tasks. Table 3 reports the F1-avg scores under different values of r. From Table 3, we can see that smaller bottleneck dimensions do not result in significant performance drops. An intriguing observation is that larger bottleneck dimensions do not provide performance improvements, demonstrating that the superior performances of FastNER do not come from introducing additional model parameters.

Table 3. Experimental results for comparing different bottleneck dimensions. The F1-avg (cross-layer average F1) scores are reported.

5 Conclusions

In this work, we first design the FastNER-bypass framework, consisting of convolutional bypasses, to enhance the overall performances of multi-exit BERT biaffine NER models. Second, the existing literature does not investigate the problem of early exiting for the NER tasks. Thus, we transfer the early exiting methods for sentence-level tasks to the biaffine NER model and propose two early exiting mechanisms: the entropy-based method and the maxprob-based method. Experiments are conducted on six benchmark NER datasets. The experimental results show that: (a) our FastNER framework can effectively improve the overall performances of multi-exit BERT biaffine models, thus providing stronger backbones for dynamic early exiting; (b) the early exiting mechanisms we designed for the BERT biaffine NER models can achieve 2–3 times inference speedups with quite limited performances drops.