1 Introduction

Nowadays, analyzing and extracting useful information from plain text (especially web content) is one of the most important research areas. For many applications, sequence labeling is a fundamental pre-processing step. It is also one of the most well-studied tasks in natural language processing. As shown in Table 1, sequence labeling tasks aim at automatically assigning words in texts with labels.

Traditionally, Hidden Markov Models (HMM), Conditional Random Fields (CRFs), and Support Vector Machine (SVM) has been widely used for sequence labeling tasks [9, 10, 14, 15]. Compared with these models, Recurrent Neural Networks (RNNs) are able to capture information from a fairly long distance. Recently, with the help of extra resources and feature engineering, the combination of RNN and other models achieves state-of-the-art results [3, 8, 12, 13].

Table 1. An example of sequence labeling tasks.

For sequence labeling, each target word and its corresponding label are explicitly aligned. Previous RNNs predict label solely based on each hidden layer of the corresponding target word. However, in practice, using one single hidden layer is not sufficient for prediction, even with sophisticated variations like Bi-directional Recurrent Neural Network (Bi-RNN) [16], Long Short-Term Memory (LSTM) [7], and Gated Recurrent Unit (GRU) [4].

This paper proposes an Attention-based Recurrent Neural Network for Sequence Labeling (ARNN-SL), which allows RNNs to “focus” not only on the aligned hidden layer, but other informative hidden layers as well.

Different from other tasks such as machine translation [1], image caption [19], and speech recognition [5], where attention mechanism has been successfully applied, sequence labeling has its own characteristics for deciding which hidden layer is informative or not. Intuitively, the closer a hidden layer is to target word, the more information it contains. Moreover, the aligned hidden layer is always most important to this end. A windowing technique is introduced by limiting our model to selectively focus on hidden layers in a small window size, instead of irrelative hidden layers far way. ARNN-SL explicitly leverages the information from the aligned and attention-focused hidden layers for prediction.

2 Model

2.1 Simple RNN for Sequence Labeling

Formally, sequence labeling aims at finding the most probable label sequence \(\mathbf y =\{y_1, ... , y_T\}\) for a given input word sequence \(\mathbf w =\{w_1, ... , w_T\}\), where T is the sequence length.

The overall architecture of simple RNN (sRNN) for sequence labeling is depicted in Fig. 1. In this figure, \(w_t\) represents word at time step t and \(x_t \in \mathbb {R}^n\) is \(w_t\)’s word embedding. \(y_t \in \mathbb {R}^L\) is the probability over L labels of word at position t and is defined asFootnote 1:

$$\begin{aligned} y_t = softmax(V h_t) \end{aligned}$$
(1)

where \(h_t \in \mathbb {R}^{m}\) is the hidden state at time step t. \(h_t\) encodes the information in previous time steps and is computed as:

$$\begin{aligned} h_t = \sigma (W x_t + U h_{t-1}) \end{aligned}$$
(2)

where V, W and U are weight matrices. \(\sigma \) is active function and is often set to sigmoid.

2.2 RNN Variations

Most of the variations of RNN focus on modifying the way of calculating hidden layer \(h_t\) in Eq. 2. For example, Long Short-Term Memory (LSTM) [7] and Gated Recurrent Unit (GRU) [4] introduces additional memory cells and gates to depict the long-range dependency information much better. For local context window technique [13], instead of using \(x_{t}\) for computing \(h_t\), it uses a weighted sum of all \(x_j\) in a context window of wn as input: \(\sum _{i=-wn}^{wn}{U_i x_{t+i}}\), which allows RNN to consider more local context dependency information.

Fig. 1.
figure 1

Illustration of sRNN and ARNN-SL for sequence labeling.

Bi-directional Recurrent Neural Network [16] is also commonly used for improving the performance of sRNN. It computes the forward hidden layer \(h_t^f\) and backward hidden layer \(h_t^b\) using \(h_{t-1}^f\) and \(h_{t+1}^b\) respectively, and concatenates these two types of layers to form the final hidden layer \(h_t\). In this way, both past and future information is preserved.

2.3 Proposed Attention Architecture

In practice, \(h_t\) alone is still not sufficient for encoding all the information needed for predicting \(y_t\), even with sophisticated variations like Bi-RNN, LSTM, and GRU. In this paper, we propose Attention-based Recurrent Neural Network for Sequence Labeling (ARNN-SL), which allows RNN to selectively use multiple hidden vectors’ information instead of using \(h_t\) alone.

As shown in Fig. 1, ARNN-SL has two types of hidden layers: encoding layer \(h_t\) and decoding layer \(\widetilde{h}_t\). The same as most encoder-decoder frameworks, the last encoding layer is used as the input of the first decoding layer, two sets of parameters are used for encoder and decoder respectively.

If we ignore the attention component \(c_t\) and all decoding layers, the architecture is exactly the same as sRNN and its variations. Attention component \(c_t\) is used to selectively gather information from encoding layers for prediction, which is computed as:

$$\begin{aligned} c_t = \sum _{j=1}^{T}{a_{t,j} h_j} \end{aligned}$$
(3)

where \(a_{t,j}\) is the weight for \(h_j\) at time step t:

$$\begin{aligned} a_{t,j} = \frac{\exp (e_{t, j})}{\sum _{k=1}^{T}{\exp (e_{t, k})}} \end{aligned}$$
(4)

where \(e_{t,j}\) is attention score. The larger \(e_{t,j}\) is, the larger \(a_{t,j}\) becomes and the more \(h_j\) contributes to \(c_t\) and \(y_t\). For sequence labeling task, \(y_t\) is mainly decided by encoding layer \(h_t\). In order to make full use of its information, \(h_t\) is directly used for prediction and \(e_{t,t}\) should be set to 0. Since encoding layers which are far from current time step may be noisy, we pre-define a window size wn and directly set \(e_{t,j}\) to 0 when j is outside the window. To summarize, \(e_{t,j}\) is calculated as:

$$\begin{aligned} e_{t, j} = \left\{ \begin{array}{rl} score(\widetilde{h}_{t}, h_j) &{} t-wn< j < t+wn \text { and } j \ne t \\ 0 &{} \text { else }\\ \end{array} \right. \end{aligned}$$
(5)

the input of the score function is the current time step’s decoding layer \(\widetilde{h}_{t}\) and \(h_j\). As proposed in [11], there are mainly two types of score functions which can be used:

$$\begin{aligned} score(\widetilde{h}_{t}, h_j) = \left\{ \begin{array}{lr} \widetilde{h}_{t}^T W_e h_j &{} general \\ V_e\tanh (\widetilde{W}_e \widetilde{h}_{t}+W_e h_j)&{} concat \\ \end{array} \right. \end{aligned}$$
(6)

Finally, the calculation of the output layer \(y_t\) in Eq. 1 is defined as:

$$\begin{aligned} y_t = softmax(V_h h_t + V_c c_t) \end{aligned}$$
(7)

encoding hidden layer \(h_t\) is most informative for predicting \(y_t\) and is explicitly leveraged with attention component \(c_t\).

Since our attention architecture does not change the way of computing hidden layers, it can be directly built upon sRNN and LSTM. When ARNN-SL is built upon bi-directional RNNs, two sets of weights and variables are used for forward and backward directions respectively. For example, two attention components \(c_t^f\) and \(c_t^b\) selectively focus on \(h_j^f\) and \(h_j^b\). They are then concatenated to form a new attention component \(c_t\) for the final prediction.

3 Related Work

Recently, attention mechanism leads to state-of-the-art results on many complex tasks such as machine translation [1, 11], image caption task [19], and speech recognition [2, 5]. However, due to the characteristics such as align strategy, directly applying the same mechanism to sequence labeling is not feasible.

Fig. 2.
figure 2

Attention architectures used in different tasks.

The main difference of our specially designed ARNN-SL with attention architectures in other tasks is the way of calculating the attention component \(c_t\), as shown in Fig. 2. In machine translation [1], a translated word could be aligned with a word at any position of the sentence. Attention architecture should selectively focus on hidden layers at every position. In image caption task [19], in order to generate current caption word, hidden layers corresponding to all image segments should be focused on for the same reason.

The most similar attention architecture to ARNN-SL is used in speech recognition [2, 5]. Instead of using all hidden layers, only the layers corresponding to the most probable consecutive k acoustic frames are focused on. The same idea is also implemented in machine translation [11], but it performs worse than attention for all layers. Compared with these architectures, ARNN-SL does not dynamically decide which consecutive hidden layers should be used for attention (Eq. 5). The useful hidden layers are always close to the current position for sequence labeling.

For sequence labeling, the label is mostly affected by the word and hidden layer in current time step. To the best of our knowledge, ARNN-SL is the first attention architecture that explicitly leverages the contribution from the current hidden layer and attention component (Eq. 7).

4 Experiments

4.1 Datasets and Experimental Setup

ARNN-SL is evaluated on four commonly used tasks for sequence labeling: Part-Of-Speech Tagging (POS), ChunkingFootnote 2, Named Entity Recognition (NER)Footnote 3, and Slot Filling for Spoken Language Understanding (SF-SLU) [6, 17, 18]. Since there is no pre-defined development data for the datasets of Chunking and SF-SLU tasks, we randomly choose 20% of training data for validation. AdaDelta is used to control learning rate [21]. We use the same dropout strategy as that in [20] and the dropout rate is set to 0.5. Word embedding size and hidden layer size are set to 500. Word embeddings are either randomly initialized or pretrained using Word2Vec toolkitFootnote 4 on English Wikipedia (August 2013 dump). Models are trained for 25 epochs, and we report the results on epoch which achieves the highest performance on development data. We do not use features which are derived from lexical resources or other NLP systems. The only pre-processing we use is lowercasing.

Fig. 3.
figure 3

Illustration of the impact of attention window sizes and score functions. ARNN-SL is built upon Bi-LSTM. Pretrained word embeddings are used. Attention window size: “0” indicates that attention architecture is not used, “++” indicates that windowing technique is not used and attention selectively focuses on all encoding layers. w/o LVRG: (dotted line) traditional attention mechanism. w/LVRG: (solid line) ARNN-SL which explicitly leverage the contribution from the current hidden layer and attention component (Eq. 7).

4.2 Main Results

As shown in Fig. 3, compared to traditional attention mechanism (w/o LVRG), models that explicitly leverage the information from the aligned and attention-focused hidden layers (w/LVRG) perform consistently better. In most cases, models without LVRG actually perform worse than Bi-LSTM baselines, especially on SF-SLU task and when the window size is large. The weighted sum of all encoding layers (Eq. 3) is likely to bring noises. The current encoding layer should always be directly used for prediction.

Our proposed attention architecture consistently improves the performance of Bi-LSTM on all tasks. The best performance is usually achieved at window size 2 or 3. The performance drops when the window size is bigger than 3 and reaches the minimum when attention focuses on all encoding layers. Windowing technique is indispensable for the good performance on sequence labeling.

The trend of the curve for the concat score function is similar to that of general. However, general score function often performs better than concat. We highly recommend using general score function in practice, since it’s also easier to implement and is 2–3 times faster.

Overall, ARNN-SL obtains 1.14%, 2.14%, 3.21% and 0.90% improvement on POS, Chunking, NER and SF-SLU respectively compared to sophisticated bi-direction Bi-LSTM baselines.

Fig. 4.
figure 4

Illustration of which position of encoding layers ARNN-SL focus on. ARNN-SL is built upon Bi-LSTM and is tested on SF-SLU task. For each window position j, we sum up variable \(a_{t,j}\) in all training examples at all time steps. Each bar represents the percentage of this summed value over corresponding positions.

4.3 Attention Visualization

Since attention window size is crucial for ARNN-SL’s performance, it’s worth to explore which position of encoding layers the attention really focused on. As shown in Fig. 4, general and concat score functions both tend to focus on positions near the center. This explains the performance’s decline when attention window size is bigger than 3 in the above section. This also further strengthens our claim: encoding layers which are far from the current time steps are noisy.

Another interesting phenomenon from this figure is that the forward direction of ARNN-SL mainly focuses on both position \(-1\) and position 1. While the backward direction mainly focuses only on position \(-1\). This may caused by that backward RNN’s hidden layers contains only future information, so hidden layer at position 1 has no information about word at position \(-1\). While word at this position may contains most useful information.

5 Conclusion and Future Work

This paper presents a novel attention architecture called ARNN-SL, designed for sequence labeling. We demonstrate its effectiveness on POS, Chunking, NER, and SF-SLU tasks. More precisely, we conclude that for sequence labeling tasks: (1) it’s crucial to explicitly leverage the contribution from the current hidden layer and attention component, (2) general score function is a better choice than concat, (3) using windowing technique to restrict the attention is indispensable and the window size should be small.

The aim of this paper is investigating the impact of the attention architecture. We keep our model as simple and reproducible as possible. Note that the state-of-the-art results on POS, Chunking and NER tasks are all obtained by models combination, extra resources and feature engineering [3, 8, 12]. In the future, it’s promising to implement ARNN-SL under the same sophisticated configurations for further improvements on these tasks (e.g. combining ARNN-SL and CRF/CNN, make using of different features and DBpedia knowledge).