Keywords

1 Introduction

Target-specific Stance Detection is a problem that can be formulated as follows: given a tweet X and a target Y, the aim is to classify the stance of X towards Y into three categories, Favour, None or Against. The target may be a person, an organisation, a government policy, a movement, a product, etc. [8]. Target-specific Stance Detection is a different problem from Aspect-level Sentiment Analysis [11, 15] in the following ways: the same stance can be expressed through positive, negative or neutral sentiment [9]; the target of interest of the Stance Detection does not necessarily have to occur in the tweet, as the target-specific stance can be expressed by mentioning the target implicitly, or by talking about other relevant targets. Besides typical tweets characteristics, such as being short and noisy, the main challenge in this task is that the decision made by the classifier has to be target-specific, whilst having very little contextual information or supervision provided. Example training data from the benchmark target-specific Stance Detection dataset for SemEval-2016 Task 6 [8] can be found in Table 1. Deep neural networks enable the continuous vector representations of underlying semantic and syntactic information in natural language texts, and save researchers the efforts of feature engineering [14, 15]. Recently, they have achieved significant improvements in various natural language processing tasks, such as Machine Translation [2, 3], Question Answering [14], Sentiment Analysis [6, 11, 15, 18], etc. However, applying deep neural networks on target-specific Stance Detection has not been successful, as their performances have, up to now, been slightly worse than traditional machine learning algorithms with manual feature engineering, such as Support Vector Machines (SVM) [8].

Table 1. Examples of target-specific stance detection.

In this work, the above challenges are tackled, based on our intuition that the target information is vital for the Stance Detection, and that the vector representations for the tweets should be “aware” of the given targets. Since not all parts in the tweet are equally helpful for the Stance Detection task towards the specified target, we firstly apply the state-of-the-art token-level attention mechanism [2]. This allows neural networks to automatically pay more attention to the tokens that are more relevant to the target and more informative for detecting the target-specific stance. Importantly, a given token can be interpreted differently, according to different targets, and the semantic features in the token’s vector representation can be of different levels of importance, conditional on the given target. We propose a novel attention mechanism, which extends the current attention mechanism, from the token level, to the semantic level, through a gated structure, whereby the tokens can be encoded adaptively, according to the target. We compare the models we propose based on the token-level attention mechanism and the novel semantic-level attention mechanism with several baselines, on the target-specific Stance Detection dataset for the SemEval-2016 Task 6.A [8], which is currently the most widely applied dataset on target-specific Stance Detection in tweets. The experimental results show that substantial improvements can be achieved on this task, compared with all previous neural network-based models, by inferencing conditional tweet vector representations with respect to the given targets; the neural network model with semantic-level attention also outperforms the SVM algorithm, which achieved the previous best performance in this task [8]. Additionally, it should be noted that our results are obtained with a minimum of supervision, with no external domain corpus collected to pre-train target-specific word embeddings, and no extra sentiment information annotated. Moreover, there are no target-specific configurations or hand-engineered features involved, thus the proposed models can be easily generalised to other targets, with no additional efforts.

2 Neural Network Models for Target-Specific Stance Detection in Tweets

In this section, we first describe two baseline models, the bi-directional Gated Recurrent Unit (biGRU) model, and the model that stacks a Convolutional Neural Network (CNN) structure on the outputs of the biGRU (biGRU-CNN) model. We then show how we extend these two baseline models, by incorporating the target information through token-level and semantic-level attention mechanisms, obtaining the AT-biGRU model and the AS-biGRU-CNN model, respectively. Finally, we demonstrate methods to generate the target embedding, and how to obtain the stance detection result based on the tweet vector representation, as well as other model training details.

2.1 biGRU Model

GRU [3] aims at solving the gradient vanishing or exploding problems, by introducing a gating mechanism. It adaptively captures dependencies in sequences, without introducing extra memory cells. GRU maps an input sequence of length N, \([x_1, x_2, \cdots , x_N]\) into a set of hidden states \([h_1, h_2, \cdots , h_N]\) as follows:

$$\begin{aligned} r_n&= \sigma (W_r x_n + U_r h_{n-1} + b_r)\end{aligned}$$
(1)
$$\begin{aligned} z_n&= \sigma (W_z x_n + U_z h_{n-1} + b_z) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{h_{n}}&= \tanh (W_h x_n + U_h (r_n \odot h_{n-1}) + b_h)\end{aligned}$$
(3)
$$\begin{aligned} h_n&= (1-z_n) \odot h_{n-1} + z_n \odot \tilde{h_{n}}. \end{aligned}$$
(4)

where \(n \in \{1, \dots , N\}\); \(r_n\) is the reset gate and \(z_n\) is the update gate; \(\tilde{h_{n}} \in \mathbb {R}^{d_1}\) represents the “candidate” hidden state generated by the GRU; \(h_n \in \mathbb {R}^{d_1}\) represents the real hidden state generated by the GRU; \(x_n \in \mathbb {R}^{d_0}\) represents the word embedding vector of a token in the tweet; \(W_r\), \(W_z\), \(W_h \in \mathbb {R}^{d_1 \times d_0}\) and \(U_r\), \(U_z\), \(U_h \in \mathbb {R}^{d_1 \times d_1}\) represent the weight matrices; \(b_r\), \(b_z\), \(b_h \in \mathbb {R}^{d_1}\) represent the bias terms; \(\sigma (\cdot )\) represents the sigmoid function; \(\odot \) represents the Hadamard product operation (element-wise multiplication).

To capture the information from both the past and the future sequence, the bi-directional GRU (biGRU), which processes the sequence in both the forward and backward directions, has proven to be successful in various applications [2, 18]. In biGRU, the hidden states generated by processing the sequence in opposite directions are concatenated as the new output: \([\overrightarrow{h_1} \mathbin {\Vert }\overleftarrow{h_1}, \overrightarrow{h_2} \mathbin {\Vert }\overleftarrow{h_2}, \cdots , \overrightarrow{h_N} \mathbin {\Vert }\overleftarrow{h_N}] \), where \(\overrightarrow{h_n} \mathbin {\Vert }\overleftarrow{h_n} \in \mathbb {R}^{2d_1}\), and the arrow represents the direction of the processing.

In the biGRU model, the final hidden states of the input sequence, when processing it in opposite directions, are concatenated, to form the vector representation of the tweet s:

$$\begin{aligned} s=\overrightarrow{h_N} \mathbin {\Vert }\overleftarrow{h_1}. \end{aligned}$$
(5)

2.2 biGRU-CNN Model

The biGRU model attempts to propagate all the semantic and syntactic information in a tweet into two fixed hidden state vectors, which could become a bottleneck, when there exist some long-distance dependencies in the tweet. In [14], Recurrent Neural Network (RNN) outputs were fed into a CNN structure, to generate a vector representation, based on all the hidden states of the RNN, rather than just the final hidden state. Specifically, a filter \(w_f \in \mathbb {R}^{2kd_1}\) is applied to k concatenated consecutive hidden states \(h_{i:i+k-1} \in \mathbb {R}^{2kd_1} \) to compute \(c_i\), one value in the feature map corresponding to this filter:

$$\begin{aligned} c_{i} = f(w_f^{T} h_{i:i+k-1}+b_f), \end{aligned}$$
(6)

where f is the rectified linear unit function and \(b_f \in \mathbb {R}\) is a bias term. A max-pooling operation is further applied over the feature map \(\mathbf {c}=(c_1, c_2, \cdots , c_{N-k+1})\), to capture the most important semantic feature \(\hat{c}\) in each feature map:

$$\begin{aligned} \hat{c}=\max \{\mathbf {c}\}. \end{aligned}$$
(7)

\(\hat{c}\) is the feature generated by filter \(w_f\). Filters with varying sliding window sizes k can be applied, to obtain multiple features. The features generated by different filters are concatenated, to form the vector representation of the tweet s.

2.3 AT-biGRU Model

Whilst they solve specific problems as above, neither the biGRU model nor the biGRU-CNN model takes into account the target information. However, when human annotators are asked to label the stance of a tweet towards a given target, they are likely to keep the information about the target in their mind, and pay more attention to the parts relevant to the target. The token-level attention mechanism, firstly proposed in [2] for Machine Translation, allowed the neural network to automatically search for tokens of a source sentence that were relevant to predicting a target word, and mask irrelevant tokens; it released the burden on RNN in compressing the entire source sentence into a static, fixed representation. The attention mechanism has been successfully applied in Question Answering [14], Caption Generation [17], Sentiment Analysis [18], etc.

In this paper, we propose to apply the attention mechanism to the biGRU model, to enable the model to automatically compute proper alignments in the tweet, which reflect the importance levels of different tokens in deciding the tweet’s stance towards the given target, as shown in Fig. 1.

Fig. 1.
figure 1

The AT-biGRU model for target-specific stance detection.

In the AT-biGRU model, the vector representation s of the tweet is calculated as the weighted sum of the hidden states:

$$\begin{aligned} s=\sum _{n=1}^{N} \alpha _{n} h_{n}. \end{aligned}$$
(8)

In the above equation, the weight \(\alpha _{n}\) of each hidden state \(h_n\) is computed by:

$$\begin{aligned} \alpha _{n} = \frac{\exp (e_n)}{\sum _{n=1}^N \exp (e_n)}, \end{aligned}$$
(9)

where \(e_n \in \mathbb {R}\) is calculated through a multi-layer perceptron that takes \(h_n\) and the target embedding q as input, specifically:

$$\begin{aligned} e_n=att(h_n, q)=w_m^T(\tanh (W_{ah}h_n+W_{aq}q+b_{a}))+b_m. \end{aligned}$$
(10)

where \(W_{ah} \in \mathbb {R}^{2d_1 \times 2d_1}\); \(W_{aq} \in \mathbb {R}^{2d_1 \times d_2}\); \(b_a\), \(w_m \in \mathbb {R}^{2d_1}\); \(b_m \in \mathbb {R}\) are token-level attention parameters to optimise. In Sect. 2.5, we explore various ways to generate the target embedding \(q \in \mathbb {R}^{d_2}\), based on the embeddings of the tokens in the target Y, denoted by \(y_1\), \(y_2 \in \mathbb {R}^{d_0}\). The weight \(\alpha _n\) can be interpreted as the degree to which the model attends to token \(x_n\) in the tweet, while deciding the stance of the tweet towards the given target.

2.4 AS-biGRU-CNN Model

The model we propose above is an improvement on prior research. However, it can be further refined, as follows. The AT-biGRU model applies the attention mechanism at the token level, which enables the model to pay more attention to the tokens that have contributed to the stance decision towards specified targets. However, in the AT-biGRU model, the vector representations of the tokens do not have direct interaction with the vector representation of the target, which is against the intuition that the target can influence the human annotators’ interpretation of each token. For example, the token ‘email’ in Table 1 implies an Against stance towards the target “Hillary Clinton”, but has no obvious influence on stances towards other targets; the token “cold” can either reveal the user’s Favour stance towards the target “Climate Change is a Real Concern”, or suggest the user’s Against stance towards the target “Donald Trump”.

Thus, we use a gated structure to extend the current token-level attention mechanism to a more fine-grained semantic level, by introducing the direct interaction between the hidden states and the vector representation of the target. The gated structure can be embedded into the biGRU-CNN model, which results in the AS-biGRU-CNN model, as shown in Fig. 2.

Fig. 2.
figure 2

The AS-biGRU-CNN model for target-specific stance detection.

In Fig. 2, we introduce the target-specific hidden state \(h_n^{'}\), to replace the original hidden state \(h_n\) generated by biGRU. The target-specific hidden state is calculated as follows:

$$\begin{aligned} h_n^{'}=a_n \odot h_n. \end{aligned}$$
(11)

The attention vector \(a_n \in \mathbb {R}^{2d_1}\) decides which semantic features in each hidden state are meaningful specifically towards the target, which is calculated through a gated structure, as follows:

$$\begin{aligned} a_n = \sigma (W_m(\tanh (W_{ah}h_n+ W_{aq}q + b_a)) + b_m). \end{aligned}$$
(12)

where \(W_{ah}\), \(W_m \in \mathbb {R}^{2d_1 \times 2d_1}\); \(W_{aq}\in \mathbb {R}^{2d_1 \times d_2}\); \(b_a\), \(b_m \in \mathbb {R}^{2d_1}\) are semantic-level attention parameters, to optimise in the gated structure. The methods to derive the target embedding \(q \in \mathbb {R}^{d_2}\) based on the embeddings of the tokens in the target Y, denoted by \(y_1\), \(y_2 \in \mathbb {R}^{d_0}\), will be explained in Sect. 2.5. The elements in the attention vector \(a_n\) can be understood as the degrees to which the model attends to the semantic features of token \(x_n\) in the tweet, while deciding the stance of the tweet towards the given target.

2.5 Target Embedding

The models proposed in Sects. 2.3 and 2.4 employ the embedding of the given target \(q \in \mathbb {R}^{d_2}\), derived from the embeddings of the tokens in the given target \(y_1\), \(y_2 \in \mathbb {R}^{d_0}\). Without loss of generality, here we use a target with two tokens, as an example. However, the methods can be directly applied on targets with any number of tokens. To generate target embeddings of the same dimensionality for the targets with different token numbers, we propose to use a separate biGRU model, described in Sect. 2.1, with the target token embeddings \(y_1\) and \(y_2\) as inputs. For this scenario, the dimensionality of q, denoted by \(d_2\) in Sects. 2.3 and 2.4, equals to the dimensionality of the concatenated final hidden states of the biGRU model, denoted by \(2d_1\). Results of the AT-biGRU model and the AS-biGRU-CNN model using the biGRU target embedding are reported in Sect. 3.4. In some aspect-level Sentiment Analysis works, researchers have been using the average of the aspect token embeddings to encode the aspect [11, 15]. We also use the averaging method as a baseline target encoding approach to derive the target embedding q, by averaging the target token embeddings \(y_1\) and \(y_2\). For this scenario, \(d_2\) equals to the dimensionality of the target token embeddings, denoted by \(d_0\). Results of the AT-biGRU model and the AS-biGRU-CNN model using the averaging target embedding are reported in Sect. 3.5.

2.6 Model Training

The vector representation of the tweet s is fed as input to a softmax layer, after a linear transformation step that transforms it into a vector, whose length is equal to the number of possible stance categories. The outputs of the softmax layer (denoted by o in Figs. 1 and 2) are the probabilities of the tweet X belonging to the stance category z, given the target Y, denoted by P(z|XY). The stance category with the maximum probability is selected as the predicted category, \(z^*\):

$$\begin{aligned} z^* = argmax_{z \in \mathbf {z}} P(z|X, Y). \end{aligned}$$
(13)

All the models are smooth and differentiable, and they can be trained in an end-to-end manner, with standard back-propagation. We use the cross-entropy loss as the objective function \(L(\theta )\), which is defined as follows:

$$\begin{aligned} L(\theta ) = - \sum _{X \in \mathbf {X}} \sum _{z \in \mathbf {z}} P^{'}(z|X, Y)\cdot \log (P(z|X, Y)). \end{aligned}$$
(14)

where \(\mathbf {X}\) is the set of training data; \(\mathbf {z}\) is the set of stance categories; \(P^{'}(z|X, Y)\) denotes the target stance distribution z given X and Y; \(\theta \) is the set of parameters.

3 Experimental Results

3.1 Dataset Description

As said, we evaluated the effectiveness of the proposed models on the benchmark Stance Detection dataset for the SemEval-2016 Task 6.A [8]. We used the exact same data as provided to the contestants for this task, with no extra labelled data [4] or domain corpus [1, 9] employed. The benchmark Stance Detection training dataset contained 2,914 tweets relevant to five targets: “Atheism” (A), “Climate Change is a Real Concern” (CC), “Feminist Movement” (FM), “Hillary Clinton” (HC) and “Legalisation of Abortion” (LA). Each tweet was annotated as Favour, Neither or Against towards one of the five targets. The benchmark Stance Detection test dataset contained 1,249 tweets, as well as the interested targets. Detailed statistics about the dataset can be found in Table 2, where “#” represents the number of tweets, “%F”, “%A” and “%N” represent the percentages of tweets with Favour, Against and Neither stances towards the targets, respectively.

Table 2. Statistics of the benchmark target-specific stance detection dataset.

3.2 Comparison Models

We compared the proposed models with the two best performing models in the SemEval-2016 Task 6.A: (1) MITRE [19], which trained separate Long Short-Term Memory (LSTM) networks with a voting scheme for different targets—the LSTM networks were pre-trained, by an auxiliary hashtag prediction task on 298,973 self-collected tweets; (2) pkudblab [16], which also trained separate CNN classifiers for different targets, with a voting scheme employed both in and out of each epoch, to improve the performance. We also compared against the SVM classifiers trained on the corresponding training datasets for the five targets, using word n-grams and character n-grams features, as reported in [8], representing the previous best performer for this task. Additionally, to illustrate the influence of the token-level and semantic-level attention mechanism, we included the performance comparison between the biGRU model (Sect. 2.1) and the AT-biGRU model (Sect. 2.3), the biGRU-CNN model (Sect. 2.2) and the AS-biGRU-CNN model (Sect. 2.4).

3.3 Experimental Settings and Model Configuration

In line with former works, we first trained separate classifiers for different targets. To obtain a fair comparison, we employed the only evaluation metric in the SemEval-2016 Task 6.A, which was the macro-average of the F1 scores for the Favour and Against stance categories. This evaluation metric will be referred to as “macro-average F1 score” in this paper for simplicity purpose. In the evaluation stage of SemEval-2016 Task 6.A, the target information of each tweet was ignored, in order to measure each team’s overall performance, rather than performance on each separate target. This was because the training datasets for different targets had different percentages of tweets with Favour, Against and Neither stances, as well as different percentages of tweets expressing stances by mentioning the given target and by mentioning other targets. Thus, this evaluation metric can reflect each team’s overall ability in dealing with different scenarios. It should be noted that even though separate classifiers were trained for different targets, we used the same configurations for target-specific classifiers, to make sure our proposed models can be easily applied to any other target, as well as effectively demonstrate the advantages of target-specific tweet vector representation, by eliminating the effects of target-specific model settings. Various methods were applied to avoid overfitting. We performed a standard 5-fold cross-validation. For each round of cross-validation, we experimentally set the maximum number of epochs to 50, and located the epoch that achieved the best performance on the validation dataset. The post-softmax probabilities of the 5 trained classifiers were averaged, to obtain the probabilities of a tweet in the test dataset belonging to the three stance categories.

We implemented the proposed models using TheanoFootnote 1 and KerasFootnote 2.

For comparison fairness, all the neural network-based models in the experiments also used the same hyper-parameters (as illustrated below), which were selected using grid search on the baseline biGRU model. In the experiments, all the word embeddings were initialised by the Glove [10] 100-dimensional pre-trained embeddings on Wikipedia data, i.e., \(d_0=100\). We applied dropout [13] with probability 0.2 on the embedding layer. The word embeddings were fine-tuned during the training process, to capture the stance information. From the preliminary experiments, we observed that the models that shared the embedding layer between the tweets and the targets performed significantly better than the models that did not. We chose the dimensionality of hidden states (\(d_1\)) of both the GRU encoding the tweet and the GRU encoding the target to be 64, and the GRU weights are initialised from a uniform distribution \(U(-\epsilon , \epsilon )\). Following [5], we added a dropout level of 0.3 between each recurrent connection in the GRU that encoded the tweets. We further selected the hyper-parameters for the CNN structure on top of the fixed hyper-parameters of the biGRU model. Following [6], we used filters of \(k \in \{ 3, 4, 5\}\), with widths equal to the dimensionality of the outputs of the biGRU, which was 128 in this case. There were 100 filters for each size. To increase the robustness of the models to overfitting, a dropout level of 0.5 was further applied before the softmax layer.

We used the Adam optimiser [7] for back-propagation with the two momentum parameters set to 0.9 and 0.999, respectively. The mini-batch size was set to 16. The code for the experiments is available at https://github.com/zhouyiwei/tsd.

3.4 Using the biGRU Target Embedding

The experimental results are shown in Table 3. Besides the evaluation metric of the SemEval-2016 Task6.A, we also provide the macro-average F1 scores of different targets, as references. From the comparison between the biGRU model and the biGRU-CNN model, it can be seen that the CNN structure on top of the biGRU model can help to generate more compact and abstract vector representations of the tweets for Stance Detection.

Both neural network-based models that incorporate target information when generating vector representations for the tweets, i.e., the AT-biGRU and AS-biGRU-CNN, outperform other neural network-based models that did not, i.e., MITRE, pkudblab, biGRU and biGRU-CNN. Specifically, the state-of-the-art token-level attention mechanism helps to increase the performance of the biGRU model by 0.32 in the overall macro-average F1 score. The injection of target information through the proposed semantic-level attention mechanism in the biGRU-CNN model, which results in the AS-biGRU-CNN model, leads to a more significant improvement (1.71) on the basis of the biGRU-CNN model, which makes it the best performing model among all the neural network-based models. This demonstrates the effectiveness of attention mechanisms in constructing a composite vector representation between the target and contextual information provided in the tweet. The proposed AS-biGRU-CNN model with semantic-level attention, however, has stronger capability in modelling the complex interaction between the target and each token in the tweet, and generating an expressive conditional vector representation of the tweet, with respect to the target, compared with the AT-biGRU model with the token-level attention.

Moreover, the AS-biGRU-CNN model outperforms the traditional SVM algorithm, with word n-grams and character n-grams features reported in [8] by a substantial margin, in the absence of feature engineering and target-specific tuning, which justifies the motivation to automatically intensify the features that are essential to the target, and “dilute” the features that are not.

Table 3. Performance of target-specific stance detection based on the macro-average F1 score, using separate classifiers.

3.5 Using the Averaging Target Embedding

In Table 3, we used biGRU to generate the vector representations for the targets. Additionally, we further experimented with the AT-biGRU and AS-biGRU-CNN models, using the averaging target embeddings. The overall macro-average F1 score of the AT-biGRU model increases from 67.97 to 68.30, while the macro-average F1 score of the AS-biGRU-CNN model decreases from 69.42 to 68.35. One possible explanation could be that a simple averaging approach is insufficient to capture the semantic meanings of the targets, thus for the biGRU-CNN model, which has stronger expressive power than the biGRU model in target-specific Stance Detection, it is helpful to use more flexible target embeddings to perform complex inference. However, for the AT-biGRU model, the target embeddings generated by biGRU surpass its capability to learn and generalise. This is also the reason why stacking the CNN structure on top of the AT-biGRU model cannot help to improve the performance, as it does in the AS-biGRU-CNN model.

3.6 Using Combined Classifiers

In the Stance Detection dataset for the SemEval-2016 Task 6.A, the training data for all the targets were of similar sizes, except for the target “Climate Change is a Real Concern”. There were only 395 items in its training data and they were highly biased, with only 3.8% of them coming from the Against category. As a result of this, all the models in Table 3 cannot achieve a comparable performance on this target, when compared with other targets. When there was not enough training data for some targets, or the training data for some targets was highly biased, it was not possible to guarantee the performance of independent classifiers for these targets. For this case, we hypothesised that a combined classifier of all the targets can alleviate this problem, through jointly modelling the interaction between the stances and contexts of all the available targets. This way, when performing Stance Detection on the “Climate Change is a Real Concern” target, the classifier can employ—or even transfer—the knowledge about the intricate connection between the stances and contexts learnt from the training data of other targets. Motivated by this idea, we further trained combined classifiers based on the proposed models, using all the training data, rather than trained separate classifiers for different targets. The combined classifiers’ performances are shown in Table 4.

Table 4. Performance of target-specific stance detection based on the macro-average F1 score, using combined classifiers.

In Table 4, we use the combined SVM classifier reported in [8] as a baseline. For combined classifiers, richer semantic and syntactic information was needed in the tweets’ vector representations, as it was necessary to additionally encode the relatedness and diversity of different targets in stance expressions. This was a much harder task, as the combined classifier had to employ useful knowledge from other targets and avoid the impairment of useless information. For this reason, we continued to employ the biGRU model to generate the target embeddings, which had stronger expressive power than the averaging method. The difficulty level of this task is illustrated by the significant diminished overall macro-average F1 score of the SVM combined classifier in Table 4, compared with the overall macro-average F1 score of the SVM separate classifiers in Table 3. We experimentally increased the dimensionality of the pre-trained word embedding vectors from 100 to 300, and the dimensionality of the hidden states of GRU from 64 to 256, to satisfy the above requirements. All the other hyper-parameters were kept the same, as illustrated in Sect. 3.3.

From Table 4, it can be observed that for the target “Climate Change is a Real Concern”, it is helpful for all models to employ the training data from other targets. Comparatively, combined classifiers using models based on neural networks achieve much better macro-average F1 scores on this target than the combined classifiers using the traditional SVM algorithm. This is because the neural network-based models employed continuous vector representations of tweets, which allows them to more easily incorporate information from other domains, compared with the traditional SVM algorithm, which employs sparse and discrete vector representations, based on feature engineering. The combined classifier using the proposed AS-biGRU-CNN model yields the best performance so far on the “Climate Change is a Real Concern” target, which further illustrates the model’s strong ability to capture the generality in stance expressions of different targets. However, the overall performance of the combined classifiers all decreases. This is because the performances for targets with sufficient training data can be negatively influenced by the redundant information from other targets. Nevertheless, the AS-biGRU-CNN model still yields the best overall performance, using only combined classifiers, which shows the model’s power in modelling the differences in stance expressions of different targets.

4 Related Work

Very few recent researches attempted to tackle the target-specific Stance Detection task on tweets, such as [1, 4, 9, 16, 19]. [1] focused on predicting the stances towards targets with no training data provided, which was the SemEval-2016 Task 6.B, a different task to the one studied here. For the problem we tackled in this work, there was a training dataset for each specified target, to effectively update the states and memories of the encoders. [4] studied the correlation between sentiment and stance, and the sentiment labels of the tweets were additionally needed to train the model. Thus, the settings of both above researches were different from the settings of the SemEval-2016 Task 6.A. [16, 19] ignored the target information while performing classification, whereas our experiments have clearly proven that the target-specific vector representation of tweets can substantially boost the performance. [9] relied on feature engineering and large domain corpus, to perform feature selection, which was hard to generalise to other targets; and the collection of domain corpus additionally added difficulty, because of the limitations of the Twitter API. The attention-based models proposed in this work, on the contrary, are fully automatic, with minimum supervision. We did not collect any extra domain corpus or use any linguistic tools, and no feature engineering was needed. Since no target-specific configurations are involved, the proposed models can be directly applied to other targets.

Another track of relevant research is aspect-level Sentiment Analysis on texts [11, 12, 15]. In this task, the text to be analysed, or at least part of the text, focuses on the aspects of interest, by explicitly mentioning the aspects, which renders the problem of modelling the importance and relatedness of tokens with respect to the aspects, easier. However, this is not the case for the target-specific Stance Detection task. Thus, a deeper integration between the target and the tweet, and a more complex inference mechanism, are needed, as proposed in our research.

5 Conclusion

To the best of our knowledge, we are the first ones to effectively apply the traditional token-level attention mechanism to the problem of target-specific Stance Detection in tweets, which achieves better performance than other neural network-based models. Moreover, we propose to use a gated structure on the basis of the biGRU-CNN model, to embed target information into the tweet’s vector representation, aiming at introducing the direct semantic interaction between the target and each token in the tweet, to perform target-specific Stance Detection. The proposed model employs a semantic-level attention mechanism, which is more fine-grained than the token-level attention mechanism. The proposed semantic-level attention mechanism searches for certain semantic features of each token in the tweet, based on the information contribution these semantic features have, in deciding the stance of the tweet, towards the given target. For the resulting AS-biGRU-CNN model, not only the tweet’s representation vector, but also the representation vectors of the tokens are target-specific. The experimental results demonstrates that the proposed model outperforms several state-of-the-art baselines, in terms of macro-average F1 score, on the benchmark target-specific Stance Detection dataset of tweets, for both the scenario when separate classifiers are allowed for different targets and the scenario when only one combined classifier is allowed. Thus, the AS-biGRU-CNN model has stronger expressive power, and higher generalising capability, to extract target-specific knowledge from annotated datasets, to perform target-specific Stance Detection on tweets. Importantly, unlike previous works on target-specific detection in tweets, the models employed in this work do not rely on any extra annotation, domain corpus or feature engineering, and can be easily generalised to other targets of interest.