Keywords

1 Introduction

Nowadays, the information exploited from tweets is abundant and useful, thus receiving great attention from researchers. The task in this paper is to predict individuals’ sentiment towards potential topics on a two-point scale: positive or negative based on their past tweets. Generally, a user’s attitudes towards different topics are closely related and won’t change dramatically in a short time, so building models of individuals’ tweets and estimating sentiment polarities towards potential topics are beneficial for precise topic recommendations for individuals, including related topics, advertisements and social circles. Earlier researchers [5] use a Support Vector Machines with part-of-speech features to categorize tweets. An adaptive recursive neural network for target-dependent classification is proposed, which propagates sentiment signals from sentiment-baring words to specific targets on a dependence tree [4].

All the methods mentioned above ignore potential sentiment relations within individuals’ tweets but rely heavily on sentiment lexicons. Besides, the models mentioned above only focus on the classification task while more practical applications require a combination of extraction and classification. In this work, we propose a hierarchical model of individuals’ tweets, which extracts topics with Single-Pass algorithm and models the relationship between individual sentiments and different topics. The main contributions of this paper are three-fold:

  • Models are built on individuals’ tweets and the topic phrase of each tweet is obtained through Single-Pass. Individuals’ tweets without sentiment words, along with extracted topic and gold labels are inputs of HMIT. Based on the approach, it’s possible to provide precise topic recommendations for individuals.

  • We propose a novel topic-dependent hierarchical model, which extracts features from fine-tuned BERT and incorporates topic information through topic-level attention. CNN categorizes sentence representations into positive or negative.

  • We build models on six users separately from one Twitter benchmark dataset and dataset collected by ourselves. We also create new test dataset, collecting neutral sentences from three general topics. In our experiments, the proposed method is able to outperform multiple baselines on both datasets in terms of classification and quantification.

2 Related Work

Target-based sentiment analysis aims to judge the sentiment polarity expressed for each target being discussed. To capture semantic relations flexibly, a target-dependent Long Short-Term Memory (TD-LSTM) is proposed [10]. As attention mechanism has been successfully applied to many tasks, a variety of attention-based RNN models have proven to be effective ways [12]. To our knowledge, we are the first to exploit target-individual relation for target-based sentiment analysis. Hierarchical models have been used predominantly for representing sentences. A hierarchical ConvNet to extract salient sentences from reviews is employed in [2]. Along with the wide use of pre-trained language models, there is a recent trend of incorporating extra knowledge to pre-trained language models as a different hierarchical model [1]. For BERT, it is difficult to be applied to downstream tasks which need to put emphasis on several specific words. We propose a hierarchical model that extracts overall information from fine-tuned BERT and then incorporates topic information. CNN categorizes the whole sentence representation into positive and negative.

3 Proposed Method

The HMIT architecture is shown in Fig. 1. We describe each component and how it is used in learning and inference in detail. For one user, \(\{s_{1},s_{2},\ldots ,s_{m}\}\) is a collection of his/her tweets, containing m tweets of various topics. A tweet \(s_{i}\) composed of n words is denoted as \(s_{i}=\{x_{1}^{(i)},x_{2}^{(i)},\ldots ,x_{n}^{(i)}\}\) with a gold sentiment label \(y_{i}=\{POSITIVE,NEGATIVE\}\).

Fig. 1.
figure 1

The overall architecture of HMIT

3.1 Topic Phrases Extraction

To extract several topic phrases in each tweet, we employ Single-Pass algorithm. The core idea is to input texts continuously to determine the matching degree between the input text and an existing cluster. Texts whose maximum similarity to the cluster core are greater than the given threshold \(p_{0}\) will be clustered as one category. After all texts are clustered, we set the most frequently occurred bi-gram as topic phrases for this cluster, so that every tweet is associated with a topic phrase \(\{TP_{1},TP_{2}\}\).

3.2 Fine-Tuned BERT with Non-sentiment Words

All sentiment words are removed from tweets first according to sentiment lexicons [6]. \(\{x_{1}^{(i)},x_{2}^{(i)},\ldots ,x_{n^{\prime }}^{(i)}\}\) represents a tweet without sentiment words and is further fed into BERT tokenization. Each sentence is tokenized and padded to length N by inserting padding tokens. The embedding layer of BERT integrates word, position and token type embeddings where \(E_j\in {\mathbb {R}}^K\) is the K-dimensional vector of the j-th word in the tweet. BERT is a multi-layer bidirectional Transformer encoder. In text classification, the decoder applies first token pooling to a full connection layer with softmax activation, returning a probability distribution on two categories. After fine-tuning BERT on our own dataset, we extract one layer of the latent vector from the encoder of fine-tuned BERT.

3.3 Topic-Level Attention

We use a topic-level attention mechanism over a topic phrase to produce a single representation. Since different tokens in a topic phrase may contribute to its semantics differently, we calculate an attention vector for a topic phrase. The hidden outputs corresponding to \(\{TP_1^{(i)},TP_2^{(i)}\}\) is denoted as \(H^{(i)} = \{h_{TP1}^{(i)}, h_{TP2}^{(i)}\}\). We compute the aggregated representation of a topic phrase as

$$\begin{aligned} H^{(i)\prime }={\alpha }^{(i)\mathsf {T}}H^{(i)}=\sum \limits _{o\in \{1,2\}}{\alpha }_{j}^{(i)}{h_{TPo}^{(i)}}\ \end{aligned}$$
(1)

where the topic attention vector \({\alpha }^{(i)} = \{{\alpha }_{1}^{(i)},{\alpha }_{2}^{(i)}\}\) is distributed over topic phrase \(H^{(i)}\). The attention vector \({\alpha }^{(i)}\) is a self-attention vector that takes the hidden outputs of a topic phrase as input and feeds them into a bi-layer perceptron. We concatenate each token representation and the aggregated topic representation \(H^{(i)\prime }\) to obtain the final context-aware representation for each word.

3.4 CNN Classification

CNN has grabbed increasing attention in text classification tasks recently due to its strong ability to capture local contextual dependencies. Based on that, we propose to apply CNN to the final layer of classification. As shown in Fig. 1, convolution operation involves kernels with three different sizes. Suppose \(w \in {{\mathbb { R}}^{q \times 2K}}\) is a filter of q tokens, a feature \(c_j\) is generated by:

$$\begin{aligned} {c_j} = f(w \circ h_{^{j:j + q - 1}}^{(i)\prime } + b) \end{aligned}$$
(2)

Here \(\circ \) denotes convolution, while \(b\in \mathbb {R}\) is a bias term and f is ReLU activation function. This filter applies to whole possible tokens in the sentence to produce a feature map:

$$\begin{aligned} c = [{c_1},{c_2},...,{c_{N - q + 1}}] \in {{\mathbb { R}}^{N - q + 1}} \end{aligned}$$
(3)

Max-pooling layer take the maximum value \(\hat{c} = \max \{ c\} \) of c as the feature corresponding to filter w. \({{\hat{y}}_i}\) denotes the predicted label for the i-th tweet.

3.5 Inference and Learning

The objective to train topic-level attention and CNN classifier is defined as minimizing the sum of the cross-entropy losses of prediction on each tweet as follows:

$$\begin{aligned} {{\mathcal {L}}_s} = - \sum \limits _{i = 1}^m {{{\hat{y}}_i}} \log ({y_i}) + (1 - {{\hat{y}}_i})\log (1 - {y_i}) \end{aligned}$$
(4)

For inference, test news is first passed to the fine-tuned BERT to obtain its hidden vector. According to the topic-level attention mechanism, context-aware representations are incorporated with topic information and then fed to the CNN classifier. Finally, a prediction is obtained.

4 Experiments

4.1 Experimental Settings

Datasets. Table 1 shows the statistics of datasets. We select three users from Sentiment140 [9] to build models separately as \(\mathbb {D}_{s1}\), \(\mathbb {D}_{s2}\) and \(\mathbb {D}_{s3}\). We also collect tweets from three talkative users and label them manually as \(\mathbb {D}_{t1}\), \(\mathbb {D}_{t2}\) and \(\mathbb {D}_{t3}\). To verify the feasibility of the method in practical application, we collect 100 news for each of three topics: health care, climate change, social security as \(\mathbb {T}_{h}\), \(\mathbb {T}_{c}\) and \(\mathbb {T}_{s}\).

Table 1. Dataset statistics

Network Details. For Single-Pass, we set the threshold \(p_0\) to 0.4. We tune pre-trained base uncased BERT which sets hidden size K as 768 with 12 hidden layers and 12 attention heads. Max sequence length N, batch size and learning rate are set to 128, 32 and \(5\times 10^{-5}\) respectively. For the CNN classifier, we adopt three filter sizes: 2, 3 and 4 separately. 64 filters are used for each filter size and three pooling sizes are set to 4 in the task. We train the fine-tuned BERT for 3 epochs and the CNN classifier for 20 epochs.

Evaluation Metrics. We employ accuracy and F1 score as evaluation for classification. We regard evaluation of test sets as a quantification task, which estimates the distribution of tweets across two classes. We adopt Mean Absolute Error based on a predicted distribution \({\hat{p}}\), its true distribution p and the set \(\mathcal {C}\) of classes. It’s computed separately for each topic, and the results are averaged across three topics to yield the final score.

4.2 Models Under Comparison

We compare our proposed method with the methods that have been proposed for sentiment analysis (SA) and target-based sentiment analysis (TBSA).

  • BERT [3]: BERT achieves state-of-the-art results in sentence classification, including sentiment classification.

  • mem_absa [11]: Mem_absa adopts a multi-hop attention mechanism over an external memory to focus on the importance level of the context words and the given target.

  • IAN [8]: IAN considers both attention mechanisms on the target and the full context. It uses two attention-based LSTMs to interactively capture the keywords of the target and its content.

  • Cabasc [7]: Cabasc takes into account the correlation between the given target and each context word, composed of sentence-level content attention mechanism and content attention mechanism.

BERT\(^{-s}\), mem_absa\(^{-s}\), IAN\(^{-s}\) and Cabasc\(^{-s}\) are variant models of BERT, mem_absa, IAN and Cabasc respectively, removing sentiment words in training and testing.

4.3 Results and Analysis

Main Results. From Table 2, we observe that HMIT is able to significantly outperform other baselines in both classification and quantification tasks on our own dataset, which suggests that our proposed method is effective to capture the relationship between individual sentiments and different topics and succeed in sentiment estimation towards potential topics. We also find that BERT performs reasonably well on validation sets, which confirms its strong ability to represent a whole sentence and its feasibility as the first layer of our model. We compare HMIT with one SA model, three TBSA models and their variants. Table 2 shows that TBSA models display little advantage compared with SA models, which implies that current sentiment classification is mostly decided by sentiment lexicons or opinion words around the target instead of the target itself. Furthermore, the superior performance of the variants on both datasets indicates that removing sentiment words from tweets enables models to pay more attention to the topic in a tweet, thus constructing the relationship between topics and individual sentiments.

Table 2. Comparison results

Extract Features from BERT. We discover which encoding layer extracted from BERT is the most appropriate for further modification and classification. We extract features from -3, -2 and -1 encoding layer of BERT and simply add a CNN classifier after that. In Fig. 2, we report the accuracy and F1 score of cross-validation on \(\mathbb {D}_{t2}\). It turns out that the penultimate layer is the most appropriate to make changes or incorporate external information. The last layer is too close to the target and the previous layers may not have been fully learned semantically. Therefore, we extract the penultimate layer in the method.

Fig. 2.
figure 2

Accuracy and F1 score on \(\mathbb {D}_{t2}\) with different BERT layer

5 Conclusion

We have proposed a hierarchical model to make individual sentiment estimation of potential topics. The approach extracts topics automatically and models the relationship between individual sentiments and different topics. It takes as input tweets without sentiment words, extracts features first from fine-tuned BERT and then incorporates topic information in context-aware token representation through the topic-level attention mechanism. CNN further classifies the repre- sentation into positive or negative. The proposed architecture can potentially be applied for a precise individual recommendation or group sentiment estimation towards one topic.