Keywords

1 Introduction

As an increasing number of people share and obtain information from social media, social media is now becoming an important real-time information source. The unprecedented volume, variety of user-generated content, and the user interaction network constitute new opportunities for understanding social behavior and building socially intelligent systems. It is important and challenging to teach a machine to automatically understand the content presented in social media.

Although considerable progress has been made in the task of Machine Reading Comprehension (MRC), most of the previous works only focus on the comprehension of the other domains, such as news [1, 5], story [11, 16] and Wikipedia [17, 27], and there are very few works addressing the problem of social media MRC. Table 1 shows an example of social media comprehension. Different from the other domains, in social media domain, one normally posts a message on the assumption that the readers have specific background knowledge. Those messages are generally short and contain limited contextual information as shown in the table. Thus, it is difficult for a machine to understand them thoroughly based only on the text itself. In this example, if only look at the message, without background knowledge, a machine reader would be puzzled about its topic and could not answer the question.

Table 1. An example of machine reading comprehension in social media domain.

To obtain background knowledge for a machine reader, one feasible method is to introduce external knowledge from the knowledge base like ConceptNet [10] and WordNet [4], as the previous works do in other domains [9, 22, 24]. Unfortunately, due to the nature of informality and diversity of the messages in social media, some key phrases of the short messages cannot be found in those pre-constructed knowledge base. For example, in Table 1, the token fantasticfour, which indicates the topic of the message, cannot be found in ConceptNet or WordNet.

By studying social media messages, we find that a significant nature of them is clustering. That is to say, on a social media platform, a group of people tend to express their opinion or report news around one topic. Specifically, those topic-relevant messages is commonly clustered by the hashtag, which is marked with “#” symbol (e.g., #fantasticfour in Table 1) and ubiquitous in social media domain. Thus, given a social media message, we can find a group of relevant messages based on the hashtag. As shown in Table 2, there are a series of topic-relevant messages clustered by the hashtag “#fantasticfour”. Through those messages, we would know the topic is a science fiction film, from Marvel, about some superheroes and so on. Those hashtag-clustered messages tend to share a common topic and can be considered as a knowledge source of the topic of the given message. To this end, we propose a novel method, which obtains and utilizes the topic knowledge from the hashtag-clustered messages, to address the problem of lack of background knowledge in the task of social media comprehension.

Table 2. An example of hashtag-clustered messages in social media.

Given a message and a question, we extract the hashtag from the message and retrieve the relevant messages based on the hashtag. Subsequently, we refine topic knowledge from the retrieved messages. Moreover, we construct a neural network, dubbed as Topic Knowledge Reader (TKR). The refined knowledge will be fused into the TKR model and contribute to the process of reading comprehension and question answering. We conduct experiments on the TweetQA dataset [26]. The result shows the effectiveness of our method.

To summarize, the major contributions of this paper are as follows:

  • In the task of machine reading comprehension, we investigate the problem of lack of background knowledge in social media domain. We propose to utilize the nature of clustering of social media to obtain the knowledge from the other relevant messages.

  • We propose a particular knowledge acquisition approach, which retrieves and refines topic knowledge from those relevant messages clustered by the hashtag which exists generally in the social media messages.

  • We build a machine reading comprehension model, TKR, to utilize the refined knowledge in a targeted manner and conduct experiments on the public dataset, which demonstrates the effectiveness of our method.

2 Related Work

Social Media NLP: Over the past few years, social media has revolutionized the way we communicate. Massive amount of information in form of text is continuously generated by the users, which creates enormous challenges for NLP community to analyze and understand those text automatically. In recent years, several NLP techniques and datasets for processing social media text have been proposed. Dos Santos and Gatti [3] use a deep convolutional neural network that exploits from character-level to sentence-level information to perform sentiment analysis of short texts. Vo and Zhang [21] splits the context and employs distributed word representations and neural pooling functions to extract features from tweets. Zhou and Chen [29] propose a graphical model, named location-time constrained topic (LTT), to capture the content, time, and location of social messages. Singh et al. [18] develop an event classification and location prediction system which uses the Markov model for location inference. Qian et al. [13] jointly discover subevents from microblogs of multiple media types–user, text, and image, and design a multimedia event summarization process.

Machine Reading Comprehension: Due to the fast development of deep learning techniques and large-scale datasets, Machine Reading Comprehension (MRC) has gained increasingly wide attention over the past few years. Richardson et al. [15] build the multiple choice dataset MCTest, and this dataset encourages the early research of machine reading comprehension, and a strand of MRC models [11, 16] are inspired by the dataset. Hermann et al. [5] propose a cloze test dataset CNN & Daily Mail, which is large-scale and more suitable than MCTest for deep learning methods. Based on this dataset, Hermann et al. [5] propose an attention-based LSTM [6] model named Attentive Reader. Moreover, Rajpurkar et al. [14] release the span extraction dataset, SQuAD, which has become the most popular MRC dataset over recent years. This dataset enlightens a lot of classical MRC model, like BiDAF [17] and R-Net [25]. In addition, the multi-hop MRC dataset HotpotQA [27] has gained recent wide attention. This dataset addresses the problem of multiple clues based question answering.

3 Method

Fig. 1.
figure 1

The framework of our proposed method. The left half is the process of obtaining topic knowledge. The right half is the reading comprehension model, Topic Knowledge Reader (TKR)

Figure 1 shows the framework of our method. Note that it is an example of the tweets, but the method is universal for the messages from other social media platforms. Given a tweet and a question, we obtain the answer by the following steps: First, we extract the hashtag from the tweet, meanwhile, we encode the tweet and the question by the BERT encoder. Second, we retrieve relevant tweets that contain the same hashtag. Third, we refine the topic knowledge from the retrieved tweets. Next, the knowledge is encoded and fused with the BERT representation. Finally, the model predicts the answer based on the knowledge aware representation of the tweet and question.

3.1 Knowledge Acquisition

We regard the set of tweets clustered by the hashtag as the resource of knowledge and obtain topic knowledge from them. We first retrieve relevant tweets, then from those tweets, we gather common concepts. Meanwhile, we maintain a hashtag pool to score each concept. Finally, we refine the concepts by select the top-k scored ones as the topic knowledge.

Retrieving Relevant Tweets. Given a tweet text T, we first extract the hashtag H. Next, we use H as the query and retrieve the relevant tweets. So we have a set S consisting of tweets that contain the same hashtag. The tweets in S tend to share the same topic with the given tweet T, and the information of them is helpful to understand T comprehensively. Next, we remove the non-English tweets from S, and then delete the non-normal strings in the text of S such as the URL which starts with “http” and the reference of the picture which starts with “\(pic\backslash \)”.

Exceptionally, for those tweets that contain no hashtag, we utilize a hashtag extractor to extract hashtag words from the tweet. The extractor is composed of a BERT encoder and a span pointer. We input the tweet, T, to the BERT model and obtain the representation \(P = \left\{ p_0, p_1, ..., p_n\right\} \), where \(p_i\) is the i-th word of the tweet. Then we extract the hashtag from the tweet by a pointer:

$$\begin{aligned} \begin{aligned} Start_i = \frac{exp(w_0^Tp_i)}{\sum _{j}{exp(w_0^Tp_{j})}} \qquad End_i = \frac{exp(w_1^Tp_i)}{\sum _{j}{exp(w_1^Tp_{j})}} \end{aligned} \end{aligned}$$
(1)

where \(w_0\) and \(w_1\) are trainable vectors. The pointer labels the probability of each word as the start and the end of the hashtag, respectively. We calculate the score of a span by multiplying those two probabilities and take the span with the max score as the hashtag. Figure 2 shows an example. We train the extractor model on a hashtag extraction dataset proposed by Zhang et al. (2016). [28]. Evaluated on the test set of the dataset, our extractor achieves 85.1% accuracy.

Fig. 2.
figure 2

An example of extracting hashtag for those tweets without hashtag

Gathering Relevant Concepts. Having retrieved the tweets with the same topic, we gather the fine-grained knowledge, i.e., concepts that connect to the topic. We tokenize every tweet text in S and obtain a set of tokens. Then we segment each token to get the concept. Due to the nature of informality, some tokens from the tweets could contain multiple words, like the hashtag “#secretwars”, thus we conduct a segmentation on each token. After that, we obtain a set C consisting of concepts (e.g., “movie”, “marvel”, and “secret wars”).

Maintaining Hashtag Pool. To further refine the concepts, we maintain a hashtag pool. First of all, a large scale of recent tweets is collected as the original corpus. Based on the corpus, we collect the hashtags, then find the relevant tweets and obtain the concept set C for each hashtag follow above-mentioned process. Those hashtags and their all relevant concepts are added to the empty hashtag pool as the initialization. When a new tweet, T, is given during application, we update the hashtag pool by adding the hashtag of T and the relevant concepts, C, to the pool.

Refining Topic Knowledge. Given tweet T, by above-mentioned steps, we have concepts C, then we apply Term Frequency-Inverse Document Frequency (TF-IDF) to score each concept. The score for \(concept_i\) in C is calculated by:

$$\begin{aligned} score_i = \frac{n_i}{N}log\frac{|P|}{|p_i|+1} \end{aligned}$$
(2)

where \(n_i\) is the frequency of \(concept_i\) in C, N is the total count of concepts in C, P denotes the hashtag pool. Thus, |P| is the total number of the hashtags in the hashtag pool, and \(|p_i|\) is the number of hashtags, whose relevant concepts contain the \(concept_i\), in the hashtag pool. Finally, the top-k scored concepts are selected as the topic knowledge K of the tweet T.

3.2 Topic Knowledge Reader

Fig. 3.
figure 3

The detail architecture of Topic Knowledge Reader (TKR).

As shown in Fig. 3, we propose a reading comprehension model, named Topic Knowledge Reader (TKR), to fuse the refined concepts and then answer the question. The inputs of the model are

  • the given tweet \(T = \left\{ t_{0}, t_{1}, ..., t_{n-1}\right\} \in \mathbb {R}^{n}\), where n is the number of words in the tweet, \(t_i\) is the i-th word in T.

  • the question \(Q = \left\{ q_{0}, q_{1}, ..., q_{m-1}\right\} \in \mathbb {R}^{m}\), where m is the number of words in the question, \(q_i\) is the i-th word in Q.

  • the concept knowledge \(K = \left\{ k_{00}, k_{01}, ..., k_{ij},..., k_{(l-1)x}\right\} \in \mathbb {R}^{y}\), where y is the number of words of all concepts, \(k_{ij}\) refers to the j-th word of i-th concept.

  • the concept score \(S = \left\{ s_{0}, s_{1}, ..., s_{l-1}\right\} \in \mathbb {R}^{l}\).

The output of the model is the predicted answer.

Encoding Tweet and Question. We first concatenate the question Q and the tweet T. The combination passage is

$$\begin{aligned} \begin{aligned} D = \left\{ [CLS], t_{0}, t_{1}, ..., t_{n-1}, [SEP], q_{0}, q_{1},... q_{m-1}, [SEP]\right\} \end{aligned} \end{aligned}$$
(3)

where we add the special word “[CLS]” and “[SEP]”, which follows the process from Devlin et al. [2]. Then, we employ BERT [2] to encode the tweet and the question together, thus we have the question-aware representation of the passage:

$$\begin{aligned} \begin{aligned} P^0 = BERT\left( D\right) \in \mathbb {R}^{(m+n+3) \times h} \end{aligned} \end{aligned}$$
(4)

Encoding Concepts. We encode the concepts, before the step of fusion. To obtain the original representation, we apply BERT encoder for the concepts as well. Analogously, we add the special word “[CLS]” to the single sequence of knowledge words,

$$\begin{aligned} \begin{aligned} K = \left\{ [CLS], k_{00}, k_{01}, ..., k_{ij},..., k_{(l-1)x}\right\} \end{aligned} \end{aligned}$$
(5)

and then the pre-trained model, BERT, is applied on encoding the concepts:

(6)

Words Aggregation: As there are multiple words in some concepts, we aggregate the words of each concept by mean pooling, then obtain the one-vector representation, \(c^0_i\), for each concept:

$$\begin{aligned} \begin{aligned} c^0_i = \frac{1}{N}\sum _{j\in [0, N) }{c^0_{ij}} \qquad C^0 = \left\{ c^0_0, c^0_1, ..., c^0_i, ..., c^0_{l-1}\right\} \in \mathbb {R}^{l \times h} \end{aligned} \end{aligned}$$
(7)

Self Attention: Though no sequential relation exists among those concepts, they are still interrelated. Thus, we use the self-attention mechanism to perform a non-sequence context encoding on the concepts:

$$\begin{aligned} \begin{aligned} c^1_i = \sum _j{\alpha _{ij}c^0_j} \qquad \alpha _{ij} = \frac{exp(\sigma (W_qc^0_i)\cdot \sigma (W_kc^0_j))}{\sum _{j^\prime } {exp(\sigma (W_qc^0_i)\cdot \sigma (W_kc^0_{j^\prime }))}} \end{aligned} \end{aligned}$$
(8)

where \(\sigma \) is the activation function, \(W_q\in \mathbb {R}^{h \times h}\) and \(W_k \in \mathbb {R}^{h \times h}\) are trainable matrixes. Thus we have self-aligned concepts \(C^1 = \left\{ c^1_0, c^1_1, ..., c^1_{l-1}\right\} \in \mathbb {R}^{l \times h}\).

Score Scaling: We then scale the concepts by the score \(S \in \mathbb {R}^{l}\) assigned in the step of knowledge refining:

$$\begin{aligned} \begin{aligned} C^2 = S C^1 \in \mathbb {R}^{l \times h} \end{aligned} \end{aligned}$$
(9)

\(C^2\) denotes the final representation of the Concepts.

Topic Knowledge Fusion. The Concepts are fused into the passage by:

$$\begin{aligned} \begin{aligned} p^1_i = \sum _j{\beta _{ij}c^2_j} \qquad \beta _{ij} = \frac{exp(\sigma (W_pp^0_i)\cdot \sigma (W_cc^2_j))}{\sum _{j^\prime } {exp(\sigma (W_pp^0_i)\cdot \sigma (W_cc^2_{j^\prime }))}} \end{aligned} \end{aligned}$$
(10)

where \(\sigma \) is the activation function, \(W_p\in \mathbb {R}^{h \times h}\) and \(W_c \in \mathbb {R}^{h \times h}\) are trainable matrixes. Thus, we obtain the concepts-aware passage representation \(P^1 = \left\{ p^1_0, p^1_1, ..., p^1_{m+n+2}\right\} \in \mathbb {R}^{(m+n+3) \times h}\). A bidirectional LSTM is applied to conduct an additional sequential context encoding and aggregate the original question-aware passage representation \(P^0\) and the concepts-aware passage representation \(P^1\).

$$\begin{aligned} P^2 = BiLSTM([P^0;P^1])\in \mathbb {R}^{(m+n+3) \times h} \end{aligned}$$
(11)

Prediction. We employ two Linear layers to point the start position and the end position of the answer in the passage, respectively, and then normalize the prediction scores:

$$\begin{aligned} \begin{aligned} \tilde{Start_i} = \frac{exp(w_s^Tp^2_i)}{\sum _{j}{exp(w_s^Tp^2_{j})}} \qquad \tilde{End_i} = \frac{exp(w_e^Tp^2_i)}{\sum _{j}{exp(w_e^Tp^2_{j})}} \end{aligned} \end{aligned}$$
(12)

\(w_s\in \mathbb {R}^{h}\) and \(w_e \in \mathbb {R}^{h}\) are trainable weight vectors. We utilize Negative Log Likelihood (NLL) as the loss function during training. Moreover, during the evaluation, we obtain the score of each span of the tweet by multiplying its start score and the end score and then select the text span with the max score as the answer.

4 Experiment

4.1 TweetQA Dataset

We conduct experiments on the recently released social media MRC dataset, TweetQA. Each instance of the dataset is a triple consisting of a tweet text, a human proposed question, and a list of human-annotated answers. The dataset is composed of 10692 training triples, 1086 development triples, and 1979 test triples. It is the first large-scale MRC dataset over social media data.

4.2 Implement Detail

Preprocess: As we employ BERT [2] to encode the text, we tokenize the text by the default tokenizer of BERT. Since the answer spans are not labeled in the train set, we annotate the approximate answer span in each tweet by selecting the span that achieves the best F1 score.

Knowledge Acquisition: To simulate the real-world scenario where a social media MRC system works, we regard the train set as the original corpus for the initialization of the Hashtag Pool. During evaluating, we update the Hashtag Pool by the hashtag and the relevant concepts, from the development set and the test set. Based on the experimental analysis, we select top-8 scored concepts for each hashtag at the step of refining the knowledge.

Training: We select the instances that contain the span whose F1 score no less than 0.6 to train the model in the way of weakly supervised. As a result, 8238 instances are used during training. We employ Adam optimizer to train the model. The learning rate is set to \(3\times 10^{-5}\), the model is fine-tuned for 3 epochs, and the dropout rate of BERT is set to 0.1. The BERT model we choose is the pre-trained bert-base [2] model, distinguished from the bert-large model. The hidden size of BERT is 768.

Evaluation: As the answer in TweetQA is not always a span of given tweet, following Xiong et al. [26], we use the metrics for natural language generation to evaluate the models, namely BLEU-1, Meteor, and Rouge-L. The answers of the test set are not released, so we submit our prediction to the official evaluating platform of TweetQAFootnote 1 and receive the response of the performance results.

4.3 Baselines

  • Query Matching: a simple IR baseline [7], which is adapted to the TweetQA Task by Xiong et al. [26].

  • BiDAF: a popular neural baseline [17] of Machine Reading Comprehension, which extract answers from the original tweet text.

  • Gerative QA: a RNN-based generative model [19]. The model employs both copy and coverage mechanisms during the process of generating.

  • BERT Extraction: a recently proposed pre-trained model [2]. Following [2], we construct a BERT based answer extraction model by inputting the representation of passage, \(P^0\), obtained from Eq. 4 directly to the prediction layer formulated by Eq. 12.

  • BERT Generation: Because part of the answers of TweetQA are not a span of the tweet text, we build a BERT based generative model. We use BERT as the encoder same with Eq. 4. Following [20], we employ a pointer generator, which selects words from both the tweet and the vocabulary, to decode the answer. The generative model is trained on all instances of the train set.

  • Knowledge Concat: We also introduce another simpler method to fuse the topic knowledge, named “Knowledge Concat”. The model directly concatenates topic knowledge (i.g., the selected concepts) with the sequence of tweet and question before BERT encoding and finally, same with TKR, conduct a span prediction.

  • KAR: Knowledge Aided Reader (KAR) [23] is a recently proposed MRC model, which utilizes the knowledge from WordNet. The model conducts mutual attention and self-attention based on the connections among the words of the question and the passage. The connections are built based on the knowledge from WordNet. For a fair comparison, we change the model by utilizing BERT as the basic encoder instead of the original embedding layers composed of Glove [12], CNN [8], and LSTM.

4.4 Main Results

Table 3. The results on TweetQA dataset. Extract-UB denotes the upper bound of extractive methods.

As shown in Table 3, our model, TKR, surpasses the recently proposed Knowledge Aided Reader (KAR) and achieves competitive performance. From our point of view, due to the limitation of the knowledge from the pre-constructed knowledge base (WordNet), KAR suffers from the sparsity problem of knowledge extraction for the diverse and informal expressions in social media domain. Besides, TKR outperforms all of the other baselines significantly, especially the BERT based model, BERT Extraction. The model, BERT Extraction, is exactly the rest architecture of TKR when we ablate topic knowledge from TKR. Thus, the comparison between TKR and BERT Extraction can directly demonstrate the advantages of our methods to acquire and utilize topic knowledge for social media comprehension.

Moreover, Knowledge Concat performs better than BERT Extraction, which also validates the effectiveness of the knowledge. Besides, comparing Knowledge Concat with TKR, we find that TKR performs better. This is because TKR can integrate our refined knowledge to the MRC model in a more targeted manner.

Fig. 4.
figure 4

The performance of TKR with different number (k) of concepts on the dev set of TweetQA.

4.5 Different Number of Concepts

To further verify the effectiveness of the topic knowledge, we study the relationship between the number of employed concepts and the performance of TKR. We choose top-k concepts during the step of refining, where k changes from 2 to 18, and then train and evaluate TKR at different settings of k. As shown in Fig. 4, by increasing the number, k, from 2 to 18, the performance of TKR first rises rapidly until k reaches 8 and then drop down slowly. The gain of performance from \(k=2\) to \(k=8\) proves our topic knowledge effective. The loss of performance from \(k=8\) to \(k=18\) is caused by the noise introduced by the concepts with low scores.

Table 4. Sampled cases which show the effect of the different numbers of concepts. The concepts in blue are the Top-5 scored ones, and those in black are the 13–18th scored concepts.

To probe the effect of employing different numbers of concepts more intuitively, we sample and analyze some cases from the development set. Table 4 shows one of them, where Question0 and Question1 are two questions proposed based on the same tweet as shown in the table. In this example, we find that the top-5 concepts describe the topic comprehensively, build a semantic connection between some key concepts in the tweet including panda, Mei Xiang, and Nattional Zoo, and finally contribute to answering the questions. On the contrary, the 13–18th scored concepts tend to deviate from the topic, and as noisy-like information, they even damage the Reading Comprehension model, TKR, when introduced into the model.

4.6 Ablation Study

Table 5. Ablation study on the development set. -score scale denotes TKR without the module of score scaling. -self attn denotes TKR without self attention. -word agg denotes TKR without word aggregation. - LSTM denotes TKR that use a dense layer instead of LSTM for the knowledge fusion

To study the effect of some key modules of TKR, we conduct ablation experiments on the development set. As shown in Table 5, all of the three knowledge encoding module, including score scale, self attention and word agg, contribute to the overall performance. The results demonstrate that those modules. which are designed for the topic knowledge in a targeted manner, indeed help the model to encode the knowledge and further to absorb it. Furthermore, the performance of - LSTM is slightly behind the original TKR, which proves that the sequential information captured by the additional context encoding is beneficial for the comprehension.

4.7 Extractive vs. Generative

Table 6. Sampled cases that show the difference between the generative model and the extractive one.

As shown in Table 3, compared with BERT Generation, the extractive models including BERT Extraction and TKR achieve better performance, though there is no identically matching substring in the tweet for part of answers. Table 6 shows two sampled cases which tell the difference between the extractive model and the generative one. As shown in the table, the generative model performs better in some cases where the answer is supposed to be synthesized based on the question and tweet. On the contrary, the generative model lag behind the extractive one, when the answer is an uninterrupted snippet of the tweet. However, as studying more cases, we find that even in many cases, where the answer need to synthesize, the generative model fails to provide a qualified answer. We consider that much more data is needed to train a qualified generative MRC model.

4.8 Weakly Supervised Training

To train the extractive model, TKR, we annotate the answer span in the tweets by the F1 score. We train the model to locate the annotated span. As the annotated span may not be the true answer, it is a process of weakly supervised training. As shown in Table 7, we study the relationship between the span score of training data and the performance which is evaluated on the development set. As shown in the table, by reducing the threshold of span score, increasing training data is involved, meanwhile the performance first rises until span score = 0.6 and then drop down. That is to say, in the process of introducing different amount of the weakly supervised training data, span score = 0.6 is the point where the difference between the benefit from the positive example and the damage from the noise is maximized.

Table 7. The performance on development set with different scale of train data. \(Span Score \ge i\) denotes that the model is trained by the instances containing the span whose F1 score is no less than i. data and proportion refer to the scale and the proportion of the selected training data.

5 Conclusion

In this paper, we focus on machine reading comprehension in social media domain. We propose a novel method to address the problem of lacking in background knowledge in this task. Utilizing the nature of clustering of social media, we retrieve and refine topic knowledge from the relevant messages, and then integrate the knowledge into an MRC model, TKR. Experimental results show that our proposed method outperforms the recently proposed models and the BERT-based baselines, which proves the method effective overall. By introducing different amount of topic knowledge, we demonstrate the effectiveness of our refined knowledge. Moreover, the ablation study further validates the contribution of the key modules of TKR for utilizing the knowledge.