Linking News and Tweets

Lin, Xiaojie; Gu, Ye; Zhang, Rui; Fan, Ju

doi:10.1007/978-3-319-46922-5_41

Xiaojie Lin¹⁶,
Ye Gu¹⁶,
Rui Zhang¹⁶ &
…
Ju Fan¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9877))

Included in the following conference series:

Australasian Database Conference

2150 Accesses
2 Citations

Abstract

In recent years, the rise of social media such as Twitter has been changing the way people acquire information. Meanwhile, traditional information sources such as news articles are still irreplaceable. These have led to a new branch of study on understanding the relationship between news articles and social media posts and fusing information from these heterogeneous sources. In this paper, we present a system that is able to effectively and efficiently link news and relevant tweets. Specifically, given a news stream and a tweet stream, the system discovers tweets that are relevant to each news in the news stream.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Linking Tweets to News: Is All News of Interest?

Real-Time Relevance Matching of News and Tweets

A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Nowadays, a real world event such as a traffic accident or a criminal activity not only is covered by news articles, but also stimulates ordinary people to post their comments on social media such as Twitter^{Footnote 1}, Facebook^{Footnote 2} and Weibo^{Footnote 3}.

The strong relationship between news and social media interests researchers. Many studies on analyzing these two types of information sources together have been carried out. For example, Yang et al. [4] use relevant social media posts to summarize and extract highlights from news articles; Minkyoung et al. [2] use the relationship to analyze the characteristics of different types of media and the diffusion pattern of news events.

These studies and applications require reliable system to link news articles and relevant social media posts. In this paper, we present such a system which effectively and efficiently links news and relevant tweets.

2 The Linking System

The structure of the system is shown in Fig. 1. The input of the system is a news stream and a tweet stream. When a news article is received, it will be first preprocessed (e.g. tokenization and stemming) and then stored in a buffer for D days. Also, some indices will be created for the news in the buffer, which facilitate the following filtering and linking processes. When a tweet arrives, it will also be preprocessed, and then the system will use an efficient filtering algorithm (e.g. BM25 with a minimum threshold) to determine if the tweet should be added to the tweet candidate set of a news article in the buffer. When the filtering module has processed a certain amount of tweets, it will output a set of news along with their tweet candidates. The more expensive linking algorithm (discussed below) will now do the linking and output the final results — news and their relevant tweets.

We use an SVM classifier for the final linking. For each pair of a news and a tweet, a feature vector is extracted, and SVM will predict if the tweet is relevant to the news. The most important features we used are as follows:

BM25.:

BM25 computes a relevance score for a document and a query. In our case, we treat the news in the buffer as the document corpus and each tweet as a query.

Time.:

For a news and tweet published at time $t_1$ and $t_2$ respectively, the time feature is computed as $1 / (t_2 - t_1 + 1)$. Note that we only consider tweet published after the news, so $t_2 - t_1 > 0$.

Named Entity.:

We extract named entities and calculate a TF-IDF score for each of them. The named entity feature is computed as:

$$\begin{aligned} \max _{n \in NE(a) \cap NE(t)} tfidf(n)\,, \end{aligned}$$

where NE(a) and NE(t) is the named entities extracted from news a and tweet t respectively.

Event Phrase.:

We use a dependency parser^{Footnote 4} to extract relations and noun phrases from news and tweets. Collectively, we call them event phrases since they can describe the essence of an event. Examples of extracted event phrases is shown in Fig. 2. We train another SVM classifier to generate a confidence score for each event phrase. The score indicates how well the event phrase describes a news article. For a news and a tweet, the event phrase feature is calculated as: $\max _{e \in EP(t)} confidence(e, a)\,,$ where EP(t) is the set of event phrases extracted from tweet t and a is a news.

3 Experiments

We use a dataset derived from Guo’s dataset [1], which contains 12,704 news and 34,888 tweets. In the gold standard, a tweet and a news article are considered relevant if the tweet contains a URL pointing to the news article. URLs in the tweets are removed before conducting experiments.

Guo’s dataset does not contain the full content of news articles. Also, most of the news articles do not have any relevant tweets. Therefore, we identify the news articles with no less than 20 relevant tweets and download the full contents. A small amount of news are also removed because of download or parsing errors. The final dataset contains 381 news with full contents and all the 34,888 tweets.

Some of the news in the dataset are about the same event, and they are very similar to each other. For example, the news “Scores Dead as Fire Sweeps Through Nightclub in Brazil” and “Hundreds killed in Brazil nightclub fire” are about the same accident. Therefore, we also conduct extra experiments on a clustered version of the dataset, which contains 240 news clusters.

We test a wide range of unsupervised approaches along with ours. The results are shown in Table 1. The unsupervised approaches include the model of Tsagkias et al. [3] which is based on the language model (LM), BM25 using news as document corpus (BM25-news), BM25 using tweets as document corpus (BM25-tweets), cosine similarity of TF-IDF word vectors and the WTMF-G model [1].

For the unsupervised approaches, 5-fold cross-validation is used to determine the cut-off thresholds which maximizes the $F_1$ score. Precision and recall are reported under the same threshold. For our supervised approaches, the same 5-fold cross-validation is used for training/testing.

As shown in Table 1, our approach “SVM with event phrase features” performs the best in both the unclustered and clustered versions of the dataset. Note that Tsagkias’s model (LM) does not work well in a binary classification setting because the relevance scores generated for different news are very different, so we are not able to find a reasonable cut-off threshold, and the reported metric values are very poor.

Table 1. Performance of different approaches

Full size table

4 Demonstration

We build an online news service based on our system. After tweets are linked to news, we also use the relevant tweets to analyze the popularity and trending of each news. Our news service can be accessed via our website, Android client or REST API. Screenshots of the website and Android client are shown in Fig. 3.

Notes

1.
https://twitter.com/.
2.
https://www.facebook.com/.
3.
The most popular microblogging platform in China. https://weibo.com.
4.
http://www.cs.cmu.edu/~ark/TweetNLP/#tweeboparser_tweebank.

References

Guo, W., Li, H., Ji, H., Diab, M.T.: Linking tweets to news: a framework to enrich short text data in social media. In: ACL, vol. 1, pp. 239–249. Citeseer (2013)
Google Scholar
Kim, M., Newth, D., Christen, P.: Trends of news diffusion in social media based on crowd phenomena. In: Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, pp. 753–758. International World Wide Web Conferences Steering Committee (2014)
Google Scholar
Tsagkias, M., de Rijke, M., Weerkamp, W.: Linking online news and social media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 565–574. ACM (2011)
Google Scholar
Yang, Z., Cai, K., Tang, J., Zhang, L., Su, Z., Li, J.: Social context summarization. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 255–264. ACM (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
Xiaojie Lin, Ye Gu & Rui Zhang
Renmin University of China, Beijing, China
Ju Fan

Authors

Xiaojie Lin
View author publications
You can also search for this author in PubMed Google Scholar
Ye Gu
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ju Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaojie Lin .

Editor information

Editors and Affiliations

Monash University , Clayton, Australia
Muhammad Aamir Cheema
School of Comp. Science a. Engineer, University of New South Wales School of Comp. Science a. Engineer, Sydney, Australia
Wenjie Zhang
University of New South Wales , Sydney, New South Wales, Australia
Lijun Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, X., Gu, Y., Zhang, R., Fan, J. (2016). Linking News and Tweets. In: Cheema, M., Zhang, W., Chang, L. (eds) Databases Theory and Applications. ADC 2016. Lecture Notes in Computer Science(), vol 9877. Springer, Cham. https://doi.org/10.1007/978-3-319-46922-5_41

Download citation

DOI: https://doi.org/10.1007/978-3-319-46922-5_41
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46921-8
Online ISBN: 978-3-319-46922-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Linking News and Tweets

Abstract

Similar content being viewed by others

Linking Tweets to News: Is All News of Interest?

Real-Time Relevance Matching of News and Tweets

A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles

Keywords

1 Introduction

2 The Linking System

3 Experiments

4 Demonstration

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Linking News and Tweets

Abstract

Similar content being viewed by others

Linking Tweets to News: Is All News of Interest?

Real-Time Relevance Matching of News and Tweets

A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles

Keywords

1 Introduction

2 The Linking System

3 Experiments

4 Demonstration

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation