1 Introduction

Microblogging services have become one of the prominent platforms for sharing, disseminating and consuming user generated content. In order to understand and exploit this information, microblogging services enable users to search for posts that contain topic phrases, and return the results sorted by metrics that consider recency and relevancy (Sharifi et al. 2013). To get a snapshot of what users are primarily saying about a particular topic, it is necessary to acquire a summary or gist of these posts as opposed to returning all of the posts that match the search criterion. Without effective data reduction or summarization mechanisms, users are often confronted with an overwhelming amount of replicated information, which makes it difficult for them to understand the essence of the topics and therefore, possibly miss valuable information. Applying summarization methods on microblogging services, e.g., Twitter, facilitates the generation of condensed summaries about a certain topic discussed by the users in real-time with less time and effort, which can be specifically advantageous for individuals, companies, agencies or institutions seeking public opinion. Therefore, it would be of great benefit if effective mechanisms can be developed for summarizing various aspects of a topic of interest on microblogs (Bian et al. 2015).

Microblog summarization can be viewed and formulated as a multi-document summarization (MDS) task (Jones 2007), which has been widely studied in information retrieval. MDS allows users to quickly capture the essential information contained in a large cluster of documents by producing a summary about a particular topic. In recent years, several MDS approaches ranging from cluster-based (Wan and Yang 2008) and graph-based (Mihalcea and Tarau 2005) to semantic-based approaches (Wang et al. 2008; Hennig and Labor 2009) have been proposed to analyze the information contained in a document set and extract highly salient sentences for generating a summary. While existing works in MDS are designed for well-organized documents, they could still be applicable to microblogs since they are based on simple frequency based methods (Erkan and Radev 2004a; Mihalcea and Tarau 2004). However, research shows that techniques for noise removal and extensive pre-processing on microblog content is required due to the informal, short and noisy nature of the content posted on microblogging services (Bian et al. 2015). Recently, a few efforts have been undertaken for microblog summarization (Sharifi et al. 2013; Bian et al. 2015; Inouye and Kalita 2011; Nichols et al. 2012; Lin et al. 2012). In this paper, we propose a novel summarization method based on sentiment and topical aspect analysis to generate a holistic summary for trending topics in microblogs. Specifically, the proposed method comprises of three stages. First, after pre-processing and semantically enriching microblog posts, we extract the topics and sentiments expressed by each post. Then, we build a sentiment-based Word Graph for each topic and cluster the graph to extract different aspects of the topic. Finally, we apply state-of-the-art summarization methods to summarize each topical aspect individually; and aggregate all aspect-level summaries to generate the holistic summary for each topic.

To the best of our knowledge, there have been limited work that consider sentiments, topics and aspects detection algorithm and semantic enrichment process of tweets in tandem for microblog summarization. As such the core contributions of our work can be enumerated as follows:

  • We propose a sentiment-based approach in our summarization method to automatically consider sentiments (positive or negative) of semantically enriched tweets when generating summaries. This is useful because it allows positive and negative feedback to be aggregated equally into the final generated summary.

  • In addition to sentiments, we propose to construct a word graph, inspired by KeyGraphs (Ohsawa et al. 1998), to automatically extract aspects of a topic to be used in the summarization process. Therefore, our work benefits from topic aspects and sentiments simultaneously when generating summaries.

  • We conduct experiments on an already available Twitter dataset presented by Abel et al. (2011) consisting of approximately 3M tweets. By comparing with several state of the art summarization methods, we show improved summarization performance by our proposed approach.

The rest of the paper is organized as follows: in the next section we review the related literature, followed by the presentation of the overview of our proposed approach. In Section 4, the details of our work are presented and Section 5 covers the experimental setup. We present our findings in Section 6 and the paper is finally concluded in Section 7.

2 Related work

Work in automated multi-document summarization has drawn much attention in the recent years. A number of algorithms have been developed to improve summarization as well as to perform summarization on new forms of documents such as Web pages (Sun et al. 2005), discussion forums and blogs (Zhou and Hovy 2006; Ku et al. 2006). General MDS methods can be separated into two categories: extractive and abstractive (Knight and Marcu 2002; Jing and McKeown 2000). Extractive summarization involves assigning saliency scores to sentences and paragraphs of the documents calculated by a set of predefined features such as term-frequency-inverse document frequency (TF-IDF), sentence or term position (Lin and Hovy 2002; Yih et al. 2007), and the number of keywords (Yih et al. 2007), and extracting those with the highest scores. Notable extractive MDS methods include SumBasic (Vanderwende et al. 2007) and centroid-based methods such as MEAD (Radev et al. 2004). The underlying premise behind SumBasic is that words which occur more frequently across documents have a higher probability of being selected for human created multi-document summaries over those words that occur less frequently. SumBasic tends to favour longer sentences, as they are more likely to have higher average probabilities leading to increased recall, as noted in Sharifi et al. (2013). On the other hand, MEAD (Radev et al. 2004), which is a centroid-based method, scores sentences based on sentence-level and inter-sentence features, including cluster centroids, position, TF-IDF, and other features. NeATS (Lin and Hovy 2002) uses sentence position, term frequency, topic signature and term clustering to select important content. Unlike MEAD and SumBasic, the Maximal Marginal Relevance (MMR) (Goldstein et al. 1999) method has been developed as an extractive method that decides which sentences should be removed to reduce the original document to a summary as opposed to selecting sentences to be included in the summary.

Abstractive summarization employs techniques from information fusion (Knight and Marcu 2002; Jing and McKeown 2000), rule based approach (Genest and Lapalme 2012); sentence compression (Knight and Marcu 2002); sentence merging based on semantics (Liu et al. 2015), and graph-based algorithms (Ganesan et al. 2010; Bhargava et al. 2016). The abstractive summarization framework proposed by Liu et al. (2015) parses input sentences to build Abstract Meaning Representation (AMR) graphs based on semantic representation of text. AMR graphs are then converted to a summary graph using a perceptron model prediction algorithm to select the subgraphs, which can be further used for summary generation. Ganesan et al. (2010) describes an approach that uses the original sentence word order to build directed graphs in order to generate abstractive summaries. Their technique leverages the graphical form of the input text to reduce redundancy. In Lloret and Palomar (2011), the authors propose a summarization approach which builds a directed weighted word graph in which each word is represented by a node in the graph and the edge indicates the adjacency relation between the words. The weight of the edge is calculated by a combination of their pagerank value and the frequency of the words. Important sentences are determined by selecting the first n words with the highest TF-IDF score. Sentence correctness is also ensured using basic rules of grammars.

Most recently, graph-based ranking methods have been proposed for MDS which rank sentences based on votes or recommendations. TextRank (Mihalcea and Tarau 2004) and LexPageRank (Erkan and Radev 2004b) use algorithms such as PageRank and HITS to rank important keywords of n-grams in a corpus. Thus, their summaries appear to be short snippets of the corpus. These methods first construct a graph representing the relationship between sentences and then evaluate the importance of each sentence based on the topology of the graph. LexRank uses Erkan and Radev (2004a), a modified cosine similarity function to construct an adjacency matrix with similarity values of two sentences. This matrix is treated as a markov chain, and an iterative algorithm is used to compute the stationary distribution. Each value of the stationary distribution represents a weight for the corresponding document, and the one with the highest weight is chosen to represent the summary. Mihalcea and Tarau (2005) also propose an algorithm based on PageRank, which exploits a meta-summarization process to summarize the meta-document generated by assembling all the single summaries of each document.

Some other methods have been designed that identify semantically important sentences for summary generation. Gong and Liu (2001) propose a method that uses Latent Semantic Analysis (LSA) to select highly ranked sentences for summarization; while the work in Haghighi and Vanderwende (2009) exploits a hierarchical LDA-style model to represent content specificity as a hierarchy of topic vocabulary distributions, based on which sentences are selected according to these distributions. Wang et al. (2008) have proposed a framework based on Sentence Level Semantic Analysis (SLSS) and Symmetric Non-negative Matrix Factorization (SNMF) to capture relationships between sentences in a semantic manner and factorize the similarity matrix to obtain meaningful groups of sentences. Other methods include NMF-based topic-specific summarization (Lin and Hovy 2003); Conditional Random Fields (CRF) based summarization (Lin and Hovy 2002); and Hidden Markov Model (HMM) based methods (Mani 2001).

These methods are specifically designed for formal texts and documents, thus applying them on a microblog dataset may not produce the best results (Bian et al. 2015). With the increasing interest in using microblog services to disseminate information, a few works have shifted their focus to process microblog data. Summarizing microblogs can be considered as an instance of extractive MDS which deals with informal documents derived from a wide range of users and topics. Most of the prior work on Twitter data summarization are about topic-level summarization. In Zhou et al. (2016), the authors present CMiner, an opinion mining system for Chinese microblogs. CMiner adopts an unsupervised label propagation algorithm to extract target opinions based on the assumption that similar messages are focused on similar opinion targets. On this basis, a summarization framework is proposed to generate opinion summaries for different opinion targets. CMiner clusters the extracted opinion targets based on different similarity measures and ranks both the opinion targets and microblog sentences based on the proposed co-ranking algorithm.

Chakrabarti and Punera (2011) have formalized the problem of tweet summarization for highly structured and recurring events. The authors discuss how events can be subdivided into smaller events using Hidden Markov Models (HMMs). They assert that HMMs are useful in detecting bursty events and are able to learn differences in language models of sub-events automatically. This method is useful when a training set can describe changes in events; otherwise, it is difficult to use this method effectively. Lin et al. (2012) adopt a graph optimization method to generate event storyline of an ongoing event from microblogs. Temporal information is utilized for event representation. This framework is only suitable for relatively long-term events, which makes it less effective for our task given the fact that the hot period of most social events to be summarized is usually very short. TwitInfo (Marcus et al. 2011) has been designed to aggregate tweets about a topic into visual summaries on peaks of high tweet activity and display the summaries on users’ timelines. Given a search query related to an event, the streaming algorithm identifies and labels event peaks; highlights important terms and tweets in the conversation; and provides an aggregative view of users’ sentiments. These visualizations must be interpreted by users and do not include sentence-level textual summaries. Carrillo-de Albornoz et al. (2016) propose a novel methodology for evaluating the performance of three representative summarization algorithms such as LexRank, Follower and Single Voting on generating summaries in the context of online reputation reports from Twitter. This work exploits the RepLab Dataset(Amigó et al. 2013) in which the tweets have been manually annotated by experts for relevancy, polarity, topic and priority. The empirical results indicate that incorporating priority signals improve the summarization task.

One of the significant contributions in microblog summarization is the work presented by Sharifi et al. (2010, 2013). In this paper, the authors have developed two multi-document summarization algorithms, which include: 1) a clustering based algorithm; and 2) a Hybrid TF-IDF algorithm, which is a direct extension of TF-IDF used in single-document summarization algorithms. They also propose a Phrase Reinforcement algorithm which uses a graph to represent overlapping phrases in a set of related microblog sentences and summarizes Twitter hot topics through finding the most commonly used phrases that encompass the topic phrase. Sharifi et al. evaluated the performance of their proposed algorithms in comparison with other notable summarizers, including MEAD (Radev et al. 2004), LexRank (Erkan and Radev 2004a), TextRank (Mihalcea and Tarau 2004) , and SumBasic (Vanderwende et al. 2007).

Our proposed method is a topic-level microblog summarization approach, which first semantically enriches the microblogs and constructs a Word Graph for each class of sentiment in each topic and then applies clustering algorithms on each graph for the purpose of topical aspect extraction. Different cluster-based multi-document summarization methods are explored and exploited to produce a final summary.

The idea of incorporating aspect-relevance and sentiment intensity of documents for the summarization purpose has already been considered in the literature. For instance, in (Xu et al. 2011), the authors propose an aspect-based summarization method which considers representativeness and diversity of online reviews in order to generate summaries with maximum coverage and minimum redundancy in terms of sub-aspects and their opinions. Furthermore, in Wu et al. (2016), the authors adopt a direction different from prior work on aspect extraction by directly mapping each review sentence into pre-defined aspect categories, assuming that users already know what aspects each product can have. This paper proposes two convolutional neural network-based approaches to 1) extract implicit mentions of an aspect in a review and assign them to a pre-defined set of aspects, and 2) predict the sentiment polarity of the input sentences. In Lin and Hovy (2003), the authors present a framework for identifying and extracting important pieces of information along with their sentiments in the document to form a summary. This approach categorizes the textual content in two subjective and objective categories and then aggregates them using specific rules. In Piryani et al. (2018), the authors present a summarization framework to generate aspect-wise extractive sentiment summary for textual reviews. However, this approach focuses on certain domains, e.g., laptop reviews, and identifies the set of aspects manually. In Hu et al. (2017), the authors present a model to classify the reviews from the Trip Advisor website into predefined aspects and then apply a topic modelling technique along with sentiment analysis on the classified reviews to identify hidden information which would be further exploited to generate summaries.

It should be noted that most existing approaches for aspect extraction and polarity classification for the summarization purpose are designed for formal texts and documents, which might not be precise enough or suitable for short texts such as microblog posts (Ling et al. 2008; Titov and McDonald 2008; Zhuang et al. 2006). Besides, many existing approaches have been built on the assumption of the existence of a pre-defined set of aspects and did not propose any aspect detection algorithm to automatically derive aspects from free-form texts.

We have chosen Sharifi’s proposed algorithms presented in Sharifi et al. (2013) as the baselines since they have shown strong performance in comparison to the state of the art summarization approaches for Twitter.

3 Approach overview

The objective of our work is to automatically generate a textual summary for any given topic present on Twitter. The flowchart of our proposed microblog summarization work is illustrated in Fig. 1.

Fig. 1
figure 1

Flowchart of our proposed microblog summarization method

As seen in the figure, there are three main stages in our proposed work. In the first stage, after preprocessing the textual content of microblog posts, we discover the active topics that are present in microblog posts using the LDA topic modelling approach (Blei et al. 2003) and then further employ sentiment analysis to derive the sentiment of each microblog post, accordingly. Once the topics are identified and the sentiment of each microblog post is determined, we group the microblog posts into several topic-sentiment clusters that contain posts related to similar topics and with similar sentiments. In the second stage, we create a Word Graph (WG) for each cluster of microblogs that share the same topic and sentiment in order to model the co-occurrence of the words within that cluster. Once WG is constructed, graph clustering is applied to identify the different topical aspects of the specific topic-sentiment cluster. The final stage re-assigns microblogs to the identified topical aspects in each WG and further exploits document clustering algorithms to summarize each topical aspect, accordingly. Finally, we aggregate all topical aspect-level summaries to derive a holistic summary for each topic .

4 Our proposed approach

In this section, we elaborate on the details of the proposed microblog summarization method including topic extraction, sentiment analysis, word graph construction and topical aspect extraction, and aspect level summary extraction as shown in Fig. 1.

4.1 Microblog topic and sentiment extraction

Given the core of our work is to summarize topics on Twitter, we assume that a collection of microblog posts denoted as \(M=\{ M_{1},.,M_{i},..,M_{|n|}\}\) are present that collectively form k topics represented by \(TP=\{tp_{1},..,tp_{k}\}\). By considering M, the set of microblog posts as a textual corpus, it is possible to extract a set of active topics TP using topic modelling techniques such as LDA. As proposed in Varga et al. (2014) and Saif et al. (2012), to obtain better topics without modifying the standard topic modelling methods, we enrich each microblog post \(M_{i}\) using an existing semantic annotator, i.e., TagMe (Ferragina and Scaiella 2010) and employ the extracted entities within the topic modeling process. This has shown to result in the reduction of noisy content within the topic detection process (Zarrinkalam et al. 2015; Zarrinkalam et al. 2016). Therefore, in our work, each microblog post is represented as a set of one or more semantic entities that collectively denote the underlying semantics of the microblog post. We view a topic, defined in Definition 1, as a distribution over entities.

Definition 1

(Topic): Let M be a microblog collection and \(E =\{e_{1},e_{2},..., e_{|E|}\}\) be a vocabulary of entities. A topic, tp, is defined to be a vector of weights, i.e., (\(g_{tp(e_{1})},...,g_{tp(e_{|E|)}}\)), where \(g_{tp(e_{i})}\) shows the participation score of term \(e_{i} \in E \) in forming topic tp. Collectively, \(TP = \{tp_{1},...tp_{K}\}\) denotes a set of K topics extracted from M.

To extract topics from microblogs using LDA, documents should naturally correspond to microblog posts. We treat each microblog post that has been enriched with entities as a single document and train LDA on all microblog posts M. LDA has two parameters to be inferred from the corpus of documents: document-topic distribution \(\theta \), and the K topic-term distribution \(\omega \). Given that each document corresponds to the microblog entities, by applying LDA over all microblog posts, K topic-entity distributions will be produced, where each topic entity distribution associated with a topic \(tp \in TP\) represents one active topic in M.

In the next step, given the set of enriched microblog posts with entities, we incorporate the semantic features of the microblogs into a Naive Bayes (NB) classifier using the semantic augmentation method (Saif et al. 2012; Go et al. 2009) to identify the sentiment of each microblog post in M. The assignment of a sentiment class c to a given microblog post \(M_{i}\) in a NB classifier can be computed as:

$$\begin{array}{@{}rcl@{}} \hat{c} &=& \operatorname{arg max}_{c \in C} P(c|e)\\ &=& \operatorname{arg max}_{c \in C} P(c) \underset{1 \le i \le N_{e}}{\prod}P(c|e) \end{array} $$
(1)

where \(N_{e}\) is the total number of entities in microblog \(M_{i}\), \(P(c)\) is the prior probability of a post appearing in class c, \(P(e|c)\) is the conditional probability of entity e occurring in a microblog post of class c .

In multinomial NB, \(P(c)\) can be estimated by \(P(c) = N_{c}/N\) where \(N_{c}\) is the number of microblog posts in class c and N is the total number of microblog posts. \(P(e|c)\) can be estimated using maximum likelihood with Laplace smoothing:

$$ P(e|c) = \frac{N(e,c) + 1} {{\sum}_{e^{\prime} \in V} N(e^{\prime}|c)) +|V| } $$
(2)

where \(N(e, c)\) is the occurrence frequency of word e in all training microblogs of class c and \(|V|\) is the number of entities in the vocabulary. In our work, we only consider polarity when determining sentiment and therefore, \(|c|= 3\); consisting of positive, neutral and negative classes.

4.2 Co-occurrence graph model for topical aspect extraction

In this section, we present the notion of Word Graph (WG), a co-occurrence graph model, to build a graph of entities for microblog posts (hereafter, tweets) for each class of sentiments derived for a particular topic. Our goal is to apply graph clustering algorithms on WG to discover different topical aspects of a topic, which will be further exploited to generate summaries for that topic. The idea of applying clustering algorithms on long texts (multi-documents) as well as short texts (microblogs) has been well-studied in the literature. Various approaches have been proposed to cluster similar tweets ranging from using hierarchical clustering methods (Abdullah and Hamdan 2015; Jashki et al. 2009), term-frequency analysis (Atefeh and Khreich 2015) to tweet attribute analysis such as favourite and re-tweet counts (Bild et al. 2015). These approaches apply clustering algorithms directly on tweets to induce different attributes and topical aspects of the topic for the summarization task. However, we argue that building a graph of entity co-occurrence for each class of sentiments of a particular topic prior to clustering would significantly enhance the summarization task for two reasons:

  1. 1.

    a graph representation captures different relations pertaining to node attributes (e.g., favorite counts in tweets) between directly connected nodes as well as hidden relations between indirectly connected nodes; and

  2. 2.

    as the number of tweets with positive and negative sentiments are imbalanced (i.e., consider the topic that is dominated by highly positive tweets and few negative ones), directly applying clustering algorithm on tweets may result in the domination of only positive tweets, and negative tweets most likely would be overlooked in the generated summary.

Let \(Mtp^{1}\) be a set of tweets related to topic \(tp_{1} \in TP=\{tp_{1},...,tp_{k}\}\), denoted as \(Mtp^{1}= \{Mtp^{1}_{sen_{p}} \cup Mtp^{1}_{sen_{n}} \cup Mtp^{1}_{sen_{o}} \}\) where \(Mtp^{1}_{sen_{p}}\) indicates the set of microblog posts with a positive sentiment; \(Mtp^{1}_{sen_{n}}\) indicates a set of microblogs with the negative sentiment; and \(Mtp^{1}_{sen_{o}}\) indicates the set of microblogs with a neutral sentiment. We construct a Word Graph (WG) for each sentiment of a particular topic based on Definition 2, as follows:

Definition 2

(Word Graph Representation): Given a collection of tweets such as \(Mtp\), a co-occurrence word graph can be defined as an undirected, weighted graph \(G = (V, E)\), where the set of vertices V correspond to the entities of the tweets and E is the set of edges that represent relationships among these entities. The weight of the edges is calculated based on the total number of times the two entities have co-occurred in different tweets and the number of times those tweets have been favorited (i.e., liked, re-tweeted) as follows,

$$ Weight_{E(e_{i}, e_{j})} =\sum\limits_{\forall M_{k} \in Mtp} Occurred(e_{i},e_{j},M_{k}) \times \sum\limits_{\forall M_{k} \in Mtp} Count(e_{i},e_{j},M_{k}) $$
(3)

such that:

$$ Occurred(e_{i},e_{j},M_{k}) = \left\{\begin{array}{ll} 1 & \text{ if } e_{i} , e_{j} \in M_{k} \\ 0 & \text{ otherwise} \end{array}\right. $$
(4)

and,

$$ Count\left( e_{i},e_{j},M_{k}\right) = \left\{\begin{array}{ll} 1 & \text{ if } e_{i} , e_{j} \in M_{k} \text{ and } M_{k}\text{ isFavourited} \\ 0 & \text{ otherwise} \end{array}\right. $$
(5)

This process will generate three word graphs per topic, i.e., one word graph per positive, neutral and negative aspect of each topic, respectively.

For example, consider the following two tweets for the topic, ”Costco”:

  • All the retailers are closed now, except Costco. [liked: 0]

  • Costco reports 2 percent profit after stock market closed. [Liked = 1]

4.2.1 Word graph clustering

A natural form of graph clustering is to partition the vertex set into disjoint subsets called clusters. One of the common requirements for graph clustering is modularity (Blondel et al. 2008; Newman 2006), which formalizes that the connections within graph clusters should be dense, and the connections between different graph clusters should be sparse.

In this paper, we have explored various well-known graph clustering algorithms for partitioning sparsely connected dense word subgraphs from each other. As indicated in Table 1, there is no single superior algorithm among existing work and their quality is dependent on the characteristic of the specific graph under study. Therefore, in our paper, we adopt three clustering algorithms that consider edge weights in their process: 1) InfoMap (Rosvall and Bergstrom 2008); 2) Newman’s Eigenvector Method (Newman 2006); and 3) Blondel’s Multi-level Clustering (Blondel et al. 2008). The overview of these algorithms is presented in Table 1.

Table 1 Summary of the graph clustering algorithms

We compare these three algorithms using two clustering metrics to measure the quality of the produced clusters, namely: 1) Variance of Information (VI) (Meila 2003), which measures the amount of information loss when changing from one clustering to another; and 2) Split-Joint distance (Dongen 2000), which is a non-commutative metric that measures the overlap between two different clusters. The reason why we exploit such metrics is because they make no assumptions about how the clusterings were generated and apply to both soft and hard clusterings (Meila 2003). The algorithm with the highest values for VI and Split-joint distance was selected as the representative graph clustering technique to induce topical aspects (e.g. See Table 2) in our experiments.

Table 2 Sample Aspects for the Tsunami Topic

4.3 Microblog summary generation

In this subsection, we explore how to utilize the discovered topical aspects of the constructed word graphs to facilitate the generation of the summary for a specific topic. We propose an approach for text summarization based on cluster-based multi-document summarization algorithms such that different sentiments regarding a single topic would be equally weighted and aggregated in the final summary. The key idea is to apply cluster-based multi-document summarization algorithms on every topical aspect of the constructed word graphs in order to group tweets into clusters; and then select the representative tweets from each cluster to generate the final summary for a specific topic. For this purpose, we first need to re-assign the set of tweets with the given topic to one or more of the topical aspects that we have extracted from the word graphs.

Definition 3

(Microblog and topical aspect similarity): Given a topical aspect (\(\mathcal {A}\)) and the entities in a microblog post m, we calculate the similarity of the microblog post to the topical aspect as follows:

$$ Sim(\mathcal{A},m) = \frac{{\sum}_{m_{i}\in m}{\sum}_{t_{j}\in\mathcal{A}} sim(m_{i},t_{j})}{|\mathcal{A}|\times|m|}. $$
(6)

where sim is a semantic relatedness measure that calculates the similarity of two semantic entities (Feng et al. 2017). We employ the similarity measure proposed in Ferragina and Scaiella (2010) for this purpose. A microblog m will be assigned to a topical aspect \(\mathcal {A}\) if their similarity score is larger than a threshold, \(\lambda \).

Now, given a collection of tweets that are assigned to the different topical aspects of a certain topic, we exploit various cluster-based algorithms to group microblog posts into clusters given each specific topical aspect of the word graphs. In this work, we have adopted two clustering algorithms that are widely-used for document summarization, namely, Agglomerative Clustering (Ackermann et al. 2014); and Bisect K-Means++ Clustering (Steinbach et al. 2000). These algorithms are described in more detail later in the paper. After grouping the microblogs into clusters within each topical aspect, in each cluster, we compute the score of a tweet to measure how important the tweet is to be included in the summary. We rank the tweets based on a tweet score calculated as follows:

$$ Score(m_{i},C_{k}) = \frac{1}{N_{C_{k}}-1} \sum\limits_{m_{j} \in C_{k} / m_{i}} sim(m_{i}, m_{j}) $$
(7)

where \(Score(m_{i},C_{k})\) measures the average similarity score between tweet m and all the other tweets in the cluster \(C_{k}\), and \(N_{C_{k}}\) is the number of tweets in \(C_{k}\).

Finally, given the set of word graphs built based on different classes of sentiments of a certain topic, we select k tweets with the highest score, calculated by (7), from each topical aspect of the word graphs to form the final summary.

5 Experimental setup

5.1 Dataset and pre-processing

Our experiments were conducted on the available Twitter dataset released by Abel et al. (2011). It consists of approximately 3M tweets sampled between November 1 and December 31, 2010. Given the fact that tweets are an informal way of communicating, they were preprocessed to remove spam and other noise features. We adopted the Datumbox Framework package and followed the steps proposed in Sharifi et al. (2013) to preprocess the tweets, as indicated in Table 3.

Table 3 Tweet Preprocessing Steps

5.2 Topic modeling and sentiment analysis on tweets

We have enriched the processed tweets with Wikipedia entities using the TagMe (Ferragina and Scaiella 2010) semantic annotator. To derive topics and their associated sentiments from the tweets, we adopted the Stanford Topic Modeling Toolbox (Ramage and Rosen 2011) to run LDA on tweets, and the Naive Bayes implementation provided in Datumbox official website http://www.datumbox.com/ to discover tweet sentiments in our work.

5.3 Word graph construction

In order to construct Word Graphs for tweets, we exploited Jung (Java Universal Network/Graph Framework)Footnote 1 API and extended it to meet our purposes. The graphs were ported to PajekFootnote 2 format and used in Python iGraph library for clustering. Each Word Graph was uniquely identified by the topic and the overall sentiment of positive, negative, or neutral. Thus, each topic had a Word Graph for each sentiment. Furthermore, we also investigated if a tweet had any replies associated with it. On Twitter, users can reply to tweets, favorite them, or re-tweet them. In many instances, replies may not have the same semantic annotation as the original tweet and, hence, will not be assigned to the same topic as the original tweet. In order to ensure that replies are also assigned to the same topic, we looked at the replyToId attribute of the tweet and linked the original tweet to it. So if the original tweet is assigned to a topic, automatically, all of its replies would be assigned to that topic as well. Each edge weight between the original tweet and the reply would be equal to the number of times the reply is favourited plus 1, consistent with the methodology of weights applied to co-occurrence of words within a tweet itself.

6 Evaluation

6.1 Outline

We have proposed a method for extracting summaries from a collection of tweets. The proposed method builds a sentiment-based word graph for each topic and clusters the word graph into k clusters each of which represents a specific aspect of the topic. Then, each topical aspect is summarized individually through document clustering. In the next subsection, we perform testing of various graph clustering approaches to select the best method for WG clustering. We then evaluate how the performance of existing document clustering algorithms are improved when they are used in our proposed method for summarization. More specifically, our goal is to determine if having word graphs to induce topical aspects of the topic, before summarization, will improve the quality of the generated summaries or not. An articulated example on the end-to-end process of our proposed mechanism in automatic summary generation is provided in the Appendix.

6.2 Selection of the clustering method

As indicated in Table 1, we have considered different clustering techniques to cluster our Word Graphs, namely, InfoMap, Newman and Blondel clustering algorithms. We compare these methods using two clustering metrics in order to identify the most suitable one for our experiments. Table 4 shows the comparative result of the three algorithms with respect to the Variance of Information (VI) metric. As is indicated, VI, which measures the amount of information loss when changing from one type of clustering to another, shows smaller values for Blondel compared to Newman and InfoMap, meaning that Blondel shows better clustering performance, given the current graph structure, compared to the other clustering approaches.

Table 4 Variance of information for the three clustering algorithms

We also measure the split-join-distance for these three clustering algorithms. The split-join-distance shown in Table 5 are higher for Blondel compared to Newman and InfoMap, meaning that the amount of overlap would be low if we change our clustering from Blondel to one of the other two clustering algorithms. Newman also showed a comparatively high score; however, since the VI metric is more stable for Blondel, we choose Blondel to cluster our Word Graphs in the rest of the experiments.

Table 5 Split-join-distance for the three clustering techniques

6.3 Summarization algorithms

We compare the results of our proposed microblog summarization method with baseline algorithms and well-known multi-document clustering algorithms. The detailed overview of the comparative algorithms is presented in Table 6. We have chosen the work presented in Sharifi et al. (2013) as our baseline, due to that fact that it has already shown better performance compared with other state of the art approaches such as Random Summarizer (Sharifi et al. 2013), Most recent summarizer (Sharifi et al. 2013), SumBasic (Vanderwende et al. 2007), MEAD (Radev et al. 2004), LexRank (Erkan and Radev 2004a) and TextRank (Mihalcea and Tarau 2004). Thus, our baselines include Bisect K-Means++ with Hybrid TF-IDF; Hybrid TF-IDF; as well as the Phrase Reinforcement algorithm.

Table 6 Description of the baseline algorithms

6.4 Evaluation process

To evaluate the efficiency of our proposed method in generating the summary for a tweet collection, we compare the performance of different multi-document summarization algorithms under two different conditions: 1) when adopting our proposed microblog summarization method; and 2) without adopting our proposed method. There are two types of workflows for evaluation. In the first workflow, we directly cluster the documents in k groups and find a representative tweet from each cluster to come up with a k sentence summary. In the second workflow, we first create Word Graphs to induce aspects of the topic and then cluster each topical aspect into k groups, and pick a representative tweet from each cluster to come up with a topical aspect summary. As described in Table 6, we exploit Agglomerative clustering, Bisect K-Means++ clustering; Hybrid TF-IDF algorithm, and Phrase Reinforcement algorithm (Sharifi et al. 2013) as our baseline comparative approaches for document summarization.

In the first workflow, the first two comparative approaches involve directly applying Agglomerative and Bisect K-Means++ algorithms on tweets without building any word graphs. These k clusters are then passed to a 1-means algorithm to determine the centroid which would be used as the representative tweet for that cluster. The other comparative approaches involve applying the above-mentioned algorithms with the exception of exploiting Hybrid TF-IDF approach instead of a 1-means algorithm. Sharifi et al concluded that Bisect K-Means++ combined with Hybrid TF-IDF is the best methodology for tweet summarization. The next comparative method is to directly apply Hybrid TF-IDF on tweets, which has shown the highest summarization performance with respect to F-measure according to Sharifi et al. In order to keep the documents from being too similar in content, Sharifi et al. (2013) conducted preliminary tests to determine the best cosine similarity threshold, which was reported to be 0.77.

In the the second workflow, the comparative approaches are aimed at evaluating the effectiveness of building word graphs in improving the quality of summarization. Thus, after building word graphs for each class of sentiment, we apply the five above-mentioned methods on the microblog posts related to each topical aspect of the word graph. The next baseline method is dedicated to exploiting the PR algorithm proposed by Sharifi et al. (2013) to obtain a one sentence summary phrase for each topical aspect, without applying any clustering algorithm on that topical aspect.

6.5 Manual summarization model (Gold Standard)

For building the gold standard, we adopted the approach proposed in Sharifi et al. (2013). Our manual multi-document summaries were created by four volunteers for six topics. Each topic had at least 80 positive tweets and 80 negative tweets. The choice for our sentiment analysis method was motivated by the findings reported in Saif et al. (2012). Each topic was analyzed twice by two separate individuals. So, every annotator analyzed three topics and provided one manual summary for each aspect. Annotators worked independently and did not share any information. An example of the selected topics and the number of tweets per topic is provided in the Appendix.

Volunteers had to perform summarization tasks for the first workflow (prior to Word Graph construction) as well as the second workflow (post-Word Graph construction). For evaluating summaries prior to Word Graph construction (first workflow), volunteers were given three sets of tweets for each topic: positive, negative, and neutral. For each set of tweets, they were required to group the tweets into clusters and then pick a representative tweet from each cluster to obtain a summary for that sentiment. Afterwards, the volunteers were asked to provide content scores for different algorithms to evaluate the automated summaries against the manual summaries in order to understand whether or not we are truly achieving human-comparable summaries.

For evaluating summaries after Word Graph construction (second workflow), volunteers were given three sets of tweets for each topic: positive, negative, and neutral. Each of these sets contained subsets of tweets corresponding to the topical aspects. For each set of tweets, volunteers were required to group the tweets into clusters and then pick a representative tweet from each cluster to obtain a summary for that sentiment. Afterwards, as suggested by Sharifi et al. (2013), the volunteers were asked to provide content scores for different algorithms by comparing their manual summary to the summary generated by the algorithms. Table 7 presents the average results of content score for algorithms after the word graph construction. As seen in the figure, the human evaluators believed that the proposed Agglomerative method produced the best summaries for the topics after the word graph was constructed. While the Agglomerative method was the best method from the perspective of the human evaluators, our proposed Agglomerative + Hybrid TF-IDF method showed competitive performance to the baseline Bisect K-Means++ / Hybrid TF-IDF method. We further show how the human evaluators’ perception of the quality of the summaries changed before and after the application of the word graph. As seen in Fig. 2, regardless of whether the method was proposed in this paper or by the baseline, the application of the word graph leads to higher quality topic summaries. We conclude that (1) employment of word graphs leads to higher quality summaries regardless of the method that is used and (2) our proposed Agglomerative method produces the most desirable summaries from the perspective of the human evaluators compared to both the baselines and the other variations of our proposed methods.

Table 7 Content scores for the algorithms after the application of the word graph
Fig. 2
figure 2

Content score comparison after word graphs construction

6.6 Evaluation metrics

In Saggion et al. (2010) and Louis and Nenkova (2009), the authors have mentioned different complex methods to evaluate automatically generated summaries. In this paper, we employ ROUGE, a widely-use summarization evaluation method (Lin and Hovy 2003) which automatically determines the similarity of a summary compared to human generated gold standards.

ROUGE-N is an N-gram recall metric computed as follows:

$$ ROUGE-N = \frac{{\sum}_{s \in MS} {\sum}_{n-gram \in s} match(n-gram)}{{\sum}_{s \in MS} {\sum}_{n-gram \in s} count(n-gram)} $$
(8)

where MS is the set of manual summaries; n is the length of n-grams, \(count(n-gram)\) is the number of n-grams in the manual summary; and \(match (n-grams)\) is the number of co-occurrences where an n-gram was found in both the manual summary and automated summary. Since the baseline (Sharifi et al. 2013) used the \(ROUGE-1\) metric, we will adopt the same for comparison to the baselines. The \(ROUGE-1\) metric can be modified to obtain precision of automatically generated summaries as follows:

$$ p = ROUGE-1^{\prime} = \frac{{\sum}_{m \in MS} {\sum}_{u \in m} match(u)}{|MS| \times {\sum}_{u \in a} count(u)} $$
(9)

where \(|MS|\) is the number of manual summaries; a is the automatically generated summary. The recall of the automated summaries can be also computed using related formulation of the ROUGE metric, as follows:

$$ r = ROUGE-1 = \frac{{\sum}_{m \in MS} {\sum}_{u \in m} match(u)}{{\sum}_{m \in MS} \times {\sum}_{u \in m} count(u)} $$
(10)

where u is the set of unigrams in a particular manual summary. Finally, \(F-measure\) which is a harmonic average of precision and recall is computed as follows:

$$ F-measure =\frac{2 \times p \times r}{p+r} $$
(11)

6.7 Evaluation results

In this subsection, we evaluate the effectiveness of our proposed work compared to several summarization approaches. Figure 3 shows the results of ROUGE based on clustering techniques prior to Word Graphs. We observe that Sharifi et al.’s baseline of Bisect K-Means++ with Hybrid TF-IDF was outperformed by Agglomerative clustering with 1-means pass. The standalone Bisect K-Means++ method, which picks out the centroids from each cluster with a 1-means pass also performed competitively. Agglomerative clustering technique with 1-means pass, which has not been evaluated in any previous work for microblog clustering and is proposed in this paper, outperforms all the other algorithms.

Fig. 3
figure 3

Performance of document clustering techniques prior to word graphs construction

Figure 4 shows the performance of the methods after word graph construction, with the addition of Phrase Reinforcement. We wanted to bring in the PR method in this experiment to observe the effectiveness of choosing the most weighted phrases. In many instances, tweets did not contain the topic phrases that existed in the concept titles of the topic. Hence, the performance of the PR algorithm was low.

Fig. 4
figure 4

Performance of document clustering techniques after word graphs construction

We notice that all algorithms performed better after Word Graph construction. The best overall summarizer was Bisect K-Means++ with Hybrid TF-IDF. The Hybrid TF-IDF algorithm, whether with or without Word Graphs, did not perform as well as expected. We also evaluate which summarization algorithm benefits the most from the word graphs. Figure 5 shows the results of the performance deltas. We can observe that Agglomerative clustering with Hybrid TF-IDF (F-Measure delta: + 0.255) benefits the most, followed by Bisect K-Means++ with Hybrid TF-IDF (F-Measure delta: + 0.251), and followed by Hybrid TF-IDF only ((F-Measure delta: + 0.249). We also noted that all techniques had positive deltas, which was desirable and points to the fact that when word graphs are used as proposed in this paper, a more focused set of tweets are considered when generating summaries.

Fig. 5
figure 5

Delta of the performance of document clustering techniques

Finally, we conclude that Agglomerative clustering technique with 1-means pass has been the better summarizer without the construction of Word Graphs. With the construction of Word Graphs, all summarization algorithms that we experimented had improved F-Measures compared to the baselines. In particular, the Agglomerative clustering technique with 1-means pass reported the highest performance improvement.

Our research findings can be summarized as follows:

  1. 1.

    The content score of Agglomerative clustering is higher prior and post Word Graph construction compared with other summarization algorithms, meaning that the automatically generated summaries by our proposed Agglomerative clustering based approach are better matched with the human perception of summarization quality.

  2. 2.

    The quality of summaries generated by Agglomerative clustering technique with 1-means pass outperforms other state-of-the-art approaches without the construction of Word Graphs.

  3. 3.

    Building a sentiment-based Word Graph to extract different aspects of the topic, as proposed in this paper, improves the performance of the summarization task. The F-Measures of all representative summarization algorithms are consistently higher after the construction of the Word Graphs.

  4. 4.

    Agglomerative clustering with Hybrid TF-IDF has the highest improvement with the construction of Word Graphs, with an F-Measure delta of + 0.255.

7 Concluding remarks

In this paper, we presented a summarization method based on sentiment topical aspect analysis to automatically generate a holistic textual summary for the topics present on Twitter. The proposed approach features the exploration of the intrinsic sentiments associated with the microblog posts as well as the analysis of the topical aspects for enhancing the summarization performance. In particular, we propose three major stages to accomplish summarization. First, we employed an approach for semantically enriching the microblog post and extracting the sentiment and the topic associated with each microblog post. Then, we create a Word Graph for each topic-sentiment cluster of microblog posts to identify different topical aspects of each specific cluster. Finally, we generate a holistic summary for each topic by applying different state-of-the-art summarization algorithms on each topical aspect of the Word Graph.

For the purpose of evaluation, we conducted a series of experiments on a Twitter dataset (Abel et al. 2011) to comparatively evaluate the performance of our proposed work against existing document clustering algorithms. Experimental results showed that inducing topical aspects with the word graphs will improve the quality of the generated summaries compared to the state of the art. We also comparatively evaluated the performance of our work against leading document clustering algorithms without the construction of the Word Graph. We found that our proposed Agglomerative clustering based approach with 1-means pass, which to the best of our knowledge has not been employed for the summarization task before and is one of the contributions of our paper, outperforms the others.

There are areas where our evaluations could be strengthened. There have been work in the document summarization literature that argue ROUGE-2 is a more reliable measure of summary quality compared to the ROUGE-1 metric. As such, we intend to perform additional evaluation in our future work to evaluate our work based on the ROUGE-2 metric. The primary reason for selecting ROUGE-1 in this paper has been motivated by this choice in the baseline paper. Additionally, performing more in-depth evaluation of the performance of each of the steps of our proposed method, in addition to the overall performance evaluation of the summarization technique, can provide insight as to which steps are causing the issues and are in fact the primary causes for errors in summarization. We also intend to undertake such evaluation as a part of our future work.

In addition and as a part of our future work, we intend to improve the proposed approach in the following ways. First, our summarization process does not currently consider the time of occurrence of the tweets. Motivated by Time-Aware Knowledge Extraction (TAKE) methodology presented in De Maio et al. (2016), we plan to extend our summarization process to incorporate the temporal evolution of tweets by identifying temporal peaks of tweets frequency through analyzing their timestamps. Second, it would be interesting to evaluate the performance of the proposed summarization process in the context of question-answering systems. Previous work on query-oriented summarization (Miao and Li 2010; Torres-Moreno et al. 2009; Biryukov et al. 2005) mostly aim to automatically extract information from documents. We intend to study how our proposed approach can be adopted in question-answering systems such that relevant and useful answers are generated from the corpus of tweets for a given query. Finally, in the first stage of our approach, we have performed tweet content wikification (Feng et al. 2018) by enriching each tweet using an existing semantic annotator (i.e., TagMe) which links entities to the pertinent Wikipedia articles. However, our work could benefit from various annotation systems such as Carrillo-de Albornoz et al. (2016), De Maio et al. (2014), and Hu et al. (2009) which use Wikipedia as a resource to support accurate algorithms for keyword extraction and word sense disambiguation. One of the possible venues for future work will be to conduct different experiments to compare the efficacy of each annotation algorithm in the context of the proposed summarization process.