1 Introduction

Microblogs data, the microlength user-generated data that is posted on the web, such as tweets, online reviews, news comments, social media comments, and user check-ins, has become very popular in recent years. As microlength data, it is easy and quick for users to generate plenty of them every day. In fact, every day, over one billion users post more than four billions microblogs [104, 331] on Facebook and Twitter. Such tremendous amounts of user-generated data have rich content, e.g., news, updates on on-going events, reviews, location information, language information, user information, discussions in politics, products, and many others. This richness has motivated researchers and developers worldwide to take advantage of microblogs to support a wide variety of practical applications [227, 249], including public health [140, 272], disaster response [101, 144, 145, 156, 157, 161, 304], public safety [325], education [354], real-time news delivery [8], geo-targeted advertising [256], and several disciplines of academic research such as social science [330], information modeling [270], human dynamics [308], engagement in education [328], political sciences [329], behavioral sciences [335], and even medical-related research [135]. The distinguished nature of microblogs data that combines large data sizes, high velocity, and short noisy text, has introduced new challenges, which motivated researchers to develop numerous novel techniques to support microblogs data management, analysis, and visualization at scale.

Fig. 1
figure 1

Microblogs literature timeline

This paper provides a comprehensive review for existing major techniques and systems for microblogs data management since the inception of Twitter in 2006. The literature on microblogs is rich and includes several major research communities, e.g., data management, natural language processing, and information retrieval. However, this survey paper is addressed to the data management community that provides scalable infrastructures for indexing and querying microblogs and incorporate them in data management systems to enable managing this data at scale. The paper includes three main parts. The first part reviews core indexing and query processing components of microblogs data management, including their query languages and associated main-memory management techniques. The second part focuses on major genres of data management systems that are either designed for microblogs data or equipped with infrastructures to manage fast and large data, which are distinguishing characteristics for microblogs. The third part highlights major research topics that exploit data management infrastructures to build applications and analysis modules on top of microblogs, such as visual analysis, user analysis queries, and event detection. This part does not include other major research directions, e.g., natural language processing and information retrieval, as they are orthogonal to the data management research and out of the scope of this paper. In fact, dedicated survey papers review parts of their literature [80, 117].

Figure 1 depicts a summary of different parts and the research topics that will be covered in this survey paper in a timeline format. The horizontal axis in Fig. 1 represents the year of publication or system release for each technique/system, while the vertical axis represents the research topic. The techniques are then classified into three categories: (1) techniques that deal with real-time data, i.e., very recent data, depicted by a filled black circle, (2) techniques that deal with historical data, depicted by a blank circle, and (3) techniques that deal with both real-time and historical data, depicted by a blank triangle. As the vertical axis of Fig. 1 depicts, the paper is organized around three main parts: indexing and querying, systems, and data analysis, each part is outlined below:

(1) Data indexing and querying: this part covers existing work for indexing and querying microblogs data that is depicted in the first to third rows of Fig. 1 and includes the following three topics:

  • Query languages: this work provides generic query languages that support SQL-like queries on top of microblogs. This facilitates basic operators and advanced functions to express a variety of queries on microblogs.

  • Indexing and query processing: this work includes various indexing and their associated query processing techniques that have been proposed to index incoming microblogs either in main-memory [50, 51, 211, 223, 229, 305, 360] or in disk [60, 223]. This includes keyword search based on temporal ranking [51, 60], single-attribute search based on generic ranking functions [211], spatial-aware search that exploits location information in microblogs [229], personalized social-aware search that exploits the social graph and produces user-specific search results [205], and aggregate queries [50, 225, 305] that find trending keywords and correlated location-topic pairs instead of individual microblog items.

  • Main-memory management: this work includes techniques that optimize for main-memory consumption and utilization. Most microblogs indexing techniques depend on main-memory to manage microblogs in real time. Thus, some techniques are equipped for main-memory management such that memory resources are efficiently utilized, either for aggregate queries [225] or basic search queries that retrieve individual data items [224, 229].

(2) Data management systems: this part highlights the current state and the challenges of managing microblogs data through major types of big data systems [18, 21, 24, 26, 51, 223, 245, 315], depicted in the fourth to eighths rows of Fig. 1. In specific, we give a briefing on system challenges and motivational case studies to provide system-level data management for microblogs. Then, we highlight the data management features that are related to managing microblogs in the following system genres:

  • Specialized systems: such as Twitter Earlybird [51, 245], Taghreed [223], and Kite [228] that are designed considering the distinguishing characteristics of microblogs data and queries.

  • Big semi-structured data management systems: such as AsterixDB [18] that is a generic big data management system to support various data sources. Recently, AsterixDB has extended its components to support fast data [121], e.g., microblogs, natively in the system. We review the fast data support in AsterixDB, which shows the current challenges of persisting fast data.

  • Fast data-optimized database systems: such as VoltDB [315] that is mainly optimized for database transactions on fast data, e.g., microblogs. We review the challenges of supporting transactional applications on fast data and solutions at the system level.

  • Fast batch processing systems: such as Apache Spark [24] and Apache Flink [21] that are optimized to process high-throughput applications on fast data via batch processing models. We discuss viable use cases as well as challenges and limitations of such systems to support efficient management for different microblogs applications.

  • Key-value stores: such as Apache Cassandra [20] and Redis [279] that store big datasets in key-value pairs. We discuss the adequacy of such systems to support certain microblogs applications as well as their limitations to support other applications.

  • Hybrid system architectures: such as gluing stream processing engine, e.g., Apache Storm [26], with a persistent data store, e.g., MongoDB [252]. We discuss the challenges to manage real-time data in such setting showing the need to consider data velocity inherently in different system components.

(3) Data analysis: this part covers the major types of analysis on microblogs data that are depicted in the ninth to thirteenth rows of Fig. 1. The selected types of analysis are the ones that exploit the data management infrastructure to pose queries of massive number of microblogs and popular in the research community. This does not include either ad-hoc non-research applications, such as web applications that exploit microblogs data, or orthogonal research directions, such as linguistic analysis or information retrieval, which are intractable and dedicated surveys review only portions of them [80, 117]. This part includes the following five types of analysis:

  • Visual analysis: this work covers existing microblogs data visualization techniques that make use of the underlying scalable queries to enable visual analysis for excessive number of microblog records. This work use both aggregate queries, for aggregation-based visualization [93, 114, 284, 316], non-aggregate queries for sampling-based visualization [223, 294], or a combination of both [162, 236, 327].

  • User analysis: this work is mainly interested in querying user information for different purposes, such as identifying top influential users in certain regions or topics [163, 223, 336] or discovering users with similar interests [34, 132]. Such users, or group of users, can be used in several scenarios, including posting ads and enhancing their social graph.

  • Event detection and analysis: this work exploits the fact that microblogs users post many updates on on-going events. Such updates are queried, grouped, and analyzed to discover events in real time [2, 292] or analyze long-term events [238, 327], e.g., revolutions.

  • Recommendation: this work exploits microblogs user-generated content as means for catching user preferences to support diverse recommendation tasks, such as recommending content to follow [14], real-time news to read [268], authority users to follow [53], products [384], or users who share similar interests [132].

  • Automatic geotagging: this work tries to attach geo-locations to microblogs data that are not geotagged based on analyzing their different attributes. This is mainly motivated by the small percentage of geotagged microblogs, e.g., less than 4% of tweets, that is faced by the need of many location-aware applications on top of microblogs, e.g., [2, 229, 256, 294].

Other sporadic analysis tasks are addressed on microblogs data in both research community, e.g., news extraction [268, 294], and topic extraction [143, 277], and industrial community, e.g., geo-targeted advertising [256] and generic social media analysis [324, 382]. However, we outline the major analysis that exploit the data management infrastructure and include a wide variety of research techniques, which is of interest for the data management research community.

The rest of this paper details each of the three parts highlighting existing challenges, innovations, and future opportunities in microblogs data management research. Section 2 gives details of the data indexing and querying part. Section 3 gives details of the data management systems part. Section 4 gives details of the data analysis part. Finally, Sect. 5 concludes the paper and discusses different open problems in microblogs research.

2 Microblogs data indexing and querying

This section gives a comprehensive review for data management techniques that support large-scale indexing and querying for microblogs data. We first introduce microblogs query languages, in Sect. 2.1, that enable high-level declarative interfaces for querying microblogs. Then, Sect. 2.2 reviews the core indexing and query processing techniques. Finally, Sect. 2.3 outlines main-memory management techniques that are used in association with in-memory index structures.

2.1 Query languages

There are few attempts in the literature to standardize query languages tailored for the needs of microblogs, and inspired by SQL query language: TweeQL [237] and MQL [226, 228], each outlined below.

TweeQL [237] is a wrapper over Twitter APIsFootnote 1 so the user can post SQL-like queries on top of Twitter data and the underlying query processing is performed through accessing her Twitter developer account. TweeQL supports Select-Project-Join-Aggregate queries, recognizing aggregation as a major part of querying microblogs in several applications, e.g., trend discovery. In addition, TweeQL allows two additional constructs. First, built-in filters for the three major microblog attributes: keywords, spatial, and temporal attributes. Second, user-defined functions that allow higher-level analysis of tweets, such as automatic geotagging and sentiment analysis.

Unlike TweeQL, MQL [226, 228], stands for Microblogs Query Language, is proposed as an inherent part of data management systems that support microblogs. MQL allows Select-Project-Join-Count queries, focusing on count as the only useful aggregate measure on microblogs. The major distinction of MQL is promoting top-k and temporal aspects as mandatory in all queries, arguing that there is no practical microblog query that can avoid these two aspects. Even if the user does not explicitly provide a top-k ranking function and temporal horizon for the query, MQL beefs up the query with default values. In addition, MQL supports filtering data based on arbitrary attributes, including spatial boundaries and keywords, and continuous queries similar to traditional data streams [58, 125, 250, 343, 356, 358, 391].

Fig. 2
figure 2

An overview of microblogs data management literature

Table 1 Summary of indexing and top-k querying of microblogs

2.2 Indexing and query processing

This section reviews indexing and query processing techniques that are proposed to support large-scale querying on microblogs. Figure 2 depicts a high-level overview of this literature classified based on both query type (Fig. 2a) and index type (Fig. 2b). Based on query type, existing techniques are classified into non-aggregate querying techniques (detailed in Sect. 2.2.1) and aggregate querying techniques (detailed in Sect. 2.2.2). Based on index type, microblogs are indexed using either tree-based indexing or hash-based indexing that could employ a single or multiple layers of hash-based indexes. Table 1 provides more details summarizing these techniques in terms of the query attribute(s), index structure, index cell content order, and top-k ranking function. As the table shows, all existing queries on microblog include both temporal and top-k aspects regardless their other details. This is attributed to the nature of microblogs as they come in large numbers around the clock. This large number mandates retrieving the most useful k microblogs based on a certain top-k ranking function, otherwise, many useless data will be reported. In addition, being a kind of streaming data, the data is real time by nature and many users and applications are interested in recent microblogs. This inspired almost all the existing techniques to embed the time aspect by default in the query signature, unless it is disabled by the user. In fact, without using the time aspect, a query might retrieve data from several years ago, which leads to a significant querying overhead. So, by disabling this default option, users become aware of the implications on querying performance if they consider data of long temporal periods.

A generic query signature that represents all queries in Table 1 is: “Find top-k microblogs/keywords ranked based on a ranking function F.” In non-aggregate queries that retrieve individual microblogs, the ranking function F can be temporal [6, 51, 60], spatio-temporal [229, 230], significance-temporal [211], or socio-temporal [108] as shown in Table 1. In aggregate queries [50, 167, 225, 305], the temporal aspect is used as a filter for queried data and the ranking functions mostly depend on keyword counts and their derived measure, e.g., trendline slope, except GeoScope that employs a correlation measure.

Almost all indexing techniques of microblogs are optimized for high digestion rates in a main-memory index for real-time data indexing, and secondary-storage indexing is assumed to have older data to query historical microblogs. The only exception is TI [60] that primarily uses a disk-based index. In addition, the query processing techniques are optimized for top-k and temporal queries. The rest of this section briefly outlines each technique that is shown in Table 1, for both non-aggregate and aggregate querying.

2.2.1 Non-aggregate indexing and querying

This section reviews non-aggregate querying techniques that “Find top-kmicroblogsranked based on a ranking function F.”, and retrieve individual microblog records.

TI [60] employs a disk-based inverted index structure where microblogs are sorted based on their timestamp. The main idea in TI is to defer indexing unimportant microblogs to reduce the number of microblogs that are indexed immediately and cope up with the large number of incoming data records. So, it keeps in memory a set of recent and popular queries and topics. Then, it categorizes each incoming microblog and decides whether it should be indexed immediately or deferred. The categorization considers the microblog recency, the user’s page rank, popularity of the topic, and the textual relevance. The unindexed microblogs are written into log file, and an offline batch indexing is performed periodically to reduce real-time indexing latency. This is the first work to consider temporality in optimizing for microblogs, but it uses disk-based index solution which cannot scale for high microblogs arrival rate. The following techniques use in-memory structures that can digest fast data rates as well as providing low query latency, even though TI achieves higher indexing throughput compared to traditional techniques. Indexing time ranges from 0.1 to 1 s based on index parameter settings, whereas for the traditional index the indexing time is a constant 1.6 s. Query processing time ranges from 30 to 90 ms as the number of involved microblogs increases with growing answer size value. Query accuracy also increases with minimum of 90% for all settings.

Earlybird [51]—the core retrieval engine that powers Twitter’s real-time search service— is a distributed system where each node manages multiple inverted index segments to index keywords in real time. Incoming data first goes to a partitioner that divides tweets over nodes. In each node, ingested tweets first fill up the active segment before proceeding to the next one. Therefore, there is at most one index segment actively being modified, whereas the remaining segments are read-only. Each index segment is a traditional inverted index; however, postings for each term are maintained in reverse chronological order. It is worth mentioning that Earlybird reduces the concurrency management overhead by adopting a single-writer multiple-readers model to eliminate any contention and race conditions. When a query comes, a blender receives it and determines which nodes should be accessed. Then, the query is posted to these nodes, the partial answers are retrieved and compiled by the blender to return the final answer. The experimental evaluation shows that Earlybird achieves 7000 tweet/s indexing rate at latency of 180 ms.

ContexEven [6] also supports keyword search queries on real-time microblogs in favor of finding real-time content of the top-k relevant events. It defines the event context with a set of keywords and organizes incoming data in an inverted index based on these keywords. Each index entry maintains a list of event ids that correspond to a certain keyword ordered by a hybrid score that combines popularity and time recency, while an event is represented with a temporal tree that shows the chronological order of data within the same event [7]. To cope up with high velocity data, each index entry divides its posting list into buckets of exponentially growing sizes to reduce the insertion overhead in real time. In addition, ContexEven adapts a lazy update strategy for the index that defers updating the event id order until it is moved to the \((2 \times k)\)th position, scarifying a slight query accuracy with real-time efficiency. The query processor then iterates over all index entries that correspond to the query keywords and aggregate the event final ranking score from all entries to return the final top-k events.

MIL [52] is another event-based real-time search system that employs multi-layer inverted index that organizes event data based on keywords. The index has m layers, each layer maintains a separate inverted index. The index key at the \(i^{th}\) layer is a set of i keywords that co-occur in certain events, while the posting list stores a list of event ids that correspond to these keywords. So, layer 1 key has a single keyword, while layer 2 key has a pair of co-occurring keywords and so on. A new microblog is inserted into all layers that correspond to different combinations of its keywords. Incoming queries also access all index layers to perform a nearest neighbor search based on cosine similarity. Experiments show that MIL outperforms variants of its competitor IL in search time, pruning power, and index update time. MIL search time is below 2 ms with different data sizes and query length, where pruning power is constant and it is almost 1. Index update time is less than 0.1 ms for up to 10 millions records.

Spatiotemporal ranking functions can be depicted in Mercury [229] and its successor Venus [230]. Mercury employs a partial quad-tree structure, where each cell contains a list of microblogs that have arrived within the cell boundary in the last T time units, ordered chronologically. As traditional data insertions, expiration, and index structuring are very inefficient for real-time data, Mercury employs bulk data insertion, speculative index cell splitting, piggybacked deletion, and lazy cell merging to significantly reduce the overall indexing overhead and scale for real-time microblogs. The bulk insertion buffers incoming data and insert them every t seconds, where t is 1–2 s, to navigate different index levels once for several thousands microblogs. In addition, deletion and index structuring operations are piggybacked on the insertion navigation. For the index structuring, cell splitting is performed if and only if the cell exceeds maximum capacity and the microblogs in that cell will span at least two quadrants of the quad-tree node. Cell merging is deferred until at least three out of the four quadrant siblings are empty to reduce redundant splits and merges in real time. The query processing in Mercury has two phases, namely the initialization phase and the pruning phase. In the first phase, cells lying within the query range are ordered based on a spatiotemporal proximity score, and microblogs are retrieved from these cells based on their score. The pruning phase tightens the original search boundaries where microblogs outside the new boundaries are early pruned. This significantly reduces the total number of processed microblogs to get the final answer. Experimental results show that Venus supports high arrival rates up to 64,000 microblogs/s and average query latency of 4 ms.

LSII [211] supports top-k queries based on combining three ranking scores for a microblog: its significance, its keyword similarity with the query, and its temporal freshness. A microblog is more significant if it is posted, for example, by an authority user or has high popularity with large number of forwards and replies. High keyword similarity indicates a high relevance to the query and freshness measures the temporal recency of the microblog. LSII consists of a sequence of m inverted indexes where each index \(I_i\) is double in size its predecessor index \(I_{i-1}\). The first index \(I_0\) is a read–write structure to which new data is appended, and the microblog list of each keyword is ordered chronologically. The indexes from \(I_1\) to \(I_{m-1}\) are read-only indexes, and each keyword has three microblog lists sorted with the three ranking scores. The small size and simple organization of \(I_0\) enable high digestion rates of real-time data, while the three sorted lists of subsequent indexes enable efficient query processing. When index \(I_{i-1}\) size reaches a certain threshold, a merge operation with index \(I_i\) is triggered and index \(I_{i-1}\) is flushed. To process an incoming query, LSII first scans \(I_0\) to get the initial set of top-k microblogs, then it proceeds in scanning other indexes. If the upper bound of index \(I_i\) is no more than the scores of the top-k candidates, index traversal is stopped and proceeds to the next one. Since each index \(I_i\) is less recent than index \(I_{i-1}\), the search is more likely to get pruned at earlier indexes since they have higher fresh scores. Extensions to LSII include personalized search, when a user is only interested in microblogs from specific users. Performance of LSII is compared to append only approach and Triple-Posting-List approach. The query processing time for LSII is less for both number of microblogs the query asks for and for the number of queries in the mixed stream of queries and updates. The query time is between 1 s and 10 s for varying number of microblogs asked by a query and increases from this range linearly with increasing the number of queries. The total processing time is almost 10 s and does not vary with changing the weights of the ranking function.

Another type of ranking is considering the social relevance as well as the textual relevance along with microblog freshness. A 3-D inverted index structure is proposed in [205] where each index cell is a three-dimensional data cube, a dimension for term frequency, a dimension for social relevance, and a dimension for time freshness. Each dimension is partitioned into intervals; the social graph is partitioned with k-way partitioning using minimum cut utility. The time and textual dimensions are sorted at indexing time whereas social dimension is sorted at query time. Data is first partitioned by time; then, cubes in each time interval are indexed with a B+ tree to avoid maintaining many empty cubes. New data records are added to the last time interval. When the size of data in latest time interval exceeds a threshold, it is concluded and a new time interval is introduced. For query processing, cubes are first sorted by their estimated total score. Then, the query processor iterates over neighboring cubes and gets actual scores for microblogs. When the existing top-k records are more relevant than the next unseen cube, the query processing terminates and prunes all remaining data cubes to ensure efficient query latency. The 3-D index outperforms both time pruning and frequency pruning, the two state-of-the-art techniques, with an average of 4–8x speedups for most of the parameter settings.

Proven [360] optimizes keyword search on microblogs for a unique similarity measure that depends on data provenance, measured through microblog content such as hashtags, URLs, and keywords. Incoming microblogs are grouped into bundles based on their provenance similarities and ordered based on their temporal evolution. An inverted index organizes bundles that are continuously updated with incoming microblogs. The inverted index has provenance elements, such as hashtags, URLs, and keywords as index keys and bundles as values. Through this index, incoming queries retrieve whole connected bundles of microblogs, which improves the search result relevance.

RT-SocialMedia [111] proposes a generic index structure for generic query function that can be extended to support temporal, spatial and/or social aspects. It proposes using the inverted index structure with a space-partitioning strategy in which the documents ids are partitioned into intervals, and each interval partitions documents based on keywords into blocks. To facilitate the top-k retrieval, meta-data is stored within each block. The meta-data includes an interval id, a maximum score, and a bit map signature to determine which documents are present in this block. The maximum score is an upper bound for all documents in the block, so if the current top kth score exceeds this bound, the block is safely pruned. The signature field also provides a tighter bound to fasten the search process as absent documents are not included in the upper-bound value. Documents are sorted in the inverted index by the document id, so newer documents are appended to the end of the list to naturally support the temporal aspect. To support the spatial aspect, the index is extended with a uniform grid where in each cell we store the interval ids present in this cell, which helps to demote absent documents that do not appear the query cell. To support social aspect, the index meta-data is extended with a friendship bitmap, which helps to determine quickly if a user is a friend of another user. Experiments show that RT-SocialMedia reports better query latency compared to competitors in keyword search and spatial-keyword search, and better query latency in most cases as compared to LSII [211] in temporal keyword search.

Judicious [372] is the only microblog querying technique that does not consider temporality in their indexing. It offers a compact inverted index structure that treats rare terms, that are not frequently present in the data, differently from common terms, that are frequently present. For rare terms, a traditional inverted index is used. For common terms, a compact inverted index is proposed that uses block partitioning schemes, where microblogs are hashed into intervals, each interval is stored in a block with maximum score as meta-data for the block to facilitate early pruning in query processing. Thus, whole blocks are pruned if their maximum scores are not within the current query upper bound. Incoming queries has two types, singular queries that ask for one type of terms, either rare terms or common terms, and mixed queries that ask for both rare and common terms. Singular queries are answered from their corresponding index. In mixed queries, the rare item lists are retrieved first and used as fancy lists that tighten the query upper bound score and speed up pruning the search space. Experiments have shown that Judicious achieves 2–3 times query speedup over the state-of-the-art approaches with much smaller index size. For the same dataset, Judicious maintains an index of 35GB, whereas competitors BM-OPT and BMW-LB-PB maintain indexes with sizes 49GB and 50GB, respectively. Average response time on TREC queries ranges from 9 to 130 ms with increasing the number of keywords in Judicious, whereas it ranges from 25 to 290 ms in the other two techniques. With increasing answer size, Judicious average response time ranges from 21 to 30 ms, whereas the other two techniques range from 70 to 110 ms.

2.2.2 Aggregate indexing and querying

This section reviews aggregate querying techniques that “Find top-kkeywordsranked based on a ranking function F.” These techniques retrieve keywords, rather than individual microblog records, ranked based on aggregate information, e.g., frequency or frequency growth over time.

AFIA [305] retrieves top-k frequent keywords that lie within any arbitrary spatial range and temporal interval. To support this at scale, AFIA maintains in main-memory a set of spatial grid indexes at different spatial and temporal granularities. Each grid cell keeps track of a summary of top-k keywords that lie within its spatial and temporal ranges, using a modified version of the SpaceSaving algorithm [243]. At query time, the query range is mapped to the corresponding grid cells; summaries from all cells are merged together to get the top-k keywords for the query spatiotemporal range. Despite using the SpaceSaving algorithm the consumes small memory footprint, AFIA is consuming significant memory resources when supporting fine spatial and temporal granularities, as shown in [167, 225], due to maintaining a huge number of summaries without supporting deletions or data expiration.

Unlike AFIA, GeoTrend [225] limits its search scope to recent microblogs and retrieves top-ktrending keywords that lie within any arbitrary spatial range within the last T time units. GeoTrend accommodates various trending measures including trendline slope, which gauges the keyword frequency growth over time, and keyword frequency. To support this efficiently, GeoTrend maintains a partial quad-tree structure where each cell contains aggregate information about keywords that arrive within its spatial boundaries. A list of top-k keywords is materialized in each cell at indexing time. At query time, GeoTrend first gets local top-k trending keywords within cells that intersect with the query boundaries. Then, to get the global top-k trending keywords, the global trending value of each keyword is aggregated from local values, using Fagin’s algorithm [105], and final top-k keywords are returned. Experimental evaluation shows that GeoTrend supports arrival rates up to 50K microblogs/s, average query latency of 3 ms, and 90% query accuracy under limited memory resources.

GARNET [167] generalizes trend discovery to any arbitrary user-defined context instead of being limited to the spatial space. In specific, GARNET finds top-k trending keywords within: (a) a d-dimensional context that is defined on arbitrary d microblog attributes, and (b) an arbitrary time interval. For example, it could find trending keywords that are posted by teenagers in Spanish during July 2018. In this example, the context is two-dimensional and defined over age and language attributes. Each of the contextual attribute is divided into a set of discrete values or disjoint intervals, e.g., age attribute can be divided to child, teenager, and elder, while the language attribute can be categorized into English, Spanish, French, and Others. Then, a d-dimensional grid index is employed to map incoming data to the corresponding context gird cells. An in-memory grid index is maintained for recent data, and in-disk grid index is maintained for historical data. Each in-memory grid cell maintains a list of top-k trending keywords over the last T time units, while each in-disk grid cell maintains a temporal tree that maintains top-k trending keywords for multiple temporal granularities over extended periods. At query time, top-k keywords are aggregated from corresponding grid cells and a final top-k list is compiled in a similar way to [225]. Experimental evaluation has conducted to show index scalability and query performance with different numbers of grid cells. The comparison with AFIA [305] has shown the superiority of Garnet. Garnet in-memory insertion time is below 400 ms for up to 24,000 microblog/s rate and reaches up to 1 s for higher rates. For varying grid cells, query latency ranges from 0.1 to 1 ms for both frequent and trending queries. The naive scanning alternative is not a competitor and increases query latency up to 1 s.

Unlike all other techniques, GeoScope [50] measures localized trending topics based on correlation between topics and a predefined set of locations, e.g., list of cities. The main idea of GeoScope to discover localized trending topics rather than topics that are popular all over the space. For example, a presidential election campaign is trending in many cities all over the country while a city council election campaign is trending only within a specific city. To this end, GeoScope limits the number of monitored locations to the \(\theta \)-frequent locations, keeps track of topics that is only \(\phi \)-frequent at least in one location, and then only tracks \(\psi \)-frequent locations of this topic. GeoScope has two main data structures: Location-StreamSummary-Table and Topic-StreamSummary-Table. Location-StreamSummary-Table maintains top frequent topics for each location while Topic-StreamSummary-Table maintains top frequent locations for each topic. At query time, these aggregate information are processed to retrieve topics that correlated only to the query location, distinguishing them from topics that are popular in all locations. Experiments show that GeoScope consumes almost constant amount of memory and reports constant amount of time with increasing window size. Also, it reports perfect recall and near-perfect precision.

2.3 Main-memory management

All major indexing techniques of microblogs store data in main-memory to be able to support real-time indexing for fast data and provide low query response time. However, with the rapid increase in number of microblogs, it is infeasible to store all data in main-memory for extended periods. At certain point, the available memory becomes full and part of the memory content has to be moved to a secondary-storage structure to free up memory resources for incoming microblogs. To this end, different indexing techniques use, implicitly or explicitly, flushing policies that decide on which microblogs to flush from main-memory to secondary storage. Although the problem of selecting memory content to evict has been studied before for the buffer management in database systems [97], anti-caching in main-memory databases [85, 197, 374], and load shedding in data stream management systems [33, 112, 138], flushing in microblogs data management is different in terms of the optimization goals and the anticipated real-time overhead as detailed in [224]. This section reviews the major flushing policies that are proposed in the literature to manage main-memory for microblogs data management.

Many of the major microblogs indexing techniques implicitly depend on temporal-based flushing [51, 60, 108, 211, 305], where a chunk of the oldest data is flushed to disk to free up memory resources. The main intuition behind this simple policy is that: (a) recent microblogs are more important than old microblogs in several applications, and (b) incoming data, in these techniques, is indexed and ordered based on temporal recency, so flushing the oldest data will encounter very limited overhead in real time. This intuition is correct in a practical sense and gives the major advantage of the temporal flushing, which is its low overhead in real-time environments so its invocation does not limit the system scalability. However, it encounters a major limitation that affects both main-memory utilization and query latency. It underutilizes memory resources and stores \(\sim \) 70% of memory data that is never reported to any incoming query, as detailed in [224]. The main reason is that flushing decisions depend solely on data recency without accounting for what is actually needed for incoming queries. Subsequent techniques in the literature have addressed such limitation for different types of queries as outlined below. The main objective of all these techniques is better utilization for main-memory resources, as useless data are evicted and useful data accumulates in main memory. This leads to increasing memory hit ratio, so more queries are answered from in-memory content without accessing disk content.

Mercury [229], and its successor Venus [230], provide flushing policies that decide on evicting non-aggregate data, i.e., individual microblogs. The flushing policy is optimized for top-k spatiotemporal queries that retrieve microblogs from a spatial boundary R, and temporal interval of the last T time units. By default, each index cell stores data from the last T time units. Mercury flushing policies provide two tighter time bounds, \(T_c\) and \(T_{c,\beta }\), both of them are no greater than T, where any data record outside \(T_c\) or \(T_{c,\beta }\) can be flushed to disk. The main observation behind finding such tighter bounds that highly populated areas, e.g., Downtown Chicago, has higher arrival rates than other areas. Then, top-k microblogs can be retrieved from a shorter time than that of areas of less arrival rates. Thus, values of \(T_c\) and \(T_{c,\beta }\) are derived based on the local arrival rate, ranking function, and query parameters. \(T_c\) ensures accurate query answers, which means any data record outside \(T_c\) is not reported to any incoming query. On the contrary, \(T_{c,\beta }\) employs a load shedding parameter \(\beta \), \(0 \le \beta \le 1\), that allows to save up to \(100 \times \beta \)% of the memory with probability \(\beta ^3\) to miss a needed data record, trading off a slight decrease in query accuracy with a significant saving in memory resources. \(\beta \) in this case is an input parameter by the system administrator. Experimental results show that compared with the default case where data from the last T time units are stored, the policy consumes 65% less storage while achieving an accuracy of 98% to 99.5% when \(\beta \) = 0.3. At \(\beta \) = 0.7, 75% less memory are consumed and the accuracy is 97.5–99.3%. Venus [230] extends this to provide an adaptive load shedding technique where the value of \(\beta \) is adaptively calculated and automatically adjusted with the distribution changes in incoming queries and data. This leads to different \(\beta \) value for each region, based on local data and query distributions, rather than a single global value for all regions. The strategy saves up to 80% of the storage while keeping an accuracy of more than 99% and is considered as significant enhancement over Mercury.

Table 2 Summary of systems features for supporting efficient management of microblogs data

kFlushing [224] is another flushing policy for non-aggregate data, . kFlushing accounts for a variety of top-k queries for arbitrary attributes, ranking functions, and index structures. kFlushing performs flushing on three phases, a following phase is only invoked when the preceding phase cannot flush B% of memory, where the default value of B is 10. The first phase keeps only k microblogs in each index cell and trims any records beyond k. The following phase removes the infrequent values of indexed attributes, e.g., keywords, with their associated microblogs in ascending order of their latest arrival time. If infrequent entries do not clear B% of memory, the last phase removes data in least-recently-used order. The main idea in all three phases is evicting data on the level of index entry rather than the level of individual microblogs. This significantly reduces the real-time overhead and scale in highly dynamic data environments. Comparisons with the first-in-first-out and least-recently-used policies are made to demonstrate the superiority of kFlushing. The results show that kFlushing increases memory hit ratio by 26–330% when compared with the existing flushing schemes and saves up to 75% memory resources.

GeoTrend’s flushing policy, TrendMem [225], depends on aggregate information to evict data from main-memory. GeoTrend queries find top-k trending keywords within an arbitrary spatial region and recent time, where different trend measures depend on keyword count. To effectively utilize memory resources, TrendMem evicts keywords that are consistently infrequent during all recent time periods, so they are unlikely to contribute to any top-k trending query answer. Targeting consistent infrequency ensures not to miss a rising keyword. Therefore, TrendMem periodically removes \(\epsilon \)-infrequent keywords every \(\frac{1}{\epsilon }\) insertions in each index cell, so dense spatial cells do not affect less populated cells. TrendMem achieves significant memory savings while maintaining highly accurate query answers.

GARNET [167] also provides a flushing policy that aims to use the minimal amount of memory rather than utilizing a fixed memory budget. The policy is tailored for its trending queries over arbitrary time periods. Each incoming microblog needs the past \(N+1\) index cells to calculate its trending measure. Thus, only these \(N+1\) cells are kept in memory and any older data is flushed to disk. If less than B% of the memory is flushed, GARNET flushes from the least recently arrived keywords till it reaches B%. Memory usages of TrendMem, GARNET, and AFIA are compared. By comparison, TrendMem consumes less than 10% of AFIA memory, while GARNET consumes around 40% of AFIA memory. It is also shown that GARNET supports the highest arrival rate. The arrival rate supported by TrendMem is higher than AFIA and is also an order of magnitude higher than the current Twitter rate.

3 Microblogs data management systems

In this section, we highlight the major data management systems that support either microblogs data in particular or similar characteristics so microblogs data can be one of their use cases. Due to the plethora of new systems that are emerging in the data management literature, our review gives representative examples for each major genre of systems. We identified the major genres based on the adequacy of systems features and components to handle microblogs data. In specific, microblogs combine both large volume and high velocity aspects, where major novel techniques on managing microblogs data give particular attention to its fast streaming nature. Managing fast data has been recently got attention in many data management systems, from both academia and industry, which makes some of microblogs queries manageable in different systems genres. This section reviews five genres of systems: specialized systems that are designed and developed for microblogs, semi-structured data management systems, fast-data-optimized database systems, fast batch processing systems, and key-value stores. In addition, we highlight hybrid architectures that combine two different types of systems to manage microblogs, showing the limitations of this approach.

Table 2 summarizes the microblog-related features for systems that are reviewed in this section. It summarizes their capabilities in terms of indexing, supported queries, and flushing policies, highlighting the minimum and ideal requirements for efficient management of microblogs data. The rest of this section outlines different genres of systems, discussing their challenges, solutions, and limitations.

Specialized systems The literature has few systems that are specialized for microblogs data. A major example from industry is TwitterEarlybird. As introduced in Sect. 2.2, Earlybird system started as a distributed search system that powers real-time keyword search in Twitter [51]. However, Twitter added different functionalities [195, 208, 209, 239, 245, 246, 326] related to real-time data management, large-scale logging, and higher-level data analysis. We focus on one of such functionalities, which is real-time query suggestions, as it was a motivational use case to radically re-design the way Twitter is handling its real-time data and it shows the importance of radically re-thinking batch processing systems to support efficient queries on real-time data as detailed in [245]. When a user poses a keyword query, a query suggestion module finds potential related queries to suggest to the user. For example, a user who searches for football might receives suggestions such as soccer, FIFA, or world cup. Twitter was supporting query suggestions through a query analyzer that employs Hadoop MapReduce to analyze the query log of Earlybird system and produce the suggestions. However, using Hadoop has led to significant overhead where an hourly data is processed in fifteen minutes. This is much slower than the changes in Twitter queries distribution, which changes every few minutes [209, 239]. Thus, fifteen minutes latency to process one-hour data is way behind such fast changes and has led to producing inaccurate query suggestions. To overcome this, Twitter beefed up Earlybird system with in-memory query analyzer modules that directly access user queries through Earlybird blenders (see Sect. 2.2). Each in-memory query analyzer maintains statistics about incoming queries with a ranking module to filter top related query suggestions. Every five minutes, the suggestions are persisted to a distributed file system, that represents a data store from where query suggestions are retrieved to end users. Such addition to Earlybird system was motivational for Twitter to add several latency-sensitive components to their internal systems and radically re-design solutions that depend on batch processing systems such as Hadoop.

Another two examples of specialized systems that come from academia are TaghreedandKite systems. Taghreed [223], and its successor Kite [228], were early end-to-end holistic systems that focus on microblogs data management in academic systems groups. In particular, both system designs inherently consider microblogs characteristics of both data and queries in indexing, query processing, and main-memory management. For data, they support fast and large volume data requirements. To this end, they employ both in-memory and in-disk index structures as core components to store, index, and retrieve recent and historical data. Indexes at different storage tiers are optimized for different objectives. In-memory indexes are equipped with fast data ingestion through batching incoming data and segmenting the index into small segment sizes that is lightly updatable. In addition, in-memory indexes are equipped with flushing policies that are responsible for moving a portion of memory content to disk when the available main-memory budget is full. Flushing policies are optimized to sustain system real-time operations as well as careful selection of victim data to evict to utilize memory resources to store useful data that serve incoming queries. For microblogs queries, they promote temporal, spatial, textual, and top-k queries as first-class citizens through indexing and query processing. So, each of the two systems supports two families of index structures: a spatial index and a keyword index. Each index incorporates the temporal aspect in organizing its data, and in certain settings it incorporates the top-k ranking function. Moreover, index segmentation is based on the time dimension in both memory and disk indexes. Disk indexes are optimized for efficient queries over arbitrarily large temporal periods through a richer segmentation setting. Basically, the data are replicated over different temporal resolutions, e.g., day, week, and month, so that querying data over several months still access limited number of index segments and provide a relatively low query latency. Other than indexing and query processing, both Taghreed and Kite give a particular attention to main-memory utilization as a core asset to manage hundreds of millions of microblogs. For this, they provide different optimization techniques in their flushing policies so that most useful data accumulates in main-memory and obsolete data is moved earlier to disk.

Although Taghreed [223] and Kite [228] share many characteristics in both objectives and system internals, Taghreed is an earlier version of Kite that started to identify core components and requirements to support microblogs data and queries. Thus, Taghreed focused in a single generic range query that allow to retrieve microblogs data within a spatiotemporal range and relevant to a set of keywords. Then, any further processing, e.g., top-k ranking, is performed on top of Taghreed query processor. Kite generalized this to allow querying any arbitrary attribute, while still promoting temporal, spatial, and textual as the prime attributes. Also, Kite added support for more advanced queries in the system components, such as top-k queries and aggregate queries. Ideas in these systems are patented [251] and commercialized by a social media analysis startup company.

Semi-structured data management systems A major example of such systems is ApacheAsterixDB [18] that is a distributed big data management system that has been developed by academic research groups, and has been recently incubated by Apache Foundation as a top-level Apache project [19]. AsterixDB is a general-purpose system that is designed to manage large volume, billion-scale, datasets that are limited to be managed efficiently in other systems. Recently, AsterixDB has introduced a core system component, called data feeds, to provide scalable ingestion and management for fast data [121], such as microblogs. A data feed digests and preprocesses raw data in main-memory. Then, data is forwarded to primary and secondary index structures. Each index is disk-based; however, it has in-memory components that aggregate data in main-memory before flushing them to disk-resident components. Data is accessible to the query processor when it is resident in the disk components. When data is congested, AsterixDB is equipped with different ingestion policies to select a portion of data to ingest promptly, while the rest of data is discarded or deferred. AsterixDB has achieved data digestion rates that are comparable to current Twitter peak rates with a cluster of five machines as experimented in [121]. Such performance is higher than what is reported by Earlybird [51] in terms of data digestion per single machine. In terms of digestion latency (or searchability latency), i.e., average time between a microblog arrives to being indexed and available in search results, AsterixDB data feeds provide low latency that are appropriate for real-time applications with certain ingestion policies, and significantly high latency with other policies. So, it is crucial to configure the system carefully for the underlying application needs. As a general-purpose system that is not designed for microblogs data, AsterixDB provides common utilities that fit for general fast data use cases without focusing on particular microblogs characteristics, such as temporal and top-k query signatures.

Fast-data-optimized database systems. Although many of microblogs applications do not require transactional data management, database systems that are optimized for transactions on fast data are strong candidates to be used to handle some of microblogs queries, with optionally turning on or off the transactional features. This is due to their light weight management overhead with streaming data, while sustaining a high throughput of scalable queries. VoltDB is an example for such systems. VoltDB [315] is a distributed in-memory database management system (DBMS) that is designed and optimized to support high-throughput ACID database transactions on fast data. The system has started as an academic project, under the name of H-Store [169], that is commercialized by VoltDB [315, 338]. The main additions of VoltDB to traditional disk-based database systems are driven by reducing the overhead of the database transaction manager. Particularly, VoltDB identifies four major sources of overhead in transaction management: (1) multi-threading that is required to manage multiple transactions concurrently, (2) buffer manager that swaps in data pages from disk to a main-memory buffer and evicts pages to disk on full memory buffer, (3) locking that is used to manage data consistency in concurrency control, and (4) logging that is essential in recovery management of completed transactions and rolling back aborted transactions. So, the four main contributions of VoltDB are to tackle such overhead sources to increase the throughout of transactions for fast data management. The multi-threading overhead is totally eliminated by assigning each transaction to a single dedicated CPU core. The buffer management overhead is totally eliminated by eliminating disk storage and storing all data in main-memory, so no buffer is managed in VoltDB. The locking overhead is also eliminated through determining deterministic orders for executing transactions through introducing global and local serializer components. The global serializer is a component that is aware of different data replicas on different machines, while the local serializer has the transactions details on a single local machine. Both components exchange information so the global serializer is able to provide each local replica deterministic orders for transactions, which leads to eliminating the locking overhead. Finally, the logging overhead is significantly reduced through logging data images instead of logging single transaction commands. In particular, VoltDB does not provide recovery management through the traditional write-ahead logging that mandates to write each transaction step to the database log file. Instead, only transaction parameters are written to file proactively. Then, in lazy basis, a full image of current data is written to disk for recovery purposes. This significantly reduces disk access and increases the throughput to 16,000 transaction per core per second, with almost linear scalability when adding more cores. This light management overhead has significantly lifted up managing fast data. Thus, VoltDB indexing and data management infrastructures are suitable to digest fast data efficiently and support important queries in real time, such as keyword queries. However, there are two major concerns for effectively supporting microblogs data end to end. First, VoltDB and similar systems are not optimized for large volume datasets, as stated in their technical documentation, which will lead to limitations in handling historical microblogs, over several months, that are richly exploited in different use cases. Second, it has no support for prime attributes of microblogs, such as the spatial attribute, which makes it inadequate for several important queries even on fast microblogs data.

Fast batch processing systems. Recently, a new generation of distributed batch processing systems has been emerged, extending Hadoop-like systems with main-memory data management infrastructures for efficient processing of large and fast datasets. Spark [24] and Flink [21] are prime examples for these systems. Both systems primarily process data in main-memory with options to connect to popular file systems, such as HDFS, or store statuses in persistent data stores, such as RocksDB [286]. As in-memory systems that support fast data through streaming packages, e.g., Spark Streaming [25], some microblogs applications could fit as use cases for these systems. However, unlike all reviewed systems earlier, Spark and Flink do not inherently support data indexing. Instead, they provide an advanced generation of batch processing systems, similar in spirit to Hadoop, that provide efficient parallel scans over all data records using commodity hardware clusters. Batch processing has limitations in several applications that need inherent indexing for either large volume or high velocity data. Newer systems, e.g., Apache AsterixDB, have tackled these limitations and provide different types of indexing for large and fast data. Obviously, many of microblogs applications are among these applications that require data indexing of several types as detailed earlier. For that reason, any system that gives particular attention to microblogs, e.g., Earlybird, Taghreed, or Kite, has provided different types of indexing for microblogs data. Furthermore, batch processing systems, such as Spark and Flink, do not consider query signatures that are popular in microblogs applications, e.g., top-k, spatial, and textual queries. This adds more overhead when powering large-scale microblogs applications on batch processing systems. The pros and cons of Spark and Flink apply to other batch processing systems that share similar characteristics and architecture, e.g., Apache Impala [23] and Presto [271].

Key-value stores. A major genre of the emerging big data systems is key-value stores that work as massively distributed hashtables to store data in key-value pairs with various data models, e.g., Apache Cassandra [20], Redis [279], and Apache Ignite [22]. These systems are suitable for certain microblogs applications that require fast data ingestion with hash-based indexing, e.g., real-time keyword search. In fact, some of microblogs-oriented systems, e.g., Earlybird [51] and Kite [228], are using the key-value store model to support in-memory keyword indexing. However, distributed key-value stores still lack other essential features that are needed in several microblogs applications, such as spatial indexing, temporal awareness, and top-k query processing. Such shortcomings limit them from being an end-to-end solution for managing microblogs, yet they provide a solid foundation to build upon.

Hybrid architectures. An alternative way to handle fast and large data is gluing a streaming engine, such as Apache Storm [26], with a persistent data store, such as MongoDB [252]. In fact, MongoDB, a document-oriented database that provides several indexing and querying modules, has got a significant attention as a highly scalable database for persistent data, while Apache Storm has got similar attention for processing streaming data. However, each of them is designed and optimized for one aspect of big data, either large volume or high velocity, but not both. It has been experimented to glue these two systems in [121] to handle fast data that got persisted in large volumes. The comparison with Apache AsterixDB has shown up to two orders of magnitudes of higher digestion latency for the glued alternative, assuming that data is queried only when it is persisted to disk. Such significant overhead confirms the need of inherent support of fast data in the system components to provide scalable data indexing and querying. Similar conclusions are also drawn in other studies, e.g., [245], on the adequacy of adapting fast data management in systems that are optimized for large volumes. A major source of overhead is the incompatibility of system optimization goals, which leads to different decisions in different system components. For example, MongoDB is optimized for throughput, write concurrency, and durability, which leads to high wait time per single data write to disk and high ingestion latency in turn. Another source of overhead in such systems is the concurrency and transactions model that assume general-purpose applications with complex scenarios and requirements. This does not allow to use simple and scalable concurrency models, such as single-writer multiple-readers, that is adapted by several microblogs-oriented systems, e.g., [51, 211, 229].

4 Microblogs data analysis

The reviewed data management techniques and systems on microblogs have enabled to power a variety of data analysis tasks at scale. This section highlights the major data analysis research for analysis tasks that exploit the scalable data management infrastructures on microblogs to provide high-level functionality. As microblogs data analysis is a broad literature and include several topics that are not related to the data management community, this section limits its scope only to the analysis tasks that lie in the intersection of two categories. First, they have novel research contributions, which excludes a plethora of development applications that analyze microblogs data without addressing novel problems. Second, they exploit the querying techniques that are developed by the data management community. This excludes major research directions that are orthogonal from the data management research, such as natural language processing and information retrieval. In fact, these research directions have a rich literature where dedicated survey papers review parts of it [80, 117]. The goal of this section is not discussing the details of various techniques. Instead, we present a high-level classification for techniques in the literature, and we summarize each topic through a generic framework that is induced from a variety of existing techniques when applicable. Then, we briefly highlight similarities or differences of each major technique in this topic compared with the induced framework. With such contributions, this section represents a road map for various microblogs data analysis that make use of the underlying data management infrastructures. We review major work in five main analysis tasks: visual analysis (Sect. 4.1), user analysis (Sect. 4.2), event detection and analysis (Sect. 4.3), recommendations using microblogs (Sect. 4.4), and automatic geotagging (Sect. 4.5). Finally, Sect. 4.6 briefly highlights other microblogs analysis tasks.

Fig. 3
figure 3

An overview of microblogs data visualization literature

Fig. 4
figure 4

Example of aggregation-based visualization based on the spatial dimension

4.1 Visual analysis

Visualizing microblogs data has gained a particular attention due to the importance of end users interactions with microblogs applications, e.g., political and disastrous event analysis, disease outbreaks detection, and user communities analysis. The challenges faced in visualizing microblogs data align with the general challenges in visualizing other types of big data [43, 59, 99, 100, 128, 178, 263]. So, several pieces of the proposed research for big data visualization can be used for microblogs data as one type of big datasets. However, we review visualization work that targets a specific problem in microblogs datasets for different applications. In particular, microblogs have microlength content, which makes them easy to be generated by users all the time, e.g., a user can easily generate a tweet in a few seconds or less. This leads to generating a large number of data records in relatively short times. Visualizing such large numbers is beyond the capacity of existing frontend technologies, such as mapping technologies, e.g., GoogleMaps. So, visualization techniques that focus on microblogs try to address this problem by either aggregation, sampling, or a combination of both. Figure 3 classifies the visualization literature into three categories of techniques: (1) aggregation-based techniques, (2) sampling-based techniques, or (3) hybrid techniques. The visualization modules in all these categories use underlying querying modules, both aggregate and non-aggregate queries, to retrieve the data to be visualized. Thus, they directly make use of the scalable data management infrastructures that are built for microblogs. The rest of this section outlines each category of techniques.

Aggregation-based visualization. Techniques in this category [3, 93, 114, 155, 159, 236, 284, 302, 316, 341, 349, 353, 366] reduce the amount of data to be visualized through visualizing aggregate summaries of microblogs at different levels of aggregation, e.g., different spatial levels or temporal levels, rather than visualizing individual microblogs. Such aggregation is application-dependent and is usually performed either based on major attributes, e.g., temporal aggregation [93, 155], spatial aggregation [114, 349], or keyword aggregation [93, 316], or based on derived attributes, e.g., sentiment [155, 284]. Thus, these techniques are lossless and present all available information in a summarized form without ignoring any portion of the data. Aggregation could be based on a single attribute (one-dimensional) or multiple attributes (multi-dimensional). Figure 4 shows an example of aggregation-based visualization based on a single attribute, the spatial attribute [259]. In Fig. 4a, spatial regions that have a large number of data points visualize a variable-size circle that shows the number of points in this region. On the contrary, regions that have sparse data, Arctic Ocean and Norwegian Sea in Fig. 4a, visualize the actual data points. On zooming on the map view, more detailed data is visualized up to the street level that shows detailed data points, as depicted in Fig. 4b that shows street-level data in Riverside, California. Figure 5 shows an example of aggregation-based visualization based on two attributes, the spatial attribute and the language attribute [114]. In this case, number of microblogs is aggregated in each spatial region and the visualized circle categories data based on the language attribute to show percentage of microblogs posted in English, Arabic, Indonesian, Persian, etc.

The literature currently has seventeen visualization modules that employ only data aggregation based on microblogs queries. We next briefly outline each of them, highlighting their aggregation attributes and visualization format. VisCAT [114] aggregates data based on categorical attributes, e.g., language, and spatial and temporal ranges. DiscVis [349] aggregates tweets based on spatial region, language, and topics. DestinyViz [93] aggregates tweets related to certain games based on time, sentiment, and keywords. NLCOMS [3] aggregates tweets based on user communities and visualize them in a graph form. GovViZ [155] aggregates data based on time, country, topic, keywords, sentiment, and content objects, e.g., links, images, and videos. VisImp [366] aggregates data based on communities and social interactions. Twigraph [316] aggregates data based on keywords and visualizes it in a graph form. Plexus [353] aggregates data based on topics and emoji objects in the textual content. TSViz [284] aggregates data based on time, sentiment, and hashtags. PairCSA [341] aggregates data based on their location stamps or location mentions, to get relation between users locations and the locations they mention. Tweetviz [302] aggregates data based on sentiment for business intelligence. NetworkTweet [159] aggregates external passenger flow and unusual phenomena based on spatiotemporal attributes, and uses trending keywords from microblogs to understand users’ behavior. TwitterViz13 [168] aggregates data based on tweets’ intensity (tweet/second) and tweets sentiment. CityViz [283] aggregates data based on user behavior in cities to visualize periods intense/sparse of user activity. TileViz [68] generates summary statistics of the data for each tile for exploring the raw data set. TwitterViz15 [98] provides two visualization views for Twitter data: (a) spatiotemporal analysis view, and (b) graph analysis view. The spatiotemporal view aggregates data based on spatial regions, sentiment, social bonds combined with spatiotemporal information, temporal evolution, and real-time statistics. The second view aggregates data based on social graph and real-time graph statistics. ImpressViz [188] aggregates textual and meta-data information to quantify user impression and visualize data in a six-dimensional impression space.

Fig. 5
figure 5

Example of aggregation-based visualization based on both spatial and language dimensions

Fig. 6
figure 6

Example of sampling-based visualization for tweets with different languages

Sampling-based visualization. Techniques in this category [223, 294, 346] reduce the amount of visualized data through sampling. A sample of data is selected and visualized as a representative for the whole dataset, while the rest of data is not visualized. The sampling technique can be classified based on different dimensions. A sample could be a query-guided sample or an arbitrary sample. An example for a query-guided sample is OmniSci TweetMapFootnote 2 (Fig. 6) that samples tweets based their language as the query predicate filters data based on the language attribute. Another example is TwitterStand [294] (Fig. 7) that samples tweets based on textual content that have news stories. For certain queries, the query predicate is generating a lot of data that still cannot be visualized efficiently. In this case, applications, e.g., [223], select an arbitrary data sample to reduce the data size. Another classification of the way of sampling is based on the amount of data in the sample. The sample is either fixed or interactive. For example, TwitterStand [294] takes a fixed sample of data that contains new stories. Any interaction for end users with the map view, in Fig. 7, will not change the content of this sample. User interactions only change the subset of this sample that is shown on the map. On the contrary, an interactive sample changes the sample content based on user interactions. At the beginning, an initial sample of 100K, for example, is visualized from all languages including 30K English microblogs. When the user filters out data to show only English microblogs, the visualized English microblogs can be increased to 100K as it is solely visualized. Such interactive technique is exploiting the whole capacity of frontend technologies while increasing the overall amount of data visualized to users. Such technique is not heavily used and has several research challenges to support large-scale data.

Fig. 7
figure 7

Example of sampling-based visualization for news tweets

Unlike aggregation-based techniques that are lossless, sampling-based techniques might be lossy or lossless depending on the application and the size of query result. If certain application queries are generating a reasonable sample size, then all data points are considered. Otherwise, such as in arbitrary sampling, a subset of data points are ignored and the sampling is lossy.

The literature currently has three visualization modules that employ only data sampling based on microblogs non-aggregate queries. We briefly outline each module highlighting the sampling attributes and stages. CulTweet [346] samples data based on language, country, and topic. Taghreed [223] performs two-step sampling. First, it samples data guided by query predicates based on spatial, temporal, and keywords. Then, if the sample size is still excessive, it performs an arbitrary sampling. TwitterStand [294] samples data based on textual content and spatial extent.

Hybrid visualization. Some applications allow to use both aggregation and sampling to reduce the amount of data to be visualized [76, 162, 236, 238, 240, 327, 365]. For example, event analysis applications [238, 327] sample microblogs based on their relevance to specific events. Then, event data need to be aggregated to summarize the event highlights to users, e.g., showing changes over time, space, users, or topics. Such applications usually do not encounter challenges in visualizing their data as the data size is reduced over two different phases, sampling and aggregation, which leads to significant reduction in their size and ease the visualization task. We highlight examples of such applications.

We highlight nine visualization modules that employ both data aggregation and sampling based on microblogs queries. We briefly outline each of them highlighting it different stages. TweetTracker [327] samples tweets that are relevant to a set of tracked long-term events; then, it aggregates them based on location, time, and keywords. TwitInfo [238] samples event-related data and aggregates them based on sentiment and spatial attributes. ATR-Vis [236] samples tweets that are relevant to a set of input debates; then, it aggregates and label tweets based on mentioned hashtags and the corresponding debate. Cloudberry [162] samples data based on keywords and aggregates it based on space and time. TweetDesk [240] provides a sample of top tweets of an event, along with a summary about the event. ChineSentiment [365] visualizes sentiment distribution based on temporal, spatial, and hot events. EmotionWatch [174] visualizes sentiment summary of public reactions toward events. It allows visualization of intense emotional reactions (peaks), controversial reactions and emotional anomalies. UserViz [124] analyzes users’ connections and the frequency of tweets sent by one or a group of users, classifies these tweets, generates a tag cloud, and visualizes the most popular users. Taqreer [231] samples microblogs based on user-defined categories, e.g., different car models, defined by a set of keywords; then, data for each category is aggregated based on spatial and temporal ranges and visualized on map and aggregate views.

4.2 User analysis

The importance of microblogs in different applications originates from its user-generated nature, where hundreds of millions of users worldwide are posting around the clock. Among the major analysis directions is analyzing the user behavior related to different topics, locations, and communities based on their profiles and content of their microblogs. In fact, such kind of user analysis is highly overlapping in microblogs, i.e., the microlength user-generated data, and social media in general that include both short and long posts and objects, e.g., images and videos. This section limits its scope to analyzing microblogs users where excessive numbers of data records are generated compared to regular social media data due it its microlength.

Figure 8 classifies the literature of user analysis techniques on microblogs into techniques that either (1) find top-k users according to a certain ranking criteria, e.g., top-k influential users for a certain topic or top-k active users in a certain location, to provide useful answers for higher-level applications, or (2) classify users based on certain characteristics. Top-k users queries directly benefit from the data indexing and query processing techniques that are introduced in the data management literature to support different types of scalable top-k queries based on various ranking functions. In fact, usernames are used interchangeably with keywords as string keys for the index structures, which makes many of the proposed data management techniques applicable to user queries. Figure 9 depicts a high-level framework for user analysis in microblogs that is induced from the existing literature. The framework consists of three main stages. First, microblogs of each user are fed into a feature extraction module to profile the user behavior through different pieces of information, such as keywords, followers/friends, timestamps, and locations. The actual extracted information is different in different applications. Then, the extracted user information is forwarded to an indexing (or modeling) module that produces a relevant index/model for users based on their information. Finally, a query processor accesses the index/model to answer application-level queries. As the description shows, the last two stages of this user analysis framework significantly make use of the data management techniques, and hence, new advancements in indexing and query processing techniques would positively affect the performance of user analysis tasks. Following the described framework, major techniques in the literature serve different applications with diverse purposes. We outline each category of techniques below.

Fig. 8
figure 8

An overview of microblogs user analysis literature

Fig. 9
figure 9

A framework for microblogs user analysis

Top-k user queries. Section 4.4 has reviewed several techniques that recommend top-k users as potential friends, which overlaps with top-k user queries. DomUsr [213] finds most influential users based on nine features that are aggregated through different models to calculate a final influence score. The used models are both aggregation and SVM classification models. TkLUS [163] finds top-k local users who are most active for a certain topic in certain location. TkLUS uses textual, social, and spatial relevance in hybrid spatial-keyword index to organize and retrieve top-k users efficiently. PromUsr [49] finds prominent users for certain event through a probabilistic model that analyzes their temporal and textual information. LnkUsr [336] identifies top candidate user entities with limited information on microblogging platforms that can be linked to user entities on other platforms. It extends graph matching techniques with two heuristics to overcome the limited available information. Twittomender [132] finds top users with similar interest to the querying user to expand homogeneous communities of similar interests. It profiles user posts content and use collaborative filtering techniques to find similar users. TemUsr [293] models users temporal behavior for different short-term and long-term topics. CUsr [102] samples microblogs data records, rather than sampling k users, for efficient user community reconstruction based on strongly connected components. IBCF [221] uses dynamic user interactions in different topics to model the dynamics of relationship strength between users and topics over time. Then, the modeled relationships are used in matrix factorization recommendation model to improve social-based recommendation quality. FURec [352] predicts the top-k users who will retweet or mention a focal user in the future by formalizing the prediction problem as a link prediction problem over an evolving hybrid network. InfUsr [15] finds most influential users in a certain topic. A nodal feature called focus rate is introduced to measure how focal users are on specific topics. Then, they incorporate nodal features into network features and use modified PageRank algorithm to analyze topical influence of users. FadeRank [40] evaluates the reputation of Twitter users. It summaries the past history in a bounded number of values and combines them with the raw reputation computed from the most recent behavior to assign a final ranking score. TrueTop [375] outputs top-k influential non-sybil users among a given set of Twitter users. The system constructs an interaction graph and then performs iterative credit distribution using weighted eigenvector centrality as the metric to make the influential non-sybil users stand out. UIRank [381] identifies influential users whose tweets can cause the readers to change emotion, opinion or behavior. The algorithm is based on random walk and measures the user’s tweet influence and information dissemination ability to evaluate the influence of the user. FAME [193] finds topical authorities on Twitter for a given topic. The algorithm adopts a focused crawling strategy to collect a high-quality graph and applies a query-dependent personalized PageRank to find important nodes that represent authorities. Cognos [116] identifies expert users for a certain topic through mining the meta-data of Twitter user lists that are created by the crowd. Lexical techniques are used to infer user expertise; then, experts in the same topic are ranked based on cover density ranking.

User classification. In addition to top-k queries, user analysis is also performed to do user classification. PEDIdent [137] identifies pro-eating disorder (ED) Tumblr posts and Twitter users. They use the associative classification algorithm CMAR to generate classification rules and train a classifier to identify pro-ED posts and users. AOH [118] classifies users into automated agents and human users using a random forest classifier. OMT [176] identifies the orientation of a user by analyzing tweets which mention more than one orientation using a logistic regression model. HUsr [282] identifies hateful users from twitter. They first sample users using a diffusion process based on DeGroot’s learning model. Then, a crowd-sourcing service was adopted to manually annotate the samples. AutoOPU [378] detects the opioid users through a multi-kernel learning model based on meta-structures over heterogeneous information network.

4.3 Event detection and analysis

Event detection and analysis has gained tremendous attention with the rise of microblogging platforms [1, 2, 12, 16, 17, 32, 78, 91, 103, 108, 133, 151, 153, 154, 170, 171, 187, 203], [212], [218], [253], [267], [269], [285, 290, 292, 348, 357, 368, 369, 377, 383, 387,388,389]. The reason is the popularity of event-related updates that are posted by users through microblogs around the clock. This includes a wide variety of both short-term and long-term events, such as concerts, crimes, sports matches, accidents, natural disasters, social unrest, festivals, traffic jams, elections, and conflicts. Analyzing the event-related microblogs enabled several applications at different levels of importance, including crucial applications, leisure applications, and in-between applications. An example for crucial applications is rescue services and emergency response that have used microblogs to save hundreds of souls in different natural disasters since 2012 across the world [101, 144, 145, 156, 157, 161, 304]. An example of leisure applications is detecting surrounding entertainment events that are not collected in a single calendar, e.g., concerts, light shows, and special museum exhibitions in Los Angeles area. In-between both types, other types of applications have become popular, such as news extraction based on events [8], event-driven advertising [256], public opinion analysis for political campaigns [333, 334], and analyzing protests and social unrest [28, 257, 332].

The advancements in microblogs data management enable significant performance enhancements in both tasks of events detection and analysis. As noted in the data management section, there are several state-of-the-art indexing and query processing techniques that are tailored for organizing and retrieving event data, such as ContexEven [6] and MIL [52] that are reviewed in Sect. 2.2.1. In a more general context, event detection makes use of indexing data based on temporal attributes that enables efficient retrieval of recent and temporally compact data, which is a major characteristics for grouping relevant data of a single event. In addition, indexing data based on spatial attributes gives an edge for discovering local events in geographic neighborhoods.

Fig. 10
figure 10

An overview of microblogs event detection and analysis literature

Figure 10 depicts an overview for the literature of event detection and analysis on microblogs. The rich literature is categorized into three main categories: (1) detecting arbitrary events, (2) detecting specific types of events, and (3) analyzing events. We summarize each category with a generic framework that is induced based on major work in the literature. Figure 11 shows three frameworks that correspond to the three categories. In the rest of this section, we review each category describing the different components of its framework and mapping existing literature to this framework highlighting similarities and deviations.

Fig. 11
figure 11

Frameworks for microblogs event detection and analysis

4.3.1 Detecting arbitrary events

A major direction of event detection research focuses on detecting arbitrary events that have either no predefined or at most very high-level characteristics. For example, finding coherent discussions on Twitter [32, 78] without having a prior idea about what could be such discussions about. Another example is looking for local events in a certain city [2, 108] without determining any specific characteristics of such events. These events are arbitrary events as the user does not provide a prior detailed description for the event characteristics. Figure 11a depicts a framework that is followed by most arbitrary event detection techniques. The framework consists of five main stages: (a) filtering & feature extraction, (b) grouping, (c) scoring, (d) summarization, and (e) visualization. A microblog dataset, either streaming or stored dataset, is processed through the filtering and feature extraction stage to identify potentially relevant microblogs and extract their temporal [2, 16, 108, 187, 285, 348, 368, 383, 389], textual [16, 32, 91, 348, 368, 377, 389], spatial [2, 16, 108, 171, 187, 348, 368, 369, 377, 383, 389], and semantic (part-of-speech (POS) tags/named entities) [218, 285, 290, 357, 368, 387, 388] features. These four types of features are the main drivers for detecting new events. Then, microblogs are forwarded to a grouping stage that assembles microblogs with similar features into groups, each group represents an event candidate. The grouping stage uses different types of techniques, including clustering [2, 16, 78, 108, 170, 218, 348, 367,368,369], lexical matching [290, 377], graph partitioning [32, 91], and statistical techniques such as Bayesian [368], latent variable models [285, 387, 388], and regression models [187], as depicted in Fig. 10. The set of candidate events are then forwarded to a scoring module that gives a score (or a label) for each candidate to distinguish actual events from noisy groups. Scoring is performed in different ways, including labeling [171, 187, 367, 368, 383, 387] or ranking candidates based on diffusion [2, 78, 285], similarity [16, 32, 78, 170, 218, 285, 377, 389], correlation [91, 290, 348, 388], and/or burstiness of different combinations of temporal, keyword, and spatial features [2, 108, 369, 383]. For example, in scoring based on keyword correlation, if the group of microblogs has scattered keywords that are not related to each other based on statistical co-occurrences of words, then this group is discarded as a noisy group that does not reflect an event. On the contrary, if the set of keywords are cohesive and with high co-occurrence likelihood in real topics, it is assigned a high score as an actual event. Several scoring techniques also consider temporal and spatial similarities besides textual-based measures. Then, top scored candidates are selected as actual events, while the rest of groups are considered noisy events. The events are then fed to an optional summarization module that identifies the most important microblog posts to represent a certain event using different signals of importance, such as popularity of the post, its temporal position, etc. Finally, the events are forwarded to a visualizer that displays representative microblogs along with their labels, content, locations, and temporal details to end users. The visualizer uses microblogs visual analysis techniques that are presented in Sect. 4.1, so the details of this are not duplicated in this section.

Table 3 Summary of clustering-based and lexical techniques that detect arbitrary events from microblogs data
Table 4 Summary of graph-based and statistical techniques that detect arbitrary events from microblogs data

The described framework drives the major techniques in the literature. Tables 3 and  4 summarize the different stages of each technique. This literature can be categorized into four categories based on the grouping technique as the major stage that generates event candidates. Figure 10 depicts the four-category classification, namely clustering-based techniques, lexical techniques, graph-based techniques, and statistical techniques. Table 3 summarizes clustering-based and lexical techniques and Table  4 summarizes graph-based and statistical techniques. The rest of this section briefly outlines techniques of each category.

Clustering-based techniques. EvenTweet [2] proposes a framework to detect localized events in real time from a Twitter stream and track their evolution over time by adopting a continuous analysis of the most recent tweets within a time-based sliding window. Event candidate extraction is based on clustering keywords according to their cosine similarity of their spatial signature. Scoring the events is based on keyword burstiness and time diffusion from the cluster. Detected localized events are summarized by the number of related keywords and spatiotemporal characteristics. STREAMCUBE [108] system extracts microblogs hashtags along with spatiotemporal attributes. Then, hashtags are grouped through a single-pass hierarchical spatiotemporal clustering to detect event candidates, that are scored and ranked based on burstiness and local features. The system provides ways to explore events with different granularities in both time and location. EveMin [171] detects visual events based on photos and locations. Feature extraction calculates area weights and commonness score of words, grouping depends on word bursts using n-gram model and image clustering based on deep convolutional neural network (DCNN), and labeling uses another DCNN. ReutersTracer [218] extracts features based on named entities, while grouping uses a novel clustering algorithm that counts for microblogs features. TrioVecEvent [368] detects local events through extracting semantic textual, temporal, and spatial features that are used by a multimodal embedding learner to map correlated microblogs to the same latent space. Then, a novel Bayesian mixture clustering model finds geo-topic candidate events. These candidate events are then passed by a classifier that relies on the multimodal embeddings to label whether an event is a local event. DisruptEven [16] framework has both classification and clustering. The classification phase is used for filtering event-related posts from noisy posts and based on a naive Bayes model. Then, an online clustering is performed using temporal, spatial and textual set of features. After clustering, the framework offers event summarization using a novel temporal Term Frequency–Inverse Document Frequency (TF-IDF) that generates a summary of top terms without the need of prior knowledge of the entire dataset. MGeLDA [357] is a sub-event detection technique that extracts semantic features based on microtopics. The microtopics are identified by a novel mutually generative latent Dirichlet allocation (LDA) model for microblogs hashtags. Then, k-means clustering is used to group related topics and discover events. StoryEven [170] also introduces a model that summarizes each event as a sequence of sub-events on a timeline based on nonnegative matrix factorization (NMF) clustering.

Lexical techniques. Jasmine [348] extracts co-occurring words as well as geo-location and timestamp of microblogs. Then, microblogs that are generated within a short time and a small geographic area are grouped to form event candidates. Co-occurring words of each candidate are analyzed to distinguish noisy candidates from local events. DisaSubEven [290] extracts sub-events from a bigger event, e.g., a disastrous event has a series of small-scale emergencies such as a bridge collapsing, airport getting shut, and medical aid reaching an area. Feature extraction is based on POS tagging, grouping of sub-events is based on noun-verb pairs, and ranking is based on the frequency of co-occurrence of their constituent nouns and verbs in the corpus. For summarization, DisaSubEven uses an integer linear programming (ILP) technique that considers the maximum occurrence of nouns, verbs, and numerals.

Graph-based techniques. DynamiCentr [32] combines the first three stages of the framework depicted in Fig. 11a through extracting emergent keywords from incoming data streams based on analyzing the dynamic semantic graphs, where nodes represent the keywords and the edges are the co-occurrence of the keywords. Then, events are summarized based on the minimum weighted set cover applied on the semantic graph of the dynamically highly ranked keywords. SNAF [377] detects local events based on spatial and textual features of microblogs. It first filters event-relevant microblogs based on lexical analysis and statistical user profiling. Then, relevant microblogs are geotagged based on a large gazetteer and distance-based data cleaning algorithms. The cleaned data is then grouped into spatial connected components that represent events. GeoBurst [369] uses spatial and keyword features to build a keyword co-occurrence graph that is used to infer semantic features through random walks. Then, geo-topic clusters are formed as candidate events by combining both spatial and semantic features. A set of pivot microblogs are identified for each cluster, and then, they are ranked based on spatiotemporal bursts and top-k are selected. GeoBurst+ [367] differs from GeoBurst by employing a new supervised framework for selecting the local events, instead of burst ranking. In addition, it performs keyword embedding to capture the subtle semantics of microblogs. EvenDetecTwitter [91] framework identifies both short-term and long-term events. It first extracts temporal and textual features that include word frequency, conditional word frequency, inverse trend word frequency, fuzzy representation, and scale time modeling. The features are used to connect data in a graph model. Then, a multi-assignment graph partitioning scheme is employed so that each microblog can belong to multiple events. The similarity measure differs based on event type, for short-term events a cross-correlation similarity measure is used whereas for long-term events Riemannian distance is used.

Statistical techniques. This category can be divided into two sub-categories. First, latent variable models. ExplorEven [387] proposes a pipeline process of event filtering, extraction, and categorization. The filtering is based on lexicon matching and binary classification to opt only event-relevant microblogs. Feature extraction then processes relevant microblogs for time expression resolution, named entity recognition, part-of-speech (POS) tagging and stemming, and the mapping of named entities to semantic concepts. The event candidate extraction and grouping phase is based on an unsupervised latent variable model, called latent event and category model (LECM). For labeling a cluster, the most prominent semantic class obtained based on the event entities is employed as the event type. ProbEvent [388] extracts features through POS tagging and named entity recognition, groups microblogs based on a novel unsupervised latent variable model, called LEEV model, which simultaneously extracts events and generates visualizations, and scores candidate events based on the correlation between named entities, dates, locations, and words. OpenEve [285] extracts temporal, named entities, and POS tags, that are used to filter irrelevant microblogs through an event tagger module based on conditional random fields (CRF). The microblogs are then grouped based on latent variable model and ranked based on the association with event and time. Second, miscellaneous models that use different statistical methods, including regression, Markov models, graphical models, and temporal analysis. Eyewitness [187] extracts local events and summarizes them using time series analysis of geotagged tweet volumes from localized regions. The framework identifies features as count of data records based on spatial and temporal localities. Then, for a given region, a regression model is learned to predict volume of data versus data spikes as a function of time. Local event are identified when the actual volume exceeds the prediction by a significant amount. SpatialEvent [383] forecasts spatiotemporal events using an enhanced Hidden Markov Model (HMM) that characterizes the transitional process of event development by jointly considering the time-evolving context and space-time burstiness of Twitter streams. To handle the noisy nature of tweet content, words that are exclusive to a single event are identified by a language model that has been optimized by a dynamic programming algorithm to achieve an accurate sequence likelihood calculation. SEvent [389] detects related events, e.g., a sinking boat and an on-going flood in same spatial region. It first extracts textual, spatial, and temporal features. Then, a novel graphical model-based framework, called location–time constrained topic (LTT), is used to express each microblog as a probability distribution over a number of topics. To group related microblogs, a KL divergence-based measure is employed to gauge the similarity between two microblogs. Then, another longest common subsequence (LCS)-based measure is used for the link similarity between two sequences of user microblogs. Sequences are grouped based on spatial, temporal, and topical similarities. BEven [78] focuses on discovering breaking events and distinguishing real-life events from virtual events that happen only in the online community. Therefore, it categories microblogs based on three features extracted from the hashtags: (1) instability for temporal analysis, (2) meme possibility to distinguish social events from virtual topics or memes, and (3) authorship entropy for mining the most contributed authors. Based on these attributes, an unsupervised technique is used to categorizes microblogs into advertisements, memes, breaking events, or miscellaneous.

Table 5 Summary of techniques that detect specific types of events from microblogs data

The rich literature of event detection on microblogs not only contains holistic frameworks that start with raw data and output events to end users, but also specialized pieces of work that are not proposing holistic frameworks; however, it either focuses on one or more of the stages or studies a problem that is utility for event detection. We outline examples for such work in the rest of this section.

HierEmbed [267] focuses on mining topics that are related to events in microblog streams. It presents an unsupervised multi-view hierarchical embedding (UMHE) framework that generates topics with a high accordance to the events from a microblog stream. The framework applies LDA to extract the feed-topic and topic-word distributions. Therefore, for each latent topic, there are two different view features, namely the latent word distribution and the relevant feed collection. Then, it applies a novel multi-view Bayesian rose tree (Mv-BRT) to refactor the latent topics into a hierarchy. A translation-based hierarchical embedding is formulated to encode the topics and relations in low dense vectors to better capture their semantic coherence. ET-LDA [151] proposes a joint model based on LDA to extract the topics covered in the event and tweets, and segmenting the event into topically coherent segments. AnchorMF [133] solves the event context identification problem using a matrix factorization technique by leveraging a prevalent feature in social networks, namely the anchor information. A probabilistic model is built to consider users, events, and anchors as latent factors. An anchor selection algorithm is proposed to automatically identify informative anchors for the model. A Gibbs sampler and a maximum a posteriori (MAP) estimator are proposed to estimate the model parameters. KeyExtract [1] focuses on extracting real-time local keywords through a time sliding window approach. For each keyword, a probability distribution over co-occurring places is estimated and used to eliminate spatial outliers. The spatial distribution is updated based on inserting new content and removing old content that is expired from the sliding window. AutoSummarize [17] focuses on automatic summarization of Twitter posts using three methods, namely temporal TF-IDF, retweet voting, and temporal centroid representation. The temporal TF-IDF is based on extracting the highest weighted terms as determined by the TF-IDF weights for two successive time frames. The voting method considers the highest number of retweets a post received in the time window. The temporal centroid method selects posts that correspond to each cluster centroid.

4.3.2 Detecting specific types of events

Another major direction of event detection research focuses on detecting specific types of events that have a set of distinguishing information to characterize the event, e.g., keywords. Examples of such events are crime events, earthquakes, or traffic jams. Crime events can be described by a set of keywords, while earthquakes are characterized by labeled training data, for example. In general, each event type is described by a set of event-related information. Figure 11b shows a framework that utilizes the event-related information along with incoming microblogs data to detect events of a specific type. The framework consists of three main stages: (a) feature extraction, (b) event classification, and (c) visualization. The incoming microblog data is processed to extract temporal [154], textual [154, 203, 212, 386], spatial [154], and sentiment features  [361]. Then, the processed data is forwarded to a classification model that uses the event-related information to distinguish relevant data to the event type of interest from irrelevant data. The classification can be performed through two different types techniques, as depicted in Fig. 10: (1) learning-based techniques [12, 142, 153, 154, 269, 292, 361], such as support vector machines (SVM) [142, 153, 292] and regression models [269], and (2) lexical techniques [203, 212, 386]. The type of classification is also coupled with the type of provided event-related information that might be keywords or labeled training data. The classified relevant microblogs are directly fed to a visualizer that displays events to end users. The visualizer still uses one of the visualization techniques that exploit aggregation, sampling, or both as presented in Sect. 4.1. Compared to arbitrary event detection (in Sect. 4.3.1), this framework replaces the clustering and scoring modules with a classification model that exploits the event-related information to directly group and filter relevant data and reduce noisy output.

This framework drives the major existing work on detecting different types of events. Table 5 summarizes the different stages of each technique. The literature includes two categories of techniques based on the classification stage, as depicted in Fig. 10: learning-based techniques and lexical techniques. We briefly outline techniques of each category.

Learning-based techniques. This category includes both supervised and semi-supervised techniques. EarthquakEven [292] detects earthquake events through Twitter. It uses SVM classifiers and labeled training earthquake data to classify earthquake-related tweets. ContraEven [269] detects controversial events through a regression classification model along with labeled training data on well-known controversial topics, such as Obama Nobel Peace Prize. WellEven [12] extracts wellness events from tweets. It extracts features based on a graph-guided multi-task learning model, and classify data based on a novel supervised model that takes task relatedness into account. TarEven [154] detects social media events that are related to news reports. It extracts features from both tweets and news reports to find out relevant tweets. Then, relevant tweets are splits into positive and negative examples through an EM-based refinement algorithm and final relevance is computed based on textual, spatial, and temporal similarities. The data is then fed to a novel semi-supervised approach for detecting spatiotemporal events from tweets. STED [153] proposes a semi-supervised approach that enables automatic detection and visualization of user-defined specific events. The framework first applies transfer learning and label propagation to automatically generate labeled data, then learns an SVM text classifier based on tweet mini-clusters obtained by graph partitioning. Then, it finally applies fast spatial scan statistics to estimate the locations of events. PersonaLife [361] detects personal life events from users’ tweets using multi-task LSTM model with attention. The system detects whether the tweet is an explicit event, implicit event, or not an event and then detects category of the event from predefined life events categories. CrowdEven [142] treats each bus-related tweet as a microevent which can be further analyzed for event type categorization, entity extraction, and sentiment mining. It uses CRF for entity extraction and one-against-one classification strategy with SVM as the classifier.

Lexical techniques. TEDAS [203] detects crime events based on crime-related keywords along with lexical matching to classify relevant data. TrafEven [212] detects traffic events using related keywords along with wavelet analysis to classify relevant tweets. DynKeyGen [386] proposes a semi-supervised solution based on expectation maximization mechanism that leverages word information to infer tweet labels. The candidate tweets are selected based on a set of keywords, which are generated and updated dynamically based on word importance score that changes over time.

4.3.3 Event analysis

Unlike event detection techniques, where new events are outputs, event analysis techniques take an event as an input and analyze its data in different ways. In specific, event analysis work focuses more on providing exploration tools for known predefined events rather than detecting new events that are not known beforehand. For example, the Syrian revolution is a long-term event that is known beforehand with a set of features such as keywords and locations. So, an event analysis module is interested more in analyzing data of this well-known event rather than discovering a new event that is not known beforehand. Another example is King Tut festival in Hayward, California. This is a short-term event that is known beforehand with a set of keywords, locations, and a time period. Again, an event analysis module focuses more on analyzing data of this event without discovering any new events. Thus, most of existing event analysis work follows a simple framework that is depicted in Fig. 11c. The framework has two stages: (a) filtering and (b) visualization & analysis. The filtering stage employs simple filters on different attributes, e.g., keywords [29, 237, 238, 327], spatial [29, 327], and temporal [29], to extract relevant microblogs to a certain event, e.g., Hurricane Sandy. Then, extracted data is forwarded to a rich visualization module that enables end users to analyze event data based on multiple views, e.g., map view, aggregate views, frequent keywords, influential users, timeline view, sentiment view, or individual microblogs. The features of analysis and visualization views are highly variant and depend on the application and the analysis purpose. The rest of this section presents examples of event analysis applications in the literature.

TweetTracker [327] provides an event analysis framework for long-term events, such as Arab Spring uprisings, Occupy Wall Street, and US presidential elections. Users can define new jobs to define new events to analyze. Events data are filtered based on keywords, locations, and usernames features. Newly incoming data is tracked based on the event features for a long term. Then, the collected and new data is visualized based on a time series view, geographic map view, trending keywords view, entities view, and individual tweets view. TweetTracker has currently collected 3.2 billion tweets, and it is adding new \(\sim \) 700,000 tweet every day. TwitterPoliticalIndex [334] is a social media index for US presidential elections co-developed by Twitter and Topsy Labs, a social search and analytics company that owns all Twitter data and is acquired by Apple Inc. [27]. The index visualizes tweets relevant to US elections based on political party, sentiment, locations such as states and counties, and timeline view. TwitInfo [237, 238] provides a timeline-based event analysis framework that allows users to define events based on relevant keywords. Then, the system collects relevant tweets, categorizes them based on sentiment, and organize them in timeline and map views in both aggregate and individual data records forms. The system addresses scalability problems that are associated with analyzing and visualizing such large number of data records. STEvent [29] analyzes events based on three aspects. First, how topic initiators influence popularity of the topic. Second, the impact of geography on popularity by partitioning the Twitter network according to regional divisions and studying the behavior of popular and non-popular topics. Third, the effect of topology and the dynamics of topic spread on popularity.

4.3.4 Events and microblogs aggregate queries

Several aggregate querying techniques (Sect. 2.2.2) have been motivated by detecting events from large-scale microblogs data [50, 225, 305]. This includes detecting highly frequent [305] and highly trending [225] keywords that identify popular topics among users, and detecting highly correlated keywords with different locations [50] that identify localized topics of people interests. Such techniques can be used as scalable infrastructures to detect events from large amount of data. However, the core research methods focus on indexing and query processing on a large scale, which lies in a lower-level of the data analysis stack compared to techniques that are reviewed in this section.

4.4 Recommendations using microblogs

Microblogs represent a rich and up-to-date source for user-generated content. Therefore, they are appealing for several recommendation applications to extract up-to-date user preferences, which is essential to recommend relevant items. Although recommendation applications that exploit microblogs data are diverse, being an up-to-date source for user preferences is the common theme that links all of them. From a data management perspective, having such large and highly changing data as a source of preferences introduces significant challenges in updating recommendation models in practice. In fact, this has triggered deep research discussions in the data management community on the ability to support recommendation models efficiently in data management systems [109, 181, 191, 192, 295,296,297,298, 362]. This clearly makes a transformative shift toward a new generation of recommender systems that should be able to recommend relevant items accurately through updating models much more efficient than their ancestor generations of recommender systems. Microblogs data plays a major role as a source of preferences for this new generation of recommender systems and the data management research community is in the heart of addressing their challenges.

Figure 12 depicts a high-level overview about recommendation techniques using microblogs. The literature includes two major recommendation problems, recommending content and recommending friends, in addition to a set of diverse miscellaneous recommendation applications. Such applications are as diverse as recommending news items, products, question answers, events, and scholarly information. The rest of this section highlights each category.

Fig. 12
figure 12

An overview of recommendations using microblogs literature

Recommending user-generated content. One of the major recommendation problems that is widely studied in the literature is recommending user-generated content, such as recommending other microblogs to read, hashtags to search, and mentions to post. NetRec [14] recommends tweets that are not visible to the user, e.g., posted by friends of friends or further, by exploiting the social network, content, and retweet analysis. The importance of invisible tweets is initially estimated by the social distance. Then, both content analysis and user analysis are performed to rank highly relevant users to recommend their tweets. Content analysis is based on textual analysis using bigrams, while user analysis is based on comparing timelines and mutual retweets. BlgRec [173] leverages and combines the user’s location, social network feeds, and in-app actions to infer the user’s interest and develop a personalized recommendation model. A user’s feed is then made up of recommended content, including trending news, social network feeds, and social content, on either local or global scales based on the user spatial interests. TWIMER [307] performs tweet recommendations based on formulating a query based on the user’s interest profile to probabilistic language models. Then, irrelevant and near-duplicate tweets are discarded using threshold-based filtering, locality sensitive hashing, and tweet freshness. SimGraph [74] is a scalable recommendation model based on a similarity graph that induces the mutual interest among users by analyzing retweets. The probability of a certain user to like incoming microblogs are estimated based on a propagation model that aggregates top-k tweets and recommends them to the user. CmpRec [66] tackles a more fundamental functionality in microblogs recommendation through comparing two approaches to compute similarity among microblogs with brief content: a topic-based approach and WordNet corpus-based approach. The study shows the superiority of WordNet corpus to catch similarity between brief textual content of microblogs.

Hashtag and mention recommendation is another content recommendation task that is popular in the literature, so users can easily search for their topics of interest. EmTaggeR [88] is a trained model for learning word embeddings and assigning hashtags with the trained embedding system. CogRec [186] proposes two cognitive-inspired hashtag recommendation techniques based on the Base-Level Learning (BLL) equation: \(\hbox {BBL}_{{I, S}}\) and \(\hbox {BBL}_{{I, S, C}}\). BLL accounts for the time-dependent decay of item exposure in human memory, once with the current tweet content (\(\hbox {BBL}_{I, S, C}\)) and once without (\(\hbox {BBL}_{{I, S}}\)). MRTM [204] is a personalized hashtag recommendation model based on collaborative filtering and topic modeling. It integrates user adoption behaviors, user hashtag content, and contextual information into a joint probabilistic latent factor model to recommend hashtags to users. MenRec [222] addresses the problem of using both texts and images of microblogs for mention recommendation. A cross-attention memory network is proposed which considers the content of a tweet, interests of the user, and interests of the author to recommend a user to be mentioned for a certain tweet.

Recommending friends. Another recommendation problem that is widely studied in the literature by researchers from academia and industry (specifically Twitter) is recommending users to follow to expand and enhance the social graph connected components. Twittomender [132] started exploiting the real-time nature of microblogs by dynamically profiling the users through their recent microblogs. Then, collaborative filtering techniques are used to recommend users with similar interests. FURec [352] tackles the problem from a different angle and recommends top-k users who will likely interact with microblog posts of a certain focal user. It uses the existing follower network and creates a new network based on retweets and mentions, then a single hybrid network is composed to recommend the new users. The problem is also studied and realized by Twitter Inc. [129, 130, 300], where substantial contributions in enriching connections between Twitter users are made. The Who to Follow (WTF) project [129, 300] started to recommend users to follow and enrich Twitter social graph. The core of the system is the Cassovary in-memory graph processing engine and a novel technique for performing user recommendation, called Stochastic Approach for Link-Structure Analysis (SALSA). SALSA constructs a bipartite graph that include the user’s circle of trust on the left side, while the right side includes users who are followed by the users in the left side. Then, this bipartite graph is traversed and ranking scores are assigned, on which users are recommended accordingly. Approximation algorithms are also provided in the second generation of WTF to reduce the complexity of processing hundreds of millions of users. To exploit the time aspect of Twitter data, they added MagicRecs [130] that recommends users who are followed by friends within certain temporal constraints. To expand Twitter’s recommendation services, they added content recommendation through GraphJet [300] that is based on a bipartite graph similar to the one maintained in WTF system, except the right side models actual user tweets. A random walk on this graph with a fixed probability of reset outputs a ranked list of vertices that represents the tweets to be recommended to the user.

Miscellaneous recommendation applications. A significant portion of the literature is recommending miscellaneous items/users, where the common theme is using microblogs as an up-to-date source for user preferences. NewsRec [268] recommends news items re-ranking based on user preferences extracted from tweets. The user tweets and RSS news feeds are both processed by a preference extraction module that finds out common keywords in both. Then, these keywords are used to promote relevant news in the news feeds timeline, so important news appear early to users. METIS [384] recommends products based on detecting purchase intent from microblogs data in near real-time fashion, combining their model with the offline traditional models that are similar to e-commerce website recommendations, e.g., Amazon. Such exploitation of real-time user-generated data has enhanced the effectiveness of product recommendation models. Another recommendation model that handles cold-start problem for product recommendation exploiting user-generated microblogs is CSPR [385]. CSPR uses data from microblogging users with no historical purchase records to map users’ attributes extracted from microblogs into feature representations learned from e-commerce websites. Thus, given a microblogging user, a personalized ranking of recommended products can be generated to overcome the cold-start problem. EvenRec [232] exploits geotagged microblogs to recommend events from Eventbrite, a popular event organization website. The extracted events depend on microblogs locations that are fed to item-user models. This work is orthogonal from event recommendation in event-based social networks [84, 122, 347], e.g., Meetup.com, which has different nature compared to microblogging platforms, and thus, it is beyond the scope of this paper. CRAQ [312] recommends potential answers to a posted question through selecting a group of potential authority users who are selected based on their topically relevant microblogs. Then, the candidate group is iteratively filtered by discarding non-informative users, and top-k relevant microblogs are determined as potential answers. Jury [53] recommends potential authority users who are able to answer a given question. It adapts a probabilistic model that selects a set of users so that the probability of having wrong answer is minimized. SchRec [364] recommends scholarly information through microblogs posted by researchers who post about their latest findings or research resources. Two neural embedding methods are proposed to learn the vector representations for both users and microblogs. Recommendation is made by measuring the cosine distance of a given microblog and user.

4.5 Automatic geotagging

Geo-locations are heavily exploited in several microblogs applications, such as localized event detection [2], geo-targeted advertising [256], local news extraction [294], user interest inference [126], and finding local active users [163]. With all such importance of geo-location data in microblogs applications, still the majority of microblogs are not associated with precise location information. In fact, a small percentage (< 4%) of popular microblogging data, e.g., Twitter, is associated with locations sourced from user devices. This triggered a need to associate location information with more microblogs data automatically to exploit as much microblogs as possible in location-aware applications. However, traditional geotagging techniques are limited for enriching microblogs location data due to the brevity of microblogs textual content. Such brief text contains a lot of abbreviations and noisy words that make it hard for named entity recognizers to extract accurate places and locations. In this section, we give an overview about new techniques in the literature that are designed to extract locations from microblogs data. Although traditional geotagging techniques purely depend on linguistic analysis to extract locations, recent geotagging techniques on microblogs go beyond this to identify top-k locations for both users and data records, as elaborated later in this section. This recent paradigm overlaps and makes use of certain indexing and query processing techniques from the data management literature. Thus, automatic geotagging on microblogs is leaning toward making more use of data management infrastructures in addition to the linguistic techniques.

Figure 13 classifies the literature at high-level into techniques that use a single microblog record at a time for geotagging and techniques that use collections of microblogs. Figure 14 shows frameworks for the two types of techniques. In fact, most of microblogs geotagging techniques in the literature depend on classification models to assign location(s) to one microblog at a time. Figure 14a shows a geotagging framework that is induced based on existing work on microblogs. The framework consists of two stages. The first stage is a feature extraction stage that extracts keywords and named entities places from the brief textual content of training microblogs. The extracted keywords and places are used to train the classification model. For each incoming microblog, the classifier assigns a location based on its textual content features. The location classification is performed through different models, such as probabilistic models [199, 274, 291], multinomial naive Bayes [141], lexical matching [158], ensemble of statistical and heuristic classifiers [235], pure place entity recognition [210], gazetteer verification [13, 95], and matrix factorization [94].

Fig. 13
figure 13

An overview of microblogs automatic geotagging literature

Fig. 14
figure 14

Frameworks for microblogs automatic geotagging

A common problem in these techniques is the trade-off between error distance and classification precision. The precision is significantly dropped down for the practical margins of error distance, which represents the distance between actual location and predicted location. For example, with error within 100 m, the precision ranges from 10–20% for different techniques. On increasing the error distance to 30 KM, the precision is raised to 60–80%. With 100+ KM error distance, the precision reaches 80–90%. Therefore, accurate location prediction provides very low precision where over 90% of data is mistakenly geo-located. On the other hand, the significant increase of error distance makes predicted locations not useful for practical applications.

To overcome this problem, a state-of-the-art technique [199] proposed to process microblogs as collections instead of individual records as depicted in Fig. 14b. The technique is collecting all microblogs of each user as one collection and perform exact and fuzzy location extraction on them to identify all possible locations for this user. Then, top-k locations for each user are predicted and identified as the most likely locations where the user is posting microblogs. When a new microblog arrives, a set of top-k locations are extracted from microblogs content and meta-data. Then, the k microblog locations and the k user locations are fed into a location refinement module that predicts the final top-k microblog locations. This technique has shown tremendous enhancement in prediction precision and recall (95+%) within 100 m error distance, which is the threshold for accurate location prediction.

4.6 Other analysis tasks

The reviewed analysis tasks in previous sections represent the major high-level analysis tasks on microblogs that are of interest to the community of data management and analysis researchers. However, the microblogs literature and applications are so rich to enumerate all possible analysis types or techniques. In fact, other analysis are sporadically addressed on microblogs in both (1) academic community, such as news extraction [268, 294], topic extraction [143, 201, 277], summarization [47, 96, 119], situational awareness [289, 303], and resource needs matching [41, 42], and (2) industrial community, such as geo-targeted advertising [256] and generic social media analysis [324, 382]. Yet, the reviewed literature represents the main performed high-level analysis tasks that span a wide variety of interests, applications, and novel research challenges as well as future research opportunities.

5 Conclusions and future directions

This paper has provided a comprehensive review for major research work and systems for microblogs data management and the corresponding analysis tasks. The paper categorized the literature into three parts: data indexing and querying, data management systems, and data analysis, where each part is further divided into sub-topics. The data indexing and querying part has reviewed microblogs query languages, individual indexing and query processing techniques, and main-memory management techniques. The systems part has reviewed characteristics of different genres of big data systems, e.g., batch processing systems, big data indexing systems, and key-value stores, in terms of their adequacy to handle microblogs query workloads. It has also discussed challenges and solutions that are provided through these systems for fast data, highlighting their potential limitations to handle certain microblogs applications. The data analysis part provided a detailed roadmap for the major analysis tasks that are directly or indirectly make use of the data management literature: visual analysis, user analysis, event detection and analysis, recommendations, and automatic geotagging. For each task, we presented a generic framework, when applicable, that is induced from major techniques in the literature and drives main research innovations for this task. In addition, we classified the literature based on the major component of this framework to provide better understanding for different techniques and highlight existing challenges and future opportunities in this research direction.

The rich literature of research on microblogs data faces several big challenges and is still rich with opportunities on different fronts. In terms of data management, there are several research opportunities in real-time indexing, query optimization, and system-level integration. For real-time indexing, the microblogs literature does not provide a comprehensive study for supporting spatial-keyword queries on real-time data. This has not been studied before either in existing spatial-keyword querying techniques [54,55,56, 63,64,65, 69, 72, 73, 127, 196, 200, 206, 219, 223, 233, 234, 343, 371, 373] that focus on traditional static datasets, e.g., restaurants, or in existing microblog indexing that considers the spatial-keyword combination only in aggregate queries that retrieve frequent or trending keywords [50, 225, 305]. Existing specialized systems for microblogs supports two separate indexes, a keyword index and a spatial index, as a generic option that allows supporting various queries with few system assets. However, it is not clear how much performance is lost compared to hybrid indexing strategies. Quantifying such performance losses will enable better understanding for parameters that control querying performance on different indexes, which in turn will allow optimizing each index. Such understanding contributes to developing query optimization models for real-time data management as elaborated below. In addition to spatial-keyword queries, social information is still underutilized in supporting scalable personalized queries on real-time microblogs data. Although there exist few techniques that exploit this information [205, 211], these queries still suffer from inherent scalability limitations due to the overhead of supporting hundreds of millions of users while sustaining efficient data digestion, indexing, and querying in real time.

Despite the richness of exploring real-time indexing on microblogs, there is almost no work on studying the implications of these novel indexing techniques on query optimization models. For example, the traditional selectivity estimation models assume relatively stable index content that is dominated by read operations and encounter much less write operations. This assumption does not hold on microblogs real-time indexes that have highly dynamic content. In addition, microblogs indexes are segmented based on temporal and spatial ranges, which gives a room for estimation model compression to serve such excessive amount of data with limited storage requirements. In general, the implications of new real-time indexing techniques on traditional query optimization models need to be revisited on microblogs.

Integrating all existing and future techniques of microblogs data management in end-to-end systems is a must to widen the impact of existing data management technology in microblogs applications. Recently, extensive efforts started to develop end-to-end systems to support microblogs data as elaborated in Sect. 3. However, there is still a gap between the available research techniques and their applicability for system-level integration. For example, existing aggregate queries techniques face challenges to be integrated with microblogs systems as they cannot be supported efficiently using existing indexes and require separate indexes. This is not favorable from system point of view to maintain additional indexes. So, new ways need to be innovated to integrate aggregate data structures within index cells of non-aggregate queries at a system level. Another example is flushing policies that are way developed in separate indexes than the ones supported at system level. This due to a lack of integration techniques that allow flexible flushing policies while maintaining the real-time performance.

In terms of data analysis, there are several untackled challenges on two levels: enhancing the analysis modules and integrating them with microblogs systems to extend their functionality for enriching and facilitating microblogs applications. There are many examples that can be induced from the reviewed literature. We will highlight few of them in different analysis areas. First, developing a unified event detection framework that allows users to express different types of event-based queries. Such framework will exploit the rich literature of event detection and analysis on microblogs to provide common utilities that allow effective and efficient event queries. Second, real-time geotagging of microblogs data. Although recent work started to tackle this problem [94, 95], there are still challenges in reducing the geotagging time due to the high computational cost of this task. Achieving the goal of attaching locations to microblogs as they come will widely impact a plethora of location-aware applications that are built on top of microblogs. Third, integrating the rich literature of user analysis techniques with the scalable data management infrastructures, e.g., indexes and query processors, in microblogs systems. Such integration will allow a variety of user-centric applications to be supported at scale. Fourth, developing a unified recommendation framework that exploits microblogs data and allow users to express a variety of recommendation queries flexibly. Such framework will serve a diverse set of applications that are reviewed in Sect. 4.4. The envisioned unified framework could exploit existing work on supporting generic recommendation queries in data management systems [198, 296,297,298].

In addition to enhancing different analysis modules, there is a dire need to integrate such rich literature of analysis techniques with microblogs data systems to widen the impact of microblogs research in a practical sense. Such integration will have tremendous impact of a plethora of applications that benefit the society, the research community, and business applications, including public health, disaster response, public safety, and education. The feasible way to achieve such goal is abstracting different analysis tasks on microblogs into basic building blocks that can be supported in microblogs systems, inspired by SELECT-PROJECT-JOIN building blocks in SQL database management systems. Such task is huge and shall be started with developing generic frameworks for different analysis tasks, as discussed earlier for event detection and recommendations as well as provided throughout Sect. 4.