1 Introduction

Online activities can be associated with dramatic offline effects, such as voter fraud misinformation contributing to the 6 January 2020 riots and invasion of the US Capitol building in Washington DC (Scott 2021), COVID-19 misinformation leading to panic buying of toilet paper (Yap 2020), online narratives incorrectly attributing Australia’s “Black Summer” bushfires to arson amplifying public attention to it via the media (Weber et al. 2020a), and attempts to influence domestic and foreign politics (Ratkiewicz et al. 2011; Woolley 2016; Morstatter et al. 2018; Woolley and Howard 2018). For researchers to successfully analyse online activity and provide advice about protection from such events, they must be able to reliably analyse data from online social networks (OSNs).

Social network analysis (SNA) facilitates exploration of social behaviours and processes. OSNs are often considered convenient proxies for offline social networks, because they seem to offer a wide range of data on a broad spectrum of individuals, their expressed opinions and interrelationships. It is assumed that the social networks present on OSNs can inform the study of information dissemination and opinion formation, contributing to an understanding of offline community attitudes. Though such claims are prevalent in the social media literature, there are serious questions about their validity due to an absence of SNA theory on online behaviour, the mapping between online and offline phenomena, and the repeatability of such studies. In particular, the issue of reliable data collection is fundamental. Collection of OSN data is often prone to inaccurate boundary specifications due to sampling issues, collection methodology choices, as well as platform constraints. The establishment of datasets in which the research community can have confidence, as well as the ability for the replication of studies, including through common benchmarks, is vital for the validation of research findings (Assenmacher et al. 2021).

Previous work has considered the question of data reliability from a variety of perspectives. Broadly speaking, questions of how to reason about data quality appeared in the late 1960s in statistics but were picked up by management research in the 1980s and computer science in the 1990s as part of database and data warehouse research (Scannapieco et al. 2005). The dimensions described by Scannapieco et al. (2005) provide a structured way to reason about data quality in terms of accuracy, completeness, time-related measures and consistency. It is increasingly apparent that data heavy disciplines such as machine learning (ML) cannot rely on their techniques and a simple abundance of data to overcome these issues (Roccetti et al. 2020). Even if data is available, some ML techniques can still struggle if its distribution is uneven (Sun et al. 2009) and the ‘cleanliness’ of data can be a significant factor in the performance of ML systems (Breck et al. 2019; Roccetti et al. 2020). Data quality is also especially important for modern Big Data systems (Emani et al. 2015), including those underpinning OSNs, but those using OSN Application Programming Interfaces (APIs) can be assured of high-quality data, at least with regard to the completeness of the schemas and validity of the values they provide.

Turning to OSN data specifically, relevant research into reliability has explored sampling (Morstatter et al. 2013; González-Bailón et al. 2014; Joseph et al. 2014; Paik and Lin 2015), biases (Ruths and Pfeffer 2014; Tromble et al. 2017; Pfeffer et al. 2018; Olteanu et al. 2019) and the danger of making invalid generalisations while relying on the promise of Big Data without first developing a nuanced understanding of the data (Lazer et al. 2014; Tufekci 2014; Falzon et al. 2017; Venturini et al. 2018). Analyses of incomplete networks exist (Holzmann et al. 2018), but this paper specifically considers the questions of data reliability for SNA, considering not only the significance of online interactions to discover meaningful social networks, but also how sampling and boundary issues can complicate analyses of the networks constructed. Through an exploration of modelling and collection issues, and a measurement study examining the reliability of simultaneously collected, or parallel, datasets, this multidisciplinary study addresses the following research questions:

  • To what extent do datasets obtained with social media collection tools differ, even when the tools are configured with the same search settings?

  • How do variations in collections affect the results of social network analyses?

Our work makes the following contributions:

  1. 1.

    Discussion of the challenges mapping OSN data to meaningful social and information networks;

  2. 2.

    A methodology for systematic dataset comparison;

  3. 3.

    Recommendations for the use and evaluation of social media collection tools; and

  4. 4.

    Five original social media datasets collected in parallel, and relevant analysis code.Footnote 1

This paper extends Weber et al. (2020b) primarily with the introduction of three further case studies, in which we vary the use of collection tool features, but also with a dedicated discussion section and expanded conclusion with recommendations for social media analysts and researchers. Specifically, we have added the following:

Literature:

a broader examination of related literature;

Social media data:

a deeper examination of challenges arising from the collection of social media data; discussion of the balance between interactions apparently common between platforms and their platform-specific semantics when developing methods with cross-platform applicability; a deeper consideration of the nuances relating to the mechanics of constructing social networks from social media interactions;

Datasets:

three more case studies, each with parallel datasets, and an experimental framework incorporating factors to illuminate the effect of certain “value-added” collection tool features—specifically, the new case studies explore the disruption introduced by ‘smart’ features, language and terminology clashes, the effect of API credentials, and the limits of certain statistics;

Methodology:

further detail regarding the metrics employed in the comparison methodology;

Analysis:

deeper analyses of the Q&A datasets, including examination of their content and activity over time; new visualisations of network statistics that facilitate comparison; and a summary of findings in each case study, raising further questions, which are then considered in the following case studies;

Discussion:

a summary of lessons from the case studies regarding the effects observed on various statistics, considerations regarding language and terminology clashes, and effects from query term choice, the size of datasets and those introduced by the platforms themselves; thoughts offered regarding the issue of representativeness, which related to (perhaps unobtainable) dataset completeness and a general measure of dataset ‘reliability’; and

Recommendations:

an expanded conclusion that includes observations and recommendations for social media analysts and researchers.

This expansion aims to better demonstrate the use of our systematic comparison methodology to facilitate research in social media analytics.

Five sections follow from this point: Sect. 2 addresses challenges obtaining and modelling social networks from OSN data for SNA; Sect. 3 describes our methodology for systematic parallel dataset comparison; Sect. 4 presents results from using our methodology in a number of case studies; Sect. 5 discusses our findings and provides an exploration of the notion of a measure of reliability; and finally Sect. 6 offers recommendations for social media researchers and analysts, plus directions for future research.

2 Social networks from social media data

Using SNA to explore social behaviours and processes from OSN data presents many challenges. Most easily accessible OSN data consists of timestamped interactions, rather than details of long-standing relationships, which form the basis of SNA theory. Additionally, although interactions on different OSNs are superficially similar, how they are implemented may subtly alter their interpretation. They offer a window into online behaviour only, and any implications for offline relations and behaviour are unclear. Beyond modelling and reasoning with the data is the question of collection—accessing the right data to construct meaningful social networks is challenging. OSNs provide a limited subset of their data through a variety of mechanisms, balancing privacy and competitive advantage with openness and transparency.

2.1 Interactions and relationships online

SNA provides concepts and tools to model social relationships among actors. It is based on the premise that an actor’s position in the network impacts their ability to access opportunities and resources and therefore allows us to understand social behaviours and processes in network terms (Borgatti et al. 2013). Given the availability, nature and structure of much OSN data, the use of network-based techniques is a natural choice for the analysis of online social behaviour.

There is, however, an important distinction between the relatively stable, long-term relationships that are typically studied in SNA and the social connections among online actors (Wasserman and Faust 1994; Nasim 2016; Borgatti et al. 2009). On social media, accounts can easily fulfil the role of actors, but precisely what constitutes a relationship is unclear. An obvious candidate is the friend or follower relationship common to most OSNs, but, due to how OSNs present their specific features to users, each online community develops its own social relation culture. Therefore, such connections do not necessarily easily translate between OSNs. Is a Facebook friendship really the same as a follow on Twitter, even if it is reciprocated? And how do each relate to offline friendships?

Table 1 Equivalent social media interaction primitives

OSNs offer ways to establish and maintain relations with others. This is done through interactions, many of which are common between OSNs, such as replying to the posts of others, mentioning others (causing the mentioning post to appear in the mentioned user’s activity feed), using hashtags to reach broader communities, or sharing or reposting another’s post to one’s followers or friends. A sample of interactions with equivalents on different OSNs is offered in Table 1. (N.B. We distinguish interactions from following or friending actions, which define information flows (i.e. they tell the OSN where to send posts), which are persistent once created.) Specific interactions may be visible to different accounts, intentionally or incidentally (cf. replying to one post versus using a hashtag). Exploration of these differences may lead to an understanding of the author’s intent and the identity of the intended audience. Is replying to a politician’s Facebook post a way to connect directly with the politician, or is it a way to engage with the rest of the community replying to the post, either by specifically engaging with dialogue or merely signalling one’s presence with a comment of support or dismay? A reply could be all of these things but, in particular, it is evidence of engagement at a particular time and indicates information flow between individuals (Bagrow et al. 2019). Since most online interactions are directed towards a particular individual or group, they offer an opportunity to study the flow of information and influence. On the other hand, although friend and follower connections may indicate community membership, they obscure the currency of that connection. Through their dynamic interactions, a user who liked a Star Wars page ten years ago can be distinguished from one who not only liked it, but posted original content to it on a monthly basis. Therefore, we specifically focus on interactions rather than friend and follower relations in this study.

2.2 Social network analysis theory

Relationships between individuals in a social network may last for extended periods of time, vary in strength, and be based upon a variety of factors, not all of which are easily measurable. Because of the richness of the concept of social relationships, data collection for SNA is often a qualitative activity, involving directly surveying community members for their perceptions of their direct relations and then perhaps augmenting that data with observational data such as recorded interactions (e.g. meeting attendance, emails, phone calls). Just like it is tempting to believe that delving into Big Data will bring quick rewards, only to discover that extracting semantic information can be remarkably challenging (Emani et al. 2015), it is tempting to believe that the richness of social relationships should be discoverable in the vast amount of interaction data provided by OSNs, but there are issues to consider:

  1. 1.

    Links between social media accounts may vary in type and across OSNs—it is unclear how they contribute to any particular relationship;

  2. 2.

    What is observed online is only a partial record of interactions in a relationship, where interactions may occur via other OSNs or online media, or entirely offline; and

  3. 3.

    Collection strategies and OSN constraints may also hamper the ability to obtain a complete dataset.

Although many interactions seem common across OSNs (e.g. a retweet on Twitter resembles a repost on Tumblr and a share on Facebook), nuances in how they are implemented and how data retrieved about them is modelled (beyond questions of semantics) may confound direct comparison. For example, a Twitter retweet refers directly to the original tweet, obscuring any chain of accounts through which it has passed to the retweeter (Ruths and Pfeffer 2014). There are efforts to probabilistically regenerate such chains (Rizoiu et al. 2018; Gray et al. 2020), but, in any case, is one account sharing the post further evidence of a relationship? What if it is reciprocated once, or three times? What if the reciprocation occurs only over some interval of time? These questions require careful consideration before SNA can be applied to OSN data.

2.3 Challenges obtaining OSN data

Social media data is typically accessed via an OSN’s APIs, which place constraints on how true a picture researchers can form of any relationship. Via its API an OSN can control: how much data is available, through rate limiting, biased or at least non-transparent sampling, and temporal constraints; what types of data are available, through its data model; and how precisely data can be specified, through its query syntax. Many OSNs offer commercial access, which provides more extensive access for a price, though use of such services in research raises questions of repeatability (Ruths and Pfeffer 2014; Assenmacher et al. 2021). This is done to protect users’ privacy but also to maintain competitive advantage. Researchers must often rely on the cost-free APIs, which present further issues. Twitter’s 1% Sample API has been found to provide highly similar samples to different clients, and it is therefore unclear whether these are truly representative of Twitter traffic (Joseph et al. 2014; Paik and Lin 2015). If the samples were truly random, then they ought to be quite distinct, with only minimal overlap. Studying social media data therefore raises questions about the “the coverage and representativeness” (González-Bailón et al. 2014, p.17) of the sample obtained and how it therefore “affects the networks of communication that can be reconstructed from the messages sampled” (González-Bailón et al. 2014, p.17).

Empirical studies have compared the inconsistencies between collecting data from search and streaming APIs using the same or different lists of hashtags. Differences have been discovered between the free streaming API and the full (commercial) “firehose” API (Morstatter et al. 2013). There is general agreement in the literature that the consistency of networks inferred from two streaming samples is greater when there is a high volume of tweets even when the list of hashtags is different (González-Bailón et al. 2014). More concerning is the ability to tamper with Twitter’s sample API to insert messages (Pfeffer et al. 2018), introducing unknown biases at this early stage of data collection (Tromble et al. 2017; Olteanu et al. 2019).

Assuming that Big Data will provide easy success without deep understanding of the data can also lead to inappropriate generalisations and conclusions (Lazer et al. 2014; Tufekci 2014; Emani et al. 2015). This is well illustrated, for example, by the range of motivations behind retweeting behaviour including affirmation, sarcasm, disgust and disagreement (Tufekci 2014). Similarly, in the study of collective action, there are important social interactions that occur offline (Venturini et al. 2018). Furthermore, relying solely on observable online behaviours risks overlooking passive consumers, resulting in underestimating the true extent to which social media can influence people (Falzon et al. 2017).

Big Data and its precursors in databases and data warehouses have had to address issues of data quality since the late 1960s (Scannapieco et al. 2005), both in terms of the cleanliness of the data (e.g. missing or incorrect values, poorly designed schemas, difficulties in the enforcement of consistency or other validation practices) as well as techniques to manage the distribution of values within the data. ML algorithms have long benefited from techniques to manage class imbalance for classifiers (Sun et al. 2009), and careful human input is very much needed to guide ML system design: Roccetti et al. (2020) describe their experiences studying faulty water meters in Italy, finding the contribution of subject matter experts invaluable in defining ‘clean’ data to train ML classifiers. Others have begun to systematise how to study the effect of data quality on the performance of ML algorithms (Foidl and Felderer 2019; Breck et al. 2019), though the phenomenon is long known (Sessions and Valtorta 2006).

In the case of OSN data, the quality of the data is high (as it has already been processed by the OSN platforms) and thus the further challenges are at least twofold:

  • To determine the completeness of a given dataset; and

  • To extract meaningful network information (i.e. semantic information) from datasets using OSN-specific schemas, which are provided by OSN-specific APIs, many of which have unique and idiomatic characteristics.

For the first challenge, it is unclear when a dataset obtained via an OSN’s API is complete, because only the OSN knows the extent of its holdings and whether all query results have been provided. Repeatability requires that a query returns the same results (ignoring other effects, such as the introduction or removal of data, i.e. adding new posts or losing them when rate limits are reached); however, it is not necessary for complete results to be returned, only the same results. The primary requirement for repeatability comes from benchmarking, and recent efforts have begun to examine how to ensure repeatability for benchmarking without a requirement for complete results (Assenmacher et al. 2021). The second challenge requires careful design of networks from the data available, including an awareness of what information can be extracted from particular OSNs’ data models and, therefore, how transferable methods applied to the data of one OSN are to the data of another.

OSN APIs provide data by streaming it live or through retrieval services, both of which make use of OSN-specific query syntaxes. Conceptually, therefore, there are two primary collection approaches to consider: 1) focusing on a user or users as seeds (e.g. Gruzd 2011; Morstatter et al. 2018; Keller et al. 2017) using a snowball strategy to discover the accounts that surround them (Goodman 1961) and 2) using keywords or filter terms, defining the community as the accounts that use those terms (e.g. Ratkiewicz et al. 2011; Ferrara 2017; Morstatter et al. 2018; Woolley and Guilbeault 2018; Bessi and Ferrara 2016; Nasim et al. 2018). Focusing on seeds can reveal the flow of information within the communities around the seeds, while a keyword-based collection provides the ebb and flow of conversation related to a topic. These approaches can be combined, as exemplified by Morstatter et al. (2018) in their study of the 2017 German election: an initial keyword-based collection was conducted for eleven days to identify the most active accounts, the usernames of which were then used as keywords in a subsequent six-week collection.

Once a reasonable dataset is obtained, there may be benefit in stripping what Foidl and Felderer (2019) call ‘context-dependent’ Data Smells. This includes junk content introduced by automated accounts such as bots (Ferrara et al. 2016; Davis et al. 2016). The question, however, of whether to remove content from social bots (bots that actively pretend to be human) depends on the research question at hand; because humans are easily fooled by social bots (Cresci et al. 2017; Nasim et al. 2018; Cresci 2020), their contribution to discussions may still be valid (unlike, e.g. that of a sport score announcement bot). Several studies have examined how humans and bots interact, especially within political discussions (Bessi and Ferrara 2016; Rizoiu et al. 2018; Woolley and Guilbeault 2018).

So far, the following has been established:

  • The OSN information selected and used to form ties in social networks requires careful consideration to ensure meaningfulness;

  • Uncertainty regarding the completeness of OSN data (due to rates of access, accessibility of data models, query construction and OSN owner commercial or other priorities) must be accounted for; and

  • Because OSNs maintain Big Data systems as infrastructure, researchers can rely on them to have carried out many tasks associated with data quality by the time they request data from the APIs (e.g. ensuring schema consistency and valid values)—these are tasks that other SNA researchers, such as those collecting data through direct community interaction, must do themselves.

We are now in a position to empirically examine more closely the issue of repeatability, by comparing simultaneously retrieved collections.

3 Methodology

Our initial hypothesis was that if the same collection strategies were used at the same time, then each OSN would provide the same data, regardless of the collection tool used. Consequently, social networks built from such data using the same methodology should be highly similar, in terms of both network- and node-level measurements. Our methodology consisted of these steps:

  1. 1.

    Conduct simultaneous collections on an OSN using the same collection criteria with different tools.Footnote 2

  2. 2.

    Compare statistics across datasets.

  3. 3.

    Construct sample social networks from the data collected and compare network-level statistics.

  4. 4.

    Compare the networks at the node level.

  5. 5.

    Compare the networks at the cluster level.

Examining the parallel datasets in each of these ways provides the opportunity for the analyst to develop a well-rounded understanding of the participants in an online discussion, their behaviour, how they relate to each other and the communities they form.

3.1 Scope

The scope of this work is limited to datasets obtained via streaming APIs filtered with keywords. Other collection styles may start with seed accounts and collect their data and the data of accounts connected to them, either through interaction (e.g. via comments, replies or mentions) or via follower links, as mentioned above. Such collections (especially follower networks) often require the collection of data that is prohibitive to obtain, is immediately out of date, and provides no real indication of strength of relationships, as discussed in Sect. 2.1. Additionally, in the absence of a domain-focused research question to inform the choice of seed accounts, no particular accounts would make sensible seeds, so here we rely on keyword-based collections.

3.2 Data collection

Twitter was chosen as the source OSN due to the availability of its data, the fact that the data it provides was thought to be highly regular (Joseph et al. 2014), and because it has similar interaction primitives to other major OSNs. Twitter is also widely used in academia for research that makes predictions, in particular predictions about population-level events, behavioural patterns and information flows, such as studies of predicting social unrest (Tuke et al. 2020) or misinformation (Wu et al. 2016). The validity of these predictions is fundamentally based on the consistency of the underlying (accessible) data. Two very different collection tools were chosen:

TwarcFootnote 3 is an open-source library which wraps Twitter’s API and provided the baseline for the study.

RAPID (Real-time Analytics Platform for Interactive Data Mining) (Lim et al. 2018) is a social media collection and data analysis platform for Twitter and Reddit. It enables filtering of OSN live streams, as well as dynamic topic tracking, meaning it can update filter criteria in real time, adding terms popular in recent posts and removing unused ones.

Both tools facilitate filtering Twitter’s Standard version 1.1 live streamFootnote 4 with keywords, providing datasets of tweets as JSON objects.

3.3 Constructing social networks

A social network is constructed from dyads of pairwise relations between nodes, which in our case are Twitter accounts. The node ties denote intermittent relations between accounts, inferred from observed interactions (Nasim 2016; Borgatti et al. 2009). Like any choice of knowledge representation, different networks can be constructed to address different research questions. For example, a network to study information flow could draw an arc from node A to B if account B retweets A’s tweet (implying B has read and perhaps agreed with A’s tweet); alternatively, the same interaction could be used to draw an arc from B to A if the relation is to imply an attribution of status or influence (A has influence because B has supported it through a retweet). Networks can be constructed based on direct or inferred relations, including retweeting, replying or mentioning, which we discuss below, or through the shared use of hashtags or URLs, reciprocation or minimum levels of interaction activity, or friend/follower connections. Morstatter et al. (2018) constructed networks of accounts based on retweets and mentions to discover communities active during the 2017 German election, valuing mentions and retweets equally to mean one account reacting to another. URL sharing behaviour is often studied in the detection and classification of spam and political campaigns (Cao et al. 2015; Wu et al. 2018; Giglietto et al. 2020). Some require more complex calculation such as linking accounts through their participation in detected events (Nasim et al. 2018). Of course, applications for social network analysis exist outside the online sphere, e.g. in narrative analysis (Edwards et al. 2020), and require similar considerations with regard to network design. In the absence of clear alternative research questions, we will examine the social relationships implied by direct interactions and retweet networks (due to their frequency in the literature), and thus, we will focus only on the three types of network construction discussed below.

Here, we consider three social networks built from interaction types common to many OSNs: ‘mention networks’, ‘reply networks’, and ‘retweet networks’ (retweets are analogous to Facebook shares or Tumblr reposts, and replies are analogous to comments on posts on Reddit, as shown in Table 1). We define a social network G=(VE) of accounts \(u \in V\) linked by directed, weighted edges \((u_i,u_j) \in E\) based on the criteria below.

Mention Networks Twitter users can mention one or more other users in a tweet. In a mention network, an edge \((u_i, u_j)\) exists iff \(u_i\) mentions \(u_j\) in a tweet, and the weight corresponds to the number of times \(u_i\) has mentioned \(u_j\).

Reply networks A tweet can be a reply to one other tweet. In a reply network, an edge \((u_i, u_j)\) exists iff \(u_i\) replies to a tweet by \(u_j\), and the weight corresponds to the number of replies \(u_i\) has made to \(u_j\)’s tweets.

Retweet networks A user can repost or ‘retweet’ another’s tweet on their own timeline, which is then visible to their own followers. Though retweets are not necessarily direct interactions (Ruths and Pfeffer 2014), they can be used to determine an account’s reach and are widely used in the literature (e.g. Vo et al. 2017; Rizoiu et al. 2018; Woolley and Guilbeault 2018; Morstatter et al. 2018; Weber et al. 2020a). In a retweet network, an edge \((u_i, u_j)\) exists iff \(u_i\) retweets a tweet by \(u_j\), and its weight corresponds to the number of \(u_j\)’s tweets \(u_i\) has retweeted.

Examining networks of the three types built from the same dataset, replies (Fig. 1b) are the least common of the three interaction types, and all are dominated by a single large component. Mention networks (Fig. 1a) exhibit relatively high cohesiveness. The similarity between retweets (Fig. 1c) and mentions is because the data model of a retweet includes a mention of the retweeted account, and thus, the retweet edges form a subset of the mention edges. Removing these implicit mention links, if they are unwanted, would be part of data preparation, after collection but prior to network construction.

Fig. 1
figure 1

Sample networks of accounts built from 5 min of Twitter data. Nodes may appear in one or more networks, depending on their behaviour during the sampled period. The diagrams were constructed with visone (https://visone.info)

3.4 Analyses

At this point, comparative analysis can be applied to the parallel tweet datasets, initially by examining OSN-specific features and then the mention, reply and retweet networks constructed from them. When analysing these networks, it is relevant to note that SNA posits two important axioms on which most network measures are based: network structure affects collective outcomes and positions within networks affect actor outcomes (Robins 2015). Furthermore, we should expect minor differences in collections to be amplified in resulting social networks (Holzmann et al. 2018).

3.4.1 Dataset statistics

To compare the parallel datasets, we examined a number of features, their frequencies and several maximums. The first of these relate to the absolute count of the following features:

  • Tweets The number of tweets in the corpus.

  • Accounts The number of unique accounts that posted tweets in the corpus (i.e. does not include those that were only mentioned or whose tweets were retweeted).

  • Retweets The number of tweets which were native retweets, i.e. created by clicking the ‘retweet’ button on the Twitter user interface, rather than manually typing in “RT @original_author: original text”, which is another valid, though time consuming, way to post a retweet. Both include an implicit mention of the account being retweeted.

  • Quotes The number of tweets which were quote tweets (non-native retweets, or retweets with comments).

  • Replies The number of tweets which were replies, including replies to tweets outside of the corpus.

  • URLs The number of tweets using URLs, the number of unique URLs used and the number of URL uses.

  • Hashtags The number of tweets using hashtags, the number of unique hashtags used and the number of hashtag uses.

  • Mentions The number of tweets containing mentions of other accounts, the number of unique mentioned accounts, and the number of mentions overall.

The remainder relate to the highest values of the following features:

  • Tweeting account The most prolific account and the number of tweets they posted.

  • Mentioned account The most mentioned account and the number of times they were mentioned.

  • Retweeted tweet The most retweeted tweet and how often it was retweeted.

  • Replied-to tweet The tweet with the most direct replies, and the number of those replies.

  • Used hashtags The first and second most used hashtags, and the number of times they were used.

  • URLs The most used URL, and the number of times it was used.

Based on these figures, we account for major discrepancies between the datasets, which can guide post-processing (e.g. spam filtering). Depending on the application domain, it may be appropriate to also consider comparing the distributions of particular features, rather than just their maximum values.

3.4.2 Network statistics

The following network statistics are used to assess differences in the constructed networks: number of nodes, edges, average degree, density, mean edge weight, component count and the size and diameter of the largest, Louvain (Blondel et al. 2008) cluster count and the size of the largest, reciprocity, transitivity and maximum k cores. These measures provide us with an understanding of the ‘shape’ of the networks in terms of how broad and dense they are and the strength of the connections within.

3.4.3 Centrality values

Centrality measures offer a way to consider the importance of individual nodes within a network (Newman 2010). The centrality measures considered here include: degree centrality, indicating how many other nodes one node is directly linked to; betweenness centrality, referring to the number of shortest paths between all pairs of nodes in the network that a node is on and thus to what degree the node is able to control information flowing between other nodes; closeness centrality, which provides a sense of how topologically close a node is to the other nodes in a network; and eigenvector centrality, which measures how connected a node is to other highly connected nodes. Eigenvector centrality is often compared with Google’s PageRank algorithm (Brin and Page 1998), which gives a measure of the importance of nodes (e.g. websites) based on references to them by other important nodes or by many nodes. The interested reader is referred to (Robins 2015; Wasserman and Faust 1994) for more details.

Only centrality measures for mention and reply networks are considered, as edges in retweet networks are not necessarily direct interactions (Ruths and Pfeffer 2014).

Given the set of nodes in each corresponding pair of networks is not guaranteed to be identical, it is not possible to directly compare the centrality values of each node, so instead we rank the nodes in each network by the centrality values, take the top 1000 from each list, further constrain the lists to only the nodes common to both lists, and then compare the rankings. We initially compare the rankings visually using scatter plots, where a node’s rank in the first and second list is shown on the x and y axes, respectively. A statistical measure of the similarity of the two rankings (of common nodes) is obtained with the Kendall \(\tau\) coefficient, with Spearman’s \(\rho\) coefficient used as a confirmation measure. To classify the strength of the correlations, we followed the guidance of Dancey and Reidy (2011, p.175), who posit that a coefficient of 0.0–0.1 is uncorrelated, 0.11–0.4 is weak, 0.41–0.7 is moderate, 0.71–0.90 is strong, and 0.91–1.0 is perfect.

3.4.4 Cluster comparison

The final step is to consider the clusters discoverable in the mention, reply and retweet networks and compare their membership. We first compare the distribution of the sizes of the twenty largest Louvain clusters (Blondel et al. 2008) visually. The Louvain method was chosen because it works well with large and small networks (Yang et al. 2016) and is well known in the literature (e.g. Morstatter et al. 2018; Nasim et al. 2018; Nizzoli et al. 2020; Weber and Neumann 2020).

We then use the adjusted Rand index (Hubert and Arabie 1985) to compare membership. This considers two networks of the same nodes that have been partitioned into subsets. When considered in pairs, there are nodes that appear in the same subset in both partitions (a), and there are (many) pairs of nodes that do not appear in the same subsets in either partition (b), and the rest appear in the same subset in one of the partitions but not in the other. Defining the total of possible pairings of the n nodes (\(\frac{n(n-1)}{2}\)) as c, the Rand index, R, is simply \(R = \frac{a + b}{c}\). The adjusted Rand index (ARI) corrects for chance and provides a value in the range \([-1,1]\) where 0 implies that the two partitions are random with respect to one another and 1 implies they are identical.

4 Evaluation of case studies

Several case studies were conducted to evaluate the comparison methodology, the requirements for which developed progressively, each new case study’s requirements informed by lessons from the previous. The collections used different tools to carry out the parallel collections. As mentioned, Twarc was employed as a baseline, while RAPID was used with topic tracking enabled and disabled, and the tool TweepyFootnote 5 was used in only one case study as a second baseline. The first case study consisted of two parallel Twitter datasets relating to an Australian panel discussion television programme with a prominent online community (Q&A); the first datasets were collected over the running of the programme (4 h) and the second covered the following day’s discussion (15 h), both employing RAPID’s topic tracking feature to broaden the conversation. The second case study examined discussion surrounding the national Australian Rules Football competition (the Australian Football League, or AFL) over a longer period (3 days), without RAPID’s topic tracking. The third also examined the same online sports discussion, but over a longer period again (6 days) and only made use of RAPID without topic tracking. The final case study incorporated a third tool to act as a further baseline and covered a regional but large Election Day, during which a significant amount of activity was expected. These conditions are summarised in Table 2.

Table 2 Summary of data collection conditions
Table 3 Summary statistics for the datasets used in this paper

A summary of the corpora collectedFootnote 6 is presented in Table 3. As noted above, when topic tracking was employed with RAPID, some of the tweets it collected did not contain any of the initial keywords. These datasets are given the label ‘RAPID-E’. Prior to comparison with the corresponding Twarc datasets, the RAPID-E datasets were filtered to retain only tweets containing at least one of the original keywords. The AFL2 case study used RAPID with no topic tracking expansion with two sets of Twitter credentials simultaneously; in this case, the datasets are labelled ‘RAPID1’ and ‘RAPID2’. The third collection tool, Tweepy, was included in the Election Day case study to act as a second baseline.

4.1 Case study 1: Q&A, #qanda and the effect of Topic Tracking

Initially, to obtain a moderately active portion of activity, we collected data from Twitter’s Standard live stream relevant to an Australian television panel show, Q&A, that invites its viewers to participate in the discussion live.Footnote 7 A particular broadcast in 2018 was chosen due to the expectation of high levels of activity given the planned discussion topic. As a result, the filter keywords used were ‘qanda’Footnote 8 and two terms that identified a panel member (available on request). We collected two parallel datasets over two periods:

Q&A Part 1 Four hours starting 30 min before the hour-long programme, to allow for contributions from the country’s major timezones; and

Q&A Part 2 From 6 am to 9 pm the following day, capturing further related online discussions.

Twarc acted as the baseline collection as it provides direct access to Twitter’s API, while RAPID was configured to use topic tracking via co-occurrence keyword expansion (Lim et al. 2018), meaning it would progressively add keywords to the original set if they appeared sufficiently frequently (five times in 10 min). Expanded datasets such as these are referred to as ‘RAPID-E’; it was filtered back to just the tweets containing the original keywords and labelled ‘RAPID’ to enable fair comparison with the ‘Twarc’ dataset. We expected the moderate activity observed would not breach rate limits, and thus, RAPID should capture all tweets captured by Twarc. This was not the case.

Table 4 Summary statistics for the Q&A Parts 1 and 2 datasets

4.1.1 Comparison of collection statistics

The first striking difference between the datasets was the number of tweets collected and the effect on the number of contributors (Table 4). RAPID collected fewer tweets by fewer accounts, but the datasets were close to subsets of the Twarc datasets. Between 26 and 42% of the tweets collected by Twarc were missed by RAPID, but the proportion of retweets in each part is similar (52% and 55% for Part 1 and 69% and 71% for Part 2). In both parts, very few accounts appear in only the RAPID collections. Discussions with RAPID’s developers revealed it dumps tweets that miss the filter terms from the textual parts of tweets (e.g. the body, the author’s screen name and the author’s profile description). The extra tweets RAPID collected were relevant and in EnglishFootnote 9 (based on manual inspection) but posted by different accounts (unique to RAPID-E). Of the tweets that RAPID collected which contained the keywords, they were posted by almost the same accounts as Twarc, but simply did not contain the same tweets.

The benefit of topic tracking via keyword expansion is yet to be strongly evaluated, but this study indicates there are benefits (relevant tweets that omit the original filter terms are picked up once related terms are added) as well as costs (tweets that include the original filter terms but are not collected). RAPID’s expansion strategies are modifiable to optimise data collection; however, we chose not to make use of this capability to prevent obscuring the current comparative study. The rest of this analysis explores how much of a difference the keyword expansion makes with regard to SNA.

Table 5 Detailed statistics of Q&A Parts 1 and 2
Table 6 The top ten most used hashtags in the Q&A datasets (ignoring case and anonymising names)

Table 5 reveals that although feature counts vary significantly, many of the most common values are the same (e.g. most retweeted tweet, most mentioned account, most used hashtags). Many are approximately proportional to corpus size (Twarc is 1.72 and 1.32 times larger than RAPID for Parts 1 and 2, respectively), but with notable exceptions and no apparent pattern. Some values are remarkably similar, despite the size of the corpora they arise from being so different. For example, Twarc picked up nearly 8000 more hashtag uses than RAPID in Part 1, but fewer than 200 more in Part 2. Notably, although the most prolific account is different in Part 2, the most mentioned account is the same for both Parts 1 and 2, potentially implying that account has had similarly high influence in both parallel datasets. Furthermore, both datasets shared almost all the same top ten hashtags, though in different orders (see Table 6). Approximately 5000 of the extra hashtag uses are of ‘#qanda’. In Part 2, again, the top ten hashtags are nearly the same, but this time the usage counts are similar, except for ‘#auspol’ being used \(22\%\) more often in RAPID (1652 times compared with 1349), which would account for the overall difference of 190 uses when combined with the noise of lesser used hashtags. The most used URL in Part 1 is a shortened form of a link to a political party policy comparison resource prepared by an account prominent in the #auspol Twitter discussion.Footnote 10 In the longer collection, the most prominent URL is overtaken by retweets, one by @QandA (Tweet \(t_5\)) and one by @SkyNewsAust, an official news media account (Tweet \(t_6\)).

Fig. 2
figure 2

Twitter activity in the Q&A Parts 1 and 2 dataset over time (in 15 min blocks)

Moving beyond the bare statistics, the timelines shown in Fig. 2a, b show the clear differences in tweets retrieved. Though the Twarc and pared back RAPID timelines appear at least proportionately similar, it is firstly notable that the RAPID-E dataset captured so much less data in Part 1 (Fig. 2a) and so much more in Part 2 (Fig. 2b), particularly from approximately 4 a.m. onwards (UTC). One possible explanation for this is that the discussion on the night of the episode was far more directly focused on the episode themes and had less opportunity to drift to other issues, especially while informed and guided by what was being broadcast at the time. In contrast, those discussing the episode the following day would have had more opportunity to broaden the discussion to other topics, and RAPID’s topic tracking attended to that, apparently at the cost of tweets matching the exact filter terms. Word clouds of the terms drawn from the first and last 5000 tweets of the RAPID-E dataset appear to offer mild support for this (Fig. 3). Terms are sized according to their frequency. The discussion across the day focuses on the #auspol hashtag, but #qanda is more prominent early on. Mentions of anonymised IDs 1 and 18 are prominent early but shift to ID 6 later. All of these IDs refer to the same individual,Footnote 11 but by Twitter handle and first name early on and by surname later in the day. Figure 3c, showing the top terms unique to the evening discussion, indicates that the discussion shifts to humanitarian concerns (e.g. “kidsoffnauru”, “[asylum] seeker”, “shameful”, “cried”, “sadness”), perhaps due to events of the day. The early discussion (Fig. 3a) seems to mention individuals much more than later, as indicated by the greater size of anonymised IDs. This fact alone implies that the early discussion was focused more directly on the Q&A episode, as the topics it covered related to particular relationships and events involving those individuals.

Fig. 3
figure 3

Word clouds of the 50 most used terms (anonymised) in the first and last 5000 tweets of the Q&A Part 2 RAPID-E dataset, and the top 100 terms unique to the last 5000 tweets

The second notable feature is that the RAPID tool appeared to miss many of the available tweets in the first half an hour of the Part 1 collection. RAPID-E’s first half hour includes only six tweets, the first of which was at 9 a.m. (UTC), while RAPID’s only includes four tweets, the first of which was at 9:15 a.m. It is unclear why the tool missed the tweets that Twarc captured, but a discrepancy such as this suggests it was not by design. The reason that the RAPID-E included tweets without the key terms early on in the specified timeframe is that the collection was running prior to the cut-off at 9 a.m. (UTC), tracking topics while it ran, as a ‘burn-in’ period, and we have extracted just these specific periods (UTC 0900 to 1300, and UTC 1900 to 1000 the next day) to study, post-collection.

4.1.2 Comparison of network statistics

Given the differences in datasets, we expect differences in the derived social networks (Tables 7 and 8) (Holzmann et al. 2018). We also present the proportional balance between each dataset’s statistics in Figs. 4 and 5. Each network is dominated by a single large component, comprising over 90% of nodes in the retweet and mention networks and around 70% in the reply networks. The distributions of component sizes appear to follow a power law, resulting in corresponding high numbers of detected clusters.

Table 7 Q&A Part 1 network statistics
Fig. 4
figure 4

The proportional balance between Twarc and RAPID statistics of the retweet, mention and reply networks built from the Q&A Part 1 datasets

Table 8 Q&A Part 2 network statistics
Fig. 5
figure 5

The proportional balance between Twarc and RAPID statistics of the retweet, mention and reply networks built from the Q&A Part 2 datasets

Network structure statistics like density, diameter (of the largest component in disconnected networks), reciprocity and transitivity may offer insight into social behaviours such as influence and information gathering. The high component counts in all networks lead to low densities and correspondingly low transitivities, as the potential number of triads is limited by the connectivity of nodes. That said, the largest components were consistently larger in the Twarc datasets, but the diameters of the corresponding largest components from each dataset were remarkably similar, implying that the extra nodes and edges were in the components’ centres rather than on the periphery. This increase in internal structures improves connectivity and therefore the number of nodes to which any one node could pass information (and therefore influence) or, at least, reduces the length of paths between nodes so information can pass more quickly. The similarities in transitivity imply the increase may not be significant, however, with networks of these sizes. Reciprocity values may provide insight into information gathering, which often relies on patterns of to-and-fro communication as a person asks a question and others respond. Interestingly, the only significant difference in reciprocity is in the Part 1 retweet networks, with the Twarc dataset having a reciprocity nearly double that of the RAPID dataset (though still small). The Twarc dataset includes 60% more retweets than the corresponding RAPID dataset and 40% more accounts (Table 4), which may account for the discrepancy. Given the network sizes, the reciprocity values indicate low degrees of conversation, mostly in the reply networks. Interestingly, mean edge weights are very low (1.3 at most), implying that most interactions between accounts in all networks happen only once, despite these being corpora of issue-based discussions.

The proportional statistical differences between the corresponding datasets are highlighted in Fig. 4 for Part 1 and Fig. 5 for Part 2. Part 1’s Twarc networks were larger, both in nodes and edges, but less dense, than the RAPID ones, and the largest component in each network is larger by a significant proportion of the extra nodes (it is not clear what portion of the extra nodes are members of the largest components, however). An increase in components also led to an corresponding increase in detected clusters, and an increase in the size of the largest detected cluster. As mentioned earlier, the increase in internal structures leads to a higher maximum k core value. Though the proportional differences in reciprocity in the retweet networks are high, the values themselves remain low. Part 2’s reply networks are remarkably similar despite the Twarc dataset having \(26\%\) more tweets. The differences in Part 2’s retweet and mention networks are similar to those of Part 1.

That the differences in retweet and mention networks are so proportionately similar across both Parts 1 and 2 is notable because the retweet network is not based on direct interactions, while the mention network is. Retweeting a tweet links a retweeter, X, back to the original author, Y, of a tweet, rather than any intermediate account, even if the retweet passed through several accounts on its way between Y and X. It is possible that these datasets were sufficiently constrained both in size and timespan and focus of the participants (by which we mean they engaged in the discussion by following the #qanda hashtag), that there was little opportunity to build up chains of retweets.

Next we look at two major categories of network analysis: indexing, for the computation of node-level properties, such as centrality, and grouping, for the computation of specific groups of nodes, such as clustering.

4.1.3 Comparison of centralities

Centrality measures can tell us about the influence an individual has over their neighbourhood, though the timing of interactions should ideally be taken into account to get a better understanding of their dynamic aspects (e.g. Falzon et al. 2018). If networks are constructed from partial data, network-level metrics (e.g. radius, shortest paths, cluster detection) and neighbourhood-aware measures (e.g. eigenvector and Katz centrality) may vary and not be meaningful (Holzmann et al. 2018).

Fig. 6
figure 6

Centrality ranking comparison scatter plots of the mention and reply networks built from the Q&A Parts 1 and 2 datasets. In each plot, each point represents a node’s ranking in the RAPID and Twarc lists of centralities (common nodes amongst the top 1000 of each list). The number of nodes appearing in both lists is inset. Point darkness indicates rank on the x axis (darker = higher)

We compare centralities of corresponding networks using scatter plots of node rankings, as per Sect. 3.4 (Fig. 6). The symmetrical structures come from corresponding shifts in order: if an item appears higher in one list, then it displaces another, leading to the evident fork-like patterns. There is considerable variation in most centrality rankings for both mention and reply networks in Part 1 (Fig. 6a) but much less in Part 2 (Fig. 6b), apart from the ranking of eigenvector centralities for the mention networks, which lacks almost any alignment between the RAPID and Twarc node rankings, despite the high number of common nodes (825). This implies that the neighbourhoods of nodes differ between the Twarc and RAPID mention networks, but the top-ranked nodes are similar though their orders differ greatly. Furthermore, the relatively few common nodes in Part 1’s Twarc mention networks (521 to 585) and greater edge count (Table 7) could indicate that the extra edges significantly affect the node rankings. However, Part 2’s Twarc mentions networks also had many more edges, but many more nodes in common (approximately 900). Thus, it must have been how the mentions were distributed in the datasets that differed, rather than simply their number. It is not clear that Part 1’s four-hour duration (cf. Part 2’s 15 h) explains this. Instead, if we look at the 11,480 tweets unique to Twarc in Part 1 (cf. fewer than 4000 are unique to Twarc in Part 2, Table 4), only 622 are replies, whereas 6915 include mentions. There are also \(34\%\) more unique accounts in the Part 1 Twarc dataset, but only \(19\%\) more in the Part 2 Twarc dataset (Table 4). Each mention refers to one of these accounts and forms an extra edge in the mention network, thus altering the network’s structure and the centrality values of many of its nodes; this is likely where the variation in rankings originates.

Fig. 7
figure 7

Centrality ranking comparisons using Kendall \(\tau\) and Spearman’s \(\rho\) coefficients for corresponding mention and reply networks made from the Q&A Parts 1 and 2 datasets

The Kendall \(\tau\) and Spearman’s \(\rho\) coefficients were calculated comparing the corresponding lists of nodes, each pair ranked by one of the four centrality measures (Fig. 7). Although somewhat proportional, it is notable how different the coefficient values are, especially in Part 2. While Twarc produced more tweets than RAPID (Table 4), and more unique accounts, the corresponding mention and reply node counts are not significantly higher (Tables 7 and 8). In fact, the node counts in the Part 1 reply networks are correspondingly lower than in the Part 2 reply networks, even though both Part 2 datasets were smaller. Edge counts in the mention networks were very different (Twarc had many more) but were quite similar in the reply networks.

The biggest variation was in the mention networks from Part 1 (Fig. 6a and Table 7), due to the large number of extra mentions from Twarc. It is notable that the Kendall’s \(\tau\) was low for all mention networks (Fig. 7), especially for degree and closeness centrality. It is worth noting the minor differences in the degree and immediate neighbours of nodes impacts degree and closeness centralities significantly, and, correspondingly, their relative rankings. In contrast, rankings for betweenness and eigenvector centrality, which rely more on global network structure, remained relatively stable.

4.1.4 Comparison of clusters

We finally compare the networks via largest clusters (Fig. 8). The reply network clusters are relatively similar, and the largest mention and reply clusters differ the most. The ARI scores (Table 9) confirm that the reply clusters were most similar for Parts 1 and 2 (0.738 and 0.756, respectively), possibly due to the small size of the reply networks. The mention and retweet clusters for Part 2 were more similar than those of Part 1 (0.437 and 0.468 compared to 0.320 and 0.350), possibly due to the longer collection period. In Part 1, there is a chance the networks are different due to RAPID’s expansion strategy. Changes to filter keywords may have collected posts of other vocal accounts not using the original keywords, at the cost of the posts which did.

Fig. 8
figure 8

The largest retweet, mention and reply clusters built from the Q&A Parts 1 and 2 datasets

Table 9 Adjusted Rand index scores for the clusters found in the corresponding retweet, mention and reply networks built from the Q&A Parts 1 and 2 datasets

4.1.5 Summary of findings

Overall, Twarc and RAPID provided very different views into the Twitter activity surrounding the Q&A episode in question, both on the evening of and the day after. This includes variations in basic collection statistics, network statistics for retweet, mention and reply networks built from the collected data, centrality measures of the nodes in the networks and comparison of detected clusters. The extra tweets collected by the Twarc collections appear to have resulted in greater numbers of connections internal to the largest components, which may have implications for the analysis of influence, as reachability correspondingly increases. Deeper study of reply content is required to inform patterns of information gathering.

We are left with the open question of how reliable social media can be as a data source, if conducting simultaneous collection activities with the same query criteria can provide such different networks. Is the variation due to the platform providing a random sample of the overall data or an effect of the tool being used?

We next considered a more tightly controlled comparison of Twarc and RAPID, disabling RAPID’s expansion strategies so that the tools performed as similarly as possible.

4.2 Case study 2: a weekend of AFL without topic tracking

RAPID’s topic tracking feature broadens the scope of of the collection at the cost of strictly matching tweets, resulting in distinctly different corresponding corpora. Although the rankings of the most central nodes in networks built from the corpora appear relatively stable, the question remains of why the corpora were so different in size. In this section, we discuss a case study in which we disabled RAPID’s topic tracking feature, expecting the resulting corresponding corpora to increase in similarity, especially over a longer period collection. Figure 9 indicates that again, initially at least, it appeared that Twarc and RAPID produced very different, but proportional over time, datasets. Constraining the datasets to only those tweets with a “lang” property of “en” or “und” resulted in much more similar datasets.

Table 10 Summary dataset statistics of the AFL1 collection
Fig. 9
figure 9

Twitter activity in the AFL1 dataset over time (in 60 min blocks)

4.2.1 Comparison of collection statistics

We conducted two parallel collections under the term “afl” over a three-day period in March 2019 (the start of the AFL season) using RAPID without topic tracking and Twarc. This collection is labelled “AFL1” in Tables 2 and 3, and further details are offered in Table 10. The datasets obtained appear to be dramatically different: RAPID collected just shy of 22,000 tweets while Twarc found approximately twice that number with around 45,000 tweets, with 21,730 in common. Interestingly, as can be seen in Fig. 9, the extra tweets appear to occur relatively evenly and consistently over time, rather than spiking. On closer inspection, it became apparent that the balance in languages was different, with \(36\%\) of the Twarc collection having lang property of ‘jp’ (Japanese) and \(52\%\) ‘en’ (English), while RAPID consisted of \(94\%\) English tweets (Fig. 10). When both collections were trimmed to tweets with a lang property of ‘en’ or ‘und’ (undefined), they reduced to 25,231 tweets (Twarc) and 21,235 tweets (RAPID), with 21,166 in common, which still leaves more than 4000 tweets specific to Twarc (Fig. 11). The “AFL1” dataset, reduced to only posts with a lang property of ‘en’ or ‘und’ is referred to as “AFL1-en” henceforth.

Fig. 10
figure 10

Distributions of tweet language values (specified in the lang property) in the RAPID and Twarc datasets, collected using the filter term “afl”

Fig. 11
figure 11

Tweet counts of the RAPID and Twarc datasets in the AFL1 and AFL1-en collections, obtained using the filter term “afl”

As previously mentioned, RAPID does not retain tweets which do not contain filter terms in text-related portions of the tweets. In the Twarc collection, the term ‘afl’ appeared in the domain of a website that many of the Japanese tweets referred to, belonging to an online marketplace. These tweets were dropped by RAPID and did not appear in the final collection.

Only 69 tweets were unique to the RAPID AFL1-en dataset, and they appear to be AFL-related sports discussions. The 4065 tweets unique to the Twarc dataset comprise 2595 English tweets and 1470 with “und” for the lang value. This field is populated by Twitter based on language detection algorithms. When a language cannot be detected, such as when there is not sufficient free text to analyse, the value “und” is used. Inspection of the tweets indicates the reason for this: the undefined tweets include 884 retweets, 1366 tweets with URLs, 116 with hashtags, and 916 with mentions. Of the “und” tweets with URLs, the vast majority (1188) refer to a Japanese online electronics marketplace (771) and a Japanese online media platform (417). The next largest group refer to 38 retweets, some of the official @AFL account (9), though there are 16 and 5 retweets of two accounts that Botometer (Davis et al. 2016) scored at 4.2 and 4.4 out of 5, respectively, as bot-like, and both refer to the previously mentioned Japanese electronics marketplace. The top 12 most used hashtags in the English subset relate to the AFL, while the top 14 for the “und” subset are all Japanese terms, except for “iphone” (at number 9). The top term (in Japanese) is the name of the marketplace. The English tweets are mostly related to the AFL, though there is considerable obvious content from bot-like accounts, with several accounts posting the same content (offers of live streams of the matches) repeatedly within a short space of time (their messages appear adjacent in the timeline).

Table 11 Statistics of the AFL1-en RAPID and Twarc datasets

Once reduced to a relatively comparable state, the “AFL1-en” parallel datasets can be examined in more detail. It is understood that the tweets they consist of will differ, given that rate-limiting constraints may have caused each to receive different tweets. The statistics in Table 11 bear this out with the Twarc dataset statistics being approximately proportionately larger when compared with the RAPID dataset statistics. The author IDs have been anonymised, but the most mentioned account is the official @AFL account, while the most prolific author appears to be automated to some degree, having posted nearly 35,000 tweets in two years and a Botometer (Davis et al. 2016) Complete Automation Probability (CAPFootnote 12) of \(68\%\), many seemingly promote the AFL, tennis, and a singer. The most replied to tweet was posted by an Australian NBAFootnote 13 player and the most retweeted tweet was of an amusing video of an AFL supporter.

4.2.2 Comparison of network statistics

Table 12 Comparative statistics for networks generated from the RAPID and Twarc datasets for the ALF1-en collection
Fig. 12
figure 12

The proportional balance between Twarc and RAPID statistics of the retweet, mention and reply networks built from the AFL1-en datasets

The network statistics Table 12 indicate that the networks were much more similar than in the Q&A case study, though there are still notable differences. The largest components of the retweet, mention and reply networks are, at most, 15% larger by node count, and the largest components are correspondingly similar, though their diameters and densities indicate they are much more sparse than the corresponding Q&A ones, with corresponding implications for the opportunity to influence. In contrast, in sporting discussions, there is less motivation to attempt to convert fellow sports fans to cheer for one’s team than there is in a political discussion. Certainly in this study, politics has tended to generate more discussion than sports in general, and the nature of the discussions is also different. The reciprocity values here are much higher than in the Q&A case study, implying the presence of more communication among the communities that do exist. Another difference that lends weight to this interpretation is the average degree of nodes in the networks. In the Q&A retweet and mention networks, the average degrees were around 2-2.5 and 3, respectively, implying some repetition in connectivity, whereas in the sporting discussing the average degrees are around 1 and 2, respectively, implying much less continued interaction. As indicated in Fig. 12, the degree values of the Twarc and RAPID networks are highly similar.

The number of tweets and accounts in the AFL1-en datasets (Table 11), coupled with the number of nodes and edges in the derived mention and reply networks (Table 12), indicates that although the AFL1-en collections differed by nearly 4000 tweets, the number of accounts was not significantly different (approximately \(10\%\) more in Twarc) with a corresponding increase in nodes and edges in the mention network (8.3% and 7.3%, respectively) but only 54 and 77 (1.1% and 1.6%, respectively) more in the reply network.

4.2.3 Comparison of centralities

Considering the similarity of interaction networks constructed from the respective AFL1-en datasets, we compare the relative ranking of the top network nodes by various centrality values (with an upper bound of 1000 nodes). Figure 13 shows scatterplots of the relative rankings of nodes common to corresponding networks, and Fig. 14 plots the Kendall \(\tau\) and Spearman’s coefficients of the corresponding relative rankings. As with the Q&A collection, the centralities of nodes in the reply networks show more similarity than those in the mention networks, which is likely due to their relative size; Table 12 indicates a significant discrepancy in the reply and mention network sizes and average degree. Closeness is notably low in similarity, though the high component count would account for that. It is apparent the most central nodes in both network types mostly maintain their ordering for the first several hundred nodes, but all begin to diverge at some point. A few isolated nodes change their ranking significantly, such as those in the top left of the mentions betweenness and closeness plots, degrading their rankings (appearing above the diagonal), and those in the reply closeness and eigenvector plots, improving their rankings (appearing below the diagonal), but the majority diverge in a trident pattern, implying lower-ranked nodes improve their rankings swapping out higher-ranked nodes at progressively greater distances. The reason for the consistency is unclear. Minor variations would ensure that nodes’ centrality values varied, and thus, their rankings could easily vary significantly, especially due to the high number of components. The high k core values for the mention networks are likely to explain the high betweenness and eigenvector centrality values, as the highest ranked of these will reside in the larger components, which will have the greater likelihood of being similar across the networks.

Fig. 13
figure 13

Centrality ranking comparison scatter plots of the mention and reply networks built from the AFL1-en datasets. In each plot, each point represents a node’s ranking in the RAPID-en and Twarc-en lists of centralities (common nodes amongst the top 1000 of each list). The number of nodes appearing in both lists is inset. Point darkness indicates rank on the x axis (darker = higher)

Fig. 14
figure 14

Centrality ranking comparisons from the RAPID and Twarc datasets of the AFL1-en collection using Kendall \(\tau\) scores and Spearman’s coefficients

4.2.4 Comparison of clusters

Comparing the clusters detected with the Louvain method (Blondel et al. 2008) in the retweet, mention and reply networks results in ARI values in Table 13. This implies that although the networks consisted of many components, the clusters they formed were highly similar for retweet and reply networks, and only slightly less so for the mention networks, despite the fact that the Twarc mention network included more than 2000 more mention edges.

Table 13 Adjusted Rand index scores for the clusters found in the networks built from the RAPID and Twarc datasets for the AFL1-en collection

4.2.5 Summary of findings

This case study makes it clear that the tool used for collection can have a significant effect on the data collected and the resulting analytic results. It was serendipitous that the filter term chosen was “afl”, because a more specific term or set of terms is unlikely to have captured the non-English content that Twarc did. This highlighted the fact that RAPID was post-processing and filtering the tweets it collected, and raises general questions for social media data collection: Do other collection tools, especially commercial ones, do this post-processing too, as a “convenience” or “value-add” to their users? Do they make it clear if and when they do? The validity of evidence-based conclusions rests on these details. Even when both datasets were filtered to ensure some degree of consistency, there remained large differences in the networks constructed from them. Minor differences in datasets may result in amplified differences in analyses.

A further, even more fundamental, question remained after this case study, which is addressed by the next subsection: Does the same tool provide the same data over two simultaneous collections with identical filter terms?

4.3 Case study 3: tracking AFL Twitter activity with RAPID

Table 14 Summary dataset statistics of the AFL2 collection
Fig. 15
figure 15

Twitter activity in the AFL2 dataset over time (in 60-min blocks). Thick and dashed lines are used here to highlight how the timeseries overlap almost exactly. The timestamps of tweets unique to RAPID2 are shown as blue points

Given it appeared that different collection tools could produce different results using the same inputs, the question of whether APIs are delivering consistent content for all clients remained. A second collection (Table 14) was initiated over a longer period (6 days) using the same filter term and tool (RAPID), but with different API credentials. One set of credentials belonged to a relatively new and unused account (created in 2018 having posted only 3 tweets) and the other to a well-established account (created in 2009 and having posted approximately 17,000 tweets). This resulted in two highly similar, but not quite identical, datasets, with sizes 30,103 and 30,115 tweets; their timeline is shown in Fig. 15. The first dataset was a proper subset of the second, so the difference of 12 posts can be regarded as due to noise or minor differences in timing. A brief examination revealed these extra tweets (shown in blue in the Figure) were all about AFL or other sports in Australia, and their timing appeared random. Further confirmation of the similarity between datasets can be seen in Table 15 where the most prolific account, most retweeted tweet, most replied to tweet and most mentioned account details are all identical. Again, the most mentioned account is the official @AFL account.

Table 15 Statistics of two parallel datasets collected using RAPID with the filter term “afl” over a six-day period with different API credentials

4.3.1 Comparison of network statistics

Table 16 Selected comparative statistics for networks generated from the two RAPID datasets for the AFL2 collection

Due to the similarity of the datasets, the retweet, mention and reply networks generated from them were almost identical, and only a summary of the structures is provided in Table 16. Details of the networks are provided in Fig. 16, which show that the only differences occur in the detected clusters. In particular, the largest cluster detected in the RAPID2 mention network is around \(3\%\) larger than the corresponding cluster from the RAPID1 mention network. This is likely due to an element of randomness used in the Louvain algorithm (Blondel et al. 2008).

Fig. 16
figure 16

The proportional balance between the RAPID1 and RAPID2 statistics of the retweet, mention and reply networks built from the AFL2 datasets

Fig. 17
figure 17

Centrality ranking comparison scatter plots of the mention and reply networks built from the AFL2 datasets. In each plot, each point represents a node’s ranking in the RAPID1 and RAPID2 lists of centralities (common nodes amongst the top 1000 of each list). The number of nodes appearing in both lists is inset. Point darkness indicates rank on the x axis (darker = higher)

Fig. 18
figure 18

Centrality ranking comparisons from the two RAPID datasets of the AFL2 collection using Kendall \(\tau\) scores and Spearman’s coefficients

4.3.2 Comparison of centralities and clusters

The similarity of the networks based on their statistics is further confirmed by a comparison of their centrality rankings, which indicates that their structures are all but identical. A visual inspection of their respective rankings in Fig. 17 reveals no major differences, and the Kendall \(\tau\) and Spearman’s coefficients indicate their rankings are, in fact identical (Fig. 18).

Table 17 Adjusted Rand index scores for the clusters found in the networks built from the RAPID and Twarc datasets for the AFL2 collection

Interestingly, the high degree of similarity does not extend to the membership of detected clusters using the ARI measure (Table 17). Presumably, the sensitivity of the measure indicates that these scores must be as close to the maximum as we could expect due to the degree of randomness inherent in Louvain clustering. For a score of 1.0, each pair of detected clusters would need to match perfectly, across the thousands of nodes in the networks, so any minor variation will reduce that score.

4.3.3 Summary of findings

This evidence suggests that the results provided by the Twitter API (if not other platforms’ APIs) are consistent, regardless of the consumer. It is clearly important that a researcher understand how their collection tool works to guarantee their understanding of the results returned. In this regard, open-source solutions are, as the name implies, more transparent than closed-source solutions. The benefit gained as a result of more tailored filtering must be balanced against the initial effort required to understand how the APIs are employed by the tool used and what modifications tools make to the data they collect.

4.4 Case study 4: Election Day

This final case study highlights the importance of continuous network connectivity, and awareness of when that condition is not met. Given the social media researcher can offload many other aspects of data quality to the OSN (e.g. well-designed schemas, data consistency, value validity, Scannapieco et al. 2005; Foidl and Felderer 2019), it is important to note that this is an aspect for which the researcher must retain responsibility.

To consider a more focused collection activity and to consider a second open-source collection tool (thus similar to the baseline tool, Twarc), a collection was conducted over an Election Day (24-h period) in early 2019, using RAPID, Twarc and Tweepy, each configured with the same filter terms: #nswvotes, #nswelection, #nswpol, and #nswvotes2019. RAPID and Twarc collected slightly below 40,000 tweets each while Tweepy collected around 36,000 tweets, but suffered from network outages on two occasions for approximately 110 and 96 min each time (see Fig. 19). In the resulting datasets (highlights of which are shown in Table 18), 285 tweets were unique to RAPID, three to Twarc, and 19 were shared by Twarc and Tweepy but not RAPID. The vast majority of the Tweepy dataset’s 36,172 tweets appeared in all three datasets, while Tweepy missed the 3118 further tweets that appeared in both Twarc and RAPID datasets. In fact, by examining the periods where Tweepy lost its connection, around 6 p.m. (UTC) and again approximately 6 h later, Twarc retained 3036 tweets while RAPID retained 3055 tweets (RAPID-E collected 3918 during these periods), so it is possible that if Tweepy’s connection had stayed up, the Tweepy dataset might have been very similar to Twarc and RAPID, especially as the remainder of the collection behaviour of the tools appears almost identical in the timeline.

Fig. 19
figure 19

Twitter activity in the Election Day dataset over time (in 15-min blocks). Dashed and dotted lines are used here to highlight how the timeseries overlap almost exactly

Table 18 Summary dataset statistics of the Election Day collection

4.4.1 Comparison of collection statistics

The collection statistics are highly similar and are provided primarily for completeness. The effect of Tweepy’s disconnection is highlighted by the differences in its statistics from Twarc as the baseline. Although more than 3000 tweets were missed, only a few hundred accounts, quotes, replies and tweets with URLs were missed. Several thousand retweets were missed as well as tweets with hashtags and mentions, but the effect on the features with the highest counts is limited. The most prolific account, most retweeted tweet, most replied to tweet, most mentioned accounts, hashtags and URLs are all the same (Table 19).

Table 19 Statistics of the Twarc, RAPID, and Tweepy datasets collected in parallel over a 24-h period

4.4.2 Comparison of network statistics

Continuing the similarities in the collection statistics, statistics drawn from retweet, mention and reply networks built from the Election Day datasets are also strikingly resilient, despite the Tweepy networks including several hundred fewer nodes (Table 20). This is borne out by the proportional differences between Twarc and RAPID in Fig. 20, where the only significant difference is the size of the largest detected cluster (again, likely due to the randomness inherent in the Louvain algorithm, Blondel et al. 2008) and then in the proportional differences in all the statistics across the Twarc and Tweepy networks in Fig. 21.

Table 20 Comparative statistics for networks generated from the Twarc, RAPID and Tweepy datasets for the Election Day collection
Fig. 20
figure 20

The proportional balance between Twarc and RAPID statistics of the retweet, mention and reply networks built from the Twarc and RAPID datasets

Fig. 21
figure 21

The proportional balance between Twarc and RAPID statistics of the retweet, mention and reply networks built from the Twarc and Tweepy datasets

4.4.3 Comparison of centralities

Examining the centralities of the mention and reply networks built from the Election Day datasets, comparing RAPID and Tweepy against the Twarc baseline shows, as expected, only minor variations in the RAPID dataset which only occur among the lower ranked nodes (Fig. 22) and more widespread differences with the Tweepy networks (Fig. 23). Statistically, Twarc and RAPID’s mention network centrality rankings, shown in Fig. 24, had Kendall \(\tau\) values around 0.35 to 0.4 and Spearman’s coefficients around 0.45 to 0.6, while the reply networks’ values were higher, with \(\tau\) around 0.5 and Spearman’s coefficient around 0.7, possibly due to the smaller size of the reply networks. These values are all approaching or exceeding the \(\tau\) value of 0.4 to 0.6 that was regarded as reasonably to highly similar, mentioned in Sect. 3.4. The ranking similarity statistics calculated by comparing the Twarc and Tweepy baselines are notably lower (Fig. 25, though even the reply networks’ betweenness and closeness comparisons are moderately similar with \(\tau\) around 0.4 and Spearman’s coefficient around 0.5 to 0.6.

Fig. 22
figure 22

Centrality ranking comparison scatter plots of the mention and reply networks built from the Twarc and RAPID Election Day datasets. In each plot, each point represents a node’s ranking in the RAPID1 and RAPID2 lists of centralities (common nodes amongst the top 1000 of each list). The number of nodes appearing in both lists is inset. Point darkness indicates rank on the x axis (darker = higher)

Fig. 23
figure 23

Centrality ranking comparison scatter plots of the mention and reply networks built from the Twarc and Tweepy Election Day datasets. In each plot, each point represents a node’s ranking in the RAPID1 and RAPID2 lists of centralities (common nodes amongst the top 1000 of each list). The number of nodes appearing in both lists is inset. Point darkness indicates rank on the x axis (darker = higher)

Fig. 24
figure 24

Centrality ranking comparisons from the Twarc and RAPID datasets of the Election Day collection using Kendall \(\tau\) scores and Spearman’s coefficients

Fig. 25
figure 25

Centrality ranking comparisons from the Twarc and Tweepy datasets of the Election Day collection using Kendall \(\tau\) scores and Spearman’s coefficients

4.4.4 Comparison of clusters

Despite the similarities between the Twarc and RAPID networks, the cluster membership still varies significantly, with the highest similarity being found amongst the (smaller) reply networks, as can be seen in the ARI scores in Table 21. The clusters found in the Twarc and Tweepy networks are less similar, almost in line with the differences in network sizes: the retweet networks had fewer nodes than the mention networks, and the ARI scores are less different, and the reply networks were the smallest and had the smallest difference between ARI scores.

Table 21 Adjusted Rand index scores for the clusters found in the corresponding retweet, mention and reply networks built from the Election Day datasets

4.4.5 Summary of findings

This final case study provides us with further confidence that the differences observed early on in the Q&A datasets are primarily caused by enhancements provided by the RAPID platform and the differences in the AFL1 datasets were due, in part, to the choice of “afl” as the lone filter term. The Election Day collection used several specific filter terms and ran long enough to collect several tens of thousands of tweets, enough time to avoid minor differences in start and stop times. Even the differences that did occur did not result in significant effects on several networks constructed from the data or on network analysis measures calculated over those networks.

5 Discussion

A number of points worthy of further discussion have been raised by these case studies, and here we consider the statistical effect of the case study variations, specifically, but then also more general questions regarding the size of datasets, the effects of language and terminology, and the influence of the platforms.

5.1 Regarding statistics

The case studies presented highlight how decisions regarding collection specification, such as the filter terms used or their number, and the collection duration, can result in datasets that trigger features in complex collection tools, quite apart from configuration of such tools to dynamically change the collection specification (e.g. use of RAPID’s topic tracking feature). The primary variations explored here involved filter terms although collection duration also varied, depending on the collection event. The biggest variations in parallel datasets appeared when few filter terms were used and when they were short (i.e. having few characters), resulting in incidental noise from posts in unexpected languages (#qanda) or with unexpected acronyms and from elements in post metadata (#afl). When multiple terms were used, and when those terms were not valid words in a language (e.g. variations on #nswelec), the parallel datasets were much more similar. Although it might be common sense to encourage careful design of collection specifications, these case studies highlight the value in (and danger in not) being more specific, by dictating the language of posts required as well as using multiple filter terms.

When variations in datasets occurred, the extra tweets resulted in the introduction of new nodes (accounts) in retweet, mention and reply networks, the majority of which were located within the largest connected network components (relatively few appeared as new, independent components). This consistently reduced the density of the retweet, mention and reply networks, but rarely affected the diameter of the largest component (Q&A Part 2’s retweet network is an exception here), implying that the new nodes appear in the core of the components, rather than on the periphery. Consequently, the extra nodes increased reciprocity, transitivity, and sometimes maximum k core values in retweet and mention networks, but rarely changed reply networks. Reply interactions occurred least frequently in all datasets, and so reply networks were the least different in raw size (nodes and edges).

The effects of collection variation were most prominent in centrality scores, particularly when the collection event involved direct interaction between participants (e.g. issue- or theme-based discussions such as during Q&A and over weekends of football) and less straightforward information dissemination (e.g. during an election campaign). The ranking of nodes by centrality varied most in the mention and reply networks of Q&A Part 1, even though more than half the top thousand ranked nodes in each pair of parallel networks were the same (an average of 560.5 for mentions and 991 for replies). The forking patterns appearing in scatter plots imply the presence of groups of nodes with adjacent centrality rankings, which then swapped when new nodes were added, possibly through impacting the internal topology of the largest components in some way. Spearman’s \(\rho\) and Kendall \(\tau\) correlation coefficients were consistently higher for reply networks than mention networks, possibly due to their smaller size. No particular patterns in differences between centrality types were observed, which implies the differences between pairs of parallel networks did not result in significantly different topologies.

A final lesson regarding network statistics can be drawn from the use of ARI scores is that even clustering of highly similar (e.g. almost identical) networks (in Case Study 4) results in ARI scores around 0.7, meaning that ARI scores around 0.4 can be seen as confirmation cluster membership is, in fact, quite similar.

5.2 Regarding dataset size

Social media datasets analysed in the literature are often much larger than the datasets we have used in this study. For example, Cao et al. collected over a billion URLs sourced from Twitter alone in their study of URL sharing (Cao et al. 2015). There is significant power in such datasets to examine the flow of information and influence, but their scale can hamper more granular examinations focusing on accounts and the communities they form. The study of conversations can rely on direct interactions, such as replies or comments on posts and mentions, or indirect interactions such as the shared use of hashtags (e.g. Ackland 2020). Such studies examine both the structure and dynamics of conversations and their prevalence, but those structures can be found in small, targeted datasets as well as larger ones. Information sharing via retweet or repost or URLs can reveal patterns of information dissemination and related research can certainly benefit from larger datasets, especially when relying on mathematical models of behaviour (e.g. Lee et al. 2013; Cao et al. 2015; Bagrow et al. 2019). Depending on researchers’ access to privileged APIs and data access rates, generating large datasets can often encounter API rate limits, raising the question of completeness, which may or may not be an issue depending on the research questions under investigation. Assuming that collections activities are rate limited in a consistent way, we expect larger parallel datasets to exhibit many of the same patterns we have observed here, but this remains an open question for future research.

5.3 Regarding language and terminology clashes

Most popular OSNs have been developed in the English-speaking world, primarily for an English-speaking audience (at least initially), and even though most now have significant non-English-speaking users (e.g. \(56.5\%\) of internet use originates in Western, Southern, Eastern and South Eastern Asia, Kemp 2021, slide 27), English enjoys significant support. Though many OSNs support many languages and alphabets now, anglicised spelling variants for many non-English languages exist because the major mobile operating systems (Apple iOS and Google Android) originate in America. For these reasons, if terms (essentially combinations of letters) that are meaningful in multiple languages are used to filter streams or in queries, it is possible that non-English posts may be captured, especially if preferred languages are not specified as part of the filter or query. This was certainly the case for the Q&A case study (Fig. 26). RAPID attempts to address this oversight by ensuring that filter terms appear in text-related fields in the posts it captures, but our experiences with the AFL datasets raise questions about what other terms can capture posts unexpectedly. Depending on how OSN queries are interpreted, using “http”, “#” or “March” as filter terms could return every post including a URL, using a hashtag or posted in March—these are also questions for further research.

Fig. 26
figure 26

Semantic networks (Radicioni et al. 2020) of co-mentioned hashtags (i.e. hashtags appearing in the same tweet) built from the RAPID and Twarc Q&A Part 1 datasets. The node for the hashtag #qanda has been excluded, as all tweets included it, and the minimum edge weight (i.e. times hashtags needed to co-occur in a tweet) was set to 3. Nodes are coloured according to Louvain clusters (Blondel et al. 2008), and labels identifying individuals have been anonymised. It is clear that non-English clusters of tweets have been captured due to a clash with the term ‘qanda’

A secondary form of more common clash is semantic in nature, rather than lexicographical. A prime example of this is found in Weber and Neumann’s study of an Australian election in which the filter term #liberals (referring to a political party) clashed with the use of the term during student protests against gun violence in America, where the term refers more to political ideology, resulting in a spike of American tweets in a predominantly Australian discussion (Weber and Neumann 2020). Similarly, the hashtag #voteno clashed in a study of the 2017 Australian postal plebiscite on same sex marriage, drawing in American commentary on a healthcare Bill before the US Congress (Nasim 2019). To remove such Data Smells (Foidl and Felderer 2019), co-occurring hashtag networks, otherwise known as semantic networks (Radicioni et al. 2020), can be used to identify the out-of-scope content, but any use of automation is likely to require human oversight to avoid removing relevant content.

5.4 Regarding platform influences

Two of the biggest impediments to credibility in social media datasets are confidence that they are complete and, if they are known to be incomplete, knowledge of the sampling biases; both of these rely on openness and transparency on the part of the OSNs. Case study 3 (Sect. 4.3) at least confirms that different credentials, when used with the same collection tool against the same OSN with the same network boundaries (i.e. filter expression), result in approximately identical datasets (assuming some minor variation for the timing of network connections). OSNs are commercial entities and thus it stands to reason that they would bias samples to maintain users’ attention, which could mean that if Case study 3 was repeated by running collections in different parts of the world, regional preferences (e.g. languages, topics of discussion) could influence the datasets, causing greater divergence. That said, studies of Twitter’s 1% Sample API seem to offer evidence contrary to that (Joseph et al. 2014; Paik and Lin 2015). The Sample API is different from a keyword-based query or stream filter, however, and is primarily designed to support research. Query term–based APIs might be more likely to exhibit regional differences, as the queries they service could originate from user-facing applications or market analysts, and not just academic researchers. Though these studies are all focused on Twitter, most other OSNs are under similar commercial pressures and regional popularity is vital to management of their brand.

5.5 Regarding measures for reliability and representativeness

The central purpose of this paper is to draw attention to unexpected variations in datasets collected from social media streams and the networks constructed from them. This is especially relevant when it is known that the stream is limited (either through platform rate-limiting or through platform algorithms, as occurs with, say, Twitter’s 1% sample stream). An obvious follow-up question is whether or not an objective measure of reliability is feasible. This relates closely to the question of how representative samples provided by platforms are of their entire data holdings (e.g. as studied by Morstatter et al. 2013; González-Bailón et al. 2014; Joseph et al. 2014), but that question relies on examining the choices made by the platform in deciding what to include in the sample they offer. Here, similar to Paik and Lin (2015), our interest is in confirming that the data we request from a platform (with filter terms) matches what it has, or is at least representative of what it has (if rate-limits are encountered). Such a measure might rely on comparing the distributions of various features in our result dataset and the complete dataset (known only to the platform), such as the accounts and the number of tweets they post, the number of hashtags, URLs, and mentions used and replies, quotes and retweets made. Only the platform has sufficient information to calculate this measure, and there may be significant value in them providing it for the free or low cost streams they offer to researchers, analysts and other social media mass consumers. Providing a measure of representativeness (indicating reliability) alongside query and filter results could: (1) encourage consumers to pay for the higher cost streams, while also (2) providing consumers with more certainty in any conclusions they draw from the results they analyse. A measure of reliability rather than representativeness could, in fact, be more useful because there may be good reasons for results to not be truly representative—this would be the case when the complete dataset includes significant amounts of spam, pornography or other objectionable content.Footnote 14 The reliability measure would indicate how representative the provided results are when compared with the complete results.

6 Conclusion

Under a variety of conditions, the collection tools employed in several use cases provided different views of specific online discussions. These differences manifested as variations in collection statistics, and network-level and node-level statistics for retweet, mention and reply social networks built from the collected data. Extra tweets were most often collected by Twarc, and these appear to have resulted in more connections within the largest components without affecting their diameters. This may affect results of information diffusion analysis, as reachability correspondingly increases. Deeper study of reply content is required to inform discussion patterns.

How reliable social media can be as a source for research without deep knowledge of the effects of collection tools on analyses is an open question. If a tool adds value through analytics or data cleaning features, what is the nature of the effect? This paper provides a methodology to explore those effects. A canonical measure of the reliability of a dataset would be valuable to the research and broader social media analysis community. This measure would explain how complete the results of a search or filter of live posts is, and if it is not complete, how representative the provided sample results are of the complete results. Only the platforms have this information, however there would be benefits for them to do so, including as an enticement to consumers to pay for greater access to platform data holdings, as well as helping inform consumers of the degree to which they can depend on analyses of the data they receive. Twitter, in particular, has recently introduced changes as part of their API version 2.0 that facilitate academic research.Footnote 15

We recommend the following to those using OSN data:

  • Be aware of tool biases and their effects.

  • Take care to specify filter and search conditions with keywords that capture relevant data and avoid irrelevant data, and make use of metadata filters to avoid unwanted content, e.g. constraining language codes. Beware of short filter terms and ones that are meaningful in non-target languages.

  • Check the integrity of data. We observed gaps and minor inconsistencies in the Election case study due to connection failures as well as the appearance of duplicate tweets, identical in data and metadata.

There are a number of avenues by which to expand this research:

  • Consolidate and expand the methodology so that it may be applied to other OSNs and collection tools, especially proprietary ones. A method to shed light on the biases of tools would be of great value to the community.

  • Introduce content analysis to the comparison. If social networks built from parallel datasets vary, what is the effect on analyses of the discussion? Are text analysis methods robust enough to overcome differences or is it possible to draw entirely different conclusions, depending on the method used?

  • Examine differences in information flow based on clusters rather than just individual nodes to help inform questions about broad information flow within the networks. If the differences are moderate, then we may draw confidence that overall flows of influence in a network may remain relatively steady, even with variations in the collection.

  • Although OSNs are similar when the interaction primitives they offer are considered, the way in which their feature sets are presented in user interfaces create a platform-specific interaction culture, and that affects the observable behaviour of its users, which may in turn affect analyses. An exploration of how platform culture manifests itself and differs would help inform the search for higher level social activities such as coordination (Grimme et al. 2018; Pacheco et al. 2020; Weber and Neumann 2020; Graham et al. 2020).

  • Key for future social media research is to develop processes for repeatable analysis, including access to common datasets, both of which underpin the practice of benchmarking. Currently, OSN terms and conditionsFootnote 16 often hamper researchers exchanging datasets, so new techniques cannot easily be evaluated on the same data, raising questions of fair comparison. For example, Twitter requires that only tweet IDs are shared, forcing the next researcher to ‘re-hydrate’ the tweets by downloading them again from Twitter’s servers. By the time a new researcher does this, the data may have changed: metadata, such as retweet and like counts are constantly incremented; tweets and entire accounts may have been deleted or removed through suspension or account closure; or accounts may become private and inaccessible. Assenmacher et al. (2021) recently proposed a benchmarking system addressing this issue, to which algorithm implementations can be submitted for execution over a dataset, leaving the dataset within the possession of the researcher who collected it, but who is now responsible for executing submitted algorithms. Until a more extensive solution is available, concern will remain over the repeatability of (and confidence in) social media analytics.

Finally: Does it matter if a streamed collection is not necessarily either complete or representative? As long as a researcher makes clear how they conducted a collection and using what tools and configuration, does it not still result in an analysis of behaviour that occurred online? The answer is that it very much depends on the conclusions being drawn. Yes, the collection represents real activity that occurred, but the potential for its incompleteness may cause conclusions drawn from it to be unintentionally misinformed and lacking in nuance. This is especially important for benchmarking efforts (Assenmacher et al. 2021). We have seen that variations in collections have an impact on network size and structure. This may result in different community compositions and affect centrality analyses, consequently misleading influential account identification and expected diffusion patterns. A firm understanding of the data and how it was obtained is therefore vital.