Keywords

1 Introduction

Social networks like Twitter — currently in the process of rebranding to X — have become an integral part of our social lives. They revolutionized the way we communicate online, shape public discourse, and provide access to the latest news and opinions. One major issue within social networks is the prevalence of bot accounts, which have been known to influence public opinion, especially in critical areas like politics or financial markets [2]. It is notoriously hard to estimate the true extent of the presense of bots on social media platforms, and platforms may be incentivized to misrepresent them, as it could negatively impact revenueFootnote 1. In 2017, Varol et al. estimated that bots may make up to 15% of all Twitter accounts [13]. In another study, Cresci et al. analyzed Twitter dicussions concerning the US stock market, and concluded that up to 71% of the engaged users might be bots [4].

Furthermore, bots seem to become more sophisticated over time [2, 6], a phenomenon often referred to as bot evolution. This term describes the adversarial cycle in which newer bots evade increasingly more sophisticated bot detection measures, by becoming progressively indistinguishable from real humans. An illustrative example of this effect are the results reported in early 2017 by Cresci et al. [3]. In this experiment, the users were tasked to tell bots apart from legitimate users, only being able to correctly identify newer bots with a 24% accuracy, compared to 91% on older bots. Cresci [2] points out that bot detection methods must be able to distinguish between genuine users and bots, who disguise as genuine users through stolen profile pictures and neutral messages. This complexity has been further intensified by the advancement of artificial intelligence, particularly generative AI, which makes it more difficult to separate individual bot accounts from genuine users. The increasing difficulty in distinguishing between human-written and AI-generated text underscores the complexity of the issue. This is highlighted by OpenAI’s decision to disable their AI classifier as of July 2023 due to low rate of accuracy in distinguishing between AI-generated and human-generated content.Footnote 2

In response to these challenges with feature-based methods, graph-based methods are emerging as an alternative, due to their proven effectiveness in recognizing coordinated, synchronized activities [6]. By leveraging these techniques it is not only possible to study how users interact with content, but also how they interact with other users. The rationale behind these approaches stems from the assumption that human-guided and authentic activities typically display more variability than their automated, inauthentic counterparts. This emphasizes the need to move beyond analyzing individual accounts to focusing on patterns of suspicious coordination within groups.

However, research by Elmas et al. [5] on retweet bots, utilizing data from services previously purchased on black market sites, discovered discrepancies in common assumptions about bot characteristics. This included, but was not limited to, areas of volume of activity, diversity, following and followers and temporality. They illustrated that bots may emerge from compromised accounts, acting as bots only for certain period of time, and did not find a single case of one bot following another one. Such insights should prompt researchers to critically assess, whether the metrics used to evaluate the performance of bot detection methods are in fact contributing to improving downstream applications. Hays et al. [8] argued that this is currently not the case for Twitter bot detection tools, attributing high performance to simplistic collection and labeling practices of the datasets employed. Separately, Martini et al. [10] observed that different methods yield remarkably different results in comparison. This implies that current tools may not be ready for downstream usage and may result in the misclassification of many users [11].

With the heightened difficulty in identifying individual bot accounts, we focus our efforts on group activities and their coordinated behavior patterns. Our work is in line with trends in recent research that focuses more on actions and behavior of groups of accounts rather than on the classification of individual accounts [1].

We investigate the potential of new sets of relations that are challenging to circumvent; any attempts to do so could drastically limit the functionality of organized automated actions by restricting their common operational patterns. The goal of our research is to determine the feasibility of utilizing coordination patterns for the purpose of bot detection, with due consideration to both the inherent complexities and data restrictions. By recognizing these challenges, we contrast first-order behavior-based relations, such as retweets (a user sharing a tweet), with higher-order relations like co-retweet (two users retweet the same tweet) and co-hashtag (two users tweet the same hashtag more than a certain number of times). The former highlights direct user behavior, while the latter reveals shared interests or subjects, uncovering subtler collective actions. This approach is set against the current conventional method of utilizing follow relations, which are more static. Utilizing the same dataset and graph neural network architecture across our experiments, we conduct a comparative study between the conventional follow relations and those centered around behavioral patterns to assess their impact on bot detection, avoiding the introduction of new uncertainties through algorithmic changes or dataset variations. Though our results did not surpass the conventional approach, they remain competitive in terms of accuracy and F1-score, demonstrating the viability of this approach. To the best of our knowledge, this is the first work that integrates higher-order relations in a behavior-based approach for bot detection.

2 Methodology

2.1 Dataset

We utilize the TwiBot-22 dataset for our experiments. Compared to previous datasets, TwiBot-22 includes a broader and more diverse range of relations. For an in-depth exploration of the dataset’s conceptual framework, we refer the readers to the work of Feng at al. [6] that introduced TwiBot-22. Previous bot detection methods were constrained to rely only on follower/following relationships between user entities and an implicit relation between users and their tweets. The TwiBot-22 dataset encompasses extensive 14 different kinds of relations. In this work we leverage the follower (user a is followed by user b), following (user a follows user b), retweet (tweet a retweets tweet b), post (user a posts tweet b), and discuss (tweet a discusses hashtag b) relations. We believe that this range of relations offers a lot of potential for future development of more sophisticated and accurate bot detection methods. The accessibility of these diverse relations not only enhances our analytical capabilities but also allows us to reveal hidden connections between users, cross-referencing entities in ways previously unattainable. We refer the reader to Table 1 for an overview of TwiBot-22, comprising both statistics as well as an exploration of some of the characteristics that differentiate humans from bots. The left side of the table provides a quantitative overview of the dataset. On the right side, a more nuanced analysis of the variances in human and bot behavior. Key contrasts include variations in tweet and following/follower countFootnote 3 as well as ratios like hashtag-to-tweet, revealing discrepancies between the two types of accounts. This comparative analysis offers valuable insight that guides the process of deriving new relations. We explore these aspects further and delve into more detail in subsequent sections, specifically in Subsect. 2.3.

Table 1. Statistics (left) and in-depth analysis (right) of human and bot characteristics in TwiBot-22. \(*\)users with at least 1 tweet. \(\dagger \) with at least 1 follower / following.

While TwiBot-22 is believed to contain high-quality labels, it is important to recognize that we cannot entirely dismiss the possibility of underlying biases towards older notions of bot characteristics. A potential bias could be introduced by the use of non-transparent hand-crafted labeling functions and dependence on existing bot detection methods. These methods are often trained on follow relationships, an assumption we challenged in the introduction. This reliance on possibly flawed assumptions may further deviate bot detection in the wrong direction. In addition, recent evidence indicates that classifiers performing exceptionally within one dataset may significantly underperform when applied to others, even when employing more sophisticated models [8]. This may be attributed to the reliance on inherently unstable features present in the initial training data. Therefore, although TwiBot-22’s expert-guide process signals a marked improvement, the broader methodology might compromise the dataset’s overall effectiveness. Nevertheless, we assume the labels in the data to be the ground truth. This assumption is made due to the lack of better annotation methods and inherent difficulty of this problem.

2.2 BotRGCN

BotRGCN (Bot detection with Relational Graph Convolutional Networks) [7] is a graph-based method for Twitter bot detection. The model first creates a multi-modal encoding by jointly encoding multiple numerical and categorical user properties, as well as encoding user tweets and descriptions using a pre-trained RoBERTa model. These encodings serve to represent individual users, capturing diverse aspects of their behavior and characteristics. A heterogeneous graph is constructed by defining multiple relational neighborhoods for each Twitter user. BotRGCN applies relational graph convolutional networks (RGCN), which support a variable number of relations, allowing the model to capture complex patterns of interactions between users. We chose to work with BotRGCN due to its modular and well-designed architecture that allows for easy modification and experimentation. The model was used with the initialization of hyperparameters as found in the original implementation, available at the corresponding Github repository.Footnote 4 Adjustments were made to accommodate the specific number of categorical and numerical properties in TwiBot-22. The architecture and specific components of BotRGCN are further detailed in Table 2.

Table 2. Architecture of the BotRGCN model. Variables: D: embedding size, \(D_s\): description size, \(T_s\): tweet size, \(N_s\): numerical properties size, \(C_s\): categorical properties size. The input layers’ outputs are concatenated before processing through the hidden layers. A dropout regularization technique is applied between the RGCN layers. The model is used with the CrossEntropyLoss, which implicitly includes a Softmax activation on the output.

2.3 Derived Relations

Elmas et al. [5] argue that a significant challenge in bot detection is the non-intuitive nature of bot characteristics. For instance, their analysis revealed that the majority of bot accounts in their dataset had more followers than accounts they were following, and no two bots followed each other.

Moreover, the authors also observed different retweet behaviour for bots, both temporal as well as quantitative. This insight, coupled with the observation of bot evolution, led us to investigate the potential offered by new sets of relations.

Inspired by work from Vargas et al. [12], which builds upon coordination patterns from [9] we introduce the following relations:

  • Retweet: a user retweeted the tweet of another user.

  • Co-Retweet: two users retweeted the same tweet.

  • Co-Hashtag: two users tweet the same hashtag above a certain threshold.

These relations are behavior-based, which makes them harder to manipulate than, e.g., follower and following relations. We believe that this approach has the potential to reveal additional patterns of coordinated behavior among users. However, none of these are readily usable for us out-of-the-box and require some data transformation steps.

Retweet: Our analysis showed that bots tend to retweet disproportionately. In order to take advantage of this, we first need to transform the existing retweet relation from tweet\(\rightarrow \)tweet to user\(\rightarrow \)user. By cross-referencing the given retweet relation with the post relation (user\(\rightarrow \)tweet), we are able to associate a user for each tweet and subsequently derive the retweet relation in the form of user\(\rightarrow \)user. This process is illustrated in Fig. 1.

Co-Retweet: We introduce this relation to emphasize instances where two users retweeted the same tweet. To achieve this, we map a user to each tweet that retweets another tweet, similar to the process laid out in retweet above. Then, we group these users by their retweeted target tweet. From these groups, we create all possible combinations of users (excluding pairs with the same user twice) and export them as our new co-retweet relation.

Fig. 1.
figure 1

Visualization of the process of deriving the new Co-Hashtag (co_hashtag) relation. Initially, the edge file is split into individual relations (not depicted). We then join the post and discuss relation to associate user-ids with each hashtag in the discuss table. In this example we assume a threshold amount value of 100, below which co_hashtag occurrences are discarded. We then create pairs of users with the respective count of how often they share a hashtag. Lastly, we keep only those with at least n shared hashtags and discard the amount column to get the expected format.

Co-Hashtag: Using a similar grouping and pairing approach as with the Co-Retweet relation, we focus on the discuss relation (tweet\(\rightarrow \)hashtag). Prior to the pairing step, we filter out hashtags with an unusually large number of users to decrease computational demands and filter out those hashtags that do not offer any reasonable insight. After this step, we create pairs of users who tweeted the same hashtag a minimum of n times. The choice of n can be regarded as a hyperparameter itself and is detailed further, in the subsequent experiments section and Table 3.

3 Experiments

To determine the feasibility of utilizing coordination patterns for bot detection we conducted sensitivity and ablation studies. We kept hyperparameters constant across all experiments. The model is initialized with the same parameters as mentioned in Subsect. 2.2. We further fixed the dropout rate at 0.3, the learning rate at 0.001, and weight decay at 0.005. Furthermore, we standardized the number of training epochs to 200 across all experimental runs. We reused the train/test split that comes with TwiBot-22, for comparability with prior work.

Table 3. Sensitivity study of the co-hashtag edge creation threshold. The Amount column corresponds to the parameter \( n \), representing the minimum number of times pairs of users tweeted the same hashtag. We run each experiment five times and report the average value as well as the standard deviation in parentheses.

First, we defined a threshold for the Co-Hashtag relation. The threshold was set to three standard deviations above the mean, with values provided in Table 3. Since the differences between the thresholds were minor, we chose the one that achieved the highest F1-score, indicating the most reliable predictions. Additional experimentation with the sets and quantities of relations can be referenced in Table 4. Notably, the follower relation yielded the best results, as opposed to the common follower+following combination. It matches the intuition that this relation can be a strong indicator. Our main interest, however, was on the newly derived behavioral relations, with follow relationships serving as a baseline for comparison.

Our findings necessitate contextual interpretation, contrasting our approach with the conventional use of follower+following relations. Instead, we leverage the higher-order co-retweeted and co-hashtag relations to capture more complex user behaviors like mutual affinity for retweeting particular content or using the same hashtags above a certain level. However, we do not dismiss the Retweet relation and still consider it valuable for future exploration. Though we did not outperform the conventional approach, our results are closely competitive, with differences of less than 1.22 percent points lower in accuracy and 3.78 percent points in F1-score.

Table 4. Sensitivity analysis of BotRGCN to different edge types in the graph. We run each experiment five times and report the average value as well as the standard deviation in parentheses.

This gap, although initially discouraging, reveals upon closer examination the capability to make predictions, avoiding biases that might have characterized previous approaches. Despite the notable perfomance of the single follower relation, there’s evident improvement when using three or five relations instead of two. Our concerns regarding these biases are outlined in Subsect. 2.1 dedicated to the dataset. This highlights the potential of a multi-rational approach, but it is essential to note that inherent characteristics of the used dataset might influence these observations. Such results are particularly significant, as bot developers may find it challenging to avoid behavior-based detection without substantially constraining their capabilities. Building on the findings from Feng et al. [7], where it was confirmed that the optimal performance is achieved with 2 layers of RGCN, we have carried out an ablation study of BotRGCN, utilizing the same layer configuration. Our experiments, as detailed in Table 5, prove that the integration of all available modalities remains essential for robust bot detectors. The challenge requires a multi-faceted approach, integrating various modalities. This approach must then model the aggregation of these signals, aiming to ensure a clear distinction between accounts involved in automated coordinated efforts and those demonstrating authentic behavior, which may stem from social initiatives.

Table 5. Ablation Study of BotRGCN under different relation types using 2 layers of RGCN. Abbreviations used: T = User Tweets; N = User Numerical Properties; C = User Categorical Properties; D = User Descriptions. We run each experiment five times and report the average value as well as the standard deviation in parentheses.

4 Conclusion

The complexity of bots continues to evolve, making the task of bot detection a critical challenge. Our investigation into alternative higher-order, behavioral-based relations emphasizes a different approach in detecting automated coordinated group activities. Although not surpassing the conventional approach, the competitiveness of our results suggest a reliable method without falling into suspected biases of traditional techniques. Bot developers seeking to avoid detection may find it increasingly difficult without limiting their capacities. TwiBot-22, the dataset used in this study, has been instrumental in establishing these new relations. Yet, as we look into further research, the incorporation of temporal patterns into these newly established relations seems promising. This direction, however, necessitates datasets that support this, a limitation we currently face. We are optimistic that pursuits into this direction can foster the development of more robust and reliable detection methods.