1 Introduction

Agent-based simulations derived from observational data can be a powerful tool to model and predict how users react to different information to which they are exposed on social media platforms. However, a number of challenges present themselves in effectively using available data to build agent models that can be used in high-fidelity simulations to make such predictions. First, the phenomena of interest may be the product of many agents following decision workflows that are not easily recoverable from the available data. Second, machine learning approaches have typically been used in this area to learn particular focused aspects of user behavior via classification methods (e.g. link-prediction, content suggestions), and have not been applied to larger questions of information flow and evolution. Third, ML models that do address global information flow tend to create a single global model, that is limited in its predictive power because it does not capture the differing reactions and activities of individual agents that might behave differently in the future because of changes in their individual environment and goals. Fourth, the data usually embodies observations of events that took place within a given set of social media platforms, meaning that positive examples of user activity abound while it is harder to find negative examples, for example of when and why a specific action was not taken. Finally, any learned or rule-based behavior must be designed to run very efficiently in order to scale simulations to hundreds of thousands or millions of users while modeling individual decisions taken in their dynamic contexts.

In this article we present a framework for agent modeling of user behavior from observational data that addresses these challenges and show its use in building an effective model of cross-platform behavior across GitHub, Twitter and Reddit. This article is an extension of a paper presented at AAMAS 2020 [24], and it includes in-depth analysis and discussion of the results. Some key contributions of this article include:

  • We develop a set of ML models that learn aspects of user behavior in Twitter and Reddit as classification problems, combining features of users and resources.

  • We integrate the ML models into agent behaviors as alternatives to simpler statistical and rule-based models of behavior developed within the DASH cognitive agent framework [7, 10].

  • We develop features for ML learning that can be interpreted differently for different agents, such as ‘the volume of the Reddit channel I have most frequently contributed to’, to combine generalization from the data with individualized agent behavior.

  • We present performance results of four model configurations that were rigorously evaluated during the DARPA SocialSim Challenge event along with two baseline comparisons. Our models performed the best of all participants across a broad range of metrics defined for the event, indicating the promise of the approach we describe here. The best performing model overall combines learned agent behavior with statistical modeling of bursty behavior, indicating a beneficial interaction between this agent-centered learning and the architecture of the simulation.

The next section describes related work. We then introduce the domain in which we modeled cross-platform information spread and the available data. Next, we introduce our agent model and the machine learning approaches used to define agent behavior. Finally, we present our experimental results and discussion and point to the current and future work of our team in this area.

2 Related work

Several approaches have been proposed to study structure and user behavior on social media networks. For example, Bergenti et al. [4, 5] explore various models to simulate a static structure of the social networks enabled by social media. The authors introduce several metrics for network characterization generated by several agent-based simulations. Similar metrics were used in the DARPA SocialSim Challenge. However, the models discussed there focus on a static structure of the network and are often limited to simulation of one social platform, while the networks of the agent simulations we describe here are data-driven and evolve from an existing state.

In recent years there has been evolving interest in studying cross-platform information spread (e.g. [15, 18, 21]). For example, Krijestorac et al. [21] studied the cross-platform spillover effects of viral videos. They found that when video content appears on a lag platform it doubles its subsequent view growth on the original platform. These approaches, however, do not make use of multi-agent simulations.

Simulation of information spread across multiple platforms on the scale of tens of thousands to millions of agents can be computationally intensive. Previous work on the scalability of agent-based simulation frameworks was described in [8, 9, 14, 16]. When simulations are distributed across several computational nodes, it is important to group agents in order to minimize expensive cross-node communication [9, 14, 16]. In [9], the authors extend the geographic or grid-based segmentation heuristics that are commonly used in earlier approaches to consider segmentation based on clusters in the observed communication network.

Applications of machine learning methods were also widely discussed for link prediction in social networks. There has been extensive research in the area of network embedding and its application to link prediction [17, 19, 22, 26]. Wang et al. [27] casts a link prediction problem as a two class discrimination problem and uses machine learning approaches to learn prediction models. Link prediction is often used for recommending links on social networks [3]. The work we describe here is unique in applying such graph embedding models to the problem of agent decision-making.

Content-based link prediction methods have been developed and have gained popularity among researchers. The large set of those models focus on predicting the activity in social networks based on a combination of numerous content and user features [13, 29]. Link-prediction and information spread prediction on social networks is often challenging because available training data can only provide positive feedback. The authors in [28] use retweet networks to address this problem, with the retweet network used to reconstruct negative feedback.

3 Challenge problem description

The DARPA SocialSim Challenge aims at simulating specific types of dynamics between users and content on three major online social platforms: GitHub, Twitter and Reddit. The goal of a challenge is to predict how a given unit of information will spread across a multi–platform online environment. In particular, it is designed to focus on the simulation of social structure and temporal dynamics with the focus on: (1) cross-platform information spread, (2) spread of specific units of information: common vulnerabilities and exposures (CVE) IDs, URLs and malware, and (3) quantifying how the newly created information spreads. A unit of information is any trackable string that can be identified across platforms, including specific URLs, CVE IDs and malware-related keywords. For example, suppose that an important malware has been discussed in Twitter with hashtag #malware_name, and consequently the associated anti-malware software starts appearing in GitHub and is discussed in Reddit. The simulation should capture as precisely as possible the behaviors of all actors involved in the information spread.

All the teams participating in the challenge were provided with the training data: a set of relevant actions extracted from GitHub, Reddit and Twitter over the course of 32, 32 and 19 months respectively. The data covers the period before the end of August 2017, which corresponds to the end of the training data period in Fig. 1. All data was provided in the ”raw” format, as a set of the json objects returned from the platforms APIs for TwitterFootnote 1, RedditFootnote 2, and GitHub.Footnote 3 For GitHub, some additional data was provided, such as information on usersFootnote 4, the list of potential bot accounts, and information on repositories.Footnote 5 The teams had the freedom to use the provided data as needed to make the appropriate models. At the beginning of the challenge, the teams received some additional data covering an unidentified period before the simulated time, which could be used to setup the initial conditions of the system (Fig. 1). Initial conditions data covers the period of 2 weeks right before the test period (simulation time). A time gap between the training data and the initial conditions data could be introduced. The teams are requested to simulate a specific time interval after the initial conditions. The length of the test period or simulation time is 6 weeks, \(\sim 200\hbox {K}\) users and \(\sim 500\hbox {K}\) tweets and posts in simulation. The accuracy of the simulation was assessed by comparing the simulation output to the ground truth using the set of measures as explained in Sect. 5.2.

At the beginning of the simulation, the initial condition data contains a set of actions and content which could contain the relevant keywords or no keywords at all. The users can interact with one or many pieces of content in any time during the simulation.

Fig. 1
figure 1

Simulation timeline. Training data has been used to train the models. Initial conditions data has been provided at the day of the challenge. It is used to set the initial state of the system before the simulation. After the simulation, the simulation output is then compared to ground truth to assess the accuracy of the simulation

The simulation output is the list of all the actions performed by the relevant actors in all three platforms. For GitHub, we simulate ten actions that users can make: create or delete either a repository, a tag, or a branch (respectively Create and Delete), create or comment a pull request (respectively PullRequest and PullRequestReviewComment), create an issue (Issues, IssueComment), and push (Push, CommitComment). Moreover, a user can watch and fork existing repositories. For Reddit we simulate two user actions: post – creating the original content; and comment – commenting on a post or on another comment. For Twitter we focus on two user actions: tweet – creating an original tweet; and retweet – broadcasting the other user’s tweet.

4 Agent framework and models

To simulate multiple social platforms with millions of users interacting with hundreds of thousands of pieces of content, we use the DASH agent framework implemented in the FARM simulation platform [10], implemented in Python. The framework was deployed on a cluster of six Linux servers with 8 CPUs and 16GB RAM, which was sufficient for all simulations. The DASH framework was chosen in part because it provides support for cognitive modeling, including a dual process model, supporting reactive planning and forward projection as well as spreading activation, within a framework where specific behaviors can easily be replaced or augmented with customized modules. This feature is used to implement decision processes that are influenced by the output of learning methods as we describe below. DASH uses an object-oriented model for agents allowing the creation of generic agents with representations for beliefs about their environment and action models. New agents can rapidly be modeled as sub-classes of these agents with specific additional behaviors. Finally, its discrete event simulation framework and support for large distributed simulations enabled efficient simulations of the size required for this task.

At the core of the simulation framework is a discrete event simulation where the event queue contains two different kinds of events, illustrated in Fig. 2. First, individual agents appear on the queue with specific time-points. When these agents reach the front of the queue they begin a decision process to determine what action will be taken, e.g. a comment on a Reddit post. On completing an action, the agent is placed back on the queue at a time point that implements a basic rate of activity for the agent. The second kind of object on the event queue is a trigger, which on reaching the front of the queue may conditionally initiate new events based on the state of the simulation. As we describe below, in this case we use triggers to model bursty behavior, or rapid increases in the rate of activity perhaps triggered by some event that is external to the social media environment. We call this technique a dual event queue approach. While it is simple, it combines flexibility in modeling global behavior with the efficiency of the discrete event model.

Fig. 2
figure 2

Dual event queue approach

4.1 ML ready agents and simulation components

We describe our general approach to modeling agents that are active across multiple social media platforms. Some researchers have modeled cross-platform information spread at the global level or with very simple agents [15, 21]. In contrast, we use a high-fidelity multi-agent model that allows agents to make independent choices using potentially complex decision algorithms. In reality, users of a social network have their personal preferences and habits. For example, one Twitter user may be interested in UK politics, while the other follows sports, and both of them are also interested in crypto coins. The first one might be very active and often engage in discussions, while the other one rarely replies. Users can vary widely across a number of different dimensions. In order to create such a model, however, we must specify this variety of behaviors for possibly millions of agents in a way that matches our observable data, and for this we turn to machine learning approaches as described below.

Agents in our simulation interact with each other by interacting with the social media resources they produce or are associated with. The nature of interactions can be various and multiple. For example, in Twitter we define agents as users and resources as tweets. Users can interact with tweets by retweeting them, liking them or replying to them.

The environment \(\epsilon \), consists of the agents \(\{\alpha _1, \alpha _2,\ldots , \alpha _n\} \in A\) and resources \(\{r_1,r_2,\ldots , r_m\} \in R\). Agents and resources are defined by their state. Each state of an agent or a resource is represented by a vector \(X_{\alpha _i}\) and \(X_{r_j}\) for agents and resources respectively. The states are not static and they change over time as the simulation progresses and the agents/resources engage in the activities. Those dynamic state vectors are used as the feature vectors in the ML decision models described in Sect. 4.2.

Fig. 3
figure 3

Workflow of the DASH SocialSim agent - structural view of agent’s modular decision process. Grey blocks show alternative models and implementations

All models used the agent decision process shown in Fig. 3. This process is implemented as a DASH agent and consists of modular components that can have multiple implementations (marked with grey blocks, for example 1.1, 1.2, 3.1–3.2 and 5.1–5.3). Experiments can also be configured to have heterogeneous agents using different modules implementations (algorithms). The major components of the agents’ decision process are the following:

  1. 1.

    Activation. Agents become active and ready to take an action in one of two ways: internal self-activation by coming to the front of the event queue as an agent or external activation via a trigger that comes to the front of the queue. These methods can be used simultaneously or separately. The base model uses self-activation and burst model uses external activation, using the dual queue approach for incorporating external signal into simulations described in Sect. 4.4

  2. 2.

    Identification of the horizon. It is both computationally and cognitively infeasible for the active agent to consider all possible resources in making an action decision. The agent’s visible horizon h is defined by a set of rules that limit the resources the agent can possibly interact with in a given time step. The users of a social network are presented with a very small subset of the resources they could in theory interact with, as illustrated in Fig. 4a. For example, a Twitter user can interact only with tweets that are displayed on their device. Our attempt is to emulate this limited visibility of the available resources by allowing agents to interact only with a specific fraction of resources. The models for calculating the visibility horizon h are described more in details in Sect. 4.5

  3. 3.

    Invoking the process of selecting a resource and action type. If an agent makes a decision to take an action on an already existing resource (e.g. edit or delete its tweet) then it needs to select resources to interact with. This decision can be translated into the following set of problems: for each agent-resource pair and corresponding sets of features, decide whether the action will be selected. These are classification problems, where a combined set of features \(x = x_{\alpha _i} \mathbin \Vert x_{r_j}\) needs to be classified into positive or negative class. There are two models for this decision: one is probabilistic (the base model) and another is based on machine learning (the GitHub ITree model). Models developed for this problem are described in Sect. 4.2

  4. 4.

    Starting a new information cascade. Starting a information cascade means creating an original resource rather than acting on an already existing resource. Our simulation models use training data to sample the roots of information cascades. Information units (keywords, hashtags, URLs) are also sampled from the training data.

  5. 5.

    Reacting to an existing information cascade. This process is similar to (3). This decision can be translated into the following questions: for each agent-resource pair and corresponding sets of features, decide whether the action will be selected. Twitter and Reddit models developed for this problem are described in Sect. 4.2

  6. 6.

    Updating meta information. Agents add and update meta information based on the action that was taken. For example, agents’ and resources’ features and environment features.

Each of the agents in this framework are cross-platform agents in the sense that they can take actions on different platforms. The social platform, where the action happens, is specified when the agents’ activation is scheduled on the queue. Parameterization and training required for each component (1–5) in Fig. 3 is done for each social platform.

During the simulation, this process is repeated for each agent whenever it is activated as illustrated in Fig. 4b. To optimize the use of the resources, we sometimes pre-compute, parallelize or bundle some of the steps when possible.

Fig. 4
figure 4

Agents, visibility horizon and simulation in time: a An agent can interact with any resource which falls within their visibility horizon h. The decision if a particular agent will interact with a particular resource is left to other decision models; b Multiple agents are active within the simulated environment, and they repeat their decision process over time in the evolving environment

4.2 Twitter and Reddit ML models

ML models predict the probability of interaction between an agent and a resource. In the “real world” the users are presented with a set of objects they can interact with. For instance, a user on Twitter can see a few dozen tweets in a given moment. For each of the tweets, a user makes a decision either to like a tweet, retweet, reply or ignore it. In our simulation, for each of the actions the decision process is translated into a binary classification problem for a given feature vector \(X_k = X^{\alpha }_{i} \mathbin \Vert X^{r}_{j} \mathbin \Vert X^{c}_{k}\). The feature vector is made up of feature vector \(X^{\alpha }_{i}\) for an agent i, a feature vector \(X^{r}_{j}\) for a resource j and a combined feature vector \(X^{c}_{k}\) which are concatenated into a single vector \(X_k\). This way, for each combination of agent–resource pairs we can construct the unique feature vector that will be used to make a decision on an action. During the simulation, as each agent reaches the front of the event queue, it gets matched to the set of resources that are within the agent’s horizon h, and solves a binary classification problem on each one of them. For a given agent \(\alpha _i\), a vector \(X_k\) is constructed for all resources that are within the horizon h. For example, the current simulated time step is t and we need to simulate the action of a user i on Twitter in the next time step \(t+1\). The user i is defined by the corresponding state vector/feature vector. By using the horizon models (Sect. 4.5) we identify all the tweets (resources) the user might potentially interact with. Each of the resources j is also defined by their corresponding feature vector. If the user can “see” n tweets, we construct n combined feature vectors \(X_k\) and solve the binary classification problem for each of them. The solutions of the binary classifications are the probabilities of the interaction between the user i and all the resource within their visible horizon. During the simulation we then chose the match with the highest probability and perform an “action”, say a retweet.

Reddit training. On Reddit, users can comment either on a post, that is a top-level topic of discussion, or on another comment down the comment tree. In case of Reddit, we solve two classification problems: (1) post\(\rightarrow \) comment: whether the user will make a comment on a given post and (2) comment\(\rightarrow \) comment: whether the user will make a sub–comment on a given comment. For each of the two problems we build separate training datasets.

For the post\(\rightarrow \)comment problem, we first match all the users with posts they commented on. Then, for each user–post pair we create a unified feature vector by concatenating user features, post features and combined features in a single vector as illustrated in Table 1. All the vectors created this way are assigned a positive label as they represent the instances of users commenting on the posts. The initial dataset consists only of positive examples. This problem is different from the standard supervised classification problem by the lack of negative examples in the training set. Usually, to properly train a binary classifier, one needs the negative examples. The data on “real” negatives could be harvested by carefully observing user’s behavior within the Reddit interface and select the posts user saw but decided to ignore. Such information is not available to us and we approach this task as the one-class classification problem.

Often referred to as PU learning, this problem is solved by building a binary classifier from a training set consisted of positive P and unlabeled U data. To create the U part of the dataset, we match users with posts they did not comment to and create the feature vector from such user–post pairs. A fairly naive interpretation of the PU learning process is that unlabeled vectors that are similar to true positives will be labeled as positive and unlabeled vectors which are different from true positive will be labeled as negative. It is out of the scope of this paper to discuss the multiple nuances of PU learning. To build our training sets, we use an approach similar to one proposed in [23].

Finally, with the training set ready, we build a classifier able to predict if a given userpost pair will result in a user commenting on a post. After comparing multiple classification algorithms and parameter optimization, we decided to use Multi-layer Perceptron classifier with 15 hidden layers, which yielded the highest AUC score.

Table 1 To build the agent-based ML models, the training set has been build by assembling user features, post features and combined features

We approach the comment\(\rightarrow \)comment problem in a similar way. The main difference is in the set of features used to predict the child comment. For the post\(\rightarrow \)comment problem we match user features with the post features, while for the comment\(\rightarrow \)comment problem we match user features with the comment features. This way, we allow the agents to apply the different models when deciding to comment, depending on the resource they are commenting on.

Reddit features. For the post\(\rightarrow \)comment problem, the final feature vector \(X_k\) consists of \(\approx 200\) features made from user features \(X^{\alpha }_{i}\), post features \(X^{r}_{j}\) and combined features \(X^{c}_{k}\).

  • User features \(X^{\alpha }_{i}\) consist of preference features, sentiment features and frequency features. Preference features are multidimensional vectors which represent the user’s preference towards keywords, subreddits and domains. The main idea behind building the preference vectors is to quantify the user’s interests and leaning towards certain topics. The preference vectors are built using FastText word embedding technique [11]. First we train a set of FastText models using the corpus from the relevant content on Reddit. Then, we assign the embedded vectors to each keyword, subreddit name and domain which appear in the training data. During the simulation, there is a possibility of encountering the keywords or subreddit names which do not exist in the training data. FastText can embed words which were never seen before but appear in a similar context or have a similar lexical features as the words from the training corpus. For each user, the preference vector is computed as the separate weighted averaged vector of all the keywords, subreddits and domains they used. For example, the keyword preference vector for the user i is calculated as \(Pk_i = \sum _{j=1}^{n} {f_j K_j}\), where \(K_j\) is an embedded vector of a keyword j used by the user, and \(f_j\) is the number of instances the keyword has been used. This way, we are able to identify the direction of the user interests. The preference vectors for the subreddits and domains are calculated in a similar fashion. Sentiment features quantify the two dimensions of user’s writing on Reddit, namely polarity and subjectivity. Polarity is a measure usually in the range between -1 and 1 where 1 means positive statement and -1 means a negative statement. Subjectivity measures how likely the sentence is a personal opinion, where 0 is very objective and 1 is very subjective statement. We calculate the means and the standard deviations of polarity and subjectivity of all user’s posts and comments separately. Frequency features quantify the user’s activity by measuring the number of their posts, comments and subreddits together with the additional temporal measures of user’s activity.

  • Post features \(X^{r}_{j}\), similarly to user features, consist of embedded vectors of keywords, subreddits and domains related to the post. Additionally, we use a single measure of sentiment polarity and subjectivity extracted from the post content. Furthermore, we include the info about the author of the post.

  • Combined features \(X^{c}_{k}\) are the cosine and Euclidean distances of user preference vectors and post-related embedded vectors. This way we quantify the similarity between the given post and user’s interests in the past.

For the comment\(\rightarrow \)comment problem we use a similar approach. Instead of the post features we calculate the similar set of comment features and concatenate it together with the user features and combined features. We add an additional feature which represents the position of a comment in a discussion tree.

Twitter Model. For Twitter, we focus on modeling the ‘retweet’ action of a user. At any instance, a user has access to multitude of tweets on the platform. In our simulation, we develop a classification model for ‘retweet’ action tweet, user\(\rightarrow \)retweet: whether a tweet will be retweeted for a given (tweet,user) combination. This model is applied to all tweets that an agent has access to whenever it reaches the front of the event queue. The simulation selects one of the tweets with highest probability according to the model and creates a retweet event for this agent and tweet. The final model used for the task was a ‘Word2Vec’ model combined with a ‘random forest’ model.

Twitter Feature Engineering. In order to build a predictive retweet model, we need true indicative features from textual data to help the model learn. There are two main types of retweet activity by users on the platform- ‘normal retweet’ and ‘quoted retweet’. A normal retweet is when a user retweets the original tweet without any additions, while the quoted retweet has additions to the original tweet. So a quoted retweet can give increased information of overlap of the tweet with the user profile as compared to the normal tweet. This understanding is important to decipher the learning of the model. The feature space used for the model consists of 3 types of features: user features, tweet features and combined features.

  • User features: We create user profile in terms of their historical tweet activity on the platform. The intent is to model the user’s persona on the platform based on their past activities. There are 84 features describing a user. One set of features is focused on user’s profile including user description, number of followers, number of friends etc. The other set is focused on the tweet activity of a user such as their average number of tweets per week, maximum number of user mentions per tweet, average retweet period, etc. We also exploit hashtags used by users to model their interest in multiple topics floating on the platform in the form of tweets at any given time. The user description field has an interesting potential to define users’ interests at a broader level. A retweet score is separately assigned to each of the connections of a user. Every tweet has an author and the authors’ retweet score for the user will be added to the final set of features. This is to capture the hypothesis that a user is more likely to retweet a tweet from one of their connections with the highest retweet score.

  • Tweet features: Tweets are described with features including time of the day, day of the week, hashtags used, user mentions, number of characters and several others.

  • Combined features: User historical features are adjoined with features of the entire set of tweets the user can take an action on at a given time. ’Word2Vec’ model (W) is used to find out the similarity of the user’s profile with the tweet features. It is trained on all hashtags of the corpus along with user descriptions. A representing feature vector \(V_k\) is obtained for all the hashtags. We create an average hashtag feature vector for a user \(H_{u_i}\) based on the most frequently used hashtags of the user. The user description vector \(D_{u_i}\) is created for each user. \(H_{u_i}\) and \(D_{u_i}\) are then used to find similarity of the user features with tweets. For each hashtag (\(V_k\)) of each tweet(t), cosine similarity is calculated with the \(H_{u_i}\) and \(D_{u_i}\).

Feature engineering equipped the data with multiple defining features as mentioned above. For the current model, we use hashtags as major input for Word2Vec features for both normal retweet and quoted retweet. The model is then trained with the given data of feature set, retweet labels. A separate development set is used to tune parameters to eventually obtain a model accuracy of 92% on the test data.

Notice that the sets of features used to build the agent decision models for Reddit are not exactly the same as the features on Twitter. The choice of features depends on multiple factors. First, the platforms we simulate are different and it dictates the set of features we can use. For example, users on Twitter can follow other users while on Reddit that is not possible. Secondly, the data we have access to consists of multiple enrichments including sentiment measures and other user-related characteristics. During the model training, we eliminate the least informative features and it leads to the different final sets of features for two platforms. An example of a feature that is more important in one platform is one that indicates at which time of the day a user is active. Since both platforms operate at a different pace, that dictates the rate of actions as well as the lifetime of the content. It also affects the importance of some temporal features. The time of the day is important in Twitter, and can lead to better prediction, while in Reddit it is less important and can often be ignored.

4.3 GitHub Inverse Tree (ITree) model

Consider the following task: given a user \(u_i\) and the repository \(r_j\), predict the number of actions \(y_{i,j}\) the user \(u_i\) will perform on a repository \(r_j\). This translates to a typical regression problem and could be approached in multiple ways. However, in GitHub the target repository is often uncertain while the rate of user’s actions is typically already known, since the users activity rate varies slowly, and it is easily predicted just by observing the rates in the past. The active users remain active and the new users usually start with a slow rate of actions. Thus we solve the inverse problem: given a user \(u_i\) and the number of actions \(y_i\), predict the repository \(r_i\) that user will most likely perform an action on. To predict the repository, we build a model that narrows down our choice to the set of repositories that share the same set of features.

Fig. 5
figure 5

GitHub Inverse Tree (ITree) model example. Back-traversing a decision tree starting from a leaf node (representing a desired outcome) will result in the upper and lower bounds of the features of interest

For the given vectors of input and target variables, the regression tree finds a mapping from the input variables to a finite number of groups of output variables such that the average within group squared error of the target variables is minimized. The ITree traverses the trained regression tree backwards and for a given target variable and set of known features, extracts the boundaries of the unknown features. A regression model relates y to a function of X, so that \(y \approx f(X)\). In our case, \(X=X^{u} \mathbin \Vert X^{r} \mathbin \Vert X^{c}\), and we need to solve \({X^{r}=f(y,X^{u},X^{c})}\). In the Fig. 5 we illustrate the back-traversal of a simple decision tree involving a handful of features. In this example, the value of feature3 is already known. The task is to find the upper and lower bounds of other features that would, in case of a decision tree regression, yield a desired outcome. By traversing the decision tree backwards, we can identify the sets of upper and lower bounds of desired features. We find such approach useful in cases when values of some features and the value of the target variable are known, but the values of other features are missing, e.g. user features are known but repository features are missing.

The problem of predicting the set of target variables resembles the problem known as a multi-output regression [6, 12], known also as the multi-target [1, 2] or multi-response regression [25]. Multi-output regression models are built to simultaneously predict multiple real-valued target variables. The result is usually a vector of values. However, instead of predicting a single vector, the ITree algorithm outputs two vectors: the first one \(L_i\) with lower bounds and the second one \(U_i\) with upper bounds for each feature we want to predict. Finally, the output is processed as a set of tuples where \(Y_i = \{(l_1,u_1), \ldots ,(l_m,u_m)\}\).

The ITree model works in the opposite direction compared to the “traditional” regression tree model. First, we build a training data set D of N instances such that each instance is characterized by an input vector of m real variables \(X_k = (x_k^{1}, \ldots ,x_k^{m})\) and a single target variable \(y_k\). The input vectors are built by concatenating the user features \(X^{u}_{i}\), repository features \(X^{r}_{j}\) and combined features \(X^{c}_{k}\) so that \(X_k = X^{u}_{i} \mathbin \Vert X^{r}_{j} \mathbin \Vert X^{c}_{k}\). Then, we build a regression tree able to predict the number of user–repository actions, which is the target variable y. During the simulation, we do not predict y, as it is already estimated based on the previous user’s activity. However, we use the target variable \(y_i\) together with known features \(X^{u}_{i}\) and \(X^{c}_{k}\) and back-traverse the previously trained regression tree. By traversing the regression tree backwards, we extract the boundaries of unknown features \(X^{r}_{j}\) which describe the repository. Based on the lower and upper feature bounds, we select the set of repositories to choose from. The initial version of the model picks a single repository \(r_{j}\) from the set of chosen repositories uniformly at random and assigns the \(y_i\) actions from a user \(u_i\) to a repository \(r_j\). For instance, assume it is known that user \(u_{i}\) will make z push actions. The ITree model will first calculate the feature boundaries of the repositories the user is most likely to push. This way we narrow down our selection to multiple repositories as the potential targets. Then, ITree model randomly chose z repositories from the narrow selection and simulates the push action of user \(u_i\) to chosen repositories.

4.4 Dual queue event scheduling and burstiness

The framework has two queues: one for scheduling agents own activation time (self-activation) and one for external signal or triggers. Events that are scheduled in the main event queue are executed as a discrete event schedule. Events from external signal queue represent triggers that activate events on the main queue. In this paper external queue was used for burst events (burst activation model 1.2 in Fig. 3). Use of the external signal queue is optional because agents can schedule their next events on the main queue using self-activation rate.

In this paper we discuss only one type of external triggers - burst events. In our experiments we observed that using only agent self-activation rate does not simulate burstiness of events well. For this reason we used the external signal queue to trigger bursts of events. Bursts were measured in training data. A sequence of bursts observed in training data was sampled replayed in simulation. Each burst event was scheduled in the external signal queue. When a burst event is handled by the simulation controller, it creates multiple events associated with this burst and schedules them in the main queue. Size and length of each burst as well as participating users were sampled from bursts observed in training data. Kleinberg’s algorithm [20] was used to detect bursts.

4.5 Horizon and limited visibility models

Limited visibility models determine what active discussions and events are visible to agents to react to. Formally we define the agent’s horizon at a time step as all resources that are visible to an agent at the step. Depending on the platform it may be a list of repositories for GitHub, tweets and retweets for Twitter, posts and comments for Reddit that are visible to an agent. Resources can form information cascades—a tree of resources (e.g., tweet and its retweets and replies in Twitter). In our experiments we used resource age (time between when it was posted and when is viewed) to determine if it should be visible to an agent.

Resources from different platforms have different life spans. For GitHub we used all repositories observed during the training period (2-6 weeks of training data) and kept them during the test period (6 weeks). Twitter and Reddit are more fast-paced social platforms. For example, the majority of Twitter discussions have a life span of  2 days.

We used the following algorithm to determine visible resources for each agent. Each information cascade stays visible x number of days where x is minimum between a fixed threshold (2 days) and estimated lifespan of the information cascade. Lifespan of the information cascade is estimated using training data. It is done as follows. When new information cascade is created (agent creates a root resource, e.g. original tweet), its metadata (e.g. features) is sampled from training data. These features include lifespan, which becomes an estimation for a lifespan of a new cascade.

5 Results

5.1 Experiments and models

Our simulation produces a sequence of events that represent users’ activity on Reddit, Twitter, and GitHub. The resulting sequence of events is then compared to the ground truth according to a set of metrics described below. The ground truth is the sequence of events recorded from the actual social networks for the simulated period, as illustrated in Fig. 1. In this paper we discuss the following model configurations:

  • Random null model: this model does not rely on the simulation framework but instead replays the training data in the test time period, randomly shuffling the order of events.

  • Base model: a model where each step in agent’s decision workflow is executed using simple probabilistic model. Probability of each decision is based on frequency of that decision in the training data. For example, steps such as selecting action type or selecting a resource are separately determined using this approach.

  • ML model: an extension of the base model where for Reddit and Twitter platforms we use the ML model to determine how cascades grow (what tweets are retweeted and what posts are commented on). This is step 5 in Fig. 3.

  • Bursts model: an extension of the base model where in additions to self-activation agents are activated by burst events using the external signal queue as described in Sect. 4.4. The approach samples from a set of bursts detected in the training data using Kleinberg’s algorithm [20]. For each burst, we sample from a subset of agents that are activated during a very short period of time (within 30 min) after the burst.

  • Combined model: a combination of the Bursts and ML models from above.

  • ITree for GitHub: an extension of the base model where GitHub agents use ITree models to determine action type and resource (step 3 in Fig. 3)

The ML model, Bursts model, Combined model, and ITree for GitHub model are all built on top of the base model. The ML and ITree models augment the agent’s decision flow, and the Bursts model adds bursts in the event queue.

5.2 Evaluation metrics

In order to capture information spread across platforms we track information units, which are keywords, hashtags or associated URLs, depending on the platform. These are typically gathered from metadata, and one action or resource may be associated with multiple information units. In order to evaluate the predicted information spread across platforms, we measure the number of shares (e.g. number of retweets in Twitter, or comments in Reddit), the size of the audience (the number of unique users who shared), the speed of information spread, and its lifetime. The speed of information spread is measured as the number of shares over time (the number of shares per day).

Fig. 6
figure 6

Cross-platform community and burst metrics. Our models outperform the base model in the majority of metrics

For model performance evaluation we use the following metrics:

  • Temporal correlation of shares (audience sizes)—the Pearson correlation between the number of shares (audience sizes) time series is computed between all pairs of platforms. This distribution is computed for both simulation and the ground truth. Then the Kolmogorov-Smirnov test and Jensen-Shannon divergence are computed to measure similarity.

  • Time delta—determines the number of hours it takes for a piece of information to appear on another platform. The Kolmogorov-Smirnov test and Jensen-Shannon divergence are computed to measure similarity.

  • Correlation of shares (audiences)—the Pearson correlation between the activity (audience size) time series between all pairs of platforms. Measured as RMSE.

  • Correlation of lifetimes—the Pearson correlation of the average lifetime for each platform, then correlation between simulation and the ground truth is computed. Measured as RMSE.

  • Correlation of speeds—the Pearson correlation between speeds of information across platforms for each information unit, then correlation between simulation and the ground truth is measured. Measured as RMSE.

For burst measurements we use Kleinberg’s algorithm [20] to detect bursts on all platforms (population level measurement). Each metric related to bursts is computed as the RMSE of the correlation between simulation and the ground truth. Figure 6 shows RMSE of correlation of the following distributions:

  • Distribution of the number of bursts

  • Distribution of the average burst size

  • Distribution of the average number of users per burst

  • Distribution of the number of new users per burst

  • Distribution of bursts lifetime

In this paper we present measurements for two scales: population level and community level. Community is a subset of users grouped by a certain criteria (e.g. active on the same sub-reddit, use the same repositories or use similar information units, etc.). We identified several dozens of communities. Community level metrics were computed separately for each community and average is shown in Fig. 6. Results for population level metrics (burst and cross-platform) are shown in Fig. 7. Higher values are better in Figs. 6 and 7 as we show 1/RMSE. Table 2 summarizes evaluation metrics values and uses original non-inverted RMSE values (lower values show smaller error).

Table 2 Evaluation metrics values: cross-platform community, cross-platform population and burst metrics

From Table 2, Figs. 6 and 7, we can see that ML model, burst model and their combination outperform the random null model in a majority of the metrics. ML model and GitHub ITree model scored on average \(23\%\) higher in Jensen-Shannon divergence of the temporal correlation of audiences. Twitter and Reddit ML models also improved cross-platform community metrics. Both Jensen-Shannon divergence and Kolmogorov-Smirnov test of the temporal correlation of audiences and shares have the highest values in the ML model and combined ML+bursts model, meaning maximum similarity to the ground truth. At the same, time burst metrics were the most difficult for all models we tested. In two out of five metrics the random null model showed better performance.

We discuss these results in greater detail below in Sect. 6.

Fig. 7
figure 7

Cross-platform population metrics. Our models outperform both the base model and the random model in the majority of metrics

6 Conclusion and Discussion

In this paper we present several agent-base simulation models that utilize machine learning in the agent’s decision workflow. We also introduce a dual queue technique for discrete event scheduling that allows incorporating external event signals such as bursts. Our experiments on cross-platform information spread show that machine learning methods can be successfully used in multi-platform agent-based simulations, and they improve performance metrics compared to simple probabilistic models (base model and random null model).

In our experiments, all models perform significantly better in terms of the temporal correlation of cross-platform information spread than the random null model (see Figs. 6 and 7). In fact, the random null model scores zeros across all community level metrics.

Both ML models and ITree model score on average \(23\%\) higher in Jensen-Shannon divergence of the temporal correlation of audiences. The results are similar for population and community level measurements, showing that the Reddit and Twitter ML models help agents to make a better (relative to base model) decisions about what tweets and posts should be retweeted. Also the GitHub ITree model helps make better decisions about which repositories agents interact with.

All models presented in the paper are data-driven, which allows us to build large-scale high-fidelity simulations. The results also suggest that predicting bursts of events is hard without external signals. Reproducing bursts using burst samples taken from training data can improve temporal metrics.

The base model and random null model perform relatively well in burst metrics. In most burst metrics the bursts model does not perform well compared to all other models. This suggests that replaying a sequence of bursts from the initial conditions data-set does not improve the metrics that measure burst properties and distributions. However, the combined model (ML + Bursts) scores the best across almost all population level cross-platform information spread metrics while the bursts model and ML model separately come second and third respectively in these metrics. This can be explained by the fact that explicit modeling of bursts (bursts model) affects all time-sensitive metrics. Introducing bursts alone did not improve the burstiness metrics but improved temporal correlations of content shares and audience sizes.

There is no single ML model that performs the best on all metrics, though models that use ML approach often outperform others. While ML approach alone is not always the best solution, it is a viable option for simulating various phenomena, which is demonstrated by some metrics. The main advantage of the integrated ML approach is the generalizability: with properly developed ML models, we are able to rapidly deploy simulations for new, never seen scenarios. It is worth emphasizing that ML models were never used alone, but always as an addition for decision-making within the larger simulation framework. On the other hand, it is important to recognize the flaws of such approach. The main drawbacks lie in interpretability and potential modeling errors. ML models are prone to modeling errors due to data availability and data imbalance. It might seem counterintuitive, but an online action is a relatively rare event for a single user on average. The majority of the users are mostly passive and decide to get active in a very short time window. On the agent level, such a characteristic of a system is translated to solving a classification problem with a highly unbalanced data, where the majority of the data points is labeled as negative and just a few of them as positive. Additionally, due to the scale of the problem, our ML models have not been tuned by running simulation multiple times and choosing the best parameters. ML models are trained in the separate procedures first and then embedded in the agent’s decision pipeline. This multi-step procedure might additionally allow some unwanted artifacts to be transferred from training to the simulation.

The results presented in this paper lay ground for the future research and extensions of the proposed models. This includes incorporating external signals that drive the activation of agents. It can potentially improve prediction of event bursts (reaction of a subset of agents on external signal). Machine learning methods can also be applied to other aspects of agent’s decision workflow. For example, when agents generate new information cascades they have to choose information units. This choice can also be learned from observational data.

The approach we have described is capable of incorporating models that operate on different levels of resolution, e.g. a macro level (models of global phenomena), a meso level (models of groups of agents), and the micro level of individual agents and resources. This is important, since observation data pertinent to a simulation may be available at several different levels, and some predictive techniques are known to work well at a macro level where more data is available and predicted features may be more stable. In this work, for example, we made use of burst prediction data, which predicts a volume of activity independent of particular agents, by assigning predicted actions to individual agents that are naturally likely to take those actions. We are currently developing more general and flexible approaches to incorporate learned information described at a macro or meso level within the agent simulation at an individual level.

Finally, there are multiple ways in which the event visibility horizon can be implemented. In this paper we used a naive approach with an estimation of the resource life span (which determines how long it will stay within the horizon). What rules define visible events can affect agents decisions. The visibility horizon model can be also used as a tool for modeling social network users’ cognitive load and limitations.