Keywords

1 Introduction

The recent pandemic caused by the SARS-CoV-2 virus has fundamentally shaped the way we plan for and respond to the spread of highly-infectious pathogens. Drastic control measures like imposing general lockdowns proved to be particularly damaging to the global economy and the wellbeing of the population [42], causing widespread discontent among all social strata.Footnote 1 As such, less restrictive health interventions were introduced in lieu to curb dangerous infection rates, such as educating the public to socially distance, deploying large-scale testing schemes and quarantining contacts through different tracing mechanisms [21]. Despite the advent of highly effective vaccines [3, 12], financing and support for these measures continued for several months in the majority of the Western world, fueled by evidence of their continued efficacy [63, 83]. However, with the emergence of seemingly milder variants [91], concerns about the limitations [67] or the societal impact [15, 60] of the aforementioned interventions, and growing evidence of reduced public compliance [18], several administrations decided to significantly reduce the resources allocated for these programmes. In the United Kingdom, for example, the new “living with COVID” strategy meant appreciable cost reductions could be achieved [61], while heavy disruptions like the recent “pingdemic” could be entirely avoided [76, 81]. Unfortunately, blindly scaling down the public health efforts to break transmission chains has proven unsuccessful as cases across the country soared yet again within a relatively short timeframe,Footnote 2 trend that has been replicated across Europe [34]. With the vaccine protection waning over time [24, 55], and with demand for further doses decreasing among healthy adults [94, 97], similar surges could reoccur henceforth.

In this work, we propose a major shift in the implementation of “test and trace” programmes that is adaptable to a country’s budget and risk tolerance, while minimizing the burden of viral infection chains. To achieve this, we study different types of targeted policies for conducting testing and isolating contacts in an epidemic under fixed budgeting requirements, and show that a reinforcement learning agent can derive powerful and generalizable policies that outperform all baselines considered in terms of infection reach. We validate our results on several epidemic, budget and interaction network configurations, illustrating the versatility of our proposed method. Moreover, we demonstrate that even static non-learning agents significantly outcompete customary untargeted strategies.

The contributions in this paper are threefold:

  1. 1.

    We put forward a novel way of operating public health interventions in a realistic scenario where economic and societal disruptions are to be minimized: restricting the testing and tracing efforts to higher-risk individuals. To that end, we derive highly-effective policies using different agents, including centrality-based, neighborhood-based and learning-based, comparing them against more traditional approaches, such as random, acquaintance or frequency-based sampling. Our reinforcement learning agent, backed by a Graph Neural Network (GNN) adapted from the recent development of [65], is shown to outperform the other methods in both tasks across numerous configurations, despite being trained using a simple test prioritization setup with partial observable information.

  2. 2.

    Aside from presenting the numerical results and epidemic curves resulted from running our control policies over multiple simulations, we also study what the learning-based agent chooses to focus on while making its decisions. For such a system to be deployed in the real world, policy makers need to be reasonably confident the model produces sensible outputs. At the same time, testing or isolation decisions have to be explainable and verifiable when audited or contested. Here, we explore a perturbation-based technique for explaining the policy derived by an agent’s GNN module, GraphLIME [39], putting into perspective the former’s superior adaptability. Moreover, we propose visualizations for the inferred node embeddings that can be used to direct community-wide interventions or scrutinize the model’s performance.

  3. 3.

    We apply our framework to several scenarios featuring COVID-specific spreading models, including a multi-site mean-field [40, 83] and an agent-based model, with parameters obtained from [20], and show that our agents consistently perform well across a diverse set of experimental setups.

2 Related Work

2.1 Epidemic Modelling

Traditionally, simulating epidemics has been accomplished using either equation-based or agent-based models. The first of these is possibly the most common, owing its appreciable success to early work by [45], where the modelled population was said to transition between disease-specific compartments according to a system of ordinary differential equations. Recent years, however, have seen agent-based approaches become more popular, partly due to their superior granularity and ability to assess a system’s behavior at the individual level [98]. Government-advising groups in the United Kingdom employed this paradigm during the initial waves of the COVID-19 pandemic to assess the effects of public health interventions [25, 35]. Others used such formulations to study the combined effects of manual tracing with digital solutions at various application uptakes, employing parameters fitted to infection data from several regions [1, 83]. In this study, we simulate viral epidemics using a modified version of a recently-proposed multi-site mean-field model [83], which relies on the SEIR compartmental formulation but retains the capacity to leverage an individual’s locality information through contact graphs and mean effects [23, 40]. For completeness, we also investigate our policies in a purely agent-based setup, similar in spirit to the network-based approaches proposed in recent works [1, 65]. In both cases, we employ the COVID-specific dynamics parameters inferred by [20], and allow all disease-unrelated events to be time-discretized (i.e. selecting an action or updating the active links set takes place every \(t_u\) days, with \(t_u\) = 1).

2.2 Graph Neural Networks and Reinforcement Learning

A few years back, graph neural networks became one of the de facto machine learning tools for processing graph-structured information [27, 110]. The earliest studies in this space defined a GNN as a set of two functions: transition \(f_\theta \) and output \(o_\theta \) [30, 86]. The former expresses the dependence between a node i and its vicinity, while the latter controls the space spanned by the model output. These functions form the basis of what later came to be known as the message-passing paradigm [29], which quickly became dominant in the field due to its effectiveness and computational efficiency [11]. After graph convolutional networks (GCN) were first introduced [48], many GNN successes followed course [11, 32]. Motivated by the accomplishments of attention mechanisms in natural language processing [19, 104] and computer vision [4, 46], authors soon enhanced GNNs with attention capabilities, often increasing their performance (e.g. GAT [105], GATv2 [10]). Expressive power bounds for the message passing algorithm were later noted by [108], who then proposed an architecture that reaches the upper limit of the widely-used Weisfeiler-Lehman heuristic (1-WL): the Graph Isomorphism Network (GIN). Efforts to break node symmetries and surpass this upper bound have been significant ever since, with many approaches currently existing: augmenting the nodes with random features [85], modifying the message passing rule [6], or changing the input graph structure itself [71]. Additionally, issues such as feature oversmoothing [36, 74] and bottlenecks [2] have been identified as common reasons for underperforming message passing systems, with proposed solutions ranging from maintaining a low layer count and connecting all nodes in the last layer to ease information flow, to augmenting the message exchange routine (e.g. Neural Sheaf Diffusion [8]). Our framework leverages ideas from GATv2 and GIN to attain expressive power and computational efficiency, while reducing the impact of the above problems by using randomised node features and a small number of GNN layers, with a final fully-adjacent layer to mitigate the over-squashing of long-range dependencies.

A widely-used approach for explaining predictions in deep learning involves perturbing the inputs and fitting local explainable models to each data point and its corresponding perturbations. LIME [80] and SHAP [58] are two popular examples of this methodology. Although the above are directly applicable to GNNs, they do not possess the capability to leverage structural information from graph data or capture nonlinear relationships between the inputs and the outputs. To solve these limitations, GraphLIME was proposed [39]. GraphLIME replaces the local perturbations matrix with stacked node features selected from a node’s neighborhood, fitting nonlinear interpretable models using HSIC Lasso [109].

Among many other domains, GNNs have also been extensively used in the context of epidemiology. From the literature dedicated to COVID-19, we note here several noteworthy efforts: infection forecasting [44, 75], full population state estimation [102], finding “patient 0” [90], and controlling public interventions, such as testing [65] or vaccination policies [41].

Sequential decision processes are often modelled via Markov Decision Processes (MDPs) of the form (\(\mathcal {S}\), \(\mathcal {A}\), \(\mathcal {P}\), \(\mathcal {R}\)) [78], where \(\mathcal {S}\) is a state space, \(\mathcal {\mathcal {A}}\) is an action space, \(\mathcal {P}\) is a transition probability matrix, while \(\mathcal {R}\) is a reward function for the state-action pairs. Agents sample actions from their policy \(a_t \sim \pi (a|s_t;\theta )\), with \(a \in \mathcal {A}\) and \(\theta \) a parametrization, then execute them, transitioning to different states \(s_{t+1}\) and earning rewards \(R_t\), according to the environment’s \(\mathcal {P}\) and \(\mathcal {R}\). The goal of reinforcement learning (RL) is to solve MDPs by predicting and maximizing the \(\gamma \)-discounted returns of future rewards \(G_t = \sum _{i=1}^T \gamma _{t+i}^{i-1} R_{t+i}\) [100].This is routinely achieved through supervision using a w-parameterized model \(V(s_t|w)\) that predicts \(G_t\), and the returns or some intermediary estimates as targets. The first of these approaches is called the Monte Carlo (MC) algorithm, and is known to be effective despite presenting several drawbacks: slow offline learning and high variance [99, 100]. For example, a variation of MC featuring search trees was used to derive competitive policies in 2-player board games [92, 93]. In contrast, the online temporal difference (TD) learning method casts the sum of the current estimate of the next-step return \(G_{t+1}=V(s_{t+1}|w)\) and \(R_{t+1}\) as a regression target, lowering the variance and speeding up training [103]. The latter constitutes the basis for many RL algorithms to date, such as the on-policy SARSA [82, 100] and the off-policy Q-learning [106], which proved successful in multiple problem instances: reaching or outperforming human-level performance in games [69, 70], autonomous car driving [49], and many others.

Approaches that directly optimize both \(\theta \) and w are called actor-critics [52], and have become the preferred algorithmic choice when faster convergence rates are sought after and sample efficiency is not required. Recent years have seen actor-critic methods like the Proximal Policy Optimization (PPO) [88] and Deep Deterministic Policy Gradient (DDPG) [56] achieve state-of-the-art results across a wide range of challenging tasks [54, 88]. Although online implementations are possible, these agents have traditionally been trained using MC.

Learning policies in environments with combinatorial action spaces such as ours has generally been considered a difficult undertaking. In spite of this, RL methods proved to be effective in instances like multiple item [96] or thread popularity selection [33]. In the context of epidemics, an RL system based on multi-armed bandits and demographics data was recently introduced by the Greek authorities to prioritize the COVID-19 testing allocations at border control [5]. For classic combinatorial problems, such as the travel salesman (TSP) and its vehicle routing variants, RL approaches have also been shown to perform well [7, 53]. Incorporating graph embeddings into the RL agents have generally lead to improved solvers, outcompeting other learning methods [17, 43].

2.3 Influencing Graph Dynamics

The problem of influencing diffusion processes over networks has been studied in many different settings before, most notably for solving influence maximization [73], optimizing immunization strategies [77], and targeting pathogen testing [66]. It has been long established that random vaccination policies tend to be suboptimal, and even simple heuristics like acquaintance sampling can outperform them [16, 68]. Centrality-based strategies were also explored in this context, with PageRank [14], eigenvector [62] or betweenness centrality [84] becoming popular choices. For influence maximization, degree-based strategies were shown to render competitive results (e.g. LIR [57], degree discount [13]). Over time, however, multiple authors have identified problem instances where any centrality measure used by itself can lead to suboptimal results [9, 77]. The question of which heuristic to use for what problem has since become a focal point in many application domains. As an alternative, reinforcement learning techniques have been proposed for mixing different heuristics in an optimal manner, thus reducing the impact of the aforementioned drawbacks [65, 101]. Node targeting for detecting the state of a spreading process is a slightly less explored use case of control in the literature, but efficient heuristics that exploit the known state of a vertex’s neighborhood instead of centrality-derived information have proven to be successful [66]. The domain of prioritizing contact tracing, however, remains largely uninvestigated to date, but recent work suggests that isolating subsets of individuals based on the frequency of appearing in the vicinity of positive cases can lead to similar levels of containment as naively isolating every contact [51].

Meirom et al. introduce a reinforcement learning model that can derive general control policies for diffusion processes over networks, using test prioritization and influence maximization as illustrations [65]. A GNN-based controller, cast in an actor-critic framework, learns effective policies using simulated data, integrating local and long-distance information over time. The elegance of the approach stems from the fact that the training process is not conditioned on having the full epidemic state made available to the agent. The work also shows that it is possible to learn a policy on small networks (e.g. 1000) and deploy it on larger graphs with similar statistics (e.g. 50000, the size of a small city). Our study builds on top of this versatile control framework, but differs from the aforementioned work in several key aspects: First, we extend the problem formulation to cover prioritizing both testing and tracing, amending the framework to accommodate ranking nodes from eligible subsets. The latter also enables us to add a simple extension to all our agents which empirically improves performance: restricting the action space to exclude recently-tested negative individuals. Second, we analyze the control outcomes more thoroughly, looking at longer evaluation episodes than 25 days, plotting epidemic curves, and interpreting the agents’ decisions using a perturbation-based explainability technique designed for graphs, GraphLIME. Third, we employ COVID-specific spreading parameters and analyze the behavior of the policies beyond agent-based modelling. Finally, we perform a range of algorithmic changes in our implementation to improve efficiency: using bootstrapping and eligibility traces to mitigate the memory cost of the offline PPO routine, a shared network between the actor and the critic [88] to enrich the graph embeddings, a GATv2 layer in the diffusion module to enable a better tracking of the point-to-point spreading process, multiple GIN layers followed by a final fully-adjacent one in the information module to increase its expressive power, as discussed above, and standard scaling for bounding the exploding node hidden states instead of \(L^2\) normalization or GRU-based transformations.

3 Methodology

3.1 Simulating Epidemics

We simulate several epidemics using the SEIR compartmental model together with COVID-specific parameters obtained from [20]: a base infection rate of \(b=0.0791\) and an average exposed duration of \(e=3.7\) days. In order to remove stochastic artefacts that may conceal performance differences, most of our setups assume that nodes remain infectious for the whole duration of the episode unless they get isolated (i.e. recovery rate \(\rho = 0\)). Intuitively, the impact of this assumption becomes significant only when the problem becomes oversaturated (i.e. the testing budget k and/or the recovery rate \(\rho \) are large enough for any agent to achieve containment). For completeness, however, we also present results when \(\rho \) is varied (see Section 4.1).

The viral infection diffuses over different interaction network configurations, with events getting generated by either a multi-site mean-field (see [83]) or an agent-based model. These configurations correspond to both common artificial generation methods, such as Erdős-Rényi [22], dual Barabási-Albert [72] or Holme-Kim [38], and real interaction patterns. In practice, the graph connections needed for conducting such fine-grained control would have to be inferred from a monitoring system, like a digital tracing mechanism [65] or human mobility tracking via GPS [89], process which requires careful data anonymization. We assign a transmission weight \(w_j \sim \mathcal {U}(0.5,1)\) to every edge j in our graphs, calculating an interaction’s transmission probability by scaling \(w_j\) with the base factor b. In the multi-site mean-field simulations, stochasticity is ensured by the events sampling procedure, which is efficiently performed using Gillespie’s algorithm [28]. In contrast, the agent-based model relies on sampled exposed-state and recovery durations for each node, \(d_i \sim \mathcal {N}(e, 1)\), \(r_i \sim \mathcal {N}(\frac{1}{\rho }, 1)\), and the \(w_j\) weights to induce variability among individuals. For further details, please consult Appendix A.2.

3.2 Control Setup

Each epidemic is allowed to progress until at least \(c_a\) days have passed since the simulation began and a minimum of \(c_i\) nodes become infected before the agent commences its interventions. In the first day of control, the agent is informed at random about the status of a proportion \(c_k\) of the infected population, after which it is only allowed to test k individuals and isolate \(k_c\) contacts of recently-detected positive nodes (i.e. in the previous 6 timestamps) per day. As the actor is not aware of a node’s state unless it is a part of \(c_k\) or it got tested recently, the environment is partially observable. In this work, we fix \(c_a=5\), \(c_i=5\%\) and \(c_k=25\%\), while the budgets are varied between experiments. A block diagram of our framework, which includes the agents’ class hierarchy, is provided in Fig A1.

During evaluation, each agent is asked to select the top-k nodes to test and the top-\(k_c\) contacts to isolate every day, according to their appraisal of the epidemic and graph states. Consequently, this constitutes an instance of the subset selection problem [79], where nodes that are traced by the system or found to be positive are marked as isolating, becoming incapable of infecting other nodes. In principle, those individuals remain disconnected from the graph, yet we allow messages to continue flowing through their connections during the training phase of the learning-based agents. Importantly, the process of tracing is assumed to be carried with delays shorter than a day, which usually implies that a contact tracing application is already deployed and functioning [26, 107]. To evaluate the efficacy of each policy, we analyze the fraction of nodes kept healthy through the entire epidemics and the corresponding infection curves.

3.3 Baseline and Learning Agents

In this study, we consider a wide variety of baseline agents for controlling the viral diffusion that leverage separate heuristics: Random samplers (or randag); Acquaintance samplers (or acq); Centrality-based (e.g. Degree or deg, Eigenvector or eig, PageRank or prank, Closeness, Betweenness); Neighborhood state-based (or neigh). The latter is the only baseline that uses information about the epidemic state, targeting the nodes that have the highest number of positively-detected neighbors in their 2-hops vicinity via lexicographical ordering [66].

Aside from the above, when ranking the contacts of positives, additional information can be exploited through heuristic methods: the frequency with which nodes appear in the neighborhood of detected cases. We derive two baselines from the above: Frequency, which randomly samples nodes with probabilities proportional to the individual frequencies (equivalent to the tracing mechanism studied in the multi-site mean-field approach of [83]), and Backward, which greedily picks the nodes with the highest frequencies (as per [51]).

We also propose a simple yet powerful extension to these baselines: recollection of recent negative test results. This effectively restricts the action space to untested nodes in the past \(t_n\) days, speeding up the network exploration. We set \(t_n=3\), an appropriate timeline for COVID-19 [95], which renders good results empirically.

Our learning-based agents are inspired by the recent publication of [65], leveraging multiple GNNs due to their proven efficacy for targeting testing campaigns. The abstract structure of our models remains similar to the previous work, with a single-layered diffusion module and a long-range information module, followed by two multi-layer perceptrons (MLPs), one that computes the node hidden states \(h_i\), and another that defines the output space. However, our proposed solution features several improvements or simplifications: First, we utilize two output MLPs to produce a score for each vertex and a full state score from the same model, thus sharing the embedding space between the two. Second, we employ a GATv2 layer in the diffusion module to leverage attention when aggregating information from the immediate neighborhood of each node, and 3 GIN layers followed by a fully-adjacent layer in the information module to improve the expressivity and long-range information flow. Finally, after experimenting with different normalization schemes to mitigate the issue of the exploding hidden states \(h_i\) (problem also outlined in the aforementioned study), we propose the usage of standard scaling, which leads to stable training behaviors.

In addition to the above, we carefully scrutinised different combinations of node features, choosing the following final set for training our policies: the degree and eigenvector centralities, the number of infected vertices in the 1-hop and 2-hop neighborhoods, 5 random features that break structure symmetries, and 4 test-state features: a one-hot vector of size 3, marking the test status of node i at the previous timestamp (untested, negative or positive), and a binary value marking whether the vertex has ever tested positive. To allow for the hidden states to incorporate information from these features before the training commences, we disable gradient updates for the first 11 passes.

The ranking of nodes can be performed by both a supervised learning (SL) and a reinforcement learning (RL) agent, with little to no changes to the underlying neural network architecture. The SL agent is trained as a simple node classifier by optimizing a binary cross-entropy loss on the infection status of each vertex, with the output space representing the next-step infection likelihood. In contrast, our RL agent gets optimized via a surrogate PPO objective, which only needs access to the total number of infected at each time point (for more details, refer to Appendix A.3), ultimately solving for the criterion below, where E(t), I(t) and R(t) are the number of individuals in each compartment at time t:

$$\begin{aligned} \min \sum _{t=t_0}^\infty \gamma ^{t-t_0}(E(t) + I(t) + R(t)) \end{aligned}$$
(1)

Here, two reward functions can be used: negative of the number of infected or the number of susceptible vertices at time t (corresponding models denoted as rl and rlpos, respectively). The performance between the two varies due to numerical reasons, but the differences are small (see Fig A5). Consequently, Section 4 features only the former in the summary tables.

To ensure sufficient exploration during training, the RL agent passes the raw outputs of the ranking model through a softmax function that features a decaying temperature, starting from \(\epsilon =0.5\). Note that other strategies are also possible here, including the transforms proposed in [64] and [65], but our simple alternative proved sufficiently effective at exploring the state space. During evaluation, the sampling process is turned off, greedy actions are taken instead, and the edges connected to positively-identified vertices are masked before being fed to the information module, limiting feature oversmoothing. In contrast, we allow the single-layer diffusion GNN to utilize to the aforementioned links such that the positive-related node features can pass through to their neighbors.

By comparing the training behavior of the SL and RL agents with the containment achieved by the centrality-based actors with recollection, we observe a clear distinction between the two, as reflected by Fig 1. While the RL policy outperforms all baselines in several episodes, despite not entering evaluation mode as of yet (i.e. when exploration would be turned off), the SL policy struggles to compete. Further evidence of the SL agent’s underperformance can be seen in the plots of Fig A2, as well as in the extensive comparison previously conducted by [65]. Consequently, we focus our main analysis in Section 4 on the policies derived by the RL actors, comparing them against the rest of the baseline agents.

Fig. 1.
figure 1

The learning agents’ training behavior. Results obtained by the centrality-based agents and the random tester are plotted for comparison.

Table 1. Fraction kept healthy with budget \(k=1\%\) and different recovery rates. Average over 5 seeded runs for each of the considered 5 realizations of Barabási-Albert networks with \(N=1000\) nodes and a mean degree of approximately 3. “w/ R” denotes agents with recollection of recent negative test results.

4 Results and Discussion

4.1 Prioritizing COVID Testing in Static Graphs

We first explore our agents’ policies in the context of targeted testing campaigns. To that end, we investigate the fraction of nodes kept healthy throughout various epidemics triggered across different network models when the budget of daily testing k is fixed, while \(k_c\) is set to 0. Once tested positive by the framework, a node gets isolated and eventually acquires immunity, thus remaining uninfectious until the end of the simulation. As stated previously, most of our setups assume nodes do not spontaneously become uninfectious (i.e. \(\rho =0\)), but for completeness we present results for different full-recovery rates in Table 1.

Despite being trained for only 50 episodes on a single epidemic configuration spanning a preferential attachment network of 1000 nodes, our reinforcement learning agent consistently outperforms the other baselines across a range of different network sizes (see Table 2), budgets (see Fig A4), and wiring configurations (see Fig A5). Interestingly, as previously hinted by [65], the learning-based agents poses a great generalization capability when the daily budgets scale with the number of nodes, making possible a deployment into larger networks, irrespective of the training graph size and artificially without losing efficacy.

Several epidemic curves corresponding to prioritizing testing in 5000 nodes graphs are shown in Fig A3. We note the random approaches perform strikingly poorer than all our informed policies, while the impact of recollection is apparent. Moreover, in spite of using recollection, the heuristics considered remained inferior to the RL policy in terms of the average containment rate.

Table 2. Fraction kept healthy with budget \(k=1\%\) and different population sizes. Average over 5 seeded runs for each of the considered 5 realizations of Barabási-Albert networks with a mean degree of approximately 3. “w/R” denotes agents with recollection of recent negative test results. Here, a single model is trained for 50 episodes on a network of size 1000, but its policy is able to generalize to appreciably larger graphs.
Table 3. Fraction kept healthy for 1000 nodes. Results are averaged over 5 runs for each of the 5 realizations of a configuration model built using real tracing statistics.

4.2 Prioritizing Testing in Dynamic Graphs

In the previous section, we analyzed scenarios in which the connections between nodes remain fixed for the entire simulation. However, in practice, the interaction patterns change over time. In Fig 2, we present boxplots of the percentage of nodes kept healthy obtained by different agents on several preferential attachment networks whose active edges are sampled every day (a uniform random fraction is sampled daily from \(\mathcal {U}[0.4,0.8]\)). The reinforcement learning agent was retrained to accommodate this dynamic context, allowing the model to pass messages through the most recent edges only. The top performing policies were also evaluated on dynamic networks built using statistics from a real contact tracing network [65], the resulting average containments being displayed in Table 3.

Fig. 2.
figure 2

Infection control performance on different dynamic network architectures. The uncertainties are shown as boxplots.

Fig. 3.
figure 3

Averaged epidemic curves and their standard deviations during test and trace control. These are for 5000 nodes Barabási-Albert networks featuring a mean degree of approximately 3, with a daily testing budget of \(k=1\%\) and no tracing on the left, and \(k=10\) with a limit of \(k_c=25\) traced contacts on the right. Two RL agents are displayed: one trained for 50, and the another for 200 episodes.

4.3 Targeted Test and Trace Programmes

Next, we investigate the extent to which different combinations of agents tasked with conducting testing and contact tracing under the constraints of a fixed budget can reduce the spread of a pathogen. For this problem, we train an RL agent for 200 episodes on the same testing task as before, and compare the resulting policy against the other baselines. Tables 4 and 5 confirm the RL tester improves the overall quality of the test and trace programmes, irrespective of the chosen tracer. That being said, employing the same agent to perform the ranking of contacts as well generally improves the containment.

Table 4. Percentage of nodes kept healthy for graphs of size 1000 and an approximate mean degree of 3, with budgets \(k=2\), \(k_c=5\). Averages over 5 runs for each of the considered 5 realizations of the following: dual Barabási-Albert with \(m1=5\), \(m2=1\) (BA 5-1) and \(m1=10\), \(m2=1\) (BA 10-1), Holme-Kim (PC), and Erdős-Rényi (ER).
Table 5. Percentage of nodes kept healthy when controlling epidemics over a dynamic real interaction network of 74 vertices, derived from the Social Evolution dataset [59]. Averages over 5 runs for each of the considered 5 infection seeds. Test budget is \(k=2\).

We also inspect the averaged epidemic curves associated with these targeted test and trace campaigns when \(N=5000\). The results obtained by each agent is shown on the second column of Fig 3, with the first serving as a test-only reference (i.e. values from Fig A3). As stated before, heuristics with recollection bring large improvements over random policies, yet the RL agents outcompete them in most setups. Note the performance of \(k=50\) tests is similar to \(k=10\) tests, but tracing up to \(k_c=25\) contacts daily. While the balance between these will depend on various factors, the results highlight the effectiveness of tracing.

4.4 Agents Interacting with Different Spreading Dynamics

To assess the ability of the agents to generalize to other spreading dynamics, we compare their achieved containment rates recorded with both a multi-site mean-field and an agent-based model run with similar hyperparameters. The RL agent retains all the learned parameters inferred from the previous experiments.

Despite the fact that the control mechanism in the mean-field case relies on discretizing a continuous-time process, we observe minor differences between the two simulation approaches (Tables 6 and 7). This confirms that the agents continue to perform well irrespective of the underlying dynamics.

Table 6. Fraction kept healthy for 2000 nodes and an average degree of 3. Results represent averages over 5 runs for each of the considered 5 instances of a dual Barabási-Albert model (\(m_1=10\), \(m_2=1\)). Testing budget is \(k=2\) and no tracing is conducted.
Table 7. Fraction kept healthy for 2000 nodes and an average degree of 3. Results represent averages over 5 runs for each of the considered 5 instances of a dual Barabási-Albert model (\(m_1=10\), \(m_2=1\)). Budgets are \(k=2\) and \(k_c=10\).
Fig. 4.
figure 4

Explaining early predictions on a 200 nodes network using the \(\beta \) importances from GraphLIME. Initially, the agent does not possess information about the epidemic state, and as such, it focuses on the centrality features. Top row displays each node’s feature values, while neighborhood averages are shown underneath.

Fig. 5.
figure 5

Explaining later predictions on a 200 nodes network using the \(\beta \) importances from GraphLIME. During the later stages of an outbreak, the agent shifts its focus towards the epidemic state features, like the previously untested and positive flags, or the number of infected neighbors. Numbers in the the first row represent each node’s feature values, while the second row displays the neighborhood averages.

4.5 Explaining and Visually-Inspecting the Learning Agent’s Policy

To derive explanations for the decision taken by our reinforcement learning policy, we employ the GraphLIME algorithm, fitting multiple interpretable models to the raw action-values the model outputs. Fig 4 presents the feature importances derived by GraphLIME for a given day in the early stages of an epidemic, highlighting that the RL agent preferentially attends to the centrality features when it does not possess enough information about the diffusion state. As soon as the tester records positive individuals in the vicinity of a vertex, the rank of the latter increases. After neighborhoods become filled with known infections, the agent targets the affected sectors by focusing on the epidemic state features (see Fig 5). As previous results also suggest, the degree remains an effective predictor for a node’s importance throughout the process. Interestingly, the untested flag is often correlated with the action scores, which may indicate the agent favors exploring unknown sectors or reinforcing testing in recently-targeted regions. To put into perspective the significance of adaptability, we show in Fig 6 an example where an RL tester starts by targeting the same node as a degree-directed policy, but then quickly changes its behavior to also test bridging vertices. The ability to plan ahead and adapt to potential threats leads in this case to a successful containment of the pathogen to the first cluster, while the degree agent is unable to stop the infection of every community. We note the RL infers that bridges are important transmission vehicles in spite of never computing the time-consuming betweenness centralities (also see Appendix A.1). Considering the promising results exhibited by our RL policy, we hypothesize that useful patterns emerge within the ranking model’s hidden states \(h_i\). To verify our assumption, we plot t-SNE mappings and dendrograms for these embeddings across different days (refer to Fig A6). The detected positives (colored in blue) have a tendency to be grouped together, while new infections (red) get pushed to a handful of clusters within the same region. Such visualizations could be used for scrutinizing the actions of an agent or deriving effective community-wide health interventions.

Fig. 6.
figure 6

Visualization of the spread for the Degree w/R and the RL agents. This corresponds to a stochastic-block network [37] with three communities. Susceptibles are green, exposed yellow, infectious orange, and detected blue. In the first day, the two policies are identical, but later on the RL agent preferentially targets the bridges. (Color figure online)

5 Conclusion and Future Work

In this study, we show how policies for controlling an epidemic through testing and tracing in a resource-limited environment can be learned using expressive graph neural networks that can integrate both local and long range infection dynamics. Across many different scenarios, a policy inferred by a reinforcement learning agent outperforms a wide range of ad-hoc rules drawing from the connectivity properties of the underlying interaction graph, achieving containment rates of up to 15% higher than degree-based solutions with recollection, and more than 50% higher than random samplers. Interestingly, our agent also exhibits strong transferability, with one model trained on small preferential attachment networks being able to control the viral diffusion on several graphs of tens of thousands of vertices and diverse linkage patterns. While building on previous efforts [65], we explore the role of contact tracing, compare different ways of modelling the infection spread (multi-mean-field versus individual agent-based), and scrutinize a varied set of heuristics. Exploring further epidemic configurations and assessing the proposed test and trace framework on real region-level data would constitute natural extensions to this work.

Additionally, we demonstrate how orderings derived by the deep learning model can be interpreted using the node features, as well as propose visualization strategies for the cluster structures that arise in the latent space of the ranking module. We believe future work could expand on the aforementioned ideas to derive more effective public health interventions and decision-making appraisals.