Keywords

1 Introduction

Distributed systems, where entities communicate with each other, are booming in our societies. Drones communicating with each other, swarms of various objects, intelligent cars... all may face communication and leadership issues. Therefore, the algorithm that all entities execute should be verified. Thales developed an industrial election algorithm with an arbitrary number of processes, that can possibly fail. We cannot describe the code of the actual algorithm for confidentiality issues. Therefore, we consider a modified variant of the algorithm. This algorithm focuses on the election of a leader in a distributed system with a potentially large number of entities or nodes in an asynchronous environment. Our main contribution is to perform a formal verification of the algorithm correctness for a large number of nodes. By correctness, we mean the actual election of the leader after a fixed number of rounds.

We consider here a special form of the general leader election problem [22]: we assume that, in the network, all the processes (or nodes) have a specific ID number, and they execute the same code (symmetry) in order to agree which ID number is the highest one. In the synchronous context where all processes communicate simultaneously, the problem is often solved using the “Bully algorithm” [18]. In the asynchronous context where each process communicates with a specific period possibly subject to delay variation (jitter), the problem is much more difficult. Periods can be all slightly different from each other, which makes the problem particularly complex. For example, a classical distributed leader election protocol, where the nodes exchange data using broadcasting, was designed by Leslie Lamport [20] in the asynchronous context. The correctness of this algorithm was proved mechanically many times using, e. g., TLA\(^+\) tool [21], or, more recently, using the timed model checking tool Uppaal [10]. However, these automated proofs work only for a small number p of processes, typically for \(p \le 10\). In this paper, we present a technique to prove the correctness of such a distributed leader election using automated tools for a large number of nodes (e. g., \(p=5000\)). The principle of the method relies on the abstraction method consisting in viewing the network from the point of view of a specific (but arbitrary) node, say \(node_i\), and considering the rest of the nodes of the network as an abstract environment interacting with \(node_i\). In this abstract model, two basic properties of the algorithm can be proven. However, in order to prove the full correctness of the leader election algorithm, we will need an auxiliary model, where some timing information is added to (a raw form of) the abstract model. Using this auxiliary timed model, we are able to prove an additional property of the leader election algorithm. Thanks to the three aforementioned properties added as assumptions, we can then prove the full correctness of the leader election algorithm, using the bounded model checker SafeProver [16] on the abstract model.

The leader election algorithm we use is not Lamport’s algorithm, but a simple asynchronous form of the Bully algorithm. We consider a specific framework of network structure and asynchronous form of communications. Basically, we assume that:

  1. 1.

    the graph is complete (every node communicate with all the other ones).

  2. 2.

    the communications are instantaneous (the time between the sending of a message and its reception is null), and the nodes exchange data via synchronous one-way unconditional value passing.

  3. 3.

    the processes are visibly faulty, i. e., they always execute the generic code of the algorithm, trying to elect the leader when they are non-faulty (mode \(\mathsf {On}{}\)), and do nothing when they are faulty (mode \(\mathsf {Off}\)).

1.1 Relationship with Thales’ Actual Algorithm

As mentioned above, for confidentiality issue, we cannot reveal the original algorithm developed at Thales. Nevertheless, it is in essence the same as the one we present. Only the executed code has been modified. In addition, the technique presented in this paper was designed for and applied to the original algorithm. To summarize, we present exactly the methodology, up to the content of the UpdateNode code (that is still similar in spirit).

After its verification using the techniques we present here, the original algorithm has been implemented in C, and is nowadays running in one of the Thales products. This product embeds a standard processor (in the line of Intel X86), with some limited RAM, hard drive, Ethernet ports, etc.

1.2 Related Work

The method proposed here makes use of several powerful techniques such as counter abstraction, bounded model checking and parametric timed model checking for verifying distributed fault-tolerant algorithms, similarly to what has been recently described, e. g., in [19]. As said in [19]: “Symmetry allows us to change representation into a counter representation (also referred to as ‘counter abstraction’): (...) Instead of recording which process is in which local state, we record for each local state, how many processes are in this state. Thus, we need one counter per local state \(\ell \), hence we have a fixed number of counters. A step by a process that goes from local state \(\ell \) to local state \(\ell '\) is modeled by decrementing the counter associated with \(\ell \) and incrementing the counter associated with \(\ell '\). When the number p of processes is fixed, each counter is bounded by p.” The work described in [19] makes use of SMT solvers [15] in order to perform finite-state model checking of the abstracted model.

Our work can be seen as a new application of such techniques to (a variant of) an industrial election algorithm. Another originality is to combine counter abstraction, bounded model checking, with parametric timed model checking.

In an orthogonal direction, the verification of identical processes in a network, i. e., a unknown number of nodes running the same algorithm, has been studied in various settings, notably in the long line of work around regular model checking [11, 17], and in various settings in the timed case [1,2,3]. However, the focus of that latter series of works is on decidability, and they do not consider real-world algorithms, nor do they have tools implementing these results.

Finally, the line of works around the Cubicle model-checker [12,13,14] performs parameterized verification of cache memory protocols, that is also parameterized in the number of processes. However, timing parameters are not present in these works.

1.3 Outline

The rest of the paper is organized as follows. Section 2 introduces the variant of the leader election algorithm we consider. Section 3 presents our direct verification method for a small number of nodes. Section 4 presents our abstraction-based verification for a much larger number of nodes. Section 5 concludes the manuscript and outlines perspectives.

2 An Asynchronous Leader Election Algorithm

Thales recently proposed a leader election algorithm. This simple leader election algorithm is based on the classical Bully algorithm originally designed for the synchronous framework [18]. Basically, all nodes have an ID (all different), and the node with the largest ID must be elected as a leader. This algorithm is asynchronous. As usual, each node runs the same version of the code. We cannot describe the code of the actual algorithm for confidentiality issues, and we therefore consider and prove a modified variant of Thales’ original algorithm, described throughout this section.

2.1 Periods, Jitters, Offset

The system is a fixed set of p nodes \(\mathcal {N} = \{node_1,\ldots ,node_p\}\), for some \(p \in {\mathbb N}\). Each node \(node_i\) is defined by:

  1. 1.

    its integer-valued ID \(node_i.id \in {\mathbb N}\),

  2. 2.

    its rational-valued activation period \(node_i.per \in [\mathsf {period_\mathsf {min}}{}, \mathsf {period_\mathsf {max}}{}]\),

  3. 3.

    its rational-valued first activation time \(node_i.start\in [0,node_i.per]\) (which can be seen as an offset, with the usual assumption that the offset must be less than or equal to the period), and

  4. 4.

    its rational-valued jitter values represent a delay variation for each period belonging to \([\mathsf {jitter_\mathsf {min}}{}, \mathsf {jitter_\mathsf {max}}{}]\), which is a static interval defined for all nodes and known beforehand.

Observe that all periods are potentially different (even though they are all in a fixed interval, and each of them remains constant over the entire execution), which makes the problem particularly challenging. In contrast, the jitter is different at each period (this is the common definition of a jitter), and the jitter of node i at the jth activation is denoted by \(jitter_i^j\). The jth activation of node \(node_i\) therefore takes place at time \(t_i^j = t_i^{j-1} + node_i.per + jitter_i^j\). We have besides: \(t_i^0 = node_i.start\).

The concrete values for the static timing constants are given in Table 1.

Table 1. Constants (in ms)

Example 1

Assume the system is made of three nodes. Assume \(node_1.per = 49\). Recall that a period is an arbitrary constant in a predefined interval. Assume \(node_1.start = 0\).

Assume \(node_2.per = 51\) and \(node_2.start = 30\).

Assume \(node_3.per = 49\) and \(node_3.start = 0.1\).

Also assume the jitters for the first three activations of the nodes given in Table 2.

Table 2. Jitter values for Example 1

We therefore have \(t_1^0 = 0\), \(t_1^1 = 49.5\), \(t_1^2 = 97.5\), \(t_1^3 = 147.5\), \(t_2^0 = 30\), \(t_2^1 = 81\), \(t_2^2 = 132.1\), \(t_2^3 = 183\), \(t_3^0 = 0.1\), \(t_3^1 = 48.6\), \(t_3^2 = 98.4\), \(t_3^3 = 147.6\). The first activations of the nodes are depicted in Fig. 1. Due to both uncertain periods and the jitters, it can happen that, between two consecutive activations of a node, another node may not be activated at all: for example, between \(t_3^0\) and \(t_3^1\), node 1 is never activated, and therefore node 3 will not receive a message from node 1 during this interval. Conversely, between two consecutive activations of a node, another node is activated twice: for example, between \(t_3^1\) and \(t_3^2\), node 1 is activated twice (i. e., \(t_1^1\) and \(t_1^2\)), and therefore node 3 may receive two messages from node 1 during this interval.

Finally note that, in this example, the number of activations since the system start for nodes 1 and 3 is always the same at any timestamp, up to a difference of 1 (due to the jitters) because they have the same periods. In contrast, the number of activations for node 2 will be smaller than that of nodes 1 and 3 by an increasing difference, since node 2 is slower (period: 51 instead of 49). This phenomenon does not occur when periods are equal for all nodes, and makes this setting more challenging.

Fig. 1.
figure 1

Activation of three nodes with uncertain periods and jitters

Remark 1

The rest of this paper assumes the constant values given in Table 1. However, our method remains generic for constants of the same order of magnitude. Here, the variability of the periods is reasonably limited (around 4 %). A variability of more than 40 % will endanger the soundness of our method, as our upcoming assumption that between any three consecutive activations of a node, all nodes execute at least once, would not hold anymore.

2.2 IDs, Modes, Messages

We assume that all the IDs of the nodes in the network are different. Each node executes the same code. Each node has the ability to send messages to all the nodes in the network, and can store (at least) one message received from any other node in the network. Nodes are either in mode \(\mathsf {On}{}\) and execute the code at each activation time, or do nothing when they are in mode \(\mathsf {Off}\). (This models the fact that some nodes in the network might fail.) A node in mode \(\mathsf {On}{}\) is in one of the following states:

  • \(\mathsf {Follower}\): the node is not competing to become leader;

  • \(\mathsf {Candidate}\): the node is competing to become leader;

  • \(\mathsf {Leader}\): the node has declared itself to be the leader.

Each transmitted message is of the form: \(message = (SenderID, state)\) where state is the state of the sending node.

2.3 The Algorithm

At each new activation, \(node_i\) executes the code given in Algorithm 1. In short, if the Boolean flag \(node_i.EvenActivation\) (which we can suppose being initially arbitrary) is true, then the code line 1–line 15 is executed. In this code, the node first reads its mailbox, and checks whether any message contains a higher node ID than the node ID (line 3–line 7) and, if so, sets itself as a follower (line 6). If no higher ID was received, the node “upgrades” its status from follower to candidate (line 10), from candidate to leader (line 12), or remains leader if already leader (line 14).

Finally (and this code is executed at every iteration), the node swaps the Boolean flag EvenActivation (line 16), prepares a message with its ID and current state (line 17) and sends this message to the entire network (line 18). We assume that the \(Send\_To\_All\_Network\) function sends a message to all nodes—including the sender.

We can see that the significant part of the code (line 1–line 15) is only executed once every two activations (due to Boolean test \(node_i.EvenActivation \)). This is enforced in order to ensure that each node executes the code after receiving at least one message from all the other nodes (in mode \(\mathsf {On}{}\)). However, note that each node sends a message at each iteration.

The order of magnitude of the constants in Table 1 gives the immediate lemma.

figure a

Lemma 1

Assume a node i and activation times \(t_i^{j}\) and \(t_i^{j+2}\). Then in between these two activations, node i received at least one message from all nodes.

Proof

From Table 1 and Algorithm 1.

Remark 2

For different orders of magnitudes, we may need to execute the code once every more than two activations. For example, if we set \(\mathsf {jitter_\mathsf {min}}{} = - 25\) and \(\mathsf {jitter_\mathsf {max}}{} = 25\) in Table 1, the code should be executed every three activations for our algorithm to remain correct.

2.4 Objective

We first introduce the following definitions.

Definition 1

(round). A round is a time period during which all the nodes that are \(\mathsf {On}{}\) have sent at least one message.

Definition 2

(cleanness). A round is said to be clean if during its time period no node have been switched from \(\mathsf {On}{}\) to \(\mathsf {Off}\) or from \(\mathsf {Off}\) to \(\mathsf {On}{}\).

The correctness property that we want to prove is:

figure b

This property is denoted by (P) in the following.

Remark 3

(fault model). Our model does allow for faults but, according to Definition 2, only prior to the execution of the algorithm. That is, once it has started, all nodes remain in \(\mathsf {On}\) or \(\mathsf {Off}\) during its entire execution. If in reality there is a fault during the execution, it suffices to consider the execution of the algorithm at the next clean round.

3 Direct Verification of the Leader Election Algorithm

In this section, we first verify our algorithm for a fixed number of processes.

We describe here the results obtained by SafeProver on a model M representing directly a network of a fixed, constant number of p processes (without abstraction); for a small number p of nodes, we thus obtain a simple proof of the correctness of the algorithm. The model includes explicitly a representation of each node of \(\mathcal {N}\) as well as their associated periods, first activation times, local memories, and mailboxes of received messages. The code is given in Algorithm 2. The mailbox is represented as a queue, initially filled with a message from oneself.Footnote 1

figure c

During the initialization declaration, we set everything as free variables (with some constraints, e. g., on the periods) in order to have no assumptions on the state of the network at the beginning. This ensures that this model is valid whatever happened in the past, and this can be seen as a symbolic initial state: this notion of symbolic initial state was used to solve a challenge by Thales [23], also featuring uncertain periods. We also fully initialize the mailboxes of all the nodes since we are assuming that we are right after a clean round. The variable Activation is used as a variable to store how many times a node has been executed after the last clean round. The code of called function UpdateNode(i) corresponds exactly to Algorithm 1.

The property (P) we want to prove is formalized as:

\( \big (\forall i \in \{1,\ldots ,p\}, Activation(i) \ge 4\big )\ \Rightarrow \)

\(\big (\forall j\in \{1,\ldots ,p\}, j \ne maxId: node_j.state = \mathsf {Follower}{} \)

    \(\ \wedge \ node_{maxId}.state = \mathsf {Leader}{}\big )\)

with \( maxId = \text {arg max}(\{node_i.id \mid node_i.mode = \mathsf {On}{} \}_{i\in \{1,\ldots ,p\}})\). Using this model and SafeProver [16], we obtain the proof of (P) with the times tabulated in Table 3.Footnote 2 While this method allows us to formally prove the leader election for up to 5 nodes, SafeProver times out for larger number of nodes. This leads us to consider another method to prove the correctness of our algorithm for larger numbers.

Table 3. Computation times

4 Abstraction-Based Method

We now explain how to construct an abstract model \(\widehat{M}{}\) of the original model M. This model \(\widehat{M}{}\) clusters together all the p processes, except the process \(node_i\) under study (where i is arbitrary, i. e., a free variable); \(\widehat{M}{}\) also abstracts away the timing information contained in M. We then use SaveProver to infer two basic properties P1 and P2 for \(\widehat{M}{}\).

In a second phase, we consider an auxiliary simple (abstract) model T of M which merely contains relevant timing information; we then use a parametric timed model checker to infer a third property (P3) for T. The parametric timed model checker is required due to the uncertain periods, that can have any value in \([\mathsf {period_\mathsf {min}}{} , \mathsf {period_\mathsf {max}}{}]\) but remain constant over the entire execution.

In the third phase, we consider again the model \(\widehat{M}{}\), and integrate P1–P3 to SafeProver as assumptions, which allows us to infer a fourth property P4. The properties P1 and P4 express together a statement equivalent to the desired correctness property P of the leader election algorithm. The advantage of reasoning with abstract models \(\widehat{M}{}\) and T rather than directly to M, is to prove P for a large number p of processes.

We now describe our method step by step in the following.

4.1 Abstract Model \(\widehat{M}{}\) and Proof of P1-P2

The idea is to model the system as one node \(node_i\) (the node of interest) interacting with the rest of the network: \(node_i\) receives messages from the other nodes which are clustered into a single abstract process (see Fig. 2). In the abstract model \(\widehat{M}{}\), each node can take any state at any activation, with no regards to the parity (\(node_i.EvenActivation\)), what has been previously sent, what \(node_i\) is sending. We only consider the activation of \(node_i\). The rest of the network is abstracted by the messages contained in the mailbox of \(node_i\). Since we assume that at least one clean round has passed, we always have a message from a working node in the mailbox. The code is given in Algorithm 3. (Note its analogy with the SafeProver code of Algorithm 2 for M.) The first four lines define free variables.

Fig. 2.
figure 2

Scheme of model \(\widehat{M}{}\) with node i under study interacting with the cluster of all the other nodes

The list of assumptions that the solver can make on the messages received is denoted by \(List\_of\_assumptions\_on\_message_j\). This list is initially empty, and augmented with “guarantees” (a.k.a. “proven properties”) on the other nodes as they are iteratively generated by the solver. The first proven properties are:

  • P1: \( (Activation(j) \ge 2 \ \wedge \ node_j.id \ne maxId) \Rightarrow \ node_j.state = \mathsf {Follower}{}\)

  • P2: \( (Activation(j) \ge 2 \ \wedge \ node_j.id = maxId) \)\(\Rightarrow node_j.state \in \{\mathsf {Candidate}{}, \mathsf {Leader}{}\}\)

In these properties, j is a free variable: therefore, it is “fixed” among one execution, but can correspond to any of the node IDs. The two properties state that, after two rounds, a node which has not the largest ID is necessarily a follower (P1), or a candidate or a leader if it has the largest ID (P2). As said before, P1 and P2 are then integrated to \(List\_of\_assumptions\_on\_message_j\).

\(Guarantee\_to\_prove\) contains iteratively P1, then P1 and P2, then P1, P2 and P4.

figure d

4.2 Abstract Model T and Proof of P3

To represent the timed abstract model T of M, we use an extension of the formalism of timed automata [5], a powerful extension of finite-state automata with clocks, i. e., real-valued variables that evolve at the same time. These clocks can be compared with constants when taking a transition (“guards”), or to remain in a discrete state (“invariants”). Discrete states are called locations. Timed automata were proven successful in verifying many systems with interactions between time and concurrency, especially with the state-of-the-art model-checker Uppaal [10]. However, timed automata cannot model and verify arbitrary periods: while it is possible to model a different period at each round, it is not possible to first fix a period once for all (in an interval), and then use this period for the rest of the execution. We therefore use the extension “parametric timed automata” [6, 8] allowing to consider parameters, i. e., unknown constants (possibly in an interval). IMITATOR [9] is a state-of-the-art model checker supporting this formalism.

In our method, the timed abstract model T of M is a product of two similar parametric timed automata representing the node i under study and a generic node j belonging to \(\mathcal{N}\setminus \{i\}\) respectively. Each parametric timed automaton contains a single location. The parametric timed automaton corresponding to \(node_i\) uses an activation period \(per_i\) that we model as a parameter. Indeed, recall that the period belongs to an interval: taking a value in the interval at each round would not be correct, as the period would not be constant. This is where we need parameters in our method. In addition, we constrain this parameter \(per_i\) to belong to \([\mathsf {period_\mathsf {min}}{},\mathsf {period_\mathsf {max}}{}]\). Each automaton has its own clock \(c_i\) that is used to measure how much time has passed since the last activation. Each automaton has a discrete variableFootnote 3 Activation(i) which is initialized at 0 and is used to count the number of activations for this node. We give the constraint on \(c_i\) at the beginning that \(c_i \in [0, per_i + \mathsf {jitter_\mathsf {max}}{}]\). An activation can occur as soon as \(c_i\) reaches \(per_i + \mathsf {jitter_\mathsf {min}}{}\). This is modeled by the guard \(c_i \ge per_i + \mathsf {jitter_\mathsf {min}}{}\) on the transition that resets \(c_i\) and increment Activation(i). An activation can occur as long as \(c_i\) is below or equal to \(per_i + \mathsf {jitter_\mathsf {max}}{}\). This is modeled by the invariant \(c_i \le per_i + \mathsf {jitter_\mathsf {min}}{}\) on the unique location of the automaton. This invariant forces the transition to occur when \(c_i\) reaches its upper bound. This parametric timed automaton is represented in Fig. 3.Footnote 4 The other component representing the cluster of the rest of the nodes is modeled similarly as a generic component \(node_j\).

Fig. 3.
figure 3

Component 1 of timed model T

For nodes \(node_i\) and \(node_j\), the property that we want to specify corresponds in the direct model M (without abstraction) of Sect. 3 to:

  • \(\begin{aligned} (Activation(i) \le 13 \ \wedge \ Activation(j)&\le 13) \\&\Rightarrow \ \mid Activation(i) - Activation(j) \mid \ \le \ 2. \end{aligned}\)

In our timed abstract model T, such a property becomes:

  • \(\begin{aligned}&(P3): \forall i\in \{1,\dots ,p\}\ Activation(j) \le 13 \Rightarrow \\&-2 \le Activation(j) - Activation(i) \le 1. \end{aligned}\)

where Activation(i) denotes the number of activations of node i since the last clean round.

The value “13” has been obtained experimentally: smaller numbers led the algorithm to fail (the property was not satisfied). Intuitively, it consists in the number of activations by which we are sure the leader will eventually be elected.

The proof of P3 is obtained by adding to the model an observerFootnote 5 automaton checking the value of the discrete variables Activation(i) and Activation(j), which goes to a bad location when the property is violated. The property is then verified by showing that the bad location is not reachable. For the values of the timing constants in Table 1, IMITATOR proves P3 (by showing the non-reachability of the bad location) in 12 s. Recall that, thanks to our assumption on the number of nodes, we only used two nodes in the input model for IMITATOR.

In the next part, we show how the addition of P3 as an assumption in the original abstract model allows to prove the desired property P for a large number of nodes.

4.3 Proof of P Using P1–P3 as Assumptions

In addition to P1-P2, we now put P3 (\(Activation(j) \in [Activation(i) - 1; Activation(i) + 2]\)) as an element of \(List\_of\_assumptions\_on\_message_j\) used in the SafeProver code of \(\widehat{M}{}\) (see Algorithm 3). SafeProver is then able to generate the following property:

$$\begin{aligned} P4 : (Activation(i) \ge 4 \ \wedge \ node_i.id = maxId) \Rightarrow node_i.state = \mathsf {Leader}{} \end{aligned}$$

Property P4 states that the node with the highest ID will declare itself as \(\mathsf {Leader}\) after at most 4 activations. Besides, property P1 states that a node, not having the highest ID, is in the state \(\mathsf {Follower}{}\) within at most 2 activations. Properties P1 and P4 together thus express a statement equivalent to the desired correctness property P. The global generation of properties (P1), (P2) and (P4) by SafeProver takes the computation times tabulated in Table 4. As one sees, the computation time is now smaller by an order of magnitude than the ones given in Table 3, thus showing the good scalability of our method.

Table 4. Computation times

Remark 4

Note that verifying the model for 5,000 nodes also gives a guarantee for any smaller of nodes. Indeed, we can assume that an arbitrary number of nodes are in mode \(\mathsf {Off}{}\), and remain so, which is equivalent to a smaller number of nodes.

4.4 Discussion

Soundness We briefly discuss the soundness of the algorithm. First, note that the assumptions used above have been validated by the system designers (i. e., those who designed the algorithm). Second, SafeProver validated the assumptions, i. e., proved that they were not inconsistent with each other (which would result in an empty model).

Now, the abstraction used in Sect. 4.2, i. e., to consider only two nodes, is the one which required most human creativity. Let us briefly discuss its soundness. Our abstraction allows to model the sending of any message, which includes the actual message to be sent in the actual system. The fact that a message was necessarily received in the actual system between two (real) executions of the node under study is given by the fact that all nodes necessarily execute at least once in the last two periods (see Lemma 1). Of course, this soundness is only valid under our own assumptions on the variability of the period, considering the constants in Table 1: if the period of one node is 1 while the other is 100, our framework is obviously not sound anymore.

Parametric vs. parametrized model checking. As shown in Table 4, we verified the model for a constant number of nodes. This comes in contrast with the recent work on parameterized verification (e. g., [2, 12, 13]). However, while these latter consider a parameterized number of nodes, they consider non-parametric timed models; in contrast, we need here parametric timed models to be able to represent the uncertainty on the periods. Combining both worlds (parameterized number of nodes with parametric timed models) would be of interest—but remains a very challenging objective.

5 Conclusion

We described a method combining abstraction, SafeProver and parametric timed model-checking in order to prove the correctness of a variant of an asynchronous leader election algorithm designed by Thales. Our approach can efficiently verify the leader election after a fixed number of rounds for a large number p of processes (up to \(p=5000\)). The method relies on the construction of two abstract models \(\widehat{M}{}\) and T of the original model M. Although it is intuitive, it could be interesting to prove formally that each abstraction is correct in the sense that it over-approximates all the behaviors of M.

Perspectives

Many variants of the algorithm can be envisioned (loss of messages, non-instantaneous transmission, non-complete graph topology, ...). The fault model could also be enriched. It will then be also interesting to propose extensions of our abstraction-based method to prove the correctness of such extensions.

The correctness of the method relies on the order of magnitude of the constants used (Remark 1). For different intervals, it might be necessary to both adapt the algorithm (read messages only every k activations) but also the assumptions used in the proof using abstraction, a manual and possibly error-prone process. A more general verification method would be desirable.

In addition, the number of activations in our correctness property (“after 13 activations, the leader is elected”) was obtained using an incremental verification (values of up to 12 all gave concrete counterexamples). As a future work, we would like to automatically infer this value too, i. e., obtaining the minimal value of activations before a leader is guaranteed to be elected.

Finally, adding probabilities to model the fault of nodes will be of interest.