Leader Election in Arbitrarily Connected Networks with Process Crashes and Weak Channel Reliability

López, Carlos; Rajsbaum, Sergio; Raynal, Michel; Vargas, Karla

doi:10.1007/978-3-030-91014-3_4

Carlos López¹⁰,
Sergio Rajsbaum¹⁰,
Michel Raynal^11,12 &
…
Karla Vargas¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 12754))

Included in the following conference series:

International Conference on Networked Systems

213 Accesses
1 Citations

Abstract

A channel from a process p to a process q satisfies the ADD property if there are constants K and D, unknown to the processes, such that in any sequence of K consecutive messages sent by p to q, at least one of them is delivered to q at most D time units after it has been sent. This paper studies implementations of an eventual leader, namely, an \(\varOmega \) failure detector, in an arbitrarily connected network of eventual ADD channels, where processes may fail by crashing. It first presents an algorithm that assumes that processes initially know n, the total number of processes, sending messages of size \(O( ~\mathsf{{log}}~ n)\). Then, it presents a second algorithm that does not assume the processes know n. Eventually the size of the messages sent by this algorithm is also \(O( ~\mathsf{{log}}~ n)\). These are the first implementations of leader election in the ADD model. In this model, only eventually perfect failure detectors were considered, sending messages of size \(O(n ~\mathsf{{log}}~ n)\).

Partially supported by UNAM-PAPIIT grant IN106520. Full version of this paper is on ArXiv [15].

Access provided by Autonomous University of Puebla. Download conference paper PDF

Packet Efficient Implementation of the Omega Failure Detector

An Eventually Perfect Failure Detector on ADD Channels Using Clustering

Fast Leader (Full) Recovery Despite Dynamic Faults

Keywords

1 Introduction

1.1 Leader Election

This is a classical problem encountered in distributed computing. Each process \(p_i\) has a local variable \(leader_i\), and it is required that all the local variables \(leader_i\) forever contain the same identity, which is the identity of one of the processes. A classical way to elect a leader consists in selecting the process with the smallest identity^{Footnote 1}. If processes may crash, the system is fully asynchronous, and the elected leader must be a process that does not crash, leader election cannot be solved [21]. Not only the system must no longer be fully asynchronous, but the leader election problem must be weakened to the eventual leader election problem. This problem is denoted \(\varOmega \) in the failure detector parlance [2, 3]. Notice that the algorithm must elect a new leader each time the previously elected leader crashes.

1.2 Related Work

Many algorithms for electing an eventual leader in crash-prone partially synchronous systems have been proposed. Surveys of such algorithms are presented in [20, Chapter 17] when communication is through a shared memory, and in [21, Chapter 18] when communication is through reliable message-passing.

In [1] there are proposed different levels of communication reliability and is its showed that in systems with only some timely channels and a complete network it is necessary that correct processes send messages forever even with just at most one process crash. An algorithm for implementing \(\varOmega \) in networks with unknown membership is presented in [13]. This algorithm works in a complete network and every process needs to communicate its name to every neighbor using a broadcast protocol.

In [7] it is presented an implementation of \(\varOmega \) for the case of the crash-recovery model in which processes can crash and then recover infinitely many times and channels can lose messages arbitrarily. The case of dynamic systems is addressed in [14], and the case where the underlying synchrony assumptions may change with time is addressed in [9]. Stabilizing leader election in crash-prone synchronous systems is investigated in [5].

The ADD Distributed Computing Model. This model was introduced in [22], as a realistic partially synchronous model of channels that can lose and reorder messages Each channel guarantees that some subset of the messages sent on it will be delivered in a timely manner and such messages are not too sparsely distributed in time. More precisely, for each channel there exist two constants K and D, not known to the processes (and not necessarily the same for all channels), such that for every K consecutive messages sent in one direction, at least one is delivered within D time units after it has been sent.

Even though ADD channels seem so weak, it is possible to implement an eventually perfect failure detector, \(\Diamond P\), in a fully connected network of ADD channels, where asynchronous processes may fail by crashing [22]. Later on, it was shown that it is also possible to implement \(\Diamond P\) in an arbitrarily connected network of ADD channels [16]. Recall that \(\Diamond P\) is a classic failure detector, relatively powerful (more than sufficient to solve consensus), stronger than \(\varOmega \) [2] and yet realistically implementable [3].

The algorithm in [16] works for arbitrary connected networks of ADD channels, and sends messages of bounded size, improving on the previous unbounded size algorithm presented in [12]. However, the size of messages is exponential in n, the number of processes. More recently, an implementation of \(\Diamond P\) using messages of size \( O(n~\mathsf{{log}}~n)\), in an arbitrarily connected network of ADD channels was presented in [23].

1.3 Contribution

This paper shows that it is possible to implement \(\varOmega \) in an arbitrarily connected network where asynchronous processes may fail by crashing in a weaker model than the one presented in [23]. It first presents an implementation of \(\varOmega \) with messages of size \(O(\mathsf{{log}}~n)\), reducing the message size with respect to [23].

Most of the previous works related to \(\varOmega \) concentrated on communication-efficient algorithms in fully connected networks when considering the number of messages. They were considering neither the size of the messages nor arbitrarily connected networks.

The proposed algorithm works under very weak assumptions, requiring only that a directed spanning tree from the leader exists, composed of channels that eventually satisfy the ADD property. This algorithm requires that processes know n, the number of processes. Then, the paper shows how to extend the ideas to design an algorithm for the case where n is unknown in arbitrarily connected networks. Initially a process knows only its set of incident channels. Interestingly enough, eventually the size of the messages used by this algorithm is also \(O(\mathsf{{log}}~n)\).

We put particular attention to the size of the messages because it plays an important role in the time it takes for the processes to agree on the same leader, yet we show that our algorithms elect a leader in essentially optimal time. When designing ADD-based algorithms, it is challenging to transmit a large message by splitting it into smaller messages, due to the uncertainty created by the fact that, while the constants K and D do exist, a process knows neither them nor the time from which the channels forever satisfy them. This type of difficulty is also encountered in the design of leader election algorithms under weak eventual synchrony assumptions, e.g., [1, 8, 23]. Also in self-stabilizing problems, where ideas similar to our hopbound technique have been used [6], as well as in [23]. We found it even more challenging to work under the assumption that some edges might not satisfy any property at all; our algorithm works under the assumption that only edges on an (unknown to the processes) spanning tree are guaranteed to comply with the ADD property.

2 Model of Computation

Process Model. The system consists of a finite set of n processes \(\varPi = \{p_1,p_2,...,p_n\}\). Every process \(p_i\) has an identity, and without loss of generality we consider that the identity of \(p_i\) is its index i. As there is no ambiguity, we use indifferently \(p_i\) or i to denote the same process.

Every process \(p_i\) has also a read-only local clock \(\mathsf{{clock}}_i()\), which is assumed to generate ticks at a constant rate^{Footnote 2}. Local clocks need not to be synchronized to exhibit the same time value at the same time, local clocks are used only to implement timers. To simplify the presentation, it is assumed that local computations have zero duration.

Any number of processes may fail by crashing. A process is correct if it does not crash, otherwise, it is faulty.

Virtual Global Clock. For notational simplicity, we assume the existence of an external reference clock which remains always unknown to the processes. The range of its ticks is the set of natural numbers. It allows to associate consistent dates with events generated by the algorithm.

Communication Network. It is represented by a directed graph \(G=(\varPi ,E)\), where an edge \((p_i,p_j)\in E\) means that there is a unidirectional channel that allows the process \(p_i\) to send messages to \(p_j\). A bidirectional channel can be represented by two unidirectional channels, possibly with different timing assumptions. process \(p_i\) has a set of input channels and a set of output channels.

The graph connectivity requirement on the communication graph G depends on the problem to be solved. It will be stated in the Sect. 4 and 5 devoted to the proofs of the proposed algorithms.

Basic Channel Property. It is assumed that no directed channel creates, corrupts, or duplicates messages.

The ADD Property. A directed channel \((p_i,p_j)\) satisfies the ADD property if there are two constants K and D (unknown to the processes^{Footnote 3}) such that

for every K consecutive messages sent by \(p_i\) to \(p_j\), at least one is delivered to \(p_j\) within D time units after it has been sent. The other messages from \(p_i\) to \(p_j\) can be lost or experience arbitrary delays.

Each directed channel can have its own pair (K, D). To simplify the presentation, and without loss of generality, we assume that the pair (K, D) is the same for all the channels.

The \(\Diamond \)ADD Property. The eventual ADD property, states that the ADD property is satisfied only after an unknown but finite period of time. Hence this weakened property allows the system to experience an initial anarchy period during which the behavior of the channels is arbitrary.

The Span-Tree Assumption. We consider that there is a time \(\tau \) after which there is a directed spanning tree (i) that includes all the correct processes and only them, (ii) its root is the correct process with the smallest identity, and (iii) its channels satisfy the \(\Diamond \)ADD property. This behavioral assumption is called Span-Tree in the following.

Eventual Leader Election. Assuming a read-only local variable \(leader_i\) at each process \(p_i\), the leader failure detector \(\varOmega \) satisfies the following properties [2, 18]:

Validity: For any process \(p_i\), each read of \(leader_i\) by \(p_i\) returns a process name.
Eventual leadership: There is a finite (but unknown) time after which the local variables \(leader_i\) of all the correct processes contain forever the same process name, which is the name of one of them.

3 Eventual Leader Election with Known Membership

This section presents Algorithm 1 that implements \(\varOmega \), assuming each process knows n. Parameter T denotes an arbitrary duration. Its value affects the efficiency of the algorithm, but not its correctness^{Footnote 4}.

3.1 Local Variables at a Process \(p_i\)

Each process \(p_i\) manages the following local variables.

\(in\_neighbors_i\) (resp., \(out\_neighbors_i\)) is a (constant) set containing the identities of the processes \(p_j\) such that there is channel from \(p_j\) to \(p_i\) (resp., there is channel from \(p_i\) to \(p_j\)).
\(leader_i\) contains the identity of the elected leader.
\( timeout _i[1..n, 1..n]\) is a matrix of timeout values and \( timer _i[1..n, 1..n]\) is a matrix of timers, such that the pair \(\langle timer _i[j,n-k], timeout _i[j,n-k]\rangle \) is used by \(p_i\) to monitor the elementary paths from \(p_j\) to it whose length is k.
\(hopbound_i[1..n]\) is an array of non-negative integers; \(hopbound_i[i]\) is initialized to n, while each other entry \(hopbound_i[j]\) is initialized to 0. Then, when \(j\ne i\), \(hopbound_i[j]=n-k \ne 0\) means that, if \(p_j\) is currently considered as leader by \(p_i\), the information carried by the last message alive \((j,n-1)\) sent by \(p_j\) to its out-neighbors (which forwarded alive \((j,n-2)\) to their out-neighbors, etc.) went through a path^{Footnote 5} of k different processes before being received by \(p_i\). The code executed by \(p_i\) when it receives a message alive \((j,-)\) is detailed in Sect. 3.2.

The identifier hopbound stands for “upper bound on the number of forwarding” that –due to the last message alive \((j,-)\) received by \(p_i\)– the message alive \((j,-)\) sent by \(p_i\) has to undergo to be received by all processes. It is similar to a time-to-live value.
\(penalty_i[1..n,1..n]\) is a matrix of integers such that \(p_i\) increases \(penalty_i[j,n-k]\) each time the \(timer_i[j,n-k]\) expires. It is a penalization counter monitored by \(p_i\) with respect to the elementary paths of length k starting at \(p_j\) and ending at \(p_i\).
\(not\_expired_i\) is an auxiliary local variable.

3.2 General Principle of the Algorithm

As many other leader election algorithms, Algorithm 1 elects the process that has the smallest identity among the set of correct processes by keeping as a candidate to be the leader the smallest identifier received as it is explained in the following sections. It is made up of three main sections: the one that generates and forwards the alive() messages, the one that receives alive() messages and the one that handles the timer expiration. Every section is described in detail below.

Generating and Forwarding Messages. (Lines 6–9) Every T time units of clock \(\mathsf{{clock}}_i()\), a process \(p_i\) sends the message alive \((leader_i,hopbound_i[leader_i]-1)\).

A message alive \((*,n-1)\) is called generating message. A message alive \((*,n-k)\) such that \(1< k < n-1\), is called forwarding message (in this case it is the forwarding of the last message alive \((leader_i,hopbound)\) previously received by \(p_i\)). Moreover, the value \(n-k\) is called hopbound value. When a process \(p_i\) starts the algorithm, it proposes itself as candidate to be leader.

A message is sent if predicate \(hopbound_i[leader_i]>1\) of line 7 is true, The message sent is then alive \((leader_i,hopbound_i[leader_i]-1)\).

The message forwarding is motivated by the fact that, if \(hopbound_i[leader_i] >1\), maybe processes have not yet received a message alive \((leader_i,-)\) whose sending was initiated by \(leader_i\) and then forwarded along paths of processes (each process having decreased the carried hopbound value) has not reached all the processes. In this case, \(p_i\) must participate in the forwarding. To this end, it sends the message alive \((leader_i,hopbound_i[leader_i]-1)\) to each of its out-neighbors (line 8).

Let us observe that during the anarchy period during which, due to the values of the timeouts and the current asynchrony, channel behavior and process failure pattern, several generating messages alive \((*,n-1)\) can be sent by distinct processes (which compete to become leader) and forwarded by the other processes with decreasing hopbound values. But, when there are no more process crashes and there are enough directed channels satisfying the ADD property, there is a finite time from which a single process (namely, the correct process \(p_\ell \) with the smallest identity) sends messages alive \((\ell ,n-1)\) and no other process \(p_j\) sends the generating message alive \((j,n-1)\).

Message Reception. (Lines 10–17) When a process \(p_i\) such that \(leader_i\ne i\) receives a message alive \((\ell ,n-k)\), it learns that (a) \(p_\ell \) is candidate to be leader, and (b) there is a path with k hops from \(p_j\) to itself.

If \(\ell \le leader_i\), \(p_i\) considers \(\ell \) as its current leader (line 11). Hence, if \(\ell <leader_i\), \(p_\ell \) becomes its new leader, otherwise it discards the message. This is due to the fact that \(p_i\) currently considers \(leader_i\) as leader, and the eventual leader must be the correct process with the smallest identity.

Then, as the message alive \((\ell ,n-k)\) indirectly comes from \(leader_i=\ell \) (which generated alive \((\ell ,n-1)\)) through a path made up of k different processes, \(p_i\) increases the associated timeout value if the timer \( timer _i[leader_i,hb]\) expired before it received the message alive \((\ell ,hb)\) (line 13). Moreover, whether \( timer _i[leader_i,hb]\) expired or not, \(p_i\) resets \( timer _i[leader_i,hb]\) (line 14) and starts a new monitoring session with respect to its current leader and the cycle-free paths of length hb from \(leader_i\) to it.

The role of the timer \( timer _i[\ell ,hb]\) is to allow \(p_i\) to monitor \(p_{\ell }\) with respect to the forwarding of the messages alive \((\ell ,hb)\) it receives such that \(hb=n-k\) (i.e., with respect to the messages received from \(p_j\) along paths of length k).

Finally, \(p_i\) updates \(hopbound_i[leader_i]\). To update it, \(p_i\) computes the value of \(not\_expired_i\) (line 15) which is a bag (or multiset) of cycle-free path lengths x whose timers \(timer_i[leader_i,n-x]\) is still running. To this end, the idea then is to select the less penalized path (hence the “smallest non-negative value” at line 16). But, it is possible that there are different cycle-free paths of lengths x1 and x2 such that we have \(penalty_i[leader_i,n-x1]=penalty_i[leader_i,n-x2]\). In this case, in a conservative way, \(\mathsf{{max}}(n-x1,n-x2)\) is selected to update the local variable \(hopbound_i[leader_i]\).

Timer Expiration. (Lines 18–23) Given a process \(p_i\), when the timer currently monitoring its current leader through a path of length \(k=n-hb\) expires (line 18), it increases its \(penalty_i[leader_i,n-k]\) entry (line 19).

The entry \(penalty_i[j,n-k]\) is used by \(p_i\) to cope with the negative effects of the channels which are on cycle-free paths of length k from \(p_j\) to \(p_i\) and do not satisfy the ADD property. More precisely we have the following. If, while \(p_i\) considers \(p_j\) is its current leader (we have then \(leader_i=j\)), and \(timer_i[j,n-k]\) expires, \(p_i\) increases \(penalty_i[j,n-k]\). The values in the vector \(penalty_i[j,1..n]\) are then used at lines 15–16 (and line 22) to update \(hopbound_i[leader_i]\) which (if \(p_j\) is the eventually elected leader) will contain the length of an cycle-free path from \(p_j\) to \(p_i\) made up of \(\Diamond \)ADD channels (i.e., a path on which \(timer_i[j,n-k]\) will no longer expire).

Then, if for all the hopbound values, the timers currently monitoring the current leader have expired (line 20), \(p_i\) becomes candidate to be leader (line 21).

If one (or more) timer monitoring its current leader has not expired, \(p_i\) recomputes the path associated with the less penalized hopbound value in order to continue monitoring \(leader_i\) (line 22).

4 Proof of Algorithm 1

This section shows that Algorithm 1 elects an eventual leader while assuming the Span-Tree behavioral assumption.

We have to prove that the algorithm satisfies Validity and Eventual Leader Election. For Validity, let us observe that the local variables \(leader_i\) of all the processes always contain a process identity. Hence, we must only prove Eventual Leader Election, i.e. we must only show that the variables \(leader_i\) of all the correct processes eventually converge to the same process identity, which is the identity of one of them.

Due to space limitation, the proof of the lemmas are in the full version of the paper that can be found in [15].

Lemma 1

Let \(p_i\) and \(p_j\) be two correct processes connected by a \(\Diamond \)ADD channel, from \(p_i\) to \(p_j\). There is a time after which any two consecutive messages received by \(p_j\) on this channel are separated by at most \(\varDelta =(K-1)\times T+D\) time units.

Given any run r of Algorithm 1, let \(\mathsf{{correct}}(r)\) denote the set of processes that are correct in this run and \(\mathsf{{crashed}}(r)\) denote the set of processes that are faulty in this run.

The following lemma shows that there is a time after which there are no alive \((j,n-k)\) messages with \(p_j \in \mathsf{{crashed}}(r)\), i.e. eventually all correct processes stop sending the alive messages from a failed process which proves that once a leader fails, eventually all processes elect a new leader.

Lemma 2

Let \(p_i \in \mathsf{{crashed}}(r)\) and \(1 \le a < n-1\). Given a run r there is a time after which there are no messages alive \((i,n-a)\).

Theorem 1

Given a run r satisfying the Span-Tree property, there is a finite time after which the variables \(leader_i\) of all the correct processes contain the smallest identity \(\ell \in \mathsf{{correct}}(r)\). Moreover, after \(p_\ell \) has been elected, there is a finite time after which the only messages sent by processes are alive \((\ell ,-)\) messages.

Theorem 2

The size of a message is \(O(\mathsf{{log}}~n)\).

Proof

The proof follows directly from the fact that a message carries a process identity which belongs to the set \(\{1,\cdots ,n\}\) and a hopbound number hopbound such that \(2\le hopbound \le n-1\). Since an integer bounded with n can be represented with exactly \(\mathsf{{log}}~n\) bits and we have two integers bounded with n we have that the size of every message is \(O(\mathsf{{log}}~n)\).

4.1 Time Complexity

Given a run r, let \(\ell \) denote the smallest identity such that \(\ell \in correct(r)\). Let \(t^a\) be the time given by Lemma 2, i.e. a time from which no message from crashed processes is till in transit (they have been received or are lost). Let \(t^a \le t^r\) be the time after which:

1.
All failures already happened.
2.
All \(\Diamond \) ADD channels satisfy their constants K and D.

After \(t^r\), let \(\varDelta \) be the constant given by Lemma 1.

Lemma 3

Let \(p_i\) be a correct process such that for every \(t > t^r\), \(hopbound_i[\ell ] = n-k\). Then, for every correct process \(p_j\) such there is a \(\Diamond \) ADD channel from \(p_i\) to \(p_j\), \(timeout_j[\ell ,n-(k+1)] \le \mathcal {C} + 2^{log(\lceil \varDelta \rceil )}\) with \(timeout_j[\ell ,n-(k+1)]= \mathcal {C}\) before \(t^r\).

Lemma 3 states that after \(t^r\), a timeout value is increased a finite number of times. Let \(t^c\) be the time after which all timeouts have reached their maximum, namely, no timeout is increased again. The following claims refer to the communication graph after \(t^c\).

Lemma 4

For every correct process \(p_i\) such that there is a minimum length path of \(\Diamond \) ADD channels of length k from \(p_\ell \) to \(p_i\), \(leader_i = \ell \) at time \(t^c + (k \times \varDelta )\).

Let \(\mathcal {D}\) be the diameter of the underlying spanning-tree of \(\Diamond \) ADD channels.

Theorem 3

For every correct process \(p_i\) it takes \(O(\mathcal {D} \cdot \varDelta )\) time to have \(leader_i = \ell \).

Proof

The proof is direct from Lemma 4.

4.2 Simulation Experiments

This section presents simulation experiments related to the performance predicted by Theorem 3 of Algorithm 1. Only a few experiments are presented, a more detailed experimental study is beyond the scope of this conference version. Our experiments show that a leader is elected in time proportional to the diameter of the network, in two network topologies: a ring and a random regular graph of degree 3.

Considering the constants K and D satisfied by an \(\Diamond \) ADD once it stabilizes, Lemma 1 shows that for a given T (the frequency with which the messages are sent), then \(\varDelta =(K-1)\times T+D\) is an upper bound on the time of the consecutive reception of two messages by a process. According to Theorem 3, the time to elect a leader is proportional to the diameter of the network, where the K, D and T determine the slope of the function.

For the (time and memory) efficiency of the experiments we assume some simplifying assumptions, which seem sufficient to a preliminary illustration of the results:

All the channels are \(\Diamond \) ADD to avoid the need of a penalization array.
All the messages are delivered within time at most D or not delivered at all. This is sufficient to illustrate the convergence time to a leader. Additional experimental work is needed to determine the damage done by messages that are delivered very late.
We selected \(K=4\), \(D=12\) and \(T=1,5,10\).

Convergence Experiments. The experiments of the ring in Fig. 1 and Fig. 2, are when the probability of a message being lost is \(1\%\), and \(99\%\) respectively. The case of a random graph of degree 3 up to 50,000 nodes is in Fig. 3 when the probability of a message being lost is \(1\%\). These experiments verify that indeed the convergence time is proportional to the diameter. The constants appear to be smaller than \(\varDelta \), the one predicted by Theorem 3.

Simulation Details. We performed our simulation results in a 48 multicore machine with 256 GB of memory, using a program based on the Discrete Event Simulator Simpy, a framework for Python. We used the Networkx package to model graph composed of ADD channels. For the ring simulations, experiments were performed for each n from 10 up to 400 nodes, and taking the average of 10 executions, for each value of n. For the random regular networks, the degree selected was 3, and experiments starting with n starting in 100, up to 10, 000, taking the average of 5 executions. The n was incremented by 100 to reach 10,000 and from then on until 50,000 we incremented n by 10,000 each time. A performance impediment was indeed the large amount of memory used.

The convergence time curves we obtained for the ring experiment are functions of the form \(f(x) =c\cdot x\), where x represents the diameter of the network, and the constant c is, roughly, between 2.5 and 4.5 as T goes from 1 to 10. While for the random regular networks, we again got a constant that doubled in size, roughly, as T goes from 1 to 10. This behavior seems to be better than the one predicted by Theorem 3, which says that the constant c should have grown 10 times.

Re-Election Convergence Simulation. If an elected leader fails, we would like to know in how much time a new leader is elected.

Note that the \(\Diamond \) ADD channels can arbitrarily delay the delivery of some messages. This condition has a great impact in the time it takes to Algorithm 1 to change a failed leader. For the following simulations again we assume that all the messages are delivered within time at most D or not delivered at all. But note that in a realistic scenario, we can ease the impact of the arbitrarily delayed messages by adding a timestamp to every message and keeping track for every neighbor of this timestamp. If the timestamp of the recently received message is smaller than the current one, just ignore the message. This timestamp does not have a bound, but if we use an integer and increase it by one every second that a message is sent, this integer can hold on up for a century without overflowing^{Footnote 6}. By adding an integer to the message, we keep messages of size \(O(log \ n)\).

For the simulation of Fig. 4 we selected \(K=4\), \(D=12\), \(T=1\) and the probability of a message being lost is \(1\%\). We performed this simulation on a ring. The algorithm starts at time \(t_0\) and continues its execution till the average time in which a leader is elected (the curve represented in orange). In this time, the candidate to be the leader fails and then a timer from an external observer is started in every process. This timer is used to know the average time needed for each process to discard the failed leader (curve represented in purple) and then converge to a new leader (curve represented in blue). This experiment verify that indeed the convergence time after the current leader fails is proportional to the diameter since \(\varDelta = (K-1)\times T+D = 3 + 12 = 15\).

5 Eventual Leader Election with Unknown Membership

Here, while n exists and has a fixed value, it is no longer assumed that processes know it. Consequently, the processes have an “Unknown Membership” of how many and which are the processes in the network. Nevertheless, for convenience, the proposed algorithm still uses the array notation for storing the values of timers, timeouts, hopbounds, etc. (in an implementation dynamic data structures –e.g., lists– should be used).

Algorithm 2 solves eventual leader election in the \(\Diamond \) ADD model with unknown membership, which means that, initially, a process knows nothing about the network, it knows only its input/output channels.

Our goal is to maintain the \(O(\mathsf{{log}}~ n)\) bound on the size of the messages even in this model. It seems that it is not easy to come up with a minor modification of the first algorithm. For instance, a classic way of ensuring that forwarding the alive message is cycle-free is to include the path information in the message along which the forwarding occurred, as done in the paper [16]. This would result in message sizes of exponential size, while assuming a slightly different model, we show how to eventually stay with \(O(\mathsf{{log}}~ n)\) messages.

Furthermore, since we want the complexity to be \(O(\mathsf{{log}}~n)\) eventually, we need to design a mechanism that works as a broadcast in which once a process \(p_i\) knows a new process name from \(p_j\), the later does not need to send to \(p_i\) the same information but only the leader information. The proposed mechanism in this paper is not the same as the proposed in [23] since we are preventing processes to send all the known names but eventually, only the leader information.

Since no process has knowledge about the number of participating processes, this number must be learned dynamically as the names of processes arrives. In order to the leader to reach every process in the network, there must be a path of \(\Diamond \) ADD channels from every correct process to the leader. It follows that an algorithm for eventual leader election in networks with unknown membership cannot be a straightforward extension of the previous algorithm. More precisely, instead of the unidirectional channels and Span-Tree assumptions, Algorithm 2 assumes that (i) all the channels are bidirectional \(\Diamond \) ADD channels, and (ii) the communication network restricted to the correct processes remains always connected (namely, there is always a path –including correct processes only– connecting any two correct processes).

In Algorithm 1, every process \(p_i\) uses n to initialize its local variable \(hopbound_i[i]\) (which thereafter is never modified). In the unknown membership model, \(hopbound_i[i]\) is used differently, namely it represents the number of processes known by \(p_i\) so far. So its initial value is 1. Then, using a technique presented in [23], \(hopbound_i[i]\) is updated as processes know about each other: every time a process \(p_i\) discovers a new process identity it increases \(hopbound_i[i]\).

5.1 General Principle of the Algorithm

Initially each process \(p_i\) only knows itself and how many channels are connected to it. So the first thing \(p_i\) needs to do is communicate its identity to its neighbors. Once its neighbors know about it, \(p_i\) no longer sends its identity. The same is done with other names that \(p_i\) learns. For that, \(p_i\) keeps a pending set for every channel connected to it that tracks the information it needs to send to its neighbors. So initially, \(p_i\) adds the pair \(({\texttt {new}},i)\) to every pending set.

During a finite amount of time, it is necessary to send an alive() message to every neighbor without any constraint because the set of process names needs to be communicated to other processes. That is, information about a leader might be empty and the message only contains the corresponding pending set.

When process \(p_i\) receives an alive() message from \(p_j\), this message can contain information about the leader and the corresponding pending set that \(p_j\) saves for \(p_i\). First, \(p_i\) processes the information contained in the pending set and then processes the information about the leader.

How \(\boldsymbol{p_i}\) Learns New Process Names. If \(p_i\) finds a pair with a name labeled as new and is not aware of it, it stores the new name in the set \(known_i\), increases its hopbound value, and adds to every pending set (except to the one belonging to \(p_j\)) this information labeled as new. In any case, \(p_i\) needs to communicate \(p_j\) that it already knows that information, so \(p_i\) adds this information to the pending set of \(p_j\) but labeled as an acknowledgment.

When \(p_j\) receives name labeled as an acknowledgment from \(p_i\), i.e. \(({\texttt {ack}},name)\), it stops sending the pair \(({\texttt {new}},name)\) to it, so it deletes that pair from \(p_i\)’s pending set. Eventually, it receives a pending set from \(p_j\) not including \(({\texttt {new}},name)\), so \(p_i\) deletes \(({\texttt {ack}},name)\) from \(p_j\)’s pending set.

How \(\boldsymbol{p_i}\) Processes the Leader Information. As in Algorithm 1, every process keeps as leader a process with minimum id. Since it is assumed that all the channels are \(\Diamond \) ADD, there is no need to keep a timer for every hopbound value or a penalty array. In this case, process \(p_i\) keeps the greatest \(n-k\), i.e. hopbound value that it receives from the process it considers to be the leader. If this value (or a greater one) does not arrive on time, \(p_i\) proposes itself as the leader. In case a smaller hopbound value of the leader arrives, it is only taken if its timer expired.

5.2 Local Variables at Each Process \(p_i\)

Each process \(p_i\) manages the following local variables.

\(leader_i\) contains the identity of the candidate leader.
\(hopbound_i[1..)\) is an array of natural numbers; \(hopbound_i[i]\) is initialized to 1.
\(timeout_i[\cdot ]\) and \(timer_i[\cdot ]\) have the same meaning as in Algorithm 1. So, when \(p_i\) knows \(p_j\), the pair \(\langle timer _i[j], timeout _i[j]\rangle \) is used by \(p_i\) to monitor the sending of messages by \(p_j\) (which is not necessarily a neighbor of \(p_i\)).
\(known_i\) is a set containing the processes currently known by \(p_i\). At the beginning, \(p_i\) only knows itself.
\(out\_neighbors_i\) is a set containing the names of the channels connecting \(p_i\) to its neighbor processes. The first time \(p_i\) receives through channel m a message sent by a process \(p_j\), \(p_j\) and m become synonyms
\(pending_i[1,...,k]\) is a new array in which, when \(p_i\) knows \(p_j\), \(pending_i[j]\) contains the pairs of the form (label, id) that are pending to be send through channel connecting \(p_i\) and \(p_j\). There are two possible labels, denoted \({\texttt {new}}\) and \({\texttt {ack}}\).

5.3 Detailed Behavior of a Process \(p_i\)

The code of Algorithm 2 addresses two complementary issues: the management of the initially unknown membership, and the leader election.

Initialization. (Lines 1–4) Initially, each process \(p_i\) knows only itself and how many input/output channels it has. Moreover, it does not know the name of the processes connected to these channels (if any) and how many neighbors it has (the number of channels is higher or equal to the number of neighbors). So when the algorithm begins, it proposes itself as the leader and in the pending sets of every channel adds its pair \(({\texttt {new}},i)\) for neighbors to know it.

Sending a Message. (Lines 5–11) Every T units of time, \(p_i\) sends a message through every channel m. In some cases the leader information is empty because of the condition of line 7. But in any case, it must send a message that includes information about the network that is included in the set \(pending_i[j]\).

Receiving a Message. (Lines 12–38) When \(p_i\) receives a message (line 12) from process \(p_j\) (through channel m), at the beginning it knows from which channel it came and eventually knows from whom is from. When the message is received, the information included in pending (lines 14–29) is processed, and then the leader information is processed (lines 30–38).

Processing New Information. (Lines 14–29) The input parameter set pending includes pairs of the form (label, id), where \(label \in \{{\texttt {new}},{\texttt {ack}}\}\) and id is the name of some process. When \(p_i\) processes the pairs that it received from \(p_j\) there can be two kind of pairs. The first is a pair with label \({\texttt {new}}\) (line 15), which means that \(p_j\) is sending new information (at least for \(p_j\)) to \(p_i\). When this information is actually new for \(p_i\) (line 17) then, it stores this new name, increases its hopbound entry and adds to every pending set(but not the one from which it received the information) this new information (line 20).

In case that \(p_i\) already knows the information labeled as new for \(p_j\) (line 21), then \(p_i\) needs to check if it is included in the pending set to \(p_j\) this information as new too. If that is the case, then it deletes from pending[m] this pair (line 22). In any case, \(p_i\) adds to the pending set the pair \(({\texttt {ack}},k)\) for sending through the channel from where this message was received (line 23).

If \(p_i\) receives the pair \(({\texttt {ack}},k)\) (line 25), then it deletes the pair \(({\texttt {new}},k)\) from the set \(pending_i[m]\), because the process that sent this pair, already knows k.

Processing the Leader Related Information. (Lines 30–38) If the leader related information is not empty, \(p_i\) processes it. As in the first algorithm, if the identity of the proposed leader is smaller than the current one, then it is set as \(p_i\)’s new leader (line 31). Then, it processes the hopbound. If the recently arrived hopbound is greater than the one currently stored, then the recently arrived is set as the new hopbound (line 33). If the timer for the expected leader expired, it needs more time to arrive to \(p_i\), so the timeout is increased (line 35) and the timer is set to timeout (line 36).

Deleting Pairs. (Lines 21, 25 and 28) If some process \(p_i\) wants to send some information k to \(p_j\), it adds to the pending set of \(p_j\) the pair \(({\texttt {new}},k)\). When \(p_j\) receives this pair, it looks if this is already in its set, in that case, it deletes the pair from \(p_i\)’s pending set (line 21). Then, \(p_j\) adds an \(({\texttt {ack}},k)\) to the pending set of \(p_i\). As soon as \(p_i\) receives this pair from \(p_j\), it deletes from \(p_j\)’s pending set the pair \(({\texttt {new}},k)\) (line 25). So when \(p_j\) receives a pending set from \(p_i\) without the pair \(({\texttt {new}},k)\), it means that \(p_i\) already received the acknowledgment message, so \(p_j\) deletes \(({\texttt {ack}},k)\) from \(p_i\)’s pending set (line 28).

Timer Expiration. (Line 39) When the timer for the expected leader expires, \(p_i\) proposes itself as the leader.

Notice that, when compared to Algorithm 1, Algorithm 2 does not use the local arrays \(penalty_i[1..n,1..n]\) employed to monitor the paths made of non-ADD channels.

6 Underlying Behavioral Assumption and Proof of Algorithm 2

In the following we consider that there is a time \(\tau \) after which no more failures occur, and the network is such that (i) all the channels are bidirectional \(\Diamond ADD\) channels, and (ii) the communication network restricted to the correct processes remains always connected. Assuming this, this section shows that Algorithm 2 eventually elects a leader despite initially unknown membership. All the proofs of this algorithm are in full version of this paper that can be found on [15].

7 Conclusion

The \(\Diamond \)ADD model has been studied in the past as a realistic, particularly weak communication model. A channel from a process p to a process q satisfies the ADD property if there are two integers K and D (which are unknown to the processes) and a finite time \(\tau \) (also unknown to the processes) such that, after \(\tau \), in any sequence of K consecutive messages sent by p to q at least one message is delivered by q at most D time units after it has been sent. Assuming first that the correct processes are connected by a spanning tree made up of \(\Diamond \) ADD channels, this article has presented an algorithm that elects an eventual leader, using messages of only size \(O(~\mathsf{{log}}~n)\). Previous algorithms in the \(\Diamond \)ADD model implemented an eventually perfect failure detector, with messages of size \(O(n ~\mathsf{{log}}~n)\). In addition to this, the article has presented a second eventual leader election algorithm in which no process initially knows the number of processes. This algorithm sends larger messages, to be able to estimate n, but only for a finite amount of time, after which the size of the messages is again \(O(~\mathsf{{log}}~n)\). We conjecture that it is necessary, that the process identities are repeatedly communicated to the potential leader. Although we proved that our algorithms elect a leader in time proportional to the diameter of the graph, many interesting question related to performance remain open.

Notes

1.
A survey on election algorithms in failure-free message-passing systems appears in Chap. 4 of [19]. The aim is to elect a leader as soon as possible, and with as few messages as possible, and it can be done on a ring with \(1.271 ~n ~\mathsf{{log}}(n)+ O(n)\) messages [11, 17].
2.
However, the algorithm presented below can be adapted in the case where local clocks can suffer bounded drifts, these drifts being known by the processes.
3.
Always unknown, as the global time, also never known by the processes.
4.
If T is too big, the failure detection of a process currently considered as a leader can be delayed. On the contrary, a too small value of T can entail false suspicions of the current eventual leader \(p_j\) until the corresponding timer \( timer _i[j]\) has been increased to an appropriate timeout value.
5.
In the graph theory, such a cycle-free path is called an elementary path.
6.
An unsigned integer can be encoded with 32 bits, so its maximum value can be 4294967296. A year has 31536000 seconds.

References

Aguilera, M., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: Communication-efficient leader election and consensus with limited link synchrony. In: 23th ACM Symposium on Principles of Distributed Computing (PODC’04), pp. 328–337. ACM press (1996)
Google Scholar
Chandra, T.D., Hadzilacos, V., Toueg, S.: The weakest failure detector for solving consensus. J. ACM 43(4), 685–722 (1996)
Article MathSciNet Google Scholar
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. J. ACM 43(2), 225–267 (1996)
Article MathSciNet Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Article Google Scholar
Delporte-Gallet, C., Devismes, S., Fauconnier, H.: Stabilizing leader election in partial synchronous systems with crash failures. J. Parallel Distrib. Comput. 70(1), 45–58 (2010)
Article Google Scholar
Altisen, K., Devismes, S., Dubois, S., Petit, F.: Introduction to Distributed Self-Stabilizing Algorithms. Distributed Computing Theory. Morgan and Claypool Publishers, San Rafael (2019)
MATH Google Scholar
Fernández-Campusano, C., Larrea, M., Cortiñas, R., Raynal, M.: A distributed leader election algorithm in crash-recovery and omissive systems. Inf. Process. Lett. 118, 100–104 (2017)
Article MathSciNet Google Scholar
Fernández, A., Jimenez, E., Raynal, M., Trédan, G.: A timing assumption and two \(t\)-resilient protocols for implementing an eventual leader service in asynchronous shared-memory systems. Algorithmica 56(4), 550–576 (2010). https://doi.org/10.1007/s00453-008-9190-2
Article MathSciNet MATH Google Scholar
Fernández, A., Raynal, M.: From an asynchronous intermittent rotating star to an eventual leader. IEEE Trans. Parallel Distrib. Syst. 21(9), 1290–1303 (2010)
Article Google Scholar
Fetzer, Ch.: Fail-awareness in timed asynchronous systems. In: 15th ACM Symposium on Principles of Distributed Computing (PODC’96), pp. 314–341. ACM press (1996)
Google Scholar
Higham, L., Przytycka, T.: A simple efficient algorithm for maximum finding on rings. Inf. Process. Lett. 58, 319–324 (1996)
Google Scholar
Hutle, M.: An efficient failure detector for sparsely connected networks. In: IASTED International Conference on Parallel and distributed Computing and Networks, pp. 369–374 (2004)
Google Scholar
Jimenez, E., Arevalo, S., Fernandez, A.: Implementing unreliable failure detectors with unknown membership. Inf. Process. Lett. 100(2), 60–63 (2006)
Article MathSciNet Google Scholar
Larrea, M., Soraluze, I., Cortiñas, R., Raynal, M.: Specifying and implementing an eventual leader service for dynamic systems. Int. J. Web Grid Serv. 8(3), 204–224 (2012)
Article Google Scholar
López, C., Rajsbaum, S., Raynal, M., Vargas, K.: Leader election in arbitrarily connected networks with process crashes and weak channel reliability. Arxiv https://arxiv.org/abs/2105.02972 (2021)
Kumar, S., Welch, J.L.: Implementing \(\Diamond P\) with bounded messages on a network of ADD channels. Parallel Process. Lett. 29(1) 1950002, 12p (2019)
Google Scholar
Mendivil, J.-R., Arrieta, I., Raynal, M.: Leader election: from Higham-Przytycka’s algorithm to a gracefully degrading algorithm. In: 6th International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS’12), pp. 225–232. IEEE Computer Press (2012)
Google Scholar
Raynal, M.: A short introduction to failure detectors for asynchronous distributed systems. ACM SIGACT News 36(1), 53–70 (2005)
Article Google Scholar
Raynal, M.: Distributed Algorithms for Message-Passing Systems, p. 510. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38123-2. ISBN 978-3-642-38122-5
Raynal, M.: Concurrent Programming: Algorithms, Principles And Foundations, p. 515. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-32027-9. ISBN 978-3-642-32026-2
Raynal, M.: Fault-Tolerant Message-passing Distributed Systems: An Algorithmic Approach, p. 492. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-319-94141-7. ISBN 978-3-319-94140-0
Sastry, S., Pike, S.M.: Eventually perfect failure detectors using ADD channels. In: Stojmenovic, I., Thulasiram, R.K., Yang, L.T., Jia, W., Guo, M., de Mello, R.F. (eds.) ISPA 2007. LNCS, vol. 4742, pp. 483–496. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74742-0_44
Chapter Google Scholar
Vargas, K., Rajsbaum, S., Raynal, M.: An eventually perfect failure detector for networks of arbitrary topology connected with ADD channels using time-to-live values. Parallel Process. Lett. 30(2), 2050006, 23 p (2020). A preliminary version appeared in the 49th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’2019), pp. 264–275
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Matemáticas, UNAM, Mexico City, Mexico
Carlos López, Sergio Rajsbaum & Karla Vargas
Univ Rennes IRISA, 35042, Rennes, France
Michel Raynal
Department of Computing, Polytechnic University, Kowloon, Hong Kong
Michel Raynal

Authors

Carlos López
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Rajsbaum
View author publications
You can also search for this author in PubMed Google Scholar
Michel Raynal
View author publications
You can also search for this author in PubMed Google Scholar
Karla Vargas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karla Vargas .

Editor information

Editors and Affiliations

Mohammed VI Polytechnic University, Ben Guerir, Morocco
Karima Echihabi
Technische Universität Braunschweig, Braunschweig, Niedersachsen, Germany
Roland Meyer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

López, C., Rajsbaum, S., Raynal, M., Vargas, K. (2021). Leader Election in Arbitrarily Connected Networks with Process Crashes and Weak Channel Reliability. In: Echihabi, K., Meyer, R. (eds) Networked Systems. NETYS 2021. Lecture Notes in Computer Science(), vol 12754. Springer, Cham. https://doi.org/10.1007/978-3-030-91014-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-91014-3_4
Published: 02 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91013-6
Online ISBN: 978-3-030-91014-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leader Election in Arbitrarily Connected Networks with Process Crashes and Weak Channel Reliability

Abstract

Similar content being viewed by others

Packet Efficient Implementation of the Omega Failure Detector

An Eventually Perfect Failure Detector on ADD Channels Using Clustering

Fast Leader (Full) Recovery Despite Dynamic Faults

Keywords

1 Introduction

1.1 Leader Election

1.2 Related Work

1.3 Contribution

2 Model of Computation

3 Eventual Leader Election with Known Membership

3.1 Local Variables at a Process \(p_i\)

3.2 General Principle of the Algorithm

4 Proof of Algorithm 1

Lemma 1

Lemma 2

Theorem 1

Theorem 2

Proof

4.1 Time Complexity

Lemma 3

Lemma 4

Theorem 3

Proof

4.2 Simulation Experiments

5 Eventual Leader Election with Unknown Membership

5.1 General Principle of the Algorithm

5.2 Local Variables at Each Process \(p_i\)

5.3 Detailed Behavior of a Process \(p_i\)

6 Underlying Behavioral Assumption and Proof of Algorithm 2

7 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation