Keywords

1 Introduction

Distributed consensus is essential to building high available systems, which allows a collection of machines to work as a coherent group that can tolerate the failures of some of its members. Paxos [1, 2] and Raft [3] consensus algorithm are widely adopted in modern large distributed systems such as MegaStone, Spanner, CockroachDB, OceanBase, and TiDB for data replication and fault tolerance. Many large-scale distributed systems like GFS, HDFS, and RAMCloud typically use a replicated state machine [4] such as Chubby, Boxwood, and ZooKeeper for activities including operation sequencing, coordination, leader election, and resource discovery.

Distributed systems place three main demands on consensus algorithms: (1) High throughput for replication inside a distributed system; (2) Low latency for replication across data centers; (3) High availability for services.

The typical consensus algorithms or replication protocols like Multi-Paxos, Raft, Zab, and Viewstamped Replication have a common limitation that all clients communicate with a single leader server at all times when it is available. When the leader fails, additional consensus mechanisms are required to do leader election. Leader leases [5] are usually used as a failure detector for other replicas to discover a failed leader in time.

However, leader-based consensus algorithms still exist several problems. For example, leader bears higher load and may easily become a bottleneck of the distributed system. To solve these problems, MegaStone, Spanner, CockroachDB, OceanBase and TiDB partition data into multiple overlapping consensus groups. Mencius [6] shares the leader load by distributing the leader responsibilities round-robin among the replicas.

Egalitarian Paxos (EPaxos) [7] abandons the leader and it exploits commutativity in state machine commands. However, if concurrently proposed commands interfere with each other, EPaxos requires an additional round of communication. EPaxos can reach high availability but cannot achieve low latency and high throughput due to command interference.

To achieve low latency, high throughput and high availability at the same time as well as possible, we design a consensus algorithm Veca for state machine replication. Veca requires no particular leader replica. Instead, all replicas can commit commands concurrently at any time, and each command can be committed after just one round of communication with a majority of replicas in the normal case. Veca separates agreement from ordering and execution, which allows all replicas to commit commands concurrently without determining their order, but to track their dependencies using vector clocks. Then a subsequent replay phase assigns an order to the commands and executes them in that order. Commands are committed out of order and then be executed in the same order by all replicas. A replica can take the initiative to learn the decision for an instance using a failure recovery protocol. The leaderless design makes the systems built with Veca provide continuous service as long as more than half of the replicas are available. Veca has several advantages of load balancing, high availability and high performance.

The remainder of this paper is organized as follows. Section 2 provides Paxos background. Section 3 demonstrates the intuition of Veca. In Sect. 4, we present the detail design of Veca consensus algorithm. Section 5 implements and evaluates Veca.

2 Paxos Background

Paxos [1, 2] is the most famous consensus algorithm, it decides a value with two phases. When receiving a command from a client, a replica will try to become the leader of a new instance by creating a proposal identified with an incremental proposalid and sending Prepare messages to a quorum of acceptors (possibly including itself). If the proposal id is higher than any previously received proposal, the acceptor replies a Promise message to ignore all future proposals with a less proposal id. If the acceptor accepted a proposal at some point in the past, it must include the previous proposal in its response to the proposer. If the proposer receives enough Promise messages from a quorum of acceptors, it successfully becomes a temporary leader of this instance and can run the second phase. If any acceptor had previously accepted any proposal, the proposer must set the value of its proposal to the value associated with the highest proposal id reported by the acceptors. If none of the acceptors had accepted a proposal up to this point, the proposer could choose any value for its proposal. Then the proposer sends Propose messages to a quorum of acceptors with the chosen value for its proposal. If an acceptor receives a Propose message, it must accept it if and only if it has not already promised to any prepared proposal with a greater proposal id. In this case, it should accept the corresponding value and send an Accept message to the proposer. When the proposer receives enough Accept messages from a quorum of acceptors, it commits the command locally, and asynchronously notifies all other replicas and the client.

3 Veca Intuition

Veca is a consensus algorithm for state machine replication in which all replicas in the replicated state machine run their own instances of Paxos concurrently and independently. The commands produced by the instances led by all replicas are committed without determining their order, but to track their dependencies using vector clocks [8]. Then a subsequent replay phase assigns an order that guarantees serializability and linearizability to the commands and executes them in that order.

An instance in Veca runs roughly as follows. Every replica maintains a vector clock. When a replica starts a new instance, it skips the prepare phase in Paxos and runs the accept phase directly. The replica increases its vector clock and sends Propose messages to at least a majority of replicas with its vector clock. When a replica receives a Propose message, it updates its vector clock with the received vector clock and replies an Accept message to the proposer with its vector clock. When the proposer receives an Accept message, it updates its vector clock with the received vector clock. When the proposer receives Accept messages from a majority of replicas, it commits the instance with its vector clock and sends Commit messages to all other replicas. When a replica receives a Commit message, it updates its vector clock with the received vector clock and commits the instance. When a replica commits an instance, it starts a replay phase to assign an order to the command produced by that instance and relevant commands according to their dependencies tracked by vector clocks and finally executes them in that order.

Figure 1 presents a simple example of how Veca works. A replicated state machine consists of replica AB and C. Replica A and C are running two instances concurrently. Command \(C_1\) did not discover command \(C_2\), but \(C_2\) discovered \(C_1\), which makes \(C_2\) depend on \(C_1\). Thus, \(C_1\) can be executed as soon as committed, but \(C-2\) must wait for the commit of \(C_1\) and be executed after\(C_1\).

Fig. 1.
figure 1

Veca message flow in a simple example.

4 Design

4.1 Assumptions

Non-Byzantine Failure. A replica may crash, or it may fail to respond to messages from other replicas indefinitely, but it cannot respond in a way that does not conform to the protocol.

Asynchronous Distributed System. Replicas are connected by a network. The network may fail to deliver messages, delay them, duplicate them, or deliver them out of order.

Totally Ordered Replicas. Each replica has a unique identifier, and all identifiers can be totally ordered.

Majority Requirement. At least a majority of replicas are available.

4.2 Design Goals

Load Balance. No leader, all replicas can commit commands concurrently at any time.

Low Latency. Each command can be committed after one round of communication with a majority of replicas.

High Throughput. A lot of requests can be handled concurrently and efficiently.

High Availability. Systems can provide continuous service as long as more than half of replicas are available.

4.3 Guarantees

Non-triviality. Only proposed commands can be learned.

Durability. Once a command has been committed, it will remain so at any later time.

Safety. At most one command can be learned in an instance.

Liveness (with High Probability). If a command has been proposed in an instance, then eventually every replica will learn a command in the instance, as long as more than half of the replicas are available and messages eventually go through before timeout.

Consistency. All committed commands will be executed in the same order by all replicas.

Linearizability. If committed commands are serialized by clients, they will be executed in the serialized order. Furthermore, The correctness of Veca is provable.

4.4 The Veca Consensus Algorithm

Veca is an algorithm for managing the replication, ordering and execution of commands inside a replicated state machine. The replicated state machine comprises \(N = 2F + 1\) replicas, where F is the maximum number of tolerated failures. For every replica R there is an unbounded sequence of numbered instances (R.id, 1), (R.id, 2), \((R.id, 3), \dots , (R.id, i), \dots \) that replica R owns, where R.id is the identifier of replica R, and i is the incremental clock of replica R. The state of each replica includes all instances owned by every replica in the system. At most one command will be chosen in an instance. The order of instances is not pre-determined, instead, it is determined dynamically by the algorithm, as commands are chosen.

Veca comprises (1) the commit protocol for replicas to commit commands concurrently without determining their order but to track their dependencies; (2) the replay algorithm for replicas to assign an order to the commands according to their dependencies and to execute them in that order; and (3) the failure recovery protocol for replicas to take the initiative to learn the decision for an instance.

To access the service of a replicated state machine, a client sends a Request message to a replica of its choice. A Response message from that replica will notify the client the execution result of the command. If a client time out waiting the Response message after sending a Request message, it resends the Request message to another replica. When a replica receives a Request message from a client, it runs the commit protocol to commit a command. Then, the replica calls the replay algorithm to assign an order to the command and relevant commands and executes them in that order. Finally, the replica gets the execution result and replies to the client. If a replica times out waiting for the commit of a relevant command, it will run the failure recovery protocol to learn it.

The Commit Protocol. The commit protocol is for replicas to commit commands concurrently without determining their order, but to track their dependencies using vector clocks. Due to the page limitation, the pseudocode of these protocols are not listed here.

When a replica receives a Request message from a client, it becomes the leader of a new instance. The replica increases its vector clock. Then, it initializes a new command in a new instance and saves it in the instance. Finally, the replica sends Propose messages to all other replicas.

When a replica receives a Propose message, if the ballot number of the received command is smaller than the previously received largest one in the instance, the Propose message will be ignored. Then, the replica updates its own vector clock with the vector clock of the received command. Finally, the replica saves the received command and replies an Accept message to the proposer.

When a replica receives an Accept message, the replica updates its own vector clock with the received vector clock. Then, if the replica receives at least \(\lfloor N/2 \rfloor Accept\) messages of the command, it updates the vector clock of the command, marks the state of the command as committed, and sends Commit messages to all other replicas. Finally, the replica calls the replay algorithm to execute commands, waits the command executed and gets the execution result, and replies a Response message to the client. It is a deterministic mechanism for replicas to start replaying commands from the same command proposed by the replica with the smallest identifier.

When a replica receives a Commit message, the replica updates its own vector clock with the vector clock of the received command. Then, the replica saves the received command. Finally, the replica calls the replay algorithm to execute commands.

As in Paxos, every message contains a ballotnumber to indicate message freshness. The difference is that the ballot number in Paxos is global, but in Veca, every instance has its own independent ballot number. Replicas disregard messages with a ballot number that is smaller than the largest they have seen for a certain instance. For correctness, ballot numbers used by different replicas must be distinct, so they include a replica identifier. A replica increases only the incremental number of the ballot number when trying to initiate a new ballot. Each replica is the default leader of its own instances, so the ballot number is initialized to the replica identifier at the beginning of every instance.

The Replay Algorithm. The replay algorithm is for replicas to assign an order to the unordered commands according to their dependencies and to execute them in that order. An instance starts at the proposedvectorclock and ends at the committedvectorclock. All concurrent instances during this period are tracked in the proposed vector clock and the committed vector clock.

Figure 2 puts the concurrent instances during the running of instance (RS) in timeline. The proposed vector clock of instance (RS) is \([(R_0, S_0)\), \((R_1, S_1), \dots , (R, S), \dots , (R_N-1, S_N-1)]\), and the committed vector clock of instance (RS) is \([(R_0, T_0)\), \((R_1, T_1), \dots , (R, T), \dots , (R_N-1, T_N-1)]\). Note that S and T do not necessarily have the same value, because a replica can start the next instance before the last one completed and the next instance may run faster than the first one. The two vector clocks do not only record the logical start and end time of instance (RS), but also record the concurrent instances during the running of instance (RS). The commands produced by those concurrent instances must be reordered according to their dependencies.

Fig. 2.
figure 2

The concurrent instances during the running of instance (RS) in timeline.

Figure 3 shows the directed acyclic graph (DAG) of the dependencies between command C(RS) and its concurrent commands. Every command has the same dependencies with its concurrent commands. If we put all commands together, we can get a complete dependency graph. The self dependencies are monotonic, and they cannot form a cycle. However, the concurrent dependencies are not monotonic, and they can form a cycle. As a result, the complete dependency graph is no longer a DAG but a directed cyclic graph (DCG). Fortunately, the concurrent dependencies can only happen between the commands produced by the instances owned by different replicas. Thus, the cycles in the complete dependency graph can be simply broken using the order of the replicas.

Fig. 3.
figure 3

The directed acyclic graph (DAG) of the dependencies between command C(R, S) and its concurrent commands. The dark arrows signify self dependencies and the gray arrows signify concurrent dependencies.

The replay algorithm is a topological sorting of a DCG. We extend the depth-first search algorithm for topological sorting. When a cycle in the DCG is detected, it is simply broken by the ascending order of the replicas. To execute a command, the replay algorithm first saves the state of the command and marks the state of the command as replaying. The first for loop recursively replays the commands that are in self dependency with, then the next for loop recursively replays the commands that are in concurrent dependency with in ascending order of replicas. Any failure will roll back the state of the replaying command and return false immediately during this period. Finally, if there is no error and the command has been committed, the command will be executed and its state will be marked as replayed.

The replay algorithm can fail to execute a command in the following cases: (1) the command has not been committed; (2) the command depends an uncommitted or missing command and the dependency cannot be broken. If the dependency can be broken, the command can be executed without waiting the dependent command committed. If a replica times out waiting for the commit of a command, the replica will take the initiative to recover the instance of the command with the failure recovery protocol.

The Failure Recovery Protocol. The failure recovery protocol is for replicas to take the initiative to learn the decision for an instance. If a replica times out waiting for the commit of an instance, the replica will try to take ownership of that instance by running the failure recovery protocol, at the end of which the replica will either learn what command was proposed in this instance then finalize committing it, or, if no replica has seen a command, will commit a no-op command to finalize the instance.

When a replica is going to recover an instance of a potentially failed replica, the replica increases the incremental number of the ballot number of this instance. Then, the replica concatenates the increased number and the identifier of the replica to generate a new ballot number. Finally, the replica sends a Prepare message to all replicas (including itself) with the new ballot number.

When a replica receives a Prepare message, and if the ballot number in the Prepare message is not larger than the previously received largest ballot number of the instance, the replica ignores the message. Then, the replica updates the ballot number of the instance with the received ballot number. Finally, the replica replies a Promise message to promise to ignore all future Prepare messages with a less or equal ballot number and Propose messages with a less ballot number in the instance.

When a replica receives a Promise message, and if the replica receives at least \(\lfloor N/2 \rfloor + 1 Promise\) messages of the instance, there are four situations according to the set of replies with the highest ballot number: (1) if the set of replies contains a command whose state is committed, which indicates the instance is committed, then the replica just saves the command and sends Commit messages to all other replicas; (2) if the set of replies contains at least \(\lfloor N/2 \rfloor + 1\) commands whose states are accepted, which indicates the instance could be committed, then the replica updates the vector clock of the command with all the vector clocks of accepted commands, marks the state of the command as committed, and sends Commit messages to all other replicas; (3) if the set of replies contains a command whose state is accepted, which indicates the command has not yet been replicated to a majority of replicas, then the replica saves the command locally and sends Propose messages to all other replicas to start the commit protocol; (4) if the set of replies is not in any situation above, the replica initializes a no-op command and sends Propose messages to all other replicas to start the commit protocol.

Veca allows multiple replicas to run the failure recovery protocol concurrently for the same instance. As in Paxos, if multiple replicas propose commands at almost the same time for the same instance, there may be no command accepted by a majority of replicas, then the instance will fail again. When this happens, each replica will time out again and restart the failure recovery protocol with a larger ballot number. However, without extra measures there can be a livelock in which failed failure recovery repeats indefinitely. Like the leader election in Raft, Veca uses randomized timeouts to ensure that failed failure recovery is rare and that they are resolved quickly.

5 Evaluation

We evaluated Veca on LAN and WAN using three replicas (tolerating one failure) and five replicas (tolerating two failures). Replicas in LAN are located in our laboratory, using Gigabit Ethernet, running Windows 10. Replicas in WAN are located in Amazon EC2 datacenters in California (CA), Virginia (VA) and Ireland (EU), plus Oregon (OR) and Japan (JP) for the five-replica experiment, running Amazon Linux AMI 2017.03. Veca, EPaxos and Multi-Paxos are implemented with the optimization of pipelining while Raft and Mencius are not allowed this optimization.

5.1 Latency in WAN

We validate that Veca has low latency in WAN using three replicas. In the latency experiment, at each server there are also ten clients co-located with each replica. They generate requests simultaneously, and measure the latency for each request. In Veca, Mencius and EPaxos, clients send requests to their local replicas, while in Multi-Paxos and Raft, clients send requests to the leader. Figure 4 shows the average and 99% ile latency for Veca, EPaxos, Mencius, Multi-Paxos and Raft in WAN using three replicas.

Fig. 4.
figure 4

Average latency (99% ile indicated by lines atop the bars) at each 3 replicas in WAN. The Multi-Paxos and Raft leader is in CA.

Veca has lower latency than EPaxos. Mencius performs relatively well in the balanced experiment that all replicas receive request messages at the same aggregate rate. However, Mencius experiences latency corresponding to the round trip time to the replica that is farthest away from the client, which brings additional latency than Veca. Multi-Paxos and Raft have high latency in non-leader replicas. Multi-Paxos can be implemented with the optimization of pipelining while Raft cannot. This makes Multi-Paxos have lower latency than Raft.

5.2 Throughput in LAN

Veca has been evaluated to have higher throughput than EPaxos, Mencius, Multi-Paxos and Raft. A client on a separate server sends requests in an open loop, and measures the rate at which it receives replies. For Veca, EPaxos and Mencius, the client sends each request to a replica chosen uniformly at random. Figure 5 shows the throughput for Veca, EPaxos, Mencius, Multi-Paxos and Raft in LAN using three replicas and five replicas.

Fig. 5.
figure 5

Throughput in LAN for 3 replicas (left) and 5 replicas (right) (with 95% CI).

Veca has higher throughput than EPaxos, the improvements are even bigger using five replicas. Mencius cannot compare with Veca in this experiment. Because Mencius introduces a lot of overhead on synchronizing and coordinating the pre-partitioned instances, and the influences are even bigger under high concurrency. Besides, the leader rotation for every instance makes the replicated state machine run at the speed of the slowest replica, which also reduces throughput.

Multi-Paxos and Raft have low throughput because the leader becomes bottlenecked by its CPU and network. Multi-Paxos and Raft cannot achieve load balance among replicas since the leader must process more messages than other replicas.

5.3 Availability Under Failures

Figure 6 shows the evolution of the throughput in a replicated state machine using three replicas in LAN that experiences the failure of one replica. For Veca, EPaxos and Mencius, the failure replica is an arbitrary one in the system, and for Multi-Paxos and Raft, the failure replica is the leader. A client sends requests at the same appropriate rate of approximately 10,000 requests per second for every system.

Fig. 6.
figure 6

Throughput when one of three replicas fails. For Multi-Paxos and Raft, the leader fails.

Veca and EPaxos have almost not been influenced during the failure. In Multi-Paxos, the new leader may take a lot of time to recover amounts of instances left behind. In Raft, the recovery time is reduced by its safety of leader election. The failure of a non-leader replica in Multi-Paxos and Raft does not affect availability.

In contrast, any replica failure disrupts Mencius. Every replica in Mencius must hear from all other replicas before committing a command. If any replica fails to respond, no other replica can make progress until the failure is detected and another replica commits no-ops on behalf of the possibly failed replica. At this point, the delayed commands are committed, which causes the throughput spike depicted. Live replicas commit no-ops periodically until the failed replica recovers, or until a membership reconfiguration.

6 Conclusion

We have presented the design of Veca, a high-performance consensus algorithm for state machine replication. With the high-efficiency and leaderless design, Veca achieves higher load balance, lower latency, higher throughput and higher availability at the same time, which are approximately optimal in existing consensus algorithms. Veca has important theoretical and practical benefits for state machine replication in LAN and WAN.