Keywords

1 Introduction

The access hot spots in global web services usually come from the same geographical region (e.g., a country, a state, or a time zone). And such hot spots may move frequently, as a function of local time. For example, a global electronic trading system will witness the moving of hot spots between countries because of different stock opening times. If such web services use servers in fixed locations, the latency will increase significantly as the hot spots migrate to a location far away from the working servers. Different applications have different tolerances for delays, but high latency generally leads to a decline in revenue.

Now Paxos-based protocols  [1, 2] are widely used in web services  [3, 9] and database systems  [10] to tolerate server failures and provide high availability. But existing Paxos protocols are not adaptive for the frequent changing workloads, two problems lie ahead of these protocols:

  1. 1.

    Agreements are reached by a majority. Such that the total number of nodes (denote N) is related to tolerated failures (denote F), which satisfies \(N = 2F+1\). Then larger N results in a larger F. This property limits the number of nodes that can be deployed in a Paxos instance and makes it impossible to place a server near every possible client.

  2. 2.

    In data migration, usually only a specific part of data needs to be migrated. But existing reconfiguration mechanisms transferring the entire state machine, which adds extra cost.

In this paper, we proposed MPaxos to solve these problems. We proposed the Working Cluster allocation mechanism. A Working Cluster is the set of nodes explicitly specified for one object, a command can and only can be committed in a majority of the related Working Cluster. In this way, we decoupled quorum size (\(F+1\)) from the number of nodes in cluster (N) by introducing W (the number of nodes in a Working Cluster), where \(W <= N\) and \(W = 2F + 1\), thus solved problem 1. Also, the Working Clusters of different objects work independently, so the migration of one object will not affect the read and write of other objects, thus alleviate problem 2. Moreover, MPaxos provides a scheduling framework for automating Working Cluster selection and migration to maintain low latency.

We briefly introduce the relevant background of Paxos and other related works in Sect. 2. Then the design and implementation of MPaxos are present in Sect. 3. The scheduling framework is introduced in Sect. 4. Section 5 shows the performance evaluation. The paper concludes in Sect. 6 with a summary.

2 Related Work

Leader-based Paxos protocols use a master to determine the execution order of all the command, while the Leaderless Paxos protocols usually don’t care the order of irrelevant commands and only establish order constraints between commands for the same object. Egalitarian Paxos  [4] is an efficient leaderless Paxos protocol and it uses a totally decentralized approach to commit commands and handle interferes: in the process of choosing a command in a log entry, each participant attaches ordering constraints to that command, and the agreement is achieved when a majority agree with that constraints, thus irrelevant commands can be committed by different replicas. Therefore, EPaxos is born with high throughput.

The commit protocol of EPaxos is divided into 3 phases. When there are no interferes between commands from different command leaders, these commands can be committed on the Fast-Path (involves only Phase 1 and 3); otherwise interfered commands will be committed on the Slow-Path (involves Phase 1, 2, and 3). On the Slow-Path, each command proposed by command leader requires replies from a majority of replicas, while in the fast path, \(F+\lfloor {\frac{F+1}{2}}\rfloor \) replies is needed to guarantee the correctness, where \(F = \lfloor {\frac{N}{2}}\rfloor \) and N is the number of replicas.

There are also numerous works proposed for the migration of services and workload burst handling. Some early works use the idea of VM live migration, in which the VMs containing the services are migrated between datacenters. This method has been widely used in practice  [7, 8]. These works adopt the shared disk technology for faster migration but face long latency in accessing a shared disk image. Then another VM based protocol Supercloud  [6] proposed a Data Propagation method that implements the storage layer much like a state machine, modified blocks are transferred between datacenters to maintain a consistent storage view.

3 The Design of MPaxos

MPaxos is proposed to achieve lower latency and higher throughput than existing Paxos based protocols in wide-area deployment. High throughput is reached by implementing a decentralized command committing method similar to EPaxos, as the read and write requests can be distributed across different nodes. But a Fast-Path in EPaxos consisting of roughly \(\frac{3}{4}\) of nodes makes the commit latency even higher. So MPaxos made the following changes to EPaxos: 1. Introduce the concept of Working Cluster. 2. Introduce the concept of Reorganization that changes the Working Cluster. 3. Let each object has its own Working Cluster.

Below we describe the components in MPaxos in more detail.

3.1 Working Cluster

The Working Cluster is a subset of the overall cluster. A replica inside the Working Cluster is called Working Replica. The committing of the commands need to be performed by a Working Replica. The replica outside of the Working Cluster on receiving a command from the client should redirect it to a Working Replica.

Through the Working Cluster mechanism, we can reduce the Fast-quorum size to \(F_{W} + \lfloor {\frac{F_{W}+1}{2}}\rfloor \), slow-quorum size to \(F_{W} + 1\) (where \(F_{W} = \lfloor {W/2}\rfloor \), so \(F_{W}\) is the actual number of tolerated failures under Working Cluster with size W). Thus we can deploy more idle machines in the cluster, while keeping a small quorum size.

3.2 Reconfiguration and Reorganization Algorithm

The process of changing the Working Cluster (i.e. Reorganization) is essentially a reconfigure process for the Working Cluster, except that the migration does not involve the startup and shut down of replicas. The process of reconfiguration can be divided into three steps: 1. Stop the old state machine, 2. Transfer the state, 3. Start the new state machine. Below we present a detailed description of steps 1 and 2. As the third step is simple and trivial, we don’t discuss it here.

The general way to stop the old state machine is to submit a stop command (it is usually done by committing a RECONFIG command), and there can be no other valid commands in old state machines after the stop command (only NOP commands are permitted)  [5]. Due to the multileader style of MPaxos, it is possible to have multiple RECONFIG commands committed at the same time, and their contents may be different. One solution is to modify the commit protocol to refusing the old RECONFIG and using another round of communication to confirm this RECONFIG. But this could cause a livelock (different replicas alternately send new RECONFIG and no RECONFIG command can be confirmed successfully). Thus we chose another way: allow multiple potentially different reconfig commands to be committed, but only the earlier one will take effect. To do this, two settings need to be introduced:

Definition 1

RECONFIG commands conflict with each other.

Definition 2

RECONFIG commands conflict with read/write commands.

With Definition 1, the concurrent RECONFIG commands will establish an execution order. Definition 2 establishes an execution order between the read/ write commands and the RECONFIG commands. The read/write commands have to be set to NOP when there is a RECONFIG command in its dependencies, this guarantees no valid command after RECONFIG.

We abstract the reconfiguration process of MPaxos into three states: NORMAL, RECONFIGURING, and TRANSFERING. RECONFIGURING implies that some replica has sent a RECONFIG command, and the command is not yet committed; TRANSFERRING state means that the RECONFIG command has been committed and the transfer of states is in progress. To know when transfer finishes, TRANSFER-FINISH command is defined and commit it in a majority of the new config. A receiver confirms this log after its transfer process is completed. Upon this command is committed successfully, the transfer state ends. The replica cannot submit normal commands in the RECONFIGURING and TRANSFERING states, while the RECONFIG command can be submitted at any time.

The reorganization process inherits the 3 steps and the 3 replica state from reconfiguration, and introduce 1 extra log type REORGANIZE. But reorganization conceptually just alter the roles some set of replicas plays. Hence it has less impact on the performance. Figure 1 shows the pseudocode of the protocol for choosing commands in MPaxos, and Fig. 2 shows the Execution logic of the REORGANIZE command.

Fig. 1.
figure 1

The basic migratable Paxos protocol for choosing commands

Fig. 2.
figure 2

The execution and transfer phase of migratable paxos

4 Scheduling Framework

In this chapter, we present the scheduling framework of MPaxos which is responsible for making migration decisions. Suppose there are N replicas deployed in MPaxos: \(\{d_{1}, d_{2}, ..., d_{n}\}\). A Working Cluster placement plan for some object \(\theta \) with k nodes is denoted as \(P=\{p_{1}, p_{2}, ... p_{k}\}\), where \(d_{p_{i}}\) is a replica in the Working Cluster. Periodically, MPaxos measures end-to-end latency between different replicas and stores the results in matrix L, where L(ij) is the round-trip-time (RTT) from \(d_{i}\) to \(d_{j}\). The workload statistics is denoted as \(S=\{(r_{1}, w_{1}), (r_{2}, w_{2}),...,(r_{n}, w_{n})\}\), \((r_{i}, w_{i})\) is the read and write workload on replica \(d_{i}\).

To evaluate a placement plan, we provide a function f(PSL) that evaluates a placement plan under a certain workload:

$$f(P,S,L) = -\frac{\sum _{i=0}^n (\alpha \cdot C(P,i) \cdot r_{i} + \beta \cdot C(P,i) \cdot w_{i})}{n}$$

where C(Pi) is replica \(d_{i}\)’s commit latency under the placement plan P. \(\alpha \) and \(\beta \) are weights indicated the importance of read and write latency.

The evaluation of C(Pi) is split into 2 steps:

  1. 1.

    Send a command from \(d_{i}\) to the closest replica in P (denoted \(p_{L}\)), the latency is:

    $$C_{1} = {\left\{ \begin{array}{ll} L(i, p_{L}), &{} \text {if }i\text { not in }P \\ 0, &{} \text {if }i\text { in }P \\ \end{array}\right. }$$
  2. 2.

    Commit latency. The command could be committed in the Fast-Path or the Slow-Path, we specify a parameter e to represent the possibility of going through Fast-Path, then the latency is:

    $$C_{2} = e \cdot Fast(P, p_{L}) + (1-e) \cdot Slow(P, p_{L})$$

    The Fast-Path only involves one round-trip between a Fast-quorum of working replicas. As a Fast-quorum contains \(F_{W} + \lfloor {\frac{F_{W}+1}{2}}\rfloor \) replicas (\(F_{W} = \lfloor {W/2}\rfloor \), with W indicates the cardinality of P), the network latency of commit in Fast-Path is roughly the same as the third quartile of latencies between \(p_{l}\) and other replicas in P, that is:

    $$Fast(P, pL) = 3rd~Quartile(\{L(p_{l}, p_{i}) \mid p_{i} \, in \, P\})$$

    The Slow-Path involves two round-trips between a Slow-Quorum (i.e. a majority). So the network latency is 2-times the median of the latencies from \(p_{l}\) to replicas in P.

    $$Slow(pl) = 2 \cdot Median(\{L(p_{l}, p_{i}) \mid p_{i} \, in \, P\})$$

5 Evaluation

We implement MPaxos on Paxi, a framework that implements EPaxos and other Paxos protocols, to evaluate and compare their performance. The implementation is in Go, version 1.11.2, and we release it as an open-source project on GitHub at https://github.com/dante159753/MPaxos.

We evaluated MPaxos on Amazon EC2, using small instances (two 64-bit virtual cores with 2 GB of memory) for both state machine replicas and clients, running Ubuntu Linux 18.04.2.

5.1 Workloads

We specify two types of workloads: (1) hot spots are static and from one or two continents; (2) a workload with a request peak at 9:00 am local time, and the relationship between the number of requests and local time is subject to normal distribution.

Our tests also capture conflicts, an important workload characteristic – conflict is a situation when potentially interfering commands reach replicas in different orders. Conflicts affect EPaxos and MPaxos. As write requests usually occupy no more than 2% of all requests, we believe that 0% and 2% command interference rates are the most realistic [4].

Fig. 3.
figure 3

Commit latency under static workloads

Fig. 4.
figure 4

Latency under regularly shifting workloads

5.2 Latency in Wide Area

We evaluate MPaxos with nine replicas (tolerating one failure, so the minimal size of Working Cluster is 3). The replicas are located in Amazon EC2 datacenters in California (CA), Virginia (VA) , Oregon (OR) , Japan (JP), Korea (KR), Singapore (SG), London (LON), Paris (PAR), and Sweden (SE). We set \(F_{W}=1\) for MPaxos and \(F=0, f=1\) for WPaxos  [11].

Figure 3 shows the average client latency for MPaxos, EPaxos, Multi-Paxos and WPaxos under static workloads, where WPaxos is a recent leader-based Paxos protocol optimized for migration scenario. The X-Axis indicates the positions of clients. With nine replicas, protocols with static clusters – such as EPaxos and Multi-Paxos with static clusters – produce high latency, while protocols with migratable clusters – such as MPaxos and WPaxos – achieve lower latency, and MPaxos outperforms WPaxos because of the leaderless command committing fashion. And the last group of the test (RANDOM), in which requests come from a random client of the world, shows that MPaxos also works well under irregular workloads.

Figure 4 shows the average client latency for these protocols under the shifting workload. MPaxos also achieves the lowest commit latency by timely responding to the shifting workload. WPaxos performs very close to MPaxos. Nevertheless, as shown in Fig. 5, MPaxos outperforms WPaxos in migration cost. Although WPaxos only need 1 round of communication during migration instead of 2 round in MPaxos, it suffers from a larger Phase-1 quorum size (at least 6 replicas), where MPaxos only need to communicate with 2 replicas twice.

Fig. 5.
figure 5

Migration cost between different continents

Fig. 6.
figure 6

Latency and workload when an emergency occurred in the 3rd second.

To evaluate how MPaxos and other protocols perform under emergency, we initialize a workload by deploy 300 clients in NA, then simulate the emergency by shutting down all clients in NA and starting 300 new clients in AS in the 3rd second. Figure 6 shows how latency and throughput change. It shows that MPaxos consistently retain the lowest latency and highest throughput by timely responding to the changes in client characteristics. The latency of WPaxos is roughly as good as MPaxos, but the single-leader-per-object style limits the throughput.

6 Conclusion

In this paper, we propose MPaxos, a Paxos-based protocol for shifting workloads. We show that designing specifically for the client characteristics yields significant performance rewards. MPaxos includes two main proposals: (1) use Working Clusters to make Replication quorums small and close to users, (2) Reorganization that enables Working Clusters timely responding to workload shifting with low migration cost. These proposals improve performance significantly as we show on a real deployment across 9 datacenters.