1 Introduction

Caches such as memcached [32], Redis [35], Ignite [15], KOSAR [17], and others improve the performance of traditional database management systems with workloads that exhibit a high read to write ratio [6, 7, 40]. A caching layer may consist of tens of servers for a small installation and thousands of servers with a popular site such as Facebook [33].

A physical server with many cores may host several Cache Manager Instances, CMIs. Each CMI is a process that might be multi-threaded. It is assigned a fixed number of cores and some amount of memory. It is also assigned a fraction of cache entries, a fragment. Multiple fragments are assigned to one CMI for load balancing. A load balancer may consider factors such as imposed load and cache hit rate to adjust the assignment of fragments to CMIs to enhance a performance metric such as system throughput [1, 38].

A configuration is an assignment of fragments to CMIs. A coordinator manages configuration changes. A configuration changes due to: (1) addition or removal of CMIs by an auto-scaling component (or a system administrator), (2) re-assignment of fragments to CMIs by a load-balancer, (3) re-assignment of fragments to CMIs in the presence of network partitions, (4) re-partitioning of data across fragments by a re-organization component in the form of either increasing or decreasing the number of fragments, or (5) a combination of these.

Configuration changes must preserve the application’s read-after-write consistency defined as a read of a cache entry observing the value produced by the last committed write of the entry [31]. Configuration changes may compromise read-after-write consistency for two reasons. First, during the window of time when the coordinator publishes a new configuration, a few clients may have the old configuration while others have the new configuration. This discrepancy may cause two or more clients that reference the same cache entry to contact different CMIs, observing different values. If they write this cache entry then they will generate different values in different CMIs. The value observed by a subsequent read depends on whether this read is issued using the old or the new configuration, potentially compromising read-after-write consistency and correctness of an application. Second, a configuration change may re-assign a fragment to a new destination CMI without physically deleting its cache entries from the source CMI, leaving these entries to become cold at the source CMI and evicted by its cache replacement technique. In the presence of updates and a subsequent configuration change that assigns the fragment back to the source, the application may observe stale cache entries. To elaborate, consider a system that migrates F\(_k\) from CMI\(_i\) to CMI\(_j\) without deleting F\(_k\)’s cache entries stored at CMI\(_i\). Should the application update F\(_k\)’s cache entries assigned to CMI\(_j\) then the value of their replicas on CMI\(_i\) become stale. If a subsequent configuration change assigns F\(_k\) back to CMI\(_i\), references to these cold cache entries observe stale values. This violates read-after-write consistency.

An ideal solution to these two challenges should (1) be pauseless by processing user requests in the presence of configuration changes, (2) provide read-after-write consistency that guarantees data produced by a committed write is observed by all subsequent reads, (3) be agnostic to the number of clients, and (4) preserve as many valid keys as possible in the presence of configuration changes. With the latter, with v fragments assigned to CMI\(_i\), if a configuration change assigns one fragment to a different CMI then keys of the remaining \(v-1\) fragments of CMI\(_i\) should remain valid.

The primary contribution of this paper is RejigFootnote 1 [16], a scalable online algorithm that satisfies the above requirements. Rejig extends existing systems [21, 33]. While it allows multiple clients to use different configurations, it guarantees consensus on a fragment’s replica by requiring clients that reference the fragment to use its latest CMI assignment always. Moreover, it allows cache entries of an old replica of a re-assigned fragment to become cold and evicted by the CMI’s replacement technique. With those entries that remain, Rejig detects when the application references them and treats them as cache misses. Rejig may be configured to delete these cache entries to free the CMI’s memory.

Fig. 1.
figure 1

A candidate architecture for Rejig. Concepts underlying Rejig are applicable to those architectures that tightly integrate the cache with the application [17] or represent the caching layer as a middleware [15, 28, 34].

Rejig is designed for clients and CMIs deployed in a data center. Rejig does not require the coordinator to propogate a new configuration to all CMIs. It employs clients to perform this task on demand, making it appropriate for other deployments. For example, it may be used with trusted mobile clientsFootnote 2 that have intermittent network connectivity to the caching layer. It also functions with CMIs deployed across two or more geographically distributed data centers [3]. These deployments are possible because Rejig satisfies the following properties. First, CMIs are passive entities that respond to requests. Second, clients do not transmit a configuration to one another directly. They use CMIs and the coordinator intelligently to obtain the latest configuration on demand, i.e., once a Rejig client detects that its request was issued using an old configuration.

Fig. 2.
figure 2

A configuration with two partitioning techniques.

The software architecture of Rejig hides its implementation details. Its transparency frees software developers to focus on their application and its requirements (instead of configuration changes and how to manage impacted fragments). Central to its design is a monotonically increasing global unique identifier for each configuration published by the coordinator, GlobalCfgID. Each updated and inserted cache entry is tagged with the current GlobalCfgID. Moreover, each client must piggyback the GlobalCfgID of its cached configuration with the request it issues to a CMI. Hence, Rejig imposes two types of overhead. First, it requires 4 extra bytes of memory per cache entry to store GlobalCfgID. Second, a client request consumes 4 extra bytes of the available network bandwidth per request to transmit GlobalCfgID of its configuration.

The rest of this paper is organized as follows. Section 2 details Rejig’s design. Section 3 presents an implementation of this design. We evaluate this implementation using microbenchmarks and traces from Azure [10] and WorldCup ’98 [4] in Sect. 4. Section 5 contains related work. Brief conclusions are presented in Sect. 6.

2 General Design of Rejig

Rejig’s software architecture consists of cache manager software, a coordinator responsible for maintaining the configuration, and a client component used by the application to issue requests. Figure 1 shows an example deployment of this software architecture. A cache server hosts one or more Rejig cache manager instances, CMIs. Each CMI is identified using the combination of its IP and port number, see Fig. 2. The coordinator represents a configuration as F fragments assigned to different CMIs. We detail a space efficient implementation of a configuration in Appendix B.

Rejig supports hash and range partitioning techniques to shard the application’s data across CMIs. With the former, a hash function is used to identify the fragment containing a referenced key K. With the latter, the configuration maintains the range of values assigned to each fragment. It performs a binary search to identify the fragment containing K. Rejig also supports a hybrid partitioning technique [1] that applies a hash function to map K to an order preserving space that is range partitioned across CMIs. We describe how Rejig’s coordinator increases the number of fragments F with the hash partitioning strategy in Appendix B.1.

Table 1. Terms and their Definitions.

Rejig’s coordinator uses a monotonically increasing integer, GlobalCfgID, to identify a configuration. Rejig’s client and server components maintain their latest value of GlobalCfgID. Rejig’s client component caches a local copy of the latest configuration for efficient processing of requests. When the coordinator computes a new configuration, it increments GlobalCfgID. For each impacted fragment, the coordinator informs either a subset or all impacted CMIs of the new GlobalCfgID and inserts the corresponding configuration in these CMIs. As an example of updating a subset of the impacted CMIs, consider a scenario that migrates a fragment from a source CMI to a destination CMI. It is sufficient for Rejig to update the source CMI with the new GlobalCfgID and insert the latest configuration in this CMI only. As an example of updating all impacted CMIs, when a CMI is removed, its fragments are assigned to different CMIs. The coordinator updates the GlobalCfgID of these CMIs and inserts the new configuration in all (Table 1).

Rejig’s clients and CMIs use a distributed collaborative algorithm to update their GlobalCfgID value and cached configuration. Section 2.1 details this algorithm.

Rejig’s coordinator implements a re-organization algorithm that changes the assignment of fragments to CMIs in response to an evolving workload. The efficiency of a configuration change and how it re-organizes fragments’ assignment is the responsibility of this algorithm (and not Rejig). At any given time, there is one active coordinator. However, there may be multiple standby coordinators, each of which is prepared to take over if the active coordinator crashes. The active coordinator stores the latest configuration and its GlobalCfgID on an external storage system that is highly available (such as ZooKeeper [26]). The standby coordinators use the external storage system to detect failure of the active coordinator, select a new active coordinator, and recover the configuration and its GlobalCfgID.

Table 2. Rejig and its variants.

With each configuration change, the coordinator maintains the value of GlobalCfgID for each fragment F\(_k\) impacted by that change, FragCfgID\(_k\). A fragment’s FragCfgID\(_k\) is initialized to the GlobalCfgID value that created it. A fragment is impacted by one configuration change. However, a configuration change may impact several different fragments. For example, removal of a cache server in Fig. 2 impacts multiple CMIs and their assigned fragments. Thus, at any instance in time, different fragments may have different FragCfgID\(_k\) values, identifying the GlobalCfgID value that changed their assignment to a CMI.

Example 2.1

In Fig. 2(b), F\(_j\)’s FragCfgID is 765. Assume the current GlobalCfgID value is also 765, GlobalCfgID = 765. If the coordinator assigns a different fragment, say F\(_0\), to a different CMI then it increments GlobalCfgID by one, GlobalCfgID = 766. It produces a new configuration with F\(_0\)’s FragCfgID set to 766 along with IP and port number of its newly assigned CMI. Other fragments’ FragCfgID including F\(_j\)’s FragCfgID remain unchanged.\(\blacksquare \)

2.1 Processing Get Requests

Algorithms 1 and 2 provide Rejig’s protocol to process the get command by a client and a CMI, respectively. Appendix A provides a formal proof of this protocol. A client piggybacks its value of GlobalCfgID with every request it issues to a CMI\(_i\), see line 3 of Algorithm 1. A cache manager instance, CMI\(_i\), compares its latest known GlobalCfgID to the one provided by the client. There are three possibilities, see lines 1 to 8 of Algorithm 2. Either the two are equal, CMI\(_i\)’s GlobalCfgID is greater than the client’s GlobalCfgID, or the CMI’s GlobalCfgID is less than client’s GlobalCfgID. Consider each in turn.

figure a

\(\mathbf {CMI}_i\)’s GlobalCfgID Equals Client Provided GlobalCfgID: If CMI\(_i\) has the value associated with the referenced key then it returns this value including its configuration id, V\(_\mathrm {cid}\). V\(_\mathrm {cid}\) identifies the configuration in which this cache entry was either inserted or updated in CMI\(_i\) (line 10 in Algorithm 2). The client compares V\(_\mathrm {cid}\) with its assigned fragment’s FragCfgID\(_k\). If V\(_\mathrm {cid}\) is greater than or equal then the value is valid and the client provides it to the application. Otherwise, the value was created in an older configuration that mapped this fragment to the same CMI and the value may be stale. Hence, the client discards the value and reports a cache miss. One may configure Rejig clients to delete this cache entry, freeing the available cache space of the CMI, see lines 19 to 24 in Algorithm 1.

This algorithm may incorrectly identify a value as stale if it was not updated while its fragment F\(_k\) was assigned to some other CMIs. The likelihood of this false negative is a function of the popularity of the cache entry, the mix of reads and writes in the workload, and how long F\(_k\) was mapped to a different CMI before being re-assigned. We quantify this in Sect. 4.2.

CMI\(_i\)’s GlobalCfgID is Greater than the Client’s GlobalCfgID: This condition is satisfied when the client’s cached configuration is old and CMI\(_i\) is provided with a more recent configuration. CMI\(_i\) returns a “Refresh & Retry” response to the client, see line 2 in Algorithm 2. In response, the client fetches the latest configuration from CMI\(_i\) and retries its request. If CMI\(_i\) evictedFootnote 3 the latest configuration (i.e., reports a cache miss), the client contacts the coordinator for the latest configuration. Subsequently, it retries its request, see lines 4 to 14 in Algorithm 1. It is possible to reduce the number of roundtrips by requiring a CMI\(_i\) to piggyback the new configuration (assuming CMI\(_i\) has it) with its “Refresh & Retry” response.

Rejig clients disseminate the latest configuration to one another using the CMIs. A client that fetches a configuration inserts it into the next \(\alpha \) unique CMIs that it contacts to process a request, piggybacking its GlobalCfgID value along with each insert, see line 17 in Algorithm 1. A CMI ignores this insertion when its GlobalCfgID is greater, i.e., the configuration changed and this CMI has a more recent configuration.

figure b

There are other variations of this dissemination technique [12]. We consider two variants named Rejig\(^C\) and Rejig\(^T\), see Table 2. With Rejig\(^C\), a client inserts its known configuration into the next \(\alpha \) unique CMIs only if it fetches the latest configuration from the coordinator. With Rejig\(^T\), a client stops inserting once a CMI reports that it has the latest configuration. We quantify the tradeoffs associated with Rejig, Rejig\(^C\), and Rejig\(^T\) in Sect. 4.1.

CMI\(_i\)’s GlobalCfgID is Less than the Client’s GlobalCfgID: CMI\(_i\) deletes its known configuration and sets its GlobalCfgID with the one provided by the client, see lines 5 to 6 in Algorithm 2. The client may insert its known configuration in CMI\(_i\), see line 17 of Algorithm 1.

When multiple threads of a client receive “Refresh & Retry”, only one thread obtains the latest configuration from CMI\(_i\) while other threads wait. Once this thread obtains the configuration, all threads use it to retry their requests. Threads referencing CMIs other than CMI\(_i\) are not blocked.

In sum, Rejig employs clients (instead of the coordinator) to propagate a new configuration to other CMIs on demand. This prevents the coordinator from becoming a bottleneck, realizing a scalable Rejig protocol.

Example 2.2

Consider the range partitioned configuration of Fig. 2(b). Cache entry K\(_i\) is assigned to Fragment F\(_j\). F\(_j\) is assigned to CMI 1.0. Assuming GlobalCfgID = 765, a write of K\(_i\) sets its V\(_\mathrm {cid}\) to 765. Assume a configuration change that assigns F\(_j\) to the CMI hosting F\(_0\), CMI 0.5. This results in the following changes: GlobalCfgID is set to 766, F\(_j\)’s FragCfgID is set to 766, F\(_j\)’s CMI IP and port number are set to those of CMI 0.5. Another write of K\(_i\) is directed to CMI 0.5, creating this cache entry and setting its V\(_\mathrm {cid}\) to 766. Now, the copy of K\(_i\) on CMI 1.0 is stale. Should another configuration change assign F\(_j\) back to CMI 1.0, the GlobalCfgID is incremented by one, GlobalCfgID = 767. Moreover, F\(_j\)’s FragCfgID is also set to 767 and its IP and port number is set to CMI 1.0. A reference for K\(_i\) may observe the stale version. This version’s FragCfgID (765) is lower than F\(_j\)’s FragCfgID 767. Hence, Rejig discards this version and reports a cache miss.    \(\blacksquare \)

2.2 Read-After-Write Consistency

Intuitively, Rejig provides read-after-write consistency for several reasons. First, a CMI impacted by a configuration change does not process a request by a client that does not have a GlobalCfgID pertaining to either this configuration change or a more recent one. Second, a client must update its configuration when a CMI provides a “Refresh & Retry” response to the client request, increasing the response time of this request. This increase is dictated by the amount of time required for the CMI to transmit the new configuration to the client. Third, a CMI always updates its GlobalCfgID to the latest GlobalCfgID provided by a client or the coordinator. Fourth, a cache entry that observes a hit is valid only if its configuration idFootnote 4 V\(_\mathrm {cid}\) is more recent than the configuration that changed its fragment’s assignment. The latter ensures replicated cache entries from a previous configuration (that have not yet been evicted and are potentially stale) are discarded. See Appendix A for a formal proof.

2.3 CMI Discarding Stale Cache Entries

Thus far, a Rejig client discards cache entries identified as potentially stale. A CMI may report a miss instead of transmitting a potentially stale cache entry to a client. To realize this, Rejig’s client component is extended to provide both FragCfgID\(_k\) and GlobalCfgID with every read. After a CMI verifies that a client provided GlobalCfgID is the latest, it must verify the V\(_\mathrm {cid}\) of the referenced cache entry is greater than or equal to the FragCfgID\(_k\). If this is the case then it provides the cache entry to the client. Otherwise, it reports a cache miss and deletes this entry. This requires extra processing by a CMI and incurs the overhead of transmitting FragCfgID\(_k\) with every read request. However, if configuration changes are the norm and discarded cache entries are large in size then this approach may save network bandwidth.

2.4 Leases

Rejig assumes an architecture that uses leases to ensure consistency of data in the presence of server failures and network partitions. These leases are managed by the coordinator of Fig. 1. A lease is similar to a lock but with a fixed lifetime [25]. A CMI may process requests referencing Fragment F\(_k\) as long as the coordinator grants it a valid lease on F\(_k\). The CMI may contact the coordinator to renew its lease on F\(_k\) prior to its expiration. Once a lease on F\(_k\) expires, the CMI stops servicing requests referencing F\(_k\). The coordinator then assigns F\(_k\) to other available CMIs.

Similarly, before the coordinator changes the assignment of F\(_k\) from CMI\(_i\) to CMI\(_j\), it (1) revokes F\(_k\)’s lease from CMI\(_i\) to stop it from processing requests, and (2) grants a lease on F\(_k\) to CMI\(_j\) to enable it to process requests referencing F\(_k\). Subsequently, it changes the configuration and uses Rejig to propagate the new configuration to the clients.

Leases and Rejig serve different purposes and complement one another. Both are required to implement read-after-write consistency in the presence of network partitions and configuration changes. While leases ensure data availability in the presence of network partitions, Rejig disseminates a new configuration efficiently. In particular, Rejig enables a CMI to use its eviction policy to delete cache entries of those fragments that are no longer assigned to it (lazily as the space occupied by these entries is required).

The coordinator does not publish a new configuration that impacts the assignment of a fragment from/to a CMI that is unreachable. It waits for the lease to expire (or the network connection to be restored to issue a revoke/grant lease) prior to publishing a new configuration.

When a client references a CMI for a cache entry assigned to a fragment with an expired lease, the CMI may respond with a “Refresh & Retry” response. This causes the client to look up the CMI for the latest configuration. If the CMI reports a miss, the client contacts the coordinator for the latest configuration. If no CMI is assigned to this fragment then the coordinator selects the least loaded CMI, assigns the lease on the fragment to it, increments its GlobalCfgID and computes a new configuration, updates the GlobalCfgID of the CMI, inserts the latest configuration in the CMI, and provides the client with the latest configuration.

2.5 Replication for High Availability

A system may construct R replicas of a fragment to enhance data availability in the presence of CMI failure(s). Below, we describe two popular approaches to maintain these replicas consistent. For each, we describe how a failure is represented as a configuration change and supported by Rejig.

The first approach requires (1) a read action to obtain r Shared (S) leases prior to reading the value of a replica and (2) a write action to obtain \(\omega \) eXclusive (X) leases prior to writing all replicas [11]. While S leases are compatible, X leases conflict with one another and S leases. Conflicts cause the leases to race with the loser backing off and retrying. For read-after-write consistency, \(r + \omega \) must be greater than R. With this approach, failure of a CMI is a configuration change that removes one replica of a fragment, decrementing R by one.

The second approach designates one copy of a fragment as primary and the other \(R-1\) as secondaries [22, 39]. All reads and writes are processed by the primary fragment. The primary is responsible for propagating updates to the secondaries in the same order it receives them. If the primary fails, the coordinator promotes one of its secondaries to become the primary. With this design, a configuration change reflects promotion of a secondary to the primary and demotion of the primary to be a secondary. The coordinator increments GlobalCfgID. The value of FragCfgID\(_k\) for the promoted secondary is left unchanged. If this new primary buffers changes for the demoted primary, then the value of FragCfgID\(_k\) for the demoted primary is also left unchanged. Should the coordinator decide to discard the fragment of the demoted primary, then it simply sets its FragCfgID\(_k\) to the latest value of GlobalCfgID.

Coordinator inserts the new configuration into \(R-1\) secondaries and updates their GlobalCfgID. A client that fails to issue a request to the failed primary CMI may contact one of the secondary CMIs for the latest configuration. If it observes a miss then the client contacts the coordinator for the latest configuration piggybacking the identity of the unavailable CMI. Hence, clients discover failed CMIs and report them to the coordinator.

2.6 Overhead

Rejig’s overhead is in the form of storage space and network bandwidth. Storage overhead include (1) a 4-byte configuration id associated with a cache entry, (2) a configuration cache entry in a CMI. Network overhead include transmission of (1) a 4-byte client configuration id attached to each request, (2) at most two round-trips per client to get the latest configuration (clients first retrieve the configuration from a CMI and, if not found, fetch the configuration from the coordinator), (3) \(\alpha \) roundtrips per client to insert a new configuration into CMIs.

We quantify Rejig’s overhead based on the statistical models of Facebook’s production key size and value size [5, 19]. Facebook’s mean key size is 35 bytes and the mean value size is 329 bytes, resulting in the average entry size of 364 bytes. Hence, the average storage overhead of the configuration id associated with a cache entry is 1% (4/364). With a configuration consisting of F = 5000 fragments and 12-byte metadata of each fragment, the configuration size is 60,000 bytes. This is equivalent to 165 (60,000/364) of Facebook’s cache entries. Rejig’s network overhead is 11.5% (4/35) for get and delete requests, and 1% (4/364) for SET requests with Facebook’s cache entries.

3 Implementation

We implemented a prototype of Rejig using memcached’s Whalin client [41]. Its client library is implemented with around 500 lines of Java code. Client interfaces to communicate with a CMI is unchanged.

We implemented Rejig’s coordinator using Google RPC [24] with 800 lines of Java code. The coordinator increases or decreases the number of CMIs based on the system load and adjusts the assignment from fragments to CMIs with the goal to ensure each CMI receives a similar number of fragments.

We extended IQ-Twemcached [21] (the Twitter extended version of memcached with Inhibit and Quarantine leases) to store GlobalCfgID and associate each cache entry with a configuration id. We extended standard APIs, e.g., get, set, and delete, to accept an optional configuration id. We also added a new API in IQ-Twemcached to allow the coordinator to update a CMI’s configuration id. The extension to the IQ-Twemcached only requires 40 lines of C code.

4 Evaluation

We answer the following questions in this section: (1) How fast can Rejig disseminate a new configuration? (2) How much stale data does Rejig prevent? (3) Does Rejig impact cache hit rate in the presence of graceful and drastic configuration changes?

Factors that impact Rejig’s dissemination rate are the number of clients and CMIs. CMIs are passive providers of a new configuration. Once all CMIs receive a new configuration, all clients receive the new configuration as soon as they issue a request to a CMI. The load imposed on the coordinator and the number of configuration insertions to CMIs are an interplay of \(\alpha \), Rejig and its variants, and the duration of time a CMI caches a configuration. In the worst case scenario, the size of a configuration cache entry is larger than the available memory of each CMI, causing the load imposed on the coordinator to equal the number of clients. In the best case scenario, each CMI has sufficient memory to cache the configuration and never evicts it. In this case, a larger \(\alpha \) expedites dissemination of the configuration to all CMIs, reducing the number of clients that fetch the configuration from the coordinator. Rejig\(^{C}\) bounds the number of configuration fetches from the coordinator to be the number of CMIs. This is because a client must discover a new configuration from a CMI and fetches the configuration from the coordinator before the client propagates the new configuration to other CMIs. Rejig\(^{T}\) bounds the number of repeated configuration insertions in CMIs to the number of clients since a client terminates the insertion once it encounters a CMI with a copy of the configuration. Section 4.1 quantifies Rejig’s dissemination rate.

We use Azure virtual machine event traces [10] and WorldCup ’98 request traces [4] to demonstrate the number of stale data Rejig prevents and its impact on the cache hit rate. Azure trace exhibits graceful configuration changes while WorldCup ’98 trace exhibits drastic configuration changes with a diurnal pattern.

Lastly, we use YCSB [8] benchmark to evaluate the performance impact of Rejig, namely, tagging a request with the client’s configuration id and returning a cache entry’s configuration id along with its value upon a cache hit. Our evaluation shows that Rejig’s network overhead is insignificant. The average read latency increases by less than 2% with Rejig when compared to without it.

Rejig can also be extended to support diverse migration techniques [27, 33, 36, 42]. In its current format, once the coordinator assigns a fragment to a different CMI, the new CMI starts with an empty replica of the fragment. If migration is enabled, a cache miss on the new CMI migrates the cache entry from the original CMI.

Main lessons are as follows:

  • Rejig disseminates a new configuration to all clients and CMIs efficiently and quickly.

  • Collaborative dissemination reduces the number of clients that contact the coordinator for the latest configuration significantly.

  • Rejig preserves read-after-write consistency in the presence of configuration changes by detecting and discarding all consistency violations.

  • In all experiments, the cache hit rate remains high even though a configuration change does not migrate cache entries. With the trace driven analysis, the cache hit rate is always higher than 99%.

  • With \(\alpha \ge 1\), if the objective is to minimize the number of configuration fetches from the coordinator then Rejig is superior to the alternatives shown in Table 2. On the other hand, if the objective is to minimize the number of repeated configuration insertions into CMIs then Rejig\(^C\) is a superior technique. Rejig\(^C\) is superior to Rejig\(^T\).

Below, we describe experiments in turn.

4.1 Scalable Configuration Dissemination

We design a microbenchmark to evaluate Rejig’s configuration dissemination. It consists of 100 CMIs, a fixed number of clients, and one coordinator. There are 5000 fragments, F = 5000. We use Facebook’s published cache entry size of 364 bytes [5]. The size of a configuration is 60,000 bytes, \(\mathrm {F} \times \mathrm {12}\); twelve bytes assuming 8 bytes for CMI address + 4 bytes for FragCfgID. For experiments of Sects. 4.1 and 4.1, we assume the total available memory is greater than the database size and there are no evictions. We consider limited memory in Sect. 4.1.

An experiment performs a sequence of iterations. In each iteration, each client issues a read for a randomly selected key K\(_i\) assigned to CMI\(_i\), \(i=\mathrm {Config}[\mathrm {h}(\mathrm {K}_i)].\mathrm {CMI}\). The experiment starts with the coordinator publishing a new configuration that assigns a fragment from a source CMI to a destination CMI. The coordinator inserts the new configuration into the source CMI. The experiment terminates when all clients and CMIs have the latest GlobalCfgID and configuration. We report the number of iterations required for Rejig to disseminate a configuration change.

Configuration Dissemination Rate. The number of clients impacts how fast Rejig disseminates a configuration change. As we increase the number of clients from 100 to 1,000 and 10,000, Rejig requires 12 ± 1.7, 5 ± 0.5, 4 ± 0 (mean ± standard deviation) iterations, respectively. This is also an upper bound on the number of client configuration insertions into a CMI. For a CMI to receive the latest GlobalCfgID, a client with the latest GlobalCfgID must issue a request to it. With a larger number of clients, a larger number of requests are issued to CMIs in each iteration. This increases the likelihood of clients referencing all CMIs in an iteration, causing the experiment to terminate with fewer iterations. The reported number of iterations is orthogonal to the value of \(\alpha \) because a CMI always updates its configuration id if the client provided one is more recent.

The number of configuration fetches from the coordinator depends on the value of \(\alpha \). With \(\alpha =0\), it is approximately the same as the number of clients: 93 ± 2.2, 962 ± 5.4, 9745 ± 16.3 with 100, 1000, and 10,000 clients, respectively. The explanation for this is as follows. Once the coordinator publishes the new configuration in one CMI, say CMI\(_i\), it also notifies this CMI of the new GlobalCfgID. A client that references CMI\(_i\) and obtains the latest GlobalCfgID will subsequently reference another CMI\(_j\) and cause its GlobalCfgID to reflect the latest without providing it with the latest configuration. Those clients that reference this CMI\(_j\) observe a cache miss for the configuration and fetch the configuration from the coordinator. The number of configuration fetches is slightly less than the number of clients because those that observe a hit for the configuration using CMI\(_i\) do not contact the coordinator.

Next section discusses \(\alpha \ge 1\).

Collaborative Dissemination, \({\varvec{\alpha \ge 1.}}\) In a second experiment, we evaluate the impact of requiring a client to insert a new configuration in its next unique \(\alpha \ge 1\) referenced CMIs. This is the standard Rejig. We consider its variants Rejig\(^C\) and Rejig\(^T\), see Table 2. Results of this section show Rejig\(^C\) is superior to Rejig\(^T\).

Figure 3 shows the number of configuration fetches with these variants as a function of \(\alpha \). With Rejig, the number of configuration fetches from the coordinator drops 100 folds with \(\alpha =1\) and 10,000 clients when compared with \(\alpha =0\), see Fig. 3(c). When a client inserts its configuration in the next \(\alpha =1\) unique CMI it visits, other clients that reference CMI\(_j\) and observe a “Refresh & Retry” reply will now observe a cache hit for the configuration. Hence, they will not fetch the configuration from the coordinator. The number of configuration fetches from the coordinator continues to drop as we increase the value of \(\alpha \). With Rejig and \(\alpha \ge 4\), there are almost no client fetches from the coordinator. Clients and CMIs facilitate propagation of the configuration among themselves without further involvement of the coordinator.

Fig. 3.
figure 3

The impact of \(\alpha \) on the number of configuration fetches from the coordinator.

Fig. 4.
figure 4

The impact of \(\alpha \) on the number of repeated configuration insertions in CMIs.

With Rejig\(^C\), the number of configuration fetches from the coordinator is higher than 10 as we increase \(\alpha \) beyond 4. Rejig\(^C\) requires a client to insert a configuration only if it fetches the configuration from the coordinator. Hence, it is less aggressive in spreading the latest configuration when compared with Rejig. This increases the likelihood of a client observing a miss for a configuration in a CMI. Hence, the number of configuration fetches from the coordinator is higher.

A larger \(\alpha \) causes two or more clients to insert the same configuration into the same CMI repeatedly. Figure 4 shows the total number of repeated insertions with different variants. Rejig\(^C\) performs the fewest repeated insertions because it is the least aggressive. The standard Rejig performs the most because every client that obtains the latest configuration (either from the coordinator or a CMI) will insert into the next \(\alpha \) unique CMIs. Rejig\(^T\) is moderately aggressive by requiring a client to terminate configuration insertion once a CMI reports that it has the configuration.

These results show Rejig\(^C\) is superior to Rejig\(^T\) for two reasons. First, it performs significantly fewer configuration insertions than Rejig\(^T\), see Fig. 4. Second, its overall number of configuration fetches from the coordinator is comparable, see Fig. 3.

Worst Case Scenario. Consider a worst case scenario where the size of a configuration equals the amount of memory assigned to each CMI. (This is highly unlikely because the size of a configuration is in the order of hundreds of kilobytes with thousands of nodes and memory sizes are typically much larger.) Thus, the coordinator’s insertion of a configuration evicts all cached entries of that CMI. Similarly, a client that inserts a cache entry upon a cache miss will evict the configuration cache entry. With \(\alpha =0\), almost all clients (9998) fetch the configuration from the coordinator. The coordinator populates at least one CMI and this copy is fetched by 1 or 2 clients. As we increase \(\alpha \), the number of configuration fetches drops dramatically; 9998 ± 0.15, 2197 ± 133.93, 110 ± 10.39 and 24 ± 5.26 for \(\alpha \) values of 0, 1, 2 and 3, respectively. This is because a client’s insertion of the configuration in a CMI may observe a reference by another client prior to an entry’s insertion that evicts the configuration. The likelihood of this hit increases with a larger values of \(\alpha \), reducing the number of configuration fetches from the coordinator.

4.2 Trace Driven Evaluation

We use two traces to evaluate Rejig: Azure virtual machines (VM) trace [10] and 92 days of WorldCup 1998 request trace [4]. The first trace provides a dynamic addition and removal of VMs. We augment it with a database and use its trace to emulate addition and removal of CMIs from a configuration. WorldCup 1998 provides request traces (HTTP GET/POST on a page) with approximately 1.3 billion requests. It exhibits drastic workload fluctuations. We augment it with an auto-scaling framework that adjusts addition (\(\mathrm {Add}_i\)) and removal (\(\mathrm {Remove}_{i}\)) of CMIs based on the imposed load.

With both traces, when \(\mathrm {Add}_i\) new CMIs are inserted in a configuration, the coordinator assigns fragments of existing CMIs to the new CMIs until the number of fragments per CMI is approximately the same. No data is migrated. Similarly, when \(\mathrm {Remove}_{i}\) CMIs are removed, the coordinator assigns the orphaned fragments to other CMIs until the same number of fragments are assigned to each CMI. Once again, no data is migrated.

Fig. 5.
figure 5

Azure trace.

In all experiments, the values stored in a CMI are known and the workload generator can verify the correctness of the fetched values. We establish the following metrics: (1) the cache hit rate, (2) the number of discarded keys, and (3) the percentage of discarded keys. The percentage of discarded keys highlights the percentage of read requests that may observe stale entries. If a client produces a discarded key as an output, then the read request may violate read-after-write consistency. The number of invalid discarded entries highlights how many read requests violate read-after-write consistency with a system that does not use our techniques. Rejig eliminates these stale entries. We describe each trace and obtained results in turn.

Azure VM Trace. Azure virtual machines (VM) trace [10] provides a representative subset of the first-party Azure VM workload in one geographical region. It monitors VMs in a consecutive 30-day period and contains a total of 2 million VMs. It provides the exact lifetime of each virtual machine with approximately 145,000 VMs running at a time. This trace has no data set. Hence, we generate a database with 10 million key-value records partitioned into 100,000 fragments for our evaluation using this trace. Before a configuration change, we randomly select 20% of entries and update their values. Since a configuration change does not migrate data, copies of these entries in the impacted CMIs are now stale.

Figure 5(a) shows the number of configuration changes we extracted from the Azure trace. It shows both the number of added CMIs, removed CMIs, and total CMIs per configuration change. Figure 5(b) shows both the observed cache hit rate and the percentage of keys discarded by Rejig with each configuration. The cache hit rate remains higher than 99% even though a configuration change may add 1800 CMIs. The variation in cache hit rate is higher with a larger number of CMIs either added or removed from a configuration. This is expected because a higher number of fragments are assigned to different CMIs that results in more cache misses. Note the percentage of discarded keys is low (less than 0.003%) in this experiment. Among all discarded keys, 99% are stale. These are read-after-write inconsistencies prevented by Rejig.

Fig. 6.
figure 6

WorldCup 1998 trace.

WorldCup 1998 Trace. WorldCup 1998 [4] exhibits a diurnal pattern and drastic workload fluctuations. It contains 92 days of request traces at the granularity of seconds. A large percentage (99.98%) of its approximately 1.3 billion requests are reads. Peak system load observes 10 million requests per hour. The traces reference 89,997 unique keys. We start the simulation from Day 5 (The first four days have no data).

We use the following auto-scaling framework with this trace. We assume the peak processing capacity of a CMI is \(C=1000\) requests per second. We define the imposed system load as the number of requests at a given time, Load\(_i\). The number of required CMIs is a function of these two parameters, the number of CMIs to be added \(\mathrm {Add}_i=\mathrm {max}(\frac{\mathrm {Load}_i}{C}-\mathrm {S}_{i-1},\ 0)\) and the number of CMIs to be removed \(\mathrm {Remove}_{i}=\mathrm {max}(\mathrm {S}_{i-1}-\frac{\mathrm {Load}_{i}}{C},\ 0)\). These ensure the caching layer has sufficient number of CMIs to handle Load\(_i\) at a given hour i. This auto-scaling is performed every hour of the trace for its entire 92 days.

Figure 6(a) reports the total number of added and removed CMIs per hour. As expected, more CMIs are added during the daytime. They are removed during the nighttime. The resulting drastic configuration change causes the cache hit rate to fluctuate as we do not migrate data, see Fig. 6(b). However, the percentage of discarded keys remains low. The 0% cache hit rate on Day 60 is because there is no data for a few hours on that day. The number of read-after-write consistency anomalies avoided by Rejig is still significant even though the update ratio is only 0.02%, see Fig. 6(c).

5 Related Work

Existing work [20, 21, 33] on cache augmented database systems focus on eliminating or minimizing inconsistency between the caching layer and the data store. IQ-framework [21] uses leases to ensure a cache entry’s replica in the caching layer is consistent with its replicas in the data store. Gumball [20] minimizes inconsistency by rejecting reads and writes referencing a cache entry for a certain duration after the entry is deleted. Facebook [33] sets this duration to two seconds. Rejig complements these systems by preserving consistency in the presence of configuration changes.

The CAP theorem [23] states that a distributed data store must choose two out of the three properties: consistency, availability, and partition tolerance. Rejig does not address these properties directly. Rejig is a scalable online algorithm for cache server configuration changes. Rejig assumes an architecture that uses heartbeat messages to detect network partitions and leases to re-claim fragments assigned to unreachable nodes. While it strives to preserve consistency, one may use Rejig with an architecture that uses eventual consistency.

Two-phase commit [37] based systems (e.g., Sinfonia [2]) and Paxos [29] based systems (e.g., Spanner [9]) ensure consistency between multiple replicas. Rejig does not ensure consistency across all replicas. Instead, it detects and discards stale cache entries caused by configuration changes.

Rejig’s configuration dissemination protocol is inspired by Demers et al.’s work [12] on epidemic algorithms where all sites of a data store actively participate in propagating an update made by one site. Rejig propagates a configuration made by the coordinator using both clients and CMIs. Its CMIs are passive entities that respond to client requests. Its clients actively propagate configuration to other clients using CMIs. Rejig’s dissemination protocol is also applicable to a data store.

Rejig’s configuration management is inspired by previous work, e.g., Google File System [22], Hyperdex [13], and Slicer [1]. Rejig is unique because it is designed for a caching solution and not a data store. With Rejig, there is a permanent copy of data elsewhere and loss of cached data does not result in data loss. Another novel feature of Rejig is that it is intended for frameworks that allow for stale cache entries to exist. Rejig detects these entries by storing a configuration id with each cache entry and its fragment, see Sect. 2. This concept is missing from prior work. Below, we provide an overview of each related system and how Rejig is different.

Google File System (GFS) [22] is a distributed file system. A GFS file contains a list of chunks. Each chunk is associated with a chunk version number. The master maintains the latest chunk version number for each chunk. Once a chunk server recovers from a failure, the master detects a stale chunk if its version number is less than the master’s chunk version number. Rejig stores configuration id with each cache entry and fragment to detect stale cache entries.

Hyperdex [13] is a novel distributed key-value store that supports index structures on more than one attribute. Similar to Rejig, it employs a coordinator that manages its configuration with a strictly increasing configuration id. Upon a configuration change, the coordinator increments the configuration id and distributes the latest configuration to all servers. Both Hyperdex servers and its clients cache the configuration id. A client embeds its local configuration id on every request to a server and discovers its cached configuration is stale if the id does not match the server’s configuration id. Hyperdex is a data store and prevents stale values for data items. Rejig is for a caching environment where stale entries may exist. It assigns a configuration id to every cache entry and uses this information to detect stale entries and discard them to provide read-after-write consistency.

Slicer [1] is Google’s general purpose sharding service that is transparent to its applications. It maintains assignments using generation numbers (equivalent to Rejig’s GlobalCfgID). Slicer employs leases to ensure that a key is assigned to one slicelet (equivalent to Rejig’s CMI) at a time. Applications are unavailable for at most 4 s during an assignment change due to updating leases to reflect the latest generation number (and assignment) to slicelets. Also, Slicer must provision resources for a large number of distributors to disseminate a new assignment to clients and slicelets. Rejig is designed for caches and is different in several ways. First, while its CMIs may cache a copy of configuration (Slicer’s assignment), they do not use it to decide whether to process a client request or not. CMIs use GlobalCfgID (Slicer’s generation number) for this purpose. Moreover, Rejig requires its coordinator to update impacted CMIs only and employs both clients and CMIs to participate in distributing a new configuration. Finally, Slicer does not either prevent or detect stale data. Rejig is novel because it detects stale cache entries by storing the configuration id with each entry and fragment.

6 Conclusion

Rejig is a scalable online algorithm for cache server configuration changes that preserves read-after-write consistency. It does not require deletion of cached entries impacted by a configuration change, leaving them to be evicted by the cache replacement technique. It serves as the building block for a fragment re-organization algorithm to balance system load, an auto-scaling framework that grows and shrinks the size of a caching layer, a data availability technique that re-assigns fragments in response to network partitions, and a persistent caching layer [18] that must recover cached entries after a failure.