Keywords

1 Introduction

The performance of applications on multi-core systems is often harmed by saturated locks, where at least one thread is waiting for the lock. Prior work has observed that as the number of threads circulating through a saturated lock grows, the overall application performance often fades or even drops abruptly [2, 7, 16, 17], a behavior called scalability collapse [7]. This happens because threads compete over shared system resources, such as computing cores and last-level cache (LLC). For instance, the increase in the number of distinct threads circulating through the lock typically leads to increased cache pressure, resulting in cache misses. At the same time, threads waiting for the lock consume valuable resources and might preempt the lock holder from making progress with its execution under lock, exacerbating the contention on the lock even further.

Fig. 1.
figure 1

Microbenchmark performance with different locks on a 2-socket machine with 20 hyper-threads per socket.

An example for scalability collapse can be seen in Fig. 1 that depicts the performance of a key-value map microbenchmark with three popular locks on a 2-socket x86 machine featuring 40 logical CPUs in total (full details of the microbenchmark and the machine are provided later). The shape and the exact point of the performance decline differ between the locks, yet all of them are unable to sustain peak throughput. With the Test-Test-Set lock, for instance, the performance drops abruptly when more than just a few threads are used, while with the MCS lock [21] the performance is relatively stable up to the capacity of the machine and collapses once the system gets oversubscribed (i.e., has more threads available than the number of cores). Note that one of the locks, MCS-TP, was designed specifically to handle oversubscription [14], yet its performance falls short of the peak.

It might be tempting to argue that one should never create a workload where the underlying machine is oversubscribed, pre-tuning the maximum number of threads and using a lock, such as MCS, to keep the performance stable. We note that in modern component-based software, the total number of threads is often out of the hands of the developer. A good example would be applications that use thread pools, or even have multiple mutually unaware thread pools. Furthermore, in multi-tenant and/or cloud-based deployments, where the resources of a physical machine (including cores and caches) are often shared between applications running inside virtual machines or containers, applications can run concurrently with one another without even being aware that they share the same machine. Thus, limiting the maximum number of threads by the number of cores does not help much. Finally, even when a saturated lock delivers a seemingly stable performance, threads spinning and waiting for the lock consume energy and take resources (such as CPU time) from other, unrelated tasksFootnote 1.

In this paper we introduce generic concurrency restriction (GCR) to deal with the scalability collapse. GCR operates as a wrapper around any existing lock (including POSIX pthread mutexes, and specialized locks provided by an application). GCR intercepts calls for a lock acquisition and decides which threads would proceed with the acquisition of the underlying lock (those threads are called active) and which threads would be blocked (those threads are called passive). Reducing the number of threads circulating through the locks improves cache performance, while blocking passive threads reduces competition over CPU time, leading to better system performance and energy efficiency. To avoid starvation and achieve long-term fairness, active and passive threads are shuffled periodically. We note that the admission policy remains fully work conserving with GCR. That is, when a lock holder exits, one of the waiting threads will be able to acquire the lock immediately and enter its critical section.

In this paper we also show how GCR can be extended into a non-uniform access memory (NUMA) setting of multi-socket machines. In those settings, accessing data residing in a local cache is far cheaper than accessing data in a cache located on a remote socket. Previous research on locks tackled this issue by trying to keep the lock ownership on the same socket [3, 9, 10, 23], thus increasing the chance that the data accessed by a thread holding the lock (and the lock data as well) would be cached locally to that thread. The NUMA extension of GCR, called simply GCR-NUMA, takes advantage of that same idea by trying to keep the set of active threads composed of threads running on the same socket. As a by-product of this construction, GCR-NUMA can convert any lock into a NUMA-aware one.

We have implemented GCR (and GCR-NUMA) in the context of the LiTL library [13, 20], which provides the implementation of over two dozen various locks. We have evaluated GCR with all those locks using a microbenchmark as well as two well-known database systems (namely, Kyoto Cabinet [12] and LevelDB [18]), on three different systems (two x86 machines and one SPARC). The results show that GCR avoids the scalability collapse, which translates to substantial speedup (up to three orders of magnitude) in case of high lock contention for virtually every evaluated lock, workload and machine. Furthermore, we show empirically that GCR does not harm the fairness of underlying locks (in fact, in many cases GCR makes the fairness better). GCR-NUMA brings even larger performance gains starting at even lighter lock contention.

2 Related Work

Prior work has explored adapting the number of active threads based on lock contention [7, 17]. However, that work customized certain types of locks, exploiting their specific features, such as the fact that waiting threads are organized in a queue [7], or that lock acquisition can be aborted [17]. Those requirements limit the ability to adapt those techniques into other locks and use them in practice. For instance, very few locks allow waiting threads to abandon an acquisition attempt, and many spin locks, such as a simple Test-Test-Set lock, do not maintain a queue of waiting threads. Furthermore, the lock implementation is often opaque to the application, e.g., when POSIX pthread mutexes are used. At the same time, prior research has shown that every lock has its own “15 min of fame”, i.e., there is no lock that always outperforms others and the choice of the optimal lock depends on the given application, platform and workload [6, 13]. Thus, in order to be practical, a mechanism to control the number of active threads has to be lock-agnostic, like the one provided by GCR.

Other work in different, but related contexts has observed that controlling the number of threads used by an application is an effective approach for meeting certain performance goals. For instance, Raman et al. [24] demonstrate that with a run-time system that monitors application execution to dynamically adapt the number of worker threads executing parallel loop nests. In another example, Pusukuri et al. [22] propose a system that runs an application multiple times for short durations while varying the number of threads, and determines the optimal number of threads to create based on the observed performance. Chadha et al. [4] identified cache-level thrashing as a scalability impediment and proposed system-wide concurrency throttling. Heirman et al. [15] suggested intentional undersubscription of threads as a response to competition for shared caches. Hardware and software transactional memory systems use contention managers to throttle concurrency in order to optimize throughput [25]. The issue is particularly acute in the context of transactional memory as failed optimistic transactions are wasteful of resources.

Trading off between throughput and short-term fairness has been extensively explored in the context of NUMA-aware locks [3, 9, 10, 23]. Those locks do not feature a concurrency restriction mechanism, and in particular, do not avoid contention on the intra-socket level and the issues resulting from that.

3 Background

Contending threads must wait for the lock when it is not available. There are several common waiting policies. The most simple one is unbounded spinning, also known as busy-waiting or polling. There, the waiting threads spin on a global or local memory location and wait until the value in that location changes. Spinning consumes resources and contributes to preemption when the system is oversubscribed, i.e., has more ready threads than the number of available logical CPUs. Yet, absent preemption, it is simple and provides fast lock handover times, and for those reasons used by many popular locks, e.g., Test-Test-Set.

An alternative waiting policy is parking, where a waiting thread voluntarily releases its CPU and passively waits (by blocking) for another thread to unpark it when the lock becomes available. Parking is attractive when the system is oversubscribed, as it releases CPU resources for threads ready to run, including the lock holder. However, the cost of the voluntary context switching imposed by parking is high, which translates to longer lock handover times when the next owner of the lock has to be unparked.

To mitigate the overhead of parking and unparking on the one hand, and limit the shortcomings of unlimited spinning on the other hand, lock designers proposed a hybrid spin-then-park policy. There, threads spin for a brief period, and park if the lock is still not available by the end of that time. While tuning the optimal time for spinning is challenging [16, 19], it is typically set to the length of the context-switch round trip [7].

4 Generic Concurrency Restriction

GCR “wraps” a lock API, i.e., calls to Lock/Unlock methods go through the corresponding methods of GCR. In our implementation, we interpose on the standard POSIX pthreads_mutex_lock and pthreads_mutex_unlock methods. Thus, using the standard LD_PRELOAD mechanism on Linux and Unix, GCR can be made immediately available to any application that uses the standard POSIX API, even without recompiling the application or its locks.

The pseudo-code implementation is provided in Fig. 2Footnote 2, where FAA, SWAP and CAS stand for atomic fetch-and-add, swap and compare-and-swap instructions, respectively. In the following description, we distinguish between active threads, that is, threads allowed by GCR to invoke the API of the underlying lock, and passive threads, which are not allowed to do so. Note that this distinction is unrelated to the running state of the corresponding threads. That is, active threads may actually be blocked (parked) if the underlying lock decides doing so, while passive threads may be spinning, waiting for their turn to join the set of active threads. In addition, given that GCR by itself does not provide lock semantics (even though it implements the lock API), we will refer to the underlying lock simply as the lock.

Fig. 2.
figure 2

GCR pseudo-code implementation.

The auxiliary data structures used by GCR include Node and LockType (cf. Fig. 2). The Node structure represents a node in the queue of passive threads. In addition to the successor and predecessor nodes in the queue, the Node structure contains the event flag. This flag is used to signal a thread when its node moves to the head in the queue.

The LockType structure contains the internal lock metadata (passed to the Lock and Unlock functions of that lock) and a number of additional fields:

  • top and tail are the pointers to the first and the last nodes in the queue of passive threads, respectively.

  • topApproved is a flag used to signal the passive thread at the top of the queue that it can join the set of active threads.

  • numActive is the counter for the number of active threads.

  • numAcqs is a counter for the number of lock acquisitions. It is used to move threads from the passive set to the active set, as explained below.

In addition to LockType structure, GCR uses nextLock (nextUnlock) function pointer, which is initialized to the Lock (Unlock, respectively) function of the underlying lock. The initialization code is straightforward (on Linux it can use the dlsym system call), and thus is not shown.

GCR keeps track of the number of active threads per lock. When a thread invokes the Lock method wrapped by GCR, GCR checks whether the number of active threads is larger than a preconfigured threshold (cf. Line 3, where we use a threshold of 1). If not, the thread proceeds by calling the lock’s Lock method after incrementing (atomically) the (per-lock) counter of active threads (numActive) in Line 5. This constitutes the fast path of the lock acquisition. We note that the check in Line 3 and the increment in Line 5 are not mutually atomic, that is, multiple threads can reach Line 5 and thus increment the counter stored in numActive concurrently. However, the lack of atomicity may only impact performance (as the underlying lock will become more contended), and not correctness. Besides, this should be rare when the system is in the steady state.

If the condition in Line 3 does not hold, GCR detects that the lock is saturated, and places the (passive) thread into a (lock-specific) queue where that thread waits for its turn to join the set of active threads. This queue is based on a linked list; each node is associated with a different thread. Every thread in the queue but the first can wait by spinning on a local flag (event) in its respective node, yield the CPU and park, or any combination of thereof. (The thread at the head of the queue has to spin as explained below.) In practice, we choose the spin-then-park policy for all passive threads in the queue but the first, to limit the use of system resources by those threads.

The thread at the top of the queue monitors the signal from active threads to join the active set. It does so by spinning on the topApproved flag (Line 17). In addition, this thread monitors the number of active threads by reading the numActive counter (Line 21). Note that unlike the topApproved flag, this counter changes on every lock acquisition and release. Thus, reading it on every iteration of the spinning loop would create unnecessary coherence traffic and slow down active threads when they attempt to modify this counter. In the longer version of this paper [8], we describe a simple optimization that allows to read this counter less frequently while still monitoring the active set effectively. Once the passive thread at the top of the queue breaks out of the spinning loop, it leaves the queue, notifying the next thread t (if exists) that the head of the queue has changed (by setting a flag in t’s node and unparking t if necessarily), and proceeds by calling the lock’s Lock method.

When a thread invokes GCR’s Unlock method, it checks whether it is time to signal the (passive) thread at the head of the queue to join the set of active threads (cf. Lines 34–38). This is done to achieve a long-term fairness, preventing starvation of passive threads. To this end, GCR keeps a simple counter for the number of lock acquisitions (numAcqs), which is incremented with a simple store, as it is done under the lock. (Other alternatives, such as timer-based approaches, are possible.) Following that, GCR atomically decrements the counter of active threads and calls the lock’s Unlock method.

In the longer version of the paper [8], we describe a number of optimizations intended to reduce the overhead of GCR when the underlying lock is not saturated. For instance, we show how to avoid atomic operations on the counter of active threads (numActive) by dynamically enabling and disabling GCR based on the actual contention on the lock.

5 NUMA-Aware GCR

As GCR controls which threads would join the active set, it may well do so in a NUMA-aware way. In practice, this means that it should strive to maintain the active set composed of threads running on the same socket (or, more precisely, on the same NUMA node). Note that this does not place any additional restrictions on the underlying lock, which might be a NUMA-aware lock by itself or not. Naturally, if the underlying lock is NUMA-oblivious, the benefit of such an optimization would be higher.

Introducing NUMA-awareness into GCR requires relatively few changes. On a high level, instead of keeping just one queue of passive threads per lock, we keep a number of queues, one per socket. Thus, a passive thread joins the queue corresponding to the socket it is running on. In addition, we introduce a notion of a preferred socket, which is a socket that gets preference in decisions which threads should join the active set. In our case, we set the preferred socket solely based on the number of lock acquisitions (i.e., the preferred socket is changed in a round-robin fashion every certain number of lock acquisitions), but other refined (e.g., time-based) schemes are possible.

We say that a (passive) thread is eligible (to check whether it can join the active set) if it is running on the preferred socket or the queue (of passive threads) of the preferred socket is empty. When a thread calls the Lock function, we check whether it is eligible and let it proceed with examining the size of the active set (i.e., read the numActive counter) only if it is. Otherwise, it immediately goes into the slow path, joining the queue according to its socket. This means that once the designation of the preferred socket changes (when threads running on that socket acquire and release the lock “enough” times), active threads from the now not-preferred socket will become passive when they attempt to acquire the lock again.

Having only eligible threads monitor the size of the active set has two desired consequences. First, only the passive thread at the top of the queue corresponding to the preferred socket will be the next thread (out of all passive threads) to join the set of active threads. This keeps the set of active threads composed of threads running on the same (preferred) socket and ensures long-term fairness. Second, non-eligible threads (running on other, non-preferred sockets) do not access the counter of active threads (but rather wait until they become eligible), reducing contention on that counter.

6 Evaluation

We implemented GCR as a stand-alone library conforming to the pthread mutex lock API defined by the POSIX standard. We integrated GCR into LiTL [20], an open-source project providing an implementation of dozens of various locks, including well-known established locks, such as MCS [21] and CLH [5], as well as more recent ones, such as NUMA-aware Cohort [10] and HMCS locks [3]. The LiTL library also includes the implementation of a related Malthusian lock [7], which introduces a concurrency restriction mechanism into the MCS lock. Furthermore, the LiTL library allows specifying various waiting policies (e.g., spin or spin-then-park) for locks that support that (such as MCS, CLH or Cohort locks). Overall, we experimented with 24 different lock+waiting policy combinations in LiTL (for brevity, we will refer to each lock+waiting policy combination simply as a lock).

We run experiments on three different platforms. For this paper, we focus on a dual-socket x86-based system with 40 logical CPUs in total. The qualitative results from other two systems (a four-socket x86-based system with 144 logical CPUs in total and a dual-socket SPARC-based system with 512 logical CPUs in total) were similar, and are included in the longer version of the paper [8].

In all experiments, we vary the number of threads up to twice the capacity of each machine. We do not pin threads to cores, relying on the OS to make its choices. In all experiments, we employ a scalable memory allocator [1]. We disable the turbo mode to avoid the effect of that mode, which varies with the number of threads, on the results. Each reported experiment has been run 3 times in exactly the same configuration. Presented results are the average of results reported by each of those 3 runs.

6.1 AVL Tree Microbenchmark

The microbenchmark uses a sequential AVL tree implementation protected by a single lock. The tree supports the API of a key-value map, including operations for inserting, removing and looking up keys (and associated values) stored in the tree. After initial warmup, not included in the measurement interval, all threads are synchronized to start running at the same time, and apply tree operations chosen uniformly and at random from the given distribution, with keys chosen uniformly and at random from the given range. At the end of this time period (lasting 10 s), the total number of operations is calculated, and the throughput is reported. The reported results are for the key range of 4096 and threads performing \(80\%\) lookup operations, while the rest is split evenly between inserts and removes. The tree is pre-initialized to contain roughly half of the key range. Finally, the microbenchmark allows to control the amount of the external work, i.e., the duration of a non-critical section (simulated by a pseudo-random number calculation loop). In this experiment, we use a non-critical section duration that allows scalability up to a small number of threads.

The absolute performance of the AVL tree benchmark (in terms of the total throughput) with several locks is shown in Fig. 3. Figure 3(a) and (b) show how the popular MCS lock [21] performs without GCR, with GCR and with GCR-NUMA, and how those locks compare to the recent Malthusian lock [7], which implements a concurrency restriction mechanism directly into the MCS lock. Locks in Fig. 3(a) employ the spinning waiting policy, while those in Fig. 3(b) employ the spin-then-park policy. In addition, Fig. 3(c) and (d) compare the performance achieved with the simple Test-Test-Set (TTAS) lock and the POSIX pthread mutex lock, respectively, when used without GCR, with GCR and with GCR-NUMA. The concurrency restriction mechanism of a Malthusian lock cannot be applied directly into the simple TTAS or POSIX pthread mutex locks, so we do not include a Malthusian variant in those two cases.

With the spinning policy (Fig. 3(a)), GCR has a small detrimental effect (\(2\%\) slowdown for a single thread, and in general, at most \(12\%\) slowdown) on the performance of MCS as long as the machine is not oversubscribed. This is because all threads remain running on their logical CPUs and the lock handoff is fast at the time that GCR introduces certain (albeit, small) overhead. The Malthusian lock performs similarly to (but worse than) GCR. MCS with GCR-NUMA, however, tops the performance chart as it limits the amount of cross-socket communication incurred by all other locks when the lock is handed off between threads running on different sockets. The performance of the MCS and Malthusian locks plummets once the number of running threads exceeds the capacity of the machine. At the same time, GCR (and GCR-NUMA) are not sensitive to that as they park excessive threads, preserving the overall performance. In case of GCR-NUMA, for instance, this performance is close to the peak achieved with 10 threads.

The MCS and Malthusian locks with the spin-then-park policy exhibit a different performance pattern (Fig. 3(b)). Specifically, the former shows poor performance at the relatively low number of threads. This is because as the number of threads grows, the waiting threads start quitting spinning and park, adding the overhead of unparking for each lock handoff. The Malthusian lock with its concurrency restriction mechanism avoids that. Yet, its performance is slightly worse than that of MCS with GCR. Once again, MCS with GCR-NUMA easily beats all other contenders.

Fig. 3.
figure 3

Throughput results for several popular locks (AVL tree).

In summary, the results in Fig. 3(a) and (b) show that despite being generic, the concurrency restriction mechanism of GCR performs superiorly to that of the specialized Malthusian lock. Besides, unlike the Malthusian lock, the choice of a waiting policy for the underlying lock becomes much less crucial when GCR (or GCR-NUMA) is used.

The TTAS and pthread mutex locks exhibit yet another performance pattern (Fig. 3(c) and (d)). Similarly to the MCS spin-then-park variant, their performance drops at low thread counts, however they manage to maintain reasonable throughput even as the number of threads grows. Along with that, both GCR and GCR-NUMA variants mitigate the drop in the performance.

We also run experiments in which we measured the handoff time for each of the locks presented in Figure 3, that is the interval between a timestamp taken right before the current lock holder calls Unlock() and right after the next lock holder returns from Lock(). Previous work has shown that the performance of a parallel system is dictated by the length of its critical sections [11], which is composed of the time required to acquire and release the lock (captured by the handoff data), and the time a lock holder spends in the critical section. Indeed, the data in Fig. 4 shows correlation between the throughput achieved and the handoff time. That is, in all cases where the throughput of a lock degraded in Fig. 3, the handoff time has increased. At the same time, GCR (and GCR-NUMA) manages to maintains a constant handoff time across virtually all thread counts.

Fig. 4.
figure 4

Handoff time for the MCS (spin) and Test-Test-Set locks (AVL tree).

Fig. 5.
figure 5

Total throughput measured with multiple instances of the microbenchmark, each run with 40 threads.

In a different experiment, we run multiple instances of the microbenchmark, each configured to use the number of threads equal to the number of logical CPUs (40). This illustrates the case where an application with a configurable number of threads chooses to set that number based on the machine capacity (as it typically happens by default, for instance, in OpenMP framework implementations). Figure 5 presents the results for two of the locks. Both GCR and GCR-NUMA scale well up to 4 instances for all tested locks. Except for pthread mutex (not shown), all locks without GCR (or GCR-NUMA) exhibit greatly reduced performance, especially when the number of instances is larger than one (which is when the machine is oversubscribed). Pthread mutex fares relatively well, although it should be noted that its single instance performance is worse than several other locks in this experiment.

It is natural to ask how the fairness of each lock is affected once the GCR mechanism is used. In the longer version of the paper [8], we demonstrate that GCR can, in fact, improve the long-term fairness, i.e., the total number of operations performed by threads over a time interval. This is because some locks can be grossly unfair mainly due to caching effects. That is, if multiple threads attempt to acquire the lock at the same time, the thread on the same core or socket as a previous lock holder is likely to win as it has the lock word in its cache. GCR restricts the number of threads competing for the lock, and shuffles those threads periodically, achieving long-term fairness. Interestingly, GCR-NUMA achieves even better fairness, as it picks active threads from the same socket. Thus, it reduces the chance that the same thread(s) will acquire the lock repeatedly while another thread on a different socket fails to do that due to expensive remote cache misses.

6.2 Kyoto Cabinet

We report on our experiments with the Kyoto Cabinet [12] kccachetest benchmark run in a wicked mode, which exercises an in-memory database. Similarly to [7], we modified the benchmark to use the standard POSIX pthread mutex locks, which we interpose with locks from the LiTL library. We also modified the benchmark to run for a fixed time and report the aggregated work completed. Finally, we fixed the key range at a constant (10M) elements. (Originally, the benchmark set the key range dependent on the number of threads). All those changes were also applied to Kyoto in [7] to allow fair comparison of performance across different thread counts. The length of each run was 60 s.

Kyoto employs multiple locks, each protecting a slot comprising of a number of buckets in a hash table; the latter is used to implement a database [12]. Given that the wicked mode exercises a database with random operations and random keys, one should expect a lower load on each of the multiple slot locks compared to the load on the central lock used to protect the access to the AVL tree in the microbenchmark above. Yet, Kyoto provides a view on how GCR behaves in a real application setting.

The results are presented in Fig. 6, where we run GCR and GCR-NUMA on top of 24 locks provided by LiTL. A cell at row X and column Y represents the throughput achieved with Y threads when GCR (GCR-NUMA, respectively) is used on top of lock X divided by throughput achieved when the lock X itself is used (i.e., without GCR or GCR-NUMA). The shades of red colors represent slowdown (speedup below 1, which in virtually all cases falls in the range of [0.8..1), i.e., less than 20% slowdown), while the shades of green colors represent positive speedup; the intensity of the color represents how slowdown/speedup are substantial. Both GCR and GCR-NUMA deliver robust gains (at times, over x1000), and those gains start for virtually all locks even before the machine becomes oversubscribed.

Fig. 6.
figure 6

Speedup achieved by GCR and GCR-NUMA over various locks (Kyoto).

In the longer version of the paper [8], we also present results for LevelDB, an open-source key-value storage library [18]. The LevelDB results largely echo the results for Kyoto, and lead to the same high-level conclusion as in the other benchmarks—increased lock contention leads to increased speedups achieved by GCR and GCR-NUMA.

7 Conclusion

We have presented GCR, a generic concurrency restriction mechanism, and GCR-NUMA, the extension of GCR to the NUMA settings. GCR wraps any underlying lock and controls which threads are allowed to compete for its acquisition. The idea is to keep the lock saturated by as few threads as possible, while parking all other excessive threads that would otherwise compete for the lock, create contention and consume valuable system resources. Extensive evaluation with more than two dozen locks shows substantial speedup achieved by GCR on various systems and benchmarks; the speedup grows even larger when GCR-NUMA is used.