1 Introduction

Correctness in task parallel programs is only guaranteed if the programs are free of data race. A data race is a pair of concurrent uses of a shared memory location when at least one use writes to the location. The order in which these uses occur can change the outcome of the program, creating nondeterminism.

Structured parallelism sometimes takes the form of lightweight tasks. Languages such as Habanero Java, OpenMP, and Erlang encourage the use of new tasks for operations that can be done independently of other tasks. As a result, many programs written in this family of languages use large numbers of threads. In some cases, the number of threads cannot be statically bounded.

Data race is usually undesirable and there is much work to automatically and efficiently detect data race statically. However, static techniques often report too many false positives to be effective tools in practice. Precise data race detection for a single input can be achieved dynamically. Many dynamic techniques use access histories (shadow memory) to track accesses to shared memory locations.

Vector clocks [12, 22] are an efficient implementation of shadow memory. One analysis based on vector clocks is capable of reasoning about multiple schedules from a single trace [17]. Its complexity is linear if the number of threads and locks used is constant. Vector clocks have been extended to more efficient representations for recursively parallel programs [1, 6] that yield improved empirical results. In all of these cases, the complexity of vector clock algorithms is sensitive to the number of threads used.

When programs are restricted to structured parallelism, shadow memory can reference a computation graph that encodes which events are concurrent. This allows the size of shadow memory to be independent from the number of threads in the program.

The SP-bags algorithm [11], which has been extended to task parallel languages with futures [26], detects data race by executing a program in a depth first fashion and tracking concurrent tasks. Other extensions enable locks [5] and mutual exclusion [25], but can produce false positives.

As an alternative to shadow memory, each task can maintain sets of shared memory locations they have accessed. Lu et al. [20] created TARDIS, a tool that detects data race in a computation graph by intersecting the access sets of concurrent nodes. TARDIS is more efficient for programs with many sequential events or where many accesses are made to the same shared memory location. However, TARDIS does not reason about mutual exclusion.

The computation graph analysis by Nakade et al. [24] for recursively task parallel programs can reason precisely about mutual exclusion [24]. By model checking, it is capable of proving or disproving the presence of data race over all possible schedules. Its algorithm for detecting data races in a single trace is direct, albeit naïve; it computes the transitive closure of the computation graph. This quadratic algorithm admits significant improvement.

This paper presents such an improvement in the form of Zipper, a new algorithm for detecting data races on computation graphs. Zipper maintains precision while utilizing mutual exclusion to improve the efficiency of the computation graph analysis. This algorithm is superior in asymptotic time complexity to that of vector clock implementations when the number of threads is large. It also presents an implementation of the algorithm and a comparison with the naïve computation graph algorithm. The implementation is an addition to the code base published by Nakade et al., which allows for a direct comparison of the two.

In summary, the contributions of this paper are:

  • An algorithm for identifying data races in the framework of Nakade et al.,

  • A discussion of the relative merits of vector clocks and this algorithm, and

  • An empirical study of both the naïve computation graph algorithm and the optimized Zipper algorithm.

The structure of this paper is as follows: Sect. 2 discusses computation graphs and SP-bags. Section 3 presents the Zipper algorithm, and demonstrates the algorithm on a small graph. Section 4 contains an empirical study that compares Zipper to the original computation graph algorithm. Section 5 details work related to this paper. Section 6 concludes.

2 Background

2.1 Programming Model

The surface syntax for the task parallel language used in this paper is based on the language used by Nakade et al. [24] and is given in Fig. 1. A program P is a sequence of procedures, each of which takes a single parameter l of type L. The body of a procedure is inductively defined by s. The expression language, e, is elided.

Fig. 1.
figure 1

The surface syntax for task parallel programs.

The \(\mathbf {async}\), \(\mathbf {finish}\), and \(\mathbf {isolated}\) have interprocedural effects that influence the shape of the computation graph. The remaining statements have their usual sequential meaning. The \(\mathbf {async}\)-statement calls a procedure p asynchronously with argument e. The \(\mathbf {finish}\) statement waits until all tasks initiated within its dynamic scope terminate.

This programming model disallows task passing and therefore does not capture some concurrent language constructs like futures. Futures can result in non-strict computation graphs that Zipper is not currently able to analyze. Related work for structured parallelism can reason about programming models that include futures [20, 26] but cannot reason about isolation. ESP-bags can reason about isolated regions, but only when they commute [25], as discussed in Sect. 2.3.

Restrictions on task passing are not unique to this programming model. Task parallel languages usually restrict task passing in some way in order to ensure deadlock freedom. For example, Habanero Java [4] restricts futures to be declared final. Deterministic Parallel Ruby [20] requires futures to be completely independent and to deep copy their arguments. Extending Zipper to treat task passing is the subject of further research. Despite this restriction, the collection of concurrent constructs treated in this paper is sufficient to simulate a wide range of functionality common in modern task parallel languages.

2.2 Computation Graph

A Computation Graph for a task parallel program is a directed acyclic graph representing the concurrent structure of one program execution [7]. The edges in the graph encode the happens before relation [19] over the set of nodes: \(\prec \subset N\times N\). There is a data race in the graph if and only if there are two nodes, \(n_i\) and \(n_j\), such that the nodes are concurrent, (\(n_i \mid \mid _{\prec } n_j \equiv n_i \nprec n_j \wedge n_j \nprec n_i\)), and the two nodes conflict:

$$\begin{aligned} conflict (n_i,n_j) = \begin{array}{l} \rho (n_i) \cap \omega (n_j) \ne \emptyset \ \vee \\ \rho (n_j) \cap \omega (n_i) \ne \emptyset \ \vee \\ \omega (n_i) \cap \omega (n_j) \ne \emptyset , \end{array} \end{aligned}$$
(1)

where \(\rho (n)\) and \(\omega (n)\) are the sets of read and write accesses recorded in n.

In order to prove or disprove the presence of data race in a program that uses mutual exclusion, a model checker must be used to enumerate all reorderings of critical sections [24]. For each reordering, a different computation graph is generated that must be checked for data race. The main contribution of this paper is an algorithm that can efficiently check a computation graph for data race without reporting false positives even in the presence of mutual exclusion.

2.3 The SP-Bags Algorithm

The SP-bags algorithm can check a computation graph for data race with a single depth first traversal. ESP-bags [25] generalizes SP-bags to task parallel models similar to the one in Fig. 1. However, it can report false positives when the isolated regions in a program do not commute. To demonstrate this limitation, an example is given where it reports a race when in fact there is none.

ESP-bags maintains shadow memory for each shared memory location. The reader and writer shadow spaces record the last relevant task to read or write to the location. Similar shadow spaces exist for isolated regions. To check an access for data race, one must determine if the last task to access a location is executing concurrently with the current task. Tasks that are executing concurrently with the current task are stored in “P-bags”. Serialized tasks are stored in “S-bags”. Therefore, checking an access for data race reduces to checking whether the last task to access the location is in a P-bag.

When a task is created with \(\mathbf {async}\) its S-bag is created containing itself. Its P-bag is empty. When it completes and returns to its parent, its S-bag and P-bag are emptied into its parent’s P-bag. When a \(\mathbf {finish}\) block completes the contents of its S-bag and P-bag are emptied into its parent’s S-bag.

Fig. 2.
figure 2

A simple example of a task parallel program.

2.4 Example

The program contained in Fig. 2 represents a program with critical sections that do not commute and therefore cause ESP-bags and similar algorithms to report races when there are none. There are two isolated blocks in the program. If the isolated block in procedure main executes first then the shared variable x is never written. Otherwise, it is written and the isolated block in p happens before the isolated block in main. Because the happens before relation is transitive, the read of x in main becomes ordered with the write in p and there is no race.

Table 1 shows the state of the ESP-bags algorithm as it executes the program. Only rows that contain state changes are listed. The thread that executes procedure main is labeled as \(T_1\) and the thread that executes procedure p is labeled \(T_2\). The only finish block is labeled \(F_1\).

Table 1. ESP-bags state through Fig. 2

The first row shows the initial state of the algorithm. The next two rows show the correct initialization of \(T_1\) and \(T_2\) S-bags. On line fifteen, the shared variable x is written to because of the order in which ESP-bags executes the critical sections in the program. When \(T_2\) completes and the algorithm returns to the \(\mathbf {finish}\) block that spawned it, \(T_2\) is placed in the P-bag of \(F_1\) signifying that it will be in parallel with all subsequent statements in the \(\mathbf {finish}\) block. This is the state that is in play when x is read outside of an isolated block on line nine. Here ESP-bags reports a race because the last isolated writer is in a P-bag. This is a false positive.

ESP-bags is an efficient algorithm, but its imprecision makes it unsuitable when false positives are unacceptable. The goal of the computation graph analysis is to precisely prove or disprove the absence of data race. As such, a comparison of the efficiency of ESP-bags with Zipper is not given in this work.

3 The Zipper Algorithm

The algorithm presented by Nakade et al. [24] checks every node against every other node. While effective, this algorithm is inefficient. This paper presents the Zipper algorithm, which is more efficient but still sound. Zipper performs a depth-first search over non-isolation edges in the computation graph, capturing serialization imposed by isolation. Section 3.1 describes the variables and algorithmic semantics. Section 3.2 presents the algorithm in several procedures. Lastly, Sect. 3.3 shows an example execution of the algorithm.

3.1 Definitions

Integers have lowercase names and collection names (arrays and sets) are capitalized. Their types are as follows:

  • \( Z {\downarrow }\), \( Z {\uparrow }\): Array of sets of node IDs; indices correspond to isolation nodes

  • \( slider {\downarrow }\), \( slider {\uparrow }\), \( next\_branch\_id \), \( next\_bag\_id \): Integer

  • \( Z {\downarrow _{\lambda }}\), \( Z {\uparrow _{\lambda }}\), \( S \), \( I \): Array of sets of node IDs; indices correspond to branch IDs

  • \( C \): Array of sets of pairs of node IDs; indices correspond to branch IDs

  • \( R \): Set of node IDs

  • \(B_{\lambda }\): Array of sets of branch IDs; indices correspond to bag IDs

  • \(B_{i}\): Array of sets of isolation indices; indices correspond to bag IDs

Zippers encode serialization. The isolation zipper \( Z \) captures serialization with respect to isolation nodes. The “lambda” zipper \( Z _{\lambda }\) captures nodes not in the isolation zipper in a particular task.

Traversal begins at the topmost \(\mathbf {async}\) node. Anytime an \(\mathbf {async}\) node is visited, each branch is given a new branch ID (tracked with \( next\_branch\_id \)) and are queued for evaluation in arbitrary order. As the algorithm traverses downward, the visited node IDs are added to the set at the index of the branch ID in the \( S \) array. This continues until an isolation node is visited that has an outgoing isolation edge. Then, contents of the \( S \) array set for the current branch ID are emptied into the set at the index of the isolation node in the down zipper \( Z {\downarrow }\). Once a wait node is visited on the way down, all nodes in the \( S \) array set for the current branch ID are emptied into the set at the index of the current branch ID in the down lambda zipper \( Z {\downarrow _{\lambda }}\). The process is then performed again on the way up, except isolation nodes that have incoming edges are used to trigger a dump from \( S \) into \( Z {\uparrow }\). Additionally, when an async node is hit the \( S \) set for the current branch is emptied into the up lambda zipper \( Z {\uparrow _{\lambda }}\).

When an isolated node is visited, its ID is also placed into the set at the index of the current branch ID in \( I \), creating an easily-accessible mapping of branch ID to the isolated nodes on that branch. The ready set \( R \) is used to identify which async nodes have had all children traversed; therefore after the last branch of a async node is traversed, the async node ID is placed into the set \( R \) to signify that the algorithm can continue with the wait node that corresponds with the async node, since all children have been processed. Any time an isolation node is visited on the way down, \( slider {\downarrow }\) is set to that isolation node’s index; similarly, \( slider {\uparrow }\) is set to the index of isolation nodes seen on the way up, restricting data race checks to the fewest possible nodes.

When returning to an async node, the set in \( I \) at the current branch ID is emptied into the \(B_{i}\) at the current bag ID. The current branch ID is also placed into the set at the current bag ID index in \(B_{\lambda }\). The \(B_{i}\) and \(B_{\lambda }\) are used to indicate nodes that are concurrent and are not serialized by an isolation edge with the current node.

On the way down a branch, each time a node is visited the \(p\_bag\_id\) is used to index into the \(B_{\lambda }\) and \(B_{i}\) sets. Each of the indices in the \(B_{\lambda }\) set at the \(p\_bag\_id\) index is used to index into the \( Z {\downarrow _{\lambda }}\) to obtain node IDs that are possibly in parallel with the current node. Each pair of possibly parallel nodes is placed into the set located in the \( C \) array at the current branch ID. A similar process is used with \(B_{i}\) and \( Z {\downarrow }\); however, only indices larger than \( slider {\downarrow }\) that are in the set in \(B_{i}\) at the \(p\_bag\_id\) index are paired and placed in \( C \).

On the way up the same process is followed, except \( Z {\uparrow }\) and \( Z {\uparrow _{\lambda }}\) are used, and only indices smaller than \( slider {\uparrow }\) are used when indexing into \( Z {\uparrow }\). Also, node pairs that are possibly in parallel are not placed in \( C \); instead, node pairs are checked against \( C \). A node pair discovered in both the upwards and downwards traversal is actually in parallel and is checked with conflict.

3.2 The Algorithm

The top level of the algorithm is the \( recursive\_analyze \) function. Before it is invoked, several variables are declared and initialized:

figure a

\( recursive\_analyze \) relies on three helpers. \( async\_node \) analyzes nodes with \(\mathbf {async}\)-statements, \( wait\_node \) analyzes nodes that terminate \(\mathbf {finish}\)-statements, and \( other\_node \) analyzes other nodes:

figure b
figure c
figure d

Lastly, \( checkDown \) and \( checkUp \) are used for identifying data races:

figure e
figure f

3.3 Zipper Example

Figure 3 is a computation graph that serves to illustrate the Zipper algorithm, each step of the algorithm is given in Table 2. The Node column in Table 2 represents the visited nodes, in traversal order. The other columns refer to the global variables and their values at each step. For brevity, empty sets in the \( Z {\downarrow }\), \( Z {\uparrow }\), \( Z {\downarrow _{\lambda }}\), and \( Z {\uparrow _{\lambda }}\) arrays are omitted and nonempty sets are preceded by their index. Additionally, the \( S \) column shows the set at the current branch ID rather than the entire \( S \) array.

Fig. 3.
figure 3

An example computation graph

At node a in Fig. 3 all variables are initialized to empty; async_node is then called, which calls recursive_analyze on line 21. Node b is then visited in other_node, and added to the \( S \) set at the current branch ID. Then, node b is checked for conflicts at line 12 in other_node, however, the \(B_{i}\) and \(B_{\lambda }\) at \(p\_bag\_id\) are empty, so no operation takes place. This is true for the entirety of the first branch; data race checks are performed while traversing the second branch. It then recursively visits its child node, c. Node c calls other_node and is an isolated node, therefore \( slider {\downarrow }\) is set to the index of c (0), on line 5. Node c is also added to \( S \) which is emptied into \( Z {\downarrow }\) at index 0 on lines 7 and 8. Recursion continues until node g. The slider is set to the index of g, but \( S \) is not emptied because the isolation edge is not outgoing as shown on line 6 in other_node. Execution continues to node i until the wait_node j is visited and wait_node is called.

Table 2. Step-by-step Zipper algorithm

In wait_node the \( S \) set is emptied into \( Z {\downarrow _{\lambda }}\) at the current branch ID on lines 8 and 9. Node i is placed in \( S \) on line 2 in other_node, then recursion returns to node g. The \( S \) set is then emptied into \( Z {\uparrow }\) at index 3 (the index of g) on lines 21 and 22. Execution continues in a similar fashion until it arrives at a. Take note that since the execution is returning up the first branch, the \( S \) set is not emptied at node c, since it has an outgoing edge (\( S \) empties on the way down on isolation nodes with outgoing edges and on the way up with incoming edges). It is important to note that c and b are in \( Z {\uparrow _{\lambda }}\) at the branch index. Then a recursive call is made to traverse the second branch in the same manner as the first branch, except data race checks will be performed. The data race checks are not in the table, but are shown in the algorithm in checkDown and checkUp and described in Sect. 3.

4 Implementation and Results

4.1 Methods

The implementation of Zipper is an addition to the implementation provided by Nakade et al. [24] in their paper. The original implementation is available at https://jpf.byu.edu/jpf-hj/. The benchmarks referenced in their paper, which are available as part of the same repository, provide a rich comparison of the two algorithms.

The results of the comparison of the two analyses are included in Table 3. In the table, the Benchmark column contains the name of the program used. The Nodes column contains the number of nodes in a computation graph. The Isolation and Race columns indicate, respectively, whether or not isolation and data race are present. The CG (ms) and Zipper (ms) columns contain the execution time for the respective analyses in milliseconds. Lastly, \(\frac{Zipper}{CG}\) is the ratio of the two time measurements.

Table 3. Comparison of the computation graph and Zipper analyses

The experimental results measure the time taken to model check over all possible isolation schedules and reason about each resulting computation graph. All experiments were run on an Intel(R) Xeon(R) Gold 5120 CPU with 8 GB RAM.

4.2 Analysis

Zipper performed better on every benchmark (except for DoAll2OrigNo) whose computation graph had more than 37 nodes. For all of the benchmarks with 37 nodes or fewer, the Zipper analysis performed slower except for IsolatedBlockNo and PrimeNumCounter. In all of these smaller cases, the analyses’ execution time was virtually identical. As expected, the degree to which Zipper outperforms the computation graph analysis grows with the number of nodes.

While the size of the computation graph is the strongest predictor of relative runtime performance between the two analyses, other factors contribute to performance. For example, DoAll1OrigNo and DoAll2OrigNo produce computation graphs with identical structure. However, they differ in both number and placement of shared variable reads and writes in their respective nodes. As a result, the Zipper analysis executes in half the time that the computation graph analysis does on DoAll1OrigNo. On the other hand, the two analyses take about the same amount of time when analyzing DoAll2OrigNo.

The key difference between the Zipper algorithm and the CG algorithm is in the work done to identify the nodes that need to be checked for data race. The Zipper algorithm is able to identify the nodes that execute in parallel much more quickly than the CG algorithm. If there is a large number of reads and writes in proportion to the number of nodes, the Zipper algorithm performs comparably to the CG algorithm, since they both spend a majority of the time checking conflicting nodes for data race in the same way. Conversely, if there are relatively few reads and writes in proportion to the number of nodes, identifying the nodes that need to be checked becomes much more significant in the analysis time. This makes Zipper more suitable for recursively parallel programs or any task parallel programs that utilize many light weight threads. The Zipper algorithm identifies the nodes that need to be checked much quicker than CG and therefore the overall time is reduced.

4.3 Comparison with TARDIS, SP-Bags and Vector Clocks

Like SP-bags and TARDIS, the Zipper algorithm operates as a depth first traversal of a computation graph that represents a single execution of the program. Zipper tracks reads and writes to shared memory locations in a set for each task and intersects these sets to check for race. However, Zipper does not union access sets and therefore performs more intersect operations than TARDIS. Unlike TARDIS and SP-bags, Zipper can reason about mutual exclusion and includes the scheduled order of isolated regions to reduce the number of intersect operations necessary to check a graph for race.

The vector clock algorithm by Kini et al. [17] checks a program execution for data race by comparing the vector clock of shared memory locations after they are accessed with the current thread’s vector clock in order to ensure that the last thread to access the same location is not concurrent with the current thread. The vector clocks are updated after access events and synchronization events.

The vector clock algorithm takes \(O(N(L+T^2))\) time to analyze a trace where N is the number of events in the trace, L is the number of locks used and T is the number of threads. In the programming model used in this paper L is always one. It is linear in the length of the trace for programs that use a small, bounded number of locks and threads.

It takes \(O(M(T+I))\) time for Zipper to analyze a computation graph and compute the pairs of nodes that are parallel with each other. M is the number of nodes, T is the number of branches in the computation graph and I is the number of isolated regions. Zipper must take the intersection of the access sets for \(O(M^2)\) pairs containing K events. This makes the total complexity of Zipper \(O(M(T+I) + M^2K)\).

When a program repeats many accesses to the same shared memory locations M and K can be much smaller than N, as TARDIS shows. In this case, Zipper is more efficient than vector clocks and can scale to larger programs. In addition, it may be possible to apply the ideas of TARDIS to Zipper in order to achieve a linear number of intersect and union operations.

5 Related Work

This work is an improvement upon the computation graph analysis by Nakade et al. [24]. Lu et al. [20] implement a similar analysis based on access sets in their tool TARDIS. TARDIS only requires a linear number of intersect and union operations to detect data race in a computation graph but does not support mutual exclusion.

Feng and Leiserson’s SP-bags algorithm [11] is a sound and complete data race detection algorithm for a single program execution but it can only reason about a subset of task-parallel programs that do not use locks. Work has been done to apply SP-bags to other task-parallel models with the use of futures [26], async and finish constructs and isolation [25] with limitations discussed in Sect. 2. Defined in [5] the ALL-SETS and BRELLY algorithms extend SP-bags to handle locks and enforce lock disciplines but can also report false positives when the execution order of critical sections change the control flow of the program being verified. Other SP-bags implementations use parallelization to increase performance [2].

Mellor-Crummey [23] uses thread labels to determine whether two nodes in a graph are concurrent and gives a labeling scheme that bounds the length of labels to be proportional to the nesting level of parallel constructs. This work however, does not treat critical sections at all.

Many algorithms for detecting data race are based on vector clocks that map events to timestamps such that the partial order relation on the events is preserved over the set of timestamps. The complexity of vector clocks algorithm is sensitive to the number of threads used in a program. Fidge [12] modifies vector clocks to support dynamic creation and deletion of threads. Christiaens and Bosschere [6] developed vector clocks that grow and shrink dynamically as threads are created and destroyed. Flanagan et al. [13] replace vector clocks with more lightweight “epoch” structures where possible. Audenaert [1] presents clock trees that are also more suitable for programs with many threads. The time taken in a typical operation on a clock tree is linear with respect to the level of nested parallelism in the program. Kini et al. [17] present a vector clock algorithm that runs in linear time with respect to the number of events in the analyzed execution assuming the number of threads and the number of locks used is constant. This assumption also fails in programs that use large numbers of lightweight threads.

This work relies on structured parallelism to reduce the cost of precise dynamic analysis. Structured parallelism is strict in how threads are created and joined, for example, a locking protocol leads to static, dynamic, or hybrid lock-set analyses for data race detection that are often more efficient than approaches to unstructured parallelism [9, 10, 28]. Unstructured parallelism defines no protocol for when and where threads can be created or join together. Data race detection in unstructured parallelism typically relies on static analysis to approximate parallelism and memory accesses [16, 18, 27] and then improves precision with dynamic analysis [3, 8, 14]. Other approaches reason about threads individually [15, 21]. The work in this paper relies heavily on structured parallelism and it is hard to directly compare to these more general approaches.

6 Conclusion

The computation graph analysis presented by Nakade et al. [24] is well suited to task parallel programs with isolation and lightweight threads. However, its admittedly direct algorithm for identifying data races is inefficient. The Zipper algorithm achieves the same soundness and completeness as does the direct algorithm with significantly improved asymptotic time complexity and empirical performance. In programs with many threads, its time complexity is superior to that of vector clock implementations. This improved algorithm affords improved efficiency to the computation graph analysis, enabling it to prove the presence or absence of data race in larger and more complex task parallel programs.