1 Introduction

A data-race is where two concurrent executions access the same memory location with at least one of the two accesses being a write. It introduces non-determinism into the program execution as the behavior may depend on the order in which the concurrent executions access memory. Data-race is problematic because it is not possible to directly control or observe the run-time internals to know if a data-race exists, let alone enumerate program behaviors when one does.

The data-race detection problem, given a program with its input, is to determine if there exists an execution containing a data-race. The research presented in this paper is concerned with proving data-race freedom for task-parallel models that impose structure on parallelism by constraining how threads are created and joined, and by constraining how shared memory is accessed (e.g., OpenMP, Cilk, X10, Chapel, Habanero, etc.). These models rely on run-time environments to implement task abstractions to represent concurrent executions [4, 7, 8, 21]. The language restrictions on parallelism and shared memory interactions enable properties like determinism (i.e., the computation is independent of the execution) or the ability to serialize (i.e., removing all task related keywords yields a serial solution). Such properties only hold in the absence of data-race, which is not always the case since programmers, both intentionally and unintentionally, move outside the programming model.

Data-race detection in task-parallel models generally prioritizes performance and the ability to scale to many tasks over a proof of absence. The predominant SP-bags algorithm, with its variants, is a dynamic approach that exploits assumptions on task creation and joining for efficient on-the-fly detection with low overhead [2, 9, 15, 33, 36]; millions of tasks are feasible with varying degrees of slow-down (i.e., slow-down increases as parallelism constraints are relaxed) [34, 35]. The approaches, and SP-bags in general with all its variants, do not work in the presence of mutual exclusion. Other approaches use access histories [30, 32], access sets [28], or programmer annotations [40]. Performance is a priority requiring careful integration into complex run-time environments, and solutions are often only complete, meaning that little can be concluded about other executions of the program on the same input.

The research presented in this paper reprises the data-race problem in task-parallel models with the intent to prove, via model checking, data-race freedom on a given input over all feasible executions with support for mutual exclusion. Prior model checking-based solutions enumerate schedules that interleave conflicting accesses, meaning at least one access is a write, to shared variables [1, 17, 42]. In this way, the model checker schedules on every shared variable access to enumerate all interesting schedules on which a data-race might exist, but such an approach comes with an exponential cost in the number of schedules generated; these solutions quickly run out of resources even on very small toy examples on programs that are data-race free. For input programs with data-race, performance depends on the luck of the scheduler to choose a schedule on which the data-race manifests before running out of resources.

The approach here rather uses techniques from dynamic approaches to build a happens-before relation in the form of a computation graph from a single observed program execution sufficient to prove data-race freedom in all executions that order mutually exclusive regions in the same way as the observed execution [20, 24, 27]. The happens-before relation is thus a partial order and an equivalence relation. The observed trace being representative of all other schedules that observe the same order on the mutually exclusive regions but perhaps order other concurrent actions differently.

Unlike dynamic approaches, though, the approach here further generates other program executions necessary to prove data-race freedom over all executions on the input. As a result, in the absence of mutual exclusion, a single program execution is sufficient to prove data-race freedom. In the presence of mutual exclusion, the model checker generates and checks all feasible orderings of the mutually exclusive regions to complete the proof. Underlying this contribution is the fact that we assume the program under test terminates; if such is not the case, then the research in this paper does not directly apply.

The research presented in this paper includes an empirical study of the proposed model checking algorithm for a Java implementation of Habenero with the Java Pathfinder model checker (JPF). Unlike prior solutions, this implementation uses an idealized verification run-time for Habanero rather than a production run-time, does not require internal modifications to JPF, and gives results about the input program that generalize to any language run-time implementation [1, 17, 42]. Results over several published benchmarks comparing to JPF’s default race detection using partial order reduction and a task-parallel approach with permission regions show the approach here to be more efficient in JPF terms with its inherent overhead.

Fig. 1
figure 1

A program with data-race. a Task-parallel. b Habanero Java

An additional algorithm using vector clocks is also implemented in JPF, [16], as part of the research presented in this paper and compared to the proposed model checking algorithm here. This comparison shows that in the absence of data-race, the two approaches are comparable in terms of time and resources. In the presence of data-race, the vector clock algorithm, being on-the-fly, typically finds the data-race and terminates before the approach min this paper that has to wait until the execution completes before doing the data-race analysis. That said, the results also show that the vector clock analysis runs out of memory for examples with many tasks and objects to track.

Of course, as with any model checking approach, the intent is to not scale to millions of parallel tasks with hundreds of mutually exclusive regions; rather, this research assumes that it is possible to provide input to any given program that results in hundreds of tasks and a few mutually exclusive regions. It further assumes that a data-race freedom proof on the small instance of the program, the one that results in hundreds of tasks and a few mutually exclusive regions, generalizes to a large instance of the same program with millions of tasks and many mutually exclusive regions. In other words, without changing the program text in any meaningful way, it is possible to give test input appropriate for model checking. The primary contributions are thus

  • a simple approach to data-race detection in programs that terminate based on creating a happens-before relation from an execution of a task-parallel program;

  • a proof that scheduling to interleave mutually exclusive regions is sufficient to prove data-race freedom; and

  • an implementation of the approach for Java Habanero in JPF with an implementation of a algorithm that uses vector clocks both with results from benchmarks comparing to other solutions in JPF.

The rest of this paper is organized as follows: Sect. 2 illustrates the approach in a small example; Sect. 3 defines the computation graph and an algorithm to detect data-race on a computation graph; Sect. 4 defines the language and semantics of parallel tasks and shows how to build a computation graph from a program execution; Sect. 5 gives a correctness proof; Sect. 6 presents the model checking algorithm and proves it generates all interesting schedules over mutual exclusion; Sect. 7 is the empirical study with a summary of the implementation; Sect. 8 briefly discusses related work; and Sect. 9 is the conclusion with future work.

2 Example

The approach to data-race detection in this paper is presented in a very simple example. Consider the task-parallel program in Fig. 1a. The language used is defined in this paper with a formal semantics to facilitate proofs but has a direct expression in most task-parallel languages. For example, Fig. 1b shows the equivalent program in the Habanero Java language.

For Fig. 1, execution begins with the procedure m. The variable g is global. The \(\mathbf async ~\)statement creates a new asynchronous task running procedure p passing 0 for its parameter. The task handle is stored in the region r, also global, and when that task completes and joins with its parent m, it runs a default \(\lambda \)-expression as a return-value handler. In this case, that handler is defined to be the no-op skip.

The isolated statement runs the statement in its scope in mutual exclusion to other isolated statements. The await statement joins all tasks in region r with the task that issued the await. The issuer may join with a task in the region if that task is has evaluated the expression in its return statement. That value, at the join, is passed to the return-value handler in the parent context. The parent blocks at the await statement until it has joined with all tasks in the indicated region.

The program in Fig. 1 has a schedule dependent data-race. If the scheduler runs the isolated statement in procedure p before the isolated statement in procedure m then there is a write–write data-race; otherwise, there is no data-race.

Related work in model checking task-parallel languages enumerates schedules to interleave the mutual exclusion and to interleave unprotected shared memory access leading quickly to state explosion [1, 17, 42]. These approaches use the happens-before relation to detect data-race but not to reduce the number of considered schedules—every schedule is checked.

The approach in this paper exploits so-called partial order analyses to reduce the number of schedules that must be checked to prove data-race freedom. The approach uses the simple happens-before partial order, [27], but is easily extended to something like weak causally precedes to further reduce checked schedules [20, 24]. Unlike other partial-order approaches though, a sufficient set of schedules is checked to prove data-race freedom on the given input.

The approach dynamically detects shared memory accesses and uses the language semantics to capture the happens-before relation in the form of a computation graph during execution. Figure 2b shows the computation graph for the data-race free schedule of the example program. Every node represents a block of sequential operations and edges order the nodes. The thick a-labeled line is the result of the \(\mathbf async ~\)statement creating a new task, and the dashed boxes are the isolated statements. Intuitively, the computation graph is a Hasse diagram with inverted edges—things at the bottom happen-after things at the top—and with extra information on each node to indicate read and write memory locations. For example, the \(\omega (g)\) label indicates that variable g is written in the node. Such a graph can be checked for data-race using any number of algorithms [16, 27], or if there is no isolation [22, 28]. This research uses as simpler algorithm with worse complexity but supports all the language features and is easy to reason about in the proof.

To reason over all schedules, the approach in this paper assumes input programs are well formed and terminate. Well formed in this context captures common properties of task-parallel programming models that prohibit deadlock and non-determinism in how tasks are synchronized. Under these restrictions, the model checker, to prove data-race freedom, must generate a set of schedules that contains all ways allowed by the program semantics to interleave isolated statements. Such an algorithm is presented with a proof that it enumerates all such schedules.

Figure 2b shows the computation graph for the data-race schedule of the simple example program. Although the two schedules in Fig. 2 are the only schedules that need to be considered by the model checker in this example, the number of interesting schedules grows exponentially in the number of concurrent dependent isolated statements. The growth limits the model-checking approach in this paper to input programs that can be instantiated in such a way as to not require resources beyond the limits of the model checker. In other words, a program intended to run on inputs that create a very large number of tasks and mutually exclusive regions has a test mode that does not meaningfully change the program text and only creates a modest number of tasks and mutually exclusive regions. A data-race proof in the test mode would then generalize to a full-scale input since the program text and underlying concurrent structure is unchanged.

Fig. 2
figure 2

Two computation graphs for Fig. 1

3 Data-race detection on computation graphs

These sections define the computation graph with an algorithm to detect data-race given a computation graph as input. A complexity proof is then given showing the algorithm to be quadratic when the number of heap accesses in any given node is small relative to the number of nodes in the graph and otherwise cubic. The relation between the algorithm here and the SP-bags algorithm is discussed. The algorithm is then proved to be sound and complete.

3.1 Computation graphs

A computation graph for a task-parallel program is a directed acyclic graph representing the concurrent structure of the program execution [11]. It is modified here to track memory locations accessed by tasks. For the definition, \(\mathtt {Globals}\) is the set of the unique identifiers for the shared locations, and \({\mathcal {P}}\left( \mathtt {Globals}\right) \) is the power set.

Definition 1

A computation graph is defined as a directed acyclic graph (DAG), \(G= \left( N, E, \rho , \omega \right) \), where

  • \(N\) is a finite set of nodes;

  • \(E\subseteq N\times N\) is a set of directed edges;

  • \(\rho : (N\mapsto {\mathcal {P}}\left( \mathtt {Globals}\right) )\) maps \(N\) to the unique identifiers for the shared locations read by the tasks; and

  • \(\omega : (N\mapsto {\mathcal {P}}\left( \mathtt {Globals}\right) )\) maps \(N\) to the unique identifiers for the shared locations written by the tasks.

The graph collapses sequential accesses to memory locations into single nodes between concurrent operations to create, join, or isolate concurrent executions. The structure of the graph is the happens-before relation ordering shared accesses. Any feasible execution schedule of a task-parallel program can be captured in a computation graph.

3.2 Data-race detection

Let \(\prec \subset N\times N\) be the happens-before relation in the graph G with constant time lookup. Such a relation can be computed in O(\(|N|*|E|\)) time. There is a data-race in the graph if and only if there are two nodes, \(n_i\) and \(n_j\), such that the nodes are concurrent, (i.e., \(n_i \nprec n_j \wedge n_j \nprec n_i\) or, equivalently, \(n_i \mid \mid _{\prec } n_j\)), and the two nodes conflict:

$$\begin{aligned} \mathrm{conflict}(n_i,n_j,\rho ,\omega ) = \begin{array}{l} \rho (n_i) \cap \omega (n_j) \ne \emptyset \ \vee \\ \rho (n_j) \cap \omega (n_i) \ne \emptyset \ \vee \\ \omega (n_i) \cap \omega (n_j) \ne \emptyset \ \\ \end{array} \end{aligned}$$
(1)

The algorithm to detect data-race in a computation graph is given in Algorithm 1. It relies on the topological ordering of the nodes in the computation graph which can be computed in O(\(|N|*|E|\)) time. The algorithm takes advantage of the topological sort to only check for conflicts with nodes that do not happen before the current node being considered in the sort by only looking at nodes that come after in the sort; though, this small optimization does not change the inherent complexity of the algorithm.

figure a

If there is a small bound on the number of reads and writes in any given node, meaning that that bound is much much smaller than the number of nodes in the computation graph, then the call to the conflict function on the inner loop is considered constant time rather than linear time (e.g., computing the intersection is negligible). As such, the data-race detection algorithm operates in quadratic or cubic time, depending on the number of reads and writes in the graph’s nodes. Most often, except in the contrived benchmarks to characterize the algorithm, the bound on the reads and writes in any given node is much smaller than the number of nodes in the graph.

It is worth noting that there are more efficient algorithms to compute the data-race in the computation graph when there is no mutual exclusion, [22, 28], or with mutual exclusion using vector clocks, [16, 27]. Algorithm 1 is preferred here or its simplicity relative to the proofs, and more critically, the cost of data-race detection is not being driven by the algorithm to detect data-race in the detection; it is rather driven by generating the needed executions for the proof as is shown in Sect. 7.

That said, this research did implement an approach based on vector clocks, [16], and though not proven that it has the same correctness properties as Algorithm 1, assuming it does, it plugs directly into the model checking algorithm presented in this paper and all the same properties hold. The results show that since it is on-the-fly, it can get lucky and report a race before running out of resources unlike Algorithm 1.

It may be helpful to pause here to make clear the difference between an SP-bags algorithm and Algorithm 1. SP-bags relies on a special traversal of the computation graph so that least common ancestor queries combined with some book keeping determine the parallel relationship of any two nodes in the graph. As such, it is based on Tarjans fast LCA algorithm. Algorithm 1 is only leveraging the fact that the computation graph encodes the happens before relation and is not sensitive to the structure the graph or the observed execution order. This independence is also true of algorithms based on vector clocks; though, those incur some non-trivial memory overhead in the clocks as shown in Sect. 7. As such, Algorithm 1 works on computation graphs that otherwise break SP-bags and its variants.

3.3 Proof of correctness

Theorem 1

(Soundness of Algorithm 1) If Algorithm 1 finds a data-race when applied to a computation graph \(G\), \(G\) contains a data-race.

Proof

The condition on line 5 is the definition of a data-race in a computation graph; though the symmetric condition of \(n_j \nprec n_i\) is implied by the fact that \(n_i\) is topologically ordered before \(n_j\) by the outer loop on line 3. Algorithm 1 reports a data-race only on line 6, which is only reachable if the condition on line 5 is true. Therefore, Algorithm 1 can only report a data-race on a graph \(G\) if that data-race exists in \(G\). \(\square \)

Theorem 2

(Completeness of Algorithm 1) If a computation graph \(G\) contains one or more instances of data-race, Algorithm 1 finds at least one instance when applied to \(G\).

Proof

Data race is only defined over distinct nodes in \(G\); it is impossible for a node to race with itself. The condition on line 5 is agnostic to the order in which it receives nodes; it is true for \(\left( n_{i}, n_{j}\right) \) if and only if it is true for \(\left( n_{j}, n_{i}\right) \). The proof of Theorem 1 establishes that the condition on line 5 matches the definition of data-race in a computation graph and that line 6 always reports a data-race. As such, as long as each pair of nodes is considered, a date-race is reported.

The loop defined on line 3 guarantees that every node is considered as \(n_{i}\). The loop defined on line 4 guarantees that every node after \(n_{i}\) is considered as \(n_{j}\). line 6 can break out of the loop, but only after reporting a valid data-race. In this case, the lemma holds. If no data-race is detected, every node is compared against every other node at line 5 once, line 6 never executes, and the algorithm correctly reports no data-race. \(\square \)

4 Capturing computation graphs from parallel tasks

This section formally defines how to build computation graphs from parallel task executions. The semantic model for parallel tasks in this presentation is inspired by that of isolated parallel tasks [5]. In that model, concurrent computations are hierarchically divided into concurrent tasks with each task executing sequentially. Tasks additionally maintain regions that track handles to other tasks.

The initial task begins without any task handles. When a task \(t\) creates a child task \(t^\prime \), it stores the handle for \(t^\prime \) in one if its regions; \(t\) and \(t^\prime \) are concurrent at this point and \(t^\prime \) is able to recursively create new tasks and store handles in its own regions. If task \(t\) requires the computation of task \(t^\prime \), then it must await the completion of \(t^\prime \) and in so doing possibly block its own execution if \(t^\prime \) is yet to complete. When \(t^\prime \) completes, \(t\) consumes its handle. The value returned by \(t^\prime \) is combined into the state of \(t\) through a programmer supplied return-value handler. Parent tasks may transfer ownership of subordinate tasks to children at creation, and children always pass unconsumed subordinate tasks to parents at completion.

Parallel tasks, unlike isolated parallel tasks, allow shared memory for communication between tasks, restrict where parents are allowed to pass tasks to children, and force the programmer to define a total order on synchronizing with child tasks anytime the return value handlers side effect on the parent’s state. Parallel tasks remove the non-determinism with side-effecting return value handlers that make data-race detection especially hard while preserving the core semantics that encompass most aspects of existing programming models. For example, languages, such as OpenMP, Cilk, X10, Habanero, fall largely within the semantics of parallel tasks. The only real notable missing language features are barriers for phased execution and some generality in futures. Barriers for phased execution are a lesser known and used language feature unique to Habanero. The well-known and widely used core language features are covered by parallel tasks.

4.1 Syntax

The syntax for parallel tasks is given in Fig. 3. A program P is a sequence of procedures with names taken from a finite set: \(p_0 \ldots p_i \in \mathtt {Procs}^*\). Each p has a single L-type parameter l taken from a finite set of parameter names Vars and a top-level statement \(\mathbf s \) that gives the body of the procedure. The semantics is abstracted over concrete values and operations, so the possible types of l are not specified.

Statements are inductive via composition with other control-flow statements, and the set of all statements is given by Stmts. The syntax for expressions is intentionally undefined but includes a finite set of values Vals and local or global variable references. It is assumed that Vals minimally includes \(\mathbf true \), \(\mathbf false \), and a special value \(\bot \) for uninitialized. Global variable references are taken from a finite set of names, \(\mathtt {Globals}\), where \(\mathtt {Globals}\cap \mathtt {Vars} = \emptyset \). The names include a special reserved variable isolate that is only used by the semantics for mutual exclusion.

Fig. 3
figure 3

The surface syntax for task-parallel programs

The \(\mathbf async ~\), \(\mathbf future ~\), \(\mathbf await ~\), and \(\mathbf isolated ~\) statements relate to concurrency; the rest of the statements are sequential and have their usual meaning. Intuitively, the \(\mathbf async ~\)statement adds a task into a region r, taken from a finite set of region identifiers, Regs, by indicating the procedure p for the task with an expression e for the value of the local variable. The return value from p is ignored and not combined into the parent state. The \(\mathbf future ~\)statement is similar only it includes a return value handler \(d: \texttt {Vals} \rightarrow \texttt {Stmts}\) to combine a return value into the parent state, and it allows the parent to pass the ownership of tasks in the regions in the region sequence \(\mathbf {r}\) to the child.

A task is termed completed when its has evaluated the expression contained in its return statement. The \(\mathbf await ~\)statement blocks execution until some task with a handle in r completes at which point its return handler is executed. The \(\mathbf await ~\)statement recursively calls itself until all tasks with handles in r complete.

The \(\mathbf isolated ~\)statement provides mutual exclusion relative to other \(\mathbf isolated ~\)statements. If s is isolated, then it runs mutually exclusive to any other statements \(s^\prime \) that are also isolated; however, s does not run mutually exclusive to other non-isolated statements that may be concurrent with s. An \(\mathbf isolated ~\)statement must consist of only sequential statements.

For the rest of the presentation, only well-formed programs are considered.

Definition 2

(Well-formed) A program is said to be well-formed if and only if it conforms to a programming model that ensures (1) freedom from deadlock; and (2) awaiting futures cannot introduce non-determinism.

There are several programming models guarantee well-formedness. For example, consider the Habanero model. For deadlock, there is only a single lock for mutual exclusion, so it is not possible to hold and wait, and it prohibits cyclic dependencies between futures and only allows sequential statements in isolation for the same reason. Additionally, futures do not have return value handlers; instead, the value returned from a future’s statement block is made accessible via the future’s get() method. As a result, no nondeterminism is introduced when a \(\mathbf finish \) block waits on multiple futures.

The language is purposely kept simple to focus on the core concept of a computation graph as an equivalence class over task-parallel program executions. Additionally, there is no loss of generality in restricting procedures to a single parameter or not including statements for procedure calls because, for example, there are global variables, the syntax for types and expressions is left free, and procedures calls are syntactic sugar for a \(\mathbf future ~\)statement followed by an \(\mathbf await ~\)statement with an appropriate return-value handler.

4.2 Tree-based semantics

The semantics of task-parallel programs is defined over a tree of procedure frames to represent the concurrency in the language rather than a stack of procedure frames which is inherently sequential. A frame in the tree is a task. Execution proceeds by stepping some task in the tree. If that step creates a new task, then the task is posted as a child of the creating task in the tree. If that step consumes a completed child task, then the completed child task is removed from the tree, and any unconsumed children of the consumed task become children of the consuming task.

Additional to the tree of tasks is a global store and a computation graph. The computation graph is additional to the program state, and that state is updated by the semantics to capture the program execution; it includes a few additional tuple members that will be defined shortly. The computation graph has no bearing on how program execution proceeds, that is completely defined by the tree of tasks and the global store.

Any task in the tree that is not blocked may step in an execution of a task-parallel program. A task step updates the auxiliary computation graph as appropriate, and it updates the statement for the current task. Additionally, depending on the statement, the step may also change the value of the task’s local variable, change the global store, or, as mentioned previously, change the structure of the tree by adding or removing a child. With this intuition, the semantics are ready to be formalized.

4.2.1 Program state

The state of a parallel tasks program is defined by the tuple \(\left( \varGamma , \sigma , \left( t,m \right) \right) \) where \(\varGamma \) is an augmented computation graph, \(\sigma :\mathtt {Globals}\rightarrow \mathtt {Vals}\) is a partial function mapping global variable names to values, and \(\left( t,m \right) \) is a tree configuration for the root of the tree. Let \(\sigma _o\) be the initial store such that \(\forall g \in \mathtt {Globals}, \sigma _o(g) = \bot \).

An augmented computation graph is a tuple \(\varGamma = \left( G, \textit{last}, R \right) \), where \(G= \left( N, E, \rho , \omega \right) \) is a computation graph; \(\textit{last}\in N\) is the last isolated node and is used to assert the observed order of \(\mathbf isolated ~\)statements in the execution; and \(R: \mathtt {Regs} \mapsto {\mathcal {P}}\left( N\right) \) is a function to track nodes that need to join in the computation graph at the end of an \(\mathbf await ~\)statement. In general, a function notation is adopted to access members of tuples. For example, the members of \(G\) are accessed as \(G(N)\), \(G(E)\), \(G(\rho )\), etc. Let \(\mathtt {Nodes}\) be a finite set of nodes, and \(\mathrm {fresh}()\) return as yet unused nodes in \(\mathtt {Nodes}\). Let \(\varGamma _o\) be the initial augmented computation graph such that for \(n_o = \mathrm {fresh}()\), \(\varGamma _o(G) = \left( \{n_o\}, \emptyset , \rho _o, \omega _o \right) \) where \(\forall n \in \mathtt {Nodes}, \rho _o(n) = \emptyset \wedge \omega _o(n) = \emptyset \), \(\varGamma _o(\textit{last}) = n_o\), and \(\forall r \in \mathtt {Regs}, \varGamma _o(R)(r) = \emptyset \).

A tree configuration, \(c = \left( t,m \right) \), is an inductively defined tree with task-labeled vertices, t, and region labeled edges given by the region valuation function, \(m : \texttt {Regs} \rightarrow \texttt {Configs}\), where Configs is the set of tree configurations. The tree is finite and unordered. For a given vertex \(c = \left( t,m \right) \), m(r) returns the collection of sub-trees connected to the t-labeled root by r-labeled edges. In this context, m is local to the current task. Let \(m_o\) be the initial region valuation such that \(\forall r \in \mathtt {Regs}, m_o(r) = \emptyset \); it is used in both the initial state and transition rules when creating new tree configurations.

A task \(t = \left( \ell , s, d, n \right) \) is a tuple containing the value, \(\ell \in \mathtt {Vals}\), of the task’s single local variable l, along with its statement s, its return value handler d, and its associated node n in the computation graph. The initial task \(t_o = \left( \bot , s_o, \lambda v.\mathbf skip ~, \varGamma _o(last) \right) \) is defined to start and then await the initial task,

$$\begin{aligned} s_o = \mathbf async ~\ r_o \leftarrow p_o\ \ell \ ;\ \mathbf await ~\ r_o \end{aligned}$$

where \(p_o\) is the top level procedure and \(\ell \) is the initial value for \(p_o\)’s single parameter. With that, the initial state of a task-parallel program is \(\left( \varGamma _o, \sigma _o, \left( t_o, m_o \right) \right) \).

4.3 Notation

The semantics are defined as a set of transition rules relating states. Some amount of additional notation is needed to help the presentation of the transition rules be as concise and uncluttered as possible. That notation follows.

Let \(\llbracket \cdot \rrbracket _e\) be a partial evaluation function for expressions without any variables. For convenience in the semantics:

$$\begin{aligned} e(t, \sigma )= & {} e(\left( \ell , s, d, n \right) , \sigma ) \\= & {} e(\ell , \sigma ) \\= & {} \llbracket e[\ell / \texttt {l}, \sigma (\mathtt {g}_0) / \mathtt {g}_0, \sigma (\mathtt {g}_1) / \mathtt {g_1}, \ldots ] \rrbracket _e \end{aligned}$$

If \(e[\ell / \texttt {l}, \sigma (\mathtt {g}_0) / \mathtt {g}_0, \sigma (\mathtt {g}_1) / \mathtt {g_1}, \ldots ]\) has any free variables or other errors, then by definition,

\(\llbracket e[\ell / \texttt {l}, \sigma (\mathtt {g}_0) / \mathtt {g}_0, \sigma (\mathtt {g}_1) / \mathtt {g_1}, \ldots ] \rrbracket _e\) has no meaning and is undefined. The set of global variables that appear in an expression is given by \(\mathtt {Globals}(e)\).

A statement contextS for s is a convenient way to consider, and manipulate, the next-to-be-executed statement in s. For example, \(S[\texttt {l} := e]\) means that the next-to-be-executed statement in the context S is \(\texttt {l} := e\); other statements may follow since s is inductive (e.g., \(\texttt {l} := e\ ;\ s^\prime \)). A configuration context, C, is a way to consider a single task in a larger tree leaving the surrounding context (i.e., other tasks in the tree) unchanged.

The contexts are used to express transition rules. For example, consider the transition rule for a step on task whose next-to-execute statement is a \(\mathbf skip ~\). The transition rule, via contexts, is given as:

figure b

The configuration context can be any task in the tree that matches the statement context. In the source state, it is matched when the next-to-execute instruction in the statement context is \(\mathbf skip ~; s\) where s is any valid statement. In the target state, the transition side effects the task’s statement context to be s, having consumed the \(\mathbf skip ~\), and leaves the rest of the surrounding context, or state, unchanged.

A function can be strongly updated as in a write to a global store: \(\sigma ^\prime = \sigma [\texttt {g} \mapsto \ell ]\); the new store is just like the old store only now \(\sigma ^\prime (\mathtt {g}) = \ell \). A function may also be weakly updated as in adding new variables into the read set for a node in the computation graph: \(\rho ^\prime = \rho [n \overset{\cup }{\mapsto }\mathtt {Globals}(e)]\); the new read set is just like the old read set only now \(\rho ^\prime (n) = \rho (n) \cup \mathtt {Globals}(e)\).

A function may remove mappings as in a parent task consuming a completed child task in the region valuation: \(m_1^\prime = m {\setminus } (r \mapsto \left( t_2,m_2 \right) )\); the new region valuation is just like m only now \(m_1^\prime (r)\) no longer includes \(\left( t_2,m_2 \right) \). Regions may also be merged together as in \(m_1 \cup m_2\) with the intuitive weak update meaning. The projection of the region valuation, m, to the sequence of regions \(\mathbf {r}\) is given as \(m|_{\mathbf {r}}\) and defined such that \(m|_{\mathbf {r}}(r^\prime ) = m(r^\prime )\) when \(r^\prime \) occurs in \(\mathbf {r}\) and \(m|_{\mathbf {r}}(r^\prime ) = \emptyset \) otherwise.

And finally, for a tuple such as a \(\varGamma = \left( G, \textit{last}, R \right) \), this presentation adopts the same notation for updating functions where \(\varGamma ^\prime = \varGamma [\textit{last}\mapsto n^\prime ]\) is understood to mean that \(\varGamma ^\prime (\textit{last})\) now has the value \(n^\prime \) and everything else is the same as in \(\varGamma \).

There are two additional functions to help manipulate the augmented computation graph efficiently in the transition rules. The fork function adds nodes for new tasks: \(G^\prime = \mathrm {fork}(G,n,n^\prime ,\)\(n^{\prime \prime },V)\) forks the node n in the computation graph \(G\) to \(n^\prime \) and \(n^{\prime \prime }\) and updates the read seat for n with the variables in the set V:

$$\begin{aligned} G^\prime= & {} G[N\overset{\cup }{\mapsto }\{n^\prime , n^{\prime \prime }\}] \\&\ \ \ [E\overset{\cup }{\mapsto }\{\left( n, n^\prime \right) , \left( n, n^{\prime \prime } \right) ] \\&\ \ \ [\rho (n) \overset{\cup }{\mapsto }V] \end{aligned}$$

The new graph has a new thread of concurrent execution and corresponds to the creation of a new task in the task configuration tree.

The join function makes clear in the graph where completed tasks synchronize at an \(\mathbf await ~\)statement: \(G^\prime = \mathrm {join}(G,R,n_2, n)\) joins the nodes in \(R\) and the node from the recently completed task \(n_2\) to the new node n in the computation graph \(G\) to \(n^\prime \):

$$\begin{aligned} G^\prime= & {} G[N\overset{\cup }{\mapsto }\{n\}] \\&\ \ \ [E\overset{\cup }{\mapsto }\{(n, n_2)\} \cup \{(n,n_i) \mid n_i \in R\}] \end{aligned}$$

The new graph has gathered concurrent executions at the join and corresponds to the removed child tasks in the configuration tree.

Fig. 4
figure 4

Transition rules for sequential statements

Fig. 5
figure 5

Transition rules concurrent statements

4.3.1 Transition rules

Figure 4 gives the transition rules for all the sequential statements but the \(\mathbf skip ~\)statement (given previously). Sequential statements have no direct affect on the shape of the configuration tree. Relative to the computation graph, these statements are only able to update read or write sets for the task’s associated computation graph node.

The \(\mathbf return ~\)statement does merit some little discussion. Its rule inserts a machine only \(\mathbf done ~\)statement. The rule evaluates the return expression and then puts the value in the \(\mathbf done ~\)statement. Later, when an \(\mathbf await ~\)statement joins with the completed task, the \(\mathbf done ~\)statement is consumed and the value is used by the return-value handler.

Figure 5 gives the transition rules for all the concurrent statements in the language. These statements affect the shape of the configuration tree and the shape of the computation graph.

The Async rule creates a new child task. The rule adds two fresh nodes \(n_0^\prime \) and \(n_1\) to the computation graph. Node \(n_0^\prime \) represents the statements following the \(\mathbf async ~\) and \(n_1\) represents the statements to be executed by the new task. The rule orders both after the current node for the parent, \(n_0\), in the computation graph. The read set \(\rho \) of node \(n_0\) is updated to include any global variables referenced in the expression, \(\mathtt {Globals}\left( e\right) \), for the local parameter value in the new task. The region mapping m of the parent task is updated to include the new task with its empty initial region mapping. The Future rule mimics the Async rule only it allows arbitrary return value handlers, and its region valuation is seeded with regions passed from the parent using projection.

The Await rule blocks the execution of the currently executing task until a task in the indicated region completes. A new node to join the two tasks is not created in the computation graph, nor are the two tasks ordered in the sense of join because the choice of task \(t_2\) in the region is non-deterministic; as such, the computation graph allows tasks in the region to join in any order contrary to the observed reduction by the rule. The rule saves the current node in the graph for \(t_2\), \(n_2\), to join later once the region is empty, and it updates the read set for \(t_2\) on the expression in the return statement.

The rest of the rule separates out the task from the region, makes sure the region is not yet empty, and gets the statement for the return value handler. The new state adds an await statement after the return value handler statement since the region is not yet empty, and the region valuation function in the new state includes any tasks owned by \(t_2\).

The Await-Done activates when the last task in the region is joined. It differs from the Await rule in that it orders all tasks that have joined in the region to happen-before the new node for the parent in the computation graph, and it does not insert another await statement in the new state.

If no other isolated statements are running, then the Isolated rule updates the isolated shared variable and inserts after the isolated statement \(\textit{s}\) the new isolated-end keyword that is only a part of the machine syntax for the semantics. The computation graph gets a new node to track accesses in the isolated statement with an appropriate edge from the previous node. A sequencing edge from \(\textit{last}\) is also added so the previous isolated statement happens before this new isolated statement. As a reminder, \(\textit{last}\) is initialized to the initial node when execution starts.

The Isolated-End rule creates a new node in the computation graph to denote the end of isolation, updates the isolated shared variable, and it updates \(\textit{last}\) to properly sequence any future isolation. The \(\textit{last}\) variable implements how the semantics track and record the order of isolated statements in the computation graph while reducing the state.

5 Proof of correctness

The semantics in Sect. 4 determine both how a program executes and how computation graphs are created. This section proves that computation graphs are correct with respect to the executions that produce them. It also proves a much stronger claim: Computation graphs are correct with respect to all compatible executions. A computation graph contains a data-race if and only if an execution with the same order on \(\mathbf isolated ~\)statements contains a data-race.

This section begins with relevant formal definitions, including a definition for data-race. It then proceeds to relevant reasoning, concluding with Theorem 4, which proves soundness, and Theorem 5, which proves completeness.

Definition 3

(Compatible execution) A given execution is compatible with a computation graph \(G\) if and only if its order on \(\mathbf isolated ~\)statements is the same as the order on isolated nodes in \(G\).

Definition 4

(Conflict) Two statements conflict if they both access the same shared variable and at least one of them writes to that variable. \(\rho \left( s\right) \) and \(\omega \left( s\right) \) behave as expected.

$$\begin{aligned} \mathrm{conflict}(s_i,s_j) = \begin{array}{l} \rho (s_i) \cap \omega (s_j) \ne \emptyset \ \vee \\ \rho (s_j) \cap \omega (s_i) \ne \emptyset \ \vee \\ \omega (s_i) \cap \omega (s_j) \ne \emptyset . \end{array} \end{aligned}$$
(2)

A state’s set of active statements is used to define concurrency.

Definition 5

(Active statements) A state \(\varsigma \) has a set of active statements \(a\left( \varsigma \right) \) that corresponds to the next statement to be reduced in each of the active tasks in the state.

Notice that the definition does not check if the active statements are reducible. For example, an \(\mathbf await ~\)statement may be active but not reducible in a state because there is no task in its indicated region that has completed; thus, the Await rule is not active on that statement. This nuance becomes important in the next definition. For that definition, and without loss of generality, we call the program’s initial state \(\varsigma _{0}\).

Definition 6

(Concurrency) Any two statements are concurrent if and only if an execution of the program can result in a state \(\varsigma \) such that both statements are active at the same time:

$$\begin{aligned} s \mid \mid s' \Longleftrightarrow \exists \varsigma : \varsigma _{0} \rightarrow ^{*} \varsigma \; \wedge \; \left\{ s, s'\right\} \subseteq a\left( \varsigma \right) . \end{aligned}$$
(3)

A state \(\varsigma \) that satisfies this condition for \(s\) and \(s\)’ is called a witness state for \(s \mid \mid s'\). \(\rightarrow ^{*}\) is the transitive closure of the transition relation \(\rightarrow \) defined in Sect. 4.2. Two statements are ordered if and only if they are not concurrent.

Note that the definition treats some statements as concurrent that are in fact not concurrent in the witness state. For example, consider an \(\mathbf isolated ~\)statement and an assign statement. The \(\mathbf isolated ~\)statement may be blocked in the witness state waiting for the isolation lock while the assign statement is always able to reduce; hence, these statements are not actually concurrent in the witness state.

This discrepancy is harmless because the only statements that can block in these ways are the \(\mathbf isolated ~\)statement and the \(\mathbf await ~\)statement. In the case of the \(\mathbf isolated ~\)statement, it only interacts with the machine only \(\mathbf isolated ~\) global variable. Similarly, a task that is completed is at a \(\mathbf done ~\)statement, and that statement does not read or write any variables; reading and writing variables happens when the \(\mathbf return ~\)statement reduced. As such, it is not possible to ever have a data-race with either of these statements, so calling them concurrent with other statements is harmless.

Definition 7

(Data race) There is a data-race on two statements \(s\) and \(s'\) if and only if they conflict and are concurrent:

$$\begin{aligned} \textit{DR}\left( s, s'\right) = s \mid \mid s' \; \wedge \mathrm{conflict}\left( s, s'\right) . \end{aligned}$$
(4)

Two statements that occur in the same thread of execution cannot be concurrent, as exactly one statement is active in each active thread at any point in time. Accordingly, events contained in the same node are always serialized. The semantics for the Isolated and Isolated-done transitions in Fig. 5 ensure that no two statements inside of \(\mathbf isolated ~\)statements can be concurrent, as only one thread may enter an \(\mathbf isolated ~\)statement at a time.

Definition 8

(Specific transition) A state \(\varsigma \) transitions to some other state \(\varsigma '\) via a configuration context C when C is the configuration context reduced in the transition from \(\varsigma \) to \(\varsigma '\). This is denoted \(\varsigma {\mathop {\rightarrow }\limits ^{C}} \varsigma '\).

Definition 9

(Reachable states) A set \(\Sigma _{r}\) of reachable states is the set of states reachable from the initial state \(\varsigma _{0}\):

$$\begin{aligned} \Sigma _{r} \equiv \left\{ \varsigma \mid \varsigma _{0} \rightarrow ^{*} \varsigma \right\} . \end{aligned}$$
(5)

Definition 10

(Safe statement ordering) Any two statements \(s\) and \(s'\) are safely ordered if and only if any state \(\varsigma \) with two configuration contexts C and \(C'\) transitions to the same state \(\varsigma '\) after transitioning via C and \(C'\), regardless of the order in which these transitions occur:

$$\begin{aligned} s {\mathop {\leftrightarrow }\limits ^{?}} s' \equiv \forall \varsigma \in \Sigma _{r}: \left( \varsigma \left( C\right) = \right.&\left. \left( \left( l, s, d, n\right) , m\right) \wedge \right. \nonumber \\ \varsigma \left( C'\right) =&\left. \left( \left( l', s', d', n'\right) , m'\right) \right) \nonumber \\ \implies&\varsigma {\mathop {\rightarrow }\limits ^{C}} \varsigma _{1} {\mathop {\rightarrow }\limits ^{C'}} \varsigma ' \wedge \varsigma {\mathop {\rightarrow }\limits ^{C'}} \varsigma _{2} {\mathop {\rightarrow }\limits ^{C}} \varsigma '. \end{aligned}$$
(6)

Two statements can be safely ordered in one of two ways. First, they can be ordered (in which the definition is true vacuously). Second, they can commute, satisfying the definition’s consequent. Thus, all realizable orders on safely ordered statements yield deterministic results.

A data-race may introduce nondeterminism. Similarly, it is possible for conflicting statements (intended data-races) in different \(\mathbf isolated ~\)statements to introduce nondeterminism. Lemma 1 and Theorem 3 prove that in the absence of both data-race and isolation, program execution is deterministic.

Lemma 1

(Local determinism) When two statements do not race, they are safely ordered:

$$\begin{aligned} \lnot \textit{DR}\left( s, s'\right) \implies s {\mathop {\leftrightarrow }\limits ^{?}} s'. \end{aligned}$$
(7)

Proof

A well-formed program cannot introduce data-race on the order in which futures are gathered. In a well-formed program, then, this assertion is equivalent to two implications:

$$\begin{aligned}&\lnot \textit{DR}\left( s, s'\right) \implies s {\mathop {\leftrightarrow }\limits ^{?}} s' \\&\lnot \left( s \mid \mid s' \wedge \mathrm{conflict}\left( s, s'\right) \right) \implies s {\mathop {\leftrightarrow }\limits ^{?}} s' \\&\left( s \mid \mid s' \wedge \mathrm{conflict}\left( s, s'\right) \right) \vee s {\mathop {\leftrightarrow }\limits ^{?}} s' \\&\left( s \mid \mid s' \vee s {\mathop {\leftrightarrow }\limits ^{?}} s'\right) \wedge \left( \mathrm{conflict}\left( s, s'\right) \vee s {\mathop {\leftrightarrow }\limits ^{?}} s'\right) \\&\left( \lnot s \mid \mid s' \implies s {\mathop {\leftrightarrow }\limits ^{?}} s'\right) \wedge \left( \lnot \mathrm{conflict}\left( s, s'\right) \implies s {\mathop {\leftrightarrow }\limits ^{?}} s'\right) \end{aligned}$$

The first implication is true from Definition 10, when \(s\) and \(s'\) are safely ordered because no state exists in which both are active.

The second implication is true by induction on the transition rules in Figs. 4 and 5. Informally, the two statements act independently in their respective configuration contexts. Only a global side effect in one statement that changes a value read or written in the other statement could change the second statement’s behavior. This can only occur, however, when the two statements conflict.

For example, consider a state with two active statements, one whose next transition is governed by the Assign Local transition rule and another whose transition is governed by the Assign Global rule. The first statement computes the value of e and the second changes the value of l. If the value of e depends on l, the two statements conflict and the implication is vacuously true. On the other hand, if the value of e does not depend on that of l, its result is the same regardless of whether the Assign Global transition happens before or after the Assign Local transition. \(\square \)

Theorem 3

(General determinism) In the absence of data-race and on some order of \(\mathbf isolated ~\)statements, a program’s behavior is deterministic.

Proof

By induction on Lemma 1. \(\square \)

Definition 11

(Race-sensitive partial order) Every computation graph is derived from some execution, which is a total order on statements executed. Some of these statements race with each other; by definition, these statements occur in pairs. The computation graph is a partial order on the statements contained in its nodes. This partial order, together with additional pairs to enforce the observed order on pairs of statements that race with each other (and additional pairs, as needed, for transitivity) is a race-sensitive partial order.

Race-sensitive partial orders are not created in practice. However, they are useful as a theoretical construct for the proof. Observe that whenever there is no data-race, there are no additional synchronization pairs and the race-sensitive partial order is the same as the order induced by the computation graph.

Lemma 2

If two nodes \(n\) and \(n'\) are ordered in a computation graph \(G\), every \(s\in n\) and \(s' \in n'\) are ordered:

$$\begin{aligned} n\prec _{G}n' \implies s\prec s'. \end{aligned}$$
(8)

Proof

By induction on the transition rules in Fig. 5 (no rule in Fig. 4 creates a node or changes its task’s node except to add a statement to that node). Every computation graph \(G\) is created by these transition rules. Every edge is created from a node that has completed (no additional events are added to it) to a node that is beginning, with the exception of \(\textit{last}\), which is used only to create an edge in the Isolated transition rule, no state contains a reference to \(n\) after the Async, Future, Await-done, Isolated, or Isolated-done rules complete. In each of these cases, the edge or edges created in the transition reflect a necessary order on events. For example, an \(\mathbf async ~\)statement (in \(n_{0}\)) must execute before the events that follow it in its own task (in \(n_{0}'\)) and must occur before all events in the task it creates (in \(n_{1}\)). Events contained in the same node always pertain to the same task and are strictly ordered.

Accordingly, every edge in every computation graph reflects an ordering imposed by the semantics. \(\square \)

Corollary 1

If two statements are unordered, their respective nodes in the computation graph are unordered:

$$\begin{aligned} s \mid \mid s' \implies n \mid \mid _{\prec _{G}} n' \end{aligned}$$
(9)

Proof

By two uses of the contrapositive of Lemma 2:

$$\begin{aligned}&s\nprec s' \implies n\nprec _{G}n' \wedge s' \nprec s\implies n' \nprec _{G}n\\&s\nprec s' \wedge s' \nprec s\implies n\nprec _{G}n' \wedge n' \nprec _{G}n\\&s \mid \mid s' \implies n \mid \mid _{\prec _{G}} n'. \end{aligned}$$

\(\square \)

Lemma 3 reasons about unordered nodes in a race-sensitive partial order. This is in contrast with Lemma 2, which reasons about ordered nodes in a computation graph.

Lemma 3

If two nodes \(n\) and \(n'\) are unordered in a computation graph \(G\), every \(s\in n\) and \(s' \in n'\) not ordered in the race-sensitive partial order induced by \(G\) are concurrent:

$$\begin{aligned} \forall s\in n, s' \in n' : n \mid \mid _{\prec } n' \implies s \mid \mid s' . \end{aligned}$$
(10)

Proof

If \(s\) and \(s'\) are ordered in the race-sensitive partial order for \(G\), the lemma holds vacuously.

Synchronization between threads is accomplished by restrictions in the transition rules. For example, no task may enter an \(\mathbf isolated ~\)statement when another task is currently inside an \(\mathbf isolated ~\)statement of its own because the other task must have set isolate to \(\mathbf true \) when executing the Isolated transition rule and no other Isolated transition rule may fire before the task currently in a critical section exits, setting isolate to \(\mathbf false \) in its execution of the Isolated-done rule. Similarly, an \(\mathbf await ~\)statement cannot step forward (per the Await and Await-done transition rules) until one of the tasks in its region is at a \(\mathbf done ~\)statement (and is therefore finished) and cannot complete until all tasks in its region have been reaped (per the Await-done transition rule). In every case, execution of a synchronization event results in the creation of a corresponding edge in \(G\).

There is a single first node in each computation graph \(G\) that happens before every other node. As a result, \(G\) is a finite semilattice with a global infimum. It follows that every two nodes \(n\) and \(n'\) have a least common ancestor \(n_{a}\).

Since \(s\) and \(s'\) both exist in \(G\), they are both reachable in the execution that created \(G\). The same is true of \(n\) and \(n'\). As a result, their least common ancestor \(n_{a}\) is also reachable. Therefore, some state \(\varsigma _{a}\) must exist that creates outgoing edges to \(n_{1}\) and \(n_{2}\) where \(n_{1}\) is either equal to \(n\) or is an ancestor of it and where the same is true of \(n_{2}\) and \(n'\). \(N_{1}\) is the set of nodes on a path from \(n_{1}\) to \(n\), inclusive. \(N_{2}\) is the set of nodes on a path from \(n_{2}\) to \(n'\), inclusive. There cannot be any synchronization edges between nodes in \(N_{1}\) and \(N_{2}\), as this would contradict the definition of \(n_{a}\) as the least common ancestor of \(n\) and \(n'\).

As a result, it is possible for \(\varsigma _{a}\) to transition in the two tasks independently of each other until reaching \(s\) and \(s'\). The state in which both statements are active is a witness for their concurrency. \(\square \)

Corollary 2

Every topological sort on a race-sensitive partial order is a realizable trace.

Proof

By Lemma 3. \(\square \)

Theorem 4

(Soundness) If a computation graph \(G\) contains a data-race, an actual data-race exists on an execution compatible with that graph.

Proof

The race-sensitive partial order for \(G\) is \(\prec _{r}\). Let \(S_{r}\) be the set of conflicting statements in unordered nodes in \(G\); that is, the statements that \(G\) claims to be racing statements. Because \(G\) contains a data-race, \(S_{r}\) must have at least two members. Furthermore, Corollary 1 proves that every pair of statements that actually race must belong, respectively, to unordered nodes. Since they conflict by definition, these statements are members of \(S_{r}\).

Let a reported race be a pair of members of \(S_{r}\) that conflict with each other and that occur in nodes that are unordered in \(G\). Let \(s\) and \(s'\) be a reported race such that \(s\) occurs before \(s'\) in the observed order of execution. Let \(n\) and \(n'\), respectively, be the nodes that contain \(s\) and \(s'\). Without loss of generality, choose \(s'\) to be the first statement that meets these requirements. Again without loss of generality, choose \(s\) to be the last statement compatible with these requirements (including the choice of \(s'\)). To restate, \(\textit{conflict}\left( s, s'\right) \) and \(n \mid \mid _{\prec _{G}} n'\).

By Corollary 2, any topological sort on \(\prec _{r}\) is a realizable execution. Choose a topological sort that executes until \(s\) is active and then advances other tasks until \(s'\) is active. The choice of \(s\) and \(s'\) ensures that if they are ordered by \(\prec _{r}\), they race: if they were ordered because of some other pair of statements that races, the second of those statements must precede \(s'\), creating a contradiction. By Lemma 3 and because \(n\) and \(n'\) are unordered, \(s\) and \(s'\) must either race each other (in which case the lemma holds) or must be unordered by \(\prec _{r}\). This latter case, however, contradicts itself: if \(s\) and \(s'\) are unordered in \(\prec _{r}\), they are necessarily concurrent. Since it is given that they conflict, they race. This, however, contradicts that \(s\) and \(s'\) are unordered by \(\prec _{r}\), which orders all racing statements. \(\square \)

Theorem 5

(Completeness) If a computation graph \(G\) does not contain a data-race, no compatible execution contains a data-race.

Proof

By Lemma 2, any statement in two ordered nodes are ordered. If statements in unordered nodes conflict, there is a data-race in the graph. Therefore, if there is no data-race in a graph, no statements in its unordered nodes conflict and no data-race is possible. \(\square \)

6 Model checking task-parallel programs

This section applies model checking to a task-parallel program to prove data-race freedom under the given input to the program. The model checker uses a depth-first search-based algorithm to exhaustively explore all computation graphs that arise from different total orders on isolated nodes. A proof of correctness is given. The proof shows that the depth-first search enumerates a sufficient set of graphs to be both sound and complete on the given input.

6.1 Depth-first scheduling

The depth-first algorithm to enumerate all schedules over \(\mathbf isolated ~\)statements is given in Algorithm 2. The algorithm is a fairly direct implementation of a depth-first search on the semantics. It is divided into five ordered cases:

  1. 1.

    The program is completed and only a single task is left that cannot reduce any further (line 2);

  2. 2.

    There is some task that is not at an \(\mathbf await ~\)statement or \(\mathbf isolated ~\)statement that can be reduced (possibly more than once);

  3. 3.

    There is some task at an \(\mathbf await ~\)statement that can be reduced;

  4. 4.

    There is some task at an \(\mathbf isolated ~\)statement that can be reduced from which all possible choices must be explored; or

  5. 5.

    The program is not well-formed.

Reduced in this context means the state has a successor. Each case is discussed below.

line 2 detects when the task-parallel program has run to completion. The function Configs(\(\varsigma \)) returns all configuration contexts (i.e., task frames) in the tree in the state, and the notation \(\varsigma \not \rightarrow \varsigma ^\prime \) is intended to convey that it is not possible to reduce the state any further. A well-formed program gathers all tasks eventually, which means eventually there is as single context in the task tree, and the statement in that context is a \(\mathbf skip ~\)statement that cannot be reduced because there is no statement that follows. In this case, the computation graph is pulled out of the state and checked for data-race.

Line 4 reduces a context in the tree that is not at an \(\mathbf await ~\)statement or \(\mathbf isolated ~\)statement. The notation \(\varsigma \overset{C}{\hookrightarrow }\varsigma ^\prime \) is intended to convey that the context is reduced as much as possible, which is at least one time, using any of the rules in Fig. 4 or Fig. 5 but await, await-done, and isolated; the state is maximally reduced when done with the reductions and should be stopped at the following: a \(\mathbf done ~\)statement, \(\mathbf await ~\)statement, \(\mathbf isolated ~\)statement, or the case in line 2. In all of these cases, the procedure makes a recursive call on the reduced state.

line 6 reduces \(\mathbf await ~\)statements. It is only reached when the first two cases fail; meaning the program is not able to reduce further without first reducing either an \(\mathbf await ~\)statement or \(\mathbf isolated ~\)statement. The algorithm first gives priority to the \(\mathbf await ~\)statement. It this case, if it is possible to step the context exactly once on the \(\mathbf await ~\)statement, then the procedure makes the recursive call on the successor state.

line 8 is the depth-first search over the \(\mathbf isolated ~\)statements. The key difference with regard to the await case on line 6 is that this case searches from the successor of every context with an \(\mathbf isolated ~\)statement that can reduce at the current state; not just one successor. As such, the current state is the backtrack point for the search; thus, from this state, all possible orders of isolation are considered.

The final case on line 12 should only be reached in the case of a program that is not well formed. Such a program may be deadlocked, have dangling tasks that have not been gathered, etc. Note that a divergent program will cause an infinite loop and not be detected.

figure c

6.2 Proof of correctness

Theorem 6

A well-formed, terminating program can manifest a data-race on some schedule if and only if a computation graph generated by Algorithm 2 contains a data-race.

Proof

A program that is well-formed is guaranteed to not deadlock. Furthermore, it cannot introduce data-race via futures and Theorem 3 proves that all other determinism is impossible. If either of these properties is false, the theorem is vacuously true. When these properties are true, the model checker successfully performs a depth-first search on isolation orders, generating a computation graph for each possible order.

Since Algorithm 2 enumerates a computation graph for each possible ordering on \(\mathbf isolated ~\)statements, each execution is compatible with one of these computation graphs. By Theorems 4 and 5, each of these graphs contains a data-race if and only if a compatible execution contains a data-race. \(\square \)

7 Implementation and results

The model checking approach to data-race detection described in this paper has been implemented for Habanero Java. The implementation uses the verification run-time specifically designed to test Habanero Java programs and play nicely with JPF [1]. The implementation is a set of JPF listeners to create the computation graph and schedule over isolated statements. Two methods for data-race detection on the generated computation graph were tested. The first is an implementation of Algorithm 1. In this implementation, data-race is detected by intersecting the access sets of parallel nodes [28]. The second is an implementation of FastTrack [16], a fast vector clock algorithm that optimizes the common case where only the clock value of the last thread to access a shared memory location is needed to determine whether a race condition exists. There are a few more key differences to the two methods. FastTrack checks for data-race alongside the execution of the program while the first method waits until program completion to check a graph for data-race.

This implementation uses JPF’s VM listeners to track various program events related to parallelism. The methods objectCreated and objectReleased are used to create and connect nodes in the computation graph. The objectCreated method is used to track the creation of a new async task. It detects when a async statement executes and adds appropriate edges to the computation graph.

The objectReleased method is used to track when finish blocks complete execution and add edges as expected. The await statement is used to create a node in the graph where the tasks belonging to the finish block join. The executeInstruction method is used to track memory locations that are accessed by various tasks by updating the node with the location accessed by the task during the execution of that instruction. All in all, seven listeners and two factories are replaced in JPF consisting of roughly 1.6K lines of code.

The approach in this paper is compared to two other approaches implemented by JPF: Precise race detector (PRD) and Gradual permission regions (GPR). The PRD algorithm is a partial order reduction based on JPF’s ability to dynamically detect shared memory accesses. In this mode, JPF schedules on all detected shared memory accesses. GPR uses program annotations to reduce the number of shared locations that need to consider scheduling by grouping several bytecodes that access shared locations into a single atomic block of code with read/write indications [31]. For example, if there are two bytecodes that touch shared memory locations, PRD schedules from each of the two locations. In contrast, if those two locations are wrapped in a single permission region, then GPR only considers schedules from the start of the region with the region being considered atomic. GPR is equal to PRD if every bytecode that accesses shared memory is put in its own region. Both approaches are a form of partial order reduction with GPR outperforming PRD by virtue of considering significantly fewer scheduling points via the user annotated permission regions.

The comparison over a set of benchmarks is shown in Table 1. The benchmarks are a collection of those from the Habanero Java distribution itselfFootnote 1 and various presentation materials introducing the Habanero model; other benchmarks come from testing various language constructs in the development process. The table indicates for each benchmark its relative size in lines-of-code and tasks. The number of states generated by JPF for the proof, the time in minutes and seconds, and finally whether or not a race was detected. The “-” indicates that no results are available because the approach exceeded the arbitrary one hour time bound for each run. The “X” in the time column indicates that the analysis ran out of memory. Two times are reported for the approach in this paper, one for each method of data-race detection on the generated computation graphs: transitive closure (TC) and FastTrack (FT). The experiments were run on a machine with an Intel Core i5 processor with 2.8 GHz speed and 8 GB of RAM.

Table 1 Computation graphs versus permission regions versus PreciseRaceDetector

The table shows that in general, PRD does not finish in the time bound. The “Y*” on the Race column for PRD indicates that PRD incorrectly reports data-race on array objects in some examples. For GPR the “Y*” indicates that GPR reports the intended data-race between mutually exclusive regions. GPR falls behind quickly as the number of permission regions grow. The difference in performance is seen in the Add, Scalar multiply, and Prime number counter benchmarks which use shared variables. The regions are made as big as possible without creating a data-race. The last six benchmarks also have isolated regions. As a result, the state space for computation graphs is also large compared to other benchmarks. Of course, in the presence of isolation, the approach in this paper must enumerate all possible computation graphs, so it suffers the same state explosion as other model checking approaches.

On benchmarks with no data-race, the TC and FT algorithms are comparable. Because FT checks for data-race on the fly, it finishes for some benchmarks that have data-race while the TC algorithm does not. Even PRD and GPR finish on True Dependant Linear and Last Private Missing while the TC algorithm runs out of memory or times out. The TC algorithm can be made to be on the fly for programs without isolation by unioning the access sets of parent nodes and their direct children [28]. This adjustment would make it even more useful.

For each benchmark that caused the TC algorithm to time out or run out of memory, a smaller instance of the same benchmark was also tested (with the exception of Jacobi Parallel). These rows are marked with an asterisk in the benchmark name and is an example of how checking a program against a smaller problem instance can be a useful technique in data-race detection.

The benchmark Many Network Requests in an example of a case where the TC algorithm is more memory efficient than FT and therefore finishes while FT does not. In this benchmark, many tasks are spawned with few reads or writes per task. The benchmark is so named because it simulates a common idiom used in languages that support light weight tasks to initiate a network request with each task and then wait on the results at a later time. In this case, FT runs out of memory quickly as it must maintain a vector clock whose size is linear in the number of tasks for each task and each shared variable. Neither algorithm finishes analyzing Jacobi Parallel but FT runs out of memory and TC times out.

The next set of results are for bigger real-world programs. The Crypt-af benchmarks is an implementation of the IDEA encryption algorithm and Series-af and Series-f are the Fourier coefficient analysis benchmarks adapted from the JGF suite [6] using async-finish and future constructs, respectively. The strassen benchmark is adapted from the OpenMP version of the program in the Kastors suite [38]. These are quickly verified free of data-race using computation graphs as shown below—PRD and GPR time out. Source code and additional benchmarks converted from benchmarks at https://github.com/LLNL/dataracebench can be found at https://bitbucket.org/byu-vv/jpf-hj/src/master/.

Test ID

SLOC

Tasks

States

(TC) mm:ss

(FT) mm:ss

Race

Crypt-af

1010

251

259

00:10

00:08

N

Series-af

730

99

106

00:03

00:01

N

Series-f

830

197

207

00:04

00:03

N

Strassen

560

1

2

0:37

00:37

N

8 Related work

Data-race detection in unstructured thread parallelism, where there is no defined protocol for creating and joining threads, or accessing shared memory, relies on static analysis to approximate parallelism and memory accesses [23, 26, 37] and then improves precision with dynamic analysis [10, 12, 16, 18, 27]. Other approaches reason about separate threads individually [19, 29]. These approaches make few assumptions about the parallelism for generality and some are somewhat directly applicable to structured programs as seen in the prior section with the comparison to FastTrack [16].

There is newer more recent work to improve the efficiency and precision of data-race detection in unstructured thread parallelism. RacerD restricts the analysis to only report data-races for which it has high-confidence are real at the expense of proving a program data-race free [3]. Another approach compresses the execution trace using a grammar representation and analyzes the grammar for data-race [25]. It supports lock-set or happens-before analysis with vector clocks for detection. And yet another approach re-implements FastTrack and verifies the core algorithm correct, and thread safe, with CIVL [41]. None of these look specifically the structure of the parallelism per se and are best grouped with the techniques in the previous paragraph.

Structured parallelism constrains how threads are created and joined and how shared memory is accessed through programming models. For example, a locking protocol leads to static, dynamic, or hybrid lock-set analyses for data-race detection that are typically more efficient than approaches to unstructured parallelism [13, 14, 39]. It is worth noting that locking protocols can be applied to isolation with similar results—over-approximating the set of shared locations potentially rejecting programs as having data-race when indeed they do not.

Dynamic data-race detection based on SP-bags has been shown to effectively scale to large program instances and the method has been applied to the Habanero programming model to support a limited set of Habanero keywords including futures [34, 35]. Another extension supports isolation [33] but is not complete, meaning it reports false races when isolated regions do not commute. The goal in this paper is verification and not run-time monitoring, so it needs to enumerate all possible computation graphs but can benefit from the more efficient SP-bags algorithm to detect data-race on-the-fly in the computation graph. Other dynamic data-race detectors for structured parallelism are based on access sets [28] and thread labels [30]. They can be more efficient for certain classes of programs but neither has been extended to support isolation.

Programmer annotations indicating shared interactions (e.g., permission regions) do improve model checking in general [40]. These are best understood as helping the partial order reduction by grouping several shared accesses into a single atomic block. The regions are then annotated with read/write properties to indicate what the atomic block is doing. The model checker only considers the interactions of these shared regions to reduce the number of executions explored to prove the system correct.

There are other model checkers for task-parallel languages [17, 42]. The first modifies JPF and an X10 run-time extensively (beyond the normal JPF options for customization) and the second is a new virtual machine to model check the language. Both of these solutions require extensive programming whereas the solution in this paper leverages the existing Habanero verification run-time for JPF. That run-time maps tasks to threads making it small enough (relatively few lines of code) to argue correctness and making it work with JPF without any modification to JPF internals.

9 Conclusion and future work

This paper presents a model checking approach for data-race detection in task-parallel programs using computation graphs. The computation graph represents the happens-before relation of the task-parallel program and can readily be checked for data-race. The approach then enumerates all computation graphs created by different schedules of isolated regions to prove data-race freedom. The data-race detection analysis is implemented for a Java implementation of the Habanero programming model using JPF and evaluated on a host of benchmarks. The results are compared to JPF’s precise race detector and a gradual permission regions based extension. The results show that computation graph analysis reduces the time required for verification significantly relative to JPF’s standards.

The paper further implements FastTrack, and uses it in the model checking to compare its data-race detection performance on the computation graph with Algorithm 1 in this paper. To be clear, the model checking algorithm works with any data-race detection algorithm insofar as that algorithm is sound and complete with respect to the computation graph. The results over the benchmark set show that there are benefits to the on-the-fly nature of FastTrack, but that it is susceptible to overhead from the vector clocks. The results also confirm, in a small way, that it is often possible to instantiate small problem instances—at least in this benchmark set. It is hard to know how well that generalizes.

Future work is to reduce the number of schedules that must be considered by the model checker by weakening the happens-before relation in a manner similar to recent advances in dynamic data-race detection [20, 24]. Soundly weakening the happens-before relation grows the number of schedules covered by any one observed program execution including schedules that have different orders of isolation statements. These larger equivalence classes captured by the weakened happens-before relation can be used to prune schedules from consideration by the model checker. Other future work is to leverage static analysis, abstract interpretation, etc., to reason over the input space so that the proof can be generalized to all inputs and executions.