Keywords

1 Introduction

Debugging concurrent programs is an interesting application of reversibility. A reversible debugger allows one to explore a program execution by going forward – letting the program execute normally –, or backward – rolling back the program execution by undoing the effect of previously executed instructions. Several works have explored this idea in the past, see, e.g., the survey in [6], and reversible debugging is used in mainstream tools as well [21]. It is only recently, however, that the idea of a causal-consistent debugger has been proposed by Giachino et al. in [10]. The key idea in [10] was to base the debugger primitives on a causal-consistent reversible semantics for the target programming language. Causal consistency, introduced by Danos and Krivine in their seminal work on reversible CCS [5], allows one, in reversing a concurrent execution, to undo any event provided that its consequences, if any, are undone first. On top of a causal-consistent semantics one can define a rollback operator [15] to undo an arbitrary past action. It provides a minimality guarantee, useful to explore concurrent programs which are prone to state explosion, in that only events in the causal future of a target one are undone, and not events that are causally independent but which may have been interleaved in the execution.

The CauDEr debugger [11, 17, 18] builds on these ideas and provides a reversible debugger for a core subset of the Erlang programming language [3]. Erlang is interesting for it mixes functional programming with a concurrency model inspired by actors [1], and has been largely applied since its initial uses by EricssonFootnote 1, to build distributed infrastructures.

This paper presents an extension of CauDEr to take into account distribution primitives which are not part of the core subset of Erlang handled by CauDEr. Specifically, we additionally consider the three Erlang primitives called \(\mathsf {start}\), to create a new node for executing Erlang processes, \(\mathsf {node}\), to retrieve the identifier of the current node, and \(\mathsf {nodes}\), which allows the current process to obtain a list of all the currently active nodes in an Erlang system. We also extend the \(\mathsf {spawn}\) primitive handled by CauDEr to take as additional parameter the node on which to create a new Erlang process.

Adding support for these primitives is non trivial for they introduce causal dependencies in Erlang programs that are different than those originating from the functional and concurrent fragment considered in CauDEr, which covers, beyond sequential constructs, only message passing and process creation on the current node. Indeed, the set of nodes acts as a shared set variable that can be read, checked for membership, and extended with new elements. Interestingly, the causal dependencies induced by this shared set cannot be faithfully represented in the general model for reversing languages introduced in [14], which allows for resources that can only be produced and consumed.

The contributions of the current work are therefore as follows: (i) we extend the reversible semantics for the core subset of the Erlang language used by CauDEr with the above distribution primitives; (ii) we present a rollback semantics that underlies primitives in our extended CauDEr debugger; (iii) we have implemented an extension of the CauDEr debugger that handles Erlang programs written in our distributed fragment of the language; (iv) we illustrate on an example how our CauDEr extension can be used in capturing subtle bugs in distributed Erlang programs. Due to space constraints, we do not detail in this paper our extended CauDEr implementation, but the code is publicly available in the dedicated GitHub repository [8].

The rest of this paper is organized as follows. Section 2 briefly recalls the reversible semantics on which CauDER is based [19]. Section 3 presents the Erlang distributed language fragment we consider in our CauDEr extension, its reversible semantics and the corresponding rollback semantics. Section 4 briefly describes our extension to CauDEr, and presents an example that illustrates bug finding in distributed Erlang programs with our extended CauDEr. Section 5 discusses related work and concludes the paper with hints for future work. Due to space constraints we omit some technicalities, for further details we refer the interested read to the companion technical report [9] or to the master thesis of the first author [7].

2 Background

We recall here the main aspects of the language in [19], as needed to understand our extension. We refer the interested reader to [19] for further details.

Fig. 1.
figure 1

Language syntax

2.1 The Language Syntax

Figure 1 shows the language syntax. The language depicted is a fragment of Core Erlang [2], an intermediate step in Erlang compilation. A module is a collection of function definitions, a function is a mapping between the function name and the function expression. An expression can be a variable, a literal, a function name, a list, a tuple, a call to a built-in function, a function application, a case expression, or a let binding. An expression can also be a \(\mathsf {spawn}\), a \(\mathsf {send}\), a \(\mathsf {receive}\), or a \(\mathsf {self}\), which are built-in functions. Finally, we distinguish expressions, patterns and variables. Here, patterns are built from variables, tuples, lists and literals, while values are built from literals, tuples and lists, i.e., they are ground patterns. When we have a \(\mathsf {case}~e~\mathsf {of} \ldots \) expression we first evaluate e to a value, say v, then we search for a clause that matches v. When one is found, if the guard \(\mathsf {when}~expr\) is satisfied then the \(\mathsf {case}\) construct evaluates to the clause expression, otherwise the search continues with the next clause. The \(\mathsf {let}~X=~expr_1~\mathsf {in}~expr_2\) expression binds inside \(expr_2\) the fresh variable X to the value to which \(expr_1\) reduces.

As for the concurrent features, since Erlang implements the actor model, there is no shared memory. An Erlang system is a pool of processes that interact by exchanging messages. Each process is uniquely identified by its pid and has its own queue of incoming messages. Function \(\mathsf {spawn}~(expr, [expr_1, \ldots , expr_n])\) evaluates to a fresh process pid p. As a side-effect, it creates a new process with pid p. Process p will apply the function to which expr evaluates to the arguments to which the expressions \(expr_1, \ldots , expr_n\) evaluate. As in [19], we assume that the only way to introduce a new pid is through the evaluation of a spawn. Then, \(expr_1~!~expr_2\) allows a process to send a message to another one. Expression \(expr_1\) must evaluate to a pid (identifying the receiver process) and \(expr_2\) evaluates to the content of the message, say v. The whole function evaluates to v and, as a side-effect, the message will eventually be stored in the receiver queue. The counterpart of message sending is \(\mathsf {receive}~clause_1, \ldots , clause_n~\mathsf {end}\). This construct traverses the queue of messages searching for the first message v that matches one of the n clauses. If no message is found then the process suspends. Finally, \(\mathsf {self}\) evaluates to the current process pid.

2.2 The Language Semantics

This subsection provides key elements to understand the CauDEr semantics. We start with the definition of process.

Definition 1 (Process)

A process is denoted by a tuple \(\langle p, \theta , e, q \rangle \), where p is the process’ pid, \(\theta \) is an environment, i.e. a map from variables to their actual value, e is the current expression to evaluate, and q is the queue of messages received by the process.

Two operations are allowed on queues: v : q denotes the addition of a new message on top of the queue and \(q \mathbin {\backslash \backslash }v\) denotes the queue q after removing v (note that v may not be the first message).

A (running) system can be seen as a pool of running processes.

Definition 2 (System)

A system is denoted by the tuple \(\varGamma ; \varPi \). The global mailbox \(\varGamma \) is a multiset of pairs of the form \((target\_process\_pid,~message)\), where a message is stored after being sent and before being scheduled to its receiver. \(\varPi \) is the pool of processes, denoted by an expression of the form

$$\begin{aligned} \langle p_1, \theta _1, e_1, q_1 \rangle \,|\,\ldots \,|\,\langle p_n, \theta _n, e_n, q_n \rangle \end{aligned}$$

where “ |” is an associative and commutative operator. \(\varGamma \cup \{(p,v)\}\), where \(\cup \) is multiset union, is the global mailbox obtained by adding the pair (pv) to \(\varGamma \). We write \(p \in \varGamma ; \varPi \) when \(\varPi \) contains a process with pid p.

We highlight a process p in a system by writing \(\varGamma ; \langle p, \theta ,e, q\rangle \,|\,\varPi \). The presence of the global mailbox \(\varGamma \), which is similar to the “ether” in [24], allows one to simulate all the possible interleavings of messages. Indeed, in this semantics the order of the messages exchanged between two processes belonging to the same runtime may not be respected, differently from what happens in current Erlang implementations. See [24] for a discussion on this design choice.

The semantics in [19] is defined in a modular way, similarly to the one presented in [4], i.e., there is a semantics for the expression level and one for the system level. This approach simplifies the design of the reversible semantics since only the system one needs to be updated. The expression semantics is defined as a labelled transition relation of the form:

$$\begin{aligned} \{Env,Expr\} \times Label \times \{Env, Expr\} \end{aligned}$$

where Env represents the environment, i.e., a substitution, and Expr denotes the expression, while Label is an element of the following set:

$$\{\tau , \mathsf {send}(v_1,v_2), \mathsf {rec}(\kappa , \overline{cl_n}), \mathsf {spawn}(\kappa , a/n, [\overline{v_n}]), \mathsf {self}(\kappa )\}$$

The semantics is a classical call-by-value semantics for a first order language. Label \(\tau \) denotes the evaluation of a (sequential) expression without side-effects, like the evaluation of a \(\mathsf {case}\) expression or a \(\mathsf {let}\) binding. The remaining labels denote a side-effect associated to the rule execution or the request of some needed information. The system semantics will use the label to execute the associated side-effect or to provide the necessary information. More precisely, in label \(\mathsf {send}(v_1,v_2)\), \(v_1\) and \(v_2\) represent the pid of the sender and the value of a message. In label \(\mathsf {rec}(\kappa , \overline{cl_n})\), \(\overline{cl_n}\) denotes the n clauses of a \(\mathsf {receive}\) expression. Inside label \(\mathsf {spawn}(\kappa , a/n, [\overline{v_n}])\), a/n represents the function name, while \([\overline{v_n}]\) is the (possibly empty) list of arguments of the function. Where used, \(\kappa \) acts as a future: the expression evaluates to \(\kappa \), then the corresponding system rule replaces it with its actual value.

For space reasons, we do not show here the system rules, which are available in [19]. We will however show in the next section how sample rules are extended to support reversibility.

2.3 A Reversible Semantics

The reversible semantics is composed by two relations: a forward relation \(\rightharpoonup \) and a backward relation \(\leftharpoondown \). The forward reversible semantics is a natural extension of the system semantics by using a typical Landauer embedding [13]. The idea underlying Landauer’s work is that any formalism or programming language can be made reversible by adding the history of the computation at each state. Hence, this semantics at each step saves in an external device, called history, the previous state of the computation so that later on such a state can be restored. The backward semantics allows us to undo a step while ensuring causal consistency [5, 16], indeed before undoing an action we must ensure that all its consequences have been undone.

In the reversible semantics each message exchanged must be uniquely identified in order to allow one to undo the sending of the “right” message, hence we denote messages with the tuple \(\{\lambda , v\}\), where \(\lambda \) is the unique identifier and v the message body. See [19] for a discussion on this design choice.

Due to the Landauer embedding the notion of process is extended as follows.

Definition 3 (Process)

A process is denoted by a tuple \(\langle p, h, \theta , e, q \rangle \), where h is the history of the process. The other elements are as in Definition 1. The expression \(\mathsf {op}(\ldots ): h\) denotes the history h with a new history item added on top. The generic history item \(\mathsf {op}(\ldots )\) can span over the following set.

$$\begin{aligned} \{ \tau (\theta , e), \mathsf {send}(\theta , e, \{\lambda , v\}), \mathsf {rec}(\theta , e, \{\lambda , v\}, q), \mathsf {spawn}(\theta , e, p), \mathsf {self}(\theta , e)\} \end{aligned}$$

Here, each history item carries the information needed to restore the previous state of the computation. For rules that do not cause causal dependencies (i.e., \(\tau \) and \(\mathsf {self}\)) it is enough to save \(\theta \) and e. For the other rules we must carry additional information to check that every consequence has been undone before restoring the previous state. We refer to [19] for further details.

Fig. 2.
figure 2

An example of a rule belonging to the forward semantics and its counterpart.

Figure 2 shows a sample rule from the forward semantics (additions w.r.t. the standard system rule are highlighted in red) and its counterpart from the backward semantics. In the premises of the rule Spawn we can see the expression-level semantics in action, transitioning from the configuration \((\theta , e)\) to \((\theta ', e')\) and the corresponding label that the forward semantics uses to determine the associated side-effect. When rule Spawn is applied the system transits in a new state where process \(p'\) is added to the pool of processes and the history of process p is enriched with the corresponding history item. Finally, the forward semantics takes care of updating the value of the future \(\kappa \) by substituting it with the pid \(p'\) of the new process.

The reverse rule, \(\overline{Spawn}\), can be applied only when all the consequences of the \(\mathsf {spawn}\), namely every action performed by the spawned process \(p'\), have been undone. Such constraint is enforced by requiring the history of the spawned process to be empty. Since the last history item of p is the \(\mathsf {spawn}\), and thanks to the assumption that every new pid, except for the first process, is introduced by evaluating a \(\mathsf {spawn}\), we are sure that there are no pending messages for \(p'\). Then, if the history is empty, we can remove the process \(p'\) from \(\varPi \) and we can restore p to the previous state.

3 Distributed Reversible Semantics for Erlang

In this section we discuss how the syntax and the reversible semantics introduced in the previous section have been updated to tackle the three distribution primitives \(\mathsf {start,node}\) and \(\mathsf {nodes}\). Lastly, we extend the rollback operator introduced in [19, 20], which allows one to undo an arbitrary past action together with all and only its consequences, to support distribution.

3.1 Distributed System Semantics

The updated syntax is like the one in Fig. 1, with the only difference that now expr can also be \(\mathsf {start}(e), \mathsf {node}()\) and \(\mathsf {nodes}()\), and \(\mathsf {spawn}\) takes an extra argument that represents the node where the new process must be spawned.

Let us now briefly discuss the semantics of the new primitives. First, in function \(\mathsf {start}\), e must evaluate to a node identifier (also called a nid), which is an atom of the form ‘name@host’. Then, the function, as a side-effect, starts a new node, provided that no node with the same identifier exists in the network, and evaluates to the node identifier in case of success or to an error in case of failure. Node identifiers, contrarily to pids which are always generated fresh, can be hardcoded, as it usually happens in Erlang. Also, function \(\mathsf {node}\) evaluates to the local node identifier. Finally, function \(\mathsf {nodes}\) evaluates to the list (possibly empty) of nodes to which the executing node is connected. A formalization of the intuition above can be found in [7]. Here, we assume that each node has an atomic view of the network, therefore we do not consider network partitioning.

Notions of process and system are updated to cope with the extension above.

Definition 4 (Process)

A process is denoted by a tuple \(\langle nid, p, \theta , e, q \rangle \), where nid is an atom of the form name@host, called a node identifier (nid), pointing to the node on which the process is running. For the other elements of the tuple the reader can refer to Definition 1.

The updated definitions of node and network follow.

Definition 5 (Node and network)

A node is a pool of processes, identified by a nid. A network, denoted by \(\varOmega \), is a set of nids. Hence, nids in a network should all be distinct.

Now, we can proceed to give the formal definition of a distributed system.

Definition 6 (Distributed system)

A distributed system is a tuple \(\varGamma ;\varPi ;\varOmega \). The global mailbox \(\varGamma \) and the pool of running processes \(\varPi \) are as before (but processes now include a nid). Instead, \(\varOmega \) represents the set of nodes connected to the network. We will use \(\cup \) to denote set union.

3.2 Causality

To understand the following development, one needs not only the operational semantics informally discussed above, but also a notion of causality. Indeed, backward rules can undo an action only if all its causal consequences have been undone, and forward rules should store enough information to both decide whether this is the case and, if so, to restore the previous state.

Thus, to guide the reader, we discuss below the possible causal links among the distribution primitives (including \(\mathsf {spawn}\)). About the functional and concurrent primitives, the only dependencies are that a message receive is a consequence of the scheduling of the same message to the target process, which is a consequence of its sendFootnote 2.

Intuitively, there is a dependency between two consecutive actions if either they cannot be executed in the opposite order (e.g., a message cannot be scheduled before having been sent), or by executing them in the opposite order the result would change (e.g., by swapping a successful \(\mathsf {start}\) and a \(\mathsf {nodes}\) the result of the \(\mathsf {nodes}\) would change).

Beyond the fact that later actions in the same process are a consequence of earlier actions, we have the following dependencies:

  1. 1.

    every action of process p depends on the (successful) \(\mathsf {spawn}\) of p;

  2. 2.

    a (successful) \(\mathsf {spawn}\) on node nid depends on the \(\mathsf {start}\) of nid;

  3. 3.

    a (successful) \(\mathsf {start}\) of node nid depends on previous failed \(\mathsf {spawn}\)s on the same node, if any (if we swap the order, the \(\mathsf {spawn}\) will succeed);

  4. 4.

    a failed \(\mathsf {start}\) of node nid depends on its (successful) \(\mathsf {start}\);

  5. 5.

    a \(\mathsf {nodes}\) reading a set \(\varOmega \) depends on the \(\mathsf {start}\) of all nids in \(\varOmega \), if any (as discussed above).

Fig. 3.
figure 3

Distributed forward reversible semantics

3.3 Distributed Forward Reversible Semantics

Figure 3 shows the forward semantics of distribution primitives, which are described below. The other rules are as in the original work [19] but for the introduction of \(\varOmega \).

The forward semantics in [19] has just one rule for \(\mathsf {spawn}\), since it can never fail. Here, instead, a \(\mathsf {spawn}\) can fail if the node fed as first argument is not part of \(\varOmega \). Nonetheless, following the approach of Erlang, we always return a fresh pid, independently on whether the \(\mathsf {spawn}\) has failed or not. Also, the history item created in both cases is the same. Indeed, thanks to uniqueness of pids, one can ascertain whether the \(\mathsf {spawn}\) of \(p'\) has been successful or not just by checking whether there is a process with pid \(p'\) in the system: if there is, the \(\mathsf {spawn}\) succeeded, otherwise it failed. Hence, the unique difference between rules SpawnS and SpawnF is that a new process is created only in rule SpawnS.

Similarly, two rules describe the \(\mathsf {start}\) function: rule StartS for a successful \(\mathsf {start}\), which updates \(\varOmega \) by adding the new nid \(nid'\), and rule StartF for a \(\mathsf {start}\) which fails because a node with the same nid already exists. Here, contrarily to the \(\mathsf {spawn}\) case, the two rules create different history items. Indeed, if two or more processes had a same history item \(\mathsf {start}(\theta ,e,nid)\), then it would not be possible to decide which one performed the \(\mathsf {start}\) first (and, hence, succeeded).

Lastly, the Nodes rule aves, together with \(\theta \) and e, the current value of \(\varOmega \). This is needed to check dependencies on the \(\mathsf {start}\) executions, as discussed in Sect. 3.2. The Node rule, since \(\mathsf {node}\) is a sequential operation, just saves the environment and the current expression.

Fig. 4.
figure 4

Extended backward reversible semantics

3.4 Distributed Backward Reversible Semantics

Figure 4 depicts the backward semantics of the distribution primitives.

The semantics is defined in terms of the relation \(\leftharpoondown _{p,r,\varPsi }\), where:

  • p represents the pid of the process performing the backward transition

  • r describes which action has been undone

  • \(\varPsi \) lists the requests satisfied by the backward transition (the supported requests are listed in Sect. 3.5)

These labels will come into play later on, while defining the rollback semantics. We may drop them when not relevant.

As already discussed, to undo an action, we need to ensure that its consequences, if any, have been undone before. When consequences in other processes may exist, side conditions are used to check that they have already been undone.

Rule \(\overline{SpawnS}\) is analogous to rule \(\overline{Spawn}\) in Fig. 2. Rule \(\overline{SpawnF}\) undoes a failed spawn. As discussed in Sect. 3.2, we first need to undo, if any, a \(\mathsf {start}\) of a node with the target nid, otherwise the \(\mathsf {spawn}\) will now succeed. To this end, we check that \(nid' \notin \varOmega \).

Then, we have rule \(\overline{StartS}\) to undo the (successful) creation of node \(nid'\). Before applying it we need to ensure three conditions: (i) that no process is running on node \(nid'\); (ii) that no \(\mathsf {nodes}\) has read \(nid'\); and (iii) that no other \(\mathsf {start}\) of a node with identifier \(nid'\) failed. The conditions, discussed in Sect. 3.2, are checked by ensuring that the lists of pids computed by auxiliary functions spawns, reads and \(failed\_starts\) are empty. Indeed, they compute the list of pids of processes in \(\varPi \) that have performed, respectively, a \(\mathsf {spawn}\) on \(nid'\), a \(\mathsf {nodes}\) returning a set containing \(nid'\), and a failed start of a node with identifier nid. Condition (i) needs to be checked since nids are hardcoded, hence any process could perform a spawn on \(nid'\). The check would be redundant if nids would be created fresh by the \(\mathsf {start}\) function.

Rule \(\overline{StartF}\) instead requires no side condition: \(\mathsf {start}\) fails only if the node already exists, but this condition remains true afterwards, since we do not have primitives to stop a node. Rule \(\overline{Node}\) has no dependency either.

To execute rule \(\overline{Nodes}\) we must ensure that the value of \(\varOmega '\) in the history item and of \(\varOmega \) in the system are the same, as discussed in Sect. 3.2.

We now report a fundamental result of the reversible semantics. As most of our results, it holds for reachable systems, that is systems that can be obtained using the rules of the semantics from a single process with empty history.

Lemma 1 (Loop Lemma)

For every pair of reachable systems, \(s_1\) and \(s_2\), we have \(s_1 \rightharpoonup s_2\) iff \(s_2 \leftharpoondown s_1\).

Proof

The proof that a forward transition can be undone follows by rule inspection. The other direction relies on the restriction to reachable systems: consider the process undoing the action. Since the system is reachable, restoring the memory item would put us back in a state where the undone action can be performed again (if the system would not be reachable the memory item would be arbitrary, hence there would not be such a guarantee), as desired. Again, this can be proved by rule inspection.   \(\square \)

Note that, as exemplified above, this result would fail if we allow one to undo an action before its consequences.

3.5 Distributed Rollback Semantics

Since undoing steps one by one may be tedious and unproductive for the developer, CauDEr provides a rollback operator, that allows the developer to undo several steps in an automatic manner, while maintaining causal consistency. We extend it to cope with distribution. Our definition takes inspiration from the formalization style used in [20], but it improves it and applies it to a system with explicit local queues for messages. Dealing with explicit local queues is not trivial. Indeed, without local queues, the receive primitive takes messages directly from \(\varGamma \). With local queues we use a rule called Sched to move a message from \(\varGamma \) to the local queue of the target process, and the receive takes the message from the local queue. A main point is that the Sched action does not create an item in the history of the process receiving the message, and as a result it is concurrent to all other actions of the same process but receive. We refer to [19] for a formalization of rule Sched and of its inverse. When during a rollback both a \(\overline{Sched}\) and another backward transition are enabled at the same time one has to choose which one to undo, and selecting the wrong one may violate the property that only consequences of the target action are undone.

We denote a system in rollback mode by \(\lceil \lceil {\mathcal S}\rceil \rceil _{\{p,\psi \}}\), where the subscript means that we wish to undo the action \(\psi \) performed by process p and every action which depends on it. More generally, the subscript of \(\lceil \lceil \rceil \rceil \), often depicted with \(\varPsi \) or \(\varPsi '\) (where \(\varPsi \) can be empty while \(\varPsi '\) cannot), can be seen as a stack (with  :  as cons operator) of undo requests that need to be satisfied. Once the stack is empty, the system has reached the state desired by the user. We consider requests \(\{p,\psi \}\), asking process p to undo a specific action, namely:

  • \(\{p,s\}\): a single step back;

  • \(\{p,\lambda ^{\Downarrow }\}\): the receive of the message uniquely identified by \(\lambda \);

  • \(\{p,\lambda ^{\Uparrow }\}\): the send of the message uniquely identified by \(\lambda \);

  • \(\{p,\lambda ^{sched}\}\): the scheduling of the message uniquely identified by \(\lambda \);

  • \(\{p, st_{nid}\}\): the successful start of node \(nid'\);

  • \(\{p, sp_{p'}\}\): the spawn of process \(p'\).

The rollback semantics is defined in Fig. 5 in terms of the relation \(\rightsquigarrow \), selecting which backward rule to apply and when. There are two categories of rules: (i) U-rules that perform a step back using the backward semantics; (ii) rule Request that pushes a new request on top of \(\varPsi \) whenever it is not possible to undo an action since its consequences need to be undone before.

Fig. 5.
figure 5

Rollback semantics

Let us analyse the U-rules. During rollback, more than one backward rule could be applicable to the same process. In our setting, the only possibility is that one of the rules is a \(\overline{Sched}\) and the other one is not. It is important to select which rule to apply, to ensure that only consequences of the target action are undone.

First, if an enabled transition satisfies our target, then it is executed and the corresponding request is removed (rule \(U-Satisfy\)). Intuitively, since two applications of rule \( Sched \) to the same process are always causally dependent, if the target action is an application of \( Sched \), an enabled \( Sched \) is for sure one of its consequences, hence it needs to be undone (rule \(U-Sched\)). Dually, if the target is not a \( Sched \) and a non \( Sched \) is enabled, we do it (rule \(U-Act\)). If a unique rule is applicable, then it is selected (rule \(U-Unique\)).

Rule Request considers the case where no backward transition in the target process is enabled. This depends on some consequence on another process of the action on top of the history. Such a consequence needs to be undone before, hence the rule finds out using operator \(\mathsf {dep}\) in Fig. 6 both the dependency and the target process and adds on top of \(\varPsi \) the corresponding request.

Let us discuss operator \(\mathsf {dep}\). In the first case, a send cannot be undone since the sent message is not in the global mailbox, hence a request has to be made to the receiver \(p'\) of undoing the Sched of the message \(\lambda \).

Fig. 6.
figure 6

Dependencies operator

In case of multiple dependencies, we add them one by one. This happens, e.g., in case \(\mathsf {nodes}\), where we need to undo the start of all the nodes which are in \(\{nid'\}\cup \varOmega '\) but not in \(\varOmega \). Adding all the dependencies at once would make the treatment more complex, since by solving one of them we may solve others as well, and thus we would need an additional check to avoid starting a computation to undo a dependency which is no more there. Adding the dependencies one by one solves the problem, hence operator \(\mathsf {dep}\) nondeterministically selects one of them. Notice also that the order in which dependencies are solved is not relevant.

In some cases (e.g., \(\mathsf {send}\)) we find a precise target event, in others we use just s, that is a single step. In the latter case, a backward step is performed (and its consequences are undone), then the condition is re-checked and another backward step is required, until the correct step is undone. We could have computed more precise targets, but this would have required additional technicalities.

Function \(parent(nid',\varPi )\), used in the definition of \(\mathsf {dep}\), returns the pid of the process that started \(nid'\) while function \(fst(\cdot )\) returns the first element of a list.

An execution of the rollback operator corresponds to a backward derivation, while the opposite is generally false.

Theorem 1 (Soundness of rollback)

If \(\lceil \lceil {\mathcal S}\rceil \rceil _{\varPsi '} \rightsquigarrow ^* \lceil \lceil {\mathcal S}'\rceil \rceil _{\varPsi }\) then \({\mathcal S}\leftharpoondown ^* {\mathcal S}'\) where \(^*\) denotes reflexive and transitive closure.

Proof

The rollback semantics is either executing backward steps using the backward semantics or executing administrative steps (i.e., pushing new requests on top of \(\varPsi \)), which do not alter the state of the system. The thesis follow.   \(\square \)

In addition, the rollback semantics generates the shortest computation satisfying the desired rollback request.

Theorem 2 (Minimality of rollback)

If \(\lceil \lceil {\mathcal S}\rceil \rceil _{\varPsi } \rightsquigarrow ^* \lceil \lceil {\mathcal S}'\rceil \rceil _{\emptyset }\) then the backward steps occurring as first premises in the derivation of \(\lceil \lceil {\mathcal S}\rceil \rceil _{\varPsi } \rightsquigarrow ^* \lceil \lceil {\mathcal S}'\rceil \rceil _{\emptyset }\) form the shortest computation from \({\mathcal S}\) satisfying \(\varPsi \) derivable in the reversible semantics.

A precise formalization and proof of this result is quite long, hence for space reasons we refer to [7, Theorem 3.2].

4 Distributed CauDEr

CauDEr [11, 17, 18] is the proof-of-concept debugger that we extended to support distribution following the semantics above. Notably, CauDEr works on Erlang, but primitives for distribution are the same in Core Erlang and in Erlang, hence our approach can be directly applied. CauDEr is written completely in Erlang and bundled up with a convenient graphical user interface to facilitate the interaction. The usual CauDEr workflow is the following. The user selects the Erlang source file, then CauDEr loads the program and shows the source code to the user. Then, the user can select the function that will act as entry point, specify its arguments, and the node identifier where the first process is running. The user can either perform single steps on some process (both forward and backward), or perform n steps in the chosen direction in an automatic manner (a scheduler decides which process will perform each step), or use the rollback operator.

The interface (see Fig. 7) is organized as follow: CauDEr shows the source code on the top left, the selected process’ state and history (log is not considered in this paper) on the bottom left, and information on system structure and execution on the bottom right. Execution controls are on the top right.

We illustrate below how to use CauDEr to find a non-trivial bug.

Fig. 7.
figure 7

A screenshot of CauDEr.

Finding Distributed Bugs with CauDEr. Let us consider the following scenario. A client produces a stream of data and wants to store them in a distributed storage system. A server acts as a hub: it receives data from the client, forwards them to a storage node, receives a confirmation that the storage node has saved the data, and finally sends an acknowledgement to the client. Each storage node hosts one process only, acting as node manager, and has an id as part of its name, ranging from one to five. Each node manager is able to store at most m packets. Once the manager reaches the limit, it informs the server that its capacity has been reached. The server holds a list of domains and an index referring to one of them. Each domain is coupled with a counter, i.e., an integer, and each domain can host at most five storage nodes. Each time the server receives a notification from a node manager stating that the node maximum capacity has been reached, it proceeds as follows. If the id of the current storage manager is five it means that such domain has reached its capacity. Then, the server selects the next domain in the list, resets its counter and starts a new node (and a corresponding storage manager) on the new domain. If the id of the node is less than five then the server increases its counter and then starts a new node (and storage manager) on the same domain, using the value of the counter as new id. Each node should host at most one process.

Let us now consider the program \(\mathsf {distributed\_storage\_node.erl}\), available in the GitHub repository [8], which shows a wrong implementation of the program described above. In order to debug the program one has to load it and start the system. Then, it is sufficient to execute about 1500 steps forward to notice that something went wrong. Indeed, by checking the \(\mathsf {Trace}\) box (Fig. 7) one can see a warning: a \(\mathsf {start}\) has failed since a node with the same identifier already existed. Then, since no check is performed on the result of the \(\mathsf {start}\), the program spawns a new storage manager on a node with the same identifier as the one that failed to start. Hence, now two storage managers run on the same node.

To investigate why this happened one can roll back to the reception of the message \(\mathsf {\{store,full\}}\) right before the failed \(\mathsf {start}\). Note that it would not be easy to obtain the same result without reversibility: one would need to re-run the program, and, at least in principle, a different scheduling may lead to a different state where the error may not occur. After rolling back one can perform forward steps on the server in manual mode since the misbehavior happened there. After receiving the message, the server enters the case where the index of the storage manager is 5, which is correct because so far we have 5 storage nodes on the domain. Now, the server performs the start of the node (and of the storage manager) on the selected domain and only afterwards it selects the new domain, whereas it should have first selected a new domain and then proceeded to start a new storage node (and a new storage manager) there. This misbehavior has occurred because a few lines of code have been swapped.

5 Related Work and Conclusion

In this work we have presented an extension of CauDEr, a causal-consistent reversible debugger for Erlang, and the related theory. CauDEr has been first introduced in [17] (building on the theory in [19]) and then improved in [11] with a refined graphic interface and to work directly on Erlang instead of Core Erlang. We built our extension on top of this last version. CauDEr was able to deal with concurrent aspects of Erlang: our extension supports also some distribution primitives (\(\mathsf {start, node}\) and \(\mathsf {nodes}\)). We built the extension on top of the modular semantics for Erlang described in [11, 19]. Monolithic approaches to the semantics of Erlang also exist [22], but the two-layer approach is more convenient for us since the reversible extension only affects the system layer.

Another work defining a formal semantics for distributed Erlang is [4]. There the emphasis is on ensuring the order of messages is respected in intra-node communications but not in inter-node communications (an aspect we do not consider). Similarly to us, they have rules to start new nodes and to perform remote spawns, although they do not consider the case where these rules fail.

In the context of CauDEr also replay has been studied [20]. In particular CauDEr supports causal-consistent replay, which allows one to replay the execution of the system up to a selected action, including all and only its causes. This can be seen as dual to rollback. Our extension currently does not support replay, we leave it for future work.

To the best of our knowledge causal-consistent debugging has been explored in a few settings only. The seminal paper [10] introduced causal-consistent debugging in the context of the toy language \(\mu \)Oz. Closer to our work is Actoverse [23], a reversible debugger for the Akka actor model. Actoverse provides message-oriented breakpoints, which allow the user to stop when some conditions on messages are satisfied, rollback, state inspection, message timeline and session replay, which allows one to replay the execution of a program given the log of a computation, as well as the capacity to go back in the execution. While many of these features will be interesting for CauDEr, they currently do not support distribution.

Reversible debugging of concurrent programs has also been studied for imperative languages [12]. However, differently from us, they force undoing of actions in reverse order of execution, and they do not support distribution.

As future work it would be interesting to refine the semantics to deal with failures (node crashes, network partitions). Indeed, failures are unavoidable in practice, and we think reverse debugging in a faulty context could be of great help to the final user. Also, it would be good to extend CauDEr and the related theory to support additional features of the Erlang language, such as error handling, failure notification, and code hot-swapping. Finally, it would be good to experiment with more case studies to understand the practical impact of our tool.