Keywords

1 Introduction

The “happens before” or the causality relation, denoted \(\rightarrow \), between events in a distributed system was defined by Lamport [6]. Given two events e and \(e'\), the causality detection problem asks to determine whether \(e\rightarrow e'\).

There is a rich literature on solutions for solving the causality detection problem between events. See [4, 5, 9, 15, 17] for an overview of some approaches such as tracking causality graphs, scalar clocks, vector clocks [3, 8], and variants of logical clocks such as hierarchical clocks, plausible clocks, dotted version vectors, Bloom clocks, interval tree clocks and resettable encoded vector clocks. Some of these variants track causality accurately while others introduce approximations as trade-offs to save on the space and/or time and/or message overheads. Schwarz and Mattern [17] stated that the quest for the holy grail of the ideal causality tracking mechanism is on. This literature above assumed that processes are correct (non-faulty). The causality detection problem for a system with Byzantine processes was recently introduced and studied in [11].

A related problem is the causal ordering of messages. Under the Byzantine failure model, causal ordering has recently been studied in [10, 12, 13].

Contributions. It was recently proved that the problem of detecting causality between a pair of events cannot be solved in an asynchronous system in the presence of Byzantine processes, irrespective of whether the communication is via unicasts, multicasts, or broadcasts [11]. In the multicast mode of communication, each send event sends a message to a group consisting of a subset of the set of processes in the system. Different send events can send to different subsets of processes. Communicating by unicasts and communicating by broadcasts are special cases of multicasting. It was shown in [11] that in asynchronous systems with even a single Byzantine process, the unicast and multicast modes of communication are susceptible to false positives and false negatives, whereas the broadcast mode of communication is susceptible to false negatives but no false positives. A false positive means that \(e\not \rightarrow e'\) whereas \(e\rightarrow e'\) is perceived/detected. A false negative means than \(e\rightarrow e'\) whereas \(e\not \rightarrow e'\) is perceived/detected. The impossibility result for asynchronous systems prompts us to examine whether the causality detection problem can be solved in synchronous systems in the presence of Byzantine processes. We answer in the affirmative for unicasts, multicasts, and broadcasts by outlining two approaches in this brief announcement. The results are summarized in Table 1.

Table 1. Solvability of causality detection between events under different communication modes in Byzantine-prone asynchronous and synchronous systems. FP is false positive, FN is false negative. \(\overline{FP}/\overline{FN}\) means no false positive/no false negative is possible.

2 System Model

The distributed system is modelled as an undirected complete graph \(G = (P,C)\). Here P is the set of processes communicating in the distributed system. Let \(|P|=n\). C is the set of (logical) FIFO communication links over which processes communicate by message passing. The processes may be Byzantine [7, 14]. The distributed system is assumed to be synchronous [2].

Let \(e^x_i\), where \(x \ge 1\), denote the x-th event executed by process \(p_i\). An event may be an internal event, a message send event, or a message receive event. Let the state of \(p_i\) after \(e^x_i\) be denoted \(s^x_i\), where \(x\ge 1\), and let \(s^0_i\) be the initial state. The execution at \(p_i\) is the sequence of alternating events and resulting states, as \(\langle s^0_i,e^1_i,s^1_i,e^2_i, s^2_i\ldots \rangle \). The sequence of events \(\langle e^1_i,e^2_i,\ldots \rangle \) is called the execution history at \(p_i\) and denoted \(E_i\). Let \(E=\bigcup _i\{E_i\}\) and let T(E) denote the set of all events in (the set of sequences) E. The happens before [6] relation, denoted \(\rightarrow \), is an irreflexive, asymmetric, and transitive partial order defined over events in a distributed execution that is used to define causality. The causal past of an event e is denoted as CP(e) and defined as the set of events \(\{e'\in T(E)\,|\, e'\rightarrow e\}\).

3 Problem Formulation and a Brief Overview of Solutions

The problem formulation is analogous to that in [11]. An algorithm to solve the causality detection problem collects the execution history of each process in the system and derives causal relations from it. \(E_i\) is the actual execution history at \(p_i\). For any causality detection algorithm, let \(F_i\) be the execution history at \(p_i\) as perceived and collected by the algorithm and let \(F=\bigcup _i\{F_i\}\). F thus denotes the execution history of the system as perceived and collected by the algorithm. Analogous to T(E), let T(F) denote the set of all events in F. Analogous to the definition of \(\rightarrow \) on T(E) [6], the happens before relation can be defined on T(F) instead of on T(E).

Let \(e1\rightarrow e2|_E\) and \(e1\rightarrow e2|_F\) be the evaluation (1 or 0) of \(e1\rightarrow e2\) using E and F, respectively. Byzantine processes may corrupt the collection of F to make it different from E. We assume that a correct process \(p_i\) needs to detect whether \(e^x_h \rightarrow e^*_i\) holds and \(e^*_i\) is an event in T(E). If \(e^x_h \not \in T(E)\) then \(e^x_h\rightarrow e^*_i|_E\) evaluates to false. If \(e^x_h \not \in T(F)\) (or \(e^*_i \not \in T(F)\)) then \(e^x_h\rightarrow e^*_i|_F\) evaluates to false. We assume an oracle that is used for determining correctness of the causality detection algorithm; this oracle has access to E which can be any execution history such that \(T(E)\supseteq CP(e^*_i)\).

Byzantine processes may collude as follows.

  1. 1.

    To delete \(e^x_h\) from \(F_h\) or in general, record F as any alteration of E such that \(e^x_h\rightarrow e^*_i|_F = 0\), while \(e^x_h\rightarrow e^*_i|_E = 1\), or

  2. 2.

    To add a fake event \(e^x_h\) in \(F_h\) or in general, record F as any alteration of E such that \(e^x_h\rightarrow e^*_i|_F = 1\), while \(e^x_h\rightarrow e^*_i|_E = 0\).

Without loss of generality, we have that \(e^x_h \in T(E)\cup T(F)\). Note that \(e^x_h\) belongs to \(T(F)\setminus T(E)\) when it is a fake event in F.

Definition 1

The causality detection problem \(CD(E,F,e^*_i)\) for any event \(e^*_i\in T(E)\) at a correct process \(p_i\) is to devise an algorithm to collect the execution history E as F at \(p_i\) such that \(valid(F)=1\), where

$$ valid(F) = \left\{ \begin{array}{ll} 1 &{} \text{ if } \forall e^x_h, e^x_h\rightarrow e^*_i|_E = e^x_h\rightarrow e^*_i|_F \\ 0 &{} \text{ otherwise } \end{array} \right. $$

When 1 is returned, the algorithm output matches the actual (God’s) truth and solves CD correctly. Thus, returning 1 indicates that the problem has been solved correctly by the algorithm using F. 0 is returned if either

  • \(\exists e^x_h\) such that \(e^x_h\rightarrow e^*_i|_E = 1 \wedge e^x_h\rightarrow e^*_i|_F = 0\) (denoting a false negative), or

  • \(\exists e^x_h\) such that \(e^x_h\rightarrow e^*_i|_E = 0 \wedge e^x_h\rightarrow e^*_i|_F = 1\) (denoting a false positive).

In our first solution, we use the Replicated State Machine (RSM) approach [16] and vector clocks in the algorithm for causality detection. We can show that F at a correct process can be made to exactly match E, hence there is no possibility of a false positive or of a false negative. The RSM approach works only in synchronous systems. In a system with n application processes, the RSM-based solution uses \(3t+1\) process replicas per application process, where t is the maximum number of Byzantine processes that can be tolerated in a RSM. Thus, there can be at most nt Byzantine processes among a total of \((3t+1)n\) processes partitioned into n RSMs of \(3t+1\) processes each, with each RSM having up to t Byzantine processes. By using \((3t+1)n\) processes and the RSM approach to represent n application processes, the malicious effects of Byzantine process behaviors are neutralized.

Another approach is as follows. A generic transformation from Byzantine failures to crash failures for synchronous systems can be applied [1], this requires \(t< n/3\). The possibility of correct Byzantine-tolerant causality detection would be implied by the possibility of correct crash-tolerant causality detection.