Keywords

1 Introduction

Faulty software is estimated to cost 60 billion dollars to the US economy per year [1] and has been single-handedly responsible for major newsworthy catastrophesFootnote 1. This problem is exacerbated by the fact that debugging (defined as the process of finding and fixing a fault) is complex and time consuming – estimated to consume 50–60 % of the time a programmer spends in the maintenance and development cycle [2]. Consequently, the development of effective and efficient methods for software fault localisation has the potential to greatly reduce costs, wasted programmer time and the possibility of catastrophe.

In this paper we advance the state of the art in lightweight fault localisation by building on research in spectrum-based fault localisation (sbfl). In sbfl, a measure is used to determine the degree of suspiciousness each line of code is with respect to being faulty, where this degree is defined as a function of the number of passing/failing traces that do/do not cover that code. sbfl is one of the most prominent areas of software fault localisation research, has recently been estimated to make up 35 % of published work in the field [3] and has been consistently demonstrated to be effective and efficient at finding faults [421].

However, so far there have not been many formal properties about the general problem of fault localisation which sbfl measures have been shown to satisfy, representing a potential theoretical shortcoming of the approach. Although properties that measures should satisfy a priori (such as strict rationality [22]) have been discussed, and measures that solve fault localisation sub-problems have been presented (such as single bug optimal measures [15]), there is not yet a sbfl measure that solves the problem of fault localisation for all benchmarks. Indeed, recently Yoo et al. have established theoretical results which show that a “best” performing suspicious measure does not exist [23]. In light of this, the sbfl literature has favoured developing measures with good experimental performance as opposed to developing them according to a priori requirements. This has facilitated a culture of borrowing measures from other domains [11, 15, 16], manually tweaking measures [13, 17], or using machine learning methods [19, 20, 24]. Thus, there remains the challenge of developing new, better performing and comparably efficient methods that can satisfy key properties of fault localisation. Our contributions in this paper are as follows:

  • We introduce and motivate a new formal framework denoted Probabilistic Fault Localisation (pfl), which can leverage any sbfl measure.

  • We formally prove that pfl satisfies desirable formal properties which sbfl does not.

  • We demonstrate that pfl techniques are substantially and statistically significantly more effective (using \(p = 0.01\)) than all known (200) sbfl measures at finding faults on what, to our knowledge, is the largest scale experimental comparison in software fault localisation to date.

  • We show that it is theoretically impossible to define strictly rational sbfl measures that can outperform given pfl techniques on many of our benchmarks.

  • We demonstrate that pfl maintains efficiency comparable to sbfl.

The rest of the paper is organised as follows. In Sect. 2 we present the formal preliminaries common to the approaches discussed in this paper and in Sect. 3 introduce a small illustrative example of sbfl. In Sect. 4, we introduce and motivate the formal theory underlying the pfl approach and formally prove it satisfies desirable fault localisation properties which sbfl does not. Section 5 presents our experimental comparison of pfl techniques against sbfl measures. Finally, we present related work and general conclusions.

2 Preliminaries

In this section we summarise the formal apparatus common to the approaches in this paper.

We model each program as an ordered set P \(=\langle C_1,\ldots , C_n\rangle \). Intuitively, each \(C_i\) can be thought of as a program entity, event, or proposition, which is executed, occurs, or is true if a corresponding program entity is covered in a given execution. A program entity (or component) can be thought of as a program statement, branch, path, or block of code [7, 25]. A component is called complex if it is the union of other components, and atomic otherwise. In practice, P is modelled as a set of atomic components in order to reduce overhead [421].

We model each test suite as an ordered set of test cases \(\mathbf T = \langle t_{1},\ldots , t_{m}\rangle \). Each test case \(t_i\) is a Boolean vector of length \(|{{\varvec{P}}}|\) such that \(t_k = \langle c_1^k,\ldots , c_{|{{\varvec{{P}}}}|}^k \rangle \), and where we have \(c_i^k = \top \) if \(C_i\) is covered/occurs/is true in \(t_k\) and \(\bot \) otherwise. We also use 1 and 0 for \(\top \) and \(\bot \) respectively. \(C_{|{{\varvec{{P}}}}|}\) can be thought of as the event of the error (denoted by E), where \(c_{|{{\varvec{{P}}}}|}^k = e^k =\) \(\top \) if the test case fails and \(\bot \) if it passes. Intuitively, each test case records the coverage details of a given execution, and is failing/passing if that execution violates/satisfies a given specification, where a specification is a logically contingent proposition stated in some formal language.

Each test suite may be partitioned T = \(\mathbf F \cup \mathbf P \), where F and P are the set of failing and passing test cases respectively. By convention each test suite is ordered such that the failing traces appear before the passing. In general, we assume that every failing test case covers at least one component, and that every component is covered by at least one failing test case. We may represent a test suite with an \(m \times n\) coverage matrix, in which the k-th row of the i-th column represents whether \(C_i\) occurred in \(t_k\). An example of a coverage matrix is given in Fig. 2.

For each \(C_i \in {{\varvec{P}}}\) we can construct its program spectrum using a test suite. A program spectrum is defined as a vector of four elements \(\langle a_{ ef }^i, a_{ nf }^i, a_{ ep }^i, a_{ np }^i \rangle \), where \(a_{ ef }^i\) is the number of failing test cases in \(\mathbf T \) that cover \(C_i\), \(a_{ nf }^i\) is the number of failing test cases in \(\mathbf T \) that do not cover \(C_i\), \(a_{ ep }^i\) is the number of passing test cases in \(\mathbf T \) that cover \(C_i\) and \(a_{ np }^i\) is the number of passing test cases in \(\mathbf T \) that do not cover \(C_i\). Probabilistic expressions may be defined as a function of program spectra as follows. We identify \(P(C_i \cap E)\), \(P(\overline{C_i} \cap E)\), \(P(C_i\,\cap \,\overline{E})\) and \(P(\overline{C_i}\,\cap \,\overline{E})\) with \(\frac{a_{ ef }^i}{|\mathbf T |}\), \(\frac{a_{ nf }^i}{|\mathbf T |}\), \(\frac{a_{ ep }^i}{|\mathbf T |}\) and \(\frac{a_{ np }^i}{|\mathbf T |}\) respectively. Using definitions from probabilistic calculus [26], we may then identify many measures with a probabilistic expression.

A suspiciousness measure w maps a program entity to a real number as a function of its spectrum [15], where this number is called the program entity’s degree of suspiciousness. The higher the degree the more suspicious the program entity \(C_i\) is assumed to be with respect to being a fault. In practical sbfl the components in the program are investigated in descending order of suspiciousness until a fault is found. Prominent measures include Zoltar = \(\frac{a_{ ef }^i}{a_{ ef }^i+a_{ nf }^i+a_{ ep }^i + k}\) where k = \(\frac{10000a_{nf}^ia_{ ep }^i}{a_{ ef }^i}\) [15], Kulczynski2 = \(\frac{1}{2}(P(E|C_i) + P(C_i|E))\) [15], Ochiai = \({P(C_i \cap E)}/{\sqrt{P(C_i)P(E)}}\) [4], and Positive predictive power (ppv) = \(P(E|C_i)\) [6]. ppv is equivalent to the Tarantula measure [27].

Some suspiciousness measures are informally known as measures of causal strength [11]. Measures of causal strength are designed to measure the propensity of an event in causing a given effect. Any measure can be proposed as a measure of causal strength. Historically such measures have been developed around the premise that causes raise the probability of their effects. Prominent measures include Lewis = \(P(E|C_i)/P(E|\lnot C_i)\), Fitelson = \(P(E|C_i)/P(E\)), Suppes = \(P(E|C_i) - P(E|\lnot C_i)\), and Eels = \(P(E|C_i)/P(E\)) (see [11]).

Formally, a suspiciousness measure w is rational if and only if for all \(c > 0\) (1) \(w(a_{ ef }, a_{ nf }, a_{ ep }, a_{ np }) \le w(a_{ ef }+c, a_{ nf }-c, a_{ ep }, a_{ np })\), and (2) \(w(a_{ ef }, a_{ nf }, a_{ ep }+c, a_{ np }-c) \le w(a_{ ef }, a_{ nf }, a_{ ep }, a_{ np })\). The property of strict rationality is defined by replacing \(\le \) with < in the latter two conditions. Roughly speaking, a measure is rational/strictly-rational if more failing traces covering a component make it more suspicious, and more passing traces maek it less suspicious – conforming to our intuition of suspiciousness. Many suspiciousness measures have been shown to satisfy strict rationality, at least when \(a_{ef}, a_{ep} > 1\) [15, 22]. Naish et al. argue that it is reasonable to restrict the sbfl approach to rational measures [28].

We now discuss established methods for evaluating the performance of a suspiciousness measure. First, there are wasted effort scores (or W-scores). W-scores estimate the percentage of non-faulty components a user will look through until a fault is found. Best case, worst case, and average case W-scores have been defined [4, 11, 13, 15]. Where w is a measure, b is a fault with the highest degree of suspiciousness, and f is the number of faults which are equally suspicious to the most suspicious fault, we use the following definitions: \( best (w) = \frac{|\{x| w(x) > w(b)\}|}{|{{\varvec{{P}}}}|-1}100\), \( worst (w) = \frac{|\{x| m(x) \ge w(b)\}-1|}{|{{\varvec{{P}}}}|-1}100\), \( average (w) = \) \( best (w) + \) \(\frac{ worst (w) - best (w)}{ f + 1}\). We use avg W-scores. Second, there are absolute scores (or A-scores) [29]. A-scores measure whether a given measure found a fault after inspecting n non-faulty components [29]. Thus, for a given n a suspiciousness measure receives 100 % if the user found a fault after investigating n non-faulty components, otherwise it received 0 %. We use \(n = 0\). A suspiciousness measure performs well if it has have low mean W-scores and a high mean A-scores.

Finally, Naish et al. define the unavoidable costs of any strictly rational measure. These are the scores that the best performing strictly rational measure can possibly receive [28]. To determine this score, one constructs an ordered list with the property that for every component, \(C_i\) is ranked higher than a given fault \(C_j\) just in case every strictly rational measure would rank \(C_i\) higher than \(C_j\). The W/A-scores of this list are the unavoidable cost W/A-scores. Unavoidable cost scores estimate the upper bound limit for the performance of the sbfl approach in general (see [28] for details).

3 Example

We present a small example to illustrate sbfl. Consider the C program minmax.c in Fig. 1 (from [30]). The program is formally modelled as the following set of program entities P = \(\langle C_{1}, C_{2}, C_{3}, C_{4}, E \rangle \), where E models the event in which the specification \(\mathtt{assert(least <= most)}\) is violated. The program fails to always satisfy this specification. The explanation for the failure is the fault at \(C_3\), which should be an assignment to least instead of most. We collected coverage data from ten test cases to form our test suite \(\mathbf T = \langle t_1, \ldots , t_{10} \rangle \). The coverage matrix for these test cases is given in Fig. 2. Three of the test cases fail and seven pass. We compute the program spectrum for each component using the coverage matrix. For example, the program spectrum for \(C_3\) is \(\langle 3, 0, 1, 6 \rangle \).

Fig. 1.
figure 1

minmax.c

Fig. 2.
figure 2

Coverage matrix

To illustrate an instance of sbfl we use the suspiciousness measure Wong-2 = \(a^i_{ ef } - a^i_{ ep }\) [13]. The user inspects the program in decreasing order of suspiciousness until a fault is found. \(C_3\) is inspected first with a suspiciousness of 2 and thereby a fault is found immediately. The example illustrates that sbfl measures can be successfully employed as heuristics for fault localisation, but that the formal connection to fault localisation could potentially be improved.

4 Estimating Fault Probability

In this section, we introduce assumptions to generate our estimation of fault probability and then prove this estimation satisfies important properties that are not satisfied by any sbfl measure.

We begin as follows. We introduce a probability function P the domain of which is a set of propositions. To define the set of propositions, we first define two sets of atomic propositions \(\mathbf H =\) \(\{h_i | C_i \in {{\varvec{P}}}\}\) and \(\mathbf C =\) \(\{h_i^k | C_i \in {{\varvec{P}}} \wedge t_k \in \mathbf T \}\). Intuitively, \(\mathbf H \) is a set of fault hypotheses, where \(h_i\) expresses the hypothesis that \(C_i\) is faulty, and \(\mathbf C \) is a set of causal hypotheses, where \(h_i^k \) expresses the hypothesis that \(C_i\) was the cause of the error E in execution \(t_k\). The set of propositions is then defined inductively as follows. For each \(p,q \in \mathbf{H \cup \mathbf C }\), p and q are propositions. If p and q are propositions, then \(p \wedge q\), \(p \vee q\), \(\lnot p\) are propositions. We also assume the following standard properties of probability [26]. For each proposition p and q: \(P(p) = 1 \) if \(p = \) \(\top \). \(P(p) = 0\) if \(p = \) \(\perp \). \(P(p \vee q)\) = P(p) + \(P(q) - P(p \wedge q)\). \(P(\lnot p) = 1 - P(p)\). \(P(p|q) = P(p \wedge q)/P(q)\).

We now present assumptions A1-7 which are designed to be plausible in any probability space induced by a test suite T for a faulty program P.

  • A1. For all \(h_i \in \mathbf H \), \(h_i = \bigvee _{k = 1}^{|\mathbf T |} h_i^k\).

This states that \(C_i\) is faulty just in case \(C_i\) was the cause of the error E in some execution of the program.

  • A2. For all \(t_k \in \mathbf F \), \(\bigvee _{i=1}^{|{{\varvec{{P}}}}|} h_i^k =\) \(\top \).

This states that for every failing trace, there is some component \(C_i \in {{\varvec{P}}}\) which caused the error E in that trace. In other words, if an error occurred then something must have caused it. For all \(h_i^k \in \mathbf {C} \) we also have the following

  • A3. if \(h_i^k = \top \) then \(C_i \ne E\).

  • A4. if \(h_i^k = \top \) then \(c_i^k = \top \) and \(e^k = \top \).

These assumptions state that if \(C_i\) was the cause of E in \(t_k\), then \(C_i\) must have been a different event to E (A3), and \(C_i\) and E must have actually occurred (A4). These two assumptions have been described as fundamental properties about causation [31]. For all \(h_i^k, h_j^k \in \mathbf C \)

  • A5. if \(C_i \ne C_j\) then \(h_i^k \wedge h_j^k = \) \(\bot \).

This states that no two events could have both been the cause of the error in a given trace. In other words, different causal hypotheses for the same trace are mutually exclusive. The rationale for this is that the intended meaning of \(h_i^k\) is \(C_i\) was the cause of E in \(t_k\), and as the implies uniqueness, no two events could have been the cause. In general, any union of events may be said to be the cause so long as that union is in P. For all \(h_i^k \in \) C and every sample \(\mathbf S \subseteq \mathbf T - \{t_k\}\)

  • A6. \(P(h_i^k|\bigvee _{t_n \in S} h_i^n) = P(h_i^k)\).

This states that the probability that \(C_i\) was the cause in one trace is not affected by whether it was in some others. In other words, whether it was the cause in one is statistically independent of whether it was in others. Here, we assume that our probabilities describe objective chance, and that the causal properties of each execution is determined by the properties of the events in that execution alone, and therefore cannot affect the causal properties of other executions. Independence principles are well established in probability theory [26].

In light of the above assumptions we may define \(c(t_k) = \{C_i|C_i \in {{\varvec{P}}} \wedge c_i^k = e^k = \top \wedge C_i \ne E\}\) as the set of candidate causes of E in \(t_k\). Following this, for some measure w, and all \(C_i, C_j \in c(t_k)\), we assume

  • A7. \(P(h_i^k) / P(h_j^k)\) \( {=} \) \( w(C_i) / w(C_j)\).

Here, we assume w measures the propensity of a given event to cause the error (and is thus motivated as a measure of causal strength as described in the preliminaries). Accordingly, the assumption states that the relative likelihood that one event caused the error over another, is directly proportional to their propensities to do so. In general, any suspiciousness measure w from the sbfl literature may be proposed as a measure of causal strength, and thus there is great room for experimentation over the definition of w. One formal proviso is that measures be re-scaled so \(w(C_i) > 0\) if \(a_{ef}^i > 0\) (this avoids divisions by zero). We use the notation pfl-w when measure w is being used.

We now show that the assumptions A1-7 (henceforth pfl assumptions) imply Eqs. (1), (2) and (3) (henceforth pfl equations). The pfl equations can be used to determine the probability that a given component \(C_i\) is faulty. For all \(h_i^k \in \mathbf C \)

$$\begin{aligned} P(h_i) = P(\bigvee _{n = 1}^{|\mathbf T |}\!\!h_i^n) \end{aligned}$$
(1)
$$\begin{aligned} P(\bigvee _{j=n}^{|\mathbf T |}\!\!h_i^j) = P(h_i^{n}) + P(\!\bigvee _{j=n+1}^{|\mathbf T |}\!\!\!h_i^j) - P(h_i^{n}) \, P(\!\bigvee _{j=n+1}^{|\mathbf T |}\!\!\!h_i^j) \end{aligned}$$
(2)
(3)

Proposition 1

The pfl assumptions imply the pfl equations.

Proof

We first show Eq. (1). \(h_i = \bigvee _{k = 1}^{|\mathbf T |} h_i^k\) (by A1). Thus \(P(h_i) = P(\bigvee _{k = 1}^{|\mathbf T |} h_i^k)\) (by Leibniz’s law). We now show Eq. (2). The definition of disjunction states \(P(\bigvee _{j=n}^{|\mathbf T |}h_i^j) = P(h_i^{n}) + P(\!\bigvee _{j=n+1}^{|\mathbf T |}h_i^j) - P(h_i^{n} \wedge \bigvee _{j=n+1}^{|\mathbf T |}h_i^j)\). It remains to show \(P(h_i^{n} \wedge \bigvee _{j=n+1}^{|\mathbf T |}h_i^j)\) = \(P(h_i^{n})P(\bigvee _{j=n+1}^{|\mathbf T |}h_i^j)\). \(P(h_i^{n} \wedge \bigvee _{j=n+1}^{|\mathbf T |}h_i^j)\) is equal to \(P(h_i^{n} | \bigvee _{j=n+1}^{|\mathbf T |}h_i^j)P(\bigvee _{j=n+1}^{|\mathbf T |}h_i^j)\) (by probabilistic calculus). This is equal to \(P(h_i^{n})P(\bigvee _{j=n+1}^{|\mathbf T |}h_i^j)\) (by A6).

We now show Eq. (3). We have two cases to consider: \(C_i \in c(t_k)\) and \(C_i \notin c(t_k)\). Assume \(C_i \in c(t_k)\). We may assume \(t_k\) is ordered such that \(\bigwedge _{i=1}^n c_i^k =\) \( \top \), \(\bigwedge _{i=n+1}^{|{{\varvec{{P}}}}|-1} c_i^k = \) \(\perp \) and \(c_{|{{\varvec{{P}}}}|}^k = e^{k}\) \(= \top \) (such that \(c(t_k)\) = \(\{C_1, \dots ,C_{n}\}\)). Now, for all \(C_i, C_j \in c(t_k)\) \(P(h_i^k)/P(h_j^k)\) = \(w(C_i)\)/\(w(C_j)\) (by A7). Thus, for \(C_i, C_j \in c(t_k)\) \(w(C_i)/P(h_i^k)\) = \(w(C_j)/P(h_j^k)\) (as \(x/y = w/z \equiv z/y = w/x\)). So, \(w(C_1) / P(h_1^k)\) = \(w(C_2) / P(h_2^k) \) = \(\dots \) = \(w(C_n) / P(h_{n}^k)\). Thus, there is some c such that for all \(C_i \in c(t_k)\), \(c = w(C_i) / P(h_i^k)\) (by the last result). Equivalently, there is some c such that for all \(C_i \in c(t_k)\), \(P(h_i^k) = w(C_i)/c\). To complete the proof it remains to prove \(c = \sum _{C_j \in c(t_k)} w(C_j)\). \(\bigvee _{i=1}^{|{{\varvec{{P}}}}|} h_i^k = \top \) (by A2). But, \(\bigvee _{i=n+1}^{|{{\varvec{{P}}}}|-1} h_i^k = \) \(\perp \) (by A4), and \(h_{|{{\varvec{P}}}|}^k = \) \(\perp \) (by A3). Thus, \(\bigvee _{i=1}^n h_i^k = \top \) (by \(\vee \)-elimination). So, \(P(\bigvee _{i=1}^{n} h_i^k) = 1\) (by probabilistic calculus). Thus, \(\sum _{i=1}^{n} P(h_i^k) = 1\) (by probabilistic calculus and A5). So, \(\sum _{i=1}^{n} (w(C_i)/c) = 1\). Thus, \((\sum _{i=1}^{n} w(C_i))/c = 1\). Equivalently, \(\sum _{i=1}^{n} w(C_i) = c\). So, \(\sum _{C_i \in c(t_k)} w(C_i) = c\) (by def. of \(c(t_k)\) above). We now do the second condition. Assume \(C_i \notin c(t_k)\). Then \(\lnot (c_i^k = e^k = \top \wedge C_i \ne E)\) (by def. of \(c(t_k)\)). Thus \(c_i^k = \) \(\perp \) or \(e^k = \) \(\perp \) or \(C_i = E\). If \(C_i \ne E\), then \(P(h_i^k) = 0\) (by A3). If \(c_i^k =\) \( \perp \), then \(P(h_i^k) = 0\) (by A4). If \(e^k = \) \(\perp \), then \(P(h_i^k) = 0\) (by A4). Thus, if \(C_i \notin c(t_k)\), then \(P(h_i^k) = 0\).

To use the pfl equations, it remains for the user to choose a measure w for A7. One proposal is \(w(C_i)\) = \(P(E|C_i)\) (the ppv measure [11]) or \(P(E|C_i)/P(E)\) (the Fitelson measure of causal strength [32]). For the purposes of defining \(P(h_i^k)\) both proposals are equivalent (observe \(P(h_i^k) / P(h_j^k)\) \( {=} \) \( P(E/C_i) / P(E/C_i) = (P(E/C_i) / P(E)) / (P(E/C_j) / P(E))\) using A7).

The proposal captures three potentially plausible intuitions about causal likelihood. Firstly, it captures an intuition that the more something raises the probability of the error, the more likely it is to be the cause of it (to see this, observe we have \(P(h_i^k) / P(h_j^k)\) = \(P(E|C_i)\)/\(P(E|C_j)\) using A7). Secondly, it captures an intuition that events which do not affect the error’s likelihood are equally unlikely to have caused it (to see this, assume both \(C_i\) and \(C_j\) are independent of E i.e. \(P(E) = P(E|C_i) \) and \(P(E) = P(E|C_j)\), then it follows \(P(h_i^k) = P(h_j^k)\) using A7). Thirdly, a plausible estimate of \(w(C_i)\) as a measure of \(C_i\)’s causal strength is the probability that \(C_i\) causes E given \(C_i\), and \(P(E|C_i)\) accordingly provides an upper bound for this estimate. In our running example pfl-ppv returns \(P(h_2) = 0.00\), \(P(h_2) = 0.31\), \(P(h_3) = 1.00\), and \(P(h_4) = 0.25\), which correctly identifies the correct hypothesis with the most probable one.

Finally, given a test suite T and measure w, an algorithm to find a single fault in a program P is as follows. Step one, find \(\max _{h_i \in \mathbf {H} }(P(h_i))\) by computing the value of \(P(h_i)\) for each \(h_i \in \mathbf {H} \) using the pfl equations. If the most probable hypothesis represents a fault in the program, the procedure stops. Otherwise, \(h_j\) is removed from the set of candidates by setting \(c_j^k =\) \(\perp \) for each \(t_k\), and return to step one. We call this the pfl algorithm. A property of this algorithm is that yet to be investigated components can change in fault likelihood at each iteration.

We now identify desirable formal properties which we prove the pfl equations satisfies, but no sbfl suspiciousness measure can.

Definition 1

Fault Likelihood Properties. For all \(C_i, C_j \in {{\varvec{P}}}\), where \(C_i \ne C_j\), we define the following:

  1. 1.

    Base case. If there is some failing trace which only covers \(C_i\), but this property does not hold of \(C_j\), then \(C_i\) is more suspicious than \(C_j\).

  2. 2.

    Extended case. Let \(\mathbf T _1\) be a test suite in which all failing traces cover more than one component, and let \(\mathbf T _2\) be identical to \(\mathbf T _1\) except \(c_i^k = 1\) and \(c_j^k = 1\) in \(\mathbf T _1\) and \(c_i^k = 1\) and \(c_j^k = 0\) in \(\mathbf T _2\), then the suspiciousness of \(C_i\) in \(\mathbf T _2\) is more than its suspiciousness in \(\mathbf T _2\).

These properties capture the intuition that the fewer covered entities there are in a failing trace, the fewer places there are for the fault to"hide", and so the a priori likelihood that a given covered entity is faulty must increase. Upper bounds for this increase is established by the base case – if a failing trace only covers a single component then that component must be faulty. We now formally establish that the pfl equations, but no sbfl measure, satisfies these properties.

Proposition 2

The pfl equations satisfies the fault likelihood properties.

Proof

We first prove the base property. We first show that if there is some failing trace \(t_k\) which only covers \(C_i\), then nothing is more suspicious than it. Let \(t_1\) be a failing trace which only covers \(C_i\). Then \(P(h_i^1) = \frac{w(C_i)}{w(C_i)} = 1\) (by Eq. (3)). Letting n abbreviate \(P(\bigvee _{j = \tiny {2}}^{|\mathbf {T} |} h_i^j)\), we then have \(P(\bigvee _{k = \tiny {1}}^{|\mathbf {T} |} h_i^k) = (1 + n) - (1n) = 1\) (by Eq. (2)). So \(P(h_i) = 1\) (by Eq. (1)). Thus, nothing can be more suspicious than \(C_i\). We now show that if there is no failing trace which only covers \(C_j\), then \(C_j\) must be less suspicious than \(C_i\). Assume the antecedent, then for each \(t_k\) we have \(P(h_i^k) = \frac{w(C_i)}{w(C_i) + \dots + w(C_k)} < 1\) (by Eq. (3)). Thus \(P(\bigvee _{k = \tiny {1}}^{|\mathbf {T} |} h_i^k) < 1\) (by Eqs. (2) and (3)). Thus \(P(h_j) < 1\) (by Eq. (1)). Thus \(P(h_j) < P(h_i)\), which means \(C_i\) is more suspicious than \(C_j\).

We now prove the extended property. Let \(\mathbf T _1\) be a test suite in which all failing traces cover more than one component, and let \(\mathbf T _2\) be identical to \(\mathbf T _1\) except \(c_i^1 = 1\) and \(c_j^1 = 1\) in \(\mathbf T _1\) and \(c_i^1 = 1\) and \(c_j^1 = 0\) in \(\mathbf T _2\). Let n abbreviate \(P(\bigvee _{m = \tiny {2}}^{|\mathbf {T} |} h_i^m)\) \(P(h_i) = P(h_i^1) + n - (P(h_i^1)n)\) (by Eqs. (1) and (2)). It remains to first show that \(P(h_i^1)\) is greater in \(\mathbf T _2\), and secondly show n has the same value for both test suites where \(n < 1\). For the former, let \(P(h_i^1)\) = \(\frac{w(C_i)}{w(C_1) + \dots + x + \dots + w(C_{|c(t_k)|})}\) for both test suites (using Eq. (3)), where we let \(x = w(C_j)\) for \(\mathbf T _1\) (where \(w(C_j) > 0\)), and \(x = 0\) for \(\mathbf T _2\) (as \(c_j^k \notin c(t_k)\) for \(\mathbf T _2\)). So, the equation for \(P(h_i^1)\) is greater in \(\mathbf T _2\). To show the latter, we observe that for all \(1 < m \le |\mathbf T _1|\) we have \(P(h_i^m) < 1\) (by assumption each \(t_m \in \mathbf F \subseteq \mathbf T _1\) covers at least 2 components) and that \(P(h_i^m)\) is the same in both \(\mathbf T _1, \mathbf T _2\), thus \(n < 1\) (by Eq. (2)) and n has the same value for both.

Proposition 3

No sbfl measure satisfies either property.

Proof

To show that no suspiciousness measure w satisfies the base property, we show that for any w we can construct a test suite in which (1) there is a failing trace which only covers \(C_i\), (2) there is some \(C_j\) such that there is no failing trace which only covers it, and (3) \(w(C_i) = w(C_j)\). A simple example is as follows. Let \({{\varvec{P}}}\) = \(\langle C_1, C_2, C_3, E \rangle \) and \(\mathbf T \) = \(\langle \langle 1, 1, 1, 1 \rangle , \langle 0, 1, 1, 1 \rangle , \langle 1, 0, 0, 1 \rangle \rangle \). Thus the spectrum for \(C_1\) and \(C_2\) is \(\langle 2,0,0,0 \rangle \), and so \(w(C_i) = w(C_j)\).

To show that no suspiciousness measure w satisfies the extended property, we show that for any w we can construct a pair of test suites \(\mathbf T _1\) and \(\mathbf T _2\) which are otherwise identical except (1) \(c_i^k = 1\) and \(c_j^k = 1\) in \(\mathbf T _1\) (2) \(c_i^k = 1\) and \(c_j^k = 0\) in \(\mathbf T _2\), and (3) \(w(C_i) = w(C_j)\). The simplest example is as follows. Let \({{\varvec{P}}}\) = \(\langle C_1, C_2, E \rangle \) and \(\mathbf T _1\) = \(\langle \langle 1, 1, 1 \rangle \rangle \) and \(\mathbf T _2\) = \(\langle \langle 1, 0, 1 \rangle \rangle \). Thus the spectrum for \(C_1\) is \(\langle 1,0,0,0 \rangle \) in both cases, and so \(w(C_i) = w(C_j)\).

The proof of the last proposition suggests that there are large classes of test suites in which sbfl measures violate the properties. sbfl measures do not have the resources to satisfy the properties because each \(C_i\)’s suspiciousness is only a function of its program spectrum, which itself is only a function of the i-th column of a coverage matrix.

5 Experimentation

In this section we discuss our experimental setup and our results. The aim of the experiment is to compare the performance of the pfl algorithm against sbfl measures at the practical task of finding a fault in large faulty programs.

5.1 Setup

We use the Steimann test suite in our experiments, described in Table 1 [33]. M is the number of methods, UUT the number of units under test, b the number of blocks of code, UUT / b the mean number of UUTs per block, t the number of test cases. The last column gives the number of program versions with 1/2/4/8/16/32 faults respectively. The average number of covered components that were faulty for all 1/2/4/8/16/32 fault benchmarks was found to be 1.00/1.92/3.63/6.71/11.81/20.02 respectively (7.52 on average). Steimann’s test suite is the only test suite found to represent large programs with a large range of faults and a large number of program versions. For more about the suite see [33]. The suite came with a program that generated the coverage matrices (see [33]).

Table 1. Benchmarks

We used blocks as our atomic program entity, and only considered atomic blocks for all methods compared. A block corresponds to a maximal set of executable statements with the same traces covering them. This correspondence provides a natural grouping, as the degree of suspiciousness of lines of code is the same as the block to which they belong, and does not effect the fault localisation process from the user’s point of view. In the majority of cases blocks represented a continuous chunk of the program and were similar in size – the average size of these blocks are reported in the UUT / b column of Table 1, and can often be quite large.

We now discuss techniques compared. We include all known sbfl techniques; which includes the 157 measures in [11], which itself includes 30 measures from the studies of Naish [15] and the 40 measures Lo [16]. We include the 30 genetic measures of Yoo [20], the Dstar measures [17] and the 6 “combination” measures of Kim et al. [21]. This brings the number of sbfl measures to almost 200, which to our knowledge is the largest comparison of sbfl measures to date. We now discuss the pfl techniques. We used the weighted model pfl-w, and used the ppv, Ochiai, Kulczynski2, and Suppes measures (see preliminaries) as values for w. Not all measures could be tested because pfl techniques take slightly longer to run. To our knowledge sbfl and pfl techniques are the only ones which can feasibly scale to our experiments. Techniques which take 10 min on average to localise a fault in one program version would take almost a year to complete the experiment.

We evaluated the effectiveness of a technique using avg W-scores and A-scores (see preliminaries). We define higher level W/A-scores as follows. For each \(n \in \{1,2,4,8,16,32\}\) a basic score for the n-fault versions of a given benchmark is the mean of the scores for all versions of that benchmark with n-faults. The score for the n-fault versions is the mean of the ten basic scores for the n-fault versions. The Avg score is the mean of the 60 basic scores. We used Wilcoxon rank-sum tests to determine to whether a technique’s 60 W/A basic scores were statistically significantly better than another (using \(p = 0.01\)). To provide a lower bound for sbfl performance, we included scores for the Random measure (defined as a measure which outputs a random number). To provide an upper bound, we computed the unavoidable costs for W/A-scores (discussed in Sect. 2).

5.2 Results

We begin with overall results. Zoltar was the sbfl measure with the highest Avg W-score of 2.59. pfl-ppv improved on this score with a Avg W-score of 1.88. Thus, the user has to investigate 37.77 % more code when using the best sbfl measure. Klosgen was the sbfl measure with the highest Avg A-score of 55.2. (pfl-ppv) improved on this score with a Avg A-score of 76.2. Thus, the user finds a fault immediately 27.56 % less frequently using the next best sbfl measure. Both pfl-ppv’s W/A 60 scores were a statistically significant improvement over the next best performing sbfl measures using \(p = 0.01\). Thus, the pfl approach was a substantial and significant improvement at localising faults.

We now discuss unavoidable cost (UC) scores. UC’s Avg W/A-scores were 76.45 and 1.38 respectively. pfl-ppv outperformed UC’s W-scores at 30/60 benchmarks, and UC’s A-scores at 24/60 benchmarks. It is thus theoretically impossible to design a strictly rational sbfl measure that can outperform pfl-ppv on many benchmarks.

To get an impression for the overall range of performance, we present the following additional Avg scores. Kulkzynski2, Ochiai, Suppes, ppv (Tarantula), Random had Avg W-scores of 2.69, 2.97, 3.60, 3.88, 19.77, and Avg A-scores of 52.65, 52,47, 52.47, 49.73, 12.97 respectively. When pfl-w was used in conjunction with the first three measures (scaled [0,1]), the techniques had Avg W-scores of 2.31, 2.05, 2.09 and Avg A-scores of 70.57, 72.55, 72.30. Thus, all our pfl-w approaches outperformed all sbfl measures regardless of choice of w. This suggests that the pfl-w framework is more responsible for the improvement of fault localisation effectiveness than the choice of weight w.

Fig. 3.
figure 3

W-scores for selected techniques

Fig. 4.
figure 4

A-scores for selected techniques

We now discuss how measures behave as more faults are introduced into a program. In Figs. 3 and 4 we graphically compare a range of techniques. Firstly, we represented the unavoidable cost scores to show how pfl-ppv approximates (and in some cases exceeds) the idealised upper bounds for performance of strictly rational sbfl measures. Secondly, we represented Zoltar as it was the sbfl measure which the best W-scores. Thirdly, we represented Tarantula (equivalent to ppv) to show how using pfl-ppv improves performance. Each column represents a technique’s score for the n-fault versions of that suite, with the key shade representing the value of n (for example, the W-score for pfl-ppv at the 2-fault benchmarks is 1.94).

We observed the following trends: In general, the more faults there were in a program the better an sbfl measure’s W-scores, but the worse that measure’s A-scores. A proposed explanation for this is that the more faults there were in a program, the more likely it was to find a fault early (due to increased luck – thus improving W-scores), but the less likely to find a fault immediately (due to increased noise – thus worsening A-scores). These trends were noticed in all of our sbfl measures, of which Zoltar and Tarantula are examples. By contrast, a negative trend for the A-scores was not noticed for our variants of pfl-w, which demonstrated a superior ability to deal with noise introduced by multiple faults.

Fig. 5.
figure 5

pfl-ppv performance

We now discuss Fig. 5. For each set of n-fault benchmarks, if \(y\,\%\) of the program versions received a W-score of \(\le x\,\%\), a point was plotted on that graph at (xy). The mean (Avg) of the 6 graphs is also plotted. The figure demonstrates that if we limit fault localisation to only 10 % of the blocks, on Avg we would expect to find a fault 95 % of the time using pfl-ppv. An outlier is that pfl-ppv does slightly worse on the 16-fault benchmarks. In general, the graph confirms the conclusion that pfl-ppv’s performance is not substantially worsened by the number of faults in the program.

We now discuss time efficiency. In our implementation it took under a second to find the most suspicious component in sbfl/pfl procedures. The complete pfl procedure (as per the algorithm in Sect. 4), took an average of 6.16 s (with potential for optimisation) – thus establishing pfl’s negligible overhead.

In summary, pfl approaches substantially, and statistically significantly improve over the best performing sbfl approaches in our large multiple fault programs, and are comparably efficient. Furthermore, they outperform theoretically optimum performance of sbfl measures on a large class of benchmarks.

We briefly report results (Avg scores) in additional experiments which were of much lower quality and size. We generated 500+ 1/2/3/4 fault versions using the methodology and 10 SIR benchmarks of [34]. 80 % of the faults were covered by all failing traces which made it less challenging for our techniques in terms of noise. The highest scoring sbfl measure was Kulczynski2 (K2) (W-score 8.76, A-score 33.75). pfl-Suppes, pfl-K2, pfl-Ochiai came second (W-scores 9.66, 10.27, 10.40, and A-scores 25.73, 24.43, 28.43 respectively). K2’s scores were not statistically significantly better. The Barinel tool (see [34]) came 43rd overall and was not competitive. pfl-ppv came 6th after sbfl measures (W-score 11.39, A-score 29.15). The experiments confirm pfl as high performing.

6 Related Work

The most prominent lightweight approach to software fault localisation is sbfl [3], and provides the theoretical groundwork for the pfl approach. Research is driven by the development of new measures and experimentally comparing them on benchmarks [421]. Causal measures were introduced to sbfl in [11]. Threats to the value of empirical studies of sbfl is studied in [33]. Theoretical results include proving potentially desirable formal properties of measures and finding equivalence proofs for classes of measures [10, 11, 15, 22, 35]. Yoo et al. have established theoretical results that show that a “best” performing suspicious measure for sbfl does not exist [23], and thus there remains the problem of finding formal properties for lightweight techniques to exploit. We have tried to address this problem in this paper.

A prominent probabilistic approach is Barinel, which differs to pfl insofar as it uses Bayesian methods to generate likelihoods of given hypotheses [34], and a minimal hitting set algorithm Staccato to generate hypotheses. Their approach is designed for the simultaneous fault localisation of sets of multiple faults, and were only scalable to our additional experiments. Other heavyweight techniques are similarly unscalable [30, 3641], which emphasises the importance of developing lightweight techniques such as pfl/sbfl.

In general, sbfl methods have been successfully used in the following applications. Firstly, in semi-automated fault localisation in which users inspect code in descending order of suspiciousness [29]. Secondly, in fully-automated fault localisation subroutines within algorithms which inductively synthesise (such as Cegis [42]) or repair programs (such as GenProg [43]). Thirdly, as a substitute for heavyweight methods which cannot scale to large programs [30, 36, 37]. Fourthly, as a technique combined with other methods [21, 24, 4450]. In general, pfl may be used as a substitute for sbfl measures in all these applications. For a major recent survey we defer to Wong et al. [3].

7 Conclusions

In this paper we have presented a new formal framework which we call pfl, and compared it to sbfl in terms of (1) desirable theoretical properties, (2) its effectiveness at fault localisation and (3) its efficiency. Regarding (1), the pfl equations were formally proven to satisfy desirable fault likelihood properties which sbfl measures could not. Regarding (2), pfl-ppv was shown to substantially and statistically significantly (using \(p = 0.01\)) outperform all known sbfl measures at W and A-scores in what is to our knowledge the largest scale experimental comparison in software fault localisation to date. We found that the user has to investigate over 37.77 % more blocks of code (and finds a fault immediately 27.56 % less frequently) than pfl-ppv when using the best sbfl measures. Furthermore, we show that for a third/quarter of our benchmarks it is theoretically impossible to design strictly rational sbfl measures which outperforms pfl-ppv’s W/A-scores respectively. Regarding (3), we found that the pfl approach maintains a comparably negligible overhead to sbfl. Thus, our results suggest the pfl framework has theoretical and practical advantages over sbfl.

For future work, we would like to find additional suspiciousness measures for use with pfl-w. Secondly, we would like find a method to determine upper bound scores for pfl performance (similar to Naish’s unavoidable costs). Thirdly, we would like to implement pfl in an easy to use tool for engineers.