1 Introduction

When querying a database, a user may not always obtain the expected results, and the system could provide some explanations. They could be useful to further understand the data or check if the query is the intended one. Actually, the notion of explanation for a query result was introduced in [47], on the basis of the deeper concept of actual causation.Footnote 1

A tuple t is an actual cause for an answer \(\phantom {\dot {i}\!}\bar {a}\) to a conjunctive query \(\phantom {\dot {i}\!}\mathcal { Q}\) from a relational database instance D if there is a contingent set of tuples Γ, such that, after removing Γ from D, \(\phantom {\dot {i}\!}\bar {a}\) is still an answer, but after further removing t from \(\phantom {\dot {i}\!}D\smallsetminus {\Gamma }\), \(\phantom {\dot {i}\!}\bar {a}\) is not an answer anymore (cf. Section 2.1 for a precise definition). Here, Γ is a set of tuples that has to accompany t so that the latter becomes a counterfactual cause for answer \(\phantom {\dot {i}\!}\bar {a}\). Actual causes and contingent tuples are restricted to be among a pre-specified set of endogenous tuples, which are admissible, possible candidates for causes, as opposed to exogenous tuples, which may also be present in the database. In rest of this paper, whenever we simply say “cause”, we mean “actual cause”.

In applications involving large data sets, it is crucial to rank potential causes by their responsibilities [47, 48], which reflect the relative (quantitative) degrees of their causality for a query result. The responsibility measure for a cause is based on its contingency sets: the smallest (one of) its contingency sets, the strongest it is as a cause.

Actual causation, as used in [47], can be traced back to [32, 33], which provides a model-based account of causation on the basis of counterfactual dependence. Causal responsibility was introduced in [19], to provide a graded, quantitative notion of causality when multiple causes may over-determine an outcome.

Apart from the explicit use of causality, research on explanations for query results has focused mainly, and rather implicitly, on provenance [1315, 22, 38, 40, 61]. A close connection between causality and provenance has been established in [47]. However, causality is a more refined notion that identifies causes for query results on the basis of user-defined criteria, and ranks causes according to their responsibilities [48].

Consistency-based diagnosis [53], a form of model-based diagnosis [60](@var1@sec. 10.3@var1@), is an area of knowledge representation. The problem here is, given the specification of a system in some logical formalism and a usually unexpected observation about the system, to obtain explanations for the observation, in the form of a diagnosis for the unintended behavior (cf. Section 2.3 for a precise definition).

In a different direction, a database instance, D, that is expected to satisfy certain integrity constraints may fail to do so. In this case, a repair of D is a database D that does satisfy the integrity constraints and minimally departs from D. Different forms of minimality can be applied and investigated. A consistent answer to a query from D and with respect to the integrity constraints is a query answer that is obtained from all possible repairs, i.e. is invariant or certain under the class of repairs (cf. Section 2.2 for a precise definition). These notions were introduced in [2] (see [7, 9] for surveys).Footnote 2

These three forms of reasoning, namely inferring causes from databases, consistency-based diagnosis, and consistent query answering (and repairs) are all non-monotonic [55]. For example, a (most responsible) cause for a query result may not be such anymore after the database is updated. Furthermore, they all reflect some sort of uncertainty about the information at hand. In this work we establish natural, precise, useful, and deeper connections between these three reasoning tasks.

More precisely, we unveil a strong connection between computing causes and their responsibilities for conjunctive query answers, on one hand, and computing repairs in databases with respect to denial constraints, on the other. These computational problems can be reduced to each other. In order to obtain repairs with respect to a set of denial constraints from causes, we investigate causes for queries that are unions of conjunctive queries, and develop algorithms to compute causes and responsibilities.

We show that inferring and computing actual causes and their responsibilities in a database setting become diagnosis reasoning problems and tasks. Actually, a causality-based explanation for a conjunctive query answer can be viewed as a diagnosis, where in essence the first-order logical reconstruction of the relational database provides the system description [54], and the observation is the query answer. We obtain causes and their responsibilities -and as a side result, also database repairs- from diagnosis.

Being the causality problems the main focus of this work, we take advantage of algorithms and complexity results both for consistency-based diagnosis on one side; and database repairs and consistent query answering [9], on another. In this way, we obtain new complexity results for the main problems of causality, namely computing actual causes, determining their responsibilities, and obtaining most responsible causes; and also for their decision versions. In particular, we obtain fixed-parameter polynomial-time algorithms for some of them. More precisely, our main results are as follows: (the complexity results are all in data complexity)

  1. 1.

    We characterize actual causes and most responsible actual causes for a Boolean conjunctive query in terms of subset- and cardinality-repairs of the instance with respect to the denial constraint associated to the query (the query being the violation view of the constraint). In this way we can compute causes from repairs.

    In the other direction, we obtain repairs of databases with respect to sets of denial constraints from causes for query results. For this, we extend the treatment of causality to unions of conjunctive queries (to represent multiple denial constraints). We characterize an actual cause’s responsibility in terms of cardinality-repairs. Along the way we provide PTIME algorithms to compute causes and their (minimal) contingency sets for unions of conjunctive queries.

  2. 2.

    We reduce causes for a Boolean conjunctive query to consistency-based diagnosis for the query being unexpectedly true according to a system description. In particular, we show how to compute actual causes, their contingency sets, and responsibilities using the diagnosis characterization. As a side result, we obtain database repairs from diagnosis.

    Hitting-set-based algorithmic approaches to diagnosis [53] inspire our algorithmic/complexity approaches to causality. In particular, we reformulate the causality problems as hitting-set problems and vertex cover problems on hypergraphs, which allows us to apply results and techniques for the latter to causality.

  3. 3.

    We obtain several new computational complexity results:

    1. (a)

      Checking minimal contingency sets can be done in PTIME.

    2. (b)

      The responsibility decision problem for conjunctive queries, which is about deciding if a tuple’s responsibility is greater that a bound v (that is part of the input) is NP-complete. However, this problem becomes fixed-parameter tractable, with the parameter being \(\phantom {\dot {i}\!}\frac {1}{v}\).

    3. (c)

      The problem of computing responsibilities of causes is F P NP(log(n))-complete. Deciding most responsible causes is P NP(log(n))-complete.

    4. (d)

      The structure of the resulting hitting-set problem allows us to obtain efficient parameterized algorithms and good approximation algorithms for computing causes and minimal contingency sets.

    5. (e)

      From the repair connection we obtain that, for consistency based-diagnosis with specifications given by positive implications with disjunctive consequents, the problems of computing minimum-cardinality diagnoses and computing minimum-cardinality diagnoses that contain a given atom are both F P NP(log(n))-hard in the size of their underlying Herbrand structure.

  4. 4.

    We define notions of preferred causes; in particular one based on prioritized repairs [59]. We also propose an approach to causality based on interventions that are repair actions that replace attribute values by null values.

The paper is structured as follows. Section 2 introduces technical preliminaries for relational databases, causality in databases, database repairs and consistent query answering, consistency-based diagnosis, and relevant complexity classes. Section 3 characterizes actual causes and responsibilities in terms of database repairs. Section 4 characterizes repairs and consistent query answers in terms of causes and contingency sets for queries that are unions of conjunctive queries, and presents an algorithm for computing both of the latter. Section 5 formulates causality and repair problems as consistency-based diagnosis problems. Section 6 shows complexity and algorithmic results; in particular a fixed-parameter tractability result for causes’ responsibilities, and also about consistency based-diagnosis. Section 7 deals with preferred causes. Section 8 discusses several relevant issues, connections and open problems around causality in databases. It also draws some final conclusions. We provide proofs for all the results except for those that are rather straightforward. This is an extended version of [58]. It contains proofs, many improvements in the presentation, and also new developments and results, mainly in Sections 6.2 and 7.

2 Preliminaries

We consider relational database schemas of the form \(\phantom {\dot {i}\!}\mathcal {S} = (U,\mathcal {P})\), where U is the possibly infinite database domain of constants and \(\phantom {\dot {i}\!}\mathcal {P}\) is a finite set of database predicates Footnote 3 of fixed arities. A database instance D compatible with \(\phantom {\dot {i}\!}\mathcal {S}\) can be seen as a finite set of ground atomic formulas (in databases aka. atoms or tuples), of the form P(c 1,...,c n ), where \(\phantom {\dot {i}\!}P \in \mathcal {P}\) has arity n, and c 1,…,c n U.

A conjunctive query (CQ) is a formula \(\phantom {\dot {i}\!}\mathcal { Q}(\bar {x})\) of the first-order (FO) logic language, \(\phantom {\dot {i}\!}\mathcal {L}(\mathcal {S})\), associated to \(\phantom {\dot {i}\!}\mathcal {S}\) of the form \(\phantom {\dot {i}\!}\exists \bar {y}(P_{1}(\bar {s}_{1}) \wedge {\cdots } \wedge P_{m}(\bar {s}_{m}))\), where the \(\phantom {\dot {i}\!}P_{i}(\bar {s}_{i})\) are atomic formulas, i.e. \(\phantom {\dot {i}\!}P_{i} \in \mathcal {P}\), and the \(\phantom {\dot {i}\!}\bar {s}_{i}\) are sequences of terms, i.e. variables or constants.Footnote 4 The \(\phantom {\dot {i}\!}\bar {x}\) in \(\phantom {\dot {i}\!}\mathcal {Q}(\bar {x})\) shows all the free variables in the formula, i.e. those not appearing in \(\phantom {\dot {i}\!}\bar {y}\). If \(\phantom {\dot {i}\!}\bar {x}\) is non-empty, the query is open. If \(\phantom {\dot {i}\!}\bar {x}\) is empty, the query is Boolean (a BCQ), i.e. the query is a sentence, in which case, it is true or false in a database, denoted by \(\phantom {\dot {i}\!}D \models \mathcal {Q}\) and \(\phantom {\dot {i}\!}D \not \models \mathcal {Q}\), respectively. A sequence \(\phantom {\dot {i}\!}\bar {c}\) of constants is an answer to an open query \(\phantom {\dot {i}\!}\mathcal {Q}(\bar {x})\) if \(\phantom {\dot {i}\!}D \models \mathcal {Q}[\bar {c}]\), i.e. the query becomes true in D when the free variables are replaced by the corresponding constants in \(\phantom {\dot {i}\!}\bar {c}\).

An integrity constraint is a sentence of language \(\phantom {\dot {i}\!}\mathcal { L}(\mathcal { S})\), and then, may be true or false in an instance for schema \(\phantom {\dot {i}\!}\mathcal {S}\). Given a set IC of integrity constraints for schema \(\phantom {\dot {i}\!}\mathcal {S}\), a database instance D is consistent with IC if DI C; otherwise it is said to be inconsistent. In this work we assume that sets of integrity constraints are always finite and logically consistent.

A particular class of integrity constraints is formed by denial constraints (DCs), which are sentences κ of the form: \(\phantom {\dot {i}\!}\forall \bar {s} \neg (A_{1}(\bar {s}_{1}) \wedge {\cdots } \wedge A_{n}(\bar {s}_{n})\), where \(\phantom {\dot {i}\!}\bar {s}= \bigcup \bar {s}_{i}\) and each \(\phantom {\dot {i}\!}A_{i}(\bar {s}_{i})\) is a database atom, i.e. predicate \(\phantom {\dot {i}\!}A_{i} \in \mathcal {P}\!\). So as with conjunctive queries, the atoms may contain constants. Denial constraints are exactly the negations of BCQs. Sometimes we use the common representation of DCs as “negative rules” of the form: \(\phantom {\dot {i}\!}\leftarrow \ A_{1}(\bar {s}_{1}), \ldots , A_{n}(\bar {s}_{n})\). We will also consider functional dependencies (FDs) as DCs. They are represented by negative rules of the form: \(\phantom {\dot {i}\!}\leftarrow \ A(\bar {x}_{1},\bar {x}_{2},y), A(\bar {x}_{1},\bar {x}_{3},z), y \neq z\), saying that the last attribute of relation A functionally depends upon the attributes holding variables \(\phantom {\dot {i}\!}\bar {x}_{1}\). They do not contain constants, and correspond to BCQs with inequality.

2.1 Causality and responsibility

Assume that the database instance is split in two, i.e. D = D nD x, where D n and D x denote the disjoint sets of endogenous and exogenous tuples, respectively.

Actual causes and contingent tuples are usually restricted to be among a pre-specified set of endogenous tuples, which are admissible, possible candidates for causes, as opposed to the exogenous tuples. Actually, the latter provide the context or the background for the problem, and are considered as external factors that are not of interest to the current problem statement or beyond our control. Since no intervention (or update, in database parlance) is conceivable on exogenous tuples, they can not be included in any contingency set or be an actual cause. They are assumed to be included in all conceivable hypothetical states of a database.

The endogenous/exogenous partition is application-dependent and captures predetermined factors, such as users preferences that may affect QA-causal analysis. For example, certain tuples or full tables might be identified as irrelevant (or exogenous) in relation to a particular query at hand, or decided to be exogenous or endogenous a priori, independently from the query.

A tuple tD n is called a counterfactual cause for a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), if \(\phantom {\dot {i}\!}D\models \mathcal { Q}\) and \(\phantom {\dot {i}\!}D\smallsetminus \{t\} \not \models \mathcal { Q}\). A tuple tD n is an actual cause for \(\phantom {\dot {i}\!}\mathcal {Q}\) if there exists Γ⊆D n, called a contingency set, such that t is a counterfactual cause for \(\phantom {\dot {i}\!}\mathcal {Q}\) in \(\phantom {\dot {i}\!}D\smallsetminus {\Gamma }\) [47].

We will concentrate mostly on CQs. However, the definitions of actual cause and contingency set can be applied without a change to monotone queries in general [47], in particular to unions of BCQs (UBCQs), with or without built-ins.

The responsibility of an actual cause t for \(\phantom {\dot {i}\!}\mathcal { Q}\), denoted by \(\phantom {\dot {i}\!}\rho _{_{\!D\!}}(t)\), is the numerical value \(\phantom {\dot {i}\!}\frac {1}{|{\Gamma }| + 1}\), where |Γ| is the size of the smallest contingency set for t. We can extend responsibility to all the other tuples in D n by setting their value to 0. Those tuples are not actual causes for \(\phantom {\dot {i}\!}\mathcal {Q}\).

Example 1

Consider D = D n={R(a 4,a 3),R(a 2,a 1),R(a 3,a 3),S(a 4), S(a 2), S(a 3)}, and the query \(\phantom {\dot {i}\!}\mathcal {Q} :\exists x \exists y ( S(x) \land R(x, y) \land S(y))\). It holds: \(\phantom {\dot {i}\!}D \models \mathcal {Q}\).

Tuple S(a 3) is a counterfactual cause for \(\phantom {\dot {i}\!}\mathcal { Q}\). If S(a 3) is removed from D, \(\phantom {\dot {i}\!}\mathcal { Q}\) is not true anymore. Therefore, the responsibility of S(a 3) is 1. Besides, R(a 4,a 3) is an actual cause for \(\phantom {\dot {i}\!}\mathcal {Q}\) with contingency set {R(a 3,a 3)}. If R(a 3,a 3) is removed from D, \(\phantom {\dot {i}\!}\mathcal { Q}\) is still true, but further removing R(a 4,a 3) makes \(\phantom {\dot {i}\!}\mathcal {Q}\) false. The responsibility of R(a 4,a 3) is \(\phantom {\dot {i}\!}\frac {1}{2}\), because its smallest contingency sets have size 1. Likewise, R(a 3,a 3) and S(a 4) are actual causes for \(\phantom {\dot {i}\!}\mathcal {Q}\) with responsibility \(\phantom {\dot {i}\!}\frac {1}{2}\).

For the same \(\phantom {\dot {i}\!}\mathcal { Q}\), but with D={S(a 3),S(a 4),R(a 4,a 3)}, and the partition D n={S(a 4),S(a 3)} and D x={R(a 4,a 3)}, it turns out that both S(a 3) and S(a 4) are counterfactual causes for \(\phantom {\dot {i}\!}\mathcal {Q}\).

Remark 1

In the rest of this paper, we will assume in the context of causality that database instances D are partitioned as D = D nD x, into a subset of endogenous and a set of exogenous tuples, respectively. We will denote with \(\phantom {\dot {i}\!}Causes(D,\mathcal {Q})\) the set of actual causes for the BCQ \(\phantom {\dot {i}\!}\mathcal {Q}\) (being true) from instance D.

2.2 Database repairs

Given a set IC of integrity constraints, a subset repair (simply, S-repair) of a possibly inconsistent instance D for schema \(\phantom {\dot {i}\!}\mathcal {S}\) is an instance D for \(\phantom {\dot {i}\!}\mathcal { S}\) that satisfies IC and makes \(\phantom {\dot {i}\!}{\Delta }(D,D^{\prime })=(D \smallsetminus D^{\prime }) \cup ( D^{\prime } \smallsetminus D)\) minimal under set inclusion.Footnote 5 Srep(D,IC) denotes the set of S-repairs of D with respect to IC [2]. Similarly, D is a cardinality repair (simply C-repair) of D if D satisfies IC and minimizes |Δ(D,D )|. Crep(D,IC) denotes the class of C-repairs of D with respect to IC. C-repairs are always S-repairs. For DCs, S-repairs and C-repairs are obtained from the original instance by deleting an S-minimal, resp. C-minimal, set of tuples. In other words, S- and C-repairs under DCs become maximal (under set inclusion), resp. maximum (in cardinality), consistent subsets of the given instance.

In more general terms, we say that a set is S-minimal in a class of sets \(\phantom {\dot {i}\!}\mathcal { C}\) if it is minimal under set inclusion in \(\phantom {\dot {i}\!}\mathcal {C}\). Similarly, a set is C-minimal (or minimum) if it is minimal in cardinality within \(\phantom {\dot {i}\!}\mathcal {C}\). S-maximality and C-maximality are defined similarly.

Example 2

(ex. 1 cont.) Consider the denial constraint κ :← S(x),R(x,y),S(y), whose body corresponds to the CQ in Example 1, and is violated by the given instance D.

Here, Srep(D,κ) = {D 1, D 2, D 3} with D 1={R(a 4,a 3), R(a 2,a 1), R(a 3,a 3), S(a 4), S(a 2)}, D 2={R(a 2,a 1), S(a 4), S(a 2), S(a 3)}, D 3={R(a 4,a 3), R(a 2,a 1), S(a 2), S(a 3)}. The only C-repair is D 1, i.e. Crep(D,κ)={D 1}.

More generally, different repair semantics may be considered to restore consistency with respect to general integrity constraints. They depend on the kind of allowed updates on the database (i.e. tuple insertions/deletions, changes of attribute values), and the minimality conditions on repairs, e.g. subset-minimality, cardinality-minimality, etc.

Given D and IC, a repair semantics, S, defines a class Rep S(D,I C) of S-repairs, which are the intended repairs [9](@var1@Sec. 2.5@var1@). All the elements of R e p S(D,I C) are instances over the same schema of D, and consistent with respect to IC. If D is already consistent, Rep S(D,IC) contains D as its only member.

Given a repair semantics, S, \(\phantom {\dot {i}\!}\bar {c}\) is a S-consistent answer to an open query \(\phantom {\dot {i}\!}\mathcal {Q}(\bar {x})\) if \(\phantom {\dot {i}\!}D^{\prime } \models \mathcal {Q}[\bar {c}]\) for every D R e p S(D,I C). A BCQ is S-consistently true if it is true in every D R e p S(D,I C). In particular, if \(\phantom {\dot {i}\!}\bar {c}\) is a consistent answer to \(\phantom {\dot {i}\!}\mathcal {Q}(\bar {x})\) with respect to S-repairs, we say it is an S-consistent answer. Similarly for C-consistent answers. Consistent query answering for DCs under S-repairs was investigated in detail [18]. C-repairs and consistent query answering under them were investigated in detail in [43]. (Cf. [9] for more references.)

2.3 Consistency-based diagnosis

Consistency-based diagnosis, a form of model-based diagnosis [60](@var1@Sec. 10.4@var1@), considers problems \(\phantom {\dot {i}\!}\mathcal {M}=\) (SD, COMPS, OBS), where SD is the description in logic of the intended properties of a system under the explicit assumption that all the components in COMPS are working normally. OBS is a FO sentence that represents the observations. If the system does not behave as expected (as shown by the observations), then the logical theory obtained from SDOBS plus the explicit assumption, say \(\phantom {\dot {i}\!}\bigwedge _{c \in \mathit {COMPS}} \neg Ab(c)\), that the components are indeed behaving normally, becomes inconsistent. Ab is an abnormality predicate.Footnote 6

The inconsistency is captured via the, i.e. those minimal subsets C O M P S of COMPS, such that SDOBS \(\phantom {\dot {i}\!}\cup \{\bigwedge _{c \in \mathit {COMPS}^{\prime }} \neg \textit {Ab}(c)\}\) is inconsistent. As expected, different notions of minimality can be used at this point.

A minimal diagnosis for \(\phantom {\dot {i}\!}\mathcal { M}\) is a minimal subset Δ of COMPS, such that \(\phantom {\dot {i}\!} \textit {SD} \cup \textit {OBS} \cup \{\neg \textit {Ab}(c)~|~ c \in \textit {COMPS} \smallsetminus {\Delta } \} \cup \{\textit {Ab}(c)~ |~ c \in {\Delta }\}\) is consistent. That is, consistency is restored by flipping the normality assumption to abnormality for a minimal set of components, and those are the ones considered to be (jointly) faulty. The notion of minimality commonly used is S-minimality, i.e. a diagnosis that does not have a proper subset that is a diagnosis. We will use this kind of minimality in relation to diagnosis. Diagnosis can be obtained from conflict sets [53].

Example 3

Consider a simple logical gate Or, denoted with o (the only system component in this case), that receives two digits, x,y, as inputs and outputs a digit val(x,y).

This simple system can be specified in terms of normal behavior by the logical formula σA B(o)→∀xy(val(x, y) = 0 ←→⇔ x = y=0)), saying that, when the gate is not abnormal, the output is 0 iff the inputs are both 0.

The logical theory {σ,val(0,1)=0} is logically consistent (it can be made true) despite the unexpected observation (namely, output 0 with inputs 0,1). This is because the system’s model allows for abnormal behaviors. However, this theory together with the extra assumption ¬A b(o), i.e. that the gate is normal, form the theory {σ, v a l(0,1)=0,¬A b(o)} that is inconsistent in the sense that it can not be made true (in technical terms, it has not models).

2.4 Complexity classes

We recall some complexity classes [52] used in this paper. FP is the class of functional problems associated to decision problem in the class PTIME, i.e. that are solvable in polynomial time. P NP (or \(\phantom {\dot {i}\!}{{\Delta }^{P}_{2}}\)) is the class of decision problems solvable in polynomial time by a machine that makes calls to an NP oracle. For P NP(log(n)) the number of calls is logarithmic. It is not known if P NP(log(n)) is strictly contained in P NP . FP NP(log(n)) is similarly defined.

3 Actual Causes From Database Repairs

In this section we characterize actual causes for a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) being true in a database instance D in terms of the repairs of D with respect to a denial constraint whose violation view is \(\phantom {\dot {i}\!}\mathcal {Q}\), i.e. the latter asks if the constraint is violated. In essence, the actual causes will become the tuples outside an S-repair. The complement of the latter contains the cause plus a contingency set for the cause. In order to capture responsibility, C-repairs are considered.

Let D be an instance for schema \(\phantom {\dot {i}\!}\mathcal {S}\), and \(\phantom {\dot {i}\!}\mathcal { Q}\!: \exists \bar {x}(P_{1}(\bar {x}_{1}) \wedge {\cdots } \wedge P_{m}(\bar {x}_{m}))\) a BCQ. \(\phantom {\dot {i}\!}\mathcal { Q}\) may be unexpectedly true, i.e. \(\phantom {\dot {i}\!}D \models \mathcal { Q}\). Now, \(\phantom {\dot {i}\!}\neg \mathcal { Q}\) is logically equivalent to the DC \(\phantom {\dot {i}\!}\kappa (\mathcal { Q})\!: \forall \bar {x} \neg (P_{1}(\bar {x}_{1}) \wedge {\cdots } \wedge P_{m}(\bar {x}_{m}))\). The requirement that \(\phantom {\dot {i}\!}\neg \mathcal {Q}\) holds can be captured by imposing \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\) on D. Due to \(\phantom {\dot {i}\!}D \models \mathcal {Q}\), it holds \(\phantom {\dot {i}\!}D \not \models \kappa (\mathcal {Q})\). So, D is inconsistent with respect to \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\), and could be repaired.

Repairs for (violations of) DCs are obtained by tuple deletions. Intuitively, a tuple that participates in a violation of \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\) in D is an actual cause for \(\phantom {\dot {i}\!}\mathcal {Q}\). S-minimal sets of tuples like this are expected to correspond to S-repairs for D with respect to \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\).

More precisely, given an instance D, a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), and a tuple tD n, we consider:

  • The class containing the sets of differences between D and those S-repairs that do not contain t, and are obtained by removing a subset of D n:

    $$\begin{array}{@{}rcl@{}} \textit{Diff}^{s}(D,\kappa(\mathcal{ Q}), t)&=&\{ D \smallsetminus D^{\prime} ~|~ D^{\prime} \in \textit{Srep}(D,\kappa(\mathcal{ Q})), \\ && {\kern2.5cm}t \in (D\smallsetminus D^{\prime}) \subseteq D^{n}\}. \end{array} $$
    (1)
  • The class containing the sets of differences between D and those C-repairs that do not contain t, and are obtained by removing a subset of D n:

    $$\begin{array}{@{}rcl@{}} \textit{Diff}^{c}(D,\kappa(\mathcal{ Q}), t)&=&\{ D \smallsetminus D^{\prime} ~|~ D^{\prime} \in \textit{Crep}(D,\kappa(\mathcal{ Q})),\\ && {\kern2.5cm}t \in (D\smallsetminus D^{\prime}) \subseteq D^{n}\}. \end{array} $$
    (2)

It holds \(\phantom {\dot {i}\!}\textit {Diff}^{c}(D,\kappa (\mathcal { Q}), t) \subseteq \textit {Diff}^{s}(D,\kappa (\mathcal { Q}), t)\).

Now, any \(\phantom {\dot {i}\!}{\Lambda } \in \textit {Diff}^{s}(D,\kappa (\mathcal { Q}), t)\) can be written as Λ=Λ∪{t}. From the S-minimality of S-repairs, it follows that \(\phantom {\dot {i}\!}D \smallsetminus ({\Lambda }^{\prime } \cup \{t\}) \models \kappa (\mathcal {Q})\), but \(\phantom {\dot {i}\!}D \smallsetminus {\Lambda }^{\prime } \models \neg \kappa (\mathcal {Q})\). That is, \(\phantom {\dot {i}\!}D \smallsetminus ({\Lambda }^{\prime } \cup \{t\}) \not \models \mathcal { Q}\), but \(\phantom {\dot {i}\!}D \smallsetminus {\Lambda }^{\prime } \models \mathcal { Q}\). As a consequence, t is an actual cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) with contingency set Λ. We have obtained the following result.

Proposition 1

Given an instance D and a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , t∈D n is an actual cause for \(\phantom {\dot {i}\!}\mathcal {Q}\) iff \(\phantom {\dot {i}\!}\textit {Diff}^{s}(D,\kappa (\mathcal { Q}), t) \not = \emptyset \) . Furthermore, if \(\phantom {\dot {i}\!}D \smallsetminus D^{\prime } \in \textit {Diff}^{s}(D,\kappa (\mathcal { Q}), t)\) , then \(\phantom {\dot {i}\!}D \smallsetminus (D^{\prime } \cup \{t\})\) is a minimal contingency set for t.

Proposition 2

Given an instance D, a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , and t∈D n:

  1. (a)

    If \(\phantom {\dot {i}\!}\textit {Diff}^{s}(D,\kappa (\mathcal { Q}), t) = \emptyset \) , then \(\phantom {\dot {i}\!}\rho _{_{D}}(t)=0\).

  2. (b)

    Otherwise, \(\phantom {\dot {i}\!}\rho _{_{D}}(t)=\frac {1}{|{\Lambda }|}\) , where \(\phantom {\dot {i}\!}{\Lambda } \in \textit {Diff}^{s}(D,\kappa (\mathcal { Q}), t)\) and there is no \(\phantom {\dot {i}\!}{\Lambda }^{\prime } \in \mathit {Diff}^{~s}(D,\kappa (\mathcal {Q}), t)\) such that |Λ |<|Λ|.

Corollary 1

Given an instance D and a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) : t∈D n is a most responsible actual cause for \(\phantom {\dot {i}\!}\mathcal {Q}\) iff \(\phantom {\dot {i}\!}\textit {Diff}^{c}(D, \kappa (\mathcal { Q}), t) \not = \emptyset \).

Example 4

(ex. 1 and 2 cont.) Consider the same instance D and query \(\phantom {\dot {i}\!}\mathcal {Q}\). The associated DC is \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\!: \leftarrow S(x),R(x, y),S(y)\) that we considered in Example 2, where we obtained \(\phantom {\dot {i}\!}\textit {Srep}(D, \kappa (\mathcal { Q}))\) = {D 1, D 2, D 3} and \(\phantom {\dot {i}\!}\textit {Crep}(D, \kappa (\mathcal { Q}))=\{D_{1}\}\).

For tuple R(a 4,a 3), \(\phantom {\dot {i}\!}\textit {Diff}^{s}(D,\kappa (\mathcal { Q}), R(a_{4},a_{3}))=\{D \smallsetminus D_{2}\} = \{ \{ R(a_{4},a_{3}),\) R(a 3, a 3)}}, which, by Propositions 1 and 2, confirms that R(a 4,a 3) is an actual cause, with responsibility \(\phantom {\dot {i}\!}\frac {1}{2}\). The complement of \(\phantom {\dot {i}\!}D\smallsetminus D_{2}\) contains the actual cause R(a 3,a 3) plus a contingency set of it, namely that formed by tuple R(a 3,a 3), which has to be deleted together with the actual cause R(a 4,a 3) to restore consistency (cf. Example 2).

For tuple S(a 3), \(\phantom {\dot {i}\!}\textit {Diff}^{s}(D,\kappa (\mathcal { Q}), \) \(\phantom {\dot {i}\!} S(a_{3})) = \{D \smallsetminus D_{1}\}\) ={{S(a 3)}}. So, S(a 3) is an actual cause with responsibility 1.

Similarly, R(a 3,a 3) is an actual cause with responsibility \(\phantom {\dot {i}\!}\frac {1}{2}\), because Diff s(D,\(\phantom {\dot {i}\!}\kappa (\mathcal {Q}),\) \(\phantom {\dot {i}\!}R(a_{3},a_{3})) = \{D\smallsetminus D_{2}, \ D \smallsetminus D_{3}\}\) ={{R(a 4, a 3), R(a 3,a 3)}, {R(a 3,a 3), S(a 4)}}.

It holds \(\phantom {\dot {i}\!}\textit {Diff}^{s}(D,\kappa (\mathcal { Q}),S(a_{2})) = \textit {Diff}^{s}(D,\kappa (\mathcal { Q}),R(a_{2} , a_{1})) = \emptyset \), because all repairs contain S(a 2), R(a 2,a 1). This means they do not participate in the violation of \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\) or contribute to make \(\phantom {\dot {i}\!}\mathcal {Q}\) true. So, they are not actual causes for \(\phantom {\dot {i}\!}\mathcal {Q}\), confirming the result in Example 1.

\(\textit {Diff}^{c}(D,\kappa (\mathcal { Q}), S(a_{3}))=\{\{S(a_3)\}\} \). From Corollary 1, S(a 3) is the most responsible cause. □

Remark 2

The results in this section can be easily extended to unions of BCQs. This can be done by associating a DC to each disjunct of the query, and considering the corresponding problems for database repairs with respect to several DCs (cf. Section 4.1).

4 Database Repairs from Actual Causes

In this section we characterize repairs for inconsistent databases with respect to a set of DCs in terms of actual causes with their contingency sets. The reduction of repair-related computations to cause-related computations is particularly relevant, because we can take advantage of known complexity results for repairs to obtain new lower-bound complexity results for causality.

Causality has been investigated so far mainly for single conjunctive queries. However, database repairs appear in the context of sets of constraints. We concentrate on sets of DCs, which requires extending the analysis of causality to unions of conjunctive queries.

More concretely, in this section we characterize repairs of a database instance D with respect to a set Σ of DCs in terms of the actual causes (with their contingency sets) for the union of the conjunctive queries naturally associated to the (bodies of the) DCs. In essence, an S-repair D is a maximal subset of D that does not contain any actual cause t, and the tuples other than t and outside D form a contingency set for t. As expected, C-repairs require the use of most responsible tuples.

Consider an instance D for schema \(\phantom {\dot {i}\!}\mathcal {S}\), and a set of DCs Σ on \(\phantom {\dot {i}\!}\mathcal { S}\). For each κ∈Σ, say \(\phantom {\dot {i}\!}\kappa \!: \leftarrow A_{1}(\bar {x}_{1}),\ldots ,A_{n}(\bar {x}_{n})\), consider its associated violation view defined by a BCQ, namely \(V^{\!\kappa }\!\!: \exists \bar {x}(A_{1}(\bar {x}_{1})\wedge {\cdots } \wedge A_{n}(\bar {x}_{n}))\). The answer yes to V κ shows that κ is violated (i.e. not satisfied) by D.

Next, consider the query that is the union of the individual violation views: \(\phantom {\dot {i}\!}V^{\!{\Sigma }}:= \bigvee _{\kappa \in {\Sigma }} V^{\kappa } \), a union of BCQs (UBCQs). Clearly, D violates (is inconsistent with respect to) Σ iff DV Σ .

It is easy to verify that D, with D x = , is consistent with respect to Σ iff Causes(D, V Σ) = , i.e. there are no actual causes for V Σ when all tuples are endogenous.

Now, let us collect all S-minimal contingency sets associated with an actual cause t for V Σ:

Definition 1

For an instance D and a set Σ of DCs:

$$\begin{array}{@{}rcl@{}} \textit{Cont}(D,V^{\!{\Sigma}},t) &:=& \{ {\Gamma}\subseteq D^{n} ~|~ D\smallsetminus {\Gamma} \models V^{\!{\Sigma}}, D\smallsetminus ({\Gamma} \cup \{t\}) \not \models V^{\!{\Sigma}}, \\ && \text{ and~} \forall {\Gamma}^{\prime}\subsetneqq {\Gamma}, \ D \smallsetminus ({\Gamma}^{\prime} \cup \{t\}) \models V^{\!{\Sigma}} \}. \end{array} $$
(3)

Notice that for Γ∈Cont(D,V Σ,t), it holds t∉Γ. When D x = , if tC a u s e s(D,V Σ) and Γ∈C o n t(D,V Σ,t), from the definition of actual cause and the S-minimality of Γ, it holds that Γ=Γ∪{t} is an S-minimal subset of D with \(\phantom {\dot {i}\!}D \smallsetminus {\Gamma }^{\prime \prime } \not \models V^{\!{\Sigma }}\). So, \(D \smallsetminus {\Gamma }^{\prime \prime }\) is an S-repair for D. Then, the following holds.

Proposition 3

For an instance D, with D x =∅, and a set DCs Σ: D ⊆D is an S-repair for D with respect to Σ iff, for every \(t \in D \smallsetminus D^{\prime }\) : t∈Causes(D,V Σ ) and \(D \smallsetminus (D^{\prime } \cup \{t\}) \in \textit {Cont}(D, V^{\!{\Sigma }}, t)\).

To establish a connection between most responsible actual causes and C-repairs, assume that D x = , and collect the most responsible actual causes for V Σ:

Definition 2

For an instance D with D x = :

$$\begin{array}{@{}rcl@{}} \textit{MRC}(D,V^{\!{\Sigma}}) &:=& \{t \in D ~|~ t \in \textit{Causes}(D,V^{\!{\Sigma}}), \not \exists t^{\prime} \in \textit{Causes}(D,V^{\!{\Sigma}}) \\ && {\kern1.2cm} \text{ with} \,\,\rho_{_{D}}(t^{\prime})> \rho_{_{D}}(t)\}. \end{array} $$
(4)

Proposition 4

For instance D, with D x =∅, and set of DCs Σ: D ⊆D is a C-repair for D with respect to Σ iff, for every \(\phantom {\dot {i}\!}t \in D \smallsetminus D^{\prime }\) : t∈MRC(D,V Σ ) and \(\phantom {\dot {i}\!}D \smallsetminus (D^{\prime } \cup \{t\}) \in \textit {Cont}(D, V^{\!{\Sigma }}, t)\).

Actual causes for V Σ, with their contingency sets, account for the violation of some κ∈Σ. Removing those tuples from D should remove the inconsistency. From Propositions 3 and 4 we obtain:

Corollary 2

Given an instance D and a set DCs Σ, the instance obtained from D by removing an actual cause, resp. a most responsible actual cause, for V Σ together with any of its S-minimal, resp. C-minimal, contingency sets forms an S-repair, resp. a C-repair, for D with respect to Σ.

Example 5

Consider D={P(a),P(e),Q(a,b),R(a,c)} and Σ={κ 1,κ 2}, with κ 1 :←P(x),Q(x,y) and κ 2 :←P(x),R(x,y).

The violation views are \(\phantom {\dot {i}\!} V^{\kappa _{1}}\!: \exists xy (P(x) \land Q(x,y)) \) and \(\phantom {\dot {i}\!} V^{\kappa _{2}} \!: \exists xy (P(x) \land R(x,y))\). For \(\phantom {\dot {i}\!}V^{\!{\Sigma }} := V^{\kappa _{1}}\lor V^{\kappa _{2}}\), DV Σ and D is inconsistent with respect to Σ.

Now assume all tuples are endogenous. It holds Causes(D,V Σ)={P(a), Q(a,b),R(a,c)}, and its elements are associated with sets of S-minimal contingency sets, as follows: Cont(D,V Σ ,Q(a,b))={{R(a,c)}}, Cont(D,V Σ , R(a,c))={{Q(a,b)}}, and Cont(D,V Σ ,P(a))={}.

From Corollary 2, and Cont(D,V Σ , R(a,c))={{Q(a,b)}}, \(\phantom {\dot {i}\!}D_{1}=D \smallsetminus (\{ R(a,c)\} \cup \{Q(a,b)\}) =\{ P(a),P(e)\}\) is an S-repair. So is \(\phantom {\dot {i}\!}D_{2}=D \smallsetminus (\{P(a) \} \cup \emptyset )=\{P(e),Q(a,b),\) R(a,c)}. These are the only S-repairs.

Furthermore, MRC(D,V Σ)={P(a)}. From Corollary 2, D 2 is also a C-repair for D.

Remark 3

An actual cause t with any of its S-minimal contingency sets determines a unique S-repair. The last example shows that, with different combinations of a cause and one of its contingency sets, we may obtain the same repair (e.g. for the first two Cont sets). So, we may have more minimal contingency sets than minimal repairs. However, we may still have exponentially many minimal contingency sets, so as we may have exponentially many minimal repairs of an instance with respect to DCs, as the following example shows.Footnote 7

Example 6

Consider D={R(1,0),R(1,1),…,R(n,0),R(n,1),S(1),S(0)} and the DC κ :←R(x,y),R(x,z),S(y),S(z). D is inconsistent with respect to κ. There are exponentially many S-repairs of D: \(\phantom {\dot {i}\!}D^{\prime } = D \smallsetminus \{S(0)\}\), \(\phantom {\dot {i}\!}D^{\prime \prime } = D \smallsetminus \{S(1)\}\), \(\phantom {\dot {i}\!}D_{1} = D \smallsetminus \{R(1,0), \ldots , R(n,0)\}\), ..., \(\phantom {\dot {i}\!}D_{2^{n}} = D \smallsetminus \{R(1,1), \ldots , R(n,1)\}\). The C-repairs are only D and D .

For the BCQ V κ associated to κ, DV κ, and S(1) and S(0) are actual causes for V κ (courterfactual causes with responsibility 1). All tuples in R are actual causes, each with exponentially many S-minimal contingency sets. For example, R(1,0) has the S-minimal contingency set {R(2,0),…,R(n,0)}, among exponentially many others (any set built with just one element from each of the pairs {R(2,0),R(2,1)}, ..., {R(n,0),R(n,1)} is one).

4.1 Causes for Unions of Conjunctive Queries

If we want to compute repairs with respect to sets of DCs from causes for UBCQs using, say Corollary 2, we first need an algorithm for computing the actual causes and their (minimal) contingency sets for UBCQs. These algorithms could be used as a first stage of the computation of S-repairs and C-repairs with respect to sets of DCs. However, these algorithms (developed in Section 4.2), are also interesting and useful per se.

The PTIME algorithm for computing actual causes in [47] is for single conjunctive queries, but does not compute the actual causes’ contingency sets. Actually, doing the latter increases the complexity, because deciding responsibility Footnote 8 of actual causes is N P-hard [47] (which would be tractable if we could efficiently compute all (minimal) contingency sets).Footnote 9 In principle, an algorithm for responsibilities can be used to compute C-minimal contingency sets, by iterating over all candidates, but Example 6 shows that there can be exponentially many of them.

We first concentrate on the problem of computing actual causes for UBCQs, without their contingency sets, which requires some notation.

Definition 3

Given \(\phantom {\dot {i}\!}\mathcal { Q} = C_{1} \vee {\cdots } \vee C_{k}\), where each C i a BCQ, and an instance D:

  1. (a)

    \(\phantom {\dot {i}\!}\mathfrak {S}(D)\) is the collection of all S-minimal subsets of D that satisfy a disjunct C i of \(\phantom {\dot {i}\!}\mathcal {Q}\).

  2. (b)

    \(\phantom {\dot {i}\!}\mathfrak {S}^{n}(D)\) consists of the S-minimal subsets Λ of D n for which there exists a \(\phantom {\dot {i}\!}{\Lambda }^{\prime } \! \in \mathfrak {S}(D) \) with Λ⊆Λ and \(\phantom {\dot {i}\!} {\Lambda } \smallsetminus {\Lambda }^{\prime } \subseteq D^{x}\). □

\(\phantom {\dot {i}\!}\mathfrak { S}^{n}(D)\) contains all S-minimal sets of endogenous tuples that simultaneously (and possibly accompanied by exogenous tuples) make the query true. It is easy to see that \(\phantom {\dot {i}\!}\mathfrak {S}(D)\) and S n(D) can be computed in polynomial time in the size of D.

Now, generalizing a result for CQs in [47], actual causes for a UBCQs can be computed in PTIME in the size of D without computing contingency sets. We formulate this results in terms of the corresponding causality decision problem (CDP).

Proposition 5

Given an instance D, a UBCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , and t∈D n:

  1. (a)

    t is an actual cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) iff there is \(\phantom {\dot {i}\!}{\Lambda } \in \mathfrak { S}^{n}(D)\) with t∈Λ.

  2. (b)

    The causality decision problem (about membership of)

    $$ \mathcal{C}\mathcal{D}\mathcal{P} := \{ (D,t)~ |~ t \in D^{n}, \text{ and} \,\,t \in \textit{Causes}(D,\mathcal{ Q})\} $$
    (5)

    belongs to PTIME.

Proof 1

  1. (a)

    Assume \(\phantom {\dot {i}\!}\mathfrak {S}(D) =\{{\Lambda }_{1}, \ldots , {\Lambda }_{m}\}\), and there exists a \(\phantom {\dot {i}\!}{\Lambda } \in \mathfrak { S}^{n}(D)\) with t∈Λ. Consider a set Γ⊆D n such that, for all \(\phantom {\dot {i}\!}{\Lambda }_{i} \in \mathfrak {S}^{n}(D)\) where Λ i ≠Λ, Γ∩Λ i and Γ∩Λ = . With such a Γ, t is an actual cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) with contingency set Γ. So, it is good enough to prove that such Γ always exists. In fact, since all subsets of \(\phantom {\dot {i}\!}\mathfrak {S}^{n}(D)\) are S-minimal, then, for each \(\phantom {\dot {i}\!}{\Lambda }_{i} \in \mathfrak {S}^{n}(D)\) with Λ i ≠Λ, Λ i ∩Λ = . Therefore, Γ can be obtained from the set of difference between each Λ i and Λ.

    Now, if t is an actual cause for \(\phantom {\dot {i}\!}\mathcal { Q}\), then there exist an S-minimal Γ∈D n, such that \(\phantom {\dot {i}\!}D \smallsetminus ({\Gamma } \cup \{t\}) \not \models \mathcal {Q}\), but \(\phantom {\dot {i}\!}D \smallsetminus {\Gamma } \models \mathcal {Q}\). This implies that there exists an S-minimal subset Λ of D, such that t∈Λ and \(\phantom {\dot {i}\!}{\Lambda } \models \mathcal { Q}\). Due to the S-minimality of Γ, it is easy to see that t is included in a subset of \(\phantom {\dot {i}\!}\mathfrak {S}^{n}(D)\).

  2. (b)

    This is a simple generalization of the proof of the same result for single conjunctive queries found in [47].

Example 7

(ex. 5 cont.) Consider the query \(\mathcal { Q}\!: \exists xy (P(x) \land Q(x,y)) \lor \exists xy(P(x) \land R(x,y))\) , and assume that for D, D n={P(a),R(a,c)} and D x={P(e),Q(a,b)}. It holds \(\mathfrak {S}(D)=\{ \{P(a),Q(a,b)\}, \{P(a),\) R(a,c)}}. Since {P(a)}⊆{P(a), R(a,c)}, \(\phantom {\dot {i}\!}\mathfrak { S}^{n}(D)=\{\{P(a)\}\}\). So, P(a) is the only actual cause for \(\phantom {\dot {i}\!} \mathcal { Q}\).

4.2 Contingency Sets for Unions of Conjunctive Queries

It is possible to develop a (naive) algorithm that accepts as input an instance D and a UBCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), and returns \(\phantom {\dot {i}\!}\textit {Causes}(D,\mathcal {Q})\); and also, for each \(\phantom {\dot {i}\!}t \in \textit {Causes}(D,\mathcal { Q})\), its (set of) S-minimal contingency sets \(\phantom {\dot {i}\!}{Cont}(D,\mathcal {Q},t)\).

The basis for the algorithm is a correspondence between the actual causes for \(\phantom {\dot {i}\!}\mathcal { Q}\) with their contingency sets and a hitting-set problem.Footnote 10 More precisely, for a fixed UBCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), consider the hitting-set framework

$$ \mathfrak{H}^{n}(D) = \langle D^{n}, \mathfrak{S}^{n}(D)\rangle, $$
(6)

with \(\phantom {\dot {i}\!}\mathfrak {S}^{n}(D)\) as in Definition 3. Different computational and decision problems are based on \(\phantom {\dot {i}\!}\mathfrak { H}^{n}(D)\), and we will confront some below. Notice that hitting-sets (HSs) are all subsets of D n.

The S-minimal hitting-sets for \(\phantom {\dot {i}\!}\mathfrak {H}^{n}(D)\) correspond to actual causes with their S-minimal contingencies for \(\phantom {\dot {i}\!}\mathcal {Q}\). Most responsible causes for \(\phantom {\dot {i}\!}\mathcal {Q}\) are in correspondence with hitting-sets for \(\phantom {\dot {i}\!}\mathfrak {H}^{n}(D)\). This is formalized as follows:

Proposition 6

For an instance D, a UBCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , and t∈D n:

  1. (a)

    t is an actual cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) with S-minimal contingency set Γ iff Γ∪{t} is an S-minimal hitting-set for \(\phantom {\dot {i}\!}\mathfrak {H}^{n}(D)\).

  2. (b)

    t is a most responsible actual cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) with C-minimal contingency set Γ iff Γ∪{t} is a minimum hitting-set for \(\phantom {\dot {i}\!}\mathfrak {H}^{n}(D)\).

The proof is similar to that of part (a) of Proposition 5.

Example 8

(ex. 5 and 7 cont.) D and \(\phantom {\dot {i}\!}\mathcal { Q}\) are as before, but now all tuples are endogenous. Here, \(\phantom {\dot {i}\!}\mathfrak { S}(D)= \mathfrak { S}^{n}(D) = \{ \{P(a),Q(a,b)\}, \{P(a),\) R(a,c)}}. \(\phantom {\dot {i}\!}\mathfrak { H}^{n}(D)\) has two S-minimal hitting-sets: H 1={P(a)} and H 2={Q(a,b),R(a,c)}. Each of them implicitly contains an actual cause (any of its elements) with an S-minimal contingency set (what’s left after removing the actual cause). H 1 is also the C-minimal hitting-set, and contains the most responsible actual cause, P(a).

Remark 4

For \(\phantom {\dot {i}\!}\mathfrak { H}^{n}(D)=\langle D^{n}, \mathfrak { S}^{n}(D)\rangle \), \(\phantom {\dot {i}\!}\mathfrak { S}^{n}(D)\) can be computed in PTIME in data complexity, and its elements are bounded in size by \(\phantom {\dot {i}\!}|\mathcal {Q}|\), which is the maximum number of atoms in one of \(\phantom {\dot {i}\!}\mathcal {Q}\)’s disjuncts. This is a special kind of hitting-set problems. For example, deciding if there is a hitting-set of size at most k as been called the d-hitting-set problem [50], and d is the bound on the size of the sets in the set class. In our case, d would be \(\phantom {\dot {i}\!}|\mathcal {Q}|\).

4.3 Causality, Repairs, and Consistent Answers

Corollary 2 and Proposition 6 can be used to compute repairs. If the classes of S- and C-minimal hitting-sets for \(\phantom {\dot {i}\!}\mathfrak {H}^{n}(D)\) (with D n = D) are available, computing S- and C-repairs will be in PTIME in the sizes of those classes. However, it is well known that computing minimal hitting-sets is a complex problem. Actually, as Example 6 implicitly shows, we can have exponentially many of them in |D|; so as exponentially many minimal repairs for D with respect to a denial constraint. We can see that the complexity of contingency sets computation is in line with the complexities of computing hitting-sets and repairs.

As Corollary 2 and Proposition 6 show, the computation of causes, contingency sets, and most responsible causes via minimal/minimum hitting-set computation can be used to compute repairs and decide about repair questions. Since the hitting-set problems in our case are of the d-hitting-set kind, good algorithms and approximations for the latter (cf. Section 6.1) could be used in the context of repairs.

In the rest of this section we consider an instance D whose tuples are all endogenous, and a set Σ of DCs. For the disjunctive violation view V Σ, the following result is obtained from Propositions 3 and 4, and Corollary 2.

Corollary 3

For an instance D, with D x=∅, and a set Σ of DCs, it holds:

  1. (a)

    For every t∈Causes(D,V Σ), there is an S-repair that does not contain t.

  2. (b)

    For every t∈MRC(D,V Σ), there is a C-repair that does not contain t.

  3. (c)

    For every D ∈Srep(D,Σ) and D ′′ ∈Crep(D,Σ), it holds \(\phantom {\dot {i}\!}D \smallsetminus D^{\prime } \subseteq \textit {Causes}(D,V^{\!{\Sigma }})\) and \(\phantom {\dot {i}\!}D \smallsetminus D^{\prime \prime } \subseteq \textit {MRC}(D,V^{\!{\Sigma }})\).

For a projection-free, and a possibly non-Boolean CQ \(\phantom {\dot {i}\!}\mathcal { Q}\), we are interested in its consistent answers from D with respect to Σ. For example, for \(\phantom {\dot {i}\!}\mathcal { Q}(x,y,z)\!: \ R(x,y) \wedge S(y,z)\), the S-consistent (C-consistent) answers would be of the form 〈a,b,c〉, where R(a,b) and S(b,c) belong to all S-repairs (C-repairs) of D.

From Corollary 3, 〈a,b,c〉 is an S-consistent (resp. C-consistent) answer iff R(a,b) and S(b,c) belong to D, but they are not actual causes (resp. most responsible actual causes) for V Σ.

The following simple result and its corollary will be useful in Section 6.

Proposition 7

For an instance D, with D x =∅, a set Σ of DCs, and a projection-free CQ \(\mathcal {Q}(\bar {x})\!: \ P_{1}(\bar {x}_{1}) \wedge \cdots \wedge P_{k}(\bar {x}_{k})\):

  1. (a)

    \(\phantom {\dot {i}\!}\bar {c}\) is an S-consistent answer iff, for each i, \(\phantom {\dot {i}\!}P_{i}(\bar {c}_{i}) \in (D \smallsetminus \textit {Causes}(D, V^{\!{\Sigma }}))\).

  2. (b)

    \(\phantom {\dot {i}\!}\bar {c}\) is a C-consistent answer iff, for each i, \(P_{i}(\bar {c}_{i}) \in (D \smallsetminus \textit {MRC}(D, V^{\!{\Sigma }}))\).

Example 9

(ex. 5 cont.) Consider \(\phantom {\dot {i}\!}\mathcal { Q}(x)\!: \ P(x)\). We had Causes(D,V Σ) ={P(a), Q(a,b), R(a,c)}, MRC(D,V Σ)={P(a)}. Then, 〈e〉 is both an S- and a C-consistent answer.

Notice that Proposition 7 can easily be extended to conjunctions of ground atomic queries.

Corollary 4

Given an instance D and a set Σ of DCs, the ground atomic query \(\phantom {\dot {i}\!}\mathcal {Q}\!\!: P(c)\) is C-consistently true iff P(c)∈D and it is not a most responsible cause for V Σ.

Example 10

For D={P(a,b),R(b,c),R(a,d)} and the DC κ : ←P(x,y),R(y,z), we obtain: Causes(D, V κ) = MRC(D,V κ)={P(a,b),R(b,c)}.

From Proposition 7, the ground atomic query \(\phantom {\dot {i}\!}\mathcal { Q}\!\!: R(a,d)\) is both S- and C-consistently true in D with respect to κ, because, \(D \smallsetminus \textit {Causes}(D, V^{\kappa }) = D \smallsetminus \textit {MRC}(D, V^{\kappa })\) ={R(a,d)}.

The CQs considered in Proposition 7 and its Corollary 4 are not particularly interesting per se, but we will use those results to obtain new complexity results for causality later on, e.g. Theorem 3.

5 Causes and Repairs from Consistency-Based Diagnosis

The main objective in this section is to characterize database causality computation as a diagnosis problem.Footnote 11 This is interesting per se, and will also allow us to apply ideas and techniques from model-based diagnosis to causality. As a side result we obtain a characterization of database repairs in terms of diagnosis.

Let D be an instance for schema \(\phantom {\dot {i}\!}\mathcal {S}\), and \(\phantom {\dot {i}\!}\mathcal { Q}\!: \exists \bar {x}(P_{1}(\bar {x}_{1}) \wedge {\cdots } \wedge P_{m}(\bar {x}_{m}))\), a BCQ. Assume \(\phantom {\dot {i}\!}\mathcal { Q}\) is, possibly unexpectedly, true in D. So, for the associated DC \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\!: \forall \bar {x} \neg (P_{1}(\bar {x}_{1}) \wedge {\cdots } \wedge P_{m}(\bar {x}_{m}))\), \(\phantom {\dot {i}\!}D \not \models \kappa (\mathcal { Q})\). \(\phantom {\dot {i}\!}\mathcal { Q}\) is our observation, for which we want to find explanations, using a consistency-based diagnosis approach.

For each predicate \(\phantom {\dot {i}\!}P \in \mathcal { P}\), we introduce predicate Ab P , with the same arity as P. Intuitively, a tuple in its extension is abnormal for P. The “system description”, S D, includes, among other elements, the original database, expressed in logical terms, and the DC as true “under normal conditions”.

More precisely, we consider the following diagnosis problem, \(\phantom {\dot {i}\!}\mathcal {M}=(\textit {SD},D^{n},\) \(\phantom {\dot {i}\!} \mathcal { Q})\), associated to \(\phantom {\dot {i}\!}\mathcal {Q}\). The FO system description, S D, contains the following elements:

  1. (a)

    Th(D), which is Reiter’s logical reconstruction of D as a FO theory [54] (cf. Example 11).

  2. (b)

    Sentence \(\phantom {\dot {i}\!}\kappa (\mathcal { Q}){\!^{\textit {Ab}}}\), which is \(\phantom {\dot {i}\!}\kappa (\mathcal { Q})\) rewritten as follows:

    $$\begin{array}{@{}rcl@{}} \kappa(\mathcal{ Q}){\!^{\textit{Ab}}}\!: \ \forall \bar{x}\neg (P_{1}(\bar{x}_{1}) \wedge \neg \textit{Ab}_{P_{1}}(\bar{x}_{1}) \wedge {\cdots} \wedge P_{m}(\bar{x}_{m}) \wedge \neg {Ab}_{P_{m}}(\bar{x}_{m}) ). \end{array} $$
    (7)
  3. (c)

    Formula (7) can be refined by applying the abnormality predicate, Ab, to endogenous tuples only. For this we need to use additional auxiliary predicates E n d P , with the same arity of \(\phantom {\dot {i}\!}P \in \mathcal {S}\), which contain the endogenous tuples in P’s extension (see Example 11). Accordingly, we introduce the inclusion dependencies: For each \(\phantom {\dot {i}\!}P \in \mathcal {P}\),

    $$\forall \bar{x}(\textit{Ab}_{P}(\bar{x}) \rightarrow \textit{End}_{\!P}(\bar{x})), \ \text{ and} \ \forall \bar{x}({End}_{\!P}(\bar{x}) \rightarrow P(\bar{x})).$$

The last entry, \(\phantom {\dot {i}\!}\mathcal { Q}\), in \(\phantom {\dot {i}\!}\mathcal {M}\) is the “observation”, which together with SD will produce and inconsistent theory, because we make the initial and explicit assumption that all the abnormality predicates are empty (equivalently, that all tuples are normal), i.e. we consider, for each predicate P, the sentenceFootnote 12

$$ \forall \bar{x}(\textit{Ab}_{P}(\bar{x}) \rightarrow \textbf{false}), $$
(8)

where, false is a propositional atom that is always false.

The second entry in \(\phantom {\dot {i}\!}\mathcal { M}\) is D n. This is the set of “components” that we can use to try to restore consistency, in this case, by (minimally) changing the abnormality condition on tuples in D n. In other words, the universal rules (8) are subject to exceptions or qualifications: some endogenous tuples may be abnormal. Each diagnosis shows an S-minimal set of endogenous tuples that are abnormal.

Example 11

(ex. 1 cont.) Consider the query \(\phantom {\dot {i}\!}\mathcal { Q} :\exists x \exists y ( S(x) \land R(x, y) \land S(y))\), and the instance D={S(a 3), S(a 4), R(a 4,a 3)}, with D n = {S(a 4), S(a 3)}, consider the diagnostic problem \(\phantom {\dot {i}\!}\mathcal {M}=( \textit {SD},\{S(a_{4}),S(a_{3})\},\) \(\phantom {\dot {i}\!} \mathcal { Q})\), with SD containing the sentences in (a)-(c) below:

  1. (a)

    Predicate completion axioms plus unique names assumption:

    $$\begin{array}{@{}rcl@{}} &&\forall x y (R(x,y) \leftrightarrow x= a_{4} \wedge y = a_{3}), \ \ \ \forall x(S(x) \leftrightarrow x = a_{3} \vee x = a_{4}), \end{array} $$
    (9)
    $$\begin{array}{@{}rcl@{}} && \forall x y (\textit{End}_{R}(x,y) \leftrightarrow \textbf{false}), \ \ \ \forall x(\textit{End}_{S}(x) \leftrightarrow x = a_{3} \vee x = a_{4}), \end{array} $$
    (10)
    $$\begin{array}{@{}rcl@{}} && a_{4} \neq a_{3}. \end{array} $$
    (11)
  2. (b)

    The denial constraint qualified by non-abnormality, \(\phantom {\dot {i}\!}\kappa (\mathcal { Q})^{\!\textit {Ab}}\):

    $$\begin{array}{@{}rcl@{}} \forall x y \neg&&\!\!\!\!( S(x) \land \neg \textit{Ab}_{S}(x) \land R(x, y) \land \neg \textit{Ab}_{R}(x, y)\land \ S(y) \wedge \neg {Ab}_{S}(y)). \end{array} $$

    In diagnosis formalizations this formula would be usually presented as:

    $$\begin{array}{@{}rcl@{}} \forall x y(&&\!\!\!(\neg\textit{Ab}_{S}(x) \wedge \neg\textit{Ab}_{R}(x,y) \wedge \neg \textit{Ab}_{S}(y)) \ \longrightarrow \neg (S(x) \land R(x, y) \land \ S(y))). \end{array} $$

    That is, under the normality assumption, the “system” behaves as intended; in this case, there are no violations of the denial constraint. This main formula in the diagnosis specification can also be written as a disjunctive positive rule:

    $$\begin{array}{@{}rcl@{}} &{}\!\!\!\forall x y(S(x) \land R(x, y) \land S(y) \ \longrightarrow \ \textit{Ab}_{S}(x) \vee \textit{Ab}_{R}(x,y) \vee \textit{Ab}_{S}(y)). \end{array} $$
    (12)
  3. (c)

    Abnormality/endogenousity predicates are in correspondence to the database schema, and only endogenous tuples can be abnormal:

    $$\begin{array}{@{}rcl@{}} &&\forall x y(\textit{Ab}_{R}(x,y) \rightarrow \textit{End}_{R}(x,y)), \ \ \ \forall x y(\textit{End}_{R}(x,y) \rightarrow R(x,y)), \end{array} $$
    (13)
    $$\begin{array}{@{}rcl@{}} && \forall x(\textit{Ab}_{S}(x) \rightarrow \textit{End}_{S}(x)), \ \ \ \ \ \ \ \ \ \ \ \ \forall x(\textit{End}_{S}(x) \rightarrow S(x)). \end{array} $$
    (14)

    In addition to this specification, we have the observation \(\phantom {\dot {i}\!}\mathcal { Q}\):

    $$\begin{array}{@{}rcl@{}} \exists x \exists y ( S(x) \land R(x, y) \land S(y)). \end{array} $$
    (15)

Finally, we make the assumption that there are not abnormal tuples:

$$\begin{array}{@{}rcl@{}} \forall x y(\textit{Ab}_{R}(x,y) \rightarrow \textbf{false}), \ \ \ \ \forall x(\textit{Ab}_{S}(x) \rightarrow \textbf{false}). \end{array} $$
(16)

The FO theory formed by (9) - (16) (more precisely, (9), (11), (12), (15) and (16)) is inconsistent.

Now, in more general terms, the observation is \(\phantom {\dot {i}\!}\mathcal { Q}\) (being true), obtained by evaluating query \(\phantom {\dot {i}\!}\mathcal { Q}\) on (theory of) D. In this case, \(\phantom {\dot {i}\!}D \not \models \kappa (\mathcal {Q})\). Since all the abnormality predicates are assumed to be empty, \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\) is equivalent to \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})^{{Ab}}\), which also becomes false with respect to D. As a consequence, \(\phantom {\dot {i}\!}{SD} \cup \{(8)\} \cup \{\mathcal {Q} \}\) is an inconsistent FO theory. A diagnosis is a set of endogenous tuples that, by becoming abnormal, restore consistency.

Definition 4

  1. (a)

    A diagnosis for \(\phantom {\dot {i}\!}\mathcal { M}\) is a Δ⊆D n, such that

    $$\textit{SD} \ \cup \ \{\textit{Ab}_{P}(\bar{c}) ~|~ P(\bar{c}) \in {\Delta}\} \ \cup \ \{\neg \textit{Ab}_{P}(\bar{c}) ~|~ P(\bar{c}) \in D \smallsetminus {\Delta}\} \ \cup \ \{\mathcal{ Q}\} \vspace{-7mm} $$

    is consistent.

  2. (b)

    \(\phantom {\dot {i}\!}\textit {Diag}^{s}(\mathcal { M},t)\) denotes the set of S-minimal diagnoses for \(\phantom {\dot {i}\!}\mathcal { M}\) that contain tuple tD n.

  3. (c)

    \(\phantom {\dot {i}\!}\textit {Diag}^{c}(\mathcal { M},t)\) denotes the set of C-minimal diagnoses in \(\phantom {\dot {i}\!}\textit {Diag}^{s}(\mathcal { M},t)\).

Example 12

(ex. 11 cont.) The theory can be made consistent by giving up (16), and making S-minimal sets of tuples abnormal. According to (13)-(14), those tuples have to be endogenous.

\(\phantom {\dot {i}\!}\mathcal { M}\) has two S-minimal diagnosis: Δ1={S(a 3)} and Δ4={S(a 4)}. The first one corresponds to replacing the second formula in (16) by ∀x(A b S (x)∧xa 3false), obtaining now a consistent theory.

Here, \(\phantom {\dot {i}\!}\textit {Diag}^{s}(\mathcal { M},S(a_{3}))= \textit {Diag}^{c}(\mathcal { M}, S(a_{3}))=\{ \{ S(a_{3})\}\}\), and \(\phantom {\dot {i}\!}{Diag}^{s}(\mathcal {M},\) \(\phantom {\dot {i}\!} S(a_{4}))={Diag}^{c}(\mathcal {M}, S(a_{4}))=\{\{ \) S(a 4)}}.

If R(a 4,a 3) is also endogenous, then also {R(a 4,a 3)} becomes a minimal diagnosis.

By definition, \(\phantom {\dot {i}\!}\textit {Diag}^{c}(\mathcal { M},t) \subseteq \textit {Diag}^{s}(\mathcal { M},t)\). Diagnoses for \(\phantom {\dot {i}\!}\mathcal { M}\) and actual causes for \(\phantom {\dot {i}\!}\mathcal { Q}\) are related.

Proposition 8

Consider an instance D, a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , and the diagnosis problem \(\phantom {\dot {i}\!}\mathcal { M}\) associated to \(\phantom {\dot {i}\!}\mathcal {Q}\) . Tuple t∈D n is an actual cause for \(\phantom {\dot {i}\!}\mathcal {Q}\) iff \(\phantom {\dot {i}\!}{Diag}^{s}(\mathcal {M},t) \not = \emptyset \).

The responsibility of an actual cause t is determined by the cardinality of the diagnoses in \(\phantom {\dot {i}\!}{Diag}^{c}(\mathcal {M},t)\).

Proposition 9

For an instance D, a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , the associated diagnosis problem \(\phantom {\dot {i}\!}\mathcal {M}\) , and a tuple t∈D n, it holds:

  1. (a)

    \(\phantom {\dot {i}\!}\rho _{_{\!D\!}}(t)=0\) iff \(\phantom {\dot {i}\!}\textit {Diag}^{c}(\mathcal { M},t) = \emptyset \).

  2. (b)

    Otherwise, \(\phantom {\dot {i}\!}\rho _{_{\!D\!}}(t)=\frac {1}{|{\Delta }|}\) , where \(\phantom {\dot {i}\!}{\Delta } \in \textit {Diag}^{c}(\mathcal { M},t)\).

For the proofs of Propositions 8 and 9, it is easy to verify that the conflict sets of \(\phantom {\dot {i}\!}\mathcal {M}\) coincide with the sets in \(\phantom {\dot {i}\!}\mathfrak {S}(D^{n})\) (cf. Definition 3). The results are obtained from the characterization of minimal diagnosis as minimal hitting-sets of sets of conflict sets (cf. Section 2 and [53]) and Proposition 6.

Example 13

(ex. 12 cont.) From Propositions 8 and 9, S(a 3) and S(a 4) are actual cases, with responsibility 1. If R(a 4,a 3) is also endogenous, it also becomes an actual cause with responsibility 1.

In consistency-based diagnosis, minimal diagnoses can be obtained as S-minimal hitting-sets of the collection of S-minimal conflict sets (cf. Section 2) [53]. In our case, conflict sets are S-minimal sets of endogenous tuples that, if not abnormal (only endogenous ones can be abnormal), and together, and possibly in combination with exogenous tuples, make (7) false.

It is easy to verify that the conflict sets of \(\phantom {\dot {i}\!}\mathcal { M}\) coincide with the sets in \(\phantom {\dot {i}\!}\mathfrak {S}(D^{n})\) (cf. Definition 3 and Remark 4). As a consequence, conflict sets for \(\phantom {\dot {i}\!}\mathcal {M}\) can be computed in PTIME, the hitting-sets for \(\phantom {\dot {i}\!}\mathcal {M}\) contain actual causes for \(\phantom {\dot {i}\!}\mathcal {Q}\), and the hitting-set problem for the diagnosis problems is of the d-hitting-set kind.

The reduction from causality to consistency-based diagnosis allows us to apply constructions and techniques for the latter (cf. [27, 49]), to the former.

Example 14

(ex. 11 cont.) The diagnosis problem \(\phantom {\dot {i}\!}\mathcal {M}=( \textit {SD},\{S(a_{4}),S(a_{3})\},\) \(\phantom {\dot {i}\!} \mathcal { Q})\) gives rise to the hitting-set framework \(\phantom {\dot {i}\!}\mathfrak { H}^{n}(D) = \langle \{S(a_{4}),S(a_{3})\}, \{\{ (S(a_{3}),\) S(a 4)}}〉, with {S(a 3),S(a 4)} corresponding to the conflict set c={S(a 4), S(a 3)}.

\(\phantom {\dot {i}\!}\mathfrak {H}^{n}(D)\) has two minimum hitting-sets: {S(a 3)} and {S(a 4)}, which are the S-minimal diagnosis for \(\phantom {\dot {i}\!}\mathcal {M}\). Then, the two tuples are actual causes for \(\phantom {\dot {i}\!}\mathcal {Q}\) (cf. Proposition 8). From Proposition 9, \(\phantom {\dot {i}\!}\rho _{_{\!D\!}}(S(a_{3}))= \rho _{_{\!D\!}}(S(a_{4}))= 1\).

The solutions to the diagnosis problem can be used for computing repairs.

Proposition 10

Consider an instance D with D x =∅, a set of DCs of the form \(\phantom {\dot {i}\!}\kappa \!: \ \forall \bar {x}\neg (P_{1}(\bar {x}_{1}) \wedge {\cdots } \wedge P_{m}(\bar {x}_{m})\) , and their associated “abnormality-aware” integrity constraints Footnote 13 in (7) (in this case we do not need End P atoms).

Each S-minimal diagnosis Δ gives rise to an S-repair of D, namely \(\phantom {\dot {i}\!}D_{\!{\Delta }} = D \smallsetminus \{P(\bar {c}) \in D ~|~ \textit {Ab}_{P}(\bar {c}) \in {\Delta }\}\) ; and every S-repair can be obtained in this way. Similarly, for C-repairs using C-minimal diagnoses.

Example 15

(ex. 13 cont.) The instance D={S(a 3), S(a 4), R(a 4,a 3)}, with all tuples endogenous, has three (both S- and C-) repairs with respect to the DC κ : ∀x y¬(S(x)∧R(x,y)∧S(y)), namely D 1={S(a 3),R(a 4,a 3)}, D 2={S(a 4),R(a 4,a 3)}, and D 3={S(a 3),S(a 4)}. They can be obtained as \(\phantom {\dot {i}\!}D_{\!{\Delta }_{1}}, D_{\!{\Delta }_{2}},\) \(\phantom {\dot {i}\!} D_{\!{\Delta }_{3}}\) from the only (S- and C-) diagnoses, Δ1={S(a 3)}, Δ2={S(a 4)}, Δ3={R(a 4,a 3)}, resp.

We have characterized repairs in terms of diagnosis. Thinking of the other direction, and as a final remark, it is worth observing that the very particular kind of diagnosis problem we introduced above (with restricted logical formulas) can be formulated as a preferred-repair problem [9](@var1@Sec. 2.5@var1@). Without going into the details, the idea is to materialize tables for the auxiliary predicates A b P and E n d P , and consider the DCs of the form (7) (with the E n d P atoms when not all tuples are endogenous), plus the DCs (8), saying that the initial extensions for the A b P predicates are empty. If D is inconsistent with respect to this set of DCs, the S-repairs that are obtained by only inserting endogenous tuples into the extensions of the A b P predicates correspond to S-minimal diagnosis, and each S-minimal diagnosis can be obtained in this way.

6 Complexity Results

There are three main computational problems in database causality. For a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) and database D:

  • The causality problem (CP) is about computing the actual causes for \(\phantom {\dot {i}\!}\mathcal { Q}\). Its decision version of this problem, CDP, is stated in (5). Both CP and CDP are solvable in polynomial time [47], which can be extended to UBCQs (cf. Proposition 5).

  • The responsibility problem (RP) is about computing the responsibility \(\phantom {\dot {i}\!}\rho _{_{\!D\!}}(t)\) of a given actual cause t. (Since a tuple that is not an actual cause has responsibility 0, this problem subsumes (a).) This is a maximization problem due to the minimization of |Γ| in the denominator.

    We will consider the decision version of this problem that, as usual for maximization problems [29], asks whether the real-valued function being computed (responsibility in this case) takes a value greater than a given threshold v of the form \(\phantom {\dot {i}\!}\frac {1}{k}\), for a positive integer k.

Definition 5

For a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), the responsibility decision problem (RDP) is (deciding about membership of):

$$\begin{array}{@{}rcl@{}} \mathcal{R}\mathcal{D}\mathcal{P}(\mathcal{ Q})=\{(D,t,v)\; |\; t &\in& D^{n}, v \in \{0\} \cup \{\frac{1}{k}\; |\; k \in \mathbb{N}^{+}\}, \text{ and}\\ D&\models& \mathcal{ Q} \ \text{ and} \ \rho_{_{\!D\!}}(t) > v \}, \end{array} $$

that is, deciding if a tuple has a responsibility greater than a bound v (as a cause for \(\phantom {\dot {i}\!}\mathcal {Q}\)).

The complexity analysis of RDP in [47] is restricted to conjunctive queries without self-joins. Here, we will generalize the complexity analysis for RDP to general CQs.

  1. (c)

    Computing the most responsible actual causes (MRC). Its decision version, MRCDP, the most responsible cause decision problem, is a natural problem, because actual causes with the highest responsibility tend to provide most interesting explanations for query answers [47, 48].

Definition 6

For a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), the most responsible cause decision problem is (membership of):

\(\phantom {\dot {i}\!}\mathcal { MRCDP}(\mathcal { Q})\) \(\phantom {\dot {i}\!}=\{(D,t) | t \in D^{n} \text { and} \,\,0 < \rho _{_{\!D\!}}(t)~\text {is a maximum for} \ D\}\).

We start by analyzing a more basic decision problem, that of deciding if a set of tuples Γ is an S-minimal contingency set associated to a cause t (cf. (3)). Due to the results in Sections 3 and 4, it is clear that there is a close connection between this problem and the S-repair checking problem [9](@var1@Chap. 5@var1@), about deciding if instance D is an S-repair of instance D with respect to a set of integrity constraints. Actually, the following result is obtained from the PTIME solvability of the S-repair checking problem for DCs [18] (see also [1]).

Proposition 11

For a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , the minimal contingency set decision problem (MCSDP), i.e. \(\phantom {\dot {i}\!}\mathcal {MCSDP}(\mathcal { Q}):=\{(D,t,{\Gamma })~ |~ {\Gamma }~\text {is minimal} \text {~element in} \,\,\textit {Cont}(D,\mathcal { Q},t)\}\) , belongs to PTIME.

Proof 2

To decide if \(\phantom {\dot {i}\!}(D,t,{\Gamma }) \in \mathcal { MCSDP}(\mathcal { Q})\), it is good enough to observe, from Proposition 1, that \(\phantom {\dot {i}\!}(D,t,{\Gamma }) \in \mathcal {M}\mathcal {C}\mathcal {S}\mathcal {D}\mathcal {P}(\mathcal {Q})\) iff \(\phantom {\dot {i}\!}D \smallsetminus ( {\Gamma } \cup \{t\})\) is an S-repair for D with respect to \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\). S-repair checking can be done in PTIME in data [18]. □

We could also consider the decision problem defined in Proposition 11, but with C-minimal Γ. We will not use results about this problem in the following. Furthermore, its connection with the C-repair checking problem is less direct. As one can see from Section 3, C-minimal contingency sets correspond to a repair semantics somewhere between the S-minimal and C-minimal repair semantics (a subclass of Srep, but a superclass of Crep): It is about an S-minimal repair with minimum cardinality that does not contain a particular tuple.

Now we establish that RDP is NP-complete for CQs in general. The NP-hardness is shown in [47]. Membership of NP is obtained using Proposition 11.

Theorem 1

  1. (a)

    For every BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , \(\phantom {\dot {i}\!}\mathcal {R}\mathcal {D}\mathcal {P}(\mathcal { Q}) \in \textit {NP}\).

  2. (b)

    [47] There are CQs \(\phantom {\dot {i}\!}\mathcal { Q}\) for which \(\phantom {\dot {i}\!}\mathcal {R}\mathcal {D}\mathcal {P}(\mathcal { Q})\) is NP-hard.

Proof 3

  1. (a)

    We give a non-deterministic PTIME algorithm to solve RDP. Non-deterministically guess a subset Γ⊆D n, return yes if \(\phantom {\dot {i}\!}|{\Gamma }| < \frac {1}{v}\) and (D,t,\(\phantom {\dot {i}\!} {\Gamma } ) \in \mathcal {M}\mathcal {C}\mathcal {S}\mathcal {D}\mathcal {P}\); otherwise return no. According to Proposition 11 this can be done in PTIME in data complexity.

In order to better understand the complexity of RP, the responsibility computation problem, we will investigate the functional, non-decision version of RDP.

The main source of complexity when computing responsibilities is related to the hitting-set problem associated to \(\phantom {\dot {i}\!}\mathfrak {H}^{n}(D) = \langle D^{n}, \mathfrak {S}^{n}(D)\rangle \) in Remark 4 (cf. (6)). In this case, it is about computing the cardinality of a minimum hitting-set that contains a given vertex (tuple) t. That this is a kind of d-hitting-set problem [50] will be useful in Section 6.1.

Remark 5

Our responsibility problem can also be seen as a vertex cover problem on the hypergraph Footnote 14

$$ \mathfrak{G}^{n}(D) = \langle D^{n}, \mathfrak{ S}^{n}(D)\rangle $$
(17)

associated to \(\phantom {\dot {i}\!}\mathfrak { H}^{n}(D) = \langle D^{n}, \mathfrak { S}^{n}(D)\rangle \) (that is, the hitting-set framework can be seen as a hypergraph). In it, the hyperedges are the members of \(\phantom {\dot {i}\!}\mathfrak {S}^{n}(D)\). Determining the responsibility of a tuple t becomes the problem on hypergraphs of determining the size of a minimum vertex cover that contains vertex t (among all vertex covers that contain the vertex). Again, in this problem the hyperedges are bounded in size by \(\phantom {\dot {i}\!}|\mathcal {Q}|\).Footnote 15

Example 16

For \(\phantom {\dot {i}\!}\mathcal { Q}\!: \exists xy(P(x)\wedge R(x,y)\wedge P(y))\), and D = D n={P(a),P(c), R(a,c), R(a,a)}, \(\phantom {\dot {i}\!}\mathfrak {S}(D)=\mathfrak {S}^{n}(D)=\{ \{P(a), R(a,a)\}, \{P(a), P(c), R(a,c)\} \}\).

The hypergraph \(\phantom {\dot {i}\!}\mathfrak { G}^{n}(D)\) has D as set of vertices, and its hyperedges are {P(a), R(a,a)} and {P(a),P(c),R(a,c)}. Its minimal vertex covers are: v c 1={P(a)}, v c 2={P(c), R(a,a)}, v c 3={R(a,a),R(a,c)}. Only the first has minimum cardinality. Accordingly, its only element, P(a), is an actual cause with responsibility 1. The other tuples are actual causes with responsibility \(\phantom {\dot {i}\!}\frac {1}{2}\).

Remark 6

To simplify the presentation of the next computational problems (Lemmas 1 and 2 and Proposition 12), we will formulate and address them in terms of graphs. However, they still hold for hypergraphs [43, 44], which is what we need for the complexity results obtained in the rest of this section.

Lemma 1

(representation lemma) There is a fixed database schema \(\phantom {\dot {i}\!}\mathcal {S}\) and a BCQ \(\phantom {\dot {i}\!}\mathcal {Q} \in \mathcal {L}(\mathcal {S})\) , without built-ins, such that, for every graph G=(V,E), with non-empty E, and v∈V, there is an instance D for \(\phantom {\dot {i}\!}\mathcal {S}\) and a tuple t∈D, such that the size of a minimum vertex cover of G containing v is the inverse of the responsibility of t as an actual cause for \(\phantom {\dot {i}\!}\mathcal { Q}\).

Proof 4

Consider a graph G=(V,E), and assume the vertices of G are uniquely labeled.

Consider the database schema with relations Ver(v 0) and Edges(v 1,v 2,e), and the conjunctive query \(\phantom {\dot {i}\!}\mathcal { Q}\!: \exists v_{1}v_{2}e(\mathit {Ver}(v_{1}) \wedge \mathit {Ver} (v_{2}) \wedge \mathit {Edges}(v_{1}, v_{2}, e))\). Ver stores the vertices of G, and Edges, the labeled edges. For each edge (v 1,v 2)∈E, E d g e s contains n tuples of the form (v 1,v 2,i), where n is the number of vertices in G. All the values in the third attribute of E d g e s are different, say from 1 to n×|E|. This padding of relation E d g e will ensure in the rest of the proof that C-minimal contingency sets for the query answer consist only of vertices, i.e. elements of V e r (as opposed to Edge tuples). The size of the padded instance is still polynomial in the size of G. It is clear that \(\phantom {\dot {i}\!}D \models \mathcal {Q}\).

Assume VC is the minimum vertex cover of G that contains vertex v, where tuple t is Ver(v). Consider the set of tuples Λ={V e r(x) | xV C}. Since vV C, Λ=Λ∪{V e r(v)}. Then, \(\phantom {\dot {i}\!}D \smallsetminus ({\Lambda }^{\prime } \cup \mathit {Ver}(v)) \not \models Q\). This is because for every tuple E d g e(v i ,v j ,k) in the instance, either v i or v j belongs to V C. Due to the minimality of V C, \(\phantom {\dot {i}\!}D \smallsetminus {\Lambda }^{\prime } \models \mathcal {Q}\). Therefore, tuple V e r ( v ) is an actual cause for \(\phantom {\dot {i}\!}\mathcal {Q}\).

Suppose Γ is a C-minimal contingency set associated to Ver(v). Due to the C-minimality of Γ, it entirely consists of tuples in Ver. It holds that \(\phantom {\dot {i}\!}D \smallsetminus ({\Gamma } \cup \{\textit {Ver}(v)\}) \not \models \mathcal {Q}\) and \(\phantom {\dot {i}\!}D \smallsetminus {\Gamma } \models \mathcal {Q}\). Consider the set V C ={x | V e r(x)∈Γ}∪{v}. Since \(\phantom {\dot {i}\!}D \smallsetminus ({\Gamma } \cup \{\mathit {Ver}(v)\}) \not \models \mathcal {Q}\), for every tuple E d g e(v i ,v j ,k) in D, either v i V C or v j V C . Therefore, V C is a minimum vertex cover of G that contains v. It holds that \(\phantom {\dot {i}\!}\rho _{_{\!D\!}}(\mathit {Ver}(v))=\frac {1}{1+|{\Gamma }|}\). So, the size of a minimum vertex cover of G that contains v can be obtained from \(\phantom {\dot {i}\!}\rho _{_{\!D\!}}(\mathit {Ver}(v))\). □

Having represented our responsibility problem as a graph-theoretic problem, we first consider functional computational problems in graphs.

Definition 7

The minimal vertex cover membership problem (MVCMP) consists in, given a graph G=(V,E), and a vertex vV as inputs, computing the size of a minimum vertex cover of G that contains v.

Lemma 2

Given a graph G and a vertex v in it, there is a graph G extending G that can be constructed in polynomial time in |G|, such that the size of a minimum vertex cover for G that contains v and the size of a minimum vertex cover for G coincide.

Proof 5

The size of V C G (v), the minimum vertex cover of G that contains the vertex v, can be computed from the size of I G , the maximum independent set of G, that does not contain v. In fact,

$$ |VC_{G}(v)|=|G|-|I_{G}|. $$
(18)

Since I G is a maximum independent set that does not contain v, it must contain one of the adjacent vertices of v (otherwise, I G is not maximum, and v can be added to I G ). Therefore, |V C G (v)| can be computed from the size of a maximum independent set I that contains v , one of the adjacent vertices of v.

Given a graph G and a vertex v in it, a graph G that extends G can be constructed in polynomial time in the size of G, in such a way that: there is a maximum independent set I of G containing v iff v belongs to every maximum independent set of G iff the sizes of maximum independent sets for G and G differ by one.

Actually, graph G can be obtained by adding a new vertex v that is connected only to the neighbors of v . It holds:Footnote 16

$$\begin{array}{@{}rcl@{}} |I_{G}|&=&|I_{G^{\prime}}|-1, \end{array} $$
(19)
$$\begin{array}{@{}rcl@{}} |I_{G^{\prime}}|&=&|G^{\prime}|-|\textit{VC}_{G^{\prime}}|, \end{array} $$
(20)

where \(\phantom {\dot {i}\!}I_{G^{\prime }}\) is a maximum indent set in G , and \(\phantom {\dot {i}\!}\textit {VC}_{G^{\prime }}\) is a minimum vertex cover of G . From (18), (19) and (20), we obtain: \(\phantom {\dot {i}\!}|VC_{G}(v)|= |{VC}_{G^{\prime }}|\). □

From Lemma 2 and the FP NP(log(n))-completeness of determining the size of a maximum clique in a graph [39], we obtain:

Proposition 12

The MVCMP problem for graphs is FP NP(log(n)) -complete.

Proof 6

We prove membership by describing an algorithm in FP NP(log(n)) for computing the size of the minimum vertex cover of a graph G=(V,E) that contains a vertex vV . We use Lemma 2, and build the extended graph G .

The size of a minimum vertex cover for G gives the size of the minimum vertex cover of G that contains v. Since computing the maximum cardinality of a clique can be done in time F P NP(log(n)) [39], computing a minimum vertex cover can be done in the same time (just consider the complement graph). Therefore, MVCMP belong to F P NP(log(n)).

Hardness can be obtained by a reduction from computing minimum vertex covers in graphs to MVCMP. Given a graph G construct the graph G as follows: Add a vertex v to G and connect it to all vertices of G. It is easy to see that v belongs to all minimum vertex covers of G . Furthermore, the sizes of minimum vertex covers for G and G differ by one. Consequently, the size of a minimum vertex cover of G can be obtained from the size of a minimum vertex cover of G that contains v. Computing the minimum vertex cover is F P NP(log(n))-complete. This follows from the F P NP(log(n))-completeness of computing the maximum cardinality of a clique in a graph [39]. □

Theorem 2

  1. (a)

    For every BCQ, \(\phantom {\dot {i}\!}\mathcal { Q}\) , computing the responsibility of a tuple as a cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) is in FP NP(log(n)) .

  2. (b)

    There is a database schema and a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , without built-ins, such that computing the responsibility of a tuple as a cause for \(\phantom {\dot {i}\!}\mathcal {Q}\) is FP NP(log(n)) -complete.

Proof 7

For membership, we observe from Remark 5 that computing a tuple’s responsibility amounts to computing the size of a minimum vertex cover containing the tuple in the graph associated to the query and instance at hand. By Proposition 12, this problem belongs to F P NP(log(n)).

Hardness follows from Lemma 1 and the hardness result in Proposition 12. □

Now we address the most responsible causes problem, MRCDP (cf. Definition 6). We use the connection with consistent query answering of Section 4.3, namely Corollary 4, and the P NP(log(n))-completeness of consistent query answering under the C-repair semantics for queries that are conjunctions of ground atoms and a particular DC [43, Theorem 4].

Theorem 3

  1. (a)

    For every BCQ, \(\phantom {\dot {i}\!}\mathcal { MRCDP}(\mathcal { Q}) \in P^{\textit {NP}\textnormal {(}log\textnormal {(}n\textnormal {))}}\!\!\).

  2. (b)

    There is a database schema and a BCQ \(\phantom {\dot {i}\!}\mathcal {Q}\) , without built-ins, for which \(\phantom {\dot {i}\!}\mathcal {M}\mathcal {R}\mathcal {C}\mathcal {D}\mathcal {P}(\mathcal {Q})\) is P NP(log(n)) -complete.

Proof 8

  1. (a)

    To show that \(\phantom {\dot {i}\!}\mathcal { MRCDP}(\mathcal { Q})\) belongs to P NP(log(n)), consider first the hitting-set framework \(\phantom {\dot {i}\!}\mathfrak {H}^{n}(D) = \langle D^{n}, \mathfrak {S}^{n}(D)\rangle \) (cf. Definition 3 and 6) and its associated hypergraph \(\phantom {\dot {i}\!}\mathfrak {G}^{n}(D)\) (cf. (17)).

    It holds that t is a most responsible cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) iff \(\phantom {\dot {i}\!}\mathfrak { H}^{n}(D)\) has a C-minimal hitting-set that contains t (cf. Proposition 6). Therefore, t is a most responsible cause for \(\phantom {\dot {i}\!}\mathcal {Q}\) iff t belongs to some minimum vertex cover of \(\phantom {\dot {i}\!}\mathfrak {G}^{n}(D)\).

    It is easy to see that \(\phantom {\dot {i}\!}\mathfrak { G}^{n}(D)\) has a minimum vertex cover that contains t iff \(\phantom {\dot {i}\!}\mathfrak { G}^{n}(D)\) has a maximum independent set that does not contains t. Checking if t belongs to all maximum independent set of \(\phantom {\dot {i}\!}\mathfrak {G}^{n}(D)\) can be done in P NP(log(n)) [43, Lemma 2].

    If t belongs to all independent sets of \(\phantom {\dot {i}\!}\mathfrak { G}^{n}(D)\), then \(\phantom {\dot {i}\!}(D,t) \not \in \mathcal { MRCDP}(\mathcal { Q})\); otherwise \(\phantom {\dot {i}\!}(D,t) \in \mathcal {M}\mathcal {R}\mathcal {C}\mathcal {D}\mathcal {P}(\mathcal {Q})\). As a consequence, the decision can be made in time P NP(log(n)).

  2. (b)

    The proof is by a reduction, via Corollary 4, from consistent query answering under the C-repair semantics for queries that are conjunctions of ground atoms, which was proved to be P NP(log(n))-complete in [43, Theorem 4]. Actually, that proof (of hardness) uses a particular database schema \(\phantom {\dot {i}\!}\mathcal {S}\) and a DC κ. In our case, we can use the same schema \(\phantom {\dot {i}\!}\mathcal {S}\) and the violation query V κ associated to κ (cf. Section 4).

From Proposition 6 and the FP NP(log(n))-completeness of determining the size of C-repairs for DCs [43, Theorem 3], we obtain the following for the computation of the highest responsibility value.

Proposition 13

  1. (a)

    For every BCQ, computing the responsibility of the most responsible causes is in FP NP(log(n)).

  2. (b)

    There is a database schema and a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , without built-ins, for which computing the responsibility of the most responsible causes is FP NP(log(n)) -complete.

Proof 9

  1. (a)

    To show the membership of F P NP(log(n)), consider the hypergraph \(\phantom {\dot {i}\!}\mathfrak { G}^{n}(D)\) as obtained in Theorem 3. The responsibility of most responsible causes for \(\phantom {\dot {i}\!}\mathcal {Q}\) can be obtained from the size of the minimum vertex cover of \(\phantom {\dot {i}\!}\mathfrak {G}^{n}(D)\) (cf. Proposition 6). The size of the minimum vertex cover in a graph can be computed in F P NP(log(n)), which is obtained from the membership of F P NP(log(n)) of computing the maximum cardinality of a clique in graph [39].

    It is easy to verify that minimum vertex covers in hypergraphs can be computed in the same time.

  2. (b)

    This is by a reduction from the problem of determining the size of C-repairs for DCs shown to be F P NP(log(n))-complete in [43, Theorem 3]). Actually, that proof (of hardness) uses a particular database schema \(\phantom {\dot {i}\!}\mathcal {S}\) and a DC κ. In our case, we may consider the same schema \(\phantom {\dot {i}\!}\mathcal {S}\) and the violation query V κ associated to κ (cf. Section 4).

The size of C-repairs for an inconsistent instance D of the schema \(\phantom {\dot {i}\!}\mathcal { S}\) with respect to κ can be obtained from the responsibility of most responsible causes for V κ (cf. Corollary 2). □

6.1 FPT of Responsibility

We need to cope with the intractability of computing most responsible causes. The area of fixed parameter tractability (FPT) [28] provides tools to attack this problem. In this regard, we recall that a decision problem with inputs of the form (I,p), where p is a distinguished parameter of the input, is fixed parameter tractable (or belongs to the class FPT), if it can be solved in time O(f(|p|)⋅|I|c), where c and the hidden constant do not depend on |p| or |I|, and f does not depend on |I|.

In our case, the parameterized version of the decision problem \(\phantom {\dot {i}\!}\mathcal {R}\mathcal {D}\mathcal {P}(\mathcal { Q})\) (cf. Definition 5) is denoted with \(\phantom {\dot {i}\!}\mathcal {R}\mathcal {D}\mathcal {P}^{p}(\mathcal {Q})\), and the distinguished parameter is k, such that \(\phantom {\dot {i}\!}v = \frac {1}{k}\).

That \(\phantom {\dot {i}\!}\mathcal {R}\mathcal {D}\mathcal {P}^{p}(\mathcal { Q})\) belongs to FPT can be obtained from its formulation as a d-hitting-set problem (d being the fixed upper bound on the size of the sets in the set class). The latter problem consists in, given a hitting-set framework with d-bounded subsets and an element t (a tuple in our case), deciding if there is a hitting-set of cardinality smaller that k that contains t. This problem belongs to FPT.

Theorem 4

For every BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) , \(\phantom {\dot {i}\!}\mathcal {R}\mathcal {D}\mathcal {P}^{p}(\mathcal { Q})\) belongs to FPT, where the parameter is the inverse of the responsibility bound.

Proof 10

First, there is a PTIME parameterized algorithm for the d-hitting-set problem about deciding if there is a hitting-set of size at most k that runs in time O(e k + n), with n the size of the underlying set and e = d−1 + o(d −1) [50]. In our case, n=|D|, and \(\phantom {\dot {i}\!}d = |\mathcal {Q}|\) (cf. also [26]).

Now, to decide if the responsibility of a given tuple t is greater than \(\phantom {\dot {i}\!}v=\frac {1}{k}\), we consider the associated hypergraph \(\phantom {\dot {i}\!}\mathfrak {G}^{n}(D)\), and we decide if it has a vertex cover that contains t and whose size is less than k. In order to answer this, we use Lemma 2, and build the extended hypergraph \(\phantom {\dot {i}\!}\mathfrak {G}^{\prime }\).

The size of a minimum vertex cover for \(\phantom {\dot {i}\!}\mathfrak { G}^{\prime }\) gives the size of the minimum vertex cover of \(\phantom {\dot {i}\!}\mathfrak { G}^{n}(D)\) that contains t. If \(\phantom {\dot {i}\!}\mathfrak {G}^{n}(D)\) has a vertex cover that contains t of size less than k, then \(\phantom {\dot {i}\!}\mathfrak {G}^{\prime }\) has a vertex cover of size less than k. If \(\phantom {\dot {i}\!}\mathfrak {G}^{\prime }\) has a vertex cover of size less than k, its minimum size for a vertex cover is less than k. Since this minimum is the same as the size of a minimum vertex cover for \(\phantom {\dot {i}\!}\mathfrak {G}^{n}(D)\) that contains t, \(\phantom {\dot {i}\!}\mathfrak {G}^{n}(D)\) has a vertex cover of size less than k that contains t. As a consequence, it is good enough to decide if \(\phantom {\dot {i}\!}\mathfrak {G}^{\prime }\) has a vertex cover of size less than k. For this, we use the hitting-set formulation of this hypergraph problem, and the already mentioned FPT algorithm. □

This result and the corresponding algorithm sketched in its proof show that the higher the required responsibility degree, the lower the computational effort needed to compute the actual causes with at least that level of responsibility. In other terms, parameterized algorithms are effective for computing actual causes with high responsibility or most responsible causes. In general, parameterized algorithms are very effective when the parameter is relatively small [28].

Now, in order to compute most responsible causes, we could apply, for each actual cause t, the just presented FPT algorithm on the hypergraph \(\phantom {\dot {i}\!}\mathfrak {G}^{n}(D)\), starting with k=1, i.e. asking if there is vertex cover of size less than 1 that contains t. If the algorithm returns a positive result, then t is a counterfactual cause, and has responsibility 1. Otherwise, the algorithm will be launched with k=2,3,…,|D n|, until a positive result is returned. (The procedure can be improved through binary search on k=1,2,3,…,m, with m possibly much smaller than |D|.)

The complexity results and algorithms provided in this section can be extend to UBCQs. This is due to Remark 2 and the construction of \(\phantom {\dot {i}\!}\mathfrak {S}^{n}(D)\), which the results in this section build upon.

For the d-hitting-set problem there are also efficient parameterized approximation algorithms [11]. They could be used to approximate the responsibility problem. Furthermore, approximation algorithms developed for the minimum vertex cover problem on bounded hypergraphs [34, 51] should be applicable to approximate most responsible causes for query answers. Via the causality/repair connection (cf. Section 4.3), it should be possible to develop approximation algorithms to compute S-repairs of particular sizes, C-repairs, and consistent query answers with respect to DCs.

6.2 Complexity of Diagnosis with Positive Disjunctive Rules

It is known that consistency-based diagnosis decision problems can be unsolvable [53]. However, there are decidable classes of FO diagnosis specifications, and those classes are amenable to complexity analysis. However, there is little research on the complexity analysis of solvable classes of consistency-based diagnosis problems. The connection we established in the previous sections between causality, repairs and consistency-based diagnosis can be used to obtain new algorithmic and complexity results for the latter. Without trying to be exhaustive about this, which is beyond the scope of this paper, we give an example of the kind of results that can be obtained.

Considering the diagnosis problem we obtained in Section 5, we can define a class of diagnosis problems. Cf. Example 11, in particular (12), for motivation.

Definition 8

A disjunctive positive (DP) diagnosis specification Σ is a consistent FO logical theory, such that:

  1. (a)

    Σ has a signature (schema) consisting of a finite set of constants, a set of predicates \(\phantom {\dot {i}\!}\mathcal { S}\), a set \(\phantom {\dot {i}\!}\mathcal {S}^{{ab}}\) of predicates of the form A b R ,Footnote 17 with \(\phantom {\dot {i}\!}R \in \mathcal { S}\), and Ab R with the same arity of R. \(\phantom {\dot {i}\!}\mathcal { S}\) and \(\phantom {\dot {i}\!}\mathcal { S}^{\textit {ab}}\) are mutually disjoint.

  2. (b)

    Σ is inconsistent with \(\phantom {\dot {i}\!}\textit {AB}^{\mathcal { S}} := \{\forall \bar {x}(\textit {Ab}_{\!R}(\bar {x}) \rightarrow \mathbf {false}) ~| ~R \in \mathcal {S}\}\).

  3. (c)

    Consists of:

    • (c1) Sentences of the form \(\phantom {\dot {i}\!}\forall \bar {x}(C(\bar {x}) \ \longrightarrow \ \bigvee _{i} \! \textit {Ab}_{\!R_{i}}(\bar {x}_{i}))\), with \(\phantom {\dot {i}\!}\bar {x}_{i} \subseteq \bar {x}\), and \(\phantom {\dot {i}\!}C(\bar {x})\) a conjunction of atoms that does not include Ab-atoms of any kind.

    • (c2) Sentences of the forms \(\phantom {\dot {i}\!}\forall \bar {x}(\!\textit {Ab}_{\!R}(\bar {x})\ \longrightarrow \ (R(\bar {x}) \wedge S(\bar {x})))\), with \(\phantom {\dot {i}\!}S \in \mathcal {S}\).

    • (c3) A finite background universal theory \(\phantom {\dot {i}\!}\mathcal { T}\) expressed in terms of predicates in \(\phantom {\dot {i}\!}\mathcal { S}\) (and constants) that has a unique Herbrand model.Footnote 18

As above, a diagnosis is a set of Ab R -atoms that, when assumed to be true, restores the consistency of the correspondingly modified \(\phantom {\dot {i}\!}{\Sigma } \cup {AB}^{\mathcal {S}}\).

There are at least two important computational tasks that emerge, namely, given a disjunctive positive (DP) diagnosis specification Σ together with \(\phantom {\dot {i}\!}\textit {AB}^{\mathcal { S}}\):

  1. 1.

    The minimum-cardinality diagnosis (MCD) problem, about computing minimum-cardinality diagnoses.

  2. 2.

    The minimal membership diagnosis, (MMD) about computing minimum-cardinality diagnoses that contain a given Ab-atom.

It is not difficult to see that these problems are computable (or solvable in their decision versions). Now we can obtain complexity lower bounds for them. Actually, in Section 5, the responsibility and most responsible causes problem were reduced to diagnosis problems for specifications that turned out to be disjunctive positive (see (12)).

More specifically, Proposition 9 reduces computing responsibility of a tuple to computing the size of a minimum-cardinality diagnosis that contains the tuple. Furthermore, as a simple corollary of Proposition 9, we obtain the computation of minimum-cardinality diagnoses allows us to compute most responsible causes. Now, combining all this with Proposition 13 and Theorem 2, we obtain the following lower bounds for our diagnosis problems.

Theorem 5

For disjunctive positive diagnosis specifications, the MCD and MMD problems are FP NP(log(n)) -hard in the size of their underlying Herbrand structure.

7 Preferred Causes for Query Answers

In Section 3 we characterized causes and most responsible causes in terms of S-repairs and C-repairs, resp. We could generalize the notion of a cause and/or its responsibility by using, in principle, any repair semantics S. The latter is represented by a class of repairs R e p S(D,Σ), of D with respect to a set of denial constraints (cf. Section 2.2). When dealing with (sets of) DCs, the repair actions can only be of certain kinds. Usually tuple deletions have been considered. This is the case of the S- and C-repairs we have considered in this work so far.

We could go beyond and consider the notion of prioritized repair [59]. Also changes of attribute values can be the chosen repair actions, including the use of null values, to “destroy” joins (again, with different semantics, e.g. with nulls à la SQL [8, 12]).

In this section we explore the possibility of introducing a notion of preferred cause that is based on a given repair semantics. This idea is inspired by (and generalizes) the characterization of causes in terms of repairs that we obtained before, namely (1), (2), Proposition 1, and Corollary 1.

If we define causes and their (minimal) contingency sets on the basis of a given repair semantics, the minimality condition involved in the latter will have an impact on the notion of minimal (or preferred) contingency set, and indirectly, on the notions of responsibility and most responsible cause.Footnote 19

In Section 7.1 In Section 7.2 we impose preferences on causes on the basis of the prioritized repairs introduced in [59] (and further investigated in [25]). In Section 7.3, we briefly investigate the possibility of capturing endogenous repairs, i.e. that do not change exogenous tuples, by means of a priority relation. Finally, in Section 7.4, we briefly consider the possibility of defining (preferred) causes via attribute-based repairs that use null values.

7.1 Prioritized Repairs

The prioritized repairs in [59] are based on a priority relation, ≻, on the set of database tuples. In the case of a pair of (mutually) conflicting tuples, i.e. that simultaneously violate a constraint in a given set set of DCs (possibly in company of other tuples), the repair process reflects the user preference -as captured by the priority relation- on the tuples that are privileged to be kept in the database, i.e. in the intended repairs.

Given such a priority relation, in [59] different classes of prioritized repairs are introduced, namely the class of globally optimal repairs, that of Pareto-optimal repairs, and that of completion-optimal repairs. Intuitively, each class relies on a different optimality criterion that is used to extend the priority relation ≻ on pairs of conflicting facts to a priority relation on the set of S-repairs. As a consequence, each of these three classes is contained in that of the S-repairs. In particular, all these repairs are based on tuple deletions.

Let us denote with \(\phantom {\dot {i}\!}\textit {Rep}^{\succ ,\mathcal { X}}(D,{\Sigma })\) the class of all prioritized repairs based on ≻ and the optimality criterion \(\phantom {\dot {i}\!}\mathcal {X}\). Its elements are called \(\phantom {\dot {i}\!}(\succ ,\mathcal { X})\)-prioritized repairs of D with respect to a the set Σ of DCs. It holds \(\phantom {\dot {i}\!}{Rep}^{\succ ,\mathcal {X}}(D,{\Sigma }) \subseteq {Srep}(D,{\Sigma })\), and then, all the elements of \(\phantom {\dot {i}\!}{Rep}^{\succ ,\mathcal {X}}(D,{\Sigma })\) are subsets of D.

In order to show a concrete class \(\phantom {\dot {i}\!}\textit {Rep}^{\succ ,\mathcal { X}}(D,{\Sigma })\), we first recall the definitions of priority relation and global-optimal repair from [59].

Definition 9

Given an instance D and a set of denial constraints Σ , a binary relation ≻ on D is a priority relation with respect to Σ if: (a) ≻ is acyclic, and (b) for every t,t D, if tt , then t and t are mutually conflicting.Footnote 20

Definition 10

Let D be an instance, Σ a set of DCs, and ≻ a corresponding priority relation. Let D and D be two consistent sub-instances of D. D is a global improvement of D if D D , and for every tuple \(\phantom {\dot {i}\!}t^{\prime } \in D^{\prime \prime }\smallsetminus D^{\prime }\), there exists a tuple \(\phantom {\dot {i}\!}t \in D^{\prime }\smallsetminus D^{\prime \prime }\) such that tt . D is a global-optimal repair of D, if D is an S-repair and does not have a global improvement.

In this definition, the optimality criterion, a possible \(\phantom {\dot {i}\!}\mathcal { X}\) above, is that of global-optimal repair, or (≻ ,g o)-repair, which leads to a class R e p ≻,go(D,Σ). We consider this repair semantics just for illustration purposes.

Example 17

Consider the database schema Author(Name, JournalN), J o u r n a l (J o u r n a l N, T o p i c,P a p e r #), and the following instance D: D:

ᅟ ᅟ

Consider the following denial constraint:

$$ \kappa\!: \ \forall x y z z^{\prime} \neg (\textit{Author}(\textit{x,y}) \land \textit{Journal}(y,z,z^{\prime}) \land x=\textsf{John} \land z^{\prime}=\textsf{XML}), $$
(21)

capturing the condition that “John has not published a paper in a journal that has published papers on XML”.

D is inconsistent with respect to κ, and contains the following sets of conflicting tuples:

$$\begin{array}{@{}rcl@{}} C_{1}&= & \{\textit{Author(John,TKDE), Journal(TKDE,30,XML)}\}, \\ C_{2}&=&\{\textit{Author(John,TODS), Journal(TODS,32,XML)}\}. \end{array} $$

D has the following S-repairs, each obtained by deleting one tuple from each of C 1 and C 2, to resolve the conflicts:

$$\begin{array}{@{}rcl@{}} D_{1}&= & \{Author(Tom, \mathit{TKDE}), Journal(\mathit{TKDE}, 31, \mathit{CUBE}), Author(John, \mathit{TODS}),\\&& Journal(\mathit{TKDE}, 30, \mathit{XML})\} \\ D_{2}&= & \{Author(Tom, \mathit{TKDE}), Journal(\mathit{TKDE}, 31, \mathit{CUBE}), Journal(\mathit{TKDE}, 30, \mathit{XML}),\\&& {Journal(\mathit{TODS}, 32, \mathit{XML})}\} \\ D_{3}&= & \{Author(Tom, \mathit{TKDE}), Journal(\mathit{TKDE}, 31, \mathit{CUBE}), Author(John, \mathit{TKDE}),\\&& {Journal(\mathit{TODS}, 32, \mathit{XML})}\} \\ D_{4}&= & \{{Author(Tom, \mathit{TKDE}), Journal(\mathit{TKDE}, 31, \mathit{CUBE}), Author(John, \mathit{TKDE})},\\&& {Author(John, \mathit{TODS})}\} \end{array} $$
  1. (a)

    Now, assume a user prefers to resolve a conflict by removing tuples from the Author table rather than the Journal table, maybe because he considers the latter more reliable than the former. This is expressed the following priority relationships on conflicting tuples: Journal(TKDE, 30, XML) ≻Author(John, TKDE) and Journal(TODS, 32, XML) ≻Author(John, TODS).

    In this case only D 2 is a global-optimal repair. Actually, D 2 is a global improvement over each of D 1, D 3 and D 4. For D 1, for example: \(\phantom {\dot {i}\!}D_{2} \smallsetminus D_{1}=\{Journal(\mathit {TODS}, 32, \mathit {XML})\}\) and \(\phantom {\dot {i}\!}D_{1} \smallsetminus D_{2}=\{\) Author(John,TODS) }. We can see that, for each tuple in \(\phantom {\dot {i}\!}D_{2} \smallsetminus D_{1}\), there is a tuple in \(\phantom {\dot {i}\!}D_{1} \smallsetminus D_{2}\) that has a higher priority. Therefore, D 2 is a global improvement on D 1. So, in this case R e p ≻,go(D,κ)={D 2}

    In this case, the uniqueness of the global-optimal repair is quite natural as the preference relation among conflicting tuples is a total relation. So, we know how to resolve every conflict according to the user preferences.

  2. (b)

    For a more subtle situation, assume the user has the priorities as before, but in addition he tends to believe that John has a paper in TODS. In this case we have only the relationship J o u r n a l(T K D E,30,X M L)≻ A u t h o r(J o h n,T K D E), and no preference for resolving the second conflict. Now both D 1 and D 2 are global-optimal repairs. That is, now \(\phantom {\dot {i}\!}{Rep}^{\succ ^{\prime }\!\!,{go}}(D,\kappa ) = \{D_{1}, D_{2}\}\).

7.2 Preferred causes from prioritized repairs

According to the motivation provided at the beginning of this section, we now define preferred causes on the basis of a class of prioritized repairs. (Compare (22) below with (1) and (2).) To keep things simple, we concentrate on single BCQs, \(\phantom {\dot {i}\!}\mathcal {Q}\), whose associated denial constraints are denoted by \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\).

Before providing technical details, we motivate the notion of preference in the context of causality. In this direction, first notice that under actual causality, we already make a difference -and only this difference- between endogenous and exogenous tuples. We can think of extending this priority relation among tuples in such a way that, for example, we prioritize -as causes- tuples in a given relation R, and we are not interested in tuples in another relation S. So, the user can specify a priority relation between the two relations, or different scores for these relations [46].

In Section 4.2 actual causes and their minimal contingency sets for a UBCQ were characterized as the minimal hitting-sets of the collection \(\phantom {\dot {i}\!}\mathcal {C}\) of minimal subsets of a database that entail the query. Those minimal hitting-sets are obtained by removing at least one tuple from each of the elements of \(\phantom {\dot {i}\!}\mathcal {C}\) (cf. Proposition 6). At this point, user preferences, or priorities, could be applied to tuples that belong to a same set \(\phantom {\dot {i}\!}\mathcal {C}\).

Definition 11

Given an instance D and a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), tuples t and t are jointly-contributing if tt , and there exists an S-minimal Λ⊆D such that \(\phantom {\dot {i}\!}{\Lambda } \models \mathcal {Q}\) and t,t ∈Λ.

Now we define priority relations on jointly-contributing tuples.

Definition 12

Given an instance D and a BCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), a binary relation ≻ c on D is a causal priority relation with respect to \(\phantom {\dot {i}\!}\mathcal {Q}\) if: (a) ≻ c is acyclic, and (b) for every t,t D, if t c t , then t and t are jointly-contributing tuples.

This definition introduces a natural notion of preference on causality. Actually, this way of approaching priorities on causes is in (inverse) correspondence with preference on repairs as based on priority relations on conflicting tuples. To see this, first observe that for a given instance D and BCQ \(\phantom {\dot {i}\!}\mathcal {Q}\): t and t are jointly-contributing tuples for \(\phantom {\dot {i}\!}\mathcal {Q}\) iff t and t are mutually conflicting tuples for \(\phantom {\dot {i}\!}\kappa (\mathcal {Q})\).

Next, in the context of prioritized repairs, a priority relation reflects a user preference on tuples that are preferred to be kept in the database. This is the inverse of causality, where a causal priority relation, as we defined it, reflects the tuples that are preferred to be (hypothetically or counterfactually) removed from database, to make them preferred causes.

In the following assume \(\phantom {\dot {i}\!}\succ _{\!c}^{r}\) is the inverse of a causal priority relation ≻ c . That is, \(\phantom {\dot {i}\!}t \succ _{\!c}^{r} t^{\prime }\) iff t c t. Clearly, \(\phantom {\dot {i}\!}\succ _{\!c}^{r}\) is acyclic, and can be imposed, with the expected result, on pairs of conflicting tuples. As a consequence, \(\phantom {\dot {i}\!}\succ _{\!c}^{r}\) can be used to define prioritized repairs.

Definition 13

Let D be an instance, \(\phantom {\dot {i}\!}\mathcal { Q}\) a BCQ, t a tuple in D, ≻ c a causal priority relation on D’s tuples.

  1. (a)
    $$\begin{array}{@{}rcl@{}} \ \textit{Diff}^{\;{\succ_{\!c}^r,\mathcal{ X}\!}}(D,\kappa(\mathcal{ Q}), t):=\{ D \smallsetminus D^{\prime}&|& D^{\prime} \in \textit{Rep}^{{\succ_{\!c}^r,\mathcal{ X}\!}}(D,\kappa(\mathcal{ Q})), \text{ and}\\&& t \in D\smallsetminus D^{\prime}\}. \end{array} $$
    (22)
  2. (b)

    tD is a \(\phantom {\dot {i}\!}({\succ _{\!c},\mathcal { X}\!})\) -preferred cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) iff \(\phantom {\dot {i}\!}\textit {Diff}^{\;{\succ _{\!c}^r,\mathcal { X}\!}}(D, \kappa (\mathcal { Q}), t) \not = \emptyset \).

Notice that every \(\phantom {\dot {i}\!}({\succ _{\!c},\mathcal { X}\!})\)-preferred cause is also an actual cause. This follows from Proposition 1 and the fact that prioritized repairs are also S-repairs.

Similarly to Proposition 2, for each \(\phantom {\dot {i}\!}{\Lambda } \in \textit {Diff}^{\;{\succ _{\!c}^r,\mathcal {X}\!}}(D,\kappa (\mathcal { Q}), t)\), it holds that t∈Λ, t is a \(\phantom {\dot {i}\!}({\succ _{\!c},\mathcal {X}\!})\)-preferred cause, and also an actual cause for \(\phantom {\dot {i}\!}\mathcal {Q}\) with S-minimal contingency set \(\phantom {\dot {i}\!}{\Lambda } \smallsetminus \{t\}\). In particular, t’s responsibility can be defined and computed as before, but now restricting its contingency sets to those of the form \(\phantom {\dot {i}\!}{\Lambda } \smallsetminus \{t\}\), with \({\Lambda } \in \textit {Diff}^{\;{\succ _{\!c}^r,\mathcal { X}\!}}(D,\kappa (\mathcal { Q}), t)\). In this way, a causal priority relation may affect the responsibility of a cause (with respect to the non-prioritized case).

Example 18

(example 17 cont.) The following BCQ query \(\phantom {\dot {i}\!}\mathcal { Q}\) is true in D:

$$\begin{array}{@{}rcl@{}} \exists \textit{JournalN} \ \exists \textit{Paper\#}(\textit{Author}(\textsf{John},&&\!\!\textit{JournalN}) \ \wedge\\&& {Journal}({JournalN},{Paper\#},\textsf{XML})); \end{array} $$

and its associated DC \(\phantom {\dot {i}\!}\kappa (\mathcal { Q})\) is κ in (21).

We want to obtain the preferred causes for \(\phantom {\dot {i}\!}\mathcal { Q}\) being, possibly unexpectedly, true in D, with the following preferences: (a) We prefer those among the Author tuples. (b) It is likely that John does have a paper in TODS. So, we prefer Author(John, TODS) not to be the cause.

These causal priorities are in inverse correspondence with those in the second case of Example 17(b) about priorities for repairs. That is, for our causal priority relation ≻ c here, its inverse \(\phantom {\dot {i}\!}\succ _{\!c}^{r}\) is ≻ in Example 17(b). There we had \(\phantom {\dot {i}\!}{Rep}^{\succ ^{\prime },{go}}(D,\kappa (Q)) = \{D_{1}, D_{2}\}\), which we can use to apply Definition 13.

We obtain as the globally-optimal causes, i.e. as (≻ c ,go)-causes: Author(John, TKDE), Author(TODS, 32, XML) and Author(John, TODS), all with the same responsibility, \(\phantom {\dot {i}\!}\frac {1}{2}\). □

Notice that Definition 13 can be easily extended to UBCQs. This is done, as earlier in this work, by considering the set Σ of denial constraints associated to a UBCQ. In the other direction, we recall that if we start with a set of DCs Σ, the corresponding UBCQ is denoted with V Σ.

As we did in the previous sections of this work, we could take advantage of algorithmic and complexity results about prioritized repairs [25, 59], to obtain complexity results for preferred causes problems. As an example, we establish the complexity of the minimal contingency set decision problem for (≻ c ,g o)-preferred causes. More precisely, for an instance D and a UBCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), the minimal preference-contingency set (decision) problem is about deciding if a set of tuples Γ is an S-minimal contingency set associated to a (≻ c ,g o)-preferred cause t.

Notation

\(\textit {Cont}^{{\succ _{\!c},\mathcal { X}\!}}(D,\mathcal { Q},t) := \{\Lambda \smallsetminus \{t\}~ |~ {\Lambda } \in \mathit {Diff}^{\;{\succ _{\!c}^r,\mathcal { X}\!}}(D,\kappa (\mathcal {Q}), t)\}\) is the class of all S-minimal contingency sets for a \(\phantom {\dot {i}\!}({\succ _{\!c},\mathcal { X}\!})\)-preferred cause t.

Definition 14

For a UBCQ \(\phantom {\dot {i}\!}\mathcal { Q}\), the minimal preference-contingency set decision problem is about membership of:

\(\phantom {\dot {i}\!}\mathcal {M}\mathcal {P}\mathcal {C}\mathcal {D}\mathcal {P}(\mathcal { Q}):=\{(D,\succ _{\!c}, t,{\Gamma })~ | ~t \in D, {\Gamma } \subseteq D, \text { and} \,\,{\Gamma } \in \textit {Cont}^{\succ _{\!c},go}(D,\mathcal { Q},t)\}\).

From Definition 13, there is a close connection between \(\phantom {\dot {i}\!}\mathcal { MPCDP}\) and the global-optimal repair checking problem, i.e. about deciding if an instance D is a (≻,g o)-repair of D with respect to a set of denial constraints. If we accept functional dependencies (FDs) among our denial constraints (and then, UBCQs that involve inequalities), the following result can be obtained from the NP-completeness of globally-optimal repair checking [59] for FDs.

Proposition 14

For a UBCQ \(\phantom {\dot {i}\!}\mathcal { Q}\) with inequalities, \(\phantom {\dot {i}\!}\mathcal { MPCDP}(\mathcal { Q})\) is NP-hard.

Proof 11

It is good enough to reduce globally-optimal repair checking to our contingency checking problem. So, consider an inconsistent instance D with respect to a set of denial constraint Σ, a priority relation for repairs ≻, and D D. To check if D R e p ≻,go(D,Σ) we can check, for an arbitrary element \(\phantom {\dot {i}\!}t \in D \smallsetminus D^{\prime }\), if \(\phantom {\dot {i}\!}(D,\succ ^{r}, t, D \smallsetminus (D^{\prime }\cup \{t\}))\in \mathcal {M}\mathcal {P}\mathcal {C}\mathcal {D}\mathcal {P}(V^{\!{\Sigma }})\). □

It is worth contrasting this result with the tractability result in Proposition 11 for the minimal contingency set decision problem (MCSDP) for actual causes. Notice that Proposition 11 still holds for UBCQs with inequality.

Notice that we could generalize the notion of preferred cause by appealing to any notion of repair. More precisely, if we have a repair semantics rSem (based on tuple deletions for DCs), we could replace \(\phantom {\dot {i}\!}{Rep}^{{\succ ,\mathcal {X}}}(D,\kappa (\mathcal {Q}))\) in (22) by \(\phantom {\dot {i}\!}{Rep}^{\textsf {S}}(D,\kappa (\mathcal {Q}))\). However, to obtain the intended results for causes, we have to be careful, as above, about a possible inverse relationship between preference on repairs and preference on causes.

7.3 Endogenous Repairs

The partition of a database into endogenous and exogenous tuples that is used in the causality setting may also be of interest in the context of repairs. Considering that we should have more control on endogenous tuples than on exogenous ones, which may come from external sources, it makes sense to consider endogenous repairs, which would be obtained by updates (of any kind) on endogenous tuples only. (Of course, a symmetric treatment of “exogenous” repairs is also possible; what is relevant here is the partition.)

For example, in the case of DCs, endogenous repairs would be obtained by deleting endogenous tuples only. More formally, given D = D nD x, possibly inconsistent with a set of DCs Σ, an endogenous repair D of D is a maximally consistent sub-instance of D with \(\phantom {\dot {i}\!}D \smallsetminus D^{\prime } \subseteq D^{n}\), i.e. D keeps all the exogenous tuples of D. If endogenous repairs form the class Srep n(D,Σ), it holds Srep n(D,Σ)⊆Srep(D,Σ).

Example 19

Consider D = D nD x, with D n={R(a 2,a 1),R(a 4,a 3),S(a 3), S(a 4)} and D x={R(a 3,a 3),S(a 2)}, and the DC κ : ¬∃x y(S(x)∧R(x,y)∧S(y)).

Here, Srep(D,κ) = {D 1, D 2, D 3}, with D 1={R(a 2,a 1),R(a 4,a 3), R(a 3,a 3), S(a 4),S(a 2)}, D 2={R(a 2,a 1),S(a 3),S(a 4),S(a 2)}, and D 3={R(a 2,a 1), R(a 4,a 3),S(a 3),S(a 2)}. The only endogenous S-repair is D 1.

In this section, without trying to be exhaustive or detailed, we consider the possibility of defining endogenous repairs on the basis of a suitable priority relation ≻ on tuples,Footnote 21 while at the same time taking advantage of the op optimality condition considered in Section 7.1.Footnote 22

First, if we assume that relation ≻, the extension of ≻, is such, that t t when tD x and t D n ( ≻ is ≻ if the latter already has this property), then it is easy to verify that every endogenous S-repair globally improves any non-endogenous S-repair. As a consequence, if there is an endogenous S-repair, then all the (≻,g o)-repairs are endogenous. Notice that the extension ≻ may destroy the acyclicity assumption on the priority relation, because we are starting from a given (acyclic) relation ≻, which we are now extending.

It might be the case that there is no endogenous S-repair, in which case non-endogenous S-repairs would not the improved by an endogenous one. So, if we want to prevent the existence of non-endogenous repairs, we can add an extra, dummy predicate D(⋅) to the schema, and the endogenous tuple D(d) to D. We modify every DC in Σ, say \(\phantom {\dot {i}\!}\kappa \!: \ \leftarrow \ C(\bar {x})\), by adding an extra, dummy condition: \(\phantom {\dot {i}\!}\kappa ^{d}\!: \ \leftarrow \ D(d), C(\bar {x})\), obtaining a set Σd of DCs. In this case, the S-repairs will be: \(\phantom {\dot {i}\!}D^{d}:=D \smallsetminus \{D(d)\}\), which is endogenous, and also all those S-repairs of D with respect to Σ (now each including D(d)). The latter are all non-endogenous. If we assume that t D(d), for every tD x, then every non-endogenous S-repair will be improved by D d, and will not be considered.

7.4 Null-based Causes

Consider an instance D={R(c 1,…,c n ),…} that may be inconsistent with respect to a set of DCs. The allowed repair updates are changes of attribute values by the constant null. We assume that null does not join with any other value, including null itself.

In order to keep track of changes, we may introduce numbers as first arguments in tuples, as global tuple identifiers (ids). So, D becomes D={R(1;c 1,…,c n ),…}. Assume that id(t) returns the id of the tuple tD. For example, id(R(1;c 1,…,c n ))=1.

If, by updating D into D in this way, the value of the ith attribute in R is changed to null, then the change is captured as the string R[1;i]. These strings are collected forming the set Diff null(D,D ). For example, if D={R(1;a,b),S(2;c,d),S(3;e,f)} is changed into D ={R(1;a,n u l l),S(2;n u l l,d), S(3;n u l l,n u l l)}, we have D i ff null(D,D )={R[1;2],S[2;1],S[3;1],S[3;2]}.

A null-repair of D with respect to a set of DCs Σ is a consistent instance D , such that D i ff null(D,D ) is minimal under set inclusion.Footnote 23 Rep null(D,Σ) denotes the class of null-based repairs of D with respect to Σ.

Example 20

(example 19 cont.) Consider the following inconsistent instance with respect to DC κ : ¬∃x y(S(x)∧R(x,y)∧S(y)):

D={R(1;a 2,a 1),R(2;a 3,a 3),R(3;a 4,a 3), S(4;a 2),S(5;a 3),S(6;a 4)}.

For simplicity, we do not make any difference between endogenous and exogenous tuples. Here, the class of null-based repairs, Rep null(D,κ), is formed by:

D 1={R(1;a 2,a 1),R(2;a 3,a 3),R(3;a 4,a 3),S(4;a 2),S(5;null),S(6;a 4)},

D 2={R(1;a 2,a 1),R(2;null,a 3),R(3;a 4,null),S(4;a 2),S(5;a 3),S(6;a 4)},

D 3={R(1;a 2,a 1),R(2;null,a 3),R(3;a 4,a 3),S(4;a 2),S(5;a 3),S(6;null)},

D 4={R(1;a 2,a 1),R(2;a 3,null),R(3;a 4,null),S(4;a 2),S(5;a 3),S(6;a 4)},

D 5={R(1;a 2,a 1),R(2;a 3,null),R(3;null,a 3),S(4;a 2),S(5;a 3),S(6;a 4)},

D 6={R(1;a 2,a 1),R(2;a 3,null),R(3;a 4,a 3),S(4;a 2),S(5;a 3),S(6;null)}.Here, Diff null(D,D 2)={R[2;1],R[3;2]}, and Diff null(D,D 3)={R[2;1], S[6;1]}.

According to the motivation provided at the beginning of this section, we can now define causes appealing to the class of null-based repairs of D. Since repair actions in this case, are attribute-value changes, causes can be defined at both the tuple and attribute levels. The same applies to the definition of responsibility (in this case generalizing Proposition 2).

Definition 15

For D an instance and \(\phantom {\dot {i}\!}\mathcal {Q}\) a BCQ, and tD be a tuple of the form R(i;c 1,…,c n ).

  1. (a)

    R[i;c j ] is a null-based attribute-value cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) if there is D Rep null(D,\(\phantom {\dot {i}\!}\kappa (\mathcal { Q}))\) with R[i;j]∈Diff null(D,D ).

    (That is, the value c j for attribute A j in the tuple is a cause if it is changed into a null in some repair.)

  2. (b)

    t is a null-based tuple cause for \(\phantom {\dot {i}\!}\mathcal { Q}\) if some R[i;c j ] is a null-based attribute-value cause for \(\phantom {\dot {i}\!}\mathcal {Q}\).

    (That is, the whole tuple is a cause if at least one of its attribute values is changed into a null in some repair.)

  3. (c)

    The responsibility, ρ tnull(t), of t, a null-based tuple cause for \(\phantom {\dot {i}\!}\mathcal {Q}\), is the inverse of \(\phantom {\dot {i}\!}\textit {min}\{|\textit {Diff}^{\;\textit {null}}(D,D^{\prime })| : R[i;j] \in \textit {Diff}^{\;\textit {null}}(D,D^{\prime }), \text { for some}\,\, j, \text { and} \,\,D^{\prime } \in {Rep}^{null}(D,\kappa (\mathcal {Q}))\}\).

  4. (d)

    The responsibility, ρ anull(R[i;c j ]), of R[i;c j ], a null-based attribute-value cause for \(\phantom {\dot {i}\!}\mathcal {Q}\), is the inverse of min{|Diff null(D,D )|:R[i;j]∈Diff null(D,\(\phantom {\dot {i}\!}D^{\prime }), \text {and} \,\,D^{\prime } \in {Rep}^{null}(D,\kappa (\mathcal {Q}))\}\).

In cases (c) and (d) we minimize over the number of changes in a repair that are made together with that of the candidate tuple/attribute-value to be a cause. In the case of a tuple cause, any change made in one of its attributes is considered in the minimization. For this reason, the minimum may be smaller than the one for a fixed attribute value change; and so the responsibility at the tuple level may be greater than that at the attribute level. More precisely, if t = R(i;c 1,…,c n )∈D, and R[i;c j ]) is a null-based attribute-value cause, then it holds ρ anull(R[i;c j ])≤ρ tnull(t).

Example 21

(ex. 20 cont.) Consider R(2;a 3,a 3)∈D. Its projection on its first (non-id) attribute, R[2;a 3], is an attribute-level cause since R[2;1]∈Diff null(D,D 2). Also R[2;1]∈Diff null(D,D 3).

Since |Diff null(D,D 2)|=|Diff null(D,D 3)|=2, it holds \(\phantom {\dot {i}\!}\rho ^{a-null}(R[2;1]) = \frac {1}{2}\).

Clearly R(2;a 3,a 3) is a null-based tuple cause for \(\phantom {\dot {i}\!}\mathcal { Q}\), with \(\phantom {\dot {i}\!}\rho ^{t-null}(t) = \frac {1}{2}\).

Notice that the definition of tuple-level responsibility, i.e. case (c) in Definition 15, does not take into account that a same id, i, may appear several times in a D i f f null(D,D ). In order to do so, we could redefine the size of the latter by taking into account those multiplicities. For example, if we decrease the size of the Diff by one with every repetition of the id, the responsibility for a cause may (only) increase, which makes sense.

8 Discussion and Conclusions

Our work opens interesting research directions, some of which are briefly discussed below. They are matter of ongoing and future research.

8.1 Endogenous Repairs

As discussed in Section 7, the partition of a database into endogenous and exogenous tuples may also be of interest in the context of repairs. We may prefer endogenous repairs that change (delete in this case) only endogenous tuples. However, if there are no endogenous tuples, a preference condition could be imposed on repairs, keeping those that change exogenous tuples the least. This is something to explore.

As a further extension, it could be possible to assume that combinations of (only) exogenous tuples never violate the integrity constraints, which could be checked at upload time. In this sense, there would be a part of the database that is considered to be consistent, while the other is subject to possible repairs. For somehow related research, see [31].

Going a bit further, we could even consider the relations in the database with an extra, binary attribute, N, that is used to annotate if a tuple is endogenous or exogenous (it could be both), e.g. a tuple like R(a,b,y e s). integrity constraints could be annotated too, e.g. the “exogenous” version of DC κ, could be κ E : ←P(x,y,yes),R(y,z,yes), and could be assumed to be satisfied.

8.2 Objections to Causality

Causality as introduced by Halpern and Pearl in [32, 33], aka. HP-causality, is the basis for the notion of causality in [47]. HP-causality has been the object of some criticism [35], which is justified in some (more complex, non-relational) settings, specially due to the presence of different kinds of logical variables (or lack thereof). In our context the objections do not apply: variables just say that a certain tuple belongs to the instance (or not); and for relational databases the closed-world assumption applies. In [35, 36], the definition of HP-causality is slightly modified. In our setting, this modified definition does not change actual causes or their properties.

8.3 Open Queries

We have limited our discussion to Boolean queries. It is possible to extend our work to consider conjunctive queries with free variables, e.g. \(\phantom {\dot {i}\!}\mathcal {Q}(x)\!: \exists yz(R(x,y) \wedge S(y,z))\). In this case, a query answer would be of the form 〈a〉, for a a constant, and causes would be found for such an answer. In this case, the associated DC would be of the form κ a : ←R(a,y),S(y,z), and the rest would be basically as above.

8.4 ASP Specification of Causes

S-repairs can be specified by means of answer set programs (ASPs) [3, 6], and C-repairs too, with the use of weak program constraints [3]. This should allow for the introduction of ASPs in the context of causality, for specification and reasoning. There are also ASP-based specifications of diagnosis [24] that could be brought into a more complete picture.

8.5 Causes and Functional Dependencies, and Beyond

Functional dependencies are DCs with conjunctive violation views with inequality, and are still monotonic. There is much research on repairs and consistent query answering for functional dependencies, and more complex integrity constraints [9]. In causality, mostly CQs without built-ins have been considered. The repair connection could be exploited to obtain more refined results for causality and CQs with inequality, and also other classes of queries, even non-monotonic ones, that correspond violation views for other kinds of integrity constraints. In a different, but related direction, causality for monotonic queries in the presence of integrity constraints has been investigated in [56].

8.6 View Updates and Abduction

Abduction [20, 23] is another form of model-based diagnosis, and is related to the subjects investigated in this work. The view update problem, about updating a database through views, is a classical problem in databases that has been treated through abduction [21, 37]. User knowledge imposed through view updates creates or reflects uncertainty about the base data, because alternative base instances may give an account of the intended view updates. The view update problem, specially in its particular form of deletion propagation, has been recently related in [41, 42] to causality as introduced in [47]. (Notice only tuple deletions are used with violation views and repairs associated to DCs.)

Database repairs are also related to the view update problem. Actually, answer set programs (ASP) for database repairs [6] implicity repair the database by updating intentional, annotated predicates (cf. Section 8.4). Even more, in [8], in order to protect sensitive information, databases are explicitly and virtually “repaired” through secrecy views that specify the information that has to be kept secret. These are prioritized repairs that have been specified via ASPs. Abduction has been explicitly applied to database repairs [5].

The deep interrelations between causality, abductive reasoning, view updates and repairs are the objects of our ongoing research efforts [10, 57].

To conclude, let us emphasize that in this research we have unveiled and formalized some first interesting relationships between causality in databases, database repairs, and consistency-based diagnosis. These connections allow us to apply results and techniques developed for each of them to the others. This is particularly beneficial for causality in databases, where still a limited number of results and techniques have been obtained or developed.

The connections we established here inspired complexity results for causality, e.g. Theorems 2 and 3, and were used to prove them. We appealed to several non-trivial results found in [43] (and the proofs thereof found in [44]) about repairs and CQA. It is also the case that the well-established hitting-set approach to diagnosis inspired a similar approach to causal responsibility, which in its turn allowed us to obtain results about its fixed-parameter tractability. It is also the case that diagnostic reasoning, as a form of non-monotonic reasoning, can provide a solid foundation for causality in databases and query answer explanation, in general [16, 17].

In ongoing research we have established connections between query answer causality, abductive diagnosis and database updates through views [57]. It is interesting that several of these areas of data management and knowledge representation, including those considered in this work, fall under what has been called “reverse data management” tasks [45]. Our work establishes formal connections between them and sets the ground for further investigation into their interrelationships.