Keywords

1 Introduction

Functional dependencies (FDs) are database constraints initially devoted to database design [26]. Since then, they have been used for numerous tasks ranging from data cleaning [5] to data mining [28]. However, when dealing with real world data, FDs are also a simple yet powerful way to syntactically express background knowledge coming from domain experts [12]. More precisely, a FD \(X \rightarrow A\) between a set of attributes (or features) X and another attribute A depicts a function of the form \(f(X) = A\). In this context, asserting the existence of a function which determines A from X in a dataset amounts to testing the validity of \(X \rightarrow A\) in a relation, i.e. to checking that every pair of tuples that are equal on X are also equal on A. Unfortunately, this semantics of satisfaction suffers from two major drawbacks which makes it inadequate to capture the complexity of real world data: (i) it must be checked on the whole dataset, and (ii) it uses equality.

Drawback (i) does not take into account data quality issues such as outliers, mismeasurements or mistakes, which should not impact the relevance of a FD in the data. To tackle this problem, it is customary to estimate the partial validity of a given FD with a coverage measure, rather than its total satisfaction. The most common of these measures is the \(g_3\)-error [8, 17, 21, 31], introduced by Kivinen and Mannila [22]. It is the minimum proportion of tuples to remove from a relation in order to satisfy a given FD. As shown for instance by Huhtala et al. [21], the \(g_3\)-error can be computed in polynomial time for a single (classical) FD.

As for drawback (ii), equality does not always witness efficiently the closeness of two real-world values. It screens imprecisions and uncertainties that are inherent to every observation. In order to handle closeness (or difference) in a more appropriate way, numerous researches have replaced equality by binary predicates, as witnessed by recent surveys on relaxed FDs [6, 32].

However, if predicates extend FDs in a powerful and meaningful way with respect to real-world applications, they also make computations harder. In fact, contrary to strict equality, computing the \(g_3\)-error with binary predicates becomes NP-complete [12, 31]. In particular, it has been proven for differential [30], matching [11], metric [23], neighborhood [1], and comparable dependencies [31]. Still, there is no detailed analysis of what makes the \(g_3\)-error hard to compute when dropping equality for more flexible predicates. As a consequence, domain experts are left without any insights on which predicates they can use in order to estimate the validity of their background knowledge in their data quickly and efficiently.

This last problem constitutes the motivation for our contribution. In this work, we study the following question: which properties of predicates make the \(g_3\)-error easy to compute? To do so, we introduce binary predicates on each attribute of a relation scheme. Binary predicates take two values as input and return true or false depending on whether the values match a given comparison criteria. Predicates are a convenient framework to study the impact of common properties such as reflexivity, transitivity, symmetry, and antisymmetry (the properties of equality) on the hardness of computing the \(g_3\)-error. In this setting, we make the following contributions. First, we show that dropping reflexivity and antisymmetry does not make the \(g_3\)-error hard to compute. When removing transitivity, the problem becomes NP-complete. This result is intuitive as transitivity plays a crucial role in the computation of the \(g_3\)-error for dependencies based on similarity/distance relations [6, 32]. Second, we focus on symmetry. Symmetry has attracted less attention, despite its importance in partial orders and order FDs [10, 15, 27]. Even though symmetry seems to have less impact than transitivity in the computation of the \(g_3\)-error, we show that when it is removed the problem also becomes NP-complete. This result holds in particular for ordered dependencies.

Paper Organization. In Sect. 2, we recall some preliminary definitions. Section 3 is devoted to the usual \(g_3\)-error. In Sect. 4, we introduce predicates, along with definitions for the relaxed satisfaction of a functional dependency. Section 5 investigates the problem of computing the \(g_3\)-error when equality is replaced by predicates on each attribute. In Sect. 6 we relate our results with existing extensions of FDs. We conclude in Sect. 7 with some remarks and open questions for further research.

2 Preliminaries

All the objects we consider are finite. We begin with some definitions on graphs [2] and ordered sets [9]. A graph \(G\) is a pair \((V, E)\) where \(V\) is a set of vertices and \(E\) is a collection of pairs of vertices called edges. An edge of the form (uu) is called a loop. The graph \(G\) is directed if edges are ordered pairs of elements. Unless otherwise stated, we consider loopless undirected graphs. Let \(G= (V, E)\) be an undirected graph, and let \(V' \subseteq V\). The graph \(G[V'] = (V', E')\) with \(E' = \{(u, v) \in E\mid \{u, v\} \subseteq V' \}\) is the graph induced by \(V'\) with respect to \(G\). A path in \(G\) is a sequence \(e_1, \dots , e_m\) of pairwise distinct edges such that \(e_i\) and \(e_{i+1}\) share a common vertex for each \(1 \le i < m\). The length of a path is its number of edges. An independent set of \(G\) is a subset I of V such that no two vertices in I are connected by an edge of \(G\). An independent set is maximal if it is inclusion-wise maximal among all independent sets. It is maximum if it is an independent set of maximal cardinality. Dually, a clique of \(G\) is a subset K of \(V\) such that every pair of distinct vertices in K are connected by an edge of \(G\). A graph \(G\) is a co-graph if it has no induced subgraph corresponding to a path of length 3 (called \(P_4\)). A partially ordered set or poset is a pair \(P = (V, \le )\) where \(V\) is a set and \(\le \) a reflexive, transitive, and antisymmetric binary relation. The relation \(\le \) is called a partial order. If for every \(x, y \in V\), \(x \le y\) or \(y \le x\) holds, \(\le \) is a total order. A poset P is associated to a directed graph \(G(P) = (V, E)\) where \((u_i, u_j) \in E\) exactly when \(u_i \ne u_j\) and \(u_i \le u_j\). An undirected graph \(G= (V, E)\) is a comparability graph if its edges can be directed so that the resulting directed graph corresponds to a poset.

We move to terminology from database theory [24]. We use capital first letters of the alphabet (A, B, C, ...) to denote attributes and capital last letters (..., X, Y, Z) for attribute sets. Let U be a universe of attributes, and \(R\subseteq U\) a relation scheme. Each attribute A in \(R\) takes value in a domain \(\textsf {dom}(A)\). The domain of \(R\) is \(\textsf {dom}(R) = \bigcup _{A\in R} \textsf {dom}(A)\). Sometimes, especially in examples, we write a set as a concatenation of its elements (e.g. AB corresponds to \(\{A,B\}\)). A tuple over \(R\) is a mapping \(t :R\rightarrow \textsf {dom}(R)\) such that \(t(A) \in \textsf {dom}(A)\) for every \(A \in R\). The projection of a tuple t on a subset X of \(R\) is the restriction of t to X, written t[X]. We write t[A] as a shortcut for \(t[\{A\}]\). A relation r over \(R\) is a finite set of tuples over \(R\). A functional dependency (FD) over \(R\) is an expression \(X \rightarrow A\) where \(X \cup \{A\} \subseteq R\). Given a relation r over \(R\), we say that r satisfies \(X \rightarrow A\), denoted by \(r \models X \rightarrow A\), if for every pair of tuples \((t_1, t_2)\) of r, \(t_1[X] = t_2[X]\) implies \(t_1[A] = t_2[A]\). In case when r does not satisfy \(X \rightarrow A\), we write \(r \not \models X \rightarrow A\).

3 The \(g_3\)-error

This section introduces the \(g_3\)-error, along with its connection with independent sets in graphs through counterexamples and conflict-graphs [3].

Let r be a relation over \(R\) and \(X \rightarrow A\) a functional dependency. The \(g_3\)-error quantifies the degree to which \(X \rightarrow A\) holds in r. We write it as \(g_3(r, X \rightarrow A)\). It was introduced by Kivinen and Mannila [22], and it is frequently used to estimate the partial validity of a FD in a dataset [6, 8, 12, 21]. It is the minimum proportion of tuples to remove from r to satisfy \(X \rightarrow A\), or more formally:

Definition 1

Let \(R\) be a relation scheme, r a relation over \(R\) and \(X \rightarrow A\) a functional dependency over \(R\). The \(g_3\)-error of \(X \rightarrow A\) with respect to r, denoted by \(g_3(r, X \rightarrow A)\) is defined as:

$$ g_3(r, X \rightarrow A) = 1 - \frac{{ \textsf {max}}(\{ \vert s \vert \mid s \subseteq r, s \models X \rightarrow A \})}{\vert r \vert } $$

In particular, if \(r \models X \rightarrow A\), we have \(g_3(r, X \rightarrow A) = 0\). We refer to the problem of computing \(g_3(r, X \rightarrow A)\) as the error validation problem [6, 31]. Its decision version reads as follows:

  • Error Validation Problem (EVP)

  • Input:       A relation r over \(R\), a FD \(X \rightarrow A\), \(k \in \mathbb {R}\).

  • Question:  Is is true that \(g_3(r, X \rightarrow A) \le k\)?

It is known [6, 12] that there is a strong relationship between this problem and the task of computing the size of a maximum independent set in a graph:

  • Maximum Independent Set (MIS)

  • Input:       A graph \(G= (V, E)\), \(k \in \mathbb {N}\).

  • Question:  Does \(G\) have a maximal independent set I such that \(\vert I \vert \ge k\)?

To see the relationship between EVP and MIS, we need the notions of counterexample and conflict-graph [3, 12]. A counterexample to \(X \rightarrow A\) in r is a pair of tuples \((t_1, t_2)\) such that \(t_1[X] = t_2[X]\) but \(t_1[A] \ne t_2[A]\). The conflict-graph of \(X \rightarrow A\) with respect to r is the graph \(\textsf {CG}(r, X \rightarrow A) = (r, E)\) where a (possibly ordered) pair of tuples \((t_1, t_2)\) in r belongs to \(E\) when it is a counterexample to \(X \rightarrow A\) in r. An independent set of \(\textsf {CG}(r, X \rightarrow A)\) is precisely a subrelation of r which satisfies \(X \rightarrow A\). Therefore, computing \(g_3(r, X \rightarrow A)\) reduces to finding the size of a maximum independent set in \(\textsf {CG}(r, X \rightarrow A)\). More precisely, \(g_3(r, X \rightarrow A) = 1 - \frac{\vert I \vert }{\vert r \vert }\) where I is a maximum independent set of \(\textsf {CG}(r, X \rightarrow A)\).

Example 1

Consider the relation scheme \(R= \{A, B, C, D\}\) with \(\textsf {dom}(R) = \mathbb {N}\). Let r be the relation over \(R\) on the left of Fig. 1. It satisfies \(BC \rightarrow A\) but not \(D \rightarrow A\). Indeed, \((t_1, t_3)\) is a counterexample to \(D \rightarrow A\). The conflict-graph \(\textsf {CG}(r, D \rightarrow A)\) is given on the right of Fig. 1. For example, \(\{t_1, t_2, t_6\}\) is a maximum independent set of \(\textsf {CG}(r, D \rightarrow A)\) of maximal size. We obtain:

$$ g_3(r, D \rightarrow A) = 1 - \frac{\vert \{t_1, t_2, t_6\} \vert }{\vert r \vert } = 0.5 $$

In other words, we must remove half of the tuples of r in order to satisfy \(D \rightarrow A\).

Fig. 1.
figure 1

The relation r and the conflict-graph \(\textsf {CG}(r, D \rightarrow A)\) of Example 1.

However, MIS is an NP-complete problem [13] while computing \(g_3(r, X \rightarrow A)\) takes polynomial time in the size of r and \(X \rightarrow A\) [21]. This difference is due to the properties of equality, namely reflexivity, transitivity, symmetry and antisymmetry. They make \(\textsf {CG}(r, X \rightarrow A)\) a disjoint union of complete k-partite graphs, and hence a co-graph [12]. In this class of graphs, solving MIS is polynomial [14]. This observation suggests to study in greater detail the impact of such properties on the structure of conflict-graphs. First, we need to introduce predicates to relax equality, and to define a more general version of the error validation problem accordingly.

4 Predicates to Relax Equality

In this section, in line with previous researches on extensions of functional dependencies [6, 32], we equip each attribute of a relation scheme with a binary predicate. We define the new \(g_3\)-error and the corresponding error validation problem.

Let \(R\) be a relation scheme. For each \(A \in R\), let \(\phi _A :\textsf {dom}(A) \times \textsf {dom}(A) \rightarrow \{\texttt {\small {true}}, \texttt {\small {false}}\}\) be a predicate. For instance, the predicate \(\phi _A\) can be equality, a distance, or a similarity relation. We assume that predicates are black-box oracles that can be computed in polynomial time in the size of their input.

Let \(\varPhi \) be a set of predicates, one for each attribute in \(R\). The pair \((R, \varPhi )\) is a relation scheme with predicates. In a relation scheme with predicates, relations and FDs are unchanged. However, the way a relation satisfies (or not) a FD can easily be adapted to \(\varPhi \).

Definition 2

(Satisfaction with predicates). Let \((R, \varPhi )\) be a relation scheme with predicates, r a relation and \(X \rightarrow A\) a functional dependency both over \((R, \varPhi )\). The relation r satisfies \(X \rightarrow A\) with respect to \(\varPhi \), denoted by \(r \models _{\varPhi } X \rightarrow A\), if for every pair of tuples \((t_1, t_2)\) of r, the following formula holds:

$$ \left( \bigwedge _{B \in X} \phi _B(t_1[B], t_2[B])\right) \implies \phi _A(t_1[A], t_2[A]) $$

A new version of the \(g_3\)-error adapted to \(\varPhi \) is presented in the following definition.

Definition 3

Let \((R, \varPhi )\) be a relation scheme with predicates, r be a relation over \((R, \varPhi )\) and \(X \rightarrow A\) a functional dependency over \((R, \varPhi )\). The \(g_3\)-error with predicates of \(X \rightarrow A\) with respect to r, denoted by \(g_3^{\varPhi }(r, X \rightarrow A)\) is defined as:

$$ g_3^{\varPhi }(r, X \rightarrow A) = 1 - \frac{{ \textsf {max}}(\{ \vert s \vert \mid s \subseteq r, s \models _\varPhi X \rightarrow A \})}{\vert r \vert } $$

From the definition of \(g_3^{\varPhi }(r, X \rightarrow A)\), we derive the extension of the error validation problem from equality to predicates:

  • Error Validation Problem with Predicates (EVPP)

  • Input:  A relation r over \((R, \varPhi )\), a FD \(X \rightarrow A\) over \(R\), \(k \in \mathbb {R}\).

  • Question:  Is it true that \(g_3^{\varPhi }(r, X \rightarrow A) \le k\)?

Observe that according to the definition of satisfaction with predicates (Definition 2), counterexamples and conflict-graphs remain well-defined. However, for a given predicate \(\phi _A\), \(\phi _A(x, y) = \phi _A(y, x)\) needs not be true in general, meaning that we have to consider ordered pairs of tuples. That is, an ordered pair of tuples \((t_1, t_2)\) in r is a counterexample to \(X \rightarrow A\) if \(\bigwedge _{B \in X} \phi _B(t_1[B], t_2[B]) = \texttt {\small {true}}\) but \(\phi _A(t_1[A], t_2[A]) \ne \texttt {\small {true}}\).

We call \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\) the conflict-graph of \(X \rightarrow A\) in r. In general, \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\) is directed. It is undirected if the predicates of \(\varPhi \) are symmetric (see Sect. 5). In particular, computing \(g_3^{\varPhi }(r, X \rightarrow A)\) still amounts to finding the size of a maximum independent set in \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\).

Example 2

We use the relation of Fig. 1. Let \(\varPhi = \{\phi _A, \phi _B, \phi _C, \phi _D\}\) be the collection of predicates defined as follows, for every \(x, y \in \mathbb {N}\):

  • \(\phi _A(x, y) = \phi _B(x, y) = \phi _C(x, y) = \texttt {\small {true}}\) if and only if \(\vert x - y \vert \le 1\). Thus, \(\phi _A\) is reflexive and symmetric but not transitive (see Sect.  5),

  • \(\phi _D\) is the equality.

The pair \((R, \varPhi )\) is a relation scheme with predicates. We have \(r \models _\varPhi AB \rightarrow D\) but \(r \not \models _\varPhi C \rightarrow A\). In Fig. 2, we depict \(\textsf {CG}_{\varPhi }(r, C \rightarrow A)\). A maximum independent set of this graph is \(\{t_1, t_2, t_3, t_5\}\). We deduce

$$ g_3^{\varPhi }(r, C \rightarrow A) = 1 - \frac{\vert \{t_1, t_2, t_3, t_5\} \vert }{\vert r \vert } = \frac{1}{3} $$
Fig. 2.
figure 2

The conflict-graph \(\textsf {CG}_{\varPhi }(r, C \rightarrow A)\) of Example 2.

Thus, there is also a strong relationship between EVPP and MIS, similar to the one between EVP and MIS. Nonetheless, unlike EVP, the problem EVPP is NP-complete [31]. In the next section, we study this gap of complexity between EVP and EVPP via different properties of predicates.

5 Predicates Properties in the \(g_3\)-error

In this section, we study properties of binary predicates that are commonly used to replace equality. We show how each of them affects the error validation problem.

First, we define the properties of interest in this paper. Let \((R, \varPhi )\) be a relation scheme with predicates. Let \(A \in R\) and \(\phi _A\) be the corresponding predicate. We consider the following properties:

(ref):

\(\phi _A(x, x) = \texttt {\small {true}}\) for all \(x \in \textsf {dom}(A)\) (reflexivity)

(tra):

for all \(x, y, z \in \textsf {dom}(A)\), \(\phi _A(x, y) = \phi _A(y, z) = \texttt {\small {true}}\) implies \(\phi _A(x, z) = \texttt {\small {true}}\) (transitivity)

(sym):

for all \(x, y \in \textsf {dom}(A)\), \(\phi _A(x, y) = \phi _A(y, x)\) (symmetry)

(asym):

for all \(x, y \in \textsf {dom}(A)\), \(\phi _A(x, y) = \phi _A(y, x) = \texttt {\small {true}}\) implies \(x = y\) (antisymmetry).

Note that symmetry and antisymmetry together imply transitivity, as \(\phi _A(x, y) = \texttt {\small {true}}\) entails \(x = y\).

As a first step, we show that symmetry and transitivity are sufficient to make EVPP solvable in polynomial time. In fact, we prove that the resulting conflict-graph is a co-graph, as with equality.

Theorem 1

The problem EVPP can be solved in polynomial time if the predicates used on each attribute are transitive (tra) and symmetric (sym).

Proof

Let \((R, \varPhi )\) be a relation scheme with predicates. Let r be relation over \((R, \varPhi )\) and \(X \rightarrow A\) be a functional dependency, also over \((R, \varPhi )\). We assume that each predicate in \(\varPhi \) is transitive and symmetric. We show how to compute the size of a maximum independent set of \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\) in polynomial time.

As \(\phi _A\) is not necessarily reflexive, a tuple t in r can produce a counter-example (tt) to \(X \rightarrow A\). Indeed, it may happen that \(\phi _B(t[B], t[B]) = \texttt {true}\) for each \(B \in X\), but \(\phi _A(t[A], t[A]) = \texttt {false}\). However, it follows that t never belongs to a subrelation s of r satisfying \(s \models _\varPhi X \rightarrow A\). Thus, let \(r' = r \setminus \{t \in r \mid \{t\} \not \models _\varPhi X \rightarrow A\}\). Then, a subrelation of r satisfies \(X \rightarrow A\) if and only if it is an independent set of \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\) if and only if it is an independent set of \(\textsf {CG}_{\varPhi }(r', X \rightarrow A)\). Consequently, computing \(g_3^{\varPhi }(r, X \rightarrow A)\) is solving MIS in \(\textsf {CG}_{\varPhi }(r', X \rightarrow A)\).

We prove now that \(\textsf {CG}_{\varPhi }(r', X \rightarrow A)\) is a co-graph. Assume for contradiction that \(\textsf {CG}_{\varPhi }(r', X \rightarrow A)\) has an induced path P with 4 elements, say \(t_1, t_2, t_3, t_4\) with edges \((t_1, t_2)\), \((t_2, t_3)\) and \((t_3, t_4)\). Remind that edges of \(\textsf {CG}_{\varPhi }(r', X \rightarrow A)\) are counterexamples to \(X \rightarrow A\) in \(r'\). Hence, by symmetry and transitivity of the predicates of \(\varPhi \), we deduce that for each pair (ij) in \(\{1, 2, 3, 4\}\), \(\bigwedge _{B \in X} \phi _B(t_i[B], t_j[B]) = \texttt {\small {true}}\). Thus, we have \(\bigwedge _{B \in X} \phi _B(t_3[B], t_1[B]) = \bigwedge _{B \in X} \phi _B(t_1[B], t_4[B]) = \texttt {\small {true}}\). However, neither \((t_1, t_3)\) nor \((t_1, t_4)\) belong to \(\textsf {CG}_{\varPhi }(r', X \rightarrow A)\) since P is an induced path by assumption. Thus, \(\phi _A(t_3[A], t_1[A]) = \phi _A(t_1[A], t_4[A]) = \texttt {\small {true}}\) must hold. Nonetheless, the transitivity of \(\phi _A\) implies \(\phi _A(t_3[A], t_4[A]) = \texttt {\small {true}}\), a contradiction with \((t_3, t_4)\) being an edge of \(\textsf {CG}_{\varPhi }(r', X \rightarrow A)\). We deduce that \(\textsf {CG}_{\varPhi }(r', X \rightarrow A)\) cannot contain an induced \(P_4\), and that it is indeed a co-graph. As MIS can be solved in polynomial time for co-graphs [14], the theorem follows.   \(\square \)

One may encounter non-reflexive predicates when dealing with strict orders or with binary predicates derived from SQL equality. In the 3-valued logic of SQL, comparing the null value with itself evaluates to false rather than true. With this regard, it could be natural for domain experts to use a predicate which is transitive, symmetric and reflexive almost everywhere but on the null value. This would allow to deal with missing information without altering the data.

The previous proof heavily makes use of transitivity, which has a strong impact on the edges belonging to the conflict-graph. Intuitively, conflict-graphs can become much more complex when transitivity is dropped. Indeed, we prove an intuitive case: when predicates are not required to be transitive, EVPP becomes intractable.

Theorem 2

The problem EVPP is NP-complete even when the predicates used on each attribute are symmetric (sym) and reflexive (ref).

The proof is omitted due to space limitations, it can be found in [33]. It is a reduction from the problem (dual to MIS) of finding the size of a maximum clique in general graphs. It uses arguments similar to the proof of Song et al. [31] showing the NP-completeness of EVPP for comparable dependencies.

We turn our attention to the case where symmetry is dropped from the predicates. In this context, conflict-graphs are directed. Indeed, an ordered pair of tuples \((t_1, t_2)\) may be a counterexample to a functional dependency, but not \((t_2, t_1)\). Yet, transitivity still contributes to constraining the structure of conflict-graphs, as suggested by the following example.

Example 3

We consider the relation of Example 1. We equip ABCD with the following predicates:

  • \(\phi _C(x, y) = \texttt {\small {true}}\) if and only if \(x \le y\)

  • \(\phi _A(x, y)\) is defined by

    $$ \phi _A(x, y) = {\left\{ \begin{array}{ll} \texttt {\small {true}} &{} \text {if } x = y \\ \texttt {\small {true}} &{} \text {if } x = 1 \text { and } y \in \{2, 4\} \\ \texttt {\small {true}} &{} \text {if } x = 3 \text { and } y = 4 \\ \texttt {\small {false}} &{} \text {otherwise.} \end{array}\right. } $$
  • \(\phi _B\) and \(\phi _D\) are the equality.

Let \(\varPhi = \{\phi _A, \phi _B, \phi _C, \phi _D\}\). The conflict-graph \(\textsf {CG}_{\varPhi }(C \rightarrow A)\) is represented in Fig. 3. Since \(\phi _C\) is transitive, we have \(\phi _C(t_3[C], t_j[C]) = \texttt {\small {true}}\) for each tuple \(t_j\) of r. Moreover, \(\phi _A(t_3[A], t_6[A]) = \texttt {\small {false}}\) since \((t_3, t_6)\) is a counterexample to \(C \rightarrow A\). Therefore, the transitivity of \(\phi _A\) implies either \(\phi _A(t_3[A], t_4[A]) = \texttt {\small {false}}\) or \(\phi _A(t_4[A], t_6[A]) = \texttt {\small {false}}\). Hence, at least one of \((t_3, t_4)\) and \((t_4, t_6)\) must be a counterexample to \(C \rightarrow A\) too. In the example, this is \((t_3, t_4)\).

Fig. 3.
figure 3

The conflict-graph \(\textsf {CG}_{\varPhi }(r, C \rightarrow A)\) of Example 3.

Nevertheless, if transitivity constrains the complexity of the graph, dropping symmetry still allows new kinds of graph structures. Indeed, in the presence of symmetry, a conflict-graph cannot contain induced paths with more than 3 elements because of transitivity. However, such paths may exist when symmetry is removed.

Example 4

In the previous example, the tuples \(t_2, t_4, t_5, t_6\) form an induced \(P_4\) of the underlying undirected graph of \(\textsf {CG}_{\varPhi }(r, C \rightarrow A)\), even though \(\phi _A\) and \(\phi _C\) enjoy transitivity.

Therefore, we are left with the following intriguing question: can the loss of symmetry be used to break transitivity, and offer conflict-graphs a structure sufficiently complex to make EVPP intractable? The next theorem answers this question affirmatively.

Theorem 3

The problem EVPP is NP-complete even when the predicates used on each attribute are transitive (tra), reflexive (ref), and antisymmetric (asym).

The proof is omitted due to space limitations. It is given in [33]. It is a reduction from MIS in 2-subdivision graphs [29].

Theorem 1, Theorem 2 and Theorem 3 characterize the complexity of EVPP for each combination of predicates properties. In the next section, we discuss the granularity of these, and we use them as a framework to compare the complexity of EVPP for some known extensions of functional dependencies.

6 Discussions

Replacing equality with various predicates to extend the semantics of classical functional dependencies is frequent [6, 32]. Our approach offers to compare these extensions on EVPP within a unifying framework based on the properties of the predicates they use. We can summarize our results with the hierarchy of classes of predicates given in Fig. 4.

Fig. 4.
figure 4

Complexity of EVPP with respect to the properties of predicates.

Regarding the computation of the \(g_3\)-error, most existing works have focused on similarity/distance predicates. First, the \(g_3\)-error can be computed in polynomial time for classical functional dependencies [20]. Then, Song et al. [31] show that EVPP is NP-complete for a broad range of extensions of FDs which happen to be reflexive (ref) and symmetric (sym) predicates, which coincides with Theorem 2. However, they do not study predicate properties as we do in this paper. More precisely, they identify the hardness of EVPP for differential [30], matching [11], metric [23], neighborhood [1], and comparable dependencies [31]. For some of these dependencies, predicates may be defined over sets of attributes. Using one predicate per attribute and taking their conjunction is a particular case of predicate on attribute sets.

Some extensions of FDs use partial orders as predicates. This is the case of ordered dependencies [10, 15], ordered FDs [27], and also of some sequential dependencies [16] and denial constraints [4] for instance. To our knowledge, the role of symmetry in EVPP has received little attention. For sequential dependencies [16], a measure different than the \(g_3\)-error have been used. The predicates of Theorem 3 are reflexive, transitive and antisymmetric. Hence they are partial orders. Consequently, the FDs in this context are ordered functional dependencies as defined by Ng [27]. We obtain the following corollary:

Corollary 1

EVPP is NP-complete for ordered functional dependencies.

Ordered functional dependencies are a restricted case of ordered dependencies [15], sequential dependencies [16], and denial constraints [4] (see [32]). The hardness of computing the \(g_3\)-error for these dependencies follows from Corollary 1.

The hierarchy depicts quite accurately the current knowledge about EVPP and the delimitation between tractable and intractable cases. However, this analysis may require further refinements. Indeed, there may be particular types of FDs with predicates where EVPP is tractable in polynomial time, even though their predicates belong to a class for which the problem is NP-complete. For instance, assume that each attribute A in \(R\) is equipped with a total order \(\phi _A\). We show in Proposition 1 and Corollary 2 that in this case, EVPP can be solved in polynomial time, even though the predicates are reflexive, transitive and antisymmetric.

Proposition 1

Let \((R, \varPhi )\) be a relation scheme with predicates. Then, EVPP can be solved in polynomial time for a given FD \(X \rightarrow A\) if \(\phi _B\) is transitive for each \(B \in X\) and \(\phi _A\) is a total order.

Proof

Let \((R, \varPhi )\) be a relation scheme with predicates and \(X \rightarrow A\) a functional dependency. Assume that \(\phi _B\) is transitive for each \(B \in X\) and that \(\phi _A\) is a total order. Let r be a relation over \((R, \varPhi )\). Let \(G= (r, E)\) be the undirected graph underlying \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\), that is, \((t_i, t_j) \in E\) if and only if \((t_i, t_j)\) or \((t_j, t_i)\) is an edge of \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\).

We show that \(G\) is a comparability graph. To do so, we associate the following predicate \(\le \) to \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\): for each pair \(t_i, t_j\) of tuples of r, \(t_i \le t_i\) and \(t_i \le t_j\) if \((t_i, t_j)\) is a counterexample to \(X \rightarrow A\). We show that \(\le \) is a partial order:

  • reflexivity. It follows by definition.

  • antisymmetry. We use contrapositive. Let \(t_i, t_j\) be two distinct tuples of r and assume that \((t_i, t_j)\) belongs to \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\). We need to prove that \((t_j, t_i)\) does not belong to \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\), i.e. it is not a counterexample to \(X \rightarrow A\). First, \((t_i, t_j) \in \textsf {CG}_{\varPhi }(r, X \rightarrow A)\) implies that \(\phi _A(t_i[A], t_j[A]) = \texttt {\small {false}}\). Then, since \(\phi _A\) is a total order, \(\phi _A(t_j[A], t_i[A]) = \texttt {\small {true}}\). Consequently, \((t_j, t_i)\) cannot belong to \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\) and \(\le \) is antisymmetric.

  • transitivity. Let \(t_i, t_j, t_k\) be tuples of r such that \((t_i, t_j)\) and \((t_j, t_k)\) are in \(\textsf {CG}_{\varPhi }(r, X \rightarrow A)\). Applying transitivity, we have that \(\bigwedge _{B \in X} \phi _B(t_i[B], t_k[B]) = \texttt {\small {true}}\). We show that \(\phi _A(t_i[A], t_k[A]) = \texttt {\small {false}}\). Since \((t_i, t_j)\) is a counterexample to \(X \rightarrow A\), we have \(\phi _A(t_i[A], t_j[A]) = \texttt {\small {false}}\). As \(\phi _A\) is a total order, we deduce that \(\phi _A(t_j[A], t_i[A]) = \texttt {\small {true}}\). Similarly, we obtain \(\phi _A(t_k[A], t_j[A]) = \texttt {\small {true}}\). As \(\phi _A\) is transitive, we derive \(\phi _A(t_k[A], t_i[A]) = \texttt {\small {true}}\). Now assume for contradiction that \(\phi _A(t_i[A], t_k[A]) = \texttt {\small {true}}\). Since, \(\phi _A(t_k[A], t_j[A]) = \texttt {\small {true}}\), we derive \(\phi _A(t_i[A], t_j[A]) = \texttt {\small {true}}\) by transitivity of \(\phi _A\), a contradiction. Therefore, \(\phi _A(t_i[A], t_k[A]) = \texttt {\small {false}}\). Using the fact that \(\bigwedge _{B \in X} \phi _B(t_i[B], t_k[B]) = \texttt {\small {true}}\), we conclude that \((t_i, t_k)\) is also a counterexample to \(X \rightarrow A\). The transitivity of \(\le \) follows.   \(\square \)

Consequently, \(\le \) is a partial order and \(G\) is indeed a comparability graph. Since MIS can be solved in polynomial time for comparability graphs [18], the result follows.

We can deduce the following corollary on total orders, that can be used for ordered dependencies.

Corollary 2

Let \((R, \varPhi )\) be a relation scheme with predicates. Then, EVPP can be solved in polymomial time if each predicate in \(\varPhi \) is a total order.

In particular, Golab et al. [16] proposed a polynomial-time algorithm for a variant of \(g_3\) applied to a restricted type of sequential dependencies using total orders on each attribute.

7 Conclusion and Future Work

In this work, we have studied the complexity of computing the \(g_3\)-error when equality is replaced by more general predicates. We studied four common properties of binary predicates: reflexivity, symmetry, transitivity, and antisymmetry. We have shown that when symmetry and transitivity are taken together, the \(g_3\)-error can be computed in polynomial time. Transitivity strongly impacts the structure of the conflict-graph of the counterexamples to a functional dependency in a relation. Thus, it comes as no surprise that dropping transitivity makes the \(g_3\)-error hard to compute. More surprisingly, removing symmetry instead of transitivity leads to the same conclusion. This is because deleting symmetry makes the conflict-graph directed. In this case, the orientation of the edges weakens the impact of transitivity, thus allowing the conflict-graph to be complex enough to make the \(g_3\)-error computation problem intractable.

We believe our approach sheds new light on the problem of computing the \(g_3\)-error, and that it is suitable for estimating the complexity of this problem when defining new types of FDs, by looking at the properties of predicates used to compare values.

We highlight now some research directions for future works. In a recent paper [25], Livshits et al. study the problem of computing optimal repairs in a relation with respect to a set of functional dependencies. A repair is a collection of tuples which does not violate a prescribed set of FDs. It is optimal if it is of maximal size among all possible repairs. Henceforth, there is a strong connection between the problem of computing repairs and computing the \(g_3\)-error with respect to a collection of FDs. In their work, the authors give a dichotomy between tractable and intractable cases based on the structure of FDs. In particular, they use previous results from Gribkoff et al. [19] to show that the problem is already NP-complete for 2 FDs in general. In the case where computing an optimal repair can be done in polynomial time, it would be interesting to use our approach and relax equality with predicates in order to study the tractability of computing the \(g_3\)-error on a collection of FDs with relaxed equality.

From a practical point of view, the exact computation of the \(g_3\)-error is extremely expensive in large datasets. Recent works [7, 12] have proposed to use approximation algorithms to compute the \(g_3\)-error both for equality and predicates. It could be of interest to identify properties or classes of predicates where more efficient algorithms can be adopted. It is also possible to extend the existing algorithms calculating the classical \(g_3\)-error (see e.g. [21]). They use the projection to identify equivalence classes among values of A and X. However, when dropping transitivity (for instance in similarity predicates), separating the values of a relation into “similar classes” requires to devise a new projection operation, a seemingly tough but fascinating problem to investigate.