Keywords

1 Introduction

Databases represent information about some domain of the real world. For this purpose, data dependencies provide the main mechanism for enforcing the semantics of the given application domain within a database system. As such, data dependencies are essential for most data management tasks, including database design, query and update processing, as well as data cleaning, exchange, and integration. The usability of a class \(\mathcal {C}\) of data dependencies for these tasks depends critically on the computational properties of its associated implication problem. The implication problem for \(\mathcal {C}\) is to decide whether for a given finite set \(\varSigma \cup \{\varphi \}\) of data dependencies from \(\mathcal {C}\), \(\varSigma \) implies \(\varphi \), i.e. whether every database that satisfies all the elements of \(\varSigma \) also satisfies \(\varphi \). If we require databases to be finite, then we speak of the finite implication problem, and otherwise of the unrestricted implication problem. While the importance of data dependencies continues to hold for new data models, the focus of this article is on the implication problems for important classes of data dependencies in the relational model of data. In this context, data dependency theory is deep and rich [37]. Our submission is from the area of database theory, on which DASFAA’s call for paper has solicited original contributions.

Functional and inclusion dependencies constitute the most commonly used classes of data dependencies in practice. In particular, functional dependencies (FDs) are more expressive than keys, and inclusion dependencies (INDs) are more expressive than foreign keys, thereby capturing Codd’s principles of entity and referential integrity, respectively, on the logical level. An FD \(R:X\rightarrow Y\) with attribute subsets XY on relation schema R expresses that the values on attributes in Y are uniquely determined by the values on attributes in X. In particular, \(R:X\rightarrow R\) expresses that X is a key for R. An inclusion dependency (IND) \(R[A_1,\ldots ,A_n]\subseteq R'[B_1,\ldots ,B_n]\), with attribute sequences \(A_1,\ldots ,A_n\) on R and \(B_1,\ldots ,B_n\) on \(R'\), expresses that for each tuple t over R there is some tuple \(t'\) over \(R'\) such that for all \(i=1,\ldots ,n\), \(t(A_i)=t'(B_i)\) holds. If \(n=1\) we call the IND unary (UIND).

A fundamental result in dependency theory is that the unrestricted and finite implication problems for the combined class of FDs and INDs differ and each is undecidable [10, 32, 33]. Interestingly, for the expressive sub-class of FDs and UINDs, the unrestricted and finite implication problems still differ but each are axiomatizable and decidable in low-degree polynomial time [12].

Another important expressive class of data dependencies are embedded multivalued dependencies (EMVDs). An EMVD \(R:X\rightarrow Y\bot Z\) with attribute subsets XYZ of R expresses that the projection r[XYZ] of a relation r over R on the set union XYZ is the join \(r[XY]\bowtie r[XZ]\) of its projections on XY and XZ. Another fundamental result in dependency theory is that the unrestricted and finite implication problems for EMVDs differ, each is not finitely axiomatizable [36] and each is undecidable [23, 24]. An important fragment of EMVDs are multivalued dependencies (MVDs), which are a class of full dependencies in which XYZ covers the full underlying set R of attributes. In fact, MVDs are the basis for Fagin’s fourth normal form [13]. For the combined class of FDs, MVDs, and UINDs, finite implication is axiomatizable and decidable in cubic time, while unrestricted implication is axiomatizable and decidable in almost linear time [12, 26].

Independence atoms (IAs) constitute an expressive subclass of EMVDs and FDs. An IA \(X\bot Y\) with attribute subsets XY of R expresses that \(X\cap Y\) is constant (i.e., the FD \(\emptyset \rightarrow X\cap Y\) holds) and that the EMVD \(\emptyset \rightarrow X\setminus Y \bot Y\setminus X\) holds. The latter expresses that the projection of a relation r on XY equals the cartesian product of its projections on X and Y, i.e., \(r[XY]=r[X]\times r[Y]\). For disjoint X and Y, the independence atoms \(X\bot Y\) thus form a subclass of EMVDs. For the class of IAs, the finite and unrestricted implication problems coincide, they are finitely axiomatizable and decidable in low-degree polynomial time [27].

Given the usefulness of EMVDs, FDs, and INDs for data management, given their computational barriers, and given the attractiveness of IAs as a tractable fragment of EMVDs, it is a natural question to ask how IAs, FDs, and INDs interact. Our article helps address the current gap in the existing rich theory of relational data dependencies. Adding further to the challenge it is important to note that IAs still form an embedded fragment of EMVDs, in contrast to MVDs which are a class of full dependencies. Somewhat surprisingly, already the interaction of IAs with just keys is intricate [19, 20]. For example, unrestricted implication is finitely axiomatizable but finite implication is not for keys and unary IAs (those with singleton attribute sets), while the finite and unrestricted implication problems coincide and enjoy a finite axiomatization for IAs and unary keys (those with a singleton attribute set). In contrast, the extension of INDs with IAs, although being more expressive than the class of INDs alone, does not add further complexity to the latter. For INDs and IAs taken together both implication problems still coincide and are finitely axiomatizable, \(\mathsf {PSPACE} \)-complete, and fixed-parameter tractable in their arity [9, 21].

Examples. We few examples will illustrate how knowledge about IAs advances data management. FDs and INDs do not require further motivation but the more we know about the interaction of IAs with FDs and INDs, the more we can advance data management.

Our first example is query processing. In particular, we show how the validity of independence atoms is intrinsically linked to the optimization of the famous division operator. The operator \(\pi _{XY}(R)\div \pi _Y(R)\) returns all those X-values x such that for every Y-value y there is some tuple t with \(t(X)=x\) and \(t(Y)=y\) [11]. The ability of the division operator to express universal quantification makes it very powerful for expressing natural queries. The following result establishes the intrinsic link.

Theorem 1

For all relations r over R, \(\pi _{XY}(R)(r)\div \pi _Y(R)(r)=\pi _X(R)(r)\) if and only if r satisfies \(X\bot Y\).

Proof

The division operator is defined as follows:

$$\begin{aligned} \pi _{XY}(R)(r)\div \pi _Y(R)(r)=\pi _X(R)(r)- \pi _X((\pi _X(R)(r)\times \pi _Y(R)(r))-\pi _{XY}(R)(r)), \end{aligned}$$

and r satisfies \(X\bot Y\) if and only if \(\pi _X(R)(r)\times \pi _Y(R)(r)=\pi _{XY}(R)(r)\). The result follows directly.

In particular, the validity of an IA reduces the quadratic complexity of the division operator to a linear complexity of a simple projection [28]. The reduction in complexity also applies to the expression complexity of a query. Suppose we would like to return those entities x that occur together with all entities y (for example, suppliers that supply all products), then we need to express the division operator \(\pi _{X,Y}(R)\div \pi _{Y}(R)\) in SQL by double-negation as in:

$$\begin{aligned} \begin{array}{l} \text {SELECT}\ R0.X\ \text {FROM}\ R\ \text {AS}\ R0 \\ \text {WHERE\ NOT\ EXISTS} \\ \quad \text {SELECT}\ *\ \text {FROM}\ R\ \text {AS}\ R1 \\ \quad \text {WHERE\ NOT\ EXISTS} \\ \quad \quad \,\, \text {SELECT}\ *\ \text {FROM}\ R\ \text {AS}\ R2 \\ \quad \quad \,\, \text {WHERE}\ R2.X\, =\, R1.X\ \text {AND}\\ \qquad \qquad \quad \,\,\, \, R2.Y\, =\, R0.Y\;; \\ \end{array} \end{aligned}$$

where R.X is short for \(\bigwedge _{A\in X}R.A\), and R2.X = R1.X (R2.Y = R0.Y) for \(\bigwedge _{A\in X} R2.A=R1.A\) (\(\bigwedge _{B\in Y} R2.B=R1.B\)). However, if a query optimizer can notice that the IA \(X\bot Y\) is implied by the enforced set \(\varSigma \) of constraints, then the query can be rewritten into

$$\begin{aligned} \begin{array}{l} \text {SELECT}\ X \\ \text {FROM}\ R\,; \end{array} \end{aligned}$$

Our second example is database security. More specifically, the aim of inference control is to protect private data under inferences that clever attacks may use to circumvent access limitations [7]. For example, the combination of a particular patient name (say Jack) together with a particular medical examination (say angiogram) may be considered a secret, while access to the patient name and access to the medical examination in isolation may not be a secret. However, in some given context such as a procedure to diagnose some condition, all patients may need to undergo all examinations. That is, the information about the patient is independent of the information about the examination. Now, if the secret (Jack, angiogram) must not be revealed to an unauthorized user that can query the data source, then this user must not learn both: that Jack is a patient undergoing the diagnosis of the condition, and that angiogram is a medical examination that is part of the process for diagnosing the condition. Being able to understand the interaction of independence atoms with other database constraints can therefore help us to protect secrets under clever inference attacks.

Our final example is data profiling. Here we would like to demonstrate that independence atoms do occur in real-world data sets. For that purpose, we have mined some well-known publicly available data sets that have been used for the mining of other classes of data dependencies before [34]. We report the basic characteristics of these data sets in the form of their numbers of rows and columns, and list the number of maximal IAs and the maximum arity of those found. Here, an IA \(X\bot Y\) is maximal in a given set of IAs if there is no other IA \(V\bot W\) in the set such that \(V\subseteq X\) and \(W\subseteq Y\) holds. The arity of an IA is defined as the total number of attribute occurrences.

Data set

Number of columns

Number of rows

Number of IAs

Maximum arity

Bridges

13

108

4

3

Echocardiogram

13

132

5

4

Adult

14

48,842

9

3

Hepatitis

20

155

855

6

Horse

27

368

112

3

Table 1. Subclasses of FD+IND+IA. We write “ui” and “fi” for unrestricted and finite implication, respectively.

It should be stressed that the usefulness of these IAs is not restricted to those that are semantically meaningful. For example, the optimizations for the division operator also apply to IAs that “accidentally” hold on a given data set.

1.1 Contributions

In this article we make the following contributions.

  1. (1)

    We illustrate the relevance of independence atoms for data management, such as their intrinsic link to the optimization of the division operator, more precise cardinality estimations for choosing better query plans, and database security. Moreover, we show that they occur in real-world data sets.

  2. (2)

    For the combined class of FDs and IAs, finite and unrestricted implication differ [19, 20]. We show that finite implication is not finitely axiomatizable, already for binary FDs (those with a two-element attribute set on the left-hand side) and unary IAs. For the combined class of IAs and unary FDs, we show that finite and unrestricted implication coincide and establish a finite axiomatization. Hence, the situation for the combined class of FDs and IAs is more intricate than for the combined class of FDs and MVDs, where finite and unrestricted implication coincide, which enjoy an elegant finite axiomatization [6], and for which implication can be decided in almost linear time [14].

  3. (3)

    For the combined class of IAs, unary FDs, and UINDs, we establish axiomatizations for their finite and unrestricted implication problems, and show that both are decidable in low-degree polynomial time. This is analogous to the results for the combined class of FDs, MVDs, and UINDs. To the best of our knowledge, the class of IAs, unary FDs, and UINDs is only the second known class for which the finite and unrestricted implication differ but both are decidable in low-degree polynomial time. The class is practically relevant as it covers arbitrary IAs on top of unary keys and unary foreign keys, and already unary keys and unary foreign keys occur readily in practice [12]. The significant difference to FDs, MVDs, and UINDs is the more intricate interaction between FDs and IAs in comparison to FDs and MVDs. Note that unary FDs and INDs frequently occur in practice as surrogate keys and foreign keys that reference them. For example, 6 out of 8 keys are unary and 8 out of 9 foreign keys are unary in the TPC-H benchmark, while 20 out of 32 keys are unary and 44 and out 46 foreign keys are unary in the TPC-E benchmarkFootnote 1. The ability to reason efficiently about IAs, UFDs, and UINDs is good news for data management. Finally, trading in restrictions of the arity on INDs and FDs for restrictions on the arity of IAs cannot be successful: Finite implication for unary IAs and binary FDs is not finitely axiomatizable, see (2).

  4. (4)

    For the combined class of IAs and FDs we establish tractable conditions sufficient for non-interaction in both the finite and unrestricted cases. Instances of the finite or unrestricted implication problems that meet the non-interaction conditions can therefore be decided efficiently by using already known algorithms for the sole class of IAs and the sole class of FDs. The decidability of the finite and unrestricted implication problems for IAs and FDs are both still open.

Organization. In Sect. 2 we present all the necessary definitions for the article. Section 3 addresses the combined class of FDs and IAs. In Sect. 4 we focus on the combination of UFDs, UINDs, and IAs, and establish axiomatizations for their finite and unrestricted implication problems. Section 5 identifies polynomial-time criteria for the non-interaction between INDs and IAs, and also between FDs and IAs. Finally, in Sect. 6 we discuss the computational complexity of the implication problems. Due to lack of space we refer the reader to Appendix for any remaining proofs. The appendix can be found in [22].

2 Preliminaries

We denote by \(A,B,C,\ldots \) attributes and by \(X,Y,Z,\ldots \) either sets or sequences of attributes. For two sets (sequences) X and Y, we write XY for their union (concatenation). Similarly, we may write A instead of the single element set or sequence that consists of A. The size of a set (or length of a sequence) X is written as |X|.

A relation schema consists of attributes A, each equipped with a set of domain values denoted by \(\text {Dom}(A)\). By database schema we denote a pairwise disjoint sequence of relations schemata. Given a relation schema R, a tuple over R is a function that maps each attribute A from R to \(\text {Dom}(A)\). A relation r over R is then a non-empty set of tuples over R, and a database d over \(\mathcal {R}=(R_1, \ldots ,R_n)\) is a sequence \((r_1, \ldots ,r_n)\) where each \(r_i\) is a relation over \(R_i\)Footnote 2. We sometimes write r[R] to denote that r is a relation over R, and similarly we may write \(d[\mathcal {R}]\). A relation is called finite if the underlying set of tuples is finite, and a database is finite if it is a sequence of finite relations. For a tuple t and a relation r over R and a subset (or subsequence) X of R, t(X) is the restriction of t to X, and r(X) is the set of all restrictions t(X) where \(t\in r\).

Next we define the syntax and semantics of functional and inclusion dependencies and independence atoms.

Functional Dependency. Let X and Y be two sets of attributes from a relation schema R. Then \(R: X\rightarrow Y\) is a functional dependency that is satisfied by a database \(d=(r[R])\) iff for all \( t,t'\in r\), \(t(X)=t'(X)\) implies \(t(Y)=t'(Y)\).

Inclusion Dependency. Let \(A_1, \ldots ,A_n\) and \(B_1, \ldots ,B_n\) be two sequences of distinct attributes from relation schemata R and \(R'\), respectively. Then \(R[A_1\ldots A_n]\) \(\subseteq R'[B_1\ldots B_n]\) is an inclusion dependency that is satisfied by a database \(d=(r[R],r'[R'])\) iff for all \(t\in r_i\) there is \(t'\in r_j\) such that \(t(A_1)=t'(B_1), \ldots ,t(A_n)=t'(B_n)\).

Independence Atom. Let X and Y be two (not necessarily disjoint) attribute sets from a shared relation schema R. Then \(R:X\bot Y\) is an independence atom that is satisfied by a database \(d=(r[R])\) iff for all tuples \(t,t \in r\) there is a tuple \(t''\in r\) such that \(t''(X)=t(X)\) and \( t''(Y)=t'(Y)\). A disjoint independence atom (DIA) is an IA \(X\bot Y\) where \(X\cap Y\) is empty.

Regarding all the aforementioned dependencies, if the relation schema R is not needed in the context, we will drop it from the syntax. E.g., we will write \(X\bot Y\) instead of \(R:X\bot Y\).

We say that an IND is k-ary if it is of the form \(A_1\ldots A_k\subseteq B_1\ldots B_k\). An IA \(X \bot Y\) and an FD \(X\rightarrow Y\) are called k-ary if \(\max \{|X|,|Y|\}= k\). A class of dependencies is called k-ary if it contains at most k-ary dependencies. Most of the subclasses that we consider are only unary, so we add “U” to a class name to denote its unary subclass. For instance, UIND denotes the class of all unary INDs. In general, for \(k\ge 2\), we add “k” to a class name to denote its k-ary subclass. We use “\(+\)” to denote unions of classes, e.g., IND+IA denotes the class of all inclusion dependencies and independence atoms.

Notice that the semantic condition for IAs \(X\bot Y\) holds only if the values of the common attributes of X and Y are constant. In other words, the following holds:

  • * \(d\models R:X\bot X\), if for all \( s,s'\in r\) it holds that \(s(X)=s'(X)\).

Hence, we also call unary FDs of the form \(\emptyset \rightarrow A\) and unary IAs of the form \(A\bot A\) constancy atoms (CAs).

The restriction of a dependency \(\sigma \) to a set of attributes R, written \(\sigma \upharpoonright R\), is \(X\,\cap \, R \rightarrow Y\,\cap \, R\) for an FD \(\sigma \) of the form \(X\rightarrow Y\), and \(X\,\cap \, R\bot Y\,\cap \, R\) for an IA \(\sigma \) of the form \(X\bot Y\). If \(\sigma \) is an IND of the form \(A_1\ldots A_n\subseteq B_1\ldots B_n\) and \(i_1, \ldots ,i_k\) lists \(\{i=1, \ldots ,n:A_i\in R\ \text {and}\ B_i\in R\}\), then \(\sigma \upharpoonright R=A_{i_1}\ldots A_{i_k}\subseteq B_{i_1}\ldots B_{i_k}\). For a set of dependencies \(\varSigma \), the restriction of \(\varSigma \) to R, written \(\varSigma \upharpoonright R\), is the set of all \(\sigma \upharpoonright R\) for \( \sigma \in \varSigma \). For attributes A and B from R, we denote by \(\sigma (R:A\mapsto B)\) the dependencies obtained from \(\sigma \) by replacing any number of occurrences of A with B.

A set of axioms \(\sigma \) and rules of the form \(\sigma _1, \ldots ,\sigma _n \Rightarrow \sigma \) is called an axiomatization. A rule is called n-ary if its antecedent part has n conjuncts. An axiomatization consisting of at most n-ary rules is called n-ary. A deduction from a set of dependencies \(\varSigma \) by an axiomatization \(\mathfrak {R}\) is a sequence of dependencies \((\sigma _1, \ldots ,\sigma _n)\) where each \(\sigma _i\) is either an element of \(\varSigma \), an axiom, or follows from \(\sigma _1, \ldots ,\sigma _{i-1}\) by an application of a rule in \(\mathfrak {R}\). In such an occasion we write \(\varSigma \vdash _{\mathfrak {R}}\sigma \), or simply \(\varSigma \vdash \sigma \) if \(\mathfrak {R}\) is known.

Given a finite set of database dependencies \(\varSigma \cup \{\sigma \}\), the (finite) unrestricted implication problem is to decide whether all (finite) databases that satisfy \(\varSigma \) also satisfy \(\sigma \), written \(\varSigma \models \sigma \) (\(\varSigma \models _\mathrm{fin}\sigma \)). An axiomatization \(\mathfrak {R}\) is sound for the unrestricted implication problem of a class of dependencies \(\mathcal {C}\) if for all finite sets \(\varSigma \cup \{\sigma \}\) of dependencies from \(\mathcal {C}\), \(\varSigma \vdash _{\mathfrak {R}} \sigma \Rightarrow \varSigma \models \sigma \); it is complete if \( \varSigma \models \sigma \Rightarrow \varSigma \vdash _{\mathfrak {R}} \sigma \). Soundness and completeness for finite implication are defined analogously.

Some of our proofs use the chase algorithm that was invented in the late 70s [3, 31]. For a detailed exposition of this technique we refer the reader to [2].

Axiomatizations. Tables 2 and 3 present the axiomatizations considered in this article. In Table 2, the axiomatization \(\mathfrak {I}:=\{\mathcal {I}{1}, \ldots ,\mathcal {I}{5}\}\) is sound and complete for independence atoms alone [20, 27]. The rules \(\mathcal {F}{1}, \mathcal {F}{2} ,\mathcal {F}{3}\) form the Armstrong axiomatization for functional dependencies [4], and the rules \(\mathcal {FI}{1}\) and \(\mathcal {FI}{2}\) describe simple interaction between independence atoms and functional dependencies. Table 3 depicts the sound and complete axiomatization of inclusion dependencies introduced in [8, 9]. Table 2 presents rules describing interaction between inclusion dependencies and independence atoms [21].

We leave it to the reader to check the soundness of the axiom systems in Tables 2 and 3. The proof does not include anything unexpected; we only note that soundness of \(\mathcal {UI}{3}\) follows only if databases are not allowed to contain empty relations.

Theorem 2

The axiomatization \(\mathfrak {A}\cup \mathfrak {B}\cup \mathfrak {C}\) is sound for the unrestricted and finite implication problems of FD+IND+IA.

Lastly, we note that, for notational clarity only, we will restrict attention to the uni-relational case in all our proofs. That is, we will consider only those cases where databases consist of a single relation.

Table 2. Axiomatizations \(\mathfrak {A}\) for FDs and IAs and \(\mathfrak {C}\) for IAs and INDs. We define \(\mathfrak {I}:=\{\mathcal {I}{1},\ldots ,\mathcal {I}{5}\}\) and \(\mathfrak {A}^*:=\mathfrak {A}\setminus \{\mathcal {I}{5},\mathcal {F}{3}\}\).

3 IAs+FDs

First we consider the interaction between FDs and IAs. Already keys and IAs combined form a somewhat intricate class: Their finite and unrestricted implication problems differ and the former lacks a finite axiomatization [20]. In Sect. 3.1 we extend these results to the classes FD+IA and 2FD+UIA. However, the interaction between unary FDs and IAs is less involved. In Sect. 3.2 we show that for UFD+IA unrestricted and finite implication coincide and the axiomatization \(\mathfrak {A}^*\) given in Table 2 forms a sound and complete axiomatization.

3.1 Implication Problem for FDs and IAs

The following theorem enables us to separate the finite and unrestricted implication problems for FD+IA as well as for FD+UIA.

Theorem 3

[19]. The unrestricted and finite implication problems for keys and UIAs differ.

This theorem was proved by showing that \(\varSigma \models _\mathrm{fin}\sigma \) and \(\varSigma \not \models \sigma \), for \(\varSigma :=\{A\bot B,C\bot D, BC\rightarrow AD,AD\rightarrow BC\}\) and \(\sigma :=AB\rightarrow CD\). In [19] it was shown that this counterexample can be extended to a non-axiomatizability results for finite implication of keys and IAs. By an analogous line of reasoning this results carries over to the class of FDs and IAs, as well (see Appendix).

Table 3. Axiomatization \(\mathfrak {B}\) for INDs

Theorem 4

The finite implication problem for FD+IA (2FD+UIA) is not finitely axiomatizable.

The implicit assumption in the above theorem is that an axiomatization must be attribute-bounded, meaning that it may not introduce new attributes [10]. It is easy to see that with this prerequisite finite axiomatization entails decidability. Contrarily, there are finite axiomatizations for undecidable implication problems that do not adhere to this assumption [16,17,18, 33].

To the best of our knowledge, decidability is open for both FD+IA and FD+UIA with respect to their finite and unrestricted implication problems. It is worth noting here that the unrestricted (finite) implication problem for FD+UIA is as hard as that for FD+IA. For this, we demonstrate a simple reduction from the latter to the former. Let \(\varSigma \cup \{\sigma \}\) be a set of FDs and IAs, and let \(\varSigma '\) denote the set of FDs and IAs where each IA of the form \(X\bot Y\) is replaced with dependencies from \(\{A\bot B,X\rightarrow A,A\rightarrow X,Y\rightarrow B,B\rightarrow Y\}\) where A and B are fresh attributes. If \(\sigma \) is an FD, then \(\varSigma \) (finitely) implies \(\sigma \) iff \(\varSigma '\) (finitely) implies \( \sigma \). Also, if \(\sigma \) is of the form \(X\bot Y\), then we have \(\varSigma \models \sigma \) iff \(\varSigma ''\models \sigma '\), where

$$ \varSigma '':=\varSigma '\cup \{X\rightarrow A,A\rightarrow X,Y\rightarrow B,B\rightarrow Y\}, $$

\(\sigma ':=A\bot B\), and A and B are fresh attributes.

3.2 Implication for UFDs and IAs

Next we turn to the class UFD+IA. Extending the scope and methods from [20], which presented a finite axiomatization for unary keys and IAs, we show that the axiomatization \(\mathfrak {A}^*\) (see Table 2) is sound and complete for UFD+IA in both with respect to finite and unrestricted implication. Hence, compared to UIAs and FDs, the interaction between IAs and UFDs is relatively tame. Combined, however, these two may entail new restrictions to column sizes. For instance, in the finite \(A\rightarrow B_1\), \(A\rightarrow B_2\), and \(B_1\bot B_2\) imply \(|r(B_1)|\cdot |r(B_2)|\le |r(A)|\). The proof of the following completeness theorem is obtained by a chase-based model construction (see Appendix).

Theorem 5

The axiomatization \(\mathfrak {A}^*\) is sound and complete for the unrestricted and finite implication problems of UFD+IA.

As the same axiomatization characterizes both finite and unrestricted implication, we obtain the following corollary.

Corollary 1

The finite and unrestricted implication problems coincide for UFD+IA.

4 IAs+UFDs+UINDs

Next we turn attention to the combined class of FDs, INDs, and IAs. In the previous section we noticed that the finite implication problem for binary FDs and unary IAs is not finitely axiomatizable. On the other hand, both the finite and unrestricted implication problems for unary FDs and binary INDs are undecidable [32]. Hence, in this section we restrict to unary FDs and unary INDs, a class for which the two implication problems already deviate [12]. It turns out that the combination UFD+UIND+IA can be axiomatized with respect to both problems. However, in the finite case the axiomatization is infinite as one needs to add so-called cycle rules for UFDs and UINDs.

An axiomatization for unrestricted implication follows from results in Sect. 3.2 and [12]. For the proof, see Appendix.

Theorem 6

The axiomatization \(\mathfrak {A}^*\cup \{\mathcal {U}{1},\mathcal {U}{2},\mathcal {UI}{3},\mathcal {UI}{4}\}\) is sound and complete for the unrestricted implication problem of UFD+UIND+IA.

For finite implication a complete axiomatization of UFD+UIND+IA is found by extending \(\mathfrak {A}^*\cup \{\mathcal {U}{1},\mathcal {U}{2},\mathcal {UI}{3},\mathcal {UI}{4}\}\) with the so-called cycle rules [12] (see Table 2) and by removing \(\mathcal {UI}{3},\mathcal {UI}{4}\) which become redundant. However, the completeness proof is now more involved and proved in two steps. We will combine the chase-based approach of the proof of Theorem 5 with the graph-theoretic approach from [12]. The latter method was used to prove a complete axiomatization for the finite implication problem of UIND+FD. For the graph-theoretic approach, we commence by introducing multigraphs with two sorts of edges: red ones which encode UFDs and black ones which encode UINDs.

Table 4. Cycle rules for finite implication

Definition 1

[12]. For each set \(\varSigma \) of UINDs and UFDs over R, let \(G(\varSigma )\) be the multigraph that consists of nodes R, red directed edges (AB), for \(A\rightarrow B \in \varSigma \), and black directed edges (AB), for \(B\subseteq A\in \varSigma \). If \(G(\varSigma )\) has red (black) directed edges from A to B and vice versa, then these edges are replaced with an undirected edge between A and B.

Given a multigraph \(G(\varSigma )\), we first topologically sort its strongly connected components which form a directed acyclic graph [25]. That is, each component is assigned a unique scc-number, greater than the scc-numbers of all its descendants. For an attribute A, denote by \(\text {scc}(A)\) the scc-number of the component node A belongs to. Note that \(\text {scc}(A)\ge \text {scc}(B)\) if (AB) is an edge in \(G(\varSigma )\). Denote also by \(\text {scc}_i\) the set of attributes A with \(\text {scc}(A)=i\), and let \(\text {scc}_{\le i}:=\bigcup _{j\le i} \text {scc}_j\) and define \(\text {scc}_{\ge i}\), \(\text {scc}_{<i}\), and \(\text {scc}_{>i}\) analogously. The following lemma is a simple consequence of the definition.Footnote 3

Lemma 1

[12]. Let \(\varSigma \) be a set of UFDs and UINDs, closed under \(\{\mathcal {F}{1},\mathcal {F}{2},\mathcal {U}{1},\) \(\mathcal {U}{2}\}\cup \{\mathcal {C}_k:k\in \mathbb {N}\}\). Then every node in \(G(\varSigma )\) has a red and a black self-loop. The red (black) subgraph of \(G(\varSigma )\) is transitively closed. The subgraphs induced by the strongly connected components of \(G(\varSigma )\) are undirected. In each strongly connected component, the red (black) subset of undirected edges forms a collection of node-disjoint cliques. Note that the red and black partitions of nodes could be different.

We now apply this graphical approach to earlier techniques presented in this paper. Theorem 7 shows completeness of the axiomatization \(\mathfrak {A}^*\cup \{\mathcal {U}{1},\mathcal {U}{2}\}\cup \{\mathcal {C}_{n}: n\in \mathbb {N}\}\) for the finite implication problem of UFD+UIND+IA by using the relation generated in Lemma 2. The proof of this lemma describes an incremental modification of the base relation, taken from the proof of Theorem 5, that is shown to reflect a growing number of inclusion dependencies in its composition. This is achieved by an inductive re-organization of the column values according to the underlying scc-numbering while at the same time maintaining the integrity of the UFD and IA dependencies in the base relation. The proof of the theorem and the lemma can be found in Appendix.

Lemma 2

Let \(\varSigma \) be a set of UFDs, UINDs, and IAs over R, partitioned respectively to \(\varSigma _\mathrm{UFD}\), \(\varSigma _\mathrm{UIND}\), and \(\varSigma _\mathrm{IA}\). Assume that \(\varSigma _\mathrm{UFD}\cup \varSigma _\mathrm{UIND}\) contains all UFDs and UINDs derivable from \(\varSigma \) by \(\mathfrak {A}^*\cup \{\mathcal {U}{1},\mathcal {U}{2}\}\cup \{\mathcal {C}_k:k\in \mathbb {N}\}\), and assume that we have assigned an scc-numbering to \(G(\varSigma _\mathrm{UFD}\cup \varSigma _\mathrm{UIND})\). Let E be either the empty set or a single attribute, and let \(R':=\{B\in R: E \rightarrow B\not \in \varSigma \}\). Then there exists a finite relation r and tuples \(t_0,t_1\in r\) such that:

  1. (i)

    \(\varSigma \vdash X\bot Y\) if \(X,Y\subseteq R'\) and for some \(t\in r\), \(t(X)=t_0(X)\) and \(t(Y)=t_1(Y)\);

  2. (ii)

    \(r\models \varSigma _\mathrm{UFD}\cup \varSigma _\mathrm{IA}\);

  3. (iii)

    r(A) is (strictly) included in r(B) if \(\text {scc}(A)\) is (strictly) less than \(\text {scc}(B)\).

Theorem 7

The axiomatization \(\mathfrak {A}^*\cup \{\mathcal {U}{1},\mathcal {U}{2}\}\cup \{\mathcal {C}_{n}: n\in \mathbb {N}\}\) is sound and complete for the finite implication problem of UFD+UIND+IA.

5 Polynomial-Time Conditions for Non-interaction

The interaction-freeness between the class FD+IND has been well-studied in the literature [29, 30]. Here, we examine the frontiers for tractable reasoning about the class FD+IA in both the finite and unrestricted cases. For IND+IA these questions have been studied in [21]. The idea is to establish sufficient criteria for the non-interaction between IAs and FDs. There is a trade-off between the simplicity and generality of such criteria. While simple criteria may be easier to apply, more general criteria allow us to establish non-interaction in more cases. Our focus here is on generality, and the criteria are driven by the corresponding inference rules. We define non-interaction between two classes as follows.

Definition 2

Let \(\varSigma _0\) and \(\varSigma _1\) be two sets of dependencies from classes \(\mathcal {C}_0\) and \(\mathcal {C}_1\), respectively. We say that \(\varSigma _0,\varSigma _1\) have no interaction with respect to unrestricted (finite) implication if

  • for \(\sigma \) from \(\mathcal {C}_0\), \(\sigma \) is (finitely) implied by \(\varSigma _0\) iff \(\sigma \) is (finitely) implied by \(\varSigma _0\cup \varSigma _1\).

  • for \(\sigma \) from \(\mathcal {C}_1\), \(\sigma \) is (finitely) implied by \(\varSigma _1\) iff \(\sigma \) is (finitely) implied by \(\varSigma _0\cup \varSigma _1\).

Let us now define two syntactic criteria for describing non-interaction. We say that an IA \(X\bot Y\) splits an FD \(U\rightarrow V\) if both \((X\setminus Y)\,\cap \, U\) and \((Y\setminus X)\,\cap \, U\) are non-empty. An IA \(X\bot Y\) splits an IND \(Z \subseteq W\) if both \( X\cap W \) and \( Y\cap W \) are non-empty. Furthermore, \(X\bot Y\) intersects \(U\rightarrow V\) if \(XY\cap U\) is non-empty. Notice that both these concepts give rise to possible interaction between two different classes. We show that lacking splits implies non-interaction for FD+IA in the unrestricted case. Non-interaction for FD+IA in the finite is guaranteed by the stronger condition defined in terms of lacking intersections.

For IND+IA lack of splits entail non-interaction [21].

Theorem 8

[21]. Let \(\varSigma _\mathrm{IND}\) and \(\varSigma _\mathrm{IA}\) be respectively sets of INDs and IAs. If no IA in \(\varSigma _\mathrm{IA}\) splits any IND in \(\varSigma _\mathrm{IND}\), then \(\varSigma _\mathrm{IND}\) and \(\varSigma _\mathrm{IA}\) have no interaction with respect to unrestricted (finite) implication.

We proceed with the non-interaction results for FD+IA. The proofs are located in Appendix. For unrestricted implication the idea is to first apply the below polynomial-time algorithm which transforms an assumption set \(\varSigma \) to an equivalent set \(\varSigma ^*\). The set \(\varSigma ^*\) is such that it has no interaction between FDs and IAs provided that none of its FDs split any IAs.

For a set of FDs \(\varSigma \), let us denote by \(\text {Cl}(\varSigma ,X)\) the closure set of all attributes A for which \(\varSigma \models X\rightarrow A\). This set can be computed in linear time by the Beeri-Bernstein algorithm [5]. The non-interaction condition for unrestricted implication is now formulated using \(\varSigma _\mathrm{IA}^*=\{X_1\bot Y_1, \ldots ,X_n\bot Y_n\}\) and \(\varSigma _\mathrm{FD}^*=\varSigma _\mathrm{FD}\cup \{\emptyset \rightarrow Z\}\) where \(Z,X_iY_i\) are computed using the following algorithm that takes an FD set \(\varSigma _\mathrm{FD}\) and an IA set \(\varSigma _\mathrm{IA}=\{U_1\bot V_1, \ldots , U_n\bot V_n\}\) as an input.

figure a

From the construction we obtain that \(\varSigma _\mathrm{FD}^*\cup \varSigma _\mathrm{IA}^*\) is equivalent to \(\varSigma _\mathrm{FD}\cup \varSigma _\mathrm{IA}\) and that

  1. (1)

    for \(Z_1\bot Z_2\in \varSigma _\mathrm{IA}^*\) and \(i=1,2\), \(\varSigma _\mathrm{FD}^*\models Z_i\rightarrow X\) implies \(X\subseteq Z_i\);

  2. (2)

    \(\varSigma _\mathrm{FD}^*\cup \varSigma _\mathrm{IA}^*\models \emptyset \rightarrow A\) iff \( A \in Z \).

Recall that the closure set \(\text {C}(\varSigma _\mathrm{FD},X)\) can be computed in linear time by the Beeri-Bernstein algorithm. Now, at stage 5 (or stage 6) the computation of the closure set is resumed whenever V introduces attributes that are new to \(X_i\) (\(Y_i\)). Since the number of the closures considered is \(2 |\varSigma _\mathrm{IA}|\), we obtain a quadratic time bound for the computation of \(Z,X_i,Y_i\).

Theorem 9

Let \(\varSigma _\mathrm{FD}\) and \(\varSigma _\mathrm{IA}\) be respectively sets of FDs and IAs, and let \(\varSigma _\mathrm{FD}^*\) and \(\varSigma _\mathrm{IA}^*\) be obtained from \(\varSigma _\mathrm{FD}\) and \(\varSigma _\mathrm{IA}\) by Algorithm 1. Then the following holds:

  • if no IA in \(\varSigma _\mathrm{IA}^*\) splits any FD in \(\varSigma _\mathrm{FD}^*\), then \(\varSigma _\mathrm{FD}^*\) and \(\varSigma _\mathrm{IA}^*\) have no interaction with respect to unrestricted implication;

  • if no IA in \(\varSigma _\mathrm{IA}\) intersects any FD in \(\varSigma _\mathrm{FD}\), then \(\varSigma _\mathrm{FD}\) and \(\varSigma _\mathrm{IA}\) have no interaction with respect to finite implication.

To illustrate the necessity for a stronger condition in the finite case, recall from Sect. 3.1 that \(AB \rightarrow CD\) is finitely implied by \(\{A\bot B, C\bot D, BC\rightarrow AD, AD\rightarrow BC\}\), and notice that \(AB \rightarrow CD\) is not finitely implied by \(\{BC\rightarrow AD, AD\rightarrow BC\}\). However, Algorithm 1 does not produce any fresh assumptions, and neither \(A\bot B\) nor \(C\bot D\) splits any FD assumption. Therefore, lack of splits is not sufficient for non-interaction in the finite case.

6 Complexity Results

Next we examine the computational complexity of the discussed implication problems. We show that both implication problems for UFD+UIND+IA can be solved in low-degree polynomial time, even though the problems differ from one another. The associated decision procedures, found in Appendix, transform the implication problems first to graphs, as earlier in this paper, and subsequently modify them according to appropriate inference rules. The only difference with finite implication is that an application of the cycle rules is included in the process. The implication problem then reduces, for UFDs and UINDs, to reachability in the graph, and for IAs, to an IA-implication instance which reflects the topology of the graph. Consequently, the stated time bounds follow.

Theorem 10

Let \(\varSigma _\mathrm{UFD},\varSigma _\mathrm{UIND},\varSigma _\mathrm{IA}\) be respectively sets of UFDs, UINDs, and IAs over a relation schema R. The unrestricted and finite implication problems for \(\sigma \) by \(\varSigma _\mathrm{UFD}\cup \varSigma _\mathrm{UIND}\cup \varSigma _\mathrm{IA}\) can be decided in time:

  • \(O(|\varSigma _\mathrm{IA}|\cdot |\varSigma _\mathrm{UFD}|+|\varSigma _\mathrm{UIND}|)\) if \(\sigma \) is an UFD or UIND;

  • \(O(|\varSigma _\mathrm{IA}|\cdot (|\varSigma _\mathrm{UFD}|+|R|^2)+ |\varSigma _\mathrm{UIND}|)\) if \(\sigma \) is a IA.

7 Conclusion and Outlook

In view of the infeasibility of EMVDs and of FDs and INDs combined, the class of FDs, MVDs and unary INDs is important as it is low-degree PTIME decidable in the finite and unrestricted cases. As independence atoms form an important tractable embedded sub-class of EMVDs, we have delineated axiomatizability and tractability frontiers for sub-classes of FDs, INDs, and IAs. The most interesting class is that of IAs, unary FDs and unary INDs, for which finite and unrestricted implication differ but each is axiomatisable and decidable in low-degree polynomial time. The results form a basis for the advancement of several data processing tasks, including cardinality estimation, database security, and query optimization.

Even though research on dependency theory has been rich and deep, there are many problems that warrant future research. Theoretically, the decidability remains open for both independence atoms and functional dependencies as well as unary independence atoms and functional dependencies, both in the finite and unrestricted case. This line of research should also be investigated in the probabilistic setting of conditional independencies, fundamental to multivariate statistics and machine learning. Practically, implementations and experimental evaluations of the algorithms can complement the findings in the research. Of direct practical use for data profiling would be algorithms that compute the set of IAs that hold on a given relation, as would algorithms to mine notions of approximate IAs [1].