Keywords

1 Introduction

Background. Cardinality constraints (CCs) and functional dependencies (FDs) are fundamental for understanding the structure and semantics of data, and have a long and fruitful history in conceptual modeling, database theory and practice. CCs were introduced in the seminal paper by Chen [5], while FDs were introduced in the seminal paper by Codd [6]. We focus on cardinality constraints that define an upper bound on the number of objects that have matching values on a given set of attributes. For example, any project manager should not be looking after more than three projects at any period of time. An FD expresses that the values on some attributes uniquely determine the values on some other attributes. For example, every project has at most one manager. Due to their ability to express desirable properties of many application domains, CCs and FDs have been used successfully for core data management tasks, including database cleaning, design, integration, modeling, querying, and updating.

Motivation. Relational databases were developed for applications with certain data, including accounting, inventory and payroll. Modern applications, such as information extraction, sensors, and data integration produce large volumes of uncertain data. While different approaches to uncertainty in data exist, our running example considers a simple scenario in which a qualitative approach is applied to the integration of two data sources. The scenario maintains the levels of confidence associated with objects. Indeed, objects that occur in both sources are labeled ‘fully possible’, while objects that occur in only one source are labeled ‘somewhat possible’. The information about the confidence of objects is clearly useful, but probability distributions are unavailable. Instead, a qualitative approach as founded in possibility theory is appropriate [9, 10, 35]. Figure 1 shows a possibilistic instance (p-instance) where each object is associated with a possibility degree (p-degree) from a finite scale: \(\alpha _1>\ldots >\alpha _{k+1}\). The top degree \(\alpha _1\) is reserved for objects that are ‘fully possible’, the bottom degree \(\alpha _{k+1}\) for objects that are ‘impossible’ to occur. Intermediate degrees and their linguistic interpretations are used as preferred, such as ‘somewhat possible’ (\(\alpha _2\)).

Fig. 1.
figure 1

A possibilistic instance and the scope by which constraints apply to its objects

Interestingly, p-degrees enable us to express CCs and FDs with different degrees of certainty. For example, to express that it is ‘impossible’ that the same department and manager are associated with more than three employees we declare the CC \(\textit{card}(\textit{Dep, Mgr})\le 3\) to be ‘fully certain’ by using the label \(\beta _1\), stipulating that no combination of department and manager can feature in more than three objects that are at least ‘somewhat possible’. Similarly, to say it is only ‘somewhat possible’ that departments with different managers exist we declare the FD \(\textit{Dep}\rightarrow \textit{Mgr}\) as ‘somewhat certain’ by using the label \(\beta _2\), stipulating that no department has more than one manager in ‘fully possible’ objects. We will investigate the combined class of CCs and FDs in this possibilistic data model.

Contributions and impact. Our contributions are as follows. (1) We show that the combination of CCs and FDs in a possibilistic data model constitutes a ‘sweet spot’ in terms of expressivity and computational behavior. In particular, we unify previous work under a more expressive framework that retains efficient computational properties. Slightly more expressive approaches result in non-axiomatizability, intractability, or even undecidability. (2) We establish a finite axiomatization and a linear-time decision algorithm for the associated implication problem. We illustrate applications from constraint maintenance, query optimization, and pivoting to eliminate data redundancy. (3) We establish an effective construction of Armstrong representations for any given set of our constraints. Here, we overcome the practical challenge that finite Armstrong instances do not frequently exist. We thus provide automated support for the acquisition of the constraints that are meaningful in a given application domain.

Organization. Section 2 discusses related work. Our data model is defined in Sect. 3. In Sect. 4 we characterize the implication problem axiomatically and algorithmically. Applications are highlighted in Sect. 5. Section 6 describes how to compute Armstrong representations. In Sect. 7 we conclude and discuss future work. Proofs are available in [39].

2 Related Work

FDs are probably the most studied class of constraints, due to their expressivity, computational behavior, and impact on practice. FDs were introduced in Codd’s seminal paper [6], and are intrinsically linked to conceptual, logical, and physical database design [27, 44]. Applications on the conceptual level include graphical reasoning [8] and pivoting [3, 17]. CCs are an influential contribution of conceptual modeling to database constraints. They featured in Chen’s seminal paper [5]. Cardinality constraints subsume the class of keys as a special case where the upper bound on the cardinality is fixed to 1. Keys are fundamental to most data models [4, 11, 16, 21, 22, 26, 29, 30, 42, 46]. Most languages for conceptual design (description logics, ER, UML, ORM) come with means for specifying CCs. CCs have been studied extensively in database design [7, 18, 25, 31, 32, 38, 43].

Probability theory offers a popular quantitative approach to uncertain data [41]. Research about constraints on probabilistic data is in its infancy [4, 40]. Probabilistic FDs, which specify a lower bound on the marginal probability that FDs exhibit on probabilistic databases, are not finitely axiomatizable.

The results of our article unify various previous works under one, more expressive, framework. In fact, our framework subsumes (1) the sole class of possibilistic CCs [15, 28], (2) the sole class of possibilistic FDs [35], and (3) the combined class of CCs and FDs over relational data (the special case of possibilistic data with only one degree of confidence, i.e. where \(k=1\)) [18]. While our framework is strictly more expressive, it retains the good computational properties of previous work, making it special. Indeed, making our framework more expressive is likely to result in the loss of good computational behavior. For example, using numerical dependencies instead of FDs leaves the implication problem not finitely axiomatizable [14], using multivalued dependencies requires more elaborate possibilistic data models and the interaction with CCs is not well-understood [19, 20, 23, 24, 33, 34, 45], using conditional FDs leaves the implication problem coNP-complete [13], adding inclusion dependencies makes the implication problem undecidable [37], and adding lower bounds to the upper bounds of our CCs’ results requires us to solve unsolved problems from combinatorial design theory, even in the special case where \(k=1\) [18]. Further restrictions on what we additionally include are always possible, but our focus here is the natural class of cardinality constraints with upper bounds and functional dependencies.

3 Cardinality Constraints and Functional Dependencies

We extend object types that model certain objects in traditional conceptual modeling to model uncertain objects qualitatively. This allows us to extend CCs and FDs from their use on certain object types to uncertain object types.

An object type, denoted by O, is a finite non-empty set of attributes. Each attribute \(A\in O\) has a domain \(\textit{dom}(A)\) of values. An object o over O is an element of the Cartesian product \(\prod _{A\in O}{} \textit{dom}(A)\). For \(X\subseteq O\) we denote by o(X) the projection of o on X. An instance over O is a set \(\iota \) of objects over O. For example we use the object type Work with attributes Emp, Dep, and Mgr. Objects either belong or do not belong to an instance. For example, we cannot express that we have less confidence for Employee Nara to work in the department Tennis under Manager Federer than for the Employee Nishikori.

We model uncertain instances by assigning to each object some degree of possibility with which the object occurs in an instance. Formally, we have a possibility scale, or p-scale, that is, a strict linear order \(\mathcal {S}=(S,<)\) with \(k+1\) elements. We write \(\mathcal {S}=\{\alpha _1,\ldots ,\alpha _{k+1}\}\) to declare that \(\alpha _1>\cdots>\alpha _k>\alpha _{k+1}\). The elements \(\alpha _i\in S\) are called possibility degrees, or p-degrees. Here, \(\alpha _1\) is reserved for objects that are ‘fully possible’ while \(\alpha _{k+1}\) is reserved for objects that are ‘impossible’ to occur in an instance. Humans like to use simple scales in everyday life to communicate, compare, or rank. Here, the word “simple” means that items are classified qualitatively rather than quantitatively by putting precise values on them. Classical instances use two p-degrees, i.e. \(k=1\).

A possibilistic object type \((O,\mathcal {S})\), or p-object type, consists of an object type O and a p-scale \(\mathcal {S}\). A possibilistic instance, or p-instance, over \((O,\mathcal {S})\) consists of an instance \(\iota \) over O, and a function \(\textit{Poss}\) that assigns to each object \(o\in \iota \) a p-degree \(\textit{Poss}(o)\in \mathcal {S}-\{\alpha _{k+1}\}\). We sometimes omit Poss when denoting a p-instance. Figure 1 shows a p-instance over \((\textsc {Work},\mathcal {S}=\{\alpha _1,\alpha _2,\alpha _3\})\).

Fig. 2.
figure 2

Nested worlds of the p-instance from Fig. 1 and possibilistic constraints

P-instances enjoy a possible world semantics. For \(i=1,\ldots ,k\) let \(w_i\) consist of all objects in \(\iota \) that have p-degree at least \(\alpha _i\), that is, \(w_i=\{o\in \iota \mid \textit{Poss}(o)\ge \alpha _i\}\). Indeed, we have \(w_1\subseteq w_2\subseteq \cdots \subseteq w_k\). If \(o\notin w_k\), then \(\textit{Poss}(o)=\alpha _{k+1}\). Every object that is ‘fully possible’ occurs in every possible world, and is therefore also ‘fully certain’. Hence, instances are a special case of uncertain instances. Figure 2 shows the possible worlds \(w_1\subsetneq w_2\) of the p-instance of Fig. 1.

As CCs and FDs are fundamental to applications with certain data, their possibilistic variants serve similar roles for applications with uncertain data. A cardinality constraint over object type O is an expression \(\textit{card}(X)\le b\) where \(X\subseteq O\), and b is a positive integer. The CC \(\textit{card}(X)\le b\) over O is satisfied by an instance w over O, denoted by \(\models _w\textit{card}(X)\le b\), if there are no \(b+1\) distinct objects \(o_1,\ldots ,o_{b+1}\in w\) with matching values on all the attributes in X. For example, Fig. 2 shows that \(\textit{card}(\textit{Dep,Mgr})\le 1\) is not satisfied by any instance \(w_1\) or \(w_2\), and \(\textit{card}(\textit{Dep,Mgr})\le 2\) is satisfied by \(w_1\), but not by \(w_2\). A functional dependency over object type O is an expression \(X\rightarrow Y\) where \(X,Y\subseteq O\). The FD \(X\rightarrow Y\) over O is satisfied by an instance w over O, denoted by \(\models _w X\rightarrow Y\), if for any two objects \(o_1,o_2\in w\) the following holds: if \(o_1(X)=o_2(X)\), then \(o_1(Y)=o_2(Y)\). For example, Fig. 2 shows that \(\textit{Dep}\rightarrow \textit{Mgr}\) is satisfied by \(w_1\), but not by \(w_2\), and \(\textit{Emp}\rightarrow \textit{Dep}\) is satisfied by \(w_1\) and \(w_2\).

The p-degrees of objects result in degrees of certainty by which constraints hold. Since \(\textit{Emp}\rightarrow \textit{Dep}\) holds in every possible world, it is fully certain to hold on \(\iota \). As \(\textit{Dep}\rightarrow \textit{Mgr}\) and \(\textit{card}(\textit{Dep,Mgr})\le 2\) are only violated in a somewhat possible world \(w_2\), they are somewhat certain to hold on \(\iota \). Since \(\textit{card}(\textit{Dep,Mgr})\le 1\) is violated in the fully possible world \(w_1\), it is not certain to hold on \(\iota \).

Similar to the scale \(\mathcal {S}\) of p-degrees \(\alpha _i\) for objects, we use a scale \(\mathcal {S}^T\) of certainty degrees, or c-degrees, \(\beta _j\) for CCs and FDs. Formally, the correspondence between p-degrees in \(\mathcal {S}\) and the c-degrees in \(\mathcal {S}^T\) is defined by the mapping \(\alpha _i\mapsto \beta _{k+2-i}\) for \(i=1,\ldots ,k+1\). Hence, the certainty \(C_\iota (\sigma )\) by which the CC \(\sigma =\textit{card}(X)\le b\) or FD \(\sigma =X\rightarrow Y\) holds on the uncertain instance \(\iota \) is either the top degree \(\beta _1\) if \(\sigma \) is satisfied by \(w_k\), or the minimum amongst the c-degrees \(\beta _{k+2-i}\) that correspond to possible worlds \(w_i\) in which \(\sigma \) is violated, that is,

We can now define the semantics of possibilistic CCs and FDs. Let \((O,\mathcal {S})\) denote a p-object type. A possibilistic CC (p-CC) over \((O,\mathcal {S})\) is an expression \((\textit{card}(X)\le b,\beta )\) where \(\textit{card}(X)\le b\) denotes a CC over O and \(\beta \in \mathcal {S}^T\). A p-instance \((\iota ,\textit{Poss})\) over \((O,\mathcal {S})\) satisfies the p-CC \((\textit{card}(X)\le b,\beta )\) if and only if \(C_\iota (\textit{card}(X)\le b)\ge \beta \). A possibilistic FD (p-FD) over \((O,\mathcal {S})\) is an expression \((X\rightarrow Y,\beta )\) where \(X\rightarrow Y\) denotes an FD over O and \(\beta \in \mathcal {S}^T\). A p-instance \((\iota ,\textit{Poss})\) over \((O,\mathcal {S})\) satisfies the p-FD \((X\rightarrow Y,\beta )\) if and only if \(C_\iota (X\rightarrow Y)\ge \beta \).

For example, Fig. 2 shows some of the p-CCs and p-FDs that the p-instance \(\iota \) from Fig. 1 satisfies. The next example introduces the set \(\varSigma \) of p-CCs and p-FDs we will use as an example constraint set in the remainder of the article.

Example 1

Let \(\varSigma \) denote the set with the following p-CCs and p-FDs over p-object type \((\textsc {Work},\mathcal {S}=\{\alpha _1,\alpha _2,\alpha _3\})\): \((\textit{Emp}\rightarrow \textit{Dep},\beta _1)\), \((\textit{card}(\textit{Dep,Mgr})\le 3,\beta _1)\), \((\textit{Dep}\rightarrow \textit{Mgr},\beta _2)\), and \((\textit{card}(\textit{Mgr})\le 2,\beta _2)\).    \(\square \)

4 Computational Problems and Their Solutions

We establish fundamental tools to reason about p-CCs and p-FDs. Their applicability will be illustrated in Sect. 5. First, we define the implication problem and then address its solution in terms of inference rules and algorithms.

Let \(\varSigma \cup \{\varphi \}\) denote a set of p-CCs and p-FDs over \((O,\mathcal {S})\). We say \(\varSigma \) implies \(\varphi \), denoted by \(\varSigma \,\models \,\varphi \), if every p-instance \((\iota ,\textit{Poss})\) over \((O,\mathcal {S})\) that satisfies every element of \(\varSigma \) also satisfies \(\varphi \). We use \(\varSigma ^*=\{\varphi \mid \varSigma \ \models \ \varphi \}\) to denote the semantic closure of \(\varSigma \). The implication problem for p-CCs and p-FDs is to decide, given any p-object type, and any set \(\varSigma \cup \{\varphi \}\) of p-CCs and p-FDs over the p-object type, whether \(\varSigma \ \models \ \varphi \) holds.

Example 2

Let \(\varSigma \) be as in Example 1. Further, let \(\sigma \) denote the CC \(\textit{card}(\textit{Dep})\le 2\). Then the highest c-degree \(\beta \) such that \((\sigma ,\beta )\) is implied by \(\varSigma \) is \(\beta _2\). Indeed, \(\varSigma \) does not imply \(\varphi =(\sigma ,\beta _1)\). We can create a p-instance that has 3 different objects, all of which have matching values for department and manager, but pairwise different employees, and 2 of those objects have p-degree \(\alpha _1\) while the remaining object has p-degree \(\alpha _2\). Then the c-degree of \(\textit{card}(\textit{Dep})\le 2\) in \(\iota \) is \(\beta _2\), which means that \((\textit{card}(\textit{Dep})\le 2,\beta _1)\) is violated. Since the c-degrees of \(\textit{Emp}\rightarrow \textit{Dep}\), \(\textit{Dep}\rightarrow \textit{Mgr}\), and \(\textit{card}(\textit{Dep,Mgr})\le 3\) in \(\iota \) are \(\beta _1\), and the c-degree of \(\textit{card}(\textit{Mgr})\le 2\) in \(\iota \) is \(\beta _2\), \(\iota \) satisfies \(\varSigma \), but violates \(\varphi \).    \(\square \)

4.1 Using \(\beta \)-Cuts

Our overarching goal is to extend the combined use of CCs and FDs from certain to uncertain data, while maintaining their good computational properties. The core notion for achieving this goal is that of a \(\beta \)-cut for a given set \(\varSigma \) of p-CCs and p-FDs and c-degree \(\beta >\beta _{k+1}\). Informally, the \(\beta \)-cut \(\varSigma _\beta \) of \(\varSigma \) contains all CCs and FDs \(\sigma \) such that there is some p-CCs or p-FD \((\sigma ,\beta ')\) in \(\varSigma \) where \(\beta '\) is at least \(\beta \). That is, \(\varSigma _\beta =\{\sigma \mid (\sigma ,\beta ')\in \varSigma \text { and }\beta '\ge \beta \}\) is the \(\beta \)-cut of \(\varSigma \). The following theorem shows how the \(\beta \)-cut can be used to reduce the implication problem for p-CCs and p-FDs to the implication problem of traditional CCs and FDs. The theorem does not hold for CCs with lower bounds or multivalued dependencies.

Theorem 1

Let \(\varSigma \cup \{(\sigma ,\beta )\}\) be a set of p-CCs and p-FDs over \((O,\mathcal {S})\) where \(\beta >\beta _{k+1}\). Then \(\varSigma \,\models \,(\sigma ,\beta )\) if and only if \(\varSigma _\beta \,\models \,\sigma \).

Theorem 1 allows us to apply achievements from CCs and FDs for certain data to p-CCs and p-FDs. It is a major tool to establish our results.

Example 3

Let \(\varSigma \) be as in Example 1. Then \(\varSigma _{\beta _1}\) consists of \(\textit{card}(\textit{Dep,Mgr})\le 3\) and \(\textit{Emp}\rightarrow \textit{Dep}\), while \(\varSigma _{\beta _2}\) contains \(\varSigma _{\beta _1}\) and includes \(\textit{card}(\textit{Mgr})\le 2\) and \(\textit{Dep}\rightarrow \textit{Mgr}\). Using knowledge about the interaction of CCs and FDs from relational data [18], we conclude that \(\varSigma _{\beta _1}\) does not imply \(\textit{card}(\textit{Dep})\le 2\), but \(\varSigma _{\beta _2}\) does imply \(\textit{card}(\textit{Dep})\le 2\). Theorem 1 shows then that \(\varSigma \) does not imply \((\textit{card}(\textit{Dep})\le 2,\beta _1)\), but \(\varSigma \) does imply \((\textit{card}(\textit{Dep})\le 2,\beta _2)\). In fact, the possible world \(w_1\) of the p-instance \(\iota \) from Example 2 satisfies \(\varSigma _{\beta _1}\), and violates \(\textit{card}(\textit{Dep})\le 2\).   \(\square \)

4.2 Axiomatic Characterization

A finite axiomatization allows us to effectively enumerate all implied p-CCs and p-FDs, that is, to determine the semantic closure \(\varSigma ^*=\{\sigma \mid \varSigma \,\models \,\sigma \}\) of \(\varSigma \). A finite axiomatization facilitates human understanding of the interaction of the given constraints, and ensures all opportunities for the use of these constraints in applications can be exploited (Sect. 5). We determine the semantic closure by applying inference rules of the form \(\frac{\text {premise}}{\text {conclusion}}\). For a set \(\mathfrak {R}\) of inference rules let \(\varSigma \vdash _\mathfrak {R}\varphi \) denote the inference of \(\varphi \) from \(\varSigma \) by \(\mathfrak {R}\). That is, there is some sequence \(\sigma _1,\ldots ,\sigma _n\) such that \(\sigma _n=\varphi \) and every \(\sigma _i\) is an element of \(\varSigma \) or is the conclusion that results from an application of an inference rule in \(\mathfrak {R}\) to some premises in \(\{\sigma _1,\ldots ,\sigma _{i-1}\}\). Let \(\varSigma ^+_\mathfrak {R}=\{\varphi \,\mid \,\varSigma \vdash _{\mathfrak {R}}\varphi \}\) be the syntactic closure of \(\varSigma \) under inferences by \(\mathfrak {R}\). \(\mathfrak {R}\) is sound (complete) if for every set \(\varSigma \) over every \((O,\mathcal {S})\) we have \(\varSigma ^+_\mathfrak {R}\subseteq \varSigma ^*\) (\(\varSigma ^*\subseteq \varSigma ^+_\mathfrak {R}\)). The (finite) set \(\mathfrak {R}\) is a (finite) axiomatization if \(\mathfrak {R}\) is both sound and complete. Table 1 shows an axiomatization \(\mathfrak {C}\) for p-CCs and p-FDs. Here, \((O,\mathcal {S})\) is an arbitrarily given p-object type, \(X,Y\subseteq O\), b is a positive integer, \(\beta ,\beta '\in \mathcal {S}^T\) are c-degrees, and \(\sigma \) uniformly denotes either some CC or FD. In particular, \(\beta _{k+1}\) denotes the bottom c-degree in \(\mathcal {S}^T\).

Table 1. Finite axiomatization of p-CCs and p-FDs

Theorem 2

The set \(\mathfrak {C}\) forms a finite axiomatization for the implication of possibilistic cardinality constraints and functional dependencies.   \(\square \)

The application of inference rules in \(\mathfrak {C}\) from Table 1 is illustrated next.

Example 4

Consider \(\varSigma \) from Example 1. Applying pullback to \((\textit{Dep}\rightarrow \textit{Mgr},\beta _2)\) and \((\textit{card}(\textit{Mgr})\le 2,\beta _2)\) results in \((\textit{card}(\textit{Dep})\le 2,\beta _2)\in \varSigma ^+_{\mathfrak {C}}\). For an inference of \((\textit{card}(\textit{Emp,Mgr})\le 1,\beta _1)\) consider the following steps. Applying reflexivity infers \((\textit{Emp,Mgr}\rightarrow \textit{Emp},\beta _1)\). Then we apply transitivity to \((\textit{Emp,Mgr}\rightarrow \textit{Emp},\beta _1)\) and \((\textit{Emp}\rightarrow \textit{Dep},\beta _1)\) to infer \((\textit{Emp,Mgr}\rightarrow \textit{Dep},\beta _1)\). Next we apply extension to \((\textit{Emp,Mgr}\rightarrow \textit{Dep},\beta _1)\) to infer \((\textit{Emp,Mgr}\rightarrow \textit{Emp,Dep,Mgr},\beta _1)\). The top rule infers \((\textit{card}(\textit{Emp,Dep,Mgr})\le 1,\beta _1)\). Finally, we apply pullback to \((\textit{Emp,Mgr}\rightarrow \textit{Emp,Dep,Mgr},\beta _1)\) and \((\textit{card}(\textit{Emp,Dep,Mgr})\le 1,\beta _1)\) to infer \((\textit{card}(\textit{Emp,Mgr})\le 1,\beta _1)\in \varSigma ^+_\mathfrak {C}\).    \(\square \)

4.3 Algorithmic Characterization

While \(\mathfrak {C}\) enables us to enumerate all p-CCs and p-FDs that are implied by a set \(\varSigma \) of p-CCs and p-FDs, in practice it often suffices to decide whether a given p-CC or p-FD \(\varphi \) is implied by \(\varSigma \). Enumerating all implied constraints and checking whether \(\varphi \) is among them is neither efficient nor makes good use of \(\varphi \). However, our axiomatization \(\mathfrak {C}\) provides us with the insight to develop efficient algorithms for deciding the associated implication problem.

First, Theorem 1 tells us that the implication of some p-CC or p-FD \((\sigma ,\beta )\) by \(\varSigma \) can be decided by considering the \(\beta \)-cut \(\varSigma _\beta \). If \(\sigma \) denotes an FD \(X\rightarrow Y\), then our axiomatization \(\mathfrak {C}\) tells us that the decision only depends on the FDs in \(\varSigma _\beta \) and the cardinality constraints \(\textit{card}(X)\le 1\in \varSigma _\beta \), as the latter implies the FD \(X\rightarrow O\in \varSigma ^*_{\beta }\). For a given set \(\varSigma \) of cardinality constraints and functional dependencies, let \(\varSigma [\text {FD}]\) denote the set of FDs in \(\varSigma \) together with the FDs \(X\rightarrow O\) for every \(\textit{card}(X)\le 1\in \varSigma \). The p-FD \((X\rightarrow Y,\beta )\) is therefore implied by \(\varSigma \) if and only if the FD \(X\rightarrow Y\) is implied by \(\varSigma _\beta [\text {FD}]\). The latter condition is equivalent to Y being a subset of the attribute set closure \(X^+_{\varSigma _\beta [\text {FD}]}=\{A\in X\mid {\varSigma _\beta [\text {FD}]} \,\models \,X\rightarrow A\}\), which can be computed in linear time in the input set \(\varSigma _\beta [\text {FD}]\) [1]. This shows condition (i) in Theorem 3 below. If \(\sigma \) denotes a cardinality constraint \(\textit{card}(X)\le b\), then our axiomatization \(\mathfrak {C}\) tells us that the decision only depends on the existence of some cardinality constraint \(\textit{card}(Y)\le b'\in \varSigma _\beta \) such that \(Y\subseteq X^+_{\varSigma _\beta [\text {FD}]}\) and \(b'\le b\). The clause that \(b'\le b\) follows from the relax rule, and the clause that \(Y\subseteq X^+_{\varSigma _\beta [\text {FD}]}\) follows from the pullback rule and the fact that \(X^+_{\varSigma _\beta [\text {FD}]}\) is the maximal subset of O that is functionally determined by X given \(\varSigma _\beta [\text {FD}]\). This shows condition (ii) in Theorem 3 below.

Theorem 3

Let \(\varSigma \) denote a set of p-CCs and p-FDs over \((O,\mathcal {S})\) with \(|\mathcal {S}|=k+1\). Then (i) \(\varSigma \) implies \((X\rightarrow Y,\beta )\) if and only if \(Y\subseteq X^+_{\varSigma _\beta [\text {FD}]}\), and (ii) \(\varSigma \) implies \((\textit{card}(X)\le b,\beta )\) if and only if \(X^+_{\varSigma _\beta [\text {FD}]}=O\), or there is some \(\textit{card}(Y)\le b'\in \varSigma _\beta \) such that \(Y\subseteq X^+_{\varSigma _\beta [FD]}\) and \(b'\le b\).

The worst-case complexity bound in the following result follows from the well-known fact that the computation of \(X^+_{\varSigma [FD]}\) is linear in the total number of attribute occurrences in \(\varSigma [FD]\) [1], and this size of \(\varSigma [FD]\) is bounded by \(|O|\times |\varSigma |\) where |S| denotes the number of elements in S.

Corollary 1

An instance \(\varSigma \,\models \,\varphi \) of the implication problem for p-CCs and p-FDs can be decided in time \(\mathcal {O}(|O|\times |\varSigma \cup \{\varphi \}|)\).   \(\square \)

We illustrate the use of Theorem 3 on our running example.

Example 5

Let \(\varSigma \) be as in Example 1. Then we can use Theorem 3 to decide whether the p-CC \((\textit{card}(\textit{Dep})\le 2,\beta _2)\) is implied by \(\varSigma \). Indeed, \(\textit{Dep}^*_{\varSigma _{\beta _2}[\text {FD}]}=\{\textit{Dep,Mgr}\}\) and \(\textit{card}(Mgr)\le 2\in \varSigma _{\beta _2}\). Similarly, \(\varSigma \) implies \((\textit{card}(\textit{Emp,Mgr})\le 1,\beta _1)\) since \(\{\textit{Emp,Mgr}\}^+_{\varSigma _{\beta _1}[\text {FD}]}=O\).   \(\square \)

5 Applications

We give a series of examples that illustrate core data processing areas on which our solutions have an impact. These include more efficient update and query operations, as well as schema decompositions to avoid data redundancy.

Non-redundant Constraint Maintenance. Constraints ensure data integrity. Whenever database instances are updated, it must be validated that the updated instance satisfies all the given constraints. Data integrity therefore comes at the cost of enforcing it. However, it is redundant to validate any implied constraints, because every instance that satisfies the remaining constraints already satisfies the implied constraints. Unnecessary costs for implied constraints are removed by computing a non-redundant cover of the given constraint set. This is done by successively removing any constraint \(\sigma \in \varSigma \) from \(\varSigma \) whenever \(\varSigma -\{\sigma \}\) implies \(\sigma \). Having an efficient algorithm to decide implication means that we also have an efficient algorithm to compute a non-redundant set of constraints. Note that the time complexity refers to the schema size, which is negligible in comparison to the size of the instance. Furthermore, the larger database instances are the more time we save by validating non-redundant sets of constraints. We will now illustrate these ideas on our running example from the introduction. Some of the p-CCs and p-FDs satisfied by the p-instance in Fig. 1 include: \((\textit{Emp}\rightarrow \textit{Dep},\beta _1)\), \((\textit{card}(\textit{Dep})\le 3,\beta _1)\), \((\textit{card}(\textit{Mgr})\le 3,\beta _1)\), \((\textit{card}(\textit{Emp})\le 2,\beta _1)\), \((\textit{card}(\textit{Emp,Dep})\le 2,\beta _1)\), \((\textit{Emp}\rightarrow \textit{Dep},\beta _2)\), \((\textit{Dep}\rightarrow \textit{Mgr},\beta _2)\), \((\textit{Emp}\rightarrow \textit{Mgr},\beta _2)\), \((\textit{card}(\textit{Dep})\le 2,\beta _2)\), \((\textit{card}(\textit{Mgr})\le 2,\beta _2)\), \((\textit{card}(\textit{Dep,Mgr})\le 2,\beta _2)\), and \((\textit{card}(\textit{Emp, Mgr})\le 3,\beta _2)\). This set is redundant, and a non-redundant subset that implies all constraints of the given set is shown in Fig. 2.

Query Optimization. Knowing which constraints hold on a given instance also assists us with making the evaluation of queries more efficient. Take, for example, the query

SELECT DISTINCT Emp FROM Work WHERE p-degree=\(\alpha _1\);

and assume it is evaluated on the p-instance from Fig. 1. Since the p-instance satisfies the p-CCs and p-FDs in Fig. 2, and these constraints imply the p-CC \(\textit{card}(\textit{Emp}\le 1,\beta _2)\), a query optimizer that can reason about our constraints is able to conclude that the DISTINCT clause in the query above is superfluous. The elimination of this clause can save considerable evaluation time because the ordering of tuples and removing of duplicates is an expensive operation. For another query evaluated on the same p-instance consider

SELECT Dep, COUNT(Emp) FROM Work WHERE p-degree=\(\alpha _1\)

GROUP BY Dep HAVING Count(Emp)\(\le 3\);

which lists the departments together with the number of their ‘certain’ employees, if that number does not exceed 3. A query optimizer able to determine that the p-CC \((\textit{card}(\textit{Dep})\le 3,\beta _2)\) is implied by the satisfied set of p-CCs and p-FDs, can remove the HAVING clause from the query without affecting the result.

Removing Data Redundancy by Pivoting. The goal of pivoting is to decompose object schemata at design time in an effort to reduce data redundancy and optimize constraint validation time during the lifetime of the target database. We briefly use our running example to illustrate the impact of possibilistic constraints on pivoting. For this purpose, consider again the (possible worlds of the) p-instance in Fig. 2.

Each occurrence of the Mgr-value Federer in world \(w_1\) is redundant in the sense that any update of this occurrence to a different value would result in a violation of the p-FD \((\textit{Dep}\rightarrow \textit{Mgr},\beta _2)\). In contrast, the occurrence of Federer in \(w_2\) is not redundant, because the p-FD \((\textit{Dep}\rightarrow \textit{Mgr},\beta _2)\) only applies to objects with p-degree \(\alpha _1\). In other words, we could decompose the schema Work into the two schemata {Dep,Mgr,ID1} and {Emp,ID1} for objects with p-degree \(\alpha _1\). For objects with p-degree \(\alpha _2\) we could decompose Work into the two schemata {Emp,Dep,ID2} and {Mgr,ID2}, based on the p-FD \((\textit{Emp}\rightarrow \textit{Dep},\beta _1)\). That is, our framework enables us to first apply a horizontal decomposition of the given database instance into \(w_1\) and \(w_2-w_1\), and then apply traditional pivoting to decompose the schemata with respect to the \(\beta \)-cuts \(\varSigma _{\beta _2}\) and \(\varSigma _{\beta _1}\), respectively. The resulting decomposition of the p-instance from Fig. 1 would look like:

figure a

in which all redundant data value occurrences have been removed. In addition, the original cardinality constraint \((\textit{card}(\textit{Dep,Mgr})\le 2,\beta _2)\) now becomes a cardinality constraint stipulating that each ID1 value in the \(\{\textit{Dep},\textit{Mgr},\textit{ID}1\}\) instance should occur in at least 1 and at most 2 objects of the \(\{\textit{Emp},\textit{ID}1\}\) instance.

6 Armstrong Instances and Representations

We establish computational support for the acquisition of p-CCs and p-FDs that are meaningful in a given application domain. A major inhibitor to the acquisition is the mismatch in expertise between business analysts and domain experts. The former know database concepts but not the domain, while the latter know the domain but not database concepts. To facilitate effective communication between them, Armstrong instances serve as data samples that perfectly represent the current set of constraint sets. We will sketch how to compute Armstrong instances for any given set of p-CCs and p-FDs, which analysts and experts can jointly inspect to consolidate the set of meaningful constraints.

We first restate the original definition of an Armstrong database [12] in our context. A p-instance \(\iota \) is said to be Armstrong for a given set \(\varSigma \) of p-CCs and p-FDs on a given p-object type \((O,\mathcal {S})\) if and only if for all p-CCs and p-FDs \(\varphi \) over \((O,\mathcal {S})\) it is true that \(\iota \) satisfies \(\varphi \) if and only if \(\varSigma \) implies \(\varphi \). As such, Armstrong p-instances exhibit for each cardinality constraint and functional dependency the largest c-degree for which it is implied by the given set \(\varSigma \).

Example 6

The p-instance from Fig. 1 is Armstrong for the set of p-CCs and p-FDs from Fig. 2.   \(\square \)

We will now explain how to compute an Armstrong p-instance \(\iota \) for an arbitrarily given set \(\varSigma \) of p-CCs and p-FDs.

For every attribute subset X and every c-degree \(\beta _i\), we compute the smallest \(b_{X,i}\) such that \((\textit{card}(X)\le b_{X,i},\beta _i)\) is implied by \(\varSigma \). We start with \(b_{X,i}=\infty \), and set \(b_{X,i}=1\), if \(X^+_{\varSigma _{\beta _i}[\text {FD}]}=O\) holds. Otherwise, we set \(b_{X,i}\) to b whenever there is some \(\textit{card}(Y)\le b\in \varSigma _{\beta _i}\) such that \(Y\subseteq X^+_{\varSigma _{\beta _i}[\text {FD}]}\) and \(b<b_{X,i}\), see Theorem 3 (ii). Now it suffices to introduce \(b_{X,i}\) objects into \(\iota \) with p-degree \(\alpha _{k+1-i}\) and matching values \(c_{A,i}\) on all \(A\in X\) and unique values on all \(A\notin X\). This ensures that all p-CCs implied by \(\varSigma \) are satisfied in \(\iota \) and all p-CCs not implied by \(\varSigma \) are violated. Several optimizations reduce the number of objects in an Armstrong p-instance: If \(b_{X,i}=1\), no objects need to be introduced in \(\iota \). If \(Y\subset X\) and \(b_{Y,i}=b_{X,i}\), then it suffices to introduce \(b_{X,i}\) objects, because they also violate \((\textit{card}(Y)\le b_{Y,i},\beta _i)\). For \(j>i\) and \(b_{X,j}\le b_{X,i}\) for which \(b_{X,j}\) objects with (at most) p-degree \(\alpha _{k+1-j}\) have already been introduced, it suffices to introduce further \(b_{X,i}-b_{X,j}\) objects of p-degree \(\alpha _{k+1-i}\), again with matching values \(c_{X,j}\) on all \(A\in X\) and unique values on all \(A\notin X\).

Fig. 3.
figure 3

Attribute sets X with cardinalities \(b_{X,i}\) for \(i=1,2\) from left to right

As an illustration, Fig. 3 shows for all attribute subsets X and c-degrees \(\beta _1\) and \(\beta _2\) the associated cardinalities \(b_{X,i}\) for our running example from Example 1. The bold attribute sets are those that require the insertion of objects into an Armstrong p-instance for the given \(\varSigma \).

In general, we still need to ensure that all p-FDs not implied by \(\varSigma \) are violated. For all \(A\in X\) and every c-degree \(\beta _i\), we compute all maximal attribute subsets X such that \(A\notin X^+_{\varSigma _{\beta _i}[\text {FD}]}\), i.e., for all \(B\notin (XA)\) we have \(A\in (XB)^+_{\varSigma _{\beta _i}[\text {FD}]}\). These sets are known as the maximal sets for \(\varSigma _{\beta _i}[\text {FD}]\) and can be computed by an algorithm given in [36]. For each set X that is maximal with respect to \(\varSigma _{\beta _i}[\text {FD}]\), we introduce two objects with p-degree \(\alpha _{k+1-i}\) and matching values \(c_{A,i}\) on all \(A\in X\) and unique values on all \(A\notin X\). Again, some optimizations reduce the number of objects in the final Armstrong p-instance: If X is maximal with respect to \(\varSigma _{\beta _i}[\text {FD}]\) and \(\varSigma _{\beta _j}[\text {FD}]\) and \(i<j\), then it suffices to introduce the two objects with p-degree \(\alpha _{k+1-j}\). Finally, we do not need to introduce the two objects for the maximal set X, if \(b_{Y,j}\) objects have previously been introduced for some \(j\ge i\) where \(X\subseteq Y\) and X is only maximal for attributes \(A\notin Y-X\) with respect to \(\varSigma _{\beta _i}[\text {FD}]\).

Continuing with the construction of an Armstrong p-instance for the given set \(\varSigma \) from Example 1, the following table lists the attribute subsets (only one maximal set in each case here) that are maximal for the given attributes and \(\varSigma _{\beta _i}[\text {FD}]\).

 

Emp

Dep

Mgr

\(\varSigma _{\beta _1}\)

{Dep,Mgr}

{Mgr}

{Emp,Dep}

\(\varSigma _{\beta _2}\)

{Dep,Mgr}

{Mgr}

\(\emptyset \)

Indeed, only the set \(\{\textit{Mgr}\}\) that is maximal for \(\varSigma _{\beta _2}[\text {FD}]\) requires us to insert two objects. In particular, the maximal set \(X=\{\textit{Dep,Mgr}\}\) for \(\varSigma _{\beta _2}[\text {FD}]\) is already covered by the \(b_{X,2}=2\) objects introduced previously, see Fig. 3, and the maximal set \(\emptyset \) is covered because the p-FD \((\emptyset \rightarrow \textit{Mgr},\beta _2)\) is already violated after two objects with different \(\textit{Mgr}\) values have been introduced. Similarly, all the maximal sets for \(\varSigma _{\beta _1}[\text {FD}]\) have already been covered.

The outlined algorithm ensures that Armstrong p-instances exist for every given set \(\varSigma \) of p-CCs and p-FDs, and that they are computed in time exponential in input. Since there are cases where the minimum number of required objects is exponential in the given input, which is known for traditional FDs [2], no polynomial-time algorithm can exist. However, as our running example illustrates we still need to deal with the following occurring case, which occurs frequently in practice. There are attribute subsets X and c-degrees \(\beta _i\) such that \(b_{X,i}=\infty \), that is, there is no finite upper bound b such that \((\textit{card}(X)\le b,\beta _i)\) is implied by the input \(\varSigma \). It follows that every Armstrong p-instance is necessarily infinite, which seems to make our acquisition strategy unfit for its intended purpose. However, we apply the following representation trick that overcomes this challenge. Instead of introducing \(b_{X,i}\) different objects with matching values \(c_{A,i}\) on all \(A\in X\) and unique values on all \(A\notin X\), we introduce one single object with \(c_{A,i}\) on all \(A\in X\) and \(*\) on all \(A\notin X\), plus its cardinality \(b_{X,i}\) in a new column card. This single object represents the \(b_{X,i}\) different objects, in particular, \(*\) for unique values in all columns outside of X. If the objects result from a maximal set, then the cardinality is simply 2. Representations resulting from this transformation of (finite or infinite) Armstrong p-instances are called Armstrong p-representations for \(\varSigma \). We can show that the optimizations applied in our computation result in representations that are bounded by the size of minimum-sized Armstrong p-representations (those Armstrong p-representations with the least number of objects) and the number of given constraints.

Theorem 4

Given an arbitrary set \(\varSigma \) of poss-CCs and poss-FDs over some given p-object type, the outlined algorithm computes an Armstrong p-representation \(\zeta \) for \(\varSigma \) whose size is bounded by that of a minimum-sized Armstrong p-representation \(\zeta ^\textit{min}\) for \(\varSigma \) and the number of elements in \(\varSigma \) as follows: \(|\zeta | \le |\zeta ^\textit{min}|\times (|\varSigma |+|\zeta ^\textit{min}|)\).    \(\square \)

This construction yields the following Armstrong p-instance for the given set \(\varSigma \) of p-CCs and p-FDs from Example 1.

Emp

Dep

Mgr

p-degree

card

\(*\)

Tennis

Federer

\(\alpha _1\)

2

\(*\)

\(*\)

Gauss

\(\alpha _1\)

2

\(*\)

Tennis

Federer

\(\alpha _2\)

1

Musashimaru

Sumo

\(*\)

\(\alpha _2\)

\(\infty \)

\(*\)

\(*\)

Taiho

\(\alpha _2\)

\(\infty \)

We list some of the observations we can make by inspecting this Armstrong p-instance. First of all, the given constraint set \(\varSigma \) has not captured any ‘fully certain’ finite bounds on the cardinalities by which (Emp,Dep)-objects or (Mgr)-objects occur. Indeed, the combination (Musashimaru, Sumo) can occur infinitely many times when ‘somewhat possible’ objects are involved, and the same applies to (Taiho). In contrast, \(\varSigma \) does guarantee the uniqueness of any (Emp)-objects that are ‘fully possible’, and a maximum cardinality of two on any (Dep,Mgr)-objects that are ‘fully possible’. Similarly, any non-trivial FD \(\textit{Mgr}\rightarrow A\) is not even ‘somewhat certain’. The FD \(\textit{Dep}\rightarrow \textit{Mgr}\) is ‘somewhat certain’, because there are two ‘somewhat possible’ occurrences of the (Sumo) department, but in combination with different managers. While the FD \(\textit{Emp}\rightarrow \textit{Dep}\) is ‘fully certain’, the FD \(\textit{Emp}\rightarrow \textit{Mgr}\) is only ‘somewhat certain’, because there are two ‘somewhat possible’ occurrences of the employee (Musashimaru), but each occurrence is in combination with different managers.

7 Conclusion and Future Work

Cardinality constraints and functional dependencies naturally co-occur in most aspects of life. Consequently, they have received invested interest from the conceptual modeling community over the last three decades. In contrast to various previous works, we have studied cardinality constraints and functional dependencies over uncertain data. Uncertainty has been modeled qualitatively by applying the framework of possibility theory. Our results show that cardinality constraints and functional dependencies form a ‘sweet spot’ in terms of both expressivity and good computational behavior, as more expressive classes of constraints behave poorly. In particular, we have established a finite axiomatization and a linear time algorithm to decide the implication problem associated with our class, and illustrated their applicability to conceptual design, update and query efficiency. We have also established an algorithm that computes for every given set of our constraints an Armstrong representation. These representations embody the exact certainty with which any constraint in our class is currently perceived to hold by data analysts. The analysts can show our Armstrong representations to domain experts in order to jointly consolidate the actual certainty with which cardinality constraints and functional dependencies shall hold in a given application domain.

Our framework opens up several questions for future investigation, including a detailed study and performance tests for our applications, the interaction with yet other constraint classes despite the limits outlined, and empirical evaluations for the usefulness of Armstrong representations. It is further interesting to investigate possibilistic approaches to more expressive data models, such as SQL with partial and duplicate information, XML, RDF, or graph databases.