Keywords

1 Introduction

Background. The notion of cardinality constraints is fundamental for understanding the structure and semantics of data. In traditional conceptual modeling, cardinality constraints were already introduced in Chen’s seminal paper [3]. They have attracted significant interest and tool support ever since. Intuitively, a cardinality constraint \(\textit{card}(X)\le b\) stipulates for an attribute set X and a positive integer b that a relation must not contain more than b different tuples that all have matching values on all the attributes in X. For example, bank customers with no more than 5 withdrawals from their bank account per month may qualify for a special interest rate. Traditionally, cardinality constraints empower applications to control the occurrences of certain data, and have applications in data cleaning, integration, modeling, processing, and retrieval among others.

Example. Relational databases target applications with certain data, such as accounting, inventory and payroll. Modern applications, such as data integration, information extraction, scientific data management, and financial risk assessment produce large amounts uncertain data. For instance, RFID (radio frequency identification) is used to track movements of endangered species of animals, such as wolverines. Here it is sensible to apply probabilistic databases. Table 1 shows a probabilistic relation (p-relation) over Tracking = {rfidtimezone}, which is a probability distribution over a finite set of possible worlds, each being a relation. Data patterns occur with different frequency in different worlds. That is, different worlds satisfy different cardinality constraints. For example, the cardinality constraint \(c_1=\textit{card}(\textit{time},\textit{zone})\le 1\) holds in the world \(w_1\) and {timezone} is therefore a key in this world, and \(c_2=\textit{card}(\textit{time},\textit{zone})\le 2\) holds in the world \(w_1\) and \(w_2\). Typically, the likelihood of a cardinality constraint to hold in a given application domain, i.e. the constraint’s degree of meaningfulness, should be reflected by its marginal probability. In the example above, \(c_1\) and \(c_2\) have marginal probability 0.75 and 0.9, respectively, and we may write \((\textit{card}(\textit{time},\textit{zone})\le 1,\ge 0.75)\) and \((\textit{card}(\textit{time},\textit{zone})\le 2,\ge 0.9)\) to denote the probabilistic cardinality constraints (pCCs) that \(c_1\) holds at least with probability 0.75 and \(c_2\) holds at least with probability 0.9.

Table 1. Probabilistic relation

Applications. PCCs have important applications. Data quality: Foremost, they can express desirable properties of modern application domains that must accommodate uncertain data. This raises the ability of database systems to enforce higher levels of consistency in probabilistic databases, as updates to data are questioned when they result in violations of some pCC. Enforcing hard constraints, holding with probability 1, may remove plausible worlds and lead to an incomplete representation. The marginal probability of cardinality constraints can balance the consistency and completeness targets for the quality of an organization’s data. Query estimation: PCCs can be used to obtain lower bounds on the probability by which a given maximum number of answers to a given query will be returned, without having to evaluate the query on any portion of the given, potentially big, database. For example, the query

SELECT rfid FROM Tracking WHERE zone=‘z2’ AND time=‘09’

asks for the rfid of wolverines recorded in zone z2 at 09am. Reasoning about our pCCs tells us that at most 3 answers will be returned with probability 1, at most 2 answers will be returned with minimum probability 0.9, and at most 1 answer will be returned with minimum probability 0.75. A service provider may return these numbers, or approximate costs derived from them, to a customer, who can make a more informed decision whether to pay for the service. The provider, on the other hand, does not need to utilize unpaid resources for querying the potentially big data source to return the feedback.

Contributions. The applications motivate us to stipulate lower bounds on the marginal probability of cardinality constraints. The main inhibitor for the uptake of pCCs is the identification of the right lower bounds on their marginal probabilities. While it is already challenging to identify traditional cardinality constraints which are semantically meaningful in a given application domain, identifying the right probabilities is an even harder problem. Lower bounds appear to be a realistic compromise here. Our contributions can be summarized as follows. (1) Modeling: We propose pCCs as a natural class of semantic integrity constraints over uncertain data. Their main target is to help organizations derive more value from data by ensuring higher levels of data quality and assist with data processing. (2) Reasoning: We characterize the implication problem of pCCs by a simple finite set of Horn rules, as well as a linear time decision algorithm. This enables organizations to reduce the overhead of data quality management by pCCs to a minimal level necessary. For example, enforcing \((\textit{card}(\textit{rfid})\le 3,\ge 0.9)\), \((\textit{card}(\textit{zone})\le 4,\ge 0.9)\) and \((\textit{card}(\textit{rfid},\textit{zone})\le 3,\ge 0.75)\) would be redundant as the enforcement of \((\textit{card}(\textit{rfid},\textit{zone})\le 3,\ge 0.75)\) is already implicitly done by enforcing \((\textit{card}(\textit{rfid})\le 3,\ge 0.9)\).

Table 2. PC-sketch of Table 1

(3) Acquisition: For acquiring the right marginal probabilities by which pCCs hold, we show how to visualize concisely any given system of pCCs in the form of an Armstrong PC-sketch. Recall that every p-relation can be represented by some PC-table. Here, we introduce Armstrong PC-sketches as finite semantic representations of some possibly infinite p-relation which satisfies every cardinality constraint with the exact marginal probability by which it is currently perceived to hold. Problems with such perceptions are explicitly pointed out by the PC-sketch. For example, Fig. 2 shows a PC-sketch for the p-relation from Table 1, which is Armstrong for the pCCs satisfied by the p-relation. The sketch shows which patterns of data must occur in how many rows (represented in column card) in which possible worlds (represented by the world identifiers in column \(\iota \)). The symbol \(*\) represents some data value that is unique within each world of the p-relations derived from the sketch. \(\varPi \) defines the probability distribution over the resulting possible worlds. Even when they represent finite p-relations, PC-sketches are still more concise since they only show patterns that matter and how often these occur.

Organization. We discuss related work in Sect. 2. PCCs are introduced in Sect. 3, and reasoning tools for them are established in Sect. 4. These form the foundation for computational support to acquire the correct marginal probabilities in Sect. 5. We conclude and outline future work in Sect. 6. Due to lack of space, all proofs have been made available in the technical report [16].

2 Related Work

Cardinality constraints are one of the most influential contributions conceptual modeling has made to the study of database constraints. They were already present in Chen’s seminal paper [3]. It is no surprise that today they are part of all major languages for data and knowledge modeling, including UML, EER, ORM, XSD, and OWL. Cardinality constraints have been extensively studied in database design [49, 12, 14, 15, 18]. For a recent survey, see [19].

Probabilistic cardinality constraints \(\textit{card}(X)\le b\), introduced in this paper, subsume the class of probabilistic keys [2] as the special case where \(b=1\).

For possibilistic cardinality constraints [10], tuples are attributed some degree of possibility and cardinality constraints some degree of certainty saying to which tuples they apply. In general, possibility theory is a qualitative approach while probability theory is a quantitative approach to uncertainty. Our research thereby complements the qualitative approach to cardinality constraints in [10] by a quantitative approach.

Our contribution extends results on cardinality constraints from traditional relations, which are covered by our framework as the special case where the p-relation consists of only one possible world [1, 6]. As pCCs form a new class of integrity constraints, their associated implication problem and properties of Armstrong p-relations have not been investigated before.

There is also a large body of work on the discovery of “approximate” business rules, such as keys, functional and inclusion dependencies [13]. Here, approximate means that almost all tuples satisfy the given rule; hence allowing for very few exceptions. Our constraints are not approximate since they are either satisfied or violated by the given p-relation or the PC-sketch that represents it.

3 Cardinality Constraints on Probabilistic Databases

Next we introduce some preliminary concepts from probabilistic databases and the central notion of a probabilistic cardinality constraint. We use the symbol \(\mathbb {N}_1^\infty \) to denote the positive integers together with the symbol \(\infty \) for infinity.

A relation schema is a finite set R of attributes A. Each attribute A is associated with a domain \(\textit{dom}(A)\) of values. A tuple t over R is a function that assigns to each attribute A of R an element t(A) from the domain \(\textit{dom}(A)\). A relation over R is a finite set of tuples over R. Relations over R are also called possible worlds of R here. An expression \(\textit{card}(X)\le b\) with some non-empty subset \(X\subseteq R\) and \(b\in \mathbb {N}_1^\infty \) is called a cardinality constraint over R. In what follows, we will always assume that a subset of R is non-empty without mentioning it explicitly. A cardinality constraint \(\textit{card}(X)\le b\) over R is said to hold in a possible world w of R, denoted by \(w\models \textit{card}(X)\le b\), if and only if there are not \(b\,+\,1\) different tuples \(t_1,\cdots ,t_{b+1}\in W\) such that for all \(1\le i<j\le b+1\), \(t_i\not =t_j\) and \(t_i(X)=t_j(X)\).

A probabilistic relation (p-relation) over R is a pair \(r=(W,P)\) of a finite non-empty set W of possible worlds over R and a probability distribution \(P:W\rightarrow (0,1]\) such that \(\sum _{w\in W} P(w)=1\) holds.

Table 1 shows a p-relation over relation schema Wolverine = {rfid,time,zone}. World \(w_2\) satisfies the CCs \(\textit{card}(\textit{rfid})\le 3\), \(\textit{card}(\textit{time})\le 3\), \(\textit{card}(\textit{zone})\le 4\), \(\textit{card}(\textit{rfid},\textit{time})\le 3\), \(\textit{card}(\textit{rfid},\textit{zone})\le 3\), and \(\textit{card}(\textit{time},\textit{zone})\le 2\) but violates the CC \(\textit{card}(\textit{time},\textit{zone})\le 1\).

A cardinality constraint \(\textit{card}(X)\le b\) over R is said to hold with probability \(p\in [0,1]\) in the p-relation \(r=(W,P)\) if and only if \(\sum _{w\in W, w\models \textit{card}(X)\le b}P(w)=p\). In other words, the probability of a cardinality constraint in a p-relation is the marginal probability with which it holds in the p-relation. We will now introduce the central notion of a cardinality constraint on probabilistic databases.

Definition 1

A probabilistic cardinality constraint, or pCC for short, over relation schema R is an expression \((\textit{card}(X)\le b,\ge p)\) where \(X\subseteq R\), \(b\in \mathbb {N}_1^\infty \) and \(p\in [0,1]\) . The pCC \((\textit{card}(X)\le b,\ge p)\) over R is said to hold in the p-relation r over R if and only if the probability with which the cardinality constraint \(\textit{card}(X)\le b\) holds in r is at least p.

Example 1

In our running example over relation schema Wolverine, the p-relation from Table 1 satisfies the set \(\varSigma \) of the following pCCs \((\textit{card}(\textit{rfid})\le 3,\ge 1)\), \((\textit{card}(\textit{time})\le 3,\ge 1)\), \((\textit{card}(\textit{zone})\le 4,\ge 1)\), \((\textit{card}(\textit{time},\textit{zone})\le 2,\ge 0.9)\), \((\textit{card}(\textit{rfid},\textit{time})\le 1,\ge 0.75)\), \((\textit{card}(\textit{rfid},\textit{zone})\le 2,\ge 0.75)\), as well as \((\textit{card}(\textit{time},\textit{zone})\le 1,\ge 0.75)\). It violates the pCC \((\textit{card}(\textit{rfid},\textit{time})\le 1,\ge 0.9)\).

4 Reasoning Tools

When enforcing sets of pCCs to improve data quality, the overhead they cause must be reduced to a minimal level necessary. In practice, this requires us to reason about pCCs efficiently. We will now establish basic tools for this purpose.

Implication. Let \(\varSigma \cup \{\varphi \}\) denote a finite set of constraints over relation schema R, in particular \(\varSigma \) is always finite. We say \(\varSigma \) (finitely) implies \(\varphi \), denoted by \(\varSigma \models _{(f)}\varphi \), if every (finite) p-relation r over R that satisfies \(\varSigma \), also satisfies \(\varphi \). We use \(\varSigma ^*_{(f)}=\{\varphi \mid \varSigma \models _{(f)}\varphi \}\) to denote the (finite) semantic closure of \(\varSigma \). For a class \(\mathcal {C}\) of constraints, the (finite) \(\mathcal {C}\)-implication problem is to decide for a given relation schema R and a given set \(\varSigma \cup \{\varphi \}\) of constraints in \(\mathcal {C}\) over R, whether \(\varSigma \) (finitely) implies \(\varphi \). Finite implication problem and implication problem coincide for the class of pCCs, and we thus speak of the implication problem.

Axioms. We determine the semantic closure by applying inference rules of the form \(\frac{\text {premise}}{\text {conclusion}}\). For a set \(\mathfrak {R}\) of inference rules let \(\varSigma \vdash _\mathfrak {R}\varphi \) denote the inference of \(\varphi \) from \(\varSigma \) by \(\mathfrak {R}\). That is, there is some sequence \(\sigma _1,\ldots ,\sigma _n\) such that \(\sigma _n=\varphi \) and every \(\sigma _i\) is an element of \(\varSigma \) or is the conclusion that results from an application of an inference rule in \(\mathfrak {R}\) to some premises in \(\{\sigma _1,\ldots ,\sigma _{i-1}\}\). Let \(\varSigma ^+_\mathfrak {R}=\{\varphi \,\mid \,\varSigma \vdash _{\mathfrak {R}}\varphi \}\) be the syntactic closure of \(\varSigma \) under inferences by \(\mathfrak {R}\). \(\mathfrak {R}\) is sound (complete) if for every set \(\varSigma \) over every \((R,\mathcal {S})\) we have \(\varSigma ^+_\mathfrak {R}\subseteq \varSigma ^*\) (\(\varSigma ^*\subseteq \varSigma ^+_\mathfrak {R}\)). The (finite) set \(\mathfrak {R}\) is a (finite) axiomatization if \(\mathfrak {R}\) is both sound and complete. In the set \(\mathfrak {P}\) of inference rules from Table 3, R denotes the underlying relation schema, X and Y form attribute subsets of R, \(b,b'\in \mathbb {N}_1^\infty \), and pq as well as \(p+q\) are probabilities. Due to lack of space we omit the soundness and completeness proof of the following theorem, see [16].

Table 3. Axiomatization \(\mathfrak {P}=\{\mathcal {D},\mathcal {Z},\mathcal {U},\mathcal {S},\mathcal {B},\mathcal {P}\}\)

Theorem 1

\(\mathfrak {P}\) forms a finite axiomatization for the implication of probabilistic cardinality constraints.

Example 2

The set \(\varSigma \) of pCCs from Example 1 implies \(\varphi =({card}({rfid},{time})\le 4,\ge 0.8)\), but not \(\varphi '=({card}({rfid},{time})\le 1,\ge 0.8)\). In fact, \(\varphi \) can be inferred from \(\varSigma \) by applying \(\mathcal {S}\) to \(({card}({rfid})\le 3,\ge 1)\) to infer \(({card}({rfid},{time})\le 3,\ge 1)\), applying \(\mathcal {B}\) to this pCC to infer \(({card}({rfid},{time})\le 4,\ge 1)\), and then applying \(\mathcal {P}\).

If a data set is validated against a set \(\varSigma \) of pCCs, then the data set does not need to be validated against any pCC \(\varphi \) implied by \(\varSigma \). The larger the data set, the more time is saved by avoiding redundant validation checks.

Algorithms. In practice it is often unnecessary to determine all implied pCCs. In fact, the implication problem for pCCs has as input \(\varSigma \cup \{\varphi \}\) and the question is whether \(\varSigma \) implies \(\varphi \). Computing \(\varSigma ^*\) and checking whether \(\varphi \in \varSigma ^*\) is hardly efficient. Indeed, we will now establish a linear-time algorithm for computing the maximum probability p, such that \(\varphi =(\textit{card}(X)\le b,\ge p)\) is implied by \(\varSigma \). The following theorem provides the foundation for the algorithm [16].

Theorem 2

Let \(\varSigma \cup \{(\textit{card}(X)\le b,\ge p)\}\) denote a set of pCCs over relation schema R. Then \(\varSigma \) implies \((\textit{card}(X)\le b,\ge p)\) if and only if (i) \(X=R\) or (ii) \(p=0\) or (iii) \(b=\infty \) or (iv) there is some \((\textit{card}(Z)\le b',\ge q)\in \varSigma \) such that \(Z\subseteq X\), \(b'\le b\), and \(q\ge p\).

Example 3

Continuing Example 2, we can apply Theorem 2 directly to see that \(\varSigma \) implies \(\varphi =(\textit{card}(\textit{rfid},\textit{time})\le 4,\ge 0.8)\). Indeed, the pCC \((\textit{card}(\textit{rfid})\le 3,\ge 1)\in \varSigma \) satisfies the sufficient conditions of Theorem 2 to imply \(\varphi \), since \(\{\textit{rfid}\}\subseteq \{\textit{rfid},\textit{time}\}\), \(3\le 4\) , and \(1\ge 0.8\).

Theorem 2 motivates the following algorithm that returns for a given cardinality constraint \(\textit{card}(X)\le b\) the maximum probability p by which \((\textit{card}(X)\le b,\ge p)\) is implied by a given set \(\varSigma \) of pCCs over R: If \(X=R\) or \(b=\infty \), then we return probability 1; Otherwise, starting with \(p=0\) the algorithm scans all input pCCs \((\textit{card}(Z)\le b',\ge q)\in \varSigma \) and sets p to q whenever q is larger than the current p, X contains Z and \(b'\le b\). \(||\varSigma ||\) denotes the total number of attributes together with the logarithm of the integer bounds in \(\varSigma \). Here, we assume without loss of generality that \(\infty \) does not occur.

Theorem 3

On input \((R,\varSigma ,\textit{card}(X)\le b)\) our algorithm returns in \(\mathcal {O}(||\varSigma \cup \{(\textit{card}(X)\le b,\ge p)\}||)\) time the maximum probability p with which \((\textit{card}(X)\le b,\ge p)\) is implied by \(\varSigma \).

Example 4

Continuing Example 1, we can apply our algorithm to the schema Wolverine, pCC set \(\varSigma \), and the cardinality constraint \(\textit{card}(\textit{rfid},\textit{time})\le 4\), which gives us the maximum probability 1 for which it is implied by \(\varSigma \).

Theorem 3 allows us to decide the associated implication problem efficiently, too. Given \(R,\varSigma ,(\textit{card}(X)\le b,\ge p)\) as an input to the implication problem, we use our algorithm to compute \(p':=\max \{q:\varSigma \models \textit{card}(X)\le b,\ge q\}\) and return an affirmative answer if and only if \(p'\ge p\).

Corollary 1

The implication problem of probabilistic cardinality constraints can be decided in linear time.

Example 5

Continuing Example 4 we can see directly that \(\varSigma \) implies the pCC \(\varphi =(\textit{card}(\textit{rfid},\textit{time})\le 4,\ge 0.8)\) since our algorithm returned 1 as the maximum probability for which \(\textit{card}(\textit{rfid},\textit{time})\le 4\) is implied by \(\varSigma \). Since the given probability of 0.8 does not exceed \(p=1\), \(\varphi \) is indeed implied.

5 Acquiring Probabilistic Cardinality Constraints

Data quality, and therefore largely the success of data-driven organizations, depend on the ability of analysts to identify the semantic integrity constraints that govern the data. For cardinality constraints \((\textit{card}(X)\le b,\ge p)\) the “right” marginal probability p and the “right” upper bound b must be identified for a given set X of attributes. Choosing p too big or b too small prevents the entry of clean data, resulting in a lower level of data completeness. Choosing p too small or b too high can lead to the entry of dirty data, resulting in a lower level of data consistency. Analysts benefit from computational support to improve upon their ad-hoc perceptions on an appropriate probability p and bound b.

Goal. Armstrong relations are a useful tool for consolidating the perception of analysts about the cardinality constraints (CCs) of a given application domain. Starting with a set \(\varSigma \), the tool creates a small relation r that satisfies \(\varSigma \) and violates all CCs not implied by \(\varSigma \). This property makes r a perfect sample for \(\varSigma \): any CC is satisfied by the relation if and only if it is implied by \(\varSigma \).

Our goal is to develop the tool of Armstrong p-relations for a given set \(\varSigma \) of pCCs: the marginal probability by which a traditional constraint \(\textit{card}(X)\le b\) holds on the Armstrong p-relation is the maximum probability p by which the pCC \((\textit{card}(X)\le b,\ge p)\) is implied by \(\varSigma \). So, if an analyst wants to check for an arbitrary pCC \((\textit{card}(X)\le b,\ge p)\) whether it is implied by \(\varSigma \), she can compute the marginal probability \(p'\) by which the CC \(\textit{card}(X)\le b\) holds on the Armstrong p-relation and verify that \(p\ge p'\). For the remainder of this section, we will review Armstrong relations, add new results, and then devise our construction of Armstrong p-relations and more concise representations thereof.

Armstrong Relations. An Armstrong relation w for a given set \(\varSigma \) of CCs over relation schema R violates all CCs \(\textit{card}(X)\le b\) over R which are not implied by \(\varSigma \). However, \(\varSigma \models \textit{card}(X)\le b\) if and only if \(X=R\) or \(b=\infty \) or there is some \(\textit{card}(Z)\le b'\in \varSigma \) where \(Z\subseteq X\) and \(b'\le b\). Hence, if \(\varSigma \not \models \textit{card}(X)\le b\), then \(X\not =R\), \(b<\infty \) and for all \(\textit{card}(Z)\le b'\in \varSigma \) where \(Z\subseteq X\) we have \(b'>b\). Our strategy is therefore to find for all subsets X, the smallest upper bound \(b_X\) that applies to the set X. In other words, \(b_X=\inf \{b\mid \varSigma \models \textit{card}(X)\le b\}\). Moreover, if \(b_{XY}=b_X\) for some attribute sets XY, then it suffices to violate \(\textit{card}(XY)\le b_{XY}-1\). For this reason, the set \(\textit{dup}_{\varSigma }(R)\) of duplicate sets is defined as \(\textit{dup}_{\varSigma }(R)=\{\emptyset \subset X\subset R\mid b_X>1\wedge (\forall A\in R-X (b_{XA}<b_X))\}\). For each duplicate set \(X\in \textit{dup}_{\varSigma }(R)\), we introduce \(b_X\) new tuples \(t^X_1,\ldots ,t^X_{b_X}\) that all have matching values on all the attributes in X and all have unique values on all the attributes in \(R-X\). An Armstrong relation for \(\varSigma \) is obtained by taking the disjoint union of \(\{t^X_1,\ldots ,t^X_{b_X}\}\) for all duplicate sets X.

Example 6

For a probability p and a given set \(\varSigma \) of pCCs let \(\varSigma _p=\{\textit{card}(X)\le b\mid \exists p'\in (0,1] (\textit{card}(X)\le b,\ge p')\in \varSigma \}\). Continuing Example 1 consider the sets \(\varSigma _{0.75}\), \(\varSigma _{0.9}\) and \(\varSigma _{1}\) of traditional cardinality constraints on Wolverine. The attribute subsets which are duplicate with respect to these sets are illustrated in Fig. 1, together with their associated cardinalities. The worlds \(w_1\), \(w_2\) and \(w_3\) in Table 1 are Armstrong relations for \(\varSigma _{0.75}\), \(\varSigma _{0.9}\) and \(\varSigma _{1}\), respectively.

Fig. 1.
figure 1

Duplicate sets X in bold font and their cardinalities \(b_{X}\) for Example 6

Armstrong Sketches. While this construction works well in theory, a problem occurs with the actual use of these Armstrong relations in practice. In some cases, the Armstrong relation will be infinite and therefore of no use. These cases occur exactly if there is some attribute \(A\in R\) for which \(b_A=\infty \), in other words, if there is some attribute for which no finite upper bound has been specified. For a practical solution we introduce Armstrong sketches, which are finite representations of possibly infinite Armstrong relations.

Let \(R_*\) denote a relation schema resulting from R by extending the domain of each attribute of R by the distinguished symbol \(*\). A sketch \(\varsigma =(\textit{card},\omega )\) over R consists of a finite relation \(\omega =\{\tau _1,\ldots ,\tau _n\}\) over \(R_*\), and a function \(\textit{card}\) that maps each tuple \(\tau _i\in \omega \) to a value \(b_i=\textit{card}(\tau _i)\in \mathbb {N}^\infty _1\). An expansion of \(\varsigma \) is a relation w over R such that

  • \(w=\bigcup _{i=1}^n\{t^1_i,\ldots ,t^{b_i}_i\}\),

  • (preservation of domain values) for all \(i=1,\ldots ,n\), for all \(k=1,\ldots ,b_i\), for all \(A\in R\), if \(\tau _i(A)\not =*\), then \(t^k_i(A)=\tau _i(A)\),

  • (uniqueness of values substituted for \(*\)) for all \(i=1,\ldots ,n\), for all \(A\in R\), if \(\tau _i(A)=*\), then for all \(k=1,\ldots ,b_i\), for all \(j=1,\ldots ,n\), and for all \(l=1,\ldots ,b_j\) (where \(l\not =k\), if \(j=i\)), \(t^k_i(A)\not =t^l_j(A)\).

We call \(\varsigma \) an Armstrong sketch for \(\varSigma \), if every expansion of \(\varsigma \) is an Armstrong relation for \(\varSigma \). The following simple algorithm can be used to construct an Armstrong sketch \(\varsigma =(\textit{card},\omega )\) for \(\varSigma \): for each duplicate set \(X\in \textit{dup}_{\varSigma }(R)\) we introduce a tuple \(\tau _X\) into \(\omega \) such that, for all \(A\in X\), \(\tau _X(A)\) has some unique domain value from \(\textit{dom}(A)-\{*\}\), and for all \(A\in R-X\), \(\tau _X(A)=*\), and \(\textit{card}(\tau _X)=b_X\). The main advantage of Armstrong sketches over Armstrong relations is their smaller number of tuples. In fact, this number coincides with the number of duplicate sets which is guaranteed to be finite. In contrast, if some \(b_X=\infty \), then every Armstrong relation must be infinite.

Example 7

Continuing Example 6 the following tables show Armstrong sketches (A-sketches) for the sets \(\varSigma _{0.75}\), \(\varSigma _{0.9}\), and \(\varSigma _{1}\), which have expansions \(w_1\), \(w_2\), and \(w_3\) as shown in Table 1, respectively.

figure d

Armstrong p-sketches. An Armstrong p-relation for a set \(\varSigma \) of pCCs over R is a p-relation r over R such that for all pCCs \(\varphi \) over R the following holds: \(\varSigma \models \varphi \) if and only if r satisfies \(\varphi \). As relations are the idealized special case of p-relations in which the relation forms the only possible world of the p-relation, there are sets of pCCs for which no finite Armstrong p-relation exists, i.e., the Armstrong p-relation contains some possible world that is infinite. For this reason we introduce probabilistic sketches and their expansions, as well as Armstrong p-sketches which are guaranteed to be finite p-relations.

A probabilistic sketch (p-sketch) over R is a probabilistic relation \(s=(\mathcal {W},\mathcal {P})\) over \(R_*\) where the possible worlds in \(\mathcal {W}\) are sketches over R. A probabilistic expansion (p-expansion) of s is a p-relation \(r=(W,P)\) where W contains for every sketch \(\varsigma \in \mathcal {W}\) a single expansion w over R of \(\varsigma \), and \(P(w)=\mathcal {P}(\varsigma )\).

An Armstrong p-sketch for a set \(\varSigma \) of pCCs over R is a p-sketch over R such that each of its p-expansions is an Armstrong p-relation for \(\varSigma \).

Example 8

Continuing Example 1 the following table shows an Armstrong p-sketch s for the given set \(\varSigma \) of pCCs.

figure e

A p-expansion of s is the finite Armstrong p-relation of Table 1.

Naturally the question arises whether Armstrong p-sketches exist for any given set of pCCs over any given relation schema. The next theorem shows that every distribution of probabilities to a finite set of cardinality constraints, that follows the inference rules from Table 3, can be represented by a single p-relation which exhibits this distribution in the form of marginal probabilities [16].

Theorem 4

Let \(l:2^R\times \mathbb {N}_1^\infty \rightarrow [0,1]\) be a function such that the image of l is a finite subset of [0,1], \(l(R,1)=1\) and for all \(X\subseteq R\), \(l(X,\infty )=1\), and for all \(X,Y\subseteq R\) and \(b,b'\in \mathbb {N}_1\), \(l(X,b)\le l(XY,b+b')\) holds. Then there is some p-sketch s over R such that every p-expansion r of s satisfies \((\textit{card}(X)\le b,\ge l(X,b))\), and for all \(X\subseteq R\), \(b\in \mathbb {N}_1^\infty \) and \(p\in [0,1]\) such that \(p>l(X,b)\), r violates \((\textit{card}(X)\le b,\ge p)\).

We say that pCCs enjoy Armstrong p-sketches, if for every relation schema R and for every finite set \(\varSigma \) of pCCs over R there is some p-sketch over R that is Armstrong for \(\varSigma \) [16].

Theorem 5

Prob. cardinality constraints enjoy Armstrong p-sketches.

Armstrong PC-sketches. Probabilistic databases can have huge numbers of possible worlds. It is therefore important to represent and process probabilistic data concisely. Probabilistic conditional databases, or short PC-tables [17] are a popular system that can represent any given probabilistic database. Considering our aim of finding concise data samples of pCCs, we would like to compute Armstrong p-sketches in the form of Armstrong PC-sketches.

For this purpose, we first adapt the standard definition of PC-tables [17] to that of PC-sketches. A conditional sketch or c-sketch, is a tuple \(\varGamma = \langle \varsigma ,\iota \rangle \), where \(\varsigma =(\textit{card},\omega )\) is a sketch (where \(\omega \) may contain duplicate tuples), and \(\iota \) assigns to each tuple \(\tau \) in \(\omega \) a finite set \(\iota _\tau \) of positive integers. The set of world identifiers of \(\varGamma \) is the union of the sets \(\iota _\tau \) for all tuples \(\tau \) of \(\omega \). Given a world identifier i of \(\varGamma \), the possible world sketch \(\varsigma _i=(\textit{card}_i,\omega _i)\) associated with i is \(\omega _i = \{ \tau | \tau \in \omega \text{ and } i\in \iota _\tau \}\) and \(\textit{card}_i\) is the restriction of \(\textit{card}\) to \(\omega _i\). The representation of a c-sketch \(\varGamma = \langle \varsigma ,\iota \rangle \) is the set \(\mathcal {W}\) of possible world sketches \(\varsigma _i\) where i denotes some world identifier of \(\varGamma \). A probabilistic conditional sketch or PC-sketch, is a pair \(\langle \varGamma ,\varPi \rangle \) where \(\varGamma \) is a c-sketch, and \(\varPi \) is a probability distribution over the set of world identifiers of \(\varGamma \). The representation of a PC-sketch \(\langle \varGamma ,\varPi \rangle \) is the p-sketch \(s=(\mathcal {W},\mathcal {P})\) where \(\mathcal {W}\) is the set of possible world sketches associated with \(\varGamma \) and the probability \(\mathcal {P}\) of each possible world sketch \(\varsigma _i\in \mathcal {W}\) is defined as the probability \(\varPi (i)\) of its world identifier i.

It is simple to see that every p-sketch can be represented as a PC-sketch [16].

Theorem 6

Every p-sketch can be represented as a PC-sketch.

A PC-sketch is called an Armstrong PC-sketch for \(\varSigma \) if and only if its representation is an Armstrong p-sketch for \(\varSigma \).

Example 9

Table 2 shows a PC-sketch \(\langle \varGamma ,\varPi \rangle \) that is Armstrong for the set \(\varSigma \) of pCCs from Example 1.

figure f

Algorithm 1 contains the pseudo-code and comments how to compute an Armstrong PC-sketch for any given set \(\varSigma \) of pCCs over any given relation schema R. In particular, line (5) uses the definition of the cardinality \(b^i_X:=\inf \{b\mid \textit{card}(Y)\le b\in \varSigma _{p_i}\wedge Y\subseteq X\}\) to compute them.

Theorem 7

For every set \(\varSigma \) of pCCs over relation schema R, Algorithm 1 computes an Armstrong PC-sketch for \(\varSigma \).

Finally, we derive some bounds on the time complexity of finding Armstrong PC-sketches. Since the relational model is subsumed there are cases, where the number of tuples in every Armstrong PC-sketch for \(\varSigma \) over R is exponential in \(||\varSigma ||\). Such a case is given by \(R_n=\{A_1,\ldots ,A_{2n}\}\) and \(\varSigma _n=\{ (\textit{card}(A_{2i-1},A_{2i})\le 1,\ge 1)\mid i=1,\ldots ,n\}\) with \(||\varSigma _n||=2\cdot n\). Indeed, every Armstrong PC-sketch for \(\varSigma _n\) must feature \(2^n\) different tuples to accommodate the \(2^n\) different duplicate sets X with associated cardinality \(b^1_X=\infty \), and there is only one possible world. Algorithm 1 was designed with the goal that the worst-case time bound from the traditional relational case does not deteriorate in our more general setting. This is indeed achieved, as the computationally most demanding part of Algorithm 1 is the computation of the cardinalities in line (5) which is achieved in time exponential in \(\textit{max}(||\varSigma ||,|R|)\), where |R| denotes the number of attributes in R.

Theorem 8

The time complexity to find an Armstrong PC-sketch for a given set \(\varSigma \) of pCCs over schema R is precisely exponential in \(\textit{max}(||\varSigma ||,|R|)\).

There are also cases where the number of tuples in some Armstrong PC-sketch for \(\varSigma \) over R is logarithmic in \(||\varSigma ||\). Such a case is given by \(R_n=\{A_1,\ldots ,A_{2n}\}\) and \(\varSigma _n=\{ (\textit{card}(X_1\cdots X_n)\le 1,\ge 1)\mid X_i\in \{A_{2i-1},A_{2i}\}\text { for }i=1,\ldots ,n\}\) with \(||\varSigma _n||=n\cdot 2^n\). There is an Armstrong PC-sketch for \(\varSigma \) that contains only one tuple for each of the n duplicate sets \(X=R-\{A_{2i-1},A_{2i}\}\) with associated cardinality \(b^1_X=\infty \).

In practice we recommend to use both representations of business rules: one in the form of the set \(\varSigma \) of pCCs itself and one in the form of an Armstrong PC-sketch for \(\varSigma \). This is always possible by our results. We think Armstrong PC-sketches help identify bounds b that are too low or probabilities p that are too high, while the set \(\varSigma \) helps identify bounds b that are too high or probabilities p that are too low.

Fig. 2.
figure 2

Screenshot of the GUI Fortuna

Graphical User Interface. We have implemented Algorithm 1 in the form of a graphical user interface (GUI) called Fortuna Footnote 1. A user can enter some attributes and specify probabilistic cardinality constraints using any combination of these. The GUI shows an Armstrong PC-sketch for the specified input, sketches of the possible worlds can be brought up, and their individual tuples can be expanded at will. Figure 2 shows a partial screenshot of our GUI Fortuna with some outputs for our running example.

6 Conclusion and Future Work

Probabilistic cardinality constraints were introduced to stipulate lower bounds on the marginal probability by which a maximum number of the same data pattern can occur in sets of uncertain data. As shown in Fig. 3 the marginal probability can be used to balance the consistency and completeness targets for the quality of data, enabling organizations to derive more value from it.

Fig. 3.
figure 3

Control mechanism p

Axiomatic and algorithmic tools were developed to reason efficiently about probabilistic cardinality constraints. This can help minimize the overhead in using them for data quality purposes or deriving probabilities on the maximum number of query answers without querying any data. These applications are effectively unlocked by developing computational support in the form of probabilistic Armstrong samples for identifying the right marginal probabilities by which cardinality constraints should hold in a given application domain. Analysts and domain experts can jointly inspect Armstrong samples which point out any flaws in the current perception of the marginal probabilities. Our tool Fortuna can be used to generate Armstrong samples for any input, and to explore the possible worlds it represents.

Our results constitute the core foundation for probabilistic cardinality constraints, which can be extended into various directions in future work. It will be interesting to raise the expressivity of probabilistic cardinality constraints by allowing the stipulation of lower bounds on the number of the same data patterns, and/or upper bounds on the marginal probabilities, for examples. For a given PC-table it would be interesting to develop efficient algorithms that compute the marginal probability by which cardinality constraints hold on the data the table represents. Experiments with our implementation are expected to provide further insight into the average case performance of Algorithm 1 in relationship to the worst- and best-cases discussed. Finally, it would be interesting to conduct an empirical investigation into the usefulness of our framework for acquiring the right marginal probabilities of cardinality constraints in a given application domain. This will also require us to extend empirical measures from certain [11] to probabilistic data sets. Particularly intriguing will be the question which of Armstrong PC-sketches and Armstrong p-sketches are actually more useful. While Armstrong PC-sketches are more concise, they may prove to be too concise to draw the attention of analysts and domain experts to critical constraint violations.