First-Order Under-Approximations of Consistent Query Answers

Geerts, Floris; Pijcke, Fabian; Wijsen, Jef

doi:10.1007/978-3-319-23540-0_24

Floris Geerts⁶,
Fabian Pijcke⁷ &
Jef Wijsen⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9310))

Included in the following conference series:

International Conference on Scalable Uncertainty Management

538 Accesses
2 Citations

Abstract

Consistent Query Answering (CQA) has by now been widely adopted as a principled approach for answering queries on inconsistent databases. The consistent answer to a query q on an inconsistent database ${\mathbf {db}}$ is the intersection of the answers to q on all repairs, where a repair is any consistent database that is maximally close to ${\mathbf {db}}$. Unfortunately, computing consistent answers under primary key constraints has already exponential data complexity for very simple conjunctive queries, which is completely impracticable.

In this paper, we propose a new framework for divulging an inconsistent database to end users, which adopts two postulates. The first postulate complies with CQA and states that inconsistencies should never be divulged to end users. Therefore, end users should only get consistent query answers. The second postulate states that the data complexity of user queries must remain tractable (i.e., in ${\mathbf {P}}$ or even in ${\mathbf {FO}}$). User queries with exponential data complexity will be rejected. We investigate which consistent query answers can still be obtained under such access postulates.

Access provided by Autonomous University of Puebla. Download conference paper PDF

On the Data Complexity of Consistent Query Answering

Article 04 November 2014

Dealing with Inconsistency in Databases: An Overview

A Survey of the Data Complexity of Consistent Query Answering under Key Constraints

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Inconsistent, incomplete and uncertain data is widespread in the internet and social media era. This has given rise to a new paradigm for query answering, called Consistent Query Answering (CQA). This paradigm starts with the notion of repair, which is a new consistent database that minimally differs from the original inconsistent database. In general, an inconsistent database can have many repairs. In this respect, database repairing is different from data cleaning which aims at a unique cleaned database.

In this paper, we assume that the only constraints are primary keys, one per relation. A repair of an inconsistent database ${\mathbf {db}}$ is a maximal subset of ${\mathbf {db}}$ that satisfies all primary key constraints. Primary keys will be underlined. For example, the database of Fig. 1 stores ages and cities of residence of male and female persons. For simplicity, assume that persons have unique names (attribute N). Every person has exactly one age (attribute A) and city (attribute C). However, distinct tuples may agree on the primary key N, because there can be uncertainty about ages and cities. In the database of Fig. 1, there is uncertainty about the city of Ed (it can be Mons or Paris). The database can be repaired in two ways: delete either $M({\underline{\text {Ed}}},{48},\mathrm{{Mons}})$ or $M({\underline{\text {Ed}}},{48},\mathrm{{Paris}})$.

When database repairing results in multiple repairs, CQA shifts from standard semantics to certainty semantics. Given a query, the certain answer (also called consistent answer) is defined as the intersection of the answers on all repairs. That is, for a query q on an inconsistent database ${\mathbf {db}}$, CQA replaces the standard query answer $q({\mathbf {db}})$ with the certain answer, defined by the following intersection:

$$\begin{aligned} \bigcap \{q({\mathbf {r}})\mid {\mathbf {r}} \text{ is } \text{ a } \text{ repair } \text{ of } {\mathbf {db}}\}. \end{aligned}$$

(1)

Thus, the certainty semantics exclusively returns answers that hold true in every repair. Given a query q, we will denote by $\lfloor {q}\rfloor $ the query that maps a database to the answer defined by (1).

A practical obstacle to CQA is that the shift to certainty semantics involves a significant increase of complexity. When we refer to complexity in this paper, we mean data complexity, i.e., the complexity in terms of the size of the database (for a fixed query) [1, p. 422]. It is known for long [7] that there exist conjunctive queries q that join two relations such that the data complexity of $\lfloor {q}\rfloor $ is already ${\mathbf {coNP}}$-hard. If this happens, CQA is completely impracticable.

This paper investigates ways to circumvent the high data complexity of CQA in a realistic setting, which is based on the following assumptions:

If a query returns an answer to a user, then every tuple in that answer should belong to the certain answer. In Libkin’s terminology [16], query answers must not contain false positives, i.e., tuples that are not certain.
The only queries that can be executed in practice are those with data complexity in ${\mathbf {P}}$ or, even better, in ${\mathbf {FO}}$. ${\mathbf {FO}}$ is the descriptive complexity class that captures all queries expressible in relational calculus.

Therefore, if the data complexity of a query $\lfloor {q}\rfloor $ is not in ${\mathbf {P}}$, then the best we can go for is an approximation without false positives (also called under-approximation), computable in polynomial time. The term strategy will be used for queries that compute such approximations. Intuitively, a strategy can be regarded as a two-step process in which one starts by issuing a number of well-behaved queries $\lfloor {q_i}\rfloor $, for $i\in \{1,\ldots ,\ell \}$, which can then be subject to a post-processing step. In this paper, well-behaved queries are those that are accepted by a query interface, e.g., self-join-free conjunctive queries $q_i$ such that $\lfloor {q_i}\rfloor $ is in ${\mathbf {FO}}$, and post-processing is formalized as queries built-up from the $\lfloor {q_i}\rfloor $’s.

We next illustrate our setting by an example. Consider the following scenario with two persons, called Bob and Alice. The person called Bob owns a database that is publicly accessible only via a query interface which restricts the syntax of the queries that can be asked. Our main results concern the case where the interface is restricted to self-join-free conjunctive queries. The database schema including all primary key constraints is publicly available. However, Bob is aware that his database contains many mistakes which should not be divulged. Therefore, whenever some end user asks a query q, Bob will actually execute the query $\lfloor {q}\rfloor $. That is, end users will get exclusively consistent answers. But, for feasibility reasons, Bob will reject any query q for which the data complexity of $\lfloor {q}\rfloor $ is too high. In this paper, we assume that Bob considers that data complexity is too high when it is beyond ${\mathbf {FO}}$. The person called Alice interrogates Bob’s database, and she will be happy to get exclusively consistent answers. Unfortunately, her query q will be rejected by Bob if the data complexity of $\lfloor {q}\rfloor $ is too high (i.e., not in ${\mathbf {FO}}$). If this happens, Alice has to change strategy. Instead of asking q, she can ask a finite number of queries $q_{1}$, $q_{2}$, ..., $q_{\ell }$ such that for every $i\in \{1,\dots ,\ell \}$, the data complexity of $\lfloor {q_{i}}\rfloor $ is in ${\mathbf {FO}}$, and hence the query $q_{i}$ will be accepted by Bob. No restriction is imposed on the number $\ell $ of queries that can be asked. The best Alice can hope for is that she can compute herself the answer to $\lfloor {q}\rfloor $ (or even to q) from Bob’s answers to $\lfloor {q_{1}}\rfloor $, ..., $\lfloor {q_{\ell }}\rfloor $ by means of some post-processing. The question addressed in this paper is: Given that Alice wants to answer q, what queries should she ask to Bob?

Here is a concrete example. Assume Bob owns the database of Fig. 1. Interested in stable couples^{Footnote 1}, Alice submits the query $q_{1}$ which asks “Get pairs of ages of men and women living in the same city”:

$$q_{1}=\{y,w\mid \exists x\exists u\exists z\left( {M(\underline{x},y,z)\wedge F(\underline{u},w,z)}\right) \}.$$

The consistent answer is $\{(48,37)$, $(29,37)\}$. However, the query $\lfloor {q_{1}}\rfloor $ that returns the certain answer is known to have ${\mathbf {coNP}}$-hard data complexity [13, 14]. Therefore, Bob will reject $q_{1}$. Alice changes strategy and asks the query $q_{2}$ which asks “Get pairs of ages and city of men and women living in the same city”:

$$\begin{aligned} q_{2}=\{y,w,z\mid \exists x\exists u\left( {M(\underline{x},y,z)\wedge F(\underline{u},w,z)}\right) \}. \end{aligned}$$

(2)

Since the data complexity of $\lfloor {q_{2}}\rfloor $ is known to be in ${\mathbf {FO}}$ [13, 14], Bob will execute $\lfloor {q_{2}}\rfloor $. The query $q_{2}$ returns {(29, 37, Mons), (48, 37, Mons)} on one repair, and {(29, 37, Mons), (48, 37, Paris)} on the other repair, so the certain answer is $\{(29,37,\text {Mons})\}$. This in turn allows Alice to derive a certain answer to the original query: since $(29,37,\text {Mons})$ belongs to the answer to $\lfloor {q_{2}}\rfloor $, it is correct to conclude that (29, 37) belongs to the answer to $\lfloor {q_{1}}\rfloor $. An interesting question is whether Alice has a better strategy that divulges even more answers to $\lfloor {q_{1}}\rfloor $.

The technical contributions of this paper are as follows. We first show that the following problem is undecidable: Given a relational calculus query q, is $\lfloor {q}\rfloor $ in ${\mathbf {FO}}$? In view of this undecidability result, we then limit our attention to strategies that are first-order combinations (using disjunction and existential quantification) of queries $\lfloor {q}\rfloor $ that are known to be in ${\mathbf {FO}}$. We show how to build optimal strategies under such syntax restrictions.

This paper is organized as follows. Section 2 discusses related work. Section 3 provides some mathematical definitions. Section 4 introduces our new framework for studying consistent query answering under primary key constraints, and introduces the problem ${\mathsf {OPTSTRATEGY}}$. Intuitively, ${\mathsf {OPTSTRATEGY}}$ asks, given a query q, to find a new query $q'$ that gets the largest subset of consistent answers while still obeying the restrictions imposed by our framework. Section 5 provides ways to solve ${\mathsf {OPTSTRATEGY}}$ in restricted settings. Finally, Sect. 6 concludes the paper.

2 Related Work

Consistent query answering (CQA) was proposed in [2] as a principled approach to handle data quality problems that arise from violations of integrity constraints. See the textbooks [3, 10] for comprehensive overviews of these domains.

Fuxman and Miller [11] were the first ones to focus on CQA under the restrictions that consistency is only with respect to primary keys and that queries are self-join-free conjunctive. See [21] for a survey on consistent query answering to conjunctive queries under primary key constraints. Some recent results not covered by this survey can be found in [13, 14].

Instead of returning the query answers true in every repair, one could return the query answers true in, e.g., a majority of repairs. This leads to the counting variant of CQA, which has been studied in [17, 18]. As observed in [20], the counting variant of CQA under primary key constraints is closely related to query answering in block-independent-disjoint (BID) probabilistic databases [8, 9]. Alternatively, one can obtain approximations by restricting the set of repairs. This approach has been considered in [5] in the setting of ontology-based data access.

Our work can also be regarded as querying “consistent views,” in the sense that Bob returns exclusively consistent answers. It has been observed long ago [19] that consistent views are not closed under relational calculus. In other words, the position of the $\lfloor {\cdot }\rfloor $ construct in a query does matter. For example, for the database of Fig. 1, the query $\{x\mid \exists y\exists z\lfloor {M(\underline{x},y,z)}\rfloor \}$ returns only Dirk, while $\lfloor {\{x\mid \exists y\exists z M(\underline{x},y,z)\}}\rfloor $ returns both Ed and Dirk. Bertossi and Li [4] have used views to protect the secrecy of data in a database. In our setting, the query answers that are to be hidden from end users are those that are not true in every repair.

3 Preliminaries

We assume disjoint sets of variables and constants. If $\mathbf {x}$ is a sequence containing variables and constants, then $\mathsf {vars}({\mathbf {x}})$ denotes the set of variables that occur in $\mathbf {x}$. A valuation over a set U of variables is a total mapping $\theta $ from U to the set of constants.

Atoms and Key-equal Facts. Each relation name R of arity n, $n\ge 1$, has a unique primary key which is a set $\{1,2,\dots ,k\}$ where $1\le k\le n$. We say that R has signature $[{n},{k}]$ if R has arity n and primary key $\{1,2,\dots ,k\}$. We say that R is all-key if $n=k$. For all positive integers n, k such that $1\le k\le n$, we assume denumerably many relation names with signature $[{n},{k}]$.

If R is a relation name with signature $[{n},{k}]$, then $R(s_{1},\dots ,s_{n})$ is called an R-atom (or simply atom), where each $s_{i}$ is either a constant or a variable ($1\le i\le n$). Such an atom is commonly written as $R(\underline{\mathbf {x}},\mathbf {y})$ where the primary-key value $\mathbf {x}=s_{1},\dots ,s_{k}$ is underlined and $\mathbf {y}=s_{k+1},\dots ,s_{n}$. An R-fact (or simply fact) is an R-atom in which no variable occurs. Two facts $R_{1}(\underline{\mathbf {a}_{1}},\mathbf {b}_{1}),R_{2}(\underline{\mathbf {a}_{2}},\mathbf {b}_{2})$ are key-equal if $R_{1}=R_{2}$ and $\mathbf {a}_{1}=\mathbf {a}_{2}$.

We will use letters F, G, H for atoms. For an atom $F=R(\underline{\mathbf {x}},\mathbf {y})$, we denote by ${\mathsf {key}}({F})$ the set of variables that occur in $\mathbf {x}$, and by ${\mathsf {vars}}({F})$ the set of variables that occur in F, that is, ${\mathsf {key}}({F})=\mathsf {vars}({\mathbf {x}})$ and ${\mathsf {vars}}({F})=\mathsf {vars}({\mathbf {x}})\cup \mathsf {vars}({\mathbf {y}})$.

Uncertain Databases, Blocks, and Repairs. A database schema is a finite set of relation names. All constructs that follow are defined relative to a fixed database schema.

A database is a finite set ${\mathbf {db}}$ of facts using only the relation names of the schema. We often refer to databases as “uncertain databases” to stress that such databases can violate primary key constraints.

A block of ${\mathbf {db}}$ is a maximal set of key-equal facts of ${\mathbf {db}}$. The term R-block refers to a block of R-facts, i.e., facts with relation name R. An uncertain database ${\mathbf {db}}$ is consistent if no two distinct facts are key-equal (i.e., if every block of ${\mathbf {db}}$ is a singleton). A repair of ${\mathbf {db}}$ is a maximal (with respect to set containment) consistent subset of ${\mathbf {db}}$. We write $\mathsf {rset}({{\mathbf {db}}})$ for the set of repairs of ${\mathbf {db}}$.

Queries and Consistent Query Answering. We assume that the reader is familiar with relational calculus [1, Chapter 5] and with the notion of queries [15, Definition 2.7]. By ${\mathbf {FO}}$, we denote the descriptive complexity class that contains the queries expressible in relational calculus.

For every m-ary ($m\ge 0$) relational calculus query q, we define $\lfloor {q}\rfloor $ as the m-ary query that maps every database ${\mathbf {db}}$ to $\bigcap \{q({\mathbf {r}})\mid {\mathbf {r}}\in \mathsf {rset}({{\mathbf {db}}})\}$. Clearly, if ${\mathbf {db}}$ is a consistent database, then $\lfloor {q}\rfloor ({\mathbf {db}})=q({\mathbf {db}})$.

Given two m-ary queries $q_{1}$ and $q_{2}$, we say that $q_{1}$ is contained in $q_{2}$, denoted by $q_{1}\sqsubseteq q_{2}$, if for every database ${\mathbf {db}}$, $q_{1}({\mathbf {db}})\subseteq q_{2}({\mathbf {db}})$. We write $q_{1}\sqsubset q_{2}$ if $q_{1}\sqsubseteq q_{2}$ and $q_{2}\not \sqsubseteq q_{1}$. We say that $q_{1}$ and $q_{2}$ are equivalent, denoted by $q_{1}\equiv q_{2}$, if $q_{1}\sqsubseteq q_{2}$ and $q_{2}\sqsubseteq q_{1}$.

A 0-ary query is called Boolean. If q is a Boolean query, then q maps any database to either $\{\langle {}\rangle \}$ or $\{\}$, corresponding to $\mathbf{true }$ and $\mathbf{false }$ respectively.

A conjunctive query is a relational calculus query of the form $\{\mathbf {z}\mid \exists \mathbf {y}B\}$ where B is a conjunction of atoms. The conjunction B and the query are said to be self-join-free if no relation name occurs more than once in B. We write $\mathsf {vars}({B})$ for the set of variables that occur in B. By a slight abuse of notation, we denote by B also the set of conjuncts that occur in B. For example, if $B_{1}=R(\mathbf {x})\wedge R(\mathbf {x})\wedge R(\mathbf {y})$ and $B_{2}=R(\mathbf {x})\wedge R(\mathbf {y})\wedge R(\mathbf {z})$, then we may write $B_{1}\subseteq B_{2}$.

Significantly, the following example shows that $\lfloor {q}\rfloor $ may not be expressible in relational calculus, even if q is self-join-free conjunctive.

Example 1

Let $q_{1}=\{\langle {}\rangle \mid \exists x\exists y \exists z\left( {R(\underline{x},z)\wedge S(\underline{y},z)}\right) \}$. The query $q_{1}$ is self-join-free conjunctive. It follows from [13] that $\lfloor {q_{1}}\rfloor $ is not in ${\mathbf {FO}}$ (i.e., not expressible in relational calculus).

Let $q_{2}=\{\langle {}\rangle \mid \exists x\exists y\left( {R(\underline{x},y)\wedge S(\underline{y},b)}\right) \}$, where b is a constant. Then, $\lfloor {q_{2}}\rfloor $ is equivalent to the following relational calculus query:

$$ \begin{array}{rl} \exists x\exists y &{} \left( R(\underline{x},y)\wedge \right. \\ \forall y &{} \left. \left( {R(\underline{x},y)\rightarrow \left( {S(\underline{y},b)\wedge \forall z\left( {S(\underline{y},z)\rightarrow z=b}\right) }\right) }\right) \right) . \quad \square \end{array}$$

4 A Framework for Divulging Inconsistent Databases

In this section, we formalize the setting that was described and illustrated in Sect. 1. The setting is captured by the language called ${\mathsf {CQAFO}}$, which consists of first-order quantification and Boolean combinations of atomic formulas of the form $\lfloor {q}\rfloor $, where q is any relational calculus query. The atomic formulas $\lfloor {q}\rfloor $ capture that the database owner Bob only returns certain answers. Subsequently, the end user Alice, who interrogates Bob’s database, can do some post-processing on Bob’s outputs. In our setting, we assume that Alice uses first-order quantification and Boolean combinations of Bob’s answers.

Example 2

The scenario in Sect. 1 is captured by the ${\mathsf {CQAFO}}$ query

$$ \{y,w\mid \exists Z\lfloor {\exists x\exists u\left( {M(\underline{x},y,Z)\wedge F(\underline{u},w,Z)}\right) }\rfloor \}. $$

The formula within $\lfloor {\cdot }\rfloor $ is the query (2). The quantification $\exists Z$ corresponds to Alice projecting away the cities column returned by Bob. For readability, we will often use upper case letters for variables that are quantified outside the range of $\lfloor {\cdot }\rfloor $. $\square $

Example 3

The following query allows Alice to find the names of men with more than two cities in the database:

$$ \{x\mid \lfloor {\exists y\exists z M(\underline{x},y,z)}\rfloor \wedge \lnot \exists Z\lfloor {\exists y M(\underline{x},y,Z)}\rfloor \}. $$

To understand this query, it may be helpful to notice that $\{x,Z\mid \lfloor {\exists y M (\underline{x},y,Z)}\rfloor \}$ returns tuple (n, c) whenever c is the only city of residence encoded for the person named n. $\square $

4.1 The Language ${\mathsf {CQAFO}}$

Syntax of ${\mathsf {CQAFO}}$

If q is a relational calculus query, then $\lfloor {q}\rfloor $ is a ${\mathsf {CQAFO}}$ formula.
If $\varphi _{1}$ and $\varphi _{2}$ are ${\mathsf {CQAFO}}$ formulas, then $\varphi _{1}\wedge \varphi _{2}$, $\varphi _{1}\vee \varphi _{2}$, and $\lnot \varphi _{1}$ are ${\mathsf {CQAFO}}$ formulas.
If $\varphi $ is a ${\mathsf {CQAFO}}$ formula, then $\exists x\varphi $ and $\forall x\varphi $ are ${\mathsf {CQAFO}}$ formulas.

If $\varphi $ is a ${\mathsf {CQAFO}}$ formula, then ${\mathsf {free}}({\varphi })$ denotes the set of free variables of $\varphi $ (i.e., the variables not bound by a quantifier). If $\mathbf {x}$ is a tuple containing the free variables of $\varphi $, we write $\varphi (\mathbf {x})$.

A ${\mathsf {CQAFO}}$ query is an expression of the form $\{\mathbf {x}\mid \varphi \}$, where $\mathbf {x}$ is a sequence of variables and constants containing each variable of ${\mathsf {free}}({\varphi })$. If $\mathbf {x}$ contains no constants and no double occurrences of the same variable, then such query is also denoted $\varphi (\mathbf {x})$.

Semantics. Let ${\mathbf {db}}$ be an uncertain database. Let $\varphi (\mathbf {x})$ be a ${\mathsf {CQAFO}}$ formula, and $\mathbf {a}$ be a sequence of constants (of same length as $\mathbf {x}$). We inductively define ${\mathbf {db}}\models \varphi (\mathbf {a})$.

If $\varphi (\mathbf {x})=\lfloor {q(\mathbf {x})}\rfloor $ for some relational calculus query $q(\mathbf {x})$, then ${\mathbf {db}}\models \varphi (\mathbf {a})$ if for every repair ${\mathbf {r}}$ of ${\mathbf {db}}$, ${\mathbf {r}}\models q(\mathbf {a})$;^{Footnote 2}
${\mathbf {db}}\models \lnot \varphi (\mathbf {a})$ if ${\mathbf {db}}\not \models \varphi (\mathbf {a})$;
${\mathbf {db}}\models \varphi _{1}\wedge \varphi _{2}$ if ${\mathbf {db}}\models \varphi _{1}$ and ${\mathbf {db}}\models \varphi _{2}$;
${\mathbf {db}}\models \varphi _{1}\vee \varphi _{2}$ if ${\mathbf {db}}\models \varphi _{1}$ or ${\mathbf {db}}\models \varphi _{2}$;
if $\psi (\mathbf {x})=\exists y\varphi (y,\mathbf {x})$, then ${\mathbf {db}}\models \psi (\mathbf {a})$ if ${\mathbf {db}}\models \varphi (a',\mathbf {a})$ for some $a'$;
if $\psi (\mathbf {x})=\forall y\varphi (y,\mathbf {x})$, then ${\mathbf {db}}\models \psi (\mathbf {a})$ if ${\mathbf {db}}\models \varphi (a',\mathbf {a})$ for all $a'$.

Let $Q=\{\mathbf {x}'\mid \varphi (\mathbf {x})\}$ be a ${\mathsf {CQAFO}}$ query. The answer $Q({\mathbf {db}})$ is the smallest set containing $\theta (\mathbf {x}')$ for every valuation $\theta $ over $\mathsf {vars}({\mathbf {x}})$ such that for some $\mathbf {a}$, $\theta (\mathbf {x})=\mathbf {a}$ and ${\mathbf {db}}\models \varphi (\mathbf {a})$. Notice that $\mathsf {vars}({\mathbf {x}'})=\mathsf {vars}({\mathbf {x}})$, but $\mathbf {x}'$, unlike $\mathbf {x}$, can contain constants and multiple occurrences of the same variable. If $\mathbf {x}'$ contains no variables, then Q is Boolean.

4.2 Restrictions on Data Complexity

The language ${\mathsf {CQAFO}}$ of Sect. 4.1 captures our first postulate which states that the database owner Bob returns exclusively certain answers. But we do not prohibit that end user Alice does some post-processing on Bob’s answers. In this section, we will add our second postulate which states that Bob rejects queries q if the data complexity of $\lfloor {q}\rfloor $ is not in ${\mathbf {FO}}$. Unfortunately, Bob has to face the following undecidability result.

Theorem 1

The following problem is undecidable. Given a relational calculus query q, is $\lfloor {q}\rfloor $ in ${\mathbf {FO}}$?

Proof

Let $q_{1}=\{\langle {}\rangle \mid \exists x\exists y\exists z\left( {R(\underline{x},z)\wedge S(\underline{y},z)\wedge \varphi }\right) \}$ where $\varphi $ is a closed relational calculus formula such that all relation names in $\varphi $ are all-key. We show hereinafter that $\lfloor {q}\rfloor $ is in ${\mathbf {FO}}$ if and only if $\varphi $ is unsatisfiable. The desired result then follows by [1, Theorem 6.3.1], which states that (finite) satisfiability of relational calculus queries is undecidable.

Obviously, if $\varphi $ is unsatisfiable, then $\lfloor {q_{1}}\rfloor \equiv \mathbf{false }$, and hence $\lfloor {q_{1}}\rfloor $ is in ${\mathbf {FO}}$.

We show next that if $\varphi $ is satisfiable, then $\lfloor {q_{1}}\rfloor $ is not in ${\mathbf {FO}}$. Assume that $\varphi $ is satisfiable. Let $q_{0}=\exists x\exists y\exists z\left( {R(\underline{x},z)\wedge S(\underline{y},z)}\right) $. Let ${\mathsf {CERTAIN}{\textsf {0}}}$ and ${\mathsf {CERTAIN}{\textsf {1}}}$ be the problems defined next.

${\mathsf {CERTAIN}{\textsf {0}}}\!\!:$ Given a database ${\mathbf {db}}$, determine whether every repair of ${\mathbf {db}}$ satisfies $q_{0}$.
${\mathsf {CERTAIN}{\textsf {1}}}\!\!:$ Given a database ${\mathbf {db}}$, determine whether every repair of ${\mathbf {db}}$ satisfies $q_{1}$.

Let ${\mathbf {db}}_{0}$ be a database that is input to ${\mathsf {CERTAIN}{\textsf {0}}}$. We show a polynomial-time many-one reduction from ${\mathsf {CERTAIN}{\textsf {0}}}$ to ${\mathsf {CERTAIN}{\textsf {1}}}$. Let ${\mathbf {S}}$ be the database schema that contains the relation names occurring in $\varphi $. An algorithm can consider systematically every finite database ${\mathbf {db}}'$ over ${\mathbf {S}}$ and test ${\mathbf {db}}'\models \varphi $, until a database ${\mathbf {db}}'$ is found such that ${\mathbf {db}}'\models \varphi $. The algorithm terminates because $\varphi $ is satisfiable. Since the computation of ${\mathbf {db}}'$ does not depend on ${\mathbf {db}}_{0}$, it takes ${\mathcal {O}}({1})$ time. Since all relation names in ${\mathbf {db}}'$ are all-key, we have that ${\mathbf {db}}'$ is consistent. Clearly, $q_{0}$ is true in every repair of ${\mathbf {db}}_{0}$ if and only if $q_{1}$ is true in every repair of ${\mathbf {db}}_{0}\cup {\mathbf {db}}'$. So we have established a polynomial-time many-one reduction from ${\mathsf {CERTAIN}{\textsf {0}}}$ to ${\mathsf {CERTAIN}{\textsf {1}}}$. Since ${\mathsf {CERTAIN}{\textsf {0}}}$ is ${\mathbf {coNP}}$-hard [13], it follows that ${\mathsf {CERTAIN}{\textsf {1}}}$ is ${\mathbf {coNP}}$-hard. Since ${\mathbf {FO}}\subsetneq {\mathbf {coNP}}$ [12], it follows that ${\mathsf {CERTAIN}{\textsf {1}}}$ is not in ${\mathbf {FO}}$. $\square $

By Theorem 1, there exists no algorithm that allows Bob to decide whether he has to accept or reject a relational calculus query. In general, little is known about the complexity of $\lfloor {q}\rfloor $ for relational calculus queries q. One of the stronger known results is the following.

Theorem 2

([13]). The following problem is decidable in polynomial time. Given a self-join-free conjunctive query q, is $\lfloor {q}\rfloor $ in ${\mathbf {FO}}$? Moreover, if $\lfloor {q}\rfloor $ is in ${\mathbf {FO}}$, then a relational calculus query equivalent to $\lfloor {q}\rfloor $ can be effectively constructed.

In view of Theorems 1 and 2, the following scenario is the best we can go for with the current state of art.

1.
The database owner Bob only accepts self-join-free conjunctive queries q such that $\lfloor {q}\rfloor $ is in ${\mathbf {FO}}$. Thus, Bob rejects every query that is not self-join-free conjunctive, and rejects a self-join-free conjunctive query q if $\lfloor {q}\rfloor $ is not in ${\mathbf {FO}}$.
2.
As before, Alice can do some first-order post-processing on the answers obtained from Bob.

Under these restrictions, we focus on the following research task: given that Alice wants to answer a self-join-free conjunctive query q on a database owned by Bob, develop a strategy for Alice to get a subset (the greater, the better) of certain answers. Our framework applies to Boolean queries by representing $\mathbf{true }$ and $\mathbf{false }$ by $\{\langle {}\rangle \}$ and $\{\}$ respectively. A formal definition follows.

4.3 Strategies

Strategies for a query q are defined next as relational calculus queries that can be expressed in ${\mathsf {CQAFO}}$ and that are contained in $\lfloor {q}\rfloor $.

Definition 1

Let q be a self-join-free conjunctive query. A strategy for q is a ${\mathsf {CQAFO}}$ query $\varphi $ such that $\varphi \sqsubseteq \lfloor {q}\rfloor $ and for every atomic formula $\lfloor {q'}\rfloor $ in $\varphi $, we have that $q'$ is a self-join-free conjunctive query such that $\lfloor {q'}\rfloor $ is in ${\mathbf {FO}}$.

A strategy $\varphi $ for q is optimal if for every strategy $\psi $ for q, we have $\psi \sqsubseteq \varphi $. The problem ${\mathsf {OPTSTRATEGY}}$ takes in a self-join-free conjunctive query q and asks to determine an optimal strategy for q.

Some observations are in place.

If the input to ${\mathsf {OPTSTRATEGY}}$ is a self-join-free conjunctive q such that $\lfloor {q}\rfloor $ is in ${\mathbf {FO}}$, then the ${\mathsf {CQAFO}}$ query $\lfloor {q}\rfloor $ is itself an optimal strategy.
Every strategy $\varphi $ is in ${\mathbf {FO}}$, because all atomic formulas $\lfloor {q'}\rfloor $ are required to be in ${\mathbf {FO}}$. Therefore, if Alice wants to answer a query q such that $\lfloor {q}\rfloor $ is not in ${\mathbf {FO}}$, then there is no strategy $\varphi $ such that $\varphi \equiv \lfloor {q}\rfloor $.
There is no fundamental reason why the input query to ${\mathsf {OPTSTRATEGY}}$ is required to be self-join-free conjunctive query. However, developing strategies for more expressive queries is left as an open question.

5 How to Construct Good Strategies?

Let q be a self-join-free conjunctive query. In this section, we investigate ways for constructing good (if not optimal) strategies for q of a particular syntax. In Sect. 5.1, we take the most simple approach: take the union of queries $\lfloor {q_{i}}\rfloor $ contained in $\lfloor {q}\rfloor $, where $q_{i}$ is self-join-free conjunctive and $\lfloor {q_{i}}\rfloor $ is in ${\mathbf {FO}}$. We then show that the strategies obtained in this way cannot be optimal. Therefore, an enhanced approach is developed in Sect. 5.2.

5.1 Post-processing by Unions only

Assume that the input to ${\mathsf {OPTSTRATEGY}}$ is a self-join-free conjunctive query $q(\mathbf {z})$. In this section, we look at strategies of the form

$$\begin{aligned} \bigcup _{i=1}^{\ell }\lfloor {q_{i}}\rfloor , \end{aligned}$$

(3)

where each $q_{i}$ is of the form $\{\mathbf {z_{i}}\mid \exists \mathbf {y}_{i}B_{i}\}$ in which $\mathbf {z}_{i}$ has same length as $\mathbf {z}$ and $B_{i}$ is a self-join-free conjunction of atoms.^{Footnote 3}

We use union (with its standard semantics) instead of disjunction to avoid notational difficulties. For example, the union

$$\{x,a\mid \lfloor {R(\underline{x},a)}\rfloor \}\cup \{x,y\mid \lfloor {S(\underline{x},y)}\rfloor \},$$

where a is a constant, is semantically clear, and is equivalent to

$$\{x,y\mid \lfloor {R(\underline{x},y)\wedge y=a}\rfloor \vee \lfloor {S(\underline{x},y)}\rfloor \},$$

in which equality is needed. It would be wrong to write $\{x,y\mid \lfloor {R(\underline{x},a)}\rfloor \vee \lfloor {S(\underline{x},y)}\rfloor \}$, an expression that is even not domain independent [1, p. 79].

Clearly, a formula of the form (3) is a strategy if for every $i\in \{1,\dots ,\ell \}$, $\lfloor {q_{i}}\rfloor $ is in ${\mathbf {FO}}$ and $\lfloor {q_{i}}\rfloor \sqsubseteq \lfloor {q}\rfloor $. The latter condition is equivalent to $q_{i}\sqsubseteq q$ as shown next.

Lemma 1

Let q and $q'$ be self-join-free m-ary conjunctive queries. Then, $q\sqsubseteq q'$ if and only if $\lfloor {q}\rfloor \sqsubseteq \lfloor {q'}\rfloor $.

Proof

Let $q=\{\mathbf {z}\mid \exists \mathbf {y}B\}$ and $q'=\{\mathbf {z}'\mid \exists \mathbf {y}'B'\}$, where $\mathbf {z}$ and $\mathbf {z}'$ both have the same length m.

Straightforward. Assume $\lfloor {q}\rfloor \sqsubseteq \lfloor {q'}\rfloor $. Let $\mu $ be an injective mapping with domain $\mathsf {vars}({B})$ that maps each variable to a fresh constant not occurring elsewhere. Since $\mu $ is injective, its inverse ${\mu }^{-1}$ is well defined. Let ${\mathbf {db}}=\mu (B)$. Clearly, ${\mathbf {db}}$ is consistent and $q({\mathbf {db}})=\{\mu (\mathbf {z})\}=\lfloor {q}\rfloor ({\mathbf {db}})$. From $\lfloor {q}\rfloor \sqsubseteq \lfloor {q'}\rfloor $, it follows $\mu (\mathbf {z})\in q'({\mathbf {db}})=\lfloor {q'}\rfloor ({\mathbf {db}})$. Then, there exists a valuation $\theta $ over $\mathsf {vars}({B'})$ such that $\theta (B')\subseteq {\mathbf {db}}$ and $\theta (\mathbf {z}')=\mu (\mathbf {z})$. Then ${\mu }^{-1}\circ \theta (B')\subseteq B$ and ${\mu }^{-1}\circ \theta (\mathbf {z}')=\mathbf {z}$. Since ${\mu }^{-1}\circ \theta $ is a homomorphism from $q'$ to q, it follows $q\sqsubseteq q'$ by the Homomorphism Theorem [1, Theorem 6.2.3]. $\square $

Lemma 1 does not hold for conjunctive queries with self-joins, as shown next.

Example 4

Let $q=\{\langle {}\rangle \mid R(\underline{a},b)\wedge R(\underline{a},c)\}$. For every uncertain database ${\mathbf {db}}$, $\lfloor {q}\rfloor ({\mathbf {db}})=\{\}$. Let $q'$ be a query such that $q\not \sqsubseteq q'$ (such query obviously exists). Then, $\lfloor {q}\rfloor \sqsubseteq \lfloor {q'}\rfloor $ and $q\not \sqsubseteq q'$. $\square $

Lemma 1 allows us to construct strategies of the form (3), as follows. Assume that the input to ${\mathsf {OPTSTRATEGY}}$ is a self-join-free conjunctive query $q(\mathbf {z})$. For some positive integer $\ell $, generate self-join-free conjunctive queries $q_{1},\dots ,q_{\ell }$ such that for each $i\in \{1,\dots ,\ell \}$, $q_{i}\sqsubseteq q$ and $\lfloor {q_{i}}\rfloor $ is in ${\mathbf {FO}}$. The condition $q_{i}\sqsubseteq q$ is decidable by [1, Theorem 6.2.3]; the condition that $\lfloor {q_{i}}\rfloor $ is in ${\mathbf {FO}}$ is decidable by Theorem 2. Then by Lemma 1, $\bigcup _{i=1}^{\ell }\lfloor {q_{i}}\rfloor $ is a strategy for q.

Unfortunately, Theorem 3 given hereinafter states that there are cases where no strategy of the form (3) is optimal. We first generalize Lemma 1 to unions.

Lemma 2

Let $q_{0}$, $q_{1}$, ...$q_{\ell }$ be self-join-free m-ary conjunctive queries. Then, $\lfloor {q_{0}}\rfloor \sqsubseteq \bigcup _{i=1}^{\ell }\lfloor {q_{i}}\rfloor $ if and only if for some $i\in \{1,\dots ,\ell \}$, $q_{0}\sqsubseteq q_{i}$.

Proof

Straightforward. Assume $\lfloor {q_{0}}\rfloor \sqsubseteq \bigcup _{i=1}^{\ell }\lfloor {q_{i}}\rfloor $. Let $q_{0}=\{\mathbf {z}_{0}\mid \exists \mathbf {y}_{0}B_{0}\}$, where $B_{0}$ is self-join-free. Let $\mu $ be an injective mapping with domain $\mathsf {vars}({B_{0}})$ that maps each variable to a fresh constant not occurring elsewhere. Since $\mu $ is injective, its inverse ${\mu }^{-1}$ is well defined. Let ${\mathbf {db}}=\mu (B_{0})$. Clearly, ${\mathbf {db}}$ is consistent and $q_{0}({\mathbf {db}})=\{\mu (\mathbf {z}_{0})\}=\lfloor {q_{0}}\rfloor ({\mathbf {db}})$. From $\lfloor {q_{0}}\rfloor \sqsubseteq \bigcup _{i=1}^{\ell }\lfloor {q_{i}}\rfloor $, it follows that we can assume $i\in \{1,\dots ,\ell \}$ such that $\mu (\mathbf {z}_{0})\in q_{i}({\mathbf {db}})=\lfloor {q_{i}}\rfloor ({\mathbf {db}})$. Let $q_{i}=\{\mathbf {z}_{i}\mid \exists \mathbf {y}_{i}B_{i}\}$. Then, there exists a valuation $\theta $ over $\mathsf {vars}({B_{i}})$ such that $\theta (B_{i})\subseteq {\mathbf {db}}$ and $\theta (\mathbf {z}_{i})=\mu (\mathbf {z}_{0})$. Then ${\mu }^{-1}\circ \theta (B_{i})\subseteq B_{0}$ and ${\mu }^{-1}\circ \theta (\mathbf {z}_{i})=\mathbf {z}_{0}$. Since ${\mu }^{-1}\circ \theta $ is a homomorphism from $q_{i}$ to $q_{0}$, it follows $q_{0}\sqsubseteq q_{i}$. $\square $

Theorem 3

There exists a self-join-free conjunctive query q such that for every strategy $\varphi $ of the form (3) for q, there exists another strategy $\psi $ of the form (3) for q such that $\varphi \sqsubset \psi $.

Proof

Let $q=\{\langle {}\rangle \mid \exists x\exists y\exists z\left( {R(\underline{x},z)\wedge S(\underline{y},z)}\right) \}$. Then $\lfloor {q}\rfloor $ is not in ${\mathbf {FO}}$ [14]. For every constant c, let $q_{c}$ be the query defined by $\{\langle {}\rangle \mid \exists y\exists z\left( {R(\underline{c},z)\wedge S(\underline{y},z)}\right) \}$. For every constant c, we have that $\lfloor {q_{c}}\rfloor \sqsubseteq \lfloor {q}\rfloor $ and $\lfloor {q_{c}}\rfloor $ is in ${\mathbf {FO}}$.

Let $\varphi $ be a strategy for q of the form (3). Let A be the greatest set of constants such that for all $c\in A$, there exists some $i\in \{1,\dots ,\ell \}$ such that $q_{i}\equiv q_{c}$. Let b be a constant such that $b\not \in A$. Clearly $\varphi \sqsubseteq \varphi \cup \lfloor {q_{b}}\rfloor \sqsubseteq \lfloor {q}\rfloor $. It suffices to show that $\varphi \sqsubset \varphi \cup \lfloor {q_{b}}\rfloor $, meaning that $\varphi $ is not optimal.

Assume towards a contradiction that $\lfloor {q_{b}}\rfloor \sqsubseteq \varphi $. By Lemma 2, there exists $i\in \{1,\dots ,\ell \}$ such that $q_{b}\sqsubseteq q_{i}\sqsubseteq q$. Let $q_{i}$ be the existential closure of $\left( {R(\underline{s},t)\wedge S(\underline{u},v)}\right) $. From $q_{i}\sqsubseteq q$, it follows that $t=v$. From $q_{b}\sqsubseteq q_{i}$ and $b\not \in A$, it follows that s, t, u are pairwise distinct variables. But then $q_{i}\equiv q$, contradicting that $\lfloor {q_{i}}\rfloor $ is in ${\mathbf {FO}}$. We conclude by contradiction that $\varphi \sqsubset \varphi \cup \lfloor {q_{b}}\rfloor $. $\square $

5.2 Post-processing by Unions and Quantification

The proof of Theorem 3 indicates that strategies of the form (3) lack expressiveness because the number of constants in such strategies is bounded. An obvious extension is to look for strategies that replace constants with existentially quantified variables. The following example shows how such extension solves the lack of expressiveness that underlies the proof of Theorem 3.

Example 5

Let $q=\exists x\exists y\exists z\left( {R(\underline{x},z)\wedge S(\underline{y},z)}\right) $. Let $\varphi $ be the ${\mathsf {CQAFO}}$ formula defined by $\varphi \mathrel {\mathop :}=\exists X\lfloor {\exists y\exists z\left( {R(\underline{X},z)\wedge S(\underline{y},z)}\right) }\rfloor $. It can be shown that $\varphi $ is a strategy for q, i.e., $\varphi \sqsubseteq \lfloor {q}\rfloor $ and $\lfloor {\exists y\exists z\left( {R(\underline{X},z)\wedge S(\underline{y},z)}\right) }\rfloor $ is in ${\mathbf {FO}}$. Recall from Example 2 that the use of upper case X is for readability. $\square $

Assume that the input to ${\mathsf {OPTSTRATEGY}}$ is a self-join-free conjunctive query $q(\mathbf {z})$. In this section, we investigate strategies of the form

$$\begin{aligned} \bigcup _{i=1}^{\ell }Q_{i}, \end{aligned}$$

(4)

where for each $i\in \{1\dots ,\ell \}$, $Q_{i}$ is a ${\mathsf {CQAFO}}$ query of the form

$$\begin{aligned} \{\mathbf {z}_{i}\mid \exists \mathbf {X}_{i}\lfloor {\exists \mathbf {y}_{i}B_{i}}\rfloor \}, \end{aligned}$$

(5)

in which $\mathbf {z}_{i}$ has the same length as $\mathbf {z}$, and $B_{i}$ is a self-join-free conjunction of atoms. It is understood that $\mathbf {z}_{i}$, $\mathbf {X}_{i}$, and $\mathbf {y}_{i}$ have, pairwise, no variables in common, and that $\mathsf {vars}({\mathbf {z}_{i}\mathbf {X}_{i}\mathbf {y}_{i}})=\mathsf {vars}({B_{i}})$. For readability, we will use upper case Q to refer to ${\mathsf {CQAFO}}$ queries of the form (5). The main tools for constructing strategies of the form (4) are provided by Theorems 4 and 5.

Theorem 4

The following problem is decidable in polynomial time. Given a ${\mathsf {CQAFO}}$ query Q of the form (5), is Q in ${\mathbf {FO}}$? Moreover, if Q is in ${\mathbf {FO}}$, then a relational calculus query equivalent to Q can be effectively constructed.

Proof

A ${\mathsf {CQAFO}}$ query Q of the form (5) is in ${\mathbf {FO}}$ if and only if $\lfloor {\exists \mathbf {y}_{i}B_{i}}\rfloor $ is in ${\mathbf {FO}}$. The latter condition is decidable by Theorem 2.

Theorem 5

Given a self-join-free conjunctive query $q_{1}$ and a ${\mathsf {CQAFO}}$ query $Q_{2}$ of the form (5), it can be decided whether $Q_{2}\sqsubseteq \lfloor {q_{1}}\rfloor $.

Proof

(Crux.) Let $q_{1}=\{\mathbf {z}_{1}\mid \exists \mathbf {y}_{1} B_{1}\}$ and $Q_{2}=\{\mathbf {z}_{2}\mid \exists \mathbf {X}_{2}\lfloor {\exists \mathbf {y}_{2}B_{2}}\rfloor \}$. It can be shown that $Q_{2}\sqsubseteq \lfloor {q_{1}}\rfloor $ if and only if there exists a valuation $\theta $ over $\mathsf {vars}({B_{1}})$ such that $\theta (\mathbf {z}_{1})=\mathbf {z}_{2}$ and $\theta (B_{1})\subseteq B_{2}$. $\square $

We point out that Theorem 5 is interesting in its own right. It is well known [1, Corollary 6.3.2] that containment of relational calculus queries is undecidable. A large fragment for which containment is decidable is the class of unions of conjunctive queries. Notice, however, that the queries in the statement of Theorem 5 need not be monotone (and even not first-order), and that decidability of query containment for such queries is not obvious.

Example 6

Let $Q=\{x\mid \exists Y\lfloor {R(\underline{x},Y)}\rfloor \}$. Let ${\mathbf {db}}=\{R(\underline{a},1)\}$ and ${\mathbf {db}}'=\{R(\underline{a},1)$, $R(\underline{a},2)\}$. Then ${\mathbf {db}}\subseteq {\mathbf {db}}'$, but $Q({\mathbf {db}})=\{a\}$ is not contained in $Q({\mathbf {db}}')=\{\}$. Hence Q is not monotone. We have that Q is equivalent to the following relational calculus query:

$$ \{x\mid \exists y\left( {R(\underline{x},y)\wedge \forall y'\left( {R(\underline{x},y')\rightarrow y=y'}\right) }\right) \}. $$

$\square $

Assume that the input to ${\mathsf {OPTSTRATEGY}}$ is a self-join-free conjunctive query $q(\mathbf {z})$. Theorem 5 allows us to build a strategy of the form (4) for q as follows. Let A be the set of constants that occur in q. Let $\varphi $ be the disjunction of all (up to variable renaming) ${\mathsf {CQAFO}}$ formulas $Q_{i}$ of the form (5) that use exclusively constants from A such that $Q_{i}\sqsubseteq \lfloor {q}\rfloor $ and $Q_{i}$ is in ${\mathbf {FO}}$. Clearly, there are at most finitely many such formulas (up to variable renaming). Containment of $Q_{i}$ in $\lfloor {q}\rfloor $ is decidable by Theorem 5. Finally, the condition that $Q_{i}$ is in ${\mathbf {FO}}$ is decidable by Theorem 4. The following theorem remedies the negative result of Theorem 3.

Theorem 6

For every self-join-free conjunctive query q, there exists a computable strategy $\varphi $ of the form (4) for q, such that for every strategy $\psi $ of the form (4) for q, $\psi \sqsubseteq \varphi $.

Proof

Assume that the input to ${\mathsf {OPTSTRATEGY}}$ is a self-join-free conjunctive query $q(\mathbf {z})$. Let $\varphi $ be the strategy defined in the paragraph preceding this theorem. Let $Q=\{\mathbf {z}_{0}\mid \exists \mathbf {X}\lfloor {\exists \mathbf {y}B}\rfloor \}$ be a query of the form (5) where B is a self-join-free conjunction of atoms such that Q is in ${\mathbf {FO}}$ and $Q\sqsubseteq \lfloor {q}\rfloor $. If all constants that occur in B also occur in q, then Q is already contained in some disjunct of $\varphi $ (by construction of $\varphi $). Assume next that B contains some constants that do not occur in q, and let these constants be $a_{1},\dots ,a_{m}$. For $i\in \{1,\dots ,m\}$, let $X_{i}$ be a new fresh variable. Let $B'$ be the conjunction obtained from B by replacing each occurrence of each $a_{i}$ with $X_{i}$. Let $Q'=\{\mathbf {z}_{0}\mid \exists \mathbf {X}\exists X_{1}\cdots \exists X_{m}\lfloor {\exists \mathbf {y}B'}\rfloor \}$. From the proof of Theorem 2, it follows $Q'\sqsubseteq \lfloor {q}\rfloor $. It can be easily seen that $Q\sqsubseteq Q'$. Furthermore, from [13], it follows that $Q'$ is in ${\mathbf {FO}}$. Since all constants that occur in $B'$ also occur in q, we have that $Q'$ is already contained in some disjunct of $\varphi $ (by construction of $\varphi $).

To conclude, whenever $Q=\{\mathbf {z}_{0}\mid \exists \mathbf {X}\lfloor {\exists \mathbf {y}B}\rfloor \}$ is a query of the form (5) where B is a self-join-free conjunction of atoms such that Q is in ${\mathbf {FO}}$ and $Q\sqsubseteq \lfloor {q}\rfloor $, we have that $\varphi \cup Q\sqsubseteq \varphi $. $\square $

So far, we have imposed no restrictions on the size of the computable strategy $\varphi $ in the statement of Theorem 6. From a practical point of view, it is interesting to construct, among all optimal strategies $\varphi $ of the form (4), the one with the smallest number $\ell $ of disjuncts. It is an open question, however, how to minimize strategies of the form (4).

6 Conclusion

We have studied a realistic setting for divulging an inconsistent database to end users. In this setting, users access the database exclusively via syntactically restricted queries, and get exclusively consistent answers computable in ${\mathbf {FO}}$ data complexity. If the data complexity is higher, then the query will be rejected, in which case users have to fall back on strategies that obtain a large (the larger, the better) subset of the consistent answer. Such strategies combine answers obtained from several “easier” queries.

Although our setting applies to arbitrary queries and constraints, we searched for strategies when constraints are primary keys, and the database is accessible only via self-join-free conjunctive queries for which consistent query answering is in ${\mathbf {FO}}$. Under these access restrictions, we showed how to construct strategies that combine answers by means of union and quantification. It is an open question whether our strategies can still be improved, e.g., by using negation.

Notes

1.
According to [6], marital stability is higher when the wife is 5+ years younger than her husband.
2.
${\mathbf {r}}\models q(\mathbf {a})$ is defined in the standard way.
3.
Notice that is can be easily verified that $\lfloor {\{\mathbf {z_{i}}\mid \exists \mathbf {y}_{i}B_{i}\}}\rfloor \equiv \{\mathbf {z_{i}}\mid \lfloor {\exists \mathbf {y}_{i}B_{i}}\rfloor \}$.

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)
MATH Google Scholar
Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, pp. 68–79. ACM Press (1999)
Google Scholar
Bertossi, L.E.: Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2011)
Google Scholar
Bertossi, L.E., Li, L.: Achieving data privacy through secrecy views and null-based virtual updates. IEEE Trans. Knowl. Data Eng. 25(5), 987–1000 (2013)
Article Google Scholar
Bienvenu, M., Rosati, R.: Tractable approximations of consistent query answering for robust ontology-based data access. In: IJCAI. IJCAI/AAAI (2013)
Google Scholar
Cao, N.V., Fragnire, E., Gauthier, J.-A., Sapin, M., Widmer, E.D.: Optimizing the marriage market: an application of the linear assignment model. Eur. J. Oper. Res. 202(2), 547–553 (2010)
Article MATH Google Scholar
Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1–2), 90–121 (2005)
Article MATH MathSciNet Google Scholar
Dalvi, N.N., Ré, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Commun. ACM 52(7), 86–94 (2009)
Article Google Scholar
Dalvi, N.N., Re, C., Suciu, D.: Queries and materialized views on probabilistic databases. J. Comput. Syst. Sci. 77(3), 473–490 (2011)
Article MATH MathSciNet Google Scholar
Fan, W., Geerts, F.: Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2012)
MATH Google Scholar
Fuxman, A.D., Miller, R.J.: First-order query rewriting for inconsistent databases. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 337–351. Springer, Heidelberg (2005)
Chapter Google Scholar
Immerman, N.: Descriptive Complexity. Graduate Texts in Computer Science. Springer, New York (1999)
Book MATH Google Scholar
Koutris, P., Wijsen, J.: The data complexity of consistent query answering for self-join-free conjunctive queries under primary key constraints. In: PODS, pp. 17–29. ACM (2015)
Google Scholar
Koutris, P., Wijsen, J.: A trichotomy in the data complexity of certain query answering for conjunctive queries. CoRR, abs/1501.07864 (2015)
Google Scholar
Libkin, L.: Elements of Finite Model Theory. Springer, New York (2004)
Book MATH Google Scholar
Libkin, L.: SQL’s three-valued logic and certain answers. In: ICDT. LIPIcs, vol. 31, pp. 94–109. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2015)
Google Scholar
Maslowski, D., Wijsen, J.: A dichotomy in the complexity of counting database repairs. J. Comput. Syst. Sci. 79(6), 958–983 (2013)
Article MATH MathSciNet Google Scholar
Maslowski, D., Wijsen, J.: Counting database repairs that satisfy conjunctive queries with self-joins. In: ICDT, pp. 155–164. OpenProceedings.org (2014)
Google Scholar
Wijsen, J.: Making more out of an inconsistent database. In: Benczúr, A.A., Demetrovics, J., Gottlob, G. (eds.) ADBIS 2004. LNCS, vol. 3255, pp. 291–305. Springer, Heidelberg (2004)
Chapter Google Scholar
Wijsen, J.: Charting the tractability frontier of certain conjunctive query answering. In: PODS, pp. 189–200. ACM (2013)
Google Scholar
Wijsen, J.: A survey of the data complexity of consistent query answering under key constraints. In: Beierle, C., Meghini, C. (eds.) FoIKS 2014. LNCS, vol. 8367, pp. 62–78. Springer, Heidelberg (2014)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Universiteit Antwerpen, Antwerpen, Belgium
Floris Geerts
Université de Mons, Mons, Belgium
Fabian Pijcke & Jef Wijsen

Authors

Floris Geerts
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Pijcke
View author publications
You can also search for this author in PubMed Google Scholar
Jef Wijsen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jef Wijsen .

Editor information

Editors and Affiliations

Fakultät für Mathematik und Informatik, FernUniversität in Hagen, Hagen, Germany
Christoph Beierle
California Polytechnic State University, SAN LUIS OBISPO, USA
Alex Dekhtyar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Geerts, F., Pijcke, F., Wijsen, J. (2015). First-Order Under-Approximations of Consistent Query Answers. In: Beierle, C., Dekhtyar, A. (eds) Scalable Uncertainty Management. SUM 2015. Lecture Notes in Computer Science(), vol 9310. Springer, Cham. https://doi.org/10.1007/978-3-319-23540-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-23540-0_24
Published: 16 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23539-4
Online ISBN: 978-3-319-23540-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

First-Order Under-Approximations of Consistent Query Answers

Abstract

Similar content being viewed by others

On the Data Complexity of Consistent Query Answering

Dealing with Inconsistency in Databases: An Overview

A Survey of the Data Complexity of Consistent Query Answering under Key Constraints

Keywords

1 Introduction

2 Related Work

3 Preliminaries

Example 1

4 A Framework for Divulging Inconsistent Databases

Example 2

Example 3

4.1 The Language \({\mathsf {CQAFO}}\)

4.2 Restrictions on Data Complexity

Theorem 1

Proof

Theorem 2

4.3 Strategies

Definition 1

5 How to Construct Good Strategies?

5.1 Post-processing by Unions only

Lemma 1

Proof

Example 4

Lemma 2

Proof

Theorem 3

Proof

5.2 Post-processing by Unions and Quantification

Example 5

Theorem 4

Proof

Theorem 5

Proof

Example 6

Theorem 6

Proof

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation