1 Introduction

Today, dealing with incomplete data is imperative, specially due to the increasing use of applications involving data integration and data exchange. After having been largely studied in relational databases (as for instance in Imielinski and Lipski Jr. [31], Grahne [22], Libkin [37], Reiter [49], Zaniolo [56]), incomplete information appears, nowadays, as an important issue in the semantic web domain [5, 44, 54].

Incompleteness can be of many kinds, and finding a trade-off between expressivity and the difficulty in query answering queries is still a challenging issue [1]. Our work focusses on a database perspective where incompleteness arises when values (such as attribute values, or class or property instances) are missing. In this setting, we follow Reiter [49] who has shown how to extend the relational data model accordingly, thus providing FOL (first-order logic) semantics to null values of type ‘value exists but is currently unknown’. Indeed, in Reiter [49], databases are not presented as models but as theories (i.e.  a set of formulas)—a point of view which proved to be very suited to provide an intuitively correct semantics for null values. Logical database approach has then been largely influenced by that work. To focus the discussion, let us consider an example, similar to the one used in Reiter’s paper.

Example 1.1

In a database storing researcher names, conferences and their attendances, the information that Ann and Bob are researchers and that Ann attends \(VLDB'18\) can be stored through the formula

$$\begin{aligned} \textsf {Researcher}(Ann)\wedge \textsf {Researcher}(Bob) \wedge \textsf {Conf}(VLDB'18)\,\wedge \, \textsf {Attends} (Ann, VLDB'18)). \end{aligned}$$

In order to store the fact that ‘Bob attended a conference, but we do not know which one’, Reiter proposed the following formula

$$\begin{aligned} (\exists x)(\textsf {Conf}(x) \wedge \textsf {Attends}(Bob, x)) \end{aligned}$$

which asserts the existence of a conference x to which Bob attended. Furthermore, by naming this conference with a null value \(N_1\) and by writing

$$\begin{aligned} \textsf {Conf}(N_1) \wedge \textsf {Attends}(Bob, N_1) , \end{aligned}$$

Reiter linked the database terminology to the FOL in the following terms: logicians call [the null value] a Skolem constant, providing a technical device for the elimination of existential quantifiers in proof theory. \(\square \)

The viewpoint illustrated in the above example is exactly the one we adopt in this paper. However, instead of dealing with query answering (as in Reiter [49] and also in the majority of the papers dealing with incomplete information, such as Imielinski and Lipski Jr. [31], Grahne [22], Libkin [37] and Zaniolo [56]), our proposal focusses on updates. Although updates on incomplete database have been the subject of several work such as Grahne [22] and Winslett [55], it is worth noting that the dynamic aspects of data are usually neglected in favour of querying features. Modelling database updates as updates on FOL theory has been a way to organise and explain different update approaches. In this context, the question becomes: given a FOL theory\(\mathsf {Th}\)and new information concerning changes on\(\mathsf {Th}\), how can we find a new theory\({\mathsf {Th}}_1\)which integrates this new information? Update semantics has been considered under different viewpoints varying not only with respect to applications but also with respect to the considered research domain (whether it is artificial intelligence, database, philosophy, etc.). No consensus has been established, and many different approaches are available. However, Winslett offers in Winslett [55] a classification identifying two semantic classes for FOL theory updating: the model-based approach (whose goal is to update the models of a theory) and the formula-based approach (whose goal is to update the set of formulas). The approach proposed in this paper falls in the latter category. Our choice to start this paper by a theoretical overview of our update semantic goals reveals our preoccupation in placing our work in this wide picture of updating policy.

To achieve our goal, we first recall the following basic aspects which are taken into account in our proposal: (i) \(\mathsf {Th}_{\mathbb {D,C}}\) consists of a unique conjunctive formula \(\mathbb {D}\) which represents the database instance and a set of axioms \(\mathbb {C}\) which represents constraints (logical implications); (ii) \(\mathsf {Th}_{\mathbb {D,C}}\) is consistent in the sense that any instantiation of the variables in \(\mathbb {D}\) yields a model of \(\mathbb {C}\); and (iii) the well-known closed world assumption (CWA) and unique name assumption (UNA) hold.

In this context, an update may change the conjunctive formula \(\mathbb {D}\) (i.e.the database instance) to obtain a new consistent theory \(\mathsf {Th}_{\mathbb {D',C}}\) from the original one \(\mathsf {Th}_{\mathbb {D,C}}\). The changes on \(\mathbb {D}\), when they are possible, follow the basic principle that: (i) the insertion of a formula \(\beta \) in \(\mathbb {D}\) transforms \(\mathbb {D}\) in a new conjunctive formula \(\mathbb {D}\)’ having \(\beta \) as a sub-formulaFootnote 1 and (ii) the deletion of a formula \(\beta \) from \(\mathbb {D}\) transforms \(\mathbb {D}\) into a new formula \(\mathbb {D}\)’ of which \(\beta \) is not a sub-formula.

Example 1.2

In Example 1.1, the current database instance \(\mathbb {D}\) is the FOL formula

$$\begin{aligned}&\textsf {Researcher}(Ann)\wedge \textsf {Researcher}(Bob) \wedge \textsf {Conf}(VLDB'18)\, \\&\quad \wedge \,\textsf {Attends}(Ann, VLDB'18) \wedge (\exists x)(\textsf {Conf}(x) \wedge \textsf {Attends}(Bob, x)) \end{aligned}$$

Moreover, assume now that \(\mathbb {C}\) consists of the following axiom:

$$\begin{aligned} (\forall x, y)(\textsf {Attends}(x,y) \Rightarrow \textsf {Researcher}(x)) \end{aligned}$$
(1)

Clearly, \(\mathsf {Th}_{\mathbb {D,C}}\) is consistent (because Ann and Bob attend conferences and are researchers). Now, the insertion of \( \textsf {Attends}(Alice,VLDB'18)\) transforming formula \(\mathbb {D}\) above into \({\mathbb {D}}\wedge \textsf {Attends}(Alice,VLDB'18)\) renders the theory non-consistent, because (1) cannot be satisfied for any instantiation of the variable x occurring in \(\mathbb {D}\). However, the insertion of \( \textsf {Attends}(Alice,VLDB'18) \wedge \textsf {Researcher}(Alice)\) results in \({\mathbb {D}}\wedge \textsf {Attends}(Alice,VLDB'18) \wedge \textsf {Researcher}(Alice)\) giving as a result a new consistent theory. Therefore, in our approach, would a user ask for inserting \( \textsf {Attends}(Alice,VLDB'18)\), consistency is automatically preserved through the additional insertion of \(\textsf {Researcher}(Alice)\) as a side effect.

Similarly, the deletion from \(\mathbb {D}\)  of \( \textsf {Attends}(Ann,VLDB'18) \wedge \textsf {Researcher}(Ann)\) results in the new conjunctive formula \({\mathbb {D}}'\)

$$\begin{aligned} \textsf {Researcher}(Bob) \wedge \textsf {Conf}(VLDB'18) \wedge (\exists x)(\textsf {Conf}(x) \wedge \textsf {Attends}(Bob, x)) \end{aligned}$$

thus leading to a new consistent theory \(\mathsf {Th}_{\mathbb {D',C}}\). Notice that consistency would not hold when deleting from \(\mathbb {D}\)  the only atom \(\textsf {Researcher}(Ann)\). \(\square \)

It should be clear from the above example that our constraints (as FOL axioms) follow a database perspective, in the sense that (i) they are never modified by updates and (ii) they always must be satisfied by the database instance (the other option being to see constraints as inference rules used in query processing).

On the other hand, as we focus on updates while keeping database consistency, we have to face the important problem of data redundancy, which has been the subject of many studies (see for instance [16, 20]). This problem occurs because, given a consistent theory and updates, there could be different target consistent theories (differing from the original one only on formula \(\mathbb {D}\)). Then, the question of whether there is a best solution has been answered in Fagin et al. [16] by the definition of the core. The results of this work are applied in this paper to define the ‘best’ database instance resulting from an update. The following example shows the intuition of our reasoning.

Example 1.3

Let us consider the following theory where axioms (constraints) impose a conference to be assigned to a location for which a web page informing about visa regulations is available.

$$\begin{aligned} \begin{array}{ll} {\mathbb {C}}: &{} (\forall x) (\textsf {Conf}(x) \Rightarrow (\exists y)(\textsf {Loc}(x,y)))\\ &{}(\forall x,y)(\textsf {Loc}(x, y) \Rightarrow (\exists z)(\textsf {VisaReg}(y,z))) \\ {\mathbb {D}}: &{} \textsf {Conf}(VLDB'18) \wedge (\exists x,y)(\textsf {Loc}(VLDB'18, x) \wedge \textsf {VisaReg}(x,y)) \end{array} \end{aligned}$$

Intuitively, \(\mathsf {Th}_{\mathbb {D,C}}\) states that \(VLDB'18\) is settled in a location which is currently unknown and that this location should be associated with some (currently unknown) visa regulation. The insertion in \(\mathsf {Th}_{\mathbb {D,C}}\) of \(\textsf {Loc}(VLDB'18, Rio)\) requires to insert an associated information about visa regulations according to the second constraint. However, since the associated URL is still unknown, the inserted formula is \((\exists u)(\textsf {Loc}(VLDB'18, Rio) \wedge \textsf {VisaReg}(Rio, u))\), yielding \({\mathbb {D}}^1_1\)

$$\begin{aligned}&\textsf {Conf}(VLDB'18) \wedge (\exists x,y)(\textsf {Loc}(VLDB'18, x) \wedge \textsf {VisaReg}(x,y))\, \\&\quad \wedge \,(\exists u)(\textsf {Loc}(VLDB'18, Rio) \wedge \textsf {VisaReg}(Rio, u)). \end{aligned}$$

Notice that, in \({\mathbb {D}}_1^1\), the last conjunct is a partial instantiation of the second one, implying that \({\mathbb {D}}^1_1\) contains redundancies. To avoid this situation, the result of the insertion is simply defined equivalent formula \({\mathbb {D}}_1\)

$$\begin{aligned} \textsf {Conf}(VLDB'18) \wedge (\exists u)(\textsf {Loc}(VLDB'18, Rio) \wedge \textsf {VisaReg}(Rio, u)). \end{aligned}$$

As a last update, assume now the insertion of \(\textsf {VisaReg}(Rio, url1)\), meaning that the URL containing visa regulations is now identified as url1. A similar reasoning as above would then produce the following non-redundant instance \({\mathbb {D}}_2\)

$$\begin{aligned} \textsf {Conf}(VLDB'18) \wedge \textsf {Loc}(VLDB'18, Rio) \wedge \textsf {VisaReg}(Rio, url1). \end{aligned}$$

According to Fagin et al. [16], simplifications as above refer to the core of an instance, shown as being a ‘condensed’ version that preserves the answers to queries.

Notice that this approach is not the one adopted by Reiter for whom a database instance as \({\mathbb {D}}_1^1\) above would be acceptable. But keeping the database size as small as possible was out of the scope of Reiter [49]. \(\square \)

We emphasize that the above example illustrates current challenges considered in the RDF world (see [54]) and that examples from these contexts can be expressed in a FOL formalism, ensuring generic solutions.

The main goal of our paper is to propose an approach to update databases with incomplete information that not only generalizes naive tables [1] but also applies in most current new database models, such as graph database systems or RDFS triple store systems. To achieve this goal, we consider the following context, based on the generic FOL formalism:

  • A database instance is a closed conjunctive formula built up with atoms, possibly containing existentially quantified variables.

  • The database instance must satisfy a set of constraints, each of them being a closed universally quantified implication with possibly, existentially quantified variables in the right hand side. These constraints are also called TGDs (tuple-generating dependencies) in the literature.

In this context, the following issues are investigated.

1.1 How to deal with marked nulls?

There is no consensus on how database semantics should be extended to deal with nulls and, furthermore, more difficulties arise when considering non-relational database models such as graph databases. To cope with this issue, we consider a FOL formalism (as in Reiter [49], where existentially quantified variables appear in their skolemization form, as marked nulls, Imielinski and Lipski Jr. [31], Lipski Jr. [41]). In doing so, we can model not only naive tables, but also most database models that are based on FOL. However, it should be noticed that updating naive tables has been the object of very few investigations and, in any case, this issue has not been studied in the presence of TGDs.

1.2 How to avoid redundancies?

As shown in Example 1.3, updates with incomplete information can yield redundancies that, when dealing with real size databases, must be avoided. We argue in this respect that our update processing maintains the database instance as ‘small’ as possible, while preserving answers to queries as defined in Fagin et al. [16]. We achieve this important feature by considering only instances equal to their core (see [19]). Although the complexity of core computation is known to be high in general [20], we refer to Sect. 7 for possible optimizations in our context.

1.3 How to preserve consistency while avoiding to generate too many nulls?

When considering TGDs, non-redundant incomplete information may be generated, possibly in an infinite way, when maintaining database consistency. A simple example of this issue is illustrated in the following.

Example 1.4

Consider the following constraints about citations:

$$\begin{aligned} \begin{array}{ll} {\mathbb {C}}: &{} (\forall x,y) (\textsf {Cites}(x,y) \Rightarrow \textsf {Paper}(x))\\ &{}(\forall x,y) (\textsf {Cites}(x,y) \Rightarrow \textsf {Paper}(y))\\ &{}(\forall x)(\textsf {Paper}(x) \Rightarrow (\exists y)(\textsf {Cites}(x,y))) \\ \end{array} \end{aligned}$$

These constraints stipulate that citations deal with papers and that every paper must contain at least one citation. Due to the third constraint, inserting the formula \(\textsf {Cites}(P1,P2) \wedge \textsf {Paper}(P1) \wedge \textsf {Paper}(P2)\) implies that the formula \((\exists y_1)(\textsf {Cites}(P2,y_1)\wedge \textsf {Paper}(y_1))\) has to be inserted as well. We are then clearly entering an infinite processing requiring the insertion of a formula of the form \((\exists y_1 \ldots y_k\ldots )(\textsf {Cites}(P2,y_1)\wedge \textsf {Paper}(y_1)\wedge \ldots \wedge \textsf {Cites}(y_{k-1},y_k)\wedge \textsf {Paper}(y_k) \wedge \ldots )\). \(\square \)

To cope with this problem, it is usual to impose restrictions on the constraints so as to ensure the termination of the generation of nulls, known as the chase procedure (see [6, 13, 23, 45]). In our approach, we make no restriction regarding the constraints. Instead, we define the notion of degree of a null as being the imbrication degree assigned to a null generated by applying a constraint (roughly speaking the degree is represented by the integer \(k-1\) in Example 1.4). Then, assuming a pre-defined maximal degree \(\delta _{max}\), we reject insertions that entail that a null has a degree higher than \(\delta _{max}\). In this way, whatever the constraints, we limit the spreading of nulls in the database instance. Referring to Example 1.4, we allow the constraints but, for any maximal degree \(\delta _{max}\), we reject the insertion. Notice, however, that the insertion of \(\textsf {Cites}(P1,P2)\, \wedge \)\(\textsf {Cites}(P2,P1)\, \wedge \)\(\textsf {Paper}(P1) \wedge \textsf {Paper}(P2)\) is allowed.

1.4 How to keep update deterministic?

We propose algorithms for inserting or deleting sets of atoms involving incomplete information (represented as nulls not occurring in the current database instance). As illustrated in the following example, deterministic updating is an issue for deletions, and we overcome this issue by assuming that choices can be made a priori at the design phase of constraints.

Example 1.5

Consider the following theory \(\mathsf {Th}_{\mathbb {D,C}}\) where the constraint imposes that for any conference, all attendees must be registered.

$$\begin{aligned} \begin{array}{ll} {\mathbb {C}}:&{}(\forall x,y)(\textsf {Attends}(x, y) \wedge \textsf {Conf}(y) \Rightarrow \textsf {Registered}(x,y)) \\ {\mathbb {D}}: &{} \textsf {Attends}(Bob, VLDB'18) \wedge \textsf {Conf}(VLDB'18) \\ &{}\wedge \,{\textsf {Registered}(Bob,VLDB'18)\qquad \qquad \qquad \qquad }\\ \end{array} \end{aligned}$$

If Bob’s registration is cancelled, at least one of the facts \(\textsf {Attends}(Bob, VLDB'18)\) or \(\textsf {Conf}(VLDB'18)\) has to be deleted along with \(\textsf {Registered}(Bob,VLDB'18)\), in order to maintain consistency. This situation yields three possible ways for processing the deletion, thus showing a case of non-deterministic update. In our approach, we assume that, for each constraint, the atom in the body to be deleted in case of the deletion of the head is known. In our example, a ‘natural’ choice would be the atom \(\textsf {Attends}(x, y)\). In this case, the deletion of \(\textsf {Registered}(Bob,VLDB'18)\) would imply deleting \(\textsf {Attends}(Bob, VLDB'18)\) in order to maintain consistency. In order to take into account this choice, we denote the chosen atom with ‘-’ as an exponent. In our example, the constraint would be written as \((\forall x,y)(\textsf {Attends}^-(x, y) \wedge \textsf {Conf}(y) \Rightarrow \textsf {Registered}(x,y))\). \(\square \)

Additionally to determinism, the usual properties of minimal change and of monotonicity are also investigated in our framework.

Paper organization Section 2 provides the basics of our approach, using FOL, while database instance and database constraints are defined in Sect. 3. Then, Sect. 4 deals with the two ways constraints are applied in our updating processing: in a standard way for insertions and in an ‘opposite’ way that we call backward for deletions. Section 5 introduces our approach to updates by means of two algorithms, one for insertions and one for deletions. Properties of updates are studied in Sect. 6, while Sect. 7 positions our chase approach with regard to other chase versions and provides details on the core algorithm. Section 8 presents the results of some experiments and possible optimizations. In Sect. 9, related work are discussed and Sect. 10 concludes the paper.

2 Background: the core of a set of instantiated atoms

We assume that we have a standard FOL alphabet composed of three pairwise disjoint sets, namely: const, a set of constants, var, a set of variables, and pred, a set of predicates, every predicate being associated with a positive integer called its arity. In this setting, a term is a constant or a variable and an atomic formula, or an atom, is a formula of the form \(P(t_1, \ldots , t_n)\) where P is a predicate of arity n and \(t_1, \ldots , t_n\) are terms. Every atom in which no variables occur is called a fact.

A homomorphism from a set of atoms \(A_1\) to a set of atoms \(A_2\) is a mapping h from the terms of \(A_1\) to the terms of \(A_2\) such that: (i) if \(t\in \)const, then \(h(t)= t\) and (ii) if \(P(t_1,\ldots ,t_n)\) is in \(A_1\), then \(P(h(t_1), \ldots , h(t_n))\) is in \(A_2\).

A homomorphism associating every variable x in \(A_1\) with a constant a occurring in \(A_2\) is called a valuation of the variables in \(A_1\). The set \(A_1\) is isomorphic to the set \(A_2\) iff there exists a homomorphism \(h_1\) from \( A_1\) to \(A_2\) which admits an inverse homomorphism (from \(A_2\) to \(A_1\)).

We let \(\Phi \) be the set of all formulas \(\phi \) of the form \((\exists \mathbf{X })(\varphi _1(\mathbf{X } _1) \wedge \ldots \wedge \varphi _n(\mathbf{X } _n))\) where \(\mathbf{X } \) is a vector of variables made of all variables occurring in \(\mathbf{X } _i\) (\(i=1, \ldots , n\)) and where for every \(i=1, \ldots ,n\), \(\varphi _i(\mathbf{X } _i)\) is an atomic formula in which the free variables are those in \(\mathbf{X } _i\). Moreover, the set \(\{\varphi _1(\mathbf{X } _1), \ldots , \varphi _n(\mathbf{X } _n)\}\) is denoted by \(atoms(\phi )\).

Given \(\phi \) in \(\Phi \), a modelM of \(\phi \) is a set of facts such that there exists a homomorphism from \(atoms(\phi )\) to M. In such a setting, for all \(\phi _1\) and \(\phi _2\) in \(\Phi \), \(\phi _1 \Rightarrow \phi _2\) holds if each model of \(\phi _1\) is a model of \(\phi _2\), and as usual, \(\phi _1\) and \(\phi _2\) in \(\Phi \) are said to be equivalent, denoted by \(\phi _1 \Leftrightarrow \phi _2\), if \(\phi _1 \Rightarrow \phi _2\) and \(\phi _2 \Rightarrow \phi _1\) both hold, that is, if \(\phi _1\) and \(\phi _2\) have the same models.

Lemma 2.1

For all \(\phi _1\) and \(\phi _2\) in \(\Phi \), if there exists a homomorphism h from \(atoms(\phi _2)\) to \(atoms(\phi _1)\), then \(\phi _1 \Rightarrow \phi _2\) holds.

Proof

Given a model \(M_1\) of \(\phi _1\), there exists a homomorphism \(h_1\) from \(atoms(\phi _1)\) to \(M_1\). Then, assuming that there exists a homomorphism h from \(atoms(\phi _2)\) to \(atoms(\phi _1)\) we have \(h(atoms(\phi _2)) \subseteq atoms(\phi _1)\). Hence, by composition we obtain \(h_1(h(atoms(\phi _2))) \subseteq h_1(atoms(\phi _1))\), and with \(h_1(atoms(\phi _1)) \subseteq M_1\), we conclude that \(M_1\) is also a model of \(\phi _2\). Thus, \(\phi _1 \Rightarrow \phi _2\) holds. \(\square \)

The following basic lemma is a consequence of Lemma 2.1.

Lemma 2.2

For all \(\phi _1\) and \(\phi _2\) in \(\Phi \) such that \(atoms(\phi _1) \subseteq atoms(\phi _2)\), the equivalence \(\phi _1 \Leftrightarrow \phi _2\) holds if and only if there exists a homomorphism h from \(atoms(\phi _2)\) to \(atoms(\phi _1)\).

Proof

By Lemma 2.1, assuming that there exists a homomorphism h from \(atoms(\phi _2)\) to \(atoms(\phi _1)\) implies that \(\phi _1 \Rightarrow \phi _2\) holds. On the other hand, the inclusion \(atoms(\phi _1) \subseteq atoms(\phi _2)\) shows that every model of \(\phi _2\) is also a model of \(\phi _1\), that is, \(\phi _2 \Rightarrow \phi _1\) also holds.

Conversely, assuming that \(\phi _1 \Leftrightarrow \phi _2\) holds, the proof relies on the fact that every \(\phi \) in \(\Phi \) has at least one model M defined by a valuation of variables h such that (i) for all variables \(x_1\) and \(x_2\) in \(\phi \), \(h(x_1) \ne h(x_2)\) and (ii) for every fact A in M, \(h^{-1}(A)\) contains exactly one atom in \(\phi \). To see this, let M be a model of \(\phi \) defined by a valuation h such that, for all \(x_1\) and \(x_2\) occurring in \(\phi \), \(h(x_1)\) and \(h(x_2)\) are distinct constants not occurring in \(\phi \). By definition, h satisfies (i). Regarding point (ii), let A in M such that \(h^{-1}(A)=\{\varphi _1, \varphi _2\}\). Then, \(\varphi _1\) and \(\varphi _2\) differ at least by one variable at a given position. Assume that at a given position p, \(x_1\) occurs in \(\varphi _1\) and not in \(\varphi _2\). If the term in \(\varphi _2\) at position p is a constant c, then it is not possible that \(h(x_1) =c\) and so it is not possible that \(h(\varphi _1)=h(\varphi _2)\). If the term in \(\varphi _2\) at position p is a variable \(x_2\), then according to (i), \(h(x_1) \ne h(x_2)\) and so, it is not possible that \(h(\varphi _1)=h(\varphi _2)\). On the other hand, if \(h^{-1}(A) =\emptyset \), then it is clear that \(M {\setminus } \{A\}\) is also a model of \(\phi \). Therefore, \(\phi \) has at least one model M satisfying (i) and (ii).

Now let \(\phi _1\) and \(\phi _2\) be equivalent formulas in \(\Phi \) such that \(atoms(\phi _1) \subseteq atoms(\phi _2)\) and let M be a model of \(\phi _1\) satisfying points (i) and (ii). Since \(\phi _1\) and \(\phi _2\) are equivalent, M is also a model of \(\phi _2\). Let \(h_1\) (respectively \(h_2\)) the valuation of variables in \(\phi _1\) (respectively in \(\phi _2\)) such that \(h_1(\phi _1)=M\) (respectively \(h_2(\phi _2) \subseteq M\)). Then, let h be defined for every variable x occurring in \(\phi _2\) by \(h(x)=h_1^{-1}(h_2(x))\). Since \(h_1\) satisfies (ii), h is well defined and for every \(\varphi \) in \(atoms(\phi _2)\), \(h(\varphi )\) is in \(atoms(\phi _1)\). Thus, the proof is complete. \(\square \)

If \(\phi _1\) and \(\phi _2\) are two formulas in \(\Phi \), \(\phi _1\) is said to be a simplification of \(\phi _2\), denoted by \(\phi _1 \preceq \phi _2\), if (i) \(\phi _1 \Leftrightarrow \phi _2\) holds, and (ii) \(atoms(\phi _1) \subseteq atoms(\phi _2)\).

The relation \(\preceq \) is a partial ordering. A simplification \(\phi '\) of \(\phi \) is said to be minimal if \(\phi ' \preceq \phi \) and there is no \(\phi ''\) such that \(\phi '' \prec \phi '\).

To illustrate the notion of simplification, consider the following very simple case where \(\phi \) is the formula \((\exists x,y)(P(a,x) \wedge P(a,y))\). It is easy to see that \((\exists x)(P(a,x))\) and \((\exists y)(P(a,y))\) are two distinct but equivalent simplifications of \(\phi \). The following proposition shows that this always holds, i.e.  that minimal simplifications are equal up to variable renaming.

Proposition 2.1

If \(\phi \) is a formula of \(\Phi \) and \(\phi _1\) and \(\phi _2\) two minimal simplifications of \(\phi \), then \(atoms(\phi _1)\) and \(atoms(\phi _2)\) are isomorphic.

Proof

By Lemma 2.2, for \(i=1,2\), there exists \(h_i\) such that for every \(\varphi \) in \(atoms(\phi )\), \(h_i(\varphi )\) is in \(atoms(\phi _i)\). For \(i=1,2\), if \(H_i\) is the restriction of \(h_i\) to the terms in \(\phi _j\) (\(j=1,2\) and \(j \ne i\)), we have \(H_i(atoms(\phi _j)) \subseteq atoms(\phi _i)\).

We show that the inclusion is in fact an equality. Indeed, for \(i=1\) and \(j=2\), considering the morphism \(h'_1=h_1 \circ h_2\), for every \(\varphi \) in \(atoms(\phi )\), implies that \(h'_1(\varphi )\) is in \(h_1(atoms(\phi _2))\). Therefore, by Lemma 2.2, the formula \(\phi '_1\) such that \(atoms(\phi '_1)=h_1(atoms(\phi _2))\) is a simplification of \(\phi \). Consequently, assuming that \(atoms(\phi '_1) \subset atoms(\phi _1)\) entails that \(\phi '_1 \prec \phi _1\), which is a contradiction to the fact that \(\phi _1\) is a minimal simplification of \(\phi \). Recalling that \(atoms(\phi '_1)=h_1(atoms(\phi _2))\), we obtain \(h_1(atoms(\phi _2))=atoms(\phi _1)\).

Since a similar reasoning holds for \(i=2\) and \(j=1\), and since for \(i,j=1,2\) and \(i \ne j\), \(h_i(atoms(\phi _j))=H_i(atoms(\phi _j))\), we obtain that \(H_i(atoms(\phi _j)) = atoms(\phi _i)\), and the proof is complete. \(\square \)

We recall that in the literature, the result of Proposition 2.1 follows from a similar result for graphs (see [30]), whereas our proof rather relies on FOL. Minimal simplifications are also called cores in the literature, and in the remainder of this paper, we shall follow this usual terminology and the core of a given formula \(\phi \) is denoted by \(core(\phi )\).

3 Consistent database with marked nulls

Database instance A database instance is basically a formula \(\phi \) in \(\Phi \) that cannot be simplified, i.e.  such that \(core(\phi )=\phi \). However, in order to simplify notation, we ‘skolemize’ formulas in \(\Phi \) by replacing the variables with specific constants referred to as Skolem constants or as (marked) nulls and by omitting the existential quantifier. To do so, we assume an additional set of symbols in our alphabet, denoted by null, and assumed to be disjoint from the sets const and var. Skolem constants are then elements of the set null, and this induces that a term can be of one of the following types: either a constant, or a null, or a variable. Any atom of the form \(P(t_1, \ldots ,t_n)\) where for every \(i=1 , \ldots ,n\), \(t_i\) is in const\(\cup \)null, is called an instantiated atom (not to be confused with a fact for which the terms \(t_i\) are in const).

Moreover, as usual, the transformed conjunctive formula is written as the set of its conjuncts. In other words, a database instance is a set of instantiated atoms that can be written as\(atoms(Sk(\phi ))\)where\(Sk(\phi )\)is the Skolem version of a formula\(\phi \)in\(\Phi \)such that\(core(\phi )=\phi \).

Example 3.1

Given \(\phi \) defined by \((\exists x,y)(Q(a, x)\wedge S(a,x,y)\wedge Q(a,b))\), we have \(\phi =core(\phi )\). Indeed, although Q(ax) can be mapped to Q(ab) there is no atom of the form \(S(a,b,\_)\) to which we can map S(axy). Thus, the set \(\{Q(a, N_1),\)\(S(a,N_1,N_2),\)\(Q(a,b)\}\) denotes the corresponding database instance, assuming that \(N_1\) and \(N_2\) are elements of the set null.

Now let \(\phi '\) be defined by \((\exists x,y)(Q(a, x) \wedge R(a,y) \wedge S(a,x,y)\wedge Q(a,b)\wedge S(a,b,y))\). Since \(core(\phi ')\) is defined by \((\exists y)(Q(a, b) \wedge R(a,y) \wedge S(a,b,y))\), \(\phi '\) cannot be seen as a database instance. In this case, the associated instance is \(\{Q(a,b),\)\(R(a,N_1),\)\(S(a,b, N_1)\}\) where \(N_1\) is in null. \(\square \)

Constraints In database domain, constraints are usually established in order to ensure consistency and, thus, data quality. In this paper, we consider that constraints are implications known as tuple-generating constraints or TGDs for short [1], and of the generic form:

$$\begin{aligned}&(\forall \mathbf{X } _1, \ldots ,\mathbf{X } _k, \mathbf{Y } _1, \ldots ,\mathbf{Y } _k)((L^-_1(\mathbf{X } _1, \mathbf{Y } _1) \wedge \ldots \wedge L_k(\mathbf{X } _k, \mathbf{Y } _k))\\&\quad \Rightarrow {(\exists \mathbf{Z })(L(\mathbf{X } _1, \ldots ,\mathbf{X } _k, \mathbf{Z })))} \end{aligned}$$

where \(L(\mathbf{X } _1, \ldots ,\mathbf{X } _k, \mathbf{Z })\) is an atom and for \(i=1, \ldots , k\), \(L_i(\mathbf{X } _i, \mathbf{Y } _i)\) is an atom in which \(\mathbf{X } _i\) (respectively \(\mathbf{Y } _i\)) is the vector of variables occurring in \(L_i\) and in L (respectively not in L). Removing quantifiers and denoting the left-hand side conjunction as \(B(\mathbf{X },\mathbf{Y })\) yields the simplified form \(B(\mathbf{X }, \mathbf{Y }) \Rightarrow L(\mathbf{X }, \mathbf{Z }).\)

The set of all atoms in \(B(\mathbf{X }, \mathbf{Y })\) is denoted by body(c), and the atom \(L(\mathbf{X }, \mathbf{Z })\) is referred to as head(c). Moreover, in order to ensure determinism of deletions (as discussed in Example 1.5), for every constraint c, a unique atom in body(c) is identified using an hyphen character ‘-’ as an exponent. In what follows, this atom is denoted by \(body^-(c)\), with the convention that the exponent might be forgotten when body(c) contains only one atom.

Constraints where no existentially quantified variables occur are said to be full (or safe according to Datalog terminology).

Constraint satisfaction is defined as usual in this context: given a set I of instantiated atoms, Isatisfies the constraint c, denoted as \(I \models c\), if for every homomorphism h such that \(h(body(c))\subseteq I\), there is an homomorphism \(h'\) such that \(h(body(c)) = h'(body(c))\) and \(h'(head(c))\) belongs to I. Notice that here, by homomorphism we mean any mapping from the variables in c to the set of constants or nulls. If \({\mathbb {C}}\) is a set of constraints, Isatisfies\({\mathbb {C}}\), denoted as \(I \models {\mathbb {C}}\), if for every c in \({\mathbb {C}}\), \(I \models c\).

Example 3.2

Let \({\mathbb {C}}\) be a set containing the following two constraints:

  • \(c_1:\textsf {Conf}(x_1) \Rightarrow \textsf {Loc}(x_1,y_1)\)

  • \(c_2: \textsf {Loc}(x_2, y_2) \Rightarrow \textsf {VisaReg}(y_2, z_2)\)

and consider the following sets of instantiated atoms:

$$\begin{aligned} {\mathfrak {D}}_1= & {} \{\textsf {Conf}(VLDB'18), \textsf {Loc}(VLDB'18, N_1), \textsf {VisaReg}(N_1,N_2)\}\\ {\mathfrak {D}}_2= & {} \{\textsf {Conf}(VLDB'18), \textsf {Loc}(VLDB'18, N_1), \textsf {VisaReg}(N_1,N_2), \\&{\textsf {Loc}(VLDB'18, Rio)\}~}\\ {\mathfrak {D}}_3= & {} \{\textsf {Conf}(VLDB'18), \textsf {Loc}(VLDB'18, N_1), \textsf {VisaReg}(N_1,N_2), \\&{\textsf {Loc}(VLDB'18, Rio), \textsf {VisaReg}(Rio,N_3)\}.} \end{aligned}$$

Notice that \({\mathfrak {D}}_1\) and \({\mathfrak {D}}_3\) satisfy \({\mathbb {C}}\), whereas \({\mathfrak {D}}_2\) does not, due to \(c_2\). Moreover, it should be clear that \({\mathfrak {D}}_3\) contains redundancies since instantiating \(N_1\) by Rio shows that \(\textsf {Loc}(VLDB'18, N_1)\) and \(\textsf {VisaReg}(N_1,N_2)\) are not in \(core({\mathfrak {D}}_3)\). \(\square \)

The following lemma states that considering the core of a consistent set of instantiated atoms cannot destroy consistency.

Lemma 3.1

Given a set of constraints \({\mathbb {C}}\), for every set of instantiated atoms I, if \(I \models {\mathbb {C}}\), then \(core(I) \models {\mathbb {C}}\).

Proof

Let \(c:B(\mathbf{X },\mathbf{Y }) \Rightarrow L(\mathbf{X },\mathbf{Z })\) in \({\mathbb {C}}\) such that \(core(I) \not \models c\). In this case, core(I) contains all instantiated atoms in the instance \(B(\alpha , \beta )\) of body(c), but no atom of the form \(L(\alpha , \gamma )\). However, since \(core(I) \subseteq I\), we have that \(B(\alpha , \beta ) \subseteq I\), implying that I contains an atom of the form \(L(\alpha , \gamma )\), because \(I \models {\mathbb {C}}\). Thus, core(I) contains an instance \(L(\alpha ', \gamma ')\) of \(L(\alpha , \gamma )\) where \(\alpha ' \ne \alpha \). We obtain a contradiction because in this case, core(I) would contain atoms in \(B(\alpha ', \beta ')\) instead of \(B(\alpha , \beta )\). \(\square \)

Databases We are now ready to formally define the notion of database as consisting of a set of instantiated atoms and a set constraints as follows. In our approach, only consistent databases are considered, i.e.  database instances should satisfy the database constraints. Nulls play a central role in our database model; their impact in update strategies will be described in the next section.

Definition 3.1

(Database) A database\(\Delta \) is a pair of the form \(\Delta =({\mathfrak {D}}, {\mathbb {C}})\) where \({\mathfrak {D}}\) is a finite set of instantiated atoms and \({\mathbb {C}}\) is a finite set of constraints such that (i) \(core({\mathfrak {D}})={\mathfrak {D}}\) and (ii) \({\mathfrak {D}}\models {\mathbb {C}}\).

Referring back to Example 3.2, it is easy to see that \(\Delta _1=({\mathfrak {D}}_1, {\mathbb {C}})\) satisfies Definition 3.1, because \({\mathfrak {D}}_1 \models {\mathbb {C}}\) and \(core({\mathfrak {D}}_1)={\mathfrak {D}}_1\). On the other hand, neither \(\Delta _2=({\mathfrak {D}}_2, {\mathbb {C}})\) nor \(\Delta _3=({\mathfrak {D}}_3, {\mathbb {C}})\) do satisfy Definition 3.1, because on the one hand \({\mathfrak {D}}_2 \not \models {\mathbb {C}}\) and on the other hand \(core({\mathfrak {D}}_3) \ne {\mathfrak {D}}_3\). However, \(\Delta '_3=(core({\mathfrak {D}}_3), {\mathbb {C}})\) satisfies Definition 3.1, because \(core({\mathfrak {D}}_3) \models {\mathbb {C}}\).

4 Constraint application

Given a set of instantiated atoms I, and a set of constraints \({\mathbb {C}}\), constraints can be applied forward (i.e.   from body to head) or backward (i.e.from head to body) to I in order to generate a new set of atoms \(I'\). These two ways of applying constraints are precisely the subject of the present section.

4.1 Applying constraints forward

Applying forward a constraint \(c:B(\mathbf{X }, \mathbf{Y }) \Rightarrow L(\mathbf{X }, \mathbf{Z })\) to a given set of instantiated atoms I means (i) check whether I contains an instance \(B(\alpha , \beta )\) of body(c) and then, if the answer is yes, (ii) if I contains no atom of the form \(L(\alpha , \gamma )\), create new nulls \(\mathbf{N } \) and add in I the instantiated atom \(L(\alpha , \mathbf{N })\).

More generally, this process has been widely studied in the past decades [15, 16, 20, 45, 47] under the name of chasing. While checking whether a set of constraints has terminating chase is undecidable [13], weak acyclicity [15] has been established as a sufficient condition for testing whether a set of constraints has terminating chase. In this paper, we propose a practical solution which not only guarantees chase termination, but also allows for flexibility when handling nulls. To this end, we define a specific chasing operator that, as explained in Example 1.4, avoids infinite generation of nulls by associating every null N with an integer denoted by \(\delta (N)\) and called the degree of N.

We assume that we are given a maximal null degree\(\delta _{\max }\), and starting the iterative chasing process with null degrees equal to 0, when a new null N has to be inserted due to the application of a constraint c, then:

  • If the instance of body(c) contains a null \(N'\) such that \(\delta (N') \ge \delta _{\max }\), then the constraint is not applied;

  • Otherwise, \(\delta (N)\) is set to the maximal degree occurring in the instance of body(c) plus 1.

Formally, chasing and handling null degrees are defined using an operator called forward operator as follows.

Definition 4.1

(Forward Operator) Given a set of constraints \({\mathbb {C}}\) and a maximum degree of nulls \(\delta _{max}\), the associated forward operator, denoted by: \(T_\textsf {fw}\), is defined for every set of instantiated atoms I by:

$$\begin{aligned} \begin{array}{rll} T_\textsf {fw}(I)=I \,\cup &{} \{h(L(\mathbf{X },\mathbf{Z }))\,|&{}(\exists c:B(\mathbf{X },\mathbf{Y }) \Rightarrow L(\mathbf{X },\mathbf{Z }) \in {\mathbb {C}})(\exists h)\\ &{}&{} ((c{ \text{ is } \text{ full }})\,\wedge (h(body(c)) \subseteq I))\}\\ ~\cup &{} \{h'(L(\mathbf{X },\mathbf{Z }))\,|&{}(\exists c:B(\mathbf{X },\mathbf{Y }) \Rightarrow L(\mathbf{X },\mathbf{Z }) \in {\mathbb {C}})(\exists h)\\ &{}&{}((c{ \text{ is } \text{ not } \text{ full }})\,\wedge (h(body(c)) \subseteq I)\,\wedge \\ &{}&{}(\forall h'')((h''(body(c)) =h(body(c))) \Rightarrow \\ &{}&{} \quad h''(L(\mathbf{X },\mathbf{Z })) \notin I)\,\wedge \\ &{}&{}({\text{ for } \text{ all } \text{ nulls }\, N \text{ in } }h(body(c)), \delta (N) < \delta _{max} ) \,\wedge \\ &{}&{}({h' \text{ extends } \, h\, \text{ so } \text{ as }\, h'(\mathbf{Z })\, \text{ are } \text{ new } \text{ nulls } }N{ \text{ with } }\\ &{}&{} \delta (N)=0{ \text{ if } \, h(body(c))\, \text{ contains } \text{ no } \text{ null, } \text{ or }}\\ &{}&{}\delta (N)= N_{\max } +1, { \text{ where }}\\ &{}&{} N_{\max }=\max \{\delta (N')~|~ N'{ \text{ occurs } \text{ in } }h(body(c))\}))\}. \end{array} \end{aligned}$$

Given a finite set of instantiated atoms I, let \(\left( T^k_\textsf {fw} \right) _{k \in \mathbb {N}}\) be defined by:

  •  \(T^0_\textsf {fw} = I\) where for every null N in I, \(\delta (N)\) has been set to 0;

  •  \(T^{k+1}_\textsf {fw} = T_\textsf {fw}(T^k_\textsf {fw})\).

The following lemma shows that this sequence has a limit which is computed after a finite number of steps. In what follows, this limit is denoted by \(T_\textsf {fw}^*(I)\).

Lemma 4.1

Considering the sequence \(\left( T^k_\textsf {fw} \right) _{k \in \mathbb {N}}\) as defined above, there exists an integer \(k_0\) such that, for every \(k \ge k_0\), \(T^{k+1}_\textsf {fw}=T^{k}_\textsf {fw}\).

Proof

Since \(I \subseteq T_\textsf {fw}(I)\), holds, we trivially have that for every \(k \ge 0\), \(T^{k}_\textsf {fw} \subseteq T^{k+1}_\textsf {fw}\). Therefore, the sequence \(\left( T^k_\textsf {fw} \right) _{k \in \mathbb {N}}\) is monotonous. As a consequence, either \(T^{k+1}_\textsf {fw}=T^k_\textsf {fw}\) or \(T^{k+1}_\textsf {fw}\) contains at least one atom not in \(T^k_\textsf {fw}\).

On the other hand, to compute \(T^{k+1}_\textsf {fw}\) from \(T^k_\textsf {fw}\), according to Definition 4.1, two sets may be computed. The first set is built using the full constraints, which does not create new constants or nulls. Therefore, the number of atoms incurred by full constraints is finite, since I is finite. The second set is built using the constraints with an existential variable in their head. An infinite computation would necessitate to apply at least one rule an infinite number of times. However, since degrees of nulls are bounded, each such constraint can only be applied a finite number of times. Therefore, the number of atoms containing nulls is also finite.

Hence, there must exist \(k_0\) such that \(T_\textsf {fw}(T^{k_0}_\textsf {fw})\) generates no new atom, that is, by monotonicity, such that \(T_\textsf {fw}(T^{k_0}_\textsf {fw})=T^{k_0}_\textsf {fw}\). Applying again monotonicity, this implies that for every \(k \ge k_0\), \(T^{k+1}_\textsf {fw}=T^k_\textsf {fw}\), which completes the proof. \(\square \)

The following example illustrates computations of \(T_\textsf {fw}^*(I)\). In this example, as well as in the remainder of the paper, we use the notation \(N^d\) to specify that N is a null of degree d, i.e.   \(N^d\) means that \(\delta (N)=d\) holds.

Example 4.1

In the context of Examples 1.3 and 1.5, consider the set \({\mathbb {C}}\) of the following constraints:

  • \(c_1: \textsf {Attends}^-(x_1,y_1), \textsf {Conf}(x_1) \Rightarrow \textsf {Registered}(x_1,y_1)\)

  • \(c_2: \textsf {Conf}^-(x_2) \Rightarrow \textsf {Loc}(x_2,y_2)\)

  • \(c_3: \textsf {Loc}^-(x_3,y_3) \Rightarrow \textsf {VisaReg}(y_3, z_3)\)

and \(I=\{\textsf {Attends}(Bob,VLDB'18) ,\)\( \textsf {Conf}(VLDB'18)\}\).

For \(\delta _{max}=0\), the computation of \(T^*_\textsf {fw}(I)\) yields the following: first, \(T_\textsf {fw}(I)\) is set to \(I \cup \{\textsf {Registered}(Bob,VLDB'18),\)\({\textsf {Loc}}(VLDB'18,N_1^0)\}\) due to \(c_1\) and \(c_2\). Then, when computing \(T_\textsf {fw}(T_\textsf {fw}(I))\), the only constraint to apply is \(c_3\) with \(\textsf {Loc}(VLDB'18,N_1^0)\). However, as this atom contains a null whose degree is equal to \(\delta _{\max }\), \(c_3\) is not triggered, implying that \(T_\textsf {fw}(T_\textsf {fw}(I))=T_\textsf {fw}(I)\) and thus that \(T^*_\textsf {fw}(I)=T_\textsf {fw}(I)\). Notice that, in this case \(T^*_\textsf {fw}(I)\) does not satisfy \({\mathbb {C}}\).

Now, for \(\delta _{max}=2\), the computation of \(T^*_\textsf {fw}(I)\) yields first \(T_\textsf {fw}(I)=I \cup \{\textsf {Registered}(Bob,VLDB'18),\)\(\textsf {Loc}(VLDB'18,N_1^0)\}\) as above. Then, contrary to the previous case, \(c_3\) is triggered for computing \(T_\textsf {fw}(T_\textsf {fw}(I))\), thus generating \(\textsf {VisaReg}(N_1^0, N_2^1)\). As no further constraint applies, we obtain that \(T^*_\textsf {fw}(I)=I \cup \{\textsf {Registered}(Bob,VLDB'18),\)\(\textsf {Loc}(VLDB'18,N_1^0),\)\(\textsf {VisaReg}(N_1^0, N_2^1)\}\). Notice that, in this case, \(T^*_\textsf {fw}(I)\) does satisfy \({\mathbb {C}}\). \(\square \)

The following proposition states that \(T_\textsf {fw}\) allows for restoring consistency when no nulls N such that \(\delta (N) \ge \delta _{max}\) are generated.

Proposition 4.1

For every set of instantiated atoms I, if \(T^*_\textsf {fw}(I)\) contains no null N such that \(\delta (N) \ge \delta _{max}\), then \(T^*_\textsf {fw}(I) \models {\mathbb {C}}\).

Proof

Suppose that \(c:B(\mathbf{X }, \mathbf{Y }) \Rightarrow L(\mathbf{X },\mathbf{Z })\) is a constraint not satisfied by \(T^*_\textsf {fw}(I)\). There exists a homomorphism h such that \(h(body(c)) \subseteq T^*_\textsf {fw}(I)\) satisfying the following:

  1. (a)

    If c is full, then h(head(c)) is not in \(T^*_\textsf {fw}(I)\). However, by Definition 4.1, h(head(c)) belongs to \(T_\textsf {fw}(T^*_\textsf {fw}(I))=T^*_\textsf {fw}(I)\), a contradiction.

  2. (b)

    If c is not full, then for every \(h'\) such that \(h'(boby(c)) =h(body(c))\), \(h'(head(c))\) is not in \(T^*_\textsf {fw}(I)\). However, since every null N in h(body(c)) is such that \(\delta (N) < \delta _{max}\), according to Definition 4.1, \(T_\textsf {fw}(T^*_\textsf {fw}(I))=T^*_\textsf {fw}(I)\) contains an instantiated atom \(h'(head(c))\) containing new nulls with a degree d such that \(d \le \delta _{max}\), a contradiction which completes the proof.\(\square \)

Referring to Proposition 4.1, it is important to notice from Example 4.1 that when \(T^*_\textsf {fw}(I)\) contains nulls N such that \(\delta (N) \ge \delta _{max}\), then \(T^*_\textsf {fw}(I) \models {\mathbb {C}}\) does not hold in general. In this paper, nulls N such that \(\delta (N) \ge \delta _{\max }\) are said to be disallowed. We emphasize in this respect that when the underlying set of constraint is empty and \(\delta _{\max }\) is strictly positive, then no disallowed nulls can occur. On the other hand, whatever the set of constraints, when \(\delta _{\max }=0 \) no nulls are allowed in the database (as in Halfeld Ferrari and Laurent [28]).

The following example illustrates that considering disallowed nulls has an impact on the computation of the core.

Example 4.2

Consider a database dealing with assistant professors and PhD students, as people who teach and do research in certain domains. Let us suppose that the following set is obtained by applying constraints forward with \(\delta _{max}=1\). (We refer to Example 5.1 for the details of this application.)

$$\begin{aligned} I= & {} \{ \textsf {AssistProf}(Alice), \textsf {Researcher}(Alice), \textsf {Teaches} (Alice, N_1^0), \textsf {PhD}(Alice), \\&{\textsf {doesResearchIn}(Alice, N_2^0), \textsf {Teaches}(Alice, N_3^1)\}.} \end{aligned}$$

When computing the core of I, two homomorphisms are possible, namely:

  • \(h_1\) such that \(h_1(N_1^0)= N_3^1\), which gives

    $$\begin{aligned} core(I)= & {} \{ \textsf {AssistProf}(Alice), \textsf {Researcher}(Alice), \textsf {PhD}(Alice),\\&{\textsf {doesResearchIn}(Alice, N_2^0), \textsf {Teaches}(Alice, N_3^1)\}} \end{aligned}$$
  • \(h_2\) such that \(h_2(N_3^1)= N_1^0\), which gives

    $$\begin{aligned} core(I)= & {} \{ \textsf {AssistProf}(Alice), \textsf {Researcher}(Alice), \textsf {PhD}(Alice), \\&{\textsf {doesResearchIn}(Alice, N_2^0), \textsf {Teaches} (Alice, N_1^0)\} } \end{aligned}$$

Although these two sets are isomorphic (as stated earlier in Proposition 2.1), the first one contains disallowed nulls, contrary to the second one. Consequently, in our approach, the second one is accepted, whereas the first one is not. \(\square \)

In order to take this situation into account, we consider a restricted version of the core, called \(R\_core\), descarding homomorphisms that associate disallowed nulls with allowed ones. These homomorphisms, called restricted homomorphisms, are defined as follows.

Definition 4.2

(Restricted homomorphism) A homomorphism h is said to be restricted if for all nulls \(N_i\) and \(N_j\) such that \(h(N_i)=N_j\), it holds that if \(\delta (N_i)< \delta _{\max }\), then \(\delta (N_j) < \delta _{\max }\).

The following lemma, whose proof follows from the definitions and Proposition 2.1, shows the relationship between the core and the restricted core.

Lemma 4.2

Given a set of instantiated atoms I and a maximal null degree \(\delta _{\max }\),

  • if \(R\_core(I)\) contains no disallowed nulls, then \(R\_core(I)\) and core(I) are equal up to a null renaming;

  • if \(R\_core(I)\) contains disallowed nulls, then for every homomorphism h leading to core(I), h(I) contains disallowed nulls.

4.2 Applying constraints backward

Applying backward a constraint \(c:B(\mathbf{X }, \mathbf{Y }) \Rightarrow L(\mathbf{X }, \mathbf{Z })\) to a given a set of instantiated atoms I with respect to another set of instantiated atom J means:

  1. 1.

    Check whether I contains an instance \(L(\alpha , \gamma _1)\) of head(c) and whether \(J {\setminus } I\) contains an instance \(B(\alpha , \beta _1)\) of body(c). If the answer is yes, then proceed to the step in the following.

  2. 2.

    If \(J {\setminus } I\) contains no atom of the form \(L(\alpha , \gamma _2) \) where \(\gamma _1 \ne \gamma _2\), add to I the atom \(A(\alpha _0, \gamma _0)\), where \(A(\alpha _0, \gamma _0)\) is the instantiation of \(body^-(c)\).

This processing is defined through a backward operator as follows.

Definition 4.3

(Backward Operator) Given a set of constraints \({\mathbb {C}}\) and two finite sets of instantiated atoms I and J, \(T_\textsf {bw}(I,J)\) is defined as follows:

$$\begin{aligned} \begin{array}{rl} T_\textsf {bw}(I,J)=I \cup \{A\,|&{}(\exists c \in {\mathbb {C}})(\exists h)((A = h(body^-(c)))\,\wedge \\ &{}(h(body(c)) \subseteq (J {\setminus } I)) \wedge (h(head(c)) \in I)\,\wedge \\ &{}(\forall h')((h'(body(c))=h(body(c))) \Rightarrow h'(head(c))\not \in (J{\setminus } I)))\}. \end{array} \end{aligned}$$

Given two finite sets of instantiated atoms I and J, let \(\left( T^k_\textsf {bw} \right) _{k \in \mathbb {N}}\) be defined by:

  • \(T^0_\textsf {bw} = I\);

  • \(T^{k+1}_\textsf {bw} = T_\textsf {bw}(T^k_\textsf {bw}, J)\).

The following lemma shows that this sequence has a limit which is computed after a finite number of steps. In what follows, this limit is denoted by \(T_\textsf {bw}^*(I,J)\).

Lemma 4.3

Considering the sequence \(\left( T^k_\textsf {bw} \right) _{k \in \mathbb {N}}\) as defined above, there exists an integer \(k_0\) such that, for every \(k \ge k_0\), \(T^{k+1}_\textsf {bw}=T^{k}_\textsf {bw}\).

Proof

Since \(I \subseteq T_\textsf {bw}(I,J)\) holds, we trivially have that for every \(k \ge 0\), \(T^{k}_\textsf {bw} \subseteq T^{k+1}_\textsf {bw}\). Therefore, the sequence \(\left( T^k_\textsf {bw} \right) _{k \in \mathbb {N}}\) is monotonous. As a consequence, either \(T^{k+1}_\textsf {bw}=T^k_\textsf {bw}\) or \(T^{k+1}_\textsf {bw}\) contains at least one atom not in \(T^k_\textsf {bw}\). Moreover, since \(T_\textsf {bw}(I,J)\) is a subset of J, J is a finite set, and \(T_\textsf {bw}(I,J) \subseteq T_\textsf {bw}(T_\textsf {bw}(I,J),J)\), the total number of iterations is finite.

Hence, there must exist \(k_0\) such that \(T_\textsf {bw}(T^{k_0}_\textsf {fw},J)\) generates no new atom, that is, by monotonicity, such that \(T_\textsf {bw}(T^{k_0}_\textsf {bw},J)=T^{k_0}_\textsf {bw}\). Applying again monotonicity this implies that for every \(k \ge k_0\), \(T^{k+1}_\textsf {bw}=T^k_\textsf {bw}\), which completes the proof. \(\square \)

The following proposition states that the limit \(T_\textsf {bw}^*(I,J)\) allows to restore consistency of J.

Proposition 4.2

For all finite sets of instantiated atoms I and J, it holds that \(J {\setminus } T^*_\textsf {bw}(I,J) \models {\mathbb {C}}\).

Proof

Suppose that there exists c which is not satisfied by \(J {\setminus } T^*_\textsf {bw}(I,J)\). Then, there exists a homomorphism h such that \(h(body(c)) \subseteq J {\setminus } T^*_\textsf {bw}(I,J)\) and \(h(head(c)) \not \in J {\setminus } T^*_\textsf {bw}(I,J)\). In this case, by Definition 4.3, \(h(body^-(c))\) would appear in \(T_\textsf {bw}(T^*_\textsf {bw}(I,J),J)\), thus in \(T^*_\textsf {bw}(I,J)\). Hence, \(h(body(c)) \subseteq J {\setminus } T^*_\textsf {bw}(I,J)\) does not hold, which is a contradiction. Therefore, the proof is complete. \(\square \)

Example 4.3

Referring to Example 4.1, let \({\mathbb {C}}=\{c_1, c_2, c_3\}\) defined by:

  • \(c_1: \textsf {Attends}^-(x_1,y_1), \textsf {Conf}(x_1) \Rightarrow \textsf {Registered}(x_1,y_1)\)

  • \(c_2: \textsf {Conf}(x_2) \Rightarrow \textsf {Loc}(x_2,y_2)\)

  • \(c_3: \textsf {Loc}(x_3,y_3) \Rightarrow \textsf {VisaReg}(y_3, z_3)\)

Different computations of \(T^*_\textsf {bw}(I,J)\) are presented as follows for the following set J:

$$\begin{aligned} J= & {} \{\textsf {Attends}(Bob,VLDB'18) , \textsf {Conf}(VLDB'18), \textsf {Registered}(Bob,VLDB'18),\\&{ \textsf {Loc}(VLDB'18,Rio), \textsf {VisaReg}(Rio, url1), \textsf {VisaReg}(Rio, url2)\}.} \end{aligned}$$

Considering first \(I_1=\{\textsf {Registered}(Bob,VLDB'18)\}\), \(c_1\) applies backward, producing \(T_\textsf {bw}(I_1,J)= I_1 \cup \{\textsf {Attends}(Bob,VLDB'18)\}\). Since no other constraint applies, \(T^*_\textsf {bw}(I_1,J)=\{\textsf {Registered}(Bob,VLDB'18),\)\(\textsf {Attends}(Bob,VLDB'18)\}\).

For \(I_2=\{\textsf {VisaReg}(Rio, url1)\}\), \(c_3\) is considered as follows: \((J {\setminus } I_2)\) contains one instance of \(body(c_3)\) associated with the two instances of \(head(c_3)\), among which one is in \((J {\setminus } I_2)\), namely \(\textsf {VisaReg}(Rio, url2)\). Hence, according to Definition 4.3, we have \(T^*_\textsf {bw}(I_2,J)=I_2\) because no constraint other than \(c_3\) applies.

For \(I_3=\{\textsf {VisaReg}(Rio, url1),\)\(\textsf {VisaReg}(Rio, url2)\}\), \((J {\setminus } I_3)\) contains an instance of \(body(c_3)\), namely \(\textsf {Loc}(VLDB'18,Rio)\), associated with the two instances of \(head(c_3)\) that belong to \(I_2\). Thus, according to Definition 4.3, we have \(T_\textsf {bw}(I_3,J)=I_3 \cup \{\textsf {Loc}(VLDB'18,Rio)\}\). Applying \(c_2\) backward yields \(T_\textsf {bw}(T_\textsf {bw}(I_3,J))=T_\textsf {bw}(I_3,J)\cup \{ \textsf {Conf}(VLDB'18)\}\). As no other constraint applies backward, \(T^*_\textsf {bw}(I_3,J)=I_3\cup \{\textsf {Loc}(VLDB'18,Rio),\)\(\textsf {Conf}(VLDB'18)\}\). \(\square \)

We point out that the backward operator \(T_\textsf {bw}\) does not require to consider null degrees, contrary to the case of the forward operator \(T_\textsf {fw}\). In the next section, we show how the two operators \(T_\textsf {fw}\) and \(T_\textsf {bw}\) are used in our update processing.

5 Update processing

An update in our approach is a deterministic processing that modifies the database instance according to the following guidelines: given a database \(\Delta = ({\mathfrak {D}},{\mathbb {C}})\) and a set I of instantiated atoms such that nulls occurring in I do not occur in \({\mathfrak {D}}\):

InsertingIin\(\Delta \) means generating a database \(\Delta '=({\mathfrak {D}}', {\mathbb {C}})\) such that \({\mathfrak {D}}'\) contains an instance of every atom of I. More precisely:

  1. 1.

    Compute first \(T^*_\textsf {fw}({\mathfrak {D}}\cup I)\) and then the restricted core of the output.

  2. 2.

    If disallowed nulls occur, then the insertion is rejected and \(\Delta \) is not changed. Otherwise \(\Delta '=({\mathfrak {D}}', {\mathbb {C}})\) where \({\mathfrak {D}}'\) is the instance defined above is returned.

DeletingIfrom\(\Delta \) means generating a database \(\Delta '=({\mathfrak {D}}', {\mathbb {C}})\) such that \({\mathfrak {D}}'\) contains no atom isomorphic to an atom in I. More precisely, under the restriction that in Ievery null occurs in one single atom:

  1. 1.

    Delete from \({\mathfrak {D}}\) all atoms isomorphic to an atom of I. Denoting by \({\mathfrak {D}}_1\) the resulting set of instantiated atoms, compute \(T^*_\textsf {fw}({\mathfrak {D}}_1)\) and then the restricted core of the output.

  2. 2.

    The computation of the restricted core can imply that atoms isomorphic to atoms in I are back in the output or that constraint satisfaction cannot be restored due to disallowed nulls. Denoting by ToDel the set of all these atoms, compute \(T^*_\textsf {bw}(ToDel, {\mathfrak {D}}_1)\). After removing all these atoms, the resulting instance \({\mathfrak {D}}'\) is the core of the result of this last removal.

The details of update operations are introduced in the following.

5.1 Insertion processing

Given a database \(\Delta =({\mathfrak {D}},{\mathbb {C}})\) and a maximal null degree \(\delta _{max}\), the insertion in \(\Delta \) of a set of instantiated atoms \({\textsf {iRequest}} \) is defined as the output of Algorithm 1. Notice that, as earlier mentioned, insertions might be rejected due to a null degree greater than or equal to \(\delta _{max}\). Different situations are illustrated by the following example.

figure d

Example 5.1

In the context of Example 4.2, let \(\Delta = ({\mathfrak {D}}, {\mathbb {C}})\) be a database where \({\mathfrak {D}}=\emptyset \) and \({\mathbb {C}}\) is the set of constraints below which state the following: \(c_1\) and \(c_2\) state, respectively, that assistant professors are researchers and that a PhD student does some research in a scientific domain, and \(c_3\) and \(c_4\) state that doing research implies teaching at least a course.

  • \(c_1: \textsf {AssistProf}(x_1) \Rightarrow \textsf {Researcher} (x_1)\)

  • \(c_2: \textsf {PhD}(x_2) \Rightarrow \textsf {doesResearchIn} (x_2, y_2)\)

  • \(c_3: \textsf {doesResearchIn} (x_3, y_3) \Rightarrow \textsf {Teaches} (x_3, z_3)\)

  • \(c_4: \textsf {Researcher} (x_4) \Rightarrow \textsf {Teaches} (x_4, z_4)\)

We apply Algorithm 1 for \({\textsf {iRequest}} =\{ \textsf {AssistProf}(Alice)\), \(\textsf {PhD}(Alice)\}\), first for \(\delta _{\max } =0\) and then for \(\delta _{\max } =1\).

In the first case, the computation of \(T^*_\textsf {fw}({\mathfrak {D}}\cup I)\) on line 3 first produces \(T^1_\textsf {fw}({\mathfrak {D}}\cup {\textsf {iRequest}}) ={\textsf {iRequest}} \cup \{\textsf {Researcher}(Alice),\textsf {doesResearchIn} (Alice, N_1^0) \}\). Then, when computing \(T^2_\textsf {fw}({\mathfrak {D}}\cup {\textsf {iRequest}})\), \(c_3\) is not applied forward because its instantiated body contains \(N_1^0\) whose degree is equal to \(\delta _{\max }\). However, \(c_4\) applies forward, generating \(\textsf {Teaches} (Alice, N_2^0)\). We thus obtain

$$\begin{aligned} T^*_\textsf {fw}({\mathfrak {D}}\cup I)={\textsf {iRequest}} \cup \{\textsf {doesResearchIn} (Alice, N_1^0), \textsf {Teaches} (Alice, N_2^0)\}. \end{aligned}$$

Since the computation of the restricted core on line 4 does not change this set, the test on line 5 fails, and so the insertion is rejected. This shows that when \(\delta _{\max }=0\), nulls are not accepted in the database instance, as was the case in our previous work [28].

Now, for \(\delta _{\max }=1\), the computation of \(T^*_\textsf {fw}({\mathfrak {D}}\cup I)\) on line 3 is as follows:

$$\begin{aligned} \begin{array}{rl} T^1_\textsf {fw}({\mathfrak {D}}\cup {\textsf {iRequest}}) = &{}{\textsf {iRequest}} \cup \{\textsf {Researcher}(Alice),\textsf {doesResearchIn} (Alice, N_1^0) \}\\ T^2_\textsf {fw}({\mathfrak {D}}\cup {\textsf {iRequest}}) =&{}T^1_\textsf {fw}({\mathfrak {D}}\cup {\textsf {iRequest}}) \\ &{} \cup \,\{\textsf {Teaches} (Alice, N_2^0), \textsf {Teaches} (Alice, N_3^1) \}\\ T^3_\textsf {fw}({\mathfrak {D}}\cup {\textsf {iRequest}}) =&{}T^2_\textsf {fw}({\mathfrak {D}}\cup {\textsf {iRequest}}). \end{array} \end{aligned}$$

Thus, \(T^*_\textsf {fw}({\mathfrak {D}}\cup I) = T^2_\textsf {fw}({\mathfrak {D}}\cup I)\), and the computation of the restricted core on line 4 is as discussed in Example 4.2, that is, the atom \(\textsf {Teaches} (Alice, N_3^1)\) is discarded as an instance of \(\textsf {Teaches} (Alice, N_2^0)\). Since in the produced set all nulls have a degree strictly less than \(\delta _{\max }\), the test on line 5 succeeds, and so the insertion is performed producing the following database instance:

$$\begin{aligned} {\mathfrak {D}}'= & {} \{\textsf {AssistProf}(Alice), \textsf {Researcher} (Alice), \textsf {Teaches} (Alice, N_2),\\&{\textsf {PhD}(Alice), \textsf {doesResearchIn} (Alice, N_1) \}. } \end{aligned}$$

\(\square \)

We refer to Sect. 7 regarding the relationship of our insertion processing and the chase procedures as summarized in Onet [45]. In particular, we show that, when an insertion is not rejected and when chase procedures terminate, our approach is similar to the procedure of core chase.

5.2 Deletion semantics

Before considering how deletions are performed, their semantics should be clarified by considering three particular aspects: (i) the meaning of a deletion of an atom with a null value, (ii) the possibility of considering linked nulls in the deletion request and (iii) the determinism of deletions.

Concerning aspect (i), in our approach, a deletion expressed as ‘delete P(aN)’ means ‘delete all atoms over P where a is associated with a null’. We draw attention on the fact that this update should not be understood as ‘delete all atoms of the form \(P(a, \_)\)’. To avoid confusions, we do not consider updates whose expression involves the placeholder ‘_’.

Concerning aspect (ii), the presence of nulls in the atoms to be deleted raises subtle issues, as illustrated by the following example.

Example 5.2

In the context of Example 1.5, let \(\Delta _1=({\mathfrak {D}}_1, {\mathbb {C}}_1)\) where \({\mathbb {C}}_1=\emptyset \) and \({\mathfrak {D}}_1=\{\textsf {Attends}(Bob,VLDB'18)\}\).

If the atoms in \(I_1=\{\textsf {Attends}(Bob, VLDB'18),\)\(\textsf {Conf}(VLDB'18)\}\) have to be deleted, then any database system will output the database instance \({\mathfrak {D}}'_1 = \emptyset \).

Now, as a case where nulls are involved, let \(\Delta _2=({\mathfrak {D}}_2, {\mathbb {C}}_1)\) where \({\mathfrak {D}}_2=\{\textsf {Attends}(Bob,N_1)\}\). Considering that the atoms in \(I_2=\{\textsf {Attends}(Bob,N_2),\)\(\textsf {Conf}(N_2)\}\) have to be deleted, the following question arises: should the database instance be changed nor not? The following two answers are then possible:

  • Referring to the previous case, the answer is yes because \(I_2\) contains an atom isomorphic to the one in \({\mathfrak {D}}_2\), meaning that the result is \({\mathfrak {D}}'_2=\emptyset \).

  • Referring to FOL semantics, \(I_2\) stipulates that we are deleting all pairs of atoms stating that Bob attends a conference whose name is unknown. Since \({\mathfrak {D}}_2\) contains no such pair, the deletion should not change the database instance. Thus, the answer to the question above is no!

To keep deletion semantics for null and non-null atoms intuitively similar, while avoiding questions as above, we do not allow nulls to have several occurrences in the set of atoms to be deleted. That is, in our case, the set \(I_2\) above is changed to \(I=\{\textsf {Attends}(Bob,N_2),\)\(\textsf {Conf}(N_3)\}\).

It is worth noting that the above question also concerns: (1) a choice between an ‘or’ semantics (delete if it matches this OR that) or an ‘and’ semantics (delete if it matches this AND that), together with (2) the verification whether the required deletion englobes a whole partition in the database instance (e.g. the situation where we have \(I_2\) and an instance containing \(\textsf {Attends}(Bob,N_1),\)\(\textsf {Conf}(N_1)\) and \(\textsf {Loc}(N_1, Rio)\)). Since the ‘and’ semantics raises further important issues, this point lies out of the scope of the present paper. \(\square \)

Finally, regarding aspect (iii), we recall from Sect. 4.2 that the backward operator \(T_\textsf {bw}\) definition is based on the fact that for every constraint c, a literal \(L_{i}(\mathbf{X } _{i},\mathbf{Y } _{i})\) in \(B(\mathbf{X }, \mathbf{Y })\) of c has been marked with ‘-’ as its exponent, meaning that:

  • When the database contains the atom \(L_i(\alpha _i,\beta _i)\) along with \(L(\alpha , \gamma )\),

    if\(L(\alpha , \gamma )\) has to be deleted and \({\mathfrak {D}}{\setminus } \{L(\alpha , \gamma )\}\) does not satisfy c,

    then\(L_{i}(\alpha _{i},\beta _{i})\) is deleted to restore constraint satisfaction.

Deletion is thus deterministic because when restoring consistency, each constraint applied backward generates a unique side effect. It should be noticed, as argued later in Sect. 9, that this way of ensuring the determinism of deletions can be related to what is called active rules in the literature [50].

5.3 Deletion processing

Algorithm 2 implements deletions and is illustrated in the following examples. We recall that, as argued in Example 5.2, we assume that in the sets of atoms to be deleted, every null occurs once.

figure e

Example 5.3

In this example, we consider \(\Delta =({\mathfrak {D}}, {\mathbb {C}})\) where \({\mathbb {C}}\) is as in Example 4.3, that is:

  • \(c_1: \textsf {Attends}^-(x_1,y_1), \textsf {Conf}(y_1) \Rightarrow \textsf {Registered}(x_1,y_1)\)

  • \(c_2: \textsf {Conf}(x_2) \Rightarrow \textsf {Loc}(x_2,y_2)\)

  • \(c_3: \textsf {Loc}(x_3,y_3) \Rightarrow \textsf {VisaReg}(y_3, z_3)\)

Moreover, we set \(\delta _{max}\) to be equal to 1 and we consider \({\mathfrak {D}}\) as follows:

$$\begin{aligned} {\mathfrak {D}}= & {} \{\textsf {Attends}(Bob,VLDB'18) , \textsf {Conf}(VLDB'18), \textsf {Registered}(Bob,VLDB'18),\\&{ \textsf {Loc}(VLDB'18,Rio), \textsf {VisaReg}(Rio, url1), \textsf {VisaReg}(Rio, url2)\}.} \end{aligned}$$

We first apply Algorithm 2 for \({\textsf {dRequest}} =\{\textsf {Registered}(Bob,VLDB'18)\}\). We have \(ToDel={\textsf {dRequest}} \) and on line 4, \({\mathfrak {D}}_1={\mathfrak {D}}\) because when applying \(T_\textsf {fw}\) to \({\mathfrak {D}}{\setminus } {\textsf {dRequest}} \), \(\textsf {Registered}(Bob,VLDB'18)\) is generated. Then, the restricted core computation on line 5 does not change \({\mathfrak {D}}_1\), meaning that we have \(Disallowed =\emptyset \) (since no nulls are present) and \(Back=\{\textsf {Registered}(Bob,VLDB'18)\}\). Therefore, the test on line 8 fails and \(T^*_\textsf {bw}(\{\textsf {Registered}(Bob,VLDB'18)\}, {\mathfrak {D}}_1)\) is computed on line 11. By Example 4.3, the result is \(\{\textsf {Registered}(Bob,VLDB'18),\)\(\textsf {Attends}(Bob,VLDB'18)\}\) and so, on line 11, \({\mathfrak {D}}_2\) is set to \({\mathfrak {D}}{\setminus } \{\textsf {Registered}(Bob,\)\(VLDB'18),\)\(\textsf {Attends}(Bob,VLDB'18)\}\). Since no null is involved, the computation of the restricted core on line 12 produces \({\mathfrak {D}}' ={\mathfrak {D}}_2\), which is the updated database instance.

Now, we apply Algorithm 2 for \({\textsf {dRequest}} =\{\textsf {VisaReg}(Rio, url1)\}\). As above, \(ToDel={\textsf {dRequest}} \), but on line 4, we have \({\mathfrak {D}}_1={\mathfrak {D}}{\setminus } {\textsf {dRequest}} \), because \(head(c_3)\) has two instances in \({\mathfrak {D}}\) matching with \(\textsf {Loc}(VLDB'18, Rio)\). Then, it is easy to see that the sets Disallowed and Back are set to \(\emptyset \), and so the result \({\mathfrak {D}}'= {\mathfrak {D}}_1\) is output on line 9. That is, the deletion is processed by removing \(\textsf {VisaReg}(Rio, url1)\) from \({\mathfrak {D}}\).

As another deletion, let \({\textsf {dRequest}} =\{\textsf {VisaReg}(Rio, url1),\)\(\textsf {VisaReg}(Rio, url1)\}\). As in the previous two cases, \(ToDel={\textsf {dRequest}} \), but on line 4, \({\mathfrak {D}}_1\) is set to \(({\mathfrak {D}}{\setminus } ToDel) \cup \{\textsf {VisaReg}(Rio, N^0_1)\}\) due to \(c_3\). Then, as in the previous case, the sets Disallowed and Back are set to \(\emptyset \), and the resulting updated instance \({\mathfrak {D}}_1\) is output on line 9. That is, the updated database instance \({\mathfrak {D}}'\) is processed by removing \(\textsf {VisaReg}(Rio, url1)\) and \(\textsf {VisaReg}(Rio, url2)\) and adding \(\textsf {VisaReg}(Rio, N^0_1)\). We point out that in this case \({\mathfrak {D}}' \subseteq {\mathfrak {D}}\) does not hold.

As a last deletion in this example, let \({\textsf {dRequest}} =\{Loc(VLDB'18, Rio)\}\), which again is equal to ToDel on line 1 of Algorithm 2. Then, on line 4, \({\mathfrak {D}}_1\) is set to \(({\mathfrak {D}}{\setminus } ToDel) \cup \{\textsf {Loc}(VLDB'18, N^0_1),\)\(\textsf {VisaReg}(N^0_1, N^1_2)\}\) due to \(c_2\) and \(c_3\). Since \(\textsf {Loc}(VLDB'18, N^0_1)\) cannot be instantiated, the restricted core computation on line 5 does not change \({\mathfrak {D}}_1\), and so, on line 6 we have \(Disallowed =\{\textsf {VisaReg}(N^0_1, N^1_2)\}\) because \(\delta _{\max }=1\). On line 7Back is set to \(\emptyset \), and the test on line 8 fails implying that \(T^*_\textsf {bw}(\{\textsf {VisaReg}(N^0_1, N^1_2)\}, {\mathfrak {D}}_1)\) is computed on line 11, producing \(\{\textsf {VisaReg}(N^0_1, N^1_2),\)\(\textsf {Loc}(VLDB'18, N^0_1),\)\(\textsf {Conf}(VLDB'18)\}\). Since \({\mathfrak {D}}_2\) as computed on line 11 contains no null, the restricted core computation on line 12 does not change \({\mathfrak {D}}_2\). Consequently, \({\mathfrak {D}}'\) is obtained by removing from \({\mathfrak {D}}\) the atoms \(\textsf {Loc}(VLDB'18, Rio)\) and \(\textsf {Conf}(VLDB'18)\). \(\square \)

In the following two examples, we show that the core computations on lines 5 and 12 in Algorithm 2 are necessary in order to properly implement deletions.

Example 5.4

As for the core computation on line 5 of Algorithm 2, let \(\Delta = ({\mathfrak {D}},{\mathbb {C}})\) where \({\mathfrak {D}}=\{P(a,b),\)\(P(a,N_1),\)\(P(a',N_1)\}\) and where \({\mathbb {C}}=\emptyset \) (which satisfies that \(R\_core({\mathfrak {D}})={\mathfrak {D}}\)). We also set \(\delta _{\max }=1\). Since \({\mathbb {C}}\) is empty, this implies that all null degrees are equal to 0. Hence, no null can be disallowed and thus, all homomorphisms are restricted.

For \({\textsf {dRequest}} = \{P(a',N)\}\), we have \(ToDel=\{P(a',N_1)\}\) and \(T^*_\textsf {fw}({\mathfrak {D}}{\setminus } {\textsf {dRequest}})={\mathfrak {D}}{\setminus } \{P(a',N_1)\}\). Therefore, on line 4, we have \({\mathfrak {D}}_1=\{P(a,b),\)\(P(a, N_1)\}\), and thus, at this stage we do not have that \(R\_core({\mathfrak {D}}_1)={\mathfrak {D}}_1\), since \(R\_core({\mathfrak {D}}_1)=\{P(a,b)\}\). This is what is computed on line 5 and returned by Algorithm 2, since in that case, the sets Disallowed and Back are empty.

We point out from this example that even when no constraints are considered in \(\Delta \), the core computations are necessary in order to keep the size of the database instance as small as possible. \(\square \)

Example 5.5

To illustrate the core computation on line 12 of Algorithm 2, consider that \(\delta _{\max }=1\) and let \(\Delta = ({\mathfrak {D}},{\mathbb {C}})\) where \({\mathfrak {D}}=\{P(a,N_1),\)\(Q(N_1),\)Q(b),  \(R(a, N_2)\}\) and \({\mathbb {C}}=\{c_1, c_2\}\) defined as follows:

  • \(c_1:P(x_1,y_1) \Rightarrow Q(y_1)\)

  • \(c_2: P(x_2,y_2) \Rightarrow R(x_2,z_2)\)

For \({\textsf {dRequest}} = \{R(a,N)\}\), we have \(ToDel=\{R(a,N_2)\}\), and due to \(c_2\), \({\mathfrak {D}}_1=T^*_\textsf {fw}({\mathfrak {D}}{\setminus } ToDel)\) is \(({\mathfrak {D}}{\setminus }\{R(a,N_2^0)\}) \cup \{R(a,N_3^0)\}\). Therefore, on line 6, the set Disallowed is set to \(\emptyset \), and on line 7, the set Back is set to \(\{R(a,N_3^0)\}\). Since \(T_\textsf {bw}^*(\{R(a, N_3^0)\}, {\mathfrak {D}}_1)=\{R(a, N_3^0),\)\(P(a,N_1^0)\}\), on line 11 we have \({\mathfrak {D}}_2 = \{Q(N_1^0),\)\(Q(b)\}\). Thus, the core computation of line 12 returns \(R\_core({\mathfrak {D}}_2)=\{Q(b)\}\), and so Algorithm 2 outputs \(\Delta ' =({\mathfrak {D}}' , {\mathbb {C}})\) where \({\mathfrak {D}}'=\{Q(b)\}\). \(\square \)

5.4 Complexity remarks

Checking whether all previously satisfied TGDs are still satisfied after an update is an NP-complete problem in the general case (see [19, 47]).

In Algorithm 1, the most expensive computations concern the chase (performed on line 3) and the core (performed on line 4). Similarly, in Algorithm 2, the most expensive computations appear on lines 4 and 11 (where the forward or the backward operator is applied) and on lines 5 and 12 (where the core is computed). We refer to Sect. 7 for a detailed discussion on our core algorithm together with its computation.

Here, following a similar reasoning as in Halfeld Ferrari and Laurent [28], we consider complexity aspects of the computing \(T^*_{\textsf {fw}}(I)\) for a given instance I. Denoting by \(\alpha \) the maximal arity of predicates, the number of constants occurring in I is bounded by \(\alpha \times |I|\), and so the number of possible instantiations of an atom \(L(x_1, \dots , x_{\alpha })\) is \((\alpha \times |I|)^{\alpha }\). Thus, for a constraint c with \(m_c\) atoms in its body, the number of its possible instantiations is \( ((\alpha \times |I|)^{\alpha })^{m_c}\) or just \(O |I|^{\alpha . m_c}\). This implies that, for one constraint c, the number of possible atoms generated by the application of \(T_{\textsf {fw}}\) on c and I is \(O |I|^{\alpha . m_c}\). Hence, applying \(T_{\textsf {fw}}\) on \({\mathbb {C}}\) and I generates a new set of atoms in time \(O( |I|^{\alpha . m})\), where m is the maximal number of atoms in the bodies of the constraints in \({\mathbb {C}}\). As by Lemma 4.1, constraints are applied at most \((k_0 +1)\) times, the complexity of computing \(T^*_{\textsf {fw}}(I)\) is in \(O( |I|^{\alpha . m . (k_0 +1)})\). Consequently, the computation of \(T^*_{\textsf {fw}}({\mathfrak {D}}\cup {\textsf {iRequest}})\) (line 3 of Algorithm 1) is in \(O( |{\mathfrak {D}}\cup {\textsf {iRequest}} |^{\alpha . m . (k_0 +1)})\).

In Algorithm 2, a similar reasoning shows that the complexity of computing line 4 is \(O( |{\mathfrak {D}}{\setminus } ToDel|^{\alpha . m . (k_0 +1)})\). Moreover, the complexity of applying the operator \(T_{\textsf {bw}}\) is similar to that obtained for the forward operator, except that \(T_{\textsf {bw}}\) operates on rules having only one atom in the body and in the head. Thus, the complexity of computing \(T^*_\textsf {bw}((Disallowed \cup Back), {\mathfrak {D}}_1)\) (line 11) is \(O ( |Disallowed \cup Back|^{\alpha . (k_0 +1)} )\) where \(k_0\) is the integer in Lemma 4.3.

To sum up the above remarks, it can be stated that the generation of side effects for insertions and deletions in our approach is polynomial with respect to the sizes of \({\mathfrak {D}}\), \({\textsf {iRequest}} \) and \({\textsf {dRequest}} \).

6 Update properties

In this section, we investigate the properties of the updates as defined in the previous section. To this end, given \(\Delta =({\mathfrak {D}}, {\mathbb {C}})\) and the insertion of \({\textsf {iRequest}} \) or the deletion of \({\textsf {dRequest}} \), we denote by \(\Delta '=({\mathfrak {D}}', {\mathbb {C}})\) the result of the update and we consider the following issues:

  • Effectiveness This property states that the output of any update is indeed a database, i.e. \(\Delta '\) satisfies Definition 3.1. Moreover, when the update is not rejected, the property ensures that the update is effective, meaning that in case of insertion, \({\mathfrak {D}}'\) contains an instance of every atom in \({\textsf {iRequest}} \) and in case of deletion, \({\mathfrak {D}}'\) contains no atom isomorphic to an atom in \({\textsf {dRequest}} \).

  • Determinism The way an insertion (respectively a deletion) is processed depends only on \(\Delta \) and on the set \({\textsf {iRequest}} \) (respectively \({\textsf {dRequest}} \)), i.e.   no choice has to be made during the processing.

  • Monotonicity This property requires an ordering over instances to state that after any insertion (respectively any deletion), \({\mathfrak {D}}'\) is greater (respectively less) than or equal to \({\mathfrak {D}}\). Example 5.3 shows that, in our approach, set inclusion cannot be considered, as in standard approaches. Instead we compare instances according to a relation denoted by \(\sqsubseteq \), based on homomorphisms and to be defined in the following.

  • Minimal change Intuitively, minimal change states that updates should modify the database instance as few as possible. However, to check this property, all instances have to be considered, which is in general non-tractable. Even if this would be feasible, there exist in general several distinct instances satisfying minimal change, which is in contradiction to the property of determinism. Instead, minimal change in our approach is expressed as follows: the insertion of \({\textsf {iRequest}} \) (respectively the deletion of \({\textsf {dRequest}} \)) satisfies the minimal change property if, when ‘forgetting’ one of the specified side effects, at least one constraint is not satisfied.

It should be clear that our update processing is deterministic because no choice has to be made when running Algorithms 1 or 2. We, however, recall that deletions are deterministic thanks to the choice of an atom in the bodies of constraints, a choice made at design phase and not during update processing. We now define the relation according to which instances are compared.

Definition 6.1

(Comparison of sets of instantiated atoms) Let \(I_1\) and \(I_2\) be two sets of instantiated atoms. We say that \(I_1 \sqsubseteq I_2\) holds if there exists a homomorphism h such that \(h(I_1) \subseteq I_2\).

Notice that \(\sqsubseteq \) is a partial pre-ordering and that \(\sqsubseteq \) generalizes set inclusion, because \(I_1 \subseteq I_2\) implies \(I_1 \sqsubseteq I_2\). Moreover, it is easy to see that if \(\phi _1\) and \(\phi _2\) are the formulas in \(\Phi \) associated, respectively, with \(I_1\) and \(I_2\), then \(I_1 \sqsubseteq I_2\) holds if and only if so does \(\phi _1 \Rightarrow \phi _2\). As a consequence, for every set of instantiated atoms I, \(I \sqsubseteq R\_core(I)\) and \(R\_core(I) \sqsubseteq I\) both hold.

Regarding insertions, it should be clear that given \(\Delta =({\mathfrak {D}}, {\mathbb {C}})\) and a set \({\textsf {iRequest}} \), if Algorithm 1 returns \(\Delta '=({\mathfrak {D}}', {\mathbb {C}})\) where \({\mathfrak {D}}' = {\mathfrak {D}}\), then the update is not effective when \({\mathfrak {D}}\) contains no instance of an atom in \({\textsf {iRequest}} \). However, such an update is trivially deterministic, monotonic and satisfies minimal change. Knowing that it has already been stated that our update processing is deterministic, the following proposition shows that when \({\mathfrak {D}}' \ne {\mathfrak {D}}\) (implying that the insertion is not rejected), the insertion satisfies all properties listed above.

Moreover, in this proposition, minimal change is stated by considering side effects as those instantiated atoms not in \({\textsf {iRequest}} \) (up to null renaming) are inserted so as to maintain consistency.

Proposition 6.1

Let \(\Delta = ({\mathfrak {D}},{\mathbb {C}})\) be a database and \({\textsf {iRequest}} \) a finite set of facts. If Algorithm 1 returns \(\Delta '=({\mathfrak {D}}', {\mathbb {C}})\) where \({\mathfrak {D}}' \ne {\mathfrak {D}}\), then \({\mathfrak {D}}'\) satisfies the following statements:

  1. 1.

    Effectiveness: (a) for every \(\varphi \) in \({\textsf {iRequest}} \), \({\mathfrak {D}}'\) contains an instance of \(\varphi \), (b) \(R\_core({\mathfrak {D}}')={\mathfrak {D}}'\) and (c) \({\mathfrak {D}}' \models {\mathbb {C}}\).

  2. 2.

    Monotonicity: \({\mathfrak {D}}\sqsubseteq {\mathfrak {D}}'\).

  3. 3.

    Minimal change: For every \(\varphi \) in \({\mathfrak {D}}'\) not in \({\mathfrak {D}}\) and not isomorphic to an atom in \({\textsf {iRequest}} \), \({\mathfrak {D}}' {\setminus } \{\varphi \} \not \models {\mathbb {C}}\).

Proof

See  (“Appendix A”). \(\square \)

The following proposition gives basic properties of Algorithm 2, namely deletions are effective and monotonic. However, as will be argued later, deletions do not satisfy the requirement of minimal change. We notice in this respect that, for deletions, side effects as those instantiated atoms not in \({\textsf {dRequest}} \) (up to null renaming) have been deleted so as to maintain consistency.

Proposition 6.2

Let \(\Delta = ({\mathfrak {D}},{\mathbb {C}})\) be a database and \({\textsf {dRequest}} \) a finite set of instantiated atoms where every null occurs at most once. Algorithm 2 returns \(\Delta '=({\mathfrak {D}}', {\mathbb {C}})\) where \({\mathfrak {D}}'\) satisfies the following statements:

  1. 1.

    Effectiveness: (a) for every \(\varphi \) in \({\textsf {dRequest}} \), \({\mathfrak {D}}'\) contains no atom isomorphic to \(\varphi \), (b) \(R\_core({\mathfrak {D}}')={\mathfrak {D}}'\), and (c) \({\mathfrak {D}}' \models {\mathbb {C}}\).

  2. 2.

    Monotonicity: \({\mathfrak {D}}' \sqsubseteq {\mathfrak {D}}\).

Proof

See  (“Appendix B”). \(\square \)

The following example shows that our approach does not satisfy minimal change in the sense that deletions might be performed with fewer changes.

Example 6.1

Consider the set \({\mathbb {C}}\) containing the following two constraints

  • \(c_1:P^-(x_1,y_1), Q(y_1,z_1) \Rightarrow R(x_1, z_1)\)

  • \(c_2:Q(x_2,y_2) \Rightarrow S(x_2, t_2)\)

along with \(\Delta =({\mathfrak {D}},{\mathbb {C}})\) where \({\mathfrak {D}}=\{P(a,b),\)Q(bc),  R(ac),  \(S(b,N_1)\}\). For \(\delta _{\max }=1\) and \({\textsf {dRequest}} =\{R(a,c),\)\(S(b,N)\}\), according to Algorithm 2, we have first \(ToDel= \{R(a,c),\)\(S(b,N_1)\}\), implying that \(T^*_\textsf {fw}({\mathfrak {D}}{\setminus } ToDel)=\{P(a,b),\)Q(bc),  R(ac),  \(S(b, N_2^0)\}\). In other words, \({\mathfrak {D}}_1\) is equal to \({\mathfrak {D}}\) up to a renaming of \(N_2\). Hence, \(R\_core({\mathfrak {D}}_2)={\mathfrak {D}}_2\), showing that \(Disallowed =\emptyset \) and that \(Back=\{R(a,c),\)\(S(b,N_2^0)\}\). Thus, \(T^*_\textsf {bw}(Disallowed \cup Back, {\mathfrak {D}}_1)=\{R(a,c),\)\(S(b,N_2^0),\)P(ab),  \(Q(b,c)\}\) and so \({\mathfrak {D}}'=\emptyset \).

However, it is easy to see that the deletion of P(ab) is not necessary. Indeed, \(\Delta ''=({\mathfrak {D}}'', {\mathbb {C}})\) where \({\mathfrak {D}}''=\{P(a,b)\}\) is an acceptable result of the deletion because atoms in \({\textsf {dRequest}} \) are not in \({\mathfrak {D}}''\), \({\mathfrak {D}}'' \models {\mathbb {C}}\) and \(R\_core ({\mathfrak {D}}'')={\mathfrak {D}}''\). \(\square \)

The following proposition shows nevertheless that when the bodies of constraints contain one atom, our approach to deletions does satisfy minimal change.

Proposition 6.3

Let \(\Delta = ({\mathfrak {D}},{\mathbb {C}})\) be a database such that all constraints in \({\mathbb {C}}\) have one literal in their body, and \({\textsf {dRequest}} \) a finite set of atoms. Let \(\Delta '=({\mathfrak {D}}', {\mathbb {C}})\) be the database returned by the call of Algorithm 2 with \(\Delta \) and \({\textsf {dRequest}} \) as input. Then, for every \(\varphi \) in \((({\mathfrak {D}}{\setminus } ToDel) {\setminus } {\mathfrak {D}}')\), \({\mathfrak {D}}' \cup \{\varphi \} \not \models {\mathbb {C}}\).

Proof

Since \(\varphi \) is in \((({\mathfrak {D}}{\setminus } ToDel) {\setminus } {\mathfrak {D}}')\), line 11 in Algorithm 2 has been run, and \(\varphi \) is in \(T^*_\textsf {bw}(Disallowed \cup Back, {\mathfrak {D}}_1)\) and in \({\mathfrak {D}}\). By definition of \(T_\textsf {bw}\), there exist \(k>0\), \(c : B(\mathbf{X },\mathbf{Y }) \Rightarrow L(\mathbf{X },\mathbf{Z })\) in \({\mathbb {C}}\) and h such that \(\varphi =h(B(\mathbf{X },\mathbf{Y }))=B(\alpha , \beta )\), \(h(B(\mathbf{X },\mathbf{Y })) \in ({\mathfrak {D}}{\setminus } T^k_\textsf {bw}(Disallowed \cup Back, {\mathfrak {D}}_1))\) and all atoms in \({\mathfrak {D}}\) of the form \(L(\alpha ,\_)\) are in \(T^k_\textsf {bw}(Disallowed \cup Back, {\mathfrak {D}}_1)\). As a consequence, \({\mathfrak {D}}_2 \cup \{\varphi \}\) contains h(body(c)) but no atom of the form \(L(\alpha , \_)\), and so, \({\mathfrak {D}}_2 \cup \{\varphi \} \not \models {\mathbb {C}}\).

Let us now assume that \({\mathfrak {D}}' \cup \{\varphi \}\), i.e.   \(R\_core({\mathfrak {D}}_2) \cup \{\varphi \}\), satisfies \({\mathbb {C}}\). Since \(R\_core({\mathfrak {D}}_2) \subseteq {\mathfrak {D}}_2\), \(R\_core({\mathfrak {D}}_2) \cup \{\varphi \}\) contains no atom of the form \(L(\alpha , \_)\), and thus, our assumption entails that \(R\_core({\mathfrak {D}}_2) \cup \{\varphi \}\) contains no atom of the form \(B(\alpha , \_)\). This is a contradiction to the fact that \(\varphi \) is such an atom that clearly belongs to \(R\_core({\mathfrak {D}}_2) \cup \{\varphi \}\). Therefore, the proof is complete. \(\square \)

7 Chasing versions and core computation

This section first offers a discussion about our chase approach, with regard to different chase versions available in the literature, and then provides details on the core computation, which is the most complex step in our algorithms.

7.1 Chasing: our choice of controlling null propagation

It is well known that when chasing an instance with respect to TGDs, the process might not terminate due to the specific form of the constraints; Example 1.4 shows such a case. The usual way to address this issue is to identify those rules that might lead to a non-terminating chase so as to discard them. However, as the problem is known to be undecidable, sufficient conditions ensuring termination have been proposed in the literature [45]. In our approach, we avoid such restrictions through the introduction of the parameter \(\delta _{\max }\), thus offering the possibility of dealing with any kind of constraints while avoiding infinite processing.

More precisely, we show that: (i) for computations where \(\delta _{max}\) is not reached, our chasing operator \(T_{\textsf {fw}}\) corresponds to a well-defined chase semantics and (ii) for situations where constraints meet the conditions ensuring chase termination, we can determine \(\delta _{max}\) so that all null degrees are less than \(\delta _{max}\).

Positioning our Chase Procedure We first recall from Onet [45] that various chasing versions have been proposed in the literature, under the names of standard, oblivious, semi-oblivious and core chase. We show here that our chasing approach as defined through the operator \(T_\textsf {fw}\) is closely related to the standard-chase and the core-chase procedures. To this end, we define a new operator T as follows, for every set I of instantiated atoms:

$$\begin{aligned} \begin{array}{rll} T(I)=I \,\cup &{} \{h(L(\mathbf{X },\mathbf{Z }))\,|&{}(\exists c:B(\mathbf{X },\mathbf{Y }) \Rightarrow L(\mathbf{X },\mathbf{Z }) \in {\mathbb {C}})(\exists h)\\ &{}&{} \quad ((c{ \text{ is } \text{ full }})\,\wedge (h(body(c)) \subseteq I))\}\\ ~\cup &{} \{h'(L(\mathbf{X },\mathbf{Z }))\,|&{}(\exists c:B(\mathbf{X },\mathbf{Y }) \Rightarrow L(\mathbf{X },\mathbf{Z }) \in {\mathbb {C}})(\exists h)\\ &{}&{}((c{ \text{ is } \text{ not } \text{ full }})\,\wedge (h(body(c)) \subseteq I)\,\wedge \\ &{}&{}(\forall h'')((h''(body(c)) =h(body(c))) \Rightarrow \\ &{}&{} h''(L(\mathbf{X },\mathbf{Z })) \notin I))\} \end{array} \end{aligned}$$

Roughly speaking, T is a simplified version of \(T_\textsf {fw}\) in which all considerations about null degrees are omitted. In other words, T can be seen as a ‘parallelized’ version (i.e. constraints are applied in a breadth first manner) of a step of the standard-chase procedure, where constraints are applied in a depth first manner [45]. Thus, when standard chase terminates, the following sequence:

  • \(T^0=I\)

  • for every \(k >0\), \(T^k=T(T^{k-1})\)

has a limit, denoted by \(T^*(I)\), which is precisely the result of the standard-chase procedure applied to I.

On the other hand, another variant of the chase procedure is known as core chase procedure and can be defined as follows. For every I, let \(T_c(I)=core(T(I))\) and let \(core\_chase(I)\) be the limit of the sequence defined by

  • \(T_c^0=I\)

  • for every \(k >0\), \(T_c^k=T_c(T_c^{k-1})\)

when this limit exists. If the limit doest not exist, \(core\_chase(I)\) is set to \(\bot \).

The relation between the two versions of the chase procedure as set in Fagin et al. [15] shows that when standard chase terminates, \(core\_chase(I)=core(T^*(I))\). Thus, the following proposition holds.

Proposition 7.1

Let I be a set of instantiated atoms such that the computation of \(T_\textsf {fw}^*(I)\) yields no nulls whose degree is greater than or equal to \(\delta _{\max }\). Then, \(core\_chase(I) \ne \bot \) and \(core\_chase(I) =core(T_\textsf {fw}^*(I))\).

Proof

Since the computation of \(T_\textsf {fw}^*(I)\) yields no nulls whose degree is greater than or equal to \(\delta _{\max }\), we have \(T_\textsf {fw}^*(I)=T^*(I)\) and thus \(core\_chase(I) =core(T^*(I))=core(T_\textsf {fw}^*(I))\), and the proof is complete. \(\square \)

Chase Termination Conditions and\(\delta _{\max }\) We recall that the threshold \(\delta _{\max }\) has been introduced to cope with two important issues when dealing with TGDs in practice:

  1. 1.

    Avoid non-terminating computations of\(T_\textsf {fw}({\mathfrak {D}})\). This issue is the most important one because it is known that termination of the chase procedure is non-decidable in general (see [45]). In the literature, sufficient conditions for termination of the chase procedure have been proposed, and in this section, we relate one of these suffisent conditions to the value of \(\delta _{\max }\).

  2. 2.

    Avoid any non-suitable spreading of nulls throughout the database instance. Even when the constraints are such that termination can be ensured, it is likely that an instance containing a huge number of nulls produced by cascading constraint applications is of very limited interest.

Focussing on the first item above, we now recall the notion of weak acyclicity of the dependency graph of \({\mathbb {C}}\) as introduced in Fagin et al. [15]. The dependency graph of \({\mathbb {C}}\), denoted by \(DG({\mathbb {C}})\), is a labelled directed graph defined as follows:

  • the vertices of \(DG({\mathbb {C}})\) are all pairs (Pi) where P is an n-ary predicate occurring in a constraint c of \({\mathbb {C}}\) and i is an integer such that \(1 \le i\le n\) representing a position in the argument of P.

  • an edge occurs between vertices (Pi) and (Qj) if there exists a constraint \(c:B(\mathbf{X }, \mathbf{Y }) \Rightarrow L(\mathbf{X },\mathbf{Z })\) in \({\mathbb {C}}\) such that body(c) contains an atom over P for which position i is a variable x in \(\mathbf{X } \), Q is the predicate L and one of the following holds:

    1. (a)

      x occurs in \(L(\mathbf{X },\mathbf{Z })\) at position j,

    2. (b)

      there exists a variable z in \(\mathbf{Z } \) occurring at position j in \(L(\mathbf{X },\mathbf{Z })\).

In case (a), the edge is said to be universal and in case (b) the edge is said to be existential. The graph \(DG({\mathbb {C}})\) is said to be weakly acyclic if it contains no cycle involving an existential edge.

It has been shown in Fagin et al. [15] that when the dependency graph is weakly acyclic, standard chase always terminates. Using this result in our context, we show the following proposition.

Proposition 7.2

Let \({\mathbb {C}}\) be such that \(DG({\mathbb {C}})\) is weakly acyclic and K be the maximum number of existential edges occurring in a path of \(DG({\mathbb {C}})\). If \(\delta _{\max } > K\), then the computation of \(T_\textsf {fw}^*(I)\) yields no nulls whose degree is greater than or equal to \(\delta _{\max }\).

Proof

By definition of \(T_\textsf {fw}\), when \(\delta (N)\) is set to k, this means that there exists a path in \(DG({\mathbb {C}})\) involving k existential edges. Thus, the result follows. \(\square \)

As a consequence of Propositions 7.1 and 7.2, it turns out that, in our approach, when the dependency graph of \({\mathbb {C}}\) is weakly acyclic, \(\delta _{\max }\) can be chosen so as all insertions are accepted, producing the same instance as if the core-chase procedure were applied. On the other hand, we recall that sets of constraints \({\mathbb {C}}\) whose dependency graph is not weakly acyclic are accepted in our approach, at the cost of rejecting some insertions.

7.2 Computing the core

As earlier mentioned, computing cores is necessary in our approach to keep the database instance as small as possible and to avoid the presence of useless, and thus confusing, nulls in the database. In what follows, we give details about core computations and its complexity.

Given a set I of instantiated atoms, the computation of \(R\_core(I)\) is based on a partition \(\Pi (I)\) of I whose construction can be explained based on the so-called Gaifman graph of the nulls of I (see [16]). This graph, denoted by G(I), is defined as follows:

  • the vertices of G(I) are all the nulls occurring in I,

  • \((N,N')\) is an edge of G(I) if N and \(N'\) both occur in one atom in I.

The connected components of G(I), say \(G_1, \ldots ,G_n\), allow to partition I into \(n+p\) pairwise disjoint subsets \(I_1, \ldots , I_n, I_{n+1}, \ldots , I_{n+p}\) where, for \(j=1, \ldots , n\), \(I_j\) contains all atoms in I in which nulls in \(G_j\) occur and where \(I_{n+1}, \ldots , I_{n+p}\) are all pairwise distinct singletons, each of them containing an atom in I with no nulls. Then, denoting by \(\Pi (I)\), respectively, \(\overline{\Pi }(I)\), the set \(\{I_1, \ldots , I_n\}\), respectively, \(\{I_1, \ldots , I_n , I_{n+1}, \ldots , I_{n+p}\}\), Algorithm 3 shows how \(R\_core(I)\) is computed. This algorithm can be seen as a simplification of the one given in Fagin et al. [16], because the chase computation is not part of our algorithm. Notice, however, that the computation of the homomorphism is given in more details than in Fagin et al. [16], in order to better explain the optimizations we focus on.

figure f

It is important to note that the loop on line 1 can be computed in parallel because any two sets in \(\Pi (I)\) share no nulls. Such parallel processing would thus result in a fast global computation, in particular when the sizes of the sets \(\pi \) are small.

However, the loop on line 5 cannot be computed by just choosing one homomorphism in each set \(hom(\pi )\) and combining these chosen homomorphisms into one global homomorphism. To see this, consider \(I=\{P(a,N_1),\)\(P(a,N_2)\}\) where for the sake of simplification no disallowed nulls occur, implying that null degrees are omitted and that all homomorphisms are restricted. We have \(\Pi (I)=\overline{\Pi }(I)=\{\{P(a,N_1)\},\)\(\{P(a,N_2)\}\}\) and \(h_1\) and \(h_2\) such that \(h_1(N_1)= N_2\) and \(h_2(N_2)=N_1\) are found on line 2, possibly in parallel. Thus, one global homomorphism would simply exchange the nulls, and thus, I would not be changed. This is not correct because, as found by Algorithm 3, \(R\_core(I)\) is \(\{P(a,N_1)\}\) (or equivalently \(\{P(a,N_2)\}\)). The following example illustrates Algorithm 3 in a less simple case, where again, no disallowed nulls are present and thus where null degrees are omitted.

Example 7.1

Assuming that all homomorphisms in this example are restricted, let \(I=\{P(a),\)\(P(N_1),\)Q(ab),  \(Q(a, N_2),\)\(R(a,N_3),\)\(S(a,b,N_3),\)\(S(a,N_2,N_3)\}\). Then, \(\Pi (I)=\{\pi _1, \pi _2\}\) where \(\pi _1= \{P(N_1)\}\) and \(\pi _2=\{Q(a, N_2),\)\(R(a,N_3),\)\(S(a,b,N_3),\)\(S(a,N_2,N_3)\}\), whereas \(\overline{\Pi }(I)=\Pi (I) \cup \{ \{P(a)\}, \{Q(a,b)\}\}\).

On line 2 of Algorithm 3, it is found that \(hom(\pi _1)=\{h_1\}\) where \(h_1\) is defined by \(h_1(N_1)=a\) and \(hom(\pi _2) =\{h_2\}\) where \(h_2\) is defined by \(h_2(N_2)=b\) and \(h_2(N_3)=N_3\). In the loop line 5, choosing \(\pi _1\) and then \(\pi _2\), \(\Gamma \) is first set to \((I {\setminus }\{P(N_1)\}) \cup \{P(a)\}=(I {\setminus }\{P(N_1)\})\), and then changed to \((\Gamma {\setminus }\{Q(a,N_2),\)\(R(a,N_3),\)\(S(a,b,N_3)\), \(S(a,N_2,N_3)\}) \cup \{Q(a,b),\)\(R(a,N_3),\)\(S(a,b,N_3)\}\). Hence, \(R\_core(I)=\{P(a),\)Q(ab),  \(R(a,N_3),\)\(S(a,b,N_3)\}\), and as \(\pi _1 \cap \Gamma =\emptyset \) and \(\pi _2 \cap \Gamma = \{R(a,N_3),\)\(S(a,b,N_3)\}\), \(\Pi (R\_core(I))= \{\{R(a,N_3),\)\(S(a,b,N_3)\}\}\). \(\square \)

Complexity It has been shown in Fagin et al. [20] that the complexity of Algorithm 3 is equal to that of computing the homomorphisms on line 2, which itself is in \(O(|I|^{b})\) where b is the maximum number of nulls occurring in the sets of \(I_j\) of \(\Pi (I)\). We also recall from Gottlob [20] that, in our context, this complexity can be brought down to \(O(|I|^{(b/2+2)})\).

8 Experimental study

We have implemented a prototype of our approach which can be uploaded together with all the details of our tests (see [10]). The implementation, written in JAVA, allows running all examples provided in the present paper and serves as a proof of concepts to our updating methods. Even if the mentioned implementation deals with facts stored in main memory, it allows us to offer a preliminary study on the trade-off between the gain of expression power and efficiency in updating. Our tests were run on an Intel(R) Core(TM) i7-6600U CPU 2.60GHz x 4; 16 GB of RAM. Time results correspond to the average time of 5 executions of each test.

8.1 Synthetic example

As a preliminary test, we run our algorithms on a synthetic example, reproducible according to instructions available in DBOrleans-Team [10]. To provide a fair idea of the impact of allowing more complex constraints in consistency maintenance of a database, we propose the following evolving testing scenario:

  1. 1.

    The initial database instance \({\mathfrak {D}}\) is fixed, with a fixed set of instantiated atoms (nulls may exist) over 525 distinct predicates.

  2. 2.

    Each test consists in performing a fixed number of updates (K) on the initial instance \({\mathfrak {D}}\).

  3. 3.

    Three different sets of constraints, denoted by \({\mathbb {C}}_{0np}\), \({\mathbb {C}}_{1np}\) and \({\mathbb {C}}_{2np}\), give rise to three different tests. The difference between these tests is the increasing number of side effects, i.e.   each test is performed on a set of constraints which evolves in the following way:

    \({\mathbb {C}}_{0np}\):

    In this first test, there is no null propagation, i.e.   a constraint generates a null-atom, but this new atom cannot match the body of another constraint. More precisely, each constraint has the form \(A_i (x, y) \Rightarrow B_i(y, z)\) for \( i \in [1, K] \). This scenario corresponds to the restrictions imposed on constraints in Alves et al. [25].

    \({\mathbb {C}}_{1np}\):

    In the second test, each generated null propagates generating a new side effect. The set of constraints is composed by subsets containing two constraints of the form \(A_i (x,y) \Rightarrow B_i(y, z)\) and \( B_i(x, y) \Rightarrow C_i(x,y,z)\) for \(i\in [1, K]\).

    \({\mathbb {C}}_{2np}\):

    In the third test, the set of constraints is composed by subsets containing three constraints of the form \(A_i (x,y) \Rightarrow B_i(y, z)\), \( B_i(x, y) \Rightarrow C_i(x,y,z)\) and \(C_i(x,y,z) \Rightarrow D_i(x,y,z, u)\) for \(i\in [1, K]\).

    Tests are built by increasing the side effect chain triggered by an atom \(A_i\). Moreover, each new generated side effect corresponds to an atom with a new null, containing as well the nulls generated by the previous constraint of the chain. For instance, in \({\mathbb {C}}_{1np}\), the insertion of \(A_1(a,b)\) generates \(B_1(a, N_1)\) and \(C_1(a,N_1, N_2)\). In \({\mathbb {C}}_{2np}\), this insertion will generate \(B_1(a, N_1)\), \(C_1(a,N_1, N_2)\) and \(D_1(a,N_1, N_2, N_3)\).

  4. 4.

    Each insertion is a set of facts on predicates \(A_i \) for \( i \in [1, K] \). The constants appearing in such facts are randomly chosen from the active domain of the database instance together with new nulls. Similarly, a deletion is a set of facts on the predicate appearing at the end of the side effect chain considered in the corresponding test, e.g. for the second test, deletions are on predicate \(C_i\).

  5. 5.

    Besides constraints with only one atom in the body, we also offer some tests with constraints having two atoms in the body. Three tests are performed in this scenario:

    \({\mathbb {C}}^{2B}_{0np}\):

    In this test, the set of constraints is composed of constraints of the form \(A_i (x, y), A_{i+1}(x, z) \Rightarrow AA_i(x, z) \).

    \({\mathbb {C}}^{2B}_{1np}\):

    In this test, the set of constraints is composed of subsets containing two constraints, one of the form \(A_i (x, y), A_{i+1}(x, z) \Rightarrow AA_i(x, z) \) and the other of the form \(A_i (x, y), A_{i+1}(y, z) \Rightarrow AB_i(x, u) \) for \( i \in [1, K-1] \).

    \({\mathbb {C}}^{2B}_{2np}\):

    In this third test, In this test, the set of constraints is composed of subsets containing the two constraints of \({\mathbb {C}}^{2B}_{1np}\) together with a third constraint of the form \( AA_i(x, y), AB_i(x, z) \Rightarrow AC_i(x, z, u)\).

    In this scenario, insertions are sets containing instantiated atoms on the predicates \(A_i\) and \(A_{i+1}\).

Our tests offer a preliminary evaluation of how update performance is impacted by the use of complex constraints such as TGDs.

(A) The first impact is, naturally, the possibility of performing an automatic verification for applications requiring such complex constraints.

(B) The second impact concerns the time needed for processing updates. We analyse time results for insertions and deletions separately, on the basis of the scenario described above: starting with a very simple case, where each update triggers a single constraint and has one side effect, we continue, as explained above, by increasing the complexity of our constraints, the null propagation and the number of generated nulls.

(B.1) Table 2 shows our results for insertions. Insertions are done according to the explanations given in item 4 and, thus, concern the worst case where all constraints are triggered. The results are promising: the gain of expression power does not compromise the performance. To better analyse the situation, we compare the results obtained for \({\mathbb {C}}_{0np}\) with the other results in Table 2:

  1. (a)

    From \({\mathbb {C}}_{0np}\) to \({\mathbb {C}}_{1np}\): while the number of triggered constraints is doubled, the increase of total time for the insertion of 75 atoms is only of \(6.4\%\). Similarly, from \({\mathbb {C}}_{0np}\) to \({\mathbb {C}}_{2np}\), while the number of triggered constraints is tripled the increase of the total insertion time is only of \(1.7\%\).

  2. (b)

    Experiments with constraints with more than one atom in the body (lines 4–6 of Table 2) show that the insertion time increases, respectively, by \(0.4\%\), \(4.7\%\) and \(13.7\%\), when compared to \({\mathbb {C}}_{0np}\).

  3. (c)

    Unsurprisingly, the core is the most time-expensive part of our algorithm, responding, here, for more than \(80\%\) of the total computation time. Inserting a set of atoms with linked null values in an instance with an important number of null values renders the search for instantiations expensive, because the algorithm looks for instantiating partitions and not only one atom. In our synthetic database, \(44\%\) of the instantiated atoms have null values. Section 8.3 proposes some optimizations which should soften this problem.

Moreover, the core computation deserves some attention, even if it is independent of constraint complexity. In our examples, the forward operator always produces atoms with nulls.Footnote 2 The core procedure tries to instantiate these atoms, but as the original database instance has \(44\%\) of its atoms with null values, such instantiations are rare. The situation is illustrated by the results on column \(|{\mathfrak {D}}'|\) of Table 2 which show that all computed side effects are inserted in the database, meaning that no core reduction is possible. However, to complete the example, consider now a different insertion scenario:

  • Take the set of constraints and database instance of line 3 in Table 2.

  • Consider \({\textsf {iRequest}} \) as containing the 75 atoms as before plus 25 facts corresponding to instantiations of atoms appearing in the database instance and in the head of some constraints.

In this new scenario, our results are: \(|T^*_\textsf {fw}({\mathfrak {D}}\cup {\textsf {iRequest}})| = 243\), \(|{\mathfrak {D}}'| = 49,888\), \(t_{T^*_\textsf {fw}} = 0.197 s\) and \(t_{Core} = 2.256 s\). In this case, the core eliminates 75 atoms (sine \((243 + 49,710)-49,888=75\)), thus resulting in a significant simplification.

(B.2) Table 3 shows our results for deletions. The tests consider only the worst case of deletion, i.e.   when Back is not empty, (line 12 of Algorithm 2) which requires computations involving the operator \(T_\textsf {bw}\) and the restricted core.

Deleting atoms with a null value is much more expensive than inserting them, since the deletion requires a database traversal to determine all possible null instantiations while the insertion needs just one instantiation. For instance, considering a database instance \({\mathfrak {D}}= \{A(a, N), B(a,c), B(a, b), B(a, d)\}\), the constraint \(B(x, y) \Rightarrow A(x, z)\), and the deletion of \(A(a, N_1)\), the algorithm has to find all the instantiations of \(B(a, \_)\) in the database instance. This operation is expensive in the current implementation because data are stored in regular files. This should be significantly improved in the new version (under construction) where data are stored using a database management system (DBMS). In this case, thanks to the optimized query processing in any DBMS, the instantiations of \(B(a, \_)\) can be efficiently retrieved. Notice that a similar situation is described in Halfeld Ferrari et al. [25] and Halfeld Ferrari [27], where only constraints as those in \({\mathbb {C}}_{0np}\) are considered.

Table 1 Notation used for results of the synthetic tests
Table 2 Results for insertions: examples consider different constraint sets on a database instance \({\mathfrak {D}}\) such that \(|{\mathfrak {D}}| =49{,}710\). The set \({\textsf {iRequest}} \) is fixed and contains 75 instantiated atoms. Results are presented using the notation shown in Table 1
Table 3 Results for deletions: examples consider different constraint sets on a database instance \({\mathfrak {D}}\) such that \(|{\mathfrak {D}}| =49,710\). The set \({\textsf {dRequest}} \) is fixed and contains 10 instantiated atoms. Results are presented using the notation shown in Table 1

Table 3 shows that, once again, even if our tests concern only the worst case, the results are promising, because the gain of expression power does not compromise the performance.

8.2 URBS example

Recalling that our approach is well adapted to RDF data sources, we now illustrate such an application. As explained in Halfeld Ferrari and Laurent [28], our constraints can be settled to express RDF/S semantic constraints. For instance, in Halfeld Ferrari and Laurent [28] and Halfeld Ferrari et al. [27], this is done by using the formalism of Flouris et al. [19] and classifying predicates into two sets: one concerning schema and another concerning instances. Then, general constraints of RDF/S are written.

In this paper, we test our approach over the RDF data set used for tests in Halfeld Ferrari et al. [25], which had been generated by importing data from the Curitiba Urbanization Company (URBS). We recall that Curitiba (Brazil) is known by its mass transport corridors and mobility solutions using bus rapid transit (BRT) systems since the 1970s [48]. The complete system, according to Curitiba’s Institute of Research and Urban Planning (IPPUC), includes about 482 routes, distributed among 9940 bus stops, 23 bus terminals and 17 categories of streets [35]. Different types of routes, such as express routes, inter-district and local lines, or feeder, are implemented.

Table 4 Constraints for the RDF database concerning Curitiba Urbanization Company

In this paper, each RDF triple of the form (sPo) (i.e.   subject–predicate–object) is translated into an atom of the form P(so). Notice also that we do not use special predicates as in Flouris et al. [19], Halfeld Ferrari and Laurent [28] and Halfeld Ferrari et al. [25] (such as PI, standing for property instance, or CI, standing for class instance). Instead, we perform a translation in order to obtain only atoms on application predicates. For example, the instance of property ExpressLineStop, expressed in RDF as \(PI(a, b, \textsf {ExpressLineStop})\), is translated into \(\textsf {ExpressLineStop}(a,b)\). Our choice is only motivated by performance issues. Indeed, considering predicates PI and CI, finding, for instance, that \(PI(a, N_1, \textsf {ExpressLineStop}) \) and \(PI(a, b, \textsf {ExpressLineStop}) \) unify would require testing all atoms on PI, whereas in our implementation, only atoms on ExpressLineStop are tested.

Table 5 shows the result of our tests on URBS data. Based on Table 4, two sets of constraints are considered, namely \({\mathbb {C}}_1 = \{c_1, c_2, c_3, c_4, c_5, c_6\}\) and \({\mathbb {C}}_2 = {\mathbb {C}}_1 \cup \{c_7, c_8, c_9, c_{10}, c_{11}\}\). The set \({\mathbb {C}}_1\) contains simple constraints, that is, constraints with no null propagation, as in Halfeld Ferrari et al. [25]. The set \({\mathbb {C}}_2\) is obtained by adding more sophisticated constraints to the previous set. This new set contains constraints with more than one atom in the body (\(c_{7}\) and \(c_{10}\)) and allows for null propagation. In this way, it is possible to express constraints establishing, for instance, that we should allocate express lines (\(c_9\)) to join lines doing connections between districts and the town terminals (\(c_8\)). The iteration between constraints \(c_8\) and \(c_9\), not possible in Halfeld Ferrari et al. [25], is allowed in the present approach, thus representing an important gain in the expression power of constraints.

Table 5 Results for insertions in the RDF database URBS: examples consider different constraint sets on a database instance \({\mathfrak {D}}\) such that \(|{\mathfrak {D}}| =115{,}910\). The set \({\textsf {iRequest}} \) is fixed and contains 10 instantiated atoms. Results are presented using the notation shown in Table 1

8.3 Possible optimizations

Our experiments have shown that the core computations may become expensive. This is why we now discuss possible optimizations of the core computation in our current implementation, based on the fact that given \(\Delta =({\mathfrak {D}}, {\mathbb {C}})\), we have that \(R\_core({\mathfrak {D}})={\mathfrak {D}}\).

In the case of an insertion, as can be seen from Algorithm 1, one core computation is necessary for the set \(T^*_\textsf {fw}({\mathfrak {D}}\cup {\textsf {iRequest}})\), which we denote as \({\mathfrak {D}}\cup I\) where I contains the atoms in \({\textsf {iRequest}} \) along with the side effects of the insertion. We notice that, although \({\mathfrak {D}}\) and \({\textsf {iRequest}} \) have no nulls in common, I may contain nulls occurring in \({\mathfrak {D}}\). As a consequence, the partition \(\Pi ({\mathfrak {D}}\cup I)\) is not equal to \(\Pi ({\mathfrak {D}}) \cup \Pi (I)\). However, assuming that \(\Pi ({\mathfrak {D}})\) is available, the computation of \(\Pi ({\mathfrak {D}}\cup I)\) can be optimized, as compared with a computation from scratch. This is so because every partition block associated with a connected component of \(G({\mathfrak {D}}\cup I)\) that occurs in \(G({\mathfrak {D}})\) has not to be modified.

Assuming that \(\Pi ({\mathfrak {D}}\cup I)\) has been computed, and referring to line 2 of Algorithm 3, since we have \(R\_core({\mathfrak {D}})={\mathfrak {D}}\), if the block \(\pi \) of \(\Pi ({\mathfrak {D}}\cup I)\) being considered was already in \(\Pi ({\mathfrak {D}})\), then, clearly, a homomorphism exists only if one atom in \(\pi \) can be instantiated by one atom in I. Therefore, this narrows the search space for the computation of \(hom(\pi )\).

Turning now to deletions, by Algorithm 2, two core computations are involved, and both address an instance that can be written as \({\mathfrak {D}}{\setminus } I\) where I is a subset of \({\mathfrak {D}}\). In this case, \(\Pi ({\mathfrak {D}})\) has first to be modified so as to generate \(\Pi ({\mathfrak {D}}{\setminus } I)\). This can be optimized because the only blocks \(\pi \) to be modified are such that \(\pi \cap I \ne \emptyset \), in which case \(\pi \) is changed into \(\pi {\setminus } I\). Every such block \(\pi {\setminus } I\) that gets empty is removed and the others have to be checked because they might have to be split. To see this, let \({\mathfrak {D}}=\{P(N_1), Q(N_1,N_2), R(N_2)\}\) and \(I=\{Q(N_1,N_2)\}\), in which case \(\Pi ({\mathfrak {D}})=\{{\mathfrak {D}}\}\). Then, the block of \(\Pi ({\mathfrak {D}})\) is changed to \(\{P(N_1),R(N_2)\}\) that has clearly to be split into two blocks.

Assuming that \(\Pi ({\mathfrak {D}}{\setminus } I)\) has been computed, and referring to line 2 of Algorithm 3, the only blocks to process are those that do not belong to \(\Pi ({\mathfrak {D}})\), which in general shortens the execution of the loop.

9 Related work

Semantic for nulls Incomplete data in the relational model have been the subject of different work. In Reiter [49], we find the proposal for a first-order interpretation of the null value together with query answering algorithms. At the same time, proposals on relational theory have emerged. In particular, the landmark paper [31] introduced the notion of Table (a relation containing nulls) and of representation system (detailing two of them considered of ‘practical interest’). Assuming that a table contains only constants and variables (representing missing information), the first representation system deals with Codd Tables [9] where variables occur at most once. However, it has been noticed in Imielinski and Lipski Jr. [31] that ‘no representation system based on Codd Tables can support join and projection at a time’. The second representation system is based on naive Tables, allowing different marked variables (or nulls) occurring more than once. It is shown in Imielinski and Lipski Jr. [31] that this system allows arbitrary conjunctive queries. A third representation system, based on the idea of conditional table and considered mainly of theoretical interest, is also described in Imielinski and Lipski Jr. [31]. Since then, considerable work has been done on querying incomplete databases as, for instance, in Grahne [22], Zaniolo [56], Fagin et al. [17] and Libkin [38].

Updating incomplete data Updates on incomplete databases have drawn less attention than queries. We mention (i) the work of Abiteboul and Grahne based on conditional tables (see [1]); (ii) the work in Fagin et al. [18] which discusses the fact that the result of an update is dependent upon the way the logical database is constructed; (iii) the survey in Winslett [55] which offers a clear distinction between model-based and formula-based approaches to updating logical theories.

However, the interest of updates on incomplete data sources is now stemming from needs on the web semantic domain [12]. Work, such as Nikolaou and Koubarakis [44], extends to the RDF world concepts introduced in Imielinski and Lipski Jr. [31] and explores SPARQL query evaluation. Moreover, when De Giacomo et al. [11] propose an ontology update management, it seems to revive previous proposals such as Halfeld Ferrari et al. [26]. Indeed, when dealing with different rule levels and when proposing semantics tolerant with inconsistencies coming from the intentional level, the relation with the exceptions proposed in Halfeld Ferrari et al. [26] is undeniable. The problem of updating an incomplete database also concerns when and how null values should be replaced by the new data coming with the updates. Part of the problem is connected to the core computation. In Halfeld Ferrari and Laurent [28], we dealt with a non-null database, situation which corresponds to the case \(\delta _{max}=0\). The differences between update and revision are another important aspect to be considered. (We refer to Hansson [29] for an overview.) These differences are the consequence of different views of the problem and influence the semantic of changes of each particular proposal.

Tuple-generating constraints (TGD) generalize the commonly used foreign key constraints, allowing to express restrictions found in web semantic or graph database domains. But considering TGD increases expressiveness at the cost of some difficulties. The first one concerns the computation of the semantics which has been addressed through a chase procedure (see [45] for a survey). Work, as in Fagin et al. [16], Gottlob [20], Fagin et al. [15] and Pichler and Skritek [47], considers the use of the chase procedure in data exchange. Furthermore, in an update context, the chase procedure is associated with the generation of side effects—imposing extra insertions or deletions (with respect to those required by the user) to preserve consistency. Clearly, constraints are expected not only to be inherently consistent (e.g. a set of constraints generating contradictory side effects for the same update u is not acceptable) but also to avoid contradicting the original intention of the user’s update. (For example, if the user asks for the deletion of f, the insertion of f is not an acceptable side effect.) The existence of non-repairable transitions (i.e. with side effects that invalidate the required update) is the focus of Schewe and Thalheim [50] which establishes sufficient and necessary conditions of a set of constraints to work correctly. In Halfeld Ferrari et al. [26], we can find results on the detection of inconsistency in a set of more simple constraints, such as those used in Halfeld Ferrari and Laurent [28]. In both cases, the authors use paths on a graph (defined from constraints) to detect inconsistency. The theory of consistency enforcement in databases has been the subject of Link [39], Link and Schewe [40] and Schewe and Thalheim [51] as well. Another difficulty is that more expressive constraints represent a barrier to the update determinism. In Halfeld Ferrari and Laurent [28] and Halfeld Ferrari et al. [26], determinism is guaranteed because constraints have only one atom in the body and in the head, while in Flouris et al. [19], determinism is possible thanks to a total ordering, which imposes arbitrary choices.

In the current approach, the determinism of deletions is ensured by distinguishing one atom in the body of each constraint. Notice, however, that this could be generalized so as to allow the deletion of more that one atom. Our solution is a practical answer to an update scenario where users want to juggle with different possibilities, namely: (a) expressing constraints richer than those usually allowed by keys and foreign keys; (b) ensuring an automatic maintenance of the database consistency; and (c) not wasting time in considering different possible new consistent database instances.

In fact, our solution corresponds to a very popular way of specifying updates in relational databases, known as ‘active rules’. Indeed, a constraint such as \(p^-, q \Rightarrow r\) can be seen as the active rule: when delete r, ifp and q are in the database, then delete p. Active rules have been widely studied in the literature (e.g. in Schewe and Thalheim [50]), and it is well known that they do not satisfy some suitable properties such as convergence, independence from the application order (which is related to determinism), minimal change, monotonicity, etc. Clearly, there is a price to pay for having deterministic deletion as a practical solution for users.

Minimal Data Instance When applying the chase over TGD to generate new data, one should naturally consider universal solutions (those having homomorphisms into every possible solution). They are identified in Fagin et al. [15], while Fagin et al. [16] and Gottlob [20] consider the problem of finding the smallest one (the core). Updates are not considered, except in Pichler and Skritek [47], which studies complexity of the TGD checking problem, including its update variant.

Positioning our work Our approach adopts the FOL formalism introduced in Reiter [49] (which, in fact, is a different way of modelling naive tables) to deal with updates. We extend our previous work in Halfeld Ferrari and Laurent [28] and Halfeld Ferrari et al. [27] by dealing both with an incomplete database and with linear tuple-generating dependencies (TGD) without restrictions. Determinism of deletion is ensured by a pre-established choice giving priority to one side effect (marked in the constraint’s body). Although, in our approach, the behaviour of a deletion depends on the non-existence of side effects invalidating it, the problem of determining such kind of behaviour in a set of constraints is out of the scope of this paper. We conjecture that the approaches proposed in Schewe and Thalheim [50] and Schewe and Thalheim [51] can be adapted to our context—together with their results concerning the conditions and the complexity of performing such tests. As updates yield their side effects which may involve nulls, we propose two ways of avoiding instances with too many nulls generated from other nulls, namely: the possibility of fixing a pre-defined depth for such null generation and the strategy of keeping the instance as small as possible by applying the core computation. In this context, our updates are performed by taking into account marked null values (which can be stored). Notice that our results can be related to work on repair checking (such as Afrati and Kolaitis [2]), because the updated database can be seen as a repair of an inconsistent database obtained by performing only updates in \({\textsf {dRequest}} \) and \({\textsf {iRequest}} \). However, our minimal change conditions are not necessarily considered in those papers.

Our update strategy, applied to the RDF world, is different from proposals such as Ahmeti et al. [3], Chirkova and Fletcher [8], Gutierrez et al. [24] and Lösch et al. [42]. Although some of them discuss non-determinism, all these approaches consider constraints as inference rules (as in Gottlob et al. [21], Lausen et al. [36], Motik et al. [43] and Patel-Schneider [46]), whereas we deal with constraints in a traditional database viewpoint, as in Flouris et al. [19]. (But the latter does not deal with null values.) Constraints are taken into account in the context of RDF technologies such as ShEx [52], SPIN [33] and SHACL [34]. However, their focus is on schema and ours is on integrity constraints. Stardog [53] deals with constraints which are closer to ours. Finally, it is worth noting that we consider updates as changes in the world and not as a revision in our knowledge of the world [4, 29, 32]. However, as our model does not include any explicit representation of time, it is possible to relate results of our method to the core principles of Belief Revision. Indeed, our delete and insertion operations are similar to contraction and revision in Belief Revision and guided by precepts corresponding to the principles of closure, success, inclusion, consistency and vacuity (see [4]).

10 Concluding remarks

In this paper, we present a formal semantics for updates of incomplete databases. Although this problem has motivated many research efforts during the last decades, no consensus exists. In proposing a generic framework based on FOL, it is expected that our approach can deal with many different contexts, including RDF data where blank nodes are the cause of many issues.

Indeed, new applications on linked data (such as on urban transportation system [7, 14, 25]) need to ensure the trust and the quality of information they give. In such a context, updating and consistency maintenance of RDF data is an increasingly required task. The update strategies and properties settled up in this paper contribute to the development of tools capable to deal with data evolution in RDF databases where nulls (blank nodes) are marked as semantically connected.

In our approach, updates are performed in a deterministic way, without any arbitrary choice, while preserving consistency with respect to TGDs and while ensuring that useless incompleteness has been discarded. One should possibly point as a drawback of our approach the fact that the complexity of our update processing is high in theory, due to core computations. However, our paper outlines possible optimizations that would lead to tractable implementation. Indeed, a prototype of our method has been implemented and is publicly available [10]. This initial in-memory version allows testing and is a proof of concepts setting the basis for the next step where the prototype will be enhanced to be able to deal with standard databases. Our future work includes enhancing the expression power of our constraints towards a more general framework.