Keywords

1 Introduction

Many authors have made proposals to model and handle relational databases involving uncertain data. In particular, the last two decades have witnessed a blossoming of researches on this topic (cf. [29] for a survey of probabilistic approaches). Even though most of the literature about uncertain databases uses probability theory as the underlying uncertainty model, some approaches rather rest on possibility theory [30]. The initial idea of applying possibility theory to this issue goes back to the early 1980’s [26, 27]. This was short after the introduction of the idea of a “fuzzy database”, for which various proposals were made, ranging from fuzzy relations (thus having weighted tuples) to ordinary relations with tuples of fuzzy values (represented by fuzzy sets), or more simply with tuples of weighted values. These different views developed by several authors were not necessarily referring to possibility theory; see [4] for references. Since this time, several possibilistic representations have been introduced, and it is useful to clarify their respective roles.

As we will discuss in Sect. 3, the possibilistic framework constitutes an interesting alternative to the probabilistic one, notably because of its qualitative nature. In this paper, we provide a survey of different modelings of uncertain data with possibility theory. The remainder is structured as follows. In Sect. 2, we recall some notions about uncertain databases and their interpretation in terms of possible worlds. Section 3 is devoted to a presentation of four possibilistic database models, with different levels of expressiveness. Section 4 discusses a specific topic where uncertain data management can play a role, namely data cleaning. Section 4.2 points out a sample of issues deserving further investigations. Finally, Sect. 5 concludes the paper and outlines some short-term research perspectives.

2 About Uncertain Databases and Possible Worlds

In the context of uncertain databases, two kinds of uncertainty are considered: tuple-level uncertainty (where the existence of some tuples in a relation is uncertain, i.e., is more or less probable/possible) and attribute-level uncertainty (where some attribute values in some tuples may be ill-known or uncertainly known). The latter case can be seen as more general than the former, since a tuple involving uncertain attribute values may be translated into a set of mutually exclusive uncertain tuples (involving only ordinary attribute values). An attribute value represented as a disjunctive weighted set can be interpreted as a probability distribution or a possibility distribution depending on the underlying uncertainty model considered. From a semantic point of view, an uncertain database D can be interpreted as a set of usual databases, called possible worlds \(W_1\), ..., \(W_p\), and the set of all interpretations of D is denoted by rep(D) = \(\{W_1\), ..., \(W_p\}\). Any world \(W_i\) is obtained by choosing a value in each disjunctive set appearing in D. One of these (regular) databases is supposed to correspond to the actual state of the universe modeled. The assumption of independence between the sets of candidates is usually made and then any world \(W_i\) corresponds to a conjunction of independent choices, thus the probability, or possibility, degree associated with a world is computed using a conjunction operator, namely, the product, or “min”, respectively.

When processing a query, a naive way of doing would be to make explicit all the interpretations of D in order to query each of them. Such an approach is intractable in practice and it is of prime importance to find a more realistic alternative. To this end, the notion of a representation system was introduced by Imielinski and Lipski [14]. The basic idea is to represent both initial tables and those resulting from queries in such a way that the representation of the result of a query q against any database D denoted by q(D), is equivalent (in terms of worlds) to the set of results obtained by applying q to every interpretation of D, i.e.: \(rep(q(D)) = q(rep(D))\) where \(q(rep(D)) = \{q(W) \,|\,W \in rep(D)\}\). If this property holds for a representation system \(\rho \) and a subset \(\sigma \) of the relational algebra, \(\rho \) is called a strong representation system for \(\sigma \). From a querying point of view, this property enables a direct (or compact) calculus of a query q, which then applies to D itself without making the worlds explicit.

3 Possibilistic Uncertainty

We first recall some distinctive features of possibility theory before reviewing the different possibilistic representations.

3.1 Possibility Theory

Possibility theory departs from probability theory in several respects. Possibility theory involves two dual set functions: the possibility \(\varPi \) and the necessity N such that \(N(A)= 1-\varPi (\bar{A})\), while probability is self-dual, namely \(P(A) = 1-P(\bar{A})\). This provides room for modeling epistemic uncertainty, including total ignorance. Indeed, \(\varPi (A)=1\) does not prevent to have also \(\varPi (\bar{A})=1\) in case of complete ignorance about A (while \(\varPi (A)=P(\bar{A}) (= 1/2)\) does not distinguish situations of genuine equiprobability from situations where, due to ignorance, one applies the Insufficient Reason Principle). \(\varPi \) (and N) are associated with a possibility distribution \(\pi \), defined from a universe U to a scale such as scale \([0,\,1]\), where \(\forall A\subseteq U, \varPi (A)= \max _{u\in A} \pi (u)\). Due to the use of max and min operations, possibility and necessity functions are more “qualitative” than the probabilistic models involving sum and product.

Still, possibility theory may be quantitative or qualitative [8]. In the first case, the whole scale \([0,\,1]\) is used, and possibility and necessity may be thought as upper and lower bounds of an unknown probability (then conditioning is based on product rather than “min”). However, possibility theory does not require the use of the scale \([0,\,1]\), but can be defined with any linearly ordered chain (e.g., a finite subset \([0,\,1]\) including 0 and 1), or more generally any lattice, and is then qualitative. Moreover, possibility theory has a logical counterpart, namely possibilistic logic [6] (which involves only lower bounds of necessity degrees, which can be viewed as certainty levels), and generalized possibilistic logic [11] (which involves both set functions). Besides, two other set functions are of interest in possibility theory, namely the guaranteed possibility, \(\varDelta (A)= \min _{u\in A} \delta (u)\), and the dual set function, where \(\delta \) is a possibility distribution. In bipolar representations [9], one uses a pair of possibility distributions \((\delta , \pi )\) for distinguishing between values u such as \(\pi (u) =0\) that are excluded, from values \(u'\) such as \(\delta (u')>0\) that are guaranteed to be possible to some extent (since, e.g., they were observed), assuming the consistency condition \(\delta \le \pi \) (expressing that what is guaranteed to be possible cannot be excluded).

3.2 Possibilitistic Representations

There is not a unique possibilistic data model. The existing models serve different purposes. From the least to the most expressive, we can distinguish four possibilistic models for uncertain data which have been actually proposed:

  • databases with layered tuples;

  • tuples involving certainty-qualified attribute values;

  • tuples involving attribute values restricted by possibility distributions;

  • possibilistic c-tables.

Layered Tuples. The idea, here, is just to provide a complete ordering of the tuples in the database according to the more or less strong confidence we have in their truth. This can be easily encoded by associating a possibility level with each tuple. This results in a layered database: all the tuples having the same degree are in the same layer (and only them). Those tuples having a possibility level equal to 1 may also be associated with a certainty level equal to 1, while the others with a possibility level strictly less than 1 are not certain at all; this means that any possible world database contains all the tuples at level 1, while the other tuples may or may not be present in a particular possible world; see [17] for details. This modeling is not very expressive since it provides no indication on what attribute values in the tuple are particularly uncertain. In that respect, it may be considered as a modeling that is too poor from a querying perspective. Still, it has been shown useful for design purposes by providing a setting for attaching certainty levels to functional dependencies (FDs) (through a duality relation with the possibility levels of the tuples that are violating the FDs). Then, this enables the generalization of Armstrong’s axioms by attaching certainty levels, and the extension of Boyce-Codd/3rd Normal Forms approaches to database design in the presence of uncertain tuples, by taking advantage of the levels  [18]. Such a possibilistic model is also useful for handling keys [15] and cardinality constraints [12, 28] in presence of uncertain data.

Certainty-Qualified Attribute Values. In this model [23], attribute values (or disjunctions thereof) are associated with a certainty level (which is the lower bound of the value of a necessity function). This amounts to associating each attribute value with a simplified type of possibility distribution restricting itFootnote 1. Different attributes in a tuple may have different certainty levels associated with their respective values. Then a tuple may be associated with a certainty level, which is the minimum of the certainty levels associated with the attribute values of the tuple, in agreement with the minitivity of necessity functions. Still this global certainty level should not be confused with the possibility level of the previous approach. In terms of possible worlds, a tuple associated with such a certainty level correspond to several tuples with a possibility level. Indeed consider the simple example of a tuple made of two attribute values a, and b, associated respectively with certainty \(\alpha \) and \(\beta \): this yields as possible worlds \(\langle a,\,b\rangle \) with possibility 1, \(\langle a',\, b\rangle \) with possibility \(1 -\alpha \), \(\langle a,\, b'\rangle \) with possibility \(1 -\beta \), \(\langle a',\, b'\rangle \) with possibility \(\min (1 -\alpha , 1 -\beta )\), where \(a'\) (resp. \(b'\)) is any value distinct from a (resp. b) in the attribute domain to which a (resp. b) belongs.

This model has some advantages with respect to querying: (i) it constitutes a strong representation system for the whole relational algebra (up to some minor restrictions); (ii) it does not require the use of any lineage mechanism and the query complexity is close to the classical case; (iii) the approach seems more robust with respect to small changes in the value of degrees than a probabilistic handling of uncertainty (see the last section of [23]). Moreover, there exists a simplified version of this model, see [24], that uses a scale with only three certainty levels (“completely certain”,“somewhat certain”, “not at all certain”). This makes the assessment of certainty particularly easy. Besides, another approach with the same formal type of modeling, but where certainty is evaluated in terms of subsets of sources (together with their reliability level) makes it possible to rank-order the answers to a query also on such a basis [22].

Attribute Values Restricted by General Possibility Distributions. In this “full possibilistic model” [3], any attribute value can be represented by any possibility distribution. Moreover, representing the result of some relational operations (in particular the join) in this model requires the expression of dependencies between candidate values of different attributes in the same tuple, which leads to the use of nested relations. In [3], it is shown that this model is a strong representation system for selection, projection and foreign-key join only. The handling of the other relational operations requires the use of a lineage mechanism as in the probabilistic approaches. This model makes it possible to compute not only the more or less certain answers to a query (as in the previous model), but also the answers which are only possible to some extent.

Possibilistic c -tables. This model is outlined in [25]. The possibilistic extension of c-tables preserves all the advantages of classical c-tables (for expressing constraints linking attribute values) while the attribute values are restricted by any kind of possibility distribution. This model generalizes the two previous ones. In fact, possibilistic c-tables, as probabilistic c-tables, can be encompassed in the general setting of the semiring framework proposed by Val Tannen et al.

4 Data Cleaning

This section first provides a brief overview of two approaches that respectively (i) allow you to query inconsistent databases, and (ii) take advantage of a possibilistic modeling for cleaning the data, before suggesting new lines of research.

4.1 Some Existing Approaches

In the presence of inconsistent data, two points of view may be taken. The first one consists in cleaning the database so as to make it consistent, either by means of an automated process [13], or by an interactive approach. The second one, such as Consistent Query Answering (CQA) approaches [2], takes into account the inconsistencies at query processing time.

An approach corresponding to this second line of thought is described in [21]. It aims at warning the user about the presence of suspect answers in a selection query result, in the context of a classical database (that may include data inconsistent with some functional dependencies). Roughly speaking, the idea is that such elements can be identified inasmuch as they can also be found in the result of negative associated queries. The notion of a suspect answer can be refined by introducing some gradedness in terms of cardinality (number of functional dependency violations in which the tuple is involved) or similarity (by relaxing the equality constraint of a functional dependency into an approximate equality). However, this approach, for the moment, does not involve any uncertainty degree associated with attribute values or tuples. In other words, it handles only inconsistency but not uncertainty.

A possibilistic approach to data cleaning has been recently proposed in [16]. This approach belongs to the research trend aimed at restoring a form of consistency in the database. Still, the approach identifies tuples that are suspect or even fraudulous. This is done independently from any particular query. This relies on a model closely related to the layered-tuple-based model reviewed above. However, it is used in the reverse way, since it starts with certainty-valued constraints (called business rules) from which one computes the confidence levels associated with the tuples (on a qualitative scale: “normal”/“suspect”/“fraud”), by solving a minimal possibilistic vertex cover (taking into account the number of violations in which the tuples are involved). Here these are the possibility levels of the tuples that are revised in order to restore the (graded) consistency.

4.2 Some Issues Deserving Further Investigations

A first extension we may think of is to introduce certainty degrees in the first of the two approaches reviewed in the preceding section (reference [21]). This means extending the querying method keeping the data as they are and indicating which answers are suspect, to the setting of the certainty-based model described in Sect. 3. In the original model, an answer is suspect as soon as there exists a repair (w.r.t. a functional dependency) of the query result to which it does not belong. In the extended context, the notion of repair becomes naturally gradedFootnote 2, as well as the concept of suspiciousness (now appreciated both in terms of the certainty degrees attached to the values of the concerned tuples and in terms of the number of functional dependencies violated by the tuples).

Another interesting issue is to unify the above view with the possibilistic approach to data cleaning reviewed in the previous section (reference [16]). We can observe that, although the outputs of the two approaches are quite similar (tuples assigned with a certainty degree expressing different levels of suspiciousness), the inputs are completely different: in one case, constraints with certainty levels, in the other case, attribute values with certainty levels. However, it seems clear that the approach [21] can also be extended by introducing functional dependencies with certainty levels and keeping all of the attribute values completely certain (rather than the opposite as suggested in the paragraph above). Then, this will make the two approaches easier to compare.

5 Conclusion

In this brief survey, we have tried to make clear that there exist different possibilistic models, with different levels of expressiveness, but also dedicated to different database tasks (design, data cleaning, querying). Other worth mentioning issues are the modeling of null values [1] and the extrapolation of missing data [5]. Two kinds of tasks, in our opinion, are particularly worth investigating: (i) a practical comparison of the certainty-based model (which offers a rather good simplicity/expressivity compromise) with probabilistic approaches; (ii) the comparison and the cooperation between different possibilistic data cleaning tools and probabilistic ones. Another line of thought which, we think, might be of interest, is to consider causality issues for evaluating the responsibility in inconsistencies, for which AI probabilistic models have been considered in a database perspective [19], while there also exist possibilistic counterparts to these AI models [10].