Keywords

1 Introduction

Querying imperfect databases received a lot of importance recently with the emergence of domains like sensor networks, data cleaning, recommendation and recommender systems, etc. Indeed, information generated from this type of applications is obviously pervaded with imperfection (uncertainty, imprecision, ignorance...). That is why, database models that handle imperfect data were introduced (Probabilistic, possibilistic and evidential databases [1, 3, 5, 6, 16]). The latter, models several types of imperfect data but also perfect information using theory of belief functions. In database management, querying is a fundamental step. As consequence, multiple types of ‘imperfect’ queries were introduced. We name the evidential skyline [9], the extended relational queries [1] and the evidential Top-k queries [4].

In general, Top-k queries are needed in real world applications. For an example, movies, music and books are ordered by the preferred ones, researchers by their H-index, etc. Imperfect top-k queries can be very challenging when it comes to their semantics but also when it comes to their practical implementation.

In this paper, we present two algorithms of evidential Top-k queries: The first named NaiETop-k is based on the computation of the preference degrees (called evidential scores) as introduced in [18] and adapted in [4]. The second named OptETop-k is based on an optimized version of the preference degree calculation justified by the complementarity property as detailed in [4]. The proposed implementation allows the ranking of all evidential scores and finally, it provides the k most interesting results among all rank-ordered answers.

Table 1 is an example of an evidential table that stores some users’ preferences about books: \(b_1, b_2, b_3, b_4\). This relation includes three attributes: The ID which is a unique reader identifier. The BookRate that includes the reader’s appreciations about one book and/or several books modeled through the belief functions theory (in this context only few researches addressed the issue of preference elicitation using this theory [2, 11]). The CL which is a specific attribute to evidential databases that stores intervals of confidence about user’s responses.

Table 1. Books appreciations’ table: EDB

This paper is organized as follows: we recall, in Sect. 2 some basic concepts about the belief functions theory and evidential databases. In Sect. 3, we remind needs and challenges of evidential Top-k queries and we present the mathematical materials to compute and compare Evidential Scores and Preference Degrees. Section 4 is dedicated to the presentation of proposed algorithms. Experiments and results are shown in Sect. 5. Section 6 is devoted to the conclusion and the future works.

2 Evidence Theory and Evidential Databases

Evidence theory named the belief functions theory or the Dempster-Shafer theory [7, 8, 17], is a powerful tool to model ignorance and to represent uncertain, imprecise and inconsistent information.

In the theory of belief functions, a set \(\varTheta = \{ \theta _1, \theta _2,\ldots , \theta _n\}\) is a finite, non empty and exhaustive set of n elementary and mutually exclusive hypotheses related to a given problem. The set \(\varTheta \) is called the frame of discernment or universe of discourse.

The power set \(2^{\theta }= { \{\varnothing , \theta _1, \theta _2,\ldots ,\theta _n,\{\theta _1, \theta _2\},\ldots ,\{\theta _1, \theta _2,\ldots , \theta _n \} \}}\) is the set of all subsets of \(\varTheta \).

A mass function, noted m, is a mapping from \(2^\varTheta \) to the interval [0, 1]. The basic belief mass of an hypothesis x is noted m(x), it represents the belief on the truth of that hypothesis x. A mass function is also called basic belief assignment (bba). It is formalized such that:

$$\begin{aligned} \displaystyle {\sum _{x\subseteq \varTheta }}m^{\varTheta }(x)=1 \end{aligned}$$
(1)

If \(m^\varTheta (x)>0\), x is called focal element. The set of all focal elements is denoted F and the couple \(\{F,m\}\) is called body of evidence.

The belief function, denoted bel, is the minimal degree of support committed exactly to x such that:

$$\begin{aligned} bel(x) = \displaystyle {\sum _{y\subseteq x; y \ne \varnothing } m^\varTheta (y)} \end{aligned}$$
(2)

The plausibility function, denoted pl, is the maximal degree of support committed exactly to x such that:

$$\begin{aligned} pl(x)= \sum _{y \subseteq \varTheta ; x \cap y \ne \varnothing } m^\varTheta (y) \end{aligned}$$
(3)

An evidential database, denoted EDB, stores different types of data using the belief functions theory as shown in Table 2.

Table 2. The different types of information modeled in the evidential database

Definition 1

[Compact Evidential Database]

An EDB has N objects and A attributes. An evidential value, noted \(V_{la}\), is the value of an attribute a (\(1\le a\le A\)) for an object l (\(1\le l\le N\)) that represents a basic belief assignment.

$$\begin{aligned} V_{la} : 2^{\varTheta _{a}} \rightarrow [0,1] \end{aligned}$$
(4)
$$\begin{aligned} with \,\,m^{\varTheta _{a}}_{la}(\varnothing ) = 0 \quad and \quad \displaystyle {\sum _{x \subseteq \varTheta _a}m^{\varTheta _{a}}_{la}(x)=1} \end{aligned}$$
(5)

The set of focal elements relative to the bba \(V_{la}\) is noted \(F_{la}\) such that:

$$\begin{aligned} F_{la} = \{x \subseteq \varTheta _a / m_{la}(x) > 0\} \end{aligned}$$
(6)

A confidence level, CL, is a specific attribute that includes intervals. Each one represents the confidence about its object l in the evidential database. The confidence level is a pair of belief and plausibility [bel; pl] reflecting the pessimistic and the optimistic degrees of support about each object’ existence in the database [1, 14, 15].

Multiple types of queries can be applied over an EDB like the extended relational operators (select, project, join...) [1, 14, 15], skyline queries [9, 10] and ranking queries [4].

3 Evidential Top-k Querying

Top-k queries represent a mighty tool to order queries’ results and give only the most interesting answers. Top-k queries were firstly introduced in the multimedia systems [12, 13]. They use a score function to rank answers where only results with the highest scores are returned.

Evidential Top-k queries, denoted ETop-k, rank answers using an evidential score function and return the most interesting ones (with the highest scores). Contrary to usual top-k queries that give a ranking based on a score function with precise values, the ETop-k queries give answers based on a score function with intervals. The latter reflect the minimal and the maximal amounts of confidence about each answer.

Definition 2

[Evidential Score]

Let \(R_i\) be a response generated from processing a query Q over an evidential database EDB of a size N and let \(S(R_i)\) be the score function of that answer \(R_i\) and \(bel(R_i)\) and \(pl(R_i)\) are respectively its belief and plausibility in the table, such that:

$$\begin{aligned} S(R_i)= [ bel(R_i); pl(R_i)] \end{aligned}$$
(7)
$$\begin{aligned} where \quad bel(R_i) = \dfrac{ \sum ^N_{l=1} bel_l(R_i) * bel_l}{N} \end{aligned}$$
$$\begin{aligned} pl(R_i) = \dfrac{ \sum ^N_{l=1} pl_l(R_i) * pl_l}{N} \end{aligned}$$

The belief of an answer, \(bel(R_i)\), is a disjunction of the response’s beliefs in each object of the database. The belief of a response in one object l, denoted \(bel_l\), is the product of its belief in the attribute and the belief of that object. Same for the plausibility of an answer, \(pl(R_i)\). It is the disjunction of the response’s plausibilities in each object of the database where the plausibility of a response in one object l, denoted \(pl_l\) is the product of its plausibility in the attribute and the plausibility of that object [1, 14].

Example 1

The Top-k query processed over the evidential database of Table 1 is the following [4]:

Q: SELECT BookRate FROM EDB ORDER BY S(BookRate) LIMIT k;

Four possible responses are computed using the evidential score as detailed in Definition 2:

  • S(\(b_1\)) = [bel(\(b_1\)); pl(\(b_1\))] = [0,0375; 0.325]

  • S(\(b_2\)) = [bel(\(b_2\)); pl(\(b_2\))] = [0,0375; 0.525]

  • S(\(b_3\)) = [bel(\(b_3\)); pl(\(b_3\))] = [0,125; 0.65]

  • S(\(b_4\)) = [bel(\(b_4\)); pl(\(b_4\))] = [0,0375; 0.1]

Often a top-k query processed over an evidential database gives a large number of results. These latter need to be ranked in order to respond to the objective of the given query. In the evidential case, the result is a set of intervals that must be compared. Two methods were introduced to compare interval results in EDBs’ context [4, 18].

  1. (i)

    The first method was introduced in [18] and adapted in [4]. It is about computing degrees of preference of two intervals and then compare their results to deduce the rank based on three cases:

Definition 3

[Preference Degree]

Let \(S(R_i) = [bel_i;pl_i]\) and \(S(R_j) = [bel_j;pl_j]\) be two evidential scores. Each one is an interval composed of a belief degree and a plausibility degree. The degree of one interval to be greater than the other one is called a degree of preference and denoted P.

The degree of preference that \(S(R_i) > S(R_j)\) is defined such that:

$$\begin{aligned} P(S(R_i) > S(R_j)) = \dfrac{max(0,pl_i - bel_j)- max(0,bel_i - pl_j)}{(pl_i - bel_i) + (pl_j - bel_j)} \end{aligned}$$
(8)

The degree of preference that \(S(R_i) < S(R_j)\) is defined such that:

$$\begin{aligned} P(S(R_i) < S(R_j)) = \dfrac{max(0,pl_j - bel_i)- max(0,bel_j - pl_i)}{(pl_i - bel_i) + (pl_j - bel_j)} \end{aligned}$$
(9)

The different cases of comparing intervals \(S(R_i)\) and \(S(R_j)\) are as follows:

  • If \(P(S(R_i)> S(R_j))> P(S(R_j) > S(R_i))\), then \(S(R_i)\) is said to be superior to \(S(R_j)\), denoted by \(S(R_i) \succ S(R_j)\).

  • If \(P(S(R_i)> S(R_j)) = P(S(R_j) > S(R_i)) = 0.5\), then \(S(R_i)\) is said to be indifferent to \(S(R_j)\), denoted by \(S(R_i) \sim S(R_j)\).

  • If \(P(S(R_j)> S(R_i))> P(S(R_i) > S(R_j))\), then \(S(R_i)\) is said to be inferior to \(S(R_j)\), denoted by \(S(R_i) \prec S(R_j)\).

  1. (ii)

    The second method optimizes the first one using the complementarity proof*, results are compared in order to deduce their rank [4]:

Fig. 1.
figure 1

Specific cases to deduce evidential scores [4]

Definition 4

[Optimized Preference Degree]

Let \(S(R_i)=[bel_i;pl_i]\) and \(S(R_j)=[bel_j;pl_j]\) be two evidential scores. Every interval is composed of degrees of belief (bel) and plausibility (pl) and P is the calculated preference degree.

$$\begin{aligned} P(S(R_i) > S(R_j)) = \dfrac{max(0,pl_i - bel_j)- max(0,bel_i - pl_j)}{(pl_i - bel_i) + (pl_j - bel_j)} = \lambda \end{aligned}$$
(10)

The different cases of comparing intervals \(S(R_i)\) and \(S(R_j)\) are as follows:

  • If \(\lambda > 0.5\) then \(S(R_i) \succ S(R_j)\).

  • If \(\lambda = 0.5\), then \(S(R_i) \sim S(R_j)\).

  • If \(\lambda < 0.5\) then \(S(R_i) \prec S(R_j)\).

Cases that permit to minimize computations before using Definitions 3 or 4 are illustrated in Fig. 1.

Note here the importance of transitivity** property detailed in [18] to give the final ranking.

4 Implementation of Evidential Top-k Query

As the best of our knowledge, there is no implementation of evidential top-k queries. Indeed, we present in this paper an object-oriented implementation of two methods to rank evidential scores (the evidential intervals). The first method is naive, it consists on computing the preference degree through three steps each time: (a) it computes the preference degree that the first interval is superior to the second one and then (b) it computes the preference degree that the first interval is inferior to the second one. Finally, (c) it compares results and give the partial rank. This algorithm is presented in Table 3.

The second method is an optimization of the first one. Indeed, it consists on computing only in one step the preference degree and then deduce the partial order between two intervals. This algorithm is detailed in Table 4.

Finally, the last order is treated using a sorting algorithm, that ranks all evidential intervals and provide the k most interesting ones. The presented implementations offer two methods of evidential intervals’ ranking. Both algorithms use the object-oriented paradigm for its programming benefits.

Table 3. ETop-k naive algorithm
Table 4. ETop-k optimized algorithm

5 Experimental Study

In this section, we evaluate both algorithms from a performance point of view. We used a windows 10 operating system with 2.10 GHz CPU and 4 GB RAM. We also used Java programming language and NetBeans platform.

5.1 Data Sets

We used synthetic data sets with the following parameters (a) N the size of the database, (b) S the evidential score which is an interval of belief and plausibility [BelPL] with BEl, PL \(\in \) [0;1] and BEL \(\le \) PLFootnote 1.

To generate a synthetic evidential database, the used algorithm uses a procedure that generates a synthetic S. Indeed, the procedure computes randomly a fixed number of evidential scores in the interval [0, 1]. Then one of the algorithms (naive or optimized) are processed in order to compare intervals. Finally, a sorting function is used to provide the final complete ranking of all intervals. Note that each interval is associated to a specific and unique item in the evidential database. In our example, the item is a specific book.

Table 5. Impact of the database size for methods: NaiTopK and OptTopK
Fig. 2.
figure 2

Comparison of performance of NaiETopK and OptETopK

Experiments showed interesting results from a performance point of view. In fact, we varied the database size parameter (N) from 10 to 3000. The execution time did not exceed 4 min and 50 s for both algorithms. Results are presented in Table 5. Both algorithms showed interesting results. Moreover, OptETopK gave better ones as shown in Fig. 2. For example, OptETopK ranked 1500 tuples in 69 s against 60 s for NaiETopKNote. Note that complexity depends also on the intervals’ nature generated randomly as detailed theoretically in Sect. 3.

6 Conclusion

Throughout this paper, we presented an implementation of the Evidential Top-k query, ETop-k. In fact, we proposed two algorithms NaiETopK and OptETopK. Both methods showed interesting results when we varied the database size but OptETopK showed best performance in practice as shown theoretically in [4]. The proposed implementation is an important achievement of the evidential Top-k querying fitting the semantics of returning the k most credible answers.

Other types of queries, in the evidential context, like aggregation, range, threshold remain as a promising future works.