Keywords

1 Introduction

The skyline queries are introduced by Borzsönyi in [4] to formulate multi-criteria searches. Recently, this concept, has gained much attention in the database community. It has been integrated in many database applications that require decision making and personalized services. Skyline process attempts to identify the most interesting (not dominated in sense of Pareto) objects from a set of data. Skyline queries are based on Pareto dominance relationship. This means that, given a set D of d-dimensional points (objects), a skyline query returns, the skyline S, set of points of D that are not dominated by any other point (object) of D. A point p dominates another point q iff p is better than or equal to q in all dimensions and strictly better than q in at least one dimension.

A great research effort has been devoted to develop efficient algorithms to skyline computation [12, 14, 17, 20, 24, 29]. The skyline computation often leads to a huge number of skyline objects which is less informative for the user and does not bring any insight to decision making. In order, to solve this problem and reduce the size of skyline, several algorithms have been developed [2, 6, 7, 9, 13, 18, 21, 23, 26]. In this paper, we consider this problem, but with another novel vision. In particular, the idea of the solution advocated is borrowed from the formal concept analysis field. This idea consists in building a formal concept lattice for skyline objects based on the minimal distance between each concept and the target concept (i.e., the ideal object w.r.t the user query). The refined skyline \(S_{ref} \) is given by the concept that has the minimal distance to the target concept and contains k objects (k is a parameter given by the user). Starting from this idea, we develop an algorithm to compute the refined skyline, called FLCMD. In summary, our main contributions cover the following points:

  • We define an efficient approach to refine the skyline S based on the minimal distance between the concepts lattice and the target concept.

  • We develop and implement an algorithm to compute \(S_{ref}\) efficiently.

  • We conduct a set of thorough experiments to study, analyze and compare the relevance and effectiveness of proposed approach and the naive method.

This paper is structured as follows. In the Sect. 2, we define some necessary notions about the skyline queries, fuzzy set theory, the formal concept analysis and lattice then, we report some works related to the skyline refinement and at the end of this section, we explain the naive approach. In Sect. 3, we present our approach and we give the FLCMD algorithm that compute the refined skyline \(S_{ref}\). Section 4 is dedicated to the experimental study and Sect. 5 concludes this paper and points out some future work.

2 Background and Related Work

2.1 Skyline Queries

Skyline queries [4] represent a very popular and powerful paradigm to extract objects from a multidimensional dataset. They are based on Pareto dominance principle which can be defined as follows:

Definition 1

Let D be a set of d-dimensional data points and \( u_{i} \) and \( u_{j} \) two points of D. \( u_{i} \) is said to dominate, in Pareto sense, \( u_{j} \) (denoted \( u_{i}\succ u_{j} \)) iff \( u_{i} \) is better than or equal to \( u_{j} \) in all dimensions and strictly better than \( u_{j} \) in at least one dimension. [25]

Formally, we write:

$$\begin{aligned} u_{i} \succ u_{j} \Leftrightarrow (\forall k \in \{1,..,d\}, u_{i}[k]\le u_{j}[k]) \wedge (\exists l\in \{1,..,d\}, u_{i}[l]<u_{j}[l]) \end{aligned}$$
(1)

where each tuple \( u_{i}=(u_{i}[1], u_{i}[2],\cdots , u_{i}[d]) \) with \( u_{i}[k] \) stands for the value of the tuple \( u_{i} \) for the attribute \( A_{k} \).

In Eq. (1), without loss of generality, we assume that the minimal value, the better.

Definition 2

The skyline of D, denoted by S, is the set of points which are not dominated by any other point.

$$\begin{aligned} u \in S \Leftrightarrow \not \exists u' \in D, u' \succ u \end{aligned}$$
(2)

Example 1

To illustrate the concept of the Skyline, let us consider a database containing information about apartments as shown in Table 1. The list of apartments includes the following information: code apartment, area of apartment (\(m^{2}\)), price in (€) and distance between work and home (apartment) (\(dist\_wh\) in km). Ideally, a person is looking to rent an apartment with a minimal price and having a minimal distance to his/her work (price and \(dist\_wh\)), ignoring the other pieces of information. Applying the traditional skyline on the apartments list of Table 1, returns the following apartments: \(\{A_{1}, A_{3}, A_{5}, A_{7}\}\), see Fig. 1.

Table 1. List of apartments
Fig. 1.
figure 1

Skyline of apartments

2.2 Fuzzy Set Theory

The concept of fuzzy sets has been developed by Zadeh [30] in 1965 to represent classes or sets whose limits are imprecise. They can describe gradual transitions between total belonging and rejection. Formally, a fuzzy set F on the universe X is described by a membership function \( \mu _{F}: X \rightarrow [0,1] \), where \(\mu _{F}(x) \) represents the degree of membership of x in F. By definition if \(\mu _{F}(x)=0 \) then the element x does not belong to F, \(\mu _{F}(x)=1 \) then x completely belongs to F, these elements form the core of F denoted by \(Cor(F)=\{x \in F\backslash \mu _{F}(x)=1\} \). When \(0< \mu _{F}(x) < 1\) we talk about a partial membership, these elements form the support of F denoted by \(supp(F)= \{x \in F \backslash \mu _{F}(x)>0\}\). Moreover, \(\mu _{F}(x)\) is closed to 1, more x belongs to F. Let \(x, y \in F\), we say that x is preferred to y iff \(\mu _{F}(x)>\mu _{F}(y)\). If \(\mu _{F}(x)=\mu _{F}(y)\) then x and y have the same preference. In practice, F can be represented by a trapezoid membership function (t.m.f) \((\alpha ,\beta ,\varphi ,\psi )\) where \([\beta ,\varphi ] \) is the core and \(]\alpha ,\psi [\) is its support see Fig. 2.

Fig. 2.
figure 2

Trapezoidal fuzzy set.

2.3 Formal Concept Analysis

The theory of formal concept analysis (FCA), proposed by Wille in 1982 [28]. It is based on a formal context \( \mathcal {K} = (O, P, R) \), where O is a set of objects, P is a set of properties (attributes) and R a binary relation between O and P. Wille defined a correspondence between sets O and P. These correspondences are called a Galois derivation operator (or sufficiency operator) noted by \(\vartriangle \). Given \(A\subset O\), \(B\subset P\), \(A^{\vartriangle }\) express all the properties satisfied by all the objects of A and dually \(B^{\vartriangle }\) express the set of objects satisfying all the properties of B (see [28]). The dual pair of operators (\((.)^{\vartriangle }\),\((.)^{\vartriangle }\)) constitutes a Galois connection which allows to introduce formal concepts. A formal concept of a formal context \( \mathcal {K}\) is a pair (AB) with \(A\subset O\), \(B\subset P\), \(A^{\vartriangle }= B\) and \(B^{\vartriangle } = A\). A and B are respectively called extent and intent of the formal concept (AB). The set of all formal concepts is equipped with a partial order denoted \(\preceq \) defined by: \((A_{1},B_{1})\preceq (A_{2},B_{2})\) iff \(A_{1}\subseteq A_{2}\) or \(B_{2}\subseteq B_{1}\). Ganter and Wille have proved in [10] that, the set of all formal concepts ordered by \(\preceq \) forms a complete lattice of the formal context \(\mathcal {K}\) denoted by \(\mathcal {L(\mathcal {K})}\). In most applications, like in our case the attributes are defined a fuzzy way. In order to take into account relations allowing a gradual satisfaction of a property by an object, a fuzzy FCA was proposed by Burusco and Fuentes-Gonzales [5] and belohlávek et al. in [3]. In this case, the notion of satisfaction can be expressed by a degree \(\in [0,1]\). A fuzzy context formal is a tuple (LOPR), where a fuzzy relation \(R\in L^{O\times P}\) is a function that is defined \(O \times P \longrightarrow L\) which assigns to each object \(o\in O\) and for each property \(p\in P\) a degree R(op) for which the object o has the property p. In general \(L=[0, 1]\). The generalization of the Galois derivation operator, to fuzzy settings is based on the fuzzy implication defined by belohlávek in [3]. It is defined for an subset \(A \in L^{O}\) (and similarly defined for an subset \(B\in L^{P}\)) as follows:

$$\begin{aligned} A^{\vartriangle }(p)=\bigwedge _{o\in O}(A(o) \rightarrow R(o,p)) \end{aligned}$$
(3)
$$\begin{aligned} B^{\vartriangle }(o)=\bigwedge _{p\in P} ( B(p)\rightarrow R(o,p)) \end{aligned}$$
(4)

\(\rightarrow \): is a fuzzy implication that verify \(( 0\rightarrow 0=0\rightarrow 1=1\rightarrow 1=1\) and \(1\rightarrow 0=0)\). We distinguish three type of fuzzy formal concepts. Concept with crisp extent and fuzzy intent, crisp extent and fuzzy intent the third type fuzzy extent and fuzzy intent.

In this paper, we use concept with crisp extent and fuzzy intent, i.e., the set of objects is crisp and the set of properties is fuzzy.

Example 2

To illustrate the computation of formal concepts in our case, let us consider a database containing information about hotels as shown in Table 2. The set of objects O is composed by different hotels \(\{h_{1}, h_{2}, h_{3}\} \), the set of properties P contains the properties \( cheap \) (denoted Ch) and \( Near\, the\, beach \) (denoted Nb), i.e., \(P=\{Ch, Nb\}\). \(R(o_{i},p_{j}) \) represents the degree for witch the object \(o_{i}\) satisfies the property \(p_{j}\), for example \(R(h_{2},ch)=0.5\) means that the hotel \(h_{2}\) satisfies the property \( cheap \) with degree 0.5. Let us consider the sets of objects \(A_{1}= \{h_{2}, h_{3} \}\), \(A_{2}= \{h_{2}\}\) and the set of properties \(B_{1}= \{Ch^{0.5}, Nb^{0.5}\}\). Now, let us describe how to compute \((A_{1})^{\vartriangle }\), \((A_{2})^{\vartriangle }\) and \((B_{1})^{\vartriangle }\). For \((A_{1})^{\vartriangle }\) and \((A_{2})^{\vartriangle }\), we use Eq. (3) and the implication of Gödel defined by

$$\begin{aligned} p\longrightarrow q= {\left\{ \begin{array}{ll} 1 &{} \text {if } p \le q \\ q &{} \text {else} \end{array}\right. } \end{aligned}$$
(5)
Table 2. List of hotels

\(A_{1}= \{h_{2},h_{3}\}=\{h_{1}^{0}, h_{2}^{1}, h_{3}^{1}\} \)

\((A_{1})^{\vartriangle }(Ch)= \wedge (0 \rightarrow 0,\,\, 1\rightarrow 0.5,\,\, 1\rightarrow 0.5)=\wedge (1,\, 0.5,\, 0.5)=0.5\)

\((A_{1})^{\vartriangle }(Nb)= \wedge (0 \rightarrow 0.8,\,\, 1\rightarrow 0.5,\,\, 1\rightarrow 0.6)=\wedge (1,\, 0.5,\, 0.6)=0.5\)

\((A_{1})^{\vartriangle }=\{Ch^{0.5},Nb^{0.5}\}=B_{1} \)

Similarly, we obtain \((A_{2})^{\vartriangle }=\{Ch^{0.5},Nb^{0.5}\} = B_{1}\)

To compute \((B_{1})^{\vartriangle }\), we use Eq. (4) and the implication of Rescher Gaines defined by

$$\begin{aligned} p\longrightarrow q= {\left\{ \begin{array}{ll} 1 &{} \text {if} p \le q \\ 0 &{} \text {else} \end{array}\right. } \end{aligned}$$
(6)

\((B_{1})^{\vartriangle }(h1)=\wedge (0.5 \rightarrow 0,\,\, 0.5\rightarrow 0.8)=\wedge (0,\, 0.5)=0\)

\((B_{1})^{\vartriangle }(h2)=\wedge (0.5 \rightarrow 0.5,\,\, 0.5\rightarrow 0.5)=\wedge (1,\, 1)=1\)

\((B_{1})^{\vartriangle }(h3)=\wedge (0.5 \rightarrow 0.5,\,\, 0.5\rightarrow 0.6)=\wedge (1,\, 1)=1\)

\((B_{1})^{\vartriangle }=\{h_{1}^{0},h_{2}^{1},h_{3}^{1}\}=\{h_{2}, h_{3}\}=A_{1} \).

  • \((A_{1})^{\vartriangle }=B_{1}\) and \((B_{1})^{\vartriangle }=A_{1}\), this means that \((A_{1},B_{1})\) forms a fuzzy formal concept, \(A_{1}\) is its extent and \(B_{1}\) its intent.

  • \((A_{2})^{\vartriangle }=B_{1}\) but \((B_{1})^{\vartriangle }= \{h_{2}, h_{3}\} \ne A_{2}\) then, \((A_{2},B_{1})\) is not a fuzzy formal concept.

2.4 Related Work

The work proposed by B\(\ddot{o}\)rzs\(\ddot{o}\)nyi and al. in [4] is the first work that addresses the issue of skyline queries in the database field. They have proposed two different algorithms to process skyline queries in complete database, namely, Block Nested Loop (BNL) and Divide and Conquer (D& C). Later, many algorithms have been developed which are inspired from BNL and D&C [4, 8, 19, 23, 27, 27]. Several authors have been interested in the problem of huge skyline and have proposed additional mechanisms to refine the skyline and reduce its size.

In [2, 7, 13, 18, 21, 23, 26] ranking functions are used to refine the skyline. The idea of these approaches is to combine the skyline operator with the top-K operator. For each tuple in the skyline, one joins a related score, which is computed by the means of ranking function F. We note that F must be monotonic on all its arguments. Skyline tuples are ordered according to their scores, and the top-K tuples will be returned.

In [11] authors, propose the notion of fuzzy skyline queries, which replaces the standard comparison operators \((=, <, >, \le , \ge )\) with fuzzy comparison operators defined by user. While in [15], Hadjali and al. have proposed some ideas to introduce an order between the skyline points in order to single out the most interesting ones. In [1], a new definition of dominance relationship based on the fuzzy quantifier “almost all” is introduced to refine the skyline, while in [16] authors, introduce a strong dominance relationship that relies on the relation called “much preferred”. This leads to a new extension of skyline, called MPS (Must Preferred Skyline), to find the most interesting skyline tuples. In [22] authors, propose a flexible approach called “\(\theta -skyline\)” to categorize and refine the skyline set by applying successive relaxations of the dominance conditions with respect to the user’s preferences. This approach is based on the ranking method which deals with decision-making in the presence of conflicting choices. Furthermore, they define a global ranking method over the skyline set. In [13], Haddache et al. have proposed an approach based on ELECTRE method borrowed from the outranking domain to refine the skyline.

Furthermore, several researchers have worked on skyline’s refinement for the evidential data. In [9] authors, have developed efficient algorithms to retrieve the best evidential skyline objects over uncertain data.

2.5 Naive Method

This approach [6] is based on two steps: (i) first compute for each skyline point p, the number of points dominated by p denoted by num(p). (ii) The skyline points are sorted according to num(p) in order to choose the \(Top-k\).

3 Our Approach

In this section, we will present the main steps of our approach. First, we assume that, we have

  • A database formed by a set of m objects (tuples), \(O=\{o_{1},o_{2},\cdots ,o_{m}\}\).

  • A set P of n properties (or dimensions or attributes), \(P=\{p_{1},p_{2},\cdots ,p_{n}\}\).

  • Each object \(o_{j}\) from the set O is evaluated for every property \(p_{i}\).

  • S the skyline of O, \(S=\{o_{1},o_{2},\cdots ,o_{t}\}\), \(t<=m\), t is the size of skyline.

  • \(S_{ref}\) the refined skyline returned by our approach.

  • In our approach we use the implication(\(\longrightarrow \)) of Rescher Gaines defined by Eq. (6).

Fig. 3.
figure 3

Steps of our approach

The principle of our approach is to build the fuzzy concept lattice of the skyline points based on the minimal distance between each new concept and the target concept. In summary, our approach is based on the following steps (see Fig. 3).

figure a
  1. 1.

    First, we calculate the skyline using the Basic Nested Loop algorithm (BNL) for more details see [4].

  2. 2.

    Second, we compute the refined skyline using Algorithm 1 (FLCMD). This algorithm, starts by computing for each object \(o_{i}\) the degree \(R(o_{i},p_{j})\) for witch the \(o_{i}\) minimizes the property \(p_{j}\) chosen by the user. Then, it computes the formal concept whose intent minimizes the properties chosen by the user, i.e., maximizes the degrees \(R(o_{i},p_{j})\) for these properties (this concept is called target concept).

  3. 3.

    FLCMD builds the fuzzy lattice for skyline objects. It starts by computing the formal concept whose intent minimizes the degrees \(R(o_{i},p_{j})\) for the properties chosen by user,

  4. 4.

    The algorithm FLCMD computes all the following concepts of this concept.

  5. 5.

    For each new concept, FLCMD, computes the size of its extent and the distance between its intent and the intent of target concept.

  6. 6.

    If the size of the extent equals k (where k is a user-defined parameter), the process stopped. The refined skyline is given by the objects of this extent (when the number of extents having a size equals k is greater than 1, FLCMD chooses the extent whose intent has the minimal distance).

  7. 7.

    If the size of the extent is greater than k, FLCMD selects the intent that has the minimal distance and it starts from step 4.

FLCMD algorithm uses the following functions:

  • \(Next\_intent(Intent\_min, i)\): gives the following intent of \(Intent\_min \) on the dimension i.

  • \(Compute\_Extent(New\_Intent)\): computes the extent of \(New\_Intent\), using the equation (4) and the implication given by the Eq. (6).

  • \(Compute\_distance(New\_Intent, Intent\_target)\): computes the Euclidean distance between \(New\_Intent\) and \(Intent\_target\).

Example 3

To illustrate our approach, let us come back to the skyline calculated in Example 1 presented in Sect. 2.1. As a reminder, we use two properties, namely price and \(dist\_wh\). Furthermore, we assume that the minimal value, the better. BNL algorithm returns as skyline the following apartments: \(\{A_{1}, A_{3}, A_{5}, A_{7}\}\), see Table 3.

Remark In the following, we note the intent (\(price^{\alpha }\), \(dist\_wh^{\beta }\)) by \((\alpha , \beta )\).

Table 3. Classic skyline and objects degrees

First, we compute for each object skyline \(A_{i}\) the degree \(R(A_{i}, P_{i}) \) for which \(A_{i}\) minimizes the property \(P_{i}\). These degrees are given by (see Fig. 4).

  • \(R(A_{i},price)=1-(x_{1}-340)/200\), \(x_{1}\) is the value of \(A_{i}\) w.r.t property \( price \).

  • \( R(A_{i}, dis\_wh )=1-(x_{2}-10)/80\), \(x_{2}\) is the value of \(A_{i}\) w.r.t property \( dis\_wh \).

Second, we compute the target intent and the intent that minimizes the degree \(R(A_{i}, P_{i}) \) w.r.t cheap price and short distance. Using data from Table 3, and the Algorithm 1, one cane observe that \(Intent\_target=(1,1)\) \(Intent\_min=(0,0)\). Then, we compute the following intents of the intent_min.

For \(i=1\), \(New\_Intent= (0.075, 0)\). The distance between this intent and the target intent \(d=\sqrt{(1-0.075)^{2}+(1-0)^{2}}=1.36\), extent = \((A_{1}, A_{3}, A_{5})\).

For \(i=2\), \(New\_Intent= (0, 0.0625)\). The distance between this intent and the target intent \(d=\sqrt{(1-0)^{2}+(1-0.0625)^{2}}=1.37\), extent=\((A_{1}, A_{3}, A_{7})\).

If \(k=3\), the process stopped and \(S_{ref}=\{A_{1}, A_{3}, A_{5}\}\).

If \(k<3\), we select the intent (0.075, 0) (because it has the minimal distance (1.36) with the target intent) then, we compute its following intents and the process continues as shown in Fig. 5. From Fig. 5, we can see that, if \(k=2\), the refined skyline equals \(\{A_{3}, A_{5}\}\), when \(k=1\) \(S_{ref}=\{A_{3}\}\).

Fig. 4.
figure 4

Objects degrees

Fig. 5.
figure 5

Lattice of skyline points based on minimal distance

4 Experimental Study

In this section, we present the experimental study that we have conducted. The goal of this study is to prove the effectiveness of our algorithm and its ability to refine huge skyline and compare its relevance to the naive method. All experiments were performed under Windows OS, on a machine with an Intel core i7 2,90 GHz processor, a main memory of 8 GB and 250 GB of disk. All algorithms were implemented with Java. Dataset benchmark is generated using the method described in [4]. The test parameters used are distribution dataset [DIS] (correlated, anti-correlated and independent), the dataset size [D] (100K, 250K, 500K, 1000K, 2000K, 4000K) and the number of dimensions [d] (2, 4, 6, 10, 15). To interpret the results we define the following refinement rate (\(ref\_rate\)):

$$\begin{aligned} ref\_rate=\dfrac{(ntcs-ntrs)}{(ntcs)} \end{aligned}$$
(7)

where ntcs is the number of tuples of the regular skyline and ntrs is the number of tuples for the refined skyline.

Impact of [DIS]. In this case, we use a dataset with \(|D|=100K\), \(d=6\). Figure 6 shows that the execution time of the two algorithms for anti-correlated data is high compared to the correlated or independent data. This is due to the important number of tuples to refine (14758 tuples for anti-correlated data, 2184 and 89 tuples for independent and correlated data). Figure 6 shows also that our algorithm has the best execution time compared to the naive algorithm (0.004 s for FLCMD, 0.85 s for naive algorithm in the case of correlated data, 10.41 s for FLCMD and 72.32 s for the naive algorithm in the case of anti-correlated data, 0.38 s for FLCMD and 18.2 s for the naive algorithm in the case of independent data). The refinement rate for the two algorithms is very high (for correlated data \(= (89-10)/89=0.88\), for anti-correlated data \(=(14758-10)/14758=0.99\) and for independent data \(= (2184-10)/2184=0.995\)).

Fig. 6.
figure 6

Impact of [DIS]

Fig. 7.
figure 7

Impact of [D]

Impact of the Size of the Dataset [D]. In this case, we study the impact of the size of the database on the execution time of the refined skyline and the refinement rate for the two algorithms. To do this, we use an anti-correlated database with \(d=4\). Figure 7, shows that, the execution time increases with the increase of the database size. But the execution time of our algorithm remains the best compared to the naive algorithm (the execution time increases from 0.3 s if \(|D|=100K\) to 10.52 s when \(|D|=4000K\) for FLCMD and from 13.48 s if \(|D|=100K\) to 1130.5 s if \(|D|=4000K\) for naive algorithm). The refinement rate for the two algorithms is very high varied from 0.996 \(((2811-10)/2811=0.996)\) when \(|D|=100K\) to 0.999 \(((12540-10)/12540=0.999)\) when \(|D|=4000K\).

Impact of the Number of Dimensions [d]. In this case, we study the impact of varying the number of dimensions skyline in the process of computing \( S_{ref} \). We use an anti-correlated distribution data with \(|D|=50K\). Figure 8 shows that the execution time increases with the number of dimensions (from 0.008 s for \(d=2\) to 120.3 s when \(d=15\) for the FLCMD algorithm) and (between 0.5 s and 420 s when d varied from 2 to 15 for naive algorithm). This indicates that our algorithm gives the best execution time compared to naive algorithm. The refinement rate increases from 0.94 \(((187-10)/187=0.94)\) when \(d=2\) to 0.99 \(((48103-10)/48103=0.99)\) for \(d=15\).

Fig. 8.
figure 8

Impact of [d]

5 Conclusion and Perspectives

In this paper, we addressed the problem of the skyline, especially a huge skyline and we proposed a new approach to reduce its size. The basic idea of this approach is to build a fuzzy concept lattice for skyline objects based on the minimal distance between each concept and the target concept. The process of refinement stopped when we compute the concept that contains k objects (where k is a user-defined parameter) and has the minimal distance with the target concept. The refined skyline is given by the objects of this concept. An algorithm called FLCMD to calculate the refined skyline is proposed. In addition, we implemented the naive algorithm to compare its performance to that of our algorithm. The experimental study we have done showed that, our approach is a good alternative to reduce the size of the classic skyline (the refinement rate reached \(99\%\)) and has a reasonable time computation also, the execution time of our algorithm is the best compared to the naive algorithm. As for future work, we will explore, on the one hand the use of semantic distance between concepts to build the refinement lattice and on the other hand, we will use the lattice construction algorithms that gives fuzzy extensions in order to sort the objects of the same concept.