1 Introduction

For the past decade, the interest in designing and developing more flexible query operators in database management systems has increased dramatically. These query operators have changed the way of retrieving data from the database. Preference queries return data items of the database if (and only if) they meet the user’s given preferences. Skyline queries constitute one of the most practical and predominant types of preference queries. They were first introduced in database systems by [1]. Skyline queries return only those non-dominant data items from the database that best meet the user’s given preferences in the submitted query [1,2,3,4,5,6,7,8,9,10]. For instance, someone is looking for a house in a specific area, and each house available on the market possesses different features (number of rooms, number of bathrooms, type of bedrooms, location, and distance from the workplace). Browsing a house rental website can be quite time-consuming and tedious as the interested user has to choose from among the many houses available and advertised on the website. In this scenario, submitting a skyline query will result in eliminating all those houses that do not fulfil the user’s stated preferences and narrowing the results that best suit his or her interest.

In order to gain a better understanding of processing skylines queries in a given database, the following example of a hotel finder database will clarify how the skyline technique works. It is assumed that a researcher is looking for a hotel in proximity to his conference venue. He is looking for a hotel located closest to the conference venue and has been given the best ratings. Figure 1a represents the data in a relation form. This relation contains the details of ten hotels. The first attribute represents the hotel ID, the second attribute indicates the rating of the hotels, and lastly, the third attribute denotes the distance to the conference venue. Figure 1b illustrates the representation of the hotel-finder database example in 2-D space. Based on the definition of skyline queries and considering the existing constraints, it can be learned that hotels h1, h2, h4, h5, and h7 are fully dominated by h9 and h8. This is due to the fact that h9 and h8 have better ratings than the other hotels (h1, h2, h4, h5, and h7) and are similar in terms of distance. Hotels h3, h6, and h10 are partially dominated by h9 based on distance. However, these hotels are dominated by h8 in both dimensions (better rating and shortest distance). The dominated hotels (h1, h2, h3, h4, h5, h6, h7, h10) are not considered as skylines. It can also be observed that hotel h9 partially dominates h8 in dimension three (shortest distance), and similarly, hotel h8 partially dominates hotel h9 in the second dimension (better rating). According to the skyline definition, these two hotels (h8, and h9) are the skylines of this hotel-finder database.

Fig. 1
figure 1

Skylines of hotel recommendation database

Due to their immense benefits, skyline queries are being widely utilized in many domains such as multi-criteria decision making, decision support [11,12,13,14,15], hotel recommender [16], restaurant finder [2, 16], temporal databases [17], crowdsourcing databases [18,19,20], and cloud databases [21]. Since the first introduction of a skyline operator into a database system by Borzsony et al. [1], researchers in the database community have been striving to improve the performance of skyline process by reducing the searching space and minimizing the number of pairwise comparisons, which in turn lowers the processing time of any skyline computation. Many skyline approaches have been proposed in the literature [11,12,13, 15, 22,23,24] concentrating on skyline issues over complete databases where values of all the dimensions are present in the database. The completeness of the data renders it easier to identify the skylines among all data items present in the database which fit the user’s preference. However, in the real-world databases, which are mostly large and multidimensional databases, it is rather unlikely that the database is complete. In other words, values of some data items are not present (missing) in one or more dimensions. Thus, the assumption of data completeness in any given skyline process cannot be sustained.

The incompleteness of data in databases has given rise to many challenges in evaluating skyline queries and made it quite difficult to identify the skylines of the database. Due to the incompleteness of the data, the approaches proposed for complete databases are not recommended to be directly applied on incomplete databases. Incomplete data results in unnecessary pairwise comparisons that eventually lead to the cost of processing skyline queries being prohibitive. Most importantly, an incomplete database also means losing the transitivity property of the skyline, which in turn may lead to the problem of cyclic dominance [2]. Furthermore, in large incomplete databases, the size of skylines may increase due to the incompleteness of data in many dimensions and render many data items as incomparable. Besides, a high number of skylines does not necessarily provide any insight to the user and may not assist him or her in making the right choice [2].

In order to further clarify the issue of incomplete data and its impact on processing skyline queries, the following example is given. A house-rental website possesses a database to allow users to rent houses, and a customer wants to rent a house whose features should be as follows: 1) It should have the maximum number of bedrooms; 2) It should be closest to his workplace; and 3) the rent should be the lowest among the houses available in the database. Let it be assumed that the database contains three houses and that some dimensions are missing: h1 (2, *, 600), h2 (*, 5, 700) and h3 (3, 6, *). The first dimension contains the number of rooms, the second dimension the distance (in km) from the workplace, and the third dimension the price of the rent. The symbol * denotes that the value of a particular dimension is not present (missing). In order to evaluate the skylines for this kind of incomplete dataset, we need to compare all houses with one other and identify the best house according to the user’s preferences. When comparing h1 with h2 from the given dataset, we find that the h1 dominates h2 in the third dimension. When comparing h2 with h3, we find that h2 is better than h3 in the second dimension. In respect to the transitivity property where h1 dominates h2 and h2 dominates h3, so h1 should dominate h3. However, this is not the case in the given dataset. Instead, we find that h3 dominates h1. Hence, we lose the transitivity property and also encounter the issue of cyclic dominance. In consequence, the process ends with no result as all houses are dominated by one other.

  • Transitivity: Given ai, aj, ak ∈ R, if ai ≻ aj, and aj ≻ ak, according to transitivity property ai ≻ ak holds for complete data. However, it may not hold for incomplete data.

  • Cyclic dominance: Given ai, aj, ak ∈ R, ai ≻ aj, aj ≻ ak, and , ak ≻ ai may hold over incomplete data.

The research work presented in this paper focuses on sorting the data items based on the available data in each dimension in descending order. This is followed by accumulating the domination power of each data item denoting the number of dominated data items by scanning the sorted lists. Subsequently, filtration is applied to begin with the initial pruning for the dominated data items before applying the skyline technique. Then, the remaining data items are partitioned in disjoint sets called clusters based on their domination power. These clusters are further divided into smaller groups, where each group contains data items with the same bitmap representation. Dividing the clusters into smaller groups simplifies the skyline process and helps sustain the transitivity property of the skyline technique. In addition, it also helps reduce the searching space and minimizes the domination tests. In order to accelerate the process, it is run parallel on each group to eliminate the unwanted data items effectively. Lastly, the approach ends by comparing the local cluster skylines and returns only non-dominated data items as the skylines.

The following points summarize the contributions of this paper:

  • We re-introduce the problem of identifying skylines in a database with incomplete data and justify the need to form a new and more efficient solution.

  • We conduct a comprehensive review examining the most notable work done in the area of skyline queries in database systems. This covers the previous approaches designed for complete and incomplete databases. The focus is given on examining and summarizing the strengths and the weaknesses of each approach.

  • We propose a new skyline algorithm processing skyline queries on incomplete data called Sorting-based Cluster Skylines Algorithm (SCSA) that efficiently answers skyline queries in incomplete databases by exploiting the sorting and clustering in simplifying the skyline computation.

  • We develop an innovative method that helps sort initial data items into distinct clusters based on the domination power of the data items.

  • We incorporate the two optimization techniques of filtration and optimization that help reduce the number of pairwise comparisons between data items. Filtration prunes the initial incomplete database and eliminates the dominated data items before applying the skyline technique.

  • We evaluate the efficiency and the effectiveness of the proposed solution through experiments using both real and synthetic datasets. These experiments demonstrate the effectiveness of the proposed solution.

The remainder of the paper is organized as follows: In Section 2, the previous works related to this research are reported and discussed. The basic definitions and notations used in the rest of the paper are set out in Section 3. The proposed approach is explained and illustrated in Section 4, and the experimental results are illustrated and explained in Section 5. The conclusion is described in Section 6.

2 Related work

Skyline queries were initially examined for maximal vector [25,26,27] or Pareto set computation [28, 29]. Borzsonyi et al. [1] was the first who presented the skyline operator in database management systems. Following this example, a number of works were proposed to improve the efficiency and performance of the skyline process in database management systems. Therefore, the proposed algorithms in the literature focus on reducing the searching space, minimizing the pairwise comparisons between the data items to declare the skylines, and reducing the execution time. In this section, we review and investigate the existing algorithms designed for complete databases as described in Subsection 2.1. Subsection 2.2 concentrates on examining the existing skyline solutions formed for incomplete databases.

2.1 Skyline queries for complete databases

Borzsonyi et al. [1] proposed the two algorithms BNL and D&C. Block Nested-Loop (BNL) reads data items one by one and compares them to the other data items in the window. The data items that are dominated will be removed from the window leaving the rest of data items in the window for next iterations. Divide and Conquer (D&C) divides the initial dataset into two main sections, computes the local skylines of both sections separately, and then identifies the final skylines by computing local skylines of both sections together. A numerous number of algorithms have been offered after BNL and D&C and applied two methods (sorting and partitioning) on initial data to evaluate skylines.

Sorting approaches

The main goal of sorting is to reorganize the data items in such a way that the dominated data items will be pruned as soon as possible, which also reduces domination tests. As in BNL, the data items are accessed in the initial (input) state. In order to enhance the efficiency of BNL, other algorithms like SFS [23], LESS [30] and SaLSa [31] have been proposed. These algorithms use the sorting concept and rearrange the data items to eliminate those data items at the early stages of the skyline process that possess the least potential. SFS sorts entire database in non-ascending order to eliminate non-skyline data items early, whereas LESS combines the advantages of SFS and BNL and reorders the data items. SalSa uses another function and keeps dominant data items on top that helps to prune more data items and reduces domination tests.

Partitioning approaches

The main idea behind partitioning datasets is to increase the efficiency of the algorithm. It also reduces the domination tests between the data items. As stated earlier in the literature, D&C divides the dataset into two main subsets and then retrieves the skylines by recursively partitioning the subsets. Several other algorithms such as INDEX [11], NN [22], BBS [24], OSPS [32], ZSearch [33], BSkyTree [34] have been proposed based on the D&C concept by dividing the dataset into small partitions before retrieving the local skylines from all partitions, combining all local skylines and producing the final skyline.

2.2 Skyline queries for incomplete databases

While Borzsonyi et al. [1] first introduced skyline queries in complete data, Khalefa et al. [2] first proposed skyline queries for incomplete data. They proposed the ISkyline algorithm to identify skylines in incomplete data. In the ISkyline algorithm, he optimized his own proposed algorithm called the Bucket algorithm [2] using the two optimizing techniques virtual points and shadow skylines. Instead of comparing all data items of bucket Bi with Bi++, virtual points Pk are created for each bucket (Bi-Bn), and the virtual point of Bi is used to compare the data items present in another group Bi++. The objective here is to reduce the maximum number of unwanted domination tests and prune unwanted data items. In order to identify the correct skylines, a set of data items represented as shadow skylines are used to dominate some data items across the buckets. This set of data items, however, is dominated by virtual points. Regardless of the enhanced Bucket algorithm, ISkyline’s performance is affected by the order of initial data. Using the virtual point technique may produce incorrect results. It may be that the data item in bucket Bi++ should be in the final skyline, yet due to using the virtual point of bucket Bi the data item is dominated or vice versa. Moreover, ISkyline involves n number of virtual points which are derived from the local skylines of the buckets to be placed on top of each bucket to prune the dominated local skylines. However, the number of virtual points increases when the number of local skylines increases, which in turn increases the number of pairwise comparisons. The performance of ISkyline is highly influenced by the number of virtual points.

RSSSQ [35] uses the same concept of Khalefa et al. [2] by replacing missing values with certain numbers that are larger than the domain value in order to avoid losing the transitivity property of the skyline and the issue of cyclic dominance. Miao et al. [36] proposed three algorithms known as baseline, virtual point (VP) and k-iSkyband (kISB). Their baseline algorithm requires a high number of pairwise comparisons to retrieve the skylines. In addition, the correction of data items in the buckets is ignored. The VP algorithm overcomes the issue of data item correction. Lastly, kISB further optimizes VP by reducing the domination test process and eliminating redundant data storage.

Based on [2], Bharuka and Kumar [37] proposed the SIDS algorithm (Sorting-based Incomplete Data Skyline) to improve the ISkyline algorithm. SIDS uses the pre-sorting approach on input data to rearrange the position of data items. This rearrangement helps place those data items on the top of the list that are most likely to dominate other data items. SIDS uses the sorting technique of [13, 38] where values of each dimension are sorted in a descending order without counting the missing dimensions. Domination tests are done by selecting a data item P from each dimension according to the round-robin method and comparing it with the data items C present in the same dimension. The variable ProcessedCount (pc) keeps the record of domination tests of the data item P. Dominated data items are subsequently removed from the candidate set. If the value pc of P is equal to the number of non-missing dimensions of P, then P will be moved to ResultSet. When all candidate skylines (data items) are processed at least once, the domination process stops and the remaining data items in the candidate set are also moved to ResultSet. The ResultSet consists of the final skylines. However, in this SIDS approach, the lists have to be accessed in a sequential order, and the system has to receive the results of all lists before moving to the next phase. Thus, the increasing the number of lists may degrade the performance of the skyline process and delay generating the skylines for the end user. Moreover, SIDS is not fully efficient in handling all types of incomplete datasets as it is based on datasets used in [13, 38]. SIDS lacks any optimization that could simplify the process of identifying the skylines. It identify skylines by accessing each data item in a sequential order. This sequential access renders the process of pairwise comparisons very tedious and exhausting as many unnecessary pairwise comparisons are needed to eliminate the dominated data items.

[39] has discussed the issue of processing skyline queries in incomplete datasets by proposing an approach named Incomplete Data Frequent Skyline or IDFS. IDFS adopts the top-k frequent skyline technique suggested in [12] in order to control the size of the skyline results. It utilizes the concept of top-k to derive superior skylines based on the fractional skyline frequency of each data item, which enables it to identify the superior skylines for a database with missing values. Experimental results have reportedly shown the efficiency of IDFS using real and synthetic datasets. However, its performance becomes highly degraded if the examined space for determining the frequent skylines data items is very large.

A number of alternative approaches have been proposed for processing skyline queries in incomplete data such as the COBO framework [40] and SOBA [41]. Based on the COBO framework, Zhang et al. [40] proposed an algorithm (ISSA) that identifies skylines in two phases. In the first phase the bucket technique from [2] is adopted to eliminate most of the dominated data items at an early stage. In the second phase a function is utilized that aggregates the values of non-missing dimensions of data items and sorts those data items in descending order to prune dominated data items. SOBA, on the other hand, uses the same approach of partitioning the data items into subsets based on the same namespace. For optimization of data items, SOBA uses a technique whereby data items are sorted in ascending order for each subset to identify local skylines. Unlike ISSA, SOBA compares data items with one another across buckets without considering to prune the dataset by eliminating the dominated data items before applying the skyline technique. This, however, necessitates many unwanted pairwise comparisons among data items during the skyline process.

To the best of our knowledge, the latest works on the issue of processing skyline queries on incomplete databases are authored by Alwan et al. [42] and Wang et al. [43]. Alwan et al. [42] has developed the Incoskyline algorithm for processing skyline queries in multidimensional and incomplete databases. Incoskyline has improved ISkyline by using a different way of reducing pairwise comparisons and pruning non-skyline data items as early as possible. In the Incoskyline algorithm, the initial dataset is divided into a different set of clusters Cn, each cluster containing data items of the same namespace (same bitmap representation). Each cluster is divided into two groups Ci.G1, Ci.G2. The first groups contain all the data items that possess the highest values in any dimension while the other group contains the data items with the second highest values. Subsequently, the local skylines are found by comparing the data items with one another in each group. After combining the local skylines of each group in one list, domination tests are carried out to determine the local skylines of each cluster (Ci-Cn). A virtual data item called k-dom is created by selecting the highest values of all dimensions of any data item in cluster Ci and Ci + 1. k-dom is used to compare data items of Ci + 1 instead of comparing all data items of Ci with Ci + 1. Thus, for cluster C1k-dom is created from C2 and C3, for cluster C2k-dom is created from cluster C3 and C4. Likewise, for Cluster Cnk-dom is created from cluster Cn-1 and C1. Although Incoskyline has improved Iskyline in terms of execution time and pairwise comparisons, yet it lacks result accuracy due to its use of a virtual data item (k-dom), which is similar to Iskyline as explained earlier.

Wang et al. [43] has recently introduced the SPQ approach (Skyline Preference Query). SPQ adopts the SIDS [37] approach. However, it divides the initial data into two subsets based on preference. The first subset contains all those dimensions given high priority by the user. The SIDS approach is implemented on the first subset to retrieve the local skylines. However, for another subset the D&C [2] technique is implemented to divide the data items based on their bitmap representation and identify the local skylines. In order to retrieve the final skylines, the local skylines of the first subset are compared to the local skylines of the second subset. Although SPQ has improved SIDS, both approaches include the crestion of multiple arrays, each array being processed sequentially, which slows down the processing time. Furthermore, many unwanted pairwise comparisons are performed in order to identify the local skylines of each subset since filtration is not done to prune the unwanted data items as early as possible.

In clear contrast, our proposed algorithm (SCSA) uses a new hybrid approach that combines the power of sorting and partitioning to prune the unwanted data items before further processing. Unlike SPQ and SIDS, the proposed approach uses several optimization techniques that allow the pruning of the dominated data items before applying the skyline operation. This is achieved during the sorting and filtering and selecting superior local skylines phases. We attempt to identify those data items that are most likely to be contained in the final skyline result by identifying the domination power of each data item. Also, partitioning the data items based on their domination power value enhances the performance of our solution.

3 Definitions and notations

In this section, a number of definitions and notations are provided related to skylines queries in incomplete databases. These definitions and notations help clarify our proposed approach.

Table 1 summarizes the symbols used throughout the paper. These terms are further explained below. Our approach has been developed in the context of incomplete relational databases, D. A relation of the database D is denoted by R (d1, d2..., dm) where R is the name of the relation with m-arity and d = (d1, d2, ..., dm) is the set of dimensions.

  • Definition 1 Skyline: The skyline technique retrieves the skyline S, in a way such that any skyline in S is not dominated by any other data items in the database.

  • Definition 2 Dominance: Given two data items pi and pj∈D database with d dimensions, pi dominates pj (the greater is better) (denoted by pi ≻ pj) if (and only if) the following condition holds: ∀ dkd, pi.dk ≥ pj.dk ∧∃dl, ∈d, pi.dl > pj.dl.

  • Definition 3 Skyline Queries: Select a data item pi from the set of D database if (and only if) pi is as good as pj (where i ≠ j) in all dimensions (attributes) and strictly better than pj in at least one dimension (attribute). We use Sskyline to denote the set of skyline data items, Sskyline = (pipi, pjD, pi ≻ pj).

  • Definition 4 Incomplete Database: given a database D (R1, R2, ..., Rn), where Ri is a relation denoted by Ri (d1, d2, ..., dm), D is said to be incomplete if (and only if) it contains at least a data item pj with missing values in one or more dimensions dk (attributes); otherwise, it is complete.

  • Definition 5 Comparable: Let the data items ai and ajR, ai and aj be comparable (denoted by ai ε aj) if (and only if) they have no missing values in at least one identical dimension; otherwise ai is incomparable to aj (denoted by aiε/ aj).

Table 1 Symbols and description

4 SCSA algorithm

We have proposed a new efficient algorithm called Sorting-based Cluster Skyline Algorithm (SCSA) for deriving skylines in a database with incomplete data. The proposed algorithm consist of the five phases of Sorting and Filtering, Clustering and Grouping, Identifying Local Skylines, Selecting Superior Local Skylines, and Retrieving Final Skylines as illustrated in Fig. 2. These five phases are further explained in following subsections.

Fig. 2
figure 2

The phases of the proposed algorithm (SCSA)

In order to illustrate the function of SCSA a sample run on an incomplete database has been conducted as demonstrated in Fig. 3. The database example contains 40 data items with seven dimensions in which some values are not present (marked as *).

Fig. 3
figure 3

An example of incomplete database

4.1 Sorting and filtering

In this phase the data items of the initial dataset are sorted in descending order based on the values of each non-missing dimension before eliminating the dominated data items with a low value of domination power and discarding them. Data items with low domination power value are unlikely to contribute in forming the skyline results and thus, removing them before applying the skyline technique saves a large amount of unnecessary pairwise comparisons and reduces the overhead of the skyline computation process. The process starts by sorting the data items in each distinct list according to the values of each dimension. It is worth noticing here that only the id dimension of data items is stored in the lists, which are scanned in a round-robin fashion to count the domination power of each data item. This process continues until all the data items of the initial dataset have been read at least once. Those data items with a domination power value less than the user-defined threshold (th) are removed as they may be dominated by other data items in one single dimension. Hence, it is ascertained that the eliminated data items will not be part of the skyline result as they are most likely to be dominated by other data items with a domination power value greater than th. This filtration process helps simplify the skyline operation by reducing the number of pairwise comparisons between data items.

As part of the sample run, the initial dataset D is sorted in descending order for each dimension d1d6. A set of lists u1u6 is constructed to store the sorted data items based on the corresponding dimension. Figure 4 depicts the sorted data items according to each dimension in the dataset. It can be observed that the six constructed lists u1u6 correspond with the number of dimensions d1d6 in the dataset D.

Fig. 4
figure 4

List of sorted arrays

The data items of the constructed lists are scanned in a round-robin fashion to calculate the domination power for each data item. This process continues until all the data items of the dataset have been read at least once. The following equation demonstrates the formula of computing the domination power (dp) of the data item.

$$ dp=\sum \limits_{k=1}^u{dt}_i,\kern0.5em \mathrm{iff}\ d{t}_i.k\succ d{t}_j.k $$

As part of our experimental sample run of the database, the process works as follows: the first data item m8 in list u1 is read and its domination power dp has increased by 1 which indicates the appearance of m8. In the second iteration, the first data item m30 in list u2 is read and its dp has also increased by 1. In the third iteration, the data item m30 has been read again in list u3 and its dp value has further increased by 1 and becomes 2. This process continues until the reading of all data items of the dataset to compute their dp is complete. Based on the example (Fig. 4), the process terminates at the 104th iteration where m27 in list u2 is read. The process has been terminated after reading m27 as the termination conditions have become true (number of scanned data items = number of data items in the dataset).

The scanning process helps facilitate the filtration process by utilizing the domination power value of each data item. Figure 5 illustrates the scanned data items and their domination power.

Fig. 5
figure 5

Domination power of each data item

During the filtration process, all data items with a domination power lower than the user-defined threshold (th) are eliminated from further processing as all data items with dp < th are not likely to be part of the skyline result given that they are good only in one dimension. The main idea of filtering lies in exploiting the domination power value to further simplify the skyline process in incomplete databases with multiple dimensions and large dataset size. These data items can thus be safely removed since the elimination of these data items will not affect the skyline results. We represent the dp-list set using the following formula.

$$ dp- list=\left\{\forall d{t}_i:\mathrm{iff}\ dp\ \mathrm{of}\ dti>= th\ \right\}. $$

where i = 1, …, n and th is the minimum defined threshold value of dp.

For instance, if th is set as 2, the data item m25 has to be removed as its dp < 2. It is clear that m25 will be dominated based on the comparable common non-missing dimension data items m29, m30, and m32. Hence, m25 should be removed before conducting unnecessary pairwise comparisons between data items. Similarly, the data items m5, m19, m25, m11, m24, m35, m33, m36, m14, m28, and m27 have been removed since their dp is <2. A sizeable number of data items has been removed prior to applying the skyline technique, which means that unnecessary pairwise comparisons have been avoided. Figure 6 depicts the remaining data items sorted in descending order based on their domination power.

Fig. 6
figure 6

Data items after filtration

We argue that this sorting and filtering process simplifies the skyline process by eliminating unwanted data items before applying the skyline technique. In our running database example, 11 out of 40 data items are deleted before commencing with the skyline process, which translates into reducing 27% of the pairwise comparison process of the skyline. For the sake of simplicity and without losing generality, we assume that all data items with dp value <2 must be removed from further processing. However, our approach can also accommodate cases where the user may chose a minimum th value to be set to remove the dominated data items. For instance, in certain cases where the number of dimensions is large (meaning here more than eight dimensions) we can set the filtration condition to remove all data items with dp < 3 or 4. Therefore, a large number of dominated data items can be removed, which reduces the number of pairwise comparisons between data items in the skyline process.

Figure 7 illustrates the detail steps of sorting and filtering algorithm. In this algorithm, the data items of dataset D are sorted in descending order for each dimension and stored in the constructed lists (steps 1–4). In step 5, a 2D array or dp-list is constructed to store the ID of all data items present in D and their domination power value. A variable AllDataItemsRead is initialized to denote the total number of data items in D (step 6). In step 8 for each row dri of List is selected. If DimCountDone is true the scanning process terminates (step 9), and if not the data item index of column ui of dri is read (step 12). If AllDataItemsRead is greater than 0 (step 13), then uj is checked to determine whether it has been read before or not (step 14). If it has been read before, the dp of uj is incremented by 1 (step 15) and if not, uj is added to dp-list (step 17) and its dp is set to 1 (step 18) before AllDataItemsRead is decremented by 1 (step 19). If AllDataItemsRead is not greater than 0, DimCountDone is set as true (step 22) and the process is terminated (step 25). Steps 8–26 are repeated in order to read all the data item from List at least once.

Fig. 7
figure 7

The algorithm for sorting and filtering

The scanning process is executed in a round-robin fashion. By the end of the scanning process, the dp-list contains all the data items present in D and their domination power value (dp). The dp indicates how many times a data item (dt) has been read. The maximum value of dp for each data item is equal to the number of non-missing dimensions for that particular data item. Steps 27–31 represent the filtration process, which starts by reading each data item present in dp-list (step 27). If dp of dti is <2 (step 28), then it is removed from the dp-list (step 29). The process ends by returning the filtered dp-list (step 32).

4.2 Clustering and grouping

This phase aims at further simplifying the skyline process in a database with incomplete data by partitioning the data items into smaller clusters based on their domination power generated in the previous phase. The idea of clustering relies on grouping data items with similar domination power in one cluster. This is achieved by scanning the list of the remaining data items after filtration and placing the data items with similar domination power in one cluster. The number of created clusters equals to (d-d`- dpmin) where d is the total number of dimensions in a database excluding the primary key dimension and d` is the number of dimensions with missing values. In addition, dpmin denotes the minimum value of dp for those removed dominated data items. Distributing filtered data items into different clusters based on their domination power significantly reduces the number of pairwise comparisons necessary to generate the final skylines. This is due to the fact that clustering essentially applies the divide and conquer technique, which has been proven as an effective way of processing skyline queries in database systems [1, 2, 41,42,43].

Hence, this process helps reduce the number of data items to be compared with each other, which in turn helps avoid many unwanted pairwise comparisons without compromising the correctness of the skyline result.

The detail steps of clustering are further elaborated as follows. Firstly, the list of filtered data items is scanned to identify the highest domination power or dp value in the list before creating a new cluster Ci that contains all data items with the highest dp value. Another new cluster Cj may be created on the next highest dp value and contains the data items that have a value equal to the next highest dp value. This process continues until the dp value of the remaining data items is below the user-defined threshold value th. The following formula translates the process of creating clusters based on the domination power dp of the data items.

$$ {C}_j=\left\{\forall {dt}_i:\mathrm{iff}\ dp\left(d{t}_i\right)= hdp\ \right\} $$

Where hdp is the highest dp value in dp-list.

Data items in each cluster may have different bitmap representation which makes it very difficult to apply the skyline technique due to the incompatibility of the bitmap representation of data items in one cluster. This problem is caused by values missing in one or more dimension of data items, which makes it impractical to perform pairwise comparisons between data items that ensure that the transitivity property of skylines will always hold. Most importantly, it is inevitable to encounter the issue of cyclic dominance. Thus, these large clusters need to be further divided into smaller manageable groups to avoid the above-mentioned problems. The purpose of grouping is to create groups from each cluster based on the common bitmap representation of data items. We employed the principle of bitmap representation that indicates the membership of the data items to the appropriate group. For instance, given a set of data items a1, a2, a3 ∈ a cluster Ci we assume that each data item has three dimensions in total and one dimension with missing value a1 (4, 5, *),a2 (4, *, 9) and a3 (1, 3, *). Based on the given data of item a1, its bitmap representation is 110 where value 1 denotes that the value is present in that particular dimension while 0 indicates the value is missing for that particular dimension. Similarly, the bitmap representation of a2 and a3 is 101 and 110 respectively. Therefore, according to the bitmap representation of the above data items 2 groups, G1 and G2 must be created where G1 contains a1 and a3 while G2 contains a2. Aggregating data items with the same subspace allow for smooth pairwise comparisons, thus sustaining the transitivity property of the skyline and excluding the problem of cyclic dominance [2, 4, 41, 42]. The following formula demonstrates the process of adding data items to their corresponding groups.

$$ {G}_j=\left\{\forall {dt}_i:\mathrm{iff}\ d{t}_i. bitmap={G}_j. bitmap\right\} $$

Grouping also helps reduce the number of pairwise comparisons by limiting the number of data items to be compared against each other in one group rather than comparing the entire data items in one cluster. This simplifies the skyline process by decreasing the number of pairwise comparisons, which in turn reduces the execution time of the skyline process. We believe clustering and grouping significantly contributes to improving the performance of the skyline process by eliminating a large number of unnecessary pairwise comparisons.

Figure 8 demonstrates the result of the clustering process conducted on our running database example. From the figure it can be observed that four distinct clusters have been created (C1, C2, C3, C4). Cluster C1 contains one data item m29 with dp = 5, which means that m29 has appeared five times in the list of sorted arrays. Likewise, clusters C2, C3, and C4 contain data items with the next highest dp values of 4, 3 and 2 respectively. Furthermore, the data items belonging to each cluster are further separated and distributed in distinct groups based on similar bitmap representation.

Fig. 8
figure 8

Data items after clustering

Figure 9 depicts the output of the grouping process applied on the created clusters of our database sample run. It can be observed that only one group with the bitmap representation (111101) has been created from cluster C1. Similarly, four groups (G1, G2, G3, G4) with different bitmap representations have been created based on the data items of cluster C2. It should be obvious that creating smaller groups further simplifies the skyline process. This divide and conquer technique further lowers the number of data items to be compared against each other.

Fig. 9
figure 9

List of clusters divided into groups

Figure 10 details the steps of employing the data clustering algorithm. The input of the algorithm includes the dp-list and the initial dataset with incomplete data, while the output consists of a list of distinct clusters. The algorithm works as follows: Initially, each data item dti in the dp-list is read. If the dp value of the data item dti is equal to the dp value of any created cluster Cj (step 2), the data item is inserted in the corresponding cluster Cj (step 3). A new cluster Ck is created (step 5) and the data item dti inserted into cluster Ck (step 6). This process continues until each data item in the dp-list is read and inserted into one of the created clusters. Finally, the algorithm returns a list of distinct clusters, each cluster containing data items with similar dp value (step 9).

Fig. 10
figure 10

The algorithm for clustering

Figure 11 clarifies the steps of creating groups from the constructed clusters. The algorithm input consists of a cluster containing data items with different bitmap representation whereas the output of the algorithm includes a list of distinct groups based on bitmap representation. Each data item dti belonging to cluster Cj is read (step 1). If the bitmap representation of the data item dti is equivalent to the bitmap representation of any previously created group Gk (step 2) before inserting the data item into group Gk (step 3). Else, create a new group Gk (step 5) and insert the data item dti into the corresponding group Gk (step 6).

Fig. 11
figure 11

The algorithm for grouping

Eventually, a set of groups with distinct bitmap representation is returned (step 9). It should be noted that this process is running simultaneously on all clusters since they are independent of each other and creating groups from one cluster bears no effect on the other cluster relationships. Therefore, this process can be conducted in parallel, which speeds up the skyline process in a database with incomplete data.

We noticed that the idea of generating the domination power and clustering data items based on their domination power values is very beneficial. It significantly reduces the complexity of the skyline process in incomplete data by lowering the number of data items to be considered for pairwise comparison. Therefore, the number of pairwise comparisons that need to be conducted is further decreased, which in turn results in less processing time. Constructing groups based on bitmap representation also further simplifies the skyline process in incomplete data, which ensures that the transitivity property of the skyline technique is kept and that the problem of cyclic dominance issue is avoided.

4.3 Identifying local skylines

In this phase the local skylines of each cluster are identified and all the dominated data items removed and excluded from further processing. This helps reduce the number of pairwise comparisons that need to be made in order to identify the final skyline of the dataset. The process begins with identifying the skylines of each group that belong to each created cluster. This step is important as it ensures the smooth execution of the pairwise comparison process between data items since all group data items have similar bitmap representation. Thus, the issue of losing transitivity property and cyclic dominance can be avoided. The skylines of the groups are further compared with each other to derive the local skylines of the cluster. This process reduces the number of data items to be considered in the subsequent phases to that of the non-dominated data items. Hence, the complex skyline process is considerably simplified. Its process runs parallel whereby the local skylines of each cluster are generated simultaneously. Therefore, exploiting the parallel execution intensifies the skyline process and shortens the total execution time.

In our database sample running, data items present in each group of the constructed clusters are compared to each other in order to identify group skylines. In Fig. 9 is shown group C1.G1 of the cluster where C1 contains only one data item m29 and is thus considered as the skyline of group C1.G1. Similarly, there are the four groups C2.G1, C2.G2, C2.G3 and C2.G4 created from cluster C2. The skylines of each group are identified by comparing all data items in one group to another. The dominated data items in each group are discarded and eliminated from further processing. In a similar fashion the group skylines of clusters C3 and C4 are identified. It is important to notice here that the process of identifying group skylines of all clusters are performed simultaneously. This is possible as all the data items in the groups are independent of each other. This parallel execution speeds up the process of identifying the group skylines of the clusters. These group skylines of the clusters are then compared to each other in order to determine the local skyline of each cluster. Figure 12 shows the local skylines of clusters for our database running sample. We can conclude that most dominated data items have been removed and excluded from further processing.

Fig. 12
figure 12

Local skylines of each cluster

Figure 13 describes the detail steps of identifying the group skylines process. The algorithm input constitutes a set of data items in group Gj while the output of the algorithm constitutes a set of group skylines. The algorithm works as follows: In step 1 each data item dti of a group Gj is read. A pointer ptr indicate a data item dti (step 2). Each data item dtk is read where k = i + 1 in Gj (step 4). A pointer cw points to dtk (step 5). The values of non-missing dimensions between ptr and cw are compared by calling algorithm 6 (step 6). If the comparison results return 1 (step 7), the data item pointed by cw from Gj is removed (step 8) and if the comparison results return 2 (step 9), the set IsDominated is true (step 10). This process is repeated until all the data items of Gj are compared to the data item pointed to by ptr. If a data item of ptr is dominated by any data item of cw (step 13), the data item pointed by ptr is removed from Gj (step 14). This process continues until all the remaining data items of the group are compared to each other. The algorithm ends by returning the group skylines.

Fig. 13
figure 13

Algorithm for group skylines

Figure 14 details the steps of an algorithm that is used to identify the local cluster skylines. The input of the algorithm consists of a set of groups of a cluster, while the output consists of a set of local skylines of a cluster. The algorithm works as follows: In step 1 a group Gi of clusters Cj is selected. The data item dtk of Gi is read (step 2). A pointer ptr denotes a data item dtk (step 3). Then, a new group Gj++ of Cj is selected (step 5) before each data item dtl of group Gj++ is read (step 6) followed by setting a pointer cw that indicates the data item dtl (step 7). A pairwise comparison between data items pointed to by ptr and cw is conducted (step 8). If the result returned from the comparison is 1 (step 9), the data item pointed to by cw is removed from Gj++ (step 10) and if the result of the comparison is 2 (step 11), then IsDominated is true (step 12). This process is repeated until all data items of Gj++ are compared to the data item pointed to by ptr (steps 6–14). The process continues by comparing ptr with the data items of the other groups of Cj (steps 5–15). If a data item pointed to by cw of any group dominates the data item pointed to by ptr (step 16), the data item pointed to by ptr is removed from Gj (step 17). This process continues until all the remaining data items of group Gj are compared to the data items of the other groups (steps 2–19). This process ends once all groups have been compared to each other and the skylines of the cluster have been identified (step 1–20). The process ends by returning the local skylines of the cluster (step 21).

Fig. 14
figure 14

Algorithm for cluster local skylines

Figure 15 presents the steps of the pairwise comparison algorithm. The input for the algorithm consists of two different data items while the output constitutes an integer value that denotes the result of the comparison. The process begins with reading the values of all dimensions of dti (step 1). If the value of a dimension dk of dti or dtj is missing or the value of dk of dti and dtj is equal (step 2), it is continued by skipping the next statements and jump to step 1 (step 3) and if the value of dk of dti is better than the value of dk of dtj (step 5), then PDom is true (step 6). However, if the value of dk of dti is worse than the value of dk of dtj (step 7), then CDom is true (step 8). If PDom and CDom are true (step 9), then the return is 0 (step 10). This process continues until all dimensions of dti and dtj are compared to each other (steps 1–13). If only PDom is true (step 14), then the return is 1 (step 15). However, if only CDom is true (step 16), then the return is 2 (step 17).

Fig. 15
figure 15

Comparison algorithm

We can conclude at this point that applying the skyline technique during the local skyline identification phase helps eliminate many dominated data items. As evident from our database example, only 15 data items out of the remaining 29 data items of the dataset are left for further processing. This represents up to 50% reduction in the dataset.

4.4 Selecting superior local skylines

The main purpose of this phase is to further optimize the process of identifying the skylines in the incomplete database. This is achieved by eliminating some of the local cluster skylines produced during the previous phase before comparing them to the local skylines of other clusters. This step is necessary in order to avoid many unwanted pairwise comparisons while identifying the final skylines. Removing the local skylines of a cluster before comparing them to the local skylines of other clusters allows scanning the available values in all dimensions and retain only the data items with the highest value in any of non-missing dimensions. In other words, we first read the value of each data item in sequence by accessing only one dimension and mark each data item with the highest value in that particular dimension. This process continues by accessing the remaining non-missing dimensions belonging to each data item and marking the data items with the highest value. This means that all unmarked data item(s) are eventually removed from the local skyline of the cluster. This optimization step can greatly reduce the number of local skylines of the clusters that need to be considered in deriving the final skylines and also reduces the number of pairwise comparisons that need to be conducted in order to identify the final skylines. This process is carried out concurrently on all clusters and makes the skyline process more efficient. The following formula generalizes the process of identifying the superior local skyline of each cluster.

$$ SLS\left({C}_i\right)=\Big\{\kern0.5em d{t}_i:\mathit{\max}\left(d{t}_i.{d}_j\right) $$
$$ whrer\ 0\ge {d}_j\le u\Big\} $$

Figure 16 details the process of deriving superior skylines of clusters. The data items with shaded cells represent the superior skyline of clusters. Here, m30, m31, m32, and m21 constitute the superior skylines of cluster, C2, and m0 should be removed from C2 since it has no highest value in any of its non-missing dimensions and will be dominated by the local skyline m29 of cluster C1. Therefore, unnecessary pairwise comparisons between m29 and m0 are avoided, which in turn accelerates the skyline process. Similarly, data items m1, m37, m38, and m34 are removed since their values in the non-missing dimensions are not the highest and will be dominated by the local skylines of the other clusters, C1 and C2.

Fig. 16
figure 16

The process of selecting superior local skylines

Figure 17 illustrates the superior local skylines of each cluster. We can conclude that m29 is the superior skyline of cluster C1. Similarly, m30, m31, m32 and m21 are the superior skylines of cluster C2, m3, m6, and m39 are the superior skylines of C3, and m12 and m16 are the superior skylines of cluster C4.

Fig. 17
figure 17

Superior local skylines of each cluster

Figure 18 details the steps of selecting the superior local skylines of the cluster algorithm. The input of the algorithm includes the local skylines of a cluster while the output consists of a list of superior local skylines of a cluster. The algorithm starts by selecting a dimension di (step 4) followed by setting d_max as the highest available value in di (step 5). Each value in dimension di is read (step 6), and if the value is equal to d_max and the corresponding data item dtj is unmarked (step 8), the data item dtj is marked (step 9). This process continues until all the data items of the cluster Ci are read. The unmarked data items are removed from the list of the cluster of the local skylines (step 14). Eventually, the remaining data items in CLS are retrieved as the superior local skyline of cluster Ci (step 15).

Fig. 18
figure 18

Algorithm for identifying superior local skylines

The phase of selecting the superior local skyline includes a new optimization technique that significantly reduces the number of data items to be considered in the skyline process. The idea is novel and attempts to exploit the maximum available value in any dimension when selecting the superior local skylines. Therefore, we believe that this optimization technique significantly simplifies the skyline process in incomplete datasets. From our database example it can be learned that out of 15 data items, five data items are eliminated from further processing. This represents a 33% reduction of the entire dataset.

4.5 Retrieving final skylines

In the final phase of our proposed approach to processing the skyline queries in incomplete data, we identify the global skylines of the entire dataset. In other words, we retrieve the final skylines that are not dominated by any data item in the whole dataset. This objective is accomplished by comparing the superior local skylines of each cluster to each other and determining the final skylines. The process of this phase is identical to the process of identifying the local skylines of each cluster as already described in Section 4.3. In order to derive the final skylines, algorithm 5 is used. The input consisting of the list of superior local skylines of each cluster, whereas the output consists of a set of final skylines, final_S.

In our database running sample, the superior local skylines of cluster C1 are compared to the superior local skylines of C2, C3, and C4. If the local skyline of cluster C1 dominates the superior local skylines of C2, C3, and C4, the superior local skylines are removed, and if the superior local skyline of C2, C3, and C4 dominates the superior local skyline of C1, it will be removed at the end of the iteration. Similarly, the superior local skylines of C2 are compared to the remaining superior local skylines of C3 and C4. Finally, the superior local skylines of C3 are compared to the remaining superior local skylines of C4. Eventually, the remaining superior local skylines of all clusters are reported as the final skylines, final_S of the dataset. Figure 19 depicts the final skylines of given dataset D.

Fig. 19
figure 19

Final skylines

5 Experimental environment

In order to evaluate the efficiency and the performance of our proposed approach (SCSA) developed for processing skyline queries in incomplete databases, we have compared our approach to the most recent skyline approaches designed for incomplete databases such as SPQ [43] Incoskyline [42], SIDS [37], and Iskyline [2].

All approaches considered in this research work have been implemented using C# programming language. A comprehensive and intensive set of experiments has been conducted on i3 1.6GHz PC with 3GB memory and Windows 7 32bit platform. Since it is argued that the skyline queries is a CPU exhaustive process [1, 2, 12, 13, 35, 37, 39, 42], the experiments of this research work involved the two performance metrics of the number of pairwise comparisons between the data items and the processing time. In all experiments these two metrics are measured by varying the number of dimensions that belong to the database, the number of dimensions with incomplete data, and the total size of the database. Synthetic and real datasets are used in this paper. Two types of synthetic datasets are generated and chosen to run the experiments. First, an independent dataset is generated whose values in one dimension are unrelated to the values of other dimensions. The second synthetic dataset is a correlated dataset whose values in one dimension are influenced by the values of the other dimensions. Another three types of the real dataset are selected, which include NBA, MovieLens, and CoIL 2000 insurance company datasets. These datasets are more realistic and are frequently used by researchers in the area of processing skyline queries in complete and incomplete database systems [1, 2, 4, 12, 13, 15, 37, 39, 42]. Also, some of them suit our experimental conditions as they are initially incomplete (MovieLens and CoIL 2000 insurance company). We opted for the query statement when retrieving the skylines with the highest values for the sake of simplicity and without losing generality. Table 2 summarizes the range of parameter values for the synthetic and real datasets used to evaluate the proposed approach for handling skylines queries in incomplete databases presented in this paper.

Table 2 The parameter settings of the synthetic and real datasets in the experiments for incomplete database

5.1 Experimental results

This section highlights the experimental results performed on the synthetic and real datasets for our proposed approach of processing skyline queries in incomplete databases. In this section, we attempt to investigate the impact of database dimensionality (number of dimensions) and the influence of database cardinality (dataset size) on the process of pairwise comparison and the processing time for skyline evaluation. We argue that these are the most crucial parameters that influence the skyline query processing [1, 2, 12, 15, 24, 37, 42, 44,45,46].

5.1.1 Effect of number of dimensions

It has been evidenced in the literature of skyline queries that the number of dimensions highly influences the skyline query process [2, 37, 42]. Therefore, the first set of experiments examine in particular the impact of the number of dimensions on the process of pairwise comparison in order to derive the skylines for a database with incomplete data. In this section, we illustrate the experimental results for the synthetic and real datasets used throughout the paper. Figure 20a and b illustrate the results for the independent and correlated synthetic dataset in which the number of dimensions vary from 4 to 12 and the dataset size is fixed to 300 KB. From the figures, it is obvious that our approach consistently outperforms the SPQ, Incoskyline, SIDS, and Iskyline for both types of synthetic datasets. Furthermore, Fig. 20c presents the experimental results on the NBA real dataset depicting the number of pairwise comparisons between data items executed during the skyline process. For this set of experiments, the dataset size is fixed at 120 KB, while the number of dimensions varies between five to 17, including the dimensions with missing values.

Fig. 20
figure 20

The effect of number of dimensions on the number of pairwise comparisons

From the NBA dataset, we conclude that our approach is superior to that of SPQ, Incoskyline, SIDS, and Iskyline. We can also observe that the performance of our approach is better than SPQ and SIDS as the number of dimensions increases to more than 9. Figure 20d illustrates the experiment result on the CoIL 2000 insurance company real dataset. This set of experiments evaluates the existing approaches by varying the number of dimensions in the range of 3 to 21 and fixing the dataset size to 150 KB. We can observe that our approach is superior in all cases in terms of the number of executed pairwise comparisons. Since the MovieLens dataset consists of only four dimensions it is not used in this experiment.

It can be observed that SCSA shows that the number of pairwise comparisons in correlated, NBA and CoI is fewer than in independent datasets since most of the dominated data items are eliminated after filtration or during the identification of the group skylines, which helps reduce most of the pairwise comparisons across groups and clusters.

As for the experiment results, we observed that the SCSA is best compared to the other approaches (SPQ, Incoskyline, SIDS, and Iskyline) in terms of raising the number of dimensions and its minor influence on the number of pairwise comparisons. Our proposed method prunes the initial dataset by applying the sorting and filtration processes before applying the skyline process, which leads to eliminating many unwanted data items. The idea of filtration relies on exploiting the concept of generating domination power (dp) meaning the number of dimensions in which the data items is counted as a skyline. Furthermore, selecting superior local skylines by exploiting the highest value found in each dimension to help eliminating the dominated data items is employed in our approach. These two techniques have significantly contributed towards reducing the number of pairwise comparisons while identifying the final skylines. We also observed that Iskyline constitutes the least efficient technique due to its complicated process that results in deriving many local skylines in each distinct cluster. Furthermore, Iskyline also derives a large number of virtual skylines from different local skylines, which are compared to the local cluster skylines to identify the candidate skylines. These processes culminate in exhaustive pairwise comparisons among the entire dataset in order to retrieve final skylines. Nevertheless, the other three approaches, SPQ, Incoskyline and SIDS perform better than Iskyline for all datasets. However, these two approaches are worse than our proposed approach, SCSA.

Figure 21a, b, c, and d depict the processing time of all the different approaches to generate the skylines on the independent, correlated, NBA, and CoIL 2000 insurance company datasets. The parameter settings of this set of experiments are the same as the previous experiments for the synthetic dataset (independent and correlated) and the real dataset (NBA and CoIL 2000 insurance company). Figure 21a and b demonstrate that SCSA requires less processing time to identify the final skylines for both synthetic datasets. We have eliminated the dominated data items before applying the skyline technique in order to lessen the number of pairwise comparisons, which in turn decreases the processing time of computing the skylines. Most importantly, our approach incorporates the simultaneous run when evaluating the skylines of groups and clusters in order to accelerate the skyline identification process. From the figures we also learn that Iskyline underperforms in all cases, and its performance deteriorates when the number of dimension increases to six for the independent dataset (Fig. 21a) and increases to eight for the correlated dataset (Fig. 21b). However, Incoskyline performs only slightly worse than our approach on an independent dataset, and its performance declines when the number of dimensions is larger than eight on the correlated dataset. For NBA and CoIL 2000 insurance company real datasets, the parameter settings (dataset size and number of dimensions) are same as the previous experiment (Fig. 21c and d). Both figures show that our approach is superior to the other approaches in all the given cases.

Fig. 21
figure 21

The effect of number of dimensions on the processing time

It is obvious that the increasing number of dimensions has an insignificant impact on the performance of our approach. This is due to the fact that applying the sorting and data filtration techniques helps prune the initial dataset and eliminates many dominated data items from further processing. Furthermore, the optimization process embedded in SCSA helps prune many dominated local skylines before applying the skyline process. Therefore, a significant reduction in number of pairwise comparisons to identify the final skylines can be obtained. From Fig. 21 we conclude that our proposed approach (SCSA) outperforms all the other approaches designed to process skyline queries in incomplete datasets (SPQ, Incoskyline, SIDS, and Iskyline). For instance, while SCSA takes only less than two seconds to produce the skylines of a 300 KB dataset with 12 dimensions, SPQ, Incoskyline, SIDS and Iskyline require more than 10 s as shown in Fig. 21b).

The idea of constructing clusters based on the data items’ domination power and divining data items of clusters into smaller groups makes each cluster and each group within a cluster independent. Thus, this allows to process clusters and groups while simultaneously identifying the local skylines. This highly efficient technique allows SCSA to consume less processing time when identifying the final skylines, which makes SCSA efficient and highly effective compared to other approaches (SPQ, Incoskyline, SIDS and Iskyline).

5.1.2 Effect of dataset size

Figure 22a, b, c, d and e explain the results of the number of pairwise comparisons that have been executed on the data items during the skyline process for the independent, correlated, NBA, CoIL 2000 insurance company, and MovieLens datasets. This set of experiments examines the impact of the dataset size on the skyline computation process. For the synthetic dataset (independent and correlated), the number of dimensions is fixed to six while the dataset size varies from 100 KB to 600 KB (Fig. 22a and b). From the result presented in the figure, we conclude that the dataset size has a significant impact on the skyline process, as the number of pairwise comparisons is gradually raised when the dataset size is increased. We also notice that SCSA outperforms SPQ, Incoskyline, SIDS, and Iskyline in all cases since it makes use of the domination power (dp) for each data item. The dp denotes that a data item with low dp value is eliminated before applying the skyline technique. Figure 22c demonstrates the number of pairwise comparisons performed in order to identify the skylines for the NBA dataset. In this experiment, the number of dimensions is fixed to 17, while the dataset size varies from 40 to 200 KB. From the figure, we can conclude that our approach consistently outperforms SPQ, Incoskyline, SIDS, and Iskyline in all cases and that the performance of Iskyline dramatically deteriorates when the dataset size gradually increases. We also notice that SPQ, Incoskyline and SIDS approaches perform slightly worse than our approach in all cases. The main reason behind the slight improvement by SCSA is that the size of the dataset for synthetic datasets, CoIL 2000 insurance company and MovieLens, is far larger than the size of NBA, which significantly influences the performance of Incoskyline, SIDS, and Iskyline. Hence, these approaches generate more local skylines and execute more pairwise comparisons between data items. In contrast, SCSA generates a lower number of local skylines as compared to Iskyline, SIDS, and Incoskyline.

Fig. 22
figure 22

The effect of dataset size on the number of pairwise comparisons

Figure 22d elaborates the results of the experiment on the real dataset, CoIL 2000 insurance company. In this experiment, 13 dimensions have been considered, and the dataset size varies from 50 to 300 KB. The experiment result shows that the size of the dataset has a marginal impact on the performance of our approach. This is because many unwanted data items are removed during the filtration process performed before applying skyline technique, which helps avoid many unnecessary pairwise comparisons. Figure 22e depicts the experimental result obtained for the MovieLens real dataset. In this experiment, the number of dimensions is four, and the dataset size varies from 400 to 2000 KB. From this figure, we notice that the performance of our approach is marginally better than that of SPQ and SIDS if the dataset size is less than 1200 KB. However, there is a gap between the performance of our approach, SQP and SIDS, which gradually increases when the dataset size is larger than 1200 KB.

Figure 23a, b, c, d and e describe the processing time of identifying the skylines on an incomplete database for the independent, correlated, NBA, CoIL 2000 insurance company, and MovieLens datasets. The parameter settings of this set of experiments are the same as that of the previous experiments described in Fig. 22. Figure 23a and b describe the processing time of the skyline query operation in an incomplete database on the synthetic dataset (independent, correlated). From the figure, it is clear that our approach is better than SPQ, Incokyline, SIDS, and Iskyline in all cases as it requires less processing time as the sorting and filtration and selecting superior local skylines phase significantly contributes to removing many dominated data items while identifying the skylines in a database with incomplete data.

Fig. 23
figure 23

The effect of dataset size on the processing time

Figure 23c demonstrates the experiment result that was conducted on the NBA dataset. We notice that our approach outperforms the previous approaches in all cases and that SPQ, Iskyline and SIDS perform worse by requiring more processing time if the dataset size increases. Our approach performs slightly better than Incoskyline if the dataset size is less than 120 KB as it is not easy to find many dominated data items in a small dataset, as compared to a dataset with a large number of data items. This causes unnecessary pairwise comparisons between the data items, which further increases the processing time. This connection can also be observed in the synthetic dataset (independent and correlated).

The processing time result for the skyline operation on the CoIL 2000 insurance company dataset has been demonstrated in Fig. 23d and indicates that our approach steadily outperforms SPQ, Iskyline, SIDS, and Incoskyline in all cases. This can be explained by the fact that our approach avoids many unwanted pairwise comparisons among the data items, which in turn reduces the processing time. Lastly, Fig. 23e presents the experimental results obtained from the MovieLens dataset.

We can conclude that SCSA requires less processing time as compared to Iskyline, SIDS, Incoskyline, and SPQ during the skyline operation for the different dataset sizes. This is due to the fact that our approach successfully excludes many dominated data items from the process of pairwise comparison, which in consequence shortens the processing time considerably. The figure also demonstrates that Iskyline, SIDS, Insoskyline and SPQ perform worst if the dataset size keeps increasing, which forces them to scan the entire dataset more than once. Hence, multiple data scans generates a high number of pairwise comparisons to identify the skylines.

The experiment results presented throughout the paper show that our proposed technique outperforms the most recent techniques proposed for processing skyline queries in incomplete databases (Iskyline, Incoskyline, SIDS and SPQ). The experimental results prove the effectiveness and the efficiency of our proposed solution in managing the skyline query process in incomplete databases. Our proposed approach utilized the idea of sorting the data items based on the non-missing values before the skyline process unlike the Iskyline and Incoskyline techniques that apply the skyline process without first filtering the data items. Prior filtering of the initial data helps avoid unnecessary exhaustive pairwise comparison [37, 39, 43]. Most importantly, the new idea of generating the domination power for each data item also eliminates many unnecessary data items from further processing before applying the skyline operation. Moreover, the concept of creating clusters based on the domination power values of the data items also significantly contributes to simplifying the skyline process and avoiding unnecessary and unwanted pairwise comparisons. This stands in contrast to the examined SIDS and SPQ techniques that perform a sequential scan to all sorted data items without considering the value of the domination power in order to exclude unnecessary data items. Lastly, the concept of creating smaller groups from clusters based on the bitmap representation has greatly improved the efficiency of our proposed approach and minimizes the number of pairwise comparisons and the processing time of the skyline process.

6 Conclusion

Skyline queries are generally considered as an expensive process given their extensive domination tests when determining the skylines. The dataset size and the number of dimensions have a critical impact on the searching space and affect the computation process. This paper proposes a novel hybrid algorithm (SCSA) that derives skylines from incomplete data by minimizing the searching space and reducing the domination tests. SCSA processes the skylines by removing the dominated data items before applying skyline technique. The clustering of data items is achieved by implementing the highly efficient technique of generating the domination power of each data item. Furthermore, the idea of dividing clusters into smaller groups based on bitmap representation allows to process all clusters and groups within clusters simultaneously. These two preprocessing steps simplify the skyline process involving incomplete databases. In order to prove the efficiency and the effectiveness of our proposed approach, several experiments have been run on real and synthetic datasets. The results have shown that our algorithm is superior and outperforms the most recent skyline algorithms proposed to process skyline queries in incomplete datasets.