1 Introduction

During the last two decades Page Ranking for web queries has gained great interest by the machine learning community. Due to ubiquitous nature of ranking it is also applied to different fields such as MetaSearch, voting method in elections, information retrieval and collaborative computing (Aledo et al. 2013). Rank aggregation techniques are mainly designed for web based tools to provide effective and fast response to search queries put on by the users. In the current scenario, web based activity raised beyond the imagination around the world with new and very advanced MetaSearch engines. So, the demand of fast query searching techniques has been raised every day. There are plenty of search engines available in the worldwide market which differs in terms of their search capabilities, speed and indexing algorithms. Hence, new methods and models are required to improve the performance of search engine in terms of finding query near to or exactly correct as per the user. Another very important factor is response time taken by the search engine to a particular query. Web search evaluation system is the coordination of search based algorithms along with distance evaluating methods (known as Rank Aggregation Methods). Generally, a user will always be interested in viewing first twenty top results for his query which is displayed on desktop screen. But for a particular query a search engine has to evaluate the distance of this query w.r.t to the web pages linked to this engine. Sometimes results are found to be satisfactory on the first page but, in some other cases favourable results may or may not be found. Hence by improving ranking of web pages to a particular query will provide us better and fast search options.

2 Earlier work done

Rank aggregation might look simple but according to Social Choice theory, it is impossible to design a fair voting ranking list. First voting paradox was discovered by a French mathematician M. Condorcet in 1785 (Condorcet 1785) but before that in 1781 J.C Borda compiled his work highlighting the importance of ranking in field of Voting (Borda 1781). The work reported by him described the situation such that there is no winner for an election. Later, in 1951, Kenneth Arrow published the impossibility theorem which defined the best outputs for social choice theory. According to Arrow, it is impossible to design an aggregation method for three or more than three candidates who meet all required criterions of fairness (Arrow 1951). Lot of research have been proposed which highlights the solution for Rank aggregation methods but, none of them is perfect as proved by Arrow in his theorem. The Rank Aggregation technique is divided into two areas i.e., unsupervised learning approach and supervised learning. In case of unsupervised approach, no training dataset is required for deriving results like Borda Count (Borda 1781; Saari 2000), MedianRank aggregation (Dwork et al. 2001), Genetic and Fuzzy based rank aggregation (Beg and Ahmad 2002) whereas, the supervised techniques utilizes a dataset. To achieve efficient dataset in context of web page ranking, the related work can be compared from earlier which was related to Social Choice Theory (Bartholdi et al. 1989) as it reduces the discrepancy among various similar data fed to search engines as well as generates the merged result. The work done by Diaconis and Graham (1977) first applied the two popular distance measures i.e., Spearman’s footrule and Kendall Tau between two lists of Social choice theory and calculated the distances between them. Spearman’s footrule distance calculate the sum of absolute difference between the ranks obtained according to the two lists whereas, the method by Kendall (1938) calculates the pair-wise disagreements between two lists and satisfies the neutrality and consistency mentioned under the Condorcet Property (Young 1974, 1988).

Larry Page and Brin (1998) placed the biggest milestone for exploring World Wide Web. The Page Rank algorithm helps in calculating the relevancy of pages in comparison with other existing pages. PageRank (Page and Brin 1998) and Kleinberg’s HITS algorithm (Kleinberg 1999) are two important algorithms for measuring the importance of document on web. Also, the work reported by Dwork et al. (2001) mainly focused on the problem of combining ranking results from various sources in context of web. The method is similar to the Page Rank (Page and Brin 1998) graph based method as it also uses Markov Chain concept for generation of new list of web pages. Montague and Aslam (2001) has applied of rank aggregation in metasearch and decomposed the problem into normalization score technique. Montague et al. (2001b) addressed relevance score normalization (Fox et al. 1999; Lee 1997) for not retrieved documents. Renda and Straccia (2003) covered the rank fusion as a problem under the area of MetaSearch in which various lists were combined in such a way that an optimize solution was being produced after combination. Akritidis et al. (2011) have dealt with designing of a MetaSearch engine named as QuadSearch engine which aimed to combine speed, reliable rank aggregation method, “spam” free result along with detailed and enriched information.

The concept of multiple objective GA based rank aggregation techniques is addressed by Kaur et al. (2015). The Mean and Stuart methods have been taken in consideration for rank order related to queries. Comparative analysis using genetic algorithm between Stuart and Mean method with genetic based approach is done. The performance for GA based approach came out to be the best but with more time.

3 Methods for rank aggregation

3.1 Kemeny young method

Kemeny Young method was suggested by John Kemeny (1959). Young and Levenglick (1978) showed that Kemeny method satisfies Condorcet criterion. Condorcet criterion indicates the robustness of a method where a candidate ‘c’ elected as winner if it beats every other candidate in a pairwise comparison. Kemeny Young method produces the ordering of candidates on the basis of minimized sum of the Kendall tau distance. Unfortunately, Kemeny’s approach is found as NP-hard, even when it used for four rank lists. Dwork et al. (2001) has proposed a 2-approximation to Kemeny Young method employed for Spearman’s foot-rule instead of Kendall-tau distance.

3.2 Borda count method

Jean Charles de Borda (1781) in late 1770s developed an aggregate rank list of candidates from political elections. Saari (2000) had applied Borda count in which score was assigned to each candidate based on its position in a rank list, e.g., if a candidate ‘c’ is in first position of a list of size n × 1, then this candidate will assigned with a score of n −1 since it beats n-1 candidates. Similarly, candidate ranked second will have a score of n-2 and so on. Whereas candidate at the bottom of a rank list will be assigned with zero score as it beats no other candidates. Cumulative score of a candidate in the final ranked list was used to rank candidates in a list. The method of Borda comprised of rank R given to the item n and a list mk ∈ R. The A mk (i) is equal to the number of items j in mk such that mk(j) > mk(i). The total Borda score for the item i is give as.

$$ {\mathrm{A}}_{\mathrm{m}\mathrm{k}}\left(\mathrm{i}\right)={\displaystyle \sum_{\mathrm{m}\in \mathrm{R}}}{\mathrm{A}}_{\mathrm{m}}\left(\mathrm{i}\right) $$
(1)

Allocation of rank is based on sorting the score value as At where, score with large value is given small rank. Equation (1) represent the method for a list mk, where |mk| = |U| − d, and Ark (i) is calculated for all items, i ∈ mk, for all items j ≠ mk assigned as.

$$ {\mathrm{A}}_{\mathrm{mk}}\left(\mathrm{j}\right)=\frac{{\left(\mathrm{d}+1\right)}^2+\left(\mathrm{d}+1\right)}{2\mathrm{d}} $$
(2)

Borda Count has very simple computations and can be easily implemented. However, Borda is not a Condorcet method and it also fails few other criterions of fairness.

3.3 Markov chain based methods

Dwork et al. (2001) had proposed four Markov chain based methods. These methods were based on a Markov chain which followed transitions between ordering of candidates in the rank lists. The rank of a candidate was derived by stationary distribution of the Markov Chain. In his PageRank algorithm was based on Markov Chain method.

3.4 The mean by variance (MBV) method

The MBV method comprised of function related to Fuzzy based method which derives an aggregated list. The MBV method is related to Cardinal rank aggregation also called as score based method. M.M. Beg (Beg and Ahmad 2003) used the variance computed from Eq. (3). Higher Rank was given to the document based on least mean variance position. The ratio for all ‘M’ documents was derived from Eq. 3 and arranged in an ascending order as aggregated list.

$$ mbv(i)=\left({\overline{m}}_{di}/{\overline{\sigma}}_{di}^2\right) $$
(3)

3.5 The PageRank method

Page and Brin (1998) have developed PageRank algorithm at Stanford University. This algorithm has become the back bone of Google search engine and various other search engines. The Page rank methods mainly calculate the relevancy of pages in comparison to other existing web pages. Other parameters like title, tag, anchor tag, keywords are also used to calculate the overall rank of page under PageRank.

Steps used for Page ranking are given under SERPS which are as follows (Page and Brin 1998):

  1. a)

    The pages which carry keywords are first filtered out.

  2. b)

    Ranking is performed according to maximum keyword occurrence in a page.

  3. c)

    The value for inbound anchor text links is calculated.

  4. d)

    Finally the results are derived in descending order using Eq. 4.

    $$ PR(u)=\left(1-d\right)+d{\displaystyle \sum_{v\in B(u)}}\frac{PR(v)\ }{N_v} $$
    (4)

    Where d is a dampening factor and its value is 0.85. The value of d can varied as per user’s requirement and following the direct links and (1–d) as the page rank distribution from non-directly linked pages.

3.6 HITS method

Klienberg’s HITS algorithm (Kleinberg 1999) used the concept of authority and hub instead of in-links and out-links concept for allotting rank to the pages. The number of edges among hubs and authority were taken for calculation. Weights were assigned to hubs and authority in order to modify the existing results. The algorithm works in the form of iterations for calculation of ranks.

As first iteration start, the authority and hub weight assigned to a node is computed which is given as.

$$ {a}_i={\displaystyle \sum_{j\in B(i)}}{h}_{j\ } and\ {h}_j={\displaystyle \sum_{i\in F(j)}}{a}_i $$
(5)

The algorithm keeps on iterating until the vectors converge. A graph is generated by HITS algorithm by keeping into account the structure of the graph around the node for computation of hub and authority scores.

3.7 Distance measure methods

Distance evaluation among the document list plays an important role for rank aggregation techniques. Basically two distance measures, Kendall-tau distance and Spearman’s footrule distance were first reported by Diaconis and Graham (1977) and Diaconis (1988).

  1. a)

    Kendall-tau distance measure: This method is widely used to calculate the correlation between two rank lists. Kendall-tau (Kendall 1938) distance counts the number of inversion in the order of pair of candidates between two rank lists. The number of pair-wise disagreements is counted by the Kendall-tau distance between two full lists. The distance variable taken for two full lists are denoted by σ and τ as given in Eq. 6:

    $$ K\left(\sigma, \tau \right)=\left|\left\{\left(k,l\right);k<l,\sigma (k)\left\langle \sigma (l) but\ \tau (k)\right\rangle \tau (l)\right\}\right| $$
    (6)

    The normalized value for Kendall-tau distance derived by dividing with greatest possible coefficient value of 0.5 (Kendall 1938). Another method is Optimal aggregation method which works on the principle of removal of outliers from various methods of aggregation. This method satisfied the neutrality and consistency called as Condorcet property (Young 1974, 1988).

  2. b)

    Spearman’s footrule distance measure: The Spearman’s footrule distance is defined as absolute distance between positions of candidates in two ranking lists. Suppose two document lists l1 : (d1, d2, d3, d4) and l2 : (d4, d3, d2, d1) are given. Spearman’s footrule distance d is calculated as the sum of absolute difference between positions of documents, e.g., position of d1 in l1 is 1st and l2 is 4th, therefore, the contribution of document d1 to distance is |1 − 4|. Hence total distance is calculated below.

    $$ \begin{array}{l}d=\left|1-4\right|+\left|2-3\right|+\left|3-2\right|+\left|4-1\right|\\ {}\ d=8\end{array} $$

The Spearman’s footrule distance calculates the sum of all elements i.e., k for a given set ‘A is generated. The Eqs. 1 and 2 gives the absolute difference between the ranks of k obtained according to the two lists. The two given lists are σ, τ and their Spearman’s footrule distance is calculated as:

$$ F\left(\sigma, \tau \right)={\displaystyle \sum_k}\left|\sigma (k)-\tau (l)\right| $$
(7)

4 Rank aggregations as an optimization problem

The rank aggregation based problems are basically divided into two categories as:

  • Ordinal rank aggregation: This method employs order base preferences which are provided by voters and this order of candidates helps in calculating the aggregated ranking.

  • Cardinal rank aggregation: This method based on score provided by the voters to each candidate and this score is used to derive aggregated ranking. In this work both methods are used for generating the consensus ranking.

The problem of Rank aggregation is concerned with finding a consensus ranking list that represent combined list from various search engines. This list is as close to all individual rank lists. In order to find such a combined list is a NP-Hard problem because it cannot be assured that the obtained aggregated list is the best one. Hence, some optimizing techniques are required to implement and solve NP-Hard based problem. Dwork (2001) suggested Kendall distance based aggregation as Kemeny optimal aggregation which generates the geometric median of the inputs. Dwork et al. (2001) addressed in his work that the Kemeny optimal aggregation was NP Hard. A polynomial time algorithm was constituted for footrule optimal aggregation (PFOA) for full lists. In the last decade, MM Beg and Ahmad (2003) proposed heuristic approach for Partial Footrule optimal aggregation (PFOA) NP-Hard problem by using the Genetic algorithm based soft computing technique. In his work, both positional methods and PFOA were applied to derive optimal solutions. However, results obtained were better with Spearman’s footrule distance measure but, not addressed for Kendall-tau. By taking motivation from the literature survey we have applied the Genetic approach for both Kendall-tau and Spearman’s footrule distance on a complex data set.

5 Rank aggregations with genetic approach

Genetic approach is also known as metaheuristic approach. Genetic algorithm (GA) approach was firstly, applied by Goldberg (1989) for the NP-hard problem of partial list which is derived from various rank aggregation methods. Multiple objective genetic algorithms were employed using Mean and Stuart methods for calculation of rank order of different web queries by (Kaur et al. 2015). Genetic algorithm differs from traditional search and optimization in many ways. GA has the ability to avoid being trapped in local optimal solution like traditional methods, which search from a single point. Also, GA approach uses the probabilistic selection and work on the chromosomes which are encoded with solutions parameters. The fitness score value is generated by GA is a beneficial way for deriving objective functions. The Kendall-tau and Spearman’s footrule distances measures are taken as an objective functions where, j is the aggregated rank and N are the participating partial lists. The S i  = {d i1 , d i2 , d i3 , … …., d i n } are the set of document based results derived from search engines. M represents the union of participating engines as U = ∪  M i = 1 S i.. Hence union of all set is U = {1, 2, … …. |U|}. In this work,five search engines are taken i.e., Google 1 , AltaVista 2 , Deeperweb 3 , Excite 4 and Hotbot 5. Generally, no search engine generates full list for a given query hence partial list are derived which need to be converted into full list. The distance measures are uniformly calculated after converting the partial list into full list. Let partial list is t j and full list is t, with the number of elements in them being |t j | and |U|, respectively. Equations (8) and (9) converts the partial list into full list. The position value for full list t is modified using Eq. (9).

$$ {t}_{j\kern0.5em }\left[{t}_{j\kern0.5em }(i)\right]=\left\{\begin{array}{ll} unchanged\hfill & if\ i\le \left|{t}_j\right|\hfill \\ {}x\left|x\ \epsilon U\vee x\notin {t}_j\right.,\hfill & otherwise\hfill \end{array}\right. $$
(8)
$$ t(i)=\left\{\begin{array}{cc}\hfill unchanged,\hfill & \hfill if\ i\le \left|{t}_j\right|,\hfill \\ {}\hfill \frac{{\displaystyle {\sum}_{k=\left(\left|{t}_j\right|+1\right)}^{\left(\left|U\right|\right)}}t(k)}{\left|U\right|-\left|{t}_j\right|},\hfill & \hfill otherwise,\ \hfill \end{array}\right. $$
(9)

Spearman’s footrule distance F(t, t 1) between the lists t and t j is calculated by using Eq. (7) and Kendall-tau distance K(t, t 1) between the lists t and t j is calculated by using Eq. (6). The normalized aggregated footrule distance is computed with the set of N partial document list as T = {t 1, t 2, t 3, … …, t N }, by using Eq. (10).

$$ F\left(t,T\right)=\frac{{\displaystyle {\sum}_{i=1}^N}F\left(t,{t}_i\right)}{N} $$
(10)

Also, the normalized aggregated Kendall-tau distance is given as:

$$ K\left(t,T\right)=\frac{{\displaystyle {\sum}_{i=1}^N}K\left(t,{t}_i\right)}{N} $$
(11)

The objective functions as given by Eqs. (10) and (11) are required to be minimized using genetic algorithm. Chromosomes for genetic approach are selected as aggregated list t generated from Borda Count, Mean By Variance, PageRank, Markov Chain MC4 and Scaled footrule aggregation SFO methods. The major stages for GA are generation, reproduction, crossover and mutation which are evaluated as per the given NP-Hard problem of partial lists. Firstly, a conversion chromosomes step into decimal form from binary is carried out in order to represent as valid permutation of numbers 1 to |U| . Then objective function is evaluated with the selected set of chromosomes from the pool of randomly generated permutations of the universe U = {1, 2, …., |U|}. Also each of the chromosomes from the pool takes its turn to evaluate the objective function, which is the aggregated footrule distance and the aggregated Kendall distance along with the rankings given by the M participating search engines. Chromosomes with lesser values survived whereas; those having large value are eliminated from the pool.

At the crossover stage, consider two to-be-crossed chromosomes as D k i  = [d i,1, d i,2, ….., d i,n ] and D k j  = [d j,1, d j,2, ….., d j,n ], where n = |U|. Exchange of elements within the group of first m(m < n) elements will take place, and remaining elements which are d i,m + 1, d i,m + 2, …., d i,n in D k i and d j,m + 1, d i,j + 2, …., d j,n in D k j will get adjusted to form new set of chromosomes which are [d j,1, … ….. d j,m , d ' i,m + 1  ….., d ' i,n ] and D k + 1 j  = [d i,1, … ….. d i,m , d ' j,m + 1  ….., d ' j,n ] to form valid permutations. The implementation of permutation will follow the concept of multiplication of permutations . The following chromosomes will be formed after multiplication of permutation as given by Eqs. (12) and (13) as:

$$ \left\{{d}_{i,m+1}^{\hbox{'}}\dots ..,{d}_{i,n}^{\hbox{'}}\right\}=\left\{{d}_{j,m+1},{d}_{i,m+2},\dots .,{d}_{j,n}\right\}\times \left\{{d}_{i,m+1},{d}_{i,m+2},\dots .,{d}_{i,n}\right\}, $$
(12)
$$ \left\{{d}_{j,m+1}^{\hbox{'}}\dots ..,{d}_{j,n}^{\hbox{'}}\right\}=\left\{{d}_{i,m+1},{d}_{i,m+2},\dots .,{d}_{j,n}\right\}\times \left\{{d}_{j,m+1},{d}_{j,m+2},\dots .,{d}_{j,n}\right\} $$
(13)

New generation based valid permutation of the Universe U for two distinct chromosomes will be formed as D k + 1 i and D k + 1 j . The digit to-be- mutated is exchanged with any other randomly selected digit in that very permutation. This process will result in producing chromosomes of valid permutation.

The genetic algorithm for Kendall-tau and Spearman’s footrule distance measure is as follows:

figure a

After implementation of GA optimized parameters are obtained corresponds to minimized objective function (Eqs. 10 and 11).

6 Experiment and results

In this work, three queries are applied as input to five search engines. GA based distance measures such as Kendall-tau (KTu) and Spearman’s footrule distance (SFD) are used to aggregate rank lists for these queries. Simulated results show that proposed GA based GKTu and GSFD shows primarily good results as compare to non-GA distance measures. Table 1 shows set of main three queries. The five search engines are taken viz. Google 1 , AltaVista 2 , Deeperweb 3 , Excite 4 and Hotbot 5 . The top 20 results from each of these search engines are generated for evaluation. Also, the aggregated results are compared with using Borda, Mean-By-Variance, PageRank, Scaled aggregated footrule and Markov MC4 methods.

Table 1 Main queries and search engines

Word association technique is one of the novel applications of rank aggregation. All three queries are taken and bind through AND-OR operators in order to analyse the effect of word association. The keywords for the given queries are chosen by the user. New set of queries are formulated using AND-OR operators separately given in Table 2.

Table 2 Main queries and new logical operator based queries

6.1 Non GA results for various rank aggregation techniques

After applying all the three Queries of Table 1 to Google 1 , AltaVista 2 , Deeperweb 3 , Excite 4 and Hotbot 5 search engines, general lists are obtained. These lists are filtered as full list given in Table 3 by using rank aggregation techniques. Both Kendall-tau and Spearman’s footrule distances are calculated for each of the list.

Table 3 Aggregated list derived from Rank Aggregation techniques

6.2 GA based results for various rank aggregation techniques

The GA based rank aggregation techniques are implemented to obtain optimum value of objective function. The rank aggregation method is segregated on the basis of results having minimum optimized distance. Min distance value will helps to improve the performance of Metasearch. Table 4 represents the comparison of with and without genetic based optimized Kendall-tau distance. The Tables 4 and 5 both show the result for the proposed Genetic based Kendall-tau Algorithm (GKTu) and Genetic based Spearman’s footrule (GSFD) for each rank aggregation method with iterations k = 10, k = 50 and k =100.

Table 4 Comparison of Kendall-tau Genetic Algorithms (GKTu) with rank aggregation techniques
Table 5 Comparison of Spearmen’s footrule Genetic Algorithms (GSFD) with rank aggregation techniques

Tables 4 and 5 show the distance values with and without GA, for Borda method. Distance value with improved query 1 by word association technique for Borda with non-GA for Kendall-tau obtained as 0.237 and for the modified as query 1a, the value is minimized to 0.211. Hence, optimized aggregated list are derived for metasearch. Similarly, with GA for Kendall-tau for the same set of queries, value came out as 0.226 and with modification the value is minimized to 0.211. Whereas, in case of OR operators the values get slightly increased in comparison to AND operator but less than those queries which are without any operator. Hence, the word association based queries helps in improving the quality of search to a larger extent. Figure 1 represents the proposed genetic based Kendall-tau distance measures for Borda and MBV method along with best and mean fitness value for iteration count k = 10, 50 and 100. Similarly, Fig. 2 shows the GA based Spearman’s footrule distance measure (fitness value) for MC4 and SFO methods with iteration count k = 10, 50 and 100.

Fig. 1
figure 1

GA based optimized Kendall-tau distance measure for Borda and MBV Methods

Fig. 2
figure 2

GA based optimized Spearman’s footrule distance measure for MC4 and SFO Methods

Tables 6 and 7 reflect the best aggregated list results obtained from the optimized rank aggregation techniques. It has been observed that Borda method is the winner for obtaining the improved results for all types of queries with and without word association operators. Table 8 shows the comparison for all the aggregation methods without GA for both Kendall and Spearman’s footrule. Whereas, Table 9 shows the comparison of aggregation methods with GA for both distance measures. Figures 3 and 4 show the graphical representation of Spearman’s footrule and Kendall-tau distances related to all the rank aggregation techniques.

Table 6 Top 20 results for the Query Parallel Sorting Neural Network
Table 7 Top 20 results for the Query Parallel Sorting AND Neural Network
Table 8 Performance comparison of different rank aggregation techniques for all six queries without GA
Table 9 Performance comparison of different rank aggregation techniques for all six queries with GA at 100th iteration
Fig. 3
figure 3

Distance comparison of rank aggregation techniques for Spearman foot rule

Fig. 4
figure 4

Distance comparison of rank aggregation techniques for Kendall

7 Conclusions and future scope

Genetic Algorithm approach is applied for distance measures in order to minimize the distance values for various aggregated lists to generate an aggregated rank. Minimum distance derived from analysis improves the quality of metasearch and helpful for more effective MetaSearch engines. The Genetic based Kendall-tau and Spearman’s footrule shows variation in the magnitude for Borda method. In case of proposed GA based approach at 10th generation, there is no improvement seen in the distance values but with the increase in the iterations the values are getting optimized. Similarly for MBV, results for GA are consistently improving whereas for Markov Chain MC4 results improved as we moved from 50th generation to 100th generation. Whereas, in case of non-GA approach distance values derived for MBV, Markov Chain MC4, PageRank and SFO techniques results are not found satisfactory. Results derived after using AND-OR operators are showing significant improvement for both proposed GA and Non GA based measures. These Boolean logics when applied in web based search act as an extremely effective filters for finding the exact information require by the users. It is concluded that GA based approach is found efficient and extraordinary as compared to Non GA based approach in terms of optimized distance. For the future scope of this work, more soft computing techniques need to be explored along with different rank aggregation techniques to minimize the execution time.