Rank aggregation using ant colony approach for metasearch

Kaur, Parneet; Singh, Manpreet; Josan, Gurpreet Singh; Dhillon, Sukhwinder Singh

doi:10.1007/s00500-017-2723-3

Rank aggregation using ant colony approach for metasearch

Methodologies and Application
Published: 22 July 2017

Volume 22, pages 4477–4492, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

Rank aggregation using ant colony approach for metasearch

Download PDF

Parneet Kaur^1,2,
Manpreet Singh³,
Gurpreet Singh Josan⁴ &
…
Sukhwinder Singh Dhillon⁵

319 Accesses
3 Citations
Explore all metrics

Abstract

Metasearch engines provide a plethora of information to the user through World Wide Web. They are the prominent sources of query-based search and centralized human–world interactions. Metasearch engine shows a list of Web sites to a particular query as per the rank assigned to a web link. The effectiveness of metasearch engine is also examined on the basis of ranks assigned to Web sites for a particular query. Assigning top rank to a web link with most relevant information pertaining to a query by the search engine is formulated as research problem. Here, we have formulated the rank aggregation optimization problem by using metaheuristic approach. Search engines are facing widely two problems such as biasing of search solutions and giving irrelevant rank to similar kind of documents. Both these problems can be overcome by applying an effective rank aggregation technique for combining the search results from various search engines. This paper presents a metaheuristic approach to optimize Spearman’s footrule and Kendall-tau distance measures which are used to compare ranking methods. The performance of proposed ant colony-based strategy is compared with GA technique and is validated through experimental results for real-world queries. Likewise, Precision, Recall and F-Measure-based performance metrics are employed to test the effectiveness of various metasearch engines.

Comparative analysis of Rank Aggregation techniques for metasearch using genetic algorithm

Article 10 February 2016

Rank Aggregation Using Moth Search for Web

Advancements in Rank-Based Ant System: Enhancements for Improved Solution Quality in Combinatorial Optimization

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the present scenario, real and interesting datasets are available in abundant amount worldwide for any discipline such as sports, politics, education, psychology, stock market. But searching of useful information is the tedious and complex task. Metasearch engines provide lists of different Web sites to a user query containing its keywords. These lists merely match the user satisfaction level. Hence, rank aggregation of several lists into one “super” list is one of the best solutions for getting better results. The various rank aggregation techniques were employed in literature such as score-based methods, positional methods, probabilistic methods, learning techniques and soft computing techniques (Beg and Ahmad 2003, 2004; Beg et al. 2016; Dwork et al. 2001; Renda and Straccia 2003; Montague et al. 2001). In order to investigate the performance of rank aggregation techniques, two distance measures such as Spearman’s footrule and Kendall-tau are used. The distance value of final aggregated list is calculated among other input lists. Hence, a normalized aggregated distance is obtained and top rank is assigned to a Web site having minimum distance value. The minimization of this normalized aggregated distance is known as optimal aggregation which is an NP-hard problem. A common rank aggregation technique was the Borda’s method developed in 1770s by Borda (1781). Although this method was applied on the political dataset, but later on it was investigated with several new problems like sports, psychology, stock market as application areas (Langville and Meyer 2012). In 1951, Arrow has proved his impossibility theorem, which describes the limitations inherent in the voting system. Shimura (1973) proposed a fuzzy logic technique for ranking of objects. Dubois and Prade’s (DP) (Ross 1997) technique was adopted for rank aggregation which was based on the on the fuzzy ordering of elements. In this technique apart from mean position of the document, averages over the results were also used. This approach was missing in the Borda’s method. Entropy technique by Ross (1997) was employed for ranking large datasets with lot of variation and complexity by using the concept of threshold for obtaining position of documents in a given range.

In 1998, paper by Larry and Brin (Page and Brin 1998) “Anatomy of a Search Engine” opened new horizons in the field of ranking and rating by designing Google search engines which was based on Markov chains to rank web pages. The Google search engine was based on the PageRank algorithm which helped in calculating the relevancy of new pages in comparison with other existing pages. PageRank (Page and Brin 1998) and Kleinberg’s HITS (Kleinberg 1999) algorithms are currently used for measuring the importance of a document on web. Dwork et al. (2001) considered the problem of rank aggregation in context of web. They studied rank aggregation technique to overcome the problem of spam fighting. Montague et al. (2001) came up with various methods of metasearch. His work was focused on the normalization score values which were used to decompose the metasearch concept.

Beg and Ahmad (2003) employed genetic algorithm (GA) for optimization of Spearman footrule distance for partial lists known as partial footrule optimal aggregation (PFOA). Beg and Ahmad (2004) proposed three strategies which were compared with classical fuzzy technique and Borda’s method for web applications. Yan et al. (2011) improved the quality of PageRank algorithm by introducing a new algorithm based on genetic algorithm known as a genetic PageRank (GPRA). The GPRA algorithm was based on the topic-sensitive idea which is divided into two phases: a collection of themes related to the calculation of PageRank vectors and online inquiries to determine the context of theme. The problem of refinement of page ranking was considered by Laughlin et al. (2011) using fuzzy set and logic. Akritidis et al. (2011) highlighted the effective rank aggregation techniques for metasearch. In his paper QuadRank, a new rank aggregation method was introduced which consider the query terms, the collected results and the data correlated with each of these results related to title, textual, snippet, URLs and individual rankings, etc. Likewise, the concept of feature selection was employed by Waad et al. (2013) in the area of rank aggregation. Rank aggregation problem was investigated (Pihur et al. 2014) on R tool for computing cross-entropy. The concept of multiple objective GA-based rank aggregation techniques was addressed by Kaur et al. (2015). The Mean and Stuart methods were applied to ranking of different queries. Beg et al. (2016) covered the rank aggregation problem in metasearch by modified Shimura method with addition of OWA operator. Kaur et al. (2017) had implemented the genetic algorithm by optimizing the distance measures for various URLs obtained from different search engines using rank aggregation techniques. Genetic-based Kendall-tau as (GKTu) and Spearman’s foot rule as (GSFD) distance measures were proposed to derive optimized distance measure values. However, the work reported by Beg and Kaur provide better solution to ranking problem, but it is having drawback of large execution time. Napoles et al. (2015) presented a model based on ant colony for partial ranking using the Kemeny approach.

In the past few years, more advanced nature inspired optimization techniques were proposed, such as ant colony, PSO and bee colony for the solutions to dynamic and mathematical problems. These techniques are yet to be applied for ranking of Web sites. Highly motivated from the different ranking techniques and need to provide better strategy for rank aggregation having less execution time are taken as problem objectives in this work.

In this paper, a bio-inspired ant colony-based rank aggregation methodology is proposed. Simulation solutions are received for five aggregation methods, such as Borda, Markov chain, scaled footrule, PageRank and mean-by-variance. At a later point, the execution of the suggested scheme is compared with traditional genetic algorithm on the basis of implementation time and minimized distance value. At last, three main performance metrics such as Precision, Recall and F-Measure are also discussed for checking the efficiency of various search engines.

2 Rank aggregation methods used for metasearch

The area of rank aggregation is divided into two categories, i.e., unsupervised and supervised learning approach. In case of unsupervised approach, no training dataset is required for deriving the results like Borda count (Borda 1781; Saari 2000), median rank aggregation by Dwork et al. (2001), genetic and fuzzy-based rank aggregation (Beg and Ahmad 2002), whereas the supervised techniques utilizes a dataset.

2.1 Kemeny optimal aggregation method

Kemeny optimal aggregation (Kemeny and Snell 1962) is one of the desirable methods used for ranking problems. Given a set of total rankings $\sigma ^{1},\ldots ,\sigma ^{j},\ldots , \sigma ^{N}$ on the set of items Z, the goal is to find the order of Z, $\sigma $. Suppose that the input rankings $\sigma ^{j}$ are noisy versions of $\sigma $, which is obtained by swapping two elements of $\sigma $ with a probability $r<0.5$ given as.

$$\begin{aligned} \sigma ^{*}= {\arg \min }_\sigma \frac{1}{N}\sum _{j=1}^N {K(\sigma ,\sigma ^{j}}) \end{aligned}$$

(1)

The estimate $\sigma ^{*}$ is referred as to Kemeny optimal aggregation. Dwork et al. (2001) in his work had shown that the computation of Kemeny optimal aggregation is a NP-hard problem even when the number of rankings is four.

2.2 Borda’s method

Borda’s method (Borda 1781) is one of the classical method based on the position oriented score. The scores are assigned as per the position by which a candidate appears within each voter’s ranked list of preferences. The score value of all candidates is used for sorting. In this method, a score is assigned first, suppose $S_i (c)=$ the number of candidates ranked below c in $\alpha _i$ is the list and $\alpha _1,\alpha _2,\ldots ,\alpha _k$ is the set of full list. For each candidate $c\in D$, the total Borda score is expressed as:

$$\begin{aligned} \sum _{i=1}^k {S_i (c)} \end{aligned}$$

(2)

The candidates are sorted in decreasing order of the Borda score.

2.3 Markov chain method

This method of ranking designed by Markov in 1906 and its counter parts were proposed by Dwork et al. (2001). Markov chains for a system involves a set of states having different items. A nonnegative stochastic matrix M of $n\times n$ item is formed. The method progresses initially by moving from a particular state to other state. The $M_{ij}$ value indicates the state i to state j movement. The Markov chain comprised of probability distributed matrix formed by the product of M and current state distribution. The start state of distribution is indicated by x on N. The process follows a number of iterations continue until distribution reaches to a steady-state value which is known as stationary distribution. The stationary distribution is given by the principal of left eigenvector y of M, i.e., $yM=\gamma y$. All entries in y are known as natural ordering on S which is called as Markov chain ordering of M as MC1, MC2, MC3 and MC4. These chains can be applied on different applications as per their requirements.

2.4 PageRank method

Page and Brin (1998) developed the PageRank algorithm at Stanford University. This algorithm had become the back bone of Google search engine along with several other search engines. The page rank methods calculate the relevancy of pages in comparison with other existing web pages. Other parameters like title, tag, anchor tag, keywords are also used to calculate the overall rank of page under PageRank.

Final results are arranged in descending order using following relation as.

$$\begin{aligned} \hbox {PR}\left( u \right) =\left( {1-d} \right) +d\mathop \sum \limits _{v\in B\left( u \right) } \frac{\hbox {PR}\left( v \right) }{N_v} \end{aligned}$$

(3)

where PR(u) denotes the page rank for page (u) and $N_v$ denotes the number of out-links of page $(\upsilon )$. d is a dampening factor, and its value is 0.85. The value of d can vary as per user’s requirement and following the direct links and $(1-d)$ as the page rank distribution from non-directly linked pages.

2.5 HITS algorithm

HITS algorithm (Kleinberg 1999) is based on the concept of authority and hub instead of in-links and out-links concept for allotting rank to the pages. The number of edges among hubs and authority was taken for calculations. Weights were assigned to hubs and authority in order to modify the existing results. The algorithm works in the form of iterations for calculation of ranks. As first iteration starts, the weight assigned to a node is computed as:

$$\begin{aligned} a_i = \sum \limits _{j\in B\left( i \right) } h_{j} \quad {\text {and}} \quad h_j = \sum \limits _{i\in F\left( j \right) } a_i \end{aligned}$$

(4)

where $a_i$ is the authority update connected to i and $h_{j}$ is the hub update connected to j.

3 Rank aggregation-based distance measures

Rank aggregation method can be defined as generating a new list of elements which is closest to a set of given lists. The performance of a rank aggregation method is measured by two distance measures, namely Kendall-tau distance and Spearman’s footrule distance. These measures were first reported by Diaconis and Graham (1977) and Diaconis (1988).

3.1 Kendall-tau distance measure

This method is widely used to calculate the correlation between two rank lists. Kendall-tau distance in Kendall (1938) counts the number of inversion in the order of pair of candidates between two rank lists. The number of pair-wise disagreements is counted by the Kendall-tau distance between two full lists. The distance variables taken for two full lists k and l are denoted by $\sigma $ and $\tau $ formulated as:

$$\begin{aligned} K\left( {\sigma ,\tau } \right) =\left| {\left\{ {\left( {k,l} \right) ;k<l,\sigma \left( k \right) \left\langle {\sigma \left( l \right) \, {\text {but}} \, \tau \left( k \right) } \right\rangle \tau \left( l \right) } \right\} } \right| \end{aligned}$$

(5)

The normalized value for Kendall-tau distance is derived by dividing with greatest possible coefficient value of 0.5.

3.2 Spearman’s footrule distance measure

The Spearman’s footrule distance (Diaconis and Graham 1977) is defined as absolute distance between positions of candidates in two ranking lists. Suppose two document lists l1 : (d1, d2, d3, d4) and l2 : (d4, d3, d2, d1) are given. Spearman’s footrule distance d is calculated as the sum of absolute difference between positions of documents, e.g., position of d1 in l1 is 1st and l2 is 4th, and therefore, the contribution of document d1 to distance is $|{1-4}|$. Hence, total distance is calculated as.

$$\begin{aligned} d= & {} \left| {1-4} \right| +\left| {2-3} \right| +\left| {3-2} \right| +\left| {4-1} \right| \\ d= & {} 8 \end{aligned}$$

The Spearman’s footrule distance calculates the sum of all elements, i.e., k for a given set, suppose ‘A’ is generated. Equations (5) and (6) give the absolute difference between the ranks of k obtained according to the two lists. The two given lists are $\sigma ,\tau $, and their Spearman’s footrule distance is given as:

$$\begin{aligned} F\left( {\sigma ,\tau } \right) =\mathop \sum \limits _k \left| {\sigma \left( k \right) -\tau \left( l \right) } \right| \end{aligned}$$

(6)

In this work, minimization of Kendall-tau and Spearman’s footrule distance measures is formulated as objective functions:

$$\begin{aligned} {\text {Obj}}\_{\text {function}} =\min \{K(\sigma ,\tau ),F(\sigma ,\tau )\} \end{aligned}$$

(7)

4 Rank aggregation by genetic algorithms approach

Genetic algorithm (GA) was developed by Holland (1975) which forms the strong foundation for evolutionary computing (Yang 2010, 2014). The whole process of GA is basically divided into three steps which are crossover, mutation and selection. The use of GA for NP-hard problem was firstly applied by Goldberg (1989) in search and machine learning area. The general procedure for GA is to encode the objective or cost. Secondly, define the fitness function or selection criterion, and then create a population of individuals. Then, an iteration or evolution cycle is carried out by evaluating the fitness of all the individuals in the population. Also with these cycles, new population are created by using the older one. Lastly decode the results in order to obtain the optimized value (Yang 2010). The pseudo-code for GA optimized distance measures for rank aggregation is given as:

The objective function from Eq. (7) is evaluated with the selected set of chromosomes from the pool of randomly generated permutations of the universe $U=\{1,2,3,\ldots , |U|\}$. Then, each of the chromosomes from the pool takes it turn to evaluate the objective function, which is the aggregated footrule distance and the aggregated Kendall distance of that chromosome with the rankings given by the M participating search engines. Those chromosomes having less values survive, whereas a more value-based chromosomes leaves the pool (Beg and Ahmad 2003).

During the crossover stage, we will consider two to-be-crossed chromosomes as $D_i^k =[d_{(i,1)},d_{(i,2)},\ldots ,d_{(i,n)}]$ and $D_j^k =[d_{(j,1)},d_{(j,2)},\ldots ,d_{(j,n)}]$ where $n=|U|$.

Then, exchange of elements within the group of first $m ({m<n})$ elements will take place, and remaining elements which are $d_{(i,m+1)},d_{(i,m+2)},\ldots ,d_{(i,n)}$ in $D_i^k$ and $d_{(j,m+1)},d_{(i,j+2)},\ldots ,d_{(j,n)}$ in $D_j^k$ will get adjusted to form new sets of chromosomes which are $[d_{j,1},\ldots , d_{j,m},d_{i,m+1}^{\prime }, \ldots ,d_{i,n}^{\prime }]$ and $D_j^{(k+1)} =[d_{(i,1)},\ldots ,d_{(i,m)},d_{(j,m+1) (j,m+1)}^{\prime }, \ldots ,d_{(j,n)}^{\prime }]$ to form valid permutations. The implementation of permutation will take by using the concept of multiplication of permutations (Beg and Ahmad 2003). The following chromosomes will be formed after multiplication of permutation as per the relations formulated as.

$$\begin{aligned} \left\{ d_{i,m+1}^{\prime },\ldots ,d_{i,n}^{\prime } \right\}= & {} \{d_{(j,m+1)} ,d_{(i,m+2)},\ldots ,d_{(j,n)}\} \nonumber \\&\times \{d_{(i,m+1)},d_{(i,m+2)} ,\ldots ,d_{(i,n)}\} \end{aligned}$$

(8)

$$\begin{aligned} \left\{ d_{j,m+1}^{\prime },\ldots ,d_{j,n}^{\prime } \right\}= & {} \{d_{i,m+1)} ,d_{(i,m+2)},\ldots ,d_{(j,n)}\} \nonumber \\&\times \{d_{(j,m+1)},d_{(j,m+2)} ,\ldots ,d_{(j,n)}\} \end{aligned}$$

(9)

New generation-based valid permutation of the universe U for two distinct chromosomes will be formed as $D_i^{k+1}$ and $D_j^{k+1}$. The digit to-be-mutated is exchanged with any other randomly selected digit in that very permutation. This process will result in producing chromosomes of valid permutation (Beg and Ahmad 2003).

GA (Kaur et al. 2017) was applied for finding the optimal values of Kendall-tau (KTu) and Spearman’s footrule (SFD) distance measures. Also, the initial population is created with help of rank aggregation methods such as Borda, MBV, Markov, PageRank and scaled footrule methods given in Table 2. Then, the Kendall-tau and Spearman’s footrule are chosen as objective functions in order to calculate the distance value. The results obtained from 50 and 100 iterations using GA is shown in Tables 3 and 4. The complete steps for the proposed GA algorithm using Kendall-tau and Spearmen’s footrule for rank aggregation are discussed in Kaur et al. (2017).

From the literature, a lot of drawbacks such as large execution time, the choice of rate of mutation, crossover and selection are need to be carefully handled for GA to avoid false results. The performance of individual query is also not covered in our previous work (Kaur et al. 2017) in terms of Precision, Recall and F-Measure. Here we proposed ant colony-based Kendall-tau and Spearman’s footrule distance measures for rank aggregation to each query. The performance of individual search engine pertaining to each query with and without word association is also tested using the Precision, Recall and F-Measure parameters.

5 Proposed ant colony-based rank aggregation strategy

In 1992, M. Dorigo had discovered the foraging behavior of wild ants by noting their food searching capabilities. Later on this foraging behavior was utilized for solving various optimization problems reported by Dorigo (1992) and Dorigo and Gambardella (1997). Ant colony optimization strategy is mainly inspired by the food searching technique of ants by strengthening the pheromone concentration on the path closest to the food and evaporation of pheromone of pheromones from the path far away from the food/destination Yang (2014). ACO was also, applied for solving discrete problems addressed by Dorigo et al. (1999) and Blum (2005). Similarly, the natural behavior of honey bees for food search is given by Gould and Gould (1988). Primarily, the ACO is based on the principle that at first, a random route is generated which is similar to mutation. Then pheromone based selection helps in selecting the shortest routes. In ACO there is no explicit crossover but fitness based mutation helps in providing the solutions. In this work a web link rank aggregation problem is framed and two objective functions are defined such as Kendall tau and spearman foot rule. The URLs have been arranged in a manner that URL with minimum objective function values will occupy highest or first rank position in the whole list among other URLs. The pseudo code for proposed Ant Colony based rank aggregation strategy is given as:

The optimized values of objective functions obtained with the help of proposed ACO based strategy are depicted in the Tables 3 and 4 respectively.

6 Results and discussion

In this section, the performance of proposed ant colony-based strategy is compared with genetic algorithm for rank aggregation. In this work, initial dataset is taken and referred from Kaur et al. (2017) which comprised of three main queries (Q1–Q3) which are further refined by using word association as shown in Table 1. All these queries are fed to the five search engines such as ${ Google}^{1}$, ${ AltaVista}^{2}$, ${ Deeperweb}^{3}$, ${ Excite}^{4}$ and ${ Hotbot}^{5}$.

Table 1 Main queries and new logical operator-based Queries

Full size table

Table 2 Aggregated rank list for each query by rank aggregation methods

Full size table

The aggregated rank list is given in Table 2 for each query by rank aggregation methods such as Borda, mean-by-variance, Markov Chain MC4, PageRank and scaled footrule.

The comparative analysis between GA- and ACO-based ranking strategies is discussed in the next subsections.

6.1 Performance comparison of GA- and ACO-based ranking strategies

Both the proposed ACO- and GA (Kaur et al. 2017)-based ranking strategies are compared on the basis of experimental simulation results for 50 and 100 number of iterations. Table 3 indicates the values of optimized values of Kendall-tau distance measure and execution time at 100 iterations for each query by rank aggregation methods. The values of 50 iterations are not indicated due to space constraints. But the graphical results for 50 iterations are shown in Figs. 1 and 2, respectively.

Table 3 Comparison of proposed ACO and GA optimized Kendall-tau distance measure at (100 gen.)

Full size table

Likewise, the optimized values of Spearman’s footrule distance measure and execution time for both GA- and proposed ACO-based techniques are shown in Table 4 at 100 iterations.

Table 4 Comparison of proposed ACO and GA optimized Spearman’s footrule distance measures at (100 gen.)

Full size table

Tables 3 and 4 show the optimized distance values for Kendall-tau and Spearman’s footrule distance measure with standard form, GA and ACO at 100 iterations, respectively. In Table 3, it is observed that proposed ACO-based strategy gives better results as compared to GA. For example, for Query 1 the Kendall-tau distance value is 0.237 with standard method, with GA it is 0.226, and by using ACO it is reduced to 0.178 with very less execution time as 13.16 s. Also, at 100th iteration Spearman’s footrule value is 0.51 with GA, whereas with ACO the value is 0.31 with execution time of 14.98 s compared to 139.8 s with GA as shown in Table 4. Hence, from the overall analysis, the optimized results by proposed ACO-based strategy are found much better than GA and standard form for both the distance measure.

Table 5 Performance comparison of rank aggregation methods on the basis of aggregated ranks

Full size table

However, average aggregated distance value for Borda method with standard form is found minimum, but ACO-based strategy gives optimized results for other ranking methods. Table 5 shows the average aggregated rank of all queries for Kendall-tau and Spearman’s distance measures.

Table 6 Comparison of rank aggregation methods on the basis of GA and ACO optimized Kendall and Spearman’s distance measures

Full size table

Table 6 indicates the average aggregated rank values of Kendall Tau and Spearman Foot Rule for Genetic and proposed ACO based ranking strategies. The aggregated distance values for all the ranking methods have been found minimum with proposed strategy as compared to Genetic optimization (Kaur et al. 2017) as indicated in Table 6. It is concluded from the aggregated ranks that the Borda method has minimum rank value among its counterparts. Hence, the Borda method has been chosen for further analysis and for obtaining much refined rank list which can meet the user’s satisfaction level.

Figure 1 indicates the graphical representation of best cost derived from GA and proposed ACO approach for Borda method with Query Q1 at 50th and 100th iterations. Figure 2 signifies that ACO-based approach outcast GA in terms of best cost, i.e., the optimized distance measure value. Also, Fig. 2 represents the best cost for Borda method with Query Q3b at 100 iterations which gets improved from 50th iteration with the proposed ACO algorithm.

6.2 Best optimized aggregated rank list for Queries

For the given set of three queries Q1-Q3, to test the relevancy of a query two sub queries have been derived using AND-OR technique as indicated in Table 1, which consist of a total of nine queries. Hence, from the above analysis for selection of best query among the set of three queries like from the set of queries Q1, Q1a and Q1b, the query Q1 has occupied the top most position in the aggregated list with minimum distance value by Borda’s method as indicated in Tables 3 and 4. Table 7 shows the previous and optimized new ranks of 20 URLs for query Q1. Hence, the final aggregated list of 20 URLs for query Q1 is

Q1:$\{1,3,2,4,7,9,16,19,12,18,5,10,6,11,15,8,13,14,17,20\}$ as given in Table 7.

Table 7 Optimized aggregated rank list of URLs

Full size table

Similarly, among other queries such as Q2, Q2a and Q2b, the query Q2a is having more optimized aggregated list of URLs from scaled footrule rank aggregation method as given in Tables 3 and 4. Out of queries Q3, Q3a and Q3b, the query Q3a of PageRank is the winner on the basis of minimum distance measure value obtained from proposed ACO-based strategy given in Tables 3 and 4. Hence, the best final aggregated URL list for Q2a and Q3a is:

$$\begin{aligned} t_\mathrm{query2a}= & {} \left\{ 15,4,9,13,8,7,6,11,3,10,5,12,14,1,2,\right. \\&\left. 16,17,18,19,20\right\} \\ t_\mathrm{query3a}= & {} \left\{ 4,14,20,8,16,2,1,3,10,6,7,9,12,11,15,\right. \\&\left. 13,5,17,18,19 \right\} \end{aligned}$$

6.3 Performance metrics for optimized lists

In this section, the various testing parameters such as Precision, Recall and F-Measures are applied to investigate the performance of metasearch engines.

6.3.1 Precision metrics

The Precision metrics is also called as positive predictive value which is the fraction of retrieved instances that are relevant. The performance of final aggregated queries such as query Q1, Q2a and Q3a is evaluated on the basis of precision. Precision is the ratio of retrieved relevant document over retrieved documents. The Precision value for the URLs derived for query Q1 is calculated using Eq. (13). Table 8 represents the precision for top twenty documents retrieved from all the five search engines.

$$\begin{aligned} {\text {Precision}} =\frac{{\text {retrived relevant}}}{{\text {retrieved}}}= \frac{|{A\cap B}|}{|B|} \end{aligned}$$

(13)

Table 8 Precision for Query Q1

Full size table

In Table 8, the average precision for Google and Excite search engines is found best than for other search engines.

6.3.2 Recall metrics

Recall is the ratio of retrieving relevant documents over the relevant documents. The Recall metrics is evaluated for Q1 which is the fraction of relevant instances that are retrieved also known as sensitivity. Using Eq. (14), the value of Recall is calculated for Q1 and results are shown in Table 9. The value of recall for first URL of Q1 is 0.05 which means that out of total 20 URLs (relevant documents retrieved), the first URL is 0.05 times relevant, i.e., (1/20). Similarly, for all top twenty documents the Recall is calculated as shown in Table 9.

$$\begin{aligned} {\text {Recall}} =\frac{{\text {retrived relevant}}}{{\text {relevant}}}= \frac{|{A\cap B}|}{|A|} \end{aligned}$$

(14)

Table 9 Recall for Query Q1

Full size table

6.3.3 F-Measure metrics

F-Measure is also known as F-score which is the weighted harmonic mean of measuring the accuracy. The F-Measure is used during the work where Precision and Recall are equally weighted. It considers both the value of Precision and Recall to calculate the score. F-Measure is calculated using Eq. (15) for top twenty documents retrieved for Query Q1as shown in Table 10.

$$\begin{aligned} F{\text {-Measure}} =\frac{2\times {\text {Precision}} \times {\text {Recall}}}{{\text {Precision}}+ {\text {Recall}}} \end{aligned}$$

(15)

Table 10 represents the F-Measure for Query 1 calculated using Eq. (15) for twenty URLs.

Table 10 F-Measure for Query Q1

Full size table

Figure 3 shows the comparison of Precision, Recall and F-Measure on the basis of average value of performance metrics for the query Q1. Also, Fig. 3 represents that out of all five search engines, Google performance metrics outcast the performance of other search engines. The positions acquired by search engines in terms of performance metrics are $\hbox {Google}>\hbox {Excite}> \hbox {HotBot}> \hbox {AltaVista} > \hbox {Deeperweb}$.

The values for the Precision, Recall and F-Measure for all the final aggregated queries in order to check the relevancy of the documents are shown in Table 11. Also, Table 11 represents that the average performance metrics for Precision and Recall which are used to check the relevancy of each query as given in Tables 3 and 4 by the proposed ACO scheme. Hence, it is concluded from the performance analysis that Google and Excite search engines are the winners in producing most relevant documents to the users by the proposed ACO-based strategy.

7 Conclusion and future scope

Table 11 Average performance metrics

Full size table

In this work, ranking problem of web links has been framed by applying different rank aggregation techniques. Kendall tau and Spearman’s footrule distance measures based objective functions are chosen for optimization purpose. Ant Colony based rank optimization strategy has been proposed to minimize the objective function value and to assign highest rank to a web link having lowest objective function value. Also, the performance of the proposed ranking strategy has been compared with the Standard and GA optimized rank aggregation methods on the basis of distance values (Kendall and Spearman) and execution time for each query as given in Tables 3 and 4 respectively. From the simulation results it has been concluded that Borda is the best method among its other aggregation methods as it gives minimum distance values. At last performance tests are applied such as Precision, Recall and F-Measure to test the proposed strategy. Also, the word association operators are applied to the queries which provide more prominent and very close results to the web lists provided by search engines. In future, other meta-heuristic approaches such as Particle Swarm, Teacher-learner, Moth search and hybrid techniques can be applied to optimize the performance of search engines and to improve metasearch techniques.

References

Arrow K (1951) Social choice and individual values. Wiley, New York
MATH Google Scholar
Akritidis L, Katsaros D, Bozanis P (2011) Effective rank aggregation for metsearching. J Syst Softw 84:130–143
Article Google Scholar
Beg MMS, Ahmad N (2002) Fuzzy logic based rank aggregation methods for the world wide web. In: Proceedings of the international conference on artificial intelligence in engineering and technology, Malaysia, pp 363–368
Beg MMS, Ahmad N (2003) Soft computing techniques for rank aggregation on the world wide web. J Int Inf Syst 6(1):5–22
Beg MMS, Ahmad N (2004) Study of rank aggregation for World Wide Web. J Study Fuzziness Soft Comput 137:24–46
MATH Google Scholar
Beg MMS, Ansari MZ, Kumar M (2016) Enhancement of fuzzy rank aggregation technique. Proceedings of the second international conference on computer and communication technologies, advances in intelligent systems and computing 381:127–135
Google Scholar
Blum C (2005) Ant colony optimization: introduction and recent trends. J Phys Life Rev 2:353–373
Article Google Scholar
Borda JC (1781) Mémoire sur les élections au scrutin. Histoire de l’Académie Royale des Sciences
Diaconis P, Graham R (1977) Spearman’s footrule as a measure of disarray. J R Stat Soc B 39(2):262–268
MathSciNet MATH Google Scholar
Diaconis P (1988) Group representation in probability and statistics. IMS lecture series 11, Institute of Mathematical Statistics, Hayward
Dorigo M (1992) Optimization, learning and natural algorithms. Ph.D. thesis, Politecnico di Milano
Dorigo M, Gambardella LM (1997) Ant colony system: a cooperative learning approach to the travelling salesman problem. Proc IEEE Trans Evolut Comput 1(1):53–66
Article Google Scholar
Dorigo M, Di Caro G, Gambardella LM (1999) Ant algorithms for discrete optimization. Artif Life 5(2):137–172
Article Google Scholar
Dwork C, Kumar R, Naor M, Sivakumar D (2001) Rank aggregation methods for the Web. In: Proceedings of tenth ACM international conference on World Wide Web, pp 613–622
Goldberg DE (1989) Book on genetic algorithms in search, optimization and machine learning. Addison Wesley, Reading
Gould JL, Gould CG (1988) The honey bee. Scientific American Library, New York
Google Scholar
Holland J (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor
Google Scholar
Kaur M, Kaur P, Singh M (2015) Rank aggregation using multi objective genetic algorithm. In: Proceedings of IEEE sponsored international conference on next generation technologies, Dehradun, pp 836–840
Kaur P, Singh M, Josan GS (2017) Comparative analysis of Rank Aggregation techniques for metasearch using genetic algorithm. J Educ Inf Technol 22(3):965–983
Article Google Scholar
Kemeny JL, Snell JG (1962) Mathematical models in the social sciences. Ph.D. thesis, New York
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93
Article MATH Google Scholar
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. ACM 46(5):604–632
MathSciNet MATH Google Scholar
Langville AN, Meyer CD (2012) Who’s #1?: The science of rating and ranking. Princeton University Press, Princeton
MATH Google Scholar
Laughlin A, Olson J, Simpson D, Inoue A (2011) Page ranking refinement using fuzzy sets and logic. In: Proceedings of the 22nd midwest artificial intelligence and cognitive science conference, Cincinnati, pp 40–46
Montague MH, Aslam JA (2001) Models of metasearch. In: Proceedings of the ACM international conference on research and development in information retrieval (SIGIR), pp 276–284
Napoles G, Dikopoulou Z, Papgeorgiou E, Bello R, Vanhoof K (2015) Aggregation of partial rankings–an approach based on the Kemeny ranking problem. Adv Comput Intell Lect Notes Comput Sci 9095:343–355
Google Scholar
Page L, Brin L (1998) The anatomy of a large-scale hyper textual web search engine. In: Proceedings of seventh international World Wide Web conference
Pihur V, Datta S, Datta S (2014) RankAggreg, an R package for weighted rank aggregation. A report by Department of Bioinformatics and Biostatistics, University of Louisville. http://vpihur.com/biostat
Renda ME, Straccia U (2003) Web metasearch: rank vs. score based rank aggregation methods. SAC, Melbourne
Ross TJ (1997) Book on fuzzy logic with engineering applications. McGraw-Hill, New York
Google Scholar
Saari DJ (2000) The mathematics of voting: democratic symmetry. Economist.83. http://vidal.ece.sc.edu/822/papers/econ-voting.html
Shimura M (1973) Fuzzy sets concept in rank-ordering objects. J Math Anal Appl 43:717–733
Article MathSciNet MATH Google Scholar
Waad B, Atef BB, Mohamed L (2013) Feature selection by rank aggregation and genetic algorithms. In: Proceedings of the international conference on knowledge discovery and information retrieval, Vilamoura, pp 74–81
Yang XS (2010) Engineering optimization: an introduction with metaheuristic applications. Wiley, London
Book Google Scholar
Yang XS (2014) Nature inspired optimization algorithms, 1st edn. Elsevier, Amsterdam
MATH Google Scholar
Yan L, Gui C, Du W, Guo Q (2011) An improved PageRank method based on genetic algorithm for web search. In: Proceedings of advanced in control engineering and information science procedia engineering 15, pp 2983–2987. Elsevier

Download references

Acknowledgements

The contributions of all the authors are highly obliged for their continuous work and efforts. Also, the contributions of cited authors are highly appreciated for their research findings and quality of work. The work has been conducted at I.K.G Punjab Technical University, Jalandhar.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, I.K.G Punjab Technical University, Jalandhar, Punjab, India
Parneet Kaur
Department of Computer Science and Engineering, ACE & AR Devsthali, Ambala, Haryana, India
Parneet Kaur
Department of Computer Science and Engineering, GNDEC, Ludhiana, Punjab, India
Manpreet Singh
Department of IT and Computer Science and Engineering, Punjabi University, Patiala, Punjab, India
Gurpreet Singh Josan
Department of Electronics and Communication Engineering, ACE & AR Devsthali, Ambala, Haryana, India
Sukhwinder Singh Dhillon

Authors

Parneet Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Manpreet Singh
View author publications
You can also search for this author in PubMed Google Scholar
Gurpreet Singh Josan
View author publications
You can also search for this author in PubMed Google Scholar
Sukhwinder Singh Dhillon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Parneet Kaur.

Ethics declarations

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of interest

Authors declare that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kaur, P., Singh, M., Josan, G.S. et al. Rank aggregation using ant colony approach for metasearch. Soft Comput 22, 4477–4492 (2018). https://doi.org/10.1007/s00500-017-2723-3

Download citation

Published: 22 July 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s00500-017-2723-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Rank aggregation using ant colony approach for metasearch

Abstract

Similar content being viewed by others

Comparative analysis of Rank Aggregation techniques for metasearch using genetic algorithm

Rank Aggregation Using Moth Search for Web

Advancements in Rank-Based Ant System: Enhancements for Improved Solution Quality in Combinatorial Optimization

1 Introduction