A novel approach for ranking web documents based on query-optimized personalized pagerank

Roul, Rajendra Kumar; Sahoo, Jajati Keshari

doi:10.1007/s41060-020-00232-2

A novel approach for ranking web documents based on query-optimized personalized pagerank

Regular Paper
Published: 18 August 2020

Volume 11, pages 37–55, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Data Science and Analytics Aims and scope Submit manuscript

A novel approach for ranking web documents based on query-optimized personalized pagerank

Download PDF

556 Accesses
8 Citations
Explore all metrics

Abstract

Ranking plays an important role in the search process of web documents on a huge corpus. This not only reduces the searching time but also provides useful documents to the users. In this paper, we extend our earlier query-optimized PageRank approach by combining the TF-IDF and personalized PageRank algorithm to generate a robust ranking mechanism. In our earlier approach, we modeled a ranking scheme by considering the link structures of the documents along with their content. A novel feature selection technique named as ‘Term-term correlation-based feature selection’ (TCFS) is also proposed which removes all noise terms from the document before the ranking process starts. We believe that by incorporating TCFS and personalized PageRank of the documents along with their relevance will improve the retrieval results. The aim is to modify the link structure based on the similarity score between the content of the document and the user query. Experimental results show that the proposed feature selection technique can outperform the conventional feature selection techniques, and the performance of the combined TF-IDF and personalized PageRank approach is promising compared to the traditional approaches.

Query-Optimized PageRank: A Novel Approach

LionRank: lion algorithm-based metasearch engines for re-ranking of webpages

Article 20 November 2018

A Novel Ranking Technique Based on Page Queries

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the first growing of the internet, everybody experiences a flood of information. It is estimated that the present web contains at least 4.62 billion page^{Footnote 1} which makes it highly difficult for common users to get the desire information on the web. Ranking is the most commonly used technique in the field of Information Retrieval (IR) which brings the required documents on top of the retrieved results. Initially, for a given set of documents and a query, a scoring function is computed which finds the degree of relevance of each document with respect to the query. Then a ranking list is generated by sorting the documents based on their relevance score. Modern ranking approach uses different machine learning techniques such as BM25 [37], PageRank [34] etc., to generate such a ranking function and therefore achieve great improvements on the ranking performances [28]. By nature most of the IR problems are ranking problems, such as anti web spam [38] [45], collaborative filtering [7], product rating [14], key term extraction [11, 41, 44, 46], important e-mail routing [10], sentiment analysis [32, 43], definition finding [58], text summarization [40, 48] etc. Among these ranking problems, document ranking is a common problem that is faced by many search engines. Emails, web documents, news articles, books, and academic papers are some examples of documents. Some ranking scenarios of document retrieval are:

Documents are ranked as per their relevance to the user query.
Website structures [12], diversity [61], and relationships of similarity [55] between documents are some of the features that are considered during the ranking process and is known as relational ranking [36].
Several candidate ranked lists are aggregated to get a better ranked list, known as meta search [4].
Finding up to what degree the property of a web document influences the ranking result.

In recent years, ranking has become a very important research direction in the domain of IR, and a large number of ranking models has been proposed that achieved high influence [3, 21, 39, 57]. All these models can be roughly categorized as

Query-dependent model:

In this model, documents are retrieved based on the occurrences of the query terms in the documents. Examples are standard Boolean model [9], Vector Space model [50], Latent Semantic Indexing [31], Probabilistic ranking technique such as Binary independence model [23], Latent dirichlet allocation [24].
Query-independent model:

In this model, documents are ranked based on their own importance such as traditional PageRank algorithm, query-independent learning [13], content-based technique [30] etc.

PageRank is the first algorithm which is used in Google search engine to rank the web pages (or documents). In the PageRank algorithm, the importance of a web document is evaluated by considering the number of quality incoming links to that web document. As the internet is growing rapidly, the PageRank will help to retrieve the required information in the fastest way. In the current PageRank algorithm, the importance or relevance of a web document is a relative concept, and it completely depends on the user query. This is one of the major drawbacks of the present ranking system and will be solved by using the concept of Personalized PageRank algorithm. There are many such limitations of the existing PageRank algorithms [22] and some of them are listed below:

In some of the PageRank algorithms, PageRank is calculated not at the query time but at the indexing time.
Most of the PageRank algorithms have a problem called topic drifting which decreases their efficiency.
Some PageRank algorithms are judged based on the importance of the web documents, whereas some of them completely ignore the importance of each individual document.
Content of web documents which play a vital role in PageRank algorithm is ignored sometimes which reduces the performance of the algorithm.

Among the above limitations, the main limitation of the traditional PageRank algorithm is topic drifting. This is because it considers uniform link structure i.e., a surfer will jump from one document to the other uniformly. For example, suppose someone is looking for documents related to computer science, then those documents have outgoing links to biological documents are also incorporated in the computation of PageRank (since some biological documents can be relevant or linked to computer science such as ‘prediction of diabetics using machine learning’, ‘detection of breast cancer’, ‘recognizing brain tumor’, ‘finding stress level in the human brain’ etc.).

Our earlier query-optimized PageRank approach [42] succeeded in dealing with the limitations of the PageRank algorithm by biasing the next jump to the relevant documents of the user query. The importance of web pages for different users can be better determined if the PageRank algorithm takes into consideration user preferences which is called as personalized page ranking. The importance of a page differs for different individuals with different interests, knowledge and backgrounds. So, a global ranking of a web page might not necessarily indicate the importance of that page for individual users. It is important to calculate a personalized view of the importance of the pages.

Hence, to make the query-optimized PageRank approach better (i.e., by making it more user friendly), the proposed technique extended our earlier approach by introducing personalized PageRank that combines with a user query to rank the web documents, which is the main objective of the paper.

The major contributions of the proposed approach are as follows:

i.
Incorporating importance of the document with its personalized PageRank is an innovative idea to rank the web documents. It updates the link structure according to the similarity score between the document and the user query, and thereby refine the retrieval results by bringing the required documents on the top.
ii.
By using a traditional PageRank algorithm, search engines might return pages that may not give information satisfying user needs and preferences. Hence, by considering the personalized PageRank algorithm for restructuring the links based on the user query would be more beneficial while implementing it on the search engine with datasets that have a lot of citations and have good link structures such as Wikipedia databases, research journal databases, business databases etc.
iii.
As the content of the web document is considered along with the personalized PageRank, hence the proposed approach achieved high performance by bringing the required documents on the top of the search results. Here, the content of each web document and the user query are converted into TF-IDF forms and then the similarity score is computed between them. Based on this similarity score, the required documents are retrieved on the top of the search results that reduces the searching time of the user.
iv.
By re-ranking the web documents, relevancy of the results is enhanced. Here, the re-ranking is nothing but personalized ranking of our earlier query-optimized PageRank approach. The earlier approach is improved by using the optimization function [18]. The modified link structure is the input to the personalized PageRank algorithm and contains only those output links that are connected to relevant documents. (which here is the non-zero cosine-similarity with the user query).
v.
By introducing a novel feature selection technique named as TCFS, the noise terms are removed from the corpus before the personalized PageRank starts. This makes the personalized pageranking process more effective.

Although much work has been done for ranking the web documents (as evident from the past literature) but those ranking mechanisms either completely ignore the content of the documents or are fully dependent on the user query. Hence, the realm of personalized PageRank combine with similarity scores between the relevant documents and the user query provides a relatively unexplored pool of opportunities. The proposed algorithm is implemented on different benchmark datasets and, the experimental results show the effectiveness of the proposed feature selection technique and the query-optimized personalized PageRank algorithm.

The remainder of this paper is organized as follows: Sect. 2 discusses the past work done in the ranking domain. The basic preliminaries required for the proposed approach are discussed in Sect. 3. Section 4 discusses the query-optimized personalized PageRank. Experimental work of the proposed approach is analyzed in Sect. 5. Section 6 concluded the work with some future enhancement.

2 Past work

The dynamic web contains a huge volume of digital documents, and it is growing very rapidly. This makes it difficult for the search engine to retrieve relevant results. A search engine needs to rank the documents in such a way that the retrieved results should be most relevant to the user. Among all the existing ranking techniques, Spatial TF-IDF is a technique that is used to rank the documents by incorporating spatial and textual features of the documents and is suggested by Ali et al. [25]. The authors have proposed another method named spatial-keyword Inverted File for Points (SKIF-P) for web document indexing. They implemented their algorithm on real and synthetic datasets and show that their technique is more efficient than existing ranking techniques. Chahal et al. have discussed a semantic-based new document ranking mechanism [8] where conceptual instances between the keywords are considered by building an ontology. Important relations among the keywords have been analyzed by the authors, and the importance of each web document is decided based on these relations. Experimentally, they have shown that their approach can outperform the existing ranking techniques. Derhami et al. [15] have proposed a Reinforcement Learning (RL) for web documents ranking. They considered each web document as a state, and a technique is developed by combining RL Rank and BM25 (a content-based algorithm) to rank the documents. Experimental results of LETOR and dotIR datasets show that their approach can achieve much better results than PageRank algorithm. Du and Hai [17] have suggested a semantic approach of web documents based on formal concept analysis. Their approach uses a combination of all three types of the web Mining (i.e., web content, usage, and structure). Empirical results show that the returned results are highly efficient and relevant to the user query.

Patterns or similar words of a document are combined to generate a topic. Topic models play a vital role in document ranking. Some of the primary research work has been done in this direction [29, 47, 52, 62]. Bougouin et al. [6] have suggested a graph-based topic ranking mechanism for key-phrase extraction. Their approach generates topics by clustering the candidate key-phrases. Empirical results on benchmark datasets show that their method is better than the existing ranking method. In the similar line, multiple topic tracking which classifies the news articles either interesting or not for a specific user has been developed by Pon et al. [35]. Empirical results justify the performance of their approach compared to the traditional pattern and term-based models. A pattern-based topic model, which is an information filtering model is proposed by Gao et al. [20]. In their work, multiple topics are combined together to generate useful information for ranking the documents. Experimental results of different benchmark datasets justify the efficiency of the proposed work.

The present document ranking structure treats the user query as independent which overlooks the interests of the user. Working in this direction, a cumulative proximity expansion method has been proposed by Vuurens et al. [56]. The authors investigate that occurrences of query terms are very much useful for measuring the document’s relevance. They have implemented their work on Newswire and web corpora and showed the effectiveness of their technique. Evi et al. [60] have proposed a quality-biased ranking that incorporates signals from passages based on a novel use of community question answering data. Their approach develops a set of methodologies to improve the term relevance estimates from which answering passages are extracted. Ranking experiments on two web test collections (GOV2 and ClueWeb09B) shows the efficiency of their approach. Fafalios et al. [19] suggested a ranking method which ranks the archive documents for structured queries. Probabilistic and stochastic ranking models are proposed by them which consider the timeliness, relativeness, and temporal relations among the documents. For experimental purposes, they have used the New York Times annotated corpus which contains 1.8 million query and the results show the effectiveness of their approach. Deep learning architecture named as ‘DeepRank’ has been used by Pang et al. [33] for relevance ranking of documents. DeepRank captures the query term importance, proximity heuristics, and diverse relation requirements. Empirical results on LETOR4.0 and large clickthrough dataset show that DeepRank model outperforms the existing ranking methods and deep IR models.

The above discussed approaches are either fully dependent on the query or neglect the content of the web documents. Combining the personalized PageRank of documents with their relevance is an innovative idea to rank the web documents. It updates the link structure of documents based on the similarity score with the user query and thereby refine the retrieval results by bringing the required documents on the top. Experimental results on five benchmark datasets show the efficiency of the proposed ranking approach.

3 Basic preliminaries

3.1 TF-IDF

$TF\text{- }IDF$ [53] is a common technique which finds the importance of a term t in a given document d by considering its appearance in the whole corpus and is shown in Eq. 1.

$$\begin{aligned} TF\text{- }IDF_{t, d} = TF_{t, d} \times IDF_{t} \end{aligned}$$

(1)

where,

$$\begin{aligned} TF_{t, d} = \frac{Number\; of\; t\in d}{|d|} \end{aligned}$$

|d| represents the total length of d, and

$$\begin{aligned} IDF_{t} = \log _{10}\left( \frac{ Number\; of\;d\in P}{Number\; of\; d\; have\; t}\right) \end{aligned}$$

3.2 Silhouette coefficient

Silhouette Coefficient [49] of a term t is defined using Eq. 2.

$$\begin{aligned} silhouette(t) = \frac{s(t) - c(t)}{max \big (c(t), s(t)\big )} \end{aligned}$$

(2)

where c(t) and s(t) are the cohesion (how close is t to its own cluster) and separation score (how well separated is t from other clusters) of the term t respectively.

3.3 Fuzzy C-means

Fuzzy C-Means (FCM) algorithm [5] distributes a finite collection of n documents into c clusters. It returns a list of c cluster centroids along with a matrix which shows the degree of membership of each document to other clusters. It aims to minimize the following function as shown in Eq. 3.

$$\begin{aligned} T_m=\sum _{i=1}^{n}\sum _{j=1}^{c}v_{ij}^{m}||d_{ij}||^{2} \end{aligned}$$

(3)

where $d_{ij} = {x_{i}-c_{j}}$ is the distance, m is the fuzzy coefficient and generally set to 2, $c_{j}$ is the centroid(vector) of cluster j, $x_{i}$ is the $i^{th}$ document. $v_{ij} \in [0, 1]$ is the degree of membership of $x_{i}$ with respect to $c_{j}$ and subject to the following conditions:

$$\begin{aligned}&\sum \limits _{j=1}^{c}v_{ji} = 1,~~ i=1, 2, 3, \ldots ,n \;\; \text{ and } \\&0< \sum \limits _{i=1}^{n}v_{ij} < n, ~~ j=1, 2, 3, \ldots ,c \end{aligned}$$

One can iteratively find the values of $c_j$ and $v_{ij}$ updated with each iteration by using the Eqs. 4 and 5 respectively.

$$\begin{aligned} c_{j}= & {} \frac{\sum _{i=1}^{n}v_{ij}^{m}-x_{i}}{\sum _{i=1}^{n}v_{ij}^{m}} \end{aligned}$$

(4)

$$\begin{aligned} v_{ij}= & {} \frac{1}{\sum _{k=1}^{c}(\frac{||d_{ij}||}{||x_{i}-c_{k}||})^{\frac{2}{m-1}}} \end{aligned}$$

(5)

3.4 Mutual information judge

The relationship between a class c and a term t is established by using Mutual Information Judge (MI) [26] which mainly focus the information of t in c. The MI is computed using the Eq. 6.

$$\begin{aligned}&MI(t,c)= \nonumber \\&\sum _{e_t\in \{0,1\}}\sum _{e_c\in \{0,1\}}Prob(e_t,e_c)\log _2\frac{Prob(e_t,e_c)}{Prob(e_t)Prob(e_c)} \end{aligned}$$

(6)

where the Bernoulli variables $e_t$ and $e_c$ are defined as

$$\begin{aligned} e_t= & {} \left\{ \begin{array}{l} 1,\hbox { if}\ t\in d \\ 0, \text{ otherwise }\end{array}\right. \;\; \text{ and } \\ e_c= & {} \left\{ \begin{array}{l} 1,\hbox { if}\ d\in c \\ 0, \text{ otherwise }\end{array}\right. \end{aligned}$$

4 Propose approach

This Section discusses our earlier Query-optimized PageRank approach [42] briefly and the current Query-optimized personalized PageRank technique in detail.

4.1 Query-optimized PageRank

The following steps are used for query-optimized PageRank algorithm:

1.
Initially, all the documents of a given corpus are pre-processed and converted into vector forms using Step 1 of Sect. 4.2.
2.
Top l (l = 1 or 2)^{Footnote 2} terms are selected as the query term whose average TF-IDF values are maximum in the corpus. The reason for selecting the length of the query as one or two terms is, from literature [54] it has been found that most of the queries are very short (i.e., either one or two terms).
3.
Cosine-similarity is calculated between each document and the query. The documents which are highly dissimilar (cosine similarity is zero) are discarded from the corpus and then the ranks of the remaining documents are calculated using PageRank algorithm. The main focus of the approach is that a surfer who is searching for a query on the web should only jump to those web documents which are highly correlated with the query.
4.
After the link structure of documents gets modified using PageRank algorithm, adjustment of the weights of documents are made by considering the damping factor. Initially all the documents get same importance (i.e., same weight). Next the rank of a web document $p_i$ is updated by adding the importance of the incoming links to the current rank score of $p_i$. This process is repeated for every document of the corpus. A rank matrix r is created which stores the updated rank of each web document after incorporating the damping factor to r. The following steps are used to compute the PageRank:
1. i.
  Consider a directed graph G of k nodes and $\frac{k(k-1)}{2}$ edges, where each node is a web document and each edge represents the relationship between two documents. When page i refers to document j, then a directed edge will be added from node i to the node j in G.
2. ii.
  Though all the documents that are linked by a single document will get equal importance initially, hence if a node has n outgoing edges, then the importance of each document will be $\frac{1}{n}$. Let A be the transition matrix of the graph G and represented as,
  $$\begin{aligned}A = \begin{bmatrix} x_{11} &{}\quad x_{12} &{}\quad x_{13} &{} \dots &{} x_{1n} \\ x_{21} &{}\quad x_{22} &{}\quad x_{23} &{} \dots &{} x_{2n} \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ x_{k1} &{}\quad x_{k2} &{}\quad x_{k3} &{} \dots &{} x_{kn} \end{bmatrix} \end{aligned}$$
  where $x_{ij}$ is the link from document i to j.
3. iii.
  Let v is the initial rank vector whose all the entries are $\frac{1}{k}$, because initially all web documents received equal importance. The rank of a web document i will be updated by adding the importance of the incoming links to the current value of i. It is same as multiplying the transition matrix A with the initial rank vector v. Hence, after first the iteration, the new importance vector become $v_1$ = Av. We keep iterating this step and it generates the sequences $v, Av, A^{2}v, A^{3}v,\cdots ,A^{k}v$ which is the PageRank of the web graph G.
4. iv.
  Since the experimental dataset is large, the graph G may not be connected. Thus, one requires an unambiguous meaning of the rank of a document for such directed web graph. To overcome this problem, damping factor (p) is used which is a positive constant lying between 0 and 1. The typical value of damping factor is 0.85. Equation 7 is used to compute the PageRank of G.
  $$\begin{aligned} PageRank(G) = (1-p)A + pB \end{aligned}$$
  (7)
  where,
  $$\begin{aligned}B = \frac{1}{n}\begin{bmatrix} 1 &{}\quad 1 &{}\quad 1 &{} \dots &{} 1 \\ 1 &{}\quad 1 &{} 1\quad &{} \dots &{} 1 \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ 1 &{}\quad 1 &{}\quad 1 &{} \dots &{} 1 \end{bmatrix}\end{aligned}$$

4.2 Query-optimized personalized pageRank

To improve the retrieved results of the above query-optimized PageRank, we combine the personalized pageRank with the content of the relevant documents. The new ranks of the web documents which are computed using personalized PageRank are discussed using the following steps:

1.
Data acquisition and Pre-processing of the corpus:

Consider a corpus P having q classes. All the documents are pre-processed which includes lexical-analysis, stop-word elimination, removal of HTML tags, stemming^{Footnote 3}, and then index terms are extracted. Documents of all q classes are put together, which constitute the dimension of P as $b \times l$, where b and l represent number of documents and terms respectively. Table 1 shows the document-term matrix where $t_{ij}$ indicates the weight of the jth term in ith document.
2.
Document-Term cluster formation:

FCM clustering algorithm is run on the corpus P and divided all the terms of different documents of P into k doc-term clusters ($dt_p$) where $dt_p$ = {$dt_1, dt_2, \ldots , dt_k$} by bringing similar terms into the same cluster. Here, each $dt_p$ is of dimension $b \times n$ (i.e., the number of documents remain same for each cluster but the number of terms get reduced by clustering). The reason for choosing FCM among the existing clustering techniques is that it is a soft clustering algorithm where the fuzziness can be exploited to create a more crisp behaving technique which generates better results, and it is one of the best algorithms compared to other hard clustering algorithms used for text data [5]. Next objective is to select the significant terms from each of the k clusters for maintaining the uniformity without excluding any collection.
3.
Term-Term Correlation based Feature Selection (TCFS): The following steps discuss how important features are selected from each cluster $dt_p, \forall p \in [1, $k].
1. (i)
  Frequency-based correlation (CF) calculation:
  
  First, the frequency-based correlation measure between every pair of terms i and j of each cluster $dt_p$ is calculated using Eq. 8.
  $$\begin{aligned} \textit{CF}_{ij} = \sum \lim _{m \in p} f_{im} * f_{jm} \end{aligned}$$
  (8)
  where, $f_{im}$ and $f_{jm}$ represent the frequency of $i^{th}$ and $j^{th}$ terms in the $m^{th}$ document of the cluster $dt_p$.
2. (ii)
  Constructing association matrix:
  
  An association matrix shown in the Table 2 is constructed where each entry represents the association or frequency-based correlation measure between the terms $t_i$ and $t_j$.
3. (iii)
  Normalizing $CF_{ij}$:
  
  The frequency-based correlation measure CF$_{ij}$ is normalized (named as normalized correlation measure (NCM)) using Eq. 9 which float the correlation values between 0 and 1 as shown in the Table 3. All the diagonal values of $\textit{NCM}_{ij}$ are 1 as i = j.
  $$\begin{aligned} \textit{NCM}_{ij} = \frac{CF_{ij}}{CF_{ii}+CF_{jj}-CF_{ij}} \end{aligned}$$
  (9)
4. (iv)
  Semantic centroid vector generation:
  
  For each term $t_i$ (i.e., for each row of NCM), the mean is calculated and all the means generate an n-dimensional vector named as semantic centroid vector $sc_p$. Each component of $sc_p$ is shown in the Eq. 10.
  $$\begin{aligned} sc_{p_i} = \frac{\sum _{j=1}^{n}{} \textit{NCM}_{ij}}{n},~~~~1\le i\le n \end{aligned}$$
  (10)
  Each component of the semantic centroid vector is represented as
  $$\begin{aligned} \left[ \begin{array}{c} sc_{p_1} = \frac{(NCM_{11} ~+~ NCM_{12} ~+ ~NCM_{13} ~+~ \cdots ~+~ NCM_{1n})}{n}\\ \\ sc_{p_2} = \frac{(NCM_{21} ~+~ NCM_{22}~+~NCM_{23}~ + ~\cdots ~ +~ NCM_{2n})}{n}\\ \\ sc_{p_3}= \frac{(NCM_{31} ~+~ NCM_{32} ~+~NCM_{33}~ +~ \cdots ~+~ NCM_{1n})}{n}\\ \vdots \vdots \\ sc_{p_n} = \frac{(NCM_{n1} ~+~ NCM_{n2} ~ + ~ NCM_{n3} ~+~ \cdots ~+~ NCM_{nn})}{n}\\ \end{array} \right] \end{aligned}$$
5. (v)
  Selecting important features:
  1. a.
    Calculating silhouette coefficient:
    
    The silhouette coefficient (silhout) of the term $t_i \in dt_p$ is computed using Eq. 11. Cohesion (coh) measures how cohesive is the term, $t_i \in dt_p$ to the centroid, $sc_p \in dt_p$ and is shown in Eq. 12 and separation (sep) measures how well separated a term, $t_i\in dt_p$ from the semantic centroid of other clusters, $sc_m$, $\forall m$ $\in $ [1, k] and $m \ne p$ which is shown in the Eq. 13.
    $$\begin{aligned} {\textit{silhout}} (t_i) = \frac{{\textit{sep}}(t_i) - {\textit{coh}}(t_i)}{{max}\big ( {\textit{coh}}(t_i), {\textit{sep}}(t_i)\big )} \end{aligned}$$
    (11)
    $$\begin{aligned} {\textit{coh}}(t_i) = (||sc_p - t_i||) \end{aligned}$$
    (12)
    $$\begin{aligned} \begin{aligned} {\textit{sep}}{(t_i)} = \text{ min }\big (||sc_m - t_i||\big ) \\ \end{aligned} \end{aligned}$$
    (13)
    where, $sc_m$ is the semantic centroid of the $m^{th}$ cluster.
  2. b.
    Finally, the terms are ranked based on their silhouette coefficient scores and among them top ‘m%’ terms (for experimental work, we choose m = 10 of the total terms it is decided empirically) are selected as the important features for the cluster $d_{tp}$.
6. (vi)
  By repeating Step 3 (i-v) for all k doc-term clusters, top m% important terms are generated for each doc-term cluster. After generating top $m\%$ important terms for each doc-term cluster, the noise terms are ignored from each cluster. The documents which do not contain any of these important terms are discarded from the cluster.
The details of this feature selection technique are generalized in Algorithm 1 for the implementation purposes.
4.
Query vector generation:

Among the top $m\%$ important terms of each cluster $dt_p$ of the corpus P, top l terms (l = 1 or 2 and the reason for such selection of l values is discussed in Sect. 4.1) based on their silhouette coefficient scores are selected to generate the query $q_p,\forall p \in [1, $k] for that cluster. As we are working on Bag-of-words model, the order of the terms in the query does not matter.
5.
Computing similarity between the documents and the query:

Using Eq. 14, cosine-similarity ($cosine\text{- }sim$) is computed between each document $d_i \in dt_p$ and the query vector $q_p$. Then the documents of each $dt_p$ are arranged based on their cosine-similarity scores, and those documents are discarded from the corpus P whose scores fall below a threshold of 0.5^{Footnote 4}.
$$\begin{aligned} cosine\text{- }sim (d_i, q_p)= \frac{{d_i}.{q_p}}{{||d_i||}*{||q_p||}} \end{aligned}$$
(14)
All the documents of each $dt_p$ are merged together which generates a new corpus $P_{new}$.
6.
Computing the personalized PageRank of $P_{new}$:

Personalized PageRank is used to rank the documents of $P_{new}$ assuming that it contains n number of documents and is discussed in the next step. At the beginning, all the documents of $P_{new}$ received the same importance. Next, their ranks are updated by adding the importance of the incoming links to their current rank scores. This technique is repeated for all the documents of $P_{new}$.
7.
Applying Link-Based Technique on the corpus $P_{new}$:

In Link-based techniques, the personalized PageRank is evaluated for all the web documents having non-zero cosine-similarity of the corpus $P_{new}$ which improved the earlier PagaRank approach. The link-based approach is developed using the following steps:
1. (i)
  Adjacency matrix construction:
  
  We represented the web by a directed graph G = $\{V, E\}$ where vertices V is considered as the set of web documents and the edges E represents the hyper-link from vertex U to V. The outlink information between web documents have been stored according to the format of dataset (for example purposes, we have shown the link structure of few documents) and demonstrated in Table 4.

Table 1 Document-term matrix

A novel approach for ranking web documents based on query-optimized personalized pagerank

Abstract

Similar content being viewed by others

Query-Optimized PageRank: A Novel Approach

LionRank: lion algorithm-based metasearch engines for re-ranking of webpages

A Novel Ranking Technique Based on Page Queries

Explore related subjects

1 Introduction

2 Past work

3 Basic preliminaries

3.1 TF-IDF

3.2 Silhouette coefficient

3.3 Fuzzy C-means

3.4 Mutual information judge

4 Propose approach

4.1 Query-optimized PageRank

4.2 Query-optimized personalized pageRank

5 Experimental work

5.1 Result analysis of query-optimized pageRank

5.2 Result analysis of proposed feature selection technique

5.2.1 Tuning hyper-parameters:

5.3 Result analysis of query-optimized personalized pageRank

5.3.1 Comparison of proposed personalized pageRank with cosine-similarity and traditional pageRank algorithm

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation