1 Introduction

With the first growing of the internet, everybody experiences a flood of information. It is estimated that the present web contains at least 4.62 billion pageFootnote 1 which makes it highly difficult for common users to get the desire information on the web. Ranking is the most commonly used technique in the field of Information Retrieval (IR) which brings the required documents on top of the retrieved results. Initially, for a given set of documents and a query, a scoring function is computed which finds the degree of relevance of each document with respect to the query. Then a ranking list is generated by sorting the documents based on their relevance score. Modern ranking approach uses different machine learning techniques such as BM25 [37], PageRank [34] etc., to generate such a ranking function and therefore achieve great improvements on the ranking performances [28]. By nature most of the IR problems are ranking problems, such as anti web spam [38] [45], collaborative filtering [7], product rating [14], key term extraction [11, 41, 44, 46], important e-mail routing [10], sentiment analysis [32, 43], definition finding [58], text summarization [40, 48] etc. Among these ranking problems, document ranking is a common problem that is faced by many search engines. Emails, web documents, news articles, books, and academic papers are some examples of documents. Some ranking scenarios of document retrieval are:

  • Documents are ranked as per their relevance to the user query.

  • Website structures [12], diversity [61], and relationships of similarity [55] between documents are some of the features that are considered during the ranking process and is known as relational ranking [36].

  • Several candidate ranked lists are aggregated to get a better ranked list, known as meta search [4].

  • Finding up to what degree the property of a web document influences the ranking result.

In recent years, ranking has become a very important research direction in the domain of IR, and a large number of ranking models has been proposed that achieved high influence [3, 21, 39, 57]. All these models can be roughly categorized as

  • Query-dependent model:

    In this model, documents are retrieved based on the occurrences of the query terms in the documents. Examples are standard Boolean model [9], Vector Space model [50], Latent Semantic Indexing [31], Probabilistic ranking technique such as Binary independence model [23], Latent dirichlet allocation [24].

  • Query-independent model:

    In this model, documents are ranked based on their own importance such as traditional PageRank algorithm, query-independent learning [13], content-based technique [30] etc.

PageRank is the first algorithm which is used in Google search engine to rank the web pages (or documents). In the PageRank algorithm, the importance of a web document is evaluated by considering the number of quality incoming links to that web document. As the internet is growing rapidly, the PageRank will help to retrieve the required information in the fastest way. In the current PageRank algorithm, the importance or relevance of a web document is a relative concept, and it completely depends on the user query. This is one of the major drawbacks of the present ranking system and will be solved by using the concept of Personalized PageRank algorithm. There are many such limitations of the existing PageRank algorithms [22] and some of them are listed below:

  • In some of the PageRank algorithms, PageRank is calculated not at the query time but at the indexing time.

  • Most of the PageRank algorithms have a problem called topic drifting which decreases their efficiency.

  • Some PageRank algorithms are judged based on the importance of the web documents, whereas some of them completely ignore the importance of each individual document.

  • Content of web documents which play a vital role in PageRank algorithm is ignored sometimes which reduces the performance of the algorithm.

Among the above limitations, the main limitation of the traditional PageRank algorithm is topic drifting. This is because it considers uniform link structure i.e., a surfer will jump from one document to the other uniformly. For example, suppose someone is looking for documents related to computer science, then those documents have outgoing links to biological documents are also incorporated in the computation of PageRank (since some biological documents can be relevant or linked to computer science such as ‘prediction of diabetics using machine learning’, ‘detection of breast cancer’, ‘recognizing brain tumor’, ‘finding stress level in the human brain’ etc.).

Our earlier query-optimized PageRank approach [42] succeeded in dealing with the limitations of the PageRank algorithm by biasing the next jump to the relevant documents of the user query. The importance of web pages for different users can be better determined if the PageRank algorithm takes into consideration user preferences which is called as personalized page ranking. The importance of a page differs for different individuals with different interests, knowledge and backgrounds. So, a global ranking of a web page might not necessarily indicate the importance of that page for individual users. It is important to calculate a personalized view of the importance of the pages.

Hence, to make the query-optimized PageRank approach better (i.e., by making it more user friendly), the proposed technique extended our earlier approach by introducing personalized PageRank that combines with a user query to rank the web documents, which is the main objective of the paper.

The major contributions of the proposed approach are as follows:

  1. i.

    Incorporating importance of the document with its personalized PageRank is an innovative idea to rank the web documents. It updates the link structure according to the similarity score between the document and the user query, and thereby refine the retrieval results by bringing the required documents on the top.

  2. ii.

    By using a traditional PageRank algorithm, search engines might return pages that may not give information satisfying user needs and preferences. Hence, by considering the personalized PageRank algorithm for restructuring the links based on the user query would be more beneficial while implementing it on the search engine with datasets that have a lot of citations and have good link structures such as Wikipedia databases, research journal databases, business databases etc.

  3. iii.

    As the content of the web document is considered along with the personalized PageRank, hence the proposed approach achieved high performance by bringing the required documents on the top of the search results. Here, the content of each web document and the user query are converted into TF-IDF forms and then the similarity score is computed between them. Based on this similarity score, the required documents are retrieved on the top of the search results that reduces the searching time of the user.

  4. iv.

    By re-ranking the web documents, relevancy of the results is enhanced. Here, the re-ranking is nothing but personalized ranking of our earlier query-optimized PageRank approach. The earlier approach is improved by using the optimization function [18]. The modified link structure is the input to the personalized PageRank algorithm and contains only those output links that are connected to relevant documents. (which here is the non-zero cosine-similarity with the user query).

  5. v.

    By introducing a novel feature selection technique named as TCFS, the noise terms are removed from the corpus before the personalized PageRank starts. This makes the personalized pageranking process more effective.

Although much work has been done for ranking the web documents (as evident from the past literature) but those ranking mechanisms either completely ignore the content of the documents or are fully dependent on the user query. Hence, the realm of personalized PageRank combine with similarity scores between the relevant documents and the user query provides a relatively unexplored pool of opportunities. The proposed algorithm is implemented on different benchmark datasets and, the experimental results show the effectiveness of the proposed feature selection technique and the query-optimized personalized PageRank algorithm.

The remainder of this paper is organized as follows: Sect. 2 discusses the past work done in the ranking domain. The basic preliminaries required for the proposed approach are discussed in Sect. 3. Section 4 discusses the query-optimized personalized PageRank. Experimental work of the proposed approach is analyzed in Sect. 5. Section 6 concluded the work with some future enhancement.

2 Past work

The dynamic web contains a huge volume of digital documents, and it is growing very rapidly. This makes it difficult for the search engine to retrieve relevant results. A search engine needs to rank the documents in such a way that the retrieved results should be most relevant to the user. Among all the existing ranking techniques, Spatial TF-IDF is a technique that is used to rank the documents by incorporating spatial and textual features of the documents and is suggested by Ali et al. [25]. The authors have proposed another method named spatial-keyword Inverted File for Points (SKIF-P) for web document indexing. They implemented their algorithm on real and synthetic datasets and show that their technique is more efficient than existing ranking techniques. Chahal et al. have discussed a semantic-based new document ranking mechanism [8] where conceptual instances between the keywords are considered by building an ontology. Important relations among the keywords have been analyzed by the authors, and the importance of each web document is decided based on these relations. Experimentally, they have shown that their approach can outperform the existing ranking techniques. Derhami et al. [15] have proposed a Reinforcement Learning (RL) for web documents ranking. They considered each web document as a state, and a technique is developed by combining RL Rank and BM25 (a content-based algorithm) to rank the documents. Experimental results of LETOR and dotIR datasets show that their approach can achieve much better results than PageRank algorithm. Du and Hai [17] have suggested a semantic approach of web documents based on formal concept analysis. Their approach uses a combination of all three types of the web Mining (i.e., web content, usage, and structure). Empirical results show that the returned results are highly efficient and relevant to the user query.

Patterns or similar words of a document are combined to generate a topic. Topic models play a vital role in document ranking. Some of the primary research work has been done in this direction [29, 47, 52, 62]. Bougouin et al. [6] have suggested a graph-based topic ranking mechanism for key-phrase extraction. Their approach generates topics by clustering the candidate key-phrases. Empirical results on benchmark datasets show that their method is better than the existing ranking method. In the similar line, multiple topic tracking which classifies the news articles either interesting or not for a specific user has been developed by Pon et al. [35]. Empirical results justify the performance of their approach compared to the traditional pattern and term-based models. A pattern-based topic model, which is an information filtering model is proposed by Gao et al. [20]. In their work, multiple topics are combined together to generate useful information for ranking the documents. Experimental results of different benchmark datasets justify the efficiency of the proposed work.

The present document ranking structure treats the user query as independent which overlooks the interests of the user. Working in this direction, a cumulative proximity expansion method has been proposed by Vuurens et al. [56]. The authors investigate that occurrences of query terms are very much useful for measuring the document’s relevance. They have implemented their work on Newswire and web corpora and showed the effectiveness of their technique. Evi et al. [60] have proposed a quality-biased ranking that incorporates signals from passages based on a novel use of community question answering data. Their approach develops a set of methodologies to improve the term relevance estimates from which answering passages are extracted. Ranking experiments on two web test collections (GOV2 and ClueWeb09B) shows the efficiency of their approach. Fafalios et al. [19] suggested a ranking method which ranks the archive documents for structured queries. Probabilistic and stochastic ranking models are proposed by them which consider the timeliness, relativeness, and temporal relations among the documents. For experimental purposes, they have used the New York Times annotated corpus which contains 1.8 million query and the results show the effectiveness of their approach. Deep learning architecture named as ‘DeepRank’ has been used by Pang et al. [33] for relevance ranking of documents. DeepRank captures the query term importance, proximity heuristics, and diverse relation requirements. Empirical results on LETOR4.0 and large clickthrough dataset show that DeepRank model outperforms the existing ranking methods and deep IR models.

The above discussed approaches are either fully dependent on the query or neglect the content of the web documents. Combining the personalized PageRank of documents with their relevance is an innovative idea to rank the web documents. It updates the link structure of documents based on the similarity score with the user query and thereby refine the retrieval results by bringing the required documents on the top. Experimental results on five benchmark datasets show the efficiency of the proposed ranking approach.

3 Basic preliminaries

3.1 TF-IDF

\(TF\text{- }IDF\) [53] is a common technique which finds the importance of a term t in a given document d by considering its appearance in the whole corpus and is shown in Eq. 1.

$$\begin{aligned} TF\text{- }IDF_{t, d} = TF_{t, d} \times IDF_{t} \end{aligned}$$
(1)

where,

$$\begin{aligned} TF_{t, d} = \frac{Number\; of\; t\in d}{|d|} \end{aligned}$$

|d| represents the total length of d, and

$$\begin{aligned} IDF_{t} = \log _{10}\left( \frac{ Number\; of\;d\in P}{Number\; of\; d\; have\; t}\right) \end{aligned}$$

3.2 Silhouette coefficient

Silhouette Coefficient [49] of a term t is defined using Eq. 2.

$$\begin{aligned} silhouette(t) = \frac{s(t) - c(t)}{max \big (c(t), s(t)\big )} \end{aligned}$$
(2)

where c(t) and s(t) are the cohesion (how close is t to its own cluster) and separation score (how well separated is t from other clusters) of the term t respectively.

3.3 Fuzzy C-means

Fuzzy C-Means (FCM) algorithm [5] distributes a finite collection of n documents into c clusters. It returns a list of c cluster centroids along with a matrix which shows the degree of membership of each document to other clusters. It aims to minimize the following function as shown in Eq. 3.

$$\begin{aligned} T_m=\sum _{i=1}^{n}\sum _{j=1}^{c}v_{ij}^{m}||d_{ij}||^{2} \end{aligned}$$
(3)

where \(d_{ij} = {x_{i}-c_{j}}\) is the distance, m is the fuzzy coefficient and generally set to 2, \(c_{j}\) is the centroid(vector) of cluster j, \(x_{i}\) is the \(i^{th}\) document. \(v_{ij} \in [0, 1]\) is the degree of membership of \(x_{i}\) with respect to \(c_{j}\) and subject to the following conditions:

$$\begin{aligned}&\sum \limits _{j=1}^{c}v_{ji} = 1,~~ i=1, 2, 3, \ldots ,n \;\; \text{ and } \\&0< \sum \limits _{i=1}^{n}v_{ij} < n, ~~ j=1, 2, 3, \ldots ,c \end{aligned}$$

One can iteratively find the values of \(c_j\) and \(v_{ij}\) updated with each iteration by using the Eqs. 4 and 5 respectively.

$$\begin{aligned} c_{j}= & {} \frac{\sum _{i=1}^{n}v_{ij}^{m}-x_{i}}{\sum _{i=1}^{n}v_{ij}^{m}} \end{aligned}$$
(4)
$$\begin{aligned} v_{ij}= & {} \frac{1}{\sum _{k=1}^{c}(\frac{||d_{ij}||}{||x_{i}-c_{k}||})^{\frac{2}{m-1}}} \end{aligned}$$
(5)

3.4 Mutual information judge

The relationship between a class c and a term t is established by using Mutual Information Judge (MI) [26] which mainly focus the information of t in c. The MI is computed using the Eq. 6.

$$\begin{aligned}&MI(t,c)= \nonumber \\&\sum _{e_t\in \{0,1\}}\sum _{e_c\in \{0,1\}}Prob(e_t,e_c)\log _2\frac{Prob(e_t,e_c)}{Prob(e_t)Prob(e_c)} \end{aligned}$$
(6)

where the Bernoulli variables \(e_t\) and \(e_c\) are defined as

$$\begin{aligned} e_t= & {} \left\{ \begin{array}{l} 1,\hbox { if}\ t\in d \\ 0, \text{ otherwise }\end{array}\right. \;\; \text{ and } \\ e_c= & {} \left\{ \begin{array}{l} 1,\hbox { if}\ d\in c \\ 0, \text{ otherwise }\end{array}\right. \end{aligned}$$

4 Propose approach

This Section discusses our earlier Query-optimized PageRank approach [42] briefly and the current Query-optimized personalized PageRank technique in detail.

4.1 Query-optimized PageRank

The following steps are used for query-optimized PageRank algorithm:

  1. 1.

    Initially, all the documents of a given corpus are pre-processed and converted into vector forms using Step 1 of Sect. 4.2.

  2. 2.

    Top l (l = 1 or 2)Footnote 2 terms are selected as the query term whose average TF-IDF values are maximum in the corpus. The reason for selecting the length of the query as one or two terms is, from literature [54] it has been found that most of the queries are very short (i.e., either one or two terms).

  3. 3.

    Cosine-similarity is calculated between each document and the query. The documents which are highly dissimilar (cosine similarity is zero) are discarded from the corpus and then the ranks of the remaining documents are calculated using PageRank algorithm. The main focus of the approach is that a surfer who is searching for a query on the web should only jump to those web documents which are highly correlated with the query.

  4. 4.

    After the link structure of documents gets modified using PageRank algorithm, adjustment of the weights of documents are made by considering the damping factor. Initially all the documents get same importance (i.e., same weight). Next the rank of a web document \(p_i\) is updated by adding the importance of the incoming links to the current rank score of \(p_i\). This process is repeated for every document of the corpus. A rank matrix r is created which stores the updated rank of each web document after incorporating the damping factor to r. The following steps are used to compute the PageRank:

    1. i.

      Consider a directed graph G of k nodes and \(\frac{k(k-1)}{2}\) edges, where each node is a web document and each edge represents the relationship between two documents. When page i refers to document j, then a directed edge will be added from node i to the node j in G.

    2. ii.

      Though all the documents that are linked by a single document will get equal importance initially, hence if a node has n outgoing edges, then the importance of each document will be \(\frac{1}{n}\). Let A be the transition matrix of the graph G and represented as,

      $$\begin{aligned}A = \begin{bmatrix} x_{11} &{}\quad x_{12} &{}\quad x_{13} &{} \dots &{} x_{1n} \\ x_{21} &{}\quad x_{22} &{}\quad x_{23} &{} \dots &{} x_{2n} \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ x_{k1} &{}\quad x_{k2} &{}\quad x_{k3} &{} \dots &{} x_{kn} \end{bmatrix} \end{aligned}$$

      where \(x_{ij}\) is the link from document i to j.

    3. iii.

      Let v is the initial rank vector whose all the entries are \(\frac{1}{k}\), because initially all web documents received equal importance. The rank of a web document i will be updated by adding the importance of the incoming links to the current value of i. It is same as multiplying the transition matrix A with the initial rank vector v. Hence, after first the iteration, the new importance vector become \(v_1\) = Av. We keep iterating this step and it generates the sequences \(v, Av, A^{2}v, A^{3}v,\cdots ,A^{k}v\) which is the PageRank of the web graph G.

    4. iv.

      Since the experimental dataset is large, the graph G may not be connected. Thus, one requires an unambiguous meaning of the rank of a document for such directed web graph. To overcome this problem, damping factor (p) is used which is a positive constant lying between 0 and 1. The typical value of damping factor is 0.85. Equation 7 is used to compute the PageRank of G.

      $$\begin{aligned} PageRank(G) = (1-p)A + pB \end{aligned}$$
      (7)

      where,

      $$\begin{aligned}B = \frac{1}{n}\begin{bmatrix} 1 &{}\quad 1 &{}\quad 1 &{} \dots &{} 1 \\ 1 &{}\quad 1 &{} 1\quad &{} \dots &{} 1 \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ 1 &{}\quad 1 &{}\quad 1 &{} \dots &{} 1 \end{bmatrix}\end{aligned}$$

4.2 Query-optimized personalized pageRank

To improve the retrieved results of the above query-optimized PageRank, we combine the personalized pageRank with the content of the relevant documents. The new ranks of the web documents which are computed using personalized PageRank are discussed using the following steps:

  1. 1.

    Data acquisition and Pre-processing of the corpus:

    Consider a corpus P having q classes. All the documents are pre-processed which includes lexical-analysis, stop-word elimination, removal of HTML tags, stemmingFootnote 3, and then index terms are extracted. Documents of all q classes are put together, which constitute the dimension of P as \(b \times l\), where b and l represent number of documents and terms respectively. Table 1 shows the document-term matrix where \(t_{ij}\) indicates the weight of the jth term in ith document.

  2. 2.

    Document-Term cluster formation:

    FCM clustering algorithm is run on the corpus P and divided all the terms of different documents of P into k doc-term clusters (\(dt_p\)) where \(dt_p\) = {\(dt_1, dt_2, \ldots , dt_k\)} by bringing similar terms into the same cluster. Here, each \(dt_p\) is of dimension \(b \times n\) (i.e., the number of documents remain same for each cluster but the number of terms get reduced by clustering). The reason for choosing FCM among the existing clustering techniques is that it is a soft clustering algorithm where the fuzziness can be exploited to create a more crisp behaving technique which generates better results, and it is one of the best algorithms compared to other hard clustering algorithms used for text data [5]. Next objective is to select the significant terms from each of the k clusters for maintaining the uniformity without excluding any collection.

  3. 3.

    Term-Term Correlation based Feature Selection (TCFS): The following steps discuss how important features are selected from each cluster \(dt_p, \forall p \in [1, \)k].

    1. (i)

      Frequency-based correlation (CF) calculation:

      First, the frequency-based correlation measure between every pair of terms i and j of each cluster \(dt_p\) is calculated using Eq. 8.

      $$\begin{aligned} \textit{CF}_{ij} = \sum \lim _{m \in p} f_{im} * f_{jm} \end{aligned}$$
      (8)

      where, \(f_{im}\) and \(f_{jm}\) represent the frequency of \(i^{th}\) and \(j^{th}\) terms in the \(m^{th}\) document of the cluster \(dt_p\).

    2. (ii)

      Constructing association matrix:

      An association matrix shown in the Table 2 is constructed where each entry represents the association or frequency-based correlation measure between the terms \(t_i\) and \(t_j\).

    3. (iii)

      Normalizing \(CF_{ij}\):

      The frequency-based correlation measure CF\(_{ij}\) is normalized (named as normalized correlation measure (NCM)) using Eq. 9 which float the correlation values between 0 and 1 as shown in the Table 3. All the diagonal values of \(\textit{NCM}_{ij}\) are 1 as i = j.

      $$\begin{aligned} \textit{NCM}_{ij} = \frac{CF_{ij}}{CF_{ii}+CF_{jj}-CF_{ij}} \end{aligned}$$
      (9)
    4. (iv)

      Semantic centroid vector generation:

      For each term \(t_i\) (i.e., for each row of NCM), the mean is calculated and all the means generate an n-dimensional vector named as semantic centroid vector \(sc_p\). Each component of \(sc_p\) is shown in the Eq. 10.

      $$\begin{aligned} sc_{p_i} = \frac{\sum _{j=1}^{n}{} \textit{NCM}_{ij}}{n},~~~~1\le i\le n \end{aligned}$$
      (10)

      Each component of the semantic centroid vector is represented as

      $$\begin{aligned} \left[ \begin{array}{c} sc_{p_1} = \frac{(NCM_{11} ~+~ NCM_{12} ~+ ~NCM_{13} ~+~ \cdots ~+~ NCM_{1n})}{n}\\ \\ sc_{p_2} = \frac{(NCM_{21} ~+~ NCM_{22}~+~NCM_{23}~ + ~\cdots ~ +~ NCM_{2n})}{n}\\ \\ sc_{p_3}= \frac{(NCM_{31} ~+~ NCM_{32} ~+~NCM_{33}~ +~ \cdots ~+~ NCM_{1n})}{n}\\ \vdots \vdots \\ sc_{p_n} = \frac{(NCM_{n1} ~+~ NCM_{n2} ~ + ~ NCM_{n3} ~+~ \cdots ~+~ NCM_{nn})}{n}\\ \end{array} \right] \end{aligned}$$
    5. (v)

      Selecting important features:

      1. a.

        Calculating silhouette coefficient:

        The silhouette coefficient (silhout) of the term \(t_i \in dt_p\) is computed using Eq. 11. Cohesion (coh) measures how cohesive is the term, \(t_i \in dt_p\) to the centroid, \(sc_p \in dt_p\) and is shown in Eq. 12 and separation (sep) measures how well separated a term, \(t_i\in dt_p\) from the semantic centroid of other clusters, \(sc_m\), \(\forall m\) \(\in \) [1, k] and \(m \ne p\) which is shown in the Eq. 13.

        $$\begin{aligned} {\textit{silhout}} (t_i) = \frac{{\textit{sep}}(t_i) - {\textit{coh}}(t_i)}{{max}\big ( {\textit{coh}}(t_i), {\textit{sep}}(t_i)\big )} \end{aligned}$$
        (11)
        $$\begin{aligned} {\textit{coh}}(t_i) = (||sc_p - t_i||) \end{aligned}$$
        (12)
        $$\begin{aligned} \begin{aligned} {\textit{sep}}{(t_i)} = \text{ min }\big (||sc_m - t_i||\big ) \\ \end{aligned} \end{aligned}$$
        (13)

        where, \(sc_m\) is the semantic centroid of the \(m^{th}\) cluster.

      2. b.

        Finally, the terms are ranked based on their silhouette coefficient scores and among them top ‘m%’ terms (for experimental work, we choose m = 10 of the total terms it is decided empirically) are selected as the important features for the cluster \(d_{tp}\).

    6. (vi)

      By repeating Step 3 (i-v) for all k doc-term clusters, top m% important terms are generated for each doc-term cluster. After generating top \(m\%\) important terms for each doc-term cluster, the noise terms are ignored from each cluster. The documents which do not contain any of these important terms are discarded from the cluster.

    The details of this feature selection technique are generalized in Algorithm 1 for the implementation purposes.

    figure a
  4. 4.

    Query vector generation:

    Among the top \(m\%\) important terms of each cluster \(dt_p\) of the corpus P, top l terms (l = 1 or 2 and the reason for such selection of l values is discussed in Sect. 4.1) based on their silhouette coefficient scores are selected to generate the query \(q_p,\forall p \in [1, \)k] for that cluster. As we are working on Bag-of-words model, the order of the terms in the query does not matter.

  5. 5.

    Computing similarity between the documents and the query:

    Using Eq. 14, cosine-similarity (\(cosine\text{- }sim\)) is computed between each document \(d_i \in dt_p\) and the query vector \(q_p\). Then the documents of each \(dt_p\) are arranged based on their cosine-similarity scores, and those documents are discarded from the corpus P whose scores fall below a threshold of 0.5Footnote 4.

    $$\begin{aligned} cosine\text{- }sim (d_i, q_p)= \frac{{d_i}.{q_p}}{{||d_i||}*{||q_p||}} \end{aligned}$$
    (14)

    All the documents of each \(dt_p\) are merged together which generates a new corpus \(P_{new}\).

  6. 6.

    Computing the personalized PageRank of \(P_{new}\):

    Personalized PageRank is used to rank the documents of \(P_{new}\) assuming that it contains n number of documents and is discussed in the next step. At the beginning, all the documents of \(P_{new}\) received the same importance. Next, their ranks are updated by adding the importance of the incoming links to their current rank scores. This technique is repeated for all the documents of \(P_{new}\).

  7. 7.

    Applying Link-Based Technique on the corpus \(P_{new}\):

    In Link-based techniques, the personalized PageRank is evaluated for all the web documents having non-zero cosine-similarity of the corpus \(P_{new}\) which improved the earlier PagaRank approach. The link-based approach is developed using the following steps:

    1. (i)

      Adjacency matrix construction:

      We represented the web by a directed graph G = \(\{V, E\}\) where vertices V is considered as the set of web documents and the edges E represents the hyper-link from vertex U to V. The outlink information between web documents have been stored according to the format of dataset (for example purposes, we have shown the link structure of few documents) and demonstrated in Table 4.

Table 1 Document-term matrix
Table 2 Association matrix
Table 3 Normalized correlation measure
Table 4 Outlinks to web documents

The outlink information of web documents can be easily obtained from each row of Table 4. For example, the last row tells us the outlink information of the fifth web document and explained in Table 5 for better understanding.

Table 5 Outlink information of fifth web document

The adjacency matrix A of the outlinks information is defined as follows

$$\begin{aligned} A= \begin{bmatrix} \beta _{11} ~&{}\quad ~\beta _{12} &{}\quad \beta _{13} &{}\quad ~\dots &{} ~\beta _{1n} \\ \beta _{21}~ &{}\quad ~\beta _{22} &{}\quad \beta _{23} &{}\quad ~\dots ~ &{} ~\beta _{2n} \\ \beta _{31}~ &{}\quad ~\beta _{32} &{}\quad \beta _{33} &{}\quad ~\dots ~ &{} ~\beta _{3n} \\ \vdots ~ &{} ~\vdots &{} ~\vdots &{}~\ddots &{} ~\vdots \\ \beta _{n1}~ &{} ~\beta _{n2} &{} \beta _{n3} &{} ~\dots &{} ~\beta _{nn} \end{bmatrix} \end{aligned}$$

where \(\beta _{ij}\) is the number of outlinks from document i to document j. To understand the calculation of the adjacency matrix A, we have shown the adjacency computation \(A'\) as

$$\begin{aligned} A'= \begin{bmatrix} 0 ~~&{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~2 &{}\quad ~~1 &{}\quad ~~0\\ 2 ~~&{}\quad ~~0 &{}\quad ~~1 &{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~1\\ 0 ~~&{}\quad ~~1 &{}\quad ~~0 &{}\quad ~~1 &{}\quad ~~0 &{}\quad ~~4\\ 2 ~~&{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~1 &{}\quad ~~3\\ 1 ~~&{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~3 &{}\quad ~~0 &{}\quad ~~1\\ 5 ~~&{}\quad ~~3 &{}\quad ~~3 &{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~0 \end{bmatrix} \end{aligned}$$

The adjacency matrix is normalized by dividing the row sum to each row. This normalized form is used for personalized PageRank. The normalized forms of A and \(A'\) are denoted by H and \(H'\) respectively.

$$\begin{aligned} H= & {} \begin{bmatrix} \frac{\beta _{11}}{\beta _{11}+\beta _{12}+\cdots +\beta _{1n}} &{} ~\dots &{} ~\frac{\beta _{1n}}{\beta _{11}+\beta _{12}+\cdots +\beta _{1n}} \\ \\ \frac{\beta _{21}}{\beta _{21}+\beta _{22}+\cdots +\beta _{2n}} &{} ~\dots &{} ~\frac{\beta _{2n}}{\beta _{21}+\beta _{22}+\cdots +\beta _{2n}} \\ \vdots ~ &{} ~\ddots &{} ~\vdots \\ \frac{\beta _{n1}}{\beta _{n1}+\beta _{n2}+\cdots +\beta _{nn}} &{} ~\dots &{} ~\frac{\beta _{nn}}{\beta _{n1}+\beta _{n2}+\cdots +\beta _{nn}} \end{bmatrix} \\ H'= & {} \begin{bmatrix} 0 ~~&{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~2/3 &{}\quad ~~1/3 &{}\quad ~~0\\ 2/4 ~~&{}\quad ~~0 &{}\quad ~~1/4 &{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~1/4\\ 0 ~~&{}\quad ~~1/6 &{}\quad ~~0 &{}\quad ~~1/6 &{}\quad ~~0 &{}\quad ~~4/6\\ 2/6 ~~&{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~1/6 &{}\quad ~~3/6\\ 1/5~~&{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~3/5 &{}\quad ~~0 &{}\quad ~~1/5\\ 5/{11} ~~&{}\quad ~~3/{11} &{}\quad ~~3/{11} &{}\quad ~~0 &{}\quad ~~0 &{}\quad ~~0 \end{bmatrix} \end{aligned}$$
  1. (ii)

    Calculation of personalized PageRank:

    The personalized PageRank for a web document i is evaluated by considering all other web documents’ contributions to the PageRank of i. In the graph G,  the contribution of a vertex \(V_1\) to the PageRank of another vertex \(V_2\) is described in terms of personalized PageRank [2]. For a row normalized adjacency matrix H,  the PageRank \(\textit{PR}_i\) (for document i) is determined as

    $$\begin{aligned} \textit{PR}_i\leftarrow \alpha *\textit{PR}_i*H + (1-\alpha )*\vartheta \end{aligned}$$
    (15)

    In the Eq. 15, \(\vartheta \) is the teleportation vector and \(\alpha \in [0, 1]\) is scaling parameter. In practice, the scaling parameter is normally assumed as 0.85 [27]. To calculate the personalized PageRank contribution vector for ith web document, the ith bit of \(\vartheta \) is set to 1 and remaining bits are set as 0. At the beginning, \(\textit{PR}_i\) is set to

    $$\begin{aligned} \textit{PR}_i=\left( \frac{1}{n},\frac{1}{n},\frac{1}{n},\cdots ,\frac{1}{n}\right) \end{aligned}$$

    where n is the total number of documents of \(P_{new}\) and it updated iteratively using Eq. 15. \(\textit{PR}_i\) then stored in the \(i^{th}\) row of personalized PageRank (ppr) matrix i.e., \(ppr[i, :] \leftarrow \textit{PR}_i\). The computation of personalized PageRank (ppr) is described explicitly in Algorithm 2 for implementation purposes. The overview of the query-optimized personalized PageRank technique is shown in Fig. 1.

figure b
Fig. 1
figure 1

Query-optimized personalized PageRank

5 Experimental work

For experimental purposes, five benchmark datasets are used (DMOZFootnote 5, Classic4Footnote 6, ReutersFootnote 7, 20-NewsgroupsFootnote 8, and WebKBFootnote 9). A brief description about each dataset is mentioned below:

  1. i.

    DMOZ is one of the largest dictionaries on the web, and it has 14 categories of web documents. Out of 69,068 documents, we have used 38,000 documents for training and 31,068 for testing purposes. Many documents have no content or very less content. Total number of terms is 60320 out of which 39886 are considered for training.

  2. ii.

    20-Newsgroups is a standard machine learning dataset, and it has 11,293 training and 7528 testing documents classified into 20 classes. All the classes are considered for experimental purposes. Some of the documents have no content. Among 52,422 terms, 32,270 are used for training and the rest are used for testing.

  3. iii.

    Classic4 is a text mining dataset and it has 4257 training and 2838 test documents classified into four classes - cisi, med, casm, and cran, having 1460, 1033, 3204, and 1400 respectively. All the classes are considered in evaluation. The total terms contained in all documents is 21299 and for training documents is 15971.

  4. iv.

    Reuters is a well-known machine learning dataset having eight categories of documents, and all categories are used for the experimental purposes. Among these documents, 5485 are used for training and 2189 are used for testing. The total number of terms used is 17,582, and among them 13,531 are used for training.

  5. v.

    WebKB is a popular machine learning dataset which has four categories of documents from four different college websites. For experiment, all categories of documents are used in which 2803 documents are considered for training and 1396 for testing. The total number of terms of all these documents is 7606 and from that 7522 terms are used for training. Documents having less content are filtered out as discussed in the next paragraph, but they are counted during the ranking process.

For experimental purposes, we have discussed how the adjacency matrix is generated for WebKB dataset and the same technique has been applied for all other datasets. A collection of 4199 different hosts are considered as available on WebKB dataset (Host graph format). Initially, the adjacency matrix of dimension 4199 \(\times \) 4199 for the Host graph is computed and normalized as discussed in Step 7 of Sect. 4.2. The dataset of 4199 web documents has been filtered out to the dataset of 994 web documents. The filtering process is done using four steps as mentioned below:

  1. (i)

    Initially, only those web documents are selected from the dataset where the human assigned labels are available and all other web documents are discarded.

  2. (ii)

    Among those human assigned labels of web documents, all working links are selected.

  3. (iii)

    Next, the content of those working links of web documents are extracted and stored in a corpus in text file format.

  4. (iv)

    At the end, only those web documents from the corpus are selected which have content of at least 1KB. The reason for selecting at least 1KB of web document is to conduct the personalized PageRank smoothly because the content of the web document is more and it is important for experimental purposes.

Table 6 Query: Agriculture (NFr = 0.333)
Table 7 Query: Massachusetts (NFr = 0)
Table 8 Query: Roman Empire (NFr =0.661)

5.1 Result analysis of query-optimized pageRank

This section discusses the experimentation of the earlier query-optimized PageRank. To implement the PageRank algorithm combine with the content of the document, a benchmark research dataset called DBpediaFootnote 10 has been chosen. The reason for choosing such a dataset is that it has both content and link structure which are needed for the experimental work. But the limitation of this dataset is that it does not have a sufficient set of relevant documents for any given query, which makes difficult to compute the accuracy of our earlier approach. To handle such problem, we used a method called Spearman’s footrule [16]. For checking the efficiency of our earlier approach, we compared the accuracy of query-optimized PageRank with the cosine-similarity ranking of web documents. The query vector is generated in the same way as discussed in Step 4 of Sect. 4.2. Here, we have considered only monogram (l=1) and bi-grams (l=2) queries for experimental purposes and the reason is already discussed in Sect. 4.1. The followings are some of the links of dbpedia dataset used for experimental purposes.

  • Links/amsterdammuseum_links

  • Links/dailymed_links

  • Links/eunis_links

  • En/external_links_en

  • En/infobox_properties_en

  • Links/italian_public_schools_links

  • Sv/labels_en_uris_sv

  • Pl/long_abstracts_en_uris_pl

  • Links/revyu_links

  • Links/yago_links

Table 9 Query: General History (NFr =0.714)
Table 10 Query: Mediterranean (NFr = 0.611)
Table 11 Query: Catholic (NFr = 0.704)
Table 12 Query: Civil (NFr = 0.857)

Assuming that dbpedia dataset contains N documents, where the documents are ranked between 1 and N. The Spearman’s footrule method is applied to both query-optimized and cosine-similarity ranking techniques for measuring their accuracies. No ties are allowed as the rankings generated by each of the two techniques being compared is basically a permutation of the other. Let’s say that the result of the rankings are permutations \(\sigma _2\) for the ranking based on cosine-similarity and \(\sigma _1\) for query-optimized PageRank. The ranking results of the top ‘k’ documents of each of the two techniques is turn out to be over S, where S the set of overlapping results between the two ranking techniques. Equation 16 is used to compute the Spearman’s footrule.

$$\begin{aligned} Fr^{|S|}(\sigma _1, \sigma _2) = \sum \limits _{i=1}^{|S|}|(\sigma _1(i) - \sigma _2(i)| \end{aligned}$$
(16)

The value of \(Fr^{|S|}\) is computed by dividing the obtained result with its maximum value. The achieved value is independent on the size of the overlap S and lies between 0 and 1. Following three conditions are observed based on the value of S:

  1. i.

    When the ranking lists of both query-optimize and cosine-similarity ranking techniques are equal, then \(Fr^{|S|}\) is zero.

  2. ii.

    When |S| is even, \(Fr^{|S|}\) obtained the maximum value of \(\frac{1}{2}S^2\).

  3. iii.

    When |S| is odd, \(Fr^{|S|}\) obtained the maximum value of \(\frac{1}{2}(|S|+1)(|S|-1)\).

Equation 17 is used to compute the normalized Spearman’s footrule (NFr) for \(|S|>\)1.

$$\begin{aligned} \text{ NFr } = \frac{Fr^{|S|}}{max\; Fr^{|S|}} \end{aligned}$$
(17)

Thus, NFr will range between 0 and 1. Ranking using query-optimized PageRank and cosine-similarity for unigram and bi-gram queries are shown in Tables 6, 7, 8, 9, 10, 11, and 12. In these Tables, NFr represents the accuracy gained by the query-optimized PageRank over cosine-similarity ranking. Since the retrieved documents are very less, and the link structure could not refine the ranks much based on the cosine-similarity of the documents with the query, hence for the query “Massachusetts” as shown in Table 7, the Spearman coefficient turned out to be 0. A non-zero spearman’s score says that the ranking given by the query-optimized PageRank puts forward a new direction of research for the modified PageRank which is query dependent.

Table 13 Size (approximate) of the input feature vector

5.2 Result analysis of proposed feature selection technique

Table 13 shows different parameters used for the feature selection technique. All top \(m\%\) important terms (m = 10) (as discussed in step 3(v) of Sect. 4.2) are combined together to generate the training feature vector for classifiers. Comparison results of the proposed TCFS feature selection technique with other well-known existing techniques (Mutual Information (MI), Chi-square (CHI-2), GINI [51], and Information Gain (IG) [59]) are given in Tables 14, 15, 16, 17 and 18 respectively. Eight classifiers like LinearSVM, Gaussian-Naive Bayes (G-NB), Binomial Naive Bayes (B-NB), Multinomial Naive Bayes (M-NB), Adaboost, Decision Trees (DT), Random Forest (RF), and Extra Trees (ET) are used for classification of document on different datasets. All ensemble classifiers are 10-class classifiers. We have adapted the above techniques to check the performance of the proposed feature selection with respect to other conventional techniques. Equation 18 is used to measure the performance of each classifier. The bold results indicate the highest F-measure obtained by the proposed feature selection technique using the corresponding classifier. From the results, it is observed that the proposed feature selection technique is either comparable or better than the conventional techniques.

$$\begin{aligned} \textit{Precision (p)}= & {} \frac{{(\mathrm {relevant}_{documents})}\cap {(\mathrm {retrieved}_{documents})} }{\mathrm{{retrieved}_{documents}}} \nonumber \\ \mathrm {Recall (r)}= & {} \frac{{(\mathrm {relevant}_{documents})}\cap {(\mathrm {retrieved}_{documents})} }{\mathrm{{relevant}_{documents}}} \nonumber \\ \mathrm {F-measure (f)}= & {} 2*(\frac{p*r}{p+r}) \end{aligned}$$
(18)
Table 14 Comparisons on 20-NG Dataset
Table 15 Comparisons on DMOZ Dataset
Table 16 Comparisons on Classic4 Dataset

5.2.1 Tuning hyper-parameters:

Tuning of hyper-parameter of different classifiers are given below:

  1. i.

    LinearSVM: Cs = [0.001, 0.01, 0.1, 1, 10], gammas = [0.001, 0.01, 0.1, 1], param_grid = {‘C’: Cs, ‘gamma’: gammas}

  2. ii.

    Naive Bayes: Prior Probabilities = [0.65, 0.35]

  3. iii.

    Adaboost: n_estimators = 12, max_depth = 5, subsample = 0.5, random_state = 0, number of classifiers used = 10

  4. iv.

    Decision Trees: max_depth = 10, min_sample_splits = 40%, min_samples_leaf = 5

  5. v.

    Random Forest: n_estimators = 11, max_depth = 4, subsample = 0.5, random_state = 0, number of classifiers used = 10

  6. vi.

    Gradient Boosting: learning_rate = 0.1, n_estimators = 280, max_depth = 4, subsample = 0.4, random_state = 0, number of classifiers used = 10

  7. vii.

    Extra Trees: n_estimators = 20, learning_rate = 1, max_depth = 4, subsample = 0.6, random_state = 0, number of classifiers used = 10

Table 17 Comparisons on Reuters Dataset
Table 18 Comparisons on WebKB Dataset

5.3 Result analysis of query-optimized personalized pageRank

This Section discusses the experimental results of the proposed Query-optimized personalized PageRank technique in detail. At the beginning of the work, two sets of documents are formed:

  1. i.

    One set contains all the original categories of documents of a dataset named as ‘\(original_{category}\)’.

  2. ii.

    The other set contains newly formed clusters named as \(new_{cluster}\), where a cluster may have documents that come from other categories of a dataset. New clusters are formed by combining all the documents of different categories of a dataset and then running FCM technique on those documents. The number of clusters formed for each dataset is same as the number of categories the dataset has. For example, 20-NG has seven categories, hence the number of clusters generated by the FCM algorithm is seven. To know which cluster of \(new_{cluster}\) belongs to which category of \(original_{category}\), a technique is developed and according to it, cluster i belongs to category j iff i contain maximum documents of j. This is done in order to compute the ranking process efficiently.

Table 19 Monogram and Bi-grams query words
Table 20 Performance of the ranking approach on 20-NG
Table 21 Performance of ranking on DMOZ

The query-optimized personalized PageRank is applied on \(new_{cluster}\) to rank all the documents. The top ‘sFootnote 11 ranked results of the \(original_{category}\) and top ‘tFootnote 12 ranked results of \(new_{cluster}\) are considered for performance measurement of the ranking approach. Equations 19 and 20are used to compute the precision and recall respectively.

$$\begin{aligned}&{precision (p')} = \frac{a}{b} \end{aligned}$$
(19)
$$\begin{aligned}&recall (r') = \frac{a}{d} \end{aligned}$$
(20)
$$\begin{aligned}&F\text{- }measure(f') = 2\big (\frac{p*r}{p+r}\big ) \end{aligned}$$
(21)

where, ‘a’ is the common documents between the top ‘\(s\%\)’ documents of \(original_{category}\) and the top ‘t%’ documents of the \(new_{cluster}\), ‘b’ is the top ‘t%’ ranked results of \(new_{cluster}\) and ‘d’ is the top ‘\(s\%\)’ documents of \(original_{category}\). For experimental purposes, we have discussed the ‘Computer’ category named as \(original_{comp}\) of 20-NG dataset which has 1952 documents. The proposed query-optimized personalized PageRank algorithm is run on both the new generated computer cluster named as \(new_{comp}\) and \(original_{comp}\) to rank the documents of both clusters. We then find how many top ‘s%’ documents of ‘\(original_{comp}\)’ category match with the top ‘t%’ documents of ‘\(new_{comp}\)’ from which the F-measure is computed as mentioned in Eq. 21.

5.3.1 Comparison of proposed personalized pageRank with cosine-similarity and traditional pageRank algorithm

The monogram and bi-grams queries generated from each dataset (shown in the Table 19) are used for cosine-similarity, PageRank, and proposed query-optimized personalized PageRank approaches. The proposed query-optimized personalized PageRank approach using monogram query is tested on each category of all the datasets, and their performances are shown in Tables 20, 21, 22, 23 and 24 respectively. Comparison of these three ranking techniques using monogram and bi-grams query are shown in Figures 2 and 3 respectively. From the figures, it is observed that the performance of query-optimized personalized PageRank is better than the query combined with cosine-similarity and query-optimized PageRank approach. The overall (i.e., average) F-meaure of each dataset shows the stability of the proposed ranking approach.

Table 22 Performance of ranking on Classic4
Table 23 Performance of ranking on Reuters
Table 24 Performance of ranking on WebKB
Fig. 2
figure 2

F-measure (monogram query)

Fig. 3
figure 3

F-measure (bi-grams query)

6 Conclusion

This paper is an extension of our earlier approach which improves the existing PageRank by personalizing it and then combined with query to rank the web documents. In the earlier approach, an efficient ranking model is developed which improves the ranking mechanism by bringing the required documents on the top of the retrieved results. The current approach further improved the earlier approach using the following points:

  1. i.

    To improve the ranking mechanism, a novel feature selection technique (TCFS) is proposed which removes the noise features from the corpus.

  2. ii.

    To ensure that the proposed TCFS technique is efficient, it is compared with the existing feature selection techniques.

  3. iii.

    Next, the earlier PageRank algorithm is improved by personalizing it.

  4. iv.

    The personalize PageRank and the cosine-similarity of the documents with the user-query further enhances the earlier ranking mechanism which was based on the relevance of documents and their outlink to the user query.

Experimental results on five benchmark datasets show the stability and effectiveness of the proposed query-optimized personalized PageRank approach. This work further can be extended by considering the following points:

  1. i.

    The proposed ranking method does not take care of whether the ranking documents are spam or not. Hence, spam detection can be done before the ranking process starts.

  2. ii.

    Similarly, duplicate documents are a big threat to the search engine which is not detected by the proposed method. Future work for detection of duplicate documents before the ranking process can further improve the proposed work.

  3. iii.

    By considering each query as a mixture of various topics (generated from documents using latent dirichlet allocation) can further improve the PageRank matrix that receives more relevant and important outlinks.

  4. iv.

    The proposed approach can be improved by incorporating user behavior signals [1].