1 Introduction

Most of the existing search engines rely only on the keywords that co-occur between user queries and returned documents. However, a study was made on queries submitted on a search engine [24] and it was observed that users usually formulate very short queries. Thus, the keywords-based systems incapable to retrieve documents responding to the information needs of the search engines’ users.

Due to the increasing volume of information in the World Wide Web, and the lack of organization in it, the retrieval of relevant documents using the initial user query is becoming an almost impossible task [4]. As queries get longer, there is more possibilities that some important terms co-occur in the query and the relevant documents to it.

Moreover, the query considered separately is insufficient to clearly identify what the user is looking for, because of the fact that the query is only a partial and often ambiguous expression of the user’s needs of information [5]. Considering the context around the query, the ambiguity can be resolved to a certain degree and the partial information of the user’s needs can be completed.

For a correct interpretation of the user’s query, it has been demonstrated that it should be placed in its appropriate context [4]. But as known, the context is a large notion that includes the user context (his domains of interest, his preferences and his historic of research) and the query context, which means the environment of the query (its relevant documents and its terms...). The first context needs the research to be done using user’s profiles, but a single profile can group a large variety of domains and interests, which are not always relevant for a particular query. Thus, the solution is to use the environment of the query as an appropriate context to improve the precision of the query.

In this paper, we propose an approach of query expansion based on the context around the query. This approach is divided into two phases:

  • In the first phase, we build the query context using the Language Models [2, 5, 6, 26] as a query recommendation method. The technique of recommendation proposed estimates the probability of generation of the user query by the LM of the past user queries.

  • In the second phase, a query expansion method is to apply using: (1) the new query to expand; (2) the top-ranked documents of it; (3) the best-recommended queries to it; and (4) the top-ranked documents of each recommended query. We propose also to apply the LSA method [14, 20] to search for the most similar terms to the new user query terms. Then, these similarities are merged together with the expression of the cohesion weight presented in Cui et al. [9] to order the candidate terms for expansion according to the whole query to expand.

The Latent Semantic Analyses (LSA) method has already been used for query expansion. The originality of this work appears in the fact of extracting the context of the query using the LMs. While, in the field of information retrieval [26], the concept of LMs was exploited in estimating the probability of generation of a new query by the LM of the document. The aim of this utilization is classifying documents according to a query. However, our contribution arises also in the fact of calculating the recommendation score based simultaneously on terms and documents vectors.

In the rest of this paper, we firstly provide a literature review in Sect. 2. Then, we describe in Sect. 3 the different processes of our approach including the query recommendation, and the query expansion approaches proposed. We expose our results in the fourth section. Finally, we conclude in Sect. 5 and discuss future works.

2 Related work

Query Suggestion [22] is the fact of proposing to the user queries that are similar to its submitted query, whereas Query Expansion [7, 8] is the fact of adding the terms that are almost similar to the user query terms, and it can be considered as a method for improving retrieval performance by extracting the context around the user’s query.

Several query suggestion methods were proposed in the literature. For example, the reference [25] proposes a method for recommending a list of queries that are related to the user’s query, based on a clustering of similar past queries. The authors hypotheses’ that semantically similar queries may not share query terms but they certainly share terms in the documents selected by users while submitting these queries. Thus, the query recommendation algorithm presented by this work contains three steps. First, past users’ queries are represented by a term weighted vectors with the text of their clicked URL’s, and are clustered. The second step is an online step, while the first is processes offline. This step is applied when a new query is submitted to the search engine, by finding the cluster to which it belongs and then measures a rank score for each query in this cluster according to the new user’s query. Finally, the related queries are ordered according to their rank score and returned as an output of the algorithm. This paper has also proposed a new similarity measure called as “Tanimoto Coefficient” in order to rank the related queries. The experiments over a real query logs show the effectiveness of this algorithm.

On the other hand, the terms used for expansion can be selected from external sources or from the corpus itself. Some of the methods for selecting the terms from the corpus are based on global analysis, where the list of candidate terms is generated from the whole collection, while others are based on local analysis [13] in which the relevance feedback techniques [12] are used in the aim of expanding terms are selected from the top-ranked documents by users.

Global analysis methods [12] are computationally very expensive and their effectiveness is generally not better and sometimes worse than local analysis. The problem with local analysis is that user’s feedback is required to provide information regarding top-ranked relevant documents. User’s involvement makes it difficult to develop such automatic methods for query expansion. To avoid this problem, a pseudo-relevance feedback approach is preferred where documents are retrieved using an efficient matching function and the top-retrieved documents are assumed to be relevant [9, 19].

However, these methods of expansion were limited in the extraction of expansion terms from a set of documents and have not used information about interactions between the users and the system; this is the case of the expansion based on the use of query logs [11, 15, 23].

Query logs are a mine of information, which gives an idea about the interaction between the users and the information retrieval system. As an example, the approach proposed in Fonseca et al. [11] is a method of concepts suggestion for expanding the original user query with additional context. The first part of the approach is a method of concepts generation from query logs. This part is composed of three steps: it starts by an offline step for determining query relations in the log, then two steps for building a query relation graph and identifying concepts in this graph are performed during the query processing time. The second part of the approach is a concept-based query expansion method, where the system proposes to the user a set of concepts related to his query. Once the user has selected one concept related to the query, this concept is added to the original user query and the expanded query is processed. Such approaches are simple, intuitive and effective according to the experiments done on it, but it uses only past users queries for expanding the new query, and do not minimize the gap between queries’ terms and documents’ terms.

In the last decades, the information retrieval field has also known an integration of the semantic aspect to query expansion, document ranking and clustering, question-answer systems and query recommendation. In the work presented in Meng et al. [15], a new model for measuring similarity between web queries was proposed. The model is taking into account both the word form and the semantic information of the two queries. It uses WordNet [16] as a thesaurus that focuses on word meaning instead of word forms, in order to obtain the semantic information. The approach described uses the bottom-up hierarchical clustering so that it can cluster past users queries and discover the different topics gathered in the search engine’s log. The clustering process is carried out based on the new similarity metric proposed. The experiments made on this new model show its good performance in improving the precision of query expansion by 9.2 % and its recall by 8.1 %.

Traditionally, the concept of LM [2, 5, 6, 26] is exploited in the field of information retrieval in order to represent the relationship of relevance between a document and a submitted query, by estimating the probability of generation of the query by the LM of the document. In this work, we exploit the LM in order to classify the past user’s queries, extracted from the query log of the search engine, according to their capacity to generate the new user query. Then, we propose to apply the LSA method [14, 20] as a query expansion technique. As a result, the LSA method gives a matrix that linked the documents with their terms, so we can compute the similarity between a query and a document or between two terms.

3 Context-aware query expansion method

The approach proposed in this paper (Fig. 1) is a context-aware query expansion method called “Latent Semantic Analyses using Recommended Queries” (LSARQ), and it is composed of two phases:

  1. (1)

    Query Recommendation: we used the LMs in order to find the most related past queries to the user query.

  2. (2)

    Query Expansion: we used the LSA model to select the candidate terms in order to expand the user query.

3.1 Phase 1: LM for the query’s context extraction

The main objective of this work is to provide high-level suggestions for the original user query that we are using later for expanding the query with additional context. We used the LM to calculate the correlation between two queries. In this aim, we order past queries extracted from a query log according to their capacity to generate the new user query. The expression used in order to calculate the recommendation score of a past query \({{\varvec{Q}}}_{{\varvec{p}}}\) according to the initial query \({{\varvec{Q}}}_{{\varvec{n}}}\) is as follows:

$$\begin{aligned} \textit{RankScore}\left( {{Q_n},{Q_p}} \right) = \gamma \textit{Score}_{LM\_T}\left( {{Q_n},{Q_p}} \right) + \left( {1 - \gamma } \right) \textit{Score}_{LM\_D}\left( {{Q_n},{Q_p}} \right) \end{aligned}$$
(1)

With \(\gamma \in \left[ {0,1}\right] \) is a parameter that we used for normalization.

The \(\textit{Score}_{\textit{LM}\_T}\) is calculated using vectors that represent the presence or not of a term in a query. \(Score_{LM\_D}\) is computed using vectors that represent the presence or not of a document between the clicked documents of a query.

The typical score function defined by KL-divergence in the language modeling framework [1, 3] is used as a measuring function for \(Score_{LM\_T}\) and \(Score_{LM\_D}\). Its expression is as follows:

$$\begin{aligned} \textit{Score}_{\textit{LM}}\left( Q_{n}, Q_{p} \right) = {\sum }_{t \in V} P\left( t|\theta _{Qn} \right) \log (P(t|\theta _{Qp}) \approx - KL(\theta _{Qn}\Vert \theta _{Qp}) \end{aligned}$$
(2)

where \(\theta _{Q_n }\) is the LM of the new query, \(\theta _{Q_p}\) the LM created for a past query, and V the vocabulary of terms.

\(P\left( {t|\theta _Q } \right) \) represents the probability of a term t in the LM of the query and is computed using the Maximum Likelihood Estimation (MLE), by the equation:

$$\begin{aligned} P\left( {t|\theta _Q } \right) =\frac{f\left( t \right) }{{\sum }_{t_{i} \in Q} f\left( {t_i } \right) } \end{aligned}$$
(3)

With \(f\left( t \right) \) is the frequency of t in the query.

Thus, Eq. (1) is representing the global ranking score of a past query \(Q_p \) according to the new query \(Q_n \). Equation (2) is to be used to calculate \(Score_{LM\_T} \) and \(Score_{LM\_D}\) of Eq. (1). The \(P\left( {t|\theta _{Q_n }} \right) \) and \(P(t|\theta _{Q_p })\) in Eq. (2) are computed using the expression presented in Eq. (3).

However, the size of the training corpus cannot reach the size of a language. Thus, when a query contains a term which is absent from the training corpus, this term is estimated by a null probability. Therefore, a null probability is assigned to any sequence of words containing that word. This issue is known as the underrepresentation of data, and it represent the main problem that occurs for LMs.

The proposed solution to this problem is the “Smoothing” expressions whose aim is to assign a not null probability to the absent terms from the training corpus, by redistributing the probability mass observed.

Several smoothing methods have been developed in the literature [18]. The choice of the appropriate smoothing technique depends on the environment of experimentation according to Cao et al. [6]. One of the common smoothing methods used in information retrieval is the Jelinek–Mercer interpolation smoothing:

$$\begin{aligned} P\left( {t|\theta ^{\prime }_{Q_p } } \right) =\left( {1-\lambda } \right) P\left( {t|\theta _{Q_p } } \right) +\lambda P\left( {t|\theta _C } \right) \end{aligned}$$
(4)

where \(\lambda \) is an interpolation parameter and \(\theta _C \) the LM of the collection of queries extracted from the search engine’s log.

For our approach, we use the Jelinek–Mercer interpolation smoothing only for past queries. The new query model \(\theta _{Q_n } \) is estimated by the maximum likehood estimation without any smoothing.

3.2 Phase 2: query expansion using the LSA method

The LSA is a method that tries to overcome the problems of lexical matching by retrieving information on the basis of a conceptual meaning instead of individual words for retrieval.

LSA assumes that there is some latent (underlying) structure in word usage that is partially obscured by the variability in word choice. A particular mathematical technique called singular value decomposition (SVD) is applied to a word-document matrix in order to estimate the structure in word usage across documents [14].

Applied in a set of documents and a user query to expand, the LSA method build at first a word-document weighted matrix \({{\varvec{A}}}_{{{\varvec{t}}}\times {{\varvec{d}}}}\), with t the number of terms and d the number of documents plus a column representing the query vector. Then, the SVD projection is computed by decomposing the term-document matrix \({{\varvec{A}}}_{{{\varvec{t}}} \times {{\varvec{d}}}}\) into the product of three matrices \({{\varvec{T}}}_{{{\varvec{t}}}\times {{\varvec{n}}}}\), \({{\varvec{S}}}_{{{\varvec{n}}}\times {{\varvec{n}}}}\) and \({{\varvec{D}}}_{{{\varvec{d}}}\times {{\varvec{n}}}}\):

$$\begin{aligned} A_{t\times d} =T_{t\times n} S_{n\times n} D^{\prime }_{n\times d} \end{aligned}$$
(5)

where \(n = \min (t, d)\) is the number of dimensions for A, also called the rank of A and D’ is the transpose of D.

The matrices T and D represent terms and documents in the new space and have orthonormal columns, i.e., \(\mathbf{TT}^{\mathbf{T}} = \mathbf{DD}^{\mathbf{T}}= \mathbf{I}\). The matrix S is a diagonal matrix and contains the singular values of A in descending order. The ith singular value indicates the amount of variation along the ith axis.

In the next step, the SVD matrices are truncated by reducing the rank n of the matrix A. The objective of the method is to find the rank \(k < n\) that gives a new matrix \({{\varvec{A}}}^{\prime }\) which is the best approximation of \({{\varvec{A}}}\).

\({{\varvec{A}}}^{\prime }\) is constructed by restricting the matrices T, S and D to their first k rows and multiplying them as follows:

$$\begin{aligned} A^{\prime }_{t\times d} =T_{t\times k} S_{k\times k}D^{\prime }_{k\times d} \end{aligned}$$
(6)

The reduction of rank has to be done to the matrix \({{\varvec{A}}}\) in the lower dimensional space, while minimizing the “distance” between the two matrices as measured by the 2-norm:

$$\begin{aligned} \varDelta = {\left| {\left| {A - A'} \right| } \right| _2} \end{aligned}$$
(7)

The choice of the number of dimensions k for \({{\varvec{A}}}^{\prime }\) is an interesting problem. A reduction in n can remove much of the noise, by keeping too few dimensions important information may be lost. However, it is observed that the LSA method works well with a relatively small number of dimensions k. This observation shows the fact that these dimensions are capturing a major portion of the meaningful structure [20].

The truncated SVD captures most of the important underlying structure in the association of terms and at the same time removes the noise or variability in word usage. For example, terms that occur in similar queries or documents will be near each other in the k-dimensional space even if they never co-occur in the same query. In fact, some terms that never co-occur with the new query terms can be similar to them in the k-space.

Finally, the new user query terms vectors can be compared to all candidate terms for expansion, and they can be ranked by their similarity to each term of the query to expand using the common measure of similarity Cosine [21], whose expression is as follows:

$$\begin{aligned} \textit{Simc}\left( {\vec {t_i },\vec {t_j }} \right) =|\cos \left( {\vec {t_i },\vec {t_j }} \right) |=\frac{|\vec {t_i }\times \vec {t_j |}}{\Vert {\vec {t_i }\Vert \times \Vert \vec {t_j }} \Vert }. \end{aligned}$$
(8)

By combining the similarities of each candidate term \(t_j \) for all the new query terms \(t_i^{Q}\), we can calculate the cohesion weight [9] of a candidate term, which represent the correlation (relationship) between this term and the whole query to expand.

The cohesion weight of a term \(t_j \) for a user query \({{\varvec{Q}}}\) is measured by the expression [9]:

$$\begin{aligned} {\textit{CoWeight}}\left( {Q,t_j} \right) =\hbox {ln}\left( \prod \limits _{t_i^{Q}\in Q} \left( {\textit{Simc}\left( {t_j |t_i^{Q}} \right) +1} \right) \right) . \end{aligned}$$
(9)

This method returns a list of weighted terms. The top-ranked terms can be selected as expansion terms for the new user query.

Fig. 1
figure 1

Context-aware query expansion method

3.3 Integrating phase 1 and 2

The main idea of this work is to expand queries by enriching them with additional context based on the process illustrated in Fig. 1. In this aim, we are using a term vector and a document vector to represent each query. The term vector represents the presence or not of a term in a query, whereas the document vector is representing the presence or not of a document between the clicked documents of a query. The information about clicks are extracted from the “Query Logs,” while the top-ranked documents of the new query to expand are considered instead of the clicked documents.

Therefore, we apply the query recommendation model presented as phase 1 to order the users’ past queries (extracted from the Query Log) according to the new user query that we intend to expand in the end of this process. The output of this phase is a list of past queries ordered by their RankScore presented in Eq. (1).

The context of the new query \(Q_n \) is considered as the most recommended queries to \(Q_n\). That means the top of the list resulted from phase 1.

As a context-aware query expansion method, phase 2 consist of applying the LSA method using the context extracted in phase 1. Thus, the LSA method is applied using:

  • The new query to expand;

  • The top-ranked documents to it;

  • The best ranked recommended queries to it;

  • The top-clicked documents to every recommended query used.

The output of LSA is a term-document truncated matrix of k dimensions (Sect. 3.2). This matrix is used to compute the similarities between the new query terms and all the terms contained in the matrix using Eq. (8). Finally, the cohesion weight (Eq. 9) measures the weight of the relationship between the whole query and a candidate term of expansion.

The result of the approach is a list of terms candidate for expansion ranked according to their correlation to the new query subject of expansion.

4 Experimental results

As a collection of test, we used the database CISI from the standard collection SRT. This collection provides 111 queries, 1460 documents and a matrix representing the relevance or non-relevance of each document to each query.

The first step of experimentation is done by applying the Query Recommendation method presented in Sect. 3.1. We used the average internal similarity (AIS) measure to identify the best value of the Jelinek–Mercer interpolation smoothing parameter \(\lambda \) (Eq. 4) for the LM based on terms vectors. We considered as a cluster each input query (query to expand) and its five best-recommended queries, and we calculated the AIS of each cluster. The AIS is computed using the expression [17]:

$$\begin{aligned} \hbox {AIS}\left( c \right) =\frac{\hbox {sum}\left( c \right) ^{2}-\left| c \right| }{\left| c \right| \left( {\left| c \right| -1} \right) } \end{aligned}$$
(10)

With c the cluster of queries, \({\vert }c{\vert }\) the number of vectors (queries) in the cluster and sum(c) is a vector, which represents the sum of all the vectors in the cluster c.

In Table 1, we represent the AIS values for short queries (contain less than 5 terms) and long queries (contain more than 5 terms).

Table 1 Average internal similarity for short and long queries while varying the smoothing parameter

Table 1 shows that for short queries the AIS increases from a low value for \(\lambda =0\) to its best value when smoothing with 0.2 and keeps it until \(\lambda =0.8\), while for long queries the best value of AIS is reached at 0.2 and then increases until having its lowest value at 0.8. Thus, we can conclude that using LMs based on terms for query recommendation with the parameter of smoothing equal to 0.2 is enough to reach the best values of AIS using our collection of test.

Table 2 AIS for short and long queries using the LMR and QRA approaches for query recommendation

In order to test the relevance of this approach, we compared it with the Query Recommendation Algorithm (QRA) proposed in [25], which we improved and we used for context extraction in a previous work [10].

We present in Table 2 the AIS using the 5 best-recommended queries for each input query using our Language Model Recommendation (LMR) technique and Query Recommendation Algorithm (QRA), and considering the parameter of normalization \(\lambda = 0.2\) for short queries and \(\lambda =0.4\) for long queries in Eq. 1.

Table 2 shows that for short and long queries the highest value of AIS is reached by our LMR approach. With these results, once again the LMs show their performance in the information retrieval field.

As second step of experimentation, we applied the approach LSARQ proposed in Sect. 3. We used the Un-interpolated Average Precision (UAP) measure to evaluate the performance of the IR system. We varied the number of recommended queries, and we searched for relevant documents until the 20th retrieved document. In Fig. 2, we present the results of the UAP for short and long queries.

Fig. 2
figure 2

Un-interpolated Average Precision for short and long queries using the LSARQ approach according to the number of recommended queries

Fig. 3
figure 3

Variation of the number of expansion terms used in the LSARQ approach for short and long queries

Fig. 4
figure 4

Comparison of two query expansion methods for short and long queries

We notice in Fig. 2 that for short queries the value of UAP reached its highest value while using 2 recommended queries, while for long queries the highest value of UAP is given using 3 recommended queries. In what follows we used these results when varying the number of terms used to expand the initial queries. The results are presented in Fig. 3.

In Fig. 3, we notice that when expanding the short queries using 2 terms the UAP’s value increases from 0.52 to 0.62 and then decreases more and more when adding more terms of expansion. Long queries keep the value of UAP 0.35 when adding only 2 terms using the LSARQ method. Then, this value increases until 5 and 6 terms of expansion where it reaches the highest value of UAP which is 0.52 to decrease after that when adding more than 6 terms.

In order to evaluate our approach, we compared it in Fig. 4 with the baseline (original query), and the LSA Model (query expansion using LSA without recommended queries).

Figure 4 shows that our proposed approach improves the results returned by the information retrieval (IR) system in a significant way for short and long queries. Table 3 and Fig. 5 proves these results, showing the performance of the LSARQ approach according to the baseline and the LSA model.

Table 3 shows that our context-aware query expansion approach “LSARQ” improves the precision of the IR system by 24 % according to the expansion using the LSA method only for short queries and by 40.54 % for long queries.

Table 3 Comparing the performance of the query expanded by the proposed approach LSARQ to the baseline, and The LSA Method
Fig. 5
figure 5

Performance comparison by the F-measure

In Fig. 5, we evaluated the LSARQ using the F-measure. The results show an improvement of 7.76 % according to the search by the original query and 5.04 % according to the expansion by LSA for short queries. While for long queries, we can see that the F-measure’s value increases by 13.31 % according to the original queries and 6.23 % according to the LSA method, which indicate the good performance of our approach.

5 Conclusion

In this paper, we proposed an approach for query expansion, which is based on the LSA method and the LMs. We used the LM in order to enrich the new user query with additional context extracted from the past user’s queries Log. Using the text database CISI from the standard collection SMART, we have shown that our approach “LSARQ” improves the effectiveness of the information retrieval system with 24 % for short queries and 40.54 % for long queries according to the expansion using the LSA technique only, and with 19.23 % for short queries and 52.94 % for long queries according to the original users’ queries. We conclude that the proposed combination gives better results than each individual method.

In future work, we intend to do more experimentations by investigating other combinations of query expansion methods by proposing a new method for the context extraction.