1 Introduction

Information Retrieval (IR) majorly describes methodologies and techniques that can distinguish the documents relevant to a query from the documents that are non-relevant, accurately in corpora. There are three major categories of IR architecture: representation of documents and queries, ranking function to match queries with documents, and query expansion (QE) [26, 28]. In literature, various methodologies and approaches [56, 59] have been developed and proposed for these categories to improve IR system performance [14, 27, 38, 40, 55]. However, still, there is a vast possibility for QE because of the queries given by users carry ambiguity and uncertainty [23]. Query expansion is a way to enhance the retrieval performance of the IR system. Consider an example when user submits a query car then the search engine needs to extract all the documents related to the car. Query expansion helps in this process by expanding car into cars, automobiles, vehicles. Although few researchers [23, 56, 59] achieve success to some extent, still, colossal scope exists.

Some of the researchers used ontology to capture the semantic meaning of the documents for improving the quality of user queries [34]. The authors identified suitable concepts that pronounce and illustrate the content of documents. The main challenge was to discard unrelated concepts and keeping the relevant concepts only.

Before going to investigate QE approaches, we need to understand the importance of QE and its role. Why do we require query expansion approaches? The answer is that users are not experts in giving their requirements in the form of queries. Studies show that most of the time, they give two or three words in queries [60]. This short length queries are ambiguous and uncertain by nature [18, 23]. For example, a user puts his/her request (query) like “What is Java.” Through this query, a user might wish to get results for Java technology or be interested in different variants of Javacoffee or looking for Java island. The other reason for using QE approaches is low recall rate as most of the applications demand high précised information but not high recall necessarily [18]. QE is one of the highly used practices for handling the above-discussed issues. Query reformulation or expansion approaches handle the “vocabulary mismatch problem” [23, 54, 66].

There are various QE approaches, but “pseudo-relevance feedback(PRF)” [6] is the most popular and widely used approach [9]. In this approach, the user submits his/her request in the form of a text query to an IR system, which extracts relevant documents. Then, top extracted documents are selected and used to identify unique terms, called term-pool. Researchers used various term weighting schemes to find suitable terms in the literature so far [9, 10, 23, 25, 50, 57]. The PRFmethod has one limitation, i.e., query-drift. If PRF returns irrelevant documents, then non-suitable terms would be added in the query, which decreases the performance [23, 25].

In this paper, we used three datasets are us, i.e., CACM, Centre for Inventions and Scientific Information (CISI), and Text REtrieval Conference-3 (TREC-3), to execute all the experiments. CACM dataset consists of titles and abstracts from the CACM journal. It consists of 3204 documents. CISI is a public dataset used for the information retrieval process. Here each dataset contains Unique id, title, abstract, cross-language references to other documents. The trec-3 dataset consists of 5500 labeled questions in the training dataset and 500 questions in the testing dataset. We conduct all the experiments on randomly selected 50 queries from each dataset. The organization of the remainder of the paper is as follows. Section 2 outlines the research objectives of this paper. The related work done in the fields is described in Section 3. Section 4 presents our QE framework for IR. This framework contains various modules, such as document pre-processing and indexing. Section 5 consists of all outcomes and their discussion. This paper includes a conclusion in Section 6.

2 Research objectives

  • Firstly the effect of adding synonyms, n-gram PRF method, Abbreviation resolution, Co-occurrence features on the performance of baseline approaches

  • We proposed a new combined techniques based query expansion approach. The proposed Technique includes the inclusion of WordNet, Concept-based normalization, use of PRF and co-occurrence features. The integration of these methods includes the benefits of each approach which is resulted in enhanced performance.

  • We compared with recently developed similar types of QE techniques on three benchmark datasets, such as CACM, CISI, and TREC-3.

3 Related work

According to the literature, most of the queries contain one, two, or three words only [33]. This problem of short queries increases ambiguity due to the more than one possibility of different interpretations, the same or similar query terms. This problem is called a vocabulary mismatch problem. The positioning of high recall and high precision in IR has been a cause for increasing the work in QE.

Cooper et al. [15] developed various graphical relations for different items to improve the performance of IR. Horng et al. [29] implemented a new term weighting scheme for QE using a genetic algorithm. Gong et al. [22] proposed a new query expanding approach using WordNet. The authors applied the concept of substitution and hypernym/hyponymy associations in their approach. This proposed approach obtained significantly improved results. Fattahi et al. [21] used Domain-related topical and non-topical terms to develop a QE model. Latiri et al. [41] in his proposal for a QE approach, used association rule mining techniques. Also, the term weighting scheme, presented by Bendersky et al. [5] used a genetic algorithm to determine the weighting factor for the query.

Recently, less work has been reported using data sources for QE by researchers Anand et al. [1]; Azad et al. [3, 4]; Bouchoucha et al. [7]; Hsu et al. [30]; Kotov et al. [37]; explored the usage of Wikipedia to enhance QE for retrieving relevant documents for an original query. Few Authors suggested the use of automatically built glossary in query expansion. To create the thesaurus, they used the relationships between concepts like antonym, synonym, part-of, or hyponym. Adding the terms to query using ontology may retrieve irrelevant documents, also Li et al. [42]; Macdonald et al. [43]. Mahler [44] proposed a holistic query expansion (HQE) method. Wang Zhixiao et al. [64] dispensed an approach for QE that relies on Global Analysis and Ontology (GAO) along with statistical and semantic approaches. Chang et al. [12] implemented a new fuzzy-based QE approach to improve the IR system. The results were satisfactory. Nowacka et al. [46] developed a novel method for QE using fuzzy logic to improve the process. Khennak et al. [35] developed the QE approach using accelerated particle swarm. Using the MEDLINE dataset on their approach and got exceptional results. Singh et al. [56] proposed a different approach. They used different lexical co-occurrence techniques that relied on the corpus. Also, choosing the best grouping of query terms from the term pool obtained using pseudo-feedback based QE. Gupta et al. [23] introduced a new QE approach based on co-occurrence and PRF, verifying their results using CISI, CACM, and TREC-3 datasets, which improves the implementation of IR.

Htun et al. [31] performed two user studies based on different collaborative search interfaces and information access scenarios. They used several IR, Collaborative IR, and Collaborative Information Seeking evaluation metrics to compare the interfaces. Authors concluded that various features of a query like time spent on a query, popularity of a query, and effectiveness of a query might be useful to get information about retrieval performance. Singh et al. [58] used rank fusion and semantic notion methods in query expansion. Raza et al. [48] presented a survey on using Statistical approaches for query expansion. Fang et al. [20] suggested a semantic sequential dependence model (SSDM). To fetch biomedical documents, they used the combination of semantic content and conventional SDM. The authors used BioASQ is used as a dataset here, which depicted satisfactory results. Sharma et al. [52] provides a comparative study of recently established query expansion methods using fuzzy logic to obtain relevant documents from broad datasets for a specific user query. Dahab et al. [16] used spectral analysis information retrieval to consider term proximity. Here, the Authors used the distribution approach (KLD) and WordNet to select the candidate features distribution. Zingla et al. [70] used association rules for term retrieval. In this, the authors focused on combining external sources with association rule mining to enhance query representation. To deal with query drift in the future, one can add more significance to the user query terms. Huang et al. [32] proposed code changes approach for query expansion. The result shows the enhancement in precision by up to 52% to 62%. This method gives an effective code search.

Azad et al. [2] presented a survey on QE techniques implemented between 1960 and 2017. They compared these techniques in terms of various features like data sources used weighting and ranking methodologies, user participation, and applications. Bounhas et al. [8] presented a morpho-semantic knowledge graph from classical Arabic vocalized corpus. They mixed Ghwanmeh stemmer and MADAMIRA to retrieve a multi-level lexicon from an Arabic vocalized corpus. Perez et al. [47] proposed an automatic QE approach based on collaborative feature location models. Azad et al. [3] developed a new QE approach using Wikipedia and WordNet. To address the relationship problem, the query terms, the authors suggested a novel method using Wikipedia –WordNet as data sources. The authors used In-link and tf-idf approach to give weight to expansion terms. They obtained satisfactory results. Sharma et al. [53] used cuckoo search and particle swarm optimization to present a new QE technique. The authors used a cuckoo search and fuzzy logic-based technique for query expansion. The authors used the first Okapi-BM25 ranking function to collect documents, then created a candidate term pool. Then Cuckoo search is applied to get the best candidate expansion terms to append in the query. Datasets used were CISI, CACM, and TREC-3. This technique was tested on three benchmark datasets extracting satisfactory results.

Chandra et al. [11] demonstrate the use of query enhancement in the Cross-Lingual IR(CLIR) system. Here, the user can input their query in any known language and get the output in the desired language. They used Hindi-English CLIR to convert the Hindi search into English documents to enhance MAP results. However, in the future, it is required to increase the quality results and decrease the associated time spend in fetching relevant documents.Kumar et al. [39] propose the use of user profile information like user ratings and user tags to fetch relevant information. Here PRF method is used to grasp social information. There is a possibility to use all user features to extract relevant items in the future. Vocabulary mismatch problem occurs in case of microblog texts due to their shorter length, so to address this challenge, authors wang et al. [65] used a k-Nearest Neighbor (kNN) algorithm to create termsfrom local word embeddings to increase the initial query. They used official TREC Twitter corpora as a dataset, and the result indicates that this method accelerates the retrieval performance. Azad et al. [4] try to predict the relationship between terms and determining the relationship of terms to the user query. The authors suggested a query expansion procedure that uses term frequency and inverse document frequency algorithm, use of k-nearest neighbor (kNN) based cosine similarity, and the calculation of correlation score. They used the FIRE dataset to examine the proposed method. This method gives satisfactory retrieval performance in comparison to the existing approach. Dalton et al. [17] suggested word-based and entity-based methods to deal with complex topics, which result in a 20% improvement in MAP value. Nasir et al. [45] discussed the novel Query expansion approach that uses both corpus-based techniques and relevance feedback. In this semantic similarity is determined between the user query and the retrieved candidate features using different knowledge which helps in query reformulation. Torjmen-Khemakhem et al. [63] discussed a novel QE process for medical text depending on retro-semantic mapping. In this text database and UMLS concepts are used for medical image retrieval, then the most appropriate concept is selected. Gupta et al. [24] suggested a novel swarm-based approach which used KHM and fuzzy logic for making quality data clustering. Two standard datasets CASM and CISI, were used to evaluate the clustering results. This approach combines with document clustering in future work. Wu et al. [67], the authors combine pattern mining, topic modeling, and relevance ranking techniques for multi-document summarization. The authors used DUC 2006 and DUC 2007 datasets for the experimental purpose, which gives better performance than existing methods.

Esposito et al. [19] propose a hybrid approach using lexical resources and word embedding for query expansion to design a Question-Answering system for the Italian language and considering the cultural heritage domain. There is a need to consider other languages and different domain’s knowledge to generalize the result as a future scope. This method is a standard statistical approach. Chaudhary et al. [13] proposed a framework, Image Retrieval using Knowledge Embedding (ImReKE), for embedding knowledge into images and queries using a knowledge base, allowing retrieval approaches to understanding the context of queries and images in a better way. The improvement achieved is maximum when knowledge embedding created using the proposed knowledge base in comparison to existing KBs, ConceptNet, and ImageNet. Khennak et al. [36] suggested two correlation measures for query reorganization. The one is called external Correlation, which relies on features co-occurrence and another, is an internal correlation using term proximity. The authors used MEDLINE as a dataset, and the experiments showed the robustness of this method compared to existing work.

In the approaches mentioned above, suitable terms identified and used for QE without considering the effect of various necessary conditions to obtain the optimized query using knowledge sources and other IRS parameters. This paper explores the effect of various processes on query expansion. Then according to the effect of these processes, we are combining and presenting a framework for Automatic Query Expansion (AQE).

4 Proposed query expansion framework

In this section, we explain various components or elements of the presented AQE framework, as shown in Fig. 1. The modules of this framework are discussed in detail in the following subsections. This model composes of five components: Document preprocessing, Query preprocessing and expansion, Indexing, Querying, and Document retrieval. This proposed method combines several techniques to enhance the performance of the IR system. Contextual information is also used here. WordNet to applied to find the association between query terms and document terms. Our proposed method improved the pseudo relevance feedback method by adding a Linear document context combination, which is the specialty of our method. This method helps removal of non-relevant documents that are retrieved during the pseudo relevant feedback process.

Fig. 1
figure 1

Framework of proposed automatic query expansion

4.1 Document preprocessing

All three datasets CACM, CISI, and Trec-3 are preprocessed using these steps:

  1. 1.

    Segmentation of sentences: Each document contains a few paragraphs, and these paragraphs are composed of many sentences. We have segmented all the sentences within paragraphs using the Lucene framework written in Java. This step identifies the boundaries of each sentence among words. Generally, punctuation marks in written language are used for sentence boundaries, which differentiate sentences from each other. Here, dictionary-based segmentation is used, which is more flexible and appropriate for query expansion. It uses a lexicon tool to decide the boundary of words [68].

  2. 2.

    Replacement of characters: After the segmentation of sentences, we did character replacements in sentences. Textual characters replace all character images. We also replace all roman numbers with Arabic numbers. The replacement of characters is essential and helps to capture the right semantic terms in query and document matching.

    In this work, we have also replaced the hyphens in sentences such as T.V. is replaced with Television, and Hewlett-Packard is replaced with Hewlett Packard. Similarly, we have made many other changes in the character replacement process. This step brings all the words in a single format.

  3. 3.

    Removal of text and special characters: We remove all HTML tags, headings, and abbreviations to reduce the matching complexity between a query and a document. We have removed concise sentences as they produce some noise in determining precision and recall. Concise sentences indicate sentences consist of one or two words in it. We also removed the special characters like hyphens, asterisks, slashes to smoothen the matching process.

  4. 4.

    Abbreviation Resolution: After the removal of special characters. In this method, we expand the specific abbreviation, which helps in retrieving relevant results.e.g.inc. Means incorporate.

  5. 5.

    Tokenization:It transforms the text of a document into a list of words or tokens. It usually requires some preprocessing on text documents like translating all words in the lower case, eliminating punctuation characters etc. Individual tokens are identified upon initial preprocessing. If the text of the document has been tokenized, it is crucial to determine which words to be used. It means it has to be determined which descriptors are useful in the joint position of defining the content of the document and separating the document in the set from the other document.

  6. 6.

    Stop Word Removal:Stop word removal is the process of removal of the least essential terms from documents and queries. Stop words consist of prepositions, conjunctions, and articles. The reason for this process is that these terms are not functional and are not useful for identifying high rank documents with respect to a user query. Even though these stop words have a grammatical function and are essential for the comprehension of sentences, they are of slight use in discriminating some documents from other documents as part of IR.

  7. 7.

    Stemming:Stemmingis a technique, which removes suffixes from terms in order to reduce them to a common stem form called root form. Stemming typically removes gerunds (ing), plurals, and past tenses.

4.2 Document indexing and query expansion

All the documents are indexed after preprocessing the whole corpora, aWe used the Lucene framework to create an index for all documents. We did indexing on each document of the corpus. Further, queries are also preprocessed as documents are done then queries are parsed to find out the phrases for similar concepts. Instead of using any explicit method, we have used knowledge sources to deduce fitting terms that can be used for QE in various cases. These sources of knowledge are much related to the data corpora used in this paper and can be categorized as ontological assets and automatically produced knowledge sources. In this work, our primary focus to determine the best word association for query expansion, and it is query-specific. These word associations are as follows:

  • Lexical analysis: In this work, we find out variants of each query term. These variants may be singular, plural, orthographic, spelling variation, and morphological.

  • Semantic analysis: When we perform lexical analysis, the main problem is with the synonyms. Typically, synonyms are lexically distinct but meaning wise; these are similar, which is known as semantic similarity.

  • Ontological analysis: We used Ontological resources such as WordNet API. We used these resources to find out relationships between query terms and expansion terms. The ontological analysis provides specialized or generalized associations between terms. It shows a has-a relationship or is-a-part-of relationship. Therefore Wordnet parsing is applied to find all the possible synset of candidate expansion terms.

  • Co-occurrence relationships analysis: We determined co-occurrence analysis from each corpus Gupta et al. [23]. Two or more terms appear in corpus many times; then, these terms used for co-occurrence analysis. For example, the expansion comes with a query many times; then, QE can be considered as co-occurred terms. Two types of co-occurrence relationships are being used in this work: PRF and statistically regular n-grams taken out from corpus.

4.3 Query normalization based on concepts

In this work, we are using the Okapi-BM25 model Robertson et al. [49, 50]; Gupta et al. [28] for ranking the documents. This ranking model is a widely used method and rank the documents for queries based on similarity among them. However, this method has few issues, also like Okapi-BM25is not able to differentiate query terms and concept terms. The other issue is corpus related problem, when we use it for TREC-3, The documents and queries in TREC-3 dataset usually contain more than one concept terms, and Okapi-BM25 fails to discriminate that. Here, the length of the documents also plays an important role. If document length is less and has one conception only, a higher rank given to this document as compared to another document, whose length is more and contains more than one concept as this document has other concepts too.

Zhou et al. [69] presented a solution to overcome the issues mentioned above using a conceptual IR model. Stokes et al. [61] and Stokes et al. [62] also proposed a new and more effective solution, i.e., concept-based query normalization, to resolve these issues. We have used this method to overcome these issues. To solve the former issue, we have categorized terms into two categories: concept terms tc), and non-concept terms (tn). To divide query terms into these categories, we define query terms as non-concept terms that are not in any knowledge source. In contrast, query terms called concept terms; those have entries in knowledge sources. Therefore, the similarity between a document d and a given query q with both concept and non-concept terms can be determined as follows [62]:

$$ si{m}_{score}\left(q,d\right)= nsi{m}_{score}\left(q,d\right)+ csi{m}_{score}\left(q,d\right) $$
(1)

Where nsimscore (q, d) represents the similarity score for the non-concept query, csimscore (q, d) represents the similarity score for concept query. These similarity scores can be defined mathematically as:

$$ nsi{m}_{score}={\sum}_{t\in {Q}_n} si{m}_t\left(q,d\right) $$
(2)
$$ nsi{m}_{score}={\sum}_{t\in {Q}_n}{r}_{p,t}.{w}_t.{r}_{q,t} $$
(3)

where Qn is the collection of all non-concept terms in query q, and

$$ {r}_{p,t}=\frac{\left({k}_1+1\right).{f}_{p,t}}{k_1\cdotp \left[\left(1-b\right)+b\cdotp \frac{W_p}{avg{W}_p}\right]+{f}_{p,t}} $$
(4)
$$ {W}_t=\log \frac{N-{f}_t+0.5}{f_t+0.5} $$
(5)
$$ {r}_{q,t}=\frac{\left({k}_3+1\right).{f}_{q,t}}{k_3+{f}_{q,t}} $$
(6)

Where k1 and b are constant, and the values changed to 1.2 and 0.75, respectively. k3 is set to 7 in this work after empirical study. Wp indicates document length, avg(Wp) indicates the mean document length in whole data collection. In this context, the total number of documents in the collection are represented by N, while ft shows the number of documents having the term t. The similarity score for concept terms can be defined as:

$$ csim\left(q,d\right)={\sum}_{C\in {Q}_c}\kern0.5em si{m}_c\left(q,d\right) $$
(7)
$$ csim\left(q,d\right)={\sum}_{C\in {Q}_c}\kern0.5em Norm\left( si{m}_{tc1}\left(q,d\right),..\dots \dots, si{m}_{tc N}\left(q,d\right)\right) $$
(8)

Where Qc is the collection of all concept terms in query q, and C represents a concept in Qc. tci is a concept term belonging to C.

4.4 Pseudo relevant N-gram feedback

After gathering the concepts, We explore the Pseudo Relevant N-gram feedback method. The steps for it are listed below:

  1. 1.

    First, we retrieve relevant documents using Okapi and select only the top 100 most relevant.

  2. 2.

    Afterward, we find all unique terms, unigrams, bigrams, and trigrams from these top 100 relevant documents. We formed tokens formed using the Lucene framework.

  3. 3.

    After collecting all N-grams, we compute the suitability score of each term using the method proposed by Gupta et al. [23]. Gupta et al. [23] presented a new method for term reweighting that relies on four IR factors, while they used this method to compute the suitability of terms for query expansion. Further, they arranged all terms in decreasing order of suitability score, and at last, we select the top 10 terms for query expansion.

4.5 Document context combination

Although, the selection of terms of QE using the above-discussed method improves the quality of a query and also shows the significant improvement in MAP too. However, some drop can be observed in MAP for some of the queries in both the datasets. Gupta et al. [23] have been witnessed this negative impact on document retrieval. The reason for it to select non-relevant documents using PRF. To resolve this problem, we have used a linear document context combination method in this paper, which is described below:

  1. 1.

    We divide all selected top-ranked 100 relevant documents according to different concept levels as the number of concepts these documents have.

  2. 2.

    Then we re-compute the score of each document as Eq. (9) also, re-rank the documents in each group [62].

$$ {\mathrm{S}}_{\mathrm{i}}={\mathrm{P}}_{\mathrm{i}}+\upalpha \ast \frac{{\mathrm{D}}_{\mathrm{i}}}{{\mathrm{D}}_{\mathrm{max}}}\ast {\mathrm{P}}_{\mathrm{max}} $$
(9)

Where Dmax and Pmax are the maximum similarity scores among all top 100 documents. We set Alpha to 0.5 after empirical study.

5 Experimental analysis

As we have discussed that CACM and CISI documents are in plain text format, whereas TREC-3 documents are XML documents. CACM consists of 3204 English text documents and 64 queries. CISI has 1460 English written text documents on IRand contains 100+ queries. TREC-3 has more than eight lacs of documents. In this research work, MAP, F-measure, and precision-recall plots are the terms used to compute the performance of the proposed method. The average precision is the mean of the precision scores after each relevant document is retrieved. MAP is average precision across multiple queries. The mathematical representations for evaluating parameters are as follows:

$$ \mathrm{Precision}=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{retrieved}\ \mathrm{relevant}\ \mathrm{documents}}{\ \mathrm{number}\ \mathrm{of}\ \mathrm{retrieved}\ \mathrm{documents}} $$
(10)
$$ \mathrm{Recall}=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{retrieved}\ \mathrm{relevant}\ \mathrm{documents}}{\ \mathrm{number}\ \mathrm{of}\ \mathrm{relevant}\ \mathrm{documents}} $$
(11)
$$ \mathrm{F}-\mathrm{measure}=\frac{2\ast \mathrm{Precision}\ast \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}} $$
(12)

5.1 Comparing baseline approaches

All the experiments have been divided into two types of evaluation analysis. The first analysis is done on comparing baseline performance without adopting the QE approach, and comparison is also made with the Okapi ranking algorithm along with other baseline approaches for all three datasets. In this analysis, we have observed the effect of various techniques used in the proposed QE approach on results. We did the second analysis on the combined effect of all different techniques on results. In this type of analysis, we have compared the results with recently developed QE approaches of Gupta et al. [25], Khennak et al. [35], and Sharma et al. [53].

The main objective of performing this experiment is to demonstrate the improvements in results using the baseline model, which can be used for the proposed QE approach. Table 1 shows the comparison of MAP values obtained by Okapi-BM25 and our concept-based normalization baseline model for all three datasets.

Table 1 Comparison of MAP values obtained by the baseline approach and Okapi-BM25

This table gives the overall performance analysis of the models. However, it is also interesting to observe query-wise analysis for both the datasets. In this work, 50 queries were selected from datasets CACM and CISI, while a selection of 50 queries in number (151–200) was taken from TREC - 3 datasets for query wise analysis. We compute F-measure for all queries, and analysis is done for the top 50 documents. Figures 2, 3 and 4. illustrates results. These figures clearly illustrate that the Baseline model gives higher F-measure values than Okapi-BM25 for all three datasets.

Fig. 2
figure 2

Comparison of F-measure values obtained by the Baseline approach with Okapi-BM25 for CACM

Fig. 3
figure 3

Comparison of F-measure values obtained by the Baseline approach with Okapi-BM25 for CISI

Fig. 4
figure 4

Comparison of F-measure values obtained by the Baseline approach with Okapi-BM25 for TREC-3

Figure 2 depicts that our baseline model gets better F-measure values for all selected 50 queries as compared to Okapi-BM25 in the case of CACM datasets. A similar type of results can be shown in Figs. 3 and 4 for CISI and TREC-3 Datasets.

5.2 Analyze the effect of lexical variation on query expansion

In Section 4, we have discussed various types of terms in QE approaches. These are lexical variation, synonymy, ontologically related words, and co-occurring terms. These types make much impact on finding terms for query expansion. In this subsection, we are considering term equivalence through lexical variation as a relation type. As we discussed in the above section about splits of words at breakpoints, for instance, a hyphen. In this work, we use WordNet as a lexical database. Table 2 shows the comparison of MAP values obtained by lexicon variation based QE with the baseline model. We can see that there is a slight improvement in results after using lexicon variation based query expansion.

Table 2 Comparison of MAP values obtained by the baseline approach and Lexical variation based QE approach

The query wise analysis is also done to check the effectiveness of lexicon variation based query expansion. Figures 5, 6 and 7 shows the results for CACM, CISI, and TREC-3 datasets.

Fig. 5
figure 5

Comparison of Baseline approach with Lexical variation based QE in terms of F-Measure for CACM

Fig. 6
figure 6

Comparison of Baseline approach with Lexical variation based QE in terms of F-Measure for CISI

Fig. 7
figure 7

Comparison of Baseline approach with Lexical variation based QE in terms of F-Measure for TREC-3

Figure 5 presents the comparison of the Baseline approach with the Lexical variation based approach for the CACM dataset. It is clear from this figure that Lexical variation has a favorable impact, as seen on the prior performance of the QE method. Similarly, the comparison is shown in Figs. 6 and 7 for CISI and TREC-3 datasets, respectively.

5.3 Analyze the effect of synonyms on query expansion

This section discusses the effects of synonyms on query expansion. In this work, we convert terms into synonyms for QE. The experiment analysis is done as above: overall and query wise. Table 3 compares Mean Average Precision (MAP) values obtained by approaches. This table shows that MAP values have minor improvements, which cannot be considered as significant.

Table 3 Comparison of MAP values obtained by the baseline approach and Synonym based QE approach

The comparison of F-measure values is also presented in Figs. 8, 9 and 10 for query wise analysis. Figure 8 shows the comparison for the CACM dataset. This figure depicts the usage of synonyms in the QE approach improves the performance slightly. A similar analysis is made for CISI and TREC-3 in Figs. 9 and 10, respectively. Synonyms enhanced the performance in CISI and TREC-3 datasets also. Finally relevant documents are collected using the synonyms feature.

Fig. 8
figure 8

Comparison of Baseline approach with synonym based QE in terms of F-Measure for CACM

Fig. 9
figure 9

Comparison of Baseline approach with synonym based QE in terms of F-Measure for CISI

Fig. 10
figure 10

Comparison of Baseline approach with synonym based QE in terms of F-Measure for TREC-3

5.4 Analyze the effect of abbreviation on query expansion

The abbreviation is one type of synonymy and used very frequently in domain-specific documents such as science. As we have discussed earlier in the above sections that there are three types of using abbreviations. First, creating a set of long-form and short-form pairs (Schwartz et al. [51]), second, using a resource, and last, converting all abbreviations to their long forms in the corpus.

We have used the last one in this work: converting all abbreviations into their extended forms. Table 4 shows the comparison of MAP values achieved by various approaches. The query wise analysis is also shown in Figs. 11, 12 and 13.

Table 4 Comparison of MAP values obtained by the baseline approach and abbreviation based QE approach
Fig. 11
figure 11

Comparison of Baseline approach with Abbreviation based QE in terms of F-Measure for CACM

Fig. 12
figure 12

Comparison of Baseline approach with Abbreviation based QE in terms of F-Measure for CISI

Fig. 13
figure 13

Comparison of proposed Baseline approach with Abbreviation based QE for TREC-3

Figure 11 demonstrates the results for CACM datasets, and it is clear from this figure that using an abbreviation for QE decreases F-Measure values for almost all 50 queries. Similarly, the performance of QE is dropped in the case of the other two datasets CISI and TREC-3 after using abbreviations, as shown in Figs. 12 and 13, respectively.

From the results mentioned above, it can be analyzed that ambiguity becomes an issue in using abbreviation query expansion. For example, some type of abbreviation may have more than one long-form or full form. Therefore, most of the queries are suffered from query-drift problem, and the QE approach could not perform better.

5.5 Analyze the effect of related and co-occurring terms on query expansion

Till now, ontological resources were used for analyzing query expansion. In this section, we are focusing on the use of the co-occurring terms are for query expansion. In a log-likelihood association metric, we take the document length as the window size for a co-occurrence pair. We find the terms; that are frequently appearing together in documents. At last, we add these co-occurring terms to the original query.

Table 5 presents the results for MAP values obtained from our expansion experiments using co-occurring terms. We have also compared the results with the Baseline approach. This table clearly shows a little improvement in the performance of QE as compared to the Baseline approach. For query wise analysis, we compute F-measure values for each query using various approaches, as demonstrated in Figs. 14, 15 and 16 for all three used datasets.

Table 5 Comparison of MAP values obtained by the baseline approach and Co-occurrence based QE approach
Fig. 14
figure 14

Comparison of Baseline approach with co-occurrence method in terms of F-Measure for CACM

Fig. 15
figure 15

Comparison of Baseline approach with co-occurrence method in terms of F-Measure for CISI

Fig. 16
figure 16

Comparison of Baseline approach with co-occurrence method in terms of F-Measure for TREC-3

Figure 14 clearly shows that using co-occurrence for QE does not make much impact on the improvement of performance for CACM as the values are almost the same for most of the queries. The similar types of results and analysis exist for CISI and TREC-3 in Figs. 15, and 16 respectively.

5.6 Analyze the effect of PRF relevance N-gram on query expansion

This section shows the effect of using the PRF relevance N-gram method for query expansion. The proposed approach is based on pseudo relevant N-grams feedback, as discussed in above Section. In this work, we have taken unigrams, bigrams, and trigrams as co-occurring terms from datasets. Further, we remove all stop words from obtained all N-grams. We have also removed less frequent N-grams; those are occurring less than two. Then, we have determined the top ten associated phrases for the given user query. Table 6 tabulates the results for MAP values obtained from our expansion experiments using N-gram PRF based QE and Baseline approach. This table clearly shows that PRF using unigram and bigram improve MAP values significantly in comparison to others.

Table 6 Comparison of MAP values obtained by the baseline approach and PRF relevance N-gram based QE approach

For query wise analysis, we compute F-measure values for each query using various approaches, as demonstrated in Figs. 17, 18 and 19 for all three datasets. Figure 17 shows some improvement in performance using N-gram PRF based QE for the CACM dataset. Figure 18 also shows some improvement in performance for CISI, and a similar type of enhancement can be observed for TREC-3 in Fig. 19.

Fig. 17
figure 17

Comparison of Baseline approach with N-gram PRF based QE in terms of F-Measure for CACM

Fig. 18
figure 18

Comparison of Baseline approach with N-gram PRF based QE in terms of F-Measure for CISI

Fig. 19
figure 19

Comparison of Baseline approach with N-gram PRF based QE in terms of F-Measure for TREC-3

Pseudo relevance N-gram feedback improves the performance as the top-ranked documents contain two or more than two query concepts. These remove any ambiguity from each other mutually. Hence it improves the likelihood of feedback terms for query expansion.

5.7 Analyze the combined effect of all techniques on query expansion

This section analyzes the combined effect of all techniques discussed above on query expansion. We include Lexical variation, Synonyms, Co-occurrences, and PRF relevance N-gram techniques to implement the QE approach. The abbreviation is not taken as it degrades the performance. Table 7 tabulates the results for MAP values obtained from our combined techniques based QE approach. Their results are now compared with the Baseline method and recently developed QE approaches of Gupta et al. [25], Khennak et al. [35], and Sharma et al. [53], Table 7 shows that our combined techniques based QE approach gets higher values of MAP as compared to the rest of the approaches for CACM and CISI datasets. In the case of TREC-3, the performance of our proposed approach turned to be better than the Baseline method approaches, Khennak et al. [35], and Sharma et al. [53], but it lacks behind from Gupta et al. approach [25]. Our proposed method reduces the MAP value by 1.71%.

Table 7 Comparison of MAP values obtained by various approaches

The proposed combined techniques based QE causes improvement in results as compared to other approaches because this approach takes the advantages of each method. The abbreviation method decreases the performance; therefore, it is not included in the proposed combined techniques based approach.

Query wise analysis and comparison is also performed to check the performance of all QE approaches. Figures 20, 21 and 22 present the comparison of F-measure values for all three datasets. Figure 20 presents the results for the CACM dataset. It shows that our baseline method is inferior to Khennak et al. [35] approach. However, combined techniques based QE outperforms Khennak et al. [35] approach as well as Sharma et al. approach for 44 queries. While for the rest six queries, their results are the same. This figure also depicts the superiority of our combined technique based QE over the Gupta et al. [25] approach. Figure 21 presents the comparison of performance using the CISI dataset. This figure shows that combined techniques based on QE get higher values of F-measure in comparison to approaches like Khennak et al. [35], Sharma et al. [53], and Gupta et al. [25] Similarly, Fig. 22 shows the outcomes for the TREC-3 dataset. This figure depicts that our combined approach outperforms Khennak et al. [35] proposition and Sharma et al. [53] proposition, but it lacks behind from Gupta et al. [25] approach. This method achieved 0.287 MAP value for the CACM dataset,0.2546 MAP Value for CISI, and 0.3027 for the TREC-3 dataset. This method improved 0.31% in the case of the CACM dataset,1.19% improved in the CISI dataset,1.71% MAP value reduced in the Trec-3 dataset in comparison to Gupta et al. [25]. Our proposed method works well on CACM and CISI because These datasets contain less number of documents, while TREC-3 contains a large number of documents, so it is not useful there.

Fig. 20
figure 20

Comparison of combined techniques based QE with baseline method, Khennak et al. [35], Sharma et al. [53] and Gupta et al. [25] as to F-Measure for CACM

Fig. 21
figure 21

Comparison of combined techniques based QE with baseline method, Khennak et al. approach [35], Sharma et al. [53] and Gupta et al. [25] as to F-Measure for CISI

Fig. 22
figure 22

Comparison of combined techniques based QE with baseline method, Khennak et al. approach [35]), Sharma et al. [53] and Gupta et al. [25] as to F-Measure for TREC-3

We present a precision-recall curve to compare the overall performance of different approaches, as shown in Figs. 23, 24 and 25 for all the three datasets. Figure 23 shows the improvement in performance in QE using the combined effect of all techniques as compared to Baseline method approaches, Khennak et al. [35], Sharma et al. [53], and Gupta et al. [25] used over CACM dataset. Figure 24 presents the comparison for the CISI dataset and clearly shows the improvement in QE performance. Similarly, Fig. 25 shows the results, of using the combined effect of all techniques as compared to Baseline method over TREC-3 dataset and also depicts that our proposed approach enhances the overall performance of QE as compared to approaches Khennak et al. [35] and Sharma et al. [53], but it lags behind Gupta et al. [25] proposal.

Fig. 23
figure 23

Comparison of Precision-Recall curves for various proposals on CACM dataset

Fig. 24
figure 24

Comparison of Precision-Recall curves for various proposals on CISI dataset

Fig. 25
figure 25

Comparison of Precision-Recall curves for various proposals on TREC-3 dataset

The above results show that our proposed combined techniques based QE approach improves the performance significantly for small datasets like CACM and CISI. However, for large datasets as TREC-3, the performance is not improved that much. The cause of enhancement in the performance of proposed combined techniques based on QE in comparison to other approaches is that this approach takes the advantages of each method. The abbreviation method decreases the performance; therefore, it is not included in the proposed combined techniques based approach.

For text categorization, There is a possibility to use combined approach. This method can be applied in image retrieval, Cross-Language Information Retrieval(CLIR), Multimedia retrieval, Mobile Search, Information Filtering, and medical field also for query expansion. It can be further used in other applications such as blog searching, a recommendation system.

6 Conclusion

This paper is a step to explore various essential conditions for the optimal query using a QE technique. Our proposed technique uses knowledge sources and other IR parameters. AQE based method can be used in the neural network and bio and health field. Our combined approach can be utilized in the medical field. The proposed work has tested the effect of sources of term expansion, pseudo-relevance feedback, and ontologies on QE techniques.. This work also leads to a new QE framework, i.e., combined techniques based QE for improving IR performance. The developed combined model is evaluated and verified using three standard datasets CACM, CISI, and TREC-3. Comparing this combined model with freshly proposed approaches in the similar field like Khennak et al. [35], Sharma et al. [53] and Gupta et al. [25] QE approaches. The current work concludes that the proposed combined techniques based QE approach enhances the performance of IR significantly for small datasets like CACM and CISI. However, for large datasets like TREC-3, the improvement is not significant. So, In the future, there is a need to consider other large datasets also. Our proposed method does not consider abbreviation for expanding the query, but this is also additional features to work in future. One mentionable point for the research work presented in this paper can be summarized in terms that it does not consider any such external environmental factors like sentiments, the mood of the user to retrieve documents. Hence, such external factors can be used as a basis or inclusions for future research works.