1 Introduction

Conventional search engines such as Bing,Footnote 1 Baidu,Footnote 2 Google,Footnote 3 and others play a very important role for providing useful information to end users in the today’s information age. There is, however, one major limitation to these search engines. A lot of resources are not available for them to access due to proprietary or commercial reasons [1]. Many digital libraries and news blogs are in this category. For example, ACM and IEEE digital libraries are digital libraries of scientific publications on computing and electric & electronics engineering, respectively, while Bloomberg is an authoritative financial news blogs. Full access to them is only granted to legitimate users with login credentials. Some of these information resources, scattered on the web, are either ignored or not known to many web users. Targeting these information sources, federated search makes them accessible to more web users [2, 3]. This is achieved by providing a single search interface that is capable of simultaneously forwarding a user query to multiple independent resources and merging their returned result lists into a single list for end users [4]. In the real-world web, site like PricelineFootnote 4 is federated in nature with hundreds of available resources in the back-end.

In federated search, there are three major research problems: First, “resource description” concerns the contents of the resource and other information such as the size of the collection and the overlapping rates between two or more collections. Second, “resource selection” concerns how to select a group of most useful resources for a given user query. Finally, “result merging” is about how to fuse the result lists returned from the multiple resources. These three are also connected in some way. Both tasks of resource selection and result merging need to know information of all component resources (resource description) for a proper decision.

As shown in Fig. 1, the typical processes involved in a federated search system are as follows:

  • The user inputs her query on her machine via a federated search interface for end users.

  • The query is transferred to the central broker.

  • The broker selects a group of relevant resources and forwards the query to those selected.

  • Each selected resource does the retrieval and returns a list of documents to the broker.

  • The broker merges results coming from multiple resources to be a single list and send them to the user machine.

  • The result list is displayed on the user’s machine.

Fig. 1
figure 1

Processes involved in a federated search system

Research on federated search has been conducted in two different scenarios: cooperative and uncooperative environments. In a cooperative environment, resources agree to share some vital information about their contents and corpus statistics with the broker. This is the case for some contexts such as enterprise search. While on the web, many resources are independent and autonomous, and treat the broker as an ordinary user. Apart from that, it may not be possible for such resources to share any extra information with the broker.

STARTS [5] is among the early studies that proposed protocols for a cooperative environment. These protocols define the information and modes of communicating it between all the resources involved and the broker. Similarly, GlOSS [6] is also an early work in the literature that proposed a methodology for identifying the most relevant resources to search for a given user query based on the relevant documents within the resources.

On the other hand, in an uncooperative environment, a key issue is how to obtain accurate information about all the resources. The general idea is to get that through the communication channel as an ordinary user. Query-based sampling [7] and some variants [8, 9] have been proposed in the literature. In this way, the broker can still know some useful information about the resources and then utilize them for resource selection and result merging tasks.

Over the last three decades, considerable progress on federated search has been made. However, to our knowledge, there is only one literature review paper on federated search so far [10]. Although it is very good and comprehensive, it was published over a decade ago. Therefore, we think it is desirable to have a new review paper. The purpose of this paper is to provide a general picture of the major work in federated search over the years, with more attention to recent research work and related activities (such as workshops and evaluation events).

The remaining parts of this paper are organized as follows: Sect. 2 describes survey methodology and selected papers. Sections 3, 4, and 5 explain the methods proposed for resource representation, resource selection, and result merging, along with their benefits and drawbacks, respectively. Section 6 discusses some federated search systems and research prototypes developed. Section 7 discusses the data sets used for performance evaluation of federated search systems or/and their components. Section 8 presents some related research issues including retrieval evaluation, aggregated search, metasearch, and personalizing federated search. Finally, Sect. 9 concludes the paper with some future research directions.

2 Survey methodology and selected papers

2.1 Selection criteria

Shokouhi and Si’s review paper [10] was published in 2011, and all the papers reviewed were published in 2010 or before. Therefore, we try to include all relevant papers that were published since 2011. While for those papers published before 2011, we do not try to include all of them, but instead some representative papers considering quality and addressed problems.

We used DBLP and Google scholar to do the search work, in which “distributed information retrieval,” “federated search,” “federated retrieval” were used as keyword queries. We also checked the reference lists of some selected papers to find more relevant papers. All the papers downloaded were manually checked their relevance for final selection.

However, there are a few exceptions as follows:

  1. 1.

    For those reports submitted to 2013 and 2014 TREC FedWeb tracks, only those of three best-performing runs per year are included.

  2. 2.

    For both topics “aggregated search” and “metasearch,” only some representative papers are included.

2.2 Selected papers

Through the selection process, we identified approximately 122 articles that satisfied our set criteria. Among the identified articles, resource selection emerged as the most frequently published topic, with over 55 articles directly related to it. Conversely, security issues had the least number of articles, with only four published pieces. The number of articles addressing the other federated search interconnected problems ranged from 7 to 15.

2.3 Discussion

Despite the fact that conventional search engines are the primary tools used by most web users to locate information on the web, a substantial portion of the web’s content is not entirely accessible through these search engines. For instance, Google reported discovering over 30 trillion URLs in 2012,Footnote 5 but a study conducted over a period of nine years from 2006 to 2015, as presented in [11], indicates that the Google index’s total size was only 45.7 billion documents as of January 2015. Various studies, including those reported in [12] and [13], demonstrate that relying solely on search engines causes web users to miss out on a significant number of relevant documents that are exclusively available from specialized information sources. Federated search provides a solution to this issue by targeting these information sources and linking them directly to web users through a single interface. As a result, web users are able to search multiple independent resources through a single interface, rather than having to search them individually.

To ensure optimal performance of federated search systems, it is crucial for the broker to forward user query to the most relevant resources and merge the results based on their relevance to the query. As such, researchers have identified three interrelated problems that must be addressed for federated search to function properly. These problems are resource representation, resource selection, and results merging.

3 Resource representation

For the broker to function properly, it needs a lot of information for every resource so as to perform resource selection and result merging tasks. There are two typical scenarios. One is that all the resources are cooperative and willing to provide comprehensive information if required by the broker. Then, a special channel between the two parties can be set up for this. The other is some or all the resources are independent and uncooperative. In such a situation, the broker has to collect some useful information through the resource’s ordinary communication channel for end users. In the following, let us discuss them one by one.

3.1 Cooperative environment

In a cooperative environment, the resources agreed to exchange the information required by the broker to perform searching and merging accurately via an established protocol. Enterprise search is an example of a federated search that works in a cooperative environment. As both the resources and the interface are owned and maintained by the same entity. Therefore, the resources provide the broker with details of their metadata, such as document frequency, list of stop words, number of terms in each document, and total terms in the collection as a whole [14]. A cooperative resource discovery [15] is another proposed method for the cooperative environment in which each resource provides the broker with the number of terms and the resources in which those terms appear. However, for resources with diverse content, different sets of metadata are required by the broker to function effectively. For this reason, [16] considered previous query logs as metadata to enhance vertical selection. There is, however, a drawback to the cooperative method, which is that it may not be workable in a real-world web environment, where most resources are owned by different entities.

3.2 Uncooperative environment

In contrast, in an uncooperative environment, which is a typical situation for the Web, no standardization is implemented for resources to provide detailed information about their corpus statistics to the broker [10]. As a result, a widely used query-based sampling strategy (QBS) [7] is used to sample a sufficient number of documents from each resource and index them in what is referred to as a centralized sample database (CSD). In QBS, a single-term query chosen from either a reference dictionary or resource search interface is issued to the resources. The top n documents returned by them are downloaded and indexed in CSD. The next query is selected from the terms of the sampled documents. The sampling process continues until a stoppage criterion is reached, which is mostly about 300 distinct documents. The majority of the uncooperative environment research proposed in the last three decades used this method to obtain representative documents [10, 17]. CSD is used for both resource selection and result merging. Limiting the number of sampled documents to 300 has the drawback of oversampling resources with small content while under-sampling those with a large of content.

Table 1 Summary of some selected published studies for the resource description and corpus size estimation

3.3 Estimation of resource size

Estimating of resources size at a certain interval yields information about freshness, quality of content, and resources with most diverse contents [22]. Further, the size of the resource is one of the factors used in determining the most relevant resource to search in most of the resource selection algorithms proposed [23, 24]. However, in an uncooperative environment, the resources corpus size is not available to the broker. For this reason, various methodologies were proposed in this regard. Among them, consider the query pool method [18] as an example. This method estimates the size of the resources by randomly selecting a term from a dictionary, issues it as a query to resources and then downloading all the matched documents to an index. Afterward, extract the terms with the highest document frequency to form the query pool. Next, they select the terms in the query pool one by one and issue it as query to the downloaded index, and then harvested all the documents with distinct ids. Similarly, the sample–resample methods proposed in [24] estimate the size of the resource by selecting a single term query from the centralized sample database and issue it to a resource. The next query is selected from the downloaded documents of the previous query. This process continues until the predefined criteria are met. The resource size is estimated using the following equation:

$$\begin{aligned} R_\textrm{Size}=\frac{R_\textrm{dfqi}\times R_\textrm{sample}}{R_\textrm{dqisample}} \end{aligned}$$
(1)

where \(R_\textrm{dfqi}\) is the number of documents from resource R that contain query qi, \(R_\textrm{Sample}\) is the number of documents sampled from resource R, and \(R_\textrm{dqisample}\) is the number documents sampled from R that contain qi.

Other methods proposed in the literature include: random sampling [20], uncorrelated terms query [21], and the capture–recapture method [19].

However, most of the aforementioned methods [9, 19, 24] were based on random samples. Nguyen et al. [25] argued that these approaches proposed based on random samples in most cases contain noisy data. Therefore, they proposed the reference corpus method of resource size estimation. This method used the ClueWeb09 dataset as the reference corpus. For the given queries Q, Eq. 2 estimates resource size:

$$\begin{aligned} S_{\text {size}} = \ \frac{1}{|Q |}\sum _{q \in Q}^{}\frac{R_ s}{\text {df}_{q}}\ \times |ClueWeb_\textrm{size}|\end{aligned}$$
(2)

where \(\vert Q\vert \) is the query size, \(R_S\) is the number of documents resource R return for a particular query q, \(df_q\) is the ClueWeb documents frequency for query q, and \(\vert ClueWeb\_\textrm{size} \vert \) is the total size of the ClueWeb collection.

A summary of some selected studies on resource description and corpus size estimation is presented in Table 1.

3.4 Estimation of resource overlapping rates

Resource overlap rates refer to the extent to which two or more resources share the same or similar documents. In a federated environment, it is essential to estimate the extent to which the contents of the resources overlap. This is because searching different resources that return similar documents not only wastes the search user time but also degrades search effectiveness [26]. As such, Bernstein et al. [27] proposed a method based on hash vectors that detects and discards similar and near-similar documents from the merged result list in a cooperative environment setting. For an uncooperative environment, Shokouhi and Zobel [26] used the sampled documents in the centralized sample index to estimate the overlapping rate between two resources. That is, the number of similar documents within two different resources can be estimated using the following equation:

$$\begin{aligned} K = \frac{|R_1||R_2 |\times D}{|{\textrm{Sr}}_1||{\textrm{Sr}}_2 |} \end{aligned}$$
(3)

This equation estimates the number of similar documents between two resources using overlap documents K, sampled documents \({Sr}_1\) and \({Sr}_2\) from resources \(R_1\) and \(R_2\), and expected similar documents D.

4 Resource selection

In federated search, it is not a good policy for the broker to forward the received query to all the participating resources, as some may not be relevant to that given query. As such, resource selection is necessary for the broker to select only those with a high probability of returning relevant documents. This section reviews and categorizes the well-known resource selection methods proposed in the literature.

Table 2 Summary of some selected published studies for the resource selection problem (records from 1 to 9)
Table 3 Summary of some selected published studies for the resource selection problem (records from 10 to 16)
Table 4 Summary of some selected published studies for the resource selection problem (records from 17 to 22)

4.1 Heuristic methods

The heuristic methods rely on the lexicon statistics either obtained or provided by the resources. Most of the early studies consider each resource as big document. That is, the boundaries of the documents in each information resource are collapsed to form a single big document that contains only a bag of words. Upon receiving the user’s query, the broker computes the query similarity with the lexicon statistics of each information resource and ranks them based on their relevance score. The big document approach includes CORI [28], GlOSS [6], and CVV [15]. The studies in [24, 44] reported that CORI [28] is the most effective and straightforward resource selection method in the literature. A Bayesian inference network is used in CORI to calculate the relevance score of each information resource for a given query. However, the limitation of the big document approach is that, by removing the boundaries of the separate documents, the relevance of the individual document cannot be ascertained; instead, only the resource’s overall relevance to the given query can be estimated.

On the other hand, the models proposed by [23, 24, 45, 46] move away from collapsing the document boundaries as applied in the big document approach. Rather, they consider each resource as a collection of documents, and the relevance of a resource is estimated based on the relevance of its constituent documents. CRCS [23] is a resource selection algorithm based on the small document approach. In CRCS, the broker issues the user query to CSD. The number of documents and their ranking positions in the top k of the generated ranking are used to determine the relevance of a resource to that given query. As such, the relevance score of resource \(R_i\) is computed using Eq. 4, described as follows:

$$\begin{aligned} S\left( R_{i}\right) = \frac{S_{i}}{S_{\max }\times S_{s}} \times \sum _{d\in S_{s}}^{}S(d) \end{aligned}$$
(4)

where \(S_i\) is the estimated size of resource \(i, S_\textrm{max}\) is the estimated size of the largest resource, \(S_s\) is the number of documents sampled from resource i during the sampling phase, and S(d) is the contribution of document d to the weight of the resource that returned it. The S(d) value is computed either linearly or exponentially as shown in Eqs. 5 and 6.

$$\begin{aligned} S\left( d \right) = \ \left\{ \begin{array}{ll} k - l, &{} if\ l < k \\ 0, &{} \text {Otherwise} \\ \end{array} \right. \end{aligned}$$
(5)
$$\begin{aligned} S\left( d \right) = \ \beta \ exp( - \gamma \ \bullet \ l) \end{aligned}$$
(6)

where k is the top documents in the CSD ranked list, which is set to 50, l is the rank of the document \(r_j\) in the CSD ranked list, and \(\beta \) and \(\gamma \) are the constant parameters whose values are set to 1.2 and 0.28. ReDDE [24] is considered the most common resource selection algorithm based on the small-document approach [47]. In ReDDE, the relevance of a resource to a given query is estimated based on the number of documents that particular resource has in the top k results when the query is run on the CSD.

The Text Retrieval Conference (TREC) recently built large collections of documents gathered from real-world search engines in order to facilitate research on federated search using a dataset similar to the real-world federated environment. Numerous approaches were proposed for the resource selection task in both 2013 and 2014 TREC FedWeb tracks [48, 49].

In the approach proposed by [31], a term-weighted frequency scheme was used to select the relevant resources for the given queries. Their approach considered each search engine as a collection of document descriptors, e.g., terms, and the relevance score of a search engine (resource) for a given query is obtained based on the number of documents and the number of query terms that appear in such documents. The approach proposed in [50] ranked the resources based on their relevance as well as the opinion of the given query. Furthermore, in [51], they use the Google search API in computing the relevance score of the resources. The search engine impact factor (SEIF) method was proposed in [34]. In this method, the sources are ranked based on their popularity or market share. They assumed that the most popular search engines (Google, Bing, Baidu, etc.) would contain more relevant documents than the non-popular ones. Although their method is independent of a user query, it is the best performing method in the 2014 TREC FedWeb track [48]. In the model proposed in [35], all the documents for each search engine (resource) provided in the dataset are concatenated into a single big document. Then, the topic model of each resource and the given queries is obtained using latent Dirichlet allocation (LDA). The resources are ranked based on the number of topics they share with each given query. The methodology proposed in [32] employs the Tally statistical method [52] for resource selection. According to their submission, keeping representative documents in CSD is expensive. It is easier to handle if you preserve the term-related features. Thus, they extracted each resource’s terms’ features and computed the relevance score of the sources based on these terms’ features. Recently, Urak et al. [41] argued that using the SEIF model [34] to select the relevant resources would repeatedly choose the same resources since the search engine market is dominated by giants such as Google, Baidu, and Bing. Based on this observation, they proposed a method that includes the long tail resources among the selected resources for a given query. With long tail resources, the user who issued the query can explore documents from other smaller relevant resources. Thus, Eq. 7 is used to select the final resources to search for the given query q.

$$\begin{aligned} S_{f}\left( q,\ s \right) = \left( 1 - \delta \right) S_{\text {best}}\left( q,\ s \right) + \ \delta S_{\text {tail}}(q,\ s) \end{aligned}$$
(7)

where \(\delta \) is the control parameter with a range value between 0 and 1. This parameter is used to ensure that the final selected resources are balanced.

4.2 Machine learning-based methods

All of the above-mentioned resource selection methodologies use traditional document query similarity in selecting the relevant resources to search. Machine learning techniques have recently proven to be a viable alternative to traditional methods of computing document query similarity. To this effect, various machine learning methods have been proposed for resource selection in the literature. Arguello et al. [29] extracted three types of features, namely collection features, query topic features, and click-through features, and trained a classifier for resource selection. A joint probabilistic model that estimates a source relevance based on its similarity with the already selected resource was proposed by [30]. Xu and Li [53] postulated that using more features can improve the performance of the collection selection algorithms. As such, they proposed a method that used two separate sets of features, query-dependent and query-independent, and then combined them to form the query-collection features vector. SVMrank [54] was used to learn a ranking function of all the resources. Similarly, in [38], three different sets of features, query-independent, term-based, and sampled-documents, were used for resource ranking in selective search. Wu et al., [42] proposed the LTRRS algorithm, which combined all the features proposed in [38] in addition to the topic relevance feature introduced in their paper. They used LambdaMART [55] to train the function for ranking the resources.

As previously stated, the broker must search the CSD for each query received in order to determine the most relevant resources for that given query. But this process is considered repetitive and bandwidth-consuming [46]. Therefore, Garba et al. [43] recently proposed an embedding base model for resource selection that utilizes past queries. In particular, for each current user query received by the broker, its similarity with the past queries that reside in the query log is obtained. Then, resources selected for the past queries that are similar to the current query are reselected for search. Specifically, let \(S_k=\{s_1,s_2,...,s_m\}\) be the set of resources with the indexed documents in the CSD. Let also \(Q_p=\{q_1,q_2,...,q_n\}\) be the set of the past queries. The similarity between the current query and each past query stored in the query log sim (q, \(q_l\)) is estimated by computing the cosine similarity of their term vectors using a word embedding technique as explicated in Eq. 8:

$$\begin{aligned} \text {sim}\left( q, q_l\right) = \frac{\left| V_q \right| \left| V_{q_l} \right| }{\left| \left| V_q \right| \right| \left| \left| V_{q_l} \right| \right| } \end{aligned}$$
(8)

where \(V_q\) is the vector of terms of the current query and \(V_ql\) is the vector of terms of the past query. In their paper, they considered the current and past queries similar if their sim(q,\(q_l\)) score is greater or equal to 0.65. Finally, the current query relevance to resource \(s_k\) is estimated using Eq. 9 described as follows:

$$\begin{aligned} \text {rel}\left( q,\ s_{k} \right) = \ \sum _{k = 1}^{m}\text {rel}\left( s_{k} \vert q_{l} \right) \text {sim}(q,\ \ q_{l}) \end{aligned}$$
(9)

where rel(\(s_k\) \(\vert \) \(q_l\)) is the relevance score of resource \(s_k\) given the past query \(q_l\) which is obtained using ReDDE algorithm and sim(q,\(q_l\)) is the current and past queries similarity score. Zhu et al. [56] used k-means and latent semantic index (LSI) for resource selection. In their approach, the content of each resource is partitioned into a number of clusters with the help of the k-means clustering algorithm. After that, the semantic structure of each cluster is captured using LSI, which measures the relationship between them and then estimates the cluster relevance to the given query.

Other recent approaches in the literature are proposed by [57, 58]. Calì and Straccia, 2017 argue that since most of the content from federated resources is accessed by filling out an online form, this can be equated to querying relational database tables. Based on this notion, they proposed a novel approach that uses a mediated schema to integrate the resources into a single interface. On that interface, their approach automated all of the building blocks of federated search (document sampling, size estimation, resource selection, and result merging). But in [58], an approach that detects the unlawful alteration, manipulation, and reuse of copyrighted works via distributed information retrieval was proposed.

Almost all of the above-mentioned resource selection models [23, 24, 28, 42, 43] select a group of most relevant resources to search for the given query by considering relevance alone. However, it was established that many users’ queries issued to the search systems are either ambiguous or multifaceted [59, 60]. Therefore, the LDA-RS resource selection algorithm was proposed in [39] to balance both relevance and diversity in selecting the resources to search for the given query. To generate the diversity rank list, each document in the initial ranked list is considered as a vector of terms \(d_i=\{t_1,t_2,...,t_n\}\) in which LDA is applied on each document to compute the probability of the query topics it covers. The goodness (i.e., relevance and diversity) of each document is obtained using the following expression:

$$\begin{aligned} G\left( d_{i},q,R_{\text {div}} \right) = \ \lambda \Gamma \left( d_{i},q \right) - (1 - \lambda )\max _{r_{j} \in R_{\text {div}}}{\text {sim}(d_{i},r_{j})} \end{aligned}$$
(10)

where \(\lambda \) is the relevant and diversity control parameter, \(\gamma \)(\(d_i\), q) is the document relevant score obtained in the initial ranked list, and sim(\(d_i\), \(d_j\)) is the similarity score obtained using cosine similarity of the documents \(d_i\) and \(d_j\) vectors. In the LDA-RS paper, they used the Indri search engine to obtain the \(\gamma \)(\(d_i\), q) and KL-divergence retrieval model for the sim(\(d_i\), \(d_j\)).

Similarly, a mean-variance method of search result diversification was proposed by Ghansah and Wu [36]. In their approach, the query received by the broker is executed on the CSD to generate the initial ranking. An Indri retrieval system is used as a retrieval model, and the resources with the highest number of documents are selected as the most relevant for the query. Afterward, they reranked the initially generated ranking using the portfolio algorithm proposed in [61]. A constant score is assigned to the selected resources for each of their documents in the reranked list. The resources with the highest scores are considered the most relevant and diverse.

Tables 23, and 4 summarize some selected studies on resource selection proposed in the literature.

4.3 Other methods

In most organizations, information is stored on multiple servers due to location or technical issues and mostly is available in unstructured files. To facilitate access to this information, most companies create an enterprise search system [62]. This system is designed to save employees time, improve the decision-making process, and find information regardless of its format or the server on which it is stored. In [63], an advanced resource selection model for enterprise search that utilizes semantic middleware schemas was proposed.

5 Result merging

Result merging is the last lap of the federated search interrelated problems. The goal of the result merging models is to collate all the results returned by those resources through calculated scores. The scores generated should be comparable across multiple resources. Consequently, an effective result-merging approach is critical to the success of federated search systems. This is because, even with the most relevant resources chosen, proper result merging is a necessity to guarantee the effectiveness of the final result list.

Nevertheless, merging the multiple result lists is a challenging task due to discrepancies among all the resources in terms of content, as well as the use of different retrieval models to retrieve the documents and, in most cases, the non-availability of the documents’ full text at the merging time. These reasons make the result merging problem the least research area in federated search, especially for an uncooperative environment. Similar to resource selection, result merging can be subdivided into the heuristic method and machine learning method.

5.1 Heuristic methods

In the literature, one of the early result merging model [64] assumed that the resources should return their ranked results with their collection index terms statistics. However, it was argued in [10] that this assumption is not entirely achievable in a realistic web environment. This is because most of the resources are not cooperative. Because of the uncooperative nature of most resources, the approaches proposed by [44, 47, 65, 66], with different methodologies, utilized the representative documents in CSD to compute the merging score. That is, when the broker receives a user query, it forwards the query to the most relevant resources and runs it on the CSD. The merging score of a document is estimated by mapping its rank in a resource result list to its relevance score obtained from the CSD ranking. One disadvantage of these approaches is that their effectiveness depends on the high number of overlapped documents between resource results and CSD-ranked lists.

For the result merging tasks, a few runs were submitted in both the 2013 and 2014 TREC FedWeb tracks. In [67], they used some data fusion techniques to merge the results. Specifically, they converted the document ranks returned by the resources into a ranking score using the rank fusion technique [68]. That is, each document’s relevance score was calculated by adding its ranks and frequency of appearance across multiple resource lists. However, the merged effectiveness of this approach suffers in the absence of many similar documents across the different resources result list. Similarly, the approaches proposed in [31, 69] computed the documents’ merging scores by first converting their ranks into relevance scores and then multiplying them by the resource relevance score obtained in the resource selection phase. Specifically, Pal and Mitra [69] obtained the document score by taking the reciprocal of the log of document ranks. The effectiveness of these approaches depends on the effectiveness of the resource selection algorithm.

In [70], sentiment diversification was used to improve the effectiveness of the merged result list. Specifically, they converted the document ranks returned by the resources into a ranking score using the following equation:

$$\begin{aligned} s\left( d \right) = \ \frac{r(d)}{n} \times s(S_{i}) \end{aligned}$$
(11)

where s(d) is the document relevance score, r(d) is the document rank in the resource ranked list, n is the number of documents the resource returned in its ranked list, and s(\(S_i\)) is the source relevance score obtained in the resources selection phase. The sentiment diversification is obtained using the SentWordNet lexicon approach [71]. That is, for each document, its sentiment toward the given query is obtained based on the sentiment of the terms that appear in it, which is obtained using the following equation:

$$\begin{aligned} \text {sent}\left( d \right) = \ \sum _{t \in d}^{}{\text {sent}\left( t \right) \frac{tf(t,d)}{|d|}} \end{aligned}$$
(12)

where sent(t) is the sentiment of the term t obtained from the SentWordNet, tf(t,d) is the frequency of term t in document d, and \(\vert \) d \(\vert \) is the total number of terms in document d. The final merging score for each document \(s_m\) (d) is obtained by iterative adding a document to the final ranking list using the following equation:

$$\begin{aligned} s_{m}\left( d \right) = \text {argmax}(s_{\text {norm}}\left( d \right) \times sent\left( d \right) ) \end{aligned}$$
(13)

Unfortunately, no significant difference was observed for this method compared to the non-diversified result methods proposed in the TREC 2014 FedWeb track. Recently, a snippet-based result merging model was proposed in [72]. In merging the results, they only used the snippets provided by search engines at query time to estimate the merging score for each document, making no assumptions about the resources’ corpus size or retrieval models.

5.2 Machine learning-based methods

Although many machine learning models have been applied in various tasks of information retrieval, only a few have been used for the result merging in federated search. Tjin–Kam–Jet and Hiemstra [73] treated the result merging problem as a classification problem. Based on the readily available information in the resource result list, they extracted some relevant features, such as the number of documents in each resource result list, the presence or absence of a URL for a document, query terms occurrences in the title, etc. They utilized SVMrank to train a ranking function that merged the multiple results lists into a single ranking list. A similar approach was proposed in [74] with additional features such as the resource ranking score obtained in the resource selection phase and then employing a boosting algorithm [75] to learn the ranking function. Furthermore, Ponnuswami et al. [76] used a gradient boost algorithm to learn the composition of the final merged result list when different verticals returned the result list. Recently, Vo [77] used genetic programming to propose a methodology for calculating the scores for all the documents to be merged. Either full text or excerpts such as ranking position, title, and description of the documents in question, BM25 scores of both title and description are also used. In their study, they used 45 attributes and 4 parameters in computing the merging score. Similarly, a reranker for a multilingual metasearch engine was proposed in [78]. This reranker is proposed for a multi-stage metasearch engine: The first stage is retrieving candidate documents for a given query from conventional search engines. In the second stage, the retrieved documents are reranked with a neural model. Each document is then scored according to its relevance to a given query. A final step is to format the documents and return them to the user.

Table 5 summarizes some selected studies on result merging proposed in the literature. Furthermore, based on Tables 234, and 5, it is evident that most of the approaches proposed over the last three decades have been focused on solving the resource selection problem, while a few have been focused on solving the result merging problem. In addition, very few of them use machine learning methods, as most of them employ heuristic methods.

Table 5 Summary of some selected published studies for the result merging problem

5.3 Other methods

Several search engines get most of their revenue from sponsored search, where advertisers bid on slots to display targeted sponsored ads alongside the search results. For conventional search engines, the process of displaying ads alongside results is straightforward; however, for federated search, it is not. This is because, for a federated search system to know which ads to show, the documents returned must have relevance scores, as it is only the scores that determine likelihood of whether or not an ad will be clicked [80]. For this reason, a mechanism that incentivizes the inclusion of documents relevance scores in the returned resource result list was proposed in [80]. In [81], a revenue sharing mechanism between the search interface provider and the information sources that provide the contents in the federated setting was proposed.

Due to the autonomous nature of the resources, there is no uniformity in which programming language each resource presents its results to the broker. As such, each resource presents its results in its generic language even though some of them provide an application programming interface (API) for easy extraction of their results by the broker [82]. As a result, a standardized protocol for the exchange of search results between the resources and a broker was proposed [82]. Furthermore, a model that predicts web-page relevance to a given query based on the web-page snippet provided by the resource was proposed in [83].

During the past decade, LinkedIn has evolved into a site that contains information about professionals, their profiles, job postings, and professional groups. Usually, people visit the site to search for jobs, hire people, join professional groups, and download content. To enhance user experience, [84] proposed a personalized federated search that utilizes users search history to aggregate the search results into a single list for LinkedIn users.

6 Systems and project prototypes

In the mid-1990s, researchers began exploring the potential benefits of federated search technology to enhance information retrieval systems’ efficiency and accessibility. Studies have been conducted over the years to investigate the effectiveness of federated search systems, including the impact of resource selection and result merging algorithms on meeting users’ information needs. The findings of this research aided in the development of federated search systems that are now utilized in digital libraries, government databases, and corporate enterprises.

6.1 First-generation systems

The first generation of federated search systems were developed using various protocols proposed to work in a cooperative environment, allowing multiple independent resources to be searched simultaneously through a single interface. Some of these protocols allow users to specify which resource their query should be routed to. Early systems such as MetaCrawler and some digital libraries utilized these protocols. Research such as STARTS [5] and SDLIP [85] proposed these protocols

The STARTS project aimed to create search protocols that allow each participating resource to share information with the broker to enable simultaneous searching across multiple independent resources. Meanwhile, SDLIP proposed middleware for search interfaces that facilitate cross-searching and information sharing among various digital libraries. The protocol was used to connect the digital libraries of the universities of California at Berkeley, San Diego, Santa Barbara, and the California Digital Library (CDL) through a single interface. However, in SDLIP, it is the search users who determine which resource receives their query

These two early researches were conducted to propose protocols for a cooperative environment in which resources involved disclose their corpus information to the broker through the agreed channel of communication. Despite the fact that the resources provide the broker with full information about their corpus, the systems developed using these protocols had limitations. First, merging documents from different resources was difficult due to varying corpus management. Second, the broker needed to periodically check for free and charged information from the resources. Third, these protocols were designed for textual data only. Additionally, in SDLIP, the interface controlled the time allocated for a search session, which could result in session closure before the user was done. Lastly, the assumed level of information disclosure may not be realistic in a web environment

6.2 Second-generation systems

Combining the advantages of STARTS and SDLIP protocols led to the development of advanced search systems that enable automatic resource selection in a cooperative environment. SDARTS study [86] falls into this category. As it combines the SDLIP and STARTS protocols in order to develop an advanced search system capable of performing cross-searching on both local and internet resources. The SDART model used the combined protocols to develop three sets of wrappers: text documents, XML documents, and web documents. A wrapper is a piece of software that defines the interaction between resources that participate in a federated setting. These developed wrappers were integrated to create a sophisticated search interface that can access information on local resources and those on the internet. However, because SDARTS combined the protocols of STARTS and SDLIP, all their limitations are inherited by SDARTS.

6.3 Third-generation systems

In response to the wide acceptance of federated search technology among organizations and government institutions, researchers turned their attention to the development of a variety of wrappers. With these wrappers, hundreds of resources with different content can be accessed in an uncooperative environment setting.

The FedStats portal is a federated search portal that provides statistical information published by more than 100 federal agencies in the USA. With this portal, individuals and businesses can search for information without having to know which agency provides it. This portal was developed by Carnegie Mellon University researchers and the Federal statistics team under the FedLemur project proposed in [87]. The project aimed to create a wrapper for each of the target agencies’ websites. By using the wrapper created, user queries can be translated into the programming language of the target agency. Then, forward the query, receive the results, and merge them into a single list. For each of these processes, a separate wrapper was developed. A limitation of this project, when it was developed, was the use of SSL and CORI algorithms for merging the results. These algorithms were found to be less effective in the literature as discussed in Sect. 5.1.

6.4 Fourth-generation systems

Federated searches are extensively employed across various sectors, primarily in academics, enterprises, and the tourism industry. Such searches cater to the needs of users seeking relevant information from multiple sources, thereby necessitating the development of sophisticated systems. These systems operate in an uncooperative environment, leveraging advanced resource selection and result merging methods.

Jayakody et al. [88] highlighted the challenges faced by the European Connected Factory Platforms for Agile Manufacturing (EFPF) project,Footnote 6 which aims to connect participating resources such as NIMBLE,Footnote 7 COMPOSITION,Footnote 8 vf-OS,Footnote 9, and DIGICORFootnote 10 to offer seamless access to users. Due to different content in the repositories, the project faces significant challenges in content acquisition and interoperability. The ZenedoFootnote 11 is an open-access federated search system developed to enable researchers to share their findings and promote collaboration, while the EEXCESS eu-project [89] aimed to create a federated search system with access to different third-party search engines. Additionally, Tanium Reveal [90] a federated search engine for unstructured file systems managing sensitive data in enterprise networks, is designed such that each endpoint controls its index documents, and the central interface does not interfere with the resources’ indexed content or keep a sample of it in its local database. Thus, when the broker receives a query, it is forwarded to all resources, and they perform three tiers of processing to generate a result list that is returned to the broker.

Collarana et al. [91] proposed a Federated Hybrid Search Engine (FuhSen).They also identified resource content variation as a major barrier to the interoperability of both searching and merging results. Nonetheless, they addresses the challenges of resource content variation using an on-demand knowledge graph to estimate semantic similarity and relatedness of resources to a given query. Damas et al. [4] developed a federated search system for sports-related websites, where four separate indexes are created for competitions, teams, managers, and players. The query is divided into terms and sent to the respective indexes, and the results are merged using the approach proposed in [92].

In summary, result merging and query optimization are the major challenges faced in the development of sophisticated federated search systems. The reason for this is that the resources involved in the federated setting use different methods of indexing, processing, and retrieving documents. Optimizing a query that generalizes across all the resources such that each resource retrieves its best result for a given query and then the broker merges those results returned by the resources into a single list is a challenging task.

7 Datasets

In information retrieval research, document corpora or testbeds serve as real-world search engine simulations for users to submit their information needs and receive a ranked list of documents. These testbeds contain a document corpus, a set of test queries to simulate user information needs, and relevance judgments for each document. These testbeds enable researchers to test the effectiveness of retrieval systems and develop improved approaches to meet user needs.

In the domain of federated search, information sources are considered autonomous, containing diverse content with some overlap between them [5]. Consequently, there is a need to develop testbeds that can simulate real-world federated search systems. One common method of creating such testbeds is to partition TREC datasets into smaller corpora that can serve as information sources. For instance, Xu et al. [93] used a K-means clustering algorithm to divide the TREC4 dataset into 100 information sources, while Powell et al. [94] used TREC 1–4 disks to create TREC-123-100col. Nevertheless, the primary drawbacks of these testbeds are their limited size compared to actual real-world information sources, and a nearly uniform distribution of documents across the created information sources.

According to [10, 95], the performance of federated search approaches is heavily influenced by the datasets used to evaluate them. In other words, models that perform well on smaller testbeds may not perform equally well on larger ones. To address this issue, 100col-GOV2 testbeds were created from the TREC GOV2 dataset [23], and wikipedia-100col-Kmeans was created from the Wikipedia Clueweb dataset [44].

The aforementioned testbeds are artificially created by dividing TREC datasets and assigning retrieval models, which may not reflect real-world federated search environments. Additionally, these testbeds primarily consist of text documents and may not account for the diverse range of content provided by some resources.

In an effort to address the limitations of existing testbeds and enable research that simulates real-world federated search environments, TREC has created the FedWeb datasets. FedWeb datasets are extensive collections of documents obtained from real-world search engines where search engines retrieve the documents using their proprietary retrieval models. This is contrary to previous federated search datasets mentioned earlier, the TREC FedWeb 2013 dataset was sampled from 157 real-world search engines in 24 vertical categories such as academic journals, blogs, news, videos, images, entertainment, shopping, and kids [48]. The 2014 TREC FedWeb dataset, on the other hand, was drawn from 149 real-world search engines across 24 vertical categories [48]. Another dataset created for federated search research was the one proposed in [96]. This dataset was crawled from 109 real-world conventional search engines and specialized databases.

The 2013 TREC FedWeb dataset was created using 2000 queries, of which the first 1000 were single-term queries sampled from the ClueWeb09 Cat-A collection. The remaining 1000 queries were search engine dependent, selected from the vocabulary of the snippets returned by the search engines for the first 1000 queries.

In contrast to the TREC 2013 FedWeb dataset, the TREC 2014 FedWeb dataset was created by issuing 4000 queries to search engines. The first 2000 queries were single-term queries from ClueWeb09 Cat-A, while the remaining 2000 queries were search engine-specific.

Real-world search engines have significant overlap among their return results, and to account for this, the TREC FedWeb datasets include a list of duplicate documents that must be removed before evaluating model-generated rankings that utilized the datasets. These datasets have features that resemble those of real-world federated search systems, making them ideal for testing federated search approaches.

8 More related issues

In the last three sections, we have reviewed work on three major aspects of federated search: resource description, resource selection, and result merging. In this section, we review work on some other issues than these three.

8.1 Evaluation

Result evaluation is an important aspect in information retrieval. However, it is more complicated for a federated search system than for a centralized search system. Of course, it might be desirable to evaluate the three major components (resource representation, resource selection and result merging) separately. There are also some effort that tries to evaluate the whole system in different ways.

Probably [97] is the first that addressed this problem. A flexible simulation model is defined to analyze performance issues of a distributed information retrieval system. Response time, throughput, and resource utilization are measured in the condition of different settings of parameters including the number of users and text collections, average query length, I/O and CPU workloads network latency, the time to merge results from different IR servers, and so on.

[98] proposed a new measure, average ranked relative recall, to evaluate the results of a distributed information retrieval system. Considering that the result from a distributed information retrieval system is almost always worse than that of a centralized retrieval system, the results from a distributed retrieval system can be evaluated using the results from a centralized system as baseline.

Both [99] and [100] concerned the performance of component retrieval servers and corresponding estimation methods were proposed. They can be useful for tasks in federated search including resource selection and result merging, or may be useful for the evaluation of the whole federated search system as well.

A user study was presented in [101] to evaluate a federated medical search engine, MedSocket, in an established clinical setting. The Human, Organization, and Technology (HOT-fit) evaluation framework was applied. [102] carried out another user study to an interactive patent search system PerFedPat. a Prototype Web-Based Federated Search Engine for Art and Cultural Heritage was evaluated in [103]. In these studies, both efficiency and effectiveness were evaluated.

8.2 Aggregated search

In the context of web search, information seeking users are becoming more adept at identifying documents that are relevant to their queries. Some users are looking for more than just textual documents. Therefore, most search engines nowadays display multiple types of content such as images, maps, videos, and other media in search engine result page (SERP). Aggregation of diverse content on SERPs is referred to as aggregated search. Aggregated search can be regarded as an instance of federated search; it needs to deal with three key problems for a given user query. The first problem is to determine which verticals (resources) are relevant. The second problem is to determine which documents from the chosen vertical should appear in the SERP. Finally, there is the vertical presentation problem. It concerns how to display all the selected contents in the SERP.

Although federated search and aggregated search have some similarities, they also differ in some aspects, as highlighted in [104]: First, most of the recent studies on federated search were carried out in the uncooperative environment in which no cooperation exists between the broker and the resources. In aggregated search, on the other hand, there is full cooperation and the verticals are maintained centrally. Second, the goal of federated search resource selection is to select as few resources as possible for a given query. The premise is that selecting a few resources to search may lead to an improvement in retrieval performance. But in vertical selection, the goal is to determine which verticals are relevant to the query and which are not. Third, the same scoring formula is used to evaluate the relevance of the resources for a given query in federated search resource selection. Vertical selection, on the other hand, scores each vertical relevance to a query separately. For the last decade, different approaches [105,106,107] on vertical selection and presentation are proposed in the literature. In a nutshell, aggregated search is a research area that focuses on the composition of the SERP. Its primary goals are as follows: (i) determining which verticals to include and where in the SERP; (ii) determining the users’ behavior on the presented result; and (iii) determining what factors influence that behavior.

8.3 Metasearch

Metasearch engines [108] try to combine results from a given number of component search engines. It can work as general-purpose or specialized search engines depending on the type of search engines underneath. Metasearch and federated search look very similar, but many Metasearch papers assume that the collections in those component search systems are the same or overlap significantly. Therefore, a major objective of the research on Metasearch is how to improve retrieval performance by combing results from different retrieval systems with identical collection. In some cases, metasearch is referred to as data fusion [109].

In order to achieve better retrieval performance, a variety of techniques have been tried to obtain good weighting schemes for merging results. Borda count and Bayesian inference-based approaches were investigated in [110], Condorcet fusion was investigated in [111, 112], a multiple linear regression-based methods was proposed in [113], linear programming was investigated in [114], a method that using fuzzy analytical hierarchy process and modified extended ordered weighted averaging operator was investigated in [115], and an ant colony-based search was investigated in [116].

As an alternative to fusion, another type of approaches is re-ranking all the results from multiple search engines with all the duplicates removed. A re-ranking method was proposed in [117] that considered text-based, factor-based, rank-based, semantic-based, and classifier-based features extracted from the web pages retrieved by component search engines.

As another alternative to fusion, one policy is to estimate the effectiveness of all component search engines and choose the best per query. In [118], five heuristic measures were proposed for evaluating the relative relevance of all result lists from multiple search engines. All of them take into account the redundancy and ranking of documents across the lists.

The design and implementation of some metasearch systems were presented in [115, 117, 119].

8.4 Personalizing federated search

With the advancement of communication technologies and the latest generation of mobile devices (i.e., smartphones, tablets, etc.), people can now access the internet at any time, from any location using any mobile gadget. This internet penetration gave birth to different types of large-scale social media networks, such as Facebook, WeChat, Twitter, and WhatsApp. These social networks are now widely recognized as important tools for disseminating information and exchange of ideas [120]. The social media network allows users to tag a post or document and subsequently used the tags to label them by topics [121]. Several bookmarking sites for tagging such as PinterestFootnote 12 and FlickrFootnote 13 are available on the internet. These set of tags can be used to build a user preference profile [122]. Several approaches [37, 122, 123] have exploited these tags to personalize resource selection and result merging.

Kechid and Drias [123] argued that most result merging approaches proposed before only considered document-query relevance for the ranking of the final results, while the user’s preferences and interest were not taken into account. To deal with this problem, they proposed a personalized approach that takes into account document relevance to: (i) user query; (ii) user profile; and (iii) user preferences. The documents are then ranked based on the sum of the three scores. A similar approach was proposed in [122]. The difference is that instead of using personal data and preferences to create a user profile as proposed in [123], a set of tags is used to build the user profile in [122]. Similarly, Hamid and Samir [37] posited that in order to meet user information needs, user profiles need to be considered apart from the documents’ relevance to the query. As such, they proposed a resource selection algorithm that considers the user’s profile. In their approach, a set of local and global user profiles are created. Local profiles include document preferences and interests, whereas global profiles include the user’s device preferences and situation. Then, they used a collaborative scoring schema to compute the relevance score for the resources.

8.5 Security issues

Security is a very important issue for a distributed information system because it can be accessed by many different people in many different end points. When developing a federated search system, security should be considered at different levels.

Reveal [90] can evaluate compliance with security standards for data protection, such as those mandated by government regulations and laws. Some such examples include PCI standards for protecting personal credit card payment information [51]], HIPAA standards for secure patient health data [17], and GDPR standards for protecting personally identifiable information [23, 81]. Reveal can detect patterns of sensitive text, thereby identifying regulatory noncompliance.

9 Conclusion and future research directions

The ubiquity of conventional search engines as vital tools in the present-day information age is undeniable. Although they cater to the information needs of numerous individuals seeking information on the web, they are insufficient in providing complete access to a substantial proportion of information sources available on the web.

Federated search targets those information sources by acting as an intermediary between them and information seekers, enabling the forwarding queries to multiple resources through a single search interface.

Researchers have made significant progress in addressing the interrelated issues in federated search, including resource description, resource selection, and result merging. This paper reviews various state-of-the-art models, with a particular emphasis on resource selection and result merging, and highlights their methodology and some limitations, providing insights into potential areas for future research. Furthermore, the available testbeds used for evaluating federated search models are discussed, and some federated search systems and prototype were also discussed.

Although numerous approaches have been proposed to tackle federated search challenges, most of them utilize partitioned datasets that is not realistic reflection of real-world web federated search systems. To address this dataset gap, TREC created the 2013 and 2014 TREC FedWeb datasets, which replicate actual federated search systems. Despite this, few new models have been proposed using these datasets. Therefore, a potential research direction using these datasets and the development of additional ones is proposed.

Search Results Diversification: In the field of information retrieval, it is commonly reported that many search users’ queries are ambiguous or multi-faceted. Result diversification has been proposed as a solution to disambiguate search queries. However, there are few proposed approaches to diversifying search results in the federated search result merging problem. Thus, there is a need for an approach that can use only the FedWeb dataset snippets for result merging in federated search.

Query Expansion: In the field of information retrieval, previous research has established that query expansion can significantly enhance retrieval performance for short queries in centralized search systems. However, the same level of success has not been reported in the federated search literature. This is due to the difficulty of finding suitable sources to select expansion terms from. As such, there is a need for an approach that explores alternative sources for selecting expansion terms beyond traditional feedback documents or external dictionaries, such as WordNet.

Multimedia Data Sampling:In the context of obtaining resources corpus information in uncooperative environments, sampling methods have been proposed in the literature, primarily for textual data. However, it is becoming increasingly apparent that multimedia data, such as images and videos, are prevalent in resources indexes. As a result, there is a need for novel approaches that can effectively sample multimedia data based on their features to cater to the needs of federated search research.

Image Retrieval: In recent times, image retrieval has garnered significant attention from researchers due to the exponential growth in the volume of images generated in various domains such as medical images, satellite images, and social media. While several approaches have been proposed for real-time retrieval in centralized systems, there is a notable gap in the literature concerning federated search approaches for image retrieval. Hence, creating an image dataset that simulates a real-world federated search environment and proposing models for resource selection and result merging is a promising direction to explore. Such models could be useful for effectively retrieving images in federated search systems, which will enhance their performance and utility for various applications.