Keywords

1 Introduction

Even though the web can be seen as a single network of distributed repositories, many of traditional information retrieval approaches became difficult to put into practice. One of the most important raisons is the variety, heterogeneity and distributivity of information sources. The pervasive sources are asked for a query at the same time. This operation will certainly, returns a huge of information and consumes a considerable time. Distributed Information Retrieval (DIR) [1, 2] provides a solution to the problem of searching on several dispersed information sources. The DIR system consists of three phases, namely source description [3], source selection [4, 5], and result merging [6]. In the first phase, representations of available remote sources are created, containing important information about the sources such as their contents and their sizes. In the second phase, the DIR system selects a subset of sources which are most useful for users’ queries. The source description is used to estimate the relevance of each source, and to classify sources accordingly. The third phase combines documents retrieved from selected sources into a single ranked list which will be presented to the users.

We aim in this paper to address the source selection problem in a context of a large number of sources. Previous research [7] showed that the source selection phase is vital for the overall effectiveness of DIR system. We formalized the problem of sources selection as a combinatorial optimization problem, which consists in finding the optimal combination (a selection) in a prohibitive search space containing all possible solutions (combinations). We search the solution, which maximizes similarity between sources composing a selection and the user query. We address this problem by the use of intelligent methods, in particular the Genetic Algorithms (GAs) [8], which are considered robust and efficient [9] and outperform the analytical methods for the large scale data [10].

2 Related Work

2.1 Sources Selection Approaches in a Distributed Environment

The first generation of source selection approaches, known as big document approaches, represents each source as a concatenation of its documents. The big documents obtained are classified according to their lexical similarity with the query using standard information retrieval techniques based on tf (term frequency) and idf (inverse document frequency). In sources selection, df (document frequency) is used instead of tf and icf (inverse collection frequency) instead of idf. The most well-known approaches are CORI [1, 4] and GlOSS [11]. The second generation or small document approaches use a centralized index of sampled documents from different sources. The sources are selected based on the ranking of their documents for a given query. The documents relevance is estimated to classify sources according to the number and position of their documents in a centralized ranking. Examples of these approaches are CRCS [12] and [5, 13]. Finally, a classification-based approach combines the above approaches with a number of other query-based and corpus-based features in a machine learning framework [7, 14].

2.2 Genetic Algorithm in Information Retrieval

In the last few years, there has been a growing interest of designing GAs in different areas of Information Retrieval (IR) [15]. GAs were used to modify document descriptors [16] or user queries [17,18,19,20,21,22] and are used to optimize the web crawling [23, 24], to optimize parameters independently of retrieval models [15, 25]. Others works used GAs to address the adaptive information retrieval problem that relies on evolutionary user-modeling [26] and to generate and adapt user’s profile for filtering documents that match the user’s interests [27]. To our knowledge, very few works that address the DIR problem using evolutionary methods, for example the work in [28] authors proposed an algorithm to select the appropriate search engine in meta-search engine for the user query using GA. The proposed search engine selection algorithm is based on the relevance between a search engine and the query introduced by the user, which represents the fitness function of the proposed algorithm. His experiments are based on a simulation in MATLAB.

3 A Genetic Algorithm for Information Sources Selection

In this work, we propose an approach based on genetic algorithms to select with the optimal possible way, information sources to be interrogated.

3.1 Problem Definition

We define the problem of sources selection with a pair (instance, question) as follows:

Instance: S = {\(s_1, s_2, .., s_n\)} a set of n information sources and the user query q.

Question: determine a subset S’ of S such that the similarity between its elements and q is maximal, where \( |S'|= k< |S| = n \); k is the number of selected sources for a query.

3.2 Search Space

The search space includes all possible solutions of the problem. As a solution is a combination of k sources from n total sources, the size of the search space is expressed as: \( C_{n}^k\)= \( \begin{pmatrix} n \\ k \end{pmatrix} \)=\(\frac{n!}{k!(n-k)!}\)

When n is very large, the number of possible combinations is enormous and no complete method is able to yield a solution of good quality. One approach to cope with this issue is the use of artificial intelligence techniques such as a genetic algorithm.

3.3 Genetic Algorithm

We propose a Generic Algorithm for Sources Selection called GASS with the aim to find the optimal selection for a user query. GASS is initiated by a population of solutions represented by subsets of sources each, representing possible solutions to the problem. Each solution or chromosome is evaluated by the fitness function. Genetic operators (selection, crossing and mutation) are used to generate a new population from the current population. Once a new generation is created, the genetic process is repeated iteratively until an optimal solution or as default, a solution of good quality is found (Fig. 1).

Fig. 1.
figure 1

Source selection approach

Solution Encoding. A solution to the problem defined above is a set of k sources. The solution is represented by a vector of length k containing information sources. The latter will be encoded by integers to simplify their manipulation. Thus a source si is between 1 and n and a possible solution is a vector of k integers between 1 and n. For example, if the number of sources is equal to 5 and the number of sources to be selected is equal to 3, a solution can be: {1, 4,5} or {2, 3,5}, {3, 4,5} etc.

Fitness Function. The solutions evaluation function is a performance measure function that evaluates the quality of each solution. We evaluate a solution called sol by the average of the similarities between the sources of this solution and the user query. To calculate the similarity between a solution and the query q we consider sol as a collection of documents representing the sources. This similarity is calculated using the following formula:

$$\begin{aligned} Similarity (sol, q)= \dfrac{\sum _{h \in sol} Similarity (h, q)}{k} \end{aligned}$$
(1)

Similarity (sol, q): similarity between a solution and the query q.

Similarity (h, q): similarity between source h in the solution and the query q.

k: number of sources in the solution.

The similarity between a source and the query can be calculated by the cosine measure of the vector search model. The source is considered as a set of terms. We represent the query and the source by vectors of terms weights in an m-dimensional space corresponding to the terms present in the search space. Thus the similarity between a source h and a query q is given as follows:

$$\begin{aligned} Similarity (s_h, q)= \dfrac{\sum _{j=1,m} (t_{hj}*t_{qj})}{\sqrt{\sum _{j=1,m} (t_{hj})^2 * \sum _{j=1,m} (t_{qj})^2}} \end{aligned}$$
(2)

where \(t_{hj}\) and \(t_{qj}\) are the weights of the term j in the source h and query q respectively, calculated using the tf-idf approach [29] by replacing tf (the term frequency) with df and idf (inverted document frequency) with icf. It is defined as follows:

$$\begin{aligned} Weight (q/c) = d f * icf \end{aligned}$$
(3)

where df: document frequency, icf: inverse collection frequency, calculated as follows: log(N/cf). Where N is the number of all collections and cf is the number of collections that contain the term t.

Algorithm 1 outlines the proposed genetic algorithm for the selection of information sources.

figure a

In the following, we describe the different components of the algorithm.

Initial Population. The evolution process starts with an initial population of size PopSize generated randomly from the set of possible combinations. It consists of a set of chromosomes; each denotes a solution to the problem and is represented by a vector of k sources encoded by integer numbers from 1 to n. During the population generation, the same sources (duplicate genes) are avoided in the same chromosome. For example in the chromosome {2,5,3,2}, the number 2 is repeated, such construction of chromosomes must be avoided. We should also avoid to repeat the same chromosome in the population (duplicate chromosomes), for example {3,4,5,8} is the same as {8,3,5,4} because the order is not important.

Genetic Operators (a) Selection. The selection operator simulates the “survival-of-the-fittest”. There are various mechanisms to implement this operator, and the idea is to give preference to the better chromosomes. We used natural selection that takes the best chromosomes in the next generation. The best chromosomes are identified by evaluating their fitness value.

(b) Crossover. It is a genetic operator that combines two chromosomes together to form new offspring. It occurs only with crossover probability Pc. Chromosomes that are not subjected to crossover remain unmodified. The intuition behind crossover is the exploration of new solutions and exploitation of old solutions. We use single-point crossover without “duplicate”. Thus, the gene values in the generated chromosome must not be repeated (see Algorithm 2).

Example. Let consider n = 9 and k = 6. S = {1, 2, 3, 4, 5, 6, 7, 8, 9}, S’ is subset of S of length 6. Figure 2 shows an example of crossover.

Fig. 2.
figure 2

Single-point crossover

figure b

(c) Mutation. Mutation is the process of randomly altering the genes in a particular chromosome. Mutation involves the modification of the gene values of a solution with some probability Pm. The objective of mutation is restoring lost and exploring variety of data. We used a single-point mutation. A gene is changed with a certain probability by a random number generated in the interval [1, n], while avoiding genes duplication (see Algorithm 3). Figure 3 shows example of mutation.

Fig. 3.
figure 3

A single mutation (on the 4th gene)

figure c

There is no simple way to configure the GA parameters. We defined these parameters (population size, crossover and mutation probabilities) during the experiments.

Termination Criterion. The generation process is repeated until a termination criterion is reached. The termination criterion is the maximum number of generations to reach the convergence of the algorithm.

4 Experiments

We have implemented the proposed genetic algorithm in a java environment using JGAPFootnote 1. In this section we describe the data and measures used in the experiments.

4.1 Test Sources

We used databases of scientific research documents covering different domains (computer science, medicine, law...). The access to these libraries is ensured through a user account by a platform called SNDLFootnote 2. The databases used are described in Table 1.

Table 1. Sources test

The Query-Based Sampling method (QBS) [3] is used to construct the sources description. Probe queries composed of a single term are sent to each of the sources. Queries are chosen according to the domains to which the sources belong. For each query (in a set of 15 queries) the top 4 documents are downloaded from the source. These documents are used to represent the source. We used IndriFootnote 3 to index these sources and search for documents. We set the following genetic algorithm parameters: Crossover rate: 60%  Mutation rate: 10% and Generation number: 1500. A set of 20 test queries is selected manually. We choose general queries that return results and we avoid queries that do not return any response. We have varied the number of sources to select (k = 2, 4, 6, 8, 9).

4.2 Evaluation Function

Since all the relevant documents for a query is difficult to know for our data set we used the precision to evaluate our algorithm which is given by the following formula:

$$\begin{aligned} Precision=\dfrac{|Selected\,\,Relevant\,\, Sources|}{|Selected\,\,Sources|} \end{aligned}$$
(4)

To identify relevant sources, a user query is sent to each selected source and we only count the returned documents that are relevant. We analyze the first 20 documents returned by each source. We asked users to judge the relevance of the returned documents. A source is marked relevant if it returns at least 3 documents relevant to the query. The average precision is calculated over 20 test queries. We compared the proposed Algorithm (GASS) with the CORI algorithm [4]. The default parameters of the CORI algorithm are used.

4.3 Experiment Results

Table 2 shows the average precision reached by each algorithm over 20 queries. Figure 4 shows that the proposed algorithm is better than CORI algorithm in terms of precision. It can be concluded that the proposed algorithm provides a solution to sources selection problem in distributed environment being better or at least as efficient as other state-of-the-art source selection algorithms (CORI algorithm).

Table 2. Average precision of GASS and CORI algorithms
Fig. 4.
figure 4

Precision values for GASS and CORI algorithms on SNDL sources

5 Conclusion

We have shown in this work how bio-inspired methods and more precisely genetic algorithms can provide solutions to the problem of source selection in a multi-source environment. First, we designed a genetic algorithm to find the best sources for the user query. Experiments showed a performance improvement in terms of precision of our algorithm in comparison with the CORI algorithm. This asserts that this approach can be efficient in distributed information retrieval. In future work, we plan to improve the experimental parameter values that allow to achieve better results. Further experiments are needed to demonstrate our proposal by increasing the number of sources.