Keywords

Introduction

In modern history, the Internet in its all probability is the most desired and used distributed processing and computing environment and on the other hand, a user in this environment with limited capabilities for retrieving the desired result are at best. In such distributed processing environment, queries posed by user or applications routinely require gathering data from multiple sites or data sources [1]. Advanced database systems are embedded with the capability of various transparency mechanisms due to which as user never face or feel the effect of distribution of data and in such systems the data pertinent to a query move to a single server for final compilation. In traditional way, query on distributed processing environment is optimized centrally and executed synchronously [2]. Improving the performance of a database system is one of the key research issues from over the decades. Distributed database system (DDBS) is effectual means to improve reliability and availability of data to improve the overall performance of a processing system [3]. The users at local sites can work independently as well as communicate with other sites to retrieve data for answering global queries. Such a setup is referred to as a distributed database system (DDS) [4]. Query posed on a DDS is generally decomposed into secondary queries, which are executed at respective local sites with local data store, before communicated to central control sites for final result generation and finally at the user end, an integrated result is displayed.

A distributed query processing (DQP) strategy aims to minimize the overall cost of query processing in such systems [5, 6]. The cost of query processing in a DDS comprises two costs: the local processing cost and the site-to-site communication or the transmission cost of relation fragments. The total cost incurred in processing a distributed query can thus be taken as the sum of the local processing cost, cost of local operations on local data and cost of transferring locally processed data from participating sites to a central control sites. DQP iteratively evaluates all QEPs of a user query and determines the most optimal query plan that minimizes the total cost that is local processing (CPU, I/O) cost and communication cost [7]. In DDBS, the number of query equivalent plans (QEPs) grows at least exponentially through the increase in amount of relations accessed by the query, as relations are either replicated fully or replicated with partitioned sub-relations. Performing an exhaustive search for optimal QEPs over the all possible combinations of query plans is not viable due to a huge solution space. Therefore, in large DDB, devising a query processing strategy that optimizes the total query processing cost is shown to be a combinatorial optimization problem with NP-complete in complexity nature [8].

Over the last two decades, evolutionary algorithms have gained immense popularity due to their applicability in solving engineering optimization problems and complex scientific problems. These algorithms are inspired by the Darwinian evolution that accentuates the concept of “Survival of the Fittest”. It is, thus, metaphorical to the natural social behaviour and biological evolution of species. The evolutionary strategies are now proved to be the most proficient method of choice for solving such problems. Genetic algorithm-based techniques which belong to the class of evolutionary algorithms have also been widely used in solving complex real-life science and engineering problems. The strength of GA as a meta-heuristic comes from its ability to combine the good features from several solutions to create new and better solutions over generations. Most real-world scientific and engineering problems have often conflicting and competing objectives that need to be optimized. In subsequent years, several different evolutionary algorithms (VEGA [9], NPGA [10], NSGA [11], NSGA-II [4], SPEA [11], SPEA-II [11], PAES [11], PESA [12, 11]) have effectively employed to solve the classic optimization problems.

This paper is organized as follows. Next section describes DQP and its motivation and proposed heuristic. Fundamentals of MOEA are discussed in subsequent section, also its suitability for DQP problem. Vector evaluated genetic algorithm (VEGA) is described in next section and an example illustrate the use of VEGA-based DQPG algorithm for generating optimal query plans. In graph section performance over other techniques in same section, in which comparison. In the last section conclusive discussion is presented with summary of approach.

Distributed Query Processing

In today’s scenario, with a multifold increase in the size operational data leads to growth of size of DDBS, the communication cost asserts a major impact on the overall cost of query processing. The cost incurred in communicating data through a congested network path or the communication of large data units between sites with higher communication costs can highly influence the cost of query processing. It thus also plays a key role in determining the overall performance of a DDBS [4, 6]. There can be a number of possible ways to process and communicate sub-relations relevant to the user query. In difference to the centralized query processing, distributed query processing cannot statically generate result at one single site. In DQP the user query is executed at node and its neighbours [6, 13, 14]. The optimal query plan generation and selection greatly depend on the cost function, as shown in Fig. 1.

Fig. 1
figure 1

Distributed query optimization

In distributed relational database, logical data is replicated either fully or partitioned way to achieve higher degree of reliability in environment and serve higher availability data for efficient processing closer to the user sites or locations. This leads query processing to complex optimizations scenario, as a number of QEPs are exponentially increased. An approach is proposed in this paper using genetic algorithm (GA) to generate the query plans keeping optimality among solutions at highest priory. Primarily, a query optimization focuses on to diminish cost or amount of data a query requiring from multiple logical sites for processing queries. As a result, computing optimal distributed query plans becomes a multifaceted problem. In [14] a distributed query plan generation (DQPG) approach for single-objective, query proximity cost (QPC) is proposed, where optimization among QEPs is according to the heterogeneity within a QEP on the order of sites used. We proposed a solution for DQPG problem, and analysed optimization performance by including additional design objective. In the following objectives are briefly discussed.

Motivation: DQP is a model to determine the optimal policy for processing any given user query. The model is based on operating cost (cost of local processing and communication), which is a function of processing sites for query operators and sequence of these database operators. In DQP quality of QEP is evaluated according to the processing time required by the QEP [15]. In DQP, cost of data transmission between sites is the dominating cost. The objective of distributed query processing is to minimize the data transmission among sites thereby reducing the data transmission cost. Semi-joins have been used [16, 17] to reduce the communication cost. Semi-join reduces the communication cost and exploits the parallelism for query plan execution. Semi-joins carry out this by reducing the amount of attributes or tuples/attributes transfers among sites for generating results, thereby reducing the communication cost. DQP is phenomenon driven by the heuristics [18].

Heuristic - 1: Query proximity cost (QPC) quantifies the heterogeneity in a QEP; a QEP with least number of distinct sites is better as it leads to less data transfer among sites. In this case, more than one QEPs with higher number of same relation is considered better [13, 14]. The fitness function, i.e. the QPC, based on these two heuristics is given below:

$$ {\text{QPC}} = \sum\limits_{i = 1}^{M} {\frac{{K_{i} }}{N}\left( {1 - \frac{{K_{i} }}{N}} \right)} $$

where M indicates the total number of sites required in a QEP, K i refer times the ith site in the QEP and finally N is the total number of relations required in QEP. GA uses the above fitness function to select the query plans having minimum QPC for the next generation. The selected query plans undergo genetic operator (crossover, mutation) for the generation of new pool.

Heuristic - 2: Communication cost (CC), first of all algorithm forms the all possible trees assuming sites as nodes of tree, then evaluate minimum cost tree among all. The communication path between two node is defined on the basis of data stored into the sites, such as the communication cost between node A and B is different in the case when we want move from B to A or A to B. The data from sites is transferred such a way that caused minimum cost to overall plan. So we are trying to minimize this heuristic. For the generation of the communication cost for query plan set, we need a communication cost table consisting communication cost of all site-to-site path.

Multi-objective Evolutionary Algorithm (MOEA)

These algorithms are inspired by the Darwinian evolution that accentuates the concept of “Survival of the Fittest” [19, 20]. It is, thus, metaphorical to the natural social behaviour and biological evolution of species. The evolutionary techniques are now proved to be the most proficient method of choice for solving such problems. Genetic algorithm-based techniques which belong to the class of evolutionary algorithms have also been widely used in solving complex real-life science and engineering and scientific problems [21]. The strength of GA as a meta-heuristic comes from its ability to combine the good features from several solutions to create new and better solutions over generations. Most real-world scientific and engineering problems have often conflicting and competing objectives that need to be optimized [22]. The evolutionary tactics are established as a best possible tool for many engineering and science related problems to optimize the different objectives and find efficient trade-offs unlike the classic techniques. The strength of GA as a meta-heuristic comes from its ability to combine the good features from several solutions to create new and better solutions over generations. Design objectives in most engineering and science scenario are disparate and contradictory for which optimized solution is required [20, 22]. A technique based on evolution is best fit for these kinds of problem, each evolution or iteration generated new improved solution and continues until certain degree of optimality is achieved over set of solutions. Evolutionary techniques are being used and suited for such problems on which concurrent optimization can be achieved over multiple objectives. The best part of GA emerges for multi-heuristic problems on its ability to combine the good features from several solutions to create new and better solutions over generations [20]. Most real-world scientific and engineering problems have often conflicting and competing objectives that need to be optimized. The evolutionary strategies are proved to be best suited for this class of problems as they can simultaneously optimize the different objectives and find efficient trade-offs unlike the classic techniques.

Many engineering and scientific problems cannot realistically modelled using single design objective, and thus multiple and often conflicting objectives are designed. Single genetic algorithm (SGA) is potentially unable to solve optimization problems with multiple conflicting objective functions, and this leads to a set of evolution of customized genetic algorithms; these are demonstrated and observed effective on determining the excellent solutions. The solution set consists of a pareto front of optimal solutions for problem, and a solution in pareto front is considered best with respect to other solution for each of the objectives.

Schaffer (1985) considers the capability of single-objective GA (SGA) for multi-objective-based optimization problems and adapted a new variant of GA, named vector evaluated genetic algorithm (VEGA). In this new development, basic methodology of GA is same except the selection process of chromosomes, as selection is based performed on entire population by applying fitness values from each of the objectives, simultaneously and this generates set of population equal to the number of objectives considered. A problem having m design objectives and total population size of N, m subpopulations of N/m size is created. The selection step was modified to perform proportional selection at each generation for each of the objectives [15], and next for algorithm to proceed with the application of crossover and mutation in the standard mode.

VEGA is implemented over bi-objective query optimization, and performance of aggregation-based genetic algorithm (weighted-sum approach) with VEGA is analysed. The outcome of VEGA-based optimizations is better, as a aggregation-based GA has main demerits in complexity to decide the weights that can be appropriate to scale the objective, to accomplish this prior information which is required about problem, particularly if we consider that optimal point obtained will be purpose of these weights [23].

Vector Evaluated Genetic Algorithm (VEGA)

VEGA is first practical algorithm, implemented for the population with multiple properties or objectives. In [24] Schaffer adapted fundamental GA for multi-objectives by modifying the selection genetic operator of GA. In VEGA vector term represents the objective value for the multi-objective problem. This variant of genetic algorithm is a simplest and an uncomplicated adaption of single-objective genetic algorithm (SGA) for multi-objective optimization with modified selection genetic operator [25]. This algorithm divides total population into set of subpopulations equal to number of design objectives. Each subpopulation is created based on the fitness values of specific objectives, and for each generation, fitness assignment for first subpopulation is done according to the values of first objective function, and so on [24]. Selection techniques are applied on each of the subpopulations and based on the objectives, fitness value due to mating pool is created. In order to minimize positional partiality among the solutions by randomly shuffling before they are partitioned according to the objectives into set of subpopulations, this is useful in managing problems on which objective takes values of different domains and orders of magnitudes. As all QEPs of each of subpopulations is allotted fitness and restricted selection within each of subpopulations, this emphasizes better solution equivalent to respective objective function. Moreover, since neither two QEPs are compared on different objective functions nor generate any complexity. In VEGA, any fundamental operator of selection can be used, and we used for a proportionate (Roulette wheel method) as selection operator. Selected QEPs are inserted into the mating pool for the further reproduction using genetic operators, like crossover using crossover probability (P c ) and mutation using mutation probability (P m ). VEGA iteratively evolves solutions until as stopping criteria met or fulfilled these criteria values are provided by decision maker or user [26].

(a) Algorithm

(b) Example

In GA, first chromosomes for a population are initialized. In proposed solution, two important aspects are the values used for encoding a chromosome and the sizes of the chromosome. The size of a QEP(chromosome) is equal to the number of relation required in FROM clause of the query and values in a QEP are name of sites of relations. Tables 1 and 2 show relation site matrix and some valid query plans. VEGA initially divides entire population into subpopulations, as 20 chromosomes are in Table 3, initial population (P 0). The number of objective decides the size of subpopulation (q) = N/M, where N is the total number of query plans and M is the number of objectives. Initial population (P 0) consists of 20 query plans, and is now equally divided into subpopulation of 10 query plans in each.

Table 1 Relation site matrix
Table 2 Valid query plans (chromosomes)
Table 3 Partitioned population, subpopulation 1 (objective 1) and subpopulation 2 (objective 2)

Query: Select a, b

From R 1 , R 2 , R 3 , R 4

Where R 1 .a = R 4 .t

and R4.p = R2.x

and R2.x = R3.n;

In this query R 1 , R 2 , R 3 , R 4 are required for result retrieval. Relation R 1 is replicated on sites (1, 4, 6, 9), R 2 at sites (1, 4, 5, 7), R 3 at sites (1, 3, 5, 8) and similarly relation R 4 at sites (1, 2, 4, 7) as shown in Table 1 relation site matrix (RSM). Total number of relation required by the query is 4, and the QEP size would be 4. The values into QEP are names of sites for the relation and ordering will be from R 1 to R 4, from left to right.

Selection of the fitter and better chromosomes is achieved parallel in VEGA, as both subpopulation are treated differently with different heuristics, as first subpopulation is based on the heuristic 1 and second on the heuristic 2. In Table 3, query plans with lower fitness value are inserted into the mating pool from subpopulation 1, similarly from the subpopulation 2, query plans with lower communication cost are inserted into mating pool.

Now for implementing the crossover and mutation, combine all mating pool of different subpopulations. Table 4 mating pool consists of fitter and better QEPs, combine mating pool 1 and pool 2. Next, apply crossover with P c in [0.5, 0.9]. In graph section, the effect of different crossover probability (P c ) values on generation of new population is shown. Mutation required in the QEP (chromosomes) according to P m value, for VEGA P m , is [0.0, 0.2]. In Table 5, new population consists of new set of QEPs for initial population given in Table 4.

Table 4 Mating pool sub
Table 5 New population

(c) Graphs

The comparative analysis of optimization performance is according to parameters, e.g. Number of generations, Average query cost and Top-K query of VEGA; aggregation-based genetic algorithm is shown in the following graphs. The experimental setup includes simulation on genetic algorithm toolkit in MATLAB with modelled cost functions and pool of QEP given relation site matrix. For the comparative experimentation, different crossover schemes are implemented: single point crossover (SPC), double point crossover (DBC) and uniform point crossover (UPC). UPC converged more consistently as compared to other schemes for all set of parameters. In Figs. 2 and 3, the comparative results are shown. The experimental analysis states that VEGA’s convergence is better than any multi-objective aggregation-based genetic algorithm for different values of crossover and mutation probability, P c and P m .

Fig. 2
figure 2

Top-K versus generation for fixed average fitness

Fig. 3
figure 3

Average fitness versus generation for fixed Top-K

Conclusion

Distributed database system is one of the most reliable processing environments; logical data is kept in multiple sites. The users at local sites can work independently as well as communicate with other sites to retrieve data for answering global queries; setup is defined as distributed query processing (DQP). A distributed query processing strategy aims to reduce the overall cost of query processing. The query optimizer primarily minimizes the time required for result retrieval for a user query. Genetic algorithm is nature-based optimization technique which has also been widely used in solving complex real-life optimization problems. The strength of GA as a meta-heuristic comes from its ability to combine the good features from several solutions to create new and better solutions over generations. Schaffer proposed a multi-objective genetic algorithm to accommodate the different types of fitness function, vector valued fitness values, called vector evaluated genetic algorithm. Initial steps are similar to single-objective GA, while selection is modified, so that at proportional selection according to each of the objectives is achieved. This selection approach justifies the importance of each of the design objectives of optimization problem, and in query optimization query plans are selected for evolution in proportional contribution of each of the objective populations. The performance trade-offs are discussed result section, and the comparative graphs are drawn between various parameters, e.g. number of generation (evolution) and average query cost.