Keywords

1 Introduction

A data warehouse (DW) is a data repository that provides an integrated environment for reporting, analyzing, and supporting queries which requires complex aggregations of huge amounts of historical data [1, 2]. It is a big challenge to operate and manage such an integrated data store in a cost-effective way. Materialized views are the intermediate results stored in a data warehouse [3] that avert accessing the original data sources and thus increase the efficiency of the queries posed to a data warehouse. Optimal selection of materialized views is NP-hard problem [4] and is needed to design a data warehouse effectively. Recently, lot of attention is given to solve this problem. Researchers have given various frameworks and algorithms to deal with this problem. Data cube and lattice [5,6,7,8,9,10,11,12,13], MVPP [14,15,16,17] and AND-OR view graphs [18,19,20,21] are various frameworks that exist in literature. Lattice framework captures dependency among aggregate views and is used in building the data cubes with multiple dimensions. MVPP is a global query processing plan for the complete query set and exploits the existence of common sub-expressions. AND-OR view graph is used to express all the possible execution plans for evaluating a query in the query set. Using the above frameworks, various algorithms like heuristic-based algorithms [11, 20], greedy algorithms [5, 8, 18, 22], evolutionary algorithms such as genetic algorithm [9, 10, 14, 15, 21, 23, 24] and simulated annealing [13, 16, 17]have been presented for view selection. However, heuristic-based algorithms cannot compute a perfect solution within the acceptable time due to the complex nature of problem. Greedy algorithms are highly problem-dependent and are susceptible to poor local minima, while evolutionary algorithms (EA) work on randomly selected multiple solutions simultaneously to find out the optimum most solution and can be applied on various problems. Various evolutionary algorithms like particle swarm optimization (PSO) [25, 26], bee colony optimization (BCO) [27], ant colony optimization (ACO) [28], and differential evolution (DE) [29]have been used in context of materialized view selection. Backtracking search optimization algorithm (BSA) comes under the category of evolutionary algorithm and aids in solving various optimization problems [30]. With a simple structure, it is effective, fast, and has strategies for generating trial populations by taking favor of the experiences gained from previous generations and gives BSA very powerful capabilities for exploring and exploiting the population [30]. In this paper, we have implemented BSA on lattice framework to find an optimal set of views within the space constraint, thereby minimizing the total cost of query processing. We have compared our results with genetic algorithm and PSO to prove the effectiveness of BSMVSA.

This paper is organized as follows: Definition and mathematical model of materialized view selection problem is given in Sect. 2. Section 3 gives the brief review of backtracking search optimization algorithm (BSA), while Sect. 4 presents BSA-based view selection in a stepwise manner. Experimental results under different space constraints, along with comparative performance analysis with other algorithms, are discussed in Sect. 5. Section 6 ends up with conclusions.

2 Materialized View Selection

2.1 Problem Definition

Materialized view selection [31] problem can be illustrated as follows: Given some storage space X and query set Z, the problem is to determine a set of materialized views V that reduce the total query processing cost of the selected views, under the constraint that the total space occupied by selected views, V, is less than X. View selection is an important decision in the effective design of data warehouse. We have used the lattice framework [5] in determination of materialized view selection as lattice framework captures and models dependencies among aggregate views. Group-by clause [32] characterizes a data cube (DC). A path exists between the two cubes ci and cj if there is a dependency relation ci ≤ cj, exists between the two cubes ci and cj, implying if a query can be resolved from ci, then it can also be resolved by using cube cj. 2 N data cubes are possible in the lattice framework, for a fact relation having N dimensions. For example, sales transactions in a data warehouse system (taken from TPC-H star schema benchmark [33]) have three dimensions, part (P), supplier (S), and customer (C), and in fact depicts the sale of parts(P) from suppliers(S) to customers(C). This fact relation will generate 23 = 8 data cubes in the lattice as shown in Fig. 1.

Fig. 1
figure 1

Lattice framework with 8 possible data cubes [5]

Each level of the lattice has data cubes which are aggregated at same level of aggregation along different dimensions. The number beside each cube indicates its size (i.e., number of rows). The top cube, i.e., psc, is the base cuboid at the lowest level of aggregation. Bottom cuboid has highest level of aggregation. If a query can be directly answered from cube (*, s,*) then it can also be answered from any of its parent cubes (p, s,*), (*, s, c), or (p, s, c) by summarizing the data along some dimensions [26].

2.2 Mathematical Model of Materialized View Selection Problem

Selection of materialized views is crucial for designing an efficient data warehouse. It aims at reducing the query response time by selecting an optimal set of materialized views within the storage space and cost considerations, to help accelerating the entire data warehouse. We have used the lattice framework, where cubes correspond to the views to be selected, whenever a query is invoked by user. Query invoking frequency corresponds to the cube invoking frequency. We have followed the linear cost model as proposed in [5], for evaluating the cost of answering query. This query answering cost is equivalent to the no. of rows to be accessed in the corresponding cube for the query. So, the materialized view selection (MVS) using lattice framework [9] can be defined as follows: Given a cube-lattice X, having a set of s cubes B = (b1, b2, …, bs), set of y user queries U = (u1, u2, …, uy), a set of query frequency values W = (wu1, wu2,…, wuy), set of update frequency values CF = (gb1, gb2, …, gbs) of the cubes in B, constrained by storage space T. Our objective function is to select a set of cubes (views) P to minimize the cost function defined in Eq. 1, under the space constraint \( \sum\nolimits_{b \in P} {\left| B \right| \le {\text{T}}} \) [9].

$$ \sum\nolimits_{{{\text{i}} = 1}}^{\text{k}} {{\text{wui*R}}\left( {{\text{ui }},\,{\text{P}}} \right)} + \sum\nolimits_{b \in P} {{\text{gb*M}}\left( {{\text{b }},{\text{P}}} \right)} $$
(1)

where R(ui, P) and M(b, P) depict the cost to evaluate query ui and the maintenance cost of cube b, with reference to the set of materialized cubes (views) P.

3 Brief Review of Backtracking Search Optimization Algorithm

Backtracking search optimization algorithm (BSA) is one of the most popular evolutionary algorithms, for finding solution to global optimization problems. It has a simple structure, i.e., fast and efficient in solving various problems. BSA retains a memory in which it caches a population from previous generation and uses the past generated solutions while searching for solutions with superior fitness values. It has strategies for generating trial populations by taking favor of the experiences gained from previous generations and gives BSA very powerful capabilities for exploring and exploiting the population [30]. Random mutation and non-uniform crossover strategy of BSA produces very useful trial populations in each generation and enhances its ability to solve problems. It can be described by partitioning its functions into five major processes: initialization, selection-I, mutation, crossover, and selection-II [30]. Algorithm 1 presents the pseudocode for BSA.

Algorithm1: Pseudocode for BSA [30]

4 BSA-Based Materialized View Selection (BSMVSA)

The steps of the BSA-based algorithm for materialized view selection are elaborated below.

  1. Step1:

    Population initialization: In this step, a set of views V is created constrained by the rule that the total space occupied by V is less than S as shown in (2) and the objective function, i.e., cost function of (1) is defined as

    $$ {\text{V}}_{\text{ab}} \sim {\text{U}}\,\left( {{\text{x}},{\text{y}},{\text{z}}} \right) $$
    (2)

    where b = 1, 2, … z, a = 1, 2, …, y, x represents the number of cubes (views), z corresponds to the problem dimension, and y represents the number of user queries. Each Va is a target individual in the view set V. U is the uniform distribution. Fitness function, of the initial population (set of views V), fitness Va is initialized as the objective function.

  2. Step2:

    Selection-I: Selection-I of BSMVSA gives the historical population oldV according to (3). It is used for directing the search toward the set of

    $$ {\text{oldV}}_{\text{ab}} \sim {\text{U}}\left( {{\text{x}},{\text{y}},{\text{z}}} \right) $$
    (3)

    The historical population oldV can be redefined at the start of each iteration by the ‘if-then’ rule shown in (4):

    $$ {\text{if}}\,{\text{m}} < {\text{n}}\,{\text{then }}\,{\text{oldV}}: = {\text{V}}|{\text{m}},{\text{n }} \sim {\text{U}}\left( {0,\,1} \right), $$
    (4)

    After determining oldV, (5) is used to alter the order of views in oldV randomly.

    $$ {\text{oldV}}: = {\text{permuting}}\left( {\text{oldV}} \right) $$
    (5)
  3. Step3:

    Mutation: BSMVSA’s mutation process gives the basic form of trial set of views,M using (6)

    $$ {\text{M}} = {\text{V}} + {\text{R}}.\,\left( {{\text{oldV}}{-}{\text{V}}} \right) $$
    (6)

    In this, R manages the amplitude of (oldV − V). The value of R = 3.rndn.

  4. Step4:

    Crossover: During the crossover, BSMVSA generates the final form of the trial set of views S with the help of two steps as shown in Algorithm1 (Lines 15–21). The boundary control mechanism in (Algorithm1, lines 29–34) is used to reconstruct the individuals.

  5. Step5:

    Selection-II: In this stage, the trial set of views Si’s having better fitness values than the analogous set of views Vi’s are used to update the Vi’s.

5 Experimental Results

We have implemented BSMVSA using MATLAB and conducted experiments by running it over standard test data sets of TPC-H star schema benchmark [33]. To test and validate the effectiveness of BSMVSA, we have compared it with PSO and genetic algorithm. The main parameters of BSMVSA are as follows: population size k = 50, problem dimension d = 3, and mix rate = .5. We have calculated the cost function on varying the dimensions of lattice, query invoking frequency, and cube invoking frequency to show the effectiveness of BSA for materialized view selection.

5.1 Considering Different Space Constraints

If unlimited space is provided to store materialized views, then all the cubes can be materialized to attain minimum query processing cost. But providing unlimited space is infeasible, so we considered different cases of space constraints as 5, 10, 15, 20, 25, 30, 35, and 40% to examine the performance of the proposed algorithm in selecting views under different space constraints. It is from Fig. 2 that aimlessly increasing the storage space cannot reduce the cost. According to the test data set, the optimal and effective storage space in terms of cost is about 20% of the total views.

Fig. 2
figure 2

Results of BSMVSA under different space constraints

5.2 On Varying the Dimensions of Lattice

We have considered three cases, i.e., by using (a) three dimensions (b) four dimensions, and (c) five dimensions, respectively. We have used the part of TPC-H benchmark [33]. For three dimensions, we chose Product p, Customer c, and Supplier s from the test set. For the fourth dimension, we included Time t dimension from the benchmark. For considering five dimensions, we added an additional dimension, Location l. The results of the experiments on changing the number of dimensions of lattice are depicted in Fig. 3a, b with uniform frequency sets and random frequency sets, respectively. It has been observed that on an exponential rise in the number of views, the processing cost of query increases linearly.

Fig. 3
figure 3

a Results of BSMVSA on varying lattice dimensions keeping the query frequency and update frequency uniform b Results of BSMVSA on varying lattice dimensions with random query frequency and update frequency

5.3 On Increasing the Number of User Queries

To examine the performance distribution of query processing cost achieved, we conducted an experiment on increasing the user queries. Keeping the query invoking frequency and cube invoking frequency uniform or random, the difference between query processing cost without materialization and query processing cost with materialization decreases on increasing the number of user queries as shown in Fig. 4a, b. This apparently depicts that our proposed algorithm, BSMVSA is scalable.

Fig. 4
figure 4

a Number of user queries versus query processing cost (keeping query frequency and update frequency uniform) b Number of user queries versus query processing cost (with random query frequency and update frequency)

5.4 Comparison with PSO and Genetic Algorithm

BSMVSA yields much better results than PSO and genetic algorithm, inspite of storage and invoking frequency of data set used. Thus, BSMVSA is a better option than PSO and GA in selecting materialized views with lower query processing cost (Fig. 5).

Fig. 5
figure 5

Comparative results using genetic, PSO, and BSMVSA

6 Conclusion

In this study, we have implemented BSA for selection of materialized views using lattice framework. Experiments were conducted using TPC-H benchmark data set on different lattice dimensions, on different frequency sets, and considering different cases of storage space. According to the experimental results, BSMVSA always generates a superior solution. The total cost of processing and maintaining the queries approaches a minimum when space is approximately 20% of size of all views. This clearly shows that any further increase beyond this amount does not notably reduce the cost. To prove its effectiveness, over other view selection methods, it is compared with PSO and genetic algorithm.