Keywords

1 Introduction

Data warehouses is a repository of data integrated from multiple sources and delivered to decision makers supporting their complex OLAP queries and helping them to make proper strategic decisions. Due to their huge volume, data warehouses’ response time to complex OLAP queries is considered as a problem. To shorten that response time, several optimization methods exist in different phases of building a data warehouse. We mention Indexing for the physical phase, and both fragmentation and views materialization for the logical phase. Unlike regular views, a materialized view store the result of queries physically in a table. The queries executed on these tables are significantly faster than those executed on the whole raw data. However, every materialized view will occupy additional storage space, and need to be maintained (updated) from the raw data periodically. In a data warehouse context, one always has to deal with limited resources. Therefore, all possible views can not be materialized due storage space and maintenance constraints. Materialized View Selection (MVS) is the problem of choosing among the views’ universe, the set of ones that minimizes the Total Processing Cost (TPC) when materialized with respect to space or maintenance constraints. In this paper, authors will focus on the MVS problem with storage space constraint.

The MVS problem is proven to be NP-Hard. Therefore, several approaches using heuristics and meta-heuristics were proposed so far in literature.

2 Related Work

When talking about MVS problem, one has to mention the first paper dealing with it, which is [5]. At that time, they considered only the storage space constrained MVS problem. Authors proposed a greedy algorithm named HRU, this algorithm selects views to materialize in every iteration in the following way: for every view not yet chosen for materialization, the algorithm calculates its total benefit when materialized, and the view having the maximum benefit is then materialized. This step is repeated until there is not enough space to materialize further views. An extension was proposed in [1] to the selection of both views and indices. [3] also proposed heuristics based on greedy algorithm, the difference between these heuristics and HRU is that they select the view with maximum benefit per unit space. [6] compared between different greedy based algorithms for large problem instances (>10 dimensions), after proposing a search space reduction algorithm based on the observation that some hierarchically related views may have the same size. There have been several evolutionary based approaches proposed for the storage space constrained MVS problem such as [9] in which the authors proposed a Genetic Greedy Algorithm; they used the greedy concept in the repairing function. The MVS problem with maintenance constraint was first considered in [2], after noticing that storage space causes fewer problems as a constraint. They propose an inverted tree and A-star algorithms. [14] applied a hybrid evolutionary algorithm consisting of two levels of algorithm processing. The higher-level algorithm searches for good global processing plan from local processing plans based on queries. The lower level algorithm selects the best set of materialized views with the minimal total cost for a particular global processing plan. [8] proposed a genetic algorithm with a penalty function. [13] proposed an extension of [8] by replacing the penalty function with stochastic ranking procedure. In all of the aforementioned works the authors used one of three frameworks to represent all possible views. The first one is lattice framework introduced in [5]. We also find the Multiple View Processing Plan(MVPP) and AND-OR graph.

3 Background

3.1 Lattice Framework

V. Harinarayan et al. [5] Introduced an oriented acyclic graph G = (V, E) called lattice to represent the dependencies between all possible views of a star schema, where each view is characterized by a group-by clause. The Vertices V(G) of this graph represents the set of views or queries, while the Edges E(G) represents the relation between those queries. For a pair of vertices (x,y) we say that x depends on y (x \(\le \) y) if the query x can be answered using the result of y. An edge \((v_i ,v_j)\) between two views \(v_i\) and \(v_j\) exists only if \(v_i\) is the “immediate proper ancestor” of \(v_j\), which means that (\(v_j \le v_i\)) and \(\not \exists v_k\) (\(v_j \le v_k \wedge v_k \le v_i\)) for every three distinct \(v_i, v_j\) and \(v_k\). Fig. 1 is the lattice representation of the views dependencies of the star schema given in the same figure.

Fig. 1.
figure 1

Example of a lattice framework representing dependencies between views of a star schema

3.2 Space-Constrained MVS Problem

The vertices and edges of the lattice are weighted in the following way:

  • Every vertex \(v \in V(G)\) has three weights :

    • \(r_v\): the initial data scan cost.

    • \(f_v\): query frequency.

    • |v|: the storage space needed to materialize the view v itself.

  • Every edge \((v,u) \in E(G)\) has one weight:

    • \(w_q(v,u)\): processing cost of query u using v.

An additional vertex \(v_+\) is added to the graph, this vertex represents the raw data from which every query can be calculated. For every pair of queries (uv), we call q(uv) the query processing cost of u using v. q(uv) is the sum of the query processing costs of every edge in the shortest path between u and v, added to the initial data scan cost of \(v (r_v)\). If no path links u to v, the vertex \(v_+\) is used. Let M be the set of views to be materialized, q(vM) be the minimum query processing cost of v in presence of M. The space-constrained materialized view selection (MVS) problem can be formulated as follows :

  • Select a set of views M to materialize which minimizes T(GM), where:

    $$\begin{aligned} T(G,M) = \sum _{v \in V(G)} f_v . q(v,M) \end{aligned}$$
    (1)
  • Under the constraint \( H(G,M) \le S\), where:

    $$\begin{aligned} H(G,M) = \sum _{v \in M} |v| \end{aligned}$$
    (2)
  • S is the storage threshold not to be exceeded by the views of M.

3.3 Quantum Inspired Evolutionary Algorithm Overview

The smallest unit of information in quantum computing is called q-bit, and each q-bit is defined by two complex numbers \(\alpha \) and \(\beta \) verifying: \(|\alpha |^2+|\beta |^2=1.\) \(|\alpha |^2\) represents the probability of the q-bit to be in the state ‘0’, as for \(|\beta |^2\), it represents the probability of the q-bit to be in the state ‘1’. In QEA, each individual q is represented by a quantum vector of length m (m being the length of solutions). Each column is represented by a single q-bit, which means: \(|\alpha _i|^2+|\beta _i|^2=1, i=1,2,3,... m. \)

q-bit representation of solutions has the advantage of being able to represent a linear superposition of states, i.e: a single quantum vector can refer to multiple solution states with different probabilities [4], which gives the QEA a better population diversity characteristic. Like any independent evolutionary algorithm, QEA has his own operators. We mention Measurement, Quantum interference and Mutation. Measurement is the name of the operation allowing QEA to generate a solution vector using a quantum vector. As for Quantum interference, it intensifies the research around the best self/global solution. It consists in moving the state of each q-bit in the direction of the corresponding bit’s value of the best solution in progress, and this is accomplished by making a rotation whose angle is a function of the amplitudes \(\alpha _i\) and \(\beta _i\). Like the evolutionary mutation, quantum Mutation is used to inspect some of the current solution’s neighbors with some defined probability.

3.4 Differential Evolution

The differential Evolution (DE) algorithm was first introduced not as a separate evolutionary algorithm but as a genetic algorithm variation [12]. It is considered as a unique EA because it is not biologically motivated. DE is a population-based algorithm that is designed to optimize functions in an n-dimensional continuous domain [12]. Each individual in the population is an n-dimensional vector that represents a candidate solution to the problem. The basic idea of DE algorithm is described as follows:

Like every EA algorithm we first generate randomly a set of individuals \(X={x_1,x_2,x_3,...,x_n}.\) Then we proceed in the following way:

  1. 1.

    Mutant vector generation: For each individual \(x_i\), a mutant vector \(v_i\) is generated using three different random individuals \(x_{r1}, x_{r2}\) and \(x_{r3}\) other than \(x_i\), following the formula (3):

    $$\begin{aligned} V_i = x_{r1} + F(x_{r2} - x_{r3}) \end{aligned}$$
    (3)

    Where \(x_{r1}, x_{r2}\) and \(x_{r3}\) are three distinct vectors, and F is a real number used to scale the difference between \(x_{r2}\) and \(x_{r3}\).

  2. 2.

    Trial vector generation: The mutant vector \(v_i\) is then combined (crossed over) with the individual \(x_i\) itself to generate a trial vector \(u_i\). Crossover is done as follows (formula (4)):

    $$\begin{aligned} u_{ij} = \left\{ \begin{array}{l} v_{ij},\;\; if (rand<c)\; or\; (j=Jr) \\ x_{ij},\;\; otherwise \end{array} \right. \end{aligned}$$
    (4)

    For every j in [1, n], where n is the problem size and is also the size of \(u_i, v_i\,and\, x_i\); \(u_{ij}\) is the \(j^{th}\) component of \(u_i , v_{ij}\) is the \(j^{th}\) component of \(v_i, x_{ij}\) is the \(j^{th}\) component of the individual \(x_i\), rand is a random number taken from the uniform distribution in [0, 1]; c is the constant crossover rate (in [0, 1]); and Jr is a random integer taken from the uniform distribution in [1, n].

  3. 3.

    Best fit choice: After the trial vector ui is created, it is compared to the individual xi, and the fittest vector in each pair \((x_{i}, u_{i})\) is kept to the next generation of DE algorithm. The least fit is discarded.

    $$\begin{aligned} x_{i+1} = \left\{ \begin{array}{l} u_{i},\;\; if\; (f(u_{i})<f(x_{i})) \\ x_{i},\;\; otherwise \end{array} \right. \end{aligned}$$
    (5)

4 QDE Algorithm

In this paper we propose a new hybrid algorithm (QDE) using both QEA and DE algorithms. In QDE, the population is a set of quantum vectors. Each quantum vector \(Q_i\) is considered as a vector. After measurement, a set of solutions \(X_i\) is generated from every \(Q_i\) vector. Then, mutant vectors generation and trial vectors generation are applied on every vector \(Q_i\) like in the original DE algorithm. The best fit choice is also applied on both the original \(Q_i\) population and the new trial vectors based on the fitness of their corresponding solutions after applying measurement. After that, the Quantum interference operator of QEA is used on the new \(Q_i\) generation to improve it by rotating towards the global best solution. The above steps are repeated until some stop criteria are satisfied.

The main idea of the QDE is outlined in the Fig. 2 and its pseudo code is given in Algorithm 1.

Fig. 2.
figure 2

Main steps of QDE Algorithm

4.1 Solution Representation

In this work, the authors represent every solution by a binary vector \(Xi = (x_1, x_2, x_3, ..., x_N)\), where N is the total number of views. A case \(x_i\) takes the value 1 if the view \(v_i\) is materialized, 0 otherwise. As shown in Fig. 3, the binary vector X is the representation of the solution given by the lattice L (the lattice structure of a star model with 3 dimensions), in which the materialized views are painted in grey. In this example the views \(v_2\) and \(v_3\) are materialized, which corresponds to the value 1 in the second and third position of the vector X. Whereas, the rest of the views are not materialized, hence the value 0 of the remaining positions.

figure a
Fig. 3.
figure 3

Example of solution representation

4.2 Q-Bit Representation

Since \(|\alpha |^2 + |\beta |^2 =1\), it basically represents the equation of a unit circle and each point on its perimeter can be represented by a single variable \(\theta \) with the Cartesian co-ordinates given by \(cos_\theta \) and \(sin_\theta \) where \(\theta \) is defined in \([0,2\pi ]\). Therefore, instead of using \(\alpha \) and \(\beta \) to represent a q-bit like QEA, we use the variable \(\theta \), from which the values \(\alpha \) and \(\beta \) can be inferred. This, not only will allow us to gain space, but also will make the use of quantum interference operator easier, because the rotation angle will be added directly to the quantum itself.

4.3 Quantum Interference Application

As mentioned in the above section, quantum interference is applied by adding the rotation angle \(\gamma \) to the quantum directly. In order to apply this operator, we use the same look-up table used in [7] described in Table 1.

Table 1. Look up table from [7]

The choice of the rotation angle’s value \(\gamma \) is very important. A badly chosen value can lead to premature or very late convergence or divergence.

4.4 Discrete Vs Continuous Search Space

Both QEA and DE algorithms are dedicated to solve continuous problems. In those algorithms, solution vectors move around the search space which is within a continuous real domain. However, MVS problem deals with a binary discrete search space. Some adaptation of those algorithms to solve binary discrete problems is needed. Several works like [10, 11] use transfer functions. Those functions consider the continuous search space as a probability of being at the values 0 or 1. In this work the authors follow a similar approach. Instead of having solution vectors, the Vector population is a set of quantum vectors. Those vectors generate solutions using the measurement operator mentioned earlier. So, instead of manipulating the actual position of each solution, the probability of the position being at 0 or 1 is manipulated. This way one can use any continuous search space dedicated algorithm to solve a problem having a binary discrete search space.

4.5 Dealing with Unfeasible Solutions

Since the MVS is a constrained combinatorial optimization problem, several solutions during the algorithm’s application will not respect the constraint forming what is called unfeasible solutions. One of the most commonly used methods to deal with this problem is to implement a penalty function penalizing those unfeasible solutions, and ensuring that even if they have good fitness, they will not be highly evaluated among other solutions. In our work, we do not use any penalty function. Instead of that, we use a repairing function. We keep dematerializing the smallest view in every iteration, until the constraint is not violated any more. Doing that will ensure that after measurement, there will be no unfeasible solutions.

4.6 The “Reset” Operator

As in most swarm algorithms, every single particle seeks the best solution. With lack of diversity and after some iterations, the particles stagnate at the same position (around the best solution). Consequently, the population prematurely converges to local optima. To deal with this issue, the authors have introduced the reset operator (known as reseeding or extinction) to ensure the escape from the local optima case. If the best global solution does not change for a certain number of iterations, reset operator is applied by replacing all the quantum vectors with new ones. By doing that, and keeping the best global solution found so far, one can ensure that those vectors will seek the best solution from a different starting points, covering new areas in the search space, and ensuring diversity for the whole population. In QDE the number of iteration needed to apply the reset operator is set to 100 iterations.

5 Experimental Results

In this section we will prove the effectiveness of the proposed QDE algorithm against the well-known HRU algorithm of [5]. Also, we test it against the Genetic Algorithm which is one of the most commonly used evolutionary algorithms used to solve MVS problem both with space and maintenance constraints. The tests were conducted on an Intel based i-5 2.4 GHz PC having 8 GB RAM. Due to several tests, the parameters were set as shown in Table 2.

Table 2. Parameters used for QDE/GEA implementation

We chose to compare QDE to HRU for 5 and 6 dimensions’ lattices. Table 3 shows the solution’s fitness (TPC) given by HRU, compared to the average of 30 independent runs of QDE for 1000 iterations. For 5-dimension lattice, QDE outperforms HRU for almost all space thresholds. They provide same TPC for the space threshold 0.7, 0.2 and 0.1. this is due to small search area for this lattice. As for 6-dimension lattice, QDE provide better solutions for all space thresholds.

Table 3. QDE vs HRU for 5-dimension and 6-dimension lattices

Table 4, Table 5 and Table 6 contain the results of 30 independent runs of QDE and GEA on a 5, 6 and 7-dimension lattices respectively. We kept track of the mean, min and max TPC found during these runs.

Table 4. QDE vs GEA for 5-dimension lattice

By observing Table 4, one can notice that QDE outperforms GEA, for all thresholds. What seems to be the best solution found by GEA (min TPC) is almost always the mean TPC of QDE. i.e QDE found that solution for every single run since mean = min = max for all thresholds except 0.6 as shown in Table 4.

As for 6-dimesion lattice, Table 5 shows that QDE finds better solutions since its mean, min and max TPC are better than GEA, except for thresholds 0.1 and 0.3 where GEA finds a similar min TPC.

Finally for 7-dimension lattice (Table 6), the supremacy of QDE over GEA becomes very clear since QDE outperforms GEA for all the statistical indicators (min,max and mean) TPC and that is for all the thresholds with no exception. This proves the scalability of QDE’s solutions quality.

Table 5. QDE vs GEA for 6-dimension lattice
Table 6. QDE vs GEA for 7-dimension lattice

6 Conclusion

In this paper, we adapted a binary version of Differential Evolution (DE) Algorithm to solve MVS problem with space constraint. We used the lattice structure of [5] to represent hierarchy between different views. Also, adaptation of the DE algorithm to the discrete binary case was done through merging DE algorithm with the QEA algorithm instead of the use of transformation functions such as Sigmoid or V-shaped. The Experimental results show the efficiency of the proposed QDE algorithm in comparison with HRU and GEA algorithms. That is shown through the TPC values of QDE, HRU and GEA which is significantly better in QDE. As perspective, we want to investigate the use QEA alongside with other evolutionary algorithms. That is, to study further the efficiency of QEA when use instead of transformation functions.