Keywords

1 Introduction

The rise in data generation by devices has raised challenges to explore optimal ways to visualize data. Earlier, visualization is conducted through static tools like tableau, Microsoft Excel and other related applications [12]. Today, the collected datasets are humongous with multiple attributes [10]; thus, VS is a critical issue. Manual approaches to determine the most appropriate view through a list of offered represented views are tedious and time-consuming. Thus, there is a pressing need to dynamically allocate views based on on-demand query fired by users. The problems are that there are numerous factors on which a particular query is raised like similarities in view generations, informational semantics, grouping of data and aggregation. This requires a scanning of the entire MQT every time a query is fired which is cost-ineffective.

A static or dynamic approach can be used for materialization of view. Static VM materialize views before the execution of workload and remains as it is till the last statement of workload, in between if database object update the relation or tables that will not immediately propagated for the remaining workload, which create inconsistency in the database, in contrast to dynamic view will be updated automatically when the database object modifies the database, we can understand the concept of VM process in Fig. 1. Incremental approaches were proposed by authors that define multiple performance metrics like accuracy in view updates, misses in base tables and latency in fetching queries. The performance parameter includes optimizes a query sequence and involves complexity of view, memory constraints and selectivity of view [8]. A materialized view is an object of database which contains a result of the query, which may be local copy of data located remotely or may be a join result or a summary of an aggregate function. The collection of views is effectively chosen in such a way that most of the query can be answered, through which we can reduce the overall execution time of the query. The drawback of VM is that they need to be refreshed timely whenever an update happens in the base tables. Thus, the overall system is not scalable as the number of users increases. Table 1 presents a comparative analysis of different existing approaches for VM.

Fig. 1
figure 1

Sequence flow of view materialization

Table 1 Comparative table for existing approaches for VM

In the literature, many authors proposed solutions to address challenges in VM like scalability, information extraction and retrieving patterns from raw data. The various conditions where a view is needed to be materialized [21], materialization in hybrid integration [3] and the optimization of materialization in SPARQL are discussed [28]. PROforma-based approach for view updation [19] and use of discrete genetic operator and optimization technique is used to materialize the view [18]. Overlapping sub-expression [16] to identify the common expression for materialization and game theory [6]-based framework for selection of view is discussed. Priority-based [13] materialization method, multi-clustering [27] to remove noisy view from materialization process and cross-row constraints [11] in materialization are addressed. Data mining techniques [22], frequent pattern analysis [24], are proposed to support ARM applications like word embedding in textual data [15], extract labels in intrusion detection system [5] and decentralized storage patterns in blockchain [17]. The authors in [26] addressed the problems of selecting view dynamically and dropping of MV. Researchers in [4, 7] use a clustering method to find the cluster of closely related queries. Phan et al. [20] describe automated, dynamic materialized query table management scheme that materializes views.

Materialized view can be managed by least recently used (LRU) policy. Hossein et al. [6] proposed a game theory approach where one player is greedy for high query processing time and the other layer is greedy for the high view maintenance cost. The main advantage of the approach is flexibility. However, if the number of parameters increases the system becomes complex to find materialized set.

1.1 Motivation

As the amount of data on the server is huge, therefore, for better performance views need to be processed and stored in an efficient manner in a systematic storage system where it can be frequently accessed based on business logics. So, to minimize the query execution time we need, we cannot bring the whole relation or database in memory, and we need to focus only on the selected portion which is sufficient to answer the maximum query workload. Many authors propose data mining [4], pattern analysis [23, 25], label extraction [5] and game theory approaches [6] to address VM issues. However, reducing the size of relevant data in memory is important. Hence, motivated by the same, DAMS proposes a dynamic VM approach which has twofold benefits. First, to reduce the query execution time by materializing the selected candidate views through proposed attribute selection algorithm based on association rules, and then, view prioritization is achieved based on clustering views on confidence and support metric results.

1.2 Research Contributions

The proposed research contributions are now as follows:

  • A novel attribute selection algorithm is presented based on association rule mining in VS to address historical queries.

  • Selected views are clustered-based support and confidence matrix and then prioritize for materialization.

1.3 Organization of the Paper

The contributed paper work is organized into five sections. Section 1 gives introduction and motivation behind the paper, and Sect. 2 presents the key terminologies in view materialization process. Section 3 confers the proposed approach and architecture. Section 4 presents the research challenges and future scope in the materialization process and solving associated attribute sets. Finally, Sect. 5 presents the conclusion at the end.

2 View Materialization: Key Terminologies

For view materializing, we need to consider both space and processing time for a query as a constraint in the system. As the database object updates any information in the base table of the database, we need to propagate those changes in the view. Hence, the key challenge is to maintain the consistency in the materialized view. Some systems re-compute the materialized view from the scratch, which is not desirable as re-computations in base tables consume more time. The re-computations of the views need to be done in an incremental fashion.

2.1 Performance Evaluation Parameters for View Materialization

Selectivity of View: Selectivity of view can be defined as follows [21]

$$ {\text{View}}\_{\text{selectivity}} = \frac{{N_{{{\text{rowquery}}}} }}{{N_{{{\text{rowview}}}} }} $$

where Nrowquery denotes the number of qualified rows and Nrowview denotes the number of rows existing in the view. The views are calculated with the help of a different parameter and constraint which results in a different view selectivity for the same database.

To understand this better consider the following example:

CREATE VIEW sales-view-1 AS SELECT att-1, att-2 FROM sales WHERE att-3=k1 HAVING SUM(att-4)>k2.

Suppose the total number of rows in the base table is 5000 and the number of rows in the sales-view-1 is 1500 and we change the constraints to WHERE att-3 > k1 and att-3 < k2 HAVING AVG(att-4) > k3 GROUP BY att-5. And let us say the resulted rows in sales-view-1 become 1000, and now you can understand the difference in the view selectivity due to different constraints on the same database table.

View selectivity (sales-view-1) = 1500/5000 = 0.3

View selectivity (sales-view-1) = 1000/5000 = 0.2

View Complexity: It can be defined as a result of join between two or more relations and combination of WHERE and GROUP BY clause, because view with JOIN another combination of clause is complex and takes time to compute. Understand the complexity of both sales-view-1 and sales-view-2.

CREATE VIEW sales_view-1 AS SELECT att-1, att-2 FROM sales WHERE att-3=k1. CREATE VIEW sales_view-2 AS SELECT att-1, att-2 FROM sales WHERE att-3=k1 HAVING AVG(att-4)>k2 GROUP BY att-5.

Database Size: It depends on the organization and their business requirement, how frequently data is accessed from multiple resources for analysis and fact finding. A small organization relatively stores small size database as compared to the organization with huge amount of data stored in different geographical locations. So, the size of materialized view also depends on organization and its requirement.

Query Optimization: Based on the view calculated and stored in the materialized query table, few conditions are applied to check whether a view is capable of providing the solution or not; if it is not, we need to fetch the result from the original base table. So generally we prefer one of the two approaches, In the first approach, we match the attribute of the query with the view, or alternatively in second approach we can compare the join attribute, selection condition and aggregate function between the query from the workload and materialized view. The following is depicted through an example on defining a view.

CREATE VIEW Employee_Department_V1 as SELECT E. E.employee_name, D.department_name, D.department_location, M.employee_name as Contractor FROM Employee E JOIN Department D us ing(department_no) join Employee M on (M.employee_no= D.manager); Query: SELECT E.employee_name, D.department_name FROM Em ployee E, Department D WHERE D.department_no=101 and M.employee_no=D.manager;

A VS problem is based on parameter size of the view, frequency of view update [9], query prioritization [14] and cost function. The approaches of dynamic view materialization are depicted in Fig. 2.

Fig. 2
figure 2

Approaches to dynamic view materialization

3 DAMS: The Proposed Approach

In the proposed approach, a heuristic framework is presented. Workload is considered as a sequence of statements {s1, s2, …, sn} to be executed in a specific order. Figure 3 shows the VM process. For view selection, mainly four steps are involved:

Fig. 3
figure 3

View creation

  1. 1.

    Preprocessing of queries to generate attribute matrix

  2. 2.

    Identify the association rules

  3. 3.

    Identify clusters

  4. 4.

    View selection.

3.1 Attribute Matrix Generation Table

Attribute matrix generation algorithm generates attribute matrix (Mkj) where k is the number of queries in the workload (Q), j is the number of attributes present in the relation/table represented by (A), and Attset is the attribute set which is created from the attribute; initially, it contains φ and after the completion of this algorithm Attset ends up with collection of sets of single attribute which belongs to Queries Qj.

figure a

That is, Attset = {{At1}{At2}{At3}{At4}…{Atj}}. Based on values of Mkj generated by algorithm 1, a dimension table is presented in Table 2. All possible attributes present in the query workload {Q1, Q2, Q3, Q4, Q5, Q6} are depicted in the column, and individual row represents the queries. In cell, a value ‘1’ indicates the presence and ‘0’ indicates the absence of attribute in the specific query. Using this attribute matrix, we can co-relate the set of queries in the workload and identify the clusters.

Table 2 DAMS: generated attribute matrix table

3.2 Mapping Association Rules to Find Clusters

The main issue is to find the view which is associated with maximum number of queries, i.e., can provide result to major portion of the workload.

Consider the queries:

SELECT SUM(salary) from EmployeeDB where Branch_office='Mumbai' and Joining_year=2019 GROUP BY department.

In the above query, we first identify the rows with office at ‘Mumbai’ and year 2020 and then take average of salary using aggregate function and then arranging them on the basis of department, the same query can also be considered if we first group the data on the basis of department with branch Mumbai and year 2019 and then apply aggregate function. So, we need to find the result faster. This is depicted in Fig. 4.

Fig. 4
figure 4

Proposed architectural diagram for view materialization

Consider the set of items ITEM(I) = Bread, Milk, Butter which are frequent part of your transaction, and same is shown in Table 3, where presence and absence of attribute are shown with numeric values 1 and 0, respectively, and from a collection of transaction we can identify the closely related attribute which occurs in transaction more frequently. In online transaction, this rule will help the customers to take decision on the current trends irrespective of the requirement sometimes, and in this way, you get suggestion on e-commerce and social media platform [4]. The row in the attribute matrix represents in the following manner.

Table 3 Mapping of generated associated values to rule sets
$$ {\text{Q}}1 = 10100,{\text{Q}}2 = 00011 $$

We can understand the presence and absence of attribute in the bit vector. Now to find the co-relation between multiple queries, we can make use of bit-wise logical operator.

3.3 Clustering and Prioritizing Queries

Based on bit vector supporting a query set, we identify the attribute, create a cluster of it, then prioritize them based on the frequency in the transaction and create a cluster based on the decision support system for view selection. Consider Table 4 storing the bit value in the matrix for each query and attribute.

Table 4 To generate clusters and prioritize query sets

According to Agrawal et al. [1], ARM problem is defined as: Let A = {At1, At2, At3, At4, …} be a set of attributes from the query workload which belongs to a transaction we call it items. Let Q = {Q1, Q2, Q3, Q4, Q5, Q6, …} be a set of queries in the workload. To understand this, consider each query with a unique identification which contains subset from the item set I, and A rule is defined as A → B where A, B ⊆ S and A ∩ B = NULL [2].

To identify the interesting co-relation from the set of transactions, constraint can be applied. The key constraint to draw any conclusion in this system is use of minimum support and confidence as a threshold.

Support: The support supp(A ⇒ B) of an items A ∪ B is defined as the frequency of data A and B comes together in a transaction data set.

$$ {\text{Support}}\left( {A \Rightarrow B} \right) = \frac{{n\left( {A \cup B} \right)}}{N} $$

where N is count of total transaction. If in a total of 1000 transaction the frequency of A and B comes together 150, then Support (A ∪ B) = 150/1000 = 0.15 or 15%.

Confidence: The confidence is defined as association between item A and B, i.e., how likely B is selected if A is considered.

$$ {\text{Confidence}}\left( {A \Rightarrow B} \right) = \frac{{{\text{Support}}\left( {A \Rightarrow B} \right)}}{{{\text{Support}}\left( A \right)}} $$

For example, if the value of confidence is 0.15/0.15 = 1, i.e., (100%) it means that for 100% of the transactions, if attribute B is present, then A is also present, so we consider both the attributes in the same cluster. The reason behind the association rule is to create cluster based on the priority of the view. To prioritize, we have chosen A-priori algorithms, because of its efficient approach to categorize dataset items [2]. To understand the concept, consider hypothetical data along with the frequency in the query workload.

Consider the item set of one and two attributes initially.

  1. 1.

    Attribute set: {{At1}, {At2}, {At3}, {At4}, {At5}}

  2. 2.

    Attribute set {{At1, At2}, {At1, At3}, {At3, At4}, … so on}

  3. 3.

    Attribute set: contains combination of 3 attributes together and so on, so this attribute set will depend on the number of attributes present in the table.

So, we first find the frequent set with 1 attribute alone in the database which we call L1 and likewise we calculate till Lk.

Lk = attribute set of K frequent attribute.

Step 1: find the frequent set with 1 attribute in the database.

Lk = {{At1}, {At2}, {At3}, {At4}, {At5}}

Next step to identify the frequent set with 1 attribute by observing L1.

C2 = {{At1, At2}, {At1, At3}, {At1, At4}, {At1, At5}, {At2, At3}, {At2, At4}, {At2, At5}, {At3, At4}, {At3, At5}, {At4, At5}}

Now scan the attribute table for {Ati, Atj} existing in a Qi in the attribute table, where i, j ∈ {At1, At2, At3, At4, At5} and Qi ∈ query workload.

L2 = {{At1, At2}, {At1, At3}, {At1, At4}, {At1, At5}, {At2, At3}, {At2, At4}, {At2, At5},{At3, At4}, {At3, At5}, {At4, At5}}

On passing minimum support and confidence on L2, we generate C3.

C3 is generated by combining all possible sets of L2.

C3 = {{At1, At2, At3}, {At1, At2, At4}, {At1, At2, At5}, … and so on}

Using C3, we will get L3 and let us assume only two sets finally qualify the minimum threshold.

L3 = {{At2, At3, At4} {At2, At4, At5}}

C4 = φ and so we stop here for clustering of attribute because no further possible attribute set is capable of qualifying threshold value of set Lk, and we also calculate the support and confidence of all attribute set.

For example, attribute set {At2, At4, At5}

Confidence({At2, At4} ⇒ {At5})

Confidence({At2, At5} ⇒ {At4})

Confidence({At4, At5} ⇒ {At2})

Confidence({At2, At4} ⇒ {At5})

Confidence({At5} ⇒ {At2, At4})

Confidence({At2} ⇒ {At4, At5})

Confidence({At4} ⇒ {At2, At5})

Support({At2, At4} ⇒ {At5})

and so on, and we can calculate the other support rule for this above case.

This result can help us find more relevant and co-related attribute set which helps prioritize the strong associations and how strongly they are connected together; this helps us to keep them in a same cluster and can be resolved in a same view.

figure b

3.4 View Selection by Attribute Selection Using Candidate Sets

Attribute matrix generation algorithm is presented that generates possible candidate views for materialization. The following Algorithm 2 is used for finding the candidate set of attributes which are frequent and need to be materialized for efficient access and storage, where k is the number of queries represented by (Q), j is the number of attributes present in the relation/table represented by (A), and Attset is the attribute set which was calculated in above Algorithm 1 and will be updated in each iteration using Cartesian product of previous sets which is combination pair of all the previous set element, i.e., state after 1st iteration Attset = {{At1, At2}, {At1, At3}, {At1, At4},{At1, At5}, {At2, At3}, {At2, At4}, {At2, At5}, {At3, At4}, {At3, At5}, {At4, At5}} state after 2nd iteration Attset = {{At1, At2, At3}, {At1, At2, At4}, {At1, At2, At5}, … and so on}. Cm is the candidate set which counts the occurrence of the attribute in the query, and those which qualify minimum support count in the occurrence of queries are shifted in least count set Lm where m denotes the number of set required to find out the minimum attribute which is considered for materialization process.

Now we materialized only those view which are capable of providing answers to multiple queries, and will reduce the candidate set. Query order permutation and frequency of past queries are also used to reduce the overall query processing.

4 Research Challenges and Future

In VM, refresh needs to be performed constantly on the database object whenever an update occurs. These updates need to be propagated at real time to simultaneous users. The challenge lies in identifying the updates that are consistent among all users. The research challenges are now presented as follows

  • Partial versus Complete Refresh of View: The refreshes are sometimes partial, where the updates are applied as they happen and the view is presented. This increases the efficiency of query updates but may introduce inconsistency among parallel users operating simultaneously. The refreshes are sometimes complete, where all updates first occur on the VM and then the view is presented to users. This allows consistency among all users for a common view but increases the latency of the query. Fine-tuning an appropriate balance is a difficult task.

  • Cost of Materialization: Another challenge is establishing a trade-off among limited memory in constrained environments against the query execution time. Increasing memory usage reduces the execution time of fetching queries, which is not feasible in low-powered environments.

  • Attribute Selection: Another challenge is identifying the proper sets of frequent attributes during VM. Frequent attributes may be pre-fetched so that latency reduces. The challenge is to determine the estimated range value of attributes that needs to be pre-fetched.

As rule-based queries have casual dependencies, the cost of querying view increases. To address the same, the authors as part of future work would propose machine-learning-based approaches on graph-based models to classify a set of candidate views to be materialized. This would reduce the overall query and maintenance cost of updating views. To address the same, queries are clustered as graph datasets. A priority algorithm for the selection of materialized views would be proposed. This reduces the overall storage and query processing cost.

5 Conclusion and Future Work

Optimal VS requires materialization in order to minimize the overall data retrieval time. In the proposed approach DAMS, based on assigned workloads, similarity and dissimilarity of features are grouped into cluster sets based on confidence values. To define the candidate view, we use clusters of related queries which further solve multiple queries. If candidate views are in number, it will take more time to identify which candidate is merged and at the same time ensure the capability to answer multiple queries. The concept rule mining provides much stronger associated attribute set. It also helps in the clustering of related attribute which is frequent in queries in a database for a given amount of time period.