Keywords

1 Introduction

Organizing data in structured forms has been a traditional but still mainstream approach to improve the efficiency of query processing on database. Suppose B+Tree [1, 2], a data structure which is widely employed to organize a relational table or a secondary index in relational database. B+Tree is composed of two different types of nodes; leaf nodes store table records or index entries in the designated key order, whereas internal nodes store the navigation information to the leaf nodes. The navigation information allows the query processing to limit the search space, thus speeding up the processing if the query constraint is substantially selective along the key attribute.

In this technical direction, this paper proposes synopsis embedment, an idea of incorporating synopsis (i.e. precomputed summary or abstraction information) into an existing data structure. Using the embedded synopsis, the query processing may obtain an exact or approximate [3, 5] answer without accessing all the concerned records stored in the leaf nodes. Thus, the use of the synopsis may reduce the access cost, improving the efficiency of the query processing. This paper presents a design of efficient search algorithms on the synopsis-embedded B+Tree structure and clarifies its performance superiority by showing our experiments in scenarios of exact and approximate query processing.

Fig. 1.
figure 1

Example of synopsis embedment. A B+Tree structure organizes a given relational table with synopsis. The B+tree structure is composed of Nodes 1 to 7. The internal nodes, Node 1 to 3, contain synopses (colored in orange) as well as separation values (colored in blue) and references to its child nodes.

2 Synopsis Embedment and Synopsis-Aware Search

Synopsis Embedment. This section presents synopsis embedment, an idea of incorporating synopsis into an existing data structure, and search algorithms that exploit the embedded synopsis to improve the processing efficiency of queries. We take B+Tree as an example, but the same idea can be extended to other structures. B+tree is a major data structure employed in relational database. According to the standard design, a B+Tree structure is composed of multiple nodes (i.e., fixed-length pages) that are structured in a balanced tree form. Records are stored only in the leaf nodes in an sorted order of the designated key attribute. Each of other (internal) nodes contains one or more separation values of the designated key attribute and references to its child nodes (internal nodes or leaf nodes), each being in charge of one of the separated key space. Query processing starts at the root (top internal) node, and then recursively visits each of its child node only if its descendant nodes may hold records satisfying the query constraint until fetching all such records. This approach limits the search space, reducing the query processing cost.

Synopsis embedment extends the information that the internal node holds regarding its descendant nodes. Figure 1 illustrates an example case of synopsis embedment in B+Tree. Suppose a relational table presented in the figure, which mimics a sales order history; each record contains an order identifier (ORDERID as the primary key), price (PRICE), order date (ORDERDATE) and ship date (SHIPDATE) of a respective order. The figure also illustrates a B+Tree structure that embeds synopsis information. Each entry of the internal node contains synopsis (colored in orange) regarding the records stored in its descendant nodes. For example, the first synopsis in Node 2 covers all the records stored in Node 4, whereas the first synopsis in Node 1 covers Nodes 4 and 5. In this case, the synopsis comprises (1) the number of records, (2) the average value of the key attribute (ORDERID), (3) the average value of an non-key attribute (PRICE), and (4) the maximum value of an non-key attribute (ORDERDATE).

Exact Search Exploiting Embedded Synopsis. Here, we present a search algorithm (synopsis-aware search; SAS) for exact query processing that utilizes the synopsis embedded in B+Tree to improve the processing efficiency.

figure a

Algorithm 1 presents the pseudo code of SAS, which extends a conventional depth-first search on B+Tree. The conventional algorithm examines internal nodes of B+Tree until it reaches the leaf nodes, where the matched records are retrieved to the result buffer to form the query answer. In contrast, if the embedded synopsis is mathematically sufficient to form a part of the query answer, SAS merely retrieves the synopsis information to the result buffer, but it no longer examines its child nodes (at lines 6 to 8). Hence, SAS can reduce the search cost even though it can still obtain an exact answer to the given query.

Approximate Search Exploiting Embedded Synopsis. SAS is useful to reduce the search cost, but it works only if the embedded synopsis is mathematically sufficient for a given query. Here, we presents the improved version (SAS+) that extends SAS to approximate processing. Even if the embedded synopsis may not be perfectly sufficient to a given query, it may be able to provide the approximate answer; this solution potentially offers the performance improvement to more queries, and may be acceptable for use cases where query responsiveness matters more than query exactness, such as rough statistics surveys.

Fig. 2.
figure 2

Inter-attribute result composition. For a query having a composed constraint, separate synopsis-aware search offers a partial result and the inter-attribute result composition technique composes the results to obtain an approximate answer.

This paper presents an inter-attribute result composition technique that composes separate results obtained from multiple synopsis-embedded B+Tree to offer an approximate query answer. Figure 2 illustrates its example. Assume that a separate synopsis-embedded B+Tree is organized for each attribute that is likely to appear in query constraints. The figure exemplifies two B+Tree structures, one by ORDERID and another by SHIPDATE. The search process is two-fold. First, SAS is applied to each of the B+Tree structures; specifically speaking, the B+Tree structure by ORDERID offers a partial result that ignores a query constraint on SHIPDATE, and vice versa. Second, the obtained partial results are mathematically composed to an approximate answer. How to compose partial results depends on the query design. The example query is supposed to report a sum value; thus, arithmetic multiplication works for the composition. This mathematical composition assumes that the concerned attributes are statistically independent. If the attributes are correlated, the obtained result possibly contain non-negligible errors. Hence, the improved algorithm (SAS+) additionally decides whether the search process should utilize the synopsis (to be composed into an approximate answer), or obtain the answer by traversing to the leaf nodes, based on the correlation value between the concerned attributes.

3 Experiment

This section presents the experiments that we conducted to clarify how effectively the proposed synopsis-aware approaches worked. We implemented an experimental query engine in C; the engine incorporated the synopsis-aware search algorithms (SAS and SAS+) based on the well-known B+Tree design [2]. We tested the query performance by using this query engine and the TPC-H datasets. The datasets were generated by the standard dbgen tool (generating a uniform dataset) [6] and the revised dbgen tool (generating a skewed dataset) [7], both with the scale factor of 10, and they were loaded into the data structures managed by the experimental query engine. The zipf value was set to 1.0 for the skewed dataset. Four different test queries, Q1 to Q4, were tested on the datasets. All the experiments were performed on the two-socket server with dual Intel Xeon 6132 processors (each having 14 processing cores at 2.60 GHz), 96 GB main memory, 28 TB storage (composed of 24 hard disks by RAID6) running CentOS Linux 7.7. All the data structures were stored in the ext4 file system constructed in the storage space. The node size of B+Tree was set to 4096 bytes. We performed five trials of each test; this paper reports their average value.

For each dataset, we prepared two relation files, LINEITEM (ORDERKEY) and LINEITEM (SHIPDATE), and embedded the following synopsis information: the number of records; the sum, maximum and minimum values of the key attribute (L_ORDERKEY and L_SHIPDATE); and the sum, maximum and minimum values of an additional attribute (L_EXTENDEDPRICE). This synopsis configuration yielded only 0.80% space overhead for the uniform and skewed datasets.

Exact Query Processing of Search Queries with Sngle Constraints. First, the paper presents that the synopsis-aware search (SAS) performs significantly faster when the embedded synopsis information is sufficient.

We executed the following test queries (Q1, Q2 and Q3) on LINEITEM (ORDERKEY).

figure b

Compared with the baseline case (having no synopsis embedded), SAS successfully reduced the necessary execution time for Q1 and Q2; the baseline took 0.549–0.553 s and 0.533–0.535 s respectively, whereas SAS only took 0.020–0.021 s for both, achieving speedups by factors of 26.3 to 27.5. In contrast, SAS could not speed up Q3, but worked comparably with the baseline case; the baseline took 0.649–0.650 s, whereas SAS also took 0.636–0.638 s. This difference was clearly because the embedded synopsis satisfied the constraint and derivative attributes of Q1 and Q2, but it did not for Q3.

Approximate Query Processing of Search Queries with Composite Constraints. Second, the paper presents that the synopsis-aware search plus approximate processing (SAS+) performs faster with moderate error rates.

As additional synopsis, we further embedded the correlation value between the key attribute and another attributes (L_ORDERKEY and L_SHIPDATE) for each file. This synopsis configuration yielded only 1.62% space overhead for the uniform and skewed datasets. Then, we executed the following test query (Q4).

figure c

Note that Q4 limits the search space along with the L_ORDERKEY and L_SHIPDATE attributes. LINEITEM (ORDERKEY) or LINEITEM (SHIPDATE) do not hold the synopsis information that perfectly offers an exact query result, but they offer an opportunity of approximate processing.

Compared with the baseline case (13.280 s and 13.280 s), SAS significantly reduced the necessary execution time (0.041 s and 0.040 s) for the uniform and skewed datasets. However, note that the query processing was approximate for Q4. SAS induced only negligible (0.2%) error for the uniform dataset, but it involved substantial (14.3%) error for the skewed dataset. This was seemingly because SAS assumed the uniformity of the attribute value distribution and applied the synopsis-based approximation to the entire space. By contrast, SAS+ utilized the inter-attribute correlation information to limit the search space to which the synopsis-based approximation was applied. SAS+ also performed significantly faster (0.521 s) than the baseline method with a small (3.0%) error for the uniform dataset. Interestingly, SAS+ offered the balanced property for the skewed dataset. Specifically speaking, SAS+ offered a much smaller (6.6%) error than SAS (14.3%) with much smaller execution time (6.684 s) than the baseline case (13.280 s).

4 Related Work and Conclusion

Related Work. Utilizing a precomputed synopsis has been actively studied for conventional exact query processing and approximate query processing [3, 5]. Typically utilized synopses can be roughly grouped into histograms, wavelets and sketches. Recently, Liang et al. proposed building a tree of partial aggregates in advance [4]. This idea is close to our study. However, our paper further presents an idea of embedding a synopsis into an existing data structure such as B+Tree and an technique, inter-attribute result composition, to apply the approximation to queries having composite constraints.

Conclusion. This paper has proposed synopsis embedment, an idea of incorporating synopsis into an existing data structure. The embedded synopsis allows the query processing on the data structure to reduce the search cost. This paper has presented efficient search algorithms on the synopsis-embedded B+Tree structure. The presented experiments have clarified their performance superiority in scenarios of exact and approximate query processing. There remain open problems in the synopsis embedment; we would like to explore how to manage the freshness of embedded synopsis and the accuracy of query answers.