Definitions

Caching for SQL-on-Hadoop are techniques and systems which store data to provide faster access to that data, for Structured Query Language (SQL) engines running on the Apache Hadoop ecosystem.

Overview

The Apache Hadoop software project (Apache Hadoop 2018) has grown in popularity for distributed computing and big data. The Hadoop stack is widely used for storing large amounts of data, and for large-scale, distributed, and fault-tolerant data processing of that data. The Hadoop ecosystem has been important for organizations to extract actionable insight from the large volumes of collected data, which is difficult or infeasible for traditional data processing methods.

The main storage system for Hadoop is the Hadoop Distributed File System (HDFS). It is a distributed storage system which provides fault-tolerant and scalable storage. The main data processing framework for Hadoop is MapReduce, which is based on the Google MapReduce project (Dean and Ghemawat 2008). MapReduce is a programming model and distributed batch processing framework for reliably processing large volumes of data, typically from HDFS. However, the distinct programming model for MapReduce can be a barrier for nondevelopers, such as data analysts or business intelligence (BI) tools.

Because of this barrier to entry, new tools and frameworks have emerged for the Hadoop ecosystem. Primarily, new Structured Query Language (SQL) frameworks have gained popularity for the Hadoop software stack. SQL is a domain-specific declarative language to describe retrieving and manipulating data, typically from relational database management systems (RDBMS). These SQL engines for Hadoop have been effective for empowering more users who are familiar with existing tools and the SQL language for managing data. These frameworks typically read data from HDFS, process the data in a distributed way, and return to the user the desired answer. Many SQL engines are available for Hadoop, which include Apache Hive (2018), Apache Spark SQL (2018), Apache Impala (2018), Presto (Facebook 2018), and Apache Drill (2018).

These SQL on Hadoop engines have combined the convenience of data query with SQL and the power of distributed processing of data with Hadoop. However, these engines are commonly not tightly integrated into the Hadoop ecosystem. There can be advantages of separating the computation engine from the storage system, such as cost effectiveness and operational flexibility. However, one potential weakness is that the performance of accessing data may decline. Since RDBMS typically have tight integration and control of the entire stack from the computation to the storage, performance may be faster and more predictable. For comparable experiences with traditional SQL processing in RDBMS, distributed SQL engines on Hadoop turn to caching to achieve the desired performance.

There are several different ways SQL engines on Hadoop can take advantage of caching to improve performance. The primary methods are as follows:

  • Internal caching within the SQL engine

  • Utilizing external storage systems for caching

SQL Engine Internal Caching

A common way a SQL engine uses caching is to implement its own internal caching. This provides the most control for each SQL engine as to what data to cache and how to represent the cached data. This internal cache is also most likely the fastest to access, since it is located closest to the computation being performed (same memory address space or local storage), in the desired format. However, the SQL engine internal caching may not be shareable between different users and queries, which is beneficial in a multitenant environment. Also, not all SQL processing frameworks have implemented an internal cache, which prevents using cached data for queries.

Apache Spark SQL

Apache Spark SQL is a SQL query engine built on top of Apache Spark, a distributed data processing framework. Spark SQL exposes a concept called a DataFrame, which represents a distributed collection of structured data, similar to a table in an RDBMS. Caching in Spark SQL utilizes the caching in the Spark computation engine. A user can choose to cache a DataFrame in Spark SQL by explicitly invoking a cache command (Spark SQL 2018). When the command is invoked, Spark SQL engine will cache the DataFrame internal to the Spark engine. There are several user-configurable ways the data can be cached, including MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, and others (Spark RDD 2018). An explicit command must be invoked to remove the DataFrame from the cache.

Apache Hive and LLAP

Apache Hive is a SQL data warehouse for Hadoop, which utilizes an existing distributed computation engine for the data processing. The options of computation engines for Hive to use are MapReduce, Spark, or Tez. Hive uses Live Long And Process (LLAP) (Apache Hive LLAP 2018) for caching data for SQL processing. Currently, only Tez will take advantage of the caching available via LLAP.

LLAP is a long-lived daemon which runs alongside Hadoop components in order to provide caching and data pre-fetching benefits. Since Tez takes advantage of LLAP, when Hive uses Tez as the computation engine, data access will go to the LLAP daemons, instead of the HDFS DataNodes. With this access pattern and architecture, LLAP daemons can cache data for Hive queries.

External Storage Systems for Caching

Another major technique for caching for SQL engines on Hadoop is to utilize an external storage system for caching data. Using a separate system can have several benefits. An external system may be able to manage the cache more effectively and provide additional cache-related features. Also, an external system can be deployed independently from the other components of the ecosystem, which can provide operational flexibility. Additionally, sharing cached data can be enabled or made simpler with a separate system for caching. By using a separate system for handling the caching of data, SQL engines can take advantage by reading the data from the cache, instead of the originating data source. This can greatly improve performance, especially when frequently accessed files are cached. Below are some of the main systems which can provide caching in Hadoop ecosystems.

Alluxio

Alluxio (2018) is a memory speed virtual distributed storage system. Alluxio provides a unified namespace across multiple disparate storage systems. Alluxio enables users to “mount” any existing storage system like HDFS, Amazon S3, or a storage appliance, and presents a single, unified namespace encompassing all the data. When a mount is established, applications simply interact with the Alluxio namespace, and the data will be transparently accessed from the mounted storage and cached in Alluxio. Since there is no limit to how many mounts are possible, Alluxio enables accessing all data from a single namespace and interface, and allows queries to span multiple different data sources seamlessly. Figure 1 shows how Alluxio can mount multiple storage systems for multiple computation frameworks.

Caching for SQL-on-Hadoop, Fig. 1
figure 31figure 31

Alluxio mounting various storage systems

Alluxio enables caching for SQL engines by providing memory speed access to data via its memory centric architecture. Alluxio stores data on different tiers, which include memory, solid-state drive (SSD), and disk. With the memory tier and the tiered storage architecture, Alluxio can store the important and most frequently accessed data in the fastest memory tier, and store less accessed data in slower tiers with more capacity.

SQL engines can access any data via the Alluxio unified namespace, and the data will be either fetched from the configured mounted storages, or served from Alluxio-managed storage (memory, SSD, and disk), which can greatly improve performance. When Alluxio is deployed colocated with the computation cluster, the Alluxio memory storage can behave similar to the internal caches of SQL engines. Alluxio is able to transparently cache data from various other storage systems, like HDFS. Additionally, since the Alluxio storage is independent from the SQL engines, the cached data can easily be shared between different computation frameworks, thus allowing greater flexibility for deployments.

Apache Ignite HDFS Cache

Apache Ignite (2018) is a distributed platform for in-memory storage and computation. Apache Ignite is a full stack of components which includes a key/value store, a SQL processing engine, and a distributed computation framework, but can be used to cache data for HDFS and SQL engines on Hadoop.

Apache Ignite provides a caching layer for HDFS via the Ignite File System (IGFS). All clients and applications interact with IGFS for data, and IGFS handles reading and writing the data to memory on the nodes. When using IGFS as a cache, users can optionally configure a secondary file system for IGFS, which allows transparent read-through and write-through behavior. Users can configure HDFS for the secondary file system, so when the data is synced to IGFS, the data in Ignite is cached in memory.

Using Apache Ignite, SQL engines can access their HDFS data via IGFS, and the data can be cached in the Ignite memory storage. Any SQL engine can take advantage of this cached data for greater performance for queries. When using IGFS, the SQL engine internal caching does not need to be used, which will also enable sharing of the cached data. However, IGFS only allows a single secondary HDFS-compatible file system, if SQL queries need to access varied sources, IGFS would not be able to cache the data.

HDFS Centralized Cache Management

Another external caching option is for the cache to be implemented in the Hadoop storage system, HDFS. Depending on the operating system (OS) buffer cache is one way to take advantage of caching in HDFS. However, there is no user control over the OS buffer cache. Instead, HDFS has a centralized cache management feature (Apache Hadoop HDFS 2018), which enables users to explicitly specify files or directories which should be cached in memory. Once a user specifies a path to cache in HDFS, the HDFS DataNodes will be notified, and will cache the data in off-heap memory. In order to remove a file or directory from the cache, a separate command must be invoked. HDFS caching can help accelerate access to the data, since the data can be stored in memory, instead of disk.

By enabling the HDFS centralized cache, the data of specified files will reside on memory on the HDFS DataNodes, so any SQL engines or applications accessing those files will be able to read the data from memory. Increasing the data access performance of HDFS will help any queries accessing the cached files. Apache Impala has additional support for utilizing the HDFS centralized caching feature for Impala tables which are read from HDFS. With Apache Impala, users can specify tables or partitions of tables to be cached via HDFS centralized caching. IBM Big SQL (Floratou et al. 2016) also uses the HDFS cache for accelerating access to data and utilizes new caching algorithms Adaptive SLRU-K and Adaptive EXD to improve the cache hit rates for better performance.

Since the centralized caching is an HDFS feature, it can help improve performance for any HDFS data access. However, for disaggregated clusters, where the HDFS cluster is separate from the SQL engine cluster, the HDFS caching may not be able to help much. Separating computation and storage clusters is becoming popular for its scalability, flexibility, and cost effectiveness. In these environments, when the computation cluster requires data, the data must be accessed across the network, so even if the data resides in the memory cache on the HDFS DataNodes, the network transfer may be the performance bottleneck. Therefore, further improving performance via caching requires either SQL engine internal caching or a separate caching system. Also, since the caching system is tightly integrated with HDFS, the deployment is not as flexible as other external caching systems.

Conclusion

Caching can significantly improve query performance for SQL engines on Hadoop. There are several different techniques for caching data for SQL frameworks. SQL engine internal caching has the potential for the greatest performance, but each framework must implement a cache, and the cached data is not always shareable by other applications or queries. Another option is to use the HDFS centralized caching feature to store files in memory on the HDFS DataNodes. This server-side cache is useful to speed up local access to data, but if SQL engines are not colocated with HDFS, or the data is not stored in HDFS, the network may become the performance bottleneck. Finally, external systems can cache data for SQL frameworks on Hadoop. When colocated with the computation cluster, external systems can store data in memory similar to an internal cache, but provide the flexibility to share the data effectively between different applications or queries.

Cross-References