1 Introduction

In current Big Data era, advances in computing technologies have triggered a great explosion of digital information. Such gigantic amounts of data have further widen the speed gap between calculation and storage and brought about both performance and power consumption challenges.

One intuitive solution to such challenges is to introduce hybrid storage architectures which can combine the advantages of multiple types of storage devices. For example, flash-based hybrid storage drives (HSDs) have been proposed to integrate standard hard disk drives (HDDs) and flash-based solid-state drives (FSSDs) into a single storage enclosure [1]. In doing so, Flash-based HSDs deliver fast random access by combining low-power, fast-access FSSDs and low-cost, high-capacity HDDs together to end users. Although flash-based HSDs are gaining popularity in personal computers, they suffer from several striking shortcomings of FSSDs, namely high write latency and low endurance, which seriously hinder the successful integration of FSSDs into HSDs, especially in data intensive server scenarios. A series of techniques have been proposed to address the issues [34, 36]. However, most of them only target specific usage scenarios and cannot act as a general solution to eliminate FSSDs’ drawbacks, which continue to threaten the future success of FSSD-based HSDs. There exists a continuing need of better technologies in the storage market.

Fortunately, new cutting-edge non-volatile random-access memory (NVRAM) devices are emerging, such as phase-change memory (PCM), spin-torque transfer memory (STTRAM), and resistive RAM (RRAM). These memory devices provision a rich set of advanced features, for example, they support similar non-volatility as conventional HDDs and FSSDs while providing speeds approaching those of DRAMs. Among these technologies, PCM is particularly promising, with several companies and universities already providing prototype chips and devices [2, 22]. Compared to FSSD, PCM is equipped with a number of performance and energy advantages [6]. First, PCM has much lower read response time than FSSD. It offers a read response time of around 50ns, nearly 500 times faster than that of FSSD. Second, PCM can overwrite data directly on the memory cell, in contrast to FSSD’s erase-before-write. The write response time of PCM is less than 1\(\upmu \)s, nearly three orders of magnitude faster than that of FSSD. Third, the program energy for PCM is 6 Joules/GB, 3 times smaller than that of FSSD [6]. Thus, PCM is a viable alternative to FSSDs for building hybrid storage systems.

In this paper, we design a novel hybrid storage system that leverages PCM as a write cache to merge random write requests and improve access locality for HDDs. It includes new caching and destaging algorithms for better I/O performance as well as wear leveling algorithms for enhanced PCM life time.

A rich set of techniques have used NVRAM as the data cache to improve disk I/O [3, 9, 11, 17, 38]. Most of them use LRU-like methods (e.g., Least Recently Written, LRW [9, 11]) to manage small size non-volatile cache to improve performance and reliability of HDD-based storage and file system. However, the fine-grained LRW for large PCM cache can cause big mapping overheads while CSCAN introduced in [11] has a O(log(n)) insertion speed makes it not suitable for relatively large PCM cache. Specifically, for high density PCM cache with capacity of GBs, using LRU to manage them will cause big DRAM overheads in managing the LRU stack and mapping. In addition, LRU/LRW cannot ensure that destaging I/O traffic be presented as sequential writes to hard disks. Some methods (e.g., CSCAN [11]) as a supplement for LRW can ease this issue to some extent but it requires O(log(n)) time for insertion, making it not suitable for large size cache management.

Therefore, it is crucial to rethink the current cache management strategies for PCM. In this paper, we propose a novel cache management algorithm, named HALO which uses a chained hash table to manage PCM, merge random write requests, and improve data access locality to hard disks. It manages address mapping through cuckoo hash tables and implements a new two-way destaging policy. These techniques together save DRAM overheads significantly while maintaining constant speeds for both insertion and query. Also, HALO is very beneficial in terms of managing caching items, merging random write requests, and improving data access locality to hard disks. In addition, by removing the dirty-page write-back deadline limitations that commonly exist in DRAM-based caching systems, HALO enables better write caching and destaging, and thus achieves better I/O performance. And by storing cache mapping information on non-volatile PCM, the storage system is able to recover quickly and maintain integrity in case of system crashes.

To use PCM as a write cache, we must address PCM’s limited durability. Several existing wear-leveling techniques have shown good endurance improvement for PCM-based memory systems [28, 29, 44]. However, these techniques are not specifically designed for PCM used in storage and file systems, and thus can negatively impact spatial locality of file system accesses, which in turn degrade read-ahead and sequential access performance of file systems. We propose two new wear leveling techniques: rank-bank round robin wear-leveling and space filling curve wear-leveling. In the space filling curve wear-leveling, we not only provide a good write balance among different regions of the device, but also keep data locality and enable good adaptation to the file system’s I/O access characteristics.

For assessing the effectiveness of our hybrid storage system, we conduct extensive experiments for a diverse set of real-world I/O workloads. By thoroughly studying the system’s I/O performance and endurance characteristics compared to traditional techniques, HALO is demonstrated to be a fast and durable PCM write cache in HDD-based storage system. To be specific, HALO leads to an average reduction of 36.8% compared to LRU caching scheme in terms of execution time; in addition, the SFC wear leveling extends the lifetime of PCM device by a factor 21.6. This paper is a substantial extension upon our preliminary study presented at [20]. It further includes the rank-bank round-robin wear leveling algorithm for comparisons and the two-way destaging for better front-end service performance. In addition, we have provided more comprehensive evaluation results such as the HDD throughput, response time, detailed write access counts, bank deviation and memory overheads of mappings.

The rest of the paper is organized as follows. We first describe the design of a hybrid PCM-HDD architecture in Sect. 2, followed by Sect. 3 where the details of our wear-leveling algorithms are provided. Section 4 then provides experimental methodology and results. Section 5 provides a brief overview of related work. Finally, we conclude the paper in Sect. 6.

2 PCM based hybrid storage

Hybrid storage devices have been constructed in many different ways. Most HSDs are built using flash-based solid state devices as either a non-volatile cache or a prefetch buffer inside the hard drives. The combination of FSSDs and HDDs offers an economic advantage with low-cost components and the mass production. This composition of hybrid storage devices, as shown in Fig. 1a, is currently the most popular.

Fig. 1
figure 1

Different architectures of hybrid storage devices

Exploring emerging NVRAM devices such as PCM as components in hybrid storage devices has attracted significant research interests. Research in this direction proceeds along two distinct paths. Along the first path, PCM is used as a direct replacement for FSSDs, as shown in Fig. 1b. Along the second path, PCM is used in combination with FSSDs to compensate FSSDs’ lack of in-place updating, and possibly push HDDs out of hybrid storage devices, as shown in Fig. 1c. For example, Sun et al. [36] use this type of hybrid storage devices to demonstrate its capability of high performance and increased endurance with low energy consumption. However, there are two major problems associated with this approach. First, since FSSDs provide primary data storage space, the erase-before-write problem still exists, although it happens at lower frequency. This causes significant performance loss for data intensive applications. Second, without HDDs in the memory hierarchy, large volumes of storage space cannot be leveraged at reasonable performance costs. In terms of cost per gigabyte, FSSDs are still about 10 to 20 times more expensive than HDDs.

For the above reasons, we investigate the benefits of leveraging PCM as a write cache for hybrid storage devices that are designed along the first path. As shown in Fig. 1b, we use PCMs to completely replace FSSDs while retaining HDDs for their advantages in storage capacity. With the fast development of PCM technologies, we expect that the PCM-based hybrid storage drive will become more popular. In this section, we describe our hybrid storage system - HALO that uses PCM as a write cache for HDDs to improve performance and reliability of HDD-based storage and file systems. Specifically, we first introduce the HALO framework and its basic supportive data structures, and then elaborate on the caching and destaging algorithms.

2.1 HALO framework and data structures

Using PCM as caches for HDDs demands efficient caching algorithms. We design a new caching algorithm, referred to as HALO, to manage data blocks that are cached in PCM for hard disk drives. HALO is a non-volatile caching algorithm which uses a HAsh table to manage PCM and merge random write requests, thereby improving access LOcality for HDDs. Figure 2 shows the HALO framework. The basic data structure of HALO is a chained hash table used to maintain the mapping of HDD’s LBNs (Logical Block Number) to PCM’s PBNs (PCM block addresses). Sequential regions on HDDs, in units of 1MB, are managed by one hash bucket. The information associated with sequential regions is used to make cache replacement decisions.

Fig. 2
figure 2

Design of the HALO framework

Fig. 3
figure 3

Data structures of HALO caching

2.1.1 Mapping management

As shown in Fig. 3, the chained hashtable includes an in-DRAM array (i.e., the bucketinfo table) and on-PCM mapping structures. Another CuckooHashtable enables space-efficient fast query. The bucketinfo table stores information for HDD data regions. Each bucket item in the table represents a 1 MB region on the disk partition or logical volume. Hence, the number of buckets in the bucketinfo table is determined by the size of the disk volume. Each bucketinfo item, if activated, contains three components: listhead, bcounts, and recency. listhead maintains the head block’s PBN of a list of cache items that map to the same sequential 1 MB disk area, bcounts represents the number of caching blocks, and recency records the latest access time-stamps for all cache items in this bucket. We use a global request counter globalReqClock to represent the time-stamp; whenever a request arrives, the counter increases by one. The total_counts variable records how many HDD blocks have been cached inside PCM, while activated_bucks indicates the number of bucketinfo items activated in the bucketinfo table. buck_scan is used to search the bucketinfo table for a candidate destaging bucket.

Cache items that are associated with a bucket item do not need to be linked in ascending order of LBNs, because they are only accessed in groups during destaging. Each newly inserted item will be linked to the head of the list. This guarantees insertions to be finished in constant time. Each cache item maintains a 4 KB mapping from HDD block address (LBN) to PCM block number (PBN). It contains a LBN (the starting LBN of 8 sequential HDD blocks), the PBN of the next PCM block in the list and an 8-bit bitmap which represents the fragmentation inside a 4KB PCM block. If the 8-bit bitmap is nonzero, the nonzero bits represent cached 512B HDD blocks. Each cache item is stored on each PCM block’s meta data section [2].

2.1.2 Cuckoo hash table

To achieve fast retrieval of HDD blocks, a DRAM-based cuckoo hash table is maintained using the LBN as the key and the PBN as the value. On a cache hit, the PBN of the cache item is returned, which enables fast access of data information in the PCM. Traditionally, hash tables resolve collisions through linear probing or chained hash and they can answer lookup queries within O(1) time when their load factors are very low, i.e., smaller than log(n) / n, where n is the table size. With an increasing load factor, its query time can degrade to O(log(n)) or even O(n). Cuckoo hashing solves the issue by using multiple hash functions [10, 26]. It can achieve fast lookups within O(1) time (albeit a bigger constant than linear hashing), as well as good space efficiency i.e., high load factor. Next, we introduce how we achieve such space efficiency.

Fig. 4
figure 4

Max load factors for different numbers of hash functions in Cuckoo hashing

We set 100 as the maximum displacement threshold in the Cuckoo hashtable. When the Cuckoo hashtable cannot find an available slot for a new inserting record within 100 item displacements, it indicates that the hashtable is almost full and requires a larger size table and rehashing. And such critical load factor before rehashing is counted as maximum load factor. Figure 4 shows how the number of hash functions influences the average maximum load factor we can achieve for running seven traces. When the number of functions is two, only 50% load factor can be achieved; as the number of functions increases, the load factor of Cuckoo hash tables initially grows rapidly and then the slope of the curve becomes smaller thus the benefit achieved becomes marginal. Therefore we use four functions in our design since a larger number of functions can in turn bring higher query and computation overheads. In addition, we set a larger initial size of hashtable to keep the load factor lower than 80% when PCM cache is fully loaded. In this way we are able to maintain the average displacements per insert below 2 for better performance.

Here, we give a sample calculation of DRAM overhead by the HALO data structures. In total, for a 2 GB PCM cache with 4 KB cache block size, 0.5 M items will be placed in the hashtable. As each item takes 8 Bytes and the load factor of the cuckoo hashtable is 0.8, the total memory overhead of the Cuckoo hashtable is about 5 MB. With the bucketinfo table normally consuming about 6–12 MB DRAM, we need less than 20 MB DRAM to implement HALO cache management for 2 GB PCM and 1 TB hard disk.

2.1.3 Recovery from system crashes

Mapping information of a PCM block that contains the LBN, the next PBN and the bitmap are stored on non-volatile PCM. Therefore, in case the system crashes, it can first reboot and then either destage the dirty items from PCM to HDD or rebuild the in-DRAM hashtables by scanning information on fixed positions of the PCM meta data sections (to get the cache items’ information including LBN, bitmap and next PBN). As the PCM’s read performance is similar to that of DRAM, the recovery procedure should only take seconds to rebuild the in-memory mapping data structures. In doing so, we can avoid loss of cached data and guarantee the system integrity.

2.2 HALO caching scheme

Our caching algorithm is described in Algorithm 1. When a request arrives, the bucket index is computed using the request’s LBN. The hash table is then searched for an entry corresponding to the LBN. In the event of a cache hit, the PBNs are returned from the hash table and the corresponding blocks are either written in-place to, or read from, the PCM. The corresponding bucket’s recency in the bucketinfo table is also updated to the current time-stamp. In the event of a cache miss on a read request, data is read directly from the HDD without updating the cache. In the event of a cache miss on a write request, a cache item is allocated in the PCM, and data is written to that cache block. Then, if the bucket item of the bucketinfo table for the LBN is empty, it will be activated. After that, the bucket item’s list of cache items is updated, the address mapping information is added to the hash table, the recency of this bucket is set to the current time-stamp, and the bucket’s bcounts is incremented. The updated access statistic information are used by the two-way destaging algorithm to conduct destaging procedures.

figure d

2.3 Two-way destaging

Since the capacity of PCM cache is limited, we have to destage some dirty data from PCM to HDD in order to spare cache space for accommodating new requests. Therefore, we propose a Two-Way Destaging approach to achieve this target. The Two-Way Destaging approach contains two types of destaging: on-demand destaging and background destaging, which are activated to evict some data buckets out of PCM. Next, we will introduce when to trigger each destaging and how to select the victim buckets.

The on-demand destaging is activated when the utilization of PCM cache reaches a high percentage, e.g, 95% of the total size. Such on-demand method can sometimes incur additional wait delay to front-end I/O requests, especially when the I/O load intensity is high. To complement this approach and relax such contention, we introduce another destaging method which is triggered when both the PCM utilization is relatively high, e.g., 80%, and the front-end I/O intensity is low (specifically, disk performance utilization smaller than 10%). Through combination of these ways of destaging, PCM space can be appropriately reclaimed with minimal performance impacts to front-end workloads.

For either destaging method, a bucket is eligible to be destaged to HDDs if any of the following two conditions holds: First, the bucket’s bcounts needs to be greater than the average value of bcounts plus a constant threshold \({TH}_{BCOUNTS}\) and the bucket’s recency needs to be older than globalReqClock by a constant \({TH}_{RECENCY}\). For every unsuccessful round of scan, these two thresholds will dynamically decrease to make sure that victim buckets can be found within a reasonable number of steps. Second, the bucket’s recency needs to be older than globalReqClock by a constant \({OD}_{RECENCY}\) (\({OD}_{RECENCY} \gg {TH}_{RECENCY}\)). As soon as a bucket is identified as eligible for destaging, all cache blocks associated with the bucket are destaged to the HDD in a batching manner, the bucket is deactivated and the corresponding items in the cuckoo hash table are deleted. As these cache blocks are mapped to 1MB sequential region of HDD, this batch of write-backs are supposed to only incur one single seek operation to HDD, thus providing good write locality and causing minimal affects to read requests.

We select these two criteria for determining destaging candidates for the following reasons. First, we want to choose a bucket that has enough items to form a large enough sequential write to the HDD to increase spatial locality of write operations, and at the same time it needs to be one that is not recently used in order to preserve temporal locality. Second, for those very old and small buckets, we evict them from the PCM by setting the control variable \({OD}_{RECENCY}\).

3 Wear leveling

Although the write-endurance of PCM is 3–4 orders of magnitude better than that of FSSDs, it is still worse than that of traditional HDDs. When used as storage, excessively unbalanced wearing of PCM cells must be prevented to extend its lifetime. A popular PCM wear leveling technique [28] avoids frequent write requests to the same regions by shifting cache lines and spreads requests through randomization at the granularity of cache lines (256 B). This technique is feasible when PCM is used as a part of main memory; however, when PCM is used as a cache for back-end storage, this technique can negatively impact spatial locality of file system requests that are normally several KBytes or MBytes in size. In addition, the use of Feistel network or invertible binary matrix for address randomization requires extra hardware to achieve fast transformation. To address these issues, we propose two wear leveling algorithms for PCM in hybrid devices, namely rank-bank round-robin and space filling curve (SFC)-based wear leveling. Instead of using 256-Byte cache lines or single bits as wear leveling units, our algorithms use stripes (32 KB each). Such bigger units can significantly reduce the number of data movements in wear leveling. In addition, with the fast access time of PCM devices, the time to move a 32 KB stripe is quite small (less than 0.1 ms). Hence, the data movement overhead will not affect the response times of front-end requests. The important parameters for our algorithms are listed in Table 1.

Table 1 Parameters used for wear leveling

3.1 Rank-bank round-robin wear leveling

The rank-bank round-robin wear leveling technique is inspired by the RAID architecture. It adopts a similar round-robin procedure to distribute address space among PCM memory ranks and banks for achieving uniformity in inter-region write traffic. We firstly apply block striping over PCM devices in order to ensure an even distribution of writes at the rank and bank granularity. This scheme iteratively distributes data first over ranks, and then over banks within the same rank. Similarly, consecutive writes to the same rank are distributed over the banks within that rank. This scheme is shown in Fig. 5. Aside from assuring a good write distribution between ranks and banks, the proposed scheme also takes full advantage of parallel access to all ranks in the PCM. This means that writing \(N_{rank}\) blocks of data at the same time is possible, where \(N_{rank}\) represents the number of ranks in a particular device. This parallel access translates into improved response times, which is important for data-intensive applications. After block striping, we apply a start-gap rotation scheme inside each bank similar to the method in [28], but different in terms of the size of data units (i.e., in stripes of 32 KB rather than cache lines of 256 B for better spatial locality and less frequent rotations). We illustrate the calculation of LSN, rank index, bank index and logical stripe offset for the address mapping of Rank-Bank round-robin wear leveling in Eq. (1).

$$\begin{aligned} \begin{array}{lll} {LSN} &{} = &{}\left\lfloor \dfrac{{\textit{PBN}}}{S_{stripe}}\right\rfloor \\ {Rank} &{} = &{} {{\textit{LSN}}}~mod~N_{ranks}\\ {Bank} &{} = &{} {{\textit{LSN}}}~mod~N_{banks}\\ {OS_{inBank}} &{} = &{} \left\lfloor \dfrac{{\textit{LSN}}}{N_{ranks}} \times {N_{banks}}\right\rfloor \end{array} \end{aligned}$$
(1)
Fig. 5
figure 5

Rank-bank round-robin wear leveling

3.2 Space fill curve based wear leveling

The rank-bank round robin wear leveling algorithm can achieve even distribution among all banks and ranks for most cases as described later in Sect. 4. However, under certain workloads, a few intensively accessed banks still reduce lifetime of the PCM device. To solve this problem, we propose using the Hilbert space filling curve (SFC) to further improve wear leveling. SFCs are mathematical curves whose domain spans across a multidimensional geometric space in a balanced manner [19].

In theory, there are an infinite number of possibilities to map one-dimensional points to multi-dimensional ones, but what makes SFCs suitable in our case is the fact that the mapping schemes of SFCs maintain the locality of data. In particular, points whose 1D indices are close together are mapped to indices of higher dimensional spaces that are still close. In our case, the LBN sequence is represented by the 1D order of points. The 3D space, into which the LBN sequence is mapped, is constructed with a tuple of three elements along the stripe dimension (the offset of stripes in a bank), the bank dimension (the offset of banks in a rank), and the rank dimension (the offset of ranks in a device).

$$\begin{aligned} \begin{array}{lll} \text {LSN} &{} = &{}\left\lfloor \dfrac{\text {PBN}}{S_{stripe}}\right\rfloor \\ OS_{\text {inStripe}} &{} = &{} {PBN}~mod~S_{stripe}\\ Cube_{\text {no}} &{} = &{} Stripe~mod~N_{cubes}\\ OS_{inCube} &{} = &{} \left\lfloor \dfrac{Stripe}{N_{cubes}}\right\rfloor \\ Seq &{} = &{} StartGapMap(O\,S_{inCube})\\ (R,B,S)&{}=&{}SFCMapFunc(Seq) \end{array} \end{aligned}$$
(2)

We have 512 stripes in a bank, 16 banks in a rank and 8 ranks in a device. We evenly split the 3D space into 32 cubes along the stripe dimension. In other words, the number of stripes in each cube is 16 \(\times \) 16 \(\times \) 8 (i.e., #stripe \(\times \) #bank \(\times \) #rank). After splitting, we apply the round-robin method to distribute accesses across these cubes. And inside every cube, a start-gap like stripe shifting is implemented, making the 3D SFC cube move like a snake. The consequence is that consecutive writes in the same cube can only happen for addresses that are 32 stripes away, which dramatically reduces the possibility of intensive writing in the same region. Within each cube, we apply SFC to further disperse accesses. We orchestrate SFC to disperse accesses across ranks as much as possible. This helps exploit parallelism from the hardware.

In summary, using SFC in combination with the round-robin method, we are able to map a 1D sequence of block numbers into a 3D triple of stripe number, rank number and bank number. The address mapping scheme is generally depicted in Fig. 6. The left figure shows the logical organization of the device with its 32 cubes (or parallelepiped’s, to be more precise, because the size is \(8\times 16\times 16\)). The right figure shows a 3-dimensional space filling curve that is used in our work. The mapping scheme starts with PBN provided by the system and ends up with a 3-tuple (R, B, S) calculated based on Eq. (2). The SFC based wear leveling is designed for a best trade-off among write uniformity, access parallelism and spatial locality.

4 Experimental evaluations

In this section, we first introduce the evaluation methodology and then demonstrate the experimental results in terms of I/O performance and PCM endurance.

4.1 Evaluation methodology

4.1.1 Experimental setup

To realize our proposed PCM-based write cache for hybrid storage devices we have designed a PCM simulation framework that simulates different caching schemes (HALO and LRU), wear leveling algorithms and PCM devices’ characteristics including hardware structure, performance and wearing status. As we can see from Fig. 7, during evaluation, the block-level I/O traces are input to the simulators. The I/O requests are then processed by caching and wear leveling schemes, which generate two types of intermediate I/O requests: PCM requests and HDD requests. The PCM requests are processed by the PCM simulator to get response and wear leveling results. The HDD requests come from cache misses and destaging, which are stored as HDD trace files. HDD trace files are then replayed by the blktrace tool [4] on a 500 GB, 7200RPM Seagate disk in a CentOS 5 Linux 2.6.32 system with an Intel E4400 CPU and 2.0 GB memory. The DRAM-based system buffer cache is bypassed by the HDD trace replaying process. Traces are replayed in a close-loop way for measuring system service rate.

Fig. 6
figure 6

Space filling curve based wear leveling

Fig. 7
figure 7

PCM simulation and trace replay

Table 2 Workload statistics

4.1.2 Metrics

Because PCM devices have much higher (more than 10 times) throughput rates and response performance than those of HDDs [2], we reasonably assume that the total execution time of a workload trace is dominated by the replay time of the HDD trace. For example, if the HDD traffic rate is about 10% and the write throughput is about 50 MB/sec, then the PCM cache must have a throughput rate of about 500 MB/sec, which is consistent with the reported performance of current PCM devices [2].

Based on the above discussion, the workload execution time can be calculated as follows: (\(Total\_IO\_Size * Traffic\_Rate / Average\_Throughput\)). The traffic rate is calculated as the total number of accessed disk sectors (after the PCM cache’s filtering) divided by the total number of requested sectors in the original workloads. This metric is similar to the cache miss rate. The lower the traffic rate we can achieve, the better the cache scheme performs. In order to achieve shorter execution times and better I/O performance, we must minimize the traffic rate and at the same time maximize the average HDD throughput. According to our tests, a standard hard disk can achieve as high as 100 MB/sec of throughput for sequential workloads but can only achieve 0.5 MB/sec for workloads with small random requests. We will evaluate whether the HALO caching scheme can reduce the HDD traffic rate while maximizing average throughput of a hard disk by reducing the inter-request seek distance among all disk writes.

To evaluate wear leveling techniques, we define the PCM life ratio metric, which is calculated by dividing the achieved lifetime with the maximum lifetime. The life ratio is significantly affected by the uniformity of write requests. For example, if all write traffic goes to 1% of the PCM area, the life ratio can be reduced to 1% of the maximum life time. The life ratio is directly determined by the region with the maximum write count if there are no over-provisioning regions provided by the device.

4.1.3 Workloads

In our tests, we use seven representative real-world I/O traces. The workload statistics are described in Table 2. Specifically, the traces Fin1 and Fin2 were collected with the SPC-1 benchmark suite at a large financial organization [37]; the trace Dap was collected at a Display Advertisement Platform’s payload server; the trace Exchange was collected at a Microsoft Exchange 2007 mail server for 5000 corporate users; the trace TPC-E was collected on a storage system of 12 28-disk RAID-0 arrays under an OLTP benchmark, TPC-E [35]; the trace Mail was collected on a mail server by Florida International University [35]. The seventh trace Randw was collected by us on the target disk while running the IOmeter benchmark [25] with the 4 KB-100% random-100% write workload for 120 min.

4.2 I/O performance

Fig. 8
figure 8

Execution time

To evaluate the execution times of aforementioned seven traces, we choose 512 MB as the cache size for Fin1 and Fin2, and 2GB as the cache size for the other five traces. Figure 8 shows the results for Non-cache, HALO-cache and LRU-cache respectively. As we can see, the execution times are reduced greatly for all traces. The execution improvement of HALO caching over LRU caching is 47.11% for Fin1, 27.48% for Fin2, 18.97% for Dap, 28.27% for Exchange, 66.10% for TPC-E, 2.65% for Mail, and 67.11% for Randw. The improvement level is mainly determined by the randomness and write ratio of the traces. Note that for the Mail trace, the improvement is only 2.65%. The reason is that most of write requests in the Mail trace are sequential requests, for which there is not much room for HALO caching to improve on LRU caching. As introduced in Sect. 4.1, the improvement of execution time comes from two aspects: first, the reduction of traffic rate because of better cache hit; second, the increasing on HDD average throughput due to improved write access sequentiality.

Fig. 9
figure 9

Traffic rate for Fin1, Fin2 and Dap

Fig. 10
figure 10

Traffic rate for exchange, TPC-E, mail and Randw

Figures 9 and 10 show the traffic rates for the seven traces with the HALO cache policy and the LRU cache policy, respectively. In most cases (with the cache size ranging from 256 MB to 6 GB), HALO consistently achieves 5–10% lower traffic rates than LRU. Take the trace Fin1 for example, HALO achieves 5, 6, 9, 10.3, 3, 1% lower traffic rates than LRU, where the cache size varies from 256 MB to 3 GB. We observe similar results for Exchange and TPC-E. For Dap, because the access repeatability and temporal locality is very poor, the traffic rate for HALO and LRU remains relatively high. However, HALO still gets a 6% lower traffic rate than LRU. For Randw, as it has a completely random write access pattern, all cache schemes can get almost no cache hits. That explains why the traffic rate of HALO and LRU are almost identical. Yet, HALO can still bring significant performance improvement in terms of execution time because of improved write access locality. For Fin2, as the cache size becomes larger, the PCM write cache consistently reduces more traffic until it reaches 768 MB. This is due to Fin2’s relatively high read ratio (84.60%) and thus the opportunities for optimizing write operations are limited. For Mail, HALO’s improvement is more insignificant for larger cache sizes. However, for small cache sizes (512 MB–2 GB), the traffic rate under LRU caching is about 2% less than that under HALO caching, because most write requests are already in a uniformly sequential pattern, so our cache scheme—which is targeted at random-write workloads—cannot show good improvement.

Fig. 11
figure 11

HDD throughput rate and HALO’s improvement over LRU

Fig. 12
figure 12

Average inter-request LBN distance

Table 3 Wear leveling results

We use the average inter-request LBN distance as a metric to evaluate the IO access sequentiality to HDD, and the results are shown in Fig. 12. We notice that the average inter-request LBN distance is reduced greatly by HALO caching for almost all traces. This explains why the average disk throughput with HALO is much larger than with Non-cache and with LRU, as shown in Figs. 11 and 12. However, for Fin2, HALO does not reduce the LBN distance, because the majority of Fin2 requests are read requests, and there is little room for HALO to improve performance. As we can see in Fig. 13, normalized HDD’s request response times are demonstrated for LRU and HALO. On average, HALO leads to 34.2% better request response time than LRU because of improved access locality and workload intensities achieved by HALO caching and Two-Way destaging algorithms.

Fig. 13
figure 13

Normalized disk response time

4.3 Wear leveling results

Table 3 and Fig. 14 illustrate the wear leveling results for different wear leveling schemes. SG-max represents the maximum bank write count deviations for non-randomized region-based start-gap wear leveling (SG). Avg-counts represents the average bank write counts of all 128 banks. NAME-life represents the life ratio compared to perfect wear leveling, which can be calculated by \(\frac{1}{(NAME\text {-}max/Avg\text {-}counts +1)}\). The “lifetime improvement” column gives an indication of the SFC wear leveling technique’s lifetime improvement compared to that of SG. The average life ratio is 0.9255 for SFC-based wear leveling (SFC), 0.619 for rank-bank round-robin wear leveling (Rank-Bank RR or RR), and 0.0598 for SG. The average lifetime improvement with our schemes for all traces is 21.60.

Figure 15 shows the bank write count deviations for RR and SFC under Fin2 workload. The x-axis is the bank number for 8 ranks * 16 banks per rank. The y-axis is the deviation of wear counts among all banks. We observe that SFC can achieve very good wear uniformity and very low deviations for all banks. This is very helpful for extending the life ratio and lifetime of a PCM device. Bank deviations for the other traces are similar and not shown here for space limit.

Fig. 14
figure 14

Life ratio comparison between different wear leveling techniques

Fig. 15
figure 15

Wear leveling results for Fin2

5 Related work

In this section, we review recent works on nonvolatile memory and caching techniques.

5.1 Hybrid memory systems

While NVRAM has non-volatility and energy benefits over DRAM and has great performance advantages over HDDs, it has limitations such as short device endurance and asymmetric access latency. To overcome these limitations, researchers have combined conventional memory systems with NVRAM to avoid these limitations and leverage the benefits of both HDDs and NVRAM. Ramos et al. [30], Wang et al. [40] and Zhang et al. [43] place PCM and DRAM side-by-side behind the bus to build a hybrid system [18]. Ramos et al. [30] have introduced hardware extensions to the memory controller (MC) to monitor popularity and write intensity of memory pages. They migrate pages between DRAM and PCM, and let operating systems (OS) in charge of updating memory mapping [39], such that performance-critical pages and frequently written pages are placed in DRAM, while non-critical pages and rarely written pages are in PCM. Zhang et al. [43] also rely on the MC to monitor access pattern. Their work differs from [30] by completely relying on OS to manage pages and treating DRAM as an OS-managed write partition. Over time, frequently written pages are placed into DRAM to reduce writes to PCM. Shi et al. present a flash+PCM hybrid nonvolatile disk cache system where flash is used mainly as read cache and PCM is used as write cache [33]. In [16], non-volatile memory is used as both buffer cache and journaling layers for ext4 file system. Our work aims at combining PCM and HDDs as a back-end storage device. Instead of relying on hardware and/or online detection mechanisms to partition data between PCM and HDDs, we store data blocks based on logical block numbers and record the mapping with hash tables, which avoids the involvement of OS in monitoring data access patterns and the need of additional hardware.

5.2 Caching and logging

Least recently used (LRU) and least frequently used (LFU) are common and effective cache replacement policies. LRU-k [24], LRFU [15], MQ [45] and LIRS [14] are important improvements to the basic LRU policy. They consider inter-access time, access history and access frequency to improve hit ratio. DULO [13] and DISKSEEN [7] complement LRU by leveraging spatial locality of data access. DULO gives priority to random blocks by evicting sequential blocks (those with similar block addresses and timestamps). However, the limited sequential bank size and volatility of DRAM prevents DULO from detecting sequences within a larger global address space and over a longer time scale. For this reason, DULO is not positioned to attain performance improvements over LRU for random access workloads in storage and file systems.

Logging is a method that aims to mitigate random writes to hard drives [31] and flash-based SSDs [42]. When data blocks are appended to early blocks rather than updated in place, the garbage collection is necessary and becomes a critical issue. DCD [23] uses a hard disk as a log disk for improving the random write performance of hard disks. However, it does not solve issues such as random reads from the logs disk and suffers from expensive destaging operations (still random writes) under heavy workloads. Several techniques employ non-volatile devices to boost the performance of storage and file systems. Some use NRVAM as the file system metadata storage [8, 9, 27], while others use NVRAM as LRU/LRW caches in file and storage systems [3, 17, 21, 38]. However, these techniques have limitations. For example, they can only boost performance for certain types of file systems; they also cannot ensure the sequential write-back to HDDs due to the usage of LRU policy to manage cache replacement. The HALO scheme as proposed in our hybrid storage system addresses both of these limitations.

5.3 Wear leveling for PCM

Many research efforts have been invested in studying wear leveling in order to extend the lifetime of PCM. Qureshi et al. [29] make the writes uniform in the average case by organizing data as rotating lines in a page. For each newly allocated page, a random number is generated to determine the detailed rotation behavior. Seong et al. [32] use a dynamic randomized address mapping scheme that swaps data using random keys to prevent adversaries. Zhou et al. [44] propose a wear-leveling mechanism that integrates two techniques at different granularities: a fine-grained row shifting mechanism that rotates a physical row one byte at a time for a given shift interval, and a coarse-grained segment swapping mechanism that swaps the most frequently written segment with the less frequently written segments. Their work suffers from the overhead of hardware address mapping and the overhead of periodical sorting to pick up appropriate segments for swapping. Ipek et al. [12] propose a solution to improve the lifetime of PCM by replicating a single physical memory page over two faulty, otherwise unusable PCM pages. With modifications to the memory controller, TLBs and OS, their work greatly improves the lifetime of PCM. In [41], Wang et al. provide inter-set and intra-set wear-leveling techniques to uniformize write operations to ReRAM which is used as on-chip caches. Our wear leveling work is distinguished from these prior efforts. We regard the global address space as a multidimensional geometric space and employ a novel space filling curve-based algorithm to evenly distribute accesses across different dimensions. Comparing with existing work, our approach significantly extends the lifetime of PCM in hybrid storage devices.

6 Conclusions

In this paper, we propose a new hybrid PCM+HDD storage system that leverages PCM as a write cache to merge random write requests and improve access locality for the storage system. Along with this, we also design a cache scheme, named HALO, which utilizes the fast access and non-volatility features of PCM to improve system I/O performance with guaranteed reliability. Results from a diverse set of workloads show that HALO can achieve lower traffic rates to HDDs due to better caching and achieve higher system throughput led by smarter destaging techniques. Therefore, our approaches reduce execution times significantly, 36.8% on average. This hybrid storage organization is especially beneficial for workloads with intensive random writes. We also design two wear leveling schemes, rank-bank round-robin wear leveling and space filling curve based wear leveling, to extend the lifetime of PCM in the proposed hybrid devices. Our results show that SFC-based wear leveling improves the life time of PCM devices by as much as 21.6 times. Also, we have integrated an in-house PCM simulator into DiskSim 4.0 [5] as a new hardware model.

In the future, we plan to devise dynamically partitioned read/write cache schemes for improving a wider range of workload patterns. In addition, a scalable and hierarchical hash table can be introduced for further reducing memory space overhead.