Keywords

1 Introduction

SRAM has been the mainstream technology of caches for years due to its high access speed, low dynamic power and other good features. However, with more and more cores are embedded on chip, caches need larger size. However, increasing capacity of SRAM cache leads to high leakage power, which will bring in a serious on-chip heat sink problem. So researchers are focusing on alternative substitutes for SRAM.

STT-RAM is regarded as the most promising replacement for SRAM because it owns almost all desired features of an universal memory and cache, including high storage density, fast read speed and non-volatility. However, there are two drawbacks, namely, long write latency and high write energy, which limit the application of STT-RAM in L1 cache design. In [24, 6, 10], there are some efficient schemes proposed to overcome the two drawbacks when applying STT-RAM in cache design, such as relaxing the non-volatility and hybrid cache design.

To overcome the two problems, we propose to relax the non-volatility of STT-RAM to gain a significant optimization in performance and power consumption. In addition, we design the related refresh scheme to improve the cache’s reliability. We simulate the proposed L1 cache architecture on GEM5 simulator, and collect the simulation results to analysis its overall performance.

2 STT-RAM Features

The basic storage cell of STT-RAM is magnetic tunnel junction (MTJ) shown in Fig. 1. There are two magnetic layers in a MTJ, namely, free layer and reference layer. They are isolated by an oxide layer. The magnetic direction of reference layer is fixed, however, that of free layer can be switched by current. If the directions of the two layers are parallel, the MTJ is in low-resistance state; if they are anti-parallel, the MTJ is in high-resistance state.

Fig. 1.
figure 1

The MTJ design (1T1J). (a) High-resistance state. (b) Low-resistance state.

The MTJ’s non-volatility can be analyzed quantitatively with the retention time of MTJ. We use τ to represent its retention time. τ is related to the thermal stability factor ∆ and can be calculated with Eq. (1) [1].

$$ \tau \,{ \approx }\,\tau_{0} exp\left(\Delta \right) $$
(1)
τ0::

The attempt time and set as 1 ns.

∆ is derived from Eq. (2).

$$ {\Delta = }\frac{{E_{F} }}{{k_{B} T}}\text{ = }\frac{{M_{s} VH_{K} }}{{2k_{B} T}} $$
(2)
M s ::

The saturation magnetization.

H k::

The effective anisotropy field.

T::

The working temperature.

k B ::

The Boltzmann constant.

V::

The volume for the STT-RAM write current.

From Eqs. (1) and (2), we can know that the data retention time of a MTJ decreases exponentially when its working temperature T increases.

According to the different T w , MTJ has three regions, namely, the thermal activation, dynamic reverse and processional switching. We relax the MTJ’s retention time to 30 μs (∆ = 10.3) and adjust the T w to get three different design options. In this paper they are called LRS1, LRS2 and LRS3 respectively [2].

3 STT-RAM LLC Design

3.1 Performance Parameters

Based on the analysis in Chapter 2, we simulate the performance of three LRS STT-RAM designs on NVSim [5]. The results are shown in Table 1.

Table 1. The parameters for multi-retention STT-RAM cells.

From Table 1, we find that if we relax the retention time of MTJ to μs level, its access speed is almost the same with SRAM. The gap between their performance is not very large, however, the area of LRS3 is much larger than the other two designs.

Based on the above analysis, the LRS2 STT-RAM has a better overall performance and it is used in our cache design.

3.2 L1 Cache Architecture

In 2 GHz processor system, LRS2’s read latency is 2 cycles, and its write latency is 3 cycles while both the two parameters of SRAM are 3 cycles. However, the retention time of LRS2 is only 30 us, which is shorter than the write interval of many blocks in L1 cache. It means that many data will be invalid if we do not take any measures.

So we propose to add two counters, namely, the refresh-counter and the access-counter, for every LRS2 block in L1 cache. The refresh-counter is used to monitor the duration time that the data have been stored in volatile blocks. We divide the retention time of STT-RAM to N R , which is the maximum value of refresh-counter. Generally, we set N R  = 15 for 32 KB L1 cache. The maximum value of the access-counter is 8. So the refresh-counter is 4 bits, and the access-counter is 3 bits. The architecture is shown as Fig. 2.

Fig. 2.
figure 2

The L1 cache architecture

All counters are controlled by a global clock, whose period is T gc  = τ/ N R . The access-counter is used to record the read access number in recent 5 T gc because most data in L1 are accessed again shortly since it is written into a block. If one block is not accessed in 5 T gc , we think that it will no longer be accessed again in its life span. The L2_Writeback bit is used to monitor that if the last write operation is a writeback operation from L2 cache. If it is, we set its value to 1, otherwise set it to 0.

To complete the refresh operation, the data stored in a LRS block will be extracted to the buffer firstly, then is written back to the block again. If a write request comes during this process, we stop the refresh process and execute the write operation; if a read request comes, it gets data form the buffer directly and does not need to wait for the completion of refresh process. The duration of the whole process is about 5 cycles 2 GHz CPU system, which is much smaller than the retention time of LRS2 STT-RAM, so it is not necessary to consider the refresh duration in the calculation of N R and T gc .

At the end of T gc , all counters are increased by 1. Both the refresh-counter and the access-counter of a LRS2 block will be reset to 0 if a write access is executed, however, the access-counter will also be reset to 0 for a read access. When a refresh-counter reaches N R , we detect the value of access counter, if it is lower than 5, we continue the refresh process; if it is higher than 5, we detect the value of L2_Writeback. If it is 0, we write its data back to L2 cache and invalidate it in L1 cache, otherwise we only invalidate it.

The hardware overhead is (4 bits × 2)/(64 bytes) = 1.56%. Based on simulation results, these counters’ power consumption takes up only less than 6% of the overall dynamic power consumption, which has little influence on the overall performance.

4 Simulation

4.1 Experimental Setup

In this article, we use GEM5 [7, 8] to conduct the architectural experiment to test the overall system performance of LRS STT-RAM. The configuration is shown in Table 2. The benchmarks are selected from SPEC CPU 2006 [10].

Table 2. GEM5 configuration

4.2 Architectural Simulation

The performance is measured by instructions per cycle (IPC), and the results are shown as Fig. 3. It can be seen that LR3 owns the best performance (at 100.5%) while LRS1’s performance is the lowest one (at 99.1 %). The IPC of LRS2 is the same with SRAM.

Fig. 3.
figure 3

The IPC test results

The leakage power consumption results are shown as Fig. 4. It can be seen that the leakage power of LRS3 is the highest one, at 18.6% on average, however, the LRS1 share the lowest one, at 13.9%. The LRS2’s leakage power is at 16.0%. The leakage power consumption is doubled because of the introduction of the refresh-buffer. However, the total value is still much less than that of SRAM cache.

Fig. 4.
figure 4

The total leakage power consumption

The dynamic power consumption results are shown as Fig. 6, which includes the refresh energy. We find that the dynamic power of LRS3 is also the highest, at 101.1%, which is followed by LRS2, at 87.3%. The dynamic power of LRS1 is the lowest one, at 80.4% (Fig. 5).

Fig. 5.
figure 5

The total dynamic power consumption

The overall power consumption shown in Fig. 6 is the sum of leakage and dynamic power consumption. It is clear that LRS1’s power consumption is only the half of SRAM, at 58.6%. LRS2 is a bit higher than LRS1, at 63.8%. Compared with the two design, LRS3’s power consumption is the highest one, at 73.9%.

Fig. 6.
figure 6

The overall power consumption

5 Conclusion

In this paper, we propose a novel L1 cache architecture based on volatile STT-RAM to improve the reliability and save energy. Our simulation results show that the proposed volatile STT-RAM L1 cache has the same overall performance with SRAM, while having only 63.8% power consumption. In addition, the total on-chip area of proposed L1 cache can be saved by 64.8% ideally.