1 Introduction

The concern of sustainability has transformed the high-performance computing (HPC) landscape and now energy is as important as performance. Nowadays, supercomputers are not only ranked by the Top500 List [39] but also the Green500 [8]. As computing systems are approaching a huge scale, power consumption takes a great part in their total costs of ownership. Power management is thus an increasingly important research focus in supercomputing. Taking Tianhe-2, the fastest supercomputer on the TOP500 list (as of June 2014), as an example, its total power consumption is up to 17,808 kWFootnote 1 [39]. Running Tianhe-2 for a year consumes 156 GWh. To bridge our understanding of the figure, this amount has equaled the annual household electricity consumption of over 312,800 persons in China or 36,000 persons in US.Footnote 2 The electricity bill for Tianhe-2 runs between $65,000 and $100,000 a day [40]. Among the top ten supercomputers, seven of them have similar power efficiencies ranging around 1900 to 2700 Mflops/W. This implies huge power consumption is not an exceptional, but commonplace problem. The major source of power consumption in these supercomputers stems from the many-core processors. For example, Tianhe-2 consists of 32,000 Xeon E5 and 48,000 Xeon Phi processors, totaling 3,120,000 cores, which contribute to over 60 % of the system power.Footnote 3 To save power costs and carbon footprint of data centers, how to improve the power efficiency of the state-of-the-art many-core architectures becomes a pressing research gap to fill.

It has been shown that the energy consumption of a program exhibits convex energy behavior, that means there exists an optimal CPU frequency at which energy consumption is minimal [41]. Dynamic voltage and frequency scaling (DVFS) achieves a trade-off between performance and power by dynamically and adaptively changing of the clock frequency and supplied voltage of the CPUs. Existing work on DVFS [6, 10, 26, 33, 42] has also experimentally confirmed its effectiveness to save about 15 to 90 % energy of the CPU chip. In view of increasingly more data-intensive HPC workloads and multi-tenant cloud computing workloads, there are more energy saving chances to scavenge from time to time, and DVFS is the core technology well suited for the purpose. In other words, DVFS is quite an indispensable part of a green HPC system. However, reaping power savings through frequency/voltage scaling without causing a disproportionately large delay in runtime, i.e., to optimize the energy–delay product (EDP), is still a research challenge. Most of the prior DVFS studies or solutions did not consider the latency of voltage/frequency scaling. By our investigation, the latency of voltage scaling is non-negligible, especially on the many-core architectures with multiple voltage domains [12, 16, 29, 32, 34]. Scheduling power state transitions without awareness of the latencies involved would fall behind the expected power efficiency; something even worse could happen if one performs power state transitions too aggressively, introducing extra performance loss and energy dissipation.

In this paper, we explore the latency characteristics of DVFS and design a novel latency-aware DVFS algorithm for many-core computing architectures in which the DVFS latency becomes a notable issue. There have been a few existing studies considering the DVFS overheads. Ye and Xu [47] proposed reducing the number of power state transitions by allocating a task to the right core with a certain power state. However, program execution pattern usually changes according to the workflow so that the optimal power settings for each phase of program execution are likely to be different. Although task allocation reduces the times of DVFS scaling, it could miss good opportunities for saving energy. Ioannou et al. [13] realized the latency overhead problem, but they just made the voltage transitions farther away from each other using a threshold of the least distance time. This alleviating method is obviously suboptimal and there must be more efficient ways to deal with the latency issue.

To bridge this gap, we propose a new latency-aware DVFS algorithm to avoid aggressive power state transitions that would be unnecessary and overkill. “Aggressive” here means too short, the next power state transition is away from the last, and too frequent voltage/frequency changes are not only unprofitable but also detrimental, in view of the extra time and energy costs introduced. We implement our ideas into a usable power management library on top of the Barrelfish multikernel operating system [2] and evaluate its effectiveness on the Intel Single-Chip Cloud Computer (SCC) [12]. By calling the power management routines of the library at profitable locations (usually I/O or synchronization points), an application program or framework, such as our Rhymes shared virtual memory (SVM) system [19], can reap energy savings easily. Our current design of the library adopted a self-made offline profiler to obtain a per-application execution profile for guiding power tuning decisions. Experimental results using various well-known benchmarks (e.g., Graph 500 [11] and Malstone [3]) show that our latency-aware DVFS algorithm is capable of making significant energy and EDP improvements over both the baseline power management scheme (without latency awareness) and the scheme proposed by Ioannou et al. [13] for amortizing DVFS latency costs.

On top of our previous publication [18], this paper extends the work with a thorough latency-aware DVFS algorithm, presents the design and implementation of a new dynamic power management (DPM) solution based on the algorithm, and provides more complete and in-depth experimental evaluation results to prove its effectiveness. While our study was performed on the Intel SCC which is only a research processor consisting of Pentium P54C cores, its power-related design is very typical and adopted in the state-of-the-art multicore or many-core chips with on-chip networks and fine-grained DVFS support (multiple clock/voltage domains). DVFS latency causes issues not specific to Intel SCC alone, but to almost all chip multiprocessors like Xeon Phi whose frequency/voltage scaling latency is in millisecond range. So our findings and proposed solutions are insightful for the general development of energy-efficient many-core computing architectures. Generic contributions of this work that are independent of SCC or Barrelfish are listed as follows:

  • We carry out an in-depth study on the latency characteristics of voltage/frequency scaling on a real many-core hardware platform. We confirm that the DVFS latency is non-negligible (sometimes up to hundreds of milliseconds in reality), but neglected or handled poorly by traditional DVFS schemes. Ignoring this factor will bring about considerable side effects on the system performance and chip power consumption in attempt to save energy by DVFS.

  • Based on the experimental investigation of many-core DVFS latencies, we devise a novel latency-aware DVFS control algorithm for a profile-guided phase-based power management approach applicable to shared-memory programming. The control algorithm is particularly useful for chip multiprocessors of multiple clock/voltage domains and non-trivial DVFS latencies. It is in fact not restricted to a profile-guided DPM approach, but applicable to all other DVFS-based power management approaches [13, 23, 24, 26]. We present experimental results taken on a real system with a working implementation to tell the effectiveness of the proposed DVFS scheme.

The remainder of this paper is organized as follows. Section 2 discusses the basic concept of DVFS latency and our investigation into its effect on many-core architectures. We describe our new latency-aware DVFS algorithm and its implementation in Sect. 3. Section 4 presents the experimental results and analysis we did. Section 5 reviews related work. Finally, we conclude the paper in Sect. 6.

2 DVFS latency on many-core architectures

Before presenting the latency-aware DVFS algorithm, it is important to first investigate the latency behaviors of voltage/frequency scaling on a typical many-core system. In particular, we focus the study on many-core tiled architectures with multiple voltage domains.

2.1 Basics of DVFS latency

As a key feature for dynamic power management, many CPU chips provide multiple power states (pairs of voltage/frequency, or \(V/f\) henceforth) for the system to adaptively switch between. Scheduling DVFS according to the varying program execution behavior such as compute-intensiveness and memory access pattern can help to save energy without compromising the performance. One basic, but important rule for DVFS is that the voltage must be high enough to support the frequency all the time, i.e., the current frequency cannot exceed the maximal frequency which the current voltage supports. As shown in Fig. 1, we assume that there are three different frequency values provided by the hardware, \(F0\), \(F1\) and \(F2\), where \(F0<F1<F2\). For each frequency state, there is a theoretical least-voltage value that satisfies this frequency’s need. According to this condition, we can draw a line of “safe boundary” on the voltage–frequency coordinate plane in Fig. 1. Thus, all the \(V/f\) states above this boundary are not safe (or dangerous) as they violate the basic condition, and could damage the hardware. On the other hand, all the \(V/f\) states under this boundary are considered safe.Footnote 4

Fig. 1
figure 1

Relationship between voltage and frequency during dynamic scaling

We must consider whether the power state will exceed the safe boundary during the scaling. For example, in the case of scaling up voltage and frequency, if we scale the frequency first, then the voltage may not be high enough to support the scaled frequency. Since the execution performance only depends on frequency, keeping the voltage at the least operational levels should be the most power-efficient states (the green states in Fig. 1). Of course, we can apply much higher voltage than the least voltage for each frequency (the orange states in Fig. 1). These states are safe, but less power-efficient than those least-voltage states with the same frequency.

To change the power state (voltage and frequency values) from (\(V_{s}\), \(F_s\)) to (\(V_d\), \(F_d\)), assuming they are both safe states, we indeed have to scale the voltage and frequency separately. But the problem is that there exists some delay for both frequency and voltage scaling. Moreover, the latency of voltage scaling is generally much higher than that of frequency scaling (a millisecond scale versus a handful of CPU cycles).

We find that the latency of voltage scaling should be taken into account only when both the frequency and voltage need to be scaled up. In other cases where min(\(V_s\), \(V_d\)) is high enough to support max(\(F_s\), \(F_d\)), although latency is involved in scaling the voltage from \(V_s\) to \(V_d\) (also for frequency from \(F_s\) to \(F_d\)), the program can actually keep going during voltage (or frequency) scaling since the current voltage level is high enough to support both frequencies of \(F_s\) and \(F_d\). To reap energy savings, apart from the minuscule latency of scaling down the frequency, there is no noticeable latency after scaling down the voltage. To restore or increase the CPU performance is, on the opposite, liable to some millisecond scale latency penalty. Specifically, in the case that \(V_s<V_d\) and \(F_s<F_d\), after scaling up the voltage (which has to be done first for the safety reason explained above), we should wait for a moment until the voltage reaches the level of \(V_d\), which is safe to support the new frequency \(F_d\). If we scale the frequency to \(F_d\) when the voltage level is not high enough to support it, the CPU will stop working. This situation is very dangerous and could damage the chip.

In conclusion, we have the strategies for voltage/frequency scaling and the associated latency costs as shown in Table 1. For better power efficiency, we assume the power states switch between power-efficient states. So, under this assumption, \(F_s >F_d\) only if \(V_s>V_d\). In the case of lowering the power state, we scale down the voltage after scaling down the frequency so that the program need not wait for voltage scaling to finish. When lifting the power state, the program has to suspend and wait until the voltage gets scaled up, and then continues on scaling up the frequency.

Table 1 DVFS latency in different scaling cases

2.2 DVFS latency on many-core architectures

A complete lack of a model characterizing DVFS latency for many-core architecture with multiple voltage domains is a crucial research gap to fill. In this section, we investigate the DVFS latency behavior and contribute an experimental model on a representative many-core x86 chip, the Intel SCC [12], which was designed as a vehicle for scalable many-core software research. The SCC is a 48-core CPU consisting of six voltage domains and 24 frequency domains. Each two-core tile forms a frequency domain, while every four tiles form a voltage domain (a.k.a. voltage island). The frequency of each tile can be scaled by writing the global clock unit (GCU) register shared by the two cores of the tile. The SCC contains a voltage regulator controller (VRC) that allows independent changes to the voltage of an eight-core voltage island [14].

According to Intel’s documentation [15], a voltage change is of the order of milliseconds whereas a frequency change can finish within 20 CPU cycles on the SCC. We also conducted experiments to measure the latencies accurately. We found that the latency of frequency scaling is nearly unnoticeable, so we can concentrate on the voltage switching time alone. To measure it, we design a microbenchmark which performs a series of power state transitions between various possible power states (\(V/f\) pairs). Adjacent transitions are separated by sufficiently long computation time to avoid interference in measurements. We adopt a commonly used method in the community for measuring voltage scaling latencies. We call it “double write”—writing the VRC register twice when it is needed to wait for the voltage transition. It is the second write on the VRC register introducing the latency. As soon as the voltage reaches the desired value, the second write of the VRC register will return. During the execution of the microbenchmark, we record the wall-clock times of all “double writes” on the VRC register and take them as the voltage scaling latencies. The timestamps for wall-clock time measurement are taken from the global timestamp counter based on the 125 MHz system clock of the SCC board’s FPGA (off the chip). We launch the microbenchmark program on 4, 8, 12, 28, 32 and 36 cores to produce simultaneous voltage scaling on 1, 2, 3, 4, 5 and 6 voltage domains, respectively.

Figure 2 shows the average latency of voltage scaling measured in two cases: from 0.8 to 0.9 V and from 0.9 to 1.1 V. For a single voltage domain, the latencies of voltage scaling in the two cases are both about 30 ms. However, when there are multiple voltage domains scaling their voltages simultaneously, the latency seen by each domain surges to a much higher level and increases linearly with the number of domains. We experimented that scaling all the six voltage domains simultaneously from 0.8 to 0.9 V takes about 195 ms. This is a very high overhead in on-die DVFS-speak. Voltage switching time in millisecond range may be SCC specific, but the latency surge due to concurrent voltage requests represents a common problem. We attribute the cause of the linear latency increase to a single VRC (located at a corner of the on-chip mesh) to control voltages of all the domains. Despite simplifying VRC circuitry and saving die area, this presents a bottleneck against high frequency of concurrent voltage switching activities which may be found useful for certain kinds of workloads. We believe that many (predominantly Intel) chip multiprocessors, e.g., Intel Ivy Bridge, are prone to this scalability issue since their DVFS designs are like the SCC’s case—having a global chip-wide voltage regulator for all cores or domains. While we agree fine-grained DVFS offers more power savings, it is hard to scale the number of on-chip regulators for a many-core processor for compounded reasons related to regulator loss, inductor size and die area. This is where latency-aware software-level DVFS techniques can help to address this architectural problem.

Fig. 2
figure 2

Latency of voltage scaling on a chip with multiple voltage domains

3 Latency-aware power management

3.1 Baseline power management scheme

Our baseline dynamic power management (DPM) scheme adopts a profile-guided approach for determining the optimal power states for different portions of the program execution. The scheme is implemented into a power management library and a kernel-level DVFS controller. We employ the library to optimize the power efficiency of Rhymes SVM [19], which is an SVM runtime system we developed for running parallel applications on the SCC port of Barrelfish as if they were running on a cache-coherent shared-memory machine. In the SVM programming environment, application codes generally employ the synchronization routines (lock acquire, lock release and barrier) provided by the SVM library to enforce memory consistency of shared data across parallel threads or processes. So, the parallel program execution is typically partitioned by locks and/or barriers. Moreover, the code segments across a barrier or a lock operation are likely to perform different computations and exhibit different memory access patterns. Thus, the program execution could be divided into phases by these barriers and locks. The phases can be classified into stages performing the real computation and the busy-waiting stages corresponding to barrier or lock intervals. A per-application phase-based execution profile recording the execution pattern of each phase could be derived by an offline profiling run of a program. Note that, the latency-aware DVFS algorithm that we are going to propose will be evaluated based on, but not limited to, this power management approach.

One of the key problems of the profile-guided DVFS scheme is how to determine the optimal power state for each phase. We designed prediction models [17] for the optimal power and runtime performance in each phase. The power model and performance model are based on two indexes, instructions per cycle (IPC) and bus utilization (ratio of bus cycles), which are derived from the performance monitor counters (PMCs) provided by the CPU. As the power/performance models are not the focus of this work, their details are not included in this paper.

Assuming the goal of power management is to minimize the EDP or energy–performance ratio [43], which is a commonly used metric to evaluate the power efficiency of DPM solutions. We can predict the EDP of each phase at a certain power state \(f,v\) (henceforth, we will use the frequency alone to represent the power state as we assume the voltage keeps to be the least value) using the power and performance model as follows:

$$\begin{aligned} \mathrm{EDP}(f)= & {} \mathrm{Energy}(f) \cdot \mathrm{Runtime}(f)\nonumber \\= & {} \mathrm{Runtime}(f) \cdot \mathrm{Power}(f) \cdot \mathrm{Runtime}(f)\nonumber \\= & {} \mathrm{Power}(f) \cdot \mathrm{Runtime}(f)^2 \end{aligned}$$
(1)

Then, we can choose the optimal power state for each phase to achieve the minimal EDP. However, this method does not consider the latency of voltage/frequency scaling. If the power state before the phase begins is different from the predicted optimal power state for this phase, we have to scale the power state first, and could introduce some latency and extra power consumption. Thus, the method which does not take latency into account could lead to wrong decisions.

3.2 Latency-aware DVFS

figure a

Based on our investigation in Sect. 2, DVFS latency is non-negligible and should be taken into account for the optimal power state tuning. In essence, power states must be altered with respect to the implicit deadlines imposed by phase transitions such that performance boost or energy reduction effects can take place for a sufficient length of time. As the latency of frequency scaling is minuscule, we just consider the latency of scaling up voltage. Besides the voltage transition time, issuing power requests can also incur some latency overhead as it entails context switching between user space and the kernel.

Our proposed latency-aware DVFS algorithm is shown in Algorithm 1. We denote the latency of scaling up voltage as \(\varDelta _s\) and the latency of issuing a power request as \(\varDelta _i\). For an application with a sequence of profiled phases \(P_k\)’s, we assume that the execution time of each phase, \(T_k\), can be obtained in the profiling run, during which we can also get certain basic information of each phase, like whether it is busy waiting or performing real computations. The algorithm is composed of two for-loops.

1st loop For each phase, there are two cases to determine the optimal power state. On one hand, if the phase \(P_k\) is a busy-waiting phase, what we need to do is to reduce the power as far as possible without increasing the execution time of the phase. So, we check the length of the execution time (\(T_k\)) to choose the optimal power state. If \(T_k\) \(\leqslant \) \(\varDelta _i\) (meaning that the phase is not long enough to cover the time of issuing a request to change the power level), the system will do nothing and keep using the current power state. If \(T_k\) \(\leqslant \) \(\varDelta _s\) (meaning the phase is not long enough to scale the voltage), the system will keep the voltage and scale the frequency down to the lowest level \(f_\mathrm{min}\). If the busy-waiting time is long enough for scaling down the voltage, the algorithm will scale both the frequency and voltage down to their lowest operation points. On the other hand, if the phase is not busy waiting but performing real computation, we compute the optimal power setting using Eq. 3 and the method of tuning is detailed as follows.

2nd loop It is possible that the execution time of a busy-waiting phase \(P_k\) is not long enough to scale the frequency or voltage to the lowest level (so the system keeps running in some high-power state left by \(P_{k-1}\) or \(P_{k-2}\)...), but the next phase \(P_{k+1}\) does not need such a high-power setting. In this case, it is actually better to lower the power state as early as possible to reduce energy wasted in busy waiting. Therefore, for each busy-waiting phase \(P_k\), if the frequency (\(f_k\)) and/or voltage (\(v_k\)) settings are higher than those of the next phase (which is supposedly performing real computation), frequency and/or voltage will be scaled down in advance to the \(V/f\) values of the next phase.

For a phase which is not busy waiting, assuming the optimization is targeted at the least EDP, the optimal power state for the phase, denoted by \(f_\mathrm{optm}\), should be the frequency value (with the corresponding least voltage) that minimizes the sum of EDP consumed in the phase being executed and the EDP consumed in voltage/frequency scaling (from current power state \(f_c\) to \(f\)), denoted by \(\mathrm{EDP}_\mathrm{phaseRun}(f)\) and \(\mathrm{EDP}_{(f_c{\rightarrow }f)}(f)\), respectively. The minimum sum of EDPs could be denoted by \(\mathrm{sumEDP}(f)\) as follows:

$$\begin{aligned} \mathrm{sumEDP}(f)= & {} \mathrm{EDP}_\mathrm{phaseRun}(f)+\mathrm{EDP}_{(f_{c}{\rightarrow }f)}(f)\nonumber \\= & {} p_{f}\cdot (t_{f})^2+\frac{1}{2}(p_{f_c}+p_{f} )\cdot (\varDelta _i+\varDelta _{s(f_c{\rightarrow }f)})^2 \end{aligned}$$
(2)

As shown in Eq. 2, the power during voltage and frequency scaling is estimated to be the average of the powers before and after the scaling (\(\frac{1}{2}(p_{f_c}+p_{f} )\)). The runtime overhead of DVFS consists of the latency of issuing power request (\(\varDelta _i\)) and the DVFS latency (\(\varDelta _{s(f_c{\rightarrow }f)}\)) of transiting from current power state \({f_c}\) to \(f\). The DVFS latency (\(\varDelta _{s(f_c{\rightarrow }f)}\)) is derived according to different scaling cases described in Table 1. As we ignore the latency of frequency scaling, \(\varDelta _{s(f_c{\rightarrow }f)}\) equals zero for the first case (scaling down frequency/voltage) in Table 1, while \(\varDelta _{s(f_c{\rightarrow }f)}\) equals \(\varDelta _s\) for the second case. In Eq. 2, the power \(p_{f_c}\) in the current power state \({f_c}\) , power \(p_f\) at \(f\) and runtime \(t_f\) at \(f\) can be estimated by the performance/power model.

Hence, the optimal power state \(f_\mathrm{optm}\) can be denoted by Eq. 3.

$$\begin{aligned} f_\mathrm{optm}=f \, \text { s.t. } \, \mathrm{sumEDP}(f)=\min _{f_\mathrm{min}{\leqslant }f{\leqslant }f_\mathrm{max}}\mathrm{sumEDP}(f) \end{aligned}$$
(3)

Our current design adopts an offline profile-guided DPM approach. As the number of possible power states (\(V/f\) pairs) is usually limited, we do not consider the complexity of the minimization process. Thus, the optimal power state for each phase minimizing \(\mathrm{sumEDP}\) can be selected according to Eq. 3 from the alternative power states listed in Table 3. These optimal power settings will then be applied to subsequent production runs.

As we reveal in Sect. 2, the largest latency for voltage scaling measured through microbenchmarking is about 195 ms. But in full-load tests with real-world benchmark programs like Graph 500, we observe the actual latency could reach 240 ms. Voltage scale-up events usually happen upon barrier exits, where all cores (all six voltage islands) request for power state transition simultaneously. So, it is an effectual heuristic to set \(\varDelta _s\) to be 240 ms in Eq. 2. This setting was also experimentally validated to be the most effective choice in our tests. Although the latency for the local core to issue a power request is of the order of thousands of cycles, we set \(\varDelta _i\) to be 2 ms in our experiments to take into account the context switching overheads.

3.3 Implementation on Barrelfish

We designed and implemented a DVFS controller and user library on Barrelfish, a multikernel many-core operating system developed by ETH Zurich and Microsoft Research [2], to assess the effectiveness of the latency-aware DVFS algorithm. Our DVFS controller follows a domain-aware design adapted to many-core chips with clustered DVFS support (Intel’s SCC is a typical example). In other words, each CPU core has its role inside the whole controller. The roles include stub cores (SCore), frequency domain masters (FMaster) and voltage domain masters (VMaster). All the cores are SCore. Meanwhile, in each frequency or voltage domain, we assign one core as the frequency or voltage master which is responsible for determining the domain-optimal power level and scaling the power level of the domain. The domain-wide optimization policy is flexible and configurable according to different scenarios. In our current implementation, the domain-wide power setting adopts an “arithmetic mean” policy as Ioannou et al. [13] proposed. That means the power level of a domain is set as the arithmetic mean of the frequencies or voltages requested by all the cores in the domain.

As shown in Fig. 3, the design of the DVFS controller is made up of three main modules, namely broker, synchronizer and driver, respectively, which are implemented at the kernel level. All the broker instances running on each CPU core are controlling the frequency–voltage settings for the chip, using the capability provided by the synchronizer and driver modules. Below, we describe each module in more detail.

  • Broker is an event-driven subroutine that intelligently performs the DVFS actions. When the system boots up, the broker is responsible for determining the role of the local core and handling the DVFS requests made from the user space via the API. If the local core is a FMaster or VMaster, it should handle the events for synchronizing the DVFS requests from other cores in the domain.

  • Synchronizer is the module where we designed an inter-core communication protocol to synchronize different power requests from different CPU cores. The protocol implementation on the Intel SCC has applied a real-time technique, making use of the efficient inter-processor-interrupt (IPI) hardware support, to guarantee better DVFS efficiency. This virtual real-time IPI-based inter-core communication mechanism can greatly reduce the response time of power tuning requests.

  • Driver is a low-level layer of code that carries out the actual frequency and voltage scaling operations supported by the many-core hardware. On the Intel SCC, the frequency of a two-core tile is scaled by writing the configuration register of the GCU, which is shared by the two cores on the tile. The voltage is changed by writing a 17-bit VRC register [14].

Fig. 3
figure 3

Design of the DVFS controller on the Barrelfish OS

The API block in Fig. 3 refers to the user-space library provided for programmers or execution environments to drive the DVFS controller. It is a lightweight DVFS interface that facilitates development of high-level DPM policies at middleware or application level. The main functions of the API are described in Table 2. A DPM policy just needs this API for making local DVFS requests to interface with the DVFS controller. In other words, the kernel parts of the DVFS controller are totally transparent to users.

Table 2 The main functions of DVFS interface implemented on Barrelfish

4 Experimental evaluation

4.1 Experimental environment and testing methodology

We evaluate the latency-aware DVFS solution on an Intel SCC machine (with 32 GB RAM) using several well-known benchmarks. The operating system is the SCC port of Barrelfish. The instantaneous chip power can be measured by reading the power sensors provided by the Intel SCC platform. Thus, the energy consumption could be obtained by integrating the instantaneous power over time. All the experiments were conducted on 48 cores of the SCC. As the temperature of the SCC board was maintained at around \(40\,^{\circ }\mathrm {C}\), we ignored the impact of the temperature on the power of the CPU chip. The clock frequencies of both the mesh network and memory controllers (MCs) of the SCC were fixed at 800 MHz during the experiments.

As discussed in Sect. 2, a frequency change of a frequency domain is valid only if the new frequency value is “safe” to reach at the current voltage. On the SCC platform, the frequency is scaled by a frequency divider (Fdiv) with a value from 2 to 16, and the frequency value will equal 1600 MHz/Fdiv. According to Intel’s SCC documentation [15], voltage of 0.8 V is enough to support 533 MHz. However, in the case of booting Barrelfish on 48 cores of the SCC, we find that the booting process will always fail at bootstrap of the 25th core if the initial voltage is 0.8 V while the initial frequency is 533 MHz. We find that the system throws some weird errors when the voltage is scaled down to 0.7 V, especially when we launch programs on a large number of cores (e.g., 48 cores). To keep the program run safely, we set the least voltage for 533 MHz to be 0.9 V, and 0.8 V for frequencies which are lower than or equal to 400 MHz. To put it simple, we derived a safe-frequency-least-voltage (SFLV) table (see Table 3) that we used to tune the \(V/f\) settings.

Table 3 Combinations of safe-frequency and least-voltage settings for CPU cores of the Intel SCC

Based on the above experimental conditions, we set up four different power management (DPM) policies for comparison in terms of power, runtime performance, energy consumption and the EDP index. The four policies are detailed as follows:

  • Static 800M To evaluate the efficiency of various DPM schemes, we need a static power policy for control experiment. This policy is using a static power model with the highest power state. All CPU cores’ frequencies are set to 800 MHz, and their voltages are set to the least value of 1.1 V during this control experiment. The profile information of each benchmark program is also derived using this experimental setting.

  • Baseline DVFS This policy refers to our baseline profile-guided DPM scheme without the latency-aware DVFS algorithm (i.e., a “latency-unaware” policy). All \(V/f\) switching is done observing the SFLV table. Although we do not consider the DVFS latency in this policy, we set the latency of issuing a power request (\(\varDelta _i\) in Sect. 3.2) to 2 ms to take into account the overhead of power state switching.

  • Latency-aware DVFS Based on the baseline policy, this is an enhanced policy that considers the voltage scaling latency and adjusts the DVFS decisions according to the algorithm presented in Sect. 3.2. The latency of scaling up voltage (\(\varDelta _s\)) is set to be the maximum value (240 ms).

  • Threshold-based DVFS Also based on baseline policy, we emulate the solution given by Ioannou et al. [13] and set a fixed threshold of 240 ms as the maximal voltage scaling latency. If the time distance between the current voltage scaling and its prior one is less than this threshold, this policy will ignore the voltage scaling request. This solution was considered effective for avoiding excessive (non-profitable) power state transitions, and we are going to compare it with our latency-aware scheme.

4.2 Benchmark programs

Experimental comparison was done using four benchmark programs, namely Graph 500, LU, SOR and Malstone, listed in Table 4. We port these application programs to our Rhymes SVM system [19] which leverages software virtualization to restore cache coherence on the SCC machine with non-coherent memory architecture. In this way, programmability at the application level would not be much compromised, compared with a traditional shared-memory programming model. Porting effort was made only to convert the original memory allocation and synchronization code into one using Rhymes’ provided malloc, lock and barrier functions. Among the benchmark programs, Graph 500 and Malstone are “big-data” computing applications while the other two are classical scientific computing algorithms. In particular, Graph 500 is the most complex, but representative one. So, it is worth more elaboration as follows.

Table 4 Characteristics of the four benchmark programs

Graph 500 is a project maintaining a list of the most powerful machines designed for data-intensive applications [11]. Researchers observed that data-intensive supercomputing applications are of growing importance for representing current HPC workloads, but existing benchmarks did not provide useful information for evaluating supercomputing systems for data-intensive applications. To guide the design of hardware architectures and software systems to support such applications, the Graph 500 benchmark was proposed and developed. Data-intensive benchmarks are expected to have more potential for energy saving than compute-intensive ones [4]. So, Graph 500 is a suitable benchmark for evaluating our solution. The workflow of Graph 500 is described in Algorithm 2. Its kernel workload is performing breadth-first searches (BFSes) over a large-scale graph. In our experiment, the problem size for every Graph 500 test is set as follows: SCALE = 18 (262,144 vertices) and EDGE factor = 16 (4,194,304 edges). The execution of Graph 500 (including 64 times of BFSes) is divided into 1700+ phases delimited by barrier and lock operations using the profile-guided DPM approach described in Sect. 3.1. In the original Graph 500 benchmark, only step 2 and step 4.2 (a.k.a. kernels) are timed and included in the performance information. Since our goal is not to compare the kernels’ performance with other machines, we did not follow this way of timing and took the total execution time instead.

figure b

For the other three benchmark programs, LU implements the algorithm of factoring a matrix as the product of a lower triangular matrix and an upper triangular matrix. The program performs blocked dense LU factorization with a problem size of a \(2048\times 2048\) matrix and \(16\times 16\) block size. The SOR benchmark performs red-black successive over-relaxation on a \(4096\times 2048\) matrix. Malstone [3] is a stylized benchmark for data-intensive computing, which implements some data mining algorithm to detect “drive-by exploits” (or malware) from log files. We used a log file of 300,000 records for testing.

Table 4 shows the performance characteristics of the four benchmark programs. We can see that both LU and SOR are actually more data-intensive (having higher bus utilization and lower IPC) than Graph 500 and Malstone. This implies that Graph 500 and Malstone should have less potential for power saving. On the other hand, Graph 500 and LU are divided by barrier and lock operations into a lot of phases while SOR and Malstone consist of a few phases only. A higher number of phases implies more power state transition opportunities and more severe concern about the DVFS latency. Therefore, we expect our latency-aware algorithm to make more power efficiency improvement than other policies for benchmarks with more phases, i.e., Graph 500 and LU.

4.3 Experimental results

Under the experimental settings described in Sect. 4.1, we monitor the power, runtime, energy and EDP variations of the four benchmarks under different power management policies. The results are shown in Table 5. Note that, the results were obtained with the optimization target towards minimal EDP as described in Sect. 3.

Table 5 Results of average power, runtime, energy and EDP obtained during benchmark program executions under different power management policies

In Table 5, “Runtime” denotes the total execution time of the benchmark program. “AvgPower” refers to the average chip power of the SCC, including the power of the CPU cores and the network-on-chip (NoC). “Energy” is the energy consumption of the chip during the execution, i.e., the product of average power and runtime, and “EDP” is the product of energy and runtime. We also present the results (the items marked with *) normalized to the corresponding Static 800M values. For ease of visualizing the comparison, we plot the normalized values of runtime, average power, energy and EDP as histograms as shown in Fig. 4.

Fig. 4
figure 4

Results of average power, runtime, energy and EDP of the benchmarks running under different power management policies. The values are normalized to the ones in Static 800M policy, a Graph 500, b LU, c SOR, d Malstone

The experimental results of Graph 500 are plotted as Fig. 4a from which we can see that all the three policies using DVFS achieve considerable saving in energy or EDP compared with the static power mode. Although the baseline (latency-unaware) DVFS policy achieves 40.7 % energy saving, it gives the worse EDP result. Threshold-based DVFS achieves 31.6 % energy saving and 9.9 % EDP reduction. Latency-aware DVFS brings about 54.7 % energy saving and 33.7 % EDP reduction. Compared with the baseline DVFS policy, our latency-aware DVFS algorithm achieves 23.6 and 40.9 %, respectively, more energy and EDP saving. Compared with threshold-based DVFS, latency-aware DVFS reduces the energy and EDP further by 33.8 and 26.4 %, respectively.

For the LU benchmark (Fig. 4b), although the three DVFS policies can all reduce the average power and energy significantly (average reduction of 63.2 and 28.1 %, respectively), only latency-aware DVFS reduces the EDP product (by 62.5 %). The other two policies, baseline and threshold based, give the worst EDP figures (increased by 73.4 and 52.4 %, respectively) due to substantial performance loss.

For the SOR benchmark (Fig. 4c), latency-aware DVFS performs better than other policies in all aspects, including average power, runtime, energy and EDP (although the improvements over baseline DVFS are marginal for this program). Compared with Static 800M, it has 60.0 % energy saving and 58.9 % EDP reduction, outperforming the threshold-based policy by saving 56.6 % more energy and giving 57.0 % better EDP without observable performance degradation.

For Malstone (Fig. 4d), we can see all the three DVFS schemes can achieve significant energy saving and EDP reduction. Moreover, our latency-aware DVFS scheme achieves the least EDP as desired (57.2 % less than the static policy’s EDP) despite the 17.7 % runtime increase it costs.

In summary, compared with the static policy (Static 800M), our latency-aware DVFS algorithm achieves 51.2 % average EDP reduction (with 55.3 % average energy saving) while the average overhead of execution time is 8.8 %. Compared with baseline DVFS, it gives 31.3 % EDP reduction, 24.0 % energy saving and 15.2 % less overhead of execution time in the average case. It also wins over the DVFS solution of Ioannou et al. [13] by an average of 42.5 % further reduction in EDP and 44.9 % more energy saving.

4.4 Analysis and discussion

We further analyze and discuss the experimental results by linking to observations about the chip power variation (Fig. 5) during the execution of the benchmarks.

Fig. 5
figure 5

Chip power comparison during execution under different power management policies, a Graph 500, b LU, c SOR, d Malstone

4.4.1 Analysis of Graph 500

Figure 5a shows the chip power of the SCC when Graph 500 was run under different power management policies. For the first 13 s in the figure, the performance and power obtained under different policies are nearly the same. This is because the program is performing compute-intensive edge generation and graph construction in that time range, where the opportunity for power saving is very limited (so, high-power setting was applied). Beyond this range, the program begins to perform breadth-first searches and the workload becomes more data-intensive (or memory-bound), so DPM policies get opportunities to lower the power with little performance loss. With our proposed latency-aware DVFS algorithm, the DPM solution effectively avoids most of the excessive long-latency voltage transitions. Therefore, the algorithm achieves better runtime performance. We can also see that both baseline and threshold-based DVFS policies are making the chip power jump up and drop down aggressively with long spikes in the figure. However, the latency-aware policy makes the power variation more stable, lingering in the low-power range of around 20–30 W.

The results obtained have confirmed that aggressive power state transitions will really translate into either sizable slowdown in runtime or increase in average chip power consumption. Although the threshold-based policy tries to avoid excessive voltage transitions by putting a fixed time gap between DVFS scheduling reference points, it may make wrong decisions that allow non-profitable power state transitions to happen while omitting those profitable ones. This is also why the chip power stays in the higher-power region most of the time for the curve of threshold-based DVFS in Fig. 5a. We can conjecture that it must have missed quite a lot of rewarding transitions from high-power to low-power states. Therefore, compared with latency-aware DVFS, although threshold-based DVFS gets an 11.2 % performance improvement (or 10.1 % runtime reduction), it consumes 68.1 % more average power.

To make the comparison more reasonable, we also present the chip power under static 533 MHz/0.9 V power setting (denoted as Static 533M in Fig. 5a), which is a moderately throttled power state. As shown in Fig. 5a, although Static 533M performs better than Static 800M (full throttle) in energy consumption (83.7 % energy of Static 800M’s), the EDP is still a little (3.9 %) higher. Moreover, Static 533M almost loses in both energy and EDP compared with the other three policies (84.7 % more energy and 56.7 % more EDP than latency-aware DVFS).

4.4.1.1 Analysis of the phase-based profiles

Since our baseline DPM scheme follows a profile-guided approach, we present the execution profile of Graph 500 to figure out the correlation between the profile and DVFS scheduling. The profiles of Graph 500 with and without latency awareness are shown in Fig. 6. For clear visualization, we show only the most representative part of the profiles from phase #300 to #500. The profiles include the optimal power setting (\(V/f\) pair) for each phase.

Fig. 6
figure 6

Comparison of the profiles with and without the latency-aware algorithm: in the upper sub-figure, we can find some aggressive DVFS decisions which cause power state transitions between different voltage levels (referring to Table 3, frequencies of 800, 533 MHz and other values below 533 MHz imply different corresponding voltages). After applying the latency-aware algorithm, we can see in the lower sub-figure that these aggressive state transitions have disappeared

It is quite common in SVM programming styles that the master process exhibits an execution pattern somewhat different from other processes of the same parallel program. For example, the master process usually takes up the duties of reading the input data file, allocating a shared array for all cores to access, and printing the aggregated computation result. We assume the master process is always running on core 0 and all “non-master” processes share a very similar execution pattern. In Fig. 6, “core0” and “core1” in the legend denote the profiles of the master process and the processes running on other cores, respectively.

The top sub-figure of Fig. 6 shows the profile information derived without the latency-aware algorithm. The \(x\)-axis represents the phase number; the \(y\)-axis stands for the optimal frequency (MHz) for the corresponding phase. As described in Sect. 4.1, the least voltage for 800 MHz is 1.1 V; for 533 MHz, it is 0.9 V; and for other frequency levels under 533 MHz, it is 0.8 V. We can see in the figure that there exist many excessive DVFS scheduling decisions due to the lack of latency awareness. In particular, there are many times of frequency scaling between different voltage levels, which could cause high-voltage switching costs.

We expect the latency-aware DVFS algorithm proposed in Sect. 3 can avoid such excessive DVFS decisions. The bottom sub-figure of Fig. 6 shows that there are much fewer DVFS decisions made between different voltage levels after applying the latency-aware algorithm. The DVFS decisions become more “conservative” in the cases when voltage scaling is needed. This proves that our proposed algorithm is capable of choosing the most profitable power states to switch between when the temporal effects of voltage scaling are considered.

4.4.2 Analysis of LU

The chip power variation of the LU program execution is depicted in Fig. 5b. The execution is divided into 525 phases by barriers and locks. In the figure, it is obvious that the execution times obtained by the baseline and threshold-based policies both increase to over a double of that obtained by latency-aware DVFS. Meanwhile, the power maintained using these two policies is quite unstable, fluctuating fairly vigorously between 30 and 50 W. By comparing the DVFS decisions, we find that the baseline scheme keeps switching the power state frequently between different voltage levels (0.9 and 0.8 V). As we investigated in Sect. 2, such voltage scaling has a high latency cost to pay. Frequent voltage transitions are the cause for the dismal performance and additional energy dissipation observed in Fig. 4b. Although the threshold-based scheme avoids some of the excessive power state transitions and performs slightly better than the baseline, its power efficiency is much worse than latency-aware DVFS.

4.4.3 Analysis of SOR

Figure 5c displays the chip power consumed by the SOR program during its execution. We can see that enabling DVFS basically causes no increase in runtime, no matter which policy is used. By inspecting the profile of this benchmark, we find that the main portion of execution from about the 20th second to the end corresponds to a single phase performing a sorting procedure which is highly memory-bound in nature—up to 98 % bus utilization is noted during our performance monitoring. That means for such data-intensive workload, we can slow down the frequency and/or voltage to save energy with vanishingly small performance loss. This is also observable in Fig. 4c—all the runtime bars are almost the same.

Another interesting phenomenon observed is that threshold-based DVFS starts failing to make power reduction after the 20th second. This is because the policy finds that the coming frequency/voltage scaling is not far away enough from the last scaling. Then, it skips the scaling request. By only one inaccurate decision, it sacrifices the vast opportunity to save power for the long-running phase. This again implies that the strategy of moving DVFS scheduling decisions far apart by a universal gap would be overkill.

On the other hand, we can see that both the baseline and latency-aware policies perform quite alike for this benchmark. This is because the entire execution of SOR has nine phases only—the room is too small for latency-aware DVFS scheduling to exploit for significant improvements to be observed. This contrasts with the very different situation of the LU program execution which has hundreds of phases.

4.4.4 Analysis of Malstone

Figure 5d shows the chip power curves of Malstone execution under different policies. The results show that all the three DVFS policies achieve significant power reduction with slight performance degradation. The effect played by latency-aware DVFS on this benchmark is found more biased towards power saving, maintaining the lowest power state among all the policies for most of the time. Although latency-aware DVFS gives a longer execution time, it still achieves a better performance–energy trade-off than others, resulting in the minimum EDP.

5 Related work

Power-efficient system designs and energy optimization techniques are of crucial importance to almost every kind of computing systems including supercomputers, clouds and mobile devices. One fundamental approach to lowering system power consumption is to build hardware using low-power design technologies. Thapliyal et al. provide various designs at the electronic circuitry levels to lower the power usage of fundamental compute operations [1, 3538]. For example, exploiting reversible logic implementations of carry look-ahead and carry skip BCD arithmetic units can effectively reduce the number of reversible gates used and garbage output produced, thus lowering the power usage of the processor chip. Recent work done by Yang et al. [46] demonstrates a low-power design for ASIX-based radar system-on-chips (SoCs). Apart from employing low-power ASIX cores, they also improve the dynamic clock-distribution mechanism of the power management module for fine-grained power state control.

OS- and application-level power management solutions constitute another approach to the problem. Dynamic voltage and frequency scaling (DVFS) is a canonical technique to trim power consumption by dynamically adjusting voltage and/or frequency levels to the time-varying needs of the system. Existing work on DVFS [6, 10, 26, 33, 42] has experimentally revealed probable saving of about 15 to 90 % energy of the CPU chip by the technique.Footnote 5 DVFS mechanisms, including the so-called multiple energy gears [9] and dynamic overclocking [22], and DVFS policies are well studied in the literature [9, 13, 21, 23, 30, 31]. Freeh and Lowenthal [9] present a power-efficient execution framework using multiple frequency–voltage settings. Ma et al. [23] adopt control theory to precisely control the power of the entire many-core chip. David et al. [5] demonstrate a power management algorithm that runs in real time and dynamically adjusts the performance of islands of cores to reduce power consumption, while maintaining the same level of performance. Li et al. [21] present a software-controlled execution scheme that considers the effects of dynamic concurrency throttling (DCT) and dynamic voltage/frequency scaling (DVFS) in the context of hybrid programming models. Lo and Kozyrakis [22] studied the power–performance impact of CPU TurboMode and propose autoturbo to dynamically manage TurboMode for modern multicore chips. However, all these pieces of work did not explore the latency behavior of DVFS, even though their evaluations were conducted on real multicore hardware platforms.

This paper fills the research gap by developing a novel DVFS algorithm that counteracts the side effects of the scaling latency. The direct inspiration for this paper comes from our study of the DVFS latency characteristics on a many-core chip. We found that the latency is non-negligible and varies case by case, calling for a more intelligent DVFS scheduling algorithm to counteract its adverse effects. Some prior work did reveal the potential problem with DVFS latency [13, 27, 28, 47]. They are discussed as follows.

Ravishankar et al. [28] argue that if fine-grained DVFS support is provided, the overheads of DVFS would scale with the number of cores in the multiprocessor system-on-chip (MPSoC) platforms. There also exist a few research efforts related to combating the DVFS overhead issue. For instance, thread motion [27] or thread migration [28] is introduced into DPM schemes to reduce the DVFS overhead. Ravishankar et al. [28] propose a power-aware thread migration algorithm to dynamically migrate threads to appropriate cores with different and static power states. However, thread migration could make the execution environment much more complex.

Ye et al. [47] propose reducing the number of power state transitions by introducing task allocation into learning-based dynamic power management for multicore processors. However, program execution pattern usually changes according to the workflow so that the optimal power settings for each phase of program execution are likely to be different. Although task allocation reduces the time of DVFS scaling, it can also cause misses of important power saving opportunities.

Finally, Ioannou et al. [13] propose a hierarchical DVFS controller using some phase prediction algorithm for MPI applications. We feel that their work is the closest to ours—both adopted a phase-based hierarchical power management approach and used the Intel SCC hardware for experimental evaluation. But we did the work for shared-memory programs (on Barrelfish), while they targeted MPI programs (evaluated on SCC Linux). Their solution is also aware of the non-negligible voltage transition costs, and attempts to amortize the costs by making the reference points for DVFS scheduling decisions far apart. Their study shows that a DVFS decision affecting the whole chip can take up to 6 ms, and an empirical threshold of at least 20 ms between the reference points should be used. Our study, on the other hand, reveals that the overhead can surge to over 200 ms in the worst case that all voltage domains of the busy chip are making concurrent voltage scaling requests, and suggests a more fine-grained algorithm that can segregate scale-up and scale-down scenarios. Our previous work [18] has investigated DVFS latency behavior and contributed a preliminary version of latency-aware DVFS for many-core architectures. In this paper, we extend it in two main aspects. First, we improved our DVFS algorithm by considering the impact of busy-waiting phases on DVFS scaling; the details of the algorithm (Sect. 3) are supplemented in this paper. Second, we provided a more thorough experimental comparison between our improved algorithm and the threshold-based method given by Ioannou et al. [13]. Our novel DVFS control algorithm was proved to work better in all cases for a better balance between performance and energy.

6 Conclusion

In this paper, we investigate the latency characteristics of DVFS for many-core architectures with multiple voltage domains. We find non-negligible DVFS latency on the Intel SCC many-core architecture. Based on the study, a latency-aware DVFS control algorithm is proposed to avoid excessive power state transitions. We implement the algorithm into a working DVFS controller for the Barrelfish operating system plus a power management library for (virtual) shared-memory parallel programs. A thorough experimental evaluation using Graph 500, along with other benchmarks, was conducted on the Intel SCC platform. The experimental results show that our latency-aware DVFS algorithm is capable of achieving 15.2 % less execution time, 24.0 % more energy saving and 31.3 % better EDP in the average case than a baseline profile-guided dynamic power management (DPM) policy without latency awareness. The algorithm also outperforms a prior DVFS scheme employing a universal time gap between DVFS scheduling decisions to amortize heavy voltage scaling costs.