Keywords

1 Introduction

The power consumption of supercomputers has become a major limiting factor for further increases in performance. For particularly compute intensive work-loads like the high-performance Linpack benchmark, currently a power efficiency of 7 GFlop/s/W can be reached on (smaller) supercomputers [13].

With the goal of keeping the power consumption of an exascale system below 20 MW, an improvement in power efficiency of at least a factor seven is required. Achieving this goal will require combining multiple strategies.

Leveraging existing system features that impact performance and power consumption (and thus energy-to-solution) is one of them. We review some of these features available on GPU-accelerated servers based on POWER8 processors. This includes, e.g., the core clock frequencies of processor and accelerator. Higher frequencies typically lead to a reduction in time-to-solution and (due to the behaviour of CMOS technology) unavoidably to an increase in power consumption. In line with previous observations, we observe a reduction of energy-to-solution for increasing the clock speed, i.e. a “run to idle” clock tuning strategy to be beneficial. For the considered system this effect is particularly large because of the high idle power consumption. The memory sub-system attached to the POWER processor comprising a larger number of memory buffer chips contributes significantly to that.

To identify the setting for which energy-to-solution becomes minimal performance and power models can be helpful. Performance models enable prediction of time-to-solution for different parts of a work-load, while the power model is needed to predict the power that is consumed on average while executing a particular part of the work-load. Combining both multiplicatively allows to predict energy-to-solution.

For this paper we consider three different applications that come from different research areas. They have in common that they are used to target research questions, for which significant HPC resources are required. Their performance characteristics, however, differ significantly. B-CALM is an application for simulating electro-magnetic fields in dispersive media, which is relevant for developing photonic products. From the area of materials research we selected the application KKRnano, which implements the Density Functional Theory (DFT) method in a particularly scalable way. Finally, we also consider the molecular dynamics package GROMACS.

With this article we make the following contributions:

  1. 1.

    An overview on opportunities for optimising energy-to-solution on GPU-accelerated POWER8 servers is given.

  2. 2.

    Results for energy-to-solution measurements for kernels of several relevant scientific applications and different system configurations are given.

  3. 3.

    A simple power model is derived and combined with available performance models to model energy-to-solution.

After providing background on the relevant technologies in Sect. 2 we discuss the considered applications in Sect. 3 and the options to tune for energy-efficiency in Sect. 4. In Sect. 5 we present results from power measurements for different configurations. Based on these results we derive power and energy models with empirically determined parameters in Sect. 6. Before presenting our conclusions in Sect. 8, we give an overview on related work in Sect. 7.

2 Technology Background

All results presented in this paper have been obtained on a single POWER8 8247-42L Server [8]. This server, which comprises 2 POWER8 Dual-Chip Modules (DCM), was the first to support acceleration by NVIDIA GPUs. The server considered here is equipped with 2 Tesla K40m cards.

Each of the POWER8 modules comprises 5 cores, i.e. there are 20 cores per node. Each core offers two sets of the following instruction pipelines: fixed point (FXU), floating-point (VSU), pure load (LU) and a load-store unit (LSU). Instructions are processed out-of-order to increase instruction level parallelism. The cores feature 8-way Simultaneous Multi-threading. The two VSU support VSX (Vector-Scalar eXtensions) instructions, including two-way SIMD, double-precision fused multiply-add instructions. This results in a theoretical peak performance per node

$$\begin{aligned} B_\mathrm {fp}^\mathrm {(CPU)}= N_\mathrm {core}\cdot f_\mathrm {CPU}\cdot 8\,\mathrm{Flop} \le 590\,\mathrm{GFlop/s}, \end{aligned}$$
(1)

where \(N_\mathrm {core}\) denotes the number of cores (here \(N_\mathrm {core}=20\)) and \(f_\mathrm {CPU}\) is the core clock frequency.

Each POWER8 processor is connected to eight external memory buffer chips (Centaur chips). Each of these 16 chips is connected with one link to one of the POWER8 processors, which has a bandwidth of 16 GByte/s and 8 GByte/s for reading and writing, respectively.

Each Tesla K40m hosts a GK110B GPU of the Kepler generation that comprises \(N_\mathrm {SM}=15\) streaming multi-processors. Each streaming multi-processors has 64 double precision floating point pipelines capable of executing one fused multiply-add instruction every clock. This results in a theoretical peak performance per GPU

$$\begin{aligned} B_\mathrm {fp}^\mathrm {(GPU)}= N_\mathrm {SM}\cdot f_\mathrm {GPU}\cdot 128\,\mathrm{Flop} \le 1622\,\mathrm{GFlop/s}, \end{aligned}$$
(2)

where \(f_\mathrm {GPU}\) is the GPU’s clock frequency. The device memory is based on GDDR5 memory technology and is directly attached to GK110B GPU with a theoretical maximum bandwidth of 288 GByte/s.

The POWER8 processor provides an on-chip controller (OCC) to measure a set of sensors in the hardware. The data is available out-of-band via a service processor and can be read out by a tool called Amester [15, 28], which recently has been made available open-source.Footnote 1 The power sensors considered in this publication are sampled with an interval of 0.25 ms. The read-out granularity depends on the number of sensors, each requires an additional latency of typically 200 ms. The data is, therefore, gathered in irregular intervals and is re-sampled at evenly spaced intervals of \(\varDelta \tau = 0.1\) s. To calculate the overall energy consumption we have to aggregate the power consumption measurements \(P_i\) and multiply this with the measurement interval \(\varDelta \tau \). The sensor does not allow for exactly attributing consumed power to individual server components. For instance, the sensor for the 12 V domain includes different I/O devices, also covering part of the power consumed by the GPUs. We combine the values of these sensors when presenting power traces, thus overestimating the actual power consumed by the GPU.Footnote 2

As can be seen from Fig. 4, the read-out time interval can be large compared to the time scale on which power consumption changes. We verified, however, that energy-to-solution values determined from repeated measurements for the same application kernel are consistent with a reasonable boundFootnote 3, which we take as an indication that the power sensor sampling rates are sufficiently fine.

3 Applications

KKRnano is an application from materials research and features a very high level of scalability [39]. It is based on the Density Functional Theory (DFT) method that enables an accurate description of the electronic properties of a material. Instead of solving for the many-electron wave-function, an effective potential is utilised, treating the problem as a single electron system via the Kohn-Sham equation \(\hat{H}\varPsi = E\varPsi \). Here \(\hat{H}\) denotes the Hamilton operator, E the energy and \(\varPsi \) the electron wave functions. KKRnano finds solutions to this equation in the form of the Green function \(\hat{G}(E) = (E - \hat{H})^{-1}\), as proposed by Korringa, Kohn and Rostoker (KKR) [4, 24, 25].

In large systems, where the number of atoms \(N_{\mathrm {atom}}\gg 1000\), the Green function formulation can be approximated by systematically truncating long-ranged interactions between well-separated atoms. This reduces the overall complexity of the method from cubic to linear and, hence, large systems with 100, 000 atoms and more become feasible. The high level of exposed parallelism can be exploited using MPI plus OpenMP or MPI plus compute accelerators, like GPUs.

Most of the computational efforts are spent on solving a linear system \(\varLambda \gamma = \omega \) for each atom. The matrix \(\varLambda \) is block-sparse with a typical block size \(b = 16\). The inversion is done locally on a single node using the Quasi Minimal Residual (QMR) method, an iterative solver [16], modified to work over matrices and vectors with block structure. For the following discussion, we focus on the application of the operator to a block-structured dense vector as the central part of the QMR solver. KKRnano utilises double-precision complex numbers, requiring 16 Byte per element and 8 Flop to perform a complex multiply-accumulate operation.

Our performance modelling approach is based on the information exchange function concept [5, 30], which captures as a function of the problem size the amount of information that has to be exchanged within a computer architecture. On a single node the most important factors are the number of floating-point operations \(I_\mathrm {fp}\) as well as the amount of data loaded and stored, namely \(I_\mathrm {ld}\) and \(I_\mathrm {st}\). We further need to consider the data exchanged between the accelerator and its host, \(I_{\mathrm {acc}}\). For the KKRnano kernel we previously derived these information exchange functions:

$$\begin{aligned} I_\mathrm {fp}= & {} 2N_{\mathrm {iter}}\cdot \frac{N_{\mathrm {atom}}}{N_{\mathrm {node}}} \cdot N_{\mathrm {tr}}\cdot N_{\mathrm {cl}}\cdot b^3 \cdot 8\,\text {Flop}\,, \end{aligned}$$
(3)
$$\begin{aligned} I_\mathrm {ld}= & {} 2N_{\mathrm {iter}}\cdot \frac{N_{\mathrm {atom}}}{N_{\mathrm {node}}} \cdot N_{\mathrm {tr}}\cdot N_{\mathrm {cl}}\cdot b^2 \cdot 16\,\text {Byte}\,, \end{aligned}$$
(4)
$$\begin{aligned} I_\mathrm {st}= & {} 2N_{\mathrm {iter}}\cdot \frac{N_{\mathrm {atom}}}{N_{\mathrm {node}}} \cdot N_{\mathrm {cl}}\cdot b^2 \cdot 16\,\text {Byte}\,, \end{aligned}$$
(5)
$$\begin{aligned} I_{\mathrm {acc}}= & {} \frac{N_{\mathrm {atom}}}{N_{\mathrm {node}}} \cdot (2 + N_{\mathrm {cl}}) \cdot N_{\mathrm {tr}}\cdot b^2 \cdot 16\,\text {Byte}\,. \end{aligned}$$
(6)

To construct a performance model, we follow a procedure described in [3], where we made the assumption that latency depends linearly on the amount of exchanged information. When executing the application on the POWER8 processors, we expect it to be bound by floating-point operation throughput, because of the relatively high memory bandwidth. On the GPU, however, we expect the memory bandwidth to be the limiting factor. Additionally, we have to take the time needed to exchange data between host and device into account. We, therefore, make the following semi-empirical modelling ansatz:

$$\begin{aligned} \varDelta t_{\mathrm {solver}}^{\mathrm {CPU}}= & {} a_0^{\mathrm {CPU}} + a_{1,\mathrm {fp}}^{\mathrm {CPU}}I_\mathrm {fp}\end{aligned}$$
(7)
$$\begin{aligned} \varDelta t_{\mathrm {solver}}^{\mathrm {GPU}}= & {} a_0^{\mathrm {GPU}} + a_{1,\mathrm {mem}}^{\mathrm {GPU}}(I_\mathrm {ld}+ I_\mathrm {st}) + a_{1,\mathrm {acc}}^{\mathrm {GPU}}I_{\mathrm {acc}}\,, \end{aligned}$$
(8)

where the coefficients have to be determined by fitting this ansatz to performance measurements. As the constants offsets are sufficiently small, we set \(a_0^{\mathrm {CPU}} = 0\) and \(a_0^{\mathrm {GPU}} = 0\).

Table 1. Parameters used for all KKRnano benchmark runs.

B-CALM (Belgium-California Light Machine) is a research application for studying photonics or in general electromagnetic waves in media [41]. It implements the Finite-Difference Time-Domain (FDTD) numerical method for simulating classical electrodynamic interaction, that is to solve Maxwell’s equations in a medium [37]. The evolution of the discretised electromagnetic fields is described by a first-order spatial and temporal difference equation, alternating between magnetic and electrostatic components.

B-CALM exploits the inherently high level of parallelism in the FDTD algorithm by making use of GPUs. Significant gains in performance compared to CPU-only-implementations have been reported for FDTD in general [27, 33, 41]. Realistic problems require the use of multiple distributed GPUs due to the memory footprint.

We apply a similar semi-empirical performance modelling approach as for KKRnano [3]. We consider a single node where the simulation domain is parallelised over \(p=2\) GPUs in z-direction. The total time taken for an update step \(\varDelta t\) can be decomposed into three contributions: boundary update \(\varDelta t_\mathrm {bnd}\), update of the interior domain \(\varDelta t_\mathrm {bulk}\) and exchange of the boundary \(\varDelta t_\mathrm {com}\) between the direct neighbours of the p processes, where the last two operations are overlapped. Our semi-empirical performance model ansatz is:

$$\begin{aligned} \varDelta t= & {} \varDelta t_\mathrm {bnd}+ \max \left( \varDelta t_\mathrm {bulk}, \varDelta t_\mathrm {com}\right) \,, \end{aligned}$$
(9)
$$\begin{aligned} \varDelta t_\mathrm {bulk}= & {} a_\mathrm {bulk}+ V \cdot b_\mathrm {bulk}\,168\,\text {Byte}, \end{aligned}$$
(10)
$$\begin{aligned} \varDelta t_\mathrm {com}= & {} a_\mathrm {com}+ S \cdot b_\mathrm {com}\,48\,\text {Byte}, \end{aligned}$$
(11)
$$\begin{aligned} \varDelta t_\mathrm {bnd}= & {} a_\mathrm {bnd}+ S \cdot b_\mathrm {bnd}\,\,336\,\text {Byte}, \end{aligned}$$
(12)

where \(V = (L_x/p - 2) \cdot L_y\cdot L_z\) is the local sub-domain’s volume excluding halo layers and \(S = L_y\cdot L_z\) corresponding halo volume. For our benchmarks, we utilise a one-dimensional domain decomposition with the parameters summarised in Table 2.

Table 2. Parameters used for all B-CALM benchmark runs.

GROMACS simulates Newtonian equations of motion to perform molecular dynamics. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids. A very important design goal of GROMACS is high performance which is achieved through algorithmic optimisations and by exploiting all available hardware resources. To achieve the latter it uses intrinsic functions for SIMD vectorisation and supports MPI and OpenMP for parallelisation on CPUs. CUDA C is used to accelerate the calculation of the non-bonded force components on NVIDIA accelerators. GPU accelerated runs of GROMACS utilise task parallelism between different force components to maximise resource utilisation, i.e. overlap CPU and GPU work. GROMACS applies a dynamic CPU/GPU load balancing, however, for optimal efficiency both parts – CPU and GPU – in the system need to be balanced [1].

For all runs of GROMACS in this article a water box with 1 million water molecules, i.e. 3 million particles, was simulated for 12000 time steps corresponding to 24 pico seconds of simulation time. We selected the Particle Mesh Ewald method for the long-range electrostatics and applied a domain decomposition with 40 domains. Each domain is processed by one MPI rank with 4 OpenMP threads and respectively 20 domains share a GPU.

4 Tuning for Energy Efficiency

The energy E required to execute a computational task is given by

$$\begin{aligned} E = \int _{t_0}^{t_0 + \varDelta t}\,d\tau \,P(\tau )\,, \end{aligned}$$
(13)

where \(t_0\) is the time when the computational task is started, \(\varDelta t\) the time needed to execute the task, and \(P(\tau )\) the power consumed by the system at time \(\tau \). If we assume the power consumption during execution of the kernel to be constant, i.e. \(P(\tau ) = \overline{P}\) then the relation simplifies to \(E = \overline{P}\cdot \varDelta t\).

For the given node architecture, we identify the following options for changing the implementation or the execution environment such that \(\varDelta t\) and P change:

  1. 1.

    Dynamic voltage and frequency scaling (DVFS) capabilities of the POWER8 processor.

  2. 2.

    Exploitation per POWER8 core DVFS settings and low-power states.

  3. 3.

    Modification of the GPU’s clock frequency \(f_\mathrm {GPU}\).

  4. 4.

    Change of GPU driver settings.

  5. 5.

    Use of algorithms optimised for energy-to-solution.

The frequency f a processing element is running at is exerting a significant influence on its power drain as voltage has to be increased when f is raised. The power consumption effectively scales as \(P \propto f^\gamma \), with \(\gamma \simeq 3\). On POWER8 the clock frequency can be managed by restricting the range of frequencies used by the LinuxFootnote 4 kernel governor. The strategy employed by the governor follows the on-demand setting and results in the highest available frequency being chosen during our kernel’s runtime and a power saving state in between runs. The available frequencies are distributed almost uniformly between 2.061 GHz and 3.690 GHz in 50 steps, out of which we chose \(f_\mathrm {CPU}\in \{2.061, 2.294, 2.493, 2.693, 2.892, 3.092, 3.291, 3.491, 3.690\}\,\text {GHz}\).

The POWER8 processor comprises a PowerPCFootnote 5-based On-Chip Controller (OCC), which provides for real-time control of per-core clock frequency and power states [17]. Additionally, cores can be switched off under control of the operating system enabling software to reduce the number of active cores. Instead of user selection of active cores, one may use thread pinning to move all work-load to a subset of cores and rely on the OS to detect unused cores and adjust frequency and power settings accordingly.

Recently, also GPUs became able to operate at different clock frequencies. For the K40 GPU we considered 4 different clock states, namely \(f_\mathrm {GPU}\in \{666, 745, 810, 845\}\,\text {MHz}\). Changing the GPU frequency of the K40 GPU is possible via application clock settings which are accessible from the NVIDIA Management Library (NVML) or the nvidia-smi command line tool [26].

The architecture considered here is best used from an energy-efficiency point-of-view if significant parts of the application can be off-loaded to the GPUs. The reason is that the GPU typically requires less energy per floating-point operation. In many cases, there is no concurrent execution of tasks both, on CPU and GPU, i.e. the CPU should be largely inactive while a kernel is executed on the GPU. However, by default the GPU driver is polling on an active lock while waiting for a kernel to complete. Alternatively, the driver can yield its thread to the Linux kernel scheduler and thus avoid polling, which is achieved by changing the device flags from cudaDeviceScheduleSpin to cudaDeviceScheduleYield. This might come with a small cost penalty for the driver as notification through the kernel may be delayed.

Recently, optimisation of algorithms for energy-efficiency attracted increasing attention. In particular, in case of solving linear systems often a choice between different algorithms exists (see, e.g., [22] for a recent investigation). We have not pursued this option within the scope of this paper.

5 Measurements Results

Exploitation per POWER8 core DVFS settings and low-power states. In Fig. 1 we show power measurements for the KKRnano solver running either on 20 or 10 cores. We observe that the additional power consumed by the processor after starting the execution of the kernel is reduced by almost a factor 2. At the same time, the execution time \(\varDelta t\) almost doubles. Given the high base power load for the processors, memory and GPUs, the overall reduction in power consumption is less than 15 % and thus energy-to-solution increases by more than 70 %.

Fig. 1.
figure 1

Power consumption for four invocations of the KKRnano solver running on the CPU, only. 40 threads are distributed either over 20 (1st and 3rd invocation) or 10 cores (2nd and 4th).

Change of GPU driver settings. For the same application kernel, we analysed the power consumption when off-loading this to the GPUs. While a kernel is running on the GPU, there are no application tasks running on the CPU. Nevertheless, a significant increase of the power consumption beyond what is consumed without the user application running, as can be seen in Fig. 2. Between kernel launches the CPU performs tasks for steering the solvers progress. We investigated the effect of advising the CUDA driver to yield its thread to the operating system scheduler when possible. No effect was observed, consistently for both KKRnano and BCALM.

Fig. 2.
figure 2

Power consumption for two invocations of the KKRnano solver running on the GPU with the GPU driver polling for completion of the kernel (Spin) or yielding its thread to the operating system (Yield).

Table 3. Energy-to-solution for the KKRnano solver running on the CPU in units of kJ.

Modification of clock frequencies. Next we consider the power consumed by applications running on the POWER8 processor at different clock frequencies \(f_\mathrm {CPU}\). In Fig. 3 we show the power consumed by the processors, the memory subsystem as well as the GPU (including the other I/O devices attached to the 12 V power rail) during execution of the KKRnano solver on the CPU. As expected, the power consumption increases with \(f_\mathrm {CPU}\), while the execution time reduces. Results for energy-to-solution as a function of \(f_\mathrm {CPU}\) are documented in Table 3. For small clock frequencies the increase in time-to-solution overcompensates the reduction in power consumption, which causes energy-to-solution to increase. It is thus beneficial to use a higher clock frequency.

Fig. 3.
figure 3

Power consumption for multiple invocations of the KKRnano solver running on the CPU for different settings of \(f_\mathrm {CPU}\).

Table 4. Energy-to-solution for B-CALM in units of kJ.

As shown in Fig. 4, a similar effect is observed for B-CALM, where the kernels are running on the GPU. Since the kernels are completely running on the GPU, we consider here only the case where the processor clock is kept fixed at minimal value and \(f_\mathrm {GPU}\) is varied. In Tables 4 and 5 we show our results for energy-to-solution for B-CALM as well as the GPU-accelerated version of the KKRnano solver when using different GPU clock settings. In both cases we observe that larger \(f_\mathrm {GPU}\) result in smaller energy-to-solution.

Fig. 4.
figure 4

Example of a GPU power trace over multiple invocations of B-CALM. GPU clocks are set to each of the available values.

Table 5. Energy-to-solution for the KKRnano solver running on the GPU in units of kJ.

As opposed to the other two codes, GROMACS has a task parallelisation scheme that allows to execute tasks on CPU and GPU concurrently. Data is exchanged at defined points in the algorithm [29]. As both, the CPU and the GPU consume a significant amount of power even when not being in use, an imperfect overlap of tasks running on any of the 4 processing devices will result in a loss in energy efficiency. Table 6 shows the energy consumption and time to solution for a single run as a function of \(f_\mathrm {GPU}\) for different processor frequencies \(f_\mathrm {CPU}\). We observe that changing the processor frequency \(f_\mathrm {CPU}\) has a much larger effect on energy-to-solution than changing the frequency of the GPU \(f_\mathrm {GPU}\). The energy to solution is reduced by 8 % to 17 % when using the higher CPU frequency, while variations of \(f_\mathrm {GPU}\) change the energy to solution by at maximum 7 %. While the energy consumed by the GPU increases for larger clock frequencies (up to 40 % with low \(f_\mathrm {CPU}\), up to 22 % for high \(f_\mathrm {CPU}\)), the energy consumed by the remaining parts of the compute node is reduced (by up to 11 %) due to the shorter time to solution (see Fig. 5). On the other hand, increasing \(f_\mathrm {CPU}\) has a even more significant effect on time-to-solution (up to 32 %), which over-compensates the increase in CPU power consumption (up to 20 %). The net effect is a reduction of energy-to-solution at higher clock speed, ranging between 8 % and 20 %.

Fig. 5.
figure 5

GROMACS power trace for different GPU clock settings.

Table 6. GROMACS energy to solution in kJ.
Table 7. Performance of STREAM as a function of the CPU frequency.

Finally, we give power measurements for the behaviour of the STREAM benchmark at different clock settings in Table 7 and the corresponding power trace in Fig. 6. Notably, the achieved performance does not depend on the core clock. This is explained by the fact that the serial link between CPU and off-chip memory controller runs at a fixed speed. In line with previous observations [12, 31], one can conclude that for purely memory bandwidth bound operations the clock speed can be reduced without impacting performance.

Fig. 6.
figure 6

Example of a CPU power trace over the four tasks in the STREAM micro-benchmark. Three of used values for the CPU clock are shown.

6 Power and Energy Modelling

Our strategy to derive an energy model is as follows. We only consider the case where all options for tuning for energy efficiency except for changing clock speeds of CPU and GPU are applied. Furthermore, we make the assumption that power consumption during application kernel execution is constant. The challenge thus reduces to the design of a performance model and a power model to determine time-to-solution \(\varDelta t\) and power consumption \(\overline{P}\), which we assume to be constant during the time interval \(\varDelta t\), as a function of the CPU and GPU clock frequencies \(f_\mathrm {CPU}\) and \(f_\mathrm {GPU}\), respectively.

For deriving a power model we choose for a phenomenological approach, which has the advantage of being sufficient simple to be of practical use. The concerned compute devices are based on CMOS circuits, where power is typically split in two components: A static power consumption due to leakage currents and a dynamic contribution due to charging and de-charging the capacitive elements of the circuit. The latter is assumed to depend linearly on the clock frequency and quadratically on the supply voltage. Since the latter needs to be increased for higher clock frequencies, we can in first approximation assume a cubic dependency on the frequency. We therefore make the following generic ansatz in terms of the frequency f:

$$\begin{aligned} \overline{P}(f) = p_0 + p_3 f^\gamma \,, \end{aligned}$$
(14)

with \(\gamma = 3\) fixed.Footnote 6 The parameters \(p_0\) and \(p_3\) will be determined through least-square fits to experimental data.

In Fig. 7 we show results for power consumption of the KKRnano solver as well as 4 different STREAM benchmarks, which are all executed on the POWER8 processor, as a function of \(f_\mathrm {CPU}\). Multiple measurements have been obtained by executing the application kernels and benchmarks multiple times in sequence. Active phases were selected based on thresholding the power for the GPU halfway between minimum and maximum values. We observe that Eq. (14) using \(f=f_\mathrm {CPU}\) parametrises the measurements very well.

The instruction mix for the KKRnano kernel is similar to the triad micro-benchmark. However, the KKRnano kernel achieves a significantly higher instruction throughput when compared to triad (instruction-per-cycle IPC = 1.44 versus 0.24 when using 40 threads). The higher throughput is mainly due to the data reuse in the matrix-matrix multiplication and the resulting cache efficiency. This is consistent with the observation that \(p_3\) is larger for KKRnano than for the STREAM benchmarks.

Fig. 7.
figure 7

Power measurement results for different settings of \(f_\mathrm {CPU}\) for the KKRnano kernel running on the CPU (left) and the STREAM benchmarks (right). The solid lines show results from fits to Eq. (14).

We extended the analysis to the GPU-accelerated application kernels, where we varied \(f_\mathrm {GPU}\). Results for KKRnano and B-CALM are summarised in Fig. 8. Again we observe that Eq. (14) provides a good parametrisation of the measured power.

Fig. 8.
figure 8

Fitting the model against KKRnano GPU (left) and B-CALM (right) experimental results

Fig. 9.
figure 9

Estimated distributions of power drain for \(f_\mathrm {CPU} = 3.69\,\text {GHz}\) for the four STREAM micro-benchmarks on the CPU level (left) and on the node level (right).

Fig. 10.
figure 10

KKRnano performance model results versus processor clock \(f_\mathrm {CPU}\).

Next, we extend the performance model for KKRnano introduced in Eq. (7) by considering the parameter \(a_{1,\mathrm {fp}}^{\mathrm {CPU}}\) to be a function of the clock frequency \(f_\mathrm {CPU}\). We observe from Fig. 10 that the results for different \(f_\mathrm {CPU}\) can be parametrised as follows:

$$\begin{aligned} \frac{1}{a_{1,\mathrm {fp}}^\mathrm {CPU}(f_\mathrm {CPU})} = b_{1,0,\mathrm {fp}}^\mathrm {CPU} + b_{1,1,\mathrm {fp}}^\mathrm {CPU}\cdot f_\mathrm {CPU}\end{aligned}$$
(15)

with the empirical factors \(b_{1,0,\mathrm {fp}}^\mathrm {CPU} = 0.28\,\text {GFlop/s}\) and \(b_{1,1,\mathrm {fp}}^\mathrm {CPU}= 5.01\,\text {GFlop}\).

Plugging this result into Eq. (7) gives us \(\varDelta t_{\mathrm {solver}}^{\mathrm {CPU}}\) as a function of the processor clock speed \(f_\mathrm {CPU}\). We combine this with the power model of Eq. (14) to derive the following energy model:

$$\begin{aligned} E^\mathrm {CPU}_\mathrm {solv} = \overline{P}(f_\mathrm {CPU})\cdot \varDelta t(f_\mathrm {CPU}) = I_\mathrm {fp}\frac{p_0 + p_3\cdot f_\mathrm {CPU}^3}{b_{1,0,\mathrm {fp}}^{CPU} + b_{1,1,\mathrm {fp}}^{CPU}\cdot f_\mathrm {CPU}}\,. \end{aligned}$$
(16)

In particular, if \(p_0\) is large compared to \(p_3 \cdot f_\mathrm {CPU}^3\) and if \(b_{1,0,\mathrm {fp}}^{CPU}\) is small compared to \(b_{1,1,\mathrm {fp}}^{CPU} \cdot f_\mathrm {CPU}\) the frequency dependence may be dominated by the denominator. In this case we expect \(E^\mathrm {CPU}_\mathrm {solv} \propto f_\mathrm {CPU}^{-1}\). Once \(f_\mathrm {CPU}\) becomes large, the numerator will dominate such that \(E^\mathrm {CPU}_\mathrm {solv} \propto f_\mathrm {CPU}^2\). In Fig. 11 we plot model results for a wide range of processor clock frequencies. We observe initially energy-to-solution to drop, until for large frequencies the numerator prevails.

Consistent with the results shown in Table 3 we determine an – in terms of energy-efficiency – optimal clock setting, namely \(f_\mathrm {CPU}= 3.5\,\text {GHz}\). We find our model systematically underestimating the required energy by maximally around \(7\,\%\) over the range 2–4 GHz. KKRnano on the CPU achieves energy efficiencies of around \(45\,\text {J/GFlop}\) or the equivalent of \(22.2\,\text {MFlop/J}\).

Fig. 11.
figure 11

Modelled energy-to-solution metric for the KKRnano CPU solver as a function of the processor core clock \(f_\mathrm {CPU}\). In comparison the measured data from Table 5 is shown.

7 Related Work

Given the growing importance of power consumption of modern HPC systems, there is a growing number of papers reporting on work related to power measurements as well as power modelling.

An early framework for accurate and scalable power profiling is PowerPack [18], which relies however on power data acquisition devices to be added to all cluster nodes. Another framework is PowerScope, which was developed for mobile compute platforms [14]. Some node architectures feature fine-grained power measurements capabilities, which are often based on reconfigurable hardware components for high-frequency read-out of power sensors (see, e.g., [20]). Demmel and Gearhart proposed to use on-chip counters, more specifically Intel’s Running Average Power Limit (RAPL) interface, for measuring energy consumption at subroutine level [10]. Investigations about the use of the RAPL interface for measuring memory power can be found in the literature [9]. Such kind of measurements using counters that can be read-out from the processor have become easier with the integration into performance measurements frameworks like PAPI [42]. An integration of the the node-level power measurement framework Amester into Vampir has been reported by Knobloch and others [23]. Another approach is taken, where power measurement capabilities are integrated into a Linux kernel module [32]. Both, the PowerPack as well as the PAPI framework have been used to explore the energy footprint of linear algebra algorithms [11].

Other work aims on modelling power consumption at processor level. For instance, a power measurement and runtime power modeling framework for the Intel Netburst architecture is presented [21]. Processor level power models have also been proposed and used in the context of processor selection to identify, e.g., processors that yield a certain frequency in a given power budget [45]. At a time, when direct power measurement capabilities were largely non-existing, Van Bui and others advocated using power models that used performance counters as input, for estimating power consumed by applications [6]. Such approaches can be exploited for dynamic profitability estimation of dynamic voltage and frequency scaling (DVFS) [12, 31].

With GPUs becoming more widely used as computational devices, research was started on power models for GPUs. Statistical methods to model power and energy consumption of common high performance kernels have been employed successfully [19]. In a new approach hardware performance counter data is combined with machine learning to model power and performance for modern GPU-based systems [34]. Also Wu and others applied machine learning in their work [44].

Other approaches to power modelling target numerical kernels. Statistical methods like multi-variable regression are used to model the power and performance of the High-Performance Linpack benchmark [36]. Model parameters are application parameters like matrix size or block size. The same numerical kernel is also considered [7]. Power models for another important numerical kernel, namely Cholesky factorisation, are considered using a semi-analytical approach [2]. Machine learning techniques are also employed to create application specific power models [40]. Input parameters include application specific plus system parameters, like clock frequency. For an extensive survey of power and energy efficient techniques for high performance numerical linear algebra operations we refer to the work of Tan and others [38].

Performance models for more complex applications are, e.g., considered in [43]. Here Wittmann and others use models to explore energy efficiency of a computational fluid-dynamics (CFD) application based on the Lattice Boltzmann method on Intel Sandy Bridge multi-core processors. They combined an analytic performance model with a simple phenomenological power model. A semi-analytical performance and power model for another CFD application, namely a mini-application version of the spectral element method based code Nek5000, is presented in [35].

8 Summary and Conclusions

In this paper we investigated for several scientific applications, how energy efficiency of currently available, GPU-accelerated POWER8 servers can be optimised. We discussed several options for how to change application and/or system configuration such that energy-to-solution for a given workload could be reduced. We found that, in particular, adjusting processor or GPU clock have major impact on energy-to-solution. Except for the STREAM benchmarks we found that applications typically benefit from higher clock speeds. This behaviour can be explained by the relatively high base power consumption. Within a certain range of clock frequencies it is thus more beneficial to reduce time-to-solution by using higher clock frequencies. For applications running on the processor only, this applies, however, only to applications that are limited by instruction throughput.

Furthermore, we found that the measured power consumption can be described by a simple phenomenological power model, which describes power consumption as a function of the CPU or GPU clock frequency. For one of the application kernels, the KKRnano solver, we extended a performance model to include variations of the processor clock frequency. By combining the performance and power model we derived a simple energy model. The curve shown in Fig. 11 is based on four parameters, only, which are relatively easy to determine. This can be used to tune the clock frequency. For KKRnano the model results and measurements lead to a reasonably consistent results for the CPU clock frequency \(f_\mathrm {CPU}\) that minimises energy-to-solution.

A significant fraction of the base power consumption, i.e. the power that is consumed without an application running, is due to the memory subsystem. The base power consumption on servers based on POWER8 processors with a smaller number of memory buffer chips is lower and therefore the clock frequency, for which power consumption is minimised, is likely to be smaller. The picture is expect to change again for the recently announced POWER9-SO processors with directly attached memory.