1 Introduction

Rising IT spending on power has increased the awareness and need for monitoring, management, and optimization of data-center energy consumption. HPC centers have additional performance-based constraints and account for disproportionally large power/energy costs. Among the numerous challenges which exist in expanding systems to Exascale, the capping of the energy consumption at a reasonable power limit, say 20 MW, is probably the most important one.

Although computing systems offer an increasingly sophisticated set of power measurement and management capabilities, on most platforms fine-grained power measurements are difficult or impossible without modifying the hardware. In this paper, we focus on IBM systems based on the POWER7 processor. These systems are provided with on-board power measurement circuits to measure the power consumed by the full system, processor socket, memory sub-system, I/O sub-system and the fans [1]. Additionally, the power consumed in different parts of the POWER7 processor can be estimated using a hardware supported power proxy. This power proxy translates information from activity monitors into a power estimate using a programmable weight factor [2]. Information from various sensors is collected by a dedicated microcontroller, the Thermal Power Management Device (TPMD). This device is also used to implement a given power policy and management direction. It may, for instance, reduce the processor frequency to save power at the expense of system performance.

Applications in the HPC space tend to be designed and tuned to maximize performance with no consideration for energy efficiency. Programming approaches can mask real utilization of resources like CPU or memory (e.g., wait loop in communication progress function), and a load balancing style of programming can interfere with autonomous hardware frequency scaling. Therefore, one of the challenges is to enhance methods and policies in this area to exploit energy management mechanisms. To identify optimization potential, one has to study the application’s energy profiles as well as the utilization of the different system resources.

One of the challenges to obtain reliable energy profiles is to translate the power consumption P(t) measured at time t into energy E(t 0,t) consumed since time t 0, e.g. the time when program execution started. The energy is given by

$$ E(t_0,t) = \int_{t_0}^t d\tau P( \tau). $$
(1)

Since power is measured at discrete points in time t i , the integral has to be approximated by a sum

$$ E(t_0,t) \simeq \sum_{k=1}^N \Delta t P(t_0+k \Delta t), $$
(2)

where for simplicity we assumed t i =t 0+kΔt with Δt=(tt 0)/N. For this approximation to be good, Δt should be small compared to the time scale on which P(t) changes.

To read the information from the TPMD, we use an IBM internal tool called Amester (Automated Measurement of Systems for Temperature and Energy Reporting), which provides an external service to collect the power consumption data. We used VampirTrace [7] to trace the performance of the application, and developed a plugin which queries Amester to add power measurement information to the performance traces.

After discussing related work, we describe our measurement setup in Sect. 3. Then we present the applications we used and the analysis results in Sect. 4. Finally we conclude the paper and give an outlook on future work in Sect. 5.

2 Related work

A large variety of papers have been published analysing the power consumption of individual components using synthetic workloads. However, far less information is available on the power consumption at component level for HPC production workloads.

In [6] the power consumption of a Cray XT4 system is studied at node and system level for a set of application based benchmarks like the NAS Parallel Benchmarks. A different approach was taken in [5] where the authors analysed rack level power measurement data collected for 12,500 jobs on a production Blue Gene/P system.

In [3] the power profiling infrastructure PowerPack is described. This infrastructure targets the analysis of parallel applications and also links power to performance measurements [10]. This setup however requires significant modifications of the hardware. Furthermore, a critical analysis of the conversion of power measurements into estimates of the energy consumption is lacking.

In recent years, the hardware support for power measurements has been driven by the need of monitoring power to ensure a certain power envelope not being exceeded. This may, e.g., be mandatory in case of high-density designs where components at high load may generate more heat than the cooling system is able to remove. In [9] a feedback control mechanism is described, which allows to operate the system in the highest performance state at a fixed power constraint. This requires precise power information to be periodically retrieved.

The authors of [1] present results of an investigation where they use the power monitoring and management features of POWER7-based systems in order to reduce the power consumption. The power measurements are used to fit the parameters of a heuristic model, which describes the power consumption of an application as a function of the frequency. The analysis is, however, restricted to a selection of SPEC CPU2006 workloads and does not include full applications.

Power consumption measurements often require a dedicated hardware setup. Even if power measurement capabilities are integrated into HPC production environments, then the data is either not accessible to the user or the user has to make significant efforts to analyse the data. Initial attempts to integrate power measurements into widely used performance analysis tools, e.g. Vampir, are reported by the eeClust project [11]. In this project, x86-based systems with external power meters were used.

3 Measurement setup

The heart of our measurement setup is a POWER7 processor-based server, an IBM Power 720 Express, on which the application is executed. The POWER7 processor is a recent generation server processor in the IBM POWER family. The main features of our machine are:

  • Single 4-core 3.0 GHz processor (Pseries, 8202-EeB)

    • 96 GFLOPS peak

    • 4 SMT threads per core

    • Execution units per core

      • 2 fixed-point units

      • 2 load/store units

      • 4 double-precision floating-point units

      • 1 vector unit supporting VSX

    • 32+32 kB L1 instruction and data cache per core

    • 256 kB L2 cache per core

    • 16 MB shared L3 cache

  • 16 GB memory

  • Dual 300 GB 10 K RPM SAS disks

  • TPMD (Thermal and Power Management Device)

An additional system is used to run the power measurement service (called Amester) without interfering with the workload execution on the POWER7-based server.

The POWER7-based server is connected to a Power Distribution Unit (PDU) which provides us with an estimate of the power consumed by the server at its power inlet. The PDU which we use, a Raritan DPXS 12A-16, only allows for relatively coarse power measurements with a granularity of 3 s and a precision of about 5 %. It should however be noted that power consumption at system inlet is expected to change at a much slower rate. The data is stored in an SQL database and is used as consistency check for power measurements with Amester. The PDU values are expected to be larger, e.g. due to inefficiencies of the power supplies in the POWER7 system. We found the difference to agree with the specified efficiency of the power supplies.

Fine grained power measurements on POWER7 are possible with a software tool called Amester, which communicates with the TPMD of the POWER7-based server via the Flexible Service Processor (FSP) of the POWER7. It sends commands to the FSP which returns the requested data. Amester can query counters with a sampling rate as low as 1 ms. Using this tool it is possible to retrieve, among others, data about power consumption of the full node, the processor, the memory, the I/O subsystem, and the fans. Amester is executed on a separate x86 server, i.e. it allows in principle for an intrusion free power measurement. The x86 server and the POWER7 system communicate via TCP/IP over a socket connection. Figure 1 shows the hardware (Fig. 1(a)) and software (Fig. 1(b)) setup we used for our experiments.

Fig. 1
figure 1

Hardware and software setup for Amester measurements. It highlights the different time domains from which timestamps are obtained

To match the Amester measurements and the performance data received on the POWER7 processor, a timestamp synchronization issue needs to be resolved. Timestamps of measurements taken by VampirTrace come from the POWER7 CPU time domain. On the other hand, samples gathered by Amester are marked with timestamps originating from the POWER7’s FSP. The FSP’s timer is a millisecond counter, incremented from the start-up of the system, contrary to the CPU clock. Therefore, in order to correlate Amester’s fine-grained measurements with application performance characteristics, time offset calculation mechanism is required.

For this purpose, a simple micro-benchmark has been implemented, together with a suite of post-processing mechanisms. The purpose of the benchmark is to provide a number of IPS (Instructions Per Second) peaks, marked both by POWER7 CPU and FSP timestamps. CPU timestamps are taken during the micro-benchmark runtime, directly before and after each iteration, and printed out as the output of the application. In the same time, Amester is used to gather performance statistics from the TPMD, marked with FSP timestamps. Afterwards, Amester output is processed and beginnings and ends of peaks are recognized. Here a number of mechanisms to ensure the accuracy of data recognition has been implemented, i.e. filtering unwanted values by threshold, validating peak data series length, a number of cross-checks with values provided by the CPU, and more. Figure 2 shows a sample output of the offset detection. The final value for offset is given as an average of offsets for the start and end of each peak in the benchmark. This value is later used in the Amester plugin for VampirTrace.

Fig. 2
figure 2

Sample output of the POWER7 CPU—FSP offset detection. It shows the four peaks in IPS and the calculated beginning and end of the computation phases

In principle this approach should scale to multiple nodes as power data generation and communication do not interfere with the application run. However, the timestamp synchronization would have to be done for each node separately.

VampirTrace is a library to generate event-based trace files from instrumented applications. The VampirTrace workflow as we use it is shown in Fig. 3.

Fig. 3
figure 3

VampirTrace Workflow for our experiments

We developed a plugin for the VampirTrace plugin counter interface [12] to merge the counters provided by Amester into the OTF trace file generated by VampirTrace. As the Amester measurement is out-of-band, we choose a post-mortem plugin where the values are merged into the trace file at the finalization of the measurement, i.e. after the application generated its results. This keeps the additional measurement overhead at a negligible level. Additional hardware performance counter values can be obtained with the PAPI library [14]. The resulting trace file can be visualized with the Vampir trace file visualizer [7].

The sensors that the VampirTrace plugin queries using Amester are listed in Table 1. The counter names we use for the Amester plugin are of the form P7_IPS and P7_POWER{_COMPONENT}_RESOLUTION.

Table 1 Counters which are retrieved from Amester by the VampirTrace plugin

4 Applications and analysis results

For our experiments we selected two codes developed at JSC, PEPC and MP2C, that also run—in different configurations from the ones we used here—on JSC’s Blue Gene/Q supercomputer on several thousand cores.

PEPC (Pretty Efficient Parallel Coulomb solver) [15] is a mesh-free tree-code for computing long-range forces, e.g. Coulomb or gravitational forces, in N-body particle systems. The code was initially developed to study problems in plasma physics. It can, however, also be used for problems from other research areas like astrophysics and biophysics. By using successively larger multi-pole groups of distant particles, the computational complexity of the long-range force computations is reduced to O(NlogN) which is a key requisite to achieve a very high scalability of the code. The code is written in Fortran90 and parallelized using MPI and Pthreads. In our test cases an MPI only version was used.

Figure 4 shows a Vampir screenshot of a PEPC run with 4 processes capturing power measurements of the total power consumption as well as CPU and memory power consumption. The top left ’master timeline’ shows the program activity on a per-process base. Below are the ’counter timelines’, showing the development of the different counters. Since there is more than one sample per pixel, Vampir shows for each counter the maximum value (the upper line), the minimum value (the lower line) and the average (the middle line). On the right side, some statistical information is displayed. The 10 iterations of the test run are clearly distinguishable.

Fig. 4
figure 4

Vampir screenshot of a PEPC run showing power measurements at system level (top) as well as for CPU (middle) and memory (bottom). For each measurement it shows maximum (upper line), average (middle line), and minimum (lower line)

A more detailed view of one iteration is shown in Fig. 5. The resolution is now fine enough for Vampir to show the measured values of the counters instead of the statistical information as in Fig. 4. We see that significant changes in power consumption occur at millisecond level for all components.

Fig. 5
figure 5

Vampir screenshot of one iteration of a PEPC run showing that there are significant changes in power consumption at millisecond scale

MP2C (Massively Parallel Multi-Particle Collision) [13] is a code for simulating fluids with solvated particles. It couples Multi-Particle Collision Dynamics (MPC) with Molecular Dynamics (MD). The former is a simulation technique where particles are simulated at mesoscale. By coupling MPC to MD, hydrodynamic interactions between solvated molecules can be taken into account. The code is written in Fortran90 and parallelized using MPI and OpenMP.

Figures 6 and 7 show Vampir screenshots of a MP2C run with 4 processes with component power measurements and runtime characteristics. The 10 iterations of the test case are easily detectable in both figures. Figure 6 plots the power consumption of the memory subsystem against the L3 data cache misses, and Fig. 7 shows the CPU power consumption and instructions per second (IPS). These values correlate quite nicely, although some peaks in the CPU power consumption can not be spotted in the IPS counter line. This might be related to the IPS being averaged over a time period of 32 ms, which might miss some details.

Fig. 6
figure 6

Vampir screenshot of an MP2C run showing memory power measurements and L3 cache misses

Fig. 7
figure 7

Vampir screenshot of an MP2C run showing detailed CPU power consumption and IPS rate

Figure 8 shows the comparison of power measurements with 1 ms resolution and with 32 ms resolution for CPU and memory. The 32 ms measurements internally accumulate 32 1 ms measurements and average them. Thus, they flatten some details that can be seen in the 1 ms measurements, yet result in the same integrated energy consumption for the whole application run. However, to calculate the energy consumption of shorter code parts, fine-grained measurements are beneficial.

Fig. 8
figure 8

Comparison of power measurements with 1 ms resolution (upper part) and 32 ms resolution (lower part) for CPU (a) and memory (b)

5 Conclusion and outlook

In this paper we presented a setup that allows us to obtain fine-grained power measurements on the IBM POWER7 platform and correlate these values to application performance data. We showed that coarse-grained power measurements flatten the dynamics in power consumption on all components.

The next step is to adapt that workflow to work with the new Score-P measurement system [8], a unified measurement system used by multiple tools, e.g. Vampir and Scalasca [4].

Further, we are developing a model for the energy consumption on component level based on hardware performance counters, for which such fine-grained power measurements are beneficial. Such models can then be used by all kinds of tools, even profile based tools.