Keywords

1 Introduction

Developers of HPC applications are forced to optimize their applications to reach maximum possible performance and scalability. This request makes the performance analysis tools very important elements of the HPC systems, that have a goal in the identification of the hot spots of the code that provides space for improvement. Except basic, single-purpose applications every region of an application may have different requirements on the underlying hardware. In general, we may speak about several application kernels, that are bounded due to different (micro-)architectural components (e.g. compute, memory bandwidth, communication or I/O kernels) presented in [1] and evaluated in [13, 29].

An application performance analysis tool provides profiling of the application - stores current time, hardware performance counters et cetera, to provide information about the program status at the given time. In general there are several ways how to connect the profiling library with the target application to select when the application’s state should be captured, (1) insert the profiling library API functions into the code of the profiled application, or (2) the profiling library implements a middleware for a specific functions (e.g. memory access, I/O or MPI etc). Another option could be monitoring or simulating the application process, however these approaches may have a problem to profile exactly the application performance. The advantage of the monitoring is that it does not require to instrument the application, which means that identification of exact location in such code is ambiguous. Instrumentation can be inserted manually to the source code, by a compiler at the compilation time or the application’s binary can be patched using dynamic or static instrumentation tools.

1.1 Motivation

The list of the HPC applications profiling tools is quite long, and despite many features are shared among them, every tool brings something extra to provide slightly different insight into the application’s behavior. New HPC machines come with new challenges that require different ways how to optimize the code. With the upcoming HPC exascale era, there is pressure to reduce energy consumption of the system and the applications too. Several projects develop autotuning tools for energy savings based on CPU frequencies scaling or using Intel RAPL power capping [9], e.g. GEOPM [6], COUNTDOW [5], Adagio [22] or READEX [20, 23].

One of the READEX tools is MERIC [14, 29] library, that has been developed, to provide application behavior analysis and information about its energy consumption when different application or system parameters are tuned. MERIC dynamically changes the tuned parameters and searches for the configuration in which each part of the application fully utilizes the system, not to waste the resources and bring energy or time savings. This way user can detect that some parts of the target application when uses just one of two sockets is as fast as when using them both due to strong NUMA (Non-Uniform Memory Access) effect, or that the frequency of the CPU cores can be significantly reduced, due to inefficient memory access pattern. MERIC supports manual instrumentation only, which we have identified as a weak spot on a way to reach maximum possible savings. First of all process of localization where to insert the manual instrumentation to the source code is time consuming, which may lead to the situation that some parts of the code will not be sufficiently covered, and due to that the code analysis may miss identification some of the code’s dynamicity and result configuration will be sub-optimal.

It is barely possible to specify a single rule for all autotuning frameworks that decides which parts of a code should be instrumented, but the most universal way is specification of a minimum region size. Under the READEX project has been specified that the minimum size of an instrumented function is 100 ms to the tuning framework be able to change the system settings of the contemporary Intel x86 processors and provide reliable energy measurement for all the instrumented regions (Intel RAPL counters [12] and HDEEM [11], 1 kHz power-sampling energy measurement systems have been used in the project).

To reach maximum possible savings, the application should contain the maximum possible amount of regions, that may show different behavior. It results in search for all regions that last more than the selected threshold and instrument them, nevertheless the threshold can be extended if the instrumentation is too heavy. In general, too detailed instrumentation can be handled also at the tuning framework side, that may ignore some of the regions, but anyway even this solution will lead to some minimal overhead depending on the framework’s implementation (e.g. minimal time between regions’ starts, maximum level of nesting). For purpose of identification regions with runtime longer than a specified threshold a Timeprof library [14] has been developed. The library does time measurement of the application’s functions and provides a list of functions that fulfill the condition.

2 Performance Analysis Tools

List of HPC tools for application performance analysis is very long so we decided to focus on open-source tools that are selected by OpenHPC [19] project whose mission is to provide a reference collection of open-source HPC software components and best practices, lowering barriers to deployment, advancement, and use of modern HPC methods and tools. The project mentions the following tools:Footnote 1

LIKWID [27] is one of the performance monitoring and benchmarking suite of command-line applications. Extrae [24] is a multi-platform trace-file generator to monitor the application’s performance. Score-P is a library for profiling and tracing, that provides core measurement services for other libraries - Scalasca [7], TAU [25], Vampir [17] and Periscope Tuning Framework (PTF) [8]. Scalasca and TAU are very similar profiling and tracing tools that can also cooperate - e.g. Scalasca’s trace-files can be visualized using TAU’s profile visualizer. Vampir framework provides event tracing and focuses mainly on the visualization part of the analysis process. On the other hand, PTF is an autotuning framework, providing many plugins to tune the application from various perspectives. GEOPM is an autotuning tool focused on x86 systems, that dynamically coordinating hardware settings across all compute nodes used by an application according to the application’s behavior and requests from the resource manager. The last tool from our list is the mpiP [21], which is a lightweight profiling library for MPI applications, based on middleware of the MPI functions, despite that it also has a limited list of C API functions to manually instrument the application, as well as all the mentioned tools.

Besides splitting the application into different parts of the code, some tools also provide an opportunity to instrument the most time consuming loops of the target application (e.g. in case of Score-P we speak about a Phase region, GEOPM terminology uses word Epoch, etc.). This kind of annotation is useful especially in case of tools that do not only analyze the application but also provide the opportunity to tune the application performance using some kind of optimization.

3 Manual and Compiler Inserted Instrumentation

Manual instrumentation usually wraps a function, block of functions (with the similar behavior) or is inserted inside a loop body, to detect different behavior within the iterations, or in case of autotuning tools to identify optimal configuration by switching the configuration in each iteration.

Manual source code instrumentation requires access to the source code to insert the API functions and at least a basic knowledge of the application behavior, to instrument the most significant regions. The application must be recompiled for each change in the instrumentation. Due to these requirements, manual instrumentation is time-consuming and inconvenient.

Despite some of the performance analysis tools provides options how to analyze the application without doing changes in the source code, using the middleware (mpiP), compiler instrumentation (Score-P) or binary instrumentation (extrae, TAU), anyway all of the mentioned tools have their own API to let the application user/developer extend the instrumentation about specific parts of the application.

Compiler instrumentation is provided by the Score-P or by the GNU profiler gprof [10], it provides a possibility to wrap applications’ functions with the instrumentation at the compilation time. In comparing to the manual instrumentation it removes the requirement to browse the source code to locate the requested functions, however, the handicap of accessing the source code persists. In default settings compiler instruments all the application’s functions, without any limit on the function size, which in many cases may cause high overhead of the profiling, when measuring performance of the shortest regions too. Due to that, the compiler provides an option on how to select/filter a subset of the functions to instrument. Unfortunately, it leads to repeated compilation of the target application, which is usually slower than plain compilation (e.g. Score-P does not support parallel compilation).

4 Binary Patching

Binary patching means a modification of an application execution without recompilation of the source code. The modifications can be done dynamically during the application run or statically rewrite the binary with all the necessary changes and store the edited binary into a new file.

Dynamic Binary Instrumentation (DBI) tools [4, 16, 18] interrupt the analyzed application process and switch context to the tool at a certain point that should be instrumented, and execute a required action. This approach causes an overhead that is usually not acceptable for autotuning or performance analysis. On the other hand, a binary generated by a Static Binary Instrumentation (SBI) tool should not cause any extra overhead in comparison to manually instrumented code, which confirms our measurements presented later in this section.

SBI tools not only insert functions calls at certain positions in the instrumented binary, but also add all the necessary dependencies to the shared libraries, so it is not required to recompile the application for its analysis. Also, SBI tools can access both mangled and demangled names of the functions even though the application has been compiled without debug information. SBI tools are provided by TAU (using Dyninst [3] or Pebil [15] or MAQAO [2]) and extrae (using Dyninst) and Score-P uses Dyninst to instrument the code by its compiler.

PEBIL is a binary rewriting tool allowing to patch ELF files for the x86-64 architecture. Unfortunately, PEBIL project is closed since 2017, so support for new platforms is not guaranteed. Due to that, we will focus on Dyninst and MAQAO only, from which MAQAO-2.7.0 supports the IA-64 and Xeon Phi architectures only, on the other hand, Dyninst-10.0.0 InstructionAPI implementation supports the IA-32, IA-64, AMD-64, SPARC, POWER, and PowerPC instruction sets and ARMv8 is in experimental status.

We have evaluated overhead of instrumentation when inserted manually with statically inserted instrumentation by MAQAO and Dyninst. We have used MERIC library for this measurement, that reads requested system information and store the value in memory. A single thread application (to remove the influence of an MPI/OpenMP barriers on the measurement) contained one region, that had been performed thousand times. We have not seen any difference in the overhead of manual instrumentation and SBI. Overhead of one instrumentation call on an Intel Xeon E5-2697v4 is:

  • \(175\,\upmu \)s – when reading timestamp

  • \(375\,\upmu \)s – when reading energy consumption using Intel RAPL (read four hardware counters and timestamp)

In the case of binary patching of a complex application, the time that is required to insert the instrumentation should not exceed the time needed for the application compilation. According Valensi MAQAO is able to insert 18 000 function calls in less than a minute [28].

4.1 A Binary Parsing

Dyninst as well as MAQAO holds the executable in a structure of components as the application was decomposed by a compiler. The components and their relations are illustrated in the Fig. 1. A binary base element is one or several images, which is a handle to the executable file associated with this binary. Each image contains a list of functions and global variables. A function can be also inspected for local variables and basic blocks (BBs), which is a sequence of the instructions with a single entry point and single exit point. The BBs are organized in a control-flow graph (CFG), that represents the branches of the code. From a basic block, it is also possible to access its instructions.

When using Dyninst to browse through an application binary for its analysis or patching all the components on higher levels must be accessed first, on the other hand, MAQAO interface allows a user to access them directly. Anyway, we are primarily interested in the insertion of a function call before and after selected functions, we may stay at the level of functions.

4.2 Workflow

In this section, we will present a process of an SBI using MAQAO or Dyninst, with a goal insert a profiler function call before and after a select application function. The patching libraries provide much more functionality than presented (e.g. static binary analysis or insertion of a function call at more general locations), however for most of the profilers and autotuners wrapping a function with its instrumentation should be sufficient.

figure a

Both Dyninst and MAQAO open the binary and starts with its decomposition into the components as it was previously presented. We can select a function to insert from the dependent shared libraries of the application. If the application has been compiled without the profiling library, the first step should be adding all the necessary dependencies, which is a single function call.

Fig. 1.
figure 1

Components of an application binary produced by a compiler.

With all the necessary dependencies, it is possible to find the function we want to insert under the application image, as well as the functions we want to wrap into the profiler instrumentation. To find the function that should be instrumented, the binary (modules in case Dyninst) must be browsed for this function. The function may have several code locations that could be instrumented, from which we are interested in its entry and exit points (addresses). With this point, it is possible to associate a function call with the requested list of arguments (be aware that there is no argument type control). This change must be committed to the binary. The edited binary is then stored and is ready to be executed to analyze its performance.

Listings present a code snapshots that insert printf call, that will print “FUNC main” at the beginning of execution main function of a C application a.out and stores the binary as b.out using Dyninst (Listing 1) or MAQAO (Listing 2) libraries. The examples assume that printf function is available to be added, otherwise relevant shared library dependency must be added too, also return codes are ignored to reduce size of the Listings.

figure b

5 Conclusion

Compiler and binary instrumentation are solution for a fully automatized application analysis and following optimized run of the application, but only in the case that such instrumentation does not lead to significantly higher overhead than in case of the manual instrumentation. Our measurements have not seen any measurable difference in manual and static binary instrumentation provided by MAQAO or Dyninst. We consider SBI as simple and the most powerful solution and based on this conclusion when writing a tool for an application behavior analysis we recommend to provide also an SBI support and present samples of code using both Dyninst and MAQAO to show how simple a basic SBI tool is.

The problem of the ideal instrumentation (amount and location of the probes) has a massive impact on the effectiveness of every auto-tuning framework. Autoinstrumentation tool can be written to instrument the analyzed application according to the requirements of the autotuner and its way of tuning the application. Timeprof library helps to identify the significant regions of the application to analyze their behavior. We can easily measure the runtime of all the functions of the application with Timeprof, which will provide us a selection of the regions. Afterward, the identified regions are instrumented with the selected library.