Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Developer tools are indispensable to develop complex and large applications. The broad spectrum of existing tools ranges from simple autocompletion in IDEsFootnote 1 to sophisticated suites for debugging or performance analysis. While no prior experience is needed to use autocompletion, debugging an application is significantly more demanding and requires a certain degree of expertise.

HPC tools for debugging and performance analysis are of high relevance in particular. Unlike tools for conventional application development, HPC tools are required to maintain an arbitrary number of process states. A debugger for parallel applications has to manage all control streams (e.g. processes) and eventually their communication, for example.

Likewise, performance analysis tools have to collect metrics like the time spent in a function, the number of function invocations, or the communication time between processes. All these information have to be collected for every process, preferably without changing the program’s behavior.

The acceptance for new libraries like DASH can be increased by providing tool support. The use and extension of existing tools has two significant benefits. First, users don’t need to learn to use new tools. Second, the library developers make use of the experience of the tool developers and don’t have to redevelop an already existing tool.

In DASH, we follow this approach and provide dedicated support to third-party solutions such as the GDB debugger [12], Score-P [22] and PAPI [5]. This paper describes the extensions and the challenges to incorporate these tools in the DASH template library and the DASH runtime (DART).

The remainder of this paper is structured as follows: publications and tools related to this work are presented in Sect. 2. Section 3 briefly introduces the DASH library and the underlying runtime. Challenges occurring when debugging parallel applications are discussed in Sect. 4. The following Sect. 5 explains extensions for Score-P and the DASH library to instrument and trace DART and DASH applications. A specific profiling application for the MPI backend is presented in Sect. 6. Section 7 describes the integration of the PAPI library in DASH. Finally, Sect. 8 concludes the work and looks out on the future work.

2 Related Work

2.1 DASH

The concept of the DASH library and the DART API is explained in detail in [13]. An MPI3 DART implementation can be found in [33].

Besides DASH, other PGAS approaches exist. UPC [30] and Co-array Fortran [25] extend C and Fortran with the PGAS concept. Other PGAS languages are Chapel [7], Fortress [3] and X10 [8]. All three projects were funded from the “DARPA High Productivity Computing Systems”.

GASPI [17] and OpenSHMEM [27] are an effort to create a standardized API for PGAS programming. UPC++ [32] and Co-array C++ [20] extended C++ with PGAS. STAPL [6] implements a C++ template library that shares several concepts with DASH like local view on data and the representation of distributed containers but does not seem to target classic HPC applications.

2.2 Debugging

We decided to extend the debugger GDB [12] because it is a widely used open source debugger for sequential applications. LLDB [31] is also an alternative, but not as prevalent as GDB. Existing debuggers for parallel applications like Eclipse Parallel Tools [2], PGDB [10], Totalview [28] or DDT [4] are limited to MPI, OpenMP, CUDA, etc. and don’t support the DASH library. Totalview and DDT support the most libraries but are proprietary and therefore, not the first choice for dedicated support in an open source project like DASH.

2.3 Performance Analysis

The Score-PFootnote 2 measurement system is a performance analysis software for parallel programs. Because performance analysis and optimization is important for a successful utilization of HPC resources, many tools for analyzing parallel programs are available. Some of them rely on sophisticated profiling techniques e.g. TAU [29] and HPC-Toolkit [1], others use event tracing.

Score-P is a joint effort of RWTH Aachen, ZIH TU Dresden, Jülich Supercomputing Centre, TU München, and GNS Braunschweig. Score-P can record a wide range of parallel subsystems such as MPI, OpenMP, SHMEM, CUDA and PThreads. Auxiliary instrumentation through compiler instrumentation and manual user-defined instrumentation are supported while sampling based recording is planned for future releases. Furthermore, Score-P can utilize CPU performance counters, provided by the PAPI, to identify bottlenecks and enables performance analysis based on event tracing (OTF2 [11] trace format) or profiling (CUBE format). Post mortem, tools like Vampir [21] or Scalasca [15] support the performance analysis by visualizing the traces and profiles. With Periscope [16] it is possible to make an online analysis.

3 Overview DASH

The main goal of the DASH project is the development of a C++ template library for distributed data structures that provides a convenient parallel programming model. DASH is divided into four layers (Fig. 1). The DASH runtime (DART) implements different communication substrates like MPI [23] and provides a uniform interface to the upper DASH C++ template library. This library provides containers for distributed data structures (e.g. arrays) which can be used similar to the containers of the C++ STL.Footnote 3 Furthermore, these containers can be used with algorithms provided by the C++ STL. The user application is located on top of the DASH template library, thus it is independent from any changes made in DART or the communication substrate. Additionally the application benefits from performance optimizations in DART. The tools and interface layer is connected to all aforementioned layers and supports the developers of each layer. Be it debugging (Sect. 4), performance analysis (Sect. 5), implementing/optimizing the MPI implementation in DART (Sect. 6), or getting direct support within the DASH template library for PAPI counters (Sect. 7).

Fig. 1
figure 1

The layered structure of the DASH project

In the following we briefly explain the DART and the DASH template library, more detailed information can be found in [13].

3.1 DART: The DASH Runtime

As mentioned before, DART abstracts the low-level communication libraries from the DASH template library. The DART API is written in plain C and sufficiently general enough to use any low-level communication substrate. Processes are called units in DART/DASH to be independent from other terms like threads, ranks or images. These units are all part of at least one team.

Units communicate directly with one-sided communication or collective communication operations. Blocking and non-blocking communication are provided as well as synchronization mechanisms.

Besides the communication, the DART API manages the distributed global memory which can be allocated symmetrically and team aligned or locally with global accessibility. The former allocates the same amount of storage for every unit that is part of the team. So, globally the memory looks like one big block, but locally each unit manages only a part of it. The latter allocates memory locally but is accessible from other units.

For now, DART supports MPI and GASPI [17] as communication substrate. Furthermore, a System V shared memory implementation for testing purposes and a prototype for GPGPUs exist.

3.2 DASH: Distributed C++ Template Library

The DASH template library is written in C++ which allows efficient PGAS [26] semantics. PGAS approaches often use local- and global-views on data. While in the global-view it is not directly clear whether the data is accessed remotely or locally. The local-view is limited to local data accesses only which need no communication, thus, they are much faster than remote accesses. The distributed data containers in the DASH template library use the local-/global-views and combine it with the C++ STL iterator concept. So, it is possible to iterate over local data only, all data, or only a part of it(mixture of global and local data accesses). Listing 1 demonstrates the construction of an 1D-array (line 2), a global iteration over all elements (lines 9–12) and local data accesses with the C++ STL function fill (line 20). This example shows a small set of possibilities to access the data elements of a container. To combine the distributed data structure concept with the C++ STL concepts, new functionality had to be added. In Sect. 5.2 we explain why C++ container, like our DASH container, are quite difficult to trace for performance analysis.

Listing 1 DASH array container used with global-view and local-view semantics

1 // a l l o c a t e s  an array over a l l  units

2 dash:: array<double>  my_array (100000, nodeteam );

4 // gets  the unit ’ s id

5 int  myid =  nodeteam.myID ( );

7 // The unit with id 0 i t e r a t e s  over a l l  array elements and

8 // assigns  them to the value 1

9 i f ( myid ==  0){

10   for ( auto i t  =  my_array. begin ( );  i t  != my_array. end ( );  ++i t )

11     * i t  =  1;

12 }

13 

14 // blocks a l l  units  until  every unit reached this  routine

15 my_array. barrier ( );

16 

17 // every unit f i l l s  i t s  part of  the array with i t s  unit id

18 // lbegin ()  returns  the beginning of  the l o c a l  memory while

19 // lend ()  returns  the ending

20 std:: f i l l ( my_array. lbegin (),  my_array. lend (),  myid );

4 Debugging DASH Applications

To conveniently debug parallel DASH applications, it is necessary to interrupt the parallel start up, provide a useful output for DASH data structures, and get access to all data elements (local and remote) of a DASH container.

We decided to extend the widely used and extensible debugger GDB because no debugger fulfilled all requirements. GDB doesn’t support debugging multiprocess applications but allows to attach to existing processes remotely. The initialization process of DASH was modified to interrupt the start up and, in debugging mode, traps the master process in an infinite loop while all other processes are waiting in a barrier. Now, the user can attach GDB instances to all interesting DASH processes and, for instance, set breakpoints. To continue debugging, the master process has to be released from the infinite loop.

The concept of pretty-printers, provided by newer versions of the GDB, supports the modification of the data structure output, printed in the debugging process. We used this extension and defined pretty-printers for all DASH containers and added them to the GDB environment. The following example shows the difference between the default (Listing 2) and the pretty printed (Listing 3) output for a DASH 1-D array. The pretty printed version is obviously more helpful for most users, because it provides more specific information.

It is necessary to get access to all data elements, local and global ones, to get the shown pretty printed output. Therefore, we added, only in debugging mode, new methods to the DASH containers which enabled GDB to access all data elements.

More detailed information, concerning the start up process and made modifications, can be found in [18].

Listing 2 Standard GDB output for a DASH 1-D array

$1 =  ( dash:: Array<int,  dash:: PatternBlocked>  &) @0x7fff6018f370:  {m_team =  @0x619ea0,  m_size =  20, m_pattern =  {<std:: __shared_ptr< dash:: Pattern,  ( __gnu_cxx:: _Lock_policy)2>> =  {_M_ptr =  0xac2de0,  _M_refcount =  {

        _M_pi =  0xac2dd0 }}, < No data f i e l d s >}, m_ptr =  {<std:: __shared_ptr< dash:: GlobPtr<int >, ( __gnu_cxx:: _Lock_policy)2>> =  {_M_ptr =  0xab1310,  _M_refcount =  {_M_pi =  0xab1300 }}, < No data f i e l d s >},

  m_acc =  {<std:: __shared_ptr< dash:: MemAccess <int >, ( __gnu_cxx:: _Lock_policy)2>> =  {_M_ptr =  0xac2e50,  _M_refcount =  {_M_pi =  0xac2e40 }}, < No data f i e l d s >}}

Listing 3 Pretty printed GDB output for a DASH one-dimensional array

$1 =  Dash:: Array<int,  dash:: PatternBlocked>  with 20 elements −  Team with dart id 0 and position  0 −  dart global  pointer:  std:: shared_ptr ( count 1, weak 0) 0xab1310 =  {1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4}

5 Using Score-P to Analyze DASH and DART

Performance analysis distinguishes between data acquisition and data presentation. Data acquisition describes techniques to get performance data during a program execution, such as Sampling, or Instrumentation. Data presentation deals with the visualization of the acquired data. This section explains two performance data presentation techniques (profiling, and tracing) and techniques used to support DASH/DART in Score-P.

Profiling

A profile presents the performance data in summarized aggregations of performance metrics. These metrics can be the number of function invocations or the inclusive/exclusive time spent in a code region, e.g. function, or loop body. The simplest form of a profile provides information about all functions, the program spent the majority of time (flat profile), but no caller context. A profile can also offer information about program execution based on a call path which results in a call path profile. It is even possible to split the program execution into phases and create profile records of each phase (phase profile). Performance data for a profile can be gathered from instrumented regions or even from sample based data [19].

Tracing

Tracing stores performance information in event traces during the program execution. All interesting events are hold in memory and flushed, process dependent, to disk. These flushes (buffer-flush-event [24]) only happen if not enough memory is available, or the program is finished. When the program finished, the separated event streams will be unified.

Performance data are stored in special trace formats, such as OTF2 [11] and are produced by instrumentation, or sample based data. They contain information about where, when, and what kind of event is recorded. Typical events can be simple enters or leaves of subroutines, or communication calls (e.g. MPI Point-to-Point communication). In contrast to profiling, tracing records time stamps for the events to later reconstruct the program behavior. Event tracing provides a great level of detail of a program’s performance, but with the cost of a lot disk space and buffer-flush-events for large and long running applications.

5.1 DART

The DART function calls can be automatically instrumented by the Score-P adapter, using library wrapping. Library wrapping is a technique to link function calls to alternative functions instead of calling the original function. The alternative function can record the enter event; call the original library function and forward the parameters; record the leave event; and return the result of the original function to the caller. This approach needs the symbols of all interesting library functions the application is linked against. Library wrapping is a widely used and a well known concept for automatically instrumented libraries [9].

We extended Score-P with a specialized adapter, wrapping all defined functions of the DART API. Besides the adapter, the build system of Score-P and DART specific parameters were added. Now, Score-P records DART API functions which can be visualized in tools like Vampir.

DART API functions are independent from the implemented communication substrates.The recording of DART API only, supports especially the development process of the different DART implementations because it allows a comparison of the runtime/behavior of DART functions using different implementations, or even the native communication substrate.

The Vampir screenshot in Fig. 2 shows an instrumented application with 8 processes. Yellow events symbolize DART events and red ones MPI events. The call tree (on the bottom right) displays all recorded DART functions and the used MPI functions. The inclusive and exclusive time is given behind the function name. The graph on top shows the runtime of each recorded function. On the left, the timeline for all process in the chosen time frame is plotted and visualizes all events ordered by their time stamp.

Fig. 2
figure 2

Vampir screenshot. On the left a time frame of the recorded events for 8 processes is shown. The accumulated exclusive time of all measured functions is located top right and the function call graph at the bottom right

5.2 DASH

DASH is a C++ template library, implemented as header only which means, all template declarations and definitions are located in the same header file. This is necessary because the compiler has to know both. The actual type of the template is first known when the user includes the header, containing the template definition, into the application source code. Further, symbol names of C++ functions and methods appear mangled in an object file. A mangled name encodes additional information of the symbol, for instance, type information of: the parameters, templates, namespaces, and classes. Symbols of the DASH library are not entirely determined before compilation of the actual template calls. In the library file, header only symbols do not exist. So, it is not possible to guess which symbols will actually appear in the application. For this kind of libraries the library wrapping approach, mentioned in Sect. 5.1, doesn’t work.

Another approach to instrument the DASH library is the automatic compiler instrumentation. The compiler includes Score-P measurement function calls before and after each function call. But real C++ applications tend to have an enormous amount of small function calls (e.g. operators) which results in recording many events. Another drawback is the instrumentation of all functions or methods of other used libraries, such as internal functions of the STL. Internals of the STL with all the constructor calls and overloaded operators are probably not of interest for analysis of an application, but are recorded and increase: the number of events, the overhead in execution time, and the memory used to store the events. The Score-P measurement system provides function filter to reduce the overhead, but this costs the user maintenance and extra effort.

Fortunately, Score-P offers a user API to instrument the source code manually. This API generally provides a set of preprocessor directives (macros) which can be inserted at the beginning of a function or method. Expanded, a macro creates a new object on the stack to generate an enter event at construction time; when the function leaves, this object is destructed and triggers a leave event. Instrumenting every single function this way results in a similar behavior as the compiler instrumentation, in terms of runtime overhead and memory footprint. To avoid this, we used a level based approach for the instrumentation. According to the architecture of the DASH library the instrumentation is portioned into the following three levels:

  • CONTAINER_LEVEL instrumentation

  • PATTERN_LEVEL instrumentation

  • ALL_LEVEL instrumentation

CONTAINER_LEVEL is the lightest level of instrumentation, only public methods of the DASH container classes are recorded. This level can show performance-critical methods of the DASH containers. PATTERN_LEVEL extends the container level with all methods of the DASH pattern classes and delivers additional information for the data distribution. Figure 3 shows Vampir’s function summary plot for an instrumented DASH application with PATTERN_LEVEL instrumentation. All DASH functions and methods will be recorded with the level ALL_LEVEL.

Fig. 3
figure 3

Vampir screenshot. Accumulated exclusive time per function for all functions recorded with PATTERN_LEVEL

The instrumentation and the concrete level can be enabled with the compiler flags -DDASH_SCOREP and -D<LEVELNAME>_LEVEL=true. If no level is set the default level is CONTAINER_LEVEL (only when instrumentation is enabled).

The used disk space of the produced traces for the different instrumentation levels are shown in Table 1. NO_DASH represents DART and MPI functions only (no DASH functions were recorded) while COMPILER recorded all functions (automatic compiler instrumentation). Table 1 shows, that the instrumentation of the DASH container functions already increased the trace sizes by about 10 GB. Recording all DASH functions used 61 GB which is 10 times more than NO_DASH. However, compared to COMPILER, ALL_LEVEL uses significantly less disk space.

Table 1 Comparison of the used disk space for different instrumentation levels. Measured was a 2-D heat equation running on two nodes (each 4 cores) with a problem size of 64 × 64. The columns describe the different instrumentation levels. NO DASH means only MPI and DART calls are recorded. COMPILER used automatic compiler instrumentation

Additionally to the traces’ sizes, we compared the runtimes of the instrumentation levels (Table 2). Therefore, we measured the whole application execution time which includes initialization and finalization of DASH.Footnote 4 The reason for the enormous difference of factor 3 between PLAIN and NO DASH is the tracing of DART and MPI events in NO DASH. This and the large number of iterations, cause the big runtime overhead. The difference between CONTAINER_LEVEL and PATTERN_LEVEL is quite acceptable, considering the higher degree of information. Table 2 also shows that the highest instrumentation level of DASH (ALL_LEVEL) produced significantly less overhead than the automatic compiler instrumentation (COMPILER).

Table 2 Comparison of the runtime for a 2-D heat equation running on two nodes (each 4 cores) with a problem size of 64 × 64. The columns describe the different instrumentation levels. PLAIN represents the runtime for a no instrumentation, NO DASH means only DART and MPI calls were recorded, and COMPILER is the automatic compiler instrumentation

Nevertheless, our approach has two drawbacks. First, using manual instrumentation macros in constexpr Footnote 5 is not possible. Second, the __func__ macro only provides the function/method name but no namespace and class information for manual instrumented functions. For future work some modern introspection techniques could be of interest, to improve the quality of the information of our manual instrumentation.

6 MPI Profiling

As an additional option for lightweight profiling on the MPI level we extended and adapted IPM [14]. IPM is a low-overhead performance profiling tool that provides a high-level overview of application activity and allows for an attribution of execution time to various event sources. IPM keeps its performance data in a hashtable and uses both the hash key and the hash value to store performance data. Upon encountering a relevant application event, a hash key is formed using a numeric event ID and context information. The hash values consist of first-order statistics of the events and their duration.

To work with DASH, IPM was extended to support MPI-3 one-sided operations. A total of 38 functions from the MPI-3 RMA (remote memory access) API were added to the event coverage of IPM. These functions deal with the setup of one-sided communication operations (window creation), the actual communication operations (put and get) as well as synchronization operations (flush, flush local). The event context information also captured by IPM consists of the size of data transfers (where appropriate, i.e., MPI_Put, MPI_Get, MPI_Accumulate, and similar) and communication and synchronization partner ranks (i.e., the origin or target ranks for the operation). For each monitored application event, a lookup is performed in the hashtable and the associated statistical data is updated. Per default IPM records the number of invocations for each (event, context) pair as well as the minimum, maximum and sum of all event durations encountered.

It is possible to extend the default context information for application events by including a callsite and a region identifier. The callsite ID is automatically generated by IPM by keeping track of the callstack for each invocation of an MPI operation and this allows the differentiation of MPI calls from different calling contexts (e.g., subroutines). Region identifiers can serve the same purpose, but are manually managed by the application developer, for example to differentiate phases of an application (e.g., initialization, main iteration loop, result analysis).

Data is kept in per-process local hashtables by IPM and on program termination all hashtables are merged at the master rank and application performance reports are generated. In its most basic form, this is an application performance report banner written to stdout immediately after program termination. An example IPM report with full detail is shown in Fig. 4. Evidently this program (a simple communications benchmark) spent most of its time in communication and synchronization routines. More detailed data is recorded by IPM in an XML based performance data file suitable for archival. A parser reads the report file and generates a HTML profiling report that can be viewed using a web browser. The HTML profiling report includes displays such as a communication topology heatmap (the amount of data or time spent for communication between any pair of processes) and a breakdown of time spent in communication routines by transfer size. Examples for these displays are shown in Fig. 5.

Fig. 4
figure 4

Example IPM banner output showing high level metrics such as the total time spent in MPI operations and individual contributing MPI calls

Fig. 5
figure 5

Example performance profiling displays generated by IPM. Time spent in MPI calls, sorted by total MPI time across ranks (left) communication topology map (right)

The detail level of data collection and output can be specified in IPM by using environment variables. IPM_REPORT specifies the amount of information to be included in the banner output upon program termination (the available settings are none, terse, and full). The environment variable IPM_LOG specifies the amount of data to be included in the XML profiling log. The setting full means that all information, included in the hash table, are available in the XML file, whereas terse means that only summaries are available.

7 PAPI Support in DASH

Modern microprocessors provide registers that count the occurrence of events related to performance metrics. Hardware performance counters allow to query clock cycles and the occurrence of more specific events like cache misses. Their usage and semantics vary between processor architectures, however, as they usually can only be accessed using assembly instructions.

The PAPI (Performance API) project [5] specifies a standard application programming interface that allows portable access to hardware performance counters and has evolved to a de-facto standard in performance evaluation. The DASH library provides dedicated support of PAPI: its usage can be specified in build configuration and performance-related concepts have been designed with PAPI in mind.

Application developers using DASH can call PAPI functions directly. However, the DASH developer toolkit also provides its own interface for performance counters that switches to native fallback implementations for several architectures and platforms if PAPI is not available on a system. Typically, the DASH performance interface just acts as a light-weight wrapper of PAPI functions with a more convenient, intuitive interface.

As an example, we demonstrate the use of the DASH Timer class and discuss available fallback implementations.

7.1 The DASH Timer Class

Time measurements allow to derive the most conventional metrics for application performance tuning, such as latency and throughput. In this, time can be measured in clock-based and counter-based domains.

Clock-based time measurements refer to wall-clock time and are provided by the operating system. Functions like clock_gettime are reasonably convenient for end users and have standardized semantics. Their accuracy and precision depend on the operating system’s implementation, however. To address a common misconception, note that a timestamp’s resolution only represents its unit of measure but does not imply precision: a timestamp with 1 ns resolution might only grant 1 ms precision or less. Therefore, two consecutive timestamps might have identical values if both measurements fall into the same time sampling interval.

Counter-based time measurements require low-level interaction with hardware in client code. Dedicated counters in hardware like RTDSC achieve optimal accuracy and precision at instruction-level and return monotonic counter values. As an advantage, consecutive counter-based time measurements are guaranteed to yield different values. Machine instructions and semantics of counter registers are specific for each CPU architecture and tedious to use for end users, however. At the time of this writing, no standardized interface exists to access hardware counters; efforts like PerfCtr and Perfmon have not been accepted as parts of vanilla Linux. In addition, counter values can only be related to pseudo wall-clock time via frequency scaling, i.e. when dividing by a hypothetical constant CPU clock frequency.

The DASH Timer class represents a unified, concise interface for both clock- and counter-based measurements. In both cases, elapsed time between two measurements is automatically transformed to wall-clock time, applying frequency scaling when needed. Developers can still access raw timestamp values directly. Listing 4 shows how the scaling frequency for counter-based timers in DASH can be configured explicitly. Instances of the DASH Timer are then initialized for either clock- or counter-based measurement.

We apply the RAII scheme in the Timer class to provide convenient usage: in the constructor, a class member is initialized with the current timestamp. Elapsed time since creation of a Timer object can then be resolved without maintaining a start timestamp manually as shown in Listing 5.

Listing 4 Configuring the DASH Timer

1 dash:: Timer< dash:: Timer:: Counter >:: Calibrate (

2                                      2560.0 // bogo− MIPS

3                                     );

Listing 5 Usage of the DASH Timer class

1 // Create timer based on time stamp counter:

2 dash:: Timer< dash:: Timer:: Counter>  timer_ct;

3 // Create timer based on wall−clock  time:

4 dash:: Timer< dash:: Timer:: Clock>    timer_ck;

5 // Operation to measure:

6 do_work ( );

7 // Get timestamp of  construction  of  the timer:

8 auto   timestamp_start =  timer_ct. start ( );

9 // Get current timestamp based on time stamp counter:

10 auto   timestamp_end   =  timer_ct. now ( );

11 // Get the time in nanoseconds elapsed since  construction

12 // of  the timer instance.  Counter value in timestamp i s

13 // converted to real −time domain using frequency scaling:

14 double elapsed_ns =  timer_ct. elapsed ( );

15 // same as:

16 double elapsed_ns =  timer_ct. elapsed_since (

17                       timer_ct. start ( ) );

7.2 Fallback Timer Implementations

Accuracy and precision of time measurements are tedious to verify and optimize for a specific architecture. PAPI implements stable time measurements for a wide range of platforms based on hardware counters like RTDSC.

The DASH developer toolkit includes own implementations to access time stamp counter registers and the operating system’s time functions like clock_gettime on Unix. Therefore, DASH applications do not depend on the PAPI library for reliable time measurements, but it can still be easily integrated in an existing application as both variants are accessed using the same interface.

If DASH is configured without PAPI support, the use of fallback implementations is defined at compile time. The decision for a concrete fallback variant depends on system architecture, operating system, and available functions and flags in the linked C standard library. Clock-based timestamps are then typically obtained from the Linux function clock_gettime using the most precise mode available on the system. For counter-based timestamps, specific assembly instructions read values from counter registers like RDTSC on x86 architectures or PCCNT on ARM32.

8 Conclusion and Future Work

This paper described the current tool infrastructure of the DASH project. To understand what DASH is about, the project was briefly introduced and main components were explained. Then the debugging support for DASH, especially the challenges of controlling many processes and the use of new pretty-printers, were described. Following this, we explained the differences between profiling and tracing to demonstrate our developed extensions for the DASH/DART support in the Score-P measurement environment. The need to use different approaches to support DASH/DART were discussed afterwards. Especially, the necessity for different instrumentation levels (recording only a few or all functions) in the DASH template library were demonstrated. At the end, we presented a profiling tool for MPI (supporting especially DART-MPI developers) and how PAPI counters are integrated in DASH.

The next step will be a performance monitoring environment for monitoring memory accesses in DASH containers which especially helps DASH/DART developers to verify distribution patterns and to evaluate the memory access behavior of implemented DART communication libraries. The monitoring environment can also be used to analyze and optimize existing patterns and algorithms using them.