Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The development of Message Passing Interface (MPI) [1] applications is a time consuming and complex task. One of the key challenges, aside from achieving high efficiency, is guaranteeing soundness of an application’s use of MPI, i.e., its correct usage of the MPI API. While some MPI related errors may directly cause wrong results, application crashes, or hangs, some errors may only manifest on some systems or runs and then in some cases only long after their cause or simply by producing wrong results at the end of the execution. If done manually, finding such problems can be a long and difficult task and developers therefore require tool support that aids in the removal of these errors. Runtime error detection, i.e., detecting errors during an application run, is one tool class that provides this support. We develop the Marmot Umpire Scalable Tool (MUST), named after its predecessor tools Marmot [2] and Umpire [3], for this purpose.

Recent advances in runtime deadlock detection [4] and datatype correctness checks [5] allow MUST to efficiently detect complex errors. However, detecting such errors is only half the solution to the overall problem. Any tool must also present all details about a detected error in a way that helps users understand the erroneous behavior of their codes and help them fix the problem. Consider the following examples that illustrate some potential complexities:

Figure 1 presents two deadlock scenarios with simplified MPI calls. Two processes attempt to send and receive a message from each other using blocking receive and non-blocking send calls. The example in Fig. 1a results in a deadlock, as both processes issue the MPI_Recv call without issuing any send calls first. As a result, both processes wait in a cyclic fashion for each other’s send call, which is never reached, and hence can’t continue execution. MUST’s graph-based deadlock detection catches this error and presents the user with a wait-for graph. As no process issued a send call before the receive call, this report includes the key items to understand the error, which in this case are the processes involved in the deadlock and their individual active MPI calls. The situation in Fig. 1b represents a similar communication, which also results in a deadlock, due to a mismatch in the given message tags. MUST’s wait-for graph shows the user that both processes are blocked in the MPI_Recv call. As both processes have active send calls, the simple criteria used in the example above doesn’t hold and the tool user needs to investigate these calls manually in order to determine whether a tag or even a communicator mismatch exists. Different source files that contain active send calls or the use of variables as tag arguments can complicate this further.

Fig. 1
figure 1

MPI usage error examples. (a) Recv-recv deadlock. (b) Deadlock resulting from a tag mismatch

In this paper, we present a set of novel output extensions of MUST that provide tool users with the necessary fine-grained and detailed information of such complex error situations, but without overwhelming them with additional unrelated data. In particular, we include:

  • A parallel call stack that highlights the processes that MUST determined as the root of a deadlock,

  • A condensed message queue that only lists send and receive calls that are meaningful in a deadlock situation, and

  • A call-stack based decomposition of the message queue graph to augment a regular message queue graph with source location information, and

  • A datatype tree view that highlights error positions in derived datatypes.

We first present an overview of MUST, its correctness checks, and its basic error report in Sect. 2, followed by a summary of MUST’s current deadlock view and datatype usage reports. Afterwards, we present our proposed deadlock view extensions in Sect. 4. Section 5 presents how we can efficiently pinpoint particular error positions in derived datatypes. Finally, we present related work in Sect. 6 and conclude in Sect. 7.

2 MUST

MUST detects MPI usage errors, i.e., usage of MPI calls that are not consistent with restrictions laid out in the MPI standard, during an application run and reports them to the user. Examples for such usage errors are illegal parameters to MPI calls, writes to a send buffer while an asynchronous message transfer is in progress, inconsistent orderings of collective operations, or deadlocks due to improper synchronization. MUST uses the MPI profiling interface to intercept and analyze all MPI calls that an application issues. The tool can be loaded into the application using the LD_PRELOAD mechanism. In this case, the usage of the tool becomes as easy as replacing the respective mpiexec command with a wrapper command called mustrun.

We distinguish two types of correctness checks: local correctness checks and non-local checks. Local checks only require information that is available on a single MPI process and hence don’t require any communication for their execution. As a result, MUST is able to execute local checks inside each application process, or more precisely inside the MUST MPI wrappers used to intercept all MPI calls. Using local checks, we can, e.g., detect whether a datatype that is used in a communication call is committed or whether parameters to MPI calls are out of range. Non-local correctness checks require information from more than one process. Datatype signature matching between sending and receiving communication calls is one such example. The implementation of non-local correctness checks requires additional communication and hence a separate communication mechanism that can forward information about MPI calls to other processes or extra resources. MUST uses the Generic Tool Infrastructure (GTI) [6] for this purpose. Currently MUST provides the following classes of correctness checks covering a wide spectrum of possible error cases:

  • Local:

    • Integer checks (e.g., restrictions on tags, counts, sizes, and offsets)

    • Integrity checks (e.g., Arrays allocated or communication buffer present)

    • MPI resource surveillance (e.g., use of requests, datatypes, reduce operations, groups, and communicators)

    • Resource leak checks

    • Communication buffer overlap checks

  • Non-local:

    • Collective verification (e.g., matching roots and compatible reduce operations)

    • Lost message detection

    • Message type matching (for both point-to-point and collective operations)

    • Deadlock detection

Previous work [4] includes extensive performance results and has shown the feasibility of this approach, including its scalability using an application study on up to 512 processes.

In its initial form, the basic output of MUST is an HTML table that follows the format of Marmot [7]. In Marmot checks had to be implemented for each MPI call, even for the same error conditions, leading to significant code duplication of any error reporting. MUST avoids this redundancy with the use of so-called argument IDs. Figure 2 shows a basic MUST report with an integer usage error. The check that detects the negative count argument in the MPI_Send call is mapped to many different calls and argument types. MUST uses the argument IDs to identify the argument number and name, which increases the detail in its output reports. Further, MUST uses the Stackwalker API of the Dyninst projectFootnote 1 to retrieve call stack information for each MPI call it intercepts.

Fig. 2
figure 2

Example MUST error report

3 Shortcoming of Current Error Views

While the initial MUST implementation provided useful information about violated checks, the output format was not optimal and omitted several key pieces of information a user requires to identify the broken code location and to fix it. These shortcomings were introduced because the initial output format was driven by the implementation of the tool and what it naturally collects, without taking the user’s needs into account. This is, unfortunately, common for many tools, which flood the user with raw data, but fail to provide some essential details. We illustrate two such problems in the following, using the examples of deadlock detection and problems with complex datatypes. We will first show (in this section) why the existing views are insufficient and (following in the next two sections) how we were able to work around it.

3.1 Example 1: Pinpointing Deadlocks

A key feature of MUST is its graph-based deadlock detection [8]. It creates a wait-for graph and then uses this graph to identify existing deadlock conditions. If such a condition is found, the tool provides the user with a list of processes that are in a deadlocked state as well as their wait-for dependencies that cause them to be deadlocked. This enables MUST to separate processes that cause the deadlock from processes that hang due to waiting for deadlocked processes directly or indirectly.

The graph based approach also has the additional advantage that we can use the graph itself to visualize the deadlock conditions and the wait-for dependencies to the user. As a result, MUST’s previous deadlock view provides:

  • A textual description of the deadlock situation,

  • A wait-for graph of the deadlocked processes, and

  • A source location list of the deadlock processes.

In the following we use the erroneous sequence of MPI calls in Fig. 1b as an example to illustrate MUST’s previous output. Figure 3a shows the wait-for graph (WFG) that MUST provides for this example. However, this graph along with the source location lists of the deadlocked processes alone is not sufficient to identify the root cause for this error. From our experience, a tool must provide answers to the following questions:

  1. 1.

    Which processes cause the deadlock?

  2. 2.

    What MPI calls are active on these processes?

  3. 3.

    Which control flow led to these active calls?

  4. 4.

    In the case of involved point-to-point operations, which other active communications exist?

Fig. 3
figure 3

Deadlock view components for the example in Fig. 1b. (a) Wait-for graph. (b) Message queue graph. (c) Parallel call stack graph. (d) Call stack graph decomposition of the message queue graph

While MUST’s previous output provides answers to the first two questions it does not provide information on the latter two. Also, the list of source locations is insufficient for deadlock reports that involve more than a few processes.

3.2 Example 2: Viewing Datatype Related Problems

The MPI standard imposes constraints for communication operations. Erroneous usage of MPI datatypes may collide with three of such constraints. In the following we sketch these three referring to version 2.2 of the MPI standard [1]:

  • For sending operations, the application may not modify the communication buffer, until the send completes.

  • For receiving operations, the application must not access any part of the communication buffer, until the receive completes.

  • The type signature of a communication must adhere to matching rules during the following three steps:

    1. 1.

      MPI types must match programming language types for reads from the application memory (except for the MPI type MPI_BYTE),

    2. 2.

      MPI types must match on receiver and sender sides during transport to receiver, and

    3. 3.

      MPI types must match programming language types for writes to the application memory (except for the MPI type MPI_BYTE).

In MUST we provide checks for overlapping communication buffers handling a sub-set of clashes with the first two constraints, and for type matching in communication which meets step two of the latter constraint. These checks handle any (derived) datatypes that communication calls may use. We provide no checks for memory manipulation done in application context. Instead, we focus on simultaneous MPI communications that break any of these constraints. If MUST detects such an error, it is crucial that it provides precise information on its source. While the simplest solution would be to provide memory addresses, this provides unsatisfactory details on where the error resides in a communication buffer and its associated MPI datatype. We currently use a path expression approach [5] to pin-point these error locations. An example for this path expression can be found in Sect. 5. While these expressions provide an exact position of the error location within a datatype signature, they require a deep understanding of their format, while losing information about the overall structure of the involved datatype(s).

4 Deadlock View in MUST

As the last section illustrated, MUST’s previous deadlock view lacked detail, especially for message mismatch situations, and scalability. To overcome this limitation, we propose a new, dedicated deadlock view that contains the following elements:

  • A textual summary,

  • A communicator overview,

  • The WFG with a legend,

  • A parallel call stack,

  • A graph representation of the current message queue, and

  • A decomposition of the message queue that uses a parallel call stack.

Our new output generator in MUST combines all of these elements in a single HTML page (for better readability, however, we present the individual elements in separate sub-figures). While the textual summary matches our previous outputs, we use the communicator overview to represent each communicator with an upper case letter. In the erroneous sequence of MPI calls in Fig. 1b, which we use as an example throughout this section, the application only uses MPI_COMM_WORLD, which we represent as comm A. If additional communicators are defined by the application, the communicator summary includes information on the MPI calls that created the communicator. The WFG (Fig. 3a) matches our previous outputs, except that we now use the communicator symbols to also present information on the communicators in use. We also add a legend to this graph as it may contain intermediate nodes to represent complex MPI semantics. Additionally, the new view shows the parallel call stack to provide insights for Question 3 (introduced in Sect. 3) and the last two graphs to provide information for Question 4, which we describe in the following.

Figure 3c shows MUST’s parallel call stack for our example. It helps to illustrate control flow decisions that lead to the deadlock condition. While it is challenging to represent information on the control flow of the individual processes in all details, this limited view provided by call stacks is in most cases sufficient. Additional static source analysis may reveal control flow relevant variables to enrich parallel call stack graphs with further information, as an extension [9] of the STAT [10] tool shows. Further, these graphs scale well with the number of application processes. For our purposes, we limit this call stack graph to only the application processes that are part of the deadlock in order to remove any unnecessary information and provided the most concise representation.

Question 4 addresses situations where point-to-point operations are involved in a deadlock. In this case the root-cause of the error may be a tag or communicator mismatch. In order to understand this situation, the application developer requires information about any active and meaningful point-to-point call, whether it is involved in the actual deadlock condition or not. MUST provides a message queue graph for this purpose. Since MUST detects which processes are part of the deadlock, while it also determines which processes are blocked in point-to-point calls, we can automatically reduce the full message queue graph to only present messages that:

  • Were started by a process that is part of the deadlock;

  • Have active send operations, which target a process that hangs in a receive operation or a completion that includes a non-blocking receive operation; or

  • Have active receive operations, which target a process that hangs in a send operation or a completion that includes a non-blocking send operation.

Using these conditions, we can condense our output to only present relevant point-to-point operations. Figure 3b shows this graph for our example. The graph includes an arc from node 0 (which represents process 0) to node 1 to represent the MPI_Isend call that was issued on process 0 before the deadlock manifested. The other arc represents the MPI_Isend operation that was started by process 1.

MUST’s condensed message queue graph allows application developers to determine whether a potential mismatch exists. In our example, Fig. 3a shows that process 0 waits for a matching send operation of process 1, which uses the tag 200, while Fig. 3b shows that a send operation exits, but with tag 100. If a mismatch exists, the user needs to be able to identify the call and control flow origin of the mismatched operation. We use a parallel call stack to represent all MPI operations that started any operation within MUST’s relevant message queue graph. This identifies the call stacks of these operations, but as each operation may use multiple targets, tags, and communicators, we need to highlight which individual parts of the message queue graph result from each leaf of the call stack graph. As a result, we decompose the message queue graph into sub-graphs that represent the components that each MPI operation creates. Figure 3d shows this call-graph-based decomposition for our example. This graph allows the tool user to determine which message might be mismatched, while it contains information about its source location along with limited control flow information.

5 Type Tree View

In this section we will describe a new, more expressive graphical view for datatype related errors.

Listing 1 Example for a communication buffer overlap

  double velocity[3]; double spin[3]; char charge;

  double radius; double mass; };

struct particle cloud[112];

MPI_Datatype structtype, indexedtype;

int blocklens[7] = {3, 3, 3, 3, 1, 1, 1};

MPI_Datatype types[7] = {MPI_DOUBLE, MPI_INT, MPI_DOUBLE,

    MPI_DOUBLE, MPI_CHAR, MPI_DOUBLE, MPI_DOUBLE};

// displs derived from c-struct by MPI_Get_address()

MPI_Aint displs[7] = {0, 24, 40, 64, 88, 96, 104, 112};

MPI_Type_struct (7, blocklens, displs, types,

    &structtype);

int array_of_blocklens[8] = {3, 2, 1, 2, 4, 8, 1, 3};

int array_of_displs[8] = {3, 13, 23, 34, 44, 55, 65, 76};

MPI_Type_indexed (8, array_of_blocklens, array_of_displs,

    structtype, &indexedtype);

MPI_Type_commit(&indexedtype);

MPI_Sendrecv(cloud, 1, indexedtype, 0, 42, cloud + 25, 1,

    indexedtype, 0, 42, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

The code example in Listing 1 sketches a particle simulation where information about a subset of the particles needs to be transferred to a neighbor process. In the application a C struct holds the information about a particle. The set of particles is organized in an array of this struct. Using derived datatypes, MPI enables us to select the subset from the array and send it in a single contiguous operation to the neighbor. To create the fitting datatype, the example uses at first the MPI_Type_struct constructor to represent the C struct and then an MPI_Type_indexed constructor to select parts of an array of this struct. While the first constructor is correct with respect to type matching, the second one causes a communication buffer overlap when the example issues the MPI_Sendrecv call (performed as local operation in this simplified example). MUST’s current path expressions calculate to [0](INDEXED)[5][4](STRUCT)[0][0](DOUBLE) for the sending part and [0](INDEXED)[3][0](STRUCT)[0][0](DOUBLE) for the receiving part of the MPI_Sendrecv call. Figure 4a sketches the overlap within the array (called cloud) of the C structure, i.e., the elements that the MPI_Type_indexed constructor selects from the array. While this representation highlights the overlap, this display loses information about the internal datatype structure. To combine the expressiveness of the path expression and the overview of such a memory map, we propose an overlap graph. This graph visualizes the two path expressions that cause the overlap along with a sketched structure of the datatypes in use. Figure 4b shows this graph for the example in Listing 1. We represent the path expressions of the overlap in red in this graph. For overlaps the trees of the colliding communication operations will either join at a node of the same basic MPI type and absolute offset, as in our example, or we use a compound node if the overlap occurs for two different types/offsets. We join further tree nodes if they compare to equal sub-types, as for the MPI_Type_struct in our example. We compute this by recursing the type trees from the leaf towards its root.

Fig. 4
figure 4

Overlap view for the example in Listing 1. (a) Array indices of send/receive marked blue/green. (b) Overlap graph

An example for a type mismatch can be derived from the above example by mixing up the struct entries for charge and radius at one of the neighbor processes. The current path expression for this situation calculates to [0](INDEXED)[0][0](STRUCT)[4][0](CHAR) and [0](INDEXED) [0][0](STRUCT)[4][0](DOUBLE), indicating that an MPI_CHAR mismatches with an MPI_DOUBLE. To display the mismatch we create a tree for both involved datatypes where we skip nodes apart from the (red) error path while we keep a few basic MPI types near the mismatch position to have a more detailed context of the mismatch. To derive a smaller graph we merge similar nodes of both trees. Figure 5 provides the resulting view for the sketched mismatch situation.

Fig. 5
figure 5

Type mismatch view for a confusion in the type definition

6 Related Work

This work directly relates to other runtime error detection approaches for MPI applications, which include Marmot [2], Umpire [3], ISP [11], MPI-Check [12], and Intel’s approach [13]. While MUST as successor of both Marmot and Umpire identifies deadlocks with a graph-based approach, the MPI-Check tool and Intel’s approach use a timeout-based deadlock detection. As a result, these tools only provide a list of all active MPI calls when the presence of a deadlock is suspected. ISP runs a replay based investigation of all possible interleavings of an MPI application. As a result, this tool can detect some deadlocks that MUST would not detect in a certain application run. ISP’s deadlock output includes a trace of all MPI calls that each process issued, as well as their matching decisions. While very detailed, this output will get overly complex, especially for longer application runs with more than a few processes. While our output contains no complete history of all issued MPI calls, we provide the user with a more scalable deadlock view that condenses relevant history information with the use of a reduced message queue graph.

The STAT [10] tool and debuggers like DDT and Totalview use parallel call stack graphs and/or message queue graphs. Debuggers use interfaces to the MPI library [14] to retrieve message queue information, whereas MUST tracks all MPI calls during the whole application run. Existing integrations of runtime error detection tools with debuggers, e.g. DDT and Marmot [15], could be extended to provide debuggers with information on which processes cause a deadlock. Debuggers could than condense message queue graphs as in our approach. Also, the representation of derived datatypes with trees is based on ideas of the flattening on the fly technique [16].

7 Conclusion

We present the MUST runtime error detection tool for MPI applications along with extensions of its error reports. Our previous output for deadlock situations failed to capture information on active point-to-point messages, which is crucial in the detection of message mismatch situations. We use message queue graphs to present these active operations. MUST’s graph-based deadlock detection yields a set of processes that cause the deadlock, which allows us to condense parallel call stacks and message queues to only include relevant information. In order to add call location information to the message queue graph representation, we propose an extended parallel call stack graph that includes a decomposition of the message queue graphs in their leaves. While these representations allow us to present relevant information for the removal of deadlocks at moderate scale, we still need to investigate their practicability for thousands or more processes. While our approach allows us to visualize deadlocks that only involve a few processes, it may fail for complex deadlocks that involve all or most application processes. This especially affects the size of the WFG and the message queue graphs.

Our second error view provides a detailed output for errors that involve derived datatypes. This includes communication buffer overlaps, and type mismatches between point-to-point or collective MPI operations. The removal of these errors requires a precise understanding of which part in a derived datatype causes the error. As a result, we use a narrowed type tree representation that highlights the position in the datatype that causes the error, while it sketches the structure of the involved datatypes at the same time.