Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Interconnection networks used in today’s supercomputers play a vital role in the overall performance, efficiency, and scalability of scientific simulation and modeling. HPC applications achieve a paltry 5–15% of a machine’s peak performance [1,2,3] on modern microprocessor-based supercomputers. A significant fraction of the loss comes from inter-node data movement.

When applications fail to make effective use of the compute resources at scale, application developers resort to profilers to understand bottlenecks. There is sufficient state-of-the-art and commercial tools for CPU profiling [4,5,6,7,8,9,10] that capture metrics such as CPU cycles, cache misses, branch mis-predictions, etc. and associate the measurements back to the application source code or application data objects [11, 12].

Domain scientists can only reason about performance when the measurements are attributed back to the application source code. Unfortunately, network performance problems are a “black box” from an application developer’s viewpoint. CPU-side profilers typically quantify the amount of delay waiting for a network communication but offer little insight into why an instance of network transaction was slow. Even the most sophisticated network performance analysis techniques [13,14,15,16] only reason about communication endpoints but do not capture measurements from under-the-hood workings from the autonomous interconnection hardware, which includes network interface cards (NICs), bridges, and switches.

Figure 2 in Appendix A shows the execution profile of NWChem [17]—a US Department of Energy flagship computational chemistry code—running with 1024 MPI ranks on the Dragonfly [18, 19] interconnection network on the NERSC Edison [20] supercomputer. The figure shows a hotpath in the CPU profile taken using HPCToolkit [5], a state-of-the-art CPU PMU-based profiler. The figure shows a deep call stack with various layers of host-side code leading to the vendor-provided networking API dmapp_lock_acquire to acquire a lock on a remote node. The execution spends a significant (26%) part of execution waiting in this networking API, but the profiles cannot obtain any insights on the cause of this wait. This leaves an application developer with many unanswered questions:

  1. 1.

    Is there load imbalance in the code? Our conversation with the NWChem application developers eliminated this case of any load imbalance and contention for a single lock—the workload is dynamically balanced.

  2. 2.

    Is the network lock implementation suboptimal? Our conversation with Cray Inc. eliminated this possibility—the network lock is local spinning MCS [21] lock.

  3. 3.

    Is the communication network performing poorly?

  4. 4.

    Is there an interference from another job that affected this execution?

  5. 5.

    Is the observed, seemingly network problem, indeed a network bandwidth problem or delays in the local NICs to inject messages?

  6. 6.

    If locking is frequent, is the lock-release message getting delayed in the interconnection network? If so, can we use a separate high-priority virtual channel for such network communication that appear on the critical path?

Clearly, traditional CPU profilers cannot offer answers to these questions since they cannot measure what happens in the interconnection network hardware components. Once a network-related transaction leaves the CPU, even in a simplistic network, the following events happen. The message gets enqueued as a command to the NIC. The NIC notices the command at some later point, which introduces an arbitrary delay. Now, the NIC may initiate a DMA transfer from the local DRAM if the command is a send/put. It packetizes a put/send command into multiple MTU-sized packets and injects them one by one. The NIC may then wait for the acknowledgement of every packet (which is the case in Gen-Z [22] protocol). Different packets may take different paths in the network based on the network routing heuristics. At each router hop, a packet may be subject to different policies and arbitration delays before being forwarded to an output port. At the destination NIC, the packets may arrive out of order (which happen in Gen-Z [22]). The destination NIC may delay injecting packet-level acknowledgments. The destination CPU may get notified some time later after the entire message is reassembled and may introduce further delays before a message-level acknowledgment is generated. Finally, an acknowledgment message may be subject to the same set of uncertainties on its return journey.

With myriad autonomous, unsynchronized components, it is virtually impossible to track how a message gets affected in its roundtrip from one host to another. Prior network performance analysis efforts have conducted an event-driven simulation of the network with characteristic workloads for designing superior networks without paying attention to delivering developer insights. Production hardware has offered simple counters in network routers to collect aggregate runtime data, which offer coarse-grained statistics for system administration to spot anomalous or overloaded hardware components; these techniques are tedious, vendor specific, and often not accessible to CPU profilers.

No prior art has addressed the challenge of tracking an individual message from its source location through every hop in every hardware component in an interconnection network and associated the observed performance metrics to the source and target host codes. This level of detailed measurement in conjunction with full CPU-side context-sensitive profiling is the basis of delivering rich, end-to-end application insights. Such detailed profiling and tracing can alone answer questions that we raised previously in the NWChem example. Evidently, tracking every message and every packet in the network with this level of detailed statistics is a recipe for performance data deluge and will bring the network to a grinding halt in merely collecting the measurement data. Statistical sampling comes to our rescue in collecting detailed data with sparse sampling.

Our strategy is to “mark” network transaction to be monitored on a sampling basis at the origin (CPU) and record statistics of such marked messages at every hop along its journey in an interconnection network. By retaining both CPU-side profiles and network profiles for a sparse set of samples, we are able to observe what happens to network transactions and elevate the measurements to application source code in a manner that sheds lights on the causes of network-related problems to the application developer. The result is that the application developer, with full understanding of the problems, may,

  1. 1.

    Choose to refactor the source code to better utilize the network, or

  2. 2.

    Provision more network resources to reduce network-related bottlenecks that are caused by her application, or

  3. 3.

    Conclusively infer that the problem was not caused by the application but due to an interference with another job, the solution is in better network provisioning or job scheduling, or

  4. 4.

    Pinpoint that the problem is not in the network provisioning but in the networking algorithms, an anomalous router, or local network interface (NIC) software or hardware.

2 Related Work

There is a rich literature on profiling and tracing CPU executions. Profiling provides aggregated metrics whereas tracing captures the time-varying behavior of executions. Profiling and tracing come in various flavors and granularities. It is common to instrumentation source code or binary, manually or via a tool, at function, loop, or basic-block granularities. Hardware event-based sampling is an orthogonal method where CPU PMU counter overflow triggers an interrupt that a profiler captures and attributes to application binary and in-turn to the source code [4, 5, 23]. None of these techniques measure data from interconnection network hardware.

MPI profilers [24,25,26] capture communication metrics at endpoints: they measure time spent in networking-related tasks by wrapping or intercepting each MPI library functions. Advanced methods [13] are able to replay execution traces to pinpoint the root causes of some performance bugs. However, none of these methods obtain measurements from networking hardware. As a result, although one might observe anomalous communication delays, there exists little evidence to isolate problems to a host-side NIC, a router, the destination NIC, or destination CPU.

Networking hardware design is often performed via low-level event-driven simulators [27,28,29]. These simulators are driven by predefined communication patterns to assess the strength of hardware designs or algorithms. A low-level simulator can simulate only a small (often milliseconds to a second) amount of real execution. High-level simulators [30,31,32] capture runtime communication traces on real execution and replay the communication traces to drive coarse-grained network simulators. Both high-level and low-level network simulators treat the CPU execution as a black box and focus only on the networking aspect and hence are incapable of offering insights to application an developer at the source code level.

There is rich literature in network profilers for Ethernet [33,34,35]. We are unaware of any tool that can a) attribute network profiles to application source code, or b) perform path-synchronous sampling to capture a specific network transaction (e.g., traversal of a specific packet) throughout its journey. Network-side monitoring schemes such as sFlow [35] and netflow [34] capture the source and destination of a packet when flowing through a component. They, however, lack the full path information of a sampled transaction and hence the hop-by-hop details of any specific packet is unavailable. sFlow can aggregate the data from many components over long periods of time and filter the data by the traffic originating from the same source going to the destination to reconstruct an “average” behavior and a “typical” path; but such schemes cannot attribute the observed behavior to the application source code because over time there can be many source code locations contributing to the same “flow”. Samples from different components lack temporal correlation. This lack of temporal correlation means one can observe only aggregate behavior of traffic and not be able to pinpoint a specific anomaly to its causes. Aggregate metrics handicap the ability to pinpoint the cause of a transient anomalous behavior.

3 Methodology

A key requirement for attributing network behavior to application source code is to identify what happens to a transaction initiated by a line of source code (could be an assembly instruction) throughout its journey through the network. Such hop-by-hop tracking retains temporal correlation among performance metrics generated by unsynchronized components. We track hop-by-hop metrics of a small subset of packets as they are forwarded across a network. This is done on sampling basis because observing every transaction is infeasible both from space and time overhead viewpoint. In other words, one in N packet originating from a source is chosen to be tracked throughout its journey. The choice of N can be arbitrary or more intelligent. Each endpoint may choose the same or different value of N. Endpoints need not coordinate when they track a packet. Sampling ensures that any event that is statistically significant will be observed with the frequency proportional to its occurrence. We propose the following extensions:  

Protocol Extension: :

every packet of the protocol carries a special Performance Monitoring (PM) tag. The PM tag may be present at a designated offset in the packet header to make it quick to inspect by the hardware. We call a packet whose PM tag is enabled as a “marked” packet. We have already incorporated a PM tag in the Gen-Z protocol [22] to enable performance tools.

Hardware Extensions: :
1.:

The NIC exposes a special tag “track me” (TM) to the software. The software may assert the TM bit in a command it issues to the local NIC indicating the NIC to track the command.

2.:

The NIC propagates the TM bit from a CPU-issued command into a (one or more) packet(s) by setting the PM bit in the packets that it injects into the network on behalf of the command.

3.:

Every switch inspects the PM tag of each packet it routes. If the PM tag is enabled in an incoming packet, the switch logs a performance data record into its local buffer (typically an SRAM). The PM tag is propagated through the switch from an incoming packet to the corresponding out-going packet.

 

The fact that a marked packet’s information is logged at each hop allows us to achieve the path-synchronous sampling. A key piece of information logged at each hop is the unique identity of the next hop of the packet. The next hop information allows us to, in a post mortem pass, reconstruct the full path along the journey of a marked packet. In systems with request-response protocol (e.g., Gen-Z), the PM tag is retained from request to response so that its journey is tracked in both directions. To accomplish this, the endpoint hardware (e.g., NIC) may be modified to propagate the PM tag from request to response. We assume that every network packet at least contains its source identifier (SID), destination identifier (DIS), a tag (need not be unique), and the PM tag. The log in each component contains at least the following information:

  1. 1.

    The arrival time of the packet or command (component local time).

  2. 2.

    The departure time of the packet or command (component local time).

  3. 3.

    The identity of the next hop (out going port) of the packet/command.

  4. 4.

    (Optional) In addition to the first three necessary data, a component may include any additional data: for example, anomalous condition at the time of routing the designated marked packet (e.g., ran out of credit when transmitting this packet), position of the packet in a router’s input queue on arrival, conflict during router arbitration, etc.

Fig. 1.
figure 1

An interconnection network with four switches, two endpoints and their respective NICs. A message sent from A to B traverses the path A\(\rightarrow \)NICa \(\rightarrow \)S1 \(\rightarrow \)S2 \(\rightarrow \)S4 \(\rightarrow \)NICb \(\rightarrow \)B. The profiler captures CPU-side contexts, marks the message to be tracked in the network and logs data to a collection server. NICs and switches that the marked packet traverses also log their data to the collection server.

Figure 1 depicts the workflow when the endpoint A wants to send a message to endpoint B and the packet follows the route A\(\rightarrow \)NICa \(\rightarrow \)S1 \(\rightarrow \)S2 \(\rightarrow \)S4 \(\rightarrow \)NICb \(\rightarrow \)B in an anecdotal network:

  1. 1.

    The software (profiler running on the source CPU, endpoint A) on a sampling basis chooses a transaction to be monitored. The choice can be random sampling or more intelligent, if desired.

  2. 2.

    The software captures its CPU calling context (CTXT1) and creates a locally unique command id (CID1) representing the network command.

  3. 3.

    The software (at time T1) issues the network command to NICa passing the unique id (CID1) setting the TM flag.

  4. 4.

    Software logs the tuple \(\langle \)CTXT1, CID1, T1, A, B\(\rangle \).

  5. 5.

    NICa at a later point (time T2) inspects the command, generates some M network packets for the command, and by observing the TM flag, it enables the PM tag in one of (randomly chosen or otherwise) the M network packets.

  6. 6.

    NICa injects the PM-marked packet at time T3 to the switch S1. Let the id of the marked packet be PKID. Let the last packet corresponding to CID1 leave at time T4. NICa logs the local information tuple \(\langle \)CID1, A, B, PKID, S1, T2, T3, T4 \(\rangle \).

  7. 7.

    The switch S1 notices the marked packet with PKID at time T5 and forwards it to switch S2 at time T6 and the logs the information tuple \(\langle \)A, B, PKID, S2, T5, T6 \(\rangle \).

  8. 8.

    The switch S2 notices the marked packet with PKID at time T7 and forwards it to switch S4 at time T8 and the logs the information tuple \(\langle \)A, B, PKID, S4, T7, T8 \(\rangle \).

  9. 9.

    The switch S4 notices the marked packet with PKID at time T9 and forwards it to NICb at T10 and the logs the information tuple \(\langle \)A, B, PKID, NICb, T9, T10 \(\rangle \).

  10. 10.

    NICb at time T11 assembles all packets and create an entry for the endpoint B and produces the log entry \(\langle \)A, B, PKID, CID2, B, T10, T11, TM = 1\(\rangle \).

  11. 11.

    The CPU at endpoint B at time T12 in calling context CTXT2 receives the full message and on noticing the TM flag, logs the tuple \(\langle \)CTXT2, CID2, T12, A, B\(\rangle \).

For brevity, we are not discussing the case of response or acknowledgment or dropped packets. In unreliable networks when a marked packet is dropped, no further logs will be available—a clear indication of a dropped packet. We do not discuss what additional information a component may log. There can be component-specific fields, which, for example, can include link-level credits.

Collection server: Hardware has a limited local buffer to log performance data. Hence, we use a management software running on each hardware component to periodically drain the logs collected to a centralized server. The SRAM buffer on the hardware acts as a circular buffer. All modern HPC networking components have additional management hardware with Ethernet connections of \(\sim \)1 GBPS. The management software on each component is capable of NFS mounting a remote distributed server and dump logs from local memory to a unique file on the remote server.

Post-mortem analysis: The collection server contains logs collected from all components through which every marked packet traverses. A post-mortem analysis of the logs in the collection server allows a software tool to reconstruct the complete path traversed by each marked packet initiated at a source and associate the data with the application source code in its calling context. In the previous example, starting from the CPU-side log of the endpoint A, we can go through the following steps to reconstruct the path:

  1. 1.

    Endpoint A’s log entry \(\langle \)CTXT1, CID1, T1, A, B\(\rangle \) tells that at source-code context CTXT1, a command CID1 was issued to target B.

  2. 2.

    Sifting through NICa’s logs for CID1 shows the following entry: \(\langle \)CID1, A, B, PKID, S1, T2, T3, T4 \(\rangle \). T2−T1 is the in-node delay. The command took a total of T4−T2 time to get injected. The marked packet has the tag PKID and was injected at time T3 and was sent to switch S1.

  3. 3.

    Sifting through switch S1’s logs for \(\langle \)A, B, PKID, T3 \(\,\pm \,\varDelta \rangle \) shows a record \(\langle \)A, B, PKID, S2, T5, T6 \(\rangle \). The packet’s delay at hop S1 is T6−T5. It was forwarded to S2.

  4. 4.

    Sifting through switch S2’s logs for \(\langle \)A, B, PKID, T6 \(\,\pm \,\varDelta \rangle \) shows a record \(\langle \)A, B, PKID, S4, T7, T8 \(\rangle \). The packet’s delay at hop S2 is T8−T7. It was forwarded to S4.

  5. 5.

    Sifting through switch S4’s logs for \(\langle \)A, B, PKID, T8 \(\,\pm \,\varDelta \rangle \) shows a record \(\langle \)A, B, PKID, B, T9, T10 \(\rangle \).\(\rangle \). The packet’s delay at hop S4 is T10−T9. It was forwarded to the destination NICb.

  6. 6.

    Sifting through NICb’s logs for \(\langle \)A, B, PKID, T10 \(\,\pm \,\varDelta \rangle \) shows the following entry: \(\langle \)A, B, PKID, CID2, B, T10, T11, TM = 1\(\rangle \). T11−T10 is the delay at NICb. It was delivered to the destination B.

  7. 7.

    Sifting through endpoint B’s logs for \(\langle \)A, B, CID2, T11 \(\,\pm \,\varDelta \rangle \) shows the following entry: \(\langle \)CTXT2, CID2, T12, A, B\(\rangle \). CTXT2 is the receiving application calling context. The packet’s journey ends here.

Full calling context with source code attribution at both endpoints along with hop-by-hop metrics for the traversal: A(CTXT1)\(\rightarrow \)NICa \(\rightarrow \)S1 \(\rightarrow \)S2 \(\rightarrow \)S4 \(\rightarrow \)NICb \(\rightarrow \)B(CTXT2) including the in-NIC delays is easily reconstructed. Since each component logs data into its local buffer, there is no need for concurrency control. There is no need for perfectly synchronized clocks across the system; but, we expect the components to be close enough in time via standard protocols such as NTP.

Alternative uses: Our approach samples a randomly chosen transaction in a window of N transactions. Alternatively, we may also sample the exact N\(^{th}\) transaction. In fact, a precise, predetermined, transaction may be sampled, if desired. Instead of the software at the source of a transaction enabling the PM tag, any component may choose to enable the PM tag and capture the partial path. Although we suggested unsynchronized sampling from endpoints, we do not preclude sampling in a synchronized manner, which is useful for debugging purposes. Our approach associates metrics to the source-code location that initiated a transaction. We do not preclude associating metrics to some other place in the source code, e.g., a network wait event associated with a non-blocking transaction.

4 Implementation

We implemented our network performance monitoring prototype using the SST/Macro event-driven network simulator framework [32] and open sourced it [36]. SST/Macro models hardware components such as CPU, memory, NIC, switch, crossbar. The networking components of SST/Macro are mature with various, configurable network topologies, bandwidths, latencies, and algorithms of packet-based routing and arbitration, ideally suited for our evaluation. SST/Macro is easy to extend with additional hardware and software components, which was necessary for our extensions. SST/Macro is driven by “skeleton” C++ code that mimics an HPC C++ code written using MPI and needs trivial or no modifications to work with SST/Macro.

We enhanced SST/Macro in the following ways. We introduced a new flag (PM bit) in SST/Macro packet format. We extended SST/Macro NIC and switch hardware components with the additional capability to log “marked” packets to a bounded SRAM buffer. We chose a bounded buffer of 2 KB in each router and NIC. We introduced a new hardware subcomponent “drainer” in NICs and routers, which reads the performance data accummulated in a local bounded SRAM buffer and transfers it to an on-component management software. The management software NFS mounts a file on the remote data collection server and drains the incoming performance logs to the server.

Additionally, we also implemented extensions to the NIC-software interface to express the ability to track a message. We extended the NIC with the ability to mark one out of N packets with the PM bit if the command issued from the CPU carried the TM flag and append its log in a local SRAM.

We drive the profiling with a software profiler in SST that uses random sampling to determine if a message needs to be monitored. If so, it sets a special TM tag when it issues a command to its NIC. Also, the CPU profiler collects the calling context and logs the CPU metrics about the message in a per-CPU log file.

The postmortem analysis inspects the log files to reconstruct the path taken by each marked packet by each endpoint and associates performance metrics to each hop on the path as described in the previous section. The output of our postmortem analysis is a set of files containing the path information of all the marked messages. The path information also contains the performance metrics attributed to each hop along the path.

To visualize how the application behaves, we generate a heatmap and a set of stacked bar graphs from the performance metrics using a graphing software called Plotly [37]. Figure 3a in Appendix B shows the heatmap for an example NCAST program. The heatmap shows the total time taken by each marked packet to travel from the source CPU to the destination CPU. The points on the x-axis correspond to the time at which messages were initiated. The points on the y-axis correspond to the processes that sent the messages. A point on the heatmap that is darker than other points signifies the message took relatively longer to travel from the source to the destination. In addition to the heatmap, we also generate a set of stacked bar graphs, one for each process that initiated a message in the program. Figure 3b shows a bar graph for process 97 in the ncast program. Each bar represents the cumulative time spent by the message in each network component along its path and each stack in a bar represents the time spent at each network component. A large stack in a bar shows that the message was stuck in the component for a long time. To summarize, we can use the heatmap to identify what messages were delayed, and then use the stacked bar graph corresponding to the process that initiated that message to identify network component that caused the delay.

5 Evaluation

We evaluated our prototype implementation to answer the following questions: (1) how effective is our prototype in finding performance bottlenecks in the network due to the application? (2) does our prototype monitor network traffic with low overhead? All our experiments were run on a four socket, 15-core Intel Xeon E7-4890 machine clocked at 2.8 GHz and containing 1 TB DRAM. Our setup simulated the NERSC Edison [20] system with the Dragonfly [18, 19] topology containing 5586 compute nodes.

Effectiveness: To evaluate the effectiveness of our prototype, we executed it with an MPI skeleton program and ran with 4096 MPI ranks. In the skeleton program—NCAST—a single MPI process (rank 42) is bombarded with multiple large (4 MB) messages from all the other MPI processes in the network. As a large number of messages are sent to a single node, the NIC at the destination CPU becomes a bottleneck. Also, since all packets would flow through a single network switch before reaching the destination, the switch at the last hop becomes congested. Our goal is to use our prototype implementation to precisely identify the network component that is the bottleneck in the NCAST program.

Figure 3 in Appendix B shows the graphs generated by our prototype after executing the NCAST program. The heatmap in Fig. 3a in Appendix B reveals a surprising and non-obvious performance problem—the messages are all serialized; the MPI ranks are sending messages one after another, resulting in the diagonal in the heatmap. Samples from all CPUs except for CPU 42 are sparse and CPU 42 samples show that it is continuously sending messages to other processes, which is reflected in the thick horizontal line in the heatmap. The reason for serialization is the large message size sent from all other nodes. For large messages, each MPI process sends a short notification message to the target (rank 42); and the target one-by-one fetches the large message from the sources. The concurrency gets completely destroyed—a subtle anomaly invisible in the CPU-only profiles but distinctly visible in full network telemetry.

On the diagonal, we can see that the points above CPU 80 appear darker which means that those messages take relatively longer than the earlier messages. This shows that the messages that are being sent later are getting delayed at either the destination or at a network switch. Figure 3b in Appendix B shows the stacked bar graph of CPU 97. The large stack in the bar graph represents the time spent at switch 21. Switch 21 directly connects to node 42 which is receiving messages from all other nodes. Hence the stack is large since all the packets are queued at switch 21. We observe a similar pattern in the bar graphs corresponding to all other CPUs after CPU 80. This shows that all the messages are queued up at switch 21, which has become chocked.

Efficiency: We evaluate the efficiency of our network performance monitoring scheme by measuring the simulation and wall clock time on three MPI skeletons: NCAST, broadcast, and a mutiapp. We execute each application five times and report the geometric mean of the overhead. The skeletons were designed such that their simulation time was at least one second. We tracked one in every hundred NIC command at each endpoint. Our measurements showed that the hardware extensions added a negligible 0.16% mean overhead to the simulation time. The wall clock time for simulation marginally increased (4.8%) over the original execution without our extensions to SST/Macro. The average size of the log files generated for the three applications is 61 MB.

6 Conclusions

Application developers better understand performance when measurements are attributed back to the source code. However, it is hard to attribute performance measurement data from myriad autonomous, asynchronously operating hardware components in an HPC system back to application source-code. Traditional profilers have either focused only on CPU-side hardware measurements for source-code attribution or focused on network-side hardware measurements without source-code attribution.

We developed a protocol extension to track the flow of packets and collect hardware performance data in the emerging memory-semantic-based communication protocol—Gen-Z. We enhanced the router and NIC hardware and management software with additional components for logging performance data. We enhanced traditional CPU profilers to unify CPU profiles with telemetry from networking hardware. Our sampling-based scheme implemented in the SST/Macro simulator shows promise of our technique in offering a unified system-wide performance insights for application developers.

Our future work involves extensively evaluating our methods on serious workloads, working with hardware development teams to incorporate our proposed extensions, and working with software profiling tools to best utilize the network telemetry.