1 Introduction

Since the first top 500 listing, the processing power of supercomputers has increased by 5–6 orders of magnitude [1, 2] and computer simulations have become important research tools [3, 4]. Efficient design of scientific simulation tools has become increasingly challenging as the complexity and hardware has changed significantly. The community has evolved from adapting codes from single processor vector machines [5], to shared memory supercomputers [6], to small Beowulf clusters [7], massive clusters [8], and eventually to massive clusters with significant co-processors [9]. Recently, the memory systems have also become more complex with high-bandwidth memory [10] and non-volatile memory [11]. Adapting research code to specific hardware environments is extremely labor intensive and is not a viable option for most scientific codes, given the variability of existing and future systems. Even well-established codes, like Quantum Espresso, do not have a software team that is dedicated to performance optimization on every computing cluster a user of the code has access to. We believe that self-adapting resource management will be the path forward to fully utilize the available hardware resources without altering the actual research code.

1.1 State-of-the-art cluster-level and node-level resource management

The resource management in supercomputers has two levels: the cluster level and the node level. On the cluster level, job scheduling tools (e.g., PBS [12] and SLURM [13]) allocate computing resources (e.g., individual nodes) to each simulation task and focusing, e.g., on maximal overall system utilization. On each node, the operating system (OS) manages the computing and storage resources.

The OS manages memory and CPU usage [14,15,16]. However, scientific software developers and users are expected to explicitly limit the peak memory usage below the node memory [17]. Explicit and flexible memory management is challenging given the wide range of memory per CPU core, e.g., 1.4 GB/core on the Stampede 2 KNL sub-cluster [18] and 6.4 GB/core on the Halstead cluster at Purdue University [19].

The OS manages the CPU resources using a time-sharing method [20,21,22]: time slots are allocated among active processes to ensure fair sharing [23,24,25]. Context switches, switching between different processes for the same CPU core, can happen regularly which causes overhead. Most of the large clusters we are aware of, and certainly the most massive clusters assign a CPU or a full node to one specific user with one specific executable to minimize the context switching. When a node gets exclusively assigned to a specific research software, that software is expected to match the number of active threads to the available CPU cores to maximize CPU utilization while minimizing context switching.

1.2 Motivation and requirements for a general resource management tool

Scientific simulations often have memory usage variations [26, 27] and load imbalance [28,29,30,31]. Adding dynamic memory and CPU management to existing software would typically require fundamental code modifications. Such modifications are time intensive and error prone. Resource management can be difficult for non-specialist scientific software developers [32,33,34]. Therefore, we introduce a general resource manager that treats existing software as black boxes and requires only minimal code intrusions.

There are many workflow management tools to combine multiple programs and coordinate their execution such as SWIFT [35], Pegasus [36], HTCondor [37], ANSYS workbench [38], Galaxy [38], etc. These high-level tools analyze the input–output dependencies and generate a directed acyclic graph and execute programs accordingly. However, they lack the dependency management on the iteration level which is fundamentally required to control memory usage. The capability to adjust CPU usage is not in the scope of the existing tools.

We envision and demonstrate a different approach which uses available OS facilities, does not require expertise in performance optimization, nor access to administrative privileges. The OS provides a tool set that in general allows to manage resources in high detail. To make this available to computational researchers, these tools have to get combined into a usable application. This application, the general manager of resources (MARE) has to be flexible to accommodate any resource scenario computational scientists may encounter. Users of MARE must be able to program the resource management control.

This paper presents MARE, the resource manager that combines abstractions of the resource management with a programmable workflow management. MARE allows researchers who are not specialized in performance optimization to easily optimize computational resource usage. MARE treats software as black boxes. This allows users to optimize the resource usage of scientific software without knowing its details. One key innovation of MARE is ensuring optimal CPU usage with dynamic and task-adaptive multithreading. Our MARE requires no system administration privilege.

The reminder of the paper is organized as follows: Sect. 2 presents the design details of MARE. Sections 3 and 4 show two application examples to demonstrate several concrete and typical resource usage problems that can be solved with MARE. Conclusions are presented in Sect. 5.

2 Framework design and key features

The MARE framework consists of three major components: manager, agent, and client. First, the MARE manager starts on a single node. Second, on each available compute node one MARE agent is started. During the agent start, the location of the MARE manager is provided to the agent. The last step of the initialization has each application software register an associated client to the respective MARE agent. During runtime, each client monitors its application software status by instrumenting the scientific software and communicates the status repeatedly via its agent to the MARE manager (see Fig. 1). Depending on the management policy and the received information, the MARE manager dynamically adjusts the assigned resources for each application.

Fig. 1
figure 1

The software architecture of the MARE resource management framework, containing clients, agents, and a manager. The manager can share a computing node (0 or 1 in the figure) with the simulation code or be hosted on a dedicated node (2 in the figure). The manager communicates with each client but only one arrow to the first client on Node 1 is shown in the figure for cleanness

2.1 Manager design and implementation

MARE requires only a single manager that runs on either a compute node or a login node. The manager receives application software status updates from all agents via text files stored in the virtual file system of the RAM. These text files are monitored by inotify of Linux, which notifies the manager about new status files available to be processed. The MARE manager processes concurrent status files in the order it received the inotify notification. Scientific application software often requires multiple compute nodes. Shared network filesystems like NFS [39], Lustre [40], GPFS [41], etc. are mounted to compute nodes. However, MARE agents do not move the status files to a network filesystem, since the inotify feature is not sufficiently supported on network filesystems [42, 43]. Instead, the widely available SSH tool is used to transfer application software status files to the local filesystem of the manager. With proper caching of the SSH authentication, the latency of sending one status file is as low as 2 ms. This enables a management granularity (solver/iteration) of the MARE framework in the order of seconds.

The management policy is contained in a function that is executed every time a new status file is detected. Any kind of status information or metadata of the application software process can be input to the policy function. This can include from CPU or memory load, process information, application data, or even user input. The manager also provides adjustment APIs for the policy function that can wrap OS details that would require advanced OS knowledge. Application software status updates direct the manager via the policy function to make adjustments. Example policies for synchronization and dynamic CPU usage adjustments are provided in MARE.

2.2 Agent design

An agent is a thin layer needed on each compute node to handle the network communication with the MARE manager. The agent uses the inotify feature to monitor application software status files. The event-driven design of inotify avoids unnecessary file observation which saves CPU resources. The agent forwards new status files to the MARE manager.

2.3 Client design and implementation

Each application software process has an exclusive associated client. The client connects the application software process with the agent. The client interacts via two interfaces. The first interface consists of control points in the source code of the application software. These control points allow the client to monitor the application status. Code locations around MPI function calls are natural candidates of control points. The client writes the status updates and process metadata into the status files that are forwarded by the agent to the manager.

The second interface provides dynamic CPU resource adjustment. During the client’s initialization, a signal handler in the framework is registered to receive SIGUSR1 and SIGUSR2 signals from the OS. The signal handler will automatically adjust the number of threads (CPU cores) used by the application software according to the received signals from the manager.

Since the framework is written in C++ and expected to work with various software, a version of client-side APIs without name mangling [44] is created for C and Fortran. Additionally, a special wrapper interface (MEX) is also created for the popular engineering software Matlab and Octave.

2.4 Design considerations

The MARE framework stores and transfers application software status between the different MARE components as text files. Using the file system rather than direct communication (pipeline or socket) decouples each component of the MARE framework for the highest flexibility. This design makes the client an optional component in the MARE framework. An application software can avoid the client by generating the MARE status files by itself.

2.5 Example policies and control points

Different software applications have different resource usage patterns. The MARE resource manager is designed to be highly customizable and programmable via user-defined policies. A few policy examples are provided that address typical resource usages. The following application examples illustrate the management of CPU load and the simultaneous management of memory and CPU load. Similar policies can be imagined to control the memory bandwidth, network bandwidth, energy consumption, allocation budget on community clusters, etc.

Table 1 Potential control point location for different type of scientific simulation

Many scientific simulations are done iteratively. Within one iteration, the problem is divided into smaller chunks for parallel processing. An ideal location for control points is the end of each chunk before the synchronization preparing for the next iteration. Table 1 shows potential control point locations for different types of simulations in addition to the nanoscale simulation examples demonstrated in the later sections.

3 Policy example: CPU load

To exemplify a policy that manages dynamic multithreading, the ab-initio code Quantum Espresso (QE) [49,50,51] is considered when applied on a heterojunction of two 2D materials. The code is run on Purdue University’s Halstead cluster [19, 52] for performance testing. Halstead consists of two 10-core CPUs with 128 GB of memory per node on a 100 Gbps Infiniband network.

3.1 Application example: structure optimization with Quantum Espresso

Fig. 2
figure 2

Illustration of the monolayer MoS\(_{2}\)/WS\(_{2}\) heterostructure system considered in the Quantum Espresso example

Figure 2 illustrates the application example of a monolayer MoS\(_{2}\)/WS\(_{2}\) heterostructure. The electronic properties of the heterostructure are calculated in the density functional theory framework of the plane wave-based QE. The QE software is written in Fortran language and had been augmented to utilize the MARE client APIs following Sect. 2.3.

Fig. 3
figure 3

Schematic of the program flow for the Quantum Espresso structure optimization. The inner self-consistent loop performs the electronic density calculation that involves solving for the mutually independent momentum point (labeled with “k0”, “k1”, etc.) contribution to the density. The calculation of the k-point contributions get repeated until a converged electronic density is found. The outer loop iterates between ion positions and electronically mediated forces. The outer loop continues until global convergence is reached

Figure 3 shows the simplified software flow of a typical electronic-structure calculation with QE which consists of an outer structural relaxation and an inner electronic relaxation loop [53]. The ions are moved according to the forces on the ions that depends on the electronic density, which in turn depends on the ion positions. Due to their mutual dependence, ionic positions and electronic density have to be solved iteratively until convergence is achieved. Most of the simulation time is spent in the solution of the momentum (k) point contributions to density, which gets iterated often due to the self-consistency loops. For this example, a single electronic density calculation with fixed ionic positions is used. The resource management would be the same for relaxation of ionic positions.

3.2 Without MARE: load imbalance

Fig. 4
figure 4

CPU usage of the Quantum Espresso example of Fig. 2 with 13 k-points and 13 MPI processes. Multithreading is not enabled to avoid oversubscription of the 20 available CPU cores. This setup leaves 7 CPU cores unused. Here, each color corresponds to one k-point and one MPI process

The solution of k-points is independent from each other and can run trivially parallel. However, since the number of required k-points is determined by the modeling problem, it can easily be incommensurate with the number of available CPU cores per node. This leads to load imbalance, as also illustrated in Fig. 4 for this example with 13 independent k-points and 20 CPU cores per compute node. In this case, ideal CPU usage cannot be achieved when the number of threads for each process is statically set by environmental variables.

3.3 Multithreading management by the operating system

One way to improve the CPU usage is increasing the number of threads per process from 1 to 2, which leads to an oversubscription of 26 computing threads for the 20-core CPU. Modern multi-use and multi-program operating systems are capable of handling such oversubscriptions with context switching [54].

Fig. 5
figure 5

CPU usage of the example simulation of Fig. 4 with 2 computing threads per process and context switching of the OS. 13 k-points correspond to the 13 colors and their computation is spread out over 20 cores. The CPU usage and simulation time improve by 197 s down to 1278 s from 1474 s shown in the single threaded case of Fig. 4, but a highly frequent oscillation of CPU usage is found

The CPU usage with context switching shown in Fig. 5 is indeed higher than in the situation without shown in Fig. 4. Consequently, the end-to-end time improved by 197 s. However, the context switching also causes a very frequent oscillation of CPU usage per process as expected (see Fig. 6), which creates some CPU load by itself.

Fig. 6
figure 6

CPU usage of 1 process with a 2 threads per process in the oversubscription situation of the example calculation in Fig. 5 averaged over subsequent time intervals of 1.7 s. The usage frequently oscillating between 1 and 2 due to frequent context switching by the OS

This is illustrated in the CPU usage of 1 process using 2 threads shown in Fig. 6. The OS-controlled context switching causes the CPU usage to oscillate between 1 and 2.

Typically, to minimize context switching, the number of running processes or threads has to match the number CPU cores. The amount of processes is typically determined prior to the simulation by specifying the amount of MPI ranks. The amount of threads is determined either prior to the simulation by environmental variables or during the simulation by explicit API calls.

3.4 Management policy for CPU load

The overhead created by this oscillation can be avoided when information of the application software is used to dynamically adjust the number of threads by the MARE framework following a CPU load management policy. Processes with smaller and/or fewer computational tasks complete the computation sooner. The CPU resources of those processes are then reclaimed by the MARE manager and redistributed to processes with remaining tasks. Since threads share their memory, making the required data available to the newly assigned CPU cores does not require explicit data movement. As a result of this policy, tasks with higher computational load get effectively assigned a larger number of cores which improves the overall CPU load balance. To enable core reassignment, the application software needs to be augmented with a control point that (1) reports to the manager the computational task is done (see Fig. 1) and (2) puts the corresponding process into sleep mode. When all processes reach that barrier, the CPU resource assignment is then reset to default and the processes are signaled to proceed. In the example of Quantum Espresso, “proceeding” means to continue calculation of further self-consistency iterations.

Many large-scale software face similar load imbalance that this policy of dynamic multithreading can address. Even unforeseen scenarios can be improved due to the dynamic nature of MARE’s thread control. This policy essentially requires multiple processes being hosted on the same compute node to benefit from the shared memory situation. Therefore, it becomes more effective with a larger number of cores per compute node are available.

3.5 MARE control points in Quantum Espresso

Fig. 7
figure 7

Control points required for the electron density calculation. For each electron density iteration, MPI collective communication is required after initialization, computation of k-points and update of the force field. The MARE manager dynamically optimizes the CPU usage for each code segment individually

The example of Fig. 2 requires Quantum Espresso to trigger a global collective MPI communication each after electron initialization, k-point computation, and update of the force field. In addition to the multithreaded computation of k-points discussed above, also the electron initialization and the update of the force field can be solved with multiple threads. Thereby, electron initialization, k-point computation, and update force field are logic segments that are independently CPU-load-optimized by MARE. Accordingly, at the end of each of these segments and right before the MPI communication, an MARE control point is inserted (see Fig. 7). It is worth to mention the additions of these control points, including corresponding interfaces to the numerical library, are the only code alteration needed in Quantum Espresso to use MARE. This minimal intrusion makes it easy for simulation developers to adapt MARE for their needs.

3.6 Dynamic multithreading

Table 2 Number of threads assigned to each of the 13 processes which each solve one specific k-point

The goal of this CPU load example of the MARE resource manager is to dynamically adjust the number of threads assigned to each process to ensure that the available CPU cores are always fully utilized. Here, the 13 processes that handle the 13 considered k-points are initialized with 1 thread each. Then, 7 out of the 13 processes are assigned one additional thread to fully utilize the total 20 CPU cores. These 7 processes with 2 threads finish their calculations earlier than the other 6 processes that have only 1 thread each. Once the 7 double-threaded processes reach their MARE control points, they signal the MARE manager which in turn redistributes the overall 20 CPU cores to the remaining 6 processes as equally as possible. This redistribution of CPU cores gets repeated until all processes have reached the same control point (as detailed in Table 2). Then, the thread assignment is reset by the MARE manager for the next computational segment.

3.7 Assessment of performance improvement

Fig. 8
figure 8

CPU usage of the same simulation example and setup as Fig. 4 with the MARE resource manager. The bottom 7 colors represent k-points with 2 initial threads, whereas the top 6 colors represent k-points with 1 initial thread. The available 20 CPU cores are fully utilized. The spike-like CPU usage changes indicate reassignment of idling CPU resources to the remaining active k-points close to the end of each simulation iteration (illustrated in Table 2). The simulation time improves by 297 s down to 1177 s from 1474 s shown in the single threaded case of Fig. 4

Figure 8 shows the CPU usage of the same Quantum Espresso simulation with the MARE manager optimizing the CPU utilization. All available 20 CPU cores are almost fully utilized during the simulation time. Compared to the case with no multithreading (in Fig. 4), the end-to-end timing reduces from 1470  to 1177 s with a speedup factor of 1.25\(\times\). MARE’s dynamic thread control still accelerate the end-to-end calculation by 1.08\(\times\) compared to the case of constant multithreading (in Fig. 5).

3.8 MARE in Quantum Espresso with multiple parallelization levels

A Quantum Espresso example that simulates the energy barrier of vacancy diffusion in Si [55] is used to demonstrate the effectiveness of MARE in scenarios with multiple parallelization levels. A CPU load management policy is applied to an example of Nudged Elastic Band (NEB) [56, 57] calculation in QE. The NEB simulation tool is built on top of the electronic structure computing engine exemplified in Fig. 3. The NEB method divides the simulation into independent images in one iteration. One image consists of the electronic structure computation discussed in Sect. 3.1. This simulation example contains 10 independent images with the image freezing optimization [58] enabled. In this demonstration, the simulation is allowed to run for 9 iterations. Some images are calculated fewer than 9 times as discussed in detail below. At minimum, two MARE control points per MPI ranks are utilized. One control point signals the start, while the other signals the completion of an iteration. Concretely, in this example, 12 MPI ranks are used to solve 6 images simultaneously with each image supported by 2 ranks. More control points could be used for each individual image calculation which is equivalent to the example in Sect. 3.5. However, for simplicity in demonstration, only two control points at the image level are applied. The Gilbreth cluster [59] at Purdue University with 24 CPU cores and 196 GB memory per node is used for the demonstration.

Fig. 9
figure 9

Workload distribution among image groups without resource control by MARE. Each of the 10 colors represents the calculation of a specific image. The completion of iterations of all images are marked with dashed lines. White areas represent idling ranks. The letters A, B, and C highlight the 3 different types of load imbalance encountered in this example

This simulation setup exemplifies different types of load imbalance in NEB simulations.

  • Uneven distribution of images among image groups: In this example, some image groups have to solve 2 or 3 images, while the rest solve only 1. This causes idling ranks and gaps in the workload distribution of type A in Fig. 9. While this type of imbalance is under user control, it too often is inevitable due to simulation setup needs.

  • Different workload of the images of each iteration: Different images cause different numerical load until their converge (see B in Fig. 9). This difference is unpredictable and varies between different iterations.

  • Frozen images in one iteration: The freezing image optimization avoids iterations for images with small errors. While this reduces the numerical workload of an image group, it causes load imbalance between image groups (see C in Fig. 9).

Fig. 10
figure 10

Workload distribution among the image groups of Fig. 9 with resource control by MARE. All colors and dashed lines have the same meaning as in Fig. 9. Significant reduction of idling ranks (white space) is achieved. The total simulation with MARE takes 11449.3 s, which is 1.19\(\times\) faster than the 13613.4 s of simulation time without MARE

Once the MARE control points at the end of each iteration signal the MARE manager the completion of an image group, the MARE manager can redistribute the available CPU cores to other image groups still busy with the electronic structure calculation. This CPU core distribution is reset at the beginning of each iteration. This method is applicable to the 3 types of load imbalances discussed above. As a result of the MARE management, Fig. 10 shows 1.19\(\times\) speedup for this example.

4 Policy example: memory and CPU load

Simulations of the recently developed ROBIN method [60] are used to exemplify a policy that optimizes memory and CPU load on two different HPC systems: (1) the Rice cluster [61] at Purdue University with 20 CPU cores and 64 GB memory per node and (2) the Intel KNL partition on the Stampede 2 cluster [18, 62] at Texas Advanced Computing Center (TACC) with 68 cores and 96 GB per node.

4.1 ROBIN simulations of materials

Fig. 11
figure 11

Schematics of the simulation region partitioning, which is modeled as a round sphere with a large number of atoms. Recursive Green’s function method compatible partitioning requires every segment to have at most two neighbor segments. The memory load of each segment scales squared with the number of atoms in it

State-of-the-art material simulations require periodicity at the simulation domain boundaries. While these boundary conditions are typically appropriate for ideal solid-state materials, realistically fabricated material samples typically host defects, dislocations, and irregularly strained interfaces. The recently developed Recursive Open Boundary and INterfaces (ROBIN) method allows the explicit discretization of millions of atoms and solution of material properties within a center region of the discretized atom pool (see circle and zoom-in in Fig. 11). ROBIN does not constrain the simulation area with any symmetry. This allows accurate prediction of disordered materials and interfaces and avoids artificial periodicity assumptions. It was shown in Ref. [60] the periodicity assumption can even yield an erroneous band gap when the actual disordered material is gapless (see Fig. 4 in Ref. [60]). Solving the impact of the environment on a center simulation region requires partitioning the atoms and iterating quantum transport equations over the segments (see Fig. 11).

4.2 Peak memory usage

Fig. 12
figure 12

Memory usage of an example calculation of the ROBIN method for two electronic energies solved in sequence with 20 threads used

Depending on the overall distribution of discretized atoms and the partition strategy, the number of atoms in each segment can vary significantly (see Fig. 11). The memory load of the quantum transport equations scales squared with the number of atoms in each respective segment. Since the quantum transport equations are iterated over the segments, the peak memory usage varies a lot with simulation time (see Fig. 12).

The example of Fig. 12 solves the quantum transport equations of ROBIN for two different energies in sequence. The calculation of each energy point yields a maximum memory usage of 23 GB and sees the highest usage plateau at 13 GB. As common for quantum transport simulations, different electronic energies can be solved independently in parallel. In production run ROBIN simulations, hundreds to thousands of energy points [63, 64] are solved that scale embarrassingly.

Fig. 13
figure 13

Memory usage of the calculation of 8 energy points with multithreading to fully utilize 20 CPU cores. (a) 2 parallel processes with 10 threads each. (b) 4 parallel processes with 5 threads each enabled by the MARE manager. The MARE framework enables the simultaneous calculation of 4 energy points with the memory consumption under the system limit

The parallelism at the level of energy points is preferred to the lower level parallel computation within energy points because of better resource utilization efficiency and throughput discussed in the later section. Given the computational behavior of each energy point is similar regardless of the specific energy value, computing two energy points (with 10 threads each on a node of 20 CPU cores) simultaneously results in almost doubling the peak memory usage of Fig. 12 (i.e., above 32 GB), as shown in Fig. 13a. The more efficient CPU commensurate calculation scenario of ROBIN would involve four simultaneous energy point calculations (with 5 threads each), which then exceeds the node’s memory limit. Swap space is designed to absorb the memory usage bursts. However, swapping, even briefly, is typically turned off on HPC systems to avoid expensive page faults [17]. Frequent page faults lead to memory trashing which significantly slows down application software [17]. The users of HPC systems are expected to explicitly limit the memory usage to the amount of available RAM. Therefore, keeping the peak memory usage of the ROBIN simulation under the system limit may cause reduced parallelism and CPU underutilization. Otherwise, the simulation may have reliability (crashes due to out-of-memory) issues.

4.3 Management policy for memory and CPU load

This example management policy provides a synchronization mechanism (semaphore) to manage the dependencies among software processes. Two functions are provided for processes to acquire and release resources. The wait function allows a process to sleep and is controlled via signals from the manager. In this example, wait is applied as long as required resources are unavailable. The signal function reports the application progress to the resource manager. In this example, the signal function reports the completion of a task which triggers the manager to redistribute resources with the wait function. Both wait and signal are executed at control points added to the respective application software. This allows pipelining MPI processes to control the application software progress and make it utilize all available machine resources constantly.

It is generally important that the application software is free from race conditions [65, 66]. For the MARE management tool, it is even essential that all application software is free of race conditions. The concept of MARE ensures that it does not introduce any race conditions by itself.

4.4 Control points with the MARE framework

Fig. 14
figure 14

Memory usage of 4 processes running the ROBIN application simultaneously. The control points added to the ROBIN code enable the delayed start to pipeline the peak memory usage

To enable MARE’s control of ROBIN, two control points are added to the ROBIN source code: the first control point is added at the beginning of the simulation. At this point, the MARE manager gets signaled via the agent that the code execution has reached this position and the software process is put into sleep state. It will wait for a wake-up signal from the MARE manager. The second control point is added to the source code position of peak memory usage (see Fig. 12). Depending on the MARE manager, this procedure can yield a delay between two consecutive processes. Such delay can be desired to avoid peak memory usage beyond the machine’s limit. The client monitors whether its corresponding host software reaches the second control point and reports the software status to the manager. Then, the manager sends the wake-up signal to the next process to enter the execution pipeline (see Fig. 14). The pipelined execution enables smoother and under-the-limit memory consumption with the parallel computation of 4 energy points on a computer node shown in Fig. 13b, which results in faster simulation because of higher resource utilization efficiency.

4.5 Memory management to enable efficient simulation setup

MARE can enable running 4 processes in parallel: the example in Fig. 13b uses the MARE resource manager to pipeline the parallel calculation of four energy points. By properly pipelining the 4 processes, the MARE resource manager successfully limits the overall peak memory usage below the system capacity of 64 GB. Additionally, the MARE resource manager enables less memory usage oscillation to achieve better overall resource utilization.

Fig. 15
figure 15

Multithreading (a) strong scaling and (b) parallel efficiency of the ROBIN simulation for one energy point on a compute node with 20 CPU cores on the Rice cluster

The multithreading parallel efficiency is another optimization factor that impacts the end-to-end application timing. Figure 15a shows the strong scaling of one energy point of ROBIN with respect to the number of parallel threads on one Rice node with 20 CPU cores. The end-to-end timing deviates the more from the projected ideal scaling line the more threads are used.

A more efficient CPU usage would prefer running the ROBIN code with as few threads as possible but as many processes per compute node as possible. MARE’s pipelining of processes discussed in the previous section enables more parallel processes which results in more efficient CPU usage and shorter end-to-end timing. Since the solution of different energy points scales embarrassingly parallel, we consider in the following the time-to-solution per energy point.

Fig. 16
figure 16

Memory usage on a Rice compute node of the example ROBIN simulation (multithreading enabled). The black line shows the memory usage of 4 simulation processes with 5 threads each using the MARE framework. The gray line shows the memory usage of 2 simulation processes with 10 threads each, i.e., the optimal setting without the MARE framework. MARE enables more efficient resource usage and a shorter time-to-solution

The change of this time-to-solution per energy point is shown in Fig. 16. The black (gray) line in Fig. 16 shows the memory usage over time when four (two) processes run in parallel per node of the Rice cluster. It is worth to repeat that only MARE enables running 4 processes simultaneously below the system’s limit. With 4 processes per compute node, the number of threads per process needed to maximally utilize the Rice CPU reduces from 10 to 5. The parallel efficiency improves from 68.7 to 78.4%. That is the reason the MARE framework effectively enables processing four energy points every 2937 s (734 s per energy point). In contrast, without MARE, one compute node can process two energy points every 1588 s (794 s per energy point). Using this scheme, MARE improves the calculation throughput per node by 1.08\(\times\).

Fig. 17
figure 17

The multithreading parallel efficiency of the ROBIN simulation for one energy point solved on the Stampede 2 cluster

The maximum performance improvement MARE can achieve depends on the hardware configuration and with it the specific strong scaling behavior. Figure 17 shows the strong scaling of ROBIN on the Stampede 2 cluster at the Texas Advance Computing Center with respect to the number of threads.

Fig. 18
figure 18

Memory usage on a compute node of an example simulation with 60 energy points (multithreading enabled). The black line shows the memory usage of 5 simulation processes (12 energy points per process) with the MARE framework. The gray line shows the memory usage of 3 simulation processes (20 energy points per process) without the MARE framework. The black line with the MARE framework is faster than the gray line without the MARE framework

Accordingly, Fig. 18 illustrates a stronger performance improvement on Stampede 2 than on Rice (shown in Fig. 16). Figure 18 shows the memory usage of ROBIN when run on Stampede 2. The gray line shows three processes with 22 threads per process which is the optimal setting without the MARE manager. The black line shows five processes with 13 threads per process managed by MARE. Since the parallel efficiency improves with MARE from 50.4% with 22 threads to 66.2% with 13 threads (see Fig. 17) the processing throughput of one Stampede 2 node improves from 3 energy points every 1583 s to 5 energy points every 2174 s. That means, the per energy point execution time improves from 528 to 435 s (1.21\(\times\) speedup). Figure 18 also shows the MARE framework enables a more efficient usage of the available node memory.

4.6 CPU management to automatically balance computing resource distribution

Fig. 19
figure 19

CPU usage over time of the simulation of Fig. 16 with 4 simulation processes and the MARE framework without additional adjustment. The first and last approximate 1000 s show CPU ramping in increments of 5 cores

The scheduling of computational processes entails a ramping of the number of active threads per node. This can be seen in Fig. 19 which shows the usage of a single Rice node’s CPUs in the case of the 4-process simulation of Fig. 16. The delaying of MPI processes needed to limit the memory load below the node’s maximum memory (see the memory usage management discussion of Sect. 4.2) causes the CPU usage to increment by steps of 5 at the beginning and decrement by 5 at the end of the simulation. Since each process is set here by environmental variables to spawn 5 threads, the CPU core usage increments and decrements in steps of 5. In addition to optimizing the memory usage, MARE can improve the CPU usage with the dynamic multithreading discussed in Sect. 3.6.

Fig. 20
figure 20

CPU usage of the first process of the simulation of Fig. 19 with and without the dynamic adjustment of the number of spawned threads. The gray line shows the constant setting of 5 threads as in Fig. 19. The black line shows the dynamic spawning which starts with 20 threads when all other processes are paused. It reduces in steps whenever another process becomes active to avoid oversubscription of the node

Fig. 21
figure 21

Total CPU usage of all processes of the simulation in Fig. 20. The black line shows the CPU usage with MARE’s dynamic multithreading. It shows full CPU usage at the begin and at the end of the simulation. For comparison, the CPU usage without MARE’s dynamic thread handling of Fig. 19 is shown in gray

MARE’s multithreading policy dynamically adjusts the number of computing threads (CPU cores) used by each process during the simulation as shown for process 1 in Fig. 20 (black line). For comparison, Fig. 20 shows the number of threads for the same process without MARE’s dynamic multithreading (gray line). The dynamic policy assigns in the beginning the whole 20 CPU cores to process 1, since it is the first to enter the scheduling pipeline. When the second process enters the pipeline, the 20 cores are distributed among the two processes. Accordingly, the number of threads for process 1 drops to 10. When the third process is scheduled to start, processes 1 and 2 retain 6 threads, and process 3 receives the remaining 8 available threads. Once the forth process becomes active, all processes are assigned 5 threads. The reverse of this assignment procedure happens near the end of the simulation, whenever another process finishes its simulation tasks. Note that both dynamic (black) and static (gray) multithreading scenarios face oscillations in the CPU usage when the code iterates between serial and multithreaded sections. The overall CPU usage of 4 processes are shown in Fig. 21. The black line represents higher CPU usage at the begin and end of simulation with the dynamic CPU adjustment.

Fig. 22
figure 22

a The memory usage of the same example in Fig. 16 on the Rice cluster. b The memory usage of the same example in Fig. 18 on the Stampede 2 KNL cluster. Dynamic multithreading is enabled in this figure in addition to memory usage pipelining shown in Figs. 16 and 18. Additional speedup (see the detailed comparison in Table 3) and smoother memory usage can be observed

Figure 22 shows the memory usage of the same example on the Rice cluster. Similarly, it also shows the memory usage of the example on the Stampede 2 cluster. The additional CPU management enables processing four energy points every 2818 s on the Rice cluster (compared with 2937 s with memory management only). The speedup factor is 1.13\(\times\) with the CPU management. The processing time of 5 energy points on one Stampede 2 node improves from 2174 to 2037 s with the additional CPU management as well (1.30\(\times\) speedup).

4.7 Overall performance improvement

Table 3 Average time-to-solution (s) per energy point for the ROBIN example calculation on the Rice cluster and the Stampede 2 cluster with different levels of optimization

Table 3 summarizes the calculation throughput on the Rice cluster and the Stampede 2 cluster with and without the MARE framework. When memory management schedules the processes (labeled “With MARE (memory)”), more processes can run simultaneously and fewer threads per process are needed to utilize all CPUs. This improves the parallel efficiency and thus the throughput. The dynamic adjustment of multithreading reduces the CPU underutilization during pipeline filling and draining (labeled “With MARE (memory + CPU)”), which improves the throughput further.

5 Conclusion

The majority of supercomputers are dedicated to scientific simulations. Scientific simulation software typically allows users varying applications with hard-to-predict computational load. In consequence, the supercomputer hardware tends to be incompletely utilized during the scientific simulation runs. This paper presents the MARE framework which solves this problem with a few control points to the scientific simulation codes. With these observational points, MARE schedules the computation of each parallel process to optimize memory and CPU load during the simulation’s runtime. In this way, MARE enables even inexperienced users to optimally utilize the available infrastructure without significantly altering the scientific software. MARE accepts user-defined policies for, e.g., optimizing memory usage, CPU usage, or both. The applications of MARE on Quantum Espresso and ROBIN demonstrate resource efficiency improvement with only superficial knowledge of the simulation software.