Keywords

1 Introduction

For many years, energy consumption in HPC systems has been a challenging topic. Data centers account for around 4% of global electricity consumption and between 73 and 140 million MWh per year [8]. It is estimated that from 2010 to 2018, the demand for work in data centers has increased by 550% while consumption has only increased by 6% because of the advances in energy efficiency that has been carried out [1]. Current supercomputers such as Frontier [9] and Fugaku, even being much more efficient than the previous ones, require more than 20 MWh during peak loads. Therefore, Green Computing has become a recent trend in computer science, which aims to reduce the energy consumption and carbon footprint produced by computers on distributed platforms such as clusters, grids, and clouds [6].

From the environmental and economic point of view, it can be seen that multiple works address the development of energy efficient HPC systems since they make it possible to reduce high consumption (e.g. Frontier, the most efficient supercomputer in the world and was placed at the top of the GREEN500). The advances that have been made in this area are highly relevant since they enable the reduction of CO2 emissions from computing [10].

While traditional scheduling solutions attempt to minimize processing time without taking into account the energetic, energy-aware scheduling of jobs has become a trend in computing facilities. Most of those solutions are based on job grouping and migration, considering that the number of resources needed for a workload can be minimized [2]. Other solutions propose real-time dynamic scheduling systems to execute applications efficiently optimizing energy consumption [5]. Those solutions are not usually addressed on supercomputers, where jobs are statically allocated and elasticity or migrations mechanisms are not available. In supercomputers, techniques like varying CPU frequencies [3] or allocating jobs to CPU or heterogeneous devices [4] are more popular to increase the energy efficiency of the system without compromising the application makespan.

The main goal of this work is to increase energy efficiency in HPC platforms by means of malleability and dynamic energy-aware application scheduling. Malleability is a promising technique to reduce the energy consumption of parallel applications by adjusting dynamically the resources to the computation requirements of each application. In this work, we present a resource scheduler capable of choosing, using real-time data, the application processes that reduce energy consumption under different constraints. We use the FlexMPI [7] runtime to dynamically adapt and reconfigure the applications while it is able to adapt the frequency of the CPUs in the platform. Scheduling algorithms range from the simplest one that minimizes the energy consumption over all possible configurations, to the most sophisticated one, which uses a cost function that allows the user to prioritize the minimization of execution time or energy consumption. We use mathematical methods to predict the application workload and execution phases, which enable the resource scheduler to refine its decisions. We also provide a set of use cases integrated with the FlexMPI runtime that can be used as benchmarks. Additionally, Using these use cases, we support an energy profile modeler that estimates the application energy consumption, an energy-aware malleable scheduler that is able to run the use cases in an efficient way, and a practical evaluation on a real platform.

Fig. 1.
figure 1

Overview of the system architecture.

2 Energy-Aware FlexMPI System Architecture

Figure 1 depicts the proposed energy-aware system architecture based on the FlexMPI framework. The first component includes the Parallel applications. Applications also respond to FlexMPI commands. These calls are sent by the root process after they have been received from the controller. Energy is measured by the root process and sent to the controller, which forwards it to the scheduler. The root application is made up of a single thread and it is in charge of executing and controlling FlexMPI. The performance monitor is the component that collects the energy metrics and forwards them to the scheduler. It is also the part that is in charge of receiving the new configurations from the scheduler and forwarding them to the controller. It is the intermediate piece between FlexMPI and another component developed in MATLAB that runs on a different compute node. The Malleability scheduler is the element that is in charge of analyzing the available information (energy, minimum points, workload predictions, and phases) continuously and determining what is the ideal configuration to apply in every reconfiguration point. This information is received from the monitor (integrated with FlexMPI), the profiler, and the prediction model. The Profile modeler collects energy time series and uses them to model the energy that the applications consume under different configurations (number of processes). The reconstruction is done using mathematical methods (i.e., interpolation) and the results are sent to the scheduler.

The FlexMPI controller is in charge of receiving the desired malleability configurations, sending the malleability commands to the processes (expanding or shrinking them) and updating the frequency and voltage of the processors. This component interacts with two elements: the FlexMPI malleability driver [7] and the Dynamic voltage and frequency scaling (DVFS). DVFS is a technique used to reduce energy in digital systems by reducing the voltage of the nodes.

3 Malleable Resource Scheduling for Energy Saving

There are two principal components in the proposed architecture: an energy-aware scheduler that determines the application configuration and DVFS value in run-time and an energy profile modeler that supports the scheduler support-making process. This section describes both of them.

3.1 Malleability Scheduler

The malleability scheduler is in charge of analyzing all the available information in real-time (energy, execution times, application phases, etc.). The scheduler calculates which is the best configuration (frequency and the number of processes) to apply in order to reduce energy or execution time.

From the scheduler’s point of view, in order to make the best decision, it is necessary to collect both the system and application performance from historical records, data from the current execution, and predictions of future performance based on models. Note that in the proposed framework runtime energy and performance metrics are collected by FlexMPI because it integrates a performance and energy monitor, and these measurements are sent to the scheduler for decision-making.

In the cost function depicted in Eq. 3.1, E(NPfreq) and T(NPfreq) are the energy and execution time of a specific configuration of the number of processes and frequency (NP and freq, respectively) and \(E_{max}\) and \(T_{max}\) are the maximum existing values of both parameters. \(W_1\) and \(W_2\) are normalized weights, with \(W_1 + W_2 = 1\). The scheduling algorithm can find multiple solutions (by minimizing the cost function) according to the weight parameters. For instance, when the user desires to balance the energy and execution time values, then both weights should be similar. If it is necessary to minimize the energy, then \(W_1\) should be higher (and vice-versa if the goal is to only minimize the execution time).

$$\begin{aligned} C(NP,freq) = W_1 \frac{E(NP,freq)}{E_{max}} + W_2 \frac{T(NP,freq)}{T_{max}} \end{aligned}$$
(3.1)

The scheduling algorithm consists of finding the minimum value of the cost function across all the existing combinations of the number of processes and DVFS values. Given that not all energy data is available, we propose an energy profile modeler that reduces the complexity and the amount of monitoring information needed by the scheduler.

3.2 Energy Profile Modeller

This section introduces the mathematical method employed to implement the energy modeler. This method approximates the information related to the energy consumption of the application, allowing the system to configure the most appropriate setup, based on the number of processes and DVFS values (NP and freq). Existing combinations can be seen as two surfaces, with the x and y axis ranging the possible NP and freq values, and the z-axis with the corresponding application energy or execution time.

The main idea behind this proposal is to monitor the application performance by using a limiting number of configurations under certain NP and freq values. We denote each measurement as a sample. For each sample, the application’s energy consumption and execution time are collected. The mathematical interpolation method reconstructs the energy and execution time surfaces by means of linear interpolation techniques. Algorithm 1 describes the interpolation process. Initially, we collect samples using the maximum and minimum number of processes while considering multiple ranges of freq values. We have observed that the overhead of updating the DVFS values is negligible. In contrast, adapting the number of processes has a more relevant impact, not only related to the process creation or destruction but also to data redistribution. In the following iterations, new samples are collected based on the maximum difference between the application sample and the model. This process enables the refinement of the interpolation process. Algorithm 2 depicts the interpolation algorithm that reconstructs the curve surfaces based on the existing samples.

3.3 FlexMPI Support

In this work, we leverage malleable support provided by FlexMPI to collect multiple sample points (number of processes and DVFS values) during the application execution. The system performs two operations. First, the system collects the performance metrics of the current configuration of the execution. Second, once it has enough information for that configuration, by using malleability, it reconfigures the application to analyse another sample (expanding or shrinking the number of processes and changing the DVFS value). Following this approach, the system is able to execute multiple analyses during one execution of the real use case. By means of leveraging malleability, our proposal is able to build the models in the early execution stages, reducing data collection time and making the models available during the first application execution. Note that this approach is also beneficial due to a lower energy consumption compared with running independent use cases varying both the processes number and DVFS value. We assume that the performance behaviour at the beginning of the application represents the situation in the whole execution.

figure a
figure b

4 Evaluation

The evaluation has been carried out in a baremetal cluster consisting of compute nodes with Intel(R) Xeon(R) Gold 6212U with 24 cores and 330 GB of RAM. The connection between nodes is by 10 Gbps Ethernet. A FlexMPI-based implementation of the Jacobi algorithm method is used. Our implementation includes a set of tunable parameters that modify the behavior of the code, generating different performance profiles of CPU, I/O, and network usage. Due to this, we have designed a set of configurations that cover a wide range of performance patterns. Following we describe the main use cases:

  • Use case 1 (UC1): CPU-intensive application with high data locality.

  • Use case 2 (UC2): CPU-intensive application with low data locality.

  • Use case 3 (UC3): Combination of CPU and communication intensive phases.

  • Use case 4 (UC4): Combination of CPU and I/O intensive phases.

  • Use case 5 (UC5): Combination of CPU, I/O, and communication intensive phases.

To obtain execution profiles, the application has been run in multiple iterations. Each iteration includes all the executions combining the entire frequency range (from 1.2 GHz to 2.4 GHz) and all the available cores (from 1 to 24). The energy metric represents the average energy of the existing samples.

In Fig. 2, we can observe that the energy profile for UC1, which is very similar to the energy profile from UC2 and UC4 (note that we do not include all the figures due to the available space). We can state that increasing the number of processes reduces the overall consumed energy. However, there is a point where the reduction is very small and the profile looks like a flat surface. Figure 3 plots the energy profile for the fifth use case UC5, which is a combination of UC3 and UC4, and that is the reason why it is also very similar to UC3. This profile includes features of UC4 by the end of the figure because the I/O intensity is dominant. As the frequency increases, the energy consumption increases as well, and the minimum energy point is with a frequency of 1.4 GHz. When we consider the number of processes, we can see that the more processes are included, the higher the energy consumption. The minimum frequency point is 1.4 GHz.

4.1 Energy Modeller Accuracy

In this section, we analyze the experimental results of the energy profile modeler. The comparison between the real and the interpolated models takes into account: (1) how the surface is refined during the modeler iterative process, and (2) the differences between the real and the model surfaces. In these examples, the interpolation is only done in four iterations because the results after the fourth iteration were very similar to the original one.

Fig. 2.
figure 2

Generalized energy profile to represent the use cases UC1, UC2 and UC4. Note that the colors on the graphs identify the energy range to which the points belong.

Fig. 3.
figure 3

Generalized energy profile that represents the use cases UC3 and UC5. Note that the colors on the graphs identify the energy range to which the points belong.

Figures 4 and 5 plot the interpolated energy profile model for UC1. It is important to highlight that in the first iteration, the profile is a plane using only two points. Then it is divided into two for the second iteration by taking a midpoint, and, including more data from new iterations, it becomes more like the actual CPU profiles. If we consider differences with the real profile, we can see that in the first iteration, there are differences between 100 J and 400 J. For four iterations, the maximum difference is 180 J between process numbers 2 and 4 and decreases to 0.03 J between 4 and 20 processes.

Fig. 4.
figure 4

Interpolated energy profile with 1 iteration refinement for use cases UC1, UC2 and UC4. Quantitative results are included in Tables 1 to 4.

Fig. 5.
figure 5

Interpolated energy profile with 4 iteration refinement for use cases UC1, UC2 and UC4. Quantitative results are included in Tables 1 to 4.

Fig. 6.
figure 6

Interpolated energy profile with 1 iteration refinement for use cases UC3 and UC5. Quantitative results are included in Tables 1 to 4.

In Figs. 6 and 7, we can see the modeler surface for UC3, which is very similar to UC5. Considering the profile with four iterations, the maximum difference is 190 J at the lowest points of the processes and 13 J at the lowest. In this case, the difference can have a greater influence because the minimum energy point location is unclean. Following the same reasoning as in the previous analysis, considering the lowest energy point, the application spends an energy of 531.51 J per iteration. Considering the maximum saving, we observe a saving of 59.50% per iteration (max. 1310.59 J per iteration). Compared to the real surface the model selects a non-optimal configuration with a 9.50% cost higher than the one obtained by the real surface.

In order to show the overall results, Table 1 summarizes the scheduling metrics based on minimizing only energy consumption. Table 2 depicts the scheduling results based on the cost function that provides the same weight to both energy and execution time. Next, Table 3 shows the scheduling results based on the cost function, considering more importance to energy. Finally, Table 4 shows the scheduling results based on the cost function prioritizing execution time.

Fig. 7.
figure 7

Interpolated energy profile with 4 iteration refinement for use cases UC3 and UC5. Quantitative results are included in Tables 1 to 4.

Table 1. Summary table of scheduling results based on minimizing energy.
Table 2. Summary table of scheduling results based on cost function with more weight in energy.
Table 3. Summary table of scheduling results based on cost function with the same weight in energy and execution time.
Table 4. Summary table of scheduling results based on cost function with more weight in execution time.

Regarding the interpolation, for some profiles, the energy saving is the same, and for others is a bit lower than the real one. However, we can compensate for the difference because the interpolation applies early to the best configuration. On the other hand, the scheduling results obtained using the cost function show similar savings to the energy minimization. However, taking into account that some results are equal, even a bit lower, we consider it better to use this method because it allows the user to adjust the target of the cost function: saving energy or execution time.

5 Conclusion

In this work, we have introduced a dynamic energy-profile scheduler for MPI-based applications that integrates FlexMPI runtime and application modeling at run-time. The scheduler exploits the previous models to determine the best configuration (DVFS value and number of processes) for each application to reduce energy consumption. Finally, we have completed an evaluation on a real platform, which demonstrates that our proposal can minimize either the energy consumption and the execution time of the scheduled application. Finally, we are working on a machine learning model capable of predicting at near/real-time given the execution information (wall-clock time, energy, data used) that phase of the application execution is the next. With this information, we will be able to enrich the scheduler with global vision of which phase is going to be executed in the future improving the scheduling decisions.