Malleable Techniques and Resource Scheduling to Improve Energy Efficiency in Parallel Applications

Cascajo, Alberto; Arbe, Alvaro; Garcia-Blas, Javier; Carretero, Jesus; Singh, David E.

doi:10.1007/978-3-031-40843-4_2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13999))

Included in the following conference series:

International Conference on High Performance Computing

1455 Accesses
2 Citations

Abstract

The high energy consumption of computing platforms has become one of the major problems in high-performance computing (HPC). Computer energy consumption represents a significant percentage of the CO2 emissions that occur each year in the world, therefore, it is crucial to develop energy efficiency techniques in order to reduce the energy consumption in HPC systems. In this work, we present a resource scheduler capable of choosing, using real-time data, the optimal number of processes of running applications. The solution takes advantage of the use the FlexMPI runtime to dynamically reconfigure the application number of processes and DVFS to modify the frequency of cores of the platform. The scheduling algorithms presented in this work include one that minimizes the application energy and another more holistic one that allows the user to balance between energy and execution time minimization. This work presents a description of the methodologies and a experimental evaluation on a real platform.

This work was partially supported by the EuroHPC project “Adaptive multi-tier intelligent data manager for Exascale” under grant 956748 - ADMIRE - H2020-JTI-EuroHPC-2019-1 and by the Agencia Española de Investigación under Grant PCI2021-121966.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Runtime Adaptability: The Key for Improving Parallel Applications

Mapping Medley: Adaptive Parallelism Mapping with Varying Optimization Goals

The Energy Efficiency Evaluating Method Determining Energy Consumption of the Parallel Program According to Its Profile

Article 01 December 2020

Keywords

1 Introduction

For many years, energy consumption in HPC systems has been a challenging topic. Data centers account for around 4% of global electricity consumption and between 73 and 140 million MWh per year [8]. It is estimated that from 2010 to 2018, the demand for work in data centers has increased by 550% while consumption has only increased by 6% because of the advances in energy efficiency that has been carried out [1]. Current supercomputers such as Frontier [9] and Fugaku, even being much more efficient than the previous ones, require more than 20 MWh during peak loads. Therefore, Green Computing has become a recent trend in computer science, which aims to reduce the energy consumption and carbon footprint produced by computers on distributed platforms such as clusters, grids, and clouds [6].

From the environmental and economic point of view, it can be seen that multiple works address the development of energy efficient HPC systems since they make it possible to reduce high consumption (e.g. Frontier, the most efficient supercomputer in the world and was placed at the top of the GREEN500). The advances that have been made in this area are highly relevant since they enable the reduction of CO2 emissions from computing [10].

While traditional scheduling solutions attempt to minimize processing time without taking into account the energetic, energy-aware scheduling of jobs has become a trend in computing facilities. Most of those solutions are based on job grouping and migration, considering that the number of resources needed for a workload can be minimized [2]. Other solutions propose real-time dynamic scheduling systems to execute applications efficiently optimizing energy consumption [5]. Those solutions are not usually addressed on supercomputers, where jobs are statically allocated and elasticity or migrations mechanisms are not available. In supercomputers, techniques like varying CPU frequencies [3] or allocating jobs to CPU or heterogeneous devices [4] are more popular to increase the energy efficiency of the system without compromising the application makespan.

The main goal of this work is to increase energy efficiency in HPC platforms by means of malleability and dynamic energy-aware application scheduling. Malleability is a promising technique to reduce the energy consumption of parallel applications by adjusting dynamically the resources to the computation requirements of each application. In this work, we present a resource scheduler capable of choosing, using real-time data, the application processes that reduce energy consumption under different constraints. We use the FlexMPI [7] runtime to dynamically adapt and reconfigure the applications while it is able to adapt the frequency of the CPUs in the platform. Scheduling algorithms range from the simplest one that minimizes the energy consumption over all possible configurations, to the most sophisticated one, which uses a cost function that allows the user to prioritize the minimization of execution time or energy consumption. We use mathematical methods to predict the application workload and execution phases, which enable the resource scheduler to refine its decisions. We also provide a set of use cases integrated with the FlexMPI runtime that can be used as benchmarks. Additionally, Using these use cases, we support an energy profile modeler that estimates the application energy consumption, an energy-aware malleable scheduler that is able to run the use cases in an efficient way, and a practical evaluation on a real platform.

2 Energy-Aware FlexMPI System Architecture

Figure 1 depicts the proposed energy-aware system architecture based on the FlexMPI framework. The first component includes the Parallel applications. Applications also respond to FlexMPI commands. These calls are sent by the root process after they have been received from the controller. Energy is measured by the root process and sent to the controller, which forwards it to the scheduler. The root application is made up of a single thread and it is in charge of executing and controlling FlexMPI. The performance monitor is the component that collects the energy metrics and forwards them to the scheduler. It is also the part that is in charge of receiving the new configurations from the scheduler and forwarding them to the controller. It is the intermediate piece between FlexMPI and another component developed in MATLAB that runs on a different compute node. The Malleability scheduler is the element that is in charge of analyzing the available information (energy, minimum points, workload predictions, and phases) continuously and determining what is the ideal configuration to apply in every reconfiguration point. This information is received from the monitor (integrated with FlexMPI), the profiler, and the prediction model. The Profile modeler collects energy time series and uses them to model the energy that the applications consume under different configurations (number of processes). The reconstruction is done using mathematical methods (i.e., interpolation) and the results are sent to the scheduler.

The FlexMPI controller is in charge of receiving the desired malleability configurations, sending the malleability commands to the processes (expanding or shrinking them) and updating the frequency and voltage of the processors. This component interacts with two elements: the FlexMPI malleability driver [7] and the Dynamic voltage and frequency scaling (DVFS). DVFS is a technique used to reduce energy in digital systems by reducing the voltage of the nodes.

3 Malleable Resource Scheduling for Energy Saving

There are two principal components in the proposed architecture: an energy-aware scheduler that determines the application configuration and DVFS value in run-time and an energy profile modeler that supports the scheduler support-making process. This section describes both of them.

3.1 Malleability Scheduler

The malleability scheduler is in charge of analyzing all the available information in real-time (energy, execution times, application phases, etc.). The scheduler calculates which is the best configuration (frequency and the number of processes) to apply in order to reduce energy or execution time.

From the scheduler’s point of view, in order to make the best decision, it is necessary to collect both the system and application performance from historical records, data from the current execution, and predictions of future performance based on models. Note that in the proposed framework runtime energy and performance metrics are collected by FlexMPI because it integrates a performance and energy monitor, and these measurements are sent to the scheduler for decision-making.

In the cost function depicted in Eq. 3.1, E(NP, freq) and T(NP, freq) are the energy and execution time of a specific configuration of the number of processes and frequency (NP and freq, respectively) and $E_{max}$ and $T_{max}$ are the maximum existing values of both parameters. $W_1$ and $W_2$ are normalized weights, with $W_1 + W_2 = 1$. The scheduling algorithm can find multiple solutions (by minimizing the cost function) according to the weight parameters. For instance, when the user desires to balance the energy and execution time values, then both weights should be similar. If it is necessary to minimize the energy, then $W_1$ should be higher (and vice-versa if the goal is to only minimize the execution time).

$$\begin{aligned} C(NP,freq) = W_1 \frac{E(NP,freq)}{E_{max}} + W_2 \frac{T(NP,freq)}{T_{max}} \end{aligned}$$

(3.1)

The scheduling algorithm consists of finding the minimum value of the cost function across all the existing combinations of the number of processes and DVFS values. Given that not all energy data is available, we propose an energy profile modeler that reduces the complexity and the amount of monitoring information needed by the scheduler.

3.2 Energy Profile Modeller

This section introduces the mathematical method employed to implement the energy modeler. This method approximates the information related to the energy consumption of the application, allowing the system to configure the most appropriate setup, based on the number of processes and DVFS values (NP and freq). Existing combinations can be seen as two surfaces, with the x and y axis ranging the possible NP and freq values, and the z-axis with the corresponding application energy or execution time.

The main idea behind this proposal is to monitor the application performance by using a limiting number of configurations under certain NP and freq values. We denote each measurement as a sample. For each sample, the application’s energy consumption and execution time are collected. The mathematical interpolation method reconstructs the energy and execution time surfaces by means of linear interpolation techniques. Algorithm 1 describes the interpolation process. Initially, we collect samples using the maximum and minimum number of processes while considering multiple ranges of freq values. We have observed that the overhead of updating the DVFS values is negligible. In contrast, adapting the number of processes has a more relevant impact, not only related to the process creation or destruction but also to data redistribution. In the following iterations, new samples are collected based on the maximum difference between the application sample and the model. This process enables the refinement of the interpolation process. Algorithm 2 depicts the interpolation algorithm that reconstructs the curve surfaces based on the existing samples.

3.3 FlexMPI Support

In this work, we leverage malleable support provided by FlexMPI to collect multiple sample points (number of processes and DVFS values) during the application execution. The system performs two operations. First, the system collects the performance metrics of the current configuration of the execution. Second, once it has enough information for that configuration, by using malleability, it reconfigures the application to analyse another sample (expanding or shrinking the number of processes and changing the DVFS value). Following this approach, the system is able to execute multiple analyses during one execution of the real use case. By means of leveraging malleability, our proposal is able to build the models in the early execution stages, reducing data collection time and making the models available during the first application execution. Note that this approach is also beneficial due to a lower energy consumption compared with running independent use cases varying both the processes number and DVFS value. We assume that the performance behaviour at the beginning of the application represents the situation in the whole execution.

4 Evaluation

The evaluation has been carried out in a baremetal cluster consisting of compute nodes with Intel(R) Xeon(R) Gold 6212U with 24 cores and 330 GB of RAM. The connection between nodes is by 10 Gbps Ethernet. A FlexMPI-based implementation of the Jacobi algorithm method is used. Our implementation includes a set of tunable parameters that modify the behavior of the code, generating different performance profiles of CPU, I/O, and network usage. Due to this, we have designed a set of configurations that cover a wide range of performance patterns. Following we describe the main use cases:

Use case 1 (UC1): CPU-intensive application with high data locality.
Use case 2 (UC2): CPU-intensive application with low data locality.
Use case 3 (UC3): Combination of CPU and communication intensive phases.
Use case 4 (UC4): Combination of CPU and I/O intensive phases.
Use case 5 (UC5): Combination of CPU, I/O, and communication intensive phases.

To obtain execution profiles, the application has been run in multiple iterations. Each iteration includes all the executions combining the entire frequency range (from 1.2 GHz to 2.4 GHz) and all the available cores (from 1 to 24). The energy metric represents the average energy of the existing samples.

In Fig. 2, we can observe that the energy profile for UC1, which is very similar to the energy profile from UC2 and UC4 (note that we do not include all the figures due to the available space). We can state that increasing the number of processes reduces the overall consumed energy. However, there is a point where the reduction is very small and the profile looks like a flat surface. Figure 3 plots the energy profile for the fifth use case UC5, which is a combination of UC3 and UC4, and that is the reason why it is also very similar to UC3. This profile includes features of UC4 by the end of the figure because the I/O intensity is dominant. As the frequency increases, the energy consumption increases as well, and the minimum energy point is with a frequency of 1.4 GHz. When we consider the number of processes, we can see that the more processes are included, the higher the energy consumption. The minimum frequency point is 1.4 GHz.

4.1 Energy Modeller Accuracy

In this section, we analyze the experimental results of the energy profile modeler. The comparison between the real and the interpolated models takes into account: (1) how the surface is refined during the modeler iterative process, and (2) the differences between the real and the model surfaces. In these examples, the interpolation is only done in four iterations because the results after the fourth iteration were very similar to the original one.

Figures 4 and 5 plot the interpolated energy profile model for UC1. It is important to highlight that in the first iteration, the profile is a plane using only two points. Then it is divided into two for the second iteration by taking a midpoint, and, including more data from new iterations, it becomes more like the actual CPU profiles. If we consider differences with the real profile, we can see that in the first iteration, there are differences between 100 J and 400 J. For four iterations, the maximum difference is 180 J between process numbers 2 and 4 and decreases to 0.03 J between 4 and 20 processes.

In Figs. 6 and 7, we can see the modeler surface for UC3, which is very similar to UC5. Considering the profile with four iterations, the maximum difference is 190 J at the lowest points of the processes and 13 J at the lowest. In this case, the difference can have a greater influence because the minimum energy point location is unclean. Following the same reasoning as in the previous analysis, considering the lowest energy point, the application spends an energy of 531.51 J per iteration. Considering the maximum saving, we observe a saving of 59.50% per iteration (max. 1310.59 J per iteration). Compared to the real surface the model selects a non-optimal configuration with a 9.50% cost higher than the one obtained by the real surface.

In order to show the overall results, Table 1 summarizes the scheduling metrics based on minimizing only energy consumption. Table 2 depicts the scheduling results based on the cost function that provides the same weight to both energy and execution time. Next, Table 3 shows the scheduling results based on the cost function, considering more importance to energy. Finally, Table 4 shows the scheduling results based on the cost function prioritizing execution time.

Table 1. Summary table of scheduling results based on minimizing energy.

Full size table

Table 2. Summary table of scheduling results based on cost function with more weight in energy.

Full size table

Table 3. Summary table of scheduling results based on cost function with the same weight in energy and execution time.

Full size table

Table 4. Summary table of scheduling results based on cost function with more weight in execution time.

Full size table

Regarding the interpolation, for some profiles, the energy saving is the same, and for others is a bit lower than the real one. However, we can compensate for the difference because the interpolation applies early to the best configuration. On the other hand, the scheduling results obtained using the cost function show similar savings to the energy minimization. However, taking into account that some results are equal, even a bit lower, we consider it better to use this method because it allows the user to adjust the target of the cost function: saving energy or execution time.

5 Conclusion

In this work, we have introduced a dynamic energy-profile scheduler for MPI-based applications that integrates FlexMPI runtime and application modeling at run-time. The scheduler exploits the previous models to determine the best configuration (DVFS value and number of processes) for each application to reduce energy consumption. Finally, we have completed an evaluation on a real platform, which demonstrates that our proposal can minimize either the energy consumption and the execution time of the scheduled application. Finally, we are working on a machine learning model capable of predicting at near/real-time given the execution information (wall-clock time, energy, data used) that phase of the application execution is the next. With this information, we will be able to enrich the scheduler with global vision of which phase is going to be executed in the future improving the scheduling decisions.

References

A carbon crisis looms over supercomputing. how do we stop it?. https://www.hpcwire.com/2021/06/11/a-carbon-crisis-looms-over-supercomputing-how-do-we-stop-it/
Agrawal, P., Rao, S.: Energy-aware scheduling of distributed systems. IEEE Trans. Autom. Sci. Eng. 11(4), 1163–1175 (2014)
Article Google Scholar
Auweter, A., et al.: A case study of energy aware scheduling on SuperMUC. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 394–409. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07518-1_25
Chapter Google Scholar
Chen, J., He, Y., Zhang, Y., Han, P., Du, C.: Energy-aware scheduling for dependent tasks in heterogeneous multiprocessor systems. J. Syst. Architect. 129, 102598 (2022)
Article Google Scholar
Juarez, F., Ejarque, J., Badia, R.M.: Dynamic energy-aware scheduling for parallel task-based application in cloud computing. Futur. Gener. Comput. Syst. 78, 257–271 (2018)
Article Google Scholar
Lin, W., Shi, F., Wu, W., Li, K., Wu, G., Mohammed, A.A.: A taxonomy and survey of power models and power modeling for cloud servers. ACM Comput. Surv. (CSUR) 53, 1–41 (2020). https://doi.org/10.1145/3406208,https://dl.acm.org/doi/10.1145/3406208
Martín, G., Marinescu, M.-C., Singh, D.E., Carretero, J.: FLEX-MPI: an MPI extension for supporting dynamic load balancing on heterogeneous non-dedicated systems. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 138–149. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_16
Chapter Google Scholar
PROJECT, T.S.: Impact environnemental du numÉrique : Tendances À 5 ans et gouvernance de la 5G (2021). https://theshiftproject.org/wp-content/uploads/2021/03/Note-danalyse-Numerique-et-5G-30-mars-2021.pdf
Schneider, D.: The Exascale era is upon us: the frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second. IEEE Spectr. 59, 34–35 (2022). https://doi.org/10.1109/MSPEC.2022.9676353
Article Google Scholar
Ábrahám, E., et al.: Preparing HPC applications for Exascale: challenges and recommendations. In: Proceedings - 2015 18th International Conference on Network-Based Information Systems, NBiS 2015, pp. 401–406 (2015). https://doi.org/10.1109/NBIS.2015.61

Download references

Author information

Authors and Affiliations

University Carlos III of Madrid. Computer Science and Engineering Department, Madrid, Spain
Alberto Cascajo, Alvaro Arbe, Javier Garcia-Blas, Jesus Carretero & David E. Singh

Authors

Alberto Cascajo
View author publications
You can also search for this author in PubMed Google Scholar
Alvaro Arbe
View author publications
You can also search for this author in PubMed Google Scholar
Javier Garcia-Blas
View author publications
You can also search for this author in PubMed Google Scholar
Jesus Carretero
View author publications
You can also search for this author in PubMed Google Scholar
David E. Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alberto Cascajo .

Editor information

Editors and Affiliations

University of New Mexico, Albuquerque, NM, USA
Amanda Bienz
University of Edinburgh, Edinburgh, UK
Michèle Weiland
Université Paris-Saclay, Gif sur Yvette, France
Marc Baboulin
CERFACS, Toulouse, France
Carola Kruse

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cascajo, A., Arbe, A., Garcia-Blas, J., Carretero, J., Singh, D.E. (2023). Malleable Techniques and Resource Scheduling to Improve Energy Efficiency in Parallel Applications. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-40843-4_2
Published: 25 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Malleable Techniques and Resource Scheduling to Improve Energy Efficiency in Parallel Applications

Abstract

Similar content being viewed by others

Runtime Adaptability: The Key for Improving Parallel Applications

Mapping Medley: Adaptive Parallelism Mapping with Varying Optimization Goals

The Energy Efficiency Evaluating Method Determining Energy Consumption of the Parallel Program According to Its Profile

Keywords

1 Introduction

2 Energy-Aware FlexMPI System Architecture