Evaluating Scalability and Efficiency of the Resource and Job Management System on Large HPC Clusters

Georgiou, Yiannis; Hautreux, Matthieu

doi:10.1007/978-3-642-35867-8_8

Yiannis Georgiou²⁰ &
Matthieu Hautreux²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7698))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

998 Accesses
9 Citations

Abstract

The Resource and Job Management System (RJMS) is the middleware in charge of delivering computing power to applications in HPC systems. The increasing number of computational resources in modern supercomputers brings new levels of parallelism and complexity. To maximize the global throughput while ensuring good efficiency of applications, RJMS must deal with issues like manageability, scalability and network topology awareness. This paper is focused on the evaluation of the so-called RJMS SLURM regarding these issues. It presents studies performed in order to evaluate, adapt and prepare the configuration of the RJMS to efficiently manage two Bull petaflop supercomputers installed at CEA, Tera-100 and Curie. The studies evaluate the capability of SLURM to manage large numbers of compute resources and jobs as well as to provide an optimal placement of jobs on clusters using a tree interconnect topology. Experiments presented in this paper are conducted using both real-scale and emulated supercomputers using synthetic workloads. The synthetic workloads are derived from the ESP benchmark and adapted to the evaluation of the RJMS internals. Emulations of larger supercomputers are performed to assess the scalability and the direct eligibility of SLURM to manage larger systems.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

RADICAL-Pilot and PMIx/PRRTE: Executing Heterogeneous Workloads at Large Scale on Partitioned HPC Resources

Virtual Clusters: Isolated, Containerized HPC Environments in Kubernetes

Large-Scale Experiment for Topology-Aware Resource Management

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Top500 supercomputer sites, http://www.top500.org/
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: Simple Linux Utility for Resource Management. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003)
Chapter Google Scholar
Wong, A., Oliker, L., Kramer, W., Kaltz, T., Bailey, D.H.: System Utilization Benchmark on the Cray T3E and IBM SP. In: Feitelson, D.G., Rudolph, L. (eds.) IPDPS-WS 2000 and JSSPP 2000. LNCS, vol. 1911, pp. 56–67. Springer, Heidelberg (2000)
Chapter Google Scholar
Kramer, W.T.C.: PERCU: A Holistic Method for Evaluating High Performance Computing Systems. PhD thesis, EECS Department. University of California, Berkeley (November 2008)
Google Scholar
Zhou, S., Zheng, X., Wang, J., Delisle, P.: Utopia: A load sharing facility for large, heterogeneous distributed computer systems. Technical report (1993)
Google Scholar
Ibm loadleveler, http://www.redbooks.ibm.com/redbooks/pdfs/sg246038.pdf
Henderson, R.L.: Job scheduling under the portable batch system. In: IPPS 1995: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, pp. 279–294. Springer, London (1995)
Google Scholar
Moab workload manager, http://www.adaptivecomputing.com/resources/docs/mwm/7-0/help.htm
Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: the condor experience. Concurrency - Practice and Experience 17(2-4), 323–356 (2005)
Article Google Scholar
Capit, N., Da Costa, G., Georgiou, Y., Huard, G., Martin, C., Mounié, G., Neyron, P., Richard, O.: A batch scheduler with high level components. In: 5th Int. Symposium on Cluster Computing and the Grid, pp. 776–783. IEEE, Cardiff (2005)
Chapter Google Scholar
Grid engine, http://gridscheduler.sourceforge.net/howto/howto.html
Torque resource manager, http://www.adaptivecomputing.com/resources/docs/torque/4-0/help.htm
Maui scheduler, http://www.adaptivecomputing.com/resources/docs/maui/index.php
Kaplan, J.A., Nelson, M.L.: A comparison of queueing, cluster and distributed computing systems. NASA TM-109025 (Revision 1), NASA Langley Research Center, Hampton, VA 23681-0001 (June 1994)
Google Scholar
Baker, M.A., Fox, G.C., Yau, H.W.: Cluster computing review (1995)
Google Scholar
El-Ghazawi, T.A., Gaj, K., Alexandridis, N.A., Vroman, F., Nguyen, N., Radzikowski, J.R., Samipagdi, P., Suboh, S.A.: A performance study of job management systems. Concurrency - Practice and Experience 16(13), 1229–1246 (2004)
Article Google Scholar
Cirne, W., Berman, F.: A comprehensive model of the supercomputer workload. In: 4th Workshop on Workload Characterization, pp. 140–148 (December 2001)
Google Scholar
Frachtenberg, E., Schwiegelshohn, U.: New Challenges of Parallel Job Scheduling. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2007. LNCS, vol. 4942, pp. 1–23. Springer, Heidelberg (2008)
Chapter Google Scholar
Chapin, S.J., Cirne, W., Feitelson, D.G., Jones, J.P., Leutenegger, S.T., Schwiegelshohn, U., Smith, W., Talby, D.: Benchmarks and Standards for the Evaluation of Parallel Job Schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 67–90. Springer, Heidelberg (1999)
Chapter Google Scholar
Frachtenberg, E., Feitelson, D.G.: Pitfalls in Parallel Job Scheduling Evaluation. In: Feitelson, D.G., Frachtenberg, E., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2005. LNCS, vol. 3834, pp. 257–282. Springer, Heidelberg (2005)
Chapter Google Scholar
Feitelson, D.G.: Metric and workload effects on computer systems evaluation. IEEE Computer 36(9), 18–25 (2003)
Article Google Scholar
Bhatele, A., Bohm, E.J., Kalé, L.V.: Topology aware task mapping techniques: an api and case study. In: PPOPP, pp. 301–302 (2009)
Google Scholar
Leiserson, C.E.: Fat-trees: Universl networks for hardware-efficient supercomputing. IEEE Transactions on Computers c-34(10) (1985)
Google Scholar
Navaridas, J., Miguel-Alonso, J., Ridruejo, F.J., Denzel, W.: Reducing complexity in tree-like computer interconnection networks. Parallel Computing 36(2-3), 71–85 (2010)
Article MATH Google Scholar
Bay, P., Bilardi, G.: Deterministic on-line routing on area-universal networks. JACM: Journal of the ACM 42 (1995)
Google Scholar
Frachtenberg, E., Petrini, F., Fernández, J., Pakin, S.: Storm: Scalable resource management for large-scale parallel computers. IEEE Trans. Computers 55(12), 1572–1587 (2006)
Article Google Scholar
Fernández, J., Frachtenberg, E., Petrini, F., Sancho, J.C.: An abstract interface for system software on large-scale clusters. Comput. J. 49(4), 454–469 (2006)
Article Google Scholar
Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., Wilde, M.: Falkon: a fast and light-weight task execution framework. In: IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 2007) (2007)
Google Scholar
Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: Modeling the characteristics of rigid jobs. Journal of Parallel and Distributed Computing 63, 2003 (2001)
Google Scholar
Vishwanath, K.V., Vahdat, A., Yocum, K., Gupta, D.: Modelnet: Towards a datacenter emulation environment. In: Schulzrinne, H., Aberer, K., Datta, A. (eds.) Peer-to-Peer Computing, pp. 81–82. IEEE (2009)
Google Scholar
Canon, L.-C., Jeannot, E.: Wrekavoc: a tool for emulating heterogeneity. In: IPDPS. IEEE (2006)
Google Scholar
Wong, A.T., Oliker, L., Kramer, W.T.C., Kaltz, T.L., Bailey, D.H.: ESP: A system utilization benchmark. In: SC 2000: High Performance Networking and Computing. Dallas Convention Center, Dallas, TX, USA, November 4–10, pp. 52–52. ACM Press and IEEE Computer Society Press (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

BULL S.A.S, France
Yiannis Georgiou
CEA-DAM, France
Matthieu Hautreux

Authors

Yiannis Georgiou
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu Hautreux
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Google, 1600 Amphitheater Parkway, 94043, Mountain View, CA, USA
Walfredo Cirne
Mathematics and Computer Science Division, Argonne National Laboratory, Bldg 240, 60439, Argonne, IL, USA
Narayan Desai
Facebook Inc., 1601 Willow Road, 94025, Menlo Park, CA, USA
Eitan Frachtenberg
Robotics Research Institute, TU Dortmund, Otto-Hahn-Str. 8, 44227, Dortmund, Germany
Uwe Schwiegelshohn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Georgiou, Y., Hautreux, M. (2013). Evaluating Scalability and Efficiency of the Resource and Job Management System on Large HPC Clusters. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2012. Lecture Notes in Computer Science, vol 7698. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35867-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-35867-8_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35866-1
Online ISBN: 978-3-642-35867-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Evaluating Scalability and Efficiency of the Resource and Job Management System on Large HPC Clusters

Abstract

Chapter PDF

Similar content being viewed by others

RADICAL-Pilot and PMIx/PRRTE: Executing Heterogeneous Workloads at Large Scale on Partitioned HPC Resources

Virtual Clusters: Isolated, Containerized HPC Environments in Kubernetes

Large-Scale Experiment for Topology-Aware Resource Management

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Evaluating Scalability and Efficiency of the Resource and Job Management System on Large HPC Clusters

Abstract

Chapter PDF

Similar content being viewed by others

RADICAL-Pilot and PMIx/PRRTE: Executing Heterogeneous Workloads at Large Scale on Partitioned HPC Resources

Virtual Clusters: Isolated, Containerized HPC Environments in Kubernetes

Large-Scale Experiment for Topology-Aware Resource Management

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation