Keywords

1 Introduction

A task efficiency monitoring system is essential for detecting incorrectly started calculations that entail the insufficiently efficient use of cluster resources. This paper describes a new task performance monitoring system, HPC TaskMaster, developed at the HSE University for the cHARISMa (Computer of HSE for Artificial Intelligence and Supercomputer Modeling) cluster.

The developed system allows users to view reports on the performance of their tasks together with interactive execution schedules and automatically identify tasks that worked inefficiently. Having access to the results of the analysis, users can run their tasks more efficiently in the future, which will significantly save the machine time of the cluster.

In addition, the system will allow the administrators of the cluster to collect statistics about user tasks, which was previously unavailable.

The most common examples of the inefficient usage of cluster resources are:

  • allocation of insufficient or excessive resources for a task;

  • running a non-parallel task on multiple CPU cores or GPUs;

  • allocation of the compute node capacity without starting calculations.

The following requirements were defined for the design of the task performance monitoring system.

  1. 1.

    The system should collect the following data for each task:

    • utilization of specific CPU cores allocated for the task;

    • utilization of GPUs allocated for the task;

    • GPU memory utilization;

    • GPU power consumption;

    • utilization of RAM created by the task;

    • file system usage.

  2. 2.

    The system must analyze the collected data and use it to determine whether the task worked effectively.

  3. 3.

    The system must provide users with access to the list of completed tasks and reports on their completion using a web application.

The rest of this paper is organized as follows. A comparison of different monitoring systems is carried out in Sect. 2. In Sect. 3, the architecture of the system is described. The detection of inefficient user tasks is considered in Sect. 4. User statistics are provided in Sect. 5. Finally, Sect. 6 shows the conclusions of this work.

2 Related Work

The key feature of the HSE cluster is how it allocates resources for user tasks. Instead of allocating the entire compute node for one task, the user is given a certain number of processor cores and GPUs. As a result, several dozen tasks can be performed on the compute node at once, thus optimizing cluster resources. Due to this feature, ready-made solutions for monitoring system resources, such as Nagios and Zabbix, are not suitable for this cluster. cHARISMa already has a monitoring system of its own [4], however, it is designed to display only the global usage across the whole cluster and its nodes.

Since one of the HSE University goals is to provide cluster users with a secure system in the HSE University environment, a new monitoring system was built using open-source monitoring tools. Chan [3], Wegrzynek [11], Kychkin [6], Safonov [10] describe how using a combination of programs such as Telegraf, InfluxDB and Grafana allows one to quickly set up and run a cluster resource monitoring system. In [2, 3], it is also described how the Slurm plugin acct_gather enables to collect metrics for Slurm tasks, which is precisely the data required for a task efficiency monitoring system. Since all programs, except Telegraf, are already installed on cHARISMa, this approach can be used to monitor tasks on the cluster.

The development of LIKWID Monitoring Stat [9], a task monitoring system using InfluxDB, Grafana and built-in LIKWID tools for monitoring tasks on the cluster, also draws attention. For each task, a dashboard is created from ready-made JSON templates, which allows creating personalized graphs for each task. The disadvantages of using the LIKWID Monitoring Stack on the HSE Cluster include the need to use LIKWID tools for the system to operate and the lack of a web interface for the system in addition to Grafana, which makes the system inconvenient for using on a cluster with a large number of users and tasks.

In addition to monitoring cluster resources, the system must analyze the effectiveness of user tasks. A well-known system for creating reports on the effectiveness of tasks is JobDigest [7, 8]. It analyzes the collected integral values and, based on them, applies a tag to the task describing the property of the task (for example, “low GPU utilization”). Although using tags is convenient for searching and filtering tasks, it is not always possible to provide an overall picture of the effectiveness of the task using tags alone.

Summarizing all the above, we can conclude that there is no ready-made task monitoring system fitting the individual characteristics of the cHARISMa cluster, which can be integrated into the HSE University environment. It is necessary to develop its own software system for evaluating the effectiveness of tasks, which can be flexibly configured for specific types of user tasks, delimit access for cluster users, and take into account the compliance of tasks with registered scientific and educational projects. As the basis of the system, it is worth using the open-source software Telegraf, InfluxDB and Grafana.

3 System Architecture

This section describes the monitoring infrastructure of the HPC TaskMaster system, shown in Fig. 1.

Fig. 1.
figure 1

Diagram of the system components

The Slurm task scheduler is used to run tasks on the cluster. The main data of Slurm tasks is stored in the MySQL relational database using the background process slurm database (slurmdbd), and the task metrics are written to the InfluxDB time series database using the plugin acct_gather. This plugin collects memory and filesystem usage (read/write) for each task.

The required metrics of utilizing specific CPU cores and GPUs are collected with the Telegraf daemon, which has built-in plugins for these metrics. Thus, having the CPU and GPU IDs assigned to the task, the system can collect metrics for the components and, therefore, distinguish utilization for different tasks on one node. Additional metrics are collected using developed plugins in Python.

The collected metrics are stored in the InfluxDB database. InfluxDB was chosen as a time-series database because of Telegraf support and Slurm acct_gather plugin support, which allows one to store all the required metrics in one database.

Grafana is used as a tool for visualizing graphs on the cHARISMa cluster. Grafana provides great opportunities for configuring and formatting charts and also has support for creating them using the API. This API allows automating the creation of graphs for each task. New graphs for each task are created using JSON templates. Based on the available data about the task, when the user requests it, graphs are automatically built in Grafana. The created graphs are displayed on the system’s website using iframe technology, where the user can interactively view the graphs for the period of task execution. In addition, the system creates graphs for both completed and running tasks. Thereby, the user can observe the work of his task in real time.

The advantage of using a combination of Telegraf, InfluxDB and Grafana is the ability to install and configure these tools on any cluster. Moreover, these tools make the monitoring system quite flexible – additional data for the system can be collected using the built-in plugins of Telegraf or developed ones.

It is important to pay attention to the fact that the HPC TaskMaster system has a negligible impact on the performance of compute nodes; the installed Telegraf daemon uses only 0.03% of the overall CPU performance. In addition to Telegraf, another source of the computing cluster load is InfluxDB. Installed on the head node, InluxDB uses an average of 5 GB of storage per month. To free up storage, a retention policy that compresses metrics older than 6 months is used.

The HPC TaskMaster system is developed on Django, a Python web framework that has a large number of available packages and a wide range of tools for developing web applications, which allows one to develop a monitoring system using Telegraf, InfluxDB and Grafana. In addition, Django has a built-in administration panel through which the administrator can configure the monitoring system himself without making changes to the source code of the program.

The task performance monitoring system works according to the following principles:

  • metrics are collected on each compute node using Telegraf and stored in the InfluxDB database on the head node. Metrics from the acct_gather plugin are also stored in InfluxDB;

  • the system updates its local MySQL database by comparing its tasks with those from the Slurm database;

  • while the task is running, aggregated metrics are collected for it from the InfluxDB database with a certain period;

  • if the task is completed, its aggregated metrics are collected for the last time;

  • the collected aggregated metrics are analyzed by the system, and an inference about the efficiency of the task is generated.

4 Detecting Inefficient Tasks

The user interacts with the HSE high-performance computing cluster [4] by launching tasks through the SLURM workload manager. A task is a set of user processes for which the workload manager allocates computing resources (compute nodes, CPUs, GPUs, etc.) Each launch of the user’s program for execution generates a new task, which is collected in the database and analyzed.

Here we define task efficiency as the usage of allocated resources above a certain threshold.

4.1 Collected Data

HPC TaskMaster collects two types of data about running tasks on the HPC cluster:

  1. 1)

    parameters characterizing the running task;

  2. 2)

    metrics that characterize the execution of the task.

Parameters. Table 1 shows the task parameters and their type.

Table 1. Parameters of the task

Metrics

Table 2 shows the metrics collected during the execution of the task. The metrics form a time series \(\theta _i\). \(\Theta = \{\theta _i\}\) denotes the set of all-time series of the task.

The frequency of collecting metrics can be adjusted and selected in such a way as to obtain sufficiently detailed information about the task without overloading the system with data collection and storage.

Table 2. Collected metrics and collection frequency

4.2 Data Processing

Aggregated Metrics

To simplify the analysis, aggregated metrics \(\varLambda ^k=(\lambda _1^k,\,\cdots ,\,\lambda _m^k)\) are calculated for each time series [5]. They include the minimum, maximum, average, median and standard deviations. In addition to them, the tuple \(\varLambda \) includes the average load of each node and the combined average load of the nodes.

Tags

Since the task parameters are a heterogeneous set of data (integers, strings, dates), to simplify their analysis, a system of tags, i.e., “labels” indicating the type of task, execution time, and other properties of the task, is introduced. Table 3 contains a list of tags currently available in the system. Additional tags can be developed and implemented into the system.

Table 3. List of tags

The tuple \(T^k=(\tau _1^k,\dots , \tau _n^k)\) is assigned to the task with the ID k, where n is the number of tags in the system. The \(\tau _i\) element corresponds to the indicator of the i tag and takes the value 1 if all conditions are met and the tag is assigned to the task, and 0 otherwise.

Indicators

To determine if the task is working inefficiently, it is necessary to evaluate the disposal of the components involved in the task. To do this, the concept of indicator of problems is introduced.

Indicators, dimensionless values inversely proportional to the value of the metrics, are used to evaluate the disposal of the components involved in the task.

Indicators take a value from 0 (with the full use of allocated resources) to 1 (otherwise). For example, the value of the indicator \(l_j\) is calculated from the aggregated metric \(\lambda _j^k\in \varLambda ^k\) using formula (1).

$$\begin{aligned} l_j^k = 1 - \frac{\lambda _j^k - {a}_j}{{b}_j-{a}_j},\quad l_j\in [0,\,1], \end{aligned}$$
(1)

where \({a}_j,\;{b}_j\) are the admin defined parameters referring to the minimum and maximum possible values of the j-th element of the aggregated metrics.

Indicators are placed in the tuple of indicators \(L^k = (l_1^k,\dots ,\,l_m^k)\).

The list of currently available indicators is presented in Table 4. Additional indicators can be developed and implemented into the system. The number of indicators for a specific task depends on the number of cores, compute nodes and GPUs used.

Table 4. List of indicators

4.3 Inferences

To help users to interpret the results, the system has a set of inferences \(\varPhi =(\phi _i)\). Inferences are the result of the analysis of the task.

Different requirements for tags and indicator values are set for each inference. An inference is assigned to the task when all the conditions are met. Several inferences can correspond to one task at once.

Denote the union of tuples of indicators L and tags T as

$$\begin{aligned} N^k=( l_1^k,\,\dots ,\,l_n^k,\,\tau _1^k,\,\dots ,\,\tau _m^k). \end{aligned}$$
(2)

Let \(\varOmega _i\) be a set of conditions for the output of \(\phi _i\) to the elements of the tuple \(N^k\).

Then we can match the set \(C^k\) to each problem:

$$\begin{aligned} C^k = \{\phi _i\in \varPhi :\, \varPi _{\omega \in \varOmega _i}\mathbbm {1}_\omega (N^k) = 1\}, \end{aligned}$$
(3)

where \(\mathbbm {1}_\omega \) is the indicator function equal to 1 if the condition \(\omega \in \varOmega _i\) is met. In other words, the tuple \(C^k\) contains the inferences assigned to the task.

4.4 Example

Let us consider a computational task performed on the cHARISMa supercomputer using 176 cores and 16 NVIDIA Tesla V100 GPU accelerators on 4 compute nodes. Table 5 shows the parameters of the task.

Table 5. Parameters of the task

The aggregated metrics across all compute nodes for the example task are shown in Table 6.

Table 6. Aggregated metrics by node
Table 7. Aggregated metrics of compute node cn-001

Table 7 shows the aggregated metrics of the time series for compute node cn-001. Data for compute nodes cn-002, cn-003, cn-004 are not shown to save space.

Table 8. List of indicators

Tags of the Task

Based on the parameters of the task from Table 5 and the tags from Table 3, no tag will be assigned to task 405408, since it is completed without an error and is not the launch of one of the packages. Therefore, the tuple of task tags will have the form \(T^{405408} = (0,\,0,\,0,\,0,\,0,\,0)\).

Indicators of the Task

Based on the data from Tables 6, 7, the system calculates the values of the indicators shown in Table 8.

Inferences of the Task

After the previous steps, we get a tuple

\(N^{405408} = (l_1,\dots ,\,l_{202},\,\tau _1,\dots ,\,\tau _6)\)

As an example, let us consider the three outputs presented in Table 9.

Table 9. Inferences

Based on the tuple \(N^{405408}\), the system will associate the set \(C^{405408}=\{\phi _1\}\) with task 405408, since the task is executed without errors and all resources are used.

An example of the task report with an inference of inefficient salloc usage is shown in Fig. 2.

Fig. 2.
figure 2

Task report

5 User Statistics

System administrators have access to inference statistics for each cluster user for a selected period of time. An example of statistics is shown in Fig. 3. Using this pie chart, administrators can understand which types of tasks are causing difficulties for the user. After determining the problem that the user has encountered, he can get a personal consultation to solve this problem.

Fig. 3.
figure 3

Graphs of the utilization of computing resources by the task

Statistics of the most active users of the cluster with the lowest percentage of effective tasks are compiled monthly; personal consultations are held on the basis of the statistics. By tracking trends in user efficiency by month, we can conclude how the HPC TaskMaster system can increase the efficiency of using cluster resources.

6 Conclusions

The developed task performance monitoring system, HPC TaskMaster, is a powerful tool that provides all the necessary information (main information, aggregated metrics, graphs, and inferences) about tasks in one place. This system will help users to identify the problem for existing scientific applications and applications of their development, thereby simplifying work with the cluster for users, allowing them to perform scientific calculations faster and more efficiently in the future.

HPC TaskMaster is constantly evolving and improving. Among the future directions for development are:

  • monitoring the effectiveness of individual categories of applications using machine learning tools;

  • adding new types of indicators and tags to generate new inferences;

  • smart recognition of the type of running application;

  • development of a module for notifying users about the launch of inefficient tasks by them.

HPC TaskMaster is available to all cluster users of cHARISMa via the personal account of the supercomputer complex. HPC TaskMaster is also available for public use [1], and any suggestions for improving the project are greatly appreciated.

The research was performed using the cHARISMa HPC cluster of the HSE University [4].