1 Introduction

In recent years, the use of computer clusters and the Message Passing Interface (MPI) tool have been proven to solve large processing simulation problems such as: molecular dynamics [1,2,3], particle diffusion [4, 5], sinoatrial node cells synchronization [6], porous networks behavior [7, 8], machine learning [9, 10], among many others. This is mainly due to the decreasing price of computing infrastructure and the increasing processing capacity.

It is common for clusters to initially maintain a homogeneous infrastructure; however, in order to keep clusters updated, new infrastructures haven been frequently incorporated to make them heterogeneous. Due to this fact, when a parallel application is executed on a cluster it is convenient to use the most suitable nodes, in order to reduce response time. This leads to solve a task allocation problem [11, 12] where the most appropriate processors should be selected to execute specific application processes.

The process allocation to cluster processors can be as simple as blindly assigning a process to a single core randomly selected; however the response time can be affected while selecting nodes with scarce available resources. Another general algorithm for process allocation is the one used when running an MPI application where processes are assigned to processing units following a cyclic order looking for the load balance. More than one MPI process should be allocated when number of them is larger than the available processing units. However, this solution may have a problem if the slowest processor has to run processes that require a high processing time. On the other hand, many hardware characteristics and load information criteria could be considered for processor selection, such as: number of logical cores (multithreading technology), different levels and sizes of cache memories L1, L2, and L3, Random Access Memory (RAM) organization—Uniform Memory Architecture (UMA) or Non Uniform Memory Architecture (NUMA)—, processing speed, communication latency, etc. However, considering all these elements and resources can result in a complex allocation algorithm, requiring specific information provided by users about the behavior of their applications (for example, the communication graph or a processing cost estimation) or the cluster characteristics.

In this paper we propose PAARes (Process Allocation based on the Available Resources), an algorithm for process allocation on cluster nodes that takes into account the following resource information of each node: processor type (physical or logical cores—Multithreading—), frequency (processor speed), cache memory (all levels, for example L1, L2, and L3), and load (number of running processes). Based on the aforementioned information, the nodes are ordered considering their processing availability. The nodes having higher processing availability are prioritized by placing them first and the overloaded are at last of a list. The PAARes algorithm works in three phases, firstly, it automatically and transparently gathers information about the resources of each node, avoiding the need of a user to provide any information. Secondly, a set of keys is generated for comparing the characteristics of different nodes. Thirdly, the construction of an ordered node list is carried out and finally, the generated list is used for the allocation of the processes in the nodes.

To compare the performance of our approach, PAARes is integrated in the distribution of processes performed by OpenMPI using the applications of NPB (NAS Parallel Benchmarks) [13]. Using PAARes, before each execution, the nodes with the lowest load or the highest processing capacity are selected, in order to reduce the execution time when an MPI application is executed.

The rest of the document is structured as follows: Sect. 2 presents the state of the art of process allocation algorithms. In Sect. 3, the proposed PAARes algorithm is described. Section 4 shows the experimental setup considered for comparing our proposal performance. In Sect. 5 the obtained results are presented, and finally, Sect. 6 includes the conclusions and future work.

2 Related works

In order to reduce the execution time of cluster applications, the processes distribution problem has been studied in several works. Some blind allocation policies do not consider any information to choose the processing unit where a process will be executed.

In [14, 15] the selection of processors is considered based on the requirements of the user application, such as a specific number of nodes, the number of processors per node, and a minimum amount of memory. If the required resources are available, one process is assigned per each processing unit to run the application, and otherwise, the application is not executed. The main disadvantage of this allocation policy is that heavy processes are not necessarily assigned to the nodes with higher processing capacity, but to those that comply with the requirements. Another example of this type of allocation is found in MPI implementations.

Most MPI distributions use by default a cyclic Round Robin policy, where an assignment can be carried out by Processing Units (PU) or by Nodes [16,17,18] selection. Using an assignment generated by PU selection, a process per processor or core on a single node is usually assigned, if there are more processes to allocate they are distributed to the next node, and so on until all processes are allocated. In an assignment completed by Nodes selection, a process per each cluster node is allocated, if the number of processes is greater than the number of nodes the procedure is repeated several times until all processes are allocated. Both policies are blind since they use a list of node IP addresses or node names provided by the user to establish the processes distribution order which is sequentially read by MPI to select a node.

Other works [19, 20] take into account theoretical models, assuming that there is prior knowledge of the processes behavior such as the time in which a process is incorporated into the cluster and the execution time of each process. This information is used to define an execution configuration reducing the response time of the cluster processes on average. Although theoretical models look for obtaining optimal assignments, it is often difficult (or impossible) to have the information they refer to (such as the execution time of a process); besides, the optimal allocation of resources is known to be a problem with computational complexity (NP-Complete problem [21]).

Approaches such as [22,23,24,25] consider the way in which the processes of an application communicate, allocating in the same processor the processes that communicate more frequently, thus decreasing network communications and reducing latency. In [22, 25] a parallel application is first executed at least once to trace the processes interaction and build a communication graph. The processes are finally assigned based on the level of communication they maintain, following the graph information. In [23] the assignment of processes to cluster nodes is done considering the number of network cards that each node has. The processes are assigned by network card looking for the balance of communications and intending to avoid bottlenecks. In [24] a logical tree is generated to represent the cluster hardware (nodes, cores per node, etc.) where the cores correspond to the leaves of the tree. Then, a communications map is defined where the interaction (communication) between the application processes is represented. Processes that carry out more communications form groups that are assigned to nodes based on the generated tree. Assigning them to the same node strengthens local communication and reduces cost. Although this solution helps reduce latency, it is not practical as it requires the user to provide information that is not always available or easy to obtain about how processes interact.

We can also find some assignment policies that analyze more specific information about the available hardware to decide processes assignment and improve the performance of applications. Some MPI distributions (for example [26,27,28,29]) consider the hardware of a cluster, allowing a range of possible assignments of processes based on the characteristics and quantity of cluster nodes, processors, cores, cache memory levels L1, L2 or L3 (without considering their sizes, only if they are present), or the memory architecture of nodes (UMA or NUMA). While these works consider different aspects of hardware, they have two main disadvantages. The first disadvantage is that these assignments are very dependent on the specification of the required hardware; for example, if a process defines the restriction of using L3 cache to run on one of the most recent computing hardware, the nodes that do not have the required cache level will not be considered for collaboration even though they may have the best available processing resources. Another drawback is that developers must indicate (and know) the hardware characteristics required for execution. The second disadvantage is that if there is a multiprogramming environment where a node executes processes from different applications, the assignment of processes made by these MPI distributions does not consider the overloading that may exist in some nodes.

In this paper we propose a process allocation strategy called PAARes (Process Allocation based on the Available Resources) which is based on the construction of a node list where nodes of a dedicated or non-dedicated cluster are sorted considering their available processing capacity. It is achieved by collecting the hardware resources and the load state information of each node to determine its processing capacity. The assignment of processes to nodes is then guided by the ordered node list. In the next section this algorithm is described in detail.

3 Process allocation proposal

The purpose of this algorithm is to decrease parallel application execution time by determining a process allocation based on the generation of a properly ordered list of cluster nodes. Considering each node available processing and cache capacities; the nodes with higher processing capacity are at the beginning of the list while the slower nodes are located at the end of the list. Algorithm 1 shows the general PAARes algorithm that should be performed when an MPI parallel application is going to be launched, the list parameter is a list containing the names of the existent cluster nodes.

figure a

The PAARes strategy is carried out in three main phases or stages which are the information gathering, information analysis, and the process allocation. The first two phases are executed by the lines 2–5 of Algorithm 1. In these phases, a view of the characteristics and load state of the nodes is obtained and analyzed to quantify their processing availability by means of the initialization of four key values per node. The third phase is performed by lines 4–6, which include the ordering of the nodes according to their calculated key values and finally the allocation of the application processes. Each one of these stages are described below.

3.1 Information gathering

A fundamental phase of the PAARes algorithm is the information gathering to obtain a measure of the resources usage and workload state of each node on a cluster. PAARes works on a Linux-cluster and gathers the following information from each node.

  1. 1.

    The number of processing cores

  2. 2.

    The maximum core frequency in MHz

  3. 3.

    Memory cache characteristics

  4. 4.

    The number of running processes.

Algorithm 2 details the information gathered to estimate the available resource of a node. This includes the number of physical cores per node (line 2), the number of logical cores (line 3), the maximum frequency of cores (line 4), the amount of cache memory at the different levels (line 5), and the number of running node processes (line 6) which uses more than 5% CPU (most system processes use less than 5%).

\(L_ {i}\) is an array that stores the amount of memory in each particular level of cache; currently, most nodes have three cache levels: L1, L2, and L3. For simplicity, all cache memories are considered to have the same frequency or speed, regardless of their level.

figure b

These data are extracted from operating system commands that describe the hardware. For example, with /proc/cpuinfo, \({\textit{lscpu}}\), or ps system commands it is possible to get the amount of cores and their frequencies (Fig. 1 shows an example of the output of cpuinfo command). It is worth mentioning that this is not the only way to obtain information of the used resources, commands are continuously being developed to allow monitoring and obtaining system resources information, for example \({\textit{hwloc}}\) and \({\textit{numactl}}\) [30, 31].

Fig. 1
figure 1

Processor information obtained from a cpuinfo command

3.2 Information analysis

After the information gathering of each node, PAARes generates a set of four values or keys which refer to the processing capability available in each node. The larger the key values, the higher the available processing throughput on a node. The four keys generated by PAARes are listed below.

  1. 1.

    \({\textit{key}}_1\): A numeric value that defines the processing capacity of each node. It is calculated by Eq. (1):

    $$\begin{aligned} {\textit{key}}_1=\dfrac{\left( {\textit{info.nb}}\_{\textit{Pcores}}+ \dfrac{{\textit{info.nb}}\_{\textit{Lcores}}}{2}\right) \times {\textit{info.frequency}}}{{\textit{info.p}}\_{\textit{load}}+1} \end{aligned}$$
    (1)

    This key value considers the physical and logical cores, their frequency and the number of processes currently running on the node. As it is observed, \({\textit{key}}_1\) is calculated in two parts: firstly, the total number of cores is multiplied by the maximum core frequency to obtain a total node processing capacity; although the number of logical cores could be equal to this of the physical cores, they do not obtain the same processing gain as the physical ones; in [32, 33] the reported gains using the logical cores only reached between 30% and 50%. For the above mentioned, in \({\textit{key}}_1\) the number of logical cores is divided by 2 to obtain a gain of 50%. In the second part, a penalty is added by dividing the processing capacity between the number of already running processes on the node, \({{\textit{load}}_{i}+1}\). When a large number of processes is being executed on a specific node, \({\textit{key}}_1\) will obtain lower values.

  2. 2.

    \({\textit{key}}_2\): It is defined by the maximum amount of L3 cache memory associated with a core. If the kernel does not have this cache type, the value of this key is 0.

  3. 3.

    \({\textit{key}}_3\): This key value is initialized with the maximum amount of L2 cache memory associated with a core. If the kernel does not have an L2 cache, the value of this key is 0.

  4. 4.

    \({\textit{key}}_4\): The maximun amount of L1 cache memory associated with a core initializes this key.

Table 1 shows an example of a cluster with 6 nodes. For each node, the information about physical cores, logical cores, core frequency, cache memory L3, L2, and L1, and the number of processes using more than 5% CPU (Load) is given. In this example, it can be seen that not all nodes have logical cores and L3 cache memory (it is a heterogeneous cluster). Moreover, the only node that has a Load value greater than 0 is node 3. Table 2 shows the calculated key values using the algorithm of the last paragraph.

Table 1 Example of a set of 6 cluster nodes and their resources
Table 2 Keys values of the nodes described in Table 1

3.3 PAARes process allocation

The key values information is taken into account to build a sorted node list. Based on the position of nodes in the list, their processing availability is defined. The first nodes are those considered to have more available resources (therefore the highest processing throughput), and the nodes located at the end of the list are identified as the more overloaded. The PAARes strategy to build the sorted list is given in Algorithm 3.

figure c

The ordering of nodes (lines 2–5) in the list considers the four key attributes. All nodes are first sorted by \({\textit{key}}_1\) in descending order (line 2), leaving the node with the highest \({\textit{key}}_1\) value at the beginning of the list.

Whether two nodes get the same \({\textit{key}}_1\) value, the \({\textit{key}}_2\), \({\textit{key}}_3\) and \({\textit{key}}_4\) are used to decide which of them should be placed before the other (lines 3–5); in case all keys have same values, both nodes are placed one after the other, indistinctly, illustrated in Tables 23, and 4.

Table 3 shows the nodes of Table 2 after being sorted by \({\textit{key}}_1\). From the table, it can be observed that node 6 gets the first position (with \({\textit{key}}_1\) value = 31.5) and node 3 the last one (with \({\textit{key}}_1\) value = 6.6) in the list. Here, nodes 1 and 2 get the same \({\textit{key}}_1\) value; in this case the algorithm considers the other key values to find out which of them should be placed before the other. Since both \({\textit{key}}_2\) and \({\textit{key}}_3\) values are the same for node 1 and node 2, the \({\textit{key}}_4\) value is used to decide the final ordering. Node 2 (with \(key_4 = 64\)) is then placed before node 1 (with \({\textit{key}}_4=32\)), obtaining the results given in Table 4.

After the ordering steps, a file called hostfile is generated (line 7 of Algorithm 1) storing the sorted list (only considering the name of the nodes or their IP addresses). Before the execution of a parallel program, the hostfile file is read (line 8 of Algorithm 1) to first use the nodes with the highest processing availability found at the beginning of the file, running the application processes following an assignment by PU policy.

Table 3 Example of values of the four keys for each node of Table 2; nodes ordered by the value of \({\textit{key}}_1\)
Table 4 Final arrangement obtained by ordering nodes 1 and 2 by \({\textit{key}}_4\)

4 Experimental setup

The NAS Parallel Benchmarks (NPB) [13] contain a set of MPI programs intended to evaluate parallel computers mainly in terms of processing and memory performance. In this work the NPB kernel is used to compare the performance of PAARes vs the OpenMPI’s default distribution policy which consists of a process mapping by Processing Units (by slotFootnote 1[18, 34]), using a user-provided node list. The application characteristics and the used experimental infrastructure are described below.

4.1 Applications

The characteristics of the five NPB kernel applications are the following:

  1. 1.

    IS (Integer sorted) application: The main cluster challenge executing this program is to perform random accesses in memory.

  2. 2.

    EP (Embarrassingly parallel) application: The objective of this application is to execute independent tasks, that is, processing tasks with very little or no communication among them.

  3. 3.

    CG (Conjugate gradient) method: In this application mathematical calculations and irregular and distant communications are performed.

  4. 4.

    A simplified MultiGrid (MG) kernel: This program performs structured communicationsFootnote 2 which are short and distant, as well as an intensive use of memory.

  5. 5.

    Partial solution using the Fast Fourier Transform (FFT): The objective of this application is to evaluate the communication between all the processes.

Each of these applications contains three different classes (different problem sizes) represented by letters. Between each class, the problem size is increased by 4 orders of magnitude regarding the immediate previous class. For our proposal evaluation classes A (the smallest size problem), B, and C (the largest size problem) are used.

4.2 Infrastructure

The experimental infrastructure consists of two clusters, a homogeneous cluster (cluster 1) and a heterogeneous cluster (cluster 2). The hardware specifications and the operating system of each cluster are shown in Tables 5 and 6. All nodes are connected through a Gigabit Ethernet switch.

Table 5 Characteristics of the Cluster 1 (homogeneous)
Table 6 Characteristics of the Cluster 2 (heterogeneous)

The software specifications are: gcc 4.8.5 compiler version, OpenMPI version 1.10.2, and java 1.8.0_25 SDK.

4.3 Test scenarios

The proposed PAARes strategy is compared in two scenarios: dedicated and non-dedicated functioning. In the dedicated scenario NPB application is executed at a time, without the existence of external processing tasks affecting the system performance. In case of a dedicated homogeneous cluster, the selection of nodes to allocate processes is indistinct since they all have the same characteristics and all their processing capacity is available, so PAARes behaves the same as OpenMPI obtaining same performance; for this reason this document only presents the results where a difference is observed. With a dedicated scenario using a heterogeneous cluster PAARes considers the different processing capacities of nodes to build the hostfile and place the NPB processes.

The non-dedicated scenario considers a homogeneous/heterogeneous cluster where external interfering processes are being executed in one or more nodes to generate additional load on them. In our experiments the external load was generated by the execution of \({\textit{stress}}\)Footnote 3 processes.

5 Results

In this section, the evaluation of the PAARes strategy versus the default OpenMPI processes distribution executing the NPB MPI applications is given, considering the test scenarios previously presented. The default OpenMPI process allocation algorithm (assignment by slot) considers, for the heterogeneous cluster, a hostfile containing a list of nodes whose order of appearance is given in Table 6, with the most recent nodes at the bottom. For the case of the homogeneous cluster, the hostfile list contains the names of nodes 1 to 6 from Table 5, all of them having the same characteristics.

Each reported NPB application execution time is the average of the three classes (A, B, and C) for the same number of processes. For each test, five comparatives (one per each NPB application) are presented. It is worth mentioning that PAARes does not need to know the intercommunication graph between processes nor the number of calculations of each application process, whether class A, B or C application. To simplify the results presentation, the default OpenMPI processes distribution is named MPI. In each result graph, the number of used processes is varied, in cluster 1 from 2 to 32 and in cluster 2 from 2 to 64 processes.

5.1 Non-dedicated homogeneous cluster

In a homogeneous cluster with some nodes overloaded through the execution of more than one application, PAARes first tries to allocate processes based on the least loaded nodes and then on the most loaded ones. Figure 2 shows five graphs (one per each NPB application) plotting the execution times obtained by PAARes and MPI. Here, only one node has extra load executing one stress process. As described in Sect. 4.1, each application in the NPB has different characteristics; we can see that for all application types PAARes obtained shorter times than the default MPI distribution, i.e., 1.9% (IS with 32 processes) and 73.5% (CG with 2 processes) less execution time.

Fig. 2
figure 2

Average execution for the NPB applications on the non-dedicated homogeneous cluster (cluster 1)

5.2 Dedicated heterogeneous cluster

In a dedicated heterogeneous cluster, PAARes first selects the nodes with more processing capability in order to obtain better execution time in most cases. Figure 3 shows a comparison between PAARes vs MPI (one per each NPB application) using the cluster 2. In this scenario, for the FT application with 2 and 4 processes, only the problem sizes A and B were executed due to memory constraints that arose while running class C. In general, the execution time of PAARes is better in most NPB applications. As seen in the figure, for 64 processes while using the entire cluster the execution time is almost similar; however, in some cases MPI got reduced times. This is due to two factors, (i) the problem size and (ii) the default nodes organization of MPI. From the beginning, PAARes uses the nodes with the most available resources, while MPI uses them only when the execution involves 64 processes because the fastest nodes are at the end. As the characteristics of the processes regarding the processing or communications to be carried out are not priorly known, it could be the case that the firstly assigned processes have a lower cost than the last ones, resulting in no improvement in PAARes times. However, the results obtained after executing the NPB applications demonstrate that this is a rare case which only happen when all the cluster cores are used.

Fig. 3
figure 3

Average execution for the NPB applications on the dedicated heterogeneous cluster (cluster 2)

5.3 Non-dedicated heterogeneous cluster

A non-dedicated heterogeneous cluster adds another parameter to consider in process allocation which in turn increases the degree of heterogeneity. Since the nodes have different processing capabilities, the fastest ones will not be at the top of the list if they are executing extra load and the slowest ones will not be at the bottom if they are lightly loaded. Figure 4 shows a comparison between PAARes and MPI (one graph for each NPB application) using the cluster 2, with 50 % overloaded nodes in order to obtain a higher degree of heterogeneity. For the FFT application with 2 and 4 processes, only the problem sizes A and B are averaged. Similar to the previous results, in most cases PAARes results in improved timing. It can be observed that when 64 processes are used, the results are very similar.

Fig. 4
figure 4

Average execution times for the NAS applications on the non-dedicated heterogeneous cluster (cluster 2)

It is worth mentioning that the results in the case of MPI will depend on how the programmer ordered the nodes on the hostfile, while PAARes guarantees that nodes with less workload or more available resources will always be used first due to its 4 keys-based sorting strategy.

6 Conclusions and future work

The allocation of processes is a challenge in applications that require high processing capacity. In this paper, we proposed a process allocation strategy called PAARes which works in dedicated and non-dedicated environments. The algorithm proposes the creation of a machine file containing a sorted node list which takes into account the available processing capacity of the nodes, considering the physical and logical cores, their frequency, the different levels of cache memory existing at each node, and the number of running processes. In the algorithm, it is not necessary for the user to provide these information manually, PAARes automatically obtains information. With the collected information, PAARes use four keys to quantify it. These keys are the basic criteria to perform the ordering of processing nodes.

To evaluate the proposal we employed the NAS Parallel Benchmark considering its 5 kernel applications executed on homogeneous and heterogeneous clusters. We compared the performance of PAARes with respect to the default distribution of OpenMPI executing the benchmark in dedicated and non-dedicated scenarios. The results show that PAARes give better performance compared to MPI that does not consider the physical and logical node characteristics for process allocation. This allows us to claim that by having a process assignment that takes into account more detailed information about the characteristics and load state of cluster nodes, in most cases it is possible to reduce the execution time of parallel applications without having a previous knowledge of their communication or processing cost.

The future work is oriented to consider other information about the architecture of processors; for example, NUMA architecture, connected I/O devices. In case the information about the communication graph of the parallel applications is available, PAARes could be extended to consider it as an additional key with the aim of improving performance.