1 Introduction

Nowadays, large-scale distributed systems such as cluster systems, clouds, grids, and datacentres are used to provide high-performance computing (HPC) solutions to big parallel applications. These systems consume a huge amount of energy which raises several environmental and economical concerns [11, 31]. For example, a 360-teraFLOPS high-performance cluster can consume nearly 20MW of power [32]. A datacenter having one thousand racks demands almost 10000 KW of electric power in a day to function [19]. In 2006, the United States Environmental Protection Agency reported that the overall energy consumed by datacenters and servers of the U.S. was 61.4 billion KWh which was equivalent to the power expenditure of 5.8 million households [32]. It is guesstimated in 2008 that HPC devices consume approximately 0.5% of the world\('\)s total electricity produced which is expected to grow quadruple by 2020 [5]. High power consumption in cluster systems and datacenters lifts their functional costs, produces more heat requiring cooling devices which ultimately emit large quantities of carbon dioxide harming the environment and also affects the reliability, availability, and performance of the systems. Therefore, reducing energy consumption for HPC in cluster systems becomes a vital research topic.

Over the past decade, an appreciable development has been seen in computing technologies to boost power efficiency in HPC systems. Dynamic Power Management (DPM) [3, 7,8,9, 18, 22,23,24] and Dynamic Voltage Frequency Scaling (DVFS) [7, 10, 16, 26,27,28,29,30] are two popular techniques used in modern processors for minimizing power consumption. With DPM, the processors are turned off when they are idle [3, 22]. It is appropriate where idle slots of the processors are very long, and not suitable for the applications having precedence-constrained tasks as these applications involve shorter idle time between the computations of tasks [7, 18]. While using DVFS, the processors could operate at different levels of frequencies and voltages [16]. DVFS allows processors to reduce power consumption by scaling down the supply voltages and the corresponding frequencies of the processors since there is a quadratic relationship between the power consumed and the operating voltage [21]. For example, AMD Opteron processor operates on six different frequency levels varying from 1000 to 2600 MHz and correspondingly consumes power from 32 to 95 W [1].

The parallel applications are modeled as directed acyclic graphs (DAGs). In a DAG, each node represents a task, and the edges represent dependencies between the tasks [14]. Normally, the weight of a node represents the computation cost of a task, and the weight of an edge represents the communication cost required to send data from parent task to child [13, 17]. With DVFS, it is possible to minimize energy consumption of the parallel application by assigning different frequencies to each task of the application [7]. Sometimes, minimizing energy consumption slows down the computation of parallel applications because the techniques used for energy reduction decreases the overall computational power of HPC devices which may lead to performance degradation of the systems as well [20]. Scaling down the supply voltages, and turning off some processors, decreases the computational capability of the system and increases the overall schedule length or makespan of the application. Therefore, when concentrating on energy efficiency, makespan should be taken care of [20].

Many task scheduling algorithms have been given by the researchers that make use of the DVFS technique to obtain energy efficient schedule. The main idea, behind these algorithms, is to use the slack time of the tasks to scale down the supply voltages and frequencies of the processors. This paper presents an energy-aware task scheduling algorithm using slack reclamation for heterogeneous cluster systems. The algorithm decreases the energy consumption as much as possible without affecting the schedule length. For this, we have made use of DVFS technique. The presented algorithm is an improved version of the NCM (Not Changing Makespan), a sub algorithm of EASLA algorithm [7], for heterogeneous cluster systems. The algorithm also makes use of a fast and low complexity algorithm called PEFT (Predict Earliest Finish Time) [2] to compute the schedule length of the application and down scale frequencies of processors when schedule length of the application does not change.

The contributions of our work are summarized as follow:

  1. 1.

    We develop an improved version of the NCM algorithm which can be utilized for different types of DAG applications.

  2. 2.

    It makes use of PEFT algorithm [2] to determine the schedule length of the application. The PEFT is a fast and low complexity scheduling algorithm which uses a lookahead feature to compute the schedule of the application.

  3. 3.

    It downscales the frequencies of the processors by using DVFS concept and redistribute the slack of the non-critical tasks of the applications as like NCM to reduce the energy consumption in processors of the cluster systems.

  4. 4.

    We perform a simulation study to know the effectiveness of the proposed improved NCM algorithm over NCM algorithm for randomly generated graphs based on various characteristics. The comparative results demonstrate its effectiveness.

The rest of the paper is organized as follows: Sect. 2 explains the related work. Sect. 3 introduces the system models and formalizes the scheduling problem. Sect. 4 presents the improved NCM algorithm and its complexity analysis. An execution trace of the presented algorithm, with the help of an example, is also given in this section. The experimental results and discussion are presented in Sect. 5. Sect. 6 concludes the work.

2 Related work

In the literature, many algorithms have been proposed for energy efficient scheduling of precedence-constrained tasks of the parallel applications. In this section, we give a brief explanation of some existing energy-aware task scheduling algorithms.

Juurlink and Langen proposed a leakage-aware multiprocessor scheduling algorithm to reduce the total energy consumption [4]. This algorithm finds the schedule using a scheduler which schedules the tasks according to their earliest deadlines first. Next, binary search is used to find the number of required processors in such a way that the total length of the schedule is not more than that obtained earlier. After that, frequency and voltage scaling technique is applied to reduce the energy consumed further. This algorithm was designed for the homogeneous multiprocessor systems.

Some authors proposed duplication-based scheduling algorithms [12, 32] and clustering-based scheduling algorithms [16, 26] to minimize the energy consumption. In [32], the authors presented two energy efficient duplication-based scheduling algorithms named EAD (Energy-Aware Duplication) and PEBD (Performance-Energy Balanced Duplication) for parallel applications on homogeneous clusters. The algorithms leverage DVFS concept to conserve energy dissipation in processors. In [12], the authors proposed an algorithm which minimizes the duplication and saves energy for homogeneous systems. For this, an ideal threshold was calculated which depended on the schedule length and the power of the processor. Tasks are duplicated only up to this threshold. Moreover, the duplicating tasks have to satisfy the property that because of it energy consumed will not increase and the performance will be retained. The tasks are grouped based on their previous tasks communicating from the graph, and the group of tasks is assigned to any of the available processors. The tasks are executed at the highest frequency levels, and DVFS is used to scale only on the times when the processors are idle. In [16], the authors have given a power-aware task scheduling algorithm called EAEPS (Energy Aware Edge Priority Scheduling) for multiprocessors. This algorithm is an energy aware version of the EPS (Edge Priority Scheduling) algorithm [15]. The algorithm zeros the edges of higher priorities and reduces the energy consumption by exploiting DVFS technique on non-critical tasks. It utilizes the concept for voltage scaling for non-critical tasks of the application similar to [26].

Wang et. al [27] studied the slack time for non-critical tasks and reduced energy consumption without enhancing the makespan of the applications. Further, they considered the green SLA (Service Level Agreement) and presented energy aware task scheduling heuristics such as best effort power scheduling algorithm and energy-performance tradeoff scheduling algorithm to reduce energy consumption in DVFS enabled clusters. In [26], authors extended the work of [27] and proposed two energy-aware task scheduling for parallel applications on heterogeneous multiprocessor systems named PALS (Power Aware List Scheduling) and PATC (Power Aware Task Clustering). In this work, along with energy-aware scheduling, the authors have also considered how to allocate data to different memories on a heterogeneous system which consists of shared memory for various processors.

In [20], the authors developed a power-aware task scheduling algorithm called PASTA for heterogeneous systems. It follows a score based approach. The goal here is to find out of the given set of processors a combination which has its makespan same as that of the HEFT (Heterogeneous Earliest Finish Time) algorithm [25]. The calculation of score is done for the processors. Next method is the scoring of tasks by multiplying their earliest first time and how much energy was consumed on all the chosen processors. Then, for the given task we select the processor which score the best that is which had the least score. It has quite a high makespan.

In [28], the authors defined a term ECR (energy consumption ratio) to compute the energy efficient schedule and converted the energy-efficient task scheduling problem into minimization of the total ECR problem. They proved the minimization of ECR as NP-hard problem by transforming it into variable bin-packing problem and given two heuristics such as MECRI (Minimum Energy Consumption Ratio Increment) and PTAB (Power-Aware Task Allocation for Batch Tasks) to minimize ECR. Further, they presented an algorithm for task migration and frequency readjustment which again minimize the energy consumption of the server.

In [29], the authors proposed an algorithm to minimize energy consumption for DVFS enabled systems. The algorithm is developed using an upward approach (from exit to entry tasks) as well as a downward approach (from entry to exit tasks). The algorithm uses an existing algorithm HEFT [25] to compute the schedule length of the application.

Hu et. al. [7] developed an algorithm called EASLA for cluster systems having DVFS enabled processors. The main idea behind their algorithm is to distribute slacks of set of tasks and scale voltage/frequencies down to achieve energy reduction. It uses EFT (Earliest Finish time) algorithm and iteratively allocates slacks to max independent set which achieves maximal energy reduction. Our approach is quite similar to it too but we make use of PEFT algorithm [2] which has proved to be faster than it.

3 System model and problem formulation

This section introduces the system model used in the work and formulates the scheduling problem. The notations used throughout the text of this paper with their respective descriptions are shown in Table 1.

Table 1 Symbols and their meaning

3.1 Cluster model

The cluster model utilized in this work consists of a set of heterogeneous processors \(P = \{p_1, p_2,....,p_p\}\) which are fully interconnected with each other through the identical communication links. Each processor in a cluster system is DVFS enabled and have different voltage levels with corresponding frequencies. The latency of communication between processors is assumed to be negligible. When two tasks are executed by the same processor, the communication cost between them is omitted. A task which is getting executed by a processor cannot be preemptively interrupted by another task. A processor can simultaneously perform execution and communication at the same time. Also, duplication of tasks is not considered in this model.

3.2 Application model

An application is, a set of tasks, depicted using a Directed Acyclic Graph (DAG), \(G = (V, E)\), where V represents the set of nodes, and each node denotes a task, and E is the set of directed edges between tasks. In the DAG, a directed edge \({e}_{i,j} \in E\) represents the precedence constraint from task \({t}_{i} \in V\) to \({t}_{j} \in V\) which means \({t}_{j}\) cannot begin its execution until task \({t}_{i}\) has finished its execution and output is transferred to \({t}_{j}\). The weight on edge is denoted by \(c_{i,j}\) which gives the communication time from task \({t}_{i}\) to \({t}_{j}\). The communication time between two tasks is considered to be zero when both tasks are executed by the same processor. Also, a computation cost matrix for tasks is also given which gives execution cost of a task \({t}_{i} \in V\) on different processors of the cluster system. For instance, \(w_{i,j}\) represents the computation cost of a task \(t_i\) on processor \(p_j\). A sample DAG [8] and its computation matrix are given in Fig. 1 and Table 2 respectively.

Fig. 1
figure 1

A sample DAG

Table 2 Computation cost matrix of the tasks on each processor for a 3-processor system

3.3 Energy model

The Energy model used in this work is based on the power consumption model of the CMOS-based logic circuits. The processors considered in this model are DVFS enabled and we can minimize the energy consumption by scaling down the frequencies and supply voltages of the processors appropriately. Power consumption in CMOS-based logic circuits is due to leakage, short-circuit and processing of the instructions etc. Generally, the power consumption by the processor is the sum of its static power consumption and dynamic power consumption. Here, static power consumption is not affected by the DVFS state of processors. Thus, the total power consumption of the processor is equal to its dynamic power consumption and is expressed as follows:

$$\begin{aligned} {\mathscr {P}}_{c} = C_{eff}V_{j}^2f_{j} \end{aligned}$$
(1)

where \(C_{eff}\) gives the effective capacitance, \(V_{j}\) and \(f_{j}\) is the supplied voltage and operating frequency of processor at \(j^{th}\) state [6]. Thus, the power consumed is reduced by using DVFS as their is a quadratic relationship between power consumed and voltage supplied.

Let, \({\mathscr {P}}^{i}_j\) represents power consumed by the processor working on task \(t_i\) at power state j and \(w^{i}_j\) be the computation cost of task \({t}_{i}\) at power state j. Each state of DVFS enabled processor has its own power consumption rate and service rate (SR). We take highest power state as 0 and further states by increasing the index. Let the processor be k, then we get \(w^{i}_j\) as

$$\begin{aligned} w^{i}_j = w_{i,k} \cdot \frac{SR_0}{SR_j} \end{aligned}$$
(2)

Taking n as the number of tasks in the application, the total energy consumed by the system can be written as

$$\begin{aligned} EN = \sum _{i=0}^{n-1} {\mathscr {P}}^{i}_j \cdot w^{i}_j = \sum _{i=0}^{n-1} C_{eff} (V^{i}_j)^2 f^{i}_j \cdot w^{i}_j \end{aligned}$$
(3)

where \(f^{i}_j\) represents the processor frequency, and \(V^{i}_j\) represents the voltage, on which task \(t_i\) is executed at power state j.

3.4 Problem formulation

This subsection first gives some known definitions ascribing common properties of task scheduling and then formulates the task scheduling problem.

Definition 1

(\(parent({t}_{i})\)). It denotes the set of direct parents of task \({t}_{i}\) in a graph. An entry task, \({t}_{entry}\) is a task which doesn\('\)t have any parent. A graph may contains multiple entry tasks, if it does, a dummy entry task having zero computation and communication costs may be added to the graph.

Definition 2

(\(child({t}_{i})\)). It denotes the set of direct children of task \({t}_{i}\) in a graph. An exit task, \({t}_{exit}\) is a task which doesn\('\)t have any child. A graph may contains multiple exit tasks, if it does, a dummy exit task having zero computation and communication costs from existing exit tasks to dummy task may be added to the graph.

The uniqueness of \({t}_{entry}\) and \({t}_{exit}\) is not presumed in this work.

Definition 3

(\(AFT({t}_{i})\)). It is the Actual Finish Time of a task \({t}_{i}\) and is defined as the point of time where the execution of task \({t}_{i}\) is finished by some allocated processor.

Definition 4

Makespan or schedule length refers to the finish time of the last task in the scheduled DAG and it can be defined as follows:

$$\begin{aligned} makespan = AFT(t_{exit}) \end{aligned}$$
(4)

where \(AFT(t_{exit})\) is the Actual finish time of the exit task.

Definition 5

(\(EST(t_i,p_j)\)). It denotes the Earliest Start Time of a task \(t_i\) on a processor \(p_j\) and is obtained as follows:

$$\begin{aligned}&EST(t_i,p_j) \nonumber \\&\quad = \max \Big \{T_{av}(p_j), \max _{t_k \in parent(t_i)} \{AFT(t_k) + c_{k,i} \} \Big \} \end{aligned}$$
(5)

where \(T_{av}(p_j)\) is the lowest time at which processor \(p_j\) becomes available. Also, the communication cost \(c_{k,i}\) is taken as zero if \(p_k\) is allocated to the processor \(p_j\). EST is zero for the entry task.

Definition 6

(\(EFT(t_i,p_j)\)). It denotes the Earliest Finish Time of a task \(t_i\) on a processor \(p_j\) and is obtained as

$$\begin{aligned} EFT(t_i,p_j) = EST(t_i,p_j) + w_{i,j} \end{aligned}$$
(6)

which is summation of the EST of a task \(t_i\) on a processor \(p_j\) and the execution cost of \(t_i\) on a processor \(p_j\).

Definition 7

(\(LFT(t_i,p_j)\)). It denotes the Latest Finish Time of a task \(t_i\) on a processor \(p_j\) and is defined as follows:

$$\begin{aligned}&LFT(t_i,p_j)\nonumber \\&\quad = \min _{p_x \in P} \Big \{ \min _{t_k \in child(t_i)} \{LFT(t_k,p_x) - c_{i,k} - w_{k,x} \} \Big \} \end{aligned}$$
(7)

\(c_{i,k}\) is zero if \(p_x = p_j\). LFT is equal to makespan of exit task on all processors.

Definition 8

(\(OCT(t_i,p_j)\)). It denotes the Optimistic Cost Table value which indicates the maximum shortest path from a \(t_i\) children\('\)s tasks to the exit task using processor \(p_j\) for task \(t_i\). It is defined as follows:

$$\begin{aligned}&OCT(t_i,p_j) \nonumber \\&\quad = \max _{t_k \in child(t_i)} \Big \{ \min _{p_x \in P} \{ OCT(t_k,p_x) + w_{k,x} + \overline{c_{i,k}} \} \Big \} \end{aligned}$$
(8)

\(\overline{c_{i,k}}\) is zero if \(p_x = p_j\).

\(OCT(t_{exit},p_j) = 0\) for all processors \(p_j\). The definition of OCT is taken from [2].

Our problem can be formulated as follows. Given, an application graph consisting of |V| tasks and their precedence constrained defined. We have p number of processors which have their own states for DVFS. Our goal is to minimize energy consumption in the cluster system as much as possible while keeping the length of schedule to be the same as that achieved by the best effort algorithm (PEFT in our case).

Let, \(EN_{best\_effort}\) be the energy consumed in scheduling by the best effort algorithm and it is calculated by running all processors at highest power states and \(EN_{algo}\) be the energy consumption by an algorithm using DVFS technique. The energy saving (ES)metric can be defined from [7] as follows:

$$\begin{aligned} ES = \frac{EN_{best\_effort} - {EN}_{algo}}{EN_{best\_effort}} \end{aligned}$$
(9)

Hence, we have to maximize the value of ES, as expressed by Eq. (9).

4 The improved NCM algorithm

In [7], authors presented an energy-aware scheduling algorithm called EASLA (Energy-Aware Service Level Agreement) for cluster systems having DVFS enabled processors. The EASLA algorithm reduces energy consumption by utilizing the slack of tasks and downscaling the frequency of processors. This algorithm obtains the schedule from the ETF (Earliest Task First) algorithm as input and consists of two sub-algorithms named NCM (not changing makespan) and UAME (using accepted makespan extension). The NCM algorithm is used to downscale frequency when schedule length does not change whereas the UAME algorithm is used to downscale frequency when schedule length is extended. In this work, we present an improved NCM algorithm for heterogeneous cluster systems that is used to downscale frequencies when schedule length does not change. As like NCM [7], it reduces the energy consumption by using the slacks of the tasks and without affecting the schedule length of the application. The improved NCM algorithm differs from the NCM algorithm as it makes use of the PEFT algorithm [2] to determine the makespan of the application. The PEFT is a fast and low complexity list-based scheduling algorithm which uses a lookahead feature to compute the schedule of the application.

Algorithm 1 presents the improved NCM algorithm. The algorithm first computes the makespan of the given application according to the PEFT algorithm [2] and then downscales the frequencies of the processors according to the NCM algorithm [7]. The main steps of the algorithm are as follows. A matrix called Optimistic Cost Table(OCT) is made in which each element \(OCT(t_{i},p_{j})\) gives the maximum of the shortest paths from the task \(t_{i}\)’s children nodes to the exit node on selecting processor \(p_{j}\). Then we find, Optimistic EFT (OEFT) which is the sum of OCT computed before and its Earliest Finish Time(EFT). Out of the tasks which can be executed, the task with highest OCT is chosen, then it is assigned to the processor with minimum OEFT for it.

$$\begin{aligned} OEFT(t_i, p_j) = EFT(t_i, p_j) + OCT(t_i, p_j) \end{aligned}$$
(10)

Next step is the calculation of available slack for a task. Makespan is the overhead of the time required to complete all the tasks. We find the slack as the difference between its LFT and the sum of its EST and computation cost [7].

$$\begin{aligned} slack_i = LFT(t_i,p_j) - EST(t_i,p_j) - w_{i,j} \end{aligned}$$
(11)

Here, \(p_j\) is the processor which is assigned to the \(i^{th}\) task in our cluster system. Only the tasks which are independent of each other can share slack or in other words the tasks which are not reachable from each other in the DAG can share slack.

Initialize a list L with all the tasks and sort them in non-increasing order of their OCT values. Calculate EST for each task in increasing order of OCT values and LFT for each task in reverse order of OCT values. If the task being iterated on cannot be scaled down using the slack available it is deleted from L. Otherwise, apply frequency/voltage scaling on the task’s assigned processor and calculate the amount of energy saved. Out of the remaining tasks, the task which can get maximum energy reduction is scaled down using DVFS. Then, the voltage, frequency and computation time of the tasks are updated according to the new value. We terminate the algorithm when L becomes empty.

Thus, at the end we get a schedule which gives the processors assigned to tasks, their order and the level of the DVFS enabled processor used for the particular tasks.

figure f

4.1 Time complexity of algorithm

The time complexity of our algorithm can be discussed as follows: assuming |V| as the number of tasks, |E| as the number of edges, p as processors and m as the number of power states for processors, the first step is to schedule using the best effort algorithm i.e. PEFT. Its complexity is known to be \(O(p(|E| + |V|) + |V|^{2}.p)\). Then, we have to sort the tasks by their OCT values which takes O(|V|log|V|) time. Next, we iteratively find the available slack times for the tasks. Then, we have to check for the processor’s DVFS states to scaled down using the slack time. It requires \(O(m|V| + |E|)\) time and after all the iteration, the time becomes \(O(|V|(m|V| + |E|))\). Thus, the time complexity of our algorithm is \(O(p(|E| + |V|) + |V|^{2}.p + |V|(log|V| + m|V| + |E|))\).

4.2 An illustrative example

Consider the DAG given in Fig. 1 which contains eight tasks labeled \({t}_{0}\) to \({t}_{7}\). A corresponding computation cost matrix for the tasks is given in Table 2 for three processor system. The tasks \({t}_{0}\) and \({t}_{7}\) are the entry and exit tasks of the DAG respectively. The edges of the graph are labeled with the communication costs.

The improved NCM algorithm initially calculates the schedule length of the given DAG by using PEFT algorithm and then determines the slack of tasks using NCM algorithm. The PEFT first computes the OCT of each tasks for different processors and an average OCT for all tasks. The OCT for this example is shown in Table 3. Based on the OCT computation, tasks are assigned to the processors in order. The tasks selected and processors allocated are given along with their actual finish times in Table 4. After this, slack times are calculated for the various tasks. The slack times of tasks are given in Table 5.

Table 3 OCT for the illustration
Table 4 Tasks and the processors assigned and the actual finish time
Table 5 Slack times after first iteration

Based on the slack time of the tasks, DVFS is applied and energy saved according to our discussed algorithm. Here, the tasks \(t_2\), \(t_5\) and \(t_7\) can\('\)t be scaled down and removed from the list L. Task \(t_4\) achieves maximum ES for DVFS state 5 of processor 2. Thus, to save energy consumption, power state of processor 2 is scaled down from state 0 to state 5. After this task \(t_4\) is removed from the list L. Now, the tasks \(t_0\), \(t_1\), \(t_3\) and \(t_6\) remain in L. Similarly, further iterations are done using slack allocation and the overall energy is saved as 12.29 % by using our algorithm compared to best effort algorithm. The energy for best effort algorithm is calculated by running all processors at highest power states.

5 Experimental results and discussion

In this section, we evaluate the results of the improved NCM algorithm and compare its performance with the NCM algorithm [7]. For comparison of the performance of the algorithms, we used energy saving (ES) as a comparison parameter, as given by Eq. (9). For this purpose, we have generated some random graphs based on different input parameters like number of tasks, CCR values, fat, and regularity. We performed the simulations in C++ programming language and use three different types of processors as listed below. The processors are DVFS enabled and their details are taken from [24]. The specification of the processors is given in the Table 6 and Table 7.

  • AMD Opteron processors (high-performance and high-energy).

  • Intel Pentium M processors (low-performance and low-energy).

  • VIA C7-M processors (medium-performance and low-energy).

Table 6 P-states and power consumption
Table 7 Service rate of processors

The Random DAGs, used in the experiment, are generated using some defined parameters [17]:

  • n: It denotes the number of tasks present in an input DAG.

  • CCR: Communication-to-computation ratio. It represents the ratio of the sum of communication costs of all edges and the sum of computation costs on all the nodes in a DAG.

  • fat: It affects the width and height of the DAG. The width and height of the DAG are maintained by a uniform distribution using \(fat \times \sqrt{n}\) and \(\frac{\sqrt{n}}{fat}\) as a mean value for width and height, respectively.

  • regularity: It determines how uniformly tasks are distributed at each level of the graph. The smaller value of regularity represents that each level of the DAG contains different number of tasks, whereas a higher value of regularity indicates that at each level of the graph, the number of tasks are almost same.

Table 8 Parameters and their corresponding values used in the generation of random graphs

Table 8 gives the details of the parameters used to generate the random graphs. 144 DAGs are generated by using these parameters\('\) combinations. Further, for each DAG, we generated ten different random graphs having same structure but with different execution and communication costs on nodes and edges, respectively. Thus, totally 1440 random DAGs are used in this work.

Fig. 2
figure 2

Energy saving results for random graphs with respect to number of tasks when processors are 3

Fig. 3
figure 3

Energy saving results for random graphs with respect to number of tasks when processors are 6

Fig. 4
figure 4

Energy saving results for random graphs with respect to number of tasks when processors are 9

Fig. 5
figure 5

Energy saving results for random graphs with respect to number of tasks when processors are 12

Figs. 2, 3, 4, 5 show the ES results of random graphs with respect to number of tasks for different number of processors such as 3, 6, 9, and 12. It is observed from the results that the improved NCM algorithm gives better ES as compared to the NCM algorithm for various number of processors. For example, when DAG size is 200, the improved NCM algorithm saves 13.71 %, 14.75 %, 15.14 %, and 16.03% energy as compared to the NCM algorithm for the number of processors 3, 6, 9, and 12, respectively. It is also noticed that increasing the number of processors also leads to a slight increase in ES. This can be credited to the fact that with an increase in the number of processors, the number of tasks allocated to each processor decreases. In our experiments, the results of the improved NCM algorithm always outperform the NCM algorithm.

Fig. 6
figure 6

Energy saving results for random graphs with respect to number of processors when tasks are 10

Fig. 7
figure 7

Energy saving results for random graphs with respect to number of processors when tasks are 20

Fig. 8
figure 8

Energy saving results for random graphs with respect to number of processors when tasks are 40

Fig. 9
figure 9

Energy saving results for random graphs with respect to number of processors when tasks are 80

Fig. 10
figure 10

Energy saving results for random graphs with respect to number of processors when tasks are 100

Fig. 11
figure 11

Energy saving results for random graphs with respect to number of processors when tasks are 200

Figs. 6, 7, 8, 9, 10, 11 show the ES results of random graphs with respect to number of processors for different number of tasks such as 10, 20, 40, 80, 100, and 200. It is observed from the results that the improved NCM algorithm gives better ES as compared to the NCM algorithm for various number of tasks. For example, when number of processors are 9, the improved NCM algorithm saves 13.12 %, 13.47 %, 13.96 %, 14.59 %, 14.87 %, and 15.14% energy as compared to the NCM algorithm for the number of tasks 10, 20, 40, 80, 100, and 200, respectively. It is also noticed that as the number of tasks in the application increases, energy saving also increases. Here too, the improved NCM algorithm performs better than the NCM algorithm in all cases. Though, it stops increasing at some point as the number of slacks to be used decreases.

Fig. 12
figure 12

Energy saving results for random graphs with respect to CCR

Fig. 12 presents the energy saving for random graphs with respect to CCR. We have used 0.2, 0.5, 1, and 2 as CCR values for the experiments. For all values of CCR, it is observed from the results that the improved NCM algorithm gives better energy savings as compared to the NCM algorithm. The improved NCM algorithm saves upto 14.93 % energy as compared to the NCM algorithm.

6 Conclusions

In this paper, we presented an algorithm for energy-aware scheduling using slack reclamation for heterogeneous cluster systems in which processors are DVFS enabled. The algorithm decreases the energy consumption as much as possible without affecting the schedule length. The presented algorithm is an improved version of the NCM algorithm which is a sub-algorithm of well-known EASLA algorithm. The time complexity of the proposed algorithm is \(O(p(|E| + |V|) + |V|^{2}.p + |V|(log|V| + m|V| + |E|))\), where |V| and |E| are the number of tasks and edges in a given DAG, and p and m are the number of processors and total number of DVFS power states. We performed experiments for various randomly generated DAGs and compared performance of the improved NCM algorithm with the NCM algorithm. The results demonstrate that the improved NCM algorithm shows better performance than the NCM algorithm in terms of energy saving. In future works, we may also add some real world application graphs to our experiment. Furthermore, real systems may be used to test applications and their efficiency and energy savings recorded.