1 Introduction

As technology advances towards the deep submicron region, the shrinking device sizes and growing transistor counts result in exponential increase of chip power density, which in turn leads to the elevated chip temperature. High temperatures are responsible for transient faults caused by timing errors since every 10 °C temperature increase can cause about 5 % interconnect delay [1]. In addition, elevated temperatures directly impact electromigration and hence reduce the mean time to failure (MTTF) of the chip. In a word, a system will fall into the predicament of functional incorrectness, low reliability and even hardware failures if the operating temperature exceeds a certain threshold. Therefore, thermal management has been a significant and pressing research issue in computing systems, especially for the embedded system with limited cooling techniques.

Considerable research effort has been devoted to the design of temperature-aware task allocation and scheduling in real-time MPSoC systems [210]. Research works on temperature-aware task allocation and scheduling can be classified into two categories. One category of research works aims to maximize the system performance or minimize the energy consumption under thermal constraints [26]. For instance, Chantem et al. [2] investigated the impacts of temperature and thermal cycling on system lifetime reliability, and presented an online task assignment and scheduling algorithm to maximize MTTF. Static and dynamic temperature-aware scheduling techniques were studied in [3] for MPSoCs to minimize energy, balance energy, and minimize hot spots. An integer linear programming (ILP) solution was first proposed to statically solve the temperature-aware task scheduling problem of MPSoC, then an adaptive dynamic policy that modifies the workload allocation policy in the runtime was designed based on the temperature history. Saha et al. [4] developed a genetic algorithm-based approach to solve the thermal constrained energy-aware real-time task mapping for heterogeneous multiprocessor systems. The proposed task mapping strategy minimizes the energy consumption with considerations of real-time constraint and maximum temperature limit. The widely utilized system-level power management technique, dynamic voltage scaling (DVS), has also been exploited in [5, 6] to minimize the energy consumption under thermal constraints. However, scaling down the operating frequency of a processor degrades the performance of the processor [11]. In addition, all the above works focus on improving lifetime reliability or minimizing energy for MPSoCs under the thermal constraint, however, the thermal optimization is not taken into account. This is not suitable for real-time systems deployed certain safety-critical applications (e.g. implantable cardioverters in medical applications) where temperature optimization is crucial for the correct and safe operation of systems.

Another category of existing research works concentrates on reducing the peak temperature under the constraint of a given threshold temperature [710]. An effective thermal management technique, that is load balancing through activity/task migration, has been utilized in [7, 8] to optimize the peak temperature of MPSoCs. Ebi et al. [7] presented an extremum-seeking control-based method to minimize on-chip peak temperature and optimize thermal balancing by finding an optimal scheme of activity migration based on observed temperatures. In the literature [8], a task migration algorithm that balances the loads on different cores was proposed to reduce hotspots. Ghahfarokhi and Ejlali [9] developed a schedule swapping scheme to mitigate the peak temperature of multiple processors in a distributed system. The peak temperature is reduced by swapping tasks of overheated processors with the coldest processor, and deadline constraints of tasks are guaranteed using a feasibility checking technique at static time. However, the schemes proposed in [79] may incur significant time and energy overheads due to frequent task migration or swapping. An optimal phased steady-state mixed integer linear programming (MILP) based solution was designed in [10] to solve the problem of temperature-aware real-time MPSoC mapping and scheduling for minimizing the chip peak temperature. However, the thermal characteristics of tasks and processors are not considered in the above works for temperature optimization.

In this paper, we proposed a thermal-ware task allocation and scheduling scheme for heterogeneous MPSoC systems to reduce peak temperature. The proposed scheme first attempts to improve the thermal profiles of tasks by minimizing the task steady state temperatures through task mapping, then further reduces the peak temperature of tasks on the processors by using the task splitting technique. The thermal characteristics of tasks and processors are exploited in the proposed task mapping strategy, and two scenarios (ideal and realistic scenarios) are considered in the proposed task splitting policy.

The rest of the paper is organized as follows. Section 2 introduces the system model and defines the thermal-aware MPSoC task mapping and scheduling problem. Section 3 presents the proposed temperature-aware task allocation strategy. Section 4 describes the proposed temperature-aware task splitting policy. The benefits and efficiency of the proposed algorithms are experimentally determined in Section 5. Section 6 concludes the paper.

2 System Model and Problem Definition

This section describes the system model and the thermal-aware task-to-processor mapping and scheduling problem.

2.1 Processor and Task Model

Processing element (PE) is the fundamental computational block integrated in the MPSoC system. For heterogeneous MPSoC platforms, the PEs can be instruction set processors (ISP), digital signal processors (DSP), general-purpose CPUs or FPGA fabric tiles. In addition, each PE can support a wide range of voltages and operating speeds (e.g. [3, 12, 13]), or equipped with a fixed voltage and operating frequency (e.g. [14, 15]). This work considers the latter system architecture (no DVS) for separate concerns since DVS would add another dimension for optimization. Specifically, the MPSoC system PE is composed of M heterogeneous processing elements {P E 1, P E 2, ⋯ , P E M }, where each processing element P E m (1 ≤ mM) is assumed to support one active mode and one sleep mode. The active mode of P E m is characterized by a fixed supply voltage/frequency pair (v m , f m ). Tasks are only executed when the processor is in active mode while the processor is idle in sleep mode. It is assumed that a processor can be switched from one mode to another mode at any time with the timing overhead of t sw, during which no computation is allowed.

Consider a task set Γ comprising N independent periodic real-time tasks {τ 1, τ 2, ⋯ , τ N }. The characteristics of a task τ i (1 ≤ iN) is modeled using a quadruplet τ i :{c i , μ i , p i , d i }, where c i is the worst case execution cycles of task τ i , and μ i (ranging in (0, 1]) is an active factor that defines how intensively functional units have been utilized by the task τ i [16]. p i and d i are the period and relative deadline of the task, respectively. For a given task set (e.g. Γ), the hyper-period of the task set, denoted by L, is the lowest common multiple (LCM) of all task periods {p 1, p 2, ⋯ , p N }. Assume the task τ i is running on the processing element P E m , then the corresponding execution time of task τ i is given by \(\frac {c_{i}}{f_{m}}\).

2.2 Power Model

The power consumption of a CMOS device can be modeled as the sum of leakage power dissipation and dynamic power dissipation. The leakage power is temperature dependent and can be expressed as P leak=N gate V dd I leak, where N gate is the number of gates, V dd is the supply voltage, and I leak is the leakage current. The leakage current I leak can be formulated by a nonlinear exponential equation [17] as

$$ I^{leak}=I^{s}(\textit{A}T^{2}e^{(\vartheta_{1}V^{dd}+\vartheta_{2})/T}+\textit{B}e^{(\vartheta_{3}V^{dd}+\vartheta_{4})}), $$
(1)

where I s is the leakage current at a certain temperature and supply voltage, T is the operating temperature, and A, B, 𝜗 1, 𝜗 2, 𝜗 3, and 𝜗 4 are all empirical technology constants. Since the leakage current changes super linearly with temperature [18], the leakage power consumption of processing element P E m can be closely approximated by \(P^{leak}_{m}=\alpha _{m}v_{m}+\delta _{m}Tv_{m}\) [18], where α m and δ m are both curve fitting constants and dependent on the architecture of P E m . The dynamic power \(P_{m}^{dyn}\) is independent of the temperature and can be estimated by a strictly increasing and convex function of the operating frequency, that is, \(P_{m}^{dyn} \propto {f_{m}^{3}}\) [19]. As the operating frequency is nearly linear with the supply voltage [19], the power consumption P m, i of task τ i on the processing element P E m is formulated as

$$ P_{m,i}=\mu_{i}(C^{ef}_{m}{v_{m}^{3}}+\alpha_{m}v_{m}+\delta_{m}Tv_{m}), $$
(2)

where μ i is the active factor of task τ i , and \(C_{m}^{ef}\) is the effective switching capacitance of processor P E m .

2.3 Temperature Model

An accurate and practical dynamic model of temperature is needed to accurately characterize the thermal behavior of an application. In this paper, a lumped RC thermal model proposed by Skadron et al. [20] that is widely used in the literature is adopted as the temperature model to predict the temperature of the processor. Let T(t) be the temperature at time instance t, as is given by

$$ RC\frac{dT(t)}{dt}+T(t)-RP(t)=T^{a}, $$
(3)

where P(t) is the power consumption at time instance t. R and C are thermal resistance and capacitance, respectively. They are both processor architecture dependent constants. T a is the ambient temperature of the die.

Consider a task τ i executing on the processing element P E m during the time interval [t 0, t 0 + Δt]. Let T 0 be the initial temperature at time instance t 0, the ending temperature of the task τ i at time instance t 0 + Δt is derived by solving (3) and then formulated as

$$ T_{m,i}(t_{0}+{\Delta} t)=T_{0}e^{-K_{m,i}{\Delta} t}+T^{std}_{m,i}(1-e^{-K_{m,i}{\Delta} t}), $$
(4)

where

$$ T^{std}_{m,i}=\frac{R_{m}(\alpha_{m}v_{m}+C_{m}^{ef}{v_{m}^{3}})\mu_{i}+T^{a}}{1-R_{m}\delta_{m}v_{m}\mu_{i}} $$
(5)

is the steady state temperature of task τ i on the processing element P E m , and \(K_{m,i}=\frac {1-R_{m}\delta _{m}v_{m}\mu _{i}}{R_{m}C_{m}}\) is a time constant that depends on the task and processing element.

2.4 Problem Definition

Given an input task set Γ of N tasks {τ 1, τ 2, ⋯ , τ N } on a MPSoC system PE of M heterogeneous processing elements {P E 1, P E 2, ⋯ , P E M }, a task partition {Γ1, Γ2, ⋯ , Γ M } is feasible if \({\Gamma }_{m}\bigcap {\Gamma }_{k}=\emptyset \)mk (1 ≤ m, kM) and \({\Gamma }_{1} \bigcup {\Gamma }_{2} \bigcup {\cdots } \bigcup {\Gamma }_{M} = {\Gamma }\), where Γ m and Γ k indicate the subset of tasks allocated to processing element P E m and P E k , respectively. In addition, all tasks should be finished before their individual deadlines, that is, R T m, i d i holds for ∀1 ≤ mM and ∀1 ≤ iN, where R T m, i is the worst case response time of task τ i executing on the processing element P E m , and d i is the relative deadline of task τ i . The worst case response time of task τ i is formulated as

$$ RT_{m,i}=\frac{c_{i}}{f_{m}}+{\sum}_{\tau_{j} \in {\Gamma}_{m},p_{j}<p_{i}}{\big\lceil \frac{RT_{m,i}}{p_{j}} \big\rceil }\times \frac{c_{j}}{f_{m}}, $$
(6)

where \(\frac {c_{i}}{f_{m}}\) and \(\frac {c_{j}}{f_{m}}\) are the execution time of task τ i and τ j on the processing element P E m , respectively. Task τ j has a higher priority than task τ i for j < i, and \(\big \lceil \frac {RT_{m,i}}{p_{j}} \big \rceil \) indicates the number of instances of task τ j during the time interval of R T m, i . The objective of the studied problem is to find an optimal task partition and scheduling strategy, which generates the minimum on-chip peak temperature. Let T Peak denote the on-chip peak temperature, then it is given by

$$T^{peak}= \max {\sum}_{m\in [1,M]} {\sum}_{\tau_{i} \in {\Gamma}_{m}} T_{m,i}(t), $$
(7)

where T m, i (t) is the temperature of task τ i on the processing element P E m at time instance t and can be obtained using Eq. 4.

3 The Proposed Temperature-aware Task Allocation Strategy

The steady state temperature of a task is defined as the temperature that will be reached if infinite number of instances of the task execute continuously on the processor, which is in general utilized to represent the thermal characteristics of a task. Specifically, the lower the task steady state temperature, the better thermal characteristics the task [21]. For instance, a task is assumed to be a hot task if its steady state temperature exceeds the system maximum temperature limit, otherwise it is assumed to be a cool task [21, 22]. As is shown in Eq. 5, the steady state temperature of a given task is explicitly determined by parameters of the processor where the task executes. In other words, the steady state temperature of tasks depends on the task-to-processor assignment. Therefore, this paper aims to minimize the steady state temperature of all tasks by finding an optimal task mapping strategy.

Given a set Γ of N real-time tasks, it can be partitioned into M subsets {Γ1, Γ2, ⋯ , Γ M }, where Γ m (1 ≤ mM) indicates the subset of tasks assigned to processing element P E m . It is clear that there are M N partitioning instances. In other words, assigning tasks to processors is essentially a NP hard problem. Thus, this work focuses on proposing a suboptimal task-to-processor assignment heuristic, in which the preference of a task to any processor is set according to its ascending order of steady state temperature. The most preferred processor for a task is the one that makes the task has a minimal steady state temperature. On the contrary, the processor that produces a maximal steady state temperature of the task is least preferred.

The proposed thermal-aware task allocation heuristic attempts to assign tasks to their most preferred processor to minimize the steady state temperature of individual tasks in the system. However, due to the real-time constraint of tasks and the limited capacity of processors, not all tasks can be assigned to their respective most preferred processor. As a result, a ranking mechanism needs to be designed in the proposed algorithm to handle the competition for a processor between multiple tasks.

Let \({\Delta } T_{m,i}^{std}\) be the minimum steady state temperature increment of task τ i from its current processing element P E m to other processing elements, as is given by

$$ {\Delta} T_{m,i}^{std} = \min\limits_{k\neq m,1\le k,m \le M } \{T_{k,i}^{std}| T_{k,i}^{std} \ge T_{m,i}^{std} \}-T_{m,i}^{std}, $$
(8)

where \(T_{k,i}^{std}\) and \(T_{m,i}^{std}\) are the steady state temperature of task τ i on processing element P E k and P E m , respectively. They can be obtained using Eq. 5. The minimum steady state temperature increment of a task reflects the degradation of task thermal profiles when the task is allocated to its less preferred processor. This metric can be utilized to rank the tasks that are competing for the same processor. More specifically, for a given processor, tasks are sorted by the values of ΔT std in decreasing order and the tasks with larger values have higher priorities to occupy the processor. Based on the ranking mechanism, a thermal-aware task allocation heuristic is designed. The task-to-processor assignment can be started from any processor, for instance, it starts from the processor with index 1 in Algorithm 1 (line 18). The principle of the heuristic is that a task assigned to a processor is not considered for an assignment to any other processor where the task has a higher steady state temperature than its currently assigned processor. The same process is repeated for all the processors, and is stopped when all the tasks are assigned to exactly one processor.

The proposed thermal-aware task allocation heuristic given in Algorithm 1 operates as follows. It takes as input a task set Γ, a processor set PE, M subsets {Γ1, Γ2, ⋯ , Γ M }, and the utilization {U 1, U 2, ⋯ , U M } of M processors. Lines 1-2 of the algorithm initialize the subsets {Γ1, Γ2, ⋯ , Γ M } and the utilizations {U 1, U 2, ⋯ , U M } to \(\{\varnothing ,\varnothing ,\cdots ,\varnothing \}\) and {0,0, ⋯ , 0}, respectively. The minimum steady state temperature increment of tasks on each processor is iteratively derived in lines 3-15. Lines 5-6 compute the steady state temperatures of task τ i on M processors, and obtain the maximum using \(\delta _{i}=\max \{T_{1,i}^{std},T_{2,i}^{std},\cdots ,T_{m,i}^{std},\cdots ,T_{M,i}^{std}\}\). If the task τ i is not on its least suitable processor, that is, \(T_{m,i}^{std} \neq \delta _{i}\), the minimum steady state temperature increment \({\Delta } T_{m,i}^{std}\) of the task is calculated using Eq. 8 (lines 9-10); otherwise, it is set to 0 (lines 11-13). The iteration stops when the minimum steady state temperature increment of all tasks are derived. Then a M × N matrix A is constructed to store these increments by \(a_{mi}={\Delta } T_{m,i}^{std}\), where a m i is the element of matrix A (line 16).

Lines 17-35 describe the procedure of task allocation to M processors. For each processor P E m , tasks in the task set Γ are sorted in decreasing order of a m i (line 19), then the algorithm iteratively selects the top (τ i ) of ordered task list and attempts to assign it to processor P E m . If the processor P E m whose utilization is U m has the capacity to accommodate task τ i whose utilization is U m, i (lines 20-21), the task is assigned to the processor (line 22), and not considered for allocation to other processors where the task has a higher steady state temperature as compared to the current processor P E m . If the task τ i was previously assigned to other processors that generate higher steady state temperatures (lines 23-24), such task assignments need to be removed, and the matrix A, the processor utilization U k need to be updated accordingly (lines 25-26). Then the same procedure repeats for next processor. When the first iteration is completed, the algorithm starts the next iteration again from the first processor, and repeats the iterations until all the tasks in the task set are assigned to exactly one processor. Finally, the algorithm outputs the M subsets {Γ1, Γ2, ⋯ , Γ M }, which is allocated to M processors (line 35).

figure e

4 The Proposed Temperature-aware Task Splitting Policy

Task splitting refers to the technique that equally partitions a task into multiple sections and executes them with slack time alternatively. Unlike the previous work presented in [16] that utilizes task splitting technique to maximize throughput, the proposed thermal-aware task scheduling algorithm adopts the task splitting technique and exploits the slack that is generated due to the early completion of real-time tasks to reduce the peak temperature. A motivational example given in Fig. 1 shows the effectiveness of task splitting in reducing peak temperature. Consider a task τ with execution time of 600 ms, deadline of 1000 ms, and steady state temperature of 105 °C. The initial temperature is set to 45 °C. As is shown in Fig. 1a, the task τ has a slack time of 400 ms. When the task is executing in active mode, its temperature increases from 45 °C to 96.9 °C, then drops to 55 °C when the processor is idle. The peak temperature of 96.9 °C is reached at the end of task execution. Figure 1b demonstrates the temperature profiles of task τ when the task is equally split into 5 sections and executed alternately with the slack time. The peak temperature of 82.8 °C is reached at the end of the 5th section. Thus, in this example, the peak temperature can be reduced up to 14.1 °C by using the task splitting technique.

Figure 1
figure 1

The motivational example of task splitting.

The proposed scheduling algorithm first identifies hot tasks from the task set based on the task steady state temperature,then splits each hot task into multiple sections to enable the alternation of hot task execution and slack time, such that the thermal profile is improved and the peak temperature is hence reduced. Two scenarios, that is, one ideal scenario that does not consider the mode switching overhead and another realistic scenario that covers the mode switching overhead are both taken into account in the algorithm.

4.1 The Ideal Scenario without Considering the Mode Switching Overhead

In the ideal scenario, all the slack on the processor can be exploited to cool down the execution of hot tasks and each hot task can be split into endless sections when mode switching overhead is negligible. This is motivated by the observation in [16] that the more the task splitting operation, the lower the task peak temperature. Additionally, all the hot tasks on the processor are assumed to have a uniform ideal peak temperature after thermal optimization, which is realized by endless task splitting and slack distribution. The focus of the ideal scenario is to calculate the uniform ideal peak temperature of tasks on the processor and determine the corresponding ideal slack allocated to each task.

Let Γ m be the subset of tasks allocated to processing element P E m , then the total slack \(SL_{m}^{tot}\) generated on the processor during a hyper-period L m is given by

$$ SL_{m}^{tot}=L_{m}-{\sum}_{\tau_{i} \in {\Gamma}_{m}} \frac{c_{i}}{f_{m}} \times \frac{L_{m}}{p_{i}}, $$
(9)

where L m is the hyper-period of tasks in the subset Γ m , \(\frac {c_{i}}{f_{m}}\) is the execution time of task τ i at the frequency f m , and \(\frac {L_{m}}{p_{i}}\) is the number of instances of task τ i during the time interval of L m .

A task can be either a hot task or a cool task based on its steady state temperature. Specifically, if the steady state temperature of task τ i is larger than the maximum temperature limit T max, the task τ i is deemed to be a hot task and is inserted into the hot subset \({{\Gamma }^{h}_{m}}\); otherwise it is a cool task. According to the definition of ideal scenario, all the available slack time on the processor are assigned to hot tasks for reducing temperature, that is

$$ {\sum}_{\tau_{i} \in {{\Gamma}_{m}^{h}}} sl_{i}^{ideal}\times \frac{L_{m}}{p_{i}}=SL_{m}^{tot}, $$
(10)

where \(sl_{i}^{ideal}\) is the ideal slack allocated to hot task τ i . As is mentioned above, all the hot tasks on the processor will assume a uniform ideal peak temperature \(T_{m}^{ideal}\) through endless task splitting and slack distribution, that is,

$$ \lim\limits_{\forall \tau_{i} \in {{\Gamma}_{m}^{h}},S_{i}\rightarrow \infty} T^{std}_{m,i}=T_{m}^{ideal} , $$
(11)

where S i is the split number of task τ i , and the steady state temperature \(T^{std}_{m,i}\) of task τ i is given in Eq. 5.

Hence \(sl_{i}^{ideal}\) and \(T^{std}_{m,i}\), which are given as follows, can be derived by solving the system of the Eqs. 5, 9, 10, and 11.

$$ \small sl_{i}^{ideal}=(\frac{T^{std}_{m,i}}{T^{ideal}_{m}}-1)\frac{c_{i}B_{m,i}}{f_{m}}, $$
(12)
$$ \small T^{ideal}_{m}=\frac{{\sum}_{\tau_{i} \in {{\Gamma}_{m}^{h}}}\frac{L_{m}}{p_{i}} \frac{c_{i}}{f_{m}}T^{std}_{m,i}B_{m,i}}{L_{m}-{\sum}_{\tau_{i} \in {\Gamma}_{m}}\frac{L_{m}}{p_{i}} \frac{c_{i}}{f_{m}}+{\sum}_{\tau_{i} \in {{\Gamma}_{m}^{h}}}\frac{L_{m}}{p_{i}} \frac{c_{i}}{f_{m}}B_{m,i}}, $$
(13)

where B m, i =(1−μ i R m δ m v m ).

4.2 The Realistic Scenario with Considering the Mode Switching Overhead

In the realistic scenario that considers the mode switching overhead, the key point of minimizing the peak temperature is find the optimal split number for each hot task. This is due to the fact that given a hot task, increasing the number of task splitting operation can further reduce the task peak temperature, but also cause a larger switching overhead. It has been proved in [16] that the effectiveness of task splitting technique in reducing peak temperature is maximized when the switching overhead of a task is equal to its allocated slack time. In addition, considering that the processing element needs to change its mode twice for each task splitting operation. Therefore, the optimal split number S i of task τ i is given by

$$ S_{i}=\lfloor \frac{sl_{i}}{2\times t^{sw}}\rfloor, $$
(14)

where S i is the optimal split number of task τ i , s l i is the slack time allocated to task τ i , and 2 × t sw is the mode switching overhead of one task splitting operation.

4.3 The Proposed Temperature-aware Task Splitting Algorithm

Both the ideal scenario and realistic scenario are discussed in the proposed temperature-aware task splitting algorithm, as is described in Algorithm 2. The algorithm takes as input the tasks of subset Γ m on the processing element P E m , the maximum temperature limit T max, and an arbitrarily small positive number 𝜖. Line 1 identifies the hot tasks in the subset Γ m based on their steady state temperatures. Specifically, if \(T^{std}_{m,i}\ge T^{max}\), the task τ i is a hot task and inserted into the hot subset \({{\Gamma }^{h}_{m}}\). Lines 2-9 show the ideal scenario that does not consider the mode switching overhead t sw. Line 3 calculates the uniform ideal peak temperature \(T_{m}^{ideal}\) using Eq. 13, and lines 4-5 derive the ideal slack \(sl_{i}^{ideal}\) of each hot task using Eq. 12. Then the available slack allocated to task τ i under the real-time constraint is obtained using the procedure \(sl_{i}^{rt}=\) SLK (τ i , Γ m ) in line 6. The slack \(sl_{i}=\min \{sl_{i}^{ideal},sl^{rt}_{i}\}\) is finally assigned to task τ i with endless task splitting for thermal optimization, where the operator \(\vartriangleleft \) indicates the operation of task splitting (lines 7-8).

figure f

The realistic scenario is described in lines 10-16. The slack assigned to task τ i is s l i = SLK (τ i , Γ m ) and the optimal split number S i of the task is computed using Eq. 14 (lines 12-13). Then line 14 splits task τ i into S i sections and alternates their execution with slack time to reduce peak temperature.

The procedure SLAK that derives the maximum available slack for a task is a binary search-based approach. Inputs to the procedure are the task τ i , the subset Γ m , and the arbitrarily small positive number 𝜖. A search space [s l l o w , s l h i g h ] is defined and initialized to [0, d i R T m, i ], where s l l o w and s l h i g h denote the lower and upper bound of the space, respectively, and d i and R T m, i are the deadline and response time of task τ i , respectively (line 17). The search length, which is denoted by ρ, is set to ρ=s l h i g h s l l o w (line 18). Lines 19-28 describe the searching process. In each round of iteration, a dummy task τ t e m is created and initialized to τ i , the median s l m i d of the search space [s l l o w , s l h i g h ] is calculated and taken as the slack assigned to the dummy task τ t e m , and a dummy subset Γ t e m is created and set to Γ m + τ t e m . The procedure RTFA presented in Algorithm 2 is called to check if the timing constraint of tasks in the subset Γ t e m is met. The search space [s l l o w , s l h i g h ] and the search length ρ are updated in each iteration, and the process stops when the ρ is less than yet close enough to the arbitrarily small positive number 𝜖. The lower bound s l l o w of the search space is returned as the maximum available slack that could be assigned to the task τ i (line 29). The feasibility analysis technique, referred to as real-time feasibility analysis (RTFA), is utilized to check if the timing constraint is satisfied. If the response time R T m, i of the task τ i exceeds the deadline d i of the task, the task τ i cannot be feasibly assigned to the processing element P E m (lines 30-40).

There is a possibility that the available slack time during the system is insufficient to control the peak temperature below the maximum temperature limit. In this scenario, the thermal characteristics of cool tasks can be exploited to further reduce the peak temperature by splitting cool tasks into multiple sections and alternating the execution of cool subtasks and hot subtasks. If cool tasks still cannot help to limit the peak temperature under the safe threshold, in which case the task set is deemed infeasible.

5 Experimental Results And Discussions

Extensive simulation experiments have been conducted to validate the effectiveness of the proposed task allocation and scheduling scheme in thermal management. The proposed algorithms were implemented in C++, and the simulation was performed on a machine with Intel Dual-Core 2.3 GHz processor and 8GB memory. This section first describes experimental settings for the simulation, then compares the proposed algorithms with benchmarking schemes. Moreover, the same simulation settings are adopted for the proposed algorithms and benchmarking schemes for the sake of fair comparison.

5.1 Experimental Settings

A simulated platform of multiple processing elements is constructed. The number of processing elements is assumed to be six, and the corresponding supply voltages of six processing elements P E 1, P E 2, P E 3, P E 4, P E 5, P E 6 are set to 0.85 V, 0.9 V, 0.95 V, 1.0 V, 1.05 V, 1.1 V [23]. The parameters of the operating point of each processing element are constant, that is, the supply voltage v, curve fitting constants α and δ, effective switching capacitance C ef and operating frequency f of each processing element are fixed, as is listed in Table 1.

Table 1 Parameters of the simulated platform [23].

25 benchmarking tasks are selected from Mibench [24] and Mediabench [25] to form the task set Γ for validating the proposed task allocation and scheduling algorithm. The numbers of clock cycles of these tasks are in the range of [4 × 107, 6 × 108]. The architecture-level power simulator McPAT [26] is utilized to obtain power consumptions of tasks. It can model all three types of power dissipation, including dynamic, leakage, and short-circuit power, and provide a complete view of power consumptions. The task active factor μ evenly takes values between [0.4,1] to manifest the heterogeneous nature of tasks [16]. The average of thermal resistance R and thermal capacitance C of processing elements are assumed to be 0.80 K/W and 340 J/K, respectively [23], and the variance of the R and C of a processing element is deemed to be 0.05 and 10.0, respectively. The overhead of mode switching is assumed to be 5 ms [16], and the ambient temperature T a is set to 35 °C. The temperature is obtained using the thermal simulator HotSpot [20], which is an accurate yet fast and practical tool to capture thermal profiles.

5.2 Experimental Results

5.2.1 Comparison of the Task Steady State Temperature

Two benchmarking methods are implemented and compared to evaluate the effectiveness of the proposed task allocation heuristics in minimizing task steady state temperature. The first one is the Random algorithm that randomly assigns tasks to processors without utilizing any techniques. The second one, referred to as RMBF [27], is the partitioning heuristic that assigns the task with the highest priority to the processor with smallest unused capacity among those processors on which it fits. The proposed thermal-aware task allocation heuristic attempts to select the most suitable processing element in terms of task steady state temperature for each task in the system.

Figure 2 shows the steady state temperature of 25 tasks under the proposed method, and benchmarking schemes Random and RMBF [27]. It has been demonstrated in Fig. 2 that the steady state temperature of the proposed algorithm is almost much lower than that of the Random and RMBF [27] method. For example, the average steady state temperature of 25 tasks achieved by the proposed algorithm, benchmarking schemes Random and RMBF are 68.86 °C, 72.4 °C, and 71.32 °C, respectively. Thus, the average steady state temperature achieved by the proposed algorithm is 3.57 °C lower than that of scheme Random, and 2.46 °C lower than that of scheme RMBF [27]. The proposed algorithm outperforms the benchmarking methods Random and RMBF [27] in thermal management since it fully takes the thermal characteristics of tasks and processing elements into account during the task allocation.

Figure 2
figure 2

The steady state temperature of 25 tasks under the proposed method, and benchmarking schemes Random and RMBF [27].

5.2.2 Comparison of the Peak Temperature

In the proposed scheme, the thermal profiles of tasks in the task set are improved by selecting the most suitable processing element for each task, then the peak temperature is reduced by splitting hot tasks into multiple sections and enabling the alternation of hot task execution and slack time. The proposed algorithm is compared with two benchmarking methods in terms of peak temperature to demonstrate its effectiveness of thermal management. The first benchmarking scheme, referred to as NOTM, does not utilize any thermal management techniques while the second benchmarking scheme, referred to as task sequencing (TS) [21], utilizes thermal characteristics of tasks to derive a task sequence in the alternate order of being cool-hot.

Figure 3 plots the peak temperature of six processing elements under the proposed method, and benchmarking schemes TS [21] and NOTM. The initial temperature of each processing element is assumed to equal the ambient temperature T a, which is set to 35 °C in this study. It has been shown in Fig. 3 that the peak temperature of six processing elements achieved by the proposed method is much lower than that of benchmarking schemes TS [21] and NOTM. For instance, the proposed method reduces the peak temperature of tasks on the processing element P E 2 by 5.8 % as compared to the TS [21] scheme, and 11.5 % as compared to the NOTM scheme. The NOTM method does not employ any thermal control techniques and is utilized as a baseline to show the highest efficiency of the proposed algorithm in reducing peak temperature. The proposed algorithm also outperforms the state-of-the-art scheme TS [21] in thermal management since it first improves the thermal profiles of each task through a thermal-ware task mapping heuristics, then exploits the task splitting technique and slack time to cool down the execution of hot tasks on the processor for a lower peak temperature.

Figure 3
figure 3

The peak temperature of six processing elements under the proposed method, and benchmarking schemes TS [21] and NOTM.

In addition, Fig. 4 compares the instantaneous temperature of benchmarking scheme TS [21] and the proposed task splitting algorithm. Two tasks are running on the processor. One is a cool task with execution time of 520 s and another is a hot task with execution time of 560 s. The TS [21] method executes the two tasks in the order of being cool-hot. The proposed task splitting algorithm first partitions the cool task and hot task into two cool subtasks and two hot subtasks respectively, then interleaves the execution of cool subtasks and hot subtasks. The peak temperature of the proposed task splitting algorithm is 92.59 °C, which is 5.75 °C lower than that of the benchmarking method TS [21].

Figure 4
figure 4

Compare the instantaneous temperature of benchmarking scheme TS [21] and the proposed task splitting algorithm.

6 Conclusions

This paper explores the thermal-aware task mapping and scheduling for heterogeneous MPSoC systems to reduce the on-chip peak temperature under the real-time constraint. The proposed algorithms explicitly exploit thermal characteristics of both tasks and processors to minimize the peak temperature without incurring significant overheads. The proposed algorithms are implemented in two steps. In the first step, tasks are assigned to individual processors to minimize their respective steady state temperature, while in the second steps, tasks with undesired thermal characteristics (hot tasks) are selected and split into multiple sections to enable the alternation of hot task execution and slack time, such that the peak temperature of tasks is reduced. Experimental results show that a 11.5 % reduction of peak temperature can be achieved by the proposed algorithms as compared to the benchmarking schemes.