1 Introduction

An efficient multiprocessor SoC (MPSoC) implementation requires automated exploration to find an efficient HW allocation, task mapping and scheduling [13]. Heterogeneous MPSoCs are needed for low power, high performance and high volume markets [36]. The central idea in MPSoCs is to increase performance and energy-efficiency. This is achieved by efficient communication between cores and keeping clock frequency low while providing enough parallelism.

Mapping means placing each application task onto some processing element (PE), as depicted in Fig. 1. Task refers here to the smallest unit of computation that can be re-located. Scheduling means determining execution timetable of the application components on the platform. Example Fig. 1 shows that tasks 0 and 1 are mapped to PE 0, task 2 is mapped to PE 1, and task N−1 is mapped to PE M−1. In a strict mapping problem, state of the system is defined as a mapping of each task to a PE. The state is optimized with respect to given criteria, such as system’s throughput, latency or power. The optimized criteria is defined by a cost function whose value is minimized. For example, system’s performance can be optimized by setting the cost function to be its total execution time for a given program and input.

Fig. 1
figure 1

Conceptual view of mapping application tasks to processing elements

Tasks in a general mapping problem may execute arbitrary programs, deterministic or non-deterministic. In this paper, however, tasks are restricted to a subset known as Kahn Process Networks (KPNs) [15]. Each PE executes a program where reads are blocking, and testing for existing readable data on a communication channel can only be done by a blocking read. This enforces a deterministic result of the computation regardless of timing and mapping of tasks.

A large design space must be pruned systematically, since the exploration of the whole design space is not feasible [13]. Fast optimization procedure is desired in order to cover reasonable design space. However, this comes with the expense of accuracy. Iterative optimization algorithms evaluate a number of application mappings for each resource allocation candidate. The application is simulated for each mapping to evaluate the cost of a solution. The cost may depend on multiple factors, such as cycle count, latency, energy consumption, silicon area and others.

This paper investigates the use of Simulated Annealing (SA) algorithm in task mapping. We analyze the research question of how to select SA optimization parameters for a given point in design space exploration. Section 2 presents a survey of current state-of-the-art in SA task mapping. Section 4 presents our results on SA global optimum properties, and Sect. 5 properties of SA acceptance functions. Section 6.1 presents recommendations for reporting SA results for disciplined comparisons and reproduction of the experiments by other researchers. Section 6.2 presents recommendations for selecting SA parameters based on related work and new analysis. Finally, we conclude the paper with high level directions on how to improve existing state-of-the-art in Sect. 7.

2 Related work

We limit the discussion about mapping heuristics to Simulated Annealing and mention other approaches when direct comparison has been reported in literature.

2.1 Simulated Annealing algorithm

SA is a widely used metaheuristic for complex optimization problems. Term heuristic means that optimality is not guaranteed but algorithm usually produces satisfactory results, whereas meta denotes that it can be fitted into many kinds of problems. SA is a probabilistic non-greedy algorithm that explores the search space of a problem by annealing from a high to a low temperature. Temperature is a historic term originating from annealing in metallurgy where material is heated and cooled to increase the size of its crystals and reduce their effects. Temperature indicates the algorithm’s willingness to accept moves to a worse state.

Probabilistic behavior means that SA can find solutions of different goodness between runs. Non-greedy means that SA may accept a move into a worse state, and this allows escaping local minima. Local minimum is such a point in design space where no single move can improve the cost, but perhaps two or three consecutive moves can. The algorithm always accepts a move into a better state. Move to a worse state is accepted with a changing probability. This probability decreases along with the temperature, and thus the algorithm starts as a non-greedy algorithm and gradually becomes more and more greedy.

Figure 2 shows the SA pseudocode. The algorithm takes initial temperature T 0 and initial state S as parameters. State is modified on every iteration. Cost function evaluates the objective function to be optimized, e.g. by simulating the platform. This algorithm seeks to minimize the cost. Temperature function returns the annealing temperature as a function of T 0 and loop iteration number i. Move functions generates a new state from a given state. Hence, it re-locates some task(s) in our case and the associated cost (e.g. cycle count) is evaluated.

Fig. 2
figure 2

Pseudocode of the Simulated Annealing (SA) algorithm

This new state can be accepted as a base for the next iteration or discarded. Move to better state is always accepted and Accept function calculates the probability for accepting a state change when cost function difference ΔC>0. Random function returns a random real number in range [0,1).

End-Condition function returns True iff optimization should be terminated. Parameters of the end conditions are not shown in the pseudocode. These may include various measures, such as the number of consecutive rejected moves, current and a given final temperature, current and accepted final cost. Finally the algorithm returns the best state S best in terms of the Cost function.

2.2 SA in task mapping

Table 1 shows SA usage in 14 publications, each of which are summarized below. All publications use SA for task mapping, 5 for task scheduling, and 1 also for communication routing. None uses SA simultaneously for all purposes. There are many other methods for task mapping, such as genetic algorithms (GA), but they are outside the scope of this paper. Synthetic app. means the paper uses task graphs that are not directed toward any particular application, but artificial and meant for benchmarking the mapping algorithms. SA usually performs hundreds or thousands of iterations and therefore it is usually run off-line. Performing SA mapping at runtime is rare.

Table 1 Publications where Simulated Annealing has been applied in Task mapping, Scheduling, or Communication routing

Ali [1] optimizes performance by mapping continuously executing applications for heterogeneous PEs and interconnects while preserving two quality of service constraints: maximum end-to-latency and minimum throughput. Mapping is optimized statically to increase QoS safety margin in a multisensor shipboard computer. Tasks are initially mapped by a fast greedy heuristics, after which SA further optimizes the placement. SA is compared with 9 other heuristics. SA and GA were the best heuristics in comparison. SA was slightly faster than GA, with 10 % less running time.

Bollinger [4] optimizes performance by mapping a set of processes onto a multiprocessor system and assigning interprocessor communication to multiple communication links to avoid traffic conflicts. The purpose of the paper is to investigate task mapping in general.

Braun [5] optimizes performance by mapping independent (non-communicating) general purpose computing tasks onto distributed heterogeneous PEs. The goal is to execute a large set of tasks in a given time period. An example task given was analyzing data from a space probe, and send instructions back to the probe before communication black-out. SA is compared with 10 heuristics, including GA and Tabu search (TS). GA, SA and TS execution times were made approximately equal to compare effectiveness of these heuristics. GA mapped tasks were 20–50 % faster compared to SA. Tabu mapped tasks varies from being 50 % slower to 5 % faster compared to SA. SA was run with only one mapping per temperature level, but repeating the annealing process 8 times for two different temperature coefficients. One mapping per temperature level means insufficient number of mappings for good exploration. However, GA is better with the same number of mappings. GA gave the fastest solutions.

Coroyer [6] optimizes performance excluding communication costs by mapping and scheduling directed acyclic graphs (DAGs) to homogeneous PEs. 7 SA heuristics are compared with 27 list scheduling heuristics. SA results were the best compared to other heuristics, but SA’s running time was two to four orders of magnitude higher than other heuristics. The purpose of the paper is to investigate task mapping and scheduling in general.

Ercal [10] optimizes performance by mapping DAGs onto homogeneous PEs and a network of homogeneous communication links. Load balancing constraints are maintained by adding a penalty term into the objective function. Performance is estimated with a statistical measure that depends on task mapping, communication profile of tasks and the distance of communicating tasks on the interconnect network. Simulation is not used. The model assumes communication is Programmed IO (PIO) rather than DMA. SA is compared with a proposed heuristic called Task Allocation by Recursive Mincut. SA’s best result is better than the mean result in 4 out of 7 cases. Running time of SA is two orders of magnitude higher than the proposed heuristics.

Ferrandi [11] optimizes performance by mapping and scheduling a Hierarchical Task Graph (HTG) [12] onto a reconfigurable MPSoC of heterogeneous PEs. HTGs were generated from C programs parallelized with OpenMP [33]. SA is compared with Ant Colony Optimization (ACO) [9], TS and a FIFO scheduling heuristics combined with first available PE mapping. SA running time was 28 % larger than ACO and 12 % less than TS. FIFO scheduling happens during runtime. SA gives 11 % worse results (performance of the solution) than ACO and comparable results with TS.

Kim [16] optimizes performance by mapping and scheduling independent tasks that can arrive at any time to heterogeneous PEs. Tasks have priorities and soft deadlines, both of which are used to define the performance metric for the system. Dynamic mapping is compared to static mapping where arrival times of tasks are known ahead in time. SA and GA were used as a static mapping heuristics. Several dynamic mapping heuristics were evaluated. Dynamic mapping runtime was not given, but they were very probably many orders of magnitude lower than static methods because dynamic methods are executed during the application runtime. Static heuristics gave noticeably better results than dynamic methods. SA gave 12.5 % better performance than dynamic methods, and did slightly better than GA. SA runtime was only 4 % of the GA runtime.

Koch [18] optimizes performance by mapping and scheduling DAGs presenting DSP algorithms to homogeneous PEs. SA is benchmarked against list scheduling heuristics. SA is found superior against other heuristics, such as Dynamic Level Scheduling (DLS) [34]. SA does better when the proportion of communication time increases over the computation time, and the number of PEs is low.

Lin [22] optimizes performance while satisfying real-time and memory constraints by mapping general purpose synthetic tasks to heterogeneous PEs. SA reaches a global optimum with 12 node graphs.

Nanda [24] optimizes performance by mapping synthetic random DAGs to homogeneous PEs on hierarchical buses. The performance measure to optimize is an estimate of expected communication costs and loss of parallelism with respect to critical path (CP) on each PE. Schedule is obtained with list scheduling heuristics. Two SA methods are presented where the second one is faster in runtime but gives worse solutions. The two algorithms reach within 2.7 % and 2.9 % of the global optimum for 11 node graphs.

Ravindran [31] optimizes performance by mapping and scheduling DAGs with resource constraints on heterogeneous PEs. SA is compared with DLS and a decomposition based constraint programming approach (DA). DLS is the computationally most efficient approach, but it loses to SA and DA in solution quality. DLS runs in less than a second, while DA takes up to 300 seconds and SA takes up to 5 seconds. DA is an exact method based on constraint programming that wins SA in solution quality in most cases, but is found to be most viable for larger graphs where constraint programming fails due to complexity of the problem space.

Wild [35] optimizes performance by mapping and scheduling DAGs to heterogeneous PEs. PEs are processors and accelerators. SA is compared with TS, FAST [20, 21] and a proposed Reference Constructive Algorithm (ReCA). TS gives 6–13 %, FAST 4–7 %, and ReCA 1–6 % better application execution time than SA. FAST is the fastest optimization algorithm. FAST is 3 times as fast than TS for 100 node graphs and 7 PEs, and 35 times as fast than SA. TS is 10 as fast as SA.

Xu [37] optimizes performance of a rule-based expert system by mapping dependent production rules (tasks) onto homogeneous PEs. A global optimum is solved for the estimate by using linear programming (LP) [23]. SA is compared with the global optimum. SA reaches within 2 % of the global optimum in 1 % of the optimization time compared to LP. The cost function is a linear estimate of the real cost. In this sense the global optimum is not real.

In summary, SA performs rather well on average. The differences regarding solution quality are usually 1–10 % whereas in algorithm runtimes vary greatly, e.g. from few percents to even 100x–1 000x. Heuristics have been commonly evaluated for systems with 6–8 PEs and the task graph sizes vary greatly, from 12–4000 tasks. So far, most of the comparisons were between heuristics and unfortunately not against global optimum (11 to 14 PE cases). Global optimum is the exact comparison point and should be sought when feasible. Even if near-optimal results in a small test cases do not guarantee performance in larger problem, at least they show the performance trend (e.g. linear vs. quadratic degradation). Nevertheless, extrapolating the performance is somewhat dubious.

3 SA parameters

Table 2 shows parameter choices for the publications presented in Sect. 2.2. Move and acceptance functions, and annealing schedule, and the number of iterations per temperature level were investigated, where N is the number of tasks and M is the number of PEs. “?” indicates the information is not available. “N/A” indicates that the value does not apply to a particular publication. Detailed explanation of parameters is presented in Sects. 3.1, 3.2 and 3.3.

Table 2 Utilized move heuristics, acceptance functions, and annealing schedule in Simulated Annealing. Geometric annealing schedules (G) have a temperature scaling co-efficient “q”. Adaptive initial temperature “T 0” and Stop condition are also marked. “L” is the number of iterations per temperature level, where N is the number of tasks and M is the number of PEs

3.1 Move functions

Table 2 shows 11 different move functions applied in the publications. There are two kinds of functions. Agnostic ones do not consider application or platform structure at all, whereas the others may weigh some tasks more than rest.

Single Task (ST) move takes one random task and moves it to a random PE. It is the most common move heuristics, 9 out of 14 publications use it. It is not known how many publications exclude the current PE from solutions so that always a different PE is selected by randomization. Excluding the current PE is useful because evaluating the same mapping again on consecutive iterations is counterproductive.

EM [35] is a variation of ST that limits task randomization to nodes that have an effect on critical path length in a DAG. The move heuristics is evaluated with SA and TS. SA solutions are improved by 2–6 %, but TS solutions are not improved. Using EM multiplies the SA optimization time by 50–100x, which indicates it is dubious by efficiency standards. However, using EM for TS approximately halves the optimization time!

Swap 1 (SW1) move is used in 3 publications. It chooses 2 random tasks and swaps their PE assignments. These tasks should preferably be mapped on different PEs. MBOL is a variant of SW1 where task randomization is altered with respect to annealing temperature. At low temperatures, only the tasks that are close in system architecture are considered for swapping. At high temperatures more distant tasks are considered.

Priority move 4 (P4) is a scheduling move that swaps the priorities of two random tasks. This can be viewed as swapping positions of two random tasks in a task permutation list that defines the relative priorities of tasks. P1 is a variant of P4 that considers only tasks that are located on the same PE. A random PE is selected at first, and then priorities of two random tasks on that PE are swapped. P3 scheduling move selects a random task, and moves it to a random position in a task permutation list.

Hybrid 1 (H1) is a combination of both task assignment and scheduling simultaneously. First ST move is applied, and then P3 is applied to the same task to set a random position on a permutation list of the target PE. Koch [18] is a variant of H1 that preserves precedence constraints of the moved task in selecting the random position in the permutation of the target PE. That is, schedulable order is preserved in the permutation list.

MLI is a combination of ST and SW1 that tries three mapping alterations. First, tries ST greedily. The move heuristics terminates if the ST move improves the solution, otherwise the move is rejected. Then SW1 is tried with SA acceptance criterion. The move heuristics terminates if the acceptance criterion is satisfied. Otherwise, ST is tried again with SA acceptance criterion.

ST is the most favored mapping move. Heuristics based on swapping or moving priorities in the task priority list are the most favored scheduling moves. The choice and impact of mapping and scheduling moves have not been studied thoroughly.

3.2 Acceptance functions

Table 2 shows that 4 different acceptance functions have been used in publications. Acceptance function takes the change in cost ΔC and temperature T as parameters. There are 3 relevant cases to decide whether to accept or reject a move.

  1. 1.

    ΔC<0 is trivially accepted since the cost decreases.

  2. 2.

    ΔC=0 is probabilistically often accepted. The probability is usually 0.5 or 1.0.

  3. 3.

    ΔC>0 is accepted with a probability that decreases when T decreases, or when ΔC grows.

The most common choice is exponential acceptor function (denoted with E in the table)

$$ \mathit{Accept}(\Delta C, T)=\exp\biggl( \frac{-\Delta C}{T}\biggr). $$
(1)

The probabilities fall into range [0,1.0] which means that quite many worsening moves are accepted. Unfortunately, the behavior depends heavily on the ratio of cost (depends on system) and temperature (derived by SA developer).

In order to avoid manual parameter tuning, Orsila [27, 30] uses the normalized inverse exponential function (NIE)

$$ \mathit{Accept}(\Delta C, T)=\frac{1}{1+\exp(\frac{\Delta C}{0.5 C_0 T})}. $$
(2)

First of all, temperature level T is in normalized range (0, 1] regardless of the system. Moreover, ΔC is normalized with respect to initial cost C 0. Second, the denominator part is always ≥2, and hence probabilities are in range [0, 0.5].

The normalized exponential (NE) acceptor

$$ \mathit{Accept}( \Delta C, T)=\exp\biggl(\frac{-\Delta C}{0.5 C_0 T}\biggr) $$
(3)

uses similar normalizations as NIE acceptor (2). The difference is that Accept(0,T)=1.0 for NE (just like E) and 0.5 for NIE. Figure 3 illustrates the differences. X-axis shows the temperature. It decreases during the process, and hence optimization proceeds from right to left. Y-axis shows the acceptance probability. The lines show the acceptance probabilities for moves that are 5 % and 30 % worse than current one. NE was found to be slightly better than NIE in [25] (see Sect. 5.1).

Fig. 3
figure 3

Probabilities of Normalized Exponential (NE) and Normalized Inverse Exponential (NIE) acceptance function. Small worsening moves are accepted with higher probability. Moreover, NE accepts nearly twice as many moves as NIE at high temperatures (beginning of optimization)

Braun [5] uses an inverse exponential function (IE)

$$ \mathit{Accept}(\Delta C, T)=\frac{1}{1+\exp(\frac{\Delta C}{T})}. $$
(4)

Temperature normalization is not used, but the same effect is achieved by setting the initial temperature properly: T 0=C 0.

Using an exponential acceptor and dynamic initial temperature is the most common choice.

3.3 Annealing schedule

Table 2 shows that 4 different annealing schedules have been used in publications. The annealing schedule is a trade-off between solution quality and optimization time. The annealing schedule defines the temperature levels and the number of iterations for each temperature level. Optimization starts at the initial temperature T 0 and ends at final temperature T f . These may not be constants, in which case initial temperature selection and/or stopping criteria are adaptive. Stopping criteria defines when optimization ends.

Geometric temperature scale (G) is the most common annealing schedule (11 out of 14). Temperature T is multiplied by a factor q∈(0,1) to get the temperature for the next level, that is,

$$ T_{next} = q T. $$
(5)

Usually several iterations are spent on a given temperature level before switching to the next level. Scaling factor q is most often 0.95 or 0.99 in literature. Some works use q=0.9 which was also used by Kirkpatrick et al. who presented SA [17]. They used SA for partitioning circuits to two chips. This is analogous to mapping tasks onto two PEs. Figure 4 shows temperature as a function of iteration count for 4 different schedules. Naturally, temperature decreases faster with small q and optimization will terminate quicker. Moreover, iteration count per temperature level has a major impact (see Sect. 3.5).

Fig. 4
figure 4

Examples of geometric annealing schedule with two scaling coefficients q=0.95 and q=0.9. Moreover, two iteration counts are shown L=1 and L=16. Too small q and L will terminate SA quickly and likely cause sub-optimal results

Bollinger’s schedule [4], denoted B, consider the size of improvement during scaling. They computes a ratio R of minimum cost to average cost on a temperature level, and then applies a geometric temperature multiplier

$$ T_{next} = \min(R, q_L) T, $$
(6)

where q L is an experimentally chosen lower bound for the temperature multiplier. q L varied in range [0.9,0.98].

Koch [18] and Ravindran [31], denoted K, use a dynamic q factor that depends on the statistical variance of cost at each temperature level. The temperature is calculated with

$$ T_{next} = \frac{T}{1 + \frac{T \ln(1 + \delta)}{3 \sigma_T}} $$
(7)

where δ is an experimental parameter used to control the temperature step size and σ T is the standard deviation of cost function on temperature level T. Higher δ or lower σ T means a more rapid decrease in temperature. Koch suggests δ in range [0.25,0.50].

Coroyer [6] experimented with a fractional temperature scale (F). Temperature T is calculated as \(T_{i} = \frac{T_{0}}{i}\) where T 0 is the initial temperature and i is the iteration number starting from 1. Fractional temperature was found to be worse than geometric.

3.4 Adaptivity in start and stop conditions

The choice of initial temperature T 0 can be chosen experimentally or methodically. Ten works present methods to set T 0 based on given problem space characteristics and desired acceptance criteria for moves [46, 10, 18, 22, 26, 29, 30, 37]. For example, T 0 can be set purely based on simulation so that T 0 is raised high enough so that average move acceptance probability p is high enough for aggressive statistical sampling of the problem space, e.g. P∼0.9 [6, 18]. In contrast, our method determines T 0 from the task graph and the system architecture sizes [26, 29, 30].

There are several common stopping criteria. For example, optimization ends when a given T f has been reached [6, 18], a given number of consecutive rejections happen, a given number of consecutive moves has not improved the solution, a given number of consecutive solutions are identical [6], or a solution (cost) with given quality is reached [18]. These criteria can also be combined. Constants associated with these criteria are often decided experimentally, but adaptive solutions exist too. We use adaptive T f that is computed from problem space similarly to T 0 [26, 29, 30].

3.5 Iteration per temperature level

The number of iterations per temperature level L is defined as an experimental constant or a function of the problem space. It affects the schedule and hence also acceptance function notably, as highlighted in Fig. 4. No single method is clearly more common than others, but functions are more common than constants (8 vs. 4 papers). A function of the problem space (based on number of tasks and PEs; and cost) is more widely applicable, but a constant can be more finely tuned to a specific problem. Parameter L is often dependent on task count N and PE count M. Ercal [10] proposes L that is proportional to N(M−1), the number of neighboring solutions in the mapping space.

Lin [22] uses an adaptive number of iterations per temperature level (LLI). Their approach starts with L 0=N(N+M) at T 0. The number of iterations for the next temperature level is calculated by L next =min(1.1L,N(N+M)2). However, on each temperature level they compute an extra XL iterations iff X>L, where \(X = \frac{1}{\exp((C_{\min}-C_{\max}) / T)}\) and C min and C max are the minimum and maximum costs on the temperature level T. They compute initial temperature as T 0=a+b, where a is the maximum execution cost of any task at any PE, and b is the maximum communication cost between any two tasks. Then they compute T 0 average and standard deviation of cost at T 0 by sampling a number of moves. The temperature is doubled (once) if the minimum and maximum cost is not within two standard deviations from the average.

Ferrandi [11] has adaptive number of iterations per temperature level (LFE). Temperature level is switched on the first accepted move on the temperature level. However, there can be arbitrarily many rejected moves.

4 On global optimum results

This section analyzes how many mapping iterations are needed to reach a given solution quality. SA convergence rate and the number of mapping iterations is compared with respect to global optimum solutions. The effect of SA parameters on convergence rate and quality of solution is analyzed. Specifically, the effect of L value and acceptance function is analyzed with respect to the number of mapping iterations and solution quality. We present convergence results with reference to global optimum solutions.

One difficulty in experiments with heuristics is the choice of appropriate reference point. Unfortunately, a global optimum is hard to find for large task graphs since the mapping problem is in NP-complexity class. Therefore, we present our thorough experiment where SA is compared to global optimum that was found with exhaustive search. Effectiveness of heuristics may decrease rapidly as the number of task nodes grows. One task can be placed on any of the M PEs and hence the graphs requires an N items long vector to store the full mapping. The total number of mappings X for N tasks and M PEs is X=M N. The number of mappings grows exponentially with respect to tasks, and polynomially with respect to PEs. Adding PEs is preferable to adding nodes from this perspective.

4.1 Experiment setup

Our SA with automatic temperature (SA+AT) algorithm in [30] is further analyzed with respect to global optimum convergence. The automatic temperature algorithm defines the start temperature T 0, final temperature T f , and iteration per temperature level L so that optimality and SA runtime are balanced. SA+AT is applied to problem space of variable number of PEs and nodes with Kahn Process Networks (KPNs). Specifically, we brute force solve the problem for cases shown in Table 3, and then compare SA+AT heuristics with the brute force global optimum results.

Table 3 The optimization cases for brute force optimization. For each case we find global optimum for 10 graphs generated with KPN generator. G means 109

More than 2 PEs would be too costly to compute in brute force for 32 or more nodes. Already now, the experiments together required over 500 days of CPU time. Therefore, we limited the number of nodes for each case as follows. The number of nodes for each case is picked so that the number of brute force iterations to find global optimum is the same within factor 0.5 to 2. As one task mapping was always fixed to one PE, the brute force complexity of the problem is X=231=2 147 483 648∼2.1×109 mappings for each graph. Note, fixing one node is a valid choice because PEs are homogeneous and connected with a single shared bus in our cases. Then, given M PEs, we selected number of nodes N so that following equation roughly holds: M N−1=X.

In Table 3, mappings per second determines the speed at which brute force optimization can make progress. The average for 2 and 8 PE cases varies by a factor of 1.7x. Optimization time in brute force and SA is dominated by the time to evaluate a mapping. The number of CPU days consumed in the brute force experiment is also presented. Note that since 10 graphs were optimized for each case, one tenth of this value presents brute force optimization time for a single graph.

Ten synthetic graphs were generated with kpn-generator [19] for each system size. For 2 PE graphs target distribution parameter was set to 10 %, in other cases 30 %. Target distribution defines the relative number of tasks that can act as a target for each task. For example, 10 % target distribution for a 32 node graph means each task can communicate with at most 3 other tasks. All graphs were cyclic graphs (default) and b-model parameter for both communication size and computation time burstiness was set at 0.7 (default).

SA+AT was 1000 times independently for each of the 10 graphs for each system size. We measured the proportion SA runs that come within p percent of global optimum cost. Here, we use execution time t as the cost. We measured the number of SA+AT runs that got execution time t≤(1+p)t o , where p is the execution time overhead compared to global optimum execution time t o . Although the graphs generated by kpn-gen are cyclic, each tasks is activated a certain fixed number of times. Execution of the graph is complete when there are no more activations and that instant defines the execution time.

2 PE graphs for the experiment available at DCS task mapper web page [7] under the section “Experimental data”. The experiment data for 3, 4, 6 and 8 PEs is available at [8]. The first case with 2 PEs used NIE acceptor, but once NE was found better it was used for cases with 3–8 PEs. A reference C implementation of the SA+AT algorithm is also available at [32]. The experiment can be repeated with a GNU/Linux system by using DCS task mapper and jobqueue [14].

4.2 Example convergence with one graph

Let us first consider the difficulty of the mapping. Figure 5 which shows the task graph execution time for one graph plotted against the number of mappings. All possible mapping combinations for 32 nodes and 2 PEs were computed with brute force search. One node mapping was fixed and the rest 31 were which resulted in 231 mappings. The initial execution time is 875 μs (all tasks on a single PE), mean 1033 μs and standard deviation 113 μs. It is interesting to note that there is exactly one optimum mapping which results in 535 μs. That is nearly 2x as fast as average and about 3.2x as fast as the worst case. Hence, mapping has a large impact on performance but optimal case is extremely unlikely to be found with random choice. It is like a needle in a haystack.

Fig. 5
figure 5

Task graph execution time for one graph plotted against the number of mappings. All possible mapping combinations for 32 nodes and 2 PEs were computed with brute force search. One node mapping was fixed. 31 node mappings were varied resulting into 231≈2.1×109 mappings. The initial execution time is 875 μs, mean 1033 μs and standard deviation 113 μs. There is only one mapping that reaches the global optimum 535 μs

Table 4 shows how many SA+AT runs got execution time overhead p or less compared to global optimum execution time t o . It also shows the associated number of mappings within that range computed from the brute force search.

Table 4 The number of mappings within p % global optimum computed with a brute force search. Furthermore, proportion of SA runs that converged to that are also shown. Note that there are only very few good mappings out of 2 milliard. Higher SA run proportion is better

There is only one mapping that is the global optimum: 535 μs but SA reaches it in 2.1 % of the runs which is a very good result considering the difficulty of this mapping problem. Repeating SA yields the global optimum in approximately 105 iterations. Purely random mapping (RM) would find global optimum in 109 iterations in approximately half the runs. RM only converges within 13 % of the global optimum in a million iterations.

Interestingly there is only one mapping that is the worst solution: 1708 μs. Reversing the cost difference comparison in SA yields the worst solution in 0.9 % of the runs. The SA algorithm can therefore also find the global maximum.

4.3 Convergence results for 10 graphs

Table 5 shows the execution time overheads for 2 PEs and 32 tasks considering all the 10 task graphs. The values are well aligned with Table 4 and also shown in Fig. 6. The first row, p=0 %, means the global optimum (shortest execution time). p=10 % means execution time no more than 10 % over the global optimum. The last two rows show mean and median values for the number of mappings tried in a single SA+AT run. SA uses only small fraction of the brute force iterations, but does not always converge to a global optimum. However, the optimum result is obtained in 0.6–15.2 % of cases depending on L. However, slightly looser bound p=+5 % is already reached with 7.6–64.9 % of runs which is very good. For each L, the 90 % probability level is marked in boldface in the table and dotted vertical lines in Fig. 6 show corresponding the overhead p.

Fig. 6
figure 6

SA+AT convergence with respect to global optimum with 2 PEs and 32 task nodes, as in Table 5. Curves show proportion of SA+AT runs that converged within p from global optimum for a given value of L. Lower p value on X-axis is better. Higher probability on Y-axis is better. SA+AT chooses L=32

Table 5 SA+AT convergence with respect to global optimum for 2 PEs and 32 nodes. Values in table show proportion of SA+AT runs that converged within p from global optimum. The higher value is better. Experiment varied parameter L=16−256 and automatic parameter selection method of SA+AT chooses L=32. The 90 % level is marked in boldface on each column

Determining the cost is the most time consuming operation, as it usually requires some sort of simulation. Therefore, the number of mapping iterations should minimized to obtain reasonable runtime for the heuristic. The runtime of the heuristic is often negligible, and hence it pays off to design a more complex algorithms which cleverly reduces the iteration count.

Table 6 shows the expected number of mappings needed to obtain a result within p percent of global optimum by repeatedly running SA+AT. They are derived from convergence rates. For example take a look at cell p=5 %, L=16 in Table 5. It shows that roughly 7.6 % of the runs reached that performance. Dividing the mean number of mappings with that, gives \(\frac{840}{0.0757}=11~099\), which is shown in Table 6. Automatically selected L value works rather well as it has the smallest or second smallest iteration count for each values of p.

Table 6 Approximate expected number of mappings for SA+AT to reach performance within p percent from the global optimum with 2 PEs and 32 nodes. Values are derived from Table 5. SA+AT chooses L=32. The best values (smallest) are in boldface on each row

Note that convergence is not a guarantee but probabilistic, which means the actual global optimum might never be reached. Brute force for 2 PEs uses 2.0×104−2.3×104 times the iterations compared to SA, for L=32 and L=64 respectively. Hence, SA offers 4 orders of improvement over brute force. Moreover, results within 3 % of global optimum are reached with 5 orders of magnitude improvement in iterations by using the SA+AT. Results within 16 % are reached with 6 orders of magnitude improvement.

4.4 Differences between graphs

There are of course differences between 10 tested graphs. Table 7 shows the convergence rate variance between graphs for L values 16–256 for 2-PE case. Minimum and maximum convergence rates vary by a factor of 50x, approximately. Doubling L approximately doubles the mean convergence rate. For L≥32 global optimum was found for each graph when SA was run 1000 times. However, L should be scaled according to the problem. Otherwise, a small L obtains poor results with large problems, or a large value wastes time with simple problems.

Table 7 Global optimum convergence rate varies between the tested 10 32-node graphs and L values. Column minimum denotes the hardest graph and maximum the easiest one. Value 0 % in Min column means there was a graph for which global optimum was not found in 1000 SA+AT runs. Note, the Mean column has same values as the p=0 % row in Table 5

With L=256 the hardest 32-node graph converged with probability 0.4 %, but the same graph converged within 1 % with 4.4 % probability, within 2 % with 8.8 % probability, and within 3 % with 22.3 % probability. Sacrificing just 3 % from optimum more than doubles the convergence rate on average, as shown by Table 5.

The easiest graph converged to optimum with 27.8 % probability, but within 3 % with just little more likely, with 32.6 % probability. This indicates that it does not have many solutions near the global optimum, but hitting the global optimum is rather probable (27.8 %). This is opposite to the hardest graph where hitting the global optimum is hard (probability 0.4 %), but there are surprisingly many solutions within 3 % that are found with 22.3 % probability. Such differences necessitate that evaluation of a heuristic uses many task graphs.

4.5 Applicability of SA on larger problems

SA performed well on a 2-PE case and the experiments were repeated larger PE counts. As mentioned, task count was reduced since otherwise brute force solution would be infeasible. Tables 8 and 9 shows the results for 8 PEs and 11 tasks. The result tables for N=3, 4, 6 PEs are found in Appendix.

Table 8 Proportion of SA+AT runs that converged within p from global optimum for 8 PEs and 11 nodes. A higher value is better. SA+AT chooses L=77
Table 9 Approximate expected number of mappings for SA+AT with 8 PEs and 11 nodes. SA+AT chooses L=77

Convergence is faster with 8 PEs than 2 PEs. First of all, the problem size is smaller (1.1G<2.1G) but that is not the only reason. Task count is also smaller (N=11<N=32) and this has bigger impact, as evident from Tables 13, 14, 15, 16, 17, 18. Note that once convergence proportion is rounded to 1.000 no more values are shown, but iterations counts do still increase very little (e.g. L=32, p=11–17 %).

Figure 7(a) we notice how quickly SA converges towards global optimum with all tested PE counts. Figure 7(b) plots the probability of reaching solutions p=0 % (optimum) and p=5 % as a function of task count (left Y axis). Note that problem sizes are roughly equal from 1.1G to 4.3G solutions which allows estimating the impact of task count. Moreover, the expected iteration counts are shown on the right Y axis. Iteration count for optimum solution increases with task count and has a peculiar spike at 20 tasks (3 PEs). In contrary, iterations count changes only modestly for p=5 %.

Fig. 7
figure 7

Convergence rates for M=2–8 PEs with automatically selected L values (32–77)

Reaching the optimum is most likely with 11 tasks and least likely with 32 tasks (90.1 % vs. 1.6 %). Sacrificing the quality by 5 % increases the likelyhoods notably (99.1 %, 15.4 %). The shapes of probability curve seem reciprocal, i.e. y=1/x. Although the probability of a good solution drops quickly, the knee-point is around 20 tasks and the drop after that is much smaller. This gives hope that decent results could be achieved with larger task counts, but confirming that would need very heavy brute force runs, which are currently infeasible.

Table 10 shows relative optimization time for each PE case when targeting within p percent of the global optimum. The table shows that reaching a solution within 5 percent of the global optimum can mean a significant optimization time save. This is especially true for 2 and 3 PE cases where probability of reaching global optimum with a single SA+AT run is very low. For 2 PEs, reaching within 5 % of optimum takes only 1/10 of the optimization time, and within 10 % of optimum takes only 1/29 of time. For 3 PEs, these corresponding cases take just 1/18 and 1/43 of time. We observe that repeating SA several times for a given problem is a beneficial strategy for finding good solutions when the probability of reaching an optimum solution is low for one optimization run.

Table 10 Relative optimization time for reaching within p percent of global optimum. Time to reach the global optimum is scaled 1.0 time units for each PE count

Furthermore, executing one round of SA+AT takes between 0.4 s and 12.6 s for 2 and 8 PEs. Expected time to reach global optimum with SA+AT is 29.1 s for 2 PEs, 49.8 s for 3 PEs, 12.7 s for 4 PEs, 5.9 s for 6 PEs and 22.0 s for 8 PEs. Sacrificing solution quality to be within 5 percent of the global optimum decreases optimization time so that it is 2.9 s for 2 PEs, 2.8 s for 3 PEs, 4.4 s for 4 PEs, 4.2 s for 6 PEs and 17.3 s for 8 PEs.

Results show that large N makes optimization more complex although total number of solutions M N is rather constant. We suspect that this might be due to homogeneity of our HW platform. For example, let us assume that we move all tasks from PE 0 to PE 1, PE 1PE 2,…,PE M−1PE 0. Since PEs are identical, this is just a matter of their numbering and there should be (M−1)! permutations with (practically) equal performance. Consequently the design space of the mapping would be actually much smaller. With heterogeneous resources or hierarchical network this would not apply. Moving the tasks has more complex consequences due to their dependencies.

Earlier, SA+AT has been compared to Group Migration (GM), hybrid of SA and GM Random, and Optimal Subset Mapping algorithm with large 300-task graphs and 2, 4, 8 PEs. [28]. In that experiment, SA+AT was the best heuristic studied. The speedups w.r.t single PE were about 1.9x, 2.7x and 3.7x, respectively, but the optimum was unknown. The relation between mean iteration count and problem size is far from linear but we haven’t yet formulated it. For example in this paper, 8-PE case requires about 20x iterations compared to 2 PEs. However in [28], the ratio is only 2.4x and iterations counts are rather small—37k and 88k—considering the size of task graph.

5 On SA acceptance probability

This experiment studied the impact of acceptance function.

5.1 On acceptor functions

Normalized exponential and normalized inverse exponential acceptor functions are compared with respect to solution quality and convergence. Very few SA papers try different acceptor functions. We ran an experiment to compare the normalized inverse exponential (2) and normalized exponential (3) acceptor functions. Parameter selection method from [30] was applied to (3). The experiment was re-run for both acceptor functions, 100 graphs, 2–4 PEs, 10 times each case, totaling 6 000 optimization runs.

Table 11 shows the average gain and iteration count results for both acceptor functions. In terms of gain, normalized exponential acceptor is better by not more than half a percent.

Table 11 Comparing average gain and iteration count values for the normalized inverse exponential (NIE) and normalized exponential acceptors (NE). Higher value is better in gain and smaller in the iterations

However, exponential acceptor needs 2–7 % more iterations. Both SA algorithms have an equal number of iterations per temperature from T 0 to T f . Optimization terminates when TT f and L consecutive rejections happen. Acceptance function affects the number of consecutive rejections which makes the difference in the number of iterations. Inverse exponential acceptor has higher rejection probability, which makes it stop earlier than an exponential acceptor.

Furthermore, we compared brute force results with 32 nodes and 2 PEs between normalized exponential and normalized inverse exponential acceptors. Table 12 shows the difference between global optimum converge rates for the two acceptors for SA+AT. Results indicate normalized exponential acceptor (NE) converges slightly more frequently than normalized inverse exponential acceptor (NIE). The SA+AT choice L=32 displays difference of at most 2.4 % in absolute convergence rate.

Table 12 The difference in SA+AT convergence rate between normalized exponential acceptor (NE) and normalized inverse exponential acceptor (NIE). Overhead percentage p shows the convergence within global optimum. A positive value indicates normalized exponential acceptor is better. SA+AT chooses L=32

We recommend using normalized exponential acceptor NE for task mapping.

5.2 On zero transition probability

SA acceptance function defines the probability to accept a move for a given objective change and the temperature. Zero transition probability is the probability of accepting a move that does not change the cost. One might speculate that although the cost does not change, the new mapping may be a better base for the next moves. We studied the effect of zero transition probability to the solution quality.

The normalized inverse exponential acceptance function (2) gives 0.5 probability for zero transition, that is, when ΔC=0. The acceptance function was modified to test the effect of zero transition probability:

$$ \mathit{Accept}(\Delta C, T)= \frac{2 a}{1+\exp(\frac{\Delta C}{0.5 C_0 T})}, $$
(8)

where a is the probability for ΔC=0. SA+AT was re-run for several graphs with the setup specific in detail in [30] with distinct probabilities a∈[0.1,0.2,…,1.0]. Two 16 node cyclic graphs, one 128 node cyclic graph, two 16 node acyclic graphs, and one 128 acyclic graph, all with target distribution 10 %, were tested with a 3 PE system. However, no causality was found between solution quality and the zero transition probability. It seems that this probability is insignificant.

6 Recommendations

In this section we summarize our findings and give recommendations for reporting the results on task mapping using simulated annealing. In addition, we present a compact list how to set up SA for task mapping.

6.1 On reporting results

There are several guidelines that apply for publishing data on parallel computation [2] and heuristics in general [3]. Unfortunately, we found out that many of the cited publications leave out important details to reproduce the results with SA. Therefore, we present recommendations for publishing results on task mapping with SA as follows:

  1. 1.

    Report at least the pseudocode of the algorithm. A textual description of the algorithm is often too ambiguous for reproducing the experiment. Specify the temperature scale, cost, move, and acceptance functions.

  2. 2.

    Report numerical values for constants for the algorithm. Specify temperature scaling factor, initial and final temperatures and the number of iterations per temperature level. These are needed to re-produce the same results.

  3. 3.

    Report the convergence rate. Plot the number of mapping iterations against optimized values (e.g. speedup). Report mean and median number of iterations.

  4. 4.

    Compare the result from heuristics with a global optimum result for a trivial problem that can be solved by brute force. Report the proportion of optimization runs and the number of iterations that reach within p percent of the global optimum where p is varied. For example, it can be informative to know that p=95 percent of optimization runs yield a result that is only 5 percent worse than the global optimum and uses a specific number of iterations on average for this result.

  5. 5.

    Report the optimization time per iteration. This corresponds mostly to simulation time per mapping.

6.2 Recommended practices for task mapping with Simulated Annealing

Based on existing data and results, we recommend following practices for task mapping with SA:

  1. 1.

    Scale the number of iterations per temperature level with system size. That means that L should be proportional to α=N(M−1), where N is the number of tasks and M is the number of PEs. α is the number of neighboring mapping solutions. Section 4 results and [6, 26, 30] indicate that α times a small multiplier is sufficient for good convergence provided that SA is repeated several times.

    Furthermore, as a generic result it is important to note that the number of mapping iterations should grow as a function of problem complexity parameters N and M unless experimental results indicate good results for a specific L value.

  2. 2.

    Use geometric temperature schedule with 0.90≤q≤0.99. Most known results use values in this range. Our experiments have found q=0.95 to be a suitable value. The L value has to be adjusted with the q variable. Dropping q from 0.95 to 0.90 implies that the number of iterations in a given temperature range halves unless L is doubled.

  3. 3.

    Use a systematic method for choosing the initial and final temperatures, e.g. one published in [46, 10, 18, 22, 26, 29, 30, 37]. A systematic method decreases manual work and the risk of careless parameter selection. A systematic method may also decrease optimization time.

  4. 4.

    Use a normalized exponential acceptance function (NE) (3). Section 5 indicates that using exponential rather than inverse exponential acceptance function is preferred. The zero transition probability did not explain why inverse exponential did worse. Normalized acceptance function is defined with a normalized temperature range T∈(0,1] which makes annealing schedules more comparable between problems. It is easy to select a safe but wasteful range when temperature is normalized, e.g. T∈(0.0001,1]. It is also easier to add new influencing factors into the cost function since the initial cost does not directly affect the selection of initial temperature when a normalized acceptance function is used. Instead, the relative change in cost function value in moves is a factor in the selection of initial temperature.

  5. 5.

    Use the ST (single task) move function 3.1, if in doubt. It is the most common heuristics which makes also the comparison easier to other works. However, exploring the impact of move function calls for more research.

  6. 6.

    Run the heuristics several times for a given problem. Section 4 shows that repetition, that is, restarting optimization, is necessary. The variance of solution quality can be significant which comes visible by running the heuristics several times. SA terminates usually in a low temperature state in a neighborhood of a local minimum. We call this neighborhood a valley. It is hard to escape from a valley when the temperature is low. Running SA several times increases the probability that better valleys are found in the optimization process. Increasing the number of iterations per optimization run may not address this because the algorithm will most likely end up in a single valley. Multiple valleys should be explored.

  7. 7.

    Record the iteration number when the best solution is reached. Some algorithms continue running but make no further progress at the end of optimization. If the termination iteration number is much higher than the best solution iteration number, maybe the annealing can be stopped earlier without sacrificing quality.

7 Conclusion

A survey of state-of-the-art of task mapping with SA was presented. It was found out that SA parameters are often incompletely presented and explained in publications. Recommendations were presented on how to better report this information. Despite SA is an efficient optimization method for task mapping, there are many practices for selecting the SA parameters. Based on our experiments and findings, we presented a set of recommended parameters in detail to help researchers reproduce and compare their works. Thorough experiments with 2–8 PEs and 11–32 tasks showed that SA can achieve solutions very close to globally optimum and also 4–6 orders of magnitude reduction in optimization time. Our future work includes adding memory and time constraints to the objective function, evaluate applicability to larger problems, and to further automate the task mapping process as whole.