1 Introduction

Machine learning-based computing rather than algorithm-based computing continues to accelerate. A study [1] suggests that popularity of neural networks would double the computation demands on datacenters. This acceleration of demands puts pressure to have large-scale systems to meet demands. Fortunately, because of the advancement of transistor technology, we see a huge increase in the number of cores per chip [2,3,4]. Network-on-chip (NoC) has become the standard communication medium for multi-/many-core systems on-chip because of various properties, including scalability and parallelism [5,6,7]. For example, Cerebras have integrated 800K cores in an NoC-based single chip with trillion transistors [2], which can deliver more performance than a cluster of traditional machines. We could even built a datacenter on a chip as many-core on-chip systems have better power efficiency, interconnection, and latency compared to traditional local area network (LAN)-based systems [8]. However, fraction of time spent in communication for an application increases with the increased number of compute nodes in the systems, for example, communication time can be more than 50% for an application in large-scale systems [9]. Therefore, communication is the major bottleneck to achieve high parallelism in systems with many compute nodes as communication costs much more time and energy than that of computation [9,10,11].

Given heterogeneous workload requests with diverse computation and communication (data dependencies) demands, a challenging research problem is to exploit the optimal mapping of resource demands onto various heterogeneous resources to fulfill high-performing task-resource co-allocation. Our target is to achieve both computation and communication efficiency for heterogeneous workloads without incurring overhead while mapping on heterogeneous processor tiles connected by heterogeneous routers and links. The mapping optimization in NoC-based MCSoC faces new challenges of limited energy and capacity budgets. Billions of transistors are being integrated on a single chip. However, while transistors keep shrinking, current leakage poses greater challenges, and consequently, power density increases, causing the failure of Dennard scaling [12]. It is not feasible to activate all cores at their maximum speed simultaneously, leaving a large amount of silicon inactive (dark silicon) or operating at a lower frequency (dim silicon), the so-called dark/dim silicon [13]. Mapping solutions should consider the heterogeneous frequency of computation and communication resources, including the inactive resources. Dark and dim silicon are achieved by applying voltage-frequency scaling (including power-gating), called dynamic voltage-frequency scaling (DVFS). The search space for mapping task-to-core could increase quadratically or exponentially with the increase in the number of cores, routers, and link resources in MCSoC. For example, the communication paths between cores in MCSoC increase quadratically or exponentially with the increased number of cores in NoC-based MCSoC. The large number of task-to-node assignment options and data communication (routing) options significantly contribute to the size of the solution space, and the time complexity required to obtain an optimal solution. The communication path also depends on the topology of interconnection networks, e.g., mesh and ring. The higher time complexity of getting a solution makes it harder to apply the solution in a real-time-constrained system to meet the deadlines of the application(s). Therefore, the mapping or resource allocation problem, such as the bin packing problem, is NP-complete and/or NP-hard [14], where the time complexity grows exponentially with an increase in the number of tasks in application and/or resources in MCSoC. For example, mapping is an NP-hard problem, as there is no known algorithm that can solve it in polynomial time. Large problem instances cannot be computed using linear programming optimizers (e.g., IBM CPLEX [15]) for optimal solutions since they fail to compute results in time especially for run-time mapping decisions. Heuristics can produce solutions quickly without the need for significant hardware overhead, whereas machine learning, a young field with many underexplored research opportunities [16], may require significant training and/or hardware overhead to provide efficient mapping. Heuristics can cover all possible cases under heterogeneous tasks (and traffic) and their data dependencies and heterogeneous resources in complex and large platforms. This shows the opportunities of using heuristics for application-architecture mapping in NoC-based MCSoC.

The major contributions of this work are outlined below:

  • Mixed-integer Linear Programming Model for Mapping: A mathematical model using mixed-integer linear programming for mapping tasks onto NoC-based MCSoC has been proposed and implemented using IBM CPLEX [15] for optimal solutions. The model aims to accelerate applications by reducing the execution times (sum of computation and communication times) of running tasks. The model includes computation and communication capacity, energy budget constraints of MCSoC, and also incorporates the computation and communication demands of the applications. Mapping tasks of multiple applications is addressed in the MILP model.

  • Simulated Annealing and Genetic Algorithms for Mapping: SA and GA heuristics are proposed to map tasks onto NoC-based MCSoC with the goal of reducing execution times of running tasks to accelerate applications. The energy and performance constraints of the proposed MILP model are integrated with SA and GA for mapping in energy and computation/communication capacity-constrained NoC-based MCSoC systems.

  • Evaluation on Real Benchmarks and Platform: The MILP model and SA and GA algorithms were evaluated on both small-scale and large-scale systems using garnet [17], a cycle-accurate NoC model on the gem5 [18] platform. Embedded system benchmark suite (E3S) [19] is used for evaluating all these proposed approaches. An in-house developed simulator was used to generate mapping results for subsequent latency and performance evaluation on the gem5/garnet platform. Mapping multiple applications is simulated, as all the applications in E3S contain multiple sub-applications within them.

Related work is presented in Sect. 2. Mathematical model of mapping tasks onto NoC-based MCSoC architecture is formulated using MILP in Sect. 3. Then, we discussed simulated annealing and genetic algorithm mapping heuristics for mapping tasks onto NoC in Sects. 4 and  5, respectively. A mapping example illustrating the mapping of a sample application using MILP and heuristics is presented in Sect. 6. Simulation results in Sect. 7 demonstrate the effectiveness of the proposed high-performance application mapping solutions, particularly the performance of SA and GA heuristics compared to the optimal solutions from the MILP model.

2 Related work

Mapping decisions are driven by various energy and performance goals, such as minimizing the execution time of applications, reducing latency in interconnection networks, and optimizing the energy consumption of MCSoC. Depending on the application needs, the mapping strategy maps the tasks onto the system and allocates/configures the resources in the system. Various task mapping algorithms have been developed for multicore systems, some are based on graph theoretic, some use mathematical programming, and others are heuristic algorithms (we did not discuss machine learning-enabled task mapping in this work). Variations of shortest tree algorithms have been devoted to mapping highly correlated tasks on the task graph to nearby cores, in order to shorten the communication delay [20,21,22,23,24]. For example, [23] maps the closely communicating threads to neighboring cores to improve locality of data accesses. Similarly, multi-application mapping in [25, 26] maps the tasks in the same application together to reduce the communication distance, for example, [25] finds an optimal region for each application using the maximal empty rectangle technique. Along with shortening the communication distance, the communication traffic is split into multiple paths to improve bandwidth efficiency during task mapping in [20]. The run-time mappings in [27,28,29] map the communicating tasks in a close neighborhood to have contiguous mapping regions. Hill climbing technique in [29] maps the tasks of an application as nearest as possible. Run-time mapping in [28] selects a square region required for an incoming application to reduce congestion in the mapped neighborhood. Some work focuses on balancing the load distribution to reduce contention and hotspots [30,31,32,33]. Some work considers dark and/or dim silicon cores (voltage-frequency scaled) during task mapping to reduce the overall energy dissipation in the system [34, 35]. Some work minimizes both energy hotspots or contention and overall chip power [32, 35]. Run-time mapping in dark silicon, presented in [34, 36], aims to increase power budget for activating more resources. Some work focuses on thermal management during task mapping in NoCs [37,38,39,40]. Some work employs fault-tolerant task mapping to enhance the reliability of the system while reducing energy consumption [41]. Some work focuses on reducing energy consumption and accelerating applications at the same time, such as [42, 43] and this paper. This paper is an extension of the paper published in [44], where we made several modifications, including the implementation of the MILP model using an LP solver and the addition of MILP results. In this work, we address the placement of computation and communication-demanded tasks of multiple applications onto heterogeneous NoCs under computation/communication capacity, execution time, and energy budget constraints. The proposed mapping approaches accelerate the applications by reducing the execution time of applications, which can indirectly reduce energy consumption. We modeled the task mapping problem using an MILP model and proposed simulated annealing and genetic algorithm heuristics for a quick mapping solution applicable in real-time systems. We implemented all the proposed models and heuristics and compared their results.

3 Linear programming modeling for mapping

For the linear programming (LP) formulations, we assume a mesh architecture that employs shortest-path routing for data delivery. A \(3 \times 3\) 2D-Mesh NoC-based multicore system is shown in Fig. 1, where a tile contains a core, its cache, and a router. Cores do the computation and routers connected to cores are used for communication between tasks mapped to cores.

Fig. 1
figure 1

Two-dimensional-Mesh NoC-based multicore system on-chip

Each node (or tile interchangeably) i in multicore systems has a frequency denoted by \(f_i\). Two adjacent nodes (routers) i and j are connected by a bidirectional link \(l_{ij}\) with a maximum communication capacity of \(W_{ij}\). The capacity of the nodes (big/little cores and routers) and links (high/low bandwidth links) varies to demonstrate heterogeneity in 2D-Mesh NoC. The capacity of the cores is varied by changing the number of computational operations per second. The capacity of the links is varied by varying link widths, where \({link}_{bandwidth} = {link}_{width} \times {link}_{frequency}\). The frequency of the router determines its packet forwarding rate. The capacity of a router is varied by changing the number of ports, the number of buffers in each port, and the size of each individual buffer. The size of a buffer determines the size of a flit (flow control digit), as a flit (part of a packet) needs to be stored in a buffer due to contention in the NoC.

Given a system that consists of computation and communication resources, this LP model considers a sequence of K applications \(\{A_{1}...A_{K}\}\) for mapping. Each application contains a set of Z tasks \(\{t_{1}...t_{Z}\}\). A sample application graph with computation and communication demands is shown in Fig. 2.

Fig. 2
figure 2

A sample application task graph G

Let’s assume, \((X_k^t)^m\) is a binary variable to indicate the mapping of task t of application k to node (core) m.

3.1 Computation and execution time constraints

Assume \({E}_{k}\) is the execution time of an application k, \((E_k^t)^m\) is the computation time of task t on node m, the total number of tasks in a application k is Z, the total number of nodes in the system is M, \((C_k^t)^m\) represents the computation time of task t on node m, \({Q}_{m}\) is the size (quantum) of the time slice given by node m for each task in the round-robin scheduling policy, and \({S}_{m}\) is the time node m needs from switching the execution from one task to another. \((C_k^t)^m\) is calculated by dividing the computation demand of \({(cmp)}_{k}^{t}\) by the processor capacity (operations per second) of \({pc}_{m}\). \({(cmp)}_{k}^{t}\) represents the number of computational operations requested by task t of application k.

$$\begin{aligned} (C_k^t)^m\,= \,& {} (X_k^t)^m {(cmp)}_{k}^{t} / {pc}_{m}, \forall 1\le t\le Z, \nonumber \\{} & {} \forall 1\le k\le K, \forall 1\le m\le M. \end{aligned}$$
(1)
$$\begin{aligned} (E_k^t)^m= & {} \sum _{{\begin{array}{c} 1\le m \le M \end{array}}} ({(C_k^t)^m}) (1 + {S}_{m}/{Q}_{m}) \end{aligned}$$
(2)
$$\begin{aligned} {E}_{k}= & {} \sum _{{\begin{array}{c} 1\le t\le Z \\ 1\le m \le M \end{array}}} (E_k^t)^m \end{aligned}$$
(3)

Since the mapping needs to satisfy the computation time requirements (quality-of-service requirements), \({E}_{k}^{q}\), for all the applications, we arrive at the computation constraint for each application k:

$$\begin{aligned} {E}_{k} \le {E}_{k}^{q},\quad \forall 1\le k\le K. \end{aligned}$$
(4)

3.2 Communication and link constraints

Mapping needs to consider the capacity of the communication links in NoC and the traffic on those links, as the execution time of tasks depends on the communication time with mapped tasks at remote nodes in MCSoC. The adjacent tasks \(T_k^s\) and \(T_k^t\) in the task graph need to communicate with each other, with a traffic load of \({(com)}_k^{st}+{(com)}_k^{ts}\). When they are mapped to two nodes, e.g., p and q, the traffic is delivered through a path in multicore systems, denoted as \(Ph_{pq}\), which consists of a set of NoC communication links. Such a path is determined by the underlying routing algorithm like XY-routing algorithm in this work. Multiple traffic flows may simultaneously traverse through a communication link \(l_{mn}\), thus the total communication traffic load via a link is:

$$\begin{aligned} \omega _{mn}=\sum _{{\begin{array}{c} 1\le p,q\le N\\ 1\le s,t \le Z\\ 1\le k \le K \\ \exists l_{mn}\in Ph_{pq} \end{array}}} {(com)}_k^{st} \cdot (X_k^s)^p \cdot (X_k^t)^q. \end{aligned}$$
(5)

Since each communication link \(l_{mn}\) has a maximum bandwidth capacity of \(W_{mn}\), we arrive at the communication constraint:

$$\begin{aligned} \omega _{mn} \le W_{mn}, \quad\forall 1\le m,n\le N. \end{aligned}$$
(6)

Communication time on the link \(l_{mn}\) is determined by dividing the traffic load \(\omega _{mn}\) on the link \(l_{mn}\) by the corresponding bandwidth of the link \(W_{mn}\):

$$\begin{aligned} \begin{aligned} {C}_k^{mn} = \omega _{mn}/ W_{mn},\quad \forall 1\le m,n\le N. \end{aligned} \end{aligned}$$
(7)

3.3 Energy budget constraint

The energy dissipation \({P}_{energy}\) is determined by multiplying computation and communication load with the energy dissipation per load unit. \(e_{cmp}\) and \(e_{com}\) represent the energy coefficient of computation and communication loads, respectively. In 22nm technology, \(e_{cmp}\) and \(e_{com}\) are approximately 45 picojoule (pJ) per computational operation (assuming 8-bit per computational operation for simplicity) and 65 pJ per communicating bit, respectively [45].

$$\begin{aligned} \begin{aligned} {P}_{energy} = e_{cmp} \cdot \sum _{{\begin{array}{c} 1\le t\le Z \\ 1\le k\le K\\ 1\le m \le N \end{array}}} (X_k^t)^m {(cmp)}_{k}^{t} + e_{com} \cdot \sum _{{\begin{array}{c} 1\le m, n\le N \end{array}}} \omega _{mn}. \end{aligned} \end{aligned}$$
(8)

Since the mapping needs to satisfy the energy budget constraint of \(P_{budget}\), we arrive at the energy constraint constraint:

$$\begin{aligned} {P}_{energy} \le P_{budget},\quad \forall 1\le m,n\le N. \end{aligned}$$
(9)

3.4 Objective function

The objective is to minimize the total execution time (sum of computation and communication time) of the tasks to accelerate the applications:

$$\begin{aligned} \begin{aligned} Minimize{:}\sum _{{\begin{array}{c} 1\le k\le K \end{array}}} {E}_{k} + \sum _{{\begin{array}{c} 1\le m,n \le M \end{array}}} {C}_k^{mn} \end{aligned} \end{aligned}$$
(10)

subject to the constraints given in Eqs. (4), (6), and (9). This objective function, along with the communication, computation, and energy constraints, forms an MILP model. This is an MILP model as the model contains both binary and continuous variables.

A unique property of the proposed MILP model is the inclusion of both computation and communication in the model for application mapping. The MILP model includes both computation and communication capacity of the NoC resources and also includes both computation and communication demands of the tasks and applications. The objective function in the MILP model aims to accelerate both computation and communication times of the mapping solution. The proposed MILP model is designed to handle the heterogeneity of applications and systems/networks, which is crucial in mapping tasks onto NoC-based multi-/many-core architectures. The architecture-awareness and application-demand awareness property of the MILP model help to map a higher amount of computation and communication by effectively utilizing the NoC resources. The proposed MILP model maps the computation and communication demands of the tasks by considering the loads and energy consumption on the nodes (cores and routers) and links of the NoC. By considering utilization in different regions of the NoC and the energy budget and task deadline constraints, the proposed MILP model reduces the execution time for both computation and communication, leading to acceleration of applications and systems. Furthermore, the reduced execution time contributes to the reduction of energy consumption, as energy consumption tends to increase with time (assuming other factors like power usage remain constant).

We notice the long computation time for an LP solver to discover an optimal solution. For example, for a mapping of an application with 28 tasks in 36-core NoC, it takes about hours for LP solvers, such as IBM CPLEX [15] and mixed-integer LP solver (intlinprog) in MATLAB [46], to search for the optimal solutions. When the number of applications and/or NoC size increases, an LP solver cannot deliver a solution in a reasonable time period for its application in real-time systems. Simulated annealing (SA) and genetic algorithm (GA) heuristics can produce near-optimal solutions much more quickly compared to LP solvers. Similar to LP modeling, a mesh NoC-based MCSoC is utilized to evaluate the proposed SA- and GA-based mappings. XY-routing algorithm is used to find the distance between two communicating tasks that are mapped to different cores in a 2D-Mesh NoC. In a 2D-Mesh NoC, XY-routing algorithm functions as a shortest-path algorithm, much like Dijkstra’s algorithm [14]. Though a 2D-Mesh NoC is utilized for mapping for the MILP model and heuristics in this work, all these approaches can be adapted to other NoC topologies, such as a ring and torus, by changing or modifying the routing algorithm.

4 Simulated annealing for mapping

Simulated annealing (SA) [47] has proven to be quite effective in approximating global optima for many different NP-hard or NP-complete combinatorial problems, such as mapping. Due to the high time complexity of the mapping problem, an SA technique has been proposed to quickly solve application-architecture mapping problem in MCSoC. The proposed SA can address the heterogeneity of applications and systems and can handle the mapping of a large number of applications and tasks in large-scale systems. The pseudocode of the proposed SA for task mapping onto NoC-based multicore systems is shown in Algorithm 1. The technique starts with an initial temperature of \(T=10{,}000\), \(\alpha\) value of 0.95, \(T = T \cdot \alpha\) function to decrease temperature, \(\epsilon\) value of 0.5, and an initial mapping solution. The values of T, \(\alpha\), and \(\epsilon\) are empirically determined (based on the best performance). The initial mapping is evaluated first. The heuristic executes until it meets a certain predetermined stopping criteria. The stopping criteria are either when temperature goes below \({10}^{-20}\) or when the mapping remains the same for 250 iterations. At each iteration, the current mapping is changed to make a new mapping, and the new mapping is compared to the current mapping. The new mapping solution needs to meet the constraints in Eqs. (4), (6), and (9), before the mapping candidate is considered for acceptance. If the new mapping solution has a better value for the execution time (i.e., the fitness function that is being optimized) of the mapped application(s), then the new mapping is chosen to be the current mapping solution. When the new mapping is worse than the current mapping, the solution is accepted if the \(\epsilon\) is smaller than \(e^{- |{score}_{current}- {score}_{new}|/T}\), where \({score}_{new}\) and \({score}_{current}\) are the new and current values returned by the evaluation function. The property of accepting worse solutions helps to explore a broader search space, potentially escaping local optima and discovering better solutions that may be further away in the search space. This strategy of accepting better and worse solutions is similar to exploration and exploitation strategies in reinforcement learning (RL) [48]. Exploitation in SA involves focusing on accepting better solutions with a high probability, and exploration is achieved through the process of accepting worse solutions with a certain probability to avoid local optima in the proposed SA. The exploration and exploitation trade-off in SA is controlled by the cooling schedule (\(e^{- |{score}_{current}- {score}_{new}|/T}\)). At higher temperatures, SA is more explorative (possibility of accepting worse solutions). The probability of accepting worse solutions gradually reducer over time (as temperature decreases). As the optimization process progresses and the temperature parameter decreases (during the annealing process), SA becomes more likely to accept only better solutions, aiming to converge toward an optimal or near-optimal solution. This feature of SA mimics the machine learning or reinforcement learning agent that becomes smarter with more training.

The evaluation function calculates a score for a mapping using the computation score and communication score. Communication scores are calculated by considering the communication times between the communicating tasks in NoC. Computation scores are calculated by considering the computation times of the tasks at the corresponding mapped nodes. The return from the evaluation function (in SA) is the inverse of the summed computation and communication times (execution time). So, a higher score (lower execution time) means better performance.

Algorithm 1
figure a

Simulated Annealing Heuristic for Task-to-Node Assignment in MCSoC

Similar to the MILP model, the unique properties of the proposed SA are architecture-awareness and application-demand awareness during mapping. The proposed SA method keeps track of the utilization of the computation and communication capacity of the NoC resources. The proposed SA is also aware of the computation and communication demands of the tasks and applications. The evaluation function in the proposed SA considers both computation and communication scores, reflecting the performance of the mapping in terms of both task execution and inter-task communication. The proposed SA method can handle the heterogeneity of applications and systems/networks. The proposed SA implementation employs a strategy that balances exploration and exploitation, similar to techniques used in RL (reinforcement learning). A major property of SA is the use of probability to select worse solutions to avoid local optimal solutions. The proposed SA maps the computation and communication demands of the tasks by considering the current loads and energy consumption on the nodes (cores and routers) and links of the NoC. By considering current utilization in different regions of the NoC and the energy budget and task deadline constraints, the proposed SA accelerates both computation and communication. The acceleration of computation and communication tasks leads to higher performance for systems and applications. Faster execution further leads to a reduction in energy consumption, as energy consumption increases with time.

5 Genetic algorithm for mapping

Genetic algorithms (GAs) follow evolutionary processes [49,50,51] to search large solution spaces. The approach of GA for combining and evolving previous solutions can result in efficient solutions that may not be obvious and would be difficult to arrive at intuitively. In classical GA, a solution is encoded as a chromosome. In the proposed GA for mapping, a chromosome represents the mapping of tasks onto nodes, and a population consists of multiple chromosomes or multiple mappings of (multiple) tasks onto nodes. The algorithm begins by encoding a set of possible mapping solutions as chromosomes. The entire set of mapping solutions (chromosomes) is called a population (a list of mapping solutions). Similar to SA, the proposed GA for mapping is also an iterative improvement algorithm. The mapping solutions (population) are iteratively updated by the GA toward better solutions until a stopping criterion is met. The iteration starts with a population of randomly generated individuals (mappings). The population in each iteration is called a generation (new mapping). In each generation, the performance score (inverse of execution time) of every individual in the population is evaluated. As in simulated annealing, a similar calculation is used for calculating the performance score using computation and communication scores. Mappings with lower execution times are stochastically selected from the current population, and each mapping is modified (recombined and possibly randomly mutated) to form a new mapping. The new mapping of candidate solutions is then used in the next iteration of the algorithm. The GA typically terminates either when a maximum number of different mappings (generation count) has been produced or when a satisfactory execution time has been reached for the population. In this paper, we terminate the iteration when the maximum number of generations is reached. Similar to the proposed SA algorithm, the new mapping solution in the proposed GA also needs to meet the constraints in Eqs. (4), (6), and (9), before the mapping candidate is considered for acceptance. The pseudocode of the proposed GA for mappings onto multicore systems is shown in Algorithm 2.

Algorithm 2
figure b

Genetic Algorithm Heuristic for Task-to-Node Assignment in MCSoC

The four steps of the proposed GA for a mapping solution are described below: selection, crossover, mutation, and evaluation. One cycle of selection, crossover, mutation, and evaluation is called a generation.

5.1 Evaluation

The evaluation function calculates a score for the mapping using the computation and communication time. A higher performance score returned from the evaluation indicates lower execution time. The evaluation function could also return invalid scores. For example, a score of 0 indicates an invalid mapping due to violating computation capacity, communication capacity, or energy budget constraints.

5.2 Selection

The selection phase helps us choose the best-suited mappings and discard the worse (higher execution time) mapping solutions. The main idea is to increase the chances for better mapping solutions to be preserved over the next generation. The selection phase consists of two phases: selection and sampling [52]. The linear ranking selection algorithm is used as the selection algorithm [53]. This selection algorithm sort the mappings in descending order based on their performance score. The mapping with the highest performance score is at the beginning of the sorted list and the lowest score at the end of the list. The selection algorithm then assigns a selection probability to each individual mapping in the current mapping population based on their ranking in the sorted list. The selection probability is a linear function of the mapping’s rank. The formula to calculate the selection probability for the \(i-th\) mapping \(Prob_{i}\) is as follows:

$$\begin{aligned} Prob_{i} = (1 / {pop}_{size}) \cdot (n_{max} - (n_{max} - n_{min}) \cdot (i / ({pop}_{size} - 1))) \end{aligned}$$
(11)

where \({pop}_{size}\) is the total number of mappings in the population. \(n\_{max}\) is a constant representing the maximum probability for the fittest mapping. \(n\_{min}\) is a constant representing the minimum probability for the least fit mapping. i is the rank of the mapping in the sorted list, ranging from 0 to \({pop}_{size}\) - 1. \(n\_{max}\) and \(n\_{min}\) are determined empirically. \(n\_{max}\) and \(n\_{min}\) control the level of exploration and exploitation desired in the GA. A higher \(n\_{max}\) will increase the selection pressure for the best mappings, making them more likely to be selected. A smaller value of \(n\_{max}\) will reduce the selection pressure on the fittest mappings, allowing more exploration. Similarly, a higher \(n\_{min}\) will increase the selection pressure for the worst mappings, making them less likely to be selected and promoting exploration. A smaller value of \(n\_{min}\) will reduce the selection pressure on the least fit mappings, allowing more exploitation. \(n\_{max}\) and \(n\_{min}\) are set to 1.1 and 0.9, respectively, in our experiments to produce the best-performing solutions. The selection algorithm favors better mapping by assigning them higher probabilities. However, most of the solutions, including worse solutions, will have a some probability of selection for next generation. This helps avoid quick convergence to possible local optima solutions, leading to a good mapping solution.

The sampling algorithm then produces the mating pool by selecting individual mappings from the current mapping solutions. The Roulette Wheel Selection algorithm is used for sampling. The algorithm spins the "pseudo" roulette wheel by randomly selecting a position. Each slice in the roulette wheel is a candidate mapping solution. The size of the slice is proportional to the performance score returned by the evaluation function. Because of the possible larger size for a better mapping solution, they have a higher probability of getting selected as winners in the roulette wheel spinning. The chosen mapping corresponds to the slice where the wheel stops. Repeat the spinning process until the mating pool is filled with the desired number of selected mappings.

5.3 Crossover

A set of chromosome pairs is formed by randomly selecting mappings (chromosomes) from the current mappings. For each pair in this set, a random number r between 0 and 1 is generated; if r is less than or equal to crossover rate of 0.7, the pair is chosen for crossover. For each chosen pair, every task (gene) is exchanged between the two parents with a certain probability called the swapping probability. This method of crossover is also called uniform crossover [54]. The values of crossover rate and swapping probability are empirically determined to produce the best-performing solution. In this work, the crossover rate and swapping probability are empirically determined to be set to 0.7 and 0.5, respectively. A higher crossover rate is used to perform more crossovers and introduce more diversity in the mapping solutions, which helps in reaching a better mapping solution and providing convergence. Uniform crossover introduces a higher level of exploration in the search space compared to traditional crossover methods, such as single-point or multi-point crossover, as it has the potential to create new mapping with gene combinations that are not present in either parent mapping solutions. This exploration property can be beneficial for maintaining genetic diversity in the population and increasing the chances of discovering better solutions in the optimization process.

5.4 Mutation

Mutation is used to introduce diversity in the population (mappings) to prevent the GA from converging to local optima. Mutation swaps the two randomly chosen tasks (assigned to cores) in a single mapping to create new mapping. If the random number r is less than or equal to the mutation rate of 0.05, the mapping is chosen for mutation. Then, two tasks (genes) are selected randomly within the chosen mapping, and the tasks (gene values) are swapped. The mutation rate is kept small as a higher mutation rate (e.g., 0.5) could turn the proposed GA into a random search, which reduces the quality of the mapping solution. GA utilizes exploitation and exploration strategies, similar to reinforcement learning (RL), to iteratively improve solutions over generations. Exploration is done during crossover, whereas exploitation is done during mutation.

Table 1 Properties of the proposed mapping solutions

Similar to the proposed MILP model and SA method, the unique properties of the proposed GA are architecture-awareness and application-demand awareness during mapping. The evaluation function in the proposed GA considers both computation and communication times when calculating the execution time of the mapping solutions. The proposed GA method can handle the heterogeneity of applications and systems/networks. The proposed GA method is aware of the utilization of the computation and communication capacity of the NoC resources and the computation and communication demands of the tasks and applications. The another uniqueness of the proposed GA includes the use of selection and sampling algorithms. The proposed GA uses linear ranking selection to assign selection probabilities to each mapping based on its performance scores. The Roulette Wheel Selection sampling algorithm is used to produce mating pools based on the performance score of the candidate mapping solution. The use of the selection and sampling in the proposed GA balances between exploitation and exploration. By considering current utilization in different regions of the NoC and the energy budget and task deadline constraints, the proposed GA accelerates both computation and communication, leading to acceleration of applications and systems. Furthermore, the reduced execution time contributes to the reduction of energy consumption.

The differences and similarities in properties of all three proposed mapping solutions for high performance are summarized in Table 1.

6 Mapping example

A sample mapping of an application with two sub-applications (sub app-1 and sub app-2) on a \(4\times 4\) 2D-Mesh NoC is shown in Fig. 3. Both computation and communication demands of the applications are presented in the figure and are considered for mapping. Node capacity, link capacity, energy budget, and deadline are constrained, which need to be satisfied by any mapping solution. Node capacities are represented as the computational operations per second, while link capacities are represented as the width of the links.

Fig. 3
figure 3

Application mapping using MILP and heuristics in \(4\times 4\) heterogeneous NoC

The total energy consumption budget for a mapping solution is set to 12 nanojoules (nj) based on 22nm technology. Leakage energy for idle NoC resources is not considered in this example. The deadline to complete the application is set to 15 nanoseconds (ns). The node and link capacities and energy budget of a \(4\times 4\) NoC are shown in Fig. 3a. The frequency of the routers and links is set to 1 giga-Hertz (GHz). Link widths are varied from 8-bit to 16-bit. The bit-width and frequency of the link determine link bandwidths. Routers have 2, 3, or 4 ports depending on their positions within the 2D-Mesh NoC. We assume that routers have a sufficient number of buffers to store and forward the flits. The capacities of the cores are set to the level of billion operations per second; for example, the capacity of the first core is 8 billion operations per second. Individual node and link capacities are varied to show the mapping in a heterogeneous NoC. Figure 3b–d shows the computational and communication loads distribution using MILP, simulated annealing, and genetic algorithm, respectively. Figure 3b–d shows the node and link wise computation and communication load distribution. Node load and node capacity determine the computational time of the task. Similarly, link load and link capacity determine the communication time between the tasks mapped in NoC. The goals of the MILP and heuristics are to accelerate the execution time of the applications and reduce the energy consumption. As loads in the NoC determine the execution time and energy consumption, the loads are shown in the figures. Furthermore, node and link wise load distribution demonstrate the resource utilization in different regions of the NoC. High utilization at a node or link can introduce hotspots, which could block or delay the flits from reaching the destination. The total energy in the NoC is calculated by summing up the computational energy consumption at the node and the communication energy consumption on the links. Similarly, the total execution time is calculated by summing the computation and communication times of the tasks in the applications.

Figure 3b–d also shows the total energy consumption and execution time from the MILP and heuristics for a random application in this mapping example. The total energy consumption and execution time from simulated annealing and genetic algorithms is very close to each other, with simulated annealing performing slightly better than genetic algorithm. The total energy consumption and execution time from the MILP is 10–20% better (lower) than the solutions obtained from simulated annealing and genetic algorithms. However, all of these MILP and heuristics produce mapping solutions that meet the node and link capacities and energy budgets of NoCs and the deadline demands of the application. With a more constrained mapping, such as reductions in energy budget and deadline, the difference in total energy consumption and execution time would decrease for both MILP and heuristics. However, all of these MILP and heuristics may not produce a mapping solution if the mapping problem is very constrained. Therefore, the constraints need to be carefully set after conducting some experiments with capacity, budget, and deadline constraints.

7 Simulation and results

We implemented the linear programming (LP) optimization model as described in Sect. 3 using a commercial LP solver IBM CPLEX [15] which searches for the optimal solution for the mapping. We integrated the CPLEX solver in MATLAB and called the CPLEX solver to solve the optimization problem using the MATLAB optimization toolbox [46]. We implemented both the proposed SA and GA for mapping evaluation using our in-house developed simulator that includes the mathematical constraints mentioned in Sect. 3 (Eqs. 4, 6, and 9). The mapping solutions were then fed to gem5/garnet to evaluate the latency and throughput performance of NoCs. We simulated multicore systems ranging from 16 cores to 100 cores, where cores are interconnected by a mesh-based NoC. For instance, an 10X10 2D-Mesh NoC is used for a 100-core system where each core has a router for communication with other cores and memory in the MCSoC. Both large-scale and small-scale NoCs have been simulated to show the robustness of the mapping algorithms. The performance score of the mapping solutions is calculated by taking the inverse of the sum of the computation time and communication time of tasks mapped in the system.

Real system experiments were conducted using gem5 [18] and garnet [17] platforms to evaluate the performance of NoCs in terms of end-to-end latency and network throughput. The results of the mapping solutions from the linear programming solver and from the in-house-developed simulator for SA and GA heuristics are fed to the garnet [17] on-chip networks in the gem5 platform [18] for latency and throughput performance evaluations in real systems. gem5/garnet configurations for 64-core 2D-Mesh NoC are shown in Table 2. The simulation configurations for garnet/gem5 comprise the following settings: Instruction Set Architecture = ALPHA, Interconnect Frequency = 1 GHz, Number of Virtual Networks (VN) = 3, Number of Virtual Channels (VC) per VN = 4, Number of Buffers per VC = 4, NoC size=8X8 and 4X4, and XY-routing [55]. Sixty-four-core NoC and 16-core NoC are simulated in gem5, as the gem5 only allows system with exponential of two for square mesh NoCs. Each core has a router associated with it for communication with other cores and memory in garnet. The injection rate is set to 0.3, ensuring a higher packet injection to test the robustness of the mapping solution, for comparing solutions of MILP, SA, and GA. Furthermore, latency and throughput performance of all the solutions are evaluated at different injection rates from 0.05 to 0.3 to demonstrate the robustness of the algorithms.

Table 2 gem5 and garnet configuration for 64-core Mesh NoC

The proposed MILP model and SA and GA algorithms were evaluated using the E3S benchmark suite [19]. E3S comprises applications in Consumer (C), Auto-Industry (A), Networking (N), Telecom (T), and Office Automation (O) domains. Consumer application has two sub-applications with 12 tasks. Auto-Industry application has four sub-applications with 24 tasks. Networking application has four sub-applications with 13 tasks. Telecom application has eight sub-applications with 28 tasks. Office Automation is a single application with 5 tasks. The energy budget constraint is set to 50 mJ (milliJoule) for all the applications in E3S benchmarks.

Fig. 4
figure 4

Total energy comparison for MILP, SA, and GA on different NoC sizes under E3S benchmarks

Fig. 5
figure 5

Execution time comparison for MILP, SA, and GA on different NoC sizes under E3S benchmarks

The energy consumption of the proposed MILP model and SA and GA algorithms was evaluated using a 64-core and a 16-core 2D-Mesh NoCs for all applications, as shown in Fig. 4a and b. Simulation results for E3S benchmarks on a 64-core 2D-Mesh NoC suggest that the energy dissipation of SA and GA solutions is within 5–15% of the optimal or near-optimal solutions from the MILP solutions. This shows the suitability of SA and GA for high-performance mapping in NoC-based multi-/many-core systems. However, on a 16-core NoC, the energy dissipation of MILP solutions is 50–90% better than that of the SA and GA solutions. These results indicate that SA and GA perform very close to MILP solutions when enough resources are available. SA and GA could struggle when computation and communication resources are limited, as in small-scale 16-core NoCs. Simulation results also suggest that the energy consumption of the GA is only 2–6% higher, on average, than that of the SA for both 64-core and 16-core NoCs. The faster execution of the applications also indirectly contributes to the lower energy consumption in all the proposed approaches.

Fig. 6
figure 6

Execution time comparison at different NoC sizes for different applications under E3S benchmark

Execution time performance of all the applications for NoC sizes with 64 cores and a6 cores is demonstrated in Fig. 5a and b. Execution time for all the algorithms is in the range of milliseconds for all the applications, demonstrating the efficiency of the algorithms. Execution time significantly decreases (100–500%) with the increase in NoC sizes from 16-core to 64-core for all the applications with a very few exceptions. This shows the scalability property of the MILP model and the SA and GA algorithms as they can utilize the increased availability of computation and communication resources (with the increase in NoC sizes) properly for running the tasks.

Fig. 7
figure 7

Execution time comparison at different NoC sizes for different applications under E3S benchmark

Fig. 8
figure 8

Execution time comparison at different NoC sizes for Office Automation application under E3S benchmark

Fig. 9
figure 9

Execution time of SA and GA for Telecom application for different parameter settings

Execution time performance of all the applications for NoC sizes ranging from \(4 \times 4\) to \(10 \times 10\) is demonstrated to show the scalability property of the SA and GA algorithms. The results for MILP are not included as MILP takes very long time to generate solutions for 100 cores. We again observe that execution time decreases with the increase in NoC sizes for all the applications with a few exceptions. SA and GA algorithms can map tasks as close as possible to reduce the communication time, even though the possibility of increased communication distance between mapped tasks with the increase in NoC size. The performance of SA and GA algorithms differs for different applications. The SA algorithm performs better for Consumer, Networking, and Auto-Industry applications, while the GA algorithm performs better for Telecom and Office Automation applications. For the Consumer application, as shown in Fig. 6a, on average, the execution time from the SA algorithm is 10–50% (on average 38%) better than that of the GA. For the Auto-Industry application, as shown in Fig. 6b, the execution time from the SA algorithm is 19% (on average) better than GA, with one exception. The exception is that the GA algorithm performs significantly better than the SA algorithm for the Auto-Industry application in a 49-core NoC. For Networking applications, as shown in Fig. 7a, on average, the execution time of the SA algorithm is 25–90% better than that of the GA. The performance of SA is higher (90%) than GA in larger NoCs. For Telecom applications, as shown in Fig. 7b, the execution time of the SA algorithm is 4% worse than that of GA. For Office Automation applications, as shown in Fig. 8, on average, the execution time of the GA algorithm is 5% better than that of the SA. Along with better performance, the time complexities of the solutions from the SA are relatively lower (better) than those from the GA. The higher time complexities in the GA are mainly the result of the multiple stages (selection, crossover, and mutation) involved in solution generation within the GA.

Fig. 10
figure 10

Latency and throughput comparison for MILP, SA, and GA on 64-core NoC under E3S benchmarks

We also evaluated the performance of the proposed SA and GA at different temperatures and crossover rates, respectively, on a \(6\times 6\) NoC. The initial temperature of \(T=10{,}000\) is chosen for the SA as it provides the best solution at this temperature. As shown in Fig. 9a, the solutions from the SA maximize at \(T=10{,}000\). The Telecom application is chosen here as it has the maximum number of tasks within E3S benchmarks. As shown in Fig. 9b, a crossover rate of 0.7 provides a good mapping solution. Although a crossover rate of 0.4 provides the best mapping, we select the crossover rate of 0.7 because the higher crossover rate provides more diversity in mapping populations, leading to better mapping and convergence.

Fig. 11
figure 11

Latency and throughput comparison for MILP, SA, and GA on 16-core NoC under E3S benchmarks

Latency and throughput performance of the proposed MILP model and SA and GA algorithms were evaluated using a 64-core 2D-Mesh NoC and a 16-core 2D-Mesh NoC on the garnet/gem5 platform. Simulation results for E3S benchmarks on a 64-core 2D-Mesh NoC suggest that the latency and throughput performance of SA and GA solutions are within 5–15% of the optimal/near-optimal solutions from the MILP solutions, as shown in Fig. 10a and b. This shows the suitability of SA and GA for high-performance mapping in NoC-based multi-/many-core systems. Simulation results also suggest that the latency and throughput performances of the GA are only 2–6% better (on average) than that of the SA. The GA performs slightly better in Auto-Industry, Office Automation, and Telecom applications compared to the SA. In the 16-core 2D-Mesh NoC, latency and throughput performance of MILP, SA, and GA are very similar, as shown in Fig. 11a and b. This implies that performance does not change significantly in a small-scale NoC, as the models and algorithms can address the resource limitations in small-scale systems like a 16-core NoC.

Fig. 12
figure 12

Latency and throughput comparison for MILP for different injection rates under E3S benchmarks

Fig. 13
figure 13

Latency and throughput comparison for SA for different injection rates under E3S benchmarks

The proposed SA and GA algorithms avoid local optima by accepting worse solutions and performing mutation/crossover, respectively, to reach a solution closer to the global optimum mapping solution. That is why the solutions from SA and GA are very close to the solution from MILP. The major advantage of the proposed SA and GA for mapping is their ability to quickly produce a near-optimal mapping solution, making the proposed approaches suitable for real-time systems, unlike the solutions generated by LP solvers using the designed MILP model in Sect. 3. An LP solver explores the solution space completely to obtain the optimal solution, which could take a long time to generate solutions and may not produce solutions within a finite time, especially for large-scale NoC-based MCSoC systems.

Fig. 14
figure 14

Latency and throughput comparison for GA for different injection rates under E3S benchmarks

The throughput performance of all the algorithms for all the applications increases with the increase in the injection rate of packets or flits in NoCs, as shown in Figs. 12b,  13b, and  14b for 64-core NoC. As more packets are injected into the NoC, more packets are delivered to the destination. This shows the robustness of the mapping solutions. However, as many flits are injected into the NoC, they also suffer from more delay due to the resource limitations, such as the buffer at the router and link capacity, of the NoC. That is why latency increases with the increase in the injection rate of packets or flits in NoCs, as shown in Figs. 12a,  13a, and  14a for 64-core NoC.

8 Conclusion

The application mapping problem has been formulated using a mixed-integer linear programming (MILP) model that considers the computation and communication capacity, energy budget constraints of the NoC-based MCSoC, as well as the execution time requirements and data dependencies between tasks in applications. The MILP model aims to accelerate applications by reducing the computation and communication times of running tasks. Simulated annealing (SA) and genetic algorithm (GA) heuristics have been proposed to optimize the application-architecture mapping and accelerate applications, considering the capacity and budget constraints of the NoC-based MCSoC systems. Simulation results showed that both simulated annealing and genetic algorithms perform very close to the optimal solutions generated by the MILP model. Additionally, simulation results showed that simulated annealing generally outperforms the genetic algorithm in most cases. The energy consumption of simulated annealing and genetic algorithms is very close to the optimal solution from MILP on a 64-core NoC, where the energy differences are higher in a 16-core NoC. As the size of the NoC increases from 16 cores to 100 cores, the proposed MILP model and SA and GA algorithms accelerate the applications, as demonstrated in the throughput results. This shows the scalability property of the proposed MILP model and heuristics.