High-performance application mapping in network-on-chip-based multicore systems

Reza, Md Farhadur

doi:10.1007/s11227-024-06184-9

High-performance application mapping in network-on-chip-based multicore systems

Published: 18 May 2024

Volume 80, pages 18573–18599, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Journal of Supercomputing Aims and scope Submit manuscript

High-performance application mapping in network-on-chip-based multicore systems

Download PDF

Md Farhadur Reza¹

66 Accesses
Explore all metrics

Abstract

The allocation of resources and scheduling of tasks, specifically mapping, in multicore systems on-chip (MCSoC), poses significant challenges. Tasks have diverse resource requirements and interact with each other, while network-on-chip (NoC)-based MCSoC consists of heterogeneous cores and communication networks. The heterogeneity of resources in MCSoC, along with the varying computational and communication demands of applications, makes mapping a complex optimization problem. We have mathematically modeled the mapping problem in NoC-based MCSoC using mixed-integer linear programming (MILP) with the objective of minimizing the execution time of applications. This model incorporates computation and communication capacity, energy budget constraints of MCSoC, and execution time requirements of applications. We further propose heuristics, including simulated annealing (SA) and genetic algorithm (GA), considering the capacity and budget constraints of MCSoC systems to accelerate applications and provide quick mapping solutions. Simulation results demonstrate that the performance from SA and GA heuristics is very close (within 10%) to the optimal solutions from MILP across various applications for 2D-Mesh NoCs with 16–100 cores. The energy consumption of SA and GA heuristics is also very close to the optimal solutions from MILP, with a few exceptions on small-scale 16-core NoC. Additionally, SA outperforms GA in most cases for all applications.

A Heterogeneous Multi-core Network-on-Chip Mapping Optimization Algorithm

Application mapping algorithms for mesh-based network-on-chip architectures

Article 22 November 2014

Application mapping onto mesh-based network-on-chip using constructive heuristic algorithms

Article 17 May 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Machine learning-based computing rather than algorithm-based computing continues to accelerate. A study [1] suggests that popularity of neural networks would double the computation demands on datacenters. This acceleration of demands puts pressure to have large-scale systems to meet demands. Fortunately, because of the advancement of transistor technology, we see a huge increase in the number of cores per chip [2,3,4]. Network-on-chip (NoC) has become the standard communication medium for multi-/many-core systems on-chip because of various properties, including scalability and parallelism [5,6,7]. For example, Cerebras have integrated 800K cores in an NoC-based single chip with trillion transistors [2], which can deliver more performance than a cluster of traditional machines. We could even built a datacenter on a chip as many-core on-chip systems have better power efficiency, interconnection, and latency compared to traditional local area network (LAN)-based systems [8]. However, fraction of time spent in communication for an application increases with the increased number of compute nodes in the systems, for example, communication time can be more than 50% for an application in large-scale systems [9]. Therefore, communication is the major bottleneck to achieve high parallelism in systems with many compute nodes as communication costs much more time and energy than that of computation [9,10,11].

Given heterogeneous workload requests with diverse computation and communication (data dependencies) demands, a challenging research problem is to exploit the optimal mapping of resource demands onto various heterogeneous resources to fulfill high-performing task-resource co-allocation. Our target is to achieve both computation and communication efficiency for heterogeneous workloads without incurring overhead while mapping on heterogeneous processor tiles connected by heterogeneous routers and links. The mapping optimization in NoC-based MCSoC faces new challenges of limited energy and capacity budgets. Billions of transistors are being integrated on a single chip. However, while transistors keep shrinking, current leakage poses greater challenges, and consequently, power density increases, causing the failure of Dennard scaling [12]. It is not feasible to activate all cores at their maximum speed simultaneously, leaving a large amount of silicon inactive (dark silicon) or operating at a lower frequency (dim silicon), the so-called dark/dim silicon [13]. Mapping solutions should consider the heterogeneous frequency of computation and communication resources, including the inactive resources. Dark and dim silicon are achieved by applying voltage-frequency scaling (including power-gating), called dynamic voltage-frequency scaling (DVFS). The search space for mapping task-to-core could increase quadratically or exponentially with the increase in the number of cores, routers, and link resources in MCSoC. For example, the communication paths between cores in MCSoC increase quadratically or exponentially with the increased number of cores in NoC-based MCSoC. The large number of task-to-node assignment options and data communication (routing) options significantly contribute to the size of the solution space, and the time complexity required to obtain an optimal solution. The communication path also depends on the topology of interconnection networks, e.g., mesh and ring. The higher time complexity of getting a solution makes it harder to apply the solution in a real-time-constrained system to meet the deadlines of the application(s). Therefore, the mapping or resource allocation problem, such as the bin packing problem, is NP-complete and/or NP-hard [14], where the time complexity grows exponentially with an increase in the number of tasks in application and/or resources in MCSoC. For example, mapping is an NP-hard problem, as there is no known algorithm that can solve it in polynomial time. Large problem instances cannot be computed using linear programming optimizers (e.g., IBM CPLEX [15]) for optimal solutions since they fail to compute results in time especially for run-time mapping decisions. Heuristics can produce solutions quickly without the need for significant hardware overhead, whereas machine learning, a young field with many underexplored research opportunities [16], may require significant training and/or hardware overhead to provide efficient mapping. Heuristics can cover all possible cases under heterogeneous tasks (and traffic) and their data dependencies and heterogeneous resources in complex and large platforms. This shows the opportunities of using heuristics for application-architecture mapping in NoC-based MCSoC.

The major contributions of this work are outlined below:

Mixed-integer Linear Programming Model for Mapping: A mathematical model using mixed-integer linear programming for mapping tasks onto NoC-based MCSoC has been proposed and implemented using IBM CPLEX [15] for optimal solutions. The model aims to accelerate applications by reducing the execution times (sum of computation and communication times) of running tasks. The model includes computation and communication capacity, energy budget constraints of MCSoC, and also incorporates the computation and communication demands of the applications. Mapping tasks of multiple applications is addressed in the MILP model.
Simulated Annealing and Genetic Algorithms for Mapping: SA and GA heuristics are proposed to map tasks onto NoC-based MCSoC with the goal of reducing execution times of running tasks to accelerate applications. The energy and performance constraints of the proposed MILP model are integrated with SA and GA for mapping in energy and computation/communication capacity-constrained NoC-based MCSoC systems.
Evaluation on Real Benchmarks and Platform: The MILP model and SA and GA algorithms were evaluated on both small-scale and large-scale systems using garnet [17], a cycle-accurate NoC model on the gem5 [18] platform. Embedded system benchmark suite (E3S) [19] is used for evaluating all these proposed approaches. An in-house developed simulator was used to generate mapping results for subsequent latency and performance evaluation on the gem5/garnet platform. Mapping multiple applications is simulated, as all the applications in E3S contain multiple sub-applications within them.

Related work is presented in Sect. 2. Mathematical model of mapping tasks onto NoC-based MCSoC architecture is formulated using MILP in Sect. 3. Then, we discussed simulated annealing and genetic algorithm mapping heuristics for mapping tasks onto NoC in Sects. 4 and 5, respectively. A mapping example illustrating the mapping of a sample application using MILP and heuristics is presented in Sect. 6. Simulation results in Sect. 7 demonstrate the effectiveness of the proposed high-performance application mapping solutions, particularly the performance of SA and GA heuristics compared to the optimal solutions from the MILP model.

2 Related work

Mapping decisions are driven by various energy and performance goals, such as minimizing the execution time of applications, reducing latency in interconnection networks, and optimizing the energy consumption of MCSoC. Depending on the application needs, the mapping strategy maps the tasks onto the system and allocates/configures the resources in the system. Various task mapping algorithms have been developed for multicore systems, some are based on graph theoretic, some use mathematical programming, and others are heuristic algorithms (we did not discuss machine learning-enabled task mapping in this work). Variations of shortest tree algorithms have been devoted to mapping highly correlated tasks on the task graph to nearby cores, in order to shorten the communication delay [20,21,22,23,24]. For example, [23] maps the closely communicating threads to neighboring cores to improve locality of data accesses. Similarly, multi-application mapping in [25, 26] maps the tasks in the same application together to reduce the communication distance, for example, [25] finds an optimal region for each application using the maximal empty rectangle technique. Along with shortening the communication distance, the communication traffic is split into multiple paths to improve bandwidth efficiency during task mapping in [20]. The run-time mappings in [27,28,29] map the communicating tasks in a close neighborhood to have contiguous mapping regions. Hill climbing technique in [29] maps the tasks of an application as nearest as possible. Run-time mapping in [28] selects a square region required for an incoming application to reduce congestion in the mapped neighborhood. Some work focuses on balancing the load distribution to reduce contention and hotspots [30,31,32,33]. Some work considers dark and/or dim silicon cores (voltage-frequency scaled) during task mapping to reduce the overall energy dissipation in the system [34, 35]. Some work minimizes both energy hotspots or contention and overall chip power [32, 35]. Run-time mapping in dark silicon, presented in [34, 36], aims to increase power budget for activating more resources. Some work focuses on thermal management during task mapping in NoCs [37,38,39,40]. Some work employs fault-tolerant task mapping to enhance the reliability of the system while reducing energy consumption [41]. Some work focuses on reducing energy consumption and accelerating applications at the same time, such as [42, 43] and this paper. This paper is an extension of the paper published in [44], where we made several modifications, including the implementation of the MILP model using an LP solver and the addition of MILP results. In this work, we address the placement of computation and communication-demanded tasks of multiple applications onto heterogeneous NoCs under computation/communication capacity, execution time, and energy budget constraints. The proposed mapping approaches accelerate the applications by reducing the execution time of applications, which can indirectly reduce energy consumption. We modeled the task mapping problem using an MILP model and proposed simulated annealing and genetic algorithm heuristics for a quick mapping solution applicable in real-time systems. We implemented all the proposed models and heuristics and compared their results.

3 Linear programming modeling for mapping

For the linear programming (LP) formulations, we assume a mesh architecture that employs shortest-path routing for data delivery. A $3 \times 3$ 2D-Mesh NoC-based multicore system is shown in Fig. 1, where a tile contains a core, its cache, and a router. Cores do the computation and routers connected to cores are used for communication between tasks mapped to cores.

Each node (or tile interchangeably) i in multicore systems has a frequency denoted by $f_i$. Two adjacent nodes (routers) i and j are connected by a bidirectional link $l_{ij}$ with a maximum communication capacity of $W_{ij}$. The capacity of the nodes (big/little cores and routers) and links (high/low bandwidth links) varies to demonstrate heterogeneity in 2D-Mesh NoC. The capacity of the cores is varied by changing the number of computational operations per second. The capacity of the links is varied by varying link widths, where ${link}_{bandwidth} = {link}_{width} \times {link}_{frequency}$. The frequency of the router determines its packet forwarding rate. The capacity of a router is varied by changing the number of ports, the number of buffers in each port, and the size of each individual buffer. The size of a buffer determines the size of a flit (flow control digit), as a flit (part of a packet) needs to be stored in a buffer due to contention in the NoC.

Given a system that consists of computation and communication resources, this LP model considers a sequence of K applications $\{A_{1}...A_{K}\}$ for mapping. Each application contains a set of Z tasks $\{t_{1}...t_{Z}\}$. A sample application graph with computation and communication demands is shown in Fig. 2.

Let’s assume, $(X_k^t)^m$ is a binary variable to indicate the mapping of task t of application k to node (core) m.

3.1 Computation and execution time constraints

Assume ${E}_{k}$ is the execution time of an application k, $(E_k^t)^m$ is the computation time of task t on node m, the total number of tasks in a application k is Z, the total number of nodes in the system is M, $(C_k^t)^m$ represents the computation time of task t on node m, ${Q}_{m}$ is the size (quantum) of the time slice given by node m for each task in the round-robin scheduling policy, and ${S}_{m}$ is the time node m needs from switching the execution from one task to another. $(C_k^t)^m$ is calculated by dividing the computation demand of ${(cmp)}_{k}^{t}$ by the processor capacity (operations per second) of ${pc}_{m}$. ${(cmp)}_{k}^{t}$ represents the number of computational operations requested by task t of application k.

$$\begin{aligned} (C_k^t)^m\,= \,& {} (X_k^t)^m {(cmp)}_{k}^{t} / {pc}_{m}, \forall 1\le t\le Z, \nonumber \\{} & {} \forall 1\le k\le K, \forall 1\le m\le M. \end{aligned}$$

(1)

$$\begin{aligned} (E_k^t)^m= & {} \sum _{{\begin{array}{c} 1\le m \le M \end{array}}} ({(C_k^t)^m}) (1 + {S}_{m}/{Q}_{m}) \end{aligned}$$

(2)

$$\begin{aligned} {E}_{k}= & {} \sum _{{\begin{array}{c} 1\le t\le Z \\ 1\le m \le M \end{array}}} (E_k^t)^m \end{aligned}$$

(3)

Since the mapping needs to satisfy the computation time requirements (quality-of-service requirements), ${E}_{k}^{q}$, for all the applications, we arrive at the computation constraint for each application k:

$$\begin{aligned} {E}_{k} \le {E}_{k}^{q},\quad \forall 1\le k\le K. \end{aligned}$$

(4)

3.2 Communication and link constraints

Mapping needs to consider the capacity of the communication links in NoC and the traffic on those links, as the execution time of tasks depends on the communication time with mapped tasks at remote nodes in MCSoC. The adjacent tasks $T_k^s$ and $T_k^t$ in the task graph need to communicate with each other, with a traffic load of ${(com)}_k^{st}+{(com)}_k^{ts}$. When they are mapped to two nodes, e.g., p and q, the traffic is delivered through a path in multicore systems, denoted as $Ph_{pq}$, which consists of a set of NoC communication links. Such a path is determined by the underlying routing algorithm like XY-routing algorithm in this work. Multiple traffic flows may simultaneously traverse through a communication link $l_{mn}$, thus the total communication traffic load via a link is:

$$\begin{aligned} \omega _{mn}=\sum _{{\begin{array}{c} 1\le p,q\le N\\ 1\le s,t \le Z\\ 1\le k \le K \\ \exists l_{mn}\in Ph_{pq} \end{array}}} {(com)}_k^{st} \cdot (X_k^s)^p \cdot (X_k^t)^q. \end{aligned}$$

(5)

Since each communication link $l_{mn}$ has a maximum bandwidth capacity of $W_{mn}$, we arrive at the communication constraint:

$$\begin{aligned} \omega _{mn} \le W_{mn}, \quad\forall 1\le m,n\le N. \end{aligned}$$

(6)

Communication time on the link $l_{mn}$ is determined by dividing the traffic load $\omega _{mn}$ on the link $l_{mn}$ by the corresponding bandwidth of the link $W_{mn}$:

$$\begin{aligned} \begin{aligned} {C}_k^{mn} = \omega _{mn}/ W_{mn},\quad \forall 1\le m,n\le N. \end{aligned} \end{aligned}$$

(7)

3.3 Energy budget constraint

The energy dissipation ${P}_{energy}$ is determined by multiplying computation and communication load with the energy dissipation per load unit. $e_{cmp}$ and $e_{com}$ represent the energy coefficient of computation and communication loads, respectively. In 22nm technology, $e_{cmp}$ and $e_{com}$ are approximately 45 picojoule (pJ) per computational operation (assuming 8-bit per computational operation for simplicity) and 65 pJ per communicating bit, respectively [45].

$$\begin{aligned} \begin{aligned} {P}_{energy} = e_{cmp} \cdot \sum _{{\begin{array}{c} 1\le t\le Z \\ 1\le k\le K\\ 1\le m \le N \end{array}}} (X_k^t)^m {(cmp)}_{k}^{t} + e_{com} \cdot \sum _{{\begin{array}{c} 1\le m, n\le N \end{array}}} \omega _{mn}. \end{aligned} \end{aligned}$$

(8)

Since the mapping needs to satisfy the energy budget constraint of $P_{budget}$, we arrive at the energy constraint constraint:

$$\begin{aligned} {P}_{energy} \le P_{budget},\quad \forall 1\le m,n\le N. \end{aligned}$$

(9)

3.4 Objective function

The objective is to minimize the total execution time (sum of computation and communication time) of the tasks to accelerate the applications:

$$\begin{aligned} \begin{aligned} Minimize{:}\sum _{{\begin{array}{c} 1\le k\le K \end{array}}} {E}_{k} + \sum _{{\begin{array}{c} 1\le m,n \le M \end{array}}} {C}_k^{mn} \end{aligned} \end{aligned}$$

(10)

subject to the constraints given in Eqs. (4), (6), and (9). This objective function, along with the communication, computation, and energy constraints, forms an MILP model. This is an MILP model as the model contains both binary and continuous variables.

A unique property of the proposed MILP model is the inclusion of both computation and communication in the model for application mapping. The MILP model includes both computation and communication capacity of the NoC resources and also includes both computation and communication demands of the tasks and applications. The objective function in the MILP model aims to accelerate both computation and communication times of the mapping solution. The proposed MILP model is designed to handle the heterogeneity of applications and systems/networks, which is crucial in mapping tasks onto NoC-based multi-/many-core architectures. The architecture-awareness and application-demand awareness property of the MILP model help to map a higher amount of computation and communication by effectively utilizing the NoC resources. The proposed MILP model maps the computation and communication demands of the tasks by considering the loads and energy consumption on the nodes (cores and routers) and links of the NoC. By considering utilization in different regions of the NoC and the energy budget and task deadline constraints, the proposed MILP model reduces the execution time for both computation and communication, leading to acceleration of applications and systems. Furthermore, the reduced execution time contributes to the reduction of energy consumption, as energy consumption tends to increase with time (assuming other factors like power usage remain constant).

We notice the long computation time for an LP solver to discover an optimal solution. For example, for a mapping of an application with 28 tasks in 36-core NoC, it takes about hours for LP solvers, such as IBM CPLEX [15] and mixed-integer LP solver (intlinprog) in MATLAB [46], to search for the optimal solutions. When the number of applications and/or NoC size increases, an LP solver cannot deliver a solution in a reasonable time period for its application in real-time systems. Simulated annealing (SA) and genetic algorithm (GA) heuristics can produce near-optimal solutions much more quickly compared to LP solvers. Similar to LP modeling, a mesh NoC-based MCSoC is utilized to evaluate the proposed SA- and GA-based mappings. XY-routing algorithm is used to find the distance between two communicating tasks that are mapped to different cores in a 2D-Mesh NoC. In a 2D-Mesh NoC, XY-routing algorithm functions as a shortest-path algorithm, much like Dijkstra’s algorithm [14]. Though a 2D-Mesh NoC is utilized for mapping for the MILP model and heuristics in this work, all these approaches can be adapted to other NoC topologies, such as a ring and torus, by changing or modifying the routing algorithm.

4 Simulated annealing for mapping

Simulated annealing (SA) [47] has proven to be quite effective in approximating global optima for many different NP-hard or NP-complete combinatorial problems, such as mapping. Due to the high time complexity of the mapping problem, an SA technique has been proposed to quickly solve application-architecture mapping problem in MCSoC. The proposed SA can address the heterogeneity of applications and systems and can handle the mapping of a large number of applications and tasks in large-scale systems. The pseudocode of the proposed SA for task mapping onto NoC-based multicore systems is shown in Algorithm 1. The technique starts with an initial temperature of $T=10{,}000$, $\alpha$ value of 0.95, $T = T \cdot \alpha$ function to decrease temperature, $\epsilon$ value of 0.5, and an initial mapping solution. The values of T, $\alpha$, and $\epsilon$ are empirically determined (based on the best performance). The initial mapping is evaluated first. The heuristic executes until it meets a certain predetermined stopping criteria. The stopping criteria are either when temperature goes below ${10}^{-20}$ or when the mapping remains the same for 250 iterations. At each iteration, the current mapping is changed to make a new mapping, and the new mapping is compared to the current mapping. The new mapping solution needs to meet the constraints in Eqs. (4), (6), and (9), before the mapping candidate is considered for acceptance. If the new mapping solution has a better value for the execution time (i.e., the fitness function that is being optimized) of the mapped application(s), then the new mapping is chosen to be the current mapping solution. When the new mapping is worse than the current mapping, the solution is accepted if the $\epsilon$ is smaller than $e^{- |{score}_{current}- {score}_{new}|/T}$, where ${score}_{new}$ and ${score}_{current}$ are the new and current values returned by the evaluation function. The property of accepting worse solutions helps to explore a broader search space, potentially escaping local optima and discovering better solutions that may be further away in the search space. This strategy of accepting better and worse solutions is similar to exploration and exploitation strategies in reinforcement learning (RL) [48]. Exploitation in SA involves focusing on accepting better solutions with a high probability, and exploration is achieved through the process of accepting worse solutions with a certain probability to avoid local optima in the proposed SA. The exploration and exploitation trade-off in SA is controlled by the cooling schedule ($e^{- |{score}_{current}- {score}_{new}|/T}$). At higher temperatures, SA is more explorative (possibility of accepting worse solutions). The probability of accepting worse solutions gradually reducer over time (as temperature decreases). As the optimization process progresses and the temperature parameter decreases (during the annealing process), SA becomes more likely to accept only better solutions, aiming to converge toward an optimal or near-optimal solution. This feature of SA mimics the machine learning or reinforcement learning agent that becomes smarter with more training.

The evaluation function calculates a score for a mapping using the computation score and communication score. Communication scores are calculated by considering the communication times between the communicating tasks in NoC. Computation scores are calculated by considering the computation times of the tasks at the corresponding mapped nodes. The return from the evaluation function (in SA) is the inverse of the summed computation and communication times (execution time). So, a higher score (lower execution time) means better performance.

Similar to the MILP model, the unique properties of the proposed SA are architecture-awareness and application-demand awareness during mapping. The proposed SA method keeps track of the utilization of the computation and communication capacity of the NoC resources. The proposed SA is also aware of the computation and communication demands of the tasks and applications. The evaluation function in the proposed SA considers both computation and communication scores, reflecting the performance of the mapping in terms of both task execution and inter-task communication. The proposed SA method can handle the heterogeneity of applications and systems/networks. The proposed SA implementation employs a strategy that balances exploration and exploitation, similar to techniques used in RL (reinforcement learning). A major property of SA is the use of probability to select worse solutions to avoid local optimal solutions. The proposed SA maps the computation and communication demands of the tasks by considering the current loads and energy consumption on the nodes (cores and routers) and links of the NoC. By considering current utilization in different regions of the NoC and the energy budget and task deadline constraints, the proposed SA accelerates both computation and communication. The acceleration of computation and communication tasks leads to higher performance for systems and applications. Faster execution further leads to a reduction in energy consumption, as energy consumption increases with time.

5 Genetic algorithm for mapping

Genetic algorithms (GAs) follow evolutionary processes [49,50,51] to search large solution spaces. The approach of GA for combining and evolving previous solutions can result in efficient solutions that may not be obvious and would be difficult to arrive at intuitively. In classical GA, a solution is encoded as a chromosome. In the proposed GA for mapping, a chromosome represents the mapping of tasks onto nodes, and a population consists of multiple chromosomes or multiple mappings of (multiple) tasks onto nodes. The algorithm begins by encoding a set of possible mapping solutions as chromosomes. The entire set of mapping solutions (chromosomes) is called a population (a list of mapping solutions). Similar to SA, the proposed GA for mapping is also an iterative improvement algorithm. The mapping solutions (population) are iteratively updated by the GA toward better solutions until a stopping criterion is met. The iteration starts with a population of randomly generated individuals (mappings). The population in each iteration is called a generation (new mapping). In each generation, the performance score (inverse of execution time) of every individual in the population is evaluated. As in simulated annealing, a similar calculation is used for calculating the performance score using computation and communication scores. Mappings with lower execution times are stochastically selected from the current population, and each mapping is modified (recombined and possibly randomly mutated) to form a new mapping. The new mapping of candidate solutions is then used in the next iteration of the algorithm. The GA typically terminates either when a maximum number of different mappings (generation count) has been produced or when a satisfactory execution time has been reached for the population. In this paper, we terminate the iteration when the maximum number of generations is reached. Similar to the proposed SA algorithm, the new mapping solution in the proposed GA also needs to meet the constraints in Eqs. (4), (6), and (9), before the mapping candidate is considered for acceptance. The pseudocode of the proposed GA for mappings onto multicore systems is shown in Algorithm 2.

The four steps of the proposed GA for a mapping solution are described below: selection, crossover, mutation, and evaluation. One cycle of selection, crossover, mutation, and evaluation is called a generation.

5.1 Evaluation

The evaluation function calculates a score for the mapping using the computation and communication time. A higher performance score returned from the evaluation indicates lower execution time. The evaluation function could also return invalid scores. For example, a score of 0 indicates an invalid mapping due to violating computation capacity, communication capacity, or energy budget constraints.

5.2 Selection

The selection phase helps us choose the best-suited mappings and discard the worse (higher execution time) mapping solutions. The main idea is to increase the chances for better mapping solutions to be preserved over the next generation. The selection phase consists of two phases: selection and sampling [52]. The linear ranking selection algorithm is used as the selection algorithm [53]. This selection algorithm sort the mappings in descending order based on their performance score. The mapping with the highest performance score is at the beginning of the sorted list and the lowest score at the end of the list. The selection algorithm then assigns a selection probability to each individual mapping in the current mapping population based on their ranking in the sorted list. The selection probability is a linear function of the mapping’s rank. The formula to calculate the selection probability for the $i-th$ mapping $Prob_{i}$ is as follows:

$$\begin{aligned} Prob_{i} = (1 / {pop}_{size}) \cdot (n_{max} - (n_{max} - n_{min}) \cdot (i / ({pop}_{size} - 1))) \end{aligned}$$

(11)

where ${pop}_{size}$ is the total number of mappings in the population. $n\_{max}$ is a constant representing the maximum probability for the fittest mapping. $n\_{min}$ is a constant representing the minimum probability for the least fit mapping. i is the rank of the mapping in the sorted list, ranging from 0 to ${pop}_{size}$ - 1. $n\_{max}$ and $n\_{min}$ are determined empirically. $n\_{max}$ and $n\_{min}$ control the level of exploration and exploitation desired in the GA. A higher $n\_{max}$ will increase the selection pressure for the best mappings, making them more likely to be selected. A smaller value of $n\_{max}$ will reduce the selection pressure on the fittest mappings, allowing more exploration. Similarly, a higher $n\_{min}$ will increase the selection pressure for the worst mappings, making them less likely to be selected and promoting exploration. A smaller value of $n\_{min}$ will reduce the selection pressure on the least fit mappings, allowing more exploitation. $n\_{max}$ and $n\_{min}$ are set to 1.1 and 0.9, respectively, in our experiments to produce the best-performing solutions. The selection algorithm favors better mapping by assigning them higher probabilities. However, most of the solutions, including worse solutions, will have a some probability of selection for next generation. This helps avoid quick convergence to possible local optima solutions, leading to a good mapping solution.

The sampling algorithm then produces the mating pool by selecting individual mappings from the current mapping solutions. The Roulette Wheel Selection algorithm is used for sampling. The algorithm spins the "pseudo" roulette wheel by randomly selecting a position. Each slice in the roulette wheel is a candidate mapping solution. The size of the slice is proportional to the performance score returned by the evaluation function. Because of the possible larger size for a better mapping solution, they have a higher probability of getting selected as winners in the roulette wheel spinning. The chosen mapping corresponds to the slice where the wheel stops. Repeat the spinning process until the mating pool is filled with the desired number of selected mappings.

5.3 Crossover

A set of chromosome pairs is formed by randomly selecting mappings (chromosomes) from the current mappings. For each pair in this set, a random number r between 0 and 1 is generated; if r is less than or equal to crossover rate of 0.7, the pair is chosen for crossover. For each chosen pair, every task (gene) is exchanged between the two parents with a certain probability called the swapping probability. This method of crossover is also called uniform crossover [54]. The values of crossover rate and swapping probability are empirically determined to produce the best-performing solution. In this work, the crossover rate and swapping probability are empirically determined to be set to 0.7 and 0.5, respectively. A higher crossover rate is used to perform more crossovers and introduce more diversity in the mapping solutions, which helps in reaching a better mapping solution and providing convergence. Uniform crossover introduces a higher level of exploration in the search space compared to traditional crossover methods, such as single-point or multi-point crossover, as it has the potential to create new mapping with gene combinations that are not present in either parent mapping solutions. This exploration property can be beneficial for maintaining genetic diversity in the population and increasing the chances of discovering better solutions in the optimization process.

5.4 Mutation

Mutation is used to introduce diversity in the population (mappings) to prevent the GA from converging to local optima. Mutation swaps the two randomly chosen tasks (assigned to cores) in a single mapping to create new mapping. If the random number r is less than or equal to the mutation rate of 0.05, the mapping is chosen for mutation. Then, two tasks (genes) are selected randomly within the chosen mapping, and the tasks (gene values) are swapped. The mutation rate is kept small as a higher mutation rate (e.g., 0.5) could turn the proposed GA into a random search, which reduces the quality of the mapping solution. GA utilizes exploitation and exploration strategies, similar to reinforcement learning (RL), to iteratively improve solutions over generations. Exploration is done during crossover, whereas exploitation is done during mutation.

Table 1 Properties of the proposed mapping solutions

Full size table

Similar to the proposed MILP model and SA method, the unique properties of the proposed GA are architecture-awareness and application-demand awareness during mapping. The evaluation function in the proposed GA considers both computation and communication times when calculating the execution time of the mapping solutions. The proposed GA method can handle the heterogeneity of applications and systems/networks. The proposed GA method is aware of the utilization of the computation and communication capacity of the NoC resources and the computation and communication demands of the tasks and applications. The another uniqueness of the proposed GA includes the use of selection and sampling algorithms. The proposed GA uses linear ranking selection to assign selection probabilities to each mapping based on its performance scores. The Roulette Wheel Selection sampling algorithm is used to produce mating pools based on the performance score of the candidate mapping solution. The use of the selection and sampling in the proposed GA balances between exploitation and exploration. By considering current utilization in different regions of the NoC and the energy budget and task deadline constraints, the proposed GA accelerates both computation and communication, leading to acceleration of applications and systems. Furthermore, the reduced execution time contributes to the reduction of energy consumption.

The differences and similarities in properties of all three proposed mapping solutions for high performance are summarized in Table 1.

6 Mapping example

A sample mapping of an application with two sub-applications (sub app-1 and sub app-2) on a $4\times 4$ 2D-Mesh NoC is shown in Fig. 3. Both computation and communication demands of the applications are presented in the figure and are considered for mapping. Node capacity, link capacity, energy budget, and deadline are constrained, which need to be satisfied by any mapping solution. Node capacities are represented as the computational operations per second, while link capacities are represented as the width of the links.

The total energy consumption budget for a mapping solution is set to 12 nanojoules (nj) based on 22nm technology. Leakage energy for idle NoC resources is not considered in this example. The deadline to complete the application is set to 15 nanoseconds (ns). The node and link capacities and energy budget of a $4\times 4$ NoC are shown in Fig. 3a. The frequency of the routers and links is set to 1 giga-Hertz (GHz). Link widths are varied from 8-bit to 16-bit. The bit-width and frequency of the link determine link bandwidths. Routers have 2, 3, or 4 ports depending on their positions within the 2D-Mesh NoC. We assume that routers have a sufficient number of buffers to store and forward the flits. The capacities of the cores are set to the level of billion operations per second; for example, the capacity of the first core is 8 billion operations per second. Individual node and link capacities are varied to show the mapping in a heterogeneous NoC. Figure 3b–d shows the computational and communication loads distribution using MILP, simulated annealing, and genetic algorithm, respectively. Figure 3b–d shows the node and link wise computation and communication load distribution. Node load and node capacity determine the computational time of the task. Similarly, link load and link capacity determine the communication time between the tasks mapped in NoC. The goals of the MILP and heuristics are to accelerate the execution time of the applications and reduce the energy consumption. As loads in the NoC determine the execution time and energy consumption, the loads are shown in the figures. Furthermore, node and link wise load distribution demonstrate the resource utilization in different regions of the NoC. High utilization at a node or link can introduce hotspots, which could block or delay the flits from reaching the destination. The total energy in the NoC is calculated by summing up the computational energy consumption at the node and the communication energy consumption on the links. Similarly, the total execution time is calculated by summing the computation and communication times of the tasks in the applications.

Figure 3b–d also shows the total energy consumption and execution time from the MILP and heuristics for a random application in this mapping example. The total energy consumption and execution time from simulated annealing and genetic algorithms is very close to each other, with simulated annealing performing slightly better than genetic algorithm. The total energy consumption and execution time from the MILP is 10–20% better (lower) than the solutions obtained from simulated annealing and genetic algorithms. However, all of these MILP and heuristics produce mapping solutions that meet the node and link capacities and energy budgets of NoCs and the deadline demands of the application. With a more constrained mapping, such as reductions in energy budget and deadline, the difference in total energy consumption and execution time would decrease for both MILP and heuristics. However, all of these MILP and heuristics may not produce a mapping solution if the mapping problem is very constrained. Therefore, the constraints need to be carefully set after conducting some experiments with capacity, budget, and deadline constraints.

7 Simulation and results

We implemented the linear programming (LP) optimization model as described in Sect. 3 using a commercial LP solver IBM CPLEX [15] which searches for the optimal solution for the mapping. We integrated the CPLEX solver in MATLAB and called the CPLEX solver to solve the optimization problem using the MATLAB optimization toolbox [46]. We implemented both the proposed SA and GA for mapping evaluation using our in-house developed simulator that includes the mathematical constraints mentioned in Sect. 3 (Eqs. 4, 6, and 9). The mapping solutions were then fed to gem5/garnet to evaluate the latency and throughput performance of NoCs. We simulated multicore systems ranging from 16 cores to 100 cores, where cores are interconnected by a mesh-based NoC. For instance, an 10X10 2D-Mesh NoC is used for a 100-core system where each core has a router for communication with other cores and memory in the MCSoC. Both large-scale and small-scale NoCs have been simulated to show the robustness of the mapping algorithms. The performance score of the mapping solutions is calculated by taking the inverse of the sum of the computation time and communication time of tasks mapped in the system.

Real system experiments were conducted using gem5 [18] and garnet [17] platforms to evaluate the performance of NoCs in terms of end-to-end latency and network throughput. The results of the mapping solutions from the linear programming solver and from the in-house-developed simulator for SA and GA heuristics are fed to the garnet [17] on-chip networks in the gem5 platform [18] for latency and throughput performance evaluations in real systems. gem5/garnet configurations for 64-core 2D-Mesh NoC are shown in Table 2. The simulation configurations for garnet/gem5 comprise the following settings: Instruction Set Architecture = ALPHA, Interconnect Frequency = 1 GHz, Number of Virtual Networks (VN) = 3, Number of Virtual Channels (VC) per VN = 4, Number of Buffers per VC = 4, NoC size=8X8 and 4X4, and XY-routing [55]. Sixty-four-core NoC and 16-core NoC are simulated in gem5, as the gem5 only allows system with exponential of two for square mesh NoCs. Each core has a router associated with it for communication with other cores and memory in garnet. The injection rate is set to 0.3, ensuring a higher packet injection to test the robustness of the mapping solution, for comparing solutions of MILP, SA, and GA. Furthermore, latency and throughput performance of all the solutions are evaluated at different injection rates from 0.05 to 0.3 to demonstrate the robustness of the algorithms.

Table 2 gem5 and garnet configuration for 64-core Mesh NoC

Full size table

The proposed MILP model and SA and GA algorithms were evaluated using the E3S benchmark suite [19]. E3S comprises applications in Consumer (C), Auto-Industry (A), Networking (N), Telecom (T), and Office Automation (O) domains. Consumer application has two sub-applications with 12 tasks. Auto-Industry application has four sub-applications with 24 tasks. Networking application has four sub-applications with 13 tasks. Telecom application has eight sub-applications with 28 tasks. Office Automation is a single application with 5 tasks. The energy budget constraint is set to 50 mJ (milliJoule) for all the applications in E3S benchmarks.

The energy consumption of the proposed MILP model and SA and GA algorithms was evaluated using a 64-core and a 16-core 2D-Mesh NoCs for all applications, as shown in Fig. 4a and b. Simulation results for E3S benchmarks on a 64-core 2D-Mesh NoC suggest that the energy dissipation of SA and GA solutions is within 5–15% of the optimal or near-optimal solutions from the MILP solutions. This shows the suitability of SA and GA for high-performance mapping in NoC-based multi-/many-core systems. However, on a 16-core NoC, the energy dissipation of MILP solutions is 50–90% better than that of the SA and GA solutions. These results indicate that SA and GA perform very close to MILP solutions when enough resources are available. SA and GA could struggle when computation and communication resources are limited, as in small-scale 16-core NoCs. Simulation results also suggest that the energy consumption of the GA is only 2–6% higher, on average, than that of the SA for both 64-core and 16-core NoCs. The faster execution of the applications also indirectly contributes to the lower energy consumption in all the proposed approaches.

Execution time performance of all the applications for NoC sizes with 64 cores and a6 cores is demonstrated in Fig. 5a and b. Execution time for all the algorithms is in the range of milliseconds for all the applications, demonstrating the efficiency of the algorithms. Execution time significantly decreases (100–500%) with the increase in NoC sizes from 16-core to 64-core for all the applications with a very few exceptions. This shows the scalability property of the MILP model and the SA and GA algorithms as they can utilize the increased availability of computation and communication resources (with the increase in NoC sizes) properly for running the tasks.

Execution time performance of all the applications for NoC sizes ranging from $4 \times 4$ to $10 \times 10$ is demonstrated to show the scalability property of the SA and GA algorithms. The results for MILP are not included as MILP takes very long time to generate solutions for 100 cores. We again observe that execution time decreases with the increase in NoC sizes for all the applications with a few exceptions. SA and GA algorithms can map tasks as close as possible to reduce the communication time, even though the possibility of increased communication distance between mapped tasks with the increase in NoC size. The performance of SA and GA algorithms differs for different applications. The SA algorithm performs better for Consumer, Networking, and Auto-Industry applications, while the GA algorithm performs better for Telecom and Office Automation applications. For the Consumer application, as shown in Fig. 6a, on average, the execution time from the SA algorithm is 10–50% (on average 38%) better than that of the GA. For the Auto-Industry application, as shown in Fig. 6b, the execution time from the SA algorithm is 19% (on average) better than GA, with one exception. The exception is that the GA algorithm performs significantly better than the SA algorithm for the Auto-Industry application in a 49-core NoC. For Networking applications, as shown in Fig. 7a, on average, the execution time of the SA algorithm is 25–90% better than that of the GA. The performance of SA is higher (90%) than GA in larger NoCs. For Telecom applications, as shown in Fig. 7b, the execution time of the SA algorithm is 4% worse than that of GA. For Office Automation applications, as shown in Fig. 8, on average, the execution time of the GA algorithm is 5% better than that of the SA. Along with better performance, the time complexities of the solutions from the SA are relatively lower (better) than those from the GA. The higher time complexities in the GA are mainly the result of the multiple stages (selection, crossover, and mutation) involved in solution generation within the GA.

We also evaluated the performance of the proposed SA and GA at different temperatures and crossover rates, respectively, on a $6\times 6$ NoC. The initial temperature of $T=10{,}000$ is chosen for the SA as it provides the best solution at this temperature. As shown in Fig. 9a, the solutions from the SA maximize at $T=10{,}000$. The Telecom application is chosen here as it has the maximum number of tasks within E3S benchmarks. As shown in Fig. 9b, a crossover rate of 0.7 provides a good mapping solution. Although a crossover rate of 0.4 provides the best mapping, we select the crossover rate of 0.7 because the higher crossover rate provides more diversity in mapping populations, leading to better mapping and convergence.

Latency and throughput performance of the proposed MILP model and SA and GA algorithms were evaluated using a 64-core 2D-Mesh NoC and a 16-core 2D-Mesh NoC on the garnet/gem5 platform. Simulation results for E3S benchmarks on a 64-core 2D-Mesh NoC suggest that the latency and throughput performance of SA and GA solutions are within 5–15% of the optimal/near-optimal solutions from the MILP solutions, as shown in Fig. 10a and b. This shows the suitability of SA and GA for high-performance mapping in NoC-based multi-/many-core systems. Simulation results also suggest that the latency and throughput performances of the GA are only 2–6% better (on average) than that of the SA. The GA performs slightly better in Auto-Industry, Office Automation, and Telecom applications compared to the SA. In the 16-core 2D-Mesh NoC, latency and throughput performance of MILP, SA, and GA are very similar, as shown in Fig. 11a and b. This implies that performance does not change significantly in a small-scale NoC, as the models and algorithms can address the resource limitations in small-scale systems like a 16-core NoC.

The proposed SA and GA algorithms avoid local optima by accepting worse solutions and performing mutation/crossover, respectively, to reach a solution closer to the global optimum mapping solution. That is why the solutions from SA and GA are very close to the solution from MILP. The major advantage of the proposed SA and GA for mapping is their ability to quickly produce a near-optimal mapping solution, making the proposed approaches suitable for real-time systems, unlike the solutions generated by LP solvers using the designed MILP model in Sect. 3. An LP solver explores the solution space completely to obtain the optimal solution, which could take a long time to generate solutions and may not produce solutions within a finite time, especially for large-scale NoC-based MCSoC systems.

The throughput performance of all the algorithms for all the applications increases with the increase in the injection rate of packets or flits in NoCs, as shown in Figs. 12b, 13b, and 14b for 64-core NoC. As more packets are injected into the NoC, more packets are delivered to the destination. This shows the robustness of the mapping solutions. However, as many flits are injected into the NoC, they also suffer from more delay due to the resource limitations, such as the buffer at the router and link capacity, of the NoC. That is why latency increases with the increase in the injection rate of packets or flits in NoCs, as shown in Figs. 12a, 13a, and 14a for 64-core NoC.

8 Conclusion

The application mapping problem has been formulated using a mixed-integer linear programming (MILP) model that considers the computation and communication capacity, energy budget constraints of the NoC-based MCSoC, as well as the execution time requirements and data dependencies between tasks in applications. The MILP model aims to accelerate applications by reducing the computation and communication times of running tasks. Simulated annealing (SA) and genetic algorithm (GA) heuristics have been proposed to optimize the application-architecture mapping and accelerate applications, considering the capacity and budget constraints of the NoC-based MCSoC systems. Simulation results showed that both simulated annealing and genetic algorithms perform very close to the optimal solutions generated by the MILP model. Additionally, simulation results showed that simulated annealing generally outperforms the genetic algorithm in most cases. The energy consumption of simulated annealing and genetic algorithms is very close to the optimal solution from MILP on a 64-core NoC, where the energy differences are higher in a 16-core NoC. As the size of the NoC increases from 16 cores to 100 cores, the proposed MILP model and SA and GA algorithms accelerate the applications, as demonstrated in the throughput results. This shows the scalability property of the proposed MILP model and heuristics.

References

Dean J, Patterson D, Young C (2018) A new golden age in computer architecture: empowering the machine-learning revolution. IEEE Micro 38(2):21–29. https://doi.org/10.1109/MM.2018.112130030
Article Google Scholar
Cerebras: Cerebras Wafer Scale Engine. https://www.cerebras.net/product/ (2019)
Bohnenstiehl B, Stillmaker A, Pimentel JJ, Andreas T, Liu B, Tran AT, Adeagbo E, Baas BM (2017) Kilocore: a 32-nm 1000-processor computational array. IEEE J Solid-State Circuits 52(4):891–902. https://doi.org/10.1109/JSSC.2016.2638459
Article Google Scholar
TESLA: NVIDIA Tesla GPU. https://www.nvidia.com/en-us/data-center/v100/ (2007)
Benini L, Micheli GD (2002) Networks on chip: a new paradigm for systems on chip design. In: Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE), pp 418–419. https://doi.org/10.1109/DATE.2002.998307
Benini L, Bertozzi D (2005) Network-on-chip architectures and design methods. IEE Proc Comput Digit Tech 152(2):261–272. https://doi.org/10.1049/ip-cdt:20045100
Article Google Scholar
Micheli GD, Seiculescu C, Murali S, Benini L, Angiolini F, Pullini A (2010) Networks on chips: from research to products. In: Proceedings of ACM/IEEE Design Automation Conference (DAC), pp 300–305. https://doi.org/10.1145/1837274.1837352
Kas M (2012) Toward on-chip datacenters: a perspective on general trends and on-chip particulars. J Supercomput 62(1):214–226. https://doi.org/10.1007/s11227-011-0703-4
Article Google Scholar
Bergman K et al (2008) ExaScale computing study: technology challenges in achieving exascale systems
Shalf J, Dosanjh S, Morrison J (2010) Exascale computing technology challenges. In: Proceedings of the International Conference on High Performance Computing for Computational Science (VECPAR: Vector and Parallel Processing), pp 1–25
Zahn F, Lammel S, Fröning H (2017) Early experiences with saving energy in direct interconnection networks. In: Proceedings of IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB), pp 33–40. https://doi.org/10.1109/HiPINEB.2017.10
Dennard RH, Gaensslen FH, Rideout VL, Bassous E, LeBlanc AR (1974) Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid-State Circuits 9(5):256–268. https://doi.org/10.1109/JSSC.1974.1050511
Article Google Scholar
Taylor MB (2013) A landscape of the new dark silicon design regime. IEEE Micro 33(5):8–19. https://doi.org/10.1109/MM.2013.90
Article Google Scholar
Cormen TH, Stein C, Rivest RL, Leiserson CE (2001) Introduction to algorithms, 2nd edn. McGraw-Hill Higher Education, Cambridge
Google Scholar
CPLEX: IBM ILOG CPLEX Optimizer. https://www.ibm.com/analytics/cplex-optimizer (2022)
Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260. https://doi.org/10.1126/science.aaa8415 (https://www.science.org/doi/pdf/10.1126/science.aaa8415)
Article MathSciNet Google Scholar
Agarwal N, Krishna T, Peh LS, Jha NK (2009) GARNET: a detailed on-chip network model inside a full-system simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 33–42. https://doi.org/10.1109/ISPASS.2009.4919636
Binkert N et al (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. https://doi.org/10.1145/2024716.2024718
Article Google Scholar
Dick R (2008) Embedded system synthesis benchmarks suites (e3s). http://robertdick.org/tools.html
Murali S, De Micheli G (2004) Bandwidth-constrained mapping of cores onto noc architectures. In: Proceedings Design, Automation and Test in Europe Conference and Exhibition (DATE), pp 896–901. https://doi.org/10.1109/DATE.2004.1269002
Murali S, Benini L, De Micheli G (2005) Mapping and physical planning of networks-on-chip architectures with quality-of-service guarantees. In: Proceedings of the ASP-DAC 2005. Asia and South Pacific Design Automation Conference (ASP-DAC), pp 27–32. https://doi.org/10.1109/ASPDAC.2005.1466124
Hu J, Marculescu R (2005) Energy and performance aware mapping for regular NoC architectures. IEEE Trans TCAD 24(4):551–562. https://doi.org/10.1109/TCAD.2005.844106
Article Google Scholar
Chen G, Li F, Son SW, Kandemir M (2008) Application mapping for chip multiprocessors. In: Proceedings of the 45th Annual Design Automation Conference (DAC), pp 620–625
Chou C-L, Marculescu R (2007) Incremental run-time application mapping for homogeneous nocs with multiple voltage levels. In: Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pp 161–166
Yang B, Guang L, Xu TC, Yin AW, Säntti T, Plosila J (2010) Multi-application multi-step mapping method for many-core network-on-chips. In: Proc. NORCHIP, pp 1–6. https://doi.org/10.1109/NORCHIP.2010.5669454
Lu H, Yan G, Han Y, Fu B, Li X (2013) RISO: relaxed network-on-chip isolation for cloud processors. In: Proceedings of the 50th Annual Design Automation Conference (DAC), pp 1–6
Fattah M, Ramirez M, Daneshtalab M, Liljeberg P, Plosila J (2012) Cona: dynamic application mapping for congestion reduction in many-core systems. In: Proceedings of IEEE 30th International Conference on Computer Design (ICCD), pp 364–370. https://doi.org/10.1109/ICCD.2012.6378665
Haghbayan M-H, et al. (2015) Mappro: proactive runtime mapping for dynamic workloads by quantifying ripple effect of applications on networks-on-chip. In: Proceedings of the 9th IEEE/ACM International Symposium on Networks-on-Chip, pp 26–1268. ACM, New York, NY, USA. https://doi.org/10.1145/2786572.2786589
Fattah M, Daneshtalab M, Liljeberg P, Plosila J (2013) Smart hill climbing for agile dynamic mapping in many-core systems. In: 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp 1–6
Zhu D, Chen L, Pinkston TM, Pedram M (2015) TAPP: temperature-aware application mapping for NoC-based many-core processors. In: Proceedings of IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE), pp 1241–1244. http://dl.acm.org/citation.cfm?id=2757012.2757100
Chou C-L, Marculescu R (2008) Contention-aware application mapping for network-on-chip communication architectures. In: Proceedings of IEEE International Conference on Computer Design (ICCD), pp 164–169. https://doi.org/10.1109/ICCD.2008.4751856
Chou C-L, Marculescu R (2008) User-aware dynamic task allocation in networks-on-chip. In: Proceedings of IEEE Design, Automation and Test in Europe (DATE), pp 1232–1237. https://doi.org/10.1109/DATE.2008.4484847
Reza MF, Zhao D, Wu H, Bayoumi M (2018) Hotspot-aware task-resource co-allocation for heterogeneous many-core networks-on-chip. Comput Electr Eng 68(C):581–602
Article Google Scholar
Kanduri A, Haghbayan M, Rahmani AM, Shafique M, Jantsch A, Liljeberg P (2018) adboost: thermal aware performance boosting through dark silicon patterning. IEEE Trans Comput 67(8):1062–1077
Article MathSciNet Google Scholar
Reza MF, Zhao D, Bayoumi M (2021) Energy-efficient task-resource co-allocation and heterogeneous multi-core NoC design in dark silicon era. Microprocess Microsyst. https://doi.org/10.1016/j.micpro.2021.104055
Article Google Scholar
Kanduri A, Haghbayan M, Rahmani A, Liljeberg P, Jantsch A, Tenhunen H (2015) Dark silicon aware runtime mapping for many-core systems: a patterning approach. In: Proceedings of 33rd IEEE International Conference on Computer Design (ICCD), pp 573–580. https://doi.org/10.1109/ICCD.2015.7357167
Khdr H, Pagani S, Shafique M, Henkel J (2015) Thermal constrained resource management for mixed ILP-TLP workloads in dark silicon chips. In: Proceedings of ACM/IEEE Design Automation Conference (DAC), pp 1–6. https://doi.org/10.1145/2744769.2744916
Manna K, Mukherjee P, Chattopadhyay S, Sengupta I (2018) Thermal-aware application mapping strategy for network-on-chip based system design. IEEE Trans Comput 67(4):528–542. https://doi.org/10.1109/TC.2017.2770130
Article MathSciNet Google Scholar
Liu W, Yang L, Jiang W, Feng L, Guan N, Zhang W, Dutt N (2018) Thermal-aware task mapping on dynamically reconfigurable network-on-chip based multiprocessor system-on-chip. IEEE Trans Comput 67(12):1818–1834. https://doi.org/10.1109/TC.2018.2844365
Article MathSciNet Google Scholar
Paul S, Chatterjee N, Ghosal P (2019) Dynamic task mapping and scheduling with temperature-awareness on network-on-chip based multicore systems. J Syst Archit 98:271–288. https://doi.org/10.1016/j.sysarc.2019.08.002
Article Google Scholar
Chatterjee N, Paul S, Chattopadhyay S (2017) Fault-tolerant dynamic task mapping and scheduling for network-on-chip-based multicore platform. ACM Trans Embed Comput Syst. https://doi.org/10.1145/3055512
Article Google Scholar
Abdel-Basset M, El-Shahat D, Deb K, Abouhawwash M (2020) Energy-awar2020e whale optimization algorithm for real-time task scheduling in multiprocessor systems. Appl Soft Comput 93:106349. https://doi.org/10.1016/j.asoc.2020.106349
Article Google Scholar
Huang C-H (2020) Hda: hierarchical and dependency-aware task mapping for network-on-chip based embedded systems. J Syst Archit 108:101740. https://doi.org/10.1016/j.sysarc.2020.101740
Article Google Scholar
Reza MF, McCloud Z (2023) Heuristics-enabled high-performance application mapping in network-on-chip based multicore systems. In: 2023 IEEE International Conference on Omni-layer Intelligent Systems (COINS), pp 1–6. https://doi.org/10.1109/COINS57856.2023.10189228
Borkar S (2013) Exascale computing: a fact or a fiction? In: a Keynote at IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS). https://doi.org/10.1109/IPDPS.2013.121
Matlab: Global Optimization Toolbox (2024). https://www.mathworks.com/products/global-optimization.html
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680. https://doi.org/10.1126/science.220.4598.671 (https://www.science.org/doi/pdf/10.1126/science.220.4598.671)
Article MathSciNet Google Scholar
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. A Bradford Book, Cambridge, MA, USA. ISBN 0262039249
Google Scholar
Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor (second edition 1992)
Google Scholar
Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning, 1st edn. Addison-Wesley Longman Publishing Co. Inc, New York
Google Scholar
Sastry K, Goldberg D, Kendall G (2005) In: Burke EK, Kendall G (eds) Genetic algorithms. Springer, Boston, pp 97–125. https://doi.org/10.1007/0-387-28356-0_4
Grefenstette JJ, Baker JE (1989) How genetic algorithms work: a critical look at implicit parallelism. In: Proceedings of 3rd International Conference on Genetic Algorithms (ICGA)
Baker JE (1987) Reducing bias and inefficiency in the selection algorithm. In: Proceedings of the Second International Conference on Genetic Algorithms on Genetic Algorithms and Their Application, pp 14–21
Syswerda G (1989) Uniform crossover in genetic algorithms
Duato J, Yalamanchili S, Lionel N (2002) Interconnection networks: an engineering approach. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Eastern Illinois University, 600 Lincoln Ave., Charleston, IL, 61920, USA
Md Farhadur Reza

Authors

Md Farhadur Reza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md Farhadur Reza.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Reza, M.F. High-performance application mapping in network-on-chip-based multicore systems. J Supercomput 80, 18573–18599 (2024). https://doi.org/10.1007/s11227-024-06184-9

Download citation

Accepted: 03 May 2024
Published: 18 May 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s11227-024-06184-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

High-performance application mapping in network-on-chip-based multicore systems

Abstract

Similar content being viewed by others

A Heterogeneous Multi-core Network-on-Chip Mapping Optimization Algorithm

Application mapping algorithms for mesh-based network-on-chip architectures

Application mapping onto mesh-based network-on-chip using constructive heuristic algorithms

1 Introduction

2 Related work

3 Linear programming modeling for mapping