1 Introduction

Modern applications such as healthcare, production, sales, web, and organizations generate huge volume of data [1]. These applications heavily use computer vision, pattern recognition, image processing, etc., that are computation and communication intensive requiring manycore processors. The benefits of manycore processors can be increased by applying parallel processing of data. Parallel processing enables faster execution of these applications. A networks-on-chip (NoC) [2] architecture is used to support parallel processing of instructions and data by providing efficient interconnect to multi/many cores processors. These applications belong to interdisciplinary field dealing with the extraction of high-dimensional data from the real world such as computational biology, biometrics, biomedical imaging, artificial intelligence, robotics, security, self-driving cars, big data, and knowledge engineering to produce useful information. The consolidation of data-intensive applications is putting unprecedented pressure on NoC interconnection fabric. On the one hand, one of the processors’ computation limiting factor is communication speed of on-chip network. On the other hand, the power consumption of NoC approaches 42% among on-chip components due to increasing network size in manycore processors [3]. So, the efficiency of NoC architecture is one of the major concerns for parallel computer architects.

The NoC efficiently scales bandwidth to enhance performance of manycore processors over dedicated point-to-point signal wires and shared buses. Parallel applications utilize NoC potential via operating data packets on different data links simultaneously. In tiled architecture [4] of manycore processors, each tile comprises core, caches, and router. Each core comprises a tiny network interface (NI) to split the large cache messages into smaller flits. These flits are transmitted to destination core through NoC. In manycore processors, the conventional NoC was using single network to interconnect on-chip routers and router to/from network interface. These networks are more than one in multiple networks-on-chip (Multi-NoC). The Multi-NoC emerges as faster communication infrastructure for parallel applications. These architectures are known for providing flexibility to devise a better power performance trade-off [5]. These circuits can address the communication bottleneck raised by emerging data-intensive applications. They facilitate simple, independent, and parallel data flows through more than one NoC interconnects. So Multi-NoC has been adopted in silicon prototypes, i.e., OpenPiton [6], Xeon Phi [7], SCORPIO [8], Tile [9], TRIPS [10], and RAW [11] to isolate different message classes for deadlock-freeness and quality of service.

Parallelism is considered as one of the primary ways to increase performance consistently. Since Moore’s law slows down performance gain as there is insufficient power to keep all the cores active. Parallelism was initially explored for applications at the software level, then the focus shifted on hardware parallelism. Hardware-based solutions are faster; however, they are power expensive, hence less preferred for actual implementation. The power-efficient customized architectures are popular in this direction [12].

Current Multi-NoCs are always designed with network selector (i.e., traffic distribution hardware unit); however, its placement and circuit design are not analyzed. In this paper, we explore the answers to the following questions:

  • Why do we need network selector?

  • Where should network selector be placed?

  • What should be the hardware circuit design of network selector?

  • Does the placement of network selector in data plane or control plane affect the complexity of the network selector circuit as well as overall Multi-NoC design?

  • What consequences can be observed in Multi-NoC design on changing the placement of network selector?

In this paper, we achieve a customized Multi-NoC hardware circuit design by changing the placement of network selector hardware unit from conventional network interface to router of the Multi-NoC. Network selector functions as the digital demultiplexer that takes the information from one input and transmits over one of several outputs. In Multi-NoC, a flit is demultiplexed to any of the networks through network selection hardware unit (named as Net-Demux1). A network selector is placed in the data plane of the network interface in Multi-NoC. We propose a new hardware implementation in the control plane of the router. This placement changes the network selector’s circuit complexity, area, power, and performance. The premise is that changing the placement of network selector can efficiently customize the Multi-NoC. Our design includes two important hardware features. First, the network selector circuit is customized by changing its placement from network interface to router. Second, the consequence of these changes results in customization of the Multi-NoC. We utilize this opportunity to reduce the hardware complexity of Multi-NoC. Extensive experimental results have demonstrated the effectiveness of the proposed customization in Multi-NoC design to boost the NoC performance.

We propose an improved placement of Net-Demux at switch allocator of the router as an alternative to the network interface and the routing unit. Net-Demux placement changes the average number of signal transitions in a single cycle of the circuit, and hence it varies the switching activity of the circuit. Power dissipation at Net-Demux is related to the difference in input and output switching activity. The network selector in proposed Multi-NoC architectures is placed at the network interface. Hence, power efficiency is achieved. The key contributions of this paper are as follows:

  1. 1.

    Exploring the possibilities of placements of network selector.

  2. 2.

    Analysis of network selector architecture with different placements.

  3. 3.

    Comparison of different placements with the traditional one.

The idea is implemented and experimented with Gem5Footnote 1 full system simulator that is extensively used open-source simulator for evaluating performance of manycore processor architectures. Gem5 is integrated with Garnet that is a detailed interconnection network model. Garnet simulates a detailed router micro-architecture model and requisite components of on-chip networks [15]. Hardware parameters of these components can be configured with different values in Garnet during simulations. For placement of network selector, we need to make significant changes in the Garnet code. Gem5 is also integrated with ORION 2.0 simulator [16] to estimate NoC power and area that helps in early stage design space exploration for multicore and manycore processors.

We have integrated Gem5 with PARSEC benchmark [17] for real-time analysis of our proposals. The full system simulation runs the parallel section of benchmarks, and it is most important for performance assessment. The full system simulation is performed in three phases. The first phase is warm up phase where empty caches are initially filled with data. The second phase, i.e., region of interest (RoI) is relatively steady state. After benchmark execution, the third phase is the clean-up phase where the operating system does garbage collection.

The rest of this paper is organized as follows. Section 2 discusses the related work on Multi-NoC architectures. Section 3 introduces the placement impact on digital circuit design techniques. Section 4 discusses Net-Demux placement at the data and the control plane of the NoC. Hardware synthesis and benchmark results are presented in Sect. 5. Finally, Sect. 6 concludes the results.

2 Related work

In this section, we discuss various research works related to placement of hardware components on-chip and customizations in Multi-NoC architectures.

2.1 Placement of hardware components on-chip

In this subsection, we discuss the latest work on placement of Net-Demux and other hardware components. The different approaches, contribution, benefits, and their limitations are given in Table 1. Yadav et al. [18, 19] initiated the discussions on the placement of Net-Demux in Multi-NoCs. They also proposed the idea of Net-Demux placement at the control plane of the router.

The placement of hardware components is explored by a number of researchers. Abts et al. [20] has explored optimal placement of memory controllers for different topologies (mesh and torus), routing, and workloads as memory controllers are less than the number of cores. Efficient placement of memory controller can reduce contention, hot spots, and lower the latency variance for memory-intensive applications. Zhao et al. [21] proposed fast evaluation of memory controller placement using path-load assessment. The complexity of path load counting algorithm is \(O(M\times M\times K)\) for \(M\times M\) mesh with K memory controllers. These explorations are limited to mesh and torus topologies. Likewise, Hung et al. [22] proposed optimized IP placement using a genetic algorithm to minimize thermal hot spots. Hu and Marculescu [23] proposed an energy-aware mapping of tasks to cores. Srinivasan and Chatha [24] proposed optimal static routes between cores to achieve bandwidth and latency requirements. However, all such placement aspects have been explored with single-NoC.

Table 1 Literature summary on placement of hardware components on-chip

In this paper, we present a detailed analysis on the Net-Demux placement. Yadav et al. [18, 19] discussed this idea briefly. Likewise, network controller [25] is implemented for traffic distribution between hybrid wired–wireless architecture. However, they do not provide analysis on network controller architecture and its placement. The placement of other on-chip hardware components have different challenges than Net-Demux placement. In memory controller placement [20, 21], the number of memory controllers are less than the number of routers. This requires to devise a suitable placement of memory controllers on-chip, whereas Net-Demux units are in proportion to the number of routers. We look for Net-Demux placement within the tile. IP virtualization and placement [22] minimizes thermal hotspot in NoC. Likewise, IP/Core maps [23] on to NoC to minimize energy consumption according to their proposed algorithm. In our approach, Net-Demux placement minimizes energy consumption by reducing static power due to hardware customizations. The optimal route selection [24] is also proposed to find minimal path to save energy. However, algorithm complexity introduces some hardware overheads. In contrary, proposed placement simplifies the hardware complexity.

Net-Demux placement improves static power significantly without any performance penalty, while power gating techniques [28] improve static power at the cost of performance penalty [33]. To mitigate the performance penalty, the power gating is applied to wireless multiple NoC [29]. However, the benefits of power saving is mitigated due to overhead of antenna power. Switch allocator [30] is customized to transfer two flits in a single cycle. However, the latency increases by 26–40% with this approach. We have also customized switch allocator by integrating Net-Demux. Since our hardware is simple, we do not have any latency overhead.

2.2 Multi-NoC customizations

In this subsection, we discuss traditional Multi-NoC architecture, their customizations, and traffic communication approaches. Multi-NoC is adopted over single-NoC since several customization flexibility further improves NoC efficiency significantly. Balfour and Dally [34] duplicated NoC topologies, such as Mesh and CMesh, to improve system performance. Carara et al. [35] replicated the physical networks by taking advantage of the abundance of wires between routers for enhancing efficiency. Since replication doubles the power of the NoC, Yoon et al. [36] partitioned NoC into a number of sub-networks while keeping the overall amount of wires and buffering resources constant to avoid power overheads.

The wires can be customized in Multi-NoC in different ways. Grot et al. [37] proposed partitioning of two parallel networks named as multidrop express channels (MECS-X2). Kumar et al. [38] proposed channel slicing to improve the poor channel utilization of concentrated channels. Gómez et al. [39] followed a slightly different approach for network partitioning. They divided the wires into several parallel links, while the router remains single to improve the network throughput, which reduced the consumption of area and power.

These architectures are explored for replicating internal components as well. Teimouri et al. [40] divided the n-bit wide network resources in a router, such as links, buffers, and crossbar switch, into two parallel n/2-bit sub-networks to introduce reconfigurable shortcut paths. Noh et al. [41] proposed parallel crossbars to increase the flit transfer rate between input and output ports. The resulting router has a simpler design that performs better than a single-plane router. Gilabert et al. [42] proposed a multi-switch architecture and compared it to a multiplane NoC. The multi-switch approach provides better performance than an equivalent multi-network with only a small area overhead.

The traffic communication can also be customized using Multi-NoC [43, 44]. Volos et al. [45] proposed a specialized cache–coherence network-on-chip (CCNoC). This architecture combines asymmetric multiplane and virtual channels for efficient cache coherence communication.

3 Placement impact on circuit design techniques

In this section, we have discussed design metrics of digital circuits that are affected by the placement of the hardware logic. These design metrics include timing delay, static and dynamic power. Typical placement objectives include the improvement in these metrics for the overall gain in power-performance efficiency. The timing delay is directly affected by the placement of Net-Demux. An inferior placement might degrade the overall efficiency. The dynamic power of a logic gate can be reduced by minimizing the physical capacitanceFootnote 2 and the switching activity. At all levels of the design abstraction, the switching activity can be minimized. However, we discuss the front end digital circuit characteristics that are affected by Net-Demux placement.

Fig. 1
figure 1

Switching activity variation due to circuit topology as illustrated through a chain structure, and b tree structure. The output transition probability is uniform, i.e., (\(P_{1 (X, Y, Z, W )} = 0.5\)) for all the inputs

3.1 Logic restructuring

The topology of a circuit affects the overall power dissipation. The implementation of logic network, as shown in Fig. 1, \(O = X\cdot Y\cdot Z\cdot W\) is possible in two alternate ways. Let’s assume that all primary inputs (XYZW) are uniformly distributed (i.e., \(P_{1\,\,(X,\,Y,\,Z,\,W)}\,=\,0.5\)), i.e., \(P\,(X=1)\,=\,0.5\) and this holds true for all other inputs as well. For an AND gate with input X and Y, the probability that the output is 1, i.e., \(P_{1\,(X,\,Y)}\,=\,P_{1\,(X)}\times P_{1\,(Y)}\). The probability that the output is 0, i.e., \(P_{0\,(X,\,Y)}\,=\,1\,-\,P_{1\,(X,\,Y)}\), and the transition probability is

$$\begin{aligned} {P_{(0 \rightarrow 1)\,(X,\,Y)}} = {P_{0\,(X,\,Y)}} \times {P_{1\,(X,\,Y)}} \end{aligned}$$
(1)
Fig. 2
figure 2

Reordering of input signals affects the output switching activity of the circuit a inputs are ordered as XYZ b input order changes as ZYX. The \(\triangle P_{0 \rightarrow 1}\) shows the reduction in output switching activity over intermediate output switching activity. (Here \(\triangle P_{0 \rightarrow 1} = \triangle P_{0 \rightarrow 1}(O)/P_{0 \rightarrow 1}\). The lower value of switching activity is better)

Likewise, the output signal transition probabilities \(P_{0 \rightarrow 1\,(O1)}\), \(P_{0 \rightarrow 1\,(O2)}\), \(P_{0 \rightarrow 1\,(O)}\) is calculated in chain structure as shown in Fig. 1a. The output signal transition probability \(P_{0 \rightarrow 1\,(O2)}\) for output O2 changes in a tree structure is demonstrated in Fig. 1b. On comparing chain and tree topology, the results indicate that the chain implementation has lower switching activity in intermediate outputs than the tree implementation as observed with random inputs. The lower switching activity during intermediate outputs is important as these intermediate signals may become the input of other circuits. Thus, switching activity propagates in the circuit. It may vary significantly with placement as Net-Demux integrates with different hardware modules of the NoC circuit.

3.2 Input signal ordering

The input signal ordering is another parameter that affects the switching activity of the circuit. The impact of reordering of the input signals on switching activity is shown in Fig. 2. The circuits are identical in topology, but the input signal X is swapped with Z between the circuits of Figs. 2a, b. The reordering of the input signals affects the output switching activity of the circuit. Let the probabilities of input signal being 1 be \(P_{(X=1)}\,=\,0.8\), \(P_{(Y=1)}\,=\,0.2\), \(P_{(Z=1)}\,=\,0.1\). Then from Eq 1, the output switching activity is calculated as follows.

  1. 1.

    Circuit in Fig. 2a: the output signal transition probability is \((1-0.8\times 0.2)\,(0.8\times 0.2)\,=\,0.1344\). The final output transition is \((1-0.1344\times 0.1)\,(0.1344\times 0.1)\,=\,0.0132\).

  2. 2.

    Circuit in Fig. 2b: the probability of a \(0 \rightarrow 1\) transition is \((1-0.2\times 0.1)\,(0.2\times 0.1)\,=\,0.0196\). The final output transition is \((1-0.0196\times 0.8)\,(0.0196\times 0.8)\,=\,0.0154\).

We observe a substantial reduction in final switching activity over intermediate switching activity in the circuit shown in Fig. 2b, i.e., 0.78 compared to the circuit shown in Fig. 2a, i.e., 0.098. It is beneficial to postpone the introduction of signals with a high transition rate [46]. Thus a simple reordering of the input signals, significantly reduces the signal transition rate (switching activity).

3.3 Time delay of signal paths

The timing delay refers to the time required by a signal to reach from input to output pin of the circuit. It is the sum of interconnect and gate delays that constitute the path. A signal delivers from input pads to gate outputs and proceeds from the output pad to gate input by following the circuit path [47]. The longest path that introduces maximum timing delay is known as critical path of the circuit.

Delay along critical path decides the time period of the clock or, alternately, the maximum operating frequency of the circuit. For reliable operation of the circuit, the difference between two clock arrivals should be more than the critical path delay. Net-Demux placement at various locations varies the signal path length, the number of signal paths, and the critical path delay.

Placement of Net-Demux can be modeled mathematically using timing graphs [48]. Let the total number of signal paths in Multi-NoC circuit be \(\rho _n\) and in network selector be \(q_m\), where ‘n’ and ‘m’ are finite integers greater than equal to 1. Lets assume each signal path comprises ‘X’ number of logic gates. The network selector is integrated at, lets say, \(i^{th}\) position of any one signal path \(\rho _j\), where \(1\le \hbox {i}\le X\) and \(1\le \hbox {j}\le n\).

  1. 1.

    The total number of signal paths is \(s_o=\rho _n+q_m\), where ‘o’ is an finite integer, after integration of network selector.

  2. 2.

    One of the signal path (\(s_{iz}\)) out of the total number of signal paths (\(s_o\)) is starting from logic gate \(x_i\) and ending at logic gate \(x_z\), i.e., \(s_{iz}\rightarrow ({x_i}\leadsto \,x_z)\) \(\exists\) i and \(z\ge 1\)  \(\ni\) \(\,s_i={D_{x_{i}}}\) and \(s_z={D_{x_{z}}}\)\(\forall\) \(1\le \,i,z < X\), where \(D_{x_{i}}\) and \(D_{x_{z}}\) are delay of logic gate \(x_i\) and \(x_z\). So the timing delay of logic gate \(x_i\) is dependent on \(x_z\). After integration of Net-Demux, the timing delay of the circuit would be \(D_{x_{i}}+D_{x_{z}}\).

  3. 3.

    Lets say \(\phi\) is a set of signal paths’ delays without integration of network selector, \(\phi =\{D_{\rho _1},D_{\rho _2},\ldots ,D_{\rho _{n}}\}\), where \(D_{\rho _1}\), \(D_{\rho _2}\) and \(D_{\rho _{n}}\) are the total delay of all logic gates in \(\rho _1\), \(\rho _2\) and \(\rho _{n}\) signal paths, respectively. There exists a signal path (\(\rho _j\)) having maximum delay \(\max (\phi )\) which is the critical path of the circuit without the integration of the Net-Demux.

  4. 4.

    Lets say \(\psi\) is a set of signal paths’ delays, \(\psi =\{D_{s_1},D_{s_2},\ldots ,D_{s_{n+m}}\}\), where \(D_{s_1}\), \(D_{s_2}\) and \(D_{s_{n+m}}\) are the total delay of all logic gates in \(s_1\), \(s_2\) and \(s_{n+m}\) signal paths, respectively. There exists a signal path (\(s_o\)) having maximum delay \(\max (\psi )\) which is the critical path of the circuit.

  5. 5.

    When \(\max (\psi )>\max (\phi )\), then critical path delay of Multi-NoC circuit is increased. Otherwise, there is no impact on critical path of the circuit since the critical path is unchanged even after integration of the Net-Demux.

We infer that the integration of Net-Demux can either increase the critical path or may not affect the critical path of the circuit. This depends on where the Net-Demux is integrated in Multi-NoC. Therefore, Net-Demux should be integrated with a noncritical signal path. It is noted that besides the path delay, the input signal ordering also impacts circuit timing characteristics. We analyze all these properties at the abstraction level of the circuits. In the next section, we discuss the placement impact on Net-Demux and NoC architecture.

4 Architectural implications of net-demux placement

The Net-Demux can be placed in either control planeFootnote 3 or data plane. Control planes refer to the part of the hardware responsible for generating control signals, i.e., coordinating the movement of packets through the data plane. Likewise, the data plane or data path handles the storage and movement of packets. It comprises a set of input and output buffers located in the network interface (NI) as well as routers/switches. The data and control plane have data and control input/output (I/O), respectively.

In Fig. 3, we compare control versus data plane placement hardware implementation overhead of a single demultiplexer. A smallFootnote 4 bus-width C-bit (let’s say) I/O is sufficient to find the NoC network as compared to a largeFootnote 5 bus-width B-bit (let’s say) I/O on data plane demultiplexer. The placement affects static power and area as follows.

  1. 1.

    The control plane placement is independent of the network link width.Footnote 6

  2. 2.

    The benefits in hardware overhead with control plane are \((I_{NI} \times B)/(I_{R} \times C)\)Footnote 7 times more than data plane placement. The placement in the control plane is B/C times better than data plane placement because link-width \(B>> C\) and \(I_{NI}\,=\,I_{R}\,= \,4\).Footnote 8

  3. 3.

    The control plane placement is more scalable than data plane placement with increasing number of NoC networks. Because the width of the select line and the number of output lines have a minor hardware implementation overhead due to a lower input link width, i.e., C-bit.

Fig. 3
figure 3

Placement of demultiplexer

Fig. 4
figure 4

Placement of Net-Demux at a network interface (NI 0, NI 1, NI 2, NI 3), b router’s network links (N1, E1, S1, W1)

Thus, control plane placement is beneficial over data plane placement. It modifies not only the NoC micro-architecture but also bring variations in circuit paths. Since the circuit’s physical design significantly impacts power dissipation, area, and timing delays, an incompatible placement degrades the overall efficiency, which cannot be improved even with intelligent routings. Multi-NoC is made via partitioning of single-NoC having four network links and four NI.

In Fig. 4, we compare the placement impact in the data plane (at NI) vs control plane (on the router) of dual-NoC. In conventional Multi-NoC, the Net-Demux is placed at NI, as shown in Fig. 4a. The links of both NoC networks are connected with different NIs as \(NI\,0\,(l_0,\,l_4)\), \(NI\,1\,(l_1,\,l_5)\), \(NI\,2\,(l_2,\,l_6)\), and \(NI\,3\,(l_3,\,l_7)\). Every NI has a choice of traffic distribution between any one of NoC networks, so four hardware units of Net-Demux is placed corresponding to each NI. We named this conventional Multi-NoC as dual-network-NoC. In Fig. 4b, the placement of Net-Demux at router customizes dual-network-NoC architecture. The 2-networks can be formed through routers. The single links are sufficient to connect different NIs, i.e., \(NI\,0\,(l_0)\), \(NI\,1\,(l_1)\), \(NI\,2\,(l_2)\), and \(NI\,3\,(l_3)\), with a single router. The four hardware units of Net-Demux is placed corresponding to N1, E1, S1, and W1 network links of the router. We named this architecture as 2-network-NoC.

The placement changes the architecture of the Net-Demux according to the function of the hardware unit. It also customizes NoC architecture. In the subsequent subsections, we discuss in detail the conventional placement at the NI and compare it with our proposed routing unit and switch allocator placements.

4.1 Net-demux at network interface

Dual-network-NoC places Net-Demux at NI between the core and router. Core messages enter into NI, and then into the router through NI links, as shown in Fig. 5. A router consists of eight core links and eight network links for dual-network-NoC. The odd-numbered links constitute the \({\mathrm{NoC}}_1\) network, whereas the even-numbered links represent the \({\mathrm{NoC}}_2\) network.

Fig. 5
figure 5

Network interface with Net-Demux hardware implementation

Fig. 6
figure 6

Comparison of network interface a with Net-Demux placement in dual-network-NoC, b without placement in single or 2-network-NoC. The circuit design flow reflects the complexity of placement over NI on comparing timing delay of both circuits that is \(a\,>\,b\)

Messages are the logical unit of communication and may be arbitrarily long. These messages are divided into packets that are further segmented into fixed-length flits.Footnote 9 If the number of generated packets is k, then single packet size is \(S\,=\,(M-2)/k\), and the flit size is \(N\,=\,(M-2)/k\times f\) where f is the number of flits per packet. On receiving a message at NI, the flits are checked for the control or data packets since both types of flits are generated separately.

A control packet is composed of one control flit. It consists of a coherence command and the memory address. Contrary, data packets are made up of fiveFootnote 10 flits. It consists of a head flit containing the destination address, three body flits, and a tail flit. The end of a packet is indicated by tail flit. On arrival of this flit, a signal is triggered for generating the next flit.

The control packet is a single flit though it keeps all the required information of destination. Only control flit and head flit keep the control and routing information. The data and tail flits follow the head flit status to reach their respective NoC networks. These flits consist of payload. However, the last data flit is padded with zeros (if required).

Table 2 NoC selection for cache messages via bit encoding

Every message is prepended with two bits to identify the message class as shown in Table 2. The encoding is specified for flag m according to different message classes. The flag m is computed from bits \(b_0\) and \(b_1\). Its value decides NoC selection for message forwarding. These bits are passed to inputs \(i_1[0]\) and \(i_2[0]\) of the Net-Demux circuit as shown in Fig. 5. These two bits are XORed. Output m of the XOR gate is connected to two AND gates A1 and A2. The other input to these gates is a message signal. So whenever the message belongs to Control or Writeback Control, the value of flag m is set to 1. Then, AND gate A2 passes message bits. This ensures the selection of \({\mathrm{NoC}}_2\) for these messages. For other message classes, \({\mathrm{NoC}}_1\) is selected.

The Net-Demux is integrated with flit generation (FG), as shown in Fig. 6a, which is compared to single-NoC (in Fig. 6b). Once the credit signal indicates the availability of an empty virtual channel, the FG checks the counter. If it is zero, the first flit is checked for signal m. If m is true, the output port address of \({\mathrm{NoC}}_2\) is assigned to flit, else \({\mathrm{NoC}}_1\) address is allocated. The complete Net-Demux circuit design flow, from the incoming message in NI to a  selection  of NoC network for the packet, is shown in  Fig. 6a.

Placement of Net-Demux at NI increases the critical path delay as it is on the critical path. As a result, the timing delay ‘a’ with Net-Demux placement is larger than the timing delay ‘b’ without Net-Demux placement at NI. Since \(a > b\), the placement of Net-Demux increases the critical path of the circuit. There is overhead in power, area, and timing delay on placing Net-Demux at NI.

4.2 Net-Demux at router

Conventional Net-Demux placement is at NI that is a part of the data plane. On the router, placement can be done in control or data plane. In the data path, the placement at the input/output of the crossbar is feasible. However, the area and power consumption are higher than the control plane. The required Net-Demux hardware units are in the order of the number of inputs to \({\mathrm{NoC}}_1\) from the router.

The area and static power are proportional to the data path width. Since the hardware cost increases in proportion to data path width, we have not explored the Net-Demux placement at the crossbar. We propose placement at the routing unit and switch allocator of the router due to lower hardware costs in these control planes.

Fig. 7
figure 7

Placement of Net-Demux between the input unit and the routing unit of the router. One half-adder and two full-adders require to implement the logic of port encoder. Flowchart demonstrates the hardware implementation of Net-Demux placement after the route computation of RU for selecting either of NoC networks

Routing Unit (RU). It selects the output port for flit traversal. Flits are divided into head, body, and tail flits. The least significant three bits of head flit has destination information that is utilized during route computation. The port encoder of Net-Demux converts this information into four-bit output port. It is made of one half-adder and two full-adders. The computed output port is updated in Status Registers (SR).

Routing unit reads the status register to get the routing information for head flit. As shown in Fig. 7, two control lines read the destination address \(x_d\) and \(y_d\) from the SR and feed this information to the input of Net-Demux that checks the message class of head flit and assign its NoC network. The selection of NoC network is dependent on signal m. If the signal m holds the value zero, the output port of \({\mathrm{NoC}}_1\) is selected by Net-Demux, else the output port of \({\mathrm{NoC}}_2\) network. The circuit design flow illustrates the control flow of the routing unit. The integration of Net-Demux increases the length of each path for the routing unit.

Switch Allocator (SA). In each cycle, the crossbar connections are established between input and output ports for flit traversal. SA schedules these connections. In Fig. 8, the timing arc (red dash-line) exhibits a critical path between three submodules: local arbiter, global arbiter, and masking logic. The local arbiter selects a winner among the competing Virtual Channels (VCs) at each input port. The output of local arbiter becomes the select line of MUX1 that selects only one wire consisting of output port information for the flit of winner VC.

Fig. 8
figure 8

Net-Demux placement at switch allocator of the router. Only three half adders are sufficient to implement switch encoder logic. The flowchart demonstrates the selection of either of NoC networks with the placement of Net-Demux

The Net-Demux is placed in between local and global arbiters. A 1-bit select line m makes the selection of NoC network according to a type of message class (refer to Table 2). If m is ‘0,’ it proceeds to \({\mathrm{NoC}}_1\), else it proceeds to \({\mathrm{NoC}}_2\) network. The competition between different inputs requesting for the same output physical channel is resolved by Global Round-Robin (G-RR) arbiter. Each input bit processed through G-RR arbiters corresponds to each output channel. The grants generated by global arbiter are used to set up the crossbar control registers.

The output of the global arbiter also feeds into the input of masking logic. Another input of masking logic is the V-bit input line from the local arbiter. The masking logic activates the next VC request in lieu of the recent VC. In the next section, we compare hardware implementation overhead of Net-Demux across different placements.

4.3 Comparative analysis between different placements

Initially, we compare the network interface versus router placement. In Table 3, we examine changes in the hardware component’s architecture with different Net-Demux placement. Subsequently, the impact of placement on hardware metrics is compared in Table 4. Then, we compare placement within the router, i.e., RU versus SA.

Table 3 Comparative changes in hardware component’s architectures with different Net-Demux placements
Table 4 Gain in hardware metrics with different Net-Demux placements

Network Interface (NI) versus Router Placement. Net-Demux is placed in the data plane at the NI, whereas Net-Demux is placed on the control plane at the router. As \(K<< N\)Footnote 11, the hardware component overhead is less for control plane placement as compared in Table 3. So the placement at the router’s control plane is beneficial over placement at NI. The data plane placement of NI has more area and power overhead for the implementation of Net-Demux, as given in Table 4.

Placements at Routing Unit (RU) versus Switch Allocator (SA). The Net-Demux can be placed on the router either at the RU or SA. The architecture of Net-Demux varies with placement. Since each functional unit of the router performs a specific task, Net-Demux inputs and outputs vary accordingly, as compared in Table 3. For example, the RU computes the output port for the incoming header flit, whereas SA resolves the contention between channels and enables the switch to forward the flit across the crossbar. The encoder is different for both the placements, and therefore, the minimum number of required half adders is different. The architecture of Net-Demux is less complicated for placement at SA as compared to placement at RU. Though the input signals are of the same width for placements at RU and SA, the architecture of Net-Demux is less complicated for placement at SA.

Impact on Circuit Characteristics. Though, we have not explored the back-end impact of placement in details, we have presumed implications as follows:

  1. 1.

    Input Signal Ordering: This characteristic recommends postponing the introduction of input signals with a high transition rate.Footnote 12 The signal passes through the routing algorithm, then becomes the input of Net-Demux. The input signal transition rate for the RU is higher than the SA. Since the SA signal passes through one multiplexer and becomes the input for Net-Demux, the placement of Net-Demux at the SA is more efficient than placement at RU because a low transition rate signal enters early to the Net-Demux.

  2. 2.

    Timing Delay: The critical path is an important characteristic for timing delay. The critical path delay is affected by placement at NI. The router micro-architecture is a pipelined architecture. The crossbar of the router dominates the critical path delay. The Net-Demux placement at the RU and SA do not affect the router’s critical path delay. Since the frequency need not change on placement, there is no impact on data arrival in each router cycle.

  3. 3.

    Circuit Topology: The optimization algorithms used by the fabrication tools decide the topology of the circuit. Therefore, the topology of the circuit changes with placementFootnote 13 and hence switching activity.Footnote 14 These are the primary factors that affect the dynamic power.

Impact on Scalability. The current technology breakthrough makes NoC sensitive to faults, severe congestion, traffic hotspot, and thermal issues. So routings of these processors are more complex. The complex routing circuits have high switching activity because of the number of digital units in every path. It is better to place Net-Demux at SA of the router.

5 Experimental results

In this section, we first discuss the hardware synthesis results to estimate the placement impact on power, area, and critical path delay of the circuit. Then, for performance evaluation, we run benchmarks in full system simulation mode.

Table 5 Parameter configuration for hardware synthesis

5.1 Hardware synthesis

The parameter configuration for hardware synthesis is listed in Table 5. We have performed Register Transfer Level (RTL) synthesis using System Verilog to evaluate the hardware area, power, and critical path delay of the Routing Unit (RU) and Switch Allocator (SA) placement of Net-Demux. This is compared with conventional Network Interface (NI) placement. We synthesize using Synopsys Design Compiler with 32nm standard-cell libraries and LVTFootnote 15 cell technology. The target clock frequency is set to 2 GHz.

We compare the slack values of a number of signal paths between NI, RU, and SA in Table 6. A slack shows the permissible delay of a cell activity without delaying overall circuit output. So, path slack is the difference between time of data arrival, and data requirement. If the data arrival coincides with the time of data requirement, slack is zero ideally. However, the paths that are close to zero slack values are critical. If a higher number of paths in a circuit are close to zero, a slight variation in the slack may render the circuit unreliable. The time when a signal arrives can vary due to variation in input data, temperature, voltage, and manufacturing defects. Hence, the design should ensure that all signals will arrive neither too early nor too late despite these variations.

Table 6 Path slack variations in different signal paths across Net-Demux placements

The positive slack implies the arrival time of a cell may be postponed by slack value without affecting the overall delay of the circuit. It shows that data arrives before being required by the circuit. So the researchers emphasize on the restructuring of the schematic for fewer equally critical paths. Instead of looking at the slack for just the most critical path, the design should consider the entire distribution of slacks [51]. The NI’s path slack value is in the order of \(10^{-3}\) ns, which is very close to zero slack as compared to RU and SA, i.e., order of \(10^{-2}\) ns. The rest of the positive slack values reduces the risk of violating the timing delay of a circuit [52].

Area. The area of different components of a circuit for various Net-Demux placements are compared in Fig. 9. The total area is the sum of four factors: Combinational, Noncombinational, Net Interconnect area, and Buffers. The combinational logic gates, like ANDs, ORs, etc., cover area due to logic cells. The noncombinational factors are registers. The net interconnect connects all cells. This net area is computed by the library of wire load models. The buffers are the primary contributor to the area as compared to other NoC components. The total gainFootnote 16 in the area is 39% with placement at SA and 36% with placement at RU over placement at NI, respectively.

Fig. 9
figure 9

Area comparison

Power and Critical Path Delay. We compare Net-Demux placements at NI in dual-network-NoC with placement at RU and SA in 2-network-NoC. We examine the area, power, and critical path delay of these placements with single-NoC.

The power dissipation is the sum of total static/leakage power and dynamic/switching power. The leakage power is dependent on the static current flow from voltage supply to ground in the absence of switching activity. It consists of (dis)charging of the capacitor and short circuit power. Thus, leakage power varies according to the switching activity of the circuit. Another important design metric is critical path delay because it determines the NoC frequency by estimating the worst path taken by the data.

In Table 7, we normalize the results of placements with respect to single-NoC. The values of design metrics are considered as ‘1’ for single-NoC. We estimate the gain in dual-network-NoC and 2-network-NoC over single-NoC.

Table 7 Area, power, and critical path delay comparison relative to single-NoC for different placements (the area, power, and critical path delay of single-NoC is normalized to one unit time)
Table 8 Improvement in area, power and critical path delay for different placements

Table 8 is drawn from Table 7 to show the percentage improvement in the area, power, and delay for different placements. Here, we estimate the gain of metrics for different NoCs. Since dual-network-NoC already exists, the gains are compared with single-NoC and proposed 2-network-NoC. For example, the gain in area of dual-network-NoC over single-NoC is \((1-0.52)/1= 0.48\) and over 2-network-NoC is \((0.47-0.52)/0.47= -0.106\). Likewise, other metrics are computed. This computation is done for placement of Net-Demux at NI. Similarly, for placement of Net-Demux at RU, the gain of 2-network-NoC is compared with single-NoC and existing dual-network-NoC. For example, the gain in area of 2-network-NoC over single-NoC is \((1-0.55)/1= 0.45\) and over dual-network-NoC is \((0.68-0.55)/0.68= 0.191\). The gain of other metrics is computed in the same way for RU as well as SA. In Table 8, the values are represented in percentage wherein values after two decimal places are truncated for simpler representation. The ‘+’ sign indicates the advantage, and ‘–’ shows the drawback in the table. The area and static power are improved significantly for all three placements over single-NoC. Since the flit-width in dual-network-NoC and 2-network-NoC is half of the single-NoC, the buffer size is determined in the proportion of flit-width. As buffers occupy major area and consume more static power [53], a significant gain is observed in these metrics.

However, Net-Demux placement at NI in dual-network-NoC has more consumption of area and static power as compared to 2-network-NoC. The number of NI links in dual-network-NoC is eight, whereas it is only four in 2-network-NoC. The less number of links implies less number of buffers and hence less area and static power for 2-network-NoC. The overall gain in the area and static power is observed with Net-Demux placement at SA in 2-network-NoC over dual-network-NoC.

The Net-Demux placement at NI also increases critical path delay, as shown in Table 8. The data arrival is delayed by 8% over single-NoC and 2-network-NoC. However, the increase in critical path delay at RU and SA does not impact the overall router frequency as the router is a pipelinedFootnote 17 architecture.

The switching power varies across different placements because of the variation in input and output switching activity. As these results are calculated with the default wire load model having low traffic, we further evaluate the performance of different placements with PARSEC benchmark workload.

5.2 Benchmark results

We have experimented on PARSEC benchmarks using Gem5 simulator. The configuration details of the simulation are listed in Table 9. We compare the execution time of Net-Demux placement at router with placement at NI using the PARSEC benchmark. For full system simulation, we have used Gem5 [13, 14, 54] integrated with the GARNET [15, 55] NoC simulator. A Linux 2.6.27 kernel image is booted with ALPHA instruction set architecture. For power results, GARNET is integrated with ORION [16, 56].

Table 9 Simulation configuration

The volume of messages varies across message classes of different PARSEC applications [17]. The PARSEC comprises the data-parallel benchmarks, i.e., blackscholes, bodytrack, fluidanimate, and vips, whereas the canneal benchmark is unstructured. These benchmarks have different granularity in data like fluidanimate, and canneal are fine-grained, whereas blackscholes is coarse-grain. The bodytrack and vips are moderate. So these benchmarks show different impacts on performance.

Fig. 10
figure 10

Normalized execution time comparison among Net-Demux placements at Routing Unit (RU), Switch Allocator (SA), and Network Interface (NI) [57] (the execution time of single-NoC is normalized to one unit time)

We compare the execution time of Net-Demux placement at RU and SA with conventional placement at NI [57] in Fig. 10. Minor degradation of 10% and 4% is observed in execution time for placements at NI and RU, respectively. However, 6% improvement is achieved with placement at SA over single-NoC. In comparison with placement at NI, placements at RU and SA are, respectively, 6% and 14% more efficient.

6 Conclusions

We propose the placement of the network selector hardware unit in the control plane of the router to improve the power efficiency of Multi-NoC instead of conventional network interface placement. Net-Demux architecture is modified according to the circuit where it is placed. In the control plane, wires’ bus-width and registers’ size are quite small as compared to the data plane. Net-Demux placement at switch allocator has the lowest hardware implementation overhead in Net-Demux architecture compared to the placement at routing unit and network interface. Placement on the control plane does not impact critical path delay as the router is a pipelined architecture. Contrary, Net-Demux placement at network interface increases critical path delay up to the critical path length. On comparing synthesis results, we observe that Net-Demux placement at switch allocator improves 20% area, 21% static power, 29% dynamic power, and 33% critical path delay of the circuit over conventional placement. With the PARSEC benchmark, the execution time with Net-Demux placement at switch allocator improves by 6% over single-NoC. Subsequently, the Net-Demux placement at routing unit and switch allocator improves execution time by 6% and 14%, respectively, over placement at network interface.

Presently, our work is limited to two networks NoC. In future, we shall implement the proposed idea of placement at router for more than two NoC networks. The benefits could increase with increasing number of NoC networks, since the number of outputs for Net-Demux shall be in proportion to the number of NoC networks. Another limitation is that we have explored the proposed placement only for five-stage pipelined router architecture. This can be further explored for one-/two-/three-/four-stage pipelined architecture. It will also be interesting to explore the placement impact on critical path delay of different pipelined router architectures. However, on reducing router pipeline stages, the bypassing and speculation techniques are used to transfer the flit. These techniques are effective to improve network throughput in low network traffic. Since bypassing or speculation failure results in a longer critical path delay, the exploration of placement with reduced pipelined router architecture is suitable only for low traffic workloads. This approach can also be implemented for wireless Multi-NoC as our approach reduces power and it is a big concern in wireless NoC [58]. The multi-chip processor architectures also require traffic distribution unit between multiple chips. So our proposed idea of placement can be repurposed for multiple chips [59].