Keywords

1 Introduction

The NoC (network-on-chip) is widely considered as a first-order component of current and future multicore and manycore CMPs, due to its high flexibility and scalability to be able to effectively address the rapid increasing of core count. Unfortunately, NoCs are concerned about their excessive power consumption. For example, for Intel Terascale 80-core chip, its NoC consumes 28 % [1] of the chip power; For MIT RAW, it is up to 36 % [2, 3]; In the future, NoCs in many-core processors are estimated to consume hundreds of watts of power [4] if current network implementation is naively scaled. One of the key components of an NoC router is the buffer, which is necessary to provide high performance for an NoC. However, the buffer consumes significant power (up to 30 % – 40 % of NoC power [5]). Recent studies have explored several optimizations to reduce the power consumption of router buffers. Bufferless routing [6] presents a new algorithm for routing without using buffers in router input/output ports. Bufferless routing is proved to be very effective in power saving at low network load, however, because of low bandwidth and detouring, it will incur a significant performance penalty at higher network loads. Flexibuffer [7] performs power gating on buffers to reduce power consumption. Although demonstrated obvious advantages in router power saving, FlexiBuffer still incurs considerable leakage power, especially at low network load, which happens on many real-world applications for much of the time. The deficiencies of FlexiBuffer are detailed in Sect. 2.

In this paper, we propose HVCRouter, a novel NoC router design with heterogeneous virtual channels (VCs). To efficiently reduce power, HVCRouter integrates one bufferless channel to leverage its power efficiency. In addition, HVCRouter allows to power-gating not only buffers, but also channels to capture more power optimization opportunities. However, these optimizations are not achieved without a challenge. To make good use of the heterogeneous VCs, and power-gating channels as well as buffers without degrading performance at varying network loads, the channel allocation and flow control mechanisms in conventional NoC routers must be modified and carefully orchestrated as detailed in Sect. 3.

In this paper, we make the following contributions:

  • We introduce a novel NoC router design with heterogeneous virtual channels. In particular, it introduces a bufferless channel to respect its power efficiency at low network utilization.

  • We present a fine-grained power gating algorithm, which enables HVCRouter to adapt well to varying network loads, and achieve excellent power efficiency. To the best of our knowledge, our approach is the first to exploit power saving opportunities on NoC router at buffer-level and channel-level simultaneously.

  • Our experiments show that HVCRouter achieves similar performance with FlexiBuffer, the best in the literature, consumes an average of 22.797 % less power, and provides 20.698 % lower EDP than FlexiBuffer.

The rest of the paper is organized as follows. For background information, Sect. 2 introduces the architecture of a modern NoC router, discusses bufferless routing and Flexibuffer, and motivates our work. Section 3 describes the architecture of HVCRouter. In Sect. 4, we present our approach for power gating and VC allocation. Section 5 evaluates our work. Section 6 discusses related work. In Sect. 7, we summarize and conclude the paper.

2 Background and Motivation

2.1 Router Architecture

The architecture of a typical modern virtual-channel NoC router is shown in Fig. 1. When a flit of a packet arrives at a router from one of the input virtual channels in the router’s input port, the flit is hold in a buffer of that channel until it can be forwarded. The process on a flit is divided into several pipeline stages. First, route computation (RC) is performed by the routing unit to decide the output port. Second, virtual-channel allocation (VA) by the VC allocator allocates an available output virtual channel in the given output port. If all output channels are occupied, the flit is kept in the buffer and waits until there is one vacant. Third, switch allocation (SA) by the switch allocator schedules a time slot on the switch and the output channel, and forwards the flit to routed output port during this time slot. Finally, switch traversal (ST) forwards the flit to depart the current router and travel to the next router in its routing path.

Fig. 1.
figure 1

The router architecture.

The buffer serves as one of the most important components in a modern NoC router, since it decouples the allocation of channel resources. Buffers within each router improve the bandwidth efficiency in the network because they enable a flit to wait until its allocated output channel and switch ready, otherwise the flit has to be dropped or misrouted, namely, sent to a less desirable destination port, thereby buffers reduce the number of dropped or misrouted packets. It is proved that more buffers result in significantly higher performance. However, buffers in the NoC occupy a significant portion of the power. On the other hand, Kim et al. [7] made the observation that even if the network is saturated, not all of the buffer resources are fully utilized.

2.2 Bufferless Routing and Flexibuffer

Recent studies have explored several optimizations to reduce the power consumption of router buffers. bufferless routing [6] presents a new algorithm for routing without using buffers in router input/output ports. When multiple flits are routed to the same output channel, an arbitration is performed as usual to choose one to get the channel. For remaining flits, bufferless routing misroutes them to other output channels and guarantees those flits will reach their destination at last. In contrast to misrouting, Mitchell et al [8] proposes another bufferless router design by dropping those losing arbitration flits, which are later retransmitted. Bufferless routing is proved to be very power efficient at low network load, however, bufferless routing suffers from poor performance and energy at higher loads because the misrouting/dropping caused by link contention leads to increased link utilization, which creates a positive feedback cycle because increased link utilization further increases link contention. Consequently, bufferless networks will incur a significant performance penalty at higher network loads, and saturate at lower throughputs than buffered networks.

In contrast to eliminating buffers from routers, FlexiBuffer employs a power gating policy and adjusts the size of the active buffers adaptively. Although it demonstrates obvious advantages in terms of performance under medium-to-high network loads over bufferless routing, according to our observation, there remain optimization opportunities to be exploited. First, FlexiBuffer keeps all of the virtual channels active at any time, which will incur significant static power when network is lightly utilized. Second, FlexiBuffer includes buffers in each channel, controls them separately, and keeps some buffers active at any time for each channel, this also introduces considerable static power, we are thereby motivated to propose a router design with heterogeneous virtual channels, and eliminate buffers from one channel to reduce static power. In addition, instead of considering channels separately, our design manages channels and buffers in a global framework to allow adaptively power-gating channels as well as their buffers, if desired, to achieve higher power efficiency, without degrading performance.

Fig. 2.
figure 2

The architecture of HVCRouter.

3 HVCRouter Architecture

The architecture of HVCRouter is similar to the conventional design shown in Fig. 1, with the major differences in the input port and the VC allocator as highlighted in Fig. 2. HVCRouter employs a mixture of buffered and bufferless VCs, among the multiple VCs in an input port, the lowest-order one, i.e., VC-0 is without buffers. The output of VC-0 is connected not only to the switch as usual, but also to buffers of VC-1 by a demultiplexer. The output selecting of the demultiplexer could be controlled by both the VC allocator and the switch allocator. When a flit coming from bufferless VC (VC-0) does not win the output VC arbitration during VC allocation, the VC allocator will send a signal to corresponding demultiplexer to forward the flit into the tail of buffers of VC-1 to enable it to wait for next round of VC allocation. Similar process happens on switch allocation, if a flit in bufferless VC loses the arbitration, the switch allocator will forward it into the buffer of VC-1 as well. This is different with the process of bufferless routing. In bufferless routing, those flits have to be dropped or misrouted, which will significantly increase the latency. This flit-inner-forwarding design will not increase notable power consumption because VC-0 only serves at low network load, in which case, competition and arbitration rarely happen. If the load increases, HVCRouter will rely on buffered channels to service packet flow instead, and power gate VC-0 as detailed in Sect. 4. To solve the problem of out-of-order arriving, HVCRouter receive flits in a receiver-side buffer until all flits of a packet have arrived. We adopt a bufferless channel instead of leveraging power gating to disable all buffer entries in a channel at an input port, which is based on the observation that many real-world applications have low network utilization for much of the time [6], bufferless channel suffices in those cases, and buffers’ large area and high power consumption could be saved. For a buffered channel even with power-pating, transitions between power states during power-gating will increase power consumption. Besides the introduction of a bufferless VC as well as a demultiplexer logic, another notable difference for HVCRouter is the VC allocator, which integrates a control unit (CU) to be able to dynamically and adaptively turn on/off virtual channels and buffers.

figure a

4 Power-Gating and VC Allocation

4.1 Power-Gating Algorithm

The power-gating algorithm of HVCRouter introduces a state machine (as depicted in Fig. 3) with two states, characterized as follows,

  • state-0: only VC-0 and VC-1 are turned on.

  • state-1: VC-0 is turned off, VC-1 is turned on, other VCs may or may not be turned on.

Fig. 3.
figure 3

The state machine of HVCRouter.

Let us examine the power gating algorithm shown in Algorithm 1. At first, HVCRouter works at state-0 with only VC-0 and VC-1 (with p active buffer entries, p is explained later) are turned on, to make use of the bufferless channel to reduce the static power to a minimum at low network utilization. In Line 4, vc_occupancy(no) calculates the number of busy buffers (with packet flit(s) in it) in VC- no, to estimate the load. If the load of VC-1 beyond a threshold, in Line 5, turn_on_vc(2, p) will turn on VC-2 as well as its p buffer entries, here p represents the minimum number of active buffer entries needed to prevent stalls caused by the lack of available buffer entries, p = max(\(t_w, t_{crt}\)), where \(t_w\) is the wake up delay of a buffer entry, and \(t_{crt}\) is the credit round-trip latency [7, 9]. In Line 6, VC-0 is power gated. As a result, the state is transited to state-1 in Line 7.

figure b

In state-1, in Lines 13 – 17, the algorithm checks if some channel has encountered congestion, and turns on a higher order channel to mitigate the congestion. In Lines 18 – 22, on the contrary, the algorithm checks if some channel is idle, and its immediately lower order channel is also lightly utilized, in that case, the idle channel is power gated. In Line 23, the algorithm invokes manage_router_buffers() to perform fine-grained power gating on router buffers. As shown in Algorithm 2, for each buffered and active virtual channel, if current number of active-but-idle buffer entries is no more than \(P - Q\), and there exists buffer(s) in sleep state, then one more buffer is waked up. On the contrary, if current number of active-but-idle buffer entries is more than \(P - Q\), and number of active buffer entries is more than P, then one buffer is power gated. In contrast to turn on buffers, power-gating is executed less frequently (per T consecutive cycles) to avoid thrashing and reduce control overhead. Finally, in Lines 24 – 34 in Algorithm 1, if all channels except VC-1 are power gated, and VC-1 has a low utilization, then VC-0 is turned on with the state transited back to state-0.

As the algorithm shown, when load increases, it tends to turn on more channels first, rather than more buffers in an already active channel, because more virtual channels inside a single physical channel allow other flits to use the channel bandwidth that would otherwise be left idle when a flit blocks [9], which will improve the performance.

4.2 VC Allocation

This section discusses the VC allocation policy of HVCRouter. The VC allocator allocates output channel for a flit according to the state of the corresponding input port of the downstream router. Suppose a flit in router A is waiting for VC allocation, with its routed output port being \(P_{A\_OUT}\), and the downstream router is B with corresponding input port being \(P_{B\_IN}\). If the state of \(P_{B\_IN}\) is state-0, which implies only VC-0 and VC-1 are active in \(P_{B\_IN}\), then the VC allocator of A will check if VC-0 in \(P_{B\_IN}\) is idle (A is able to know this by conventional credit based backpressured flow control [9]), and allocate the flit into VC-0 in \(P_{A\_OUT}\), otherwise, allocate it into VC-1 in \(P_{A\_OUT}\). To support power-gating router buffers, HVCRouter makes the modification to conventional credit-based flow control similar to that of FlexiBuffer. However, due to the heterogenous nature of HVCRouter, the credit count of VC-0 is initialized to 1, that of VC-1 is initialized to \(p-1\) (one may be stolen by VC-0), and the others are initialized to p. If the state of \(P_{B\_IN}\) is state-1, and supposes the input channel with the maximum number of idle buffers, i.e., maximum credits, is VC-m, then the VC allocator in A will allocate the flit into VC-m in \(P_{A\_OUT}\).

Table 1. Network configuration.
Fig. 4.
figure 4

Power of HVCRouter, normalized to that of FlexiBuffer, for six traffic patterns.

Fig. 5.
figure 5

Performance of HVCRouter and FlexiBuffer for (a) Uniform random (b) Bit reverse (c) Transpose (d) Hot-spot (e) Shuffle and (f) Bit complement traffic patterns.

5 Evaluation

We evaluate HVCRouter in terms of power, network latency and energy delay product by using BookSim [10] as well as DSENT [11]. Booksim is a cycle-accurate interconnection network simulator to measure network latency, and we revise it to model HVCRouter. Using the results obtained by Booksim as well as the network configuration files as input, DSENT will output the power values. The detailed network configurations used in our evaluation are listed in Table 1.

Fig. 6.
figure 6

The energy delay product (EDP) of HVCRouter for six traffic patterns, normalized to that of FlexiBuffer.

Fig. 7.
figure 7

Power of HVCRouter for real workload traces, normalized to that of FlexiBuffer.

5.1 Synthetic Traffic Evaluation

Figure 4 shows the power consumption of HVCRouter for six synthetic traffic patterns, normalized to that of FlexiBuffer. It is observed that HVCRouter consumes lower power than FlexiBuffer for all the patterns under various packet injection rates. Let us examine the results in detail. All the curves demonstrate similar trend. At low injection rates, HVCRouter consumes significantly lower power than FlexiBuffer, more specifically, a maximum of 25.75 % lower for Hot-spot. When network is lightly utilized, HVCRouter utilizes bufferless virtual channel to reduce both static and dynamic power by scheduling packets into bufferless channel whenever possible, and power gating other channels as well as buffers. As the injection rate increases, HVCRouter adaptively turns on channels and buffers to mitigate the link contention of bufferless channel. Attributing to the finer-grained power gating at both channel and buffer level, HVCRouter remains lower power consumption than FlexiBuffer, with an average of 22.367 % less power compared to FlexiBuffer.

The performance of HVCRouter is evaluated by measuring the packet latency. Figure 5 presents the average packet latency achieved on HVCRouter and FlexiBuffer for six traffic patterns. As the results shown, HVCRouter achieves similar performance with FlexiBuffer, with an average of 1.704 % larger packet latency than that of FlexiBuffer. The results prove that HVCRouter makes good use of the heterogeneous architecture, reacts quickly to congestion.

Figure 6 reports the energy delay product (EDP) of HVCRouter for six traffic patterns, normalized to that of FlexiBuffer. Overall, HVCRouter results in an average of 21.08 % lower EDP for all the patterns than FlexiBuffer.

5.2 Real Workload Evaluation

We also use the traffic generated by SynFull [12] for BookSim to simulate real-world workloads. SynFull introduces a synthetic traffic generation methodology that captures both application and cache coherence behavior to evaluate NoCs. Using a real-world benchmark as the input, SynFull is able to output a traffic, which could be further fed into BookSim for simulation. Currently, the traffics made publicly available by SynFull are a set of 16 traffics produced from PARSEC [13] and SPLASH-2 [14] benchmarks with the sim-small input set for 16 cores. The power and packet latency are measured with the results shown in Figs. 7 and 8, respectively. The results demonstrate that HVCRouter consumes an average of 24.83 % less power on all of the benchmarks than FlexiBuffer.

Fig. 8.
figure 8

Performance of HVCRouter and FlexiBuffer for real workload traces.

Fig. 9.
figure 9

The energy delay product (EDP) of HVCRouter for real workload traces, normalized to that of FlexiBuffer.

Regarding the network performance, overall, HVCRouter delivers similar performance than FlexiBuffer, with an average of 7.53 % larger latency than that of FlexiBuffer. Figure 9 reports the energy delay product (EDP) of HVCRouter, normalized to that of FlexiBuffer. It is observed that HVCRouter provides an average of 19.176 % lower EDP than FlexiBuffer on all the workloads.

6 Related Work

Recent studies have explored some NoC power optimizations at various granularities. First, at channel and switch level, Michelogiannakis et al. [15] propose adaptive bandwidth networks (ABNs) to divide channels and switches into lanes to reduce power consumption of NoC. Second, at router level, NoRD [16] provides a power-gating bypass to decouple the node’s ability for transferring packets from the powered-on/off status of the associated router, thereby increases the length of idle periods, and eliminates node-network disconnection problem. Panthre [17] adopts topology and routing reconfiguration to steer away the packets that would normally use power-gated components, to provide long intervals of uninterrupted sleep to selected units. Matsutani et al. [18] adopt look-ahead routing to hide the wake-up delay and reduce the short-term sleeps of channels. Chen et al. [19] propose a performance-aware, power reduction scheme that aims to achieve non blocking power-gating of on-chip network routers through looking ahead routing. Router Parking [20] selectively power gates routers attached to parked cores dynamically, and adopts adaptive routing to ensure the network performance. In addition, some topology dependent approaches are also proposed. Yue et al.  [21] present Smart Butterfly, a core-state-aware NoC power-gating scheme based on flattened butterfly topology that utilizes the active/sleep state information of processing cores to improve power-gating effectiveness. Since Clos network has multiple alternative paths for every packet, Chen et al. [22] propose power-gating scheme MP3, which is able to achieve minimal performance penalty and save more static energy than conventional Clos network. Third, at the granularity of subnet, Balfour et al. [23] present a concentrated mesh topology with replicated sub-networks to improve area and energy efficiency in NoC. Das.R et al. [24] propose the Catnap architecture which performs power-gating on subnets in a multi-layer NoC. Mishra et al. [25] introduce two separate networks on chip, where one is optimized for bandwidth and the other for latency, and the steering of applications to the appropriate network. darkNoC [26] integrates multiple layers of architecturally identical, but physically different routers, leveraging the extra transistors available due to dark silicon. Each layer is separately optimized for a particular voltage-frequency range. Finally, at router buffer level, Moscibroda et al. [6] found that a bufferless router consumes a very low leakage power compared to that of a traditional buffered router, and they present a bufferless NoC design and a new routing algorithm. However, bufferless NoC is only applicable at low network load. Chris Fallin et al. [27] propose the minimally-buffered deflection (MinBD) router, which combines deflection routing in bufferless network with a small side buffer to reduce deflections, which improves bufferless routing to some extent. FlexiBuffer [7] reduces buffer leakage power by using fine-grained power gating and adjusting the size of the active buffers adaptively.

7 Conclusion

The router buffer makes a significant contribution to the overall NoC power. We discuss HVCRouter, a novel NoC router architecture which couples buffered and bufferless virtual channels. Employing a fine-grained power gating policy, HVCRouter consumes an average of 22.797 % less power than FlexiBuffer, the state of the art power efficient NoC router design. In terms of performance, HVCRouter matches FlexiBuffer on all the benchmarks.